ADS Net
ADS Net
A R T I C L E I N F O A B S T R A C T
Keywords: Change detection technology is an important key to analyze remote sensing data and is of great significance for
Change detection accurate comprehension of the earth’s surface changes. With the continuous development and progress of deep
Attention mechanism learning technology, fully convolutional neural networks are applied gradually in remote sensing change
Deep supervision network
detection tasks. The present methods mainly encounter the problems of simple network structure, poor detection
Difference feature
of small change areas, and poor robustness since they cannot completely obtain the relationships and differences
between the features of bi-temporal images. To solve such problems, we propose an attention mechanism-based
deep supervision network (ADS-Net) for the change detection of bi-temporal remote sensing images. First, an
encoding–decoding full convolutional network is designed with a dual-stream structure. Various level features of
bi-temporal images are extracted in the encoding stage, then in the decoding stage, feature maps of different
levels are inserted into a deep supervision network with different branches to reconstruct the change map. Ul
timately, to obtain the final change detection map, the prediction results of each branch in the deep supervision
network are fused with various weights. To highlight the characteristics of change, we propose an adaptive
attention mechanism combining spatial and channel features to capture the relationship of different scale
changes and achieve more accurate change detection. ADS-Net has been tested on the LEVIR-CD and SVCD
datasets of challenging remote sensing image change detection. The results of quantitative analysis and quali
tative comparison indicate that the ADS-Net method comprises better effectiveness and robustness compared to
the other state-of-the-art change detection methods.
* Corresponding author.
E-mail addresses: [email protected] (D. Wang), [email protected] (X. Chen), [email protected] (M. Jiang), poisonous_mushroom@163.
com (S. Du), [email protected] (B. Xu), [email protected] (J. Wang).
https://doi.org/10.1016/j.jag.2021.102348
Received 20 January 2021; Received in revised form 2 April 2021; Accepted 20 April 2021
Available online 30 April 2021
0303-2434/© 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
threshold to determine the changed and unchanged pixels making the Table 1
change detection effect poor. The image transformation-based change Summary of traditional change detection methods.
detection method transforms the remote sensing image into a specific Methods Example Studies
feature space while emphasizing the changed pixels and suppressing the
Traditional change Image arithmetic- Singh A. (Singh, 1986), Howarth
unchanged pixels (Wang et al., 2018). The main idea is to use algorithms detection based method et al. (Howarth and Wickware,
including independent component analysis (ICA) (Yi-Quan et al., 2016); methods 2007), Wang et al. (Zheng et al.,
iteratively reweighted multivariate alteration detection (IRMAD) (Wang 2013)
et al., 2015), change vector analysis (CVA) (Singh and Talwar, 2015), Image W et al. (Yi-Quan et al., 2016),
transformation-based Wang et al. (Wang et al., 2015),
principal component analysis (PCA) (Wei et al., 2016), and slow feature method Singh et al (Singh and Talwar,
analysis (SFA) (Wu et al., 2014) to analyze and transform the feature on 2015), Zhao et al. (Zhao, 2011),
bi-temporal images. Then, changes are detected based on the trans Bovolo et al. (Bovolo and Member,
formed features. These methods improved the accuracy of remote 2012), Thonfeld et al. (Thonfeld
and Feilhauer, , 2016), Wei et al. (
sensing change detection to a certain extent. However, it is a very
Wei et al., 2016), Wu et al. (Wu
difficult process to choose the most appropriate image transformation- et al., 2014)
based method for a specific area. The above different algorithms Object classification- Han et al. (Han et al., 2020), Xin
cannot be applied to all change detection tasks owing to their specific based method et al. (Xin et al., 2018), Zhang et al.
characteristics. The change detection method based on object classifi (Zhang et al., 2018), Tan et al. (Tan
et al., 2019)
cation uses the spectrum, texture, structure, and geometric features of Deep learning change detection methods Ji et al. (Ji et al., 2019), Wang et al.
bi-temporal images for similarity analysis. In object classification-based (Wang et al., 2018), Peng et al. (
change detection methods (Han et al., 2020), the strategy of post- Peng and Guan, 2019), Daudt et al.
classification is usually used. First, the interesting objects are extrac (Daudt, 2018), Zhang et al. (Zhang
et al., 2020), Chen et al. (Chen and
ted from bi-temporal images and compared and analyzed to generate the
Shi, 2020), Liu et al. (Liu et al.,
final change map. Although good change detection effects have been 2019)
achieved, the process is relatively complex, and the effect of change
detection depends on the accuracy of object classification.
With the development of deep learning technology, it has become the baseline model from 83.9% to 87.3% with acceptable computational
easier to utilize convolutional neural networks(CNN) to extract high- overhead. A new type of deep neural network architecture was proposed
level features of images. The good generalization capability of high- by Liu et al. (Liu et al., 2019) based on information transmission and
level features is very helpful to detect change information. The current attention mechanism. In the design of the DNN structure, an information
CNN-based remote sensing change detection models have achieved a transmission module was introduced for information interaction and
high accuracy rate and the effect better than other traditional methods transmission, and the machine is utilized to give the corresponding
(Zhan et al., 2017). Ji et al. (Ji et al., 2019) proposed a CNN-based attention weight to bi-temporal image features. The F1-Score of this
change detection framework to locate changing building instances and network is 7.4% higher than the original CNN. To facilitate retrieval, we
change building pixels from high-resolution aerial images. The building summarized the relevant literature of the above-mentioned change
extraction network is run through two extensively used structures: Mask detection methods in Table 1.
R-CNN for object instance segmentation and multi-scale fully convolu The change detection method based on deep learning has a better
tional network for pixel-based semantic segmentation. This method performance compared to any traditional method. It can learn change
reached an average precision (AP) of 63% at the object (building characteristics based on the sample label information with supervised
instance) level. The region-based Faster R-CNN method was applied by technique, and detect the change of interested region and features.
Wang et al. (Wang et al., 2018) for the change detection of high- However, there are some limitations to the existing change detection
resolution remote sensing images. Compared to the traditional network structure and function. Thus, the corresponding solutions are
methods and other deep learning-based change detection methods, it proposed in the present paper to the following problems:
reduced numerous error changes and achieved higher detection accu
racy, the Kappa coefficient reached 71.16% and 79.42% in the two (1) The rich information of pixels in high-resolution remote sensing
datasets, respectively. Peng et al. (Peng and Guan. , 2019) proposed a images is not completely utilized. By increasing the resolution of
novel end-to-end CD method based on the encoder-decoder architecture remote sensing images, the requirements for change detection
UNet++ for semantic segmentation. It concatenates the registered pixel segmentation are higher. Most of the existing methods apply
image pairs as the input of the enhanced UNet++ network and generates the classic semantic segmentation networks of U-Net and U-
the final change detection map. Its F1-Score reached 87.56% while Net++ (Peng and Guan. , 2019); (Wiratama, et al., 2020)), where
achieving the best performance among all the comparative SOTA the change detection effect is not good. In this paper, we try to
methods. Daudt et al. (Daudt, 2018) proposed two Siamese extensions of focus on important information in the spatial and channel domain
fully convolutional networks. During the training process, the image and use the attention mechanism to concatenate the two mod
difference and the image concatenate feature were respectively fused ules. We combine various scale features to construct a deep su
representing better performance and faster detection speed. Then, a pervision network and fuse low-level and high-level features with
deep-supervised image fusion network (DSIFN) was proposed by Zhang various weights to improve the change detection precision.
et al. (Zhang et al., 2020) based on the Siamese network. The highly (2) Insufficient feature fusion results in a poor training effect of the
representative deep features of bi-temporal images are extracted change detection network. The present change detection feature
through a completely convolutional dual-stream structure, and the fusion methods are generally divided into pre-fusion and post-
extracted depth features are fed into a deeply supervised difference fusion (Wiratama and Sim, 2019). Pre-fusion denotes concate
discrimination network for change detection. The F1-Score of DSIFN nating two images or inserting the difference of bi-temporal im
reached 90.3%, which is superior to other benchmark methods and ages into the network for feature extraction and change
yields a changing area with complete boundaries and high internal detection. Early feature maps cannot present the deep informa
compactness. Chen et al. (Chen and Shi, 2020) presented a novel Spatial- tion of a single original image since it is more sensitive to noise
Temporal attention neural network based on Siamese and designed a CD and simply results in the accumulation of errors. Post-fusion is
self-attention mechanism to calculate attention weights between any based on using two identical networks to process each bi-
two pixels at various times and positions. They improved the F1-score of temporal image separately and then utilizing the respective
2
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
extracted features through an integrated network for change In this paper, we propose an attention-based deeply supervised
detection (Liu et al., 2019). However, the change maps are network (ADS-Net) for remote sensing image change detection. First, to
greatly impacted by the quality of the training images in the extract the multi-layer features of bi-temporal images, a dual-stream full
process of discriminating the difference. In this paper, we adopt convolutional neural network is used. Then, a fusion attention module of
the mid-layer fusion method. After extracting each layer’s fea the channel and spatial feature is added to the decoding part of the
tures, the bi-temporal features in the encoding stage are concat network and the features of each layer are merged to obtain various
enated with the output of the former layer in the decoding part, prediction maps of multiple supervision modules. Ultimately, the pre
and the bi-temporal feature maps are concatenated with their diction results through each branch supervision network are weighted
difference maps as input for each decoding layer. In this way, we and then fused to form the ultimate prediction result, realizing the
obtain the sufficient and effective fusion of bi-temporal features. function of deep supervision.
3
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
The rest of the paper is organized as follows. Section 2 introduces the the extracted feature pair with its difference as input and strengthen the
proposed method in detail. The change detection experiments and re change information via the designed attention module. Then, we carry
sults analysis are presented in Section 3. Discussion is presented in out convolution, BatchNorm regularization, ReLU activation, and
Section 4. The paper is concluded in Section 5. Dropout operation to aggregate features. Ultimately, deconvolution
operation is performed to expand the size of feature maps. By concate
2. Methodology nating the results with feature maps of the corresponding size in the
encode part, they are inserted into the next decode steps to realize the
In this section, we represent the proposed ADS-Net method from four ultimate reconstruction of the feature map.
aspects of basic network structure, attention fusion module, deep su The implementation steps of ADS-Net architecture are as follows:
pervision fusion, and loss function.
(1) The 256 × 256 × 3 bi-temporal images are inserted into T1 and
2.1. Basic network structure T2 networks respectively. After five convolution and down-
sampling modules, concatenate operation is performed on two
Fig. 1 represents the proposed attention-based deeply supervised 16 × 16 × 256 feature modules. Then, a decoding network is
network (ADS-Net) for remote sensing change detection. The network inserted. To reduce the number of network parameters, we shared
structure consists of two encoding and decoding parts. First, in the the weights of the T1 and T2 networks.
encode part, a pair of Siamese networks (Daudt, 2018) is utilized to (2) Correspondingly, the Conv5 bi-temporal features are subtracted
extract the registered bi-temporal image features, then the feature pairs from above to obtain the different features. The T1, T2 features
of different sizes are concatenated with their differences and input to the and their different features are merged, and the proposed atten
decoding stage, through the deconvolution operation reconstruct tion module is inserted to aggregate the features. The up-
change map. Ultimately, the reconstructed change maps of various sampling operation is then performed to enlarge the size of the
decoding layers are weighted and fused to form the ultimate change feature map. As shown in Fig. 1, before each level of upsampling,
detection result. the features of the previous level are merged with the corre
In the encoding stage, both networks contain convolution modules sponding T1, T2 encoding features, and then passed to the
(conv1, conv2, …, conv5) extracting the features of bi-temporal images upsampling layer through the attention module. Finally, a change
(T1, T2) respectively, and obtaining 5 pairs of different sizes feature map is generated consistent with the original image size.
layers. The first two convolution modules conv1 and conv2 both (3) The encoding layer with five convolution modules may not be
comprise two convolutional layers and a maximum pooling layer. Conv3 able to extract all types of change features, hence, we start with
and conv4 both contain three convolutional layers and a maximum the Conv2 module and add four different depths of decoding
pooling layer. Conv5 includes three convolutions layers. It is worth networks to form a deep supervision strategy. Each supervisory
noting that after each convolutional layer in the network, ReLU acti network has a structure similar to the main network introduced in
vation function, BatchNorm regularization, and Dropout operation are (2) except for the different network depths. To form a final
added to guarantee the feature layer with preventing overfitting and change map with higher accuracy, their predicted change maps
strong expressive ability. The layers in different colors in Fig. 1 represent are fused with different weights.
the different scale feature layers output by each convolution module.
Furthermore, to effectively obtain different information and relieve data 2.2. Adaptive attention fusion module
pressure, the weights and structure of the T1 and T2 network are shared,
hence, bi-temporal image features are converted to the same space for In the field of computer vision, machines selectively are focused on
comparison. the important part of visible information such as humans, while ignoring
Four decoding modules are designed in the decode stage to recon other irrelevant information. This mechanism is termed the attention
struct the change map for feature maps of different sizes, moreover, mechanism (Li et al., 2019). In the remote sensing change detection task,
various change maps obtained by each module are weighted and fused we mostly consider the areas where bi-temporal images have been
to obtain the ultimate change detection result (w2, w3, w4, w5 represent changed. Therefore, improving the characteristics of the changed areas
the fusion weight of each module, respectively). First, we concatenate is more helpful to enhance detection efficiency. In recent years,
4
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
numerous researchers have added different types of attention modules which is used to construct the channel attention module well.
to the change detection network comprising spatial attention and Spatial attention module(SAM): First, Max pooling is performed on
channel attention (Zhang et al., 2020; Chen and Shi, 2020; Zhang et al., each pixel position of the input concatenated feature map (H × W × C).
2021). Spatial attention has a role in increasing the distance difference In other words, the element at each position takes the maximum value of
between the changed and unchanged pixels. The role of channel atten its channel to create a 1 × W × H vector (W and H are respectively the
tion is to amplify channels associated with the changes in ground fea numbers of feature map columns and rows). Then, the vector is con
tures and prevent irrelevant channels. ducted to a two-dimensional convolution operation (conv2d) with
In the process of change detection, not all high-dimensional features convolution kernel size k2, and the convolution result is normalized to a
are helpful for difference discrimination (Saha et al., 2019; Jie and weight coefficient with a value between (Bruzzone and Bovolo, 2013)
Samuel, 2019) and irrelevant features may make change detection more through the sigmoid activation function. Ultimately, the obtained
difficult (Woo et al., 2018). Thus, we propose an adaptive attention spatial attention weight coefficient is element-wise multiplied by each
fusion module to improve usefulness information and suppress irrele channel of the original feature map to obtain the locally improved
vant information. As shown in Fig. 2, the attention fusion module pro spatial attention feature map (MS ), which is calculated as follows:
posed in this paper is a dual-stream attention mechanism. We put the
MS = σ(conv2d(Maxpool(F))) ⊗ F (6)
former feature maps into channel attention and spatial attention oper
ations respectively and integrated them in an element-wise way to Where F represents the input of the combined feature map to the
achieve the improved feature maps, which can better reconstruct the SAM module and σ(⋅)denotes the sigmoid activation function, based on
following change map. Remarkably, the convolution kernel sizes of Formula (2). For the two-dimensional convolution kernel size k2, it is
spatial and channel attention are determined adaptively based on the similar to the way of determining k1. We adopt the strategy of adaptively
change map. determining the value based on the size of the feature map (W and H).
Channel attention module(CAM): First, average pooling is per Since the bi-temporal images and their feature maps are all square maps
formed on each channel of the input concatenated feature map (H × W utilized in the experimental process, namely, the number of rows and
× C). Indeed, the elements of each channel are averaged to create a C × the columns are equal, thus, W = H. We build the functional relation
1 × 1 vector (C is the number of channels). Then, a one-dimensional ship between W and k2 as follow:
convolution operation (conv1d) is performed with a convolution
W = g(k2 ) (7)
kernel size of k1 on this vector. The result of convolution is normalized to
weight coefficient with a value between (Bruzzone and Bovolo, 2013) The size of input bi-temporal images utilized in our paper is 256 ×
through the sigmoid activation function. Ultimately, the obtained 256, which is reduced to 1/2 of the original after each pooling, hence,
channel attention weight coefficient is element-wise multiplied by each the size of the feature map in different periods is always an exponential
spatial element of the original feature map, and the globally enhanced power of 2. k2 is determined the same as k1, as follows:
channel attention feature map (MC ) is obtained, for which the calcula ⃒ ⃒
⃒log2 (W) + b⃒
tion expression is as follows: k2 = ⃒⃒ ⃒
⃒ (8)
a odd
MC = σ (conv1d(Avgpool(F))) ⊗ F (1)
In this paper, through numerous experiments, we found that SAM
Where F is the merged feature map inserted to the CAM module, ⊗ has the best performance when a and b are set to 2 and 3 respectively. A
represents element-wise multiplication, σ (⋅)is the sigmoid activation relatively large convolution kernel is utilized in a low-dimensional
function, and its expression is as follows: feature map with a larger size, and a relatively small convolution
1 kernel is used in a high-dimensional feature map with a smaller size to
σ (z) = z
(2) emphasize more prominent change features. The spatial attention
1 + e−
module is constructed well using the method of adaptively determining
For the one-dimensional convolution kernel size k1, we adopt the the convolution kernel size based on the size of the feature map.
strategy of adaptively determining its value based on the number of Ultimately, the enhanced feature maps inserted by CAM and SAM are
channels C. The functional relationship between C and k1 is constructed element-wise to obtain the final attention module output result, as fol
as follow: lows:
C = f (k1 ) (3) M = MC ⊕ MS (9)
In general, linear mapping is the simplest mapping, however, the Since the average pooling method is utilized in CAM and the
expressive power of linear relationships is limited. Normally, the size of maximum pooling method is used in SAM to reduce dimensionality, the
the channel number C is an exponential power of 2 (Wang et al., 2020); change information is obtained by the two modules after global and
thus, the formula (3) can be expressed as: local enhancement respectively, thus, the two modules are combined to
C = f (k1 ) = 2(a*k1 − b)
(4) achieve better detection results.
Hence, after giving the specific value of channels number C, the size
of the convolution kernel k1 can be expressed as: 2.3. Deep supervision based on weighted fusion
⃒ ⃒
⃒log2 (C) + b⃒
k1 = ⃒⃒ ⃒
⃒ (5) Deep supervision refers to the method of addition of auxiliary clas
a odd sifiers in the middle hidden layer of the deep neural network as a
Where |M|odd denotes the odd number closest to M. In the present network branch to supervise the backbone network. This is utilized to
paper, we conducted several experiments and found that the CAM has solve the problems of deep neural network training gradient disap
the best performance when a and b are set to 2 and 1, respectively. In the pearance and decelerate convergence the speed (Lee, 2015). The core
high-dimensional features with more channels, the larger convolution idea of deep supervision is to present an integrated direct supervision
kernel has a wider receptive field, and in the low-level features with layer for the hidden layer, instead of only providing supervision in the
fewer channels, the smaller convolution kernel concentrates on a more output layer, and propagate this supervision back to the earlier layer. By
compact receptive field. Thus, the method of adaptively determining introducing an accompanying objective function for each hidden layer
convolution kernel size is achieved based on the number of channels, to present this integrated direct hidden layer supervision, these
accompanying objective functions can be considered as additional soft
5
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
6
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
Fig. 3. The LEVIR-CD dataset before and after cropping; (a) and (b) the bi-temporal image pair; (c) the Ground Truth of changed pixel, the top is the original image,
and the bottom is cropped image.
Fig. 4. Some scenes of the SVCD dataset, the top ones are T1 time images, the middle ones are T2 time images, and the bottom scenes are Ground Truth.
images are taken from 20 various areas in several cities in Texas, USA, reducing the impact of other irrelevant changes on the model. In the
from 2002 to 2018. It contains numerous changes resultant from sea present paper, the LEVIR-CD dataset is cropped into image pairs with a
sonal and light changes, which assists to train a more effective change size of 256 × 256 pixels to alleviate the pressure on computer GPU
detection model while further focusing on interesting changes and memory. The unchanged image pairs are discarded to build 1390 pairs
7
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
Fig. 5. The curves of various metrics on two datasets with iterations: (a) the performance of the LEVIR-CD dataset, (b) the performance of the SVCD dataset.
of bi-temporal images, of which 1000 pairs are utilized for model coefficient is a metric utilized for consistency testing. For change
training and 200 pairs used to verify the training process, 190 pairs are detection problems, the so-called consistency is whether the model
used for testing after the training. The LEVIR-CD dataset before and after prediction results are consistent with the actual classification results.
cropping is provided in Fig. 3. Thus, we use it for measuring the change detection effect. The overall
The second dataset is Season-varying Change Detection Dataset performance of the model is reflected by F1-Score and Kappa coefficient,
(SVCD) proposed by (Lebedev et al., 2018). It uses 7 pairs of the same for which the values are between (Bruzzone and Bovolo, 2013). The
area remote sensing images changing with the seasons obtained by larger value will reflect better model performance. The four evaluation
Google Earth (Digital Globe), with a size of 4725 × 2700 pixels. The metrics are represented as follows:
obtained image has a spatial resolution of 3–100 cm and contains objects
TP
of various sizes (from cars to large building structures) and seasonal Pr = (14)
TP + FP
alterations of natural objects (from single trees to wide forest areas). The
whole image is rotated randomly and cropped into 256 × 256 segments TP
to create our dataset. To shorten the training time and ease the pressure Re = (15)
TP + FN
of GPU, we selected 5847 images randomly as the training set, as well as
1725 and 959 images as a verification set and test set, respectively. Some 2Pr × Re
F1 = (16)
bi-temporal images of the SVCD dataset are presented in Fig. 4. Pr + Re
OA − P
Kappa = (17)
3.2. Evaluation metric and parameter setting 1− P
TP is the number of correctly detected changed pixels, TN represents
To compare the difference between the labeled map and the pre the number of correctly detected unchanged pixels, FP is the number of
dicted change map, and assess the effectiveness of our proposed tech false alarm pixels, and FN is the number of lost unchanged pixels. In the
nique, we utilized four evaluation metrics including Precision (Pr), Kappa calculation formula, OA denotes the overall accuracy, P repre
Recall (Re), F1-Score (F1), and Kappa coefficient (Kappa). In the change sents the proportion of expected agreement between the ground-truth
detection task, the higher Precision denotes the more accuracy of and predictions with given class distributions (El Amin et al., 2017).
detected changed pixels and the higher Recall represents the greater The expressions of OA and P are as follows:
ability of the model to find more changed pixels. F1-Score is a metric for
measuring the accuracy of the binary classification model. It considers OA =
TP + TN
(18)
the precision and recall of the classification model at the same time, N
moreover, it is a harmonic average of model precision and recall. Kappa
8
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
Fig. 6. The ADS-Net changes detection results of some scenes on the LEVIR dataset: (a) and (b) the T1 and T2 time of bi-temporal remote sensing images; (c) the
Ground Truth map of actual changes, and (d) the change map predicted by ADS-Net.
(TP + FP)(TP + FN) + (FN + TN)(FP + TN) the training reaches 40 epochs, the network is stabilized. On the SVCD
P= (19) dataset, the training speed is slower owing to the more complex types of
N2
changes and much more training samples, hence, the network gradually
where N is the total number of pixels. tends to converge at 160 epochs.
In the ADS-Net model, all of convolution kernels size is 3 × 3 in the
encoding stage, attention module convolution kernel size is adaptively 3.3. Comparison methods
determined in the decode stage based on the input feature map size and
the number of channels, moreover, other convolution kernels’ sizes are To assess the performance of the proposed model, we introduce the
all set to 3 × 3. The training period is set to 200 epochs in the training following four change detection benchmark methods and compare their
process, and the batch size is 16. The learning rate uses a multi-step performance on two datasets:
adjustment strategy (MultiStepLR) where the initial learning rate is
0.001, and the learning rate decays to half of the original after every 80 1) FC-Siam-diff (Daudt, 2018): It is a fully convolutional change
iterations. detection network based on Siamese structure sharing the weights of
The proposed method is run by Pytorch with python3.6 as the the two channels in the encoding stage. The difference of bi-temporal
backend, which is powered by a workstation with Intel Core i9-9820X features is input to the network as skip connection in decode stage to
CPU (3.3 GHz, 64 GB RAM) and a single NVIDIA GeForce GTX TITAN reconstruct the change map.
X. We utilize CUDA computing architecture and cuDNN library for 2) FC-Siam-conc (Daudt, 2018): It was proposed at the same time with
accelerated training. FC-Siam-diff, as a fully convolutional change detection network
To more conveniently observe the model training situation, we draw based on Siamese structure. The difference is that bi-temporal fea
the variation curve of various indicators on the validation set with the tures are concatenated to the network input as skip connection in the
number of epochs during the training process, as shown in Fig. 5. It is decode stage to reconstruct the change map.
observed that on the LEVIR dataset, the training speed is faster. When
9
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
Fig. 7. The ADS-Net changes detection results of some scenes on the SVCD dataset: (a) (b) the T1 and T2 time of bi-temporal remote sensing images, (c) the Ground
Truth map of actual changes, and (d) the change map predicted by ADS-Net.
3) STA-Net (PAM) (Chen and Shi, 2020): It is a spatial–temporal are highly consistent with the Ground Truth. Moreover, on the LEVIR-
attention network based on the Siamese structure using a change CD dataset (Fig. 6), the influence of noise caused by irrelevant
detection self-attention mechanism to model spatial–temporal re changes such as illumination and seasons can be ignored. On the SVCD
lationships. The self-attention module is integrated into the feature dataset (Fig. 7), the changes in land types can be well detected, indi
extraction process to calculate attention weights between any two cating that ADS-Net possesses better change detection performance and
pixels at various times and positions, and use them to generate more robustness.
distinguishing features. To further verify the proposed effectiveness of ADS-Net, we trained
4) DSIFN (Zhang et al., 2020): It is a deeply supervised image fusion and tested the above four methods and ADS-Net method on two datasets,
network (DSIFN) for change detection in high-resolution bi-temporal and then evaluated the proposed method effect through the comparative
remote sensing images. The multi-level deep features of the original analysis of qualitative and quantitative aspects, for remote sensing
image and image difference features are fused via the attention change detection.
module, and change detection is conducted via the deeply supervised 1) LEVIR-CD dataset:
difference discrimination network (DDN) based on full convolution. We chose four typical scenes in the LEVIR-CD dataset and compared
the four benchmark methods with ADS-Net. These scenes are mainly
altered in buildings, and the image texture changes resultant from
3.4. Comparison of experimental results
various seasons and lighting are utilized as interference factors. The
performances of different methods are represented in Figs. 8-11. The red
The proposed ADS-Net method change detection results are repre
box shows the local details to facilitate observation and comparison.
sented in Fig. 6 and Fig. 7., on the LEVIR-CD and SVCD datasets several
By comparing the above experimental results, it is found that ADS-
scenes. The black pixels represent the unchanged area, and the white
Net is very sensitive to alterations in buildings, moreover, it can cap
pixels denote the changing area. It is observed that both large scenes and
ture some subtle changes (Figs. 9, 11(h)). Compared to the other
small objects can be well detected in the changing area of bi-temporal
methods, it significantly reduces the number of false detection pixels
images, and the shape and boundary sharpness of the changing area
10
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
Fig. 8. The first scene of LEVIR-CD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc, (f)
STA-Net (PAM), (g) DSIFN, (h) ADS-Net.
Fig. 9. The second scene of LEVIR-CD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc, (f)
STA-Net (PAM), (g) DSIFN, (h) ADS-Net.
(Fig. 8(h)). It can overcome the influence of irrelevant changes in metric reached 88.29%, which is respectively increased by 1.53% and
different times of illumination and climate on the model to the greatest 1.08% compared to the excellent remote sensing change detection
extent, and detect the changes of interested objects accurately (Fig. 10 method STA-Net and DSIFN in the past two years. ADS-Net achieved
(h)). 89.8% in F-Score, which is 2.48% and 1.53% higher than STA-Net and
To further assess the performance of the above methods, the quan DSIFN, respectively. Moreover, comparing the two methods of FC-Siam-
titative results of various evaluation indicators of different methods are diff and FC-Siam-conc, it is found that the concatenate feature fusion
represented in Table 2. It is found that the proposed ADS-Net achieved method has a better effect compared to the direct subtract (Figs. 8-11 (d)
optimal values in the Recall, F1-Score, and Kappa indicators. The Kappa (e)). The reason is the direct subtract will lose the relevant information
11
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
Fig. 10. The third scene of LEVIR-CD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc, (f)
STA-Net (PAM), (g) DSIFN, (h) ADS-Net.
Fig. 11. The fourth scene of LEVIR-CD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc,
(f) STA-Net (PAM), (g) DSIFN, (h) ADS-Net.
in the feature map, and the concatenate operation may increase the spatial–temporal attention module is added by STA-Net emphasizing the
redundant information. However, it will not reduce the key information characteristics of the changing area, hence, the performance has been
related to the change. Compared to FC-Siam-diff, FC-Siam-conc was greatly improved. The DSIFN method connects channel and spatial
increased by 2.34% and 2.62% on two indicators of F1-Score and Kappa, attention modules in series and then adopts a deep supervision strategy
respectively. Therefore, in our proposed ADS-Net decoding stage, representing a better performance compared to STA-Net, which in
concatenated jump connections are also utilized to obtain a very good creases by 0.95% and 0.45% in F1-Score and Kappa, respectively. ADS-
result. Compared to the FC-Siam-diff and FC-Siam-conc, a Net adaptively selects the size of the convolution kernel in the attention
12
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
Table 2 indicators of different methods. The proposed ADS-Net achieved the best
The quantitative evaluation results of different methods on the LEVIR-CD performance in three metrics of Precision, F1-Score, and Kappa. The F1-
dataset. Score and Kappa metrics reached 82.72% and 81.48% respectively,
Method Precision Recall F1-Score Kappa which are lower than the detection results on the LEVIR-CD dataset,
FC-Siam-diff 0.8314 0.8464 0.8281 0.8101
however, they have been significantly improved compared to the other
FC-Siam-conc 0.8649 0.8593 0.8515 0.8363 benchmark methods.
STA-Net 0.8382 0.9108 0.8732 0.8676 Based on the detection results of LEVIR-CD and SVCD datasets, ADS-
DSIFN 0.9120 0.8798 0.8827 0.8721 Net has a good recall rate for small areas of change and a good sense of
ADS-Net 0.8967 0.9136 0.8980 0.8829
color changes resultant from various seasons, climates, and illumina
tion. It can well detect some complicated changes and significantly
module and concatenates the channel and spatial attention to increment reduce the false detection rate caused by noise interference. To compare
the weight of the change feature. Hence, the recall rate is improved the detection effects of ADS-Net and other benchmark methods more
compared with DSIFN. Both methods also use a deep supervision strat intuitively, Table 4 represents the percentage increase of the metrics of
egy, DSIFN introduces loss in the middle layer for backpropagation, and ADS-Net in two datasets compared to the benchmark network.
updates weights to further enhance the change detection network.
However, ADS-Net uses various weights at the back end of the network 3.5. Ablation experiments
to merge the branch network results and yield the final result (Figs. 8-11
(g) (h)). The comprehensive comparison indicates that ADS-Net has Compared to similar methods, the ADS-Net remote sensing change
better detection performance. detection method proposed in this paper has better performance, which
2) SVCD dataset: is mainly caused by the attention fusion module and the deep supervi
We also chose four typical scenarios in the SVCD dataset where the sion network. To further verify the effect of the work, we performed
changes are more complex, including changes in cars, houses, and ablation experiments on two datasets.
changes in vegetation with seasons. The four benchmark methods were (1) Effectiveness of attention module
compared with ADS-Net, moreover, the performance of different In the research of remote sensing change detection, several methods
methods was compared, as shown in Figs. 12-15. have used the attention mechanism to highlight the change information
By comparing the above experimental results, it is found that ADS- and achieve better detection performance. Therefore, to verify the
Net has a strong ability for detecting the changes in complex ground attention module effectiveness in this paper, we used the non-attention
features. In the red box of Fig. 12 (h), the detection result of ADS-Net is module method and the attention fusion method with LEVIR-CD and
closest to the Ground Truth map. The red box in Fig. 13 represents SVCD datasets. Fig. 16 represents the comparison of the four evaluation
models’ false detection of the sample. In contrast, only ADS-Net has no indicators of the two types of methods.
false detection of the area. ADS-Net shows its sensitivity to changes in Through comparison, it is found that after adding the proposed
small objects such as cars in Fig. 14, and the model has a high degree of channel and spatial attention fusion module, every metric of the model
refinement. Moreover, ADS-Net possesses a more prominent detection was improved. The recall was improved the most, with an increase of
effect on ground changes, as shown in Fig. 15 (h), clearly distinguishing 2.69% and 3.26% on the two datasets, respectively. It is indicated that
the boundary between land and grass. the proposed attention module can better capture subtle change infor
Table 3 represents the quantitative results of various evaluation mation, and accurately classify possible change pixels, thereby
enhancing the overall detection performance of the model.
Fig. 12. The first scene of SVCD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc, (f) STA-
Net (PAM), (g) DSIFN, (h) ADS-Net.
13
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
Fig. 13. The second scene of LEVIR-CD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc,
(f) STA-Net (PAM), (g) DSIFN, (h) ADS-Net.
Fig. 14. The third scene of LEVIR-CD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc, (f)
STA-Net (PAM), (g) DSIFN, (h) ADS-Net.
(2) Role of deep supervision network Score of test results utilizing the main network, each branch network,
The proposed deep supervision network integrates the detection re and the deep supervision network respectively on the LEVIR-CD and
sults obtained by four different deep networks and the feature infor SVCD datasets. The deep supervision network is a weighted fusion of
mation of different dimensions from the low to the high level. To verify each branch network output, where the weight is determined based on
the function of the deep supervision network, we tested each branch the F1-Score of the individual detection results of each branch network
network. De-Conv2, De-Conv3, De-Conv4, and De-conv5 correspond to (Equation (10)). It is found that by increasing the number of network
the decoding networks of Conv2, Conv3, Conv4, and Conv5 respec layers, the change detection results become more and more accurate,
tively, of which De-conv5 is the main network. Fig. 17 represents the F1- and the performance of the deep supervision network improves
14
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
Fig. 15. The fourth scene of LEVIR-CD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc,
(f) STA-Net (PAM), (g) DSIFN, (h) ADS-Net.
15
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
Fig. 16. The verification result of the attention module effectiveness: (a) the verification result on the LEVIR-CD dataset, (b) the verification result on the
SVCD dataset.
Acknowledgments:
16
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348
Yi-Quan, W., Zhao-Qing, C., Fei-Xiang, T., 2016. Change detection of multi-temporal Photogrammetry and Remote Sensing, vol. 166, 2020, pp. 183–200. https://doi.org/
remote sensing images based on contourlet transform and ICA[J]. Chinese Journal of 10.1016/ J.ISPRSJPRS.2020.06.003.
Geophysics 59 (4), 1284–1292. https://doi.org/10.1002/cjg2.20231. Chen, H., Shi, Z., 2020. A Spatial-Temporal Attention-Based Method and a New Dataset
Wang, B., Choi, S.K., Han, Y.K., et al., 2015. Application of IR-MAD using synthetically for Remote Sensing Image Change Detection[J]. Remote Sensing 12(10):1662.
fused images for change detection in hyperspectral data[J]. Remote Sensing Letters 6 https://doi.org/10.3390/rs12101662.
(7–9), 578–586. https://doi.org/10.1080/2150704X.2015.1062155. Liu, R., Cheng, Z., Zhang, L., et al., 2019. Remote Sensing Image Change Detection Based
Singh, S., Talwar, R., 2015. Assessment of Different CVA Based Change Detection on Information Transmission and Attention Mechanism[J]. IEEE. Access PP(99):1–1.
Techniques Using MODIS Dataset[J]. Mausam 66 (1), 77–86. https://doi.org/10.1109/ACCESS.2019.2947286.
Zhao, He, 2011. Improving change vector analysis in Multi-temporal space to detect land Wiratama, Wahyu, et al. “Change Detection on Multi-Spectral Images Based on Feature-
cover changes by using cross-correlogram spectral matching algorithm[C]// Level U-Net.” IEEE Access, vol. 8, 2020, pp. 12279–12289. https://doi.org/10.1109/
Geoscience & Remote Sensing Symposium. IEEE. https://doi.org/10.1109/ ACCESS.2020.2964798.
IGARSS.2011.6048960. Wiratama, W., Sim, D., 2019. Fusion network for change detection of high-resolution
Bovolo, F., Member, I.E.E.E., et al., 2012. A Framework for Automatic and Unsupervised panchromatic imagery[J]. Applied Sciences 9 (7), 1441. https://doi.org/10.3390/
Detection of Multiple Changes in Multitemporal Images[J]. IEEE Transactions on app9071441.
Geoence & Remote Sensing 50 (6), 2196–2212. https://doi.org/10.1109/ Liu, J., Gong, M., Qin, A.K., et al., 2019. Bipartite Differential Neural Network for
TGRS.2011.2171493. Unsupervised Image Change Detection[J]. IEEE Transactions on Neural Networks
Thonfeld, Frank, Feilhauer, et al. Robust Change Vector Analysis (RCVA) for multi-sensor and Learning Systems PP(99):1–15. https://doi.org/10.1109/TNNLS.2019.2910571.
very high resolution optical satellite data.[J]. International Journal of Applied Earth Li X, Wang W, Hu X, et al. Selective Kernel Networks[C]// 2019 IEEE/CVF Conference
Observation & Geoinformation, 2016. https://doi.org/10.1016/j.jag.2016.03.009. on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020. https://doi.org/
Wei, H., Jinliang, H., Lihui, W., et al., 2016. Remote sensing image change detection 10.1109/CVPR.2019.00060.
based on change vector analysis of PCA component[J]. Remote Sensing for Land & Jie, Shen, Samuel, et al. Squeeze-and-Excitation Networks.[J]. IEEE transactions on
Resources. https://doi.org/10.6046/gtzyyg.2016.01.04. pattern analysis and machine intelligence, 2019. https://doi.org/10.1109/
Wu, C., Du, B., Zhang, L.P., 2014. Slow Feature Analysis for Change Detection in TPAMI.2019.2913372.
Multispectral Imagery. IEEE T Geosci. Remote 52, 2858–2874. https://doi.org/ Zhang, H., Wang, M., Wang, F., et al., 2021. A Novel Squeeze-and-Excitation W-Net for
10.1109/TGRS.2013.2266673. 2D and 3D Building Change Detection with Multi-Source and Multi-Feature Remote
Han, Y., Javed, A., Jung, S., et al., 2020. Object-Based Change Detection of Very High Sensing Data[J]. Remote Sensing 13 (3), 440. https://doi.org/10.3390/rs13030440.
Resolution Images by Fusing Pixel-Based Change Detection Results Using Weighted Saha, S., Bovolo, F., Bruzzone, L., 2019. Unsupervised Deep Change Vector Analysis for
Dempste-Shafer Theory[J]. Remote Sensing 12 (6), 983. https://doi.org/10.3390/ Multiple-Change Detection in VHR Images[J]. IEEE Transactions on Geoscience and
rs12060983. Remote Sensing PP(99):1–17. https://doi.org/10.1109/TGRS.2018.2886643.
Xin, W., Sicong, L., Peijun, D., et al., 2018. Object-Based Change Detection in Urban Woo S, Park J, Lee J Y, et al. CBAM: Convolutional Block Attention Module[J]. 2018.
Areas from High Spatial Resolution Images Based on Multiple Features and Ensemble https://doi.org/10.1007/978-3-030-01234-2_1.
Learning[J]. Remote Sensing 10 (2), 276. https://doi.org/10.3390/rs10020276. Wang Q, Wu B, Zhu P, et al. ECA-Net: Efficient Channel Attention for Deep Convolutional
Zhang, Y., Peng, D., Huang, X., 2018. Object-Based Change Detection for VHR Images Neural Networks[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern
Based on Multiscale Uncertainty Analysis[J]. IEEE Geoscience & Remote Sensing Recognition (CVPR). IEEE, 2020. https://doi.org/10.1109/CVPR42600.2020.01155.
Letters 15 (1), 13–17. https://doi.org/10.1109/LGRS.2017.2763182. Lee, Chen-Yu, et al., 2015. Deeply-Supervised Nets. In: Proceedings of the Eighteenth
Tan K, Zhang Y, Wang X, et al. Object-Based Change Detection Using Multiple Classifiers International Conference on Artificial Intelligence and Statistics, pp. 562–570.
and Multi-Scale Uncertainty Analysis[J]. Remote Sensing, 2019, 11(3). https://doi. Lin, T.Y., Goyal, P., Girshick, R., et al., 2017. Focal Loss for Dense Object Detection[J].
org/ 2010.3390/rs11030359. IEEE Transactions on Pattern Analysis & Machine Intelligence PP(99):2999–3007.
Zhan, Y., Fu, K., Yan, M., et al., 2017. Change Detection Based on Deep Siamese https://doi.org/10.1109/iccv.2017.324.
Convolutional Network for Optical Aerial Images[J]. IEEE Geoscience and Remote Cui, Yin, et al., 2019. Class-Balanced Loss Based on Effective Number of Samples. In:
Sensing Letters 14 (10), 1845–1849. https://doi.org/10.1109/LGRS.2017.2738149. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Wang, Q., Zhang, X., Chen, G., et al., 2018. Change detection based on Faster R-CNN for pp. 9268–9277. https://doi.org/10.1109/cvpr.2019.00949.
high-resolution remote sensing images[J]. Remote Sensing Letters 9 (10–12), Zhang, Linbin, et al. “A Class Imbalance Loss for Imbalanced Object Recognition.” IEEE
923–932. https://doi.org/10.1080/2150704X.2018.1492172. Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol.
Peng, Zhang, Guan. End-to-End Change Detection for High Resolution Satellite Images 13, 2020, pp. 2778–2792. https://doi.org/10.1109/JSTARS.2020.2995703.
Using Improved UNet++[J]. Remote Sensing, 2019, 11(11):1382. https://doi.org/ Lebedev, M.A., Vizilter, Y.V., Vygolov, O.V., Knyaz, V.A., Rubis, A.Y., 2018. CHANGE
10.3390/rs11111382. DETECTION IN REMOTE SENSING IMAGES USING CONDITIONAL ADVERSARIAL
Daudt, Rodrigo Caye, et al., 2018. Fully Convolutional Siamese Networks for Change NETWORKS. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. XLII-2 (565–571)
Detection. In: 2018 25th IEEE International Conference on Image Processing (ICIP), https://doi.org/10.5194/isprs-archives-XLII-2-565-2018.
pp. 4063–4067. https://doi.org/10.1109/ICIP.2018.8451652. El Amin, A.M.; Liu, Q.; Wang, Y. Zoom out CNNs features for optical remote sensing
Zhang, Chenxiao, et al. “A Deeply Supervised Image Fusion Network for Change change detection. In Proceedings of the 2017 2nd International Conference on
Detection in High Resolution Bi-Temporal Remote Sensing Images.” Isprs Journal of Image, Vision and Computing (ICIVC), Chengdu, China, 2–4 June 2017; pp.
812–817. https://doi.org/ 10.1109/ICIVC.2017.7984667.
17