DiffUCD - Unsupervised Hyperspectral Image Change Detection With Semantic Correlation Diffusion Model
DiffUCD - Unsupervised Hyperspectral Image Change Detection With Semantic Correlation Diffusion Model
Abstract—Hyperspectral image change detection (HSI-CD) has research focus in remote sensing [2], with applications in land
emerged as a crucial research area in remote sensing due use and land cover change [3], ecosystem monitoring, natural
to its ability to detect subtle changes on the earth’s surface. disaster damage assessment [4], and more.
Recently, diffusional denoising probabilistic models (DDPM) have
demonstrated remarkable performance in the generative domain. Broadly speaking, HSI-CD can be achieved using super-
Apart from their image generation capability, the denoising vised and unsupervised method. Most current methods rely on
process in diffusion models can comprehensively account for the supervised deep learning networks trained with high-quality
semantic correlation of spectral-spatial features in HSI, resulting labeled samples [5], [6]. However, obtaining high-quality
in the retrieval of semantically relevant features in the original labeled training samples is costly and time-consuming. Thus,
image. In this work, we extend the diffusion model’s application
to the HSI-CD field and propose a novel unsupervised HSI-CD reducing or eliminating the reliance on labeled data is critical
with semantic correlation diffusion model (DiffUCD). Specifically, to addressing the challenge of HSI-CD.
the semantic correlation diffusion model (SCDM) leverages Although deep learning based supervised HSI-CD methods
abundant unlabeled samples and fully accounts for the semantic have shown promising results, they still face several chal-
correlation of spectral-spatial features, which mitigates pseudo lenges: 1) There are often insufficient labeled samples for HSI-
change between multi-temporal images arising from inconsistent
imaging conditions. Besides, objects with the same semantic CD, necessitating the need to effectively leverage labeled and
concept at the same spatial location may exhibit inconsistent unlabeled data to train deep learning networks. 2) HSI-CD
spectral signatures at different times, resulting in pseudo change. involves spatiotemporal data, where changes occur over time
To address this problem, we propose a cross-temporal contrastive and exhibit spatial correlations. While existing approaches pri-
learning (CTCL) mechanism that aligns the spectral feature marily focus on extracting features, they often overlook the im-
representations of unchanged samples. By doing so, the spectral
difference invariant features caused by environmental changes portance of considering spectral-spatial semantic correlations.
can be obtained. Experiments conducted on three publicly avail- Incorporating such correlations is essential for accurate CD.
able datasets demonstrate that the proposed method outperforms 3) Objects with the same semantic concept at the same spatial
the other state-of-the-art unsupervised methods in terms of location can exhibit different spectral features at different
Overall Accuracy (OA), Kappa Coefficient (KC), and F1 scores, times due to changes in imaging conditions and environments
achieving improvements of approximately 3.95%, 8.13%, and
4.45%, respectively. Notably, our method can achieve comparable (i.e., the same objects with different spectra). While most deep
results to those fully supervised methods requiring numerous learning-based CD methods focus on fully extracting spectral
annotated samples. features, none of them has investigated extracting spectral
Index Terms—Hyperspectral image, change detection, diffu- difference invariant features caused by environmental changes.
sion model, contrastive learning. Recently, many unsupervised HSI-CD methods [7], [8]
have been proposed. Unlike supervised methods, unsupervised
methods do not require pre-labeled data and can learn features
I. I NTRODUCTION of changed regions using only two HSIs. This confers a signif-
Change detection (CD) involves using remote sensing tech- icant advantage over supervised methods, as it avoids the need
nologies to compare and analyze images taken at different for labor-intensive and time-consuming labeling and mitigates
times in the same area, detecting changes in ground objects issues such as inaccurate and inconsistent labeling. However,
between two or more images [1]. Hyperspectral data provides the accuracy of unsupervised methods is often lower than that
continuous spectral information, making it ideal for detecting of supervised methods, despite their ability to function without
subtle changes on the Earth’s surface. As such, hyperspectral any annotation information.
image change detection (HSI-CD) has become an important Diffusion models have recently demonstrated remarkable
successes in image generation and synthesis [9], [10]. Thanks
Xiangrong Zhang, Shunli Tian, Guanchun Wang, and Licheng Jiao are with to their excellent generative capabilities, researchers have
the Key Laboratory of Intelligent Perception and Image Understanding of
Ministry of Education, Xidian University, Xi’an, Shaanxi Province 710071, begun exploring the application of diffusion models in visual
China. understanding tasks such as semantic segmentation [11], [12],
Huiyu Zhou is with the School of Informatics, University of Leicester, object detection [13], image colorization [14], super-resolution
Leicester LE1 7RH.U.K.
This work was supported in part by the National Natural Science Foundation [10], [15], and more. However, their potential for HSI-CD
of China under Grant 61871306, Grant 62171332. remains largely unexplored. As such, how to apply diffusion
2
models to HSI-CD remains an open problem. convolutional networks (FCNs) to extract features from bitem-
To address the challenges faced by HSI-CD, we propose an poral images separately. The unsupervised noise modeling
unsupervised approach based on semantic correlation diffusion module can alleviate the accuracy limitation caused by pseudo-
model (SCDM) that leverages its strong denoising generation labels. An unsupervised method [19] that self-generates trusted
ability. This method consists of two main steps. Firstly, the labels has been proposed to improve pseudo-labels’ quality.
denoising process of the SCDM can utilize many unlabeled This method combines two model-driven methods, CVA and
samples, fully consider the semantic correlation of spectral- SSIM, to generate trusted pseudo-training sets, and the trusted
spatial features, and retrieve the features of the original image pseudo-labels can improve the performance of deep learn-
semantic correlation. Secondly, we propose a cross-temporal ing networks. While recent advances in unsupervised HSI-
contrastive learning (CTCL) mechanism to address the prob- CD methods have shown promise, the efficient extraction of
lem of spectral variations caused by environmental changes. changing features remains challenging [20], [21]. UTBANet
This method aligns the spectral feature representations of [21] aims to reconstruct HSIs and adds a decoding branch to
unchanged samples cross-temporally, enabling the network to reconstruct edge information. Unlike previous methods, this
learn features that are invariant to these spectral differences. paper utilizes many unlabeled HSI-CD samples to train SCDM
The main contributions of this paper are: to extract semantically relevant spectral-spatial information.
• We propose DiffUCD, the first diffusion model designed
explicitly for HSI-CD, which can fully consider the se- B. Diffusion models
mantic correlation of spectral-spatial features and retrieve
semantically related features in the original image. Diffusion models [14], [22], [23] are Markov chains that
• To address the problem that objects with the same se- reconstruct data samples through a step-by-step denoising pro-
mantic concept at the same spatial location may exhibit cess, beginning with randomly distributed samples. Recently,
different spectral features at different times, we propose methods based on diffusion models have been brilliant in
CTCL, which enables the network to learn the spectral various fields, such as computer vision [10], [24]–[26], natural
difference invariant features. language processing [27], [28], multimodal learning [29], [30],
• Extensive experiments on three datasets demonstrate that time series modeling [31], [32], etc. Diffusion models have
our proposed method achieves state-of-the-art results been gradually explored in terms of visual representation,
compared to other unsupervised HSI-CD methods. and Baranchuk et al. [33] demonstrated that diffusion models
could also be used as a tool for semantic segmentation,
Through experiments on three publicly available datasets especially when labeled data is scarce. Gu et al. [34] proposed
(Santa Barbara, Bay Area, and Hermiston), we demonstrate a new framework, DiffusionInst, which represents instances as
that DiffUCD outperforms state-of-the-art methods by a sig- instance-aware filters and instance segmentation as a noise-to-
nificant margin. Specifically, our method achieves OA values filter denoising process. In this paper, we propose SCDM and
of 96.87%, 96.35%, and 95.47% on the three datasets, re- further explore the application of the diffusion model in the
spectively, which are 5.73%, 5.56%, and 0.57% higher than field of HSI-CD. To our knowledge, this is the first work that
those achieved by the state-of-the-art unsupervised method. employs a diffusion model for HSI-CD.
Even when trained with the same number of human-labeled
training samples, our method exhibits competitive performance
compared to supervised methods. When compared to ML- C. Contrastive learning
EDAN [16], our method achieves slightly better or similar Contrastive learning [35]–[37] learns feature representations
performance, with OA values changing by −1.13%, −0.12%, of samples by automatically constructing similar and dissimi-
and +0.89%, respectively. In summary, our approach extends lar samples. BYOL [38] relies on the interaction of the online
the application of diffusion models to HSI-CD, achieving and target networks for learning. An online network is trained
superior results compared to previous methods. from augmented views of an image to predict target network
The rest of this article is organized as follows. Section II representations of the same image under different augmented
introduces the related work of this paper. Section III introduces views. SimSiam [39] theoretically explained that the essence
the proposed framework for HSI-CD in detail. Section IV of twin network representation learning with stop-gradient is
introduces the experiments. Finally, the conclusion of this the Expectation-Maximization (EM) algorithm. BYOL [38]
paper is drawn in Section V. and SimSiam [39] still work without negative samples. Re-
cently, contrastive learning has achieved promising results in
II. R ELATED W ORK HSI classification tasks [40], [41]. Ou et al. [42] proposed
an HSI-CD framework based on a self-supervised contrastive
A. Unsupervised HSI-CD learning pre-training model and designed a data augmentation
There has been a growing interest in unsupervised HSI- strategy based on Gaussian noise for constructing positive and
CD methods based on deep learning in recent years. Recent negative samples. In this paper, we design a CTCL network
studies have focused on mitigating the impact of noisy labels that can extract the invariant features of spectral differences
in pseudo-labels [17], [18]. Li et al. [18] proposed an un- caused by environmental changes, thereby reducing the impact
supervised fully convolutional HSI-CD framework based on of imaging conditions and environmental changes on CD
noise modeling. This framework uses parallel Siamese fully results.
3
Fig. 1. The proposed DiffUCD framework consists of two main modules: SCDM and CTCL. SCDM can fully consider the semantic correlation of
spectral-spatial features and reconstruct the essential features of the original image semantic correlation. CTCL can deal with the problem of the same object
with different spectra and constrain the network to learn the invariant characteristics of spectral differences caused by environmental changes.
III. P ROPOSED M ETHOD with noise and a reverse chain that denoises the noisy image.
This section will provide an overview of the DDPM frame- The forward chain is a process of forward diffusion, which
work [22], [43], [44] and describe the proposed DiffUCD gradually adds Gaussian noise to the input data to create
model in detail. Fig. 1 illustrates the architecture of the Dif- interference. The reverse chain learns a denoising network that
fUCD model, which comprises three main parts: the SCDM, reverses the forward diffusion process. In the forward diffusion
CTCL, and CD head. process of noise injection, Gaussian noise is gradually added to
the clean data x0 ∼ p (x0 ) until the data is entirely degraded,
A. Preliminaries resulting in a Gaussian distribution N (0, I). Formally, the
operation at each time step t in the forward diffusion process
Inspired by nonequilibrium thermodynamics [45], a series is defined as:
of probabilistic generative models called diffusion models have
been proposed. There are currently three popular formulations p
based on diffusion models: denoising diffusion probabilistic q (xt | xt−1 ) = N xt ; 1 − βt xt−1 , βt I (1)
models (DDPMs) [22], [43], [44], score-based generative
models (SGMs) [23], [46], and stochastic differential equa- Here (x0 , x1 , · · · , xT ) represents a T -step Markov chain. βt ∈
tions (Score SDEs) [14], [47]. In this paper, we expand the (0, 1) represent the noise Schedule.
application of DDPMs to the HSI-CD domain. Importantly, given a clean data sample x0 , we can obtain a
Diffusion probabilistic models for denoising typically use noisy sample xt by sampling the Gaussian vector ∼ N (0, I)
two Markov chains: a forward chain that perturbs the image and applying the transformation directly to x0 :
4
√ √ √
q (xt | x0 ) = N xt | x0 ᾱt , (1 − ᾱt ) I (2) Ht (H0 , t ) = H0 ᾱt + t 1 − ᾱt (6)
Qt Qt
√ √ where ᾱt = i=0 αi = i=0 (1 − βi ), t ∼ N (0, I).
xt = x0 ᾱt + t 1 − ᾱt , t ∼ N (0, I) (3)
X̂ = 1/3(Conv(Sub(X̂1 , X̂2 ))
B. DiffUCD (8)
+ Concat(X̂1 , X̂2 ) + Concat(x̂10 , x̂20 ))
The proposed DiffUCD framework comprises a SCDM, a
CTCL, and a CD head, as illustrated in Fig. 1. SCDM can Here, X̂1 and X̂2 represent the encoder output features
use a large number of unlabeled samples to fully consider the obtained through CTCL, while x̂10 and x̂20 denote the spectral-
semantic correlation of spectral-spatial features and retrieve spatial features extracted by the SCDM. The Concat(·)
the features of the original image semantic correlation. CTCL function is used to superimpose features along the channel
aligns the spectral sequence information of unchanged pixels, dimension, while Sub(·) calculates the features’ differences.
guiding the network to extract features that are insensitive The resulting fused features, X̂, are then passed to the CD
to spectral differences resulting from variations in imaging head to generate the final CD map. The structure of the
conditions and environments. CD head used in this paper is consistent with the spatial
1) Semantic Correlation Diffusion Model: We utilize the transformer in Fig. 1.
forward diffusion process proposed by SCDM [22] in Eq. (6),
which corrupts the input HSI H0 to obtain Ht at a random
time step t. Fig. 1 illustrates that the SCDM takes a patch
D. Training
xt ∈ RC×K×K from the Ht at time T 1 or T 2 as input. Our
SCDM is structured similarly to U-ViT [48], with the time The training process comprises two stages: 1) The SCDM
step t, condition c, and noise image xt all used as tokens is pre-trained using a large number of unlabeled HSI-CD
for input into the SCDM. In contrast to the U-ViT long skip samples to fully consider the semantic correlation of spectral-
connections method, we employ a multi-head cross-attention spatial features and retrieve the features of the original image
(MCA) approach for feature fusion between the shallow and semantic correlation. 2) A small set of pseudo-label samples
deep layers. The noise image xt is fed into θ (xt , t, c), are used to train the CTCL network. The spectral-spatial
parameterized by the SCDM. The pixel-level representation features extracted by the SCDM are fused with the spectrally
x̂0 of x0 is obtained through the θ (xt , t, c) network, and the invariant features learned by the CTCL network and then
corresponding formula is given as follows: passed through the CD head to generate the ultimate CD map.
5
(a) DSFA [49] (b) HyperNet [50] (c) Base+SCDM (d) Ours (e) GT
FP FN TN TP Unknown Area
Fig. 2. Visualizations of the proposed method and state-of-the-art unsupervised methods on three datasets. From top to bottom are Santa Barbara, Bay Area,
and Hermiston datasets.
TP TP + TN
precision = (13) OA = (15)
TP + FP TP + TN + FP + FN
7
Fig. 3. The t-SNE visualization of features extracted on three datasets. From top to bottom are Santa Barbara, Bay Area, and Hermiston datasets.
TABLE V
ABLATION EXPERIMENTS ON MODULE EFFECTIVENESS ON THREE DATASETS
the first stage, the pre-training SCDM trains for 1000 epochs
OA − PRE
KC = (16) using the AdamW optimizer [58] with an initial learning rate
1 − PRE of 1e-5. The timestep for the SCDM was set to 200. In the
second stage, we fix the parameters of the SCDM and use the
(TP + FP)(TP + FN) (FN + TN)(FP + TN) Adadelta optimizer [59] to optimize the CTCL and CD head
PRE = +
(TP + TN + FP + FN) (TP + TN + FP + FN)2
2 network over time. The initial learning rate is set to 1 and
(17) linearly decreases to 0 at 200 epochs. Through experiments,
2 we choose the spectral-spatial features produced by the SCDM
F1 = −1 −1 (18) t = 5, 10, 100 as the input features of the CD head.
recall + precision
2) Implementation Details: We perform all experiments
using the PyTorch platform, running on an NVIDIA GTX C. Comparison to State-of-the-art Methods
2080Ti GPU with 11GB of memory. The batch size is 128, We conduct a comprehensive comparison of our method
and a patch size of 7 is used to process the input data. In with recent unsupervised and supervised HSI-CD methods,
8
(a) t=200 (b) t=50 (c) t=10 (d) t=5 (e) t=0
Fig. 4. SCDM denoising process reconstructs pseudo-color images of different timestamps of the Santa Barbara dataset. Image visualization at time T1 and
T2 from top to bottom.
(a) t=200 (b) t=50 (c) t=10 (d) t=5 (e) t=0
Fig. 5. SCDM denoising process reconstructs pseudo-color images of different timestamps of the Bay Area dataset. Image visualization at time T1 and T2
from top to bottom.
9
including CVA [53], PCA [51], ISFA [54], DSFA [49], MSCD V. C ONCLUSION
[55], HyperNet [50], BCG-Net [57], BCNNs [56], and ML- This work presents a novel diffusion framework, called
EDAN [16]. Fig. 2 presents a visual comparison of these DiffUCD, designed explicitly for HSI-CD. To our knowledge,
methods on the three datasets. this is the first diffusion model developed for this particular
From the visual observations in Fig. 2, it is evident that our task. DiffUCD leverages many unlabeled samples to fully
proposed method, DiffUCD, exhibits the smallest regions of consider the semantic correlation of spectral-spatial features
red and green. This compelling visualization underscores the and retrieve the features of the original image semantic corre-
superior performance of DiffUCD compared to all other meth- lation. Additionally, we employ CTCL to align the spectral
ods. Table II, Table III, and Table IV provides the quantitative feature representations of unchanged samples. This align-
results of DiffUCD alongside various state-of-the-art methods ment facilitates learning invariant spectral difference features
across the three datasets. Remarkably, our proposed method essential for capturing environmental changes. We evaluate
substantially improves performance over the state-of-the-art the performance of our proposed method on three publicly
unsupervised methods, as evidenced by significant margins available datasets and demonstrate that it achieves significant
in OA, KC, and F1-score. Specifically, DiffUCD surpasses improvements over state-of-the-art unsupervised methods in
the unsupervised methods on the Santa Barbara dataset by terms of OA, KC, and F1 metrics. Furthermore, the diffusion
remarkable margins of 5.73%, 11.93%, and 7.17% in terms of model holds great potential as a novel solution for the HSI-CD
OA, KC, and F1-score, respectively. Furthermore, compared task. Our work will inspire the development of new approaches
to supervised methods trained on an equivalent number of and foster advancements in this field.
human-annotated training examples, our method demonstrates
comparable or superior performance. R EFERENCES
[1] D. Lu, P. Mausel, E. Brondizio, and E. Moran, “Change detection
techniques,” International journal of remote sensing, vol. 25, no. 12,
D. Ablation Study pp. 2365–2401, 2004.
[2] S. Liu, D. Marinelli, L. Bruzzone, and F. Bovolo, “A review of change
1) Effectiveness of the module: We conduct a comprehen- detection in multitemporal hyperspectral images: Current techniques,
sive ablation study to verify the effectiveness of the proposed applications, and challenges,” IEEE Geoscience and Remote Sensing
SCDM and CTCL. The results are shown in Table V. After Magazine, vol. 7, no. 2, pp. 140–158, 2019.
[3] F. Aslami and A. Ghorbani, “Object-based land-use/land-cover change
adding the pre-training of the SCDM, the results of the net- detection using landsat imagery: a case study of ardabil, namin, and nir
work on the three datasets have been significantly improved. counties in northwest iran,” Environmental monitoring and assessment,
We argue that the SCDM pre-training process utilizes many vol. 190, pp. 1–14, 2018.
[4] T. Rumpf, A.-K. Mahlein, U. Steiner, E.-C. Oerke, H.-W. Dehne, and
unlabeled samples, which can extract the semantic correlation L. Plümer, “Early detection and classification of plant diseases with
of spectral-spatial features of the CD dataset. The third row support vector machines based on hyperspectral reflectance,” Computers
of Table V is based on the base model, which adds a CTCL and electronics in agriculture, vol. 74, no. 1, pp. 91–99, 2010.
[5] Y. Wang, D. Hong, J. Sha, L. Gao, L. Liu, Y. Zhang, and X. Rong,
module, improving CD accuracy on the three datasets by “Spectral–spatial–temporal transformers for hyperspectral image change
aligning the spectral features of unchanged samples. The detection,” IEEE Transactions on Geoscience and Remote Sensing,
fourth row is the experimental results of the DiffUCD model vol. 60, pp. 1–14, 2022.
[6] W. Dong, J. Zhao, J. Qu, S. Xiao, N. Li, S. Hou, and Y. Li, “Abundance
we proposed, and the OA values on the three data sets have matrix correlation analysis network based on hierarchical multi-head
been increased by 6.39%, 4.58%, and 2.64%, respectively. self-cross-hybrid attention for hyperspectral change detection,” IEEE
Experiments fully prove the effectiveness of our proposed Transactions on Geoscience and Remote Sensing, 2023.
[7] D. Chakraborty and A. Ghosh, “Unsupervised change detection in hyper-
DiffUCD and sub-modules. spectral images using feature fusion deep convolutional autoencoders,”
2) Comparison of feature extraction ability: Fig. 3 visually arXiv preprint arXiv:2109.04990, 2021.
demonstrates the effectiveness of the SCDM in extracting [8] J. Lei, M. Li, W. Xie, Y. Li, and X. Jia, “Spectral mapping with
adversarial learning for unsupervised hyperspectral change detection,”
compact intra-class features compared to the base model. Neurocomputing, vol. 465, pp. 71–83, 2021.
Notably, the feature distances obtained through the CTCL [9] G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion
mechanism are significantly larger on the Santa Barbara and models for robust image manipulation,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2022, pp.
Hermiston datasets. The t-SNE visualization further reinforces 2426–2435.
the discriminative nature of our model. The t-SNE plot [10] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi,
vividly illustrates that the features extracted by DiffUCD are “Image super-resolution via iterative refinement,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2022.
well-separated, allowing for distinct clusters corresponding to [11] H. Tan, S. Wu, and J. Pi, “Semantic diffusion network for semantic
different classes. This enhanced feature separability plays a segmentation,” Advances in Neural Information Processing Systems,
crucial role in boosting CD accuracy. vol. 35, pp. 8702–8716, 2022.
[12] J. Wu, H. Fang, Y. Zhang, Y. Yang, and Y. Xu, “Medsegdiff: Medical
3) The influence of timestamp t on the reconstruction ef- image segmentation with diffusion probabilistic model,” arXiv preprint
fect: Fig. 4 and Fig. 5 provides qualitative evidence of the arXiv:2211.00611, 2022.
effectiveness of DiffUCD in both noise removal and feature [13] S. Chen, P. Sun, Y. Song, and P. Luo, “Diffusiondet: Diffusion model
for object detection,” arXiv preprint arXiv:2211.09788, 2022.
reconstruction of the original HSI. The visualization results [14] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and
clearly illustrate how the denoising process of DiffUCD fully B. Poole, “Score-based generative modeling through stochastic differ-
incorporates the semantic correlation of spectral-spatial fea- ential equations,” arXiv preprint arXiv:2011.13456, 2020.
[15] H. Li, Y. Yang, M. Chang, S. Chen, H. Feng, Z. Xu, Q. Li, and Y. Chen,
tures, enabling the extraction of essential features that preserve “Srdiff: Single image super-resolution with diffusion probabilistic mod-
the original image’s semantic correlation. els,” Neurocomputing, vol. 479, pp. 47–59, 2022.
10
[16] J. Qu, S. Hou, W. Dong, Y. Li, and W. Xie, “A multilevel encoder– [37] G. Wang, X. Zhang, Z. Peng, X. Tang, H. Zhou, and L. Jiao, “Absolute
decoder attention network for change detection in hyperspectral images,” wrong makes better: Boosting weakly supervised object detection via
IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1– negative deterministic information,” in Proceedings of the International
13, 2021. Joint Conference on Artificial Intelligence (IJCAI), 2022, pp. 1378–1384.
[17] H. Zhao, K. Feng, Y. Wu, and M. Gong, “An efficient feature extrac- [38] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya,
tion network for unsupervised hyperspectral change detection,” Remote C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al.,
Sensing, vol. 14, no. 18, p. 4646, 2022. “Bootstrap your own latent-a new approach to self-supervised learning,”
[18] X. Li, Z. Yuan, and Q. Wang, “Unsupervised deep noise modeling for Advances in neural information processing systems, vol. 33, pp. 21 271–
hyperspectral image change detection,” Remote Sensing, vol. 11, no. 3, 21 284, 2020.
p. 258, 2019. [39] X. Chen and K. He, “Exploring simple siamese representation learning,”
[19] Q. Li, H. Gong, H. Dai, C. Li, Z. He, W. Wang, Y. Feng, F. Han, in Proceedings of the IEEE/CVF conference on computer vision and
A. Tuniyazi, H. Li et al., “Unsupervised hyperspectral image change pattern recognition, 2021, pp. 15 750–15 758.
detection via deep learning self-generated credible labels,” IEEE Journal [40] X. Hu, T. Li, T. Zhou, Y. Liu, and Y. Peng, “Contrastive learning based
of Selected Topics in Applied Earth Observations and Remote Sensing, on transformer for hyperspectral image classification,” Applied Sciences,
vol. 14, pp. 9012–9024, 2021. vol. 11, no. 18, p. 8670, 2021.
[20] Z. Hou, W. Li, R. Tao, and Q. Du, “Three-order tucker decomposition [41] P. Guan and E. Y. Lam, “Cross-domain contrastive learning for hyper-
and reconstruction detector for unsupervised hyperspectral change de- spectral image classification,” IEEE Transactions on Geoscience and
tection,” IEEE Journal of Selected Topics in Applied Earth Observations Remote Sensing, vol. 60, pp. 1–13, 2022.
and Remote Sensing, vol. 14, pp. 6194–6205, 2021. [42] X. Ou, L. Liu, S. Tan, G. Zhang, W. Li, and B. Tu, “A hyperspectral
[21] S. Liu, H. Li, F. Wang, J. Chen, G. Zhang, L. Song, and B. Hu, “Un- image change detection framework with self-supervised contrastive
supervised transformer boundary autoencoder network for hyperspectral learning pretrained model,” IEEE Journal of Selected Topics in Applied
image change detection,” Remote Sensing, vol. 15, no. 7, p. 1868, 2023. Earth Observations and Remote Sensing, vol. 15, pp. 7724–7740, 2022.
[22] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” [43] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilis-
Advances in Neural Information Processing Systems, vol. 33, pp. 6840– tic models,” in International Conference on Machine Learning. PMLR,
6851, 2020. 2021, pp. 8162–8171.
[23] Y. Song and S. Ermon, “Generative modeling by estimating gradients [44] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,”
of the data distribution,” Advances in neural information processing arXiv preprint arXiv:2010.02502, 2020.
systems, vol. 32, 2019. [45] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli,
[24] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- “Deep unsupervised learning using nonequilibrium thermodynamics,”
resolution image synthesis with latent diffusion models,” in Proceedings in International Conference on Machine Learning. PMLR, 2015, pp.
of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- 2256–2265.
tion, 2022, pp. 10 684–10 695. [46] Y. Song and S. Ermon, “Improved techniques for training score-based
[25] E. A. Brempong, S. Kornblith, T. Chen, N. Parmar, M. Minderer, generative models,” Advances in neural information processing systems,
and M. Norouzi, “Denoising pretraining for semantic segmentation,” vol. 33, pp. 12 438–12 448, 2020.
in Proceedings of the IEEE/CVF Conference on Computer Vision and [47] Y. Song, C. Durkan, I. Murray, and S. Ermon, “Maximum likelihood
Pattern Recognition, 2022, pp. 4175–4186. training of score-based diffusion models,” Advances in Neural Informa-
[26] J. Wyatt, A. Leach, S. M. Schmon, and C. G. Willcocks, “Anoddpm: tion Processing Systems, vol. 34, pp. 1415–1428, 2021.
Anomaly detection with denoising diffusion probabilistic models using [48] F. Bao, C. Li, Y. Cao, and J. Zhu, “All are worth words: a vit backbone
simplex noise,” in Proceedings of the IEEE/CVF Conference on Com- for score-based diffusion models,” arXiv preprint arXiv:2209.12152,
puter Vision and Pattern Recognition, 2022, pp. 650–656. 2022.
[27] J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg, “Struc- [49] B. Du, L. Ru, C. Wu, and L. Zhang, “Unsupervised deep slow feature
tured denoising diffusion models in discrete state-spaces,” Advances analysis for change detection in multi-temporal remote sensing images,”
in Neural Information Processing Systems, vol. 34, pp. 17 981–17 993, IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 12,
2021. pp. 9976–9992, 2019.
[28] X. Li, J. Thickstun, I. Gulrajani, P. S. Liang, and T. B. Hashimoto, [50] M. Hu, C. Wu, and L. Zhang, “Hypernet: Self-supervised hyperspectral
“Diffusion-lm improves controllable text generation,” Advances in Neu- spatial–spectral feature understanding network for hyperspectral change
ral Information Processing Systems, vol. 35, pp. 4328–4343, 2022. detection,” IEEE Transactions on Geoscience and Remote Sensing,
[29] O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text- vol. 60, pp. 1–17, 2022.
driven editing of natural images,” in Proceedings of the IEEE/CVF [51] J. Deng, K. Wang, Y. Deng, and G. Qi, “Pca-based land-use change
Conference on Computer Vision and Pattern Recognition, 2022, pp. detection and analysis using multitemporal and multisensor satellite
18 208–18 218. data,” International Journal of Remote Sensing, vol. 29, no. 16, pp.
[30] D. Yang, J. Yu, H. Wang, W. Wang, C. Weng, Y. Zou, and D. Yu, 4823–4838, 2008.
“Diffsound: Discrete diffusion model for text-to-sound generation,” [52] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework
IEEE/ACM Transactions on Audio, Speech, and Language Processing, for contrastive learning of visual representations,” in International
2023. conference on machine learning. PMLR, 2020, pp. 1597–1607.
[31] Y. Tashiro, J. Song, Y. Song, and S. Ermon, “Csdi: Conditional score- [53] F. Bovolo and L. Bruzzone, “A theoretical framework for unsupervised
based diffusion models for probabilistic time series imputation,” Ad- change detection based on change vector analysis in the polar domain,”
vances in Neural Information Processing Systems, vol. 34, pp. 24 804– IEEE Transactions on Geoscience and Remote Sensing, vol. 45, no. 1,
24 816, 2021. pp. 218–236, 2006.
[32] K. Rasul, C. Seward, I. Schuster, and R. Vollgraf, “Autoregressive [54] C. Wu, B. Du, and L. Zhang, “Slow feature analysis for change detection
denoising diffusion models for multivariate probabilistic time series in multispectral imagery,” IEEE Transactions on Geoscience and Remote
forecasting,” in International Conference on Machine Learning. PMLR, Sensing, vol. 52, no. 5, pp. 2858–2874, 2013.
2021, pp. 8857–8868. [55] S. Saha, P. Ebel, and X. X. Zhu, “Self-supervised multisensor change
[33] D. Baranchuk, I. Rubachev, A. Voynov, V. Khrulkov, and A. Babenko, detection,” IEEE Transactions on Geoscience and Remote Sensing,
“Label-efficient semantic segmentation with diffusion models,” arXiv vol. 60, pp. 1–10, 2022.
preprint arXiv:2112.03126, 2021. [56] Y. Lin, S. Li, L. Fang, and P. Ghamisi, “Multispectral change detection
[34] Z. Gu, H. Chen, Z. Xu, J. Lan, C. Meng, and W. Wang, “Diffu- with bilinear convolutional neural networks,” IEEE Geoscience and
sioninst: Diffusion model for instance segmentation,” arXiv preprint Remote Sensing Letters, vol. 17, no. 10, pp. 1757–1761, 2019.
arXiv:2212.02773, 2022. [57] M. Hu, C. Wu, B. Du, and L. Zhang, “Binary change guided hy-
[35] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast perspectral multiclass change detection,” IEEE Transactions on Image
for unsupervised visual representation learning,” in Proceedings of the Processing, 2023.
IEEE/CVF conference on computer vision and pattern recognition, 2020, [58] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
pp. 9729–9738. arXiv preprint arXiv:1711.05101, 2017.
[36] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton, “Big [59] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv
self-supervised models are strong semi-supervised learners,” Advances preprint arXiv:1212.5701, 2012.
in neural information processing systems, vol. 33, pp. 22 243–22 255,
2020.