0% found this document useful (0 votes)
76 views10 pages

DiffUCD - Unsupervised Hyperspectral Image Change Detection With Semantic Correlation Diffusion Model

1. The document proposes an unsupervised hyperspectral image change detection method called DiffUCD that uses a semantic correlation diffusion model (SCDM). 2. SCDM leverages abundant unlabeled samples and accounts for the semantic correlation of spectral-spatial features to mitigate pseudo changes between images from different times. 3. It also uses cross-temporal contrastive learning to align spectral feature representations of unchanged samples and extract spectral difference invariant features caused by environmental changes.

Uploaded by

Elijah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views10 pages

DiffUCD - Unsupervised Hyperspectral Image Change Detection With Semantic Correlation Diffusion Model

1. The document proposes an unsupervised hyperspectral image change detection method called DiffUCD that uses a semantic correlation diffusion model (SCDM). 2. SCDM leverages abundant unlabeled samples and accounts for the semantic correlation of spectral-spatial features to mitigate pseudo changes between images from different times. 3. It also uses cross-temporal contrastive learning to align spectral feature representations of unchanged samples and extract spectral difference invariant features caused by environmental changes.

Uploaded by

Elijah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1

DiffUCD:Unsupervised Hyperspectral Image


Change Detection with Semantic Correlation
Diffusion Model
Xiangrong Zhang, Senior Member, IEEE, Shunli Tian, Guanchun Wang, Huiyu Zhou, and
Licheng Jiao, Fellow, IEEE
arXiv:2305.12410v1 [cs.CV] 21 May 2023

Abstract—Hyperspectral image change detection (HSI-CD) has research focus in remote sensing [2], with applications in land
emerged as a crucial research area in remote sensing due use and land cover change [3], ecosystem monitoring, natural
to its ability to detect subtle changes on the earth’s surface. disaster damage assessment [4], and more.
Recently, diffusional denoising probabilistic models (DDPM) have
demonstrated remarkable performance in the generative domain. Broadly speaking, HSI-CD can be achieved using super-
Apart from their image generation capability, the denoising vised and unsupervised method. Most current methods rely on
process in diffusion models can comprehensively account for the supervised deep learning networks trained with high-quality
semantic correlation of spectral-spatial features in HSI, resulting labeled samples [5], [6]. However, obtaining high-quality
in the retrieval of semantically relevant features in the original labeled training samples is costly and time-consuming. Thus,
image. In this work, we extend the diffusion model’s application
to the HSI-CD field and propose a novel unsupervised HSI-CD reducing or eliminating the reliance on labeled data is critical
with semantic correlation diffusion model (DiffUCD). Specifically, to addressing the challenge of HSI-CD.
the semantic correlation diffusion model (SCDM) leverages Although deep learning based supervised HSI-CD methods
abundant unlabeled samples and fully accounts for the semantic have shown promising results, they still face several chal-
correlation of spectral-spatial features, which mitigates pseudo lenges: 1) There are often insufficient labeled samples for HSI-
change between multi-temporal images arising from inconsistent
imaging conditions. Besides, objects with the same semantic CD, necessitating the need to effectively leverage labeled and
concept at the same spatial location may exhibit inconsistent unlabeled data to train deep learning networks. 2) HSI-CD
spectral signatures at different times, resulting in pseudo change. involves spatiotemporal data, where changes occur over time
To address this problem, we propose a cross-temporal contrastive and exhibit spatial correlations. While existing approaches pri-
learning (CTCL) mechanism that aligns the spectral feature marily focus on extracting features, they often overlook the im-
representations of unchanged samples. By doing so, the spectral
difference invariant features caused by environmental changes portance of considering spectral-spatial semantic correlations.
can be obtained. Experiments conducted on three publicly avail- Incorporating such correlations is essential for accurate CD.
able datasets demonstrate that the proposed method outperforms 3) Objects with the same semantic concept at the same spatial
the other state-of-the-art unsupervised methods in terms of location can exhibit different spectral features at different
Overall Accuracy (OA), Kappa Coefficient (KC), and F1 scores, times due to changes in imaging conditions and environments
achieving improvements of approximately 3.95%, 8.13%, and
4.45%, respectively. Notably, our method can achieve comparable (i.e., the same objects with different spectra). While most deep
results to those fully supervised methods requiring numerous learning-based CD methods focus on fully extracting spectral
annotated samples. features, none of them has investigated extracting spectral
Index Terms—Hyperspectral image, change detection, diffu- difference invariant features caused by environmental changes.
sion model, contrastive learning. Recently, many unsupervised HSI-CD methods [7], [8]
have been proposed. Unlike supervised methods, unsupervised
methods do not require pre-labeled data and can learn features
I. I NTRODUCTION of changed regions using only two HSIs. This confers a signif-
Change detection (CD) involves using remote sensing tech- icant advantage over supervised methods, as it avoids the need
nologies to compare and analyze images taken at different for labor-intensive and time-consuming labeling and mitigates
times in the same area, detecting changes in ground objects issues such as inaccurate and inconsistent labeling. However,
between two or more images [1]. Hyperspectral data provides the accuracy of unsupervised methods is often lower than that
continuous spectral information, making it ideal for detecting of supervised methods, despite their ability to function without
subtle changes on the Earth’s surface. As such, hyperspectral any annotation information.
image change detection (HSI-CD) has become an important Diffusion models have recently demonstrated remarkable
successes in image generation and synthesis [9], [10]. Thanks
Xiangrong Zhang, Shunli Tian, Guanchun Wang, and Licheng Jiao are with to their excellent generative capabilities, researchers have
the Key Laboratory of Intelligent Perception and Image Understanding of
Ministry of Education, Xidian University, Xi’an, Shaanxi Province 710071, begun exploring the application of diffusion models in visual
China. understanding tasks such as semantic segmentation [11], [12],
Huiyu Zhou is with the School of Informatics, University of Leicester, object detection [13], image colorization [14], super-resolution
Leicester LE1 7RH.U.K.
This work was supported in part by the National Natural Science Foundation [10], [15], and more. However, their potential for HSI-CD
of China under Grant 61871306, Grant 62171332. remains largely unexplored. As such, how to apply diffusion
2

models to HSI-CD remains an open problem. convolutional networks (FCNs) to extract features from bitem-
To address the challenges faced by HSI-CD, we propose an poral images separately. The unsupervised noise modeling
unsupervised approach based on semantic correlation diffusion module can alleviate the accuracy limitation caused by pseudo-
model (SCDM) that leverages its strong denoising generation labels. An unsupervised method [19] that self-generates trusted
ability. This method consists of two main steps. Firstly, the labels has been proposed to improve pseudo-labels’ quality.
denoising process of the SCDM can utilize many unlabeled This method combines two model-driven methods, CVA and
samples, fully consider the semantic correlation of spectral- SSIM, to generate trusted pseudo-training sets, and the trusted
spatial features, and retrieve the features of the original image pseudo-labels can improve the performance of deep learn-
semantic correlation. Secondly, we propose a cross-temporal ing networks. While recent advances in unsupervised HSI-
contrastive learning (CTCL) mechanism to address the prob- CD methods have shown promise, the efficient extraction of
lem of spectral variations caused by environmental changes. changing features remains challenging [20], [21]. UTBANet
This method aligns the spectral feature representations of [21] aims to reconstruct HSIs and adds a decoding branch to
unchanged samples cross-temporally, enabling the network to reconstruct edge information. Unlike previous methods, this
learn features that are invariant to these spectral differences. paper utilizes many unlabeled HSI-CD samples to train SCDM
The main contributions of this paper are: to extract semantically relevant spectral-spatial information.
• We propose DiffUCD, the first diffusion model designed
explicitly for HSI-CD, which can fully consider the se- B. Diffusion models
mantic correlation of spectral-spatial features and retrieve
semantically related features in the original image. Diffusion models [14], [22], [23] are Markov chains that
• To address the problem that objects with the same se- reconstruct data samples through a step-by-step denoising pro-
mantic concept at the same spatial location may exhibit cess, beginning with randomly distributed samples. Recently,
different spectral features at different times, we propose methods based on diffusion models have been brilliant in
CTCL, which enables the network to learn the spectral various fields, such as computer vision [10], [24]–[26], natural
difference invariant features. language processing [27], [28], multimodal learning [29], [30],
• Extensive experiments on three datasets demonstrate that time series modeling [31], [32], etc. Diffusion models have
our proposed method achieves state-of-the-art results been gradually explored in terms of visual representation,
compared to other unsupervised HSI-CD methods. and Baranchuk et al. [33] demonstrated that diffusion models
could also be used as a tool for semantic segmentation,
Through experiments on three publicly available datasets especially when labeled data is scarce. Gu et al. [34] proposed
(Santa Barbara, Bay Area, and Hermiston), we demonstrate a new framework, DiffusionInst, which represents instances as
that DiffUCD outperforms state-of-the-art methods by a sig- instance-aware filters and instance segmentation as a noise-to-
nificant margin. Specifically, our method achieves OA values filter denoising process. In this paper, we propose SCDM and
of 96.87%, 96.35%, and 95.47% on the three datasets, re- further explore the application of the diffusion model in the
spectively, which are 5.73%, 5.56%, and 0.57% higher than field of HSI-CD. To our knowledge, this is the first work that
those achieved by the state-of-the-art unsupervised method. employs a diffusion model for HSI-CD.
Even when trained with the same number of human-labeled
training samples, our method exhibits competitive performance
compared to supervised methods. When compared to ML- C. Contrastive learning
EDAN [16], our method achieves slightly better or similar Contrastive learning [35]–[37] learns feature representations
performance, with OA values changing by −1.13%, −0.12%, of samples by automatically constructing similar and dissimi-
and +0.89%, respectively. In summary, our approach extends lar samples. BYOL [38] relies on the interaction of the online
the application of diffusion models to HSI-CD, achieving and target networks for learning. An online network is trained
superior results compared to previous methods. from augmented views of an image to predict target network
The rest of this article is organized as follows. Section II representations of the same image under different augmented
introduces the related work of this paper. Section III introduces views. SimSiam [39] theoretically explained that the essence
the proposed framework for HSI-CD in detail. Section IV of twin network representation learning with stop-gradient is
introduces the experiments. Finally, the conclusion of this the Expectation-Maximization (EM) algorithm. BYOL [38]
paper is drawn in Section V. and SimSiam [39] still work without negative samples. Re-
cently, contrastive learning has achieved promising results in
II. R ELATED W ORK HSI classification tasks [40], [41]. Ou et al. [42] proposed
an HSI-CD framework based on a self-supervised contrastive
A. Unsupervised HSI-CD learning pre-training model and designed a data augmentation
There has been a growing interest in unsupervised HSI- strategy based on Gaussian noise for constructing positive and
CD methods based on deep learning in recent years. Recent negative samples. In this paper, we design a CTCL network
studies have focused on mitigating the impact of noisy labels that can extract the invariant features of spectral differences
in pseudo-labels [17], [18]. Li et al. [18] proposed an un- caused by environmental changes, thereby reducing the impact
supervised fully convolutional HSI-CD framework based on of imaging conditions and environmental changes on CD
noise modeling. This framework uses parallel Siamese fully results.
3

Fig. 1. The proposed DiffUCD framework consists of two main modules: SCDM and CTCL. SCDM can fully consider the semantic correlation of
spectral-spatial features and reconstruct the essential features of the original image semantic correlation. CTCL can deal with the problem of the same object
with different spectra and constrain the network to learn the invariant characteristics of spectral differences caused by environmental changes.

III. P ROPOSED M ETHOD with noise and a reverse chain that denoises the noisy image.
This section will provide an overview of the DDPM frame- The forward chain is a process of forward diffusion, which
work [22], [43], [44] and describe the proposed DiffUCD gradually adds Gaussian noise to the input data to create
model in detail. Fig. 1 illustrates the architecture of the Dif- interference. The reverse chain learns a denoising network that
fUCD model, which comprises three main parts: the SCDM, reverses the forward diffusion process. In the forward diffusion
CTCL, and CD head. process of noise injection, Gaussian noise is gradually added to
the clean data x0 ∼ p (x0 ) until the data is entirely degraded,
A. Preliminaries resulting in a Gaussian distribution N (0, I). Formally, the
operation at each time step t in the forward diffusion process
Inspired by nonequilibrium thermodynamics [45], a series is defined as:
of probabilistic generative models called diffusion models have
been proposed. There are currently three popular formulations  p 
based on diffusion models: denoising diffusion probabilistic q (xt | xt−1 ) = N xt ; 1 − βt xt−1 , βt I (1)
models (DDPMs) [22], [43], [44], score-based generative
models (SGMs) [23], [46], and stochastic differential equa- Here (x0 , x1 , · · · , xT ) represents a T -step Markov chain. βt ∈
tions (Score SDEs) [14], [47]. In this paper, we expand the (0, 1) represent the noise Schedule.
application of DDPMs to the HSI-CD domain. Importantly, given a clean data sample x0 , we can obtain a
Diffusion probabilistic models for denoising typically use noisy sample xt by sampling the Gaussian vector  ∼ N (0, I)
two Markov chains: a forward chain that perturbs the image and applying the transformation directly to x0 :
4

√  √ √
q (xt | x0 ) = N xt | x0 ᾱt , (1 − ᾱt ) I (2) Ht (H0 , t ) = H0 ᾱt + t 1 − ᾱt (6)
Qt Qt
√ √ where ᾱt = i=0 αi = i=0 (1 − βi ), t ∼ N (0, I).
xt = x0 ᾱt + t 1 − ᾱt , t ∼ N (0, I) (3)

To add noise to x0 , we use Eq. 3 to transform the 1 √


Qtdata into

x̂0 = √ xt − 1 − ᾱt θ (xt , t, c) (7)
x for each time step t ∈ {0, 1, . . . , T }. Here ᾱ = i=0 αi =
ᾱt
Qt t t
i=0 (1 − β i ).
During the training phase, a U-ViT [48] like structure for 2) Cross-Temporal Contrastive Learning: The proposed
θ (xt , t) is trained to predict  by minimizing the training CTCL module aims to learn more discriminative features for
objective using L2 loss. HSI-CD by emphasizing spectral difference invariant features
between unchanged samples at T 1 and T 2 moments. The
√ √  2 architecture consists of two parts: a spectral transformer en-
2
L = k − θ (xt , t)k =  − θ αt xt−1 + 1 − αt , t coder and an MLP. To construct positive and negative sample
(4) pairs, unchanged pixels at the same location but different
During the inference stage, given a noisy input xt , the phases are used as positive samples, while the rest are negative
trained model θ (xt , t) is used to denoise and obtain xt−1 . samples. The CTCL network takes X1 and X2 as input and
This process can be mathematically represented as follows: produces contrastive feature representations z i and z j , which
  are then aligned through a contrastive loss function. This
1 1 − αt architecture aims to shorten the distance between the feature
xt−1 = √ xt − √ θ (xt , t) + σt z (5)
αt 1 − ᾱt representations of unchanged pixel samples in different phases,
which helps the network extract more robust and invariant
where z ∼ N (0, I) and σt = 1− ᾱt−1
1−ᾱt βt . xt obtains x0 through features that are less affected by environmental changes.
continuous iteration, i.e., xt → xt−1 → xt−2 → . . . → x0 .
In this work, we aim to address the task of unsupervised
HSI-CD using a diffusion model. Specifically, we consider C. Change Detection Head
data sample x0 as a patch from the HSI at either T 1 or T 2.
We begin by corrupting x0 with Gaussian noise using Eq. 3 to We employ a fusion module to fuse the semantic correlation
obtain the noisy input xt for the noise predictor θ (xt , t, c). of spectral-spatial features obtained by the SCDM with the
We define θ (xt , t, c) as a noise predictor that can extract spectral difference-invariant features extracted by CTCL. The
spectral-spatial features that are useful for downstream HSI- module is formulated as follows:
CD tasks.

X̂ = 1/3(Conv(Sub(X̂1 , X̂2 ))
B. DiffUCD (8)
+ Concat(X̂1 , X̂2 ) + Concat(x̂10 , x̂20 ))
The proposed DiffUCD framework comprises a SCDM, a
CTCL, and a CD head, as illustrated in Fig. 1. SCDM can Here, X̂1 and X̂2 represent the encoder output features
use a large number of unlabeled samples to fully consider the obtained through CTCL, while x̂10 and x̂20 denote the spectral-
semantic correlation of spectral-spatial features and retrieve spatial features extracted by the SCDM. The Concat(·)
the features of the original image semantic correlation. CTCL function is used to superimpose features along the channel
aligns the spectral sequence information of unchanged pixels, dimension, while Sub(·) calculates the features’ differences.
guiding the network to extract features that are insensitive The resulting fused features, X̂, are then passed to the CD
to spectral differences resulting from variations in imaging head to generate the final CD map. The structure of the
conditions and environments. CD head used in this paper is consistent with the spatial
1) Semantic Correlation Diffusion Model: We utilize the transformer in Fig. 1.
forward diffusion process proposed by SCDM [22] in Eq. (6),
which corrupts the input HSI H0 to obtain Ht at a random
time step t. Fig. 1 illustrates that the SCDM takes a patch
D. Training
xt ∈ RC×K×K from the Ht at time T 1 or T 2 as input. Our
SCDM is structured similarly to U-ViT [48], with the time The training process comprises two stages: 1) The SCDM
step t, condition c, and noise image xt all used as tokens is pre-trained using a large number of unlabeled HSI-CD
for input into the SCDM. In contrast to the U-ViT long skip samples to fully consider the semantic correlation of spectral-
connections method, we employ a multi-head cross-attention spatial features and retrieve the features of the original image
(MCA) approach for feature fusion between the shallow and semantic correlation. 2) A small set of pseudo-label samples
deep layers. The noise image xt is fed into θ (xt , t, c), are used to train the CTCL network. The spectral-spatial
parameterized by the SCDM. The pixel-level representation features extracted by the SCDM are fused with the spectrally
x̂0 of x0 is obtained through the θ (xt , t, c) network, and the invariant features learned by the CTCL network and then
corresponding formula is given as follows: passed through the CD head to generate the ultimate CD map.
5

1) Pretrained Semantic Correlation Diffusion Model: To TABLE I


pre-train the SCDM, we selected the Santa Barbara, Bay CONFUSION MATRIX
Area, and Hermiston datasets1 , which contain large amounts
Predicted
of unlabeled data. For the input x0 , we randomly initialized Confusion Matrix
Change Unchange
the time t and added noise using Eq. (3) to obtain xt . The pre- Change TP FN
trained SCDM predicted xt and then calculated the estimated Actual
Unchange FP TN
features of the input data x0 using Eq. (7). The noise loss for
the SCDM is defined as follows:
predicted by the network. Therefore, the total loss of our
N proposed DiffUCD framework is:
 − θ xit , t, c 2
X i 
Lnoise = Et,x0 ,c,
i=1 IV. E XPERIMENTS
N A. Datasets
X i √ √  2
= Et,x0 ,c,  − θ ᾱt xit + 1 − ᾱt , t We demonstrate the effectiveness of our proposed method
i=1
(9) on three publicly available HSI-CD datasets: Santa Barbara,
where i represents the noise added to the i-th sample using Bay Area, and Hermiston. The Santa Barbara dataset com-
Eq. (3), N represents the number of samples. prises imagery captured by the AVIRIS sensor over the Santa
2) Training the Cross-Temporal Contrastive Learning and Barbara region in California. The dataset includes images
Change Detection Head: In the second stage, we keep the pre- from 2013 and 2014, with spatial dimensions of 984 × 740
trained SCDM parameters fixed and only focus on training pixels and 224 spectral bands. Similarly, the Bay Area dataset
the CTCL and CD head networks. Our goal is to learn consists of AVIRIS sensor imagery surrounding the city of
features that are invariant to spectral differences caused by Patterson, California. The dataset includes images captured in
environmental changes. We use CTCL to align spectral feature 2013 and 2015, with spatial dimensions of 600 × 500 pixels
representations of unchanged samples to achieve this. First, we and 224 spectral bands.
obtain pseudo-labels using the traditional unsupervised method The Hermiston dataset focuses on an irrigated agricultural
PCA [51] and then use them to train the entire network. We field in Hermiston, Umatilla County, Oregon. The imagery
feed the original samples X1 and X2 into the CTCL to obtain was acquired on May 1, 2004, and May 8, 2007. The image
contrastive feature representations zi and zj . The loss function size is 307 × 241 pixels, consisting of 57,311 unchanged
of the CTCL architecture based on the paper SimCLR [52] is pixels and 16,676 changed pixels. After removing noise, 154
defined as follows: spectral bands were selected for the experiments. The changes
observed in this dataset primarily pertain to land cover types
and the presence of rivers.
exp (sim (zi , zj ) /τ )
`i,j = − log P2Q (10) Santa Barbara and Bay Area unlabeled pixels make up
k=1 1k6=i · (exp (sim (zi , zk )) /τ ) approximately 80% of all pixels. To train the CTCL and CD
where 1[k6=i] ∈ {0, 1} is an indicator function evaluating to1 heads, we use the full-pixel pre-trained SCDM and select 500
if k 6= i. changed and 500 unchanged pixels from the PCA-generated
pseudo-labels [51].
Q
1 X TABLE II
Lcon = [`(2k − 1, 2k) + `(2k, 2k − 1)] (11) COMPARISON WITH STATE-OF-THE-ART METHODS ON SANTA
2Q
k=1 BARBARA DATASET
where `i,j represents the loss of a pair of positive samples
Santa Barbara
(i, j), and Lcon represents the total loss of contrastive learning.
Method OA KC F1
sim (zi , zj ) is the cosine similarity between feature representa-
tions zi and zj . Q represents the number of unchanged samples CVA [53] 87.12 73.10 83.78
PCA [51] 88.40 76.76 86.95
in a sample set with a batch size of N . τ denotes a temperature ISFA [54] 89.12 76.75 85.35
parameter. DSFA [49] 87.70 73.23 82.49
The CD task involves pixel-wise evaluation of changes at MSCD [55] 78.68 53.13 68.72
HyperNet [50] 91.14 81.48 88.80
each location, and we use the cross-entropy loss to measure
the change loss. The loss for variation is defined as follows: Ours 96.87 93.41 95.97
Supervised Model
N BCNNs [56] 97.04 93.77 96.19
1 X
ML-EDAN [16] 98.00 95.81 97.46
Lchange = − (yi log ŷi + (1 − yi ) log (1 − ŷi )) (12)
N i=1

where yi ∈ {0, 1} represents the actual label, 0 represents


B. Experimental Details
no change, 1 represents a change, and ŷi represents the label
1) Evaluation Metrics: We quantitatively evaluate Dif-
1 https://citius.usc.es/investigacion/datasets/hyperspectral-change-detection. fUCD’s performance using three widely-used metrics: Overall
6

(a) DSFA [49] (b) HyperNet [50] (c) Base+SCDM (d) Ours (e) GT

FP FN TN TP Unknown Area

Fig. 2. Visualizations of the proposed method and state-of-the-art unsupervised methods on three datasets. From top to bottom are Santa Barbara, Bay Area,
and Hermiston datasets.

TABLE III TABLE IV


COMPARISON WITH STATE-OF-THE-ART METHODS ON BAY COMPARISON WITH STATE-OF-THE-ART METHODS ON
AREA DATASET HERMISTON DATASET

Bay Area Hermiston


Method OA KC F1 Method OA KC F1
CVA [53] 85.41 71.10 84.89 CVA [53] 91.98 74.06 78.77
PCA [51] 89.28 78.77 88.88 PCA [51] 92.14 74.56 79.19
ISFA [54] 89.17 78.48 89.05 ISFA [54] 90.23 67.16 72.62
DSFA [49] 82.68 65.81 81.61 DSFA [49] 92.67 76.94 81.39
MSCD [55] 78.68 53.13 68.72 MSCD [55] 78.51 47.88 62.01
HyperNet [50] 90.79 81.52 91.29 HyperNet [50] 92.06 76.13 81.12
BCG-Net [57] 94.90 85.38 88.67
Ours 96.35 92.67 96.57
Ours 95.47 86.69 89.58
Supervised Model
Supervised Model
BCNNs [56] 96.84 93.67 96.97
ML-EDAN [16] 96.47 92.91 96.67 BCNNs [56] 93.39 81.49 85.79
ML-EDAN [16] 94.58 84.89 88.41

Accuracy (OA), Kappa Coefficient (KC), and F1 score. These


metrics are used to comprehensively assess the model’s accu-
racy, consistency, and balance between precision and recall. TP
recall = (14)
The above metrics are defined as follows: TP + FN

TP TP + TN
precision = (13) OA = (15)
TP + FP TP + TN + FP + FN
7

(a) Base (b) Base+CTCL (c) Base+SCDM (d) Ours

Intra-class Distances Inter-class Distances

Fig. 3. The t-SNE visualization of features extracted on three datasets. From top to bottom are Santa Barbara, Bay Area, and Hermiston datasets.

TABLE V
ABLATION EXPERIMENTS ON MODULE EFFECTIVENESS ON THREE DATASETS

Santa Barbara Bay Area Hermiston


Base SCDM CTCL OA KC F1 OA KC F1 OA KC F1

90.48 80.51 88.67 91.77 83.35 92.65 92.83 77.24 81.55
√ √
95.64 90.92 94.57 94.74 89.49 94.90 94.62 84.65 88.12
√ √
95.38 90.33 94.15 94.38 88.74 94.59 93.62 82.71 86.89
√ √ √
96.87 93.41 95.97 96.35 92.67 96.57 95.47 86.69 89.58

the first stage, the pre-training SCDM trains for 1000 epochs
OA − PRE
KC = (16) using the AdamW optimizer [58] with an initial learning rate
1 − PRE of 1e-5. The timestep for the SCDM was set to 200. In the
second stage, we fix the parameters of the SCDM and use the
(TP + FP)(TP + FN) (FN + TN)(FP + TN) Adadelta optimizer [59] to optimize the CTCL and CD head
PRE = +
(TP + TN + FP + FN) (TP + TN + FP + FN)2
2 network over time. The initial learning rate is set to 1 and
(17) linearly decreases to 0 at 200 epochs. Through experiments,
2 we choose the spectral-spatial features produced by the SCDM
F1 = −1 −1 (18) t = 5, 10, 100 as the input features of the CD head.
recall + precision
2) Implementation Details: We perform all experiments
using the PyTorch platform, running on an NVIDIA GTX C. Comparison to State-of-the-art Methods
2080Ti GPU with 11GB of memory. The batch size is 128, We conduct a comprehensive comparison of our method
and a patch size of 7 is used to process the input data. In with recent unsupervised and supervised HSI-CD methods,
8

(a) t=200 (b) t=50 (c) t=10 (d) t=5 (e) t=0

Fig. 4. SCDM denoising process reconstructs pseudo-color images of different timestamps of the Santa Barbara dataset. Image visualization at time T1 and
T2 from top to bottom.

(a) t=200 (b) t=50 (c) t=10 (d) t=5 (e) t=0

Fig. 5. SCDM denoising process reconstructs pseudo-color images of different timestamps of the Bay Area dataset. Image visualization at time T1 and T2
from top to bottom.
9

including CVA [53], PCA [51], ISFA [54], DSFA [49], MSCD V. C ONCLUSION
[55], HyperNet [50], BCG-Net [57], BCNNs [56], and ML- This work presents a novel diffusion framework, called
EDAN [16]. Fig. 2 presents a visual comparison of these DiffUCD, designed explicitly for HSI-CD. To our knowledge,
methods on the three datasets. this is the first diffusion model developed for this particular
From the visual observations in Fig. 2, it is evident that our task. DiffUCD leverages many unlabeled samples to fully
proposed method, DiffUCD, exhibits the smallest regions of consider the semantic correlation of spectral-spatial features
red and green. This compelling visualization underscores the and retrieve the features of the original image semantic corre-
superior performance of DiffUCD compared to all other meth- lation. Additionally, we employ CTCL to align the spectral
ods. Table II, Table III, and Table IV provides the quantitative feature representations of unchanged samples. This align-
results of DiffUCD alongside various state-of-the-art methods ment facilitates learning invariant spectral difference features
across the three datasets. Remarkably, our proposed method essential for capturing environmental changes. We evaluate
substantially improves performance over the state-of-the-art the performance of our proposed method on three publicly
unsupervised methods, as evidenced by significant margins available datasets and demonstrate that it achieves significant
in OA, KC, and F1-score. Specifically, DiffUCD surpasses improvements over state-of-the-art unsupervised methods in
the unsupervised methods on the Santa Barbara dataset by terms of OA, KC, and F1 metrics. Furthermore, the diffusion
remarkable margins of 5.73%, 11.93%, and 7.17% in terms of model holds great potential as a novel solution for the HSI-CD
OA, KC, and F1-score, respectively. Furthermore, compared task. Our work will inspire the development of new approaches
to supervised methods trained on an equivalent number of and foster advancements in this field.
human-annotated training examples, our method demonstrates
comparable or superior performance. R EFERENCES
[1] D. Lu, P. Mausel, E. Brondizio, and E. Moran, “Change detection
techniques,” International journal of remote sensing, vol. 25, no. 12,
D. Ablation Study pp. 2365–2401, 2004.
[2] S. Liu, D. Marinelli, L. Bruzzone, and F. Bovolo, “A review of change
1) Effectiveness of the module: We conduct a comprehen- detection in multitemporal hyperspectral images: Current techniques,
sive ablation study to verify the effectiveness of the proposed applications, and challenges,” IEEE Geoscience and Remote Sensing
SCDM and CTCL. The results are shown in Table V. After Magazine, vol. 7, no. 2, pp. 140–158, 2019.
[3] F. Aslami and A. Ghorbani, “Object-based land-use/land-cover change
adding the pre-training of the SCDM, the results of the net- detection using landsat imagery: a case study of ardabil, namin, and nir
work on the three datasets have been significantly improved. counties in northwest iran,” Environmental monitoring and assessment,
We argue that the SCDM pre-training process utilizes many vol. 190, pp. 1–14, 2018.
[4] T. Rumpf, A.-K. Mahlein, U. Steiner, E.-C. Oerke, H.-W. Dehne, and
unlabeled samples, which can extract the semantic correlation L. Plümer, “Early detection and classification of plant diseases with
of spectral-spatial features of the CD dataset. The third row support vector machines based on hyperspectral reflectance,” Computers
of Table V is based on the base model, which adds a CTCL and electronics in agriculture, vol. 74, no. 1, pp. 91–99, 2010.
[5] Y. Wang, D. Hong, J. Sha, L. Gao, L. Liu, Y. Zhang, and X. Rong,
module, improving CD accuracy on the three datasets by “Spectral–spatial–temporal transformers for hyperspectral image change
aligning the spectral features of unchanged samples. The detection,” IEEE Transactions on Geoscience and Remote Sensing,
fourth row is the experimental results of the DiffUCD model vol. 60, pp. 1–14, 2022.
[6] W. Dong, J. Zhao, J. Qu, S. Xiao, N. Li, S. Hou, and Y. Li, “Abundance
we proposed, and the OA values on the three data sets have matrix correlation analysis network based on hierarchical multi-head
been increased by 6.39%, 4.58%, and 2.64%, respectively. self-cross-hybrid attention for hyperspectral change detection,” IEEE
Experiments fully prove the effectiveness of our proposed Transactions on Geoscience and Remote Sensing, 2023.
[7] D. Chakraborty and A. Ghosh, “Unsupervised change detection in hyper-
DiffUCD and sub-modules. spectral images using feature fusion deep convolutional autoencoders,”
2) Comparison of feature extraction ability: Fig. 3 visually arXiv preprint arXiv:2109.04990, 2021.
demonstrates the effectiveness of the SCDM in extracting [8] J. Lei, M. Li, W. Xie, Y. Li, and X. Jia, “Spectral mapping with
adversarial learning for unsupervised hyperspectral change detection,”
compact intra-class features compared to the base model. Neurocomputing, vol. 465, pp. 71–83, 2021.
Notably, the feature distances obtained through the CTCL [9] G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion
mechanism are significantly larger on the Santa Barbara and models for robust image manipulation,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2022, pp.
Hermiston datasets. The t-SNE visualization further reinforces 2426–2435.
the discriminative nature of our model. The t-SNE plot [10] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi,
vividly illustrates that the features extracted by DiffUCD are “Image super-resolution via iterative refinement,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2022.
well-separated, allowing for distinct clusters corresponding to [11] H. Tan, S. Wu, and J. Pi, “Semantic diffusion network for semantic
different classes. This enhanced feature separability plays a segmentation,” Advances in Neural Information Processing Systems,
crucial role in boosting CD accuracy. vol. 35, pp. 8702–8716, 2022.
[12] J. Wu, H. Fang, Y. Zhang, Y. Yang, and Y. Xu, “Medsegdiff: Medical
3) The influence of timestamp t on the reconstruction ef- image segmentation with diffusion probabilistic model,” arXiv preprint
fect: Fig. 4 and Fig. 5 provides qualitative evidence of the arXiv:2211.00611, 2022.
effectiveness of DiffUCD in both noise removal and feature [13] S. Chen, P. Sun, Y. Song, and P. Luo, “Diffusiondet: Diffusion model
for object detection,” arXiv preprint arXiv:2211.09788, 2022.
reconstruction of the original HSI. The visualization results [14] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and
clearly illustrate how the denoising process of DiffUCD fully B. Poole, “Score-based generative modeling through stochastic differ-
incorporates the semantic correlation of spectral-spatial fea- ential equations,” arXiv preprint arXiv:2011.13456, 2020.
[15] H. Li, Y. Yang, M. Chang, S. Chen, H. Feng, Z. Xu, Q. Li, and Y. Chen,
tures, enabling the extraction of essential features that preserve “Srdiff: Single image super-resolution with diffusion probabilistic mod-
the original image’s semantic correlation. els,” Neurocomputing, vol. 479, pp. 47–59, 2022.
10

[16] J. Qu, S. Hou, W. Dong, Y. Li, and W. Xie, “A multilevel encoder– [37] G. Wang, X. Zhang, Z. Peng, X. Tang, H. Zhou, and L. Jiao, “Absolute
decoder attention network for change detection in hyperspectral images,” wrong makes better: Boosting weakly supervised object detection via
IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1– negative deterministic information,” in Proceedings of the International
13, 2021. Joint Conference on Artificial Intelligence (IJCAI), 2022, pp. 1378–1384.
[17] H. Zhao, K. Feng, Y. Wu, and M. Gong, “An efficient feature extrac- [38] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya,
tion network for unsupervised hyperspectral change detection,” Remote C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al.,
Sensing, vol. 14, no. 18, p. 4646, 2022. “Bootstrap your own latent-a new approach to self-supervised learning,”
[18] X. Li, Z. Yuan, and Q. Wang, “Unsupervised deep noise modeling for Advances in neural information processing systems, vol. 33, pp. 21 271–
hyperspectral image change detection,” Remote Sensing, vol. 11, no. 3, 21 284, 2020.
p. 258, 2019. [39] X. Chen and K. He, “Exploring simple siamese representation learning,”
[19] Q. Li, H. Gong, H. Dai, C. Li, Z. He, W. Wang, Y. Feng, F. Han, in Proceedings of the IEEE/CVF conference on computer vision and
A. Tuniyazi, H. Li et al., “Unsupervised hyperspectral image change pattern recognition, 2021, pp. 15 750–15 758.
detection via deep learning self-generated credible labels,” IEEE Journal [40] X. Hu, T. Li, T. Zhou, Y. Liu, and Y. Peng, “Contrastive learning based
of Selected Topics in Applied Earth Observations and Remote Sensing, on transformer for hyperspectral image classification,” Applied Sciences,
vol. 14, pp. 9012–9024, 2021. vol. 11, no. 18, p. 8670, 2021.
[20] Z. Hou, W. Li, R. Tao, and Q. Du, “Three-order tucker decomposition [41] P. Guan and E. Y. Lam, “Cross-domain contrastive learning for hyper-
and reconstruction detector for unsupervised hyperspectral change de- spectral image classification,” IEEE Transactions on Geoscience and
tection,” IEEE Journal of Selected Topics in Applied Earth Observations Remote Sensing, vol. 60, pp. 1–13, 2022.
and Remote Sensing, vol. 14, pp. 6194–6205, 2021. [42] X. Ou, L. Liu, S. Tan, G. Zhang, W. Li, and B. Tu, “A hyperspectral
[21] S. Liu, H. Li, F. Wang, J. Chen, G. Zhang, L. Song, and B. Hu, “Un- image change detection framework with self-supervised contrastive
supervised transformer boundary autoencoder network for hyperspectral learning pretrained model,” IEEE Journal of Selected Topics in Applied
image change detection,” Remote Sensing, vol. 15, no. 7, p. 1868, 2023. Earth Observations and Remote Sensing, vol. 15, pp. 7724–7740, 2022.
[22] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” [43] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilis-
Advances in Neural Information Processing Systems, vol. 33, pp. 6840– tic models,” in International Conference on Machine Learning. PMLR,
6851, 2020. 2021, pp. 8162–8171.
[23] Y. Song and S. Ermon, “Generative modeling by estimating gradients [44] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,”
of the data distribution,” Advances in neural information processing arXiv preprint arXiv:2010.02502, 2020.
systems, vol. 32, 2019. [45] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli,
[24] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- “Deep unsupervised learning using nonequilibrium thermodynamics,”
resolution image synthesis with latent diffusion models,” in Proceedings in International Conference on Machine Learning. PMLR, 2015, pp.
of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- 2256–2265.
tion, 2022, pp. 10 684–10 695. [46] Y. Song and S. Ermon, “Improved techniques for training score-based
[25] E. A. Brempong, S. Kornblith, T. Chen, N. Parmar, M. Minderer, generative models,” Advances in neural information processing systems,
and M. Norouzi, “Denoising pretraining for semantic segmentation,” vol. 33, pp. 12 438–12 448, 2020.
in Proceedings of the IEEE/CVF Conference on Computer Vision and [47] Y. Song, C. Durkan, I. Murray, and S. Ermon, “Maximum likelihood
Pattern Recognition, 2022, pp. 4175–4186. training of score-based diffusion models,” Advances in Neural Informa-
[26] J. Wyatt, A. Leach, S. M. Schmon, and C. G. Willcocks, “Anoddpm: tion Processing Systems, vol. 34, pp. 1415–1428, 2021.
Anomaly detection with denoising diffusion probabilistic models using [48] F. Bao, C. Li, Y. Cao, and J. Zhu, “All are worth words: a vit backbone
simplex noise,” in Proceedings of the IEEE/CVF Conference on Com- for score-based diffusion models,” arXiv preprint arXiv:2209.12152,
puter Vision and Pattern Recognition, 2022, pp. 650–656. 2022.
[27] J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg, “Struc- [49] B. Du, L. Ru, C. Wu, and L. Zhang, “Unsupervised deep slow feature
tured denoising diffusion models in discrete state-spaces,” Advances analysis for change detection in multi-temporal remote sensing images,”
in Neural Information Processing Systems, vol. 34, pp. 17 981–17 993, IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 12,
2021. pp. 9976–9992, 2019.
[28] X. Li, J. Thickstun, I. Gulrajani, P. S. Liang, and T. B. Hashimoto, [50] M. Hu, C. Wu, and L. Zhang, “Hypernet: Self-supervised hyperspectral
“Diffusion-lm improves controllable text generation,” Advances in Neu- spatial–spectral feature understanding network for hyperspectral change
ral Information Processing Systems, vol. 35, pp. 4328–4343, 2022. detection,” IEEE Transactions on Geoscience and Remote Sensing,
[29] O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text- vol. 60, pp. 1–17, 2022.
driven editing of natural images,” in Proceedings of the IEEE/CVF [51] J. Deng, K. Wang, Y. Deng, and G. Qi, “Pca-based land-use change
Conference on Computer Vision and Pattern Recognition, 2022, pp. detection and analysis using multitemporal and multisensor satellite
18 208–18 218. data,” International Journal of Remote Sensing, vol. 29, no. 16, pp.
[30] D. Yang, J. Yu, H. Wang, W. Wang, C. Weng, Y. Zou, and D. Yu, 4823–4838, 2008.
“Diffsound: Discrete diffusion model for text-to-sound generation,” [52] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework
IEEE/ACM Transactions on Audio, Speech, and Language Processing, for contrastive learning of visual representations,” in International
2023. conference on machine learning. PMLR, 2020, pp. 1597–1607.
[31] Y. Tashiro, J. Song, Y. Song, and S. Ermon, “Csdi: Conditional score- [53] F. Bovolo and L. Bruzzone, “A theoretical framework for unsupervised
based diffusion models for probabilistic time series imputation,” Ad- change detection based on change vector analysis in the polar domain,”
vances in Neural Information Processing Systems, vol. 34, pp. 24 804– IEEE Transactions on Geoscience and Remote Sensing, vol. 45, no. 1,
24 816, 2021. pp. 218–236, 2006.
[32] K. Rasul, C. Seward, I. Schuster, and R. Vollgraf, “Autoregressive [54] C. Wu, B. Du, and L. Zhang, “Slow feature analysis for change detection
denoising diffusion models for multivariate probabilistic time series in multispectral imagery,” IEEE Transactions on Geoscience and Remote
forecasting,” in International Conference on Machine Learning. PMLR, Sensing, vol. 52, no. 5, pp. 2858–2874, 2013.
2021, pp. 8857–8868. [55] S. Saha, P. Ebel, and X. X. Zhu, “Self-supervised multisensor change
[33] D. Baranchuk, I. Rubachev, A. Voynov, V. Khrulkov, and A. Babenko, detection,” IEEE Transactions on Geoscience and Remote Sensing,
“Label-efficient semantic segmentation with diffusion models,” arXiv vol. 60, pp. 1–10, 2022.
preprint arXiv:2112.03126, 2021. [56] Y. Lin, S. Li, L. Fang, and P. Ghamisi, “Multispectral change detection
[34] Z. Gu, H. Chen, Z. Xu, J. Lan, C. Meng, and W. Wang, “Diffu- with bilinear convolutional neural networks,” IEEE Geoscience and
sioninst: Diffusion model for instance segmentation,” arXiv preprint Remote Sensing Letters, vol. 17, no. 10, pp. 1757–1761, 2019.
arXiv:2212.02773, 2022. [57] M. Hu, C. Wu, B. Du, and L. Zhang, “Binary change guided hy-
[35] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast perspectral multiclass change detection,” IEEE Transactions on Image
for unsupervised visual representation learning,” in Proceedings of the Processing, 2023.
IEEE/CVF conference on computer vision and pattern recognition, 2020, [58] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
pp. 9729–9738. arXiv preprint arXiv:1711.05101, 2017.
[36] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton, “Big [59] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv
self-supervised models are strong semi-supervised learners,” Advances preprint arXiv:1212.5701, 2012.
in neural information processing systems, vol. 33, pp. 22 243–22 255,
2020.

You might also like