1. Introduction
As an effective monitoring method, change detection has been widely used for disaster evaluation [
1], building planning [
2,
3], sea ice monitoring [
4], forest vegetation management [
5,
6], and land use analysis [
7,
8]. Because hyperspectral images can provide more abundant spectral information to reveal subtle changes in land cover [
9,
10], hyperspectral image change detection has become a hot research field [
11,
12,
13].
In general, change detection can be divided into supervised change detection and unsupervised change detection. Supervised methods need to train the model through labeled data. Wu et al. proposed a general end-to-end 2-D convolutional neural network (CNN) framework based on unmixing spectral information [
14]. To further combine spectral and spatial features, Zhan et al. designed a spectral–spatial CNN to extract the features from the spectral information and two spatial directions [
15]. Wang et al. designed a low-complexity network with few parameters and a residual self-calibrated structure [
16]. To solve the band redundancy, a slow–fast band selection CNN framework was proposed to extract effective spectral information [
17]. Zhang et al. designed a multi-scale Siamese transformer to extract global features [
3]. However, there are only a few public hyperspectral change detection datasets with labeled data [
18], resulting in limited applications for supervised methods.
Because unsupervised methods can be used on unlabeled datasets, many researchers have exploited unsupervised methods for change detection. The early unsupervised change detection methods were mainly direct comparison and algebraic transformation. Although pixel-based comparison methods, such as change vector analysis (CVA) [
19,
20], are simple and efficient and are vulnerable to various types of noise. To reduce noise interference, algebraic transformation is used for change detection to extract image features. Deng et al. proposed adding a principal components analysis into the CVA method, which enhances the change information [
21]. Nielsen et al. proposed iteratively reweighted multivariate alteration detection (IR-MAD) for hyperspectral image change detection [
22]. This method uses a canonical correlation analysis to transform the original variables and assigns each pixel a different weight. Slow feature analysis (SFA) change detection aims to minimize the difference between the invariant points in the new transformation space [
23]. Compared to direct comparison, analyzing the difference between two temporal features causes the change detection results to be more robust and accurate. In addition, some change detection algorithms based on other perspectives have been proposed in recent years. Inspired by orthogonal subspace projection, Wu et al. proposed a subspace-based change detection (SCD) method that calculates the distance from the pixel to the subspace form by another time image [
24]. Spectral unmixing has also been applied to change detection, and abundance is considered an image feature that identifies changes [
25,
26,
27,
28]. However, these methods usually only focus on spectral dimension information and lack the use of spatial information.
As CNN and transformers can combine spectral and spatial features and perform well in other computer vision tasks, in recent years, many unsupervised change detection methods based on deep learning have been proposed. Transfer learning, autoencoders, and generative adversarial networks (GANs) have attracted great attention for unsupervised change detection [
2,
29,
30,
31]. Du et al. proposed an unsupervised deep learning SFA that uses an optimizer based on SFA to train the feature extraction module in the neural network [
32]. Hu et al. used two autoencoders to reconstruct two temporal hyperspectral images, respectively, and then used the reconstructed data to obtain the change results [
30]. Nevertheless, these methods lack attention to the edge features, resulting in the poor detection of changed boundaries.
Although a lot of research has been carried out, there are still several problems in unsupervised change detection. First, the boundary of change targets is not easy to distinguish using an unsupervised method. Many surrounding environment factors, such as noise and resolution, will confuse the boundaries of change objects and increase the difficulty of feature extraction [
33]. Second, the redundancy of feature maps will affect the detection results. To extract more image features, the network usually uses many convolution kernels. However, some feature maps contain little useful information, reducing the final detection accuracy [
34].
To overcome the abovementioned problems, an unsupervised transformer boundary autoencoder network (UTBANet) is proposed for hyperspectral image change detection in this paper. Our method is an unsupervised method based on an autoencoder structure. A transformer structure based on spectral attention is used in the encoder to enhance the global features and to select effective feature maps. Furthermore, we design a boundary information reconstruction module that allows the autoencoder to reconstruct the edge information of the image in addition to the hyperspectral image. In this way, the ability of the encoder to extract boundary features can be improved. The main contributions of our article are as follows:
- (1)
An unsupervised UTBANet based on a transformer is proposed to improve global feature representation in change detection.
- (2)
The encoder structure with spectral attention can give different weights to each feature map and select important feature maps, thus reducing the impact of feature map redundancy.
- (3)
The boundary information reconstruction module forces the encoder to pay more attention to the edge information, which causes the feature map to contain more discriminative boundary features and improves the accuracy of change detection.
3. Experiment and Discussion
3.1. Datasets
Three public datasets were used in our experiments. The three datasets were all acquired using a Hyperion sensor. The sensor provides 242 spectral bands with wavelengths ranging from 0.4 to 2.5 m. The spectral resolution is about 10 m, and the spatial resolution is approximately 30 m.
The first dataset is Farmland, as shown in
Figure 6a,b. The two temporal hyperspectral images were acquired on 3 May 2006, and 23 April 2007, respectively. The main scene is a farmland in Yancheng, Jiangsu Province, China, with a size of 450 × 140 pixels. After removing the bands with a low signal-to-noise ratio, 155 bands were retained. The main changes are caused by the seasonal change in crops.
The second dataset is Hermiston, as shown in
Figure 6c,d. The two images were acquired on 1 May 2004, and 8 May 2007. The main scenes are rivers and circular irrigation areas in Hermiston City, USA, with a size of 390 × 200 pixels. In total, 163 bands were selected after noise removal. The main changes are also caused by the seasonal change in crops.
The third dataset is River, as shown in
Figure 6e,f. The two images were acquired on 3 May 2013, and 31 December 2013. The main scenes are rivers and residential areas in Jiangsu Province, China, with 463 × 241 pixels. In total, 167 bands were selected after noise removal. The main change is sediment in rivers and residential areas.
3.2. Implementation Details
To improve the convergence speed of our model, the three datasets were normalized from 0 to 1. To increase the training samples, before training, we cropped each hyperspectral image into sub-images sized 32 × 32 with a step size of eight. The boundary GT was also cropped to the sub-images. The two temporal sub-images were mixed and shuffled and input into the network as training data. In addition, during training, these sub-images were randomly rotated, vertically flipped, and horizontally flipped.
The UTBANet was trained by Pytorch with GeForce RTX 3090 24 GB. The learning rate is 0.01 in the beginning, and the decay coefficient is 0.1 with every 100 epochs. Early stopping was used to prevent overfitting. The optimizer was Adma with the default parameter.
To evaluate the performance of our method, we used three metrics: overall accuracy (
OA), the kappa coefficient (
KC), and intersection over union (
IoU).
TP is a true positive,
TN is a true negative,
FP is a false positive, and
FN is a false negative. The three metrics can be defined as:
where
3.3. Parameter Setting
In our proposed method, the number of kernels, batch size, and the weight of the two loss functions will affect the change detection result. We carry out a series of comparison experiments to explore the influence of these hyperparameters. When analyzing one parameter, the other parameters are fixed.
3.3.1. The Analysis of Kernel Number
Here, we suppose that the kernel number in the embedding layer is K. Then, the number of output channels of SAT1, SAT2, Residual block 1, and Residual block 2 are K/2, K/4, K/2, and C, respectively, where C is the number of hyperspectral image channels. We tested the K values of 32, 64, 128, and 256.
Table 1 shows that as the kernel number increases, the OA of Farmland and River increase and then decrease. The best numbers of kernels are 64, 256, and 64 for the three datasets, respectively.
3.3.2. The Analysis of the Batch Size
In this experiment, we compared three different batch sizes: 32, 64, and 128.
Table 2 shows that the number of batch sizes greatly influences the results of the Hermiston and River datasets. The best batch sizes for the three datasets are 128, 32, and 64, respectively.
3.3.3. The Analysis of the Weight of Two Loss Functions
In this experiment, six different
(0.1, 0.5, 1, 2, 5, and 10) were used for comparison.
Table 3 shows that when the weight of the hyperspectral loss is higher than or equal to the boundary loss, the OA of Farmland and Hermiston are relatively high. The best weights for the three datasets are 5, 5, and 2.
3.3.4. The Analysis of the Edge Detection Operator
In this experiment, two common edge detection operators (Sobel and Canny) were used for comparison. As shown in
Table 4, different operators had little effect on Farmland and Hermiston. However, the Sobel operator performed better than the Canny operator on the River dataset. Therefore, we selected the Sobel operator for edge detection.
3.4. Comparison with Other Methods
In this section, we compared our model with other unsupervised change detection methods, including CVA, PCA-CVA [
7], IR-MAD [
8], DCVA [
21], and convolutional autoencoder multiresolution features (CAMF) [
26].
CVA: The Euclidean distance between two pixels is calculated to determine whether the point changes.
PCA-CVA: This is an improved CVA method. It first obtains the principal components of the hyperspectral image and then uses the CVA algorithm to detect changes.
IR-MAD: This method adds weight to each pixel while using a canonical correlation analysis for algebraic transformation.
DCVA: This method first trains a CNN with other images to extract image features. Then, feature variance is used to select useful feature maps. Finally, global and local thresholding is used to make judgments.
CAMF: This method uses two temporal images to train an autoencoder. Then, the feature comparison and selection are used to extract features.
For all comparison methods, the parameters have been optimized. These methods first generate gray results, and then we use Otsu for segmentation. Although the quantitative evaluation is carried out on the binary map, we also show the gray results in the experimental section.
3.4.1. Experiment on the Farmland Dataset
The gray results are shown in
Figure 7.
Figure 8 shows the binary change detection results of all methods. Compared with the ground truth, the CVA, PCA-CVA, and CAMF results show many white pixels in unchanged road areas, as shown in
Figure 8. The misjudgment of the unchanged pixels causes these false alarm pixels to present as changed pixels. IR-MAD performed poorly, with many types of noise being observed. For DCVA, there was a large false alarm region in the bottom left corner of the result. Because UTBANet combines global and boundary features, our method only generated a few false alarm pixels, and the overall performance of its result was the closest to the ground truth; therefore, it can better identify the boundary of Farmland.
Table 5 displays the metrics of these methods. Our method achieved the best value of the three metrics, with an OA of 0.9663, a KC of 0.9195, and an IoU of 0.9434.
3.4.2. Experiment on the Hermiston Dataset
Figure 9 shows the gray results of Hermiston dataset. And
Figure 10 displays the binary results of six methods. CVA identified the most changed areas but also misjudged some unchanged areas. PCA-CVA and DCVA mistakenly detected the river as a changed area. IR-MAD also did not perform well on this dataset, with many types of noise being produced in the results. UTBANet has the highest consistency with the ground truth.
Figure 10f indicates that after paying more attention to edge features, our method is more accurate in identifying changed irrigation areas, and there are no obvious false alarm pixels.
Table 6 shows the results of all methods. Because there are many false alarm pixels in the PCA-CAV results, its KC value is only 0.2996.
Figure 10e shows that CAMF missed many changed pixels and generated many false alarm pixels. Therefore, the KC and IoU values of the CAMF are all low. UTBANet demonstrated good performance in the three metrics, with an OA of 0.9858, a KC of 0.9362, and an IoU of 0.9443.
3.4.3. Experiment on the River Dataset
Gray and binary change detection results are shown in
Figure 11 and
Figure 12, respectively.
Table 7 shows the quantitative indicators of the six methods. PCA-CVA misjudged the river as a changed region, which may be due to the low signal-to-noise ratio of the river region. IR-MAD did not detect the changed area in the middle of the river.
Figure 12d shows a large white area compared with the ground truth, indicating that the DCVA performance is poor. This may be because no suitable data were found to pre-train the CNN for DCVA. For CAMF, there were many false alarm pixels in the residential area. Therefore, the KC value of CAMF was only 0.2769.
Figure 12 shows that CVA and our method are the closest to the ground truth. Compared with CVA, our method shows fewer false alarm pixels, which may be because the boundary reconstruction allows our method to better identify rivers and riversides. However, compared with the ground truth, UTBANet missed the detection of some changed small targets. Therefore, our method achieved higher OA and KC values than those of CVA, but the IoU is lower than that of CVA.
3.5. Ablation Analysis
In our model, the boundary decoder and SE block are used for feature extraction. To verify the effectiveness of these two structures, we designed model experiments with and without the boundary decoder and SE block. The network without the boundary decoder (Net without BD) consists only of feature extraction and the hyperspectral reconstruction module. The network without the SE block (Net without SE) still comprises two decoder modules, but the SE block is removed from the SAT.
Figure 13a shows that without the boundary decoder, the boundary of changed areas is blunt and cannot detect the unchanged road between the farmlands.
Figure 14 and
Figure 15 show that the boundary decoder is effective, not only for changed edges, but also for improving the overall detection results.
Table 8 shows quantitatively that the boundary decoder and SE block improved the change detection results.