UMMFF: Unsupervised Multimodal Multilevel Feature Fusion Network for Hyperspectral Image Super-Resolution

Jiang, Zhongmin; Chen, Mengyao; Wang, Wenju

doi:10.3390/rs16173282

Open AccessArticle

UMMFF: Unsupervised Multimodal Multilevel Feature Fusion Network for Hyperspectral Image Super-Resolution

by

Zhongmin Jiang

,

Mengyao Chen

and

Wenju Wang

^*

College of Publishing, University of Shanghai for Science and Technology, Shanghai 200093, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(17), 3282; https://doi.org/10.3390/rs16173282

Submission received: 2 July 2024 / Revised: 26 August 2024 / Accepted: 27 August 2024 / Published: 4 September 2024

(This article belongs to the Special Issue Image Enhancement and Fusion Techniques in Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Due to the inadequacy in utilizing complementary information from different modalities and the biased estimation of degraded parameters, the unsupervised hyperspectral super-resolution algorithm suffers from low precision and limited applicability. To address this issue, this paper proposes an approach for hyperspectral image super-resolution, namely, the Unsupervised Multimodal Multilevel Feature Fusion network (UMMFF). The proposed approach employs a gated cross-retention module to learn shared patterns among different modalities. This module effectively eliminates the intermodal differences while preserving spatial–spectral correlations, thereby facilitating information interaction. A multilevel spatial–channel attention and parallel fusion decoder are constructed to extract features at three levels (low, medium, and high), enriching the information of the multimodal images. Additionally, an independent prior-based implicit neural representation blind estimation network is designed to accurately estimate the degraded parameters. The utilization of UMMFF on the “Washington DC”, Salinas, and Botswana datasets exhibited a superior performance compared to existing state-of-the-art methods in terms of primary performance metrics such as PSNR and ERGAS, and the PSNR values improved by 18.03%, 8.55%, and 5.70%, respectively, while the ERGAS values decreased by 50.00%, 75.39%, and 53.27%, respectively. The experimental results indicate that UMMFF demonstrates excellent algorithm adaptability, resulting in high-precision reconstruction outcomes.

Keywords:

hyperspectral image super-resolution; multimodal multilevel feature fusion; gate-cross keeping shared encoder; multilevel parallel fusion decoder; prior-knowledge implicit estimation

1. Introduction

Spectrally abundant imagery contains a wealth of spectral information, which is beneficial for thoroughly exploring material properties, colors, textures, and more. As a result, it has extensive applications in various domains such as crop assessment [1], tobacco quality evaluation [2], surface soil prediction [3], cultural heritage restoration [4], artistic analysis [5], and counterfeit detection [6]. While hyperspectral images have a superior spectral resolution, the sensor faces the challenge of reduced spatial resolution when partitioning the perceptual domain into pixels to acquire the hyperspectral data. Hyperspectral image super-resolution is a method that reconstructs low spatial resolution hyperspectral images into high spatial–spectral resolution images [7]. This not only provides superior visual results but also enhances the accuracy of downstream tasks, such as identification, classification, segmentation, localization, detection, and tracking, in the aforementioned applications [1,2,3,4,5,6]. Consequently, research on hyperspectral image super-resolution has become a focal point of inquiry.

The methods for hyperspectral super-resolution can be primarily categorized into the following two classes: single-hyperspectral-image super-resolution and fusion-based super-resolution methods [8]. The training images used in single-hyperspectral-image super-resolution methods, originating from a single sensor, often fail to capture a comprehensive perception of the scene. On the other hand, fusion-based super-resolution methods utilize training images obtained from two or more sensors that capture the same scene. This approach harnesses the power of information complementarity to achieve high-precision super-resolution reconstruction of hyperspectral images. Therefore, fusion-based methods hold greater potential in enhancing the spatial–spectral resolution of hyperspectral images. This category of methods can be broadly classified into traditional machine learning-based approaches and deep learning-based approaches [9]. Traditional machine learning-based fusion methods for hyperspectral super-resolution mainly include extensions of pansharpening techniques [10], spatial unmixing methods [11], Bayesian sparse learning [12], matrix factorization [13], and tensor decomposition methods [14], among others. The extended pansharpening method [10] utilizes Ratio Image-Based Spectral Resampling (RIBSR) to estimate missing spectral bands in order to enhance spatial resolution. While this method offers simplicity, it may introduce spectral distortions when applied to high-dimensional images. The matrix decomposition method [15] leverages known target image information as prior knowledge and decomposes the target hyperspectral image into a matrix with both spectral and spatial resolutions. Consequently, this addresses the limitation of the extended pansharpening method in handling higher-dimensional images. However, the process of matrix decomposition involves flattening the three-dimensional image into two dimensions, resulting in the loss of spatial information. To address the issue of spatial information loss in matrix decomposition, tensor decomposition methods have been proposed. Among them, a notable approach is based on nonnegative nonlocal sparse tensor decomposition [16]. This method utilizes a four-dimensional tensor structure to perform super-resolution clustering cluster by cluster, preserving the high-dimensional structure and achieving good sparsity within each cluster; thus, effectively reconstructing some texture information. However, the depth of this method’s model is limited, thereby hindering further improvement in reconstruction accuracy. The aforementioned approaches, rooted in conventional machine learning techniques, are generally characterized by simplicity in their operation. However, they heavily rely on manual priors and are constrained in their ability to enhance accuracy in spatial resolution.

Compared to conventional approaches of super-resolution reconstruction in machine learning, deep learning-based methods possess the advantage of not relying on prior information. Instead, they exhibit excellent ability to learn features, significantly improving accuracy. The high-resolution fusion methods based on deep learning for hyperspectral imagery are primarily divided into the following two main categories: supervised and unsupervised.

Supervised methods rely on a large amount of manually labeled data for training, which can effectively improve the spatial resolution of reconstructed images. However, these methods may face challenges, such as texture blurring and spectral distortions, when dealing with a significant magnification factor. The adversarial learning-based hyperspectral super-resolution method [17] introduces a band attention mechanism to explore the correlation among spectral bands, effectively mitigating spectral distortions. The high-resolution hyperspectral-image super-resolution method based on deep spatial–spectral attention convolutional neural network [18] incorporates attention and pixelShuffle modules to generate high-quality spatial details while avoiding spectral distortions. However, it only considers local neighborhood relationships with priority receptive field convolutional kernels, overlooking global relationships within the feature maps. The high-resolution hyperspectral-image super-resolution method, in Fusformer [19], based on the Transformer architecture, leverages the attention mechanism to incorporate a broader range of global information in network representations. However, when dealing with images that possess features of varying scales, it may fail to fully capture the correlations among different scales. Supervised methods rely on labeled hyperspectral image datasets to achieve high-precision reconstruction results in practical applications. However, collecting a large amount of high-resolution, annotated, and co-registered hyperspectral images is often challenging.

Unsupervised methods can utilize unlabeled data for training and achieve favorable results in hyperspectral-image super-resolution reconstruction. The unsupervised Sparse Dirichlet-Net (uSDN) [20] employs an encoder-decoder architecture for the unsupervised super-resolution reconstruction of hyperspectral images. However, the resulting image quality of the reconstruction is noticeably inferior compared to the supervised approach. Coupled unmixing network with cross-attention (CUCaNet) [21] incorporates a cross-attention mechanism to fuse spatial and spectral information, enhancing the performance and accuracy of the model. However, it relies on prior knowledge of the degradation model. The model with degradation-adaptive learning capability (UDALN) [22] can exhibit a clear physical interpretation of the degradation process, thereby aiding in faithful reconstruction of hyperspectral images. However, the data used by the model still require registration. The unregistered and unsupervised mutual Dirichlet-Net (u2-MDN) approach [23] employs an encoder-decoder architecture to stabilize unregistered images from two inputs. It effectively alleviates the dependence on registered datasets seen in supervised methods. But this method assumes a prior knowledge of the degradation from high-resolution hyperspectral to low-resolution hyperspectral and high-resolution multispectral images as the input. Such an assumption does not align with the actual scenario, which limits the applicability of the fusion method to real-world applications. The Model-Inspired Autoencoder (MIAE) method [24] incorporates a blind estimation module that directly estimates the Point Spread Function (PSF) and Spectral Response Function (SRF) from two observed images. This enhances the performance of the network in real-world applications. However, the blind estimation network of this method suffers from limited optimization degrees due to the addition of manual regularization in the loss function, and it is also influenced by reconstruction errors in the resulting images.

In summary, compared to supervised super-resolution reconstruction methods, unsupervised hyperspectral-image super-resolution techniques, which do not rely on extensive datasets, can achieve commendable super-resolution reconstruction results; however, the accuracy of reconstruction still requires further enhancement. On one hand, the encoder employed for image reconstruction fails to effectively leverage the spatial–spectral information of hyperspectral and multispectral images to enhance feature complementarity and improve fusion accuracy. On the other hand, the decoder lacks mechanisms for the enhancement of features across different modalities. Additionally, the network used for estimating degradation parameters exhibits insufficient precision. This paper will focus on these three aspects, proposing a novel unsupervised hyperspectral-image super-resolution method, UMMFF, to effectively enhance the reconstruction accuracy of unsupervised hyperspectral images. To efficiently utilize hyperspectral and multispectral information for complementary purposes, a gated retention mechanism is introduced in the encoder to capture local spatial features and rich edge details, facilitating the interaction of different modal information in a cross-sharing manner. To better enhance features of different modalities in the decoder, a parallel approach utilizing spatial and channel attention is adopted to augment the spatial–spectral feature representation of hyperspectral and multispectral images across various levels. To improve the accuracy of the estimation network, a combination of degradation prior knowledge and positional encoding constraints ensures precise estimation of degradation parameters. Consequently, this paper conducts relevant research and presents a hyperspectral-image super-resolution method based on an unsupervised Multimodal Multilevel Feature Fusion Network (UMMFF). The contributions of UMMFF are as follows:

(1): UMMFF has designed a gated cross-retention shared encoder to address the issue of insufficient utilization of intermodal information in multimodal image fusion. This method captures local features by introducing a gated retention mechanism and establishes a cross-sharing relationship between different modal queries (Q) and keys (K) to achieve information complementarity.
(2): UMMFF constructs a multilevel spatial attention and channel attention parallel fusion decoder to resolve the lack of feature enhancement for different modalities within the decoder. This decoder employs channel attention and spatial attention for multispectral and hyperspectral images, respectively, to extract spatial–spectral features. The resulting attention features enhance the capability of extracting and fusing spatial–spectral information across the following three levels: low, mid, and high.
(3): UMMFF proposes an implicit representation blind estimation degradation network based on prior knowledge to tackle the issue of low optimization degrees in prior regularization networks. This network utilizes positional encoding to regularize a multilayer perceptron, thereby increasing the optimization degrees while combining degradation prior knowledge to constrain the network and avoid local optima. Consequently, the network can accurately estimate degradation parameters, enhancing the reconstruction accuracy and generalizability of the unsupervised algorithm.

2. Related Work

2.1. Theoretical Foundation of the Model

Given two observed images, including a low-resolution hyperspectral image,

Y

, and a high-resolution multispectral image,

Z

, the relationship between the high-resolution hyperspectral image,

X,

and the low-resolution hyperspectral image, as well as the high-resolution multispectral image, can be expressed according to prior knowledge, as represented in Equations (1) and (2) [24], as follows:

Y \approx X P D

(1)

Z \approx S X

(2)

Here, P denotes the Point Spread Function (PSF), D represents spatial downsampling, and S signifies the Spectral Response Function (SRF) of the multispectral sensor. The high-resolution hyperspectral image, when subjected to the point spread function and spatial downsampling, yields the low-resolution hyperspectral image. Conversely, the high-resolution hyperspectral image, when processed through the multispectral sensor’s response, results in the high-resolution multispectral image.

The objective of this study, based on a fused unsupervised hyperspectral-image super-resolution method, is to generate a high-resolution hyperspectral image, X, by integrating the low-resolution hyperspectral image, Y, and the high-resolution multispectral image, Z, through an encoder-decoder architecture. This process can be expressed as follows:

(Y_{i}^{↑}, Z_{i}) \frac{f (.)}{\to} S_{i} \frac{g (.)}{\to} X_{i}

(3)

Here,

i

represents the iteration count within the network;

Y_{i}^{↑}

denotes the upsampled

Y

, which aligns in spatial resolution with

Z

; and

f (.)

signifies the encoding process, while

g (.)

denotes the decoding process.

The generated

X

is degraded to obtain

Y^{'} = X_{i} P D

and

Z^{'} = S X_{i}

based on Equations (1) and (2). Consequently, the minimization objective of the fused hyperspectral-image super-resolution method is represented by Equation (4) as follows:

m i n | |Y - X_{i} P D| |_{1,1} + γ | |Z - S X_{i}| |_{1,1}

(4)

In this study, the point spread function,

P

, and the spectral sensor response, S, are estimated through an independent, prior-knowledge-based implicit blind estimation degradation network.

2.2. Transformer

The Transformer is a neural network architecture based on attention mechanisms that is designed for natural language processing tasks. It has lately been successfully applied to the field of computer vision. The Transformer [25] first introduced the self-attention mechanism for sequence-to-sequence natural language processing tasks and achieved remarkable results. However, employing the Transformer in image-related tasks that involve intricate spatial structures and high-dimensional features presents an ongoing challenge. The Vision Transformer (ViT) [26] divides images into blocks and transforms them into sequence inputs, which has led to excellent performance in large-scale image classification tasks. However, ViT has a high computational complexity and may have limitations in capturing the global contextual information of the images. To address these limitations, Cross attention in vision transformer (CAT) [27] introduces cross-attention mechanisms to improve the representation capability of ViT. It can handle larger scale images, but cross-attention primarily focuses on the correlations among adjacent positions in the input sequence and may perform poorly in capturing long-range dependencies. Swin Transformer [28] introduces a hierarchical structure and a shifting window approach to better capture local and global features in images. However, it consumes a large amount of memory when dealing with large-scale images. SwinIR [29] incorporates an adaptive window feature mechanism, allowing it to handle larger scale images with improved computational efficiency. Nonetheless, it faces certain challenges in handling complex textures and detail restoration. Swin2S [30] achieves the restoration of complex textures and image details while maintaining low computational complexity through more refined attention mechanisms and an adaptive window approach. Retnet [31] introduces a multiscale preservation mechanism to replace multihead attention in the Transformer, resulting in low-cost inference and good performance. However, it fails to fully utilize the information among different modalities in multimodal image fusion. In this work, the UMMFF proposed is designed as a gated cross-preservation shared encoder that incorporates the Transformer mechanism. This encoder learns the shared structural information between high-resolution multispectral images and low-resolution hyperspectral images. The goal is to enable mutual complementation and enhancement of information between the two input images.

2.3. Blind Estimation Network

SRF (Spectral Response Function) and PSF (Point Spread Function) are important parameters in hyperspectral and multispectral imaging. SRF represents the ratio of received radiance at each wavelength by a hyperspectral sensor to the incident radiance, describing the sensor’s response intensity at different wavelengths. PSF is a function that describes the spreading of a discrete point into its surrounding neighborhood, representing the imaging system’s response to a point source or object. The imaging of hyperspectral and multispectral data is affected by factors such as lighting conditions and noise. Therefore, in image fusion-based hyperspectral super-resolution methods, it is not appropriate to directly use these parameters as prior information. Hence, accurate blind estimation of these parameters greatly affects hyperspectral super-resolution. The Hyconet [32] framework incorporates a pair of convolutional layers to acquire knowledge of the parameters associated PSF and SRF. This adaptive learning capability allows the system to obtain robust parameter estimations across diverse datasets. However, the absence of effective regularization constraints gives too much freedom to the estimation function, affecting the accuracy of the estimations. CUCaNet [21] introduces a spatial–spectral consistency module in the network to improve the estimation of parameters such as PSF and SRF. However, it relies on the spectral coverage between high-resolution multispectral images and low-resolution hyperspectral images as auxiliary information for estimation, which limits its practicality. The unsupervised network MIAE [24] is capable of directly estimating PSF and SRF from two observed images without the need for additional auxiliary information. While it exhibits good versatility, its estimation performance can be affected by the image reconstruction error in the loss function. UDALN [22], in an unsupervised manner, treats PSF and SRF as trainable parameters within the neural network, enabling more accurate reconstruction results. However, this approach increases the complexity of the model and poses challenges during training. The Enhanced Unmixing-Inspired Unsupervised Network with Attention-Embedded Degradation Learning (EU2ADL) [33], by introducing an attention mechanism in the estimation of PSF and SRF, can significantly reduce the number of parameters the model needs to learn, while dynamically adjusting degradation parameters to enhance estimation capabilities. However, this approach incurs additional computational overhead and lacks the capacity to model the physical constraints.

The present study introduces a prior-based implicit representation for the blind estimation of degraded networks. It employs a positional encoder as a regularization method, utilizing a multilayer perceptron to approximate the required parameters and leveraging prior knowledge to constrain the modeling capabilities. This approach aims to simultaneously balance model generalization and robustness.

3. Methods

The proposed network architecture primarily consists of the following three components: (1) the gated cross-retention shared encoder, (2) multilevel spatial–channel attention parallel fusion decoder, and (3) prior-based implicit representation blind estimation of degraded networks, as illustrated in Figure 1. The gated cross-holding shared encoder extracts features from high-resolution multispectral images and low-resolution hyperspectral images, complementing and enhancing the learned information while maintaining spatial–spectral consistency. The multilevel spatial–channel attention parallel fusion decoder (MLFF) integrates the different modal features extracted by the encoder, thereby fully utilizing the intermodal information to reconstruct high-resolution hyperspectral images. If the set number of training iterations is met, a high-resolution hyperspectral image (HSI) is output. If the training iterations are not reached, the network degrades the decoder output

F_{H}'

into a low-resolution hyperspectral image,

{H S I}^{'},

and a high-resolution multispectral image,

{M S I}^{'}

. The degradation employs the parameters SRF and PSF, estimated by the prior-based implicit representation blind estimation degradation network, as convolution kernels to perform blurring downsampling and dimensionality reduction on the decoder output. The degraded images

{H S I}^{'}

and

{M S I}^{'}

are then used alongside the original inputs Y and Z to compute the image reconstruction loss, which is fed back into the gated cross-holding shared encoder, achieving unsupervised optimization of the model to enhance reconstruction accuracy. Given the critical importance of the gated cross-holding shared encoder, its pseudocode is provided in Table 1 to facilitate a better understanding among readers.

3.1. Gate Cross-Retention Shared Encoder

In addressing the challenges of hyperspectral-image super-resolution processing, traditional methods often overlook the utilization of spatial–spectral information from both hyperspectral and multispectral images to enhance fusion accuracy. To tackle this issue, this paper innovatively introduces a gated retention mechanism [31] module and establishes a cross-sharing relationship between queries Q and keys K across different modalities. The gated retention mechanism, enhanced by a causal mask D matrix, improves the model’s ability to capture Local spatial features and variations in spectral information. Compared to the conventional softmax gating mechanism, the gated retention mechanism employs a combination of the Hadamard product and the D matrix, ensuring efficient computation while focusing on the complex spectral variations present in hyperspectral images, significantly enhancing the accuracy of the reconstructed images. The cross-sharing mode better preserves the consistency of spatial–spectral information while facilitating the complementary enhancement of information among different modalities, thereby improving the reconstruction accuracy of super-resolution tasks.

The gated cross-kept shared encoder consists of the following two modules: the Gate Cross-Retention (GCR) module and the Feed-Forward Network (FFN). Please refer to Figure 2 for the illustration. The gated cross-kept shared encoder adopts a dual-branch structure for low-/high-resolution hyperspectral imagery. The high-resolution multispectral branch follows the same process as the low-resolution hyperspectral branch. Let us focus on the working principle of the low-resolution hyperspectral branch as an example, which involves the following two main processes: gated cross-kept and feed-forward network processing.

Gate Cross Retention (GCR)

A diagonal matrix D, with the same dimensions as the low-resolution hyperspectral image tensor, Y, is used for computing the attention weights. Three learnable parameters,

W_{Q}

,

W_{K}

, and

W_{v}

, are defined to compute the representations of the query (Q), key (K), and value (V), respectively. These representations are used to calculate the attention weights and generate self-attention outputs. The low-resolution hyperspectral image tensor, Y, is reduced dimensionally through embedding, as shown in Equation (5).

F_{E Y} = E m b e d d i n g (Y)

(5)

The layer normalization of

W_{v}

is denoted as

F_{L Y}

, as indicated in Equation (6).

F_{L Y} = L N (F_{E Y})

(6)

The multiplication of

F_{L Y}

with the learnable parameter matrix

W_{Q}

, followed by element-wise multiplication with D, yields the query tensor Q for computing attention scores, as depicted in Equation (7), as follows:

Q = F_{L Y} W_{Q} D

(7)

The high-resolution multispectral image tensor Z is dimensionally reduced through embedding to obtain

F_{E Z}

, as represented in Equation (8), as follows:

F_{E Z} = E m b e d d i n g (Z)

(8)

After normalization, the representation of the

F_{E Z}

layer is denoted as

F_{L Z}

, as shown in Equation (9).

F_{L Z} = L N (F_{E Z})

(9)

The

F_{L Z}

is multiplied by the parameter matrix

W_{K}

and then multiplied by the conjugate tensor of D, denoted as

D^{†}

, to obtain the key tensor K. This key tensor is subsequently used in the computation of the conjugate term in the attention scores. This process can be represented by Equation (10), as follows:

K = F_{L Z} W_{K} D^{†}

(10)

The

F_{L Z}

is multiplied by the parameter matrix

W_{v}

to obtain the value tensor V, which is used in the computation of the weighted values in the attention scores. This can be represented by Equation (11), as follows:

V = F_{L Z} \cdot W_{v}

(11)

The product of the tensor Q and the transpose of the key tensor K, multiplied by D and then normalized, yields the attention scores. These attention scores, when multiplied by the value tensor V, result in the final weighted values

F_{W Y}

, which achieve complementarity and enhancement of the input information. This process can be represented by Equation (12), as follows:

F_{W Y} = (Q K^{T} ⊙ D) V

(12)

The GroupNorm operation is applied to normalize

F_{W Y}

, resulting in

F_{G Y}

. This normalization process is represented by Equation (13), as follows:

F_{G Y} = G r o u p N o r m (F_{W Y})

(13)

F_{G Y}

is reshaped to match the same shape as the input matrix, resulting in

F_{R Y}

. This reshaping operation is represented by Equation (14), as follows:

F_{R Y} = R e s h a p e (F_{G Y})

(14)

Let

W_{G}

and

W_{O}

be learnable parameters of a complex type.

F_{R Y}

is multiplied by

W_{G}

and passed through the swish activation function to introduce nonlinearity. The output of the activation function is then multiplied by

F_{E Y}

and linearly transformed by the parameter

W_{O}

to obtain the final output,

F_{S Y}

. Based on the aforementioned process, the definition of gate-keeping crossover retention is given by Equation (15), as follows:

F_{S Y} = (s w i s h (F_{R Y} W_{G}) ⊙ F_{E Y}) W_{O}

(15)

The gate-keeping crossover retention module enhances the model’s ability to represent diverse modalities of input image data by incorporating complex-type linear transformations and nonlinear transformations.

2.: Feed forward network (FFN)

The feed-forward network consists of layer normalization, fully connected layers, and the LeakyReLU activation function. The output

F_{S Y}

is connected to the output

F_{F Y}

of the feed-forward network through a residual connection, resulting in the shared encoder output Y′. In the next branch, the high-resolution multispectral image tensor Z is similarly processed through steps 1 and 2 to obtain Z′.

3.2. Multilevel Spatial–Channel Attention Parallel Fusion Decoder

The multilevel spatial–channel attention parallel fusion decoder adheres to the fundamental principle that deep neural networks extract intricate structures and abstract features through multilayer representation learning, while also being specifically tailored to the characteristics of hyperspectral and multispectral images. This strategy is categorized into the following three tiers: low, mid, and high. Each layer is designated for specific tasks to ensure comprehensive and targeted feature extraction. The low-level feature extraction employs a dual-branch design, where spatial and channel attention mechanisms are applied to different modal images. This configuration enables the model to accurately concentrate on critical areas of the image, such as edges and texture details, as well as essential spectral bands, allowing for the extraction of more representative features. The mid-level feature extraction achieves deep integration of features from the low-level dual branches via concatenation and nonlinear transformations. This fusion not only preserves the advantageous features of each modality but also introduces new feature combinations through nonlinear transformations, providing a richer dataset for high-level feature extraction. In the high-level processing, the combined mid-level features are integrated with the low-level attention features, optimizing globally while retaining vital low-level information to prevent data loss. This design effectively leverages intermodal complementary information, achieving a comprehensive optimization of feature extraction and fusion, thereby enhancing the performance of the model.

The multilevel spatial–channel attention fusion decoder mainly includes hyperspectral channel attention feature extraction, multispectral spatial attention feature extraction, and multilevel feature fusion decoder, as shown in Figure 3.

Low-level feature extraction

① Hyperspectral channel attention feature extraction (HCAF)

The input

Y^{'}

obtained from Section 3.1 serves as the input to the decoder. The feature map

F_{Y 1}

, obtained by applying a nonlinear transformation to

F_{Y 1}

through a fully connected layer and LeakyReLU activation function, is expressed in Equation (16), as follows:

F_{Y 1} = L e a k y R e l u (F u l l y C o n n e c t e d (Y^{'}))

(16)

Further,

F_{Y 1}

is passed through another fully connected layer and the LeakyReLU activation function to obtain

F_{Y 2}

, as shown in Equation (17), as follows:

F_{Y 2} = L e a k y R e l u (F u l l y C o n n e c t e d (F_{Y 1}))

(17)

To enhance the channel information in the hyperspectral image, channel attention [34] is employed.

F_{Y 2}

undergoes global average pooling and max pooling, and the resulting values are added together to obtain

W_{Y 1}

, as shown in Equation (18):

W_{Y 1} = G l o b a l A v g P o o l (F_{Y 2}) + G l o b a l M a x P o o l (F_{Y 2})

(18)

The channel attention weights,

W_{Y 2}

, can be obtained by applying the sigmoid activation function to

W_{Y 1}

, as shown in Equation (19):

W_{Y 2} = s i g m o i d (W_{Y 1})

(19)

The channel attention feature map,

F_{1}'

, is obtained by multiplying

W_{Y 2}

with

F_{Y 2}

. This process is represented by Equation (20), as follows:

F_{1}' = W_{Y 2} * F_{Y 2}

(20)

② Multispectral spatial attention feature extraction (MSAF)

The transformed feature map,

F_{Z}

, is obtained by applying a set of fully connected layers and the LeakyReLU activation function to

Z^{'}

obtained from Section 3.1. This process can be represented by Equation (21), as follows:

F_{Z} = L e a k y R e l u (F u l l y C o n n e c t e d (Z^{'}))

(21)

To enhance the spatial information in the multispectral image, we utilize spatial attention [34]

F_{Z}

. We average and max-pool

F_{Z}

and then concatenate it along the spatial dimension to obtain

W_{Z 1}

, as shown in Equation (22), as follows:

W_{Z 1} = c a t (m e a n (F_{Z}), m a x (F_{Z}))

(22)

The spatial attention weights, denoted as

W_{Z 2}

, are derived from

W_{Z 1}

by applying a convolutional layer and subsequently using a sigmoid activation function. This process is mathematically represented by Equation (23), as follows:

W_{Z 2} = s i g m o i d (C o n v (W_{Z 1}))

(23)

The weighted spatial attention feature map,

F_{2}'

, is obtained by performing element-wise multiplication of

W_{Z 2}

and

F_{Z}

. This process is mathematically represented by Equation (24), as follows:

F_{2}' = W_{Z 2} * F_{Z}

(24)

2.: Mid-level feature extraction

The spatial attention feature

F_{1}'

and the channel attention feature

F_{2}'

are concatenated along the dimension to obtain

F_{E 1}

, as shown in Equation (25):

F_{E 1} = c o n c a t (F_{1}', F_{2}')

(25)

F_{E 1}

undergoes feature extraction through a fully connected layer and a LeakyReLU activation function, resulting in the adaptive attention feature

F_{E 2}

. This process is represented by Equation (26), as follows:

F_{E 2} = L e a k y R e l u (F u l l y C o n n e c t e d (F_{E 1}))

(26)

3.: High-level feature extraction

The feature map

F_{E 2}

is further concatenated with the attention feature maps

F_{1}^{'}

and

F_{2}'

to obtain

F_{E 3}

, as shown in Equation (27). This process enables multilevel concatenation fusion, thereby enhancing the spatial structure and channel correlation.

F_{E 3} = c o n c a t (F_{E 2}, F_{1}^{'}, F_{2}')

(27)

The feature map

F_{E 3}

is then processed through a fully connected layer and a LeakyReLU activation function to obtain the final feature map

F_{E}

, as depicted in Equation (28), as follows:

F_{E} = L e a k y R e l u (F u l l y C o n n e c t e d (F_{E 3}))

(28)

The feature map

F_{E}

undergoes a two-dimensional convolution operation to obtain the high-resolution hyperspectral image

F_{H}

. This process is represented as Equation (29), as follows:

F_{H} = C o n v (F_{E}, C)

(29)

Here, C represents the learnable matrix of convolution kernel parameters, designed by humans.

F_{H}'

is obtained from

F_{H}

using the clamp function, ensuring that the pixel values are within the range of [0, 1]. This prevents excessively large or small values from interfering with the super-resolution task. This process is represented by Equation (30).

F_{H}' = C l a m p (F_{H})

(30)

3.3. Prior-Based Implicit Representation Blind Estimation of Degraded Networks and Loss Design Adopt

To address the issue of low optimization degrees in prior-based regularization networks [35], a method called the Prior-based implicit representation blind estimation of degraded networks is proposed. This approach employs a multilayer perceptron constrained by positional encoding to approximate the required parameters and incorporates prior knowledge constraints during training. By accurately estimating degradation parameters with higher optimization degrees, this network enhances the reconstruction performance and generalizability of the Unsupervised Feature Fusion-Based Multimodal (UFFMM) approach. The prior-based implicit representation blind estimation of degraded networks consists primarily of positional encoding, a multilayer perception network, and a loss function based on prior knowledge, as illustrated in Figure 4. In the diagram, HSI refers to the low-resolution hyperspectral image, and MSI refers to the high-resolution multispectral image.

Positional encoding

A one-dimensional positional encoding vector is generated for each channel of the hyperspectral image. When the MLP receives this one-dimensional positional encoding as input, it can gain insights into the position of that channel within the overall spectrum. This aids the network in learning how to effectively integrate information from different channels to make more precise predictions regarding the PSF. The one-dimensional positional encoding(ODPE) involves generating an array, denoted as

A_{Y}

, which is uniformly distributed in the range [0,

N_{Y} - 1

]. This array is created using the linspace function, as shown in Equation (31). Here,

N_{Y}

represents the number of channels in the low-resolution hyperspectral image.

A_{Y} = (\frac{l i n s p a c e (0, N_{Y} - 1, N_{Y})}{N_{Y}}) \cdot 2 - 1

(31)

Based on hierarchical indexing at each layer, two encoding arrays, denoted as

{P E}_{1}

and

{P E}_{2}

, are generated by applying sine and cosine functions to the array

A_{Y}

. This process is described in Equations (32) and (33).

{P E}_{1} = \sin (2^{N_{Z i}} \cdot Π \cdot A_{Y})

(32)

{P E}_{2} = \cos (2^{N_{Z i}} \cdot Π \cdot A_{Y})

(33)

Here,

N_{Z}

represents the number of channels in the high-resolution multispectral image, and

N_{Z i}

represents the encoding hierarchical index. By multiplying the positional index

2^{N_{Z i}}

with the layer information

A_{Y}

, the product is encoded into the sine and cosine functions, denoted by Π, capturing position information at different scales and preserving the positional features.

The concatenation of

{P E}_{1}

,

{P E}_{2}

, and

A_{Y}

mentioned in Equations (31)–(33) forms the complete positional encoding, denoted as

{P E}_{1 d}

, as illustrated in Equation (34).

{P E}_{1 d} = [{P E}_{1}, {P E}_{2}, A_{Y}]

(34)

Two-dimensional positional encoding can be employed to convey information regarding the spatial location of each pixel within a multispectral image. When it comes to spatial convolution kernels, the dimensions of the kernel determine both the degree and direction of blurring. Although the two-dimensional positional encoding itself does not directly represent the kernel’s width and height, it facilitates the MLP’s understanding of the relationships among different positions in the image, which is crucial for subsequent blur operations or comprehending the effects of blurring. The two-dimensional positional encoding (TDPE) involves generating arrays,

A_{U}

and

A_{V}

, through the linspace function. This function creates distribution in the ranges [0, U − 1] and [0, V − 1] respectively, as depicted in Equations (35) and (36). Here, U and V represent the width and height of the spatial blur kernel respectively.

A_{U} = (\frac{l i n s p a c e (0, U - 1, U)}{U}) \cdot 2 - 1

(35)

A_{V} = (\frac{l i n s p a c e (0, V - 1, V)}{V}) \cdot 2 - 1

(36)

The encoding arrays,

{P E}_{3}

and

{P E}_{4}

, are computed using the sine and cosine encoding of

A_{U}

and

A_{V}

, as described in Equations (37) and (38) respectively.

{P E}_{3} = \sin (2^{L} \cdot Π \cdot A_{U})

(37)

{P E}_{4} = \cos (2^{L} \cdot Π \cdot A_{V})

(38)

Here, L represents the number of encoding layers, which is set to 1 since the size of the spatial blur kernel is U × V × 1. By multiplying Π with the positional index

2^{L}

and the level information

A_{U}

and

A_{V}

, the position information is encoded into the sine and cosine functions. This allows for capturing position features at different scales.

The sine encoding

{P E}_{3}

and cosine encoding

{P E}_{4}

for each layer, along with the hierarchical information

A_{U}

and

A_{V}

, are combined in order to form a comprehensive two-dimensional positional encoding in

{P E}_{2 d}

. As described in Equation (39), as follows:

{P E}_{2 d} = [{P E}_{3}, {P E}_{4}, A_{U}, A_{V}]

(39)

2.: Multilayer perception network

The multilayer perceptron network

M_{1}

consists of five linear fully connected layers. The one-dimensional positional encoding

{P E}_{1 d}

serves as the input and undergoes processing by

M_{1}

to obtain the parameter SRF, which can be represented by Equation (40).

S R F = (L e a k y R e L U (\sum_{j = 0}^{M} W_{j k} {P E}_{1 d}_{j k}))^{2}

(40)

In the equation, j represents the current layer,

W_{j k}

denotes the weight of neuron k in the current layer, and

{P E}_{1 d}_{j k}

represents the one-dimensional positional encoding of neuron k in the current layer. M represents the number of nodes in the current layer.

The multilayer perceptron network

M_{2}

is composed of five layers of two-dimensional convolutional fully connected layers. The two-dimensional coordinates

{P E}_{2 d}

serve as the input and undergo processing by

M_{2}

to obtain the parameter PSF, which can be represented by Equation (41).

P S F = S o f t m a x (L (\sum_{i = 0}^{N} W_{i k} {P E}_{2 d}_{i k}))^{2}

(41)

In the equation, i represents the current layer,

W_{i k}

denotes the weight of neuron k in the current layer, and

{P E}_{2 d}_{i k}

represents the two-dimensional positional encoding of neuron k in the current layer. N represents the number of nodes in the current layer.

Compared to traditional regularization strategies, incorporating positional encoding as a form of regularization for the MLP not only mitigates the risk of overfitting but also enhances the model’s comprehension of the data structure, enabling it to more effectively handle hyperspectral data with pronounced spatial structures. Furthermore, positional encoding as a regularizer increases the optimization flexibility of the network, allowing the model to adapt more readily to the specific characteristics of the data. Consequently, the introduction of positional encoding not only improves the accuracy of PSF and SRF parameter estimation but also elevates the model’s generalization capabilities.

3.: Loss function based on prior knowledge

The loss function for blind estimation of a degradation network based on prior knowledge can be represented by Equations (42) and (43) [24]. According to prior knowledge, it is known that the low-resolution hyperspectral image (HSI) and the high-resolution multispectral image (MSI) can be obtained by degrading the high-resolution hyperspectral image X.

H S I \approx X P D

(42)

M S I \approx S X

(43)

In the equations, P represents the point spread function (PSF), D represents spatial down sampling, and S represents the spectral response function (SRF) of the multispectral sensor. Equation (44) can be derived from Equations (42) and (43).

X \approx \frac{H S I}{P D} \approx \frac{M S I}{S}

M S I \cdot P D \approx S \cdot H S I

(44)

The representation of high-resolution multispectral-image downsampling as MSI·PD is denoted by

M^{'}

. Similarly, the representation of low-resolution hyperspectral-image dimension reduction using S·HSI is denoted by

H^{'}

. It can be inferred from Equation (40) that

H^{'}

and

M^{'}

exhibit a high degree of similarity,

M^{'}

≈

H^{'}

. Based on this prior knowledge, the designed loss function

{L o s s}^{'}

is expressed by Equation (45), with a training iteration of 30,000 times. After the iterations, the estimated parameters PSF and SRF will be applied in the degradation for image reconstruction. This prior-based loss function constrains the implicit representation of the blind estimation degradation network, resulting in a more accurate estimation of PSF and SRF, and achieving unsupervised optimization effects.

{L o s s}^{'} = | |M^{'} - H^{'}| |_{1,1}

(45)

The parameters PSF obtained through the optimization in Equation (41) are used as the convolution kernel to blur and degrade

F_{H}^{'}

, resulting in the generation of a low-resolution hyperspectral image

{H S I}^{'}

through downsampling. This process is similar to Equation (38). Similarly, the parameter SRF serves as the convolution kernel to reduce the dimensionality of

F_{H}^{'}

, resulting in the high-resolution multispectral image

{M S I}^{'}

. This process is similar to Equation (43).

F_{H}^{'}

represents the output of the multilevel feature decoder before completing the iterations, as indicated in Equation (30).

HSI’ and MSI’ can be used to calculate the image reconstruction loss, denoted as LOSS, as specified in Equation (46). This enables the achievement of unsupervised optimization of the model to enhance the accuracy of image reconstruction.

L o s s = | |Y - {H S I}^{'}| |_{1,1} + γ | |Z - {M S I}^{'}| |_{1,1}

(46)

where Y and Z represent the input hyperspectral image and the multispectral image, respectively.

4. Experiments

4.1. Datasets

The advanced performance and generalization ability of our proposed method are evaluated using the following three datasets: “Washington DC”, “Salinas”, “Botswana”, and “Pavia University”.

The “Washington DC” dataset is an aerial hyperspectral image captured by the Hydice sensor. This dataset encompasses a total of 191 bands, spanning a wavelength range from 0.4 to 2.4 µm, inclusive of visible and near-infrared spectra, with a spatial resolution of 1280 × 307 pixels. The land cover categories included in the dataset are rooftops, streets, paved roads, grassland, trees, water, and shadows. Figure 5 presents both the true-color and false-color images of the “Washington DC” dataset.

The Salinas dataset is a hyperspectral image captured by the AVIRIS sensor in the Salinas Valley of California. It represents a high-resolution image with a spatial resolution of 3.7 m and a size of 512 × 217 pixels. The dataset comprises a total of 204 spectral bands. It specifically focuses on agricultural areas and encompasses 16 different crop categories. Figure 6 presents both the true-color and false-color images of the Salinas dataset.

The Botswana dataset is a hyperspectral image captured by the NASA EO-1 satellite over the Okavango Delta in Botswana. The Hyperion sensor onboard the EO-1 satellite collected data in 242 spectral bands along a 7.7 km strip, with a pixel resolution of 30 m and a size of 1476 × 256 pixels. The uncalibrated and noisy spectral bands covering water absorption features were removed, resulting in the retention of 145 bands for analysis. Land cover types depicted in the spectral image include seasonal marshes, sporadic marshes, and dry woodlands located in the far end of the delta. Figure 7 presents both the true-color and false-color images of the Botswana dataset.

The Pavia University dataset is a hyperspectral image collection gathered by the Reflective Optics Spectrographic Imaging System (ROSIS-03). This dataset, known as PaviaU, contains a wealth of spectral information comprising 103 bands that span the spectrum from visible light to shortwave infrared. With an image size of 610 by 340 pixels, it boasts a high resolution that vividly captures the details of surface objects. The featured land covers include asphalt, grasslands, gravel, trees, metal sheets, bare soil, bricks, and shadows. Figure 8 illustrates both the true-color and false-color images of the Pavia University dataset.

4.2. Hardware Environment for Setting Model Parameters

This paper reconstructs low-resolution hyperspectral images into high-resolution hyperspectral images through various modalities of fusion. In this process, the input data comprise high-resolution multispectral images and low-resolution hyperspectral images, as follows: the high-resolution multispectral images are obtained by extracting five bands at equal intervals from the aforementioned hyperspectral dataset; the low-resolution hyperspectral images are derived from the same dataset after applying Gaussian blur with a kernel size of (5, 5) and a standard deviation of 2, followed by downsampling at a scale of 8. The low-resolution hyperspectral images and high-resolution multispectral images serve as training pairs for iterative training within the network. In each iteration, the fused generated hyperspectral image undergoes degradation to yield low-resolution hyperspectral images and high-resolution multispectral images. The images obtained from this degradation are compared with the source input image pairs to compute the loss, thereby optimizing the overall network to ultimately produce high-resolution hyperspectral images.

To validate the advanced performance and versatility of the UMMFF method, experiments specifically selected the following three open-source datasets representing different land cover types: “Washington DC”, Salinas, and Botswana for training and validation. The validation data consist of 128 × 128 sub-regions cropped from the upper left corner of the aforementioned datasets, while other portions are randomly cropped into 128 × 128 regions for each iteration of training.

The hardware used for all experiments included an NVIDIA GeForce RTX 3080 graphics card (NVIDIA Corporation, Santa Clara, CA, USA) and a 12th Gen Intel (R) Core (TM) i7-12700K processor(Intel Corporation, Santa Clara, CA, USA). The operating system employed was Windows 10 Professional. The software stack consisted of PyTorch 1.13.1 and CUDA 11.6 for unified device architecture computations. The programming language used was Python 3.9.For the “Washington DC” dataset, the initial learning rate was set at 0.0001, with a total of 2000 iterations. The learning rate decayed gradually as follows: 1 − (1/1900) × max(0, iterations − 100), where “iterations” denotes the current iteration number. For the Salinas dataset, the initial learning rate was set at 0.0001, and the total number of iterations was 5000. The learning rate gradually decreased according to the following decay formula: 1 − (1/4900) × max(0, iterations − 100). Regarding the Botswana dataset, the initial learning rate was set at 0.0005, and the total number of iterations was 3000. The learning rate was progressively reduced using the following decay formula: 1 − (1/2900) × max(0, iterations − 100). The AdamW optimizer was utilized for gradient updates. The prior-based implicit representation blind estimation degradation network was trained using observed images through a feed-forward network, with a total of 30,000 iterations. The learning rate for this training was set to

{5 \times 10}^{- 5}

.

4.3. Evaluating Indicator

We evaluate the reconstruction quality of hyperspectral-image super-resolution algorithms using the following five metrics: peak signal-to-noise ratio (PSNR) and root mean square error (RMSE) [36], spectral angle mapper (SAM) [37], and the relative dimensionless global error in synthesis (ERGAS) [38]. These indicators are used to assess the fidelity of the high-resolution result.

The Peak Signal-to-Noise Ratio (PSNR) is a measure that calculates the ratio between the peak signal and the mean squared error between the original and reconstructed images, as shown in Equation (47). A higher PSNR value indicates lower distortion and better image quality.

P S N R = 10 {l o g}_{10} (\frac{m a x (Z_{i})^{2}}{\frac{1}{H W} | |Z_{i} - {\hat{z}}_{i}| |_{2}^{2}})

(47)

Here, H represents the height of the image, W represents the width of the image,

Z_{i}

represents the pixel value of the true hyperspectral image in the i-th band, and

{\hat{z}}_{i}

represents the pixel value of the reconstructed hyperspectral image in the i-th band.

The Root Mean Square Error (RMSE) is a metric that quantifies the magnitude of errors by calculating the differences between predicted values and observed values, and then averaging the squared differences. A lower value indicates a smaller error between the predicted and observed values, suggesting a closer approximation to the ground truth. In the case of hyperspectral super-resolution, the definition of RMSE can be expressed as shown in Equation (48), as follows:

R M S E = \sqrt{\frac{\sum_{k = 1}^{C} \sum_{i = 1}^{H} \sum_{j = 1}^{W} (Z_{k} (i, j) - {\hat{z}}_{k} (i, j))}{H W C}}

(48)

Here,

Z_{k} (i, j)

represents the pixel value of the true hyperspectral image at position

(i, j)

in the k-th band, While

{\hat{z}}_{k} (i, j)

represents the pixel value of the reconstructed hyperspectral image at position

(i, j)

in the k-th band. H represents the height of the image, W represents the width of the image, and C represents the number of channels in the image.

The Spectral Angle Mapper (SAM) evaluates the similarity between the spectral feature vectors of image pixels and predefined reference spectra by calculating the angle between them. This can be observed in Equation (49). A smaller SAM angle value indicates a higher degree of similarity among the spectral features, suggesting that the pixel in the image is more closely related to the land cover type represented by the reference spectrum.

S A M = a r c c o s (\frac{(Z_{k} (i, j), {\hat{z}}_{k} (i, j))}{| |Z_{i} (i, j) ||_{2}| | {\hat{z}}_{i} (i, j)| |_{2}})

(49)

Here,

Z_{k} (i, j)

represents the pixel value of the true hyperspectral image at position

(i, j)

in the k-th band, and

{\hat{z}}_{k} (i, j)

represents the pixel value of the reconstructed hyperspectral image at position

(i, j)

in the k-th band.

The Relative Dimensionless Global Error in Synthesis (ERGAS) is computed by using the root mean square error between the actual pixel values and the synthesized pixel values, as well as the mean and variance of the remote sensing image. This can be observed in Equation (50). A lower value of the ERGAS index signifies a better quality of the synthesized image.

E R G A S = \frac{100}{s r} \sqrt{\frac{1}{L} \sum_{i = 1}^{L} \frac{m e a n | |Z_{i} - {\hat{z}}_{i}| |_{2}^{2}}{(m e a n Z_{i})^{2}}}

(50)

In the above description, sr represents the scale factor of the resolution between the reconstructed image and the original image, L represents the number of spectral bands in the reconstructed image,

Z_{i}

denotes the pixel value of the true hyperspectral image in the i-th band, and

{\hat{z}}_{i}

denotes the pixel value of the reconstructed hyperspectral image in the i-th band.

5. Results and Discussion

5.1. Visualization and Analysis of Super-Resolution

In order to evaluate the performance of the studied super-resolution framework model, a visual comparison was conducted using randomly selected bands from three reconstructed hyperspectral images based on the “Washington DC”, Salinas, and “Botswana” datasets. The advanced hyperspectral image super-resolution algorithms, namely, SSFCNN [39], SSRnet [40], Fusformer [19], MCT [41], UAL [42], MIAE [24], MSDformer [43], and DCT [44], were visually compared.

The proposed algorithm demonstrates commendable super-resolution reconstruction visualization effects on the “Washington DC” dataset, as illustrated in Figure 9. The visual comparison reveals that the super-resolution images produced by SSFCNN, SSRnet, and Fusformer appear relatively blurred, while MCT, MIAE, UAL, MSDformer, DCT, and our method exhibit higher clarity. Notably, UAL, MSDformer, DCT, and our approach closely resemble the visual quality of the real images. Considering the error images of the reconstructed hyperspectral images for bands 26, 45, and 72 shown in Figure 8, it is evident that the reconstructions by SSDCNN, MIAE, and Fusformer display significant discrepancies from the real images. In contrast, the super-resolution images generated by UAL, MSDformer, and our method show minimal structural differences in the residuals compared to the real images. Although SSRnet and MCT exhibit relatively small structural differences in the error images for bands 26 and 45, artifacts appear in the edge details of the error image for band 72. Our method, however, performs admirably across all three bands. This success can be attributed to the enhanced image fusion effects resulting from the information exchange among different modalities in the gated cross-holding module, as well as the emphasis on integrating spectral information from hyperspectral images and spatial information from multispectral images within the decoder.

The proposed algorithm exhibits commendable super-resolution reconstruction visualization effects on the Salinas dataset, as depicted in Figure 10. In the error images for the reconstructed hyperspectral images of bands 58, 73, and 126, highlighted in red boxes, it is evident that the method MIAE shows considerable discrepancies when compared to the real images. The error image for band 126 reconstructed by SSRnet and MCT reveals significant structural differences. The error image from the SSFCNN method indicates slightly inferior performance in capturing edge details. Conversely, the error images generated by UAL, Fusformer, and our method demonstrated strong performances. While the MIAE method was capable of producing relatively clear hyperspectral images, it suffered from considerable spectral distortion in complex datasets, likely because the autoencoder employed in the MIAE network does not enhance learning of spatial and spectral information. In contrast, our approach maintains spatial–spectral consistency by eliminating feature discrepancies among different modalities through a shared encoder.

The proposed algorithm demonstrates exceptional super-resolution reconstruction visualization effects on the Botswana dataset, as illustrated in Figure 11. In this figure, all ten methods exhibit commendable performance in reconstructing hyperspectral images for band 60; however, the error images produced by the MIAE method for bands 28 and 100 show subpar results. Our method, on the other hand, exhibits the least discrepancy from the real images across all three bands. This achievement is attributed to the shared structure of the encoder in UFFMM, which facilitates information exchange and enhances the consistency of spatial and spectral information. Additionally, the decoder strengthens spatial–spectral information through a multilevel feature extraction and fusion approach, incorporating low, medium, and high-level features.

Evidently, the reconstruction results of UFFMM across the three datasets exhibit excellent visual quality and demonstrate a high degree of consistency with the real images. It is apparent that the prior-based implicit representation blind estimation network accurately estimates the parameters required for degradation while enhancing the generalization capability of UFFMM. Moreover, in order to comprehensively reflect the effectiveness of our reconstruction algorithm, this paper presents the spectral curves corresponding to pixel points at the same position across all bands in the datasets, as shown in Figure 12, Figure 13 and Figure 14. From these spectral curve cases, it can be observed that the spectral curves of UFFMM closely align with those of the real images, further demonstrating that the shared encoder effectively preserves spatial–spectral consistency.

5.2. Comparison with Advanced Methods

To demonstrate the advancement of our proposed algorithm, a fair comparison will be conducted between the presented framework model and several state-of-the-art hyperspectral-image super-resolution algorithms, namely SSFCNN [39], SSRnet [40], Fusformer [19], MCT [41], UAL [42], MIAE [24], MSDformer [43], and DCT [44], using the following four metrics: PSNR, RMSE, SAM, and ERGAS. All framework models were trained with an 8x magnification on the same hardware, programming environment, and dataset. The final results are presented in Table 2, Table 3 and Table 4. For the Salinas dataset, a comparison of various advanced super-resolution algorithms across the aforementioned four performance metrics is shown in Table 2. In terms of PSNR, SAM, and ERGAS metrics, our method outperformed the commendable UAL method by 8.55%, 6.48%, and 75.39%, respectively, although the RMSE metric is slightly lower than that of UAL. The inadequacies of the UAL method may stem from its early fusion strategy, which lacks comprehensive feature learning for each modality, leading to some loss of information. In contrast, our approach employs a mid-fusion strategy that enhances feature learning for each modality during the encoder stage prior to fusion. For the Botswana dataset, a similar comparison of the four performance metrics is presented in Table 3. Our method achieved optimal values of 43.15 for PSNR and 0.57 for ERGAS, with a PSNR improvement of 5.70% and an ERGAS decrease of 53.27% compared to MIAE. While MIAE also utilizes an early fusion strategy, it enhances feature learning for each modality through nonlinear transformations before fusion and employs residual connections to prevent information loss. However, MIAE did not utilize intermodal information to supplement single-modality data in the decoder stage. Our method, by contrast, employs a cross-sharing mechanism to complement information from different modalities, thereby enhancing reconstruction accuracy, although it performs slightly lower than MIAE on the RMSE and SAM metrics. This may be attributed to the network’s pixel-level loss focusing more on spatial information differences, thereby overlooking the consistency of local features across spatial and spectral dimensions. In Table 4, a comparison of performance metrics for super-resolution algorithms on the PaviaU dataset is displayed. Our method achieved the highest performance in terms of PSNR and ERGAS, recording values of 42.37 and 0.73 respectively. Among the advanced methods compared in the table, MSDformer demonstrated excellent performance, designed with DCTM to capture diverse global spectral dependencies across all spectral bands, resulting in superior SAM metrics. However, this method lacks information complementarity from different modalities, which could enhance reconstruction accuracy for single-image hyperspectral data. Our method possesses greater potential for reconstruction of hyperspectral images by leveraging the advantageous features of various modalities, although further strengthening in spectral information learning is warranted. To validate that our method can achieve high-precision super-resolution reconstruction across multiple magnification factors, experiments were conducted on the “Washington DC” dataset with 4×, 8×, and 16× magnifications, comparing performance metrics with existing advanced super-resolution algorithms as shown in Table 5. In Table 5, UMMFF demonstrates PSNR values of 55.34, RMSE of 0.0027, SAM of 0.20, and ERGAS of 0.06 for the 4x experiment, while for the 16x experiment, it shows PSNR of 53.75, RMSE of 0.0030, SAM of 0.24, and ERGAS of 0.07. The above metrics indicate that the reconstruction results at 4× and 16× magnifications are comparably excellent to those at 8x magnification.

Based on the aforementioned, our approach demonstrates superior performance in terms of the PSNR and ERGAS metrics across all three datasets, albeit slightly lower than the optimal comparative method in the SAM and RMSE metrics. The exceptional performance of PSNR and ERGAS indicates that the UFFMM reconstruction of super-resolution images maintains high fidelity and global accuracy with respect to the original images, preserving both fine details and overall characteristics. This is mainly attributed to the well-designed gated cross-retention shared encoder in UFFMM, which effectively learns the spatial–spectral information of hyperspectral and multispectral data. The weight sharing mechanism in the encoder eliminates feature discrepancies among different modalities, thus avoiding spectral distortion. Additionally, the multilevel fusion employed in this approach enables comprehensive acquisition of spatial information from multispectral data and spectral information from hyperspectral data, ultimately leading to satisfactory reconstruction results. However, our proposed method exhibits suboptimal performance in terms of the RMSE and SAM metrics, primarily due to the fact that the network solely utilizes pixel-level loss functions, lacking constraints on spectral information. Therefore, it is necessary to enhance the design of the loss function to improve the learning of spectral information.

5.3. Ablation Experiment

To validate the effectiveness of the gated cross-retention shared encoder, multilevel spatial–channel attention parallel fusion decoder, and prior-based implicit representation blind estimation of degraded networks work, corresponding ablation experiments will be sequentially conducted to assess their synergistic effects. In the basic model, the Retnet serves as the encoder, auto-decoder, and the blind estimation network without prior knowledge. As an encoder, Retnet does not have a shared structure nor information interaction. The auto-decoder consists of only one convolutional layer for decoding. The blind estimation network without prior knowledge estimates parameters solely through positional encoding regularized multilayer perceptrons.

In order to validate the effectiveness of the gated cross-retention shared encoder in learning the shared structure between two inputs to achieve complementary information, the Retnet in the base model is replaced with the gated cross-hold shared encoder. Four metrics, namely, PSNR, RMSE, SAM, and ERGAS, were used to quantify and compare the performance of the two encoders. The quantitative comparison results of the ablation experiments conducted on the “Washington DC” dataset are shown in Table 6. When the Retnet was used as the encoder, the PSNR value was 46.79, the RMSE value was 0.0046, the SAM value was 0.32, and the ERGAS value was 0.11. When the gated cross-hold encoder was used as the encoder, compared to the Retnet encoder, the PSNR value was 49.62, an increase of 7.58%; the RMSE value was 0.0039, a decrease of 23.91%; the SAM value was 0.32, a decrease of 21.87%; and the ERGAS value was 0.11, a decrease of 27.27%. It can be observed that the PSNR value slightly improved, while the other RMSE, SAM, and ERGAS metrics showed significant improvement. This indicates that the gated cross-retention shared encoder can enhance the spatial and spectral information between the two inputs, improving the quality of the reconstructed images and preventing spectral distortion.

Based on the aforementioned experiments, the auto-decoder is replaced with a Multilevel spatial–channel attention parallel fusion decoder to validate its effectiveness in further enhancing the correlation of spatial and spectral information. The method employs a multilevel spatial–channel attention fusion decoder, resulting in an increase of 6.04% in PSNR value (52.85), a decrease of 15.21% in RMSE value (0.0035), a decrease of 17.24% in SAM value (0.0024), and a decrease of 18.18% in ERGAS value (0.08). It can be seen that by applying channel attention to the branch with the input of high-spectral imagery and spatial attention to the branch with the input of multispectral imagery, information between the two inputs can be obtained during fusion, further improving the spatial–spectral resolution.

The prior knowledge-free implicit blind estimation network has been replaced with a prior-based implicit representation blind estimation of degraded networks to verify the impact of estimating parameters PSF and SRF on image reconstruction in the loss function. Compared to the method using the prior knowledge-free implicit blind estimation network, our method achieves a higher PSNR value of 53.61, an increase of 1.43%; a lower RMSE value of 0.0029, a decrease of 17.14%; a lower SAM value of 0.20, a decrease of 16.66%; and a lower ERGAS value of 0.06, a decrease of 25.00%. These computational results demonstrate that the PSF and SRF parameters required for the prior-based implicit representation blind estimation degradation network contribute to improving the reconstruction accuracy of the network. It also indicates that when the gated cross-retention shared encoder, multilevel spatial–channel attention parallel fusion decoder, and prior-based implicit representation blind estimation of degraded networks work together, the model achieves optimal performance.

6. Conclusions

In summary, this paper presents a hyperspectral image super-resolution method based on an unsupervised multimodal, multilevel feature fusion network. To facilitate the extraction and integration of information across different modalities, we introduce a gated cross-holding shared encoder, which establishes a cross-holding relationship between queries (Q) and keys (K) while incorporating gated activation functions to achieve complementary information enhancement. To fully leverage the information contained within various modal images, we design a multilevel spatial–channel attention parallel fusion decoder, which extracts and integrates features at three distinct levels—low, medium, and high—to mutually supplement and consolidate the feature representations across modalities. Furthermore, to address the low degrees of freedom associated with prior-based regularization network optimization, we construct a prior-based implicit representation blind estimation degradation network, utilizing a multilayer perceptron for the parameters required for implicit representation and applying prior knowledge constraints during training to accurately estimate degradation parameters, thereby enhancing the network’s generalizability. When compared to state-of-the-art super-resolution models on the “Washington DC”, Salinas, Botswana and Pavia University hyperspectral datasets, our method achieves superior performance in terms of the PSNR and ERGAS metrics. However, our approach exhibits insufficient performance in the RMSE and SAM metrics, indicating limitations in the current model’s processing of spectral information. In future research, we will focus on enhancing the learning of spectral information and designing multiple loss functions to promote consistency between local feature retention and spatial and spectral information. Additionally, the model currently reconstructs HR HIS images at a fixed magnification factor, a limitation that may impact the broader applicability of the research. Therefore, future studies will not only aim to strengthen the learning of spectral information but also explore super-resolution methods for arbitrary scales, aspiring to achieve super-resolved reconstructions across any scale with a single mode.

Author Contributions

Conceptualization, Z.J.; methodology, Z.J. and M.C.; software, M.C. and W.W.; validation, M.C.; formal analysis, Z.J.; investigation, M.C.; resources, W.W.; data curation, M.C.; writing—original draft preparation, M.C.; writing—review and editing, Z.J. and M.C.; visualization, M.C.; supervision, Z.J.; project administration, Z.J.; funding acquisition, Z.J. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by Key Lab of Intelligent and Green Flexographic Printing (grant no. ZBKT202203), in part by the Key Lab of Intelligent and Green Flexographic Printing (grant no. ZBKT202301).

Data Availability Statement

The “Washington DC”, “Salinas”, “Botswana”, and “Pavia University” datasets used in this study were obtained from public domains and are available online at https://engineering.purdue.edu/~biehl/MultiSpec/hyperspectral.html (accessed on 28 August 2024) and https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes (accessed on 28 August 2024). The source code for our unsupervised Multimodal Multilevel Feature Fusion Network (UMMFF) is available at https://github.com/mengyao72/UMMFF.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pande, C.B.; Moharir, K.N. Application of hyperspectral remote sensing role in precision farming and sustainable agriculture under climate change: A review. In Climate Change Impacts on Natural Resources, Ecosystems and Agricultural Systems; Springer Climate; Springer: Cham, Switzerland, 2023; pp. 503–520. [Google Scholar]
Zhang, M.; Chen, T.; Gu, X.; Chen, D.; Wang, C.; Wu, W.; Zhu, Q.; Zhao, C. Hyperspectral remote sensing for tobacco quality estimation, yield prediction, and stress detection: A review of applications and methods. Front. Plant Sci. 2023, 14, 1073346–1073360. [Google Scholar] [CrossRef] [PubMed]
Pan, B.; Cai, S.; Zhao, M.; Cheng, H.; Yu, H.; Du, S.; Du, J.; Xie, F. Predicting the Surface Soil Texture of Cultivated Land via Hyperspectral Remote Sensing and Machine Learning: A Case Study in Jianghuai Hilly Area. Appl. Sci. 2023, 13, 9321. [Google Scholar] [CrossRef]
Liu, L.; Miteva, T.; Delnevo, G.; Mirri, S.; Walter, P.; de Viguerie, L.; Pouyet, E. Neural networks for hyperspectral imaging of historical paintings: A practical review. Sensors 2023, 23, 2419. [Google Scholar] [CrossRef]
Vlachou-Mogire, C.; Danskin, J.; Gilchrist, J.R.; Hallett, K. Mapping materials and dyes on historic tapestries using hyperspectral imaging. Heritage 2023, 6, 3159–3182. [Google Scholar] [CrossRef]
Huang, S.-Y.; Mukundan, A.; Tsao, Y.-M.; Kim, Y.; Lin, F.-C.; Wang, H.-C. Recent advances in counterfeit art, document, photo, hologram, and currency detection using hyperspectral imaging. Sensors 2022, 22, 7308. [Google Scholar] [CrossRef]
da Lomba Magalhães, M.J. Hyperspectral Image Fusion—A Comprehensive Review. Master’s Thesis, Itä-Suomen Yliopisto, Kuopio, Finland, 2022. [Google Scholar]
Zhang, M.; Sun, X.; Zhu, Q.; Zheng, G. A survey of hyperspectral image super-resolution technology. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS 2021, Brussels, Belgium, 11–16 July 2021; pp. 4476–4479. [Google Scholar]
Dian, R.; Li, S.; Sun, B.; Guo, A. Recent advances and new guidelines on hyperspectral and multispectral image fusion. Inf. Fusion 2021, 69, 40–51. [Google Scholar] [CrossRef]
Chen, Z.; Pu, H.; Wang, B.; Jiang, G.-M. Fusion of hyperspectral and multispectral images: A novel framework based on generalization of pan-sharpening methods. IEEE Geosci. Remote Sens. Lett. 2014, 11, 1418–1422. [Google Scholar] [CrossRef]
Jia, S.; Qian, Y. Spectral and spatial complexity-based hyperspectral unmixing. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3867–3879. [Google Scholar]
Akhtar, N.; Shafait, F.; Mian, A. Bayesian sparse representation for hyperspectral image super resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, Boston, MA, USA, 7–12 June 2015; pp. 3631–3640. [Google Scholar]
Xie, W.; Jia, X.; Li, Y.; Lei, J. Hyperspectral image super-resolution using deep feature matrix factorization. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6055–6067. [Google Scholar] [CrossRef]
Dian, R.; Fang, L.; Li, S. Hyperspectral image super-resolution via non-local sparse tensor factorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 5344–5353. [Google Scholar]
Liu, J.; Wu, Z.; Xiao, L.; Sun, J.; Yan, H. A truncated matrix decomposition for hyperspectral image super-resolution. IEEE Trans. Image Process. 2020, 29, 8028–8042. [Google Scholar] [CrossRef]
Wan, W.; Guo, W.; Huang, H.; Liu, J. Nonnegative and nonlocal sparse tensor factorization-based hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8384–8394. [Google Scholar] [CrossRef]
Li, J.; Cui, R.; Li, B.; Song, R.; Li, Y.; Dai, Y.; Du, Q. Hyperspectral image super-resolution by band attention through adversarial learning. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4304–4318. [Google Scholar] [CrossRef]
Hu, J.-F.; Huang, T.-Z.; Deng, L.-J.; Jiang, T.-X.; Vivone, G.; Chanussot, J. Hyperspectral image super-resolution via deep spatiospectral attention convolutional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 7251–7265. [Google Scholar] [CrossRef] [PubMed]
Hu, J.-F.; Huang, T.-Z.; Deng, L.-J.; Dou, H.-X.; Hong, D.; Vivone, G. Fusformer: A transformer-based fusion network for hyperspectral image super-resolution. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6012305. [Google Scholar] [CrossRef]
Qu, Y.; Qi, H.; Kwan, C. Unsupervised sparse dirichlet-net for hyperspectral image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2511–2520. [Google Scholar]
Yao, J.; Hong, D.; Chanussot, J.; Meng, D.; Zhu, X.; Xu, Z. Cross-attention in coupled unmixing nets for unsupervised hyperspectral super-resolution. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIX 16. pp. 208–224. [Google Scholar]
Li, J.; Zheng, K.; Yao, J.; Gao, L.; Hong, D. Deep Unsupervised Blind Hyperspectral and Multispectral Data Fusion. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6007305. [Google Scholar] [CrossRef]
Qu, Y.; Qi, H.; Kwan, C.; Yokoya, N.; Chanussot, J. Unsupervised and unregistered hyperspectral image super-resolution with mutual Dirichlet-Net. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5507018. [Google Scholar] [CrossRef]
Liu, J.; Wu, Z.; Xiao, L.; Wu, X.-J. Model inspired autoencoder for unsupervised hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522412. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Lin, H.; Cheng, X.; Wu, X.; Shen, D. Cat: Cross attention in vision transformer. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
Conde, M.V.; Choi, U.-J.; Burchi, M.; Timofte, R. Swin2SR: Swinv2 transformer for compressed image super-resolution and restoration. In Computer Vision–ECCV 2022 Workshops, Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 669–687. [Google Scholar]
Sun, Y.; Dong, L.; Huang, S.; Ma, S.; Xia, Y.; Xue, J.; Wang, J.; Wei, F. Retentive network: A successor to transformer for large language models. arXiv 2023, arXiv:2307.08621. [Google Scholar]
Zheng, K.; Gao, L.; Liao, W.; Hong, D.; Zhang, B.; Cui, X.; Chanussot, J. Coupled convolutional neural network with adaptive response function learning for unsupervised hyperspectral super resolution. IEEE Trans. Geosci. Remote Sens. 2020, 59, 2487–2502. [Google Scholar] [CrossRef]
Gao, L.; Li, J.; Zheng, K.; Jia, X. Enhanced Autoencoders with Attention-Embedded Degradation Learning for Unsupervised Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5509417. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Li, J.; Li, Y.; Wang, C.; Ye, X.; Heidrich, W. Busifusion: Blind unsupervised single image fusion of hyperspectral and rgb images. IEEE Trans. Comput. Imaging 2023, 9, 94–105. [Google Scholar] [CrossRef]
Palsson, F.; Sveinsson, J.R.; Ulfarsson, M.O.; Benediktsson, J. Quantitative quality evaluation of pansharpened imagery: Consistency versus synthesis. IEEE Trans. Geosci. Remote Sens. 2015, 54, 1247–1259. [Google Scholar] [CrossRef]
Kruse, F.A.; Lefkoff, A.; Boardman, J.; Heidebrecht, K.; Shapiro, A.; Barloon, P.; Goetz, A. The spectral image processing system (SIPS)—Interactive visualization and analysis of imaging spectrometer data. Remote Sens. Environ. 1993, 44, 145–163. [Google Scholar] [CrossRef]
Wald, L. Quality of high resolution synthesised images: Is there a simple criterion? In Proceedings of the Third Conference” Fusion of Earth Data: Merging Point Measurements, Raster maps and Remotely Sensed Images”. SEE/URISCA, Sophia Antipolis, France, 26–28 January 2000; pp. 99–103. [Google Scholar]
Han, X.-H.; Shi, B.; Zheng, Y. SSF-CNN: Spatial and spectral fusion with CNN for hyperspectral image super-resolution. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 2506–2510. [Google Scholar]
Zhang, X.; Huang, W.; Wang, Q.; Li, X. SSR-NET: Spatial–spectral reconstruction network for hyperspectral and multispectral image fusion. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5953–5965. [Google Scholar] [CrossRef]
Wang, X.; Wang, X.; Song, R.; Zhao, X.; Zhao, K.J.K.-B.S. MCT-Net: Multi-hierarchical cross transformer for hyperspectral and multispectral image fusion. Knowl.-Based Syst. 2023, 264, 110362–110375. [Google Scholar] [CrossRef]
Zhang, L.; Nie, J.; Wei, W.; Zhang, Y.; Liao, S.; Shao, L. Unsupervised adaptation learning for hyperspectral imagery super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 3073–3082. [Google Scholar]
Chen, S.; Zhang, L.; Zhang, L. Msdformer: Multi-scale deformable transformer for hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2023, 601, 5525614–5525628. [Google Scholar]
Ma, Q.; Jiang, J.; Liu, X.; Ma, J. Reciprocal transformer for hyperspectral and multispectral image fusion. Inf. Fusion 2024, 104, 102148–102159. [Google Scholar] [CrossRef]

Figure 1. Unsupervised adaptive fusion for blind hyperspectral image super-resolution: (1) gate cross-retention shared encoder; (2) multilevel spatial–channel attention parallel fusion decoder; (3) prior-based implicit representation blind estimation of degraded networks and loss design.

Figure 2. Gate cross-retention shared encoder.

Figure 3. Multilevel spatial–channel attention parallel fusion decoder.

Figure 4. Prior-based implicit representation blind estimation of degraded networks and loss design adopt.

Figure 5. True-color and false-color images of the “Washington DC” dataset.

Figure 6. True-color and false-color images of the “Salinas” dataset.

Figure 7. True-color and false-color images of the “Botswana” dataset.

Figure 8. True-color and false-color images of the “Pavia University” dataset.

Figure 9. Reconstructed images and error images from the super-resolution algorithm for the “Washington DC” dataset (rows 1, 3, and 5 correspond to the reconstructed images for bands 26, 45, and 72, respectively, while rows 2, 4, and 6 represent the error images for bands 26, 45, and 72) [19,24,39,40,41,43,44].

Figure 10. Reconstructed images and error images from the super-resolution algorithm for the Salinas dataset (rows 1, 3, and 5 correspond to the reconstructed images for bands 58, 73, and 126, respectively, while rows 2, 4, and 6 represent the error images for the same bands) [19,24,39,40,41,43,44].

Figure 11. Reconstructed images and error images from the super-resolution algorithm for the Botswana dataset (rows 1, 3, and 5 correspond to the reconstructed images for bands 28, 60, and 100, respectively, while rows 2, 4, and 6 present the error images for these bands) [19,24,39,40,41,43,44].

Figure 12. Use case for spectral curves in the “Washington DC” dataset.

Figure 13. Use case for spectral curves in the Salinas dataset.

Figure 14. Use case for spectral curves in the Botswana dataset.

Table 1. Gated cross-retention shared encoder pseudocode.

Def: Gate cross-retention shared encoder (Y, Z)

W_{Q}

,

W_{K}

,

W_{V}

,

W_{G}

,

W_{O}

= initialize_parameters()
D = create_diagonal_matrix(Y)

F_{E Y}

= Embedding(Y)

F_{L Y}

= layer_normalization(

F_{E Y}

)
Q = D × (

F_{L Y}

@

W_{Q}

)

F_{E Z}

= Embedding(Z)

F_{L Z}

= layer_normalization(

F_{E Z}

)
K = D × (

F_{L Z}

@

W_{V}

)
V =

F_{L Z}

@

W_{V}

Attention_scores = normalize(Q @

K^{T}

@ D)

F_{W Y}

= Attention_scores @ V

F_{G Y}

= group_normalization(

F_{W Y}

)

F_{R Y}

= reshape(

F_{G Y}

, shape(Y))
output = swish(

W_{G}

×

F_{R Y}

)
F = output @

W_{O}

×

F_{E Y}

F_{F Y}

= feed_forward_network(final_output)

F_{F Y}'

= final_output +

F_{F Y}

return

F_{F Y}'

Table 2. Performance comparison of super-resolution algorithms on the Salinas dataset.

Index	SSFCNN [39]	SSRnet [40]	MCT [41]	MIAE [24]	Fusformer [19]	UAL [42]	MSDformer [43]	DCT [44]	Ours
PSNR ↑	42.41	41.37	42.73	44.74	40.84	46.41	43.59	41.52	50.38
RMSE ↓	0.0057	0.0065	0.0055	0.0097	0.0069	0.0036	0.0050	0.0064	0.0044
SAM ↓	1.61	1.97	1.88	2.20	1.74	1.08	1.45	1.87	1.01
ERGAS ↓	8.02	7.62	7.19	1.0	4.20	1.91	2.48	4.84	0.47

Table 3. Performance comparison of super-resolution algorithms on the Botswana dataset.

Index	SSFCNN [39]	SSRnet [40]	MCT [41]	MIAE [24]	Fusformer [19]	UAL [42]	MSDformer [43]	DCT [44]	Ours
PSNR ↑	34.10	37.32	37.41	40.82	37.04	40.29	37.67	34.36	43.15
RMSE ↓	0.0032	0.0022	0.0022	0.0014	0.0023	0.0015	0.0021	0.0031	0.0019
SAM ↓	3.23	2.14	2.11	1.38	5.81	1.46	1.91	2.57	1.77
ERGAS ↓	9.92	7.65	7.92	1.22	2.17	1.56	2.30	5.17	0.57

Table 4. Performance comparison of super-resolution algorithms on the PaviaU dataset.

Index	SSFCNN [39]	SSRnet [40]	MCT [41]	MIAE [24]	Fusformer [19]	UAL [42]	MSDformer [43]	DCT [44]	Ours
PSNR ↑	39.33	40.17	40.48	41.91	39.55	42.24	42.26	41.94	42.37
RMSE ↓	0.0107	0.0098	0.0094	0.0080	0.0105	0.0077	0.0077	0.0079	0.0093
SAM ↓	2.91	2.73	2.65	2.27	2.92	2.26	2.16	2.20	2.37
ERGAS ↓	2.19	2.01	1.96	1.73	2.18	1.63	1.63	1.73	0.73

Table 5. Performance comparison of super-resolution algorithms on the “Washington DC” data.

	X4				X8				X16
Methods	PSNR	RMSE	SAM	ERGAS	PSNR	RMSE	SAM	ERGAS	PSNR	RMSE	SAM	ERGAS
SSFCNN [39]	39.56	0.0073	0.57	0.33	38.89	0.0079	0.64	0.36	37.90	0.0089	0.81	0.40
SSRnet [40]	40.11	0.0069	0.58	0.31	39.32	0.0076	0.70	0.34	37.72	0.0091	0.88	0.42
MCT [41]	42.65	0.0051	0.45	0.23	42.30	0.0053	0.48	0.24	41.44	0.0059	0.51	0.27
MIAE [24]	49.59	0.0023	0.21	0.10	46.75	0.0112	0.72	0.26	49.68	0.0023	0.20	0.10
Fusformer [19]	47.32	0.0030	0.24	0.13	40.34	0.0067	0.54	0.30	39.29	0.0076	0.69	0.34
UAL [42]	47.56	0.0029	0.26	0.13	45.64	0.0036	0.31	0.16	47.42	0.0029	0.25	0.13
MSD [43]	46.89	0.0031	0.27	0.14	45.84	0.0035	0.30	0.16	46.24	0.0034	0.30	0.15
DCT [44]	38.11	0.0087	0.40	0.76	39.66	0.0073	0.64	0.34	42.22	0.0054	0.48	0.25
ours	55.34	0.0027	0.20	0.06	53.87	0.0034	0.24	0.08	53.75	0.0030	0.24	0.07

Table 6. Results of the ablation research.

Shared Encoder	Fusion Decoder	Blind Estimation Network	PSNR	RMSE	SAM	ERGAS
×	×	×	46.79	0.0046	0.32	0.11
√	×	×	49.62	0.0039	0.29	0.09
√	√	×	52.85	0.0035	0.24	0.08
√	√	√	53.61	0.0029	0.20	0.06

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, Z.; Chen, M.; Wang, W. UMMFF: Unsupervised Multimodal Multilevel Feature Fusion Network for Hyperspectral Image Super-Resolution. Remote Sens. 2024, 16, 3282. https://doi.org/10.3390/rs16173282

AMA Style

Jiang Z, Chen M, Wang W. UMMFF: Unsupervised Multimodal Multilevel Feature Fusion Network for Hyperspectral Image Super-Resolution. Remote Sensing. 2024; 16(17):3282. https://doi.org/10.3390/rs16173282

Chicago/Turabian Style

Jiang, Zhongmin, Mengyao Chen, and Wenju Wang. 2024. "UMMFF: Unsupervised Multimodal Multilevel Feature Fusion Network for Hyperspectral Image Super-Resolution" Remote Sensing 16, no. 17: 3282. https://doi.org/10.3390/rs16173282

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

UMMFF: Unsupervised Multimodal Multilevel Feature Fusion Network for Hyperspectral Image Super-Resolution

Abstract

1. Introduction

2. Related Work

2.1. Theoretical Foundation of the Model

2.2. Transformer

2.3. Blind Estimation Network

3. Methods

3.1. Gate Cross-Retention Shared Encoder

3.2. Multilevel Spatial–Channel Attention Parallel Fusion Decoder

3.3. Prior-Based Implicit Representation Blind Estimation of Degraded Networks and Loss Design Adopt

4. Experiments

4.1. Datasets

4.2. Hardware Environment for Setting Model Parameters

4.3. Evaluating Indicator

5. Results and Discussion

5.1. Visualization and Analysis of Super-Resolution

5.2. Comparison with Advanced Methods

5.3. Ablation Experiment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI