1. Introduction
In urban smart construction, spatial information in remote sensing images plays an essential role. The hyperspectral image (HSI) is a pervasive tool in the field of remote sensing due to its ability to provide high spectral resolution. It combines sample images with spectral information and provides a 3D data representation with rich spectral bands and abundant spatial–spectral information [
1]. This technology has been extensively utilized in numerous Earth observation tasks, including land cover classification [
2], scene classification [
3], object detection [
4], medical diagnosis [
5], and urban planning [
6].
Convolutional neural network (CNN)-based classification models primarily focus on the spatial–spectral characteristics of the hyperspectral image (HSI). Li et al. [
7] introduced a spectral context-aware transformer (SCAT) algorithm that better captures spatial information in his; consequently, this results in a notable improvement in the accuracy of categorization. Zhang et al. [
8] introduced an unsupervised convolutional neural network model based on spectral–spatial features. This model fully utilizes the spectral–spatial features to better extract sample features. Through 2D CNN, the spatial characteristics of the HSI can be extracted, while 3D CNNs are effective for spectral feature extraction. Additionally, Gao et al. [
9] introduced a new multi-scale, two-branch feature fusion method based on an attention mechanism that addresses the limitations of previous methods that relied on static convolutional kernels and a step-by-step approach to feature extraction. By utilizing a multi-branch network structure, the depth of the network is reduced, and it better adapts to different scales of features while accounting for both spectral and spatial data.
However, the application of the HSI is limited in complex terrains and multi-object conditions, including the ability to directly capture the spectral properties of ground objects. Therefore, it can be augmented with other remotely sensed data when performing classification tasks. By integrating complementary information extracted from multimodal data, more robust and reliable decisions can be made in feature classification tasks [
10].
In recent years, many multimodal remote sensing data fusion methods have been proposed for consideration [
11], and feature extraction can be performed efficiently by using deep learning networks, which provide a solid foundation for addressing the above challenges. Li et al. [
12] put forth the concept of unsupervised fusion networks with diminished adaptive learning capabilities, which are capable of directly encoding spatial and spectral transformations across a range of resolutions. The CNN-Fus [
13] method fuses the HSI and MSI using a subspace representation and a CNN noise reducer, and it is used for grayscale image denoising, which shows better performance. A cross-modal-learning X-shaped interactive self-encoder network (XINet) [
14] couples two disconnected U-nets through a parameter sharing strategy to realize information exchange and complementarity between modalities. In contrast, traditional light detection and ranging (LiDAR) incorporates additional data, including 3D coordinate information, reflection intensity data, time stamps, and echo information [
15]. This enables more effective compensation for the absence of elevation data in hyperspectral data. Hong et al. [
16] introduced a deep encoder–decoder network structure (End-Net) by reconstructing multimodal data with feature fusion to realize cross-modal activation of neurons, which can be effectively fused to multimodal images. Zhang et al. [
17] introduced the interleaved perception convolutional neural network (IP-CNN), a two-branch CNN architecture for integrating different input data. Zhang et al. [
18] proposed a new three-channel CNN to extract the spectral, spatial, and elevation information of remote sensing images, and a multilevel feature fusion (MLF) module was employed to integrate shallow and deep features. In addition, a mutually guided attention (MGA) module was introduced to achieve a comprehensive fusion of spatial and elevation data. Ding et al. [
19] proposed a novel approach to the utilization of both local and global features simultaneously and employed a probabilistic approach for classification estimation through decision fusion. Lu et al. [
20] introduced a new classification method based on coupled adversarial learning (CALC). This method trains a coupled adversarial feature learning (CAFL) sub-network, which enables the unsupervised extraction of high-level semantic features from hyperspectral images (HSIs) and LiDAR data. The CAFL sub-network generates multiple category-estimated probabilities by learning the low-, intermediate-, and high-level features, which are then combined in an adaptive manner to produce the final, accurate classification results. Yu et al. [
21] proposed a shadow-mask-driven multimodal endowment image decomposition (smMIID) approach to overcome the shortcomings of existing intrinsic image decomposition (IID)-based frameworks in terms of information diversity and modal relevance.
Researchers, such as Nitta [
22,
23], generalized the neural network on the quaternion domain and designed the BP neural network. Li et al. [
24] applied principal component analysis (PCA) to the HSI to obtain and maintain the orthogonal structure of the HSI and encoded the first three principal components (PCs) as the three imaginary parts of the quaternion. Voronin et al. [
25] used a quaternion framework to represent remote sensing images, and Rao et al. [
26] proposed an innovative quaternion-based network for HSI classification (QHIC-Net). This network captures both the local dependencies among spectral channels for individual pixels and the global structural relationships that define edges or shapes formed by pixel groups. Zhou et al. [
27] investigated the mapping of real HSI features into quaternion features and proposed a new separable quaternion convolutional neural network (SQNet) to classify hyperspectral images using quaternion convolutional neural networks. However, quaternions have only four parts, and thus they cannot handle the complex spectral spatial structure information in remote sensing data.
Remote sensing imagery typically includes rich spectral and spatial details, while existing deep learning methods primarily address the external spatial relationships between pixels. These methods often overlook the intrinsic relationships among various attributes within a single pixel, making it challenging to accurately identify and distinguish the internal and external relationships between different features in the image, which can result in the loss of structural features and critical information across different spectral bands. Furthermore, remote sensing data generally comprise multiple spectral bands, and the correlation between these bands is essential for precise classification, but traditional deep learning models struggle to capture the connections between comprehensive information. The interactions between different bands are often ignored or oversimplified, which leads to limitations in the model’s ability to extract comprehensive spectral–spatial features, thus affecting classification outcomes.
To address this limitation, we propose geometric algebra (GA) networks for the first time in multi-source remote sensing image classification. Weights within the geometric algebraic neurons can fully capture the structural features according to the algebraic rules, and each pixel in the remote sensing image is represented as a multivector in this method, which allows the whole remote sensing image to be represented by a genetic matrix instead of several independent real-valued matrices. The complex internal and external relationships and spatial information inherent in the remote sensing data are extracted more efficiently through the integrated processing of the internal correlation information and the global overall information of the image. The GA-based convolution and the real-valued convolution are hierarchically fused across different domains to mitigate the instability associated with extracting HSI depth features, thereby enhancing the model’s overall performance. Based on this framework, we propose a GA-based spectral–spatial hierarchical fusion network (GASSF-Net) for multi-source remote sensing data. The primary contributions of this paper are as follows:
- (1)
In response to the complex spectral and spatial information of remote sensing data, as well as the holistic relationships between different bands, this study extends convolutional layers into the geometric algebra domain for the integration and categorization of multi-source remote sensing images. By using geometric algebra matrices to represent the entire remote sensing image, both internal correlations and holistic spatial relationships can be processed simultaneously. This multi-dimensional representation captures the complex interactions between spectral and spatial features more effectively than traditional real-valued matrices.
- (2)
To enhance the correlation of spectral dimensions while improving model performance, the multi-source feature extraction (MSFE) module uses pairwise ensemble operators (PEOs) to preserve the spectral and spatial information of his, thereby deeply mining spectral features.
- (3)
A GA and real-valued domain multi-dimensional fusion module (GRMF) is proposed as a means to extract deep features from the HSI. GA convolution effectively captures the relationships between different spectral bands, and its integration with the real-valued convolution enables more comprehensive information extraction. In addition, the GA network extracts features from LiDAR data, thus improving the correlation between spatial and spectral information. These neurons can fully capture structural features according to algebraic rules, leading to more efficient extraction of complex internal correlations and spatial information, thus improving the overall performance of the model.
- (4)
A GA-based cross-fusion (GACF) module is employed to achieve comprehensive multi-source feature fusion in the spectral–spatial domain, which enables feature-level fusion while preserving holistic relationships between different attributes.
The following is a detailed account of the article in question.
Section 2 provides a detailed analysis of the methodology used for each module of the experiment, while
Section 3 presents the findings of the experimental research and the subsequent analysis.
Section 4 discusses the proposed model. Lastly,
Section 5 provides the conclusion of this paper.
3. Experiments and Results
All experiments were carried out using a Linux operating system using a GeForce RTX 3090 GPU and trained on the PyTorch framework. The optimizer used is Adam, and a single GPU was used for training over 200 epochs. The initial learning rate was set to 0.01, and it decreased by 1/2 every 80 epochs. To assess the reliability of the proposed model, we tested it on three remote sensing datasets.
3.1. Presentation of the Datasets
- (1)
Trento data capture rural areas surrounding the city of Trento, Italy, with dimensions of 600 × 166 pixels. The hyperspectral imagery obtained from the AISA Eagle sensor comprises 63 spectral bands, covering spectral wavelengths from 420.89 to 989.09 nm with a spectral resolution of 9.2 nm and a spatial resolution of 1 m. The LiDAR imagery captured by the Optech ALTM 3100EA sensor is a single-channel image containing elevation heights corresponding to ground locations. The dataset comprises 30,214 ground truth pixels, which have been categorized into six classes. Detailed information regarding the sample quantities for each class is provided in
Table 1, and the visualization is demonstrated in
Figure 4.
- (2)
MUUFL data is a registered aerial hyperspectral–LiDAR dataset. It was acquired through a single aerial flight over the Southern Mississippi University Gulf Park Campus on 8 November 2010, capturing two modal images. The images have dimensions of 325 × 220 pixels. The hyperspectral data acquired using the CASI-15000 imaging sensor include 64 spectral bands ranging from 375 to 1050 nm, with a spatial resolution of 0.54 m. The LiDAR data consist of two elevation rasters, providing elevation heights corresponding to ground positions. The dataset comprises 53,687 ground truth pixels, which have been categorized into 11 classes. Detailed information regarding the sample quantities for each class is provided in
Table 2, and the visualization is demonstrated in
Figure 5.
- (3)
Houston 2013 data were obtained by the University of Houston through the National Science Foundation (NSF)-funded Center for Airborne Laser Ranging (NCALM). The corresponding scene information was collected using the ITRES CASI-1500 imaging sensor over the University of Houston’s campus and its surrounding neighborhoods. The images have dimensions of 349 × 1905 pixels. The hyperspectral dataset includes 144 spectral bands spanning from 364 to 1046 nm, with a spectral resolution of 10 nm and a spatial resolution of 2.5 m. The LiDAR dataset is a single-channel image providing elevation heights corresponding to ground positions. The dataset comprises 15,029 ground truth pixels, which have been categorized into 15 classes. Detailed information regarding the sample quantities for each class is provided in
Table 3; the visualization is demonstrated in
Figure 6.
3.2. Multi-Source Data Fusion Analysis
The effectiveness of the comparative classification methods used in our experiments was assessed using three standard metrics: overall accuracy (
), average accuracy (
), and Kappa coefficient (
).
was previously employed to indicate the ratio of correctly classified pixels to the total number of pixels.
is the average accuracy across all categories.
is an indicator used for consistency testing., adjusted for the level of agreement that could occur by chance, expressed as a percentage. Higher values of the three metrics (
,
, and
) indicate better classification performance in remote sensing image classification tasks. The definitions of these indices are as follows:
where
represents the number of correctly identified samples and
represents the total number of samples.
and
represent the number of samples in each category in
and
, respectively. The hypothesized probability of chance congruence,
, is calculated using Equation (33).
where the counts of actual samples and predicted samples for each class are denoted by
and
, respectively.
To demonstrate the superiority of the GACF module and the advantages of multi-source data fusion in classification, we assessed the performance of classification using HSI and LiDAR data alone versus multi-source data. We trained HSI and LiDAR data using the MSFE module and the GRMF module and compared their classification abilities with the results obtained using both data simultaneously.
Figure 7 presents the outcomes of the distinct classification methodologies.
Figure 7 shows that using multiple sources of data for classification resulted in better performance in all three datasets compared to using species data alone. In the Houston dataset, using the HSI alone improved the
,
and
by almost 40% compared to using LiDAR alone. In the MUUFL dataset, the
improved by about 40%, while the
and
improved by almost 57%. After fusing the two data sources, both datasets achieved an
of over 90%, and the
and
were also improved.
This demonstrates that LiDAR data can effectively compensate for the missing elevation information in HSI data. Our proposed fusion network can then better combine the two sets of data features to achieve high-precision classification.
3.3. Classification Performance
We investigated the performance impact of data blocks at different sizes (7 × 7, 9 × 9, 11 × 11, 13 × 13, 15 × 15, and 17 × 17) and different batches (16, 32, 64, and 128), and the outcomes are presented in
Figure 8,
Figure 9 and
Figure 10. In the Trento dataset, the highest
value was achieved for a batch size of 16 when the dataset size was the same, where the highest
value of 99.5% was achieved for a scale size of 17 × 17. In the MUUFL dataset, when the dataset batches are the same, the
value peaks for a size of 13 × 13, where the
reaches a maximum of 92.88% for a batch size of 128. In the Houston dataset, the
peaked at 94.46% at a batch size of 32 and a scale size of 11 × 11.
3.4. Ablation Study
The proposed network includes the MSFE module, which preserves the integrity of the information and the corresponding spatial structure of the HSI. In addition, the GRMF module mines the internal and external relationships and the overall nature of the HSI spectral–spatial features. Finally, the GACF module is the module in the network that performs the fusion of different data features. These three modules work in concert to enhance the classification performance of remotely sensed data. To assess the effectiveness of MSFE, GRMF, and GACF, we conducted ablation experiments. These experiments helped determine the key modules and guided future research.
Both the MSFE and the GRMF use a simple real-valued conv2d for transformation. For the fusion module, the sole fusion technique employed is summation fusion. We utilized identical experimental configurations and evaluated the final classification outcomes for all three datasets.
Table 4 shows the experimental results and the best results are in bold.
Table 4 shows that using MSFE + GRMF and MSFE + GACF in all three datasets improves the
compared to using MSFE alone. In the Trento dataset, the
,
and
have all improved, and, in the MUUFL dataset, the
and
increased by approximately 6% and 2%, respectively, while Kappa increased by 8% and 2%, respectively. When using MSFE + GRMF, the
and
in the Houston dataset improved by 1.42% and 1.96%, respectively, while Kappa increased by 1.57%. This confirms the ability of GRMF to explore fine spatial information in remote sensing images and preserve the overall relationships between internal and external features, as well as the fusion of multi-source information through GACF.
Upon comparing GRMF alone with both MSFE + GRMF and GRMF + GACF, in the Trento dataset, there is a minimal difference in classification performance between the use of GRMF and the use of GRMF + GACF. However, the improves by at least 1% when using MSFE + GRMF, suggesting that the two modules collaborating during feature extraction can capture spatial and spectral features more efficiently. In the MUUFL dataset, compared to GRMF alone, both MSFE + GRMF and GRMF + GACF improved the , and ; MSFE + GRMF improved the , and by 5.59%, 1.96%, and 7.21%, respectively. In the Houston dataset, the other two methods achieved improvements in the by about 2.32% and 1.85%, by about 3.47% and 2.09%, and by 2.51% and 2% over GRMF alone. These findings indicate that MSFE can be used to compensate for spatial information in remote sensing data and to thoroughly explore the spectral–spatial details.
Simultaneous use of MSFE + GACF and GRMF + GACF resulted in a slight improvement in the , and compared to using GACF alone in both the Trento and Houston datasets. Meanwhile, in the MUUFL dataset, the improved by 1.77% and 0.67%, by 1.38% and 1.77%, and by 2.24% and 0.97% compared to using GACF alone. The highlights the significance of the feature extraction module in extracting spatial and spectral features from remote sensing data.
The most favorable classification outcomes were achieved when all three modules were utilized together, resulting in significant improvements in the , and . This demonstrates that the three designed modules not only achieve good classification results separately but also when combined.
Concurrently using MSFE, GRMF, and GACF can fully utilize the role of each module in maximizing the extraction of spatial and spectral feature details in remote sensing data while retaining the interrelated structural domain information. This leads to a fully integrated set of features for classification.
3.5. Comparative Experimental Analysis
To substantiate the distinction of our proposed model, we conducted comparisons with various network models designed for the classification of HSI and LiDAR data. The methods for comparison include the Interleaved Perception CNN (IP-CNN) [
16], the Adaptive Mutual Learning Multimodal Data Fusion Network (AM
3-Net) [
43],the Cross-Channel Reconstruction Module (CCR-Net) [
44], the Deep Codec Network (End-Net) [
18], the Hierarchical Random Wandering Network (HRWN) [
45], Coupled Adversarial Learning Classification (CALC) [
20], and the Multiscale Spatial Spectral Network (AMSSE) [
46]. To ensure fairness in the experiments, the same training and test samples were utilized for comparison purposes.
Table 5,
Table 6 and
Table 7 show the detailed results of the various methods applied to the three experimental datasets and the best results are in bold. These include the classification accuracy for each category as well as the
,
and
values. Specifically, CCR-Net, IP-CNN, and End-Net use feature-level fusion, while and HRWN use decision fusion strategies. CALC and AMSSE, on the other hand, use both characteristic-level and decision-level fusion methods. Most of the architectures employed in this study are based on deep CNNs. Based on
Table 5,
Table 6 and
Table 7, the following conclusions can be drawn.
First, feature-level fusion methods, such as IP-CNN, that fuse features achieve preservation of the complementary structure of the HSI and LiDAR, as well as the integrity of the fusion information. Nevertheless, the utilization of the Gram matrix for the purpose of preserving multi-source complementary information results in the accumulation of a considerable amount of superfluous data, rendering the extraction of pivotal information a challenging endeavor. This allows for more effective exchange of feature information, resulting in clearer features on the edges and structures of the information, ultimately leading to a clearer appearance of the fused features.
Second, the classification accuracy can be improved by implementing a decision-level fusion strategy. AM3-Net represents both overall and local information through shallow and deep appearance features. This is due to the strong complementary relationship between shallow and deep information. Different weights are utilized when fusing the three levels of features. Similarly, HRWN achieves classification based on the random wandering layer of the LiDAR weight map, and both methods achieve better classification accuracy.
Third, AMSSE-Net utilizes MMHF to capture spatial information and feature maps of different sensory fields and fuses the data using a combined strategy involving characteristic levels and weighted merging. CALC improves the classification performance of the model by utilizing both characteristic-level and decision-level fusion methods. In the fusion stage, advanced semantic and complementary information is mined and utilized, which increases adversarial training and efficiently maintains intricate details in both HSI and LiDAR data. The adaptive probabilistic fusion strategy further enhances classification performance.
To improve the feature extraction and fusion of various remote sensing data, we pro-pose the GASSF-Net model. Considering the above methods and challenges faced, we first extracted the features of HSI data through the PEO before performing feature fusion. Subsequently, we utilize the GRMF module to simultaneously mine the spectral–spatial information of the HSI through GA networks and real-valued networks, which ensures the correlation of internal and external relationships between high-dimensional signals and maintains the comprehensive information while mining the advanced semantic details of the HSI.To complement the HSI features, we utilize a GA-based network to extract elevation and spatial details from LiDAR data. Finally, we introduce a cross-fusion module that effectively keeps detailed information in both HSI and LiDAR data to achieve interactive complementation of rich information, thus better fusing multi-source data using GA-based methods.
The performance of the different comparison methods was similarly evaluated for the Trento and MUUFL datasets with fewer feature classes and higher spatial resolution. The visualized classification figures are shown in
Figure 11 and
Figure 12. For the Trento dataset, our method achieved the best results for the
and
, with higher results for
. The highest level was achieved in five categories. Moreover, our method achieves the highest
and
values for the challenging MUUFL dataset, and it is capable of achieving high classification accuracies for classes with exceptionally large sample sizes (e.g., class 1 trees).
For the Houston dataset, the best results were obtained for all three metrics. The visualized classification graph is shown in
Figure 13, from which we can see that we achieve smoother classification results. This shows that our method can mine more features and achieve higher classification accuracy, and our method aligns more closely with the ground truth map.
To quantitatively analyze the computational cost of different models,
Table 8 shows the total parameters of the neural network models (in millions), the training time (in seconds), and the testing time (in seconds) for different models on the Houston dataset. It can be seen that End-Net, as a lightweight network, has a smaller number of parameters and shorter training and testing times, but it is less efficient in learning. In contrast, IP-CNN has a significant increase in the number of parameters due to two-stage training, while AMSSE-Net and AM
3-Net deeply mine the spectral features through pairwise ensembles, which means that the training and testing time is mainly consumed in this part. Our proposed model maintains a moderate level of training and testing time, but it still achieves the best classification results, indicating its high training and testing efficiency while guaranteeing high classification accuracy.
4. Discussion
The GASSF-Net method, as proposed in this paper, has made notable advancements in the field of remote sensing spectral–spatial information extraction. A comprehensive evaluation of GASSF-Net using several standard datasets, including Trento, MUUFL, and Houston, demonstrates that the method excels in terms of feature extraction accuracy, classification accuracy, and generalization ability. The ablation experiments on single-source and multi-source data demonstrate that the LiDAR data features extracted through geometric algebra can effectively address the limitations of HSI data regarding elevation. Additionally, the integration of diverse data features allows for the extraction of a more comprehensive range of information. The results of the ablation experiments on different modules also demonstrate the crucial role of geometric algebra in this method. The fundamental innovation of GASSF-Net lies in its incorporation of geometric algebraic techniques for the extraction of spectral–spatial information. The geometric algebra network can comprehensively capture the complex internal and external relationships and comprehensive information within the data, and when combined with the channel information extracted through the real-valued network, it ensures efficient encoding and interpretation of features. The branch fusion method significantly enhances the extraction of spectral spatial features, thereby enabling GASSF-Net to more accurately reflect the information present in remote sensing images. Concurrently, the GA-based cross-fusion module achieves feature-level fusion and complementarity while maintaining the integrity of the unique spectral spatial information of each data source. This module is capable of dealing with the complexity of different data sources and effectively fusing multimodal data. In comparison to other methods, GASSF-Net exhibits superior performance in feature extraction accuracy, classification accuracy, and generalization ability.
While GASSF-Net demonstrates satisfactory performance in the current experiments, there are still some avenues that warrant further investigation. Firstly, although this paper focuses on the fusion of hyperspectral images and LiDAR data, the framework of GASSF-Net can be extended to other types of remote sensing data, including synthetic aperture radar (SAR) images or multispectral images. Secondly, the incorporation of geometric algebra with its non-commutative multiplication results in an increase in the computational complexity of the algorithm. We think this issue will be addressed with the advent of geometric-algebra-parallel computing technology. Furthermore, there is scope for further optimization of the genetic algorithm’s design to enhance the efficiency and effectiveness of the cross-fusion module and to fully use the extracted feature information.
GASSF-Net offers novel insights into the processing of remote sensing data and pro-vides robust support for classification tasks in real-world applications. To illustrate, in certain domains, such as land cover classification, environmental monitoring, and urban planning, GASSF-Net can enhance the precision and dependability of classification, thereby furnishing more precise guidance for pertinent decisions. This design not only enhances the comprehensiveness and robustness of feature representation but also effectively addresses the pivotal challenges inherent to multi-source data fusion. The incorporation of a genetic algorithm in feature fusion represents a significant advancement over existing methodologies, as it is better equipped to navigate the intricate relationships between data sources, thereby enhancing classification performance.