1. Introduction
Soil salinization, a global agricultural and environmental challenge, has far-reaching adverse effects on soil properties. It not only restricts crop growth and development but also reduces microbial activity and fertility in the soil, thereby disturbing the equilibrium of ecosystems and posing a major threat to plant health and agricultural output [
1]. Xinjiang is an arid and semi-arid region with many types of saline soils and a wide area, making it a typical inland-type saline area. The data of the second national soil census show that the total area of various types of saline soil in Xinjiang reaches 13.361 million hectares, accounting for 36.8% of the total saline soil area in the country, and there is an urgent need for timely treatment and long-term planning [
2]. The main causes of salinization include climatic factors, such as low rainfall and high temperatures, accumulation of soluble salts in the soil due to hydrogeology, poorly maintained irrigation setups, and unscientific cultivation habits of farmers [
3]. These factors work together to exacerbate the accumulation of soil salts, seriously affecting agricultural production and the ecological environment [
4]. Therefore, it is important to formulate effective management measures and sustainable agricultural strategies to address the current situation of soil salinization in Xinjiang. The problem of soil salinization is particularly acute in the agriculturally important areas of northern Xinjiang, China, due to the unique geographical conditions of the inland arid zone. The sparse rainfall and intense evaporation in this region make it easy for salts to accumulate on the surface of arid soils, forming saline soils, a phenomenon that not only hinders the healthy growth of crops but also reduces the effectiveness of key nutrients in the soil, directly affecting the yield and quality of the crops grown in the soil. In addition, the presence of saline soils makes land management more difficult and costly and makes it difficult to provide accurate and timely information to support agricultural production activities in the vast and varied landscapes of the Xinjiang region.
The advancement of precision agriculture has led to the potential application of remote sensing technology for monitoring salinity in agricultural soils. Remote sensing by drones has become a highlight of the field because of its flexibility and high accuracy [
5]; however, its range is often limited by flight endurance and speed, and it is mainly suitable for small-scale area monitoring. On the contrary, satellite remote sensing, with its wide-area coverage and high-frequency updates, is gaining attention for tracking agricultural dynamics and assisting in strategic planning. In particular, it is worth pointing out that the Sentinel-2 series of satellites launched by the European Space Agency (ESA) can provide more in-depth remote sensing support for agriculture and other fields by significantly improving temporal, spatial, and spectral resolution. In addition, the Landsat series of satellites, carrying decades of remote sensing data accumulation, has irreplaceable value for long-term environmental change and land use dynamic analysis. Integration of multi-source satellite data effectively enhances the recognition accuracy of surface features [
6,
7]. A study by Wang combined data resources from unmanned aerial vehicles (UAVs) and Sentinel-2A to construct models for regions with different salt concentrations, improving the accuracy of salt inversion [
8]. Zhang’s study [
9] further verified that the fusion of multi-source satellite data enriches data dimensions and significantly reduces the uncertainty factor in sea ice thickness assessments. Similarly, the work by El-Rawy [
10,
11] revealed the effectiveness of multi-source satellite data in assessing soil conductivity and salt content on inter-temporal and spatial scales. In summary, the integration of multi-source satellite data not only strengthens the effect of soil property monitoring, but also paves a solid information foundation for the practice of smart agriculture.
The rapid progress of pedological remote sensing technology has greatly contributed to the level of accuracy in soil salinity detection. By combining multispectral imagery with ground characteristics, soil salinity can be effectively assessed and salinity indicators defined [
12,
13]. The selection of appropriate spectral indices for soil monitoring is a critical step in ensuring data accuracy in the study area, given its unique topography and ecosystem characteristics [
14]. With the help of soil texture analyses, Duan et al. aim to differentiate between different geospatial layouts of soils and provide a strategic orientation for the efficient management of soil resources [
15]. Furthermore, the integration of environmental covariates, such as topographic relief, soil type, and vegetation cover, can greatly improve the predictive effectiveness of soil salinity monitoring models, especially in arid and semi-arid regions [
16]. For example, Zhao et al. verified the efficacy of a variety of auxiliary variables in assessing soil conductivity and revealed that the inclusion of topographic elements can substantially improve the accuracy of the model in a case study in Karamay, Xinjiang [
17]. The work of Emami and other scholars, on the other hand, was designed to assist in the planning of soil management strategies in northern Iran by identifying the elements of environmental influence that dominate the distribution of soil salinity in the region [
18]. In summary, this integrated multi-source information assessment method, by linking spectral data analysis and environmental variables, greatly enhances the accuracy of the soil salinity monitoring model, enriches the information content of the decision support system, and lays a solid foundation for the scientific formulation of soil management planning and intervention [
19].
Currently, machine learning techniques have been widely used in the field of soil salinity prediction, showing significant advantages over traditional means of dealing with complex nonlinear relationships. Among them, the integrated learning approach has attracted much attention for its ability to effectively deal with the challenges of high-dimensional data and enhance the generalization ability of the model [
20]. The conclusion that Random Forest, as a robust integrated learning strategy, is widely used in soil salinity assessment and is particularly suitable for dealing with bare soil environments is supported by several studies [
21,
22,
23]. Nevertheless, machine learning models still face the limitations of insufficient interpretability and high dependence on large data volumes in practical applications, resulting in compromised performance when samples are scarce or data quality is low, and are susceptible to data noise and outliers. It has been shown that machine learning methods incorporating SHapley Additive exPlanations (SHAP) value analysis can significantly improve the explanatory power of the model and help to gain a deeper understanding of the contribution of each feature variable to model construction [
24]. On the other hand, the neural network algorithm demonstrated more stable performance with higher accuracy on the salt prediction task [
25]. In particular, when constructing a hierarchical neural network architecture for fine-grained classification of soil salinity levels, its superiority surpasses traditional machine learning methods and improves the accuracy of prediction [
26]. In addition, neural network techniques have confirmed superior performance compared to machine learning techniques in mapping soil salinity distribution [
27].
Existing studies mainly focus on a single data source, which cannot fully reflect the complexity of soil salinity changes. This study overcame this limitation by integrating data from multiple sources and considering the effects of environmental factors on soil. An assessment method was also used to ensure the reliability of the model and the credibility of the prediction results. The method has an important potential for application in soil management in arid regions and can provide more accurate data support for decision-makers. Given this, this study aims to explore the feasibility of an interpretable modeling approach combined with multi-source satellite remote sensing imagery (covering Landsat 8 and Sentinel-2) and environmental parameters to monitor soil salinization phenomena at a large scale in the northern Xinjiang farmland territory. Specifically, this can be categorized into three research objectives, including:
- (1)
To systematically assess and improve the accuracy of soil salinity prediction models by integrating multi-source satellite data and a series of environmental auxiliary variables, and to identify the characteristic variables affecting soil salinity;
- (2)
To accurately assess the degree of soil salinity in arid farmland in northern Xinjiang and ensure the credibility of the results;
- (3)
Applying modules with good interpretability in the modeling process to enhance the learning effectiveness of the model.
2. Materials and Methods
2.1. Overview of the Study Area
Xinjiang is situated in the northwestern part of China. As illustrated in
Figure 1. The total area of arable land in Xinjiang is 7,038,600 hectares. Of this, paddy fields account for 0.85 percent of the territory’s arable land, irrigated land accounts for 96.00 percent of the territory’s arable land, and dry land accounts for 3.15 percent of the territory’s arable land. The climate in northern Xinjiang is temperate continental arid to semi-arid, with low average annual rainfall, usually between 100 and 200 mm, but varying according to topographical differences. For example, annual rainfall is about 150 mm in the Tarim Basin and up to 200 mm in parts of the southern foothills of the Tianshan Mountains. Rainfall is mainly concentrated between July and September, accounting for more than 70 percent of the year. This uneven spatial and temporal distribution of climate, combined with high evapotranspiration, leads to insufficient soil moisture and exacerbates soil salinization. The wet season in Northern Xinjiang usually occurs from June to September, with most of the rainfall concentrated in July and August. During this period, average temperatures range from 25 °C to 35 °C, with higher levels of relative humidity compared to the dry season. The main cash crop in the area is cotton. This unique climatic condition results in soils that are prone to salinization, which presents a challenge to agricultural production and increases the complexity of agricultural management. Regarding the criteria and methods of sample site selection, we based them on the distribution of cotton fields in northern Xinjiang, which are mainly distributed in these five, six, seven, and eight divisions in total, combined them with the digital elevation model (DEM) and topographic maps, and based on the data on the distribution of cotton fields in northern Xinjiang released by the Bureau of Agriculture, we selected the main cotton planting areas of the four divisions as the base area for sample site selection. Within each division area, the stratified random sampling method was used to randomly select representative sample points, and the specific location of the sample points was finally determined through field verification and adjustment. The whole process ensured the scientific distribution of the sample points and the reliability of the data, which provided solid data support for the subsequent soil salinity analysis. After collecting soil samples, the soil needs to be stored in separate bags according to the sample sequence. The sampling operation is then carried out for subsequent experiments, as shown in
Figure 2.
2.2. Collection and Analysis of Soil Samples
In this study, soil sampling was conducted in April 2021 in the primary cotton cultivation regions of the fifth, sixth, seventh, and eighth divisions in northern Xinjiang. The sampling was conducted using the five-point plum method, in conjunction with GPS positioning technology for distribution, and a total of 1044 soil samples were collected at a depth of 0–30 cm. The soil samples were combined with distilled water in a 1:5 ratio and then agitated in a thermostatic oscillator for 30 min to facilitate sufficient soil dissolution. Subsequently, the mixture was allowed to stand for a period of time to allow for the separation of the supernatant, and the electrical conductivity (EC, ds/m) of the soil was then measured using a conductivity meter (Model S230, Mettler Toledo, Shanghai, China). To obtain the mean value, three replicate measurements were made for each sample. Subsequently, the soil salt content (SSC) was calculated according to the established empirical formula [
28], allowing for a more detailed analysis of the soil salinity levels, which were then classified according to the criteria outlined in
Table 1 [
29].
In this equation, denotes the SSC in g/kg and denotes the electrical conductivity (EC, μS/cm) of the soil.
2.3. Acquisition and Processing of Satellite Images from Sentinel-2 and Landsat 8
Firstly, the vector maps were merged according to the study area in northern Xinjiang, and the shapefile of the study area was then imported into the Google Earth Engine (GEE) cloud platform (
https://earthengine.google.com/ (accessed on 8 March 2024)) by using the method ‘ee.FeatureCollection’ to call the shapefile. Sentinel-2 and Landsat 8 data were acquired within the GEE environment, and the two different data sources were subjected to cloud removal (QA60 for Sentinel-2) and (CLOUD_COVER, 60 for Landsat 8). Specifically, the ‘ee.FeatureCollection’ command, which aggregates all the elements in the shapefile, was used. Furthermore, for Sentinel-2 image processing, we applied the QA60 band to mask clouds and cloud shadows. QA60 provides information on whether each pixel is covered by clouds. The CLOUD_COVER parameter was set to 60 for Landsat 8 images, thereby excluding images with more than 60% cloud coverage. And then the images of the two satellites in April 2021 were downloaded using the median synthesis method for the 10 m resolution of the study area where they are located. The study area was imaged at 30 m resolution using Sentinel-2 and Landsat 8 satellites, and the spectral curves for the area were acquired in GEE. These data are presented in
Figure 3. Subsequently, the 30 m resolution Landsat 8 image was resampled to 10 m in ArcGIS software, version 10.2.
2.4. Environment Variable Selection
In order to assess soil salinity in-depth, this study builds on previous research [
30,
31,
32]. A total of 23 environmental covariates were extracted, and key terrain characterization parameters, including elevation, slope, slope direction, curvature, hill shadow, terrain undulation, and terrain roughness, were obtained from the Digital Elevation Model (
https://earthengine.google.com/ (accessed on 11 April 2024)). The nighttime lighting data were sourced from the Earth Observatory (
https://earthobservatory.nasa.gov/ (accessed on 13 April 2024)) and the spatial distribution data of China’s soil types in the Digital Earth Open Platform (
https://open.geovisearth.com/ (accessed on 15 April 2024)) were downloaded with soil type, soil parent material, clay, sand, and silt soil attributes. The data on air temperature and rainfall were obtained from the National Meteorological Science Data Center (
https://data.cma.cn/ (accessed on 16 April 2024)). The population distribution data were obtained from the Resource and Environment Science Data Platform (
https://www.resdc.cn/ (accessed on 20 April 2024)). Agricultural film data for the study area were obtained from the Xinjiang Bureau of Statistics for the year 2021 (
http://tjj.xinjiang.gov.cn/tjj/xjq/list_dq.shtml (accessed on 25 April 2024)). The data on railroad density and highway density were sourced from the OpenStreetMap website (
https://www.openstreetmap.org/ (accessed on 10 May 2024)). The data on nitrogen, phosphorus, potash, and compound fertilizer usage in the study area were sourced from the China Statistical Yearbook (
http://www.tjcn.org/ (accessed on 12 May 2024)), which provides detailed agricultural statistics, including regional fertilizer application data. The tassel cap transformation index was derived through the analysis of satellite remote sensing images, which revealed a linear transformation. A detailed account of the variables selected for this study can be found in
Table 2.
Specifically, we selected 23 environmental covariates that encompass key aspects such as soil salinization detection, vegetation health assessment, and analysis of topography and hydrological conditions. This comprehensive selection aims to accurately reflect the spatial distribution and influencing mechanisms of soil salinization in the cotton fields of northern Xinjiang. The selected covariates include the Salinity Index (Salinity Index 1–6), Salt Index (Salt Index 1–3), and Normalized Difference Salinity Index (NDSI). These indices combine reflectance from different remote sensing bands to sensitively capture the accumulation and distribution of salts in the soil, providing direct data for salinization monitoring. Additionally, the Intensity Index (Intensity Index 1–2) quantifies the severity of soil salinization by reflecting the concentration levels of salts, thereby assisting in the assessment of salinization intensity. Vegetation indices such as NDVI, ENDVI, EVI, EEVI, and GDVI indirectly reflect the impact of salinization on vegetation growth by evaluating vegetation coverage and health. High salinity environments typically inhibit vegetation growth, leading to a decrease in these index values. Soil indices, including Soil Difference Index (SDI) and Tasseled Cap indices (TCB, TCG, TCW), analyze soil and surface brightness, greenness, and wetness to reveal the influence of salt accumulation and moisture conditions on salinization. Furthermore, the Combined Spectral Response Index (Canopy Salt Response Index and Combined Spectral Response Index) integrates information from multiple spectral bands to enhance the detection of salinization characteristics under complex surface conditions, thereby improving monitoring accuracy. The selection of these environmental covariates was based on their high sensitivity and specificity in remote sensing monitoring, effectively capturing spectral changes, vegetation responses, and dynamic variations in topography and hydrological conditions during the salinization process. This robust selection of covariates provides a solid scientific foundation for the spatial analysis and mechanistic study of soil salinization. By comprehensively utilizing these variables, our study is able to thoroughly and accurately assess the spatial distribution and influencing mechanisms of soil salinization in the cotton fields of northern Xinjiang, ensuring the reliability and scientific validity of the research results.
2.5. Feature Selection Methods
In the field of spectral research for salt monitoring, we constructed feature sets through a systematic screening process that fuses spectral and environmental variables and employs four state-of-the-art feature selection techniques. These carefully constructed datasets are then deployed into machine learning and deep learning models, with the aim of revealing the impact of different feature extraction strategies on improving model prediction accuracy through detailed performance comparisons, and identifying the most efficient modeling pathways for salt monitoring. A total of four feature selection algorithms were adopted in this study: SPA [
40], CARS [
41], UVE [
42], and RF [
43]. The SPA is able to reduce redundancy among variables while selecting the most characterizing variables. CARS, which draws on the integration of competitive mechanisms with Monte Carlo and partial least squares regression, dynamically adjusts the spectral variable weights and optimizes the variable clusters. UVE excels at filtering out data noise, demonstrating efficiency and directness in high dimensional data analysis. The Random Forest algorithm, which filters variables based on their importance ratings, defines a benchmark for differentiating the importance of a variable by considering those variables whose ratings exceed the mean as key variables to be included in the subsequent model construction and analysis.
The CARS algorithm simulates natural selection mechanisms, effectively reducing redundancy in high-dimensional data. However, its high computational cost makes it suitable primarily for applications with moderate sample sizes and stringent precision requirements. The UVE algorithm, based on Partial Least Squares Regression (PLSR), offers high stability and strong interpretability, particularly excelling in scenarios with multicollinearity, thereby enhancing both model interpretability and predictive accuracy. The SPA employs vector projection analysis to identify candidate wavelengths with the largest projection vectors, ultimately determining a combination of feature wavelengths that effectively minimizes redundancy and collinearity, thereby improving model performance. Nevertheless, the performance gains are limited due to the unsupervised nature of the feature selection process, which constrains the interpretability of the selected variables. The RF algorithm, as an ensemble learning method, is adept at handling complex nonlinear relationships and significantly enhancing predictive accuracy, albeit at the expense of substantial computational resource consumption. By integrating multiple feature selection methods, it is possible to more effectively analyze the impact of environmental covariates on soil salinization, thereby increasing the accuracy and reliability of predictive models and providing robust technical support for related research.
2.6. Modeling
To accurately assess the effectiveness of machine learning and deep learning in the field of salt monitoring, this study incorporates the SHAP method into the machine learning model construction process, which significantly enhances the explanatory power of the model, deepens the understanding of the decision-making mechanism of the model, and then enhances the transparency and credibility of the model. Meanwhile, in deepening the development of the deep learning model, the newly launched KAN framework, a deep learning tool with built-in advanced interpretation functions, was adopted, which empowers the researchers to understand better and verify the underlying logic of the model predictions, ensuring the rigor and practical value of the model. With the combined use of multi-source satellite data and advanced machine learning techniques, this study is can able to track and predict soil salinity more effectively, providing solid scientific support for soil quality improvement and crop yield increase.
2.6.1. Machine Learning Models
Machine learning models are widely used to tackle complex data mining challenges due to their powerful ability to learn from data on their own. These algorithms excel at revealing nonlinear data associations and demonstrate a high learning acuity that enables them to tackle various regression and classification tasks effectively. The selection of models for machine learning is based on the following considerations. Two gradient boosting models, LightBoost and XGBoost, are good at handling high-dimensional data and complex nonlinear relationships; the former is known for its efficiency and the latter has an advantage in speed, and the combination of the two aims to improve the prediction accuracy. The RF reduces the risk of overfitting through integrated learning and provides a robust assessment of the importance of features and enhances the model interpretability. ET, on the other hand, was selected for its parallel computing power and advantages in handling high-dimensional data, and was used to compare the performance of different tree models. In this soil salinity prediction task, we specifically chose these four models for training: ET [
44], RF, XGBoost [
45], and LightBoost [
46].
2.6.2. Deep Learning Models
Deep learning neural networks process input data and generate output by emulating the multi-layered neuronal structure of the human brain. This network structure enables the effective processing of complex, high-dimensional data, the mapping of said data to a low-dimensional space through the application of dimensionality reduction techniques, the extraction of key information, and the optimization of decision-making processes. For deep learning, 1D-CNN is able to effectively process sequence data and capture spatial hierarchies, thus extracting key information needed to predict soil salinity. The introduction of residual connectivity enables 1D-ResNet to effectively overcome the gradient vanishing problem of deep network training, so as to construct deeper networks, learn more complex data features, and improve the prediction accuracy and stability. MLP, with its concise and powerful nonlinear modeling capability, can effectively capture the complex relationships of soil salinity data and serve as a baseline model for easy comparison with other deep learning models for performance comparison.
The convolutional layers of 1D-CNN (
https://github.com/poloclub/cnn-explainer (accessed on 20 May 2024)) are capable of efficiently processing sequential information and capturing spatial hierarchies. The 1D-ResNet (
https://github.com/KaimingHe/deep-residual-networks (accessed on 26 May 2024)) is renowned for its intricate structure and is adept at facilitating deeper learning through the utilization of residual connections. The MLP (
https://github.com/filipecalasans/mlp (accessed on 26 May 2024)) is distinguished by its straightforward architectural design and robust nonlinear modeling capabilities, which enable the capture of intricate data relationships. In this study, the aforementioned classic network structures are applied to the soil salinity regression prediction. However, the opaque structure of black-box models hinders the interpretability of deep learning and impairs the transparency of the decision-making process within the model. The issue of improving the interpretability of deep models is also a topic of considerable current research interest.
2.6.3. Interpretable Deep Learning Model-KAN
The recently introduced KAN model, is a revolutionary alternative to the traditional MLP in the field of neural networks, demonstrating a high degree of flexibility in parameter tuning. With regard to parameter training, KAN exhibits a distinct advantage over MLP, which necessitates retraining when parameters are modified. The KAN is a network named after Kolmogorov and Arnold, with its core idea derived from the Kolmogorov-Arnold Representation Theorem (KART). KART states that any multivariate continuous function can be represented as a finite linear combination of univariate functions. In KAN, spline functions are used to replace the weight parameters in traditional neural networks. Spline functions exhibit high flexibility and adjustability, enabling effective fitting of complex data relationships, thereby reducing approximation errors and enhancing the network’s capability to learn subtle patterns from high-dimensional data. The general formula for KAN spline can be represented using B-splines.
Here,
denotes the spline function.
are coefficients optimized during the training process, while
are B-spline basis functions defined on a grid. The grid points define intervals where each basis function
is active and significantly influences the shape and smoothness. During training, the shape of the spline function is adjusted by optimizing the loss function to best fit the training data. The spline parameters are updated at each iteration to reduce prediction errors. KAN, whose open-source implementation can be found at (
https://github.com/KindXiaoming/pykan (accessed on 31 May 2024)) enables parameter scaling by introducing a spline function-based network structure without having to going go through the training process again. By introducing simplified functions and transformation steps, we are able to gain a deeper understanding of the intrinsic mechanisms of KAN and provide more precise mathematical descriptions of the learned functions. This feature makes it possible to build deep networks by stacking multiple layers of KAN to cope with complex problems more efficiently, with each layer tailored to the specific goals of the task.
This pioneering strategy not only improves the model’s adaptability to various tasks, but also simplifies the path to increasing the depth of the model, providing a powerful tool for overcoming difficult challenges in machine learning practice. The unique network architecture of KAN is illustrated in
Figure 4.
2.7. Data Processing Steps
Firstly, in data preprocessing, the statistical method Interquartile Range (IQR) is applied to this segment because of its extraordinary performance in identifying outliers, which enables effective screening of the dataset to exclude outliers. The dimension lessness of the data has a significant effect on the improvement of the accuracy of the model, so the RobustScaler method was adopted for the data, which subtracts the median and divides it by the interquartile distance, and the data are scaled and standardized, which enabled the effective removal the abnormal values and outliers, while retaining the relative relationship between the data. In the machine learning algorithm, the grid search method was used to obtain the optimal parameter configure ratio, and each model was constructed based on Python’s open-source machine learning library Scikit-Learn. For neural networks, ten-fold cross-validation was selected to ensure the model’s ability to generalize and stability, and based on the performance of the model at each time, key parameters, such as the learning rate, learning batch, and so on, were tuned. The model was also tuned based on each model’s performance for key parameters such as learning rate, learning batch, etc.
2.8. Model Evaluation Method
In the regression model evaluation for salt prediction, we used the coefficient of determination (R
2, Equation (3)) and root mean square error (RMSE, Equation (4)) to measure the fit of the data and quantify the model error. In the classification of salinity, Accuracy (Equation (5)), Precision (Equation (6)), Recall (Equation (7)), and F1-score (Equation (8)) are utilized to evaluate the performance of distinguishing different salinity areas. In the regression task of predicting soil salinity modeling, R
2 was chosen to measure the model’s ability to explain the variation in the data, i.e., to reflect the extent to which the model fits the dataset. RMSE, on the other hand, is used to describe the difference between the model’s predicted and actual values, visualizing the magnitude of the error in the model’s predictions. In the classification task, F1-score is a metric that combines precision and recall, and is particularly suitable for dealing with datasets that are not balanced in terms of categories, providing a more comprehensive picture of the overall performance of the model. Accuracy shows the proportion of samples correctly categorized by the model out of the total number of samples, and provides a basic overview of the model’s performance. Precision measures are the proportion of samples predicted to be positive which were actually positive, reflecting the model’s false positive rate, while Recall measures the proportion of all actual positive classes that were correctly predicted to be positive, reflecting the model’s ability to recognize positive samples. These metrics allow the study to fully evaluate and optimize the overall performance of the model.
The variables , , and represent the measured, average, and predicted values of the salt content, respectively. The terms True Positives (), True Negatives (), False Positives (), and False Negatives () are used to quantify the number of true positive, true negative, false positive, and false negative examples, respectively.
2.9. Flow Chart
The principal process framework of this study is illustrated in
Figure 5. The initial stage of the study involved the collection of soil samples and salinity data from the designated study area. Secondly, images from two satellites were procured. Subsequently, the terrain parameters and other characteristic variables were obtained. Ultimately, the characteristic dataset was modeled and the interpretability of the model was enhanced through the use of visualization techniques.
4. Discussion
4.1. Application of Sentinel-2 and Landsat8 Remote Sensing Data to Monitor Soil Salinization in the Farmland of Northern Xinjiang
The prediction and monitoring of soil salinization provides crucial decision-making support in the management of salinization in the farmland areas of northern Xinjiang. Moreover, numerous studies have demonstrated that the integrated utilization of multi-source remote sensing data can enhance the extraction of subtle information, while the complementary performance of the data can elevate the spatial and temporal resolution of remote sensing images [
55,
56]. In this case, the salinity index performed better compared to other spectral indices, with the S
1 salinity index having the greatest impact [
57]. For soil salinity attributes, unlike soil surface indices, satellite-based Earth observation technology becomes a powerful detection tool. Through a series of causal analyses, it is not difficult to find that soil salinity content and topographic features are significantly affected by high temperatures and low rainfall in the Northern Borderland due to its arid and semi-arid climatic characteristics. However, in the areas of human-cultivated farmland, farmers take irrigation or artificial rain enhancement measures under unfavorable natural conditions, such as extreme dryness or scorching heat, in order to alleviate the problems of soil crusting and salinization, and to safeguard the normal growth of crops [
51]. The contribution of compound fertilizer and nitrogen fertilizer was particularly significant in the salinity modeling process, emphasizing the importance of fertility as a key environmental factor in predicting soil salinity, and the findings are in line with Omuto et al.’s findings that environmental factors are influential in revealing soil properties [
58]. Changes in compound fertilizer application are also increasingly affecting the inherent perfect functioning of soils and influencing soil quality [
39]. Given the vastness of the study area, topographic variability, and geographical differences in rainfall, temperature, and fertilizer application methods, model predictions face a high degree of noise interference, with current prediction accuracies reaching a maximum of 0.54. This reflects the fact that, despite the challenges, monitoring activities continue to be of great practical value in guiding regional salinity management strategies.
4.2. The Influence of Characteristic Variable Selection on Monitoring Soil Salinization in the Farmland of Northern Xinjiang
In light of the extensive geographical scope and distribution of soil regions, coupled with the fact that soil salinity is a soil property that is not directly discernible at the surface, it becomes evident that the introduction of additional auxiliary variables or information is necessary to establish a correlation with the spectral index of remote sensing images, with a view to enhancing the accuracy and reliability of prediction. By combining spectral indices to construct vegetation indices, salinity indices, and so forth, the information present in spectral data can be optimized through the use of mathematical formulas. Concurrently, the terrain index derived from the digital elevation model, in conjunction with the brightness, greenness, and humidity indices of remote sensing images, can be employed to effectively reflect the surface characteristics and potential soil properties by utilizing the texture characteristics of the images. In this study, by adopting the strategy of multi-model and multi-feature fusion, the applicability range and prediction accuracy of the model were improved, providing solid theoretical and methodological support for the effective monitoring of soil salinization phenomena. These integrated indices comprehensively consider soil properties, topographic features, vegetation information and transformation features, and comprehensively analyze the detection and monitoring of soil properties in the shallow topsoil layer from multiple perspectives, providing an important analytical framework for an in-depth understanding of the soil situation.
4.3. The Great Potential Shown by Neural Network Techniques in Soil Salinity Monitoring in Cotton Growing Areas in the Northern Borde
Neural network technology has been recognized as a powerful tool for processing data due to its ability to efficiently extract complex multi-dimensional information, and has demonstrated exceptional performance in the areas of pattern recognition, image analysis and language understanding, especially when faced with the challenges of nonlinear and high-dimensional data [
59,
60]. Unlike traditional machine learning methods [
61], deep learning, through its complex architectural design, is able to dig deeper into the complex associations and interdependencies between features and targets [
62], showing a clear advantage in revealing implicit connections in data. For highly complex modeling problems, such as salt prediction, neural network models often outperform traditional machine learning models, mainly due to their ability to learn and refine the esoteric and abstract qualities of the data through multilevel nonlinear structures to achieve more accurate predictions. This feature makes neural networks particularly important in analyzing environmental data with complex interactions, and ResNet, with its innovative design of residual connectivity, effectively mitigates the problem of vanishing and exploding gradients, and demonstrates strong noise immunity, which is why it performs well in the task of soil attribute regression prediction. Large-scale soil salinity monitoring faces significant challenges due to its extensive spatial coverage and the pervasive high noise levels in remotely sensed data. Deep neural networks, with their powerful feature extraction capabilities, offer the potential for overcoming these challenges. However, training deep neural networks typically requires massive datasets to mitigate noise and accurately capture the complex relationships governing soil salinity variability. With limited data, simplified network architectures, such as MLP or KAN, with their fewer layers, effectively reduce overfitting risk and computational cost. Therefore, augmenting soil salinity datasets with rich auxiliary information, such as environmental variables or remotely sensed indices, is crucial for improving the robustness and reliability of both shallow and deep neural network models in soil salinity monitoring. KAN, as an emerging neural network model, is intended to innovate the traditional multilayer perceptron (MLP), and has shown excellent results in theoretical validation and mathematical proof of principle, especially in the application case of monitoring soil salinity, which not only exceeds the conventional neural network in terms of accuracy, but also incorporates explanatory enhancement features. Therefore, the study suggests that the adoption of the KAN model is an efficient strategy for monitoring soil salinity processes through satellite technology, and that the model plays a key role in predicting soil salinity trends due to its high level of accuracy and robust data-handling capabilities. The overall assessment showed that the neural network technology demonstrated its unique and effective function in monitoring soil salinity and salinity severity.
Figure 10 illustrates the distribution of the KAN layer spline weights in each channel, where the horizontal axis represents the number of features in the dataset, and the difference in color shades (transitioning from blue to yellow) intuitively reflects the rise and fall of weight values, with a blue tendency indicating smaller weights and a yellow tendency implying higher weight values. This color coding mechanism facilitates the observer to quickly identify the key areas of weighting in each channel, which in turn provides insight into the characteristics and patterns of input feature transitions across channels.
4.4. Analysis and Solutions for Soil Salinity Prediction: Regression Classification Insights
Through regression prediction analysis of soil salinity, the complementary nature of multi-source satellite data is evident. Among the four deep learning models analyzed, the original bands of Landsat 8 and Sentinel-2 each demonstrated their strengths. Specifically, band 17 of Landsat 8 showed outstanding performance in CNN and KAN models, while bands 12 and 1 of Sentinel-2 were significant contributors in MLP and ResNet models. Furthermore, in all machine learning models, the salinity index S1 of Landsat 8 was the most prominent feature. Beyond original bands, the most critical features in deep learning models included nitrogen fertilizer, compound fertilizer, and temperature. These findings underscore the significant impact of fertilizers on soil health, particularly the water-soluble salts in nitrogen fertilizer, whose improper use can lead to soil salinity accumulation, thereby damaging soil structure and function. In arid regions, high temperatures exacerbate soil moisture evaporation, leading to the retention of salts on the surface layer, accelerating soil salinization. Based on the gradual increase in soil salinization in farmland in northern Xinjiang, the key to reducing the risk of salinization is to optimize fertilizer application strategies, rationally apply irrigation techniques, and use remote sensing for real-time decision support. Optimizing the fertilization strategy involves the scientific application of organic and chemical fertilizers to avoid over-fertilization, leading to salt accumulation. The rational use of irrigation technology requires precise irrigation according to soil and crop water requirements to reduce salt accumulation in the soil. Remote sensing technology is used to monitor salinity and provide data to support decision-making. In summary, comprehensive measures can effectively reduce the risk of soil salinization in farmland in northern Xinjiang and ensure sustainable agricultural development.
4.5. Satellite Remote Sensing for Monitoring Soil Salinity Enhances Soil Salinization Understanding
Due to the special geographic location and climatic environment in northern Xinjiang, the degree of salinization has been significantly improved under long-term farming management. Adopting reasonable irrigation measures can achieve effective regulation of salinity in farmland, optimize the existing irrigation system to achieve sustainable and healthy development of the soil, and then increase grain yield. On this basis, according to the distribution status of salts in farmland, a reasonable fertilization strategy is proposed to reduce the application of chemical fertilizers and maintain farmland ecological environments. This not only ensures the grain yield, but also improves the quality of grain. For high-salinity areas, the goal of developing salt-tolerant crops is proposed, and the crop rotation mode is adopted in low-salinity areas, so as to achieve the purpose of repairing soil health and realizing the sustainable development of agriculture through the alternation of cash crops and salt-tolerant crops. Due to irrational fertilization and irrigation, the depth of groundwater burial and the accumulation of salt in the water body can cause excessive soil salinity. If it is not drained in time, the accumulation of salt in the soil will become more serious and rise along with the evaporation of water. Soil secondary salinization is an important cause of the decline in soil permeability and deterioration in nature and function, which seriously affects agricultural production. In order to ensure the health of the land and realize the sustainable development of agriculture, it is necessary to take corresponding technical measures. The use of salinity distribution maps can effectively alert agricultural growers to prevent secondary salinization.
4.6. Summary and Outlook for Multi-Source Satellite Remote Sensing
This study utilizes Sentinel-2 and Landsat 8 satellite data to monitor and analyze soil salinization. The complementary strengths of these two data sources provide high-quality surface information. Sentinel-2’s high spatial resolution (10–20 m) and high temporal resolution (5-day revisit) are suitable for small-scale and high-precision monitoring, and its multispectral bands effectively capture spectral characteristics of surface salinity; however, it is susceptible to cloud cover and complex terrain, and generates large datasets. Landsat 8’s moderate spatial resolution (30 m) and 16-day revisit cycle are suitable for large-scale monitoring, and its long-term data archive facilitates time-series analysis and trend studies; however, its lower temporal resolution limits real-time monitoring of rapidly changing areas and hinders the detection of subtle salinization features in small regions. This study integrates Sentinel-2 and Landsat 8 data to achieve high-precision, large-scale soil salinization monitoring. Future applications of multi-source remote sensing data fusion techniques, and advanced algorithms, such as deep learning and spectral unmixing, will further enhance the spatiotemporal resolution, accuracy, and comprehensive analytical capabilities of monitoring, driving advancements in soil salinization monitoring technology. The combined application of Sentinel-2 and Landsat 8 data, coupled with advanced machine learning and deep learning algorithms, and evolving remote sensing technologies and multi-source data fusion methods, will significantly improve the accuracy and efficiency of future soil salinization monitoring, enabling more comprehensive and precise salinization monitoring and management.
5. Conclusions
This study integrates various remote sensing data resources, including three-dimensional spectroscopy, terrain features, and vegetation coverage information, and adopts four different attribute selection strategies to build predictive models. Among the collected soil samples, non-salinized soil accounts for 65.7%, while the remaining 34.3% is mainly slightly saline soil, with a negligible proportion of extremely saline soil at only 0.3%. The modeling techniques cover the field of deep neural networks with ResNet and classic machine learning algorithms such as KAN. Through a systematic comparison of different data combinations and model performance, the analysis reveals that the composite fertilizer and nitrogen fertilizer characteristics variables are important features, highlighting the inherent relationship between environmental factors and soil salinity content. In terms of model performance, ResNet demonstrates the highest accuracy of 0.54 in quantitatively predicting soil salinity in the cotton planting area of northern Xinjiang, while the KAN model shows significant effectiveness in saline classification tasks with an accuracy rate of up to 0.75. Furthermore, this study maps the distribution of soil salinity in Northern Xinjiang, clearly indicating saline-heavy areas and low-salinity safe zones. The western region exhibits saline characteristics, while the central and eastern regions maintain a benign low-salinity state. These research findings provide decision support for improving soil environments and formulating rational crop layout strategies.