Towards Robust Smart Data Driven Soil Erodibility Index Prediction Under Different Scenarios
Towards Robust Smart Data Driven Soil Erodibility Index Prediction Under Different Scenarios
To cite this article: Ataollah Shirzadi, Himan Shahabi, Kamal Nabiollahi, Ruhollah Taghizadeh-
Mehrjardi, Ivan Lizaga, John J. Clague, Sushant K. Singh, Fariba Golmohamadi & Anuar Ahmad
(2022): Towards robust smart data-driven soil erodibility index prediction under different scenarios,
Geocarto International, DOI: 10.1080/10106049.2022.2076918
Article views: 65
1. Introduction
Soil erosion is a major cause of damage to agricultural lands, with severe environmental,
economic, and social consequences (Wang et al. 2016). Erosion decreases soil fertility and
productivity, and poses a major threat to subsistence farmers in many developing coun-
tries (Chauchard et al. 2007). It also has negative downstream effects, for example sedi-
mentation in reservoirs, aggradation of stream channels, and water pollution, (Diwediga
et al. 2018). To provide an adequate supply of food for a rapidly growing population,
humans have converted natural landscapes into agricultural land, leading to soil and eco-
system changes on a global scale (Foley et al. 2005; Chauchard et al. 2007; Li et al. 2007).
Over past decades, continuous conventional tillage practices and deforestation have
accelerated soil erosion (Bruce et al. 1999; Lal 2001; Pimentel and Burgess 2013). This
problem remains the greatest threat to soil health, soil ecosystem services, and cropland
productivity in many countries despite a century of research on the topic (Pimentel 2006;
Pennock 2019). Recent research, however, has shown that changes in agricultural manage-
ment from conventional to conservation tillage practices, along with natural revegetation
following abandonment of cultivated land, increase soil organic carbon and total nitrogen
stocks and may reduce soil erosion (Lizaga et al. 2019).
Areas susceptible to soil erosion must be identified to properly manage croplands and,
more broadly, to understand landscape evolution. Documenting these areas using conven-
tional methods (e.g. physically based and empirical models) is time-consuming and
expensive (Alkharabsheh et al. 2013). However, geographic information system (GIS) and
multi-temporal remote sensing technologies can provide useful, cost-effective information
on environmental problems, including soil erosion (Wang et al. 2003). Nevertheless, soil
erosion can only be effectively reduced when the processes responsible for the problem
are understood (Wang et al. 2016).
As observational data have increased and soil erosion has become better understood,
researchers have identified possible erosion drivers, notably climate, soil properties, vege-
tation, and topography, and have developed empirical equations to forecast soil erosion.
Notable among them is the Universal Soil Loss Equation (USLE), which includes a soil
erodibility factor (K) that is closely related to soil physical and chemical properties
(Wischmeier and Smith 1949). K is a measure of the integrated, average, total annual soil
and soil profile losses due to erosion and other hydrological processes (Bonilla and
Johnson 2012).
Accurate field measurement of K requires complex and expensive experiments and
long-term observations in croplands (Kim et al. 2005), consequently attempts have been
made to estimate it from easily obtained soil properties (Wischmeier and Smith 1949).
The simplifying equations, however, have been derived from North American soil data-
bases, which might render them unsuitable for soils in other geographical areas. Only a
few studies have focused on estimating K outside the United States (Shabani et al. 2014;
Wang et al. 2016), and none has used machine learning algorithms to determine the vari-
ables that should be used in these estimations. Recently developed machine learning tech-
niques offer an opportunity to establish local or regional soil erodibility indexes due to
their ability to solve complex environmental problems (Rodriguez-Galiano et al. 2014;
Barzegar et al. 2018; J Wang et al. 2019).
Machine learning algorithms (MLAs) detect meaningful patterns in data, have a wide
range applications, and are one of the fastest growing fields of computer science
(Osisanwo et al. 2017). One of the standard uses of MLAs is classification, in which a
learner maps a vector into one of several classes by looking at several input-output exam-
ples of the function (Osisanwo et al. 2017). Examples of widely used classification
machine learning algorithms are logistic regression, artificial neural network, support vec-
tor machine, naïve bayes, bayes net, bayesian logistic regression, alternating decision tree,
naïve bayes tree, logistic mode tree, and random forest. These algorithms have been
GEOCARTO INTERNATIONAL 3
applied to a wide variety of environmental and natural hazards problems, for example
flooding (Mohammadi et al. 2020; Wang et al. 2020), rockfalls (Shirzadi et al. 2012), snow
avalanches (Mosavi et al. 2020), sinkhole formation (Taheri et al. 2019), and gully erosion
(Lei et al. 2020; Bouslihim et al. 2021).
Another standard task performed by MLAs is regression and prediction. In this case, a
target value (dependent variable) is predicted based on regression of a series of predictors
(independent variables). Regression algorithms offer the advantages of: (1) being relatively
ease to implement; (2) requiring less computational power than classification algorithms
such as genetic algorithms, neural networks, and support vectors machine; (3) providing
satisfactory prediction; and (4) increasing data availability through smart metering (Fumo
and Biswas 2015). Regression algorithms have been widely used in soil science studies
(Igwe and Egbueri 2018; Nebeokike et al. 2020; Egbueri and Igwe 2021). Examples include
studies of soil compressibility (Azzouz et al. 1976; Lav and Ansal 2001); soil chemical
properties (Udelhoven et al. 2003); stability analysis of soil slopes (Kang et al. 2015;
Metya et al. 2017); soil compaction (Canillas and Salokhe 2001); soil organic carbon and
organic matter (You-lu et al. 2008); soil variability (Hengl et al. 2004); soil quality (Kettler
et al. 2001); soil texture (Ließ et al. 2012); heavy metals in soils (Covelo et al. 2008; Yang
et al. 2020); hydraulic conductivity of sandy soil (Elbisy 2015); soil shear strength (Nhu
et al. 2020); eolian erodibility of soil (Kouchami-Sardoo et al. 2020); and soil erodibility
factors in USLE, RUSLE2, EPIC, and Dg models (Wang et al. 2013); and soil erodibility
in laboratory experiments (Ostovari et al. 2018). However, there are few published studies
that focus on predictions of soil erodibility using machine learning regression algorithms.
An exception is Ostovari et al. (2016), who predicted soil erodibility based on measure-
ments of soil particle size, soil CaCO3 and organic matter, permeability, and wet aggregate
stability in 40 erosion plots in calcareous soils in Iran. They analysed the data using sev-
eral regression models, including multiple linear regression, the Mamdani fuzzy inference
system, and an artificial neural network (ANN), and concluded that the ANN model,
based on its highest R2, lowest RMSE, and lowest ME, provided the best estimates of K.
At present, there is no guideline or standard framework for selecting the best machine
learning model to predict K. In this paper, we employ five well-known, benchmark,
machine learning techniques—Random Forest (RF), M5P, Reduced Error Pruning Tree
(REPTree), Gaussian Processes (GP), and Pace Regression (PR)—in a study of soil erodi-
bility in Kurdistan Province in western Iran. These algorithms are robust and have been
earlier studied and verified in regression applications (e.g. Singh et al. 2017; Sihag et al.
2019; Karballaeezadeh et al. 2020; Pham et al. 2021).
RF offers the advantage of being a non-parametric statistical method and can be used
to manage non-linear relationships. Of the machine learning algorithms used in this
study, it is best able of dealing with complex relationships (Breiman 2001; Mutanga et al.
2012). It also can yield adequate results with a limited dataset (Hengl et al. 2018). On the
other hand, Pham et al. (2021) have argued that the M5P and GP algorithms can be used
successfully with few user-defined parameters and also these algorithms can provide
mathematic equations to easy and convenient perform and implement (Yuan et al. 2008;
Deepa et al. 2010; Sihag et al. 2017). As a decision tree algorithm, REPTree is simple algo-
rithm to use. It can easily be applied to a training dataset, and if the output is large, it
can diminish the complexity of the tree structure (Mohamed et al. 2012). However, to our
knowledge none of these algorithms has been used to investigate their potential for pre-
dicting soil erodibility, which thus motivated the present study. In this paper, we use as
input the most important variables controlling K and determine the best algorithm for
predicting soil erodibility.
4 A. SHIRZADI ET AL.
Figure 1. Location of the study area and soil samples in Kurdistan Province, Iran.
Most of the study area is cropland (approximately 88%; mainly wheat and alfalfa); the
remainder is rangeland. Some of the farmers in the area rent land, and landlords com-
monly put much pressure on the soil through conventional tillage operations and the
overuse of chemical fertilizers to achieve maximum crop yields. These practices facilitate
soil erosion.
In this study, we collected 99 soil samples from all terrain units and soil types
(0–30 cm depth). The he samples were stored in plastic bags and transported to the soil
laboratory in the Department of Pedology of College of Agriculture in the University of
Kurdistan, Iran. In the laboratory, we removed plant roots, air-dried the samples, and
screened them through a 2-mm sieve prior to further analysis.
K ¼ 2:1 104 M1:14 ð12 OMÞ þ 3:25ð2 SÞ þ 2:5ðP 3Þ =100 (1)
where K is soil erodibility (t ha MJ1 mm1), M is the product of the percent of silt þ v-
ery fine sand, OM is the percent soil organic matter, S is a soil structure code, and P is a
soil permeability code. Soil erodibility is affected by a wide variety of soil properties,
including soil texture, structure, permeability, bulk density, aggregates, organic matter,
and chemical constituents, which are briefly described below.
2.3.5. Ph
Soil pH also as a chemical property that may have a relationship to soil erodibility.
Alkaline soils (pH > 8.5) with high Na typically have low infiltration capacities. In con-
trast, low-pH soils commonly have high Fe and Al, high infiltration capacities, and stable
aggregates. Soil pH was measured in a saturated paste using a pH electrode
(McLean 1983).
2.3.6. Caco3
Calcium is an important cation in soil aggregate stability and infiltration, and conse-
quently can affect soil erodibility (Vaezi et al. 2008). We measured CaCO3 content as the
total neutralizing value by a volumetric method (Sparks et al. 1996).
2.3.9. Infiltration
Infiltration (n) is the movement of water into and through a soil. Runoff occurs when
rainfall intensity exceeds soil infiltration capacity. Generally, soils with low infiltration
GEOCARTO INTERNATIONAL 7
capacities have higher runoff and soil loss during high-intensity rainfall events. We calcu-
lated soil infiltration based on the final infiltration rate in the field using a double-ring
infiltrometer (Scholten 1997).
where x is the input vector, N is the number of trees, and T is the collection of tree pre-
dictors. Three common parameters need to be optimized in random forest (Probst et al.
2019): (1) ntree, the number of regression trees grown based on a bootstrap sample of the
training data; (2) mtry, the number of different features tested at each node; and (3) node-
size, the minimal size of the terminal nodes of the trees.
8 A. SHIRZADI ET AL.
3.2. M5p
The M5P tree algorithm, developed by Wang et al. (1997) and described in detail by
Witten et al. (2005), implements the model-tree inducer based on the M5 tree (Quinlan
1992). The M5P algorithm is similar to common regression tree methods (Breiman et al.
1984), but the terminal nodes of the model trees are linear regression models (Kuhn and
Johnson 2013) instead of fixed average values. The algorithm involves two steps, tree
growth and pruning. The trees are grown with an exhaustive and iterative search of the
training data. The training data are divided into subsets using a decision structure (node-
splitting) strategy. At this point, the algorithm calculates the standard deviation (sd) of
the observed desired values (soil erodibility index) that reach the node (S) and treats that
value as a measure of the error at the node. Then, each feature (e.g. clay and sand con-
tents) at that node is tested by calculating the expected reduction in error (Si ).
Mathematically, it is based on the standard deviation reduction equation and
calculated as:
X Si
SDR ¼ sd ðSÞ sdðSi Þ (4)
S
where S is the set of data that reaches the node; Si is a subset of examples corresponding
to the ith outcome of the specific set; and sd is the standard deviation. The M5P algo-
rithm evaluates all possible splits and chooses one that maximizes the expected error
reduction. Other steps in growing a tree are simplifications of the linear models, pruning,
and a smoothing process that is more complex than the one used in M5 (Quinlan 1992).
The algorithm is described in detail by (Witten et al. 2005).
3.3. REPTree
Reduced Error Pruning Tree (REPTree), first developed by Quinlan (1987), is a fast deci-
sion/regression tree learner. The algorithm builds a decision/regression tree using infor-
mation gain and entropy (i.e. impurity metric) as the splitting criteria and reduced-error
pruning. A decision/regression tree splits the nodes (root node and decision nodes) on all
features (e.g. clay and sand contents) and then selects the split with the most homoge-
neous sub-nodes containing examples of similar values (Breiman et al. 1984). Entropy is
used by the REPTree algorithm to calculate the homogeneity of a sample; if the sample is
fully homogeneous, the entropy is zero (Witten et al. 2005). Mathematically, splitting the
regression tree with the REPTree algorithm is based on the highest information gain ratio
value, as expressed by:
P i ÞjSi j
EðSÞ ni¼1 EðSjSj
Information gain ratio ¼ Pn jS j (5)
i¼1 jSji log 2 jSjSji j
where E is entropy, and S and Si denote, respectively, the training dataset and its subset.
The reduced error pruning method, which is the simplest and most understandable
method in decision tree pruning, is used in REPTree. It decreases the complexity of
regression/decision tree model and the error arising from variance. It also reduces the
over-fitting problem and increases the interpretability of the model (Khosravi et al. 2018).
The REPTree algorithm is commonly combined with bagging to create multiple trees in
different iterations in order to select the best one from all generated trees (Pham
et al. 2019).
GEOCARTO INTERNATIONAL 9
ðxT ’x þ cÞd
kðx, ’x Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (7)
xTþc þ ’x Tþc
where kðx, ’x Þ denotes a normalized polykernel function, x is the input vector (e.g. clay
and sand contents), c > 0 is a free parameter, and d is the degree of the polynomial. The
hyper-parameters of the kernel functions are determined by maximizing the log-likelihood
of the training data (Kang et al. 2015).
1 and X
where X 2 are the sample means of datasets group 1 and 2, s1 and s2 are the vari-
ance of variables of dataset groups 1 and 2, c:c is the correlation coefficient between the
two datasets, and n is the number of datasets (Kim 2015).
Table 4. Pearson correlation coefficients and probability values between soil erodibility index and independ-
ent variables.
Factor Clay Silt Sand OM CEC Bd n LL pH CaCo3 EC MWD Fsand
PCC 0.369 0.540 0.002 0.334 0.412 0.056 0.067 0.463 0.014 0.349 0.101 0.055 0.326
p-value 0.000 0.000 0.985 0.001 0.000 0.584 0.509 0.000 0.888 0.000 0.318 0.587 0.001
maximum VIF and minimum TOL values are, respectively, 2.971 and 0.337 (clay vari-
able), confirming that there is no multi-collinearity among the independent variables. In
other words, all variables have a positive role in soil erodibility prediction in the
study area.
We computed Pearson correlation coefficients (PCC) to determine the most important
factors in our predictions of the soil erodibility factor based on the training dataset (Table
4). The data show that silt has the highest impact on soil erodibility prediction (PCC ¼
0.540), whereas sand has the lowest impact (PCC ¼ 0.002). Overall, silt, LL, CEC, clay,
CaCO3, OM, Fsand (fine sand), EC, n, BD, MWD, pH, and sand, in that order, are the
most important factors for predicting the soil erosion index in the study area.
Results for the M5P algorithm are shown in Figure 2(c) and 2(d). The highest value
for CC (0.892) and the lowest error values for MAE (0.0029), RMSE (0.0037), RAE
(0.433), and PRSE (0.453) were computed for input 13 (Figure 2c), indicating that all 13
factors are effective and applicable for predicting the soil erodibility index. For the valid-
ation dataset, these values are 0.884, 0.0043, 0.0053, 0.512, and 0.521, respectively, for all
13 inputs (Figure 2d). The poorest prediction is provided by input 1 (clay) in both the
training and validation datasets.
Results for the REPTree algorithm are shown in Figure 2(e) and 2(f). Input 13 has the
best performance, with error values for the training dataset of 0.969 (CC), 0.0024 (MAE),
0.0033 (RMSE), 0.471 (RAE), and 0.532 (PRSE) (Figure 2e). Corresponding values for the
validation data set are 0.867, 0.0038, 0.0046, 0.590, and 0.606 (Figure 2f). Again, input 1
yielded the poorest prediction for both the training dataset (CC ¼ 0.670, MAE ¼ 0.0051,
RMSE ¼ 0.0062, RAE ¼ 0.650, and PRSE ¼ 0.790) and validation dataset (CC ¼ 0.630,
MAE ¼ 0.0065, RMSE ¼ 0.0077, RAE ¼ 0.850, and PRSE ¼ 0.852).
Results for the GP algorithm are shown in Figure 2(g) and 2(h). Input 13 yielded error
values for the training dataset of 0.930 (CC), 0.0019 (MAE), 0.0023 (RMSE), 0.373 (RAE),
and 0.374 (PRSE) (Figure 2g). Corresponding values for validation dataset are 0.918,
0.0032, 0.0038, 0.495, and 0.493 (Figure 2h). Input 4 provided the next highest prediction
accuracy for the validation dataset (CC ¼ 0.913, MAE ¼ 0.0029, RMSE ¼ 0.0037, RAE ¼
0.447, and PRSE ¼ 0.477), followed by input 5 (CC ¼ 0.908, MAE ¼ 0.0029, RMSE ¼
0.0037, RAE ¼ 0.451, and PRSE ¼ 0.484). Input 1 provided the poorest goodness-of-fit
and prediction accuracy.
Results for the PR algorithm are shown in Figure 2(i) and 2(j). Once again, input 13
yielded the highest performance, followed by inputs 4, 10, 11, and 12. The lowest per-
formance is for input 1, followed by inputs 6, 7, 8, 9, and 5. Error values for input 13 in
the training dataset are 0.892 (CC), 0.022 (MAE), 0.0028 (RMSE), 0.433 (RAE), and 0.453
(PRSE); corresponding values for validation dataset are 0.873, 0.0034, 0.0025, 0.528,
and 0.541.
Overall, the highest predicted goodness-of-fit (training dataset) and accuracy (valid-
ation dataset) were obtained for input 13, which includes clay, silt, sand, OM, CEC, BD,
n, LL, pH, CaCO3, EC, MWD, and Fsand (Fine sand). The lowest goodness-of-fit and
prediction accuracy were returned for input 1. The GP algorithm (CC ¼ 0.918) has the
highest predictive power for soil erodibility, followed by RF (CC ¼ 0.901), M5P (CC ¼
0.884), PR (CC ¼ 0.892), and REPTree (CC ¼ 0.867).
Figure 2. Goodness-of-fit (training) and prediction accuracy (validation) of the machine learning regression models
based on input strategy. (a) RF model, training dataset. (b) RF model, validation dataset. (c) M5P model, training data-
set. (d) M5P model, validation dataset. (e) REPTree model and all twelve variables. (f) REPTree model, training dataset.
(g) GP model, validation dataset. (h) GP model, training dataset. (i) PR model, training dataset. (j) PR model, validation
dataset (Ton. acre. hr. (hundreds of acre. ft. ton f. in)1).
and result indicated that the CC, MAE, RMSE, RAE, and PRSE had respectively the val-
ues of 0.890, 0.0040, 0.0049, 0.613, and 0.641; however, these values in the absence of EC
were 0.870, 0.0040, 0.0049, 0.616, and 0.644, respectively (Figure 3a). With looking the
mentioned figure, it can be observed that with all 13 variables (Input 13) results of the
CC, MAE, RMSE, RAE, and PRSE factors are the same with enter-input scenario. We
removed all above-mentioned sensitive variables and finally results distinguished that
based on validation dataset the value of CC, MAE, RMSE, RAE, and PRSE were 0.902,
0.0038, 0.0046, 0.583, and 0.604, respectively (Figure 3b).
The results of SAS for the M5P algorithm are shown in Figure 3(c). The enter-input
scenario and sensitivity analysis scenario yielded similar results. The best prediction
16 A. SHIRZADI ET AL.
Figure 2. Continued.
GEOCARTO INTERNATIONAL 17
Figure 3. Prediction accuracy of the machine learning regression models based on the sensitivity analysis scenario
and the validation dataset. (a) RF model and all variables. (b) RF model and six predictive variables. (c) M5P and all
variables. (d) REPTree model and all variables. (e) REPTree model and ten predictive variables. (f) GP model and all
variables. (g) PR model and all variables (h) PR model and six predictive variables (Ton. acre. hr. (hundreds of acre. ft.
ton f. in)1).
accuracy (CC ¼ 0.884, MAE ¼ 0.0033, RMSE ¼ 0.0040, RAE ¼ 0.512, and PRSE ¼
0.521) was obtained for input 13, which includes all 13 soil variables.
Results for the REPTree algorithm are shown in Figure 3(d). In this case, the CEC and
BD variables are the most sensitive variables. By removing these two variables, we opti-
mized values, with a prediction accuracy (CC ¼ 0.867, MAE ¼ 0.0034, RMSE ¼ 0.0042,
RAE ¼ 0.520, and PRSE ¼ 0.551) better than that obtained using the enter-input scenario
(Figure 3e).
Results for the GP algorithm are similar to those obtained using the EIS technique.
The best prediction accuracy (CC ¼ 0.918, MAE ¼ 0.0032, RMSE ¼ 0.0038, RAE ¼
0.495, and PRSE ¼ 0.493) was obtained by including all 13 soil variables in the model
(Figure 3f).
18 A. SHIRZADI ET AL.
Figure 3. Continued.
Finally, the results for the PR algorithm are shown in Figure 3(g) and 3(h). In this
case, clay and CEC are the most sensitive variables for predicting the soil erodibility
index. In the absence of these two variables, CC, MAE, RMSE, RAE and PRSE are,
respectively, 0.891, 0.0022, 0.0028, 0.431, and 0.453 (Figure 3h). With all 13 variables
included, these values are, respectively, 0.873, 0.0034, 0.0041, 0.529, and 0.541 (Figure 3g).
After optimizing the best input parameters for the predictive modelling based on SAS, we
found that the GP algorithm (CC ¼ 0.918) has the highest predictive power for soil erodi-
bility prediction, followed by RF (CC ¼ 0.902), PR (CC ¼ 0.891), M5P (CC ¼ 0.884),
and REPTree (CC ¼ 0.811).
Figure 4. Goodness-of-fit and power predictive processes of the RF algorithm. (a) Trend of the actual and predicted K
using the training dataset. (b) MAE and RMSE of the training dataset. (c) Trend of the actual and predicted K using
the validation dataset. (d) MAE and RMSE of the validation dataset. (e) Mean and standard deviation values of the
training dataset. (f) Mean and standard deviation values of the validation dataset. (g) Scatterplot of actual vs. pre-
dicted K in the training phase. (h) Scatterplot of actual vs. predicted K in the validation phase (Ton. hr. Mj1. mm1).
of the modelling procedure for the five machine learning regression algorithms are shown
in Figures 4–8. Figures 4(a)–8(a) summarize the trends of observed and predicted soil
erodibility indexes based on the training dataset; their error values (MAE and RMSE) val-
ues are shown in Figures 4(b)–8(b). Corresponding measures for validation dataset are
shown, respectively, in Figures 4(c)–8(c) and Figures 4(d)–8(d). The means and standard
deviations of training and validation datasets are shown, respectively, in Figures 4(e)–8(e)
20 A. SHIRZADI ET AL.
Figure 5. Goodness-of-fit and power predictive processes of the REPTree algorithm. (a) Trend of actual and predicted
K using the training dataset. (b) MAE and RMSE of the training dataset. (c) Trend of actual and predicted K using the
validation dataset. (d) MAE and RMSE of the testing dataset. (e) Mean and standard deviation values of the training
dataset. (f) Mean and standard deviation values of the validation dataset. (g) Scatterplot of the actual vs. predicted K
in the training phase. (h) Scatterplot of the actual vs. predicted K in the validation phase (Ton. hr. Mj1. mm1).
and Figures 4(f)–8(f). Finally, Figures 4(g)–8(g) and Figures 4(h)–8(h) are, respectively,
scatterplots of observed versus predicted soil erodibility indexes in the training and valid-
ation datasets. The GP algorithm yielded the highest coefficient of determination (R2)
between the observed and predicted K values (0.843), followed by the RF (0.812), PR
(0.794), M5P (0.781), and REPTree (0.752) algorithms.
We also compared the results of the models on Taylor diagrams (Figure 9). The mul-
tiple aspects of model performance in simulating were evaluated by this diagram.
Statistics, correlation, standard deviation and RMSE for five machine learning regression
algorithms were computed, and different color dots were assigned to the models. The pos-
ition of each colour dot appearing on the plot quantifies how closely that algorithm’s
simulated soil erodibility index matches observations. Results show that the GP algorithm
GEOCARTO INTERNATIONAL 21
Figure 6. Goodness-of-fit and power predictive processes of the PR algorithm. (a) Trend of the actual and predicted K
using the training dataset. (b) MAE and RMSE of the training dataset. (c) Trend of the actual and predicted K using
the validation dataset. (d) MAE and RMSE of the validation dataset. (e) Mean and standard deviation values of the
training dataset. (f) Mean and standard deviation values of the validation dataset. (g) Scatterplot of actual vs. pre-
dicted K in the training phase. (h) Scatterplot of actual vs. predicted K in the validation phase (Ton. Hr. Mj1. Mm1).
best predicts the soil erodibility index based on training (Figure 9a) and validation
(Figure 9b) datasets.
A boxplot (Figure 10) shows that all five algorithms closely predict the observed max-
imum soil erodibility index (0.0474 T hr Mj1 mm1). The observed minimum soil erodi-
bility index (0.0111 T hr Mj1 mm1) is best predicted by REPTree, followed, in order, by
GP, PR, M5P, and RF. The order for the first quartile (25%; Q1) rankings are GP, PR,
M5P, RF, and REPTree, and for the third quartile (75%; Q3), REPTree, PR, M5P, GP,
and RF. The median value of the observed soil erodibility index (0.03109 T hr
Mj1 mm1) is best predicted by the GP and RF algorithms. These results indicate that,
22 A. SHIRZADI ET AL.
Figure 7. Goodness-of-fit and power predictive processes of the M5P algorithm. (a) Trend of the actual and predicted
K using training dataset. (b) MAE and RMSE of the training dataset. (c) Trend of the actual and predicted K using the
validation dataset. (d) MAE and RMSE of the validation dataset. (e) Mean and standard deviation values of the training
dataset. (f) Mean and standard deviation values of the validation dataset. (g) Scatterplot of the actual vs. predicted K
in the training phase. (h) Scatterplot of actual vs. predicted K in the testing validation (Ton. hr. Mj1. mm1).
although GP generally is the most accurate algorithm, it did not predict extreme val-
ues well.
Figure 8. Goodness-of-fit and power predictive processes of the GP algorithm. (a) Trend of the actual and predicted
K using the training dataset. (b) MAE and RMSE of the training dataset. (c) Trend of the actualand predicted soil erod-
ibility index using the validation dataset. (d) MAE and RMSE of the validation dataset. (e) Mean and standard devi-
ation values of the training dataset. (f) Mean and standard deviation values of the validation dataset. (g) Scatterplot
of the actual vs. predicted K in the training phase. (h) Scatterplot of actual vs. predicted K in the validation phase
(Ton. hr. Mj1. mm1).
machine learning datasets are normally distributed, and therefore the performance of the
models can be compared using the paired sample T-test (two-tailed). Results, shown in
Table 7, indicate that all algorithms performed well; there is no statistical difference
between the performance of each algorithm and observed soil erodibility index values at
the 0.95 significance level.
24 A. SHIRZADI ET AL.
Figure 9. Taylor diagrams comparing the performance of the models. (a) Training dataset. (b) Validation dataset.
GEOCARTO INTERNATIONAL 25
Figure 10. Box plot comparing model performances (Ton. hr. Mj1. mm1).
5. Discussion
5.1. Relation of factors to soil erodibility
Multi-collinearity analysis indicated that there are no strong correlations between inde-
pendent variables used in this study (Table 3). Therefore, all independent variables
could be used as input in fitting models without problems arising from multi-collinear-
ity. Several approaches such as correlation analysis and sensitive analysis can be used
to select important soil properties that contribute to soil erodibility and also to inter-
pret the effects of soil properties on differences in the spatial distribution and variabil-
ity of soil erodibility. In this study, there are high negative correlations between soil
erodibility and clay, OM, CEC, LL, and CaCO3 (Table 4). We note that both physical
and chemical soil properties are important in explaining the susceptibility of soils to
water erosion.
Soil texture plays a key role in erosion. Due to cohesion, clay particles are less suscep-
tible to erosion and transport by runoff than silt and sand (Morgan 1980). Also clay par-
ticles form large stable aggregates that are resistant to raindrop impact and erosion by
overland flow (Belasri et al. 2017). An exception is clay minerals that are susceptible to
expansion and contraction (e.g. smectites). No significant correlation (PCC ¼ 0.002, p-
value ¼ 0.985) was found between soil erodibility and sand content (Table 4); sand pro-
motes infiltration and reduces overland flow (Perez-Rodrıguez et al. 2007; Efthimiou
2020). As shown in Table 4, soils dominated by silt and fine sand have high K values and
are most vulnerable to water erosion. In addition, these soils commonly have fractured
superficial crusts that favour erosion (Perez-Rodrıguez et al. 2007). Yang et al. (2018) con-
cluded that soils with high clay contents have relatively high resistance to erosion, and
many other researchers have reported the same finding (e.g. Vaezi et al. 2008, 2016;
Bonilla and Johnson 2012; Shabani et al. 2014).
All other things being equal, soil resistance to erosion increases with an increase in
organic matter content, because soil OM is an important binding agent, producing stable
aggregates (Wang et al. 2016; Liu et al. 2020). Soils with less than 3.5% organic matter are
26 A. SHIRZADI ET AL.
considered erodible (Evans 1980). Soil OM may also increase infiltration and thus reduce
surface runoff and erosion (Rodrıguez et al. 2006; Tejada and Gonzalez 2006).
Cation exchange capacity can also have a significant effect on soil erodibility (Table 4).
CEC typically covaries with clay and OM contents. Abbaslou et al. (2020) reported that
high-CEC soils are less susceptible to dispersion than low-CEC ones. We found a negative
correlation between soil erodibility and CaCO3 content (Table 4). CaCO3 promotes par-
ticle flocculation, cementation, and thus the formation of soil aggregates (Wuddivira and
Camps-Roach 2007; Abbaslou et al. 2020). Vaezi et al. (2008) and Ostovari et al. (2016)
found that CaCO3 plays a key role in reducing erosion in calcareous soils. The same
authors also argued that stable surface aggregates are resistant to erosion by rainfall
splash. We found, however, that aggregate stability (MWD) is not significantly correlated
with soil erodibility (PCC ¼ 0.055 and p-vale ¼ 0.587) (Table 4). Parysow et al. (2003)
found that the soil structure explained only 6.53% of the variability of soil erodibility, per-
haps because of the poor structure of soils in their semi-arid study area. Vaezi et al.
(2008) argued that soil texture is more important than soil structure in reducing soil
erodibility in semi-arid regions, and Blanco and Lal (2008) concluded that only strong
stable soil aggregates reduce soil erosion.
Soil erodibility is also negatively correlated with liquid limit, probably because the lat-
€
ter covaries with clay content (Ozdemir and G€ ulser 2017). High clay content can contrib-
ute to soil cohesiveness and aggregate stability. Vacchiano et al. (2014) and Curtaz et al.
(2015) advocated the use of Atterberg limits as indicators of the susceptibility of soils
to erosion.
Table 6. One-sample Kolmogorov-Smirnov normality test of the machine learning regression models.
Machine learning regression model
Parameter Observed RF REPTree PR M5P GP
Mean 3.09E 02 4.42E 02 3.08E 02 4.58E 02 3.09E 02 3.09E 02
Standard deviation 8.20E 03 6.60E 03 7.90E 03 7.30E 03 7.30E 03 8.20E 03
Absolute 0.066 0.080 0.120 0.118 0.118 0.086
Positive 0.048 0.067 0.120 0.118 0.118 0.086
Negative 0.066 0.080 0.074 0.089 0.090 0.067
Kolmogorov-Smirnov Z 0.553 0.667 1.005 0.986 0.989 0.721
Significance (2-tailed) 0.920 0.765 0.265 0.285 0.282 0.676
Table 7. Performance of the machine learning regression models using the paired sample T-test (two-tailed).
CID
Models Mean St.D. SDEM Lower Upper t Sig. D
O-RF 7.30E 05 2.08E 03 2.48E 04 4.20E 04 5.69E 04 0.296 0.768 NO
O-REPTree 1.50E 05 2.03E 03 2.42E 04 4.98E 04 4.68E 04 0.062 0.951 NO
O-M5P 8.00E 05 3.71E 03 4.43E 04 8.76E 04 8.91E 04 0.017 0.986 NO
O-PR 6.00E 05 3.71E 03 4.43E 04 8.79E 04 8.90E 04 0.013 0.990 NO
O-GP 2.10E 05 3.07E 03 3.67E 04 7.10E 04 7.52E 04 0.056 0.955 NO
Notes: O: Observed. SDEM: St.D error mean. CID: 95% confidence interval of the difference. Sig.: Significance.
D: Difference.
The results obtained using the sensitivity analysis scenario are similar to those reported
above. Specifically, with SAS, the best estimates of K were obtained when all independent
variables were used to model soil erodibility. Removing input variables one by one did
not significantly affect MAE, RMSE, CC, RAE, or PRSE, except when using the
REPTree model.
The IES and SAS scenario results indicate the considerable potential for using
Gaussian process regression (GPR) in modeling soil erodibility. GPR is a powerful tech-
nique that, in spite of the simplicity of regression models, reveals complex relations
(Ballabio et al. 2019). By providing a sound framework in kernel machines, GPR offers
the benefits of selecting models, as well as interpreting their predictions (Rasmussen and
Williams 2006). Its main drawback, when large datasets are involved, is its high computa-
tional time (Ballabio et al. 2019). Possible approaches for dealing with this problem
include the Nystr€om kernel matrix approximation (Drineas and Mahoney 2005) and mas-
sive parallel processing (Ballabio et al. 2019).
overall trend of the data will result in large residuals (Kuhn and Johnson 2013).
Nevertheless, the normal distribution of residuals obtained by all models (Figures 4–8)
suggests that regression lines follow the trend of the majority of the data.
Taylor diagrams, which are graphical representation of model performances based on
different validation statistics (R2, RMSE, and standard deviation), were used to facilitate
comparisons of the model training and validation datasets (Figure 9). The K factor of
Wischmeier and Smith (1978), expressed as Eq. (1), was used as the reference for the
comparisons. Although the Taylor diagram for the training data (Figure 9a) indicates that
the GP and M5P algorithms perform about the same, GP has a slightly lower RMSE than
M5P, whereas M5P has a lower standard deviation than GP (Figure 9a). The overlap of
PR and REPTree in the Taylor diagram suggests that their performance, in term of the
validation statistics, is about the same. The pattern for the validation dataset is somewhat
different (Figure 9b). All models have R2 values around 0.9, with the value for RF a little
higher than that for GP, which in turn is higher than values for REPTree, PR, and GP, in
that order. RF, REPTree, and PR plot closer than GP and M5P to the observed data point
in Figure 9(b), indicating lower RMSE values. Zhao et al. (2018) also used a Taylor dia-
gram to compare the performance of five models in predicting soil erodibility in China.
6. Conclusion
We used five machine learning algorithms and different combinations of readily available
soil data to predict soil erodibility in the Dehgolan region of Iran. Pearson correlation
analysis shows a high and positive contribution of silt, fine sand, sand, bulk density, and
infiltration to soil erodibility in the study area. A high and negative relationship exists
between clay, organic matter, cation exchange capacity, liquid limit, pH, CaCO3, and elec-
trical conductivity, on one hand, and soil erodibility on the other. These results indicate
that soil erosion could perhaps be reduced through remediation measures that enhance
soil clay, OM, CEC, LL, pH, CaCO3, and EC. The USLE was developed for non-calcar-
eous soils, but our results indicate that including CaCO3 in soil erodibility models might
produce better results.
The input-enter and sensitivity analysis scenarios revealed that the poorest estimates of
soil erodibility were obtained when only one input variable was used in modeling, irre-
spective of the algorithm. The best results were obtained when all 13 variables were used.
SAS revealed that the set of input variables that most affect the model outputs differ
among the five models applied in this study. Comparisons of the IES and SAS results
with the results of the simple correlation analysis indicate that the algorithms used in this
study effectively reveal the complex relationships between soil properties and soil
erodibility.
Validation statistics and statistical analyses show that the five machine learning algo-
rithms successfully modeled soil erodibility, although GP generally outperformed the
GEOCARTO INTERNATIONAL 29
other algorithms. The IES and SAS results indicate that the best predictive model is GP
using all 13 independent variables. We suggest that this model can be applied as a power-
ful tool for predictive modeling of soil erosion in countries such as Iran, which lack reli-
able soil data. We further recommend including other independent variables, such as
terrain attributes and remotely sensed data, to improve the performance of these models.
Finally, because the GP models are computationally complex, especially for spatial model-
ing, approaches such as Nystr€ om kernel matrix approximation and massive parallel proc-
essing are recommended to split datasets into a more manageable size.
Disclosure statement
The authors declare that they have no known competing financial interests or personal relationships that
could have appeared to influence the work reported in this paper.
Funding
This research was supported by the University of Kurdistan, Iran, with two grants (nos. 99-11-32657-6
and 00-9-27617).
ORCID
Ataollah Shirzadi http://orcid.org/0000-0003-1666-1180
Himan Shahabi http://orcid.org/0000-0001-5091-6947
John J. Clague http://orcid.org/0000-0002-2697-2233
References
Abbaslou H, Hadifard H, Ghanizadeh AR. 2020. Effect of cations and anions on flocculation of dispersive
clayey soils. Heliyon. 6(2):e03462.
Aksakal EL, Angin I, Oztas T. 2013. Effects of diatomite on soil consistency limits and soil compactibility.
Catena. 101:157–163.
Alkharabsheh MM, Alexandridis T, Bilas G, Misopolinos N, Silleos N. 2013. Impact of land cover change
on soil erosion hazard in northern Jordan using remote sensing and GIS. Procedia Environ Sci. 19:
912–921.
Azzouz AS, Krizek RJ, Corotis RB. 1976. Regression analysis of soil compressibility. Soils Found. 16(2):
19–29.
Ballabio C, Lugato E, Fernandez-Ugalde O, Orgiazzi A, Jones A, Borrelli P, Montanarella L, Panagos P.
2019. Mapping LUCAS topsoil chemical properties at European scale using Gaussian process regres-
sion. Geoderma. 355:113912.
Barzegar R, Moghaddam AA, Deo R, Fijani E, Tziritis E. 2018. Mapping groundwater contamination risk
of multiple aquifers using multi-model ensemble of machine learning algorithms. Sci Total Environ.
621:697–712.
Bastien P, Vinzi VE, Tenenhaus M. 2005. PLS generalised linear regression. Comput Stat Data Anal.
48(1):17–46.
Baumgartl T. 2002. Atterberg limits. In: Encyclopedia of soil science. New York: Marcel Dekker Inc; p.
89–93.
Belasri A, Lakhouili A, Halima OI. 2017. Soil erodibility mapping and its correlation with soil properties
of Oued El Makhazine watershed, Morocco. Forestry. 2(3):4.
Benedetto A. 2010. Water content evaluation in unsaturated soil using GPR signal analysis in the fre-
quency domain. J Appl Geodesy. 71(1):26–35.
Blanco H, Lal R. 2008. Principles of soil conservation and management. New York: Springer.
Bonilla CA, Johnson OI. 2012. Soil erodibility mapping and its correlation with soil properties in Central
Chile. Geoderma. 189–190:116–123.
30 A. SHIRZADI ET AL.
Bouslihim Y, Rochdi A, Aboutayeb R, El Amrani-Paaza N, Miftah A, Hssaini L. 2021. Soil aggregate sta-
bility mapping using remote sensing and GIS-based machine learning techniques. Frontiers Earth Sci.
9:1–13.
Breiman L. 2001. Random forests. Machine Learn. 45(1):5–32.
Breiman L, Friedman J, Stone CJ, Olshen RA. 1984. Classification and regression trees. England (UK):
CRC Press.
Bruce JP, Frome M, Haites E, Janzen H, Lal R, Paustian K. 1999. Carbon sequestration in soils. J Soil
Water Conserv. 54(1):382–389.
Canillas EC, Salokhe VM. 2001. Regression analysis of some factors influencing soil compaction. Soil
Tillage Res. 61(3–4):167–178.
Chai T, Draxler RR. 2014. Root mean square error (RMSE) or mean absolute error (MAE)?–Arguments
against avoiding RMSE in the literature. Geosci Model Dev. 7(3):1247–1250.
Chau JF, Bagtzoglou AC, Willig MR. 2011. The effect of soil texture on richness and diversity of bacterial
communities. Environ Forensics. 12(4):333–341.
Chauchard S, Carcaillet C, Guibal F. 2007. Patterns of land-use abandonment control tree-recruitment
and forest dynamics in Mediterranean mountains. Ecosystems. 10(6):936–948.
Chen W, Zhang S, Li R, Shahabi H. 2018. Performance evaluation of the GIS-based data mining techni-
ques of best-first decision tree, random forest, and naïve Bayes tree for landslide susceptibility model-
ing. Sci Total Environ. 644:1006–1018.
Covelo E, Matıas J, Vega F, Reigosa M, Andrade M. 2008. A tree regression analysis of factors determin-
ing the sorption and retention of heavy metals by soil. Geoderma. 147(1–2):75–85.
Curtaz F, Stanchi S, D’Amico ME, Filippa G, Zanini E, Freppaz M. 2015. Soil evolution after land-reshap-
ing in mountains areas (Aosta Valley, NW Italy). Agriculture Ecosyst Environ. 199:238–248.
Deepa C, SathiyaKumari K, Sudha VP. 2010. Prediction of the compressive strength of high performance
concrete mix using tree based modeling. IJCA. 6(5):18–24.
Diwediga B, Le QB, Agodzo SK, Tamene LD, Wala K. 2018. Modelling soil erosion response to sustain-
able landscape management scenarios in the Mo River Basin (Togo, West Africa). Sci Total Environ.
625:1309–1320.
Drineas P, Mahoney MW. 2005. On the Nystr€ om method for approximating a Gram matrix for improved
kernel-based learning. J Mach Learn Res. 6(Dec):2153–2175.
Efthimiou N. 2020. The new assessment of soil erodibility in Greece. Soil Tillage Res. 204:104720.
Egbueri JC, Igwe O. 2021. The impact of hydrogeomorphological characteristics on gullying processes in
erosion-prone geological units in parts of southeast Nigeria. Geol Ecol Landscapes. 5(3):227–240.
Elbisy MS. 2015. Support vector machine and regression analysis to predict the field hydraulic conductiv-
ity of sandy soil. KSCE J Civ Eng. 19(7):2307–2316.
Elvidge S, Angling M, Nava B. 2014. On the use of modified Taylor diagrams to compare ionospheric
assimilation models. J Roy Astron Soc Can. 49(9):737–745.
Evans R. 1980. Mechanics of water erosion and their spatial and temporal controls: an empirical view-
point. In: Kirkby MJ, Morgan RPC, editors. Soil erosion. Chichester: Wiley. p. 109–128.
Foley JA, Defries R, Asner GP, Barford C, Bonan G, Carpenter SR, Chapin FS, Coe MT, Daily GC, Gibbs
HK, et al. 2005. Global consequences of land use. Science. 309(5734):570–574.
Fu W, Tunney H, Zhang C. 2010. Spatial variation of soil nutrients in a dairy farm and its implications
for site-specific fertilizer application. Soil Tillage Res. 106(2):185–193.
Fumo N, Biswas MR. 2015. Regression analysis for prediction of residential energy consumption. Renew
Sustain Energy Rev. 47:332–343.
Gauch HG, Hwang JG, Fick GW. 2003. Model evaluation by comparison of model-based predictions and
measured values. Agron J. 95(6):1442–1446.
Gee G, Bauder JW. 1986. Particle-size analysis. In: Klute A, editor. Methods of soil analysis part 1.
Agronomy monograph No. 9. 2nd ed. Madison, WI: American Society of Agronomy/Soil Science
Society of America. p. 383–411.
Gonzalez JP, Cook SE, Oberth€ ur T, Jarvis A, Bagnell JA, Dias MB. 2007. Creating low-cost soil maps for
tropical agriculture using gaussian processes. AI in ICT for Development (ICTD) International Joint
Conference on Artificial Intelligence, Hyderabad, India.
Grossman RB, Reinsch TG. 2002. 2.1 Bulk density and linear extensibility. In: Dick AW, editor. Methods
of soil analysis: Part 4 physical methods. Madison: Soil Science Society of America Book Series; p.
201–228.
Hastie T, Tibshirani R, Friedman J. 2009. The elements of statistical learning: data mining, inference, and
prediction. New York (NY): Springer Science Business Media.
GEOCARTO INTERNATIONAL 31
Hengl T, Heuvelink GB, Stein A. 2004. A generic framework for spatial prediction of soil variables based
on regression-kriging. Geoderma. 120(1–2):75–93.
Hengl T, Nussbaum M, Wright MN, Heuvelink GB, Gr€aler B. 2018. Random forest as a generic frame-
work for predictive modeling of spatial and spatio-temporal variables. PeerJ. 6:e5518.
Hosmer DW, Lemeshow S. 2000. Applied logistic regression. New York: Wiley.
Hosseini M, Agereh SR, Khaledian Y, Zoghalchali HJ, Brevik EC, Naeini SARM. 2017. Comparison of
multiple statistical techniques to predict soil phosphorus. Appl Soil Ecol. 114:123–131.
Igwe O, Egbueri JC. 2018. The characteristics and the erodibility potentials of soils from different geologic
formations in Anambra State, southeastern Nigeria. J Geol Soc India. 92(4):471–478.
Kang F, Han S, Salgado R, Li J. 2015. System probabilistic stability analysis of soil slopes using Gaussian
process regression with Latin hypercube sampling. Comput Geotech. 63:13–25.
Karballaeezadeh N, Tehrani HG, Shadmehri DM, Shamshirband S. 2020. Estimation of flexible pavement
structural capacity using machine learning techniques. Front Struct Civ Eng. 14(5):1083–1096.
Kemper W, Rosenau R. 1986. Aggregate stability and size distribution. In: Methods of soil analysis, part
1: physical and mineralogical methods. American Society of Agronomy, Inc. Soil Science Society of
America, Inc; Vol. 5. p. 425–442.
Kerry R, Oliver M. 2007. Comparing sampling needs for variograms of soil properties computed by the
method of moments and residual maximum likelihood. Geoderma. 140(4):383–396.
Kettler T, Doran JW, Gilbert T. 2001. Simplified method for soil particle-size determination to accompany
soil-quality analyses. Soil Sci Soc Am J. 65(3):849–852.
Khosravi K, Pham BT, Chapi K, Shirzadi A, Shahabi H, Revhaug I, Prakash I, Bui DT. 2018. A compara-
tive assessment of decision trees algorithms for flash flood susceptibility modeling at Haraz watershed,
northern Iran. Sci Total Environ. 627:744–755.
Kim JB, Saunders P, Finn JT. 2005. Rapid assessment of soil erosion in the Rio Lempa Basin, Central
America, using the universal soil loss equation and geographic information systems. Environ Manage.
36(6):872–885.
Kim TK. 2015. T test as a parametric statistic. Korean J Anesthesiol. 68(6):540–546.
Kouchami-Sardoo I, Shirani H, Esfandiarpour-Boroujeni I, Besalatpour A, Hajabbasi M. 2020. Prediction
of soil wind erodibility using a hybrid genetic algorithm–Artificial neural network method. Catena.
187:104315.
Kuhn M, Johnson K. 2013. Applied predictive modeling. Vol. 26. New York (NY): Springer.
Lal R. 2001. Soil degradation by erosion. Land Degrad Dev. 12(6):519–539.
Lav MA, Ansal AM. 2001. Regression analysis of soil compressibility. Turkish J Eng Environ Sci. 25(2):
101–109.
Lei X, Chen W, Avand M, Janizadeh S, Kariminejad N, Shahabi H, Costache R, Shahabi H, Shirzadi A,
Mosavi A. 2020. GIS-based machine learning algorithms for gully erosion susceptibility mapping in a
semi-arid region of Iran. Remote Sens. 12(15):2478.
Li X-G, Li F-M, Zed R, Zhan Z-Y. 2007. Soil physical properties and their relations to organic carbon
pools as affected by land use in an alpine pastureland. Geoderma. 139(1–2):98–105.
Ließ M, Glaser B, Huwe B. 2012. Uncertainty in the spatial prediction of soil texture: comparison of
regression tree and Random Forest models. Geoderma. 170:70–79.
Liu X, Zhang Y, Li P. 2020. Spatial variation characteristics of soil erodibility in the Yingwugou watershed
of the Middle Dan River, China. IJERPH. 17(10):3568.
Lizaga I, Quijano L, Gaspar L, Ramos MC, Navas A. 2019. Linking land use changes to variation in soil
properties in a Mediterranean mountain agroecosystem. Catena. 172:516–527.
MacKay DJC. 2003. Information theory. In: Inference and learning algorithms. Cambridge (UK):
Cambridge University Press.
Manrique L, Jones C, Dyke P. 1991. Predicting cation-exchange capacity from soil physical and chemical
properties. Soil Sci Soc Am J. 55(3):787–794.
McLean E. 1983. Soil pH and lime requirement. In: Methods of soil analysis, part 2: chemical and micro-
biological properties. The American Society of Agronomy, Inc., Soil Science Society of America; Vol. 9.
p. 199–224.
Menard S. 2002. Applied logistic regression analysis. Sage, Quantitative Applications in the Social
Sciences; Vol. 106. p. 128.
Merdun H, Çı nar O, € Meral R, Apan M. 2006. Comparison of artificial neural network and regression
pedotransfer functions for prediction of soil water retention and saturated hydraulic conductivity. Soil
Tillage Res. 90(1–2):108–116.
Metya S, Mukhopadhyay T, Adhikari S, Bhattacharya G. 2017. System reliability analysis of soil slopes
with general slip surfaces using multivariate adaptive regression splines. Comput Geotech. 87:212–228.
32 A. SHIRZADI ET AL.
Mohamed WNHW, Salleh MNM, Omar AH. 2012. A comparative study of reduced error pruning
method in decision tree algorithms. International Conference on Control System, Computing and
Engineering. IEEE.
Mohammadi A, Kamran KV, Karimzadeh S, Shahabi H, Al-Ansari N. 2020. Flood detection and suscepti-
bility mapping using Sentinel-1 time series, Alternating Decision Trees, and Bag-ADTree models.
Complexity. 2020:1–21.
Morgan R. 1980. Soil erosion and conservation in Britain. Prog Phys Geogr. 4(1):24–47.
Mosavi A, Shirzadi A, Choubin B, Taromideh F, Hosseini FS, Borji M, Shahabi H, Salvati A, Dineva AA.
2020. Towards an ensemble machine learning model of random subspace based functional tree classi-
fier for snow avalanche susceptibility mapping. IEEE Access. 8:145968–145983.
Mutanga O, Adam E, Cho MA. 2012. High density biomass estimation for wetland vegetation using
WorldView-2 imagery and random forest regression algorithm. Int J Appl Earth Observ. 18:399–406.
Nebeokike UC, Igwe O, Egbueri JC, Ifediegwu SI. 2020. Erodibility characteristics and slope stability ana-
lysis of geological units prone to erosion in Udi area, southeast Nigeria. Model Earth Syst Environ.
6(2):1061–1074.
Nhu V-H, Hoang N-D, Duong V-B, Vu H-D, Bui DT. 2020. A hybrid computational intelligence
approach for predicting soil shear strength for urban housing construction: a case study at Vinhomes
Imperia project, Hai Phong Xity (Vietnam). Eng Comput. 36(2):603–616.
Norman A. 1965. Methods of soil analysis, Part 2: chemical and microbiological properties. Madison, WI:
American Society of Agronomy, Soil Science Society of America.
Osisanwo F, Akinsola J, Awodele O, Hinmikaiye J, Olakanmi O, Akinjobi J. 2017. Supervised machine
learning algorithms: classification and comparison. IJCTT. 48(3):128–138.
Ostovari Y, Ghorbani-Dashtaki S, Bahrami H-A, Abbasi M, Dematte JAM, Arthur E, Panagos P. 2018.
Towards prediction of soil erodibility, SOM and CaCO3 using laboratory Vis-NIR spectra: a case study
in a semi-arid region of Iran. Geoderma. 314:102–112.
Ostovari Y, Ghorbani-Dashtaki S, Bahrami H-A, Naderi M, Dematte JAM, Kerry R. 2016. Modification of
the USLE K factor for soil erodibility assessment on calcareous soils in Iran. Geomorphology. 273:
385–395.
€
Ozdemir N, G€ulser C. 2017. Clay activity index as an indicator of soil erodibility. Eurasian J Soil Sci. 6(4):
307–311.
Page A, Miller R, Keeney D. 1982. Methods of soil analysis, Part 2. Madison, WI: American Society of
Agronomy, Soil Science Society of America.
Parysow P, Wang G, Gertner G, Anderson AB. 2003. Spatial uncertainty analysis for mapping soil erodi-
bility based on joint sequential simulation. Catena. 53(1):65–78.
Pennock DJ. 2019. Soil erosion: the greatest challenge for sustainable soil management. UN Food and
Agriculture Organization.
Perez-Rodrıguez R, Marques MJ, Bienes R. 2007. Spatial variability of the soil erodibility parameters and
their relation with the soil map at subgroup level. Sci Total Environ. 378(1–2):166–173.
Pham BT, Ly H-B, Al-Ansari N, Ho LS. 2021. A comparison of Gaussian process and M5P for prediction
of soil permeability coefficient. Scientific Programming, 1–13.
Pham BT, Prakash I, Singh SK, Shirzadi A, Shahabi H, Tran T-T-T, Bui DT. 2019. Landslide susceptibility
modeling using reduced error pruning trees and different ensemble techniques: hybrid machine learn-
ing approaches. Catena. 175:203–218.
Phinzi K, Ngetar NS. 2019. The assessment of water-borne erosion at catchment level using GIS-based
RUSLE and remote sensing: A review. Int Soil Water Conserv Res. 7:27–46.
Pimentel D. 2006. Soil erosion: a food and environmental threat. Environ Dev Sustain. 8(1):119–137.
Pimentel D, Burgess M. 2013. Soil erosion threatens food production. Agriculture. 3(3):443–463.
Probst P, Wright MN, Boulesteix AL. 2019. Hyperparameters and tuning strategies for random forest.
Wiley Interdiscip Rev Data Min Knowl Discov. 9(3):e1301.
Quinlan JR. 1987. Simplifying decision trees. Intern J Man-Mach Stud. 27(3):221–234.
Quinlan JR. 1992. Learning with continuous classes. 5th Australian Joint Conference on Artificial
Intelligence. Hobart, Tasmania, Australia: World Scientific.
Rasmussen C, Williams C. 2006. Gaussian processes for machine learning. Cambridge, MA: MIT Press.
Richards LA. 1954. Diagnosis and improvement of saline and alkali soils. Lippincott Williams Wilkins.
New York: US Department of Agriculture, New York; Vol. 78, p. 1–166.
Rodrıguez AR, Arbelo C, Guerra J, Mora J, Notario J, Armas C. 2006. Organic carbon stocks and soil
erodibility in Canary Islands Andosols. Catena. 66(3):228–235.
Rodriguez-Galiano V, Mendes MP, Garcia-Soldado MJ, Chica-Olmo M, Ribeiro L. 2014. Predictive mod-
eling of groundwater nitrate pollution using Random Forest and multisource variables related to
GEOCARTO INTERNATIONAL 33
intrinsic and specific vulnerability: a case study in an agricultural setting (southern Spain). Sci Total
Environ. 476–477:189–206.
R€omkens MJM, Young RA, Poesen JWA, McCool DK, El-Swaify SA, Bradford JM. 1997. Soil erodibility
factor (K). Compilers In: Renard KG, Foster GR, Weesies GA, McCool DK, Yoder DC, editors.
Predicting soil erosion by water: a guide to conservation planning with the Revised Universal Soil Loss
Equation (RUSLE). Washington, DC, USA: Agric. HB. 703:65–99.
Scholten T. 1997. Hydrology and erodibility of the soils and saprolite cover of the Swaziland Middleveld.
Soil Technol. 11(3):247–262.
Shabani F, Kumar L, Esmaeili A. 2014. Improvement to the prediction of the USLE K factor.
Geomorphology. 204:229–234.
Sharma A, Weindorf DC, Wang D, Chakraborty S. 2015. Characterizing soils via portable X-ray fluores-
cence spectrometer: 4. Cation exchange capacity (CEC). Geoderma. 239–240:130–134.
Sharpley AN, Williams JR. 1990. EP IC-Erosion/Productivity impact calculator. I: Model documentation.
II: User manual. Technical Bulletin-United States Department of Agriculture, (1768).
Shirzadi A, Saro L, Joo OH, Chapi K. 2012. A GIS-based logistic regression model in rock-fall susceptibil-
ity mapping along a mountainous road: Salavat Abad case study, Kurdistan, Iran. Nat Hazards. 64(2):
1639–1656.
Sihag P, Karimi SM, Angelaki A. 2019. Random forest, M5P and regression analysis to estimate the field
unsaturated hydraulic conductivity. Appl Water Sci. 9(5):1–9.
Sihag P, Tiwari N, Ranjan S. 2017. Modelling of infiltration of sandy soil using gaussian process regres-
sion. Model Earth Syst Environ. 3(3):1091–1100.
Singh B, Sihag P, Singh K. 2017. Modelling of impact of water quality on infiltration rate of soil by ran-
dom forest regression. Model Earth Syst Environ. 3(3):999–1004.
Sparks D, Page A, Helmke P, Leoppert R, Soltanpour P, Tabatabai M, Johnston G, Summer M. 1996.
Methods of soil analysis. American Society of Agronomy, Soil Science Society of America, Book Series.
5.
Sridharan A, Nagaraj HB. 1999. Absorption water content and liquid limit of soils. Geotech Test J. 22(2):
127–133.
Staff S. 2014. Keys to soil taxonomy. 12th ed. Washington, DC: Natural Resources Conservation Service,
US Department of Agriculture.
Sumner ME, Miller WP. 1996. Cation exchange capacity and exchange coefficients. In: Methods soil ana-
lysis, part 3: chemical methods. The Soil Science Society of America, Inc., American Society of
Agronomy; Vol. 5. p. 1201–1229.
Taheri K, Shahabi H, Chapi K, Shirzadi A, Gutierrez F, Khosravi K. 2019. Sinkhole susceptibility map-
ping: a comparison between Bayes-based machine learning algorithms. Land Degrad Dev. 30(7):
730–745.
Tejada M, Gonzalez J. 2006. The relationships between erodibility and erosion in a soil treated with two
organic amendments. Soil Tillage Res. 91(1–2):186–198.
Tiessen H, Cuevas E, Chacon P. 1994. The role of soil organic matter in sustaining soil fertility. Nature.
371(6500):783–785.
Torri D, Poesen J, Borselli L. 1997. Predictability and uncertainty of the soil erodibility factor using a glo-
bal dataset. Catena. 31:1–22.
Udelhoven T, Emmerling C, Jarmer T. 2003. Quantitative analysis of soil chemical properties with diffuse
reflectance spectrometry and partial least-square regression: a feasibility study. Plant Soil. 251(2):
319–329.
Vacchiano G, Stanchi S, Ascoli D, Marinari G, Zanini E, Motta R. 2014. Soil-mediated effects of fire on
Scots pine (Pinussylvestris L.) regeneration in a dry, inner-alpinevalley. Sci Total Environ. 472:778–788.
Vaezi AR, Hasanzadeh H, Cerda A. 2016. Developing an erodibility triangle for soil textures in semi-arid
regions, NW Iran. Catena. 142:221–232.
Vaezi A, Sadeghi S, Bahrami H, Mahdian M. 2008. Modeling the USLE K-factor for calcareous soils in
northwestern Iran. Geomorphology. 97(3–4):414–423.
Wang B, Zheng F, Guan Y. 2016. Improved USLE-K factor prediction: a case study on water erosion
areas in China. Intern Soil Water Conserv Res. 4(3):168–176.
Wang B, Zheng F, R€omkens MJ. 2013. Comparison of soil erodibility factors in USLE, RUSLE2, EPIC
and Dg models based on a Chinese soil erodibility database. Acta Agric Scand B – Soil Plant Sci.
63(1):69–79.
Wang G, Gertner G, Fang S, Anderson AB. 2003. Mapping multiple variables for predicting soil loss by
geostatistical methods with TM images and a slope map. Photogramm Eng Remote Sens. 69(8):
889–898.
34 A. SHIRZADI ET AL.
Wang H, Zhang G-h, Li N-n, Zhang B-j, Yang H-y. 2019. Variation in soil erodibility under five typical
land uses in a small watershed on the Loess Plateau, China. Catena. 174:24–35.
Wang J, Ding J, Yu D, Ma X, Zhang Z, Ge X, Teng D, Li X, Liang J, Lizaga I, et al. 2019. Capability of
Sentinel-2 MSI data for monitoring and mapping of soil salinity in dry and wet seasons in the Ebinur
Lake region, Xinjiang, China. Geoderma. 353:172–187.
Wang Y. 2000. A new approach to fitting linear models in high dimensional spaces. Hamilton, NZ:
University of Waikato.
Wang Y, Fang Z, Hong H, Peng L. 2020. Flood susceptibility mapping using convolutional neural net-
work frameworks. J Hydrol. 582:124482.
Wang Y, Witten IH. 2002. Modeling for optimal probability prediction. Proceeding of the Nineteenth
International Conference on Machine Learning, Sydney, Australia.
Wang Y, Witten I, van Someren M, Widmer G. 1997. Inducing models trees for continuous classes.
Poster Papers of the European Conference on Machine Learning, Department of Computer Science,
University Waikato, Hamilton, NZ.
Weisberg S. 2005. Applied linear regression. Vol. 528. New York: John Wiley & Sons.
Wilcoxon F. 1992. Individual comparisons by ranking methods. In: Breakthroughs in statistics. Berlin
(Germany): Springer; p. 196–202.
Williams CK, Rasmussen CE. 2006. Gaussian processes for machine learning. Vol. 2. Cambridge, MA:
MIT Press.
Wischmeier WH, Smith DD. 1949. Predicting rainfall-erosion losses from cropland east of the rocky
mountains: guide for selection of practices for soil and water conservation. US Department of
Agriculture. 282p.
Wischmeier WH, Smith DD. 1978. Predicting rainfall erosion losses: a guide to conservation planning.
US Department of Agriculture, Science and Education Administration. 537p.
Witten IH, Frank E, Hall MA. 2005. Practical machine learning tools and techniques. Morgan Kaufmann.
578p.
World Reference Base for Soil Resources. 2014 (2015). International soil classification system for naming
soils and creating legends for soil maps. FAO Rome, 192p.
Wuddivira M, Camps-Roach G. 2007. Effects of organic matter and calcium on soil structural stability.
Eur J Soil Science. 58(3):722–727.
Yan-li L, You-lu B, Li-ping Y, Hong-juan W, Qing-bo K. 2008. Application of hyperspectral data for soil
organic matter estimation based on principle components regression analysis. Plant Nutr Soil Sci. 7(6):
10.
Yang J, Wang J, Qiao P, Zheng Y, Yang J, Chen T, Lei M, Wan X, Zhou X. 2020. Identifying factors that
influence soil heavy metals by using categorical regression analysis: a case study in Beijing, China.
Frontiers Environ Sci Eng. 14(3):37.
Yuan J, Wang K, Yu T, Fang M. 2008. Reliable multi-objective optimization of high-speed WEDM pro-
cess based on Gaussian process regression. Int J Mach Tools Manuf. 48(1):47–60.
Yang X, Gray J, Chapman G, Zhu Q, Tulau M, McInnes-Clarke S. 2018. Digital mapping of soil erodibil-
ity for water erosion in New South Wales, Australia. Soil Res. 56(2):158–170.
Zeng Q, Darboux F, Man C, Zhu Z, An S. 2018. Soil aggregate stability under different rain conditions
for three vegetation types on the Loess Plateau (China). Catena. 167:276–283.
Zhang W, Parker K, Luo Y, Wan S, Wallace L, Hu S. 2005. Soil microbial responses to experimental
warming and clipping in a tallgrass prairie. Global Change Biol. 11(2):266–277.
Zhao W, Wei H, Jia L, Daryanto S, Zhang X, Liu Y. 2018. Soil erodibility and its influencing factors on
the Loess Plateau of China: a case study in the Ansai watershed. Solid Earth. 9(6):1507–1516.