0% found this document useful (0 votes)
67 views35 pages

Towards Robust Smart Data Driven Soil Erodibility Index Prediction Under Different Scenarios

This document discusses using machine learning techniques to predict soil erodibility index (K) in the Dehgolan region of Iran based on soil properties. Five machine learning algorithms - Random Forest, M5P, Reduced Error Pruning Tree, Gaussian Processes, and Pace Regression - were used to model the relationship between K and variables like texture, structure, density and chemistry. The Gaussian Processes model had the highest accuracy based on metrics like R2 of 0.843, MAE of 0.0044 and RMSE of 0.0050. This outperformed the other models, showing it is useful for predicting soil erodibility in similar climates and soil conditions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views35 pages

Towards Robust Smart Data Driven Soil Erodibility Index Prediction Under Different Scenarios

This document discusses using machine learning techniques to predict soil erodibility index (K) in the Dehgolan region of Iran based on soil properties. Five machine learning algorithms - Random Forest, M5P, Reduced Error Pruning Tree, Gaussian Processes, and Pace Regression - were used to model the relationship between K and variables like texture, structure, density and chemistry. The Gaussian Processes model had the highest accuracy based on metrics like R2 of 0.843, MAE of 0.0044 and RMSE of 0.0050. This outperformed the other models, showing it is useful for predicting soil erodibility in similar climates and soil conditions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Geocarto International

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/tgei20

Towards robust smart data-driven soil erodibility


index prediction under different scenarios

Ataollah Shirzadi, Himan Shahabi, Kamal Nabiollahi, Ruhollah Taghizadeh-


Mehrjardi, Ivan Lizaga, John J. Clague, Sushant K. Singh, Fariba
Golmohamadi & Anuar Ahmad

To cite this article: Ataollah Shirzadi, Himan Shahabi, Kamal Nabiollahi, Ruhollah Taghizadeh-
Mehrjardi, Ivan Lizaga, John J. Clague, Sushant K. Singh, Fariba Golmohamadi & Anuar Ahmad
(2022): Towards robust smart data-driven soil erodibility index prediction under different scenarios,
Geocarto International, DOI: 10.1080/10106049.2022.2076918

To link to this article: https://doi.org/10.1080/10106049.2022.2076918

Published online: 25 May 2022.

Submit your article to this journal

Article views: 65

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=tgei20
GEOCARTO INTERNATIONAL
https://doi.org/10.1080/10106049.2022.2076918

Towards robust smart data-driven soil erodibility index


prediction under different scenarios
Ataollah Shirzadia , Himan Shahabib , Kamal Nabiollahic, Ruhollah
Taghizadeh-Mehrjardid,e,f, Ivan Lizagag, John J. Clagueh , Sushant K. Singhi,
Fariba Golmohamadic and Anuar Ahmadj
a
Department of Rangeland and Watershed Management, Faculty of Natural Resources, University of
Kurdistan, Sanandaj, Iran; bDepartment of Geomorphology, Faculty of Natural Resources, University
of Kurdistan, Sanandaj, Iran; cDepartment of Soil Science and Engineering, Faculty of Agriculture,
University of Kurdistan, Sanandaj, Iran; dDepartment of Geosciences, Soil Science and
Geomorphology, University of T€ ubingen, T€ubingen, Germany; eSFB 1070 Resource Cultures,
University of T€ubingen, Tu€bingen, Germany; fDFG Cluster of Excellence “Machine Learning”,
ubingen, Germany; gEstacion Experimental de Aula-Dei (EEAD-CSIC),
University of T€ubingen, T€
Spanish National Research Council, Avenida Montan ~ana, Zaragoza, Spain; hDepartment of Earth
Sciences, Simon Fraser University, Burnaby, BC, Canada; iCenter for Artificial Intelligence and
Environmental Sustainability (CAIES) Foundation, Patna, Bihar, India; jDepartment of Geoinformation,
Faculty of Built Environment and Surveying, Universiti Teknologi, Johor Bahru, Malaysia

ABSTRACT ARTICLE HISTORY


Soil erosion is a major cause of damage to agricultural lands in Received 10 January 2022
many parts of the world and is of particular concern in semiarid Accepted 8 May 2022
parts of Iran. We use five machine learning techniques—Random
KEYWORDS
Forest (RF), M5P, Reduced Error Pruning Tree (REPTree), Gaussian
Soil erodibility; erosion; plot
Processes (GP), and Pace Regression (PR)—under two scenarios to studies; machine
predict soil erodibility in the Dehgolan region, Kurdistan Province, learning; Iran
Iran. Our models are based on a variety of soil properties, includ-
ing soil texture, structure, permeability, bulk density, aggregates,
organic matter, and chemical constituents. We checked the validity
of the models with statistical metrics, including the coefficient of
determination (R2), mean absolute error (MAE), root mean squared
error (RMSE), T-tests, Taylor diagrams, and box plots. All five algo-
rithms show a positive correlation between the soil erodibility fac-
tor (K) and silt, sand, fine sand, bulk density, and infiltration. The
GP model has the highest prediction accuracy (R2 ¼ 0.843, MAE ¼
0.0044, RMSE ¼ 0.0050). It outperformed the RF (R2 ¼ 0.812, MAE
¼ 0.0050, RMSE ¼ 0.0061), PR, (R2 ¼ 0.794, MAE ¼ 0.0037, RMSE
¼ 0.0052), M5P (R2 ¼ 0.781, MAE ¼ 0.0043, RMSE ¼ 0.0053), and
REPTree (R2 ¼ 0.752, MAE ¼ 0.0045, RMSE ¼ 0.0056) algorithms
and thus is a useful complement to studies aimed at predicting
soil erodibility in areas with similar climate and soil characteristics.

1. Introduction
Soil erosion is a major cause of damage to agricultural lands, with severe environmental,
economic, and social consequences (Wang et al. 2016). Erosion decreases soil fertility and

CONTACT Himan Shahabi [email protected]


ß 2022 Informa UK Limited, trading as Taylor & Francis Group
2 A. SHIRZADI ET AL.

productivity, and poses a major threat to subsistence farmers in many developing coun-
tries (Chauchard et al. 2007). It also has negative downstream effects, for example sedi-
mentation in reservoirs, aggradation of stream channels, and water pollution, (Diwediga
et al. 2018). To provide an adequate supply of food for a rapidly growing population,
humans have converted natural landscapes into agricultural land, leading to soil and eco-
system changes on a global scale (Foley et al. 2005; Chauchard et al. 2007; Li et al. 2007).
Over past decades, continuous conventional tillage practices and deforestation have
accelerated soil erosion (Bruce et al. 1999; Lal 2001; Pimentel and Burgess 2013). This
problem remains the greatest threat to soil health, soil ecosystem services, and cropland
productivity in many countries despite a century of research on the topic (Pimentel 2006;
Pennock 2019). Recent research, however, has shown that changes in agricultural manage-
ment from conventional to conservation tillage practices, along with natural revegetation
following abandonment of cultivated land, increase soil organic carbon and total nitrogen
stocks and may reduce soil erosion (Lizaga et al. 2019).
Areas susceptible to soil erosion must be identified to properly manage croplands and,
more broadly, to understand landscape evolution. Documenting these areas using conven-
tional methods (e.g. physically based and empirical models) is time-consuming and
expensive (Alkharabsheh et al. 2013). However, geographic information system (GIS) and
multi-temporal remote sensing technologies can provide useful, cost-effective information
on environmental problems, including soil erosion (Wang et al. 2003). Nevertheless, soil
erosion can only be effectively reduced when the processes responsible for the problem
are understood (Wang et al. 2016).
As observational data have increased and soil erosion has become better understood,
researchers have identified possible erosion drivers, notably climate, soil properties, vege-
tation, and topography, and have developed empirical equations to forecast soil erosion.
Notable among them is the Universal Soil Loss Equation (USLE), which includes a soil
erodibility factor (K) that is closely related to soil physical and chemical properties
(Wischmeier and Smith 1949). K is a measure of the integrated, average, total annual soil
and soil profile losses due to erosion and other hydrological processes (Bonilla and
Johnson 2012).
Accurate field measurement of K requires complex and expensive experiments and
long-term observations in croplands (Kim et al. 2005), consequently attempts have been
made to estimate it from easily obtained soil properties (Wischmeier and Smith 1949).
The simplifying equations, however, have been derived from North American soil data-
bases, which might render them unsuitable for soils in other geographical areas. Only a
few studies have focused on estimating K outside the United States (Shabani et al. 2014;
Wang et al. 2016), and none has used machine learning algorithms to determine the vari-
ables that should be used in these estimations. Recently developed machine learning tech-
niques offer an opportunity to establish local or regional soil erodibility indexes due to
their ability to solve complex environmental problems (Rodriguez-Galiano et al. 2014;
Barzegar et al. 2018; J Wang et al. 2019).
Machine learning algorithms (MLAs) detect meaningful patterns in data, have a wide
range applications, and are one of the fastest growing fields of computer science
(Osisanwo et al. 2017). One of the standard uses of MLAs is classification, in which a
learner maps a vector into one of several classes by looking at several input-output exam-
ples of the function (Osisanwo et al. 2017). Examples of widely used classification
machine learning algorithms are logistic regression, artificial neural network, support vec-
tor machine, naïve bayes, bayes net, bayesian logistic regression, alternating decision tree,
naïve bayes tree, logistic mode tree, and random forest. These algorithms have been
GEOCARTO INTERNATIONAL 3

applied to a wide variety of environmental and natural hazards problems, for example
flooding (Mohammadi et al. 2020; Wang et al. 2020), rockfalls (Shirzadi et al. 2012), snow
avalanches (Mosavi et al. 2020), sinkhole formation (Taheri et al. 2019), and gully erosion
(Lei et al. 2020; Bouslihim et al. 2021).
Another standard task performed by MLAs is regression and prediction. In this case, a
target value (dependent variable) is predicted based on regression of a series of predictors
(independent variables). Regression algorithms offer the advantages of: (1) being relatively
ease to implement; (2) requiring less computational power than classification algorithms
such as genetic algorithms, neural networks, and support vectors machine; (3) providing
satisfactory prediction; and (4) increasing data availability through smart metering (Fumo
and Biswas 2015). Regression algorithms have been widely used in soil science studies
(Igwe and Egbueri 2018; Nebeokike et al. 2020; Egbueri and Igwe 2021). Examples include
studies of soil compressibility (Azzouz et al. 1976; Lav and Ansal 2001); soil chemical
properties (Udelhoven et al. 2003); stability analysis of soil slopes (Kang et al. 2015;
Metya et al. 2017); soil compaction (Canillas and Salokhe 2001); soil organic carbon and
organic matter (You-lu et al. 2008); soil variability (Hengl et al. 2004); soil quality (Kettler
et al. 2001); soil texture (Ließ et al. 2012); heavy metals in soils (Covelo et al. 2008; Yang
et al. 2020); hydraulic conductivity of sandy soil (Elbisy 2015); soil shear strength (Nhu
et al. 2020); eolian erodibility of soil (Kouchami-Sardoo et al. 2020); and soil erodibility
factors in USLE, RUSLE2, EPIC, and Dg models (Wang et al. 2013); and soil erodibility
in laboratory experiments (Ostovari et al. 2018). However, there are few published studies
that focus on predictions of soil erodibility using machine learning regression algorithms.
An exception is Ostovari et al. (2016), who predicted soil erodibility based on measure-
ments of soil particle size, soil CaCO3 and organic matter, permeability, and wet aggregate
stability in 40 erosion plots in calcareous soils in Iran. They analysed the data using sev-
eral regression models, including multiple linear regression, the Mamdani fuzzy inference
system, and an artificial neural network (ANN), and concluded that the ANN model,
based on its highest R2, lowest RMSE, and lowest ME, provided the best estimates of K.
At present, there is no guideline or standard framework for selecting the best machine
learning model to predict K. In this paper, we employ five well-known, benchmark,
machine learning techniques—Random Forest (RF), M5P, Reduced Error Pruning Tree
(REPTree), Gaussian Processes (GP), and Pace Regression (PR)—in a study of soil erodi-
bility in Kurdistan Province in western Iran. These algorithms are robust and have been
earlier studied and verified in regression applications (e.g. Singh et al. 2017; Sihag et al.
2019; Karballaeezadeh et al. 2020; Pham et al. 2021).
RF offers the advantage of being a non-parametric statistical method and can be used
to manage non-linear relationships. Of the machine learning algorithms used in this
study, it is best able of dealing with complex relationships (Breiman 2001; Mutanga et al.
2012). It also can yield adequate results with a limited dataset (Hengl et al. 2018). On the
other hand, Pham et al. (2021) have argued that the M5P and GP algorithms can be used
successfully with few user-defined parameters and also these algorithms can provide
mathematic equations to easy and convenient perform and implement (Yuan et al. 2008;
Deepa et al. 2010; Sihag et al. 2017). As a decision tree algorithm, REPTree is simple algo-
rithm to use. It can easily be applied to a training dataset, and if the output is large, it
can diminish the complexity of the tree structure (Mohamed et al. 2012). However, to our
knowledge none of these algorithms has been used to investigate their potential for pre-
dicting soil erodibility, which thus motivated the present study. In this paper, we use as
input the most important variables controlling K and determine the best algorithm for
predicting soil erodibility.
4 A. SHIRZADI ET AL.

Figure 1. Location of the study area and soil samples in Kurdistan Province, Iran.

2. Data and methods


2.1. Study area and sample collection
The study area (approximately 480 km2) is located in the Dehgolan region of Kurdistan
Province in western Iran between 47 150 56.25ʺ and 47 340 56.58ʺ longitudes and 35
020 17.42ʺ and 35 210 45.17ʺ N latitudes (Figure 1). It is a semi-arid area with a mesic
temperature regime (Staff 2014). Mean annual temperature, mean annual precipitation,
and elevation are 10.2 C, 399 mm, and 2292 m a.s.l., respectively. The study area is xeric
and has a mesic temperature regime (Staff 2014). The dominant soil classes are
Inceptisols and Entisols based on soil taxonomy, and Cambisols and Leptosols based on
the World Reference Base for Soil Resources (2015). Soil parent materials are limestone,
conglomerate, sandstone, granite, tuff, marl, and alluvium. The major geomorphologic
units are piedmont, plateaus, and hills. Slope gradients range from gentle to very steep
(Figure 1).
GEOCARTO INTERNATIONAL 5

Most of the study area is cropland (approximately 88%; mainly wheat and alfalfa); the
remainder is rangeland. Some of the farmers in the area rent land, and landlords com-
monly put much pressure on the soil through conventional tillage operations and the
overuse of chemical fertilizers to achieve maximum crop yields. These practices facilitate
soil erosion.
In this study, we collected 99 soil samples from all terrain units and soil types
(0–30 cm depth). The he samples were stored in plastic bags and transported to the soil
laboratory in the Department of Pedology of College of Agriculture in the University of
Kurdistan, Iran. In the laboratory, we removed plant roots, air-dried the samples, and
screened them through a 2-mm sieve prior to further analysis.

2.2. Soil erodibility index (K)


Phinzi and Ngetar (2019) summarized the methods and equations used to estimate K.
The researchers who developed the methods include Bouyoucos (1935), Wischmeier and
Smith (1978), Sharply and Williams (1990), Romkens et al. (1997), Torri et al. (1997). All
equations are based on soil physical and chemical properties. The Wischmeier and Smith
(1978) Universal Soil Loss Equation (USLE) is used most widely to estimate soil erosion
(Phinzi and Ngetar 2019), thus in our study we apply it as the dependent variable.
The following equation was used to calculate K (Wischmeier and Smith 1978):

 
K ¼ 2:1  104 M1:14 ð12  OMÞ þ 3:25ð2  SÞ þ 2:5ðP  3Þ =100 (1)

where K is soil erodibility (t ha MJ1 mm1), M is the product of the percent of silt þ v-
ery fine sand, OM is the percent soil organic matter, S is a soil structure code, and P is a
soil permeability code. Soil erodibility is affected by a wide variety of soil properties,
including soil texture, structure, permeability, bulk density, aggregates, organic matter,
and chemical constituents, which are briefly described below.

2.3. Factors affecting soil erodibility


2.3.1. Organic matter
Organic matter (OM) affects the physical, biological, and chemical properties of soils
(Tiessen et al. 1994). As organic matter increases, soil aggregate stability generally
improves, reducing the likelihood of erosion (Zeng et al. 2018; Bouslihim et al. 2021). In
this study, soil organic carbon was determined using a wet combustion method
(Norman 1965).

2.3.2. Cation exchange capacity


Cation exchange capacity (CEC) affects soil properties such as acidity, water retention,
and nutrient holding capacity (Manrique et al. 1991). Basic cations chemically bond soil
particles and thereby contribute to aggregate stability. High amounts of exchangeable
sodium can weaken soil structure, which leads to soil dispersion, surface crust formation,
and a decrease in infiltration (Richards 1954). In contrast, the calcium cation binds
organic matter to clay aggregates, which can result in greater aggregate stability and there-
fore less soil erosion. We measured CEC using the 1 N ammonium acetate (at pH 7.0)
method (Sumner and Miller 1996).
6 A. SHIRZADI ET AL.

2.3.3. Bulk density


Bulk density (Bd) provides a measure of soil compaction and consolidation, and is used
to compare different land-use practices and to assess the effects of human activities on
soil degradation. In this study, soil bulk density and particle density were determined
using both core (Grossman and Reinsch 2002) and pycnometer methods.

2.3.4. Liquid limit


Liquid limit (LL), one of the Atterberg metrics, is defined as the water content above
which a soil behaves as a viscous liquid (Sridharan and Nagaraj 1999). LL provides a
measure of soil mechanical behaviour, for example compactibility (Aksakal et al. 2013).
Soils with higher LLs are commonly less compactible and more easily tilled (Aksakal et al.
2013). We measured LL using the method of Baumgartl (2002) and the
Casagrande device.

2.3.5. Ph
Soil pH also as a chemical property that may have a relationship to soil erodibility.
Alkaline soils (pH > 8.5) with high Na typically have low infiltration capacities. In con-
trast, low-pH soils commonly have high Fe and Al, high infiltration capacities, and stable
aggregates. Soil pH was measured in a saturated paste using a pH electrode
(McLean 1983).

2.3.6. Caco3
Calcium is an important cation in soil aggregate stability and infiltration, and conse-
quently can affect soil erodibility (Vaezi et al. 2008). We measured CaCO3 content as the
total neutralizing value by a volumetric method (Sparks et al. 1996).

2.3.7. Electrical conductivity


Electrical conductivity (EC) is a measure of the ability of a soil to conduct an electrical
current and is affected by the concentration of dissolved ions in porewater. Conductive
soils with high values of Caþþ are flocculated and commonly have high infiltration rates;
conversely, low-conductivity soils with high Naþ are dispersed and have low infiltration
rates (Richards 1954). We measured electrical conductivity in a saturated paste using a
conductivity meter (Page et al. 1982).

2.3.8. Mean weight diameter


Mean weight diameter (MWD) provides a measure of the aggregate stability of a soil that
is the resistance of its structure to destructive physical and chemical forces. It provides
information on the sensitivity of soils to water and wind erosion. We determines MWD
of soil aggregates using the method of Kemper and Rosenau (1986) and the following
equation:
MWD ¼ RXi Wi (2)
where MWD is the mean weight diameter of water-stable aggregates, Xi is the mean
diameter of each size fraction (mm), and Wi is the proportion of the total sample mass in
the corresponding size fraction after removal of stones.

2.3.9. Infiltration
Infiltration (n) is the movement of water into and through a soil. Runoff occurs when
rainfall intensity exceeds soil infiltration capacity. Generally, soils with low infiltration
GEOCARTO INTERNATIONAL 7

capacities have higher runoff and soil loss during high-intensity rainfall events. We calcu-
lated soil infiltration based on the final infiltration rate in the field using a double-ring
infiltrometer (Scholten 1997).

2.3.10. Soil texture


Soil texture, which is defined by the relative percentages of sand, silt, and clay, is closely
related to many other soil properties. In this study, particle-size distributions were deter-
mined using wet sieving and the Bouyoucos hydrometer method (Gee and Bauder 1986).
It has been defined four particle-size classes: sand (0.5–2 mm), fine sand (0.10–0.25 mm),
silt (0.002–0.05 mm), and clay (<0.002 mm). Sand has high porosity and permeability, is
more resistant to erosion by runoff, and therefore is less erodible than silt and clay. Fine
sandy soils are erodible because they are non-cohesive and their particles are easily
entrained in runoff. Silt is less cohesive and more easily eroded than clay. In addition, silt
particles have lower mass than sand particles and are more easily entrained and trans-
ported. Soils with high silt content have been shown to be less resistant to erosion than
sandy or clayey soils (Wischmeier and Smith 1978). Clay content affects soil moisture,
hydraulic properties such as infiltration rate and soil permeability (H Wang et al. 2019),
soil fertility and productivity, soil biology, and soil erosion (Wischmeier and Smith 1978;
Benedetto 2010; Chau et al. 2011; Sharma et al. 2015). Clay particles are cohesive, thus
clay-rich soils have larger aggregates and are more resistance to erosion.

3. Background of the machine learning algorithms


In this study, we use five state-of-the-art machine learning algorithms (random forest,
M5P, REPTree, Gaussian processes, and pace regression) to derive a relationship between
measured soil properties and the soil erodibility index (K).

3.1. Random forest


Random forest (RF), developed by Breiman (2001), is an ensemble classification/regres-
sion algorithm structured with many individual decision trees (Breiman et al. 1984). It
uses averaging to achieve a more accurate and stable prediction than an individual deci-
sion tree. A random forest adds randomness to the model in two ways (Hastie et al.
2009): (1) random sampling (bootstrap) of training data when building trees; and (2) ran-
dom subsets of features (e.g. clay and sand contents) when splitting nodes. The result is a
large number of predictions derived from individual trees, which contributes to a better
model. Importantly, the random forest calculates internal cross-validation accuracy and
uses the samples that were not included in the bootstrap set, namely the out-of-bag sam-
ples (Breiman 2001). Mathematically, RF is the unweighted average over the collection of
tree predictors and is calculated as:
1X N
f ðx Þ ¼ TðxÞ (3)
N n¼1

where x is the input vector, N is the number of trees, and T is the collection of tree pre-
dictors. Three common parameters need to be optimized in random forest (Probst et al.
2019): (1) ntree, the number of regression trees grown based on a bootstrap sample of the
training data; (2) mtry, the number of different features tested at each node; and (3) node-
size, the minimal size of the terminal nodes of the trees.
8 A. SHIRZADI ET AL.

3.2. M5p
The M5P tree algorithm, developed by Wang et al. (1997) and described in detail by
Witten et al. (2005), implements the model-tree inducer based on the M5 tree (Quinlan
1992). The M5P algorithm is similar to common regression tree methods (Breiman et al.
1984), but the terminal nodes of the model trees are linear regression models (Kuhn and
Johnson 2013) instead of fixed average values. The algorithm involves two steps, tree
growth and pruning. The trees are grown with an exhaustive and iterative search of the
training data. The training data are divided into subsets using a decision structure (node-
splitting) strategy. At this point, the algorithm calculates the standard deviation (sd) of
the observed desired values (soil erodibility index) that reach the node (S) and treats that
value as a measure of the error at the node. Then, each feature (e.g. clay and sand con-
tents) at that node is tested by calculating the expected reduction in error (Si ).
Mathematically, it is based on the standard deviation reduction equation and
calculated as:
X Si
SDR ¼ sd ðSÞ   sdðSi Þ (4)
S
where S is the set of data that reaches the node; Si is a subset of examples corresponding
to the ith outcome of the specific set; and sd is the standard deviation. The M5P algo-
rithm evaluates all possible splits and chooses one that maximizes the expected error
reduction. Other steps in growing a tree are simplifications of the linear models, pruning,
and a smoothing process that is more complex than the one used in M5 (Quinlan 1992).
The algorithm is described in detail by (Witten et al. 2005).

3.3. REPTree
Reduced Error Pruning Tree (REPTree), first developed by Quinlan (1987), is a fast deci-
sion/regression tree learner. The algorithm builds a decision/regression tree using infor-
mation gain and entropy (i.e. impurity metric) as the splitting criteria and reduced-error
pruning. A decision/regression tree splits the nodes (root node and decision nodes) on all
features (e.g. clay and sand contents) and then selects the split with the most homoge-
neous sub-nodes containing examples of similar values (Breiman et al. 1984). Entropy is
used by the REPTree algorithm to calculate the homogeneity of a sample; if the sample is
fully homogeneous, the entropy is zero (Witten et al. 2005). Mathematically, splitting the
regression tree with the REPTree algorithm is based on the highest information gain ratio
value, as expressed by:
P i ÞjSi j
EðSÞ ni¼1 EðSjSj
Information gain ratio ¼ Pn jS j (5)
 i¼1 jSji log 2 jSjSji j

where E is entropy, and S and Si denote, respectively, the training dataset and its subset.
The reduced error pruning method, which is the simplest and most understandable
method in decision tree pruning, is used in REPTree. It decreases the complexity of
regression/decision tree model and the error arising from variance. It also reduces the
over-fitting problem and increases the interpretability of the model (Khosravi et al. 2018).
The REPTree algorithm is commonly combined with bagging to create multiple trees in
different iterations in order to select the best one from all generated trees (Pham
et al. 2019).
GEOCARTO INTERNATIONAL 9

3.4. Gaussian processes


Gaussian processes, is a powerful and flexible nonparametric regression technique that
uses kernel ridge regression and neural networks with an infinite number of hidden units
to build comprehensive probabilistic models of real world issues (MacKay 2003; Williams
and Rasmussen 2006). The key advantage of Gaussian processes over other techniques is
that it provides both a mean prediction and confidence intervals, which are very import-
ant to soil scientists (Gonzalez et al. 2007). A Gaussian process generalizes the Gaussian
distribution to describe a distribution over functions instead of a random variable. From
a function space point of view, a Gaussian process is a stochastic process that is fully
specified by its mean mðxÞ and covariance (or kernel) functions kðx, ’x Þ, and defined as:
f ðxÞ  GPðmðxÞ, kðx, ’xÞÞ (6)
where GP denotes a Gaussian process and x is the input vector (e.g. clay and sand con-
tents). In this study, for the sake of simplicity, we place a zero-mean, multivariate,
Gaussian, prior distribution over f. The covariance function describes how much influence
one point has on another (Williams and Rasmussen 2006). The covariance function is
determined by kernel functions (e.g. radial basis functions, Laplace, polykernal, Sigmoid,
and multi-quadratic). Kernel functions determine the smoothness of the function in the
distribution and also control model accuracies.
Suitable selection of the covariance functions is an essential part of Gaussian processes.
Based on pre-test results in this research, the normalized polykernel function was selected
as the best kernel function. The normalized polykernel function is defined as follow:

ðxT ’x þ cÞd
kðx, ’x Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (7)
xTþc þ ’x Tþc
where kðx, ’x Þ denotes a normalized polykernel function, x is the input vector (e.g. clay
and sand contents), c > 0 is a free parameter, and d is the degree of the polynomial. The
hyper-parameters of the kernel functions are determined by maximizing the log-likelihood
of the training data (Kang et al. 2015).

3.5. Pace regression


Pace regression (“projection adjustment by contribution estimation”) was developed by
Wang (2000) and is an improved version of the classical ordinary least squares regression
method. The latter is calculated as:
y ¼ b0 þ b1 x1 þ b2 x2 þ . . . þ bn xn (8)
where y is dependent or response variable; x1, x2, … , xn are predictor features; b0 is the
intercept; and b1, b2, … , bn are the constant coefficients of features under investigation.
The classical ordinary least squares regression method is simple and has been widely used
to predict numeric values by soil science researchers (Merdun et al. 2006; Hosseini et al.
2017), but it has notable disadvantages (Weisberg 2005). For example, it is unable to
detect redundancy in feature spaces (e.g. clay and sand contents), especially when a large
number of features are being considered (Bastien et al. 2005; Witten et al. 2005). Pace
regression overcomes this problem by evaluating the contribution of each feature (e.g.
clay and sand contents) to observed values (e.g. soil erodibility index) and uses a cluster-
ing method to improve the statistical basis for estimating contributions to overall
10 A. SHIRZADI ET AL.

Table 1. Input variables based on input-enter scenario (IES).


Equation Input
K ¼f ðClayÞ 1
K ¼f ðClay, SiltÞ 2
K ¼f ðClay, Silt, SandÞ 3
K ¼f ðClay, Silt, Sand, OMÞ 4
K ¼f ðClay, Silt, Sand, OM, CECÞ 5
K ¼f ðClay, Silt, Sand, OM, CEC, BdÞ 6
K ¼f ðClay, Silt, Sand, OM, CEC, Bd, nÞ 7
K ¼f ðClay, Silt, Sand, OM, CEC, Bd, n, LLÞ 8
K ¼f ðClay, Silt, Sand, OM, CEC, Bd, n, LL, pHÞ 9
K ¼f ðClay, Silt, Sand, OM, CEC, Bd, n, LL, pH, CaCO3Þ 10
K ¼f ðClay, Silt, Sand, OM, CEC, Bd, n, LL, pH, CaCO3, EC Þ 11
K ¼f ðClay, Silt, Sand, OM, CEC, Bd, n, LL, pH, CaCO3, EC, MWDÞ 12
K ¼f ðClay, Silt, Sand, OM, CEC, Bd, n, LL, pH, CaCO3, EC, MWD, FsandÞ 13

regressions. It provides an understanding of the relative influences of different features


and therefore improves the model’s interpretability (Wang and Witten 2002).

3.6. Input scenarios of variables


3.6.1. Input-enter scenario (IES)
Table 1 shows our input-enter scenario for input variables. The scenario involves consid-
eration of 13 inputs of one or more individual variables described in Sec. 2.2. For
example, input 1 considers only the clay variable, whereas input 2 involves clay and silt,
and input 5 includes clay, silt, sand, OM, and CEC.

3.6.2. Sensitivity analysis scenario (SAS)


We use a second (sensitivity analysis) scenario by first removing each variable in succes-
sion and then modeling with the remaining variables. This procedure is individually
repeated for all variables. Table 2 summarizes input variables based on the sensitivity ana-
lysis scenario. For example, for input 1, the clay variable is removed from input 1, and
the modeling process performed with all other factors.

3.7. Multi-collinearity test


Multi-collinearity reflects near-perfect linear combinations of two or more variables
(Hosmer and Lemeshow 2000). Two well-known metrics for diagnosing multi-collinearity
are tolerance (TOL) and the variance inflation factor (VIF). According to Menard (2002),
a TOL value smaller than 0.2 indicates that there is multi-collinearity between independ-
ent variables. TOL values smaller than 0.1 and VIF values greater than 5 indicate a multi-
collinearity problem (Chen et al. 2018).

3.8. Evaluation and comparison metrics


To compare and evaluate the five machine learning algorithms (RF, M5P, REPTree, GP,
and PR), we split our data randomly into two groups: training (70%) and testing (30%)
datasets (Gauch et al. 2003). After dividing soil samples into the training and testing data-
sets, we used the 10-Flods cross-validation technique during modeling of the training
dataset as a default in the WEKA software.
GEOCARTO INTERNATIONAL 11

Table 2. Input variables based on sensitivity analysis scenario (SAS).


Equation Input
If No ðClayÞ; K ¼ f ðSilt, Sand, OM, CEC, Bd, n, LL, pH, CaCO3, EC, MWD, FsandÞ 1
If No ðSiltÞ; K ¼ f ðClay, Sand, OM, CEC, Bd, n, LL, pH, CaCO3, EC, MWD, FsandÞ 2
If No ðSandÞ; K ¼ f ðClay, Silt, OM, CEC, Bd, n, LL, pH, CaCO3, EC, MWD, FsandÞ 3
IfNo ðOMÞ; K ¼ f ðClay, Silt, Sand, CEC, Bd, n, LL, pH, CaCO3, EC, MWD, FsandÞ 4
If No ðCECÞ; K ¼ f ðClay, Silt, Sand, OM, Bd, n, LL, pH, CaCO3, EC, MWD, FsandÞ 5
If No ðBdÞ; K ¼ f ðClay, Silt, Sand, OM, CEC, n, LL, pH, CaCO3, EC, MWD, FsandÞ 6
If No ðnÞ; K ¼ f ðClay, Silt, Sand, OM, CEC, Bd, LL, pH, CaCO3, EC, MWD, FsandÞ 7
If No ðLLÞ; K ¼ f ðClay, Silt, Sand, OM, CEC, Bd, n, pH, CaCO3, EC, MWD, FsandÞ 8
If No ðpHÞ; K ¼ f ðClay, Silt, Sand, OM, CEC, Bd, n, LL, CaCO3, EC, MWD, FsandÞ 9
If No ðCaCO3Þ; K ¼ f ðClay, Silt, Sand, OM, CE, Bd, n, LL, pH, EC, MWD, FsandÞ 10
If No ðEC Þ; K ¼ f ðClay, Silt, Sand, OM, CEC, Bd, n, LL, pH, CaCO3, MWD, FsandÞ 11
If No ðMWDÞ; K ¼ f ðClay, Silt, Sand, OM, CEC, Bd, n, LL, pH, CaCO3, EC, FsandÞ 12
If No ðFsandÞ; K ¼ f ðClay, Silt, Sand, OM, CEC, Bd, n, LL, pH, CaCO3, EC, MWDÞ 13

3.8.1. Statistical error-based metrics


In this study, we used six accuracy metrics that are typically employed in soil modeling to
evaluate the performance of the machine learning algorithms (Chai and Draxler 2014):
correlation coefficient (CC), coefficient of determination (R2), mean absolute error
(MAE), root mean squared error (RMSE), relative absolute error (RAE), and root relative
squared error (RRSE). CC is a statistic that ranges from 0 to ±1 and measures the linear
correlation between the observed and predicted soil erodibility values, where þ1 and 1
indicate, respectively, perfect positive and negative correlation. R2 ranges from 0 to þ1
and is a measure of the precision of the linear relationship between the observed and pre-
dicted soil erodibility values. MAE measures the average amount of error (the signs of the
errors are removed) between the observed and predicted soil erodibility values. RMSE is a
statistic that is commonly used to measure the accuracy of prediction models; it indicates
how spread-out the residuals are. RAE and RRSE are unit-less statistics that are similar to
MAE and RMSE measurements, but express how large MAE and RMSE values are com-
pared to mean actual soil erodibility values. These statistics are expressed mathematically
as follows:
0 Pn P P
Oi Pi  n1 ni¼1Oi ni¼1Pi Þ
CC ¼ B
i¼1
BP Pn 2 Pn Pn 2 (9)
@ n
O2  1 O Þ P2  1 P Þ
i¼1 i n i¼1 i i¼1 i n i¼1 i

R2 ¼ 1  SSerror =SStotal (10)


MAE ¼ jPi  Oi j (11)
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1X n
RMSE ¼ ðPi Oi Þ2 (12)
n i¼1
jPi  Oi j
RAE ¼ (13)
O
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pn ffi
2
n
1
i¼1
ðP i O i Þ
RRSE ¼ (14)
O
where Oi and Pi are, respectively, the observed and predicted ith value; and O denotes the
average observation value. SSerror is the sum of squared errors between predicted and
observed soil erodibility values; and SStotal is the sum of squared deviations of each
response variable from its mean.
12 A. SHIRZADI ET AL.

Table 3. Tolerances and VIF for independent variables.


Factor Multi-collinearity metrics
Tolerance VIF
Clay 0.337 2.971
Silt 0.596 1.679
OM 0.631 1.584
Bd 0.516 1.937
n 0.522 1.916
LL 0.585 1.710
pH 0.866 1.155
CaCO3 0.586 1.706
EC 0.755 1.325
MWD 0.769 1.300
Fsnad 0.759 1.318

3.8.2. Taylor diagram


A Taylor diagram graphically facilitates comparison of different models based on three
statistics: correlation coefficient, root-mean-square error, and standard deviation (Elvidge
et al. 2014). A Taylor diagram is used in this study to evaluate multiple aspects of model
performance in predicting the soil erodibility index. Specifically, the performance of each
model is shown on a Taylor diagram by a color dot, and the position of the dot identifies
how closely that algorithm’s predicted soil erodibility index matches observations.

3.8.3. Normality and T-test (pairwise comparison test)


Asymmetry in the distribution of data has a significant effect on the statistical analysis
(Kerry and Oliver 2007), therefore we checked the normality of the dataset using the
Kolmogorov–Smirnov statistical test. This test provides an estimate of the degree to which
a given dataset has a particular theoretical distribution (Fu et al. 2010), and is computed
as the largest absolute difference between the observed and theoretical cumulative distri-
bution functions (Zhang et al. 2005).
After testing the normality of the dataset, we used the paired T-test to compare the per-
formances of the five machine learning algorithms. The t-test and the Wilcoxon signed-ranks
test are the most frequently used statistical tests for determining significant differences
between two machine learning algorithms (Wilcoxon 1992). There are two types of T-test—
independent and paired T-tests. The former are applied to groups that are independent of
each other, whereas the latter are used if two groups are co-dependent (Kim 2015). In the
paired T-test, if the difference between the two treatments is close to zero, there is no differ-
ence between them. The Ttest statistic for a paired T-test can be computed as follows:
X1 X2
Ttest ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi (15)
2 s1 þs2 2c:cs1 s2
2

 1 and X
where X  2 are the sample means of datasets group 1 and 2, s1 and s2 are the vari-
ance of variables of dataset groups 1 and 2, c:c is the correlation coefficient between the
two datasets, and n is the number of datasets (Kim 2015).

4. Results and analysis


4.1. Qualification of the input variables
A multi-collinearity statistical analysis was performed to check the applicability of the
independent variables as model inputs. The results, shown in Table 3, indicate that the
GEOCARTO INTERNATIONAL 13

Table 4. Pearson correlation coefficients and probability values between soil erodibility index and independ-
ent variables.
Factor Clay Silt Sand OM CEC Bd n LL pH CaCo3 EC MWD Fsand
PCC 0.369 0.540 0.002 0.334 0.412 0.056 0.067 0.463 0.014 0.349 0.101 0.055 0.326
p-value 0.000 0.000 0.985 0.001 0.000 0.584 0.509 0.000 0.888 0.000 0.318 0.587 0.001

Table 5. Optimal values of the parameters used in each algorithm.


Algorithm Optimal parameters
RF Bag size percent: 100. Batch size: 100. Break ties randomly: False. Calculate
out-of-Bag: False. Debug: False. Do not check capabilities: False. Maximum
depth: 0. Number of decimal places: 2. Number of execution slots: 1.
Number of features: 0. Number of iterations: 100. Output of out-of-Bag
complexity statistics: False. Seed: 1. Store out-of-Bagg prediction: False
REPTree Debug: False. Maximum depth: 1. Minimum total weight of the instances in
a leaf: 2. Minimum variance proportion: 0.001. No pruning: False. Number
of folds: 3. Number of decimal places: 2. Number of seeds: 1
M5P Build regression tree: False. Debug: False. Minimum number of instances: 4.
Save instances: False. Unpruned: False. Use unsmoothed: False
GP Debug: False. Filter type: Normalize training data. Kernel: Normalized
polykernel. Noise: 0.27. Number of decimal places: 2. Number of seeds: 1
PR Debug: False. Estimator: Nested model selector. Threshold: 2

maximum VIF and minimum TOL values are, respectively, 2.971 and 0.337 (clay vari-
able), confirming that there is no multi-collinearity among the independent variables. In
other words, all variables have a positive role in soil erodibility prediction in the
study area.
We computed Pearson correlation coefficients (PCC) to determine the most important
factors in our predictions of the soil erodibility factor based on the training dataset (Table
4). The data show that silt has the highest impact on soil erodibility prediction (PCC ¼
0.540), whereas sand has the lowest impact (PCC ¼ 0.002). Overall, silt, LL, CEC, clay,
CaCO3, OM, Fsand (fine sand), EC, n, BD, MWD, pH, and sand, in that order, are the
most important factors for predicting the soil erosion index in the study area.

4.2. Modelling process and analysis


The optimal values of the parameters used in each model determine the performance and
predictive power of the model. These values were determined by applying a trial-and-
error technique to the training dataset. The parameters that most affect the model results
are number of seeds, number of iterations, number of folds, and kernel functions
(Table 5).

4.2.1. Model performance based on the input-enter scenario


We first ran all five algorithms on the training dataset and then validated them with the
validation dataset. Each input, from 1 to 13 (Table 1), was entered into the algorithm,
and the error values (CC, MAE, RMSE, RAE, and PRSE) were computed. The results are
shown in Figure 2(a)–(j). For the RF algorithm, the training dataset yielded error values
of 0.982 (CC), 0.0015 (MAE), 0.0021 (RMSE), 0.251 (RAE), and 0.270 (PRSE) (Figure 2a).
Corresponding values for the validation dataset are 0.901, 0.0050, 0.0061, 0.671, and 0.683
(Figure 2b). Input 13 has the highest performance and prediction accuracy, followed by
inputs 7, 10, 12, and 6. Inputs 1, 2, 3, and 4 have the lowest performance and predic-
tion accuracy.
14 A. SHIRZADI ET AL.

Results for the M5P algorithm are shown in Figure 2(c) and 2(d). The highest value
for CC (0.892) and the lowest error values for MAE (0.0029), RMSE (0.0037), RAE
(0.433), and PRSE (0.453) were computed for input 13 (Figure 2c), indicating that all 13
factors are effective and applicable for predicting the soil erodibility index. For the valid-
ation dataset, these values are 0.884, 0.0043, 0.0053, 0.512, and 0.521, respectively, for all
13 inputs (Figure 2d). The poorest prediction is provided by input 1 (clay) in both the
training and validation datasets.
Results for the REPTree algorithm are shown in Figure 2(e) and 2(f). Input 13 has the
best performance, with error values for the training dataset of 0.969 (CC), 0.0024 (MAE),
0.0033 (RMSE), 0.471 (RAE), and 0.532 (PRSE) (Figure 2e). Corresponding values for the
validation data set are 0.867, 0.0038, 0.0046, 0.590, and 0.606 (Figure 2f). Again, input 1
yielded the poorest prediction for both the training dataset (CC ¼ 0.670, MAE ¼ 0.0051,
RMSE ¼ 0.0062, RAE ¼ 0.650, and PRSE ¼ 0.790) and validation dataset (CC ¼ 0.630,
MAE ¼ 0.0065, RMSE ¼ 0.0077, RAE ¼ 0.850, and PRSE ¼ 0.852).
Results for the GP algorithm are shown in Figure 2(g) and 2(h). Input 13 yielded error
values for the training dataset of 0.930 (CC), 0.0019 (MAE), 0.0023 (RMSE), 0.373 (RAE),
and 0.374 (PRSE) (Figure 2g). Corresponding values for validation dataset are 0.918,
0.0032, 0.0038, 0.495, and 0.493 (Figure 2h). Input 4 provided the next highest prediction
accuracy for the validation dataset (CC ¼ 0.913, MAE ¼ 0.0029, RMSE ¼ 0.0037, RAE ¼
0.447, and PRSE ¼ 0.477), followed by input 5 (CC ¼ 0.908, MAE ¼ 0.0029, RMSE ¼
0.0037, RAE ¼ 0.451, and PRSE ¼ 0.484). Input 1 provided the poorest goodness-of-fit
and prediction accuracy.
Results for the PR algorithm are shown in Figure 2(i) and 2(j). Once again, input 13
yielded the highest performance, followed by inputs 4, 10, 11, and 12. The lowest per-
formance is for input 1, followed by inputs 6, 7, 8, 9, and 5. Error values for input 13 in
the training dataset are 0.892 (CC), 0.022 (MAE), 0.0028 (RMSE), 0.433 (RAE), and 0.453
(PRSE); corresponding values for validation dataset are 0.873, 0.0034, 0.0025, 0.528,
and 0.541.
Overall, the highest predicted goodness-of-fit (training dataset) and accuracy (valid-
ation dataset) were obtained for input 13, which includes clay, silt, sand, OM, CEC, BD,
n, LL, pH, CaCO3, EC, MWD, and Fsand (Fine sand). The lowest goodness-of-fit and
prediction accuracy were returned for input 1. The GP algorithm (CC ¼ 0.918) has the
highest predictive power for soil erodibility, followed by RF (CC ¼ 0.901), M5P (CC ¼
0.884), PR (CC ¼ 0.892), and REPTree (CC ¼ 0.867).

4.2.2. Model performance based on sensitivity analysis scenario


For each algorithm, the sensitivity analysis technique was used to identify variables with
no predictive ability. Results based on the validation dataset are shown in Figure 3(a)–(h).
In the case of the RF algorithm, the model is sensitive to sand, OM, n, LL, CaCO3, and
EC. We removed these variables from the modelling process, optimizing values of CC,
MAE, RMSE, RAE, and PRSE (Figure 3a). In this scenario, results showed with removing
the sand, the values of 0.866, 0.0043, 0.0052, 0.652, and 0.680 were obtained for the CC,
MAE, RMSE, RAE, and PRSE, respectively. However, without the OM variable theses val-
ues were 0.868, 0.0042, 0.0051, 0.651, and 0.669, respectively. Moreover, results pointed
out that in the absence of bd variable, the CC, MAE, RMSE, RAE, and PRSE acquired
0.865, 0.0042, 0.0051, 0.651, and 0.663, respectively. These values in the absence of n vari-
able were 0.871, 0.0042, 0.0051, 0.648, and 0.662, respectively. In the absence of the LL
result stated that the CC was 0.890, the MAE was 0.0040, the RMSE was 0.0049, the RAE
was 0.613 and the PRSE was 0.644. The cao3 then removed from the modeling process
GEOCARTO INTERNATIONAL 15

Figure 2. Goodness-of-fit (training) and prediction accuracy (validation) of the machine learning regression models
based on input strategy. (a) RF model, training dataset. (b) RF model, validation dataset. (c) M5P model, training data-
set. (d) M5P model, validation dataset. (e) REPTree model and all twelve variables. (f) REPTree model, training dataset.
(g) GP model, validation dataset. (h) GP model, training dataset. (i) PR model, training dataset. (j) PR model, validation
dataset (Ton. acre. hr. (hundreds of acre. ft. ton f. in)1).

and result indicated that the CC, MAE, RMSE, RAE, and PRSE had respectively the val-
ues of 0.890, 0.0040, 0.0049, 0.613, and 0.641; however, these values in the absence of EC
were 0.870, 0.0040, 0.0049, 0.616, and 0.644, respectively (Figure 3a). With looking the
mentioned figure, it can be observed that with all 13 variables (Input 13) results of the
CC, MAE, RMSE, RAE, and PRSE factors are the same with enter-input scenario. We
removed all above-mentioned sensitive variables and finally results distinguished that
based on validation dataset the value of CC, MAE, RMSE, RAE, and PRSE were 0.902,
0.0038, 0.0046, 0.583, and 0.604, respectively (Figure 3b).
The results of SAS for the M5P algorithm are shown in Figure 3(c). The enter-input
scenario and sensitivity analysis scenario yielded similar results. The best prediction
16 A. SHIRZADI ET AL.

Figure 2. Continued.
GEOCARTO INTERNATIONAL 17

Figure 3. Prediction accuracy of the machine learning regression models based on the sensitivity analysis scenario
and the validation dataset. (a) RF model and all variables. (b) RF model and six predictive variables. (c) M5P and all
variables. (d) REPTree model and all variables. (e) REPTree model and ten predictive variables. (f) GP model and all
variables. (g) PR model and all variables (h) PR model and six predictive variables (Ton. acre. hr. (hundreds of acre. ft.
ton f. in)1).

accuracy (CC ¼ 0.884, MAE ¼ 0.0033, RMSE ¼ 0.0040, RAE ¼ 0.512, and PRSE ¼
0.521) was obtained for input 13, which includes all 13 soil variables.
Results for the REPTree algorithm are shown in Figure 3(d). In this case, the CEC and
BD variables are the most sensitive variables. By removing these two variables, we opti-
mized values, with a prediction accuracy (CC ¼ 0.867, MAE ¼ 0.0034, RMSE ¼ 0.0042,
RAE ¼ 0.520, and PRSE ¼ 0.551) better than that obtained using the enter-input scenario
(Figure 3e).
Results for the GP algorithm are similar to those obtained using the EIS technique.
The best prediction accuracy (CC ¼ 0.918, MAE ¼ 0.0032, RMSE ¼ 0.0038, RAE ¼
0.495, and PRSE ¼ 0.493) was obtained by including all 13 soil variables in the model
(Figure 3f).
18 A. SHIRZADI ET AL.

Figure 3. Continued.

Finally, the results for the PR algorithm are shown in Figure 3(g) and 3(h). In this
case, clay and CEC are the most sensitive variables for predicting the soil erodibility
index. In the absence of these two variables, CC, MAE, RMSE, RAE and PRSE are,
respectively, 0.891, 0.0022, 0.0028, 0.431, and 0.453 (Figure 3h). With all 13 variables
included, these values are, respectively, 0.873, 0.0034, 0.0041, 0.529, and 0.541 (Figure 3g).
After optimizing the best input parameters for the predictive modelling based on SAS, we
found that the GP algorithm (CC ¼ 0.918) has the highest predictive power for soil erodi-
bility prediction, followed by RF (CC ¼ 0.902), PR (CC ¼ 0.891), M5P (CC ¼ 0.884),
and REPTree (CC ¼ 0.811).

4.3. Model performance and comparison assessment


After selecting the best algorithm with the highest predictive power using the SAS and
EIS techniques, we ran the algorithms on the training and validation datasets to estimate
the soil erodibility index and finally compared the results with observed data. The results
GEOCARTO INTERNATIONAL 19

Figure 4. Goodness-of-fit and power predictive processes of the RF algorithm. (a) Trend of the actual and predicted K
using the training dataset. (b) MAE and RMSE of the training dataset. (c) Trend of the actual and predicted K using
the validation dataset. (d) MAE and RMSE of the validation dataset. (e) Mean and standard deviation values of the
training dataset. (f) Mean and standard deviation values of the validation dataset. (g) Scatterplot of actual vs. pre-
dicted K in the training phase. (h) Scatterplot of actual vs. predicted K in the validation phase (Ton. hr. Mj1. mm1).

of the modelling procedure for the five machine learning regression algorithms are shown
in Figures 4–8. Figures 4(a)–8(a) summarize the trends of observed and predicted soil
erodibility indexes based on the training dataset; their error values (MAE and RMSE) val-
ues are shown in Figures 4(b)–8(b). Corresponding measures for validation dataset are
shown, respectively, in Figures 4(c)–8(c) and Figures 4(d)–8(d). The means and standard
deviations of training and validation datasets are shown, respectively, in Figures 4(e)–8(e)
20 A. SHIRZADI ET AL.

Figure 5. Goodness-of-fit and power predictive processes of the REPTree algorithm. (a) Trend of actual and predicted
K using the training dataset. (b) MAE and RMSE of the training dataset. (c) Trend of actual and predicted K using the
validation dataset. (d) MAE and RMSE of the testing dataset. (e) Mean and standard deviation values of the training
dataset. (f) Mean and standard deviation values of the validation dataset. (g) Scatterplot of the actual vs. predicted K
in the training phase. (h) Scatterplot of the actual vs. predicted K in the validation phase (Ton. hr. Mj1. mm1).

and Figures 4(f)–8(f). Finally, Figures 4(g)–8(g) and Figures 4(h)–8(h) are, respectively,
scatterplots of observed versus predicted soil erodibility indexes in the training and valid-
ation datasets. The GP algorithm yielded the highest coefficient of determination (R2)
between the observed and predicted K values (0.843), followed by the RF (0.812), PR
(0.794), M5P (0.781), and REPTree (0.752) algorithms.
We also compared the results of the models on Taylor diagrams (Figure 9). The mul-
tiple aspects of model performance in simulating were evaluated by this diagram.
Statistics, correlation, standard deviation and RMSE for five machine learning regression
algorithms were computed, and different color dots were assigned to the models. The pos-
ition of each colour dot appearing on the plot quantifies how closely that algorithm’s
simulated soil erodibility index matches observations. Results show that the GP algorithm
GEOCARTO INTERNATIONAL 21

Figure 6. Goodness-of-fit and power predictive processes of the PR algorithm. (a) Trend of the actual and predicted K
using the training dataset. (b) MAE and RMSE of the training dataset. (c) Trend of the actual and predicted K using
the validation dataset. (d) MAE and RMSE of the validation dataset. (e) Mean and standard deviation values of the
training dataset. (f) Mean and standard deviation values of the validation dataset. (g) Scatterplot of actual vs. pre-
dicted K in the training phase. (h) Scatterplot of actual vs. predicted K in the validation phase (Ton. Hr. Mj1. Mm1).

best predicts the soil erodibility index based on training (Figure 9a) and validation
(Figure 9b) datasets.
A boxplot (Figure 10) shows that all five algorithms closely predict the observed max-
imum soil erodibility index (0.0474 T hr Mj1 mm1). The observed minimum soil erodi-
bility index (0.0111 T hr Mj1 mm1) is best predicted by REPTree, followed, in order, by
GP, PR, M5P, and RF. The order for the first quartile (25%; Q1) rankings are GP, PR,
M5P, RF, and REPTree, and for the third quartile (75%; Q3), REPTree, PR, M5P, GP,
and RF. The median value of the observed soil erodibility index (0.03109 T hr
Mj1 mm1) is best predicted by the GP and RF algorithms. These results indicate that,
22 A. SHIRZADI ET AL.

Figure 7. Goodness-of-fit and power predictive processes of the M5P algorithm. (a) Trend of the actual and predicted
K using training dataset. (b) MAE and RMSE of the training dataset. (c) Trend of the actual and predicted K using the
validation dataset. (d) MAE and RMSE of the validation dataset. (e) Mean and standard deviation values of the training
dataset. (f) Mean and standard deviation values of the validation dataset. (g) Scatterplot of the actual vs. predicted K
in the training phase. (h) Scatterplot of actual vs. predicted K in the testing validation (Ton. hr. Mj1. mm1).

although GP generally is the most accurate algorithm, it did not predict extreme val-
ues well.

4.4. Statistical differences between models


This section evaluates and compares the performance of the algorithms using pairwise sta-
tistics. We first checked the normality of the dataset using the one-sample Kolmogorov-
Smirnov normality test (Table 6). Results summarized in Table 6 indicate that all five
GEOCARTO INTERNATIONAL 23

Figure 8. Goodness-of-fit and power predictive processes of the GP algorithm. (a) Trend of the actual and predicted
K using the training dataset. (b) MAE and RMSE of the training dataset. (c) Trend of the actualand predicted soil erod-
ibility index using the validation dataset. (d) MAE and RMSE of the validation dataset. (e) Mean and standard devi-
ation values of the training dataset. (f) Mean and standard deviation values of the validation dataset. (g) Scatterplot
of the actual vs. predicted K in the training phase. (h) Scatterplot of actual vs. predicted K in the validation phase
(Ton. hr. Mj1. mm1).

machine learning datasets are normally distributed, and therefore the performance of the
models can be compared using the paired sample T-test (two-tailed). Results, shown in
Table 7, indicate that all algorithms performed well; there is no statistical difference
between the performance of each algorithm and observed soil erodibility index values at
the 0.95 significance level.
24 A. SHIRZADI ET AL.

Figure 9. Taylor diagrams comparing the performance of the models. (a) Training dataset. (b) Validation dataset.
GEOCARTO INTERNATIONAL 25

Figure 10. Box plot comparing model performances (Ton. hr. Mj1. mm1).

5. Discussion
5.1. Relation of factors to soil erodibility
Multi-collinearity analysis indicated that there are no strong correlations between inde-
pendent variables used in this study (Table 3). Therefore, all independent variables
could be used as input in fitting models without problems arising from multi-collinear-
ity. Several approaches such as correlation analysis and sensitive analysis can be used
to select important soil properties that contribute to soil erodibility and also to inter-
pret the effects of soil properties on differences in the spatial distribution and variabil-
ity of soil erodibility. In this study, there are high negative correlations between soil
erodibility and clay, OM, CEC, LL, and CaCO3 (Table 4). We note that both physical
and chemical soil properties are important in explaining the susceptibility of soils to
water erosion.
Soil texture plays a key role in erosion. Due to cohesion, clay particles are less suscep-
tible to erosion and transport by runoff than silt and sand (Morgan 1980). Also clay par-
ticles form large stable aggregates that are resistant to raindrop impact and erosion by
overland flow (Belasri et al. 2017). An exception is clay minerals that are susceptible to
expansion and contraction (e.g. smectites). No significant correlation (PCC ¼ 0.002, p-
value ¼ 0.985) was found between soil erodibility and sand content (Table 4); sand pro-
motes infiltration and reduces overland flow (Perez-Rodrıguez et al. 2007; Efthimiou
2020). As shown in Table 4, soils dominated by silt and fine sand have high K values and
are most vulnerable to water erosion. In addition, these soils commonly have fractured
superficial crusts that favour erosion (Perez-Rodrıguez et al. 2007). Yang et al. (2018) con-
cluded that soils with high clay contents have relatively high resistance to erosion, and
many other researchers have reported the same finding (e.g. Vaezi et al. 2008, 2016;
Bonilla and Johnson 2012; Shabani et al. 2014).
All other things being equal, soil resistance to erosion increases with an increase in
organic matter content, because soil OM is an important binding agent, producing stable
aggregates (Wang et al. 2016; Liu et al. 2020). Soils with less than 3.5% organic matter are
26 A. SHIRZADI ET AL.

considered erodible (Evans 1980). Soil OM may also increase infiltration and thus reduce
surface runoff and erosion (Rodrıguez et al. 2006; Tejada and Gonzalez 2006).
Cation exchange capacity can also have a significant effect on soil erodibility (Table 4).
CEC typically covaries with clay and OM contents. Abbaslou et al. (2020) reported that
high-CEC soils are less susceptible to dispersion than low-CEC ones. We found a negative
correlation between soil erodibility and CaCO3 content (Table 4). CaCO3 promotes par-
ticle flocculation, cementation, and thus the formation of soil aggregates (Wuddivira and
Camps-Roach 2007; Abbaslou et al. 2020). Vaezi et al. (2008) and Ostovari et al. (2016)
found that CaCO3 plays a key role in reducing erosion in calcareous soils. The same
authors also argued that stable surface aggregates are resistant to erosion by rainfall
splash. We found, however, that aggregate stability (MWD) is not significantly correlated
with soil erodibility (PCC ¼ 0.055 and p-vale ¼ 0.587) (Table 4). Parysow et al. (2003)
found that the soil structure explained only 6.53% of the variability of soil erodibility, per-
haps because of the poor structure of soils in their semi-arid study area. Vaezi et al.
(2008) argued that soil texture is more important than soil structure in reducing soil
erodibility in semi-arid regions, and Blanco and Lal (2008) concluded that only strong
stable soil aggregates reduce soil erosion.
Soil erodibility is also negatively correlated with liquid limit, probably because the lat-

ter covaries with clay content (Ozdemir and G€ ulser 2017). High clay content can contrib-
ute to soil cohesiveness and aggregate stability. Vacchiano et al. (2014) and Curtaz et al.
(2015) advocated the use of Atterberg limits as indicators of the susceptibility of soils
to erosion.

5.2. Input selection and sensitivity analysis scenarios


Using the input-enter scenario, we found that the poorest estimates of K were obtained
when only clay content was used as an explanatory variable. Clay content alone provides
only a rough measure of soil erodibility; K is controlled by the many other soil properties
and their interactions (Wang et al. 2016). Bonilla and Johnson (2012) concluded that soil
erodibility is poorly predicted by particle-size distribution alone. We conclude, based on
the IES scenario that, even though some individual variables have significant correlations
with erodibility, the best estimates of K are achieved by including all variables in the
models (Figure 2). The relationship between soil erodibility and soil properties is not sim-
ple and linear and therefore cannot be captured by a simple bivariate correlation analysis.
Validation statistics (CC, MAE, RMSE, RAE, PRSE) for a number of variables have dif-
ferent patterns based the dataset (training or validation), the algorithm, and the type of
statistics (Figure 2). For example, RF algorithm yielded the lowest CC when only one
variable was used in the training dataset, but the value of CC increased when two varia-
bles were modelled. However, an increase in the number of variables beyond two has no
significant effect on CC. A different pattern emerged in the case of the validation dataset:
CC increased as the number of input variables was increased up to seven, followed by a
decrease, and finally an increase when all 13 variables were included as input. MAE and
RMSE were not markedly affected by the number of input variables in the case of the
training dataset, whereas RAE and PRSE decreased and increased, respectively, when the
number of input variables increased from one to two. Likewise, MAE and RMSE were
not affected by the number of input variables for the validation dataset. General patterns
of RAE and PRSE for the validation data are similar to those of the training data, except
that PRSE decreased when the number of input variables was increased to two.
GEOCARTO INTERNATIONAL 27

Table 6. One-sample Kolmogorov-Smirnov normality test of the machine learning regression models.
Machine learning regression model
Parameter Observed RF REPTree PR M5P GP
Mean 3.09E  02 4.42E  02 3.08E  02 4.58E  02 3.09E  02 3.09E  02
Standard deviation 8.20E  03 6.60E  03 7.90E  03 7.30E  03 7.30E  03 8.20E  03
Absolute 0.066 0.080 0.120 0.118 0.118 0.086
Positive 0.048 0.067 0.120 0.118 0.118 0.086
Negative 0.066 0.080 0.074 0.089 0.090 0.067
Kolmogorov-Smirnov Z 0.553 0.667 1.005 0.986 0.989 0.721
Significance (2-tailed) 0.920 0.765 0.265 0.285 0.282 0.676

Table 7. Performance of the machine learning regression models using the paired sample T-test (two-tailed).
CID
Models Mean St.D. SDEM Lower Upper t Sig. D
O-RF 7.30E  05 2.08E  03 2.48E  04 4.20E  04 5.69E  04 0.296 0.768 NO
O-REPTree 1.50E  05 2.03E  03 2.42E  04 4.98E  04 4.68E  04 0.062 0.951 NO
O-M5P 8.00E  05 3.71E  03 4.43E  04 8.76E  04 8.91E  04 0.017 0.986 NO
O-PR 6.00E  05 3.71E  03 4.43E  04 8.79E  04 8.90E  04 0.013 0.990 NO
O-GP 2.10E  05 3.07E  03 3.67E  04 7.10E  04 7.52E  04 0.056 0.955 NO
Notes: O: Observed. SDEM: St.D error mean. CID: 95% confidence interval of the difference. Sig.: Significance.
D: Difference.

The results obtained using the sensitivity analysis scenario are similar to those reported
above. Specifically, with SAS, the best estimates of K were obtained when all independent
variables were used to model soil erodibility. Removing input variables one by one did
not significantly affect MAE, RMSE, CC, RAE, or PRSE, except when using the
REPTree model.
The IES and SAS scenario results indicate the considerable potential for using
Gaussian process regression (GPR) in modeling soil erodibility. GPR is a powerful tech-
nique that, in spite of the simplicity of regression models, reveals complex relations
(Ballabio et al. 2019). By providing a sound framework in kernel machines, GPR offers
the benefits of selecting models, as well as interpreting their predictions (Rasmussen and
Williams 2006). Its main drawback, when large datasets are involved, is its high computa-
tional time (Ballabio et al. 2019). Possible approaches for dealing with this problem
include the Nystr€om kernel matrix approximation (Drineas and Mahoney 2005) and mas-
sive parallel processing (Ballabio et al. 2019).

5.3. Assessment of model performance


The general performances of the five models used in this study are similar (Figures 4–8).
Although the observed and predicted erodibility values for both the training and valid-
ation datasets differ depending on the model used, the general patterns of their variations
are similar. However, the observed and predicted values for the validation dataset are
greater than those obtained on the training dataset. This was expected because validation
data were not used to train the models. For all models, the residuals of both the training
and validation datasets lack consistency, indicating differences in the ability of the
explanatory variables and models at each sampling location. The scatterplots of measured
versus predicted erodibility suggest some estimation bias (overestimation and underesti-
mation), especially for the training data (Figures 4–8). The prediction bias of the RF and
REPTree algorithms is apparent only at extreme values, whereas PR, M5P, and GP result
in biases over the full range of observed values. Observations that are far away from the
28 A. SHIRZADI ET AL.

overall trend of the data will result in large residuals (Kuhn and Johnson 2013).
Nevertheless, the normal distribution of residuals obtained by all models (Figures 4–8)
suggests that regression lines follow the trend of the majority of the data.
Taylor diagrams, which are graphical representation of model performances based on
different validation statistics (R2, RMSE, and standard deviation), were used to facilitate
comparisons of the model training and validation datasets (Figure 9). The K factor of
Wischmeier and Smith (1978), expressed as Eq. (1), was used as the reference for the
comparisons. Although the Taylor diagram for the training data (Figure 9a) indicates that
the GP and M5P algorithms perform about the same, GP has a slightly lower RMSE than
M5P, whereas M5P has a lower standard deviation than GP (Figure 9a). The overlap of
PR and REPTree in the Taylor diagram suggests that their performance, in term of the
validation statistics, is about the same. The pattern for the validation dataset is somewhat
different (Figure 9b). All models have R2 values around 0.9, with the value for RF a little
higher than that for GP, which in turn is higher than values for REPTree, PR, and GP, in
that order. RF, REPTree, and PR plot closer than GP and M5P to the observed data point
in Figure 9(b), indicating lower RMSE values. Zhao et al. (2018) also used a Taylor dia-
gram to compare the performance of five models in predicting soil erodibility in China.

5.4. Statistical differences among the models


Although all models differed in their predictions of soil erodibility, validation results
show similar performances. A paired sample T-test was conducted to reveal the magni-
tude of the difference between observed and predicted soil erodibility. Results indicate
that model estimates of soil erodibility are not significantly different from those based on
observed values, confirming the value of the machine learning algorithms in modeling
soil erodibility using a set of readily available soil properties.

6. Conclusion
We used five machine learning algorithms and different combinations of readily available
soil data to predict soil erodibility in the Dehgolan region of Iran. Pearson correlation
analysis shows a high and positive contribution of silt, fine sand, sand, bulk density, and
infiltration to soil erodibility in the study area. A high and negative relationship exists
between clay, organic matter, cation exchange capacity, liquid limit, pH, CaCO3, and elec-
trical conductivity, on one hand, and soil erodibility on the other. These results indicate
that soil erosion could perhaps be reduced through remediation measures that enhance
soil clay, OM, CEC, LL, pH, CaCO3, and EC. The USLE was developed for non-calcar-
eous soils, but our results indicate that including CaCO3 in soil erodibility models might
produce better results.
The input-enter and sensitivity analysis scenarios revealed that the poorest estimates of
soil erodibility were obtained when only one input variable was used in modeling, irre-
spective of the algorithm. The best results were obtained when all 13 variables were used.
SAS revealed that the set of input variables that most affect the model outputs differ
among the five models applied in this study. Comparisons of the IES and SAS results
with the results of the simple correlation analysis indicate that the algorithms used in this
study effectively reveal the complex relationships between soil properties and soil
erodibility.
Validation statistics and statistical analyses show that the five machine learning algo-
rithms successfully modeled soil erodibility, although GP generally outperformed the
GEOCARTO INTERNATIONAL 29

other algorithms. The IES and SAS results indicate that the best predictive model is GP
using all 13 independent variables. We suggest that this model can be applied as a power-
ful tool for predictive modeling of soil erosion in countries such as Iran, which lack reli-
able soil data. We further recommend including other independent variables, such as
terrain attributes and remotely sensed data, to improve the performance of these models.
Finally, because the GP models are computationally complex, especially for spatial model-
ing, approaches such as Nystr€ om kernel matrix approximation and massive parallel proc-
essing are recommended to split datasets into a more manageable size.

Disclosure statement
The authors declare that they have no known competing financial interests or personal relationships that
could have appeared to influence the work reported in this paper.

Funding
This research was supported by the University of Kurdistan, Iran, with two grants (nos. 99-11-32657-6
and 00-9-27617).

ORCID
Ataollah Shirzadi http://orcid.org/0000-0003-1666-1180
Himan Shahabi http://orcid.org/0000-0001-5091-6947
John J. Clague http://orcid.org/0000-0002-2697-2233

References
Abbaslou H, Hadifard H, Ghanizadeh AR. 2020. Effect of cations and anions on flocculation of dispersive
clayey soils. Heliyon. 6(2):e03462.
Aksakal EL, Angin I, Oztas T. 2013. Effects of diatomite on soil consistency limits and soil compactibility.
Catena. 101:157–163.
Alkharabsheh MM, Alexandridis T, Bilas G, Misopolinos N, Silleos N. 2013. Impact of land cover change
on soil erosion hazard in northern Jordan using remote sensing and GIS. Procedia Environ Sci. 19:
912–921.
Azzouz AS, Krizek RJ, Corotis RB. 1976. Regression analysis of soil compressibility. Soils Found. 16(2):
19–29.
Ballabio C, Lugato E, Fernandez-Ugalde O, Orgiazzi A, Jones A, Borrelli P, Montanarella L, Panagos P.
2019. Mapping LUCAS topsoil chemical properties at European scale using Gaussian process regres-
sion. Geoderma. 355:113912.
Barzegar R, Moghaddam AA, Deo R, Fijani E, Tziritis E. 2018. Mapping groundwater contamination risk
of multiple aquifers using multi-model ensemble of machine learning algorithms. Sci Total Environ.
621:697–712.
Bastien P, Vinzi VE, Tenenhaus M. 2005. PLS generalised linear regression. Comput Stat Data Anal.
48(1):17–46.
Baumgartl T. 2002. Atterberg limits. In: Encyclopedia of soil science. New York: Marcel Dekker Inc; p.
89–93.
Belasri A, Lakhouili A, Halima OI. 2017. Soil erodibility mapping and its correlation with soil properties
of Oued El Makhazine watershed, Morocco. Forestry. 2(3):4.
Benedetto A. 2010. Water content evaluation in unsaturated soil using GPR signal analysis in the fre-
quency domain. J Appl Geodesy. 71(1):26–35.
Blanco H, Lal R. 2008. Principles of soil conservation and management. New York: Springer.
Bonilla CA, Johnson OI. 2012. Soil erodibility mapping and its correlation with soil properties in Central
Chile. Geoderma. 189–190:116–123.
30 A. SHIRZADI ET AL.

Bouslihim Y, Rochdi A, Aboutayeb R, El Amrani-Paaza N, Miftah A, Hssaini L. 2021. Soil aggregate sta-
bility mapping using remote sensing and GIS-based machine learning techniques. Frontiers Earth Sci.
9:1–13.
Breiman L. 2001. Random forests. Machine Learn. 45(1):5–32.
Breiman L, Friedman J, Stone CJ, Olshen RA. 1984. Classification and regression trees. England (UK):
CRC Press.
Bruce JP, Frome M, Haites E, Janzen H, Lal R, Paustian K. 1999. Carbon sequestration in soils. J Soil
Water Conserv. 54(1):382–389.
Canillas EC, Salokhe VM. 2001. Regression analysis of some factors influencing soil compaction. Soil
Tillage Res. 61(3–4):167–178.
Chai T, Draxler RR. 2014. Root mean square error (RMSE) or mean absolute error (MAE)?–Arguments
against avoiding RMSE in the literature. Geosci Model Dev. 7(3):1247–1250.
Chau JF, Bagtzoglou AC, Willig MR. 2011. The effect of soil texture on richness and diversity of bacterial
communities. Environ Forensics. 12(4):333–341.
Chauchard S, Carcaillet C, Guibal F. 2007. Patterns of land-use abandonment control tree-recruitment
and forest dynamics in Mediterranean mountains. Ecosystems. 10(6):936–948.
Chen W, Zhang S, Li R, Shahabi H. 2018. Performance evaluation of the GIS-based data mining techni-
ques of best-first decision tree, random forest, and naïve Bayes tree for landslide susceptibility model-
ing. Sci Total Environ. 644:1006–1018.
Covelo E, Matıas J, Vega F, Reigosa M, Andrade M. 2008. A tree regression analysis of factors determin-
ing the sorption and retention of heavy metals by soil. Geoderma. 147(1–2):75–85.
Curtaz F, Stanchi S, D’Amico ME, Filippa G, Zanini E, Freppaz M. 2015. Soil evolution after land-reshap-
ing in mountains areas (Aosta Valley, NW Italy). Agriculture Ecosyst Environ. 199:238–248.
Deepa C, SathiyaKumari K, Sudha VP. 2010. Prediction of the compressive strength of high performance
concrete mix using tree based modeling. IJCA. 6(5):18–24.
Diwediga B, Le QB, Agodzo SK, Tamene LD, Wala K. 2018. Modelling soil erosion response to sustain-
able landscape management scenarios in the Mo River Basin (Togo, West Africa). Sci Total Environ.
625:1309–1320.
Drineas P, Mahoney MW. 2005. On the Nystr€ om method for approximating a Gram matrix for improved
kernel-based learning. J Mach Learn Res. 6(Dec):2153–2175.
Efthimiou N. 2020. The new assessment of soil erodibility in Greece. Soil Tillage Res. 204:104720.
Egbueri JC, Igwe O. 2021. The impact of hydrogeomorphological characteristics on gullying processes in
erosion-prone geological units in parts of southeast Nigeria. Geol Ecol Landscapes. 5(3):227–240.
Elbisy MS. 2015. Support vector machine and regression analysis to predict the field hydraulic conductiv-
ity of sandy soil. KSCE J Civ Eng. 19(7):2307–2316.
Elvidge S, Angling M, Nava B. 2014. On the use of modified Taylor diagrams to compare ionospheric
assimilation models. J Roy Astron Soc Can. 49(9):737–745.
Evans R. 1980. Mechanics of water erosion and their spatial and temporal controls: an empirical view-
point. In: Kirkby MJ, Morgan RPC, editors. Soil erosion. Chichester: Wiley. p. 109–128.
Foley JA, Defries R, Asner GP, Barford C, Bonan G, Carpenter SR, Chapin FS, Coe MT, Daily GC, Gibbs
HK, et al. 2005. Global consequences of land use. Science. 309(5734):570–574.
Fu W, Tunney H, Zhang C. 2010. Spatial variation of soil nutrients in a dairy farm and its implications
for site-specific fertilizer application. Soil Tillage Res. 106(2):185–193.
Fumo N, Biswas MR. 2015. Regression analysis for prediction of residential energy consumption. Renew
Sustain Energy Rev. 47:332–343.
Gauch HG, Hwang JG, Fick GW. 2003. Model evaluation by comparison of model-based predictions and
measured values. Agron J. 95(6):1442–1446.
Gee G, Bauder JW. 1986. Particle-size analysis. In: Klute A, editor. Methods of soil analysis part 1.
Agronomy monograph No. 9. 2nd ed. Madison, WI: American Society of Agronomy/Soil Science
Society of America. p. 383–411.
Gonzalez JP, Cook SE, Oberth€ ur T, Jarvis A, Bagnell JA, Dias MB. 2007. Creating low-cost soil maps for
tropical agriculture using gaussian processes. AI in ICT for Development (ICTD) International Joint
Conference on Artificial Intelligence, Hyderabad, India.
Grossman RB, Reinsch TG. 2002. 2.1 Bulk density and linear extensibility. In: Dick AW, editor. Methods
of soil analysis: Part 4 physical methods. Madison: Soil Science Society of America Book Series; p.
201–228.
Hastie T, Tibshirani R, Friedman J. 2009. The elements of statistical learning: data mining, inference, and
prediction. New York (NY): Springer Science Business Media.
GEOCARTO INTERNATIONAL 31

Hengl T, Heuvelink GB, Stein A. 2004. A generic framework for spatial prediction of soil variables based
on regression-kriging. Geoderma. 120(1–2):75–93.
Hengl T, Nussbaum M, Wright MN, Heuvelink GB, Gr€aler B. 2018. Random forest as a generic frame-
work for predictive modeling of spatial and spatio-temporal variables. PeerJ. 6:e5518.
Hosmer DW, Lemeshow S. 2000. Applied logistic regression. New York: Wiley.
Hosseini M, Agereh SR, Khaledian Y, Zoghalchali HJ, Brevik EC, Naeini SARM. 2017. Comparison of
multiple statistical techniques to predict soil phosphorus. Appl Soil Ecol. 114:123–131.
Igwe O, Egbueri JC. 2018. The characteristics and the erodibility potentials of soils from different geologic
formations in Anambra State, southeastern Nigeria. J Geol Soc India. 92(4):471–478.
Kang F, Han S, Salgado R, Li J. 2015. System probabilistic stability analysis of soil slopes using Gaussian
process regression with Latin hypercube sampling. Comput Geotech. 63:13–25.
Karballaeezadeh N, Tehrani HG, Shadmehri DM, Shamshirband S. 2020. Estimation of flexible pavement
structural capacity using machine learning techniques. Front Struct Civ Eng. 14(5):1083–1096.
Kemper W, Rosenau R. 1986. Aggregate stability and size distribution. In: Methods of soil analysis, part
1: physical and mineralogical methods. American Society of Agronomy, Inc. Soil Science Society of
America, Inc; Vol. 5. p. 425–442.
Kerry R, Oliver M. 2007. Comparing sampling needs for variograms of soil properties computed by the
method of moments and residual maximum likelihood. Geoderma. 140(4):383–396.
Kettler T, Doran JW, Gilbert T. 2001. Simplified method for soil particle-size determination to accompany
soil-quality analyses. Soil Sci Soc Am J. 65(3):849–852.
Khosravi K, Pham BT, Chapi K, Shirzadi A, Shahabi H, Revhaug I, Prakash I, Bui DT. 2018. A compara-
tive assessment of decision trees algorithms for flash flood susceptibility modeling at Haraz watershed,
northern Iran. Sci Total Environ. 627:744–755.
Kim JB, Saunders P, Finn JT. 2005. Rapid assessment of soil erosion in the Rio Lempa Basin, Central
America, using the universal soil loss equation and geographic information systems. Environ Manage.
36(6):872–885.
Kim TK. 2015. T test as a parametric statistic. Korean J Anesthesiol. 68(6):540–546.
Kouchami-Sardoo I, Shirani H, Esfandiarpour-Boroujeni I, Besalatpour A, Hajabbasi M. 2020. Prediction
of soil wind erodibility using a hybrid genetic algorithm–Artificial neural network method. Catena.
187:104315.
Kuhn M, Johnson K. 2013. Applied predictive modeling. Vol. 26. New York (NY): Springer.
Lal R. 2001. Soil degradation by erosion. Land Degrad Dev. 12(6):519–539.
Lav MA, Ansal AM. 2001. Regression analysis of soil compressibility. Turkish J Eng Environ Sci. 25(2):
101–109.
Lei X, Chen W, Avand M, Janizadeh S, Kariminejad N, Shahabi H, Costache R, Shahabi H, Shirzadi A,
Mosavi A. 2020. GIS-based machine learning algorithms for gully erosion susceptibility mapping in a
semi-arid region of Iran. Remote Sens. 12(15):2478.
Li X-G, Li F-M, Zed R, Zhan Z-Y. 2007. Soil physical properties and their relations to organic carbon
pools as affected by land use in an alpine pastureland. Geoderma. 139(1–2):98–105.
Ließ M, Glaser B, Huwe B. 2012. Uncertainty in the spatial prediction of soil texture: comparison of
regression tree and Random Forest models. Geoderma. 170:70–79.
Liu X, Zhang Y, Li P. 2020. Spatial variation characteristics of soil erodibility in the Yingwugou watershed
of the Middle Dan River, China. IJERPH. 17(10):3568.
Lizaga I, Quijano L, Gaspar L, Ramos MC, Navas A. 2019. Linking land use changes to variation in soil
properties in a Mediterranean mountain agroecosystem. Catena. 172:516–527.
MacKay DJC. 2003. Information theory. In: Inference and learning algorithms. Cambridge (UK):
Cambridge University Press.
Manrique L, Jones C, Dyke P. 1991. Predicting cation-exchange capacity from soil physical and chemical
properties. Soil Sci Soc Am J. 55(3):787–794.
McLean E. 1983. Soil pH and lime requirement. In: Methods of soil analysis, part 2: chemical and micro-
biological properties. The American Society of Agronomy, Inc., Soil Science Society of America; Vol. 9.
p. 199–224.
Menard S. 2002. Applied logistic regression analysis. Sage, Quantitative Applications in the Social
Sciences; Vol. 106. p. 128.
Merdun H, Çı nar O, € Meral R, Apan M. 2006. Comparison of artificial neural network and regression
pedotransfer functions for prediction of soil water retention and saturated hydraulic conductivity. Soil
Tillage Res. 90(1–2):108–116.
Metya S, Mukhopadhyay T, Adhikari S, Bhattacharya G. 2017. System reliability analysis of soil slopes
with general slip surfaces using multivariate adaptive regression splines. Comput Geotech. 87:212–228.
32 A. SHIRZADI ET AL.

Mohamed WNHW, Salleh MNM, Omar AH. 2012. A comparative study of reduced error pruning
method in decision tree algorithms. International Conference on Control System, Computing and
Engineering. IEEE.
Mohammadi A, Kamran KV, Karimzadeh S, Shahabi H, Al-Ansari N. 2020. Flood detection and suscepti-
bility mapping using Sentinel-1 time series, Alternating Decision Trees, and Bag-ADTree models.
Complexity. 2020:1–21.
Morgan R. 1980. Soil erosion and conservation in Britain. Prog Phys Geogr. 4(1):24–47.
Mosavi A, Shirzadi A, Choubin B, Taromideh F, Hosseini FS, Borji M, Shahabi H, Salvati A, Dineva AA.
2020. Towards an ensemble machine learning model of random subspace based functional tree classi-
fier for snow avalanche susceptibility mapping. IEEE Access. 8:145968–145983.
Mutanga O, Adam E, Cho MA. 2012. High density biomass estimation for wetland vegetation using
WorldView-2 imagery and random forest regression algorithm. Int J Appl Earth Observ. 18:399–406.
Nebeokike UC, Igwe O, Egbueri JC, Ifediegwu SI. 2020. Erodibility characteristics and slope stability ana-
lysis of geological units prone to erosion in Udi area, southeast Nigeria. Model Earth Syst Environ.
6(2):1061–1074.
Nhu V-H, Hoang N-D, Duong V-B, Vu H-D, Bui DT. 2020. A hybrid computational intelligence
approach for predicting soil shear strength for urban housing construction: a case study at Vinhomes
Imperia project, Hai Phong Xity (Vietnam). Eng Comput. 36(2):603–616.
Norman A. 1965. Methods of soil analysis, Part 2: chemical and microbiological properties. Madison, WI:
American Society of Agronomy, Soil Science Society of America.
Osisanwo F, Akinsola J, Awodele O, Hinmikaiye J, Olakanmi O, Akinjobi J. 2017. Supervised machine
learning algorithms: classification and comparison. IJCTT. 48(3):128–138.
Ostovari Y, Ghorbani-Dashtaki S, Bahrami H-A, Abbasi M, Dematte JAM, Arthur E, Panagos P. 2018.
Towards prediction of soil erodibility, SOM and CaCO3 using laboratory Vis-NIR spectra: a case study
in a semi-arid region of Iran. Geoderma. 314:102–112.
Ostovari Y, Ghorbani-Dashtaki S, Bahrami H-A, Naderi M, Dematte JAM, Kerry R. 2016. Modification of
the USLE K factor for soil erodibility assessment on calcareous soils in Iran. Geomorphology. 273:
385–395.

Ozdemir N, G€ulser C. 2017. Clay activity index as an indicator of soil erodibility. Eurasian J Soil Sci. 6(4):
307–311.
Page A, Miller R, Keeney D. 1982. Methods of soil analysis, Part 2. Madison, WI: American Society of
Agronomy, Soil Science Society of America.
Parysow P, Wang G, Gertner G, Anderson AB. 2003. Spatial uncertainty analysis for mapping soil erodi-
bility based on joint sequential simulation. Catena. 53(1):65–78.
Pennock DJ. 2019. Soil erosion: the greatest challenge for sustainable soil management. UN Food and
Agriculture Organization.
Perez-Rodrıguez R, Marques MJ, Bienes R. 2007. Spatial variability of the soil erodibility parameters and
their relation with the soil map at subgroup level. Sci Total Environ. 378(1–2):166–173.
Pham BT, Ly H-B, Al-Ansari N, Ho LS. 2021. A comparison of Gaussian process and M5P for prediction
of soil permeability coefficient. Scientific Programming, 1–13.
Pham BT, Prakash I, Singh SK, Shirzadi A, Shahabi H, Tran T-T-T, Bui DT. 2019. Landslide susceptibility
modeling using reduced error pruning trees and different ensemble techniques: hybrid machine learn-
ing approaches. Catena. 175:203–218.
Phinzi K, Ngetar NS. 2019. The assessment of water-borne erosion at catchment level using GIS-based
RUSLE and remote sensing: A review. Int Soil Water Conserv Res. 7:27–46.
Pimentel D. 2006. Soil erosion: a food and environmental threat. Environ Dev Sustain. 8(1):119–137.
Pimentel D, Burgess M. 2013. Soil erosion threatens food production. Agriculture. 3(3):443–463.
Probst P, Wright MN, Boulesteix AL. 2019. Hyperparameters and tuning strategies for random forest.
Wiley Interdiscip Rev Data Min Knowl Discov. 9(3):e1301.
Quinlan JR. 1987. Simplifying decision trees. Intern J Man-Mach Stud. 27(3):221–234.
Quinlan JR. 1992. Learning with continuous classes. 5th Australian Joint Conference on Artificial
Intelligence. Hobart, Tasmania, Australia: World Scientific.
Rasmussen C, Williams C. 2006. Gaussian processes for machine learning. Cambridge, MA: MIT Press.
Richards LA. 1954. Diagnosis and improvement of saline and alkali soils. Lippincott Williams Wilkins.
New York: US Department of Agriculture, New York; Vol. 78, p. 1–166.
Rodrıguez AR, Arbelo C, Guerra J, Mora J, Notario J, Armas C. 2006. Organic carbon stocks and soil
erodibility in Canary Islands Andosols. Catena. 66(3):228–235.
Rodriguez-Galiano V, Mendes MP, Garcia-Soldado MJ, Chica-Olmo M, Ribeiro L. 2014. Predictive mod-
eling of groundwater nitrate pollution using Random Forest and multisource variables related to
GEOCARTO INTERNATIONAL 33

intrinsic and specific vulnerability: a case study in an agricultural setting (southern Spain). Sci Total
Environ. 476–477:189–206.
R€omkens MJM, Young RA, Poesen JWA, McCool DK, El-Swaify SA, Bradford JM. 1997. Soil erodibility
factor (K). Compilers In: Renard KG, Foster GR, Weesies GA, McCool DK, Yoder DC, editors.
Predicting soil erosion by water: a guide to conservation planning with the Revised Universal Soil Loss
Equation (RUSLE). Washington, DC, USA: Agric. HB. 703:65–99.
Scholten T. 1997. Hydrology and erodibility of the soils and saprolite cover of the Swaziland Middleveld.
Soil Technol. 11(3):247–262.
Shabani F, Kumar L, Esmaeili A. 2014. Improvement to the prediction of the USLE K factor.
Geomorphology. 204:229–234.
Sharma A, Weindorf DC, Wang D, Chakraborty S. 2015. Characterizing soils via portable X-ray fluores-
cence spectrometer: 4. Cation exchange capacity (CEC). Geoderma. 239–240:130–134.
Sharpley AN, Williams JR. 1990. EP IC-Erosion/Productivity impact calculator. I: Model documentation.
II: User manual. Technical Bulletin-United States Department of Agriculture, (1768).
Shirzadi A, Saro L, Joo OH, Chapi K. 2012. A GIS-based logistic regression model in rock-fall susceptibil-
ity mapping along a mountainous road: Salavat Abad case study, Kurdistan, Iran. Nat Hazards. 64(2):
1639–1656.
Sihag P, Karimi SM, Angelaki A. 2019. Random forest, M5P and regression analysis to estimate the field
unsaturated hydraulic conductivity. Appl Water Sci. 9(5):1–9.
Sihag P, Tiwari N, Ranjan S. 2017. Modelling of infiltration of sandy soil using gaussian process regres-
sion. Model Earth Syst Environ. 3(3):1091–1100.
Singh B, Sihag P, Singh K. 2017. Modelling of impact of water quality on infiltration rate of soil by ran-
dom forest regression. Model Earth Syst Environ. 3(3):999–1004.
Sparks D, Page A, Helmke P, Leoppert R, Soltanpour P, Tabatabai M, Johnston G, Summer M. 1996.
Methods of soil analysis. American Society of Agronomy, Soil Science Society of America, Book Series.
5.
Sridharan A, Nagaraj HB. 1999. Absorption water content and liquid limit of soils. Geotech Test J. 22(2):
127–133.
Staff S. 2014. Keys to soil taxonomy. 12th ed. Washington, DC: Natural Resources Conservation Service,
US Department of Agriculture.
Sumner ME, Miller WP. 1996. Cation exchange capacity and exchange coefficients. In: Methods soil ana-
lysis, part 3: chemical methods. The Soil Science Society of America, Inc., American Society of
Agronomy; Vol. 5. p. 1201–1229.
Taheri K, Shahabi H, Chapi K, Shirzadi A, Gutierrez F, Khosravi K. 2019. Sinkhole susceptibility map-
ping: a comparison between Bayes-based machine learning algorithms. Land Degrad Dev. 30(7):
730–745.
Tejada M, Gonzalez J. 2006. The relationships between erodibility and erosion in a soil treated with two
organic amendments. Soil Tillage Res. 91(1–2):186–198.
Tiessen H, Cuevas E, Chacon P. 1994. The role of soil organic matter in sustaining soil fertility. Nature.
371(6500):783–785.
Torri D, Poesen J, Borselli L. 1997. Predictability and uncertainty of the soil erodibility factor using a glo-
bal dataset. Catena. 31:1–22.
Udelhoven T, Emmerling C, Jarmer T. 2003. Quantitative analysis of soil chemical properties with diffuse
reflectance spectrometry and partial least-square regression: a feasibility study. Plant Soil. 251(2):
319–329.
Vacchiano G, Stanchi S, Ascoli D, Marinari G, Zanini E, Motta R. 2014. Soil-mediated effects of fire on
Scots pine (Pinussylvestris L.) regeneration in a dry, inner-alpinevalley. Sci Total Environ. 472:778–788.
Vaezi AR, Hasanzadeh H, Cerda A. 2016. Developing an erodibility triangle for soil textures in semi-arid
regions, NW Iran. Catena. 142:221–232.
Vaezi A, Sadeghi S, Bahrami H, Mahdian M. 2008. Modeling the USLE K-factor for calcareous soils in
northwestern Iran. Geomorphology. 97(3–4):414–423.
Wang B, Zheng F, Guan Y. 2016. Improved USLE-K factor prediction: a case study on water erosion
areas in China. Intern Soil Water Conserv Res. 4(3):168–176.
Wang B, Zheng F, R€omkens MJ. 2013. Comparison of soil erodibility factors in USLE, RUSLE2, EPIC
and Dg models based on a Chinese soil erodibility database. Acta Agric Scand B – Soil Plant Sci.
63(1):69–79.
Wang G, Gertner G, Fang S, Anderson AB. 2003. Mapping multiple variables for predicting soil loss by
geostatistical methods with TM images and a slope map. Photogramm Eng Remote Sens. 69(8):
889–898.
34 A. SHIRZADI ET AL.

Wang H, Zhang G-h, Li N-n, Zhang B-j, Yang H-y. 2019. Variation in soil erodibility under five typical
land uses in a small watershed on the Loess Plateau, China. Catena. 174:24–35.
Wang J, Ding J, Yu D, Ma X, Zhang Z, Ge X, Teng D, Li X, Liang J, Lizaga I, et al. 2019. Capability of
Sentinel-2 MSI data for monitoring and mapping of soil salinity in dry and wet seasons in the Ebinur
Lake region, Xinjiang, China. Geoderma. 353:172–187.
Wang Y. 2000. A new approach to fitting linear models in high dimensional spaces. Hamilton, NZ:
University of Waikato.
Wang Y, Fang Z, Hong H, Peng L. 2020. Flood susceptibility mapping using convolutional neural net-
work frameworks. J Hydrol. 582:124482.
Wang Y, Witten IH. 2002. Modeling for optimal probability prediction. Proceeding of the Nineteenth
International Conference on Machine Learning, Sydney, Australia.
Wang Y, Witten I, van Someren M, Widmer G. 1997. Inducing models trees for continuous classes.
Poster Papers of the European Conference on Machine Learning, Department of Computer Science,
University Waikato, Hamilton, NZ.
Weisberg S. 2005. Applied linear regression. Vol. 528. New York: John Wiley & Sons.
Wilcoxon F. 1992. Individual comparisons by ranking methods. In: Breakthroughs in statistics. Berlin
(Germany): Springer; p. 196–202.
Williams CK, Rasmussen CE. 2006. Gaussian processes for machine learning. Vol. 2. Cambridge, MA:
MIT Press.
Wischmeier WH, Smith DD. 1949. Predicting rainfall-erosion losses from cropland east of the rocky
mountains: guide for selection of practices for soil and water conservation. US Department of
Agriculture. 282p.
Wischmeier WH, Smith DD. 1978. Predicting rainfall erosion losses: a guide to conservation planning.
US Department of Agriculture, Science and Education Administration. 537p.
Witten IH, Frank E, Hall MA. 2005. Practical machine learning tools and techniques. Morgan Kaufmann.
578p.
World Reference Base for Soil Resources. 2014 (2015). International soil classification system for naming
soils and creating legends for soil maps. FAO Rome, 192p.
Wuddivira M, Camps-Roach G. 2007. Effects of organic matter and calcium on soil structural stability.
Eur J Soil Science. 58(3):722–727.
Yan-li L, You-lu B, Li-ping Y, Hong-juan W, Qing-bo K. 2008. Application of hyperspectral data for soil
organic matter estimation based on principle components regression analysis. Plant Nutr Soil Sci. 7(6):
10.
Yang J, Wang J, Qiao P, Zheng Y, Yang J, Chen T, Lei M, Wan X, Zhou X. 2020. Identifying factors that
influence soil heavy metals by using categorical regression analysis: a case study in Beijing, China.
Frontiers Environ Sci Eng. 14(3):37.
Yuan J, Wang K, Yu T, Fang M. 2008. Reliable multi-objective optimization of high-speed WEDM pro-
cess based on Gaussian process regression. Int J Mach Tools Manuf. 48(1):47–60.
Yang X, Gray J, Chapman G, Zhu Q, Tulau M, McInnes-Clarke S. 2018. Digital mapping of soil erodibil-
ity for water erosion in New South Wales, Australia. Soil Res. 56(2):158–170.
Zeng Q, Darboux F, Man C, Zhu Z, An S. 2018. Soil aggregate stability under different rain conditions
for three vegetation types on the Loess Plateau (China). Catena. 167:276–283.
Zhang W, Parker K, Luo Y, Wan S, Wallace L, Hu S. 2005. Soil microbial responses to experimental
warming and clipping in a tallgrass prairie. Global Change Biol. 11(2):266–277.
Zhao W, Wei H, Jia L, Daryanto S, Zhang X, Liu Y. 2018. Soil erodibility and its influencing factors on
the Loess Plateau of China: a case study in the Ansai watershed. Solid Earth. 9(6):1507–1516.

You might also like