Slope Stability Predictions On Spatially Variable Random Fields Using Machine Learning Surrogate Models
Slope Stability Predictions On Spatially Variable Random Fields Using Machine Learning Surrogate Models
Civil and Infrastructure Engineering Discipline, School of Engineering, Royal Melbourne Institute of
Technology (RMIT), Victoria 3001, Australia
E-mail: [email protected]
E-mail: [email protected]
Civil and Infrastructure Engineering Discipline, School of Engineering, Royal Melbourne Institute of
Technology (RMIT), Victoria 3001, Australia
Email: [email protected]
Civil and Infrastructure Engineering Discipline, School of Engineering, Royal Melbourne Institute of
Technology (RMIT), Victoria 3001, Australia
Email: [email protected]
Civil and Infrastructure Engineering Discipline, School of Engineering, Royal Melbourne Institute of
Technology (RMIT), Victoria 3001, Australia
Email: [email protected]
To be submitted to
November 2021
1
Slope stability predictions on spatially variable random fields using
machine learning surrogate models
Mohammad Aminpour1*, Reza Alaie2, Navid Kardani1, Sara Moridpour1, Majidreza Nazem1
1
Civil and Infrastructure Engineering, School of Engineering, RMIT University, Melbourne,
Australia
2
Department of Civil Engineering, Faculty of Engineering, University of Guilan, Rasht, Iran.
Abstract:
Random field Monte Carlo (MC) reliability analysis is a robust stochastic method to determine
the probability of failure. This method, however, requires a large number of numerical
simulations demanding high computational costs. This paper explores the efficiency of
different machine learning (ML) algorithms used as surrogate models trained on a limited
number of random field slope stability simulations in predicting the results of large datasets.
The MC data in this paper require only the examination of failure or non-failure, circumventing
levels of soil heterogeneity and anisotropy. The Bagging Ensemble, Random Forest and
Support Vector classifiers are found to be the superior models for this problem amongst 9
different models and ensemble classifiers. Trained only on 0.47% of data (500 samples), the
ML model can classify the entire 120,000 samples with an accuracy of %85 and AUC score of
%91. The performance of ML methods in classifying the random field slope stability results
generally reduces with higher anisotropy and heterogeneity of soil. The ML assisted MC
reliability analysis proves a robust stochastic method where errors in the predicted probability
of failure using %5 of MC data is only %0.46 in average. The approach reduced the
Keywords: Machine learning; slope stability; Monte Carlo; surrogate models; anisotropy;
heterogeneity
2
Highlights
Nine machine Learning (ML) classifiers are developed as surrogate models for slope
To validate the ML models, a complete Monte Carlo (MC) dataset containing 120,000
Using 5% of MC data, the error of ML predicted probability of failure for all anisotropy
With no need to calculate the factors of safety, the approach reduced the computational
3
1. Introduction
geotechnics. The uncertainty can be associated with inherent spatial variability of soil or
parameters (Phoon and Kulhawy 1999, Juang, Zhang et al. 2019). The inherent variability of
geotechnical parameters is known as the major source of uncertainty (Fenton and Griffiths
2008) with a significant influence on slope stability (Christian, Ladd et al. 1994, Ching and
Phoon 2013, Lloret-Cabot, Fenton et al. 2014), Lloret-Cabot, Fenton et al. 2014). However,
while being critically important in a safe geotechnical design, reliability analysis can be
practically difficult to implement due to the great computational efforts required. Combined
with the random field theory (Vanmarcke 1977, Vanmarcke 1977) and the finite element
method, the Monte Carlo (MC) approach as a conceptually simple and robust statistical method
(Wang, Cao et al. 2010, Wang, Cao et al. 2011) involves the production of thousands of
simulations to evaluate the probability of geotechnical events such as slope failure. In small
probability events which are of high importance in geotechnical practice, the MC method can
be computationally inefficient. The significant computational time and power needed for this
method enforce geotechnical engineers to trade off the risk management for timely progress of
remarkably increase the associated risks with examples in tunnelling (Chen, Wang et al. 2019),
slope stability (Cho 2007, Wang, Hwang et al. 2013), Wang, Hwang et al. 2013, Gong, Tang
sampling (Au and Beck 2003) or subset simulation (Au and Beck 2001, Papaioannou, Betz et
al. 2015), Papaioannou, Betz et al. 2015). It is shown that the importance sampling is not
4
applicable in problems with a large number of random variables (Huang, Fenton et al. 2017).
Also, the subset simulation which has been applied for small failure probability problems in
slope stability (Li, Xiao et al. 2016, Tian, Li et al. 2021), appeared to be even more time
consuming than an MC simulation for a given accuracy when a search for the factor of safety
(FOS) is required (Huang, Fenton et al. 2017). The subset simulation is enhanced by avoiding
a search for FOS using the strength reduction technique to increase the efficiency of this
method. However, even with such enhancement, the subset simulation has not reduced the
simulation time more than 3 times and can be still less efficient than a complete MC simulation
Another technique to increase the efficiency of reliability analysis of slope stability is the
expansion, support vector regression, Kriging model, and multiple adaptive regression spline
(Jiang, Li et al. 2014, Jiang and Huang 2016, Kang, Xu et al. 2016, Liu and Cheng 2018, Liu,
Zhang et al. 2019, Wang, Wu et al. 2020, Zeng, Zhang et al. 2020, Deng, Pan et al. 2021). This
technique involves the introduction of surrogate models to replace the Finite Element
simulations approximating the relationship between the input variables and the output
response. Among the surrogate models, machine learning algorithms have received increasing
attention in slope stability problems with spatial variability (Kang, Xu et al. 2016, Zhu, Pei et
al. 2019, He, Xu et al. 2020, He, Wang et al. 2021, He, Wang et al. 2021, Wang and Goh 2021,
Zhang, Phoon et al. 2021, Zhu, Hiraishi et al. 2021). ML methods have been widely used in
slope stability predictions and landslide characterisations (Wei, Lü et al. 2019, Deng, Smith et
al. 2021, Han, Shi et al. 2021, Hu, Wu et al. 2021, Huang, Han et al. 2021, Sun, Xu et al. 2021,
Wang, Zhang et al. 2021, Zhao, Meng et al. 2021). However, a systematic study on different
ML models when used as surrogate models and in particular, the dependence of their
5
performed. Moreover, previous studies have normally trained the ML models on the data of
Monte Carlo simulations with calculated FOSs which are tedious tasks in geotechnical
without a need for the calculation of FOS in each simulation (Huang, Fenton et al. 2017).
Conducted either by numerous failure surfaces in a limit equilibrium analysis or using the
strength reduction method, the calculation of FOS can significantly increase the computational
cost of the MC method. Little is known about the performance of ML surrogate models on
on spatially variable random field MC data with no FOS calculated. The selected ML
algorithms include six models and three ensemble classifiers which have been used in
geotechnical engineering for more than a decade with their appropriateness being shown in
slope stability studies (Das, Sahoo et al. 2010, Samui and Kothari 2011, , Samui and Kothari
2011, Cheng and Hoang 2016, Pham, Bui et al. 2017, Feng, Li et al. 2018, , Pham, Bui et al.
2017, Feng, Li et al. 2018, Bui, Nguyen et al. 2020, Kardani, Zhou et al. 2021, Zhang, Wu et
al. 2021), Kardani, Zhou et al. 2021, Zhang, Wu et al. 2021). These models are trained on small
fractions of MC data when the outcome of MC simulations are only the failure or non-failure
status of slopes, thus further increasing the efficiency with the less computational time needed
for calculated FOSs. The performance of models is tested against a complete MC database
covering a wide range of random field heterogeneity and anisotropy. The sensitivity of results
This paper is structured as follows. Section 2 describes the technical method to generate the
random fields. Section 3 introduces the random field finite difference computational method to
generate slope models. The details of the generated Monte Carlo data with statistical
6
surrogate models implemented in this study is given in Section 5. The machine learning
algorithms and ensemble classifiers are briefly described in Section 6. Section 7 provides the
results comparing the performance of different ML classifiers when used as surrogate models
on various datasets with respect to the levels of anisotropy and heterogeneity using suitable
performance scores. The sensitivity of the results is also discussed. A summary with an
A random field as a random function defined over an arbitrary domain is used to model the
standard deviation σ and the correlation distance δ are used to characterise the random fields.
The correlation distance, also known as the scale of fluctuation, characterises the variation of
the field in space which is captured by the covariance function. Assuming a Gaussian and
stationary random field where the complete probability distribution is independent of absolute
location, the undrained shear strength parameter, C , is the random variable in this study.
Among several methods for the generation of random fields such as Covariance Matrix
Decomposition, Moving Average, Fast Fourier Transform, Turning Bands Method and Local
Average Subdivision Method (Griffiths and Fenton 2007), the Cholesky decomposition
technique is selected and used here which is developed in Ref. (Vetterling, Press et al. 2002)
and utilised in various geotechnical studies (Zhu and Zhang 2013) (Zhu, Zhang et al. 2017).
In this paper, the random field is defined as (El‐Kadi and Williams 2000)
𝐶 𝑥 𝑒𝑥𝑝 𝐿. 𝜀 𝜇 (1)
where C x is the undrained shear strength of soil at the spatial position x, ε is an independent
random variable normally distributed with zero mean and unit variance, μ is the mean of
7
the logarithm of C , and L is a lower-triangular matrix computed from the decomposition of
the covariance matrix using the Cholesky decomposition technique (Griffiths and Fenton). The
2008)
|𝑙 | 𝑙 (2)
𝐴 𝑙 ,𝑙 𝜎 𝑒𝑥𝑝
𝛿 𝛿
where l and l are the horizontal and vertical distances between two arbitrary points, δ and
δ are the horizontal and vertical scales of fluctuation (correlation distances), respectively, and
According to the Cholesky decomposition technique, the covariance between the logarithms of
the random variable values (here C ) at any two points is decomposed into
𝐴 𝐿𝐿 (3)
The mean (μ) and the standard deviation (σ) of the logarithms of C are computed as
1 (4)
𝜇 𝑙𝑛 𝜇 𝜎
2
(5)
𝜎 𝑙𝑛 1 𝐶𝑂𝑉
𝐶𝑂𝑉 . (6)
Random filed simulations include a range of mean undrained shear strength values, μ from
18.6 to 33.5 kPa. These μ values are chosen based on the FOSs associated with homogeneous
8
slopes where the shear strength of soil is equal to μ . The FOS for a homogeneous slope with
C 18.6 kPa is equal to 1. The other μ values are determined as 1.2, 1.4, 1.6 and 1.8 times
of 18.6 kPa. The COV values are changed as 0.1, 0.3 and 0.5 representing low to high levels
varies from 1 to 25. The summary of statistical parameters used in this study is presented in
Table .
Parameter Values
Mean of undrained shear strength, μ kPa 18.6, 22.3, 26, 29.7, 33.5
The slope considered for this study, as shown in Fig. 1, has the following geometrical
specifications: the slope angle of 45°, the slope height of 5 m and the foundation depth of 10
m from the top of the slope. The foundation level is assigned with a rigid boundary surface
simulating the hard rock bed with all movements restricted. The horizontal displacements are
behaviour incorporating the Mohr-Coulomb failure criterion is adapted for the soil. The shear
I C where I is the rigidity index. I is defined as the ratio between the shear modulus G and
undrained shear strength. Here, I 800 denoting a moderately stiff soil is assumed according
to the ranges suggested in the literature (e.g., I 300 1500 (Popescu, Deodatis et al.
2005)). The soil unit weight equal to γ 20kN/m is kept constant in all simulations.
9
FLAC (FLAC Itasca) is used for random finite difference simulations. The mesh size was tested
to be sufficiently fine in acquiring accurate results with less than 5% relative errors compared
to a highly fine but computationally expensive mesh size. A four-nodded quadrilateral grid
with a 0.5m side dimension was generated for this study resulting in 800 cells. Each cell
Fig. 1 The slope random finite difference model comprising 800 random variable cells
To identify if a slope is stable or not, an alternative criterion for slope failure is adapted
replacing the calculation of FOS using the traditional time-consuming strength reduction
technique. In this method, a combined criterion for plastic zones and the velocity of grids is
assessed in the finite difference model. The failure surface can be detected if a contiguous
region of active plastic zones connecting two boundary surfaces is developed. For the velocity
criterion, the failure is also determined when both the velocity gradient and amplitude are
greater than the determined thresholds. The alternative failure criterion is shown to be efficient
with accurate results as compared to stochastic stability results obtained using the strength
reduction technique (see Ref. (Jamshidi Chenari and Alaie 2015) for further details).
Monte Carlo reliability analysis involves the simulation of a typically large number of samples
for desired statistical parameters. The statistics of the results such as the probability of failure
converges to a constant value in the large datasets. For this study, it can be shown that 2000
10
samples are sufficient (Jamshidi Chenari and Alaie 2015) for constant statistics of the results.
This number agrees with the findings suggesting that a typical number of 2500 simulations can
provide reasonable precision and reproducibility for random field finite element studies
(Griffiths and Fenton 2007). Thus, for each set of statistical parameters μ , COV , δ , δ ,
2000 random field realisations are generated. Each realisation is simulated using the finite
difference method to determine the stability status of slopes. The outcomes of all analyses as a
binary result (failure or non-failure) provide the Monte Carlo data for the reliability analysis.
A summary of the details of the random fields generated in this study is shown in Table 2.
total 120000
11
Error! Reference source not found. illustrates some examples of random field realisations.
Each plot in Error! Reference source not found. shows a single realisation of the random
fields for the given μ , COV and ξ parameters from 2000 realisations. The mean value of C
a U . , b U . ,
c U . , d U . ,
Fig. 2 Examples of the random field slope models with spatial variability of the undrained shear strength of the
soil.
In this paper, a complete Monte Carlo analysis on a typical slope stability problem in
conducted. Covering a wide range of heterogeneity and anisotropy of random fields, the MC
study provides a full-size database as a benchmark by which an actual evaluation of the trained
ML surrogate models can be possible. The classification tasks and the probability of failure
obtained from ML surrogate models using 5% of data are then compared with the actual results
from the MC benchmark data. An overview of the implemented approach in this study is shown
as a flowchart in Fig. 3.
12
Fig. 3 The approach implemented in this study in evaluating the ML surrogate models versus actual full-size MC
datasets.
The possibility of a binary result can be predicted using the Logistic Regression (LR)
algorithm. The LR is utilised to describe the connection between one dependent binary variable
and one or more independent variables at the nominal, interval, ordinal, or ratio level. A binary
logistic model mathematically consists of a dependent variable with two potential values
indicated by an indicator variable with the values "0" and "1". A LR model generally estimates
13
the likelihood of belonging to one of the two classes in the dataset. The following equation
1 (7)
P y 1|x; w
1 e
Given this is a minimisation problem, the LR objective function may be stated as:
1 (8)
min , w w C log exp y x w c 1
2
where c is a constant term and C is a previously fixed hyperparameter that controls the balance
between two terms of the objective function (Hosmer Jr, Lemeshow et al. 2013).
learning that, instead of constructing an internal model, only remembers instances of the
training data. A simple majority vote of the nearest neighbours of each sample vector is the
models the connection between the independent factors and the continuous result. Cross-
validation may be used to establish or to choose the neighbourhood size to decrease the mean
square error. A KNN model is simple to build and may effectively handle non-linearity.
14
Step 1: Calculate the distance between the objects to be categorised and calculate their
neighbours utilising one of the available distance functions (e.g. Minkowski, Manhattan,
Euclidean).
Step 2: Choose the K closest data points (the items with the K lowest distances).
Step 3: Perform a "majority vote" among the data points; the categorisation that receives the
Both data inspection and hyper-parameter methods may be used to determine the optimum
value for K (Mucherino, Papajorgji et al. 2009, Cheng and Hoang 2016).
A decision tree model is a tree data structure with an unspecified number of nodes and
branches. Internal nodes are those that have outward edges, whereas others are referred to as
leaves. A specific internal node divides the examples utilised for classification or regression
into two or more categories. During the training phase, the input variable values are compared
to a specific function. A decision tree stimulant is an algorithm that generates a decision tree
given particular occurrences. The performed algorithm attempts to find the optimal decision
trees by lowering the fitness function. Since the datasets of this study contain two classes
(stable or failed), a categorisation model is fitted to the target variable employing each
independent variable in this study. A decision node is composed of at least two branches. A
leaf node denotes a categorisation or decision. The root node is the highest decision node in a
An overly big tree improves the chance of overfitting the training data and performing poorly
on new samples. The sample space may include important structural information, but a too-
small tree may not be able to capture it. The horizon effect is a term that refers to this problem.
One commonly used strategy is to continue expanding the tree until each node has a limited
15
number of instances and then prune the tree to eliminate nodes with no further information.
The primary goal of pruning is to minimise the size of a decision tree while maintaining
Support Vector Machine (SVM) is often used for three primary objectives as a supervised
learning technique: classification, regression and pattern recognition. We call an SVM used for
optimal and effective performance to resist overfitting are two significant characteristics
distinguishing SVM from basic artificial neural networks. Considering a dataset with N
samples and a represented result type of y (1 or 0), the following training vectors are used for
binary classification:
where n influencing features exist in the n-dimensional space that may be used for addressing
the decision boundary. Soft-margin SVMs impose penalties and may be utilised well for
nonlinear classification when combined with the kernel technique. Thus, every function f(x) in
the SVM kernel trick may be shown in the following manner (Noble 2006):
f x w ϕ x b 0 (10)
where W is the vector of the output layer, and b denotes the bias. The input variable x is defined
Minimise: w C∑ χ (11)
Subjected to: y w. x b 1 χ
16
Here, C denotes a penalty coefficient while χ 0 denotes slack variables, which relate to the
misclassification effects.
Random Forest (RF) is a method for ensemble learning based on divide-and-conquer widely
known as bagging, to create an ensemble of randomly generated base learners (unpruned DTs).
The RF critical point is the creation of a collection of DTs subjected to controlled alternation.
Another critical element of the training process is the random choice of features. The method
constructs a subset of M important variables for each node in the trees. It is necessary to choose
a variety of influencing factors to guarantee DT variation. As a result, each of the DTs conducts
an independent assessment of the forest. The results are obtained by voting on the whole tree.
2- For each sample, construct a decision tree and obtain prediction results from the decision
tree.
Additional information regarding RF may be found in References (Belgiu and Drăguţ 2016,
Naive Bayes (NB) is a basic, but robust, method for classifying binary and multi-class issues.
17
Bayes' theory, it may be classified as a probabilistic category. The following equation expresses
Bayes' theory:
P B|A P A (12)
P A|B
P B
where P A|B is the possibility of A if B occurs (or posterior probability), viz., a conditional
Gaussian Naive Bayes (Gaussian NB) is an enhanced version of the basic and most common
Naive Bayes. The Gaussian distribution requires the standard and mean deviation to be
calculated using the training data (Rish 2001) (Tsangaratos and Ilia 2016).
how to combine the predictions of multiple base models maximising the performance of the
combined model. The generalisation method is indeed based on a new machine learning model
in the ensemble that learns when to use each base model for a problem. The base models are
typically different and are combined by a single meta-model. With different base models
having various types of skills on the dataset, the errors in the prediction of different base models
are less correlated. When training the meta-model, the base models are fed with out-of-sample
data, i.e., the data not already used to train the base models. The predictions made by base
models, along with the expected predictions, create the input and output data to train the meta-
model.
The level-1 or base models can be diverse and complex where the difference of the approaches
and assumptions in base models can help the meta-model see the data in different ways. Here
we use a range of various machine learning approaches for the base models including logistic
regression (LR), K-Nearest Neighbours (KNN), Decision Tree (DT), Support Vector
18
Classification (SVC), Random Forest (RF) and Gaussian Naive Bayes (NB). 10-fold cross-
validation of the base models is used where the out-of-fold data create the training dataset for
the meta-model.
A simple model will suit the level-0 or beta-model which provides a smooth combination of
the predictions. Thus, linear models are often used for beta-models. A logistic regression model
is selected here for the classification task performed by the beta-model. As such, the beta-model
derives a weighted average or blending of the predictions made by the base models. 5-fold
While the stacking ensemble is designed to improve the overall performance of model
improvement. Different aspects may affect the ensemble performance including the sufficiency
of the training data to represent the complexity of the problem and if there are more complex
insights not captured by base models to be revealed by the ensemble (Džeroski and Ženko
Bagging ensemble also known as bootstrap aggregation is one of the ensemble models that
combines multiple learners trained on subsamples of the same dataset. Bagging ensemble
reduces the variance of the prediction errors and generally improves the accuracy of the model
particularly when perturbing the learning dataset can cause a significant change in the predictor
constructed.
The bagging ensemble includes a process of (i) creating multiple datasets sampled from the
original data, (ii) training multiple learners on each dataset, and (iii) combining all learners to
generate a single prediction. The combined prediction is generated by averaging the results of
19
the applied models for a regression analysis or majority voting approach for classification
problems.
Sampling with replacement is the core idea in bagging ensembles so that some instances are
included in samples multiple times while others are left out. The ensemble can consist of a
single base learner invoked several times using different subsets of the training set. A random
forest regressor is employed here as the base model (Rokach 2010, (Rokach 2010, Bühlmann
2012).
As a meta-algorithm, the voting ensemble combines several base models either as regressors
or classifiers and produces predictions based on voting algorithms. The ensemble is ideally
ensemble can be beneficial for a set of equally well-performing models to balance out their
individual weaknesses.
For regression predictions, the algorithm involves calculating the average of the predictions
from base models. In the case of classifier ensembles, a voting ensemble contains several base
classifiers. The voting ensemble can use the majority voting (hard voting) in which each base
classifier is entitled to one vote where a class with the highest number of votes is determined
as the final prediction. If the number of classifiers is even, a tiebreak rule should be identified.
The ensemble can also use the average predicted probabilities (soft voting) to combine the
predicted classes of base models. The number of base classifiers can be arbitrary (Rokach
2010). In this study, the base models of the VO En are LR, KNN, DT, RF, SVC and Gaussian
NB.
20
7.1. Metrics
In this study, the accuracy (ACC), F1-score, AUC-score and some metric values extracted from
the confusion matrix have been used to compare the performance of the ML models as
Confusion matrix (see Table 3), also known as the error matrix, is a tabular summary used to
Negative Positive
According to Table 3, some metric values including accuracy, sensitivity (true positive rate),
specificity (true negative rate), and false positive rate can be defined. The following equations
TP TN (13)
Accuracy ACC
total instances
TP (14)
Sensitivity recall or true positive rate
TP FN
TN (15)
Specificity true negative rate
TN FP
FP (16)
False positive rate 1 Specificity
TN FP
Another performance metric used in this research is F1-Score, also known as F measure. It is
employed as a metric to evaluate the accuracy using both precision ( ) and recall ( ).
21
In other words, F1-Score reaches its best value at 1 and worst at 0 through the harmonic average
TP (17)
F
1
TP FP FN
2
It is worth mentioning that in statistical analysis of geotechnical stability problems, the class
of instability events can be rare in comparison with the stable cases. In such problems, the issue
of class imbalance should be carefully considered in training the ML models to improve their
performance. For uneven class distributions, where a balance between precision and recall is
In addition, the receiver operating characteristic (ROC) curve, which plots the true positive rate
against the false positive rate at various cut-off values, is used for further analysis of the models.
This graphical plot can be employed to express the predictive ability of a model. The area under
the ROC curve, i.e., AUC, is commonly used as a measure of the performance of classifiers.
AUC ranges in value from 0 to 1, representing a totally wrong to fully correct prediction by a
classifier, respectively. As a measure to evaluating the AUC-Score values, one may consider
the following guide: Outstanding (0.9 < AUC < 1), Excellent (0.8 < AUC < 0.9) and Acceptable
(0.7 < AUC < 0.8) (Hosmer Jr, Lemeshow et al. 2013).
Confusion matrices: The confusion matrices of RF, SVC and BG En models on test datasets
for a single iteration of model training are shown in Table 4. The models are trained on %5 of
MC data and tested on the remaining %95. The table provides information on the share of
failure and stable classes in each dataset and the model performances on predicting each class.
In the U . , dataset, which is best predicted, while the accuracy reaches %96.4 for all RF, SVC
and BG En, the incorrectly classified cases are mostly the stable slopes wrongly classified as
22
failed (%3.3 of samples). When the accuracy declines to the lowest value among datasets in
the U . , dataset (ACC=0.671 for RF), the wrongly classified classes are both from actually
stable and failed slopes with %14.3 and %18.5 of all samples being wrongly classified as failed
and stable, respectively. For the entire MC dataset with 119,500 test samples, the models
trained only on 500 samples provide an accuracy reaching %84.3. The RF and BG En
classifiers wrongly classify only about %8 of all samples in each class of stable and failed
slopes.
Table 4. Confusion matrices of classifiers on the test datasets. Models are trained on 500 samples
where the test datasets consist of 9500 samples for individual U datasets and 119,500 samples for the
entire data. The values in parentheses are the numbers normalised over the total number of samples in
RF SVC BG En
Predicted Accuracy Predicted Accuracy Predicted Accuracy
Actual Stable Failed (ACC) Stable Failed (ACC) Stable Failed (ACC)
23
6208 772 5986 994 6196 784
Stable
(0.653) (0.081) (0.630) (0.105) (0.652) (0.083)
𝐔𝟎.𝟑,𝟔 0.823 0.818 0.827
908 1612 738 1782 859 1661
Failed
(0.096) (0.170) (0.078) (0.188) (0.090) (0.175)
6417 775 6368 824 6458 734
Stable
(0.675) (0.082) (0.670) (0.087) (0.680) (0.077)
𝐔𝟎.𝟑,𝟏𝟐 0.795 0.800 0.803
1168 1140 1075 1233 1137 1171
Failed
(0.123) (0.120) (0.113) (0.130) (0.120) (0.123)
6387 742 6536 593 6337 792
Stable
(0.672) (0.078) (0.688) (0.062) (0.667) (0.083)
𝐔𝟎.𝟑,𝟐𝟓 0.785 0.789 0.788
1296 1075 1407 964 1226 1145
Failed
(0.136) (0.113) (0.148) (0.101) (0.129) (0.121)
4729 582 4640 671 4737 574
Stable
(0.498) (0.061) (0.488) (0.071) (0.499) (0.060)
𝐔𝟎.𝟓,𝟏 0.853 0.853 0.854
812 3377 721 3468 812 3377
Failed
(0.085) (0.355) (0.076) (0.365) (0.085) (0.355)
3766 1223 3725 1264 3759 1230
Stable
(0.396) (0.129) (0.392) (0.133) (0.396) (0.129)
𝐔𝟎.𝟓,𝟔 0.721 0.719 0.723
1432 3079 1403 3108 1405 3106
Failed
(0.151) (0.324) (0.148) (0.327) (0.148) (0.327)
3710 1399 3611 1498 3607 1502
Stable
(0.391) (0.147) (0.380) (0.158) (0.380) (0.158)
𝐔𝟎.𝟓,𝟏𝟐 0.682 0.679 0.683
1619 2772 1553 2838 1508 2883
Failed
(0.170) (0.292) (0.163) (0.299) (0.159) (0.303)
3883 1426 3862 1447 3948 1361
Stable
(0.409) (0.150) (0.407) (0.152) (0.416) (0.143)
𝐔𝟎.𝟓,𝟐𝟓 0.667 0.668 0.671
1734 2457 1708 2483 1760 2431
Failed
(0.183) (0.259) (0.180) (0.261) (0.185) (0.256)
75255 9548 77725 7078 75379 9424
Stable
Entire (0.630) (0.080) (0.650) (0.059) (0.631) (0.079)
0.833 0.821 0.834
data 10408 24289 14318 20379 10416 24281
Failed
(0.087) (0.203) (0.120) (0.171) (0.087) (0.203)
repeated k-fold cross-validation (CV) scheme is employed in this study. In k-fold CV, the
dataset is first randomly divided into k separate folds with the same number of instances. The
model is tested over each fold in turn where the other 𝑘 1 folds are used for training the
repeated k-fold CV (Wong and Yeh 2019). k-fold CV with a large number of folds and a small
24
number of replications is shown to be an appropriate method for the performance evaluation of
classification algorithms (Refaeilzadeh, Tang et al. 2009). Three repeats of 10-fold CV create
the distribution of performance scores such as accuracy and F1 in our study. For each model,
distribution of performance scores with a higher mean value and smaller variances is desirable.
Fig. 4 10-fold cross-validation with three repeats adapted in this study resulting in the statistics of the model
performance scores shown as boxplots. The interquartile range, IQR is identified by Q1 or 25th to Q3 or 75th
percentile as the boxes. The minimum and maximum are shown as whiskers, and outliers are identified as circles.
Performance scores and their distributions: The mean values of performance results of
different classifiers obtained from repeated k-fold CV are summarised in Table . Classifiers are
used as MC surrogate models to predict the stability status of heterogenous and anisotropic
slopes. For training the surrogate classifiers, %5 of MC data, i.e., 500 samples, are randomly
chosen from each random field dataset U containing 10000 samples. The mean performance
scores of CV suggest that in general, RF, SVC, and BG En are the most appropriate classifiers
25
for this study. In the dataset where the highest performance scores are achieved (U . ,
characterised by low heterogeneity and no anisotropy), the BG En model provides the highest
scores (F1=0.903, ACC=0.962, AUC=0.98) which are also partly achieved by other models.
anisotropy), the RF model reaches the highest ACC=0.691 and AUC=0.721 where the highest
F1 score belongs to NB (F1=0.659). On the entire dataset where all sub-datasets are combined,
BG En and RF model outperform other models with ACC=0.847, AUC=0.912 and F1=0.737
(Table 5).
In addition to the mean performance scores, the distribution of these scores obtained from three
repeats of CV provides more insights into the overall appropriateness of different classification
models for the current study. These distributions can better demonstrate the comparison of
different models where smaller variance on higher scores will be desirable. The distribution of
F1 and accuracy scores for each classification model is shown in Error! Reference source not
found. as boxplots. In the best-predicted dataset (Fig. 5a: U . , ), SVC, RF, Bayes, BG En and
The similar achievement of different models implies higher predictability of the dataset due to
the smaller COV of strength parameter and stronger correlations between mean strength values
and the stability status; however, with the same COV, but larger scales of fluctuations (ξ
25), more complex failure mechanisms may have come into play reducing the predictability
of the stability. In this dataset (Fig. 5b: U . , ), SVC model outperforms all models with
distributions concentrated at higher values on both ACC and F1 scores. For U . , dataset where
COV of shear strength values is increased to %30 (Fig. 5c), SVC results in best scores and
small variances, while for the U . , dataset where anisotropy is also significant (Fig. 5d), VO
En and RF model appear to be more efficient when both scores are considered. In datasets with
26
the highest COV but isotropic heterogeneity (Fig. 5e: U . , ), NB and BG En can be selected
as the best classifiers. For the same COV but highly anisotropic dataset (Fig. 5f: U . , ), BG
En, RF and SVC perform well in accuracy distributions and NB can be selected as the winner
Table 5. Performance metrics of different models and the ensemble classifiers on different random field data.
The mean values of the metrics obtained from a repeated 10-fold cross-validation scheme are shown. The
highest mean metric values achieved for each random field are shown as bold numbers. 500 samples are
randomly taken for model training from each random field dataset.
Naive Bayes
Random LR KNN DT SVC RF
(Gaussian NB)
field
F1 ACC AUC F1 ACC AUC F1 ACC AUC F1 ACC AUC F1 ACC AUC F1 ACC AUC
U . , 0.878 0.955 0.976 0.862 0.950 0.976 0.786 0.931 0.814 0.903 0.962 0.977 0.901 0.962 0.973 0.903 0.962 0.976
U . , 0.666 0.899 0.967 0.800 0.931 0.959 0.645 0.884 0.806 0.820 0.936 0.955 0.775 0.924 0.961 0.662 0.848 0.947
U . , 0.718 0.843 0.915 0.712 0.834 0.906 0.632 0.799 0.765 0.751 0.880 0.922 0.746 0.877 0.940 0.712 0.807 0.893
U . , 0.429 0.729 0.742 0.451 0.759 0.755 0.430 0.693 0.643 0.455 0.775 0.782 0.508 0.775 0.812 0.568 0.699 0.772
U . , 0.762 0.793 0.863 0.738 0.711 0.846 0.634 0.681 0.678 0.816 0.842 0.922 0.808 0.843 0.922 0.823 0.842 0.909
U . , 0.537 0.583 0.658 0.597 0.613 0.646 0.521 0.581 0.565 0.610 0.677 0.721 0.614 0.691 0.721 0.659 0.657 0.673
Entire
0.603 0.772 0.781 0.619 0.824 0.787 0.583 0.758 0.697 0.701 0.845 0.885 0.722 0.847 0.908 0.643 0.751 0.794
data
Stacking Bagging
Random Voting Ensemble
Ensemble Ensemble
field
F1 ACC AUC F1 ACC AUC F1 ACC AUC
Entire
0.717 0.844 0.878 0.737 0.841 0.912 0.696 0.844 0.884
data
various datasets with different COVs and ξ values is also examined (Fig. 7a and b). For this
27
case, a training dataset of 500 samples (0.42% of data) is used. Here, RF and BG En appear to
In general, based on the distribution of accuracy and F1 scores, one can conclude that SVC,
RF and BG En are the most reliable choices for predicting the stability status of heterogeneous
Based on AUC scores, for the dataset of low heterogeneity and no anisotropy (Fig. 6a: U . , ),
BG En achieves the highest score (AUC=0.980). For low heterogeneity but high anisotropy
(Fig. 6b: U . , ), LR provides the best score as AUC=0.967. However, for both datasets, all
classifiers perform relatively well denoting the high predictability of the datasets except the
DT classifier which can be deemed inappropriate for this problem. With higher heterogeneity,
regardless of the anisotropy levels, RF and BG En provide the highest scores (AUC=0.94 for
U . , and AUC=0.813 for U . , , Fig. 6c and d). BG En also performs best for the highest
heterogeneity but no anisotropy (AUC=0.923, Fig. 6e) whereas RF outperforms others with
AUC=0.721 for the highest heterogeneity and anisotropy (Fig. 6f). When the entire MC dataset
is examined, RF and BG En are the most appropriate classifiers with AUC=0.89 (Fig. 7c).
In general, based on the AUC metric, RF and BG En can be concluded as the most appropriate
28
a U . ,
b U . ,
(c) U . ,
d U . ,
Fig. 5. The performance of different algorithms obtained from three repeats of 10-fold cross-validation. Boxplots
show the mean (green triangle), median (yellow line), interquartile range (25th to the 75th percentile as the boxes),
minimum and maximum (whiskers), and outliers (circles) for accuracy scores (left panels) and F1 (right panels)
for each model. Models are trained on 500 samples from each random field dataset.
29
e U . ,
f U . ,
Fig. 5 Continued.
30
a U . , b U . ,
(c) U . , d U . ,
e U . , f U . ,
Fig. 6 AUC ROC curves for different machine learning algorithms on various datasets. Models are trained on 500
31
a) Accuracy score (ACC) b) F1 score
Fig. 7 The performance of different algorithms on the entire random field data (all COV and 𝜉 datasets). Results
are obtained from three repeats of 10-fold cross-validation. The models are trained on 500 samples (%0.42 of
data). The confusion matrix showing the normalised values over the size of the test dataset is shown in d) as the
results of the RF model trained with 500 samples for a test against 119,500 samples.
characterised by various levels of soil heterogeneity and anisotropy. With F1 and AUC metrics
being the most important measures to evaluate binary classifiers in unbalanced classes, the
graphs of Fig. 8 present the overall performance of models with combined metrics. In the case
of random field data with no anisotropy (ξ 1), regardless of the level of soil heterogeneity,
the BG En appears to outperform other classifiers with higher F1 and AUC scores (Fig. 8a, c
32
and e). The RF and VO En are also the next appropriate models with performance scores close
to the best values. In the case of random field data with high anisotropy (𝜉 25), the RF is
perhaps the most reliable model while the VO En has performed better for low heterogeneity
(Fig. 8b) and SVC has performed similarly well at high heterogeneity (Fig. 8f). Ultimately, BG
En outperforms other models when learning the entire MC dataset of different heterogeneity
In general, based on the combined assessment of F1 and AUC measures for all datasets, we
can conclude that BG En and RF are the most appropriate models for this study while SVC can
be the runner up model. DT, KNN and LR are the least efficient models for this study where
33
a U . , b U . ,
c U . , d U . ,
(e) U . , f U . ,
Fig. 8 AUC versus F1 scores for different algorithms on various datasets trained with 500 samples (%5 of each
34
Fig. 9 AUC versus F1 scores for different algorithms on the entire random field dataset (120,000 samples) trained
7.3. The effects of the heterogeneity and anisotropy of random fields on classifier
performances
reliability analysis with respect to different levels of heterogeneity (COV = 0.1 0.5) and
anisotropy (ξ 1 25). The best performing classifiers, RF, SVC and BG En are trained on
%5 of the data (500 samples) for each dataset where the datapoints in Fig. 10 are the mean
values of the results obtained from the three times repeated CV. A general trend in all models
is the declining performance (decreased accuracy and AUC scores) with increasing
heterogeneity and anisotropy. This behaviour can lie in the increased complexity of the failure
mechanisms with these properties of random soils. The failure in a less heterogeneous slope is
better correlated with the average of the random strength variables. Such slopes are also more
likely to fail via typical shear surfaces which are better learned by the classifiers. A semi-
circular geometry of the failure surface is attributed to these failure cases. With increasing
heterogeneity, the predictability decreases as the failure mechanisms are more complex
depending on the interconnection of local week zones with irregular surfaces of failure.
35
The effects of anisotropy are also evident in Fig. 10, where a higher anisotropy ratio (ξ) leads
to lower predictability of the stability status of slopes. These effects could also root in the
complexity of the failure mechanisms where not only the local week zones (due to
heterogeneity) may cause the failure, but also the correlated week layers distributed spatially
into the field can contribute to irregular failure surfaces. Therefore, any failure geometry from
a circular deep surface to a local shearing zone or a shallow or deep layering slide can be a
possible mechanism. The lower repeatability of different mechanisms can reduce the prediction
In general, RF and BG En perform similarly well on all datasets while according to the AUC
metric (Fig. 10b), SVC is less efficient than the other two models in the moderate level of
heterogeneity (COV=0.3).
It should be noted that while the F1 score is used as a useful measure to evaluate the
different datasets with varying ratios of unbalanced classes. Thus, the variation of the F1 score
36
(a)
(b)
Fig. 10 Accuracy Score, ACC (a) and Area Under Curve, AUC score (b) vsversus the anisotropy ratio ξ for
Random Forest (RF), Support Vector (SVC) and Bagging ensemble (BG En) trained with 500 samples for each
While the ACC and AUC for the binary classification (failure, non-failure) may decrease to
about %70, it should be noted that the errors in the predicted probability of failure, 𝑝 can be
in much lower extent. The nature of binary results either as failure or non-failure can induce a
high number of near-to-failure samples being incorrectly classified. While in the case of the
predictions of FOS data, fluctuations in the near-to-failure (FOS 1) predictions can induce a
37
smoothed error in the mean FOS calculated, similar model performances on a binary
classification may induce lower accuracies. However, when calculating the 𝑝 , the ratio as the
number of failure predictions over the total number of samples is calculated. Therefore, near-
to-failure mis-classified samples can be balanced which can reduce the average errors in the
calculated 𝑝 .
The probability of failure 𝑝 predicted by ML models trained on only %5 of random field data
is shown in Fig. 11. The errors in the predicted 𝑝 range from less than %1 for all models and
with an average of 0.46% for all cases of heterogeneity and anisotropy (Fig. 11 -a).
The actual 𝑝 values obtained from the complete MC data are shown as data points in Fig. 11-
Trained only on %5 of data, ML surrogate models can provide highly accurate predictions.
38
(a)
(b)
Fig. 11 The errors of the ML predicted probability of failure 𝑝 (a) and the standard deviation (std) of ML
predicted 𝑝 (b) versus anisotropy ratio 𝜉. The SVC model is used here. The actual 𝑝 in (b) is obtained from
complete MC data. ML models are trained with 500 samples for each dataset (%5 of data).
39
7.5. Computational time
The CPU time for each random finite difference simulation without a FOS calculation to
determine the stability status of the slopes is 43 seconds on average on a Laptop with Intel Core
i3-5010U 2.1GHz processor using four cores. Thus, each dataset of particular 𝐶𝑂𝑉, ξ, μ
values consisting of 2,000 simulations takes a CPU time of 23.9 hours to finish. The entire
similar dataset with calculated FOSs can take up to 306 days to complete. The ML models are
trained on 500 samples to predict the entire dataset. The computation time for generating this
training dataset will take only about 6 hours where the ML training and predictions are
performed within a few minutes. Therefore, the proposed ML models can reduce the
computational CPU time of such study from 306 days to 6 hours where an accuracy score of
~%85 and AUC score of ~%91 can be expected. In the case of the random finite difference
simulations with FOS calculation using strength reduction technique, the CPU time for each
simulation can reach 220 seconds on average demanding unaffordable computational costs for
such a study. A summary of the computation time and expected performance metrics is
presented in Table 6.
40
Table 6. Comparison of the computational time of original and ML aided MC methods
Original MC method
calculations FOS
calculations
CPU
MC MC CPU time CPU time MC Accuracy
time F1 AUC 𝑃
dataset samples (hours) (hours) samples (ACC)
(hours)
U
500 Min: Min: Min:
datasets 611 5.97
119.4 (%5 of 0.691– 0.568– 0.721–
(specific 10,000 (~25.5 (~0.25 ≤0.010
(~5 days) MC Max: Max: Max:
COV days) days)
data) 0.962 0.903 0.980
and 𝜉)
500
7333 5.97
Entire 1433 (%0.42
120,000 (~306 (~0.25 0.847 0.737 0.912 0.0046
dataset (~60 days) of MC
days) days)
data)
*
Min and max metric values are associated with datasets U . , and U . , , respectively.
as response/surrogate models to predict the results of the random field slope stability with
containing 120,000 simulations is generated. The random field variable is the soil undrained
shear strength with COV ranging from 0.1 to 0.5, and the anisotropy ratio, ξ ranging from 1 to
25. Predictions on the whole database are made by machine learning surrogate models trained
on only 500 samples. The findings of this research can be summarised as follows:
41
Repeated k-fold cross-validation revealed the sensitivity of ML surrogate models
providing the distribution of performance scores. A detailed study suggested that the
Bagging ensemble and Random Forest are the most appropriate models for this random
field study with SVC being the next appropriate choice. DT, KNN and LR are the least
Slightly heterogeneous random fields are highly predictable: when trained on 500
samples (%5 of MC dataset), the accuracy of ML surrogate models for classifying the
failure and non-failure cases reached %96.2 for (COV=0.1, 𝜉=1), although decreases
ML surrogate models trained on %5 of data is %88 for (COV=0.3, 𝜉=1), and %84 for
(COV=0.5, 𝜉=1).
The overall performance of ML surrogate models for the classification task on the entire
random filed database (all COV and ξ values) when trained on 500 samples (%0.42 of
The errors in the ML predicted probability of failure using 5% of MC data for all
heterogeneity and anisotropy levels is below %1 with an average of %0.46 for the entire
MC dataset. Such an approach reduced the CPU computational time from 306 days to
only 6 hours.
when no FOS calculations are required. In another publication (Aminpour, Alaie et al. 2022),
42
we have further explored the efficiency of ML surrogate models in determining the reliability
the input random field data with geometrical aspects within the problem) can be a potential
research approach. The dimension reduction techniques applied on the random field data can
also be investigated aiming to improve the model performances. Further investigations on the
training of deep and convolutional neural networks on random field data with failure, non-
failure results (without FOS calculations) can further improve the efficiency of machine
analysis.
Acknowledgement
This research is funded by the Australian Research Council via the Discovery Projects (No.
DP200100549).
References:
Aminpour, M., R. Alaie, N. Kardani, S. Moridpour and M. Nazem (2022). "Machine learning aided Monte Carlo
reliability analysis on spatially variable random fields: anisotropic heterogeneous slopes." manuscript submitted
for publication.
Au, S.-K. and J. Beck (2003). "Important sampling in high dimensions." Structural safety 25(2): 139-163.
Au, S.-K. and J. L. Beck (2001). "Estimation of small failure probabilities in high dimensions by subset simulation."
Belgiu, M. and L. Drăguţ (2016). "Random forest in remote sensing: A review of applications and future directions."
Bühlmann, P. (2012). Bagging, boosting and ensemble methods. Handbook of computational statistics, Springer:
985-1022.
43
Bui, X.-N., H. Nguyen, Y. Choi, T. Nguyen-Thoi, J. Zhou and J. Dou (2020). "Prediction of slope failure in open-
pit mines using a novel hybrid artificial intelligence model based on decision tree and evolution algorithm."
Chen, F., L. Wang and W. Zhang (2019). "Reliability assessment on stability of tunnelling perpendicularly beneath
an existing tunnel considering spatial variabilities of rock mass properties." Tunnelling and Underground Space
Cheng, M.-Y. and N.-D. Hoang (2016). "Slope collapse prediction using Bayesian framework with k-nearest
neighbor density estimation: case study in Taiwan." Journal of Computing in Civil Engineering 30(1): 04014116.
Ching, J. and K.-K. Phoon (2013). "Probability distribution for mobilised shear strengths of spatially variable soils
under uniform stress states." Georisk: Assessment and Management of Risk for Engineered Systems and
Cho, S. E. (2007). "Effects of spatial variability of soil properties on slope stability." Engineering Geology 92(3-4):
97-109.
Christian, J. T., C. C. Ladd and G. B. Baecher (1994). "Reliability applied to slope stability analysis." Journal of
Cortes, C. and V. Vapnik (1995). "Support-vector networks." Machine learning 20(3): 273-297.
Das, I., S. Sahoo, C. van Westen, A. Stein and R. Hack (2010). "Landslide susceptibility assessment using logistic
regression and its comparison with a rock mass classification system, along a road section in the northern
Deng, L., A. Smith, N. Dixon and H. Yuan (2021). "Machine learning prediction of landslide deformation behaviour
using acoustic emission and rainfall measurements." Engineering Geology 293: 106315.
Deng, Z.-P., M. Pan, J.-T. Niu, S.-H. Jiang and W.-W. Qian (2021). "Slope reliability analysis in spatially variable
soils using sliced inverse regression-based multivariate adaptive regression spline." Bulletin of Engineering
Džeroski, S. and B. Ženko (2004). "Is combining classifiers with stacking better than selecting the best one?"
El‐Kadi, A. I. and S. A. Williams (2000). "Generating two‐dimensional fields of autocorrelated, normally distributed
Feng, X., S. Li, C. Yuan, P. Zeng and Y. Sun (2018). "Prediction of slope stability using naive Bayes classifier."
44
Fenton, G. A. and D. V. Griffiths (2008). Risk assessment in geotechnical engineering, John Wiley & Sons New
York.
Gong, W., H. Tang, C. H. Juang and L. Wang (2020). "Optimization design of stabilizing piles in slopes considering
Griffiths, D. V. and G. A. Fenton (2007). Probabilistic methods in geotechnical engineering, Springer Science &
Business Media.
Han, H., B. Shi and L. Zhang (2021). "Prediction of landslide sharp increase displacement by SVM with considering
He, X., F. Wang, W. Li and D. Sheng (2021). "Deep learning for efficient stochastic analysis with spatial variability."
Acta Geotechnica.
He, X., F. Wang, W. Li and D. Sheng (2021). "Efficient reliability analysis considering uncertainty in random field
parameters: Trained neural networks as surrogate models." Computers and Geotechnics 136: 104212.
He, X., H. Xu, H. Sabetamal and D. Sheng (2020). "Machine learning aided stochastic reliability analysis of spatially
Hosmer Jr, D. W., S. Lemeshow and R. X. Sturdivant (2013). Applied logistic regression, John Wiley & Sons.
Hu, X., S. Wu, G. Zhang, W. Zheng, C. Liu, C. He, Z. Liu, X. Guo and H. Zhang (2021). "Landslide displacement
prediction using kinematics-based random forests method: A case study in Jinping Reservoir Area, China."
Huang, J., G. Fenton, D. Griffiths, D. Li and C. Zhou (2017). "On the efficient estimation of small failure probability
Huang, J., G. Fenton, D. V. Griffiths, D. Li and C. Zhou (2017). "On the efficient estimation of small failure
Huang, Y., X. Han and L. Zhao (2021). "Recurrent neural networks for complicated seismic dynamic response
Jamshidi Chenari, R. and R. Alaie (2015). "Effects of anisotropy in correlation structure on the stability of an
undrained clay slope." Georisk: Assessment and Management of Risk for Engineered Systems and Geohazards
9(2): 109-123.
Jiang, S.-H. and J.-S. Huang (2016). "Efficient slope reliability analysis at low-probability levels in spatially variable
45
Jiang, S.-H., D.-Q. Li, L.-M. Zhang and C.-B. Zhou (2014). "Slope reliability analysis considering spatially variable
shear strength parameters using a non-intrusive stochastic finite element method." Engineering geology 168: 120-
128.
Juang, C. H., J. Zhang, M. Shen and J. Hu (2019). "Probabilistic methods for unified treatment of geotechnical and
Kang, F., Q. Xu and J. Li (2016). "Slope reliability analysis using surrogate models via new support vector machines
Kardani, N., A. Bardhan, S. Gupta, P. Samui, M. Nazem, Y. Zhang and A. Zhou (2021). "Predicting permeability of
tight carbonates using a hybrid machine learning approach of modified equilibrium optimizer and extreme
Kardani, N., A. Zhou, M. Nazem and S.-L. Shen (2021). "Improved prediction of slope stability using a hybrid
stacking ensemble method based on finite element analysis and field data." Journal of Rock Mechanics and
Li, D.-Q., T. Xiao, Z.-J. Cao, C.-B. Zhou and L.-M. Zhang (2016). "Enhancement of random finite element method
in reliability analysis and risk assessment of soil slopes using Subset Simulation." Landslides 13(2): 293-303.
Li, J., Y. Tian and M. J. Cassidy (2015). "Failure mechanism and bearing capacity of footings buried at various
depths in spatially random soil." Journal of Geotechnical and Geoenvironmental Engineering 141(2): 04014099.
Liu, L.-L. and Y.-M. Cheng (2018). "System reliability analysis of soil slopes using an advanced kriging metamodel
Liu, L., S. Zhang, Y.-M. Cheng and L. Liang (2019). "Advanced reliability analysis of slopes in spatially variable
soils using multivariate adaptive regression splines." Geoscience Frontiers 10(2): 671-682.
Lloret-Cabot, M., G. A. Fenton and M. A. Hicks (2014). "On the estimation of scale of fluctuation in geostatistics."
Georisk: Assessment and management of risk for engineered systems and geohazards 8(2): 129-140.
Mucherino, A., P. J. Papajorgji and P. M. Pardalos (2009). K-nearest neighbor classification. Data mining in
Noble, W. S. (2006). "What is a support vector machine?" Nature biotechnology 24(12): 1565-1567.
Papaioannou, I., W. Betz, K. Zwirglmaier and D. Straub (2015). "MCMC algorithms for subset simulation."
46
Pham, B. T., D. T. Bui and I. Prakash (2017). "Landslide susceptibility assessment using bagging ensemble based
alternating decision trees, logistic regression and J48 decision trees methods: a comparative study." Geotechnical
Phoon, K.-K. and F. H. Kulhawy (1999). "Characterization of geotechnical variability." Canadian geotechnical
Popescu, R., G. Deodatis and A. Nobahar (2005). "Effects of random heterogeneity of soil properties on bearing
Refaeilzadeh, P., L. Tang and H. Liu (2009). "Cross-validation." Encyclopedia of database systems 5: 532-538.
Rish, I. (2001). An empirical study of the naive Bayes classifier. IJCAI 2001 workshop on empirical methods in
artificial intelligence.
Samui, P. and D. Kothari (2011). "Utilization of a least square support vector machine (LSSVM) for slope stability
Sun, D., J. Xu, H. Wen and D. Wang (2021). "Assessment of landslide susceptibility mapping based on Bayesian
hyperparameter optimization: A comparison between logistic regression and random forest." Engineering
Tian, H.-M., D.-Q. Li, Z.-J. Cao, D.-S. Xu and X.-Y. Fu (2021). "Reliability-based monitoring sensitivity analysis
for reinforced slopes using BUS and subset simulation methods." Engineering Geology 293: 106331.
Tsangaratos, P. and I. Ilia (2016). "Comparison of a logistic regression and Naïve Bayes classifier in landslide
susceptibility assessments: The influence of models complexity and training dataset size." Catena 145: 164-179.
Vanmarcke, E. H. (1977). "Probabilistic modeling of soil profiles." Journal of the geotechnical engineering division
103(11): 1227-1246.
Vanmarcke, E. H. (1977). "Reliability of earth slopes." Journal of the Geotechnical Engineering Division 103(11):
1247-1265.
Vetterling, W. T., W. H. Press, S. A. Teukolsky and B. P. Flannery (2002). Numerical recipes example book (c++):
Wang, H., L. Zhang, H. Luo, J. He and R. W. M. Cheung (2021). "AI-powered landslide susceptibility assessment
Wang, L., J. H. Hwang, C. H. Juang and S. Atamturktur (2013). "Reliability-based design of rock slopes — A new
47
Wang, L., C. Wu, X. Gu, H. Liu, G. Mei and W. Zhang (2020). "Probabilistic stability analysis of earth dam slope
under transient seepage using multivariate adaptive regression splines." Bulletin of Engineering Geology and the
Wang, Y., Z. Cao and S.-K. Au (2010). "Efficient Monte Carlo simulation of parameter sensitivity in probabilistic
Wang, Y., Z. Cao and S.-K. Au (2011). "Practical reliability analysis of slope stability by advanced Monte Carlo
Wang, Z.-Z. and S. H. Goh (2021). "Novel approach to efficient slope reliability analysis in spatially variable soils."
Wei, Z.-l., Q. Lü, H.-y. Sun and Y.-q. Shang (2019). "Estimating the rainfall threshold of a deep-seated landslide by
integrating models for predicting the groundwater level and stability analysis of the slope." Engineering Geology
253: 14-26.
Wong, T.-T. and P.-Y. Yeh (2019). "Reliable accuracy estimates from k-fold cross validation." IEEE Transactions
Zeng, P., T. Zhang, T. Li, R. Jimenez, J. Zhang and X. Sun (2020). "Binary classification method for efficient and
accurate system reliability analyses of layered soil slopes." Georisk: Assessment and Management of Risk for
Zhang, J., K. K. Phoon, D. Zhang, H. Huang and C. Tang (2021). "Deep learning-based evaluation of factor of safety
with confidence interval for tunnel deformation in spatially variable soil." Journal of Rock Mechanics and
Geotechnical Engineering.
Zhang, W., C. Wu, H. Zhong, Y. Li and L. Wang (2021). "Prediction of undrained shear strength using extreme
gradient boosting and random forest based on Bayesian optimization." Geoscience Frontiers 12(1): 469-477.
Zhao, Y., X. Meng, T. Qi, Y. Li, G. Chen, D. Yue and F. Qing (2021). "AI-based rainfall prediction model for debris
Zhu, B., T. Hiraishi, H. Pei and Q. Yang (2021). "Efficient reliability analysis of slopes integrating the random field
method and a Gaussian process regression‐based surrogate model." International Journal for Numerical and
Zhu, B., H. Pei and Q. Yang (2019). "An intelligent response surface method for analyzing slope reliability based on
Gaussian process regression." International Journal for Numerical and Analytical Methods in Geomechanics
43(15): 2431-2448.
48
Zhu, H., L. Zhang, T. Xiao and X. Li (2017). "Generation of multivariate cross-correlated geotechnical random
Zhu, H. and L. M. Zhang (2013). "Characterizing geotechnical anisotropic spatial variations using random field
49