0% found this document useful (0 votes)
58 views49 pages

Slope Stability Predictions On Spatially Variable Random Fields Using Machine Learning Surrogate Models

The document discusses using machine learning algorithms as surrogate models for slope stability reliability analysis on spatially variable random fields. It aims to explore using ML models trained on a limited number of Monte Carlo slope stability simulations to predict results from larger datasets. An extensive dataset of 120,000 finite difference Monte Carlo slope stability simulations incorporating soil heterogeneity and anisotropy was generated. Bagging Ensemble, Random Forest and Support Vector classifiers were found to be superior models, able to classify the entire dataset with 85% accuracy when trained on just 0.47% of the data. The ML approach reduced computational time from 306 days to less than 6 hours while maintaining low errors in predicted failure probabilities. Performance of the ML methods decreased with increasing soil anisotropy and heterogeneity.

Uploaded by

Jean Lucas Belo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views49 pages

Slope Stability Predictions On Spatially Variable Random Fields Using Machine Learning Surrogate Models

The document discusses using machine learning algorithms as surrogate models for slope stability reliability analysis on spatially variable random fields. It aims to explore using ML models trained on a limited number of Monte Carlo slope stability simulations to predict results from larger datasets. An extensive dataset of 120,000 finite difference Monte Carlo slope stability simulations incorporating soil heterogeneity and anisotropy was generated. Bagging Ensemble, Random Forest and Support Vector classifiers were found to be superior models, able to classify the entire dataset with 85% accuracy when trained on just 0.47% of the data. The ML approach reduced computational time from 306 days to less than 6 hours while maintaining low errors in predicted failure probabilities. Performance of the ML methods decreased with increasing soil anisotropy and heterogeneity.

Uploaded by

Jean Lucas Belo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Slope stability predictions on spatially variable random fields using

machine learning surrogate models


Mohammad Aminpour*, PhD, Research Assistant (Corresponding author)

Civil and Infrastructure Engineering Discipline, School of Engineering, Royal Melbourne Institute of
Technology (RMIT), Victoria 3001, Australia

E-mail: [email protected]

Reza Alaie, PhD Graduate

Department of Civil Engineering, Faculty of Engineering, University of Guilan, Rasht, Iran

E-mail: [email protected]

Navid Kardani, PhD Research Fellow

Civil and Infrastructure Engineering Discipline, School of Engineering, Royal Melbourne Institute of
Technology (RMIT), Victoria 3001, Australia

Email: [email protected]

Sara Moridpour, PhD, Associate Professor

Civil and Infrastructure Engineering Discipline, School of Engineering, Royal Melbourne Institute of
Technology (RMIT), Victoria 3001, Australia

Email: [email protected]

Majidreza Nazem, PhD, Associate Professor

Civil and Infrastructure Engineering Discipline, School of Engineering, Royal Melbourne Institute of
Technology (RMIT), Victoria 3001, Australia

Email: [email protected]

To be submitted to

Engineering Geology, Elsevier

November 2021


 
Slope stability predictions on spatially variable random fields using
machine learning surrogate models
Mohammad Aminpour1*, Reza Alaie2, Navid Kardani1, Sara Moridpour1, Majidreza Nazem1
1
Civil and Infrastructure Engineering, School of Engineering, RMIT University, Melbourne,
Australia
2
Department of Civil Engineering, Faculty of Engineering, University of Guilan, Rasht, Iran.

Corresponding author: Mohammad Aminpour, E-mail: [email protected]

Abstract:

Random field Monte Carlo (MC) reliability analysis is a robust stochastic method to determine

the probability of failure. This method, however, requires a large number of numerical

simulations demanding high computational costs. This paper explores the efficiency of

different machine learning (ML) algorithms used as surrogate models trained on a limited

number of random field slope stability simulations in predicting the results of large datasets.

The MC data in this paper require only the examination of failure or non-failure, circumventing

the time-consuming calculation of factors of safety. An extensive dataset is generated,

consisting of 120,000 finite difference MC slope stability simulations incorporating different

levels of soil heterogeneity and anisotropy. The Bagging Ensemble, Random Forest and

Support Vector classifiers are found to be the superior models for this problem amongst 9

different models and ensemble classifiers. Trained only on 0.47% of data (500 samples), the

ML model can classify the entire 120,000 samples with an accuracy of %85 and AUC score of

%91. The performance of ML methods in classifying the random field slope stability results

generally reduces with higher anisotropy and heterogeneity of soil. The ML assisted MC

reliability analysis proves a robust stochastic method where errors in the predicted probability

of failure using %5 of MC data is only %0.46 in average. The approach reduced the

computational time from 306 days to less than 6 hours.

Keywords: Machine learning; slope stability; Monte Carlo; surrogate models; anisotropy;

heterogeneity


 
Highlights

 Nine machine Learning (ML) classifiers are developed as surrogate models for slope

stability reliability analysis.

 To validate the ML models, a complete Monte Carlo (MC) dataset containing 120,000

anisotropic heterogeneous slopes is generated.

 Compared to complete MC datasets, surrogate models prove a robust stochastic

reliability method with high accuracies.

 Using 5% of MC data, the error of ML predicted probability of failure for all anisotropy

and heterogeneity levels is below 1% with an average of 0.46%.

 With no need to calculate the factors of safety, the approach reduced the computational

time from 306 days to less than 6 hours.


 
1. Introduction

The uncertainty of geotechnical parameters is the core of reliability-based design in

geotechnics. The uncertainty can be associated with inherent spatial variability of soil or

deterministic calculation models in transforming measurement information into design

parameters (Phoon and Kulhawy 1999, Juang, Zhang et al. 2019). The inherent variability of

geotechnical parameters is known as the major source of uncertainty (Fenton and Griffiths

2008) with a significant influence on slope stability (Christian, Ladd et al. 1994, Ching and

Phoon 2013, Lloret-Cabot, Fenton et al. 2014), Lloret-Cabot, Fenton et al. 2014). However,

while being critically important in a safe geotechnical design, reliability analysis can be

practically difficult to implement due to the great computational efforts required. Combined

with the random field theory (Vanmarcke 1977, Vanmarcke 1977) and the finite element

method, the Monte Carlo (MC) approach as a conceptually simple and robust statistical method

(Wang, Cao et al. 2010, Wang, Cao et al. 2011) involves the production of thousands of

simulations to evaluate the probability of geotechnical events such as slope failure. In small

probability events which are of high importance in geotechnical practice, the MC method can

be computationally inefficient. The significant computational time and power needed for this

method enforce geotechnical engineers to trade off the risk management for timely progress of

project design. However, underestimating the impacts of geotechnical uncertainties can

remarkably increase the associated risks with examples in tunnelling (Chen, Wang et al. 2019),

slope stability (Cho 2007, Wang, Hwang et al. 2013), Wang, Hwang et al. 2013, Gong, Tang

et al. 2020), bearing capacity (Li, Tian et al. 2015), etc.

Different techniques are introduced to address the computational challenges of geotechnical

MC reliability assessments, including the variance reduction methods such as importance

sampling (Au and Beck 2003) or subset simulation (Au and Beck 2001, Papaioannou, Betz et

al. 2015), Papaioannou, Betz et al. 2015). It is shown that the importance sampling is not


 
applicable in problems with a large number of random variables (Huang, Fenton et al. 2017).

Also, the subset simulation which has been applied for small failure probability problems in

slope stability (Li, Xiao et al. 2016, Tian, Li et al. 2021), appeared to be even more time

consuming than an MC simulation for a given accuracy when a search for the factor of safety

(FOS) is required (Huang, Fenton et al. 2017). The subset simulation is enhanced by avoiding

a search for FOS using the strength reduction technique to increase the efficiency of this

method. However, even with such enhancement, the subset simulation has not reduced the

simulation time more than 3 times and can be still less efficient than a complete MC simulation

in many cases (Huang, Fenton et al. 2017).

Another technique to increase the efficiency of reliability analysis of slope stability is the

construction of surrogate models or response surface methods, including polynomial chaotic

expansion, support vector regression, Kriging model, and multiple adaptive regression spline

(Jiang, Li et al. 2014, Jiang and Huang 2016, Kang, Xu et al. 2016, Liu and Cheng 2018, Liu,

Zhang et al. 2019, Wang, Wu et al. 2020, Zeng, Zhang et al. 2020, Deng, Pan et al. 2021). This

technique involves the introduction of surrogate models to replace the Finite Element

simulations approximating the relationship between the input variables and the output

response. Among the surrogate models, machine learning algorithms have received increasing

attention in slope stability problems with spatial variability (Kang, Xu et al. 2016, Zhu, Pei et

al. 2019, He, Xu et al. 2020, He, Wang et al. 2021, He, Wang et al. 2021, Wang and Goh 2021,

Zhang, Phoon et al. 2021, Zhu, Hiraishi et al. 2021). ML methods have been widely used in

slope stability predictions and landslide characterisations (Wei, Lü et al. 2019, Deng, Smith et

al. 2021, Han, Shi et al. 2021, Hu, Wu et al. 2021, Huang, Han et al. 2021, Sun, Xu et al. 2021,

Wang, Zhang et al. 2021, Zhao, Meng et al. 2021). However, a systematic study on different

ML models when used as surrogate models and in particular, the dependence of their

performance on various levels of heterogeneity and anisotropy of random fields is yet to be


 
performed. Moreover, previous studies have normally trained the ML models on the data of

Monte Carlo simulations with calculated FOSs which are tedious tasks in geotechnical

simulations. However, in a direct MC simulation, the probability of failure can be examined

without a need for the calculation of FOS in each simulation (Huang, Fenton et al. 2017).

Conducted either by numerous failure surfaces in a limit equilibrium analysis or using the

strength reduction method, the calculation of FOS can significantly increase the computational

cost of the MC method. Little is known about the performance of ML surrogate models on

slope stability MC data without determined FOSs.

In this study, we systematically investigate the performance of different ML surrogate models

on spatially variable random field MC data with no FOS calculated. The selected ML

algorithms include six models and three ensemble classifiers which have been used in

geotechnical engineering for more than a decade with their appropriateness being shown in

slope stability studies (Das, Sahoo et al. 2010, Samui and Kothari 2011, , Samui and Kothari

2011, Cheng and Hoang 2016, Pham, Bui et al. 2017, Feng, Li et al. 2018, , Pham, Bui et al.

2017, Feng, Li et al. 2018, Bui, Nguyen et al. 2020, Kardani, Zhou et al. 2021, Zhang, Wu et

al. 2021), Kardani, Zhou et al. 2021, Zhang, Wu et al. 2021). These models are trained on small

fractions of MC data when the outcome of MC simulations are only the failure or non-failure

status of slopes, thus further increasing the efficiency with the less computational time needed

for calculated FOSs. The performance of models is tested against a complete MC database

covering a wide range of random field heterogeneity and anisotropy. The sensitivity of results

is assessed using the repeated k-fold cross-validation technique.

This paper is structured as follows. Section 2 describes the technical method to generate the

random fields. Section 3 introduces the random field finite difference computational method to

generate slope models. The details of the generated Monte Carlo data with statistical

parameters are summarised in Section 4. An overview on the evaluation approach of ML


 
surrogate models implemented in this study is given in Section 5. The machine learning

algorithms and ensemble classifiers are briefly described in Section 6. Section 7 provides the

results comparing the performance of different ML classifiers when used as surrogate models

on various datasets with respect to the levels of anisotropy and heterogeneity using suitable

performance scores. The sensitivity of the results is also discussed. A summary with an

overview of the future research directions is presented in Section 8.

2. Random field generation

A random field as a random function defined over an arbitrary domain is used to model the

spatial variability of geotechnical parameters. Statistical parameters including mean value μ,

standard deviation σ and the correlation distance δ are used to characterise the random fields.

The correlation distance, also known as the scale of fluctuation, characterises the variation of

the field in space which is captured by the covariance function. Assuming a Gaussian and

stationary random field where the complete probability distribution is independent of absolute

location, the undrained shear strength parameter, C , is the random variable in this study.

Among several methods for the generation of random fields such as Covariance Matrix

Decomposition, Moving Average, Fast Fourier Transform, Turning Bands Method and Local

Average Subdivision Method (Griffiths and Fenton 2007), the Cholesky decomposition

technique is selected and used here which is developed in Ref. (Vetterling, Press et al. 2002)

and utilised in various geotechnical studies (Zhu and Zhang 2013) (Zhu, Zhang et al. 2017).

In this paper, the random field is defined as (El‐Kadi and Williams 2000)

𝐶   𝑥    𝑒𝑥𝑝  𝐿. 𝜀 𝜇 (1)

where C x is the undrained shear strength of soil at the spatial position x, ε is an independent

random variable normally distributed with zero mean and unit variance, μ is the mean of


 
the logarithm of C , and L is a lower-triangular matrix computed from the decomposition of

the covariance matrix using the Cholesky decomposition technique (Griffiths and Fenton). The

anisotropic exponential Markovian covariance function is defined as (Fenton and Griffiths

2008)

|𝑙 | 𝑙 (2)
𝐴 𝑙 ,𝑙   𝜎 𝑒𝑥𝑝
𝛿 𝛿

where l and l are the horizontal and vertical distances between two arbitrary points, δ and

δ are the horizontal and vertical scales of fluctuation (correlation distances), respectively, and

σ is the standard deviation of the logarithms of undrained shear strength.

According to the Cholesky decomposition technique, the covariance between the logarithms of

the random variable values (here C ) at any two points is decomposed into

𝐴 𝐿𝐿 (3)

where T stands for transpose.

The mean (μ) and the standard deviation (σ) of the logarithms of C are computed as

1 (4)
𝜇 𝑙𝑛 𝜇 𝜎
2

(5)
𝜎 𝑙𝑛 1 𝐶𝑂𝑉

where COV is the coefficient of variation of C which is obtained by

𝐶𝑂𝑉 . (6)

Random filed simulations include a range of mean undrained shear strength values, μ from

18.6 to 33.5 kPa. These μ values are chosen based on the FOSs associated with homogeneous


 
slopes where the shear strength of soil is equal to μ . The FOS for a homogeneous slope with

C 18.6 kPa is equal to 1. The other μ values are determined as 1.2, 1.4, 1.6 and 1.8 times

of 18.6 kPa. The COV values are changed as 0.1, 0.3 and 0.5 representing low to high levels

of heterogeneity. The vertical correlation distance, δ is kept constant as 1 m, whereas the

horizontal correlation distance, δ varies from 1 to 25 m. Thus, the anisotropy ratio, ξ

varies from 1 to 25. The summary of statistical parameters used in this study is presented in

Table .

Table 1. The statistical parameters of spatial variability of soil in this study

Parameter Values

Mean of undrained shear strength, μ kPa 18.6, 22.3, 26, 29.7, 33.5

Coefficient of variation, 𝐶𝑂𝑉 0.1, 0.3, 0.5

Horizontal correlation distance, δ , m 1, 6, 12, 25

Vertical correlation distance, δ , m 1

3. Finite difference random field model

The slope considered for this study, as shown in Fig. 1, has the following geometrical

specifications: the slope angle of 45°, the slope height of 5 m and the foundation depth of 10

m from the top of the slope. The foundation level is assigned with a rigid boundary surface

simulating the hard rock bed with all movements restricted. The horizontal displacements are

also restricted at the vertical boundaries. A linear elastic-perfectly plastic stress-strain

behaviour incorporating the Mohr-Coulomb failure criterion is adapted for the soil. The shear

modulus of soil G is assumed to be a linear function of the undrained shear strength as G

I C where I is the rigidity index.  I  is defined as the ratio between the shear modulus G and

undrained shear strength. Here, I 800 denoting a moderately stiff soil is assumed according

to the ranges suggested in the literature (e.g., I 300 1500 (Popescu, Deodatis et al.

2005)). The soil unit weight equal to γ 20kN/m is kept constant in all simulations.


 
FLAC (FLAC Itasca) is used for random finite difference simulations. The mesh size was tested

to be sufficiently fine in acquiring accurate results with less than 5% relative errors compared

to a highly fine but computationally expensive mesh size. A four-nodded quadrilateral grid

with a 0.5m side dimension was generated for this study resulting in 800 cells. Each cell

attributes to a different C value in a random field realisation (Fig. 1).

Fig. 1 The slope random finite difference model comprising 800 random variable cells

To identify if a slope is stable or not, an alternative criterion for slope failure is adapted

replacing the calculation of FOS using the traditional time-consuming strength reduction

technique. In this method, a combined criterion for plastic zones and the velocity of grids is

assessed in the finite difference model. The failure surface can be detected if a contiguous

region of active plastic zones connecting two boundary surfaces is developed. For the velocity

criterion, the failure is also determined when both the velocity gradient and amplitude are

greater than the determined thresholds. The alternative failure criterion is shown to be efficient

with accurate results as compared to stochastic stability results obtained using the strength

reduction technique (see Ref. (Jamshidi Chenari and Alaie 2015) for further details).

4. Monte Carlo reliability analysis

Monte Carlo reliability analysis involves the simulation of a typically large number of samples

for desired statistical parameters. The statistics of the results such as the probability of failure

converges to a constant value in the large datasets. For this study, it can be shown that 2000

10 
 
samples are sufficient (Jamshidi Chenari and Alaie 2015) for constant statistics of the results.

This number agrees with the findings suggesting that a typical number of 2500 simulations can

provide reasonable precision and reproducibility for random field finite element studies

(Griffiths and Fenton 2007). Thus, for each set of statistical parameters μ , COV , δ , δ ,

2000 random field realisations are generated. Each realisation is simulated using the finite

difference method to determine the stability status of slopes. The outcomes of all analyses as a

binary result (failure or non-failure) provide the Monte Carlo data for the reliability analysis.

A summary of the details of the random fields generated in this study is shown in Table 2.

Table 2. The characteristics of the random fields generated in this study

Random field δ Mean of undrained shear strength, No. of random field


COV ξ
label (m) μ (kPa) simulations

U . , 0.1 1 1 18.6, 22.3, 26, 29.7, 33.5 5 2000

U . , 0.1 1 6 18.6, 22.3, 26, 29.7, 33.5 5 2000

U . , 0.1 1 12 18.6, 22.3, 26, 29.7, 33.5 5 2000

U . , 0.1 1 25 18.6, 22.3, 26, 29.7, 33.5 5 2000

U . , 0.3 1 1 18.6, 22.3, 26, 29.7, 33.5 5 2000

U . , 0.3 1 6 18.6, 22.3, 26, 29.7, 33.5 5 2000

U . , 0.3 1 12 18.6, 22.3, 26, 29.7, 33.5 5 2000

U . , 0.3 1 25 18.6, 22.3, 26, 29.7, 33.5 5 2000

U . , 0.5 1 1 18.6, 22.3, 26, 29.7, 33.5 5 2000

U . , 0.5 1 6 18.6, 22.3, 26, 29.7, 33.5 5 2000

U . , 0.5 1 12 18.6, 22.3, 26, 29.7, 33.5 5 2000

U . , 0.5 1 25 18.6, 22.3, 26, 29.7, 33.5 5 2000

total 120000

11 
 
Error! Reference source not found. illustrates some examples of random field realisations.

Each plot in Error! Reference source not found. shows a single realisation of the random

fields for the given μ , COV and ξ parameters from 2000 realisations. The mean value of C

for all illustrated slopes is 18.6 kPa.

a U . , b U . ,

c U . , d U . ,

Fig. 2 Examples of the random field slope models with spatial variability of the undrained shear strength of the

soil.

5. Evaluation of ML surrogate models

In this paper, a complete Monte Carlo analysis on a typical slope stability problem in

conducted. Covering a wide range of heterogeneity and anisotropy of random fields, the MC

study provides a full-size database as a benchmark by which an actual evaluation of the trained

ML surrogate models can be possible. The classification tasks and the probability of failure

obtained from ML surrogate models using 5% of data are then compared with the actual results

from the MC benchmark data. An overview of the implemented approach in this study is shown

as a flowchart in Fig. 3.

12 
 
Fig. 3 The approach implemented in this study in evaluating the ML surrogate models versus actual full-size MC

datasets.

6. Machine Learning Algorithms

6.1. Logistic Regression

The possibility of a binary result can be predicted using the Logistic Regression (LR)

algorithm. The LR is utilised to describe the connection between one dependent binary variable

and one or more independent variables at the nominal, interval, ordinal, or ratio level. A binary

logistic model mathematically consists of a dependent variable with two potential values

indicated by an indicator variable with the values "0" and "1". A LR model generally estimates

13 
 
the likelihood of belonging to one of the two classes in the dataset. The following equation

depicts the possibility of y 1:

1 (7)
P y 1|x; w  
1 e

where w denotes the regression coefficient.

Given this is a minimisation problem, the LR objective function may be stated as:

1 (8)
min , w w C log exp y x w c 1  
2

where c is a constant term and C is a previously fixed hyperparameter that controls the balance

between two terms of the objective function (Hosmer Jr, Lemeshow et al. 2013).

6.2. K-Nearest Neighbour

As a simple ML algorithm, the K-Nearest Neighbour (KNN) is a non-parametric supervised

classification method. The model is a type of non-generalising learning or instance‐based

learning that, instead of constructing an internal model, only remembers instances of the

training data. A simple majority vote of the nearest neighbours of each sample vector is the

basis of this algorithm.

By computing the average observations within a particular neighbourhood, KNN naturally

models the connection between the independent factors and the continuous result. Cross-

validation may be used to establish or to choose the neighbourhood size to decrease the mean

square error. A KNN model is simple to build and may effectively handle non-linearity.

Additionally, model fitting is quick since no particular parameter or number has to be

computed. KNN may be carried out in three distinct steps:

14 
 
Step 1: Calculate the distance between the objects to be categorised and calculate their

neighbours utilising one of the available distance functions (e.g. Minkowski, Manhattan,

Euclidean).

Step 2: Choose the K closest data points (the items with the K lowest distances).

Step 3: Perform a "majority vote" among the data points; the categorisation that receives the

most votes in that pool becomes the final categorisation.

Both data inspection and hyper-parameter methods may be used to determine the optimum

value for K (Mucherino, Papajorgji et al. 2009, Cheng and Hoang 2016).

6.3. Decision Tree

A decision tree model is a tree data structure with an unspecified number of nodes and

branches. Internal nodes are those that have outward edges, whereas others are referred to as

leaves. A specific internal node divides the examples utilised for classification or regression

into two or more categories. During the training phase, the input variable values are compared

to a specific function. A decision tree stimulant is an algorithm that generates a decision tree

given particular occurrences. The performed algorithm attempts to find the optimal decision

trees by lowering the fitness function. Since the datasets of this study contain two classes

(stable or failed), a categorisation model is fitted to the target variable employing each

independent variable in this study. A decision node is composed of at least two branches. A

leaf node denotes a categorisation or decision. The root node is the highest decision node in a

tree that correlates to the best predictor.

An overly big tree improves the chance of overfitting the training data and performing poorly

on new samples. The sample space may include important structural information, but a too-

small tree may not be able to capture it. The horizon effect is a term that refers to this problem.

One commonly used strategy is to continue expanding the tree until each node has a limited

15 
 
number of instances and then prune the tree to eliminate nodes with no further information.

The primary goal of pruning is to minimise the size of a decision tree while maintaining

prediction accuracy as evaluated by a set of cross-validations (Feng, Li et al. 2018).

6.4. Support Vector Classifier

Support Vector Machine (SVM) is often used for three primary objectives as a supervised

learning technique: classification, regression and pattern recognition. We call an SVM used for

classification as a Support Vector Classifier (SVC). Fast computation to achieve globally

optimal and effective performance to resist overfitting are two significant characteristics

distinguishing SVM from basic artificial neural networks. Considering a dataset with N

samples and a represented result type of y (1 or 0), the following training vectors are used for

binary classification:

D x , y , x , y , … , x , y  x  ∈  R  ,  y   ∈   0,  1 (9)

where n influencing features exist in the n-dimensional space that may be used for addressing

the decision boundary. Soft-margin SVMs impose penalties and may be utilised well for

nonlinear classification when combined with the kernel technique. Thus, every function f(x) in

the SVM kernel trick may be shown in the following manner (Noble 2006):

f x w ϕ x b 0 (10)

where W is the vector of the output layer, and b denotes the bias. The input variable x is defined

as an N n matrix, and ϕ x is a kernel function.

A problem of reduction is proposed to determine w and b (Cortes and Vapnik 1995):

Minimise:  w C∑  χ (11)

Subjected to: y w.  x b 1  χ

16 
 
Here, C denotes a penalty coefficient while  χ 0 denotes slack variables, which relate to the

misclassification effects.

6.5. Random Forest

Random Forest (RF) is a method for ensemble learning based on divide-and-conquer widely

used in real-world classification and regression. It uses bootstrap aggregation, commonly

known as bagging, to create an ensemble of randomly generated base learners (unpruned DTs).

The RF critical point is the creation of a collection of DTs subjected to controlled alternation.

Another critical element of the training process is the random choice of features. The method

constructs a subset of M important variables for each node in the trees. It is necessary to choose

a variety of influencing factors to guarantee DT variation. As a result, each of the DTs conducts

an independent assessment of the forest. The results are obtained by voting on the whole tree.

In general, the RF algorithm is composed of the following four steps:

1- Randomly choose samples from a dataset.

2- For each sample, construct a decision tree and obtain prediction results from the decision

tree.

3- Vote on the outcome of each forecast.

4- Assign the final prediction to the most chosen prediction result.

Additional information regarding RF may be found in References (Belgiu and Drăguţ 2016,

Kardani, Bardhan et al. 2021), Kardani, Bardhan et al. 2021) .

6.6. Naive Bayes

Naive Bayes (NB) is a basic, but robust, method for classifying binary and multi-class issues.

Because NB makes significant assumptions about the independence of characteristics based on

17 
 
Bayes' theory, it may be classified as a probabilistic category. The following equation expresses

Bayes' theory:

P B|A P A (12)
P A|B
P B

where P A|B is the possibility of A if B occurs (or posterior probability), viz., a conditional

probability implying that A's possibility is dependent on what happens to B.

Gaussian Naive Bayes (Gaussian NB) is an enhanced version of the basic and most common

Naive Bayes. The Gaussian distribution requires the standard and mean deviation to be

calculated using the training data (Rish 2001) (Tsangaratos and Ilia 2016).

6.7. Stacking ensemble

Stacking generalisation is an ensemble algorithm that employs a meta-learning model to learn

how to combine the predictions of multiple base models maximising the performance of the

combined model. The generalisation method is indeed based on a new machine learning model

in the ensemble that learns when to use each base model for a problem. The base models are

typically different and are combined by a single meta-model. With different base models

having various types of skills on the dataset, the errors in the prediction of different base models

are less correlated. When training the meta-model, the base models are fed with out-of-sample

data, i.e., the data not already used to train the base models. The predictions made by base

models, along with the expected predictions, create the input and output data to train the meta-

model.

The level-1 or base models can be diverse and complex where the difference of the approaches

and assumptions in base models can help the meta-model see the data in different ways. Here

we use a range of various machine learning approaches for the base models including logistic

regression (LR), K-Nearest Neighbours (KNN), Decision Tree (DT), Support Vector

18 
 
Classification (SVC), Random Forest (RF) and Gaussian Naive Bayes (NB). 10-fold cross-

validation of the base models is used where the out-of-fold data create the training dataset for

the meta-model.

A simple model will suit the level-0 or beta-model which provides a smooth combination of

the predictions. Thus, linear models are often used for beta-models. A logistic regression model

is selected here for the classification task performed by the beta-model. As such, the beta-model

derives a weighted average or blending of the predictions made by the base models. 5-fold

cross-validation is used for the beta-model.

While the stacking ensemble is designed to improve the overall performance of model

predictions compared to the base models, it is not guaranteed to always achieve an

improvement. Different aspects may affect the ensemble performance including the sufficiency

of the training data to represent the complexity of the problem and if there are more complex

insights not captured by base models to be revealed by the ensemble (Džeroski and Ženko

2004, Rokach 2010, Kardani, Zhou et al. 2021).

6.8. Bagging ensemble

Bagging ensemble also known as bootstrap aggregation is one of the ensemble models that

combines multiple learners trained on subsamples of the same dataset. Bagging ensemble

reduces the variance of the prediction errors and generally improves the accuracy of the model

particularly when perturbing the learning dataset can cause a significant change in the predictor

constructed.

The bagging ensemble includes a process of (i) creating multiple datasets sampled from the

original data, (ii) training multiple learners on each dataset, and (iii) combining all learners to

generate a single prediction. The combined prediction is generated by averaging the results of

19 
 
the applied models for a regression analysis or majority voting approach for classification

problems.

Sampling with replacement is the core idea in bagging ensembles so that some instances are

included in samples multiple times while others are left out. The ensemble can consist of a

single base learner invoked several times using different subsets of the training set. A random

forest regressor is employed here as the base model (Rokach 2010, (Rokach 2010, Bühlmann

2012).

6.9. Voting ensemble (VO En)

As a meta-algorithm, the voting ensemble combines several base models either as regressors

or classifiers and produces predictions based on voting algorithms. The ensemble is ideally

developed to integrate the predictions of conceptually different base models. A voting

ensemble can be beneficial for a set of equally well-performing models to balance out their

individual weaknesses.

For regression predictions, the algorithm involves calculating the average of the predictions

from base models. In the case of classifier ensembles, a voting ensemble contains several base

classifiers. The voting ensemble can use the majority voting (hard voting) in which each base

classifier is entitled to one vote where a class with the highest number of votes is determined

as the final prediction. If the number of classifiers is even, a tiebreak rule should be identified.

The ensemble can also use the average predicted probabilities (soft voting) to combine the

predicted classes of base models. The number of base classifiers can be arbitrary (Rokach

2010). In this study, the base models of the VO En are LR, KNN, DT, RF, SVC and Gaussian

NB.

7. Results and discussion

20 
 
7.1. Metrics

In this study, the accuracy (ACC), F1-score, AUC-score and some metric values extracted from

the confusion matrix have been used to compare the performance of the ML models as

classifiers for the slope stability problem.

Confusion matrix (see Table 3), also known as the error matrix, is a tabular summary used to

describe and visualise the performance of ML models in a supervised classification.

Table 3 Confusion Matrix

Actual Values Predicted Values

Negative Positive

Negative True Negative, TN False Positive, FP

Positive False Negative, FN True Positive, TP

According to Table 3, some metric values including accuracy, sensitivity (true positive rate),

specificity (true negative rate), and false positive rate can be defined. The following equations

indicate the mathematic expressions of these metrics:

TP TN (13)
Accuracy ACC
total instances

TP (14)
Sensitivity recall or true positive rate
TP FN

TN (15)
Specificity true negative rate
TN FP

FP (16)
False positive rate 1 Specificity
TN FP

Another performance metric used in this research is F1-Score, also known as F measure. It is

employed as a metric to evaluate the accuracy using both precision ( ) and recall ( ).

21 
 
In other words, F1-Score reaches its best value at 1 and worst at 0 through the harmonic average

of the precision and recall as expressed in the following equation:

TP (17)
F
1
TP FP FN
2

It is worth mentioning that in statistical analysis of geotechnical stability problems, the class

of instability events can be rare in comparison with the stable cases. In such problems, the issue

of class imbalance should be carefully considered in training the ML models to improve their

performance. For uneven class distributions, where a balance between precision and recall is

needed, F1-Score can be a better measure to use than the accuracy.

In addition, the receiver operating characteristic (ROC) curve, which plots the true positive rate

against the false positive rate at various cut-off values, is used for further analysis of the models.

This graphical plot can be employed to express the predictive ability of a model. The area under

the ROC curve, i.e., AUC, is commonly used as a measure of the performance of classifiers.

AUC ranges in value from 0 to 1, representing a totally wrong to fully correct prediction by a

classifier, respectively. As a measure to evaluating the AUC-Score values, one may consider

the following guide: Outstanding (0.9 < AUC < 1), Excellent (0.8 < AUC < 0.9) and Acceptable

(0.7 < AUC < 0.8) (Hosmer Jr, Lemeshow et al. 2013).

7.2. The performance of classifiers on slope stability MC data

Confusion matrices: The confusion matrices of RF, SVC and BG En models on test datasets

for a single iteration of model training are shown in Table 4. The models are trained on %5 of

MC data and tested on the remaining %95. The table provides information on the share of

failure and stable classes in each dataset and the model performances on predicting each class.

In the U . , dataset, which is best predicted, while the accuracy reaches %96.4 for all RF, SVC

and BG En, the incorrectly classified cases are mostly the stable slopes wrongly classified as

22 
 
failed (%3.3 of samples). When the accuracy declines to the lowest value among datasets in

the U . , dataset (ACC=0.671 for RF), the wrongly classified classes are both from actually

stable and failed slopes with %14.3 and %18.5 of all samples being wrongly classified as failed

and stable, respectively. For the entire MC dataset with 119,500 test samples, the models

trained only on 500 samples provide an accuracy reaching %84.3. The RF and BG En

classifiers wrongly classify only about %8 of all samples in each class of stable and failed

slopes.

Table 4. Confusion matrices of classifiers on the test datasets. Models are trained on 500 samples

where the test datasets consist of 9500 samples for individual U datasets and 119,500 samples for the

entire data. The values in parentheses are the numbers normalised over the total number of samples in

each test dataset.

RF SVC BG En
Predicted Accuracy Predicted Accuracy Predicted Accuracy

Actual Stable Failed (ACC) Stable Failed (ACC) Stable Failed (ACC)

7577 314 7577 314 7578 313


Stable
(0.798) (0.033) (0.798) (0.033) (0.798) (0.033)
𝐔𝟎.𝟏,𝟏 0.964 0.964 0.964
32 1577 27 1582 28 1581
Failed
(0.003) (0.166) (0.003) (0.167) (0.003) (0.166)
7640 409 7585 464 7626 423
Stable
(0.804) (0.043) (0.798) (0.049) (0.803) (0.045)
𝐔𝟎.𝟏,𝟔 0.938 0.949 0.939
184 1267 19 1432 156 1295
Failed
(0.019) (0.133) (0.002) (0.151) (0.016) (0.136)
7603 441 7555 489 7594 450
Stable
(0.800) (0.046) (0.795) (0.051) (0.799) (0.047)
𝐔𝟎.𝟏,𝟏𝟐 0.931 0.942 0.933
211 1245 61 1395 184 1272
Failed
(0.022) (0.131) (0.006) (0.147) (0.019) (0.134)
7612 413 7544 481 7610 415
Stable
(0.801) (0.043) (0.794) (0.051) (0.801) (0.044)
𝐔𝟎.𝟏,𝟐𝟓 0.931 0.942 0.935
242 1233 71 1404 205 1270
Failed
(0.025) (0.130) (0.007) (0.148) (0.022) (0.134)
6530 342 6580 292 6578 294
Stable
(0.687) (0.036) (0.693) (0.031) (0.692) (0.031)
𝐔𝟎.𝟑,𝟏 0.886 0.888 0.889
744 1884 773 1855 762 1866
Failed
(0.078) (0.198) (0.081) (0.195) (0.080) (0.196)

23 
 
6208 772 5986 994 6196 784
Stable
(0.653) (0.081) (0.630) (0.105) (0.652) (0.083)
𝐔𝟎.𝟑,𝟔 0.823 0.818 0.827
908 1612 738 1782 859 1661
Failed
(0.096) (0.170) (0.078) (0.188) (0.090) (0.175)
6417 775 6368 824 6458 734
Stable
(0.675) (0.082) (0.670) (0.087) (0.680) (0.077)
𝐔𝟎.𝟑,𝟏𝟐 0.795 0.800 0.803
1168 1140 1075 1233 1137 1171
Failed
(0.123) (0.120) (0.113) (0.130) (0.120) (0.123)
6387 742 6536 593 6337 792
Stable
(0.672) (0.078) (0.688) (0.062) (0.667) (0.083)
𝐔𝟎.𝟑,𝟐𝟓 0.785 0.789 0.788
1296 1075 1407 964 1226 1145
Failed
(0.136) (0.113) (0.148) (0.101) (0.129) (0.121)
4729 582 4640 671 4737 574
Stable
(0.498) (0.061) (0.488) (0.071) (0.499) (0.060)
𝐔𝟎.𝟓,𝟏 0.853 0.853 0.854
812 3377 721 3468 812 3377
Failed
(0.085) (0.355) (0.076) (0.365) (0.085) (0.355)
3766 1223 3725 1264 3759 1230
Stable
(0.396) (0.129) (0.392) (0.133) (0.396) (0.129)
𝐔𝟎.𝟓,𝟔 0.721 0.719 0.723
1432 3079 1403 3108 1405 3106
Failed
(0.151) (0.324) (0.148) (0.327) (0.148) (0.327)
3710 1399 3611 1498 3607 1502
Stable
(0.391) (0.147) (0.380) (0.158) (0.380) (0.158)
𝐔𝟎.𝟓,𝟏𝟐 0.682 0.679 0.683
1619 2772 1553 2838 1508 2883
Failed
(0.170) (0.292) (0.163) (0.299) (0.159) (0.303)
3883 1426 3862 1447 3948 1361
Stable
(0.409) (0.150) (0.407) (0.152) (0.416) (0.143)
𝐔𝟎.𝟓,𝟐𝟓 0.667 0.668 0.671
1734 2457 1708 2483 1760 2431
Failed
(0.183) (0.259) (0.180) (0.261) (0.185) (0.256)
75255 9548 77725 7078 75379 9424
Stable
Entire (0.630) (0.080) (0.650) (0.059) (0.631) (0.079)
0.833 0.821 0.834
data 10408 24289 14318 20379 10416 24281
Failed
(0.087) (0.203) (0.120) (0.171) (0.087) (0.203)

Repeated k-fold cross-validation: To assess the sensitivity of the ML model performances, a

repeated k-fold cross-validation (CV) scheme is employed in this study. In k-fold CV, the

dataset is first randomly divided into k separate folds with the same number of instances. The

model is tested over each fold in turn where the other 𝑘 1 folds are used for training the

model. To reduce the variance of the performance estimates, it is recommended to perform

repeated k-fold CV (Wong and Yeh 2019). k-fold CV with a large number of folds and a small

24 
 
number of replications is shown to be an appropriate method for the performance evaluation of

classification algorithms (Refaeilzadeh, Tang et al. 2009). Three repeats of 10-fold CV create

the distribution of performance scores such as accuracy and F1 in our study. For each model,

distribution of performance scores with a higher mean value and smaller variances is desirable.

The CV scheme adopted in this study is illustrated in Fig. 4.

Fig. 4 10-fold cross-validation with three repeats adapted in this study resulting in the statistics of the model

performance scores shown as boxplots. The interquartile range, IQR is identified by Q1 or 25th to Q3 or 75th

percentile as the boxes. The minimum and maximum are shown as whiskers, and outliers are identified as circles.

Performance scores and their distributions: The mean values of performance results of

different classifiers obtained from repeated k-fold CV are summarised in Table . Classifiers are

used as MC surrogate models to predict the stability status of heterogenous and anisotropic

slopes. For training the surrogate classifiers, %5 of MC data, i.e., 500 samples, are randomly

chosen from each random field dataset U containing 10000 samples. The mean performance

scores of CV suggest that in general, RF, SVC, and BG En are the most appropriate classifiers

25 
 
for this study. In the dataset where the highest performance scores are achieved (U . ,

characterised by low heterogeneity and no anisotropy), the BG En model provides the highest

scores (F1=0.903, ACC=0.962, AUC=0.98) which are also partly achieved by other models.

However, on the most challenging dataset (U . , characterised by high heterogeneity and

anisotropy), the RF model reaches the highest ACC=0.691 and AUC=0.721 where the highest

F1 score belongs to NB (F1=0.659). On the entire dataset where all sub-datasets are combined,

BG En and RF model outperform other models with ACC=0.847, AUC=0.912 and F1=0.737

(Table 5).

In addition to the mean performance scores, the distribution of these scores obtained from three

repeats of CV provides more insights into the overall appropriateness of different classification

models for the current study. These distributions can better demonstrate the comparison of

different models where smaller variance on higher scores will be desirable. The distribution of

F1 and accuracy scores for each classification model is shown in Error! Reference source not

found. as boxplots. In the best-predicted dataset (Fig. 5a: U . , ), SVC, RF, Bayes, BG En and

VO En achieve comparable performance with a similar distribution of F1 and accuracy scores.

The similar achievement of different models implies higher predictability of the dataset due to

the smaller COV of strength parameter and stronger correlations between mean strength values

and the stability status; however, with the same COV, but larger scales of fluctuations (ξ

25), more complex failure mechanisms may have come into play reducing the predictability

of the stability. In this dataset (Fig. 5b: U . , ), SVC model outperforms all models with

distributions concentrated at higher values on both ACC and F1 scores. For U . , dataset where

COV of shear strength values is increased to %30 (Fig. 5c), SVC results in best scores and

small variances, while for the U . , dataset where anisotropy is also significant (Fig. 5d), VO

En and RF model appear to be more efficient when both scores are considered. In datasets with

26 
 
the highest COV but isotropic heterogeneity (Fig. 5e: U . , ), NB and BG En can be selected

as the best classifiers. For the same COV but highly anisotropic dataset (Fig. 5f: U . , ), BG

En, RF and SVC perform well in accuracy distributions and NB can be selected as the winner

model considering the F1 score.

Table 5. Performance metrics of different models and the ensemble classifiers on different random field data.

The mean values of the metrics obtained from a repeated 10-fold cross-validation scheme are shown. The

highest mean metric values achieved for each random field are shown as bold numbers. 500 samples are

randomly taken for model training from each random field dataset.

Naive Bayes
Random LR KNN DT SVC RF
(Gaussian NB)
field
F1 ACC AUC F1 ACC AUC F1 ACC AUC F1 ACC AUC F1 ACC AUC F1 ACC AUC

U . , 0.878 0.955 0.976 0.862 0.950 0.976 0.786 0.931 0.814 0.903 0.962 0.977 0.901 0.962 0.973 0.903 0.962 0.976

U . , 0.666 0.899 0.967 0.800 0.931 0.959 0.645 0.884 0.806 0.820 0.936 0.955 0.775 0.924 0.961 0.662 0.848 0.947

U . , 0.718 0.843 0.915 0.712 0.834 0.906 0.632 0.799 0.765 0.751 0.880 0.922 0.746 0.877 0.940 0.712 0.807 0.893

U . , 0.429 0.729 0.742 0.451 0.759 0.755 0.430 0.693 0.643 0.455 0.775 0.782 0.508 0.775 0.812 0.568 0.699 0.772

U . , 0.762 0.793 0.863 0.738 0.711 0.846 0.634 0.681 0.678 0.816 0.842 0.922 0.808 0.843 0.922 0.823 0.842 0.909

U . , 0.537 0.583 0.658 0.597 0.613 0.646 0.521 0.581 0.565 0.610 0.677 0.721 0.614 0.691 0.721 0.659 0.657 0.673

Entire
0.603 0.772 0.781 0.619 0.824 0.787 0.583 0.758 0.697 0.701 0.845 0.885 0.722 0.847 0.908 0.643 0.751 0.794
data

Stacking Bagging
Random Voting Ensemble
Ensemble Ensemble
field
F1 ACC AUC F1 ACC AUC F1 ACC AUC

U . , 0.901 0.961 0.975 0.903 0.962 0.980 0.903 0.961 0.975

U . , 0.789 0.927 0.961 0.778 0.926 0.959 0.791 0.929 0.964

U . , 0.751 0.877 0.937 0.754 0.873 0.940 0.744 0.871 0.933

U . , 0.420 0.773 0.781 0.499 0.780 0.813 0.478 0.780 0.811

U . , 0.816 0.841 0.903 0.816 0.843 0.923 0.820 0.841 0.914

U . , 0.611 0.669 0.681 0.617 0.681 0.711 0.606 0.673 0.708

Entire
0.717 0.844 0.878 0.737 0.841 0.912 0.696 0.844 0.884
data

The efficiency of ML classifiers when implemented on the entire MC dataset combining

various datasets with different COVs and ξ values is also examined (Fig. 7a and b). For this

27 
 
case, a training dataset of 500 samples (0.42% of data) is used. Here, RF and BG En appear to

be the most efficient models.

In general, based on the distribution of accuracy and F1 scores, one can conclude that SVC,

RF and BG En are the most reliable choices for predicting the stability status of heterogeneous

and anisotropic slopes.

Based on AUC scores, for the dataset of low heterogeneity and no anisotropy (Fig. 6a: U . , ),

BG En achieves the highest score (AUC=0.980). For low heterogeneity but high anisotropy

(Fig. 6b: U . , ), LR provides the best score as AUC=0.967. However, for both datasets, all

classifiers perform relatively well denoting the high predictability of the datasets except the

DT classifier which can be deemed inappropriate for this problem. With higher heterogeneity,

regardless of the anisotropy levels, RF and BG En provide the highest scores (AUC=0.94 for

U . , and AUC=0.813 for U . , , Fig. 6c and d). BG En also performs best for the highest

heterogeneity but no anisotropy (AUC=0.923, Fig. 6e) whereas RF outperforms others with

AUC=0.721 for the highest heterogeneity and anisotropy (Fig. 6f). When the entire MC dataset

is examined, RF and BG En are the most appropriate classifiers with AUC=0.89 (Fig. 7c).

In general, based on the AUC metric, RF and BG En can be concluded as the most appropriate

classifiers for this problem where DT seems inappropriate.

28 
 
a U . ,

b U . ,

(c) U . ,

d U . ,
Fig. 5. The performance of different algorithms obtained from three repeats of 10-fold cross-validation. Boxplots

show the mean (green triangle), median (yellow line), interquartile range (25th to the 75th percentile as the boxes),

minimum and maximum (whiskers), and outliers (circles) for accuracy scores (left panels) and F1 (right panels)

for each model. Models are trained on 500 samples from each random field dataset.

29 
 
e U . ,

f U . ,

Fig. 5 Continued.

30 
 
a U . , b U . ,

(c) U . , d U . ,

e U . , f U . ,

Fig. 6 AUC ROC curves for different machine learning algorithms on various datasets. Models are trained on 500

samples from each random field dataset.

31 
 
a) Accuracy score (ACC) b) F1 score

c) AUC ROC d) Confusion matrix

Fig. 7 The performance of different algorithms on the entire random field data (all COV and 𝜉 datasets). Results

are obtained from three repeats of 10-fold cross-validation. The models are trained on 500 samples (%0.42 of

data). The confusion matrix showing the normalised values over the size of the test dataset is shown in d) as the

results of the RF model trained with 500 samples for a test against 119,500 samples.

Fig. 8 provides an overview of the performance of different classifiers on different datasets

characterised by various levels of soil heterogeneity and anisotropy. With F1 and AUC metrics

being the most important measures to evaluate binary classifiers in unbalanced classes, the

graphs of Fig. 8 present the overall performance of models with combined metrics. In the case

of random field data with no anisotropy (ξ 1), regardless of the level of soil heterogeneity,

the BG En appears to outperform other classifiers with higher F1 and AUC scores (Fig. 8a, c

32 
 
and e). The RF and VO En are also the next appropriate models with performance scores close

to the best values. In the case of random field data with high anisotropy (𝜉 25), the RF is

perhaps the most reliable model while the VO En has performed better for low heterogeneity

(Fig. 8b) and SVC has performed similarly well at high heterogeneity (Fig. 8f). Ultimately, BG

En outperforms other models when learning the entire MC dataset of different heterogeneity

and anisotropy levels (Fig. 9).

In general, based on the combined assessment of F1 and AUC measures for all datasets, we

can conclude that BG En and RF are the most appropriate models for this study while SVC can

be the runner up model. DT, KNN and LR are the least efficient models for this study where

Naive’s performance has been inconsistent.

33 
 
   

a U . ,   b U . ,  

   

c U . ,   d U . ,  

   

(e) U . ,   f U . ,  

Fig. 8 AUC versus F1 scores for different algorithms on various datasets trained with 500 samples (%5 of each

dataset). Datapoint values are the mean of repeated CV results.

34 
 
 

Fig. 9 AUC versus F1 scores for different algorithms on the entire random field dataset (120,000 samples) trained

with 500 samples (%0.42 of data).

7.3. The effects of the heterogeneity and anisotropy of random fields on classifier

performances

Fig. 10 compares the performance of the selected ML classifiers as surrogate models in

reliability analysis with respect to different levels of heterogeneity (COV = 0.1 0.5) and

anisotropy (ξ 1 25). The best performing classifiers, RF, SVC and BG En are trained on

%5 of the data (500 samples) for each dataset where the datapoints in Fig. 10 are the mean

values of the results obtained from the three times repeated CV. A general trend in all models

is the declining performance (decreased accuracy and AUC scores) with increasing

heterogeneity and anisotropy. This behaviour can lie in the increased complexity of the failure

mechanisms with these properties of random soils. The failure in a less heterogeneous slope is

better correlated with the average of the random strength variables. Such slopes are also more

likely to fail via typical shear surfaces which are better learned by the classifiers. A semi-

circular geometry of the failure surface is attributed to these failure cases. With increasing

heterogeneity, the predictability decreases as the failure mechanisms are more complex

depending on the interconnection of local week zones with irregular surfaces of failure.

35 
 
The effects of anisotropy are also evident in Fig. 10, where a higher anisotropy ratio (ξ) leads

to lower predictability of the stability status of slopes. These effects could also root in the

complexity of the failure mechanisms where not only the local week zones (due to

heterogeneity) may cause the failure, but also the correlated week layers distributed spatially

into the field can contribute to irregular failure surfaces. Therefore, any failure geometry from

a circular deep surface to a local shearing zone or a shallow or deep layering slide can be a

possible mechanism. The lower repeatability of different mechanisms can reduce the prediction

ability of the models.

In general, RF and BG En perform similarly well on all datasets while according to the AUC

metric (Fig. 10b), SVC is less efficient than the other two models in the moderate level of

heterogeneity (COV=0.3).

It should be noted that while the F1 score is used as a useful measure to evaluate the

performance of classifiers in unbalanced datasets, this score by definition is not comparable on

different datasets with varying ratios of unbalanced classes. Thus, the variation of the F1 score

is not shown for the comparison of different classifiers in Fig. 10.

36 
 
(a)

(b)

Fig. 10 Accuracy Score, ACC (a) and Area Under Curve, AUC score (b) vsversus the anisotropy ratio ξ for

Random Forest (RF), Support Vector (SVC) and Bagging ensemble (BG En) trained with 500 samples for each

dataset obtained as mean values from 3-times repeated 10-fold CV.

7.4. Predicted probability of failure

While the ACC and AUC for the binary classification (failure, non-failure) may decrease to

about %70, it should be noted that the errors in the predicted probability of failure, 𝑝 can be

in much lower extent. The nature of binary results either as failure or non-failure can induce a

high number of near-to-failure samples being incorrectly classified. While in the case of the

predictions of FOS data, fluctuations in the near-to-failure (FOS 1) predictions can induce a

37 
 
smoothed error in the mean FOS calculated, similar model performances on a binary

classification may induce lower accuracies. However, when calculating the 𝑝 , the ratio as the

number of failure predictions over the total number of samples is calculated. Therefore, near-

to-failure mis-classified samples can be balanced which can reduce the average errors in the

calculated 𝑝 .

The probability of failure 𝑝 predicted by ML models trained on only %5 of random field data

is shown in Fig. 11. The errors in the predicted 𝑝 range from less than %1 for all models and

with an average of 0.46% for all cases of heterogeneity and anisotropy (Fig. 11 -a).

The actual 𝑝 values obtained from the complete MC data are shown as data points in Fig. 11-

b. The variation of ML predictions of 𝑝 is shown as the bond of the standard deviations.

Trained only on %5 of data, ML surrogate models can provide highly accurate predictions.

38 
 
(a)

(b)

Fig. 11 The errors of the ML predicted probability of failure 𝑝 (a) and the standard deviation (std) of ML

predicted 𝑝 (b) versus anisotropy ratio 𝜉. The SVC model is used here. The actual 𝑝 in (b) is obtained from

complete MC data. ML models are trained with 500 samples for each dataset (%5 of data).

39 
 
7.5. Computational time

The CPU time for each random finite difference simulation without a FOS calculation to

determine the stability status of the slopes is 43 seconds on average on a Laptop with Intel Core

i3-5010U 2.1GHz processor using four cores. Thus, each dataset of particular 𝐶𝑂𝑉, ξ, μ

values consisting of 2,000 simulations takes a CPU time of 23.9 hours to finish. The entire

random field MC dataset containing 120,000 simulations is completed in about 60 days. A

similar dataset with calculated FOSs can take up to 306 days to complete. The ML models are

trained on 500 samples to predict the entire dataset. The computation time for generating this

training dataset will take only about 6 hours where the ML training and predictions are

performed within a few minutes. Therefore, the proposed ML models can reduce the

computational CPU time of such study from 306 days to 6 hours where an accuracy score of

~%85 and AUC score of ~%91 can be expected. In the case of the random finite difference

simulations with FOS calculation using strength reduction technique, the CPU time for each

simulation can reach 220 seconds on average demanding unaffordable computational costs for

such a study. A summary of the computation time and expected performance metrics is

presented in Table 6.

40 
 
Table 6. Comparison of the computational time of original and ML aided MC methods

Original MC method

with FOS without ML aided MC method *

calculations FOS

calculations

CPU
MC MC CPU time CPU time MC Accuracy
time F1 AUC 𝑃
dataset samples (hours) (hours) samples (ACC)
(hours)

U
500 Min: Min: Min:
datasets 611 5.97
119.4 (%5 of 0.691– 0.568– 0.721–
(specific 10,000 (~25.5 (~0.25 ≤0.010
(~5 days) MC Max: Max: Max:
COV days) days)
data) 0.962 0.903 0.980
and 𝜉)

500
7333 5.97
Entire 1433 (%0.42
120,000 (~306 (~0.25 0.847 0.737 0.912 0.0046
dataset (~60 days) of MC
days) days)
data)

*
Min and max metric values are associated with datasets U . , and U . , , respectively.

8. Conclusion and future research direction

In a systematic approach in this study, we investigated the efficiency of ML classifier methods

as response/surrogate models to predict the results of the random field slope stability with

spatial variability. A complete Monte Carlo database of heterogenous anisotropic slopes

containing 120,000 simulations is generated. The random field variable is the soil undrained

shear strength with COV ranging from 0.1 to 0.5, and the anisotropy ratio, ξ ranging from 1 to

25. Predictions on the whole database are made by machine learning surrogate models trained

on only 500 samples. The findings of this research can be summarised as follows:

41 
 
 Repeated k-fold cross-validation revealed the sensitivity of ML surrogate models

providing the distribution of performance scores. A detailed study suggested that the

Bagging ensemble and Random Forest are the most appropriate models for this random

field study with SVC being the next appropriate choice. DT, KNN and LR are the least

efficient models for this study where Naive’s performance is inconsistent.

 The efficiency of ML surrogate models generally decreases with higher heterogeneity

and anisotropy of random fields.

 Slightly heterogeneous random fields are highly predictable: when trained on 500

samples (%5 of MC dataset), the accuracy of ML surrogate models for classifying the

failure and non-failure cases reached %96.2 for (COV=0.1, 𝜉=1), although decreases

to %93.6 for (COV=0.1, 𝜉=25).

 With higher heterogeneity, the classification predictability decreases: the accuracy of

ML surrogate models trained on %5 of data is %88 for (COV=0.3, 𝜉=1), and %84 for

(COV=0.5, 𝜉=1).

 A higher anisotropy further reduces the classification performances: the accuracy is

%78 for (COV=0.3, 𝜉=25), and %69 for (COV=0.5, 𝜉=25).

 The overall performance of ML surrogate models for the classification task on the entire

random filed database (all COV and ξ values) when trained on 500 samples (%0.42 of

data) is promising with accuracy = %84.7 and AUC = %91.2.

 The errors in the ML predicted probability of failure using 5% of MC data for all

heterogeneity and anisotropy levels is below %1 with an average of %0.46 for the entire

MC dataset. Such an approach reduced the CPU computational time from 306 days to

only 6 hours.

This study portraits the efficiency of ML algorithms as surrogate models in MC simulations

when no FOS calculations are required. In another publication (Aminpour, Alaie et al. 2022),

42 
 
we have further explored the efficiency of ML surrogate models in determining the reliability

of heterogeneous anisotropic slopes outlining the extent of expected errors.

To improve the performance of ML surrogate models, feature classification (e.g., classifying

the input random field data with geometrical aspects within the problem) can be a potential

research approach. The dimension reduction techniques applied on the random field data can

also be investigated aiming to improve the model performances. Further investigations on the

training of deep and convolutional neural networks on random field data with failure, non-

failure results (without FOS calculations) can further improve the efficiency of machine

learning approaches as a computationally efficient surrogate model for geotechnical reliability

analysis.

Acknowledgement

This research is funded by the Australian Research Council via the Discovery Projects (No.

DP200100549).

References:

Aminpour, M., R. Alaie, N. Kardani, S. Moridpour and M. Nazem (2022). "Machine learning aided Monte Carlo

reliability analysis on spatially variable random fields: anisotropic heterogeneous slopes." manuscript submitted

for publication.

Au, S.-K. and J. Beck (2003). "Important sampling in high dimensions." Structural safety 25(2): 139-163.

Au, S.-K. and J. L. Beck (2001). "Estimation of small failure probabilities in high dimensions by subset simulation."

Probabilistic engineering mechanics 16(4): 263-277.

Belgiu, M. and L. Drăguţ (2016). "Random forest in remote sensing: A review of applications and future directions."

ISPRS journal of photogrammetry and remote sensing 114: 24-31.

Bühlmann, P. (2012). Bagging, boosting and ensemble methods. Handbook of computational statistics, Springer:

985-1022.

43 
 
Bui, X.-N., H. Nguyen, Y. Choi, T. Nguyen-Thoi, J. Zhou and J. Dou (2020). "Prediction of slope failure in open-

pit mines using a novel hybrid artificial intelligence model based on decision tree and evolution algorithm."

Scientific reports 10(1): 1-17.

Chen, F., L. Wang and W. Zhang (2019). "Reliability assessment on stability of tunnelling perpendicularly beneath

an existing tunnel considering spatial variabilities of rock mass properties." Tunnelling and Underground Space

Technology 88: 276-289.

Cheng, M.-Y. and N.-D. Hoang (2016). "Slope collapse prediction using Bayesian framework with k-nearest

neighbor density estimation: case study in Taiwan." Journal of Computing in Civil Engineering 30(1): 04014116.

Ching, J. and K.-K. Phoon (2013). "Probability distribution for mobilised shear strengths of spatially variable soils

under uniform stress states." Georisk: Assessment and Management of Risk for Engineered Systems and

Geohazards 7(3): 209-224.

Cho, S. E. (2007). "Effects of spatial variability of soil properties on slope stability." Engineering Geology 92(3-4):

97-109.

Christian, J. T., C. C. Ladd and G. B. Baecher (1994). "Reliability applied to slope stability analysis." Journal of

Geotechnical Engineering 120(12): 2180-2207.

Cortes, C. and V. Vapnik (1995). "Support-vector networks." Machine learning 20(3): 273-297.

Das, I., S. Sahoo, C. van Westen, A. Stein and R. Hack (2010). "Landslide susceptibility assessment using logistic

regression and its comparison with a rock mass classification system, along a road section in the northern

Himalayas (India)." Geomorphology 114(4): 627-637.

Deng, L., A. Smith, N. Dixon and H. Yuan (2021). "Machine learning prediction of landslide deformation behaviour

using acoustic emission and rainfall measurements." Engineering Geology 293: 106315.

Deng, Z.-P., M. Pan, J.-T. Niu, S.-H. Jiang and W.-W. Qian (2021). "Slope reliability analysis in spatially variable

soils using sliced inverse regression-based multivariate adaptive regression spline." Bulletin of Engineering

Geology and the Environment 80(9): 7213-7226.

Džeroski, S. and B. Ženko (2004). "Is combining classifiers with stacking better than selecting the best one?"

Machine learning 54(3): 255-273.

El‐Kadi, A. I. and S. A. Williams (2000). "Generating two‐dimensional fields of autocorrelated, normally distributed

parameters by the matrix decomposition technique." Groundwater 38(4): 530-532.

Feng, X., S. Li, C. Yuan, P. Zeng and Y. Sun (2018). "Prediction of slope stability using naive Bayes classifier."

KSCE Journal of Civil Engineering 22(3): 941-950.

44 
 
Fenton, G. A. and D. V. Griffiths (2008). Risk assessment in geotechnical engineering, John Wiley & Sons New

York.

Gong, W., H. Tang, C. H. Juang and L. Wang (2020). "Optimization design of stabilizing piles in slopes considering

spatial variability." Acta Geotechnica 15(11): 3243-3259.

Griffiths, D. V. and G. A. Fenton (2007). Probabilistic methods in geotechnical engineering, Springer Science &

Business Media.

Han, H., B. Shi and L. Zhang (2021). "Prediction of landslide sharp increase displacement by SVM with considering

hysteresis of groundwater change." Engineering Geology 280: 105876.

He, X., F. Wang, W. Li and D. Sheng (2021). "Deep learning for efficient stochastic analysis with spatial variability."

Acta Geotechnica.

He, X., F. Wang, W. Li and D. Sheng (2021). "Efficient reliability analysis considering uncertainty in random field

parameters: Trained neural networks as surrogate models." Computers and Geotechnics 136: 104212.

He, X., H. Xu, H. Sabetamal and D. Sheng (2020). "Machine learning aided stochastic reliability analysis of spatially

variable slopes." Computers and Geotechnics 126: 103711.

Hosmer Jr, D. W., S. Lemeshow and R. X. Sturdivant (2013). Applied logistic regression, John Wiley & Sons.

Hu, X., S. Wu, G. Zhang, W. Zheng, C. Liu, C. He, Z. Liu, X. Guo and H. Zhang (2021). "Landslide displacement

prediction using kinematics-based random forests method: A case study in Jinping Reservoir Area, China."

Engineering Geology 283: 105975.

Huang, J., G. Fenton, D. Griffiths, D. Li and C. Zhou (2017). "On the efficient estimation of small failure probability

in slopes." Landslides 14(2): 491-498.

Huang, J., G. Fenton, D. V. Griffiths, D. Li and C. Zhou (2017). "On the efficient estimation of small failure

probability in slopes." Landslides 14(2): 491-498.

Huang, Y., X. Han and L. Zhao (2021). "Recurrent neural networks for complicated seismic dynamic response

prediction of a slope system." Engineering Geology 289: 106198.

Jamshidi Chenari, R. and R. Alaie (2015). "Effects of anisotropy in correlation structure on the stability of an

undrained clay slope." Georisk: Assessment and Management of Risk for Engineered Systems and Geohazards

9(2): 109-123.

Jiang, S.-H. and J.-S. Huang (2016). "Efficient slope reliability analysis at low-probability levels in spatially variable

soils." Computers and Geotechnics 75: 18-27.

45 
 
Jiang, S.-H., D.-Q. Li, L.-M. Zhang and C.-B. Zhou (2014). "Slope reliability analysis considering spatially variable

shear strength parameters using a non-intrusive stochastic finite element method." Engineering geology 168: 120-

128.

Juang, C. H., J. Zhang, M. Shen and J. Hu (2019). "Probabilistic methods for unified treatment of geotechnical and

geological uncertainties in a geotechnical analysis." Engineering Geology 249: 148-161.

Kang, F., Q. Xu and J. Li (2016). "Slope reliability analysis using surrogate models via new support vector machines

with swarm intelligence." Applied Mathematical Modelling 40(11-12): 6105-6120.

Kardani, N., A. Bardhan, S. Gupta, P. Samui, M. Nazem, Y. Zhang and A. Zhou (2021). "Predicting permeability of

tight carbonates using a hybrid machine learning approach of modified equilibrium optimizer and extreme

learning machine." Acta Geotechnica: 1-17.

Kardani, N., A. Zhou, M. Nazem and S.-L. Shen (2021). "Improved prediction of slope stability using a hybrid

stacking ensemble method based on finite element analysis and field data." Journal of Rock Mechanics and

Geotechnical Engineering 13(1): 188-201.

Li, D.-Q., T. Xiao, Z.-J. Cao, C.-B. Zhou and L.-M. Zhang (2016). "Enhancement of random finite element method

in reliability analysis and risk assessment of soil slopes using Subset Simulation." Landslides 13(2): 293-303.

Li, J., Y. Tian and M. J. Cassidy (2015). "Failure mechanism and bearing capacity of footings buried at various

depths in spatially random soil." Journal of Geotechnical and Geoenvironmental Engineering 141(2): 04014099.

Liu, L.-L. and Y.-M. Cheng (2018). "System reliability analysis of soil slopes using an advanced kriging metamodel

and quasi–Monte Carlo simulation." International Journal of Geomechanics 18(8): 06018019.

Liu, L., S. Zhang, Y.-M. Cheng and L. Liang (2019). "Advanced reliability analysis of slopes in spatially variable

soils using multivariate adaptive regression splines." Geoscience Frontiers 10(2): 671-682.

Lloret-Cabot, M., G. A. Fenton and M. A. Hicks (2014). "On the estimation of scale of fluctuation in geostatistics."

Georisk: Assessment and management of risk for engineered systems and geohazards 8(2): 129-140.

Mucherino, A., P. J. Papajorgji and P. M. Pardalos (2009). K-nearest neighbor classification. Data mining in

agriculture, Springer: 83-106.

Noble, W. S. (2006). "What is a support vector machine?" Nature biotechnology 24(12): 1565-1567.

Papaioannou, I., W. Betz, K. Zwirglmaier and D. Straub (2015). "MCMC algorithms for subset simulation."

Probabilistic Engineering Mechanics 41: 89-103.

46 
 
Pham, B. T., D. T. Bui and I. Prakash (2017). "Landslide susceptibility assessment using bagging ensemble based

alternating decision trees, logistic regression and J48 decision trees methods: a comparative study." Geotechnical

and Geological Engineering 35(6): 2597-2611.

Phoon, K.-K. and F. H. Kulhawy (1999). "Characterization of geotechnical variability." Canadian geotechnical

journal 36(4): 612-624.

Popescu, R., G. Deodatis and A. Nobahar (2005). "Effects of random heterogeneity of soil properties on bearing

capacity." Probabilistic Engineering Mechanics 20(4): 324-341.

Refaeilzadeh, P., L. Tang and H. Liu (2009). "Cross-validation." Encyclopedia of database systems 5: 532-538.

Rish, I. (2001). An empirical study of the naive Bayes classifier. IJCAI 2001 workshop on empirical methods in

artificial intelligence.

Rokach, L. (2010). "Ensemble-based classifiers." Artificial intelligence review 33(1): 1-39.

Samui, P. and D. Kothari (2011). "Utilization of a least square support vector machine (LSSVM) for slope stability

analysis." Scientia Iranica 18(1): 53-58.

Sun, D., J. Xu, H. Wen and D. Wang (2021). "Assessment of landslide susceptibility mapping based on Bayesian

hyperparameter optimization: A comparison between logistic regression and random forest." Engineering

Geology 281: 105972.

Tian, H.-M., D.-Q. Li, Z.-J. Cao, D.-S. Xu and X.-Y. Fu (2021). "Reliability-based monitoring sensitivity analysis

for reinforced slopes using BUS and subset simulation methods." Engineering Geology 293: 106331.

Tsangaratos, P. and I. Ilia (2016). "Comparison of a logistic regression and Naïve Bayes classifier in landslide

susceptibility assessments: The influence of models complexity and training dataset size." Catena 145: 164-179.

Vanmarcke, E. H. (1977). "Probabilistic modeling of soil profiles." Journal of the geotechnical engineering division

103(11): 1227-1246.

Vanmarcke, E. H. (1977). "Reliability of earth slopes." Journal of the Geotechnical Engineering Division 103(11):

1247-1265.

Vetterling, W. T., W. H. Press, S. A. Teukolsky and B. P. Flannery (2002). Numerical recipes example book (c++):

The art of scientific computing, Cambridge University Press.

Wang, H., L. Zhang, H. Luo, J. He and R. W. M. Cheung (2021). "AI-powered landslide susceptibility assessment

in Hong Kong." Engineering Geology 288: 106103.

Wang, L., J. H. Hwang, C. H. Juang and S. Atamturktur (2013). "Reliability-based design of rock slopes — A new

perspective on design robustness." Engineering Geology 154: 56-63.

47 
 
Wang, L., C. Wu, X. Gu, H. Liu, G. Mei and W. Zhang (2020). "Probabilistic stability analysis of earth dam slope

under transient seepage using multivariate adaptive regression splines." Bulletin of Engineering Geology and the

Environment 79(6): 2763-2775.

Wang, Y., Z. Cao and S.-K. Au (2010). "Efficient Monte Carlo simulation of parameter sensitivity in probabilistic

slope stability analysis." Computers and Geotechnics 37(7-8): 1015-1022.

Wang, Y., Z. Cao and S.-K. Au (2011). "Practical reliability analysis of slope stability by advanced Monte Carlo

simulations in a spreadsheet." Canadian Geotechnical Journal 48(1): 162-172.

Wang, Z.-Z. and S. H. Goh (2021). "Novel approach to efficient slope reliability analysis in spatially variable soils."

Engineering Geology 281: 105989.

Wei, Z.-l., Q. Lü, H.-y. Sun and Y.-q. Shang (2019). "Estimating the rainfall threshold of a deep-seated landslide by

integrating models for predicting the groundwater level and stability analysis of the slope." Engineering Geology

253: 14-26.

Wong, T.-T. and P.-Y. Yeh (2019). "Reliable accuracy estimates from k-fold cross validation." IEEE Transactions

on Knowledge and Data Engineering 32(8): 1586-1594.

Zeng, P., T. Zhang, T. Li, R. Jimenez, J. Zhang and X. Sun (2020). "Binary classification method for efficient and

accurate system reliability analyses of layered soil slopes." Georisk: Assessment and Management of Risk for

Engineered Systems and Geohazards: 1-17.

Zhang, J., K. K. Phoon, D. Zhang, H. Huang and C. Tang (2021). "Deep learning-based evaluation of factor of safety

with confidence interval for tunnel deformation in spatially variable soil." Journal of Rock Mechanics and

Geotechnical Engineering.

Zhang, W., C. Wu, H. Zhong, Y. Li and L. Wang (2021). "Prediction of undrained shear strength using extreme

gradient boosting and random forest based on Bayesian optimization." Geoscience Frontiers 12(1): 469-477.

Zhao, Y., X. Meng, T. Qi, Y. Li, G. Chen, D. Yue and F. Qing (2021). "AI-based rainfall prediction model for debris

flows." Engineering Geology: 106456.

Zhu, B., T. Hiraishi, H. Pei and Q. Yang (2021). "Efficient reliability analysis of slopes integrating the random field

method and a Gaussian process regression‐based surrogate model." International Journal for Numerical and

Analytical Methods in Geomechanics 45(4): 478-501.

Zhu, B., H. Pei and Q. Yang (2019). "An intelligent response surface method for analyzing slope reliability based on

Gaussian process regression." International Journal for Numerical and Analytical Methods in Geomechanics

43(15): 2431-2448.

48 
 
Zhu, H., L. Zhang, T. Xiao and X. Li (2017). "Generation of multivariate cross-correlated geotechnical random

fields." Computers and Geotechnics 86: 95-107.

Zhu, H. and L. M. Zhang (2013). "Characterizing geotechnical anisotropic spatial variations using random field

theory." Canadian Geotechnical Journal 50(7): 723-734.

49 
 

You might also like