A Random Forest Model of Landslide Susceptibility Mapping Based On Gyoeroarameter Optimization Using Bayes Algorithm
A Random Forest Model of Landslide Susceptibility Mapping Based On Gyoeroarameter Optimization Using Bayes Algorithm
Geomorphology
a r t i c l e i n f o a b s t r a c t
Article history: The choice of model parameters in landslide susceptibility mapping makes a major determinant of model accu-
Received 26 November 2019 racy. The purpose of this study is to optimize the hyperparameters based on a Bayesian optimization algorithm,
Received in revised form 5 April 2020 and to obtain a high accuracy random forest landslide susceptibility evaluation model. The research steps are de-
Accepted 5 April 2020
tailed as follows. Firstly, taking a typical landslide prone mountainous area as an example, 16 conditioning fac-
Available online 12 April 2020
tors, such as elevation, annual average rainfall, distance from roads, distance from buildings and so on, were
Keywords:
preliminarily selected as the conditioning factors of landslide susceptibility. Combined with 1520 historical land-
Bayes algorithm slide events, a geospatial database was established with 30 m resolution. Secondly, the geospatial data sample set
Random forest was constructed by random sampling according to ratio of historical landslides and non-landslides of 1:10. Based
Landslide susceptibility mapping on the whole sample set, the random forest model adopted the Bayesian optimization algorithm to optimize the
Hyperparameter optimization hyperparameters. Next, the optimal hyperparameters were selected to be trained to get the evaluation model of
Factor screening landslide susceptibility. In addition, they were carried out the analysis of landslide susceptibility mapping for the
whole study area. After that, the recursive feature elimination method was used to screen out the dominant con-
ditioning factors that can explain the degree of landslide susceptibility. The results indicated that the area under
curve (AUC) values of receiver operating characteristic (ROC) curve in training data set, verification data set and
regional simulation were 0.95, 0.87 and 0.93, respectively. 65% of the historical landslides fell between the high
susceptibility and very high susceptibility regions, which made up b20% of the research area. The model was in
good agreement to the distribution characteristics of historical landslides in the study area. We noted that all
the three recent landslides with impact on the study area occurred at the locations predicted by the model to
have high or very high susceptibility in terms of typical landslides in the near future. As for conditioning factors,
the contribution related to human activities accounted for a large proportion. In conclusion, an evaluation model
with high precision for random forest landslide susceptibility can be built based on hyperparameter optimization
with Bayesian optimization algorithm. Simultaneously, using recursive feature elimination method, a random
forest landslide susceptibility model with fewer dominant conditioning factors and guaranteed evaluation accu-
racy can also be built to save the running time and input data resources of the model.
© 2020 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.geomorph.2020.107201
0169-555X/© 2020 Elsevier B.V. All rights reserved.
2 D. Sun et al. / Geomorphology 362 (2020) 107201
characteristics of landslide development, many problems such as factor LSM conditioning factors and algorithm model deserves our study.
selection, parameter optimization and sample optimization of the Most of the existing research of hyperparameter optimization was
model in the evaluation of regional landslide susceptibility have not seen in the field of computer algorithm. For example, Wang et al.
been solved systematically. With the development of Geographic Infor- (2014) proposed a hyperparameter selection method of support vector
mation System and Artificial Intelligence (AI) technology, various ma- machine based on Gaussian kernel, which can be divided into two
chine learning methods have begun to be used in research, including stages: selecting kernel parameters and training optimal penalty factors.
Logistic, Classification and Regression Tree (CART), Support Vector Ma- The calculation complexity of this method is low, the classification accu-
chine (SVM) and so on. With strong robustness against over fitting, racy is high and the training time is reasonable; Kang et al. (2019) pro-
these algorithms are suitable for the nonlinear relationship of variables posed a non-inertial particle swarm optimization with elite mutation-
and available for natural modeling of nonlinear decision boundary. Ear- Gaussian process regression (NIPSO-GPR) to optimize the
lier studies are mainly based on a single model with limited accuracy hyperparameters of GPR; Wan et al. (2010) proposed a simple, practical
and over-fitting concerns to predict landslide susceptibility. To avoid and time-effective method to select the hyperparameters of orthogonal
such issues, a random forest model combining multiple decision trees design. Two kinds of Support Vector Machine (SVM) models of typical
is proposed to improve the prediction accuracy. The model contains landslide displacement time series were designed by the combination
multiple decision trees, and the output results are determined by the of hyperparameters and orthogonal optimization, and a landslide pre-
mode of different types of decision trees. Compared to traditional diction model with high accuracy and good generalization performance
methods like logistical regression, it has certain advantages (Cao, was obtained. However, the problem of hyperparameter optimization is
2014). Specifically, the model is capable of handling datasets with rarely considered in the study of landslide susceptibility using random
higher dimensions and larger data volume and has greater generaliza- forest model. Another reason that affects the accuracy of the model
tion ability. may be the selection of samples. The selection of different training sam-
While studies on biological information (Chen and Liu, 2005; Pang ples will influence the accuracy of the model greatly, and will make
et al., 2010), medical science (Ying et al., 2008; Xie et al., 2009), business training results inconsistent with the facts. In the study of how to select
management (Ward et al., 2006; Kim et al., 2010) and other fields have training samples, Liu et al. (2020) proposed a method of selecting train-
already adopted the multiple decision trees model to achieve more so- ing samples based on soil type classification, using the random forest
phisticated research results, the research on landslide prediction started model to update the soil map.
late. Based on grid data with resolution of 25 m and using 14 condition- As the most typical mountainous county in Western China and the
ing factors including elevation, slope, aspect, vegetation index, lithology, Three Gorges Reservoir area, Fengjie was chosen as an example for re-
etc., Hong et al. (2016) implemented the evaluation and validation on search in the present study. The mountainous area features frequent
the landslide susceptibility in Lianhua County, China with random forest landslides resulted from the influence of migration, water storage and
model. Besides, comparison with other traditional statistical models power generation, continuous precipitation, etc. in the reservoir area.
(Evidence Belief Function EBF, Logistic Regression LR, Frequency Ratio Reservoir migration, in particular, intensifies construction activities
FR) have been implemented as well; Chen et al. (2017a), based on the which have resulted in the reconstruction of natural slope. By using
grid data of 30 m resolution, applied the random forest method to the Bayesian optimization algorithm, hyperparameter optimization and
spatial prediction of landslide susceptibility in Long County, China and dominant conditioning factor screening analysis, an efficient random
compared it with other advanced machine learning algorithms (Logistic forest evaluation model of landslide susceptibility was constructed,
Model Tree LMT, Classification And Regression Tree CART, Random For- and reliability evaluation and application verification were carried out.
est RF); Yu et al. (2016) selected 12 conditioning factors from three cat- The methodology of the study is illustrated in Fig. 1. Using satellite
egories, i.e., topography, meteorology, hydrology and soil vegetation. image, DEM, geological data and other multi-source data, 16 landslide
Based on the grid data with a resolution of 30 m, the relationship be- susceptibility conditioning factors were extracted, a geospatial database
tween the occurrence of landslides and the conditioning factors of land- was constructed, and random samples were selected based on historical
slides in Shunchang area of Fujian Province was empirically analyzed by landslide points. Bayesian optimization algorithm was used to select the
using the random forest model, and the applicability of the random for- optimal hyperparameters, which trained and tested the random forest
est model in the spatial prediction of landslides in South China was model to produce the landslide susceptibility assessment map of the
discussed. Based on the grid data of 100 m resolution and 15 condition- study area. Next, the comparison and analysis of the recent landslide
ing factors such as elevation, slope and aspect, Taalab et al., 2018used cases were carried out to verify the effectiveness of the random forest
the random forest model to evaluate the landslide susceptibility and landslide susceptibility model after the hyperparameter optimization.
predict the landslide viewing space in the piedmont of northwest Finally, the importance of conditioning factors and the impact law of
Italy, based on the grid data of 100 m resolution and 15 conditioning fac- typical conditioning factors were analyzed. The dominant conditioning
tors such as elevation, slope and aspect. factors which can be used to evaluate the susceptibility of landslides
The susceptibility of landslides is subject to the comprehensive effect were screened out by the recursive feature elimination method, so as
by a variety of conditioning factors. Reichenbach et al. (2018) analyzed to build a random forest landslides susceptibility model which
the studies on landslide susceptibility assessment published from 1983 contained fewer dominant conditioning factors but maintained good
to 2016 and found that 596 conditioning factors were examined in land- evaluation accuracy.
slide susceptibility assessment with an average of 9 conditioning factors
per model. The number of conditioning factors selected in most models 2. Case research area overview and data sources
was not large; besides, in most cases these factors were selected subjec-
tively according to the experience of experts. Study on how to choose 2.1. Research area overview
the dominant conditioning factors objectively was rarely seen in rele-
vant literature. With its position at 109°1′17″–109°45′58″E and 30°29′19″–31°22′
The accuracy of the model depends on not only the learning algo- 33″N (Fig. 2), Fengjie County is located in the research area which rep-
rithm but also the hyperparameters (i.e., parameters for setting values resents the east gate of Chongqing. The mountainous area with complex
before starting the learning process), which makes it necessary to opti- tectonic stress field is located in the east of Sichuan Basin, which is the
mize the model. The existing LSM research literature has paid more at- intersection of Dabashan arc fold fault zone, East Sichuan arc concave
tention to the comparison of accuracy of modeling with different fold zone and Sichuan, Hubei, Hunan and Guizhou Uplift fold zone.
methods instead of the application of hyperparameter optimization in The climate here is the Central Asian tropical humid monsoon climate,
landslide machine learning modeling. In that case, the optimization of with abundant rainfall and annual average precipitation of 1132 mm.
D. Sun et al. / Geomorphology 362 (2020) 107201 3
There are numerous river systems in the region and 17 river basins with 10.05% were caused by ground water (pore water), 2.01% by human
an area of more than 50 km2. The Yangtze River runs through the central construction activities and 12.06% by coupling.
part of the region, with an average annual discharge of 13,700 m3/s.
3. Geospatial databases
The data of 1520 historical landslides in Fengjie County from 2001 to The formation mechanism of landslide is very complex and the sus-
2016 and the related conditioning factors of landslide formation were ceptibility of landslide is jointly affected by natural factors and human
collected, sorted and organized as shown in Table 1 below. activities. Reichenbach et al. (2018) analyzed the studies related to
The historical landslide data were sorted out on two bases, i.e., type landslide sensitivity evaluation from 1983 to 2016 and concluded that
and trigger. (Fig. 3). In terms of type, it can be found that most of the there were 596 factors for landslide sensitivity evaluation, i.e., 9 factors
landslides in the study area were small/shallow/soil ones (82%) and on average was used by each model, which shows that not many factors
only 18% were large/deep/bedrock landslides. In terms of trigger, most are considered in most models. According to Ayalew and Yamagishi
of the landslides in the study area were caused by rainfall (75.88%), (2005), the selection of landslide influencing factors should be
Table 1 of landslides. Besides, it also corresponds to a main cause, i.e., the rain-
Data and data sources. fall. Distance from roads and distance from buildings are both the fac-
Data name Data sources Type Scale tors selected for the trigger of human activities. The construction of
Historical Chongqing Geological monitoring Datasheet
roads and buildings will significantly lower the stability of the slope, in-
landslides station crease the micro topography generated in the process of slope excava-
DEM Aster satellite Grid 30 m tion and accelerate the occurrence of landslides.
Geological data National Geological Data Center Grid 1:200,000
Land cover Chongqing Municipal Bureau of land Vector 1:100,000
and resources 3.2. Data processing
Administrative Chongqing Municipal Bureau of land Vector 1:100,000
division and resources
The data of slope, aspect, slope position, landforms (Weiss, 2001),
River network Chongqing Water Resources Bureau Vector 1:100,000
Satellite image Geospatial Data Cloud platform Grid 30 m profile curvature, TWI (Yu et al., 2017), CRDS (Wen et al., 2017) were
Annual rainfall Chongqing Meteorological Datasheet 30 m obtained by ArcGIS processing of DEM. The data of lithology, fault and
Administration stratum occurrence were obtained by vectorization of 1:200,000 geo-
Road Chongqing Transportation Vector 1:100,000 logical map. NDVI was generated by Landsat 8 OLI data. The annual av-
Commission
erage rainfall was formed by the gird data, which were produced from
raw data by the spatial interpolation method. The raw data is the com-
plete product data of kilometer grid precision from January 2008 to De-
measurable, operable, uneven, complete and non-redundant. Therefore,
cember 2014 by Chongqing Meteorological Bureau. Distance from
we have considered the types and triggers of landslides, and increased
faults, distance from rivers, distance from roads, and distance from
the number of conditioning factors to 16. In this paper, 16 factors of
buildings were obtained respectively by multi-level buffering of faults,
landslide susceptibility obtained from four aspects are topography (ele-
rivers, roads, and buildings. The faults are all faults, but there are no ac-
vation, slope, aspect, slope position, landforms, profile curvature, topo-
tive faults in this research area. We summarized the categories of condi-
graphic wetness index (TWI) (Yu et al., 2017), geological conditions
tioning factors in Table 2.
(lithology, distance from faults, combination reclassification of stratum
A geospatial database of landslide conditioning factors was
dip direction and slope aspect (CRDS) (Xie et al., 2018), environmental
established with a grid unit with 30 m resolution as the basic unit for
conditions (Normalized Difference Vegetation Index (NDVI), distance
landslide susceptibility assessment (Fig. 4).
from rivers, annual average rainfall, land cover), and human activities
(distance from roads, distance from buildings). The specific factors con-
sidered were topographic wetness index (TWI), combination reclassifi- 4. RF model based on hyperparameter optimization using Bayes
cation of stratum dip direction and slope aspect (CRDS), distance from algorithm
rivers, annual average rainfall, distance from roads and distance from
buildings. Among them, TWI represents the composite topographical 4.1. Random forest model
index to evaluate the spatial distribution of soil water, which can de-
scribe the influence of terrain on the degree of soil water saturation. Random forest is an ensemble learning method, first proposed by
The content and distribution of water in the soil will affect the condition Breiman (1996) and Cutler, 2005, constructing multiple decision trees
of rock, soil and vegetation on the surface of the slope, thus affecting the through different data subsets, and voting on the results of multiple de-
landslide. CRDS refers to the combination relationship between rock cision trees to get the output of the random forest. A large existing body
stratum tendency and slope direction, which acts as a comprehensive of research has shown that random forest is considerably tolerant for
factor considering both terrain and geology (Wen et al., 2017). Distance outliers and noise, unlikely to over-fit, and of high prediction accuracy
from rivers was selected because the river has the effects of downward and stability (Li, 2013).
cutting, lateral cutting and wave impact on the slope bank, which will The core of random forest is to construct a large number of unrelated
take away the rock and soil mass at the slope toe and create aerial sur- decision tree models [h (X,θk); k = 1,…] for training. Each decision tree
face at the slope toe, thus to prepare the conditions for the occurrence of makes a prediction about the classification of the sample separately (for
landslide. One of the main triggers for landslide studied in this paper is classification algorithm). The final output is the mode of the sample
ground water (pore water), which indicates that rivers and other waters classification. The performance of the random forest can be improved
have played a great role in the landslide within the study area. Annual by constructing unrelated training sets in order to decrease the variance
average rainfall refers to the average rainfall under a long-term state, of model. Different training sets of classifications h1(X)…hk(X) are ob-
which affects not only the slope itself but also the development of veg- tained by sample training, and then are combined to construct the ran-
etation, surface runoff and other factors, thus affecting the development dom forest model. The output of random forest is determined by a
forming the sample data set. Generally, the receiver operating charac- say, the hyperparameters of the optimized model were used in later
teristic (ROC) can be used to test the evaluation results of typical two model training.
classification problems such as landslides. The area under the ROC
curve is AUC value (area under curve, AUC), which can quantitatively 4.3. Model training and accuracy test method
represent the accuracy of model prediction (Li et al., 2014). The ROC
curve (Fig. 6) of the model constructed by different parameters was Using the above optimized hyperparameters, the random forest
drawn by the aforementioned Bayesian optimization algorithm, and model can be trained and constructed. In order to reduce the influence
the AUC value was calculated. For AUC, the value of 1 represents the of a single sampling method on model results, the 5-fold cross-
ideal model and the value of 0.5 represents the model without discrim- validation method was used to select training data and test data. 5-
ination effect. In addition, the higher value represents the better model.
In the process of hyperparameter optimization iteration, it was found
that the AUC value of the model obtained with different parameters
was between 0.81 and 0.91. We choose the hyperparameters corre-
sponding to the highest value (0.91), [‘n_estimators’:50,
‘max_depths’:16,′ min_samples_splits′:4, ‘max_features’:10]. That is to
Table 3
Main hyperparameters involved in RF.
Hyperparameter Explanation
Zhakuoshi landslide. There were many tension cracks in the front part of Rainfall scoured the surface of the slope, and the unstable rock and
the landslide body, and the largest deformation occurred on October 3. soil particles on the surface of the slope were taken away by the surface
(3) Huoshitan landslide is located in the slope zone on the left bank of runoff formed by rainfall, which led to the erosion of the slope. The high
Meixi River, the first tributary of the Yangtze River, Qiaowan village, rainfall areas are stripped or soil/regolith. The annual average rainfall
Xincheng Town, Fengjie County. From October 1 to October 31, a would also affect the development of vegetation, thus affecting the de-
large-scale soil landslide (Fig. 10c) occurred in Huoshitan. Rainfall in- velopment of the landslide.
duced continuous deformation, multiple tension cracks appeared in The least important conditioning factor was the distance from the
the front of the landslide body, and the largest landslide deformation oc- fault. Earthquake-triggered landslides occurred mostly in the vicinity
curred on October 7. of the more concentrated active fault, which featured dense distribution
Through comparative analysis, both Xinpu landslide and Huoshitan along the direction of the structural line (Wen et al., 2016). However,
landslide are in high and very high susceptibility regions, and Zuokushi the susceptibility of disaster varies with different distance from fault.
landslide is in very high susceptibility region. In conclusion, most of the The influence of distance from the fault is likely to be limited within a
new landslide events are located in either high susceptibility region or certain range (Ni et al., 2018). However, from Fig. 4(i), it can be found
very high susceptibility region. The landslide susceptibility mapping that there were only two fault zones in the landslide intensive area,
model has strong prediction ability. which might be caused by the insufficient accuracy of fault data in the
study area and the large classification distance from the fault distance.
6. Discussion and conclusion The landslide site did not show a direct relationship with the distance
from the fault, which might be that the distance exceeded the range of
6.1. The importance and impact law of typical conditioning factors influence.
In the past, the influence of human activities was seldom considered
It can be seen from Fig. 8 that elevation is the most important condi- in susceptibility models. The influence of human engineering activities
tioning factor affecting the occurrence of landslide in the study area. The is a conditioning factor that cannot be ignored in the formation of land-
impacts of elevation on landslide hazard can be explained by its close slides. All slopes excavated and backfilled manually have different de-
correlation with vegetation type, vegetation coverage, soil moisture, grees of deformation and damage. The distance from the house
human engineering activities and rainfall. According to the distribution identified in this study was a conditioning factor that has not been inves-
of historical landslides in each elevation classification on the special el- tigated before, and its importance was as high as 7.24% in this study area,
evation layer, the statistical diagram of landslide density within the ele- which contributed significantly to the occurrence of landslide.
vation range of the study area was generated (Fig. 11a). It can be seen Concluding on the above discussion, in the importance evaluation of
from the figure that, in general, the landslide density had a negative cor- conditioning factors given by the random forest method, the 16 condi-
relation with the elevation, and the landslide density was higher in the tioning factors are not independent and each may have a certain corre-
place with lower elevation. Fengjie County is a typical mountainous area lation with other conditioning factors. Some conditioning factors may
with complex terrain. Low elevation areas are often of loose soil bed, strengthen or weaken the importance of other related conditioning fac-
more human engineering activities and more frequent landslide, while tors. Therefore, the importance analysis of conditioning factors is a com-
high elevation areas are of tighter soil bed, less human engineering ac- prehensive and integrated reflection of all conditioning factors' mutual
tivities, higher vegetation coverage and less frequent landslides. There- restriction and balance. In order to reduce the correlation between con-
fore, the most intensive elevation range of landslide distribution is the ditioning factors, it is necessary to screen out the dominant conditioning
low elevation range with frequent human activities and low vegetation factors to change the whole conditioning factor system.
coverage.
Annual average rainfall is the second most important conditioning 6.2. Factor selection
factor. The statistical chart of annual average rainfall and landslide den-
sity (Fig. 11b) generated from the data of multi-year average rainfall In this paper, the dominant conditioning factors were selected from
and landslide density can be analyzed. The landslide density increased all the conditioning factors by recursive feature elimination. The pur-
at first and then decreased as annual average rainfall continues to rise. pose of the conditioning factors screening is to remove conditioning
10 D. Sun et al. / Geomorphology 362 (2020) 107201
factors that are not relevant or redundant. In addition, sufficient domi- conditioning factors of landslide were finally selected, including eleva-
nant conditioning factors can save the running time and input data re- tion, distance from the buildings, land cover, lithology, annual average
sources of the model. It could be a reference for other similar studies. rainfall, distance from road, distance from river, NDVI, and slope. The
By using recursive feature elimination method (Zhou et al., 2014), the final model selection of the obfuscation Matrix test results is shown in
last feature of the importance ranking was eliminated each time, and Table 6 and the order of importance of the conditioning factors is
the accuracy of the model was calculated. The order of the importance shown in Fig. 12.
of the conditioning factors obtained after each recursion was compared, Among the final 9 dominant conditioning factors, elevation
and the accuracy was kept at about 99%. Hence, 9 dominant and annual average rainfall were still the most important two
Table 5
Statistic result of landslide susceptibility in different grades.
Susceptibility level Grid number Area proportion Landslide Landslide proportion Density proportion
(%) (%)
factors, with an average accuracy reduction of 51.98 and 52.03, conditioning factors, we found that in a large number of studies (Chen
and the contribution rate of lithology and slope factor was rela- et al., 2017a; Chen et al., 2017b; Chen et al., 2018; Hong et al., 2016;
tively high. The distance from river and road was behind, indicat- Youssef et al., 2015), both elevation and lithology have been included
ing that the contribution rate to the occurrence of landslide was in the analysis of conditioning factors; however, annual average rainfall
relatively low. was taken out from the articles of Chen et al. (2017b) and Chen et al.
As per the result of ranking, elevation, annual average rainfall and li- (2018). The reason why the annual average rainfall was not considered
thology were the dominant conditioning factors. For these three might be that the data was too complex to obtain, or it might be
12 D. Sun et al. / Geomorphology 362 (2020) 107201
δaεElevation
replaced by some other conditioning factors (such as TWI, a soil mois- dominant conditioning factors to evaluate landslides and the annual av-
ture index indicating the water content and distribution in the soil, erage rainfall is consistent with 82% of the historical landslides in the
and rainfall, one of the sources of soil water). More likely, the landslide study area.
in the study area was basically a large/deep/bedrock one, and the in-
ducement tended to be earthquakes and other earth movements. 6.3. The advance of model optimization
Therefore, it shows that elevation and lithology are the essential
There have been much research on random forest; however, most of
the research done in the past mainly focused on the comparison be-
Table 6 tween random forest and other landslide evaluation models (including
Confusion matrix of random forest classification results.
traditional statistical methods and mainstream machine learning algo-
Actual value Accuracy rithms). For example, Chen et al. (2017a) compared three advanced ma-
Non-landslide Landslide (1) chine learning algorithms, i.e., LMT, CART and RF for the evaluation
(0) accuracy of landslide susceptibility in Long County in China. Chen
Predicted Non-landslide 15,200 13 Precision:0.9991 et al. (2018), basing Longhai, China as the study area, compared the ac-
value (0) curacy of the Best-First Decision Tree, Random Forest and Naive Bayes
Landslide 0 1516 Precision:1 Tree. The comparison results of the two studies showed that the RF
(1) model had the best accuracy. Most of the research results show that ran-
Recall:1 Recall:0.9914 Accuracy:0.9992
dom forest model is a promising mapping technique for landslide
D. Sun et al. / Geomorphology 362 (2020) 107201 13
sensitivity. However, Youssef et al. (2015) compared the accuracy of factors that shall not be ignored, while slope position and distance
four models, namely Random Forest (RF), Boosted Regression Tree from fault were relatively insignificant. Hence, it can be further opti-
(BRT), Classification And Regression Tree (CART) and Generalized Lin- mized to obtain the dominant conditioning factors of landslide suscep-
ear Method (GLM) and Hong et al. (2016) compared RF with traditional tibility to ensure the accuracy of efficient modeling and analysis.
statistical models (EBF, LR, FR), the results of both studies show that the 4) Through Bayesian hyperparameter optimization, and 5-fold
RF accuracy was generally higher than that of the other models. cross-validation for the selection of the best sample and dominant con-
It can be found that no conclusion has been drawn on the merits and ditioning factor screening analysis, we can build a more efficient ran-
demerits of RF model and other models. No model optimization but dom forest landslide susceptibility evaluation model with high
built-in parameters of the model was applied in the studies above. At accuracy and less dominant conditioning factors. Among them, as the
that time, the model was not necessarily to be the optimal model, and highlight of this paper, Bayesian hyperparameter optimization is used
therefore its accuracy could be further improved. The comparison be- to find the hyperparameters within a certain range through iterative
tween the non-optimized model and other models was not much con- processing in the probability model. Corresponding ROC curve for
vincing, which did not reflect the advantages and disadvantages of each optimized parameter is obtained through random forest and the
each model to the specific study area in a real sense. In order to improve parameter with the highest AUC value is selected as the optimal
the accuracy of the model, we can consider optimizing the parameters. hyperparameter.
The Bayes Optimization (BO) relies on fitting the probability model to
the observations of the black box target being optimized. Through Declaration of competing interest
iterative processing in the probability model, we can find the
hyperparameters in a certain range. The hyperparameters with the The authors declare that they have no known competing financial
highest AUC value of 0.91 was chosen to ensure the accuracy of the op- interests or personal relationships that could have appeared to influ-
timal model. ence the work reported in this paper.
1) A random forest evaluation model of landslide susceptibility after We would like to express our gratitude to Chongqing Meteorological
Bayesian hyperparameter optimization was proposed in this study. A Administration for providing essential meteorological data, and Chong-
typical mountainous area with multiple landslides was taken as an ex- qing Institute of Geology and Mineral Resources for providing valuable re-
ample for application analysis, and the importance degree and influence search materials of historical slope disaster cases and new slope
rule of conditioning factors were analyzed. deformation/damage cases in the research area. We also thank our fam-
2) The result of the random forest model after the hyperparameter ilies and friends who helped us during the writing of this paper.
optimization, which was applied to the case research area, indicated The current research is supported by grants: National Key R&D Pro-
that the AUC values of ROC curve in training data set, verification data gram of China (Grant No. 2018YFC1505501), and the National Natural
set and regional simulation were 0.95, 0.87 and 0.93 respectively. 65% Science Foundation of China (Grant No. 41807498).
of historical landslides fell in high susceptibility region with an area of
b20%, and the model had high reliability and stability. In 2017, most of References
the new typical landslides in the study area were located in high suscep-
Ayalew, L., Yamagishi, H., 2005. The application of GIS-based logistic regression for land-
tibility region, and the model had high prediction ability.
slide susceptibility mapping in the Kakuda-Yahiko Mountains, Central Japan. Geo-
3) The results of conditioning factor importance evaluation and im- morphology 65 (1–2), 15–31.
pact law analysis of typical conditioning factors showed that elevation, Breiman, L., 1996. Bagging predictors. Mach. Learn. 24 (2), 123–140.
annual average rainfall and other conditioning factors were the most Cao, Z., 2014. Study on Optimization of Random Forests Algorithm. Capital Economic and
Trade University, Beijing Doctor dissertation (in Chinese).
important conditioning factors of landslide susceptibility. Human engi- Chen, X., Liu, M., 2005. Prediction of protein-protein interactions using random decision
neering, such as buildings and roads, were significant conditioning forest framework. Bioinformatics 21 (24), 4394–4400.
14 D. Sun et al. / Geomorphology 362 (2020) 107201
Chen, W., Xie, X., Wang, J., Pradhan, B., Hong, H., Bui, D.T., Duan, Z., Ma, J., 2017a. A com- Sun, D., Wu, J., Wen, H., Xue, M., 2019. Damage resistance mapping of mountain slopes
parative study of logistic model tree, random forest, and classification and regression based on geospatial big data mining. Journal of Chongqing Normal University (Natu-
tree models for spatial prediction of landslide susceptibility. Catena 151, 147–160. ral Science) 36 (3), 64–71.
Chen, W., Pourghasemi, H.R., Panahi, M., Kornejady, A., Wang, J., Xie, X., Cao, S., 2017b. Taalab, K., Cheng, T., Zhang, Y., 2018. Mapping landslide susceptibility and types using
Spatial prediction of landslide susceptibility using an adaptive neuro-fuzzy inference Random Forest. Big Earth Data 1–20.
system combined with frequency ratio, generalized additive model, and support vec- UNOCHA, 2019. Asia and the Pacific: Weekly Regional Humanitarian Snapshot (6–13 Au-
tor machine techniques. Geomorphology 297, 59–85. gust 2019). Available online. http://reliefweb.int/report/china/asia-and-pacific-
Chen, W., Zhang, S., Li, R., Shahabi, H., 2018. Performance evaluation of the GIS-based data weekly-regional-humanitarian-snapshot-6-13-august-2019.
mining techniques of best-first decision tree, random forest, and naïve Bayes tree for Wan, Z., Dong, H., Liu, B., 2010. On choice of hyper-parameters of support vector ma-
landslide susceptibility modeling. Sci. Total Environ. 644, 1006–1018. chines for time series regression and prediction with orthogonal design. Rock Soil
Cutler, A., 2005. Random forests. American Cancer Society. Mech. 31 (2), 503–508+515.
Das, I., Stein, A., Kerle, N., Dadhwal, V.K., 2012. Landslide susceptibility mapping along Wang, X., Huang, F., Cheng, Y., 2014. Super-parameter selection for Gaussian-Kernel SVM
road corridors in the Indian Himalayas using Bayesian logistic regression models. based on outlier-resisting. Measurement 58, 147–153.
Geomorphology 179 (60), 116–125. Ward, M.M., Pajevic, S., Dreyfuss, J., Malley, J.D., 2006. Short-term prediction of mortality
Du, G., Zhang, Y., Iqbal, J., Yang, Z., Yao, X., 2017. Landslide susceptibility mapping using an in patients with systemic lupus erythematosus: classify cation of outcomes using ran-
integrated model of information value method and logistic regression in the dom forests. Arthritis Rheum. 55 (1), 74–80.
Bailongjiang watershed, Gansu Province, China. J. Mt. Sci. 14 (2), 249–268. Weiss, A., 2001. Topographic position and landforms analysis. Proceedings of ESRI User
Froude, M.J., Petley, D.N., 2018. Global fatal landslide occurrence from 2004 to 2016. Nat. Conference. San Diego, CA, USA, pp. 9–13.
Hazards Earth Syst. Sci. 18, 2161–2181. Wen, H., Xie, P., Xiao, P., Hu, D., 2016. Rapid susceptibility mapping of earthquake-trig-
Garrido-Merchán, E.C., Hernández-Lobato, D., 2019. Dealing with categorical and integer- gered slope geohazards in Lushan County by combining remote sensing and the
valued variables in Bayesian Optimization with Gaussian processes. Neurocomputing AHP model developed for the Wenchuan earthquake. Bull. Eng. Geol. Environ. 76
380, 20–35. (3), 909–921.
Guo, Z., Yin, K., Huang, F., Fu, S., Zhang, W., 2019. Evaluation of landslide susceptibility Wen, H., Wang, G., Huang, X., Xue, J., Xie, P., Zhang, Y., 2017. A Preliminary Evaluation
based on landslide classification and weighted frequency ratio model. Chin. J. Rock Method of Slope Stability Based on Topographic Map and Geological Map Chinese
Mech. Eng. 38 (2), 287–300. patent No. 2017105719823. (In Chinese).
Hong, H., Pourghasemi, H.R., Pourtaghi, Z.S., 2016. Landslide susceptibility assessment in Xie, Y., Li, X., Ngai, E.W.T., Ying, W., 2009. Customer churn prediction using improved bal-
Lianhua County (China): a comparison between a random forest data mining tech- anced random forests. Expert Syst. Appl. 36 (3), 5445–5449.
nique and bivariate and multivariate statistical models. Geomorphology 259, Xie, P., Wen, H., Ma, C., Baise, L.G., Zhang, J., 2018. Application and comparison of Logistic
105–118. regression model and Neural network model in earthquake-induced landslides sus-
Kang, L., Chen, R., Xiong, N., Chen, Y., Hu, Y., Chen, C., 2019. Selecting hyper-parameters of ceptibility mapping at mountainous region, China. Geomat. Nat. Haz. Risk 9 (1),
Gaussian process regression based on non-inertial particle swarm optimization in in- 501–523.
ternet of things. IEEE Access 7, 59504–59513. Yin, K., Zhu, L., 2001. Landslide hazard zonation and application of GIS. Earth Sci. Front. 8
Kim, S., Lee, J., Ko, B., Nam, J., 2010. X-ray image classification using random forests with (2), 279–284.
local binary patterns. Proceedings of the 9th International Conference on Machine Ying, W., Li, X., Xie, Y., Johnson, E., 2008. Preventing customer churn by using random for-
Learning and Cybernetics. IEEE Computer Society, Qingdao, China, pp. 3190–3194. ests modeling. Proceedings of the 7th IEEE international Conference on Information
Li, Z., 2013. Several Research on Random Forest Improvement. Xiamen University, Xia- Reuse and Integration. IEEE Computer Society, Las Vegas, USA, pp. 429–434.
men Master dissertation (in Chinese). Youssef, A.M., Pourghasemi, H.R., Pourtaghi, Z.S., Al-Katheeri, M.M., 2015. Landslide sus-
Li, T., Tian, Y., Wu, L., Liu, L., 2014. Landslide susceptibility mapping using random forest. ceptibility mapping using random forest, boosted regression tree, classification and
Geography and Geo-Information Science 30 (06), 25–30. regression tree, and general linear models and comparison of their performance at
Liu, X., Zhu, A., Yang, L., Pei, T., Liu, J., Zeng, C., Wang, D., 2020. A graded proportion Wadi Tayyah Basin, Asir Region, Saudi Arabia. Landslides 13 (5), 839–856.
method of training sample selection for updating conventional soil maps. Geoderma Yu, K., Yao, X., Qiu, Q., Liu, J., 2016. Landslide spatial prediction based on random forest
357, 113939. model. Transactions of the Chinese Society for Agricultural Machinery 47 (10),
Ni, S., Ma, C., Yang, H., Zhang, Y., 2018. Spatial distribution and susceptibility analysis of 338–345.
avalanche, landslide and debris flow in Beijing mountain region. Journal of Beijing Yu, H., Luo, L., Ma, H., Li, H., 2017. Application appraisal in catchment hydrological analysis
Forestry University 40 (06), 81–91. based on SRTM 1 Arc-Second DEM. Remote Sens. Land Resour. 29 (2), 138–143.
Pang, H., Datta, D., Zhao, H., 2010. Pathway analysis using random forests with bivariate Zhou, Q., Zhou, H., Zhou, Q., Yang, F., Luo, L., 2014. Structure damage detection based on
node-split for survival outcomes. Bioinformatics 26 (2), 250–258. random forest recursive feature elimination. Mech. Syst. Signal Process. 46 (1),
Reichenbach, P., Rossi, M., Malamud, B.D., Mihir, M., Guzzetti, F., 2018. A review of statis- 82–90.
tically-based landslide susceptibility models. Earth-Sci. Rev. 180, 60–91.