0% found this document useful (0 votes)

93 views15 pages

Ensemble Boosting and Bagging Based MachineLearning Models For Groundwater Potential Prediction

The document describes a study that uses ensemble machine learning models to predict groundwater potential. Specifically, it develops and compares Boosting models (AdaBoost and GamBoost) and Bagging models (random forest and Bagged CART). The models are built using 339 groundwater data points and 15 spatial conditioning factors as inputs. The recursive feature elimination method identifies the most important 12 factors. The results show that the Bagging models outperform the Boosting models, with random forest having the best performance. The topographic position index, valley depth, drainage density, elevation, and distance from stream are the most important predictive variables. The predicted groundwater potential maps can help water management and preservation.

Uploaded by

Amir Mosavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views15 pages

Ensemble Boosting and Bagging Based MachineLearning Models For Groundwater Potential Prediction

Uploaded by

Amir Mosavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Author's personal copy

Water Resources Management

https://doi.org/10.1007/s11269-020-02704-3

Ensemble Boosting and Bagging Based Machine

Learning Models for Groundwater Potential Prediction

Amirhosein Mosavi 1,2 & Farzaneh Sajedi Hosseini 3 & Bahram Choubin 4 &
Massoud Goodarzi 5 & Adrienn A. Dineva 6 & Elham Rafiei Sardooi 7

Received: 22 June 2020 / Accepted: 28 October 2020/

# Springer Nature B.V. 2020

Abstract
Due to the rapidly increasing demand for groundwater, as one of the principal freshwater
resources, there is an urge to advance novel prediction systems to more accurately
estimate the groundwater potential for an informed groundwater resource management.
Ensemble machine learning methods are generally reported to produce more accurate
results. However, proposing the novel ensemble models along with running comparative
studies for performance evaluation of these models would be equally essential to pre-
cisely identify the suitable methods. Thus, the current study is designed to provide
knowledge on the performance of the four ensemble models i.e., Boosted generalized
additive model (GamBoost), adaptive Boosting classification trees (AdaBoost), Bagged
classification and regression trees (Bagged CART), and random forest (RF). To build the
models, 339 groundwater resources’ locations and the spatial groundwater potential
conditioning factors were used. Thereafter, the recursive feature elimination (RFE)
method was applied to identify the key features. The RFE specified that the best number
of features for groundwater potential modeling was 12 variables among 15 (with a mean
Accuracy of about 0.84). The modeling results indicated that the Bagging models (i.e.,
RF and Bagged CART) had a higher performance than the Boosting models (i.e.,
AdaBoost and GamBoost). Overall, the RF model outperformed the other models (with
accuracy = 0.86, Kappa = 0.67, Precision = 0.85, and Recall = 0.91). Also, the topograph-
ic position index’s predictive variables, valley depth, drainage density, elevation, and
distance from stream had the highest contribution in the modeling process. Groundwater
potential maps predicted in this study can help water resources managers and
policymakers in the fields of watershed and aquifer management to preserve an optimal
exploit from this important freshwater.

Keywords Groundwaterpotential prediction . Ensemblemachine learningmodels . Decisiontree .

Boosting . Bagging . Recursive feature elimination

* Adrienn A. Dineva
[email protected]

Extended author information available on the last page of the article

Author's personal copy
A. Mosavi et al.

1 Introduction

Todays, particularly in arid regions, there is an urge in groundwater potential model-

ing, as the scarcity of secure freshwater is a distinguished problem, where the spread
of operations of irrigation, industry, and urbanism is almost associated with ground-
water (Kordestani et al. 2019; Naghibi et al. 2019; Miraki et al. 2019). The traditional
time consuming and expensive approaches to groundwater exploration, e.g., drilling
and geological and geophysical practices, have been dramatically transformed by
introducing geographic information systems (GISs) and remote sensing (RS) technol-
ogies for ground surveys. Today, GIS has become a principal tool to handle big
spatial data e.g. spring and qanat locations, and easily integrate with any data analysis
visualization tool to assess the groundwater potential models (Gnanachandrasamy
et al. 2018). RS has also brought great capabilities through integration with GIS,
and multi-variant modeling techniques for identifying the favorable groundwater
potential zones (Gebre et al. 2018).
Recent advances in the prediction of the groundwater potential are of particular
interest within the field of hydrology and hydrogeology, and in recent years, there has
been an increasing interest in the prediction of the groundwater potential all around
the world. The methods can be divided into several categories: (i) statistical methods,
which have a long tradition in zoning groundwater potential. The current popular
statistical methods include weights-of-evidence (Chen et al. 2018), logistic regression,
statistical index model, and index of entropy (Al-Abadi and Shahid 2015) and
frequency ratio (Das 2019; Sachdeva and Kumar 2020) appeared most popular. Yet
they are associated with various drawbacks including low accuracy. (ii) Multi-criteria
decision analysis (MCDA) techniques, including analytic hierarchy process (AHP) and
TOPSIS (Agarwal and Garg 2016; Hassan et al. 2018). (iii) Machine learning (ML)
models such as boosted regression tree, random forest (RF), artificial neural networks
(ANNs) (Lee et al. 2018), and support vector machine (SVM) (Kalantar et al. 2018).
Literature review on the modeling techniques used for groundwater modeling,
particularly the groundwater potential modeling, proves a strong shift from standard
data mining and ML models to more sophisticated ensemble models (e.g., Sameen
et al. 2019; Chen et al. 2018). Despite the promising results of the ensemble models
for groundwater potential modeling, the research area is very young, and these
models’ use had been very limited to the above. Although a limited number of
researches have used boosting for improving the quality of the prediction, e.g.,
(Motevalli et al. 2019), no single research exists on comparing the bagging and
boosting techniques for building ensemble methods. Thus, there is a research gap in
this regard, and there has been little discussion about the potential, differences, and
individual characteristics of boosting and bagging tree-based models for groundwater
potential modeling. In this context, the contribution of current research is to develop
novel Boosting models (i.e., AdaBoost and GamBoost) and Bagging models (i.e., RF
and Bagged CART) for prediction of the groundwater potential. This study, therefore,
set out: (i) to find key features related to the groundwater potential, (ii) to predict the
spatial potential of the groundwater, (iii) to assess the performance of the Boosting
models and Bagging models in modeling groundwater potential, and (v) to evaluate
the importance of predictive variables in the modeling process.
Author's personal copy
Ensemble Boosting and Bagging Based Machine Learning Models for...

2 Material and Methods

2.1 Study Area

The Dezekord-Kamfiruz watershed is a part of Fars Province, Iran, which extends between latitudes
of 30° 08′ and 30° 47′ N, and longitudes of 51° 43′ and 52° 26′ E (Fig. 1). The watershed has a mean
yearly precipitation of 652 mm, and the mean daily minimum and maximum temperatures equal to
6.25 °C and 21.62 °C, respectively. The climate of the Dezekord-Kamfiruz watershed is included
semi-arid, Mediterranean, and semi-humid, respectively, from east to west, according to the De
Martonne classification method. The watershed has an area of 2089.52 square kilometers, with a
population of about 47,000 people. The elevation of the study area varies from 1501 to 3699 m a.s.l.

There are 339 groundwater resources (including 308 perennial springs, 18 wells, and 13
qanats) which supply water for drinking and irrigating objectives. Due to the study area’s climate,
precipitation mostly falls in winter, fall, and early spring. Whereas, during the middle to end of spring
and whole summer (which is irrigating periods), precipitation is close to zero. Hence, the main source
of irrigation is groundwater. Therefore, due to the extreme requirement to the groundwater, knowing
the potential zones of groundwater could better manage the water supply in this area.

2.2 Dataset

The dataset in this study was included the location of the groundwater recourses as the
dependent variable (i.e., groundwater productivity data) and groundwater potential condition-
ing factors (GPCF) as independent variables:

Fig. 1 Location of the Dezekord-Kamfiruz watershed

Author's personal copy
A. Mosavi et al.

2.2.1 Groundwater Resources Data

The location of the groundwater resources was obtained from the Iranian Water Resources
Management Company (IWRMC). Although groundwater resource data were 401 points, we
used only perennial resources (339 points) to indicate the existence of groundwater supply.
The groundwater resource locations were randomly divided into 70% (237 points) and 30%
(102 points), respectively, for training and validation phases (Fig. 1).

2.2.2 Groundwater Potential Conditioning Factors (GPCF)

According to the literature survey number of 15 essential factors were considered (Fig. 2).
Factors of topography including elevation, slope, aspect, curvature (Fig. 2a and d) were
extracted by an ASTER Digital Elevation Model (DEM) with a resolution of 30 × 30 m by
ArcGIS 10.3 software. Different topographic conditions create different conditions of climate,
soil, infiltration, and vegetation (Aniya 1985), which can affect groundwater resources. The
slope factor widely controls the recharge processes of groundwater (Prasad et al. 2008). It has
an important role in the velocity of water flow, as in gentle slopes it allows runoff to have
enough time to penetrate the soil (Nampak et al. 2014). The slope affects the hydrological
processes through different evapotranspiration, precipitation, trends of physiography. It further
affects weathering and vegetation development processes, which are all related to groundwater
(Sidle and Ochiai 2006). As an indicator of morphology and topography, the curvature shows
the direction of flow and has a significant role in the stability and instability of terrain. A
concave curvature has more water and maintains it for a longer time to percolate and infiltrate
into the soil. The topographic position index (TPI) (Fig. 2e) measures the difference of
elevation in each cell from the average elevation of neighboring cells. It exhibits the locations
that have higher or lower elevation than the surroundings. The topographic roughness index
(TRI) (Fig. 2f) indicates the surface’s roughness, and it is calculated based on the variations of
elevation in surrounding pixels. Valley depth (Fig. 2g) is computed as a vertical distance to the
base level of the channel network (Conrad and Olaya 2012). Factors of TPI, TRI, and valley
depth were generated using the SAGA-GIS software.

The water-related factors including drainage density (Dd), distance from the stream (Dfs),
precipitation, and topographic wetness index (TWI) have an essential role in groundwater
potential mapping. The Dd map (Fig. 2h) was generated by Line Density Tool in ArcGIS 10.3.
It is relative to the slope, elevation, bedrock, and lithology structures. The Dfs was extracted by
Euclidian distance in ArcGIS 10.3 (Fig. 2i). The main sources of groundwater recharge are
precipitation and streams. Lower distances to stream increase the degree of groundwater
recharge. The mean annual precipitation (Fig. 2j) during (1987−2016) was generated using
available gauge stations in the study area (Fig. 1) which their data were received from the
IWRMC. The TWI (Fig. 2k) was produced by SAGA-GIS. It indicates the spatial patterns of
wetness and measures saturated source zones of surface runoff (Nampak et al. 2014).

Fig. 2 The predictive variables used for groundwater potential prediction: (a) elevation, (b) slope, (c) aspect, (d)
curvature, (e) topographic position index (TPI), (f) topographic roughness index (TRI), (g) valley depth, (h)
drainage density (Dd), (i) distance from stream (Dfs), (j) precipitation, (k) topographic wetness index (TWI), (l)
soil order, (m) lithology, (n) distance from fault (Dff), and (o) landuse
Author's personal copy
Ensemble Boosting and Bagging Based Machine Learning Models for...
Author's personal copy
A. Mosavi et al.

Soil and lithology have a significant role in both the porosity and permeability of aquifer
materials (Chowdhury et al. 2010; Songara et al. 2015a). The distance from fault (Dff) affect
groundwater resources through penetrating surface water and augmenting the permeability. In
this study, the lithology, soil, and fault maps (Fig. 2l and n) were received from the Forests,
Range and Watershed Management Organization (FRWMO) of Iran. The Dfs map was
extracted by the Euclidian distance tool in ArcGIS 10.3.
The land use can affect the groundwater through influence on soil and available water,
changing the topography, vegetation, and infiltration conditions (Songara et al. 2015b). The
land uses of the study area (Fig. 2o), which is obtained from the FRWMO, are rangeland and
forest, followed by agriculture, dry farming, waterbody, bare land, residential, and orchard.

2.3 Multicollinearity Analysis (MA) and Feature Selection (FS)

Although input data considered in this study was based on the scholars, the existence of the
collinear variables (i.e., high relationships among the predictors) can create unreal results in the
model’s outputs (Chatterjee et al. 2000). So, in this study, the MA was tested using the
Variance Inflation Factor (VIF). Values of less than 10 for the VIF indicates there is not any
high multicollinearity among the predictor variables.
Moreover, the presence of redundant data can create problems in modeling processes such
as increasing the training time, reducing the performance of the models, and overfitting
problems (Wang and Chen 2019). So, the selection of key features is an efficient method to
overcome these difficulties. In this study, the Recursive Feature Elimination (RFE) method, as
an FS method, was used to identify key features. The RFE is a wrapper and model-based
approach in which the random forest model is applied as an estimator (Feng et al. 2017). The
RFE is a backward select method (Kuhn and Johnson 2013) which the main concept of it is
based on eliminating the unimportance variables. After each run, the importance of the features
is calculated and then the features with lower priority are removed from modeling, which is
repeated until a single feature remains (Chen et al. 2015). The FS is only done based on the
training dataset (Wang and Chen 2019). Testing dataset (30%) is not used in the FS process,
and it is held out to test the groundwater potential modeling. A 10-fold cross-validation
method was used for feature selection based on the training dataset (70%). In each run, 9-
fold of the training dataset was used to train and one-fold is assigned to evaluate the model
performance, and this process repeated until all runs are finished. Feature selection was
performed using the Caret package (Kuhn 2015) within the R environment.

2.4 Groundwater Potential Modeling

In this study, after feature selection, the groundwater potential modeling was conducted using
the Boosting (Freund and Schapire 1997) and Bagging (Breiman 1996) methods. The model
training and parameter tuning were conducted with a 10-fold cross-validation method using
70% of the input data (same data used for feature selection with excluding redundant variables
identified by the RFE method). Two Boosting models including Adaptive Boosting Classifi-
cation Trees (AdaBoost) and Boosted Generalized Additive Model (GamBoost), and two
Bagging models including Random Forest (RF) and Bagged Classification and Regression
Trees (Bagged CART) were employed for this purpose. Parameters optimization was con-
ducted using the tuning function of the Caret R package (Kuhn 2015) using 10-fold cross-
validation resampling methods.
Author's personal copy
Ensemble Boosting and Bagging Based Machine Learning Models for...

2.4.1 AdaBoost Model

AdaBoost represents an adaptive boosting approach to classification trees. It was coined by

Freund and Schapire (1997) to improve the accuracy of the classification-based ML methods.
As a non-parametric algorithm, it is designed to efficiently distinguish outliers with no need to
define the weak learners. AdaBoost starts from creating an initial decision tree (DT) for
training via assigning equal weights to the data set. In the next step, the fitted model delivers
the entire training. The weights of the accurate predictors are labeled as fixed, and the higher
weights are labeled as misclassified. Once the weights of all the training data set are
normalized, the new randomly sampled sub data set creates a new DT. This workflow
continues until it reaches the end of circumstance. Through the weighted sum of all the DT,
the final DT is made. Furthermore, Freund and Schapire (1997) showed that AdaBoost
performs promising in dealing with various data sets through several experiments. However,
there is a gap in the literature in the real-life application of AdaBoost and the evaluation
analysis of its accuracy for groundwater potential mapping.

2.4.2 GamBoost Model

GamBoost (Hofner et al. 2016) represents a boosted extension of the generalized additive
model which is an ensemble of the generalized additive model (GAM). The GAM model adds
independent transformation for each predictor variable (Hastie and Tibshirani 2017), in which
such additives provide smoother functions to fit a broad extent of response curves (Sandman
et al. 2008). GAM is considered a flexible regression model for handling multiple distribution
function (Hofner et al. 2016). Boosting can be used to complement and improve the prediction
accuracy and integrity of the GAM models as well as dealing with overfitting (Mayr et al.
2012). GamBoost provides a novel fitting technique and performs well in promoting the most
important variables, particularly in high dimensional space where variable selection is of major
importance (Hofner et al. 2016).

2.4.3 Bagged CART

CART has been widely used for groundwater modeling including the groundwater potential
prediction with acceptable performance (Duan et al. 2016). As the CART is considered an
unstable model, the bagging technique can greatly improve its accuracy (Murphree et al.
2018). The bagged CART effectively decreases the prediction variance and highly improves
classification performance and overfitting. Thus, it is expected that through using the bagged
CART in the novel application of groundwater potential prediction, promising results can be
achieved.

2.4.4 RF Model

Random forest method is an ensemble learning widely used for regression and classification.
Ho (1995) proposed RF based on the random subspace method to construct a multitude of
decision trees with controlled variance to improve the accuracy and fix the training overfitting
issues. RF later was advanced and implemented as a package for bagging and features
selection (Breiman 2001) consisting of an ensemble of independent classification trees and a
set of random samples. To build a model, often, two-thirds of the data set is devoted to creating
Author's personal copy
A. Mosavi et al.

the decision trees, and the rest is used for evaluation of the model accuracy, error, and further
performance. In the next step, the sum of the DTs performed and the best performing model is
identified according to all trees’ most votes.

2.5 Model Validation

Model validation in this study was conducted using a hit and miss analysis by 30% of the data
which had not been used in the training phase. Statistics of Accuracy, Kappa, Precision, and
Recall were considered to validate the results (Eqs. 1–5) (Johnson and Olsen 1998; Stanski
et al. 1989):
HþCN
Accuracy¼ ð1Þ
HþFAþMþCN

Accuracy Pe
Kappa¼ ð2Þ
1 Pe

ðHþFAÞðHþMÞþðMþCNÞðFAþCNÞ
Pe ¼ ð3Þ
ðHþFAþMþCNÞ2

H
Precision¼ ð4Þ
HþFA

H
Recall¼ ð5Þ
HþM
where H denotes the number of hits, FA indicates the number of false alarms, M shows the
number of misses, and CN is the number of correct negatives, which are calculated by a
contingency table (Johnson and Olsen 1998; Stanski et al. 1989). Also, Pe is expected
agreement which indicates how much of agreement would be presented by chance alone
(Beucher et al. 2017). The Accuracy, Kappa, Recall, and Precision vary between 0 and 1,
which 1 indicates the perfect prediction.

3 Results and Discussion

3.1 Pre-processing Results

Before the feature selection, the multicollinearity analysis (MA) using the variance inflation
factor (VIF) was tested. Results indicated that there are not any collinear variables (i.e., all of
the variables had a VIF less than 10). The RFE method results showed that among 15
variables, applying 12 variables will have a good performance for groundwater potential
prediction (Fig. 3). As can be seen from box plots (Fig. 3), accuracy increases with increasing
the number of variables up to 12 variables then it decreases. Mean Accuracy (red plus in
Author's personal copy
Ensemble Boosting and Bagging Based Machine Learning Models for...

1.00

Accuracy (cross-validation) 0.90

0.80
Maximum

0.70 3st Quartile

Mean
Median
0.60
1st Quartile
Minimum
0.50
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of features

Fig. 3 Performance of the RFE using the different number of features as input

Fig. 3) is equal to 0.84 and median Accuracy (the thin black line within the boxes in Fig. 3) is
equal to 0.85 for the number of 12 variables after many times of the model runs (Fig. 3).
Therefore, the number of 12 variables must be considered for the modeling process based on
the RFE results. The occurrence frequency (%) of each feature in all model runs (1200 times)
was computed across cross-validation using RFE (Table 1). Landuse and topographic position
index are contributed respectively in 95.9%, and 100% of the model runs. Variables of
curvature, soil type, and aspect are ranked as 13th to 15th variables concerning the occurrence
in the model runs and were identified as redundant variables. Therefore, due to the optimum
variables number identified by the RFE (Table 1), we excluded these variables from input data.

3.2 Modeling Evaluation Results

The modeling results analysis was conducted by calculating the statistics of Accuracy, Kappa,
Precision, and Recall metrics (Table 2). Results of the modeling evaluation indicated that

Table 1 The occurrence frequency (%) of features in the model runs using the RFE method

Feature Frequency (%)

Landuse 100.0
Topographic position index (TPI) 95.9
Lithology 86.5
Elevation 77.0
Valley depth 72.3
Topographic wetness index (TWI) 68.9
Drainage density (Dd) 58.8
Distance from fault (Dff) 50.7
Distance from stream (Dfs) 46.6
Precipitation 45.3
Topographic roughness index (TRI) 35.1
Slope 33.1
Curvature 18.2
Soil type 14.9
Aspect 7.4
Author's personal copy
A. Mosavi et al.

Table 2 Evaluation results of the predictive models

Statistic AdaBoost GamBoost RF Bagged CART

Accuracy 0.83 0.83 0.86 0.85

Kappa 0.61 0.61 0.67 0.64
Precision 0.84 0.82 0.85 0.85
Recall 0.85 0.88 0.91 0.86

Accuracy values for the models are more than 80%. Also, according to the Kappa statistic
(Monserud and Leemans 1992), all of the models indicate good performance. Precision values
for the model vary between 0.82 and 0.85, and Recall is more than 0.85 (Table 2).

A comparison of the models’ performance indicated that the RF model had higher
performance, followed by Bagged CART, GamBoost, and AdaBoost models (Table 2).
Therefore, evaluation results indicated that the Bagging models (i.e., RF and Bagged CART)
had a higher performance than the Boosting models (i.e., AdaBoost and GamBoost). Recently
Wang and Chen (2019) demonstrated the RF (as a Bagging model) outperforms in comparison
with the AdaBoost (as a Boosting model) to simulate oil well productivity in unconventional
formations. In another study, Alotaibi and Sasi (2016) indicated that the same performance of
the AdaBoost and RF models predicted the intensive care unit transfer. In other studies, the
reasons for the success of the RF (Liaw and Wiener 2002; Thuiller and Lafourcade 2009) are
mentioned as (i) ability to output an unbiased prediction of the simplification error, (ii) lack of
pre-analysis to select variables among large number of predictors, (iii) possibility to use
categorical and numerical variables as predictors, (iv) ability to evaluate non-linear interactions
between variables, and (v) increasing the diversity of classification trees through the random
selection of predictive variables over the different tree.
Although there is not a superiority between Bagging and Boosting, it can depend on the
data and variables in the modeling process. However, considering the over-fitting problems
Bagging is the best option and Boosting can’t help avoid over-fitting (Freund and Schapire
1997; Quinlan 1996). Also, another possible explanation for this might be that the initial model
choice in Boosting is weaker than Bagging (Lemmens and Croux 2006).

3.3 Prediction of Groundwater Potential

After evaluating the models, groundwater potential maps were predicted using 12 predictors
identified by the RFE method. Pixels value of the predictors for the whole study area was used
to predict groundwater potential maps using the trained models. Then, the probability (P) of
groundwater potential was predicted by the predictive models, and classified into 5 classes
including very low (P = 0 − 0.2), low (P = 0.2 − 0.4), moderate (P = 0.4 − 0.6), high (P = 0.6 −
0.8), and very high (P = 0.8 − 1) classes, by the equal interval method (Fig. 4).

The AdaBoost model indicated the lowest area (220.27 km2) for very low class rather than
others, while the GamBoost indicates the most area (845.54 km2). On the contrary, the low,
moderate, and high classes by the AdaBoost model have the highest area (respectively equal to
917.91, 619.22, and 307.03 km2) in comparison with other models. The very high class predicted by
all of the models have a lower area than other classes and is equal to 25.10, 173.46, 113.11, and
185.37 km2 respectively for the AdaBoost, GamBoost, RF, and Bagged CART models (Fig. 4).
Author's personal copy
Ensemble Boosting and Bagging Based Machine Learning Models for...

Overall, the AdaBoost model’s very low and very high classes have the lowest area, while
the area of the classes decreases from very low class towards a very high class for the
GamBoost and Bagged CART models. In the RF model, the highest area is related to low
and very low classes (equal to 641.52 and 615.06 km2, respectively). From moderate to very
high classes, the area decreases (Fig. 4).

3.4 Variable Importance and Diagnostic Analysis

Importance of the modeling process variables was evaluated through a percent decrease in
Accuracy (Table 3). According to the results, the most important variables were TPI and valley
depth, with a reduction in 31.9% and 27.3% accuracy, respectively. Other variables such as
drainage density, elevation, and distance from stream were in the next orders with a decrease
Accuracy of 26.5%, 26.4%, and 26.3%, respectively (Table 3).

A diagnostic analysis was done to fulfill the groundwater potential classes’ complexity and
dependencies with the predictive variables. Mean values of the predictive variables in each
groundwater potential class (GPC) predicted by the RF model was computed (Table 3). For

Fig. 4 Groundwater potential map predicted by (a) AdaBoost, (b) GamBoost, (c) RF, and (d) Bagged CART
models
Author's personal copy
A. Mosavi et al.

Table 3 Importance, mean values (or prominence category) of the variables by the RF model

Variable Decrease in Groundwater potential class (GPC)

accuracy (%)
Very Low Moderate High Very high
low

Topographic position index 31.9 1.69 0.13 -0.95 -1.54 -1.82

Valley depth (m) 27.3 100.8 178.7 217.1 232.2 255.2
Drainage density (km/ km2) 26.5 0.11 0.28 0.36 0.42 0.46
Elevation (m) 26.5 2424.3 2185.5 2103.7 2115.3 2171.6
Distance from stream (m) 26.3 2298.1 1411.6 1042.0 816.5 500.2
Slope (%) 23.7 32.0 23.8 18.0 14.4 13.6
Distance from fault (m) 23.5 12562.2 12737.2 12611.9 10764.7 8982.2
Precipitation (mm) 22.0 647.3 652.0 648.4 663.5 681.6
Topographic wetness index 21.9 9.27 10.37 11.19 11.92 12.63
Topographic roughness 15.4 5.63 4.10 3.00 2.34 2.18
index
Lithology (prominence) 13.9 Kbgp Kgu Kgu Kgu Eja
(30.7%) (37.6%) (35.6%) (31.5%) (33.9%)
Landuse (prominence) 12.1 Forest Rangeland Rangeland Rangeland Agriculture
(50.6%) (45.6%) (38.2%) (35.0%) (57.3%)

categorical variables, the prominence category in each GPC was presented. As can be
seen, with decreasing the TPI the groundwater potential is increased, as the very high
GPC has a mean value − 1.82 and very low GPC have a mean value equal to 1.69.
Therefore, locations with lower TPI (indicating lower elevations than their surroundings)
indicate higher groundwater potential. This is matches with valley depth which with
increasing depths the groundwater potential is increased. Mean value of the very high
and very low GPC are respectively 255.2 m and 100.8 m. Regarding the drainage
density, with increasing the density the groundwater potential is increased, as the very
high GPC has a higher density (Dd = 0.46) and very low GPC has lower (Dd = 0.11).
Conjunction between rivers and groundwater can be a reason for this. Mean values of the
elevation indicate higher groundwater potential does not exactly follow the lower
elevations, as the moderate GPC have lower elevation (2103.7 m), follows by high and
very high GPCs (respectively 2115.3 m and 2171.6 m). With increasing distance from
streams, the potential of the groundwater is decreased. Mean distance for higher GPC is
about 500 m, whereas for lower GPC is about 2298 m. Also, the variation of the mean
slopes in GPCs indicates the lower slopes have higher groundwater potential (Manap
et al. 2013), and vice versa. It is interesting to note that the lower distance from fault
indicates the higher groundwater potential. Regarding the precipitation and TWI, the
very high GPC have higher precipitation (681.6 mm) and TWI (12.63) rather than other
GPCs. TRI indicates the higher groundwater potential is match with the lower surface
roughness. Concerning the lithology, the prominence lithology for very high GPC is
related to the Eja unit from Jahrum formation which mostly consists of limestones and
dolomite. This is well-following groundwater’s high storage capacity in these lithologies
due to the high fracture porosity (Decker et al. 1998; Ashraf et al. 2018). Also,
prominence land use for very high GPC (about 57.3% of this class) is related to the
agriculture area (Table 3). A possible reason for this may be the recharge from the
irrigated area, however, the importance of the land use in the modeling process was
lower than other variables (Table 3).
Author's personal copy
Ensemble Boosting and Bagging Based Machine Learning Models for...

4 Conclusion

The current study set out to investigate four tree-based ensemble models’ performance, among
them two Boosting models, i.e., AdaBoost and GamBoost, and two Bagging models i.e., RF
and Bagged CART for predicting groundwater potential zones. The study found that the
Bagging models had higher performance than the Boosting models. Variables of TPI, valley
depth, drainage density, elevation, and distance from stream were the most significant
contributors to the modeling process. The major limitation of this study was the lack of a
detailed soil map for the study area. Due to the great influence of soil in the infiltration process,
the use of other soil characteristics such as soil texture with a detailed scale can increase the
modeling accuracy. This study was also limited by the lack of information on the groundwater
productivity characteristics such as transmissivity and specific capacity. It is recommended
that the association of these factors to be investigated in future studies, where these data are
available. Notwithstanding these limitations, groundwater potential maps predicted in this
study can help water resources managers and policymakers in the fields of watershed and
aquifer management to preserve an optimal exploit from this important freshwater.

Acknowledgements We thank the support of the Alexander von Humboldt Foundation.

Data Availability Not applicable.

Compliance with Ethical Standards

Conflicts of interest/Competing interests Not applicable.

Code Availability Not applicable.

References

Agarwal R, Garg PK (2016) Remote sensing and GIS based groundwater potential & recharge zones mapping
using multi-criteria decision making technique. Water Resour Manag 30:243–260
Al-Abadi AM, Shahid S (2015) A comparison between index of entropy and catastrophe theory methods for
mapping groundwater potential in an arid region. Environ Monit Assess 187(9):576
Alotaibi NN, Sasi S (2016). Tree-based ensemble models for predicting the ICU transfer of stroke in-patients. In
2016 International Conference on Data Science and Engineering (ICDSE). IEEE, Piscataway, pp 1–6
Aniya M (1985) Landslide-susceptibility mapping in the Amahata river basin, Japan. Ann Assoc Am Geogr
75(1):102–114
Ashraf MAM, Yusoh R, Sazalil MA, Abidin MHZ (2018) Aquifer Characterization and groundwater potential
evaluation in sedimentary rock formation. In Journal of Physics: Conference Series, vol 995, No. 1. IOP
Publishing, Bristol, p 012106
Beucher A, Møller AB, Greve MH (2017) Artificial neural networks and decision tree classification for
predicting soil drainage classes in Denmark. Geoderma 320:30–42
Breiman L (1996) Bagging predictors. Mach Learn 24:123–40
Breiman L (2001) Random forests. Mach Learn 45:5–32
Chatterjee S, Hadi AS, Price B (2000) Regression analysis by example (3rd ed.). Wiley, Hoboken. ISBN 978-0-
471-31946-7
Chen W, Yeo CK, Lau CT, Lee BS (2015) Real-time twitter content polluter detection based on direct features. In 2015
2nd International Conference on Information Science and Security (ICISS). IEEE, Piscataway, pp 1–4
Chen W, Li H, Hou E, Wang S, Wang G, Panahi M, Li T, Peng T, Guo C, Niu C, Xiao L, Wang J, Xie X,
Ahmad BB (2018) GIS-based groundwater potential analysis using novel ensemble weights-of-evidence
with logistic regression and functional tree models. Sci Total Environ 634:853–67
Author's personal copy
A. Mosavi et al.

Chowdhury A, Jha MK, Chowdary VM (2010) Delineation of groundwater recharge zones and identification of
artificial recharge sites in West Medinipur district, West Bengal, using RS, GIS and MCDM techniques.
Environ Earth Sci 59(6):1209
Conrad O, Olaya V (2012) SAGA-GIS module library documentation (v2. 2.3). Module Valley Depth. Available
online: http://www.sagagis.org/saga_tool_doc/2.2.3/index.html
Das S (2019) Comparison among influencing factor, frequency ratio, and analytical hierarchy process techniques
for groundwater potential zonation in Vaitarna basin, Maharashtra, India. Groundw Sustain Dev 8:617–29
Decker K, Heinrich M, Klein P, Kociu A, Lipiarski P, Pirkl H, Rank D, Wimmer H (1998) Karst springs,
groundwater and surface runoff in the calcareous Alps: assessing quality and reliance of long-term water
supply. IAHS Publ Ser Proc Rep Intern Assoc Hydrol Sci 248:149–156
Duan H, Deng Z, Deng F, Wang D (2016) Assessment of groundwater potential based on multicriteria decision
making model and decision tree algorithms. Math Probl Eng. https://doi.org/10.1155/2016/2064575
Feng C, Cui M, Hodge BM, Zhang J (2017) A data-driven multi-model methodology with deep feature selection
for short-term wind forecasting. Appl Energy 190:1245–1257
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to
boosting. J Comput Syst Sci 55:119–139
Gebre T, Ahmad I, Dar MA, Gadissa E, Teka AH, Tolosa AT, Brhane ES (2018) Mapping of groundwater
potential zones using remote sensing and geographic information system: A case study of parts of Tigray,
Ethiopia. Environ Geosci 25:133–40
Gnanachandrasamy G, Zhou Y, Bagyaraj M, Venkatramanan S, Ramkumar T, Wang S (2018) Remote sensing
and GIS based groundwater potential zone mapping in Ariyalur District, Tamil Nadu. J Geol Soc India 92:
484–490
Hassan ZU, Kanth TA, Malik MI (2018) Groundwater potential zonation and prioritization of wular catchment of
Kashmir using GIS based multi-criteria evaluation approach. Water Energy Int 60RNI:49–61
Hastie TJ, Tibshirani RJ (2017) Generalized additive models. CRC Press, Boca Raton
Ho TK (1995) Random decision forests C3 - Proceedings of the International Conference on Document Analysis
and Recognition, ICDAR. IEEE Computer Society, Washington, D.C., pp 278–82
Hofner B, Mayr A, Schmid M (2016) GamboostLSS: An R package for model building and variable selection in
the GAMLSS framework. J Stat Softw 74(1):1–31
Johnson LE, Olsen BG (1998) Assessment of quantitative precipitation forecasts. Weather Forecast 13(1):75–83
Kalantar B, Pradhan B, Naghibi SA, Motevalli A, Mansor S (2018) Assessment of the effects of training data
selection on the landslide susceptibility mapping: a comparison between support vector machine (SVM),
logistic regression (LR) and artificial neural networks (ANN). Geomatics Nat Hazards Risk 9(1):49–69
Kordestani MD, Naghibi SA, Hashemi H, Ahmadi K, Kalantar B, Pradhan B (2019) Groundwater potential
mapping using a novel data-mining ensemble model. Hydrogeol J 27:211–224
Kuhn M (2015) Caret: classification and regression training. Astrophysics Source Code Library. http://adsabs.
harvard.edu/abs/2015ascl.soft05003K
Kuhn M, Johnson K (2013) Applied predictive modeling, vol 26. Springer, New York
Lee S, Hong SM, Jung HS (2018) GIS-based groundwater potential mapping using artificial neural network and
support vector machine models: the case of Boryeong city in Korea. Geocarto Int 33(8):847–861
Lemmens A, Croux C (2006) Bagging and boosting classification trees to predict churn. J Mark Res 43(2):276–
286
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22
Manap AM, Sulaiman WN, Ramli MF, Pradhan B, Surip N (2013) A knowledge-driven GIS modeling technique
for groundwater potential mapping at the Upper Langat Basin, Malaysia. Arab J Geosci 6(5):1621–1637
Mayr A, Fenske N, Hofner B, Kneib T, Schmid M (2012) Generalized additive models for location, scale and shape
for high dimensional data-a flexible approach based on boosting. J R Stat Soc Ser C Appl Stat 61:403–27
Miraki S, Zanganeh SH, Chapi K, Singh VP, Shirzadi A, Shahabi H, Pham BT (2019) Mapping groundwater
potential using a novel hybrid intelligence approach. Water Resour Manag 33(1):281–302
Monserud RA, Leemans R (1992) Comparing global vegetation maps with the Kappa statistic. Ecol Model
62(4):275–293
Motevalli A, Naghibi SA, Hashemi H, Berndtsson R, Pradhan B, Gholami V (2019) Inverse method using
boosted regression tree and k-nearest neighbor to quantify effects of point and non-point source nitrate
pollution in groundwater. J Clean Prod 228:1248–1263
Murphree DH, Arabmakki E, Ngufor C, Storlie CB, McCoy RG (2018) Stacked classifiers for individualized
prediction of glycemic control following initiation of metformin therapy in type 2 diabetes. Comput Biol
Med 103:109–115
Naghibi SA, Dolatkordestani M, Rezaei A, Amouzegari P, Heravi MT, Kalantar B, Pradhan B (2019)
Application of rotation forest with decision trees as base classifier and a novel ensemble model in spatial
modeling of groundwater potential. Environ Monit Assess 191(4):248
Author's personal copy
Ensemble Boosting and Bagging Based Machine Learning Models for...

Nampak H, Pradhan B, Manap MA (2014) Application of GIS based data driven evidential belief function model
to predict groundwater potential zonation. J Hydrol 513:283–300
Prasad RK, Mondal NC, Banerjee P, Nandakumar MV, Singh VS (2008) Deciphering potential groundwater
zone in hard rock through the application of GIS. Environ Geol 55(3):467–475
Quinlan JR (1996) Bagging, boosting, and C4. 5. AAAI/IAAI 1:725–730
Sachdeva S, Kumar B (2020) A comparative study between frequency ratio model and gradient boosted decision
trees with greedy dimensionality reduction in groundwater potential assessment. Water Resour Manag.
https://doi.org/10.1007/s11269-020-02677-3
Sameen MI, Pradhan B, Lee S (2019) Self-learning random forests model for mapping groundwater yield in data-
scarce areas. Nat Resour Res 28:757–775
Sandman A, Isaeus M, Bergström U, Kautsky H (2008) Spatial predictions of Baltic phytobenthic communities:
Measuring robustness of generalized additive models based on transect data. J Mar Syst 74:S86–S96
Sidle RC, Ochiai H (2006) Landslides: Processes, prediction, and land use. Water Resources Monogr 18.
American Geophysical Union, Washington, D.C
Songara JC, Joshipura NM, Mehmood K, Prakash I (2015a) Assessment and management of watershed of
Machhu Dam III, Morbi, Gujarat using geoinformatics technology. Int J Adv Eng Res Dev
Songara JC, Kadivar HT, Joshipura NM, Prakash I (2015b) Estimation of surface runoff of Machhu Dam III
Chatchment Area, Morbi, Gujarat, India, using curve number method and GIS. Int J Sci Res Dev 3(3):2038–
2043
Stanski HR, Wilson LJ, Burrows WR (1989) Survey of common verification methods in meteorology. World
Weather Watch Technical Report No. 8, TD No. 358, World Meteorological Organization, Geneva, 114 pp
Thuiller W, Lafourcade B (2009) BIOMOD: species/climate modelling functions. R Package Version 1.1-3/r118
Wang S, Chen S (2019) Insights to fracture stimulation design in unconventional reservoirs based on machine
learning modeling. J Petrol Sci Eng 174:682–695

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

Affiliations

Amirhosein Mosavi 1,2 & Farzaneh Sajedi Hosseini 3 & Bahram Choubin 4 & Massoud
Goodarzi 5 & Adrienn A. Dineva 6 & Elham Rafiei Sardooi 7

Amirhosein Mosavi
[email protected]

1
Environmental Quality, Atmospheric Science and Climate Change Research Group, Ton Duc Thang
University, Ho Chi Minh City, Vietnam
2
Faculty of Environment and Labour Safety, Ton Duc Thang University, Ho Chi Minh City, Vietnam
3
Reclamation of Arid and Mountainous Regions Department, Faculty of Natural Resources, University of
Tehran, Karaj, Iran
4
Soil Conservation and Watershed Management Research Department, West Azarbaijan Agricultural and
Natural Resources Research and Education Center, AREEO, Urmia, Iran
5
Soil Conservation and Watershed Management Research Institute (SCWMRI), AREEO, Tehran, Iran
6
Institute of Research and Development, Duy Tan University, Da Nang 550000, Vietnam
7
Faculty of Natural Resources, University of Jiroft, Kerman, Iran

Mental Health Detection Using Machine Learning
100% (1)
Mental Health Detection Using Machine Learning
31 pages
Documentation-Fake News Detection
100% (1)
Documentation-Fake News Detection
57 pages
Machine Learning Interview Guide
100% (1)
Machine Learning Interview Guide
41 pages
Next - Level - Data - Science - Sample Chapter
No ratings yet
Next - Level - Data - Science - Sample Chapter
37 pages
Machine Learning
50% (2)
Machine Learning
430 pages
Ensemble Methods - Bagging, Boosting and Stacking - Towards Data Science PDF
No ratings yet
Ensemble Methods - Bagging, Boosting and Stacking - Towards Data Science PDF
37 pages
DL Unit-1
No ratings yet
DL Unit-1
10 pages
Existing System For Face Recognition
67% (3)
Existing System For Face Recognition
3 pages
Machine Learning-Driven Credit Risk: A Systemic Review
No ratings yet
Machine Learning-Driven Credit Risk: A Systemic Review
13 pages
Assignment 2
No ratings yet
Assignment 2
111 pages
Sushant Tomar (12917704423) - MCA 3C AIML Assignment 2
No ratings yet
Sushant Tomar (12917704423) - MCA 3C AIML Assignment 2
11 pages
Advanced Machine Learning Syallabus
No ratings yet
Advanced Machine Learning Syallabus
2 pages
Baduwal Survey - On - Machine - Learning - Paradigms - For - Phishing - Website - Detection
No ratings yet
Baduwal Survey - On - Machine - Learning - Paradigms - For - Phishing - Website - Detection
15 pages
Msds iuFUXPCU
No ratings yet
Msds iuFUXPCU
47 pages
Ec3501 Wireless Communication 836516061 WC Notes PDF
No ratings yet
Ec3501 Wireless Communication 836516061 WC Notes PDF
6 pages
Susceptibility Prediction of Groundwater HardnessUsing Ensemble Machine Learning Models
No ratings yet
Susceptibility Prediction of Groundwater HardnessUsing Ensemble Machine Learning Models
17 pages
A Novel Transfer Learning Based Approach For Plant Species
No ratings yet
A Novel Transfer Learning Based Approach For Plant Species
14 pages
3sample Mooc Report FINAL
No ratings yet
3sample Mooc Report FINAL
18 pages
Vineet Dhanawat
No ratings yet
Vineet Dhanawat
8 pages
DSC - MachineLearning Regular HO
No ratings yet
DSC - MachineLearning Regular HO
7 pages
Theoretical Evaluation of Ensemble Machine Learning Techniques
No ratings yet
Theoretical Evaluation of Ensemble Machine Learning Techniques
9 pages
Projectreport
No ratings yet
Projectreport
4 pages
Applicationof Xgboost Algorithmfor Sales Forecasting Using Walmart Dataset
No ratings yet
Applicationof Xgboost Algorithmfor Sales Forecasting Using Walmart Dataset
13 pages
Identifying Most Suitable Priority Areas For Soil-Water Conservation Using Coupling Mechanism in Guwahati Urban Watershed
No ratings yet
Identifying Most Suitable Priority Areas For Soil-Water Conservation Using Coupling Mechanism in Guwahati Urban Watershed
35 pages
Analysing Earth Near Object & Visualizing Hazard
No ratings yet
Analysing Earth Near Object & Visualizing Hazard
5 pages
AdaBoost & DIfference Between Adaboost and Random Forest
No ratings yet
AdaBoost & DIfference Between Adaboost and Random Forest
6 pages
Comprehensive Review of Deep ReinforcementLearning Methods and Applications in Economics
No ratings yet
Comprehensive Review of Deep ReinforcementLearning Methods and Applications in Economics
42 pages
Alzhemy A Cloud Enabled Machine Learning Model For Alzheimers Disease Prediction
No ratings yet
Alzhemy A Cloud Enabled Machine Learning Model For Alzheimers Disease Prediction
6 pages
MSMOTE Improving Classification Performance When Training Data Is Imbalanced
No ratings yet
MSMOTE Improving Classification Performance When Training Data Is Imbalanced
5 pages
ML Prelims 2024-25
No ratings yet
ML Prelims 2024-25
1 page
JOCC - Volume 2 - Issue 1 - Pages 50-65
No ratings yet
JOCC - Volume 2 - Issue 1 - Pages 50-65
16 pages
Sensors
No ratings yet
Sensors
27 pages
XG Boost
No ratings yet
XG Boost
11 pages
Fuzzy Logic Model To Assess Desertification Intensity Based On Vulnerability Indices
No ratings yet
Fuzzy Logic Model To Assess Desertification Intensity Based On Vulnerability Indices
18 pages
Construction and Building Materials
No ratings yet
Construction and Building Materials
15 pages
Machine Learning A Review On Binary Classification
No ratings yet
Machine Learning A Review On Binary Classification
5 pages
International Journal of Sediment Research
No ratings yet
International Journal of Sediment Research
12 pages
Nonlinear Model Identification of Dissimilar Laser Joining of S.S 304 and ABS Using The Hammerstein-Wiener Method
No ratings yet
Nonlinear Model Identification of Dissimilar Laser Joining of S.S 304 and ABS Using The Hammerstein-Wiener Method
9 pages
Fuzzy Clustering To Classify Several Time Series Models With Fractional Brownian Motion Errors
No ratings yet
Fuzzy Clustering To Classify Several Time Series Models With Fractional Brownian Motion Errors
9 pages
Detection and Prediction of Lake Degradation Using
No ratings yet
Detection and Prediction of Lake Degradation Using
1 page