Ensemble Boosting and Bagging Based MachineLearning Models For Groundwater Potential Prediction
Ensemble Boosting and Bagging Based MachineLearning Models For Groundwater Potential Prediction
Amirhosein Mosavi 1,2 & Farzaneh Sajedi Hosseini 3 & Bahram Choubin 4 &
Massoud Goodarzi 5 & Adrienn A. Dineva 6 & Elham Rafiei Sardooi 7
Abstract
Due to the rapidly increasing demand for groundwater, as one of the principal freshwater
resources, there is an urge to advance novel prediction systems to more accurately
estimate the groundwater potential for an informed groundwater resource management.
Ensemble machine learning methods are generally reported to produce more accurate
results. However, proposing the novel ensemble models along with running comparative
studies for performance evaluation of these models would be equally essential to pre-
cisely identify the suitable methods. Thus, the current study is designed to provide
knowledge on the performance of the four ensemble models i.e., Boosted generalized
additive model (GamBoost), adaptive Boosting classification trees (AdaBoost), Bagged
classification and regression trees (Bagged CART), and random forest (RF). To build the
models, 339 groundwater resources’ locations and the spatial groundwater potential
conditioning factors were used. Thereafter, the recursive feature elimination (RFE)
method was applied to identify the key features. The RFE specified that the best number
of features for groundwater potential modeling was 12 variables among 15 (with a mean
Accuracy of about 0.84). The modeling results indicated that the Bagging models (i.e.,
RF and Bagged CART) had a higher performance than the Boosting models (i.e.,
AdaBoost and GamBoost). Overall, the RF model outperformed the other models (with
accuracy = 0.86, Kappa = 0.67, Precision = 0.85, and Recall = 0.91). Also, the topograph-
ic position index’s predictive variables, valley depth, drainage density, elevation, and
distance from stream had the highest contribution in the modeling process. Groundwater
potential maps predicted in this study can help water resources managers and
policymakers in the fields of watershed and aquifer management to preserve an optimal
exploit from this important freshwater.
* Adrienn A. Dineva
[email protected]
1 Introduction
The Dezekord-Kamfiruz watershed is a part of Fars Province, Iran, which extends between latitudes
of 30° 08′ and 30° 47′ N, and longitudes of 51° 43′ and 52° 26′ E (Fig. 1). The watershed has a mean
yearly precipitation of 652 mm, and the mean daily minimum and maximum temperatures equal to
6.25 °C and 21.62 °C, respectively. The climate of the Dezekord-Kamfiruz watershed is included
semi-arid, Mediterranean, and semi-humid, respectively, from east to west, according to the De
Martonne classification method. The watershed has an area of 2089.52 square kilometers, with a
population of about 47,000 people. The elevation of the study area varies from 1501 to 3699 m a.s.l.
There are 339 groundwater resources (including 308 perennial springs, 18 wells, and 13
qanats) which supply water for drinking and irrigating objectives. Due to the study area’s climate,
precipitation mostly falls in winter, fall, and early spring. Whereas, during the middle to end of spring
and whole summer (which is irrigating periods), precipitation is close to zero. Hence, the main source
of irrigation is groundwater. Therefore, due to the extreme requirement to the groundwater, knowing
the potential zones of groundwater could better manage the water supply in this area.
2.2 Dataset
The dataset in this study was included the location of the groundwater recourses as the
dependent variable (i.e., groundwater productivity data) and groundwater potential condition-
ing factors (GPCF) as independent variables:
The location of the groundwater resources was obtained from the Iranian Water Resources
Management Company (IWRMC). Although groundwater resource data were 401 points, we
used only perennial resources (339 points) to indicate the existence of groundwater supply.
The groundwater resource locations were randomly divided into 70% (237 points) and 30%
(102 points), respectively, for training and validation phases (Fig. 1).
According to the literature survey number of 15 essential factors were considered (Fig. 2).
Factors of topography including elevation, slope, aspect, curvature (Fig. 2a and d) were
extracted by an ASTER Digital Elevation Model (DEM) with a resolution of 30 × 30 m by
ArcGIS 10.3 software. Different topographic conditions create different conditions of climate,
soil, infiltration, and vegetation (Aniya 1985), which can affect groundwater resources. The
slope factor widely controls the recharge processes of groundwater (Prasad et al. 2008). It has
an important role in the velocity of water flow, as in gentle slopes it allows runoff to have
enough time to penetrate the soil (Nampak et al. 2014). The slope affects the hydrological
processes through different evapotranspiration, precipitation, trends of physiography. It further
affects weathering and vegetation development processes, which are all related to groundwater
(Sidle and Ochiai 2006). As an indicator of morphology and topography, the curvature shows
the direction of flow and has a significant role in the stability and instability of terrain. A
concave curvature has more water and maintains it for a longer time to percolate and infiltrate
into the soil. The topographic position index (TPI) (Fig. 2e) measures the difference of
elevation in each cell from the average elevation of neighboring cells. It exhibits the locations
that have higher or lower elevation than the surroundings. The topographic roughness index
(TRI) (Fig. 2f) indicates the surface’s roughness, and it is calculated based on the variations of
elevation in surrounding pixels. Valley depth (Fig. 2g) is computed as a vertical distance to the
base level of the channel network (Conrad and Olaya 2012). Factors of TPI, TRI, and valley
depth were generated using the SAGA-GIS software.
The water-related factors including drainage density (Dd), distance from the stream (Dfs),
precipitation, and topographic wetness index (TWI) have an essential role in groundwater
potential mapping. The Dd map (Fig. 2h) was generated by Line Density Tool in ArcGIS 10.3.
It is relative to the slope, elevation, bedrock, and lithology structures. The Dfs was extracted by
Euclidian distance in ArcGIS 10.3 (Fig. 2i). The main sources of groundwater recharge are
precipitation and streams. Lower distances to stream increase the degree of groundwater
recharge. The mean annual precipitation (Fig. 2j) during (1987−2016) was generated using
available gauge stations in the study area (Fig. 1) which their data were received from the
IWRMC. The TWI (Fig. 2k) was produced by SAGA-GIS. It indicates the spatial patterns of
wetness and measures saturated source zones of surface runoff (Nampak et al. 2014).
Fig. 2 The predictive variables used for groundwater potential prediction: (a) elevation, (b) slope, (c) aspect, (d)
curvature, (e) topographic position index (TPI), (f) topographic roughness index (TRI), (g) valley depth, (h)
drainage density (Dd), (i) distance from stream (Dfs), (j) precipitation, (k) topographic wetness index (TWI), (l)
soil order, (m) lithology, (n) distance from fault (Dff), and (o) landuse
Author's personal copy
Ensemble Boosting and Bagging Based Machine Learning Models for...
Author's personal copy
A. Mosavi et al.
Soil and lithology have a significant role in both the porosity and permeability of aquifer
materials (Chowdhury et al. 2010; Songara et al. 2015a). The distance from fault (Dff) affect
groundwater resources through penetrating surface water and augmenting the permeability. In
this study, the lithology, soil, and fault maps (Fig. 2l and n) were received from the Forests,
Range and Watershed Management Organization (FRWMO) of Iran. The Dfs map was
extracted by the Euclidian distance tool in ArcGIS 10.3.
The land use can affect the groundwater through influence on soil and available water,
changing the topography, vegetation, and infiltration conditions (Songara et al. 2015b). The
land uses of the study area (Fig. 2o), which is obtained from the FRWMO, are rangeland and
forest, followed by agriculture, dry farming, waterbody, bare land, residential, and orchard.
Although input data considered in this study was based on the scholars, the existence of the
collinear variables (i.e., high relationships among the predictors) can create unreal results in the
model’s outputs (Chatterjee et al. 2000). So, in this study, the MA was tested using the
Variance Inflation Factor (VIF). Values of less than 10 for the VIF indicates there is not any
high multicollinearity among the predictor variables.
Moreover, the presence of redundant data can create problems in modeling processes such
as increasing the training time, reducing the performance of the models, and overfitting
problems (Wang and Chen 2019). So, the selection of key features is an efficient method to
overcome these difficulties. In this study, the Recursive Feature Elimination (RFE) method, as
an FS method, was used to identify key features. The RFE is a wrapper and model-based
approach in which the random forest model is applied as an estimator (Feng et al. 2017). The
RFE is a backward select method (Kuhn and Johnson 2013) which the main concept of it is
based on eliminating the unimportance variables. After each run, the importance of the features
is calculated and then the features with lower priority are removed from modeling, which is
repeated until a single feature remains (Chen et al. 2015). The FS is only done based on the
training dataset (Wang and Chen 2019). Testing dataset (30%) is not used in the FS process,
and it is held out to test the groundwater potential modeling. A 10-fold cross-validation
method was used for feature selection based on the training dataset (70%). In each run, 9-
fold of the training dataset was used to train and one-fold is assigned to evaluate the model
performance, and this process repeated until all runs are finished. Feature selection was
performed using the Caret package (Kuhn 2015) within the R environment.
In this study, after feature selection, the groundwater potential modeling was conducted using
the Boosting (Freund and Schapire 1997) and Bagging (Breiman 1996) methods. The model
training and parameter tuning were conducted with a 10-fold cross-validation method using
70% of the input data (same data used for feature selection with excluding redundant variables
identified by the RFE method). Two Boosting models including Adaptive Boosting Classifi-
cation Trees (AdaBoost) and Boosted Generalized Additive Model (GamBoost), and two
Bagging models including Random Forest (RF) and Bagged Classification and Regression
Trees (Bagged CART) were employed for this purpose. Parameters optimization was con-
ducted using the tuning function of the Caret R package (Kuhn 2015) using 10-fold cross-
validation resampling methods.
Author's personal copy
Ensemble Boosting and Bagging Based Machine Learning Models for...
GamBoost (Hofner et al. 2016) represents a boosted extension of the generalized additive
model which is an ensemble of the generalized additive model (GAM). The GAM model adds
independent transformation for each predictor variable (Hastie and Tibshirani 2017), in which
such additives provide smoother functions to fit a broad extent of response curves (Sandman
et al. 2008). GAM is considered a flexible regression model for handling multiple distribution
function (Hofner et al. 2016). Boosting can be used to complement and improve the prediction
accuracy and integrity of the GAM models as well as dealing with overfitting (Mayr et al.
2012). GamBoost provides a novel fitting technique and performs well in promoting the most
important variables, particularly in high dimensional space where variable selection is of major
importance (Hofner et al. 2016).
CART has been widely used for groundwater modeling including the groundwater potential
prediction with acceptable performance (Duan et al. 2016). As the CART is considered an
unstable model, the bagging technique can greatly improve its accuracy (Murphree et al.
2018). The bagged CART effectively decreases the prediction variance and highly improves
classification performance and overfitting. Thus, it is expected that through using the bagged
CART in the novel application of groundwater potential prediction, promising results can be
achieved.
2.4.4 RF Model
Random forest method is an ensemble learning widely used for regression and classification.
Ho (1995) proposed RF based on the random subspace method to construct a multitude of
decision trees with controlled variance to improve the accuracy and fix the training overfitting
issues. RF later was advanced and implemented as a package for bagging and features
selection (Breiman 2001) consisting of an ensemble of independent classification trees and a
set of random samples. To build a model, often, two-thirds of the data set is devoted to creating
Author's personal copy
A. Mosavi et al.
the decision trees, and the rest is used for evaluation of the model accuracy, error, and further
performance. In the next step, the sum of the DTs performed and the best performing model is
identified according to all trees’ most votes.
Model validation in this study was conducted using a hit and miss analysis by 30% of the data
which had not been used in the training phase. Statistics of Accuracy, Kappa, Precision, and
Recall were considered to validate the results (Eqs. 1–5) (Johnson and Olsen 1998; Stanski
et al. 1989):
HþCN
Accuracy¼ ð1Þ
HþFAþMþCN
Accuracy Pe
Kappa¼ ð2Þ
1 Pe
ðHþFAÞðHþMÞþðMþCNÞðFAþCNÞ
Pe ¼ ð3Þ
ðHþFAþMþCNÞ2
H
Precision¼ ð4Þ
HþFA
H
Recall¼ ð5Þ
HþM
where H denotes the number of hits, FA indicates the number of false alarms, M shows the
number of misses, and CN is the number of correct negatives, which are calculated by a
contingency table (Johnson and Olsen 1998; Stanski et al. 1989). Also, Pe is expected
agreement which indicates how much of agreement would be presented by chance alone
(Beucher et al. 2017). The Accuracy, Kappa, Recall, and Precision vary between 0 and 1,
which 1 indicates the perfect prediction.
Before the feature selection, the multicollinearity analysis (MA) using the variance inflation
factor (VIF) was tested. Results indicated that there are not any collinear variables (i.e., all of
the variables had a VIF less than 10). The RFE method results showed that among 15
variables, applying 12 variables will have a good performance for groundwater potential
prediction (Fig. 3). As can be seen from box plots (Fig. 3), accuracy increases with increasing
the number of variables up to 12 variables then it decreases. Mean Accuracy (red plus in
Author's personal copy
Ensemble Boosting and Bagging Based Machine Learning Models for...
1.00
0.80
Maximum
Number of features
Fig. 3 Performance of the RFE using the different number of features as input
Fig. 3) is equal to 0.84 and median Accuracy (the thin black line within the boxes in Fig. 3) is
equal to 0.85 for the number of 12 variables after many times of the model runs (Fig. 3).
Therefore, the number of 12 variables must be considered for the modeling process based on
the RFE results. The occurrence frequency (%) of each feature in all model runs (1200 times)
was computed across cross-validation using RFE (Table 1). Landuse and topographic position
index are contributed respectively in 95.9%, and 100% of the model runs. Variables of
curvature, soil type, and aspect are ranked as 13th to 15th variables concerning the occurrence
in the model runs and were identified as redundant variables. Therefore, due to the optimum
variables number identified by the RFE (Table 1), we excluded these variables from input data.
The modeling results analysis was conducted by calculating the statistics of Accuracy, Kappa,
Precision, and Recall metrics (Table 2). Results of the modeling evaluation indicated that
Table 1 The occurrence frequency (%) of features in the model runs using the RFE method
Landuse 100.0
Topographic position index (TPI) 95.9
Lithology 86.5
Elevation 77.0
Valley depth 72.3
Topographic wetness index (TWI) 68.9
Drainage density (Dd) 58.8
Distance from fault (Dff) 50.7
Distance from stream (Dfs) 46.6
Precipitation 45.3
Topographic roughness index (TRI) 35.1
Slope 33.1
Curvature 18.2
Soil type 14.9
Aspect 7.4
Author's personal copy
A. Mosavi et al.
Accuracy values for the models are more than 80%. Also, according to the Kappa statistic
(Monserud and Leemans 1992), all of the models indicate good performance. Precision values
for the model vary between 0.82 and 0.85, and Recall is more than 0.85 (Table 2).
A comparison of the models’ performance indicated that the RF model had higher
performance, followed by Bagged CART, GamBoost, and AdaBoost models (Table 2).
Therefore, evaluation results indicated that the Bagging models (i.e., RF and Bagged CART)
had a higher performance than the Boosting models (i.e., AdaBoost and GamBoost). Recently
Wang and Chen (2019) demonstrated the RF (as a Bagging model) outperforms in comparison
with the AdaBoost (as a Boosting model) to simulate oil well productivity in unconventional
formations. In another study, Alotaibi and Sasi (2016) indicated that the same performance of
the AdaBoost and RF models predicted the intensive care unit transfer. In other studies, the
reasons for the success of the RF (Liaw and Wiener 2002; Thuiller and Lafourcade 2009) are
mentioned as (i) ability to output an unbiased prediction of the simplification error, (ii) lack of
pre-analysis to select variables among large number of predictors, (iii) possibility to use
categorical and numerical variables as predictors, (iv) ability to evaluate non-linear interactions
between variables, and (v) increasing the diversity of classification trees through the random
selection of predictive variables over the different tree.
Although there is not a superiority between Bagging and Boosting, it can depend on the
data and variables in the modeling process. However, considering the over-fitting problems
Bagging is the best option and Boosting can’t help avoid over-fitting (Freund and Schapire
1997; Quinlan 1996). Also, another possible explanation for this might be that the initial model
choice in Boosting is weaker than Bagging (Lemmens and Croux 2006).
After evaluating the models, groundwater potential maps were predicted using 12 predictors
identified by the RFE method. Pixels value of the predictors for the whole study area was used
to predict groundwater potential maps using the trained models. Then, the probability (P) of
groundwater potential was predicted by the predictive models, and classified into 5 classes
including very low (P = 0 − 0.2), low (P = 0.2 − 0.4), moderate (P = 0.4 − 0.6), high (P = 0.6 −
0.8), and very high (P = 0.8 − 1) classes, by the equal interval method (Fig. 4).
The AdaBoost model indicated the lowest area (220.27 km2) for very low class rather than
others, while the GamBoost indicates the most area (845.54 km2). On the contrary, the low,
moderate, and high classes by the AdaBoost model have the highest area (respectively equal to
917.91, 619.22, and 307.03 km2) in comparison with other models. The very high class predicted by
all of the models have a lower area than other classes and is equal to 25.10, 173.46, 113.11, and
185.37 km2 respectively for the AdaBoost, GamBoost, RF, and Bagged CART models (Fig. 4).
Author's personal copy
Ensemble Boosting and Bagging Based Machine Learning Models for...
Overall, the AdaBoost model’s very low and very high classes have the lowest area, while
the area of the classes decreases from very low class towards a very high class for the
GamBoost and Bagged CART models. In the RF model, the highest area is related to low
and very low classes (equal to 641.52 and 615.06 km2, respectively). From moderate to very
high classes, the area decreases (Fig. 4).
Importance of the modeling process variables was evaluated through a percent decrease in
Accuracy (Table 3). According to the results, the most important variables were TPI and valley
depth, with a reduction in 31.9% and 27.3% accuracy, respectively. Other variables such as
drainage density, elevation, and distance from stream were in the next orders with a decrease
Accuracy of 26.5%, 26.4%, and 26.3%, respectively (Table 3).
A diagnostic analysis was done to fulfill the groundwater potential classes’ complexity and
dependencies with the predictive variables. Mean values of the predictive variables in each
groundwater potential class (GPC) predicted by the RF model was computed (Table 3). For
Fig. 4 Groundwater potential map predicted by (a) AdaBoost, (b) GamBoost, (c) RF, and (d) Bagged CART
models
Author's personal copy
A. Mosavi et al.
Table 3 Importance, mean values (or prominence category) of the variables by the RF model
categorical variables, the prominence category in each GPC was presented. As can be
seen, with decreasing the TPI the groundwater potential is increased, as the very high
GPC has a mean value − 1.82 and very low GPC have a mean value equal to 1.69.
Therefore, locations with lower TPI (indicating lower elevations than their surroundings)
indicate higher groundwater potential. This is matches with valley depth which with
increasing depths the groundwater potential is increased. Mean value of the very high
and very low GPC are respectively 255.2 m and 100.8 m. Regarding the drainage
density, with increasing the density the groundwater potential is increased, as the very
high GPC has a higher density (Dd = 0.46) and very low GPC has lower (Dd = 0.11).
Conjunction between rivers and groundwater can be a reason for this. Mean values of the
elevation indicate higher groundwater potential does not exactly follow the lower
elevations, as the moderate GPC have lower elevation (2103.7 m), follows by high and
very high GPCs (respectively 2115.3 m and 2171.6 m). With increasing distance from
streams, the potential of the groundwater is decreased. Mean distance for higher GPC is
about 500 m, whereas for lower GPC is about 2298 m. Also, the variation of the mean
slopes in GPCs indicates the lower slopes have higher groundwater potential (Manap
et al. 2013), and vice versa. It is interesting to note that the lower distance from fault
indicates the higher groundwater potential. Regarding the precipitation and TWI, the
very high GPC have higher precipitation (681.6 mm) and TWI (12.63) rather than other
GPCs. TRI indicates the higher groundwater potential is match with the lower surface
roughness. Concerning the lithology, the prominence lithology for very high GPC is
related to the Eja unit from Jahrum formation which mostly consists of limestones and
dolomite. This is well-following groundwater’s high storage capacity in these lithologies
due to the high fracture porosity (Decker et al. 1998; Ashraf et al. 2018). Also,
prominence land use for very high GPC (about 57.3% of this class) is related to the
agriculture area (Table 3). A possible reason for this may be the recharge from the
irrigated area, however, the importance of the land use in the modeling process was
lower than other variables (Table 3).
Author's personal copy
Ensemble Boosting and Bagging Based Machine Learning Models for...
4 Conclusion
The current study set out to investigate four tree-based ensemble models’ performance, among
them two Boosting models, i.e., AdaBoost and GamBoost, and two Bagging models i.e., RF
and Bagged CART for predicting groundwater potential zones. The study found that the
Bagging models had higher performance than the Boosting models. Variables of TPI, valley
depth, drainage density, elevation, and distance from stream were the most significant
contributors to the modeling process. The major limitation of this study was the lack of a
detailed soil map for the study area. Due to the great influence of soil in the infiltration process,
the use of other soil characteristics such as soil texture with a detailed scale can increase the
modeling accuracy. This study was also limited by the lack of information on the groundwater
productivity characteristics such as transmissivity and specific capacity. It is recommended
that the association of these factors to be investigated in future studies, where these data are
available. Notwithstanding these limitations, groundwater potential maps predicted in this
study can help water resources managers and policymakers in the fields of watershed and
aquifer management to preserve an optimal exploit from this important freshwater.
References
Agarwal R, Garg PK (2016) Remote sensing and GIS based groundwater potential & recharge zones mapping
using multi-criteria decision making technique. Water Resour Manag 30:243–260
Al-Abadi AM, Shahid S (2015) A comparison between index of entropy and catastrophe theory methods for
mapping groundwater potential in an arid region. Environ Monit Assess 187(9):576
Alotaibi NN, Sasi S (2016). Tree-based ensemble models for predicting the ICU transfer of stroke in-patients. In
2016 International Conference on Data Science and Engineering (ICDSE). IEEE, Piscataway, pp 1–6
Aniya M (1985) Landslide-susceptibility mapping in the Amahata river basin, Japan. Ann Assoc Am Geogr
75(1):102–114
Ashraf MAM, Yusoh R, Sazalil MA, Abidin MHZ (2018) Aquifer Characterization and groundwater potential
evaluation in sedimentary rock formation. In Journal of Physics: Conference Series, vol 995, No. 1. IOP
Publishing, Bristol, p 012106
Beucher A, Møller AB, Greve MH (2017) Artificial neural networks and decision tree classification for
predicting soil drainage classes in Denmark. Geoderma 320:30–42
Breiman L (1996) Bagging predictors. Mach Learn 24:123–40
Breiman L (2001) Random forests. Mach Learn 45:5–32
Chatterjee S, Hadi AS, Price B (2000) Regression analysis by example (3rd ed.). Wiley, Hoboken. ISBN 978-0-
471-31946-7
Chen W, Yeo CK, Lau CT, Lee BS (2015) Real-time twitter content polluter detection based on direct features. In 2015
2nd International Conference on Information Science and Security (ICISS). IEEE, Piscataway, pp 1–4
Chen W, Li H, Hou E, Wang S, Wang G, Panahi M, Li T, Peng T, Guo C, Niu C, Xiao L, Wang J, Xie X,
Ahmad BB (2018) GIS-based groundwater potential analysis using novel ensemble weights-of-evidence
with logistic regression and functional tree models. Sci Total Environ 634:853–67
Author's personal copy
A. Mosavi et al.
Chowdhury A, Jha MK, Chowdary VM (2010) Delineation of groundwater recharge zones and identification of
artificial recharge sites in West Medinipur district, West Bengal, using RS, GIS and MCDM techniques.
Environ Earth Sci 59(6):1209
Conrad O, Olaya V (2012) SAGA-GIS module library documentation (v2. 2.3). Module Valley Depth. Available
online: http://www.sagagis.org/saga_tool_doc/2.2.3/index.html
Das S (2019) Comparison among influencing factor, frequency ratio, and analytical hierarchy process techniques
for groundwater potential zonation in Vaitarna basin, Maharashtra, India. Groundw Sustain Dev 8:617–29
Decker K, Heinrich M, Klein P, Kociu A, Lipiarski P, Pirkl H, Rank D, Wimmer H (1998) Karst springs,
groundwater and surface runoff in the calcareous Alps: assessing quality and reliance of long-term water
supply. IAHS Publ Ser Proc Rep Intern Assoc Hydrol Sci 248:149–156
Duan H, Deng Z, Deng F, Wang D (2016) Assessment of groundwater potential based on multicriteria decision
making model and decision tree algorithms. Math Probl Eng. https://doi.org/10.1155/2016/2064575
Feng C, Cui M, Hodge BM, Zhang J (2017) A data-driven multi-model methodology with deep feature selection
for short-term wind forecasting. Appl Energy 190:1245–1257
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to
boosting. J Comput Syst Sci 55:119–139
Gebre T, Ahmad I, Dar MA, Gadissa E, Teka AH, Tolosa AT, Brhane ES (2018) Mapping of groundwater
potential zones using remote sensing and geographic information system: A case study of parts of Tigray,
Ethiopia. Environ Geosci 25:133–40
Gnanachandrasamy G, Zhou Y, Bagyaraj M, Venkatramanan S, Ramkumar T, Wang S (2018) Remote sensing
and GIS based groundwater potential zone mapping in Ariyalur District, Tamil Nadu. J Geol Soc India 92:
484–490
Hassan ZU, Kanth TA, Malik MI (2018) Groundwater potential zonation and prioritization of wular catchment of
Kashmir using GIS based multi-criteria evaluation approach. Water Energy Int 60RNI:49–61
Hastie TJ, Tibshirani RJ (2017) Generalized additive models. CRC Press, Boca Raton
Ho TK (1995) Random decision forests C3 - Proceedings of the International Conference on Document Analysis
and Recognition, ICDAR. IEEE Computer Society, Washington, D.C., pp 278–82
Hofner B, Mayr A, Schmid M (2016) GamboostLSS: An R package for model building and variable selection in
the GAMLSS framework. J Stat Softw 74(1):1–31
Johnson LE, Olsen BG (1998) Assessment of quantitative precipitation forecasts. Weather Forecast 13(1):75–83
Kalantar B, Pradhan B, Naghibi SA, Motevalli A, Mansor S (2018) Assessment of the effects of training data
selection on the landslide susceptibility mapping: a comparison between support vector machine (SVM),
logistic regression (LR) and artificial neural networks (ANN). Geomatics Nat Hazards Risk 9(1):49–69
Kordestani MD, Naghibi SA, Hashemi H, Ahmadi K, Kalantar B, Pradhan B (2019) Groundwater potential
mapping using a novel data-mining ensemble model. Hydrogeol J 27:211–224
Kuhn M (2015) Caret: classification and regression training. Astrophysics Source Code Library. http://adsabs.
harvard.edu/abs/2015ascl.soft05003K
Kuhn M, Johnson K (2013) Applied predictive modeling, vol 26. Springer, New York
Lee S, Hong SM, Jung HS (2018) GIS-based groundwater potential mapping using artificial neural network and
support vector machine models: the case of Boryeong city in Korea. Geocarto Int 33(8):847–861
Lemmens A, Croux C (2006) Bagging and boosting classification trees to predict churn. J Mark Res 43(2):276–
286
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22
Manap AM, Sulaiman WN, Ramli MF, Pradhan B, Surip N (2013) A knowledge-driven GIS modeling technique
for groundwater potential mapping at the Upper Langat Basin, Malaysia. Arab J Geosci 6(5):1621–1637
Mayr A, Fenske N, Hofner B, Kneib T, Schmid M (2012) Generalized additive models for location, scale and shape
for high dimensional data-a flexible approach based on boosting. J R Stat Soc Ser C Appl Stat 61:403–27
Miraki S, Zanganeh SH, Chapi K, Singh VP, Shirzadi A, Shahabi H, Pham BT (2019) Mapping groundwater
potential using a novel hybrid intelligence approach. Water Resour Manag 33(1):281–302
Monserud RA, Leemans R (1992) Comparing global vegetation maps with the Kappa statistic. Ecol Model
62(4):275–293
Motevalli A, Naghibi SA, Hashemi H, Berndtsson R, Pradhan B, Gholami V (2019) Inverse method using
boosted regression tree and k-nearest neighbor to quantify effects of point and non-point source nitrate
pollution in groundwater. J Clean Prod 228:1248–1263
Murphree DH, Arabmakki E, Ngufor C, Storlie CB, McCoy RG (2018) Stacked classifiers for individualized
prediction of glycemic control following initiation of metformin therapy in type 2 diabetes. Comput Biol
Med 103:109–115
Naghibi SA, Dolatkordestani M, Rezaei A, Amouzegari P, Heravi MT, Kalantar B, Pradhan B (2019)
Application of rotation forest with decision trees as base classifier and a novel ensemble model in spatial
modeling of groundwater potential. Environ Monit Assess 191(4):248
Author's personal copy
Ensemble Boosting and Bagging Based Machine Learning Models for...
Nampak H, Pradhan B, Manap MA (2014) Application of GIS based data driven evidential belief function model
to predict groundwater potential zonation. J Hydrol 513:283–300
Prasad RK, Mondal NC, Banerjee P, Nandakumar MV, Singh VS (2008) Deciphering potential groundwater
zone in hard rock through the application of GIS. Environ Geol 55(3):467–475
Quinlan JR (1996) Bagging, boosting, and C4. 5. AAAI/IAAI 1:725–730
Sachdeva S, Kumar B (2020) A comparative study between frequency ratio model and gradient boosted decision
trees with greedy dimensionality reduction in groundwater potential assessment. Water Resour Manag.
https://doi.org/10.1007/s11269-020-02677-3
Sameen MI, Pradhan B, Lee S (2019) Self-learning random forests model for mapping groundwater yield in data-
scarce areas. Nat Resour Res 28:757–775
Sandman A, Isaeus M, Bergström U, Kautsky H (2008) Spatial predictions of Baltic phytobenthic communities:
Measuring robustness of generalized additive models based on transect data. J Mar Syst 74:S86–S96
Sidle RC, Ochiai H (2006) Landslides: Processes, prediction, and land use. Water Resources Monogr 18.
American Geophysical Union, Washington, D.C
Songara JC, Joshipura NM, Mehmood K, Prakash I (2015a) Assessment and management of watershed of
Machhu Dam III, Morbi, Gujarat using geoinformatics technology. Int J Adv Eng Res Dev
Songara JC, Kadivar HT, Joshipura NM, Prakash I (2015b) Estimation of surface runoff of Machhu Dam III
Chatchment Area, Morbi, Gujarat, India, using curve number method and GIS. Int J Sci Res Dev 3(3):2038–
2043
Stanski HR, Wilson LJ, Burrows WR (1989) Survey of common verification methods in meteorology. World
Weather Watch Technical Report No. 8, TD No. 358, World Meteorological Organization, Geneva, 114 pp
Thuiller W, Lafourcade B (2009) BIOMOD: species/climate modelling functions. R Package Version 1.1-3/r118
Wang S, Chen S (2019) Insights to fracture stimulation design in unconventional reservoirs based on machine
learning modeling. J Petrol Sci Eng 174:682–695
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Affiliations
Amirhosein Mosavi 1,2 & Farzaneh Sajedi Hosseini 3 & Bahram Choubin 4 & Massoud
Goodarzi 5 & Adrienn A. Dineva 6 & Elham Rafiei Sardooi 7
Amirhosein Mosavi
[email protected]
1
Environmental Quality, Atmospheric Science and Climate Change Research Group, Ton Duc Thang
University, Ho Chi Minh City, Vietnam
2
Faculty of Environment and Labour Safety, Ton Duc Thang University, Ho Chi Minh City, Vietnam
3
Reclamation of Arid and Mountainous Regions Department, Faculty of Natural Resources, University of
Tehran, Karaj, Iran
4
Soil Conservation and Watershed Management Research Department, West Azarbaijan Agricultural and
Natural Resources Research and Education Center, AREEO, Urmia, Iran
5
Soil Conservation and Watershed Management Research Institute (SCWMRI), AREEO, Tehran, Iran
6
Institute of Research and Development, Duy Tan University, Da Nang 550000, Vietnam
7
Faculty of Natural Resources, University of Jiroft, Kerman, Iran