Groundwater Quality Assessment Using Random Forest Method
Groundwater Quality Assessment Using Random Forest Method
net/publication/344152410
CITATIONS READS
64 1,092
2 authors:
All content following this page was uploaded by Hossein N. on 04 November 2020.
ORIGINAL PAPER
Abstract
Miandoab plain aquifer, with 1150-km2 area, supplies a significant portion of the agricultural and drinking water demands of the
area. In recent years, it has been faced with a significant decline in water level as a consequence of the deterioration of
groundwater quality. Therefore, a scientific study of the groundwater resources in the study area for quantitative and qualitative
management is necessary. One of the important indicators for assessing and zoning of the groundwater quality is the measure-
ment of the concentrations of the ions and determining the groundwater quality index (GQI) by combining ion concentrations and
their relationship with reliable standards. For this purpose, in October 2018, 75 water samples from the groundwater resources of
the Miandoab plain aquifer were collected and chemically analyzed. To minimize the uncertainties, the fuzzy groundwater
quality index was used by the fuzzification of the GQI method. Also, the random forest (RF) algorithms, as a learning method
based on an ensemble of decision trees, were used for the assessment of groundwater quality. The RF technique has advantages
over the other methods due to having high prediction accuracy, the ability to learn nonlinear relationships, and the ability to
determine the important variables in the prediction. In the validation and comparison of methods, fuzzy groundwater quality
index method with more accuracy is identified as a more reliable method in groundwater quality evaluation for drinking purposes.
Based on the RFGQI results, 20, 16, 15, 26, and 23% of the Miandoab plain aquifer, respectively, has a suitable, acceptable,
moderate, unsuitable, and absolutely unsuitable groundwater quality. Overall, the results of this study showed that the random
forest method can be used as a reliable method for groundwater vulnerability, investigating and properly managing or monitoring
of the aquifers.
Keywords Fuzzy . Groundwater . Random forest . Miandoab plain aquifer . World Health Organization
fluctuations as well as exploitation in different seasons of the decline in groundwater quality due to the excessive evaporation
year. On the other hand, the standards by the related organi- and the reduction of rainfall. Considering the importance of
zations do not have good certainty. groundwater in the Miandoab region, which is also used for
Given the fact that all major ions are involved in the quality drinking, in the present study, suitable or unsuitable areas for
of drinking water, obtaining criteria is important, which the drinking use have been investigated with random forest and
effects of all these ions are taken into account. For this purpose, fuzzy logic methods based on groundwater quality index.
Babiker et al. (2007) introduced the groundwater quality index
(GQI). The GQI uses a statistical methodology to translate wa-
ter quality parameter s into a new index consistent with the Materials and methods
WHO standards. In this method, several parameters that have
more importance in groundwater quality were combined in Study area
GIS. Most of the studies available in the literature discussed
the application of methods and groundwater quality indices The Miandoab plain aquifer with an area of approximately
for the groundwater quality assessment (Zhang et al. 2019; 1150 km2 is located in the south of Urmia Lake and is a part of
Xu et al. 2019; Adimalla and Li 2019; Al-Hadithi 2012; the Alborz-Azarbayjan structural zone. The average annual rain-
Sener et al. 2017; Chapman 1996; Jamshidzadeh and Barzi fall based on the 30-year (1989–2018) data from the Malekan
2018). Many methods were used to identify groundwater qual- and Miandoab synoptic stations, located in the plain and moun-
ity based on the GQI, such as the fuzzy groundwater quality tainous area, is 267 and 325 mm, respectively, and the mean
index (FGQI) method by Vadiati et al. (2016). Nejatijahromi annual rainfall in the study area is about 284 mm. According to
et al. (2019) used the hydrochemical techniques to evaluate the the Emberger empirical method (Emberger 1952), the region has
groundwater nitrate contamination. Kumar and Sangeetha a cold and semi-arid climate. Figure 1 shows the geographic
(2020) used the water quality index (WQI) and geospatial tech- location of the study area. Zarine Roud, Simineh Roud,
niques to evaluate the groundwater quality around Madurai Murdouchai, and Leilanchai are the most important surface
City by using parameter analysis. In an agricultural region, drains of the Miandoab plain, where surface water resources of
the WQI was used to evaluate groundwater quality for drinking these rivers flow through the plain to the Urmia Lake.
purposes and human health risk (HHR) (Adimalla and Qian Groundwater flow encounters geological formations with
2019). A time- and cost-effective approach was used to predict diverse lithology. This will change the quality of water and
water quality index class in Malaysia. Also, in this research, the increases the amount and type of soluble materials depending
decision tree machine learning technique is used to predict the on the lithology of the rocks, geological formations, and water
WQI for the Klang River and its classification within a specific retention time. The Miandoab region has different geological
water quality class. The results of this research showed that the formations (Norouzi et al. 2018b). The study area is made up
proposed prediction model has a promising potential to predict of multiple geological formations (Fig. 2), including (1) Lar
the class of the WQI. Moreover, the proposed model offers a Formation of the Jurassic periods with limestone and dolomite
more efficient process and cost-effective approach for the com- lithology in the northeastern and northern parts of the region;
putation and prediction of WQI (Yung et al. 2019). (2) Shemshak Formation of the Jurassic periods with sand-
In this study, the random forest (RF) and fuzzy logic stone and olivine green shale lithology in northeastern and
methods were used to identify groundwater quality based on eastern parts; (3) Maragheh Formations containing pyroclastic
the GQI in the Miandoab plain aquifer which is located in NW and claystone of the Permian in the northeastern of the region;
of Iran. Random forest was presented by Breiman in 2001 and (4) Mila and Rizo Formation related to the Cambrian period in
is an ensemble method that combines multiple decision tree the south and southwestern with sandstone, quartzite, and
algorithms (Breiman 2001). Recently, Yajima and Derot shale horizons lithology; 5) volcanic Andesite of Eocene in
(2018) used the random forest model for chlorophyll-a fore- the eastern part of the region; and (6) limestone of the
casting in fresh and brackish water bodies in Japan, using Cretaceous period’s eastern portion (Shahrabi 1972).
multivariate long-term databases. Sihag et al. (2019) success- The Miandoab plain aquifer is unconfined. It consists of
fully implemented the RF method for the estimation of unsat- alluvial fans, old and recent alluvial terraces, and fluvial sed-
urated hydraulic conductivity. Random forest is being applied iments. The thickness of the aquifer varies from about a few
increasingly in land cover classification from remotely sensed meters in the margins of the plain to over 75 m around
data (Pal 2005; Rodriguez et al. 2012; Sesnie et al. 2008) and Miandoab City. In the outlet areas of the plain and toward
other fields related to the environment and water resources the Urmia Lake, due to the frequency of fine-grained deposits,
(Norouzi et al. 2016; Booker and Snelder 2012; Herrera the number of aquifers increases to three layers (Norouzi et al.
et al. 2010; Huang et al. 2011). 2018a). In these parts, the upper aquifer, which is up to 30 m
The Miandoab area is located in the Urmia Lake basin (NW deep, is unconfined and below it, there are two semi-confined
of Iran), which in recent years has been affected by a sharp aquifers with a thickness of 20 to 30 m. The separating layers
Arab J Geosci (2020) 13:912 Page 3 of 13 912
of these aquifers have about 10 to 20 m thickness, which northeast of the aquifer, and toward the outlet of the aquifer,
toward the center of the plain interlaced into coarse-grained it becomes low. The cumulative mean of groundwater level
alluvial deposits, and in the center of the plain, they reach changes in the Miandoab plain is shown in Fig. 4.
zero. The transmissivity and the storage coefficient are two
important parameters in the assessment of the properties of the Quality indicators
aquifer, which indicates the reservoir transition capability and
aquifer storage capacity, respectively. Transmissivity in most The World Health Organization (WHO), as the highest inter-
parts of the plain is less than 400 m2/day, and in relatively national institution for water quality control, has provided
large areas in the outlet of the plain, it reaches less than 50 m2/ guidelines for various pollutants in drinking water. The
day. Also, the average storage coefficient values for the entire amount of guidance suggests a concentration for a component
plain are 1.8 to 3.1% (EARWO 2014). Groundwater flow that does not pose a serious risk to consumer health through-
direction is from the southeast toward the northwest (Fig. 3). out consumption (WHO 2008). Considering that all pollut-
Also, the groundwater level is high in the southeast and ants, especially chemical pollutants, not exist in all water
912 Page 4 of 13 Arab J Geosci (2020) 13:912
resources or all countries, the WHO is considered to be the is a polynomial function, is used to convert the unit of each
consideration of factors such as the type of material, geo- previous pixel (C) to a new value (r).
graphical and geological conditions of the area, and the type
of human activity involved in the development of national r ¼ 0:5 C 2 þ 4:5 C þ 5 ð2Þ
standards.
In the final step, to create a map which represents all ten
chemical parameters, and shows the general status of the plain
Calculation of groundwater quality index water quality in comparison with the WHO standard, layers of
the parameters are combined using the Groundwater Quality
To calculate the GQI, at first, in ArcGIS using Kriging inter- Index (GQI) based on the following equation.
polation method, raster maps were prepared for each of the ten GQI ¼ 100−½ðr1 w1 þ r2 w2 þ … þ rn wn Þ=n ð3Þ
chemical parameters. In the next step, for unifying the differ-
ent scales of different maps, using the below formula, the In this formula, r is the rank of each pixel of the ranked
concentrations of each pixel (K) from the raster maps (created maps and w is the relative weight of each parameter, which is
in the previous step) communicate with the WHO standard equal to the average value of the total pixels of the correspond-
value of that parameter (KWHO). ing ranked map. In fact, in order to calculate the GQI, the
weighted average is taken from the different parameters, in
K−K WHO
C¼ ð1Þ which the parameters with higher values (more than standard
K þ K WHO
values) have relative weights and therefore have more effect.
Unifying the scales produces new maps, which pixel value Because the toxicity of different elements is different for
of the maps varies between 0 and 1. Now, the concentrations humans, it is important to note, if one or more elements is
in these maps are ranked between 1 and 10 to obtain a map of more toxic than other elements, the proposed formula must
the ranked parameters of each parameter. In these maps, rank be calibrated and corrected (Hiyama and Hu 2003).
1 indicates good groundwater quality and rank 10 indicates
the degradation of groundwater quality. In fact, in this unit Parameter selection
conversion, the value − 1 in the generated map in the previous
step should be changed to 1 and 0 to 5 and 1 to 10 in the In the groundwater quality assessment, the parameter selec-
ranked map. For this purpose, the following equation, which tion depends on the purpose of the assessment and the ability
Arab J Geosci (2020) 13:912 Page 5 of 13 912
Fig. 3 Groundwater flow direction and groundwater level in the Miandoab plain aquifer
of an organization to the collection and analysis of the ground- in drinking water quality assessment using the FGQI and random
water samples. The selection of the parameters should include forest groundwater quality index (RFGQI) methods. The GQI
the most important parameters related to groundwater quality values ranged from 0 to 100 and are categorized into five classes.
background (Vadiati et al. 2016). The output membership functions of the FGQI and RFGQI
Based on the parameter anomalies impacting human health, models are categorized based on the GQI classes: absolutely
in the present study, nine key parameters including TDS, Cl−, unsuitable (0–25); unsuitable (25–50); moderate (50–70); ac-
Ca+2, Mg+2, SO4−2, HCO3−, K+, Na+, and NO3− were considered ceptable (70–90); and suitable or good (90–100).
-0.5
-1.0
-1.5
-2.0
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Year
912 Page 6 of 13 Arab J Geosci (2020) 13:912
Calculation of fuzzy groundwater quality index in a compact fuzzy model (Yen and Wang 1999). Developed
fuzzy models consisted of ten input parameters, with each
The groundwater system is generally confronted with uncer- parameter consisting of three membership functions. Six im-
tainties. Investigating and deciding on the quality of water based portant rules were selected for constructing the fuzzy rule
on the information obtained at all stages, from sampling to anal- based on the experience of experts and basis of available
ysis and analysis of the results, are faced with a variety of uncer- datasets. (1) If Mg+2 is desirable, Ca+2 is desirable, Cl− is
tainties (Norouzi et al. 2018a). Due to the uncertainty associated desirable, Na+ is desirable, SO4−2 is desirable, NO3 is desir-
with measurement in the sampling and analysis stages, the use of able, and TDS is desirable, then FWQI is excellent; (2) if Mg+
classical methods does not seem appropriate in assessing the 2
is desirable, Ca+2 is desirable, Cl− is desirable, Na+ is desir-
quality of drinking water and even agriculture. Different methods able, SO4 is desirable, NO3 is desirable, and TDS is accept-
and criteria have been presented for decision-making and assess- able, then FGQI is good. (3) If Ca+2 is unacceptable, Na+ is
ment of drinking water and agricultural quality by the fuzzy acceptable, Cl is acceptable, Mg+2 is acceptable, SO4−2 is
method. By designing a suitable fuzzy model, uncertainty in unacceptable, NO3 is acceptable, and TDS is unacceptable,
the sampling, measurement, and interpretation of water quality then FWQI is poor. (4) If Ca+2 is acceptable, Mg+2 is accept-
can be eliminated (Liu et al. 2003). able, Na+ is desirable, Cl− is acceptable, NO3 is desirable,
Fuzzy modeling applies to three methods including SO4−2 is desirable, and TDS is acceptable, then FGQI is ac-
Mamdani (MFL), Sugeno (SFL), and Larsen (LFL) fuzzy ceptable. (5) If Ca+2 is unacceptable, Mg+2 is desirable, Na+ is
methods (Zadeh 1965). The difference of the Sugeno method desirable, Cl− is desirable, NO3 is desirable, SO4−2 and TDS
with two other methods is in their output. In this approach, the are unacceptable, then FGWQI is low. (6) If Ca+2 is accept-
membership function of outputs data is a linear or fixed, which able, Mg+2 is desirable, Na+ is desirable, Cl− is desirable, NO3
is obtained by a classifying method (Nadiri et al. 2019). The is acceptable, SO4−2 is desirable, and TDS is acceptable, then
first step in creating a fuzzy model is the data categorization, FGWQI is high.
which can be used for different categorization depending on In this study, for applying the FGQI and RFGQI methods,
the used fuzzy model. A subtractive classification method for the GQI values are categorized into five classes and ranged
the Sugeno model and the Fuzzy C-Means (FCM) method for from 0 to 100. The output membership functions of the FGQI
the Mamdani and Larsen methods were used. Each fuzzy and RFGQI model are categorized based on the GQI classes:
model consists of three main steps: (a) fuzzification of data: absolutely unsuitable (0–25); unsuitable (25–50); moderate
this is performed by defining a membership function; (b) es- (50–70); acceptable (70–90); and suitable or good (90–100).
tablishing the connection between input and output: this is
also done by a series of rules, such as if-then; and (c) the final Random forest groundwater quality index
step is the system’s review, aggregation, and defuzzification,
which in defuzzification, obtained results in fuzzy sets are In recent years, machine learning and new methods have been
converted into numerical values. In this study, the Sugeno widely used in the study of water resources, which these new
fuzzy method has been used. The efficiency of the fuzzy sys- machine learning methods utilize ensemble regressions. One
tems depends on the number of membership functions for type of machine learning method that uses basic algorithms
each input and the corresponding numerical data (Amalraj for repetitive multiple predictions is called random forest
and Pius 2018; Vadiati et al. 2016). There are many different (Breiman 2001; Friedl et al. 1999). Random forest (RF) was
shapes of membership functions and depend on the subject to introduced by Breiman in 2001 as a way of developing a new
be solved (Klir and Yuan 1995). Linear interpolation is used decision tree that combines the prediction of multiple single
to obtain both endpoints of the interval, and simple trapezoid algorithms used based on rules. The general principles of en-
membership functions work well in most applications (Barua semble techniques are based on the assumption that their ac-
et al. 2014). The shape of membership functions and param- curacy is higher than other educational algorithms because the
eters of the fuzzy rules have been adjusted to obtain an optimal combination of several prediction models is more accurate
fuzzy system. Based on the nature of physicochemical data, than a single model (Quinlan 1986). And groups increase
triangular and trapezoidal membership functions are chosen to the power of individual and unique collections of classes
represent unacceptable, acceptable, and desirable fuzzy sets while reducing the weaknesses of the class at the same time
using the WHO standard limits. (Kotsiantis and Pintelas 2004). RF is an ensemble method that
In fuzzy models, membership functions and number of combines several decision tree algorithms to generate a repeat-
input parameters are important in a number of rules. The com- ed prediction of each phenomenon. RF can learn complex
plexity of the model increases with the number of large fuzzy patterns and consider the nonlinear relationship between ex-
rules. With the better generalizing ability and an overall sim- planatory variables and dependent variables (Norouzi and
plification of the system architecture, less important fuzzy Shahmohammadi-Kalalagh 2019). It can also incorporate
rules or removal of redundant from the rule base can result and combine different types of data in the analysis, which is
Arab J Geosci (2020) 13:912 Page 7 of 13 912
also due to the lack of distribution of default (normal distribu- OOB error, also called out-of-bag estimate, is a method of
tion) about data usage. RF accepts and runs thousands of input measuring the prediction error of random forests, boosted de-
variables without deleting one of them, and it can also deter- cision trees, and other machine learning models utilizing boot-
mine which variables are important in predicting the model strap aggregating to sub-sample data samples used for train-
(Rodriguez et al. 2012). RF is less sensitive than artificial ing. The OOB is the mean prediction error on each training
neural networks in the local minimum and outlier data and sample xi, using only the trees that did not have xi in their
can have a better estimate of the parameters. The RF method bootstrap sample (Gareth et al. 2013).
evaluates the relative importance of the variables and is able to n
select important variables, while also, parameterizing in the MSE≃MSEOOB ¼ n−1 ∑ ½yðxi Þ−yi 2 ð4Þ
i¼1
RF method is more easy than other methods such as neural
networks (Rodriguez et al. 2012). In this research, the random where yðxi Þ is the average of the OOB predictions for the ith
forest algorithm, which is a learning method based on an en- observation. In ensemble methods, to reach a decision, by the
semble of decision trees, is proposed to overcome the basic great amounts of diverse types of data, a decision-maker can
learner’s problems. The random forest (RF) method is the quickly become affected (Rodriguez et al. 2012). A large
development of the Bagging method. The main difference number of variables related to the properties and behavior of
with Bagging is in the random feature selection. Bagging is the system can exceed the ability of predictive methods to deal
an algorithm which can change weak learners into strong with it (Q). Notwithstanding, more information might be
learners. The random forest is appropriate for modeling of
highly dimensional data because it can handle continuous,
Table 1 Statistical indices of groundwater quality parameters of
categorical, and missing values and binary data. The Miandoab plain and their maximum limit based on WHO standards
bootstrapping and ensemble scheme make random forest
enough strength to overcome the problems of overfitting. Parameter Minimum Average Maximum (WHO) standard
This model does not need to prune the trees. Besides high (mg/l)
prediction accuracy, random forest is efficient, interpretable, Calcium 69.12 259 1112 300
and non-parametric for various types of datasets. Another Magnesium 66.24 115 360 300
good feature of RF is that RF trees grow without pruning or Sodium 20.93 299 1288 200
pruning, and in this way, too much training does not affect the Chlorine 45.35 350 1410 200
accuracy of the model which makes it lighter in terms of Sulfate 153.6 388 2490 250
computing. In addition, those variable or data that have not
Nitrate .21 33.4 87.7 50
been selected in tree training are part of the sub-categories
Bicarbonate 183 205 488 150
called out-of-bag (OOB), and this data in the RF method can
Potassium 1.17 6.63 67.4 12
be used to evaluate the model’s performance (Peters et al.
Total dissolved 215 2250 5950 600
2007). The prediction error is also calculated based on the solids
out-of-bag samples according to the below formula. The
912 Page 8 of 13 Arab J Geosci (2020) 13:912
useful for modeling; the increasing number of input features dimensions, and increasing the capability of interpretation.
may introduce additional complexity related to the increase in In this study, the RF algorithms, which are a learning method
computational time and dimensional (Bellman 2003). This based on the ensemble of decision trees, are proposed for
high dimensionality in the dataset associated with the inclu- groundwater quality assessment that has not been used in this
sion of additional features can overwhelm the expected in- field. The RF technique was identified as an accurate predic-
crease in prediction accuracy. Feature selection (FS) is an tion modeling, while having advantages over other methods
approach for selecting a subset of relevant features for build- such as high prediction accuracy, ability to learn nonlinear
ing robust learning models (Blum and Langley 1997; Guyon relationships, ability to determine the important variables in
and Elisseeff 2003; Saeys et al. 2007). FS increases the accu- the prediction, and less sensitivity against the trapping.
racy of prediction models by accelerating the process of train- General algorithm of random forest and methodology flow-
ing, increasing the generalizability, decreasing the effect of chart is shown in Fig. 5.
Arab J Geosci (2020) 13:912 Page 9 of 13 912
Results and discussion that the depth to the groundwater level in the western part of the
plain is in the range of 1 to 4 m, and in this part, the formations
Hydrogeochemical assessment of Miandoab plain are fine-grained material with very low permeability, it causes
aquifer the water to rise from the shallow depths by capillary force.
Finally, very shallow depth of groundwater causes extreme
Hydrogeochemical assessment of groundwater can provide evaporation and increases the salinity of groundwater, which
useful information about the effects of aquifer and region, with freshwater exploitation, the groundwater flow reverses to-
water flow paths, bedrock effect, recharge and drainage areas, ward the upstream of the aquifer; as a result, salinity extends to
evaporation areas from groundwater, and the impact of sur- the upstream of the aquifer. In the eastern and northeastern parts
face water on groundwater quality. Considering that it is not of the region, electrical conductivity is low, due to its proximity
possible to determine the origin of groundwater without to recharge areas, and is of good quality.
chemical analysis, 75 water samples were collected and chem- In areas, where sewage disposal is traditionally carried out
ically analyzed from the Miandoab plain aquifer. The results through sewage wells, the amount of nitrate in the groundwa-
of analysis were used to evaluate the groundwater quality of ter is high, due to sewage entrance into the aquifers. Also, in
the study area in terms of drinking water standards. In this industrial cities due to industrial pollutants into aquifers, the
regard, the values of chemical parameters that are abundant amount of nitrate is very high. Figure 6 shows the spatial
in groundwater and which are important in terms of human variations of groundwater nitrate in the Miandoab plain. In
health are also compared with those of the WHO standards. the Miandoab and Malekan regions, which are two of the
The statistical indices of these parameters and their maximum grape-producing areas in the country, every year, animal fer-
limits according to the World Health Organization (WHO) tilizers are used to increase the yield of production. These
standards for drinking water are presented in Table 1. animal fertilizers continuously produce nitrate to the ground-
Spatial variations of the electrical conductivity and other main water and increase the nitrate pollution potential in the area.
hydrochemical parameters in the Miandoab plain aquifer are As shown in Fig. 6. About half of the area has a nitrate con-
shown in Fig. 6. In general, the electric conductivity increases centration above the global standard (50 mg/l), which is a
from the east toward the west and northwest of the plain. Given serious threat to groundwater quality in the area.
912 Page 10 of 13 Arab J Geosci (2020) 13:912
Investigating of groundwater quality index in the membership function will be fuzzy and ten maps obtained.
study area Then, by placing the value of the fuzzy maps in Eq. 2, the
ranking map of each parameter was obtained. Finally, Eq. 3
FGQI method was used to create the final map according to the WHO stan-
dards. Figure 7 shows the fuzzy map of the fuzzy groundwater
Increasing the concentrations of different ions in groundwater quality index (FGQI) based on the WHO standard. In the
more than the global standards causes a problem for drinking, entrance parts of the plain, groundwater quality is suitable,
agriculture, and industry uses. In this study, the hydrogeo- and in the middle and near the plain, outlet groundwater qual-
chemical characteristics of the Miandoab plain aquifer were ity decreases. The results show that the FGQI value varies
studied and evaluated in terms of the global standard. from 0 to 100 according to the WHO standard, and ground-
Considering the uncertainty in the drinking water quality as- water quality ranges from absolutely unsuitable to suitable.
sessment as well as the ability of the fuzzy set in the decision-
making process, it has been attempted to examine the quality Table 2 Importance of variables in RF model
of drinking water in the region based on the FGQI by
Variable Score Variable importance
fuzzification of the GQI method. In the FGQI method, the
trend of changes is gradual and is more capable than the Nitrate 100 ||||||||||||||||||||||||||||||||||||||||||||||||
GQI method. Therefore, in order to calculate the FGQI, in Total Dissolved Solid 97.2 ||||||||||||||||||||||||||||||||||||||||||||||
the GIS 10.6 platform with interpolation of point data, Chloride 67.8 ||||||||||||||||||||||||||||||||
rasterized concentration map for each of the ten chemical pa- Sulfate 50.3 ||||||||||||||||||||||||
rameters was prepared. Then, raster maps using linear mem- Sodium 44.2 |||||||||||||||||||||
bership function were converted to fuzzy raster maps and
Calcium 30.2 ||||||||||||||
using Eq. 1, which is referred to in the GQI method; the con-
Magnesium 29.5 ||||||||||||||
centrations of each pixel (K) from the fuzzy raster maps cre-
Bicarbonate 22.7 ||||||||||
ated in the previous step communicate with the WHO stan-
Potassium 22.2 ||||||||||
dard value of that parameter. The output with the linear
Arab J Geosci (2020) 13:912 Page 11 of 13 912
Table 3 Percentage of RFGQI and FGQI areas in the study area variable important were identified by RF method, and then,
GQI-based water GQI value FGQI RFGQI the final map of the groundwater quality obtained according to
quality categorization percentage percentage WHO standards is shown in Fig. 8.
Table 2 shows the importance of each variable, along with
Suitable 90–100 19 20 the predicted score for each of the variables in the model,
Acceptable 70–90 23 16 which is one of the model outputs. Based on the RF predic-
Moderate 50–70 24.5 15 tion, the electrical conductivity and nitrate concentration have
Unsuitable 25–50 17.5 26 the most effect on the groundwater quality assessment in the
Absolutely unsuitable 0–25 15 23 Miandoab plain. The results of the RFGQI method according
to the WHO standard show that about 49% of the groundwater
in the Miandoab region has an unsuitable and absolutely un-
RFGQI method suitable quality for drinking and only about 36% of the
groundwater is suitable and acceptable for drinking. The per-
The model parameter setting is the first step in creating an RF centage of suitable, acceptable, moderate, unsuitable, and ab-
model for assessment of the groundwater quality. In order to solutely unsuitable areas in the Miandoab plain is represented
adjust the k value that the error value converges and the esti- in Table 3.
mate is more reliable, the model is made up of from 1 to 100 Selecting a prediction model was stable based on the three
trees. As the decision trees increase, the error rate decreases, parameters including the accuracy of the prediction model, the
so 100 trees were used to induct the model. The m parameter number of required variables, and the availability of the data.
was also optimized by changing the number of variables be- The RF model performs prediction using the OOB error value.
tween one and the maximum variables of each subset. The As shown in Fig. 9a, RMSE of modeling decreases with in-
results of the model were evaluated by estimating the OOB creasing decision trees, which in the first decision tree, the
error. Although much information may be useful for model- RMSE is equal to 0.39 and when the number of decision trees
ing, the increase in the number of input parameters imposes is 100, the error rate becomes 0.010. Another way of analyz-
additional complexities and increases the computational time ing the RF model performance is the model receiver operating
and dimensional problems (Bellman 2003). Feature selection characteristic (ROC). The ROC curves are in a similar way to
(FS) method was used to reduce the dimensions and increase the rate of success that can be controlled by the actual positive
the accuracy and comprehensibility of the model. FS is a rate. In general, in the ROC curves, the false-positive rate
method for selecting the subsets of the relevant parameters (FPR) results in the X-axis plotted against the true-positive
for better model training (Guyon and Elisseeff 2003). Also, rate (TPR) on the Y-axis. The area under curve (AUC) in
models
0.4
0.2
0.0
0 10 20 30 40 50 60 70 80 90 100
Number of Trees
b
TPR
RFGQI
FGQI
FPR
912 Page 12 of 13 Arab J Geosci (2020) 13:912
ROC curves is used for estimation of model error, which, Babiker IS, Mohamed MAA, Hiyama T (2007) Assessing groundwater
quality using GIS. J Water Resour Manag 21:699–715
when AUC is close to one, the model is more accurate. In
Barua, A., Mudunuri, L.S., Kosheleva, O. 2014. Why trapezoidal and
Fig. 9b, the AUC is equal to 0.96, which indicates the high triangular membership functions work so well: towards a theoretical
accuracy of the RFGQI method. explanation. Departmental Technical Reports (CS). Paper 783
Bellman R (2003) Dynamic programming. Dover Publications, Mineola,
366 pp
Blum AL, Langley P (1997) Selection of relevant features and examples
Conclusion in machine learning. J Artif Intell 97(1–2):245–271
Booker DJ, Snelder TH (2012) Comparing methods for estimating flow
In this study, the hydrogeochemical properties of the duration curves at ungauged sites. J Hydrol 434–435:78–94
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Miandoab plain were investigated and evaluated in terms of
Chapman D (ed) (1996) Water quality assessments - a guide to use of
global standards, and the FGQI and RFGQI methods were biota, sediments and water in environmental monitoring, 2nd edn.
used to delineate the groundwater quality. Based on the re- E&FN Spon, London
sults, the FGQI value varies from 0 to 100 according to the EARWO (East Azerbaijan Regional Water Organization). 2014.
WHO standard, and groundwater quality ranges from abso- Preparation of water balance and water cycle in the Malekan region.
56p
lutely unsuitable to suitable. In the entrance parts of the plain,
Emberger L (1952) Sur le quotient pluviothermique. C R Sci 234:2508–
groundwater quality is suitable, and in the middle and near the 2511
plain, outlet groundwater quality decreases. Also, in this re- Friedl MA, Brodley CE, Strahler AH (1999) Maximizing land cover
search, in overcoming the problems of other methods, random classification accuracies produced by decision trees at continental
forest (RF) algorithms were proposed, which are a learning to global scales. IEEE Trans Geosci Remote Sens 37(2):969–977
Gareth J, Daniela W, Trevor H, Robert T (2013) An introduction to
method based on the ensemble of decision trees and have
statistical learning. Springer. 32:316–321
advantages over other methods due to the ability to learn non- Guyon I, Elisseeff A (2003) An introduction to variable and feature se-
linear relationships and determining the important variables. lection. J Mach Learn Res 3:1157–1182
According to the RF prediction, the electrical conductivity and Herrera M, Torgo L, Izquierdo J, Pérez-García R (2010) Predictive
nitrate concentration have the most effect on the groundwater models for forecasting hourly urban water demand. J Hydrol
387(1–2):141–150
quality assessment in the Miandoab plain. Based on the
Hiyama TL, Hu CY (2003) Application of two stage fuzzy set theory to
RFGQI method, 54% of groundwater in the Miandoab region river quality evaluation Taiwan water resources. J Water Resour
has an unsuitable and absolutely unsuitable quality for drink- Manag 37:1406–1410
ing and only about 28% of the groundwater is suitable for Huang J, Xu J, Liu X, Liu J, Wang L (2011) Spatial distribution pattern
drinking. Also, random forest results in this study showed that analysis of groundwater nitrate nitrogen pollution in Shandong in-
tensive farming regions of China using neural network method. J
the method can be used for groundwater vulnerability assess- Math Comput Model 54(3–4):995–1004
ment as a reliable method. For further studies, RF results can Jamshidzadeh Z, Barzi M (2018) Groundwater quality assessment using
be related to geological information, land use, depth of water the portability water quality index (PWQI): a case in the Kashan
and etc. to identify the factors controlling the groundwater plain, Central Iran. J Environ Earth Sci 77(3):59
quality and to be used for better management of the Klir G, Yuan B (1995) Fuzzy sets and fuzzy logic, 4th edn. Prentice Hall,
Upper Saddle River
groundwater. Kotsiantis S, Pintelas P (2004) Combining bagging and boosting. Int J
Comput Intell 1(4):324–333
Kumar S, Sangeetha B (2020) Assessment of groundwater quality in
Madurai city by using parameter geospatial techniques. J Groundw
Sustain Dev 10:657–668
References Liu X, Xu MM, Huang J, Shi C, Yu XF (2003) Application of geostatistic
and technique to characterize spatial variabilities of bioavailable
Adimalla N, Li PY (2019) Occurrence, health risks, and geochemical micronutrients in paddy soils. J Agron Environ 12(2):88–91
mechanisms of fluoride and nitrate in groundwater of the rock- Nadiri AA, Norouzi H, Khatibi R, Gharekhani M (2019) Groundwater
dominant semi-arid region, Telangana State, India. Hum J Ecol DRASTIC vulnerability mapping by unsupervised and supervised
Risk Assess 25:81–103 techniques using a modelling strategy in two levels. J Hydrol 574:
Adimalla N, Qian H (2019) Groundwater quality evaluation using water 744–759
quality index (WQI) for drinking purposes and human health risk Nejatijahromi Z, Nassery HR, Hosono T, Nakhaei M, Alijani F, Okumura
(HHR) assessment in an agricultural region of Nanganur, South A (2019) Groundwater nitrate contamination in an area using urban
India. J Ecotoxicol Environ Saf 176:153–161 wastewaters for agricultural irrigation under arid climate condition,
Al-hadithi M (2012) Application of water quality index to assess suitabil- southeast of Tehran, Iran. J Agric Water Manag 221:397–414
ity of groundwater quality for drinking purposes in Ratmao-Pathri Norouzi H, Shahmohammadi-Kalalagh S (2019) Locating groundwater
Rao watershed, Haridwar District India. J Sci Ind Res 23:1321–1336 artificial recharge sites using random forest: a case study of
Amalraj A, Pius A (2018) Assessment of groundwater quality for drink- Shabestar region, Iran. J Environ Earth Sci 78:380
ing and agricultural purposes of a few selected areas in Tamil Nadu Norouzi H, Moghaddam AA, Nadiri AA (2016) Determining vulnerable
South India: a GIS-based study. J Sustain Water Recourse Manag areas of Malekan plain aquifer for nitrate, using random forest meth-
4(1):1–21 od. J Environ Stud 41(4):923–942
Arab J Geosci (2020) 13:912 Page 13 of 13 912
Norouzi H, Nadiri AA, Moghaddam AA (2018a) Identifying the suscep- Sihag P, Mohsenzadeh Karimi S, Angelaki A (2019) Random forest,
tible area of Malikan plain aquifer to contamination using fuzzy M5P and regression analysis to estimate the field unsaturated hy-
methods. Journal of environmental studies 44(2):205–221 draulic conductivity. J Appl Water Sci 9:129. https://doi.org/10.
Norouzi H, Nadiri AA, Moghaddam AA, Gharekhani M (2018b) 1007/s13201-019-1007-8
Comparing performance of fuzzy logic, artificial neural network Vadiati M, Asghari-Moghaddam A, Nakhaei M, Adamowski J,
and random forest models in transmissivity estimation of Malekan Akbarzadeh A (2016) A fuzzy-logic based decision-making ap-
plain aquifer. J Ecohydrol 5(3):739–751 proach for identification of groundwater quality based on ground-
Pal M (2005) Random forest classifier for remote sensing classification. water quality indices. J Environ Manag 184(2):255–270
Int J Remote Sens 26(1):217–222 WHO (World Health Organization) (2008) Guidelines for drinking-water
Peters J, Baets BD, Verhoest NEC, Samson R, Degroeve S, Becker PD quality, Second addendum, vol 1, Recommendations, 3rd edn.,
(2007) Random forests as a tool for ecohydrological distribution ISBN 9789241547604
modelling. Ecol Model 207(2–4):304–318 Xu PP, Feng WW, Qian H, Zhang QY (2019) Hydrogeochemical char-
Quinlan JR (1986) Induction of decision trees. J Mach Learn 1(1):81–106 acterization and irrigation quality assessment of shallow groundwa-
Rodriguez V, Ghimire B, Rogan J, Chica-Olmo M, Rigol-Sánchez JP ter in the Central-Western Guanzhong Basin, China. Int J Environ
(2012) An assessment of the effectiveness of a random forest clas- Res Public Health 16:1492
sifier for land-cover classification. ISPRS J Photogram Remote Sens Yajima H, Derot J (2018) Application of the random forest model for
67(9):104 chlorophyll-a forecasts in fresh and brackish water bodies in Japan,
Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection tech- using multivariate long-term databases. J Hydroinf 20(1):206–220
niques in bioinformatics. Bioinformatics. 23(19):2507–2517 Yen J, Wang L (1999) Simplifying fuzzy rule-based models using or-
Sener S, Sener E, Davraz A (2017) Evaluation of water quality using thogonal transformation methods. IEEE Trans Syst Man Cybern
water quality index (WQI) method and GIS in Aksu River (SW- 29(1):13–24
Turkey). Journal of Science Total Environment 584:131–144 Yung J, Abdulmohsin H, El-Shafie A, Koting S, Mohda N, Jaafar W, Hin
Sesnie S, Gessler P, Finegan B, Thessler S (2008) Integrating Landsat L, AbdulMalek M, Ahmed A, Mohtar M, Elshorbagy A, El-Shafie
TM and SRTM-DEM derived variables with decision trees for hab- A (2019) Towards a time and cost effective approach to water qual-
itat classification and change detection in complex geotropically ity index class prediction. J Hydrol 575:148–165
environments. Journal of Remote Sensing of Environment 112(5): Zadeh LA (1965) Fuzzy sets. Inf Control 8:338–353
2145–2159 Zhang YT, Wu JH, Xu B (2019) Human health risk assessment of
Shahrabi, M. 1972. Description of geological map of Urmia, Geological groundwater nitrogen pollution in Jinghui canal irrigation area of
Survey of Iran. 81 p the loess region, Northwest China. J Environ Earth Sci 77:12