0% found this document useful (0 votes)

24 views17 pages

P16 Prediction of Drinking Water Quality With Machine Learning

Uploaded by

Moneer Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views17 pages

P16 Prediction of Drinking Water Quality With Machine Learning

Uploaded by

Moneer Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Received: 12 June 2023 Revised: 9 November 2023 Accepted: 11 November 2023

DOI: 10.1111/phn.13264

APPLIED THEORY ARTICLE

Prediction of drinking water quality with machine learning

models: A public health nursing approach

Gözde Özsezer RN, MSc1,2 Gülengül Mermer RN, PhD3

1
Çanakkale Onsekiz Mart University Faculty
of Health Sciences Department of Public Abstract
Health Nursing, Çanakkale, Turkey
Objective: The aim of this study is to use machine learning models to predict drinking
2
Ege University Health Sciences Institute,
water quality from a public health nursing approach.
İzmir, Turkey
3
Ege University Faculty of Nursing
Design: Machine learning study.
Department of Public Health Nursing, İzmir, Sample: “Water Quality Dataset” was used in the study. The dataset contains physi-
Turkey
cal and chemical measurements of water quality for 2400 different water bodies. The
Correspondence process consists of four stages: Data processing with Synthetic Minority Oversampling
Gülengül Mermer, Ege University Faculty of
Technique, hyperparameter tuning with 10-fold cross-validation, modeling and com-
Nursing Department of Public Health Nursing,
İzmir, Turkey. parative analysis. 80% of the dataset is allocated as training data and 20% as test data.
Email: [email protected]
ML models logistic regression, K-nearest neighbor, support vector machine, random
forest, XGBoost, AdaBoost Classifier, Decision Tree algorithms were used for water
quality prediction. Accuracy, precision, recall, F1 score and AUC performance metrics
of ML models were compared. To evaluate the performance of the models, 10-fold
cross-validation was used and a comparative analysis was performed. The p-values of
the models were also compared.
Results: N this study, where drinking water quality was predicted with seven different
ML algorithms, it can be said that XGBoost and Random Forest are the best classifi-
cation models in all performance metrics. There is a significant difference in all ML
algorithms according to the p-value. The H0 hypothesis is accepted for these algo-
rithms. According to the H0 hypothesis, there is no difference between actual values
and predicted values.
Conclusion: In conclusion, the use of ML models in the prediction of drinking water
quality can help nurses greatly improve access to clean water, a human right, be more
knowledgeable about water quality, and protect the health of individuals.

KEYWORDS
machine learning, prediction, public health nursing, water quality

1 INTRODUCTION AND LITERATURE REVIEW to a lack of access to clean water, especially in regions where
access to water is limited or water quality is poor (WHO, 2019).
1.1 Introduction In 2017, WHO and UNICEF announced the first global estimates
for water, sanitation, and hygiene in relation to the Sustainable
Clean water is vital for health. According to the World Health Orga- Development Goals (SDGs), reporting that approximately 2.1 bil-
nization (WHO), 3.4 million people worldwide die each year due lion people worldwide lack access to clean water. Access to clean

Public Health Nurs. 2023;1–17. wileyonlinelibrary.com/journal/phn © 2023 Wiley Periodicals LLC. 1

15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 ÖZSEZER and MERMER

water is crucial for public health, social development, and economic intelligence techniques, including artificial neural networks (ANN),
progress (WHO, 2017). group data processing methods (GMDM), and support vector machine
Water is a critical element for the sustainability of human life, and (SVM), to predict the water quality components of the Tireh River in
the right to access clean water is of great importance for adequate and southwestern Iran. The review of the ANN and SVM results showed
balanced nutrition, healthy living, health services, education, and social that both models have appropriate performance for predicting water
development. However, factors such as declining water resources and quality constituents. The evaluation of the accuracy of the applied
climate change make it challenging to access clean water. WHO states models based on the error indices revealed that SVM is the most
that many standards are used to assess water quality. The water quality accurate model. A similar study investigated a set of ML models to
index is one of the most effective tools for communicating information predict water quality classification in the Kelantan River using data
about water quality to concerned citizens and policymakers (Godwin from 2005 to 2020. The proposed methodology used 13 physical and
& Oborakpororo, 2019). Water quality depends on the physical, chem- chemical parameters of water quality and 7 ML models, including
ical, and biological characteristics of water. Water quality analysis is SVM, ANN, Decision Tree (DT), K-Nearest Neighbors (KNN), Naive
crucial for assessing water’s suitability for human health. Therefore, Bayes (NB), Random Forest (RF), and Gradient Boosting. (GB) to the
physical analysis of water is an important parameter for water quality analysis, Gradient Boosting’s ensemble model with a learning rate of
assessment and management (Godwin & Oborakpororo, 2019; Ham- 0.1 exhibited the best prediction performance compared to the other
dard et al., 2020). According to WHO (2017) drinking water quality algorithms. It has the highest accuracy (94.90%), sensitivity (80.00%),
guidelines, there are nine different physicochemical parameters: pH, and f-measure (86.49%), with the lowest classification error (Ahmed
hardness, solids, chloramines, sulfate, conductivity, organic carbon, et al., 2019). El Bilali and Taleb (2020) developed 8 ML models to pre-
trihalomethanes, and turbidity. In this study, machine learning (ML) dict irrigation water quality in semi-arid areas. The developed models
was evaluated with these parameters. In this study, machine learn- use conductivity and pH parameters as inputs. Furthermore, Aldhyani
ing (ML) models were trained and tested using these physicochemical et al. (2020) predicted the water quality class with three ML models,
parameters to assess their effectiveness in predicting water quality. mainly SVM, KNN, and Naive Bayes. The dataset used has seven sig-
The right to clean water is also recognized as the right to a life wor- nificant parameters. The obtained results show that the constructed
thy of human dignity (Braig, 2018). It is of primary importance in terms models can efficiently predict the water quality index and then classify
of human rights. It is also guaranteed by the United Nations (n.d.). The the water quality. Furthermore, Lu and Ma (2020) proposed two hybrid
United Nations’ Sustainable Development Goal 6 aims to ensure access decision-based ML models to predict water quality in the short term.
to clean water and sanitation for all by 2030 (United Nations, 2015). The base models of the two hybrid models are extreme gradient boost-
Nurses play a vital role in addressing the health impacts of the envi- ing and random forest. Using Nainital Lake as a study area, the study
ronment that can impact social and environmental determinants of used eight ML algorithms and nine ML algorithms for classification
health, such as access to clean water and safe drinking water (Ameri- analysis. The result shows that the Random Forest algorithm is the
can Public Health Association Division of Public Health Nursing, 2013; most efficient ML algorithm in regression analysis. However, when it
American Nurses Association, 2014). The theme of the International comes to classification analysis, a single algorithm is not good enough
Council of Nurses (2017) is “Nurses: Leading Voice in Achieving Sus- for prediction; three algorithms with the same accuracy, Stochastic
tainable Development Goals.” Goal 14 of the Sustainable Development Gradient Descent, RF, and SVM, have proven to be effective in predict-
Goals focuses on Life Below Water. Within the scope of this purpose, ing water quality (Koranga et al., 2022). Using a set of physiochemical
nurses have important duties in ensuring the cleanliness of water, and microbiological parameters as input features to help determine
which is an important source of food supply. Research shows that arti- the suitability class of water (i.e., safe or unsafe), the study evaluated
ficial intelligence prediction studies can be used to understand the the performance of ML models (such as NB, kNN, Logistic Regression-
relationship between water quality and health. However, no specific LR, tree-based classifiers etc.) by applying class balancing (Synthetic
research on the use of these studies in the field of nursing has been Minority Oversampling Technique—SMOTE). ML algorithms are
found in the national and international literature. However, the rela- evaluated in terms of accuracy, recall, precision, and area under the
tionship between water quality and human health is an important curve (AUC). Experimental results show that the stacking classification
issue in the field of nursing. Decreasing water quality can have nega- model after SMOTE with 10-fold cross-validation outperforms the oth-
tive effects on human health and lead to many health problems, such ers with an accuracy, precision, and recall of 98.1%, 100%, and 98.1%,
as skin diseases. Therefore, it is important for nurses to have knowl- respectively, and an AUC equal to 99.9% (Dritsas & Trigka, 2023). In
edge about water quality management and ensure that water quality another study, SVM, RF, XGBoost (XGB), Multilayer Perceptron (MLP),
measurements are made correctly. and Long Short Term Memory (LSTM) models were investigated for
water quality prediction with ML models. SVM performed commend-
ably in predicting water quality, exhibiting excellent generalization
1.2 Literature review capabilities and high prediction accuracy. MLP showed its strength
in nonlinear modeling and performed well in predicting multiple
Many researchers have used ML models for water quality predic- water quality parameters. Conversely, RF and XGB models performed
tion. Haghiabi et al. (2018) investigated the performance of artificial relatively poorly in water quality prediction (Wang et al., 2023).
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 3

2.2 Features of the methods

ML models, a method of artificial intelligence, were used in this

research. Artificial intelligence is defined in the Merriam-Webster
Dictionary (2023) as “a branch of computer science concerned with the
simulation of intelligent behavior in computers, the ability of a machine to
imitate intelligent human behavior.” ML is a subset of artificial intelli-
gence. ML, as defined by Samuel in 1959, is “a field of study that gives
computers the ability to learn without being explicitly programmed.” ML
applications are one of the core components of artificial intelligence,
often based on neural networks (artificial neural networks [ANNs]) that
mimic the human brain’s ability to perceive and think, governed by algo-
rithmic computations (Özsezer, 2022). ML models use a large training
F I G U R E 1 The flowchart of the study. [Color figure can be viewed data set to find relationships between input and output data. The rela-
at wileyonlinelibrary.com] tionships found are applied to a new data set to obtain predicted results
(Nevala, 2017).
In this research, ML was used to model classification and pre-
Therefore, ML algorithms for water quality parameters provide stable diction. Data processing was performed with the Synthetic Minority
and accurate predictions. Oversampling Technique (SMOTE). To achieve high model accuracy in
According to the literature review, ML has been used in recent stud- classification and prediction modeling, the quality of the data provided
ies to predict water quality. However, the parameters used as input for the model needs to be improved. For this, SMOTE was used to
and ML algorithms are different in the reviewed studies. Therefore, in prevent an imbalanced class distribution. In this method, the target
this paper, similar to the studies in the literature, water potability pre- variable classes were balanced by increasing the number of samples
diction was performed using seven different ML algorithms and nine from the minority class.
different parameters.

2.3 Machine learning classification algoritms

1.3 Aim
ML models were applied for water quality analysis in the study. LR,
The aim of this study is to use machine learning models to predict KNN, SVM, RF, XGB, ADA, and DT algorithms were applied to the same
drinking water quality from a public health nursing approach. dataset with 10-fold cross-validation to determine the most appropri-
ate ML method that gives the highest classification performance result.
The prediction rates of ML models were evaluated.
2 METHODS

The Water Quality Dataset used in this research was processed with 2.3.1 Logistic regression (LR)
the SMOTE technique. Hyperparameter tuning was performed with
LR, KNN, SVM, RF, XGB, ADABoost (ADA), and Decision Tree (DT) is a ML classification algorithm used to predict the probability of cer-
algorithms. The performance metrics of the algorithms were evaluated tain classes based on some dependent variables. In short, the logistic
with accuracy, precision, recall, F1 score, and AUC. The flow diagram of regression model calculates the sum of the input features (in most
the research methodology is shown in Figure 1. cases, there is a bias term) and the logistic of the outcome. The output
of logistic regression is always between (0 and 1), which is suitable for
the binary classification task. The higher the value, the higher the prob-
2.1 Water quality dataset ability that the current sample will be classified as class 1 and vice versa
(Bailly et al., 2022; Ma et al., 2023; van den Goorbergh et al., 2022;
“Water Quality Dataset” was used in the study. The open-access Zabor et al., 2022).
dataset was accessed from the Kaggle (2023) website on March 30,
2023. The dataset contains physical and chemical measurements of
water quality for 3276 different water bodies. These measurements 2.3.2 K-nearest neighbor (KNN)
include nine different property variables: pH value, hardness, solids,
sulfate, conductivity, organic carbon, trihalomethanes, turbidity, and The k-nearest neighbor algorithm, also known as KNN or k-NN,
potability. Table 1 shows the characteristics of the dataset. 80% of the is a nonparametric, supervised learning classifier that uses proxim-
dataset is allocated as training data and 20% as test data. ity to make classifications or predictions about the grouping of an
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 ÖZSEZER and MERMER

TA B L E 1 Dataset features information.

Features Data type Description

pH value float64 pH of water (0 to 14).
Hardness float64 Capacity of water to precipitate soap in mg/L.
Solids float64 Total dissolved solids in ppm.
Chloramines float64 Amount of Chloramines in ppm.
Sulfate float64 Amount of Sulfates dissolved in mg/L.
Conductivity float64 Electrical conductivity of water in µS/cm.
Organic carbon float64 Amount of organic carbon in ppm.
float64 Amount of Trihalomethanes in µg/L
Trihalomethanes
Turbidity float64 Measure of light emiting property of water in NTU.
Potability int64 Indicates if water is safe for human consumption. Potable (1) and not potable (0)

individual data point. Although it can be used for both regression 2.3.5 XGBoost (XGB)
and classification problems, it is typically used as a classification
algorithm that works on the assumption that similar points can XGB is a community tree method that applies the principle of strength-
be found close to each other (Bansal et al., 2022; Chumachenko ening weak learners using a gradient descent architecture (Asselman
et al., 2022). et al., 2023; Chumachenko et al., 2022; Li et al., 2022).

2.3.3 Support vector machine (SVM = SVC) 2.3.6 AdaBoost (ADA)

SVM is one of the most popular supervised learning algorithms for AdaBoost is referred to as an ensemble classifier, which represents a
classification and regression problems. However, it is primarily used strong classifier resulting from a combination of weak classifiers. The
for classification problems in ML. The goal of the SVM algorithm general working logic of the model starts with re-running the classi-
is to create the best line or decision boundary that can classify the fier at each stage by increasing the weight of the incorrect predictions
n-dimensional space into classes so that, in the future, it can easily made as a result of the previous stage. The aim is to increase the classifi-
place the new data point into the correct category. This best-decision cation accuracy of the model by focusing on incorrect predictions (Hao
boundary is called the hyperplane. SVM selects the extreme points and & Hunag, 2023; Sevinç, 2022).
vectors that help create the hyperplane. These extreme cases are called
support vectors, and hence the algorithm is called Support Vector
Machine. Consider the following diagram, where two different cate- 2.3.7 Decision tree (DT)
gories are classified using a decision boundary or hyperplane (Ahmad
et al., 2020; Barjouei et al., 2021; Cortez & Vapnik, 1995; Ghorbani Decision trees are an algorithm that can represent input variables and
et al., 2020; Kuo et al., 2013; Leong et al., 2021; Rui et al., 2019; Shao output variables in a single tree form. It can be used for classification
et al., 2020). (categorical) or regression in ML (Anmala & Turuganti, 2021; Breiman
et al., 1984; Ma, 2018).

2.3.4 Random forest (RF)

2.4 Hyperparameter tuning
A random forest is a classifier that contains a set of decision
trees on various subsets of a given dataset and averages them In this study, the variables of the algorithms were changed to improve
to improve the prediction accuracy of that dataset. Instead of the accuracy of the ML algorithms. Hyperparameter optimization was
relying on a single decision tree, the random forest takes pre- performed to determine the most efficient variable value. The grid
dictions from each tree and predicts the final output based on search algorithm is the most basic and slowest hyperparameter deter-
the majority votes of the predictions. A larger number of trees mination algorithm. In this technique, all given hyperparameter values
in the forest provides higher accuracy and avoids the problem are tested one by one, and the hyperparameter values that give the best
of overfitting (Breiman, 2001; Chumachenko et al., 2022; Zhang result are selected. Since all parameters are tested, it works very slowly
et al., 2022). and makes the most accurate determination. In this way, instead of
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 5

randomly selecting the next set of parameters, the algorithm optimizes 2.6 Data analysis
the selection and detects the best set of parameters in the shortest
time. Since the ML algorithms used in this study do not have many In this study, the IBM Statistical Package for the Social Sciences
hyperparameters, the grid search algorithm was preferred to achieve (SPSS) 22.0 program was used for statistical analysis. The conformity
the best results. At the same time, 10-fold cross-validation was per- of the parameters to a normal distribution was evaluated by the
formed for hyperparameter optimization. To perform cross-validation, Shapiro-Wilks test. In addition to descriptive statistical methods
a subset of the data is allocated for validation as “test data”. The (mean, standard deviation, and frequency), significance was evaluated
reserved subset is not used to train the model but is kept for later at the p < .05 level. Python 3.0 was used as the main programming
use in the validation test. Once the model has been trained, there is language, and libraries such as Numpy, Pandas, and Sci-Kit Learn were
a need for reassurance about how well the model will work on data used for the prediction of ML algorithms. In this research, statistical
not previously encountered during training. Therefore, the prediction hypothesis testing was also used for the ML algorithms used in water
accuracy and performance of the model are tested. Based on the potability prediction.
model’s performance on the test data, it is determined whether the The hypotheses for normally distributed data are as follows:
model is under-, over-, or well-tuned. µ0 = mean of actual test values µ1 = mean of predicted values
H0 : µ0 = µ1
H1 : µ0 ≠ µ1
2.5 Model performance comparison metrics The author’s Windows-based personal computer with an Intel i5 7th
generation processor and an NVIDIA GeForce 940MX graphics card
Many different criteria are used to compare the performance of ML was used to analyze the ML models using Google Colab as the main IDE.
models. These metrics are widely used to assess the quality of binary
and multiclass classification for ML methods.
Accuracy measures the proportion of correctly classified samples 3 RESULTS
among all samples, as shown in (1).
3.1 Water quality exploratory data analysis
TP + TN
Accuracy = (1)
TP + TN + FP + FN
The scatter plots of the nine features included in this study according
Precision measures the proportion of true positives among all cases
to the potability variable are shown below.
classified as positive (2).

TP
Precision = (2)
TP + FP 3.1.1 pH
Recall, (3) that shows the proportion of true positives among all true
WHO (2017) states that the appropriate pH range for drinking water is
positive states.
6.5-8.5. This range ensures that water is suitable for both human con-
TP sumption and industrial use. When the pH of water is lower than 6.5,
Recall = (3)
TP + FN the water is likely to be acidic, and when the pH is higher than 8.5, the
water is likely to be alkaline. This can change the taste and odor of the
F1 score (4) is the harmonic mean of precision and recall and provides
water. The pH level distribution in this research is shown in Figure 2.
a balance between the two measures.

Precision x Recall
F1 Score = 2 x (4)
Precision + Recall 3.1.2 Hardness

Water hardness is a parameter determined by the concentration of

2.5.1 The ROC curve or AUC magnesium and calcium ions in the water. Water hardness is expressed
as the sum of the measured magnesium and calcium values and is mea-
is a graph showing the performance of the classification model at all sured in milligrams in units of CaCO3 /L. According to WHO (2017)
classification thresholds. The curve is a plot of the ratio of true positives drinking water quality guidelines, water hardness should not exceed
to false positives for all classification thresholds. The Area Under the 60 mg/L CaCO3 . It is also stated that there is no direct relationship
ROC Curve (AUC) provides information about the overall performance between water hardness and health, but it is considered a risk factor for
of a classification model. If the maximum value that the AUC can take is some health effects. Hard water can cause adverse effects on taste and
1, it means that the classification model performs very well. odor. The hardness distribution in this research is shown in Figure 3.
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 ÖZSEZER and MERMER

FIGURE 2 pH level distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]

FIGURE 3 Hardness distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]

3.1.3 Solids also cause vomiting, diarrhea, and other gastrointestinal problems. The
sulfate distribution in this research is shown in Figure 5.
The total mass of solids in drinking water. These can be natural min-
erals, salts, or waste. Solids can affect the taste, color, and odor of
water. Excessive levels of solids can reduce water quality and harm 3.1.5 Chloramines
human health. The distribution of total dissolved solids in this research
is shown in Figure 4. Chloramines are chemicals used to disinfect water, consisting of a
combination of chlorine and ammonia. Chloramines provide longer-
lasting disinfection than chlorination. Chloramines can also be formed
3.1.4 Sulfate during the breakdown of organic matter in water. At high concen-
trations, chloramines can combine with other organic compounds in
Sulfate is part of the mineral salts added to water. It is naturally pro- the water to produce a foul odor and taste and can cause respira-
duced by soil, rocks, and water sources. Sulfate levels can affect water tory problems. The chloramine distribution in this research is shown in
properties such as taste, odor, and appearance. High sulfate levels can Figure 6.
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 7

FIGURE 4 Distribution of total dissolved solids in this research. [Color figure can be viewed at wileyonlinelibrary.com]

FIGURE 5 Sulfate distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]

3.1.6 Conductivity ties. Excessive levels of organic carbon can cloud water and cause taste,
odor, and appearance problems for some people. The organic carbon
Conductivity is the ability of water to conduct electric current. When distribution in this research is shown in Figure 8.
water shows high conductivity values, it can indicate the presence of
high mineral concentrations. This can affect the quality of the water
and, in some cases, be harmful to human health. The conductivity 3.1.8 Trihalomethanes
distribution in this research is shown in Figure 7.
Trihalomethanes are a chemical produced after the chlorination of
water. They are formed during the breakdown of organic matter in
3.1.7 Organic carbon water or as a result of the reaction of chlorine with water. They
can be harmful to human health. Therefore, drinking water stan-
Organic carbon is a measure of organic matter added to water. Organic dards require trihalomethane levels to be kept below a certain limit.
matter can enter water naturally or be introduced by human activi- It is strictly monitored in water supplies due to its carcinogenic
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
8 ÖZSEZER and MERMER

FIGURE 6 Chloramines distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]

FIGURE 7 Conductivity distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]

effects. The trihalomethane distribution in this research is shown trihalomethanes (Figure 11). The characteristics of outliers are
in Figure 9. presented in Figure 12. Accordingly, outliers were found in all
nine features. SMOTE was used to avoid an imbalanced class
distribution.
3.1.9 Turbidity In this section, graphical data representations and a statistical sum-
mary of the dataset are given. The results of the statistical analysis of
Measures the density of dissolved and suspended substances in water. the dataset features are shown in Table 2. Feature statistics are based
Turbidity measurement is important to determine the efficiency of sed- on count, mean, standard deviation (Std), minimum (min), 25%, 50%,
imentation and filtration processes in water resources. The turbidity 75%, and maximum (max) values. The analysis shows that the dataset
distribution in this research is shown in Figure 10. contains 2400 rows for each feature.
Outliers and missing values were identified in the research. Note: The distribution of potability rates according to the water
The parameters with missing values were ph values, sulfates, and quality characteristics in the data set is shown in Table 3.
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 9

FIGURE 8 Organic carbon distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]

FIGURE 9 Trihalomethanes distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]

In this study, histograms were used to visualize the data distribu- potability. However, correlation analysis was also performed for these
tion of prediction variables based on the target variable. The bars relationships.
with different colors in the graphs show the distribution of potable There is a correlation matrix of Pearson correlation coefficients
and nonpotable water samples (Figure 13). Based on the graphs, it between each variable attribute and potability. The matrix shows the
is seen that the pH, hardness, solids, chloramines, and sulfate val- correlation coefficient between each pair of variables (Table 4).
ues of potable water samples are higher than those of nonpotable According to Table 4, the pH value has a negative correlation with
water samples. In addition, when the distributions according to the potability. In addition, there is a weak correlation between sulfate, tri-
potability variable are analyzed, it is seen that the mineral con- halomethanes, hardness variables, and potability. It can be said that
tents of potable water samples are lower than those of nonpotable these variables are not decisive for potability estimation. On the other
water samples. At the same time, trihalomethane values have a hand, there is a moderate correlation between solids, chloramines, con-
similar distribution between potable and nonpotable water samples ductivity variables, and potability. These variables can be decisive for
(Figure 13). potability estimation.
As a result of these analyses, it can be said that there is a relation- The heatmap in Figure 14 is a visualization of the correlation matrix.
ship between pH, hardness, solids, chloramines, and sulfates values and Each box in the matrix shows the correlation between two variables. As
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 ÖZSEZER and MERMER

FIGURE 10 Turbidity distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]

FIGURE 11 Columns with missing values.

the boxes get closer to dark blue, the correlation becomes negative, and and all variables should be used for the model to give more accurate
as they get closer to red, the correlation becomes positive. The high- results.
est positive correlation with the target variable potability is observed In this study, the relationship between the potability of water and
with sodium, hardness, and chloride variables, while the highest neg- other characteristics in the dataset was detailed with pairplot plots
ative correlation is observed with turbidity and pH variables. These (Figure 15).
results seem to be in line with the standards regarding the potability When the pairplot graph of other property variables according to
of water. In other words, among the variables affecting the potability the target variable potability is analyzed, it is seen that there are dif-
of water, increasing parameters such as sodium, hardness, and chloride ferences in the distributions of pH, hardness, and sulfate variables,
negatively affect potability, while increasing parameters such as turbid- which have the strongest relationship with potability. In addition, there
ity and ph can positively affect potability (Figure 14). As can be seen in is no significant relationship between potability and other variables.
the correlation matrix, correlation values between variables are gener- This indicates that potability is determined independently of other
ally low. This indicates that the variables are independent of each other, variables except pH, hardness, and sulfate (Figure 15).
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 11

FIGURE 12 Check for outliers among the columns. [Color figure can be viewed at wileyonlinelibrary.com]

TA B L E 2 Statistical analysis of dataset features.

Features Count Mean Std Min 25% 50% 75% Max

pH 2400.0 7.08 1.57 0.22 6.08 7.02 8.05 13.99
Hardness 2400.0 195.96 32.63 73.49 176.74 197.19 216.44 317.33
Solids 2400.0 21917.44 8642.23 320.94 15615.66 20933.51 27182.58 56488.67
Chloramines 2400.0 7.13 1.58 1.39 6.13 7.14 8.10 13.12
Sulfate 2400.0 333.22 41.20 129.00 307.63 332.23 359.33 481.03
Conductivity 2400.0 426.52 80.71 201.61 366.68 423.45 482.37 753.34
Organic carbon 2400.0 14.35 3.32 2.19 12.12 14.32 16.68 27.00
Trihalomethanes 2400.0 66.40 16.07 8.57 55.95 66.54 77.29 124.00
Turbidity 2400.0 3.96 0.78 1.45 3.44 3.96 4.51 6.49

3.2 Experimental results of machine learning curve and AUC of ML models, the value of the AUC is 1.00 in the
models XGB, ADA, and DT algorithms. In these algorithms, the ROC curve is
perfectly decomposed. That is, the classification process was done per-
For the analysis of model performances in this study, 80% of fectly, and the model completely separated the positive and negative
the data was randomly allocated for training and 20% for test- classes. In terms of precision, RF, LR, XGB, and ADA are good classi-
ing. 10-fold cross-validation was applied. LR, KNN, SVM, RF, XGB, fiers. In terms of recall and F1 score, KNN and XGB are good classifiers
ADA, and DT algorithms were used. The accuracy, precision, recall, for prediction.
and F1 Score performance metrics of ML models were compared Table 5 also shows the p values. It is observed that there is a sig-
(Table 5). nificant difference according to the p value in all ML algorithms. The
According to Table 5, it is seen that the ML algorithm that best H0 hypothesis is accepted for these algorithms. According to the H0
predicts the potability level of water is the RF and XGB algorithms, hypothesis, there is no difference between actual values and predicted
with an accuracy value of 0.79. However, when comparing the ROC values.
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 ÖZSEZER and MERMER

TA B L E 3 Distribution of water potability rates according to variable characteristics.

Features
Organic
pH Hardness Solids Chloramines Sulfate Conductivity carbon Trihalomethanes Turbidity
Potable Mean 7.06 196.00 21628.53 7.10 333.74 427.55 14.40 66.27 3.95
std 1.65 30.71 8461.10 1.47 36.39 79.88 3.37 15.93 0.78
Min 1.43 98.45 320.94 2.45 203.44 210.31 4.37 14.34 1.45
25% 5.98 177.31 15378.90 6.16 310.65 369.58 12.11 56.15 3.44
50% 6.99 196.79 20507.39 7.10 332.61 424.47 14.35 66.20 3.94
75% 8.14 214.53 26786.54 8.07 356.43 482.33 16.78 77.14 4.49
max 14.00 300.29 55334.70 12.65 460.10 753.34 27.00 120.03 6.49
Nonpotable mean 7.10 195.96 22091.35 7.18 332.67 422.45 14.30 66.55 3.97
std 1.34 32.77 8694.57 1.63 44.74 77.97 3.09 15.37 0.73
min 0.22 73.49 1198.94 1.39 129.00 201.61 2.20 8.57 1.49
25% 6.30 177.05 15656.42 6.18 304.52 361.29 12.33 56.91 3.46
50% 7.05 197.34 21397.29 7.20 332.49 415.56 14.19 66.61 3.98
75% 7.88 216.76 27416.30 8.15 363.55 477.99 16.44 76.55 4.48
max 11.89 317.33 56488.67 13.12 481.03 695.36 23.60 124.00 6.49

F I G U R E 1 3 Visualization of potability levels of waters according to their characteristics. [Color figure can be viewed at
wileyonlinelibrary.com]
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 13

TA B L E 4 Correlation matrix between variables.

Organic
pH Hardness Solids Chloramines Sulfate Conductivity carbon Trihalomethanes Turbidity Potability
pH 1.00 -0.09 -0.08 0.0149 -0.01 0.019 0.03 -0.02 -0.03 -0.00
Hardness -0.09 1.00 0.02 -0.03 -0.09 -0.02 -0.02 -0.01 -0.01 -0.01
Solids -0.08 0.02 1.00 -0.07 -0.14 0.01 -0.03 -0.03 -0.00 -0.04
Chloramines 0.01 -0.03 -0.07 1.00 0.02 0.01 -0.01 0.01 0.00 0.02
Sulfate -0.01 -0.09 -0.14 0.02 1.00 -0.01 -0.02 -0.02 -0.00 -0.01
Conductivity 0.01 -0.02 0.01 0.01 -0.01 1.00 0.02 0.00 0.00 -0.00
Organic carbon 0.03 -0.02 -0.03 -0.01 -0.02 0.02 1.00 -0.01 -0.02 -0.03
Trihalomethanes -0.02 -0.01 -0.03 0.01 -0.02 0.00 -0.01 1.00 -0.02 -0.00
Turbidity -0.03 -0.01 -0.00 0.00 -0.00 0.00 -0.02 -0.02 1.00 0.00
Potability -0.00 -0.01 -0.04 0.02 -0.01 -0.00 -0.03 -0.00 0.00 1.00

good classifiers. Although XGB was used in the study, it was reported
to give moderate results (Kaddoura, 2022). Nasir et al. (2022) used
SVM, RF, LR, DT, XGB, CatBoost, and Multi-Layer Perceptron (MLP)
algorithms for water quality prediction and found that the CatBoost
model provided the most accurate classifier with 94.5%. In a simi-
lar dataset, RF, NN, SVM, Multinomial Logistic Regression (MLR), and
Bagged Tree Model (BTM) algorithms were used to predict the water
quality index, and MLR was found to be the best classifier with 99.8%
accuracy (Hassan et al., 2021). In the study conducted with the dataset
of the Rawal watershed created by the Pakistan Council of Research
in Water Resources, MLP, Gaussian Naive Bayes, LR, SGD, KNN, DT,
RF, SVM, GB, and Bagging Classifier algorithms were evaluated with
MAE, MSE, RMSE, and R2 parameters and accuracy, precision, recall,
and F1 score performance metrics for water quality prediction. MLP
was reported to be the best classifier (Ahmed et al., 2019). In the study
by Bui et al. (2020), water quality prediction was performed with new
hybrid ML algorithms. These algorithms (decision-tree algorithms):
F I G U R E 1 4 Statistical analysis of dataset features for correlation.
M5P; random forest (RF); random tree (RT); and reduced error prun-
[Color figure can be viewed at wileyonlinelibrary.com]
ing tree (REPT); (meta-classifier or hybrid algorithms): Bagging (BA);
CV parameter selection (CVPS); and randomizable filtered classifier
4 DISCUSSION (RFC); including BA-M5P; BA-RF; BA-RT; BA-REPT; CVPS-M5P; CVPS-
RF; CVPS-RT; CVPS-REPT; RFC-M5P; RFC-RF; RFC-RT; and RFC-REPT.
The Water Quality dataset used in this study has a wide range of The best classifier is BA-RT. In this study, DT, KNN, SVM, Discriminants
applications and is used in different fields. The Water Quality dataset Analysis (DA), and Ensemble Trees (ET) algorithms were used to pre-
contains various parameters related to drinking water quality, and dict water quality indices at a regional scale using ML algorithms in the
these parameters are used in evaluations related to the quality of water Naama region, located in the southwestern region of Algeria. It was
resources. Drinking water quality needs to be monitored regularly for reported that the SVM classifier achieved 95.4% prediction accuracy
the healthy use of water resources. In this study, it was determined that (Derdour et al., 2022). In another water quality prediction study using a
the ML algorithm that best predicts drinking water quality is the XGB similar dataset to this research, SVM, KNN, and Naive Bayes algorithms
algorithm, which is common in all of the accuracy, precision, recall, F1 were used, and SVM achieved the highest value with 97.01% accuracy
score, and AUC performance metrics. In the study where water quality (Aldhyani et al., 2020). It can be said that these differences are due
was predicted using the same dataset, ML algorithms were evaluated to the difference in the separation of training and test data between
with precision using recall, F1 score, and ROC curve/AUC performance the algorithms used in our study. In this study, the predictability of
metrics. It was stated that KNN in terms of precision, LASSO LARS water quality with a ML approach using the Water Quality dataset is
(LL), Stochastic Gradient Descent (SGD) in terms of recall, SVM, and addressed in terms of nursing. Studies show that ML methods can be a
Artificial Neural Network (ANN) in terms of ROC curve/AUC were very effective approach for water quality prediction.
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 ÖZSEZER and MERMER

FIGURE 15 The pairplot analysis of dataset features. [Color figure can be viewed at wileyonlinelibrary.com]

5 CONCLUSION tion process was done perfectly, and the model completely separated
the positive and negative classes. In terms of precision, RF, LR, XGB,
Water quality is related to all of the United Nations Sustainable Devel- and ADA are good classifiers. In terms of recall and F1 score, KNN and
opment Goals. Water quality prediction with ML is the interesting part XGB are good classifiers for prediction. There is a significant differ-
of this study. To achieve this, a comparative evaluation of a large num- ence in all ML algorithms according to the p-value. The H0 hypothesis
ber of ML classification models, such as LR, KNN, SVM, RF, XGB, ADA, is accepted for these algorithms. According to the H0 hypothesis, there
DT, etc., was performed, and the intended model with the highest accu- is no difference between actual values and predicted values.
racy and discrimination ability, SMOTE, was developed with 10-fold This study demonstrates the benefits of an ML tool that can be
cross-validation. The performance of the ML algorithms used in this used by nurses for water quality monitoring. A better understanding
study was compared, and it was observed that the RF and XGB algo- of water quality by nurses can lead to better health outcomes. The
rithms (accuracy = 0.79) gave the best prediction results. However, use of ML algorithms for water quality prediction requires further
when the ROC curve/AUC of the ML models is compared, the AUC research to achieve a wider range of applications and better results.
value is 1.00 in the XGB, ADA, and DT algorithms. In these algorithms, The findings of this study support the use of ML techniques for water
the ROC curve is perfectly decomposed. In other words, the classifica- quality prediction and monitoring in the field of nursing. Monitoring
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 15

TA B L E 5 Comparison of performance metrics of ML models. is prominent today. Public health nurses should also learn about this
technology and use it in their research. The use of artificial intelli-
Model Potability Precision Recall F1Score Accuracy AUC p-value
gence methods will help nurses focus more on their main duty of care.
LR Potable 0.80 0.68 0.74 0.78 0.70 .00
At the same time, a different technique will be used to access clean
Nonpotable 0.77 0.87 0.82
water.
KNN Potable 0.87 0.94 0.90 0.70 0.70 .02
Nonpotable 0.66 0.44 0.53 AUTHOR CONTRIBUTIONS
SVM Potable 0.66 0.76 0.70 0.64 0.50 .01 All authors listed have made a substantial, direct and intellectual
Nonpotable 0.60 0.47 0.53 contribution to the work, and approved it for publication.
RF Potable 0.80 0.79 0.80 0.79 0.70 .00
Nonpotable 0.78 0.79 0.78 ACKNOWLEDGMENTS
The authors declare that they have no known competing financial inter-
XGB Potable 0.80 0.91 0.85 0.79 1.00 .00
ests or personal relationships that could have appeared to influence the
Nonpotable 0.74 0.51 0.60
work reported in this paper. This research did not receive any specific
ADA Potable 0.80 0.88 0.84 0.78 1.00 .00
grant from funding agencies in the public, commercial, or not-for-profit
Nonpotable 0.73 0.59 0.65 sectors.
DT Potable 0.72 0.76 0.74 0.65 1.00 .01
Nonpotable 0.49 0.44 0.47 CONFLICT OF INTEREST STATEMENT
The authors declare that there is no conflict of interest.

and predicting water quality is important to protect public health, and DATA AVAILABILITY STATEMENT
therefore nurses need to be further educated on this topic. This study The data is available open access via Kaggle.
can help nurses become more knowledgeable about water quality and
protect the health of individuals. ORCID
Nurses can educate the community about the protection and treat- Gülengül Mermer RN, PhD https://orcid.org/0000-0002-0566-5656
ment of water resources. These trainings can raise awareness about
how people can use water resources without harming them. Nurses can REFERENCES
participate in prevention efforts to protect water resources. For exam- Ahmad, M. S., Adnan, S. M., Zaidi, S., & Bhargava, P. (2020). A novel support
vector regression (SVR) model for the prediction of splice strength of
ple, they can support efforts to treat and recycle wastewater. Nurses
the unconfined beam specimens. Construction and building materials, 248,
can monitor the pollution and depletion of water resources. They can 118475. https://doi.org/10.1016/j.conbuildmat.2020.118475
play an important role in terms of public health by following studies on Ahmed, U., Mumtaz, R., Anwar, H., Shah, A. A., Irfan, R., & García-Nieto,
water resources. Nurses can play an active role in the management of J. (2019). Efficient water quality prediction using supervised machine
water resources. Since the management of water resources is critical learning. Water, 11(11), 2210. https://doi.org/10.3390/w11112210
Aldhyani, T. H., Al-Yaari, M., Alkahtani, H., & Maashi, M. (2020). Water qual-
for public health, nurses should actively work on this issue.
ity prediction using artificial intelligence algorithms. Applied Bionics and
The dataset and ML algorithms used in this study can be used in Biomechanics, 2020, 1–12. https://doi.org/10.1155/2020/6659314
other water quality studies. In particular, the use of ML techniques American Nurses Association. (2014). Public health nursing: Scope and
has become more important in studies investigating the relationship standards of practice. (Second Edition)..
American Public Health Association Public Health Nursing Section. (2013).
between pollution in water resources and human health. Therefore, it is
The definition and practice of public health nursing: A statement of the public
recommended that the use of ML techniques in water quality research health nursing section. American Public Health Association.
be more widespread. In addition, it can be suggested that physical, Anmala, J., & Turuganti, V. (2021). Comparison of the performance of deci-
chemical, and biological parameters that can be used in water qual- sion tree (DT) algorithms and extreme learning machine (ELM) model
in the prediction of water quality of the Upper Green River water-
ity prediction should be included in the dataset and re-evaluated, and
shed. Water Environment Research, 93(11), 2360–2373. https://doi.org/
water quality predictions should be made with ML models. 10.1002/wer.1642
Asselman, A., Khaldi, M., & Aammou, S. (2023). Enhancing the prediction of
student performance based on the machine learning XGBoost algorithm.
Interactive Learning Environments, 31(6), 3360–3379. https://doi.org/10.
6 IMPLICATIONS FOR PUBLIC HEALTH
1080/10494820.2021.1928235
NURSING Bailly, A., Blanc, C., Francis, É., Guillotin, T., Jamal, F., Wakim, B., & Roy, P.
(2022). Effects of dataset size and interactions on the prediction per-
Access to clean water is both a human right and a right to health. Pub- formance of logistic regression and deep learning models. Computer
Methods and Programs in Biomedicine, 213, 106504. https://doi.org/10.
lic health nurses should facilitate individuals’ access to clean water
1016/j.cmpb.2021.106504
with their advocacy, consulting, and research roles. Therefore, it is Bansal, M., Goyal, A., & Choudhary, A. (2022). A comparative analysis
important for public health nurses to be able to analyze water first. of K-nearest neighbor, genetic, support vector machine, decision tree,
As shown in this study, the use of artificial intelligence techniques and long short term memory algorithms in machine learning. Deci-
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16 ÖZSEZER and MERMER

sion Analytics Journal, 3, 100071. https://doi.org/10.1016/j.dajour.2022. Kaddoura, S. (2022). Evaluation of machine learning algorithm on drink-
100071 ing water quality for better sustainability. Sustainability, 14(18), 11478.
Barjouei, H. S., Ghorbani, H., Mohamadian, N., Wood, D. A., Davoodi, S., https://doi.org/10.3390/su141811478
Moghadasi, J., & Saberi, H. (2021). Prediction performance advantages Kaggle. (2023). Water quality dataset. https://www.kaggle.com/datasets/
of deep machine learning algorithms for two-phase flow rates through adityakadiwal/water-potability
wellhead chokes. Journal of Petroleum Exploration and Production, 3, Koranga, M., Pant, P., Kumar, T., Pant, D., Bhatt, A. K., & Pant, R. P. (2022).
1233–1261. https://doi.org/10.1016/j.dajour.2022.100071 Efficient water quality prediction models based on machine learning
Braig, K. F. (2018). The European Court of Human Rights and the right to algorithms for Nainital Lake, Uttarakhand. Materials today: proceedings,
clean water and sanitation. Water Policy, 20(2), 282–307. https://doi.org/ 57, 1706–1712. https://doi.org/10.1016/j.matpr.2021.12.334
10.2166/wp.2018.045 Kuo, B. C., Ho, H. H., Li, C. H., Hung, C. C., & Taur, J. S. (2013). A kernel-based
Breiman, L. (2001). Random forests. Machine learning, 45, 5–32. https://doi. feature selection method for SVM with RBF kernel for hyperspectral
org/10.1023/A:1010933404324 image classification. IEEE Journal of Selected Topics in Applied Earth Obser-
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. G. (1984). Classification vations and Remote Sensing, 7(1), 317–326. https://doi.org/10.1109/
and regression trees. Wadsworth International Group. jstars.2013.2262926
Bui, D. T., Khosravi, K., Tiefenbacher, J., Nguyen, H., & Kazakis, N. Leong, W. C., Bahadori, A., Zhang, J., & Ahmad, Z. (2021). Prediction of
(2020). Improving prediction of water quality indices using novel hybrid water quality index (WQI) using support vector machine (SVM) and least
machine-learning algorithms. Science of the Total Environment, 721, square-support vector machine (LS-SVM). International Journal of River
137612. https://doi.org/10.1016/j.scitotenv.2020.137612 Basin Management, 19(2), 149–156. https://doi.org/10.1080/15715124.
Chumachenko, D., Meniailov, I., Bazilevych, K., Chumachenko, T., & Yakovlev, 2019.1628030
S. (2022). Investigation of statistical machine learning models for Li, J., An, X., Li, Q., Wang, C., Yu, H., Zhou, X., & Geng, Y. A. (2022). Applica-
COVID-19 epidemic process simulation: Random forest, K-nearest tion of XGBoost algorithm in the optimization of pollutant concentration.
neighbors, gradient boosting. Computation, 10(6), 86. https://doi.org/10. Atmospheric Research, 276, 106238. https://doi.org/10.1016/j.atmosres.
3390/computation10060086 2022.106238
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, Lu, H., & Ma, X. (2020). Hybrid decision tree-based machine learning mod-
20, 273–297. https://doi.org/10.1007/BF00994018 els for short-term water quality prediction. Chemosphere, 249, 126169.
Derdour, A., Jodar-Abellan, A., Pardo, M. Á., Ghoneim, S. S., & Hussein, E. E. https://doi.org/10.1016/j.chemosphere.2020.126169
(2022). Designing efficient and sustainable predictions of water quality Ma, J., Dhiman, P., Qi, C., Bullock, G., van Smeden, M., Riley, R. D., & Collins,
ındexes at the regional scale using machine learning algorithms. Water, G. S. (2023). Poor handling of continuous predictors in clinical prediction
14(18), 2801. https://doi.org/10.3390/w14182801 models using logistic regression: A systematic review. Journal of Clinical
Dritsas, E., & Trigka, M. (2023). Efficient data-driven machine learning mod- Epidemiology, 161, 140–151. https://doi.org/10.1016/j.jclinepi.2023.07.
els for water quality prediction. Computation, 11(2), 16. https://doi.org/ 017
10.3390/computation11020016 Ma, X. (2018). Using classification and regression trees: A practical primer.
El Bilali, A., & Taleb, A. (2020). Prediction of irrigation water quality parame- IAP.
ters using machine learning models in a semi-arid environment. Journal of Merriam-Webster Dictionary. (2023). Artifical intelligence, https://www.
the Saudi Society of Agricultural Sciences, 19(7), 439–451. https://doi.org/ merriam-webster.com/dictionary/artificial%20intelligence
10.1016/j.jssas.2020.08.001 Nasir, N., Kansal, A., Alshaltone, O., Barneih, F., Sameer, M., Shanableh, A.,
Ghorbani, H., Wood, D. A., Choubineh, A., Tatar, A., Abarghoyi, P. G., Madani, & Al-Shamma’a, A. (2022). Water quality classification using machine
M., & Mohamadian, N. (2020). Prediction of oil flow rate through an ori- learning algorithms. Journal of Water Process Engineering, 48, 102920.
fice flow meter: Artificial intelligence alternatives compared. Petroleum, https://doi.org/10.1016/j.jwpe.2022.102920
6(4), 404–414. https://doi.org/10.1016/j.petlm.2018.09.003 Nevala, K. (2017). Machine learning primer. SAS Institute.
Godwin, A., & Oborakpororo, O. (2019). Surface water quality assessment of Özsezer, G. (2022). The future of artificial intelligence in nursing. Journal
warri metropolis using Water Quality Index. International Letters of Nat- of Human Sciences, 19(2), 285–299. https://doi.org/10.14687/jhs.v19i2.
ural Sciences, 74, 18–25. https://doi.org/10.18052/www.scipress.com/ 6217
ILNS.74.18 Rui, J., Zhang, H., Zhang, D., Han, F., & Guo, Q. (2019). Total organic carbon
Haghiabi, A. H., Nasrolahi, A. H., & Parsaie, A. (2018). Water quality pre- content prediction based on support-vector-regression machine with
diction using machine learning methods. Water Quality Research Journal, particle swarm optimization. Journal of Petroleum Science and Engineering,
53(1), 3–13. https://doi.org/10.2166/wqrj.2018.025 180, 699–706. https://doi.org/10.1016/j.petrol.2019.06.014
Hamdard, M. H., Soliev, I., Xiong, L., & Kløve, B. (2020). Drinking water qual- Sevinç, E. (2022). An empowered AdaBoost algorithm implementation:
ity assessment and governance in Kabul: A case study from a district with A COVID-19 dataset study. Computers & Industrial Engineering, 165,
high migration and underdeveloped infrastructure. Central Asian Journal 107912. https://doi.org/10.1016/j.cie.2021.107912
of Water Research, 6(1), 66–81. https://doi.org/10.29258/CAJWR/2020- Shao, M., Wang, X., Bu, Z., Chen, X., & Wang, Y. (2020). Prediction of energy
R1.v6-1/66-81.eng consumption in hotel buildings via support vector machines. Sustain-
Hao, L., & Huang, G. (2023). An improved AdaBoost algorithm for identi- able Cities and Society, 57, 102128. https://doi.org/10.1016/j.scs.2020.
fication of lung cancer based on electronic nose. Heliyon, 9(3), e13633. 102128
https://doi.org/10.1016/j.heliyon.2023.e13633 United Nations. (2015). Sustainable development goals. https://www.un.
Hassan, M. M., Hassan, M. M., Akter, L., Rahman, M. M., Zaman, S., Hasib, K. org/sustainabledevelopment/sustainable-development-goals/
M., Jahan, N., Smrity, R. S., Farhana, J., Raihan, M., & Mollick, S. (2021). United Nations. (n.d). The universal declaration of human rights. https://
Efficient prediction of water quality index (WQI) using machine learning www.un.org/en/universal-declaration-human-rights/
algorithms. Human-Centric Intelligent Systems, 1(3-4), 86–97. https://doi. van den Goorbergh, R., van Smeden, M., Timmerman, D., & van Calster, B.
org/10.2991/hcis.k.211203.001 (2022). The harm of class imbalance corrections for risk prediction mod-
International Council of Nurses. (2017). Nurses: A voice to lead—Achieving els: Illustration and simulation using logistic regression. Journal of the
the Sustainable Development Goals. https://www.icnvoicetolead.com/ American Medical Informatics Association, 29(9), 1525–1534. https://doi.
home/ org/10.1093/jamia/ocac093
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 17

Wang, X., Li, Y., Qiao, Q., Tavares, A., & Liang, Y. (2023). Water quality predic- k-nearest neighbors. Information Sciences, 595, 70–88. https://doi.org/
tion based on machine learning and comprehensive weighting methods. 10.1016/j.ins.2022.02.038
Entropy, 25(8), 1186. https://doi.org/10.3390/e25081186
World Health Organization. (2017). Guidelines for drinking-water quality. 4th
edn.. World Health Organization.
World Health Organization. (2019). Drinking-water. World Health Orga- How to cite this article: Özsezer, G., & Mermer, G. (2023).
nization, https://www.who.int/news-room/fact-sheets/detail/drinking-
Prediction of drinking water quality with machine learning
water
Zabor, E. C., Reddy, C. A., Tendulkar, R. D., & Patil, S. (2022). Logistic regres- models: a public health nursing approach. Public Health Nursing,
sion in clinical studies. International Journal of Radiation Oncology* Biology* 1–17. https://doi.org/10.1111/phn.13264
Physics, 112(2), 271–277. https://doi.org/10.1016/j.ijrobp.2021.08.007
Zhang, A., Yu, H., Huan, Z., Yang, X., Zheng, S., & Gao, S. (2022). SMOTE-
RkNN: A hybrid re-sampling method based on SMOTE and reverse

Final
100% (1)
Final
15 pages
C3 Water Quality Prediction Based On Hybrid Deep (Drinking - Water)
No ratings yet
C3 Water Quality Prediction Based On Hybrid Deep (Drinking - Water)
10 pages
Harsha-Khushi IML Proposal
No ratings yet
Harsha-Khushi IML Proposal
6 pages
Project Report
No ratings yet
Project Report
38 pages
A Machine Learning Framework For E. Coli Bacteria Detection and Classification
No ratings yet
A Machine Learning Framework For E. Coli Bacteria Detection and Classification
14 pages
Prediction of Groundwater Quality Using Efficient Machine Learning
No ratings yet
Prediction of Groundwater Quality Using Efficient Machine Learning
13 pages
Water SVM XGB
No ratings yet
Water SVM XGB
6 pages
Machine Learning in Public Health A Review
No ratings yet
Machine Learning in Public Health A Review
5 pages
A Review of Artificial Neural Network Techniques For Environmental Issues
No ratings yet
A Review of Artificial Neural Network Techniques For Environmental Issues
17 pages
Water Quality Prediction Using Artificial Intellig
No ratings yet
Water Quality Prediction Using Artificial Intellig
12 pages
Water Quality Classification Using Machine Learning
No ratings yet
Water Quality Classification Using Machine Learning
6 pages
ML Da1
No ratings yet
ML Da1
4 pages
Water Quality Analysis and Prediction
No ratings yet
Water Quality Analysis and Prediction
26 pages
Checkfinal 123
No ratings yet
Checkfinal 123
18 pages
Synopsis 6 TH Nomaan
No ratings yet
Synopsis 6 TH Nomaan
9 pages
v1 Covered
No ratings yet
v1 Covered
20 pages
Project Review
No ratings yet
Project Review
10 pages
1 s2.0 S004313542300180X Main
No ratings yet
1 s2.0 S004313542300180X Main
20 pages
Susceptibility Prediction of Groundwater HardnessUsing Ensemble Machine Learning Models
No ratings yet
Susceptibility Prediction of Groundwater HardnessUsing Ensemble Machine Learning Models
17 pages
Before 7
No ratings yet
Before 7
17 pages
Water Quality Prediction Using Machine Learning Technique
No ratings yet
Water Quality Prediction Using Machine Learning Technique
9 pages
Mallika OTCON
No ratings yet
Mallika OTCON
12 pages
Iciccd 2024 Paper Id XX
No ratings yet
Iciccd 2024 Paper Id XX
12 pages
Machine Learning, Water Quality Index, and GIS-based Analysis of Groundwater
No ratings yet
Machine Learning, Water Quality Index, and GIS-based Analysis of Groundwater
17 pages
Water Potability Prediction Paper
No ratings yet
Water Potability Prediction Paper
3 pages
Boosting
No ratings yet
Boosting
28 pages
Toxic Article
No ratings yet
Toxic Article
66 pages
q17 Design - and - Analysis - of - Deep - Learning - Based - Water - Potability - Prediction
No ratings yet
q17 Design - and - Analysis - of - Deep - Learning - Based - Water - Potability - Prediction
6 pages
23mda025 Keerthana S
No ratings yet
23mda025 Keerthana S
17 pages
Random Forest Classifier For Remote Sensing Classification.
No ratings yet
Random Forest Classifier For Remote Sensing Classification.
12 pages
Water 15 00475 v2
No ratings yet
Water 15 00475 v2
17 pages
Water Quality Final Presentation1
No ratings yet
Water Quality Final Presentation1
16 pages
Report 18
No ratings yet
Report 18
20 pages
Identification of Safety Critical Equipment (SCE) : Guide
100% (3)
Identification of Safety Critical Equipment (SCE) : Guide
28 pages
Water Quality Analysis and Prediction Using Machine Learning
No ratings yet
Water Quality Analysis and Prediction Using Machine Learning
6 pages
Nair 2022 J. Phys. Conf. Ser. 2325 012011
No ratings yet
Nair 2022 J. Phys. Conf. Ser. 2325 012011
20 pages
An AI-Driven Approach To Potable Water Classification Using Machine Learning Techniques - Abdulla A
No ratings yet
An AI-Driven Approach To Potable Water Classification Using Machine Learning Techniques - Abdulla A
8 pages
Water Quality Prediction in The Luan River Based On 1-DRCNN and BiGRU Hybrid Neural Network Model
No ratings yet
Water Quality Prediction in The Luan River Based On 1-DRCNN and BiGRU Hybrid Neural Network Model
19 pages
1 s2.0 S1319157821001361 Main
No ratings yet
1 s2.0 S1319157821001361 Main
9 pages
1 s2.0 S030147972300097X Main
No ratings yet
1 s2.0 S030147972300097X Main
14 pages
ABSTRACT
No ratings yet
ABSTRACT
2 pages
AISD Paper 5
No ratings yet
AISD Paper 5
16 pages
P 5 XNM
No ratings yet
P 5 XNM
25 pages
A Proficient Prediction Mechanism For Analyzing Water Quality Using Machine Learning Algorithms
No ratings yet
A Proficient Prediction Mechanism For Analyzing Water Quality Using Machine Learning Algorithms
8 pages
1 s2.0 S2214714422003646 Main
No ratings yet
1 s2.0 S2214714422003646 Main
17 pages
JWC 2023403
No ratings yet
JWC 2023403
23 pages
Predicting Water Purity by Riding The Ensemble Waves With Gradient Boosting Classification Technique
No ratings yet
Predicting Water Purity by Riding The Ensemble Waves With Gradient Boosting Classification Technique
4 pages
Tasks
No ratings yet
Tasks
11 pages
Click To Open - Social Media Managers Toolbox
No ratings yet
Click To Open - Social Media Managers Toolbox
5 pages
A Predictive Model For Water Quality Index Assessment by Machine Learning Approach
No ratings yet
A Predictive Model For Water Quality Index Assessment by Machine Learning Approach
6 pages
Prediction of Water Quality System For Aquaculture Using Machine Learning
No ratings yet
Prediction of Water Quality System For Aquaculture Using Machine Learning
8 pages
G7 Water Quality Prediction Using Machine Learning
No ratings yet
G7 Water Quality Prediction Using Machine Learning
11 pages
Water 14 02836
No ratings yet
Water 14 02836
15 pages
Water Quality Classification Using Machine Learning
No ratings yet
Water Quality Classification Using Machine Learning
12 pages
Water quality-PCA
No ratings yet
Water quality-PCA
9 pages
Forecasting of Water Quality Index Using Long Short-Term Memory (LSTM) Networks
No ratings yet
Forecasting of Water Quality Index Using Long Short-Term Memory (LSTM) Networks
11 pages
Batch 11 Ieee
No ratings yet
Batch 11 Ieee
5 pages
1Z0 1091 24 Demo
0% (1)
1Z0 1091 24 Demo
6 pages
40 - Обзор применения машинного обучения для оценки качества воды
No ratings yet
40 - Обзор применения машинного обучения для оценки качества воды
10 pages
Water Potability Prediction Using Neural Network
No ratings yet
Water Potability Prediction Using Neural Network
3 pages
Reliable Water Quality Prediction and Parametric Analysis Using Explainable AI Models Scientific Reports
No ratings yet
Reliable Water Quality Prediction and Parametric Analysis Using Explainable AI Models Scientific Reports
1 page
Research Paper (Yafra Khan)
No ratings yet
Research Paper (Yafra Khan)
6 pages
Brother of The Third Degree
100% (2)
Brother of The Third Degree
397 pages
Adhoc Reports in Success Factors
100% (1)
Adhoc Reports in Success Factors
10 pages
Karel Robot Book
100% (1)
Karel Robot Book
161 pages
Belina RTGS 2020 Year End Notes
No ratings yet
Belina RTGS 2020 Year End Notes
20 pages
Commissioning Generator AVR, PSS and Model Validation: Wenyan Gu, Member, IEEE
100% (1)
Commissioning Generator AVR, PSS and Model Validation: Wenyan Gu, Member, IEEE
5 pages
DOH AO No 2020 0023
No ratings yet
DOH AO No 2020 0023
11 pages
CitectSCADA 7.20 User Guide-1
No ratings yet
CitectSCADA 7.20 User Guide-1
100 pages
User Manual 3134806
No ratings yet
User Manual 3134806
2 pages
Test and Score Data: 1997-98 Edition
No ratings yet
Test and Score Data: 1997-98 Edition
8 pages
Egusphere 2025 16
No ratings yet
Egusphere 2025 16
22 pages
Dam Crack Detection
No ratings yet
Dam Crack Detection
15 pages
Faculty VET Full
No ratings yet
Faculty VET Full
221 pages
FINAL CS3501 Compiler Design LAB
No ratings yet
FINAL CS3501 Compiler Design LAB
49 pages
Unit 5 - Week 3: Assignment 3
No ratings yet
Unit 5 - Week 3: Assignment 3
5 pages
HCIA-HarmonyOS Device Developer V1.0 学员用书
No ratings yet
HCIA-HarmonyOS Device Developer V1.0 学员用书
166 pages
Asistensi AK1 10 Sept
No ratings yet
Asistensi AK1 10 Sept
13 pages
Diagnosing Diabetes Using Binary Whale Optimization Algorithm-Based Feature Selection
No ratings yet
Diagnosing Diabetes Using Binary Whale Optimization Algorithm-Based Feature Selection
6 pages
Mini Research
No ratings yet
Mini Research
4 pages
It 501 Object Technology & Uml: Multiple Choice Questions
No ratings yet
It 501 Object Technology & Uml: Multiple Choice Questions
4 pages
Contentfile 24329
No ratings yet
Contentfile 24329
30 pages
MMW1 - 4
No ratings yet
MMW1 - 4
50 pages
Practice - Creating A Discount Modifier Using Qualifiers
No ratings yet
Practice - Creating A Discount Modifier Using Qualifiers
37 pages
SMM Overview Updated
No ratings yet
SMM Overview Updated
9 pages
2 - EDS-528E-4GTXSFP-HV - Layer 2 Managed Switches EDS-528E Series - MOXA
No ratings yet
2 - EDS-528E-4GTXSFP-HV - Layer 2 Managed Switches EDS-528E Series - MOXA
1 page
Bresenham's Line Algorithm: (X +1, Y) (X +1, y +1)
No ratings yet
Bresenham's Line Algorithm: (X +1, Y) (X +1, y +1)
4 pages
Manual UT511
No ratings yet
Manual UT511
2 pages
Change The VxRail Manager IP Address
No ratings yet
Change The VxRail Manager IP Address
2 pages
Dsa Final Project
No ratings yet
Dsa Final Project
14 pages
Seven Failure Points When Engineering A Retrieval Augmented Generation System
No ratings yet
Seven Failure Points When Engineering A Retrieval Augmented Generation System
6 pages
Soal Uas Ujian-1
No ratings yet
Soal Uas Ujian-1
3 pages
5 G Test Bed
No ratings yet
5 G Test Bed
10 pages
Stacks and Queues
No ratings yet
Stacks and Queues
3 pages
Log Horizon Useful Links
No ratings yet
Log Horizon Useful Links
7 pages

P16 Prediction of Drinking Water Quality With Machine Learning

Uploaded by

P16 Prediction of Drinking Water Quality With Machine Learning

Uploaded by

Received: 12 June 2023 Revised: 9 November 2023 Accepted: 11 November 2023

APPLIED THEORY ARTICLE

Prediction of drinking water quality with machine learning

Gözde Özsezer RN, MSc1,2 Gülengül Mermer RN, PhD3

Public Health Nurs. 2023;1–17. wileyonlinelibrary.com/journal/phn © 2023 Wiley Periodicals LLC. 1

2.2 Features of the methods

ML models, a method of artificial intelligence, were used in this

2.3 Machine learning classification algoritms

TA B L E 1 Dataset features information.

Features Data type Description

2.3.3 Support vector machine (SVM = SVC) 2.3.6 AdaBoost (ADA)

2.3.4 Random forest (RF)

Water hardness is a parameter determined by the concentration of

FIGURE 11 Columns with missing values.

TA B L E 2 Statistical analysis of dataset features.

Features Count Mean Std Min 25% 50% 75% Max

TA B L E 3 Distribution of water potability rates according to variable characteristics.

TA B L E 4 Correlation matrix between variables.

You might also like