P16 Prediction of Drinking Water Quality With Machine Learning
P16 Prediction of Drinking Water Quality With Machine Learning
DOI: 10.1111/phn.13264
1
Çanakkale Onsekiz Mart University Faculty
of Health Sciences Department of Public Abstract
Health Nursing, Çanakkale, Turkey
Objective: The aim of this study is to use machine learning models to predict drinking
2
Ege University Health Sciences Institute,
water quality from a public health nursing approach.
İzmir, Turkey
3
Ege University Faculty of Nursing
Design: Machine learning study.
Department of Public Health Nursing, İzmir, Sample: “Water Quality Dataset” was used in the study. The dataset contains physi-
Turkey
cal and chemical measurements of water quality for 2400 different water bodies. The
Correspondence process consists of four stages: Data processing with Synthetic Minority Oversampling
Gülengül Mermer, Ege University Faculty of
Technique, hyperparameter tuning with 10-fold cross-validation, modeling and com-
Nursing Department of Public Health Nursing,
İzmir, Turkey. parative analysis. 80% of the dataset is allocated as training data and 20% as test data.
Email: [email protected]
ML models logistic regression, K-nearest neighbor, support vector machine, random
forest, XGBoost, AdaBoost Classifier, Decision Tree algorithms were used for water
quality prediction. Accuracy, precision, recall, F1 score and AUC performance metrics
of ML models were compared. To evaluate the performance of the models, 10-fold
cross-validation was used and a comparative analysis was performed. The p-values of
the models were also compared.
Results: N this study, where drinking water quality was predicted with seven different
ML algorithms, it can be said that XGBoost and Random Forest are the best classifi-
cation models in all performance metrics. There is a significant difference in all ML
algorithms according to the p-value. The H0 hypothesis is accepted for these algo-
rithms. According to the H0 hypothesis, there is no difference between actual values
and predicted values.
Conclusion: In conclusion, the use of ML models in the prediction of drinking water
quality can help nurses greatly improve access to clean water, a human right, be more
knowledgeable about water quality, and protect the health of individuals.
KEYWORDS
machine learning, prediction, public health nursing, water quality
1 INTRODUCTION AND LITERATURE REVIEW to a lack of access to clean water, especially in regions where
access to water is limited or water quality is poor (WHO, 2019).
1.1 Introduction In 2017, WHO and UNICEF announced the first global estimates
for water, sanitation, and hygiene in relation to the Sustainable
Clean water is vital for health. According to the World Health Orga- Development Goals (SDGs), reporting that approximately 2.1 bil-
nization (WHO), 3.4 million people worldwide die each year due lion people worldwide lack access to clean water. Access to clean
water is crucial for public health, social development, and economic intelligence techniques, including artificial neural networks (ANN),
progress (WHO, 2017). group data processing methods (GMDM), and support vector machine
Water is a critical element for the sustainability of human life, and (SVM), to predict the water quality components of the Tireh River in
the right to access clean water is of great importance for adequate and southwestern Iran. The review of the ANN and SVM results showed
balanced nutrition, healthy living, health services, education, and social that both models have appropriate performance for predicting water
development. However, factors such as declining water resources and quality constituents. The evaluation of the accuracy of the applied
climate change make it challenging to access clean water. WHO states models based on the error indices revealed that SVM is the most
that many standards are used to assess water quality. The water quality accurate model. A similar study investigated a set of ML models to
index is one of the most effective tools for communicating information predict water quality classification in the Kelantan River using data
about water quality to concerned citizens and policymakers (Godwin from 2005 to 2020. The proposed methodology used 13 physical and
& Oborakpororo, 2019). Water quality depends on the physical, chem- chemical parameters of water quality and 7 ML models, including
ical, and biological characteristics of water. Water quality analysis is SVM, ANN, Decision Tree (DT), K-Nearest Neighbors (KNN), Naive
crucial for assessing water’s suitability for human health. Therefore, Bayes (NB), Random Forest (RF), and Gradient Boosting. (GB) to the
physical analysis of water is an important parameter for water quality analysis, Gradient Boosting’s ensemble model with a learning rate of
assessment and management (Godwin & Oborakpororo, 2019; Ham- 0.1 exhibited the best prediction performance compared to the other
dard et al., 2020). According to WHO (2017) drinking water quality algorithms. It has the highest accuracy (94.90%), sensitivity (80.00%),
guidelines, there are nine different physicochemical parameters: pH, and f-measure (86.49%), with the lowest classification error (Ahmed
hardness, solids, chloramines, sulfate, conductivity, organic carbon, et al., 2019). El Bilali and Taleb (2020) developed 8 ML models to pre-
trihalomethanes, and turbidity. In this study, machine learning (ML) dict irrigation water quality in semi-arid areas. The developed models
was evaluated with these parameters. In this study, machine learn- use conductivity and pH parameters as inputs. Furthermore, Aldhyani
ing (ML) models were trained and tested using these physicochemical et al. (2020) predicted the water quality class with three ML models,
parameters to assess their effectiveness in predicting water quality. mainly SVM, KNN, and Naive Bayes. The dataset used has seven sig-
The right to clean water is also recognized as the right to a life wor- nificant parameters. The obtained results show that the constructed
thy of human dignity (Braig, 2018). It is of primary importance in terms models can efficiently predict the water quality index and then classify
of human rights. It is also guaranteed by the United Nations (n.d.). The the water quality. Furthermore, Lu and Ma (2020) proposed two hybrid
United Nations’ Sustainable Development Goal 6 aims to ensure access decision-based ML models to predict water quality in the short term.
to clean water and sanitation for all by 2030 (United Nations, 2015). The base models of the two hybrid models are extreme gradient boost-
Nurses play a vital role in addressing the health impacts of the envi- ing and random forest. Using Nainital Lake as a study area, the study
ronment that can impact social and environmental determinants of used eight ML algorithms and nine ML algorithms for classification
health, such as access to clean water and safe drinking water (Ameri- analysis. The result shows that the Random Forest algorithm is the
can Public Health Association Division of Public Health Nursing, 2013; most efficient ML algorithm in regression analysis. However, when it
American Nurses Association, 2014). The theme of the International comes to classification analysis, a single algorithm is not good enough
Council of Nurses (2017) is “Nurses: Leading Voice in Achieving Sus- for prediction; three algorithms with the same accuracy, Stochastic
tainable Development Goals.” Goal 14 of the Sustainable Development Gradient Descent, RF, and SVM, have proven to be effective in predict-
Goals focuses on Life Below Water. Within the scope of this purpose, ing water quality (Koranga et al., 2022). Using a set of physiochemical
nurses have important duties in ensuring the cleanliness of water, and microbiological parameters as input features to help determine
which is an important source of food supply. Research shows that arti- the suitability class of water (i.e., safe or unsafe), the study evaluated
ficial intelligence prediction studies can be used to understand the the performance of ML models (such as NB, kNN, Logistic Regression-
relationship between water quality and health. However, no specific LR, tree-based classifiers etc.) by applying class balancing (Synthetic
research on the use of these studies in the field of nursing has been Minority Oversampling Technique—SMOTE). ML algorithms are
found in the national and international literature. However, the rela- evaluated in terms of accuracy, recall, precision, and area under the
tionship between water quality and human health is an important curve (AUC). Experimental results show that the stacking classification
issue in the field of nursing. Decreasing water quality can have nega- model after SMOTE with 10-fold cross-validation outperforms the oth-
tive effects on human health and lead to many health problems, such ers with an accuracy, precision, and recall of 98.1%, 100%, and 98.1%,
as skin diseases. Therefore, it is important for nurses to have knowl- respectively, and an AUC equal to 99.9% (Dritsas & Trigka, 2023). In
edge about water quality management and ensure that water quality another study, SVM, RF, XGBoost (XGB), Multilayer Perceptron (MLP),
measurements are made correctly. and Long Short Term Memory (LSTM) models were investigated for
water quality prediction with ML models. SVM performed commend-
ably in predicting water quality, exhibiting excellent generalization
1.2 Literature review capabilities and high prediction accuracy. MLP showed its strength
in nonlinear modeling and performed well in predicting multiple
Many researchers have used ML models for water quality predic- water quality parameters. Conversely, RF and XGB models performed
tion. Haghiabi et al. (2018) investigated the performance of artificial relatively poorly in water quality prediction (Wang et al., 2023).
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 3
The Water Quality Dataset used in this research was processed with 2.3.1 Logistic regression (LR)
the SMOTE technique. Hyperparameter tuning was performed with
LR, KNN, SVM, RF, XGB, ADABoost (ADA), and Decision Tree (DT) is a ML classification algorithm used to predict the probability of cer-
algorithms. The performance metrics of the algorithms were evaluated tain classes based on some dependent variables. In short, the logistic
with accuracy, precision, recall, F1 score, and AUC. The flow diagram of regression model calculates the sum of the input features (in most
the research methodology is shown in Figure 1. cases, there is a bias term) and the logistic of the outcome. The output
of logistic regression is always between (0 and 1), which is suitable for
the binary classification task. The higher the value, the higher the prob-
2.1 Water quality dataset ability that the current sample will be classified as class 1 and vice versa
(Bailly et al., 2022; Ma et al., 2023; van den Goorbergh et al., 2022;
“Water Quality Dataset” was used in the study. The open-access Zabor et al., 2022).
dataset was accessed from the Kaggle (2023) website on March 30,
2023. The dataset contains physical and chemical measurements of
water quality for 3276 different water bodies. These measurements 2.3.2 K-nearest neighbor (KNN)
include nine different property variables: pH value, hardness, solids,
sulfate, conductivity, organic carbon, trihalomethanes, turbidity, and The k-nearest neighbor algorithm, also known as KNN or k-NN,
potability. Table 1 shows the characteristics of the dataset. 80% of the is a nonparametric, supervised learning classifier that uses proxim-
dataset is allocated as training data and 20% as test data. ity to make classifications or predictions about the grouping of an
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 ÖZSEZER and MERMER
individual data point. Although it can be used for both regression 2.3.5 XGBoost (XGB)
and classification problems, it is typically used as a classification
algorithm that works on the assumption that similar points can XGB is a community tree method that applies the principle of strength-
be found close to each other (Bansal et al., 2022; Chumachenko ening weak learners using a gradient descent architecture (Asselman
et al., 2022). et al., 2023; Chumachenko et al., 2022; Li et al., 2022).
SVM is one of the most popular supervised learning algorithms for AdaBoost is referred to as an ensemble classifier, which represents a
classification and regression problems. However, it is primarily used strong classifier resulting from a combination of weak classifiers. The
for classification problems in ML. The goal of the SVM algorithm general working logic of the model starts with re-running the classi-
is to create the best line or decision boundary that can classify the fier at each stage by increasing the weight of the incorrect predictions
n-dimensional space into classes so that, in the future, it can easily made as a result of the previous stage. The aim is to increase the classifi-
place the new data point into the correct category. This best-decision cation accuracy of the model by focusing on incorrect predictions (Hao
boundary is called the hyperplane. SVM selects the extreme points and & Hunag, 2023; Sevinç, 2022).
vectors that help create the hyperplane. These extreme cases are called
support vectors, and hence the algorithm is called Support Vector
Machine. Consider the following diagram, where two different cate- 2.3.7 Decision tree (DT)
gories are classified using a decision boundary or hyperplane (Ahmad
et al., 2020; Barjouei et al., 2021; Cortez & Vapnik, 1995; Ghorbani Decision trees are an algorithm that can represent input variables and
et al., 2020; Kuo et al., 2013; Leong et al., 2021; Rui et al., 2019; Shao output variables in a single tree form. It can be used for classification
et al., 2020). (categorical) or regression in ML (Anmala & Turuganti, 2021; Breiman
et al., 1984; Ma, 2018).
randomly selecting the next set of parameters, the algorithm optimizes 2.6 Data analysis
the selection and detects the best set of parameters in the shortest
time. Since the ML algorithms used in this study do not have many In this study, the IBM Statistical Package for the Social Sciences
hyperparameters, the grid search algorithm was preferred to achieve (SPSS) 22.0 program was used for statistical analysis. The conformity
the best results. At the same time, 10-fold cross-validation was per- of the parameters to a normal distribution was evaluated by the
formed for hyperparameter optimization. To perform cross-validation, Shapiro-Wilks test. In addition to descriptive statistical methods
a subset of the data is allocated for validation as “test data”. The (mean, standard deviation, and frequency), significance was evaluated
reserved subset is not used to train the model but is kept for later at the p < .05 level. Python 3.0 was used as the main programming
use in the validation test. Once the model has been trained, there is language, and libraries such as Numpy, Pandas, and Sci-Kit Learn were
a need for reassurance about how well the model will work on data used for the prediction of ML algorithms. In this research, statistical
not previously encountered during training. Therefore, the prediction hypothesis testing was also used for the ML algorithms used in water
accuracy and performance of the model are tested. Based on the potability prediction.
model’s performance on the test data, it is determined whether the The hypotheses for normally distributed data are as follows:
model is under-, over-, or well-tuned. µ0 = mean of actual test values µ1 = mean of predicted values
H0 : µ0 = µ1
H1 : µ0 ≠ µ1
2.5 Model performance comparison metrics The author’s Windows-based personal computer with an Intel i5 7th
generation processor and an NVIDIA GeForce 940MX graphics card
Many different criteria are used to compare the performance of ML was used to analyze the ML models using Google Colab as the main IDE.
models. These metrics are widely used to assess the quality of binary
and multiclass classification for ML methods.
Accuracy measures the proportion of correctly classified samples 3 RESULTS
among all samples, as shown in (1).
3.1 Water quality exploratory data analysis
TP + TN
Accuracy = (1)
TP + TN + FP + FN
The scatter plots of the nine features included in this study according
Precision measures the proportion of true positives among all cases
to the potability variable are shown below.
classified as positive (2).
TP
Precision = (2)
TP + FP 3.1.1 pH
Recall, (3) that shows the proportion of true positives among all true
WHO (2017) states that the appropriate pH range for drinking water is
positive states.
6.5-8.5. This range ensures that water is suitable for both human con-
TP sumption and industrial use. When the pH of water is lower than 6.5,
Recall = (3)
TP + FN the water is likely to be acidic, and when the pH is higher than 8.5, the
water is likely to be alkaline. This can change the taste and odor of the
F1 score (4) is the harmonic mean of precision and recall and provides
water. The pH level distribution in this research is shown in Figure 2.
a balance between the two measures.
Precision x Recall
F1 Score = 2 x (4)
Precision + Recall 3.1.2 Hardness
FIGURE 2 pH level distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]
FIGURE 3 Hardness distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]
3.1.3 Solids also cause vomiting, diarrhea, and other gastrointestinal problems. The
sulfate distribution in this research is shown in Figure 5.
The total mass of solids in drinking water. These can be natural min-
erals, salts, or waste. Solids can affect the taste, color, and odor of
water. Excessive levels of solids can reduce water quality and harm 3.1.5 Chloramines
human health. The distribution of total dissolved solids in this research
is shown in Figure 4. Chloramines are chemicals used to disinfect water, consisting of a
combination of chlorine and ammonia. Chloramines provide longer-
lasting disinfection than chlorination. Chloramines can also be formed
3.1.4 Sulfate during the breakdown of organic matter in water. At high concen-
trations, chloramines can combine with other organic compounds in
Sulfate is part of the mineral salts added to water. It is naturally pro- the water to produce a foul odor and taste and can cause respira-
duced by soil, rocks, and water sources. Sulfate levels can affect water tory problems. The chloramine distribution in this research is shown in
properties such as taste, odor, and appearance. High sulfate levels can Figure 6.
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 7
FIGURE 4 Distribution of total dissolved solids in this research. [Color figure can be viewed at wileyonlinelibrary.com]
FIGURE 5 Sulfate distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]
3.1.6 Conductivity ties. Excessive levels of organic carbon can cloud water and cause taste,
odor, and appearance problems for some people. The organic carbon
Conductivity is the ability of water to conduct electric current. When distribution in this research is shown in Figure 8.
water shows high conductivity values, it can indicate the presence of
high mineral concentrations. This can affect the quality of the water
and, in some cases, be harmful to human health. The conductivity 3.1.8 Trihalomethanes
distribution in this research is shown in Figure 7.
Trihalomethanes are a chemical produced after the chlorination of
water. They are formed during the breakdown of organic matter in
3.1.7 Organic carbon water or as a result of the reaction of chlorine with water. They
can be harmful to human health. Therefore, drinking water stan-
Organic carbon is a measure of organic matter added to water. Organic dards require trihalomethane levels to be kept below a certain limit.
matter can enter water naturally or be introduced by human activi- It is strictly monitored in water supplies due to its carcinogenic
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
8 ÖZSEZER and MERMER
FIGURE 6 Chloramines distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]
FIGURE 7 Conductivity distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]
effects. The trihalomethane distribution in this research is shown trihalomethanes (Figure 11). The characteristics of outliers are
in Figure 9. presented in Figure 12. Accordingly, outliers were found in all
nine features. SMOTE was used to avoid an imbalanced class
distribution.
3.1.9 Turbidity In this section, graphical data representations and a statistical sum-
mary of the dataset are given. The results of the statistical analysis of
Measures the density of dissolved and suspended substances in water. the dataset features are shown in Table 2. Feature statistics are based
Turbidity measurement is important to determine the efficiency of sed- on count, mean, standard deviation (Std), minimum (min), 25%, 50%,
imentation and filtration processes in water resources. The turbidity 75%, and maximum (max) values. The analysis shows that the dataset
distribution in this research is shown in Figure 10. contains 2400 rows for each feature.
Outliers and missing values were identified in the research. Note: The distribution of potability rates according to the water
The parameters with missing values were ph values, sulfates, and quality characteristics in the data set is shown in Table 3.
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 9
FIGURE 8 Organic carbon distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]
FIGURE 9 Trihalomethanes distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]
In this study, histograms were used to visualize the data distribu- potability. However, correlation analysis was also performed for these
tion of prediction variables based on the target variable. The bars relationships.
with different colors in the graphs show the distribution of potable There is a correlation matrix of Pearson correlation coefficients
and nonpotable water samples (Figure 13). Based on the graphs, it between each variable attribute and potability. The matrix shows the
is seen that the pH, hardness, solids, chloramines, and sulfate val- correlation coefficient between each pair of variables (Table 4).
ues of potable water samples are higher than those of nonpotable According to Table 4, the pH value has a negative correlation with
water samples. In addition, when the distributions according to the potability. In addition, there is a weak correlation between sulfate, tri-
potability variable are analyzed, it is seen that the mineral con- halomethanes, hardness variables, and potability. It can be said that
tents of potable water samples are lower than those of nonpotable these variables are not decisive for potability estimation. On the other
water samples. At the same time, trihalomethane values have a hand, there is a moderate correlation between solids, chloramines, con-
similar distribution between potable and nonpotable water samples ductivity variables, and potability. These variables can be decisive for
(Figure 13). potability estimation.
As a result of these analyses, it can be said that there is a relation- The heatmap in Figure 14 is a visualization of the correlation matrix.
ship between pH, hardness, solids, chloramines, and sulfates values and Each box in the matrix shows the correlation between two variables. As
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 ÖZSEZER and MERMER
FIGURE 10 Turbidity distribution in this research. [Color figure can be viewed at wileyonlinelibrary.com]
the boxes get closer to dark blue, the correlation becomes negative, and and all variables should be used for the model to give more accurate
as they get closer to red, the correlation becomes positive. The high- results.
est positive correlation with the target variable potability is observed In this study, the relationship between the potability of water and
with sodium, hardness, and chloride variables, while the highest neg- other characteristics in the dataset was detailed with pairplot plots
ative correlation is observed with turbidity and pH variables. These (Figure 15).
results seem to be in line with the standards regarding the potability When the pairplot graph of other property variables according to
of water. In other words, among the variables affecting the potability the target variable potability is analyzed, it is seen that there are dif-
of water, increasing parameters such as sodium, hardness, and chloride ferences in the distributions of pH, hardness, and sulfate variables,
negatively affect potability, while increasing parameters such as turbid- which have the strongest relationship with potability. In addition, there
ity and ph can positively affect potability (Figure 14). As can be seen in is no significant relationship between potability and other variables.
the correlation matrix, correlation values between variables are gener- This indicates that potability is determined independently of other
ally low. This indicates that the variables are independent of each other, variables except pH, hardness, and sulfate (Figure 15).
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 11
FIGURE 12 Check for outliers among the columns. [Color figure can be viewed at wileyonlinelibrary.com]
3.2 Experimental results of machine learning curve and AUC of ML models, the value of the AUC is 1.00 in the
models XGB, ADA, and DT algorithms. In these algorithms, the ROC curve is
perfectly decomposed. That is, the classification process was done per-
For the analysis of model performances in this study, 80% of fectly, and the model completely separated the positive and negative
the data was randomly allocated for training and 20% for test- classes. In terms of precision, RF, LR, XGB, and ADA are good classi-
ing. 10-fold cross-validation was applied. LR, KNN, SVM, RF, XGB, fiers. In terms of recall and F1 score, KNN and XGB are good classifiers
ADA, and DT algorithms were used. The accuracy, precision, recall, for prediction.
and F1 Score performance metrics of ML models were compared Table 5 also shows the p values. It is observed that there is a sig-
(Table 5). nificant difference according to the p value in all ML algorithms. The
According to Table 5, it is seen that the ML algorithm that best H0 hypothesis is accepted for these algorithms. According to the H0
predicts the potability level of water is the RF and XGB algorithms, hypothesis, there is no difference between actual values and predicted
with an accuracy value of 0.79. However, when comparing the ROC values.
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 ÖZSEZER and MERMER
Features
Organic
pH Hardness Solids Chloramines Sulfate Conductivity carbon Trihalomethanes Turbidity
Potable Mean 7.06 196.00 21628.53 7.10 333.74 427.55 14.40 66.27 3.95
std 1.65 30.71 8461.10 1.47 36.39 79.88 3.37 15.93 0.78
Min 1.43 98.45 320.94 2.45 203.44 210.31 4.37 14.34 1.45
25% 5.98 177.31 15378.90 6.16 310.65 369.58 12.11 56.15 3.44
50% 6.99 196.79 20507.39 7.10 332.61 424.47 14.35 66.20 3.94
75% 8.14 214.53 26786.54 8.07 356.43 482.33 16.78 77.14 4.49
max 14.00 300.29 55334.70 12.65 460.10 753.34 27.00 120.03 6.49
Nonpotable mean 7.10 195.96 22091.35 7.18 332.67 422.45 14.30 66.55 3.97
std 1.34 32.77 8694.57 1.63 44.74 77.97 3.09 15.37 0.73
min 0.22 73.49 1198.94 1.39 129.00 201.61 2.20 8.57 1.49
25% 6.30 177.05 15656.42 6.18 304.52 361.29 12.33 56.91 3.46
50% 7.05 197.34 21397.29 7.20 332.49 415.56 14.19 66.61 3.98
75% 7.88 216.76 27416.30 8.15 363.55 477.99 16.44 76.55 4.48
max 11.89 317.33 56488.67 13.12 481.03 695.36 23.60 124.00 6.49
F I G U R E 1 3 Visualization of potability levels of waters according to their characteristics. [Color figure can be viewed at
wileyonlinelibrary.com]
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 13
Organic
pH Hardness Solids Chloramines Sulfate Conductivity carbon Trihalomethanes Turbidity Potability
pH 1.00 -0.09 -0.08 0.0149 -0.01 0.019 0.03 -0.02 -0.03 -0.00
Hardness -0.09 1.00 0.02 -0.03 -0.09 -0.02 -0.02 -0.01 -0.01 -0.01
Solids -0.08 0.02 1.00 -0.07 -0.14 0.01 -0.03 -0.03 -0.00 -0.04
Chloramines 0.01 -0.03 -0.07 1.00 0.02 0.01 -0.01 0.01 0.00 0.02
Sulfate -0.01 -0.09 -0.14 0.02 1.00 -0.01 -0.02 -0.02 -0.00 -0.01
Conductivity 0.01 -0.02 0.01 0.01 -0.01 1.00 0.02 0.00 0.00 -0.00
Organic carbon 0.03 -0.02 -0.03 -0.01 -0.02 0.02 1.00 -0.01 -0.02 -0.03
Trihalomethanes -0.02 -0.01 -0.03 0.01 -0.02 0.00 -0.01 1.00 -0.02 -0.00
Turbidity -0.03 -0.01 -0.00 0.00 -0.00 0.00 -0.02 -0.02 1.00 0.00
Potability -0.00 -0.01 -0.04 0.02 -0.01 -0.00 -0.03 -0.00 0.00 1.00
good classifiers. Although XGB was used in the study, it was reported
to give moderate results (Kaddoura, 2022). Nasir et al. (2022) used
SVM, RF, LR, DT, XGB, CatBoost, and Multi-Layer Perceptron (MLP)
algorithms for water quality prediction and found that the CatBoost
model provided the most accurate classifier with 94.5%. In a simi-
lar dataset, RF, NN, SVM, Multinomial Logistic Regression (MLR), and
Bagged Tree Model (BTM) algorithms were used to predict the water
quality index, and MLR was found to be the best classifier with 99.8%
accuracy (Hassan et al., 2021). In the study conducted with the dataset
of the Rawal watershed created by the Pakistan Council of Research
in Water Resources, MLP, Gaussian Naive Bayes, LR, SGD, KNN, DT,
RF, SVM, GB, and Bagging Classifier algorithms were evaluated with
MAE, MSE, RMSE, and R2 parameters and accuracy, precision, recall,
and F1 score performance metrics for water quality prediction. MLP
was reported to be the best classifier (Ahmed et al., 2019). In the study
by Bui et al. (2020), water quality prediction was performed with new
hybrid ML algorithms. These algorithms (decision-tree algorithms):
F I G U R E 1 4 Statistical analysis of dataset features for correlation.
M5P; random forest (RF); random tree (RT); and reduced error prun-
[Color figure can be viewed at wileyonlinelibrary.com]
ing tree (REPT); (meta-classifier or hybrid algorithms): Bagging (BA);
CV parameter selection (CVPS); and randomizable filtered classifier
4 DISCUSSION (RFC); including BA-M5P; BA-RF; BA-RT; BA-REPT; CVPS-M5P; CVPS-
RF; CVPS-RT; CVPS-REPT; RFC-M5P; RFC-RF; RFC-RT; and RFC-REPT.
The Water Quality dataset used in this study has a wide range of The best classifier is BA-RT. In this study, DT, KNN, SVM, Discriminants
applications and is used in different fields. The Water Quality dataset Analysis (DA), and Ensemble Trees (ET) algorithms were used to pre-
contains various parameters related to drinking water quality, and dict water quality indices at a regional scale using ML algorithms in the
these parameters are used in evaluations related to the quality of water Naama region, located in the southwestern region of Algeria. It was
resources. Drinking water quality needs to be monitored regularly for reported that the SVM classifier achieved 95.4% prediction accuracy
the healthy use of water resources. In this study, it was determined that (Derdour et al., 2022). In another water quality prediction study using a
the ML algorithm that best predicts drinking water quality is the XGB similar dataset to this research, SVM, KNN, and Naive Bayes algorithms
algorithm, which is common in all of the accuracy, precision, recall, F1 were used, and SVM achieved the highest value with 97.01% accuracy
score, and AUC performance metrics. In the study where water quality (Aldhyani et al., 2020). It can be said that these differences are due
was predicted using the same dataset, ML algorithms were evaluated to the difference in the separation of training and test data between
with precision using recall, F1 score, and ROC curve/AUC performance the algorithms used in our study. In this study, the predictability of
metrics. It was stated that KNN in terms of precision, LASSO LARS water quality with a ML approach using the Water Quality dataset is
(LL), Stochastic Gradient Descent (SGD) in terms of recall, SVM, and addressed in terms of nursing. Studies show that ML methods can be a
Artificial Neural Network (ANN) in terms of ROC curve/AUC were very effective approach for water quality prediction.
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 ÖZSEZER and MERMER
FIGURE 15 The pairplot analysis of dataset features. [Color figure can be viewed at wileyonlinelibrary.com]
5 CONCLUSION tion process was done perfectly, and the model completely separated
the positive and negative classes. In terms of precision, RF, LR, XGB,
Water quality is related to all of the United Nations Sustainable Devel- and ADA are good classifiers. In terms of recall and F1 score, KNN and
opment Goals. Water quality prediction with ML is the interesting part XGB are good classifiers for prediction. There is a significant differ-
of this study. To achieve this, a comparative evaluation of a large num- ence in all ML algorithms according to the p-value. The H0 hypothesis
ber of ML classification models, such as LR, KNN, SVM, RF, XGB, ADA, is accepted for these algorithms. According to the H0 hypothesis, there
DT, etc., was performed, and the intended model with the highest accu- is no difference between actual values and predicted values.
racy and discrimination ability, SMOTE, was developed with 10-fold This study demonstrates the benefits of an ML tool that can be
cross-validation. The performance of the ML algorithms used in this used by nurses for water quality monitoring. A better understanding
study was compared, and it was observed that the RF and XGB algo- of water quality by nurses can lead to better health outcomes. The
rithms (accuracy = 0.79) gave the best prediction results. However, use of ML algorithms for water quality prediction requires further
when the ROC curve/AUC of the ML models is compared, the AUC research to achieve a wider range of applications and better results.
value is 1.00 in the XGB, ADA, and DT algorithms. In these algorithms, The findings of this study support the use of ML techniques for water
the ROC curve is perfectly decomposed. In other words, the classifica- quality prediction and monitoring in the field of nursing. Monitoring
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 15
TA B L E 5 Comparison of performance metrics of ML models. is prominent today. Public health nurses should also learn about this
technology and use it in their research. The use of artificial intelli-
Model Potability Precision Recall F1Score Accuracy AUC p-value
gence methods will help nurses focus more on their main duty of care.
LR Potable 0.80 0.68 0.74 0.78 0.70 .00
At the same time, a different technique will be used to access clean
Nonpotable 0.77 0.87 0.82
water.
KNN Potable 0.87 0.94 0.90 0.70 0.70 .02
Nonpotable 0.66 0.44 0.53 AUTHOR CONTRIBUTIONS
SVM Potable 0.66 0.76 0.70 0.64 0.50 .01 All authors listed have made a substantial, direct and intellectual
Nonpotable 0.60 0.47 0.53 contribution to the work, and approved it for publication.
RF Potable 0.80 0.79 0.80 0.79 0.70 .00
Nonpotable 0.78 0.79 0.78 ACKNOWLEDGMENTS
The authors declare that they have no known competing financial inter-
XGB Potable 0.80 0.91 0.85 0.79 1.00 .00
ests or personal relationships that could have appeared to influence the
Nonpotable 0.74 0.51 0.60
work reported in this paper. This research did not receive any specific
ADA Potable 0.80 0.88 0.84 0.78 1.00 .00
grant from funding agencies in the public, commercial, or not-for-profit
Nonpotable 0.73 0.59 0.65 sectors.
DT Potable 0.72 0.76 0.74 0.65 1.00 .01
Nonpotable 0.49 0.44 0.47 CONFLICT OF INTEREST STATEMENT
The authors declare that there is no conflict of interest.
and predicting water quality is important to protect public health, and DATA AVAILABILITY STATEMENT
therefore nurses need to be further educated on this topic. This study The data is available open access via Kaggle.
can help nurses become more knowledgeable about water quality and
protect the health of individuals. ORCID
Nurses can educate the community about the protection and treat- Gülengül Mermer RN, PhD https://orcid.org/0000-0002-0566-5656
ment of water resources. These trainings can raise awareness about
how people can use water resources without harming them. Nurses can REFERENCES
participate in prevention efforts to protect water resources. For exam- Ahmad, M. S., Adnan, S. M., Zaidi, S., & Bhargava, P. (2020). A novel support
vector regression (SVR) model for the prediction of splice strength of
ple, they can support efforts to treat and recycle wastewater. Nurses
the unconfined beam specimens. Construction and building materials, 248,
can monitor the pollution and depletion of water resources. They can 118475. https://doi.org/10.1016/j.conbuildmat.2020.118475
play an important role in terms of public health by following studies on Ahmed, U., Mumtaz, R., Anwar, H., Shah, A. A., Irfan, R., & García-Nieto,
water resources. Nurses can play an active role in the management of J. (2019). Efficient water quality prediction using supervised machine
water resources. Since the management of water resources is critical learning. Water, 11(11), 2210. https://doi.org/10.3390/w11112210
Aldhyani, T. H., Al-Yaari, M., Alkahtani, H., & Maashi, M. (2020). Water qual-
for public health, nurses should actively work on this issue.
ity prediction using artificial intelligence algorithms. Applied Bionics and
The dataset and ML algorithms used in this study can be used in Biomechanics, 2020, 1–12. https://doi.org/10.1155/2020/6659314
other water quality studies. In particular, the use of ML techniques American Nurses Association. (2014). Public health nursing: Scope and
has become more important in studies investigating the relationship standards of practice. (Second Edition)..
American Public Health Association Public Health Nursing Section. (2013).
between pollution in water resources and human health. Therefore, it is
The definition and practice of public health nursing: A statement of the public
recommended that the use of ML techniques in water quality research health nursing section. American Public Health Association.
be more widespread. In addition, it can be suggested that physical, Anmala, J., & Turuganti, V. (2021). Comparison of the performance of deci-
chemical, and biological parameters that can be used in water qual- sion tree (DT) algorithms and extreme learning machine (ELM) model
in the prediction of water quality of the Upper Green River water-
ity prediction should be included in the dataset and re-evaluated, and
shed. Water Environment Research, 93(11), 2360–2373. https://doi.org/
water quality predictions should be made with ML models. 10.1002/wer.1642
Asselman, A., Khaldi, M., & Aammou, S. (2023). Enhancing the prediction of
student performance based on the machine learning XGBoost algorithm.
Interactive Learning Environments, 31(6), 3360–3379. https://doi.org/10.
6 IMPLICATIONS FOR PUBLIC HEALTH
1080/10494820.2021.1928235
NURSING Bailly, A., Blanc, C., Francis, É., Guillotin, T., Jamal, F., Wakim, B., & Roy, P.
(2022). Effects of dataset size and interactions on the prediction per-
Access to clean water is both a human right and a right to health. Pub- formance of logistic regression and deep learning models. Computer
Methods and Programs in Biomedicine, 213, 106504. https://doi.org/10.
lic health nurses should facilitate individuals’ access to clean water
1016/j.cmpb.2021.106504
with their advocacy, consulting, and research roles. Therefore, it is Bansal, M., Goyal, A., & Choudhary, A. (2022). A comparative analysis
important for public health nurses to be able to analyze water first. of K-nearest neighbor, genetic, support vector machine, decision tree,
As shown in this study, the use of artificial intelligence techniques and long short term memory algorithms in machine learning. Deci-
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16 ÖZSEZER and MERMER
sion Analytics Journal, 3, 100071. https://doi.org/10.1016/j.dajour.2022. Kaddoura, S. (2022). Evaluation of machine learning algorithm on drink-
100071 ing water quality for better sustainability. Sustainability, 14(18), 11478.
Barjouei, H. S., Ghorbani, H., Mohamadian, N., Wood, D. A., Davoodi, S., https://doi.org/10.3390/su141811478
Moghadasi, J., & Saberi, H. (2021). Prediction performance advantages Kaggle. (2023). Water quality dataset. https://www.kaggle.com/datasets/
of deep machine learning algorithms for two-phase flow rates through adityakadiwal/water-potability
wellhead chokes. Journal of Petroleum Exploration and Production, 3, Koranga, M., Pant, P., Kumar, T., Pant, D., Bhatt, A. K., & Pant, R. P. (2022).
1233–1261. https://doi.org/10.1016/j.dajour.2022.100071 Efficient water quality prediction models based on machine learning
Braig, K. F. (2018). The European Court of Human Rights and the right to algorithms for Nainital Lake, Uttarakhand. Materials today: proceedings,
clean water and sanitation. Water Policy, 20(2), 282–307. https://doi.org/ 57, 1706–1712. https://doi.org/10.1016/j.matpr.2021.12.334
10.2166/wp.2018.045 Kuo, B. C., Ho, H. H., Li, C. H., Hung, C. C., & Taur, J. S. (2013). A kernel-based
Breiman, L. (2001). Random forests. Machine learning, 45, 5–32. https://doi. feature selection method for SVM with RBF kernel for hyperspectral
org/10.1023/A:1010933404324 image classification. IEEE Journal of Selected Topics in Applied Earth Obser-
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. G. (1984). Classification vations and Remote Sensing, 7(1), 317–326. https://doi.org/10.1109/
and regression trees. Wadsworth International Group. jstars.2013.2262926
Bui, D. T., Khosravi, K., Tiefenbacher, J., Nguyen, H., & Kazakis, N. Leong, W. C., Bahadori, A., Zhang, J., & Ahmad, Z. (2021). Prediction of
(2020). Improving prediction of water quality indices using novel hybrid water quality index (WQI) using support vector machine (SVM) and least
machine-learning algorithms. Science of the Total Environment, 721, square-support vector machine (LS-SVM). International Journal of River
137612. https://doi.org/10.1016/j.scitotenv.2020.137612 Basin Management, 19(2), 149–156. https://doi.org/10.1080/15715124.
Chumachenko, D., Meniailov, I., Bazilevych, K., Chumachenko, T., & Yakovlev, 2019.1628030
S. (2022). Investigation of statistical machine learning models for Li, J., An, X., Li, Q., Wang, C., Yu, H., Zhou, X., & Geng, Y. A. (2022). Applica-
COVID-19 epidemic process simulation: Random forest, K-nearest tion of XGBoost algorithm in the optimization of pollutant concentration.
neighbors, gradient boosting. Computation, 10(6), 86. https://doi.org/10. Atmospheric Research, 276, 106238. https://doi.org/10.1016/j.atmosres.
3390/computation10060086 2022.106238
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, Lu, H., & Ma, X. (2020). Hybrid decision tree-based machine learning mod-
20, 273–297. https://doi.org/10.1007/BF00994018 els for short-term water quality prediction. Chemosphere, 249, 126169.
Derdour, A., Jodar-Abellan, A., Pardo, M. Á., Ghoneim, S. S., & Hussein, E. E. https://doi.org/10.1016/j.chemosphere.2020.126169
(2022). Designing efficient and sustainable predictions of water quality Ma, J., Dhiman, P., Qi, C., Bullock, G., van Smeden, M., Riley, R. D., & Collins,
ındexes at the regional scale using machine learning algorithms. Water, G. S. (2023). Poor handling of continuous predictors in clinical prediction
14(18), 2801. https://doi.org/10.3390/w14182801 models using logistic regression: A systematic review. Journal of Clinical
Dritsas, E., & Trigka, M. (2023). Efficient data-driven machine learning mod- Epidemiology, 161, 140–151. https://doi.org/10.1016/j.jclinepi.2023.07.
els for water quality prediction. Computation, 11(2), 16. https://doi.org/ 017
10.3390/computation11020016 Ma, X. (2018). Using classification and regression trees: A practical primer.
El Bilali, A., & Taleb, A. (2020). Prediction of irrigation water quality parame- IAP.
ters using machine learning models in a semi-arid environment. Journal of Merriam-Webster Dictionary. (2023). Artifical intelligence, https://www.
the Saudi Society of Agricultural Sciences, 19(7), 439–451. https://doi.org/ merriam-webster.com/dictionary/artificial%20intelligence
10.1016/j.jssas.2020.08.001 Nasir, N., Kansal, A., Alshaltone, O., Barneih, F., Sameer, M., Shanableh, A.,
Ghorbani, H., Wood, D. A., Choubineh, A., Tatar, A., Abarghoyi, P. G., Madani, & Al-Shamma’a, A. (2022). Water quality classification using machine
M., & Mohamadian, N. (2020). Prediction of oil flow rate through an ori- learning algorithms. Journal of Water Process Engineering, 48, 102920.
fice flow meter: Artificial intelligence alternatives compared. Petroleum, https://doi.org/10.1016/j.jwpe.2022.102920
6(4), 404–414. https://doi.org/10.1016/j.petlm.2018.09.003 Nevala, K. (2017). Machine learning primer. SAS Institute.
Godwin, A., & Oborakpororo, O. (2019). Surface water quality assessment of Özsezer, G. (2022). The future of artificial intelligence in nursing. Journal
warri metropolis using Water Quality Index. International Letters of Nat- of Human Sciences, 19(2), 285–299. https://doi.org/10.14687/jhs.v19i2.
ural Sciences, 74, 18–25. https://doi.org/10.18052/www.scipress.com/ 6217
ILNS.74.18 Rui, J., Zhang, H., Zhang, D., Han, F., & Guo, Q. (2019). Total organic carbon
Haghiabi, A. H., Nasrolahi, A. H., & Parsaie, A. (2018). Water quality pre- content prediction based on support-vector-regression machine with
diction using machine learning methods. Water Quality Research Journal, particle swarm optimization. Journal of Petroleum Science and Engineering,
53(1), 3–13. https://doi.org/10.2166/wqrj.2018.025 180, 699–706. https://doi.org/10.1016/j.petrol.2019.06.014
Hamdard, M. H., Soliev, I., Xiong, L., & Kløve, B. (2020). Drinking water qual- Sevinç, E. (2022). An empowered AdaBoost algorithm implementation:
ity assessment and governance in Kabul: A case study from a district with A COVID-19 dataset study. Computers & Industrial Engineering, 165,
high migration and underdeveloped infrastructure. Central Asian Journal 107912. https://doi.org/10.1016/j.cie.2021.107912
of Water Research, 6(1), 66–81. https://doi.org/10.29258/CAJWR/2020- Shao, M., Wang, X., Bu, Z., Chen, X., & Wang, Y. (2020). Prediction of energy
R1.v6-1/66-81.eng consumption in hotel buildings via support vector machines. Sustain-
Hao, L., & Huang, G. (2023). An improved AdaBoost algorithm for identi- able Cities and Society, 57, 102128. https://doi.org/10.1016/j.scs.2020.
fication of lung cancer based on electronic nose. Heliyon, 9(3), e13633. 102128
https://doi.org/10.1016/j.heliyon.2023.e13633 United Nations. (2015). Sustainable development goals. https://www.un.
Hassan, M. M., Hassan, M. M., Akter, L., Rahman, M. M., Zaman, S., Hasib, K. org/sustainabledevelopment/sustainable-development-goals/
M., Jahan, N., Smrity, R. S., Farhana, J., Raihan, M., & Mollick, S. (2021). United Nations. (n.d). The universal declaration of human rights. https://
Efficient prediction of water quality index (WQI) using machine learning www.un.org/en/universal-declaration-human-rights/
algorithms. Human-Centric Intelligent Systems, 1(3-4), 86–97. https://doi. van den Goorbergh, R., van Smeden, M., Timmerman, D., & van Calster, B.
org/10.2991/hcis.k.211203.001 (2022). The harm of class imbalance corrections for risk prediction mod-
International Council of Nurses. (2017). Nurses: A voice to lead—Achieving els: Illustration and simulation using logistic regression. Journal of the
the Sustainable Development Goals. https://www.icnvoicetolead.com/ American Medical Informatics Association, 29(9), 1525–1534. https://doi.
home/ org/10.1093/jamia/ocac093
15251446, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/phn.13264 by Canakkale Onsekiz Mart Uni, Wiley Online Library on [26/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ÖZSEZER and MERMER 17
Wang, X., Li, Y., Qiao, Q., Tavares, A., & Liang, Y. (2023). Water quality predic- k-nearest neighbors. Information Sciences, 595, 70–88. https://doi.org/
tion based on machine learning and comprehensive weighting methods. 10.1016/j.ins.2022.02.038
Entropy, 25(8), 1186. https://doi.org/10.3390/e25081186
World Health Organization. (2017). Guidelines for drinking-water quality. 4th
edn.. World Health Organization.
World Health Organization. (2019). Drinking-water. World Health Orga- How to cite this article: Özsezer, G., & Mermer, G. (2023).
nization, https://www.who.int/news-room/fact-sheets/detail/drinking-
Prediction of drinking water quality with machine learning
water
Zabor, E. C., Reddy, C. A., Tendulkar, R. D., & Patil, S. (2022). Logistic regres- models: a public health nursing approach. Public Health Nursing,
sion in clinical studies. International Journal of Radiation Oncology* Biology* 1–17. https://doi.org/10.1111/phn.13264
Physics, 112(2), 271–277. https://doi.org/10.1016/j.ijrobp.2021.08.007
Zhang, A., Yu, H., Huan, Z., Yang, X., Zheng, S., & Gao, S. (2022). SMOTE-
RkNN: A hybrid re-sampling method based on SMOTE and reverse