0% found this document useful (0 votes)
11 views5 pages

The Water Potability Prediction Based On Active Support Vector Machine and Artificial Neural Network

Uploaded by

Devanshi Padhy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views5 pages

The Water Potability Prediction Based On Active Support Vector Machine and Artificial Neural Network

Uploaded by

Devanshi Padhy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2021 International Conference on Big Data, Artificial Intelligence and Risk Management (ICBAR)

The Water Potability Prediction Based on Active


2021 International Conference on Big Data, Artificial Intelligence and Risk Management (ICBAR) | 978-1-6654-9565-3/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICBAR55169.2021.00032

Support Vector Machine and Artificial Neural


Network
Rui Zhao
College of Artificial Intelligence and Data Science, Hebei university of technology
* Corresponding author: [email protected]

Abstract—The total amount of water on the planet is content. The specific database link is:
approximately 1.4 billion cubic kilometers, but only 2.5% of it is https://www.kaggle.com/adityakadiwal/water-potability.
fresh water. Among freshwater resources, human drinkable water
resources account for only 0.3% of them. In Africa, hundreds of B. Design and analysis of algorithms
thousands of people fall ill or even die from drinking unclean water The model of machine learning can be divided into
every year. If the water potability can be predicted accurately, supervised learning, semi-supervised learning, and unsupervised
then it can save a lot of the human, material and financial learning. There are many kinds of algorithms for each learning
resources of the country and the region on the drinkability of
mode, such as decision trees, naive Bayes, random forests, etc.
water resources, which is an important step for the application of
machine learning in water resources monitoring. In the technology
Each algorithm model has its own advantages and disadvantages.
field, today's machine learning technology has become mature, The two models used in this study are the active SVM model and
and it has been applied in many fields such as finance, biology, and the neural network model. The specific algorithms and
medical care. The algorithms in machine learning can be used to prediction curves of the two models will be introduced in turn.
help humans quickly identify whether water can be drunk so that 1) Support vector machine predictive model:
the efficiency of identifying water availability is greatly improved.
SVM is a common discriminant algorithm, it is a supervised
Among the many machine learning methods, artificial neural
networks and support vector machine algorithms became popular
learning model. The principle of the algorithm is to transform
in machine learning due to their large processing data and fast the original non-linear separable problem in the sample space
calculation speed. Therefore, selecting the above two algorithms to into a linearly separable problem in a higher-dimensional feature
judge the drinking ability of water resources is expected to better space through a mapping method [4]. The ascending expansion
achieve the desired purpose. In conclusion, after constant tuning method used is the kernel function, and then the kernel is
of parameters and changing the calculation mode, a high-precision introduced in detail. Function-related content. Now show the
artificial neural network model and SVM prediction model were application example of SVM model in a two-dimensional plane.
obtained, which make judgments in an extremely efficient manner
and predictions highly accurate.

Keywords-water quality, machine learning, SVM, Artificial


neural network, active learning.

I. INTRODUCTION
At present, the detection speed of water resources is slow and
the efficiency is low, and there are multiple water resources
standards in the world and all of them are controversial. For
many countries and regions in Africa, the cost of water resources
testing is high due to the lack of equipment. Thus, the purpose
of this research is to use the active SVM algorithm [1] and
artificial neural network model[2] to predict whether the newly
discovered water resources can be consumed by humans through
various data (microbial content, hardness, PH value) of water
resources[3].
II. METHOD
A. Data acquisition Figure 1. Two-dimensional SVM model application
The database used in this article comes from Kaggle, which
is a high-fidelity database with a five-star research rating in As shown in the figure, there are two different types of
Google Scholar. During the research process, all default data shapes in the plane. The support vector machine algorithm finds
were replaced with average values. The ratio of potable water the most suitable two points as the boundary of the edge from
and non-potable water data items is about 2:1, and there are nine the two categories, and performs a linear fitting process inside
overall training variables, such as pH, hardness, and microbial the edge. Finally, fitting straight lines B1 and B2 are obtained.

978-1-6654-9565-3/21/$31.00 ©2021 IEEE 110


DOI 10.1109/ICBAR55169.2021.00032
Authorized licensed use limited to: Sardar Patel Institute of Technology. Downloaded on October 16,2024 at 04:30:38 UTC from IEEE Xplore. Restrictions apply.
The edge distribution points selected by the two straight lines are tolerance rate, which will cause the subsequent data in the
different, so the final straight-line shape is also different. The B1 classification process to not necessarily have better tolerance. In
line has a long distance from the boundaries b11 and b12, and SVM, there are also soft-spacing classifiers and hard-spacing
the formed area is wide. The impact is that there are many classifiers. The former is biased towards classification during the
unknowable points in the area, and the allowable error range is appearance of individual outliers and is suitable for situations
relatively large. It is called a large-spacing classifier. The B2 where complete linear separation is not possible. The latter is
straight line and b21, b22 form a small edge area and a low error sensitive to data and applies to linear fitting situations.

Figure 2. Two-dimensional non-linear binary classification

The model is applied to two classification problems, which The polynomial kernel function can map the low-
often encountered is that it is unable to find an appropriate linear dimensional input space to the high-latitude feature space, but
fitting method for classification. At this time, the kernel function the polynomial kernel function has many parameters. When the
expansion method can be used to expand to high dimensions and polynomial order is relatively high, the element value of the
find a suitable curve for application. Different kernel functions kernel matrix will tend to infinity or infinitesimal. The
can be selected during the process, and different SVMs can be complexity will be too great to calculate.
generated. In the core function selection of support vector
machine, the commonly used kernel functions are as follows[5]: The Gaussian Radial Basis (RBF) function is a highly
localized kernel function that can map a sample to a higher-
Linear kernel function: dimensional space. The kernel function is the most widely used,
regardless of whether it is a large sample or a small sample. It
K(x,y)=x·y (1)
has relatively good performance, and its parameters are less than
Polynomial kernel function: polynomial kernel function. In practical applications, when
encountering non-linear use, the Gaussian kernel function is
K(x,y)=[(x·y)+1]^d (2) given priority. In the model tuning, the value of the parameter
Radial Basis Function: degree plays an important role when the kernel function is a
polynomial kernel function, and it represents the highest degree
K(x,y)=exp(-|x-y|^2/d^2) (3) in the polynomial. The values of the parameter gamma and coef0
Sigmoid kernel function: can be applied to the radial basis and polynomial kernel
functions. gamma is the coefficient of the kernel function, and
K(x,y)=tanh(a(x·y)+b) (4) the default is the reciprocal of the number of characteristic bits.
The linear kernel function is usually used in the feature space. If the gamma coefficient is too large, the smoothness of the
The point set is closely distributed and has extremely smooth Gaussian distribution curve is reduced, and the model is suitable
cutting lines. Compared with other kernel functions, it is best if for points near the support-vector, and it develops towards an
the linear kernel function can be used because its calculation is over-fitting trend. On the contrary, the Gaussian distribution
simple and efficient. But it is less used in practical applications. curve is smooth, and the model has a poor classification effect
One of the advantages of the support vector machine model is in the training set, and it develops towards an underfitting trend.
that it cannot only use a linear function as a logistic regression The value of coef0 represents the value of the constant b in the
algorithm, but also adjust the kernel function to be non-linear. kernel function. When the sigmoid kernel function is used for

111

Authorized licensed use limited to: Sardar Patel Institute of Technology. Downloaded on October 16,2024 at 04:30:38 UTC from IEEE Xplore. Restrictions apply.
model training, the support vector machine implements a multi-
layer neural network.
C. Active learning SVM model
In the field of machine learning, according to the presence or
absence of training sample labels, the learning methods are
divided into supervised learning, semi-supervised learning,
unsupervised learning and reinforcement learning. The learning
method of the SVM model is a supervised learning method,
which uses classification samples of known categories to adjust
the parameters of the classifier, and finally achieves the required
effect. In the actual application process, the supervised learning
method can use active learning to optimize and improve due to
the high human resource cost of its labeled samples. The
principle is roughly: use machine learning methods to extract Figure 3. Active SVM model accuracy curve
samples that are difficult to classify, and let manual After the
review, the supervision model or version supervision model In the model debugging, the random number seed is used for
training is carried out on the artificially labeled data to improve the first training to control the determination of the training
the training effect and make full use of the classification samples, so that the starting point of the accuracy of each query
experience. algorithm is the same. In the training process, in addition to
using the lowest confidence algorithm (LC) and the maximum
Regardless of the specific method of active learning, the information entropy algorithm (BT), there is also the use of a
process steps are divided into: 1. Establish a machine learning random sample labeling group (RS) for control experiments. As
model to make predictions 2. Use query functions to extract shown in Figure 2-3, the abscissa is epochs, which is the number
unlabeled data sets 3. Manual labeling based on expert of training iterations of the data. The ordinate is the accuracy
experience and business experience 4. Obtain labeling of rate, and the calculation method of the accuracy rate is[6]:
candidate sets Data 5. Incremental learning or re-learning of the
machine learning model to improve its learning effect.
The query function is one of the cores of active learning, and
it includes many methods. The query function used in this study
is uncertainty sampling, and the most difficult-to-mark samples (5)
in the model are taken out and marked to improve training TP is judged to be true, actual is true, FN is judged to be false,
efficiency and model accuracy. In the research, two methods are actual is true, and so on.
used to measure the difficulty of sample labeling: the minimum It is concluded that under the condition that the starting point
confidence measurement algorithm. We can score each two- of the training accuracy is 65%(The random number seed for the
category or multi-category situation. For example, in the two- first training is fixed, and the training results are the same), after
category, the two data points AB are scored (0.85, 0.15) and a few epochs, the LC and BT algorithms both reach about 85%
(0.51,0.49), then for point B, its two types have similar accuracy, and the RS accuracy is less than 80%.
probabilities, so it is a low-confidence point. In the binary
classification problem, edge sampling is equivalent to minimum D. Artificial neural networks
confidence sampling; the second method is entropy sampling: In addition to active learning methods, the research also uses
the principle is that entropy can be used to calculate the artificial neural networks for deep learning, and the accuracy and
instability of a system in mathematics. The greater the entropy, efficiency of the two are compared. During the research process,
the greater the instability. Selecting data sample points with keras is used to construct a one-dimensional neural network. The
large information entropy for labeling is conducive to improving first layer is the input layer with a density. The data set is input
the accuracy and efficiency of machine learning. During the as a one-dimensional array data, followed by three layers of
model training process, the accuracy curve is shown in the figure filtering and full link layers, the connection functions are all
below: ‘Relu’, and the final output layer Use the sigmoid function. The
training results of the constructed neural network are shown in
the following figure:

112

Authorized licensed use limited to: Sardar Patel Institute of Technology. Downloaded on October 16,2024 at 04:30:38 UTC from IEEE Xplore. Restrictions apply.
of innovation, the active learning SVM framework proposes a
new feasible method for the combination of machine learning
and water resources detection.
IV. CONCLUSION
To conclude, after training the model many times, a suitable
support vector machine model and neural network model were
selected. In the complex tuning process, the appropriate
parameter values often depend on the curve composition of the
data set model. When it comes to the content of the data set, the
first thing to choose is the vacant data processing method of the
data set. The methods mainly include median value, zero-padded
method, average value and other methods. After multiple
trainings, use the average Values instead of default data to work
best. In the process of establishing the SVM model, due to a
large amount of calculation of the linear function, the Gaussian
kernel function was finally selected as the kernel function, which
improved the overall training accuracy. Filter density selection
and convolution function selection in the process of building a
neural network is also difficult problems. It requires multiple
training tests to find the most suitable accurate value. In addition,
the selection of the sigmoid function in the selection function of
the final output layer is also obtained after many trainings. in
conclusion. In this study, due to the limitation of equipment
requirements, the accuracy of the model parameters has not
Figure 4. Neural network training accuracy and error rate curve reached the best, and there is still room for improvement. In the
calculation of the active learning training function, the
During the training process of the artificial neural network calculation of confidence and entropy is not adjusted to the most
framework, in the training process of the artificial neural appropriate level, and there is a possibility of optimization and
network framework, the study uses two indicators of accuracy improvement. In neural network deep learning, the input layer
and loss rate as the way to judge its performance. The accuracy density value is optimal within a limited test range. It is not ruled
rate calculation formula is given in section 2.2, and the loss rate out that better ones may exist. The number of intermediate link
calculation formula is error rate = (FP+FN)/(TP+TN+FP+FN). layers and the use of functions can continue to be improved, and
In the image, there are two curves for training accuracy and the final function of the output layer needs to be determined.
verification accuracy. The training accuracy rate is the accuracy
rate of machine learning on the training set, and the verification ACKNOWLEDGEMENT
accuracy rate is its accuracy rate in the verification set. Similarly,
the meaning of the training error rate and the verification error My deepest gratitude goes first and foremost to Professor
rate can be obtained. The difference between the two is too large, Robert Murphy, my supervisor, for his constant encouragement
indicating that there is a problem with the training model. It can and guidance. He helped me a lot on the theoretical basis, taught
be seen from the figure that the accuracy of the training model me how to in-depth study of machine learning theory, let me
in the training set eventually tends to 0.83, and the difference learn about various algorithms among them, and helped me
between the accuracy of the verification set and its accuracy is quickly determine the research direction.
about 0.15. Similarly, the error rate of the training model in the Second, I would like to express my heartfelt gratitude to Dr
training set is about 0.4, and the error rate in the validation set is Zhang, who help me apply machine learning algorithms in my
0.2 more than it. code. Under his teaching, I learned how to build a basic model
for basic operation and parameter adjustment.
III. RESULT AND DISCUSSION
Last my thanks would go to my beloved family for their
In this study, we studied the training results of the active
loving considerations and great confidence in me all through
SVM framework and artificial neural network in the water
these years.
quality database. From the perspective of the selected dependent
variables, the PH value, hardness, microbial content, and metal REFERENCES
content of water are all issues that can directly affect the
[1] M. Goudjil, M. Koudil, N. Hammami, M. Bedda and M. Alruily, "Arabic
drinkability of water resources. In data selection, these text categorization using SVM active learning technique: An overview,"
parameters have an impact on the dependent variable. The 2013 World Congress on Computer and Information Technology
decisive role. It can be seen from the final results that under the (WCCIT), 2013, pp. 1-2, doi: 10.1109/WCCIT.2013.6618666.
active learning SVM algorithm, the effect of incremental [2] G. M. Nicoletti, "Artificial neural networks (ANN) as simulators and
learning using low-confidence and maximum information data emulators-an analytical overview," Proceedings of the Second
points is better than the random learning algorithm. For artificial International Conference on Intelligent Processing and Manufacturing of
Materials. IPMM'99 (Cat. No.99EX296), vol.2, pp. 713-721, 1999. doi:
neural network framework learning, the accuracy rate obtained 10.1109/IPMM.1999.791476.
is slightly higher than that of active learning. However, in terms

113

Authorized licensed use limited to: Sardar Patel Institute of Technology. Downloaded on October 16,2024 at 04:30:38 UTC from IEEE Xplore. Restrictions apply.
[3] Heyi Wang, Yi Gao, Zhaoan Xu and Weidong Xu, "An recurrent neural
network application to forecasting the quality of water diversion in the
water source of Lake Taihu," 2011 International Conference on Remote
Sensing, Environment and Transportation Engineering, 2011, pp. 984-988,
doi: 10.1109/RSETE.2011.5964444.
[4] T. Dai and Y. Dong, "Introduction of SVM Related Theory and Its
Application Research," 2020 3rd International Conference on Advanced
Electronic Materials, Computers and Software Engineering (AEMCSE),
2020, pp. 230-233, doi: 10.1109/AEMCSE50948.2020.00056.
[5] H. Song, Z. Ding, C. Guo, Z. Li and H. Xia, "Research on Combination
Kernel Function of Support Vector Machine," 2008 International
Conference on Computer Science and Software Engineering, 2008, pp.
838-841, doi: 10.1109/CSSE.2008.1231.
[6] R. Medar, V. S. Rajpurohit and B. Rashmi, "Impact of Training and
Testing Data Splits on Accuracy of Time Series Forecasting in Machine
Learning," 2017 International Conference on Computing,
Communication, Control and Automation (ICCUBEA), 2017, pp. 1-6, doi:
10.1109/ICCUBEA.2017.8463779.

114

Authorized licensed use limited to: Sardar Patel Institute of Technology. Downloaded on October 16,2024 at 04:30:38 UTC from IEEE Xplore. Restrictions apply.

You might also like