0% found this document useful (0 votes)
5 views6 pages

Prakash 2018

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views6 pages

Prakash 2018

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Proceedings of the 2nd International Conference on Inventive Communication and Computational Technologies (ICICCT 2018)

IEEE Xplore Compliant - Part Number: CFP18BAC-ART; ISBN:978-1-5386-1974-2

A Comparative Study of Various


Classification Techniques to Determine
Water Quality
Ramya Prakash Tharun V.P S. Renuga Devi
School of Electronics Engineering School of Electronics Engineering Associate Professor
VIT University VIT University School of Electronics Engineering
Vellore, Tamil Nadu, India Vellore, Tamil Nadu, India VIT University
Vellore, Tamil Nadu, India

Abstract–Classification and monitoring the water interest has been collected from the state of
quality is one of the important aspects which has Madhya Pradesh, India where electrical
attracted a lot of attention in the recent years. This conductivity of ground water should lie between
work focuses on determining the water quality using 120-1500 μS/cm.
different classification techniques such as Decision
Tree (DT), K-nearest neighbour (KNN) and Support This work provides a comparative study
Vector Machine (SVM) on the ground water samples between Decision Tree, K-Nearest Neighbours and
of Madhya Pradesh, India. The water samples of all
Support Vector Machine classification techniques.
51 districts of Madhya Pradesh which were subjected
The results are evaluated based on confusion
to chemical analysis were collected. The water
samples have been classified (good, average and bad matrix. The confusion matrix is a matrix where
quality) based on the mineral content present in the each row represents the instances in a predicted
samples. A comparative study of classification class while each column represents the instances in
techniques was done based on confusion matrix, the actual class or vice versa. It determines the
accuracy of classification and Receiver Operating accuracy of classification for each of the
Characteristic (ROC). The classification is done based classification techniques.
on the electrical conductivity levels. The results
suggest that SVM is a better classification model than The methodology of the study is as follows:
KNN and DT models on the basis of performance
measure.  Dataset Acquisition
 Data Pre-Processing
Keywords- water quality; classification; confusion
matrix; decision tree; electrical conductivity; k-nearest
 Implementation of Model
neighbours; receiver operating characteristic; support  Evaluation of the Model
vector machine

I. INTRODUCTION II. BACKGROUND STUDY


Classification of the water samples based on Classification of data into different categories
their quality is one of the important characteristics is a procedure to organize large data for efficient
to meet the rapidly expanding drinking, agricultural computation. In Machine Learning, Statistics and
and industrial water requirements. The processing Neural Networks, classification is a supervised
of the ground water samples before consuming or learning approach where the network learns about
using it for any other purposes is highly necessary. the rules of classification from the input data and
It can reduce the occurrence of chronic water borne uses this learning to classify a new observation
diseases such as Hepatitis A, Typhoid fever, accordingly. A detailed description of the
Dysentery, etc. and make sure that good quality algorithms is as follows:
water is available for the growth of the crops.
A. DECISION TREE
The elevated level of salinity is bad for health
Decision tree is a supervised learning
of crops and human. Hence, the deciding factor for
algorithm which breaks a large dataset into smaller
the classification of water quality is the level of
homogenous datasets to make the classification
electrical conductivity of the ground water sample
much easy and efficient. The decision tree
[1]. The range for electrical conductivity is algorithms can be understood by the following
different for every other state. The dataset of our
figure 1:

978-1-5386-1974-2/18/$31.00 ©2018 IEEE 1501


Proceedings of the 2nd International Conference on Inventive Communication and Computational Technologies (ICICCT 2018)
IEEE Xplore Compliant - Part Number: CFP18BAC-ART; ISBN:978-1-5386-1974-2

A. DATASET ACQUISITION
The ground water samples for classification
have been collected from the districts of Madhya
Pradesh, India. The samples have been collected
from 51 districts of Madhya Pradesh. The data was
collected from the Ground Water Yearbook (2015-
16) published by Central Ground Water Board,
Ministry of Water resources, Government of India
[7]. The data acquisition interval is a time span of 1
Figure 1. Decision Tree Approach year i.e. 2014.
B. DATA PRE-PROCESSING
B. k-Nearest Neighbours The pre-processing of data includes imputing
data and feature scaling [2]. Imputing data is a
k-Nearest Neighbours is a simple algorithm measure of filling out the missing data using mean
which classifies a new observation according to the of the column, median or the most frequently used
most common k-nearest neighbours based on a value. Imputing data is an important step as the
distance function. The distance function can be dataset with missing information may wrongly train
Euclidean, Manhattan, Minkowski or Hamming the classification model.
distance. The most commonly used measure is the Feature scaling and normalisation scales the
Euclidean distance function. It can be depicted as: data into the range of -1 to 1 or 0 to 1. If all the
features have largely varying values, then feature
with higher values dominates other features.
Feature scaling results in better convergence and
less computational time as compared to the
normalisation.
C. IMPLEMENTATION OF THE MODEL
The classification model dataset was divided as
75% for training and 25% for testing. The input
parameters include the concentration of a variety of
solvents or minerals such as CO3, HCO3, Cl, SO4,
NO3, F, PO4, TH, Ca, Mg, Na, K, SiO2 i.e. 13
Figure 2. k-Nearest Neighbour Approach inputs. The data is classified based on the
estimation of electrical conductivity. Measure of
electrical conductivity differs from state to state;
C. Support Vector Machine hence, the nominal level of electrical conductivity
It is a classification method which classifies of a state should be known to classify the water
the new input much precisely with the help of samples [5]. SVM, KNN and DT classification
support vectors. It plots all the inputs into a n- techniques are implemented on the dataset. The
output is divided into 3 classes: good, average and
dimensional space and classifies the data after
bad quality sample based on electrical conductivity.
deciding the right hyper-plane which minimizes the
The level of electrical conductivity for Madhya
error. SVM can be picturized as: Pradesh is 120-1500 μS/cm. The output categories
are segregated as 0-1000 μS/cm, 1000-1500 μS/cm
and above 1500 μS/cm for good, average and bad
water quality samples respectively.
D. EVALUATION OF THE MODEL
The model is evaluated based on the accuracy
or error and using confusion matrix [6] and ROC
curve. The ratio of the sum of elements in the main
diagonal elements to sum of all the elements in the
matrix gives the total classification accuracy of the
model. ROC curve is a plot to check the quality of
Figure 3. Support Vector Machine Approach classifiers according to true and false positive ratio.
A true positive test result is a one that detects the
condition when it is present whereas a false
III. METHODOLOGY

978-1-5386-1974-2/18/$31.00 ©2018 IEEE 1502


Proceedings of the 2nd International Conference on Inventive Communication and Computational Technologies (ICICCT 2018)
IEEE Xplore Compliant - Part Number: CFP18BAC-ART; ISBN:978-1-5386-1974-2

positive test result is a one that detects the


condition when the condition is absent.

IV. RESULTS AND DISCUSSIONS


The classification models were implemented
using classification learner (classificationLearner
app) application. The results obtained after training
the model using classification techniques were
tabulated:
Table I shows the comparative analysis of the
classification models based on overall accuracy and
error.
Table I. Comparison based on accuracy and error
Overall Overall Error Figure 5. Confusion matrix for DT (splits-4)
Accuracy
KNN 86.6% 13.4%
DT (splits-4) 84% 16%
DT (splits-20) 90.8% 9.2%
DT (splits-100) 90.5% 9.5%
SVM 96.6% 3.4%

The tabulation clearly explains that the SVM


model is the best model among the discussed for
classification. In DT model, increase in the splits
increases the overall accuracy of the classification
model. But, DT (splits-100) model’s accuracy is
slightly less than the DT (splits-20) model which
indicates that, not necessarily, the increase in splits
increases the accuracy. It can be concluded that the Figure 6. Confusion matrix of DT (splits-20)
splits should be well balanced according to the
requirements.

Figure 4. Confusion matrix for KNN


Figure 7. Confusion matric for DT (splits-100)

978-1-5386-1974-2/18/$31.00 ©2018 IEEE 1503


Proceedings of the 2nd International Conference on Inventive Communication and Computational Technologies (ICICCT 2018)
IEEE Xplore Compliant - Part Number: CFP18BAC-ART; ISBN:978-1-5386-1974-2

 Class-2

Figure 8. Confusion matrix for SVM Figure 10. Class-2 KNN ROC curve

Figure (4-8) depicts the confusion matrix


of the implemented classification models. Class-1,
2 and 3 represents good, average and bad quality  Class-3
water samples. In SVM, Class-1 is classified 97.1%
correctly and 2.9% wrongly. Wrong classification
of class-1 indicates that a good quality water is
wrongly classified as an average or bad quality
water sample. Similar analysis was done for class-2
and 3 respectively.
Area under the ROC curve should be unity or
the angle between the true and false positive should
be a right angle defining the accurate classification
of each class separately. The ROC curves of the
model are as follows:
A. KNN ROC curves

 Class-1 Figure 11. Class-3 KNN ROC curve

B. DT ROC curve
 Class-1

Figure 9. Class-1 KNN ROC curve

Figure 12. Class-1 DT ROC curve

978-1-5386-1974-2/18/$31.00 ©2018 IEEE 1504


Proceedings of the 2nd International Conference on Inventive Communication and Computational Technologies (ICICCT 2018)
IEEE Xplore Compliant - Part Number: CFP18BAC-ART; ISBN:978-1-5386-1974-2

 Class-2  Class-2

Figure 13. Class-2 DT ROC curve


Figure 16. Class-2 SVM ROC Curve
 Class-3

 Class-3

Figure 14. Class-3 DT ROC curve

Table II shows the area under the curve of the


Figure 17. Class-3 SVM ROC curve
implemented models depicting that the SVM
models have a better accuracy of classification than
the KNN and DT.
Table II. ROC- Area under the curve
Class-1 Class-2 Class-3
C. SVM ROC curve
KNN 0.899 0.793 0.844
 Class-1
DT 0.963 0.922 0.975
SVM 0.999 0.996 0.999

V. CONCLUSION
Monitoring the quality of water has become an
indispensable part because of the widely prevalent
chronic water borne diseases. The level of salinity
in terms of electrical conductivity was the
classification criteria for the ground water samples
collected from the districts of Madhya Pradesh. The
implementation of KNN, DT and SVM
classification models in MATLAB were analysed
Figure 15. Class-1 SVM ROC curve based on confusion matrix, accuracy and ROC
curve. The results suggested that SVM was better
than KNN and DT with an overall classification

978-1-5386-1974-2/18/$31.00 ©2018 IEEE 1505


Proceedings of the 2nd International Conference on Inventive Communication and Computational Technologies (ICICCT 2018)
IEEE Xplore Compliant - Part Number: CFP18BAC-ART; ISBN:978-1-5386-1974-2

accuracy of 96.6% and classification error of 3.4% [6] Salisu Yusuf Muhammad, Mokhairi Makhtar, Azilawati
Rozaimee, Azwa Abdul Aziz and Azrul Amri Jamal.
[4]. Later, SVM models were analysed based on the
Classification Model for Water Quality using Machine Learning
confusion matrix which suggested that class-1 was Techniques. International Journal of Software Engineering and
97.1% correctly and 2.9 % wrongly classified [3]. Its Applications, 2015.
Similarly, class-2 and class-3 were classified 96.3% [7] Central Ground Water Board, Ministry of Water Resources,
and 94.1% correctly. The classified water samples Government of India- All data source.
can be sent to the Central or State Water Board,
[8] Sangam Shrestha, Futaba Kazama and Takashi Nakamura.
Government of India to indicate the amount of Use of principal component analysis, factor analysis and
processing required for the sample. discriminant analysis to evaluate spatial and temporal variations
in water quality of the Mekong River. Journal of
Hydroinformatics, 2008.

VI. FUTURE WORK [9] Sakizadeh M. Assessment the performance of classification


methods in water quality studies, a case study in Karaj River.
On account of the large number of inputs in the Environment Monitoring Assessment, 2015.
dataset, Linear Discriminant Analysis [9] can [10] S. R. Mounce, K. Ellis, J.M. Edwards, V. L. Speight, N.
reduce the number of input parameters by a linear Jakomis and J.B. Boxall. Ensemble Decision Tree Models Using
combination of two or more features [8]. RUSBoost for Estimating Risk of Iron Failure in Drinking
Algorithms like Boosted Ensemble trees can be Water Distribution Systems. Water Resources Management,
2017.
implemented to compare the classification accuracy
with the KNN, DT and SVM models [10]. Deep
learning techniques such as artificial neural
networks can be an efficient method for
classification of data having multi-variable
dependencies [6].

ACKNOWLEDGMENT
The authors wish to express their sincere
thanks to Central Ground Water Board, Ministry of
Water resources, Government of India for
providing the data from the yearbook to carry out
the research.

REFERENCES
[1] S. Mehdi Saghebian, M. Taghi Sattari, Rasoul Mirabbasi and
Mahesh Pal. Ground Water Classification by Decision Tree
method in Ardebil Region, Iran. Arabian Journal of
Geosciences, 2014.
[2] Shailesh Jaloree, Anil Rajput and Sanjeev Gour. Decision
tree approach to build a model for water quality. Binary Journal
of Data Mining & Networking 4, 2014.
[3] M. Sakizadeh and R. Mirzaei. A comparative study of
performance of K-nearest neighbors and support vector machine
for classification of ground water. Journal of Mining and
Environment, 2015.
[4] Amri Danades, Devie Pratama, Dian Anggraini and Diny
Anggraini. Comparison of accuracy level K-nearest neighbour
algorithm and support vector machine algorithm in classification
water quality status. System Engineering and Technology
(ICSET), 2016.
[5] Onder Gursoy. Determining the most appropriate
classification methods for water quality. Earth and
Environmental Sciences 44, 2016.

978-1-5386-1974-2/18/$31.00 ©2018 IEEE 1506

You might also like