Prakash 2018
Prakash 2018
Abstract–Classification and monitoring the water interest has been collected from the state of
quality is one of the important aspects which has Madhya Pradesh, India where electrical
attracted a lot of attention in the recent years. This conductivity of ground water should lie between
work focuses on determining the water quality using 120-1500 μS/cm.
different classification techniques such as Decision
Tree (DT), K-nearest neighbour (KNN) and Support This work provides a comparative study
Vector Machine (SVM) on the ground water samples between Decision Tree, K-Nearest Neighbours and
of Madhya Pradesh, India. The water samples of all
Support Vector Machine classification techniques.
51 districts of Madhya Pradesh which were subjected
The results are evaluated based on confusion
to chemical analysis were collected. The water
samples have been classified (good, average and bad matrix. The confusion matrix is a matrix where
quality) based on the mineral content present in the each row represents the instances in a predicted
samples. A comparative study of classification class while each column represents the instances in
techniques was done based on confusion matrix, the actual class or vice versa. It determines the
accuracy of classification and Receiver Operating accuracy of classification for each of the
Characteristic (ROC). The classification is done based classification techniques.
on the electrical conductivity levels. The results
suggest that SVM is a better classification model than The methodology of the study is as follows:
KNN and DT models on the basis of performance
measure. Dataset Acquisition
Data Pre-Processing
Keywords- water quality; classification; confusion
matrix; decision tree; electrical conductivity; k-nearest
Implementation of Model
neighbours; receiver operating characteristic; support Evaluation of the Model
vector machine
A. DATASET ACQUISITION
The ground water samples for classification
have been collected from the districts of Madhya
Pradesh, India. The samples have been collected
from 51 districts of Madhya Pradesh. The data was
collected from the Ground Water Yearbook (2015-
16) published by Central Ground Water Board,
Ministry of Water resources, Government of India
[7]. The data acquisition interval is a time span of 1
Figure 1. Decision Tree Approach year i.e. 2014.
B. DATA PRE-PROCESSING
B. k-Nearest Neighbours The pre-processing of data includes imputing
data and feature scaling [2]. Imputing data is a
k-Nearest Neighbours is a simple algorithm measure of filling out the missing data using mean
which classifies a new observation according to the of the column, median or the most frequently used
most common k-nearest neighbours based on a value. Imputing data is an important step as the
distance function. The distance function can be dataset with missing information may wrongly train
Euclidean, Manhattan, Minkowski or Hamming the classification model.
distance. The most commonly used measure is the Feature scaling and normalisation scales the
Euclidean distance function. It can be depicted as: data into the range of -1 to 1 or 0 to 1. If all the
features have largely varying values, then feature
with higher values dominates other features.
Feature scaling results in better convergence and
less computational time as compared to the
normalisation.
C. IMPLEMENTATION OF THE MODEL
The classification model dataset was divided as
75% for training and 25% for testing. The input
parameters include the concentration of a variety of
solvents or minerals such as CO3, HCO3, Cl, SO4,
NO3, F, PO4, TH, Ca, Mg, Na, K, SiO2 i.e. 13
Figure 2. k-Nearest Neighbour Approach inputs. The data is classified based on the
estimation of electrical conductivity. Measure of
electrical conductivity differs from state to state;
C. Support Vector Machine hence, the nominal level of electrical conductivity
It is a classification method which classifies of a state should be known to classify the water
the new input much precisely with the help of samples [5]. SVM, KNN and DT classification
support vectors. It plots all the inputs into a n- techniques are implemented on the dataset. The
output is divided into 3 classes: good, average and
dimensional space and classifies the data after
bad quality sample based on electrical conductivity.
deciding the right hyper-plane which minimizes the
The level of electrical conductivity for Madhya
error. SVM can be picturized as: Pradesh is 120-1500 μS/cm. The output categories
are segregated as 0-1000 μS/cm, 1000-1500 μS/cm
and above 1500 μS/cm for good, average and bad
water quality samples respectively.
D. EVALUATION OF THE MODEL
The model is evaluated based on the accuracy
or error and using confusion matrix [6] and ROC
curve. The ratio of the sum of elements in the main
diagonal elements to sum of all the elements in the
matrix gives the total classification accuracy of the
model. ROC curve is a plot to check the quality of
Figure 3. Support Vector Machine Approach classifiers according to true and false positive ratio.
A true positive test result is a one that detects the
condition when it is present whereas a false
III. METHODOLOGY
Class-2
Figure 8. Confusion matrix for SVM Figure 10. Class-2 KNN ROC curve
B. DT ROC curve
Class-1
Class-2 Class-2
Class-3
V. CONCLUSION
Monitoring the quality of water has become an
indispensable part because of the widely prevalent
chronic water borne diseases. The level of salinity
in terms of electrical conductivity was the
classification criteria for the ground water samples
collected from the districts of Madhya Pradesh. The
implementation of KNN, DT and SVM
classification models in MATLAB were analysed
Figure 15. Class-1 SVM ROC curve based on confusion matrix, accuracy and ROC
curve. The results suggested that SVM was better
than KNN and DT with an overall classification
accuracy of 96.6% and classification error of 3.4% [6] Salisu Yusuf Muhammad, Mokhairi Makhtar, Azilawati
Rozaimee, Azwa Abdul Aziz and Azrul Amri Jamal.
[4]. Later, SVM models were analysed based on the
Classification Model for Water Quality using Machine Learning
confusion matrix which suggested that class-1 was Techniques. International Journal of Software Engineering and
97.1% correctly and 2.9 % wrongly classified [3]. Its Applications, 2015.
Similarly, class-2 and class-3 were classified 96.3% [7] Central Ground Water Board, Ministry of Water Resources,
and 94.1% correctly. The classified water samples Government of India- All data source.
can be sent to the Central or State Water Board,
[8] Sangam Shrestha, Futaba Kazama and Takashi Nakamura.
Government of India to indicate the amount of Use of principal component analysis, factor analysis and
processing required for the sample. discriminant analysis to evaluate spatial and temporal variations
in water quality of the Mekong River. Journal of
Hydroinformatics, 2008.
ACKNOWLEDGMENT
The authors wish to express their sincere
thanks to Central Ground Water Board, Ministry of
Water resources, Government of India for
providing the data from the yearbook to carry out
the research.
REFERENCES
[1] S. Mehdi Saghebian, M. Taghi Sattari, Rasoul Mirabbasi and
Mahesh Pal. Ground Water Classification by Decision Tree
method in Ardebil Region, Iran. Arabian Journal of
Geosciences, 2014.
[2] Shailesh Jaloree, Anil Rajput and Sanjeev Gour. Decision
tree approach to build a model for water quality. Binary Journal
of Data Mining & Networking 4, 2014.
[3] M. Sakizadeh and R. Mirzaei. A comparative study of
performance of K-nearest neighbors and support vector machine
for classification of ground water. Journal of Mining and
Environment, 2015.
[4] Amri Danades, Devie Pratama, Dian Anggraini and Diny
Anggraini. Comparison of accuracy level K-nearest neighbour
algorithm and support vector machine algorithm in classification
water quality status. System Engineering and Technology
(ICSET), 2016.
[5] Onder Gursoy. Determining the most appropriate
classification methods for water quality. Earth and
Environmental Sciences 44, 2016.