Analogy of Water Quality Prediction Using SVM and Xgboost Algorithms
Analogy of Water Quality Prediction Using SVM and Xgboost Algorithms
submitted in partial fulfilment of the requirements for the award of the degree of
submitted by
POOJA S (20SUCS32)
Madurai – 625009
APRIL - 2023
THIAGARAJAR COLLEGE (AUTONOMOUS)
(Affiliated to Madurai Kamaraj University)
Re-Accredited with “A++ Grade” by NAAC
Ranked 22nd in NIRF
DEPARTMENT OF COMPUTER SCIENCE
BONAFIDE CERTIFICATE
External Examiner
POOJA S (20SUCS32),
Madurai– 625009.
DECLARATION
Place:
Date:
(POOJA S)
VARSHAA LAKSHMI T G (20SUCS41),
Madurai– 625009.
DECLARATION
Place:
Date:
(VARSHAA LAKSHMI T G)
ACKNOWLEDGEMENT
2. SYSTEM ANALYSIS
2.1 Existing System 3
2.2 Proposed System 3
2.3 Feasibility Study 4
3. SYSTEM SPECIFICATION
3.1 Hardware Specification 7
3.2 Software Specification 7
3.3 Software Description 8
4. PROJECT DESCRIPTION 19
5. SYSTEM DESIGN
5.1 Input Design 23
5.2 Output Design 24
5.3 Dataset 26
5.4 Data Flow Diagram 31
7. SAMPLES
7.1 Coding 38
7.2 Screen Shots 54
8. CONCLUSION 85
9. FUTURE ENHANCEMENT 86
10. BIBILOGRAPHY
10.1 Book Reference 87
10.2 Web Reference 88
INTRODUCTION
1. INTRODUCTION
1.1 Abstract
Water quality plays an important role in any aquatic system, e.g., it can influence
the growth of aquatic organisms and reflect the degree of water pollution. Water quality
prediction is one of the purposes of model development and use, which aims to achieve
appropriate management over a period of time. Water quality prediction is to forecast
the variation trend of water quality at a certain time in the future. Accurate water quality
prediction plays a crucial role in environmental monitoring, ecosystem sustainability,
and human health. Fresh water is a critical resource for agriculture and industry survival.
During the last few decades, the quality of water has deteriorated significantly due to
pollution and many issues. As a consequences of this, there is a need for model that can make
accurate projections about the water quality.
The prediction of water quality with high accuracy is the key to control water
pollution and the improvement of water management. This experiment was conducted to
compare the machine learning model performance between two advanced algorithms of
Machine Learning: Support Vector Machine (SVM) and Extreme Gradient Boosting
(XGBoost) algorithms to determine the most suitable technique for predicting Water
Quality. We conclude our prediction by analyzing which algorithm suits the best for Water
Quality Prediction.
1
1.2 Aims and Objectives
2
SYSTEM ANALYSIS
2. SYSTEM ANALYSIS
In the existing system the water quality prediction is done with only Random
Forest Algorithm. It does not include some features for analyzing the dataset in a
more efficient way.
In the proposed system we have done the water quality predictions using two
algorithms: SVM and XGBoost. We have performed the analogy between them and
concluded which algorithm is best. We have done the feature scaling using
StandardScaler function. So, the variations in the dataset values are totally reduced.
3
2.3 Feasibility Study
The feasibility of the project is analyzed in this phase and business proposal is put forth
with a very general plan for the project and some cost estimates. During system analysis the
feasibility study of the proposed system is to be carried out. This is to ensure that the proposed
system is not a burden to the company. For feasibility analysis, some understanding of the
major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are:
● Economical Feasibility
● Technical Feasibility
● Operational Feasibility
Economical Feasibility
An organization makes good investment on the system. So, they should be worthful for
the amount they spend in the system. Always the financial benefit equals or less the cost of
the system, but should not exceed the cost.
Every organization wants to reduce the cost but at the same time quality of the Service
should also be maintained. The system is developed according the estimation of the cost made
by the concern. In this project, the proposed system will definitely reduce the cost and there
are new techniques for predicting the water quality. Also, the prediction is done on
smaller datasets for analyzing which algorithm is best.
Technical Feasibility
The Technical feasibility is the study of the software and how it is included in the study
of our project. Regarding this there are some technical issues that should be noted they are as
follows:
4
● Is the necessary technique available and how it is suggested and acquired?
● Does the proposed equipment have the technical capacity to hold the data
required using the new system?
● Will the system provide adequate response that is made by the requester at a
periodic time interval?
● Can this system be expanded after this project development?
● Are there a technique guarantees of accuracy, reliability in case of access of
data and security?
The technical issues are raised during the feasibility study of investigating our System.
Thus, the technical consideration evaluates the hardware requirements, software, etc. This
system uses PYTHON as front end and MACHINE LEARNING as back end. They also
provide sufficient memory to hold and process the data. As the company is going to install all
the process in the system it is the cheap and efficient technique.
This system technique accepts the entire dataset given as the input and the response
is done without failure and delay. It is a study about the resources available and how they are
achieved as an acceptable system. It is an essential process for analysis and definition of
conducting a parallel assessment of technical feasibility.
Operational Feasibility
It is a measure of how perfectly a proposed system intends to solve the stated problem
and leverage the opportunities identified during the scope definition phase. Additionally, it
also determines how it will satisfy every requirement identified in its requirement analysis
phase.
5
Nevertheless, to determine operational feasibility, you need to understand
management’s commitment towards the proposed project. If management were the initiator of
operational feasibility, it would likely be accepted and used once completed.
Regarding the project, the system is very much supported and friendly for the user and
for future researches. The methods are defined in an effective manner and proper conditions
are given in other to avoid the harm or loss of data. These are three basic feasibility studies that
are done in every project.
6
SYSTEM SPECIFICATION
3. SYSTEM SPECIFICATION
RAM : 8 GB
Frontend : PYTHON
Browser : Google(2401:4900:173b:2eb6:6cab:9579:7c85:b344)
7
3.3 SOFTWARE DESCRIPTION
Python is a popular programming language. It was created by Guido van Rossum, and
released in 1991. Python works on different platforms (Windows, Mac, Linux, Raspberry Pi,
etc).It has a simple syntax similar to the English language. It has syntax that allows
developers to write programs with fewer lines than some other programming languages. It runs
on an interpreter system, meaning that code can be executed as soon as it is written. This
means that prototyping can be very quick.
The python language is one of the most accessible programming languages available
because it has simplified syntax and not complicated, which gives more emphasis on natural
language. Due to its ease of learning and usage, python codes can be easily written and
executed much faster than other programming languages. It requires relatively fewer numbers
of lines of code to perform the same operations and tasks done in other programming
languages with larger code blocks.
A simple python program is of the following form:
print("Hello World")
We no need any header file to be included. To print a hello statement we just
need the print function which displays the output. No need for lengthy prefacing code or
inclusion of libraries; the only required code is what is needed to get the job done!
Characteristics of Python
As we may have realized, the Python language revolves around the central theme of
practicality. Python is about providing the programmer with the necessary tools to get the job
8
done in a quick and efficient fashion. Five important characteristics make Python’s practical
nature possible:
Data Science:
Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data that is processed using the
scientific method, different technologies, and algorithms. In short, we can say that data science
is all about:
9
Data Analysis Tools in Python:
• NumPy: Numpy is a Python library that provides a mathematical function to
handle large-dimension arrays. It provides various functions for Array, linear
algebra, and statistical analysis. NumPy stands for Numerical Python. It offers
many practical functions for n-array and matrix operations in Python. Large
multidimensional arrays and matrices can have their mathematical operations
vectorized, which improves efficiency and accelerates execution.
• Pandas: One of the most used Python tools for data manipulation and analysis is
Pandas. Pandas provide useful functions to manipulate large amounts of
structured data. Pandas provide the easiest method to perform analysis. It
provides large data structures and manipulates numerical tables and time series
data. For handling data, Pandas are the ideal instrument. It is intended for fast and
simple data aggregation and manipulation. Pandas have two types of data
structures: Series – It Handle and stores data in one-dimensional
data. DataFrame – It Handle and stores Two-dimensional data.
• Scipy: Scipy is a well-liked Python library for scientific computing and data
analysis. Scipy stands for Scientific Python. It provides great functionality for
scientific mathematics and computing programming. SciPy contains sub-
modules for optimization, linear algebra, integration, interpolation, special
functions, and other tasks common in science and engineering.
Data Visualization Tools in Python:
• Matplotlib
• Seaborn
• Plotly
Jupyter Notebooks are a spin-off project from the IPython project, which used to have
an IPython Notebook project itself. The name, Jupyter, comes from the core supported
programming languages that it supports: Julia, Python, and R. Jupyter ships with the IPython
kernel, which allows us to write our programs in Python, but there are currently over 100 other
kernels that we can also use.
The Jupyter Notebook is not included with Python, so if we want to try it out, we need
to install Jupyter.
Installation
If so, then we can use a handy tool that comes with Python called pip to install Jupyter
Notebook like this:
11
Starting the Jupyter Notebook Server
Now that we have installed Jupyter, let’s learn how to use it. To get started, all we need
to do is open up our terminal application and go to a folder of our choice. We recommend
using something like our Documents folder to start out with and create a subfolder there
called Notebooks or something else that is easy to remember. Then just go to that
location in our terminal and run the following command
$ jupyter notebook
This will start up Jupyter and our default browser should start (or open a new tab) to
the following URL: http://localhost:8888/tree. Our browser should now look something like
this:
Notebook Extensions
While Jupyter Notebooks have lots of functionality built in, we can add new functionality
through extensions. Jupyter actually supports four types of extensions:
• Kernel
• IPython kernel
• Notebook
• Notebook server
Creating a Notebook
Now that we know how to start a Notebook server, we should probably learn how to
create an actual Notebook document.
All we need to do is click on the New button (upper right), and it will open up a list of
choices. On the machine, we have Python 2 and Python 3 installed, so we can create a
12
Notebook that uses either of these. For simplicity’s sake, let’s choose Python3. Our web page
should now look like this:
Naming
We will notice that at the top of the page is the word Untitled. This is the title for the
page and the name of our Notebook. Since that isn’t a very descriptive name, let’s change it!
Just move the mouse over the word Untitled and click on the text. We see an in-
browser dialog titled Rename Notebook. Let’s rename this one to Hello Jupyter:
Running Cells
A Notebook’s cell defaults to using code whenever we first create one, and that cell
uses the kernel that we chose when we started our Notebook.In this case, we started yours with
Python 3 as our kernel, so that means we can write Python code in your code cells. Since your
initial Notebook has only one empty cell in it, the Notebook can’t really do anything.
13
Thus, to verify that everything is working as it should, we can add some Python code to
the cell and try running its contents. Let’s try adding the following code to that cell:
print('Hello Jupyter!')
Running a cell means that we will execute the cell’s contents. To execute a cell, we can
just select the cell and click the Run button that is in the row of buttons along the top. It’s
towards the middle. If we prefer using your keyboard, we can just press Shift + Enter . When
Machine learning brings computer science and statistics together for creating predictive
models. Machine learning constructs or uses the algorithms that learn from historical data. The
more we will provide the information, the higher will be the performance.A machine has the
ability to learn if it can improve its performance by gaining more data.
Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it. The accuracy of predicted
14
output depends upon the amount of data, as the huge amount of data helps to build a better
model which predicts the output more accurately.
15
Unsupervised learning:
This type of machine learning involves algorithms that train on unlabeled data. The
algorithm scans through data sets looking for any meaningful connection. The data that
algorithms train on as well as the predictions or recommendations they output are pre-
determined.
Supervised learning:
In this type of machine learning, data scientists supply algorithms with labeled training
data and define the variables they want the algorithm to assess for correlations. Both the input
and the output of the algorithm is specified.
Regression Algorithms:
● Linear Regression.
● Regression Trees.
● Non-Linear Regression.
● Bayesian Linear Regression.
● Polynomial Regression.
16
Classification Algorithms:
● Random Forest.
● Decision Trees.
● Logistic Regression.
● Support Vector Machines.
● XGBoost.
Among the supervised machine learning algorithms, we have chosen SVM and
XGBOOST Classification algorithms to predict the water quality.
SVM Algorithm :
Support Vector Machine (SVM) is a supervised machine learning algorithm that can be
used for both classification or regression challenges. However, it is mostly used in
classification problems, such as text classification. In the SVM algorithm, we plot each data
item as a point in n-dimensional space (where n is the number of features you have), with the
value of each feature being the value of a particular coordinate. Then, we perform
classification by finding the optimal hyper-plane that differentiate the two classes very well.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
17
XGBOOST Algorithm :
Weights play an important role in XGBoost. Weights are assigned to all the
independent variables which are then fed into the decision tree which predicts results. The
weight of variables predicted wrong by the tree is increased and the variables are then fed to
the second decision tree. These individual classifiers/predictors then ensemble to give a
strong and more precise model. It can work on regression, classification, ranking, and user-
defined prediction problems. Some important features of XGBoost are:
Non-linearity: XGBoost can detect and learn from non-linear data patterns.
Scalability: XGBoost can run distributed thanks to distributed servers and clusters like
Hadoop and Spark, so you can process enormous amounts of data. It’s also available for many
programming languages like C++, JAVA, Python, and Julia.
18
PROJECT DESCRIPTION
4. PROJECT DESCRIPTION
We have imported all the necessary packages like pandas, seaborn, numpy
andmatplotlib. Then, reading the CSV file into the dataframe object. Using the CSV file for the
further processing steps. Then, performing the preprocessing process by Exploratory Data
Analysis (EDA). In EDA, first finding the shape of the dataset. Then, checking if null values
are present or not in the dataset. Then using describe function we are finding the mean value.
We used interpolation and backfill methods to replace the null values in the dataset for all the
features. Then, we visualized pH values using the scatterplot to view the distributions of pH
values. Then, we created a histogram visualization for all the features. Then, we used an
heatmap to see the correlation between all the features. We used an outlier visualization using
boxplot to see the which features have an outlier points and to see whether these outliers
should be removed or not. We have performed Data Splitting process. First, we have dropped
all the rows which value potability value as zero from the original dataset. Then, we have
splitted the dataset into independent and dependent features, divide the data among two
datasets. We have used sklearn library and performed train_test_split function. We have given
80% of our data values to training set and 20% of our data values to testing set. We have made
the scaling of X datasets using StandardScaler function.
SVC CLASSIFIER: We have imported SVC classifier from sklearn library. We have
created an object for SVC classifier. We are fitting the training dataset into that object and
predicting the results. Then importing accuracy score from sklearn metrics and finding the
accuracy value. Then predicting the accuracy for testing datasets.Then we are importing the
classification report. It includes metrics like precision, recall, fi-score and support.
4.1 Modules
● Data Extraction.
● Data Preprocessing.
● Splitting the dataset.
● Creating a model.
● Data Modeling.
● Model Evaluation.
Data Extraction
First , we extract the dataset required for the prediction. We extract it as a CSV file
from the Kaggle Platform. The Dataset contains ten parameters to predict the quality of water
for 3276 water bodies. The parameters are:
20
1. pH. 6. Organic_carbon.
2. Conductivity. 7. Trihalomethanes.
3. Hardness. 8. Turbidity.
4. Solids. 9. Chloramines.
Data Preprocessing
Data preprocessing is the technique used to transform raw data into an useful and
efficient format. In this part, the steps are taken to check and fill null values in the columns of
the dataset. It is done by performing Exploratory Data Analysis (EDA).
Firstly check the shape of the data set. Then check that there are Null values or
not for the parameters. Then check the information of the data set. Now describe the
dataset which shows the minimum value, maximum value, mean value, count, standard
deviation, etc. Finally we handle the missing values. We fill the missing values by
using the interpolate and backfill methods. Using Countplot, we visualize the
potability. With scatterplot, we check whether the data values are in normal
distributions. Then, we view all the features using histogram. We find the
correlation between the features using the heatmap. We check if the outliers are
present using the boxplot function.
Now it’s time to split the data set. Divide the data into the independent and dependent
features. All are independent features except Potability because Potability is our dependent
feature. Using sklearn method, split the data set into the training and testing using the
train_test_split function which returns four data sets. Then, we perform the feature scaling
to keep all the data values in same variations to predict the accuracy correctly.
Creating a model
21
Now we fit the data values in the training and testing datasets. We train our data
model with some values. Then, we create a model for SVM algortihm. We import SVC
classifier. SVC class allows to build a kernel SVM model. We fit our training datasets in
our model. We import XGBoost classifier. Now it’s time to evaluate the model using the
accuracy score, confusion matrix and classification report. Evaluation techniques take two
parameters: one is the actual data and the other one is a predicted data.
Data Modeling
We fit our training datasets into our model. The, we predict the accuracy for
x_test dataset. We import accuracy_score, classification_report and confusion_matrix.
The same process is done for the algorithms.
Model Evaluation
22
SYSTEM DESIGN
5. SYSTEM DESIGN
The input design is the link between the information system and the user. It comprises
the developing specification and procedures for data preparation and those steps are necessary
to put transaction data in to a usable form for processing can be achieved by inspecting the
computer to read data from a written or printed document. Input Design considered the
following things:
Objectives
1. This design is important to avoid errors in the data input process and show the correct
direction to the management for getting correct information from the computerized
system.
2. It is achieved by creating user-friendly screens for the data entry to handle large
volume of data. The data entry screen is designed in such a way that all the data
manipulates can be performed.
Kaggle:
We have collected our dataset from the Kaggle Website. Kaggle allows users to
find datasets they want to use in building AI models, publish datasets, work with other data
scientists and machine learning engineers, and enter competitions to solve data science
challenges. Kaggle, a subsidiary of Google LLC, is an online community of data
scientists and machine learning practitioners. Kaggle allows users to find and publish data sets,
explore and build models in a web-based data-science environment, work with other data
scientists and machine learning engineers, and enter competitions to solve data science
challenges.
23
Kaggle was first launched in 2010 by offering machine learning competitions and now
also offers a public data platform, a cloud-based workbench for data science, and Artificial
Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy
Howard. Nicholas Gruen was the founding chair succeeded by Max Levchin. Equity was
raised in 2011 valuing the company at $25.2 million. On 8 March 2017, Google announced
that they were acquiring Kaggle.
Parameters Description:
SOLIDS: Total solids are dissolved solids plus suspended and settleable solids in water.
TRIHALOMETHANES: The result of a reaction between the chlorine used for disinfecting
tap water and natural organic matter in the water.
POTABILITY: Potable water is defined as water that is suitable for human consumption.
24
source information to the user. Efficient and intelligent output design improves the system’s
relationship to help user decision-making.
1. Designing computer output should proceed in an organized, well thought out manner;
the right output must be developed while ensuring that each output element is designed
so that people will find the system can use easily and effectively. When analysis design
computer output, they should Identify the specific output that is needed to meet the
requirements.
2. Select methods for presenting information.
3. Create document, report, or other formats that contain information produced by the
system.
The output form of an information system should accomplish one or more of the following
objectives.
25
5.3. Dataset
26
27
26
28
28
29
30
33
30
5.4 Data Flow Diagram
Data flow diagram is a graphical tool. The system models are termed as data flow
diagram (DFD). It is used to describe and analyses the movement of data through a system –
manual or automated. This is a central tool and the basis from which the other component is
developed. The data flow diagram is based on the top-down approach in this system. A level 0
DFD, also called as context model, represents the entire software element as single bubble
with the input and output data indicated by incoming and outgoing arrows.
DFD consists of processes, flows, warehouses, and terminators. There are several ways
to view these DFD components. Data flow diagram shows the transfer of information from one
part of the system to another. The symbol of the flow is the arrow. It is determined for system
developers, on one hand, project contractor on the other, so the entity names should be adapted
for model domain or professionals. It is necessary to maintain consistency across all DFD
levels. Creating a dataflow diagram will allow the programmer to create a program with
minimal discomfort in programming the accrual code and further increase the productivity of
the programmer or program group. Data flow diagrams help the programmer figure out what
options the programs will need in order to handle the data it is given. Using the data flow
diagram, it makes easy to explain the program to laypeople. It will definitely save the amount
of time for programmer would have spent explaining the code to other people.The data flow
diagram will help the programmer be able to see what will happen if certain code is injected
into the program.
DFD symbol
● A diamond represents a condition or process that transforms incoming data flows into
outgoing data flows.
31
5.4 DATA FLOW DIAGRAM
Dataset
Start
Data Preprocessing
NO YES
If error found
Test &
evaluate
No error
Model
Deployment
Stop
32
I TRAINING PHASE:
Model
Evaluation
Feature Data
Model
Dataset Mining
Selection
Accuracy result
33
SYSTEM TESTING
&
IMPLEMENTATION
6. SYSTEM TESTING & IMPLEMENTATION
System Testing is an important stage in any system development life cycle. Testing is a
process of executing a program with the intention of finding errors. The importance of
software testing and its implications with respect to software quality cannot be
overemphasized. Software testing is a critical element of software quality assurance and
represents the ultimate review of specification, design and coding. A good test case is one that
has a high probability of finding a yet undiscovered error.
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of tests. Each test type addresses a specific testing requirement.
Testing is the set of activities that can be planned in advance and conducted
systematically. Different test conditions should be thoroughly checked and the bugs detected
should be fixed. The testing strategies formed by the user are performed to prove that the
software is free and clear from errors. To do this, there are many ways of testing the system’s
reliability, completeness and maintainability.
The important phase of software development is concerned with translating the design
specification into the error-free source code. Testing is carried out to ensure that the system
does not fail, that it meets the specification and it satisfies the user. The system testing was
carried out in a systematic manner with a test data containing all possible combinations of data
to check the features of the system. A test data was prepared for each module, which took care
of all the modules of the program.
System Testing is an important stage where the system developed is tested with
duplicate or original data. It is a process of executing a program with the intent of finding an
error. It is a critical process that can consume fifty percent of the development time.
34
The following are the attributes of good test:
In the unit testing the analyst tests the program making up a system. The software units
in a system are the modules and routines that are assembled and integrated to perform a
specific function. In a large system, many modules on different levels are needed.
Unit testing can be performed from the bottom up starting with the smallest and lowest
level modules and proceeding one at a time. For each module in a bottom-up testing, a short
program executes the module and provides the needed data.
The integration testing is performed for this Project when all the modules are tested
make it a complete system. After integration the project works successfully.
Validation testing can be defined in many ways, but a simple definition is that can be
reasonably expected by the customer. After validation test has been conducted, one of two
possible conditions exists.
Proposed system under consideration has been tested by using validation testing and
found to be working satisfactorily.
35
6.1.4. White Box Testing
White box testing, sometimes called glass-box testing is a test case design method that
uses the control structure of the procedural design to derive test cases. Using white box testing
methods, the software engineer can derive test cases that :
● Guarantee that all independent paths with in a module have been exercised
at least once.
● Exercise all logical decisions on their true and false sides.
● Execute all loops at their boundaries and within their operational bounds
and
● Exercise internal data structure to assure their validity.
This method treats the coded module as a black box. The module runs with inputs that
are likely to cause errors. Then the output is checked to see if any error occurred. This method
cannot be used to test all errors, because some errors may depend on the code or algorithm
used to implement the module.
6.2 Implementations
Implementation is the stage in the project where the theoretical design is turned into a
working system. The most critical stage is achieving a successful system and in giving
confidence on the new system for the users, what it will work efficient and effectively. It
involves careful planning, investing of the current system, and its constraints on
implementation, design of methods to achieve the changeover methods.
The implementation process begins with preparing a plan for the implementation of the
system. According to this plan, the activities are to be carried out in these plans; discussion has
been made regarding the equipment, resources and how to test activities.
The coding step translates a detail design representation into a programming language
realization. Programming languages are vehicles for communication between human and
computers programming language characteristics and coding style can profoundly affect
software quality and maintainability. The coding is done with the following characteristics in
mind.
36
● Ease of design to code translation.
● Code efficiency.
● Memory efficiency.
● Maintainability.
The user should be very careful while implementing a project to ensure what they have
planned is properly implemented. The user should not change the purpose of project while
implementing. The user should not go in a roundabout way to achieve a solution; it should be
direct, crisp and clear and up to the point.
Implementation is the stage of the project when the theoretical design is turned out into
a working system. Thus, it can be considered to be the most critical stage in achieving a
successful new system and in giving the user, confidence that the new system will work and be
effective.
In the existing system implementation, the water quality analysis is done with
only minimum efficiency. It does not include many features for analyzing the data in a
more efficient way.
And it does not provide high accuracy value, does not has enough visualizations
techniques to show the predictions for the obtained data.
In our project we have implement the water quality predictions using two
algorithms: SVM and XGBoost of Supervised learning methods. We have performed the
analogy between them in an efficient manner and provided the expertised result by adding
well organized feature scaling using StandardScaler and effective parameters in SVC &
XGBOOST classifiers.
37
SAMPLES
7. SAMPLES
7.1 Coding
# ANALOGY OF WATER QUALITY PREDICTION USING SVM AND XGBOOST
ALGORITHMS:
import pandas as pd
import numpy as np
import random
df=pd.read_csv(r"C:\Users\moham\Desktop\water_potability.csv")
print(df.head())
print(df.shape)
print(df.isnull().sum())
print(df.info())
print(df.describe())
df.interpolate(method='linear',limit_direction='both',inplace=True)
38
print(df.head(50))
print(df.tail(50))
df.backfill()
df.tail(50)
print(df.Potability.value_counts())
a=sea.countplot(x='Potability',data=df,saturation=5.9)
plt.xticks(ticks=[0,1],labels=["NOT POTABLE","POTABLE"])
plt.show()
# Visualizing the pH value using scatterplot function to check that it is contains a normal
distributions or not
sea.scatterplot(df['ph'],color="red")
plt.show()
df.hist(color="yellow",ec='red',figsize=(16,16))
plt.show()
# Visualize the correlation of all features using a heat map function of seaborn
plt.figure(figsize=(12,8))
sea.heatmap(df.corr(),annot=True,cmap="rocket_r")
plt.show()
df.boxplot(figsize=(14,7))
print("DATASET....\n",df)
39
x=df.drop('Potability',axis=1).copy()
print("\n x features...\n")
print(x)
y=df['Potability'].copy()
print("\n Y feature...\n",y)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.02,random_state=58,shuffle=True
)
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)
# Viewing the number of rows & columns of training and testing data
print("Training data:",x_train.shape)
print("Testing data:",x_test.shape)
40
# Modeling the object for SVM algorithm
svmclassifier=SVC(kernel='rbf',gamma=0.15)
svmclassifier.fit(x_train,y_train)
pred=svmclassifier.predict(x_test)
print(pred)
accure=accuracy_score(y_test,pred)*100
# Confusion matrix
cm=confusion_matrix(y_test,pred)
print("CONFUSION MATRIX")
ax=sea.heatmap(cm/np.sum(cm),annot=True,fmt='0.2%',cmap='Reds')
ax.xaxis.set_ticklabels(['POSITIVE','NEGATIVE'])
ax.yaxis.set_ticklabels(['POSITIVE','NEGATIVE'])
# Classification Report
target_names=['True','False']
clas=classification_report(y_test,pred,target_names=target_names)
print("CLASSIFICATION REPORT:\n",clas)
41
# Modeling the object for XGBOOST algorithm
para={'objective':'binary:logistic',
'max_depth':4,
'alpha':10,
' learning_rate':1.0,
'n_estimators':100,
'gamma':0.35,
'min_child_weight':1,
xgbclas=XGBClassifier(**para)
xgbclas.fit(x_train,y_train)
xgbpred=xgbclas.predict(x_test)
print(xgbpred)
acc=accuracy_score(y_test,xgbpred)*100
# Confusion matrix:
conm=confusion_matrix(y_test,xgbpred)
42
ax=sea.heatmap(conm/np.sum(conm),annot=True,fmt='0.2%',cmap='Greens')
ax.xaxis.set_ticklabels(['POSITIVE','NEGATIVE'])
ax.yaxis.set_ticklabels(['POSITIVE','NEGATIVE'])
# Classification Report:
target_names=['True','False']
classify=classification_report(y_test,xgbpred,target_names=target_names)
print("CLASSIFICATION REPORT:\n",classify)
models=pd.DataFrame({
'ACCURACY SCORE':[accure,acc]
})
print(models)
cols=['yellow','salmon']
fig=plt.figure(figsize=(8,15))
h=sea.barplot(x='MODEL',y='ACCURACY
SCORE',data=models,palette=cols,width=0.05,edgecolor='0.1')
sea.set(rc={'axes.facecolor':'lightgreen','figure.facecolor':'red'})
plt.locator_params(axis='y',nbins=40)
models.sort_values(by='ACCURACY SCORE',ascending=False)
43
hi=models.sort_values(by='ACCURACY SCORE',ascending=True)
fig=plt.figure(figsize=(8,15))
plt.grid(axis='x')
g=sea.set(rc={'axes.facecolor':'coral','figure.facecolor':'cyan'})
(markers,stemlines,baseline)=plt.stem(hi['ACCURACY SCORE'],basefmt="")
plt.setp(markers,marker='D',markersize=9,markeredgecolor='red',markeredgewidth=2)
plt.setp(stemlines,linestyle=':',color='blue')
plt.setp(baseline,color='green')
plt.locator_params(axis='y',nbins=40)
plt.ylabel('ACCURACY SCORE')
plt.show()
# TAKING SMALL DATA FROM THE DATASET AND PREDICT THE ACCURACY
USING tail()
#here we store 500 data in the variable 'n' and visualizing the no. of rows and columns in
that data
n=df.tail(500)
sale=StandardScaler()
l_train=sale.fit_transform(l_train)
l_test=sale.transform(l_test)
44
from sklearn.svm import SVC
smallsvm=SVC(kernel='rbf',gamma=0.7)
smallsvm.fit(l_train,p_train)
smallpred=smallsvm.predict(l_test)
print(smallpred)
smallaccure=accuracy_score(p_test,smallpred)*100
para={'objective':'binary:logistic',
'max_depth':6,
'alpha':10,
'learning_rate':1.0,
'n_estimators':100,
'gamma':0.15,
'min_child_weight':4,
clasxgb=XGBClassifier(**para)
45
clasxgb.fit(l_train,p_train)
smallxgbpred=clasxgb.predict(l_test)
print(smallxgbpred)
smallacc=accuracy_score(p_test,smallxgbpred)*100
smallmodels=pd.DataFrame({
'ACCURACY SCORE':[smallaccure,smallacc]
})
print(smallmodels)
scols=['blue','red']
fig=plt.figure(figsize=(6,9))
m=sea.barplot(x='MODEL',y='ACCURACY
SCORE',data=smallmodels,palette=scols,width=0.05,edgecolor='0.1')
sea.set(rc={'axes.facecolor':'lightgreen','figure.facecolor':'salmon'})
plt.locator_params(axis='y',nbins=14)
smallmodels.sort_values(by='ACCURACY SCORE',ascending=False)
46
#here we store 500 data in the variable 'n' and visualizing the no. of rows and columns in
that data
h=df.head(500)
scale=StandardScaler()
v_train=scale.fit_transform(v_train)
v_test=scale.transform(v_test)
ss=SVC(kernel='rbf',gamma=0.15)
ss.fit(v_train,u_train)
sspred=ss.predict(v_test)
print(sspred)
ssaccure=accuracy_score(u_test,sspred)*100
cssa.xaxis.set_ticklabels(['POSITIVE','NEGATIVE'])
cssa.yaxis.set_ticklabels(['POSITIVE','NEGATIVE'])
para={'objective':'binary:logistic',
'max_depth':4,
'alpha':10,
'learning_rate':1.0,
'n_estimators':100,
'gamma':0.09,
'min_child_weight':1,
classe=XGBClassifier(**para)
classe.fit(v_train,u_train)
ssxgbpred=classe.predict(v_test)
print(ssxgbpred)
ssacc=accuracy_score(u_test,ssxgbpred)*100
ssmodels=pd.DataFrame({
'ACCURACY SCORE':[ssaccure,ssacc]
})
48
print(ssmodels)
scols=['blue','salmon']
fig=plt.figure(figsize=(6,9))
m=sea.barplot(x='MODEL',y='ACCURACY
SCORE',data=ssmodels,palette=scols,width=0.05,edgecolor='0.1')
sea.set(rc={'axes.facecolor':'lightgreen','figure.facecolor':'yellow'})
plt.locator_params(axis='y',nbins=10)
ssmodels.sort_values(by='ACCURACY SCORE',ascending=False)
# TAKING SMALL MIDDLE DATA FROM THE DATASET AND PREDICT THE
ACCURACY
#here we store 500 data in the variable 'n' and visualizing the no. of rows and columns in
that data
r=df.iloc[1620:2120]
sca=StandardScaler()
i_train=sca.fit_transform(i_train)
i_test=sca.transform(i_test)
sq=SVC(kernel='rbf',gamma=0.18)
sq.fit(i_train,j_train)
sqpred=sq.predict(i_test)
49
print(sqpred)
sqaccure=accuracy_score(j_test,sqpred)*100
pars={'objective':'binary:logistic',
'max_depth':4,
'alpha':10,
'learning_rate':1.0,
'n_estimators':100,
'gamma':0.12,
'min_child_weight':2,
clag=XGBClassifier(**pars)
clag.fit(i_train,j_train)
sqxgbpred=clag.predict(i_test)
print(sqxgbpred)
50
# Evaluation (Accuracy score)
sqacc=accuracy_score(j_test,sqxgbpred)*100
sqmodels=pd.DataFrame({
'ACCURACY SCORE':[sqaccure,sqacc]
})
print(sqmodels)
scols=['purple','red']
fig=plt.figure(figsize=(6,9))
m=sea.barplot(x='MODEL',y='ACCURACY
SCORE',data=sqmodels,palette=scols,width=0.05,edgecolor='0.1')
sea.set(rc={'axes.facecolor':'lightgreen','figure.facecolor':'yellow'})
plt.locator_params(axis='y',nbins=40)
sqmodels.sort_values(by='ACCURACY SCORE',ascending=False)
# TAKING SMALL RANDOM DATA FROM THE DATASET AND PREDICT THE
ACCURACY USING SAMPLE FUNCTION
#here we store 500 data in the variable 'n' and visualizing the no. of rows and columns in
that data
c=df.sample(frac=0.1525,random_state=76)
scle=StandardScaler()
51
t_train=scle.fit_transform(t_train)
t_test=scle.transform(t_test)
ssf=SVC(kernel='rbf',gamma=0.122)
ssf.fit(t_train,w_train)
ssfpred=ssf.predict(t_test)
print(ssfpred)
ssfaccure=accuracy_score(w_test,ssfpred)*100
pas={'objective':'binary:logistic',
'max_depth':4,
'alpha':4,
'learning_rate':1.0,
'n_estimators':100,
'gamma':0.15,
52
'min_child_weight':2,
claggi=XGBClassifier(**pas)
claggi.fit(t_train,w_train)
ssfxgbpred=claggi.predict(t_test)
print(ssfxgbpred)
ssfacc=accuracy_score(w_test,ssfxgbpred)*100
ssfmodels=pd.DataFrame({
'ACCURACY SCORE':[ssfaccure,ssfacc]
})
print(ssfmodels)
cols=['red','salmon']
fig=plt.figure(figsize=(6,8))
k=sea.barplot(x='MODEL',y='ACCURACY
SCORE',data=ssfmodels,palette=cols,width=0.05,edgecolor='0.1')
sea.set(rc={'axes.facecolor':'lightgreen','figure.facecolor':'lightblue'})
plt.locator_params(axis='y',nbins=40)
ssfmodels.sort_values(by='ACCURACY SCORE',ascending=False)
53
7.2 Screenshots
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
84
73
74
75
76
77
82
78
79
80
81
82
83
84
CONCLUSION
8. CONCLUSION
This project has provide an accuracy for both the algorithms using python
and machine learning. This project concludes by saying which algorithm suites
better for the predictions.
The dataset is taken from the kaggle website and is imported in our
project. We have performed exploratory data analysis on the dataset. After that,
a dataset with no null values or any other error values have been obtained.
Then the splitting process is done the obtained dataset. Feature scaling is done
to reduce the variations. Then the process for prediction of dataset by the
algorithms is done. A visualization of both accuracies is given and concludes
with the suitable algorithm.
85
.
FUTURE ENHANCEMENT
9. FUTURE ENHANCEMENT
Now the system is designed to fulfill the current needs and achieved the primary
objectives. There are some additional features can be added to the system in the future. These
features may take the application to the next level.
We can predict for different datasets. We can also do the analogy process on
different algorithms of the machine learning. We can do some separate mechanisms
and find which water is best for agricultural purposes and used for some other
purposes.
Water is the source of life and it is becoming increasingly scarce.By 2030, the gap
between global demand and supplies of fresh water is expected to reach 90%. The climate
crisis, population growth and the transition to clean energy may increase that deficit
further.
86
BIBILIOGRAPHY
10. BIBILIOGRAPHY
[1]. Ambrose, R. B., Barnwell, T. O., McCutcheon, S. C., & Williams, J. R., “
Computer models for water quality analysis Water resources handbook. ”, published by L.
W. Mays (4th Edition.), New York: McGraw-Hill in (1996).
[3]. Arthur W.Hounslow, “ Water Quality Data : Analysis and Interpretation ” (1st
Edition), published by CRC Press in 1995.
[5]. Philippe Ouevauviller, “ Quality Assurance for Water Analysis ” (1st Edition),
published by Wiley in 2002.
[6]. David A. Chin, “ Water Resources Engineering ” (3rd Edition), published by
Pearson in 2013.
[7]. Jake VanderPlas, “ Python Data Science Handbook ” (2nd Edition), published by
O’Reilly Media, Inc in 2022.
87
10.2 Web References
• https://r.search.yahoo.com/_ylt=AwrPpZFiRhdkEpoDlsm7HAx.;_ylu=Y29sbwNzZz
MEcG9zAzEEdnRpZAMEc2VjA3Ny/RV=2/RE=1679275747/RO=10/RU=https%3a
%2f%2fwww.researchgate.net%2fpublication%2f352907194_An_Introduction_to_W
ater_Quality_Analysis/RK=2/RS=c072TMTs5BMUi5xz7jHR7pJ4MB
• https://r.search.yahoo.com/_ylt=Awrx.vU7Rxdk9nIDBa.7HAx.;_ylu=Y29sbwNzZzM
EcG9zAzIEdnRpZAMEc2VjA3Ny/RV=2/RE=1679275964/RO=10/RU=https%3a%2
f%2fwww.kaggle.com%2fcode%2fanbarivan%2findian-water-quality-analysis-and-
prediction/RK=2/RS=Taf7VgfLy2P1vLoV1UobMnn2VQw-
• https://r.search.yahoo.com/_ylt=AwrKEvmYRxdk010Eqn27HAx.;_ylu=Y29sbwNzZz
MEcG9zAzQEdnRpZAMEc2VjA3Ny/RV=2/RE=1679276057/RO=10/RU=https%3a
%2f%2fchemtech-us.com%2farticles%2fparameters-in-water-quality-analysis-
explained%2f/RK=2/RS=nULnFT2k21YsRKrFznfPYEwQ3HI-
• https://r.search.yahoo.com/_ylt=AwrKDxTPRxdkIHcDndu7HAx.;_ylu=Y29sbwNzZz
MEcG9zAzEEdnRpZAMEc2VjA3Ny/RV=2/RE=1679276112/RO=10/RU=https%3a
%2f%2fwww.geeksforgeeks.org%2fmachine-
learning%2f/RK=2/RS=icRWY4SETGu.2.uHBbqcz0krFGg-
• https://r.search.yahoo.com/_ylt=AwrKDxTPRxdkIHcDstu7HAx.;_ylu=Y29sbwNzZz
MEcG9zAzUEdnRpZAMEc2VjA3Ny/RV=2/RE=1679276112/RO=10/RU=https%3a
%2f%2fwww.javatpoint.com%2fmachine-
learning/RK=2/RS=5ezOTQjWggZvGSc8IalNi3xumtU-
88