0% found this document useful (0 votes)
721 views104 pages

Analogy of Water Quality Prediction Using SVM and Xgboost Algorithms

The document discusses comparing the performance of Support Vector Machine (SVM) and Extreme Gradient Boosting (XGBoost) algorithms for water quality prediction. It aims to determine the most suitable machine learning technique. The proposed system conducts an analogy of the two algorithms on smaller datasets to check accuracy. Feature scaling is applied using StandardScaler to reduce variations in dataset values. The system aims to improve water quality monitoring and management for environmental, economic and social sustainability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
721 views104 pages

Analogy of Water Quality Prediction Using SVM and Xgboost Algorithms

The document discusses comparing the performance of Support Vector Machine (SVM) and Extreme Gradient Boosting (XGBoost) algorithms for water quality prediction. It aims to determine the most suitable machine learning technique. The proposed system conducts an analogy of the two algorithms on smaller datasets to check accuracy. Feature scaling is applied using StandardScaler to reduce variations in dataset values. The system aims to improve water quality monitoring and management for environmental, economic and social sustainability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

ANALOGY OF WATER QUALITY PREDICTION USING SVM

AND XGBOOST ALGORITHMS

submitted in partial fulfilment of the requirements for the award of the degree of

BACHELOR OF SCIENCE IN COMPUTER SCIENCE

submitted by

POOJA S (20SUCS32)

VARSHAA LAKSHMI T G (20SUCS41)

under the guidance of

Mr. R. CHANDRASEKAR M.C.A., M.Phil(CA)., SET.

DEPARTMENT OF COMPUTER SCIENCE

THIAGARAJAR COLLEGE (AUTONOMOUS)

(Affiliated to Madurai Kamaraj University)

Re-Accredited with “A++ Grade” by NAAC

Ranked 22nd in NIRF

Madurai – 625009

APRIL - 2023
THIAGARAJAR COLLEGE (AUTONOMOUS)
(Affiliated to Madurai Kamaraj University)
Re-Accredited with “A++ Grade” by NAAC
Ranked 22nd in NIRF
DEPARTMENT OF COMPUTER SCIENCE

BONAFIDE CERTIFICATE

This is certify that the project work entitled as “ANALOGY OF


WATER QUALITY PREDICTION USING SVM AND XGBOOST
ALGORITHMS” is the project work, done by POOJA S (20SUCS32)
and VARSHAA LAKSHMI T G (20SUCS41), in the partial fulfillment of
the requirement for the award of the degree of Bachelor of Science in
Computer Science during final viva voce examination held on
____________.

Project Guide Head of the Department

(Mr. R. CHANDRASEKAR) (Mrs. SM. VALLI)

External Examiner
POOJA S (20SUCS32),

III BSc (Computer Science) SF,

Thiagarajar College (Autonomous),

Madurai– 625009.

DECLARATION

Hereby I declare that this software product “ANALOGY OF


WATER QUALITY PREDICTION USING SVM AND XGBOOST
ALGORITHMS” has been done thouroughly at the outset of my
knowledge and has not been drafted out to any other modes of information.
This software has been purely developed by me.

Place:

Date:

(POOJA S)
VARSHAA LAKSHMI T G (20SUCS41),

III BSc (Computer Science) SF,

Thiagarajar College (Autonomous),

Madurai– 625009.

DECLARATION

Hereby I declare that this software product “ANALOGY OF


WATER QUALITY PREDICTION USING SVM AND XGBOOST
ALGORITHMS” has been done thouroughly at the outset of my
knowledge and has not been drafted out to any other modes of information.
This software has been purely developed by me.

Place:

Date:

(VARSHAA LAKSHMI T G)
ACKNOWLEDGEMENT

We extend our sincere thanks to the almighty for giving us


physical, mental strength and ability for the effective completion of
the project.

We are grateful to thank our principal Dr. D. PANDIARAJA M.Sc.,


M.Phil., P.G.D.CA., B.Ed., Ph.D., Thiagarajar College, Madurai., for his
encouragement to take up this project.

We would like to express our profound gratitude for the support


offered by our Head of the Department of Computer Science Mrs. SM.
VALLI B.E., M.C.A., M.Phil., We are also thankful to her who guided
and encouraged in this project work.

We also extend our gratitude and sincere thanks to our project


guide Mr. R. CHANDRASEKAR M.C.A., M.Phil (CA).,SET., We will
be grateful to him for having faith in our ability and entrusting us with this
project.

We thank all the faculty members of Computer Science Department


for their cooperation. We also personally thank our family members and
friends for their continuous encouragements.
TABLE OF CONTENTS

S.NO TITLES PAGENO


1. INTRODUCTION
1.1 Abstract 1
1.2 Aims and Objectives 2

2. SYSTEM ANALYSIS
2.1 Existing System 3
2.2 Proposed System 3
2.3 Feasibility Study 4

3. SYSTEM SPECIFICATION
3.1 Hardware Specification 7
3.2 Software Specification 7
3.3 Software Description 8

4. PROJECT DESCRIPTION 19
5. SYSTEM DESIGN
5.1 Input Design 23
5.2 Output Design 24
5.3 Dataset 26
5.4 Data Flow Diagram 31

6. SYSTEM TESTING & IMPLEMENTATION


6.1 System Testing 34
6.2 Implementations 36

7. SAMPLES
7.1 Coding 38
7.2 Screen Shots 54

8. CONCLUSION 85
9. FUTURE ENHANCEMENT 86
10. BIBILOGRAPHY
10.1 Book Reference 87
10.2 Web Reference 88
INTRODUCTION
1. INTRODUCTION

1.1 Abstract

Water quality plays an important role in any aquatic system, e.g., it can influence
the growth of aquatic organisms and reflect the degree of water pollution. Water quality
prediction is one of the purposes of model development and use, which aims to achieve
appropriate management over a period of time. Water quality prediction is to forecast
the variation trend of water quality at a certain time in the future. Accurate water quality
prediction plays a crucial role in environmental monitoring, ecosystem sustainability,
and human health. Fresh water is a critical resource for agriculture and industry survival.

During the last few decades, the quality of water has deteriorated significantly due to
pollution and many issues. As a consequences of this, there is a need for model that can make
accurate projections about the water quality.

Moreover, predicting future changes in water quality is a prerequisite for


early control of intelligence aquaculture in the future. Therefore, water quality prediction has
great practical significance. Water quality testing is a strategy for finding clean drinking
water. Testing the nature of a water body, both surface water, and groundwater, can assist us
with responding to inquiries concerning whether the water is satisfactory for drinking,
washing, or water system to give some examples of applications.

The prediction of water quality with high accuracy is the key to control water
pollution and the improvement of water management. This experiment was conducted to
compare the machine learning model performance between two advanced algorithms of
Machine Learning: Support Vector Machine (SVM) and Extreme Gradient Boosting
(XGBoost) algorithms to determine the most suitable technique for predicting Water
Quality. We conclude our prediction by analyzing which algorithm suits the best for Water
Quality Prediction.

1
1.2 Aims and Objectives

The main aims and objectives are:

• To analyze which algorithm works better in predicting the larger datasets.


• To analyze small datasets and find the accuracy.
• Modeling and predicting water quality have become very important in
controlling water pollution.
• To check quality of water to avoid water-borne diseases and improve
health.
• To promote environmental,economical and social sustainability.
• To improve water supply reliability and quality.

2
SYSTEM ANALYSIS
2. SYSTEM ANALYSIS

2.1 Existing System

In the existing system the water quality prediction is done with only Random
Forest Algorithm. It does not include some features for analyzing the dataset in a
more efficient way.

Disadvantages of Existing system

● Does not provide high accuracy value.


● Does not has enough visualizations to show the predictions.
● Output visualizations are done in old model.

2.2 Proposed System

In the proposed system we have done the water quality predictions using two
algorithms: SVM and XGBoost. We have performed the analogy between them and
concluded which algorithm is best. We have done the feature scaling using
StandardScaler function. So, the variations in the dataset values are totally reduced.

Advantages of Proposed system

• Analogy between two algorithms: SVM and XGBoost.


• Visualizations are in different models.
• To check if the algorithms works correctly and efficiently, we take
smaller datasets with 500 data values and predict the water quality.
• The smaller datasets are taken from our original dataset using the head,
tail, iloc and sample functions.

3
2.3 Feasibility Study

The feasibility of the project is analyzed in this phase and business proposal is put forth
with a very general plan for the project and some cost estimates. During system analysis the
feasibility study of the proposed system is to be carried out. This is to ensure that the proposed
system is not a burden to the company. For feasibility analysis, some understanding of the
major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are:
● Economical Feasibility
● Technical Feasibility
● Operational Feasibility

Economical Feasibility
An organization makes good investment on the system. So, they should be worthful for
the amount they spend in the system. Always the financial benefit equals or less the cost of
the system, but should not exceed the cost.

● The cost of investment is analyzed for the entire system.


● The cost of Hardware and Software is also noted.
● Analyzing the way in which the cost can be
reduced.

Every organization wants to reduce the cost but at the same time quality of the Service
should also be maintained. The system is developed according the estimation of the cost made
by the concern. In this project, the proposed system will definitely reduce the cost and there
are new techniques for predicting the water quality. Also, the prediction is done on
smaller datasets for analyzing which algorithm is best.

Technical Feasibility
The Technical feasibility is the study of the software and how it is included in the study
of our project. Regarding this there are some technical issues that should be noted they are as
follows:

4
● Is the necessary technique available and how it is suggested and acquired?
● Does the proposed equipment have the technical capacity to hold the data
required using the new system?
● Will the system provide adequate response that is made by the requester at a
periodic time interval?
● Can this system be expanded after this project development?
● Are there a technique guarantees of accuracy, reliability in case of access of
data and security?

The technical issues are raised during the feasibility study of investigating our System.
Thus, the technical consideration evaluates the hardware requirements, software, etc. This
system uses PYTHON as front end and MACHINE LEARNING as back end. They also
provide sufficient memory to hold and process the data. As the company is going to install all
the process in the system it is the cheap and efficient technique.

This system technique accepts the entire dataset given as the input and the response
is done without failure and delay. It is a study about the resources available and how they are
achieved as an acceptable system. It is an essential process for analysis and definition of
conducting a parallel assessment of technical feasibility.

Though storing and computing of data is enormous, it can be easily handled by


Machine learning. As the machine learning algorithms can be run in any system and the
operation does not differ from one to another. So, this is effective.

Operational Feasibility

It is a measure of how perfectly a proposed system intends to solve the stated problem
and leverage the opportunities identified during the scope definition phase. Additionally, it
also determines how it will satisfy every requirement identified in its requirement analysis
phase.

Operational feasibility depends on the project’s available human resources and


determines if the system is usable once the project is developed and implemented. It analyzes
the preparedness of the organization to support the proposed system. Unlike the technical and
economic feasibility, operational feasibility is quite hard to gauge.

5
Nevertheless, to determine operational feasibility, you need to understand
management’s commitment towards the proposed project. If management were the initiator of
operational feasibility, it would likely be accepted and used once completed.

Operational feasibility also performs the following tasks.

● Determines whether the problems anticipated in user requirements are of high


priority
● Determines whether the solution suggested by the software development team
is acceptable
● Analyzes whether users will adapt to new software
● Determines whether the organization is satisfied with the alternative solutions
proposed by the software development team.

Regarding the project, the system is very much supported and friendly for the user and
for future researches. The methods are defined in an effective manner and proper conditions
are given in other to avoid the harm or loss of data. These are three basic feasibility studies that
are done in every project.

6
SYSTEM SPECIFICATION
3. SYSTEM SPECIFICATION

3.1 HARDWARE SPECIFICATION

RAM : 8 GB

Hard Disk : 256 GB

Processor : 11th Gen Intel(R) Core(TM) i3-1115G4 @ 3.00GHz

Processor Speed : 3.00 GHz

3.2 SOFTWARE SPECIFICATION

Operating System : Windows 11

Frontend : PYTHON

Backend : MACHINE LEARNING

Browser : Google(2401:4900:173b:2eb6:6cab:9579:7c85:b344)

7
3.3 SOFTWARE DESCRIPTION

3.3.1 Introduction to Python

Python is a popular programming language. It was created by Guido van Rossum, and
released in 1991. Python works on different platforms (Windows, Mac, Linux, Raspberry Pi,
etc).It has a simple syntax similar to the English language. It has syntax that allows
developers to write programs with fewer lines than some other programming languages. It runs
on an interpreter system, meaning that code can be executed as soon as it is written. This
means that prototyping can be very quick.

Python can be treated in a procedural way, an object-oriented way or a functional way.


It is an interpreted, high-level, general-purpose, and object-oriented scripting language,
Python is an interpreted and dynamically typed language. Python can be used on a server to
create web applications. It can connect to database systems. It can also read and modify files.
It can be used to handle big data and perform complex mathematics. It can be used for rapid
prototyping, or for production-ready software development. It is perfect for: Web app
development, Data science, Scripting, Database programming, Quick prototyping.

The python language is one of the most accessible programming languages available
because it has simplified syntax and not complicated, which gives more emphasis on natural
language. Due to its ease of learning and usage, python codes can be easily written and
executed much faster than other programming languages. It requires relatively fewer numbers
of lines of code to perform the same operations and tasks done in other programming
languages with larger code blocks.
A simple python program is of the following form:
print("Hello World")
We no need any header file to be included. To print a hello statement we just
need the print function which displays the output. No need for lengthy prefacing code or
inclusion of libraries; the only required code is what is needed to get the job done!

Characteristics of Python

As we may have realized, the Python language revolves around the central theme of
practicality. Python is about providing the programmer with the necessary tools to get the job

8
done in a quick and efficient fashion. Five important characteristics make Python’s practical
nature possible:

● Free and Open-Souce.


● Interpreted.
● Portable.
● Extensible.
● Dynamic Memory Allocation.

Data Science:

Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data that is processed using the
scientific method, different technologies, and algorithms. In short, we can say that data science
is all about:

o Asking the correct questions and analyzing the raw data.


o Modeling the data using various complex and efficient algorithms.
o Visualizing the data to get a better perspective.
o Understanding the data to make better decisions and finding the final result.

Data Science With Python


Let’s understand some popular Python libraries which make Data science easy with
Python:

9
Data Analysis Tools in Python:
• NumPy: Numpy is a Python library that provides a mathematical function to
handle large-dimension arrays. It provides various functions for Array, linear
algebra, and statistical analysis. NumPy stands for Numerical Python. It offers
many practical functions for n-array and matrix operations in Python. Large
multidimensional arrays and matrices can have their mathematical operations
vectorized, which improves efficiency and accelerates execution.
• Pandas: One of the most used Python tools for data manipulation and analysis is
Pandas. Pandas provide useful functions to manipulate large amounts of
structured data. Pandas provide the easiest method to perform analysis. It
provides large data structures and manipulates numerical tables and time series
data. For handling data, Pandas are the ideal instrument. It is intended for fast and
simple data aggregation and manipulation. Pandas have two types of data
structures: Series – It Handle and stores data in one-dimensional
data. DataFrame – It Handle and stores Two-dimensional data.
• Scipy: Scipy is a well-liked Python library for scientific computing and data
analysis. Scipy stands for Scientific Python. It provides great functionality for
scientific mathematics and computing programming. SciPy contains sub-
modules for optimization, linear algebra, integration, interpolation, special
functions, and other tasks common in science and engineering.
Data Visualization Tools in Python:
• Matplotlib
• Seaborn
• Plotly

Machine Learning Tools in Python


• Scikit – learn: Sklearn is a Python library for machine learning. Sklearn offers
various algorithms and functions that are used in machine learning. Sklearn is
built on Matplotlib, SciPy, and NumPy. It offers simple and straightforward tools
for data analysis and predictions. It offers users a collection of standard machine-
learning algorithms via a reliable user interface. Popular algorithms can be
rapidly applied to datasets to solve real problems.
• Statsmodels: Statsmodel is great python tools. It has a number of functions
which is helpful in statistical test and building statistical model in python.
10
• Tensorflow: Tensorflow is an open-source platform for Deep learning tasks. It is
developed by Google. It uses Keras api’s. It is suitable for computer
vision and Natural language processing tasks. It was created using Python,
CUDA, and C++. It is compatible with TPUs, GPUs, and CPUs.
• Keras: Keras is a high-level open-source API for neural network tasks. It is
developed by François Chollet. it can be used with Tensorflow. it can also run on
CPUs and GPUs.
• Pytorch: Pytorch is a popular open-source Deep Learning Platform. it is based on
the torch library. It has a large number of data structures for multi-dimensional
tensors and various mathematical operations. It is developed by Facebook Meta
AI. It is easy and very much suitable for computer vision and Natural language
processing tasks.

3.3.2. Introduction to the Jupyter Notebook


The Jupyter Notebook is an open source web application that we can use to create and
share documents that contain live code, equations, visualizations, and text. Jupyter Notebook is
maintained by the people at ProjectJupyter.

Jupyter Notebooks are a spin-off project from the IPython project, which used to have
an IPython Notebook project itself. The name, Jupyter, comes from the core supported
programming languages that it supports: Julia, Python, and R. Jupyter ships with the IPython
kernel, which allows us to write our programs in Python, but there are currently over 100 other
kernels that we can also use.

Getting Up and Running With Jupyter Notebook

The Jupyter Notebook is not included with Python, so if we want to try it out, we need
to install Jupyter.

Installation

If so, then we can use a handy tool that comes with Python called pip to install Jupyter
Notebook like this:

$ pip install jupyter

11
Starting the Jupyter Notebook Server

Now that we have installed Jupyter, let’s learn how to use it. To get started, all we need
to do is open up our terminal application and go to a folder of our choice. We recommend
using something like our Documents folder to start out with and create a subfolder there
called Notebooks or something else that is easy to remember. Then just go to that
location in our terminal and run the following command

$ jupyter notebook
This will start up Jupyter and our default browser should start (or open a new tab) to
the following URL: http://localhost:8888/tree. Our browser should now look something like
this:

Notebook Extensions

While Jupyter Notebooks have lots of functionality built in, we can add new functionality
through extensions. Jupyter actually supports four types of extensions:

• Kernel
• IPython kernel
• Notebook
• Notebook server

Creating a Notebook

Now that we know how to start a Notebook server, we should probably learn how to
create an actual Notebook document.

All we need to do is click on the New button (upper right), and it will open up a list of
choices. On the machine, we have Python 2 and Python 3 installed, so we can create a
12
Notebook that uses either of these. For simplicity’s sake, let’s choose Python3. Our web page
should now look like this:

Naming

We will notice that at the top of the page is the word Untitled. This is the title for the
page and the name of our Notebook. Since that isn’t a very descriptive name, let’s change it!

Just move the mouse over the word Untitled and click on the text. We see an in-
browser dialog titled Rename Notebook. Let’s rename this one to Hello Jupyter:

Running Cells

A Notebook’s cell defaults to using code whenever we first create one, and that cell
uses the kernel that we chose when we started our Notebook.In this case, we started yours with
Python 3 as our kernel, so that means we can write Python code in your code cells. Since your
initial Notebook has only one empty cell in it, the Notebook can’t really do anything.

13
Thus, to verify that everything is working as it should, we can add some Python code to
the cell and try running its contents. Let’s try adding the following code to that cell:

print('Hello Jupyter!')
Running a cell means that we will execute the cell’s contents. To execute a cell, we can
just select the cell and click the Run button that is in the row of buttons along the top. It’s
towards the middle. If we prefer using your keyboard, we can just press Shift + Enter . When

we ran this above code, the output will be of the form:

3.3.3 Introduction to Machine Learning

Machine Learning is said as a subset of artificial intelligence that is mainly concerned


with the development of algorithms which allow a computer to learn from the data and past
experiences on their own. The term machine learning was first introduced by Arthur
Samuel in 1959. We can define it in a summarized way as: “Machine learning enables a
machine to automatically learn from data, improve performance from experiences, and predict
things without being explicitly programmed”.With the help of sample historical data, which is
known as trainingdata, machine learning algorithms build a mathematical model that helps in
making predictions or decisions without being explicitly programmed.

Machine learning brings computer science and statistics together for creating predictive
models. Machine learning constructs or uses the algorithms that learn from historical data. The
more we will provide the information, the higher will be the performance.A machine has the
ability to learn if it can improve its performance by gaining more data.

Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it. The accuracy of predicted

14
output depends upon the amount of data, as the huge amount of data helps to build a better
model which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so


instead of writing a code for it, we just need to feed the data to generic algorithms, and with
the help of these algorithms, machine builds the logic as per the data and predict the output.
Machine learning has changed our way of thinking about the problem. The below block
diagram explains the working of Machine Learning algorithm:

Features of Machine Learning:

● Machine learning uses data to detect various patterns in a given dataset.


● It can learn from past data and improve automatically.
● It is a data-driven technology.
● Machine learning is much similar to data mining as it also deals with the huge
amount of the data.

15
Unsupervised learning:

This type of machine learning involves algorithms that train on unlabeled data. The
algorithm scans through data sets looking for any meaningful connection. The data that
algorithms train on as well as the predictions or recommendations they output are pre-
determined.

It can be further classifieds into two categories of algorithms: Clustering and


Association. Types of unsupervised algorithms are:

● KNN (k-nearest neighbors).


● Hierarchal clustering.
● K-means clustering.
● Anomaly detection.
● Neural Networks.
● Principle Component Analysis.
● Independent Component Analysis.
● Apriori algorithm.
● Singular value decomposition.

Supervised learning:

In this type of machine learning, data scientists supply algorithms with labeled training
data and define the variables they want the algorithm to assess for correlations. Both the input
and the output of the algorithm is specified.

Supervised learning can be grouped further in two categories of algorithms:


Classification and Regression.

Regression Algorithms:

● Linear Regression.
● Regression Trees.
● Non-Linear Regression.
● Bayesian Linear Regression.
● Polynomial Regression.

16
Classification Algorithms:

● Random Forest.
● Decision Trees.
● Logistic Regression.
● Support Vector Machines.
● XGBoost.

Among the supervised machine learning algorithms, we have chosen SVM and
XGBOOST Classification algorithms to predict the water quality.

SVM Algorithm :

Support Vector Machine (SVM) is a supervised machine learning algorithm that can be
used for both classification or regression challenges. However, it is mostly used in
classification problems, such as text classification. In the SVM algorithm, we plot each data
item as a point in n-dimensional space (where n is the number of features you have), with the
value of each feature being the value of a particular coordinate. Then, we perform
classification by finding the optimal hyper-plane that differentiate the two classes very well.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.

17
XGBOOST Algorithm :

XGBoost is an implementation of Gradient Boosted decision trees. This library was


written in C++. It is a type of Software library that was designed basically to improve speed
and model performance. It has recently been dominating in applied machine learning.
XGBoost models majorly dominate in many Kaggle Competitions. In this algorithm,
decision trees are created in sequential form.

Weights play an important role in XGBoost. Weights are assigned to all the
independent variables which are then fed into the decision tree which predicts results. The
weight of variables predicted wrong by the tree is increased and the variables are then fed to
the second decision tree. These individual classifiers/predictors then ensemble to give a
strong and more precise model. It can work on regression, classification, ranking, and user-
defined prediction problems. Some important features of XGBoost are:

Parallelization: The model is implemented to train with multiple CPU cores.

Regularization: XGBoost includes different regularization penalties to avoid overfitting.


Penalty regularizations produce successful training so the model can generalize adequately.

Non-linearity: XGBoost can detect and learn from non-linear data patterns.

Cross-validation: Built-in and comes out-of-the-box.

Scalability: XGBoost can run distributed thanks to distributed servers and clusters like
Hadoop and Spark, so you can process enormous amounts of data. It’s also available for many
programming languages like C++, JAVA, Python, and Julia.

18
PROJECT DESCRIPTION
4. PROJECT DESCRIPTION

We have imported all the necessary packages like pandas, seaborn, numpy
andmatplotlib. Then, reading the CSV file into the dataframe object. Using the CSV file for the
further processing steps. Then, performing the preprocessing process by Exploratory Data
Analysis (EDA). In EDA, first finding the shape of the dataset. Then, checking if null values
are present or not in the dataset. Then using describe function we are finding the mean value.
We used interpolation and backfill methods to replace the null values in the dataset for all the
features. Then, we visualized pH values using the scatterplot to view the distributions of pH
values. Then, we created a histogram visualization for all the features. Then, we used an
heatmap to see the correlation between all the features. We used an outlier visualization using
boxplot to see the which features have an outlier points and to see whether these outliers
should be removed or not. We have performed Data Splitting process. First, we have dropped
all the rows which value potability value as zero from the original dataset. Then, we have
splitted the dataset into independent and dependent features, divide the data among two
datasets. We have used sklearn library and performed train_test_split function. We have given
80% of our data values to training set and 20% of our data values to testing set. We have made
the scaling of X datasets using StandardScaler function.

SVC CLASSIFIER: We have imported SVC classifier from sklearn library. We have
created an object for SVC classifier. We are fitting the training dataset into that object and
predicting the results. Then importing accuracy score from sklearn metrics and finding the
accuracy value. Then predicting the accuracy for testing datasets.Then we are importing the
classification report. It includes metrics like precision, recall, fi-score and support.

XGBOOST CLASSIFIER: We have imported xgboost classifier from sklearn library.


Then, we have created an object for xgboost classifier. We have defined the model with
parameters: objective, max_depth, min_child_weight, gamma, alpha, etc. We are fitting the
training dataset into that object and predicting the results. Then importing accuracy score from
sklearn metrics and finding the accuracy value. Then predicting the accuracy for testing
datasets. We have made the confusion matrix for X_test and Y_test datasets. Then we are
importing the classification report. It includes details like precision, recall, fi-score and
support. We are going to compare the accuracy values we predicted from two algorithms. First,
we create a dataframe and store the accuracy values. We compare them by visualizing them in
a barplot . We use stem function in lollipop plot to compare the accuracy values.
19
We see that XGBoost algorithm has the highest accuracy value than svm algorithm. We
are taking last 500 data values from the dataset and performing the prediction for both the
algorithms. We are taking data values from the start, end, middle and random. In start testing,
we have taken first 500 data values using the HEAD function. In end testing, we have taken
last 500 data values using the TAIL function. In mid testing, we have taken 500 data values
using SLICING function. In RANDOM testing, we have taken 500 data values using sample
function. Thus, we conclude that XGBOOST algorithm suits better than SVM while
performing on datasets.

4.1 Modules

● Data Extraction.
● Data Preprocessing.
● Splitting the dataset.
● Creating a model.
● Data Modeling.
● Model Evaluation.

4.2 Module Description

Data Extraction

First , we extract the dataset required for the prediction. We extract it as a CSV file
from the Kaggle Platform. The Dataset contains ten parameters to predict the quality of water
for 3276 water bodies. The parameters are:

20
1. pH. 6. Organic_carbon.

2. Conductivity. 7. Trihalomethanes.

3. Hardness. 8. Turbidity.

4. Solids. 9. Chloramines.

5. Sulfate. 10. Potability.

Data Preprocessing

Data preprocessing is the technique used to transform raw data into an useful and
efficient format. In this part, the steps are taken to check and fill null values in the columns of
the dataset. It is done by performing Exploratory Data Analysis (EDA).

Exploratory Data Analysis (EDA)

Firstly check the shape of the data set. Then check that there are Null values or
not for the parameters. Then check the information of the data set. Now describe the
dataset which shows the minimum value, maximum value, mean value, count, standard
deviation, etc. Finally we handle the missing values. We fill the missing values by
using the interpolate and backfill methods. Using Countplot, we visualize the
potability. With scatterplot, we check whether the data values are in normal
distributions. Then, we view all the features using histogram. We find the
correlation between the features using the heatmap. We check if the outliers are
present using the boxplot function.

Splitting the dataset

Now it’s time to split the data set. Divide the data into the independent and dependent
features. All are independent features except Potability because Potability is our dependent
feature. Using sklearn method, split the data set into the training and testing using the
train_test_split function which returns four data sets. Then, we perform the feature scaling
to keep all the data values in same variations to predict the accuracy correctly.

Creating a model

21
Now we fit the data values in the training and testing datasets. We train our data
model with some values. Then, we create a model for SVM algortihm. We import SVC
classifier. SVC class allows to build a kernel SVM model. We fit our training datasets in
our model. We import XGBoost classifier. Now it’s time to evaluate the model using the
accuracy score, confusion matrix and classification report. Evaluation techniques take two
parameters: one is the actual data and the other one is a predicted data.

Data Modeling

We fit our training datasets into our model. The, we predict the accuracy for
x_test dataset. We import accuracy_score, classification_report and confusion_matrix.
The same process is done for the algorithms.

Model Evaluation

Model evaluation is a fundamental piece of the model improvement process. In this


step, we evaluate our model and check how well our model do in the future. We compare
both accuracy values in barchart and stemplot functions, to find which algorithm
predicts the best accuracy. We test the model on a custom data set, to check the correctness
of the algorithm. Then, we take 500 datavalues as the smaller datasets from using head,
tail, mid and random functions to check the correctness of the algorithm.

22
SYSTEM DESIGN
5. SYSTEM DESIGN

5.1. Input Design

The input design is the link between the information system and the user. It comprises
the developing specification and procedures for data preparation and those steps are necessary
to put transaction data in to a usable form for processing can be achieved by inspecting the
computer to read data from a written or printed document. Input Design considered the
following things:

● What data should be given as input?


● How the data should be arranged or coded?
● The dialog to guide the operating personnel in providing input.
● Methods for preparing input validations and steps to follow when error
occur.

Objectives

1. This design is important to avoid errors in the data input process and show the correct
direction to the management for getting correct information from the computerized
system.

2. It is achieved by creating user-friendly screens for the data entry to handle large
volume of data. The data entry screen is designed in such a way that all the data
manipulates can be performed.

Kaggle:
We have collected our dataset from the Kaggle Website. Kaggle allows users to
find datasets they want to use in building AI models, publish datasets, work with other data
scientists and machine learning engineers, and enter competitions to solve data science
challenges. Kaggle, a subsidiary of Google LLC, is an online community of data
scientists and machine learning practitioners. Kaggle allows users to find and publish data sets,
explore and build models in a web-based data-science environment, work with other data
scientists and machine learning engineers, and enter competitions to solve data science
challenges.

23
Kaggle was first launched in 2010 by offering machine learning competitions and now
also offers a public data platform, a cloud-based workbench for data science, and Artificial
Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy
Howard. Nicholas Gruen was the founding chair succeeded by Max Levchin. Equity was
raised in 2011 valuing the company at $25.2 million. On 8 March 2017, Google announced
that they were acquiring Kaggle.

Parameters Description:

PH: A measure of how acidic or basic a substance or solution is.

CONDUCTIVITY: A measure of the ability of water to pass an electrical current.

HARDNESS: The amount of dissolved calcium and magnesium in the water.

SOLIDS: Total solids are dissolved solids plus suspended and settleable solids in water.

SULFATE: Sulfate is one of the major dissolved components of rain.

ORGANIC_CARBON: A measure of the total amount of carbon in organic compounds in


pure water and aqueous systems.

TRIHALOMETHANES: The result of a reaction between the chlorine used for disinfecting
tap water and natural organic matter in the water.

TURBIDITY: Turbidity is caused by particles suspended or dissolved in water that


scatterlight making the water appear cloudy or murky.

CHLORAMINES: Chloramines are most commonly formed when ammonia is added to


chlorine to treat drinking water.

POTABILITY: Potable water is defined as water that is suitable for human consumption.

5.2. Output Design


A quality output is one, which meets the requirements of the end user and presents the
information clearly. In any system results of processing are communicated to the users and to
other system through outputs. In output design it is determined how the information is to be
displaced for immediate need and also the hard copy output. It is the most important and direct

24
source information to the user. Efficient and intelligent output design improves the system’s
relationship to help user decision-making.

1. Designing computer output should proceed in an organized, well thought out manner;
the right output must be developed while ensuring that each output element is designed
so that people will find the system can use easily and effectively. When analysis design
computer output, they should Identify the specific output that is needed to meet the
requirements.
2. Select methods for presenting information.
3. Create document, report, or other formats that contain information produced by the
system.

The output form of an information system should accomplish one or more of the following
objectives.

● Convey information about past activities, current status or projections of the


future.
● Signal important events, opportunities, problems, or warnings.
● Trigger an action.
● Confirm an action.

25
5.3. Dataset

26
27
26
28
28
29
30
33
30
5.4 Data Flow Diagram

Data flow diagram is a graphical tool. The system models are termed as data flow
diagram (DFD). It is used to describe and analyses the movement of data through a system –
manual or automated. This is a central tool and the basis from which the other component is
developed. The data flow diagram is based on the top-down approach in this system. A level 0
DFD, also called as context model, represents the entire software element as single bubble
with the input and output data indicated by incoming and outgoing arrows.

DFD consists of processes, flows, warehouses, and terminators. There are several ways
to view these DFD components. Data flow diagram shows the transfer of information from one
part of the system to another. The symbol of the flow is the arrow. It is determined for system
developers, on one hand, project contractor on the other, so the entity names should be adapted
for model domain or professionals. It is necessary to maintain consistency across all DFD
levels. Creating a dataflow diagram will allow the programmer to create a program with
minimal discomfort in programming the accrual code and further increase the productivity of
the programmer or program group. Data flow diagrams help the programmer figure out what
options the programs will need in order to handle the data it is given. Using the data flow
diagram, it makes easy to explain the program to laypeople. It will definitely save the amount
of time for programmer would have spent explaining the code to other people.The data flow
diagram will help the programmer be able to see what will happen if certain code is injected
into the program.

DFD symbol

● A magnetic disk defines a source of the system.

● An arrow identifies data flow (data motion).

● A diamond represents a condition or process that transforms incoming data flows into
outgoing data flows.

● A rectangle represents the process.

● A terminator represents the end of the system.

31
5.4 DATA FLOW DIAGRAM

Dataset

Start

Data Preprocessing

NO YES

If data are splitted Train the data


correctly?

If error found
Test &
evaluate

No error

Model

Deployment

Stop

32
I TRAINING PHASE:

Model

Evaluation

Feature Data
Model
Dataset Mining
Selection

II PREDICTION PHASE: Bad prediction Good prediction

Class label 0 Prediction Class label 1

Accuracy result

33
SYSTEM TESTING

&

IMPLEMENTATION
6. SYSTEM TESTING & IMPLEMENTATION

6.1 System Testing

System Testing is an important stage in any system development life cycle. Testing is a
process of executing a program with the intention of finding errors. The importance of
software testing and its implications with respect to software quality cannot be
overemphasized. Software testing is a critical element of software quality assurance and
represents the ultimate review of specification, design and coding. A good test case is one that
has a high probability of finding a yet undiscovered error.

The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of tests. Each test type addresses a specific testing requirement.

Testing is the set of activities that can be planned in advance and conducted
systematically. Different test conditions should be thoroughly checked and the bugs detected
should be fixed. The testing strategies formed by the user are performed to prove that the
software is free and clear from errors. To do this, there are many ways of testing the system’s
reliability, completeness and maintainability.

The important phase of software development is concerned with translating the design
specification into the error-free source code. Testing is carried out to ensure that the system
does not fail, that it meets the specification and it satisfies the user. The system testing was
carried out in a systematic manner with a test data containing all possible combinations of data
to check the features of the system. A test data was prepared for each module, which took care
of all the modules of the program.

System Testing is an important stage where the system developed is tested with
duplicate or original data. It is a process of executing a program with the intent of finding an
error. It is a critical process that can consume fifty percent of the development time.

34
The following are the attributes of good test:

● A good test is not redundant.


● A good test should be "best of breed".
● A good test should be neither simple nor too complex.

6.1.1. Unit Testing

In the unit testing the analyst tests the program making up a system. The software units
in a system are the modules and routines that are assembled and integrated to perform a
specific function. In a large system, many modules on different levels are needed.

Unit testing can be performed from the bottom up starting with the smallest and lowest
level modules and proceeding one at a time. For each module in a bottom-up testing, a short
program executes the module and provides the needed data.

6.1.2. Integration Testing

Integration testing is a systematic technique for constructing the program structure


while conducting test to uncover errors associate with interfacing. Objectives are used to take
unit test modules and built program structure that has been directed by design.

The integration testing is performed for this Project when all the modules are tested
make it a complete system. After integration the project works successfully.

6.1.3. Validation Testing

Validation testing can be defined in many ways, but a simple definition is that can be
reasonably expected by the customer. After validation test has been conducted, one of two
possible conditions exists.

● The functions or performance characteristics confirm to


specification and are accepted.
● A deviation from specification is uncovered and a deficiency list is
created.

Proposed system under consideration has been tested by using validation testing and
found to be working satisfactorily.
35
6.1.4. White Box Testing

White box testing, sometimes called glass-box testing is a test case design method that
uses the control structure of the procedural design to derive test cases. Using white box testing
methods, the software engineer can derive test cases that :

● Guarantee that all independent paths with in a module have been exercised
at least once.
● Exercise all logical decisions on their true and false sides.
● Execute all loops at their boundaries and within their operational bounds
and
● Exercise internal data structure to assure their validity.

6.1.5. Black Box Testing

This method treats the coded module as a black box. The module runs with inputs that
are likely to cause errors. Then the output is checked to see if any error occurred. This method
cannot be used to test all errors, because some errors may depend on the code or algorithm
used to implement the module.
6.2 Implementations

Implementation is the stage in the project where the theoretical design is turned into a
working system. The most critical stage is achieving a successful system and in giving
confidence on the new system for the users, what it will work efficient and effectively. It
involves careful planning, investing of the current system, and its constraints on
implementation, design of methods to achieve the changeover methods.

The implementation process begins with preparing a plan for the implementation of the
system. According to this plan, the activities are to be carried out in these plans; discussion has
been made regarding the equipment, resources and how to test activities.

The coding step translates a detail design representation into a programming language
realization. Programming languages are vehicles for communication between human and
computers programming language characteristics and coding style can profoundly affect
software quality and maintainability. The coding is done with the following characteristics in
mind.

36
● Ease of design to code translation.
● Code efficiency.
● Memory efficiency.
● Maintainability.

The user should be very careful while implementing a project to ensure what they have
planned is properly implemented. The user should not change the purpose of project while
implementing. The user should not go in a roundabout way to achieve a solution; it should be
direct, crisp and clear and up to the point.

Implementation is the stage of the project when the theoretical design is turned out into
a working system. Thus, it can be considered to be the most critical stage in achieving a
successful new system and in giving the user, confidence that the new system will work and be
effective.

The implementation stage involves careful planning, investigation of the existing


system and its constraints on implementation, designing of methods to achieve changeover and
evaluation of changeover methods.

In the existing system implementation, the water quality analysis is done with
only minimum efficiency. It does not include many features for analyzing the data in a
more efficient way.

And it does not provide high accuracy value, does not has enough visualizations
techniques to show the predictions for the obtained data.

In our project we have implement the water quality predictions using two
algorithms: SVM and XGBoost of Supervised learning methods. We have performed the
analogy between them in an efficient manner and provided the expertised result by adding
well organized feature scaling using StandardScaler and effective parameters in SVC &
XGBOOST classifiers.

37
SAMPLES
7. SAMPLES

7.1 Coding
# ANALOGY OF WATER QUALITY PREDICTION USING SVM AND XGBOOST
ALGORITHMS:

# Importing necessary packages

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sea

from sklearn import metrics

import random

# Reading csv file

df=pd.read_csv(r"C:\Users\moham\Desktop\water_potability.csv")

print(df.head())

# Displays the number of rows and columns

print(df.shape)

# Checking null values are there

print(df.isnull().sum())

# Getting information of how many values are not null?

print(df.info())

# Finding mean,standard,count, for the 10 parameters.

print(df.describe())

# Filling null values with interpolate method:

df.interpolate(method='linear',limit_direction='both',inplace=True)

38
print(df.head(50))

print(df.tail(50))

df.backfill()

df.tail(50)

# Viewing the number of counts of potable and not potable

print(df.Potability.value_counts())

# Visualizing the potability count in bar graph

a=sea.countplot(x='Potability',data=df,saturation=5.9)

plt.xticks(ticks=[0,1],labels=["NOT POTABLE","POTABLE"])

plt.show()

# Visualizing the pH value using scatterplot function to check that it is contains a normal
distributions or not

sea.scatterplot(df['ph'],color="red")

plt.show()

# Visualizing all features

df.hist(color="yellow",ec='red',figsize=(16,16))

plt.show()

# Visualize the correlation of all features using a heat map function of seaborn

plt.figure(figsize=(12,8))

sea.heatmap(df.corr(),annot=True,cmap="rocket_r")

plt.show()

# Visualize the outliers using boxplot function

df.boxplot(figsize=(14,7))

# Splitting the dataset :

print("DATASET....\n",df)

39
x=df.drop('Potability',axis=1).copy()

print("\n x features...\n")

print(x)

y=df['Potability'].copy()

print("\n Y feature...\n",y)

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.02,random_state=58,shuffle=True
)

# Scaling the feature

from sklearn.preprocessing import StandardScaler

sc=StandardScaler()

x_train=sc.fit_transform(x_train)

x_test=sc.transform(x_test)

# Viewing the number of rows & columns of training and testing data

print("Training data:",x_train.shape)

print("Testing data:",x_test.shape)

# Visualizing the training set of X:

print("X TRAINING SET......\n",x_train)

# Visualizing the testing set of X:

print("X TESTING SET......\n",x_test)

# Visualizing the training set of Y

print("Y TRAINING SET.....\n",y_train)

# Visualizing the testing set of Y:

print("Y TESTING SET....\n",y_test)

40
# Modeling the object for SVM algorithm

from sklearn.svm import SVC

svmclassifier=SVC(kernel='rbf',gamma=0.15)

svmclassifier.fit(x_train,y_train)

#predicting the test set results

pred=svmclassifier.predict(x_test)

print(pred)

# Evaluation(Calculating Accuracy Score)

from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

print("Accuracy score with rbf kernel")

accure=accuracy_score(y_test,pred)*100

print("Accuracy Score is:",accure)

# Confusion matrix

cm=confusion_matrix(y_test,pred)

print("CONFUSION MATRIX")

ax=sea.heatmap(cm/np.sum(cm),annot=True,fmt='0.2%',cmap='Reds')

ax.xaxis.set_ticklabels(['POSITIVE','NEGATIVE'])

ax.yaxis.set_ticklabels(['POSITIVE','NEGATIVE'])

# Classification Report

target_names=['True','False']

clas=classification_report(y_test,pred,target_names=target_names)

print("CLASSIFICATION REPORT:\n",clas)

41
# Modeling the object for XGBOOST algorithm

from xgboost import XGBClassifier

para={'objective':'binary:logistic',

'max_depth':4,

'alpha':10,

' learning_rate':1.0,

'n_estimators':100,

'gamma':0.35,

'min_child_weight':1,

xgbclas=XGBClassifier(**para)

#fit the classifier to the training data

xgbclas.fit(x_train,y_train)

# make predictions on test data

xgbpred=xgbclas.predict(x_test)

print(xgbpred)

# Evaluation (Calculating Accuracy Score)

acc=accuracy_score(y_test,xgbpred)*100

print("ACCURACY SCORE OF XGBOOST MODEL:\n",acc)

# Confusion matrix:

conm=confusion_matrix(y_test,xgbpred)

print("CONFUSION MATRIX :\n")

42
ax=sea.heatmap(conm/np.sum(conm),annot=True,fmt='0.2%',cmap='Greens')

ax.xaxis.set_ticklabels(['POSITIVE','NEGATIVE'])

ax.yaxis.set_ticklabels(['POSITIVE','NEGATIVE'])

# Classification Report:

target_names=['True','False']

classify=classification_report(y_test,xgbpred,target_names=target_names)

print("CLASSIFICATION REPORT:\n",classify)

# Comparing two accuracy scores using barplot

models=pd.DataFrame({

'MODEL':['Support Vector Machine','XGBoost'],

'ACCURACY SCORE':[accure,acc]

})

print(models)

cols=['yellow','salmon']

fig=plt.figure(figsize=(8,15))

h=sea.barplot(x='MODEL',y='ACCURACY
SCORE',data=models,palette=cols,width=0.05,edgecolor='0.1')

sea.set(rc={'axes.facecolor':'lightgreen','figure.facecolor':'red'})

plt.locator_params(axis='y',nbins=40)

models.sort_values(by='ACCURACY SCORE',ascending=False)

# Comparing two accuracy scores using stem function in lollipop plot

# Stem function by changing markers

43
hi=models.sort_values(by='ACCURACY SCORE',ascending=True)

fig=plt.figure(figsize=(8,15))

plt.grid(axis='x')

g=sea.set(rc={'axes.facecolor':'coral','figure.facecolor':'cyan'})

(markers,stemlines,baseline)=plt.stem(hi['ACCURACY SCORE'],basefmt="")

plt.setp(markers,marker='D',markersize=9,markeredgecolor='red',markeredgewidth=2)

plt.setp(stemlines,linestyle=':',color='blue')

plt.setp(baseline,color='green')

plt.locator_params(axis='y',nbins=40)

plt.xlabel('SVM ………………………………. XGBOOST')

plt.ylabel('ACCURACY SCORE')

plt.show()

# TAKING SMALL DATA FROM THE DATASET AND PREDICT THE ACCURACY
USING tail()

# From the dataset taking only 500 data

#here we store 500 data in the variable 'n' and visualizing the no. of rows and columns in
that data

n=df.tail(500)

# Scaling the L data features

from sklearn.preprocessing import StandardScaler

sale=StandardScaler()

l_train=sale.fit_transform(l_train)

l_test=sale.transform(l_test)

# Modeling object for SVM algorithm

44
from sklearn.svm import SVC

smallsvm=SVC(kernel='rbf',gamma=0.7)

smallsvm.fit(l_train,p_train)

#predicting the test set results

smallpred=smallsvm.predict(l_test)

print(smallpred)

# Evaluation of accuracy score

from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

print("Accuracy score with rbf kernel")

smallaccure=accuracy_score(p_test,smallpred)*100

print(" Accuracy Score for small dataset is:",smallaccure)

# Modeling object for XGBOOST algorithm

from xgboost import XGBClassifier

para={'objective':'binary:logistic',

'max_depth':6,

'alpha':10,

'learning_rate':1.0,

'n_estimators':100,

'gamma':0.15,

'min_child_weight':4,

clasxgb=XGBClassifier(**para)

#fit the classifier to th training data

45
clasxgb.fit(l_train,p_train)

# make predictions on test data

smallxgbpred=clasxgb.predict(l_test)

print(smallxgbpred)

# Evaluation of accuracy score for XGBOOST

smallacc=accuracy_score(p_test,smallxgbpred)*100

print("ACCURACY SCORE OF XGBOOST MODEL:\n",smallacc)

# Comparing two accuracy scores of small dataset using barplot

smallmodels=pd.DataFrame({

'MODEL':['Support Vector Machine','XGBoost'],

'ACCURACY SCORE':[smallaccure,smallacc]

})

print(smallmodels)

scols=['blue','red']

fig=plt.figure(figsize=(6,9))

m=sea.barplot(x='MODEL',y='ACCURACY
SCORE',data=smallmodels,palette=scols,width=0.05,edgecolor='0.1')

sea.set(rc={'axes.facecolor':'lightgreen','figure.facecolor':'salmon'})

plt.locator_params(axis='y',nbins=14)

smallmodels.sort_values(by='ACCURACY SCORE',ascending=False)

# TAKING FIRST 500 ROWS FROM THE DATASET USING head()

46
#here we store 500 data in the variable 'n' and visualizing the no. of rows and columns in
that data

h=df.head(500)

# Scaling the features

from sklearn.preprocessing import StandardScaler

scale=StandardScaler()

v_train=scale.fit_transform(v_train)

v_test=scale.transform(v_test)

# Modeling object for SVM algorithm

from sklearn.svm import SVC

ss=SVC(kernel='rbf',gamma=0.15)

ss.fit(v_train,u_train)

#predicting the test set results

sspred=ss.predict(v_test)

print(sspred)

# Evaluation of accuracy score

from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

print("Accuracy score with rbf kernel")

ssaccure=accuracy_score(u_test,sspred)*100

print("Accuracy Score for small dataset is:",ssaccure)

cssa.xaxis.set_ticklabels(['POSITIVE','NEGATIVE'])

cssa.yaxis.set_ticklabels(['POSITIVE','NEGATIVE'])

# Modeling the object for XGBOOST algorithm


47
from xgboost import XGBClassifier

para={'objective':'binary:logistic',

'max_depth':4,

'alpha':10,

'learning_rate':1.0,

'n_estimators':100,

'gamma':0.09,

'min_child_weight':1,

classe=XGBClassifier(**para)

#fit the classifier to the training data

classe.fit(v_train,u_train)

# make predictions on test data

ssxgbpred=classe.predict(v_test)

print(ssxgbpred)

# Evaluation (Accuracy score)

ssacc=accuracy_score(u_test,ssxgbpred)*100

print("ACCURACY SCORE OF XGBOOST MODEL:\n",ssacc)

# Comparing two accuracy scores using barplot

ssmodels=pd.DataFrame({

'MODEL':['Support Vector Machine','XGBoost'],

'ACCURACY SCORE':[ssaccure,ssacc]

})

48
print(ssmodels)

scols=['blue','salmon']

fig=plt.figure(figsize=(6,9))

m=sea.barplot(x='MODEL',y='ACCURACY
SCORE',data=ssmodels,palette=scols,width=0.05,edgecolor='0.1')

sea.set(rc={'axes.facecolor':'lightgreen','figure.facecolor':'yellow'})

plt.locator_params(axis='y',nbins=10)

ssmodels.sort_values(by='ACCURACY SCORE',ascending=False)

# TAKING SMALL MIDDLE DATA FROM THE DATASET AND PREDICT THE
ACCURACY

# From the dataset taking only 500 data

#here we store 500 data in the variable 'n' and visualizing the no. of rows and columns in
that data

r=df.iloc[1620:2120]

# Scaling the features

from sklearn.preprocessing import StandardScaler

sca=StandardScaler()

i_train=sca.fit_transform(i_train)

i_test=sca.transform(i_test)

# Modeling object for SVM algorithm

from sklearn.svm import SVC

sq=SVC(kernel='rbf',gamma=0.18)

sq.fit(i_train,j_train)

#predicting the test set results

sqpred=sq.predict(i_test)
49
print(sqpred)

# Evaluation of accuracy score

from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

print("Accuracy score with rbf kernel")

sqaccure=accuracy_score(j_test,sqpred)*100

print("Accuracy Score for small dataset is:",sqaccure)

# Modeling the object for XGBOOST algorithm

from xgboost import XGBClassifier

pars={'objective':'binary:logistic',

'max_depth':4,

'alpha':10,

'learning_rate':1.0,

'n_estimators':100,

'gamma':0.12,

'min_child_weight':2,

clag=XGBClassifier(**pars)

#fit the classifier to the training data

clag.fit(i_train,j_train)

# make predictions on test data

sqxgbpred=clag.predict(i_test)

print(sqxgbpred)

50
# Evaluation (Accuracy score)

sqacc=accuracy_score(j_test,sqxgbpred)*100

print("ACCURACY SCORE OF XGBOOST MODEL:\n",sqacc)

# Comparing two accuracy scores using barplot¶

sqmodels=pd.DataFrame({

'MODEL':['Support Vector Machine','XGBoost'],

'ACCURACY SCORE':[sqaccure,sqacc]

})

print(sqmodels)

scols=['purple','red']

fig=plt.figure(figsize=(6,9))

m=sea.barplot(x='MODEL',y='ACCURACY
SCORE',data=sqmodels,palette=scols,width=0.05,edgecolor='0.1')

sea.set(rc={'axes.facecolor':'lightgreen','figure.facecolor':'yellow'})

plt.locator_params(axis='y',nbins=40)

sqmodels.sort_values(by='ACCURACY SCORE',ascending=False)

# TAKING SMALL RANDOM DATA FROM THE DATASET AND PREDICT THE
ACCURACY USING SAMPLE FUNCTION

# From the dataset taking only 500 data

#here we store 500 data in the variable 'n' and visualizing the no. of rows and columns in
that data

c=df.sample(frac=0.1525,random_state=76)

# Scaling the features

from sklearn.preprocessing import StandardScaler

scle=StandardScaler()
51
t_train=scle.fit_transform(t_train)

t_test=scle.transform(t_test)

# Modeling object for SVM algorithm

from sklearn.svm import SVC

ssf=SVC(kernel='rbf',gamma=0.122)

ssf.fit(t_train,w_train)

#predicting the test set results

ssfpred=ssf.predict(t_test)

print(ssfpred)

# Evaluation of accuracy score

from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

print("Accuracy score with rbf kernel")

ssfaccure=accuracy_score(w_test,ssfpred)*100

print("Accuracy Score for small dataset is:",ssfaccure)

# Modeling the object for XGBOOST algorithm

from xgboost import XGBClassifier

pas={'objective':'binary:logistic',

'max_depth':4,

'alpha':4,

'learning_rate':1.0,

'n_estimators':100,

'gamma':0.15,

52
'min_child_weight':2,

claggi=XGBClassifier(**pas)

#fit the classifier to the training data

claggi.fit(t_train,w_train)

# make predictions on test data

ssfxgbpred=claggi.predict(t_test)

print(ssfxgbpred)

# Evaluation (Accuracy score)

ssfacc=accuracy_score(w_test,ssfxgbpred)*100

print("ACCURACY SCORE OF XGBOOST MODEL:\n",ssfacc)

# Comparing two accuracy scores using barplot¶

ssfmodels=pd.DataFrame({

'MODEL':['Support Vector Machine','XGBoost'],

'ACCURACY SCORE':[ssfaccure,ssfacc]

})

print(ssfmodels)

cols=['red','salmon']

fig=plt.figure(figsize=(6,8))

k=sea.barplot(x='MODEL',y='ACCURACY
SCORE',data=ssfmodels,palette=cols,width=0.05,edgecolor='0.1')

sea.set(rc={'axes.facecolor':'lightgreen','figure.facecolor':'lightblue'})

plt.locator_params(axis='y',nbins=40)

ssfmodels.sort_values(by='ACCURACY SCORE',ascending=False)

53
7.2 Screenshots

54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
84
73
74
75
76
77
82
78
79
80
81
82
83
84
CONCLUSION
8. CONCLUSION

This project has provide an accuracy for both the algorithms using python
and machine learning. This project concludes by saying which algorithm suites
better for the predictions.

The dataset is taken from the kaggle website and is imported in our
project. We have performed exploratory data analysis on the dataset. After that,
a dataset with no null values or any other error values have been obtained.
Then the splitting process is done the obtained dataset. Feature scaling is done
to reduce the variations. Then the process for prediction of dataset by the
algorithms is done. A visualization of both accuracies is given and concludes
with the suitable algorithm.

Then to prove that the conclusion is correct, we have performed the


prediction process on smaller datasets by taking 500 values from start, end,
middle and random of the dataset. It also provides the accurate predictions.

Thus, this project completes its purpose. We conclude our prediction


process by declaring XGBOOST algorithm, a more efficient and advanced one
to be used for prediction rather than SVM algorithm.

85
.

FUTURE ENHANCEMENT
9. FUTURE ENHANCEMENT

Now the system is designed to fulfill the current needs and achieved the primary
objectives. There are some additional features can be added to the system in the future. These
features may take the application to the next level.

We can predict for different datasets. We can also do the analogy process on
different algorithms of the machine learning. We can do some separate mechanisms
and find which water is best for agricultural purposes and used for some other
purposes.

Water is the source of life and it is becoming increasingly scarce.By 2030, the gap
between global demand and supplies of fresh water is expected to reach 90%. The climate
crisis, population growth and the transition to clean energy may increase that deficit
further.

86
BIBILIOGRAPHY
10. BIBILIOGRAPHY

10.1 Book References

[1]. Ambrose, R. B., Barnwell, T. O., McCutcheon, S. C., & Williams, J. R., “
Computer models for water quality analysis Water resources handbook. ”, published by L.
W. Mays (4th Edition.), New York: McGraw-Hill in (1996).

[2]. Claude E. Boyd, “ Water Quality An Introduction ” (2nd Edition), published by


Springer Cham in 2016.

[3]. Arthur W.Hounslow, “ Water Quality Data : Analysis and Interpretation ” (1st
Edition), published by CRC Press in 1995.

[4]. E. Roberts Alley , “ Water Quality Control ” (2nd Edition), published by


McGraw-Hill Companies, Inc in 2007.

[5]. Philippe Ouevauviller, “ Quality Assurance for Water Analysis ” (1st Edition),
published by Wiley in 2002.
[6]. David A. Chin, “ Water Resources Engineering ” (3rd Edition), published by
Pearson in 2013.

[7]. Jake VanderPlas, “ Python Data Science Handbook ” (2nd Edition), published by
O’Reilly Media, Inc in 2022.

87
10.2 Web References

• https://r.search.yahoo.com/_ylt=AwrPpZFiRhdkEpoDlsm7HAx.;_ylu=Y29sbwNzZz
MEcG9zAzEEdnRpZAMEc2VjA3Ny/RV=2/RE=1679275747/RO=10/RU=https%3a
%2f%2fwww.researchgate.net%2fpublication%2f352907194_An_Introduction_to_W
ater_Quality_Analysis/RK=2/RS=c072TMTs5BMUi5xz7jHR7pJ4MB
• https://r.search.yahoo.com/_ylt=Awrx.vU7Rxdk9nIDBa.7HAx.;_ylu=Y29sbwNzZzM
EcG9zAzIEdnRpZAMEc2VjA3Ny/RV=2/RE=1679275964/RO=10/RU=https%3a%2
f%2fwww.kaggle.com%2fcode%2fanbarivan%2findian-water-quality-analysis-and-
prediction/RK=2/RS=Taf7VgfLy2P1vLoV1UobMnn2VQw-
• https://r.search.yahoo.com/_ylt=AwrKEvmYRxdk010Eqn27HAx.;_ylu=Y29sbwNzZz
MEcG9zAzQEdnRpZAMEc2VjA3Ny/RV=2/RE=1679276057/RO=10/RU=https%3a
%2f%2fchemtech-us.com%2farticles%2fparameters-in-water-quality-analysis-
explained%2f/RK=2/RS=nULnFT2k21YsRKrFznfPYEwQ3HI-
• https://r.search.yahoo.com/_ylt=AwrKDxTPRxdkIHcDndu7HAx.;_ylu=Y29sbwNzZz
MEcG9zAzEEdnRpZAMEc2VjA3Ny/RV=2/RE=1679276112/RO=10/RU=https%3a
%2f%2fwww.geeksforgeeks.org%2fmachine-
learning%2f/RK=2/RS=icRWY4SETGu.2.uHBbqcz0krFGg-
• https://r.search.yahoo.com/_ylt=AwrKDxTPRxdkIHcDstu7HAx.;_ylu=Y29sbwNzZz
MEcG9zAzUEdnRpZAMEc2VjA3Ny/RV=2/RE=1679276112/RO=10/RU=https%3a
%2f%2fwww.javatpoint.com%2fmachine-
learning/RK=2/RS=5ezOTQjWggZvGSc8IalNi3xumtU-

88

You might also like