0% found this document useful (0 votes)

55 views18 pages

Project Synopsis

The document describes a project that aims to predict diabetes using machine learning. It will use the Pima Indian Diabetes dataset and apply classification algorithms like K-NN and logistic regression to build models for prediction. The models will then be compared to determine the best prediction method. The project involves preprocessing the data, training models on 80% of the data and testing on 20%. Key hardware and software tools to be used include Google Colab, Python and its sklearn library. The goal is to help identify diabetes early and control the disease.

Uploaded by

PRITHWIRAJ MIDYA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views18 pages

Project Synopsis

Uploaded by

PRITHWIRAJ MIDYA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 18

18

Diabetes Prediction using Machine Learning

A Project Report Submitted in Partial Fulfilment of the
Requirement for the Degree of

Bachelor of Technology in Information Technology

By
Group No: - 12

Abhishek Sinha -14800218056

Arka Dutta -14800218049
Ritayan Midya -14800218026
Pritam Pal -14800218030
Rohit Paul -14800219004
Ehsan Hassan -14800218043

Under the Guidance of

Prof. Subhasis Mitra & Prof. Debjyoti Basu

DEPARTMENT OF INFORMATION TECHNOLOGY

FUTURE INSTITUTE OF ENGINEERING AND MANAGEMENT
(Affiliated to West Bengal University of Technology)
KOLKATA 700 150
2022

1
18

Diabetes Prediction using Machine Learning

Content of the Project Document

CERTIFICATE 3
ACKNOWLEDGEMENTS 4
INTRODUCTION 5
MOTIVATION OF THE PROJECT 6
HARDWARE AND SOFTWARE TOOLS TO BE USED 7
FLOW-CHART OF THE PROJECT 8
ABOUT DATASET 9
PREPROSSEING DATASET 10
ABOUT CLASSIFICATION SUPERVISED MODEL 11
CONFUSION MATRIX 13
ROC CURVE 15
OUTPUT COMPARISON 17
FUTURE SCOPE 17
CONCLUSION 18
REFRENCES 18

2
18

Diabetes Prediction using Machine Learning

Department of Information Technology

FUTURE INSTITUTE OF ENGINEERING AND MANAGEMENT
Sonarpur Station Road, Kolkata – 700150
Tel: 033-2434 5640 (Extn. – 238) URL: www.futureengineering.in
CERTIFICATE

We do hereby declaring that the work which is being presented in the Project Report entitled Diabetes
Prediction using Machine Learning, in partial fulfilment of the requirements for the award of the Bachelor of
Technology in Information Technology and submitted to the Department of Information Technology of Future
Institute of Engineering and Management, Kolkata, is an authentic record of our own work carried out during the
period from September 2021 to June 2022, under the supervision of Prof. Debjyoti Basu & Prof. Subhasis Mitra.

The matter presented in this thesis has not been submitted by us for the award of any other degree elsewhere.

Full Signature of the Students(s)

a)
b)
c)
d)
e)
f)
This is to certify that the above statement made by the students, is correct to the best of my knowledge.

Date: 09.06.2022
Signature of the Supervisor Signature of the Supervisor

Prof. Subhasis Mitra Prof. Debjyoti Basu

Assistant Professor Assistant Professor

Head Signature of the External Examiner/

Department of Information Technology Panel Members
Future Institute of Engineering and Management
Kolkata, WB

3
18

Diabetes Prediction using Machine Learning

ACKNOWLEDGEMENT

We have taken efforts in this project. However, it would not have been possible
without the kind support and help of many individuals. We would like to extend
our sincere thanks to all of them.

We are highly indebted to our guide Prof. Debjyoti Basu and Prof. Subhasis
Mitra for his guidance and constant supervision as well as for providing
necessary information regarding the project and also for his support in
completing the project.

We express our thanks to our Principal Dr. Aloke Ghosh and our Head of the
Department Prof. Prasenjit Basu for extending their support. We would also
thank our Institution and the faculty members without whom this project would
have been a distant reality.

Our thanks and appreciations also go to all people who have willingly helped us
out with their abilities.

Abhishek Sinha

Arka Dutta

Ritayan Midya

Pritam Pal

Ehsan Hassan

Rohit Paul
4
18

Diabetes Prediction using Machine Learning

INTRODUCTION

Diabetes is noxious diseases in the world. Diabetes caused because of obesity or

high blood glucose level, and so forth. It affects the hormone insulin,resulting in
abnormal metabolism of crabs and improves level of sugar in the blood.
Diabetes occurs when body does not make enough insulin. According to (WHO)
World Health Organization about 422 million people suffering from diabetes
particularly from low or idle income countries. And this could be increased to
490 billion up to the year of 2030. However prevalence of diabetes is found
among various Countries like Canada, China, and India etc. Population of India
is now more than 100 million so the actual number of diabetics in India is 72.9
million. Diabetes is major cause of death in the world. Early prediction of
disease like diabetes can be controlled and save the human life. To accomplish
this, this work explores prediction of diabetes by taking various attributes
related to diabetes disease.

5
18

Diabetes Prediction using Machine Learning

MOTIVATION OF THE PROJECT

In recent times, most peoples are suffering in Diabetes. There are estimated
72.96 million cases of diabetes in adult population of India. The prevalence in
urban areas ranges between 10.9% and 14.2% and prevalence in rural India was
3.0-7.8% among population aged 20 years and above with a much higher
prevalence among individuals aged over 50 years. For this purpose we use the
Pima Indian Diabetes Dataset, we apply various Machine Learning classification
to predict diabetes. Machine Learning Is a method that is used to train computers
or machines explicitly. Various Machine Learning Techniques provide efficient
result to collect Knowledge by building various classification and ensemble
models from collected dataset. Such collected data can be useful to predict
diabetes. Various techniques of Machine Learning can capable to do prediction,
however it’s tough to choose best technique. Thus for this purpose we apply
popular classification method K-NN & Logistic Regression on dataset for
prediction. And main objective of this project comparison between this two
method & choose the best prediction method.

6
18

Diabetes Prediction using Machine Learning

HARDWARE & SOFTWARE TOOLS TO BE USED

HARDWARE:
 Any Kind of Laptop or Desktop (Windows 10) with internet
connectivity.
 GPU

SOFTWARE:
 Google Colab
 MS Excel
 Python
 Sklearn
 Flask
 HTML,CSS

7
18

Diabetes Prediction using Machine Learning

FLOW-CHART OF THE PROJECT

This is most important phase which includes model building for prediction of
diabetes. In this we have implemented various machine learning algorithms
which are discussed above for diabetes prediction.

Procedure of Proposed Methodology-

Step1: Import required libraries, Import diabetes dataset.

Step2: Pre-process data to remove missing data.
Step3: Perform percentage split of 80% to divide dataset as Training set
and 20% to Test set.
Step4: Select the machine learning algorithm i.e. Support Vector
Mechanism(SVM), RandomForestClassifier.
Step5: Build the classifier model for the mentioned machine learning
algorithm based on training set.
Step6: Test the Classifier model for the mentioned machine learning
algorithm based on test set.
Step7: Perform Comparison Evaluation of the experimental performance
results obtained for each classifier.
Step8: After analysing based on various measures conclude the best
performing algorithm.
Step9: Dump the best algorithm into a pickle file & load into api file.
Step10: Render the html file and get data from it.
Step11: Predict the output from the given data and show it to the client.

8
18

Diabetes Prediction using Machine Learning

PIMA DIABETES TRAIN FITTING CLASSIFICATION

DATASET DATASET SUPERVISED MODEL
(80%) (SVM,RandomForestClassifier)

SPLIT
DATASET
DATA PROCESSING CLASSIFIER
TEST
DATASET
(20%)

ANALYSING BEST PREDICTING TEST

CONFUSION
MODEL & Dump RESULT
MATRIX
into Pickle file

Predict the
Render the html
Load the classifier Outcome from
file and get data
Into api file given data and
from html file
Show the result

9
18

Diabetes Prediction using Machine Learning

ABOUT DATASET
This dataset is originally from the National Institute of Diabetes and
Digestive and Kidney Diseases. The objective is to predict based on
diagnostic measurements whether a patient has diabetes or not.
 This dataset has 768 samples of diabetic and healthy individuals.
 In particular, all patients here are females of at least 21 years of age.
 The diabetes dataset is credited to UCI machine learning database
repository.
 The dataset has total 9 attributes out of which 8 are independent
variables and one is the dependent variable i.e. target variable which
determines whether patient is having diabetes or not.
Attribute Details:
 Pregnancies (Number of time pregnant)
 Glucose level
 Blood Pressure
 Skin Thickness
 Insulin
 BMI(Body Mass Index)
 Diabetes Pedigree Function (It provides information about
diabetes history in relatives and genetic relationship of those
relatives with patients.)
 Age
 Outcome (0 means Non-diabetic and 1 means Diabetic)

10
18

Diabetes Prediction using Machine Learning

PREPROSSESING DATASET

1. Replace 0 value with Median of each attributes:

We can see that columns - Pregnancies, Glucose, Blood Pressure,
Skin Thickness, Insulin and BMI have minimum values of 0. It makes
sense to have 0 pregnancies, but the it does not make sense for other
mentioned variables to have a minimum value of 0. So we can conclude
that Glucose, Blood Pressure, Skin Thickness, Insulin and BMI have
missing data. The 0's in these columns should be replaced with the
median, since the median is least affected by outliers.

2. Split the dataset:

After processing the data we have to split the dataset into two part-
Train Dataset(80%) & Test dataset(20%).

11
18

Diabetes Prediction using Machine Learning

ABOUT CLASSIFICATION SUPERVISED MODEL

Supervised learning is the types of machine learning in which machines
are trained using well "labelled" training data, and on basis of that data,
machines predict the output. The labelled data means some input data is
already tagged with the correct output.

Classification algorithms are used when the output variable is categorical,

which means there are two classes such as Yes-No, True-false, etc.
 SVM Algorithms
 Random Forest etc.

K-Nearest Neighbour – KNN is also a supervised ma- chine learning

algorithm. KNN helps to solve both the classification and regression
problems. KNN is lazy prediction technique.KNN assumes that similar
things are near to each other. Many times data points which are similar
are very near to each other.KNN helps to group new work based on
similarity measure.KNN algorithm record all the records and classify
them according to their similarity measure. For finding the distance
between the points uses tree like structure. To make a prediction for a new
data point, the algorithm finds the closest data points in the training data
set its nearest neighbours. Here K= Number of nearby neighbours, it’s
always a positive integer. Neighbours value is chosen from set of class.
Closeness is mainly defined in terms of Euclidean distance. The
Euclidean distance between two points P and Q i.e. P (p1, p2,.., pn) and Q
(q1, q2,..qn) is defined by the following equation:-

Algorithm-

 Take a sample dataset of columns and rows named as

Pima Indian Diabetes data set.

 Take a test dataset of attributes and rows.

 Find the Euclidean distance by the help of formula

12
18

Diabetes Prediction using Machine Learning

 Then, Decide a random value of K. is the no. of nearest
neighbours

 Then with the help of these minimum distance and

Euclidean distance find out the nth column of each.

 Find out the same output values.

If the values are same, then the patient is diabetic, other-

wise not.

In this dataset, for value k=19, the prediction score is high.

Logistic Regression- Logistic regression is also a supervised learning

classification algorithm. It is used to estimate the probability of a binary
response based on one or more predictors. They can be continuous or
discrete. Logistic regression used when we want to classify or distinguish
some data items into categories.

It classify the data in binary form means only in 0 and 1 which refer case
to classify patient that is positive or negative for diabetes.

Main aim of logistic regression is to best fit which is responsible for

describing the relationship between target and predictor variable. Logistic
regression is a based on Linear regression model. Logistic regression
model uses sigmoid function to predict probability of positive and
negative class.

Sigmoid function P = 1/1+e – (a+bx)

Here P = probability, a and b =
parameter of Model.

13
18

Diabetes Prediction using Machine Learning

CONFUSION MATRIX
The confusion matrix is a technique used for summarizing the
performance of a classification algorithm i.e. it has binary outputs. For this
Diabetes Prediction-
 Cases in which the doctor predicted they don’t have the disease, and
they don’t have the disease will be termed as TRUE POSITIVES
(TP). The doctor has correctly predicted that the patient hasn’t the
disease.
 Cases in which the doctor predicted they have the disease, and they
have the disease will be termed as TRUE NEGATIVES (TN). The
doctor has correctly predicted that the patient has the disease.
 Cases in which the doctor predicted they don’t have the disease, but
they have the disease will be termed as FALSE POSITIVES (FP).
Also known as “Type I error”.
 Cases in which the doctor predicted they have the disease, but they
don’t have the disease will be termed as FALSE NEGATIVES
(FN). Also known as “Type II error”.

1. Confusion Matrix for SVM Algo:

14
18

Diabetes Prediction using Machine Learning

2. Confusion Matrix for RandomForest Classifier:

15
18

Diabetes Prediction using Machine Learning

16
18

Diabetes Prediction using Machine Learning

OUTPUT COMPARISON
Method Name Accuracy Rate(%) Miscalculation
Rate(%)
77.27272727272727 22.727272727272734
SVM
75.32467532467533 24.675324675324674
Random Forest

FUTURE SCOPE
 Implementing SVM,RandomForest Classification. Basically try to
improving for more AccuracyRate.
 Implement GUI as Front End.

17
18

Diabetes Prediction using Machine Learning

CONCLUSION
The main aim of this project was to design and implement Diabetes
Prediction Using Machine Learning Methods and Performance Analysis
of that methods and it has been achieved successfully. The proposed
approach uses various classification in which KNN, Logistic Regression
are used. The Experimental results can be assist health care to take early
prediction and make early decision to cure diabetes and save humans life.

REFERENCES
 https://www.javatpoint.com/supervised-machine-learning
 www.youtube.com
 www.kaggle.com
 www.ijert.org