Diabetes Prediction System

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Diabetes Prediction System





2020-21 Even Semester

Report of System Software and Compilers – 18CS61 project work

“Diabetes Prediction System”

Submitted By

Under the guidance of

Dr.Archana.R.A Mrs.Mari Kirthima

(Assistant Professor) (Assistant Professor)


BMSIT&M Department of CSE(2020-21) Page 1

Diabetes Prediction System

To emerge as one of the finest technical institutions of higher learning, to
develop engineering professionals who are technically competent, ethical and
environment friendly for betterment of the society.

Accomplish stimulating learning environment through high quality academic
instruction, innovation and industry-institute interface.

To develop technical professionals acquainted with recent trends and
technologies of computer science to serve as valuable resource for the

Facilitating and exposing the students to various learning opportunities through
dedicated academic teaching, guidance and monitoring.


1. Lead a successful career by designing, analyzing and solving various
problems in the field of Computer Science & Engineering.
2. Pursue higher studies for enduring edification.
3. Exhibit professional and team building attitude along with effective
4. Identify and provide solutions for sustainable environmental

BMSIT&M Department of CSE(2020-21) Page 2

Diabetes Prediction System

Web Technology and its applications– 18CS63 - Course Outcomes (COs)

w.r.t this PBL
Explain software system

Subject Name– Code - Course Outcomes (COs) w.r.t this PBL

Inspect JavaScript frameworks like jQuery and Backbonewith facilities
developer to focus on core features.

Project to Program Outcomes (PO) Mapping

Project Name: Diabetes Prediction System
SSCD ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
WTA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

Program outcomes (POs):

PO1 Engineering knowledge: Apply the knowledge of Mathematics, Science,
Engineering fundamentals and an engineering specialization to the solution of
complex engineering problems
PO2 Problem analysis: Identify, formulate, review research literature, and analyse
complex Engineering problems reaching substantiated conclusions using first
principles of mathematics, Natural sciences and engineering sciences
PO3 Design/development of solutions: Design solutions for complex engineering
problems and design system components or processes that meet the specified needs
with appropriate consideration for the public health and safety, and the cultural,
societal, and environmental considerations.
PO4 Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of
data, and synthesis of the Information to provide valid conclusions
PO5 Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern Engineering and IT tools including prediction and modelling to complex
engineering activities with an understanding of the limitations.
PO6 The engineer and society: Apply reasoning informed by the contextual knowledge
to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
PO7 Environment and sustainability: Understand the impact of the professional
engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for Sustainable development
PO8 Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
PO9 Individual and team work: Function effectively as an individual, and as a member
or leader in diverse teams, and in multidisciplinary settings
PO10 Communication: Communicate effectively on complex engineering activities with
the engineering Community and with society at large, such as, being able to

BMSIT&M Department of CSE(2020-21) Page 3

Diabetes Prediction System

comprehend and write effective reports And design documentation, make effective
presentations, and give and receive clear instructions.
PO11 Project management and finance: Demonstrate knowledge and understanding of
the Engineering and management principles and apply these to one’s own work, as a
member and Leader in a team, to manage projects and in multidisciplinary
PO12 Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological

Project to Program Specific Outcomes (PSO) Mapping

Project Name: Diabetes Prediction System


SSCD ✓ ✓
WTA ✓ ✓

Program Specific Outcomes (PSOs):

PSO1 Analyze the problem and identify computing requirements appropriate to its
PSO2 Apply design and development principles in the construction of software systems of
varying complexity.

BMSIT&M Department of CSE(2020-21) Page 4

Diabetes Prediction System

Table Of Contents:

Contents Page no.

1. Abstract 6

2. Motivation 6

3. Intoduction 7

4. Existing system 8

5. Proposed system 9

6. System requirement specifications 10

7. Proposed Methodology 12

8. Outputs 21

9. Conclusion 23

10. Reference 25

BMSIT&M Department of CSE(2020-21) Page 5

Diabetes Prediction System


Diabetes is an illness caused because of high glucose level in a human body.

Diabetes should not be ignored if it is untreated then Diabetes may cause
some major issues in a person like: heart related problems, kidney problem,
blood pressure, eye damage and it can also affects other organs of human
body. Diabetes can be controlled if it is predicted earlier. To achieve this goal
this project work we will do early prediction of Diabetes in a human body or a
patient for a higher accuracy through applying, Various Machine Learning
Techniques. Machine learning techniques Provide better result for prediction
by constructing models from datasets collected from patients. In this work we
will use Machine Learning Classification and ensemble techniques on a
dataset to predict diabetes. Which are K-Nearest Neighbor (KNN), Logistic
Regression (LR), Decision Tree (DT), Support Vector Machine (SVM),
Gradient Boosting (GB) and Random Forest (RF). The accuracy is different for
every model when compared to other models. The Project work gives the
accurate or higher accuracy model shows that the model is capable of
predicting diabetes effectively. Our Result shows that Random Forest
achieved higher accuracy compared to other machine learning techniques.


The drastic increase in diabetes requires a new research. The main source of
motivation is the current state of diabetic people suffering from this disease. Lifestyle
is the main cause of diabetes type 2. We want to create a system which could act as
a source for medical professionals to detect diabetes on time. So possibly the patient
can manage his/her diabetes effectively.

BMSIT&M Department of CSE(2020-21) Page 6

Diabetes Prediction System


Diabetes is noxious diseases in the world. Diabetes caused because of

obesity or high blood glucose level, and so forth. It affects the hormone
insulin, resulting in abnormal metabolism of crabs and improves level of sugar
in the blood. Diabetes occurs when body does not make enough insulin.
According to (WHO) World Health Organization about 422 million people
suffering from diabetes particularly from low or idle income countries. And this
could be increased to 490 billion up to the year of 2030.

However prevalence of diabetes is found among various Countries like

Canada, China, and India etc. Population of India is now more than 100
million so the actual number of diabetics in India is 40 million. Diabetes is
major cause of death in the world. Early prediction of disease like diabetes
can be controlled and save the human life. To accomplish this, this work
explores prediction of diabetes by taking various attributes related to diabetes
disease. For this purpose we use the Pima Indian Diabetes Dataset, we apply
various Machine Learning classification and ensemble Techniques to predict

Machine Learning Is a method that is used to train computers or machines

explicitly. Various Machine Learning Techniques provide efficient result to
collect Knowledge by building various classification and ensemble models
from collected dataset. Such collected data can be useful to predict diabetes.
Various techniques of Machine Learning can capable to do prediction,
however it’s tough to choose best technique. Thus for this purpose we apply
popular classification and ensemble methods on dataset for prediction.

BMSIT&M Department of CSE(2020-21) Page 7

Diabetes Prediction System


The existing system consists of the project model that can calculate only some
particular parameters and not taking into considerations the all remaining parameters
and networks. The available smart watches are very expensive and specifically they
are not at all available for hardware usage purposes and for daily waged activities.
Other disadvantage is that the systems need to be connected very invasively in
order to work properly. There is complete absence of automated systems in the
currently existing model system.

We are only watching for limited machine learning techniques with the help of which
this paper cannot accurately determine the diabetes prediction process. So
therefore, this paper need somewhat modification. Although the process of
xgboosting is very much tough compared to those such as decision-based tree
technique, support vector listing methods and random forest consisting of linear
regression techniques. Profile creation of the clients and the patients and their
storage management everything includes the use of real time communication.

The E-heath scheme manages the real time retrieval and gathering of database
information. The application services consist of three main parts the web services,
emergency response systems and the hospital services. Oximeter comes under the
information perception tasks.

Fig: Architecture Diagram

BMSIT&M Department of CSE(2020-21) Page 8

Diabetes Prediction System


.Fig: Different Phases

The proposed work roles through the diabetes prediction where our purpose will be
dealing with the pima Indian diabetes dataset to predict weather a human will suffer
from diabetes or not based on the values as per his/her dataset. The diabetes
dataset we are dealing with has somewhat 768 datapoints Range and 9 features.
The result we need to get is in binary format 0 or 1. 0 denotes that the person will
not suffer from diabetes and 1 means he/she will suffer from diabetes. From out of
these 768 datapoints 500 are marks 0 and rest 268 as 1. Considering mainly on the
train test splitting of the accumulated datasets to determine the individual
contribution of each data values. Training of data segments is very much vital
because it ensures the stability of the data contains from the accumulated data to
avoid data redundancy and to increase the overall efficiency of the system
algorithms. Importing of the necessary training data files is done prior to the
beginning of the code segments and then the result is sorted in a separate
database which is further sent for validation approval followed by the splitting of the
overall trained attributes which are further stable.

BMSIT&M Department of CSE(2020-21) Page 9

Diabetes Prediction System


In the design and development of the architecture for the diabetes management
system, the clinical requirements and design analysis of the system were based on
discussions with collaborators from the Department of Nutrition and Food Science of
the University of Ghana and Kwame Nkrumah University of Science and Technology
(KNUST). From these discussions, the diet type of patients was determined to be an
essential approach suitable for the diabetes management system. The following
functionalities were mentioned: (1) Scheduling and reminding diabetic patients to
take their medication and blood glucose readings, (2) recommending healthy meals
for diabetics to keep their blood glucose levels in check, (3) encouraging and
tracking the activity of diabetic patients, (4) providing a visual interface to help them
make meaning of their readings and establishing a sufficient connection between the
doctor and the diabetic patient using e-mail.

Providing the diabetic patient with a data visualization tool to display the data in
tables, charts, and an educational program for newly diagnosed and ongoing
diabetes treatment is valuable for the treatment and management of diabetes.


The system architecture for the Diabetes Management System presented below in
Figure 1 is the conceptual model that defines the structure, behavioural interactions,
and multiple system views that underpins the system development. It presents the
formal descriptions of the systems captured graphically that supports reasoning, and
the submodules developed as well as the dataflows between the developed

BMSIT&M Department of CSE(2020-21) Page 10

Diabetes Prediction System

Fig: System architecture for the implemented system with all submodules.

BMSIT&M Department of CSE(2020-21) Page 11

Diabetes Prediction System


Goal of the paper is to investigate for model to predict diabetes with better accuracy.
We experimented with different classification and ensemble algorithms to predict
diabetes. In the following, we briefly discuss the phase.

Fig: Comparing Glucose with the outcome.

A. Dataset Description- the data is gathered from UCI repository which is named
as Pima Indian Diabetes Dataset. The dataset have many attributes of 768
Table: Dataset Description

Sno. Attributes
1. Pregnancy
2. Glucose
3. Blood Pressure
4. Skin Thickness
5. Insulin
6. BMI(Body Mass Index)
7. Diabetes Pedigree Function
8. Age

BMSIT&M Department of CSE(2020-21) Page 12

Diabetes Prediction System

The 9th attribute is class variable of each data points. This class variable shows
the outcome 0 and 1 for diabetics which indicates positive or negative for

Fig: 1v1 characteristics.

BMSIT&M Department of CSE(2020-21) Page 13

Diabetes Prediction System

Distribution of Diabetic patient- We made a model to predict diabetes however

the dataset was slightly imbalanced having around 500 classes labeled as 0
means negative means no diabetes and 268 labeled as 1 means positive means

Fig: Ratio of Diabetic and Non Diabetic Patient

Fig: Corelation matrix between the parameters.

A correlation matrix is simply a table which displays the correlation. The measure is best
used in variables that demonstrate a linear relationship between each other. The fit of the
data can be visually represented in a scatterplot.

BMSIT&M Department of CSE(2020-21) Page 14

Diabetes Prediction System

B. Data preprocessing:- is most important process. Mostly healthcare related data

contains missing vale and other impurities that can cause effectiveness of data.
To improve quality and effectiveness obtained after mining process, Data
preprocessing is done. To use Machine Learning Techniques on the dataset
effectively this process is essential for accurate result and successful prediction.
For Pima Indian diabetes dataset we need to perform pre processing in two steps.

1). Missing Values removal- Remove all the instances that have zero (0) as
worth. Having zero as worth is not possible. Therefore this instance is eliminated.
Through eliminating irrelevant features/instances we make feature subset and this
process is called features subset selection, which reduces diamentonality of data
and help to work faster.

2). Splitting of data- After cleaning the data, data is normalized in training and
testing the model. When data is spitted then we train algorithm on the training data
set and keep test data set aside. This training process will produce the training
model based on logic and algorithms and values of the feature in training data.
Basically aim of normalization is to bring all the attributes under same scale.

Fig: Feature Importance.

C. Apply Machine Learning- When data has been ready we apply Machine Learning
Technique. We use different classification and ensemble techniques, to predict
diabetes. The methods applied on Pima Indians diabetes dataset. Main objective to

BMSIT&M Department of CSE(2020-21) Page 15

Diabetes Prediction System

apply Machine Learning Techniques to analyze the performance of these methods

and find accuracy of them, and also been able to figure out the
responsible/important feature which play a major role in prediction. The Techniques
are follows:-

1. Support Vector Machine- Support Vector Machine also known as svm is a

supervised machine learning algorithm. Svm is most popular classification
technique. Svm creates a hyperplane that separate two classes. It can create a
hyperplane or set of hyperplane in high dimensional space. This hyper plane
can be used for classification or regression also. Svm differentiates instances in
specific classes and can also classify the entities which are not supported by
data. Separation is done by through hyperplane performs the separation to the
closest training point of any class.
• Select the hyper plane which divides the class better.
• To find the better hyper plane you have to calculate the distance between the
planes and the data which is called Margin.
• If the distance between the classes is low then the chance of miss
conception is high and vice versa. So we need to
• Select the class which has the high margin. Margin = distance to positive
point + Distance to negative point.
2. K-Nearest Neighbor - KNN is also a supervised machine learning algorithm.
KNN helps to solve both the classification and regression problems. KNN is lazy
prediction technique.KNN assumes that similar things are near to each other.
Many times data points which are similar are very near to each other.KNN helps
to group new work based on similarity measure.KNN algorithm record all the
records and classify them according to their similarity measure. For finding the
distance between the points uses tree like structure. To make a prediction for a
new data point, the algorithm finds the closest data points in the training data
set — it’s nearest neighbors. Here K= Number of nearby neighbors, it’s always
a positive integer. Neighbor’s value is chosen from set of class. Closeness is
mainly defined in terms of Euclidean distance. The Euclidean distance between

BMSIT&M Department of CSE(2020-21) Page 16

Diabetes Prediction System

two points P and Q i.e. P (p1,p2, …. Pn) and Q (q1, q2,..qn) is defined by the
following equation:-

d(P,Q) = summation of (Pi-Qi)^2

• Take a sample dataset of columns and rows named as Pima Indian Diabetes
data set.
• Take a test dataset of attributes and rows.
• Find the Euclidean distance by the help of formula:
• Then, Decide a random value of K. is the no. of nearest neighbors
• Then with the help of these minimum distance and Euclidean distance find
out the nth column of each.
• Find out the same output values.
If the values are same, then the patient is diabetic, otherwise not.

3. Decision Tree- Decision tree is a basic classification method. It is supervised

learning method. Decision tree used when response variable is categorical.
Decision tree has tree like structure based model which describes classification
process based on input feature. Input variables are any types like graph, text,
discrete, continuous etc. Steps for Decision Tree
• Construct tree with nodes as input feature.
• Select feature to predict the output from input feature whose information gain
is highest.
• The highest information gain is calculated for each attribute in each node of
• Repeat step 2 to form a subtree using the feature which is not used in above

BMSIT&M Department of CSE(2020-21) Page 17

Diabetes Prediction System

4. Logistic Regression- Logistic regression is also a supervised learning

classification algorithm. It is used to estimate the probability of a binary
response based on one or more predictors. They can be continuous or discrete.
Logistic regression used when we want to classify or distinguish some data
items into categories.
It classify the data in binary form means only in 0 and 1 which refer case to
classify patient that is positive or negative for diabetes.
Main aim of logistic regression is to best fit which is responsible for describing
the relationship between target and predictor variable. Logistic regression is a
based on Linear regression model. Logistic regression model uses sigmoid
function to predict probability of positive and negative class.
Sigmoid function P = 1/1+e - (a+bx) Here P = probability, a and b = parameter
of Model.
Ensembling- Ensembling is a machine learning technique Ensemble means
using multiple learning algorithms together for some task. It provides better
prediction than any other individual model that’s why it is used. The main cause
of error is noise bias and variance, ensemble methods help to reduce or
minimize these errors. There are two popular ensemble methods such as –
Bagging, Boosting, ada-boosting, Gradient boosting, voting, averaging etc. Here
In these work we have used Bagging (Random forest) and Gradient boosting
ensemble methods for predicting diabetes.

5. Random Forest – It is type of ensemble learning method and also used for
classification and regression tasks. The accuracy it gives is grater then
compared to other models. This method can easily handle large datasets.
Random Forest is developed by Leo Bremen. It is popular ensemble Learning
Method. Random Forest Improve Performance of Decision Tree by reducing
variance. It operates by constructing a multitude of decision trees at training
time and outputs the class that is the mode of the classes or classification or
mean prediction (regression) of the individual trees.
• The first step is to select the “R” features from the total features “m” where

BMSIT&M Department of CSE(2020-21) Page 18

Diabetes Prediction System

• Among the “R” features, the node using the best split point.
• Split the node into sub nodes using the best split.
• Repeat a to c steps until ”l” number of nodes has been reached.
• Built forest by repeating steps a to d for “a” number of times to create “n”
number of trees.

The first step is to need the take a glance at choices and use the foundations of
each indiscriminately created decision tree to predict the result and stores the
anticipated outcome at intervals the target place. Secondly, calculate the votes for
each predicted target and ultimately, admit the high voted predicted target as a
result of the ultimate prediction from the random forest formula. Some of the
options of Random Forest does correct predictions result for a spread of
applications are offered.

Fig: Algorithm’s accuracies.

6. Gradient Boosting - Gradient Boosting is most powerful ensemble technique

used for prediction and it is a classification technique. It combine week learner
together to make strong learner models for prediction. It uses Decision Tree

BMSIT&M Department of CSE(2020-21) Page 19

Diabetes Prediction System

model. it classify complex data sets and it is very effective and popular method.
In gradient boosting model performance improve over iterations.
• Consider a sample of target values as P
• Estimate the error in target values.
• Update and adjust the weights to reduce error M.
• P[x] =p[x] +alpha M[x]
• Model Learners are analyzed and calculated by loss function F
• Repeat steps till desired & target result P.

Fig: Overview of the Process

Fig: Cross-Validates classification metrics.

BMSIT&M Department of CSE(2020-21) Page 20

Diabetes Prediction System

Fig: PIMA Indian Dataset.

1. Home Page:

BMSIT&M Department of CSE(2020-21) Page 21

Diabetes Prediction System

2. With values added:

3. Final Output:

BMSIT&M Department of CSE(2020-21) Page 22

Diabetes Prediction System

This research paper has presented a meal recommendation system with food
recognition capabilities which focused on generating daily personalized meal plans
for the users, according to their nutritional necessities and previous meal
preferences. The reviewed literature presented some gaps which informed the
design and development of an integrated diabetes management platform for patients
using K-Nearest Neighbour (KNN) algorithm, a supervised machine learning model
for food recommendation system for diabetics, (2) scheduling and reminding diabetic
patients to take their medication and blood glucose readings for doctor’s intervention
via mobile app, (3) encouraging and tracking the activity of diabetic patients, and (4)
providing an interactive visual interface to help them make meaning of their readings
and establishing a sufficient connection between the doctor and the diabetic patient
using e-mail and chatbots. These integrated technologies present state-of-the-art
solutions for the effective management of diabetes. This research paper required us
to provide a framework with a user-friendly interface for people with diabetes to
monitor their diet, medication, and activity levels. The task has been solved using
state of the art algorithms in artificial intelligence. The proposed framework factors
the diabetes management problem into subgoals: building a Tensorflow neural
network model for food classification; thus, it allows users to upload an image to
determine if a meal is recommended for consumption; implementing K-Nearest
Neighbour (KNN) algorithm to recommend meals; using cognitive sciences to build a
diabetes question and answer chatbot; tracking user activity, user geolocation and
generating pdfs of logged blood sugar readings. The food recognition model was
evaluated with cross-entropy metrics that support validation using neural networks
with a backpropagation algorithm. The model learned features of the images fed
from local Ghanaian dishes with specific nutritional value and essence in managing
diabetics and provided accurate image classification with given labels and
corresponding accuracy. The model achieved specified goals by predicting with high
accuracy, labels of unseen new images. The food recognition and classification
model achieved over 95% accuracy levels for specific calorie intakes. The
performance of the meal recommender model and question and answer chatbot was
tested with a designed cross-platform user-friendly interface using Cordova and Ionic
Frameworks for software development for both mobile and web applications. The

BMSIT&M Department of CSE(2020-21) Page 23

Diabetes Prediction System

system recommended meals to meet the calorific needs of users successfully using
KNN (with k = 5) and answered questions asked in a human-like way. The
implemented system would solve the problem of managing activity, dieting
recommendations, and medication notification for diabetics. The critical limitation of
this work is that it does not address corresponding hardware modules for insulin
pumps and control, as discussed by others in the review, and that may constitute a
fatal limitation since insulin control is crucial. It concentrates principally on
developing software for diabetes management with a machine learning algorithm.
Other supervised and unsupervised machine learning algorithms, such as Support
Vector Machines, random forests, K-Means, and Fuzzy C-Means, could be explored
as well. Finally, there is hope that this system will be useful to people with diabetes
now and in the future. The focus of this work has been on implementing a software
system that will take into consideration the various factors that affect diabetics. The
most crucial issue was to get different models to work; consequently, there are
improvements to make. The nonfatal limitations include a lack of wearables for
physical activity tracking and associated model to determine the number of calories
burned from each activity undertaken. The present system tracks the user’s walk and
saves the route but does not relate the saved route to the calories burned. In the
future, the calories burned would be determined, and the various modules will work
together to predict the user’s future blood glucose readings.

BMSIT&M Department of CSE(2020-21) Page 24

Diabetes Prediction System


• Stackoverflow
• Flask
• Debadri Dutta, Debpriyo Paul, Parthajeet Ghosh, "Analyzing Feature
Importance’s for Diabetes Prediction using Machine Learning". IEEE, pp 942-
928, 2018.
• Md. Faisal Faruque, Asaduzzaman, Iqbal H. Sarker, "Performance Analysis of
Machine Learning Techniques to Predict Diabetes Mellitus". International
Conference on Electrical, Computer and Communication Engineering
(ECCE), 7-9 February, 2019.

• Tejas N. Joshi, Prof. Pramila M. Chawan, "Diabetes Prediction Using Machine

Learning Techniques".Int. Journal of Engineering Research and Application,
Vol. 8, Issue 1, (Part -II) January 2018, pp.-09-13

BMSIT&M Department of CSE(2020-21) Page 25

You might also like