Predictive Model For Diabetes Using Machine Learning
Predictive Model For Diabetes Using Machine Learning
HimanshuNadda(161366)
Piyush Thakur(161373)
I hereby declare that the work presented in this report entitled “Diabetes Prediction System
Using Machine Learning” in fulfillment of the requirements for the award of the degree of
Bachelor of Technology in Computer Science and Engineering/ Information Technology
submitted in the department of Computer Science & Engineering and Information
Technology, Jaypee University of Information Technology Waknaghat is an authentic record
of the work carried out over a period from July 2019 to July 2020 under the supervision of
Dr. Ekta
Gandotra(Assistant Professor,Computer science and Engineering ).The matter embodied in
the report has not been submitted for the award of any other degree or diploma.
Himanshu Nadda(161366)
This is to certify that the above statement made by the candidates is true to the best of my
knowledge.
We would like to express our special thanks of gratitude to our project guide Dr. Ekta
Gandotrawho helped us in conceptualizing the project and actual building of procedures used
to complete the project. We would also like to thank our Head of department for providing us
this golden opportunity to work on a project like this, which helped us in doing a lot of
research and we came to know about so many things.
Thanking you,
1.2Problem
Statement……….…………………….…………………………………………………….ii
1.3
Objective……………………………………………………………………………………………..iii
1.4
Methodology…………………………………………………………………………………………iii
2. Literature
Survey….…………………………….……..……………………………………………….iv
2.1
Introduction…………………………………………………………………………………………iv
2.3 Conclusion…………………………………………………………………………………..vi
3. System
Development.............................................................................................................................................
.....vii
3.1
Dataset………………………………………………………………………………………………vii
3.2 Data
Preprocessing………………………………………………………………………………….viii
3.3Conclusion……………………………………………………………………………………x
4. Algorithm………………………………………………………………………………………xi
4.2 Logistics
Regression………………………………………………………………………………..xii
4.6 Feature
Importance…………………………………………………………………………………xvi
4.7Conclusion……………………………………………………………………………….xvi
5. Result and
Evaluation…………………………………………………………………………………xvii
5.1.1Confusion
Matrix…………………………………………………………………………………………...xviii
5.1.2sensitivity…………………………………………………………………………………...............
..xvii
5.1.3Specifity…………………………………………………………………………………………xvi
ii
5.1.4
Accuracy………………………………………………………………………………………..xix
5.2Result Analysis………………………………………………………………………………xix
5.3 Conclusion…………………………………………………………………………………..xix
6.
Conclusion And Future Work………………………………………………………..xx
Refrences………………………………………………………………………………xxi
List of Abbreviations
ACEI Angiotensin-converting-enzyme
inhibitor
ANN Artificial Neural Networks
BUN Blood urea nitrogen
CAD Coronary Artery Disease
CKD Coronary Kidney Disease
COPD Chronic obstructive pulmonary disease
DL Deep Learning
DM Diabetes Mellitus
FN False negatives
FP False positives
Hb Hemoglobin
HTN Hypertension
IHD Ischemic heart disease (IHD)
LR Logistic Regression
ML Machine Learning
NaN Not-A-Number
PCO Polycystic Ovary Syndrome
RFC Random Forest Classifier
SES Social Economic Status
SVM Support Vector Machine
T3 Triiodothyronine
TN True negatives
TP True positives
WHO World Health Organization
List of Figures
Table-1Features of patientsviii
Diabetes has become a common disease to the mankind from young to the old
Personsnowadays. There are various reasons due to which the population of diabetic patients
is increasing day by day such as obesity, bad diet, auto immune reaction, change in lifestyle,
eating habits, environmental pollution etc . Hence, early prediction of diabetes is very
essential to save the human life from diabetes. Data analytics is one of the branches of
computer science ,which is a process of examining the large datasets and find some useful
hidden patterns and draw conclusion based upon those patterns .This analytical process is
carried out using machine learning algorithms in health care system.To carry out medical
diagnoses,machine learning algorithms are used for analysing large medical data to build
the machine learning models. This project presents a diabetes prediction system to diagnosis
of diabetes.Early detection of diabetes is possible with the help of this model.
Chapter-1
Introduction
1.1Introduction
Diabetes is brought about by the expansion level of the sugar (glucose) in the blood. The
diabetes can be of two sorts, for example, type 1 diabetes and type 2 diabetes. Type 1
diabetes is an immune system malady. The cells are demolished by body which are
fundamental to create insulin to assimilate sugar .This sort can be caused paying little heed to
heftiness. The weight is the expansion of weight list (BMI) than the typical degree of BMI of
an individual . Type 1 diabetes can be found in kids and grown-ups at times. The grown-ups
who are corpulent are predominantly influenced by this sort. For the most part moderately
aged individuals are influenced by type 2 diabetes. Diabetes is a major reason to different
ailments, for example, coronary illness, stroke ,kidney sickness, eye issues, dental infection,
nerve harm, foot issues. Side effects which can cause diabetes are over the top discharge of
pee (polyuria), thirst, consistent appetite, weight reduction, vision changes, and weakness,
can happen suddenly.[1]
i
1.2 Problem Statement
The serious issue which is executing a great many individuals all through the world is
diabetes. In any case, with the progressions in innovations human life is succeeding. Hence
for the better eventual fate of human life and medicinal services why not utilize these
advances. Different AI and deep learning calculations are utilized for some sort of forecast
offices. Often these calculations are utilized by business giants for benefit in deals. Given the
subject how might we utilize innovations for the human advancement. Different calculations
utilized and learnare to be tested for expectation of something whose specialization just lives
in the hands of specialists. So as to learn different complexities of different highlights of bio
mechanics of human body and foresee the entangled issues of individuals. the machine must
be prepared with the attitude of a specialist with different highlights and outer components
gave from a valid dataset.
ii
1.3 Objectives
The principle target of this expectation framework for diabetic patients is to discover a
helpful model to serve humankind and can be comprehended be the accompanying focuses.
b) To discover connection between Diabetic Patient and his different components that
influences the malady
c) Compare performances of all algorithms and in the end use the most efficient model.
1.4 Methodology
iii
Chapter-2
Literature survey
2.1 Introduction
Diabetes is an incessant infection wherein levels of sugar and glucose are very
unsteady. A few illnesses are the consequence of this shakiness. Once in a while these
medical problems can cause abrupt passing moreover. Diabetes is an ailment which
results in light of turmoil for digestion .It can be arranged in three types.
There are numerous individuals who are experiencing this sickness and number of
these kind of individuals are expanding step by step. It has been found in ongoing
overview that one out of 11 grown-ups are experiencing this sickness. As indicated by
an ongoing study it has been discovered that one of every 11 grown-ups are
iv
experiencing this ailment. It's a serious hazardous measurement for a malady to
spread that way. [3]
The inability of body to produce very less amount of insulin or nothing can lead to
many complications. There is an extraordinary hazard on pancreas of the individual
experiencing type 1 of the ailment. A recent study shows that type1 diabetes generally
happens in age group of 1-20.
3]Type 2:
The inability of body to deny or resist any kind of insulin produced by the body
which results in non-availability of insulin to the body. Type 2 diabetic patients are
more prone to heart related ailments.According to recent survey of World Health
Organisation (WHO) has found that maximum of patients suffer from type2 diabetes.
Type 3:
v
Figure2 Main symptoms of diabetes
Treatment related to low blood sugar in most cases is same for type1 and type2.Most cases
are considered to be mild not medical emergencies.Feeling of unease, sweating, trembling
etc. are serious effects .There are more dangerous serious effects such as aggressiveness,
permanent brain damage and death in severe cases.
vi
2.3Conclusion
Diabetes is a hurtful malady which have interrelated reactions on the human body.
There can be numerous highlights that are basic in both the sickness. Through those
highlights here it very well may be built up that there can be an important expectation
from the applicable information.
vii
Chapter -3
System Development
3.1Data
This dataset depicts clinical records for Pima Indians and whether every patient will have
a beginning of diabetes inside five years.
viii
Attribute Attribute
no.
1 No. of time pregnant(NTP)
2 Plasma glucose concentration(PGC)
3 Distolic blood pressure(mmHg)(DBP)
4 Triceps skin-fold thickness(mm)(TSFT)
5 2-h serum insulin(mu U/mL)(H2SI)
6 Body mass Index(kg/m2)(BMI)
7 Diabetes Pedigree Function(DPF)
8 Age
9 Outcome
3.2Data Preprocessing
We may wind up drawing an off base surmising about the information, if the missing
qualities are not dealt with appropriately. Since all the segments or columns probably
won't be helpful for the model or the informational index that is accessible isn't in the
structure wherein it tends to be utilized for the preparation of the machine in every
one of these cases information pre-handling is a significant factor that decides the
sound beginning of the model. Information pre-handling is a procedure which is
utilized to turn crude information to valuable organization. Information Pre-handling
is one of the significant highlights required for the preparation of the model. Data pre-
processing incorporates checking for invalid values on the off chance that these
invalid values are supplanted by mean of entire section. In data pre-processing
straight out information can be changed into numerical information .label_encoder is
object which help us in moving Categorical information into Numerical information.
ix
Relationship shows the quality and course of the straight relationship between two
quantitative factors. It takes esteems between - 1 and +1. A positive incentive for r
shows a positive affiliation and a negative an incentive for r demonstrates a negative
affiliation.
The last step in datapre-processing is thesplitting of data into training and testing
data.In our ML model we have used cross_validation object from sklearn library
train_test_split.
x
Figure4 -Correlation matrix
3.3 Conclusion
We have pre-processed our data and made it useful to the further implementation.Various
missing values are replaced,many columns are deleted and converted into numerical values in
order to have positive impact on model.
xi
xii
Chapter -4
Algorithms
Method
Various ML and DL algorithms have been used to predict Diabetes in our dataset in this
section.We will utilize logistic regression ,Random forest,Decision Tree,ArtificialNeural
Networks etc. algorithms to predict and analyse results and compare these algorithms on the
basis of performance.
It is a supervised machine learning algorithm which can be used in regression as well as for
classification purposes. But it is mostly used for classification as we are using in our project.
In this algorithm we plot each data item in a n dimensional space in which n is number of
features. By finding the hyper-plane these points are differentiated between two different
classes.One side of hyper-plane has a place with one classification and opposite side has a
place with other class. This algorithm is very effective where number of dimensions are high.
This algorithm identifies those extreme points also known as support vectors in order to find
the hyper-plane .Every one has a place with it is possible that one classification or has a
place with other classification.
xiii
This algorithm not only just draw a simple line between two different categories, but
also have a region of certain width about the line .We will fit aSVM Classifier to our
pre-processed data. While the mathematical details of the likelihood model are
interesting.
Figure6sigmoid function[11]
xiv
The working of this algorithm can be better understood with the help of following
steps-
Step2- This algorithm will construct DT for every sample and also get prediction from
each DT.
xv
equivalent or autonomous. Mathematical representation of Naïve Bayes algorithm is
as follows-
𝑃 𝐵𝐴 𝑃 𝐴
𝑃 𝐴𝐵 =
𝑃 𝐵
DT isthe preferred tool for classification and prediction. Decision tree looks like tree, in
which the middle nodes contain the test on attributes and the leaf node contains class
label.
xvi
Decision trees arrange occasions by arranging them down the tree from the establishment
to some leaf hub, that gives the grouping of the case. An example is classed by starting
at the establishment hub of the tree, testing the trait such by this hub, at that point
descending the appendage likened to the value of the property as appeared inside the in
beneath of figure. This strategy is then lasting for the subtree frozen in place at the new
hub
Figure9Decision tree[14]
4.6 Artificial Neural Network (ANN)
This algorithm is simply based on human brain.It works exactly same as how humans
take decisions on different conditions. Decision making is done on various internal
noses of ANN.Structural representation of ANN is shown in below figure.
xvii
Figure10 -Structure of a Basic ANN[16]
Each info esteem has some specific weightage related with the expectation of yield
parameter, Then these parameters are passed from these hubs to shrouded hubs. In
these concealed layer summation and actuation work attempts to foresee the last yield.
ANN basically takes a shot at input technique , where for good forecast organize is
compensated and for terrible expectation it is rebuffed.
So here we have 8 criteria that can be used to estimate and predict the diabietes. These
criteria are the pregnancy (that is condition of women when give birth to child),
quantity of glucose , the number or value of blood pressure , the thickness of human
skin , Body mass Ratio also called BMI, the function of diabetes pedigree and the
year to which human body has gone through with disease & the age , these all
collectively going to predict the diabetes of homosapien. The collective dataset of
feature are going to give crucial information about the estmation of diabetes in a
human being.
All the features of our dataset play an important role in the prediction of diabetes.
Calculating importance of each feature will help us to find that how much each
feature is relevant in finding the output of our model.
xviii
Figure11 Showing importance of all features [17]
4.8 Conclusion
We have gone through many more different kind of model and weanalyze that the top
most efficient algorithim among them till now is the random forest one , that is
producing best output and result from the all tried model.
xix
Chapter -5
We can explore the right outcomes through the maginificentstuding and going
through variety of literature survey and books on the project , and here we are
implementing various model for this approach deending on the accuracy and various
parameter of them. The different kind of approaches and classifier of AI , deep CNN
and the machine language learning are implemented and then a graph and matrix of
confusion is found out with the help of them and used to find out the accuracy ,
specificity , uniqueness and measure or strength of performance and at last we will
find out which one is closer to the real data output . at last we plot a graph for
acquiring which model is closest to the original data by implementing LSM(least
square method ) and other distance formula.[6]
Confusion matrix is just a matrix that is consists of 4 elemnts or variable namely TP,
FP, FN & TN. These values are estimated as by the classifier or any model of AI and
it predicts them and then check whether shows correct result or not . depending upon
that the value for each of the 4 variable is inserted . Always remember this that the
sum of the row of matrix is 1 and value for each of the four variable is a probability
value lie between 0& 1. So this is the way the confusion matrix is formed.
TP: This indicates that this value is true or right and estimates also true.
FP: This indicates that this value is false or wrong and estimated as true.
FN: This indicates that this value is true or right and estimated as fasle.
TN: This indicates that this value is fasle or wrong and estimated also false.
xx
Predictions
Class Negative FP TN
5.1.2 Sensitivity
Sensitivity refers to positives which are actually positive and is estimated true with the
help of modelfrom all true or positive ones. In maths or statsical expression of
sensitivity can be seen below.
𝑇𝑃
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃 + 𝐹𝑁
5.1.3 Specificity
Specificity refers to which are actually positive and predicted to be wrong or falsefrom
themodel of AIfrom all negative or wrong. The formulatic way of representing in the
Mathematical expression of specificity can be seen below.
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁 + 𝐹𝑃
5.1.4 Accuracy
Accuracy tells us about the correctly predicted results. This one give us reprsent or call
about data that upto which extent it predicted it is accurateby a model. This could be
find out by mathematical notation given below-
𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝐹𝑃+𝐹𝑁+𝑇𝑁
xxi
5.2Result Analysis
We used four Machine Learning Algorithms which are Random Forest, Decision Tree,
logistic Regression and Naïve Bayes algorithms on the pre-processed data. The
Confusion Matrix obtained is shown in table4
Decision Tree
55 25
44 107
Support Vector
Machine 46 34
27 124
Artificial Neural
Network 40 40
29 122
Logistic Regression -
Cross Validation 53 27
42 109
Logistic Regression
48 32
23 128
Random Forest
Classifier 48 32
28 123
Naive Bayes
51 27
32 119
xxii
Further from above confusion matrix we can deduce values of sensitivity ,specificationetc.In
medical field we will always choose ML model with good sensitivity and specification.
xxiii
Comparison
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Logistic Regression Random Forest Naive Bayes Decision Tree Support Vector Artificial Neural
Classifier Machine Network
Figure 12 shows the visualization of results.It is clear from the figure random forest
algorithm have highest accuracy i.e 86% and Naïve Bayes algorithm have lowest accuracy
i.e 80%
5.3 Conclusion
We have used six Machine Learning Algorithms which are Random Forest, Decision Tree,
logistic Regression and Naïve Bayes, Support Vector Machine, Artificial Neural Network
algorithms on the pre-processed data .It is clear from the result and all evaluation parameters
used that Artificial Neural Network has the highest accuracy i.e 0.80. and Decision Tree
algorithm has Least accuracy .
xxiv
Chapter-6
Conclusion and future Work
6.1 Conclusion
Here we applied different Machine Learning and Deep Learning techniques to construct a
diabetes classifier .We have accomplished ideal exactness through Artificial Neural Network
classifier i.e 0.80 . One of the significant true clinical issues is the discovery of diabetes at its
beginning period. In this investigation, synchronized endeavors are made in structuring a
framework which brings about the forecast of ailment like diabetes. During this work, four AI
order calculations are considered and assessed on different measures. Investigations are
performed on Pima Indians Diabetes database.
xxv
References
xxvi
2015.
[13] A. Tettamanzi, M. Tomassini, Soft computing: integrating evolutionary, neural,
and fuzzy systems, Springer Science & Business Media (2013).
[14] M.A. Hearst, S.T. Dumais, E. Osuna, J. Platt, B. Scholkopf, Support vector
machines, IEEE Intell. Syst. Appl. 13 (4) (1998) 18–28.
[15] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: theory and
applications, Neurocomputing 70 (1) (2006) 489–501.
[16] S.A. Dudani, The distance-weighted k-nearest-neighbor rule, IEEE Trans. Syst.
Man Cybernet. SMC-6 (4) (1976) 325–327.
[17]Zhi-Hua Zhou and Yuan Jiang. NeC4.5: Neural Ensemble Based C4.5. IEEE Trans.
Knowl. Data Eng, 16. 2004.
.
xxvii
JAYPEE UNIVERSITY OF INFORMATION TECHNOLOGY, WAKNAGHAT
PLAGIARISM VERIFICATION REPORT
Date: 7/15/2020
Type of Document (Tick): PhDThesis M.Tech Dissertation/ Report B.Tech Project Report Paper
UNDERTAKING
I undertake that I am aware of the plagiarism related norms/ regulations, if I found guilty of any plagiarism and
copyright violations in the above thesis/report even after award of degree, the University reserves the rights to
withdraw/revoke my degree/report. Kindly allow me to avail Plagiarism verification report for the document
mentioned above.
Total No. of Pages = 36
Checked by
Please send your complete Thesis/Report in (PDF) &DOC (Word File) through yourSupervisor/Guide at
[email protected]