0% found this document useful (0 votes)
18 views5 pages

Paper 2

The document discusses the use of various machine learning techniques for predicting diabetes, emphasizing the importance of early detection to prevent severe health issues. It explores different algorithms such as Random Forest, Support Vector Machine, and K-Nearest Neighbor, comparing their accuracy in predicting diabetes using the Pima Indian Diabetes Dataset. The study concludes that Random Forest provides the highest accuracy among the tested methods.

Uploaded by

Shaik imtiyaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views5 pages

Paper 2

The document discusses the use of various machine learning techniques for predicting diabetes, emphasizing the importance of early detection to prevent severe health issues. It explores different algorithms such as Random Forest, Support Vector Machine, and K-Nearest Neighbor, comparing their accuracy in predicting diabetes using the Pima Indian Diabetes Dataset. The study concludes that Random Forest provides the highest accuracy among the tested methods.

Uploaded by

Shaik imtiyaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Published by : International Journal of Engineering Research & Technology (IJERT)

http://www.ijert.org ISSN: 2278-0181


Vol. 9 Issue 09, September-2020

Diabetes Prediction using Machine Learning


Techniques
Mitushi Soni Dr. Sunita Varma
Dept of Computer Science and Engineering Dept of Information Technology
Shri G.S. Institute of Technology and Science Shri G.S. Institute of Technology and Science
Indore, India Indore, India

Abstract:- Diabetes is an illness caused because of high glucose data can be useful to predict diabetes. Various techniques
level in a human body. Diabetes should not be ignored if it is of Machine Learning can capable to do prediction, however
untreated then Diabetes may cause some major issues in a person it’s tough to choose best technique. Thus for this purpose
like: heart related problems, kidney problem, blood pressure, we apply popular classification and ensemble methods on
eye damage and it can also affects other organs of human body.
Diabetes can be controlled if it is predicted earlier. To achieve
dataset for prediction.
this goal this project work we will do early prediction of Diabetes
in a human body or a patient for a higher accuracy through II. LITERATURE REVIEW
applying, Various Machine Learning Techniques. Machine K.VijiyaKumar et al. [11] proposed random Forest algo-
learning techniques Provide better result for prediction by con- rithm for the Prediction of diabetes develop a system which
structing models from datasets collected from patients. In this can perform early prediction of diabetes for a patient with a
work we will use Machine Learning Classification and ensemble higher accuracy by using Random Forest algorithm in ma-
techniques on a dataset to predict diabetes. Which are K-Nearest chine learning technique. The proposed model gives the
Neighbor (KNN), Logistic Regression (LR), Decision Tree (DT), best results for diabetic prediction and the result showed
Support Vector Machine (SVM), Gradient Boosting (GB) and
Random Forest (RF). The accuracy is different for every model
that the prediction system is capable of predicting the dia-
when compared to other models. The Project work gives the betes disease effectively, efficiently and most importantly,
accurate or higher accuracy model shows that the model is capa- instantly. Nonso Nnamoko et al. [13] presented predicting
ble of predicting diabetes effectively. Our Result shows that diabetes onset: an ensemble supervised learning approach
Random Forest achieved higher accuracy compared to other they used five widely used classifiers are employed for the
machine learning techniques. ensembles and a meta-classifier is used to aggregate their
outputs. The results are presented and compared with simi-
Keywords: Diabetes, Machine, Learning, Prediction, Dataset, lar studies that used the same dataset within the literature.
Ensemble It is shown that by using the proposed method, diabetes
onset prediction can be done with higher accuracy. Tejas
I. INTRODUCTION N. Joshi et al. [12] presented Diabetes Prediction Using
Diabetes is noxious diseases in the world. Diabetes caused Machine Learning Techniques aims to predict diabetes via
because of obesity or high blood glucose level, and so three different supervised machine learning methods in-
forth. It affects the hormone insulin, resulting in abnormal cluding: SVM, Logistic regression, ANN. This project pro-
metabolism of crabs and improves level of sugar in the poses an effective technique for earlier detection of the
blood. Diabetes occurs when body does not make enough diabetes disease. Deeraj Shetty et al. [15] proposed diabe-
insulin. According to (WHO) World Health Organization tes disease prediction using data mining assemble Intelli-
about 422 million people suffering from diabetes particu- gent Diabetes Disease Prediction System that gives analy-
larly from low or idle income countries. And this could be sis of diabetes malady utilizing diabetes patient’s database.
increased to 490 billion up to the year of 2030. However In this system, they propose the use of algorithms like
prevalence of diabetes is found among various Countries Bayesian and KNN (K-Nearest Neighbor) to apply on dia-
like Canada, China, and India etc. Population of India is betes patient’s database and analyze them by taking various
now more than 100 million so the actual number of diabet- attributes of diabetes for prediction of diabetes disease.
ics in India is 40 million. Diabetes is major cause of death Muhammad Azeem Sarwar et al. [10] proposed study on
in the world. Early prediction of disease like diabetes can prediction of diabetes using machine learning algorithms in
be controlled and save the human life. To accomplish this, healthcare they applied six different machine learning algo-
this work explores prediction of diabetes by taking various rithms Performance and accuracy of the applied algorithms
attributes related to diabetes disease. For this purpose we is discussed and compared. Comparison of the different
use the Pima Indian Diabetes Dataset, we apply various machine learning techniques used in this study reveals
Machine Learning classification and ensemble Techniques which algorithm is best suited for prediction of diabetes.
to predict diabetes. Machine Learning Is a method that is Diabetes Prediction is becoming the area of interest for
used to train computers or machines explicitly. Various researchers in order to train the program to identify the
Machine Learning Techniques provide efficient result to patient are diabetic or not by applying proper classifier on
collect Knowledge by building various classification and the dataset. Based on previous research work, it has been
ensemble models from collected dataset. Such collected observed that the classification process is not much im-

IJERTV9IS090496 www.ijert.org 921


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 09, September-2020

proved. Hence a system is required as Diabetes Prediction 1). Missing Values removal- Remove all the instances that
is important area in computers, to handle the issues identi- have zero (0) as worth. Having zero as worth is not possi-
fied based on previous research. ble. Therefore this instance is eliminated. Through elimi-
nating irrelevant features/instances we make feature subset
III. PROPOSED METHODOLOGY and this process is called features subset selection, which
Goal of the paper is to investigate for model to predict dia- reduces diamentonality of data and help to work faster.
betes with better accuracy. We experimented with different 2). Splitting of data- After cleaning the data, data is nor-
classification and ensemble algorithms to predict diabetes. malized in training and testing the model. When data is
In the following, we briefly discuss the phase. spitted then we train algorithm on the training data set and
keep test data set aside. This training process will produce
A. Dataset Description- the data is gathered from UCI the training model based on logic and algorithms and val-
repository which is named as Pima Indian Diabetes Da- ues of the feature in training data. Basically aim of normal-
taset. The dataset have many attributes of 768 patients. ization is to bring all the attributes under same scale.
Table 1: Dataset Description C. Apply Machine Learning- When data has been ready
S No. Attributes
we apply Machine Learning Technique. We use different
classification and ensemble techniques, to predict diabetes.
1 Pregnancy The methods applied on Pima Indians diabetes dataset.
2 Glucose Main objective to apply Machine Learning Techniques to
analyze the performance of these methods and find accura-
3 Blood Pressure
cy of them, and also been able to figure out the responsi-
4 Skin thickness ble/important feature which play a major role in prediction.
5 Insulin The Techniques are follows-
1) Support Vector Machine- Support Vector Machine
6 BMI(Body Mass Index) also known as svm is a supervised machine learning algo-
7 Diabetes Pedigree Function rithm. Svm is most popular classification technique. Svm
creates a hyperplane that separate two classes. It can create
8 Age
a hyperplane or set of hyperplane in high dimensional
The 9th attribute is class variable of each data points. This space. This hyper plane can be used for classification or
class variable shows the outcome 0 and 1 for diabetics regression also. Svm differentiates instances in specific
which indicates positive or negative for diabetics. classes and can also classify the entities which are not sup-
Distribution of Diabetic patient- We made a model to ported by data. Separation is done by through hyperplane
predict diabetes however the dataset was slightly imbal- performs the separation to the closest training point of any
anced having around 500 classes labeled as 0 means nega- class.
tive means no diabetes and 268 labeled as 1 means positive Algorithm-
means diabetic. • Select the hyper plane which divides the class bet-
ter.
• To find the better hyper plane you have to calcu-
late the distance between the planes and the data
which is called Margin.
• If the distance between the classes is low then the
chance of miss conception is high and vice versa.
So we need to
• Select the class which has the high margin.
Margin = distance to positive point + Distance to
negative point.

2) K-Nearest Neighbor - KNN is also a supervised ma-


Figure 1: Ratio of Diabetic and Non Diabetic Patient chine learning algorithm. KNN helps to solve both the
classification and regression problems. KNN is lazy predic-
tion technique.KNN assumes that similar things are near to
B. Data Preprocessing- Data preprocessing is most im-
each other. Many times data points which are similar are
portant process. Mostly healthcare related data contains
very near to each other.KNN helps to group new work
missing vale and other impurities that can cause effective-
based on similarity measure.KNN algorithm record all the
ness of data. To improve quality and effectiveness obtained
records and classify them according to their similarity
after mining process, Data preprocessing is done. To use
measure. For finding the distance between the points uses
Machine Learning Techniques on the dataset effectively
tree like structure. To make a prediction for a new data
this process is essential for accurate result and successful
point, the algorithm finds the closest data points in the train-
prediction. For Pima Indian diabetes dataset we need to
ing data set — it’s nearest neighbors. Here K= Number of
perform pre processing in two steps.
nearby neighbors, it’s always a positive integer. Neighbor’s
value is chosen from set of class. Closeness is mainly de-

IJERTV9IS090496 www.ijert.org 922


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 09, September-2020

fined in terms of Euclidean distance. The Euclidean dis- Sigmoid function P = 1/1+e - (a+bx) Here P = probability,
tance between two points P and Q i.e. P (p1,p2, …. Pn) and a and b = parameter of Model.
Q (q1, q2,..qn) is defined by the following equation:- Ensembling- Ensembling is a machine learning technique
Ensemble means using multiple learning algorithms to-
gether for some task. It provides better prediction than any
other individual model that’s why it is used. The main
cause of error is noise bias and variance, ensemble methods
help to reduce or minimize these errors. There are two
Algorithm- popular ensemble methods such as – Bagging, Boosting,
• Take a sample dataset of columns and rows named ada-boosting, Gradient boosting, voting, averaging etc.
as Pima Indian Diabetes data set. Here In these work we have used Bagging (Random forest)
• Take a test dataset of attributes and rows. and Gradient boosting ensemble methods for predicting
• Find the Euclidean distance by the help of formu- diabetes.
la- 5) Random Forest – It is type of ensemble learning meth-
od and also used for classification and regression tasks.
The accuracy it gives is grater then compared to other
models. This method can easily handle large datasets. Ran-
dom Forest is developed by Leo Bremen. It is popular en-
semble Learning Method. Random Forest Improve Perfor-
mance of Decision Tree by reducing variance. It operates
• Then, Decide a random value of K. is the no. of by constructing a multitude of decision trees at training
nearest neighbors time and outputs the class that is the mode of the classes or
• Then with the help of these minimum distance and classification or mean prediction (regression) of the indi-
Euclidean distance find out the nth column of vidual trees.
each. Algorithm-
• Find out the same output values. • The first step is to select the “R” features from the
total features “m” where R<<M.
If the values are same, then the patient is diabetic, other- • Among the “R” features, the node using the best
wise not. split point.
3) Decision Tree- Decision tree is a basic classification • Split the node into sub nodes using the best split.
method. It is supervised learning method. Decision tree • Repeat a to c steps until ”l” number of nodes has
used when response variable is categorical. Decision tree been reached.
has tree like structure based model which describes classi- • Built forest by repeating steps a to d for “a” num-
fication process based on input feature. Input variables are ber of times to create “n” number of trees.
any types like graph, text, discrete, continuous etc. Steps
for Decision Tree Algorithm- The random forest finds the best split using the Gin-Index
• Construct tree with nodes as input feature. Cost Function which is given by:
• Select feature to predict the output from input fea-
ture whose information gain is highest.
• The highest information gain is calculated for
each attribute in each node of tree. The first step is to need the take a glance at choices and use
• Repeat step 2 to form a subtree using the feature the foundations of each indiscriminately created decision
which is not used in above node. tree to predict the result and stores the anticipated outcome
at intervals the target place. Secondly, calculate the votes
4) Logistic Regression- Logistic regression is also a su- for each predicted target and ultimately, admit the high
pervised learning classification algorithm. It is used to es- voted predicted target as a result of the ultimate prediction
timate the probability of a binary response based on one or from the random forest formula. Some of the options of
more predictors. They can be continuous or discrete. Lo- Random Forest does correct predictions result for a spread
gistic regression used when we want to classify or distin- of applications are offered.
guish some data items into categories. 6) Gradient Boosting - Gradient Boosting is most power-
It classify the data in binary form means only in 0 and 1 ful ensemble technique used for prediction and it is a clas-
which refer case to classify patient that is positive or nega- sification technique. It combine week learner together to
tive for diabetes. make strong learner models for prediction. It uses Decision
Main aim of logistic regression is to best fit which is Tree model. it classify complex data sets and it is very ef-
responsible for describing the relationship between target fective and popular method. In gradient boosting model
and predictor variable. Logistic regression is a based on performance improve over iterations.
Linear regression model. Logistic regression model uses Algorithm-
sigmoid function to predict probability of positive and neg- • Consider a sample of target values as P
ative class. • Estimate the error in target values.
• Update and adjust the weights to reduce error M.

IJERTV9IS090496 www.ijert.org 923


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 09, September-2020

• P[x] =p[x] +alpha M[x]


• Model Learners are analyzed and calculated by
loss function F
• Repeat steps till desired & target result P.

Figure3: Accuracy Result of Machine learning methods

Here feature played important role in prediction is present-


ed for random forest algorithm. The sum of the importance
of each feature playing major role for diabetes have been
plotted, where X-axis represents the importance of each
feature and Y-Axis the names of the features.
Figure 2: Overview of the Process

IV. MODEL BUILDING


This is most important phase which includes model build-
ing for prediction of diabetes. In this we have implemented
various machine learning algorithms which are discussed
above for diabetes prediction.
Procedure of Proposed Methodology-
Step1: Import required libraries, Import diabetes dataset.
Step2: Pre-process data to remove missing data.
Step3: Perform percentage split of 80% to divide dataset as
Figure 4: Feature Importance Plot for Random Forest
Training set and 20% to Test set.
Step4: Select the machine learning algorithm i.e. K-
VI. CONCLUSION
Nearest Neighbor, Support Vector Machine, Decision Tree,
The main aim of this project was to design and implement
Logistic regression, Random Forest and Gradient boosting
Diabetes Prediction Using Machine Learning Methods and
algorithm.
Performance Analysis of that methods and it has been
Step5: Build the classifier model for the mentioned ma-
achieved successfully. The proposed approach uses various
chine learning algorithm based on training set.
classification and ensemble learning method in which
Step6: Test the Classifier model for the mentioned ma-
SVM, Knn, Random Forest, Decision Tree, Logistic Re-
chine learning algorithm based on test set.
gression and Gradient Boosting classifiers are used. And
Step7: Perform Comparison Evaluation of the experi-
77% classification accuracy has been achieved. The Exper-
mental performance results obtained for each classifier.
imental results can be asst health care to take early predic-
Step8: After analyzing based on various measures con-
tion and make early decision to cure diabetes and save hu-
clude the best performing algorithm.
mans life.
V. EXPERIMENTAL RESULTS
VII. REFERENCES
In this work different steps were taken. The proposed ap- [1] Debadri Dutta, Debpriyo Paul, Parthajeet Ghosh, "Analyzing
proach uses different classification and ensemble methods Feature Importance’s for Diabetes Prediction using Machine
and implemented using python. These methods are stand- Learning". IEEE, pp 942-928, 2018.
ard Machine Learning methods used to obtain the best ac- [2] K.VijiyaKumar, B.Lavanya, I.Nirmala, S.Sofia Caroline,
"Random Forest Algorithm for the Prediction of Diabetes
curacy from data. In this work we see that random forest ".Proceeding of International Conference on Systems Compu-
classifier achieves better compared to others. Overall we tation Automation and Networking, 2019.
have used best Machine Learning techniques for prediction [3] Md. Faisal Faruque, Asaduzzaman, Iqbal H. Sarker, "Perfor-
and to achieve high performance accuracy. Figure shows mance Analysis of Machine Learning Techniques to Predict
Diabetes Mellitus". International Conference on Electrical,
the result of these Machine Learning methods. Computer and Communication Engineering (ECCE), 7-9 Feb-
ruary, 2019.
[4] Tejas N. Joshi, Prof. Pramila M. Chawan, "Diabetes Prediction
Using Machine Learning Techniques".Int. Journal of Engineer-
ing Research and Application, Vol. 8, Issue 1, (Part -II) Janu-
ary 2018, pp.-09-13

IJERTV9IS090496 www.ijert.org 924


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 09, September-2020

[5] Nonso Nnamoko, Abir Hussain, David England, "Predicting


Diabetes Onset: an Ensemble Supervised Learning Approach ".
IEEE Congress on Evolutionary Computation (CEC), 2018.
[6] Deeraj Shetty, Kishor Rit, Sohail Shaikh, Nikita Patil, "Diabe-
tes Disease Prediction Using Data Mining ".International Con-
ference on Innovations in Information, Embedded and Com-
munication Systems (ICIIECS), 2017.
[7] Nahla B., Andrew et al,"Intelligible support vector machines
for diagnosis of diabetes mellitus. Information Technology in
Biomedicine", IEEE Transactions. 14, (July. 2010), 1114-20.
[8] A.K., Dewangan, and P., Agrawal, “Classification of Diabetes
Mellitus Using Machine Learning Techniques,” International
Journal of Engineering and Applied Sciences, vol. 2, 2015.

IJERTV9IS090496 www.ijert.org 925


(This work is licensed under a Creative Commons Attribution 4.0 International License.)

You might also like