0% found this document useful (0 votes)
8 views10 pages

Ext 74513

The document discusses a research study on predictive modeling and analytics for diabetes using machine learning techniques, specifically applied to the Pima Indian diabetes dataset. Five different supervised machine learning algorithms were utilized, including support vector machines, k-nearest neighbors, artificial neural networks, and multifactor dimensionality reduction, to classify patients as diabetic or non-diabetic. The study emphasizes the importance of early detection of diabetes and presents the performance metrics of the models, highlighting the accuracy of the linear kernel SVM model as 0.89.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

Ext 74513

The document discusses a research study on predictive modeling and analytics for diabetes using machine learning techniques, specifically applied to the Pima Indian diabetes dataset. Five different supervised machine learning algorithms were utilized, including support vector machines, k-nearest neighbors, artificial neural networks, and multifactor dimensionality reduction, to classify patients as diabetic or non-diabetic. The study emphasizes the importance of early detection of diabetes and presents the performance metrics of the models, highlighting the accuracy of the linear kernel SVM model as 0.89.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

ISSN (Online) 2394-6849

International Journal of Engineering Research in Computer Science and Engineering


(IJERCSE)
Vol 7, Issue 10, October 2020

Predictive Modelling and Analytics for Diabetes


using a Machine Learning Approach
[1]
Prateek Mishra, [2] Dr.Anurag Sharma, [3]Dr.Abhishek Badholi
[1][2][3]
Computer Science and Engineering, MATS University, Raipur,India

Abstract: Diabetes may be a major disorder which may affect entire body system adversely. Undiagnosed diabetes can increase the
danger of cardiac stroke, diabetic nephropathy and other disorders. everywhere the planet many people are suffering from this
disease. Early detection of diabetes is extremely important to take care of a healthy life. This disease may be a reason of worldwide
concern because the cases of diabetes are rising rapidly. Machine learning (ML) may be a computational method for automatic
learning from experience and improves the performance to form more accurate predictions. within the current research we've
utilized machine learning technique in Pima Indian diabetes dataset to develop trends and detect patterns with risk factors using R
data manipulation tool. To classify the patients into diabetic and non-diabetic we've developed and analyzed five different
predictive models using R data manipulation tool. For this purpose, we used supervised machine learning algorithms namely linear
kernel support vector machine (SVM-linear), radial basis function (RBF) kernel support vector machine, k-nearest neighbor (k-
NN), artificial neural network (ANN) and multifactor dimensionality reduction (MDR).
Keywords— Machine learning, Multifactor dimensionality reduction (MDR) Support vector machine (SVM), k-nearest neighbor
(kNN), (ANN), Artificial neural network

I. INTRODUCTION research the info and retrieve the knowledge from the past

Diabetes may be a quite common metabolic disease. Usually experiences [4]. Next step is testing the model to calculate
onset of diabetes happens in time of life and sometimes the accuracy and performance of the system. and eventually,
in adulthood. But nowadays incidences of this disease are optimization of the system, i.e. improvising the model by
reported in children also. There are several factors for using new rules or data set [5]. The techniques of machine
developing diabetes like genetic susceptibility, weight, food learning are used for classification, prediction and pattern
habit and sedentary lifestyle. Undiagnosed diabetes may end recognition. Machine learning are often applied in various
in very high blood glucose level referred as areas like: program, website ranking, email filtering, face
hyperglycemia which may cause complication like diabetic tagging and recognizing, related advertisements, character
retinopathy, nephropathy, neuropathy, cardiac stroke and recognition, gaming, robotics, disease prediction and traffic
foot ulcer. So, early detection of diabetes is management [6]. The essential learning process to develop a
extremely important to enhance quality of lifetime predictive model.
of patients and enhancement of Now days, machine learning algorithms are used for
their anticipation [1].Machine Learning cares with the automatic analysis of high dimensional biomedical data
event of algorithms and techniques that permits the [7].Diagnosis of disease, skin lesions, cancer classification,
computers to find out and gain intelligence supported the risk assessment for disorder and analysis of genetic and
past experience. it's a branch of AI (AI) and is genomic data are a number of the samples of biomedical
closely associated with statistics. By learning it means the application of ML [8,9]. For disease diagnosis. has
system is in a position to spot and understand the input successfully implemented SVM algorithm [10]. so as to
file, in order that it can make decisions and diagnose major clinical depression (MDD) supported EEG
predictions supported it [2]. The learning process starts with dataset have used classification models like support vector
the gathering of knowledge by different means, from machine (SVM), logistic regression (LR) and Naïve
various resources. Then subsequent step is to organize the Bayesian (NB) [11]. Our novel model is implemented using
info, that's pre-process it so as to repair the info related supervised machine learning techniques in R for Pima
issues and to scale back the dimensionality of the space by Indian diabetes dataset to know patterns for knowledge
removing the irrelevant data (or selecting the info of discovery process in diabetes. This dataset discusses the
interest) [3]. Since the quantity of knowledge that's getting Pima Indian population’s medical history regarding the
used for learning is large, it's difficult for the system to onset of diabetes. It includes several independent variables
and one variable class value of diabetes in terms of 0 and
form decisions, so algorithms are designed using some 1. during this work, we've studied performance of
logic, probability, statistics, control theory etc. to 5 different models based upon linear kernel support vector

9
ISSN (Online) 2394-6849

International Journal of Engineering Research in Computer Science and Engineering


(IJERCSE)
Vol 7, Issue 10, October 2020

machine (SVM-linear), radial basis kernel support vector


machine (SVM-RBF), k-nearest neighbor (k-NN), artificial
neural network (ANN) and multifactor dimensionality
reduction (MDR) algorithms to detect diabetes in female
patients[12].
II. Related Material and Method
Dataset of female patients with minimum twenty-one-year
age of Pima Indian population has been taken from UCI
machine learning repository. This dataset is originally
owned by the National institute of diabetes and digestive
and kidney diseases. during this dataset there are total 768
instances classified into two classes: diabetic and non- Figure1: Essential Learning process to develop a predictive
diabetic with eight different risk factors: number of times model.
pregnant, plasma glucose concentration of two hours in an There are a couple of machine learning techniques which
oral glucose tolerance test, diastolic vital sign , triceps skin will be wont to implement the machine learning process.
fold thickness, two-hour serum insulin, body mass index, Learning techniques like supervised and unsupervised
diabetes pedigree function and age[13]. learning are most generally used. Supervised learning
We have investigated this diabetes dataset using powerful R technique is employed when the historical data is out
data manipulation tool Feature engineering is a crucial step there for a particular problem. The system is trained with the
in applications of machine learning process. Modern data inputs and respective responses then used for the prediction
sets are described with many attributes for practical machine of the response of latest data [17]. Common supervised
learning model building. Usually most of the attributes are approaches include artificial neural network, back
irrelevant to the supervised machine learning classification. propagation, decision tree, support vector machines and
Preprocessing phase of the data involved feature selection, Naïve Bayes classifier. Unsupervised learning technique is
removal of outliers and k-NN imputation to predict the employed when the available training data is unlabeled. The
missing values [14]. system isn't given any prior information or training [18].
There are various methods for handling the irrelevant and The algorithm has got to explore and identify the patterns
inconsistent data. during this work, we've selected the from the available data so as to form decisions or
attributes containing the highly correlated data. This step is predictions. Common unsupervised approaches include k-
implemented by feature selection method which may be means clustering, hierarchical clustering, and principle
done by either ‘manual method’ or Boruta wrapper component analysis and hidden-Markov model [19].
algorithm. Boruta package provides stable and unbiased Supervised machine learning algorithms are selected to
selection of important features from an data system whereas perform binary classification of diabetes dataset of Pima
manual method is error prone. So, feature selection has Indians. For predicting whether a patient is diabetic or
been through with the assistance of R package Boruta. the not, we've used five different algorithms: linear kernel and
tactic is out there as an R package [15]. This package radial basis function (RBF) kernel support vector machine
provides a convenient interface for machine learning (SVM), k-nearest neighbour (k-NN), artificial neural
algorithms. Boruta package is meant as a wrapper built network (ANN) and multifactor dimensionality reduction
around random forest classification algorithm (MDR) in our machine learning predictive models which
implemented within the R. Boruta wrapper is run on the details are given below:
Pima Indian dataset with all the attributes and it yielded four A. Support Vector Machine
attributes as important. With these attributes, the accuracy, Support vector machine (SVM) is employed in both
precision and recall and other parameters are calculated classification and regression. In SVM model, the info points
[16]. are represented on the space and are categorized into
groups and therefore the points with similar properties falls
in same group.

10
ISSN (Online) 2394-6849

International Journal of Engineering Research in Computer Science and Engineering


(IJERCSE)
Vol 7, Issue 10, October 2020

Figure 3: Representation of Radical Basis Function (RBF)


Kernel support vector machine.
Figure 2: Representation of Support Vector Machine Kernel function plays very important role to put data into
feature space. Mathematically, kernel trick (K) is defined as:
In linear SVM the given data set is taken into account as p-
dimensional vector which will be separated by maximum of ……….3
p-1 planes called hyper-planes [20]. These planes A Gaussian function is also known as Radial basis function
separate the info space or set the boundaries among the (RBF) kernel. In Figure 3, the input space separated by
info groups for classification or regression problems as in
feature map (Փ). By applying equation 1& 2 we get:
Figure 2. the simplest hyper-plane are often selected
among the amount of hyper-planes on the idea of distance ………….4
between the 2 classes it separates. The plane that has the By applying equation 3 in 4 we get new function, where N
utmost margin between the 2 classes is named the represents the trained data.
maximum-margin hyper-plane [21].
For n data points is defined as:
(X1,Y1)…….,(Xn,Yn)………………………….1 ………5
Where X1 is real vector and Y1 can be 1 or -1, representing C. k-Nearest Neigh bour (k-NN)
the class to which X1 belongs. k- Nearest neighbour may be a simple algorithm but
A hyper-plane can be constructed so as to maximize the yields excellent results. it's a lazy, nonparametric and
distance between the two classes y=1 and y=-1, is defined instance-based learning algorithm. This algorithm are
as: often utilized in both classification and regression problems.
W. X- b = 0 …………………………………2 In classification, k-NN is applied to seek out out the
category, to which new unlabeled object belongs. For this, a
Where W is normal vector and b is offset of hyper-plane ‘k’ is set (where k is number of neighs bours to be
along considered) which is usually odd and therefore the distance
. between the info points that are nearest to the objects is
B. Radial Basis Function (RBF) Kernel Support Vector calculated by the ways like Euclidean’s distance, Hamming
Machine distance, Manhattan distance or Minkowski distance. After
Support vector machine has proven its efficiency on linear calculating the space, ‘k’ nearest neighbours are selected the
data and nonlinear data. Radial base function has been resultant class of the new object is calculated on the idea of
implemented with this algorithm to classify nonlinear data the votes of the neighbours. The k-NN predicts the
[21]. result with high accuracy [22].
D. Artificial neural network (ANN)
Artificial neural network mimics the functionality of human
brain. It is often seen as a set of nodes called artificial
neurons. All of those nodes can transmit information to at
least one another. The neurons are often represented by
some state (0 or 1) and every node can also have some
weight assigned to them that defines its strength or
importance within the system. The structure of ANN is
split into layers of multiple nodes; the info travels from first

11
ISSN (Online) 2394-6849

International Journal of Engineering Research in Computer Science and Engineering


(IJERCSE)
Vol 7, Issue 10, October 2020

layer (input layer) and after passing through middle layers system in predicting the category variable. Several
(hidden layers) it reaches the output layer, every layer extensions of MDR are utilized in machine
transforms the info into some relevant information and learning. variety of them are fuzzy methods, odds ratio, risk
eventually gives the specified output [23].Transfer and scores, covariates and much more [24].
activation functions play important role in functioning of III. Predictive Model
neurons. The transfer function sums up all the weighted In our proposed predictive model (Figure 4), we've done
inputs as: pre- processing of data and different feature engineering
techniques to urge better results. Pre-processing involved
removal of outliers and k-NN imputation to predict the
……………6
missing values. Boruta wrapper algorithm is employed for
feature selection because it provides unbiased selection of
Where b is bias value, which is usually 1.
important features and unimportant features from a data
The activation function basically flattens the output of the
system. Training of data after feature engineering features
transfer function to a selected range. It might be either linear
a significant role in supervised learning. we've used highly
or nonlinear. the straightforward activation function is:
correlated variables for better outcomes [25]. input file, here
indicates to check data used for predict and confusion
matrix.
………………………………….7 Early diagnosis of diabetes are often helpful to enhance the
Since this function does not provide any limits to the data, standard of lifetime of patients and enhancement of
sigmoid function is used which can be expressed as: their anticipation. Supervised algorithms are wont
to develop different models for diabetes detection. gives a
view of the various machine learning models trained on
…………………8
Pima Indian diabetes dataset with optimized tuning
D. Multifactor Dimensionality Reduction (MDR)
parameters. All techniques of classification were
Multifactor dimensionality reduction is an approach for
experimented in “R” programming studio. the
locating and representing the consolidation of independent
info set are partitioned into two parts (training and testing).
variables which can somehow influence the dependent
We trained our model with 70% training data and tested
variables. it's basically designed to hunt out the interactions
with 30% remaining data. Five different
between the variables which can affect the output of the
models are developed using supervised learning to detect
system. It doesn't depend on parameters or the type of model
whether the patient is diabetic or nondiabetic. For this
getting used, which makes it better than the other traditional
purpose, linear kernel support vector machine (SVM-linear),
systems. It takes two or more attributes and converts it into
radial basis
one. This conversion changes the space representation of
data. This results in improvement of the performance of

12
ISSN (Online) 2394-6849

International Journal of Engineering Research in Computer Science and Engineering


(IJERCSE)
Vol 7, Issue 10, October 2020

Figure 4: Framework for evaluating Predictive Model.


function (RBF) kernel support vector machine, k-NN, ANN like random guessing [20].
and MDR algorithm are used. To diagnose diabetes for Pima From which represents different parameter for evaluating all
Indian population, performance of all the five different the models, it's found that accuracy of linear kernel SVM
models are evaluated upon parameters like precision, recall, model is 0.89. For radial basis function kernel SVM,
area under curve (AUC) and F1 score. so on avoid problem accuracy is 0.84. For k-NN model accuracy is found to 0.88,
of over fitting and under fitting, tenfold cross validation is while for ANN it's 0.86. Accuracy of MDR based model is
completed. Accuracy indicates our classifier is how often found to be 0.83. Recall or sensitivity which indicates
correct in diagnosis of whether patient is diabetic or not. correctly identified proportion of actual positives diabetic
Precision has been used to determine classifier’s ability cases for SVM-linear model is 0.87 and for SVM-RBF it's
provides correct positive predictions of diabetes. Recall or 0.83. For k-NN, ANN and MDR based models recall values
sensitivity is used in our work to hunt out the proportion of are found to be 0.90, 0.88 and 0.87 respectively. Precision
actual positive cases of diabetes correctly identified by the of SVM-linear, SVM-RBF, k-NN, ANN and MDR models
classifier used. Specificity is getting want is found to be 0.88, 0.85, 0.87, 0.85 and 0.82 respectively.
to compute classifier’s capability of determining negative F1 score of SVM-linear, SVM-RBF, k-NN ANN and MDR
cases of diabetes. because the weighted average of precision models is found to be 0.87, 0.83, 0.88, 0.86 and 0.84
and recall provides F1 score so this score takes into respectively. we've calculated area under the curve
account of both. The classifiers of F1 score near 1 are (AUC) to measure performance of our models. it's found
termed as best one [18]. Receiver operating characteristic that AUC of SVM linear model is 0.90 while for SVM-
(ROC) curve could also be a documented tool to RBF, k-NN, ANN and MDR model the values are
ascertain performance of a binary classifier algorithm [19]. respectively 0.85, 0.92 0.88 and 0.89. So, from above
it's plot of true positive rate against false positive rate studies, it is often said that on the thought of all the
because the edge for assigning observations are varied to a parameters SVM-linear and k-NN are two best models to
selected class. Area under curve (AUC) value of a classifier hunt out that whether patient is diabetic or not. Further it is
may lie between 0.5 to1. Values below 0.50 indicated for a often seen that accuracy and precision of SVM- linear
gaggle of random data which couldn't distinguish between model are higher as compared to k-NN model. But recall
true and false. An optimal classifier has value of area under and F1 score of k-NN model are above SVM- linear model.
the curve (AUC) near 1.0. If it's near 0.5 then this value is If we examine our diabetic dataset carefully, it's found to be

13
ISSN (Online) 2394-6849

International Journal of Engineering Research in Computer Science and Engineering


(IJERCSE)
Vol 7, Issue 10, October 2020

an example of imbalanced class with 500 negative instances and testing). We trained our model with 70% training data
and 268 positive instances giving an imbalance ratio of 1.87. and tested with 30% remaining data. Five different
Accuracy alone won't provide a very good indication of models are developed using supervised learning to detect
performance of a binary classifier just in case of imbalanced whether the patient is diabetic or nondiabetic. For this
class. F1 score provides better insight into classifier purpose, linear kernel support vector machine (SVM-linear),
performance just in case of uneven class distribution radial basis upon parameters like precision, recall, area
because it. provides balance between precision and recall under curve (AUC) and F1 score. so on avoid problem of
[21, 25]. So, during this case F1 score should even be taken over fitting and under fitting, tenfold cross validation is
care of. Further it is often seen that AUC value of SVM- completed optimal classifier has value of area under the
linear and k-NN model are 0.90 and 0.92 respectively. curve near 1.0. If it's near 0.5 then this value is like random
IV. Patient demographics guessing [20]. Accuracy indicates our classifier is how of a
The dataset has been taken from This dataset consisted of classifier may lie between 0.5 to1. Values below varied to a
768 female patients, a minimum of 21 years old of Pima specific class. Area under
Indian heritage, diabetes diagnoses (diabetic or
control). there have been 268 cases of diabetic patients and
500 cases of control patients. This dataset contain 9
variables: (1) number of times pregnant, (2) plasma glucose
concentration-a two hour in an oral glucose tolerance test, upon parameters like precision, recall, area under curve
(3) diastolic vital sign (mm Hg), (4) triceps skin fold (AUC) and F1 score. so as to avoid problem of over fitting
thickness (mm), (5) 2-hours serum insulin (mu U/ml), (6) and under fitting, tenfold cross validation is completed
body mass index (weight in kg/ (height in m)2), (7) diabetes Accuracy indicates our classifier is how often correct in
pedigree function, (8) age (in years), (9) class variable diagnosis of whether patient is diabetic or not. Precision has
(diabetic or control). during this dataset five patient have been wonted to determine classifier’s ability provides
zero blood sugar level, diastolic vital sign is zero for 35 correct positive predictions of diabetes. Recall or
patients, 27 patients have zero body mass index, 227 sensitivity is employed in our work to seek out the
patients have zero skin fold thickness and 374 patients have proportion of actual positive cases of diabetes correctly
zero serum insulin level. However, these zero values were identified by the classifier used. Specificity is getting
meaningless. used to work out classifier’s capability of determining
Attribute No. Attribute Variable Type
A1 Pregnancy Integer 0-17
A2 glucose Real 0-199
A3 blood pressure Real 0-122
A4 skin Thickness Real 0-99
A5 insulin Real 0-846
A6 Body mass index (BMI) Real 0-67.1
A7 Diabetes pedigree Function Real 0.078-2.42
A8 Age integer 21-81
Class binary 1=Tested positive for diabetes
0=Tested Negative for diabetes
negative cases of diabetes. because the weighted average of
Table 1: parameter of different Dataset precision and recall provides F1 score so this score
takes under consideration of both. The classifiers of F1
V.RESULT: Early diagnosis of diabetes are often helpful to score near 1 are termed as best one [18]. Receiver operating
enhance the standard of lifetime of patients and characteristic (ROC) curve may be a documented tool to
enhancement of their anticipation. Supervised see performance of a binary classifier algorithm
algorithms are wont to develop different models for diabetes [19]. it's plot of true positive rate against false positive
detection. Table 2 gives a view of the various machine rate because the threshold for assigning observations are
learning models trained on Pima Indian diabetes dataset
with optimized tuning parameters. All techniques of
classification were experimented in “R” programming
studio. the info set are partitioned into two parts (training curve (AUC) value 0.50 indicated for a group of random

14
ISSN (Online) 2394-6849

International Journal of Engineering Research in Computer Science and Engineering


(IJERCSE)
Vol 7, Issue 10, October 2020

data which couldn't distinguish between true and false. An comparable the prevailing approaches while comprehensive
often correct in diagnosis of whether patient is diabetic or analysis is that the novelty of the system. Some statistical
not. Precision has been wonted to work out classifier’s information of the variables of the info From Table which
ability provides correct positive predictions of diabetes. represents different parameter for evaluating all the
Recall or sensitivity is used in our work to hunt out the models, it's found that accuracy of linear kernel SVM model
proportion of actual positive cases of diabetes correctly is 0.89. For radial basis function kernel SVM, accuracy is
identified by the classifier used. Specificity is getting want 0.84. For k-NN model
to compute classifier’s capability of determining negative
cases of diabetes. because the weighted average of precision
and recall provides F1 score so this score takes into
account of both. The classifiers of F1 score near 1 are accuracy is found to 0.88, while for ANN it's 0.86.
termed as best one [18]. Receiver operating characteristic Accuracy of MDR based model is found to be 0.83. Recall
(ROC) curve could also be a documented tool to or sensitivity which indicates correctly identified proportion
ascertain performance of a binary classifier algorithm [19]. of actual positives diabetic cases for SVM-linear model is
it's plot of true positive rate against false positive rate 0.87 and for SVM-RBF it's 0.83. For k-NN, ANN and MDR
because the edge for assigning observations are varied to a based models recall values are found to be 0.90, 0.88 and
selected class. Area under curve (AUC) value of a classifier 0.87 respectively. Precision of SVM-linear, SVM-RBF, k-
may lie between 0.5 to1. Values below 0.50 indicated for a NN, ANN and MDR models is found to be 0.88, 0.85, 0.87,
gaggle of random data which couldn't distinguish between 0.85 and 0.82 respectively. F1 score of SVM-linear, SVM-
true and false. An optimal classifier has value of area under RBF, k-NN ANN and MDR models is found to be 0.87,
the curve near 1.0. If it's near 0.5 then this value is like 0.83, 0.88, 0.86 and 0.84 respectively. we've calculated area
random guessing [20]. under the curve (AUC) to live performance of our
We adapted the missing value problem using the median models. it's found that AUC of SVM linear model is 0.90
approach and it offered the simplicity within the process while for SVM-RBF, k-NN, ANN and MDR model the
during our classification paradigm. Note that, there a several values are respectively 0.85, 0.92 0.88 and 0.89. So, from
methods for approaching this issue and within the above studies, it are often said that on the idea of all the
present scope of this paper, we've simplified this using the parameters SVM-linear and k-NN are two best models to
present scope of this paper, we've simplified this using the seek out that whether patient is diabetic or not. Further it are
median-based approach Note that it also depends upon the often seen that accuracy and precision of SVM- linear
info types and therefore the density of the info. Since our model are higher as compared to k-NN model. But recall
data is simple, our strategy yields result which are and F1 score of k-NN model are above SVM- linear model.

Table 2(a): Experiment Predictive Modelling and Analytics


for Diabetes

Table 2(b): Predictive Modelling and Analytics for Diabetes

15
ISSN (Online) 2394-6849

International Journal of Engineering Research in Computer Science and Engineering


(IJERCSE)
Vol 7, Issue 10, October 2020

Table 2(c): Predictive Modelling and Analytics for Diabetes

Analytics for Diabetes using a Machine Learning


1
0.8
0.6
0.4
0.2
0
-0.2

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

Figure 5: Predictive Modelling and Analytics for Diabetes.

16
ISSN (Online) 2394-2320

International Journal of Engineering Research in Computer Science and Engineering


(IJERCSE)
Vol 7, Issue 10, October 2020
an example of imbalanced class with 500 negative 2(167) (2011) 2- 7.
instances and 268 positive instances giving an imbalance [2] K. Papatheodorou, M. Banach, M. Edmonds, N.
ratio of 1.87. Accuracy alone may not provide a very Papanas, D. Papazoglou, Complications of Diabetes, J.
good indication of performance of a binary classifier in of Diabetes Res. 2015 (2015), 1-5.
case of imbalanced class. F1 score provides better insight [3] L. Mamykinaa, et al., Personal discovery in diabetes
into classifier performance in case of uneven class self-management: Discovering cause and effect using
distribution as it provides balance between precision and self-monitoring data, J. Biomd. Informat. 76 (2017) 1–8.
recall [21, 25]. So, in this case F1 score should also be [4] A. Nather, C. S. Bee, C. Y. Huak, J. L.L. Chew, C. B.
taken care of. Further it can be seen that AUC value of Lin, S. Neo, E. Y. Sim, Epidemiology of diabetic foot
SVM-linear and k-NN model are 0.90 and 0.92 problems and predictive factors for limb loss, J. Diab.
respectively and its Complic. 22 (2) (2008) 77-82.
[5] Shiliang Sun, A survey of multi-view machine
learning, Neural Comput. & Applic. 23 (7–8) (2013)
VI. CONCLUSION AND FUTURE WORK 2031–2038.
[6] M. I. Jordan, M. Mitchell, Machine learning: Trends,
We have developed five different models to detect perspectives, and prospects, Science. 349 (6245) (2015)
diabetes using linear kernel support vector machine 255-260.
(SVM-linear), radial basis kernel, support vector machine [7] P. Sattigeri, J. J. Thiagarajan, M. Shah, K.N.
(SVM-RBF), k-NN, ANN and MDR algorithms. Feature Ramamurthy, A. Spanias, A scalable feature learning and
selection of dataset is done with the help of Boruta tag prediction framework for natural environment sounds
wrapper algorithm which provides unbiased selection of , Signals Syst. and Computers 48th Asilomar Conference
important features. All the models are evaluated on the on Signals, Systems and Computers.( 2014) 17791783.
basis of different parameters- accuracy, recall, precision, [8] M.W. Libbrecht, W.S. Noble, Machine learning
F1 score, and AUC. The experimental results suggested applications in genetics and genomics." Nature Reviews
that all the models achieved good results; SVM-linear Genetics 16, no. 6 (2015): 321-332.
model provides best accuracy of 0.89 and precision of [9] K. Kourou, T. P.Exarchos, K. P.Exarchos, M.
0.88 for prediction of diabetes as compared to other V.Karamouzis, D. I.Fotiadis, Machine learning
models used. On the other hand, k-NN model provided applications in cancer prognosis and prediction,
best recall and F1 score of 0.90 and 0.88. As our dataset is Computation. and Struct. Biotech. J. 13 ( 2015) 8-17.
an example of imbalanced class, F1 score may provide [10]E. M. Hashem, M. S. Mabrouk, A study of support
better insight into performance of our models. F1 score vector machine algorithm for liver disease diagnosis.
provides balance between precision and recall. Further it Amer. J. of Intell. Sys. 4(1) (2014) 9-14. [11]W.
can be seen that AUC value of SVM- linear and k-NN Mumtaz, S. Saad Azhar Ali, M. Azhar, M. Yasin and A.
model is 0.90 and 0.92 respectively. Such a high value of Saeed Malik, A machine learning framework involving
AUC indicates that both SVM- linear and k-NN are EEG-based functional connectivity to diagnose major
optimal classifiers for diabetic dataset. So, from above depressive disorder (MDD)." Medical & biological
studies, it can be said that on the basis of all the engineering & computing (2017): 114. [12]D. K.
parameters linear kernel support vector machine (SVM- Chaturvedi, Soft Computing Techniques and Their
linear) and k-NN are two best models to find that whether Applications, In Mathematical Models, Methods and
patient is diabetic or not. This work also suggests that Applications, 31-40. Springer Singapore, 2015. [13]A.
Boruta wrapper algorithm can be used for feature Tettamanzi, M. Tomassini. Soft computing: integrating
selection. The experimental results indicated that using evolutionary, neural, and fuzzy systems. Springer Science
the Boruta wrapper features selection algorithm is better & Business Media, 2013. [14]M. A. Hearst, S. T. Dumais,
than choosing the attributes manually with less medical E. Osuna, J. Platt and B. Scholkopf, Support vector
domain knowledge. Thus, with a limited number of machines, IEEE Intell. Syst. and their Appl. 13 (4) (1998)
parameters, through the Boruta feature selection 18-28. [15]G. B. Huang, Q. Y. Zhu, C. K. Siew, Extreme
algorithm we have achieved higher accuracy and learning machine: theory and applications. Neurocomput.
precision. 70 (1) (2006), 489-501. [16]S. A. Dudani, The Distance-
References Weighted k-Nearest-Neighbor Rule, IEEE Trans. on
[1] D. Soumya and B Srilatha, Late stage complications Syst., Man, and Cybernet. SMC-6 (4) (1976) 325-327,
of diabetes and insulin resistance, J Diabetes Metab. [17]T. Kohonen, An introduction to neural computing.

17
ISSN (Online) 2394-6849

International Journal of Engineering Research in Computer Science and Engineering


(IJERCSE)
Vol 7, Issue 10, October 2020

Neural networks 1, no. 1 (1988): 3-16. [18]Z. C. Lipton,


C. Elkan,B. Naryanaswamy, Optimal thresholding of
classifiers to maximize F1 measure. in Joint European
Conference on Machine Learning and Knowledge
Discovery in Databases, Springer, Berlin, Heidelberg.
(2014) 225-239. [19]L. B Ware, et al., Biomarkers of
lung epithelial injury and inflammation distinguish severe
sepsis patients with acute respiratory distress syndrome,
Crit. Care. 17 (5) (2013) 1-7 [20] M. E. Rice, G. T.
Harris, Comparing effect sizes in follow-up studies: ROC
Area, Cohen's d, and r, Law Hum Behav. 29 (5) (2005)
615-620.
[21]A. Ali, S. M. Shamsuddin, A. L. Ralescu,
Classification with class imbalance problem: A Review,
Int. J. Advan. Soft Compu. Appl . 5 (3) (2013) 176-204
[22] S. Park, D. Choi, M. Kim, W. Cha, C. Kim, I.
C. Moon, Identifying prescription patterns with a topic
model of diseases and medications, J. of Biomed.
Informat. 75 (2017) 35-47.
[23] Kaur, H., Lechman, E. and Marszk, A. (2017),
Catalyzing Development through ICT Adoption: The
Developing World Experience, Springer Publishers,
Switzerland.
[24] Kaur, H., Chauhan, R., and Ahmed, Z., Role of data
mining in establishing strategic policies for the efficient
management of healthcare system–a case study from
Washington DC area using retrospective discharge data.
BMC Health Services Research. 12(S1):P12, 2012.
[25] J. Li, O. Arandjelovic, Glycaemic index prediction:
A pilot study of data linkage challenges and the
application of machine learning, in: IEEE EMBS Int.
Conf. on Biomed. & Health Informat. (BHI), Orlando,
FL, (2017)357-360.

18

You might also like