0% found this document useful (0 votes)
91 views8 pages

Diabetes Prediction

Uploaded by

nadim.nagati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views8 pages

Diabetes Prediction

Uploaded by

nadim.nagati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/332440257

Improved logistic regression model for diabetes prediction by integrating PCA


and K-means techniques

Article in Informatics in Medicine Unlocked · April 2019


DOI: 10.1016/j.imu.2019.100179

CITATIONS READS

190 2,940

3 authors, including:

Christian Idemudia
Lanzhou University of Technology
6 PUBLICATIONS 195 CITATIONS

SEE PROFILE

All content following this page was uploaded by Christian Idemudia on 19 May 2019.

The user has requested enhancement of the downloaded file.


Informatics in Medicine Unlocked xxx (xxxx) xxxx

Contents lists available at ScienceDirect

Informatics in Medicine Unlocked


journal homepage: www.elsevier.com/locate/imu

Improved logistic regression model for diabetes prediction by integrating


PCA and K-means techniques
Changsheng Zhua,∗, Christian Uwa Idemudiaa, Wenfang Fengb
a
School of Computer and Communication, Lanzhou University of Technology, Lanzhou, 730050, China
b
School of Economics and Management, Lanzhou University of Technology, Lanzhou, 730050, China

A R T I C LE I N FO A B S T R A C T

Keywords: Diabetes causes a large number of deaths each year and a large number of people living with the disease do not
PCA realize their health condition early enough. In this study, we propose a data mining based model for early
K-means diagnosis and prediction of diabetes using the Pima Indians Diabetes dataset. Although K-means is simple and
Diabetes can be used for a wide variety of data types, it is quite sensitive to initial positions of cluster centers which
Data mining
determine the final cluster result, which either provides a sufficient and efficiently clustered dataset for the
Logistic regression
logistic regression model, or gives a lesser amount of data as a result of incorrect clustering of the original
dataset, thereby limiting the performance of the logistic regression model. Our main goal was to determine ways
of improving the k-means clustering and logistic regression accuracy result. Our model comprises of PCA
(principal component analysis), k-means and logistic regression algorithm. Experimental results show that PCA
enhanced the k-means clustering algorithm and logistic regression classifier accuracy versus the result of other
published studies, with a k-means output of 25 more correctly classified data, and a logistic regression accuracy
of 1.98% higher. As such, the model is shown to be useful for automatically predicting diabetes using patient
electronic health records data. A further experiment with a new dataset showed the applicability of our model
for the predication of diabetes.

1. Introduction diagnosis and decision-making.


Various techniques and algorithms have been designed for appli-
Diabetes stands among the top 10 causes of death for 2016. Diabetes cation in extracting knowledge and information in the diagnosis and
killed 1.6 million people in 2016, up from less than 1 million in 2000. treatment of disease from medical databases. PCA is a simple, non-
With this figure diabetes replaced HIV/AIDS as the seventh top cause of parametric method for extracting relevant information from confusing
death [1]. The number of people with diabetes has risen from 108 data sets [4]. When a large dataset is to be clustered into a user spe-
million in 1980 to 422 million in 2014, with the global prevalence of cified number of clusters (k), which are represented by their centroids,
diabetes among adults over 18 years of age rising from 4.7% in 1980 to k-means will cluster the data by minimizing the squared error function
8.5% in 2014 [2]. [5], and often misclassifies some data due to outliers; also the time
By 2040, 642 million adults (1 in 10 adults) are expected to have complexity will be greater. To overcome these problems, principal
diabetes. Also, 46.5% of those with diabetes have not been diagnosed components analysis (PCA) can be used to reduce the dataset to a lower
[3]. In order to reduce the number of deaths attributable to diabetes, it dimension, while ensuring that the least information is lost, and pro-
is essential that methods and techniques that will aid in early diagnosis viding a better centroid point for clustering. K-means clustering parti-
of diabetes be devised, because a large number of deaths in diabetic tions a dataset into different groups of similar objects. Clusters that are
patients are due to late diagnosis. highly dissimilar from the others are regarded as outliers and discarded.
In order to achieve cutting-edge techniques for the early diagnosis Logistic regression is an efficient regression predictive analysis algo-
of diabetes, we need to utilize advanced information technology, and rithm. Its application is efficient when the dependent variable of a
data mining is a suitable field for this. Data mining offers the ability to dataset is dichotomous (binary). Logistic regression is used in the de-
extract and discover previously unknown, hidden, but interesting pat- scription and analysis of data in order to explain the relationship be-
terns from a large database repository. These patterns can aid medical tween one dependent binary variable and one or more independent


Corresponding author.
E-mail addresses: [email protected] (C. Zhu), [email protected] (C.U. Idemudia), [email protected] (W. Feng).

https://doi.org/10.1016/j.imu.2019.100179
Received 20 January 2019; Received in revised form 27 March 2019; Accepted 4 April 2019
2352-9148/ © 2019 Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/BY-NC-ND/4.0/).

Please cite this article as: Changsheng Zhu, Christian Uwa Idemudia and Wenfang Feng, Informatics in Medicine Unlocked,
https://doi.org/10.1016/j.imu.2019.100179
C. Zhu, et al. Informatics in Medicine Unlocked xxx (xxxx) xxxx

variables.
This research work proposes PCA for dimensionality reduction,
which helps to define suitable initial centroids for our dataset when the
k-means algorithm is applied. K-means is then used to find outliers and
to cluster the data into similar groups, with logistic regression as a
classifier for the dataset. In this paper, section 2 provides a review of
related work done by other researchers in the area of diabetes predic-
tion and diagnosis. Section 3 shows details of the experimental proce-
dures. Section 4 describes the experimental result, while section 5
concludes the work while suggesting possible direction for future work.

2. Related study

Diabetes is a standout amongst the most well-known non-trans-


mittable diseases in the world. It is assessed to be the seventh leading
cause for death [6]. It is predicted that the diabetes rate in adults
worldwide will become 642 million in 2040 [3]. The early diagnosis of
diabetes in patients has been a major goal for medical researchers and
professionals. With the availability of vast technological innovation in
computer science, collaborative studies have shown that by applying Fig. 1. Proposed algorithm model.
computer skills and algorithms (such as data mining), efficient, cost
effective and rapid techniques can be derived for the diagnosis of dia-
betes. PCA application helps to filter out irrelevant features, thereby lowering
Many researchers have developed various prediction models using the training time, cost, and also increases model performance [10].
data mining to predict and diagnose diabetes. Iyer [15] in their study After performing PCA analysis, the result is then passed for un-
proposed the use of the Naïve Bayes algorithm to predict the onset of supervised clustering using K-means because of the ability of k-means
diabetes. The study gave an accuracy result of 79.56%. Tarun [13] used to address outliers [11]. The K-means cluster result is cleaned and Lo-
PCA and a support vector machine for the classification of diabetic gistic Regression is applied to build our supervised classification for the
patients. Experimental result from the study showed that the previous dataset. The proposed model flowchart is shown in Fig. 1.
level can be improved upon as they had a classification accuracy of
93.66%. Mustafa S. Kadhm [18] proposed the use of a Decision Tree 3.1. Data mining toolkit
(DT) to assign each data sample to its appropriate class after applying
the K-nearest neighbor algorithm for eliminating undesired data. Han Anaconda is a free and open Python programming language toolkit.
et al. [3] designed a model that uses the k-means algorithm and the It consists of over 250 popular packages for data science and machine
logistic regression algorithm for predicting diabetes. The model at- learning related application. Applying this package, we are able to
tained a 95.42% accuracy. perform related data mining tasks on our dataset and implement (de-
In Ref. [14], the authors used k-means clustering in identifying and sign) our proposed model.
eliminating outliers, a genetic algorithm and correlation based feature By efficiently preprocessing the original dataset, performing PCA,
selection (CFS) for relevant feature extraction, and finally used k- and simulating the same experiment as other researchers, we show that
nearest neighbor(KNN) for classification of diabetic patients. Patil [16] improvement to the accuracy of diabetes diagnosis using data mining
proposed a hybrid prediction model that applied k-means clustering to techniques can be done.
the original dataset and then used C4.5 algorithms in building the
classifier model. The classification accuracy result was 92.38%. Anjali 3.2. Dataset description
[7] proposed a methodology based on Principal Component Analysis
(PCA) to reduce the dimension of extracted features with Neural Net- The Pima Indian Diabetes dataset obtained from UCI repository of
work (NN) as the classifier. The accuracy result was 92.2%. machine learning was utilized for this study. The dataset is comprised of
The studies all used a common dataset (the Pima Indian Diabetes 768 sample female patients from the Arizona, USA population who
Dataset) from the University of California, Irvine (UCI) machine were examined for diabetes. The dataset has a total of 8 attributes
learning database. Considering the need for an effective prediction al- (representing medical diagnosis criteria) with one target class (which
gorithm, improving the already existing prediction algorithm will be a represents the status of each tested individual). In the dataset there is a
major task of our research whilst using the same dataset as other re- total of 268 tested positive instances and 500 tested negative instances.
searchers. While great result has been achieved by various researchers, The attributes in the dataset include the following:
their data preprocessing step limited the amount of data available for
their final prediction and classification. Therefore, we need to propose a • Number of times pregnant (Preg)
model for enhanced data preprocessing that will produce a large • Plasma glucose concentration at 2hr in an oral glucose tolerance test
amount of useable data and also enhance the classification algorithm. (Plas)
• Diastolic Blood pressure (Pres)
3. Methodology • Triceps skin fold thickness (Skin)
• 2-hr serum insulin (Insu)
This section is comprised of the following steps: the data descrip- • Body mass index (BMI)
tion, preprocessing technique and the classification algorithm. The • Diabetes pedigree function (Pedi)
proposed model is designed and implemented by combing the benefit of • Age (Age)
applying PCA, K-means and Logistic regression. A new methodology is • Target Variable (Diag)
then proposed by using PCA to transform the initial set of features,
thereby solving the problem of correlation, which makes it difficult for
the classification algorithm to find relationships among the data. The

2
C. Zhu, et al. Informatics in Medicine Unlocked xxx (xxxx) xxxx

Table 1
Original and preprocessed dataset statistics.
Statistics Dataset Preg Plas Pres Skin Insu BMI Pedi Age

COUNT Original 768.0000 768.0000 768.0000 768.0000 768.0000 768.0000 768.0000 768.0000
Preprocess 768.0000 768.0000 768.0000 768.0000 768.0000 768.0000 768.0000 768.0000
MEAN Original 0.8554 120.8945 69.1054 20.5364 79.7994 31.9925 0.4718 33.2408
Preprocess 0.8554 121.6867 72.4051 29.1534 155.5482 32.4574 0.6718 33.2408
STD Original 0.3518 31.9726 19.9522 15.9522 115.2440 7.8841 0.3313 11.7602
Preprocess 0.3518 30.4359 12.0963 8.7909 85.0211 6.8751 0.3313 11.7602
MIN Original 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0780 21.0000
Preprocess 0.0780 44.0000 24.0000 7.0000 14.0000 18.2000 0.0780 21.0000
25% Original 1.0000 99.0000 62.0000 0.0000 0.0000 27.3000 0.2437 24.0000
Preprocess 1.0000 99.7500 64.0000 25.0000 121.5000 27.5000 0.2437 24.0000
50% Original 1.0000 117.0000 72.0000 23.0000 30.5000 32.0000 0.3725 29.0000
Preprocess 1.0000 117.0000 72.2025 29.2534 155.5482 32.4000 0.3725 29.0000
75% Original 1.0000 140.2500 80.0000 32.0000 127.2500 36.6000 0.6262 41.0000
Preprocess 1.0000 140.2500 80.0000 32.0000 155.5482 36.6000 0.6262 41.0000
MAX Original 1.0000 199.0000 122.0000 99.0000 846.0000 67.1000 2.4200 81.0000
Preprocess 1.0000 199.0000 122.0000 99.0000 864.0000 67.1000 2.4200 81.0000

3.3. Data preprocessing Z = standard deviation.

Today's real world databases are highly susceptible to noisy, 3.4. Model algorithm design
missing, and inconsistent data due to their typically huge sizes and their
likely origin from multiple, heterogeneous sources [13]. Data quality is Our model algorithm will be made up of 3 sub stages. In the first
an important factor in the data mining process for disease prediction stage of the design we will perform dimensionality reduction on the
and diagnosis, because low quality data may lead to inaccurate or low already processed dataset (using PCA). Then we will cluster the selected
prediction result. In order to make our original dataset more productive principal component using K-means to address outliers and remove any
and applicable for predicting diabetes, we applied several preprocessing incorrectly classified data. Finally, the correctly clustered and classified
techniques using various packages offered within the Anaconda in- data will be used as input for our supervised classification using logistic
tegrated development environment. regression.
First, we took a closer look at the various attributes, and discussing
with a professional dietician, analyzed the medical relevance of each 3.5. Principal component anaylysis
attribute to diabetes prediction and diagnosis. It was discovered that
“number of times pregnant” has less significance to the current research During data analysis it is often very difficult to find all the re-
direction. We decided to apply the same technique used by Han Wu lationships among attributes. PCA allows a huge amount of information
[12] by transforming this numeric attribute into a nominal attribute of enclosed in initially correlated data to be transformed into a set of new
value 0 and 1, with 1 indicating a patient previously pregnant and 0 orthogonal components, thereby making it possible to discover con-
indicating a patient was never pregnant. This helps to reduce the cealed relationships, enhance data visualization, detection of outliers,
complexity of analyzing the dataset [12] (see Table 1). and classification within the newly defined dimensions [5]. The appli-
Secondly, statistical analysis of our dataset suggested the presence cation of PCA on a dataset can be of great help when unsupervised
of missing values. Table 2 below shows the statistical result for our learning is required to be performed on such a dataset, as it will aid in
dataset. From the statistical result, it is observed that Plasma glucose efficiently initializing centroids for clustering.
concentration, Diastolic blood pressure, Skin fold thickness, 2hr serum Because PCA yields a feature subspace that maximizes the variance
insulin and Body mass index have a min value of 0. Medical knowledge along the axes, we first standardize the dataset onto a unit scale
explains that such attributes (medical result) cannot be 0; therefore it (mean = 0 and variance = 1) to improve the PCA result which is a
suggests that the dataset contains a missing value that if not handled requirement for the optimal performance of many machine learning
can impair the quality of our model result and accuracy. Various algorithms.
methods have been suggested for handling missing values in datasets. In Our objective here is to transform our dataset X of p dimension into
our case we replaced missing values with the mean such attribute. a new sample set Y of smaller dimension L (L < p), where Y is the
As part of our data preprocessing, the original data values are scaled Principal component of X i.e.
so as to fall within a small specified range of [0, 1] values by performing
Y = PC (X ) (2)
normalization of the dataset. This will improve speed and reduce run-
time complexity. Using the Z-Score we normalize our value set V to We proceed as follows:
obtain a new set of normalized values V’ with the equation below:
(a) Organize our dataset:
' V–Y
V =
Z (1) With X having a set of n vectors (x1, x2, …, x n ) where each xi element
is an instance of our dataset.
Where V’ = New normalized value, V = previous value, Y = mean,
(b) Find the mean using the equation:
Table 2 n
Confusion matrix. ∑i = 1 Xi
X¯ =
n (3)
0 (Negative) 1 (Positive) Class

391 3 Predicted Negative (c) Calculate Variance:


13 207 Predicted Positive

3
C. Zhu, et al. Informatics in Medicine Unlocked xxx (xxxx) xxxx

n
∑i = 1 (Xi − X¯ )2
s2 =
(n − 1) (4)

(d) Calculate Covariance:


X n × n = (x i, j , x i, j = cov (Dimi , Dimj )) (5)
Where X n × n is our data matrix with n rows and n columns and Dimi is
the ith dimension.

(e) Calculate Eigenvalues and Eigenvectors:

The core of a PCA is the eigenvector and eigenvalues of the covar-


iance matrix. The eigenvectors will determine the directions of the new
feature space while the eigenvalues determine the magnitude.
If A is an n × n matrix, then a nonzero vector x in n is called an
eigenvector of A (or of the matrix operator TA ) if Ax is a scalar multiple
of x ; that is,
Ax = λx (6)
for some scalar λ. The scalar λ is called an eigenvalue of A and x is said
to be an eigenvector corresponding to λ. Since the eigenvectors corre-
sponding to an eigenvalue of a matrix A are the nonzero vectors that
satisfy the equation Fig. 2. K-means clustering procedure.
(λI − A) x = 0 (7)
we define the set E to be all vectors x that satisfy equation (7) as our Thereafter, we cleaned our k-means cluster result by removing in-
corresponding Eigen space. correctly clustered data and make a decision to find our new dataset for
classification by using equation (11). If the new data size is above 75%,
E = {x: (A−λI) x } = 0 (8) then we proceed with supervised classification, else we repeat the k-
means step until a suitable size is determined.
(f) Once the Eigen space is found from the covariance matrix, the next
step is to order the eigenvectors by eigenvalue, highest to lowest. left data
new size =
This eliminates less significant components and we are left with the total sum (11)
principal components that provide a good approximation of the After cleaning the clustered data, we obtained 614 correctly clus-
original data. Our new principal components will be used as input tered patients, which is used as input to train the logistic regression
for our k-means clustering in the next stage of our algorithm design. algorithm.

3.6. K-MEANS clustering 3.7. Logistic regression algorithm

K-means is one of the simplest and efficient unsupervised classifi- The application of the Logistic regression model has featured pro-
cation algorithms. K-means is a well-known partitioning based clus- minently in many domains such as the biological sciences. The Logistic
tering technique that attempts to find a user specified number of clus- regression algorithm is used when the objective is to classify data items
ters represented by their centroids [5]. It is a typical distance-based into categories. Usually in logistic regression the target variable is
clustering algorithm, in which the distance is used as a measure of si- binary, which means that it only contains data classified as 1 or 0,
milarity, i.e. the smaller distance between objects shows greater simi- which in our case refers to a patient that is positive or negative for
larity [12]. Fig. 2 shows the graphical procedure for the k-means diabetes. The purpose of our logistic regression algorithm is to find the
clustering by applying the following steps: best fit that is diagnostically reasonable to describe the relationship
between our target variable and the predictor variables.
(a) Step (a) in Fig. 4 shows our entire dataset. Initialize k = 2 since the The logistic regression algorithm is based on the linear regression
target variable contains two possible outcomes (positive and ne- model given in equation (12) below
gative).
y = hθ (x ) = θT x (12)
(b) Next is to determine for each input data the cluster center that it is
nearest to by using equation (9) extracted from Ref. [9] (step b) Equation (12) will be highly inefficient to predict our binary values
(y (i)ε {0, 1}), therefore we introduce the function in equation (13) to
Si(t ) = {x p : ||x p − mi(t ) ||2 ≤ ||x p − mi(t ) ||2 ∀j, 1 ≤ j ≤ k } (9)
predict the probability that a given patient (with given attributes) be-
longs to the “1” (positive) class versus the probability that it belongs to
(c) Applying equation (10) from Ref. [9], update the cluster centers by the “0” (negative) class.
recalculating the mean of each input data assigned to the cluster.
(step c) 1
P (y = 1|x ) = hθ (x ) = ≡ σ (θT x )
1 + exp(−θT x )
1
mi(t + 1) = ∑ xj
|Si(t ) | xj ∈ Si(t )
P (y = 0|x ) = 1 − P (y = 1|x ) = 1 − hθ (x ) (13)
(10)
Applying equation (14), known as the sigmoid function, we are able
(d) To bring our k-means cluster to a stop, we loop through step (b) and to keep the value of θT x within the [0, 1] range. Then we search for a
(c) until there is a convergence in the mean value of the clusters. value of θ such that the probability P (y = 1|x ) = hθ (x ) is large when x
(step d) belongs to the “1” class and small when x belongs to the “0” class (i.e.
P (y = 0|x ) is large.)

4
C. Zhu, et al. Informatics in Medicine Unlocked xxx (xxxx) xxxx

1
σ (t ) =
(1 + e−t ) (14)

Once our logistic regression algorithm has been successfully mod-


elled and implemented; the output and result is discussed in the next
section.

4. Experimental result

A major result discovered from the use of PCA is that the process
helped in minimizing the drawback of having redundant features which
are of no help for clustering. Since the reduction in the number of
variables in the original data set assisted in handling noisy and outlier
data, PCA therefore improved our k-means result. The main advantage
of PCA is that once we have found these Principal Components from the
data and we can compress the data i.e., by reducing the number of
dimensions without much loss of information, it became an essential
Fig. 4. The ROC curve.
process in order to determine the number of clusters and provide a
statistical framework to model the cluster structure [5].
The efficiency and accuracy of any predictive and diagnostic model Table 3
is of paramount importance and should be ensured before such a model Performance metrics.
is deployed for implementation. We analyzed and evaluated our model Performance Measure Score
output using different evaluation metrics, and the result is shown in
Fig. 3. Recall 0.97
Precision 0.97
First, to determine the performance of our model, we utilized the k- Accuracy 0.9739
fold cross validation technique, which allows us to determine how well MCC 0.94
our model will perform when given new and previously unlearned data.
Our choice of the 10-fold cross validation meant that our dataset was
divided into 10 subsets. On each trial, one subset is used as the test set The ROC Curve is a graphical plot that represent the performance of
and the other nine subset formed the training set. Then, the average a classifier. The ROC allows us to look at the performance of our model
error across all 10 trials was computed to get the total performance of across all possible thresholds. In our experiment, the ROC value was
our model. This method helps solve two issues, first is that it reduces 0.967 and the ROC curve is shown in Fig. 4. The Kappa statistic (or
the problem of bias as almost all of the data is used for fitting, and value) is a metric that compares an Observed Accuracy with an Ex-
secondly, the problem of variance is greatly reduced. pected Accuracy (random chance). The Kappa statistic normally holds a
The confusion matrix is a popular way to provide a summarized value between 0 and 1. Our experiment had a kappa statistic value of
representation of predictive findings. The confusion matrix gives the 0.942.
result of the following indices: true positive (TP), true negative (TN),
false positive (FP), false negative (FN). Table 2 shows the confusion
4.1. Comparison using other algorithms
matrix for our model.
The performance metrics of our model are represented in Table 3
To further evaluate how our model performs, we modelled our da-
below:
taset with four different algorithms using the following variations:
original dataset, PCA processed data, PCA + Kmeans processed data
and Kmeans only. The result is shown in Table 4 below:
From the above table, the PCA and Kmeans integration technique
improved the performance accuracy of the different algorithms we
modelled our dataset with, an exception in performance is when
XGBoost algorithm is used. Though there is improvement from applying
just XGBoost on the original dataset, the result shown in Table 4 in-
dicates a decline in accuracy from 95% when only Kmeans is integrated
with XGBoost as against our proposed PCA and Kmeans technique to a
value of 93%. Furthermore, Kmeans alone was shown to be a good
procedure to improve the accuracy of each of the algorithms, while PCA
reduced the accuracy result when applied alone.

5. Discussion

The experimental results showed that employing PCA enhances the


k-means clustering algorithm, as we obtained 614 correctly clustered
dataset, versus other studies (See Table 5). The closest result to ours is
that of Han Wu et al. [12] which had an accuracy of 95.42% from a
sample size of 589 obtained from their k-means clustering. With re-
ference to our experimental result, we can clearly illustrate that the
proposed PCA and K-means technique improved the classification ac-
curacy of logistic regression for the Pima Indian diabetic dataset. A
Fig. 3. The result of the experiment. comparison with the classification result reported by other researchers

5
C. Zhu, et al. Informatics in Medicine Unlocked xxx (xxxx) xxxx

Table 4
Model comparison with different algorithms.
Algorithm Original dataset PCA processed Kmeans only clustered dataset PCA + KMEANS processed dataset

Logistic Regression 0.77 0.71 0.82 0.97


KNN 0.75 0.69 0.93 0.96
XGBoost 0.76 0.66 0.95 0.93
SVM 0.76 0.72 0.83 0.92
Naïve Bayes 0.74 0.73 0.86 0.90

Table 5
Comparison of K-means clustering result.
Author(year) Methodology Correctly clustered data Accuracy percent

Our proposed method PCA + K-means 614 79.94%


T. Santhanam et al. (2015) [11] K-means 511 66.53%
Han et al. (2017) [12] K-means 589 76.69%
Patil B.M et al. (2016) [16] k-means 433 56.38%
Asha Gowda et al.(2012) [17] Cascaded K-means 299 38.93%
Mustafa Kadhm(2018) [18] K-means 570 74.21%

The bold number indicate the best result.

Table 6
Accuracy comparison with other experiments.
Author(year) Methodology Accuracy percent

Our proposed method PCA + K-means + Logistic regression 97.40%


Han et al. et al. (2017) [12] K-means + Logistic regression 95.42%
Patil B.M et al. (2016) [16] k-means + C4.5 92.38%
Sanakal S. et al. (2014) [17] SVM + Fuzzy C-means clustering 94.30%
Iyer A et al. (2015) [15] Naive Bayes 79.56%
Kumari A.V. et al. (2013) SVM 78%
Anjali Khandegar et al. (2017) [6] PCA + NN 92.2%
Motka et al. (2013) [8] PCA-ANFIS 89.2%
Tarun et al. (2014) [4] PCA + SVM 93.66%
Han et al. et al. (2017) [12] K-means + Logistic regression 95.42%

The bold number indicate the best result.

Table 7 Table 9
Description of new dataset features. Clustering result for new dataset.
Attributes Description Methodology Correctly clustered data Accuracy percent

age The age of each patient PCA + K-means 773 51.53%


body mass index The measure of body fat based on height and weight K-means only 737 49.13%
family history Indicates if patient have any close family relate ever
diagnosed of diabetes.
increased urination Does patient feel the need to urinate often?
is shown in Table 6.
fatigue Does patient feel tired often?
increased appetite Is there an abnormal increase in how often patient
A key issue solved by our study is the improvement in the accuracy
wants to eat? of the prediction model. The PCA technique we proposed contributed
weight loss Any reported case of weight loss? much to the improvement of the prediction model. The kappa statistics
increased thirst Did patient report increased thirst? value of the proposed model is 0.942 (which is almost equal to 1) which
eating pattern Do patient have a controlled eating pattern/diet?
indicates that there is a match between the proposed classifier and the
regular exercise Do the patient engage in exercise often?
sex Gender of patient real world output.
blood pressure Blood pressure of the patient at the time of test.
fasting blood glucose Diabetes fasting blood glucose test score for patient
oral glucose tolerance test Test for incidence of diabetes in patient using the 5.1. New dataset evaluation
ogtt method

A major concern surrounding the development of machine learning

Table 8
Algorithm comparison.
Algorithm Original dataset PCA processed Kmeans only clustered dataset PCA + KMEANS processed dataset

Logistic Regression 0.47 0.48 0.75 0.89


KNN 0.53 0.48 0.67 0.78
XGBoost 0.51 0.49 0.74 0.85
SVM 0.50 0.47 0.45 0.58
Naïve Bayes 0.54 0.52 0.64 0.82

6
C. Zhu, et al. Informatics in Medicine Unlocked xxx (xxxx) xxxx

algorithms for medical application is the reliability of such a model Funding


when deployed practically. To evaluate the performance of our model,
we used a more practical dataset collected from a known population. In This study was funded by the Funds for Distinguished Young
collaboration with the Specialist hospital, Benin City, we extracted in- Scientists of Lanzhou University of Technology (grant number 201304).
formation from patient records who were tested for diabetes at the
medical facility. We formed a dataset from 1500 random records while Authors’ contributions
considering only those records that had no missing values for the fea-
tures that we needed. The dataset consists of 13 attributes, namely age, CSZ and CUI conceived and designed the research. CUI conducted
body mass index (bmi), family history, increased urination, fatigue, the literature review, developed the code, carried out the experiments
increased appetite, weight loss, increased thirst, eating pattern, regular and manuscript writing, CSZ and CUI helped interpreting the results.
exercise, sex, blood pressure, fasting blood glucose (fbg) and oral glu- WFF instructed CUI for dataset processing. CSZ instructed CUI for
cose tolerance test (ogtt). The class variable has a distribution of 760 model validation. All authors read and approved the final manuscript.
negative and 740 positive cases. A description of the dataset is given in
Table 7. Ethical statement
We subjected this dataset to the same preprocessing steps as we did
the pima-indian dataset and then used the output to experiment on the There are no extra ethical statement to make.
performance of our proposed model. To further demonstrate the ap-
plicability of our model, we compared its result with that of other al- Acknowledgements
gorithm using the new dataset. The result is shown in Table 8. The
performance accuracy of our model stood at 89%. This shows that the Not applicable.
new method proposed is reliable even when used with a more practical
dataset. References
In addition, the application of PCA on our new dataset also im-
proved the Kmeans clustering algorithm as shown in Table 9. [1] retrieved http://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-
death, Accessed date: 27 July 2018.
[2] http://www.who.int/news-room/fact-sheets/detail/diabetes retrieved 27/07/
6. Conclusion and future work 2018.
[3] https://www.diabetesdaily.com/learn-about-diabetes/what-is-diabetes/how-many-
The aim of this work was to design an efficient model for the pre- people-have-diabetes/.
[4] Tarun Jhaldiyal, Pawan Kumar Mishra Analysis and prediction of diabetes mellitus
diction of diabetes. After a careful study of other published work, we using PCA, REP and SVM 2014 Int J Eng Tech Res (IJETR) ISSN: 2321-0869,
proposed a novel model, which consists of using PCA for dimensionality Volume-2, Issue-8.
reduction, k-means for clustering, and logistic regression for classifi- [5] Prabhu P, et al. Improving the performance of K-means clustering for high dimen-
sional data set. Int J Comput Sci Eng June 2011;3(6). ISSN: 0975-3397.
cation. With the intent to improve the k-means result of other re- [6] Khandegar Anjali. Khushbu Pawar diagnosis of diabetes mellitus using PCA, neural
searchers, we first applied the PCA technique to our dataset. Though Network and cultural algorithm. Int J Digital Appl Contemp Res 2017;5(6).
PCA is a well-known technique, its efficiency in improving k-means [7] Novakovic J, Rankov S. Classification performance using principal component
analysis and different value of the ratio R. Int J Comput Commun Control 2011;Vol.
clustering and in turn the logistic regression classification model has
VI(2):317–27. ISSN 1841-9836, E-ISSN 1841-9844.
not been given sufficient attention. Through our experiment we have [8] Motka Rakesh, Parmarl Viral, Kumar Balbindra, Verma AR. Diabetes mellitus
shown that an improved logistic regression model for predicting dia- forecast using different data mining techniques. IEEE 4th international conference
betes is possible through the integration of PCA and k-means. The no- on computer and communication technology (ICCCT). IEEE; 2013. p. 99–103.
[9] https://en.wikipedia.org/wiki/K-means_Clustering.
velty achieved in the study includes, the ability to obtain an enhanced [10] Seyed S, Mohammad G, Kamran S. Combination of feature selection and optimized
k-means cluster result far above what other researchers have obtained fuzzy apriori rules: the case of credit scoring. Int Arab J Inf Technol 2015;12(2).
in similar studies. Also the logistic regression model performed at an [11] Santhanam T, Padmavathi MS. Application of K-means and genetic algorithms for
dimension reduction by integrating SVM for diabetes diagnosis. Procedia Comput
improved level in predicting diabetes onset, as compared to the results Sci 2015;47:76–83.
obtained when other algorithms where used in our study and that of [12] Wu Han, Yang Shengqi, Huang Zhangqin, He Jian. Xiaoyi Wang Type 2 diabetes
other studies. Another advantage is the fact that our model has the mellitus prediction model based on data mining. Inf Med. 2018;10:100–7.
Unlocked.
ability to model a new dataset successfully. [13] Han J, Kamber M, Pei J. Data mining concepts and techniques. 3rd USA: Morgan
Kaufmann Publishers; 2012.
Declarations [14] Gowda Karegowda Asha, Jayaram MA, Manjunath AS. Cascading K-means clus-
tering and K-nearest neighbor classifier for categorization of diabetic patients. Int J
Eng Adv Technol 2012;1(3). ISSN: 2249 – 8958.
Availability of data and materials [15] Iyer A, Jeyalatha S, Sumbaly R. Diagnosis of diabetes using classification mining
techniques. Int J Data Min Knowl Manag Process (IJDKP) 2015;5(1).
[16] Patil BM, Joshi RC, Durga Toshniwal. Hybrid prediction model for Type-2 diabetic
The Specialist hospital dataset can be requested through
patients. Expert Syst Appl 2010;37:8102–8.
[email protected]. [17] Gowda Karegowda Asha, Punya V, Jayaram MA, Manjunath AS. Rule based clas-
sification for diabetic patients using cascaded K-means and decision tree C4.5. Int J
Competing interests Comput Appl 2012;45(12). (0975 – 8887).
[18] Kadhm Mustafa S, Ghindawi Ikhlas Watan, Mhawi Duaa Enteesha. An accurate
diabetes prediction system based on K-means clustering and proposed classification
The authors declare that they have no competing interests. approach. Int J Appl Eng Res 2018;13(6):4038–41. ISSN 0973-4562.

View publication stats

You might also like