Diabetes Prediction
Diabetes Prediction
net/publication/332440257
CITATIONS READS
190 2,940
3 authors, including:
Christian Idemudia
Lanzhou University of Technology
6 PUBLICATIONS 195 CITATIONS
SEE PROFILE
All content following this page was uploaded by Christian Idemudia on 19 May 2019.
A R T I C LE I N FO A B S T R A C T
Keywords: Diabetes causes a large number of deaths each year and a large number of people living with the disease do not
PCA realize their health condition early enough. In this study, we propose a data mining based model for early
K-means diagnosis and prediction of diabetes using the Pima Indians Diabetes dataset. Although K-means is simple and
Diabetes can be used for a wide variety of data types, it is quite sensitive to initial positions of cluster centers which
Data mining
determine the final cluster result, which either provides a sufficient and efficiently clustered dataset for the
Logistic regression
logistic regression model, or gives a lesser amount of data as a result of incorrect clustering of the original
dataset, thereby limiting the performance of the logistic regression model. Our main goal was to determine ways
of improving the k-means clustering and logistic regression accuracy result. Our model comprises of PCA
(principal component analysis), k-means and logistic regression algorithm. Experimental results show that PCA
enhanced the k-means clustering algorithm and logistic regression classifier accuracy versus the result of other
published studies, with a k-means output of 25 more correctly classified data, and a logistic regression accuracy
of 1.98% higher. As such, the model is shown to be useful for automatically predicting diabetes using patient
electronic health records data. A further experiment with a new dataset showed the applicability of our model
for the predication of diabetes.
∗
Corresponding author.
E-mail addresses: [email protected] (C. Zhu), [email protected] (C.U. Idemudia), [email protected] (W. Feng).
https://doi.org/10.1016/j.imu.2019.100179
Received 20 January 2019; Received in revised form 27 March 2019; Accepted 4 April 2019
2352-9148/ © 2019 Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/BY-NC-ND/4.0/).
Please cite this article as: Changsheng Zhu, Christian Uwa Idemudia and Wenfang Feng, Informatics in Medicine Unlocked,
https://doi.org/10.1016/j.imu.2019.100179
C. Zhu, et al. Informatics in Medicine Unlocked xxx (xxxx) xxxx
variables.
This research work proposes PCA for dimensionality reduction,
which helps to define suitable initial centroids for our dataset when the
k-means algorithm is applied. K-means is then used to find outliers and
to cluster the data into similar groups, with logistic regression as a
classifier for the dataset. In this paper, section 2 provides a review of
related work done by other researchers in the area of diabetes predic-
tion and diagnosis. Section 3 shows details of the experimental proce-
dures. Section 4 describes the experimental result, while section 5
concludes the work while suggesting possible direction for future work.
2. Related study
2
C. Zhu, et al. Informatics in Medicine Unlocked xxx (xxxx) xxxx
Table 1
Original and preprocessed dataset statistics.
Statistics Dataset Preg Plas Pres Skin Insu BMI Pedi Age
COUNT Original 768.0000 768.0000 768.0000 768.0000 768.0000 768.0000 768.0000 768.0000
Preprocess 768.0000 768.0000 768.0000 768.0000 768.0000 768.0000 768.0000 768.0000
MEAN Original 0.8554 120.8945 69.1054 20.5364 79.7994 31.9925 0.4718 33.2408
Preprocess 0.8554 121.6867 72.4051 29.1534 155.5482 32.4574 0.6718 33.2408
STD Original 0.3518 31.9726 19.9522 15.9522 115.2440 7.8841 0.3313 11.7602
Preprocess 0.3518 30.4359 12.0963 8.7909 85.0211 6.8751 0.3313 11.7602
MIN Original 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0780 21.0000
Preprocess 0.0780 44.0000 24.0000 7.0000 14.0000 18.2000 0.0780 21.0000
25% Original 1.0000 99.0000 62.0000 0.0000 0.0000 27.3000 0.2437 24.0000
Preprocess 1.0000 99.7500 64.0000 25.0000 121.5000 27.5000 0.2437 24.0000
50% Original 1.0000 117.0000 72.0000 23.0000 30.5000 32.0000 0.3725 29.0000
Preprocess 1.0000 117.0000 72.2025 29.2534 155.5482 32.4000 0.3725 29.0000
75% Original 1.0000 140.2500 80.0000 32.0000 127.2500 36.6000 0.6262 41.0000
Preprocess 1.0000 140.2500 80.0000 32.0000 155.5482 36.6000 0.6262 41.0000
MAX Original 1.0000 199.0000 122.0000 99.0000 846.0000 67.1000 2.4200 81.0000
Preprocess 1.0000 199.0000 122.0000 99.0000 864.0000 67.1000 2.4200 81.0000
Today's real world databases are highly susceptible to noisy, 3.4. Model algorithm design
missing, and inconsistent data due to their typically huge sizes and their
likely origin from multiple, heterogeneous sources [13]. Data quality is Our model algorithm will be made up of 3 sub stages. In the first
an important factor in the data mining process for disease prediction stage of the design we will perform dimensionality reduction on the
and diagnosis, because low quality data may lead to inaccurate or low already processed dataset (using PCA). Then we will cluster the selected
prediction result. In order to make our original dataset more productive principal component using K-means to address outliers and remove any
and applicable for predicting diabetes, we applied several preprocessing incorrectly classified data. Finally, the correctly clustered and classified
techniques using various packages offered within the Anaconda in- data will be used as input for our supervised classification using logistic
tegrated development environment. regression.
First, we took a closer look at the various attributes, and discussing
with a professional dietician, analyzed the medical relevance of each 3.5. Principal component anaylysis
attribute to diabetes prediction and diagnosis. It was discovered that
“number of times pregnant” has less significance to the current research During data analysis it is often very difficult to find all the re-
direction. We decided to apply the same technique used by Han Wu lationships among attributes. PCA allows a huge amount of information
[12] by transforming this numeric attribute into a nominal attribute of enclosed in initially correlated data to be transformed into a set of new
value 0 and 1, with 1 indicating a patient previously pregnant and 0 orthogonal components, thereby making it possible to discover con-
indicating a patient was never pregnant. This helps to reduce the cealed relationships, enhance data visualization, detection of outliers,
complexity of analyzing the dataset [12] (see Table 1). and classification within the newly defined dimensions [5]. The appli-
Secondly, statistical analysis of our dataset suggested the presence cation of PCA on a dataset can be of great help when unsupervised
of missing values. Table 2 below shows the statistical result for our learning is required to be performed on such a dataset, as it will aid in
dataset. From the statistical result, it is observed that Plasma glucose efficiently initializing centroids for clustering.
concentration, Diastolic blood pressure, Skin fold thickness, 2hr serum Because PCA yields a feature subspace that maximizes the variance
insulin and Body mass index have a min value of 0. Medical knowledge along the axes, we first standardize the dataset onto a unit scale
explains that such attributes (medical result) cannot be 0; therefore it (mean = 0 and variance = 1) to improve the PCA result which is a
suggests that the dataset contains a missing value that if not handled requirement for the optimal performance of many machine learning
can impair the quality of our model result and accuracy. Various algorithms.
methods have been suggested for handling missing values in datasets. In Our objective here is to transform our dataset X of p dimension into
our case we replaced missing values with the mean such attribute. a new sample set Y of smaller dimension L (L < p), where Y is the
As part of our data preprocessing, the original data values are scaled Principal component of X i.e.
so as to fall within a small specified range of [0, 1] values by performing
Y = PC (X ) (2)
normalization of the dataset. This will improve speed and reduce run-
time complexity. Using the Z-Score we normalize our value set V to We proceed as follows:
obtain a new set of normalized values V’ with the equation below:
(a) Organize our dataset:
' V–Y
V =
Z (1) With X having a set of n vectors (x1, x2, …, x n ) where each xi element
is an instance of our dataset.
Where V’ = New normalized value, V = previous value, Y = mean,
(b) Find the mean using the equation:
Table 2 n
Confusion matrix. ∑i = 1 Xi
X¯ =
n (3)
0 (Negative) 1 (Positive) Class
3
C. Zhu, et al. Informatics in Medicine Unlocked xxx (xxxx) xxxx
n
∑i = 1 (Xi − X¯ )2
s2 =
(n − 1) (4)
K-means is one of the simplest and efficient unsupervised classifi- The application of the Logistic regression model has featured pro-
cation algorithms. K-means is a well-known partitioning based clus- minently in many domains such as the biological sciences. The Logistic
tering technique that attempts to find a user specified number of clus- regression algorithm is used when the objective is to classify data items
ters represented by their centroids [5]. It is a typical distance-based into categories. Usually in logistic regression the target variable is
clustering algorithm, in which the distance is used as a measure of si- binary, which means that it only contains data classified as 1 or 0,
milarity, i.e. the smaller distance between objects shows greater simi- which in our case refers to a patient that is positive or negative for
larity [12]. Fig. 2 shows the graphical procedure for the k-means diabetes. The purpose of our logistic regression algorithm is to find the
clustering by applying the following steps: best fit that is diagnostically reasonable to describe the relationship
between our target variable and the predictor variables.
(a) Step (a) in Fig. 4 shows our entire dataset. Initialize k = 2 since the The logistic regression algorithm is based on the linear regression
target variable contains two possible outcomes (positive and ne- model given in equation (12) below
gative).
y = hθ (x ) = θT x (12)
(b) Next is to determine for each input data the cluster center that it is
nearest to by using equation (9) extracted from Ref. [9] (step b) Equation (12) will be highly inefficient to predict our binary values
(y (i)ε {0, 1}), therefore we introduce the function in equation (13) to
Si(t ) = {x p : ||x p − mi(t ) ||2 ≤ ||x p − mi(t ) ||2 ∀j, 1 ≤ j ≤ k } (9)
predict the probability that a given patient (with given attributes) be-
longs to the “1” (positive) class versus the probability that it belongs to
(c) Applying equation (10) from Ref. [9], update the cluster centers by the “0” (negative) class.
recalculating the mean of each input data assigned to the cluster.
(step c) 1
P (y = 1|x ) = hθ (x ) = ≡ σ (θT x )
1 + exp(−θT x )
1
mi(t + 1) = ∑ xj
|Si(t ) | xj ∈ Si(t )
P (y = 0|x ) = 1 − P (y = 1|x ) = 1 − hθ (x ) (13)
(10)
Applying equation (14), known as the sigmoid function, we are able
(d) To bring our k-means cluster to a stop, we loop through step (b) and to keep the value of θT x within the [0, 1] range. Then we search for a
(c) until there is a convergence in the mean value of the clusters. value of θ such that the probability P (y = 1|x ) = hθ (x ) is large when x
(step d) belongs to the “1” class and small when x belongs to the “0” class (i.e.
P (y = 0|x ) is large.)
4
C. Zhu, et al. Informatics in Medicine Unlocked xxx (xxxx) xxxx
1
σ (t ) =
(1 + e−t ) (14)
4. Experimental result
A major result discovered from the use of PCA is that the process
helped in minimizing the drawback of having redundant features which
are of no help for clustering. Since the reduction in the number of
variables in the original data set assisted in handling noisy and outlier
data, PCA therefore improved our k-means result. The main advantage
of PCA is that once we have found these Principal Components from the
data and we can compress the data i.e., by reducing the number of
dimensions without much loss of information, it became an essential
Fig. 4. The ROC curve.
process in order to determine the number of clusters and provide a
statistical framework to model the cluster structure [5].
The efficiency and accuracy of any predictive and diagnostic model Table 3
is of paramount importance and should be ensured before such a model Performance metrics.
is deployed for implementation. We analyzed and evaluated our model Performance Measure Score
output using different evaluation metrics, and the result is shown in
Fig. 3. Recall 0.97
Precision 0.97
First, to determine the performance of our model, we utilized the k- Accuracy 0.9739
fold cross validation technique, which allows us to determine how well MCC 0.94
our model will perform when given new and previously unlearned data.
Our choice of the 10-fold cross validation meant that our dataset was
divided into 10 subsets. On each trial, one subset is used as the test set The ROC Curve is a graphical plot that represent the performance of
and the other nine subset formed the training set. Then, the average a classifier. The ROC allows us to look at the performance of our model
error across all 10 trials was computed to get the total performance of across all possible thresholds. In our experiment, the ROC value was
our model. This method helps solve two issues, first is that it reduces 0.967 and the ROC curve is shown in Fig. 4. The Kappa statistic (or
the problem of bias as almost all of the data is used for fitting, and value) is a metric that compares an Observed Accuracy with an Ex-
secondly, the problem of variance is greatly reduced. pected Accuracy (random chance). The Kappa statistic normally holds a
The confusion matrix is a popular way to provide a summarized value between 0 and 1. Our experiment had a kappa statistic value of
representation of predictive findings. The confusion matrix gives the 0.942.
result of the following indices: true positive (TP), true negative (TN),
false positive (FP), false negative (FN). Table 2 shows the confusion
4.1. Comparison using other algorithms
matrix for our model.
The performance metrics of our model are represented in Table 3
To further evaluate how our model performs, we modelled our da-
below:
taset with four different algorithms using the following variations:
original dataset, PCA processed data, PCA + Kmeans processed data
and Kmeans only. The result is shown in Table 4 below:
From the above table, the PCA and Kmeans integration technique
improved the performance accuracy of the different algorithms we
modelled our dataset with, an exception in performance is when
XGBoost algorithm is used. Though there is improvement from applying
just XGBoost on the original dataset, the result shown in Table 4 in-
dicates a decline in accuracy from 95% when only Kmeans is integrated
with XGBoost as against our proposed PCA and Kmeans technique to a
value of 93%. Furthermore, Kmeans alone was shown to be a good
procedure to improve the accuracy of each of the algorithms, while PCA
reduced the accuracy result when applied alone.
5. Discussion
5
C. Zhu, et al. Informatics in Medicine Unlocked xxx (xxxx) xxxx
Table 4
Model comparison with different algorithms.
Algorithm Original dataset PCA processed Kmeans only clustered dataset PCA + KMEANS processed dataset
Table 5
Comparison of K-means clustering result.
Author(year) Methodology Correctly clustered data Accuracy percent
Table 6
Accuracy comparison with other experiments.
Author(year) Methodology Accuracy percent
Table 7 Table 9
Description of new dataset features. Clustering result for new dataset.
Attributes Description Methodology Correctly clustered data Accuracy percent
Table 8
Algorithm comparison.
Algorithm Original dataset PCA processed Kmeans only clustered dataset PCA + KMEANS processed dataset
6
C. Zhu, et al. Informatics in Medicine Unlocked xxx (xxxx) xxxx