International Journal of Electrical, Energy and Power System Engineering (IJEEPSE)

e-ISSN: 2654-4644
Vol. 6, No. 2, pp. 134-139, June 2023
Accredited by KEMENDIKBUDRISTEK, No. 230/E/KPT/2022

Application Of Machine Learning K-Nearest

Neighbour Algorithm To Predict Diabetes
Jack Billie Chandra* Dewi Nasien
Department of Information Technology Department of Information Technology
Institut Bisnis dan Teknologi Pelita Indonesia Institut Bisnis dan Teknologi Pelita Indonesia
Pekanbaru, Indonesia Pekanbaru, Indonesia
[email protected] [email protected]

*Corresponding author: [email protected]

Abstract--- Diabetes is a chronic disease characterized by Indonesia in 2000 was 8.4 million, after India (31.7 million),
high blood sugar (glucose) levels or above abnormal values. This China (20.8 million), and the United States (17.7 million). For
can occur when the body is no longer able to absorb glucose people with diabetes worldwide, the WHO reports that there
properly or when the intake of glucose is higher than needed. are more than 143 million sufferers, and this number is
Glucose is the main energy source for the cells of the human projected to double in prevalence by 2030 [2], and 77% of
body. Glucose that accumulates over the long term in the body them occur in developing countries [3].
can lead to complications and more serious and life-threatening
diseases. As a result, patients with diabetes must be predicted The increase in diabetes cases is due to the delay in
prior to the onset of disease complications. Machine learning is establishing a diagnosis of the disease. The patient had died
one of the branches of artificial intelligence that can be used to from complications before the diagnosis was made. The cause
provide predictive value to datasets of diabetic patients. The of the delay in establishing the diagnosis is the variety of
tested dataset has 390 observations with data on cholesterol factors that influence the existing choices. Therefore, we need
levels, glucose, HDL cholesterol, cholesterol ratio, age, gender, a prediction that can be a tool in determining whether a person
blood pressure, BMI, waist and hip width with its ratio, and the has diabetes mellitus or not. Disease is caused by people who
patient's height and weight as variables. Predictions are applied combine excessive physical activity with a diet high in
using the K-Nearest Neighbor method, which shows an accuracy calories and fat that lacks fiber. Identification of diabetes is
of 93.58% with a k value of 3, using 20% of all data as test data.
needed as a prevention strategy. By utilizing a data mining
Keywords—Diabetes, K-Nearest Neighbor, Prediction,
approach, it is possible to extract previously unknown
Machine Learning information [4]. It is a great challenge for the healthcare
organizations to provide cost-effective and high-quality
I. INTRODUCTION clinical care for patients. This can be done only with the
analyses of large healthcare database to extract the knowledge
Nowadays, technology is developing more rapidly and
of disease and to make decisions. This is an important
providing more and more benefits to human life. One of the
application in case of major diseases such as heart disease,
benefits provided is computer technology, which has the
cancer and diabetes [5]. The diagnosis of diabetes is very
ability to implement a human's way of thinking into a system
important; there are so many techniques in Machine Learning
on a computer. One of them is a machine-learning system that
that can be effectively used for the prediction and diagnosis of
is used to detect or predict. Diabetes is a chronic disease that
diabetes disease. These algorithms in Machine Learning prove
can be characterized by
to be cost-effective and time saving for diabetic patients [6].
abnormally high levels of glucose (blood sugar). The
Therefore, machine learning algorithms are now used to
people suffering from diabetes, their body is unable to
identify and diagnose diseases in order to minimize the death
properly process food for use as energy. The pancreas
risk and improve a patient's health status, as machine learning
make a hormone called ‘Insulin’ helps glucose to penetrate
contributes to specific decisions [7].
into the cells of the Body, at times, the body doesn’t make
enough or any insulin. As a result, the glucose (or sugar) II. METHODOLOGY
stays in the blood and an over a time period it causes health
problems [1]. Diabetes is one of the most dangerous and A. Machine Learning
deadly diseases in Indonesia, after stroke and coronary heart Machine learning is a branch of computer science that
disease. Early prediction of diabetes risk is needed for early examines how a machine can solve problems without being
treatment of this disease. According to Sidartawan Soegondo, explicitly programmed [8]. Peter Harington (2012) describes
Indonesia is the fourth country in the world with the highest several machine learning performance flows, namely:
number of diabetics, which has increased to 14 million people. § Collect data, in the form of Excel, Ms Access, Text Files
This is based on a report from the World Health Organization and so on.
(WHO), where the number of people with diabetes in

§ Prepare the data, by determining the quality of the data D. Confusion Matrix
and then taking steps to correct problems such as data The confusion matrix is a method that is usually used to
loss. perform accuracy calculations on data mining concepts. The
§ Train a model with data prepared into two parts, namely confusion matrix is illustrated by a table which states the
training data used for model development and test data amount of test data that is correctly classified and the amount
used as a reference. of test data that is misclassified [12]. Accuracy is the
§ Evaluating the model, by determining the provisions in comparison between the data that is classified correctly and
the selection of algorithms based on the test results. the entire data. The accuracy value can be obtained from the
§ Improving performance, involves choosing a different following equation [13] :
model or introducing more variables to increase
efficiency. !"#!$
Accuracy =
𝑥 100% (1)
B. Data Mining
Data mining is the process of looking for interesting Precision is defined as the ratio of the selected relevant items
patterns or information in selected data using certain to all selected items. Precision can be obtained by using the
techniques or methods. Techniques, methods, or algorithms following equation [13] :
in data mining vary widely [9]. According to Rerun at 2018,
Data mining has several stages, with an explanation of each !"
stage in the following: Precision =
𝑥 100% (2)

Recall is defined as the ratio of the selected relevant items to

the total number of available relevant items. Recall can be
obtained from the following equation [13]:

Recall =
𝑥 100% (3)

Errors are cases that are incorrectly identified in a number of

data, in order to discover how much the error rate is in the
system used. The percentage error can be calculated using the
following equation [13]:

Fig. 1. Data Mining [10] Error =
𝑥 100% (4)

• Data Cleaning (to remove inconsistent data noise).

• Data Integration (split data sources can be unified). E. Method
• Data Selection (data relevant to the analysis task is The method used is the K-Nearest Neighbor Algorithm
returned to the database). method to classify new patient data to predict whether the
• Data Transformation (data changed or merged into the patient had diabetes or not. The following are the stages in
right form for mining with summary performance or the research, as shown in Fig. 2. The first stage is to find the
operation aggression). Euclidean Distance value of each of all the training data.
• Data Mining (essential process where intelligent methods Calculations are performed until all the training data have
are used to extract data patterns). known Euclidean Distance values. The second stage is to
• Pattern Evolution (to identify really interesting patterns determine the value of k that will be used for class
that represent knowledge based on some interesting classification. By determining the value of k, a number of
action). training data can be taken into account as much as the number
• Knowledge Presentation (where visualization techniques of k values. The training data taken are training data that have
and knowledge images are used to provide the user with the closest Euclidean Distance value to the sample data being
mined knowledge). tested.
The method to calculate the Euclidean Distance can be
C. K-Nearest Neighbor represented as follows:
The K-Nearest Neighbor (KNN) method is a method of
finding the shortest distance between the data to be evaluated dis (x1, x2) = "∑$"%&(𝑥!" − 𝑥#" )# (5)
and the closest K Neighbors in the training data. This
technique belongs to the nonparametric classification group.
X1: training data, X2: test data, I: data variable, dis: distance,
This technique is very simple and easy to implement. Similar
n: data dimension
to the clustering technique, grouping new data based on the
distance of the new data to some other data or its nearest
neighbors [11].

The third stage is classifying the training data based on the
value of k. After obtaining training data samples that are
included in the k value, the training data can be separated
according to their classification class, namely diabetes or no
diabetes. The fourth stage is to calculate the results of the
number of class variable classifications from all training data
that are included in the k value. At this stage, it will be
calculated how much training data is included in the Diabetes
classification and how much training data is included in No
diabetes. Each class of classification will be counted in order
for the next stage to draw conclusions.
The final stage is drawing conclusions. The test data will
be compared with the training data. If the number of diabetes
classifications in the training data is greater than the number
of no diabetes classifications, it can be concluded that the test
data is included in the Diabetes classification. If the number
of no diabetes classification is more dominant, then the test
data is classified into the classification no diabetes.

Fig. 2. Research Methodology


A. Dataset


14 variables

As shown at Table II above, there are 390 patients’ data with

14 variables. The raw dataset must be processed first in order
to enable the calculation. The ‘gender’ and ‘diabetes’
variables are converted into an integer value. ‘female’ is
converted into value of ‘0’, ‘male’ to ‘1’, ‘diabetes’ to ‘1’
while ‘no diabetes’ turns into ‘0’. The following Table III is
the result after preprocessing.

B. Calculating Euclidean Distance Patient_number Distance

Closest Distance
Calculating the Euclidean Distance value of each training ... ... ...
data based on the test data. The data must be split into training 243 290,655 385
257 304,4431 386
data and test data first, with a fixed ratio. Preferably with the
training data’s ratio higher than the test data. In this example,
the data is split on 20% test data and 80% training data. Test C. K Value and Prediction
data is chosen randomly by the machine learning. The After calculating the Euclidean Distances, the value of k
machine will calculate the Euclidean Distance values of every must be assigned a value. The value is no less than 3 and if
training data based on the test data. The following Table III, possible, use odd numbers for better performance. In this
as in Table IV, shows the results of calculating the Euclidean
example, the value of k is assigned as 3, therefore 3 training
Distance from all training data for test data. data with the closest distance such as those of 255, 200, and
176 will be used for classification prediction. The following
table shows the prediction:
Patient_number distance
2 76.30839
3 78.39224 Patient waist_hip_
cholesterol glucose ... ... diabetes
4 73.56964 Number ratio
5 61.39585 255 185 84 ... ... 0.83 No diabetes
... ... 200 177 87 ... ... 0.85 No diabetes
... ... 176 191 81 ... ... 0.86 No diabetes
389 132.60972
390 91.33635
As shown at Table VI, according to the training data in the
Distances are then sorted in order from the closest to the range of the given k value, the test data is classified as ‘no
test data to the furthest, with the data of 255 being the closest diabetes’, since ‘no diabetes’ classifications are more
to the test data while the data of 257 being the furthest among dominant than the ‘diabetes’ classifications. And since the
the training data, as shown below: actual data of 231 is classified as ‘no diabetes’, the prediction
proves to be accurate. Prediction processes are repeated until
TABLE IV. SORTED DISTANCES all test data are predicted, and the prediction result will be
Closest Distance
calculated by the confusion matrix accordingly.
Patient_number Distance
255 15.94302355 1
D. Algorithm Result
200 28.01607396 2 The research shows different results by implementing
176 31.98282195 3
371 32.44895838 4
several tests runs with different k values and split data ratios,
165 35.56193049 5 as shown in the Table VII.
... ... ...


Test Test Training Best K Precision Recall Error
Run Data Data Value
1 20% 80% 93.58% 3 100% 78% 6.4%
2 30% 70% 88.88% 3 91% 45% 11.11%
3 40% 60% 91.02% 17 94% 54% 8.97%

As shown in Table VII, the conclusion is that the K-Nearest K-Nearest Neighbor Model View, in this view the value of
Neighbor has the best prediction result on 20% - 80% split k can be changed for different prediction results as shown in
data ratio with three as the k value, whose accuracy of Fig. 6.
93.58% is the highest accuracy score and its error rate 6.4%
the lowest error rate.
E. Interface Discussion
Load Dataset Form View, this view is used to load the
raw dataset for the machine learning to use. As shown in
Fig. 3, the file must be in csv. format in order to run.

Fig. 6. K-Nearest Neighbor Model View

Confusion Matrix View, this view displays the confusion

matrix table results. Confusion matrix and accuracy are
displayed, as shown in Fig. 7.

Fig. 3. Load Dataset Form View

Dataset View, this view shows the contents of the dataset

as shown in Fig. 4. This view only serves to indicate that
the dataset has been successfully uploaded.
Fig. 7. Confusion Matrix View

Classification Report View, this view shows the

performance result of the K-Nearest Neighbor in diabetes
prediction. Precision, Recall, and Accuracy are displayed as
shown in Fig. 8. Prediction View, in this view, a sample
data can be entered to determine its prediction. The input
can only be numbers, either an integer or float, but not a
string. In case the value is a float type, the value must use a
period (.), instead of a comma (,).

Fig. 4. Dataset View

Training Test Split View, in this view, the user can set the
training test split ratio by changing the value of the ‘test-
size’. The amounts of training and test data are displayed
below the code box after running the codes, as shown in Fig.

Fig. 8. Classification Report View

Error Rate View, this view displays graph of error rate and the value of k.
Its purpose is to check which k value has the lowest error rate. Lower error
rates provide better accuracy.

The conclusion obtained based on the research
conducted is that the K-Nearest Neighbor algorithm has a
Fig. 5. Training Test Split View good performance result in predicting diabetes, with a fairly
high accuracy of 93.58% and a fairly low probability of
prediction error of 6.4%.

The authors would like to thank the Faculty of Science JACK BILLIE CHANDRA was born in Pekanbaru,
Computer, Institut Bisnis dan Teknologi Pelita Indonesia Indonesia and he is a student from Faculty of
for the facilities that has been provided and for its support. Computer Science, Institut Bisnis dan Teknologi
Pelita Indonesia Pekanbaru. He graduated in 2022. He
also received his A.P in 2019 from the same institute.
