Diagnostics 11 01714 v2
Diagnostics 11 01714 v2
Diagnostics 11 01714 v2
Article
Diagnosis of Diabetes Mellitus Using Gradient Boosting
Machine (LightGBM)
Derara Duba Rufo 1 , Taye Girma Debelee 2,3 , Achim Ibenthal 4, * and Worku Gachena Negera 3
1 College of Engineering and Technology, Dilla University, Dilla 419, Ethiopia; [email protected]
2 College of Electrical and Mechanical Engineering, Addis Ababa Science and Technology University,
Addis Ababa 120611, Ethiopia; [email protected]
3 Ethiopian Artificial Intelligence Center, Addis Ababa 40782, Ethiopia; [email protected]
4 Faculty of Engineering and Health, HAWK Universityof Applied Sciences and Arts,
37085 Göttingen, Germany
* Correspondence: [email protected]
Abstract: Diabetes mellitus (DM) is a severe chronic disease that affects human health and has a
high prevalence worldwide. Research has shown that half of the diabetic people throughout the
world are unaware that they have DM and its complications are increasing, which presents new
research challenges and opportunities. In this paper, we propose a preemptive diagnosis method
for diabetes mellitus (DM) to assist or complement the early recognition of the disease in countries
with low medical expert densities. Diabetes data are collected from the Zewditu Memorial Hospital
(ZMHDD) in Addis Ababa, Ethiopia. Light Gradient Boosting Machine (LightGBM) is one of the
most recent successful research findings for the gradient boosting framework that uses tree-based
learning algorithms. It has low computational complexity and, therefore, is suited for applications in
limited capacity regions such as Ethiopia. Thus, in this study, we apply the principle of LightGBM
to develop an accurate model for the diagnosis of diabetes. The experimental results show that the
Citation: Rufo, D.D.; Debelee, T.G.;
Ibenthal, A.; Negera, W.G. Diagnosis
prepared diabetes dataset is informative to predict the condition of diabetes mellitus. With accuracy,
of Diabetes Mellitus Using Gradient AUC, sensitivity, and specificity of 98.1%, 98.1%, 99.9%, and 96.3%, respectively, the LightGBM model
Boosting Machine (LightGBM). outperformed KNN, SVM, NB, Bagging, RF, and XGBoost in the case of the ZMHDD dataset.
Diagnostics 2021, 11, 1714.
https://doi.org/10.3390/ Keywords: diabetes mellitus; detection; LightGBM; diabetes diagnosis
diagnostics11091714
In 2017, about 318,000 mobile health applications were available to consumers through-
out the world [11]. This includes tools enabling diabetes self-management by mobile
devices such as mobile phones, tablets, or smart watches [12]. Some diabetes applications
differ in the choice of indicators to be tracked, such as blood glucose estimations, suste-
nance and sugar, physical movement and weight tracking, imparting information to health
and social workers, as well as providing patient information. However, most of these
existing diabetes-related mobile health applications are designed for users with a preceding
affirmative diagnosis of the disease status and accompanying factors, while this study is
dedicated to the early diagnosis of DM using machine learning algorithms.
There are several machine-learning-based diabetes assessment approaches; among
them, diagnosis, prediction, and complication analysis are the most researched ones.
In diabetes diagnosis [13–15], researchers used a patient’s diabetes history and physical
examination results such as plasma glucose concentration, diastolic blood pressure, body
mass index, age, weight, diet, insulin, water consumption, blood pressure, sex, etc. as input
to the machine learning algorithms. The most frequently used machine learning algorithms
are support vector machines (SVM), k-nearest neighbor (kNN), decision trees, Naive Bayes
(NB), and tree boosting algorithms such as XGBoost, Adaboost, and random forest (RF) [15].
Conventional algorithms such as kNN, SVM, NB, etc. result in low performance, whereas
ensemble algorithms such as XGBoost, Adaboost, and RF comparatively achieve a higher
level of accuracy. Since these ensemble learners are defined on a set of hyperparameters,
their design involves a global optimization task to combine a set of indicators into a reliable
classification model.
Ravaut et al. [16] performed large-scale machine learning studies with health record
datasets of patients in Ontario, Canada, provided by the Institute of Clinical Evaluative
Sciences (ICES) to predict the risk of diabetes in a range of 1–10 years ahead. The considered
dataset has about 963 total input features. The authors compared logistic regression, XG-
Boost, Highway Network, CNN-LSTM, and LSTM-Seq2Seq algorithms to predict the risk
of diabetes mellitus for a scope of 10 years. Based on experimental analysis, the XGBoost
model outperforms other algorithms. The most researched diabetes complications are
retinopathy, neuropathy, and nephropathy. In [17], logistic regression is used to predict the
involvement of retinopathy, nephropathy, and neuropathy at different time scenarios—3,
5, and 7 years from first diabetes reference. Input features are gender, hypertension, age,
glycated hemoglobin (HbA1c), smoking habit, time from diagnosis (how long after diabetes
diagnosis), and body mass index (BMI).
As discussed above, ensemble learning algorithms in many cases outperform other
machine learning approaches for disease diagnosis. Fundamentally, this is achieved by
combining multiple base classifiers (individual classifier algorithms) into an ensemble
model by learning the inherent statistics of the combined classifiers and, hence, outperform-
ing the single classifiers [18]. In this paper, we investigate LightGBM ensemble classifiers
for the early detection of DM. This research work aims at supporting health practitioners
in the diagnosis of DM.
LightGBM is an ensemble algorithm developed by Microsoft that provides an efficient
implementation of the gradient boosting algorithm [19]. The primary benefit of LightGBM
is a dramatic acceleration of the training algorithm, which, in many cases, results in a
more effective model. LightGBM is constructed on the top of decision tree algorithms,
employing nestimators numbers of boosted trees. Tree boosting algorithms outperform others
for prediction problems [20]. The LightGBM ensemble learning algorithm has been applied
in numerous classification and regression studies and achieved excellent detection results,
indicating that LightGBM is an effective classifier algorithm.
The proposed LightGBM model provides an optimized decision-support system for
users. The particularity of the proposed approach is in the procedure used to calculate
the number of decision trees, maximum depth of the trees, and number of tree leaves
to construct an optimal LightGBM model. Furthermore, the first local diabetes dataset
of Ethiopia has been prepared to design a CAD (Computer Aided Diagnosis) system for
Diagnostics 2021, 11, 1714 3 of 14
the early detection of DM. Thus, the purpose of this study is to develop an optimal and
accurate diabetes diagnosis model based on machine learning algorithms.
The remainder of this article is organized as follows: Section 2 discusses the related
existing work and accomplishments in the prediction and diagnosis of DM. Section 3
describes the materials used in the experiment, the research method, and the details of the
proposed diabetes detection model. Section 4 provides a discourse to the experimental
results and model evaluation, including a comparison to previous research approaches.
Section 5 states the study limitations and concludes the study with established guidelines
for future work.
2. Related Work
In general, we found that there are two categories of existing methods related to
diabetes prediction problems: machine learning viz. classification/detection [18,21–23]
and forecasting or forward prediction [16]. In this study we are interested in estimat-
ing the probability of diabetes positivity and to review relevant indicators and machine
learning methods.
From the existing publications, we generalized two main approaches related to
diabetes-related features. In the first approach, some indicators that were more relevant to
diabetes mellitus from the view of medicine are selected manually/systematically and used
for diabetes prediction or diagnosis [21–24]. In the second approach, all diabetes-related
available attributes are given to machine (deep) learning algorithms [16,25,26] and learning
models must recognize the important features [16]. Our investigations follow the first
approach by obtaining the expertise of physicians on diabetes indicators for data collection.
The proposed indicators are verified by their correlation to the class variable in Table 1 in
order to prove statistical relevance.
According to this survey, Deep Neural Networks (DNN) and Support Vector Machines
(SVM) achieve the best classification outcomes, followed by random forests and other
ensemble classifiers. For DM detection/prediction, the best-in-class method reported by
Chaki et al. applies SVM on oral glucose tolerance test data at an accuracy of 96.8% [27].
Hence, this is regarded as a performance landmark for our algorithmic studies based on
patient anamnesis data used to predict type 2 DM. Subsequently, we refer to studies on
comparable data.
Deberneh and Kim [28] investigated the problem if patients will develop type 2
DM one year after data elicitation of 12 features: (i.) fasting plasma glucose, (ii.) glycated
hemoglobin (HbA1c), (iii.) triglycerides, (iv.) body mass index, (v.) gamma-glutamyltranspep-
tidase (γ-GTP), (vi.) age, (vii.) uric acid, (viii.) sex, (ix.) smoking, (x.) drinking, (xi.) physical
activity, and (xii.) family history. They found that the prediction has an accuracy of up
to 73% for soft voting and random-forest-based approaches, while XGBoost performed
slightly less at 72% accuracy. In case the input data are elicited over a period of the past
4 years, the accuracy increased to 81%. On the one hand, this is significantly less than the
96.8% prediction accuracy reported in [27]; on the other hand, the merits are to predict the
occurrence of type 2 DM in the future and, hence, to allow for preventive treatment.
Chaki, J. et al. [29] systematically reviewed the art of machine learning and artificial
intelligence for diabetes mellitus detection and self-management. Their work focused on
four specific aspects: (i.) databases, (ii.) ML-based classification and diagnostic methods,
(iii.) AI-based intelligent assistants for patients with DM, and (iv.) performance metrics.
Alasaf et al. [30] proposed a system aimed at preemptively diagnosing DM in Saudi
Arabia. They retrieved data from King Fahd University Hospital (KFUH) in Khobar,
Saudi Arabia. The collected dataset contained 399 records, of which 191 instances were
diabetic and 208 instances were not diabetic with a binary target variable (diabetic or not).
Preprocessing techniques were applied to the data to identify relevant features, and 48
more relevant features were selected and prepared for the identification/classification
process. Four classification algorithms (SVM (LibSVM), ANN, NaiveBayes, and k-NN)
Diagnostics 2021, 11, 1714 4 of 14
were applied to predict the DM. As a result, ANN outperformed other algorithms with the
testing accuracy of 77.5%.
Faruque et al. [31] explored various risk factors related to diabetes mellitus using
machine learning techniques. They collected diabetes data of 200 patients consisting of
15 diabetes indicators (features): (i.) age, (ii.) sex, (iii.) weight, (iv.) diet, (v.) polyuria,
(vi.) water consumption, (vii.) excessive thirst, (viii.) blood pressure, (ix.) hypertension,
(x.) tiredness, (xi.) vision problems, (xii.) kidney problems, (xiii.) hearing loss, (xiv.) itchy
skin, and (xv.) genetics with one binary class variable (diabetic or not) from the diagnostic
of Medical Centre Chittagong, Bangladesh. Four machine learning algorithms (SVM, NB,
KNN, and C4.5 Decision Tree) were used to predict diabetic mellitus. Empirical results
showed that the C4.5 decision tree achieved a higher accuracy of 73.5% compared with
other machine learning techniques.
Xu and Wang [18] proposed a type 2 diabetes risk prediction model based on an
ensemble learning method using the publicly available UCI Pima Indian diabetes dataset
(PIDD). PIDD contains eight diabetes indicator attributes viz. (i.) number of times pregnant,
(ii.) plasma glucose concentration a 2 h in an oral glucose tolerance test, (iii.) diastolic
blood pressure (mmHg), (iv.) triceps skin fold thickness (mm), (v.) 2-h serum insulin
[µU/mL] [32] (vi.) body mass index (weight (kg)/(height (m))2 ), (vii.) diabetes pedigree
Diagnostics 2021, 11, 1714 5 of 14
function, and (viii.) age (years) with one binary class variable (diabetic or not). They
followed a two-step approach. Firstly, they developed a weighted feature selection algo-
rithm based on random forest (RF-WFS) for optimal feature selection; then, the extreme
gradient boosting (XGBoost) classifier was applied to predict the risk of diabetes mellitus
accurately. The experimental results showed that the model has a better accuracy of 93.75%
in classification performance than other preceding research results.
Nowadays, for classification and diagnosis problems, LightGBM outperforms other
state-of-the-art methods, cf. [33–40]. In these related works, LightGBM is not only selected
for its effective prediction performance, but also for its shorter computational time and op-
timized data handling technique. For instance, in [41], LightGBM and XGBoost algorithms
were employed to construct the prediction models for cardiovascular and cerebrovascular
diseases prediction based on different indicator elements (features) such as systolic blood
pressure (SBP), diastolic blood pressure (DBP), serum triglyceride, serum high-density
lipoprotein, and serum low-density lipoprotein. The LightGBM model achieved the lowest
least mean square error (LMSE) for all indicators.
From the above review, we observed that Ethiopian data have never been explored be-
fore in diagnosing diabetes using artificial intelligence (AI) technology. Hence, an important
goal of this project is to prepare a diabetes dataset for the application of machine-learning-
based diabetes diagnosis serving two purposes: (a.) decision support for physicians and
handling of potential diabetes conditions onset and (b.) improvement of DM detection
coverage in countries with low physician density. From the existing work, we observed
LightGBM and XGBoost ensemble classifiers are the most promising models for diabetes
detection and even for diagnosing other diseases. However, XGBoost has a lower speed
compared with LightGBM. The LightGBM algorithm features lower memory usage, higher
speed and efficiency, compatibility with large datasets over XGBoost, and better accuracy
than any other boosting algorithm [19]. LightGBM is almost seven times faster than XG-
Boost [19] and, hence, is a much better approach when working on large datasets. This
makes LightGBM an interesting candidate for DM detection.
3. Diabetes dataset: the collected diabetes data were converted to machine learning
model recognizable (tabular) format.
4. Data preprocessing: patterns underlying the data were visualized by box-plot and
correlation heat-map. Irrelevant data elements and column values were removed and
replaced, respectively. The correlation coefficient of each input variable (attributes)
to the dependent variable (diabetes or not) was calculated to identify the important
features. Each input variable has values in a different range; fast blood sugar (FBS) has
minimum 60 and maximum 200 values; whereas, gender has binary values (minimum
0 and maximum 1) but machine learning algorithms recognize patterns numerically,
meaning they give higher priority to attributes with large numerical values. By this
scenario, FBS has higher priority over gender, which is logically not always true.
To avoid such confusion, the attribute values were normalized in a common range
using the Min-Max normalization technique [43]. Finally, the preprocessed dataset
was split into training and test data samples.
5. Light Gradient Boosting Machine (LightGBM): the state-of-the-art LightGBM algo-
rithm has been proposed to predict diabetes mellitus. Here, the LightGBM was
optimized by calculating the optimal values of the hyperparameters using 10-fold
cross-validation. Finally, we developed other classifier models viz. KNN, SVM, NB,
Bagging (constructed on decision tree), RF, and XGBoost and compared the results
with the optimal LightGBM model.
The general framework of the proposed approach is summarized in Figure 1.
Problem statement
Diabetes dataset
Data preprocessing
Training test
Other LightGBM
algorithms
Model 1 ...
M2 Mn Model
(M1)
LightGBM
Gradient Boosting Decision Tree (GBDT) is a common machine learning algorithm,
which has effective implementations such as XGBoost and parallel Gradient Boosted
Regression Trees pGBRT [44,45]. Although many engineering optimizations have been
adopted in these implementations, for high-dimensional feature spaces and large data sizes,
these implementations have comparably low efficiency and scalability. A major reason is
that for each feature, they need to test all the data records to estimate the information gain
of all possible split points, which requires very high computational time. Thus, to address
these problems, Ke et al. [19] proposed LightGBM.
LightGBM is a gradient boosting framework that uses tree-based learning algorithms.
It is designed to be distributed and efficient using two novel techniques: Gradient-based
One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) [19]. GOSS excludes
a significant proportion of data instances with small gradients, and only uses the rest to
estimate the information gain. Since the data records with larger gradients play a vital
role in the computation of information gain, GOSS can obtain quite an accurate estimation
of the information gain with a much smaller dataset. EFB is used for bundling mutually
exclusive features to reduce the number of features. LightGBM is prefixed as Light because
of its high speed. Compared to other existing Gradient Boosting Decision Tree algorithms,
LightGBM has the advantages of faster training speed, higher efficiency, lower memory
usage, better accuracy, being capability for handling large-scale data, and the support of
parallel and GPU learning. LightGBM is a fast, distributed, high-performance gradient
boosting framework based on a decision tree algorithm. It is used for ranking, classification,
and many other machine learning tasks.
One of the characteristics that makes the LightGBM algorithm differ from other tree
boosting algorithms is to split the tree leafwise as shown in Figure 2 with the best fit,
whereas other boosting algorithms split the tree depthwise or levelwise, see Figure 3, rather
than leafwise. So, when growing on the same leaf in LightGBM, the leafwise algorithm
can reduce much more loss than the levelwise algorithm and, henceforth, results in much
better accuracy, which is not met by any of the other boosting algorithms.
…
Figure 3. Level-wise tree growth in other boosting algorithms (such as in XGBoost) [19].
For a small size of data, leafwise growing may lead to an increase in complexity and
result in overfitting [46]. To overcome this problem, we optimized the LightGBM algorithm
for our medium size data ZMHDD (2000+) by precalculating the optimal values of the
model’s hyperparameters to control the complexity of the LightGBM model. These are
(i.) the number of iterations, (ii.) the maximum depth of the trees, and (iii.) the number
of leaves. Hence, we retrieve the optimum number of trees, maximum depth of trees,
and number of tree leaves. Details on the optimization process are given in Section 4.
Diagnostics 2021, 11, 1714 8 of 14
3.4. Evaluation
To measure how well our model performs, different standard performance evaluation
metrics [47]—i.e., accuracy, sensitivity, specificity, and the area under the receiver operating
characteristic (AUC) curve—have been used. We also used the k-fold cross-validation
method to split the dataset into k data subsets, with k − 1 data subsets used as training
sets and one of the subsets as the test set for one round of training. This allows for k
Diagnostics 2021, 11, 1714 9 of 14
constellations of model training and testing. Taking the average performance of the of
the k training runs gives an indication of the generalization capability of the model on
unknown data.
Specifically, the performance of the proposed model is evaluated on ZMHDD in
two phases. First, 10-fold cross-validation is applied to each grid-search point in a grid
search over three hyperparameters, as described in Section 4. This results in an optimum
hyperparameter set of the LightGBM algorithm, as per Figure 4, and hence, determines the
optimum model architecture. Since, due to cross-validation, the training of this architecture
is based on a smaller dataset, its parametrization can be further aligned to data statistics by
a training using the entire ZMHDD dataset separated into 80% training and 20% test data
samples. Results will be discussed in the following section.
training time/s
optimum 0.9
0.8
40
35 0.7
30
0.6
25
n leaves
20 0.5
15
0.4
10
5 0.3
2
3 0.2
4 50
5 100
6 0.1
depth max 7
150
8 200
n est
9 250
Figure 4. 3D visualization of grid search result over the investigated space of hyperparameters nest ,
depthmax and nleaves . Bubble sizes implicate the validation score (the larger, the better); the color
indicates the required training time (the less, the better). The optimum configuration of a LightGBM
model is at nest = 150, depthmax = 3, and nleaves = 4 at a test accuracy of 0.9815 and a training time
of 0.624 s.
leaf-wise tree is typically much deeper than a depthwise tree for a fixed number
of leaves. Unconstrained depth can induce overfitting [46]. Thus, when trying to
optimize the num_leaves, we should let it be smaller than 2depthmax .
4. LightGBM model optimization: Several LightGBM models at variation of the nest ,
depthmax , and nleaves parameters were constructed using 10-fold cross-validation grid
search to define the optimal parametrization in the sense of a validation metric.
Following the grid search, our model achieved the best accuracy of 98.15% at the
configuration nest = 150, depthmax = 3, and nleaves = 4. The 3D visualization of the
10-fold cross-validation grid search result is shown in Figure 4. The size of the bullets
in Figure 4 indicates the validation score, the bubble colors indicate the training time.
5. Performance evaluation: Lastly, the performance of the designed LightGBM model is
evaluated on the test data (20% of ZMHDD) using a training and test data splitting
method [48]. Key metrics are given in Table 3 and Figure 5.
Metric Value
Accuracy 0.98
Sensitivity 0.99
Specificity 0.96
7 U X H 3 R V L W L Y H 5 D W H 6 H Q V L W L Y L W \