0% found this document useful (0 votes)
57 views13 pages

Performance Analysis of Machine Learning Algorithms For Thyroid Disease

This document summarizes a research article that analyzes the performance of machine learning algorithms for detecting thyroid disease. The research tested classifiers like KNN, Naive Bayes, SVM, decision tree, and logistic regression, with and without feature selection techniques. The thyroid data was collected from a hospital in Pakistan and included additional features like pulse rate, BMI, and blood pressure. The results showed that classifiers using L1-based feature selection achieved the highest accuracy, with Naive Bayes and logistic regression attaining 100% accuracy and KNN 97.84% accuracy.

Uploaded by

Rihab BEN LAMINE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views13 pages

Performance Analysis of Machine Learning Algorithms For Thyroid Disease

This document summarizes a research article that analyzes the performance of machine learning algorithms for detecting thyroid disease. The research tested classifiers like KNN, Naive Bayes, SVM, decision tree, and logistic regression, with and without feature selection techniques. The thyroid data was collected from a hospital in Pakistan and included additional features like pulse rate, BMI, and blood pressure. The results showed that classifiers using L1-based feature selection achieved the highest accuracy, with Naive Bayes and logistic regression attaining 100% accuracy and KNN 97.84% accuracy.

Uploaded by

Rihab BEN LAMINE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Arabian Journal for Science and Engineering

https://doi.org/10.1007/s13369-020-05206-x

RESEARCH ARTICLE-ELECTRICAL ENGINEERING

Performance Analysis of Machine Learning Algorithms for Thyroid


Disease
Hafiz Abbad Ur Rehman1 · Chyi‑Yeu Lin1 · Zohaib Mushtaq2 · Shun‑Feng Su2

Received: 13 February 2020 / Accepted: 7 December 2020


© King Fahd University of Petroleum & Minerals 2021

Abstract
Thyroid disease arises from an anomalous growth of thyroid tissue at the verge of the thyroid gland. Thyroid disorderliness
normally ensues when this gland releases abnormal amounts of hormones where hypothyroidism (inactive thyroid gland)
and hyperthyroidism (hyperactive thyroid gland) are the two main types of thyroid disorder. This study proposes the use of
efficient classifiers by using machine learning algorithms in terms of accuracy and other performance evaluation metrics to
detect and diagnose thyroid disease. This research presents an extensive analysis of different classifiers which are K-nearest
neighbor (KNN), Naïve Bayes, support vector machine, decision tree and logistic regression implemented with or without
feature selection techniques. Thyroid data were taken from DHQ Teaching Hospital, Dera Ghazi Khan, Pakistan. Thyroid
dataset was unique and different from other existing studies because it included three additional features which were pulse
rate, body mass index and blood pressure. Experiment was based on three iterations; the first iteration of the experiment did
not employ feature selection while the second and third were with ­L1-, ­L2-based feature selection technique. Evaluation and
analysis of the experiment have been done which consisted of many factors such as accuracy, precision and receiver operat-
ing curve with area under curve. The result indicated that classifiers which involved L ­ 1-based feature selection achieved
an overall higher accuracy (Naive Bayes 100%, logistic regression 100% and KNN 97.84%) compared to without feature
selection and ­L2-based feature selection technique.

Keywords  Classification · Thyroid disease · KNN · SVM · DT · NB · LR · Feature selection

List of Symbols active thyroid hormones which are levothyroxine (abbrevi-


k Number of neighboring elements ated T4) and triiodothyronine (abbreviated T3) [2, 3]. These
L1 L1-norm hormones play a vital role in the production of proteins, in
L2 L2-norm the regulation of body temperature, and in overall energy
a, b Feature vectors production and regulation [4, 5]. The thyroid gland is prone
d Distance to many distinct diseases, some of which are especially
common such as hypothyroidism and hyperthyroidism [3].
Production of deficient secretion in thyroid hormone causes
1 Introduction hypothyroidism, and production of an excessive amount
of secretion thyroid hormone causes hyperthyroidism [2,
Thyroid is a significant gland which resembles the shape of 6]. The former case refers to hypothyroidism condition
butterfly. It is placed in the lower part of the neck and helps which deals with deficiency or underproduction of thyroid
to control the body metabolism [1]. This gland produces two hormones. The symptoms in this condition may involve a
person experiencing weight gain, swelling in front of neck
and low pulse rate, whereas hyperthyroidism refers to an
* Hafiz Abbad Ur Rehman excessive amount of thyroid hormone by the thyroid gland
[email protected] in which a person may suffer from elevated blood pressure
1
and pulse rate while having reduced body weight [6, 7]. A
Department of Mechanical Engineering, National Taiwan
University of Science and Technology, Taipei, Taiwan
commonly used method to identify thyroid disorders is the
2
use of blood test, which can measure the TSH, T3 and T4
Department of Electrical Engineering, National Taiwan
University of Science and Technology, Taipei, Taiwan
levels [8, 9].

13
Vol.:(0123456789)
Arabian Journal for Science and Engineering

Health care industry produces a large part of complex (GA) scheme were used. In this case, GA-SVM showed the
data in the medical field that is very challenging to man- best classification accuracy of 98.26% among all proposed
age [5]. A fair amount of machine learning approaches has methods [18]. Moreover, another researcher Chen et  al.
recently been used to examine and identify different types developed a three-stage system to address thyroid disease.
of diseases. Bayesian network (BN), SVM, neural network, (FS-PSO-SVM) CAD method with particle swarm optimi-
ANN, decision tree (DT), Naive Bayes, K-nearest neighbor zation demonstrated better performance than the existing
(KNN) and many more are the different classification meth- methods and achieved the accuracy of 98.59 by using tenfold
ods used by researchers [9–11]. This literature review will cross-validation (CV) [19]. A generalized discriminant anal-
highlight the different machine learning approaches carried ysis (GDA) and wavelet support vector machine (WSVM)
out by researches in order to detect thyroid diseases. (GDA–WSVM) approach consisted of feature extraction,
The K-nearest neighbor (KNN) is an extremely popu- and feature reduction classification phases were used by
lar and common machine learning algorithm and currently Dogantekin et al. for thyroid disease and obtained 91.86%
many techniques are based on achieving an effective KNN classification accuracy [20]. In the study of fuzzy classifier,
to diagnose thyroid disease [12]. A variety of classification an expert system for thyroid disease called ESTDD (expert
methods such as KNN, neural network and Bayesian belief system for thyroid disease diagnosis) was introduced by two
network discussed by Tomar and Agarwal [13] and the fuzzy researchers Keleş and Keles. Fuzzy rules were applied on the
logic using MATLAB described by Jahantigh [14] often play bases of neuro fuzzy classification (NEFCLASS) algorithm
a vital role in the identification of diseases in the health and reported 95.33% accuracy [21]. Using several neural
care sector in order to obtain an appropriate thyroid clas- network methods like multilayer perception (MLP) through
sification. S. Sun and R. Huang discussed the limitation of back-propagation, radial basis function and adaptive conic
KNN algorithm and proposed an adaptive KNN algorithm section function in neural network were proposed by Ozy-
(AdaNN) for classification and showed that this is superior ilmaz et al. for thyroid diagnosis and resulted accuracies
over traditional KNN algorithm. This is because, for each were 88.30%, 81.69% and 85.92%, respectively [22]. From
test case, the AdaNN algorithm finds the suitable k value. the existing literature, it is revealed that classification is the
This determines the optimal value of k and takes the few imperative technique for detecting, predicting and diagnos-
number of neighbors closest to get the right class name [15]. ing different diseases like heart disease, breast cancer, lung
Furthermore, the researcher Liu et al. presented an efficient cancer and thyroid disorder. Figure 1 presents the role of
computer-aided diagnostic (CAD) base system consisted of classification techniques for detecting various diseases. The
fuzzy K-nearest neighbor (FKNN) classifier for diagnosis of literature review revealed that thyroid disorders have been
the thyroid disease. Two core parameters of FKNN, which focused less compared to other diseases [6, 9].
are k value of neighborhood and fuzzy parameter m, are Beside the clinical and essential investigation, proper
adaptively specified by particle swarm optimization (PSO) interpretation of thyroid disease is also important for diag-
approach. The proposed PCA-PSO-FKNN system is then nosing purposes. Authors Chen et al. address the importance
reported to use tenfold cross-validation (CV) with 99.09% of feature selection technique for improving the classifica-
accuracy to clearly distinguish and diagnose the different tion accuracy beneficial for diagnosis purposes [19]. In this
classes of thyroid diseases [16]. Acharya et al. addressed paper, effectiveness of different classification method was
the CEUS-based thyroid nodule classification CAD system investigated with the implementation of ­L1 and L ­ 2 feature
which is a contrast-enhanced ultrasound imaging to enhance selection technique. Thus, it is hypothesized that new intro-
the differential diagnosis of thyroid nodules as it gives a bet- duced features would provide accurate and precise measures
ter representation of thyroid vascular pattern. Furthermore, for diagnosing thyroid disease. To carry-out the research,
discrete wavelet transform (DWT) and texture-based features
were extracted from thyroid lesions 3D contrast-enhanced
ultrasound images. K-nearest neighbor (KNN), probabilis-
tic neural network (PNN) and decision tree (DT) classifiers
were then used to test and train these resultant features by
using ten cross-fold validation technique and achieved clas-
sification accuracy of 98.90% [17]. Researcher Nazari et al.
used another approach to detect thyroid disease which was
support vector machine classifier (SVM). This research study
compared and analyzed two thyroid datasets taken one from
UCI and another actual data from Imam Khomeini Hospital.
For feature selection, sequential forward selection (SFS),
sequential backward selection (SBS) and genetic algorithm Fig. 1  Health care statistics using classification [6]

13
Arabian Journal for Science and Engineering

unique thyroid dataset was used. The performance of the dataset unique from others available on UCI and KEEL
proposed research is examined by using the confusion repository [24]. Thirteen missing values were reported in
matrix, and the obtained results were also compared with the T3 column shown in Table 1 and replaced by ‘?.’ Mean
existing studies reported in Table 3 focusing on the thyroid values were used as a replacement for missing entries.
diagnosis. The complete paper is formed as follows: Sects. 1 Three classes are “Hypo” for hypothyroidism, “Hyper”
and 2 include a literature review and dataset, respectively. for hyperthyroidism and “Normal” for healthy individuals
Section 3 details the methodology adopted. Sections 4 and contributing 24%, 21% and 55% of the total, respectively.
5 include experimental and result analysis. Section 6 high-
lights the related existing studies. Section 7 concludes the
paper with future scope.
3 Methodology

2 Dataset Description Methodology reported in this manuscript consists of few


important steps as outlined in Fig. 2. Data processing is
Thyroid disease dataset used in our experiment is taken from the initial step of our methodology which involves dele-
District Headquarters (DHQ) Teaching Hospital, Dear Ghazi tion and cleaning of useless columns or entries. Process-
Khan, Pakistan [23]. This hospital provides health care facil- ing missing values and cleaning unnecessary data can
ities to not only the inhabitants of the district but also to potentially improve accuracy of overall result. Further-
patients coming from neighboring provinces. The dataset more, processing missing values is very crucial because
used in this study is fully verified by two endocrinologists skipping the values would negatively impact the results
associated with well renowned teaching hospital based in as there is a risk of losing valuable information. Follow-
Karachi, Pakistan. There are three classes and 309 patient ing this step, feature scaling based on min–max method is
samples in the dataset. Total patients were divided into three implemented in order to obtain the maximum and mini-
categories based on diagnosis results. The categories are as mum entry values. To get an efficient accuracy and per-
following: formance of the classifiers, first part of the experiment is
implemented without feature selection techniques. L ­ 1- and
Class (1): A total of 170 individuals with optimal range ­L2-based feature selection techniques are implemented in
of hormonal values the second and third phase of the experiment, respectively.
Class (2): A total of 66 patients suffering with hyperthy- Features such as blood pressure, pulse rate and BMI are
roidism included in this study because they directly correlate with
Class (3): A total of 73 patients suffering with hypothy- thyroid disorders and played a vital role to achieve a best
roidism accuracy results. Various evaluation parameters like f 1-
score, miss-rate Matthew correlation coefficient (MCC),
Thyroid dataset comprises of 309 entries with ten error-rate, ROC curve with AUC, sensitivity, selectivity,
attributes column and one class column. This dataset has fall-out and accuracy have been used for evaluation and
three new features, i.e., body mass index (BMI), measure- comparison criteria of different classifiers and best algo-
ment of blood pressure and pulse rate which make this rithms for detecting thyroid disease.

Table 1  Thyroid dataset Thyroid dataset


description
Attributes # Features Range description

1 Serial and hospital reference IDs ID number


2 Pregnant Yes, no
3 Body mass index (BMI) Underweight–optimal–overweight
4 Blood pressure High–healthy–low
5 Pulse rate 50–110
6 T3 0.15–3.7 (Missing values ‘?’)
7 TSH 0.05–100
8 T4 0.015–30
9 Gender Male, female
10 Age 6–62
11 Class 0 ‘Hypo’, 1 ‘Hyper’ and 2 ‘Normal’

13
Arabian Journal for Science and Engineering

Fig. 2  General block diagram of proposed study

3.1 Feature Selection the final prediction. However, in ­L2 feature selection tech-
nique, the coefficient value is not assigned zero but rather is
Feature selection plays a vital role to increase the effi- approached to zero. For this research, the linear support vec-
ciency of a given classifier. Modern IoT devices send mil- tor classifier (LSVC) was used, and to control the sparsity, a
lions of information which create datasets with hundreds C parameter was selected. Upon observation, it can be noted
of unwanted features. In resultant, these features choke that the value of C is directly proportional to the number of
the model, exponentially increase the training time and features selected; the larger the value of C, the more features
increase overfitting risk. By using feature selection tech- will be selected and vice versa.
nique, a reduced average time for predicting and training can
be achieved without loss of total information. Later, these 3.2 KNN
important selected features were then used for training and
testing in order to save cost and time. Such techniques play The K-nearest neighbors (KNN) is very common and most
a large role in impacting the classification results [12]. widely used supervised machine learning algorithm. KNN per-
forms nicely for predictive analysis and pattern recognition
3.1.1 L1 and ­L2 Norm‑Based Model Feature Selection purposes. One of the main use of KNN is to predict discrete
values in classification problems [26, 27]. KNN uses two fac-
For this report, the L
­ 1- and ­L2-based feature selection tech- tors, namely the similarity measure or distance function and
nique has been used with the help of a Python library known the selected k value to act as a classifier with the performance
as scikit-learn. Compared to other existing libraries such depending on the aforementioned factors. For any new data
as mlpy, pybrain and shogun, scikit-learn is a very user- point, firstly KNN calculates the distance of all the data points
friendly library with a remarkable response time of various and gathers the ones which are in close proximity to it. Then,
algorithms and techniques [25]. These L ­ 1- and L
­ 2-based algorithm organizes those closest data points based on their
feature selection approaches can be used with classifiers distance from arrival data point using different distance func-
to achieve dimensionality reduction for given datasets. tions. Furthermore, the next step is to gather specific number
­L1 feature selection techniques assign zero value to some of those data points which have the least distance among all
coefficients. Therefore, due to estimation of target, certain and categorize them based on their distance. Figure 3 dem-
features are removed because they do not contribute to onstrates the working principle of KNN. In the figure, the red

13
Arabian Journal for Science and Engineering

Fig. 3  Working principle of
K-nearest neighbor method a
initial data, b calculate distance
and c find neighbors and vote

plus sign belongs to class 01 whereas the green sign belongs n ( | | )


∑ | ai − bi | .
to class 02. The yellow box point “?” on the figure is either Canberrad(a,b) =
|a | + |b | (6)
related to class01 or class02 which would be predicted by the) i=1 | i| | i|
(
algorithm.
( Let a and b
) be feature vectors a = a 1 2 … an
, a ,
and b = b1 , b2 , … bn  . The considered distance functions are cov(a, b)
Correlationd(a,b) = = 1 − Cai,bi . (7)
discussed as follows; (𝜎a)(𝜎b)
( n )1
∑ p

Minkowskid(a,b) = |a − b |p where p = 1, 2, … ∞.
| i i|
i=1 3.3 SVM
(1)

( )2 ( )2 ( )2 Support vector machine (SVM) is a supervised machine learn-
Euclideand(a,b) = a1 − b 1 + a2 − b 2 ⋯ a n − b n ing algorithm which can be used for performing classification,
√ regression and even outlier detection. The features of dataset
√ n
√∑ ( )2
= √ ai − b i . are plotted in n-dimensional space. The two classes are dif-
i=1 ferentiating by drawing a straight line called hyperplane [28,
(2) 29]. All the dataset points that lie on one side of the line will
be considered as one class, whereas all the points that fall on
n
∑ ( )2 the other side of the line will be labeled as second class. The
Manhattand(a,b) = ai − bi . (3) strategy sounds simple enough, however, it is important to
note that there is an infinite amount of lines to choose from.
i=1

{ } SVM helps with selecting the line that does the best job of
Hammingd(a,b) = def
0 if a = b
. (4) classifying the data. The SVM algorithm not only selects a
1 otherwise line that separates the two classes but also stays as far away
from the closest samples as possible. In fact, the “support vec-
∑n � �� � tor” in “support vector machine” refers to two position vectors
i=1 ai bi
Cosined(a,b) =� � . (5) drawn from the origin to the points which dictate the decision
∑n � �2 ∑n � �2
i=1
ai i=1
bi boundary [30]. Figure 4 shows the working principle of SVM.

13
Arabian Journal for Science and Engineering

Hyperplane Equation process halted by providing a predefined stopping criterion.


w = ai Si , y = wx + b The class is indicated by a leaf node which is a node at the
end of a tree. The decision rule is defined by the branch
or the path of the node. Each new sample has its unique
decision rule for classification purposes [10]. These clas-
3.4 Naive Bayes sifications occur over three steps. Firstly, training data are
used to train the model in the learning process. Secondly,
Naive Bayes is a very simple algorithm to classify various a test is conducted to calculate the accuracy of the model
classification problems. It is easy to build and can make very and depending on this value, the model is either accepted
powerful and accurate predictions for large amount of data. or rejected. In order to use the model for further classifica-
This classifier is the probabilistic learning method based on tion of a new datum, the value has to be accurate and have
Bayesian theorem [28, 31]. The working principle depends considerable acceptance. Thirdly and finally, the utilization
on three steps. In first step, dataset is converted into frequency of the model is decided by either using it for classification
table. The second step involves creating a likelihood table after purposes or predicting new data [30, 32]. The Entropy and
finding out the probabilities. In the last step, the posterior prob- Gini equation are defined below in Eqs. 9 and 10 whereas
ability is calculated with help of Naïve Bayes equation for decision tree working principle is shown in Fig. 5;
each class. The class of highest posterior probability rate is the ∑
outcome of prediction [30]. Then, Bayes theorem is as follow Ent(D) = − P(y|D) log P(y|D).
(9)
y∈Y
[( ) ]
( ) P Dh ⋅ P(h)
P
h
= . (8) ∑k | |
D P(D) ( ) |Dk | I (D )
Ggini D; D1 … , Dk = I(D) − k
D

i=1 (10)
Perior Probability = P(h), where I(D) = 1 − P(y|D)2 .
( )
D y∈Y
Conditional Probability = P , Mixture Density P(D)
h

3.6 Logistic Regression
3.5 Decision Tree
Logistic regression (LR) is a classification model in
A famous method for decision making is a decision trees. A
machine learning, which is widely used in the fields like
unique strategy of ‘divide-and-conquer’ is used by creating
medicine social science [30, 33]. Logistic regression has
decision regions by dividing the instance space. Through a
been used in many types of analysis to not only explore the
testing process, a root node is established. Then, dataset is
risk factors of certain diseases but also for prediction of
broken by the value of related test attribute. It is a repeated
the probability of diseases. These predictions are discrete
which refers to as specific values or categories. They can
also view probability scores underlying the model’s clas-
sifications. The logistic function is defined in Eq. 11 and
its working principle is shown in Fig. 6.

( ) 1 eg(x⃗)
Prob(event) = P x⃗ = = . (11)
1 + e−g(x⃗) 1 + eg(x⃗)
( )
where
( P x⃗ )is the probability of some output event,
x⃗ x1 , x2 , … xk is an input vector corresponding
( ) to the inde-
pendent variables (predictors) and g x⃗ is the logit model.

3.7 Performance Evaluation Metrics

Classification algorithms can be evaluated in several ways.


For evaluating various learning algorithms, the analysis
of metrics should be interpreted correctly. For evaluat-
ing a diagnostic test, some of the measures derived from
Fig. 4  Working principle of SVM

13
Arabian Journal for Science and Engineering

Fig. 5  Working principle of
decision tree

Fig. 7  Example of receiver operating characteristic (ROC) and area


under curve (AUC) [12]
Fig. 6  Working principle of logistic regression

the threshold line, then it would indicate poor performance


confusion matrix are reported in Sahu et al. [34] and Islam of the class/model [12]. Some more derived measures from
et al. [35]. There are four distinct terms used in a confusion confusion matrix [37] are discussed as follows.
matrix which are true positive (TP), false positive (FP), true
negative (TN) and false negative (FN). True positive means 3.7.1 Accuracy and Error
that the system predicts the outcome to be a correct value
and the result is also correct. False positive means that the The most important and commonly used factor to meas-
system predicts the outcome to be a correct value however ure the performance of the classifier is accuracy. Accuracy
the result is false. True negative means that the system pre- (ACC) is calculated by the ratio of correct prediction sam-
dicts the outcome to be a false value and the result is also a ples to the total samples in the dataset.
false value. False negative means that the system predicts the
outcome to be a false value, whereas the result is a correct TP + TN
Accuracy(Acc) = × 100%. (12)
value. Another parameter to consider the performance of the TP + TN + FP + FN
classifier is ‘ROC curve with Area Under Curve’ (AUC).
However, error rate (ERR) represents the number of
The receiver operating characteristics (ROC) curve is a two-
wrongly classified samples in both negative and positive
dimensional graph in which the TPR represents the y-axis
class and calculated as follows.
and FPR is the x-axis. The ROC curve has been used to
evaluate many systems such as diagnostic systems, medi- Error rate(ERR) = (1 − Acc) × 100%
cal decision-making systems and machine learning systems FP + FN
[36]. In Fig. 7, it describes ROC curve with AUC values of = × 100%.
TP + TN + FP + FN (13)
three classes separated using colors and initialized as G1, B2
and R3. Class 3 has a large AUC value so its performance
is better than class 2 and 1. If the classifier value is below

13
Arabian Journal for Science and Engineering

3.7.2 Sensitivity and Specificity or low false negative, whereas zero value showed poor per-
formance. Equation of F1-score is as follows.
Sensitivity (TPR) or recall is defined as the ratio of true
Precision × recall
predicted positive sample to the total number of positive F1 - Score = 2 × × 100%
Precision + recall
sample. However, specificity (TNR) or selectivity is the
2TP
ratio of true predicted negative sample to the total number = × 100%.
2TP + FP + FN (19)
of negative samples. Equations (14) and (15) represent TPR
and TNR, respectively.
TP
Sensitivity or Recall(TPR) =
TP + FN
× 100% 4 Experimental Data
= 1 − (FNR) × 100%. (14) The experimental analysis of this research study involved
dependence on hardware and software performance of the
TN system. The hardware system specification used in this
Specificity, Selectivity (TNR) = × 100%
TN + FP experiment was Intel(R) Core i7-7700HQ CPU @ 2.80
= 1 − (FPR) × 100% . (15) GHZ, with 512 GB SSD, 2 TB HDD, 16 GB RAM and
Nvidia 6 GB GTX 1060 GPU. On the other hand, the soft-
ware description included the usage of scikit-learn [25] and
3.7.3 False Positive and False Negative Rate
Anaconda [38]. Scikit-learn was a great choice for its acces-
sibility, simplicity and its great performance for analyzing
False positive rate (FPR) or fall-out in Eq. (16) represents
data. Splitting method for training and testing dataset was
the false positive prediction in the total number of negative
used with five machine learning algorithms which were
samples. While, false negative rate (FNR) or miss-rate is the
KNN, decision tree, SVM, logistic regression and Naive
proportion of positive samples that were incorrectly classi-
Bayes. The two well-known ­L1 and ­L2 feature selection
fied in Eq. (17).
techniques were used for the five machine learning algo-
FP rithms. The experiment on thyroid dataset was repeated
Fall-Out(FPR) = × 100% = 1 − (TNR) × 100%.
FP + TN three times. The first iteration was without feature selec-
(16) tion, abbreviated as (WOFS). The second attempt of the
FN experiment was done by L ­ 1-based feature selection denoted
Miss-Rate(FNR) = × 100% = 1 − (TPR) × 100%. as WLSVC(L1). The third iteration of the experiment was
FN + TP
(17) employed with ­L2-based feature selection implemented
denoted as WLSVC(L2).
3.7.4 Matthews Correlation Coefficient The training and testing time can be reduced if classifier
uses important features. The importance of feature selec-
Brain W. Matthews in 1975 introduced the Matthews cor- tion depends upon various parameters where one of them
relation coefficient (MCC) [r]. This coefficient shows the is F-score that determines the importance and usefulness of
relationship between observed and predicted classification. various features. Xgboost classifier has advantage to solve
MCC is calculated from the confusion matrix and their + 1 regression and classification problems. This technique prior-
value represents perfect prediction while − 1 value indi- itizes superior results using less resources in terms of com-
cated the conflict between prediction and true values. Equa- puting and time. The main objective to use Xgboost in ­L1
tion (18) defined MCC as and ­L2 feature selection technique is to prevent the model
from overfitting. This study also selected various features
Matthews Corelation Coefficient (MCC) depending upon their F-score values by using xgboost classi-
TP × TN − FP × FN fier [39] in which features were automatically named accord-
=√ × 100%.
(TP + FP)(TP + FN)(TN + FP)(TN + FN) ing to their index in the input array. Figure 8a indicates the
(18) result of experiment done using WOFS where algorithms
gave weight to five features. According to the results, f0
(TSH) has the highest importance and f4 (pulse rate) has
3.7.5 F‑Measure the lowest importance. Similarly, Fig. 8b with implement-
ing WLSVC(L1), four important features were selected
F-measure is also known as F1-score. It described the har- based on their F-scores. In which WLSVC(L1) f0 (TSH) has
monic mean between precision and recall. A model is con- the highest importance, whereas f3 (BMI) has the lowest
sidered good if its value is one or it have low false positive index. Lastly, with implementation of WLSVC(L2) Fig. 8c,

13
Arabian Journal for Science and Engineering

Fig. 8  Feature importance for thyroid data by using WOFS. b Feature importance for thyroid data by using WLSVC(L1). c Feature importance
for Thyroid Data by using WLSVC(L2)

only three important features were selected based on their also showed a little improvement of having 75.34% accuracy.
F-scores where f0 (TSH) and f1(T4) received the highest and Logistic regression indicated 100% accuracy which was the
lowest importance, respectively. The training and prediction best improvement among all. Part (c) demonstrated the result
time of thyroid dataset is shown in Table 2. after applying WLSVC(L2) feature selection. In this part, the
algorithms with the highest accuracies were logistic regres-
sion and KNN, SVM and decision tree also demonstrated
5 Result Analysis some improvement, whereas Naive Bayes showed the maxi-
mum accuracy of 100% in both part (b) and (c).
Table 3 outlines the detailed performance of different clas- Another crucial parameter is the ROC curve with the
sifiers for thyroid disease. Various performance evalu- area under curve (AUC) value, which is used to check the
ation metrics like accuracy, recall, fall-out and error-rate classifier’s performance. Range of the AUC from ‘0’ to ‘1’
were used for this comparative study on the bases of the demonstrates that a classifier has a better performance if
output confusion matrix. In Table 3, it is explicitly shown its value is or close to ‘1.’ ROC curves are constructed in
that accuracy gets improved when feature selection tech- Origin Pro 8.5 software, and AUC is calculated with the
nique was applied. In part (a) without applying feature help of trapezoid rule. Naive Bayes achieved the highest
selection, Naive Bayes is providing more precise results. value of AUC 1.00 in all three experiments as indicated
KNN also performed well and achieved 91.39% accuracy in all parts of Fig. 9. KNN and logistic regression came
by using Minkowski distance function. SVM and logistic at the second place by achieving 0.98 and 0.97 AUC val-
regression both attained satisfying performance with hav- ues, respectively, in without feature selection. Further-
ing accuracy of 80.46% and 90.32%, respectively. Deci- more, both of these classifiers showed 1.00 AUC in both
sion tree had the lowest accuracy of 74.19% among all the WLSVC(L1) and WLSVC(L2). SVM overall performance
classifiers and had a five level depth. Furthermore, results was satisfactory, indicating AUC value of 0.94 in WOFS
after implementing the WLSVC(L1) feature selection were whereas AUC values of 0.95 and 0.98 in WLSVC(L 1) and
significantly improved. KNN accuracy jumped to a 97.84% WLSVC(L2), respectively. Lastly, decision tree had com-
accuracy with same distance function. SVM increased to paratively the lowest performance in all three parts of the
86.02% accuracy while decision tree with five level depth experiment.

Table 2  New thyroid dataset training and prediction time in seconds


Classifier Training time Predicting time Training time Predicting time Training time Predicting time
WOFS (s) WOFS (s) ­WLSVCL1 (s) ­WLSVCL1 (s) ­WLSVCL2 (s) ­WLSVCL2 (s)

KNN 0.695 0.42 0.53 0.361 0.51 0.369


Decision tree 0.763 0.422 0.629 0.360 0.681 0.372
Naïve Byes 0.659 0.388 0.549 0.358 0.574 0.367
SVM 0.601 0.398 0.506 0.359 0.511 0.361
Logistic regression 0.510 0.339 0.449 0.142 0.439 0.152

13
Arabian Journal for Science and Engineering

Table 3  Performance evaluation metrics for thyroid dataset


Classifier Performance factor (a) Without feature selection (b) With feature selection (c) With feature selection
(WOFS) WLSVC(L1) WLSVC(L2)
Values Confusion matrix Values Confusion matrix Values Confusion matrix

KNN Accuracy (%) 91.39 | 18 8 0 |


| | 97.84 | 14 2 0 |
| | 96.77 | 17 3 0 |
| |
| 0 43 0 | | 0 54 0 | | 0 51 0 |
| | | | | |
| 0 0 24 | | 0 0 23 | | 0 0 22 |
| | | | | |
Recall (%) 90 96 95
Fall-out (%) 5 2 2
Specificity (%) 95 98 98
F1-score (%) 92 97 96
Error rate (%) 8.61 2.16 3.23
MCC (%) 78.6 92.4 90
Miss-rate (%) 10 4 5
Decision tree Accuracy (%)) 74.19 | 26 0 0 |
| | 75.34 | 16 4 0 |
| | 76.92 | 20 0 0 |
| |
| 0 43 0 | | 0 39 0 | | 0 50 0 |
| | | | | |
| 0 24 0 | | 0 14 0 | | 0 21 0 |
| | | | | |
Recall (%) 67 67 67
Fall-out (%) 49 20 17
Specificity (%) 51 80 83
F1-score (%) 60 62 61
Error rate (%) 25.81 24.66 23.08
MCC (%) 100 100 100
Miss-rate (%) 33 33 33
Naïve Bayes Accuracy (%) 100 | 26 0 0 |
| | 100 | 16 0 0 |
| | 100 | 20 0 0 |
| |
| 0 43 0 | | 0 54 0 | | 0 51 0 |
| | | | | |
| 0 0 24 | | 0 0 23 | | 0 0 22 |
| | | | | |
Recall (%) 100 100 100
Fall-out (%) 0 0 0
Specificity (%) 100 100 100
F1-score (%) 100 100 100
Error rate (%) 0 0 0
MCC (%) 100 100 100
Miss-rate (%) 0 0 0
SVM Accuracy (%) 80.46 | 9 17 0 |
| | 86.02 |7 9 0 |
| | 86.02 | 10 10 0 |
| |
| 0 43 0 | | 0 54 0 | | 0 51 0 |
| | | | | |
| 0 6 18 | | 0 4 19 | | 0 3 19 |
| | | | | |
Recall (%) 70 76 79
Fall-out (%) 15 20 85
Specificity (%) 85 80 90
F1-score (%) 78 84 85
Error rate (%) 19.54 13.98 13.98
MCC (%) 52.6 62.6 66.3
Miss-rate (%) 30 24 21
Logistic regression Accuracy (%) 90.32 | 17 9 0 |
| | 100 | 16 0 0 |
| | 98.92 | 19 1 0 |
| |
| 0 43 0 | | 0 54 0 | | 0 51 0 |
| | | | | |
| 0 0 24 | | 0 0 23 | | 0 0 22 |
| | | | | |
Recall (%) 88 100 98
Fall-out (%) 6 0 1
Specificity (%) 94 100 99
F1-score (%) 91 100 98
Error rate (%) 9.68 0 1.08
MCC (%) 80 100 97
Miss-rate (%) 12 0 2

13
Arabian Journal for Science and Engineering

Fig. 9  ROC curves with AUC before using feature selection technique. b ROC curves with AUC after using WLSVC(L1). c ROC curves with
AUC after implementing WLSVC(L2)

From Table 2, it is clearly shown that prediction time 6 Related Existing Studies
gets improved. Before applying feature selection technique,
prediction time is comparatively higher in all classifiers. The approach utilized in this study has been investigated
The algorithm which has the best accuracy, minimum error alongside with other related existing studies shown in Table 4.
rate and the lowest prediction time is Naïve Bayes in all Our model dataset is distinguished with these existing studies
three parts of the experiment, whereas logistic regression because of three new features as described in Sect. 2. The
and KNN both performed great in terms of minimum error proposed study results which were achieved by using different
rate and low prediction time in part (b) of the experiment. supervised classifiers. Higher accuracy, low training and pre-
According to the original data, healthy individuals are 170, diction time were the significant goals of this research. Other
66 are suffering from hyperthyroidism and 73 with hypo- existing models use hybrid approaches with a combination of
thyroidism. After applying different classifiers, the result different algorithms and complex models. Such methodologies
indicated that detection of Naïve Bayes (in all three parts of are not only costly to achieve accurate data, but also take an
experiment) and logistic regression (in part b of experiment) increased time for training and validation.
is excellent with 100% accuracy. Moreover, KNN detection
is closer to the original data. From KNN, it is determined
that 146 are healthy, whereas 66 and 73 have hyperthyroid- 7 Conclusion
ism and hypothyroidism, respectively.
Disease detection and its early diagnosis are very important
for human life. By using machine learning algorithms, pre-
cise and accurate identification and detection have become

13
Arabian Journal for Science and Engineering

Table 4  Related others existing References Methodology Accuracy % Dataset


models
Deepika et al. [30] SVM 95.62 UCI Repository thyroid disease dataset
DT 95.00
ANN 98.60
Pal et al. [6] Naïve Bayes 94.70 KEEL repository thyroid disease dataset
SVM 92.70
KNN 96.90
Chandel et al. [9] KNN 93.44 KEEL repository thyroid disease dataset
Naïve Bayes 22.56
Turanoglu-Bekar et al. [10] NBTREE 75.00 Local hospital
LADTREE 66.25
REPTREE 62.50
BFTREE 65.00
Chalekar et al. [27] KNN 97.00 UCI repository thyroid disease dataset
Tyagi et al. [40] ANN 97.50 UCI repository thyroid disease dataset
KNN 98.00
DT 75.00
This study Naive Bayes 100 District Headquarters (DHQ) Teaching
Logistic regression 100 Hospital, Dear Ghazi Khan, Pakistan
K-NN 97.84
SVM 86.02
Decision tree 75.34

more achievable. Thyroid disease is not easy to diagnosis 3. Thyroid Cancer: https​: //seer.cance​r.gov/statf​a cts/html/thyro​
because mix-up of their symptoms with other condition. The .html. Accessed 01 Jan 2020
4. Thyroid Problems: https​: //medli​n eplu​s .gov/thyro​i ddis​e ases​
three newly introduced features in thyroid dataset in this .html. Accessed 01 Jan 2020
research show the positive impact on classifier performance 5. What Is Thyroid Cancer: https​://www.cance​r.org/cance​r/thyro​
and results show that it gives best accuracies than the exist- id-cance​r/about​/what-is-thyro​id-cance​r. Accessed 01 Jan 2020
ing studies. After comparison and analysis of KNN, Naïve 6. Pal, R.; Anand, T.; Dubey, S.K.: Evaluation and performance
analysis of classification techniques for thyroid detection. Int.
Bayes, SVM, decision tree and logistic regression, it was J. Bus. Inf. Syst. 28(2), 163–177 (2018)
observed that 100% accuracy is achieved by Naïve Bayes 7. Thyroid Patient Information: https​://www.thyro​id.org/thyro​id-
in all three parts of experiment, while logistic regression infor​matio​n/. Accessed 01 Jan 2020
gained second best accuracy 100% and 98.92% in ­L1- and 8. Acharya, U.R.; Choriappa, P.; Fujita, H., et al.: Thyroid lesion
classification in 242 patient population using Gabor transform
­L2-based feature selection, respectively. KNN also carried features from high resolution ultrasound images. Knowl. Based
out excellent result accuracy of 97.84% with error rate of Syst. 107, 235–245 (2016)
2.16%. Upon analyzing the results, the advantages and 9. Chandel, K.; Kunwar, V.; Sabitha, S.; Choudhury, T.; Mukher-
robustness of new dataset are clearly seen and would allow jee, S.: A comparative study on thyroid disease detection using
K-nearest neighbor and Naive Bayes classification techniques.
doctors to get more precise and accurate results in less time. CSI Trans. 4(2–4), 313–319 (2016)
However, in the future classifiers with different distance 10. Bekar, E.T.; Ulutagay, G.; Kantarcı, S.: Classification of thyroid
functions of KNN and data augmentation techniques can be disease by using data mining models: a comparison of decision
used for more precise results. tree algorithms. Oxf. J. Intell. Decis. Data Sci. 2016(2), 13–28
(2016)
11. Prasad, V.; Rao, T.S.; Babu, M.S.P.: Thyroid disease diagnosis
Acknowledgements  I would like to extend my sincere gratitude to via hybrid architecture composing rough data sets theory and
Dr. Abid Hussain, Dr. Zarnab Lashari and Dr. Aimen Javed for aiding machine learning algorithms. Soft Comput. 20(3), 1179–1189
in gathering Thyroid Data and contributing continuous support. I am (2016)
thankful to them for their invaluable guidance. 12. Mushtaq, Z.; Yaqub, A.; Sani, S.; Khalid, A.: Effective K-near-
est neighbor classifications for Wisconsin breast cancer data
sets. J. Chin. Inst. Eng. 43(1), 1–13 (2019)
References 13. Tomar, D.; Agarwal, S.: A survey on data mining approaches for
healthcare. Int. J. Bio-Sci. Bio-Technol. 5(5), 241–266 (2013)
14. Jahantigh, F.F.: Kidney diseases diagnosis by using fuzzy logic.
1. Miller, K.D., et al.: Cancer treatment and survivorship statistics, In: 2015 International Conference on Industrial Engineering and
2016. CA Cancer J. Clin. 66(4), 271–289 (2016) Operations Management, 2015 (IEOM2015), pp. 2369–2375.
2. Shroff, S.; Pise, S.; Chalekar, P.; Panicker, S.S.: Thyroid disease IEEE (2015)
diagnosis: a survey. In: IEEE 9th International Conference on 15. Durairaj, M.; Ranjani, V.A.: Data mining applications in health-
Intelligent Systems and Control, 2015 (ISCO 2015), pp. 1–6. care sector: a study. Int. J. Sci. Technol. Res. 2(10), 29–35 (2013)
IEEE (2015)

13
Arabian Journal for Science and Engineering

16. Liu, D.Y.; Chen, H.-L.; Yang, B.; Lv, X.-E.; Li, L.-N.; Liu, J.: 28. Mushtaq, Z.; Yaqub, A.; Hassan, A.; Su, S.F.: Performance analy-
Design of an enhanced fuzzy k-nearest neighbor classifier based sis of supervised classifiers using PCA based techniques on breast
computer aided diagnostic system for thyroid disease. J. Med. cancer. In: International Conference on Engineering and Emerging
Syst. 36(5), 3243–3254 (2012) Technologies, 2019 (ICEET2019), pp. 1–6, IEEE (2019)
17. Acharya, U.R.; Vinitha Sree, V.S.; Molinari, F.; Garberoglio, 29. Aboudi, N.; Guetari, R.; Khlifa, N.: Multi-objectives optimisation
R.; Witkowska, A.; Suri, J.S.: Automated benign and malignant of features selection for the classification of thyroid nodules in
thyroid lesion characterization and classification in 3D contrast- ultrasound images. IET Image Process. 14(9), 1901–1908 (2020)
enhanced ultrasound. In: Annual International Conference of 30. Deepika, M.; Kalaiselvi, K.: A empirical study on disease diag-
the IEEE Engineering in Medicine and Biology Society, 2012 nosis using data mining techniques. In: International Conference
(EMBS2012), pp. 452–455. IEEE (2012) on Inventive Communication and Computational Technologies,
18. Kousarrizi, M.R.N.; Seiti, F.; Teshnehlab, M.: An experimental 2018 (ICICCT2018), pp. 615–620, IEEE (2019)
comparative study on thyroid disease diagnosis based on feature 31. Zhou, Z.-H.: Ensemble Methods: Foundations and Algorithms—
subset selection and classification. Int. J. Electr. Comput. Sci. Zhi-Hua Zhou—Google Books. CRC Press, Boca Raton (2012)
12(1), 13–19 (2012) 32. Lavanya, D.; Rani, K.U.: Performance evaluation of decision tree
19. Chen, H.L.; Yang, B.; Wang, G.; Liu, J.: A three-stage expert classifiers on medical datasets. Int. J. Comput. Appl. 26(4), 1–4
system based on support vector machines for thyroid disease diag- (2011)
nosis. J. Med. Syst. 36(3), 1953–1963 (2012) 33. Yang, Y.; Chen, G.; Reniers, G.: Vulnerability assessment of
20. Dogantekin, E.; Dogantekin, A.; Avci, D.: An expert system based atmospheric storage tanks to floods based on logistic regression.
on generalized discriminant analysis and wavelet support vec- Reliab. Eng. Syst. Saf. 196, 106721 (2019)
tor machine for diagnosis of thyroid diseases. Expert Syst. Appl. 34. Sahu, B.; Mohanty, S.; Rout, S.: A hybrid approach for breast can-
38(1), 146–150 (2011) cer classification and diagnosis. ICST Trans. Scalable Inf. Syst.
21. Keleş, A.; Keles, A.: ESTDD: expert system for thyroid diseases 6(20), 2–8 (2019)
diagnosis. Expert Syst. Appl. 34(1), 242–246 (2008) 35. Islam, M.M.; Iqbal, H.; Haque, M.R.; Hasan, M.K.: Prediction of
22. Ozyilmaz, L.; Yildirim, T.: Diagnosis of thyroid disease using breast cancer using support vector machine and K-Nearest neigh-
artificial neural network methods. In: 9th International Confer- bors. In: 5th IEEE Region 10 Humanitarian Technology Confer-
ence on Neural Information Processing, 2002 (ICONIP2002), pp. ence. 2017, pp. 226–229, IEEE (2017)
2033–2036, IEEE (2002) 36. Fawcett, T.: An introduction to ROC analysis. Pattern Recog-
23. Teaching Hospital - Dera Ghazi Khan: http://thdgk​han.org/. nit. Lett. 27(8), 861–874 (2006). https​://doi.org/10.1016/j.patre​
Accessed 15 Mar 2020 c.2005.10.010
24. Alcalá-Fdez, J.; Sánchez, J.L.; Garc, S.; Jesus, M.J.D., et al.: 37. Tharwat, A.: Classification assessment methods. Appl. Comput.
KEEL data-mining software tool: data set repository, integration Inf. (2018). https​://doi.org/10.1016/j.aci.2018.08.003
of algorithms and experimental analysis framework. J. Mult. Val- 38. Anaconda: https​://www.anaco​nda.com/. Accessed 05 Jan 2020
ued Log. Soft Comput. 17, 255–287 (2011) 39. Feature Importance and Feature Selection with XGBoost in
25. Pedregosa, F.; Weiss, R.; Brucher, M.: Scikit-learn: machine Python: https​://machi​nelea​rning​maste​ry.com/featu​re-impor ​tance​
learning in python. J. Mach. Learn. Res. 12(2011), 2825–2830 -and-featu​re-selec​tion-with-xgboo​st-in-pytho​n/. Accessed 05 Jan
(2011) 2020
26. Li, C.; Zhang, S.; Zhang, H.; Pang, L.; Lam, K.; Hui, C.; Zhang, 40. Tyagi, A.; Mehra, R.; Saxena, A.: Interactive thyroid disease
S.: Using the K-nearest neighbor algorithm for the classification of prediction system using machine learning technique. In: PDGC
lymph node metastasis in gastric cancer. Comput. Math. Methods 2018–2018 5th International Conference on Parallel, Distributed
Med. (2012) and Grid Computing, pp. 689–693 (2018). https:​ //doi.org/10.1109/
27. Chalekar, P.; Shroff, S.; Pise, S.; Panicker, S.S.: Use of K-nearest PDGC.2018.87459​10
neighbor in thyroid disease classification. Int. J. Curr. Eng. Sci.
Res. 1(2), 2394–2697 (2014)

13

You might also like