0% found this document useful (0 votes)
20 views9 pages

The Prediction of Diseases Using Rough Set Theory With Recurrent Neural Network in Big Data Analytics

Uploaded by

anteater16060
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views9 pages

The Prediction of Diseases Using Rough Set Theory With Recurrent Neural Network in Big Data Analytics

Uploaded by

anteater16060
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Received: January 17, 2020. Revised: March 13, 2020.

10

The Prediction of Diseases Using Rough Set Theory with Recurrent Neural
Network in Big Data Analytics

Vamsidhar Talasila1* Kotakonda Madhubabu1 Meghana Chakravarthy Mahadasyam1


1
Naga Jyothi Atchala Lakshmi Sowjanya Kande1

1
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
* Corresponding author’s Email: [email protected]

Abstract: In a modern life, early healthcare prediction plays an important role to prevent the loss of life caused by
prediction delays in treatment. Nowadays, the researchers focused on the Big data analysis, which is used to identify
the future health status and provides an efficient way to overcome the issues in early prediction. Many researches are
going on predictive analytics using machine learning techniques to provide a better decision making. Big data
analysis provides great opportunities to predict future health status from health parameters and provide best
outcomes. However, the data classification is one of the major challenging tasks due to noisy data or missing data in
the dataset. Feature selection techniques play an important role in the classification process by removing irrelevant
features from the extracted data. In this research work, the Rough Set Theory (RST) technique is used to select the
most relevant features, which helps to provide the efficient classification of medical data and disease detection. The
selected features are given as input to the Recurrent Neural Network (RNN) technique for disease prediction. The
proposed method is also called as RST-RNN, where the experiments are carried out on the UCI machine learning
repository dataset in terms of accuracy, f-measure, sensitivity and specificity. The results showed that the RST-RNN
method achieved accuracy of 98.57%, where the existing Support Vector Machine (SVM) achieved 90.57%
accuracy and Naive Bayes (NB) achieved 97.36% accuracy for heart disease dataset.

Keywords: Big data analysis, Decision making, Feature selection technique, Naive bayes, Rough set theory,
Recurrent neural network.

different types of users [3]. Currently, the various


1. Introduction forms of healthcare data sources are being collected
in both clinical and non-clinical environments, where
Over the past two decades, digital data is
the digital copy of a patient’s medical history are the
becoming increasingly important in many domains
most important data in healthcare analytics.
like healthcare, science, technology and society. A
Therefore, designing a distributed data system to
large amount of data has been captured and
deal with big data faces three main challenges: First,
generated from multiple areas, multiple sources such
it is difficult to collect data from distributed locations
as streaming machines, high throughput instruments,
due to the heterogeneous and huge volume of data
sensor networks, mobile application and especially
[4]. Second, storage is the main problem for
in healthcare, this huge amount of collected data is
heterogeneous and massive datasets. The last
represented as big data [1-2]. The process of storing,
challenge is related to big data analytics, more
visualizing and extraction of knowledge through
precisely to mining massive datasets in real-time or
various huge data types has become a challenge,
near real-time that include modeling, visualization,
because of using inadequate existing technologies
prediction, and optimization [5]. These challenges
tools. One of the most important technological
require a new processing paradigm, as the current
challenges of big data analytics is to identify
data management systems are not efficient in dealing
efficient ways to obtain the valuable information for
International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02
Received: January 17, 2020. Revised: March 13, 2020. 11

with the heterogeneous nature of data or the real- In this section, discussions of existing techniques
time [6]. are presented, which are used to predict the health
Improper diagnosis may cause death or disability status of patients using machine learning techniques.
to the patient. Disease Prediction Model can support In addition, the advantages and limitations of the
medical professionals and practitioners in predicting existing methods were also discussed in [13-18].
the particular disease. The huge amount of data that R. Venkatesh, C. Balasubramanian, and M.
can be collected using digital devices (by the patient Kaliappan, [13] designed a Big Data Prediction
itself in hospital) can use with big data to diagnose Analytics Model for heart disease prediction using
patients and predict diseases [7-8]. But, applying NB technique as BPA-NB. This system used a
machine learning on this big data stream is probabilistic classification based on Bayes’ theorem
challenging as the traditional machine learning to analyse the data. To filter the unnecessary data,
systems are not suitable to handle such a massive BPA-NB used the clustering technique and made the
volume or varied velocity. The analytical data prediction in an effective way. The computational
processing is considered as another major problem, complexity was reduced by using MapReduce
because efficient data integration between systems algorithm with Apache Spark framework. The
are involved by performing richer analytical data experiments were carried out on UCI dataset to
processing. Most of the existing works involve validate the effectiveness of BPA-NB against
machine learning, but in case of real-time existing techniques by means of processing time,
applications, the machine learning techniques are CPU utilization and accuracy. The BPA-NB method
insufficient to handle the big data [9-10] used for predicting only the heart disease and it
Classification techniques are widely used in provided poor performance on other diseases
healthcare, since they are capable of processing large prediction.
set of data. The common used techniques in M. Nilashi, O. Ibrahim, H. Ahmadi, L.
healthcare are NB, SVM, Nearest Neighbor (NN), Shahmoradi, and M. Farahmand, [14] developed the
decision tree (DT), Fuzzy logic, Fuzzy based neural Incremental Support Vector Regression (ISVR) for
network (FNN), Artificial neural network (ANN), Unified Parkinson's Disease Rating Scale (UPDRS).
and genetic algorithms (GA) [11]. Machine learning The prediction of Motor-UPDRS and Total-UPDRS
with classification can be efficiently applied in was done by ISVR. In this method, a self-organizing
medical applications for complex measurements. map (SOM) was used to cluster the data and non-
Modern classification techniques provide more linear iterative partial least squares (NIPALS) for
intelligent and effective prediction techniques for dimensionality reduction. To evaluate the ISVR
disease prediction [12]. In this research, the method, several experimental analyses were
important features are selected by using RST method conducted on a real-world PD dataset taken from
which is used to increase the performance of the UCI. The results indicated that the method that
classification technique. The important features from combines SOM, NIPALS, and ISVR techniques was
the medical dataset are given to RNN method for effective in predicting the Total-UPDRS and Motor-
classifying the data. The overfitting is also reduced UPDRS. The ISVR method reduced the computation
in the training data while using the RST method and time for only small data and also used some
the validation of the proposed RST-RNN method is important attributes for PD diagnosis, where other
conducted on various UCI datasets for predicting the attributes were not considered.
diseases. A. Di Noia, A. Martino, P. Montanari, and A.
This research paper is prepared as follows, Rizzi, [15] predicted the occupational disease risks
section 2 describes the review of existing techniques by using pattern recognition techniques and
with its advantages and limitations. Section 3 computational intelligence techniques namely SVM
explains the importance of proposed method for and KNN. A set of meaningful labelled clusters was
disease prediction. The experiments are conducted to determined as the final model by using k-means
validate the effective of a proposed RST-RNN algorithm. The optimal hyper parameters and
method against existing techniques are presented in optimal ad-hoc dissimilarity measure weights were
Section 4. Finally, the conclusion of the research found out using genetic algorithms for classification
work with future development is illustrated in systems and improved the performance of those
section 5. systems. The experiments were carried out to
estimate the performance of three techniques against
2. Literature review existing technique on standard collected datasets by
means of fitness functions for different classes. This

International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02
Received: January 17, 2020. Revised: March 13, 2020. 12

algorithm performed well only on occupational F. Van Wyk, A. Khojandi, and R. Kamaleswaran,
disease forecasting of the collected dataset. [18] presented the hierarchical analysis of machine
L. R. Nair, D. Sujala Shetty, and D. Siddhanth learning algorithms for improving the predictions of
Shetty, [16] aimed to develop a real-time remote at-risk patients. In addition, a multi-layer machine
health status prediction system on big data learning approach was developed to analyze the
processing engine, Apache Spark, testing and high-frequency and continuous data. The
deployed on a cloud, where DT was designed on experimental results illustrated the capabilities of this
streaming data. Through tweet streams, the relevant approach for early identification of patients at risk of
health data of user was received and then send a sepsis, where the physiological data collected from
direct message to the user about their health status by bedside monitors. By this analysis, the multi-layer
using DT algorithm. A user received the information machine learning approach potentially helped to
about his health status instantly and privately with a reduce the mortality and morbidity in the ICU. Even
single tweet and which is used to decide whether though, the method having Sequential Organ Failure
he/she need expert health care or not. A variety of Assessment (SOFA) score, the onset of organ failure
diseases were also used to predict by using slight was not identified by using this algorithm.
modification of this DT algorithm. The recovery of N. Kausar, A. Abdullah, B. B. Samir, S.
data was not possible because tweets were deleted Palaniappan, B. S. AlGhamdi, and N. Dey, [19]
permanently after a certain time period. implemented the hybrid approach namely SVM with
T. Chen, J. Xu, H. Ying, X. Chen, R. Feng, X. K-means clustering technique for medical data
Fang, and J. Wu, [17] predicted the Extubation classification. The attribute dimension was reduced
Failure (EF) by analyzing 3636 adult patient records by introducing the Principal Component Analysis
in MIMIC-III clinical database using Light Gradient (PCA) algorithm. Then, the related parameters and
Boosting Machine (LightGBM). According to the measures were adjusted effectively to differentiate
results of LightGBM, afeature importance analysis the normal and abnormal patients. The experiments
were carried out by interpreting these features using were carried out on the UCI datasets in terms of
SHapley Additive exPlanations (SHAP). The accuracy, precision, recall and f-measure. When the
experiments were carried out on the clinical database unseen patterns of similar behaviours were
against existing techniques namely SVM, ANN and introduced within the selected clusters, the
Logistic Regression (LR). The results stated that developed study reduced the detection rate with high
LightGBM achieved an accurate prediction than classification time.
other existing techniques. However, the recognition
for EF using LightGBM were still not very high.

Figure. 1 The overview of the developed RST and RNN

International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02
Received: January 17, 2020. Revised: March 13, 2020. 13

The existing techniques are used to predict the all elements of 𝑈, which can be certainly classified
disease either on only UCI dataset or collected as elements of 𝑋 based on the attribute set 𝑃. The 𝑃-
dataset, where other diseases are not specified by upper approximation of 𝑋, denoted as 𝑃𝑋, which can
these traditional techniques. In this research study, be possibly classified as elements of 𝑋 based on the
the RST-RNN method focused on five UCI datasets attribute set P. These two definitions are expressed in
and selected the most important relevant features for Eq. (2) & (3).
better classification.
𝑃𝑋 = {𝑋|[𝑋]𝑃 ⊆ 𝑃𝑋} (2)
3. Proposed methodology
The disease prediction from healthcare data is a 𝑃𝑋 = {𝑋|[𝑋]𝑃 ∩ 𝑋 ≠ 𝜑} (3)
critical task due to the presence of a various
relationship between the aspects of the patients and Where, 𝑃𝑋 is illustrated as P-lower
the disease. The disease prediction provides the approximation and 𝑃𝑋 is depicted as P-upper
many advantages like an early stage disease approximation. The RST selects the features with
diagnosis and reduces the mortality rate. Healthcare dependency of attributes and reduces the superfluous
data were present in the large amount and this need features. The features selected by the RST are
to be analyzed effectively. In this research, the RST provided as input to the RNN for the classification.
and RNN is applied for disease prediction on
medical data. The dependency between the attributes 3.2 Recurrent neural network
is found by analyzing the characteristics of attributes
using RST and also used to remove the superfluous RNNs are a good solution to the problem of
attributes. The RST generated decision rule was modeling dynamic changes in a time series. They are
provided as input to the RNN. The RNN analyzes the widely used in natural language processing, speech
attributes of the data with decision rule for disease recognition, and handwriting recognition tasks. The
prediction. This section will discuss the detailed RNN inputs the time change vector sequence
information about the working of RST and RNN. 𝑋𝑡−1 , 𝑋𝑡 , 𝑋𝑡+1 .. . As the sequence continues to
The block diagram of RST and RNN in disease advance, the hidden layer 𝑆𝑡 is simultaneously
prediction is shown in Fig. 1. affected by the input 𝑋𝑡 , and the previous hidden
layer 𝑆𝑡−1 . The following Eq. (4), & (5) can be used
3.1 Rough set theory to formally describe the RNN process:
Let 𝐼 = (𝑈, 𝐴) be an information system, where 𝑆𝑡 = 𝑓(𝑈. 𝑋𝑡 + 𝑊. 𝑆𝑡−1 ) (4)
𝑈 is a nonempty set of finite objects called the
universe of discourse; A is a non-empty set of 𝑂𝑡 = 𝑔(𝑉. 𝑆𝑡 ) (5)
attributes. With every attribute 𝑎 ∈ 𝐴, a set of its
values (𝑉𝑎 ) is associated. For a subset of attributes Where, 𝑆𝑡 represents the memory of the sample
𝑃 ⊆ 𝐴 there is an associated equivalence relation at time, 𝑡 , i.e. the value of the hidden layer, as
IND (𝑃), which is called an indiscernibility relation. calculated by Eq. (4). 𝑊 is the output of the previous
The relation IND (𝑃) can be defined in following Eq. moment, which is used as the weight input at this
(1): moment, and 𝑈 is the sample weight of the input.
The Eq. (5) is used to calculate the output value as
𝐼𝑁𝐷(𝑃) = {(𝑥, 𝑦) ∈ 𝑈 2 |∀𝑎 ∈ 𝑃, 𝑎(𝑥) = 𝑎(𝑦)} 𝑂𝑡 with 𝑉 that describes the sample weight of the
(1) output. Both 𝑓 and 𝑔 are activation functions, where
f can be an activation function such as tanh, ReLU,
If (𝑥, 𝑦) ∈ 𝐼𝑁𝐷(𝑃) , then 𝑥 and 𝑦 are or the sigmoid. 𝑔 is usually a softmax activation
indiscernible by attributes from 𝑃. The equivalence function.
classes of the P-indiscernibility relation are denoted As the RNN structure deepens, the gradient
[𝑥]𝑝 . The indiscernibility relation is the calculated by the hidden layer back propagation may
mathematical basis of the RST. The lower and upper vanish or explode. Although gradient cropping can
approximations are two basic operations in RST. For cope with gradient explosions, but it failed solve
a subset, 𝑋 ⊆ 𝑈. 𝑋 can be approximated using only gradient vanishing. So, in the text sequence of a
information contained within 𝑃 by constructing the language model, RNN cannot easily capture the
𝑃-lower approximation donated as 𝑃𝑋, is the set of dependence between the text elements across the

International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02
Received: January 17, 2020. Revised: March 13, 2020. 14

large distances in the sequence. The use of a long UCI machine learning repository [20] for identifying
short-term memory (LSTM) can solve the the performance of proposed RST-RNN method.
aforementioned problems. The core of an LSTM is Table 1 shows the details of dataset with ID, number
the state of the cell (i.e. cell state). It also includes of features and classes.
three kinds of gate structure: the input, output and The missing values are present only in HD and
forget gate. Here, the relevant formulas Eq. (6-10) BC dataset, the missing categorical attributes are
are as follows: replaced by using the mode of the attributes and the
missing continuous data are replaced by mean of the
𝑓𝑡 = 𝜎(𝑊𝑓 . [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑓 ) (6) attributes. During calculation, the numerical
difficulties are addressed by scaling the data into the
𝑖𝑡 = 𝜎(𝑊𝑖 . [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑖 ) (7) range of [-1,1] before constructing the proposed
RST-RNN model. Hence, the feature values in the
𝑜𝑡 = 𝜎(𝑊𝑜 . [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑜 ) (8) smaller numerical ranges are not dominated by those
values present in the greater numerical ranges. In the
𝐶𝑡 = 𝑓𝑡 × 𝐶𝑡−1 + 𝑖𝑡 × tanh(𝑊𝑐 . [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝐶 ) following subsection, the evaluation of parameter
(9) settings with setup and the experimental validated
results of RST-RNN method against various existing
ℎ𝑡 = 𝑜𝑡 × tanh(𝐶𝑡 ) (10) techniques are explained.
4.1 Experimental setup and parameter settings
Eq. (6), Eq. (7), Eq. (8) are three multiplicative
gates: the forget gate, 𝑓𝑡 ; the input gate, 𝑖𝑡 ; and the The computer with 2.2 GHz of Intel Core i5,
output gate, 𝑜𝑡 . The input in Eq. (6), Eq. (7), Eq. (8) RAM of 8GB, where the RST-RNN method is
is [𝑥𝑡 , ℎ𝑡−1 ] , but the parameters are different. 𝜎 developed using the programming language of
represents the sigmoid activation function. 𝐶𝑡 in Eq. Python 3.7.3 version. The performance of RST-RNN
(9) is the cell state, which is obtained from 𝐶𝑡−1 and method is validated by conducting several
the input at the previous time step. If the forget gate experiments on UCI dataset using various metrics
𝑓𝑡 is 0, then the state at the previous moment is namely Area Under Curve (AUC), accuracy, F-
completely cleared, so that input data will be measure, specificity (precision) and sensitivity
considered only with this time step. The input gate 𝑖𝑡 (recall).
determines whether to receive input at this time. The The proportion of positive samples that are
final output gate 𝑜𝑡 determines whether to output the correctly classified as positive by using sensitivity
cell state. Hence, the overfitting is avoided by using rate i.e. true positive rate. In contrast with this, the
RST in training data and selected the important negative samples are correctly classified as negative
features, which is used to improve the performance by using specificity measure i.e. true negative rate.
of RNN. The experiments and their validated results Accuracy can be calculated using the Eq. (11), and
are discussed in next sections. the Eq. (12) is used to evaluate the single combined
metric, which is defined as F-measure. Among the
4. Results and discussion number of labeled positive class samples, precision
In this section, the validation of proposed RST- is used to identify the number of accurately labeled
RNN method and their experimental results are samples, which is shown in Eq. (13). On the contrary,
discussed with various existing techniques. Five according to the positive class, recall is used to
biomedical datasets such as Pima Indians diabetes, predict the number of accurate positive class labeled
Wisconsin breast cancer, heart disease, thyroid samples, which can be divided by the total number of
datasets and Parkinson datasets are collected from samples.

Table 1. Dataset description


Datasets with ID No. of No. of No. of Missing No. of No. of
Classes instances features Values Samples for Samples for
Training Testing
Heart Disease - HD 2 303 13 Yes 193 110
Breast Cancer - BC 2 699 9 Yes 499 200
Diabetes - PID 2 768 8 No 576 192
Parkinson - Pks 2 195 22 No 130 65
Thyroid - Thd 3 215 5 No 110 105

International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02
Received: January 17, 2020. Revised: March 13, 2020. 15

Table 3. AUC performance of proposed method


Table 2. Comparative analysis of proposed RST-RNN
Methods AUC (%)
method
HD BC PID Pks Thd
Methods Accuracy(%)
SVM Not 96.87 72.34 96.87 79.65
HD BC PID Pks Thd
Available
SVM+K- 90.57 96.27 76.50 96.27 74.15
BPA-NB 87.05 98.45 85.40 96.13 84.41
means
Proposed 89.58 99.12 87.46 97.67 89.74
NB 97.36 95.64 76.48 96.24 76.48
Method
RBF 96.77 95.57 76.47 95.45 77.19
J48 93.41 96.24 76.26 95.57 78.03
BPA-NB 97 95.12 78.50 96.12 78.29 Therefore, to validate the RST-RNN method on
Proposed 98.57 98.19 81.47 98.15 83.46 various datasets, this algorithm implements the BPA-
RST-RNN NB for other datasets and experiments are conducted.
Table 2 shows the experimental results of RST-RNN
The mathematical expression for recall is given method and graphical representation is given in Fig.
in Eq. (14). 2.
From the Fig. 2, it is clearly stated that the RST-
𝑇𝑃+𝑇𝑁 RNN method achieved better performance in all five
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝐹𝑃+𝐹𝑁+𝑇𝑁
(11)
datasets for accuracy parameters while comparing
with existing techniques. For instance, the SVM with
2×𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑅𝑒𝑐𝑎𝑙𝑙
𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
(12) K-means techniques achieved nearly 77% accuracy
for PID and Thd datasets, where the RST-RNN
𝑇𝑁 method achieved nearly 84% accuracy for those two
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁+𝐹𝑃 (13)
datasets. The existing technique BPA-NB achieved
𝑇𝑃
nearly 97% accuracy for HD, BC and Pks datasets,
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = (14) where the RST-RNN method attained higher
𝑇𝑃+𝐹𝑁
accuracy i.e. nearly 98.5% accuracy for the same
Where, TP is true positive, TN is true negative, datasets. The better performance of RST-RNN
FP is false positive and FN is false negative. method is due to learning rate in RNN technique,
which is used in proposed RST-RNN method. Table
4.2 Performance of proposed method by means of 3 provides the experimental results of proposed
accuracy and AUC RST-RNN method against existing techniques: SVM
and BPA-NB in terms of AUC for all five datasets.
In this section, the validation of RST-RNN Fig. 3 illustrated the graphical representation of AUC
method is analyzed against various existing performance.
techniques such as BPA-NB [13], SVM with K- From the Table 3 and Fig. 3, the experimental
means [19], hybrid approach such as Radial Basis results showed that the RST-RNN method achieved
Function as RBF, NB and J48. The existing BPA-NB better performance than popular existing techniques
conducted the experiments only on heart disease on all the different datasets. While comparing with
dataset. other datasets, PID gained less AUC for all the three
techniques.

Figure. 2 Performance of RST-RNN method in terms of


accuracy Figure. 3 AUC performance for proposed method

International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02
Received: January 17, 2020. Revised: March 13, 2020. 16

Table 4. Performance of proposed method against existing techniques


Methods Sensitivity (%) Specificity (%)
HD BC PID Pks Thd HD BC PID Pks Thd
SVM 78.61 96.86 58.07 96.86 75.47 82.58 96.89 89.62 96.86 77.35
NB 97.40 96.24 59.59 96.42 79.13 97.90 96.59 88.80 96.59 78.26
RBF 97.07 96.27 64.13 94.57 79.87 96.23 94.32 88.16 94.32 80.19
J48 93.43 96.62 73.84 96.62 82.47 90.37 95.45 88.61 95.45 81.27
BPA-NB 96.51 97.50 80.49 95.14 85.68 84.16 98.25 78.46 95.47 87.12
Proposed 98.24 99.53 94.51 98.47 90.47 98.64 99.74 96.39 98.50 89.95
Method

For instance, SVM achieved 72.34% AUC, BPA- It is clearly stated that the proposed RST-RNN
NB achieved 85.40% AUC and RST-RNN method method achieved higher performance than the
achieved 87.46% AUC. However, these techniques existing techniques for all the datasets. The HD, BC
achieved higher AUC values for BC and Pks dataset. and Pks achieved nearly 99% sensitivity for the
RST-RNN method, whereas the existing techniques
4.3 Performance of proposed technique in terms of achieved nearly 96% sensitivity for RBF, J48, NB
specificity and sensitivity and BPA-NB techniques. When compared with other
In this section, the parameters like specificity and techniques, SVM provides poor performance on all
sensitivity of RST-RNN method are compared with other datasets except BC and Pks datasets. Fig. 5
existing techniques such as SVM, RBF, NB, J48, and shows the performance of specificity of RST-RNN
BPA-NB. The experimental results are tabulated in method.
Table 4, in which the best values are make it as bold. When compared with sensitivity of PID datasets,
Fig. 4 and 5 shows the graphical representation of the specificity values have increased for the same
sensitivity and specificity of RST-RNN method with dataset, which is illustrated in Table 4. But, the
several existing techniques. The sensitivity for all the specificity values for Thd dataset provides low
datasets is experimented and the results are performance than other datasets for all techniques
illustrated in Fig. 4. including RST-RNN method. For instance, the RST-
RNN method achieved 89.95% specificity for Thd
dataset and 99.74% specificity for BC dataset. When
compared with SVM technique, the RST-RNN
method improved the 7% specificity values for PID
dataset. In the following sub-section, the
performance of RST-RNN method in terms of F-
measure are described.
4.4 Performance of proposed method by means of F-
measure

The experiments are conducted on all dataset to


validate the performance of RST-RNN method in
Figure. 4 Sensitivity of proposed method terms of F-measure, which are shown in Table 5.
The graphical representation for F-Measure of RST-
RNN method are compared with BPA-NB, SVM and
RBF is described in Fig. 6.

Table 5. Comparative analysis of proposed method


Methods F-Measure (%)
HD BC PID Pks Thd
SVM 93.05 92.94 91.83 95.84 88.14
RBF 83.52 98.00 78.15 96.18 80.35
BPA-NB 91.25 95.12 90.81 94.70 87.34
Proposed 95.47 99.07 93.62 97.21 90.27
Method
Figure. 5 Specificity of proposed method

International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02
Received: January 17, 2020. Revised: March 13, 2020. 17

method achieved 98.57% accuracy, 89.58% AUC,


98.24% sensitivity, 98.64% specificity and 95.47% f-
measure for HD dataset. The existing technique
BPA-NB achieved accuracy of 97%, AUC of
87.05%, sensitivity of 96.51%, specificity of 84.16%
and f-measure of 91.25% for the same HD dataset.
The existing techniques didn’t concentrate on the
most relevant features; they process all the features
for the classification. But, the proposed method
designed an RSN method specifically to select the
relevant features for effective classification.
However, the performance of the proposed RST-
RNN method achieved low performance in the Thd
Figure. 6 Analysis of proposed method in terms of F- dataset, while comparing other datasets. In future
measure
work, this method is further improved by using
effective optimization techniques to achieve higher
The experimental analysis on F-Measure proved
performance in Thd UCI dataset.
that the RST-RNN method achieved a higher F-
measure than other existing methods for all five
datasets. The RST-RNN method obtained 93.62%
References
for PID dataset, whereas the RBF, BPA-NB and [1] G. Manogaran and D. Lopez, “Health data
SVM achieved 78.15%, 90.81% and 91.83% F- analytics using scalable logistic regression with
Measure. In BC dataset, the RST-RNN method stochastic gradient descent”, International
achieved 99.07% F-measure, where RBF achieved Journal of Advanced Intelligence Paradigms,
98%. The analysis of the RST-RNN method in Vol.10, No.1-2, pp.118-132, 2018.
classification of five datasets shows that the RST- [2] A.N. Richter and T.M. Khoshgoftaar, “Efficient
RNN method is highly efficient when compared to learning from big data for cancer risk modeling:
other existing methods such as SVM, RBF and BPA- A case study with melanoma”, Computers in
NB. This shows that the RST-RNN method avoids biology and medicine, Vol.110, pp.29-39, 2019.
the over-fitting of training data and selects the [3] M. Chen, Y. Hao, K. Hwang, L. Wang, and L.
effective features using RST, which can be Wang, “Disease prediction by machine learning
applicable for effective classification performance. over big data from healthcare communities”,
IEEE Access, Vol.5, pp.8869-8879, 2017.
5. Conclusion [4] A. U. Haq, J. P. Li, M. H. Memon, S. Nazir, and
R. Sun, “A hybrid intelligent system framework
Nowadays, Big Data analytics plays a vital role
for the prediction of heart disease using machine
in predicting diseases and tailoring of treatment for a
learning algorithms”, Mobile Information
particular disease. Big Data provides a 360-degree
Systems, 2018.
view of patients’ data to perform analytics for better
[5] H. Hu, Y. Wen, T. S. Chua, and X. Li, “Toward
prediction outcomes. Prediction of healthcare
scalable systems for big data analytics: A
increases the accuracy of diagnosis and helps to
technology tutorial”, IEEE access, Vol.2, pp.
preventive medicine and public health. Predictive
652-687, 2014.
analytics with big data allow researchers to develop
[6] A. Ed-daoudy, and K. Maalmi, “A new Internet
prediction models for accurate results over a large
of Things architecture for real-time prediction of
number of disease cases. However, the prediction
various diseases using machine learning on big
using traditional methods are limited and time-
data environment”, Journal of Big Data, Vol.6,
consuming due to more number of features. In this
No.104, 2019.
research work, to address the issues of existing
[7] M. Tarawneh and O. Embarak, “Hybrid
techniques, an RST-RNN method is developed to
Approach for Heart Disease Prediction Using
predict the diseases. The most important features are
Data Mining Techniques”, In: Proc. of
selected by using the RST technique and the
International Conference on Emerging
classification for various diseases are carried out by
Internetworking, Data & Web Technologies.
RNN method. The experiments are conducted on
Springer, Cham, 2019.
five major UCI datasets in terms of several
[8] H. G. Schnack, “Improving individual
parameters to validate the effectiveness of RST-RNN
predictions: machine learning approaches for
against existing techniques. The proposed RST-RNN
International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02
Received: January 17, 2020. Revised: March 13, 2020. 18

detecting and attacking heterogeneity in and Health Informatics, Vol.23, No.3, pp.978-
schizophrenia (and other psychiatric diseases)”, 986, 2019.
Schizophrenia research, 2017. [19] N. Kausar, A. Abdullah, B. B. Samir, S.
[9] A. Ed-Daoudy and K. Maalmi, “Real-time Palaniappan, B. S. AlGhamdi, and N. Dey,
machine learning for early detection of heart “Ensemble clustering algorithm with supervised
disease using big data approach”, In: Proc. Of classification of clinical data for early diagnosis
International Conference on Wireless of coronary artery disease”, Journal of Medical
Technologies, Embedded and Intelligent Systems Imaging and Health Informatics, Vol.6, No.1,
(WITS), 2019. pp.78-87, 2016.
[10] H. D. Masethe and M. A. Masethe, “Prediction [20] A. Asuncion and D. Newman, “UCI machine
of heart disease using classification algorithms”, learning repository”, 2007.
In: Proc. of the world Congress on Engineering
and computer Science, Vol.2, 2014.
[11] M. Kumari and S. Godara, “Comparative study
of data mining classification methods in
cardiovascular disease prediction”, International
Journal of Computer Science and Technology,
Vol.2, No.2, pp.304-308, 2011.
[12] G. Purusothaman and P. Krishnakumari, “A
survey of data mining techniques on risk
prediction: Heart disease”, Indian Journal of
Science and Technology, Vol.8, No.12, pp.1,
2015.
[13] R. Venkatesh, C. Balasubramanian, and M.
Kaliappan, “Development of Big Data Predictive
Analytics Model for Disease Prediction using
Machine Learning Technique”, Journal of
Medical Systems, Vol.43, No.8, pp.272, 2019.
[14] M. Nilashi, O. Ibrahim, H. Ahmadi, L.
Shahmoradi, and M. Farahmand, “A hybrid
intelligent system for the prediction of
Parkinson's Disease progression using machine
learning techniques”, Biocybernetics and
Biomedical Engineering, Vol.38, No.1, pp.1-15,
2018.
[15] A. Di Noia, A. Martino, P. Montanari, and A.
Rizzi, “Supervised machine learning techniques
and genetic optimization for occupational
diseases risk prediction”, Soft Computing, pp.1-
14, 2019.
[16] L. R. Nair, D. Sujala Shetty, and D. Siddhanth
Shetty, “Applying spark based machine learning
model on streaming big data for health status
prediction”, Computers & Electrical Engineering,
Vol.65, pp.393-399, 2018.
[17] T. Chen, J. Xu, H. Ying, X. Chen, R. Feng, X.
Fang, and J. Wu, “Prediction of Extubation
Failure for Intensive Care Unit Patients Using
Light Gradient Boosting Machine”, IEEE Access,
Vol.7, pp.150960-150968, 2019.
[18] F. Van Wyk, A. Khojandi, and R. Kamaleswaran,
“Improving Prediction Performance Using
Hierarchical Analysis of Real-Time Data: A
Sepsis Case Study”, IEEE Journal of Biomedical

International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02

You might also like