The Prediction of Diseases Using Rough Set Theory With Recurrent Neural Network in Big Data Analytics
The Prediction of Diseases Using Rough Set Theory With Recurrent Neural Network in Big Data Analytics
10
The Prediction of Diseases Using Rough Set Theory with Recurrent Neural
Network in Big Data Analytics
1
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
* Corresponding author’s Email: [email protected]
Abstract: In a modern life, early healthcare prediction plays an important role to prevent the loss of life caused by
prediction delays in treatment. Nowadays, the researchers focused on the Big data analysis, which is used to identify
the future health status and provides an efficient way to overcome the issues in early prediction. Many researches are
going on predictive analytics using machine learning techniques to provide a better decision making. Big data
analysis provides great opportunities to predict future health status from health parameters and provide best
outcomes. However, the data classification is one of the major challenging tasks due to noisy data or missing data in
the dataset. Feature selection techniques play an important role in the classification process by removing irrelevant
features from the extracted data. In this research work, the Rough Set Theory (RST) technique is used to select the
most relevant features, which helps to provide the efficient classification of medical data and disease detection. The
selected features are given as input to the Recurrent Neural Network (RNN) technique for disease prediction. The
proposed method is also called as RST-RNN, where the experiments are carried out on the UCI machine learning
repository dataset in terms of accuracy, f-measure, sensitivity and specificity. The results showed that the RST-RNN
method achieved accuracy of 98.57%, where the existing Support Vector Machine (SVM) achieved 90.57%
accuracy and Naive Bayes (NB) achieved 97.36% accuracy for heart disease dataset.
Keywords: Big data analysis, Decision making, Feature selection technique, Naive bayes, Rough set theory,
Recurrent neural network.
with the heterogeneous nature of data or the real- In this section, discussions of existing techniques
time [6]. are presented, which are used to predict the health
Improper diagnosis may cause death or disability status of patients using machine learning techniques.
to the patient. Disease Prediction Model can support In addition, the advantages and limitations of the
medical professionals and practitioners in predicting existing methods were also discussed in [13-18].
the particular disease. The huge amount of data that R. Venkatesh, C. Balasubramanian, and M.
can be collected using digital devices (by the patient Kaliappan, [13] designed a Big Data Prediction
itself in hospital) can use with big data to diagnose Analytics Model for heart disease prediction using
patients and predict diseases [7-8]. But, applying NB technique as BPA-NB. This system used a
machine learning on this big data stream is probabilistic classification based on Bayes’ theorem
challenging as the traditional machine learning to analyse the data. To filter the unnecessary data,
systems are not suitable to handle such a massive BPA-NB used the clustering technique and made the
volume or varied velocity. The analytical data prediction in an effective way. The computational
processing is considered as another major problem, complexity was reduced by using MapReduce
because efficient data integration between systems algorithm with Apache Spark framework. The
are involved by performing richer analytical data experiments were carried out on UCI dataset to
processing. Most of the existing works involve validate the effectiveness of BPA-NB against
machine learning, but in case of real-time existing techniques by means of processing time,
applications, the machine learning techniques are CPU utilization and accuracy. The BPA-NB method
insufficient to handle the big data [9-10] used for predicting only the heart disease and it
Classification techniques are widely used in provided poor performance on other diseases
healthcare, since they are capable of processing large prediction.
set of data. The common used techniques in M. Nilashi, O. Ibrahim, H. Ahmadi, L.
healthcare are NB, SVM, Nearest Neighbor (NN), Shahmoradi, and M. Farahmand, [14] developed the
decision tree (DT), Fuzzy logic, Fuzzy based neural Incremental Support Vector Regression (ISVR) for
network (FNN), Artificial neural network (ANN), Unified Parkinson's Disease Rating Scale (UPDRS).
and genetic algorithms (GA) [11]. Machine learning The prediction of Motor-UPDRS and Total-UPDRS
with classification can be efficiently applied in was done by ISVR. In this method, a self-organizing
medical applications for complex measurements. map (SOM) was used to cluster the data and non-
Modern classification techniques provide more linear iterative partial least squares (NIPALS) for
intelligent and effective prediction techniques for dimensionality reduction. To evaluate the ISVR
disease prediction [12]. In this research, the method, several experimental analyses were
important features are selected by using RST method conducted on a real-world PD dataset taken from
which is used to increase the performance of the UCI. The results indicated that the method that
classification technique. The important features from combines SOM, NIPALS, and ISVR techniques was
the medical dataset are given to RNN method for effective in predicting the Total-UPDRS and Motor-
classifying the data. The overfitting is also reduced UPDRS. The ISVR method reduced the computation
in the training data while using the RST method and time for only small data and also used some
the validation of the proposed RST-RNN method is important attributes for PD diagnosis, where other
conducted on various UCI datasets for predicting the attributes were not considered.
diseases. A. Di Noia, A. Martino, P. Montanari, and A.
This research paper is prepared as follows, Rizzi, [15] predicted the occupational disease risks
section 2 describes the review of existing techniques by using pattern recognition techniques and
with its advantages and limitations. Section 3 computational intelligence techniques namely SVM
explains the importance of proposed method for and KNN. A set of meaningful labelled clusters was
disease prediction. The experiments are conducted to determined as the final model by using k-means
validate the effective of a proposed RST-RNN algorithm. The optimal hyper parameters and
method against existing techniques are presented in optimal ad-hoc dissimilarity measure weights were
Section 4. Finally, the conclusion of the research found out using genetic algorithms for classification
work with future development is illustrated in systems and improved the performance of those
section 5. systems. The experiments were carried out to
estimate the performance of three techniques against
2. Literature review existing technique on standard collected datasets by
means of fitness functions for different classes. This
International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02
Received: January 17, 2020. Revised: March 13, 2020. 12
algorithm performed well only on occupational F. Van Wyk, A. Khojandi, and R. Kamaleswaran,
disease forecasting of the collected dataset. [18] presented the hierarchical analysis of machine
L. R. Nair, D. Sujala Shetty, and D. Siddhanth learning algorithms for improving the predictions of
Shetty, [16] aimed to develop a real-time remote at-risk patients. In addition, a multi-layer machine
health status prediction system on big data learning approach was developed to analyze the
processing engine, Apache Spark, testing and high-frequency and continuous data. The
deployed on a cloud, where DT was designed on experimental results illustrated the capabilities of this
streaming data. Through tweet streams, the relevant approach for early identification of patients at risk of
health data of user was received and then send a sepsis, where the physiological data collected from
direct message to the user about their health status by bedside monitors. By this analysis, the multi-layer
using DT algorithm. A user received the information machine learning approach potentially helped to
about his health status instantly and privately with a reduce the mortality and morbidity in the ICU. Even
single tweet and which is used to decide whether though, the method having Sequential Organ Failure
he/she need expert health care or not. A variety of Assessment (SOFA) score, the onset of organ failure
diseases were also used to predict by using slight was not identified by using this algorithm.
modification of this DT algorithm. The recovery of N. Kausar, A. Abdullah, B. B. Samir, S.
data was not possible because tweets were deleted Palaniappan, B. S. AlGhamdi, and N. Dey, [19]
permanently after a certain time period. implemented the hybrid approach namely SVM with
T. Chen, J. Xu, H. Ying, X. Chen, R. Feng, X. K-means clustering technique for medical data
Fang, and J. Wu, [17] predicted the Extubation classification. The attribute dimension was reduced
Failure (EF) by analyzing 3636 adult patient records by introducing the Principal Component Analysis
in MIMIC-III clinical database using Light Gradient (PCA) algorithm. Then, the related parameters and
Boosting Machine (LightGBM). According to the measures were adjusted effectively to differentiate
results of LightGBM, afeature importance analysis the normal and abnormal patients. The experiments
were carried out by interpreting these features using were carried out on the UCI datasets in terms of
SHapley Additive exPlanations (SHAP). The accuracy, precision, recall and f-measure. When the
experiments were carried out on the clinical database unseen patterns of similar behaviours were
against existing techniques namely SVM, ANN and introduced within the selected clusters, the
Logistic Regression (LR). The results stated that developed study reduced the detection rate with high
LightGBM achieved an accurate prediction than classification time.
other existing techniques. However, the recognition
for EF using LightGBM were still not very high.
International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02
Received: January 17, 2020. Revised: March 13, 2020. 13
The existing techniques are used to predict the all elements of 𝑈, which can be certainly classified
disease either on only UCI dataset or collected as elements of 𝑋 based on the attribute set 𝑃. The 𝑃-
dataset, where other diseases are not specified by upper approximation of 𝑋, denoted as 𝑃𝑋, which can
these traditional techniques. In this research study, be possibly classified as elements of 𝑋 based on the
the RST-RNN method focused on five UCI datasets attribute set P. These two definitions are expressed in
and selected the most important relevant features for Eq. (2) & (3).
better classification.
𝑃𝑋 = {𝑋|[𝑋]𝑃 ⊆ 𝑃𝑋} (2)
3. Proposed methodology
The disease prediction from healthcare data is a 𝑃𝑋 = {𝑋|[𝑋]𝑃 ∩ 𝑋 ≠ 𝜑} (3)
critical task due to the presence of a various
relationship between the aspects of the patients and Where, 𝑃𝑋 is illustrated as P-lower
the disease. The disease prediction provides the approximation and 𝑃𝑋 is depicted as P-upper
many advantages like an early stage disease approximation. The RST selects the features with
diagnosis and reduces the mortality rate. Healthcare dependency of attributes and reduces the superfluous
data were present in the large amount and this need features. The features selected by the RST are
to be analyzed effectively. In this research, the RST provided as input to the RNN for the classification.
and RNN is applied for disease prediction on
medical data. The dependency between the attributes 3.2 Recurrent neural network
is found by analyzing the characteristics of attributes
using RST and also used to remove the superfluous RNNs are a good solution to the problem of
attributes. The RST generated decision rule was modeling dynamic changes in a time series. They are
provided as input to the RNN. The RNN analyzes the widely used in natural language processing, speech
attributes of the data with decision rule for disease recognition, and handwriting recognition tasks. The
prediction. This section will discuss the detailed RNN inputs the time change vector sequence
information about the working of RST and RNN. 𝑋𝑡−1 , 𝑋𝑡 , 𝑋𝑡+1 .. . As the sequence continues to
The block diagram of RST and RNN in disease advance, the hidden layer 𝑆𝑡 is simultaneously
prediction is shown in Fig. 1. affected by the input 𝑋𝑡 , and the previous hidden
layer 𝑆𝑡−1 . The following Eq. (4), & (5) can be used
3.1 Rough set theory to formally describe the RNN process:
Let 𝐼 = (𝑈, 𝐴) be an information system, where 𝑆𝑡 = 𝑓(𝑈. 𝑋𝑡 + 𝑊. 𝑆𝑡−1 ) (4)
𝑈 is a nonempty set of finite objects called the
universe of discourse; A is a non-empty set of 𝑂𝑡 = 𝑔(𝑉. 𝑆𝑡 ) (5)
attributes. With every attribute 𝑎 ∈ 𝐴, a set of its
values (𝑉𝑎 ) is associated. For a subset of attributes Where, 𝑆𝑡 represents the memory of the sample
𝑃 ⊆ 𝐴 there is an associated equivalence relation at time, 𝑡 , i.e. the value of the hidden layer, as
IND (𝑃), which is called an indiscernibility relation. calculated by Eq. (4). 𝑊 is the output of the previous
The relation IND (𝑃) can be defined in following Eq. moment, which is used as the weight input at this
(1): moment, and 𝑈 is the sample weight of the input.
The Eq. (5) is used to calculate the output value as
𝐼𝑁𝐷(𝑃) = {(𝑥, 𝑦) ∈ 𝑈 2 |∀𝑎 ∈ 𝑃, 𝑎(𝑥) = 𝑎(𝑦)} 𝑂𝑡 with 𝑉 that describes the sample weight of the
(1) output. Both 𝑓 and 𝑔 are activation functions, where
f can be an activation function such as tanh, ReLU,
If (𝑥, 𝑦) ∈ 𝐼𝑁𝐷(𝑃) , then 𝑥 and 𝑦 are or the sigmoid. 𝑔 is usually a softmax activation
indiscernible by attributes from 𝑃. The equivalence function.
classes of the P-indiscernibility relation are denoted As the RNN structure deepens, the gradient
[𝑥]𝑝 . The indiscernibility relation is the calculated by the hidden layer back propagation may
mathematical basis of the RST. The lower and upper vanish or explode. Although gradient cropping can
approximations are two basic operations in RST. For cope with gradient explosions, but it failed solve
a subset, 𝑋 ⊆ 𝑈. 𝑋 can be approximated using only gradient vanishing. So, in the text sequence of a
information contained within 𝑃 by constructing the language model, RNN cannot easily capture the
𝑃-lower approximation donated as 𝑃𝑋, is the set of dependence between the text elements across the
International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02
Received: January 17, 2020. Revised: March 13, 2020. 14
large distances in the sequence. The use of a long UCI machine learning repository [20] for identifying
short-term memory (LSTM) can solve the the performance of proposed RST-RNN method.
aforementioned problems. The core of an LSTM is Table 1 shows the details of dataset with ID, number
the state of the cell (i.e. cell state). It also includes of features and classes.
three kinds of gate structure: the input, output and The missing values are present only in HD and
forget gate. Here, the relevant formulas Eq. (6-10) BC dataset, the missing categorical attributes are
are as follows: replaced by using the mode of the attributes and the
missing continuous data are replaced by mean of the
𝑓𝑡 = 𝜎(𝑊𝑓 . [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑓 ) (6) attributes. During calculation, the numerical
difficulties are addressed by scaling the data into the
𝑖𝑡 = 𝜎(𝑊𝑖 . [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑖 ) (7) range of [-1,1] before constructing the proposed
RST-RNN model. Hence, the feature values in the
𝑜𝑡 = 𝜎(𝑊𝑜 . [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑜 ) (8) smaller numerical ranges are not dominated by those
values present in the greater numerical ranges. In the
𝐶𝑡 = 𝑓𝑡 × 𝐶𝑡−1 + 𝑖𝑡 × tanh(𝑊𝑐 . [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝐶 ) following subsection, the evaluation of parameter
(9) settings with setup and the experimental validated
results of RST-RNN method against various existing
ℎ𝑡 = 𝑜𝑡 × tanh(𝐶𝑡 ) (10) techniques are explained.
4.1 Experimental setup and parameter settings
Eq. (6), Eq. (7), Eq. (8) are three multiplicative
gates: the forget gate, 𝑓𝑡 ; the input gate, 𝑖𝑡 ; and the The computer with 2.2 GHz of Intel Core i5,
output gate, 𝑜𝑡 . The input in Eq. (6), Eq. (7), Eq. (8) RAM of 8GB, where the RST-RNN method is
is [𝑥𝑡 , ℎ𝑡−1 ] , but the parameters are different. 𝜎 developed using the programming language of
represents the sigmoid activation function. 𝐶𝑡 in Eq. Python 3.7.3 version. The performance of RST-RNN
(9) is the cell state, which is obtained from 𝐶𝑡−1 and method is validated by conducting several
the input at the previous time step. If the forget gate experiments on UCI dataset using various metrics
𝑓𝑡 is 0, then the state at the previous moment is namely Area Under Curve (AUC), accuracy, F-
completely cleared, so that input data will be measure, specificity (precision) and sensitivity
considered only with this time step. The input gate 𝑖𝑡 (recall).
determines whether to receive input at this time. The The proportion of positive samples that are
final output gate 𝑜𝑡 determines whether to output the correctly classified as positive by using sensitivity
cell state. Hence, the overfitting is avoided by using rate i.e. true positive rate. In contrast with this, the
RST in training data and selected the important negative samples are correctly classified as negative
features, which is used to improve the performance by using specificity measure i.e. true negative rate.
of RNN. The experiments and their validated results Accuracy can be calculated using the Eq. (11), and
are discussed in next sections. the Eq. (12) is used to evaluate the single combined
metric, which is defined as F-measure. Among the
4. Results and discussion number of labeled positive class samples, precision
In this section, the validation of proposed RST- is used to identify the number of accurately labeled
RNN method and their experimental results are samples, which is shown in Eq. (13). On the contrary,
discussed with various existing techniques. Five according to the positive class, recall is used to
biomedical datasets such as Pima Indians diabetes, predict the number of accurate positive class labeled
Wisconsin breast cancer, heart disease, thyroid samples, which can be divided by the total number of
datasets and Parkinson datasets are collected from samples.
International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02
Received: January 17, 2020. Revised: March 13, 2020. 15
International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02
Received: January 17, 2020. Revised: March 13, 2020. 16
For instance, SVM achieved 72.34% AUC, BPA- It is clearly stated that the proposed RST-RNN
NB achieved 85.40% AUC and RST-RNN method method achieved higher performance than the
achieved 87.46% AUC. However, these techniques existing techniques for all the datasets. The HD, BC
achieved higher AUC values for BC and Pks dataset. and Pks achieved nearly 99% sensitivity for the
RST-RNN method, whereas the existing techniques
4.3 Performance of proposed technique in terms of achieved nearly 96% sensitivity for RBF, J48, NB
specificity and sensitivity and BPA-NB techniques. When compared with other
In this section, the parameters like specificity and techniques, SVM provides poor performance on all
sensitivity of RST-RNN method are compared with other datasets except BC and Pks datasets. Fig. 5
existing techniques such as SVM, RBF, NB, J48, and shows the performance of specificity of RST-RNN
BPA-NB. The experimental results are tabulated in method.
Table 4, in which the best values are make it as bold. When compared with sensitivity of PID datasets,
Fig. 4 and 5 shows the graphical representation of the specificity values have increased for the same
sensitivity and specificity of RST-RNN method with dataset, which is illustrated in Table 4. But, the
several existing techniques. The sensitivity for all the specificity values for Thd dataset provides low
datasets is experimented and the results are performance than other datasets for all techniques
illustrated in Fig. 4. including RST-RNN method. For instance, the RST-
RNN method achieved 89.95% specificity for Thd
dataset and 99.74% specificity for BC dataset. When
compared with SVM technique, the RST-RNN
method improved the 7% specificity values for PID
dataset. In the following sub-section, the
performance of RST-RNN method in terms of F-
measure are described.
4.4 Performance of proposed method by means of F-
measure
International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02
Received: January 17, 2020. Revised: March 13, 2020. 17
detecting and attacking heterogeneity in and Health Informatics, Vol.23, No.3, pp.978-
schizophrenia (and other psychiatric diseases)”, 986, 2019.
Schizophrenia research, 2017. [19] N. Kausar, A. Abdullah, B. B. Samir, S.
[9] A. Ed-Daoudy and K. Maalmi, “Real-time Palaniappan, B. S. AlGhamdi, and N. Dey,
machine learning for early detection of heart “Ensemble clustering algorithm with supervised
disease using big data approach”, In: Proc. Of classification of clinical data for early diagnosis
International Conference on Wireless of coronary artery disease”, Journal of Medical
Technologies, Embedded and Intelligent Systems Imaging and Health Informatics, Vol.6, No.1,
(WITS), 2019. pp.78-87, 2016.
[10] H. D. Masethe and M. A. Masethe, “Prediction [20] A. Asuncion and D. Newman, “UCI machine
of heart disease using classification algorithms”, learning repository”, 2007.
In: Proc. of the world Congress on Engineering
and computer Science, Vol.2, 2014.
[11] M. Kumari and S. Godara, “Comparative study
of data mining classification methods in
cardiovascular disease prediction”, International
Journal of Computer Science and Technology,
Vol.2, No.2, pp.304-308, 2011.
[12] G. Purusothaman and P. Krishnakumari, “A
survey of data mining techniques on risk
prediction: Heart disease”, Indian Journal of
Science and Technology, Vol.8, No.12, pp.1,
2015.
[13] R. Venkatesh, C. Balasubramanian, and M.
Kaliappan, “Development of Big Data Predictive
Analytics Model for Disease Prediction using
Machine Learning Technique”, Journal of
Medical Systems, Vol.43, No.8, pp.272, 2019.
[14] M. Nilashi, O. Ibrahim, H. Ahmadi, L.
Shahmoradi, and M. Farahmand, “A hybrid
intelligent system for the prediction of
Parkinson's Disease progression using machine
learning techniques”, Biocybernetics and
Biomedical Engineering, Vol.38, No.1, pp.1-15,
2018.
[15] A. Di Noia, A. Martino, P. Montanari, and A.
Rizzi, “Supervised machine learning techniques
and genetic optimization for occupational
diseases risk prediction”, Soft Computing, pp.1-
14, 2019.
[16] L. R. Nair, D. Sujala Shetty, and D. Siddhanth
Shetty, “Applying spark based machine learning
model on streaming big data for health status
prediction”, Computers & Electrical Engineering,
Vol.65, pp.393-399, 2018.
[17] T. Chen, J. Xu, H. Ying, X. Chen, R. Feng, X.
Fang, and J. Wu, “Prediction of Extubation
Failure for Intensive Care Unit Patients Using
Light Gradient Boosting Machine”, IEEE Access,
Vol.7, pp.150960-150968, 2019.
[18] F. Van Wyk, A. Khojandi, and R. Kamaleswaran,
“Improving Prediction Performance Using
Hierarchical Analysis of Real-Time Data: A
Sepsis Case Study”, IEEE Journal of Biomedical
International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.02