0% found this document useful (0 votes)
339 views16 pages

Association Rule Mining For Healthcare Data Analysis

This document discusses using association rule mining algorithms like Apriori and FP-growth to analyze healthcare data and discover relationships between diseases, treatments, and other variables. It analyzes the results of experiments on healthcare data to find strong association rules. The document covers using neural networks to predict heart disease using a dataset with 481 records and 14 attributes. It trains a neural network model with status as the dependent variable and other variables like age, gender, and hospital stay details as independent variables.

Uploaded by

punyabanpatel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
339 views16 pages

Association Rule Mining For Healthcare Data Analysis

This document discusses using association rule mining algorithms like Apriori and FP-growth to analyze healthcare data and discover relationships between diseases, treatments, and other variables. It analyzes the results of experiments on healthcare data to find strong association rules. The document covers using neural networks to predict heart disease using a dataset with 481 records and 14 attributes. It trains a neural network model with status as the dependent variable and other variables like age, gender, and hospital stay details as independent variables.

Uploaded by

punyabanpatel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Association Rules Mining for Healthcare Data Analysis

Punyaban Patel1 Borra Sivaiah2


Department of Computer Science and Department of Computer Science and
Engineering, Engineering,
CMR Technical Campus, CMR College of Engineering &
Kandlakoya, Hyderabad, India Technology, Kandlakoya, Hyderabad, India,
E.Mail: [email protected] E.Mail: [email protected]

Riyam Patel3 Ruplal Choudhary4


Department of CSE (AI & ML) Department of Plant, Soil and Agriculture
SRM Institute of Science & Technology, Systems,
SRM University, Kattankulathur, Southern Illinois University, Carbondale,
Chennai, India United States.
E.Mail: [email protected] E.Mail: [email protected]

Abstract-
In the era health care, massive amounts of data are generated and the information is gathered
from many sources, which are in both unstructured & structured format. As a result, a
significant amount of money and time are spending for storing and analysing it. The health
industry has seen significant changes in recent years, with a growth in the number of
physicians, patients, and diseases. The data that was created in terabytes has now grown to
zettabytes and continues to increase exponentially. Doctors can analyse patient symptoms
using data from the health-care industry. For analysing health-care data, data mining methods
are necessary.
Association Rule Mining(ARM) is one of the most significant tasks in data mining. ARM
techniques such as Apriori and FP-growth may be used to analyse data from doctors, patients,
and illnesses. Association Rule Mining is a strong approach for uncovering hidden associations
between data variables as well as statistically confirming those that are already known. These
connections can aid in a better knowledge of illnesses and their causes, which will aid in their
prevention. Association rules connect different diseases and treatments, as well as providing
important information to doctors and health institutions in society. The association rules we
discovered could be useful for medical and healthcare research and development in areas such
as preventive medicine, disease diagnosis, and disease prevention.
This chapter presents the ARM algorithms such as Apriori and FP-growth. It has been analysed
the results of the experiments based on association rules, relationships among various diseases
and discover strong association rule from the health care data.
Keywords: Association Rule Mining (ARM), Apriori algorithm, FP-growth, Support,
Confidence,

1.Introduction
Association Rule Mining(ARM) is one of the most significant tasks in data mining. ARM
techniques such as Apriori and FP-growth may be used to analyze data from doctors, patients,
and illnesses. Association Rule Mining is a strong approach for uncovering hidden associations
between data variables as well as statistically confirming those that are already known. These
connections can aid in a better knowledge of illnesses and their causes, which will aid in their
prevention. Association rules connect different diseases and treatments, as well as providing
important information to doctors and health institutions in society. The association rules we
discovered could be useful for medical and healthcare research and development in areas such
as potential complications, preventive medicine, disease diagnosis, and disease prevention.

This chapter will cover everything there is to know about ARM algorithms in the health-care
field. We want to look at current association rules, links between illnesses, and discover strong
association rules using health-care data.

Data mining is presently employed in a variety of fields. It is very significant in clinical


practice. Thousands of patients visit hospitals every day for various treatments. Every
department in the hospital is seeing an increase in the number of patient records. Data mining
methods are employed in the medical industry to uncover hidden knowledge in medical
datasets [1]. The patterns uncovered might help decision-making and save lives. Different data
mining techniques, such as classification, clustering, association rule mining, statistical
learning, and link mining, are all useful in research and development in the respective field [2].
The most effective approach for extracting frequent itemsets from large data sets is association
rule mining. The minimal support value was utilised to discover the most common itemsets.
Frequent itemsets have a support value greater than or equal to the minimum support value. If
an itemset is common, all of its subgroups must be frequent as well [3]. Heart disease is one of
the primary causes of death in humans. Heart disease is the leading cause of mortality for both
men and women in the United States. It is an equal opportunity killer that takes the lives of
around 1 million people each year. In 2011, the illness claimed the lives of over 787,000
individuals, with 380,000 people dying each year from heart disease. Every 30 seconds,
someone suffers a heart attack, and every 60 seconds, someone dies from a heart-related
condition [4].

Diabetes, cancer, renal disease, high blood pressure, TB, heart disease, musculoskeletal
disorders, and stroke all have a significant influence on human health. Chronic diseases are
long-term illnesses that usually worsen, and they can be caused by a variety of circumstances.
A chronic illness is one that lasts more than six months in a person. Chronic illnesses are
currently the leading cause of early adult mortality and disability worldwide. According to
World Health Organization (WHO) statistics, adults under the age of 70 account for over half
of all non-communicable disease fatalities worldwide [2]. Cardiovascular disorders, chronic
respiratory illnesses, cancer, and diabetes are the principal non-communicable diseases of
concern. A patient with any of these disorders will need to undergo a series of tests in order to
fully understand their condition. Other factors, such as lifestyle changes, are also taken into
consideration. As a result, each patient has a variety of such characteristics. Relationships
between these traits can be uncovered by analysing patient-specific data, which may reveal
some type of link between these attributes and other life-saving information. This information
can aid in comprehending the nature of diseases, the relationships between various
characteristics, the regions of the body that the disease affects (e.g., diabetes can impact the
kidneys), and what tests are necessary for any condition, among other things.

The rest of the chapter organized as section 2: Related works, section 3: Association rule
mining, section 4: Measures used in ARM, section 5: Experimental analysis and Results, and
finally section 6 concluded the chapter and shows its future direction.
2. Neural Network
Neural networks are used to solve many challenging artificial intelligence problems. They
outperform traditional machine learning models because they have the advantages of non-
linearity, variable interactions, and customization. In this section, the neural network model
has been used for the prediction of heart diseases. The heart disease data set consists of 481
records and 14 attributes are shown in Table 1.
Table 1 : Heart disease data set consists of 481 records and 14 attributes
Sl. Variable Description Codes / Units
No.
01 ID Identification Code 1 to 481
02 AGE Age (per chart) years
03 SEX Gender 0 = Male,1 = Female
04 CPK Peak Cardiac Enzyme International Units
(iu)
05 SHO Cardiogenic Shock Complications 0 = No, 1 = Yes
06 CHF Left Heart Failure Complications 0 = No, 1 = Yes
07 MIORD MI Order 0 = First, 1 =
Recurrent
08 MITYPE MI Type 1 = Q-wave,
2 = Not Q-wave,
3 = Indeterminate
09 YEAR Cohort Year 1 = 1975, 2 = 1978
3 = 1981, 4 = 1984
5 = 1986, 6 = 1988
10 YRGRP Grouped Cohort Year 1 = 1975 & 1978
2 = 1981 & 1984
3 = 1986 & 1988
11 LENSTAY Length of Hospital Stay Days in Hospital

12 DSTAT Discharge Status from Hospital 0 = Alive


1 = Dead
13 LENFOL Total Length of Follow-up from Hospital Days
Admission
14 FSTAT Status as of Last Follow-up 0 = Alive
1 = Dead

The neural network model used for training data using FSTAT class as the dependent variable
and all other variables as independent variables is shown in figure 1. In this section, one hidden
layer with 2 nodes, and linear output set to false for classification models has used. The dataset
is divided into a training set consisting of 70% of the data and a test set containing 30% of the
data from the heart disease data set. The neural network model on the same training set using
the same variables as input and output parameters. The only difference is that there are now
two hidden layers, one with 4 nodes and one with 2 nodes is illustrated in Figure 1. The
confusion matrix on the training data and testing data are shown in Table 2 and Table 3
respectively.
The accuracy of the neural network model is 84.50 %
Table 2: Confusion matrix on the training data

Predicted Actual

0 1
0 168 52
1 0 116

Table 3: Confusion matrix on the testing data

Predicted Actual

0 1
0 0 0
1 64 81

Figure 1: Neural network model for heart disease prediction


3. Related Works

Many literatures are available on the applications of association rule mining to liver disease,
heart diseases, and kidney diseases in the web and online repository. Few of them has been
discussed briefly in this section.
3.1 Liver Diseases
The liver plays an important part in human bodily activities, from protein manufacturing to
toxin removal, and it is necessary for survival. Failure of the liver to operate properly might
result in major health problems. Two types of testing, imaging and liver function tests, are used
to assess the liver's function and aid in the diagnosis of liver illnesses. Many factors contribute
to liver disease, including stress, eating habits, alcohol usage, and drug use. It has recently been
shown that it is extremely difficult to diagnose at an early stage since symptoms are difficult
to define. The physician frequently fails to recognise liver illness, resulting in ineffective
medical treatment. Various data mining techniques may be used to anticipate various illness
stages, even early stages, to aid physicians in providing appropriate therapy.

Many individuals now-a-days suffer from liver disease as a result of their eating habits, alcohol
intake, stress, and a variety of other uncommon activities. Early detection of liver illness may
increase the chances of cure; but, if it is not treated appropriately at an early stage, it might lead
to major health problems. Although previous algorithms are effective at forecasting, they
become inefficient as data expands [8]. Because clinical test reports provide a large amount of
data, predicting any specific disease is quite challenging. To address such challenges, the
medical field frequently partners with automation technology. Machine learning, classification
algorithms, data analytics, and other computer techniques are applied. To address the concerns
with liver disease prediction, a comprehensive research of prediction algorithms was
conducted, followed by a comparison analysis to determine the most accurate method. Despite
the fact that existing solutions are good, their accuracy, execution speed, specificity, and
sensitivity must be targeted in order to create an effective system [6,7]. The effectiveness of
existing approaches is addressed as a consequence of comparing the results of various
algorithms.

To forecast liver illness, Nazmun Nahar and Ferdous Ara et al. [4] used decision tree algorithms
J48, LMT, Random Tree, Random Forest, REPTree, Decision Stump, and Hoeffding Tree. A
comparison of these algorithms has also been carried out. The accuracy, precision, recall, mean
absolute error, F-measure, kappa statistic [17], and run time of each algorithm are all measured
by the system. According to the findings, the Decision Stump algorithm performs well when
compared to other algorithms, with a 70.67 percent accuracy rate. Another strategy for
distinguishing different types of data and predicting accuracy is classification [5,6,8].
Clustering is the process of dividing a collection of abstract objects into classes. The process
of discovering rules that control relationships and causal objects between a group of elements
is known as association rule mining.
Sindhuja et al. [9] conducted a survey on several categorization algorithms for predicting liver
disorders. The C4.5, Naive Bayes, Decision Tree, SVM, Back Propagation Neural Network,
Classification and Regression Tree algorithms were compared and assessed using speed,
accuracy, performance, and cost as criteria. When compared against other algorithms, the C4.5
method was shown to be the best.
To detect the illness and the algorithms' performance, Vijaranietal.,[8] used the MATLAB
2013 programme to develop classification techniques such as Navies Bayes and Support Vector
Machine(SVM). When comparing the accuracy and execution time of the SVM and Navies
Bayes algorithms, it was discovered that the SVM method performs better.
To predict Fatty liver disease, Chieh-Chen Wua et al. [11] utilised data from New Taipei City
Hospital and machine learning methods such as random forest, Naive Bayes, Artificial Neural
Networks (ANN), and logistic regression (FLD). The ROC curve and accuracy of the
performances are based on the comparison. When compared to other classification models, the
Random forest model performed better RN(87.48), which should aid clinicians in classifying
fatty liver patients for early treatment.
Noor Sadiyah Novita Alfisahrin et al.,[13] used the WEKA tool to create a model in which
liver function test attributes like age, gender, total bilirubin, direct bilirubin, alkaline
phosphotase, total proteins, albumin Asparatamino transferase, ratio albumin and globulin
were combined with classification algorithms like Decision Tree, Navies Bayes, and NBTree
to predict liver disease. In addition, the ChaiSquared ranking algorithm was utilised to assess
the influence of various qualities. The execution time of each algorithm is used to assess its
performance. The accuracy was measured using a confusion matrix. The NBTree algorithm
has the best accuracy, while the Navies Bayes algorithm has the fastest computing time,
according to an experimental result. Alice Auxilia et al.,[12] used the R programme to create a
model to examine several liver disease conditions. Machine learning techniques such as
decision trees, support vector machines, and the Naive Bayes algorithm are used to train and
evaluate the datasets. The accuracy, specificity, and sensitivity of each method are measured
using Pearson correlation, and the decision tree outperforms other classification algorithms.
One of the most difficult aspects of medical data mining is automated illness prediction and
diagnosis. Sina Bahramirad, et al.,[13] used eleven algorithms to build a classification model
based on two real liver patient datasets: Logistic,Linear Logistic Regression, Gaussian
Processes,Logistic Model Trees,Multilayer Perceptron,K-STAR,RIPPER, Neural Net, Rule
Induction, Support Vector Machine, Classification and Regression Trees. A comparison
research was conducted utilising these algorithms for two types of datasets: Andhra Pradesh
state of India (AP dataset) and California state of the United States (BUPA dataset), and their
performance was evaluated to measure accuracy, precision, and recall. As a consequence, the
AP dataset outperforms the BUPA dataset in terms of accuracy, whereas the BUPA dataset
outperforms the AP dataset in terms of precision and recall.
To obtain the optimum algorithm, Ashwani Kumar et al.,[14] utilised the info-gain feature
selection approach in classification algorithms such as C4.5, Random forest, CART, Random
tree, and REP. bTo improve accuracy, the datasets were partitioned into two sets of training
testing ratios (70-30 percent and 80-20 percent). The performance is assessed based on the
comparison. As a consequence, it is determined that utilising an 80-20 percent training-testing
data split with 6 features, Random forest achieves an accuracy of 79.22%.
Anju Guliaetal.,[15] created a hybrid model using several algorithms such as J48, MLP, SVM,
RandomForest, and BayesNet, and compared them to enhance accuracy. The model contain
three phases. The first phase involves applying a classification algorithm to the original dataset;
the second phase involves selecting characteristics that influence liver disease; and the third
phase involves comparing the results of the original dataset without and with features. The
accuracy of the algorithms was measured to evaluate the performance based on the
experiments. As a consequence, before performing feature selection, the SVM method is rated
the best. After feature selection, the Random Forest method is thought to perform better than
other algorithms.

Sanjay Kumar et al.[16] used real liver disease patient data to develop models that used various
classification algorithms to detect liver disorders. The liver function test attributes included
age, gender, total bilirubin, Alkphos, DB, SgptTP, A/G Ratio, ALB, Sgot,Selector field, and
were considered with classification algorithms such as Naive-Bayes, Random forest, K-means,
C5.0, and K-Nearest Neigh (KNN). As a result, before the adaptive boosting method, the
Random Forest approach provided great accuracy. However, after adopting the C5.0 algorithm,
the accuracy has improved.

3.2 Heart diseases

Cardiovascular disease (CVD) is one of the world's leading causes of death. Cardiovascular
disease is the leading cause of mortality worldwide, according to the World Health
Organization (WHO) and the Global Burden of Disease (GBD) research.

Every year [26, 27]. According to the WHO, CVD is estimated to impact over 23.6 million
individuals by 2030. In other developed nations, such as the United States of America, the
mortality rate is about 1 in 4 [28].

The Middle East and North Africa (MENA) region has an even higher fatality rate, accounting
for 39.2 percent of the total [30]. As a result, lowering the number of deaths caused by
cardiovascular illnesses requires early and precise diagnosis as well as adequate treatment. For
people who are at high risk of acquiring heart disease, such services must be available [29].

Many factors influence the likelihood of developing heart disease. In the past, researchers were
more concerned with selecting important traits to include in their heart disease prediction
models [31]. The relevance of understanding the links between these features and deciding
their priority [32] inside the prediction model was downplayed. Many data mining-related
research have already been undertaken to address the challenges that impede early and accurate
diagnosis [33, 35, 34].

ARM is also used to predict cardiac disease. Table 1 lists the studies that employed ARM to
predict heart disease. ARM has been employed on the UCI dataset by Akbaş et al. [19],
Shuriyaa and Rajendranb [22], Srinivas et al. [24], Khare and Gupta [20], and Lakshmi and
Reddy [21].

Private datasets from hospitals and cardiac centres were used in several of the research
described in Table 1. Despite the high scores achieved from these datasets (99 percent by Sonet
et al. [23], 100 percent by Thanigaivel and Kumar [25], the studies have a reproduction issue
because the datasets are not accessible for access. On the other hand, using the UCI dataset,
Akbaş et al. [19] acquired a confidence score of 97.8%. The confidence score obtained, on the
other hand, predicted persons who were not at risk of heart disease.

3.3 Kidney diseases


Kidney problems occur when the kidneys are unable to filter blood as effectively as they
should. Chronic Kidney Disease (CKD) occurs when the kidneys' function deteriorates over
time. Diabetes and high blood pressure are the two most common causes of chronic kidney
disease, accounting for up to two-thirds of cases. Diabetes affects several organs in the body,
including the kidneys and heart, as well as blood vessels, nerves, and eyes, when blood sugar
levels are too high.

Researchers sought to use information system techniques in health care fifteen years ago in
order to reduce the expense and burden of illnesses on individuals, hence saving their lives
(Akiyama and Fujita, 2013; Maeda et al., 2016). Knowledge Discovery and Data Mining
(KDDM) is one of these methods for predicting and discovering illness indications (Boukenze
et al.,2017). KDDM is a critical method for extracting information from large amounts of data.
KDDM employs a variety of approaches and strategies to extract relevant information that may
be utilised to aid decision-making. Association rules, classification, and clustering are among
the methodologies and strategies used (Zeynu and Patil, 2018). KDDM techniques involve
many iterative steps to extract the significant knowledge, which is used to make a right decision
in an efficient manner (Arasu and Thirumalaiselvi, 2017).

On the other hand, limited study and research used an integrated strategy to collecting insight
from medical data by merging different methodologies. To close this gap in information,
classification and association rule mining techniques were combined and employed in this
study to create and construct a classification system for predicting CKD using the Weka tool.

To predict and diagnose CKD, the classification algorithms naive Bayes (NB), decision tree
(J48), support vector machine (SVM), K-nearest neighbour (KNN), and classier based on
association rule (JRip) were utilised. The Apriori technique may also be used to uncover strong
correlation rules between characteristics. The findings are more remarkable and valuable for
patients, clinicians, governments, and decision-makers in the medical and health informatics
industry after using all of these algorithms. It demonstrates that using an integrated method that
combines classification algorithms with association rule mining enhances prediction accuracy,
particularly for medical data.

4. Association Rule Mining

Every health-care facility organization has a vast database of patient data. It's tough to
physically break down each of these records. Data mining methods are used to extract
meaningful information from a dataset with a significant volume of data. It is used in the
medical industry to analyse patient data in order to inform patients who are more likely to be
impacted by the ailment and to aid doctors in detecting the condition. Humans are afflicted
with ailments such as Dengue fever, liver disease, and kidney disease, among others. We used
numerous association rule mining algorithms in this chapter to determine which disease
belongs to which group based on clinical data.
4.1 FP-Growth Algorithm
The FP-Growth algorithm [1] is as follows.
Step 1: The first step is to search the database for itemsets. This is identical to Apriori's initial
step. Support count, also known as frequency of 1-itemset, is the number of 1-itemsets
in the database.

Step 2: The FP tree must be built in the second stage. Create the tree's root as a starting point.
Null is the symbol for the root.

Step 3: The next step is to re-scan the database and review the transactions. Examine the first
transaction and identify the itemset contained in it. The highest-counting itemset is
placed first, followed by the next-lowest-counting itemset, and so on. It signifies that
the tree's branch is made up of transaction itemsets in descending order of count.

Step 4: The database's next transaction is investigated. The itemsets are listed in order of
decreasing count. If any of the transaction's itemsets are already present in another branch
(for example, the first transaction), this transaction's branch will have a common root
prefix.

Step 5: This signifies that the common itemset in this transaction is linked to the new node of
another itemset.

Step 6: In addition, as transactions occur, the count of the itemset is increased. As nodes are
formed and linked according to transactions, the count of both the common node and
new node increases by 1.

Step 7: The constructed FP Tree must now be mined. The lowest node, as well as the links
between the lowest nodes, are inspected first in this process. The frequency pattern
length is represented by the lowest node. Then, in the FP Tree, follow the path. A
conditional pattern base is a path or set of paths.

Step 8: The conditional pattern base is a database of prefix pathways in the FP tree that start
with the lowest node (suffix).

Step 9: Construct a Conditional FP Tree from the path's count of itemsets. In the Conditional
FP Tree, itemsets that meet the threshold support are examined.

Step 10: The Conditional FP Tree generates Frequent Patterns.

4.2 Apriori algorithm:


the apriori algorithm [1] is more popular algorithm for generating frequent item sets and is
described below.

Step 1: Initially, scan Data Base to get frequent 1 item sets.

Step 2: Generate (K+1) candidate item sets from lenth k frequent item sets.
Step 3: Test the Candidates against the Data base.

Step 4: Terminate when no frequent or candidate set can be generated.

5. Measures used in Association Rule Mining (ARM)

The support-confidence framework is commonly used to capture a certain form of dependency


among objects recorded in a database. This approach uses five parameters to assess the
uncertainty of an association rule: support, confidence, lift, leverage, and conviction.
The support can be written as:
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑋 → 𝑌) = 𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑌 → 𝑋) = 𝑃(𝑋 𝑎𝑛𝑑 𝑌) (1)
where 𝑋, 𝑌 are the itemsets, |𝑋𝑌| is the number transactions of itemset that contain
both 𝑋 and 𝑌 and |𝐷| represents the total number of transactions of itemset in the database.

The confidence [40] can be defined as


𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑋 → 𝑌) = 𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑋 → 𝑌)/ 𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑋) (2)
The leverage [40] can be defined as
𝐿𝑒𝑣𝑒𝑟𝑎𝑔𝑒(𝑋 → 𝑌) = 𝑃(𝑋𝑎𝑛𝑑𝑌) − (𝑃(𝑋)𝑃(𝑌)) (3)
The conviction [40] can be defined as
𝑐𝑜𝑛𝑣𝑖𝑐𝑡𝑖𝑜𝑛(𝑋 → 𝑌) = 1 − 𝑠𝑢𝑝(𝑌)/(1 − 𝑐𝑜𝑛𝑓(𝑋 → 𝑌) (4)
The lift [40] can be defined as
𝑙𝑖𝑓𝑡(𝑋 → 𝑌) = (𝑌 → 𝑋) = 𝑐𝑜𝑛𝑓(𝑌 → 𝑋)/𝑠𝑢𝑝(𝑋) (5)

6. Experimental Analysis and Results

In this section, implementation details and dataset used for evaluation are described. Practical
result analysis of proposed approach is also presented. Publicly available datasets such as heart
diseases and Breast cancer data sets are used to evaluate the proposed approach. The heart
disease dataset has totally 76 different attributes with 303 patients in its record; 14 attributes
that are linked with the heart disease are used. The details of these 14 attributes are as follows:

1. 'age' real

2.'sex' {female, male}

3. 'cp' {typ_angina, asympt, non_anginal, atyp_angina}

4.'trestbps' real

5. 'chol' real

6.'fbs' {t, f}

7. 'restecg' {left_vent_hyper, normal, st_t_wave_abnormality}


8. 'thalach' real

9. 'exang' {no, yes}

10.'oldpeak' real

11.'slope' {up, flat, down}

12. 'ca' real

13. 'thal' { fixed defect, normal, reversable_defect}

14. 'num' { '<50', '>50_1', '>50_2', '>50_3', '>50_4'}

The Breast cancer data set [3] includes 201 instances of one class and 85 instances of another
class. The instances are described by 9 attributes, some of which are linear and some are
nominal. The Attributes Information for breast cancer data is given below.
1. Class: no-recurrence-events, recurrence-events
2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
3. menopause: lt40, ge40, premeno.
4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-
59.
5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-
39.
6. node-caps: yes, no.
7. deg-malig: 1, 2, 3.
8. breast: left, right.
9. breast-quad: left-up, left-low, right-up, right-low, central.
10. irradiat: yes, no.
The frequent item sets generated by apriori algorithm are shown in table 4.
Table 4: Aprior algorithm for breast cancer data set for generating Frequent item sets
Minimum Confidence L1 L2 L3 L4 L5 L6
support
0.1 1.0 26 138 230 205 87 13
0.2 1.0 19 58 69 29 5
0.3 1.0 13 26 20 4
0.4 1.0 9 10 4 1
0.5 1.0 6 6 4 1
0.5 0.9 6 6 4 1
0.6 0.8 4 3 1
0.6 0.7 4 3 1
0.6 0.6 4 3 1
7. Observations
Here, the Apriori algorithm has been implemented in weka tool. By increasing the minimum
support and keeping same confidence value, the number of frequent item sets are decreasing.
Also, keeping the same minimum support value with decreasing the confidence value is not
affecting the number of frequent item sets generated.

We used Apriori algorithm with minsupport= 20% and mincofidence=90% and The best 10
association rules generated from Breast cancer data set is shown in Table 5.

Table 5: The best 10 rules generated by apriori algorithm


Sl. Association rule Confidence Lift Leverage Conviction
No.
1 inv-nodes=0-2 irradiat=no Class=no- 0.99 1.27 0.11 10.97
recurrence-events 147 ==> node-caps=no
145
2 inv-nodes=0-2 irradiat=no 183 ==> node- 0.97 1.25 0.12 5.85
caps=no 177
3 node-caps=no irradiat=no Class=no- 0.96 1.29 0.11 5.51
recurrence-events 151 ==> inv-nodes=0-2
145
4 inv-nodes=0-2 Class=no-recurrence-events 0.96 1.23 0.11 4.67
167 ==> node-caps=no 160
5 inv-nodes=0-2 213 ==> node-caps=no 201 0.94 1.22 0.12 3.67
6 node-caps=no irradiat=no 188 ==> inv- 0.94 1.26 0.13 4
nodes=0-2 177
7 node-caps=no Class=no-recurrence-events 0.94 1.26 0.11 3.64
171 ==> inv-nodes=0-2 160
8 irradiat=no Class=no-recurrence-events 164 0.92 1.19 0.08 2.62
==> node-caps=no 151
9 inv-nodes=0-2 node-caps=no Class=no- 0.91 1.19 0.08 2.38
recurrence-events 160 ==> irradiat=no 145
10 node-caps=no 222 ==> inv-nodes=0-2 201 0.91 1.22 0.12 2.58

The generated association rules in Table 5 are used by doctors for analysing the relationships
among the attributes in breast cancer data.

Table 6: The best 10 rules generated by apriori algorithm for the heart disease data set
Sl. Association rule Confidence Lift Leverage Conviction
No.
1 sex=male cp=asympt fbs=f ca='(0.5-inf)' 0.98 2.15 0.09 14.43
53 ==> num=>50_1 52
2 cp=asympt exang=yes ca='(0.5-inf)' 47 0.98 2.15 0.08 12.8
==> num=>50_1 46
3 sex=male cp=asympt ca='(0.5-inf)' 62 ==> 0.97 2.12 0.1 11.25
num=>50_1 60
4 cp=asympt thal=normal 52 ==> fbs=f 50 0.96 1.13 0.02 2.57
5 cp=asympt slope=flat ca='(0.5-inf)' 51 ==> 0.96 2.11 0.09 9.26
num=>50_1 49
6 cp=asympt ca='(0.5-inf)' 0.96 2.11 0.08 9.08
thal=reversable_defect 50 ==>
num=>50_1 48
7 cp=asympt slope=flat 0.96 2.1 0.08 8.71
thal=reversable_defect 48 ==>
num=>50_1 46
8 sex=male slope=flat ca='(0.5-inf)' 55 ==> 0.95 2.08 0.09 7.49
num=>50_1 52
9 cp=asympt exang=yes 0.94 2.07 0.08 7.08
thal=reversable_defect 52 ==>
num=>50_1 49
10 restecg=normal thalach='(147.5-inf)' 0.94 1.11 0.02 2.02
thal=normal 68 ==> fbs=f 64
The generated association rules in Table 6 are used by doctors for analysing the relationships
among the attributes in heart disease data.

8. Conclusion and the Future Direction


Health care data is generated from hospitals and diagnostic centers. It is very essential to
generate most frequently occurring symptoms from Health data. We applied apriori algorithm
for heart disease data set and Breast cancer data set to discover frequently occurring symptoms
and generated strong association rules from the frequent occurring symptoms. the association
rules can be used by health care professionals / physicians to find the strong associations among
symptoms. We intend to extend this research by considering more risk factors to extract more
useful and significant rules not only for breast cancer and heart disease but also other diseases
types using the association rule mining algorithm. Furthermore, we plan to build a predictive
model using machine learning techniques for all the diseases.

REFERENCES

1. Jiawei Han, Micheline Kamber & Jian Pei.: Data Mining: Concepts and Techniques. 3rd ed.
The Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers,
July (2011). ISBN 978-0123814791.

2. Margaret H. Dunham.: Data Mining: Introductory and Advanced Topics. Pearson Education
India, (2006).

3. Kabir, M. F., Ludwig, S. A., & Abdullah, A. S. Rule Discovery from Breast Cancer Risk
Factors using Association Rule Mining. IEEE International Conference on Big Data (Big
Data). pp.2433-2441, (2018). doi:10.1109/bigdata.2018.8622028.

4. Nazmun Nahar and Ferdous Ara,.:Liver disease prediction by using different decision tree
techniques, International Journal of Data Mining & Knowledge Management Process
(IJDKP),Vol.8, No.2, March (2018).

5. Divya, B, & Kalaiselvi, R,:Review on Confidentiality of the Outsourced Data. Research


Journal of Science and Engineering Systems, vol.1, pp.1-7 (2017).

6. Hassoon, M, Kouhi, M S, Zomorodi-Moghadam, M, & Abdar, M,.:Rule optimization of


boosted C5.0 classification using genetic algorithm for liver disease prediction. IEEE
International Conference on Computer and Applications (ICCA), pp. 299-305, (2017).
7. S. Dinesh, Metin KOK. :A Review on Different Parameters Effecting the Vehicle Emission
Gases of Different Fuel Mode Operations. Research Journal of Science and Engineering
Systems, Vol.3, (2018).

8. Dr.S.Vijayarani,& Mr.S.Dhayanand. : Liver disease prediction using SVM and Navies


Bayes. International Journal of Science Engineering and Technology Research, Vol.4, Issue 4,
April (2015).

9. D.Sindhuja, & R. Jemina Priyadarsini .:A Survey on Classification Techniques in Data


Mining for Analyzing Liver Disease Disorder. International Journal of Computer Science and
Mobile Computing, Vol.5 Issue.5, May (2016).

10. Sadiyah Noor, Novita Alfisahrin, & Teddy Mantoro.: Data Mining Techniques For
Optimatization of Liver Disease",International Conference on Advanced Computer Science
Applications and Technologies , (2013).

11. Chieh-Chen Wu et. Al.: Prediction of fatty liver disease using machine learning algorithms,
Computer Methods and Programs in Biomedicine, Vol.170, Pages 23-29, March (2019).

12. L. Alice Auxilia.:Accuracy Prediction using Machine Learning Techniques for Indian
PatientLiver Disease, 2nd IEEE International Conference on Trends in Electronics and
Informatics (2018). ISBN:978-1-5386-3570-4.

13. Sina Bahramirad, Aida Mustapha, Maryam Eshraghi, "Classification of Liver Disease
Diagnosis: A Comparative Study", 2013 Second International Conference on Informatics &
Applications (ICIA),Lodz, Poland,IEEE Explore,2013.

14. Ashwani Kumar, Neelam Sahu,. : Categorization of liver disease using classification
techniques, International Journal for Research in Applied Science & Engineering Technology
(IJRASET),Vol. 5, No.5, (2017).

15. Anju Gulia, Rajan Vohra, & Praveen Rani, "Liver Patient Classification Using Intelligent
Techniques", International Journal of Computer Science and Information Technologies, Vol.
5, No.4, (2014).

16. Sanjay Kumar,Sarthak Katyal,.:Effective Analysis and Diagnosis of Liver Disorder byData
Mining", Proceedings of the International Conference on Inventive Research in Computing
Applications (2018).

17 Sasikala B S, Vinai George Biju, : C. M. PrashanthKappa andAccuracy Evaluations of


MachineLearning Classifiers. 2nd IEEE International Conference On Recent Trends In
Electronics Information & Communication Technology, pp. 1920, (2017).

18. Shambel Kefelegn, Pooja Kamat, .: Prediction and Analysis of Liver Disorder Diseasesby
using Data Mining Technique: Survey", International Journal of Pure and Applied
Mathematics, Vol.118,No. 9, pp.765-770, (2018).
19. Akbas KE, Kivrak M, Arslan AK, Çolak C.: Assessment of association rules based on
certainty factor: an application on heart data set, IEEE International artificial intelligence and
data processing symposium (IDAP), pp. 1–5, (2019).

20. Khare S, & Gupta D.: Association rule analysis in cardiovascular disease,” In: Cognitive
Computing and Information Processing (CCIP), IEEE 2nd International Conference on
Cognitive Computing and Information Processing, pp. 1–6, (2016).

21. Lakshmi KP, Reddy CRK.: Fast rule-based heart disease prediction using associative
classiication mining, IEEE International conference on computer, communication and control
(IC4), pp.1–5, (2015).

22. Shuriyaa B, & Rajendranb A.: Cardio vascular disease diagnosis using data mining
techniques and ANFIS approach, Int Journal of Appl Eng Res., 13(21):15356–61, (2018).

23. Sonet, K. M. H., Rahman, M. M., Mazumder, P., Reza, A., & Rahman, R. M.: Analyzing
patterns of numerously occurring heart diseases using association rule mining, Twelfth IEEE
International Conference on Digital Information Management (ICDIM), pp. 38–45, (2017).

24. Srinivas K, Reddy BR, Rani BK & Mogili R. Hybrid.: Approach for prediction of
cardiovascular disease using class association rules and MLP. Int. Journal Electrical Computer
Eng. pp.2088–8708, 6(4), (2016).

25. Thanigaivel R & Kumar KR.: Boosted apriori: an efective data mining association rules for
Heart disease prediction system”, Middle-East J Sci Res., 24(1):192–200, (2016).

26. Roth GA, et al.: Global, Regional, and National Age-Sex-Specific Mortality for 282 causes
of death in 195 Countries and Territories, 1980–2017: a systematic analysis for the Global
Burden of Disease Study”, Lancet.,392(10159): pp.1736–88, (2018).

27. World Health Organization. Global action plan for the prevention and control of non-
communicable diseases 2014–2020, Geneva (2013). ISBN 978 92 4 1506236.

28. Murphy SL, Xu J, Kochanek KD & Arias E.: Mortality in the United States, NCHS data
brief, no 328. Hyattsville, MD: National Center for Health Statistics, 2018).

29. Maji S,& Arora S.,”Decision tree algorithms for prediction of heart disease, In information
and communication technology for competitive strategies, pp. 447–454, Springer, Singapore
(2019).

30. James SL, et al.: Global, regional, and national incidence, prevalence, and years lived with
disability for 354 diseases and injuries for 195 countries and territories, 1990–2017: a
systematic analysis for the Global Burden of Disease Study 2017. LANCET392 (10159),
pp.1789–1858, (2018).

31. Amin MS, Chiam YK, Varathan KD Identiication of signiicant features and data mining
techniques in predicting heart disease.Telem Inform.Vol.36, pp.82–93(2019).

32. Mohammed KI, Zaidan AA, Zaidan BB, Albahri OS, Albahri AS, Alsalem MA, & Mohsin
AH.: Novel technique for reorganisation of opinion order to interval levels for solving several
instances representing prioritisation in patients with multiple chronic diseases. Comput
Methods Programs Biomed. 2020;185:105151, (2020).

33. Bashir, S., Khan, Z. S., Khan, F. H., Anjum, A., & Bashir, K.: Improving heart disease
prediction using feature selection approaches”, 16th IEEE International Bhurban Conference
on Applied Sciences and Technology(IBCAST), pp.619–623, (2019).

34. Mahdi MA, & Al-Janabi S.: A novel software to improve healthcare base on predictive
analytics and mobile services for cloud data centers, in International conference on big data
and networks technologies, pp. 320–339, Springer, Cham; (2019).

35. Fitriyani NL, Syafrudin M, Alian G, & Rhee J.: HDPM: an efective heart disease prediction
model for a clinical decision support system. IEEE Access. Vol.8, pp.133034–50, (2020).

36. Alagugowri S,& Christopher T.: Enhanced Heart Disease Analysis and Prediction System
[EHDAPS] Using Data Mining. International Journal of Emerging Trends in Science and
Technology, Vol.1, pp.1555-1560, (2014).

37. Tzung-Pei Hong, Chun-Wei Lin, Tsung-Ching Lin. The MFFP-Tree Fuzzy Mining
Algorithm to Discover Complete Linguistic Frequent Itemsets. International Journal of
Computational Intelligence, Vol.30, pp.145–166, (2014).

38. Marghny H, Mohamed, Mohammed M,& Darwieesh.: Efficient Mining Frequent Itemsets
Algorithms. International Journal of Machine Learning and Cybernetics; Vol. 5, pp. 823-833.
(2013).

39. Mir Md. Jahangir Kabir, Shuxiang Xu, Byeong Ho Kang,& Zongyuan Zhao.: A Novel
Approach to Mining Maximal Frequent Itemsets Based on Genetic Algorithm. 9th International
Conference on Information Technology and Applications (ICITA), Sydney, Australia, (2014).

40. Agarwal, R., Imielinski, T., & Swami,A,N. 1993. : Mining association rules between sets
of items in large databases.” Proceedings of the 1993 ACM SIGMOD International Conference
on Management of Data. (1993).

You might also like