0% found this document useful (0 votes)
119 views27 pages

Differencial Link Analysis in Health Care Using Data Mining

This document discusses using data mining techniques like association rule mining to analyze medical data and discover relationships between patient attributes. Specifically, it proposes using differential link analysis with the Apriori algorithm to generate association rules from a database of patients with diabetes and/or hypertension. Differential link analysis uses partial support trees and total support trees to reduce memory usage and execution time compared to standard Apriori. The goal is to discover rules indicating what medical complications a patient may develop based on their conditions, which could help with treatment decisions.

Uploaded by

veerabalaj
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views27 pages

Differencial Link Analysis in Health Care Using Data Mining

This document discusses using data mining techniques like association rule mining to analyze medical data and discover relationships between patient attributes. Specifically, it proposes using differential link analysis with the Apriori algorithm to generate association rules from a database of patients with diabetes and/or hypertension. Differential link analysis uses partial support trees and total support trees to reduce memory usage and execution time compared to standard Apriori. The goal is to discover rules indicating what medical complications a patient may develop based on their conditions, which could help with treatment decisions.

Uploaded by

veerabalaj
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 27

DIFFERENCIAL LINK ANALYSIS IN HEALTH CARE USING

DATA MINING
V.Jayaraj
Department of Computer Applications,
Periyar Maniammai University,
Vallam,Thanjavur,Tamil Nadu,India.
Fax: 04362-264660,
e-mail: [email protected]

Abstract:
This thesis describes the use of rule induction in data mining application in the field of medicine.
Association rule induction is a powerful method for the so-called market basket analysis. It aims at finding
regularities in the shopping behavior of customers, e.g. of supermarkets, mail-order companies, on-line
shops and the like. Link Analysis is a modeling technique based on the theory that if you buy a certain
group of items, you are more (or less) likely to buy another group of items. There may be many databases
where one or more of the item sets may be interrelated. Experimental data in many domains serves as a
basis for predicting useful trends. This work generates association rules in one such medical database. The
data in medical field are usually very vast and interrelated. The database chosen here depicts the
complications occurring in patients with diabetes and/or hypertension (increased Blood pressure). All the
patients were in the age group of 40 to 60 years, and the sex ratio was almost equal (M: F of 1.1:1). Various
methods have been used to study such vast data and to arrive at conclusions and predict future probabilities
and scopes. Data mining, more specifically the “Apriori Algorithm”, is used to derive association rules that
represent relationship between input conditions and results of domain experiments. Use of P-tree (Partial
Support tree) and T-Tree (Total Support tree) concept in Apriori algorithm (Differential Link Analysis) will
reduce the memory space and execution time. Differential Link Analysis gives clues as to what a patient
might develop as complication if he/she has a particular disease of other or a combination.

Key words: Data mining, association rule, aprioir algorithm, Market basket analysis, partial support tree,
Differential Link Analysis.
Introduction

Data Mining, also called as data archeology, data dredging, data harvesting, is the process of extracting
knowledge hidden from large volumes of raw data and using it to make crucial business decisions. The
extracted information can be used to form a prediction or classification model, identify relations between
database records, or provide a summary of the database(s) being mined. Data mining consists of a number
of operations each of which is supported by a variety of techniques such as rule induction, neural networks,
conceptual clustering, association discovery, decision tree, etc.

In this thesis, use of discovering association rules between items in a large database in the
Medicinal field is projected. Given a collection of items and a set of records, each of which contain some
number of items from the given collection, a link discovery function is used as an operation against this set
of records. It projects affinities that exist among the collection of items.

The importance of a rule is usually measured by two numbers. Its support, which is the percentage
of transactions that the rule can be applied to (or, alternatively, the percentage of transactions, in which it
holds good), and its confidence, which is the number of cases in which the rule is correct relative to the
number of cases in which it s applicable (and thus is equivalent to an estimate of the conditional probability
of the consequent of the rule, given its antecedent). This is to select interesting rules from the set of all
possible rules.

Among the best known algorithms for association rule induction, is the Apriori algorithm [1] [3].
This algorithm works sin two steps. In the first step the frequent item sets (often misleadingly called large
item sets) are determined. These are sets of items that have at least the given minimum support (i.e., occur
at least in a given percentage of all transactions). In the second step association rules are generated form the
frequent item sets found in the firs step. Usually the first step is more important, because it accounts for the
greater part of the processing time. In order to make it efficient, the Apriori algorithm exploits the simple
observation that no superset of an infrequent item set (i.e., an item set not having minimum support) can be
frequent (can have enough support).
A major difficulty is that a large number of the rules found may be trivial for anyone familiar with
medicine. Although the volume of data has been reduced, it is still like finding a needle in a haystack. One
partial solution to this problem is differential Link Analysis. With the help of p-tree to T-tree concepts only
valuable rules will be generated. This gives clues about the probability of a diabetic and/ or hypertensive
developing a particular complication. As a first step, therefore, Differential Link Analysis can be used in
deciding the line management and promotion of good health among patients.
But this is only a first analysis. Different link analysis can find interesting results using minimal
memory space, improved speed of execution. Differential Link Analysis is a relatively new operation,
whose large volume of rules helps to extract new knowledge from database.

To extract such a new knowledge, an improved procedure of Apriori, i.e., Different Link Analysis is
developed. The main ideas of the Association analysis are
• Different items in the databases behave differently.
• Support value for all association rules work well.
• Some useful rules could be obtained using tree structure
• Generate association rules with p-tree
• Generate rules with reduced memory space
• To reduce execution time

Classification and data clustering [5] are cognate to supervised machine learning. Assume
that we are given a large set of lists each containing the values of n parameters and their known
classification. One then groups the lists into clusters that have the same classification. In the context of
gene expression analysis, effective attributes selection and classification methods have been a focus of
study in recent years. Classification and clustering [6] are two major tasks in gene expression data
analysis. Classification is concerned with assigning memberships to samples based on expression
patterns and clustering aims at finding new biological classes and refining existing ones. Classification
and clustering [8] are two major tasks in gene expression data analysis. Classification is concerned
with assigning memberships to samples based on expression patterns and clustering aims at finding
new biological classes and refining existing ones. To cluster and/or recognize patterns in gene
expression data sets, dimension problems are encountered.
The first algorithm to generate all frequent sets and confident association rules was the AIS
algorithm by Agrawal et al. [1], which was given together with the introduction of this mining problem.
Shortly after that, the algorithm was improved and renamed Apriori by Agrawal et al., [2], [3] by exploiting
the monotonic property of the frequency of item sets and the confidence of association rules.

Methodology
Basics of Association Rule

Discovery of association rules are showing attribute-value conditions that occur frequently
together in a given set of data. Market Basket Analysis is a modeling technique based on the theory that if
you buy a certain group of items then you are more (or less) likely to buy anther group of items]. The set of
items a customer buys is referred to as an item set, and market analysis seeks to find relationship between
purchases.
Typically the relationship will be in the form of a rule

IF {bread} THEN {butter}


This above condition extracts the hidden information i.e., if a customer buys bread, he will also buy butter
as side dish.
There are two types of Association rule levels.
a. Support Level
b. Confidence Level

Rules of Support Level


The minimum percentage of instances in the database that contain all items listed in a given
association rule.
Support of an Item Set
Let T be the set of all transactions under consideration, e.g., let T be the set of all “baskets” or
“cart” of products bought by the customers form a supermarket – say on a given day. The support of an
item set S is the percentage of those transactions in T which contain S. In the supermarket example this is
the number of “baskets” that contain a given set S of products, for example S= {bread, butter, milk}. If U is
the set of all transactions that contain all items in S, then
support(S) = (|U|/|T|)* 100%
Where |U| and |T| are the number of elements in U and T respectively. Fro example, if a customer buys the
set X = {milk, bread, apples, wine, sausages, cheese, onions, potatoes} then S is obviously a subset of X
and hence S is in U. If there are 318 customers and 242 of them buy such a set U or a similar one that
contains S, then support (S) = (242/318) 76.1%.
Rules for confidence Level
“If A then B”, rule confidence is the conditional probability that B is true when A is known to be
true.

Confidence of an Association Rule


To evaluate association rules, the confidence of a rule R = “A and B->C” is the support of the set
all items that appear in the rule divided by the support of the antecedent of the rule, i.e.
Confidence (R) = (support ({A, B, D}) / support ({A, B})* 100%
More intuitively, the confidence of a rule is the number of cases in which the rule is correct relative to the
number of cases in which it is applicable. For example, let R = “butter and bread-> cheese>”. If a customer
buys butter and bread, then the rule is applicable and it says that he/she can be expected to buy milk. If
he/she does not buy wine or does not buy bread or buys is not applicable and thus (obviously) does not say
anything about this customer.

Existing Apriori Algorithm

Frequent Item set Mining problem


A transactional database consists of sequence of transaction: T = (t1, …,tn). A transaction is a set
of items (t∈I). Transactions are often called baskets, referring to the primary application domain (i.e.
market-basket analysis). A set of items is often called the item set by the data mining community. The
(absolute) support or the occurrence of X (denoted by Supp(X)0 is the number of transactions that are
supersets of X (i.e. that contain X). The relative support is the absolute support divided by the number of
transactions (i.e.n). An item set is frequent if its support is greater or equal to a threshold value.

Association Rule Mining Problem


This program is also capable of mining association rules. An association rule is like an
implication: X-> Y means that if item set Y also occurs with high probability. This probability is given by
the confidence of the rule. It is like an approximation of p(YX), it is the number of transactions that contain
both X and Y divided by the number of transaction that contain X, thus conf (S->Y) =
Supp(XUY)/supp((X). An association rule is valid if its confidence and support are greater than or equal to
corresponding threshold values.

In the frequent item set mining problem a transaction database and a relative support threshold
(traditionally denoted by min supp) is given and we have to find all frequent item sets.

They have decomposed the problem of mining association rules into two parts
• Find all combinations of items that have transaction support above minimum support. Call those
combinations frequent item sets.
• Use the frequent item sets to generate the desired rules. The general idea is that if, say, ABCD and
AB are frequent item sets, then we can determine if the rule ABCD holds by computing the ratio r
= Support (ABCD)/support (AB). The rule holds only if r=>=minimum confidence. Note that the
rule will have minimum support because ABCD is frequent. The Apriori algorithm used for
finding all frequent item sets is given below.

Algorithm 5.1 Partitioning algorithm


Here the database D is divided into ‘P’ partitions D1, D2,…, DP.

Advantage
1. Algorithm is more efficient as it has the advantage of large item set must be large in al least one
of the partitions.
2. These algorithms adapt better to limited main memory
3. It is easier to perform incremental generation of association rules by treating the current state of
database as one partition and treating the new entries as a second partition.

Input
I // Item set
D (D1, D2, … DP) // Database of transactions divided into partition S //Support

Output
L //Large Item set

Partitioning Algorithm
C=0
For I = 1 to do //Find large item sets in each partition
Li = Apriori (I, D1, s)
C=CULi
L=0
For each Ii∈ D do//count candidates during second scan
For each Ii ∈ C do
If Ij∈ tj then
ci = ci +1
For each Ii∈ C do
If ci > = (sx|D|) then
L = LUIi

Algorithm 5.2 Sampling Algorithm


Here a sample is drawn from the database such that it is memory resident. Then any one algorithm
is used to find the large item sets for the sample. It an item set is large in a sample, it is viewed to be
potentially large in the entire database.

Advantages
1. Facilitates efficient counting of item sets in large databases.
2. Original sampling algorithm reduces the number of database scans to one or two.
Input
I // Itemset
D // Database of transactions
S // Support

Output
L // Large Item set
Sampling algorithm
D = Sample drawn form D
PL = Apriori (I, Ds, smalls)

C =PL UBD (PL)
L=0
For each Ii ∈ C odo
Ci = 0 // Initial counts for each itemset are 0
For each ti ∈ C do // First scan count
For each Ii ∈ C do
If Ii ∈ tj then
Ci = ci +1
For each Ii ∈ C do
If ci > = (sx |D|) do
L = LUIi
ML = (x|x BD∼ (PL) x L); // Missing large itemsets
If ML # 0 then
C=Li // set candidates to be the large itemsets
Repeat

C = CUBD  // Expand candidate sets using negative border
Until no new itemsets are added to C
For each ti ∈ C do
Ci = 0 // Initial counts for each itemset are 0
For each ti ∈ C do // Second scan count
For each Ii ∈ C do
If Ii ∈ tj then
ci = ci + 1
If ci > = (sx|D|) then
L = LUIi;
Apriori Algorithm
Input
I // Itemset
D // Database of transactions
S // Support

Output
L // Large Item set
Apriori Algorithm
k=0 // k is used as the scan number
L=0
Ck =1 // Initial candidates are set to be the items
Repeat
k=k+1
Lk=0
For each II ∈ Ck do
ci =0 // initial counts for each itset are 0
for each jj ∈ D do
for each Ij ∈ tj then
ci = ci +1
for each Ii ∈ Ck do
if ci > = (sx|D|) do
Lk = Lk U Ii
L = LULk
ck +1 = Apriori _ Gen (Lk)
Until Ck +1 =0
Input
Li-1 // Large Itemsets of size i-1
Output
Ci // Candidate of size i
Apriori-gen algorithm
Ci = 0
For each Ii ∈ Li-1 do
For each J ≠ Ii ∈ Li-1 do
If i-2 of the elements in I and J are equal then
Ck = Ck U (IUJ)

AR Gen algorithm
Input
D // Database of transactions
I // Items
L // Large Itemsets
S // Support
oe // Confidence
Output
R // Association rules satisfying s and oe
AR Gen algorithm
R=0
Fro each I ∈ L do
For each x C I such that x ≠ 0 do
If suppor(I) /support (x) > =oe then
R=RU (x = > (1-x))
Result and Discussion
Patient Data Selection
In the database used for this thesis, the records of patients with hypertension and diabetes as
diseases are chosen initially. It is a selection operational data form the Primary Health Centre patients of
some villages and contains information about the name, designation, age, address, disease particulars, and
duration of disease. In order to facilitate the KDD process, a copy of this operational data is drawn and
stored in a separate database.

Data Transformation

At first 10000 patients were screened for hypertension, diabetes kidney disease, heart disease and
stroke. Of these, 1000 cases with hypertension or diabetes or both were selected for a detailed case study.
Data smoothing was done by clustering method.
Data Enrichment
In the example taken, if, extra information about the patients chosen like their smoking and
drinking habits are purchased then the data becomes more realistic. This is because smoking and drinking
habits influence the development of certain complications under consideration. Though these data may not
immediately alter the conclusive results of the study, it may be helpful in proposing many other theories of
development of complication and has scope for future study.

Name Address Sex Age Smoker Alcoholic Hypedur


Nagaya Keeranur Male 52 Yes No 12 years
Shakilabanu Pavattakudi femal 57 No No 15 years
Sudarvili A. Thirumalam Male 53 Yes Yes 15 years
Palani Pavattakudi Male 58 No No 13 years
Seethalakshmi Pavattakudi Male 53 No No 9 years
Laskhmana Pavattakudi Male 47 No No 14 years

Coding
Coding is a process wherein certain data are represented as codes in records. For example, in our
data, the presence of a disease or complication is coded as “1” and its absence as “0”. There are several
advantages in such coding process. One is the maintenance of secrecy; as only the data operator is aware of
the details of coding. Second is that, coding removes all incorrect and inconsistent data and thus prevent
fraudulent information.
Name Occupatio Place Sex Ag Hypotensio Diabete Duratio Wellco
n e n s n n
Rajendran Farmer Sirupuliyur male 56 1 0 15 1
years
Kumar Labourer Pavattakudi male 52 0 1 12 1
years
Rajathi Labourer Annathanapura Femal 46 1 1 10 1
m e years
Suryamoot farmer Kaliyakdui male 49 0 1 11years 1
hi
Jagannatha farmer Palayar male 45 1 0 10 1
n years
Md labourer Pavattakudi male 51 1 0 10 1
Marzook years
Senthamara labourer Keeranur female 49 1 0 10 1
i years
Nagaya labourer Keeranur male 43 0 1 12 1
years

Visualization
Visualization techniques are a very useful method of discovering patterns in datasets, and may be
used at the beginning of a data mining process to get a rough feeling of the quality of the data set and also
used where patterns are to be found. Interesting possibilities are offered by object oriented three
dimensional tool kits, such as Inventor, which enables the user to explore three dimensional structures
interactively. Advanced graphical techniques in virtual reality enable people to wander through artificial
data spaces, while historic development of data sets can be displayed as a kind of animated movie. These
simple methods can provide a wealth of information. An elementary technique that can be great value is the
so called scatter diagram,. Scatter diagrams can be used to identify interesting subsets of the data sets so
that we can focus on the rest of the data mining process. There is a whole field of research dedicated to the
search for interesting projections for data sets-that is called the projection pursuit.

Medical Database Description


The table 1 shows the real surveyed data conducted at Government Primary Health Centre,
Pavattakudi and its hamlets. The database form the entire population of pavattakudi consists of long
duration hypertensive and diabetic patients and takes into account their complication to generate useful
association rules. Table1 contains surveyed data particulars.
Particulars Total cases
Total cases surveyed 10000
No hypertension, no diabetes, no kidney disease, No heart
disease, no stroke 8870
Long duration > 10 years of diabetes alone 380
Long duration > 10 years of hypertension alone 450
Long duration of both diabetes and hypertension 170
Kidney disease without diabetes or hypertension 20
Heart disease without diabetes or hypertension 90
Stroke without diabetes or hypertension 20
Table 1 Real Time Surveyed Data
From the above table, the records are classified into hypertension and diabetes as diseases and
heart disease, kidney and stroke as complication as described in Table2. It gives the particulars about the
complications in patients who are affected by the diseases under consideration.
Diseases Complications
S.No Kidney Heart
Hypertension Diabetes Stroke None
disease disease
1. No No 20 90 20 8870
2. No Yes 280 280 20 40
3. Yes No 90 360 190 90
4. Yes Yes 120 150 60 10
Table2 Result of Survey Details
Support Level form Medical Expert Systems
Table 3 contains the detailed information about patients with diabetes alone, hypertension alone or
both. They were sub-categorized based on the extent of control of these diseases, heart disease and stroke
were evaluated in this group of patients. This database is obtained form the medical expert system .
S.No Hypertension Diabetes Kidney Heart Stroke
disease
1. No No 0 + /- 0
2. No Well + + /- 0
controlled
3. No Poorly ++ + 0/+
controlled
4. Well controlled No 0 + 0/+
5. Well controlled Well + + 0/+
controlled
6. Well controlled Poorly ++ ++ 0/+
controlled
7. Poorly controlled No ++ ++ ++
8. Poorly controlled Well ++ ++ / +++ ++
controlled
9. Poorly controlled Poorly +++ +++ +++
controlled
Table 3 Predicated end organ disease

Database containing the information as the subcategory-well controlled and poorly controlled
disease is shown in appendix 3.

Support level form Surveyed Database


This support level also holds good for the data is our surveyed particulars about the complications.
Table 4 deal with support level of the complication that occurred among the above said diseased population

S.No Hypertension Diabetes Kidney Heart Stroke None


disease
1. No No 20 90 20 8870
2. No Well 100 160 Nil 400
controlled
3. No Poorly 180 120 20 Nil
controlled
4. Well controlled No Nil 22 70 700
5. Well controlled Well 10 20 Nil 100
controlled
6. Well controlled Poorly 40 50 10 Nil
controlled
7. Poorly controlled No 90 140 120 2000
8. Poorly controlled Well 300 20 20 Nil
controlled
9. Poorly controlled Poorly 40 60 30 Nil
controlled
Table4 Prevalence of end organ disease in the surveyed database
Study of Differential Link Analysis

When Apriori algorithm is directly substituted in the medical database under consideration, it
will generate many rules with minimum values as support level but with a good confidence level. But
this approach does not support for the effective use of memory and execution control speed. Also in
apriori algorithm, it is very difficult to store the values directly. It occupies a large memory size.
Hence this chapter describes the use of data structure process traversal and a new algorithm developed
with the help of tree traversal concept.

Need for Tree Traverse


In the substitution of this apriori algorithm, Memory occupation for the database processing is
very large. To reduce the memory storage requirements as well as run time of the program, ka desired
total support count has been calculated.

This chapter introduces the new technique P-tree table in order to reduce the memory storage
location as well as execution time of the program. The first section discusses the P-tree generation and
their occupation procedure.
The P-tree data structure
A P-tree is a enumeration tree structure which helps to store partial counts for item sets. The
top, single attribute, level comprises an array of references to structures of the form shown to the right,
on for each column.

Identifier Type
Support Int (4 bytes)
Child reference Pointer to child P-tree node (4 Bytes)

Each of these top level structures is then the root of a sub-tree of the overall P-tree. The nodes
in each of this sub-trees are represented as structure of the form.

Identifier Type
Node code Array of short integers (2 Bytes each)
Child reference Pointer to child P-tree node (4 Bytes)
Support int (4 Bytes)
Sibling reference Pointer to sibling P-tree node (4 Bytes)

Assuming a sample data set as shown below

Row Number Columns


1 134
2 245
3 246

where D=6 (number of columns/attributes) and N=3 (number of records). Prior to commencement of
the generation process the P-tree will have a 6 element array as shown in Figure 1 (a). The first row
will be stored in the P-tree as shown in Figure 1 (b).
Similarly the second row (2,4,5,) and third row (2,4,6) are added in the P-tree. Last row
shares a common leading substring with an existing node in the p-tree. A dummy8 node is created to
ensure that the tree does not become “flat”. Final P-tree is shown in Figure 2.

Figure 2 P-tree in it final form

Improved Procedure for Apriori Algorithm (Different Link Analysis)


The Different Link Analysis is use to determine Association Rules by first creating a T-tree
form the P-tree. This The T-tree is generated in an Apriori manner. There are a number of features of
the P-tree Table that enhance the efficiency of this process.
1. The first pass of the P-tree will calculate supports for singletons and thus the entire P-tree
must be traversed. However, on the second pass when calculating the support for “doubles”
the top level in the P-tree can be ignored, i.e. the process can be started from index 2. Further,
at the end of the previous pass the top level (cardinality=1) part of the table can be deleted.
Consequently as the T-tree grows in size the P-tree table shrinks.
2. To prevent double counting, on the first pass of the P-tree, the elements in the top level array
of the T-tree that correspond to the column numbers in node codes (not parent codes) are
updated. On the second pass, for each P-tree table record found, only those branches in the T-
tree that emanate form a top level element corresponding to a column number represented by
the node code (not the parent code) are updated. On the second pass, for each P-tree table
record found, only those branches in the T-tree that emanate form a top level element
corresponding to a column number represented by the node code (not the parent code) are
considered. Once the appropriate branch has been located, the process proceeds down to level
2 and those elements that correspond to the column numbers in the union of the parent and
node codes are updated. Then this process is repeated for all subsequent levels until there are
no more levels in the T-tree to consider.

Design of Differential Link Analysis


Input design
The medical database is converted to suit the requirement of the software. Data records are
converted into numerical values for individual filed names. The following are values of the medical
database fields taken in this thesis. The data value of the database is converted as the follows. Taking
diabetes, patients are classified into two categories as well controlled and poorly controlled. For easy
substitution, the category of person and disease into well controlled diabetes and poorly controlled
diabetes are combined. A similar rule is applied for hypertension too. Table 5 represents value for
fields of the database.
Field Name Value
Well controlled Diabetes 1
Poorly controlled Diabetes 2
Well controlled hypertension 3
Poorly controlled hypertension 4
Heart disease 5
Kidney disease 6
Stroke 7
Table 5 Input data value for the software

Implementation of Differential Link Analysis

Importance of Methods
Differential Link Analysis Application classes have the following basic form

public class < CLASS_NAME > {


/ ** Main method */
public static void main (String[]args) throws IOException {
//Create instance of class Partial Support Tree
PartialSupporTreenewApriori TFP = new PartialSupportTree
(args);
new Apriori TFP.input DataSet();
// If ldesired either: (1) keep data set as it is (do no
// preprocesseing ), (2)reorder the
data sets according to
// frequency of single attributes:
newAprioriTFP.idInputDataOrdering();
newApriori TFP.recastInput Data();
// or (3) reoroer and prune the input data
newAprioriTFP.IdInput DataOrdering();
newAprioriTFP.recastInput Data AndPruneUnsupportedAtts();
newApriori TFP.SetNumOneItemSets();
// Mine data and produce T-tree
double time 1 = (double) System.current Time Milis();
newAprioriTFP.createTotalSupporttree();
newApriori TFP.outputDuration (tiem1, (double)
System.currentTimeMillis()); // Generate ARS
newApriori TFP.generate ARs();
System.exit(0);
}
Figure 10.1 Application classes of the software
Some output is always generated such as: (1) the input parameters and start setting and (2)
“mile stones” during the processing. Additional output statements can be included in application
classes. The available additional output options are as follows.

1. output DataArray(): Outputs the inputs data


2. outputDataArraySize(): Outputs the size of the input data (number of records and columns,
number of elements and overall density).

3. outputDuration(double tiem1, double time2): The elapsed time, in seconds between time 1
and 2. Typically used t measure processing time.
double time1 = (double) System.currentTimeMillis();
// Program statements
< INSTANCE _NAME > .outputDuration(time1, (double)
System.currentTimeMillis());

4. outputPtreeTable(): Output P-tree table (for computational convenience the P-tree is cast
into a table prior to being processed further).

5. outputPtreeTableStats():Output parameters for p-tree table ---storage, number of nodes.


6. outputTtree(): Lists the T-tree nodes
7. outputTtreeStats(): Output parameters for T-tree ---storage number of nodes and number of
updates.

8. outputNumFreqSets(): The number of idendified frequent sets


9. outputNumUpdates(): The number of updates used to build the P-tree (a measure of the
amount of “work done”).

10. outputFrequenSets(): Lists the identified frequent sets contained in the T-tree, and their
associated support values.

11. outputStorage(): The storage requirements, in Bytes, for the T-tree.


12. outputNumRules(): The number of discovered Association Rules.
13. outputRules(): Lists the Association Rules and their associated confidence values.
14. outputRulesWithReconversion(): Lists the Association Rules but using reconversion (only
appropriate where attributes have been reordered).

Note that the first thirteen of the above output methods are instance methods contained in
either the PartialSupportTree class or their parent classes. The last three of the above are instance
methods of the RuleList class and thus must be called using an instance of that calss. An instance of
the Rulelist class is created during processing and may be accessed using the
getCurentRuleListObjects() public method found in the TotalSupportTree class. Thus, for example,
the outputRules() method would be invoked as follows.
< INSTANCE _NAME > .getCurrentRuleListObject().outputRules();

Output analysis
With the help of software developed in Java, the various support levels and confidence levels
are used to generate Association Rules. If the support is 5% and the confidence is 20% 119 rules are
generated. Among these rules best ones are chosen. Hence at the end many rules are available. The
following are the outputs of various minimum support and minimum confidence threshold values
substitution. The result contains the output information like T-tree levels, bytes of storage, nodes of
the tree, number of frequent item sets, Generation time both in seconds and minutes and the number of
association rules.

1. Minimum support threshold = 5.0% (5.0 records) and Minimum confidence is 20%
Levels in T-tree = 5
Generation time = 0.55 seconds (0.009166666666666667 mins)
T-TREE STATISTICS
Number of frequent sets = 39
Number of Nodes created = 47
Number of Updates = 133
T-tree Storage = 596 (Bytes)
GENERATE Association Rules: Number of rules = 119

2. Minimum support threshold = 5.0% (5.0 records) and minimum confidence is 50%
Levels in T-tree = 5
Generation time = 0.66 seconds (0.011000000000000001 mins)
T-TREE STATISTICS
Number of frequent sets = 39
Number of Nodes Created = 47
Number of Updates = 133
T-tree Storage = 596 (Bytes)
GENEATE Association Rules Number of rules = 57

3. Minimum support threshold = 2.0% (2.0 records) and Minimum confidence is 90%
Levels in T-tree = 6
Generation time = 0.49 seconds (0.008166666666666666 mins)
T-TREE STATISTICS
Number of frequent sets = 48
Number of Nodes created = 55
Number of Updates = 143
T-tree Storage = 712 (Bytes)
GENERATE Association Rules: Number of rules = 23
Rules generation with minimum support threshold of 5% minimum confidence limit of 50%
done by the development software. This minimum threshold value yields 57 rules. Most of the rules
have more confidence levels, so that is can extract the hidden knowledge with more support and more
confidence.
Results Interpretation
Based on the analysis of output generated form association rules the following observations
are made. They go in close line with the results generated by experts medical systems. They are as
follows.
a) Patients with stroke developing as a complication of diabetes or hypertension have every
change of having other complications namely kidney disease or heart disease.
b) Heart disease is the most common complication in patients with diabetes and/or hypertension
irrespective of the presence or absence of other complications.
c) Patients will poorly controlled diabetes have the highest chance of developing one or all end
organ complications.
d) Patients will poorly controlled hypertensions have a certain possibility of developing heart
disease and stroke.
e) Patients with poorly controlled hypertension have greater change of developing heart disease
when compared to any other complications.
f) Poorly controlled diabetics have higher possibility of developing kidney disease then any
other complications.
g) Patients having kidney disease as a complication of diabetes and/or hypertension are almost
certain to develop heart disease or already have pre existing heart disease.
h) Combination of hypertension and diabetes in a patient irrespective of level of control increase
the probability of developing heart disease. But poorly the control of either disease or both
further increases the risk.
i) Patients having diabetes are at a higher risk of developing kidney disease irrespective of their
level of control. This is in comparison with non diabetics.
j) Patients with hypertension are more at a risk of developing heat disease than control
population, irrespective of level of control.

Goodness of testing
The chi2 measure is well known in statistics. It is often used to measure the difference
between a supposed independent distribution of two strongly two variable depend on each other. This
measure (as it is defined in statistics) contains the number of cases it is computedtor. This is not very
appropriate if one wants to evaluate rules that can have varying support. Hence this factor is removed
by simply diving the measure by the number of itmes sets (the total number, i.e. with the names used
above, the number of sets in X). With this normalization, the chi 2 measure can assume values between
0 (no dependence) and 1 (very strong dependence). The value that can be obtained form the above
implementation of the software have a lower bound for the strength of the dependence (0-no
dependence, 100 perfect dependence). Only those rules are selected, in which the head depends on the
body with a higher degree of dependence.
Conclusion

The use of Apriori algorithm in the medical database under consideration, yields very many
number of association rules. It is very difficult to sort out which are important and interesting. To
overcome this difficulty, an improved version over the apriori has been used in this thesis. This is
what is “Differential Link Analysis.” Differential linking leads to progressive mining of refined
knowledge form the vast database. It has interesting applications in knowledge discovery in medical
database and the fields alike. From the use of Differential Link Analysis in this particular medical
database, it is clearly evident that, not just the presence of hypertension or diabetes is important in the
development of complication, but the extent of control of these disease and their duration are the key
determinations. Results of association rules generated using differential linking line parallel to the
conclusion predicated by expert medical systems

In this thesis, association rules at multiple levels are mined using partial support and total
support tree techniques. Thus differential linking technique which is used to mine rules form the
database taken does not leave cut any of the intrinsic details (e.g. extent of control diabetes and
hypertension). Use of p-tree and t-tree also has the benefit of decreasing the memory space that
occupied and the execution time.
Future Scope
1. In the same medical database, different linking technique can be applied to predict end organ
complications is diabetic and hypertensive patients who smoke and drink.
2. Mutation concept, which is nothing but a slight modification of the data field in the database
to generate more interesting association rules can be used. For example, the level of control
may be variable like poorly controlled for some time end and a better control over some
period of time. The complication is such individuals can be predicted using this concept.
3. Cross over analysis, wherein this database and a similar related database are combined to
generate newer association rules. For example, the related database showing a positive family
history, use of tobacco can be taken to predict the incidence of complications.

Similarly new algorithms using clustering techniques, concept hierarchy and multiple
minimum supports can be used to generate other interesting association rules form the same database.
The above techniques can be extended for use in database in other business files which have inter
related and intrinsic details and which are crucial in determining the output.

References

[1] Ragesh agarwal,Tomazzsz Imielinski and Arun N.swamy.mining association rules between sets of

items in large databases. Proceedings of the ACM International Conference on Management of

data.pages 207-216 1993.

[2] R.Agarwal and R.Sriganth,Fast algorithms for Mining Association Rules,Proceedings of 1994

International Conference on Very large Databases,page 487-499,Santiago,chile,sept.1994.

[3] ] Ragesh Agarwal and John C.Shafer paralling mining of Association rules ,IEEE Trassactions on

knowledge and data Engineering,Pages 962-969,December 1996.

[4]Charu Agarwal,Zheng sun and Fnilip s.Yu Online Algorithm for finding association

rules,Proceedings of ACM CIKM conference,pages 86-95,1998.


[5] Jain, A., Murty, M., And Flynn, P. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3,

264– 323.

[6] A.Zien et al,”Engineering Supprt vector machine kernels that recognize translation initiation sites”,

In Proceedings of German Conference on Bioinformatics,pages 37-43,1999.

[7] Discrete-event based Simulation Conceptual Modeling of Systems Biology 1Joe W.Yeol, 2Issac

Barjis, 3Yeong S. Ryu, and 4Joseph Barjis Proceedings of 2005 5th IEEE Conference on

Nanotechnology Nagoya, Japan, July 2005

[8] Wai-Ho Au, Keith C.C. Chan, Andrew K.C,”Wong, and Yang Wang,”Attribute Clustering for
Grouping, Selection, and Classification of Gene Expression Data”, IEEE/ACM Transactions On
Computational Biology And Bioinformatics, Vol. 2, No. 2, April-June 2005.

You might also like