Enhanced Algorithms for Mining Optimized Positive and Negative Association Rule
From Cancer Dataset
I. Berin Jeba Jingle
Assistant Professor, Department of CSE
Noorul Islam University
Nagercoil, India
[email protected] Dr. J. Jeya A.Celin
Assistant Professor, Department of Computer Application, Hindustan College of Arts and
Science, Chennai, India
[email protected]Abstract The association rule (AR) mining is
a technique of data mining which is used to
Recently researchers focus many research analyze high-dimensional relational data. The
challenges on association rule mining. association rule mining discovers interesting
Several existing algorithms has been relationship hidden in a large dataset. The
implemented in order to assure these association rule techniques are implemented
challenges but many such algorithms cause effectively in application domain such as
data loss, lack of efficiency and accuracy market basket analysis, intrusion detection,
which also results in redundant rules. The telecommunication and diagnosis decision
major issue in using this analytic optimizing support.
method is specifying the activist initialization The proposed work proves that from
limit were the quality of the association rule this infrequent itemset both positive and
relays on. The proposed work has three negative association rule can be mined more
methods to mine an optimized PAR and effectively of the form X¬Y, ¬XY, ¬
NAR. The first method is the X ¬ Y. The proposed work chooses the
Apriori_AMLMS (Accurate multi level medical dataset for the experimental analysis.
minimum support). The next method is the The proposed system intention is 1) generate
GPNAR (Generating Positive and Negative Frequent and infrequent itemset very
Association Rule) algorithm to mine the PAR accurately 2) mining both PAR and NAR
and NAR from frequent itenset, PAR and from the frequent and infrequent itemset. 3)
NAR from infrequent itemset. The third Generate an optimized PAR and NAR.
method is to obtain an optimized PAR and
NAR using the decidedly efficient swarm The major parameters in ABC algorithm are:
intelligence algorithm called the Advance Number of Food Sources, Limit and a
ABC (Artificial Bee Colony) algorithm uniform random number. The uniformly
which proves that an efficient optimized generated random number does not depend
Positive and negative rule can be mined. on any component as it is randomly
Keywords: Data mining, Association Rule generated during each iteration. The number
mining, Apriori algorithm, Accurate Multi of food sources analogous to population size.
level and Multi support, advance ABC Performance of ABC algorithm depends on
algorithm, GPNAR. size of initial population. If population size
increases then after certain limit the
1. Introduction performance of ABC deteriorates sharply.
The parameter limit also changes with
Data mining deals with the process variation in population size as it is product of
of mining unseen projecting information dimension and population size. Thus change
from massive databases. Recent technology in limit has some effect on performance of
helps medical, market, weather forecasting advance ABC algorithm.
analysis to focus requisite information in
their data warehouse. The data mining mines 2. Related Works
data’s from different sources like image, text,
and web. The data mining has several Idheba Mohammad Ali et al [21]
techniques like association rule, suggested PNAR and IMLMS an approach
classification, decision tree, clustering, for mining positive and negative association
prediction, etc. for extracting the valuable rule from transaction dataset. This approach
information from this data sources. is integrated by two algorithms. The positive
negative association rule (PNAR) algorithm
and the Intersting multiple level minimum finding association link between attainment
support (IMLMS) algorithm and the and rule extraction. The extracted rule is then
approach is PNAR_IMLMS. The IMLMS executed by deriving the weight age of these
algorithm generates the frequent and communication links this is done in the
infrequent itemset. PNAR algorithm second phase. Li-Min Tsaiet al [3] proposed
generates positive and negative association the GNAR (Generalized Negative
rule from the generated frequent and Association Rule) which is an improved
infrequent itemset. Significantly better than approach algorithm which shows negative
the previous methodologies but lack in rules are as imperative as positive rules. It
efficiency and accuracy also a time helps user to make quick decision to analyse
consuming process. which is the best association rule. The
Xiangjun Dong [22] proposes an advantage of this algorithm is cost and time
enhanced Aprior-IMLMS (interesting MLMS reduction but lack in accuracy and efficiency.
(Multiple Level Minimum Supports)) Christian Hidlern et al [4] introduced
algorithm, which is designed for pruning an algorithm continuous association rule
uninteresting infrequent and frequent mining algorithm (CARMA) which
itemsets discovered by MLMS model. One of compares online large datasets by two phase
the pruning measures used in IMLMS model, scanning of transaction in large itemset. This
interest, can be described as follows: to two algorithm consist of 2 pass, the first pass
disjoint itemsets A, B, if interest (A,B)=| continuously construct a lattice potentially
s(A∪B) - s(A)s(B)|<mi, then A∪B is for large itemsets. The second pass
recognized as uninteresting itemset and is continuously removes itemset which has less
pruned, where s(.) is the support and mi a user specified threshold value. Xiangjun et al
minimum interestingness threshold. This [5] suggested another algorithm IMLMS
measure, however, is a bit difficult for users model which generated the frequent and
to set the value mi because interest (A, B) infrequent itemset using the minimum
highly depends on the values of s(.). This correlation coefficient. This algorithm
paper proposes a new measure, MCS defines that the interestingness of the rule is
(minimum correlation strength) as a high and the minimum support value is easy
substitute. MCS, which is based on to set. But lack in accuracy in the generation
correlation coefficient, has better of the frequent and infrequent itemset.
performance than interest and it is very easy The MLMS algorithm mines the FIS
for users to set its value. The theoretical and inFIS itemse, the discovered patterns are
analysis and experimental results show the not much interesting and are raucous and
validity of the new measure. hence it requires pruning. So the existing
Zhendong Niu et al [23] proposed a method used the modified wu’s pruning
PNAR_MLMS algorithm to mine infrequent strategy with IMLMS [7] an algorithm was
itemsets. Previous work, a MLMS model designed to discover interesting frequent and
was proposed to discover simultaneously infrequent patterns. The next existing method
both frequent and infrequent itemsets by rectifies the measure interest and uses
using multiple level minimum sup- ports another measure Minimum correlation
(MLMS) model. In this paper, combines Strength (MCS) [4]based on correlation
correlation coefficient and minimum coefficient the performance is better than the
confidence is proposed and a corresponding measure interest here the users finds easy to
algorithm PNAR_MLMS is also proposed to set the values here ρ(A,B) is calculated
generate PNARs correctly from the frequent instead of interest(A,B). However the
and infrequent itemsets discovered by the performance improves but this method still
MLMS model. The experimental results lack in accuracy and efficiency. The
show that the measure and the algorithm are AMLMS-GA [18] generates the accurate
effective. frequent and infrequent itemset and latter
Azadeh Soltan et al [2] proposed an mines Positive and negative association rule
algorithm CARM (Confabulation-inspired from the generated Frequent and infrequent
Association rule mining) which generates itemsets.
frequent and infrequent itemset. The main
achievement of knowledge of this model is 3. The Proposed System
transaction medical dataset is first
The research proposes three algorithms transformed into the decision table in the pre
in order to generate the optimized Positive processing step. The data’s are arranged in
and negative association which is eventually the decision table which contains the
applied in area of association rule mining. conditional and decisional attributes. The
The proposed work chooses the UCI machine attributes are compared with the neighbour
medical dataset for the analysis of the attributes and arranged in priority wise
proposed algorithm. There are three phases hierarchal manner.
1) apriori_AMLMS algorithm 2) GPNAR
algorithm 3) Advance ABC algorithm.
3.1 The Proposed Architecture
Figure 2: Apriori_AMLMS
Architecture
The input of the proposed algorithm
is the dataset and the user defined
threshold value. The output generated is
the Frequent and the infrequent itemset.
For each itemset the support is
calculated. If suppo(X) minisuppo
Figure 1: Over all Proposed (number(X)),then the X is frequent
architecture itemset. If suppo(X) < minisuppo
(number(X)), then the X is infrequent
The architecture shows the proposed itemset. The generated frequentand
work has three contributions. The first infrequent itemset are collected in the
contribution generates the accurate frequent hash table. The hash map stores each data
and infrequent itemset based on the user with index and value. The index value
defined threshold value. The second denotes weather the itemset is frequent or
contribution generates the Positive and the infrequent itemset. The generated FIS
negative association rule from the generated and inFIS generated through this
frequent and the infrequent itemset. The third
algorithm is very accurate.
contribution generates an optimized positive
and negative association rule.
3.3 GPNAR Algorithm
3.2 Apriori_AMLMS algorithm The second contribution is the GPNAR
[20] (Generating Positive and Negative
The proposed Apriori_AMLMS[19][20] Association Rule). This algorithm mines
algorithm uses the user defined minimum the Positive and negative association rule
support threshold value for generating the based on the user defined minimum
frequent and infrequent itemset. The confidence (miniconfi) threshold value.
Suppo (cancer tumor) =0.04
Minisuppo =0.2
Miniconfi =0.6
From the above following recognized
values the PAR and the NAR can be
identified for the itemsets.
i) Suppo (cancer tumor)
=0.04<minisuppo, Hence the
(tumor cancer) is an infrequent
itemsets.
ii) Confi (tumor cancer) =0.132<
miniconfi, hence (tumor
cancer) cannot be a valid rule
Comprehensibility= using support confidence
framework.
. The comprehensibility
Hence the negative rule is derived
describes the clarity of rule. from the example.
The main measurement of the i) Suppo (tumor ¬ cancer) =
association rule mining is the confidence,
support, correlation, comprehensibility suppo (tumor) – suppo (cancer
and time. The correlation gives the
tumor)
intrestingness of the Positive and
Negative association rule. The valid = 0.4 –
association rule is of the form XY, 0.04 =0.36
where X∊ {A, A}, Y∊ {B, B}, XY⊂I > miniconfi
ii) Confi (tumor ⇒ ¬cancer)
and X∩Y=ø. Where I= {i1,i2,i3,......in},
Consider =(Suppo(cancer
a) Suppo (XY) minisuppo, ¬tumor))/suppo(tumor)
= 0.36/0.4
b) Suppo(X) minisuppo, = 0.9 >
miniconfi,
c) Suppo(Y) minisuppo, hence
(tumor ⇒
d) Lift(XY)>1,
¬cancer)
For experimental analysis the UCI
is a
machine cancer dataset is chosen which
negative
has 18 attributes and 1582 instances. This
rule.
a primary tumor domain which was
obtained from the University Medical iii) Lift (tumor ⇒ ¬cancer) =
Centre. (Suppo(tumor
¬cancer))/(Suppo(tumor)sup
Suppo (tumor) =0.4, suppo (¬tumor)
po(¬cancer )
=0.6
=0.36/(0.4*0.4) = 2.25 >1, strong
Suppo (cancer) =0.6, suppo (¬cancer)
relationship in the negative role and it
=0.4
has a valid 87% of which tells strong
presence of act and absence of adult.
3.4 Advance ABC algorithm
The Advance ABC algorithm is the third
contribution in the case of the proposed
system. The main goal is to extract an
optimized PAR and NAR. The ABC
algorithm is a recent swarm intelligence
ABC algorithm is categorized into four
phases.1) initialization Process 2)
employee Bee 3) onlooker bee 4) scout
bee. Here the extracted PAR and NAR is
given as input the. The rules are taken as
food source. Steps involved in Advance
ABC algorithm.
Step 1: Input PAR and NAR, minicofi. .
Output Optimized PAR and NAR Figure 3: Architecture of Advance
Step 2: Initialize population using ABC ABC Algorithm
on selected members to discover The proposed advance ABC
associations algorithm is given in the figure 3. The
Step 3: Find each association rule fitness process is initialized first. The food
function. sources are the rule.
Step 4: Check following condition: If 3.4.1 Initialization phase
(fitness function> miniconfi). The beginning process is the
Step 5: Set Q=Q∪(x=>y) /* the rules are initialization process. The positions of 3
added to the temporary variable Q food sources (CS/2) of employed bees is
5.1 In memory, employed bees are initialized first, (50, 500) are the range of
placed on food sources; uniform distribution and they are
5.2 Generate new offspring from randomly utilized.
older offspring after applying onlooker
bee phase. Where the upper is bound for
5.3 For finding new food sources, and is the lower bound for
send scout bee to search space.
.The
Step 6: UNTIL (requirements are not
is a restriction to be optimized for
met).
the jth employed bee on the dimension k
of the D-dimensional space. Number of
employed bee denoted as M.
3.4.2 Employee Bee Phase
Next the objective function of each rule
is defined. The object (obj) value is based
on support, comprehensibility and
confidence of each rule. ObjVal =
(suppo.*confi).*comp. The fitness value
depends on object function. The fitness
function is calculated using
.
The obj value is rule based on the promising external clues. For instance, if
support, confidence, and solution has been abandoned, the new
comprehensibility. This obj value is
considered as the food. In order to solution discovered by the scout who was
calculate fitness value number of the employed bee of can be defined.
iteration is used by comparing with the Hence those sources which are initially
neighbouring food sources. Until the best poor or have been made poor by
optimized value is obtained the iteration exploitation are abandoned and negative
goes on. The best fitness value rules are feedback behaviour arises to balance the
stored in a new memory space. The new positive feedback.
position is given as
4. Experiment Result:
Where, and
The proposed methods choose the
. In the above equation,
UCI machine cancer dataset for
is the employed bee, is the experimental and performance analysis.
new solution for , is the neighbour This dataset has 18 attributes and 1634
bee of in employed bee population, [- instances and 20 transactions. The
1, 1] is the range of and it is randomly experimental analysis first shows how the
selected, from the above equation the frequent and infrequent itemset are
values are selected and generated by the support (minsuppo)
memorized as best solution. value as input for the dataset DS. Then
3.4.3 Onlooker Bee phase from the FIS and inFIS how the PAR and
This comes under unemployed NAR is mined and from this NAR and
honey bee. The employee bee completes PAR using optimized ABC algorithm
the phase by calculating the optimized how the optimized rule is mined.
fitness value. The onlooker bee
probabilistically picks the food source
relying on the data. In advance ABC, an
onlooker honey bee picks a food source
communicate upon probability values
figured utilizing fitness values gave by
utilized honey bees. For this reason, a
fitness based fortitude strategy can be
utilized, the algorithm uses the roulette
wheel selection technique for this
purpose. The probabilistic is given as
Denotes fitness value, SN denotes swarm
size.
FIGURE 4: Frequent and inFrequent itemsets
3.4.4 Scout Bee phase
The scout bee comes under the From the experimental result it is
unemployed bee these bees pick their clearly seen that there is a gradual
foods randomly. This unemployed bee decrease in the frequent itemset set as the
searches new food sources randomly user defined threshold value minisuppo is
depending on an internal motivation or increased. This shows that the candidate
itemset generation is decreased and
database scanning is also decreased.
Hence automatically the space is also
decreased. The time to generate the FIS
and inFIS is also reduced using the
proposed Apriori_AMLMS algorithm
when compared with the existing
algorithms.The generated PAR and NAR
from frequent and infrequent itemset are
of the form
Rule Support Confidence Comprehensibility
Lift
{tumor, bone}->{treatement} 25% 85.5% 0.6309
1.432
{chemo,radiation}->{tumor} 40% 93% 0.7253
1.342
{bone,tumor}->{treatement} 40% 95.3% 0.6342
Figure 4: Optimized PAR and NAR with confidence
2.54
{tumor,CT scan, brain}->{cancer} 65% 92.5% 0.6309 and fitness value
1.43
Table 1: Accurate FIS and inFIS 5. Conclusion:
This paper concentrates on
three proposed contributions. The
research was made by analyzing
the exiting algorithms. The first
contribution generates the
frequent and the infrequent
itemsets more accurately in less
time. The second method mines
the PAR and NAR from the
frequent and infrequent itemsets.
The third method is the Advance
ABC algorithm which is used as
optimized algorithm which
generates optimized PAR and
NAR. The generated rules has
Figure 5: Valid Positive and Negative Association high confidence, more support,
Rule generated using proposed algorithm high comprehensible, and the
quality is good compared with the
The rules generated by the existing algorithm. The proposed
proposed algorithm are very huge and algorithms mine rules for the
many rules are redundant so to achieve cancer dataset very accurately.
optimized high confidence rule, the Amny hidden useful datas are
optimized Advance ABC algorithm is mined though these contributions.
applied. The result analysis shown proves The experimental analysis shows
that the achieved rule is well optimized the proposed algorithm is
rule. The optimized rule is based on the promising and efficient
fitness or optimized value of each rule References:
after several iteration.
[1] Nikky Surya wanshi Rai, Susheel Jain, Anurag
Jain,” Mining Interesting Positive and Negative
Association Rule Based on Improved Genetic
Algorithm” (IJACSA) International Journal of
Advanced Computer Science and Applications, Vol. 5, Taxonomy”.International Journal of Engineering
No. 1, 2014. Trends and Technology (IJETT) – Volume 4 Issue 10-
October 2013.
[2] AzadehSoltani and M.-R.Akbarzadeh-T,”
Confabulation-Inspired Association Rule Mining for [15] J. Han and K.C.-C. Chang, “Data Mining for Web
Rare and Frequent Itemsets”, IEEE transactions on Intelligence,”Computer, vol. 35, no.11, Nov. 2002,pp.
neural networks and learning systems, 2014. 64-70.
[3] Li-Min Tsai ; Shu-Jing Lin ; Don-Lin Yang, [16] J. Han, J. Pei, and Y. Yin, “Mining Frequent
“Effective mining of generalized negative association Patterns without Candidate Generation,” Proc. ACM
rules”, IEEE conference Publication, 2010. SIGMOD Int’l Conf. Management of Data (SIGMOD
’00), 2000,pp. 1-12.
[4] Christian Hidber. Online Association rule mining.
SIGMOD ’99 Philadelphia PA. ACM 1-58113-084- [17] RupeshDewang, JitendraAgarwal, ”Anew Methos
8/99/05, 1999. for Generating All Positive and Negative Association
rules” International Journal on computer Science and
[5] Li-Min Tsai ; Shu-Jing Lin ; Don-Lin Yang, Engineering Volume 4, No.4, ISSN No. 0975-3397
“Effective mining of generalized negative association April 2012.
rules”, IEEE conference Publication, 2010.
[18] I.Berin Jeba Jingle and Dr.J.Jeya A.Celin,
[6] XiangjunDong,ZhendongNiu, “Markov Model in Dicovering knowledge in Text
DonghuaZhu,ZhiyunZheng, QiutingJia, “Mining mining”, Journal of Theoretical and Applied
Interesting infrequent and frequent Itemset based on Information Technology”vol.70,no.3,Dec 2014,pp.459-
463.
MLMs Model” Springer, 2008, pp 444-451.
[19] I.Berin Jeba Jingle and Dr .J. Jeya A.Celin”
[7] Xiangjun Dong, ‘“Ming Interesting Infrequent and Discovering Useful Patterns in Text Mining using
Frequent Itemset Based on Minimum Correlation AMLMS-GA algorithm, ”International Journal of
Strength”, Springer, 2011, pp 437-443. Applied Engineering Research” vol.10, no.18,
2015,pp.39763-39767.
[8]Mrs.K.Mythili, MrsK.Yasodha,” A Pattern
Taxonomy Model with New Pattern Discovery Model [20] J.Jeya A.Celin, I.Berin Jeba Jingle, , “Mining
for Text Mining”International Journal of Science and Useful Patterns from Text using Apriori_AMLMS-
Applied Information Technology Volume 1, No.3, MGA Algorithm”, International Journal of Control
ISSN No. 2278-3083July – August 2012. Theory and Applications IJCTA, Volume 10, Issue 18,
pp. 81-89, JAN 2017
[9] CharushilaKadu, Praveen Bhanodia, Pritesh Jain,
“Hybrid Approach to Improve Pattern Discovery in [21] Idheba Mohamad Ali O. Swesi; Azuraliza Abu
Text mining”International Journal of Advanced Bakar; Anis Suhailis Abdul Kadir, “Mining positive
Research in Computer and Communication and Negative Association Rules from interesting
Engineering Vol. 2, Issue 6, June 2013. frequent and infrequent itemsets “ IEEE Conference
Publication, pp- 650-655, 2012.
[10] Zhen Hai, Kuiyu Chang, Jung-Jae Kim, and
Christopher C. Yang,” Identifying Features in Opinion [22] Xiangjun Dong, Zhendong Niu, Xuelin Shi,
Mining via Intrinsic and Extrinsic Domain Relevance”, Xiaodan Zhang, Donghua Zhu,”Mining Both Positive
IEEE Transactions on knowledge and data engineering, and Negative Association Rules from Frequent and
vol. 6, no. 6, June 2012. Infrequent Itemsets”, Springer, pp.122-133, 2007.
[11] Spyros I. Zoumpoulis, Michail Vlachos, Nikolaos [23] XING Xue, Chen Yao, and Wang Yan-en;Study
M. Freris, Claudio Lucchese, “Right-Protected Data on mining theories of association rules and its
Publishing with Provable Distance-based Mining “ application 978-0-7695-3942-3/10 $26 @IEEE 2010
IEEE Transactions on knowledge and data
engineering, vol. 21, no. 19, november 2012.
[12] K. Aas and L. Eikvil, “Text Categorisation: A
Survey,” Technical Report Raport NR 941, Norwegian
Computing Center, 1999.
[13] R. Agrawal and R. Srikant, “Fast Algorithms for
Mining Association Rules in Large Databases,” Proc.
20th Int’l Conf. Very Large Data Bases (VLDB ’94),
1994,pp. 478-499.
[14],MissDiptiS.Charjan, Prof. MukeshA.Pund ,”
Pattern Discovery For Text Mining Using Pattern