Framework For Comparison of Association Rule Mining Using Genetic Algorithm
Framework For Comparison of Association Rule Mining Using Genetic Algorithm
Framework For Comparison of Association Rule Mining Using Genetic Algorithm
Abstract A new framework for comparing the literature on Genetic Algorithm for Association Rule Mining is
proposed in this paper. Genetic Algorithms have emerged as practical, robust optimization and search methods
to generate accurate and reliable Association Rules. The main motivation for using GAs in the discovery of
high-level prediction rules is that they perform a global search and cope better with attribute interaction than the
greedy rule induction algorithms often used in data mining. The objective of the paper is to compare the
performance of different methods based on the methodology, datasets used and results achieved. It is shown that
the modification introduced in GAs increases the prediction accuracy and also reduces the error rate in mining
effective association rules. The time required for mining is also reduced.
I.
INTRODUCTION
In todays jargon enormous amount of data are stored in files, databases, and other repositories. Hence it
becomes necessary, to develop powerful means for analysis and interpretation of such data and for the extraction
of interesting knowledge to help in decision-making. Thus, there is a clear need for (semi-)automatic methods
for extracting knowledge from data. This need has led to the emergence of a field called data mining and
knowledge discovery.
Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data in databases. The
Knowledge Discovery in Databases process comprises of a few steps starting from raw data collections to
formation of new knowledge. The iterative process consists of the following steps:
Data cleaning: also known as data cleansing, is a phase in which noise data and irrelevant data are removed
from the collection.
Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in a common
source.
Data selection: at this step, the data relevant to the analysis is decided on and retrieved from the data
collection.
Data transformation: also known as data consolidation, it is a phase in which the selected data is transformed
into forms appropriate for the mining procedure.
Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful.
Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified based on
given measures.
Knowledge representation: is the final phase in which the discovered knowledge is visually represented to the
user. This essential step uses visualization techniques to help users understand and interpret the data mining
results.
This paper reviews the works published in the literature, where basic Genetic Algorithm is modified in some
form to address Association Rule Mining. The rest of the paper is organized as follows. Section II briefly
explains Association Analysis. Section III gives a preliminary overview of Genetic Algorithm for Rule Mining.
Section IV Reviews the different approaches reported in the literature based on Genetic Algorithm for Mining
Association Rules. Section V lists the inferences attained from the comparison. Section VI presents the
concluding remarks and suggestions for further research.
C. [New population] Create a new population by repeating the following steps until the new population is
complete
i.
[Selection] Select two parent chromosomes from a population according to their fitness (the better
fitness, the bigger chance to be selected)
ii.
[Crossover] With a crossover probability cross over the parents to form a new offspring (children).
If no crossover was performed, offspring is an exact copy of parents.
iii.
[Mutation] With a mutation probability mutate new offspring at each locus (position in
chromosome).
iv.
Genetic Operations
2.
3.
4.
I. Encoding
II. Initial Population
III. Crossover
IV. Mutation
V. Fitness Threshold
Methodology.
Application areas.
Evaluation Parameters
considered for fitness threshold results in better predictive accuracy [7]. When the real values of confidence and
support derived and applied in threshold rate found to generate faster than traditional methods [8]. The
predictability and comprehensibility factors of rules tends to provide better classification performance[11].
2. Methodology.
Rather than altering the operations in basic GA algorithm the changes made in the methodology has
also proved to increase the performance. In [5] the crossover operation is replaced by symbiotic recombination
techniques. Wenxiang Dou et. Al [8] describes the generation of rules and displayed to the user. The user
decides the interesting rules thereby seeding the population by the user. Here instead of searching the whole
database for support the system searches K- itemset alone and hence is faster than other methodologies. The real
values of support and confidence are taken into account while generating threshold values. If user is not satisfied
then more rules with next level of support could be generated. The response time of the system is found to be
increased dramatically.
Antonio Peregrin et. Al [9] states the distributed genetic algorithm where the Elit data pool is the central
node. Each node connected are considered as Intercommunicating Subpopulations. The Data Learning Flow
(DLF) Copies the Training Example with Low Fitness to neighborhood and each node is assigned with different
partitions of learning data. In [18] the concept of Dynamic Immune Evolution and Biometric Mechanism is
introduced. The immune recognition, immune memory and immune regulation is applied to GA and thereby
making the system faster and discovering interesting rules though support might be low.
3. Application Areas.
From the dataset in Table A1 it is derived that the Genetic Algorithm is flexible in nature and could be
applied to varying data sets. Different areas of applications as Biology, biometrics, Education, Manufacturing
Information System, Application Protocol interface records from Computers for Intrusion Detection, Software
Engineering, Virus information from Computer data, Image data base, Finance information, Students
Information etc. are taken up for study. It is seen that by altering representations and operators the Genetic
algorithm could be applied for any fields without compromising the efficiency.
4.
Evaluation Parameters.
To achieve the best performance from any system setting of the evaluation parameters are vital. For
mining efficient association rules using the evolutionary Genetic Algorithm the parameters that affects the
functioning of the system are representation of chromosomes and their length, population Size, Mutation
probability, Crossover probability and the Fitness threshold values. The use of support and confidence factor in
deciding the threshold values also increases the performance of the system.
It is seen form the Table A1 that parameters dont have predefined boundaries. The values of the
parameters are based upon the methodology applied and the dataset. It is found that the mutation rate differs
from 0.005 to 0.6. Similarly the Crossover probability ranges from 0.1 to 0.9. The fitness function is found to be
the most crucial parameter. The optimum method for the fitness threshold evaluation has to be arrived in order
to achieve the highest performance. The carrying over of offspring from one generation to other is based on
fitness function the population. Different factors of the system are takem into consideration while generating
fitness function. Efficient the fitness function, efficient and accurate are the association rule that are generated
by the system
V. INFERENCES
The study of the different system using Genetic Algorithm for Association rule mining was carried out.
The predictive accuracy based on the methods under comparison is noted that
Predictive Accuracy of more than 95 percentage was seen in 55 % of datasets.
Predictive Accuracy between 80 - 94 percentage was seen in 15 % of datasets.
Predictive Accuracy between 75 79 percentage was seen in 10 % of datasets.
Predictive Accuracy of less than 75 percentage was seen in 20 % of datasets.
Predictive accuracy of 100 % is achieved with methodology [1] and [14]. The error rate is seen maximum in
Pima dataset under UCI repository by SGERD method [10].
Based on the time for mining association rules as low as 5 and 9.81 seconds were achieved in SGERD [10]
and Quick response data mining model [8] respectively and maximum of 7012 seconds is taken up for mining
KDDCUP99 dataset using Genetic algorithm with symbiogenesis [5].
The tools used for mining are GA Toolbox from Mat lab, Java, C++ and Weka for training the datasets.
From the Study it could be inferred that
The system is flexible and could be applied to varying datasets and applications.
The performance of the system is improved when compared to other existing methods like Apriori,
C4.5, REGAL, Data-miner etc.
The Genetic operators are the key in achieving increased performance.
The speed of the system increases rapidly when compared with other methods.
The optimum values for the mutation, crossover and threshold decide on the performance of the
system.
Data Set
Ref .No Methodology
Steps In Genetic Algorithm
Represen Selection
-tation
[1]
Mutation
Varying Seeded
Macro
Length Population Mutations
Cross Over
As GA
As GA
Grilled
Mushrooms in
Agaricus and
Lepiota Family.
[2]
Binary
Coding
[3]
As GA
As GA
[4]
Using
Roulette
Natural
Wheel
Numbers Selection
846 Records
18
Attributes
As GA
As GA
As GA
Optimum
Value
Optimum
Value
As GA
Synthetic
Image
Database
Synthetic
Database for
TP,TN,FP, FN
the Selection of
Used
Electives for a
Course
Pm
Results
Fitness Gener
-ations
8128 Of 23
Species
22
Attributes
Whether
Required
Or Not
And If
Point Of
Mutation
Whether
Required Or
Not And If
Point Of
Crossover
Evaluated By / Parameters
Size /
Pc/
Support Confidence
Fitness
Vehicle
Silhouette
Dataset
Roulette
Wheel
Selection
Sample
Size
100
*
5%
.25%
To 2%
0.1
0.005
Complete
ness
consider
-red
5%
User
User
defined defined
4 To
14
Accuracy On
Training Dataset
Between 95 To
100. On Test
Data Between
62 And 71
Rules With
Negation Of
Attributes As
Well As
General Rules
Generated
Based On
Time And
Fetching
Knowledge
GA Is Faster
Than Apriori.
Runs 2 To 5
Times Faster
Than Apriori
KDDCUP99
Classes 2
CRX
Classes 2
[5]
As GA
Changes In
Generate
Symbiotic
Weight Of
Single
Combination
Membership
Rule Sets
Operator
Function
As GA
[6]
Using
Natural
Numbers
As GA
As GA
As GA
Individual
Evaluation
Using Strength
Of Implication
[7]
Array
Representa
-tion
As GA
As GA
As GA
Based On
Sustaining,
Creditable And
Inclusive Index
[8]
As GA
Done By
Users
From
Rules
Generate
d
As GA
Based On Real
Support And
Confidence
[9]
-[10]
As GA
Fuzzy
Rules
As GA
As GA
As GA
As GA
Done On
Same
Attributes If
Present Or
Random
As GA
[11]
[12]
[13]
[14]
As GA
As GA
As GA
As GA
As GA
As GA
As GA
Dynamic
Adaptive
As GA
Dynamic
As GA
As GA
Predictive
Accuracy
Features 4
Size 150
Vote
Classes 2
Features 16
Size 435
Wine
Classes 3
Features 13
Size 178
Car
Test Results
Dataset
Six Datasets
from Irvine
Repository
Single Table
Produced
Randomly with
100
Transactions
UCI : Nursery
Nursery
Datasets
from UCI
Finance
Service Data of
Certain City
Individual
Based
Database Of
Student
Achievement in
Schools in
Recent Years
40
Attributes
10%
12960
Instances
0.85
0.4
0.2
0.8
0.6
50%
0.01
*
optimu
m
48842
Instances
15attributes
0.4
40
0.6
S.I 1.0
C I 1.0
I.I 1.0
0.9
50
10
500
500
optimu
m
Optim
um
0.05
Optim
um
12960 With 9
Attributes
2050
Groups
0.3
200
0.6
0.9
0.01
0.7
SEA Has
Better Or
Similar
Results When
Compared
With GA
SEA Much
Faster Than
GA
10
KDD CUP99
Dataset
10000
11 Data Sets
from Irvine
Machine
Learning
Repository
Based On Last
Generation
Modified To
Decide Whether
A Chromosome
Features 15
Size 690
IRIS
Classes 3
Adult
Elitist
Recombi
nation
Method
Features 41
Size 494021
50
Length Of Chromosome 41
Generations 100
During
Generation
Between 1 To
200 Interesting
Rules For
Whatever Be
The Threshold.
Predictive
Accuracy
Better Than
CN2 And AntMiner
Methods For
All Six
Datasets
Response Got
In Ten Seconds
Whereas For
Apriori It Is
More Than
3000 Seconds
Faster And
Better
Behavior
Number Of
Rules
Generated Is
Between 60%80% Smaller
Classification
Error Rates
Are Low
Outperforms
C4.5
Better
Classification
Performance
Produces Partial
Association
Rules After 252
Generations
Whereas It Is
850 In
traditional GA
The Algorithm
Based On 0.1
Support And
0.7 Confidence
Is Close To
Actual
Situation
Rules
Generated Are
Useful In
Is Right Or Not
[15]
As GA
As GA
As GA
As GA
Feature
Selection Is
Applied
CM1, KC1,
KC2, PC1
From UCI
Repository
Vehicle Dataset
And
[16]
[17]
As GA
Gene
String
Structure
[18]
As GA
As GA
Mutation1
&
Mutation 2
Adaptive
As GA
Adaptive
Based On
Distance
Between Rules Lympography
Dataset From
UCI ML
Measure Of
Overall
Performance
22
Varies
Varies
Varies
240
0.1
0.01
0.01
148 Records
18 Attributes
4 Classes
846 Records,
18 Attributes
4 Classes
Varies
100
600
0.6
0.005
50
0.26
0.8
Computers Daily
Records Of API
Cattral, R., Oppacher, F., Deugo, D.,Rule Acquisition with a Genetic Algorithm, Proceedings of the
1999 Congress on Evolutionary Computation,. CEC 99, 1999.
[2].
Saggar, M., Agrawal, A.K., Lad, A., Optimization of Association Rule Mining,
IEEE International Conference on Systems, Man and Cybernetics, Vol. 4, Page(s): 3725 3729, 2004
[3].
[4].
Shangping Dai, Li Gao, Qiang Zhu, Changwu Zhu, A Novel Genetic Algorithm Based on Image
Databases for Mining Association Rules, 6th IEEE/ACIS International Conference on Computer and
Information Science, Page(s): 977 980, 2007
[5].
Halavati, R., Shouraki, S.B., Esfandiar, P., Lotfi, S., Rule Based Classifier Generation Using
Symbiotic Evolutionary Algorithm , 19th IEEE International Conference on Tools with Artificial
Intelligence, Volume: 1, Page(s): 458 464, 2007.
[6].
Zhou Jun, Li Shu-you, Mei Hong-yan, Liu Hai-xia, A Method for Finding Implicating Rules Based on
the Genetic Algorithm, Third International Conference on Natural Computation, Volume: 3, Page(s):
400 405, 2007.
[7].
Hua Tang, Jun Lu, A Hybrid Algorithm Combined Genetic Algorithm with Information Entropy for
Data Mining, 2nd IEEE Conference on Industrial Electronics and Applications, Page(s): 753 757,
2007.
[8].
Wenxiang Dou, Jinglu Hu, Hirasawa, K., Gengfeng Wu, Quick Response Data Mining Model using
Genetic Algorithm, SICE Annual Conference, Page(s): 1214 1219, 2008
Detecting
Intrusion
Generated
Rules That
Provide Better
Estimation
And
Explanation Of
Defective
Modules
GRA
Outperforms
Conventional
Methods
Performance
And
Effectiveness Of
Proposed Model
Is Close With
Real World
Analysis
Faster &
Discovers New
Critical Rules
Though
Support Not
High
[9].
Peregrin, A., Rodriguez, M.A., Efficient Distributed Genetic Algorithm for Rule Extraction,. Eighth
International Conference on Hybrid Intelligent Systems, HIS '08. Page(s): 531 536, 2008
[10].
Mansoori, E.G., Zolghadri, M.J., Katebi, S.D., SGERD: A Steady-State Genetic Algorithm for
Extracting Fuzzy Classification Rules From Data, IEEE Transactions on Fuzzy Systems, Volume: 16 ,
Issue: 4 , Page(s): 1061 1071, 2008.
[11].
Xian-Jun Shi, Hong Lei, A Genetic Algorithm-Based Approach for Classification Rule Discovery,
International Conference on Information Management, Innovation Management and Industrial
Engineering, ICIII '08, Volume: 1 , Page(s): 175 178, 2008.
[12].
Xiaoyuan Zhu, Yongquan Yu, Xueyan Guo, Genetic Algorithm Based on Evolution Strategy and the
Application in Data Mining, First International Workshop on Education Technology and Computer
Science, ETCS '09, Volume: 1 , Page(s): 848 852, 2009
[13].
Hong Guo, Ya Zhou, An Algorithm for Mining Association Rules Based on Improved Genetic
Algorithm and its Application, 3rd International Conference on Genetic and Evolutionary Computing,
WGEC '09, Page(s): 117 120, 2009
[14].
Yong Wang, Dawu Gu, Xiuxia Tian, Jing Li, Genetic Algorithm Rule Definition for Denial of
Services Network Intrusion Detection, International Conference on Computational Intelligence and
Natural Computing, CINC '09, Volume: 1 , Page(s): 99 102, 2009
[15].
Rodriguez, D., Riquelme, J.C., Ruiz, R., Aguilar-Ruiz, J.S., Searching for Rules to find Defective
Modules in Unbalanced Data Sets, 1st International Symposium on Search Based Software
Engineering, Page(s): 89 92, 2009
[16].
Gonzales, E., Mabu, S., Taboada, K., Shimada, K., Hirasawa, K., Mining Multi-class Datasets using
Genetic Relation Algorithm for Rule Reduction, IEEE Congress on Evolutionary Computation, CEC
'09, Page(s): 3249 3255, 2009
[17].
Haiying Ma, Xin Li, Application of Data Mining in Preventing Credit Card Fraud,
International Conference on Management and Service Science, MASS '09, Page(s): 1 6, 2009
[18].
Genxiang Zhang, Haishan Chen, Immune Optimization Based Genetic Algorithm for Incremental
Association Rules Mining, International Conference on Artificial Intelligence and Computational
Intelligence, AICI '09, Volume: 4, Page(s): 341 345, 2009
[19].
D. Whitley, A genetic algorithm tutorial, Colorado State Univ., Fort Collins, Rep. CS-93-103, 1993,
pp. 2931.
[20].
[21].
Chen Singmiset al. Data Warehouse and Data Mining Techniques. Publishing House of Electronics
Industry. 2002.8, p33U-340.
[22].
R. Agrawal. Mining Association Rules Between Sets of Items in Large Databases[C]. Proc. of the ACM
SIGMOD Intl. Conf. on Management of Data. Washington, D. C., United States, 1993, pp.207-216