Framework For Comparison of Association Rule Mining Using Genetic Algorithm

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 8

Framework for Comparison of

Association Rule Mining using Genetic Algorithm

Abstract A new framework for comparing the literature on Genetic Algorithm for Association Rule Mining is
proposed in this paper. Genetic Algorithms have emerged as practical, robust optimization and search methods
to generate accurate and reliable Association Rules. The main motivation for using GAs in the discovery of
high-level prediction rules is that they perform a global search and cope better with attribute interaction than the
greedy rule induction algorithms often used in data mining. The objective of the paper is to compare the
performance of different methods based on the methodology, datasets used and results achieved. It is shown that
the modification introduced in GAs increases the prediction accuracy and also reduces the error rate in mining
effective association rules. The time required for mining is also reduced.

Keywords: Data Mining, Genetic Algorithm, Association Rule Mining,

I.

INTRODUCTION

In todays jargon enormous amount of data are stored in files, databases, and other repositories. Hence it
becomes necessary, to develop powerful means for analysis and interpretation of such data and for the extraction
of interesting knowledge to help in decision-making. Thus, there is a clear need for (semi-)automatic methods
for extracting knowledge from data. This need has led to the emergence of a field called data mining and
knowledge discovery.
Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data in databases. The
Knowledge Discovery in Databases process comprises of a few steps starting from raw data collections to
formation of new knowledge. The iterative process consists of the following steps:
Data cleaning: also known as data cleansing, is a phase in which noise data and irrelevant data are removed
from the collection.
Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in a common
source.
Data selection: at this step, the data relevant to the analysis is decided on and retrieved from the data
collection.
Data transformation: also known as data consolidation, it is a phase in which the selected data is transformed
into forms appropriate for the mining procedure.
Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful.
Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified based on
given measures.
Knowledge representation: is the final phase in which the discovered knowledge is visually represented to the
user. This essential step uses visualization techniques to help users understand and interpret the data mining
results.
This paper reviews the works published in the literature, where basic Genetic Algorithm is modified in some
form to address Association Rule Mining. The rest of the paper is organized as follows. Section II briefly
explains Association Analysis. Section III gives a preliminary overview of Genetic Algorithm for Rule Mining.
Section IV Reviews the different approaches reported in the literature based on Genetic Algorithm for Mining
Association Rules. Section V lists the inferences attained from the comparison. Section VI presents the
concluding remarks and suggestions for further research.

II. ASSOCIATION ANALYSIS


Association analysis is the discovery of what are commonly called association rules. It studies the frequency of
items occurring together in transactional databases, and based on a threshold called support, identifies the
frequent item sets. Another threshold, confidence, which is the conditional probability that an item appears in a
transaction when another item appears, is used to pinpoint association rules.
The discovered association rules are of the form: PQ [s,c], where P and Q are conjunctions of attribute valuepairs, and s (for support) is the probability that P and Q appear together in a transaction and c (for confidence) is
the conditional probability that Q appears in a transaction when P is present.
III. GENETIC ALGORITHM
A Genetic Algorithm (GA) is a procedure used to find approximate solutions to search problems
through the application of the principles of evolutionary biology. Genetic algorithms use biologically inspired
techniques such as genetic inheritance, natural selection, mutation, and sexual reproduction (recombination, or
crossover).
Genetic algorithms are typically implemented using computer simulations in which an optimization
problem is specified. For this problem, members of a space of candidate solutions, called individuals, are
represented using abstract representations called chromosomes. The GA consists of an iterative process that
evolves a working set of individuals called a population toward an objective function, or fitness function.
Traditionally, solutions are represented using fixed length strings, especially binary strings, but alternative
encodings have been developed.
The evolutionary process of a GA is a highly simplified and stylized simulation of the biological version. It
starts from a population of individuals randomly generated according to some probability distribution, usually
uniform and updates this population in steps called generations. Each generation, multiple individuals are
randomly selected from the current population based upon some application of fitness, bred using crossover, and
modified through mutation to form a new population.
A. [Start] Generate random population of n chromosomes (suitable solutions for the problem)
B. [Fitness] Evaluate the fitness f(x) of each chromosome x in the population

C. [New population] Create a new population by repeating the following steps until the new population is
complete

i.

[Selection] Select two parent chromosomes from a population according to their fitness (the better
fitness, the bigger chance to be selected)

ii.

[Crossover] With a crossover probability cross over the parents to form a new offspring (children).
If no crossover was performed, offspring is an exact copy of parents.

iii.

[Mutation] With a mutation probability mutate new offspring at each locus (position in
chromosome).

iv.

[Accepting] Place new offspring in a new population

D. [Replace] Use new generated population for a further run of algorithm


E. [Test] If the end condition is satisfied, stop, and return the best solution in current population
F. [Loop] Go to step 2
IV.ANALYSIS ON GENETIC ALGORITHM FOR MINING ASSOCIATION RULES
Among the Genetic algorithms designed for the purpose of Association rule mining is discussed based
on the following criteria
1.

Genetic Operations

2.
3.
4.

I. Encoding
II. Initial Population
III. Crossover
IV. Mutation
V. Fitness Threshold
Methodology.
Application areas.
Evaluation Parameters

The various methodologies are listed in Table A1.


1. Genetic Operations.
The basic steps in the traditional Genetic algorithm implementations are discussed in the previous section.
Modifications are carried out in the traditional GA to increase the prediction accuracy thereby reducing error
rate in mining association rules. The variations have been carried out in various steps of GA.
Encoding :
Encoding is the process of representing the entities in datasets for mining. Rules or chromosomes can be
represented either with fixed length data [2..18] or by varying length chromosomes[1], Fuzzy rules are
implemented for encoding data in [10], In [4] and [6] coding is done using natural numbers, In [17] Gene
expressions are used for representation of chromosomes and encoding is carried out using arrays in [7] .
Initial Population:
The initial population could be generated by random selection, seeded by users [1], single rule set
generation [5] and Fuzzy Rules [10].
Crossover:
The Crossover operator which produces new offspring and hence new population plays a vital role in
enhancing the efficiency of the algorithms. The changes are carried out as discussed in Table A1. Saggar. M et.
Al [2] describes whether crossover is to be performed or not and if required the locus point where the crossover
begins is of prime importance. Crossover on same attributes of both offspring if present or random attributes in
absence of similar attributes is carried out in [11]. In [12] setting the crossover rate dynamically for each
generations are presented. The concept of Symbiotic combination, where instead of crossover the combination
of chromosomes to generate a new chromosome based on Symbiogenesis in Ramin Halavathy et. Al [5] has
proved to increase the speed of the rule generation system.
Mutation :
Mutation is the process where attributes of selected entities are changed to from a new offspring. The
Mutation Operator is altered based on the application domain into macro mutation in [1], changing locus points
of mutation in [2]. The weight factor is taken into consideration for locus point of mutation in [5] so as to
generate a better offspring, Dynamic mutation where the mutation point is decided on the particular entity and
generation selected enhances the diversity of colony is introduced in [12], mutation 1 & mutation 2 where
mutation is performed twice to generate offspring is performed in [16]. Adaptive mutation where the mutation
rate differs for each generation is found to produce better offsprings in [17].
Fitness Threshold:
The passing of chromosomes from a population to new population for the succeeding generation
depends on the fitness criteria. Changes to the fitness functions or threshold values alter the population and
hence the effective fitness values lead to the efficiency of the rules generated. The negation of the attributes are
taken into consideration while generating rules by including criteria like True Positive, True Negative, False
Positive and False Negative[2]. By these criteria rules with negated conditions of attributes are also generated.
By varying the fitness values dynamically in each generation, the speed of the system can be improved [4].
Factors like strength of implication of rules when considered while calculating fitness threshold proves to
generate more interesting rules [6]. The Sustainability index, creditable index and inclusive index when

considered for fitness threshold results in better predictive accuracy [7]. When the real values of confidence and
support derived and applied in threshold rate found to generate faster than traditional methods [8]. The
predictability and comprehensibility factors of rules tends to provide better classification performance[11].
2. Methodology.
Rather than altering the operations in basic GA algorithm the changes made in the methodology has
also proved to increase the performance. In [5] the crossover operation is replaced by symbiotic recombination
techniques. Wenxiang Dou et. Al [8] describes the generation of rules and displayed to the user. The user
decides the interesting rules thereby seeding the population by the user. Here instead of searching the whole
database for support the system searches K- itemset alone and hence is faster than other methodologies. The real
values of support and confidence are taken into account while generating threshold values. If user is not satisfied
then more rules with next level of support could be generated. The response time of the system is found to be
increased dramatically.
Antonio Peregrin et. Al [9] states the distributed genetic algorithm where the Elit data pool is the central
node. Each node connected are considered as Intercommunicating Subpopulations. The Data Learning Flow
(DLF) Copies the Training Example with Low Fitness to neighborhood and each node is assigned with different
partitions of learning data. In [18] the concept of Dynamic Immune Evolution and Biometric Mechanism is
introduced. The immune recognition, immune memory and immune regulation is applied to GA and thereby
making the system faster and discovering interesting rules though support might be low.
3. Application Areas.
From the dataset in Table A1 it is derived that the Genetic Algorithm is flexible in nature and could be
applied to varying data sets. Different areas of applications as Biology, biometrics, Education, Manufacturing
Information System, Application Protocol interface records from Computers for Intrusion Detection, Software
Engineering, Virus information from Computer data, Image data base, Finance information, Students
Information etc. are taken up for study. It is seen that by altering representations and operators the Genetic
algorithm could be applied for any fields without compromising the efficiency.
4.

Evaluation Parameters.
To achieve the best performance from any system setting of the evaluation parameters are vital. For
mining efficient association rules using the evolutionary Genetic Algorithm the parameters that affects the
functioning of the system are representation of chromosomes and their length, population Size, Mutation
probability, Crossover probability and the Fitness threshold values. The use of support and confidence factor in
deciding the threshold values also increases the performance of the system.
It is seen form the Table A1 that parameters dont have predefined boundaries. The values of the
parameters are based upon the methodology applied and the dataset. It is found that the mutation rate differs
from 0.005 to 0.6. Similarly the Crossover probability ranges from 0.1 to 0.9. The fitness function is found to be
the most crucial parameter. The optimum method for the fitness threshold evaluation has to be arrived in order
to achieve the highest performance. The carrying over of offspring from one generation to other is based on
fitness function the population. Different factors of the system are takem into consideration while generating
fitness function. Efficient the fitness function, efficient and accurate are the association rule that are generated
by the system
V. INFERENCES
The study of the different system using Genetic Algorithm for Association rule mining was carried out.
The predictive accuracy based on the methods under comparison is noted that
Predictive Accuracy of more than 95 percentage was seen in 55 % of datasets.
Predictive Accuracy between 80 - 94 percentage was seen in 15 % of datasets.
Predictive Accuracy between 75 79 percentage was seen in 10 % of datasets.
Predictive Accuracy of less than 75 percentage was seen in 20 % of datasets.
Predictive accuracy of 100 % is achieved with methodology [1] and [14]. The error rate is seen maximum in
Pima dataset under UCI repository by SGERD method [10].

Based on the time for mining association rules as low as 5 and 9.81 seconds were achieved in SGERD [10]
and Quick response data mining model [8] respectively and maximum of 7012 seconds is taken up for mining
KDDCUP99 dataset using Genetic algorithm with symbiogenesis [5].
The tools used for mining are GA Toolbox from Mat lab, Java, C++ and Weka for training the datasets.
From the Study it could be inferred that
The system is flexible and could be applied to varying datasets and applications.
The performance of the system is improved when compared to other existing methods like Apriori,
C4.5, REGAL, Data-miner etc.
The Genetic operators are the key in achieving increased performance.
The speed of the system increases rapidly when compared with other methods.
The optimum values for the mutation, crossover and threshold decide on the performance of the
system.

Better and precise rules are generated.


Classification error rates are reduced in comparison..
Negated attributes could be derived in rules.
Critical rules could be generated even if support is low.
VI. CONCLUSION
The framework for comparing the effectiveness of Genetic Algorithm in mining Association rules is
taken up for study. Various systems applying GA for mining Association rules are considered and the tabulation
was done with the data chosen. From the analysis it could be derived that the classification performance of the
system was found to be robust. Genetic Algorithm based implementation Outperforms Conventional Methods.
The speedup achieved is remarkable. For future work the combination of one or two methodologies could be
carried out for basic GA operators. The given system for a particular domain can be further modified for other
domains.

Table A1 : Genetic Algorithms For Association Rule Mining: A Comparative Study

Data Set
Ref .No Methodology
Steps In Genetic Algorithm
Represen Selection
-tation

[1]

Mutation

Varying Seeded
Macro
Length Population Mutations

Cross Over

As GA

As GA

Grilled
Mushrooms in
Agaricus and
Lepiota Family.

[2]

Binary
Coding

[3]

As GA

As GA

[4]

Using
Roulette
Natural
Wheel
Numbers Selection

846 Records
18
Attributes

As GA

As GA

As GA

Data Set from


MIS

Optimum
Value

Optimum
Value

As GA

Synthetic
Image
Database

Synthetic
Database for
TP,TN,FP, FN
the Selection of
Used
Electives for a
Course

Pm

Results
Fitness Gener
-ations

8128 Of 23
Species
22
Attributes

Whether
Required
Or Not
And If
Point Of
Mutation

Whether
Required Or
Not And If
Point Of
Crossover

Evaluated By / Parameters
Size /
Pc/
Support Confidence

Fitness

Vehicle
Silhouette
Dataset
Roulette
Wheel
Selection

Sample
Size

100
*

5%

.25%
To 2%

0.1

0.005

Complete
ness
consider
-red

5%

User
User
defined defined

4 To
14

Accuracy On
Training Dataset
Between 95 To
100. On Test
Data Between
62 And 71
Rules With
Negation Of
Attributes As
Well As
General Rules
Generated
Based On
Time And
Fetching
Knowledge
GA Is Faster
Than Apriori.
Runs 2 To 5
Times Faster
Than Apriori

KDDCUP99
Classes 2
CRX
Classes 2
[5]

As GA

Changes In
Generate
Symbiotic
Weight Of
Single
Combination
Membership
Rule Sets
Operator
Function

As GA

[6]

Using
Natural
Numbers

As GA

As GA

As GA

Individual
Evaluation
Using Strength
Of Implication

[7]

Array
Representa
-tion

As GA

As GA

As GA

Based On
Sustaining,
Creditable And
Inclusive Index

[8]

As GA

Done By
Users
From
Rules
Generate
d

As GA

Based On Real
Support And
Confidence

[9]

-[10]

As GA

Multiple Intercommunicating Subpopulations


Distributed Data And DLF
Central Elite Pool
The Data Learning Flow (DLF) Copies The Training
Example With Low Fitness To Neighborhood
Each Node Is Assigned With Different Partitions Of Learning
Data

Fuzzy
Rules

As GA

As GA

As GA

As GA

Done On
Same
Attributes If
Present Or
Random

As GA

[11]

[12]

[13]

[14]

As GA

As GA

As GA

As GA

As GA

As GA

As GA

Dynamic

Adaptive

As GA

Dynamic

As GA

As GA

Predictive
Accuracy

Features 4
Size 150

Vote
Classes 2

Features 16
Size 435

Wine
Classes 3

Features 13
Size 178

Car
Test Results
Dataset

Six Datasets
from Irvine
Repository

Single Table
Produced
Randomly with
100
Transactions

UCI : Nursery

Nursery
Datasets
from UCI
Finance
Service Data of
Certain City

Individual
Based

Database Of
Student
Achievement in
Schools in
Recent Years

40
Attributes

10%

12960
Instances

0.85

0.4

0.2

0.8

0.6

50%

0.01
*

optimu
m
48842
Instances
15attributes

0.4

40

0.6

S.I 1.0
C I 1.0
I.I 1.0

0.9

50

10

500

500

optimu
m

Optim
um

0.05

Optim
um

12960 With 9
Attributes

2050
Groups

0.3

200

0.6

0.9

0.01

0.7

SEA Has
Better Or
Similar
Results When
Compared
With GA
SEA Much
Faster Than
GA

10

KDD CUP99
Dataset

10000

11 Data Sets
from Irvine
Machine
Learning
Repository

Based On Last
Generation

Modified To
Decide Whether
A Chromosome

Features 15
Size 690

IRIS
Classes 3

Adult
Elitist
Recombi
nation
Method

Features 41
Size 494021

50

Length Of Chromosome 41
Generations 100

During
Generation
Between 1 To
200 Interesting
Rules For
Whatever Be
The Threshold.
Predictive
Accuracy
Better Than
CN2 And AntMiner
Methods For
All Six
Datasets
Response Got
In Ten Seconds
Whereas For
Apriori It Is
More Than
3000 Seconds
Faster And
Better
Behavior
Number Of
Rules
Generated Is
Between 60%80% Smaller
Classification
Error Rates
Are Low
Outperforms
C4.5

Better
Classification
Performance
Produces Partial
Association
Rules After 252
Generations
Whereas It Is
850 In
traditional GA
The Algorithm
Based On 0.1
Support And
0.7 Confidence
Is Close To
Actual
Situation
Rules
Generated Are
Useful In

Generation Gap 0.9

Is Right Or Not

[15]

As GA

As GA

As GA

As GA

Feature
Selection Is
Applied

CM1, KC1,
KC2, PC1
From UCI
Repository

Vehicle Dataset
And
[16]

[17]

As GA

Gene
String
Structure

[18]

As GA

As GA

Mutation1
&
Mutation 2

Adaptive

As GA

Adaptive

Based On
Distance
Between Rules Lympography
Dataset From
UCI ML
Measure Of
Overall
Performance

Dynamic Immune Evolution And Biometric Mechanism Is


Introduced
Immune Recognition, Immune Memory And Immune
Regulation Is Applied To GA

22

Varies

Varies

Varies

240

0.1

0.01
0.01

148 Records
18 Attributes
4 Classes
846 Records,
18 Attributes
4 Classes

Real Case Data

Varies

100

600

0.6

0.005

50

0.26

0.8

Computers Daily
Records Of API

Note : *- not defined in literature


References
[1].

Cattral, R., Oppacher, F., Deugo, D.,Rule Acquisition with a Genetic Algorithm, Proceedings of the
1999 Congress on Evolutionary Computation,. CEC 99, 1999.

[2].

Saggar, M., Agrawal, A.K., Lad, A., Optimization of Association Rule Mining,
IEEE International Conference on Systems, Man and Cybernetics, Vol. 4, Page(s): 3725 3729, 2004

[3].

Cunrong li, Mingzhong Yang, Association Rules Data mining in Manufacturing,


3rd International Conference on Computational Electromagnetics and Its Applications, Page(s): 153
156, 2004.

[4].

Shangping Dai, Li Gao, Qiang Zhu, Changwu Zhu, A Novel Genetic Algorithm Based on Image
Databases for Mining Association Rules, 6th IEEE/ACIS International Conference on Computer and
Information Science, Page(s): 977 980, 2007

[5].

Halavati, R., Shouraki, S.B., Esfandiar, P., Lotfi, S., Rule Based Classifier Generation Using
Symbiotic Evolutionary Algorithm , 19th IEEE International Conference on Tools with Artificial
Intelligence, Volume: 1, Page(s): 458 464, 2007.

[6].

Zhou Jun, Li Shu-you, Mei Hong-yan, Liu Hai-xia, A Method for Finding Implicating Rules Based on
the Genetic Algorithm, Third International Conference on Natural Computation, Volume: 3, Page(s):
400 405, 2007.

[7].

Hua Tang, Jun Lu, A Hybrid Algorithm Combined Genetic Algorithm with Information Entropy for
Data Mining, 2nd IEEE Conference on Industrial Electronics and Applications, Page(s): 753 757,
2007.

[8].

Wenxiang Dou, Jinglu Hu, Hirasawa, K., Gengfeng Wu, Quick Response Data Mining Model using
Genetic Algorithm, SICE Annual Conference, Page(s): 1214 1219, 2008

Detecting
Intrusion
Generated
Rules That
Provide Better
Estimation
And
Explanation Of
Defective
Modules

GRA
Outperforms
Conventional
Methods
Performance
And
Effectiveness Of
Proposed Model
Is Close With
Real World
Analysis
Faster &
Discovers New
Critical Rules
Though
Support Not
High

[9].

Peregrin, A., Rodriguez, M.A., Efficient Distributed Genetic Algorithm for Rule Extraction,. Eighth
International Conference on Hybrid Intelligent Systems, HIS '08. Page(s): 531 536, 2008

[10].

Mansoori, E.G., Zolghadri, M.J., Katebi, S.D., SGERD: A Steady-State Genetic Algorithm for
Extracting Fuzzy Classification Rules From Data, IEEE Transactions on Fuzzy Systems, Volume: 16 ,
Issue: 4 , Page(s): 1061 1071, 2008.

[11].

Xian-Jun Shi, Hong Lei, A Genetic Algorithm-Based Approach for Classification Rule Discovery,
International Conference on Information Management, Innovation Management and Industrial
Engineering, ICIII '08, Volume: 1 , Page(s): 175 178, 2008.

[12].

Xiaoyuan Zhu, Yongquan Yu, Xueyan Guo, Genetic Algorithm Based on Evolution Strategy and the
Application in Data Mining, First International Workshop on Education Technology and Computer
Science, ETCS '09, Volume: 1 , Page(s): 848 852, 2009

[13].

Hong Guo, Ya Zhou, An Algorithm for Mining Association Rules Based on Improved Genetic
Algorithm and its Application, 3rd International Conference on Genetic and Evolutionary Computing,
WGEC '09, Page(s): 117 120, 2009

[14].

Yong Wang, Dawu Gu, Xiuxia Tian, Jing Li, Genetic Algorithm Rule Definition for Denial of
Services Network Intrusion Detection, International Conference on Computational Intelligence and
Natural Computing, CINC '09, Volume: 1 , Page(s): 99 102, 2009

[15].

Rodriguez, D., Riquelme, J.C., Ruiz, R., Aguilar-Ruiz, J.S., Searching for Rules to find Defective
Modules in Unbalanced Data Sets, 1st International Symposium on Search Based Software
Engineering, Page(s): 89 92, 2009

[16].

Gonzales, E., Mabu, S., Taboada, K., Shimada, K., Hirasawa, K., Mining Multi-class Datasets using
Genetic Relation Algorithm for Rule Reduction, IEEE Congress on Evolutionary Computation, CEC
'09, Page(s): 3249 3255, 2009

[17].

Haiying Ma, Xin Li, Application of Data Mining in Preventing Credit Card Fraud,
International Conference on Management and Service Science, MASS '09, Page(s): 1 6, 2009

[18].

Genxiang Zhang, Haishan Chen, Immune Optimization Based Genetic Algorithm for Incremental
Association Rules Mining, International Conference on Artificial Intelligence and Computational
Intelligence, AICI '09, Volume: 4, Page(s): 341 345, 2009

[19].

D. Whitley, A genetic algorithm tutorial, Colorado State Univ., Fort Collins, Rep. CS-93-103, 1993,
pp. 2931.

[20].

Tom M Utchell. Machine Learning. China Machine Press, pp 38-56.

[21].

Chen Singmiset al. Data Warehouse and Data Mining Techniques. Publishing House of Electronics
Industry. 2002.8, p33U-340.

[22].

R. Agrawal. Mining Association Rules Between Sets of Items in Large Databases[C]. Proc. of the ACM
SIGMOD Intl. Conf. on Management of Data. Washington, D. C., United States, 1993, pp.207-216

You might also like