0% found this document useful (0 votes)
80 views15 pages

Survey On GA and Rules

This document discusses the use of genetic algorithms in association rule mining. It begins with an introduction to association rule mining and its goal of discovering frequent patterns and relationships in data. The document then provides details on how genetic algorithms can be applied in various aspects of association rule mining, including their coding strategy, selection, crossover and mutation operators, and fitness functions. It also discusses how genetic algorithms have been used to further analyze and optimize rules previously generated by other association rule mining methods like Apriori. The effectiveness of genetic algorithms is found to improve with modifications to the genetic algorithm itself.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views15 pages

Survey On GA and Rules

This document discusses the use of genetic algorithms in association rule mining. It begins with an introduction to association rule mining and its goal of discovering frequent patterns and relationships in data. The document then provides details on how genetic algorithms can be applied in various aspects of association rule mining, including their coding strategy, selection, crossover and mutation operators, and fitness functions. It also discusses how genetic algorithms have been used to further analyze and optimize rules previously generated by other association rule mining methods like Apriori. The effectiveness of genetic algorithms is found to improve with modifications to the genetic algorithm itself.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Genetic algorithm in Association Rule Mining : A Survey

Introduction
The amount of data stored in databases continues to grow fast. Intuitively, this large amount of
stored data contains valuable hidden knowledge, which may be used to improve the decisionmaking process of an organization. Thus, there is a clear need for (semi-)automatic methods for
extracting knowledge from data. This need has led to the emergence of a field called data mining
and knowledge discovery. Association rule mining is one such data mining task which involves
frequent pattern mining.
The aim of frequent pattern mining is to search for recurring relationships in a given data set
which enables us to discover various kinds of associations and correlations among different
items in data sets. Let us formally define the problem: Let I = {i1, i2, i3, ..., in} be a set of all
items. A k-itemset l consists of k items from I, is frequent if l occurs in a transaction database D
not less than |D| times. Here is a user specified parameter called minimum support and |D| is
total number of tuples in database.
In this paper the role of Genetic Algorithms in Data mining and in specific Association rule
mining is taken up for analysis. The effectiveness of the algorithm is found to me more with
modifications in the Genetic algorithm. We present here the application of genetic algorithm in
association rule mining with variations in the algorithm and the results achieved

GENETIC ALGORITHM AND DATA MINING


A. Genetic algorithm in the position of data mining
Genetic algorithm plays an important role in data mining technology, which is decided by its
own characteristics and advantages. To sum up, mainly in the following aspects:
1) Genetic algorithm processing object not parameters itself, but the encoded individuals of
parameters set, which directly operate to set, queue, matrices, charts, and other structure.
2) Possess better global overall search performance; reduce the risk of partial optimal solution.
At the same time, genetic algorithm itself is also very easy to parallel.
3) In standard genetic algorithm, basically not use the knowledge of search space or other
supporting information, but use fitness function to evaluate individuals, and do genetic Operation
on the following basis.
4) Genetic algorithm doesnt adopt deterministic rules, but adopts the rules of probability
changing to guide search direction.
The steps in Genetic Algorithm are
1. Randomly select parents.
2. Reproduce through crossover. Reproduction is the operator choosing which individual entities
will survive. In other words, some objective function or selection characteristic is needed to
determine survival. Crossover relates to changes in future generations of entities.
3. Select survivors for the next generation through a fitness function.

4. Mutation is the operation by which randomly selected attributes of randomly selected entities
in subsequent operations are changed.
5. Iterate until either a given fitness level is attained, or the preset number of iterations is reached.
a) Coding strategy and coding string length L:
Because of many parameters, multi-parameter coding technology can be used. Basic idea is to
encode each parameter obtaining substring, and then combine these substrings into a complete
chromosome. For example, 18 | 36 | medium | good | man gene strings express the employee
group of age with 18 to 36 years old, medium-income, health condition is good, sex man, it will
have a number of code are combined in use, such as 18 | 36 | medium | good | man encoded string
of 18 | 36 | 01 | 00 | 1.
b) Select Operator: By using the selection mechanism of the certainty expected value model,
that is expected value of integer part of

to arrange the times that individual are selected, if selected to participate in cross-matching and,
the survival expected value minus 0.5 in the next generation; Instead the survival expected value
minus 1, then listing expected value of M of decimal part according the value from large to
small, and one selection from large to small until the date is full. Such choice mechanism can
overcome randomness in selection.
c) Cross-Operator: Because of multi-parameter coding technology is used, taking into the
characteristics of string code, two cross is adopted.
d) Mutation operator: Adopting basic mutation operator, mutating age gene locus when below 5
random integer.
e) The group size M: When M for small value, which improves the evolution data of genetic
algorithm, but decreases the diversity of group and might cause the premature phenomena of
genetic algorithm; when M for greater value, which decreases the evolution speed of genetic
algorithm. Therefore, comprehensive consideration in these two areas, the value of M for 20~100
is good.
f) Fitness function f(x): The best employee group, that is the employee group who obtains the
highest number times in comprehensive evaluation in the same age condition, and the ultimate
aim is to find young and excellent employee. In addition to adding a restrictive conditions: the
minimum age of employee must be less than maximum age. The objective function can be set to

Thus, t(x) accords with the times of comprehensive excellent evaluation of employee for x gene
string; T is the total times of comprehensive excellent evaluation of all employee profits; i(x) is
age spacing of string of x . Generally speaking, the choice intensity should be slight lower in the
initial stage of genetic optimization, so as to avoid genetic groups have been controlled by one or
a few individuals with higher fitness degree; in the latter of genetic optimization, because the

difference is relatively small between groups, The potential ability is low if continue to optimize,
it is necessary to improve choice intensity so as to constringe a better solution for genetic
algorithm. So fitness function is designed to

g) Cross-probability Pc : Cross-probability Pc control the frequency in exchange operation , high


Pc can achieve greater solution space, thus reducing the stay in non-optimal solution on the
probability, but large Pc will waste of much time in searching unnecessary solution space. To this
end, the adaptive Pc can be used.

Thus, x is the larger one in the operation of two individuals of cross-participation, f max is the
largest group fitness degree, f avg is the average fitness degree.
h) Mutation probability pm : Mutation probability Pm control of the new gene into the
population ratio, if too low, some useful genes will not be able to enter the choice; if too high,
too much random change, future generations may lose good characteristics inherited from both
parents. To this end, the adaptive Pm can be used in (4).

Thus, y is the individual fitness in a particular mutation operation, f


fitness degree, favg is the average group fitness degree.

max

is the largest group

i) Termination: When genetic algorithm runs to difference (| (f1-f2) / f1 | <) does not change or
with small change between the two group generation of the best fitness degree, which is
considered convergent and stop operation.
Application of Genetic Algorithms on pre-mined Rules.
The association rules mined using other methods as Apriori and Network Programming
Model are further mined using Genetic Algorithms. In [1] of a new evolutionary method named
Genetic Relation Algorithm (GRA) for reducing the number of class association rules extracted
by other methods is proposed. GRA is composed of nodes and their directed or indirected
branches. Nodes represent events and branches represent the relations between nodes. The basic
structure of GRA the genotype expression of GRA node is shown in below figures

Basic Structure of GRA

Genotype Expression of GRA

The table describes the gene of node i, then, the set of these genes represents the genotype of
GRA individuals. IDi is an identification number, for example, IDi = 1 means node i has the
directed branches to other nodes, while IDi = 2 means node i has the indirected branches to the
nodes. Fi denote the function of the node i. Ci1, Ci2, . . . , Cik denote the nodes which are
connected from node i, firstly, secondly, . . . , and Si1, Si2, . . . , Sik denote the strength from node I
to node Ci1, Ci2, . . . , Cik or the strength between node I and node Ci1, Ci2, . . . , Cik depending on
the arguments of node i.
In order to find really important class association rules, the function of the nodes in GRA should
be changed. It is possible to realize the above effectively by GRA genetic operations, because
mutation and crossover will change the connections or contents of the nodes. Three kinds of
genetic operators are used: crossover, mutation-1 (change the connection of nodes) and
mutation-2 (change the function of nodes).
The algorithm is depicted in the flow chart shown below

Two datasets from UCI ML Repository were taken to conduct the experiments namely
Lymphography and Vehicle dataset were taken up for analysis. From the experimental results it
is shown that when the reduction rate is small, GRA is able to get comparable accuracy to the
large set of rules, that is, 100% of the rules, especially in the partial match, furthermore, it is
shown that the accuracy does not change drastically compared to the accuracy of the large set of
rules

Accuracy on Dataset Lympography

Accuracy on Dataset Vehicle

The results of the classification accuracy when compared with other methods are tabulates in
below table

Comparison of Classification Accuracy with other Methods


In [2] optimization of the rules generated by Association Rule Mining (apriori method), using
Genetic Algorithms was the objective. The algorithm with modifications is
1. The individuals are represented using the Michiran s approach, i.e. each individual
encodes single rule.
2. Representing the rule antecedent done using binary encoding
3. Generic Operators
4. For selection the authors used Roullete Wheel Sampling procedure is used.
5. Fitness function :
Confidence Factor, CF = TP / (TP + FP)
Comp = TP / (TP + FN)

Fitness = CF x Comp
Fitness = wl x (CF x Comp.) + w2 x Simp
where TP is True Positive, FP is False Positive, FN is False Negative ,Simp is a measure of rule
simplicity (normalized to take on values in the range O..l) and wl and w2 are user defined
weights.
The algorithm when tested on synthetic database with parameters as listed in table proved out to
contain some rules with negations in the attributes as predicted and desired.
GA Parameters

Genetic Algorithm from Application Perspective


The Genetic Algorithm when made adoptable can be applied for various application areas with
data mining techniques. The areas that are taken up for analysis are tabulated below
Areas of Application
Finance
Watermarking
MIS
Medicine
Image Database
Students Information System
Daily Records from API
Car Test Data
Software Engineering

Genetic Algorithm
Evolution Strategy Excellence in GA[3]
Data Mining Genetic Algorithm[4]
Mining Association Rules with GA[5]
Constraint Based GA[6]
Novel GA[7]
GA through Adapted Mutation[8]
Dynamic Immune Evolution to GA[9]
Implicating and Optimized Rules[10]
Defective Module implication[11]

In[3] the evolution strategys excellence is applied in genetic algorithms evolutional process.
Then the optimized genetic algorithm is used for mining association rules. The shortcomings of
the traditional GA are overcome with modifications.
Improved genetic algorithm

Genetic algorithm based on evolution strategy has improvement as follows. Firstlydissimilar


degree of individuals is judged in colony when a century has evolved. Dissimilar degree of two
random individuals in colony is as follows.

In formula l is the length of gene chain, aj is the number j bit gene in gene chain of individuals
a bj is the number j bit gene in gene chain of individuals b In whole colony, dissimilar
degree of colony is as follows.

In formula P is colony size.


cross probability and mutation probability is set up is as follows.

K is rate that dissimilar degree of colony of last century is compared with current century.

In formulas Pc' Pm' are separately cross probability and mutation probability of last century,
Pc Pm are separately cross probability and mutation probability of current century. In this
way, evolution of current century is based on last century. Original colony contains excellent
individuals of last century. Otherwise partial new individuals are randomly product, and cross
probability and mutation probability are newly set up. It can enhance the diversity of colony.
When the algorithm applied to 2050 groups of finance data in certain city the following
observations were made. After 252 generations the partial association rules were obtained as in
table, whereas traditional GA required 850 generations.

The speed was more and could be applied to other domains also.

In [4] a data mining-based GA is presented to efficiently improve the Traditional GA (TGA).


The flowchart of the algorithm is depicted below.

Algorithm of Our Data Mining-Based GA:


1. Setup the environment parameters.
Initialize the support and confidence arrays; set the DNA pool to be empty. Note that the
support and confidence arrays will be introduced in Section 3.1.
2. Evaluate all of the chromosomes based on the fitness function.
Record the important gene information for each high quality chromosome by updating the
support and the confidence arrays.
3. Recombine new chromosomes based on the traditional GA operations.
4. Recombine new chromosomes based on the data mining-based GA operation.
Type 1: Randomly select some chromosomes obtained from step 3, and then perform the new
GA operation, DNA implantation, to generate new chromosomes.
Type 2: Randomly select some chromosomes obtained from step 3, and then disable the genes
of the chromosome if the genes appeared in DNA pool.
5. Repeat steps 2 to 5 until any one of following two conditions are reached.
Condition 1 (Obtain the optimum solution): The predefined condition is satisfied, i.e. the
obtained solution satisfies to our expectation, or a constant number of iterations has been
performed.

Condition 2 (Fall into a DNA trap): The obtained solution does not satisfy our requirements,
but there is no improvement after a constant number of new evolutionary generations. Note
that the constant number of iterations is much smaller than the one in condition 1. Collect all
the important genes based on the support and the confidence to form a new DNA, put the
DNA into the DNA pool, and reset the support and confidence arrays. Then go to step 2.
The above when implemented on watermarking problem resulted in improved performance
over the traditional GA. This is clear from the graph given.

Traditional GA
Data mining GA

In the process of data mining the MIS database [5], some of the history records could be looked
as a population of individuals, a record as a corresponding individual, and the fields representing
the property of the table could be looked as the genes, the correspondence of the items and a
table is shown in figure

Correspondence of Item and a Table.


The Basic Genetic Algorithm is divided into six steps:
1. Encode

2. Select the initial population of individuals, from which computed the excellent gene list set Al;
selecting the second population of individuals, from which computed the excellent gene list
set B1
3. Save the excellent gene list set above to result set R
4. Operate the mutation of AI and B1, cross the gene of AI and B1, and generate a new set A2,
then select the third population of individuals, from which computed the excellent gene list set
B2
5. Repeat the steps from b to c, until there is no new excellent gene List
6. Decode the result set R and generate the knowledge of association rules
When the algorithm applied to MIS repository on IBM Netfinity(Pentium 1G/512MRAM). GAbased Association Rule was more efficient than Apriori Algorithm ,The time used with the
Apriori Algorithm increases sharply with the increase of data amount and the precision
decreased a little, the efficiency increased a lot.
A Constraint-Based Genetic Algorithm approach for Mining Classification Rules[6] propose a
constraint-based genetic algorithm (CBGA) approach to reveal more accurate and significant
classification rules. Here a rule induction system that consists of three modules: the userinterface, the symbol manager, and the constraint based GA (CBGA). According to Figure, the
user interface module allows users to execute the following system operations including:
loading a constraint program;
adding or retracting the constraints;
controlling the GAs parameter settings;
monitoring the best solutions.
Interesting knowledge or given constraints can be issued by either domain experts or other meta
knowledge mechanisms.

In order to introduce details of the proposed CBGA approach, a synthetic medical data set about
patients information is used for illustration. This data set includes the following attributes: age,
sex, blood pressure (BP), the status of Cholesterol (Cho), the values of Na and K, and the
quantity (Qty) and frequency (Freq) of taking the drug. The prediction attribute is one of the five
drug types, including Drug A, Drug B, Drug C, Drug D and Drug E.
In comparison with a regular GA, CBGA achieves higher classification accuracy rates in rule
inductions for both UCI data sets. In addition, the rule sets discovered by CBGA are not only
with higher predictive accuracy, but also with more significant knowledge in accordance to the
users preferences.
In [7] A Novel Genetic Algorithm Based on Image Databases for Mining Association Rules is
proposed using a novel spatial mining algorithm, called ARMNGA(Association Rules Mining in
Novel Genetic Algorithm)
Association rules mining Based on a novel Genetic Algorithm is carried out by
Encoding employs natural numbers to encode the variable Aij. That is, the number of the lines
of every range in the matrix A in which the element 1 exists is regarded as a gene. The genes are
independent of each other. They are marked by A1, A2 Aj, An, in which and Aj[l,m] , j[l,m]
and An may be a repeatedly equal natural number
The Fitness

Here, WC+Ws=1, Wc 0, Ws 0, Smin, is minimum support, and Cmin is minimum confidence.


Reproduction Operator
We are adopting roulette selection strategy; each individual reproduction probability is
proportion to fitness value.
Mutation Operator
The selection of the mutation probability is the vital point because it influences the action and
performance of the ARMNGA. If is over-small, the ARMNGA will become a pure random
research

Here, pm1=0.1, pm2=0.001, fmax (X) is the maximum fitness value of the population, f(X) is the
average fitness value of the population.

When implemented on image database the following observations were made.


For Runtime vs. the minimum support for both algorithms, where the minimum support varies
from 0.25% to 2% for the synthetic dataset. Our proposed algorithm runs 25 times faster than
the Apriori algorithm, because a large number of candidates can be pruned by using the
ARMNGA pruning strategy.

The runtime vs. the average size of transactions for both algorithms, where the average size of
transactions varies from 4 to 14 for the synthetic dataset, can deduce that ARMNGA has a higher
convergence speed and more reasonable selective scheme which guarantees the non-reduction
performance of the optimal solution.

The Genetic algorithm could be through adopting an adaptive mutation rate and improving the
methods of individual choice, thereby improving the genetic algorithm that applies to mine
association rules[8].
Here, a method of adaptive mutation rate, in the early stages of evolution and mutation rate is
done by

Pphase-out method to improve the choice, is applied to the latter part of the genetic algorithm:
1) The size of the fitness of individual choice selection sort;
2) Before the 1/4 copy 2 of the individual, the former 1/4-2/4 part of individual copy 1, enter to
the next round
of selection; 3)Before the 2/4-3/4 part of the individual reservations, enter to the next round of
selection;
4) Before the 3/4-4/4 out part of the individual is no longer into the next round of selection.
The new algorithm when applied to a database of student achievement in schools in recent years
reduces the number of unnecessary operations ,streamlines the collection of frequent generation
and improve the efficiency of the algorithm when compared to Apriori Algorithm as shown in
Graph.

.
In [9] an IOGA (Immune Optimization based Genetic Algorithm) approach for incremental
association rules is proposed The dynamic immune evolution, and biomimetic mechanism in
Engineering Immune Computing (EIC) : immune recognition, immune memory, and immune
regulation to GA is introduced .
Immune recognition is critical in the immune system, its essence is to distinguish self and nonself, and that can be evaluate by affinity between antibodies and antigens
The experimental data set is from a companys daily records of the APIs (in local computer
operation system) which were called by outside files from network, and the results whether the
files lead to computer virus.
A Method for Finding Implicating Rules Based on the Genetic Algorithm[10] for car test results
is implemented with the algorithm
Algorithm GAFIR
Input: Database D , threshold of the strength of implication, the largest evolved algebra GEN ,
populationsize N, crossover probability Pc, mutation probability Pm
Output
Rule Set ( RS )
Procedure GAFIR
1. L0 = Initial(D, N) ;
2. TR = GetRules(M, )
3. For i=1 to GEN

4. Begin
5. C = Crossover(Li-1, Pc )
6. Li = Mutation(C, pm )
7.TR= TR U GetRules ( Li ; )
8. End
9. RS = TR ;
When tested on car test data the interesting rules go to balance, while it evolves about 400
generations. The generation between 1 and 200 is the phase of interesting implication rules that
are discovered frequently. Later going to balance, when it comes to 700 generations, it nearly
discovers all the interesting rules. The greater threshold of fitness is, the less number of
interesting implication rules distilled. On the contrary, the smaller threshold of fitness is, the
more number of interesting implication rules distilled.
In the area of software engineering to find the defective modules , Searching for Rules to find
Defective Modules in Unbalanced Data Sets[11] is proposed. Feature selection (attribute
selection) to work only with those attributes from the data sets capable of predicting defective
modules. With the reduced data set, a genetic algorithm is used to search for rules characterizing
modules with a high probability of being defective.
For the given data set feature selection as a necessary step to reduce the data sets and then, as a
subgroup discovery technique, a genetic algorithm as a subgroup discovery technique was used
to generate rules for covering only defective modules. Results showed that in general data sets
are not very homogeneous in both the feature selection (attributes) selected in each data set or
rules generated. The results, however, provide some points for further research.
Mining Large Data Sets
When the data size becomes too large an efficient distributed genetic algorithm for classification
rules extraction in data mining, which is based on a new method of dynamic data distribution
applied to parallelism using networks of computers in order to mine large datasets would be a
better solution
The model is as shown

Distributed Model

EDGAR uses a local GA in each node with some communications with the neighborhood for
individuals and poorly covered examples. The Algorithm
Generate initial population using seeding
While (Stop Criteria)
For a number of generations
Select g individuals by US
For each individual
If % Perform recombination
If % Perform mutation
end
replace g individual from population
Exchange individuals
Exchange training examples
end
end
Extract set of rules by greedy algorithm
Send set of rules to Central Pool
If (not improving) reduce training data
end
For the experimental study, a well known problem has been chosen from UCI Nursery. This
dataset has 12.960 instances, big enough to test data distribution. Nursery is a complex dataset
with 6 characteristics and 5 not balanced classes, representing three of them more than 97% of
the dataset and the results observed were

The time of execution of the proposed has a considerable speedup and a better behavior than
the compared algorithm when the number of processors increases.
Classification accuracy is similar in both algorithms and does not follow any tendency
relative to the number of processors
The number of rules generated is between 60% and 80% smaller in EDGAR.

Conclusion
The genetic algorithm when compared to other data mining association rule generating method
produces better accuracy, increases the efficiency, the robustness was found to be sound. The speed
is also increased when compared with other method.
The pitfalls of the Genetic algorithm are overcome by making changes in the same and the
algorithm is found to be versatile in nature thereby enabling it to be applied with any dataset.

You might also like