Survey On GA and Rules
Survey On GA and Rules
Introduction
The amount of data stored in databases continues to grow fast. Intuitively, this large amount of
stored data contains valuable hidden knowledge, which may be used to improve the decisionmaking process of an organization. Thus, there is a clear need for (semi-)automatic methods for
extracting knowledge from data. This need has led to the emergence of a field called data mining
and knowledge discovery. Association rule mining is one such data mining task which involves
frequent pattern mining.
The aim of frequent pattern mining is to search for recurring relationships in a given data set
which enables us to discover various kinds of associations and correlations among different
items in data sets. Let us formally define the problem: Let I = {i1, i2, i3, ..., in} be a set of all
items. A k-itemset l consists of k items from I, is frequent if l occurs in a transaction database D
not less than |D| times. Here is a user specified parameter called minimum support and |D| is
total number of tuples in database.
In this paper the role of Genetic Algorithms in Data mining and in specific Association rule
mining is taken up for analysis. The effectiveness of the algorithm is found to me more with
modifications in the Genetic algorithm. We present here the application of genetic algorithm in
association rule mining with variations in the algorithm and the results achieved
4. Mutation is the operation by which randomly selected attributes of randomly selected entities
in subsequent operations are changed.
5. Iterate until either a given fitness level is attained, or the preset number of iterations is reached.
a) Coding strategy and coding string length L:
Because of many parameters, multi-parameter coding technology can be used. Basic idea is to
encode each parameter obtaining substring, and then combine these substrings into a complete
chromosome. For example, 18 | 36 | medium | good | man gene strings express the employee
group of age with 18 to 36 years old, medium-income, health condition is good, sex man, it will
have a number of code are combined in use, such as 18 | 36 | medium | good | man encoded string
of 18 | 36 | 01 | 00 | 1.
b) Select Operator: By using the selection mechanism of the certainty expected value model,
that is expected value of integer part of
to arrange the times that individual are selected, if selected to participate in cross-matching and,
the survival expected value minus 0.5 in the next generation; Instead the survival expected value
minus 1, then listing expected value of M of decimal part according the value from large to
small, and one selection from large to small until the date is full. Such choice mechanism can
overcome randomness in selection.
c) Cross-Operator: Because of multi-parameter coding technology is used, taking into the
characteristics of string code, two cross is adopted.
d) Mutation operator: Adopting basic mutation operator, mutating age gene locus when below 5
random integer.
e) The group size M: When M for small value, which improves the evolution data of genetic
algorithm, but decreases the diversity of group and might cause the premature phenomena of
genetic algorithm; when M for greater value, which decreases the evolution speed of genetic
algorithm. Therefore, comprehensive consideration in these two areas, the value of M for 20~100
is good.
f) Fitness function f(x): The best employee group, that is the employee group who obtains the
highest number times in comprehensive evaluation in the same age condition, and the ultimate
aim is to find young and excellent employee. In addition to adding a restrictive conditions: the
minimum age of employee must be less than maximum age. The objective function can be set to
Thus, t(x) accords with the times of comprehensive excellent evaluation of employee for x gene
string; T is the total times of comprehensive excellent evaluation of all employee profits; i(x) is
age spacing of string of x . Generally speaking, the choice intensity should be slight lower in the
initial stage of genetic optimization, so as to avoid genetic groups have been controlled by one or
a few individuals with higher fitness degree; in the latter of genetic optimization, because the
difference is relatively small between groups, The potential ability is low if continue to optimize,
it is necessary to improve choice intensity so as to constringe a better solution for genetic
algorithm. So fitness function is designed to
Thus, x is the larger one in the operation of two individuals of cross-participation, f max is the
largest group fitness degree, f avg is the average fitness degree.
h) Mutation probability pm : Mutation probability Pm control of the new gene into the
population ratio, if too low, some useful genes will not be able to enter the choice; if too high,
too much random change, future generations may lose good characteristics inherited from both
parents. To this end, the adaptive Pm can be used in (4).
max
i) Termination: When genetic algorithm runs to difference (| (f1-f2) / f1 | <) does not change or
with small change between the two group generation of the best fitness degree, which is
considered convergent and stop operation.
Application of Genetic Algorithms on pre-mined Rules.
The association rules mined using other methods as Apriori and Network Programming
Model are further mined using Genetic Algorithms. In [1] of a new evolutionary method named
Genetic Relation Algorithm (GRA) for reducing the number of class association rules extracted
by other methods is proposed. GRA is composed of nodes and their directed or indirected
branches. Nodes represent events and branches represent the relations between nodes. The basic
structure of GRA the genotype expression of GRA node is shown in below figures
The table describes the gene of node i, then, the set of these genes represents the genotype of
GRA individuals. IDi is an identification number, for example, IDi = 1 means node i has the
directed branches to other nodes, while IDi = 2 means node i has the indirected branches to the
nodes. Fi denote the function of the node i. Ci1, Ci2, . . . , Cik denote the nodes which are
connected from node i, firstly, secondly, . . . , and Si1, Si2, . . . , Sik denote the strength from node I
to node Ci1, Ci2, . . . , Cik or the strength between node I and node Ci1, Ci2, . . . , Cik depending on
the arguments of node i.
In order to find really important class association rules, the function of the nodes in GRA should
be changed. It is possible to realize the above effectively by GRA genetic operations, because
mutation and crossover will change the connections or contents of the nodes. Three kinds of
genetic operators are used: crossover, mutation-1 (change the connection of nodes) and
mutation-2 (change the function of nodes).
The algorithm is depicted in the flow chart shown below
Two datasets from UCI ML Repository were taken to conduct the experiments namely
Lymphography and Vehicle dataset were taken up for analysis. From the experimental results it
is shown that when the reduction rate is small, GRA is able to get comparable accuracy to the
large set of rules, that is, 100% of the rules, especially in the partial match, furthermore, it is
shown that the accuracy does not change drastically compared to the accuracy of the large set of
rules
The results of the classification accuracy when compared with other methods are tabulates in
below table
Fitness = CF x Comp
Fitness = wl x (CF x Comp.) + w2 x Simp
where TP is True Positive, FP is False Positive, FN is False Negative ,Simp is a measure of rule
simplicity (normalized to take on values in the range O..l) and wl and w2 are user defined
weights.
The algorithm when tested on synthetic database with parameters as listed in table proved out to
contain some rules with negations in the attributes as predicted and desired.
GA Parameters
Genetic Algorithm
Evolution Strategy Excellence in GA[3]
Data Mining Genetic Algorithm[4]
Mining Association Rules with GA[5]
Constraint Based GA[6]
Novel GA[7]
GA through Adapted Mutation[8]
Dynamic Immune Evolution to GA[9]
Implicating and Optimized Rules[10]
Defective Module implication[11]
In[3] the evolution strategys excellence is applied in genetic algorithms evolutional process.
Then the optimized genetic algorithm is used for mining association rules. The shortcomings of
the traditional GA are overcome with modifications.
Improved genetic algorithm
In formula l is the length of gene chain, aj is the number j bit gene in gene chain of individuals
a bj is the number j bit gene in gene chain of individuals b In whole colony, dissimilar
degree of colony is as follows.
K is rate that dissimilar degree of colony of last century is compared with current century.
In formulas Pc' Pm' are separately cross probability and mutation probability of last century,
Pc Pm are separately cross probability and mutation probability of current century. In this
way, evolution of current century is based on last century. Original colony contains excellent
individuals of last century. Otherwise partial new individuals are randomly product, and cross
probability and mutation probability are newly set up. It can enhance the diversity of colony.
When the algorithm applied to 2050 groups of finance data in certain city the following
observations were made. After 252 generations the partial association rules were obtained as in
table, whereas traditional GA required 850 generations.
The speed was more and could be applied to other domains also.
Condition 2 (Fall into a DNA trap): The obtained solution does not satisfy our requirements,
but there is no improvement after a constant number of new evolutionary generations. Note
that the constant number of iterations is much smaller than the one in condition 1. Collect all
the important genes based on the support and the confidence to form a new DNA, put the
DNA into the DNA pool, and reset the support and confidence arrays. Then go to step 2.
The above when implemented on watermarking problem resulted in improved performance
over the traditional GA. This is clear from the graph given.
Traditional GA
Data mining GA
In the process of data mining the MIS database [5], some of the history records could be looked
as a population of individuals, a record as a corresponding individual, and the fields representing
the property of the table could be looked as the genes, the correspondence of the items and a
table is shown in figure
2. Select the initial population of individuals, from which computed the excellent gene list set Al;
selecting the second population of individuals, from which computed the excellent gene list
set B1
3. Save the excellent gene list set above to result set R
4. Operate the mutation of AI and B1, cross the gene of AI and B1, and generate a new set A2,
then select the third population of individuals, from which computed the excellent gene list set
B2
5. Repeat the steps from b to c, until there is no new excellent gene List
6. Decode the result set R and generate the knowledge of association rules
When the algorithm applied to MIS repository on IBM Netfinity(Pentium 1G/512MRAM). GAbased Association Rule was more efficient than Apriori Algorithm ,The time used with the
Apriori Algorithm increases sharply with the increase of data amount and the precision
decreased a little, the efficiency increased a lot.
A Constraint-Based Genetic Algorithm approach for Mining Classification Rules[6] propose a
constraint-based genetic algorithm (CBGA) approach to reveal more accurate and significant
classification rules. Here a rule induction system that consists of three modules: the userinterface, the symbol manager, and the constraint based GA (CBGA). According to Figure, the
user interface module allows users to execute the following system operations including:
loading a constraint program;
adding or retracting the constraints;
controlling the GAs parameter settings;
monitoring the best solutions.
Interesting knowledge or given constraints can be issued by either domain experts or other meta
knowledge mechanisms.
In order to introduce details of the proposed CBGA approach, a synthetic medical data set about
patients information is used for illustration. This data set includes the following attributes: age,
sex, blood pressure (BP), the status of Cholesterol (Cho), the values of Na and K, and the
quantity (Qty) and frequency (Freq) of taking the drug. The prediction attribute is one of the five
drug types, including Drug A, Drug B, Drug C, Drug D and Drug E.
In comparison with a regular GA, CBGA achieves higher classification accuracy rates in rule
inductions for both UCI data sets. In addition, the rule sets discovered by CBGA are not only
with higher predictive accuracy, but also with more significant knowledge in accordance to the
users preferences.
In [7] A Novel Genetic Algorithm Based on Image Databases for Mining Association Rules is
proposed using a novel spatial mining algorithm, called ARMNGA(Association Rules Mining in
Novel Genetic Algorithm)
Association rules mining Based on a novel Genetic Algorithm is carried out by
Encoding employs natural numbers to encode the variable Aij. That is, the number of the lines
of every range in the matrix A in which the element 1 exists is regarded as a gene. The genes are
independent of each other. They are marked by A1, A2 Aj, An, in which and Aj[l,m] , j[l,m]
and An may be a repeatedly equal natural number
The Fitness
Here, pm1=0.1, pm2=0.001, fmax (X) is the maximum fitness value of the population, f(X) is the
average fitness value of the population.
The runtime vs. the average size of transactions for both algorithms, where the average size of
transactions varies from 4 to 14 for the synthetic dataset, can deduce that ARMNGA has a higher
convergence speed and more reasonable selective scheme which guarantees the non-reduction
performance of the optimal solution.
The Genetic algorithm could be through adopting an adaptive mutation rate and improving the
methods of individual choice, thereby improving the genetic algorithm that applies to mine
association rules[8].
Here, a method of adaptive mutation rate, in the early stages of evolution and mutation rate is
done by
Pphase-out method to improve the choice, is applied to the latter part of the genetic algorithm:
1) The size of the fitness of individual choice selection sort;
2) Before the 1/4 copy 2 of the individual, the former 1/4-2/4 part of individual copy 1, enter to
the next round
of selection; 3)Before the 2/4-3/4 part of the individual reservations, enter to the next round of
selection;
4) Before the 3/4-4/4 out part of the individual is no longer into the next round of selection.
The new algorithm when applied to a database of student achievement in schools in recent years
reduces the number of unnecessary operations ,streamlines the collection of frequent generation
and improve the efficiency of the algorithm when compared to Apriori Algorithm as shown in
Graph.
.
In [9] an IOGA (Immune Optimization based Genetic Algorithm) approach for incremental
association rules is proposed The dynamic immune evolution, and biomimetic mechanism in
Engineering Immune Computing (EIC) : immune recognition, immune memory, and immune
regulation to GA is introduced .
Immune recognition is critical in the immune system, its essence is to distinguish self and nonself, and that can be evaluate by affinity between antibodies and antigens
The experimental data set is from a companys daily records of the APIs (in local computer
operation system) which were called by outside files from network, and the results whether the
files lead to computer virus.
A Method for Finding Implicating Rules Based on the Genetic Algorithm[10] for car test results
is implemented with the algorithm
Algorithm GAFIR
Input: Database D , threshold of the strength of implication, the largest evolved algebra GEN ,
populationsize N, crossover probability Pc, mutation probability Pm
Output
Rule Set ( RS )
Procedure GAFIR
1. L0 = Initial(D, N) ;
2. TR = GetRules(M, )
3. For i=1 to GEN
4. Begin
5. C = Crossover(Li-1, Pc )
6. Li = Mutation(C, pm )
7.TR= TR U GetRules ( Li ; )
8. End
9. RS = TR ;
When tested on car test data the interesting rules go to balance, while it evolves about 400
generations. The generation between 1 and 200 is the phase of interesting implication rules that
are discovered frequently. Later going to balance, when it comes to 700 generations, it nearly
discovers all the interesting rules. The greater threshold of fitness is, the less number of
interesting implication rules distilled. On the contrary, the smaller threshold of fitness is, the
more number of interesting implication rules distilled.
In the area of software engineering to find the defective modules , Searching for Rules to find
Defective Modules in Unbalanced Data Sets[11] is proposed. Feature selection (attribute
selection) to work only with those attributes from the data sets capable of predicting defective
modules. With the reduced data set, a genetic algorithm is used to search for rules characterizing
modules with a high probability of being defective.
For the given data set feature selection as a necessary step to reduce the data sets and then, as a
subgroup discovery technique, a genetic algorithm as a subgroup discovery technique was used
to generate rules for covering only defective modules. Results showed that in general data sets
are not very homogeneous in both the feature selection (attributes) selected in each data set or
rules generated. The results, however, provide some points for further research.
Mining Large Data Sets
When the data size becomes too large an efficient distributed genetic algorithm for classification
rules extraction in data mining, which is based on a new method of dynamic data distribution
applied to parallelism using networks of computers in order to mine large datasets would be a
better solution
The model is as shown
Distributed Model
EDGAR uses a local GA in each node with some communications with the neighborhood for
individuals and poorly covered examples. The Algorithm
Generate initial population using seeding
While (Stop Criteria)
For a number of generations
Select g individuals by US
For each individual
If % Perform recombination
If % Perform mutation
end
replace g individual from population
Exchange individuals
Exchange training examples
end
end
Extract set of rules by greedy algorithm
Send set of rules to Central Pool
If (not improving) reduce training data
end
For the experimental study, a well known problem has been chosen from UCI Nursery. This
dataset has 12.960 instances, big enough to test data distribution. Nursery is a complex dataset
with 6 characteristics and 5 not balanced classes, representing three of them more than 97% of
the dataset and the results observed were
The time of execution of the proposed has a considerable speedup and a better behavior than
the compared algorithm when the number of processors increases.
Classification accuracy is similar in both algorithms and does not follow any tendency
relative to the number of processors
The number of rules generated is between 60% and 80% smaller in EDGAR.
Conclusion
The genetic algorithm when compared to other data mining association rule generating method
produces better accuracy, increases the efficiency, the robustness was found to be sound. The speed
is also increased when compared with other method.
The pitfalls of the Genetic algorithm are overcome by making changes in the same and the
algorithm is found to be versatile in nature thereby enabling it to be applied with any dataset.