Association Rule Mining Using Genetic Algorithm: The Role of Estimation Parameters
Association Rule Mining Using Genetic Algorithm: The Role of Estimation Parameters
Association Rule Mining Using Genetic Algorithm: The Role of Estimation Parameters
Abstract. Genetic Algorithms (GA) have emerged as practical, robust optimization and search methods to generate accurate and reliable Association Rules.
The performance of GA for mining association rules greatly depends on the GA
parameters namely population size, crossover rate, mutation rate, fitness function adopted and selection method. The objective of this paper is to compare the
performance of the Genetic algorithm for association rule mining by varying
these parameters. The algorithm when tested on three datasets namely Lenses,
Iris and Haberman indicates that the accuracy depends mainly on the fitness
function which is the key parameter of GA. The population size is affected by
the size of the dataset under study. The crossover probability brings changes in
convergence rate with minimal changes in accuracy. The size of the dataset and
relationship between its attributes also plays a role in achieving the optimum
accuracy.
Keywords: Association rules, Genetic Algorithm, Population size, Crossover
rate, Fitness function.
1 Introduction
Data mining, also referred as knowledge discovery in database, means a process of
nontrivial extraction of implicit, previously unknown and potentially useful information (such as knowledge rules, constraints, regularities) from data in database. Data
mining combines theory and technology of several domains which include artificial
intelligence, machine learning, statistics, neural network and so on. Association rule
mining is a major area in data mining that discovers the relations between different
attributes by analyzing and disposing data in the database.
Many algorithms for generating association rules were developed over time. Some
of the well known algorithms are Apriori, Eclat and FP-Growth tree. Many existing
algorithms traverse the database many times so the I/O overhead and computational
complexity becomes very high and cannot meet the requirements of large-scale database mining. Genetic algorithm is an algorithm which based on the biological theory
of evolution and molecular genetics of the global random search, the algorithm has a
A. Abraham et al. (Eds.): ACC 2011, Part I, CCIS 190, pp. 639648, 2011.
Springer-Verlag Berlin Heidelberg 2011
640
strong randomness, robust and implicit parallelism and can quickly and effectively
search for global optimization, in an effective way to deal with large-scale data sets.
At present, genetic algorithm-based data mining methods have yielded some progress,
and based on genetic algorithms classification system has also yielded some results.
This paper analyses the mining of Association Rules by applying Genetic Algorithms. There have been several attempts for mining association rules using Genetic
Algorithm. Robert Cattral et al. [1] describe the evolution of hierarchy of rule using
genetic algorithm with chromosomes of varying length and macro mutations. The
initial population is seeded rather than random selection. Manish Saggar et al. [2]
proposes an algorithm with binary encoding and the fitness function was generated
based on confusion matrix. The individuals are represented using the Michigans Approach. Roulette Wheel selection is done by first normalizing the values of all candidates.
Genetic algorithm based on the concept of strength of implication of rules was presented by Zhou et al. [3]. The properties of independence and correlation of descriptions in rules are taken up for fitness calculation. Genxiang et al. [4] introduced
dynamic immune evolution, and biometric mechanism in Engineering immune computing namely immune recognition, immune memory and immune regulation to GA
for mining association rules.
Gonzales. E et al. [5] introduced the Genetic Relation Algorithm (GRA) based on
evaluating the distances between rules. The distance is calculated using both matching
criteria namely complete match and partial match. Genetic algorithm easily leads to
premature convergence or takes too much time to converge during evolution process.
Hong Lei et al. [6] propose GA where the fitness function is based on predictive accuracy, comprehensibility and interestingness factor. The selection method is based
on elitist recombination.
In Haiying Ma et al. [7] the encoding of data is done with gene string structure
where the complexity concepts are mapped to form linear symbols. The fitness function is the measure of the overall performance of the process rather than that of
individual rules when the bit strings were interpreted as a complex process. Adaptive
exchange probability (Pc) and mutation probability (Pm) are adopted in this paper.
Hong Guo et al. [8] adopt the method of adaptive mutation rate to avoid excessive
variation causing non-convergence, or into a local optimal solution. A sort of individual-based selection method is applied to the evolution in genetic algorithm, in order to prevent the high-fitness individuals converging early by the rapid growth of the
number of individual.
As the parameters of the genetic algorithm and the fitness function are found to be
the major area of interest in the above studies, this paper tries to explore on the effects
of the genetic parameters and the controlling variables of fitness function on three
different datasets.
A brief introduction about Association Rule Mining and GA is given in Section 2,
followed by methodology in section 3, which describes the basic implementation details of Association Rule Mining with GA. In section 4 the parameters that decides on
efficiency of the algorithm is presented. Section 5 presents the experimental results
followed by conclusion in the last section.
Association Rule Mining Using Genetic Algorithm: The role of Estimation Parameters
641
3 Methodology
The evolutionary process of GA is a highly simplified and stylized simulation of the
biological version. It starts from a population of individuals randomly generated according to some probability distribution, usually uniform and updates this population
in steps called generations. In each generation, multiple individuals are randomly selected from the current population based on application of fitness, crossover, and
modified through mutation to form a new population.
A. [Start] Generate random population of n chromosomes.
B. [Fitness] Evaluate the fitness f(x) of each chromosome x in the population.
C. [New population] Create a new population by repeating the following steps until
the new population is complete.
i. [Selection] Select two parent chromosomes from a population according
to their fitness.
ii. [Crossover] With a crossover probability alter the parents to form a new
offspring.
642
iii.
Association Rule Mining Using Genetic Algorithm: The role of Estimation Parameters
643
(2)
In the above formula, Rs + Rc =1 (Rs 0 Rc 0) and Suppmin, Confmin are respective values of minimum support and minimum confidence. By all appearances if the
Suppmin and Confmin are set to higher values, then the value of fitness function is also
found to be high.
4.5 Crossover Operator
Crossover entails choosing two individuals to swap segments of their code, producing
artificial "offspring" that are combinations of their parents. This process is intended to
simulate the analogous process of recombination that occurs to chromosomes during
sexual reproduction. Common forms of crossover include single-point crossover, in
which a point of exchange is set at a random location in the two individual genomes,
where one individual contributes all its code till the point of crossover, the second
individual contributes all its code after the point of crossover to produce an offspring,
and uniform crossover, in which the value at any given location in the offspring's genome is either the value of one parent's genome at that location or the value of the
other parent's genome at that location, chosen with 50/50 probability[8].
4.6 Mutation Operator
Partial gene values of individuals are adjusted by using mutation operation [5]. This
part of the genetic algorithm, require great care, here there are two probabilities, one
usually called as Pm, this probability will be used to judge whether mutation has to be
done or not, when the candidate fulfills this criterion it will be fed to another probability, the locus probability that is on which point of the candidate the mutation has to be
done.
4.7 Number of Generations
The generational process of mining association rules by Genetic algorithm is repeated
until a termination condition has been reached. Common terminating conditions are:
644
5 Experimental Studies
The objective of this study is to compare the accuracy achieved in datasets by varying
the GA Parameters. The encoding of chromosome is binary encoding with fixed
length. As the crossover is performed on attribute level the mutation rate is set to zero
so as to retain the original attribute values. The selection method used is tournament
selection. The fitness function adopted is as given in equation (1).
Three datasets namely Lenses, Haberman survival and Iris Data Set from UCI Machine Learning Repository have been taken up for experimentation. Lenses dataset
has 4 attributes with 24 instances. Haberman's Survival data Set has 3 attributes and
306 instances and Iris dataset has 5 attributes and 150 instances. The Algorithm is
implemented using MATLAB R2008a simulation package. The flow of the system is
as shown in flowchart below.
/W
^
^^
KZ
The default values set for the GA parameters are given in Table 1.
The accuracy and the convergence rate by controlling the GA parameters are recorded in the table 2. Accuracy is the count of dataset matching between the original
dataset and resulting population divided by the number of instances in dataset. The
convergence rate is the generation at which the fitness value becomes fixed. The population size is varied for the three dataset, from the size of the dataset to one and half
times the dataset size while keeping the other parameters fixed.
Association Rule Mining Using Genetic Algorithm: The role of Estimation Parameters
645
Value
Instances * 1.5
0.5
0.0
Tournament Selection
0.2
0.8
Lenses
Haberman
Iris
No. of Instances
Accuracy
No. of
%
Generations
75
7
71
114
77
88
It could be seen from Table 2 that for the Lenses dataset whose size is small, an optimal accuracy is achieved, when the population size is one and half times the size of
the dataset whereas for the larger dataset, Haberman the accuracy is maximum when
the population size is equivalent to dataset size. For the Iris dataset of moderate size
the population has to be set to 1.25 times the size of the dataset to achieve optimum
result.
As the fitness function is considered to be the crucial factor for the GA, variations
are introduced in the fitness function while other parameters remain unchanged. In
Table 3 the minimum confidence and support values are altered when others are at
default values and the results are recorded.
From the Table 3 it is clear that the variation in minimum support and confidence
brings greater changes in accuracy. When the values of minimum support and confidence are set to minimum, the accuracy if found to be low regardless of the size of the
dataset. The same is noted when both the values are set to maximum. Optimum accuracy is achieved when a tradeoff value between minimum confidence and minimum
support is set.
Table 3. Comparison based on variation in Minimum Support and Confidence
Lenses
Haberman
Iris
646
When the parameters Rs and Rc are altered in the fitness function, minimum alteerations in accuracy are noted and hence their impact is not taken up for analysis.
In Table 4 the crossover probability is altered when other GA parameters are seet to
default values and the results observed are recorded.
Table 4. Com
mparison based on variation in Crossover Probability
Lenses
Haberman
Iris
Cross Over
Pc = .2
25
Pc = .5
Pc = .75
Accuracy
No. of
Accuracy
No. of Accuracy %
No. of
%
Generations
%
Generations
Generationns
95
8
95
16
95
13
69
77
71
83
70
80
84
45
86
51
87
55
No.
of No. of
Instances attrributes
Lenses
24
4
Haberman
306
3
Iris
150
5
Dataset
Minimum
Support
0.2
0.9
0.2
Minimum
confidence
0.9
0.2
0.9
Crossover
rate
0.25
0.5
0.75
Accuraccy
in %
95
71
87
Association Rule Mining Using Genetic Algorithm: The role of Estimation Parameters
647
It is observed from the experimental analysis that the choice of optimum population size for better accuracy depends upon the number of instances in dataset. If dataset size is larger, then the population size same as the number of instances in dataset is
found to produce better accuracy.
Setting up values for minimum support and confidence depends on the dataset and
their relationship between attributes. Tradeoff between minimum confidence and minimum support has to be scored to attain optimum results. Cross over rate affects the
convergence rate of the system mainly and has minimum effect on the accuracy of the
system.
6 Conclusion
Genetic Algorithms have been used to solve difficult optimization problems in a
number of fields and have proved to produce optimum results in mining Association
rules. When Genetic algorithm is used for mining association rules the GA parameters
decides the efficiency of the system. Minimum support, minimum confidence and
population size are the key parameters deciding the accuracy of the system. The setting of the population size is based on the size of the problem under study, whereas
the minimum confidence and minimum support to be set depends upon the problem
under study. The optimum value of crossover rate leads to earlier convergence while
playing minimum role in achieving better accuracy. The setting of optimum value of
the GA parameters varies from data to data and the fitness function plays a major role
in optimizing the results. The size of the dataset and relationship between attributes in
data contributes to the setting up of the parameters. The efficiency of the methodology could be further explored on more datasets with varying attribute sizes.
References
1. Cattral, R., Oppacher, F., Deugo, D.: Rule Acquisition with a Genetic Algorithm. In: Proceedings of the 1999 Congress on Evolutionary Computation, CEC 1999 (1999)
2. Saggar, M., Agrawal, A.K., Lad, A.: Optimization of Association Rule Mining. In: IEEE
International Conference on Systems, Man and Cybernetics, vol. 4, pp. 37253729 (2004)
3. Zhou, J., Li, S.-y., Mei, H.-y., Liu, H.-x.: A Method for Finding Implicating Rules Based
on the Genetic Algorithm. In: Third International Conference on Natural Computation,,
vol. 3, pp. 400405 (2007)
4. Zhang, H. Chen. : Immune Optimization Based Genetic Algorithm for Incremental Association Rules Mining. In : International Conference on Artificial Intelligence and Computational Intelligence, AICI 09, Volume: 4, Page(s): 341 345, 2009.
5. Gonzales, E., Mabu, S., Taboada, K., Shimada, K., Hirasawa, K.: Mining Multi-class Datasets using Genetic Relation Algorithm for Rule Reduction. In: IEEE Congress on Evolutionary Computation, CEC 2009, pp. 32493255 (2009)
6. Shi, X.-J., Lei, H.: Genetic Algorithm-Based Approach for Classification Rule Discovery.
In: International Conference on Information Management, Innovation Management and
Industrial Engineering, ICIII 2008, vol. 1, pp. 175178 (2008)
7. Ma, H., Li, X.: Application of Data Mining in Preventing Credit Card Fraud. In: International Conference on Management and Service Science, MASS 2009, pp. 16 (2009)
648
8. Guo, H., Zhou, Y.: An Algorithm for Mining Association Rules Based on Improved Genetic Algorithm and its Application. In: 3rd International Conference on Genetic and Evolutionary Computing, WGEC 2009, pp. 117120 (2009)
9. Tang, H., Lu, J.: Hybrid Algorithm Combined Genetic Algorithm with Information Entropy for Data Mining. In: 2nd IEEE Conference on Industrial Electronics and Applications,
pp. 753757 (2007)
10. Dou, W., Hu, J., Hirasawa, K., Wu, G.: Quick Response Data Mining Model using Genetic Algorithm. In: SICE Annual Conference, pp. 12141219 (2008)