Ga Perf Analysis

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

Performance Analysis of Genetic Algorithm for Mining Association Rules

Abstract
Association rule (AR) mining is a data mining task that attempts to discover interesting patterns or relationships between data in large databases. Genetic algorithm (GA) based on evolution principles has found its strong base in mining ARs. Many works based on Genetic algorithm for mining predictive and classification rules have been carried out intensively over the period (since 1990s). This paper analyzes the performance of GA in Mining ARs effectively based on the variations and modification in GA parameters. The recent works in the past seven years for mining association rules using genetic algorithm is considered for the analysis. Genetic algorithm has proved to generate more accurate results when compared to other formal methods available. The fitness function, crossover rate, and mutation rate parameters are proven to be the primary parameters involved in implementation of genetic algorithm. Variations and modifications introduced in primary GA parameters are found to have greater impact in increasing the accuracy of the system moderately. The speedup of the system is found to increase when the selection and fitness function are altered. Keywords: Association rule, Genetic Algorithm, GA parameters, Accuracy, Speedup.

1. Introduction
Data mining also referred as knowledge discovery in databases, is a process of nontrivial extraction of implicit, previously unknown and potential information from data in large databases [23]. The knowledge or information mined from databases is always expressed as association rules. Association rule mining is one of the important research areas in data mining [20]. Association rule mining describes the relationship among itemsets present in databases. Mining of association rule were implemented using algorithms like Apriori, Eclat, FP growth tree etc. These algorithms traverse the databases repeatedly. The Input output overhead and computational complexity of these systems is more and cannot meet the requirements of large-scale database mining. Genetic algorithm is a promising and upcoming research area for mining association rules. Genetic algorithm [25] is a method which simulates search of evolutional process. Genetic algorithm can dispose large-scale data gathered in a lot. It is widely applied in mining association rules. Genetic algorithms are typically implemented using computer simulations where optimization is the main criteria for solving the problem. For this problem, members of a space of candidate solutions, called individuals, are represented using abstract representations called chromosomes. The GA consists of an iterative process that evolves a working set of individuals called a population toward an objective function, or fitness function. Traditionally, solutions are represented using fixed length strings, especially binary strings, but alternative encodings have been developed. As many works have been carried out on mining association rules with genetic algorithms this paper surveys the existing work on application of Genetic algorithm in mining association rules and analyzes the performance of the methodology adopted . This paper is organized as follows. Section 2 discusses the preliminaries of association rule and Genetic algorithm for mining association rules. Section 3 surveys the existing work for mining association rules based on genetic algorithms, followed by conclusion in section 4.

2. Preliminaries
The preliminaries of concept are explained in this section. The concept of association rule is explained first followed by the Genetic algorithm for association rule mining and then the Genetic operators. 2.1. Association Rules and Association Rule Mining Association rules are if and then statements that help uncover relationships between seemingly unrelated data in a relational database or other information repository. An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item found in the data. A consequent is an item that is found in combination with the antecedent. Association rule [2] is expressed as X=>Y, where X is the antecedent and Y is the consequent Each association rule has two quality measurements, support and confidence. Support implies frequency of occurring patterns, and confidence means the strength of implication and is defined as follows: An itemset, X, in a transaction database, D, has a support, denoted as sup(X) or simply p(X), that is the ratio of transactions in D containing X. Or sup(X) = No. of transactions containing X / Total No. of transactions The confidence of a rule X => Y, written as conf(X=>Y), is defined as conf(X=>Y) = sup(X U Y)/sup(X). (2) (1)

Association rule mining is finding frequent patterns, correlations, associations or casual structures among sets of items or objects in transactional databases, relational databases, and other information repositories. 2.2 Genetic Algorithm for Association Rule Mining Genetic Algorithm (GA) is an adaptive heuristic search algorithm based on the evolutionary ideas of natural selection and genetics. As such they represent an intelligent exploitation of a random search used to solve optimization problems. Although randomized, GAs are by no means random, instead they exploit historical information to direct the search into the region of better performance within the search space. The basic techniques of the GAs are designed to simulate processes in natural systems necessary for evolution, especially those follow the principles first laid down by Charles Darwin of "survival of the fittest.". The evolutionary process of a GA [3] is a highly simplified and stylized simulation of the biological version. It starts from a population of individuals randomly generated according to some probability distribution, usually uniform and updates this population in steps called generations. In each generation, multiple individuals are randomly selected from the current population based upon some application of fitness, bred using crossover, and modified through mutation to form a new population. A. [Start] Generate random population of n chromosomes B. [Fitness] Evaluate the fitness f(x) of each chromosome x in the population C. [New population] Create a new population by repeating the following steps until the new population is complete i. ii. iii. iv. [Selection] Select two parent chromosomes from a population according to their fitness [Crossover] With a crossover probability cross over the parents to form a new offspring (children). [Mutation] With a mutation probability mutate new offspring at each locus (position in chromosome). [Accepting] Place new offspring in a new population

D. [Replace] Use new generated population for a further run of algorithm E. [Test] If the end condition is satisfied, stop, and return the best solution in current population F. [Loop] Go to step B 2.3. Genetic Operators The GA maintains a population of n chromosomes (solutions) with associated fitness values. Parents are selected to mate, on the basis of their fitness, producing offspring via a reproductive plan (mutation and crossover). Consequently highly fit solutions are given more opportunities to reproduce( selected for next generation), so that offspring inherit characteristics from each parent. As parents mate and produce offspring, room must be made for the new arrivals since the population is kept at a static size (population size). In this way it is hoped that over successive generations better solutions will thrive while the least fit solutions die out. The representation scheme, Population Size, Crossover rate, Mutation rate, and fitness function and selection operator are the GA operators and are discussed below. 2.3.1. Encoding chromosomes The process of representing the individual chromosomes is called encoding. The representation can be in the form of bits, numbers, trees, arrays, lists or any other objects. The encoding method adopted mainly depends on the problem being solved. The decision on best coding systems is a part of design of evaluation function. Some coding schemes are shown in figure.1

Coding scheme Binary Encoding Octal Encoding Hexadecimal encoding Value Encoding

Chromosome 101110101110 111100110001 23147632 15754231 9DA4 F34E 3.1234 5.3214 7.9812 2.1567 AHGYNBRYGUJJHUYIUYIU (back) (right) (forward) (left)

Figure 1. Encoding schemes 2.3.2. Fitness function The fitness of an individual in a genetic algorithm is the value of an objective function for its phenotype. For calculating fitness, the chromosome has to be first decoded and the objective function has to be evaluated. The fitness not only indicates how good the solution is, but also corresponds to how close the chromosome is to the optimal one. Each solution or chromosome needs to be awarded a figure of merit, to indicate how close it came to meeting the overall specification, and this is generated by applying the fitness function to the test, or simulation, results obtained from that solution. 2.3.3. Selection operator

The process of choosing the two parents from mating pool for reproduction is characterized by the selection operator. The selection is based on the fitness of the individual. Higher the fitness, more the chance of the individual being selected. The convergence of the algorithm largely depends upon the chromosomes being selected for reproduction. Figure 2 shows the selection process.

Two best Individual

Mating Pool P

Mating Pool
New Population

Figure 2 Selection process Some popular selection methods are roulette wheel selection, random selection, tournament selection, universal sampling etc. Elitism is introduced to eliminate the chance of losing information during mutation. 2.3.4. Crossover operator Crossover is the process of taking two parent solutions and producing from them a two new offspring. Crossover is a recombination operator that proceeds in three steps: The reproduction operator selects at random a pair of two individual strings for the mating A cross site is selected at random along the string length Poolvalues are swapped between the two strings following the cross site Finally,Mating the position Single point crossover, two point crossover, multipoint crossover, uniform crossover etc are the different crossover techniques adopted. One Point Crossover A crossover operator that randomly selects a crossover point within a chromosome then interchanges the two parent chromosomes at this point to produce two new offspring. Consider the following two parents which have been selected for crossover. The | symbol indicates the randomly chosen crossover point. Table 1. One point crossover Parents chromosomes Parent 1: 11001|010 Parent 2: 00100|111 Two Point Crossover A crossover operator that randomly selects two crossover points within a chromosome then interchanges the two parent chromosomes between these points to produce two new offspring. Table 2. Two point crossover Parents chromosomes Parent 1: 110|010|10 Parent 2: 001|001|11 Reproduced Chromosomes Offspring1: 110|001|10 Offspring2: 001|010|11 Reproduced Chromosomes Offspring1: 11001|111 Offspring2: 00100|010

Uniform Crossover A crossover operator that decides (with some probability know as the mixing ratio) which parent will contribute each of the gene values in the offspring chromosomes. This allows the parent chromosomes to be mixed at the gene level rather than the segment level (as with one and two point crossover). For some problems, this additional flexibility outweighs the disadvantage of destroying building blocks. If the mixing ratio is 0.5, approximately half of the genes in the offspring will come from parent 1 and the other half will come from parent 2. Table 3. Uniform point crossover Parents chromosomes Parent 1: 11001010 Parent 2: 00100111 Reproduced Chromosomes Offspring1: Offspring2:

Arithmetic Crossover A Crossover operator that linearly combines two parent chromosome vectors to produce two new offspring according to the following equations: (3) (4)

where a is a random weighting factor (chosen before each crossover operation) Table 4. Arithmetic crossover Parents chromosomes Parent 1: (0.3)(1.4)(0.2)(7.4) Parent 2: (0.5)(4.5)(0.1)(5.6) Reproduced Chromosomes a = 0.7 Offspring1: (0.36)(2.33)(0.17)(6.86) Offspring2: (0.402)(2.981)(0.149)(6.842)

Heuristic Crossover A crossover operator that uses the fitness values of the two parent chromosomes to determine the direction of the search. The offspring are created according to the following equations: Offspring1 = BestParent + r * (BestParent WorstParent) (5) (6) where r is a random number between 0 and 1. 2.3.5. Mutation operator Mutation is a genetic operator that alters one or more gene values in a chromosome from its initial state. Mutation of a bit involves flipping a bit, changing 0 to 1 and vice-versa. Mutation plays the role of recovering the lost genetic materials as well as randomly disturbing genetic information. It is an insurance policy against the irreversible loss of

genetic material. Mutation helps escape from local minimas trap and maintains diversity in the population. There are many different forms of mutation for the different kinds of representation. Flip Bit -A mutation operator that simply inverts the value of the chosen gene (0 goes to 1 and 1 goes to 0). This mutation operator can only be used for binary representation schemes. Boundary - A mutation operator that replaces the value of the chosen gene with either the upper or lower bound for that gene (chosen randomly). This mutation operator can only be used for integer and float representation schemes. Non-Uniform - A mutation operator that increases the probability that the amount of the mutation will be close to 0 as the generation number increases. This mutation operator keeps the population from stagnating in the early stages of the evolution then allows the genetic algorithm to fine tune the solution in the later stages of evolution. This mutation operator can only be used for integer and float representation schemes. Uniform - A mutation operator that replaces the value of the chosen gene with a uniform random value selected between the user-specified upper and lower bounds for that gene. This mutation operator can only be used for integer and float representation schemes. Gaussian - A mutation operator that adds a unit Gaussian distributed random value to the chosen gene. The new gene value is clipped if it falls outside of the user-specified lower or upper bounds for that gene. This mutation operator can only be used for integer and float representation schemes.

3. Mining Association rules using Genetic algorithm


The existing work for mining association rules based on genetic algorithms is taken up for Performance analysis in this section. The rules are categorized into Prediction rules and Classification rules. The datasets used for both rules vary from medical, business, education, finance, administration, problem solving etc. Both benchmark datasets and synthetic datasets are bound to give better accuracy when GA is adopted for rule mining. The analysis is carried out based on the genetic parameters and methodology adopted. If the accuracy of the rules generated is not up to the expected value then data modification process [10] is introduced. In such cases the unselected attributes are taken and classified till expected accuracy is attained. This attribute is added to the training set for further generations. The accuracy of the system is enhanced through data modification process. The insert or remove operator introduced [2] Controls the size of the rules evolved and hence influences the comprehensibility of the rules. Rule pruning [4] removes the irrelevant attributes included in the rule over evolution thereby reducing the number of attributes involved in processing. 3.1. Representation The concept of fixed length of chromosomes is adopted in [6]. Binary string of representation is followed in [7, 10, 18, 21]. Array encoding [4, 12, 13, 20] allows to store variable number of attributes. It enables mutation and crossover to be achieved at file level thereby speeding up the system. The significance of this system is that the number of attributes need not be fixed earlier. Attributes addition and removal is easier. Array implementation facilitates the easy implementation of Genetic operators. Multilevel phenotype structure [9] is simple and helps to expand indicators for further studies. The chromosomes length in first level is the number of indicators encoded in this chromosome. The value of each gene indicates a specific kind of relation related to this indicator, existing or not. The changes introduced in representation of rules in rule space [11] changes the fitness unction values and this could be used to estimate the distance to the global maximum. By individual representation [17] each individual is directly defined as an expression formed by a conjunction of predicates over some attributes. Individual representation as set of rules (ruleset)[19] facilitates the reduction of search space size.

3.2. Fitness Function The fitness criteria for classification rule almost for all studies carried out depends upon two major factors namely comprehensibility metric and confidence factor. Comprehensibility metric is the count of number of rules and number of conditions in these rules. If a rule can have at most Ac conditions, the comprehensibility metric Comp(R) of the rule R can be defined as Comp(R ) = 1- (NC(R) / AC) Confidence factor Con(R) is measured by Con(R) = SUP(AUC) / SUP (A) (8) (7)

where A is the number of rules satisfying the condition and |AUC| is the number of rules satisfying both the antecedent and consequent. Weight measures are added to the above two factors and fitness is a measure combining these factors in effective way. The fitness function of prediction rules depend upon the support and confidence values of the itemset under analysis [13, 18, 20]. Fitness function [6] is defined as follows.

where F(x) denotes the number of genes that it does not include 0 in the chromosome of rule x , n denotes the total of attribute in systems, abs(n(x)) denotes the absolute value of the strength of implication of rule x , k is the adjustment parameter between rule reduction and the strength of implication. The absolute value of the strength of implication of rule x can control a chromosome evolution along the direction that the strength of implication becomes strong, and can control the rule chromosome direction towards the simplest rule evolution. The minimum support and minimum confidence factor [8] are not specified for fitness function. The support factor alone decides on the fitness value. The fitness function is designed as

The Interestingness factor (INF) and completeness factor (CF) are used for evaluating the rules [12]. The evaluation is based on the confusion matrix created with classification labels namely true positive, true negative, false positive and false negative.

Pareto dominance based fitness factor [2] is introduced to find the single global solution or multi objective problem depending on the non dominance criteria in each generation. Pareto based methods measure individuals fitness according to their dominance property. The non dominant individuals in the population are regarded as fittest regardless of their single objective values. In [22] evaluation based on all confidence and collective strength factor enables creating quality rules and avoids infeasible rule generation. Predictive accuracy and sensitivity test [9] are introduced for measuring rulesets. Sensitivity test measures the performance of the system by varying the GA parameters and compare the results for the sensitivity of these parameters. The sustainable index, creditable index and inclusive index finds its place in [4] in generating the evaluation function. These indexes are measures based on the number of rules satisfying antecedent, number of rues satisfying the consequent, and number of cases satisfying the particular rule. The recall and precision parameters [19] find its way in fitness evaluation.

(14) 3.3. Selection Operator The strength of implication [6] extracts proper rules for reproduction and hence increases the efficiency of the system. It controls the rule chromosome direction towards simple rule evolution. The selection based on fitness criteria [12, 13, 20] tends to increase the efficiency and speeds up the system. In this method boundaries are set by the user to values closer to 1 as to maintain selection of high quality rules. Immune concept [16] when adopted for selection maintains the diversity of individuals in population. This avoids premature convergence of the system. Selection strategy based on self adaptive suppression and promotion [18] ensures the individuals which have greater fitness values to be retained for further generations. It also ensures the diversity of the population. The concentration plays suppressive role and avoid premature. Niched Pareto based selection [1, 3] uses standard deviation function when the difference in absolute count is less. This promotes the accurate selection of candidates for reproduction and saves system time too. Roulette wheel based

selection method is adopted in [2]. Tournament based selection [5] enhances the random selection of candidates for the process of symbiogenesis. Elitist recombination selection [17] retains the appropriate rule for next generation bypassing the fitness criteria. 3.4. Crossover operator To avoid invalid chromosome production order-1 crossover [8] is adopted. One segment is selected from both parents equally and replaced into each other offspring. Then the offspring copies information from corresponding parent that does not exist starting from right of the segment. Generation of rules with high number of attributes [12] is made possible with single crossover operator. Attributes are selected randomly from antecedent and consequent of the rules. Then exchange occurs to generate new offspring. R1 : AB=>CD R2 : EFG=> HIJ Reproduces R1 : F =>CHJ R2 : EGAB => ID

Random and heuristic crossover [14] helps in achieving diversity of the group and obtains more frequent itemsets quickly. The crossover operator when made dynamic process of evolution [15] helps in evolving generation of new population based on last generation population. This is found to enhance the diversity of colony. Multipoint crossover [18] classifies the domains of each attribute into a group and sets crossover point based on continuous attributes. Single point uniform crossover [2] otherwise hybrid crossover combines the best attributes of single point and uniform crossover. In single point crossover the swapping is done in adjacent genes and in uniform crossover he sapping is performed at genes at distributed location. The advantages of both systems are combined, thereby creating diverse population. N point crossover is adopted in [4,20,21] enables crossover on attributes at n different points as per the crossover points set. The crossover when altered to symbiotic combination [5] results in creation of new ruleset combining the attributes of both the parents rather than resulting in invalid rules. Symbiotic combination operator takes two partially specified chromosomes and makes an offspring with the sum of their characteristics. In [9] crossover is carried out on the first level genes of chromosomes rather all level over the crossover point indicated. This prevents generation of infeasible rules. If same attributes are present [17] in both the parents selected then crossover is attained at the same attributes randomly selected. In case of absence of common attributes the crossover is performed on randomly selected attributes. Best class crossover (BCX)[20] based on crossover matrix created from individuals fitness and random number uses specific domain knowledge and links to individuals more effectively. 3.5. Mutation operator In most cases the mutation operator remains fixed to probability Pm.. The mutation rate [8] prevents generation of invalid chromosomes. The mutation in such cases brings about changes in confidence of the rules alone thereby maintain the support intact and hence the fitness function. Either the antecedent or consequent [12] alone is selected or mutation is carried out on that attribute alone. This avoids evolution of invalid chromosomes. Adaptive mutation rate [13] helps in attaining local optimal solution. Adaptive mutation is based on the fitness of individual in present and previous generation and the highest fitness in individual stocks. It avoids excessive variation in fitness at earlier generation thereby avoiding non convergence. This enhances the efficiency of the genetic algorithm. Heuristic mutation [14] generates more new frequent itemsets. In execution of heuristic operators over generation some current maximal frequent itemsets are created.

In [9, 20, 21] multilevel mutation is performed on all levels of chromosomes while crossover is on top level of phenotype structure alone. Mutation is carried on at attribute level[17] where the attributes are either deleted or mutated based on the probability Pm and the fitness of the individual chromosome under analysis. Directed mutation [19] based on mutation matrix generated from fitness values of individuals and random number helps in formulating the mutation rate. This evolution process increases the mutation rate if the selected rule is of low quality. This enables the production of quality rules.

4. Observations
Traditional rule mining methods, are usually accurate, but have brittle operations. Genetic algorithms on the other hand provide a robust and efficient approach to explore large search space. In recent years numerous works have been carried out using genetic algorithm for mining ARs. Selected works in the past seven years for mining ARs using genetic algorithm have been studied and performance when analysis of these method are presented in this section. The comparison of these methods at parameter level is presented in table 5 below.

Mining ARs Using Genetic Algorithm Table: Comparison of Parameters involved in Mining ARs using Genetic Algorithm
Ref. No. Ref1 Selection Niched Pareto Selection Fitness Based on comprehensibility and confidence factor Crossover Mutation Accuracy Measure Dataset Remarks Execution time compared Insert , Remove operations control size of rules thereby influencing comprehensibility

Normal

Normal

Confidence Nursery factor. Comprehension. Comprehensibil ity Predictive Accuracy Rule Interestingness Comprehensibil ity Predictive Accuracy Zoo Nursery Adult

Ref2

Roulette Wheel Selection

Rank based Pareto Dominance

Single point uniform crossover

Normal

Ref3

Niched Pareto Selection

Based on comprehensibility and predictive accuracy

Uniform Crossover

Normal

Ref4

Based on sustaining index, creditable index, Inclusive index.

Random crossover

Random mutation

Predictive Accuracy Simplicity of Rules

Repair operator used after crossover to create Zoo. valid chromosomes Nursery Insert , Remove operations control size of rules Population initialization based on Tic-tac toe entropy. Dermatalogy Rule pruning done to Hepatitis Cleveleand Heart remove irrelevant terms. Disease. CRX Iris Vote Wine KDD99

Ref5

Tournament Selection

Based on simplicity operator

Symbiotic combination operator

Normal

Classification Rate

Execution time compared

Ref6

Strength of implication Based on completeness measure, confidence factor

Single crossover operator Random crossover

Bases position mutation

Ref7

Roulette Wheel Selection

Normal

Strength of implication, fitness Based on minimum support and minimum confidence Number of rules generated for given minimum confidence and support Predictive accuracy Sensitivity test

Car test result The frequent item sets with min support are found by Apriori method and algorithm applied to this output. Minimum support and minimum confidence are not considered

Lens

Ref8

Random selection

Based on support factor

Order 1 crossover

Ref9

Selection based on fitness values

Based on confidence

Crossover applies only to first level genes

Done by swapping data after going through all chromosome in population Mutation applied to all genes both at first and second level Based on mutation probability pm

SPECT heart Solar flare Nursery Monks problems Balance Scale Stock trading data of top 10 companies from S&P 500

Chromosome representation is multiple level phenotype structure. Data modification process updates the training data by checking the mined rules and

Ref10

Roulette Wheel Selection

Based on comprehensibility and confidence factor

Single point crossover

Predictive accuracy

Wisconsin breast cancer

Ref11

Based on confidence factor

Based on minimum Based on support and minimum crossover confidence factor probability pc

Based on mutation probability pm

Balance scale Chess Number of rules Nursery based on Adult minimum Mushroom support and Hayes Roth confidence Tic tac toe

By changing the representation and they grow with generation

Ref12

Based on fitness generated Individual based selection method Based on maximal frequent set thru filtering operation Based on fitness values of individuals

Based on Interesting factor and completeness factor Based on support and confidence factor

Mutation done at attribute level Attribute level by selecting and crossover deleting the attribute Based on crossover probability pc Random and heuristic crossover Random selection of crossover point done dynamically Adaptive mutation

Number of rules Car Evaluation generated dataset Student achievement database from schools Max itemsets generated, mining time

Array representation used The crossover and mutation operator modified

Ref13

Mining time

Ref14

Individual fitness based on upgrade index Based on support confidence, minimum support and confidence

Heuristic mutation Genes values adjusted partially by mutation Done dynamically

Single table generated randomly

User picks maximum frequent itemsets. Of interest

Ref15

Partial AR after certain generation

Finance service data from certain city

Ref16

Random selection

Based on support confidence, minimum Based on support and crossover confidence probability pc

Based on mutation probability pm

Partial AR after certain generation

Companies daily record of API

Distance measure between datasets are taken up for selection

Ref17

Tournament Selection

Based on comprehensibility interestingness and confidence factor

If same attribute present then at Randomly and attribute level based on fitness else randomly

No of rules mined

Adult dataset

Each individual defined as expression form

Ref18

Based on self adaptive suppression and promotion Based on fitness values of individuals

Based on support and minimum support

Grouping multipoint crossover Best Class crossover based on fitness of individuals Based on crossover matrix generated from fitness values and random number.

Based on mutation probability pm

No of rules, reduction of attributes, execution time Based on true positive, true negative, false positive and false negative values.

Abalone dataset

Rule extracted from final population based on confidence and minimum confidence factor Data buffering, Data indexing, Data sequencing and evaluation cache adopted to reduce time.

Ref19

Based on factors recall and precision

Directed mutation based on fitness of individual Based on mutation matrix generated from fitness values of chromosome and random number.

Iris Diabetics Glass Wine

Ref20

Based on fitness of individuals

Based on support of the chromosomes and minimum support given by user.

Synthetic standard Number of rules database generated generated by IBM QUEST Wisconsin Diagnostic breast cancer Wisconsin breast Number of rules cancer generated Wisconsin Prognostic breast cancer Yeast Adult Number of rules Chess generated Wine Zoo

Selection, Crossover and Mutation becomes parameter free due to self adaptive evolution process.

Ref21

Roulette wheel selection using fitness.

Based on support count, comprehensibility and interestingness.

Multi-point crossover

Multi-point Mutation

Extracts Association rules from incremental database with single pass of the whole dataset.

Ref22

Roulette wheel based on probabilistic survival of fittest.

Based on all confidence and collective strength factor.

Random crossover

Random mutation

Fitness function designed to prioritize the rules based on users preference.

From table 5 it is observed that the accuracy of the rules mined through genetic algorithm gives promising results when compared to rules mined by other methods. The Genetic parameters namely selection, crossover, mutation and fitness function when fixed to optimum and changes introduced generates accurate rules. These parameters are considered to be the primary parameters. The dataset when contains attributes where the value ranges widely for e.g. age then the accuracy of the mines ruled is low when compared to other datasets. This can be seen from medical datasets containing the age of the patients as attribute. The execution time mining ARs could be decreased considerably by altering the factors involved while bringing in minimum changes in accuracy. The other GA parameters as population size, selection methodology, encoding scheme and termination condition has least significance on accuracy of the rules mined. The GA is found to produce optimum results for both Association rule mining and Classification rule mining. The GA extracts association rules from incremental database with single pass of the whole dataset whereas other methods go through the dataset many times to produce the result. Effects of Genetic Operators on accuracy are Using selection alone will tend to fill the population with copies of the best individual from the population Using selection and crossover operators will tend to cause the algorithms to converge on a good but sub-optimal solution Using mutation alone induces a random walk through the search space. Using selection and mutation creates a parallel, noise-tolerant, hill climbing algorithm

When the primary GA parameters are modified in accordance with the dataset used for mining ARs, then the accuracy of rules generated is increased. The accuracy of the mined rules is mainly based on fitness function, crossover operator and mutation operator. Self adaptive mechanism or evolution process when introduce in GA parameters increases the accuracy marginally. The fitness function is the key for selecting accurate rules into the next generation. The mutation operator and crossover operator when designed effectively avoid premature convergence thereby increasing the efficiency of rules generated. The chromosomes created after crossover should ensure that they should not violate the existing chromosomes. Hence the crossover operator is designed accordingly. The comprehensibility factor tends to have major role in fitness function. The support and confidence factor based fitness function is noted to have significance in survey. The Pareto based ranking dominance is also adopted for fitness functions to avoid premature convergence. The crossover operator when fixed to optimum value converges the results early thereby speeding up the results. Hence different crossover operators are implemented in the paper. The mutation factor alters the chromosomes. So in order to have valid rules the mutation factor is set up with significant analysis to maintain the validity of rules mined. The measures for the validity of the rules mined are found to be similar in almost all the works. The predictive accuracy based on comprehensibility metric and confidence factor is applied in more than two third of the work taken for survey. The interestingness measure, strength of implication and number of rules generated for the given threshold of confidence and support were noted as a measure of rule set quality. The observation based on predictive accuracy is listed in table 6.

Table 6. Predictive Accuracy

Dataset Nursery Adult Wisconsin breast cancer Wisconsin breast cancer Wisconsin breast cancer. Tic tac toe Dermatology Cleveland heart disease Stock trading data Iris Diabetics Glass Wine Wisconsin Diagnostic breast cancer. Wisconsin Prognostic breast cancer.

Predictive Accuracy 89 86 98.15 96.14 to 96.99 72.62 to 91.6 97.86 95.61 63.58 90-100 99.1 76.43 83.54 100 78.57-95.16 73.33-76.19

Method Applied Elitist multiobjective GA Elitist multiobjective GA GA with information entropy GA with data modification process Incremental Association Mining GA with information entropy GA with information entropy GA with information entropy GA-ACR CAREX CAREX CAREX CAREX Incremental Association Mining Incremental Association Mining

Accuracy achieved ranges from as low as 63.58 to maximum of 100 percent. The cases where age attributes value ranges widely are prone to have less accuracy when compared to problems where attributes values range is narrow i.e. minimum. This could be noted from medical dataset where age is part of the attribute generates rules with less accuracy. Analysis based on number of rules generated by the methods is listed in table 7. The number of rules generated usually depends on the support factor set. Based on the perspective or objective the number of rules generated is varied. For the adult dataset when method [ ] generates five rules, whereas method [] generates around three hundred and fifty rules.

Table 7. Number of rules generated

Dataset Balance scale Nursery Nursery Monks Problems Solar flare SPECT heart Car Evaluation

Number of rules Mined 34 4 5 4 23 23 11

Method Applied ARMMGA ARMMGA GA based Classification of AR. CF above 0.6 ARMMGA ARMMGA ARMMGA M-GARM with fitness threshold 0.85

Group data of finance service Companys daily record of API Adult Adult Abalone IBM QUEST synthetic database Chess Wine Zoo

3-10 3 5 Around 350 13 40-50 Around 470 Around 230 Around 325

GA based on evolution strategy IOGA GA based Classification of AR. CF above 0.6 GA for prioritization of rules MAR-IGA AGA GA for prioritization of rules GA for prioritization of rules GA for prioritization of rules

Table 8 lists the comparison of the methods based on execution time. Execution time for mining of AR based on genetic algorithm is less over methods done using conventional methods. Table 8. Execution Time Dataset Balance Scale Chess Car Evaluation Nursery Nursery Tic Tac toe Adult Iris Iris Vote Wine KDD99 Student achievement database Abalone Execution time (ms) 8 768 9 132 286.38 42 1844 6.75 40 89 98 7012 20-10 48 Method GEA- DM GEA- DM GEA- DM GEA- DM INPGA GEA- DM GEA- DM INPGA SEA SEA SEA SEA Improved GA for varied support MAR-IGA

From the performance analysis carried out the further exploration for mining ARs using GA could be done by analysis on other domains to be taken up. Methods to deal with noisy, imprecise, and uncertain information could be further explored. Careful selection of attributes in preprocessing step might result in better predictive accuracy. Further enhancement of self adaptive mechanism might lead to better performance. Other interesting measures could be incorporated.

5. Conclusion
Performance analysis on mining association rules using GA was performed on recent researches on mining ARs using GA. The use of GA has resulted in both predictive and classification ARs with higher predictive accuracy. Fitness function, Crossover rate and mutation rate influences the accuracy more than other GA parameters. The right indicators when used in fitness function generated high quality rules. This avoids generation of infeasible rule in ruleset discovered. The fitness function is found to be the key for selecting accurate rules into the next generation.

The cross over rate and mutation rate when made optimum avoids premature convergence of the algorithm. This leads to the generation of feasible rules. GA algorithm is found have produced enhanced results in all type of datasets ranging from medicine to problem solving. The selection method plays major role in reducing the execution time by selecting right parents for reproduction. The right representation scheme adopted tends to speed up the system. The capability of GA to scans the dataset quickly when designed effectively reduces the execution time. Self adaptive mechanism or evolution process when introduce in GA parameters increases the accuracy marginally.

References.
1. 2. 3. 4. 5. 6. 7. 1.Junlin Lu, Fan Yang, Momo Li, Lizhen Wang, Multi-objective rule discovery using Niched Pareto genetic algorithm, Third IEEE international conference on measuring technology and mechatronics automation, 2011 S. Dehuri, S. Patnaik, A. Ghosh, R. Mall, Application of elitist multi-objective genetic algorithm for classification rule generation Applied Soft Computing 8 (2008) 477 487. Dehuri.S, Mall. R, Predictive and comprehensible rule discovery using a multiobjective genetic algorithm, Knowledge based systems, Elsevier, vol 19, p :413-421, 2006. Hua Tang, Jun u, A hybrid algorithm combines with Genetic Algorithm with information entropy for data mining, second IEEE international conference on Industrial electronics and applications, 2007. Ramin Halavathi, Saee Bagheri Shouraki, Pooya Esfandiar, Sima Lotfi, Rule based classifier using symbiotic Evolutionary algorithm, 19th IEEE international conference on tools and artificial intelligence, 2007. Zhou Jun ,Li Shu-you Mei, Hong-yan Liu, Hai-xia, A Method for Finding Implicating Rules Based on the Genetic Algorithm, Third International Conference on Natural Computation (ICNC 2007). Anandhavalli M., Suraj Kumar Sudhanshu, Ayush Kumar and Ghose M.K., Optimized association rule mining using genetic algorithm, Advances in Information Mining, ISSN: 0975 3265, Volume 1, Issue 2, 2009, pp 0104. Hamid Reza Qodmanan , Mahdi Nasiri, Behrouz Minaei-Bidgoli, Multi objective association rule mining with genetic algorithm without specifying minimum support and minimum confidence, Expert Systems with Applications 38 (2011) 288298. Ya-Wen Chang Chien , Yen-Liang Chen, Mining associative classification rules with stock trading data A GA-based method, Knowledge-Based Systems 23 (2010) 605614. Ta-Cheng Chen, Tung-Chou Hsu, GAs based approach for mining breast cancer pattern, Expert Systems with Applications 30 (2006) 674681. Zhan-min Wang, Hong-liang Wang, Du-wa Cui, A growing evolutionary algorithm for data mining, IEEE 2010. Avendano J. Christian, Gutierrez P Martin, Optimization of association rules with genetic algorithms, 29 th IEEE international conference of the Chilean computer science society, 2010. Hong Guo, Ya Zhou, An algorithm for mining association rules based on improved genetic algorithm an its applications, Third IEEE international conference on Genetic and evolutionary computing, 2009. Wenxiang Dou, Jinglu Hu, Kotaro Hirasawa and Gengfeng Wu , Quick Response Data Mining Model Using Genetic Algorithm , SICE Annual Conference 2008. Xiaoyuan Zhu, Yongquan Yu, Xueyan Guo, Genetic Algorithm based on Evolution Strategy and the Application in Data Mining, First IEEE International Workshop on Education Technology and Computer Science, 2009. Genxiang Zhang, Haishan Chen, Immune Optimization based Genetic Algorithm for incremental association rules mining, International Conference on Artificial Intelligence and Computational Intelligence, 2009. Xian Jun Shi, Hong Lei, A genetic algorithm based approach for classification rule discovery, IEEE international conference on information management, innovation management and industrial engineering, 2008. Guangjun Yang, Mining association rules from data with hybrid attributes based on immune genetic algorithm, 7th international conference on fuzzy systems and knowledge discovery.

8.

9. 10. 11. 12. 13. 14. 15.

16.

17.

18.

19. 20. 21. 22.

23. 24.

25.

Powel B.Myszkowski, Coevolutionary Algorithm for Rule Induction, Proceedings of the IEEE International Multiconference on computer science and information technology, 2010. Min Wang, Qin Zou, Caihui Liu , Multi-dimension Association Rule Mining Based on Adaptive Genetic Algorithm, IEEE International Conference on Uncertainty Reasoning and Knowledge Engineering, 2011. B. Nath, D K Bhattacharyya & A Ghosh, Discovering Association Rules from Incremental Datasets, International Journal of Computer Science & CommunicationVol 1, No. 2, July-December 2010, pp. 433-441. M. Ramesh Kumar, Dr. K. Iyakutti, Application of Genetic algorithms for the prioritization of Association Rules, IJCA Special Issue on Artificial Intelligence Techniques - Novel Approaches & Practical Applications AIT, 2011. Z. Michalewicz, Genetic Algorithms + Data Structure = Evolution Programs, Springer-Verlag, Berlin, 1994. Agrawal, T. Imielinski, and A.Swami. Mining association rules between sets of items in large databases. In the Proc. of the ACM SIGMOD Int'l Cod, on Management of Data (ACM SIGMOD '93), Washington, USA, May 1993. J.H. Holland, Adaptation in Natural and Artificial Systems, Univ. Michigan Press, Ann Arbor, MI, 1975.

You might also like