0% found this document useful (0 votes)
11 views84 pages

Bestintro

The document provides an overview of Genetic Algorithms (GAs), which are optimization techniques inspired by natural selection principles. It discusses the history, key components, and methodologies associated with GAs, including encoding methods, selection processes, and operators like crossover and mutation. Additionally, it highlights the advantages and disadvantages of GAs, along with their applications in data mining and genetic programming.

Uploaded by

Fouziya A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views84 pages

Bestintro

The document provides an overview of Genetic Algorithms (GAs), which are optimization techniques inspired by natural selection principles. It discusses the history, key components, and methodologies associated with GAs, including encoding methods, selection processes, and operators like crossover and mutation. Additionally, it highlights the advantages and disadvantages of GAs, along with their applications in data mining and genetic programming.

Uploaded by

Fouziya A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 84

CSE 634

Data Mining Concepts & Techniques


Prof: Anita Wasilewska

Genetic Algorithms (GAs)


By: Group 1

Abhishek Sharma, Mikhail Rubnich, George Iordache, Marcela


Boboila

1
General description
of the method

By: Abhishek Sharma

2
References
 ‘DATA MINING Concepts and Techniques’
Jiawei Han, Micheline Kamber Morgan Kaufman Publishers, 2003
 Data Mining Techniques : Class Lecture Notes and PP Slides.
 http://cs.felk.cvut.cz/~xobitko/ga/
 Massachusetts Institute of Technology - Prof. de Weck and Prof. Willcox,
Multidisciplinary System Design Optimization Course Lecture Notes on
Heuristic Techniques, “A Basic Introduction to Genetic Algorithms”:
http://ocw.mit.edu/NR/rdonlyres/Aeronautics-and-Astronautics/16-888Spri
ng-2004/D66C4396-90C8-49BE-BF4A-4EBE39CEAE6F/0/MSDO_L11_
GA.pdf

3
History of Genetic Algorithms
 “Evolutionary Computing” was introduced in the 1960s by I. Rechenberg.

 Professor John Holland at the University of Michigan came up with book


"Adaptation in Natural and Artificial Systems" explored the concept of
using mathematically-based artificial evolution as a method to conduct a
structured search for solutions to complex problems.

 Dr. David E. Goldberg. In his 1989 landmark text "Genetic Algorithms in


Search, Optimization and Machine Learning”, suggested applications for
genetic algorithms in a wide range of engineering fields.

4
What Are Genetic Algorithms
(GAs)?
 Genetic Algorithms are search and optimization techniques based on
Darwin’s Principle of Natural Selection.

“problems are solved by an evolutionary process resulting in a best (fittest) solution (survivor) ,
-In Other words, the solution is evolved”
1. Inheritance – Offspring acquire characteristics
2. Mutation – Change, to avoid similarity
3. Natural Selection – Variations improve survival
4. Recombination - Crossover

5
Genetics
Chromosome
 All Living organisms consists of cells. In each cell there is a same set of Chromosomes.
 Chromosomes are strings of DNA and consists of genes, blocks of DNA.
 Each gene encodes a trait, for example color of eyes.

Reproduction
 During reproduction, recombination (or crossover) occurs first. Genes from parents combine
to form a whole new chromosome. The newly created offspring can then be mutated. The
changes are mainly caused by errors in copying genes from parents.

 The fitness of an organism is measure by success of the organism in its life (survival)

Citation:
http://ocw.mit.edu/NR/rdonlyres/Aeronautics-and-Astronautics/16-888Spring-2004/D66C4396-90C8-49BE-BF4A-4EBE39CEAE6F/0/MSDO_L11_GA.pdf
6
Principle Of Natural Selection
 “Select The Best, Discard The Rest”

 Two important elements required for any problem before a genetic


algorithm can be used for a solution are:

 Method for representing a solution (encoding)


ex: string of bits, numbers, character

 Method for measuring the quality of any proposed solution, using


fitness function
ex: Determining total weight

7
GA Elements

Citation:
http://ocw.mit.edu/NR/rdonlyres/Aeronautics-and-Astronautics/16-888Spring-2004/D66C4396-90C8-49BE-BF4A-4EBE39CEAE6F/0/MSDO_L11_GA.pdf
8
Search Space
 If we are solving some problem, we are usually looking for some solution, which will be the best
among others. The space of all feasible solutions (it means objects among those the desired
solution is) is called search space (also state space). Each point in the search space represent one
feasible solution. Each feasible solution can be "marked" by its value or fitness for the problem.
 Initialization
Initially many individual solutions are randomly generated to form an initial population, covering
the entire range of possible solutions (the search space)
Each point in the search space represents one possible solution marked by its value( fitness)
 Selection
A proportion of the existing population is selected to bread a new bread of generation.
 Reproduction
Generate a second generation population of solutions from those selected through genetic
operators: crossover and mutation.
 Termination
A solution is found that satisfies minimum criteria
Fixed number of generations found
Allocated budget (computation, time/money) reached
The highest ranking solution’s fitness is reaching or has reached

9
Methodology Associated with
GAs Begi
n

Initialize
population

Evaluate
Solutions

T =0 (first step)

Optimum N
Solution?
Selection
Y

T=T+1 Stop Crossover


(go to next step)

Mutation

10
Citation: http://cs.felk.cvut.cz/~xobitko/ga/
Creating a GA on Computer

Simple_Genetic_Algorithm()
{
Initialize the Population;
Calculate Fitness Function;

While(Fitness Value != Optimal Value)


{
Selection;//Natural Selection,
Survival Of Fittest
Crossover;//Reproduction,
Propagate favorable characteristics
Mutation;//Mutation
Calculate Fitness Function;
}
}

11
Nature Vs Computer - Mapping

Nature Computer
Population Set of solutions.
Individual Solution to a problem.
Fitness Quality of a solution.
Chromosome Encoding for a Solution.
Gene Part of the encoding of a solution.
Reproduction Crossover

12
Encoding
 The process of representing the solution in
the form of a string that conveys the
necessary information.

 Just as in a chromosome, each gene controls a


particular characteristic of the individual,
similarly, each element in the string represents a
characteristic of the solution.

13
Encoding Methods
 Binary Encoding – Most common method of encoding. Chromosomes are
strings of 1s and 0s and each position in the chromosome represents a
particular characteristic of the problem.

Chromosome A 10110010110011100101
Chromosome B 11111110000000011111

 Permutation Encoding – Useful in ordering problems such as the


Traveling Salesman Problem (TSP). Example. In TSP, every chromosome
is a string of numbers, each of which represents a city to be visited.

Chromosome A 1 5 3 2 6 4 7 9 8
Chromosome B 8 5 6 7 2 3 1 4 9

14
Encoding Methods (contd.)
 Value Encoding – Used in problems where complicated values, such as
real numbers, are used and where binary encoding would not suffice.
Good for some problems, but often necessary to develop some specific
crossover and mutation techniques for these chromosomes.

Chromosome A 1.235 5.323 0.454 2.321 2.454


Chromosome B (left), (back), (left), (right), (forward)

15
Encoding Methods (contd.)
 Tree Encoding – This encoding is used mainly for evolving programs or
expressions, i.e. for Genetic programming.
 Tree Encoding - every chromosome is a tree of some objects, such as
values/arithmetic operators or commands in a programming language.

(+ x (/ 5 y)) ( do_until step wall )

Citation:
http://ocw.mit.edu/NR/rdonlyres/Aeronautics-and-Astronautics/16-888Spring-2004/D66C4396-90C8-49BE-BF4A-4EBE39CEAE6F/0/MSDO_L11_GA.pdf
16
GA Operators

By: Mikhail Rubnich

17
References

 ‘DATA MINING Concepts and Techniques’


Jiawei Han, Micheline Kamber Morgan Kaufman Publishers, 2003
 http://www.ai-junkie.com/ga/intro/gat2.html
 http://www.faqs.org/faqs/ai-faq/genetic/part2/
 http://en.wikipedia.org/wiki/Genetic_algorithms

18
19
Citation: http://www.ewh.ieee.org/soc/es/May2001/14/GA.GIF
Basic GA Operators

Recombination

Crossover - Looking for solutions near


existing solutions

Mutation - Looking at completely new


areas of search space

20
Fitness function
 quantifies the optimality of a solution (that is,
a chromosome): that particular chromosome
may be ranked against all the other
chromosomes

 A fitness value is assigned to each solution


depending on how close it actually is to solving the
problem.
 Ideal fitness function correlates closely to goal +
quickly computable.
 For instance, knapsack problem
Fitness Function = Total value of the things in the
knapsack
21
Recombination
Main idea: "Select The Best, Discard The Rest”.

The process that chooses solutions to be


preserved and allowed to reproduce and
selects which ones must to die out.

 The main goal of the recombination operator is to


emphasize the good solutions and eliminate the
bad solutions in a population ( while keeping the
population size constant )

22
So, how to select the best?

 Roulette Selection

 Rank Selection

 Steady State Selection

 Tournament Selection

23
Roulette wheel selection
Main idea: the fitter is the solution with the
most chances to be chosen

HOW IT WORKS ?

24
Example of Roulette wheel selection
No. String Fitness % Of Total

1 01101 169 14.4

2 11000 576 49.2

3 01000 64 5.5

10011 361 30.9


4
Total 1170 100.0

Citation: : www.cs.vu.nl/~gusz/
25
Roulette wheel selection

Chromosome1
Chromosome 2
Chromosome 3
Chromosome 4

All you have to do is spin the ball and grab the chromosome at the
point it stops 

26
Crossover
Main idea: combine genetic material ( bits ) of
2 “parent” chromosomes ( solutions ) and
produce a new “child” possessing
characteristics of both “parents”.

How it works ?

Several methods ….

27
Crossover methods
 Single Point Crossover- A random point is chosen on the individual chromosomes (strings) and the genetic material is exchanged at this point.

Citation: http://www.ewh.ieee.org/soc/es/May2001/14/CROSS0.GIF

28
Crossover methods
 Two-Point Crossover- Two random points are
chosen on the individual chromosomes (strings) and
the genetic material is exchanged at these points.

Chromosome1 11011 | 00100 | 110110

Chromosome 2 10101 | 11000 | 011110

Offspring 1 10101 | 00100 | 011110

Offspring 2 11011 | 11000 | 110110

NOTE: These chromosomes are different from the last example.

29
Crossover methods
 Uniform Crossover- Each gene (bit) is selected
randomly from one of the corresponding genes of
the parent chromosomes.

Chromosome1 11011 | 00100 | 110110

Chromosome 2 10101 | 11000 | 011110

Offspring 10111 | 00000 | 110110

NOTE: Uniform Crossover yields ONLY 1 offspring.

30
Crossover (contd.)
 Crossover between 2 good solutions MAY NOT
ALWAYS yield a better or as good a solution.

 Since parents are good, probability of the child


being good is high.

 If offspring is not good (poor solution), it will be


removed in the next iteration during “Selection”.

31
Elitism
Main idea: copy the best chromosomes
(solutions) to new population before applying
crossover and mutation

 When creating a new population by crossover or


mutation the best chromosome might be lost.

 Forces GAs to retain some number of the best


individuals at each generation.

 Has been found that elitism significantly improves


performance.

32
Mutation
Main idea: random inversion of bits in
solution to maintain diversity in
population set

Ex. giraffes’ - mutations could be beneficial.


Citation: http://www.ewh.ieee.org/soc/es/May2001/14/MUTATE0.GIF
33
Advantages and disadvantages

Advantages:
 Always an answer; answer gets better with time
 Good for “noisy” environments
 Inherently parallel; easily distributed

Issues:
 Performance
 Solution is only as good as the evaluation function
(often hardest part)
 Termination Criteria

34
Applications - Genetic
programming and data
mining
By: George Iordache

35
 A.A. Freitas. “A survey of evolutionary algorithms for data mining and knowledge
discovery”, Pontificia Universidade Catolica do Parana, Brazil. In A. Ghosh and S.
Tsutsui, editors, Advances in Evolutionary Computation, pages 819--845. Springer-
Verlag, 2002.
http://citeseer.ist.psu.edu/cache/papers/cs/23050/http:zSzzSzwww.ppgia.pucpr.brzSz~alexzSzpub_pape
rs.dirzSzAdvEC-bk.pdf/freitas01survey.pdf

 Anita Wasilewska, Course Lecture Notes (2007 and previous years) on Classification
(Data Mining book Chapters 5 and 7) -
http://www.cs.sunysb.edu/~cse634/lecture_notes/07classification.pdf

 J. Han, and M. Kamber. “Data Mining: Concepts and Techniques 2nd ed.”, Morgan
Kaufmann Publishers, March 2006. ISBN 1-55860-901-6
 R. Mendes, F. Voznika, A. Freitas, and J. Nievola. “Discovering fuzzy classification rules
with genetic programming and co-evolution”, Pontificia Universidade Catolica do
Parana, Brazil. In L. de Raedt and A. Siebes, editors, 5th European Conference on
Principles and Practice of Knowledge Discovery in Databases (PKDD'01), volume 2168 of
LNAI, pages 314--325. Springer Verlag, 2001.
http://citeseer.ist.psu.edu/cache/papers/cs/23050/http:zSzzSzwww.ppgia.pucpr.brzSz~alexzSzpub_papers.dirz
SzPKDD-2001.pdf/mendes01discovering.pdf

 John R. Koza, Medical Informatics, Department of Medicine, Department of Electrical


Engineering, Stanford University, Genetic algorithms and genetic programming, Lecture
notes, 2003.
www.genetic-programming.com/c2003lecture1modified.ppt

36
Genetic Programming
A program in C
 int foo (int time)
{
int temp1, temp2;
if (time > 10)
temp1 = 3;
else
temp1 = 4;
temp2 = temp1 + 1 + 2;
return (temp2);
}

 Equivalent expression (similar to a


classification rule in data mining):

(+ 1 2 (IF (> TIME 10) 3 4))

Citation: www.genetic-programming.com/c2003lecture1modified.ppt 37
Program tree

(+ 1 2 (IF (> TIME 10) 3 4))

Citation: www.genetic-programming.com/c2003lecture1modified.ppt 38
Given data
Input: Independent variable X Output: Dependent variable Y

-1.00 1.00
-0.80 0.84
-0.60 0.76
-0.40 0.76
-0.20 0.84
0.00 1.00
0.20 1.24
0.40 1.56
0.60 1.96
0.80 2.44
1.00 3.00
Citation: www.genetic-programming.com/c2003lecture1modified.ppt 39
Problem description
Objective: Find a computer program with one
input (independent variable X) whose
output Y equals the given data

1 Terminal set: T = {X, Random-Constants}

2 Function set: F = {+, -, *, /}

3 Initial population: Randomly created individuals from


elements in T and F.
4 Fitness: |y0’ – y0| + |y1’ – y1| + … where yi’ is
computed output and yi is given
output for xi in the range [-1,1]
5 Termination: An individual emerges whose sum of
absolute errors (the value of its fitness
function) is less than 0.1
Citation: www.genetic-programming.com/c2003lecture1modified.ppt 40
Generation 0
Population of 4 randomly created individuals

x+1 x2 + 1 2 x

Citation: examples taken from: www.genetic-programming.com/c2003lecture1modified.ppt 41


X Y X+1 |X+1- X2+1 |X2+1- 2 |2-Y| X |X-Y|
Y| Y|
-1.00 1.00 0 1 2 1 2 1 -1.00 2
-0.80 0.84 0.20 0.64 1.64 0.80 2 1.16 -0.80 1.64
-0.60 0.76 0.40 0.36 1.36 0.60 2 1.24 -0.60 1.36
-0.40 0.76 0.60 0.16 1.16 0.40 2 1.24 -0.40 1.16
-0.20 0.84 0.80 0.04 1.04 0.20 2 1.16 -0.20 1.04
0.00 1.00 1.00 0 1 0 2 1 0.00 1
0.20 1.24 1.20 0.04 1.04 0.20 2 0.76 0.20 1.04
0.40 1.56 1.40 0.16 1.16 0.40 2 0.44 0.40 1.16
0.60 1.96 1.60 0.36 1.36 0.60 2 0.04 0.60 1.36
0.80 2.44 1.80 0.64 1.64 0.80 2 0.44 0.80 1.64
1.00 3.00 2.00 1 2 1 2 1 1.00 2
Σ Σ Σ Σ

Fitness 4.40 6.00 9.48 15.40


: Best in Gen 0 42
Mutation

Mutation:
/

picking “2”
as mutation
point

43
Citation: part of the pictures used as examples are taken from: www.genetic-programming.com/c2003lecture1modified.ppt
Crossover
Crossover:

picking “+”
subtree and
leftmost “x” as
crossover points

Citation: example taken from: www.genetic-programming.com/c2003lecture1modified.ppt 44


Generation 1

Second offspring
First offspring of of crossover of
Mutant of (c) crossover of (a) (a) and (b)
and (b) picking “+” of
Copy of (a) picking “2” picking “+” of parent (a) and
as mutation parent (a) and
point left-most “x” of
left-most “x” of parent (b) as
parent (b) as crossover points
crossover points
45
Citation: part of the examples is taken from: www.genetic-programming.com/c2003lecture1modified.ppt
X Y X+1 |X+1- 1 |1-Y| X |X-Y| X2+X |
Y| +1 X2+X+1
-Y|
-1.00 1.00 0 1 1 0 -1.00 2 1 0
-0.80 0.84 0.20 0.64 1 0.16 -0.80 1.64 0.84 0
-0.60 0.76 0.40 0.36 1 0.24 -0.60 1.36 0.76 0
-0.40 0.76 0.60 0.16 1 0.24 -0.40 1.16 0.76 0
-0.20 0.84 0.80 0.04 1 0.16 -0.20 1.04 0.84 0
0.00 1.00 1.00 0 1 0 0.00 1 1 0
0.20 1.24 1.20 0.04 1 0.24 0.20 1.04 1.24 0
0.40 1.56 1.40 0.16 1 0.56 0.40 1.16 1.56 0
0.60 1.96 1.60 0.36 1 0.96 0.60 1.36 1.96 0
0.80 2.44 1.80 0.64 1 1.44 0.80 1.64 2.44 0
1.00 3.00 2.00 1 Σ 1 2 Σ1.00 2 Σ 3 0 Σ

Fitness 4.40 6.00 15.40 0.00


Found!
: 46
GA & Classification
Classify customers based on number of
children and salary:
Parameter # of children Salary
(NOC) (S)
Domain 0…10 0…500000

Syntax of NOC = x S=x


atomic NOC < x S<x
expression
NOC <= x S>x
NOC > x
NOC >= x
Citation: data table is taken from prof. Anita Wasilewska previous years course slides 47
GA & Classification Rules
 A classification rule is of the form (the rule is in
a predicate form – see course lectures):

IF formula THEN class=ci

Antecedent Consequence

48
Formula representation
 Possible rule:
 If (NOC = 2) AND ( S > 80000) then GOOD (customer)
Formula Class

AND

= >

NOC 2 S 8000
0 course slides
Citation: the example is taken from prof. Anita Wasilewska previous years
49
Initial data table
Nr. Number of children Salary(S) Type of customer (C)
Crt. (NOC)
1 2 > 80000 GOOD
2 1 > 30000 GOOD
3 0 = 50000 GOOD
4 >2 < 10000 BAD
5 = 10 = 30000 BAD
6 =5 < 30000 BAD
50
Initial data (written as rules inferred
from the initial table)
 Rule 1: If (NOC = 2) AND ( S > 80000) then C = GOOD
 Rule 2: If (NOC = 1) AND ( S > 30000) then C = GOOD
 Rule 3: If (NOC = 0) AND ( S = 50000) then C = GOOD
 Rule 4: If (NOC > 2) AND ( S < 10000) then C = BAD
 Rule 5: If (NOC = 10) AND ( S = 30000) then C = BAD
 Rule 6: If (NOC = 5) AND ( S < 30000) then C = BAD

51
Generation 0
 Population of 3 randomly created individuals:
 If (NOC > 3) AND ( S > 10000) then C = GOOD
 If (NOC > 1) AND ( S > 30000) then C = GOOD
 If (NOC >= 0) AND ( S < 40000) then C = GOOD

 We want to find a more general (if it is possible


the most general) “characteristic description” for
class GOOD => assign predicted class GOOD
for all individuals

52
Generation 0
AND
Individual
1
> >

NOC 3 S 1000
0
(NOC > 3) AND ( S > 10000)

Individual AND Individual AND


2 3 >= <
> >

NOC 1 S 3000 NOC 0 S 4000


0 0
(NOC > 1) AND ( S > 30000) (NOC >= 0) AND ( S < 40000)

53
Fitness function

 For one rule (IF A THEN C):


|AUC|
CF (Confidence factor) = |A|

 |A| = number of records that satisfy A


 |AUC| = number of records that satisfy A
and are in predicted class C

Citation: the confidence formula is taken from class slides: http://www.cs.sunysb.edu/~cse634/lecture_notes/07association.pdf


54
Fitness function – Generation 0
Rule 1: If (NOC = 2) AND ( S > 80000) then GOOD
Rule 2: If (NOC = 1) AND ( S > 30000) then GOOD
Rule 3: If (NOC = 0) AND ( S = 50000) then GOOD
Rule 4: If (NOC > 2) AND ( S < 10000) then BAD
Rule 5: If (NOC = 10) AND ( S = 30000) then BAD
Rule 6: If (NOC = 5) AND ( S < 30000) then BAD

Fitness of Individual 1: If (NOC > 3) AND ( S > 10000) then GOOD


|A| = 2 (Rule 5 & 6), |AUC| = 0, CF = 0 / 2 = 0
Fitness of Individual 2: If (NOC > 1) AND ( S > 30000) then GOOD
|A| = 1 (Rule 1), |AUC| = 1, CF = 1 / 1 = 1 Best in Gen 0
Fitness of Individual 3: If (NOC >= 0) AND ( S < 40000) then GOOD
|A| = 4 (Rule 2 & 4 & 5 & 6), |AUC| = 1, CF = 1 / 4 = 0.25

55
Mutation

AND Mutation AND

>= < > <

NOC 0 S 4000 NOC 0 S 9000


0 0
(NOC >= 0) AND ( S < 40000) (NOC > 0) AND ( S < 90000)

56
Crossover
AND AND

>
> > <

S 30000 S
NOC 1 NOC 1 4000
0
(NOC > 1) AND ( S < 40000)
(NOC > 1) AND ( S > 30000)

Crossover

AND AND

>= < >= >

NOC 0 S 4000 NOC 0 S 3000


0 0
(NOC >= 0) AND ( S < 40000) (NOC >= 0) AND ( S > 30000)
57
Generation 1
Individual AND Individual AND
1 2
>= >
> <

NOC 0 S 3000
NOC 1 S 4000
0
0
(NOC > 1) AND ( S < 40000) (NOC >= 0) AND ( S > 30000)

Individual AND
3
> <

NOC 0 S 9000
0
(NOC > 0) AND ( S < 90000)

58
Fitness function – Generation 1
Rule 1: If (NOC = 2) AND ( S > 80000) then GOOD
Rule 2: If (NOC = 1) AND ( S > 30000) then GOOD
Rule 3: If (NOC = 0) AND ( S = 50000) then GOOD
Rule 4: If (NOC > 2) AND ( S < 10000) then BAD
Rule 5: If (NOC = 10) AND ( S = 30000) then BAD
Rule 6: If (NOC = 5) AND ( S < 30000) then BAD

Individual 1: If (NOC > 1) AND ( S < 40000) then GOOD


|A| = 2 (Rule 4 & 5 & 6), |A&C| = 0, CF = 0 / 2 = 0
Individual 2: If (NOC >= 0) AND ( S > 30000) then GOOD
|A| = 3 (Rule 1 & 2 & 3), |A&C| = 3, CF = 3 / 3 = 1Best in Gen 1
Individual 3: If (NOC > 0) AND ( S < 90000) then GOOD
|A| = 5 (Rule 1 & 2 & 4 & 5 & 6), |A&C| = 1, CF = 1 / 5 = 0.2

59
GA Operators on Rules –
Flockharts’s paper
approach

By: Marcela Boboila

60
 I.W. Flockhart and N.J. Radcliffe. “GA-MINER: parallel data
mining with hierarchical genetic algorithms - final report”.
EPCC-AIKMS-GAMINER -Report 1.0. University of
Edinburgh, UK, 1995.
http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/3487/
http:zSzzSzwww.quadstone.co.ukzSz~ianzSzaikmszSzreport.pdf/flockhart95gaminer.pdf

 I. W. Flockhart and N. J. Radcliffe, "A genetic algorithm-based


approach to data mining," in The Second International
Conference on Knowledge Discovery and Data Mining (KDD-
96), (Portland, OR), p. 299-302, AAAI Press, Aug. 2-4 1996.
http://citeseer.ist.psu.edu/cache/papers/cs/3487/
http:zSzzSzwww.quadstone.co.ukzSz~ianzSzaikmszSzkdd96a.pdf/
flockhart96genetic.pdf

61
From rules to subset descriptions
 Step 1: We have the following rules, that
describe part of the data table:
Rule 1: A1 => C
Rule 2: A2 => C

Rule n: An => C
 Step 2: (A1 U A2 … U An) => C
 Step 3: We look only at the antecedent to get
the subset description:
(A1 U A2 … U An)
62
Part of the data table. An example
Nr. Age Hobby Class C
Crt.
1 20 .. 30 dancing GOOD
2 25 .. 55 reading GOOD

Rule 1: If Age = 20 .. 30 AND Hobby = dancing then


GOOD
A1 C

Rule 2: If Age = 25 .. 55A2


AND Hobby = reading then C
GOOD
A1 U A2 C
63
From rules to subset descriptions.
An example
 Step 1: We have the rules:
 Rule 1: If Age = 20 .. 30 AND Hobby = dancing then GOOD
 Rule 2: If Age = 25 .. 55 AND Hobby = reading then GOOD
 Step 2: We combine the antecedent part to form a single
rule describing the “subset” of individuals in the same class:
 If ((Age = 20 .. 30 AND Hobby = dancing) OR (Age = 25 .. 55 AND
Hobby = reading)) then GOOD
 Step 3: subset description = antecedent part:
 {Age = 20 .. 30 AND Hobby = dancing} OR {Age = 25 .. 55 AND
Hobby = reading}

64
Subset description
or

and and

Term
Age = 20 .. 30 Hobby = dancing Age = 25 .. 55 Hobby = reading

Clause
65
Subset description
 Chromosomes represented as subset descriptions.
 Subsets consist of disjunction and conjunction of
attribute value or attribute range constraints:
Subset Description: {Clause} [or Clause]
Clause: {Term} [and Term]
Term: Attribute in Value Set
| Attribute in Range
 E.g.: {Age = 20 .. 30 and Hobby = dancing} or {Age =
25 .. 55 and Hobby = reading}

66
Crossover
 Apply crossover at all levels, successively:
 Subset description crossover
 Clause crossover (uniform or single-point)
 Term crossover

67
Subset description crossover
Clause A1 OR Clause A2 OR Clause A3

Clause Clause rBias %


Crossover Crossover

Clause B1 OR Clause B2 OR Clause B4

(1 – rBias) %

Clause C1 OR Clause C2 OR Clause C3 OR Clause C4

68
Subset description crossover
 Consider the following 2 descriptors
(chromosomes):
A : Clause A1 or Clause A2 or Clause A3
 B : Clause B1 or Clause B2 or Clause B4

 Apply clause crossover (uniform or single-point)


to cross clause A1 with B1, and A2 with B2.
 For clauses with no partner:
 Include A3 with probability rBias (first parent).
 Include B4 with probability 1-rBias (second
parent).
69
Uniform clause crossover
Age = 20 .. 30 AND Height = 1.5 .. 2.0

Term
Crossover rBias %

Hobby = dancing AND Age = 0 .. 25

(1 – rBias) %

Hobby = dancing AND Age = .. AND Height = 1.5 .. 2.0

70
Uniform clause crossover
 Consider the clauses:
 A : Age = 20 .. 30 and Height = 1.5 .. 2.0
 B : Hobby = dancing and Age = 0 .. 25

 Align clauses with respect to terms:


 A: Age = 20 .. 30 and Height = 1.5 .. 2.0
 B : Hobby = dancing and Age = 0 .. 25

 Apply term crossover between Age terms


 Include:
 Height term (with no partner) in the child with probability rBias.
 Hobby term (with no partner) in the child with probability (1–
rBias).
71
Single-point clause crossover
Crossover Age = 20 .. 30 AND Height = 1.5 .. 2.0
Point

Hobby = dancing AND Age = 0 .. 25

Age = 0 .. 25

From first child From second child


72
Single-point clause crossover
 Consider the clauses:
 A : Age = 20 .. 30 and Height = 1.5 .. 2.0
 B : Hobby = dancing and Age = 0 .. 25

 Align clauses with respect to terms:


 A: Age = 20 .. 30 and Height = 1.5 .. 2.0
 B : Hobby = dancing and Age = 0 .. 25

 E.g.: consider crossover point between Hobby and


Age:
 child takes terms to the left of the crossover point in clause A,
and terms to the right of the crossover point in clause B:
Child C : Age = 0 .. 25
73
Term crossover – value terms
Hobby = dancing, singing
rBias %

Hobby = dancing, hiking

(1 – rBias) %

Hobby = dancing, singing, hiking

74
Term crossover – range terms
Age = 20 .... 30

rBias %

rBias %

Age = 0 .... 25

(1 – rBias) % (1 – rBias) %

Age = low limit, high limit

75
Term crossover
 Used to combine two terms concerning the same
attribute.
 Consider the clauses:
A : Hobby = dancing, singing and Age = 20 .. 30
 B : Hobby = hiking, dancing and Age = 0 .. 25
 How to form child:
 Value terms:
 Include values common to both parents: e.g.: dancing
 Include values unique to one parent with a probability:
e.g.: rBias for singing and 1-rBias for hiking
 Range terms:
 Select low and high limit with a probability:
 Low limit for Age: rBias for value 20 and 1-rBias for value 0
 High limit for Age: rBias for value 30 and 1-rBias for value 25
 Later prune (get rid of) non-valid ranges. 76
Mutation
 Apply mutation at all levels, successively:
 Subset description mutation
 Clause mutation
 Term mutation

77
Subset description mutation
Clause A1 OR Clause A2 OR Clause A3

Clause Clause Clause


mutation mutation mutation

Clause A1’ OR Clause A2’ OR Clause A3’

Add or delete ? Do add/delete clause ?


pCls %
50 % (equal prob)

Add: Clause A1’ OR Clause A2’ OR Clause A3’ OR Clause A4’

Delete: Clause A1’ OR Clause A2’ OR Clause A3’


78
Subset description mutation
 Consider the following descriptor (chromosome):
 A : Clause A1 or Clause A2 or Clause A3 or Clause A4
 Steps:
1. Apply clause mutation on each clause: on A1, A2, A3 and
A4.
2. Decide with probability pCls to do or not do an add/delete
clause operation.
3. If add/delete has been decided, either add a new clause or
delete an existing clause with equal probability (50%)
 deletion: pick a clause at random and delete it.
 adding: generate a new clause at random (from random possible
attributes with random values/ranges assigned)
79
Clause mutation
Hobby = dancing AND Age = 20 .. 30

Term Term
mutation mutation

Term Hobby’ AND Term Age’

Add or delete ? Do add/delete term ?


pTerm %
50 % (equal prob)

Add: Term Hobby’ AND Term Age’ AND Term X

Delete: Term Hobby’ AND Term Age’


80
Clause mutation
 Consider the following clause:
 Hobby = dancing and Age = 20 .. 30
 Steps:
1. Apply term mutation on each term.
2. Decide with probability pTerm to do or not do an add/delete
term operation.
3. If add/delete has been decided, either add a new term or
delete an existing term with equal probability (50%)
 deletion: pick a term at random and delete it.
 adding: generate a new term at random

81
Term mutation - Value
Hobby = dancing

Do term mutation?

rMutTerm %

Attribute or value/range?
rAvr % (1 – rAvr) %
Attribute mutation Value mutation

Occupation = student Hobby = swimming

82
Term mutation - Range
Age = 10 .. 50

Do term mutation?

rMutTerm %

Attribute or value/range?
rAvr % (1 – rAvr) %
Attribute mutation Range mutation

Occupation = student Age = 3 .. 25

83
Term mutation
 First decide with a probability rMutTerm to mutate this
term or not.
 If term mutation decided, do with a probability either
attribute mutation, or value/range mutation.
 Consider the following term: Hobby = dancing
 Attribute mutation: randomly choose another attribute
available, e.g. occupation, and a random value for it: e.g.
student. New term: occupation = student
 Value mutation: randomly choose another value for current
attribute. E.g.: swimming. New term: Hobby = swimming
 Consider the following term: Age = 10 .. 50
 Range mutation: randomly choose another range for
current attribute. E.g.: 3 .. 25. New term: Age = 3 .. 25
84

You might also like