Bestintro
Bestintro
1
General description
of the method
2
References
‘DATA MINING Concepts and Techniques’
Jiawei Han, Micheline Kamber Morgan Kaufman Publishers, 2003
Data Mining Techniques : Class Lecture Notes and PP Slides.
http://cs.felk.cvut.cz/~xobitko/ga/
Massachusetts Institute of Technology - Prof. de Weck and Prof. Willcox,
Multidisciplinary System Design Optimization Course Lecture Notes on
Heuristic Techniques, “A Basic Introduction to Genetic Algorithms”:
http://ocw.mit.edu/NR/rdonlyres/Aeronautics-and-Astronautics/16-888Spri
ng-2004/D66C4396-90C8-49BE-BF4A-4EBE39CEAE6F/0/MSDO_L11_
GA.pdf
3
History of Genetic Algorithms
“Evolutionary Computing” was introduced in the 1960s by I. Rechenberg.
4
What Are Genetic Algorithms
(GAs)?
Genetic Algorithms are search and optimization techniques based on
Darwin’s Principle of Natural Selection.
“problems are solved by an evolutionary process resulting in a best (fittest) solution (survivor) ,
-In Other words, the solution is evolved”
1. Inheritance – Offspring acquire characteristics
2. Mutation – Change, to avoid similarity
3. Natural Selection – Variations improve survival
4. Recombination - Crossover
5
Genetics
Chromosome
All Living organisms consists of cells. In each cell there is a same set of Chromosomes.
Chromosomes are strings of DNA and consists of genes, blocks of DNA.
Each gene encodes a trait, for example color of eyes.
Reproduction
During reproduction, recombination (or crossover) occurs first. Genes from parents combine
to form a whole new chromosome. The newly created offspring can then be mutated. The
changes are mainly caused by errors in copying genes from parents.
The fitness of an organism is measure by success of the organism in its life (survival)
Citation:
http://ocw.mit.edu/NR/rdonlyres/Aeronautics-and-Astronautics/16-888Spring-2004/D66C4396-90C8-49BE-BF4A-4EBE39CEAE6F/0/MSDO_L11_GA.pdf
6
Principle Of Natural Selection
“Select The Best, Discard The Rest”
7
GA Elements
Citation:
http://ocw.mit.edu/NR/rdonlyres/Aeronautics-and-Astronautics/16-888Spring-2004/D66C4396-90C8-49BE-BF4A-4EBE39CEAE6F/0/MSDO_L11_GA.pdf
8
Search Space
If we are solving some problem, we are usually looking for some solution, which will be the best
among others. The space of all feasible solutions (it means objects among those the desired
solution is) is called search space (also state space). Each point in the search space represent one
feasible solution. Each feasible solution can be "marked" by its value or fitness for the problem.
Initialization
Initially many individual solutions are randomly generated to form an initial population, covering
the entire range of possible solutions (the search space)
Each point in the search space represents one possible solution marked by its value( fitness)
Selection
A proportion of the existing population is selected to bread a new bread of generation.
Reproduction
Generate a second generation population of solutions from those selected through genetic
operators: crossover and mutation.
Termination
A solution is found that satisfies minimum criteria
Fixed number of generations found
Allocated budget (computation, time/money) reached
The highest ranking solution’s fitness is reaching or has reached
9
Methodology Associated with
GAs Begi
n
Initialize
population
Evaluate
Solutions
T =0 (first step)
Optimum N
Solution?
Selection
Y
Mutation
10
Citation: http://cs.felk.cvut.cz/~xobitko/ga/
Creating a GA on Computer
Simple_Genetic_Algorithm()
{
Initialize the Population;
Calculate Fitness Function;
11
Nature Vs Computer - Mapping
Nature Computer
Population Set of solutions.
Individual Solution to a problem.
Fitness Quality of a solution.
Chromosome Encoding for a Solution.
Gene Part of the encoding of a solution.
Reproduction Crossover
12
Encoding
The process of representing the solution in
the form of a string that conveys the
necessary information.
13
Encoding Methods
Binary Encoding – Most common method of encoding. Chromosomes are
strings of 1s and 0s and each position in the chromosome represents a
particular characteristic of the problem.
Chromosome A 10110010110011100101
Chromosome B 11111110000000011111
Chromosome A 1 5 3 2 6 4 7 9 8
Chromosome B 8 5 6 7 2 3 1 4 9
14
Encoding Methods (contd.)
Value Encoding – Used in problems where complicated values, such as
real numbers, are used and where binary encoding would not suffice.
Good for some problems, but often necessary to develop some specific
crossover and mutation techniques for these chromosomes.
15
Encoding Methods (contd.)
Tree Encoding – This encoding is used mainly for evolving programs or
expressions, i.e. for Genetic programming.
Tree Encoding - every chromosome is a tree of some objects, such as
values/arithmetic operators or commands in a programming language.
Citation:
http://ocw.mit.edu/NR/rdonlyres/Aeronautics-and-Astronautics/16-888Spring-2004/D66C4396-90C8-49BE-BF4A-4EBE39CEAE6F/0/MSDO_L11_GA.pdf
16
GA Operators
17
References
18
19
Citation: http://www.ewh.ieee.org/soc/es/May2001/14/GA.GIF
Basic GA Operators
Recombination
20
Fitness function
quantifies the optimality of a solution (that is,
a chromosome): that particular chromosome
may be ranked against all the other
chromosomes
22
So, how to select the best?
Roulette Selection
Rank Selection
Tournament Selection
23
Roulette wheel selection
Main idea: the fitter is the solution with the
most chances to be chosen
HOW IT WORKS ?
24
Example of Roulette wheel selection
No. String Fitness % Of Total
3 01000 64 5.5
Citation: : www.cs.vu.nl/~gusz/
25
Roulette wheel selection
Chromosome1
Chromosome 2
Chromosome 3
Chromosome 4
All you have to do is spin the ball and grab the chromosome at the
point it stops
26
Crossover
Main idea: combine genetic material ( bits ) of
2 “parent” chromosomes ( solutions ) and
produce a new “child” possessing
characteristics of both “parents”.
How it works ?
Several methods ….
27
Crossover methods
Single Point Crossover- A random point is chosen on the individual chromosomes (strings) and the genetic material is exchanged at this point.
Citation: http://www.ewh.ieee.org/soc/es/May2001/14/CROSS0.GIF
28
Crossover methods
Two-Point Crossover- Two random points are
chosen on the individual chromosomes (strings) and
the genetic material is exchanged at these points.
29
Crossover methods
Uniform Crossover- Each gene (bit) is selected
randomly from one of the corresponding genes of
the parent chromosomes.
30
Crossover (contd.)
Crossover between 2 good solutions MAY NOT
ALWAYS yield a better or as good a solution.
31
Elitism
Main idea: copy the best chromosomes
(solutions) to new population before applying
crossover and mutation
32
Mutation
Main idea: random inversion of bits in
solution to maintain diversity in
population set
Advantages:
Always an answer; answer gets better with time
Good for “noisy” environments
Inherently parallel; easily distributed
Issues:
Performance
Solution is only as good as the evaluation function
(often hardest part)
Termination Criteria
34
Applications - Genetic
programming and data
mining
By: George Iordache
35
A.A. Freitas. “A survey of evolutionary algorithms for data mining and knowledge
discovery”, Pontificia Universidade Catolica do Parana, Brazil. In A. Ghosh and S.
Tsutsui, editors, Advances in Evolutionary Computation, pages 819--845. Springer-
Verlag, 2002.
http://citeseer.ist.psu.edu/cache/papers/cs/23050/http:zSzzSzwww.ppgia.pucpr.brzSz~alexzSzpub_pape
rs.dirzSzAdvEC-bk.pdf/freitas01survey.pdf
Anita Wasilewska, Course Lecture Notes (2007 and previous years) on Classification
(Data Mining book Chapters 5 and 7) -
http://www.cs.sunysb.edu/~cse634/lecture_notes/07classification.pdf
J. Han, and M. Kamber. “Data Mining: Concepts and Techniques 2nd ed.”, Morgan
Kaufmann Publishers, March 2006. ISBN 1-55860-901-6
R. Mendes, F. Voznika, A. Freitas, and J. Nievola. “Discovering fuzzy classification rules
with genetic programming and co-evolution”, Pontificia Universidade Catolica do
Parana, Brazil. In L. de Raedt and A. Siebes, editors, 5th European Conference on
Principles and Practice of Knowledge Discovery in Databases (PKDD'01), volume 2168 of
LNAI, pages 314--325. Springer Verlag, 2001.
http://citeseer.ist.psu.edu/cache/papers/cs/23050/http:zSzzSzwww.ppgia.pucpr.brzSz~alexzSzpub_papers.dirz
SzPKDD-2001.pdf/mendes01discovering.pdf
36
Genetic Programming
A program in C
int foo (int time)
{
int temp1, temp2;
if (time > 10)
temp1 = 3;
else
temp1 = 4;
temp2 = temp1 + 1 + 2;
return (temp2);
}
Citation: www.genetic-programming.com/c2003lecture1modified.ppt 37
Program tree
Citation: www.genetic-programming.com/c2003lecture1modified.ppt 38
Given data
Input: Independent variable X Output: Dependent variable Y
-1.00 1.00
-0.80 0.84
-0.60 0.76
-0.40 0.76
-0.20 0.84
0.00 1.00
0.20 1.24
0.40 1.56
0.60 1.96
0.80 2.44
1.00 3.00
Citation: www.genetic-programming.com/c2003lecture1modified.ppt 39
Problem description
Objective: Find a computer program with one
input (independent variable X) whose
output Y equals the given data
x+1 x2 + 1 2 x
Mutation:
/
picking “2”
as mutation
point
43
Citation: part of the pictures used as examples are taken from: www.genetic-programming.com/c2003lecture1modified.ppt
Crossover
Crossover:
picking “+”
subtree and
leftmost “x” as
crossover points
Second offspring
First offspring of of crossover of
Mutant of (c) crossover of (a) (a) and (b)
and (b) picking “+” of
Copy of (a) picking “2” picking “+” of parent (a) and
as mutation parent (a) and
point left-most “x” of
left-most “x” of parent (b) as
parent (b) as crossover points
crossover points
45
Citation: part of the examples is taken from: www.genetic-programming.com/c2003lecture1modified.ppt
X Y X+1 |X+1- 1 |1-Y| X |X-Y| X2+X |
Y| +1 X2+X+1
-Y|
-1.00 1.00 0 1 1 0 -1.00 2 1 0
-0.80 0.84 0.20 0.64 1 0.16 -0.80 1.64 0.84 0
-0.60 0.76 0.40 0.36 1 0.24 -0.60 1.36 0.76 0
-0.40 0.76 0.60 0.16 1 0.24 -0.40 1.16 0.76 0
-0.20 0.84 0.80 0.04 1 0.16 -0.20 1.04 0.84 0
0.00 1.00 1.00 0 1 0 0.00 1 1 0
0.20 1.24 1.20 0.04 1 0.24 0.20 1.04 1.24 0
0.40 1.56 1.40 0.16 1 0.56 0.40 1.16 1.56 0
0.60 1.96 1.60 0.36 1 0.96 0.60 1.36 1.96 0
0.80 2.44 1.80 0.64 1 1.44 0.80 1.64 2.44 0
1.00 3.00 2.00 1 Σ 1 2 Σ1.00 2 Σ 3 0 Σ
Antecedent Consequence
48
Formula representation
Possible rule:
If (NOC = 2) AND ( S > 80000) then GOOD (customer)
Formula Class
AND
= >
NOC 2 S 8000
0 course slides
Citation: the example is taken from prof. Anita Wasilewska previous years
49
Initial data table
Nr. Number of children Salary(S) Type of customer (C)
Crt. (NOC)
1 2 > 80000 GOOD
2 1 > 30000 GOOD
3 0 = 50000 GOOD
4 >2 < 10000 BAD
5 = 10 = 30000 BAD
6 =5 < 30000 BAD
50
Initial data (written as rules inferred
from the initial table)
Rule 1: If (NOC = 2) AND ( S > 80000) then C = GOOD
Rule 2: If (NOC = 1) AND ( S > 30000) then C = GOOD
Rule 3: If (NOC = 0) AND ( S = 50000) then C = GOOD
Rule 4: If (NOC > 2) AND ( S < 10000) then C = BAD
Rule 5: If (NOC = 10) AND ( S = 30000) then C = BAD
Rule 6: If (NOC = 5) AND ( S < 30000) then C = BAD
51
Generation 0
Population of 3 randomly created individuals:
If (NOC > 3) AND ( S > 10000) then C = GOOD
If (NOC > 1) AND ( S > 30000) then C = GOOD
If (NOC >= 0) AND ( S < 40000) then C = GOOD
52
Generation 0
AND
Individual
1
> >
NOC 3 S 1000
0
(NOC > 3) AND ( S > 10000)
53
Fitness function
55
Mutation
56
Crossover
AND AND
>
> > <
S 30000 S
NOC 1 NOC 1 4000
0
(NOC > 1) AND ( S < 40000)
(NOC > 1) AND ( S > 30000)
Crossover
AND AND
NOC 0 S 3000
NOC 1 S 4000
0
0
(NOC > 1) AND ( S < 40000) (NOC >= 0) AND ( S > 30000)
Individual AND
3
> <
NOC 0 S 9000
0
(NOC > 0) AND ( S < 90000)
58
Fitness function – Generation 1
Rule 1: If (NOC = 2) AND ( S > 80000) then GOOD
Rule 2: If (NOC = 1) AND ( S > 30000) then GOOD
Rule 3: If (NOC = 0) AND ( S = 50000) then GOOD
Rule 4: If (NOC > 2) AND ( S < 10000) then BAD
Rule 5: If (NOC = 10) AND ( S = 30000) then BAD
Rule 6: If (NOC = 5) AND ( S < 30000) then BAD
59
GA Operators on Rules –
Flockharts’s paper
approach
60
I.W. Flockhart and N.J. Radcliffe. “GA-MINER: parallel data
mining with hierarchical genetic algorithms - final report”.
EPCC-AIKMS-GAMINER -Report 1.0. University of
Edinburgh, UK, 1995.
http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/3487/
http:zSzzSzwww.quadstone.co.ukzSz~ianzSzaikmszSzreport.pdf/flockhart95gaminer.pdf
61
From rules to subset descriptions
Step 1: We have the following rules, that
describe part of the data table:
Rule 1: A1 => C
Rule 2: A2 => C
…
Rule n: An => C
Step 2: (A1 U A2 … U An) => C
Step 3: We look only at the antecedent to get
the subset description:
(A1 U A2 … U An)
62
Part of the data table. An example
Nr. Age Hobby Class C
Crt.
1 20 .. 30 dancing GOOD
2 25 .. 55 reading GOOD
64
Subset description
or
and and
Term
Age = 20 .. 30 Hobby = dancing Age = 25 .. 55 Hobby = reading
Clause
65
Subset description
Chromosomes represented as subset descriptions.
Subsets consist of disjunction and conjunction of
attribute value or attribute range constraints:
Subset Description: {Clause} [or Clause]
Clause: {Term} [and Term]
Term: Attribute in Value Set
| Attribute in Range
E.g.: {Age = 20 .. 30 and Hobby = dancing} or {Age =
25 .. 55 and Hobby = reading}
66
Crossover
Apply crossover at all levels, successively:
Subset description crossover
Clause crossover (uniform or single-point)
Term crossover
67
Subset description crossover
Clause A1 OR Clause A2 OR Clause A3
(1 – rBias) %
68
Subset description crossover
Consider the following 2 descriptors
(chromosomes):
A : Clause A1 or Clause A2 or Clause A3
B : Clause B1 or Clause B2 or Clause B4
Term
Crossover rBias %
(1 – rBias) %
70
Uniform clause crossover
Consider the clauses:
A : Age = 20 .. 30 and Height = 1.5 .. 2.0
B : Hobby = dancing and Age = 0 .. 25
Age = 0 .. 25
(1 – rBias) %
74
Term crossover – range terms
Age = 20 .... 30
rBias %
rBias %
Age = 0 .... 25
(1 – rBias) % (1 – rBias) %
75
Term crossover
Used to combine two terms concerning the same
attribute.
Consider the clauses:
A : Hobby = dancing, singing and Age = 20 .. 30
B : Hobby = hiking, dancing and Age = 0 .. 25
How to form child:
Value terms:
Include values common to both parents: e.g.: dancing
Include values unique to one parent with a probability:
e.g.: rBias for singing and 1-rBias for hiking
Range terms:
Select low and high limit with a probability:
Low limit for Age: rBias for value 20 and 1-rBias for value 0
High limit for Age: rBias for value 30 and 1-rBias for value 25
Later prune (get rid of) non-valid ranges. 76
Mutation
Apply mutation at all levels, successively:
Subset description mutation
Clause mutation
Term mutation
77
Subset description mutation
Clause A1 OR Clause A2 OR Clause A3
Term Term
mutation mutation
81
Term mutation - Value
Hobby = dancing
Do term mutation?
rMutTerm %
Attribute or value/range?
rAvr % (1 – rAvr) %
Attribute mutation Value mutation
82
Term mutation - Range
Age = 10 .. 50
Do term mutation?
rMutTerm %
Attribute or value/range?
rAvr % (1 – rAvr) %
Attribute mutation Range mutation
83
Term mutation
First decide with a probability rMutTerm to mutate this
term or not.
If term mutation decided, do with a probability either
attribute mutation, or value/range mutation.
Consider the following term: Hobby = dancing
Attribute mutation: randomly choose another attribute
available, e.g. occupation, and a random value for it: e.g.
student. New term: occupation = student
Value mutation: randomly choose another value for current
attribute. E.g.: swimming. New term: Hobby = swimming
Consider the following term: Age = 10 .. 50
Range mutation: randomly choose another range for
current attribute. E.g.: 3 .. 25. New term: Age = 3 .. 25
84