Cse 590 Data Mining: Prof. Anita Wasilewska SUNY Stony Brook
Cse 590 Data Mining: Prof. Anita Wasilewska SUNY Stony Brook
1
References
D. E. Goldberg, ‘Genetic Algorithm In Search, Optimization And Machine
Learning’, New York: Addison – Wesley (1989)
John H. Holland ‘Genetic Algorithms’, Scientific American Journal, July 1992.
Kalyanmoy Deb, ‘An Introduction To Genetic Algorithms’, Sadhana, Vol. 24
Parts 4 And 5.
T. Starkweather, et al, ‘A Comparison Of Genetic Sequencing Operators’,
International Conference On Gas (1991)
http://obitko.com/tutorials/genetic-algorithms/introduction.php
http://en.wikipedia.org/wiki/Genetic_algorithm
http://en.wikipedia.org/wiki/Fitness_function
http://en.wikipedia.org/wiki/Crossover_(genetic_algorithm)
http://www.edc.ncl.ac.uk/highlight/rhjanuary2007g02.php/
Tutorial: Wendy Williams Metaheuristic Algorithms, Genetic Algorithms a
Tutorial”
http://brainz.org/15-real-world-applications-genetic-algorithms/
Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber
2
Quick Background
Idea of evolutionary computing was introduced in the
1960s by I. Rechenberg in his work "Evolution strategies"
(Evolutionsstrategie in original).
Citation “http://en.wikipedia.org/wiki/Genetic_algorithm”
4
Role of GA
To solve optimization and search related problems.
5
Citation “http://www.public.iastate.edu/~olafsson/mining_course_info.html ”
Components of GA
Encoding (e.g. binary encoding, permutation encoding)
Initialization Function
Termination
Citation “http://en.wikipedia.org/wiki/Genetic_algorithm” 6
GA in a gist
Begin
G: Generation
Initialize data
Evaluate data
G =0
Optimal N
Solution Selection
?
Y Crossover
G =G + 1
STOP
Mutation
Citation “http://obitko.com/tutorials/genetic-algorithms/ga-basic-description.php” 7
Overall Algorithm
1. Generate random population say n.
Permutation Encoding
The input is represented as a string of numbers.
E.g. If the input is a sequence of cities to be visited then one such sequence can be
123456
Chromosome A 1 5 3 2 6 4 7 9 8
Chromosome B 8 5 6 7 2 3 1 4 9
Citation “http://obitko.com/tutorials/genetic-algorithms/operators.php”
9
Encoding types (contd.)
Value Encoding
The input is a set of values where each value is for a specific
characteristic. E.g. For finding weights in neural networks, one possible input can
be 2 0.3 -0.3 0.1 1 0.2 -0.2
10
Evaluation(Fitness Function)
“Means to quantify the optimality of the solution so
that it can be ranked against all other solution.”
Citation “http://en.wikipedia.org/wiki/Fitness_function”
11
Selection
Roulette Wheel Selection:
• Fitness level is used to associate a probability of selection with
each individual solution e.g. chromosome.
• We first calculate the fitness for each input and then represent it
on the wheel in terms of percentages.
• In a search space of ‘N’ chromosomes, we spin the roulette wheel
N times.
Citation “http://www.edc.ncl.ac.uk/highlight/rhjanuary2007g02.php/ ” 12
Selection (contd.)
Tournament Selection:
• Two solutions are picked out of the pool of possible
solutions, their fitness is compared, and the better is
permitted to reproduce.
• Deterministic tournament selection selects the best
individual in each tournament.
• Can take advantage of parallel architecture
Others:
Rank Selection
Steady State Selection etc.
Citation “http://obitko.com/tutorials/genetic-algorithms/operators.php”
13
Selection: Elitism
Elitism first copies the best solutions to the next set of
solutions. E.g. it copies the best input chromosome (or few
chromosomes) to the new population.
Citation “http://obitko.com/tutorials/genetic-algorithms/selection.php”
14
Crossover
Varies the programming of input solutions from one
generation to the next.
18
APPLICATIONS
19
The Traveling Salesman Problem
Given:
a set of cities &
a symmetric distance matrix that indicates
the cost of travel from each city to every
other city.
Goal:
To find the shortest circular tour, visiting
every city exactly once, so as to minimize
the total travel cost, which includes the
cost of traveling from the last city back
to the first city
Citation “D. Whitley, et al , ‘Traveling Salesman And Sequence Scheduling: Quality Solutions
20
Using Genetic Edge Recombination’, Handbook Of Genetic Algorithms, New York ”
Encoding
Every city represented as an integer
We will take an example involving 6 cities, namely
{A, B, C, D, E, F}
Representation:
City Encoding
A 1
B 2
C 3
D 4
E 5
F 6
Citation “D. Whitley, et al , ‘Traveling Salesman And Sequence Scheduling: Quality Solutions
Using Genetic Edge Recombination’, Handbook Of Genetic Algorithms, New York ” 21
Encoding (contd.)
A path between two cities is represented as a sequence of
integers from 1 to 6
Citation “D. Whitley, et al , ‘Traveling Salesman And Sequence Scheduling: Quality Solutions
Using Genetic Edge Recombination’, Handbook Of Genetic Algorithms, New York ” 22
Distance Matrix For TSP
Cities 1 2 3 4 5 6
Citation “D. Whitley, et al , ‘Traveling Salesman And Sequence Scheduling: Quality Solutions
Using Genetic Edge Recombination’, Handbook Of Genetic Algorithms, New York ” 23
Fitness Function
The “fitness function” is the total cost of the tour represented by
each chromosome.
The Lesser The Sum, The Fitter The Solution Represented By That
Chromosome
Citation “D. Whitley, et al , ‘Traveling Salesman And Sequence Scheduling: Quality Solutions
24
Using Genetic Edge Recombination’, Handbook Of Genetic Algorithms, New York ”
Selection Operator
Now, for selection “Tournament Selection” is used.
1 2 3 4 5 6
D = 9750
5 1 2 4 3 6
5 1 2 4 3 6
D = 8000
1 3 6 2 4 5
D = 11750
1 3 6 2 4 5
2 4 6 1 5 3
D = 12000
Citation “D. Whitley, et al , ‘Traveling Salesman And Sequence Scheduling: Quality Solutions
26
Using Genetic Edge Recombination’, Handbook Of Genetic Algorithms, New York ”
Crossover Operator
Now, for crossover use “Enhanced Edge Recombination”
operator . This involves creating an “Edge Table”
Citation “D. Whitley, et al , ‘Traveling Salesman And Sequence Scheduling: Quality Solutions
Using Genetic Edge Recombination’, Handbook Of Genetic Algorithms, New York ” 27
Crossover Operator
Parent 1 1 2 3 4 5 6
Parent 2 1 3 4 2 5 6
1 2 -6 3
2 1 3 4 5
3 2 -4 1
4 -3 5 2
5 4 -6 2
6 -5 -1
Edge Table
Citation “D. Whitley, et al , ‘Traveling Salesman And Sequence Scheduling: Quality Solutions
Using Genetic Edge Recombination’, Handbook Of Genetic Algorithms, New York ” 28
Enhanced Edge Recombination
Algorithm
1. Choose the initial city from one of the two parent tours. This is
the current city.
1 1 6
Citation “D. Whitley, et al , ‘Traveling Salesman And Sequence Scheduling: Quality Solutions
Using Genetic Edge Recombination’, Handbook Of Genetic Algorithms, New York ” 30
Example (contd.)
Step 3 Step 4
1 2 3 1 2 3
2 3 4 5 2 3 4
3 2 -4 3 2 -4
4 -3 5 2 4 -3 2
5 4 2 5 4 2
6 -5 6
1 6 5 1 6 5 4
Citation “D. Whitley, et al , ‘Traveling Salesman And Sequence Scheduling: Quality Solutions
Using Genetic Edge Recombination’, Handbook Of Genetic Algorithms, New York ” 31
Example (contd.)
Step 5 Step 6
1 2 3 1 2
2 3 2
3 2 3 2
4 -3 2 4 2
5 2 5 2
6 6
1 6 5 4 3 1 6 5 4 3 2
Citation “D. Whitley, et al , ‘Traveling Salesman And Sequence Scheduling: Quality Solutions
Using Genetic Edge Recombination’, Handbook Of Genetic Algorithms, New York ” 32
Mutation Operator
The mutation operator induces a change in the
solution, so as to maintain diversity in the
population and prevent Premature Convergence.
We mutate the string by randomly selecting any
two cities and interchanging their positions in the
solution, thus giving rise to a new tour.
1 2 3 4 5 6
1 5 3 4 2 6
Citation “D. Whitley, et al , ‘Traveling Salesman And Sequence Scheduling: Quality Solutions 33
Using Genetic Edge Recombination’, Handbook Of Genetic Algorithms, New York ”
TSP Example: 30 Cities
120
100
80
y 60
40
20
0
0 10 20 30 40 50 60 70 80 90 100
x
120
100
80
y 60
40
20
0
0 10 20 30 40 50 60 70 80 90 100
x
40
38 40
21
35
20
67
60
60 0
40 0 10 20 30 40 50 60 70 80 90 100
42 x
50
99
120
100
80
y 60
40
20
0
0 10 20 30 40 50 60 70 80 90 100
x
1800
1600
D 1400
i
1200
se
c 1000
tn
a
t
as
i
D
800
n 600
c 400
e
200
0 Best
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Worst
Generations (1000)
Average
Bio-mimetic Invention
GA programmers work on applications that
not only analyze the natural designs
themselves for a return on how they work,
but can also combine natural designs to
create something entirely new that can
have exciting applications.
Reference:
http://brainz.org/15-real-world-applications-genetic-algorithms/ 40
OTHER APPLICATIONS (contd.)
Robotics
Generally a robot's design is dependent on
the job it is intended to do, so there are
many different designs possible.
Reference:
http://brainz.org/15-real-world-applications-genetic-algorithms/
41
OTHER APPLICATIONS (contd.)
Telecommunications Routing
GAs can be used to identify
optimized routes within
telecommunications networks.
These could take notice of your
system's instability and anticipate
your re-routing needs.
GAs are being developed to
optimize placement and routing of
cell towers for best coverage and
ease of switching
Reference:
http://brainz.org/15-real-world-applications-genetic-algorithms/
42
OTHER APPLICATIONS (contd.)
Finance and Investment Strategies
GAs can be used to explore different
parts of the search space and produce
solutions which potentially capture
different patterns in the data
GAs can also be used for noise filtering
and achieve enhanced pattern
detection for improving the overall
learning accuracy
Reference:
http://brainz.org/15-real-world-applications-genetic-algorithms/
44
Advantages & Disadvantages
Advantages:
Modular, separate from application
Good for “noisy” environments
Answer gets better with time
Inherently parallel; easily distributed
Disadvantages:
Choosing basic implementation issues:
representation
population size, mutation rate, ...
selection, deletion policies
crossover, mutation operators
Performance, scalability
Solution is only as good as the evaluation function (often hardest part)
45
Citation “Wendy Williams Metaheuristic Algorithms, Genetic Algorithms a Tutorial”
Conclusion
GAs have been applied to a variety of function
optimization problems and have been shown to be
highly effective in searching a large, poorly defined
search space even in the presence of difficulties
such as high-dimensionality, multi-modality,
discontinuity and noise.
Thank You!!
Watch Demo :
1. Evolution of Mona Lisa: http://www.youtube.com/watch?v=S1ZPSbImvFE
2. Development of chromosomes : http://obitko.com/tutorials/genetic-
algorithms/example-function-minimum.php
47
Ian W. Flockharta and Nicholas J Radclie
{iwf,njr}@quadstone.co.uk
Presented by
Gaurav Naigaonkar
Kumaran Shanmugam
Rucha Lale
48
Paper References
S. Augier, G. Venturini, and Y. Kodrato, 1995. Learning first order logic rules with a genetic algorithm. In Usama M.
Fayyad and Ramasamy Uthurusamy, editors, Proceedings of the First International Conference on Knowledge
Discovery and Data Mining. AAAI Press.
Kenneth A. DeJong, William M Spears, and Diana F Gordon, 1993. Using genetic algorithms for concept learning.
Machine Learning, 13:161 - 188.
Ian W. Flockhart and Nicholas J. Radclie, 1995. GA-MINER: Parallel data mining with hierarchical genetic algorithms.
Technical Report EPCC-AIKMS-GA-MINER-REPORT, Edinburgh Parallel Computing Centre.
William J. Frawley, 1991. Using functions to encode domain and contextual knowledge in statistical induction. In
Gregory Piatetsky-Shapiro and William J. Frawley, editors, Knowledge Discovery in Databases, pages 261 - 275. MIT
Press.
Attilio Giordana, Filippo Neri, and Lorenza Saiat, 1994. Search-intensive concept induction. Technical report,
Univerita di Torino, Dipartimento di Informatica, Corso Svizzera 185, 10149 Torino, Italy.
David Perry Green and Stephen F. Smith, 1993. Competition-based induction of decision models from examples.
Machine Learning, 13:229-257.
David Perry Greene and Stephen F. Smith, 1994. Using coverage as a model building constraint in learning classifier
systems. Evolutionary Computation, 2(1).
John H. Holland, 1975. Adaptation in Natural and Artificial Systems. University of Michigan Press (Ann Arbor).
John H. Holland, 1986. Escaping brittleness: the possibilities of general-purpose learning algorithms applied to parallel
rule-based systems. Machine Learning, an artificial intelligence approach, 2.
49
Types of Data Mining
Undirected Data Mining (Pure Data Mining)
The kind of rule which is expected is not specified
Maximum freedom to identify patterns
e.g. “Tell me something interesting about my data”
Directed Data Mining
User asks for more specific information
Constraints imposed on the system
eg: “Characterize my high spending customers”
Hypothesis refinement
User specifies a hypothesis
System evaluates the hypothesis and refines it, if needed.
eg: “I think that there is a positive correlation between sales of
peaches and sales of cream: am I right”
50
Objective
51
Reproductive Plan Language
GA-MINER is implemented using Reproductive Plan
Language [RPL2]
For Stochastic Search Algorithms with special
emphasis on evolutionary algorithms such as GA
Features pertinent to GA-MINER
Automatic parallelism
Arbitrary representations ( important for rules )
Large library of functions
52
Pattern Representation
Patterns represented using subset description
Subset descriptions
Clauses used to select subsets of databases
Units used by Genetic Algorithm
consist of disjunction of conjunction of attribute value
or attribute range constraints
Subset Description =: Clause {or Clause}
Clause =: Term and {Term}
Term =: Attribute in Value Set
| Attribute in Range
53
Pattern Representation (contd.)
Patterns supported by GA-MINER
Explicit Rule Patterns
(e.g) if C then P
where C, P are subset descriptions representing the condition and
prediction respectively
Correlation Patterns
(e.g) when C the variables A and B are correlated
A and B are hypothesis variables and C is a subset description
54
Pattern Templates
Used to constrain the system to particular forms of patterns i.e.,
to allow certain features of the pattern to be restricted
55
Pattern Templates (contd.)
Undirected mining
Performed with a minimal template
Directed mining
Performed by restricting the pattern
Hypothesis refinement
Performed by seeding the initial population with patterns
based on the template and
Random components
Search can modify these patterns
56
The Genetic Algorithm in
GA-MINER
Structured population model
Reproductive partners are selected from the same neighborhood
to improve diversity
Also to identify several patterns in a single run
Crossover is performed at the different levels
Subset description
Clause – both uniform and single point crossover
Term – uniform crossover
Mutation also done at various levels
With separate probability for mutating each of the component parts
The population is updated using a heuristic like replacing the
lowest fit
57
Crossover
Subset Description Crossover
Each clause in the first parent is crossed with the clause in the
corresponding position in the second parent.
For each clause, we use uniform clause crossover with
probability rPUCross and single-point crossover with
probability (1–rPUCross).
e.g. A : Clause A1 or Clause A2 or Clause A3 or Clause A4
B : Clause B1 or Clause B2
http://citeseer.ist.psu.edu/cache/papers/cs/3487/http:zSzzSzwww.quadstone.co.ukzSz~ianzSzaikmszSzreport.pdf/flo
ckhart95gaminer.pdf 58
Crossover (contd.)
Uniform Clause Crossover
performs an “alignment” of terms between the two
parent clauses
terms concerning the same variable will be crossed with
each other
e.g. A : Age = 20 .. 30
B : Sex = M and Age = 0 .. 25
After alignment, we get
A: Age = 20 .. 30
B : Sex = M and Age = 0 .. 25
http://citeseer.ist.psu.edu/cache/papers/cs/3487/http:zSzzSzwww.quadstone.co.ukzSz~ianzSzaikmszSzreport.pdf/flock
hart95gaminer.pdf 59
Crossover (contd.)
Single-Point Clause Crossover
Alignment of clauses
Selection of crossover point
e.g.
A : Age = 20 .. 30 and Height = 1.5 .. 2.0
B : Sex = M and Age = 0 .. 25
After alignment, we get
A: Age = 20 .. 30 and Height = 1.5 .. 2.0
B : Sex = M and Age = 0 .. 25
Result: C : Age = 0 .. 25
http://citeseer.ist.psu.edu/cache/papers/cs/3487/http:zSzzSzwww.quadstone.co.ukzSz~ianzSzaikmszSzreport.pdf/flockh
art95gaminer.pdf 60
Crossover (contd.)
Term Crossover
to combine two terms concerning the same variable
Crossover of value and range terms is handled
differently.
http://citeseer.ist.psu.edu/cache/papers/cs/3487/http:zSzzSzwww.quadstone.co.ukzSz~ianzSzaikmszSzreport.pdf/flockha
rt95gaminer.pdf 61
GA-Miner GUI
http://citeseer.ist.psu.edu/cache/papers/cs/3487/http:zSzzSzwww.quadstone.co.ukzSz~ianzSzaikmszSzreport.pdf/floc
khart95gaminer.pdf 62
Examples of discovered patterns
Explicit Rule Pattern
63
Examples of discovered patterns (contd.)
Distribution shift Pattern
64
Conclusion
GA’s application towards pattern discovery in addition
to classification and concept learning
65
THANK YOU!
66