0% found this document useful (0 votes)
24 views190 pages

AM EC Unit4

Uploaded by

jee.extra7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views190 pages

AM EC Unit4

Uploaded by

jee.extra7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 190

UNIT 4 POPULAR

EVOLUTIONARY ALGORITHM
VARIANTS
Ami Munshi
Syllabus
Course Outcomes
References
Paradigms of Evolutionary
Computing/Evolutionary Algorithm

Evolutionary Computing/ Algorithm

Genetic Evolutionary Evolutionary Genetic Differential


Algorithm Strategy Programming Programming Evolution
Genetic Algorithm-A Recap
Genetic Algorithm-Recap
 Simple Genetic Algorithm has
 a binary representation
 fitness proportionate selection
 a low probability of mutation
 and an emphasis on genetically inspired recombination (crossover) as a means of generating new
candidate solutions

Ref: A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
Comparing GA and GP
 Three important differences exist between GAs and GP:
 Structure: GP usually evolves tree structures while GA’s evolve binary or real
number strings
 Active Vs Passive: Because GP usually evolves computer programs, the solutions
can be executed without post processing i.e. active structures, while GA’s
typically operate on coded binary strings i.e. passive structures, which require
post-processing
 Variable Vs fixed length: In traditional GAs, the length of the binary string is
fixed before the solution procedure begins
 However a GP parse tree can vary in length throughout the run. Although it is
recognized that in more advanced GA work, variable length strings are used.

Ref: S.N.Sivanandam, S.N.Deepa, Introduction to Genetics Algorithm, Springer


Genetic Programming
Role of AI to search the solution space
 Fundamental role of AI is to search for solution space for the most optimal solution
 There are three types of primary search
 Blind Serach
 Hill Climbing and
 Hill Search
 GP is classified as a beam search
 Beam search
 Maintains a population of solutions that is smaller than all of the available solutions
 Beam search is a heuristic search algorithm that explores a graph by expanding the most
promising node in a limited set
 Beam search is an optimization of best-first search that reduces its memory requirements
Genetic Programming
 Genetic programming also differs from all other approaches to artificial intelligence, machine
learning, neural networks, adaptive systems, reinforcement learning, or automated logic in all
(or most) of the following seven way
 Representation: Genetic programming overtly conducts it search for a solution to the given problem in
program space
 Role of point-to-point transformations in the search: Genetic programming does not conduct its search
by transforming a single point in the search space into another single point, but instead transforms a set
of points into another set of point
 Role of hill climbing in the search: Genetic programming does not rely exclusively on greedy hill
climbing to conduct its search, but instead allocates a certain number of trials, in a principled way, to
choices that are known to be inferior
 Role of determinism in the search: Genetic programming conducts its search probabilistically
 Role of an explicit knowledge base: None.
 Role of formal logic in the search: None.
 Underpinnings of the technique: Biologically inspired.
Genetic Programming
 Central Idea of AI, ML, DL
 Goal of having computers automatically solve problems
 Machine learning pioneer Arthur Samuel, in his 1983
talk entitled ‘AI: Where It Has Been and Where It Is
Going’
 “to get machines to exhibit behavior, which if done
by humans, would be assumed to involve the use of
intelligence.”
Genetic Programming
 Genetic programming (GP) is an evolutionary
computation (EC) technique
 that automatically solves problems
 without having to tell the computer explicitly how to do it

 At the most abstract level GP is


 a systematic, domain-independent method for getting
computers to automatically solve problems starting from a
high-level statement of what needs to be done
GP Algorithm
GP in Nutshell

William B. Langdon, Riccardo Poli, Nicholas F. McPhee, and John R. Koza, Genetic Programming: An Introduction and Tutorial, with a Survey of Techniques and Applications
GP in Nutshell
 GP is a special evolutionary algorithm (EA)
 where the individuals in the population are computer programs
 Generation by generation GP iteratively transforms populations of programs into
other populations of programs
 During the process, GP constructs new programs by applying genetic operations
which are specialized to act on computer programs
 Algorithmically, GP comprises the steps shown in Algorithm 1. The main
 Genetic operations involved in GP (line 5 of Algorithm 1) are the following
 Crossover: the creation of one or two offspring programs by recombining randomly
chosen parts from two selected programs.
 Mutation: the creation of one new offspring program by randomly altering a randomly
chosen part of one selected program
GP in Nutshell
 Some GP systems support structured solutions
 May then include architecture-altering operations which randomly
alter the architecture (for example, the number of subroutines) of
a program to create a new offspring program
 In addition of crossover, mutation and the architecture-
altering operations,
 An operation which simply copies selected individuals in the next
generation is used operation, called reproduction
 It is typically applied only to produce a fraction of the new
generation
Idea of Genetic Programming

To obtain structure of a computer program to solve a


given problem using evolutionary process
Problems where GP is used
 Program will
 take ant sensor value
as input
 Produce moving and
turning actions as
output

Genetic Programming-Movie Part 1, https://www.youtube.com/watch?v=tTMpKrKkYXo


Problems where GP is used
 Program will
 Cartesian coordinates
as input
 Classification as blue
or red as output

Genetic Programming-Movie Part 1, https://www.youtube.com/watch?v=tTMpKrKkYXo


Problems where GP is used
 Program will
 take location of the
destination as input
and
 Produce appropriate
motion of robot arm
as output
Genetic Programming-Movie Part 1, https://www.youtube.com/watch?v=tTMpKrKkYXo
Problems where GP is used
 Program will
 Take value of
independent variable
as input
 Produces the value of
dependent variable
as output
Genetic Programming-Movie Part 1, https://www.youtube.com/watch?v=tTMpKrKkYXo
GP Programming Outline
 Representation, Initialization and operators in GP
 Representation, Initialization of population, selection,
recombination, mutation
 Operators in Tree-Based GP
 Run GP System
 Terminal set, function set, fitness measure, parameters for
controlling, termination criteria
 Example of GP
GP programming outline
 Representation
 Tree Structure
 Recombination
 Exchange of sub trees
 Mutation
 Random change in tree
 Parent selection
 Fitness Proportionate
 Survival Selection
 Generational replacement
Example: Credit Scoring
 Bank wants to distinguish good from bad loan
applicants
 Model needed that matches historical data

https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_6614.pdf
Example: Credit Scoring
 A possible model:
IF (NOC = 2) AND (S > 80000) THEN good ELSE bad
 In general:
IF formula THEN good ELSE bad
 Only unknown is the right formula, hence
 Our search space (phenotypes) is the set of formulas
 Natural fitness of a formula: percentage of well classified cases of
the model it stands for
 Natural representation of formulas (genotypes) is: parse trees

https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_6614.pdf
Example: Credit Scoring
IF (NOC = 2) AND (S > 80000) THEN good ELSE bad
can be represented by the following tree

https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_6614.pdf
Tree based representation
 Trees are a universal form, e.g. consider
 Arithmetic formula  y 
2     ( x  3)  
 5 1

 Logical formula (x  true)  (( x  y )  (z  (x  y)))

https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_6614.pdf
Representation in GP
 GP programs are usually expressed as syntax
trees rather than as lines of code
 Figure shows, for example, the tree
representation of the program
 max(x*x,x+3*y)
 Note
 variables and constants in the program (x, y,
and 3), called terminals in GP, are leaves of the
tree
 while the arithmetic operations (+, *, and max)
are internal nodes (typically called functions in
the GP literature)
 The sets of allowed functions and terminals
together form the primitive set of a GP system
Multi-component program representation
 In more advanced forms of GP, programs
can be composed of multiple components
(say, subroutines)
 In this case the representation used in GP is
a set of trees (one for each component)
grouped together under a special root
node that acts as glue, as illustrated in Fig
 We will call these (sub)trees branches
 The number and type of the branches in a
program, together with certain other
features of the structure of the branches,
form the architecture of the program.
Tree based representation

 y 
2     ( x  3)  
 5 1

https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_6614.pdf
Tree based representation

(x  true)  (( x  y )  (z  (x  y)))
Tree based representation
 In GA, ES, EP chromosomes are linear structures
(bit strings, integer string, real-valued vectors,
permutations)
 Tree shaped chromosomes are non-linear
structures
 In GA, ES, EP the size of the chromosomes is
fixed
 Trees in GP may vary in depth and width
https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_6614.pdf
Tree based representation
 In GA, ES, EP chromosomes are linear structures
(bit strings, integer string, real-valued vectors,
permutations)
 Tree shaped chromosomes are non-linear
structures
 In GA, ES, EP the size of the chromosomes is
fixed
 Trees in GP may vary in depth and width
https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_6614.pdf
Tree based representation
 Symbolic expressions can be defined by
 Terminal set T
 Function set F (with the arities of function symbols)

https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_6614.pdf
Tree based representation

 For Expression (x*x) +x


 Function set is {*,+}
 Terminal set is {X}

Genetic Programming-Movie Part 1, https://www.youtube.com/watch?v=tTMpKrKkYXo


Initializing the population
 Similar to other evolutionary algorithms, in GP the
individuals in the initial population are randomly generated
 There are a number of different approaches to generating
this random initial population
 Here we will describe two of the simplest (and earliest)
methods
 the Full and Grow methods
 and a widely used combination of the two known as Ramped
half-and-half
Initializing the population
 Similar to other evolutionary algorithms, in GP the
individuals in the initial population are randomly generated
 There are a number of different approaches to generating
this random initial population
 Here we will describe two of the simplest (and earliest)
methods
 the Full and Grow methods
 and a widely used combination of the two known as Ramped
half-and-half
Initializing the population
 In both the Full and Grow methods, the initial individuals
are generated subject to a pre-established maximum
depth
 In the Full method (so named because it generates full
trees)
 nodes are taken at random from the function set until this
maximum tree depth is reached
 and beyond that depth only terminals can be chosen
Initializing the population
 Full initialization method
 Figure shows snapshots of this
process in the construction of
a full tree of depth 2
 The children of the * node, for
example must be leaves, or
the resulting tree would be
too deep
 Thus at time t = 3 and time t
= 4 terminals must be chosen
(x and y in this case)
Function set= {*,+,/,-}  Full method generates trees
Terminal set= {x,y,1,0} of a specific size and shape
Initializing the population
 Grow initialization method
 Grow method allows for the creation of trees
of varying size and shape
 Here nodes are selected from the whole
primitive set (functions and terminals) until the
depth limit is reached, below which only
terminals may be chosen
 Figure illustrates this process for the
construction of a tree with depth limit 2
 Here the first child of the root + node
happens to be a terminal, thus closing off
that branch before actually reaching the
depth limit
 The other child, however, is a function (-), but
its children are forced to be terminals to
ensure that the resulting tree does not exceed
Function set= {*,+,/,-} the depth limit
Terminal set= {x,y,2}
Initializing the population
 Because neither the Grow or Full method provide a very
wide array of sizes or shapes on their own, Koza
proposed a combination called ramped half-and half
 Here half the initial population is constructed using Full
and half is constructed using Grow
 This is done using a range of depth limits (hence the
term ‘ramped’) to help ensure that we generate trees
having a variety of sizes and shapes
Selection
 Parent selection typically fitness proportionate
 Survivor selection:
 Typical: generational scheme (thus none)
 Recently steady-state is becoming popular for its elitism
Mutation
 Most common mutation: replace randomly chosen
subtree by randomly generated tree
Mutation
 Mutation has two parameters:
 Probability pm to choose mutation vs. recombination
 Probability to chose an internal point as the root of the
subtree to be replaced
 Remarkably pm is advised to be 0 (Koza’92) or very
small, like 0.05 (Banzhaf et al. ’98)
 The size of the child can exceed the size of the parent
One more example of subtree mutation
Recombination
 Most common recombination
 exchange two randomly chosen subtrees among the parents
 Recombination has two parameters:
 Probability pc to choose recombination vs. mutation
 Probability to chose an internal point within each parent as
crossover point
 The size of offspring can exceed that of the parents
Parent 1
Parent 2

Child 1 Child 2
One more example of subtree crossover

William B. Langdon, Riccardo Poli, Nicholas F. McPhee, and John R. Koza, Genetic Programming: An Introduction and Tutorial, with a Survey of Techniques and Applications
GA flowchart GP flowchart
Compare GA and GP
 Offspring creation scheme
 GA scheme using crossover AND mutation sequentially
(be it probabilistically)
 GP scheme using crossover OR mutation (chosen
probabilistically)
 Randomly create an initial
population (generation 0) of
individual computer programs
composed of the available
functions and terminals
 Iteratively perform the following
sub-steps (called a generation)
on the population until the
termination criterion is satisfied:
 Execute each program in the
population and ascertain its
fitness (explicitly or implicitly)
using the problem’s fitness
measure.
 Select one or two individual
program(s) from the population
with a probability based on
fitness (with reselection allowed)
to participate in the genetic
operations
 Create new individual program(s)
for the population by applying the
following genetic operations with
specified probabilities:
 Reproduction: Copy the selected
individual program to the new
population.
 Crossover: Create new offspring
program(s) for the new population by
recombining randomly chosen parts
from two selected programs
 Mutation: Create one new offspring
program for the new population by
randomly mutating a randomly chosen
part of one selected program
 Architecture-altering operations:
Choose an architecture altering
operation from the available
repertoire of such operations and
create one new offspring program for
the new population by applying the
chosen architecture-altering operation
to one selected program
 After the termination
criterion is satisfied, the
single best program in
the population
produced during the run
(the best-so-far
individual) is harvested
and designated as the
result of the run
 If the run is successful,
the result may be a
solution (or approximate
solution) to the problem
Getting ready to run genetic program
Five Preparatory Steps
 Step 1: Identify the
terminal set
 Theyare inputs to the
computer program that
make up the population

Genetic Programming-Movie Part 1, https://www.youtube.com/watch?v=tTMpKrKkYXo


Five Preparatory Steps
 Step 2: Identify the
function set
 Here the function set
consists of arithmetic
operations like
 Addition, Subtraction,
Multiplication, Division

Genetic Programming-Movie Part 1, https://www.youtube.com/watch?v=tTMpKrKkYXo


Examples of function and terminal sets
Closure property
 For GP to work effectively, most function sets are
required to have an important property known as
closure [188], which can in turn be broken down into
the properties of type consistency and evaluation
safety
Five Preparatory Steps
 Step 3: Identify the
fitness measure
 For the given example,
of regression, the
fitness can be in terms
of error

Genetic Programming-Movie Part 1, https://www.youtube.com/watch?v=tTMpKrKkYXo


Five Preparatory Steps
 Step 4: Control Parameters
 Example
 Populationsize (M)
 Number of generations (n)

Genetic Programming-Movie Part 1, https://www.youtube.com/watch?v=tTMpKrKkYXo


Five Preparatory Steps
 Step 5: Termination
criteria and result
declaration
 Example:
 We may terminate
when the result is at a
distance of 0.01 from
the actual target
 Or error is within
certain upper limit

Genetic Programming-Movie Part 1, https://www.youtube.com/watch?v=tTMpKrKkYXo


Five Preparatory Steps
Example of a genetic program run
Parameters Values
Parameters Values
Reproduction 8%
Objective To automatically create a computer
probability
program whose output is equal to the
values of the quadratic polynomial 𝑥 + Architecture altering 1%
𝑥 + 1 in the range from (−1 , +1) probability
Terminal set {𝑥, ℝ} where 𝑙𝑒𝑡 ℝ = {−5,5} Selection Fitness
proportionate
Function set {∗, +, −, %}
Termination criteria Fitness<0.01
Population size 4
Maximum number of none
Crossover probability 90% generations
Initialization method
Mutation probability 1%
Depth of tree in initial 2
population
John Koza, Riccardo Poli, Genetic Programming, Jan 2005
Example of a genetic program run
 Objective:
 To automatically create a computer program whose
output is equal to the values of the quadratic
polynomial in the range from −1 to +1
 That is, the goal is to automatically create a computer
program that matches certain numerical data
 This process is sometimes called system identification or
symbolic regression
Example of a genetic program run
 Five preparatory steps
 Terminal set
 Function set

 Fitness measure

 Control parameters

 Termination condition
Example of a genetic program run
 Terminal Set
 The purpose of the first two preparatory steps is to
 Specify the ingredients of the to-be-evolved program
 Because the problem is to find a mathematical function of one
independent variable,
 terminal set (inputs to the to-be-evolved program) includes the independent
variable, x
 The terminal set also includes numerical constants
 𝑇𝑒𝑟𝑚𝑖𝑛𝑎𝑙 𝑠𝑒𝑡 = 𝑥, ℝ
 𝑤ℎ𝑒𝑟𝑒 ℝ denotes constant numerical terminals in some range say {-5,5}
Example of a genetic program run
 Function set
 The objective of the problem is somewhat flexible in that it does not specify
what functions may be employed
 Reasonable choice could be four typical mathematical expressions

 Note: To avoid run-time errors, the division function % is protected
 It returns a value of 1 when division by 0 is attempted (including 0 divided
by 0)
 But otherwise returns the quotient of its two arguments
 Each individual in the population is a composition of functions from the
specified function set and terminals from the specified terminal set
Example of a genetic program run
 Third preparatory step involves constructing the fitness measure
 Purpose of the fitness measure is to specify what the human wants
 High-level goal of this problem is to find a program whose output is equal to the values of the
quadratic polynomial 𝑥2 + 𝑥 + 1
 Therefore, the fitness assigned to a particular individual in the population for this problem must
reflect
 how closely the output of an individual program comes to the target polynomial 𝑥2 + 𝑥 + 1
 Fitness measure could be defined as the value of the integral (taken over values of the
independent variable x between −1.0 and +1.0) of the absolute value of the differences
(errors) between the value of the individual mathematical expression and the target quadratic
polynomial 𝑥2 + 𝑥 + 1
 A smaller value of fitness (error) is better
 the integral is numerically approximated using dozens or hundreds of different values of the
independent variable x in the range between −1.0 and +1.0
Example of a genetic program run
 Fourth preparatory step-Control Parameters
 The population size in this small illustrative example will be just four
 In actual practice
 the crossover operation is commonly performed on about 90% of the individuals in
the population
 the reproduction operation is performed on about 8% of the population
 the mutation operation is performed on about 1% of the population
 and the architecture-altering operations are performed on perhaps 1% of the
population
 Because this illustrative example involves an abnormally small population of only
four individuals
 the crossover operation will be performed on two individuals and the mutation and
reproduction operations will each be performed on one individual
 For simplicity, the architecture-altering operations are not used for this problem
Example of a genetic program run
 Fifth preparatory step
A reasonable termination criterion for this problem is that
the run will continue from generation to generation until the
fitness of some individual gets below 0.01
Example of a genetic program run
 Actual run starts now by randomly creating a
population of size 4

John Koza, Riccardo Poli, Genetic Programming, Jan 2005


Example of a genetic program run
 First randomly constructed program tree is equivalent
to the mathematical expression 𝑥 + 1
 A program tree is executed in a depth-first way, from
left to right
 Specifically, the addition function (+) is executed with
the variable x and the constant value 1 as its two
arguments
 Then, the two-argument subtraction function (−) is
executed
 Its first argument is the value returned by the just-
executed addition function
 Its second argument is the constant value 0
 The overall result of executing the entire program
tree is thus x + 1

John Koza, Riccardo Poli, Genetic Programming, Jan 2005


Example of a genetic program run
 Grow initialization method
 The first method was constructed using the “Grow” method, by
first choosing the subtraction function for the root (top point)
of the program tree
 The random construction process continued in a depth-first
fashion (from left to right) and chose the addition function to
be the first argument of the subtraction function
 The random construction process then chose the terminal x to
be the first argument of the addition function (thereby
terminating the growth of this path in the program tree)
 The random construction process then chose the constant
terminal 1 as the second argument of the addition function
(thereby terminating the growth along this path)
 Finally, the random construction process chose the constant
terminal 0 as the second argument of the subtraction function
(thereby terminating the entire construction process)

John Koza, Riccardo Poli, Genetic Programming, Jan 2005


Example of a genetic program run
 Second program adds the
constant terminal 1 to the result of
multiplying by and is
equivalent to
 The third program adds the
constant terminal 2 to the constant
terminal 0 and is equivalent to the
constant value 2
 The fourth program is equivalent
to .
John Koza, Riccardo Poli, Genetic Programming, Jan 2005
Example of a genetic program run
 Note:
 Randomly created computer programs will, of course,
typically be very poor at solving the problem at hand
 However, even in a population of randomly created
programs, some programs are better than others
Example of a genetic program run
 Four random individuals from generation 0 produce outputs that deviate from the output
produced by the target quadratic function by different amounts
 In this particular problem, fitness can be graphically illustrated as the area between two
curves
 That is, fitness is equal to the area between the parabola and the curve
representing the candidate individual

John Koza, Riccardo Poli, Genetic Programming, Jan 2005


Example of a genetic program run
 Fig shows (as shaded areas) the integral of the absolute
value of the errors between each of the four individuals in
and the target quadratic function

John Koza, Riccardo Poli, Genetic Programming, Jan 2005


Example of a genetic program run
 The integral of absolute error for the straight line (the first
individual) is 0.67
 The integral of absolute error for the parabola (the second
individual) is 1.0

John Koza, Riccardo Poli, Genetic Programming, Jan 2005


Example of a genetic program run
 The integrals of the absolute errors for the remaining
two individuals are 1.67 and 2.67 respectively

John Koza, Riccardo Poli, Genetic Programming, Jan 2005


Example of a genetic program run
 Straight line 𝑥 + 1 is closer to the
Equation Fitness parabola 𝑥 + 𝑥 + 1 in the range from
−1 to +1 than any of its three cohorts in
0.67 the population
1  This straight line is, of course, not equivalent
to the parabola 𝑥 + 𝑥 + 1
1.67
 This best-of-generation individual from
2.6 generation 0 is not even a quadratic
function
 It is merely the best candidate that
happened to emerge from the blind
random search of generation 0
 In the valley of the blind, the one-eyed
man is king

John Koza, Riccardo Poli, Genetic Programming, Jan 2005


Example of a genetic program run
 After the fitness of each individual in the population is ascertained, genetic
programming then probabilistically selects relatively more fit programs from the
population
 Genetic operations are applied to the selected individuals to create offspring
programs
 The most commonly employed methods for selecting individuals to participate in the
genetic operations are tournament selection and fitness-proportionate selection
 In both methods, the emphasis is on selecting relatively fit individuals
 An important feature common to both methods is that the selection is not greedy
 Individuals that are known to be inferior will be selected to a certain degree
 The best individual in the population is not guaranteed to be selected
 Moreover, the worst individual in the population will not necessarily be excluded
 Anything can happen and nothing is guaranteed
Example of a genetic program run
 Creating Next generation- Reproduction, Mutation,
Crossover, architectural restructuring
 We first perform the reproduction operation
 Because the first individual is the most fit individual in the
population, it is very likely to be selected to participate in a
genetic operation
 Let us suppose that this particular individual is, in fact, selected for
reproduction
 If so, it is copied, without alteration, into the next generation
(generation 1)
Example of a genetic program run
 We next perform the mutation operation Before mutation
 Because selection is probabilistic, it is possible that the third best individual in
the population is selected
 One of the three nodes of this individual is then randomly picked as the site
for the mutation
 In this example, the constant terminal 2 is picked as the mutation site
 This program is then randomly mutated by deleting the entire subtree rooted at the
picked point (in this case, just the constant terminal 2)
 and inserting a subtree that is randomly grown in the same way that the individuals After mutation
of the initial random population were originally created
 In this particular instance, the randomly grown subtree computes the quotient
of x and x using the protected division operation %
 The resulting individual is shown in Fig
 This particular mutation changes the original individual from one
having a constant value of 2 into one having a constant value of 1
 This particular mutation improves fitness from 1.67 to 1.00
Example of a genetic program run
Above average: Below average:
 Finally, we perform the crossover operation Fitness =0.67 Fitness = 2.67
 Because the first and second individuals in
generation 0 are both relatively fit, they are
likely to be selected to participate in crossover
 However, selection can always pick
suboptimal individuals
 So, let us assume that in our first
application of crossover the pair of
selected parents is composed of
 the above-average tree and the below-
average tree as shown
Example of a genetic program run
Above average: Below average:
 One point of the first parent, namely the + Fitness =0.67 Fitness = 2.67
function is randomly picked as the crossover
point for the first parent
 One point of the second parent, namely the
leftmost terminal x is randomly picked as the
crossover point for the second parent
 The crossover operation is then performed on replace
the two parents dump
 The offspring is equivalent to x and is not
particularly noteworthy
Offspring
Example of a genetic program run
Parent 1 Parent 2
 Let us now assume, that in our Fitness =1 Fitness = 0.67

second application of crossover,


selection chooses the two most fit
individuals as parents as shown
replace
 Leftmost leaf x in parent 1 is dump

replaced by the + function in offspring


parent 2
John Koza, Riccardo Poli, Genetic Programming, Jan 2005
Example of a genetic program run
 One of the is equivalent to x
and is not noteworthy
 However, the other offspring
is equivalent to
and has a fitness (integral of
absolute errors) of zero
 Because the fitness of this
individual is below 0.01, the
termination criterion for the
run is satisfied and the run is
automatically terminated

John Koza, Riccardo Poli, Genetic Programming, Jan 2005


Example of a genetic program run
 Note that the best-of-run individual incorporates
 a good trait (the quadratic term ) from the second parent with
 two other good traits (the linear term x and constant term of 1) from the first
parent
 The crossover operation produced a solution to this problem by recombining
good traits from these two relatively fit parents into a superior (indeed,
perfect) offspring
 In summary,
 Genetic programming has, in this example, automatically created a computer
program whose output is equal to the values of the quadratic polynomial
 𝑥 + 𝑥 + 1 in the range from −1 to +1.
Over selection
 Over-selection is often used to deal with the typically large population sizes
 (population sizes of several thousands are not unusual in GP)
 The method first ranks the population, then divides it into two groups,
 one containing the top x% and
 the other containing the other (100 − x)%
 When parents are selected,
 80% of the selection operations come from the first group, and
 the other 20% from the second group
 The values of x used are found empirically by rule of thumb and depend
on the population size with the aim that the number of individuals from
which the majority of parents are chosen stays constant in the low hundreds,
i.e., the selection pressure increases dramatically for larger populations
Bloat in Genetic Programming
 A special effect of varying chromosome sizes in GP is that these tend to grow during a GP run
 That is, without appropriate countermeasures the average tree size is growing during the search
process
 This phenomenon is known as bloat (sometimes called the "survival of the fattest")
 The simplest way to prevent bloat is to introduce a maximum tree size and forbid a variation
operator if the child(ren) resulting from its application would exceed this maximum size
 In this case, this threshold can be seen as an additional parameter of mutation and recombination in
GP.
 One technique that is widely acknowledged is that of parsimony pressure
 Such a pressure towards parsimony (i.e., being "stingy" or ungenerous) is achieved through
 introducing a penalty term in the fitness formula that reduces the fitness of large chromosomes or using multi-
objective techniques.
Differential Evolution
Differential Evolution
 Belongs to a broader family of Evolutionary Computing Algorithms
 Population based, stochastic, optimization algorithm to solve optimization problems
over continuous domain
 Follows heuristic approach
 Similar to other popular direct search approaches, such as genetic algorithms and
evolution strategies,
 Differential evolution algorithm starts with an initial population of candidate solutions
 These candidate solutions are iteratively improved by
 introducing mutations into the population, and
 retaining the fittest candidate solutions that yield a lower objective function value
 It was conceptualized by by Storn and Price in 1996
 Developed to optimise real parameter, real valued functions
 Aslo known as black box (derivative free) optimization
https://machinelearningmastery.com/differential-evolution-from-scratch-in-python/
Why use Differential Evolution Algorithm?
 Global optimisation is necessary in fields such as engineering,
statistics and finance
 But many practical problems have objective functions that are
 nondifferentiable, non-continuous, non-linear, noisy, flat, multi-
dimensional or have many local minima, constraints or stochasticity
 Such problems are difficult if not impossible to solve analytically
 DE can be used to find approximate solutions to such problems

https://www.maths.uq.edu.au/MASCOS/Multi-Agent04/Fleetwood.pdf
Why use Differential Evolution Algorithm?
 Differential evolution algorithm is advantageous
over direct serach approaches and Genetic
Algorithms because
 Itcan handle nonlinear and non-differentiable multi-
dimensional objective functions, while requiring very
few control parameters to steer the minimization
 These characteristics make the algorithm easier and
more practical to use.
https://machinelearningmastery.com/differential-evolution-from-scratch-in-python/
Why use Differential Evolution (DE)
Algorithm?
 An optimization problem is considered to be practical if
 It has ability to handle non differentiable, non linear and
multimodal cost function
 Ability to parallel process computation intensive cost functions
 Ease of use- few control variables, variables should be robust
and easy to select
 Good convergence properties- Consistent convergence to global
minimum
 DE fulfills all the above requirements

https://machinelearningmastery.com/differential-evolution-from-scratch-in-python/
Differential Evolution
 DE is an Evolutionary Algorithm
 Hence follows the following steps

https://www.maths.uq.edu.au/MASCOS/Multi-Agent04/Fleetwood.pdf
DE Algorithm- Notations
 Suppose we want to optimize a function with
real D parameters
 Let us select the population size as N (Here N has to
be minimum 4
 The parameter vectors have to be of the form

 Here G is the generation number
https://www.maths.uq.edu.au/MASCOS/Multi-Agent04/Fleetwood.pdf
DE Algorithm
 Step 1: Initialization
 Define upper and lower bounds for each parameter:
 where

 Randomly select the initial parameter values uniformly


on the intervals

https://www.maths.uq.edu.au/MASCOS/Multi-Agent04/Fleetwood.pdf
DE Algorithm
 Step 2: Mutation
 Each of the N parameter vectors undergoes
 mutation, recombination and selection
 Mutation expands the search space
 For a given parameter vector , randomly select three vectors
, , , such that the indices are distinct
 Add the weighted difference of two of the vectors to the third, also called as
noisy vector or donor vector as follows
 𝑣, = 𝑥 , + 𝐹 𝑥 , − 𝑥 ,
 Here F is called the mutation factor which ranges from [0,2]
 𝑣, is noisy or donor vector

https://www.maths.uq.edu.au/MASCOS/Multi-Agent04/Fleetwood.pdf
DE Algorithm
 Step 3: Recombination
 Recombination
 Crossover between the successful solution from the previous
generation and the donor (noisy) vector is performed to obtain an
offspring called trial vector ,
 Some crossover rate CR is used to perform the crossover

,, ,
 ,,
,, ,
 Here ,
 ensures that , ,
https://www.maths.uq.edu.au/MASCOS/Multi-Agent04/Fleetwood.pdf
DE Algorithm
 Step 4: Selection
 Target vector is compared with the trial vector
and the one with the lowest function value is
admitted to the next generation

, , ,,
 ,
,,

https://www.maths.uq.edu.au/MASCOS/Multi-Agent04/Fleetwood.pdf
DE Algorithm
 Mutation, recombination and selection continue until
some stopping criterion is reached

https://www.maths.uq.edu.au/MASCOS/Multi-Agent04/Fleetwood.pdf
Performance testing of DE algorithm
 Using Ackley’s function
 In mathematical optimization, the Ackley function is
a non-convex function used as a performance test
problem for optimization algorithms- Wikipedia
Performance testing of DE algorithm
using Ackley’s function
 DE with N = 10, F = 0.5 and CR = 0.1

 Find

 Should obtain this solution

https://www.maths.uq.edu.au/MASCOS/Multi-Agent04/Fleetwood.pdf
https://en.wikipedia.org/wiki/Ackley_function
Example run for DE
 A simple numerical example is presented to
illustrate the DE algorithm
 Let us consider the following objective function: •
Minimize f (x) = x1+ x2 + x3
Example run for DE
Example run for DE
Example run for DE
Example run for DE
Example run for DE
Flow diagram of DE
Flow diagram of Initial Population

DE Target Vector
Two Randomly
selected vectors

Build weighted
difference vector

Crossover to
obtain child(trial)
vector

Check for fitness.


Higher fitness
vector will be
inserted in next
generation

Vector to be inserted in new


generation of population
Evolutionary Programming
Evolutionary Programming
 Origins
 L. Fogel (1960) – development of methods, inspired by the natural evolution,
which generate automatically systems with some intelligent behavior
 D. Fogel (1990) – in the last two decades the evolutionary programming
became more oriented toward solving problems (optimization and design)
 Particulars
 Representation or encoding variants
 e.g. real vectors, state diagrams, neural networks structures
 Based only on mutation, no recombination
 Current variants of mutation: self-adaptive

Ref: https://staff.fmi.uvt.ro/~daniela.zaharie/ma2015/lectures/metaheuristics2015_slides6.pdf
Evolutionary Programming
 First (traditional) direction
 Evolve systems (e.g. finite state machine) with prediction abilities
 The fitness of such a structure is measured by analyzing the behavior of the system = prediction
abilities
 The fitness is a quality measure related to the behaviour of the system
 Second (current) direction
 It is related to optimization methods similar to evolution strategies
 there is only a mutation operator (no recombination)
 the mutation is based on random perturbation of the current configuration (x’=x+N(0,s))
 s is inversely correlated with the fitness value (high fitness leads to small s, low fitness leads to large values
for s)
 starting from a population with m elements, by mutation are constructed m children and the survivors are
selected from the 2m elements by tournament or by truncation
 There are self-adaptive variants, called MetaEP; these variants are similar to self-adaptive Evolution
Strategies

Ref: https://staff.fmi.uvt.ro/~daniela.zaharie/ma2015/lectures/metaheuristics2015_slides6.pdf
Evolutionary Programming
 Evolutionary programming was invented by Dr. Lawrence J. Fogel in 1960
 Dr. Fogel refrained from modeling the end product of evolution
 But rather he considered to model the process of evolution itself as a
vehicle for producing intelligent behavior
 Dr. Fogel crafted a series of experiments in which finite state machines
(FSMs) represented individual organisms in a population of problem solvers
 These graphical models are used to describe the behavior or computer
software and hardware, which is why he termed his approach "Evolutionary
Programming”

http://www.scholarpedia.org/article/Evolutionary_programming
Evolutionary Programming
 The experimental procedure was as follow
 A population of FSMs is exposed to the environment – that is, the sequence of symbols that has been
observed up to the current time
 For each parent machine, as each input symbol is presented to the machine, the corresponding output
symbol is compared with the next input symbol
 The worth of this prediction is then measured with respect to the payoff function (e.g., all-none, squared
error)
 After the last prediction is made, a function of the payoff for the sequence of symbols (e.g., average
payoff per symbol) indicates the fitness of the machine or program
 Offspring machines are created by randomly mutating the parents and are scored in a similar manner
 Those machines that provide the greatest payoff are retained to become parents of the next
generation, and the process iterates
 When new symbols are to be predicted, the best available machine serves as the basis for making such
a prediction and the new observation is added to the available database
 Fogel described this process as “evolutionary programming” in contrast to “heuristic programming”

http://www.scholarpedia.org/article/Evolutionary_programming
Overview of Evolutionary Programming
 Developed: USA in the 1960
 Early names: L. Fogel, D. Fogel
 Typically applied to:
 Traditional EP: machine learning tasks executed by finite state machines
 Contemporary EP: (numerical) optimization
 Attributed features:
 Very open framework: any representation and mutation options OK
 Crossbred with ES (contemporary EP)
 Consequently: hard to determine a “standard” EP
 Special:
 No recombination
 Self-adaptation of parameters standard (contemporary EP)
http://www.scholarpedia.org/article/Evolutionary_programming
https://slideplayer.com/slide/15979842/
Finite State Machine

https://www.cs.mtsu.edu/~xyang/3080/fsm_recognition.html
Finite State Machine

https://www.cs.mtsu.edu/~xyang/3080/fsm_recognition.html
Traditional Direction: Prediction using
Finite State Machine
 Finite State Machine
 States S
 Input I
 Output O
 Transition function (Also known as state transition)
 𝛿: 𝑆 𝑋 𝐼 → 𝑆
 Output function
 Λ: 𝑆 𝑋 𝐼 → 𝑂
 Transforms input stream to output Stream
 Can be used for prediction
 Example predict next symbol in the sequence

Ref: https://slideplayer.com/slide/15979842/
Prediction using Finite State Machine
 Finite State Machine
 States S : {A,B,C}
 Input I : {1,0}
 Output O: {1,0}
 Transition function (Also known as state transition)
 𝛿: 𝑆 𝑋 𝐼 → 𝑆
 𝛿 as given in the diagram
 Example: If present state A, input is 0 then the next state
is B
 Output function
 Λ: 𝑆 𝑋 𝐼 → 𝑂
 Example: : If present state A, input is 0 then the output
is 1
 Transforms input stream to output Stream
Ref: https://slideplayer.com/slide/15979842/
Example: FSM as predictor
 Consider the following FSM
 Task: To predict next output
 Quality:
 Given -Initial State: C
 For input sequence 011101, using
FSM diagram given we get
 Output sequence as 110111
 Quality: 3 out of 5
Example: Parity Problem
 A simple test problem
 Design a FSM to check if a binary string has an even or an odd
numbers of elements equal to 1
 S={even,odd}
 I={0,1}
 O={0,1}
 FSM output
 final state = 0 (the sequence has an even number of 1)
 final state = 1 (the sequence has an odd number of 1)
Ref: https://staff.fmi.uvt.ro/~daniela.zaharie/ma2015/lectures/metaheuristics2015_slides6.pdf
Example: Parity Problem
 State diagram: labelled graph
 EP Design: Choose S, I, O
 Population
 Generate random FSM
 Generate labels for nodes
 Generate arcs
 Mutation
 Mutation of output symbol
 Redirect an arc (mutate target node)
 Add or eliminate nodes
 Change initial state

Ref: https://staff.fmi.uvt.ro/~daniela.zaharie/ma2015/lectures/metaheuristics2015_slides6.pdf
Example: Parity Problem
 Mutation example
 Change of target node of an arc

Ref: https://staff.fmi.uvt.ro/~daniela.zaharie/ma2015/lectures/metaheuristics2015_slides6.pdf
One more example: Predict prime
numbers
 Evolving FSM to predict prime numbers

 Fitness function:
 1 point for correct prediction of the next input
 0 point for the wrong prediction
FSM to predict prime numbers
 Parent selection: Each FSM is mutated only once
 Mutation operators (one selected randomly)
 Change an output symbol
 Change a state transition
 Add a state
 Delete a state
 Change initial state
 Survivor selection
Evolutionary Programming Perspective
 Idea was to simulate evolution as a learning process with the aim of
generating artificial intelligence
 It was viewed as the
 capability of a system to adapt its behaviour
 in order to meet some specified goals in a range of environments
 Adaptive behaviour is the key term in this definition, and the
capability to predict the environment was considered to be a
prerequisite
 The classic EP systems used finite state machines as individuals

Ref:A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
Modern or contemporary EP
 No predefined representation, in general, it is
chosen based on features of the problem at hand
 No predefined mutation because it must be suitable
for the chosen representation
 Often applies self-adaptation of mutation
parameters, as ES does
 In the sequel we present one EP variant
Contemporary Evolutionary Programming
Perspective and Evolutionary Strategies
 EP frequently uses real-valued representations, and so has almost
merged with ES
 The principal differences(ES and EP) lie in the biological inspiration:
 In EP each individual is seen as corresponding to a distinct species, and so
there is no recombination
 Furthermore, the selection mechanisms are different
 In ES parents are selected stochastically, then the selection of the μ best
from the union of μ + λ offspring is deterministic
 By contrast, in EP
 Each parent generates exactly one offspring (i.e., λ = μ)
 These parents and offspring populations are then merged and compete in
stochastic round-robin tournaments for survival
Ref:A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
Technical Summary

Ref:A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
Representation
 Continuous parameter optimisation is the most frequently type of application: 𝑓: 𝑅 → 𝑅
 Chromosomes consist of two parts:
 Object variables: x1,…,xn
 Mutation step sizes: 1,…,n
 Full size chromosome:
  𝑥1, … , 𝑥𝑛, 1, … , 𝑛 , 1, … , 𝑘 
 Chromosome  1 𝑛 1 𝑛 1 𝑘  is mutated as follows:

 Boundary rule: 𝜎 =𝜖
Recombination
 None
Parent Selection
 Each individual creates one child by mutation
 Hence parent selection is deterministic and not
biased by fitness
Survivor selection
 selection:
 parents, offspring are the candidates
 A pairwise competition in round-robin format (it works on a rotating basis, a
chosen pair is moved to the back of the list after the competition) is executed.
 A pairwise competition:
 Each solution from is evaluated against other randomly chosen
solutions.
 For each comparison, a "win" is assigned if is fitter than an opponent.
 The solutions with the greatest number of wins are retained to be parents of
the next generation
 Parameter allows tuning selection pressure:
 Typically
Optimal control problems using
Evolutionary Programming

Ref:D.B.Fogel, “Applying Evolutionary Programming to Selected Control Problems”, Computers Math. Applic. Vol. 27, No. 11, pp. 89-104, 1994
Optimal control problems using
Evolutionary Programming

Ref:D.B.Fogel, “Applying Evolutionary Programming to Selected Control Problems”, Computers Math. Applic. Vol. 27, No. 11, pp. 89-104, 1994
Optimal control problems using
Evolutionary Programming

Ref:D.B.Fogel, “Applying Evolutionary Programming to Selected Control Problems”, Computers Math. Applic. Vol. 27, No. 11, pp. 89-104, 1994
 https://www.youtube.com/watch?v=kL7xOZvdhDo
 https://scialert.net/fulltext/?doi=jai.2008.12.20

 https://www.youtube.com/watch?v=Wff2InJn3p0
Evolutionary Strategies
Evolutionary Strategies
 Evolution strategies are evolutionary algorithms
 Invented in the early 1960s by Rechenberg and Schwefel
 Similar to Genetic Algorithms
 find a (near-)optimal solution to a problem within a search space (all possible
solutions)
 Often used for empirical experiments, numerical optimisation
 Attributed features
 Fast and good optimizer for real-valued optimisation
 Self-adaptation of (mutation) parameters standard
 Based on principal of strong causality: Small changes have small effects
Ref: Thomas Back, Gunter Rudolph, Hans-Paul Schwefel, “Evolutionary Programming and Evolutionary Strategies”
http://www.cmap.polytechnique.fr/~nikolaus.hansen/es-overview-2015.pdf
Hans George Beyer, Hans Paul Schwefel, “Evolutionary Strategies a comprehensive introduction”
A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_1072.pdf
Evolutionary Strategy
Evolutionary strategies survivor selection
variations
 (μ + λ) Selection
 (μ, λ) Selection

Ref:A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_1072.pdf
Evolutionary strategies survivor selection
variations
 (μ + λ) Selection
 Set of offspring and parents are merged
 They are ranked according to (estimated) fitness
 Then the top μ are kept to form the next generation
 Or we can say that
 to keep the population size constant, the λ worst out
of all μ + λ individuals are discarded;
Evolutionary strategies survivor selection
variations
 (μ, λ) Selection
 (μ, λ) strategy used in Evolution Strategies where typically λ>μ children are created from a population
of μ parents
 Method works on a mixture of age and fitness
 Age component means that all the parents are discarded, so no individual is kept for more than one
generation
 The fitness component comes from the fact that the λ offspring are ranked according to the fitness, and
the best μ form the next generation
 Parents are “forgotten” no matter how good or bad their fitness was compared to that of
the new generation
 In Evolution Strategies, (μ, λ) selection, is generally preferred over (μ + λ) selection for the
following reasons
 The (μ, λ) discards all parents and is therefore in principle able to leave (small) local optima
 This may be advantageous in a multimodal search space with many local optima
Two membered Evolutionary Strategy
 Simple mutation selection scheme
 Works on one individual (parent) which will create
offspring by the means of mutation
 The better of parent and offspring is selected
deterministically to survive the next generation
 This is (1+1) selection Evolutionary strategy
Ref: Thomas Back, Gunter Rudolph, Hans-Paul Schwefel, “Evolutionary Programming and Evolutionary Strategies”
A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_1072.pdf
Two membered Evolutionary Strategy
 Here we describe the basic algorithm, termed the two-membered
evolution strategy, for the abstract problem of minimizing an n-
dimensional function
 Task
 Minimalize
 Algorithm: “two-membered ES” using
 Vectors from directly as chromosomes
 Population size 1
 Only mutation creating one child
 Greedy selection
Ref: Thomas Back, Gunter Rudolph, Hans-Paul Schwefel, “Evolutionary Programming and Evolutionary Strategies”
A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_1072.pdf
https://www.cs.vu.nl/~gusz/ecbook/ecbook-course.html
Two membered Evolutionary Strategy
 Set (here t denotes the generation counter)
 Create initial point 𝑡  1𝑡 𝑛 
𝑡

 This is the initial population with only one chromosome


 REPEAT UNTIL (TERMIN.COND satisfied) DO
 Draw 𝑖 from a normal distr. for all
 𝑡 𝑡
𝑖 𝑖 𝑖
 𝑡 𝑡

 𝑇𝐻𝐸𝑁 𝑥 = 𝑥𝑡
 𝐸𝐿𝑆𝐸𝑥 = 𝑦𝑡
 𝑆𝑒𝑡 𝑡 = 𝑡 + 1
 END
Ref: Thomas Back, Gunter Rudolph, Hans-Paul Schwefel, “Evolutionary Programming and Evolutionary Strategies”
A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_1072.pdf
https://www.cs.vu.nl/~gusz/ecbook/ecbook-course.html
Two membered Evolutionary Strategy

Ref: Thomas Back, Gunter Rudolph, Hans-Paul Schwefel, “Evolutionary Programming and Evolutionary Strategies”
A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_1072.pdf
https://www.cs.vu.nl/~gusz/ecbook/ecbook-course.html
Two membered Evolutionary Strategy-
Mutation mechanism
 z values drawn from normal distribution N(,)
 mean  is set to 0
 variation  is called mutation step size
 Rechenberg suggested the following heuristic rule for adjustings
 1/5 success rule:
 The ratio of successful mutation to all mutations should be 1/5. If it is greater than 1/5, increase the variance; if it is less, decrease
the mutation variance.
  is varied on the fly by the “1/5 success rule”:
 This rule resets  after every k iterations by
  =  /𝑐 𝑖𝑓 𝑝𝑠 > 1/5
  =  • 𝑐 𝑖𝑓 𝑝𝑠 < 1/5
  =  𝑖𝑓 𝑝𝑠 = 1/5
 where ps is the % of successful mutations, 0.8  c  1

Ref: Thomas Back, Gunter Rudolph, Hans-Paul Schwefel, “Evolutionary Programming and Evolutionary Strategies”
A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_1072.pdf
https://www.cs.vu.nl/~gusz/ecbook/ecbook-course.html
Normal distribution basics

Ref: Thomas Back, Gunter Rudolph, Hans-Paul Schwefel, “Evolutionary Programming and Evolutionary Strategies”
A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_1072.pdf
https://www.cs.vu.nl/~gusz/ecbook/ecbook-course.html
Two membered Evolutionary Strategy-
Mutation mechanism
 Given a current solution in the form of a vector of length n, a new
candidate is created by
 adding a random number for to each of the n
components
 A Gaussian, or normal, distribution is used with zero mean and standard
deviation for drawing the random numbers
 This distribution is symmetric about zero and has the feature that the
probability of drawing a random number with any given magnitude is a
rapidly decreasing function of the standard deviation
 Thus the value is a parameter of the algorithm that determines the
extent to which given values are perturbed by the mutation operator
Ref: Thomas Back, Gunter Rudolph, Hans-Paul Schwefel, “Evolutionary Programming and Evolutionary Strategies”
A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_1072.pdf
https://www.cs.vu.nl/~gusz/ecbook/ecbook-course.html
Two membered Evolutionary Strategy-
Mutation mechanism
 For this reason  is often called the mutation step size
 Theoretical studies motivated an on-line adjustment of step sizes by the famous 1/5 success rule of Rechenberg
 This rule states that the ratio of successful mutations (those in which the child is fitter than the parent) to all
mutations should be 1/5
 Hence if the ratio is greater than 1/5 the step size should be increased to make a wider search of the space,
and if the ratio is less than 1/5 then it should be decreased to concentrate the search more around the current
solution
 The rule is executed at periodic intervals, for instance, after k iterations each

  = 𝑖𝑓 𝑝𝑠 > 1/5 (Foot of big hill → increase σ)
  =  • 𝑐 𝑖𝑓 𝑝𝑠 < 1/5 (Near the top of the hill →decrease σ)
  =  𝑖𝑓 𝑝𝑠 = 1/5
 where 𝑝𝑠 is the relative frequency of successful mutations measured over a number of trials, and the parameter c is in the range 0.8 < c < 1.

Ref: Thomas Back, Gunter Rudolph, Hans-Paul Schwefel, “Evolutionary Programming and Evolutionary Strategies”
A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_1072.pdf
https://www.cs.vu.nl/~gusz/ecbook/ecbook-course.html
Evolutionary Strategies- example
 Some essential characteristics of evolution strategies are
illustrated here in the following example
 Evolution strategies are typically used for continuous parameter
optimization
 There is a strong emphasis on mutation for creating offspring
 Mutation is implemented by adding some random noise drawn
from a Gaussian distribution
 Mutation parameters are changed during a run of the algorithm

Ref: https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_1072.pdf
https://www.cs.vu.nl/~gusz/ecbook/ecbook-course.html
Historical example:
the jet nozzle experiment

Task: to optimize the shape of a jet nozzle


Approach: random mutations to shape + selection

Initial shape

Final shape

Ref: https://research.iaun.ac.ir/pd/faramarz_safi/pdfs/UploadFile_1072.pdf
https://www.cs.vu.nl/~gusz/ecbook/ecbook-course.html
Jet nozzle experiment: Representation
 Here Self Adaptation mechanism for mutation is used instead of 1/5 success rule adaptation
 Hence 𝑉𝑒𝑐𝑡𝑜𝑟 𝑥 = < 𝑥 , 𝑥 , . . 𝑥 > forms only part of a typical ES genotype
 Individuals contain some strategy parameters, in particular, parameters of the mutation
operator
 Details of mutation are discussed later
 Here we focus only on the structure of individuals, and specify the meaning of the special
genes
 Strategy parameters can be divided into two sets,
 the 𝜎 values and the 𝛼 values
 𝜎 values represent the mutation step sizes
 𝛼 values, which represent interactions between the step sizes used for different variables, are not
always used
Jet nozzle experiment: Representation
 Chromosomes consist of three parts:
 Object variables: x1,…,xn
 Strategy parameters:
 Mutation step sizes: 1,…,n
 Rotation angles: 1,…, n

 Not every component is always present
 Full size:      
 where
Jet nozzle experiment: Mutation
 Main mechanism: changing value by adding random
noise drawn from normal distribution
 x’i = xi + N(0,)
 Key idea:
  is part of the chromosome  x1,…,xn,  
  is also mutated into ’ (see later how)
 Thus: mutation step size  is coevolving with the solution
x
Jet nozzle experiment: Mutate  first
 Net mutation effect:  x,     x’, ’ 
 Order is important:
 first   ’ (see later how)
 then x  x’ = x + N(0,’)
 Rationale: new  x’ ,’  is evaluated twice
 Primary: x’ is good if f(x’) is good
 Secondary: ’ is good if the x’ it created is good
 Reversing mutation order this would not work
Mutation case 1:
Uncorrelated mutation with one 
 Chromosomes:  x1,…,xn,  
 ’ =  • exp( • N(0,1))
 x’i = xi + ’ • N(0,1)

 Typically the “learning rate”   1/ n½


Mutants with equal likelihood

Circle: mutants having the same chance to be created


Mutation case 2:
Uncorrelated mutation with n ’s
 Chromosomes:  x1,…,xn, 1,…, n 
 ’i = i • exp(’ • N(0,1) +  • Ni (0,1))
 x’i = xi + ’i • Ni (0,1)
 Two learning rate parmeters:
 ’ overall learning rate
  coordinate wise learning rate
   1/(2 n)½ and   1/(2 n½) ½
Mutants with equal likelihood

Ellipse: mutants having the same chance to be created


Mutation case 3:
Correlated mutations
 Chromosomes:  x1,…,xn, 1,…, n ,1,…, k 
 where k = n • (n-1)/2
 and the covariance matrix C is defined as:
 cii = i2
 cij = 0 if i and j are not correlated

 cij = ½ • ( i2 - j2 ) • tan(2 ij) if i and j are correlated

 Note the numbering / indices of the ‘s


Correlated mutations cont’d
The mutation mechanism is then:
 ’i = i • exp(’ • N(0,1) +  • Ni (0,1))

 ’j = j +  • N (0,1)

 x ’ = x + N(0,C’)
 x stands for the vector  x1,…,xn 
 C’ is the covariance matrix C after mutation of the  values
   1/(2 n)½ and   1/(2 n½) ½ and   5°
 i’ < 0  i’ = 0 and
 | ’j | >   ’j = ’j - 2  sign(’j)
Mutants with equal likelihood

Ellipse: mutants having the same chance to be created


Recombination
 Creates one child
 Acts per variable / position by either
 Intermediary: Averaging parental values, or
 Discrete: Selecting one of the parental values

 From two or more parents by either:


 Local: Using two selected parents to make a child
 Global: Selecting two parents for each position anew
Names of recombinations

Two parents
Two fixed parents
selected for each i

Local Global
zi = (xi + yi)/2
intermediary intermediary

zi is xi or yi Local Global
chosen randomly discrete discrete
Recombination
 Basic recombination scheme in evolution strategies involves two
parents that create one child
 To obtain offspring recombination is performed times.
 There are two recombination variants distinguished by the manner of
recombining parent alleles
 Using discrete recombination one of the parent alleles is randomly
chosen with equal chance for either parents
 In intermediate recombination the values of the parent alleles are
averaged
Recombination
 An extension of this scheme allows the use of more than two recombinants, because the two
parents x and y are drawn randomly for each position 𝒊 ∈ {𝟏 … . . 𝒏} in the offspring anew
 These drawings take the whole population of individuals into consideration, and the result is a
recombination operator with possibly more than two individuals contributing to the offspring.
 The exact number of parents, however, cannot be defined in advance. This multi-parent variant
is called global recombination
 To make terminology unambiguous, the original variant is called local recombination
 Evolution strategies typically use global recombination
 Interestingly, different recombination is used for the object variable part (discrete is
recommended) and the strategy parameters part (intermediary is recommended).
 This scheme preserves diversity within the phenotype (solution) space, allowing the trial of very
different combinations of values, whilst the averaging effect of intermediate recombination
assures a more cautious adaptation of strategy parameters
Parent selection
 Parents are selected by uniform random distribution
whenever an operator needs one/some
 Thus: ES parent selection is unbiased - every
individual has the same probability to be selected
 Note that in ES “parent” means a population member
(in GA’s: a population member selected to undergo
variation)
Survivor selection
 Applied after creating  children from the 
parents by mutation and recombination
 Deterministically chops off the “bad stuff”
 Basis of selection is either:
 The set of children only: (,)-selection
 The set of parents and children: (+)-selection
Survivor selection cont’d
 (+)-selection is an elitist strategy
 (,)-selection can “forget”
 Often (,)-selection is preferred for:
 Better in leaving local optima
 Better in following moving optima
 Using the + strategy bad  values can survive in x, too long if their host x is very fit
 Selective pressure in ES is very high (  7 •  is the common setting)
Survivor selection cont’d
 (+)-selection is an elitist strategy
 (,)-selection can “forget”
 Often (,)-selection is preferred for:
 Better in leaving local optima
 Better in following moving optima
 Using the + strategy bad  values can survive in x, too long if their host x is very fit
 Selective pressure in ES is very high (  7 •  is the common setting)
Self-adaptation illustrated
 Given a dynamically changing fitness landscape
(optimum location shifted every 200 generations)
 Self-adaptive ES is able to
 follow the optimum and
 adjust the mutation step size after every shift !
Prerequisites for self-adaptation
  > 1 to carry different strategies
  >  to generate offspring surplus
 Not “too” strong selection, e.g.,   7 • 
 (,)-selection to get rid of misadapted ‘s
 Mixing strategy parameters by (intermediary)
recombination on them
Learning Classifier Systems
 Learning Classifier Systems (LCS) represent an alternative evolutionary approach to model
building based on the use of rule sets, rather than parse trees, to represent knowledge
 LCS are used primarily in applications where the objective is to evolve a system that will
respond to the current state of its environment (i.e., the inputs to the system) by suggesting a
response that in some way maximises future reward from the environment
 An LCS is therefore a combination of a classifier system and a learning algorithm
 The classifier system component is typically a set of rules, each mapping certain inputs to
actions
 The whole rule set therefore constitutes a model that covers the space of possible inputs and
suggests the most appropriate actions for each
 The learning algorithm component of an LCS is implemented by an evolutionary algorithm,
whose population members either represent individual rules, or complete rule sets, known
respectively as the Michigan and Pittsburgh approaches

Ref:A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
Learning Classifier Systems
 The learning algorithm component of an LCS is implemented by an evolutionary
algorithm, whose population members either represent individual rules, or complete
rule sets, known respectively as the Michigan and Pittsburgh approaches
 The fitness driving the evolutionary process may be driven by many different forms
of learning, here we restrict ourselves to ‘supervised’ learning, where at each stage
the system receives a training signal (reward) from the environment in response to
the output it proposes
 This helps emphasise the difference between the Michigan and Pittsburgh
approaches
 In the Michigan data items are presented to the system one-by-one and individual
rules are rewarded according to their predictions
 By contrast, in a Pittsburgh approach each individual represents a complete model,
so the fitness would normally be calculated by presenting the entire data set and
calculating the mean accuracy of the predictions
Michigan Style Classifier Systems
 The list below summarizes the main workflow of the algorithm.
 1. A new set of inputs are received from the environment
 2. The rule base is examined to find the match-set of rules.
 • If the match set is empty, a ‘cover operator’ is invoked to generate one or more new matching rules with a random
action

 3. The rules in the match-set are grouped according to their actions


 4. For each of these groups the mean accuracy of the rules is calculated.
 5. An action is chosen, and its corresponding group noted as the action
set.
 If the system is an ‘exploit’ cycle, the action with the highest mean accuracy is chosen.
 If the system is in an ‘explore’ cycle, an action is chosen randomly or via fitness-proportionate selection, acting on
the mean accuracies.
Ref:A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
Michigan Style Classifier Systems
 6. The action is carried out and a reward is received from the environment
 7. The estimated accuracy and predicted payoffs are then updated for the rule in the
current and previous action sets, based on the rewards received and the predicted pay-
offs, using a Widrow–Hoff style update mechanism
 8. If the system is in an ‘explore’ cycle, an EA is run within the action-set, creating new rules
(with pay-off and accuracies set to the mean of their parents), and deleting others

Ref:A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
Michigan Style Classifier Systems

Ref:A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, 2nd edition, Natural Computing Series, Springer, 2015
Pittsburgh-style LCS
 The Pittsburgh-style LCS predates, but is similar to the better-known GP:
 Each member of the evolutionary algorithm’s population represents a complete model of the mapping from input to
output spaces
 Each gene in an individual typically represents a rule, and again a new input item may match more than
one rule, in which case typically the first match is taken
 This means that the representation should be viewed as an ordered list, and two individuals which contain
the same rules, but in a different order on the genome, are effectively different models. Learning of
appropriately complex models is typically facilitated by using a variable-length representation so that new
rules can be added at any stage
 This approach has several conceptual advantages — in particular, since fitness is awarded to complete rule
sets, models can be learned for complex multi-step problems
 The downside of this flexibility is that, like GP, Pittsburgh-style LCS suffers from bloat and the search space
becomes potentially infinite
 Nevertheless, given sufficient computational resources, and effective methods of parsimony to counteract
bloat, Pittsburgh-style LCS has demonstrated state-of-the-art performance in several machine learning
domains, especially for applications such as bioinformatics and medicine, where human-interpretability of
the evolved models is vital and large data-sets are available so that the system can evolve off-line to
minimise prediction error
 The Michigan-style LCS was first described by Holland in 1976 as a framework for studying learning in condition/action rule-
based systems, us ing genetic algorithms as the principal method for the discovery of new rules and the reinforcement
of successful ones [219]. Typically each member of the population was a single rule representing a partial model – that is to
say it might only cover a region of the decision space. Thus it is the entire population that together represents the learned
model. Each rule is a tuple {condition:action:payoff}. The condition specifies a region of the space of possible inputs in which
the rule applies. The condition parts of rules may contain wildcard, or ‘don’t-care’ characters for certain variables, or may
describe a set of values that a given variable may take – for example, a range of values for a continuous variable. Rules may
be distinguished by the number of wildcards they contain, and one rule is said to be more specific than another if it contains
fewer wildcards, or if the ranges for certain variables are smaller — in other words if it covers a smaller region of the input
space. Given this flexibility, it is common for the condition parts of rules to overlap, so a given input may match a number of
rules. In the terminology of LCS, the subset of rules whose condition matches the current inputs from the environment is known as
the match set. These rules may prescribe different actions, of which one is chosen. The action specifies either the action to be
taken (for example, if controlling robots or on-line trading agents) or the system’s prediction (such as a class label or a
numerical value). The subset of the match set advocating the chosen action is known as the action set. Holland’s original
framework maintained lists of which rules have been used, and when a reward was received from the environment a portion
was passed back to recently used rules to provide information for the selection mechanism. The intended effect is that the
strength of a rule predicts the value of the reward that the system will gain for undertaking the action. However the
framework proved unwieldy and difficult to make work well in practice
 LCS research was reinvigorated in the mid-1990s by Wilson who removed the concept of
memory and stripped out all but the essential components in his minimalist ZCS algorithm
[464]. At the same time several authors were noting the conceptual similarity between LCS and
reinforcement learning algorithms which attempt to learn, for each input state, an accurate
mapping from possible actions to expected rewards. The XCS algorithm [465] firmly
established this link by extdicted payoff matches the reward received. Unlike ZCS, the EA is
restricted at each cycle — originally to the match set, latterly to the action set, which increases
the pressure to discover generalised conditions for each action. As per ZCS, a credit
assignment mechanism is triggered by the receipt of rewards from the environment to update
the predicted pay-offs for rules in the previous action set. However, the major difference is
that these are not used directly to drive selection in the evolution process. Instead selection
operates on the basis of accuracy, so the algorithm can in principle evolve a complete
mapping from input space to actionsending rule-tuples to {condition:action:payoff,accuracy},
where the accuracy value reflects the system’s experience of how well the pre-
 The list below summarizes the main workflow of the algorithm. 1. A new set of inputs are
received from the environment. 2. The rule base is examined to find the match-set of rules. • If
the match set is empty, a ‘cover operator’ is invoked to generate one or more new matching
rules with a random action. 3. The rules in the match-set are grouped according to their
actions. 4. For each of these groups the mean accuracy of the rules is calculated. 5. An action
is chosen, and its corresponding group noted as the action set. • If the system is an ‘exploit’
cycle, the action with the highest mean accuracy is chosen. • If the system is in an ‘explore’
cycle, an action is chosen randomly or via fitness-proportionate selection, acting on the mean
accuracies. 6. The action is carried out and a reward is received from the environment. 7. The
estimated accuracy and predicted payoffs are then updated for the rule in the current and
previous action sets, based on the rewards received and the predicted pay-offs, using a
Widrow–Hoff style update mechanism. 8. If the system is in an ‘explore’ cycle, an EA is run
within the action-set, creating new rules (with pay-off and accuracies set to the mean of their
parents), and deleting others.
 The Pittsburgh-style LCS predates, but is similar to the better-known GP: each member of the evolutionary
algorithm’s population represents a complete model of the mapping from input to output spaces. Each gene
in an individual typically represents a rule, and again a new input item may match more than one rule, in
which case typically the first match is taken. This means that the representation should be viewed as an
ordered list, and two individuals which contain the same rules, but in a different order on the genome, are
effectively different models. Learning of appropriately complex models is typically facilitated by using a
variable-length representation so that new rules can be added at any stage. This approach has several
conceptual advantages — in particular, since fitness is awarded to complete rule sets, models can be
learned for complex multi-step problems. The downside of this flexibility is that, like GP, Pittsburgh-style LCS
suffers from bloat and the search space becomes potentially infinite. Nevertheless, given sufficient
computational resources, and effective methods of parsimony to counteract bloat, Pittsburgh-style LCS has
demonstrated state-of-the-art performance in several machine learning domains, especially for applications
such as bioinformatics and medicine, where human-interpretability of the evolved models is vital and large
data-sets are available so that the system can evolve off-line to minimise prediction error. Two recent
examples winning Humies Awards for better than human performance are in the realms of prostate cancer
detection [272] and protein structure prediction [16]

You might also like