Simple Genetic Operators Are Universal Approximators of Probability Distributions (And Other Advantages of Expressive Encodings)
Simple Genetic Operators Are Universal Approximators of Probability Distributions (And Other Advantages of Expressive Encodings)
ABSTRACT 1 INTRODUCTION
arXiv:2202.09679v4 [cs.NE] 2 Aug 2022
This paper characterizes the inherent power of evolutionary algo- Evolutionary algorithms (EAs) promise to generate powerful so-
rithms. This power depends on the computational properties of the lutions in complex environments. To fulfill this promise, not only
genetic encoding. With some encodings, two parents recombined must great solutions exist somewhere in their search space, but
with a simple crossover operator can sample from an arbitrary dis- EAs must be able to find them with non-vanishing probability even
tribution of child phenotypes. Such encodings are termed expressive in high-dimensional, dynamic, and deceptive domains. One way
encodings in this paper. Universal function approximators, includ- to deliver on this promise is to design more and more specialized
ing popular evolutionary substrates of genetic programming and operators to capture the structure of a particular domain. Although
neural networks, can be used to construct expressive encodings. this approach has led to many practical successes [8, 16, 29, 92], it
Remarkably, this approach need not be applied only to domains is inherently limited by the human effort required to design such
where the phenotype is a function: Expressivity can be achieved operators for each domain: That is, humans remain a bottleneck.
even when optimizing static structures, such as binary vectors. Such This paper advocates for an alternative approach: using simple,
simpler settings make it possible to characterize expressive encod- generic evolutionary operators with general genotype encodings
ings theoretically: Across a variety of test problems, expressive in which arbitrarily high levels of complexity can accumulate. Such
encodings are shown to achieve up to super-exponential conver- encodings are termed expressive encodings in this paper, and are
gence speed-ups over the standard direct encoding. The conclusion analyzed theoretically and experimentally. Using standard simple
is that, across evolutionary computation areas as diverse as genetic genetic operators (SGOs) involving only one or two parents, ex-
programming, neuroevolution, genetic algorithms, and theory, ex- pressive encodings are capable of universal approximation of child
pressive encodings can be a key to understanding and realizing the phenotype distributions. Thus, in theory, no hand-design is neces-
full power of evolution. sary to capture complex domain-adapted behavior.
Expressive encodings can be implemented with universal func-
CCS CONCEPTS tion approximators such as genetic programming and neural net-
works; the paper shows that such systems can achieve up to super-
• Computing methodologies → Genetic algorithms; Genetic
exponential improvements in settings that are intractable with
programming; Neural networks.
standard direct encodings. The space of expressive encodings over-
laps with indirect encodings, but is distinctly different: Some direct
KEYWORDS encodings can be expressive as well.
genetic algorithms, universal approximation, expressive encodings The expressive encoding approach can be seen as analogous to
recent progress in machine learning: Prior to deep learning [32, 51],
ACM Reference Format: progress was made by crafting better features and increasingly
Elliot Meyerson, Xin Qiu, and Risto Miikkulainen. 2022. Simple Genetic complex learning algorithms to solve different kinds of problems.
Operators are Universal Approximators of Probability Distributions (and With deep learning, large, complex structures with a general form
other Advantages of Expressive Encodings). In Proceedings of The Genetic (deep networks) allowed better solutions to be found across huge
and Evolutionary Computation Conference 2022 (GECCO ’22). ACM, New swaths of problems using a generic and simple algorithm: stochas-
York, NY, USA, 15 pages. https://doi.org/10.1145/3512290.3528746 tic gradient descent (SGD). EA could make similar progress via
complex, general encodings evolved with SGOs.
The approach creates further opportunities in at least four areas
Permission to make digital or hard copies of all or part of this work for personal or of evolutionary computation:
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation • Genetic Programming and Neuroevolution: Expressive en-
on the first page. Copyrights for components of this work owned by others than the codings in these frameworks can be developed further and
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission used as a starting point for improved methods.
and/or a fee. Request permissions from [email protected]. • Genetic Algorithms: They can be defined for GAs in general,
GECCO ’22, July 9–13, 2022, Boston, USA increasing their power.
© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9237-2/22/07. . . $15.00 • Theory: They can serve as a foundation for a new theoretical
https://doi.org/10.1145/3512290.3528746 perspective on EAs.
GECCO ’22, July 9–13, 2022, Boston, USA Meyerson, Qiu, and Miikkulainen
In other words, not only are expressive encodings powerful and directly. The complexity of what is possible in evolution at any point
valuable in their own right, they can provide a shared platform of in time is accumulated in what has been evolved so far. This area of
research that would allow fruitful communication and knowledge- work has achieved impressive, visually appealing results [53, 64].
sharing between often disparate subfields. A goal of the present work is to show that such an approach is
indeed fundamentally powerful for problem-solving, and overcomes
2 CONCEPTUAL FRAMEWORK serious fundamental limitations of direct encodings.
This section provides background on encodings and genetic opera- 2.1.3 Genetic Programming. Like neuroevolution, genetic program-
tors, including running examples that will be used in the paper. ming (GP) is used in situations where phenotypes require complex
behavior [5, 39, 48, 50, 66, 71, 76]. GP is often used for function
2.1 Encodings approximation [65, 68], but its main motivation is to generate pro-
An encoding is a function 𝐸 : 𝑋 → 𝑌 , where 𝑋 is a set of genotypes grams that meet a given description. As evolved programs accu-
and 𝑌 is a set of phenotypes. The genotype (or genome) 𝑥 ∈ 𝑋 is a mulate complexity, like in the case of NNs [43, 54], the behavior
description of the individual, which 𝐸 uses to produce the phenotype of evolution itself changes over time [1, 4, 42]. So, like NN, GP can
𝑦 ∈ 𝑌 , which can then be evaluated in a given environment. A be used to generate static structures, leading to benefits from the
fitness function 𝑓 : 𝑌 → R assigns a score to the phenotype. The complexity of evolved programs. As there are many languages for
evolutionary computation literature is filled with a diversity of human programmers, there are many encodings for GP [5, 48, 76].
encodings. This paper focuses on some of the most popular, and, The encoding can generally be defined by the available terminals
for simplicity, primarily uses a boolean vector phenotype space and operators. This paper considers a small set of terminals and
𝑌 = {0, 1}𝑛 , but the results can be extended to other settings. operators, which can naturally be extended:
2.1.1 Direct Encoding. This is the simplest and most popular ap- • Terminals: binary vectors with evolved values, integers;
proach in using GAs and EAs for optimization. With a direct encod- • Operators: <, >, +, par, ⊕, if, return.
ing, 𝐸 is the identity function 𝐸 (𝑥) = 𝑦, and 𝑋 = 𝑌 . It is called direct The binary vectors found at terminals can be of varying length, but
because the algorithm directly evolves elements of the phenotype: at least one must be of length 𝑛 so that the program can return a
the genotype contains no information or structure not in the pheno- solution of length 𝑛. Vectors of length 1 will be broadcast when
type, and visa versa. When evolving boolean vectors with a direct used in binary operations; ‘<’ and ‘>’ compare vectors from left
encoding, the genotype and phenotype are simply a vector of bits, to right as if they were binary integers, e.g., ([0, 1, 0] < [0, 1, 1]) =
which is the most common setting in EA theory [2, 19, 22, 91, 95]. (010 < 011) = 1; ‘par’ returns the parity of the vector; ‘⊕’ performs
elementwise addition (mod 2). A program using these operators
2.1.2 Neural Networks. Evolving neural networks (NNs), or neu- can be rendered in sequential, tree, or graph form [5, 48]; this paper
roevolution (NE), is a popular approach to discovering solutions uses sequential programs for readability.
for problems like function approximation, control, and sequential
decision-making [26, 61, 78, 80, 83, 97]. The phenotype is usually Of course, NNs and programs on their own are expressive in the
a function, which may receive multiple distinct inputs during its space of functions: NNs can approximate any function arbitrarily
evaluation. Many encodings exist for evolving NNs. The simplest closely [13, 41, 47], and sufficiently powerful programming lan-
of them is a direct encoding, where the NN structure is fixed and its guages can compute anything that is computable [9, 76, 87]. This
weights are evolved [26, 83]. However, NNs can also be used as the paper focuses on a different type of expressivity: How do different
genetic encoding in cases where the phenotype is not a function encodings enable the evolutionary process to behave in complex
but simply a fixed structure, such as a binary vector. In this case, and powerful ways? This behavior involves pairing encodings with
since there is no varying input to the phenotype during evaluation, genetic operators, which are discussed next.
a fixed value, e.g., 1, can simply be fed into the network to generate
the fixed phenotype. Formally, an NN genotype ℎ is of the form 2.2 Genetic Operators
ℎ : R1 → R𝑁 and a sigmoid activation 𝜎 in the final layer squashes
Genetic (or evolutionary) operators are the mechanisms by which
the output of the NN into (0, 1). Therefore, the overall encoding is
genomes reproduce to generate other genomes. A genetic operator
𝐸 (ℎ) = round(𝜎 (ℎ(1))), (1) 𝑔 is a (usually stochastic) function that produces a new genome 𝑥 ′
given a set of 𝑛𝑔 parent genomes 𝑋𝑝 ⊂ 𝑋 . Since 𝑔(𝑥) results in a
where ‘round’ is applied elementwise, so that 𝐸 (ℎ) produces a bi- distribution over genomes, we can write 𝑥 ′ ∼ 𝑔(𝑋𝑝 ). This paper
nary vector. In this paper, all internal nodes have sigmoid activation, focuses on the two most common operators (in both nature and
and all biases are 0 unless otherwise noted. computation): crossover and mutation. For consistency, assume
Although we are not aware of any prior work that has evolved the genotype can be flattened into a string of symbols in a canon-
NNs to optimize binary vectors, there is a substantial prior work ical way, so that 𝑥 𝑗 refers to the 𝑗th symbol in the string form of
evolving NNs to generate static artifacts such as pictures [49, 56, 70] a genotype 𝑥. The following operators are likely familiar to EA
and 3D objects [10, 52]. Similarly to this paper, the motivation is practitioners, but they are briefly described here for completeness.
that the structure of NNs can lead to interesting patterns in how
offspring phenotypes relate to parent phenotypes. High-level (and 2.2.1 Uniform Crossover. This operator 𝑔𝑐 takes two parents of
often interpretable [43]) reproductive complexity can be achieved equivalent structure and produces a child by independently select-
that would not be possible if, say, pixels or voxels were evolved ing the value in one of the parents at random for each element.
Expressive Encodings: Universal Approximation and other Advantages GECCO ’22, July 9–13, 2022, Boston, USA
𝑗 𝑗 𝑗
That is, 𝑥𝑐 ∼ 𝑔𝑐 (𝑥 1, 𝑥 2 ) =⇒ 𝑥𝑐 ∼ 𝑈 ({𝑥 1 , 𝑥 2 }) ∀ 𝑗, where 𝑈 is the a particular genetic operator is the genome size required to achieve
uniform distribution. Importantly, if the two parents have the same the desired level of approximation. Expressive encodings have the
value at index 𝑗, then the child is also guaranteed to have that value satisfying property that the current definition of the evolutionary
𝑗 𝑗 𝑗 𝑗
at 𝑗: 𝑥 1 = 𝑥 2 =⇒ 𝑥 1 = 𝑥𝑐 . The results on uniform crossover in this process must be stored in the genomes themselves. There is no re-
paper should be extendable to other forms of crossover, including liance on a large external controller of evolution; the only external
single- and multi-point crossover [15, 62, 91]. algorithmic information is in SGOs; the process and artifacts of
evolution are stored at a single level: in the population.
2.2.2 Single-point Mutation. With single-point mutation 𝑔𝑚 the It may seem that all expressive encodings are indirect encodings
child is a copy of a single parent with a single location altered. For [78, 79, 81]. However, if the phenotype space is sufficiently expres-
example, if the encoding is an NN, 𝑔𝑚 can alter a single weight in sive, such as in the case of searching for a neural network to perform
the network, e.g., by adding Gaussian noise. If the encoding is GP, a task, then a direct encoding may be an expressive encoding. This
𝑔𝑚 can replace a symbol with a different valid symbol, e.g., flip a property is demonstrated for the case of neural networks in the next
bit of a location that contains a binary value. Single-point mutation section (see Theorem 4.6). There are other kinds of systems capable
is similar to uniform mutation, which mutates each element inde- of universal approximation of probability distributions, such as
pendently with equal probability, but is simpler to analyze, and has neural generative models (e.g., the generator in a GAN architecture
been shown to be effective in many benchmarks [21]. [33, 57]). If EAs were not capable of such universality, then one
2.2.3 Simple Genetic Operators (SGOs). We call a genetic operator should be hesitant to use them as a problem-solving tool or engine
simple if both its description length and the cardinality of its parent of a generative system. Fortunately, they are, as is shown next.
set are constant (w.r.t. the phenotype dimensionality 𝑛) . Clearly, the
above examples are simple, which matches our intuition, since they 4 EXPRESSIVITY OF GP, NN, DIRECT, AND
are some of the most basic operators in EAs. These operators are also UNIVERSAL APPROXIMATOR ENCODINGS
simple in the colloquial sense: They are easy to explain, implement, This section provides constructions that illustrate the fundamental
and apply in a wide variety of settings without invoking too much power of expressive encodings, including showing that NNs and GP
domain-specific or encoding-specific complexity. Operators that are expressive for uniform crossover and one-point mutation, while
are not simple include model-based EAs such as EDAs [36, 37, the direct encoding is not. A key mechanism is representation of
67], linkage-based algorithms [30, 86], and evolutionary strategies switches in the genome. This mechanism is first introduced in the
[34, 94], which construct Ω(𝑛)-size probabilistic models from Ω(𝑛) special case of analyzing big jumps in GP, NN, and direct encoding,
genotypes, and use these models to generate new candidates. Such and the results are then generalized to full universal approximation.
models usually have a restricted structure, but in theory could
model any phenotype distribution. This paper shows that SGOs are
4.1 Special Case: Miracle Jumps
also fundamentally powerful, as long as the encoding is sufficiently
complex. SGOs are more in line with how evolution operates in There is a common hope in evolutionary computation of stumbling
nature, as well as the original motivation for GAs [40]. They also upon ‘big jumps’ that end up being useful. There is a large theory
avoid the pitfalls of a central algorithmic bottleneck: The operations literature on how well algorithms handle ‘jump functions’, i.e.,
of variation in the algorithm can be exceedingly simple, and can still problems where there are local optima from which a jump must be
result in powerful complex behavior via complexity accumulating taken to reach the global optimum [3, 14, 93]. This subsection shows
in genotypes. This idea is made formal in the next section. how expressive encodings can enable reliable jumps of maximal
size—a useful feature for EAs in general. We call this behavior a
3 EXPRESSIVE ENCODINGS miracle jump because, if it were to occur in a standard evolutionary
algorithm with a direct encoding, it would be considered a miracle.
The behavior of evolutionary algorithms can be described by their
Let 𝑌 = {0, 1}𝑛 . Given an encoding 𝐸 and genetic operator 𝑔,
transmission function, which probabilistically maps parent pheno-
the goal is to find 𝑋𝑝 such that 𝐸 (𝑥) = [0, . . . , 0] ∀𝑥 ∈ 𝑋𝑝 and
types to child phenotypes [1, 7, 73]. EAs that use expressive en-
Pr[𝑔(𝑋𝑝 ) = [1, . . . , 1]] = Θ(1). In other words: Find parents whose
codings can, in principle, yield the behavior of any transmission
phenotypes are all zeros, but who generate a child of all ones with
function, making them general and powerful generative systems.
probability that does not go to zero as 𝑛 increases. To solve this
Definition 1 (Expressive Encoding). An encoding 𝐸 : 𝑋 → 𝑌 is problem, the encoding must be capable of encoding a kind of switch
expressive for a simple genetic operator 𝑔 if, for any set of parent that allows non-trivial likelihood of flipping to the all ones state.
phenotypes {𝑦1, . . . , 𝑦𝑛𝑔 } = 𝑌𝑝 ⊂ 𝑌 , any probability density 𝜇
over 𝑌 , and any 𝜖 > 0, there exists a set of parent genotypes 4.1.1 Genetic Programming. Consider the two parent programs
{𝑥 1, . . . , 𝑥𝑛𝑔 } = 𝑋𝑝 ⊂ 𝑋 such that 𝐸 (𝑥𝑖 ) = 𝑦𝑖 ∀ 𝑦𝑖 ∈ 𝑌𝑝 , and shown in Figure 1(a). The two parents are equivalent except for
their values of 𝑎 and 𝑏. So, uniform crossover 𝑔(𝑋𝑝 ) results in a 1/4
Pr 𝐸 (𝑔(𝑋𝑝 )) = 𝑦 − 𝜇 (𝑦) < 𝜖 ∀𝑦 ∈ 𝑌 .
(2) chance that 𝑎 = 𝑏 = 1 =⇒ 𝐸 (𝑔(𝑋𝑝 )) = [1, . . . , 1].
In other words, starting from any initial set of parent pheno- 4.1.2 Neural Networks. Consider the two parent NNs shown in
types, a single application of the genetic operator can generate any Figure 1(b). The two parents are equivalent except for the values of
distribution of child phenotypes. This paper focuses on the case each of their two weights in the first layer. With uniform crossover,
where 𝜇 is a discrete distribution, but the ideas can be extended to there is a 1/4 chance the weights of the first layer of the child are
the continuous case. The complexity of an expressive encoding with all 1, resulting in phenotype of [1, . . . , 1].
GECCO ’22, July 9–13, 2022, Boston, USA Meyerson, Qiu, and Miikkulainen
… …
a = 0 a = 1 Parent 1
1 1
b = 1
if a + b > 1:
b = 0
if a + b > 1:
1
1 1
-2
1
-2
1
1
0 0 … 0
return [1,...,1] return [1,...,1]
else:
return [0,...,0]
else:
return [0,...,0] 0 0 0 0
0 0 … 0
1 1
Parent 2
Parent 1 Parent 2 Parent 1 Parent 2
(a) (b) (c)
Figure 1: Miracle Jump Parents. (a) Two GP parents whose phenotypes are all 0’s, but whose crossover has maximal jump (to a
child of all 1’s) with probability 0.25. They differ only in their values of 𝑎 and 𝑏; the probability is independent of phenotype
dimensionality. (b) Two NN parents with this same property; They differ only in the weights in the second layer. (c) Directly
encoded parents cannot have this property: If both parents are all 0’s, their crossover cannot yield all 1’s. This minimal example
illustrates how expressive encodings yield high-dimensional structured behavior that direct encodings cannot capture.
4.1.3 Direct Encoding. Since there are no 1’s in either parent (Fig- have any structure and be drawn from any phenotype space, and a
ure 1(c)), this solution is impossible for crossover to find. Single- similar construction could be provided, highlighting the fact that
point mutation only changes a single bit, but even uniform mutation expressive encodings enable powerful EA behavior that can be
fails: It flips bits i.i.d. with probability 𝑝 < 1, and lim𝑛→∞ 𝑝 𝑛 = 0. general across problem domains.
Note also that the construction for crossover is asymptotically
The GP and NN constructions above are built on the idea of having a more compact than that for mutation: −𝑚 log 𝜖 = 𝑚 log 𝜖1 vs. 𝑚 𝜖 in
large part of the genome being completely shared between the two the first term. This result suggests that crossover possibly has an
parents, and then having some auxiliary values that are unshared inherent advantage over mutation: It achieves equivalently complex
and function as control bits that have a high-level influence on how behavior with a more compact genome. An interesting question is:
the phenotype is generated. This kind of construction is used to Is this observation reflected in biological evolution? Importantly,
generalize these results in the next subsection. the same phenomenon emerges for NNs:
Theorem 4.3. Feed-forward neural networks with sigmoid activa-
4.2 General Case: Universal Approximation
tion are an expressive encoding for uniform crossover, with complexity
This subsection shows that certain encodings are expressive with 𝑂 (𝑚𝑛 − log 𝜖).
SGOs. Unless otherwise noted, 𝑌 is the phenotype space of bi- Proof sketch. Take the parents in Figure 3(a), who differ only in
nary vectors 𝑌 = {0, 1}𝑛 . Let y𝑖 (∀ 𝑖 ∈ [1, . . . , 𝑚]) be the desired their second-layer weights. The input to the child’s bottleneck is
phenotypes in the child distribution, and 𝑝𝑖 be their associated prob- proportional to a uniformly-sampled integer. Choosing threshold
abilities. Notice that, since there can be no more than 𝜖1 𝑝𝑖 ’s such biases and high enough 𝑐 1 , 𝑐 2 , and 𝑐 3 results in a mutually-exclusive
that 𝑝𝑖 ≥ 𝜖, any construction need only assign nonzero probability switch for each y𝑖 that fires with the correct probability. □
to at most the top 𝜖1 most probable y𝑖 ’s to achieve an approxima-
tion error of 𝜖. While expressivity can be demonstrated with many In the case of NNs, let single-point mutation be a Gaussian mu-
different constructions, the goal of this section is to provide con- tation, i.e., a single weight is selected uniformly at random and
structions that are both intuitive and highlight where the power modified by the addition of Gaussian noise. The expressivity con-
of expressive encodings comes from. Sketches of the proofs are struction is similar to that of crossover, leading to the following:
provided below. Detailed proofs are included in Appendix A.
Theorem 4.4. Feed-forward neural networks with sigmoid acti-
Theorem 4.1. Genetic programming is an expressive encoding for vation are an expressive encoding for single-point mutation, with
uniform crossover, with complexity 𝑂 (𝑚𝑛 − 𝑚 log 𝜖). complexity 𝑂 ( 𝑚𝑛
𝜖 ).
Proof sketch. Take the parents in Figure 2(a). They differ only in Proof sketch. Take Parent 2 from Figure 3(a). Choose 𝐿 so that
their values of a, and their child’s value can be viewed as an integer mutation is very likely to occur in the first two layers. Since apply-
sampled uniformly from {0, 2dim(a) − 1}. If dim(a) > ⌈lg 𝜖1 + 1⌉, ing 𝑔𝑚 there yields some continuous distribution over bottleneck
then the t𝑖 ’s can be chosen so that 𝜇 (y𝑖 ) probability is apportioned output, suitable thresholds and switches can be created. □
to each y𝑖 with error less than 𝜖. □
The constructions so far give concrete ways to create parents to
Theorem 4.2. Genetic programming is an expressive encoding for demonstrate expressivity. The next result generalizes these ideas
single-point mutation, with complexity 𝑂 ( 𝑚𝑛 𝜖 ). to encodings built from any universal function approximator. The
Proof sketch. Take the parent in Figure 2(b). Let dim(a𝑖 ) be pro- idea is that any universal function approximator can be extended
portional to 𝜇 (y𝑖 ), and then scale up all dim(a𝑖 ) so the chance of a to an expressive encoding via 𝐸 Ω , defined as follows:
mutation in a y𝑖 is sufficiently small. □
Definition 2 (𝐸 Ω ). Let Ω be any universal function approximator.
Notice that these constructions resemble the Pitts GP model Define 𝐸 Ω to be an encoding whose genotypes are of the form 𝜔 (a),
[39, 66, 74, 88]. Neither of the above constructions depend on the where a ∈ {0, 1}𝐿 , and 𝜔 ∈ Ω is a function 𝜔 : {0, 1}𝐿 → 𝑌 .
y𝑖 ’s being drawn from a particular phenotype space. The y’s could
Expressive Encodings: Universal Approximation and other Advantages GECCO ’22, July 9–13, 2022, Boston, USA
Figure 2: Universal GP Parents. (a) A template for two parents whose crossover can approximate any probability distribution
over phenotypes. The parents differ only in their values of the variable a; (b) A similar template for which single-point muta-
tion approximates any probability distribution over phenotypes. Each a𝑖 may have a different length, and ‘par’ indicates the
parity function. These templates can be used to show that GP is an expressive encoding.
Theorem 4.5. 𝐸 Ω is an expressive encoding for uniform crossover. 5 ADAPTATION AND CONVERGENCE
Proof sketch. Choose a large enough 𝐿 = Θ(− lg 𝜖). Then, let Section 4 showed how arbitrarily complex behavior is possible with
𝑥 1′ = 𝜔 ( [0, . . . , 0]) and 𝑥 2′ = 𝜔 ([1, . . . , 1]), for a suitable 𝜔 ∈ Ω. □ a single application of an SGO when encodings are expressive. An
immediate question is: Can this power actually be exploited to solve
So, if you have a favorite class of models that is expressive in
challenging problems with evolution? This section considers some
terms of its function approximation capacity, it can be turned into
of the most frustrating kinds of problems for the standard GA: those
a potentially powerful evolutionary substrate.
where there is high-level high-dimensional structure that can only
The constructions so far have used indirect encodings, in that the
be exploited if the population develops to a point where particular
phenotype space (e.g., binary vectors) is of a different form than
high-dimensional jumps emerge with reliable probability. Standard
the genotype space (e.g., GP, NN, or 𝜔 (a)). However, an encoding
GAs struggle to achieve such an emergence, since, as the dimenson-
need not be indirect to be expressive, as is demonstrated by the
ality of the required jump increases, its probability vanishes. This
following case of direct encoding of neural networks.
section considers three problems that cover the different ways in
Theorem 4.6. Direct encoding of feed-forward neural networks which such high-level high-dimensional structure can appear: It can
with sigmoid activation is an expressive encoding for uniform crossover. be temporal and deterministic, temporal and stochastic, or spatial.
Proof sketch. Take the parents in Figure 3(b). They differ only in In all three cases, expressive encodings, implemented through GP,
the biases in the first layer of nodes. If this layer is large enough, yield striking improvements over direct encodings.
ℎ ′′ has enough information from a child’s biases to decide which
function the overall NN should compute. □ 5.1 Problem Setup
As is common for ease of analysis [17, 55, 60, 95], this section
This section has shown that sufficiently complex evolutionary
considers the (1+𝜆)-EA, with 𝜆 ∈ {1, 2} (pseudocode is provided in
encodings, in particular NN and GP, are expressive. Another way
Appendix B). In this algorithm, during each generation 𝜆 candidates
of seeing the power of this property is to consider any stochastic
are independently generated by mutating the current champion.
process (e.g., EA) that samples from distribution 𝜇 after 𝑇 steps: An
One of the best candidates replaces the champion if it meets the
expressive encoding can simulate this behavior in one step. This
acceptance criteria, which is usually that it has higher fitness. When
view gives us an analogy to a powerful result from the NN meta-
the fitness function is dynamic, the fitness of the champion is evalu-
learning literature: Given a distribution of tasks, there exists a
ated in every generation along with the fitnesses of the candidates.
neural network that can learn any task from this distribution with a
To evaluate algorithms on dynamic fitness functions, the concept
single step of gradient descent [25]. This connection is meaningful:
of adaptation complexity needs to be defined.
while meta-learning should be able to encode any learning process,
evolution should be able to encode any phenotype sampling process. Definition 3 (Adaptation Complexity). The adaptation complexity
The fact that the above encodings are expressive with single- of an algorithm 𝐴 on a dynamic fitness function 𝑓 is the expected
point mutation, also known as random local search [21], is remark- time 𝐴 spends away from a global optimum vs. at a global optimum
able. Thanks to expressive encoding, random local search in the in the limit, i.e., the proportion of time it spends adapting. Formally,
genotype space leads to maximally global evolutionary sampling in if 𝑓𝑡∗ is the maximum value of 𝑓 at time 𝑡, and 𝑓 (𝐸 (𝑥𝑡 )) is the best
the phenotype space. The next section will show that this property fitness in the population at time 𝑡, the adaptation complexity is
|{𝑡 ′ < 𝑡 : 𝑓 (𝐸 (𝑥𝑡 ′ )) ≠ 𝑓𝑡∗′ }|
can be used to solve challenging problems where the standard direct " #
encoding is likely to fail. E lim . (3)
𝑡 →∞ |{𝑡 ′ < 𝑡 : 𝑓 (𝐸 (𝑥𝑡 ′ )) = 𝑓 ∗′ }|
𝑡
The three problems are considered in order of ease of analysis. It
may seem counter-intuitive to consider dynamic fitness functions
GECCO ’22, July 9–13, 2022, Boston, USA Meyerson, Qiu, and Miikkulainen
… …
bottleneck
0 … 0 0 … 0 randomness 0 … 0 1 … 1
0 0 0 1 2 2L-1
… randomness …
Parent 1 Parent 2
1 constant input 1
Parent 1 Parent 2
(a) (b)
Figure 3: Universal NN Parents. (a) A class of parent NNs that can approximate a given probability distribution over pheno-
types when recombined with uniform crossover (Theorem 4.3). Values are chosen for 𝑐 1 , 𝑐 2 , and 𝑐 3 such that for the desired
phenotypes there are mutually-exclusive switches that fire with the corresponding probabilities. Parent 2 can also be used as
the parent in the case of mutation (Theorem 4.4). (b) A direct encoding of two NN parents that can approximate any probability
distribution over NN functions when recombined with uniform crossover. Solid lines indicate a weight of 1, dotted lines 0. The
numbers in the units of the first hidden layer indicate the bias of that unit. A generated child will have biases sampled uni-
formly from {0, 1}𝐿 , and the remainder of the network, ℎ ′′ , decides which function the entire network computes based on the
sampled values. As 𝐿 increases, ℎ ′′ can better approximate the desired distribution. Thus, NNs can implement an expressive
encoding, even when they themselves are based on a direct encoding.
before static fitness functions, but it turns out that the dynamism of be too quick to forget useful past solutions, and ideally, it should
the fitness function provides an exploratory power that expressive be able to recover them in constant time.
encodings are able to exploit naturally, but direct encodings can- For the expressive encoding, consider GP genotypes with the
not exploit at all (other advantages of dynamic fitness have been structure shown in Figure 4(a), where a, b, c are bit vectors of
observed in prior work [12, 46]). As before, sketches of the proofs length dim(a) = 𝐿, dim(b) = dim(c) = 𝑛. So, the evolvable genome
are provided in this section; details are included in Appendix C. is defined by 𝑥 = (a, b, c), and has length 2𝑛 + 𝐿. The champion is
initialized with random bits. It turns out that GP results in a super-
5.2 Challenge Problems exponential speed-up: While the direct encoding takes 𝑂 (𝑛 log 𝑛)
to adapt to the new target, the GP encoding, initially ignorant of
Problem 1. Deterministic Flipping Challenge (DFC): Take
y∗1 and y∗2 , spends only constant time.
any two target phenotypes y∗1 , y∗2 ∈ {0, 1}𝑛 , where y∗2 is the com-
plement of y∗1 . At time 𝑡 = 0, the current target vector is y∗1 . The Theorem 5.1. The (1+𝜆)-EA with direct encoding has Θ(𝑛 log 𝑛)
fitness is the number of bits in the phenotype that match the target. adaptation complexity on the deterministic flipping challenge.
If the fitness of the champion is 𝑛, then at the next time step the Proof sketch. This follows from a standard coupon collector ar-
current target vector flips to the other target vector. gument [17]. □
The difficulty in this problem is clear: As soon as the maximum Theorem 5.2. The (1+𝜆)-EA with GP encoding has 𝑂 (1) adapta-
fitness is achieved, the target changes. This kind of situation arises tion complexity on the deterministic flipping challenge.
naturally in continual evolution, in which new problems arise over Proof sketch. Let dim(a) = 𝐿. Say par(a) = 1 and 𝑓 (b) > 𝑓 (c).
time, and the algorithm is initialized at its solution to the previous Then, the algorithm only accepts mutations that move b towards y∗1 .
problem [6, 24, 27, 58, 63, 90]. For example, imagine a neural archi- Once b = y∗1 and a is mutated, c will converge to y∗2 . Then, if 𝜆 = 2
tecture search system in which new machine learning problems and 𝐿 is large enough, the chance of making a mistake in b or c is
arise over time. The algorithm cannot be restarted from scratch for so small that the expected time spent on repairs is constant. □
each problem, because it is so expensive to run, so it picks up from
where it left off. Partial attempts at such a system have been made Thus, the expressive encoding yields asymptotically optimal
previously [31]. Problem 1 can also be viewed as a dynamic version adaptation complexity. The reader may be concerned that this en-
of the OneMax problem; other such versions have been considered coding takes longer to initially reach a global optimum than the
in the past [23, 44, 84], but none allow for this extreme of a change direct encoding. There are many ways to address this issue, e.g., by
in the target vector. More generally, this test problem evaluates an including mutations that enable 𝐿 to grow in size over time. How-
algorithm’s ability to avoid ‘catastrophic forgetting’: it should not ever, the next problem shows an even sharper advantage: Direct
Expressive Encodings: Universal Approximation and other Advantages GECCO ’22, July 9–13, 2022, Boston, USA
(a)
y=0
if par(a1 ): y ⊕= b
if par(a2 ): y ⊕= c
return y
(b)
Figure 4: Adaptation and Convergence. (a-b) Genotype templates for the encoding used in (a) Problems 1 and 2 and (b) Problem
3. The a∗ , b, and c are evolvable bit vectors with dim(b) = dim(c) = 𝑛. Distinct structure is learned in b and c, whose coupling can
then be exploited, temporally or in the phenotype space. (Note that these encodings are isomorphic to NNs with multiplicative
units [69] and weights in {−1, 1}.) (c-d) Experimental results for Problems 1 and 2 with 𝑛 = 16, 𝜆 = 2, and 𝐿 = 105 . Consistent
with its 𝑂 (1) adaptation complexity (Thm. 5.2 and 5.4), the GP encoding spends an increasing proportion of time at optimal
fitness; the direct encoding does not (mean, 90% conf. over 100 trials). (e) Experimental results for Problem 3 with 𝑛 = 64, 𝜆 = 1,
and 𝑅 = 105 , showing max, mean, and min over 100 trials. The GP encoding converges on all trials, while the direct encoding
converges on none of them. Thus, with expressive encodings, even simple genotypes can lead to massive improvements.
encoding spends exponential time away from the optimum, while That is, there are solutions of size Θ(𝑛) in disjoint subspaces that
the expressive encoding is asymptotically unaffected. must be discovered and combined. This problem is challenging for a
direct encoding: Once one target is found there is no fitness gradient
Problem 2. Random Flipping Challenge (RFC): Take any two
until all remaining 𝑛2 bits are matched, which takes exponential
target phenotypes y∗1 , y∗2 ∈ {0, 1}𝑛 , where y∗2 is the complement of
time to happen by chance. Since the fitness function is fixed, the
y∗1 . At each time 𝑡 the current target vector is selected to be y∗1 or
metric of interest is simply expected time to reach the optimum.
y∗2 with 0.5 probability each. The fitness is the number of bits in
For this problem, there is no dynamism in the fitness function
the phenotype that match the target.
to provide exploratory power to the (1+𝜆)-EA, so some basic mech-
This is the same as Problem 1, but with the target vector flipping anisms must be added to prevent it from getting stuck. Instead of
randomly at each generation. simply accepting if the candidate has higher fitness, the champion
is replaced if any of the following conditions are met:
Theorem 5.3. The (1+𝜆)-EA with direct encoding and 𝜆 = 2 has
Ω(2𝑛/2 ) adaptation complexity on the random flipping challenge. • Fitness. The candidate has greater fitness.
Proof sketch. Lower bound the hitting time of either target, using • Diversity. The candidate phenotype is further than Hamming
the fact that when within 𝑛/4 bits of the target, the chance of moving distance one from the champion phenotype.
towards it is less than half that of moving away. □ • Sparsity. The candidate phenotype has equal fitness and
fewer ones than the champion phenotype.
Theorem 5.4. The (1+𝜆)-EA with GP encoding and 𝜆 = 2 has 𝑂 (1) With these rules, the algorithm can still be subject to deception,
adaptation complexity on the random flipping challenge. so the champion is reinitialized after 𝑅 steps if it has not yet con-
Proof sketch. With 𝜆 = 2, 𝐿 can be chosen large enough so the verged. Call this adjusted algorithm (1+𝜆)-EA*. The pitfalls of direct
chance of moving towards the correct target is always more than encoding are too great for these methods to help much in that case.
twice that of moving away, which can be used to upper bound the
hitting time. Then, as in Theorem 5.2, once b and c have initially Theorem 5.5. The (1+𝜆)-EA* with direct encoding and 𝜆 = 1 con-
converged, the expected time spent on repairs is constant. □ verges in Ω(2𝑛 ) steps on the large block assembly problem.
Proof sketch. Once one target is found, the algorithm is stuck, so
The theoretical results for Problems 1 and 2 are validated ex- both targets can only be found together by chance. □
perimentally in Figure 4(c-d). Consistent with its 𝑂 (1) adaptation
complexity, the GP encoding spends an increasing overall percent- However, using genotypes with the GP structure shown in Fig-
age of time at optimal fitness. The power of expressive encodings ure 4(b), with dim(a1 ) = dim(a2 ) = 1, leads to tractability.
also manifests on static fitness functions, like the one in Problem 3. Theorem 5.6. The (1+𝜆)-EA* with GP encoding and 𝜆 = 1 con-
Problem 3. Large Block Assembly Problem (LBAP): In this verges in 𝑂 (𝑛 3+𝑜 (1) ) steps on the large block assembly problem.
𝑛
problem, there are two targets y∗1 , y∗2 ∈ {0, 1} 2 hidden somewhere Proof sketch. Consider convergence paths where a1 = a2 = 1
amongst the 𝑛 bits at non-overlapping indices, with |y∗1 |, |y∗2 | > 1. only occurs at the final step. Compute the probability of such con-
If the solution contains both targets the fitness is 𝑛, otherwise it is vergence, and use restarts to get the expected run time. □
the maximum number of bits matched to either target.
These theoretical conclusions are demonstrated experimentally
in Figure 4(e). In 100 independent trials, with the direct encoding,
GECCO ’22, July 9–13, 2022, Boston, USA Meyerson, Qiu, and Miikkulainen
the algorithm never reaches beyond half the optimal fitness; with enable rapid extension of these results to other encodings, oper-
GP encoding it always makes the final large jump before the 2M ators, and test domains. To highlight the mechanisms that make
evaluation limit. expressive encodings powerful, the test domains were idealized
to contain two complementary targets. Future work can general-
5.3 Extensions ize this approach to more targets, and to compositions of targets
The algorithms in this section worked with fixed GP structures, that have not been seen before, akin to generalization in machine
evolving the values within them. This approach is sufficient to learning. Section 5 demonstrated a type of stability, i.e., preventing
demonstrate the power and potential of expressive encodings; fu- catastrophic forgetting in particular cases. How far can such results
ture work will analyze methods that evolve the structure of repre- be generalized? For instance, can catastrophic forgetting be asymp-
sentations as well. However, note that the above structures are each totically avoided in individuals as complex as the ones constructed
of a constant size: One could simply enumerate all such structures, in Section 4? What other kinds of stability are possible?
trying the algorithm on each, and only pay a constant multiplicative Practice. In building practical applications, it is not clear whether
cost, not affecting the asymptotic complexities. Although this is the first step should use incrementally growing encodings like those
a brute-force approach, it suggests that powerful structures may in GP and NEAT [80], or giant fixed structures with many evolvable
actually not be so difficult to find, and indeed meaningful structures parameters. The EA community has tended to prefer incremental
are commonly found by existing GP methods [68, 71]. encodings, but deep learning has seen remarkable success via huge
The problems in this section all required jumps of size Θ(𝑛). fixed structures with canonical form and learnable parameters.
Prior work with direct encodings has sought to tackle larger and Perhaps expressive encodings with a fixed structure would be an
larger jumps, but they are still generally sublinear [3, 14]. The easier place to start, since understanding the behavior of the system
√
closest comparison with SGOs [91] had jumps of size 𝑂 ( 𝑛), but re- may be easier without the variable of dynamic genotypic structure.
quired a representation that was a priori well-aligned for two-point Also, one of the common motivations for indirect encodings is that
crossover, and an island model with a number of islands dependent the genotype can be smaller than the phenotype. However, in the
on the 𝑛. In contrast, with an expressive encoding, successful jumps constructions of Section 4 the genotype is generally much larger
of size Θ(𝑛) can be achieved with only single-point mutation in a than the phenotype, which suggests that directly evolving huge
(1+1 or 2)-EA, and simple sparsity and diversity methods. genotypes could be promising (similar to deep GA [83]).
The diversity mechanism used for Problem 3 could also be used Open-ended Evolution. The power of expressive encodings comes
to generalize the first two problems to any two target vectors. This from the ability of the transmission function to improve throughout
simple diversity mechanism is an instance of a behavior domination evolution. This phenomenon has been previously explored in GP
approach [59], in that solutions at least a fixed distance from the [1, 42] and neuroevolution [43, 54], and insights there should be
current solution are non-dominated. useful in analyzing the behavior of expressive encodings more
generally. E.g., motivated by Evolvability-ES, a construction related
6 DISCUSSION AND FUTURE WORK to miracle jumps for directly-encoded NNs has been developed for
The definitions and analysis of expressive encodings in this paper i.i.d. perturbations [28, Thm. S6.1]. As a longer-term opportunity,
can be seen as a starting point for future work in several areas: the promise of expressive encodings to continually complexify
Biological Interpretation. Because they can capture complex re- and innovate upon the transmission functions indicates that they
productive distributions, expressive encodings could in general should be a good substrate for open-ended evolution [38, 75, 77,
be more accurate than direct encodings in representing complex 82, 85, 90], and, likewise, insights from open-endedness could be
models of biological evolution. If genetic regulatory networks are useful in developing more practical SGO + Expressive Encoding
universal function approximators (models of them often are, e.g., algorithms. Thus, the long-term ideal is an algorithm that never
recurrent NNs, ODEs, and boolean networks [12, 45, 96]), then The- needs to be restarted: press ‘go’ once and over time it becomes
orem 4.5 suggests how natural evolution can become arbitrarily better and better at solving problems as they appear, accumulating
powerful over time. In the expressive encodings with crossover problem-solving ability in the genotypes themselves. Nature has
constructions in Section 4, the high level of shared structure across provided an existence proof of such an algorithm. For organisms to
parents is also consistent with nature. Humans share >99% of their become boundlessly and efficiently more complex over time, the
DNA, and high proportions of DNA are even shared across species mapping from parents to offspring must become more complex over
[11, 35, 72]. This construction may partly explain why crossover time. Expressive encodings are an important piece of this puzzle.
is so prominent in biology: It works best when a large part of the
genome is shared. Another surprising result from Section 5 is that 7 CONCLUSION
for expressive encodings in dynamic environments, generating This paper identifies expressivity as a fundamental property that
multiple offspring per generation is critical to achieving stable per- makes encodings powerful. It allows local changes in the genotype
formance. This result makes biological sense as well: Even when the to have global effects in the phenotype, which is useful in making
probability of a good offspring is high, if there is any chance that a large jumps in particular, and approximating arbitrary probability
bad one can set you back considerably due to latent nonlinearities distributions in general. Direct encodings are not generally expres-
in the genome, it is prudent to have multiple offspring. sive, whereas genetic programming and neural network encodings
Theory. The theory in this paper did not use any deep EA theory are. As demonstrated in three problems illustrating different chal-
methods, such as drift methods [20, 22, 55]. Such methods could lenges, expressive encodings may bring striking improvements in
Expressive Encodings: Universal Approximation and other Advantages GECCO ’22, July 9–13, 2022, Boston, USA
[53] Joel Lehman and Kenneth O Stanley. 2012. Beyond open-endedness: Quantifying [79] Kenneth O Stanley, David B D’Ambrosio, and Jason Gauci. 2009. A hypercube-
impressiveness. In ALIFE 2012: The Thirteenth International Conference on the based encoding for evolving large-scale neural networks. Artificial life 15, 2
Synthesis and Simulation of Living Systems. MIT Press, 75–82. (2009), 185–212.
[54] Joel Lehman and Kenneth O Stanley. 2013. Evolvability is inevitable: Increasing [80] Kenneth O Stanley and Risto Miikkulainen. 2002. Evolving neural networks
evolvability without the pressure to adapt. PloS one 8, 4 (2013), e62186. through augmenting topologies. Evolutionary computation 10, 2 (2002), 99–127.
[55] Johannes Lengler. 2020. Drift analysis. In Theory of Evolutionary Computation. [81] Kenneth O Stanley and Risto Miikkulainen. 2003. A taxonomy for artificial
Springer, 89–131. embryogeny. Artificial life 9, 2 (2003), 93–130.
[56] Antonios Liapis, Héctor P Martínez, Julian Togelius, and Georgios N Yannakakis. [82] Susan Stepney. 2021. Modelling and measuring open-endedness. Artificial Life
2021. Transforming exploratory creativity with DeLeNoX. arXiv preprint 25, 1 (2021), 9.
arXiv:2103.11715 (2021). [83] Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O
[57] Yulong Lu and Jianfeng Lu. 2020. A Universal Approximation Theorem of Deep Stanley, and Jeff Clune. 2017. Deep neuroevolution: Genetic algorithms are a
Neural Networks for Expressing Probability Distributions. In Advances in Neural competitive alternative for training deep neural networks for reinforcement
Information Processing Systems, Vol. 33. 3094–3105. learning. arXiv preprint arXiv:1712.06567 (2017).
[58] Benno Lüders, Mikkel Schläger, and Sebastian Risi. 2016. Continual learning [84] Dirk Sudholt. 2018. On the robustness of evolutionary algorithms to noise:
through evolvable neural turing machines. In Nips 2016 workshop on continual Refined results and an example where noise helps. In Proceedings of the Genetic
learning and deep networks (cldl 2016). and Evolutionary Computation Conference. 1523–1530.
[59] Elliot Meyerson and Risto Miikkulainen. 2017. Discovering evolutionary step- [85] Tim Taylor, Mark Bedau, Alastair Channon, David Ackley, Wolfgang Banzhaf,
ping stones through behavior domination. In Proceedings of the Genetic and Guillaume Beslon, Emily Dolson, Tom Froese, Simon Hickinbotham, Takashi
Evolutionary Computation Conference. 139–146. Ikegami, et al. 2016. Open-ended evolution: Perspectives from the OEE workshop
[60] Elliot Meyerson and Risto Miikkulainen. 2019. Modular universal reparameter- in York. Artificial life 22, 3 (2016), 408–423.
ization: Deep multi-task learning across diverse domains. Advances in Neural [86] Dirk Thierens. 2010. The linkage tree genetic algorithm. In International Confer-
Information Processing Systems 32 (2019), 7903–7914. ence on Parallel Problem Solving from Nature. Springer, 264–273.
[61] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Daniel Fink, [87] Alan Mathison Turing et al. 1936. On computable numbers, with an application
Olivier Francon, Bala Raju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, to the Entscheidungsproblem. J. of Math 58, 345-363 (1936), 5.
et al. 2019. Evolving deep neural networks. In Artificial intelligence in the age of [88] Ryan J Urbanowicz and Jason H Moore. 2009. Learning classifier systems: a
neural networks and brain computing. Elsevier, 293–312. complete introduction, review, and roadmap. Journal of Artificial Evolution and
[62] Melanie Mitchell, JH Holland, and S Forrest. 1991. The royal road for genetic Applications 2009 (2009).
algorithms: Fitness landscapes and GA performance. Technical Report. Los Alamos [89] vonbrand. 2019. Tight bound for.... https://math.stackexchange.com/questions/
National Lab., NM (United States). 3268900/tight-bound-for-e-sqrt-log-n Accessed: 2022-04-11.
[63] Frank Neumann and Carsten Witt. 2015. On the runtime of randomized local [90] Rui Wang, Joel Lehman, Aditya Rawal, Jiale Zhi, Yulun Li, Jeffrey Clune, and
search and simple evolutionary algorithms for dynamic makespan scheduling. In Kenneth Stanley. 2020. Enhanced POET: Open-ended reinforcement learning
Twenty-Fourth International Joint Conference on Artificial Intelligence. through unbounded invention of learning challenges and their solutions. In
[64] Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. 2015. Innovation engines: International Conference on Machine Learning. PMLR, 9940–9951.
Automated creativity and improved stochastic optimization via deep learning. In [91] Richard A Watson and Thomas Jansen. 2007. A building-block royal road where
Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computa- crossover is provably essential. In Proceedings of the 9th annual conference on
tion. 959–966. Genetic and evolutionary computation. 1452–1459.
[65] Patryk Orzechowski, William La Cava, and Jason H Moore. 2018. Where are [92] Darrell Whitley. 2019. Next generation genetic algorithms: a user’s guide and
we now? A large benchmark study of recent symbolic regression methods. In tutorial. In Handbook of metaheuristics. Springer, 245–274.
Proceedings of the Genetic and Evolutionary Computation Conference. 1183–1190. [93] Darrell Whitley, Swetha Varadarajan, Rachel Hirsch, and Anirban Mukhopadhyay.
[66] Una-May O’Reilly, Mark Wagy, and Babak Hodjat. 2013. Ec-star: A massive-scale, 2018. Exploration and Exploitation Without Mutation: Solving the Jump Function
hub and spoke, distributed genetic programming system. In Genetic programming in Θ(𝑛) Time. In International Conference on Parallel Problem Solving from Nature.
theory and practice X. Springer, 73–85. Springer, 55–66.
[67] Martin Pelikan, David E Goldberg, Erick Cantú-Paz, et al. 1999. BOA: The [94] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen
Bayesian optimization algorithm. In Proceedings of the genetic and evolutionary Schmidhuber. 2014. Natural evolution strategies. The Journal of Machine Learning
computation conference GECCO-99, Vol. 1. Citeseer, 525–532. Research 15, 1 (2014), 949–980.
[68] Michael Schmidt and Hod Lipson. 2009. Distilling free-form natural laws from [95] Carsten Witt. 2013. Tight bounds on the optimization time of a randomized
experimental data. science 324, 5923 (2009), 81–85. search heuristic on linear functions. Combinatorics, Probability and Computing
[69] Michael Schmitt. 2002. On the complexity of computing and learning with 22, 2 (2013), 294–318.
multiplicative neural networks. Neural Computation 14, 2 (2002), 241–301. [96] Hanif Yaghoobi, Siyamak Haghipour, Hossein Hamzeiy, and Masoud Asadi-
[70] Jimmy Secretan, Nicholas Beato, David B D Ambrosio, Adelein Rodriguez, Adam Khiavi. 2012. A review of modeling techniques for genetic regulatory networks.
Campbell, and Kenneth O Stanley. 2008. Picbreeder: evolving pictures collab- Journal of Medical Signals and sensors 2, 1 (2012), 61.
oratively online. In Proceedings of the SIGCHI Conference on Human Factors in [97] Xin Yao. 1999. Evolving artificial neural networks. Proc. IEEE 87, 9 (1999),
Computing Systems. 1759–1768. 1423–1447.
[71] Hormoz Shahrzad and Babak Hodjat. 2015. Tackling the Boolean multiplexer
function using a highly distributed genetic programming system. In Genetic
Programming Theory and Practice XII. Springer, 167–179.
[72] Oleg Simakov, Jessen Bredeson, Kodiak Berkoff, Ferdinand Marletaz, Therese
Mitros, Darrin T. Schultz, Brendan L. O’Connell, Paul Dear, Daniel E. Martinez,
Robert E. Steele, Richard E. Green, Charles N. David, and Daniel S. Rokhsar. 2022.
Deeply conserved synteny and the evolution of metazoan chromosomes. Science
Advances 8, 5 (2022), eabi5884.
[73] Montgomery Slatkin. 1970. Selection and polygenic characters. Proceedings of
the National Academy of Sciences 66, 1 (1970), 87–93.
[74] Stephen Frederick Smith. 1980. A learning system based on genetic adaptive
algorithms. University of Pittsburgh.
[75] L Soros and Kenneth Stanley. 2014. Identifying necessary conditions for open-
ended evolution through the artificial life world of chromaria. In ALIFE 14: The
Fourteenth International Conference on the Synthesis and Simulation of Living
Systems. MIT Press, 793–800.
[76] Lee Spector and Alan Robinson. 2002. Genetic programming and autoconstructive
evolution with the push programming language. Genetic Programming and
Evolvable Machines 3, 1 (2002), 7–40.
[77] Kenneth O Stanley. 2019. Why open-endedness matters. Artificial life 25, 3 (2019),
232–235.
[78] Kenneth O Stanley, Jeff Clune, Joel Lehman, and Risto Miikkulainen. 2019. De-
signing neural networks through neuroevolution. Nature Machine Intelligence 1,
1 (2019), 24–35.
Expressive Encodings: Universal Approximation and other Advantages GECCO ’22, July 9–13, 2022, Boston, USA
A PROOFS FOR SECTION 4 Note that the parent in Figure 2(b) could have been more like
Parent 1 in Figure 2(a), using a single long a with <. Instead, the
Theorem 4.1. Genetic programming is an expressive encoding for parity function ‘par’ is used, because it demonstrates an alternative
uniform crossover, with complexity 𝑂 (𝑚𝑛 − 𝑚 log 𝜖). kind of construction, and because it is used again in Section 5.
Proof. Let 𝑦1′ , 𝑦2′ be the phenotypes of the two parent arguments
Theorem 4.3. Feed-forward neural networks with sigmoid activa-
to the uniform crossover operator 𝑔𝑐 . Choose the parent genotypes
tion are an expressive encoding for uniform crossover, with complexity
𝑥 1′ , 𝑥 2′ to be the two programs shown in Figure 2(a). These two pro-
𝑂 (𝑚𝑛 − log 𝜖).
grams differ only in their value of a, so the child will be equivalent
to its parents, except its value of a will be uniformly sampled from Proof. This construction is similar to the uniform crossover
{0, 1}𝑛 . construction for genetic programming. Let the parents 𝑥 1′ , 𝑥 2′ be
When interpreted as a binary integer, this value of a is an integer defined by the four-layer neural networks shown in Figure 3(a),
uniformly sampled from [0, 2𝐿 − 1], where 𝐿 is the length of the where each internal node is followed by sigmoid activation, and all
vector a. Let each t𝑖 be a binary vector of length 𝐿, each of which biases are 0.
can also be interpreted as an integer, with t1 < t2 < . . . < t𝑚−1 . The two parents differ only in the values of their second layer
Now, the approach is to choose 𝐿 large enough and then choose of weights. Since all the input weights to the first layer are zero,
the t𝑖 ’s so that each y𝑖 is generated when the integer a falls in a the output of each node in the first layer is 0.5. Then, since each
given range, which occurs with probability within 𝜖 of 𝑝𝑖 . power of two is included with probability 1/2, uniform crossover
Let 𝐿 = ⌈lg 𝜖1 ⌉ + 2 = Θ(− log 𝜖). Then, there are 2𝐿 > 𝜖2 in- makes the input to the bottleneck equal 0.5 times an integer sampled
tegers each sampled with probability 21𝐿 < 𝜖2 . Let t1 = ⌊𝑝 1 𝐿⌋, uniformly from [0, 2𝐿 − 1]. Each such value uniquely defines the
Í
t𝑖 = ⌊𝐿 𝑖𝑗=1 𝑝 𝑗 ⌋ ∀ 𝑖 ∈ [2, . . . , 𝑚 − 1]. Then, ∀ 𝑖 ∈ [2, . . . , 𝑚 − 1] output of the bottleneck neuron.
Similarly to the GP case, a set of thresholds is then created. First,
𝜖
𝑝𝑖 − < Pr[𝐸𝐺𝑃 (𝑔𝑐 (𝑥 1′ , 𝑥 2′ )) = 𝑦𝑖 ] ≤ 𝑝𝑖 . (4) 𝑐 1 is set to amplify the differences between successive outputs from
2 the bottleneck, and the biases of the threshold units are then set so
In order to ensure that the parent genotypes generate their required that they output nearly one if the threshold is met and nearly zero
phenotypes, the single integers 0 . . . 0 and 1 . . . 1 are assigned to y1′ otherwise. For brevity, we say that the unit “fires” if its output is
and y2′ . The probabilities of sampling these are each less than 𝜖2 , nearly one.
and are subtracted from the probabilities for y1 and potentially y𝑚 , There is a switch neuron for each desired y𝑖 . By setting 𝑐 2 as large
so that for 𝑗 ∈ {1, 𝑚}, as we want, this switch will fire only if its lower threshold unit fires,
but its upper one does not. Since the thresholds are monotonically
𝑝 𝑗 − 𝜖 < Pr[𝐸𝐺𝑃 (𝑔𝑐 (𝑥 1′ , 𝑥 2′ )) = y 𝑗 ] ≤ 𝑝 𝑗 . (5)
increasing, exactly one switch unit will fire.
In the parent genotypes, each conditional contains 𝑂 (𝐿 + 𝑛) bits (to Finally, the weights from the switch for y𝑖 to the output of the
encode t𝑖 and y𝑖 ) and there are 𝑚 conditionals. Since 𝐿 = 𝑂 (− log 𝜖), whole network are 𝑐 3 (2y𝑖 − 1), with 𝑐 3 arbitrarily large. So, if the
the required total genotype size is 𝑂 (−𝑚 log 𝜖 + 𝑚𝑛). □ switch from y𝑖 fires, the input to the 𝑗th output node will be negative
if the 𝑗the element of y𝑖 is 0, and positive if the 𝑗the element of
y𝑖 is 1. Therefore, after squashing through the final sigmoid, the
Theorem 4.2. Genetic programming is an expressive encoding for output of the entire network will be nearly y𝑖 , which after rounding
single-point mutation, with complexity 𝑂 ( 𝑚𝑛
𝜖 ). yields y𝑖 exactly (see Equation 1). That is, the NN outputs the binary
vector y1 .
Proof. Let 𝑦 ′ be the phenotype of the single parent argument Note that similar considerations as in the case of GP can be taken
to the single-point mutation operator 𝑔𝑚 . Choose the parent geno- into account to ensure that 𝐸 (𝑥 1′ ) = y1′ and 𝐸 (𝑥 2′ ) = y2′ .
type 𝑥 ′ to be the program shown in Figure 2(b), with each a𝑖 a As in the case of GP, it is necessary that 𝐿 = Θ(− lg 𝜖). There
binary vector of zeros [0, . . . , 0]. Since par(a𝑖 ) is false for all a𝑖 in are Θ(− lg 𝜖) weights in the first two layers, 𝑂 (𝑚) in the next two
𝑥 ′ , 𝐸 (𝑥 ′ ) = y ′ . Our approach is to choose the length 𝐿𝑖 of each a𝑖 so layers, and 𝑂 (𝑚𝑛) in the final layer. So, in terms of number of
that the probability of selecting a bit to flip in a𝑖 is approximately weights, the complexity of this construction is 𝑂 (𝑚𝑛 − log 𝜖). The
𝑝𝑖 . apparent improvement over the GP construction comes from the
When 𝑔𝑚 chooses a bit to flip, it selects the bit uniformly from fact that real-numbered weights were used instead of bits. □
the a𝑖 ’s and the y𝑖 ’s. The total size of the y𝑖 ’s is 𝑚𝑛. So, let
𝑚
∑︁ 𝑚𝑛 Theorem 4.4. Feed-forward neural networks with sigmoid acti-
𝐿𝑖 = . (6)
𝑖=1
2𝜖 vation are an expressive encoding for single-point mutation, with
complexity 𝑂 ( 𝑚𝑛
𝜖 ).
Then, the 𝐿𝑖 ’s can be apportioned proportionally to 𝑝𝑖 such that
the chance of flipping a bit in a𝑖 out of all a’s is within 𝜖2 of 𝑝𝑖 . Proof. Take the single parent 𝑥 ′ for mutation to be Parent 2 in
The chance of instead flipping a bit in the y’s is less than 𝜖2 , so the Figure 3(a). In fact, in this case, the exact weights in the first two
overall probability of flipping a bit in a𝑖 is within 𝜖 of 𝑝𝑖 . layers of 𝑥 ′ do not matter, as long as the weights in the second
In this construction, the size of the a’s dominate the y’s, giving layer are non-zero. A weight in the first two layers is selected for
us a complexity of 𝑂 ( 𝑚𝑛𝜖 ). □ mutation with probability of at least 1 − 𝜖/2, by choosing a large
GECCO ’22, July 9–13, 2022, Boston, USA Meyerson, Qiu, and Miikkulainen
enough 𝐿 = Θ( 𝑚𝑛 𝜖 ). Now, no matter what the weights are in the strictly monotonic, so the input to ℎ ′′ still uniquely identifies the
first two layers of 𝑥 ′ , the stochastic process of (1) selecting one input to the whole network. □
of these weights, and (2) adding Gaussian noise to this selected
weight yields some continuous distribution over the output of the B THE (1+𝜆)-EVOLUTIONARY ALGORITHM
bottleneck unit. Similar to the case of crossover, the biases of the Algorithm 1 provides pseudocode for the algorithm used in Sec-
threshold units can be set to partition the output distribution of the tion 5. 𝐸 is the encoding and 𝑓 is the fitness function.
threshold unit appropriately, again setting 𝑐 2 and 𝑐 3 high enough
to make the output of each unit nearly one or nearly zero. It is also
possible to include threshold units for the case of no mutation to Algorithm 1 (1+𝜆)-EA
ensure that 𝐸 (𝑥 ′ ) = y ′ . Similar to the construction for GP with 𝑥𝑜 ← random initial genotype ⊲ Initialize champion
mutation, the complexity of this construction is dominated by 𝐿, while not done do
i.e., the overall complexity is 𝑂 ( 𝑚𝑛
𝜖 ). □ for 𝑖 ∈ 1, . . . , 𝜆 do
𝑥𝑖 ← mutate(𝑥𝑜 ) ⊲ Generate candidates
Theorem 4.5. 𝐸 Ω is an expressive encoding for uniform crossover. for 𝑖 ∈ 1, . . . , 𝜆 do
if 𝑓 (𝐸 (𝑥𝑖 )) > 𝑓 (𝐸 (𝑥𝑜 )) then
Proof. As in the previous constructions for uniform crossover, 𝑥𝑜 ← 𝑥 𝑖 ⊲ Replace Champion
choose 𝐿 = Θ(− lg 𝜖). Let 𝑥 1′ = 𝜔 ([0, . . . , 0]) and 𝑥 2′ = 𝜔 ([1, . . . , 1]).
Again, since the parents differ only in their values of a, the child
will be of the form 𝜔 (b) where b is drawn uniformly from {0, 1}𝐿 .
It is clear from previous constructions that values in {0, 1}𝐿 can be C PROOFS FOR SECTION 5
apportioned to assign accurate-enough probability to each 𝑦𝑖 , and The proofs in this Section consider the (1+𝜆)-EA (Algorithm 1), with
𝜔 can thus be chosen so that it assigns probability in this way. That 𝜆 ∈ {1, 2}.
is, {0, 1}𝐿 is partitioned into subsets 𝑆𝑖 so that |𝑆𝑖 |/2𝐿 ≈ 𝑝𝑖 , and 𝜔 is
chosen so that if b ∈ 𝑆𝑖 then 𝜔 (b) = y𝑖 , while also ensuring that Theorem 5.1. The (1+𝜆)-EA with direct encoding has Θ(𝑛 log 𝑛)
𝜔 ([0, . . . , 0]) = y1′ and 𝜔 ([1, . . . , 1]) = y2′ . □ adaptation complexity on the deterministic flipping challenge.
Case 3: flip a bit of c, so 𝑓 (𝑥𝑖′ ) = 0 = 𝑓 (𝑥𝑖 ). is, the pull towards the center is so strong that even when the
Only Case 2 is of concern, since in this case an important bit of b champion is one bit away from a target, the expected time to a
can be forgotten when the new candidate is accepted. In this case, up target is Ω(2𝑛/2 ). When at a target, the chance of moving away
to 𝑛 − 1 mistakes can be made in b before a bit is finally flipped in a. from it is constant, so this time to again reach a target dominates
The chance of making 𝑘 mistakes in a row vanishes quickly, but even the adaptation complexity. □
in the worst case, it would only take 𝑂 ((𝑛 + 𝐿) log 𝑛) steps to make
the repairs. So, the expected time needed to make repairs is less Theorem 5.4. The (1+𝜆)-EA with GP encoding and 𝜆 = 2 has 𝑂 (1)
than the probability of making at least one mistake times the cost adaptation complexity on the random flipping challenge.
of making repairs, i.e., E(repairs) = Pr(mistake) · 𝑂 ((𝑛 + 𝐿) log 𝑛).
Suppose 𝜆 = 1. Then, Proof. W.l.o.g. assume y∗1 is the vector of ones, y∗2 is the vector
𝑛 of zeros, and suppose that |b| > |c| + 1 and par(a) = 1 in 𝑥 0 . The
Pr(mistake) = , and (7) question is how long it takes for b to become all ones. The |b| is
2𝑛 + 𝐿
improved if the target is all-ones a zero is flipped in |b| and the
𝑛 change is accepted; call this a fix. The |b| is hurt if the target is
E (repairs) = 𝑂 · 𝑂 ((𝑛 + 𝐿) log 𝑛) = 𝑂 (𝑛 log 𝑛). (8)
𝑛+𝐿 all-zeros and a one is flipped in |b| and the change is accepted; call
That is, 𝑂 (𝑛 log 𝑛) steps are spent on repairs every time the target this a break. Let 𝜆 = 2. A break occurs if both candidates make a
flips, which is no better than the direct encoding case. mistake, or one does and the other is neutral (i.e., it flips a bit in c).
Suppose instead 𝜆 = 2. Then, So, !
1 |b| 2 + 2|b|𝑛
𝑛
2 Pr(break) = . (15)
Pr (mistake) = , and (9) 2 (2𝑛 + 𝐿) 2
2𝑛 + 𝐿
The probability of a fix is the chance that neither does not make an
𝑛
2 2
𝑛 log 𝑛
improvement:
E (repairs) = 𝑂 · 𝑂 ((𝑛 + 𝐿) log 𝑛) = 𝑂 . (10) !2!
𝑛+𝐿 𝑛+𝐿 1 2𝑛 + 𝐿 − (𝑛 − |b|)
Pr(fix) = 1 − . (16)
So, choosing 𝐿 = 𝜔 (𝑛 2 log 𝑛) leads to 2 2𝑛 + 𝐿
2
𝑛 log 𝑛 Of interest is the ratio Pr(fix)/Pr(break) . Choose 𝐿 > 3𝑛 2 − 5𝑛 + 32 .
E (repairs) = 𝑂 = 𝑂 (1). (11)
𝑛+𝐿 Then, with some algebra, it can be seen that Pr(fix)/Pr(break) > 2 for
all |b| < 𝑛.
Thus, both the expected time to hit Case 1 and the expected time
Now, let ℎ𝑖 be the expected hitting time of b to reach all-ones
to make any necessary repairs are constant. □
starting from 𝑖 ones. For brevity, say Pr(fix) given 𝑖 ones is 𝑠𝑖 . Then,
1
ℎ 0 = 1 + 𝑠 0ℎ 1 + (1 − 𝑠 0 )ℎ 0 =⇒ ℎ 0 = + ℎ1 = 𝑐 0 + ℎ1 . (17)
Theorem 5.3. The (1+𝜆)-EA with direct encoding and 𝜆 = 2 has 𝑠0
Ω(2𝑛/2 ) adaptation complexity on the random flipping challenge. Suppose ℎ𝑖−1 ≤ 𝑐𝑖−1 + ℎ𝑖 . Then,
Proof. W.l.o.g., suppose y∗1 and y∗2 are the vectors of all zeros 𝑠𝑖
ℎ𝑖 ≤ 1 + ℎ𝑖−1 + 𝑠𝑖 ℎ𝑖+1 + 1 −
3𝑠𝑖
ℎ𝑖 (18)
and all ones. Suppose the champion 𝑥 contains 𝑘 ones at time 𝑡, and 2 2
generates candidate 𝑥 ′ at time 𝑡 + 1. If the target at time 𝑡 + 1 is all- 1 1
=⇒ ℎ𝑖 ≤ + 𝑐𝑖−1 + ℎ𝑖+1 = 𝑐𝑖 + ℎ𝑖+1 . Then, (19)
ones, and a zero in 𝑥 is flipped to a one, then the change is accepted 𝑠𝑖 2
This flip happens with probability 12 · (1− ( 𝑛𝑘 ) 2 ). If the target at time ∑︁ 𝑛 ∑︁ 𝑛
𝑡 + 1 is all-zeros, and a one in 𝑥 is flipped to a zero, then the change 2𝑛 + 𝐿 2𝑛 + 𝐿 1 1
ℎ0 = + 𝑐𝑖 ≤ + + 𝑐𝑖−1 (20)
is accepted. This flip happens with probability 12 · (1 − ( 𝑛−𝑘 2 𝑛 𝑛 𝑠 2
𝑛 ) ). The 𝑖=1 𝑖=1 𝑖
𝑥 is kept with the remaining probability, which is greater than 12 .
2𝑛 + 𝐿
∑︁ 𝑛
2𝑛 + 𝐿 1 2𝑛 + 𝐿
√
Now, if 𝑘 < (1 − ( 12/2 − 1))𝑛 < 𝑛/4, then the chance of moving ≤ + + 𝑖 (21)
𝑛 𝑖=1
𝑛 −𝑖 2 𝑛
towards the target is less than half that of moving away. Note that
with 𝜆 = 1, this sufficient 𝑘 is even higher, i.e., 13 . So, suppose =𝑂
𝐿
+ 𝑂 ((𝑛 + 𝐿) log 𝑛) = 𝑂 ((𝑛 + 𝐿) log 𝑛). (22)
𝑘 = ⌊ 𝑛4 ⌋, and let ℎ𝑖0 be the expected hitting time of reaching all 𝑛
GECCO ’22, July 9–13, 2022, Boston, USA Meyerson, Qiu, and Miikkulainen
So, the complexity for |b| to converge from any starting point is if (2) a bit in c1 is flipped from a 1 to a 0 (sparsity condition), or if
the same as in Theorem 5.2, and, as in Theorem 5.2, if |b| = 𝑛, (3) 𝑎 2 is flipped, resulting in Case 2 (diversity condition).
In expectation, half of the bits in b and c are correct upon initial-
𝑛 2
Pr (mistake) = 𝑂 , and (23) ization, and there are 2𝑛 possible bits that can be flipped without
𝑛+𝐿 moving to a different Case. So, since single-point mutation is used,
the convergence of b and c are each equivalent to a coupon collector
𝑛 2
2
𝑛 log 𝑛
E (repairs) = 𝑂 · 𝑂 ((𝑛 + 𝐿) log 𝑛) = 𝑂 . (24) problem where there are 2𝑛 unique coupons of which a particular
𝑛+𝐿 𝑛+𝐿 𝑛 must be collected [18]. The expected convergence time of this
2
So, again choosing 𝐿 = 𝜔 (𝑛 2 log 𝑛) leads to process is
2
𝑛 log 𝑛
E (repairs) = 𝑂 = 𝑂 (1),
𝑛
(25)
𝑛+𝐿 2𝑛𝐻𝑛/2 ≤ 2𝑛 log + 1 < 2𝑛(log 𝑛 + 1), (26)
2
and if b and c are both converged, then there is a constant chance
of sampling the correct target at each step. □ where 𝐻𝑘 is the 𝑘th harmonic number.
Each time Case 1 or 3 is visited, the process remains there for at
Theorem 5.5. The (1+𝜆)-EA* with direct encoding and 𝜆 = 1 least 2𝑛 steps in expectation, before moving to Case 2, from which
converges in Ω(2𝑛 ) steps on the large block assembly problem. there is a fifty-fifty chance of next moving to Case 1 or Case 3. Let
𝑉1 and 𝑉3 be the number of visits before convergence for Case 1
Proof. Since only a single bit is flipped, the diversity method and Case 3, resp. Each case must be visited 𝐻𝑛/2 times to converge
has no effect on this problem. The sparsity method is not useful in expectation, and the number of visits to either case is sampled
either: It only biases the search towards 0’s, which is not helpful from a binomial distribution 𝐵(𝑡, 12 , 12 ), where 𝑡 is the number of
in general; in the worst case, both hidden targets are all 1’s. In total visits 𝑉1 + 𝑉3 . Of interest is how many trials 𝑡 the binomial
this case, the algorithm cannot make any progress once one of the distribution needs to ensure that min(𝑉1, 𝑉3 ) ≥ 𝐻𝑛/2 . This can be
targets is found, unless the state is already only one bit away from achieved by finding 𝑡 so that the mean minus the standard deviation
reaching the second target. So, the best strategy is to restart every of the distribution is at least 𝐻𝑛/2 :
iteration, i.e., set 𝑅 = 1, leading to a convergence time of Θ(2𝑛 ). □
1 1 1 1 𝑡 √︂ 𝑡
Theorem 5.6. The (1+𝜆)-EA* with GP encoding and 𝜆 = 1 con- 𝜇 𝐵 𝑡, , − 𝜎 𝐵 𝑡, , = − ≥ 𝐻𝑛/2 (27)
verges in 𝑂 (𝑛 3+𝑜 (1) ) steps on the large block assembly problem. 2 2 2 2 2
√
4
=⇒ 𝑡 − 𝑡 − 2𝐻𝑛/2 ≥ 0 (28)
Proof. Call the two hidden targets t1 and t2 . W.l.o.g., suppose
1 √︃ 1
t1 occurs in the first 𝑛2 bits of a solution, and t2 occurs in the last 𝑛2 =⇒ 𝑡 ≥ 2𝐻𝑛/2 + 8𝐻𝑛/2 + 1 + (29)
bits. Let [b1, b2 ] = b and [c1, c2 ] = c, where dim(b1 ) = dim(b2 ) = 2 2
√︁
dim(c1 ) = dim(c2 ) = 𝑛2 . Since dim(a1 ) = dim(a2 ) = 1, refer to the =⇒ 𝑡 ≥ 2𝐻𝑛/2 + 𝑂 log 𝑛 . (30)
scalar contents of a1 and a2 as 𝑎 1 and 𝑎 2 , respectively.
Upon random initialization, there is a constant probability that That is, in expectation, after 𝑡 total visits, the case visited least
|b1 − t1 | < |b2 − t2 |, |c2 − t2 | < |c1 − t1 |, and either 𝑎 1 = 0 or 𝑎 2 = 0. is visited 𝐻𝑛/2 times and the case visited most is visited 𝐻𝑛/2 +
√︁
We are interested in how long it takes to reach the state b1 = t1 , 𝑂 ( log 𝑛) times.
b2 = 0, c1 = 0, c2 = t2 , and 𝑎 1 = 𝑎 2 = 1, since this is a state of Overall, the state converges to b1 = t1 , b2 = 0, c1 = 0 and
maximal fitness. c2 = t2 after spending 𝑂 (𝑛 log 𝑛) steps in Case 1 and 𝑂 (𝑛 log 𝑛)
However, if a state with 𝑎 1 = 𝑎 2 = 1 is reached before b = [t1, 0] steps in Case 3. By symmetry, 𝑂 (𝑛 log 𝑛) time is also spent in Case
and c = [0, t2 ], then the states in b and c can be pulled in the wrong 2. Since 𝑂 (𝑛 log 𝑛) steps are spent in each case, the total time is
direction, due to the interaction between b and c caused by ‘⊕’. also 𝑂 (𝑛 log 𝑛), at which point 𝑂 (𝑛) additional steps are needed to
So, suppose that the algorithm reaches a state with b1 = t1 , reach 𝑎 1 = 1 and 𝑎 2 = 1, completing convergence.
b2 = 0, c1 = 0, and c2 = t2 , before ever reaching a state with What is the chance that 𝑎 1 = 1 and 𝑎 2 = 1 is not reached until
𝑎 1 = 𝑎 2 = 1. Consider, then, the remaining three cases, which the both b and c have converged? Suppose max(𝑉1, 𝑉3 ) = 𝑉1 . Then, for
algorithm alternates between as it converges: Case 1, the chance that 𝑎 2 is not flipped before b converges is no
Case 1: 𝑎 1 = 1 and 𝑎 2 = 0. In this case, a new solution is kept if less than
(1) a bit in b1 is flipped that makes b1 closer to t1 (due to the fitness √ √
condition, since the fitness will increase), if (2) a bit in b2 is flipped 1 2𝑛 (𝐻𝑛/2 +𝑂 ( log 𝑛)) 1 𝐻𝑛/2 +𝑂 ( log 𝑛)
from a 1 to a 0 (due to the sparsity condition, since the sparsity will 1− = (31)
2𝑛 + 2 𝑒
increase), or if (3) 𝑎 1 is flipped, which results in Case 2 (due to the 1 log 𝑛+1+𝑂 ( √log 𝑛)
diversity condition, since disabling b flips more than one bit in y). > (32)
𝑒
Case 2: 𝑎 1 = 0 and 𝑎 2 = 0. In this case, a new solution is kept √
1 1 1 𝑂 ( log 𝑛)
if 𝑎 1 or 𝑎 2 is flipped, resulting in Case 1 and Case 3, respectively = · · (33)
(diversity condition). No other flips affect the phenotype. 𝑛 𝑒 𝑒
Case 3: 𝑎 1 = 0 and 𝑎 2 = 1. In this case, a new solution is kept if 1 1 1 1
= · · 𝜖 = Ω 1+𝜖 , (34)
(1) a bit in c2 is flipped that makes c2 closer to t2 (fitness condition), 𝑛 𝑒 𝑛 𝑛
Expressive Encodings: Universal Approximation and other Advantages GECCO ’22, July 9–13, 2022, Boston, USA
√
for any 𝜖 > 0. The last step uses from the fact that 𝑒 𝑂 ( log 𝑛) = By symmetry, there is the same lower bound on the probability
𝑜 (𝑛𝜖 ) for any 𝜖 > 0, which can be seen from the following [89]: that c converges before 𝑎 1 is flipped in Case 3. So, the probability
√ √ that neither event occurs is
𝑒 𝛼 log 𝑛 𝑒 𝛼 log 𝑛
1 2
2
lim = exp log lim (35) 1 1
𝑛→∞ 𝛽𝑛𝜖 𝑛→∞ 𝛽𝑛𝜖 Ω = Ω = Ω (38)
√︁ 𝑛 1+𝜖 𝑛 1+𝑜 (1) 𝑛 2+𝑜 (1)
= exp lim 𝛼 log 𝑛 − 𝜖 log 𝑛 − log 𝛽 (36)
𝑛→∞ Therefore, one can choose a restart threshold 𝑅 = Θ(𝑛 log 𝑛),
with convergence expected in the above manner after 𝑂 𝑛 2+𝑜 (1)
= exp(−∞) = 0, (37)
restarts, yielding an overall convergence time of 𝑂 𝑛 3+𝑜 (1) log 𝑛 =
for any constants 𝛼 > 0, 𝛽 > 0.
𝑂 𝑛 3+𝑜 (1) .
□