0% found this document useful (0 votes)

13 views233 pages

N Are Do Final Thesis

Uploaded by

Ruiqi Guo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views233 pages

N Are Do Final Thesis

Uploaded by

Ruiqi Guo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 233

Genetic Programming Based on Novelty Search

Enrique Naredo

To cite this version:

Enrique Naredo. Genetic Programming Based on Novelty Search. Artificial Intelligence [cs.AI]. ITT,
Instituto tecnologico de Tijuana, 2016. English. �NNT : �. �tel-01668776�

HAL Id: tel-01668776

https://inria.hal.science/tel-01668776
Submitted on 20 Dec 2017

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est

archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Enrique Naredo Garcı́a

Genetic Programming Based

on Novelty Search
Genetic
Programming Based
on Novelty Search

Enrique Naredo Garcı́a

SEP Tecnológico Nacional de México

Instituto Tecnológico de Tijuana

Departamento de Ingenierı́a Eléctrica y Electrónica

Programación Genética Basada en Búsqueda de

Novedad

tesis sometida por:

Enrique Naredo Garcı́a

para obtener el grado de:

Doctor en Ciencias de la Ingenierı́a

director:
Dr. Leonardo Trujillo Reyes

Tijuana, Baja California, México.

3 de Junio del 2016
SEP Tecnológico Nacional de México

Instituto Tecnológico de Tijuana

Departamento de Ingenierı́a Eléctrica y Electrónica

Genetic Programming Based on Novelty Search

thesis submitted by:

Enrique Naredo Garcı́a

to obtain the degree of:

Doctor in Engineering Sciences

advisor:
Dr. Leonardo Trujillo Reyes

Tijuana, Baja California, México.

June 3, 2016
ii
Resumen

La búsqueda de novedad (Novelty Search –NS) es un enfoque único

hacia la búsqueda y optimización, en donde una función objetivo ex-
plicita es reemplazada por una medida de novedad de la solución. Sin
embargo, NS ha sido mayormente utilizada en robótica evólutiva, su
utilidad en problemas clásicos de aprendizaje máquina no ha sido ex-
plorado. Este trabajo presenta un algoritmo de programación genética
(GP) basado en novedad para clasificación supervisada, con las sigu-
ientes contribuciones. Se muestra que NS puede resolver problemas
de clasificación del “mundo real”, validado en problemas de referencia
binarios y multiclase. Estos resultados son posibles al utilizar un de-
scriptor de comportamiento especı́fico del dominio del problema, rela-
cionado con el concepto de semántica en GP. Por otra parte, se propo-
nen dos nuevas versiones del algoritmo de NS; búsqueda de novedad
probabı́listica (PNS) y una variante del criterio mı́nimo de NS (MCNS).
El primero modela el compartamiento de cada solución como un vec-
tor aleatorio elimiando todos los parámetros de NS, reduciendo además
el costo computacional del algoritmo. El segundo utiliza una función
objetivo para restringir y sesgar la búsqueda hacia soluciones con un
alto desempeño. Este trabajo discute los efectos de NS en la dinámica
de la búsqueda de GP y en el crecimiento de código. Los resultados
muestran que NS puede ser utilizado como una alternativa real para
clasificación supervisada. Además muestran que para problemas bina-
rios de clasificación el algoritmo de NS exhibe un implı́cito control de
bloat.

Palabras clave: Programación Genética, Búsqueda de Novedad, Clasi-

ficación, Decepción, Bloat.

i
ii
Abstract

Novelty Search (NS) is a unique approach towards search and opti-

mization, where an explicit objective function is replaced by a measure
of solution novelty. However, NS has been mostly used in evolutionary
robotics, its usefulness in classic machine learning problems has been
unexplored. This thesis presents a NS-based Genetic Programming
(GP) algorithms for common machine learning problems, with the fol-
lowing contributions. It is shown that NS can solve real-world clas-
sification, clustering and symbolic regression tasks, validated on real-
world benchmarks and synthetic problems. These results are made
possible by using a domain-specific behavior descriptor, related to the
concept of semantics in GP. Moreover, two new versions of the NS al-
gorithm are proposed, Probabilistic NS (PNS) and a variant of Minimal
Criteria NS (MCNS). The former models the behavior of each solution
as a random vector and eliminates all the NS parameters while reduc-
ing the computational overhead of the NS algorithm; the latter uses a
standard objective function to constrain and bias the search towards
high performance solutions. The thesis also discusses the effects of NS
on GP search dynamics and code growth. Results show that NS can be
used as a realistic alternative for machine learning, and particularly for
GP-based classification.

Keywords: Genetic Programming, Novelty Search, Classification, De-

ception, Bloat.

iii
iv
To my whole dear family and to all my true friends.

v
vi
Acknowledgements

This dissertation has been possible because the convergence of

many wills that supported my own.
Firstly, I am very grateful to my dissertation committee for their
suggestions and constructive remarks, which enabled me to improve
greatly the quality of my research work: Dr. Pierrick Legrand, Dr.
Leonardo Trujillo Reyes, Dr. Nohé Ramón Cázares Castro, Dr. Luis
Néstor Coria de los Rios, Dr. Victor Hugo Dı́az Ramı́rez, and Dr. Paulo
Jorge Cunha Vaz Dias. Particularly, I would like to thank my advisor
Leonardo Trujillo for his pretty patient guide, and specially for his de-
tailed and constructive feedback to keep my research work on the right
path.
Secondly, I would like to thank all my colleagues from the Instituto
Tecnológico de Tijuana (ITT) doctoral program in engineering sciences
at the Otay campus, particularly to all members of the Tree-Lab, for
being open, friendly, and even for creating a pleasant working atmo-
sphere.
Thirdly, thanks to all the researchers involved in the project ACOB-
SEC, who gave me a great opportunity to visit their corresponding labs
at their universities, and even more important to share with me their
time and knowledge: Dr. Pierrick Legrand (INRIA and Université de
Bordeaux, France), Dr. Francisco Fernández de Vega (Universidad de
Extremadura, Spain), Dr. Gustavo Olague (CICESE Ensenada, México),
Dr. Sara Silva (Universidade de Lisboa), and of course, to all of their
corresponding students.
Special thanks to Dr. Paulo Urbano (Universidade de Lisboa), for all
of his invaluable comments to my research work. Even though, we met
each other because a very funny causality at the Universidade de Lisboa,
we started an interesting and valuable research work in addition to a
friendship, which has been increasing over time.

vii
On a personal note, I would like to thank all the friends (‘contras’)
I made from the ITT postgraduate program in computer sciences at
the Tomás Aquino campus, for their unconditional support on my bad
and sometimes worst days in my PhD journey, and because they always
make me feel as a part of their family as well as they are for me.
On a more personal note, I am very thankful to all my family for
their lovely support. They are far away in distance, but always close in
affection.
Lastly, I would like to thank the funding provided by CONACYT
(México) for my scholarship No. 232288, as well as the funding pro-
vided by FCT project EXPL/EEISII/1861/2013, and by CONACYT Ba-
sic Science Research Project No. 178323, DGEST (Mexico) Research
Project 5414.14-P. Particularly I am very thankful for all the support
from the project ACOBSEC: FP7-PEOPLE-2013-IRSES financed by the
European Commission with contract No. 612689.

viii
Contents

Resumen i
Dedication v
Acknowledgements vii
Contents ix
List of Figures xi
List of Tables xiii
1. introduction 1
1.1. Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . 3
1.2. GP open issues . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3. Mechanisms to guide the search . . . . . . . . . . . . . . . 4
1.4. Original Contributions . . . . . . . . . . . . . . . . . . . . 6
1.5. Outline of the Dissertation . . . . . . . . . . . . . . . . . . 7
1.6. Summary of publications . . . . . . . . . . . . . . . . . . . 8
2. genetic programming 11
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2. Genetic Programming Overview . . . . . . . . . . . . . . . 13
2.3. Nuts and Bolts of GP . . . . . . . . . . . . . . . . . . . . . 14
2.4. Search Spaces in GP . . . . . . . . . . . . . . . . . . . . . . 15
2.5. GP Case Study: Disparity Map Estimation for Stereo Vision 21
2.5.1. Experimental Configuration and Results . . . . . . 22
2.6. Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . 25
3. deception 31
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2. Deception in the Artificial Evolution . . . . . . . . . . . . 32
3.3. Deception in Evolution . . . . . . . . . . . . . . . . . . . . 33
3.4. Deceptive Problems . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1. Trap Functions . . . . . . . . . . . . . . . . . . . . . 34
3.4.2. Scalable Fitness Function . . . . . . . . . . . . . . . 35
3.4.3. Deceptive Navigation . . . . . . . . . . . . . . . . . 36

ix
3.5. Deceptive Classification Problem Design . . . . . . . . . . 36
3.5.1. Synthetic Classification Problem . . . . . . . . . . 37
3.5.2. Linear Classifier . . . . . . . . . . . . . . . . . . . . 38
3.5.3. Objective Function . . . . . . . . . . . . . . . . . . 40
3.5.4. Non-Deceptive Fitness Landscape . . . . . . . . . . 40
3.5.5. Local Optima and Global Optimum . . . . . . . . 42
3.5.6. Deceptive Objective Function . . . . . . . . . . . . 42
3.5.7. Deceptive Fitness Landscape . . . . . . . . . . . . . 43
3.5.8. Preliminary Results . . . . . . . . . . . . . . . . . . 45
3.6. Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . 46
4. novelty search 47
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2. Open-ended Search . . . . . . . . . . . . . . . . . . . . . . 49
4.3. Nuts & Bolts of NS . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1. Behavioral representation . . . . . . . . . . . . . . 51
4.3.2. NS algorithm . . . . . . . . . . . . . . . . . . . . . 52
4.3.3. Underlying evolutionary algorithm . . . . . . . . . 53
4.3.4. Minimal Criteria Novelty Search . . . . . . . . . . 54
4.4. Contributions on NS . . . . . . . . . . . . . . . . . . . . . 54
4.4.1. MCNSbsf . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.2. Probabilistic NS . . . . . . . . . . . . . . . . . . . . 56
4.5. Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . 59
5. ns case study: automatic circuit syn-
thesis 61
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2. GA Synthesis and CFs Representation . . . . . . . . . . . 63
5.2.1. Objective Function . . . . . . . . . . . . . . . . . . 64
5.2.2. CFs Representation for a GA . . . . . . . . . . . . . 65
5.2.3. CF synthesis with GA-NS . . . . . . . . . . . . . . . 66
5.3. Results and Analysis . . . . . . . . . . . . . . . . . . . . . 66
5.3.1. CF Topologies . . . . . . . . . . . . . . . . . . . . . 66
5.3.2. Comparison between GA-OS and GA-NS . . . . . . 67
5.4. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 68

x
6. generalization of ns-based gp con-
trollers for evolutionary robotics 75
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2.1. Grammatical Evolution . . . . . . . . . . . . . . . . 80
6.2.2. Generalization in Genetic Programming . . . . . . 81
6.3. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 84
6.3.1. Navigation Task and Environment . . . . . . . . . 85
6.3.2. Training and Testing Set Size . . . . . . . . . . . . 86
6.3.3. Training Set Selection . . . . . . . . . . . . . . . . . 86
6.3.4. Objective Functions . . . . . . . . . . . . . . . . . . 87
6.3.5. Search Algorithms . . . . . . . . . . . . . . . . . . 88
6.3.6. Limit for Allowed Moves . . . . . . . . . . . . . . . 89
6.4. Results and Analysis . . . . . . . . . . . . . . . . . . . . . 90
6.4.1. Results for Randomly Chosen Training Sets . . . . 90
6.4.2. Comparison with a Manually Selected Training Set 92
6.4.3. Statistical Analysis . . . . . . . . . . . . . . . . . . 96
6.5. Heat Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.6. Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . 99
7. gp based on ns for regression 107
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1.1. Behavioral Descriptor . . . . . . . . . . . . . . . . . 108
7.1.2. Pictorial Example . . . . . . . . . . . . . . . . . . . 108
7.1.3. NS Modifications . . . . . . . . . . . . . . . . . . . 109
7.2. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2.1. Test Problems and Parametrization . . . . . . . . . 111
7.2.2. Parameter Settings . . . . . . . . . . . . . . . . . . 112
7.2.3. Experimental Results . . . . . . . . . . . . . . . . . 112
7.3. Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . 113
8. gp based on ns for clustering 117
8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.2. Clustering with Novelty Search . . . . . . . . . . . . . . . 118
8.2.1. K-means . . . . . . . . . . . . . . . . . . . . . . . . 119
8.2.2. Fuzzy C-means . . . . . . . . . . . . . . . . . . . . 119

xi
8.2.3. Cluster Descriptor (CD) . . . . . . . . . . . . . . . 120
8.2.4. Cluster Distance Ratio (CDR) . . . . . . . . . . . . 121
8.3. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.3.1. Test Problems . . . . . . . . . . . . . . . . . . . . . 123
8.3.2. Configuration and Parameter Settings . . . . . . . 123
8.3.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.4. Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . 125
9. gp based on ns for classification 133
9.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.1.1. Accuracy Descriptor (AD) . . . . . . . . . . . . . . 134
9.1.2. Binary Classifier: Static Range Selection . . . . . . 136
9.1.3. Multiclass Classifier: M3GP . . . . . . . . . . . . . 136
9.1.4. Preliminary Results . . . . . . . . . . . . . . . . . . 137
9.1.5. Discussion . . . . . . . . . . . . . . . . . . . . . . . 138
9.2. Real-world classification experiments . . . . . . . . . . . . 141
9.2.1. Results: Binary Classification . . . . . . . . . . . . 142
9.2.2. Results: Multiclass Classification . . . . . . . . . . 147
9.2.3. Results: Analysis . . . . . . . . . . . . . . . . . . . 150
9.3. Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . 152
10. conclusions & future directions 157
10.1. Summary and Conclusions . . . . . . . . . . . . . . . . . . 158
10.2. Open Issues in GP . . . . . . . . . . . . . . . . . . . . . . . 158
10.3. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Bibliography 163

xii
List of Figures

1.1. Different classifications of metaheuristics. (Figure author:

Johann “nojhan” Dréo, Caner Candan). . . . . . . . . . . . . . 2
2.1. Example of programs (expressions) K1 and K2 , evolved by
GP using the previous set of terminals and functions. On
the left is depicted the tree representation, while on the
right it can be seen both prefix and infix mathematical no-
tation for each GP program. . . . . . . . . . . . . . . . . . . 15
2.2. Block diagram depicting the main processes involved in a
tree-based GP search . . . . . . . . . . . . . . . . . . . . . . 17
2.3. The traditional program evaluation process used in GP. . . 17
2.4. The basic program evaluation process used in GP. . . . . . . 18
2.5. Conceptual view of how the performance of a program can
be analyzed. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6. Stereo images for experiment tests from Middlebury
Stereo Vision Page. . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7. Convergence Graphic of the best GP run result . . . . . . . 24
2.8. Function tree of the best GP run result. . . . . . . . . . . . . 25
2.9. Disparity maps comparison for the Barn1 image. . . . . . . 26
2.10. Disparity maps comparison for the Barn2 image. . . . . . . 27
2.11. Disparity maps comparison for the Bull image. . . . . . . . 28
2.12. Disparity maps comparison for the Poster image. . . . . . . 29
2.13. Disparity maps comparison for the Sawtooth image. . . . . 30
2.14. Disparity maps comparison for the Venus image. . . . . . . 30
3.1. Variable deceptive function . . . . . . . . . . . . . . . . . . . 35
3.2. Deceptive navigation task . . . . . . . . . . . . . . . . . . . 36
3.3. Synthetic classification problem with 2-classes and 2-
dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4. Search space & fitness landscape. . . . . . . . . . . . . . . . 41
3.5. Class separation criteria. . . . . . . . . . . . . . . . . . . . . 43
3.6. Set of synthetic deceptive classification problems. . . . . . . 44

xiii
4.1. Two examples of evolutionary robotics applying NS. . . . . 49
4.2. Objective-based against novelty-based in a task navigation. 50
4.3. Three scenarios for an individual’s behavior . . . . . . . . . 53
4.4. Illustration of the effect of the proposed MCNS approach. . 55
4.5. Representation of the PNS novelty measure, where each
column represents one feature βi of the AD vector and
each row is a different generation. In each column, two
graphics are presented, on the left is the frequency of in-
dividuals with either a 1 or 0 for that particular feature
in the current population, and on the right the cumulative
frequency over all generations. . . . . . . . . . . . . . . . . . 58
5.1. CF representation using (a) nullors and (b) MOSFETs. . . . 65
5.2. Three CF topologies for the synthesis of a CF with the
topology size of one MOSFET. . . . . . . . . . . . . . . . . . 69
5.3. Five CF topologies for the synthesis of a CF with the topol-
ogy size of two MOSFETs found by GA-NS. . . . . . . . . . . 70
5.4. Four CF topologies for the synthesis of a CF with the topol-
ogy size of three MOSFETs found by GA-NS. . . . . . . . . . 71
5.5. Convergence plots for GA-NS and GA-OS for the two MOS-
FET CFs, showing the performance (objective function
value) of the best solution found over the initial genera-
tions of the search. The lines represent the average over
10 runs of the algorithms. . . . . . . . . . . . . . . . . . . . . 72
5.6. Histograms showing the average composition of the chro-
mosome of the best solution found by each algorithm after
10 generations, for the single MOSFET circuits. . . . . . . . 72
5.7. Histograms showing the average composition of the chro-
mosome of the best solution found by each algorithm after
10 generations, for the two MOSFET circuits. . . . . . . . . 73
5.8. Histograms showing the average composition of the chro-
mosome of the best solution found by each algorithm after
10 generations, for the three MOSFET circuits. . . . . . . . . 73
6.1. Example of a GE genotype-phenotype mapping process. . . 79
6.2. Learning environment & test set. . . . . . . . . . . . . . . . 85

xiv
6.3. Performance comparison of objective, novelty and
random-based search. . . . . . . . . . . . . . . . . . . . . . . 89
6.4. Box plot comparison of the best solution found using train-
ing set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5. Box plot comparison of the best solution found using test
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.6. Box plot comparison of the percentage of testing hits. . . . . 97
6.7. Box plot comparison of the best solution found using train-
ing set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.8. Box plot comparison of the best solution found using test
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.9. Box plot comparison of the percentage of testing hits. . . . . 100
6.10. Comparison of the performance on the test set based on
percentage of hits. . . . . . . . . . . . . . . . . . . . . . . . . 101
6.11. Overfit heat-maps, at the right of each figure is shown a
color scale that goes from low values of overfitting in color
blue to higher values in color red. . . . . . . . . . . . . . . . 103
6.12. Overfit binary maps, obtained from the binarization of the
Overfit heat-maps by a color threshold to divide the map
into easy and difficult regions. . . . . . . . . . . . . . . . . . 103
6.13. Easy-difficult training Set & Test Set. . . . . . . . . . . . . . 104
7.1. Graphical depiction of how the −descriptor is con-
structed. Notice how the interval specified by deter-
mines the values of each βi . . . . . . . . . . . . . . . . . . . . 109
7.2. Benchmark regression problems showing the ground truth
function taken from (Uy et al., 2011a) . . . . . . . . . . . . . 112
7.3. Boxplot analysis of NS-GP-R performance on each bench-
mark problem, showing the best test-set error in each run. . 114
8.1. Fitness landscape in behavioral space for the CD descriptor.121
8.2. Five synthetic clustering problems; the observed clusters
represent the ground truth data. . . . . . . . . . . . . . . . . 123
8.3. Comparison of clustering performance on Problem No.1. . 126
8.4. Comparison of clustering performance on Problem No.2. . 127
8.5. Comparison of clustering performance on Problem No.3. . 128
8.6. Comparison of clustering performance on Problem No.4. . 129

xv
8.7. Comparison of clustering performance on Problem No.5. . 130
8.8. Evolution of sparseness for NS-GP-20, showing the aver-
age sparseness of he best individual at each generation. . . 131
8.9. Evolution of sparseness for NS-GP-40, showing the aver-
age sparseness of he best individual at each generation. . . 131
8.10. Evolution of the best solutions found at progressive gener-
ations for Problem 5 with NS-GP-40. . . . . . . . . . . . . . 132
9.1. Graphical depiction of the Accuracy Descriptor (AD) . . . . 135
9.2. Graphical depiction of SRS-GPC. . . . . . . . . . . . . . . . 136
9.3. Five synthetic 2-class problems . . . . . . . . . . . . . . . . 138
9.4. Evolution of the average size of individuals at each gener-
ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.5. Convergence of the classification error . . . . . . . . . . . . 144
9.6. Evolution of the average size of the population . . . . . . . 145
9.7. Classification error on the test data . . . . . . . . . . . . . . 148
9.8. Convergence of training and testing error for multiclass
problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.9. Plot showing the percentage of rejected individuals . . . . . 151
9.10. Archive size & Relative speed-up . . . . . . . . . . . . . . . 152
9.11. Relative ranking of NS, PNS and OS on the IM-3 problem. . 155
9.12. Relative ranking of NS, PNS and OS on the SEG problem. . 156

xvi
List of Tables

2.1. GP system parameters for the experimental tests . . . . . . 23

2.2. Comparison results for the Barn1 image . . . . . . . . . . . 25
2.3. Comparison results for the Barn2 image . . . . . . . . . . . 26
2.4. Comparison results for the Bull image . . . . . . . . . . . . 27
2.5. Comparison results for the Poster image . . . . . . . . . . . 28
2.6. Comparison results for the Sawtooth image . . . . . . . . . 29
2.7. Comparison results for the Venus image . . . . . . . . . . . 29
3.1. Preliminary results, using a Naive Bayesian method and
SVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1. Shared parameters for the GA and GA-NS. . . . . . . . . . . 67
5.2. GA-NS parameters. . . . . . . . . . . . . . . . . . . . . . . . 67
5.3. Synthesized CF topologies with one MOSFET. . . . . . . . . 68
5.4. Synthesized CF topologies with two MOSFET. . . . . . . . . 68
5.5. Synthesized CF topologies with three MOSFET. . . . . . . . 69
5.6. Number of generations required by GA-OS and GA-NS to
converge for each of the CF circuits: one MOSFET (M1),
two MOSFETS (M2) and three MOSFETs (M3). Each row
is a different run and the final row shows the average. . . . 70
6.1. A general description of the algorithms and measures used
in the experimental set-up. . . . . . . . . . . . . . . . . . . . 88
6.2. Parameters used for the experimental work. . . . . . . . . . 90
6.3. Summary of the experimental results for all of the variants. 91
6.4. Results from each of the twelve fixed instances. . . . . . . . 94
6.5. Table that summarizes the performance of the three meth-
ods for a set of 12 fixed instances. . . . . . . . . . . . . . . . 96
6.6. Summary of the statistical tests. . . . . . . . . . . . . . . . . 102
7.1. Benchmark symbolic regression problems taken from (Uy
et al., 2011a). . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2. Parameters for the experiments. . . . . . . . . . . . . . . . . 113
7.3. Parameters for novelty search. . . . . . . . . . . . . . . . . . 113

xvii
7.4. Comparison of NS-GP-R with two control methods re-
ported in (Uy et al., 2011a): SSC and SGP. Values are the
mean error computed over all runs. . . . . . . . . . . . . . . 114
8.1. Parameters for the GP-based search. . . . . . . . . . . . . . . 124
8.2. Average classification error and standard error. . . . . . . . 125
9.1. Parameters for the GP systems. . . . . . . . . . . . . . . . . 139
9.2. Average and standard deviation of the classification error
on the test data. . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.3. Average program size at the final generation for each algo-
rithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.4. Real-world and synthetic datasets for binary and multi-
class classification problems. . . . . . . . . . . . . . . . . . . 141
9.5. Binary classification performance for all MCNSbest variants. 143
9.6. Binary classification performance on the test data. . . . . . 146
9.7. Resulting p-values of the Friedman’s test with Bonferroni-
Dunn correction. . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.8. Multiclass classification performance on the test data. . . . 149
9.9. Resulting p-values of the Friedman’s test with Bonferroni-
Dunn correction, for the multiclass problems. . . . . . . . . 149

xviii
Introduction
1
Many of the problems which concern to human beings to improve
their quality of life can be posed as optimization problems. However, in
order to do so, we need to be able to describe the system that is involved
in the problem. With enough knowledge about the system we can de-
rive a useful abstraction and describe it by a model, identifying the
variables and constraints which are relevant to improve some objective.
Often, for an engineering problem the objective is to choose the design
parameters that maximize the benefits, or minimizes costs without vio-
lating the constraints, in other words finding the best feasible solution.
Therefore, solutions must be scored according to the quality they show
when solving the problem, this typically is measured through to what is
known as an objective function or reward function, which precisely re-
wards those solutions that are closer to the objective. Depending on the
problem domain, solutions can be encoded with real-valued or discrete
variables, the latter ones are referred to as combinatorial optimization
(CO) problems.
Examples of CO problems are the Travellings Salesman Problem
(TSP), the Quadratic Assignment problem (QAP), scheduling problems,
and many others. Due to the practical importance of CO problems,
many algorithms to tackle them have been developed. These algo-
rithms can be classified as either complete or approximate algorithms.
Complete algorithms are guaranteed to find for every finite size in-
stance of a CO problem an optimal solution in bounded time meta-
heuristics.
Therefore, complete methods might need exponential computation
time in the worst-case. This often leads to computation times too high

1
introduction

for practical purposes. In approximate methods we sacrifice the guar-

antee of finding optimal solutions for the sake of getting good solutions
in a substantially reduced amount of time.
In the last 30 years, a new kind of approximate algorithms has
emerged which basically tries to combine basic heuristic methods in
higher level frameworks aimed at efficiently and effectively exploring a
search space, these methods are nowadays commonly called metaheuris-
tics. Figure 1.1 shows the different classifications of metaheuristics.

Figure 1.1: Different classifications of metaheuristics. (Figure author:

Johann “nojhan” Dréo, Caner Candan).

2
1.1 evolutionary algorithms

1.1 Evolutionary Algorithms

Evolutionary Algorithms (EAs) are a broad family of search and

optimization algorithms that attempt to solve complex problems by
mimicking the processes of Darwinian evolution. Among the meth-
ods shown in the figure 1.1, genetic programming (GP) is a noteworthy
example of EAs.
GP is a branch of genetic algorithms (GAs) and the main difference
between GP and GAs is the representation of the solutions. GP evolves
computer programs that can have different sizes and shapes, while GAs
evolve fixed-length string that represent solution parameters with most
of the structure of the final solution already defined by the user. GP has
generated a plethora of human-competitive results and applications,
including novel scientific discoveries and patentable inventions (Poli
et al., 2008a; Koza, 2010). Because its own advantages over other EAs,
we focus on this particular method to conduct our research work.

1.2 GP open issues

Even though, the literature on GP shows a progressive increase re-

search devoted to both theory and applications, a series of open issues
were clearly stated in the 10th anniversary issue of the main journal in
the field (Genetic Programming and Evolvable Machines) (O’Neill et al.,
2010), among them the following are of particular importance:

open-ended evolution in gp: Design an evolutionary system ca-

pable of continuously adapting and searching. It’s difficulty ranges from
medium to hard. There is difficulty in defining clear criterion for mea-
suring success, and whether or not open-ended evolution is required
by GP practitioners.

gp benchmarks: Is it possible to define a set of test problems that can

be rigorously evaluated by the scientific community and then accepted as a
more or less agreed upon set of benchmarks for experimental studies in GP?

3
introduction

It’s difficulty ranges from medium to high. Benchmark problems serve

as a common language for a community when developing algorithms,
and therefore a certain amount of inertia to not adopt better ones exists.
Also, since GP is already a fairly complex algorithm, adopting overly
simple benchmark problems enables overall lower complexity in the
system, albeit at the expense of the significance of results.

fitness landscapes and problem difficulty in gp: Identifying

how hard a particular problem, or problem instance, will be for some GP
system, enabling a practitioner to make informed choices before and during
application. It’s a hard issue. Problem difficulty relates not only to the
issue of representation, but also to the choice of genetic operators and
fitness functions.

generalization in gp: Defining the nature of generalization in GP

to allow the community to deal with generalization more often and more
rigorously, as in other ML fields and statistical analysis. It is weighted
as a medium difficulty issue. The hardness of defining generalization
in a reliable and rigorous way is common with other ML paradigms.
The lack of generalization in a GP system is often due to lack of prior
planning by the practitioner.
Even though, these issues are different they are nevertheless closely
related. For this reason, the approach taken in this thesis can par-
tially address each of these issues through a mechanism to guide the
GP search process that does not focus explicitly on the objective.

1.3 Mechanisms to guide the search

The mechanism to drive the search in an evolutionary algorithm

is traditionally through a fitness function which typically is similar to
the objective or cost function used to evaluate the solutions based on
the problem objective. But there is no technical reason that prevents
the fitness function to be different from the objective function.

4
1.3 mechanisms to guide the search

For instance, a trivial example of using different functions to drive

the search can be observed in a random search; where a random func-
tion is used to generate solutions and a different function (the objective
function) is used to choose the best solution.
In this research, we highlight a conceptual difference between natu-
ral and artificial evolution. Nature seems to not follow any objective at
all, while artificial evolution is mostly seen as an objective-based pro-
cess.
According to the Darwinian approach, evolution by natural selec-
tion is a process where individuals from current species that are well
adapted to their environment better, will have more chances to survive
and to pass on their genetic information to their offspring. In this pro-
cess, it can be seen that natural selection exerts pressure over time to
maintain the kind of individuals with the right features to thrive in
their (dynamic) environment.
On the other hand, individuals that are not well adapted are consid-
ered inferior and probably will not survive. Following this reasoning, it
means that these less fit individuals can either go extinct or must look
for a different environmental niche in which to survive. Since in nature
we can observe a fierce battle between individuals for natural resources,
natural evolution clearly can be seen as a process which tends to pro-
duce a large diversity of individuals, in other words it is a divergent
process.
This thesis explores a recent and alternative approach to guide the
search process that is more akin to the divergent nature of biological
evolution, where instead of rewarding solutions that seem to be closer
to a given objective, it rewards solutions that are different or unique
compared to the solutions found before. In this approach, the more
different the solution is, the more novel it is considered to be and this
is the fitness score assigned to each solution. This new approach is
called Novelty Search (Lehman and Stanley, 2011a) where solutions
are described by a representative description of their behavior1 or main
features.
1 The concept of behavior will be formally explained in the following chapters.

5
introduction

NS is best suited for deceptive problems2 , which are closely related

with the difficulty of the problem. NS mainly has been used in the
field of evolutionary robotics where the behavioral characterization re-
quired to implement NS is intuitive. Furthermore, the underlying evo-
lutionary algorithm mostly used with NS has been the NeuroEvolution
of Augmenting Topologies (NEAT) (Stanley, 2004). NEAT is a GP-like
algorithm (Trujillo et al., 2014), but the use of NS with standard GP
systems has been mostly unexplored in related literature. To the best
of our knowledge there is just one previous work using a tree-GP sys-
tem with NS (Lehman and Stanley, 2010a), and more recently another
using grammatical evolution (Urbano and Loukas, 2013), both applied
on maze navigation tasks. The study of GP and NS, as a combined ap-
proach, is explored in depth for the first time in this research.

1.4 Original Contributions

Main motivation in this research work is to contribute with insight

to help tackle some of the open issues in GP by incorporating NS in
the search process. Our assumption is that integrating NS with GP it
can helps to address these issues with a different perspective, particu-
larly on the manner in which a GP search process should proceed. In
particular, this thesis makes the following original contributions:

1. This work is the first to use NS outside the evolutionary robotics

domain, particularly to solve machine learning and pattern recog-
nition problems, namely: regression (Martı́nez et al., 2013), clus-
tering (Naredo and Trujillo, 2013) and classification problems
(Naredo et al., 2013b).

2. This work is the first to propose a method to design a set of de-

ceptive classification problems as GP benchmarks (Naredo et al.,
2015).

2 More information about NS can be obtained in; The Novelty Search Users Page, http:
//eplex.cs.ucf.edu/noveltysearch/userspage/

6
1.5 outline of the dissertation

3. This work studies generalization in GP based on NS for evolution-

ary robotics, showing for the first time that NS generalizes better
than traditional search and how the size of the training set influ-
ences generalization for both approaches (Naredo et al., 2016c).

4. We propose in this work a behavior-based approach to analyze the

performance of a GP program for supervised learning (Trujillo
et al., 2013b).

5. One of the limitations in the development of real-world GP solu-

tions is the issue of bloat, which is the evolution of unnecessarily
large solutions, and according with the fitness-causes-bloat the-
ory this is caused by precisely the fitness function. In this work
we found that in some problems, using NS with GP which does
not use an objective-based fitness function, the average program
size of the evolving population does not increase (Trujillo et al.,
2013a).

6. We propose an extension of the progressive minimal criteria

NS (PMCNS), named as MCNSbsf , which considers a dynamical
threshold based on the best-so-far (bsf) solution (Naredo et al.,
2016b).

7. We propose a probabilistic approach to compute novelty named

as probabilistic NS (PNS), which eliminates all of the underlying
NS parameters, and at the same time reduces the computational
overhead from the original NS algorithm (Naredo et al., 2016b).

1.5 Outline of the Dissertation

The dissertation begins with Chapter 2, which introduces the back-

ground about GP (Koza, 1992a), explaining two different flavours of
GP; tree-based GP and grammatical evolution. Furthermore, we ad-
dress the main search spaces used in GP, and present a case study to
explain how a tree-based GP can be used to solve a complex real-world

7
introduction

problem, in this case disparity map estimation for stereo computer vi-
sion.
Chapter 3 introduces the notion of deception, which is frequently
present in real-world problems and closely related with the problem
difficulty, giving some examples about deceptive functions and de-
ceptive navigation tasks. Furthermore, a method to design synthetic
classification problems with deceptive fitness landscape is presented
(Naredo et al., 2015).
Chapter 4 presents the NS algorithm (Lehman and Stanley, 2011a),
and its unique approach to search inspired by natural evolution’s open-
ended property. We highlight that traditional evolutionary search is
driven by a given objective, but NS on the other hand, drives the
search by rewarding the most different solutions. Though, it may sound
strange and counter intuitive, NS has shown that in some problems ig-
noring the objective can outperforms traditional search. The reason for
this phenomenon is that sometimes the intermediate steps to the goal
do not resemble the goal itself (Stanley and Lehman, 2015).
Furthermore, the related work is presented analysing different pre-
vious versions of NS. Two new versions of NS are proposed; MCNSbsf
and PNS (Naredo et al., 2016b). The first version is an extension of the
progressive minimal criteria NS (PMCNS) (Gomes et al., 2012). The
second one is a probabilistic approach to compute novelty, this last
proposal has the advantage that it eliminates all of the underlying NS
parameters, and at the same time reduces the computational overhead
from the original NS algorithm while achieving the same level of per-
formance.
In Chapter 5 a first and original case study of applying NS in a GA-
based search is presented to synthesize topologies of current follower
(CF) circuits. Experimental results show twelve CFs synthesis gener-
ated by GA-NS, and their main attributes are summarized. This work
confirms that NS can be used as a promising alternative in the field of
automatic circuit synthesis (Naredo et al., 2016a).
Chapter 6 studies the problem of generalization in evolutionary
robotics, using a navigation task for an autonomous agent in a 2D envi-
ronment. The learning system used is a GE system based on NS, which

8
1.6 summary of publications

is a different flavour of GP. In particular, this Chapter studies the im-

pact that the training set has on the performance of the learning process
and the generalization abilities of the evolved solutions. Experimental
results clearly suggest that NS improves the generalization abilities of
the GE system (Urbano et al., 2014a; Naredo et al., 2016c).
Chapter 7 presents a GP system based on NS for the classical GP
problem of symbolic regression. In this respect, the paper represents
the first attempt to apply NS in this domain, one of the most com-
mon engineering and scientific problems. A behavioral descriptor was
proposed called -descriptor in order to properly describe the perfor-
mance that each GP individual exhibits with respect to the rest of the
population. Results are encouraging and consistent with recent propos-
als that expand the use of the NS algorithm to other mainstream areas
(Martı́nez et al., 2013).
Chapter 8 presents GP system based on NS to search for data cluster-
ing functions. A domain-specific behavioral descriptor was proposed
named as cluster descriptor (CD). Results show that for simple prob-
lems the exploratory capacity of NS is mostly unexploited, but perfor-
mance is improved in more difficult cases (Naredo and Trujillo, 2013).
Chapter 9 address the supervised classification problem through a
GP system based on NS. In this Chapter the proposed new versions
of NS presented in Chapter 4, MCNSbsf and PNS, are used to solve
binary and multi-class classification problems. Experimental results
are evaluated based on two measures, solution quality and average size
of all solutions in the population. In terms of performance, results
show that all NS variants achieve competitive results relative to the
standard OS approach in GP (Naredo et al., 2013b, 2016b).
Lastly, Chapter 10 presents the general conclusions about this re-
search work.

1.6 Summary of publications

conference papers:

9
introduction

1. Naredo, E., Dunn, E., and Trujillo, L. (2013a). Disparity map es-
timation by combining cost volume measures using genetic pro-
gramming. In Schütze, O., Coello Coello, C. A., Tantar, A.-A., Tan-
tar, E., Bouvry, P., Del Moral, P., and Legrand, P., editors, EVOLVE
- A Bridge between Probability, Set Oriented Numerics, and Evolu-
tionary Computation II, volume 175 of Advances in Intelligent Sys-
tems and Computing, pages 71–86. Springer Berlin Heidelberg

2. Naredo, E., Trujillo, L., and Martı́nez, Y. (2013b). Searching for

novel classifiers. In Proceedings from the 16th European Confer-
ence on Genetic Programming, EuroGP 2013, volume 7831 of LNCS,
pages 145–156. Springer-Verlag.

3. Naredo, E. and Trujillo (2013). Searching for novel clustering pro-

grams. In Proceedings of the 15th Annual Conference on Genetic and
Evolutionary Computation, GECCO ’13. ACM.

4. Martı́nez, Y., Naredo, E., Trujillo, L., and López, E. G. (2013).

Searching for novel regression functions. In IEEE Congress on Evo-
lutionary Computation, pages 16–23.

5. Trujillo, L., Spector, L., Naredo, E., and Martı́nez, Y. (2013b). A

behavior-based analysis of modal problems. In Proceedings of the
15th Annual Conference on Genetic and Evolutionary Computation
Companion, GECCO Companion ’13, pages 1047–1054.

6. Trujillo, L., Naredo, E., and Martı́nez, Y. (2013a). Prelimi-

nary study of bloat in genetic programming with behavior-based
search. In EVOLVE - A Bridge between Probability, Set Oriented Nu-
merics, and Evolutionary Computation IV, volume 227 of Advances
in Intelligent Systems and Computing, pages 293–305. Springer In-
ternational Publishing.

7. Martı́nez, Y., Trujillo, L., Naredo, E., and Legrand, P. (2014). A

comparison of fitness-case sampling methods for symbolic regres-
sion with genetic programming. In EVOLVE - A Bridge between
Probability, Set Oriented Numerics, and Evolutionary Computation

10
1.6 summary of publications

V, volume 288 of Advances in Intelligent Systems and Computing,

pages 201–212. Springer International Publishing.

8. Trujillo, L., Muñoz, L., Naredo, E., and Martı́nez, Y. (2014).

Genetic Programming: 17th European Conference, EuroGP 2014,
Granada, Spain, April 23-25, 2014, Revised Selected Papers, chapter
NEAT, There’s No Bloat, pages 174–185. Springer Berlin Heidel-
berg, Berlin, Heidelberg.

9. Urbano, P., Naredo, E., and Trujillo, L. (2014a). Generalization in

maze navigation using grammatical evolution and novelty search.
In Theory and Practice of Natural Computing, volume 8890 of Lec-
ture Notes in Computer Science, pages 35–46. Springer Interna-
tional Publishing.

10. Naredo, E., Trujillo, L., Fernández De Vega, F., Silva, S., and
Legrand, P. (2015). Diseñando problemas sintéticos de clasifi-
cación con superficie de aptitud deceptiva. In X Congreso Español
de Metaheurı́sticas, Algoritmos Evolutivos y Bioinspirados (MAEB
2015), Mérida, España.

journal articles: published and submitted.

1. Naredo, E., Urbano, P., and Trujillo, L. (2016c). The training

set and generalization in grammatical evolution for autonomous
agent navigation. Soft Computing, pages 1–18.

2. Martı́nez, Y., Naredo, E., Trujillo, L., Pierrick, L., and López, U.
(2016). A comparison of fitness-case sampling methods for ge-
netic programming. Submitted to: Journal of Experimental & The-
oretical Artificial Intelligence, currently working with the reviewers’
comments.

3. Naredo, E., Trujillo, L., Legrand, P., Silva, S., and Muño, L.
(2016b). Evolving genetic programming classifiers with novelty
search. To appear: Information Siences Journal.

11
introduction

4. Naredo, E., Duarte-Villaseñor, M. A., Garcı́a-Ortega, M. d. J.,

Vázquez-López, C. E., and Trujillo, L. (2016a). Novelty search
for the synthesis of current followers. Submitted to: Information
Siences Journal.

12
Genetic Programming
2
abstract — Genetic programming (GP) is a powerful tool that
is widely used to solve interesting and complex real-world problems.
This chapter introduces an overview of GP, focusing mainly on the
evaluation process, and particularly using the tree-representation. We
highlight the difference between fitness and objective function, which
is relevant to this research work. Furthermore, we aboard the different
search spaces which are concurrently sampled during the search pro-
cess. On the other hand, we go deeper in the concept of behavior as a
form to observe the performance of a GP program, proposing a variable
scale containing all different behaviors, from the a low level to a higher
description about the performance observed by each solution.

2.1 Introduction

Genetic programming (GP) is a subset of evolutionary computation

and a generalization of genetic algorithms (GA) that provides a note-
worthy proposal to address the central questions in computer science
stated in the 1950s by Arthur Lee Samuel (Koza, 1992a); “how can com-
puters be made to do what needs to be done, without being told exactly
how to do it?”. GP evolves knowledge and behavior from a high-level
problem statement represented by a population of programs, this pro-
cess is also called program synthesis or automatic program induction

13
genetic programming

(Koza et al., 2000, 2008; Koza, 2010). For this reason GP has become
shorthand for the generation of programs, code, algorithms and struc-
tures (O’Neill et al., 2010). GP can genetically breed computer pro-
grams capable of solving, or approximately solving, a wide variety of
problems from a wide variety of fields (Koza, 1992a).
GP is a domain-independent evolutionary method intended to solve
problems without requiring the user to know or specify the form or
structure of the solution in advance (Poli et al., 2008b). GP is inspired
in natural genetic operations, applying similar operations to computer
programs, such as crossover (sexual recombination), mutation and re-
production. Computer programs can be represented in several ways,
but since trees can be easily evaluated in a recursive manner this is the
traditional representation used in the literature (Koza, 1992a).
GP-trees are composed by two different types of nodes known as
functions and terminals. Function are internal tree nodes that rep-
resent the primitive operations used to construct more complex pro-
grams, such as standard arithmetic operations, programming struc-
tures, mathematical functions, logical functions, or domain-specific
functions (Koza, 1992a). Terminal nodes are at the leaves of the GP-
trees and usually correspond to the independent variables of the prob-
lem, zero-arity functions or random constants.
There are other GP versions that use different representations, such
as linear genetic programming (Brameier and Banzhaf, 2010), cartesian
GP (Peter, 2000), MicroGP1 or in short µGP (Squillero, 2005), and some
authors have even developed special languages for GP-based evolution
such as Push, which is used to implement the PushGP system (Spector
and Robinson, 2002) .
Moreover, GP is a very flexible technique, a characteristic that has
allowed researchers to apply it in various fields and problem domains
(Koza et al., 2000, 2008; Koza, 2010). In computer vision, for instance,
GP has been used for object recognition (Howard et al., 1999; Ebner,
2009; Hernández et al., 2007), image classification (Krawiec, 2002; Tan
1 More information about MicroGP can be found on the following web-pages;
http://www.cad.polito.it/research/Evolutionary_Computation/MicroGP.htm,
and http://ugp3.sourceforge.net/

14
2.2 genetic programming overview

et al., 2005), feature synthesis (Krawiec and Bhanu, 2005; Puente et al.,
2011), image segmentation (Poli, 1996; Song and Ciesielski, 2008), fea-
ture detection (Trujillo et al., 2008c, 2010; Olague and Trujillo, 2011)
and local image description (Pérez and Olague, 2008; Perez and Olague,
2009). GP has been successfully applied to computer vision problems
since it is easy to define concrete performance criteria and the develop-
ment or acquisition of data-sets is now trivially done. However, most
problems in computer vision remain open and most successful propos-
als rely on state-of-the-art machine learning and computational intelli-
gence techniques.

2.2 Genetic Programming Overview

As stated previously, GP is a generalization of GAs, where the basic

GA process is used to write computer programs (Koza, 1992a), from
the artificial evolution perspective GP is a program that is able to write
other computer programs. The GA operations of mutation, reproduc-
tion (crossover) and cost calculation require only minor modifications.
GP is a more complicated procedure because it must work with the
variable length structure to represent programs or functions (Haupt
and Haupt, 2004).
Like other EAs, GP starts with a population of randomly created in-
dividuals, which in this case represent programs or mathematical func-
tions. Inspired in the Darwinian principle of natural selection GP tra-
ditionally applies the maxima of “survival of the fittest” to guide the
search. Every generation, the individuals in the population are evalu-
ated through an objective function to determine their fitness, usually
measuring how close a given computer program is to achieving the de-
sired goal.
The fitness score is then used in the selection process to choose in-
dividuals as parents to generate the next generation of candidate so-
lutions. Chosen parents are passed through genetic-like operations to
share genetic information (between two parents) or to mutate any of
their genetic material.

15
genetic programming

There are three termination criteria traditionally used in GP, first

two are related with respect to an upper limit reached: a maximum
number of either generations or evaluations of the fitness function. The
third criterion is to stop the search when the chance of improvement in
the next generations is excessively low. After at least one of the termina-
tion criteria is satisfied, the best program produced during the whole
run, usually named as the best-so-far, is returned as the result of the
run.

2.3 Nuts and Bolts of GP

GP evolves a randomly generated initial population of programs

into a new population of programs, through genetic operations applied
on every where each iteration is referred as a generation using the evo-
lutionary terminology. Hopefully the evolved programs will be better
than their predecessors, but since it is a stochastic process similarly as
used in nature there is no guarantee of finding the optimal solution.
Since randomness is an essential feature of GP, there are some draw-
backs that deterministic methods do not exhibit.
Computer programs in GP traditionally are represented as tree
structures with nodes and leafs. GP tree nodes are program instruc-
tions named as functions and GP tree leafs are the features (dimen-
sions) of the dataset, we call these elements terminals. For instance,
a data point can be given by xi = (d1 , ..., dk , ..., dn ) where n data ele-
ments dk are the problem features. The set of functions can be com-
posed by the basic arithmetic, trigonometric, or by any other com-
plex function or even by a combination between them, for instance:
plus, minus, times, divide, sin, cos, log, sqrt, power, abs.
The overall process of evolving programs with standard GP is
shown in the algorithm 1. The main difference that can be highlighted
between the algorithms found in the literature is that the fitness f and
objective function g traditionally is the same, but there is no restric-
tion to be different as shown in Algorithm 1. The general flow of a GP
search is presented in Figure 2.2.

16
2.4 search spaces in gp

*
GP program K1
+ * Prefix notation:
K1 (x ) = times(plus(x, x ), times(plus(1, 1), x )).
X X + X

Infix notation:
1 1 K1 (x ) = (x + x ) ∗ ((1 + 1) ∗ x )) .

+
GP program K2
power power Prefix notation:

sin 2 cos 2
K2 (x ) = plus(power(sin(x ), 2), power(cos (x ), 2)).
Infix notation:
X
K2 (x ) = sin2 x + cos2 x.
X

Figure 2.1: Example of programs (expressions) K1 and K2 , evolved by

GP using the previous set of terminals and functions. On the left is
depicted the tree representation, while on the right it can be seen both
prefix and infix mathematical notation for each GP program.

2.4 Search Spaces in GP

It is known that many EAs concurrently sample three different

spaces during a search: genotypic, phenotypic and objective space. In
the case of traditional tree-based GP, the genotypic space corresponds
to the syntactic space and selective pressure is given in objective space
by the objective function designed for the problem. Typically, the geno-
type of a computer program K is given by its syntax (GP-tree), as shown
in Figure 2.3. Syntactic space contains all possible computer programs
which can be composed by the set of functions and terminals used in
the search. The syntax receives as input the information from the vec-
tor of terminals x. In GP, the phenotype corresponds to the algorithm
which is implicitly implemented by the syntax; though, the phenotype
is rarely analyzed explicitly by most GP systems (McDermott et al.,
2011) since extracting it is not a trivial task.

17
genetic programming

Algorithm 1 Genetic Programming Algorithm.

Input: elitism rate; e, cross rate; x, mutation rate; m
Set of functions; S, Set of terminals; T
Fitness function; f
Objective function; g
Termination criteria; number of generations; n, threshold; h
Output: The best-so-far solution scored with the objective function g, found at any generation
is designated as the result.
Procedure:
Initial generation
t ← 0;
Create initial population
P (t = 0) ← generate (S, T );
Execute and evaluate each initial program
evaluate (P (t = 0))
Assign fitness according with the fitness function
Fp (t = 0) ← f (P (t = 0))
Assign quality score according with the objective function
Gp (t = 0) ← g (P (t = 0))
while not meet termination criteria n, h do
i) Parents to apply genetic operations
a. Best existing programs taken as parents.
Pp (t ) ← best (P (t ), e )
b. Select parents according the fitness score
Pp (t ) ← selectP arents (P (t ), f );
c. Create new programs as parents if necessary
Pp (t ) ← generate (S, T )
ii) Create children by mutation.
Pch (t ) ← mutate (Pp (t ), m)
iii) Create children by crossover (sexual reproduction)
Pch (t ) ← reproduction(Pp (t ), x )
Execute and evaluate each evolved (children) program
evaluate (Pch (t ))
Assign fitness according with the fitness function
Fp (t ) ← f (Pch (t ))
Assign quality score according with the objective function
Gp (t ) ← g (Pch (t ))
Assign the new population for the next generation
Pp (t + 1) ← Pch (t )
end while

While these spaces have been the focus of most GP research, other
spaces have recently been used to develop new GP-based algorithms.
For instance, the computer program K produces an output vector,
named recently in GP literature as semantics (Moraglio et al., 2012).

18
2.4 search spaces in gp

New population Evaluation

K1 K2 f(K1)=0.6 f(K2)=0.4
+ +

÷ x x x ÷ x x x
x x x x

K3 K4 f(K3)=0.5 f(K4)=0.8
- - *
*
x x x x
x x

Mutation Recombination Selection

Simbology

Parent Parents
K: Individual
f: Fitness function - + * +
Functions
x ÷ x x ÷ x x x
Terminals x x x x x Trash

x
+ * -
*
x x ÷ x x x
x x
x x x x
O spring O spring

Figure 2.2: Block diagram depicting the main processes involved in a

tree-based GP search

Afterwards, the output semantics are transformed into a single quality

measure by the objective function O which assigns fitness to K.

target ( )

program objective
syntax function
stimulli Fitness
input semantics
genotype

Figure 2.3: The traditional program evaluation process used in GP,

where K is the genotype of a computer program, s is the program out-
put vector called semantics, the objective function O which typically
is defined as a distance d (s, t) between semantics (s) and the expected
target (t), obtaining then the fitness score assigned to the program K.

19
genetic programming

Formally, semantics can be defined as in (Moraglio et al., 2012).

Given a set of n fitness cases in a training set T = {(x1 , y1 ), ..., (xn , yn )},
the semantics of a program Kj with j = 1, ..., m is the corresponding
set of outputs it produces, expressed as s = (Kj (x1 ), ..., Kj (xn )). If we
consider real-valued outputs, then s ∈ Rn . In this case, the objective
function is usually defined as a distance d (s, t) between a program’s se-
mantics s and the desired target t = (y1 , ..., yn ), see Figure 2.3. The most
interesting feature about semantic space is that it defines a unimodal
fitness landscape. Through semantics, researchers have proposed new
search operators that help improve the GP search. For instance, Bea-
dle and Johnson (Beadle and Johnson, 2008) proposed Semantically
Driven Crossover (SDC) that promotes semantic diversity. Uy et al. (Uy
et al., 2011b) proposed four different variants of semantically aware
crossover operators for symbolic regression. Moraglio et al. (Moraglio
et al., 2012) proposed Geometric Semantic GP (GSGP), designing spe-
cial syntactic search operators that generate offspring that are mapped
to geometrically bounded regions in semantic space. Recently, GSGP
has been enhanced with local search mechanisms that improve perfor-
mance and convergence (Castelli et al., 2015). However, in some do-
mains the target is known beforehand, (the optimal output -semantics-
might not be known), instead what is measured is the effect that the out-
put has on an external system, what we are referring to as the domain
context.
For example, consider the static range selection (SRS) GP classifier
(Zhang and Smart, 2006) (discussed in Section 9.1), which functions as
follows. For a two class {ω1 , ω2 } problem and real-valued GP outputs,
the SRS classifier applies a simple classification rule R: if the program
output for input pattern x is greater than a threshold r, then the pat-
tern is labeled as belonging to class ω1 , otherwise it is labeled as a class
ω2 pattern. In this case, while the semantics of two programs might be
different (maybe substantially), they can still achieve the same classifi-
cation performance.
Indeed, these issues have led some authors to pose that the differ-
ent mappings found and sometimes used in the GP process are the
representation of different phenotypes mappings between them. For

20
2.4 search spaces in gp

target ( )

yes program output objective

function
( ) semantics
program
syntax
stimulli explicit
Fitness
objective ?
input
genotype context
Classi novelty
no
Robo measure
intera behavior
description
etc.

Figure 2.4: The basic program evaluation process used in GP consid-

ering the choice of using the objective explicitly or implicitly. First
option (yes), computes fitness through a distance function between the
computer program s (semantics) and the expected target t, named as
objective-based fitness. Second option (no), computes fitness through
a behavior description β, then applying a distance function between
each current behavior β and a set of behaviors B composed by current
behaviors and previous behaviors to assign a novelty score as fitness,
named as novelty-based fitness.

instance, (McDermott et al., 2011) suggests that the behavior of each

component mapping can be studied separately. In this work, however,
we take advantage of recent findings gained from the field of evolu-
tionary robotics (ER), to focus on the aggregate behavior of all these
mappings and analyze the total effect that a given program has on a
specific problem. The concept of behaviors has been widely used in
robotics, particularly emphasized in the seminal works by Brooks from
the 1980’s (Brooks, 1999), making it easy to integrate this concept in
ER. In behavior-based robotics, for instance, an individual’s behavior is
described at a higher level of abstraction than the semantics of a partic-
ular controller, accounting not only for program output but the context
in which the output was produced. The relation between input and
semantic space can be injective or non-injective, while behaviors can
allow for multivalued mappings or many-to-many relations between
inputs and behavior space.
In ER, where GP can be used to evolve controllers that interface
directly with a robot’s actuators, the fitness functions used normally

21
genetic programming

account for the observed high-level interactions between the robot and
its environment. In fact, in a broad survey of types of fitness functions
used in ER, Nelson et al. (Nelson et al., 2009) found that much of the
research introduces a priori human knowledge when selecting a fitness
function to solve a given problem. Nelson et al. group the fitness func-
tions into seven classes, called training data, behavioral, functional in-
cremental, tailored, environmental, competitive and aggregate fitness
functions, ordered based on the amount of a priori knowledge incorpo-
rated into the function. The first class of fitness functions coincides
with the most common approach taken in GP, where training data is
used, the highest amount of a priori knowledge about the problem, the
same as depicted in Figure 2.3. However, as Nelson et al. stated, this
type of fitness functions require specific knowledge of what should be
the optimal output, something that in many cases is not feasible or
might be even unnecessary, as we argued in the example of the SRS
classifier. Therefore, all other classes of fitness functions in ER, each to
a different extent, incorporate the concept of behavior.
Since we can find some disagreement about the interpretation of the
concept of behavior that we can find in different areas, we agree with
the definition given in (Levitis et al., 2009) (p. 108) from the behav-
ioral biology perspective, which states that a “behavior is the internally
coordinated responses (actions or inactions) of whole living organisms
(individuals or groups) to internal and/or external stimuli, excluding
responses more easily understood as developmental changes”.
Considering the above definition, in this work we understand the
concept of behavior as a measurable description about the internally
coordinated external responses of a computer program K to internal
and/or external stimuli x within a given environment or context. The
behavior produced by a particular solution K is captured by a domain
dependent descriptor β. In the ER case, context is given by the robot
morphology and parts of the environment that cannot be sensed, while
the inputs are the robot sensors and the outputs of the controller in-
terface directly with the actuators. In this case the robot behavior de-
scriptor can include such quantities as the robot position (Lehman and
Stanley, 2008), the robot velocity (Nelson et al., 2009) or patterns gen-

22
2.4 search spaces in gp

erated in the robot’s path (Trujillo et al., 2008b). Conversely, for the
classification problem described above, the context is provided by the
specified classification rule R, while one way to describe a classifier be-
havior can be related to the accuracy of the classifier on the training
instances. Afterward, the observed behavior β can be used to compute
fitness by a function that considers the objective either explicitly or im-
plicitly.
The explicit approach is the customary way to compute fitness, us-
ing an objective function measuring how close the observed behavior
is to a particular goal. The implicit approach can be to compute fitness
based on the novelty or uniqueness of each behavior, such as in NS.
The concept of behavior is graphically depicted in Figure 2.4, along
with the manner in which behavioral information is included in the
computation of the objective and fitness functions.

O
B

A V IORS
J
s E

B EH
e
m C
a
n T
t
i I
c
s
V
E
Fine Medium Coarse
Level of Detail
Figure 2.5: Conceptual view of how the performance of a program can
be analyzed. At one extreme we have objective-based analysis, a coarse
view of performance. Semantics-based analysis lies at another extreme,
where a high level of detail is sought. Finally, behavior analysis pro-
vides a variable scale based on how the problem context is considered.

We propose that the behavior-based perspective should be seen as

part of a scale of different forms of analyzing performance. In essence,
the objective function, semantics, and behaviors are different levels of
abstraction of a program’s performance as shown in the Figure 2.5. At

23
genetic programming

one extreme form of analysis, an objective function provides a coarse

grained look at performance, a single value (for each criterion) that
attempts to capture a global description of performance. At the other
end of the analysis scale, semantics describe program performance in
great detail, considering the raw outputs. On the other hand, behavior
descriptors should be considered to be situated between fitness and
semantics, providing a finer or coarser level of description depending
on how behaviors are meaningfully characterized within a particular
domain.

Moreover, the behavior-based approach allows us to modify the fit-

ness computation in other ways. In this work we study the NS algo-
rithm (Lehman and Stanley, 2008; Stanley and Lehman, 2015) that fo-
cuses selective pressure based on the concept of solution novelty. In
particular, since NS was conceived for ER problems it measures novelty
in behavioral space. This work extends previous NS-based contribu-
tions, to apply NS in the common machine learning task of supervised
classification.

2.5 GP Case Study: Disparity Map Estimation for Stereo Vision

This section intends to provide a real-world example of how GP

is used to solve a real-world complex task. In particular, we address
the problem of dense stereo correspondence, where the goal is to de-
termine which image pixels in both images are projections of the same
3-D point from the observed scene. The proposal in this work is to
build a non-linear operator that combines three well known methods
to derive a correspondence measure that allows us to retrieve a better
approximation of the ground truth disparity of stereo image pair. To
achieve this, the problem is posed as a search and optimization task and
solved with genetic programming (GP). Experimental results on well
known benchmark problems show that the combined correspondence
measure produced by GP outperforms each standard method, based
on the mean error and the percentage of bad pixels. In conclusion, this

24
2.5 gp case study: disparity map estimation for stereo vision

work shows that GP can be used to build composite correspondence

algorithms that exhibit a strong performance on standard tests.
In this work, the hypothesis is that three is better than one when it
comes to correspondence algorithms. In other words, we wish to find
an operator K that takes as input the cost volume produced by the three
correspondence methods discussed above (CVSAD , CVN CC and CVBT O )
and generates as output a new cost volume CVO which is optimal with
respect to the accuracy of the respective disparity map. This task is
posed as a supervised learning problem, where the ground truth dis-
parity is known for a set of N images, the goal is to minimize the error
of the disparity map given by K (CVSAD , CVN CC , CVBT O ) with respect
to the ground truth.
This chapter formulates the above problems in terms of search and
optimization, with the goal of automatically generating an optimal op-
erator K. The proposal is to solve this problem using a GP search,
where the terminal (input) elements for each individual are the cost
volumes CVSAD , CVN CC and CVBT O , and the output is CVGP , the best
possible approximation to CVO found by the evolutionary process.

2.5.1 Experimental Configuration and Results

The proposal developed in this work is to evolve an operator that

combines three well known correspondence measures, using Genetic
Programming.
The images used on this work are taken from the Middlebury Stereo
Vision Page2 , which are standard benchmark tests for stereo vision al-
gorithms. The dataset used contains six pairs of stereo images named:
Barn1, Barn2, Bull, Poster, Sawtooth, and Venus. The left image from
each pair is shown in Figure 2.6 for each data set. Moreover, for each
image pair the dataset includes both left-right and right-left pixel accu-
rate disparity maps as ground truth.

2 Images available on the Middlebury Stereo Vision Page: http://www.vision.

middlebury.edu/stereo.

25
genetic programming

50 50 50

100 100 100

150 150 150

200 200 200

250 250 250

300 300 300

350 350 350

50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400

(a) barn1 (b) barn2 (c) bull

50 50 50

100 100 100

150 150 150

200 200 200

250 250 250

300 300 300

350 350 350

50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400

(d) poster (e) sawtooth (f) venus

Figure 2.6: Stereo images for experiment tests from Middlebury Stereo
Vision Page.

The goal of our experimental work is to asses the quality of the

raw disparity map generated by GP, compared with respect to the base
methods and the ground truth (GT) of each image pair.
In this work a standard Koza style tree-based GP representation is
used, with subtree-crossover and subtree-mutation. The general pa-
rameters of the search are given in Table 2.1. Of particular importance
is the terminal set, which is given by the cost volumes from the cor-
respondence functions used: Sum of Absolute Differences (SAD), Nor-
malized Cross Correlation (NCC) and Birchfield-Tomasi (BTO), there-
fore, CVSAD , CVN CC , CVBT O , represent cost volumes from each method
and are used as terminals for GP.
To compute each cost volume the search region depends on the
minimum and maximum disparity given by the ground truth disparity
map of the particular image; therefore, for Barn1 the search region is:
[28,131], for Barn2 is: [27,132], for Bull: [29,153], for Poster: [27,161],
Sawtooth has: [31,143], and Venus: [24,158]. Furthermore, the neigh-
borhood M × N considered by each method is given by the number of

26
2.5 gp case study: disparity map estimation for stereo vision

Table 2.1: GP system parameters for the experimental tests

Parameter Description
Population size 20 individuals.
Generations 100 generations.
Initialization Ramped Half-and-Half,
with 6 levels of maximum depth.
Operator probabilities Crossover pc = 0.8; Mutation pµ = o0.2.
n √
Function set + , − , ∗ , / , · , sin , cos , log , x2 , | · |
Terminal set {CVSAD , CVN CC , CVBT O } Where CV is the Cost Volume
from SAD, NCC, and BTO functions respectively
Bloat control Dynamic depth control.
Initial dynamic depth 6 levels.
Hard maximum depth 20 levels.
Selection Lexicographic parsimony tournament
Survival Keep best elitism

rows N and the number of columns M, and particularly for the experi-
ments the neighborhood size used is 3 × 3.
Fitness is given by the cost function shown in Equation 2.1, that
assigns a cost value to every individual K expression as feasible solution
given by GP. The goal is to minimize the error computed between the
disparity map from every K expression and the ground truth, therefore

M−1 N −1
1 XX S
f S (K ) = |d(CV ) (i, j ) − d(SGT ) (i, j )| , (2.1)
NM m
i =0 j =0

where s is the image sequence used, N is the total number of image

pixels, (i, j ) represents the pixel position on the matrix, dCVms is the
disparity map computed using the method m (SAD, NCC, or BTO) for
the s image sequence.
First three image sequences; Barn1, Barn2, and Bull, were selected
as training data and the other three images sequences; Poster, Sawtooth,
and Venus, as testing data.

27
genetic programming

1.39

1.38

1.37
Cost Function

1.36

1.35

1.34

1.33

1.32
0 20 40 60 80 100 120
Number of generations

Figure 2.7: Convergence Graphic of the best GP run result

Plus

Sin Plus

Plus Sin
CVbto

Sin Plus
CVbto

Plus Plus Sin

Plus Sin Sin

CVbto CVncc

Plus Sin Sin

CVsad

Plus Sin
CVncc CVsad

Plus
CVsad CVsad

sin
CVncc

CVsad

Figure 2.8: Function tree of the best GP run result.

Besides the error evaluation explained in Eq. 2.1, we use to com-

pare the error function showed on Middlebury University page that
compute bad pixel percentage, given by

M−1 N −1
1 XX s s
BP = (|dCVm (i, j ) − dGT (i, j )| > δd ) , (2.2)
N
i =0 j =0

28
2.5 gp case study: disparity map estimation for stereo vision

Table 2.2: Comparison results for the Barn1 image (bold indicates best
result).

Error method SAD NCC BTO GP

Mean of Abs. Differences 1.483379 1.344098 1.834855 1.311435
Bad Pixel Percentage 1.021885 0.805163 1.397609 0.771604

50 50 50

100 100 100

150 150 150

200 200 200

250 250 250

300 300 300

350 350 350

50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400

(a) SAD (b) NCC (c) BTO

50 50

100 100

150 150

200 200

250 250

300 300

350 350

50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400

(d) GP (e) Ground Truth

Figure 2.9: Disparity maps comparison for the Barn1 image.

where the threshold for a bad evaluation δd is a disparity error toler-

ance. For the experiments we use δd = 1.0, since it is the threshold
value used on the reference Middlebury web page.

2.5.1.1 Experimental Results

We compare the SAD, NCC, and BTO matching costs against the
GP cost volume. The GP search was executed over 20 times, in all cases
achieving similar performance. Figure 2.7 presents the convergence
plot for the best run, which shows how the fitness of the best individ-
ual evolved over each generation. Figure 2.7 presents the best GP so-

29
genetic programming

Table 2.3: Comparison results for the Barn2 image (bold indicates best
result).

Error method SAD NCC BTO GP

Mean of Abs. Differences 1.959252 1.660405 2.038269 1.640279
Bad Pixel Percentage 1.579143 1.248889 1.774773 1.226271

50 50 50

100 100 100

150 150 150

200 200 200

250 250 250

300 300 300

350 350 350

50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400

(a) SAD (b) NCC (c) BTO

50 50

100 100

150 150

200 200

250 250

300 300

350 350

50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400

(d) GP (e) Ground Truth

Figure 2.10: Disparity maps comparison for the Barn2 image.

lution found expressed as a program tree. Moreover, the mathematical

expression of the best GP individual K is given by
∗
fgp =sin{sin[sin(sin CVSAD + CVSAD + CVN CC ) + CVSAD + CVN CC + CVBT O ]
+ CVBT O } + sin(sin3 CVSAD + sin2 CVSAD + CVN CC ) + CVBT O .
(2.3)

Finally, Tables 2.2 to 2.7 present a comparative analysis of the best

GP individual and the three standard methods. Each table shows both
the mean of the absolute difference, and the bad pixel percentage er-
rors for each pair of stereo images. In all cases we can note that the GP

30
2.6 chapter conclusions

Table 2.4: Comparison results for the Bull image (bold indicates best
result).

Error method SAD NCC BTO GP

Mean of Abs. Differences 1.785469 1.378297 2.030898 1.330506
Bad Pixel Percentage 1.186709 0.703539 1.422414 0.675807

50 50 50

100 100 100

150 150 150

200 200 200

250 250 250

300 300 300

350 350 350

50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400

(a) SAD (b) NCC (c) BTO

50 50

100 100

150 150

200 200

250 250

300 300

350 350

50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400

(d) GP (e) Ground Truth

Figure 2.11: Disparity maps comparison for the Bull image.

method achieves the best performance. Figures 2-6 present a compar-

ative visual analysis of the best GP individual and the three standard
methods, compared with the ground truth. Visually, we can detect the
improvement provided by the GP operator K over the standard meth-
ods.

2.6 Chapter Conclusions

In this chapter a background about GP is presented, focusing on the

fitness function as well as in the search spaces. Here, is highlighted that

31
genetic programming

Table 2.5: Comparison results for the Poster image (bold indicates best
result).

Error method SAD NCC BTO GP

Mean of Abs. Differences 2.432111 1.988050 3.010154 1.945446
Bad Pixel Percentage 1.57552 1.218279 1.985391 1.202187

50 50 50

100 100 100

150 150 150

200 200 200

250 250 250

300 300 300

350 350 350

50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400

(a) SAD (b) NCC (c) BTO

50 50

100 100

150 150

200 200

250 250

300 300

350 350

50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400

(d) GP (e) Ground Truth

Figure 2.12: Disparity maps comparison for the Poster image.

we can use a different approach to implement a fitness function. The

traditional approach is to reward solutions which are closer to the ob-
jective. Different approaches to the customary objective-based fitness
function has been proposed, one interesting approach will be presented
in the Chapter 4.
Furthermore, this chapter studies the problem of dense stereo corre-
spondence using GP. The proposed approach is to combine three well-
known similarity measures and derive a composed estimation of the
disparity map for a stereo image pair. This task is posed a search and
optimization problem and solved with GP. The terminal elements for
the GP search is composed by the cost volumes produced by the three

32
2.6 chapter conclusions

Table 2.6: Comparison results for the Sawtooth image (bold indicates
best result).

Error method SAD NCC BTO GP

Mean of Abs. Differences 1.532218 1.433264 1.727435 1.374315
Bad Pixel Percentage 1.073662 0.859555 1.169753 1.202187

50 50 50

100 100 100

150 150 150

200 200 200

250 250 250

300 300 300

350 350 350

50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400

(a) SAD (b) NCC (c) BTO

50 50

100 100

150 150

200 200

250 250

300 300

350 350

50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400

(d) GP (e) Ground Truth

Figure 2.13: Disparity maps comparison for the Sawtooth image.

correspondence methods; SAD, NCC, and BTO. Fitness is based on the

error between the estimated disparity map and the ground truth dis-
parity.
Experimental results for this case study show that the evolved GP
operator achieves better performance that the standard methods based
on well-known benchmark problems. These results are validated with
a set of test cases and an additional performance metric. While these
results are an encouraging first step, further work is considering follow-
ing this topic. For instance, we can add other standard approaches as
input to the GP search, such as non-parametric correspondence meth-
ods. Moreover, we can use the raw disparity map generated by the GP

33
genetic programming

Table 2.7: Comparison results for the Venus image (bold indicates best
result).

Error method SAD NCC BTO GP

Mean of Abs. Differences 2.281053 1.916733 2.704778 1.880739
Bad Pixel Percentage 1.617716 1.285286 1.804815 1.267094

50 50 50

100 100 100

150 150 150

200 200 200

250 250 250

300 300 300

350 350 350

. 50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400

(a) SAD (b) NCC (c) BTO

50 50

100 100

150 150

200 200

250 250

300 300

350 350

50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400

(d) GP (e) Ground Truth

Figure 2.14: Disparity maps comparison for the Venus image.

operators as raw input for global optimization methods which could

allow us to define a higher level fitness evaluation.

34
3
Deception

abstract — Since the very introduction of genetic algorithms

(GAs), it was noted by Goldberg that there is a certain kind of prob-
lem which deceives a search process. Rewarding solutions closer to
the objective in these deceptive problems, the search will be lead away
from the global optimum, getting trapped into regions of local optima.
The notion of deception is related with problem complexity. The more
complex the problem is, the higher the possibilities that it is a decep-
tive problem. Most deceptive problems found in the evolutionary com-
putation literature are toy problems, particularly applied on binary
representations, and more recently in evolutionary robotics. In this
chapter, our intent is to design a set of deceptive classification prob-
lems which hopefully could serve as benchmark to test evolutionary or
search-based classification strategies.

3.1 Introduction

EAs in general have shown good performance when solving com-

plex problems, but a central issue is the characterization of problems
that are difficult to solve, particularly by GAs. A noteworthy attempt
was made by (Goldberg, 1989) to explain the ability of GAs in fre-
quently succeeding at finding good solutions; what is called the build-

35
deception

ing block hypothesis (BBH). In this hypothesis Goldberg states that GAs
can identify segments (blocks) of the optimal solution contained in the
current solution. Furthermore, Goldberg hypothesises that GAs use
these blocks to generate new and better solutions by recombining them
or by mutating them, which at the end (mostly) build-up the complete
optimal solution.
There are a wide range of benchmark problems to test the per-
formance of EAs, which can be ranked from easy to hard problems.
Among the efforts to characterize GA-hard problems some have fo-
cused particularly on the notion of deception. Since the introduction
of the GAs, many deceptive problems have been proposed for binary-
coded GAs.
More recently, (Lehman and Stanley, 2008) have proposed several
deceptive navigation robotic tasks. But, to the best of our knowledge,
there are not deceptive benchmark problems for pattern recognition,
particularly deceptive classification problems.
This work introduces a first attempt to design a deceptive classi-
fication problem that can be used for benchmarking. The following
sections address the notion of deception.

3.2 Deception in the Artificial Evolution

Finding the factors that affect the performance of GAs to solve opti-
mization problems, has been a major interest in the theoretical commu-
nity (Jones and Forrest, 1995). In general, the performance is measured
in terms of their ability to find the closest solution to the global opti-
mum of a given problem. So, according with the BBH, if a GA finds the
global optimum, particularly on problems with a binary representation,
it is because in fact it correctly identified the correct building blocks for
the problem, classifying those problems as easy-GA problems. On the
other hand, it is a hard-GA problem if a GA fails to find the global opti-
mum for that particular problem and this means it did not identify the
correct building blocks.

36
3.3 deception in evolution

Nonetheless, there is another kind of problems, where a GA seems

to identify some of the building blocks, apparently pointing to the
global optimum, but instead gets trapped into regions of local optima.
These are known as deceptive problems since they deceive the search
when combining information of the best solutions found so-far, leading
the search far away from the global optimum (Rana, 1999). If the GA is
not able to find building blocks of low order which are instances of the
global solution, then probably it will not be able to find the global opti-
mum, and instead will find a sub-optimal solution (Deb and Goldberg,
1994). The notion of deception was originally introduced by (Goldberg,
1987), in order to better understand what kind of situations are more
probable to create difficulties for a GA search.

3.3 Deception in Evolution

The importance of the concept of deception is a controversial topic.

In the EA literature we can find a large diversity of arguments, from
those that assure that deception is the only important thing that must
be considered to classify a problem as a hard-GA problem (Das and
Whitley, 1991), to those that claim that deception is neither neces-
sary nor enough to determine the difficulty of a problem (Grefenstette,
1993).
It is clear that some deceptive problems are hard problems, but it
is also true that there are other factors that affect the difficulty of a
problem, such as: epistasis, multimodality, noise, and spurious correla-
tion among others (Goldberg, 1989; Schaffer et al., 1990; Whitley, 1991;
Grefenstette, 1993; Deb and Goldberg, 1993; Jones, 1994; Yang, 2004;
Day and Lamont, 2004; Rada-Vilela et al., 2014).
Another well known example that captures the characteristics that
make some problems difficult to be solved by GAs focus on the notion
of “rugged fitness landscape” (Weinberger, 1990; Horn and Goldberg,
1995). Concerning this (Whitley, 1991) argues that the complexity is re-
lated with the difficulty of a problem, then a problem with high rugged-
ness tends to generate deceptive fitness landscapes. On the other hand,

37
deception

in the evolutionary robotics, for instance, (Lehman and Stanley, 2008)

introduce an intuitive definition of deception related with the difficulty
of a problem: “A deceptive problem is that where an evolutionary algo-
rithm does not find the objective in a reasonable period of time”.
Unifying both approaches, we can agree that if a problem is so com-
plex that show a very rugged fitness landscape, then an EA will take a
long time to find the right path to the global optimum, if possible. So
we can conclude as (Rana, 1999) said; deceptive problems correspond
with what we normally consider to be difficult problems.

3.4 Deceptive Problems

Typical examples of deceptive problems can be found in the liter-

ature, mostly for a binary representation to test GAs. One important
deceptive problem for our research work is the trap function, which
will help us later to explain the approach used in this chapter to design
a deceptive classification problem. On the other hand, another impor-
tant problem presented here, is the deceptive navigation task, which
will allow us to introduce, in the next chapter, a new approach to man-
age the search to solve deceptive problems.

3.4.1 Trap Functions

Trap functions are deceptive functions of unitation. The advantage

of trap functions is that they have few parameters but can still be made
deceptive (Mengshoel et al., 1998). In order for a function to be decep-
tive, must have at least two optimums; one being the local optimum,
and the other the global optimum. The trap function definition given
in (Deb and Goldberg, 1993) is the following: A trap function f for a
binary string u, assigns fitness values depending solely on the number
of ones, defined as
 a
(u − h) , h < l If u ≥ h ,
 l−h



f (u ) = 
 (3.4)
 b (h − u ) , h > 0 otherwise,

h

38
3.4 deceptive problems

where l is the binary string length, a is the global optimum, b is the local
(deceptive) optimum, and h ∈ (0, l ) is the location of the slope change
which divide the region where the optimum local is located from the
region where can be found the global optimum with the binary string
of size l containing just ones, an example is shown in Figure ??. Trap

global
optimum

local
optimum

=number of ones

Figure 3.1: Trap function with the global optimum located at point a,
and the local optimum is located at point b, u is the variable with the
number of ones in the binary string. Varying the parameters of the
function the degree of deception can be varied.

functions are highly artificial in two respects: i) they are composed by

sub-problems which can be solved independently, and ii) the comple-
ment of the deceptive solutions is always the global solution. The trap
functions are attractive because they are relatively easy to manipulate
and can be difficult to solve by GAs (Mengshoel et al., 1998).

3.4.2 Scalable Fitness Function

This function is proposed by (Cuccu et al., 2011a), which is a decep-

tive function that exhibits plateaus of local minima and multi-modality.

39
deception

It is used to test the ability of a GA to cross a large region with fitness

values that mimic on average a flat region, which is defined as follow

x2 l , |x| ≤ 1,




1 2 πω (|x|−1)

fl,ω (x ) =  1 − 2 sin , 1 ≤ |x| ≤ l + 1 , (3.5)


 l
(|x| − l )2 ,

 |x| ≥ l + 1.
with parameters l ≥ 0 and ω ∈ N. This function is symmetric around
the global optimum 0, and is based on the standard parabolic curve,
which is cut between +1 and −1 constructing a plateau of length l. In

20
-20
region with similar ﬁtness local optima
10

global optimum
-20 -15 -10 -5 5 10 15 20

Figure 3.2: Variable deceptive function with parameters: l = 20 which

defines the plateau length and ω = 2 which defines the amplitude of
the sine function.

the region 1 ≤ |x| ≤ l + 1 where both plateaus are located, the parameter
ω controls the amplitude of the sine function, in such a way that if
ω = 0 there will be a flat region, while varying ω will introduce a local
optima region generated by the sine waves, this controls the deception
degree of the function.

3.4.3 Deceptive Navigation

In the evolutionary robotics literature (Lehman and Stanley, 2008)

proposed a deceptive navigation task similar to the one shown in Fig-
ure 3.2. A robot controlled by an artificial neural network (ANN) must

40
3.5 deceptive classification problem design

navigate from a starting point to an end point in a fixed time. A reason-

able fitness function for such a domain is how close the maze navigator
is to the goal at the end of the evaluation. Using this fitness function,
the neuro-evolution search was only successful in three out of 40 runs,
showing that this is a highly deceptive problem. In the next chapter
will be shown how this navigation task is solved in 39 out of 40 runs
with a new approach to explore the search space.

OBJECTIVE Attractor

START

Agent Optimal path Last position Measure Best

Figure 3.3: Deceptive navigation task. The figure on the left shows the
optimal path to the objective. The figure on the right shows how the
objective acts as an attractor rewarding the solutions that are closer to
the objective, but in fact they are getting trapped into a local optima
region.

3.5 Deceptive Classification Problem Design

This section presents the description of a method to design a set

of synthetic classification problems, which show deception and hope-
fully can serve as benchmarks to test classification algorithms (Naredo
et al., 2015). The approach to build a deceptive classification problem
is through a similar strategy that is used to build a trap function shown
in Figure ??, where we can note two regions, one smaller region where
the global optimum is located, and another large region where is found

41
deception

the local optimum. If we observe the trap function shown in Figure ??,
we can note that the smaller the region where the global optimum is
located, the more difficult will be to reach that region and to find the
global optimum.
We can see clearly that for a GA it is more likely to fall in the larger
region and getting trapped in the local optimum, even if there is a so-
lution that falls in the smaller region the gradient can push it all the
way to reach the global optimum, can be a set of counterpart located
in the bigger region with higher fitness, therefore having higher prob-
ability to be chosen, in such a way that the solution in the right way
can be lost. Following this reasoning, we can now attempt to generate a
similar fitness landscape considering a special synthetic dataset, linear
classifiers and an objective function that measures the deceptive nature
of the problem.

3.5.1 Synthetic Classification Problem

Let us consider a simple synthetic classification problem with

a dataset X = {x1 , x2 , ..., xn } composed by 2-classes {ω1 , ω2 }, and 2-
dimensions (d 1 , d 2 ). Each class is contained into the sets C 1 and C 2 ,
each set is composed by two subsets C 1 = {Ca1 , Cb1 }, C 2 = {Ca2 , Cb2 } re-
spectively, these are imbalanced subsets where the sub-index a stands
for the majority subset, while sub-index b stands for the minority one.
The imbalance factor is controlled by the parameter p ∈ [0, 1]. Figure
3.3 shows a configuration for both classes with p = 0.90.
To generate the datasets, a Gaussian function has been chosen with
parameters (µ, f). The centroid of each subset is given by µ with
coordinates (d 1 , d 2 ), while f is the data covariance matrix, where
f = (σd 1 , σd 2 ). The standard deviation σ1 is measured over dimension
d 1 , while σ2 corresponds to d 2 . To simplify our task, we build the data
set such that the clusters of samples follow a regular shape within this
2-dimensional space, then we use the strategy to generate more data
than needed with a factor q for each subset, and then trim the shape
selecting the required samples that fall within this shape. For instance,

42
3.5 deceptive classification problem design

local optima
global optima

Figure 3.4: Synthetic classification problem with 2-classes {ω1 , ω2 } and

2-dimensions (d 1 , d 2 ), where each class set is composed by two imbal-
anced clusters C 1 = {Ca1 , Cb1 }, C 2 = {Ca2 , Cb2 } with circular shapes. The
figure shows two linear classifier models, the dotted green line shows
the local optimum hyperplane Ll , and the dotted orange line shows the
global optima hyperplane Lg . Furthermore, the optimal class separa-
tion is given by δ1 , while δ2 is the separation between both majority
clusters from each class.

for Figure 3.3 we generated p data points that fall within circle shaped
clusters.

3.5.2 Linear Classifier

We consider linear classifiers to solve the previous synthetic classi-

fier problem proposed considering the definitions given in (Theodor-
idis et al., 2010; Theodoridis and Koutroumbas, 2008). The linear
classification problem involves a linear boundary, that is a hyperplane
which can be described by aT x = b for some a ∈ Rn and b ∈ R, where a
and b correspond to d 1 and d 2 respectively for the classification prob-
lem presented in the previous section. Such a line is said to correctly
classify these two sets if all data points with yi = +1 fall on one side
(hence aT x ≥ b) and all the others on the other side (hence aT x < b).

43
deception

Hence, the affine inequalities on (a, b ): yi (aT x − b ) ≥ 0, i = 1, ..., m, guar-

antee correct classification. The above constitute linear (in fact, affine)
inequalities on the decision variable, (a, b ) ∈ Rn+1 . This fact is the ba-
sis on which we can build a linear programming solution to a classi-
fication problem. Once a classifier (a, b ) is found, we can classify a
new point x ∈ Rn by assigning to it the label by the classification rule:
ŷ := sign(aT x − b ).
A classifier is a function c that maps an example x ∈ X to a bi-
nary class c (x ) ∈ {−1, +1}. A linear classifier uses; feature functions
f(x ) = (f1 (x ), ..., fm (x )) and feature weights w = (w1 , ..., wm ) to assign
x ∈ X to class c (x ) = sign(w · f(x )), where sign(y ) = +1 if y > 0 and −1
if y < 0. The main idea is to apply a non-linear transformation to orig-
inal features h(x ) = (g1 (f(x )), ..., gn (f(x ))) and learn a linear classifier
based on h(xi ) Linear classifier decision rule: given feature functions f
and weights w, assign x ∈ X to class c (x ) = sign(w · f(x ))
The focus is on the direct design of a discriminant func-
tion/decision surface that separates the classes in some optimal sense
according to an adopted criterion. The technique that is built around
the optimal Bayesian classifier relies on the estimation of the probabil-
ity density function (pdf) describing the data distribution in each class.
However, in general this turns out to be a difficult task, especially in
high-dimensional spaces.
Alternatively, one may focus on designing a decision surface that
separates the classes directly from the training data set, without having
to deduce it from the pdfs. We begin with the simple case of designing
a linear classifier, described by the equation

g ( x ) = wT x + w 0 = 0 (3.6)

which can also be written as

" #
0T 0 T x
w x ≡ [w , w0 ] =0 (3.7)
1

where w = [w1 , w2 , ...wl ]T is known a the weight vector and w0 as the

bias. That is, instead of working with hyperplanes in the Rl space, we

44
3.5 deceptive classification problem design

work with hyperplanes in the Rl +1 space, which pass through the ori-
gin. This is only for notational simplification. Once a w0 is estimated,
an x is classified to class ω1 if

w0T x0 = wT x + w0 ≥ 0 (3.8)

or to class ω2 if

w0T x0 = wT x + w0 < 0 (3.9)

for the 2-class classification task. In other words, this classifier gener-
ates a hyperplane decision surface; points lying on one side of it are
classified to ω1 and points lying on the other side are classified to ω2 .
For notational simplicity, we drop out the prime and adhere to the no-
tation w, x; the vectors are assumed to be augmented with w0 and 1,
respectively, and they reside in the Rl +1 space.

3.5.3 Objective Function

The most important factor in a deceptive problem is the objec-

tive function. A typical measure for classification performance, which
is frequently used as objective function is the classification accuracy
Acc ∈ [0, 1], which computes the proportion of the correctly classified
data sample with respect to the dataset size. Accuracy gives informa-
tion about how close the classifier performance is to the perfect classi-
fication, given by

TP +TN
Acc = (3.10)
T P + FP + T N + FN

where T P is the true positive, given by the number of samples correctly

classified; FP is the false positive, given by the number of samples in-
correctly classified; T N is the true negative, given by the number of
samples correctly rejected; and finally FN is the false negative, given
by the number of samples incorrectly rejected.

45
deception

3.5.4 Non-Deceptive Fitness Landscape

We are here considering only linear models, and we must analyse

how these models shape the fitness landscape. So, first we need to note
that linear models can be located at any place around the datasets ac-
cording to the line equation d 2 = md 1 + b, with m the slope, and b
the vertical axis (d 2 ) intersection. These parameters allow 3 degrees
of freedom building a 3D fitness landscape, but for the sake of clarity
and to ease our explanation, we will analyse a 2D fitness landscape
represented by a blue dotted line shown in Figure 3.4 (b), built by con-
sidering b fixed, and centred at the origin, as shown in Figure 3.4 (a).

0.9

0.8

0.7
itness

0.6
FFitness

0.5

0.4

0.3

0.2

0.1

0
180 160 140 120 100 80 60 40 20 0
Degrees
Rotation (degrees)
(a) Search space (b) Fitness landscape

Figure 3.5: The left figure shows the search space showing the sectors
Sg and Sl , where the global optimum and local optima are located re-
spectively. The right figure shows the 2D fitness landscape.

We begin our graphical analysis to build a fitness landscape in the

search space, by drawing a line starting from the rotation center to the
higher point of the minority sub-dataset Cb2 , and extending this line to
meet the circle border as shown in the Figure 3.4 (a).
Secondly allowing to increase m to rotate this line clockwise till to
the lower point of the majority sub-dataset Ca1 , we already had drawn
the left hand sector Sg , approximately from 190◦ to 170◦ , where we
can find linear models with 100% accuracy as shown in Figure 3.4 (b)

46
3.5 deceptive classification problem design

where it is shown with a blue dotted line the performance for the half
section approximately from 180◦ to 170◦ .
Rotating the drawn line with the same direction approximately
from 170◦ to 155◦ , will start classifying the data from the majority sub-
dataset Ca1 beginning to decrease the classification rate from 100% to
50% as shown in Figure 3.4 (b).
From approximately 155◦ to 140◦ , the linear model start recovering
its classification performance reaching in this section a maximum of
90%, given by the imbalance factor p = 0.90 for this dataset configura-
tion.
The next circle sector that our rotating linear model meets is the
sector Sl , which contains the local optima region generates a plateau re-
gion in the fitness landscape from approximately 140◦ to 15◦ as shown
in Figure 3.4 (b), which can be seen as a region with neutrality.
Finally, the linear model will increase the classification rate when it
meets the minority sub-dataset Cb1 until it reaches the perfect accuracy
when it meets the next sector Sg . Since the geometry of the clusters
is symmetric, then left half from the circle shaped search space will
show similar behavior as described previously, mirroring the fitness
landscape from 0◦ .

3.5.5 Local Optima and Global Optimum

The choice of getting two imbalanced partitions for each class is to

create at least two regions Sl , Sg , where Sl contains the local optima,
and Sg the global optimum, and considering just linear models as clas-
sifiers we can find models such as as Ll that correctly classifies a great
part of the dataset, but there is a model Lg which is provides the global
optimal (perfect accuracy) solution as shown in Figure 3.3. In order to
create these regions, one possible strategy is to separate first the classes
through a distance δ1 to create the region Sg , then to separate the ma-
jority sub-datasets by a distance δ2 to create the region Sl .
Figure 3.3 shows a possible configuration for the proposed classi-
fication problems. The symmetric construction of the dataset in Fig-

47
deception

ure 3.3 is to easily transmit the general idea, but there could be other
choices, for instance to consider more than two clusters per class with
different amounts of imbalance.

3.5.6 Deceptive Objective Function

In order to get a deceptive objective function we can take a similar

approach used to build a trap function shown in Figure ??. Consider-
ing the same objective function showed in Equation 3.9 and applied to
the previous synthetic classification problem, we can build a piecewise
function to get two regions; one for the local optima and the other for
the global optima. Therefore, we add a class separation criteria given
by the minimum ∆a and maximum ∆b distance between classes, where
∆a is measured between the points Pb1 and Pb2 , and ∆b is measured be-
tween the points Pa1 and Pa2 as shown in Figure 3.5.

( )

Figure 3.6: Class separation criteria, considering minimum and maxi-

mum separation given by the Euclidean distance ∆a and ∆b respectively,
where ∆a = E (Pb1 , Pb2 ) and ∆b = E (Pba , Pa2 ).

Furthermore, we consider a minimum distance dmin measured from

the current linear classifier and the closest point to any of the sub-

48
3.5 deceptive classification problem design

datasets. Now, we can implement the piecewise function through

a threshold h related with classification accuracy rate, for instance
h = 0.99, meaning that the rule will be applied when the linear model
success classifying 99% of the dataset as follows

Acc · d∆min ,


 ∆a > 0, Acc ≥ h,

 a
Fitness =  (3.11)


Acc · dmin , ∆b > 2σd 1 + 2σd 2 , otherwise


∆b

Recall that we are designing a deceptive classification problem, which

must show two regions, one for the local and other for the global op-
tima. So, notice that ∆a must be greater than zero in order to assure
a class separation, and since in our design we are considering circle
shaped clusters and their radius are σd 1 , σd 2 for class 1 and 2 respec-
tively, then in order to both majority subsets be separated ∆b must be
greater than 2σd 1 + 2σd 2 .

3.5.7 Deceptive Fitness Landscape

Using the piecewise function from 3.5.6 applied in the 2D search

space for linear models shown in 3.4 (a), and following similar expla-
nation done to build the fitness landscape shown in 3.4 (b), we can now
generate a deceptive fitness landscape when solving a particular con-
figuration of the synthetic classification problem.
In Figure 3.6, we present three versions of the synthetic classifica-
tion problem shown on the first raw, with its respective 2D fitness land-
scape on the second row, and the 3D fitness landscape on the third raw.
The configurations are sorted according to the proportion δ1 : δ2 used
to built them, which relates the global optimum and local optima re-
spectively. Since the notion of deception is related with the relation
between the regions where both the global optimum and local optima
are located, then can be said that this proportion controls the grade of
deception for each dataset configuration. A proportion 1 : 1 means that
both regions are equal, while 1 : 10 means that the region for the lo-
cal optima is 10 times bigger than the region for the global optimum.

49
deception

Class-1 Class-1
100 Class-2 100 Class-2 Class-1
100 Class-2

50 50
50

0 0
0

−50 −50
−50

−100 −100

−100
−100 −50 0 50 100 −100 −50 0 50 100

−100 −50 0 50 100

1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

itness
ness

tness
F i tFitness

0.6 0.6 0.6

FFitness
F iFitness

0.5
0.5 0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
180 160 140 120 100 80 60 40 20 0 180 160 140 120 100 80 60 40 20 0 180 160 140 120 100 80 60 40 20 0
Rotation (degrees)
Degrees Degrees
Rotation (degrees) Degrees
Rotation (degrees)
Fitness

Fitness
Fitness

Y-a Y-a Y-a

x is in xis in x is in
ter
cep eg rees) t er es) ter rees)
on (d
cep egre cep eg
tio
n ti tio ti on (d tio tion (d
Rota n
Rota
n
Rota

(g) 1 : 1, Low (h) 1 : 5, Medium (i) 1 : 10, High

Figure 3.7: Set of synthetic deceptive classification problems sorted

from left to right in increasing order of deception induced in the fit-
ness landscape. First row shows the dataset configuration, vertical and
horizontal axes are d 1 , d 2 used to locate the subsets, in our experiments
were set to be in [−100, 100]. Second row shows the 2D fitness land-
scape. Third row shows the 3D fitness landscape.

The second row of Figure 3.6 presents the corresponding 2D fitness

landscape, and the third row shows the 3D fitness landscape for each
dataset configuration version.
On the 2D fitness landscape a graphical performance comparison is
presented between the non-deceptive fitness function represented by a
blue dotted line, against the deceptive fitness function represented by
a red solid line. The maximum classification performance value that

50
3.6 chapter conclusions

can be reached by the deceptive fitness function in the region of local

optima is of approximately 50%. The main difference between both
fitness functions, is that on the deceptive fitness function there are re-
gions with values of zero performance, as can be noted on the red solid
line. This particular behavior separate both regions, isolating one from
the other, and clearly it can be observed how deception operates in a
similar way to the trap function of Figure ??.
The 3D fitness landscapes on the third row of the Figure 3.6 con-
sider in addition the parameter of the vertical axis (d 2 ) intersection,
meaning that now the linear models can be located around the datasets
too. On the last plot with the configuration proportion of 1 : 10 it can
be seen a high degree of deception involved in this particular configu-
ration.

3.5.8 Preliminary Results

The next step after generating the dataset configuration and the de-
ceptive fitness function is testing a set of synthetic classification prob-
lems using some standard classification methods. In this case, the meth-
ods selected are; naive Bayesian method and the Support Vector Ma-
chine (SVM). The first method based on the data distribution leads to
the local optimal with 50% performance in all the datasets tested, while
the SVM method achieve a perfect score of 100% finding the global op-
timal, because this method concerns mainly in the optimal data separa-
tion.

3.6 Chapter Conclusions

In this chapter the notion of deception was presented as well as

a quick tour guide from its origins to our days. We begin our jour-
ney from the very origins of the GAs and how deception was already
identified as an important issue, we went through describing some well
known deceptive functions, and finally arrived to the navigation tasks
which are good environments where deception can be easily observed.

51
deception

Table 3.1: Preliminary results, using a Naive Bayesian method and

SVM.

Proportion Naive Bayes SVM

1:1 0.5000 1.0000
1:2 0.5000 1.0000
1:3 0.5000 1.0000
1:4 0.5000 1.0000
1:5 0.5000 1.0000
1:10 0.5000 1.0000

This short survey allowed us to observe that machine learning re-

search lacks deceptive classification benchmarks, taking this research
opportunity in this chapter was presented a design process to build a
synthetic classification problem that together with a deceptive fitness
function can be used to test classification methods. In this chapter, we
only cover the GP open issue about benchmarks in a very reduced form,
but proposing a very novel one.
To the best of our knowledge this is the first attempt to design a
deceptive classification problem (Naredo et al., 2015). According with
the preliminary results, as expected, linear classifiers will get trapped
in the region of local optima. Of course, a non-linear classifier can eas-
ily solve this data set. Particularly, a method based on data separation
(SVM) has success in finding the global optimum then the conclusion
in this case is that the classification problem designed here is not de-
ceptive for this kind of methods. Furthermore, this chapter helps to
introduce the importance of the deception concept, which is frequently
present in real-world problems and mostly related with the difficulty
of the problem.
There are several future research lines to extend this work, one is
to take into account the methods based on data separation to design
synthetic classification problems that can introduce deception.

52
4
Novelty Search

abstract — Novelty search is a unique approach to search inspired

by natural evolution’s open-ended propensity to perpetually discover
novelty. Rather than converging on a single optimal solution, nature
discovers a vast variety of different ways to meet the challenges of
life. As an abstraction of novelty-discovering in nature, novelty search
directly rewards behaving differently instead of rewarding progress to
some ultimate goal, which is the traditional approach to search. That
is, in a search for novelty there is no pressure to be better.

To be more precise, evolutionary search is usually driven by mea-

suring how close the current candidate solution is to the objective.
That measure then determines whether the candidate is rewarded (i.e.
whether it will have offspring) or discarded. In contrast, novelty search
never measures progress at all. Rather, it simply rewards those individ-
uals that are different.

Instead of aiming for the objective, novelty search looks for novelty;
surprisingly, sometimes not looking for the goal in this way leads to find-
ing the goal more quickly and consistently. While it may sound strange,
in some problems ignoring the goal outperforms looking for it. The
reason for this phenomenon is that sometimes the intermediate steps
to the goal do not resemble the goal itself. John Stuart Mill termed this
source of confusion the “like-causes-like” fallacy. In such situations,
rewarding resemblance to the goal does not respect the intermediate
steps that lead to the goal, often causing search to fail.

53
novelty search

4.1 Introduction

Novelty search (NS) was born from a radical idea about artificial
intelligence (AI) proposed by Lehman and Stanley (2008) based on
previous evolutionary art experiments using their Picbreeder system
(Stanley, 2007). Genetic art was first introduced by Richard Dawkins
(Dawkins, 1986), currently known as evolutionary art (Romero and
Machado, 2007). Picbreeder1 is a webpage where visitors using their
creativity can breed pictures, which are able to have “children” that
are slightly different from their parents, and finally get new pictures
with awesome designs.
The radical idea borrowed from Picbreeder to develop NS, is that
it can be found solutions without really looking for them. The evolu-
tionary art experiment is just one example of non-objective system of
discovery (Stanley and Lehman, 2015). Furthermore, in the experiment
was noticed that the webpage visitors frequently pick the available pic-
tures up as parents, according with an interestingness criteria to have
children from them. Pictures that were the most different, or more
novel, among all other pictures had better chances of being selected for
survival and reproduction. In other words, novelty is a rough shortcut
for identifying interestingness (Stanley and Lehman, 2015).
Provided with this insight the following step was to use these ideas
into the design of a search algorithm. But a first stone on the road
to design this algorithm and then endorse it, is the counter intuitive
idea that a computer algorithm without an objective can work properly,
since almost each of the algorithm designed do have an objective.
When solving a problem with relative low complexity to find the
desired solution, the objective becomes a good idea to drive the search.
But when the problem at hand shows an increasing degree in its com-
plexity, then it is not so easy to find the desired (or approximate) so-
lution. Particularly, in this context, following an objective could guide
the search to find solutions that seem to be good, but in fact are far
away from the desired solution. This is the deceptive phenomenon we

1 More information about Picbreeder can be obtained in; http://www.picbreeder.org/

54
4.2 open-ended search

discussed in the previous chapter, and NS has shown an ability to find

the correct stepping stones to construct the desired solution.
NS has achieved promising results in different areas of evolution-
ary robotics Woolley and Stanley (2012), such as navigation Lehman
and Stanley (2008, 2011a); Urbano et al. (2014a); Urbano and Loukas
(2013), morphology design Lehman and Stanley (2011b), and gait con-
trol Lehman and Stanley (2011a).

(a) Biped (b) Creature

Figure 4.1: Two examples of evolutionary robotics applying NS.

4.2 Open-ended Search

The bio-inspired origins of EAs suggest a substantial difference

with respect to traditional optimization approaches. However, EAs are
guided by an objective function and specially designed search opera-
tors, just like most optimization algorithms Luke (2013). The use of
an objective function in EAs is a key difference with respect to natural
evolution, an open-ended process that lacks a predefined purpose.
Open-ended artificial evolution does not use an objective function
to drive the search, at least not an explicit one. An important feature of
open-ended systems is the continuous emergence of novelty Banzhaf
(2014). In fact, some of early EAs were open-ended Dawkins (1996);

55
novelty search

(a) objective-based (b) novelty-based

Figure 4.2: Objective-based against novelty-based in a task navigation.

novelty-based

however, they have mostly been used in Artificial Life Ofria and Wilke
(2004) and interactive search Kowaliw et al. (2012). Only recently has
open-ended search been proposed to solve mainstream problems, one
promising algorithm is Novelty Search (NS) proposed by Lehman and
Stanley (2008).

4.3 Nuts & Bolts of NS

In NS individuals in an evolving population are selected based ex-

clusively on how different their behaviour is when compared to the
other behaviours discovered so far. Through the exploration of the be-
haviour space, the objective will eventually be achieved, even though
it is not being actively pursued. Implementing novelty search requires
little change to any evolutionary algorithm aside from replacing the
fitness function with a domain dependent novelty metric. This met-
ric quantifies how different an individual is from the other individuals
with respect to their behaviour. Like the fitness function, the novelty
metric must be adequate to the domain, expressing what behaviour
characteristics should be measured and therefore conditioning what
behaviours will be explored. The use of a novelty measure creates a

56
4.3 nuts & bolts of ns

constant pressure to evolve individuals with novel behaviour features,

instead of maximising a fitness objective.
A common criticism of NS is that it might appear to be a random
or exhaustive search, however NS performs a structured search by ac-
cumulating information of the problem (Velez and Clune, 2014). Basi-
cally, NS will exhibit strong performance when randomly generated so-
lutions, those from initial generations, exhibit weak performance, thus
the search towards novelty will be correlated with the search for bet-
ter performance. However, unlike objective-based search, NS does not
stagnate around local optima, since once they are discovered NS will
quickly push the search towards new regions in solutions space (Velez
and Clune, 2014).

4.3.1 Behavioral representation

The behaviour of each individual is typically characterised by a vec-

tor of numbers. The experimenter should design the behaviour charac-
terisation so that each vector contains aspects of the behaviour of the
individual that are considered relevant to the problem that is being
solved. Once the behaviour characterisation is defined, the novelty dis-
tance metric dist can be defined as the distance between the behaviour
vectors. A commonly used distance is the Euclidean distance between
the vectors.
The behaviour characterisation can be for example the situation of
the agent at the end of the trial or some measure that is sampled along
the trial. For instance, when originally introduced, novelty search was
demonstrated on a maze navigation task (Lehman and Stanley, 2011a),
where the behaviour characterisation was a vector containing the final
position (x, y ) of the robot in the maze. Choosing the aspects of the
behaviour that should be put in the behaviour characterisation typi-
cally requires domain knowledge, and has direct implications on the
diversity of behaviours that will be found by evolution. Excessively de-
tailed behaviour characterisations can open the search space too much,
and might cause the evolution to focus on evolving behaviours that are

57
novelty search

irrelevant for solving the problem. On the other hand, a too simple be-
haviour characterisation might be insufficient for accurately estimating
the novelty of each individual, and can prevent the evolution of some
types of solutions.
It is important to note that the detail of the behaviour characterisa-
tion is not necessarily correlated with the length of the behaviour vec-
tor. In the maze navigation experiments (Lehman and Stanley, 2011a),
the authors expanded the behaviour characterisation to include inter-
mediate points along the path of an individual through the maze, in-
stead of just the final position. The authors experimented with differ-
ent sampling frequencies, resulting in behaviour characterisations of
different lengths, and the results showed that the performance of the
evolution was largely unaffected by the length of the behaviour char-
acterisation. Although a longer characterisation increased the dimen-
sionality of the behaviour space, only a small portion of this space was
reachable since adjacent points in a given path were highly correlated
(i.e. the agent can only move so far in the interval between samples). It
was demonstrated that larger behaviour descriptions do not necessarily
imply a less effective search, despite having a larger behaviour space.

4.3.2 NS algorithm

Instead of designing an objective function that summarizes the per-

formance of each individual, to use NS successfully the concept of
uniqueness must be grounded in some way. The uniqueness of a so-
lution must be measured against the rest of the evolved solutions; for
instance, solutions can be compared based on their genotype or pheno-
type. Instead, NS is based on the concept of behavior space, where a be-
havior is characterized by a vector β that describes the way an agent K
(or GP individual) acts in response to a series of stimuli (input) within a
particular context. In NS, the proposed measure of sparseness ρ around
each individual K described by its behavior descriptor β, is given by
k
1X
ρ (βj ) = d(βj , βl ) , (4.12)
k
l =1

58
4.3 nuts & bolts of ns

where βl is the l−th nearest neighbor of βj in behavior space with re-

spect to a distance or similarity measure d(, ) (Lehman and Stanley,
2008), and the number of neighbors k is an algorithm parameter. Given
this definition, when the average distance is large then the individual
is located within a sparse region of behavior space, and it is located
in a dense region if the measure is small, see Figure 4.3. The original
NS proposal considers the current population and an archive of indi-
viduals to compute sparseness, this avoids backtracking. An individ-
ual is added to the archive if its sparseness satisfies a certain threshold
condition ρth , which is the second NS parameter. Several papers have
suggested implementing the archive as a FIFO queue of size q, this al-
leviates the cost of computing sparseness but adds another parameter.

Figure 4.3: The original NS proposed in Lehman and Stanley (2008)

uses a measure of local sparseness ρ around each individual behavior
β within behavior space to estimate its novelty, considering the current
population and novel solutions from previous generations stored in an
archive by mean of a threshold which depends on the sparseness mea-
sure ρth of the current individual behavior β. The figure shows three
different scenarios for an individual’s behavior β, where the cases are
sorted from the most dense region (less novel) to the most sparse one
(the most novel).

59
novelty search

4.3.3 Underlying evolutionary algorithm

Once the objective-based fitness is replaced with novelty, the un-

derlying evolutionary algorithm operates as normal, selecting the most
novel individuals to reproduce. Over generations, novelty search en-
courages the population to spread out across the space of possible
behaviours, eventually encountering individuals that solve the given
problem, even though progress towards the solution is not directly re-
warded.

4.3.4 Minimal Criteria Novelty Search

Lehman and Stanley proposed an extension to NS that evolves so-

lutions more efficiently (Lehman and Stanley, 2010c), called minimal
criteria novelty search (MCNS). The main idea behind MCNS is that
novelty should be preserved as long as it satisfies some minimal crite-
ria (MC) for selection. Those individuals that meet the MC preserve
their novelty score, and individuals that do not satisfy the MC will re-
ceive a penalized score, given by


ρ (βj ), if the MC are satisfied


ρMCN S (βj ) =  (4.13)
0,
 otherwise.

For instance, for navigation in an unenclosed maze a simple MC is that

each individual must stay within the maze. The MC can be used as
a tool to manage and discard infeasible solutions during the search,
thereby directing the search towards the most promising regions in the
search space (Lehman and Stanley, 2010c). Moreover, several works
have attempted to combine novelty with an objective function. For
instance, (Cuccu et al., 2011b) and (Doucette and Heywood, 2010)
proposed an arithmetic combination of both measures, and a novelty-
based multiobjectivisation is proposed in (Mouret, 2011).

60
4.4 contributions on ns

4.4 Contributions on NS

NS suffers from several shortcomings and are the main topic of the
proposal developed in this section. In particular, computing novelty
using Equation 4.11 can lead to several problems. First, it is not evi-
dent which value for k will provide the best performance. Second, the
sparseness computation based on Equation 4.11, has a complexity of
O (m + q )2 , where m is the size of population, and q is the archive size,
which will grow unbounded if it is not implemented as a FIFO queue
(Lehman and Stanley, 2008, 2010b,c, 2011a). Third, choosing which
individuals should be stored in the archive is also important, several
approaches have been proposed and each adds an additional empirical
parameter or decision rule.
In this section we present two proposals to compute novelty. The
first proposal is an extension of the progressive minimal criteria NS
(Gomes et al., 2012), named as MCNSbsf , which considers a dynamical
threshold based on the best-so-far (bsf) solution. The second proposal
is a probabilistic approach to compute novelty named as probabilistic
NS (PNS), which eliminates all of the underlying NS parameters, and
at the same time reduces the computational overhead of the original
NS algorithm.

4.4.1 MCNSbsf

In this work we propose an extension of the progressive minimal cri-

teria NS, we call it MCNSbsf (best-so-far) and it combines NS with the
problem’s objective using the above equation. Thus, MCNSbsf evolves
the population through novelty, but constrains the search by penalizing
individuals that do not meet the MC, which is based on performance.
Considering a minimization problem we define that the MC of Equa-
tion 4.12 is satisfied if and only if F (Kj ) ≤ F (Kbsf )(α + 1), where F (·)
is the objective function that assigns a quality score to each individual
Kj , Kbsf is the best solution found so far and α ∈ [0, 1]. We choose a
dynamic threshold (α + 1) as a proportion of F (Kbsf ), because if we

61
novelty search

minimization

measure
>15% bsf
not satisfy MCNS

up to 15% bsf
satisfy MCNS

novelty = 0
quality

best solution
so-far (bsf) 0
compute
novelty

Figure 4.4: Illustration of the effect of the proposed MCNS approach.

use a static threshold we would need to set this threshold differently

for each problem. If the value is too high, then none of the individuals
will satisfy the MC, which was observed in preliminary tests as shown
in chapter 9. Conversely, if it is too low, all individuals will satisfy
the MC and it would become useless. The proposed approach can be
seen as a broader group of methods, where the MC is set proportion-
ally to some statistic over the current population, in this case it is the
best fitness, but the method could set the MC based on the median, the
mean or any other statistic. However, in this work we only consider
the best solution since it provides the greediest approach to possibly
improve convergence speed, the original goal of the MCNS approach,
while other variants will be studied in future work.

Some drawbacks of the proposed MCNS variant are the following.

MCNS introduces a new parameter α, in addition to those already used
by NS. Moreover, if many individuals in a given generation are assigned
a novelty measure of 0, the worst case, then the search might lack a suf-
ficient gradient and could proceed randomly. However this last point
is applicable to most MCNS algorithms.

62
4.4 contributions on ns

4.4.2 Probabilistic NS

The second proposal to overcome the drawbacks of NS is to use a

probabilistic approach towards computing the novelty of each solution.
The behavior descriptor β of a GP classifier is modelled as a binary
random vector of size n, with n the number of fitness cases and an
estimation of its probability mass function P (β ) is used to compute the
novelty φ of each solution behavior β , given by

1
φ (β ) = , (4.14)
P (β )

such that the novelty of each solution is inversely proportional to the

probability of producing it during the search. In this way, measuring
novelty is accomplished without the need of empirically tuning any
additional parameters in the GP search. The time complexity of com-
puting φ is negligible once P (β ) is known and does not require the use
of an external archive. To simplify this process further, we make the
naive Bayesian assumption that the individual dimensions i in the be-
havior descriptor βj,i are independent; i.e. that the performance of a
GP individual Kj on a particular training sample is independent with
its performance on any other. Under this assumption, φ can be com-
puted as

1
φ(βj ) = Qn , (4.15)
i =1 Pi (βj,i )

where P (βi ) represents the pmf of the i-th component βi of β. There-

fore, the problem then becomes estimating each P (βi ) during the
search, which is accomplished as follows. First, let Bt represent the
behavior matrix of generation t, given by the Equation ??, such that Bt
t
contains all behaviors βj,i from each individual j corresponding to the

63
novelty search

i-th fitness case at a generation t, with m individuals in the population

and n fitness cases.
 t  t t 
 β1   β1,1 · · · · · · β1,n 
 .   . . . 
 .   . .. .. 
 .   . 
t  β t   β t
B =  j  =   , (4.16)

j,i 
 ..   .. . .
  
 .   . .. ..  

   
t t t
βm βm,1 · · · · · · βm,n

X n
δt = t
βj,i = [ δ1t ··· · · · δnt ] . (4.17)
j =1

A second step is to compute the frequency of different behaviors in

the first generation (t = 0) for each fitness case as shown in Equation ??.
The frequency of 1s as behavior description related with a particular
fitness case is computed by summing over each column of Bt , given
by 1 δit =0 = nj=1 (βj,i
t =0
P
= 1), and the frequency of 0s is expressed as
the complement 0 δit =0 = m − (1 δit =0 ). The accumulated frequencies
δit =1
of 1’s and 0’s, are computed iteratively every generation t by; 1 b
δit−1 +1 δit , and 0 b
b δit = (t + 1)m −1 b
δi , respectively.
Knowing the frequency of the different behaviors, the probability
Pi (βj,i )t can be estimated by

1 δi
b
t
Pi (βj,i = 1) = , (4.18)
m(t + 1)

for the 1s, and for the 0s it is given by

t t
Pi (βj,i = 0) = 1 − Pi (βj,i = 1). (4.19)

Then, the probabilistic novelty of a new behavior βjt that appears at

generation t can be computed by

1
φjt = Qn t , (4.20)
i =1 Pi (βj,i )

64
4.4 contributions on ns

t=1

t=2

t=k

t=n-1

t=n
0 1 0 1 0 1 0 1 0 1 0 1

current cumulative current cumulative current cumulative

frequency frequency frequency generations

f r e q u e n c y a b o u t (pop + archive) b e h a v i o r s d e s c r ip t io n f r o m e a c h s a m p le

Figure 4.5: Representation of the PNS novelty measure, where each col-
umn represents one feature βi of the AD vector and each row is a dif-
ferent generation. In each column, two graphics are presented, on the
left is the frequency of individuals with either a 1 or 0 for that particu-
lar feature in the current population, and on the right the cumulative
frequency over all generations.

taking logarithms on both sides to avoid numerical error, we obtain

n
 
X  1 
log φjt = log  t 
 . (4.21)
i =1
P ( β
i j,i ) 

If we consider a selection method based on ranking (such as tour-

nament), which is the usual method for GP, then the novelty measure
used by PNS can be computed by
n
X 1
φP N S tj = , (4.22)
P (β t ) +
i =1 i j,i

where is a small real value to avoid numerical errors caused by divi-

sions by zero, one way to set it is to 1/population − size. The general
idea of the PNS measure for a behavior descriptor β with a binary string
representation, is depicted in Figure 4.5.

65
novelty search

PNS seen to be related with the frequency fitness assignment (FFA)

method proposed in (Weise et al., 2014), where the fitness is assigned
through a lookup table. This table contains the absolute frequencies
of the discretized values from the real-valued objective function for
each individual. The main difference between FFA and PNS is that PNS
model the population behaviors (which can be the objective-based fit-
ness) through a distribution function (i.e. a Gaussian function), and the
fitness is assigned according to the corresponding probabilistic value
from the currently updated distribution.
Another approach that can be related with PNS is the estimation of
distribution algorithms (EDAs) (Larrañaga and Lozano, 2001), some-
times called probabilistic model-building genetic algorithms (PMB-
GAs) (Pelikan et al., 2002). EDAs diverge from the traditional EAs,
particularly in the approach to generate the new population. The pop-
ulation in EAs is generated using genetic operators, such as crossover
and mutation, while EDAs generate a new population by sampling a
probability distribution, estimated by an explicit probability function
encoded by a Bayesian network, a multivariate normal distribution,
or another model from selected individuals in previous generations
(Larrañaga and Lozano, 2001).
Because the similarities that PNS show respect to the methods FFA
and EDAs, could be some that argue that PNS is a different kind of
search from NS, and that could be named, for instance as: rarity search
(RS).
PNS is a method directly related with NS, but Even though, PNS
uses a probability distribution to assign fitness, it can be used on top
of most of the EAs, without affecting the selection method or any other
mechanism in the search process, but only the approach to assign fit-
ness.

4.5 Chapter Conclusions

This chapter presented the NS algorithm as an option of the tradi-

tional approach to guide the search through following objectives. It

66
4.5 chapter conclusions

was introduced the philosophical ideas behind NS and its relationship

with natural evolution. The Open-ended evolution issue in GP stated
in (O’Neill et al., 2010) is addressed by the NS approach, which is one
of a limited amount of heuristic strategies that all but guarantees that
the search can proceed in an open-ended manner.
Furthermore, it was explained the NS algorithm as well the re-
quired steps to implement it over most evolutionary algorithms. One
important consideration to implement NS is the behavioral representa-
tion, which in evolutionary robotics mostly is used the agent last po-
sition. Finally, we present two proposals to extend NS, by incorporat-
ing the objective function explicitly and dynamically (MCNS) and by
proposing a more efficient and parameter free NS variant (PNS). Both
of these approaches are fully evaluated on supervised classification in
Chapter 9.
Last consideration is that NS as an open-ended search is not a niche
strategy, it can be used to solve traditional learning tasks effectively,
and based on some measures outperforms standard OS.

67
novelty search

68
5
NS Case Study: Automatic Circuit Synthesis

abstract — Before applying NS to GP search, this chapter presents

a case study of NS applied to a complex real-world task. In this case
we use a simple binary GA, to illustrate the manner in which NS can be
applied and its results relative to standard objective-based search. In
particular, in this chapter a topology synthesis method is introduced
using genetic algorithms based on novelty search (NS). The synthe-
sized topologies are current follower (CF) circuits; these topologies
are new and designed using integrated circuit CMOS technology of
0.35µm. Topologies are coded using a chromosome divided into four
genes: small signal gen (SS), MOSFET synthesis gene (SMos), polariza-
tion gene (Bias) and current source synthesis gene (CM). The proposed
synthesis method is coded in Matlab and uses SPICE to evaluate the
CFs fitness. The GA based on NS (GA-NS) is compared with a standard
objective-based GA, showing unique search dynamics and improved
performance. Experimental results show twelve CFs synthesis gener-
ated by GA-NS, and their main attributes are summarized. This work
confirms that NS can be used as a promising alternative in the field of
automatic circuit synthesis.

5.1 Introduction

The automated design process of integrated circuits (IC) based on

complementary metal-oxide-semiconductor (CMOS) is carried out by
the industry of Electronic Design Automation (EDA), to synthesize am-
plifiers, voltage followers, current mirrors, among others. EDA is a

69
ns case study: automatic circuit synthesis

category of software tools for designing electronic systems, which typi-

cally are validated by simulation (Gielen and Rutenbar, 2000; Martens
and Gielen, 2008; Mazumder and Rudnick, 1999; Rutenbar et al., 2007;
Vellasco et al., 2001).

EDA tools increase productivity in the design of ICs, even for the
circuit blocks that are not repetitive. Particularly, the analog design
automation is more complex compared to digital design automation,
because the relationships among their specifications are more complex.
Moreover, analog design requires experience, intuition and creativity,
primarily because it works with a large number of parameters that usu-
ally exhibit complex interactions among them.

In recent years several researchers have proposed some methods for

analog circuit synthesis; for example, (Gielen and Rutenbar, 2000) fo-
cuses on the design of passive circuits. Evolutionary algorithms (EA)
have proven to be a good choice to generate new circuits, or to opti-
mize previous by sizing their topologies (Gielen and Rutenbar, 2000;
Martens and Gielen, 2008; Rutenbar et al., 2007; Vellasco et al., 2001).
In this work, we focus on finding new topologies that synthesize a cur-
rent follower (CF) circuit using a binary encoded genetic algorithm
(GA) (Holland, 1975). A binary genetic encoding of unity gain cells
(UGC) is implemented through nullator and/or norator elements.

Therefore, in this work we use a GA (Holland, 1975) based on NS to

synthesize new topologies for a current follower circuit. The hypothesis
in this work is that NS will be able to find new circuit designs, designs
that are left unexplored by a standard search process that relies on a
fixed objective function. The experimental results show that NS allows
the GA to find different topologies with respect to those found in pre-
vious works where a standard objective-based GA was used. This work
and its results give us insights about the usefulness of NS on electronic
circuit synthesis, the first such work in related literature.

70
5.2 ga synthesis and cfs representation

5.2 GA Synthesis and CFs Representation

GAs in particular have been used to design electronic circuits in

several works. For instance, a GA was used to generate the topol-
ogy, and/or the component values of electronic circuits in (Gielen and
Rutenbar, 2000).
Another alternative was proposed by Koza et al. in (Koza et al.,
2005), that use genetic programming (GP) with a variable-length
representation containing topology modifying operators, component-
creating operators and arithmetic-performing operators. In Koza’s ap-
proach the operators that modify the circuit topology and select com-
ponent values are inseparable, and all are under the control of the evo-
lutionary processes operating on the expression.
Circuit synthesis involves both the selection of a suitable topology
and the choice of component values, and as Koza has shown, these may
be optimized simultaneously by the evolutionary process. However,
there is no reason why these operations should not be performed sep-
arately, with different optimization methods being used, specifically
tailored for each problem. Clearly the circuit topology must be cho-
sen first, and the appropriate algorithm for this task can be a GA. For
each circuit topology generated the component values can then be opti-
mized, and the performance of the circuit used as the fitness function
in a GA. In such cases, circuit fitness can be evaluated using SPICE, that
is a simulation program with IC emphasis. The code for which can bee
incorporated into the synthesis program.
The component values could also be optimized using a GA, but
this is not the best choice for problems involving well-behaved objec-
tive functions dependent on a fixed number of variables. It is well es-
tablished that numerical optimization methods converge much faster
and involve fewer objective function evaluations (Flores Becerra et al.,
2014). No optimization method is guaranteed to find the global opti-
mum, but it has been found that numerical optimization of component
values achieves a high proportion of results close to the global opti-
mum. This hybrid approach using a GA to select a suitable topology
and numerical optimization to choose component values is likely to be

71
ns case study: automatic circuit synthesis

more efficient than allowing evolution to perform both tasks concur-

rently. In this work, however, we will only focus on the first part of the
problem, that of finding novel circuit topologies, leaving any further
optimization as future work for real-world implementations.
In particular, this work focuses on the synthesis of current followers
(CFs). These circuits copy the value of a current to other parts of a
circuit with higher impedance. In other words, the CF can maintain
a current higher impedance loads without altering the original source.
CFs can be used in various analog circuits, such as filters, oscillators,
data transmission, and current conveyors (CC) (Duarte-Villaseñor et al.,
2011; Flores Becerra et al., 2014; Tlelo-Cuatle and Duarte-Villaseñor,
2008; Tlelo-Cuautle et al., 2007, 2010).

5.2.1 Objective Function

The objective function for a synthesized CF circuit K is adapted

from (Duarte-Villaseñor et al., 2011), and is given by

F (K ) = |1 − g| + 0.25 · f (BW ) + 0.5 · (f (Zin ) + f (Zout )) (5.23)

with f (BW ), f (Zin ), and f (Zout ) defined as

 BW −107
| |



 107
, BW ≤ 107
7
α 1|BW −10 |

107 < BW ≤ 108



107
,
f (BW ) =  (5.24)

7
α |BW −10 |
0.1 + 2 107 , 108 < BW ≤ 109




7

 0.2 + α3 |BW −10 | , 109 < BW


107
(
α2 |100 − Zin | , Zin < 100
f (Zin ) = (5.25)
α1 |100 − Zin | , Zin ≥ 100

α2 104 − Zout , Zout < 104

(
f (Zout ) = (5.26)
α3 104 − Zout , Zout ≥ 104
where g is the circuit gain, BW is the circuit bandwith, Zin is the in-
put impedance, Zout the output impedance and the αi s are function

72
5.2 ga synthesis and cfs representation

parameters set to α1 = 0.01, α2 = 0.001 and α3 = 0.0001. As can be

seen, Equation 5.20 defines a minimization problem, where the goal is
to increase the gain g of the circuit, subject to the penalizing terms of
equations 5.21, 5.22 and 5.23. In particular, the parametrization given
in the above equations pushes the search to find circuits with g = 1,
BW = 10MHz, Zin = 100Ω and Zout = 10KΩ, which are good values
for CF circuits (Duarte-Villaseñor et al., 2011). As stated before, for a
standard GA Equation 5.20 can be used to define fitness directly, but
this is not necessarily the case as discussed in Chapter 4.

5.2.2 CFs Representation for a GA

Using nullators and norators, it is possible to describe the behavior

of current followers, where for synthesis purposes, a nullator (O) and
a norator (P) always must form a joined-pair (Tlelo-Cuatle and Duarte-
Villaseñor, 2008; Tlelo-Cuautle et al., 2007). The node in the joined
terminals of the O-P pair is associated to the source (S) of a MOSFET,
the other terminal of the O element is associated to the gate (G), and the
other terminal of the P element to the drain (D). Figure 5.1 (a) shows
the nullor-based description of a CF, it consists of four P elements (P1-
P4), each one joined with an O element. In this work we use the binary
genetic encoding introduced by (Tlelo-Cuautle et al., 2007), for auto-
matic synthesis of analog circuits, particularly to find CF topologies.
The O-P pairs can be described by a small-signal gene called SS; the
synthesis of each O-P pair can be codified by the gene called SMos.
The addition of current and voltage biases is codified by the gene Bias;
and the synthesis of the current biases for the current mirror CMOS by
the gene CM.

The SS gene is defined as joined terminals of the O-P pairs, us-

ing two bits for each O-P pair. The SMos gene is defined as PMOS
or NMOS, using one bit for each O-P pair. The Bias gene is three bits
for each CMOS, and the CM gene is defined for 4 CMs: simple, cas-
code, Wilson, and modified Wilson; it uses two bits for each CMOS. As

73
ns case study: automatic circuit synthesis

Vdd Vdd

2I 2I

P1 P2
Vb1 M1 M2 Vb2
Vb1 Vb2
O1 O2
Iin Iout Iin Iout
O3 O4
Vb3 Vb4 Vb3 M3 M4 Vb4
P3 P4

2I 2I
Vss Vss

(a) Nullors (b) MOSFETs

Figure 5.1: CF representation using (a) nullors and (b) MOSFETs.

a result, the genetic representation of the CF consists of a chromosome

ChCF of four ordered genes, given by

ChCF = (SS, SMos, Bias, CM ) . (5.27)

An example of the CF synthesis for the chromosome proposal

shown in Figure 5.1(a) is depicted in Figure 5.1(b) using MOSFETs,
where the CF obtained is known. This description has been reported
in (Duarte-Villaseñor et al., 2011), where CFs were synthesized using
a standard GA. By using n as the number of O-P pairs, the length of
ChCF is given by

Length(ChCF ) = 2n + n + 3n + 2
= 6n + 2 .

In this work we consider three different topology sizes related with

the number of MOSFETs used by the CFs, from 1 to 3. For the topology
using just one MOSFET, the chromosome length is 8 bit, when using
two MOSFETs 13 bits are used, and for three MOSFETs 18 bits.

74
5.3 results and analysis

5.2.3 CF synthesis with GA-NS

In this scenario, we hypothesize that the very nature of circuit de-

sign, requires an exploratory and unorthodox approach, that promotes
uniqueness and creativity. It is precisely in these scenarios, Lehman
and Stanley argue, that NS should be able to produce strong results rel-
ative to standard objective-based search (Stanley and Lehman, 2015).
To apply NS, we choose to make β = ChCF , and thus sparseness is com-
puted based on the Hamming distance.
As stated before, our hypothesis is that NS will be able to find de-
signs that are left unexplored by a standard search process. There-
fore, in the following experimental section we will focus our work on
comparing the performance of the GA-NS with the standard GA-OS
approach, highlighting the unique solutions found by GA-NS. It must
be stressed that Equation 4.11 is used to determine fitness in NS, and
thus establish the selection pressure for the GA search. Nonetheless,
the final solutions returned by the search are still chosen based on the
objective function, in this case F (K ) which is given Equation 5.20. It
mus be understood that NS changes the manner in which the search
progresses, but the underlying goal of the task is still be evaluated by a
domain specific objective. The difference with a standard GA, however,
is that the objective function is not used to directly guide the search,
it is only used offline to choose the final solution returned by the algo-
rithm.

5.3 Results and Analysis

Hereafter, we will refer to the standard GA as objective or objective-

based search (GA-OS), and to GA based on NS as novelty or novelty-
based search (GA-NS). The parameters used for both algorithms in all
experiments are shown in Table 5.1, in accordance with previous works
(Duarte-Villaseñor et al., 2011; Flores Becerra et al., 2014; Tlelo-Cuatle
and Duarte-Villaseñor, 2008; Tlelo-Cuautle et al., 2007, 2010). The GA-
NS parameters are summarized in Table 7.3. All algorithms are ex-

75
ns case study: automatic circuit synthesis

Table 5.1: Shared parameters for the GA and GA-NS.

Parameter Description
population size 20
Max generations 200
Stop criteria 10 generations without
the best-fitness changing
Selection Tournament (size= 3)
Crossover one point
Crossover rate 0.9
mutation one point
Mutation rate 0.1

Table 5.2: GA-NS parameters.

Parameter Description
k−neighbors half of pop size
ρth half of chromosome size
archive control FIFO
archive size double of pop size

ecuted 10 times and performance is analyzed based on the objective

function value of the best solutions found.

The quality measured by the objective function in Equation 5.20 is

used by GA-NS to choose the best solution at the end of the run, and
by GA-OS as the fitness value to guide the search and to choose the
best solution. Note that the objective function defines a minimization
problem. As stated before, fitness in GA-NS is given by the sparseness
measure in Equation 4.11, using a binary representation for β and the
Hamming distance. All experiments were carried out on a Workstation
with a Xeon 3.50GHz processor with 16GB of RAM.

76
5.3 results and analysis

Table 5.3: Synthesized CF topologies with one MOSFET.

method generation decimal score

GA-OS 6 143 5.4956
GA-NS 6 122 5.4522
GA-NS 4 219 5.2143

Table 5.4: Synthesized CF topologies with two MOSFET.

method generation decimal score

GA-NS 8 4567 5.0003
GA-NS 6 4979 5.2450
GA-NS 95 5059 5.0003
GA-NS 27 6147 4.8482
GA-NS 91 3638 4.7058

5.3.1 CF Topologies

Tables 5.3 to 5.5 summarize the most notable topologies found by

GA-OS and GA-NS, respectively for the three considered topology sizes
(1, 2 and 3 MOSFETs). In these tables, the second column shows the
generation where the best CF topology was found, and the third col-
umn shows the decimal conversion of the binary chromosome, while
the last column shows its performance as given by the objective func-
tion. Moreover, figures 5.2 - 5.4 present the corresponding circuit
topologies; i.e., the phenotype of the solutions. For convenience, the
legend of each topology in figures 5.2 - 5.4 provides the corresponding
decimal conversion of the genotype (chromosome) of the solution.
For the simplest case, considering a single pair O-P, Table 5.3 and
Figure 5.2 show two topologies found by GA-NS and one by GA-OS. It
is important to note that the GA-OS consistently found the same topolo-
gies, one and again. Furthermore, the GA-NS found different results.
All topologies found have been previously reported and studied in the
literature (Razavi, 2001); this was expected due to the small sizes of

77
ns case study: automatic circuit synthesis

Table 5.5: Synthesized CF topologies with three MOSFET.

method generation decimal score

GA-NS 90 152195 5.2188
GA-NS 11 62399 5.2468
GA-NS 49 116851 5.2465
GA-NS 193 252746 5.3453

the CFs, and because we are using the same representation and fitness
function as previous works.
For the second series of experiments considering two pairs of O-
P (two MOSFETs), Table 5.4 and Figure 5.3 only the best topologies
found by GA-NS are presented. Topologies No. 4979 and 5059 can be
considered trivial because you can find it manually quite easily. More-
over, the topology No. 36387 shows a known circuit (Razavi, 2001),
which confirms that the GA-NS is guiding the search to good solutions
in the search space. Finally, topologies No. 4567 and 6147 are novel,
and should be considered as new designs of CF. In both topologies gain
values are near to one, but Zin and Zout are not ideal; however these
impedances can be occupied for filter designs made to measure.
The third set of experiments used three O-P pairs to build circuits
with 3 MOSFETs, these results are summarized in Table 5.5 and Figure
5.4. The topology No. 152195 (Figure 5.4(a)) shows three serial Cfsof
ine CMOS, this a clear example of a synthesized CF built from smaller
Cfs. Figures 5.4 (b) and (c) show more elaborate constructions of CFs
found by GA-NS. Finally, the topology no. 252746 in Figure 5.4 (d)
shows a novel structure that has not been presented in any related liter-
ature. Such a design confirms the ability of the NS paradigm to explore
the search space and find unorthodox solutions, even to long-standing
and well-known problems.
It is noteworthy, that the current sources as ideal presented in Fig-
ures 5.2, 5.3 and 5.4, the GAs are generated by mirrors current type
Wilson modified. All topologies have a performance as a current fol-

78
5.3 results and analysis

Vdd
Vdd
Vdd
2I
I I
Iout Iin
Iout
V1

Iin Iin Iout

I 2I I

Vss Vss Vss

(a) No. 143 (b) No. 122 (c) No. 219

Figure 5.2: Three CF topologies for the synthesis of a CF with the topol-
ogy size of one MOSFET.

Vdd
Vdd
2I
Iin Vdd Vdd I Vdd
Iin
2I 2I I I I
Iin Iin Iin Iout
V1
V1
V2

Iout Iout Iout I

Iout
2I 2I 2I 2I
I

Vss Vss Vss Vss Vss

(a) No. (b) No. 4979 (c) No. 5059 (d) No. 6147 (e) No. 3638
4567

Figure 5.3: Five CF topologies for the synthesis of a CF with the topol-
ogy size of two MOSFETs found by GA-NS.

lower with dimensions W n = 6µ, W p = 4.4, and Ln = Lp = 1.2µ; with

V dd = 1.6V , V ss = 1.6V , and I = 20uA.

5.3.2 Comparison between GA-OS and GA-NS

Let us now analyze the effect that the NS algorithm has on the
search process for circuit synthesis, relative to objective-based search.
Figure 5.5 shows convergence plots of the best solution found by each
algorithm, showing the average behavior over all runs. The figure plots

79
ns case study: automatic circuit synthesis

Vdd Vdd

3I 3I
IoutIin

Iin Iout

3I 3I

Vss Vss
(a) No. 152195 (b) No. 62399

Vdd
Vdd

3I I 2I
Iin Iin

V1
V2
V2
Iout Iout

3I I 2I

Vss Vss
(c) No. 116851 (d) No. 252746

Figure 5.4: Four CF topologies for the synthesis of a CF with the topol-
ogy size of three MOSFETs found by GA-NS.

the objective function value of the best solution with respect to the
number of generations. In particular, we focus on the experiments us-
ing two O-P pairs, for circuits with two MOSFETs. Based on this plot,
we can see no significant difference between both algorithms, the qual-
ity of the solutions found is comparable and the convergence is similar
for both algorithms, even though GA-NS reaches a better average per-
formance. However, as stated before, the topologies found by GA-NS
do not match those found by GA-OS. Therefore, in terms of the per-
formance of the synthesized circuits both algorithms are more or less
equivalent, the difference lies on the actual topologies found by each

80
5.3 results and analysis

5.3 GA−OS
GA−NS

Objective Score
5.2

5.1

4.9

5 10 15 20
Generations
Figure 5.5: Convergence plots for GA-NS and GA-OS for the two MOS-
FET CFs, showing the performance (objective function value) of the
best solution found over the initial generations of the search. The lines
represent the average over 10 runs of the algorithms.

method. This results should be highlighted, since some have argued

that the search carried out by the NS algorithm seems to be random,
given its exclusion of the objective function. However, our results sup-
port findings in other domains (Urbano et al., 2014a; Velez and Clune,
2014), that show that NS is not a random process and that it can at
the very least achieve the same level of performance as objective-based
search.
Figure 5.5 only shows the convergence over the first generations ac-
cording with the number of generations that GA-OS takes to converge.
Table 5.6 presents the average number of generations required by each
algorithm to find the best solution for each experiment over 10 runs.
It is evident from these results that the GA-OS converges much earlier
than GA-NS, much earlier than the maximum number of generations
allowed (after the fitness stagnates for 10 generations). Since GA-NS
fitness is based on novelty, it requires more generations for the search
to converge. This behavior highlights the fact that when the search is
driven directly by the objective it can become stagnated around local

81
ns case study: automatic circuit synthesis

Table 5.6: Number of generations required by GA-OS and GA-NS to

converge for each of the CF circuits: one MOSFET (M1), two MOSFETS
(M2) and three MOSFETs (M3). Each row is a different run and the final
row shows the average.

GA-OS GA-NS
Run M1 M2 M3 M1 M2 M3
R1 11 24 20 108 102 140
R2 11 18 11 109 191 201
R3 16 12 13 201 201 97
R4 15 10 21 36 201 97
R5 10 10 22 50 201 113
R6 6 6 6 193 106 176
R7 11 18 11 126 192 70
R8 16 12 13 135 201 201
R9 15 10 21 136 135 201
R10 10 10 22 72 74 201
Aver. 12.1 13.0 16.0 116.6 160.4 149.7

optima. GA-NS certainly finds similar local optima, but the search does
not stagnate given its ability to promote diversity and explore other re-
gions of the search space, allowing it to find solutions that might be-
come inaccessible to GA-OS.
For simplicity, and given the quick conversion of GA-OS, we can
take a snapshot of the type of solutions found by each algorithm af-
ter the first 10 generations. Figures 5.6 - 5.8 compare the composition
of the best solutions found by both GA-OS and GA-NS. The figures
present frequency histograms, where the height of each bar represents
the percentage of runs for which a particular bit in the chromosome
was set to a 1 value in the best solution found so far. For instance, if
a bar reaches a value 0.5 this it means that a 50% of the best solutions
found after 10 generations have a 1 at that particular position within
the chromosome. The plots are divided for each experimental config-
uration, with Figure 5.6 showing the results for the single MOSFET
CFs, Figure 5.7 for two MOSFETs and Figure 5.8 for three MOSFETs.

82
5.4 conclusions

1 1

0.8 0.8
Frequency

Frequency
0.6 0.6

0.4 0.4

0.2 0.2

0 0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Bit position Bit position
(a) GA-OS (b) GA-NS

Figure 5.6: Histograms showing the average composition of the chro-

mosome of the best solution found by each algorithm after 10 genera-
tions, for the single MOSFET circuits.

These figures nicely illustrate our claim, that GA-NS finds solutions
that are different from those found by GA-OS. Moreover, that the solu-
tions found, while being different, achieve the same performance based
on the objective function. This means that GA-NS explores other areas
of the search space, some of which might contain local (or even global)
optima that are not accessible to the standard GA-OS.

5.4 Conclusions

In this work we use an automatic synthesis approach for analog cir-

cuit topologies using a GA, particularly to find topologies for CF cir-
cuits. In particular, this work proposes the use of the NS algorithm
for circuit synthesis, the first such work in this field. While standard
objective-based search assigns fitness and selective pressure based on
the domain-specific objective function, NS guides the search by deter-
mining fitness based on the uniqueness, or novelty, of each individ-
ual solution. In the proposed GA-NS method, the objective function is
only used to select the final solution returned by the algorithm, but the

83
ns case study: automatic circuit synthesis

1 1

0.8 0.8
Frequency

Frequency
0.6 0.6

0.4 0.4

0.2 0.2

0 0
1 2 3 4 5 6 7 8 9 10111213 1 2 3 4 5 6 7 8 9 10111213
Bit position Bit position
(a) GA-OS (b) GA-NS

Figure 5.7: Histograms showing the average composition of the chro-

mosome of the best solution found by each algorithm after 10 genera-
tions, for the two MOSFET circuits.

1 1

0.8 0.8
Frequency

Frequency

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 5 10 15 20 0 5 10 15 20
Bit position Bit position
(a) GA-OS (b) GA-NS

Figure 5.8: Histograms showing the average composition of the chro-

mosome of the best solution found by each algorithm after 10 genera-
tions, for the three MOSFET circuits.

search is carried out based on the concept of solution novelty. The ex-
perimental results showed that NS allows the algorithm to explore the
search space in a different way than a standard GA-OS does. While the

84
5.4 conclusions

standard approach converged to expected results, the NS approach was

able to discover unique solutions for even this well-known and widely
studied circuit design problem. Moreover, while the GA-NS algorithm
found unique circuits, its performance was equivalent to GA-OS, show-
ing that the NS approach can also produce high-performance solutions
even when it omits the objective function from the search process.
Future work derived from this research will focus on the follow-
ing. First, to optimize the evolved topologies of the CF circuits and
to subject them to real-world experimental validation. Second, apply
the NS paradigm to synthesis other specialized circuits of interest in
the field of electronic design automation. Third, we can enhance the
NS approach by attempting to force the search away from specific ar-
eas of the search space. For example, it should be possible to seed the
population with previously known designs that should be avoided by
the search, since they are not as interesting. In this way, the NS algo-
rithm could be used to explicitly search for circuits that are unique in
electronic literature.

85
ns case study: automatic circuit synthesis

86
6
Generalization of NS-based GP Controllers
for Evolutionary Robotics

abstract — Over recent years, evolutionary computation research

has begun to emphasize the issue of generalization. Instead of evolving
solutions that are optimized for a particular problem instance, the goal
is to evolve solutions that can generalize to various different scenarios.
This chapter compares objective-based search and novelty search on a
set of generalization oriented experiments for a navigation task using
grammatical evolution (GE). In particular, this chapter studies the im-
pact that the training set has on the generalization of evolved solutions,
considering: (1) the training set size; (2) the manner in which the train-
ing set is chosen (random or manual); and (3) if the training set is fixed
throughout the run or dynamically changed every generation. Experi-
mental results suggest that novelty search outperforms objective-based
search in terms of evolving navigation behaviors that are able to cope
with different initial conditions. The traditional objective-based search
requires larger training sets and its performance degrades when the
training set is not fixed. On the other hand, novelty search seems to be
robust to different training sets, finding general solutions in almost all
of the studied conditions with almost perfect generalization in many
scenarios.

87
generalization of ns-based gp controllers
for evolutionary robotics
6.1 Introduction

Generalization is considered to be a cornerstone concept of main-

stream machine learning research (ML) (Kaelbling et al., 1996). In
general, one of the most important issues in supervised and semi-
supervised ML is the ability of an algorithm to generate general models
or solutions, starting from a finite set T of training instances (Kushchu,
2002b). Particularly in robotics, generalization is extremely desirable
when addressing reinforcement learning (RL) tasks (Kaelbling et al.,
1996). RL is the problem faced by an agent that must learn a behavior
through trial-and-error interactions with an environment (Kaelbling
et al., 1996). Another research area where generalization plays an im-
portant role is evolutionary robotics (ER), a form of RL that uses an
evolutionary algorithm (EA) to carry out the learning process.
The present chapter studies generalization in ER where the learn-
ing is carried out by grammatical evolution (GE) (O’Neill and Ryan,
2001; Dempsey et al., 2009; Georgiou, 2012) a form of EA based on the
general principles of the genetic programming (GP) paradigm (Koza,
1992b). Although generalization is closely related to robustness, they
refer to separate concepts. Robustness can be defined as resiliency to
slight changes in the inputs, due to noise or other random processes,
while generalization should be understood as the ability of a solution
to solve unseen cases (data or conditions not used during the learning
process) with a satisfactory degree of success. Therefore, a solution
is considered to be general if it is able to solve a large number of un-
seen cases. The difficulty of generating general solutions stems from
the fact that good performance on the training set T cannot guarantee
equal performance on unseen inputs, what is also referred to as the
bias-variance dilemma (Duda et al., 2000; Bishop, 2006).
The generalization issue has begun to receive much attention in ER
over recent years (Francone et al., 1996; Banzhaf et al., 1996; Mahler
et al., 2005; Kushchu, 2002a,b; Uy et al., 2010; Langdon and Poli, 2001;
Gonçalves and Silva, 2011a,b; Naik and Dabhi, 2013). On the other
hand, most works using GP study generalization in standard super-
vised learning problems, which are usually defined by a training set

88
6.1 introduction

T = {(xi , yi )} with i = 1, ..., n, where each pair (xi , yi ) represents an ex-

pected input and desired output (also referred to as a fitness-case in GP
literature). The goal is to find a model K that minimizes an objective
function based on the difference between K (xi ) and yi ∀i. However, ER
relies on RL (Nolfi and Floreano, 2000; Nelson et al., 2009), where an
autonomous agent is placed within an unknown environment and it
must learn to carry out a specific task, such as navigating towards a de-
sired target position. Similar problems have also been studied in GP lit-
erature, since the original proposal of the Santa-Fe trail (SFT) by Koza
(Koza, 1992b). In this domain, the objective function can be defined
in different ways (Kushchu, 2002a,b; Robilliard et al., 2006; Doucette
and Heywood, 2010; Georgiou and Teahan, 2010; Urbano and Loukas,
2013; Nelson et al., 2009), but basically it must measure the degree by
which the desired high-level task is fulfilled. The training and testing
sets are defined by the set of initial conditions specified for the agent
within the environment (such as location and orientation), as well as
the characteristics of the environment and the amount of time it has
to complete the task. Generalization in this domain still merits further
research.
This chapter presents an extensive study of generalization in a GE-
based ER system for an autonomous agent that must navigate within a
2D environment and reach a predefined and fixed target. The goal is to
study the effect that the size of the training set has on the generalization
ability of a GE-based learning system. Moreover, this chapter presents
an extensive comparison between two approaches towards defining fit-
ness and applying selective pressure during evolution.
First, the standard approach in GE, as in most other EAs, is to define
fitness proportionally to the performance of the solutions as expressed
by the objective function. This measure is used to quantify fitness and
to guide the search, we will refer to the traditional approach as classi-
cal or objective-based search. For instance, for a navigation problem
the objective function could be defined as the Euclidean distance be-
tween the final position of the agent and the target. In this chapter, two
objective functions are tested to measure performance: (1) f1 computes
the Euclidean distance between the agents final position and the target

89
generalization of ns-based gp controllers
for evolutionary robotics
(Lehman and Stanley, 2008), and (2) f2 considers both the Euclidean
distance and the length of the agents trajectory without considering
any repeated position (Georgiou, 2012).
Second, the novelty search (NS) algorithm proposes a different per-
spective to define fitness (Lehman and Stanley, 2008). NS was moti-
vated by the goal of dealing with deceptive fitness-landscapes (Gold-
berg, 1987) in ER, where the objective function tends to guide the
search away from the global optimum. NS replaces the objective func-
tion to compute fitness with a measure that quantifies how unique, or
novel, an evolved solution is with respect to all previously found solu-
tions during the search. However, instead of using genotypic diversity,
a common tool in EAs (Nicoară, 2009; Burke et al., 2004), NS uses a
description of what each solution does within its environment, what
can be referred to as a behavior descriptor (Lehman and Stanley, 2008;
Trujillo et al., 2011b, 2008a; Mouret and Doncieux, 2012). In this way,
selective pressure in NS pushes the search towards novel behaviors,
allowing the search to avoid local optima. In this chapter we experi-
mentally compare two different behavior descriptors including the one
proposed in (Lehman and Stanley, 2010a). NS has been widely used
in navigation and other ER problems with strong results (Lehman and
Stanley, 2008, 2010a, 2011a; Urbano and Loukas, 2013; Gomes et al.,
2013), and has recently been extended to more traditional ML prob-
lems with GP (Naredo and Trujillo, 2013; Martı́nez et al., 2013). How-
ever, most works in ER have not studied the issue of generalization in
NS, or how the size of the training set affects it.
This chapter also studies the impact of how the training set is deter-
mined, considering several different variants. First, the training set is
constructed randomly, using a different number of instances, to evalu-
ate how the size of the training set impacts generalization. This random
strategy is compared with manually selected initial conditions, to eval-
uate the bias that a human designer introduces into the learning pro-
cess and to compare the newly found results with our previous work
(Urbano et al., 2014b). Moreover, recent works in GP suggest that vary-
ing the training set during the evolutionary process can improve gener-
alization (Gathercole and Ross, 1994; Gonçalves et al., 2012; Gonçalves

90
6.1 introduction

and Silva, 2013), but this has only been validated in traditional ML
problems, not in ER. Therefore, this chapter also studies the effect of
varying the training set during evolution, instead of keeping the train-
ing set fixed during the search. The training set is either set statically
for the entirety of the evolutionary process or it is randomly varied at
the beginning of each generation, or at at the beginning of every run
and fixed then. It is assumed that by varying the training set the sys-
tem might be able to cope with a larger set of different scenarios. In
Individual BNF-Grammar Transcription
Binary String
(A) <expr> ::= <line> (0) <expr> 219 % 2 = 1
110110110101010100101001
| <expr> <line> (1) <expr> <line> 85 % 2 = 1
101111110000101100011000
(B) <line> ::= ifelse <condition> [<expr>] [<expr>] (0) <line> <line> 41 % 2 = 1
Translat ion | [<op>] (1) <op> <line> 191 % 3 = 2
Integer String (C) <condition>::= wall-ahead? (0) move <line> 11 % 2 = 1
219 85 41 191 11 24 | wall-left? (1) move <op> 24 % 3 = 0
| wall-right? (2) move turn-left ----------------
(D) <op> ::= turn-left (0)
| turn-right (1) Resulting
move turn-left program
| move (2)

Figure 6.1: Example of a GE genotype-phenotype mapping process,

where the binary genotype is translated into an integer string, which is
then used to select production rules from a predefined grammar. The
derivation sequence of the program is shown on the right, all codons
(typically groups of 8 bits) were used but wrapping was unnecessary in
this example.

summary, the main contributions of this chapter can be summarized as

follows:

i. Experimental evaluation of the effects that the training set has on

the generalization ability of two approaches towards determining
fitness for a navigation problem within a 2D environment solved
with GE: (1) novelty-based, and (2) objective based; compared with
random search as a baseline. Moreover, two different objective
functions are considered for objective-based search, and two differ-
ent behavior descriptors are evaluated to measure solution novelty.

ii. The size of the training set is varied, from the simplest case with
a single fitness-case, to a relatively large set with 60 fitness-cases.
Each training instance defines different initial conditions for the

91
generalization of ns-based gp controllers
for evolutionary robotics
agent within the environment, specifying its position and orienta-
tion. Generalization is then evaluated based on the performance
achieved on a test set of 100 randomly generated instances, con-
sidering both the quality given by the objective functions and the
percentage of cases in which the agent reaches the target (referred
to as hits).

iii. The manner in which the training set is determined is also evalu-
ated, considering three different approaches: (1) randomly setting
the training set at the beginning of each run; (2) randomly chang-
ing the training set at the beginning of each generation; or (3) man-
ually setting the training set based on expert knowledge, which is
fixed for all runs.

The remainder of this chapter is organized as follows. Section 6.2 re-

views the main background concepts. Afterwards, Section 6.3 presents
the experimental set-up of our work, outlining all of the considered
variants and performance measures. Section 6.4 presents and discusses
the main experimental results. Finally, Section 5.4 presents the conclu-
sions and future research lines that can be derived from this work.

6.2 Background

This section reviews background topics related to the present work.

First, we quickly describe GE, which is the basis of our ER system. Sec-
ond, we discuss the issue of generalization in GP literature in general.
Third, we present the NS algorithm and review previous works that
study generalization.

6.2.1 Grammatical Evolution

As stated before, GE is a variant of GP which is capable of evolv-

ing computer programs in an arbitrary language defined by a Backus-
Naur Form (BNF) grammar (O’Neill and Ryan, 2001). Programs are
indirectly represented by variable length binary genomes, and built by

92
6.2 background

a developmental process. The linear representation of the genome al-

lows for the application of genetic operators such as crossover and mu-
tation in the manner of a typical genetic algorithm, unlike tree-based
GP (Koza, 1992b). Starting with the start symbol of the grammar, each
individual’s chromosome contains in its codons (typically groups of 8
bits) the information necessary to select and apply the grammar pro-
duction rules to construct the final program.
Production rules for each non-terminal are indexed starting from 0.
To select a production rule for the left-most non-terminal of the devel-
oping program, from left to right, the next codon value in the genome
is read and interpreted using the formula I = c%r, where c represents
the current codon value, % represents the modulus operator, and r is
the number of production rules for the left-most non-terminal. The cor-
responding production rule in the I-th index will be used to replace the
left-most non-terminal. If, while reading codons, the algorithm reaches
the end of the genome, a wrapping operator is invoked and the pro-
cess continues reading from the beginning of the genome. The process
stops when all of the non-terminal symbols have been replaced, result-
ing in a valid program. In the wrapping process, if an individual fails
with the condition of replacing all of the non-terminal symbols after a
maximum number of iterations, then it is considered as an invalid in-
dividual and it is penalized with the lowest possible fitness. The map-
ping process is illustrated with an example in Figure 6.1, where we use
a grammar to describe maze navigation programs written in Netlogo
(Wilensky, 1999).

6.2.2 Generalization in Genetic Programming

Generalization in evolutionary learning has not received the same

amount of attention as in traditional ML, but has been addressed by
some works. In (Kushchu, 2002a), Kushchu studies the generalization
ability of GP, viewed as forming a hypothesis that goes beyond the ob-
served training instances. Training is the process of finding a solution
during the learning process, and generalization occurs when a learner

93
generalization of ns-based gp controllers
for evolutionary robotics
takes a decision and prefers the correct hypotheses over others. This
preference can be due to a bias in the learning algorithm or prior knowl-
edge regarding the problem domain. According to Kushchu, there are
two major types of bias; representational and procedural. Representa-
tional bias is given by the language used to describe the hypotheses,
and procedural bias is given by the search process.

Regarding the latter, to promote generalization in ML it is neces-

sary to gather more knowledge about the problem, in such a way that
the learning system increases its likelihood of success. Early research in
generalization with GP includes (Francone et al., 1996), which provides
an evaluation of a compiled GP system on supervised classification
problems, showing that GP can achieve comparable generalization to
other ML paradigms. Some researchers have correlated the problem of
excessive code growth in GP, normally referred to as bloat (Vanneschi
et al., 2010), with a lack of generalization. Suggesting that when a pro-
gram grows too large it might overfit the training data and thus fail
to generalize (Rosca, 1996). However, (Mahler et al., 2005) and (Van-
neschi et al., 2010) showed that controlling bloat and limiting program
growth does not necessarily improve generalization.

More recently, (Uy et al., 2010) examines the impact of semantically

aware search operators, showing that this new class of genetic opera-
tors can help improve overall performance and generalization of a GP
system for symbolic regression problems. Similarly, (Gonçalves et al.,
2015) has shown that other semantic operators might help improve gen-
eralization in GP for classical ML problems. Other approaches towards
improving generalization can be found in (Gonçalves and Silva, 2013;
Martı́nez et al., 2013; Spector, 2012). These works propose fitness case
sampling methods for GP, where the set of fitness-cases used to deter-
mined fitness at each generation is constructed by sub-sampling the
original training set. Results indicate that these algorithms can help im-
prove generalization, but further research is still required. In (Castelli
et al., 2010) the authors compare several flavors of GP based on their
generalization abilities, determinining that multi-objective optimiza-
tion seems to improve generalization. Finally, (Trujillo et al., 2011c)

94
6.2 background

and (Castelli et al., 2011) have attempted to derive measures of func-

tional complexity to be used as predictors of generalization in GP.
What is important to note from this high-level overview, is that al-
most no work regarding generalization in GP has studied GE or con-
sidered RL that does not use a static training set. Moreover, all of
these works have focused on traditional objective-based search and
none have addressed these issues with NS, an alternative that is dis-
cussed next.

6.2.2.1 Generalization with Novelty Search

Most recent works that have applied NS to ER (Lehman and Stan-

ley, 2010a, 2011a; Georgiou and Teahan, 2010) have studied the issue
of deception, but have not provided in depth results regarding general-
ization. These works did not test if the evolved behaviors were able to
generalize to different initial conditions, target points or environments.
Velez and Clune in (Velez and Clune, 2014), on the other hand, did
transfer maze navigation controllers evolved with NS to new scenar-
ios. Their experiments in neuro-evolution have confirmed that agents
using NS can learn general exploration skills. The transferred robots
were able to perform much better than randomly generated agents, but
did not outperform transferred robots evolved by a standard objective-
based EA. More recently in (Shorten and Nitschke, 2015) the authors
present a comparation of an objective-based search, NS and a combi-
nation of both approaches to evolve robot neurocontrollers that solve a
test set of 1, 000 mazes using a static training set of 100 mazes. The
results in (Shorten and Nitschke, 2015) show that NS and the NS-
objective combination approaches yield comparable generalized maze
navigation behaviors, and that both outperform the objective-based ap-
proach, but they did not analyse the effect of the training set size.
Kushchu (Kushchu, 2002b) has identified the SFT problem as an ex-
ample were objective-based search will lead towards brittle solutions
with GP. His experimental results suggest that a successful agent won’t
perform well on small variations of the problem. Therefore, Kushchu
proposed to train an agent on a set composed of variations of the SFT

95
generalization of ns-based gp controllers
for evolutionary robotics
using an objective-based approach, sharing similar characteristics, and
tested the learned behaviors evolved with different set of similar trails.
He was able to successfully evolve general trail following agents. In
(Lehman and Stanley, 2010a) using standard GP, and in (Urbano and
Loukas, 2013) using GE, NS was applied successfully to the SFT prob-
lem, a known deceptive problem in GP.

Doucette and Heywood (Doucette and Heywood, 2010) have empir-

ically evaluated the impact of NS on generalization performance for the
SFT, using the SFT as a single training set and a test set of similar trails.
They compared different ways of assigning selective pressure, combin-
ing novelty and the objective into a single measure, or using each one in-
dependently; their results suggest that no method could produce indi-
viduals that solved both the training and testing sets. However, results
showed that the classical objective-based GP achieved the best train
and test performances, but programs evolved by NS alone had better
generalization abilities. In contrast, in two other works (Lehman and
Stanley, 2010a; Urbano and Loukas, 2013) NS outperformed objective-
based search. Therefore, it is reasonable to state the results so far are
inconclusive.

Important open issues that have not been considered include the
effect that the training set size has on generalization, as well as the
manner in which the training set is constructed, or how different ob-
jectives and descriptors might impact the search. However, (Trujillo
et al., 2011b) showed that by promoting behavioral diversity during
the search for navigation behaviors, an EA could find several different
solutions to the same problem, and that such solutions exhibited bet-
ter generalization abilities when placed in an unknown environment.
Those results correlate with the hypothesis that the search for novel
and unique behaviors can help an evolutionary process identify gen-
eral solutions.

96
6.3 experimental setup

6.3 Experimental Setup

This section summarizes our proposed experimental work, to study

the impact of the training set on the generalization ability of a GE-
based system using standard objective-based search and NS. A high
level view of our experimental work is depicted in Figure ??, consist-
ing of: (1) a navigation task; (2) different training set sizes; (3) differ-
ent training set selection approaches; (4) two different objective func-
tions; (5) three different search strategies (objective, novelty and ran-
dom); and (6) two different behavior descriptors for NS. Each aspect is
described next.

Training Training Set Objective Search

Environment:
Set Size Selection algorithm
Navigation 1,2,4,6,12, Manual Rand Rand
Behavior
Task 24,48,60 Run
descriptor NS OS R
(Fixed) Gen

Figure 6.2: High level description of the proposed experimental work,

where the environment is a navigation task and the experiments con-
sider several sizes of the training set from 1 to 60 instances. The choices
for the training set are: fixed, randomly chosen each generation (Rand-
Gen) or every run (Rand-Run). Two objective functions f1 and f2 are
implemented for different experiments, at the same time two different
quality measures are used for both objective functions. The algorithms
used are novelty search (NS), objective-based search (OS), and random
search (R). NS uses two descriptors of the behaviors: dα and dβ , which
are implicitly related to objective functions f1 and f2 respectively.

6.3.1 Navigation Task and Environment

The learning task is a navigation problem for an autonomous agent

situated in a 2D discrete environment that must reach a static target.
The environment is shown in Figure 6.2, it is a rectangular grid of
39 × 23 cells considering the outer walls, the target is represented by
a black square and it is surrounded by an U-like wall composed by 10

97
generalization of ns-based gp controllers
for evolutionary robotics

I12 I3
I11 I4

I9 I10
target
I2
I1

I6
I5

(a) Manual Training Set (b) Test Set

Figure 6.3: Figure (a) shows the learning environment, where the target
is depicted by a black square and 12 manually chosen initial conditions
for each artificial agent (labeled from I1 to I12 ). These instances are
grouped into 6 pairs represented as two triangles with different color,
and each pair share the same location (x, y ) but different orientations
(North, East, South, or West). Figure (b) shows 100 initial conditions,
which were randomly generated to be used as the test set for all experi-
ments.

cells, so the total cells that an agent can visit is (37 × 21) − 11 = 766.
This environment is similar to the one used in (Lehman and Stanley,
2010a), where it is named as the ‘medium’ (difficulty) maze. Similar
to the present work, (Lehman and Stanley, 2010a) compared the per-
formance of GP using three different approaches to guide the search:
objective, novelty, and a random fitness. However, in (Lehman and
Stanley, 2010a) the training set always contained a single starting point
for the agent (the top-left corner), only allowed 100 moves for the agent
and did not consider a test set.
In this scenario, each training (or testing) instance is defined by the
pair Ii = (xi , θi ), that defines the initial position xi = (x, y ) of the agent
within the grid environment specified by the row x and column y, and
the initial orientation θi which can be four possible values: North (N),
South (S), West (W) and East (E).
The BNF grammar that defines the space of possible programs is
shown in Figure 6.1, which is used to perform the genotype to pheno-

98
6.3 experimental setup

type mapping. The agent has 3 boolean functions as sensors, which

each of them is formulated as a question related with the location of
a near wall respect to the agent; wall-ahead?, wall-left?, and wall-right?.
It can also perform three actions: move (moves 1 unit forward, if not
blocked by a wall), turn-left (90 degrees counter clockwise) and turn-
right (90 degrees clockwise). The program will be repeatedly executed
until the agent hits the target or reaches the maximum number of
moves. The agent succeeds if it reaches the target, this is referred to
as a hit.
All experiments in this study are performed using the jGE library
(Georgiou and Teahan, 2006), a Java implementation of GE, and jGE
Netlogo (Georgiou and Teahan, 2010) which is a Netlogo extension of
the jGE library. NetLogo1 is a multi-agent programmable modeling en-
vironment, and it was extended with an implementation of the NS al-
gorithm.

6.3.2 Training and Testing Set Size

To the best of our knowledge, previous works have not studied the
effect that the training set size has on generalization for a navigation
problem in ER. Here, we consider seven different training set sizes, of:
1, 2, 6, 12, 24, 48, and 60 instances. Moreover, to evaluate generaliza-
tion a test set is needed, thus here the testing set is composed by 100
randomly chosen initial conditions, shown in Figure 6.2(b). We believe
that given the size of the environment, 100 instances is sufficient to
evaluate the generalization of the evolved solutions.

6.3.3 Training Set Selection

We evaluate three different approaches to choose the instances that

compose the training set: (1) a set of randomly chosen instances de-
1 Netlogo is authored by Uri Wilensky and developed at the Northwestern’s Center for
Connected Learning and Computer-Based Modeling (CCL). You can download it free
of charge in: https://ccl.northwestern.edu/netlogo/.

99
generalization of ns-based gp controllers
for evolutionary robotics
termined at the beginning of each run; (2) a set of randomly chosen
instances chosen at the beginning of each generation; and (3) a set of
manually chosen instances used in all runs.
The manual approach undoubtedly introduces a human bias into
the learning process, which might compromise any conclusions drawn
from the experimental results. If the problem requires specific initial
conditions then this is not a issue, but determining how to construct
the training set for an arbitrarily complex environment is in no way a
trivial task. Moreover, as the results of this chapter will show, using
random training sets not only simplifies the problem formulation but
it actually improves the quality of the results.
The first two approaches, (1) and (2), remove this bias and allow
us to evaluate the effect of training set size irrespective of the initial
positions of the robot. One drawback about randomizing the selection
of the training set in our experimental setup is that they may coincide
with those used for testing. But even in the worst case with a training
set size of 60 and using 100 testing instances, there will be less than
2% of overlap between both sets, and will be mostly particularly for
the smaller training sets. The second approach (2) induces a dynamic
learning process with a non-static fitness landscape, a scenario where
solution diversity and generalization would seem to be necessary. For
instance, recent work (Gonçalves and Silva, 2011b) has suggested that
changing the fitness cases used at each generation can help improve
generalization and reduce overfitting in GP.

6.3.4 Objective Functions

For the evaluation of the evolved programs two objective functions

are considered: f1 and f2 . First, f1 computes the inverse of the sum of
Euclidean distance plus one between the final position of the agent α
and the target t (Lehman and Stanley, 2008), given by

1
f1 = (6.28)
1 + dist(α, t )

100
6.3 experimental setup

where dist () represents the Euclidean distance in this work. Con-

versely, f2 considers both the Euclidean distance and the length of the
agent’s trajectory. In this case, for a discrete environment represented
by a 2D grid the trajectory is represented by the number of visited
(non-repeated) cells β before reaching the target, or if the target is not
reached when the total amount of allowed moves is reached; it is com-
puted by

1
f2 = . (6.29)
dist(α,t )
1+ β

For both objective functions and a training set with n instances,

each agent is evaluated n times, once for each instance, and the final
value is the average score across all the evaluations. Similarly, the qual-
ity score on the test set will be the average score across all 100 instances.
It is important to note that f2 scale higher values than f1 , for this reason
a fair quality measure for both functions is the binary verification that
an agent reaches the target, this is named as a hit.

6.3.5 Search Algorithms

The proposed generalization experiments compare the perfor-

mance of GE using objective (OS), novelty (NS), and random-based
search (R). The goal is to determine which form of selective pressure
can increase the generalization abilities of the GE-system in the pro-
posed problem. Moreover, it is important to determine if improving
generalization has the negative effect of reducing overall training per-
formance, which will also be evaluated.
To implement NS in this work we use two behavior descriptors dα
and dβ , which are implicitly based on f1 and f2 respectively. The first
descriptor builds a behavior vector dα = (α1 , ..., αn ), where each αi is
the final position of the agent for training instance Ii . The second de-
scriptor dβ = (β1 , ..., βn ) is composed by values βi that represent the
number of visited (non-repeated) cells for training instance Ii .

101
generalization of ns-based gp controllers
for evolutionary robotics
Table 6.1: A general description of the algorithms and measures used
in the experimental set-up.

Name Description
Training set selection g Randomly chosen at the beginning of each generation
r Randomly chosen every run.
Manual Manual selection of training cases.
Algorithms Ng Novelty-based search with a training set of randomly cho-
sen
instances chosen at the beginning of each generation.
Nr Novelty-based search with a training set of randomly cho-
sen
instances determined at the beginning of a run.
Og Objective-based search search with a training set of ran-
domly chosen
instances chosen at the beginning of each generation.
Or Objective-based search with a training set of randomly
chosen
instances determined at the beginning of a run.
R Random-based search
instances determined at the beginning of a run.
Measures f1 Fitness function based on the Euclidean distance.
f2 Fitness function based on the Euclidean distance & vis-
ited cells.
dα Descriptor using the final position of the agent, related
with f1 .
dβ Descriptor using the number of of visited (non-repeated)
cells, related with f2 .

In summary, Table 6.1 presents all of the algorithmic variants tested

in this work and the different comparative measures. Finally, it is im-
portant to point out that R does not use the training set, so only a single
version of R is required.

6.3.6 Limit for Allowed Moves

The limit of allowable moves simulates the energy (i.e. battery life)
that an agent has to solve the navigation task. In this work, we chose
to determine a fixed limit for all the experimental conditions. To do

102
6.3 experimental setup

100 100

80 80

60 60

40 40

20 20

0 0
0 200 400 600 800 1000 0 200 400 600 800 1000

(a) Hit-Test for f1 (OS) and dα (NS) (b) Hit-Test for f2 (OS) and dβ (NS)

Figure 6.4: Performance comparison of objective, novelty and random-

based search using a fixed training set of 12 manually chosen instances
and different number of allowed moves, from 100 to 1000. The plots
show the average hit percentage from 100 runs using the test set for
both fitness functions f1 (a) and f2 (b) and both behavior descriptors dα
(a) and dβ (b).

so, we used twelve manually selected initial conditions as a training

set for consistency with our previous work (Urbano et al., 2014b), and
performed experiments varying the total number of allowable moves,
starting from 100 through 1000 moves, with steps of 100. The algo-
rithms were evaluated based on the percentage of hits achieved on the
testing sets. The results are shown in Figure 6.3, which is the average
over 100 independent runs. In these plots, it is possible to see that for
objective-based search hit percentage reaches a maximum value at 600
moves and levels off after that. For NS, performance still increases af-
ter 600 moves, but it still reaches almost maximum performance at 600.
Random search, on the other hand, performs poorly in all cases. There-
fore, for all of the following experiments we set the limit for allowed
moves at 600.
The parameters used in the experiments are summarized in Table
6.2, which are standard values for GE systems (Urbano and Loukas,
2013; Robilliard et al., 2006). The experiments were executed in 100
runs with a population of 250 individuals for 50 generations. For in-

103
generalization of ns-based gp controllers
for evolutionary robotics
Table 6.2: Parameters used for the experimental work. Codons-min
and Codons-max are the minimal and maximal number of codons in
the initial random population of the GE search.

Parameter Value Parameter Value

Codon-size 8 Generational YES
Codons-min 15 Mutation prob 0.01
Codons-max 25 Elitism 10%
Number of runs 100 NS archive NO
Wraps 10 Crossover codon-crossover
Number of individuals 250 NS k−neighbors 3
Crossover prob. 0.9 Selection Roulette Wheel

valid individuals, a value of 0 is given for both fitness and novelty.

After some preliminary exploration, we set the number of neighbors
used to compute the novelty score to k = 3, invalid individuals were
not considered, and we did not use an archive as suggested in (Lehman
and Stanley, 2008), since preliminary experiments showed that it did
not improve performance, as well the success of NS without using an
archive combined with a short value for k is reported in (Gomes et al.,
2015).

6.4 Results and Analysis

This section presents the experimental results divided into three

subsections. The first one considers the randomly chosen training sets,
the second considers the manually selected training set, while the final
one presents an analysis of statistical significance.

6.4.1 Results for Randomly Chosen Training Sets

Table 6.3 summarizes the results for all of the experiments that used
randomly chosen training sets, organized by the number of training in-
stances (columns 1), the manner in which the training set was selected

104
6.4 results and analysis

(column 2) and the type of selection pressure (column 3). Moreover,

results are presented for the average training and testing performance
over 100 independent runs. First, for training we show the average fit-
ness of the best solution found, remembering that for NS when f1 is
reported learning was performed with the dα descriptor and when f2 is
reported dβ was used. For testing, the fitness of the best solution found
is computed with the test set, and the percentage of test hits is also pre-
sented: H1 for function f1 (objective) and descriptor dα (novelty); and
H2 for function f2 (objective) and descriptor dβ (novelty).

Table 6.3: Summary of the experimental results for all of the variants
showing the average over a 100 runs, for the configurations that used
a random training set: the training set size is from 1 to 60; Sel is the
manner in which the training set is set, for every generation (Gen) or
for every run (Run); and three search strategies are considered, Objec-
tive, Novelty, and Random. Results are given for training and testing
performance, for training the performance of the best solution found
is evaluated on each objective function (f1 and f2 ) as well as for testing.
The percentage of hits, H1 and H2, is shown only for the test set. In all
cases Bold indicates the best performance.

Training Test
Size Sel Fitness f1 f2 H1 H2 f1 f2
1 Rand-Gen Novelty 1.0000 1.0000 12% 16% 0.1966 0.7655
Objective 0.9691 0.9989 4% 7% 0.1227 0.7246
Rand-Run Novelty 1.0000 1.0000 27% 29% 0.3315 0.8559
Objective 0.7068 0.9945 12% 22% 0.1954 0.8326
Random 0.5386 0.9689 8% 6% 0.1610 0.7848
2 Rand-Gen Novelty 1.0000 1.0000 45% 43% 0.5073 0.9206
Objective 0.6576 0.9870 10% 18% 0.1848 0.8939
Rand-Run Novelty 0.9960 1.0000 56% 59% 0.6080 0.9504
Objective 0.7296 0.9824 39% 38% 0.4439 0.9272
Random 0.4485 0.9460 10% 8% 0.1826 0.8771
6 Rand-Gen Novelty 0.9854 1.0000 92% 88% 0.9289 0.9914
Objective 0.4824 0.9655 15% 35% 0.2254 0.9435
Rand-Run Novelty 0.9733 0.9998 92% 91% 0.9280 0.9918
Continued on next page. . .

105
generalization of ns-based gp controllers
for evolutionary robotics
Table 6.3 – continued from previous page
Training Test
Size Sel Fitness f1 f2 H1 H2 f1 f2
Objective 0.7209 0.9744 56% 48% 0.6057 0.9611
Random 0.2908 0.9335 7% 13% 0.1524 0.9190
12 Rand-Gen Novelty 0.9730 1.0000 94% 98% 0.9485 0.9979
Objective 0.5131 0.9664 29% 39% 0.3609 0.9524
Rand-Run Novelty 0.9731 0.9999 94% 97% 0.9501 0.9978
Objective 0.7520 0.9784 66% 62% 0.6964 0.9735
Random 0.2524 0.9274 8% 12% 0.1673 0.9223
24 Rand-Gen Novelty 0.9548 0.9999 92% 99% 0.9306 0.9990
Objective 0.6375 0.9643 49% 40% 0.5419 0.9540
Rand-Run Novelty 0.9650 0.9999 94% 99% 0.9509 0.9989
Objective 0.8099 0.9740 75% 56% 0.7749 0.9721
Random 0.2682 0.9253 13% 8% 0.2126 0.9188
48 Rand-Gen Novelty 0.9739 1.0000 96% 100% 0.9642 0.9997
Objective 0.6727 0.9660 58% 48% 0.6199 0.9597
Rand-Run Novelty 0.9614 0.9999 95% 99% 0.9521 0.9996
Objective 0.8139 0.9742 78% 58% 0.8068 0.9727
Random 0.2379 0.9267 13% 12% 0.2102 0.9244
60 Rand-Gen Novelty 0.9648 0.9998 95% 99% 0.9581 0.9993
Objective 0.6842 0.9640 58% 41% 0.6223 0.9576
Rand-Run Novelty 0.9530 0.9996 95% 99% 0.9530 0.9993
Objective 0.7390 0.9748 70% 58% 0.7336 0.9737
Random 0.2550 0.9261 15% 10% 0.2313 0.9237

To gain a better appreciation of the differences of each method

shown in Table 6.3, Figures 6.4-6.9 present box plots that show the me-
dian and spread over the first and third quartiles of the 100 runs. First,
for objective function f1 and its related behavior descriptor dα for NS:
Figure 6.4 shows a comparison of the best training fitness from each
run; Figure 6.5 compares the test fitness of the best solution found; and
Figure 6.6 shows the percentage of hits on the test set. Similar results
are summarized for objective function f2 and descriptor dβ in Figures
6.7, 6.8, and 6.9, respectively for training fitness, test fitness and test

106
6.4 results and analysis

1
1 1 1
0.8
Fit−train

Fit−train

Fit−train
0.8 0.8 0.8
0.6 0.6
0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R

(a) 1 instance (b) 2 instances (c) 6 instances (d) 12 instances

1 1 1
Fit−train

Fit−train

Fit−train
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R

(e) 24 instances (f) 48 instances (g) 60 instances

Figure 6.5: Box plot comparison of the best solution found using train-
ing fitness with function f1 and descriptor dα . ‘N’ stands for novelty-
based search, ‘O’ for objective-based search, and ‘R’ stands for random-
based search. Sub-index ‘g’ and ‘r’ stand for the method used to choose
randomly the training set, either every generation or every run, respec-
tively. Figures are sorted in ascending order according to the number
of instances used in the training sets, from 1 to 60.

hits. All of these figures present results for five of the algorithms sum-
marized in Table 6.1, these are: Ng, Nr, Og, Or and R.
These results exhibit some clear trends. First, let’s consider train-
ing performance for both objective functions, shown in Figures 6.4 and
6.7. NS achieves good performance irrespective of the size of the train-
ing set or the manner in which the training set is chosen (per run or
per generation) for both objectives. On the other hand, objective-based
search shows worse performance than NS, and Or is relatively better
than Og in both figures, suggesting that it is better to keep the training
set static for all the generations of the run when using this form of se-
lective search. Random search clearly shows worse performance, but R
is basically equivalent to Og for small training sets.

107
generalization of ns-based gp controllers
for evolutionary robotics

1 1 1 1
0.8 0.8 0.8 0.8
Fit−test

Fit−test

Fit−test
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2

Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R

(a) 1 instance (b) 2 instances (c) 6 instances (d) 12 instances

1 1 1
0.8 0.8 0.8
Fit−test

Fit−test

Fit−test
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2

Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R

(e) 24 instances (f) 48 instances (g) 60 instances

Figure 6.6: Box plot comparison of the best solution found using test
fitness with function f1 and descriptor dα . ‘N’ stands for novelty-based
search, ‘O’ for objective-based search, and ‘R’ stands for random-based
search. Sub-index ‘g’ and ‘r’ stand for the method used to choose ran-
domly the training set, either every generation or every run, respec-
tively. Figures are sorted in ascending order according to the number
of instances used in the training sets, from 1 to 60.

When we consider testing performance, the following is suggested

by Figures 6.5 and 6.8. First, NS clearly improves on the test set when
the training set size increases. In fact, for training sets larger than
6 instances performance is almost perfect for both Ng and Nr. Simi-
larly, objective-based search also improves proportionally to the train-
ing set size, but it does not reach the same level of performance than NS.
Moreover, Or always outperforms Og, confirming that objective-based
search struggles with a dynamic objective function. All of these ob-
servations are valid for both objective functions, this means that these
trends seem to not depend on the manner in which performance is
measured, at least for the two methods considered here. Finally, it is
clear that random search cannot reach the same level of performance.
This result is of interest, since many researchers erroneously conflate

108
6.4 results and analysis

100 100 100 100

Hit−test

Hit−test
50 50 50 50

0 0 0 0
Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R

(a) 1 instance (b) 2 instances (c) 6 instances (d) 12 instances

100 100 100

Hit−test

Hit−test
50 50 50

0 0 0
Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R

(e) 24 instances (f) 48 instances (g) 60 instances

Figure 6.7: Box plot comparison of the percentage of testing hits by the
best solution found using function f1 and descriptor dα . ‘N’ stands for
novelty-based search, ‘O’ for objective-based search, and ‘R’ stands for
random-based search. Sub-index ‘g’ and ‘r’ stand for the method used
to choose randomly the training set, either every generation or every
run, respectively. Figures are sorted in ascending order according to
the number of instances used in the training sets, from 1 to 60.

NS with a random search process, while our results in agreement with

(Velez and Clune, 2014) confirm that NS is not a random search.
Finally, let’s consider the quality measures given by both the objec-
tive functions and hits. The percentage of hits on the test set are the
proportion of test instances where the agent reached the target; results
are summarized in Figures 6.6 and 6.9. These results are very similar
to those observed for the objective functions, basically the same trends
are apparent. For both NS and objective-based search performance in-
creases proportionally to the number of training instances. However,
NS clearly outperforms all other methods, and only requires 6 training
instances to reach an almost perfect performance on the test set. To
clearly show the differences between all methods, a bar plot of these
results is presented in Figure 6.10. This figure shows the percentage

109
generalization of ns-based gp controllers
for evolutionary robotics

1
1 1 1
0.8
Fit−train

Fit−train

Fit−train
0.8 0.8 0.8
0.6 0.6
0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R

(a) 1 instance (b) 2 instances (c) 6 instances (d) 12 instances

1 1 1
Fit−train

Fit−train

Fit−train
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R

(e) 24 instances (f) 48 instances (g) 60 instances

Figure 6.8: Box plot comparison of the best solution found using train-
ing fitness with function f2 and descriptor dβ . ‘N’ stands for novelty-
based search, ‘O’ for objective-based search, and ‘R’ stands for random-
based search. Sub-index ‘g’ and ‘r’ stand for the method used to choose
randomly the training set, either every generation or every run, respec-
tively. Figures are sorted in ascending order according to the number
of instances used in the training sets, from 1 to 60.

of hits on the test set relative to the number of training instances (x-
axis), for both objective functions (columns) and for both manners in
which the training set is selected (rows). These plots clearly suggest
that NS achieves substantially better generalization performance than
traditional objective-based search and random search. Indeed, NS can
achieve almost perfect generalization for sufficiently large training sets,
above 6 instances for the present task.

6.4.2 Comparison with a Manually Selected Training Set

In this section, the goal is to compare the random construction of

training sets with manually chosen initial conditions. Here, we con-
sider 12 different initial conditions for the agent, where it is easier to

110
6.4 results and analysis

1 1 1 1
0.8 0.8 0.8 0.8
Fit−test

Fit−test

Fit−test
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2

Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R

(a) 1 instance (b) 2 instances (c) 6 instances (d) 12 instances

1 1 1
0.8 0.8 0.8
Fit−test

Fit−test

Fit−test
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2

Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R

(e) 24 instances (f) 48 instances (g) 60 instances

Figure 6.9: Box plot comparison of the best solution found using test
fitness with function f2 and descriptor dβ . ‘N’ stands for novelty-based
search, ‘O’ for objective-based search, and ‘R’ stands for random-based
search. Sub-index ‘g’ and ‘r’ stand for the method used to choose ran-
domly the training set, either every generation or every run, respec-
tively. Figures are sorted in ascending order according to the number
of instances used in the training sets, from 1 to 60.

reach the target from some positions than others. The chosen instances
are depicted in Figure 6.2(a), specifying the location and orientation
of each instance; these instances were used in our preliminary work
(Urbano et al., 2014b). Table 6.4 summarizes the performance of the
learning process when each of the single instances is used for training
(each row), and the average over all (final row). It is evident that most
of the chosen instances cannot lead the search towards a general so-
lution for none of the search strategies. Moreover, if we consider the
averages reported for the test set, they are worse than when compared
with a single randomly chosen training case (first row of Table 6.4). It
is important to note that these training instances were carefully chosen
to provide different scenarios for the learning process, with the hope
that they might allow the learning process to find general solutions

111
generalization of ns-based gp controllers
for evolutionary robotics

100 100 100 100

Hit−test

Hit−test
50 50 50 50

0 0 0 0
Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R

(a) 1 instance (b) 2 instances (c) 6 instances (d) 12 instances

100 100 100

Hit−test

Hit−test
50 50 50

0 0 0
Ng Nr Og Or R Ng Nr Og Or R Ng Nr Og Or R

(e) 24 instances (f) 48 instances (g) 60 instances

Figure 6.10: Box plot comparison of the percentage of testing hits by

the best solution found using function f2 and descriptor dβ . ‘N’ stands
for novelty-based search, ‘O’ for objective-based search, and ‘R’ stands
for random-based search. Sub-index ‘g’ and ‘r’ stand for the method
used to choose randomly the training set, either every generation or
every run, respectively. Figures are sorted in ascending order according
to the number of instances used in the training sets, from 1 to 60.

(Kushchu, 2002a,b). However, the bias introduced by the human ex-

pert did not help, and actually performs worse than randomly chosen
instances.
It is also true, as the results from the previous subsection showed,
that using a single training instance is too few. NS for instance reaches
strong performance only when the training set has 6 or more instances.
Therefore, we group all of the manually chosen instances into a sin-
gle training set and compare it with the results of using 12 randomly
chosen instances. These results are summarized in Table 6.5. First, it
is clear that NS outperforms objective and random search, for all ap-
proaches towards training set construction. However, the performance
of randomly chosen instances (per run or per generation) outperform

112
6.4 results and analysis

100 100

80 80
HIts %

HIts %
60 60

40 40

20 20

0 0
1 2 6 12 24 48 60 1 2 6 12 24 48 60
Training Set Size Training Set Size
(a) Hits-Gen using f1 and dα (b) Hits-Run using f1 and dα

100 100

80 80
HIts %

HIts %

60 60

40 40

20 20

0 0
1 2 6 12 24 48 60 1 2 6 12 24 48 60
Training Set Size Training Set Size
(c) Hits-Gen using f2 and dβ (d) Hits-Run using f2 and dβ
Novelty-based Objective-based Random-based

Figure 6.11: Comparison of the performance on the test set based on

percentage of hits, considering both fitness functions (f1 and f2 ), both
behavior descriptors (dα and dβ ) and both strategies to set the training
set (every run (Run) or every generation (Gen)).

the Manual approach. Again, it seems that human bias does not im-
prove search performance, and in fact generates a negative impact.

113
generalization of ns-based gp controllers
for evolutionary robotics
Table 6.4: Results from each of the twelve instances; I1 , ...I12 , where
the sub-index stands for the number of the training instance (in the
heading as ’Size’). Three different approaches of fitness metrics are
used to evolve solutions; novelty, objective and random. Two fitness
function are used to score the best program from each run; f1 considers
last position, and f2 and considers both last position and number of
non-repeated visited cells. The average number of best programs that
hit the target is shown in the column ‘H’ where the subindex indicates
the function used (f1 or f2 ). Limit of moves for all experiments is 600.
Bold indicates the best performance.

Training Test
Size Fitness f1 f2 H1 H2 f1 f2
I1 Novelty 1.0000 1.0000 3% 3% 0.2631 0.8735
Objective 0.7445 0.9847 3% 0% 0.2862 0.8588
Random 0.4403 0.9649 1% 0% 0.1496 0.8572
I2 Novelty 1.0000 1.0000 6% 1% 0.3652 0.8889
Objective 0.9524 0.9889 2% 4% 0.2017 0.9098
Random 0.2801 0.9567 2% 2% 0.1594 0.8715
I3 Novelty 1.0000 1.0000 12% 3% 0.4768 0.9297
Objective 0.6977 0.9890 1% 5% 0.3238 0.9212
Random 0.2019 0.9429 0% 0% 0.1455 0.8873
I4 Novelty 1.0000 1.0000 8% 5% 0.4553 0.9261
Objective 1.0000 0.9774 3% 0% 0.6656 0.9055
Random 0.2273 0.9529 1% 2% 0.1651 0.8661
I5 Novelty 1.0000 1.0000 0% 1% 0.0913 0.7812
Objective 0.9802 0.9983 0% 0% 0.0720 0.7779
Random 1.0000 0.9936 0% 0% 0.0839 0.6711
I6 Novelty 1.0000 1.0000 4% 0% 0.1682 0.8468
Objective 0.9527 0.9946 2% 0% 0.1073 0.8236
Random 0.9301 1.0000 0% 0% 0.1068 0.8502
I7 Novelty 1.0000 1.0000 0% 0% 0.1246 0.5477
Objective 0.9917 1.0000 0% 0% 0.1252 0.5442
Random 1.0000 1.0000 0% 0% 0.1203 0.5372
I8 Novelty 1.0000 1.0000 0% 0% 0.1215 0.5638
Objective 1.0000 1.0000 0% 0% 0.1207 0.5246
Random 1.0000 1.0000 0% 0% 0.1154 0.5456
I9 Novelty 1.0000 1.0000 0% 0% 0.0951 0.4644
Objective 1.0000 1.0000 0% 0% 0.0967 0.4462
Continued on next page. . .

114
6.4 results and analysis

Table 6.4 – continued from previous page

Training Test
Size Fitness f1 f2 H1 H2 f1 f2
Random 1.0000 1.0000 0% 0% 0.0950 0.4819
I10 Novelty 1.0000 1.0000 0% 0% 0.0816 0.4403
Objective 1.0000 1.0000 0% 0% 0.0819 0.4502
Random 1.0000 1.0000 0% 0% 0.0842 0.4329
I11 Novelty 1.0000 1.0000 11% 7% 0.5381 0.9505
Objective 0.6492 0.9775 6% 1% 0.3878 0.8555
Random 0.1906 0.9570 2% 1% 0.1731 0.8000
I12 Novelty 1.0000 1.0000 13% 17% 0.6270 0.9585
Objective 0.6710 0.9821 2% 5% 0.4221 0.8763
Random 0.1972 0.9518 2% 0% 0.1748 0.7315
Aver. Novelty 1.0000 1.0000 5% 3% 0.2837 0.7643
Objective 0.8832 0.9911 2% 1% 0.2401 0.7412
Random 0.6195 0.9766 1% 0% 0.1306 0.7110

Table 6.5: Table that summarizes the performance of the three meth-
ods; novelty, objective, and random-based search for a set of 12 in-
stances considering 600 as limit of moves, showing the results for the
three different ways to select the training set; manually (Manual), ran-
domly chosen each generation (Rand-Gen), and randomly chosen each
run (Rand-Run). Bold indicates the best performance.

Training Test
Size Sel Fitness f1 f2 H1 H2 f1 f2
12 Manual Novelty 0.9576 0.9996 68% 81% 0.9302 0.9985
Objective 0.6775 0.9701 29% 30% 0.5310 0.9638
Random 0.4100 0.9328 1% 4% 0.1713 0.9177
12 Rand-Gen Novelty 0.9730 1.0000 94% 98% 0.9485 0.9979
Objective 0.5131 0.9664 29% 39% 0.3609 0.9524
Rand-Run Novelty 0.9731 0.9999 94% 97% 0.9501 0.9978
Objective 0.7520 0.9784 66% 62% 0.6964 0.9735
Random 0.2524 0.9274 8% 12% 0.1673 0.9223

115
generalization of ns-based gp controllers
for evolutionary robotics
6.4.3 Statistical Analysis

Finally, to validate the results presented above we perform exten-

sive statistical comparisons. For each experimental configuration, and
for each performance measure, we perform N ×N pairwise tests, where
N is the number of different search methods (Ng, Nr, Og, Or and R).
In other words, for each pair of algorithms (Ng-Nr, Ng-Og, and so on)
we perform a statistical test based on each performance measure. We
employ the Friedman non-parametric test to perform all pairwise com-
parisons of the tested algorithms as suggested in (Derrac et al., 2011),
with Bonferroni-Dunn correction of the p-values. The null hypothesis
that booth groups have equal medians was rejected at the α = 0.01 sig-
nificance level. For almost all comparisons the null hypothesis was re-
jected with p-values ≈ 0, but in some cases this was not the case. Table
6.6 summarizes the cases where the null hypothesis was not rejected
for at least one performance measure, showing the corresponding p-
values for each comparison based on each of the performance measures.
However, even in these cases for some performance measures the null
hypothesis was rejected, if so this is marked with an asterisk.
Table 6.6 shows that when the training set is small (1 or 2) there are
more cases in which the null hypothesis cannot be rejected. As the box-
plots suggested, all algorithms show a similar weak performance with
such a small training set. However, as the training set size increases
then we can see that in almost all cases the null hypothesis was rejected,
indicating a clear statistical significance in the results discussed in the
previous subsections, particularly when we compare the NS variants
(Ng and Nr) with OS (Og and Or) and random search.

6.5 Heat Maps

In this section a new approach is proposed as a possible research

line to extend this work. The main goal is to explore the impact of the
training on to evolve generalization abilities, making a preprocessing
of the entire set of initial conditions, getting a binary division into re-

116
6.5 heat maps

Table 6.6: Summary of the statistical tests, showing the resulting

p-values of each pairwise comparison using the Friedman test with
Bonferroni-Dunn correction, for each experimental configuration and
for each performance measure. Since in most cases the resulting p-
values were ≈ 0, the table only presents cases in which the null hy-
pothesis is not rejected at the α = 0.01 significance level based on at
least one performance measure, with an asterisk identifying the cases
in which the null hypothesis is rejected.

Training Test
Size Methods f1 f2 H1 H2 f1 f2
1 Ng vs Nr 4.000 4.000 0.000 * 0.000* 0.646 0.026
Ng vs Og 0.101 0.629 0.000* 0.000* 0.000* 0.438
Ng vs Or 0.000* 0.000* 3.654 0.000* 2.194 0.111
Nr vs Og 0.101 0.629 0.000* 0.000* 0.000* 0.000*
Nr vs Or 0.000* 0.000* 0.000* 0.000* 0.020* 0.646
Og vs R 0.000* 0.000* 0.000* 0.000* 0.047* 0.920
Or vs R 0.004* 0.000* 0.000* 0.000* 0.139 0.287
2 Ng vs Nr 1.269 4.000 0.000* 0.000* 0.006* 0.007*
Ng vs Or 0.000* 0.000* 0.000* 0.000* 0.124 1.909
Nr vs Or 0.000* 0.000* 0.000* 0.000* 0.038* 0.053
Og vs Or 0.059 2.530 0.000* 0.000* 0.000* 0.224
Og vs R 0.000* 0.000* 2.367 0.000* 2.194 0.010*
6 Ng vs Nr 1.269 0.629 3.332 0.000* 2.060 2.929
Og vs Or 0.000* 3.012 0.000* 0.000* 0.000* 0.117
Og vs R 0.000* 0.000* 0.000* 0.000* 0.920 0.026*
12 Ng vs Nr 4.000 1.269 3.324 1.484 2.764 3.545
Og vs Or 0.000* 0.542 0.000* 0.000* 0.000* 0.068
24 Ng vs Nr 2.955 1.269 0.0000* 2.397 1.968 2.360
Og vs Or 0.000* 0.311 0.000* 0.000* 0.000* 0.029*
48 Ng vs Nr 1.693 2.254 0.0000* 2.344 3.526 3.274
Nr vs Or 0.106 0.000* 0.000* 0.000* 0.064 0.000*
Og vs Or 0.135 0.059 0.000* 0.000* 0.021* 0.004*
60 Ng vs Nr 3.091 4.000 0.198 0.155 3.563 0.882
Og vs Or 1.790 0.132 0.000* 0.000* 0.524 0.035*

gions of easy-difficult to generalize initial conditions. In order to get

this division, 30 objective-based runs were executed from each possi-
ble initial condition in the maze navigation. The average overfit value

117
generalization of ns-based gp controllers
for evolutionary robotics
0.3 0.3 0.3 0.3

2 2 2 2

4 0.25 4 0.25 4 0.25 4 0.25

6 6 6 6

0.2 0.2 0.2 0.2

8 8 8 8

10 10 10 10
0.15 0.15 0.15 0.15
12 12 12 12

14 14 14 14
0.1 0.1 0.1 0.1

16 16 16 16

18 0.05 18 0.05 18 0.05 18 0.05

20 20 20 20

0 0 0 0
5 10 15 20 25 30 35 5 10 15 20 25 30 35 5 10 15 20 25 30 35 5 10 15 20 25 30 35

(a) West (b) North (c) South (d) East

Figure 6.12: Overfit heat-maps, at the right of each figure is shown

a color scale that goes from low values of overfitting in color blue to
higher values in color red.

is computed from the difference between the training and test qual-
ity using the 30 experimental runs. Using the experimental results we
build-up the heat-maps for each orientation is showed in 6.11, where
the average overfitting from the 30 runs are showed according with the
color scale, low values are in color blue while higher values in red.
Trough a color threshold these heat-maps can be transformed into
the binary-maps as shown in the Figure 6.12. Region in color white
contains instances that produce higher values of overfitting, while the
region in black contains the instances with the lower values of overfit-
ting. The hypothesis is that overfitting is directly correlated with the
ability to generalize. Then, instances in the white regions since show
higher values of overfitting they tend to generate very specialized con-
trollers which hardly can find the target from different instances spe-
cially from those located in the the black regions, and for this reason
are considered difficult instances to generalize. This reasoning applies
for the black region, considering their instances as easy to generalize.
With this approach we can test different combinations of easy-
difficult instances, for instance if we consider a training set with a
size of 6, we can get the following set of combinations easy-difficult:
{0 − 6, 1 − 5, 2 − 4, 3 − 3, 4 − 2, 5 − 1, 6 − 0}, each initial condition is chosen
randomly from the corresponding region. The Figure 6.13 (a) shows
an example of a set of 12 initial conditions, particularly for a combina-

118
6.6 chapter conclusions

(a) West (b) North (c) South (d) East

Figure 6.13: Overfit binary maps, obtained from the binarization of the
Overfit heat-maps by a color threshold to divide the map into easy and
difficult regions.

tion of 8-easy and 4-difficult initial conditions, and the Figure 6.13 (b)
shows the same test set used in the previous experiments.
This experimental framework can give us insight about how this
distinction between easy and difficult initial conditions can benefit to
any of the methods tested, and furthermore to test if any of the possible
combinations is the best suited to improve the generalization abilities
of the navigator.

6.6 Chapter Conclusions

This work studies the problem of generalization in ER, using a nav-

igation task for an autonomous agent in a 2D environment. The learn-
ing system is based on GE and three search strategies are evaluated:
objective-based search, NS and a random search process. In particular,
this chapter studies the impact that the training set has on the perfor-
mance of the learning process and the generalization abilities of the
evolved solutions. Several aspects are considered, namely: (1) the size
of the training set; (2) the manner in which the training set is selected
(manually or randomly); and (3) if the training set is fixed during the
run or if it is varied every generation. Moreover, two objective functions
are used and two behavior descriptors are tested to promote solution
diversity with NS.

119
generalization of ns-based gp controllers
for evolutionary robotics

easy
target

(a) Easy-difficult training Set (b) Test Set

Figure 6.14: Figure (a) shows the learning environment, where the tar-
get is depicted by a black square, and 12 randomly chosen starting posi-
tions, particularly with a combination (4-8) of 8 “easy” and 4 “difficult”
instances. Each instance has a triplet (x, y, θ ), where x, y is the 2-d loca-
tion and θ is the orientation (North, East, South, and West). Figure (b)
shows 100 initial conditions that were randomly generated to be used
as a test set.

Experimental results clearly suggest that NS improves the general-

ization abilities of the GE system. NS can reach close to optimal perfor-
mance during training in all scenarios, and reaches the same level of
test performance with relatively small training sets (above 6 instances
in our experiments). This is the main conclusion of our work, that NS
can clearly guide the search towards general solutions within the stud-
ied environment. On the other hand, objective-based search cannot
reach the same level of performance, both in training as well as testing,
especially when considering the percentage of test instances where the
target is reached (hits). It is clear that traditional search cannot solve
the navigation problem for a diverse set of training instances, and fails
to generalize to arbitrary testing conditions. Random search also ex-
hibits very weak performance, confirming that NS is not equivalent to
a random search process, even if it does not consider the objective func-
tion explicitly.
Varying the training set during evolution was expected to be ben-
eficial towards generalization, but results do not suggest this. For NS,

120
6.6 chapter conclusions

keeping the training set fixed during the run produces almost the ex-
act same results as randomly varying it every generation. On the other
hand, objective-based search is hampered when a dynamic fitness func-
tion is used, both in terms of training and testing. Objective-based
search achieves better performance when the training set size is in-
creased and when it is kept fixed throughout the evolutionary process.
Finally, we compare a manual selection of training instances, the
strategy used in our previous work (Urbano et al., 2014b), with the ran-
dom strategy studied here. Results clearly indicate that manual selec-
tion introduces an undesirable bias that negatively effects generaliza-
tion. In particular, testing performance is clearly better with a random
training set instead of a manual a priori set, suggesting that removing
human bias can be very beneficial.
The results presented here open up several possible lines of future
research. While this work provides useful insights regarding the im-
pact of the training set, only coarse features are considered, such as its
size or the overall strategy used to build it (random or manual/fixed or
dynamic). However, more detailed features of the training set can be
controlled. For instance, it is evident that not all training instances are
created equally, some of them specify more difficult scenarios than oth-
ers, or some of them might lead towards deceptive landscapes while
other might not. Therefore, future work will study the impact that
the composition of the training set has on learning and generalization.
Considering, for example, the proportion of easy and difficult training
instances, or the proportion of deceptive and non-deceptive initial con-
ditions. One final open question is to perform similar tests to study the
issues related with the so-called reality-gap in ER.

121
generalization of ns-based gp controllers
for evolutionary robotics

122
7
GP based on NS for Regression

abstract — This chapter presents our approach towards applying

NS to symbolic regression with GP. In order to do so, it is highlighted
on this work, that the key component is the individual behavior repre-
sentation for a given domain. NS can lead to complex behavioral traits,
trying just to find novel solutions can guide the search from the com-
mon low quality solutions to the scarce and better solutions.

7.1 Introduction

The proposal of this chapter is to develop a NS-based GP system for

solving real-valued symbolic regression problems, hereafter referred to
as NS-GP-R. In the regression analysis a function f (x) is approximated
by an operator K (x) considering the independent variable x ∈ Dn ⊆ Rn ,
where Dn is the domain of f (x). If we consider the training set of
fitness cases as X, then the goal is to find an operator K that minimizes
f (x) − K (x) . However, to use NS the performance of K (x) must not
be expressed as an error over all fitness cases, but a description of how
K behaves over the entire set of fitness cases.

123
gp based on ns for regression

7.1.1 Behavioral Descriptor

Since the goal of a symbolic regression problem is straightforward,

to minimize the error between the desired output and the proposed ap-
proximation, then a behavioral descriptor for symbolic regression must
consider the concept of error in some way. The descriptor must pro-
vide a sufficiently detailed description of how an individual behaves
over all fitness cases. Moreover, the descriptor should abstract away
unnecessary details, and focus on describing how unique a particular
individual is relative to others within the population.

-descriptor: For a regression problem, let f :

R n 7 → R be the function to be approximated, X =
{ ( x 1 , f ( x 1 )) , ( x 2 , f ( x 2 )) , ..., ( x L , f ( x L )) } the set of fitness
cases (input-output tuples), K i a valid GP individual, and e i =
{e i 1 , e i 2 , ..., e i L } be the error vector of K i with e i j = |f ( x j ) − K ( x j ) |;
then, the -descriptor β i of K i is a binary vector of size L that is
constructed as follows.
For simplicity and without loss of generality, consider L = 1. Then,
at generation t with population P , sort P in ascending order based
on e i , and compute the order statistic p of the set of error vectors
E = {e i | ∀K i ∈ P }, where p = |Ph | with h as an algorithm parame-
ter. Then, set a bounding value t = mi n ( t −1 , p ) if t > 0 and set
t = p otherwise; such that β i = 1 for all K i with i ≤ p, and β i = 0
otherwise.

7.1.2 Pictorial Example

For example, consider h = 10, then the -descriptor β i of K i

identifies the fitness cases j on which individual K i is in the best 10-
percentile of the population, when β ij = 1. Conversely, if β ij = 0, this
means that the individual performs in the worst 90% of the population.
A graphical description of this process is shown in Figures 7.1. More-

124
7.1 introduction

0.15
max error
Ground Truth
GP-individual
0.1
2

0.05
i

min error
0 1
f(x)

L
εi
−0.05 L-1

εi
−0.1

ε
1=0, 2=1,
i=0, L-1=1, L=1}
−0.15 ε
=[01011]

−0.2 xi xL-1
x1 x2 xL
X

Figure 7.1: Graphical depiction of how the −descriptor is constructed.

Notice how the interval specified by determines the values of each β i .

over, a description of the algorithm used to compute the -descriptor

is given in Algorithm 2.

7.1.3 NS Modifications

The behavioral descriptor presented above is constructed based on

the performance of each individual relative to the rest of the popula-
tion. Therefore, the descriptor assigned to any given individual will
depend on the population to which it belongs. In other words, the
context in which a descriptor is computed varies as a function of the
current population. This represents a slight modification to the orig-
inal NS implementation. However, it is consistent with the idea that
the uniqueness of an individual depends on the search progress at the
moment in which the individual was found.

125
gp based on ns for regression

Algorithm 2 Behavior descriptor β

Require: x, f (x ), n, L, K, p, m, gen
. x ∈ Rn
. n : individuals, pop : population size
. L : sample size
. K : GP function
. p : heuristic parameter
. m : number of generations,
. gen : number of generations
Ensure: binary vector β
for t = 0 : gen do
for i = 1 : pop do
for j = 1 : L do
. Eij = {eij } : error matrix
eij ⇐ |f (xj ) − Ki (xj )|
end for
end for
. descending sort by column, E = eij
E ⇐ sort(E,j,descend)
. order statistic p
p ⇐ | Ph |
t + 1 ⇐ p
if t +1 > t &t , 0 then
. epsilon vector update
t + 1 ⇐ t
end if
. individual descriptor loop
for i = 1 : pop do
for j = 1 : L do
if eij ≥ (t +1),j then
⇐1
βij
else
⇐0
βij
end if
end for
end for
end for

This design choice does raise a problem when sparseness is com-

puted, since the descriptors of individuals stored in the archive will,
with high probability, not represent a consistent behavioral descriptor
with respect to the individuals in the current population. Therefore,
the second modification made to the NS algorithm for regression prob-
lems is to omit the archive when sparseness is computed.

126
7.2 experiments

The lack of an archive can lead the canonical NS to perform back-

tracking during the search, cycling within behavioral space without
any real progress. However, this problem is avoided by not allowing
to increase in magnitude. Since, for most problems, the initial random
population will mostly contain individuals with a large error vector, in
the initial generations the bounds will tend to be high.

Then, with selective preference given to individuals with unique be-

haviors and given the way in which the -descriptor is constructed, the
algorithm will be biased towards individuals with descriptors that con-
tain a large proportion of ones since most low performance solutions
will have behavioral descriptors with a large proportion of zeros; i.e.,
it will be biased towards individuals that achieve a low error on many
fitness cases. Therefore, since cannot increase over generations, then
selective pressure will push the search towards novel individuals that
exhibit a low error on many fitness cases. Nonetheless, the archive is
still used to save promising individuals across generations, and is used
to select the best solution at the end of the run, but not used to compute
novelty or guide the search.

NS helped by the mechanism will push the search towards novel

solutions that finally find the global optima β 1 . The performance of the
behavior proposed is better explained in the Algorithm 2.

7.2 Experiments

The proposed NS-GP-R algorithm is evaluated on three benchmark

problems, suggested by (McDermott et al., 2012) and proposed in (Uy
et al., 2011a). Moreover, for comparative purposes, the algorithm is
compared to the results published in (Uy et al., 2011a), that use a stan-
dard GP, hereafter referred to as SGP, and a GP with the Semantic
Similarity-based Crossover (SSC) also proposed in (Uy et al., 2011a).

127
gp based on ns for regression

Table 7.1: Benchmark symbolic regression problems taken from (Uy

et al., 2011a).

Problems Fitness/Test cases

f1 x4 + x3 + x2 + x 20 random points ⊆ [−1, 1]

f2 sin(x2 ) ∗cos (x ) −1 20 random points ⊆ [−1, 1]
f3 2sin(x ) ∗ cos (y ) 100 random points ⊆ [−1, 1]x [−1, 1]

7.2.1 Test Problems and Parametrization

The three symbolic regression problems are given in Table 7.1. The
problems were chosen based on the difficulty the problems posed to the
methods published in (Uy et al., 2011a), and they are ordered from the
easiest to the most difficult. It is not claimed that this ordering implies
any deeper understanding of the intrinsic difficulty of the problem, it
is only based on the performance of the algorithms compared in (Uy
et al., 2011a).
The two easiest problems have one independent variable, while the
hardest problem has two independent variables; Figure 7.2 shows the
ground truth function in each problem. Table 7.1 specifies the desired
function and the manner in which the training set (fitness cases) and
testing set are constructed, using the same random procedure in both.

7.2.2 Parameter Settings

The novelty based Regression Funciton Descriptor (ED) is tested,

with one configuration that uses an sparseness threshold of ρmin = 3
for single variable problems and ρmin = 13 for bi-variable problems.
The number of neighbors is set to k = 15. The proposed algorithms
is a tree-based Koza style, as well a crossover and a mutation subtree-
based. Table 7.2 summarizes the parameters for this algorithm. Finally,
the algorithm were coded using Matlab 2012b and the GPLAB toolbox
(Silva and Almeida, 2003).

128
7.2 experiments

4 −0.5

3.5 −0.55

3 −0.6 1

2.5 −0.65
0.5
−0.7
2

f(x,y)
f(x)

f(x)
−0.75
1.5
−0.8 −0.5
1
−0.85
−1
0.5 1
−0.9
1
0 −0.95 0.5
0
0
−0.5 −1 −0.5
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −1
x x y x

(a) f1 (x ) (b) f2 (x ) (c) f3 (x, y )

Figure 7.2: Benchmark regression problems showing the ground truth

function taken from (Uy et al., 2011a)

Table 7.2: Parameters for the experiments.

Parameter Description
Population size 200 individuals.
Generations 100 generations.
Initialization Ramped Half-and-Half,with 6 levels
of maximum depth.
Operator probabilities Crossover pc = 0.8,
Mutation pµ = 0.2.
√
Function set ( + , − , × , ÷ , | · | , x2 , x ,
n log , sin , cos ,oif ) .
Terminal set x1 , ..., xi , ..., xp ,
where xi is a dimension of
the data patterns x ∈ Rn .
Bloat control Dynamic depth control.
Initial dynamic depth 6 levels.
Hard maximum depth 20 levels.
Selection Tournament of size 6.

7.2.3 Experimental Results

After executing the algorithm 30 times on each problem, the follow-

ing results are obtained. First, Figure 7.3 presents a boxplot analysis

129
gp based on ns for regression

Table 7.3: Parameters for novelty search.

Parameter Value
NS nearest neighbor k = 15
Sparseness threshold:
for single variable problems ρmin = 3
for bivariable problem ρmin = 13
-descriptor threshold p = 10
Number of runs per problem runs = 30

of the performance of NS-GP-R on each benchmark problem, showing

the best error on the test-set achieved in each run. For comparative pur-
poses Table III presents a comparison of the average error of NS-GP-R
on each benchmark, along with the average error reported in (Uy et al.,
2011b) for SGP and SSC. The table shows an interesting trend, NS per-
forms worse on the easiest problem, performs about equally on the in-
termediate problem, and outperforms all methods on the most difficult
problem. Results are compared based only on average error, no statis-
tical tests are performed because only the published results are being
used for comparison. The trend is apparent, NS works best when the
problems are difficult, because in this scenario, novelty indeed leads
towards quality, given that the initial random populations mostly con-
tain individuals that perform poorly. Therefore, in difficult problems
the selection pressure in novelty search can lead towards individuals
that achieve good performance. In other words, when the gradient for
novelty is positively correlated with the gradient for fitness, then NS
can indeed find high performance solutions.

7.3 Chapter Conclusions

This chapter presented a NS-based GP algorithm for the classical

GP problem of symbolic regression. In this respect, this chapter repre-
sents the first attempt to apply NS in this domain, one of the most com-

130
7.3 chapter conclusions

0.9

0.8

E r r o r
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
1 2 3
P r o b l e m s

Figure 7.3: Boxplot analysis of NS-GP-R performance on each bench-

mark problem, showing the best test-set error in each run.

Table 7.4: Comparison of NS-GP-R with two control methods reported

in (Uy et al., 2011a): SSC and SGP. Values are the mean error computed
over all runs.

Problem NS-GP-R SSC SGP

1 0.47 0.15 0.26
2 0.14 0.10 0.21
3 0.42 1.53 2.26

mon engineering and scientific problems. To achieve this, however, the

concept of behavioral space needed to be adapted for it to be applied
in this domain. Therefore, a behavioral descriptor was proposed called
-descriptor, that emphasizes the main characteristics of program be-
havior while abstracting away unnecessary details, in order to properly
describe the behavior that each GP individual exhibits with respect to
the rest of the population.
Results are encouraging and consistent with recent proposals that
expand the use of the NS algorithm to other mainstream areas. The
proposed NS-based GP was compared with recently published results
on three benchmark problems that are currently suggested for GP eval-

131
gp based on ns for regression

uation within the community (Zhang and Smart, 2006). NS shows a

consistent trend, it achieved quite bad performance on easy problems,
and performs substantially better on difficult ones, results that are sim-
ilar to those published in (Naredo et al., 2013b; Naredo and Trujillo,
2013).
For real-world scenarios these results are promising, since most in-
teresting problems are difficult. The reason for these results can be
inferred from the nature of the NS algorithm. Random solutions in the
initial population mostly exhibit bad performance, thus the selective
pressure towards good solutions increases during the search for nov-
elty. On the other hand, for easier problems the exploration performed
by NS is mostly counterproductive, since random solutions might pro-
vide good initial approximations and all that is lacking is an exploita-
tive search. Therefore, when random solutions have a high fitness then
novelty could easily lead the search towards worse results. Conversely,
when the problem is difficult and random solutions are of low fitness,
then the search for novelty will lead towards high quality solutions.
Future work will center on exploring further experimental tests
in this domain, comprehensively evaluating different parametrizations
and performance achieved on other benchmarks and real-world prob-
lems, comparing the algorithm with other GP systems for symbolic
regression. Moreover, it is imperative to provide a deep compari-
son between the proposed behavior-based search strategy and recent
semantics-based approaches, a comparison that goes beyond merely
experimental results, one that makes a detailed analysis of the main
algorithmic differences between both approaches and their effects on
search.

132
GP based on NS for Clustering
8
abstract — This chapter applies NS to unsupervised machine
learning, namely the task of clustering unlabeled data. To this end, a
behavioral descriptor is proposed describing a clustering function per-
formance. Experimental results show that NS-based search can be used
to derive effective clustering functions. In particular, NS is best suited
to solve difficult problems, where exploration needs to be encouraged
and maintained.

8.1 Introduction

Data analysis plays an indispensable role for understanding various

phenomena. Cluster analysis, primitive exploration with little or no
prior knowledge, consists of research developed across a wide variety
of communities. One of the vital means in dealing with information
data is to classify or group them into a set of categories or clusters.
In order to learn a new object or understand a new phenomenon,
people always try to seek the features that can described it, and fur-
ther compare it with other known objects or phenomena, based on the
similarity or dissimilarity, generalized as proximity, according to some
certain standard or rules. Basically, classification systems are either
supervised or unsupervised, depending on whether they assign new

133
gp based on ns for clustering

inputs to one of a finite number of discrete supervised classes or unsu-

pervised categories, respectively.
In unsupervised classification, called clustering or exploratory data
analysis, no labelled data are available. The goal of clustering is to sep-
arate a finite unlabelled data set into a finite discrete set of ”natural”,
hidden data structures, rather than provide an accurate characteriza-
tion of unobserved samples generated from the same probability distri-
bution.
Most researchers describe a cluster by considering the internal ho-
mogeneity and the external separation, i.e., patterns in the same clus-
ter should be similar to each other, while patterns in different clusters
should not.
Given a set of patterns X = {x1 , ..., xj , .., xN }, where xj =
(xj1 , xj2 , ..., xjd )T ∈ Rd and each measure xji is said to be a feature (at-
tribute, dimension, or variable). As an important tool for data explo-
ration, cluster analysis examines unlabelled data, by either construct-
ing a hierarchical structure, or forming a set of groups according to a
pre-specified number.
Research in Evolutionary Computation (EC) has produced search
and optimization algorithms that frequently achieve promising new re-
sults in diverse domains (Koza, 2010). Therefore, the practical value of
evolutionary algorithms (EAs) is by now widely accepted. Nonetheless,
for some within the field a conceptual, or even philosophical, problem
remains regarding most EAs. At their core, EAs are simple abstractions
of Neo-Darwinian evolution. However, instead of the open-ended na-
ture of biological evolution, EAs are objective driven, just like any con-
ventional optimization algorithm. Therefore, EAs are expected to con-
verge on a small subset of local optima within a static fitness landscape.
The chapter is organized as follows. Section 8.2 presents the pro-
posed NS-based GP algorithm for data clustering and a behavioral de-
scriptor for evolved clustering functions. Then, Section 8.3 presents
the experiments and an analysis of the results. Finally, a summary of
the chapter and concluding remarks are given in Section 8.4.

134
8.2 clustering with novelty search

8.2 Clustering with Novelty Search

This section presents first two well known data clustering algo-
rithms; K-means and Fuzzy C-means. Afterward, the proposed behav-
ioral descriptor for GP-based clustering functions is presented, and we
provide a discussion regarding its fitness landscape.

8.2.1 K-means

This is by far among clustering techniques that are based on min-

imizing an objective function, the most widely used and studied algo-
rithm (Jain, 2010). K-means is a clustering method that intends to par-
tition n observations into k clusters, where each observation belongs to
the cluster with the nearest mean, using the cluster centres to model
the data.
This algorithm by mean of an iterative partitioning minimizes the
sum, over all clusters, of the within-cluster sums of point-to-cluster-
centroid distances, using the squared Euclidean distance.
Given a set of observations (x1 , x2 ), ..., xn , where each observation is
a d-dimensional real vector, K-means clustering intends to partitioning
the n observations into k sets (k 6 n), S = S1 , S2 , ..., SK , so as to minimize
the within-cluster sum of squares, according with

k X
X
arg min ||xj − µi ||2 (8.30)
S i =1 xj ∈Si

where µi is the mean of point in Si .

8.2.2 Fuzzy C-means

Fuzzy C-means (FCM) is a method of clustering which allows a data

element to belong to more than one cluster, the assignation to a cluster
is relaxed, and the data element can belong to all of the clusters with
a certain degree of membership (Bezdek et al., 1984). FCM attempts

135
gp based on ns for clustering

to find a partition for a set of data points xj ∈ Rd , j = 1, .., N while

minimizing the following objective function:

c X
X N
J (U, M) = (ui,j )m Dij (8.31)
i =1 j =1

where, U = [ui,j ]cxN is the fuzzy partition matrix and ui,j ∈ [0, 1] is
the membership coefficient of the jth object in the ith cluster; M =
[m1 , ..., mc ] is the cluster prototype (mean or center) matrix; m ∈ [1, ∞)
is the fuzzification parameter and usually is set to 2; Dij = D (xj , mi ) is
the distance measure between xj and mi .

8.2.3 Cluster Descriptor (CD)

The training set T used contains sample patterns from each cluster.
Then, for a two-cluster problem with Ω = {ω1 , ω2 } the clustering de-
scriptor (CD) is constructed in the following way. If T = {y1 , y2 , ...yL },
then the behavioural descriptor for each GP clustering function Ki is a
binary vector ai = (a1 , a2 , ...aL ) of size L, where each vector element aj
is set to 1 if clustering function Ki assigns label ω1 to pattern yj and is
set to 0 otherwise.
Suppose that the number of training examples from each cluster
is L2 , and suppose that they are ordered in such a way that the first L2
elements in T correspond to cluster label ω1 . Let x represent a binary
vector, and function u (x) return the number of 1s in x. Moreover, let
KO be the optimal clustering function that achieves a perfect accuracy
on the training set.
Then, the CD of KO s behaviour is given by a1 = (11 , 12 , ...,
1 L , 0 L +1 , ..., 0L ). Moreover, for a two-cluster problem, an equally use-
2 2
ful solution is to take the opposite (complement) behaviours and invert
the clustering, such that a 1 is converted to a 0 and vice-versa. These
mirror behaviours are a0 = (01 , 02 , ...0 L , 1 L +1 ....1L ) for the CD descrip-
2 2
tors. The complete fitness landscapes in behavioural space are depicted
in Figure 8.1.

136
8.2 clustering with novelty search

0.8

0.6

0.4

0.2

0
L/2
L/2
L/4
L/4

0 0
UR(x) UL(x)

Figure 8.1: Fitness landscape in behavioral space for the CD descriptor.

For a two-cluster problem with a reasonable degree of difficulty, the

initial generations of a GP search should be expected to contain close to
random clustering functions, with roughly a 50% accuracy. For the CD
descriptor, behavioral space is organized on a two dimensional surface,
such that one axis uL considers the number of ones on the left hand side,
first L2 bits, of a behavior descriptor a, and uR considers the remaining
L
2 bits; see Figure 8.1.

Notice that the middle valley of the fitness landscape corresponds

to basically random clustering functions, with worst case scenario per-
formance. Hence, NS will push the search towards either of the two
global optima, a1 and a0 . On the other hand, for the AD descriptor,
early behaviours will almost exclusively exhibit descriptors with equal
proportions of zeros and ones; see Figure 8.1. Then, NS will progres-
sively explore towards two opposite points in behavioral space, b1 or b0 .
The effect of these differences between the CD is explored experimen-
tally in the following section. Finally, given the above binary descriptor,
we use the Hamming distance to compute the novelty measure.

137
gp based on ns for clustering

8.2.4 Cluster Distance Ratio (CDR)

This measure compares the dispersion within the clusters, consider-

ing both the intra and inter-cluster, we use the distances of the points
from their cluster centre to determine whether the clusters are com-
pact. First the intra-cluster measure is computed, using the Euclidean
distance from each point to its nearest neighbor within its own clus-
ter, then calculate the average nearest neighbor distance for all points,
defined as

K
1 XX
IntraCluster = ||x − vi ||2 , (8.32)
N
i =1 x∈Ci

where N is the number of data elements, K is the number of clusters,

and vi is the cluster centre of cluster Ci . The intra-cluster give us a
clustering compaction measure; the less intra-cluster distance the more
compact the cluster is.
The inter-cluster distance or the distance between clusters, is com-
puted using the Euclidean distance between each cluster centre to the
rest of the cluster centres taking the minimum of these values, defined
as

InterCluster = min(||vi − vj ||2 ) , (8.33)

where i = 1, 2, ..., K − 1, and j = i + 1, ..., K. Notice that it is considered

more than one cluster, and since there must exist at least one data point
within its corresponding cluster different from the other cluster, it is
assured that the inter cluster to be greater than zero.
The ratio between the intra and inter-cluster determines the quality
about the hypothesis used to group the data, defined as

IntraCluster
CDR = (8.34)
InterCluster

138
8.3 experiments

8.3 Experiments

The performance of the NS-based GP clustering function is exam-

ined against two well known clustering methods; K-means and Fuzzy
C-means. The novelty based Class Descriptoy (CD) is tested, with two
different versions, one configuration uses an sparseness threshold of
S=20 and the other one S=40.

8.3.1 Test Problems

Gaussian Mixture Models are used to generate five random syn-

thetic problems, each with three dimensions and different amounts
of class overlap and geometry. All problems are set in the R2 plane
with (x, y ) ∈ [−10, 10]2 , and 200 sample points were randomly gener-
ated for each class. The parameters for the GMM of each class were
also randomly chosen, following the same strategy reported in (Trujillo
et al., 2011a). The five problems are of increasing difficulty, these prob-
lems are graphically depicted in Figure 8.2 and denoted with the label
Ground Truth (GT).

8.3.2 Configuration and Parameter Settings

As stated above, four different algorithms are experimentally com-

pared: K-means, Fuzzy C-menas, CD-S20 and CD-S40. However, all
algorithms share the same GP representation and genetic operators, a
tree-based Koza style algorithm with subtree mutation and crossover.
The parameters shared by all algorithms are summarized in Table 8.1.
For the NS-based algorithms two different parameter settings are
used. In particular, two different values for the archive threshold ρmin
are used, 20 and 40. Parameter k on the other hand is set to 15 for all
algorithms. Finally, all algorithms were coded using Matlab and the
GPLAB toolbox (Silva and Almeida, 2003).

139
gp based on ns for clustering

5 20
5

10
0 0
0
−5
−5 −10

−10
−20

5 5 10
4 10
2 0 5
0 5 10
0 −5 0 0
−2 −5 0
−5 −10 −5
−4 −10 −10

(a) Problem-1 (b) Problem-2 (c) Problem-3

6
4
2 5

0
−2 0

−4
−6 −5

5 10
5 5
0 5
0 0 0
−5
−5 −5 −5

(d) Problem-4 (e) Problem-5

Figure 8.2: Five synthetic clustering problems; the observed clusters

represent the ground truth data.

Table 8.1: Parameters for the GP-based search.

Parameter Description

Population size 200 individuals.

Generations 100 generations.
Initialization Ramped Half-and-Half,with 6 levels
of maximum depth.
Operator probabilities Crossover pc = 0.8,
Mutation pµ = 0.2.
√
Function set ( + , − , × , ÷ , | · | , x2 , x , log ,
nsin , cos , if ) . o
Terminal set x1 , ..., xi , ..., xp , where xi is a dimension of the data patterns x ∈ Rn .
Bloat control Dynamic depth control.
Initial dynamic depth 6 levels.
Hard maximum depth 20 levels.
Selection Tournament.

8.3.3 Results

This section presents an experimental evaluation of NS-GP cluster-

ing. All of the clustering algorithms were executed 30 times to find
statistically significant results. Table 8.2 compare the performance of

140
8.3 experiments

Table 8.2: Average classification error and standard error of the best
solution found by each algorithm on each problem; GT: Ground Truth,
KM: K-means, FC: Fuzzy C-means, NS-based algorithms both use k =
15, but CD1 with ρmin = 20 and CD2 use ρmin = 40.

Class Distance Ratio (CDR)

GT KM FC CD1 CD2

Problem-1 9.1720 10.2193 10.2511 0.006 0.006

Problem-2 4.4629 5.5085 5.4570 0.006 0.006
Problem-3 5.5502 5.4053 5.2887 0.006 0.006
Problem-4 4.1075 6.2132 6.1094 0.006 0.006
Problem-5 1.5977 4.5417 4.5184 0.006 0.006
Clustering Error
GT KM FC CD1 CD2

Problem-1 9.1720 0.1025 0.1000 0.0500 ± 0.005 0.0050 ± 0.005

Problem-2 4.4629 0.0750 0.0725 0.0925 ± 0.005 0.1300 ± 0.005
Problem-3 5.5502 0.2175 0.2625 0.2675 ± 0.005 0.4375 ± 0.005
Problem-4 4.1075 0.3600 0.3550 0.1875 ± 0.005 0.1725 ± 0.005
Problem-5 1.5977 0.4550 0.4500 0.4688 ± 0.005 0.4300 ± 0.005

NS-GP with two baseline methods, KM and FC. The table presents two
comparative views of average performance over all runs. First, the al-
gorithms are compared based on their CDR score, and the CDR of the
ground truth of each problem is also presented. Additionally, using
the ground truth, a classification error was computed, based on the or-
dering suggested by each clustering method. In general, the results
indicate two noteworthy trends. First, NS seems to performs much
worse on the simpler problems, it seems like it is basically doing a ran-
dom search. On the other hand, NS noticeably outperforms the control
methods on the harder problems, this is especially true for the hard-
est, Problem 5. Second, it seems that a lower ρmin encourages better
performance in most cases. A detailed view of how the data is being
clustered can provide a different analysis of the results. Figures 8.3 -
8.7 present a graphical illustration of the clustering output achieved
by each method. All figures show the ground truth clusters for visual
comparison, along with a typical clustering output from each method.

141
gp based on ns for clustering

−5

5
4
2
0
0
−2
−5
−4

(a) Ground Truth

10 10

5 5

0 0

−5 −5

−10 −10
10 10

5 5 5 5

0 0
0 0
−5 −5

−10 −5 −10 −5

(b) K-means (c) FC-means

5 5

0 0

−5 −5

5 5
4 4
2 2
0 0
0 0
−2 −2
−5 −5
−4 −4

(d) NS-GP-20 (e) NS-GP-40

Figure 8.3: Comparison of clustering performance on Problem No.1.

These figures confirm the data presented in Table 8.2, NS-GP performs
worse on the easy problems and performs better on the difficult ones.
Figures 8.8 and 8.9 examine how sparseness evolves during the
NS-GP search, for NS-GP-20 and NS-GP-40 respectively. Each figure
presents a similar plot that shows how the sparseness of the best indi-
vidual (based on fitness) evolves over each generation. The plots are av-
erages of the 30 runs of each experiment and present a curve for each
problem. A horizontal line shows the corresponding threshold value

142
8.3 experiments

−5

−10

5
10
0
5
−5 0
−5
−10 −10

(a) Ground Truth

15 15

10 10

5 5

0 0

−5 −5

−10 −10

−15 −15
10 10
5 15 5 15
0 10 0 10
5 5
−5 0 −5 0
−10 −5 −10 −5
−10 −10
−15 −15 −15 −15

(b) K-means (c) FC-means

10 10

5 5

0 0

−5 −5

−10 −10

5 5
10 10
0 0
5 5
−5 0 −5 0
−5 −5
−10 −10 −10 −10

(d) NS-GP-20 (e) NS-GP-40

Figure 8.4: Comparison of clustering performance on Problem No.2.

for each configuration, set to 20 in Figure 8.8 and set to 40 in Figure

8.9. In both cases it is possible to observe a similar pattern. For the
easy problems (1, 2 and 3) the sparseness value of the best individual
at each generation does not reach the threshold, and is therefore not
included in the population archive. This means that the individual is
lost, and maybe it is never found again. This explains the reason why
NS fails to produce good results on the easy problems. It appears that
since the problems are not difficult, then novelty does not necessarily

143
gp based on ns for clustering

−10

−20

10
5
10
0 0
−5 −10

(a) Ground Truth

30 30

20 20

10 10

0 0

−10 −10

−20 −20

−30 −30
15 15
10 20 10 20
5 10 5 10
0 0 0 0
−5 −10 −5 −10
−10 −20 −10 −20

(b) K-means (c) FC-means

20 20

10 10

0 0

−10 −10

−20 −20

10 10
5 5
10 10
0 0 0 0
−5 −10 −5 −10

(d) NS-GP-20 (e) NS-GP-40

Figure 8.5: Comparison of clustering performance on Problem No.3.

lead to quality, and NS is not influencing the search as desired. On the

other hand, for the more difficult problems (4 and 5) it is clear that
good solutions almost always correspond with novel solutions; i.e., the
solutions that are being introduced into the archive of novel solutions
also exhibit good fitness values. In these cases, the search for novelty is
indeed leading evolution towards better overall solutions. This is rea-
sonable, since for difficult problems the initial (random) programs will
mostly exhibit bad performance, and novelty can lead towards quality.

144
8.3 experiments

6
4
2
0
−2
−4
−6

5
5
0
0
−5
−5

(a) Ground Truth

8 8

6 6

4 4

2 2

0 0

−2 −2

−4 −4

−6 −6

−8 −8
10 10

5 10 5 10
5 5
0 0
0 0
−5 −5
−5 −5
−10 −10 −10 −10

(b) K-means (c) FC-means

6 6
4 4
2 2
0 0
−2 −2
−4 −4
−6 −6

5 5
5 5
0 0
0 0
−5 −5
−5 −5

(d) NS-GP-20 (e) NS-GP-40

Figure 8.6: Comparison of clustering performance on Problem No.4.

Finally, to illustrate how evolution progresses with the NS-GP clus-

tering algorithm, Figure 8.10 presents four snapshots of NS-GP-40 ap-
plied to Problem 5. Each plot shown in the figure represents the
best NS solution (taken from the current population and population
archive) based on the CDR value at four different generations. For
this difficult problem, it is clear that NS is progressively exploring the
search space, and finding better solutions.

145
gp based on ns for clustering

−5

10
5 5
0 0
−5 −5

(a) Ground Truth

10 10

5 5

0 0

−5 −5

−10 −10
15 15
10 10 10 10
5 5 5 5
0 0 0 0
−5 −5 −5 −5
−10 −10 −10 −10

(b) K-means (c) FC-means

5 5

0 0

−5 −5

10 10
5 5 5 5
0 0 0 0
−5 −5 −5 −5

(d) NS-GP-20 (e) NS-GP-40

Figure 8.7: Comparison of clustering performance on Problem No.5.

8.4 Chapter Conclusions

This chapter presented a NS-based GP system to search for data

clustering functions. To do so, a domain-specific behavioral descriptor
was proposed; the Cluster Descriptor (CD). It appears that NS-based
search exhibits the best results when confronted with difficult prob-
lems. Since generating a high-quality solution at random is less prob-
able for difficult problems, the incentive for behavioural exploration
is incremented and the search for novelty indeed leads towards qual-

146
8.4 chapter conclusions

Prob.1
60 Prob.2
Prob.3
Prob.4
50
Prob.5
Treshold
40
Sparseness
30

0
20 40 60 80 100
Generations

Figure 8.8: Evolution of sparseness for NS-GP-20, showing the average

sparseness of he best individual at each generation.

Prob.1
60 Prob.2
Prob.3
Prob.4
50
Prob.5
Treshold
40
Sparseness

0
20 40 60 80 100
Generations

Figure 8.9: Evolution of sparseness for NS-GP-40, showing the average

sparseness of he best individual at each generation.

147
gp based on ns for clustering

5 5

0 0

−5 −5

10 10
5 5 5 5
0 0 0 0
−5 −5 −5 −5

(a) Generation 0 (b) Generation 67

5 5

0 0

−5 −5

10 10
5 5 5 5
0 0 0 0
−5 −5 −5 −5

(c) Generation 134 (d) Generation 200

Figure 8.10: Evolution of the best solutions found at progressive gener-

ations for Problem 5 with NS-GP-40.

ity during the search. On the other hand, for simple problems the
exploratory capacity of NS is mostly unexploited. In particular, CD
descriptor is less restrictive and for this reason it can be used on clus-
tering. According with the results, CD descriptor probed that can be
used with NS to solve clustering problems. Therefore, future work will
focus on exploring the usefulness of NS on the more difficult problem
of non-supervised learning.

148
9
GP based on NS for Classification

abstract — This chapter presents a NS-based genetic program-

ming (GP) algorithm for supervised classification. Results show that
NS can solve real-world classification tasks, the algorithm is validated
on real-world benchmarks for binary and multiclass problems. These
results are made possible by using a domain-specific behavior descrip-
tor. Moreover, two new versions of the NS algorithm are proposed,
Probabilistic NS (PNS) and a variant of Minimal Criteria NS (MCNS).
The former models the behavior of each solution as a random vector
and eliminates all of the original NS parameters while reducing the
computational overhead of the NS algorithm. The latter uses a stan-
dard objective function to constrain and bias the search towards high
performance solutions. The chapter also discusses the effects of NS on
GP search dynamics and code growth. Results show that NS can be
used as a realistic alternative for supervised classification, and specif-
ically for binary problems the NS algorithm exhibits an implicit bloat
control ability.

9.1 Introduction

In this chapter we will continue our work on applying NS to com-

mon machine learning tasks, and will now focus on supervised clas-

149
gp based on ns for classification

sification. In particular, we will consider the new NS variants we pro-

posed in Chapter 4, namely PNS and MCNS, as well as a hybrid method
that combines both we call MCPNS. To recap, in PNS the novelty of a
solution is estimated probabilistically by modeling each behavior as a
random vector, a new way to compute novelty, which in addition elim-
inates all of the underlying NS parameters as well as reduces the com-
putational overhead from the original NS algorithm. Moreover, MCNS
proposes a blending of the objective function with the sparseness mea-
sure, where the minimal solution quality represents a dynamic crite-
rion that is proportional to the quality of the best solution found so
far.
To apply NS successfully, a behavior descriptor must first be pro-
posed (Kistemaker and Whiteson, 2011). For instance, in a maze nav-
igation problem Lehman and Stanley used the final robot position as
the behavior descriptor (Lehman and Stanley, 2008, 2010b,c, 2011a).
In this work, our main goal is to apply NS in GP-based classification. In
particular, we use two GP classifiers: a simple binary classifier based
based on a static threshold (Zhang and Smart, 2006) and a recently pro-
posed multiclass approach (Ingalalli et al., 2014; Muñoz et al., 2015).
Both latter algorithms are wrapper methods that evolve features trans-
formations of the form g : Rn → Rp , such that each input pattern x ∈ Rn
is transformed into a new feature vector x ∈ Rp that is easier to classify
using a predefined classification rule R. In this domain, the context
is provided by R; in this work we use a simple threshold (Zhang and
Smart, 2006) and a minimum Mahalanobis distance criterion (Muñoz
et al., 2015); as described below.

9.1.1 Accuracy Descriptor (AD)

The proposed descriptor considers the accuracy of a classifier at a

fine scale, which is closely related to its semantics but also takes into
account the context provided by the classification rule R. Here we con-
sider supervised classification problems specified by a training set T
which contains sample patterns from each class. The accuracy descrip-

150
9.1 introduction

1
[1 1 ... 1 1]
0.9 optimal

Classi cation Error

0.8

0.7
1 6 1 6
0.6

2 random
7 7 0.5
performance
2
0.4
3 8 3 8 0.3
[0 0...1 1]
[0 1...0 1]
0.2 [0 1...1 0]
4 4
[1 0...0 1]
9 9 0.1

5 10 5 10 0
[0 0 ... 0 0] [1 1...0 0]
optimal L/2
0 L
βo = [1 1 1 1 1 1 1 1 1 1] β =[0 0 0 1 1 1 0 1 0 0] Number of ones

(a) Optimal behavior (b) Current behavior (c) Behavior landscape

Figure 9.1: Graphical depiction of the Accuracy Descriptor (AD): (a)

shows the optimal behavior; (b) shows a possible behavior; and (c)
shows the underlying fitness landscape abased on behavior space.

tor (AD) is constructed in the following way. If T = {x1 , ..., xL }, then

the behavior descriptor for each GP individual Kj is a binary vector
βjAD = (β1 , βj,2 , ...βj,i .., βj,L ) of size L, where each vector element βj,i is
set to 1 if the classification rule R correctly classifies sample xi based
on the output provided by Kj , and is set to 0 otherwise. The proposed
descriptor is illustrated in Figure 9.1.
To understand the underlying behavior space specified by AD, lets
consider a binary classification problem. The AD descriptor specifies
β as a binary vector, and suppose that function u (β ) returns the num-
ber of 1s in β. Moreover, let KO be the optimal transformation that
achieves a perfect accuracy on the training set based on R. The AD
AD AD
of KO is given by βO where u (βO ) = L. Moreover, for a two-class
problem, an equally useful solution is to take the worst transformation
KW , that achieves a complete misclassification in the whole training set
AD AD
given by βW where u (βW ) = 0, then use the opposite (complement)
behavior and invert the classification. This mirror behaviors is a β AD
with u (β AD ) = 0. These two optima are shown in the underlying be-
havior fitness landscape depicted in Figure 9.1(c), when fitness is given
by the classification accuracy. For a problem with a reasonable degree
of difficulty, the initial generations of a GP search should be expected
to contain close to random classifiers, with roughly a 50% accuracy.

151
gp based on ns for classification

Therefore, early individuals will mostly exhibit behavior descriptors

with equal proportions of zeros and ones. Then, NS will necessarily
explore towards two opposite points in behavior space.

Just like in semantic space, the specified fitness landscape based

on behaviors defines clear global optima (Uy et al., 2011b; Beadle and
Johnson, 2008, 2009; Krawiec and Pawlak, 2013). However, for super-
vised classification semantic space would not be useful, since the global
semantic optima is not known beforehand, it necessarily depends upon
the training set T and the specified classification rule R. It is impor-
tant to mention that (Moraglio et al., 2012) also proposed geometric
operators for GP-based classifiers based on semantics, however that for-
mulation is not applicable for wrapper methods such as (Zhang and
Smart, 2006; Muñoz et al., 2015). Moreover, while semantic approaches
have been widely used for symbolic regression (Castelli et al., 2015),
they have not been fully studied in supervised classification. Finally,
given the above binary descriptors, a natural distance function to apply
NS with Equation 4.11 showed in Chapter 4, is the Hamming distance
(any other suitable distance function for binary strings could be used),
which counts the number of bits that differ between two binary vectors.

9.1.2 Binary Classifier: Static Range Selection

For binary classification, we use the Static Range Selection GP Clas-

sifier (SRS-GPC) described by Zhang and Smart (Zhang and Smart,
2006). In a binary problem, a pattern x ∈ <n has to be classified as
belonging to one of either two classes, ω1 or ω2 . In this method, the
goal is to evolve a mapping g (x) : <n → <. The classification rule
R states that pattern x is labelled as belonging to class ω1 if g (x) > r,
and belongs to ω2 otherwise. The fitness function is simple, it consists
on maximizing the total classification accuracy after R is applied, nor-
mally setting the decision boundary to r = 0; the process is depicted in
Figure 9.2.

152
9.1 introduction

h
d1
Feature space 1-dimensional space

Figure 9.2: Graphical depiction of SRS-GPC.

9.1.3 Multiclass Classifier: M3GP

To test the NS approach on multiclass problems, we choose the

recently proposed Multidimensional Multiclass GP with multidimen-
sional populations (M3GP) (Muñoz et al., 2015). The goal of M3GP
is to evolve a mapping g (x) : <n → <p , where p is determined by
the evolutionary process. The classification rule R used by M3GP con-
sists of the following steps: (1) build M p-dimensional Gaussian clus-
ters, one for each class, using the training set; (2) classify each train-
ing instance based on the minimum Mahalanobis distance to each class
cluster, this provides the total accuracy used as fitness measure. For
testing, the clusters found during training are used to determine class
membership using the same distance measure. Results reported with
M3GP are quite competitive with state of the art methods, such as Ran-
dom Forests (Breiman, 2001) and Random Subspaces (Ho, 1998; Bryll
et al., 2003) (Muñoz et al., 2015), providing a more advanced GP-based
classification algorithm compared with SRS-GPC. Nonetheless, given
the proposed behavior descriptor we can use the same NS algorithms
in both cases. In essence, each classifier provides a different context
for the outputs generated by the GP trees, just like previous works in
evolutionary robotics solved different navigation tasks in different en-
vironments, using the same descriptor and NS algorithm (Lehman and
Stanley, 2008, 2010b,c, 2011a).
Summarizing, SRS-GPC and M3GP are used to evaluate the NS
approach, respectively on binary and multiclass problems. For both

153
gp based on ns for classification

algorithms, the traditional approach will hereafter be referred to as

objective-based search (OS). However, before testing these algorithms
on real-world datasets, we evaluate the proposed NS approach on syn-
thetic binary problems.

9.1.4 Preliminary Results

Gaussian Mixture Models (GMM) are used to generate five random

synthetic problems. All problems are set in the R2 plane with x, y ∈
[−10, 10]2 , and 200 sample points were randomly generated for each
class. The parameters for the GMM of each class were also randomly
chosen, following the same strategy reported in (Trujillo et al., 2011a).
The five problems are of increasing difficulty (according to SRS-GPC
performance), denoted as: Trivial; Easy; Moderate; Hard; and Hardest;
these problems are depicted in Figure 9.3.
Here we compare SRS-GPC using either the objective function (OS)
or novelty (NS). The parameters of both GP systems are given in Ta-
ble 9.1. Moreover, for NS method the parameters are set as follows:
(1) the number of neighbors considered for sparseness computation is
set to k = 15; (2) the sparseness threshold is set to ρth = 40; and (3)
the archive is implemented as a FIFO queue of three times the popu-
lation size. Both algorithms are executed 30 times and performance
is analyzed based on averages over all runs. For each problem, 70%
of the data is used for training and the rest for testing, with random
partitions for each run. Performance is measured based on the total
classification error and all algorithms were coded using Matlab and the
GPLAB toolbox (Silva and Almeida, 2003).

9.1.5 Discussion

Two aspects will be discussed; classification error and the average

size (number of nodes) of the individuals in the population. First, Ta-
ble 9.2 presents the average classification error on the test data. The
results are consistent with those reported in (Naredo et al., 2013b), NS

154
9.1 introduction

3 6 6

2
4 4
1

0 2 2

−1
0 0
−2

−3 −2 −2

−4
−4 −4
−5

−6 −6 −6
−6 −5 −4 −3 −2 −1 0 1 2 3 −8 −6 −4 −2 0 2 4 6 −10 −5 0 5

(a) Trivial (b) Easy (c) Moderate

6 8

4 6

4
2
2

0
0

−2 −2

−4
−4
−6
−6
−8

−8 −10
−6 −4 −2 0 2 4 6 −10 −5 0 5

(d) Hard (e) Hardest

Figure 9.3: Five synthetic 2-class problems used to evaluate each algo-
rithm in ascending order of difficulty (according with the GP objective-
based standard performance) from left to right.

performs well, achieving basically the same results compared to OS.

However, there is a trend, NS seems to perform relatively better on the
more difficult problems and worse on the easier ones. Basically, the
explorative search performed by NS is fully exploited when random
initial solution perform badly, in these conditions the search for nov-
elty can lead towards better solutions. Conversely, for easy problems
random solutions can perform quite well, thus the search for novelty
can lead the search towards solutions with undesirable performance.
The results are encouraging, especially if we consider the evolu-
tion of average program size which is related to bloat (Silva and Costa,
2009). Figure 9.4 shows how the average size of all individuals in the
population evolves across generations. First, consider OS, Figure 9.4(a)
shows typical GP behavior, with a clear tendency of code growth, with

155
gp based on ns for classification

Table 9.1: Parameters for the GP systems.

Parameter Description
Population size 100 individuals.
Generations 100 generations.
Initialization Ramped Half-and-Half,
with 6 levels of maximum depth.
Operator prob- Crossover pc = 0.8, Mutation pµ = 0.2.
abilities n √ o
Function set + , − , × , ÷ , | · | , x2 , x , log , sin , cos , if .
n o
Terminal set x1 , ..., xi , ..., xp , where xi is a dimension of the
data patterns x ∈ <n .
Hard maxi- 20 levels.
mum depth
Selection Tournament of size 4.

Table 9.2: Average and standard deviation of the classification error on

the test data, of the best solution found by each algorithm.

Problem OS NS
Trivial 0.004 ±0.007 0.007 ±0.008
Easy 0.105 ±0.040 0.144 ±0.044
Moderate 0.136 ±0.033 0.159 ±0.041
Hard 0.260 ±0.052 0.266 ±0.053
Hardest 0.365 ±0.033 0.370 ±0.043

more bloat on difficult problems. In the case of NS, Figure 9.4(b) shows
that NS controls code growth quite effectively, exhibiting the same av-
erage program size on all problems.
Based on these results, we can revisit the fitness-causes-bloat theory
of Langdon and Poli (Langdon and Poli, 1997). It basically states that
the search for better fitness (given by the objective function) will bias

156
9.1 introduction

Figure 9.4: Evolution of the average size of individuals at each genera-

tion, computed on thirty runs; where: (a) represents OS and (b) is for
NS.

(a) Objective-based (b) Novelty Search

250 250
Trivial Trivial
Easy Easy
Moderate Moderate
200 Hard 200 Hard
Hardest Hardest

150 150
Size

Size
100 100

50 50

0 0
20 40 60 80 100 20 40 60 80 100
Generations Generations

Table 9.3: Average program size at the final generation for each algo-
rithm. For the NS algorithm, the population (Pop), archive (Arch) and
both (Pop+Arch) are considered.

Problem OS NS Pop NS Arch NS

Pop+Arch
Trivial 151.3 47.9 38.5 46.1
Easy 188.5 41.4 33.6 35.4
Moderate 184 41.3 3.7 38.6
Hard 197 43.8 37.9 42.9
Hardest 220.1 40.7 32.7 38.2

the search towards larger trees, simply because there are more large
programs than small ones. Silva and Costa (Silva and Costa, 2009) state
it clearly:

... one cannot help but notice the one thing that all the
[bloat] theories have in common, the one thing that if re-

157
gp based on ns for classification

moved would cause bloat to disappear, ironically the one

thing that cannot be removed without rendering the whole
process useless: the search for fitness.

Our experimental results show similarly with those reported in

(Lehman and Stanley, 2010b), that show how NS consistently evolves
smaller program trees, and this findings correlate nicely with the
fitness-causes-bloat theory. This results suggest that the bloat phe-
nomenon was avoided by abandoning an explicit objective function.

9.2 Real-world classification experiments

In this section we present an experimental comparison of all the

novelty-based variants discussed thus far, compared against a standard
objective-based search. The algorithms are compared on real-world
classification problems taken from several public repositories summa-
rized in Table 9.4. The first 8 datasets are used to construct binary clas-
sification problems and solved using the SRS-GPC decision rule, the
datasets with more than two classes are divided into 12 binary classi-
fication tasks. Datasets 1 − 5 are balanced regarding the number of in-
stances per class, while datasets 6 to 12 pose imbalanced problems. The
remaining 3 datasets pose the multiclass problems which are solved
with M3GP.
In all, 4 different novelty-based algorithms are compared with
objective-based search: these are NS, PNS, MCNS and MCPNS, the lat-
ter of which is a hybrid that combines PNS with the MC presented in
Section 3.1. For the binary classification problems all algorithms use
the parameters given in Table 9.1, except that the population size was
reduced to 50. Similarly, to compare with M3GP the parameters are
slightly modified following (Muñoz et al., 2015), where: population
size is 500, initialization used the full method, crossover and muta-
tion probabilities are 0.5, the function set is F = {+, −, ×, ÷}, maximum
depth is 17 levels, and selection uses lexicographic tournament of size
4 . All algorithms are executed 30 times and performance is analyzed
based on the median value over all runs. In each run 70% of the data is

158
9.2 real-world classification experiments

Table 9.4: Real-world and synthetic datasets for binary and multiclass
classification problems, taken from the UC Irvine Machine Learning
Repository ♦ , from the U.S. geological survey
N (USGS) earth resources
observationJsystems (EROS) data center , and from the KEEL dataset
repository .

Type Dataset Name Short Class Attrib Inst

Binary ♦ Pima Indians Diabetes Diab 2 8 536
♦ Iris Iris 3 4 150
♦ Parkinsons Parkin 2 22 96
♦ Teaching Assist. Eval. TAE 3 5 147
♦ Wine Wine 3 13 144
♦ Cardiotocography Cardio 3 21 2126
♦ Indian Liver Patient Liver 2 10 579
♦ Fertility Fertil 2 9 100
N
Multiclass J
Satellite dataset IM-3 3 6 322
J
Segment SEG 7 19 2310
Movement-Libras M-L 15 90 360

used for training and the rest for testing, the data partition is randomly
selected for each run. The objective function is given by the classifica-
tion error, which is used by all NS variants to choose the best solution
at the end of the run, and by OS to guide the search.
Since all classification problems differ on the sample size, and be-
cause PNS considers all of the individuals generated during the search,
for NS and MCNS all the behaviors in the current population and in the
population archive are used to compute the novelty measure in Equa-
tion 4.11. Moreover, ρth is set to 50% of the largest possible distance, as
well as k−neighbors is set to 50% of the population size, and the archive
is a FIFO queue with a size three times that of the population.
For the MCNS algorithm, the minimal criteria for each individual
is that it’s fitness must be within a certain percentage of the best solu-
tion found so far. Six different versions of MCNS are tested, from 5%

159
gp based on ns for classification

to 30% (MCNS5 - MCNS30) of the fitness from the best-so-far solution,

in increments of 5%. In this way, any solution that is more than x%
worse than the best solution has it’s novelty value set to 0. It is impor-
tant to mention that all the PNS variants do not require the parameters
used for NS. Finally, all algorithms were coded using MATLAB and the
GPLAB toolbox (Silva and Almeida, 2003).
The algorithms are compared based on training error, test error (er-
ror on test data) and the average size of all individuals in the population
based on the number of tree nodes (hereafter referred to as average pop-
ulation size). To verify statistical significance, the Friedman test with
Bonferroni-Dunn correction of the p-values is performed between the
control algorithm (OS) and each NS variant, as suggested in (Derrac
et al., 2011). In Tables 9.6 and 9.8 an asterisk indicates that the corre-
sponding novelty-based algorithm achieves statistically better results
than OS, with a α = 0.05 significance level. The p-values of the statisti-
cal tests are given in Tables 9.7 and 9.9.

9.2.1 Results: Binary Classification

First, we analyze the performance of different versions of the MCNS

algorithm. Table 9.5 shows the median test error for all classification
problems; each problem is denoted by an abbreviation (Diab, Parkin,
etc., as showed in Table 9.4) followed by the classes used to pose the
binary problem (C1, C2, C3). In general, most variants are quite simi-
lar, with MCNS15 achieving the overall best results. Therefore, for the
sake of clarity and conciseness only MCNS15 is included in the subse-
quent comparisons with other methods and will simply be referred to
as MCNS.
Figure 9.5 presents convergence plots of the median training error
of the best solution over generations. In general, OS converges quicker
than most NS variants, with PNS showing the most similar conver-
gence. For some problems (plots i-l) PNS converge faster than OS. In-
deed, plots (a)-(h) correspond with well balanced problems, with simi-
lar amounts of samples for each class. On the other hand, the problems

160
9.2 real-world classification experiments

Table 9.5: Binary classification performance for all MCNSbest variants.

Table shows the median classification error on the test data for the best
solution found, bold indicates the best performance.

Dataset MC5 MC10 MC15 MC20 MC25 MC30

Diab C1C2 0.306 0.325 0.309 0.328 0.338 0.322
Iris C2C3 0.100 0.100 0.067 0.100 0.100 0.100
Wine C1C3 0.036 0.054 0.000 0.000 0.000 0.000
Wine C2C3 0.125 0.143 0.036 0.107 0.071 0.089
Parkin C1C2 0.250 0.214 0.214 0.214 0.214 0.250
TAE C1C2 0.429 0.429 0.393 0.393 0.393 0.429
TAE C1C3 0.411 0.429 0.321 0.357 0.357 0.357
TAE C2C3 0.464 0.429 0.411 0.375 0.429 0.429
Wine C1C2 0.143 0.143 0.125 0.143 0.161 0.125
Wine C1C3 0.036 0.054 0.000 0.000 0.000 0.000
Wine C2C3 0.125 0.143 0.036 0.107 0.071 0.089
Cardio C1C2 0.126 0.109 0.119 0.110 0.113 0.111
Liver C1C2 0.298 0.309 0.286 0.298 0.289 0.292
Fertil C1C2 0.138 0.138 0.138 0.155 0.172 0.138

of plots (i)-(l) present larger class imbalance, suggesting that conver-

gence of some NS algorithms is better on unbalanced problems. How-
ever, NS shows the slowest convergence rates, in many cases not reach-
ing the best training error found by other algorithms.
Figure 9.6 presents plots of how the mean size of the population
evolves. It is clear that OS pushes GP toward larger program sizes on
these problems, a trend that is particularly evident in plots (a)-(j). In
general most NS variants produce smaller GP trees, with some note-
worthy tendencies. First, NS evolves smaller trees in all cases except
for the Fertility C1C2 problem. In most cases the difference with OS is
quite large, a result that is consistent with the experiments presented
in Section 9.1.4. Second, MCNS also evolves very small trees, in some
cases ((b),(c),(g),(h),(i)) the average program size is smaller than those
generated by NS. This is a particularly encouraging result given the

161
gp based on ns for classification

−3
x 10 0.5
Classification Error

0.45 0.45

Classification Error
14

Classification Error
12 0.4
0.4
10 0.35
8
0.35 0.3
6
0.25
4
0.3
0.2
2
0.15
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
Generations Generations Generations

(a) Diabetes C1C2 (b) Iris C2C3 (c) Parkinsons C1C2

0.4
0.4
0.4
Classification Error

Classification Error

Classification Error
0.35 0.35
0.35

0.3
0.3 0.3
0.25
0.25 0.25
0.2
0.2
0.2
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
Generations Generations Generations

(d) TAE C1C2 (e) TAE C1C3 (f) TAE C2C3

0.4 0.3
0.25
Classification Error

Classification Error

Classification Error
0.35 0.25
0.3 0.2
0.2
0.25 0.15
0.15
0.2
0.1
0.15 0.1

0.1 0.05 0.05

0.05
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
Generations Generations Generations

(g) Wine C1C2 (h) Wine C1C3 (i) Wine C2C3

0.16 0.31
0.13
Classification Error

Classification Error

0.15 0.3
0.12
0.14 0.29
0.11
0.13 0.28
0.12 0.27 0.1

0.11 0.26 0.09

0.1 0.25 0.08
0.09 0.24 0.07

20 40 60 80 100 20 40 60 80 100 20 40 60 80 100

Generations Generations Generations

(j) Cardiotocography C1C2 (k) Indian Liver C1C2 (l) Fertility C1C2

OS NS MCNS PNS MCPNS

Figure 9.5: Convergence of the classification error on the training data

for the best solution found, showing the median value over all runs.

162
9.2 real-world classification experiments

80
100 100

80 80 60
Tree size

Tree size

Tree size
60 60
40

40 40
20
20 20

20 40 60 80 100 20 40 60 80 100 20 40 60 80 100

Generations Generations Generations

(a) Diabetes C1C2 (b) Iris C2C3 (c) Parkinsons C1C2

120
120 120
100
100 100
Tree size

Tree size

Tree size
80
80 80

60 60
60

40 40 40

20 20 20

20 40 60 80 100 20 40 60 80 100 20 40 60 80 100

Generations Generations Generations

(d) TAE C1C2 (e) TAE C1C3 (f) TAE C2C3

100
80
80
80
Tree size

Tree size

Tree size
60
60
60
40
40 40

20 20 20

20 40 60 80 100 20 40 60 80 100 20 40 60 80 100

Generations Generations Generations

(g) Wine C1C2 (h) Wine C1C3 (i) Wine C2C3

80 100
80
80
60
Tree size

Tree size

60
60
40 40
40

20 20
20

20 40 60 80 100 20 40 60 80 100 20 40 60 80 100

Generations Generations Generations

(j) Cardiotocography C1C2 (k) Indian Liver C1C2 (l) Fertility C1C2

OS NS MCNS PNS MCPNS

Figure 9.6: Evolution of the average size of the population at each gen-
eration, showing the median value over all runs.

163
gp based on ns for classification

quality of the solutions found by MCNS. Third, PNS in some cases

((a),(c),(d),(e),(f),(g),(h)) also evolves smaller trees than OS. Finally, OS
evolves smaller trees than most methods on two problems, Liver C1C2
and Fertility C1C2. It seems that this result may be correlated with the
slow convergence of the search on these problem, produced by the high
class imbalance in both datasets.
Numerical comparisons based on the test error and average pro-
gram size are presented in Table 9.6. Moreover, the p-values of the
statistical tests, comparing each method with the control method OS,
are given in Table 9.7. To illustrate the performance differences among
the methods, Figure 9.7 shows a box plot comparison of the test error
from the best solution found in each run.
In general, all NS algorithms are very competitive relative to OS.
In fact, all methods show statistically equivalent results based on test
error with only three exceptions: NS is worse on Wine C2C3; PNS is bet-
ter on Cardio C1C2; PNS is worse on Liver C1C2; and MCPNS is worse
on Fertility C1C2. Even in the cases where the differences are statisti-
cally significant, the relative difference is rather low. However, when
we consider the average program size it is clear that all NS variants
evolve much smaller populations. In particular NS and MCNS show
an intrinsic bloat-control property in this domain. There is one excep-
tion, in Fertility C1C2 NS evolves larger populations, which might be
related to the class imbalance issue. Regarding PNS and MCPNS, both
also control code growth on several problems with respect to OS, but
the difference is less compared to NS and MCNS.
Finally, it is instructive to comment on the results for the most im-
balanced problems, Liver C1C2 and Fertility C1C2. If we inspect the
convergence plots in Figure 9.5, several trends appear. In both cases
(plots (k) and (l)), the search performed by OS and NS seem to be stag-
nated, with only minimal improvements across generations. It is evi-
dent that these algorithms have converged to an almost trivial solution
for imbalanced problems; i.e, assign to all (or a large majority) of the
samples the class label of the majority class. In these cases, such trivial
solutions can be generated randomly and will have the same objective
score, thus the selective pressure will be null, making OS function as

164
9.2 real-world classification experiments

Table 9.6: Binary classification performance, showing the median clas-

sification error on the test data (Test) for the best solution found, and
the median of the average program size in the last generation (A-size).
Statistically significant with respect to the control method with a p-
value less than 0.05 is marked with an asterisk (*).

Dataset Measure OS NS MCNS PNS MCPNS

Diabet Test 0.331 0.341 0.309 0.347 0.334
C1C2 A-size 103.84 42.77∗ 52.92∗ 77.47 73.54
Iris Test 0.067 0.067 0.067 0.067 0.100
C2C3 A-size 96.81 55.15∗ 32.69∗ 101.56 54.70
Parkin Test 0.250 0.250 0.214 0.250 0.232
C1C2 A-size 70.61 47.48∗ 9.20∗ 57.26 36.40∗
TAE Test 0.357 0.446 0.393 0.375 0.321
C1C2 A-size 113.91 49.04∗ 74.66∗ 78.39∗ 102.74
TAE Test 0.357 0.393 0.321 0.321 0.393
C1C3 A-size 123.30 45.36∗ 52.93∗ 81.36∗ 77.32∗
TAE Test 0.411 0.393 0.411 0.375 0.429
C2C3 A-size 108.64 49.03∗ 93.94∗ 81.73∗ 87.27
Wine Test 0.107 0.143 0.125 0.107 0.161
C1C2 A-size 75.20 40.93∗ 9.25∗ 56.97 28.45∗
Wine Test 0.000 0.000 0.000 0.000 0.000
C1C3 A-size 86.51 37.03∗ 12.79∗ 68.67 21.71∗
Wine Test 0.071 0.143∗ 0.036 0.036 0.071
C2C3 A-size 94.43 36.83∗ 12.46∗ 64.51 12.44∗
Cardio Test 0.117 0.128 0.119 0.098∗ 0.112
C1C2 A-size 78.27 8.79∗ 60.10 70.17 65.27
Liver Test 0.283 0.289 0.286 0.306∗ 0.295
C1C2 A-size 57.81 11.91∗ 66.31 62.59 75.77
Fertil Test 0.103 0.103 0.138 0.138 0.138∗
C1C2 A-size 47.50 78.01∗ 59.91 91.07∗ 94.75∗

165
gp based on ns for classification

Table 9.7: Resulting p-values of the Friedman’s test with Bonferroni-

Dunn correction, for the binary classification problems using OS as the
control method. The null hypothesis is rejected with a p-value less than
0.05, marked with an asterisk (*).

Dataset Measure NS MCNS PNS MCPNS

Diabet Test 2.86 0.27 2.86 3.41
C1C2 A-size 0.00∗ 0.01∗ 0.58 0.27
Iris Test 0.182 0.057 0.629 1.269
C1C3 A-size 0.00 0.00 0.58 1.09
Parkin Test 3.39 0.11 2.78 1.41
C1C2 A-size 0.04∗ 0.00∗ 1.09 0.00∗
TAE Test 0.07 0.71 0.24 3.39
C1C2 A-size 0.00∗ 0.00∗ 0.04∗ 1.09
TAE Test 0.09 1.34 1.41 4.00
C1C3 A-size 0.00∗ 0.00∗ 0.00∗ 0.01∗
TAE Test 3.39 1.80 0.41 3.41
C2C3 A-size 0.00∗ 0.04∗ 0.01∗ 0.11
Wine Test 2.26 2.78 4.00 3.39
C1C2 A-size 0.00∗ 0.00∗ 1.09 0.00∗
Wine Test 0.63 4.00 3.13 0.36
C1C3 A-size 0.00∗ 0.00∗ 0.11 0.00∗
Wine Test 0.01∗ 0.47 0.13 1.73
C2C3 A-size 0.00∗ 0.00∗ 0.58 0.00∗
Cardio Test 0.78 1.41 0.02∗ 0.33
C1C2 A-size 0.00∗ 0.58 1.09 1.86
Liver Test 1.41 1.03 0.03∗ 0.05
C1C2 A-size 0.00∗ 1.09 1.09 0.58
Fertil Test 0.79 0.16 0.29 0.01∗
C1C2 A-size 0.00∗ 0.58 0.00∗ 0.01∗

a random search. This is consistent with the size of the evolved pro-
grams generated by OS on these problems, which are among the small-

166
9.2 real-world classification experiments

0.5
0.55
Classifcation Error

Classifcation Error

Classifcation Error
0.5
0.5 0.4

0.45 0.4
0.3
0.4
0.3
0.2
0.35
0.2
0.3 0.1

0.25 0.1
0
OS NS MCNS PNS MCPNS OS NS MCNS PNS MCPNS OS NS MCNS PNS MCPNS

(a) Diabetes C1C2 (b) Iris C2C3 (c) Parkinsons C1C2

0.7
0.6

Classifcation Error

Classifcation Error
Classifcation Error

0.6 0.5

0.5
0.5 0.4

0.4 0.4
0.3
0.3 0.3
0.2
0.2
0.2
OS NS MCNS PNS MCPNS OS NS MCNS PNS MCPNS OS NS MCNS PNS MCPNS

(d) TAE C1C2 (e) TAE C1C3 (f) TAE C2C3

0.25
0.3 0.2
Classifcation Error

Classifcation Error

Classifcation Error
0.25 0.2
0.15
0.2 0.15
0.1
0.15
0.1
0.1 0.05
0.05
0.05
0
0 0
OS NS MCNS PNS MCPNS OS NS MCNS PNS MCPNS OS NS MCNS PNS MCPNS

(g) Wine C1C2 (h) Wine C1C3 (i) Wine C2C3

0.16 0.4
0.35
Classifcation Error
Classifcation Error

Classifcation Error

0.14 0.3
0.35
0.25
0.12 0.2
0.3
0.15
0.1
0.1
0.25
0.08 0.05
OS NS MCNS PNS MCPNS OS NS MCNS PNS MCPNS OS NS MCNS PNS MCPNS

(j) Cardiotocography C1C2 (k) Indian Liver C1C2 (l) Fertility C1C2

Figure 9.7: Classification error on the test data for the best solution
found, showing box plots of the median value over all runs.

est of all the algorithms as shown in Figure 9.6. Conversely, the con-
vergence plots for PNS, MCNS and MCPNS are substantially different,

167
gp based on ns for classification

displaying a clear tendency towards incremental improvement during

the training phase. However, their respective classification error on the
test data is a bit worse but not statistically significant.

9.2.2 Results: Multiclass Classification

In these tests, we use three problems (IM-3, SEG and M-L) with
different number of classes (3, 7 and 15). Moreover, the SEG problem
has 2,310 instances, which leads to a very large behavior descriptor;
i.e., using 70% of the data for training, we obtain a descriptor length
of 1,617 bits, which gives a very large behavior space. Assessing the
performance of the NS variants on this problem is of particular interest,
since previous works have shown that the performance of NS degrades
when behavior space is very large (Kistemaker and Whiteson, 2011).
The numerical comparison of the algorithms is given in Tables 9.8
and 9.9, the former shows the median test error and median population
size, while the latter shows the corresponding p-values of the statistical
tests. Similar to the binary case, all NS variants achieve basically the
same performance as OS, with slight improvements on some problems
(particularly M-L), but not statistically significant. Indeed the similar-
ity in terms of performance is even more evident when we analyze the
convergence plots for each problem shown in Figure 9.8, which shows
how the classification error evolves for the best solution found on the
training and testing sets. On the other hand, code growth is not con-
trolled like in the previous tests, in only one problem (M-L) the NS
variants produces statistically significant differences in terms of aver-
age program size.

9.2.3 Results: Analysis

The above results show that NS can be used to solve binary and mul-
ticlass classification problems, without a performance drop-off relative
to standard OS. This was not expected, given that the search process
omits the use of a standard objective function. Moreover, on some prob-

168
9.2 real-world classification experiments

Table 9.8: Multiclass classification performance, showing the median

classification error on the test data (Test) for the best solution found,
and the median of the average program size in the last generation (A-
size). Statistically significant with respect to the control method with a
p-value less than 0.05 is marked with an asterisk (*).

Dataset Measure OS NS MCNS PNS MCPNS

IM-3 Test 0.046 0.052 0.062 0.052 0.046
A-size 66.22 71.08 55.05 55.42 53.84
SEG Test 0.044 0.043 0.042 0.043 0.041
A-size 111.10 143.61 104.57 138.48 123.01
M-L Test 0.429 0.414 0.394 0.394 0.444
A-size 13.05 12.77∗ 12.82∗ 12.63∗ 12.69∗

Table 9.9: Resulting p-values of the Friedman’s test with Bonferroni-

Dunn correction, for the multiclass classification problems using OS as
the control method. The null hypothesis is rejected with a p-value less
than 0.05, marked with an asterisk (*).

Dataset Measure NS MCNS PNS MCPNS

IM-3 Test 1.73 0.57 2.78 2.12
A-size 1.09 2.86 0.58 1.86
SEG Test 2.86 1.41 0.28 1.79
A-size 0.11 2.86 1.86 2.86
M-L Test 1.79 0.37 1.86 3.36
A-size 0.00∗ 0.04∗ 0.04∗ 0.00∗

lems, mainly in binary classification tasks, the NS variants provide an

intrinsic bloat control property.
This section provides a deeper analysis of the experimental results.
Regarding the MC variants, namely MCNS and MCPNS, we show the
impact of the penalty assigned in Equation 4.12 when the MC is not
satisfied. Figure 9.9 plots the percentage of individuals that did not

169
gp based on ns for classification

0.8
0.4
0.08
Classification Error

Classification Error

Classification Error
0.3 0.6
0.06

0.2 0.4
0.04

0.02 0.1 0.2

0 0 0
0 50 100 0 50 100 0 50 100
Generations Generations Generations

(a) IM-3 Train (b) SEG Train (c) M-L Train

0.8
0.4
0.08
Classification Error

Classification Error

Classification Error
0.3 0.6
0.06

0.2 0.4
0.04

0.02 0.1 0.2

0 0 0
0 50 100 0 50 100 0 50 100
Generations Generations Generations

(d) IM-3 Test (e) SEG Test (f) M-L Test

OS NS MCNS ++++ PNS MCPNS

Figure 9.8: Convergence of training and testing error for multiclass

problems.

satisfy the MC at each generation, for all three multiclass problems.

For both algorithms, we can see very similar patterns on all problems.
At the beginning of the run most individuals do not satisfy the MC, the
number of rejected individuals then quickly declines and then more or
less stabilizes after about 20 to 40 generations.
Another important aspect is to consider the computational cost of
each NS method, relative to standard OS. Moreover, as stated before,
PNS is posed as an approximation of the original NS algorithm, work-
ing under a naive Bayesian assumption. PNS characterizes the com-
plete distribution of behaviors, but uses a possibly unsatisfiable as-
sumption to simplify the novelty measure. NS with a FIFO archive
computes a more realistic measure of sparseness but sacrifices histor-
ical information to keep computational overhead low. One possible
alternative would be to run NS without an archive limit, such that all
individuals generated by the search are included into the archive. How-

170
9.2 real-world classification experiments

100 100

80 80
Ind. Rejected (%)

Ind. Rejected (%)

60 60

40 40

20 20

20 40 60 80 100 20 40 60 80 100
Generations Generations
(a) MCNS (b) MCPNS

IM-3 M-L SEG

Figure 9.9: Plot showing the percentage of rejected individuals, with

respect to the population size, that did not satisfy the MC.

ever, computing the NS in this scenario would surely have a detrimen-

tal effect on computational efficiency. Figure 9.10(a) shows how the size
of the population archive grows when all individuals are stored during
NS runs on the IM-3 problem, we refer to this variant as NSn. We can
see that the size of the archive grows linearly, reaching very large values
by the end of the run.
Figure 9.10(b) plots the median speed-up of each method relative to
OS on the same problem, based on all runs. Values above unity repre-
sent a proportional reduction in total CPU time and values below unity
represent an increase in total CPU time. In this plot we compare NS,
PNS and MCNS. These experiments were carried out on a Windows 7
PC with an Intel Xeon e3-1220 v3 64-bit processor @ 3.10 Ghz and 8
GB of RAM. Results show the advantage of using PNS, with CPU run
times showing a slight speed-up relative to OS of about 10%. On the
other hand, NS and MCNS clearly pay the price of the more expensive
sparseness estimation method. It is correct to assume that NSn would
fair even worse, since it uses a much larger novelty archive.

171
gp based on ns for classification

4
x 10 1.2
10 NS
MCNS
PNS
9
1.1

Speed−up respect to OS
8

7 1
Archive Size

5 0.9

4
0.8
3

2
0.7
1

0.6
10 20 30 40 50 60 70 80 90 100 20 40 60 80 100
Generations Generations

(a) Archive Size (b) Relative Speed-up

Figure 9.10: Analysis of NS variants on the IM-3 problem: (a) Archive

size for NSn; (b) Relative speed-up for NS, PNS and MCNS.

Finally, to compare the selective pressure applied by each algo-

rithm, Figures 9.11 and 9.12 compare the relative ranking of each in-
dividual in the population for problems IM-3 and SEG; the plots for
problem M-L are omitted because they are very similar to those of IM-
3. The left column of plots in each figure (indices a,c and e) shows
the percentage of individuals in the population that are ranked as the
best solution of each generation (the top-ranked solutions). In Figures
9.11(a) and 9.12(a) the runs were guided by OS, in Figures 9.11(c) and
9.12(c) by NS, and in Figures 9.11(e) and 9.12(e) selective pressure was
applied by PNS. Conversely, the right column of plots in these figures
(indices b,d and f) shows the average ranking of the best individuals at
each generation based on the other two fitness measures.
For instance, Figure 9.11(a) shows that the number of top-ranked
individuals is relatively small in the first 25 generations using OS, af-
terwards the number of top-ranked individuals grows quickly and then
stabilizes and oscillates at around 20% of the total population. Notice
that this 25 generation threshold coincides with the moment at which
training error converges on this problem, as shown in Figure 9.8(a).

172
9.2 real-world classification experiments

Figure 9.11(b) takes these top-ranked individuals and computes new

ranks for them based on the NS and PNS novelty measures, and the
plot shows their average rank based on these methods. For NS, the plot
shows that the ranking provided by the sparseness measure disagrees
substantially with OS in the initial generations, reaching a peek at gen-
eration 20. Afterwards, the ranking of individuals by NS is basically
equivalent with that of OS, this corresponds with the similar conver-
gence plots of both algorithms. PNS, on the other hand, differs with
OS only at the very beginning of the run, afterwards the ranking is the
same across all generations. Figure 9.11(c) and Figure 9.11(d) show a
similar comparison, but in this case NS provides the selective pressure.
Notice that the percentage of top-ranked individuals is very similar to
OS, but in this case we can see larger disagreement in terms of ranking
by the other methods. In particular, we can see that OS shows large
disagreement at the beginning of the runs (up to generation 25), af-
terwards the differences start to reduce and are equal after generation
70. On the other hand, PNS ranking is more erratic, we can see that
in some generations the ranking of the best individuals is in complete
agreement with NS, while in other moments the average difference can
become quite high (around generation 60). Finally, when PNS applies
the selective pressure, Figure 9.11(e) and Figure 9.11(f), we can see a
different pattern. First, the percentage of top-ranked individuals is al-
ways small, suggesting that PNS does not converge to many similar
behaviors. Second, OS and NS mostly agree with PNS ranking after
the first initial generations, but surprisingly NS shows a larger ranking
difference than OS.
Figure 9.12 presents a similar analysis on the SEG problem. How-
ever, in this case we can see different behaviors. The percentage of top-
ranked individuals oscillates with OS, while NS and PNS exhibit simi-
lar patterns, with a small number of individuals achieving the top-rank
every generation. When selective pressure is applied by OS, the relative
ranking of NS and PNS is quite similar, with large differences at the be-
ginning of the run and then converging to similar rankings at about 40
generations, see Figures 9.12(b). Conversely, when NS and PNS apply
selective pressure, we can observe a larger disagreement by the other

173
gp based on ns for classification

two methods, shown in Figures 9.12(d) and 9.12(f). In particular, when

NS applies selective pressure we can see that OS disagrees with the
rankings throughout most of the run, with the differences progressively
declining. On the other hand, the disagreement between PNS and NS
seems more erratic, with total agreement in some generations and large
disagreements in others, as seen in Figure 9.12(d). Figure 9.12(f) shows
the relative differences in ranking when PNS applies selective pressure.
In this case, both OS and NS disagree with PNS throughout most of the
run, but both methods converge to similar rankings at the end of the
search.
In summary, these plots provide useful insights regarding the dif-
ferent selective pressure applied by each method. The results confirm
that the naive Bayesian assumption made by PNS does in fact lead to
differences in search dynamics relative to NS, this might partially ex-
plain the differences in average solution size by both methods in binary
classification tasks. However, these plots also show that while PNS and
NS are preferring different individuals during some portions of the run,
after a certain number of generations these differences start to decline.
This is a plausible explanation for the similar performance achieved by
all methods in terms of classification error.

9.3 Chapter Conclusions

This chapter presented for the first time an application of the NS ap-
proach to supervised classification with GP, with several contributions.
First, the concept of behavior space is framed as a conceptual middle-
ground between the well-known concept of objective space and the re-
cently popular semantic space in GP. Second, a domain-specific descrip-
tor has been proposed and tested on supervised classification tasks,
considering synthetic and real-world data as well as binary and multi-
class problems. The proposed descriptor is a binary vector, where each
element corresponds with each fitness case in the training set, taking a
value of 1 when that fitness case is correctly classified and a 0 value oth-
erwise. Third, two extensions to the basic NS approach have been de-

174
9.3 chapter conclusions

veloped, PNS and MCNS, as well as a hybrid method MCPNS. PNS pro-
vides a probabilistic framework to measure a solution’s novelty, elimi-
nating all of the underlying NS parameters while reducing the compu-
tational overhead that the original NS algorithm suffers from. On the
other hand, the proposed MCNS extends the minimal criteria approach
by combining the objective function with the sparseness measure, con-
straining the NS algorithm by specifying a minimal solution quality, a
dynamic criterion that is proportional to the quality of the best solution
found so far.
Experimental results are evaluated based on two measures, solution
quality and average size of all solutions in the population. In terms of
performance, results show that all NS variants achieve competitive re-
sults relative to the standard OS approach in GP. These results show
that the general open-ended approach towards evolution followed by
NS can compete with objective driven search in traditional machine
learning domains. On the other hand, in terms of solutions size and
the bloat phenomenon, the NS approach can lead the search towards
maintaining smaller program trees, particularly in the simpler binary
tasks. In particular, NS and MCNS show substantial reductions in pro-
gram size relative to OS.
Finally, a promising aspect of the present work is that several future
lines of research can be explored, in no particular order we contem-
plate the following. Firstly, there seems to be a possible link between
the PNS algorithm and two similar methods in evolutionary compu-
tation, estimation of distribution algorithms (EDAs) (Larrañaga and
Lozano, 2001), the frequency fitness assignment (FFA) method (Weise
et al., 2014), and fitness sharing methods (Nguyen et al., 2012). While
EDAs use a distribution over genotype space to generate new individ-
uals, PNS uses a distribution in behavior space to measure the novelty
of each solution. FFA favors solutions with unique objective scores, in-
stead of uniqueness in behavior space as done in PNS.
Nonetheless, many of the theoretical and practical insights derived
from EDA and FFA research might be brought to bear during fur-
ther development of the PNS approach, while further comparisons
with recent diversity preservation techniques might also be of interest

175
gp based on ns for classification

(Nguyen et al., 2012). Secondly, we might extend the proposed PNS

variants in other ways, such as testing PNS with real-valued behavior
descriptors or simply apply PNS within semantic space, similar to the
approach suggested in (Castelli et al., 2014). Fourthly, the effect that NS
has on bloat should be studied further, it is clear that standard NS and
MCNS provide the best bloat control, but is unclear why this effect was
not observed on the multiclass problems. Finally, the proposed algo-
rithms should be evaluated in other machine learning problems, such
as unsupervised clustering (Naredo and Trujillo, 2013) and symbolic
regression (Martı́nez et al., 2013).

176
9.3 chapter conclusions

30 500
Top ranked individuals (%)
450
25 400
350

Average rank
20
300
15 250
200
10
150
100
5
50

20 40 60 80 100 20 40 60 80 100
Generations Generations
(a) OS (b) Relative ranking