Design of Modern Heuristics
Design of Modern Heuristics
Series Editors
G. Rozenberg (Managing Editor)
[email protected]
Th. Bäck, J.N. Kok, H.P. Spaink
Leiden Center for Natural Computing
Leiden University
Niels Bohrweg 1
2333 CA Leiden, The Netherlands
A.E. Eiben
Vrije Universiteit Amsterdam
The Netherlands
This is the kind of book that all students interested in search and optimization should
read. Almost half of the book is devoted to laying a foundation for understanding
search and optimization for both heuristic methods and more traditional exact meth-
ods. This is really why the book is so valuable. It puts heuristic search in context,
and this integrated view is important and is often lacking in other books on modern
heuristic search. The book then goes on to provide an excellent tutorial level discus-
sion of heuristic methods such as evolutionary algorithms, variable neighborhood
search, iterated local search and tabu search. The book is a valuable contribution
to the field. Other books have tried to provide the same breadth by collecting to-
gether tutorials by multiple authors. But Prof. Rothlauf’s book is stronger because
it provides an integrated and unified explanation of modern heuristic methods.
Darrell Whitley, Colorado State University,
chair of SIGEVO, former Editor-in-Chief of the journal Evolutionary Computation
The book by Franz Rothlauf is interesting in many ways. First, it goes much further
than a simple description of the most important modern heuristics; it provides in-
sight into the reasons that explain the success of some methods. Another attractive
feature of the book is its thorough, yet concise, treatment of the complete scope of
optimization methods, including techniques for continuous optimization; this allows
readers with a limited background in optimization to gain a deeper appreciation of
the modern heuristics that are the main topic of the book. Finally, the case stud-
ies presented provide a nice illustration of the application of modern heuristics to
challenging and highly relevant problems.
Michel Gendreau, École Polytechnique de Montréal, former Vice-President of
International Federation of Operational Research Societies (IFORS) and the
Institute for Operations Research and the Management Sciences (INFORMS),
Editor-in-Chief of the journal Transportation Science
Franz Rothlauf’s new book, Design of Modern Heuristics: Principles and Applica-
tion, is a celebration of computer science at its best, combining a blend of math-
ematical analysis, empirical inquiry, conceptual modeling, and useful application
vii
viii Foreword
The book on modern heuristic methods by Franz Rothlauf is very special as it has
very strong practical flavour – it teaches us how to design efficient and effective
modern heuristics to solve a particular problem. This emphasis on design compo-
nent resulted in in-depth discussion on topics like: for which types of problems we
should use modern heuristics, how we can select a modern heuristic that fits well
to our problem, what are basic principles for the design of modern heuristics, and
how we can use problem-specific knowledge for the design of modern heuristics? I
recommend this book highly to the whole optimization research community and, in
particular, to every practitioner who is interested in applicability of modern heuristic
methods.
Zbigniew Michalewicz, University of Adelaide,
author of “How to Solve It: Modern Heuristics”
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Part I Fundamentals
2 Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Solution Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Recognizing Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Defining Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Constructing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Solving Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Validating Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.6 Implementing Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Problem Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Search Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Neighborhoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Fitness Landscapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.4 Optimal Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Properties of Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Problem Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.2 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.3 Decomposability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1 Analytical and Numerical Optimization Methods . . . . . . . . . . . . . . . . 46
3.2 Optimization Methods for Linear, Continuous Problems . . . . . . . . . . 50
3.2.1 Linear Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 Simplex Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.3 Simplex and Interior Point Methods . . . . . . . . . . . . . . . . . . . . . 56
3.3 Optimization Methods for Linear, Discrete Problems . . . . . . . . . . . . . 58
3.3.1 Integer Linear Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
ix
x Contents
9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Chapter 1
Introduction
2. The second part of the book teaches us how to design efficient and effective
modern heuristics. It studies basic design elements (representation, search opera-
tor, fitness function, initialization, and search strategy), illustrates relevant design
principles (locality and bias), and presents a coherent categorization of modern
heuristics. We learn why locality is relevant for the design of representations and
search operators and how we can exploit problem-specific knowledge by biasing
the design elements.
3. The third part presents two case studies on the systematic design of modern
heuristics. In the first one, we examine the locality of two approaches for auto-
mated programming and illustrate why high locality is important for successful
use of modern heuristics. The second case study is about the design of problem-
specific modern heuristics for the optimal communication spanning tree problem.
We demonstrate how to make use of problem-specific knowledge for the design
of representations, search operators, and initial solutions. In particular, we exploit
the bias of optimal solutions towards minimum spanning trees by introducing an
analogous bias in the design elements of modern heuristics. The results empha-
size that it is not the choice of a particular type of modern heuristic that is rele-
vant for high performance, but an appropriate consideration of problem-specific
knowledge.
In detail, the book is structured as follows: Chap. 2 illustrates the process of find-
ing high-quality solutions and discusses relevant properties of optimization prob-
lems. We introduce the locality and decomposability of a problem and describe how
they affect local and recombination-based search methods. Chapter 3 provides an
overview of combinatorial optimization methods. Section 3.2 discusses relevant op-
timization methods for linear, continuous problems. Such problems are relatively
“easy” as they can be solved in polynomial time. Common optimization methods
are the Simplex method and interior point methods. Section 3.3 focuses on linear,
discrete problems and describes the functionality of selected optimization methods
like informed and uninformed search, branch-and-bound, dynamic programming,
and cutting plane methods. These methods guarantee returning optimal solutions,
however, for NP-hard problems their effort grows exponentially with the problem
size. Section 3.4 gives an overview of heuristic optimization methods. It distin-
guishes between simple construction and improvement heuristics, approximation
algorithms, and modern heuristics. Modern heuristics differ from simple improve-
ment heuristics, which only use intensifying elements, by the use of intensifying and
diversifying elements. Approximation algorithms are heuristics for which bounds on
their worst-case performance exist. The chapter ends with the no-free-lunch theorem
which tells us that black-box optimization, where an algorithm does not make use
of information learned about a particular problem, is not efficient, but high-quality
modern heuristics must be problem-specific.
Chapter 4 discusses common and fundamental design elements of modern heuris-
tics, namely representation and search operator, fitness function, initialization, and
search strategy. These design elements are relevant for all different types of modern
heuristics and understanding the general concepts behind them is a prerequisite for
their systematic design. Representations are mappings that assign problem solutions
4 1 Introduction
(phenotypes) to, usually linear, strings (genotypes). Given the genotypes, we can
define a search space either by defining search operators or by formulating explicit
neighborhood relationships between solutions. Solutions that are created by a local
search operator are neighbors. Recombination operators generate offspring, where
the distances between offspring and parents are usually equal to or smaller than
the distance between parents. Therefore, the definition of search operators directly
implies a neighborhood structure on the search space. Designing a fitness function
and initialization method is usually easier than designing proper representations and
operators. The fitness function is determined by the objective function and allows
modern heuristics to perform pairwise comparisons between solutions. Initial solu-
tions are usually randomly created if no a priori knowledge about a problem exists.
Chapter 5 presents the fifth design element which is the concept for controlling
the search. Search strategies differ in the design and control of intensification and
diversification. Diversification steps randomize search by performing large changes
of solutions but allow modern heuristics to escape from local optima. Diversification
can be introduced into search by a proper design of a representation or search op-
erator, a fitness function, initial solutions, or explicit control of the search strategy.
Consequently, we classify modern heuristics according to their diversification mech-
anisms and present representative examples of local search methods (variable neigh-
borhood search, guided local search, iterated local search, simulated annealing, tabu
search, and evolution strategies) and recombination-based search methods (genetic
algorithms, estimation of distribution algorithms, and genetic programming).
Chapter 6 presents two general principles for the design of modern heuristics:
locality and bias. Assuming that the vast majority of real-world problems are nei-
ther deceptive nor (very) difficult and have high locality, modern heuristics must
ensure that their design does not destroy the high locality of a problem. Therefore,
the search operators used should fit the search space and representations should have
high locality: this means similarities between phenotypes must correspond to sim-
ilarities between genotypes. Second, the performance of modern heuristics can be
increased by problem-specific knowledge. Consequently, we study how properties
of high-quality solutions can be exploited by biasing representations, search opera-
tors, initial solutions, fitness functions, or search strategies.
Chapters 7 and 8 present two case studies on the design of modern heuristics
using the design principles outlined in Chap. 6. In Chap. 7, we examine how the lo-
cality of a representation affects the performance of modern heuristics. We examine
grammatical evolution which is a variant of genetic programming using as geno-
types linear strings instead of parse trees. We find that the locality of the represen-
tation used in grammatical evolution is lower than for genetic programming, which
reduces the performance of local search approaches. In Chap. 8, we study how to
consider problem-specific knowledge by biasing representations, search operators,
or initial solutions. We find that optimal solutions for the optimal communication
spanning tree problem are similar to the minimum spanning tree (MST). Conse-
quently, biasing the representation, operator, or initial solutions such that MST-like
solutions are preferred results in efficient and effective modern heuristics. The book
ends with a summary of the main findings of this work.
Part I
Fundamentals
Chapter 2
Optimization Problems
borhoods, and the concept of a fitness landscape. Finally, Sect. 2.4 deals with prop-
erties of problems. We review complexity theory as a tool for formulating upper
and lower bounds on problem difficulty. Furthermore, we study the locality and de-
composability of a problem and their importance for local and recombination-based
search, respectively.
Researchers, users, and organizations like companies or public institutions are con-
fronted in their daily life with a large number of planning and optimization prob-
lems. In such problems, different decision alternatives exist and a user or an organi-
zation has to select one of these. Selecting one of the available alternatives has some
impact on the user or the organization which can be measured by some kind of eval-
uation criteria. Evaluation criteria are selected such that they describe the (expected)
impact of choosing one of the different decision alternatives. In optimization prob-
lems, users and organizations are interested in choosing the alternative that either
maximizes or minimizes an evaluation function which is defined on the selected
evaluation criteria.
Usually, users and organizations cannot freely choose from all available decision
alternatives but there are constraints that restrict the number of available alterna-
tives. Common restrictions come from law, technical limitations, or interpersonal
relations between humans. In summary, optimization problems have the following
characteristics:
• Different decision alternatives are available.
• Additional constraints limit the number of available decision alternatives.
• Each decision alternative can have a different effect on the evaluation criteria.
• An evaluation function defined on the decision alternatives describes the effect
of the different decision alternatives.
For optimization problems, a decision alternative should be chosen that considers
all available constraints and maximizes/minimizes the evaluation function. For plan-
ning problems, a rational, goal-oriented planning process should be used that sys-
tematically selects one of the available decision alternatives. Therefore, planning
describes the process of generating and comparing different courses of action and
then choosing one prior to action.
Planning processes to solve planning or optimization problems have been of ma-
jor interest in operations research (OR) (Taha, 2002; Hillier and Lieberman, 2002;
Domschke and Drexl, 2005). Planning is viewed as a systematic, rational, and
theory-guided process to analyze and solve planning and optimization problems.
The planning process consists of several steps:
1. Recognizing the problem,
2. defining the problem,
3. constructing a model for the problem,
2.1 Solution Process 9
In the very first step, it must be recognized that there is a planning or optimization
problem. This is probably the most difficult step as users or institutions often quickly
get used to a currently used approach of doing business. They appreciate the current
situation and are not aware that there are many different ways to do their business
or to organize a task. Users or institutions are often not aware that there might be
more than one alternative they can choose from.
The first step in problem recognition is that users or institutions become aware
that there are different alternatives (for example using a new technology or organiz-
ing the current business in a different way). Such an analysis of the existing situation
often occurs as a result of external pressure or changes in the environment. If every-
thing goes well, users and companies do not question the currently chosen decision
alternatives. However, when running into economic problems (for example accumu-
lating losses or losing market share), companies have to think about re-structuring
their processes or re-shaping their businesses. Usually, a re-design of business pro-
cesses is done with respect to some goals. Designing the proper (optimal) structure
of the business processes is an optimization problem.
A problem has been recognized if users or institutions have realized that there are
other alternatives and that selecting from these alternatives affects their business.
Often, problem recognizing is the most difficult step as users or institutions have
to abandon the current way of doing business and accept that there are other (and
perhaps better) ways.
After we have identified a problem, we can describe and define it. For this purpose,
we must formulate the different decision alternatives, study whether there are any
additional constraints that must be considered, select evaluation criteria which are
affected by choosing different alternatives, and determine what are the goals of the
planning process. Usually, there is not only one possible goal but we have to choose
from a variety of different goals. Possible goals of a planning or optimization pro-
cess are either to find an optimal solution for the problem or to find a solution that
is better than some predefined threshold (for example the current solution).
An important aspect of problem definition is the selection of relevant decision
alternatives. There is a trade-off between the number of decision alternatives and
10 2 Optimization Problems
the difficulty of the resulting problem. The more decision alternatives we have to
consider, the more difficult it is to choose a proper alternative. In principle, we
can consider all possible decision alternatives (independently of whether they are
relevant for the problem, or not) and try to solve the resulting optimization problem.
However, since such problems can not be solved in a reasonable way, usually only
decision alternatives are considered that are relevant and which affect the evaluation
criteria. All aspects that have no direct impact on the goal of the planning process
are neglected. Therefore, we have to focus on carefully selected parts of the overall
problem and find the right level of abstraction.
It is important to define the problem large enough to ensure that solving the
problem yields some benefits and small enough to be able to solve the problem. The
resulting problem definition is often a simplified problem description.
In this step, we construct a model of the problem which represents its essence.
Therefore, a model is a (usually simplified) representative of the real world. Math-
ematical models describe reality by extracting the most relevant relationships and
properties of a problem and formulating them using mathematical symbols and ex-
pressions. Therefore, when constructing a model, there are always aspects of reality
that are idealized or neglected. We want to give an example. In classical mechanics,
the energy E of a moving object can be calculated as E = 12 mv2 , where m is the
object’s mass and v its velocity. This model describes the energy of an object well
if v is much lower than the speed of light c (v c) but it becomes inaccurate for
v → c. Then, other models based on the special theory of relativity are necessary.
This example illustrates that the model used is always a representation of the real
world.
When formulating a model for optimization problems, the different decision al-
ternatives are usually described by using a set of decision variables {x1 , . . . , xn }.
The use of decision variables allows modeling of the different alternatives that can
be chosen. For example, if somebody can choose between two decision alterna-
tives, a possible decision variable would be x ∈ {0, 1}, where x = 0 represents the
first alternative and x = 1 represents the second one. Usually, more than one deci-
sion variable is used to model different decision alternatives (for choosing proper
decision variables see Sect. 2.3.2). Restrictions that hold for the different decision
variables can be expressed by constraints. Representative examples are relation-
ships between different decision variables (e.g. x1 + x2 ≤ 2). The objective function
assigns an objective value to each possible decision alternative and measures the
quality of the different alternatives (e.g. f (x) = 2x1 + 4x22 ). One possible decision
alternative, which is represented by different values for the decision variables, is
called a solution of a problem.
To construct a model with an appropriate level of abstraction is a difficult task
(Schneeweiß, 2003). Often, we start with a realistic but unsolvable problem model
2.1 Solution Process 11
and then iteratively simplify the model until it can be solved by existing optimization
methods. There is a basic trade-off between the ability of optimization methods to
solve a model (tractability) and the similarity between the model and the underlying
real-world problem (validity). A step-wise simplification of a model by iteratively
neglecting some properties of the real-world problem makes the model easier to
solve and more tractable but reduces the relevance of the model.
Often, a model is chosen such that it can be solved by using existing optimiza-
tion approaches. This especially holds for classical optimization methods like the
Simplex method or branch-and-bound-techniques which guarantee finding the op-
timal solution. In contrast, the use of modern heuristics allow us to reduce the gap
between reality and model and to solve more relevant problem models. However,
we have to pay a price since such methods often find good solutions but we have no
guarantee that the solutions found are optimal.
Two other relevant aspects of model construction are the availability of relevant
data and the testing of the resulting model. For most problems, is it not sufficient
to describe the decision variables, the relationships between the decision variables,
and the structure of the evaluation function, but additional parameters are necessary.
These parameters are often not easily accessible and have to be determined by using
simulation and other predictive techniques. An example problem is assigning jobs
to different agents. Relevant for the objective value of an assignment is the order
of jobs. To be able to compare the duration of different assignments (each specific
assignment is a possible decision alternative), parameters like the duration of one
work step, the time that is necessary to transfer a job to a different agent, or the
setup times of the different agents are relevant. These additional parameters can be
determined by analyzing or simulating real-world processes.
Finally, a model is available which should be a representative of the real problem
but is usually idealized and simplified in comparison to the real problem. Before
continuing with this model, we must ensure that the model is a valid representa-
tive of the real world and really represents what we originally wanted to model. A
proper criterion for judging the correctness of a model is whether different decision
alternatives are modeled with sufficient accuracy and lead to the expected results.
Often, the relevance of a model is evaluated by examining the relative differences
of the objective values resulting from different decision alternatives.
After we have defined a model of the original problem, the model can be solved
by some kind of algorithm (usually an optimization algorithm). An algorithm is a
procedure (a finite set of well-defined instructions) for accomplishing some task. An
algorithm starts in an initial state and terminates in a defined end-state. The concept
of an algorithm was formalized by Turing (1936) and Church (1936) and is at the
core of computers and computer science. In optimization, the goal of an algorithm
12 2 Optimization Problems
is to find a solution (either specific values for the decision variables or one specific
decision alternative) with minimal or maximal evaluation value.
Practitioners sometimes view solving a model as simple, as the outcome of the
model construction step is already a model that can be solved by some kind of
optimization method. Often, they are not aware that the effort to solve a model can
be high and only small problem instances can be solved with reasonable effort. They
believe that solving a model is just applying a black-box optimization method to the
problem at hand. An algorithm is called a black-box algorithm if it can be used
without any further problem-specific adjustments.
However, we have a trade-off between tractability and specificity of optimization
methods. If optimization methods are to perform well for the problem at hand, they
usually need to be adapted to the problem. This is typical for modern heuristics but
also holds for classical optimization methods like branch-and-bound approaches.
Modern heuristics can easily be applied to problems that are very realistic and near
to real-world problems but usually do not guarantee finding an optimal solution.
Modern heuristics should not be applied out of the box as black-box optimization
algorithms but adapted to the problem at hand. To design high-quality heuristics is
an art as they are problem-specific and exploit properties of the model.
Comparing classical OR methods like the Simplex method with modern heuris-
tics reveals that for classical methods constructing a valid model of the real problem
is demanding and needs the designer’s intuition. Model solution is simple as ex-
isting algorithms can be used which yield optimal solutions (Ackoff, 1973). The
situation is different for modern heuristics, where formulating a model is often a
relatively simple step as modern heuristics can also be applied to models that are
close to the real world. However, model solution is difficult as standard variants of
modern heuristics usually show limited performance and only problem-specific and
model-specific variants yield high-quality solutions (Droste and Wiesmann, 2002;
Puchta and Gottlieb, 2002; Bonissone et al, 2006).
We have seen in the previous section how the construction of a model is embedded
in the solution process. When building a model, we can represent different decision
alternatives using a vector x = (x1 , . . . , xn ) of n decision variables. We denote an
assignment of specific values to x as a solution. All solutions together form a set X
of solutions, where x ∈ X.
14 2 Optimization Problems
2.3.1 Metrics
d(x, y) ≥ 0,
d(x, x) = 0,
d(x, y) = d(y, x),
d(x, z) ≤ d(x, y) + d(y, z),
where x, y, z ∈ X.
2.3 Search Spaces 17
where x = (x1 , x2 ) and y = (y1 , y2 ). This metric is named the city-block metric as
it describes the distance between two points on a 2-dimensional plane in a city like
Manhattan or Mannheim with a rectangular ground plan. On n-dimensional search
spaces Rn , the city-block metric becomes
n
d(x, y) := ∑ |xi − yi |, (2.5)
i=1
where x, y ∈ Rn .
Another example of a metric that can be defined on Rn is the Euclidean metric.
In Euclidean spaces, a solution x = (x1 , . . . , xn ) is a vector of continuous values
(xi ∈ R). The Euclidean distance between two solutions x and y is defined as
n
d(x, y) := ∑ (xi − yi )2 . (2.6)
i=1
For n = 1, the Euclidean metric coincides with the city-block metric. For n = 2, we
have a standard 2-dimensional search space and the distance between two elements
x, y ∈ R2 is just a direct line between two points on a 2-dimensional plane.
If we assume that we have a binary space (x ∈ {0, 1}n ), a commonly used metric
is the binary Hamming metric (Hamming, 1980)
n
d(x, y) = ∑ |xi − yi |, (2.7)
i=1
where d(x, y) ∈ {0, . . . , n}. The binary Hamming distance between two binary vec-
tors x and y of length n is just the number of binary decision variables on which x
and y differ. It can be extended to continuous and discrete decision variables:
n
d(x, y) = ∑ zi , (2.8)
i=1
where
0, for xi = yi ,
zi =
1, for xi
= yi .
In general, the Hamming distance measures the number of decision variables on
which x and y differ.
18 2 Optimization Problems
2.3.2 Neighborhoods
N(x) : X → 2X , (2.9)
where X is the search space containing all possible solutions to the problem. 2X
stands for the set of all possible subsets of X and N is a mapping that assigns to each
element x ∈ X a set of elements y ∈ X. A neighborhood definition can be viewed as
a mapping that assigns to each solution x ∈ X a set of solutions y that are neighbors
of x. Usually, the neighborhood N(x) defines a set of solutions y which are in some
sense similar to x.
The definition of a topological space (X, T ) already defines a neighborhood as
it introduces an abstract structure of space in the set X. Given a topological space
(X, T ), a subset N of X is a neighborhood of a point x ∈ X if N contains an open
set U ⊂ T containing the point x. We want to give examples. For the trivial topol-
ogy ({a, b, c}, {{}, {a, b, c}}), the points in the search space cannot be distinguished
by topological means and either all or no points are neighbors to each other. For
({a, b, c}, {{}, {a}, {a, b}, {a, b, c}}), the points a and b are neighbors.
Many optimization models use metric search spaces. A metric search space is
a topological space where a metric between the elements of the set X is defined.
Therefore, we can define similarities between solutions based on the distance d.
Given a metric search space, we can use balls to define a neighborhood. For x ∈ X,
an (open) ball around x of radius ε is defined as the set
Bε := {y ∈ X|d(x, y) < ε }.
The ε -neighborhood of a point x ∈ X is the open set consisting of all points whose
distance from x is less than ε . This means that all solutions y ∈ X whose distance d
from x is lower than ε are neighbors of x. By using balls we can define a neighbor-
hood function N(x). Such a function defines for each x a set of solutions similar to
x.
Figure 2.1 illustrates the definition of a neighborhood in a 2-dimensional contin-
uous search space R2 for Euclidean distances (Fig. 2.1(a)) and Manhattan distances
(Fig. 2.1(b)). Using an open ball, all solutions y where d(x, y) < ε are neighboring
solutions to x. For Euclidean distances, we use d(x, y) := (x1 − y1 )2 + (x2 − y2 )2
and neighboring solutions are all solutions that can be found inside of a circle around
x with radius ε . For city-block distances, we use d(x, y) := |x1 − y1 | + |x2 − y2 |
and all solutions inside a rhombus with the vertices (x1 − ε , y1 ),(x1 , y1 + ε ),(x1 +
ε , y1 ),(x1 , y1 − ε ) are neighboring solutions.
It is problematic to apply metric search spaces to problems where no meaningful
similarities between different decision alternatives can be defined or do not exist.
For such problems, the only option is to define a trivial topology (see p. 16) which
assumes that all solutions are neighbors and no meaningful structure on the search
2.3 Search Spaces 19
space exists. However, practitioners (as well as users) are used to metric search
spaces and often seek to apply them also to problems where no meaningful similar-
ities between different decision alternatives can be defined. This is a mistake as a
metric search space does not model such a problem in an appropriate way.
We want to give an example. We assume a search space containing four differ-
ent fruits (apple (a), banana (b), pear (p), and orange (o)). This search space forms
a trivial topology ({a, b, p, o}, {0, / {a, b, p, o}} as no meaningful distances between
the four fruits exist and all solutions are neighbors of each other. Nevertheless,
we can define a metric search space X = {0, 1}2 for the problem. Each solution
((0, 0), (0, 1), (1, 0), and (1, 1)) represents one fruit. Although the original problem
defines no similarities, the use of a metric space induces that the solution (0, 0)
is more similar to (0, 1) than to (1, 1) (using Hamming distance (2.7)). Therefore,
a metric space is inappropriate for the problem definition as it defines similarities
where no similarities exist. A more appropriate model would be x ∈ {0, . . . , 3} and
using Hamming distance (2.8). Then, all distances between the different solutions
are equal and all solutions are neighbors.
A different problem can occur if the metric used is not appropriate for the prob-
lem and existing “similarities” between different decision alternatives do not fit the
similarities between different solutions described by the model. The metric defined
for the problem model is a result of the choice of the decision variables. Any choice
of decision variables {x1 , . . . , xn } allows the definition of a metric space and, thus,
defines similarities between different solutions. However, if the metric induced by
the use of the decision variables does not fit the metric of the problem description,
the problem model is inappropriate.
Table 2.1 illustrates this situation. We assume that there are s = 9 different de-
cision alternatives {a, b, c, d, e, f , g, h, i}. We assume that the decision alternatives
form a metric space (using the Hamming metric (2.8)), where the distances be-
tween all elements are equal. Therefore, all decision alternatives are neighbors (for
ε > 1). In the first problem model (model 1), we use a metric space X = {0, 1, 2}2
and Hamming metric (2.8). Therefore, each decision alternative is represented by
20 2 Optimization Problems
(x1 , x2 ) with xi ∈ {0, 1, 2}. For the Hamming metric, each solution has four neigh-
bors. For example, decision alternative (1, 1) is a neighbor of (1, 2) but not of (2, 2).
Model 1 results in a different neighborhood in comparison to the original decision
alternatives. Model 2 is an example of a different metric space. In this model, we use
binary variables xi j and the search space is defined as X = xi j , where xi j ∈ {0, 1}. We
have an additional restriction, ∑ j xi j = 1, where i ∈ {1, 2} and j ∈ {1, 2, 3}. Again,
Hamming distance (2.8) can be used. For ε = 1.1, no neighboring solutions exist.
For ε = 2.1, each solution has four neighbors. We see that different models for the
same problem result in different neighborhoods which do not necessarily coincide
with the neighborhoods of the original problem.
The examples illustrate that selecting an appropriate model is important. For the
same problem, different models are possible which can result in different neighbor-
hoods. We must select the model such that the metric induced by the model fits well
the metric that exists for the decision alternatives. Although, in the problem descrip-
tion no neighborhood needs to be defined, usually a notion of neighborhoods exist.
Users that formulate a model description often know which decision alternatives
are similar to each other as they have a feeling about which decision alternatives
result in the same outcome. When constructing the model, we must ensure that the
neighborhood induced by the model fits well the (often intuitive) neighborhood for-
mulated by the user.
Relevant aspects which determine the resulting neighborhood of a model are the
type and number of decision variables. The types of decision variables should be
determined by the properties of the decision alternatives. If decision alternatives
are continuous (for example, choosing the right amount of crushed ice for a drink),
the use of continuous decision variables in the model is useful and discrete deci-
sion variables should not be used. Analogously, for discrete decision alternatives,
discrete decision variables and combinatorial models should be preferred. For ex-
ample, the number of ice cubes in a drink should be modelled using integers and not
continuous variables.
The number of decision variables used in the model also affects the resulting
neighborhood. For discrete models, there are two extremes: first, we can model s
different decision alternatives by using only one decision variable that can take s
2.3 Search Spaces 21
different values. Second, we can use l = log2 (s) binary decision variables xi ∈ {0, 1}
(i ∈ {1, . . . , l}). If we use Hamming distance (2.8) and define neighboring solutions
as d(x, y) ≤ 1, then all possible solutions are neighbors if we use only one decision
variable. In contrast, each solution x ∈ {0, 1}l has only l neighbors y ∈ {0, 1}l , where
d(x, y) ≤ 1. We see that using different numbers of decision variables for modeling
the decision alternatives results in completely different neighborhoods. In general,
a high-quality model is a model where the neighborhoods defined in the model fit
well the neighborhoods that exist in the problem.
For combinatorial search spaces where a metric is defined, we can introduce the con-
cept of fitness landscape (Wright, 1932). A fitness landscape (X, f , d) of a problem
instance consists of a set of solutions x ∈ X, an objective function f that measures
the quality of each solution, and a distance measure d. Figure 2.2 is an example of
a one-dimensional fitness landscape.
We denote dmin = minx,y∈X (d(x, y)), where x
= y, as the minimum distance be-
tween any two elements x and y of a search space. Two solutions x and y are de-
noted as neighbors if d(x, y) = dmin . Often, d can be normalized to dmin = 1. A
fitness landscape can be described using a graph GL with a vertex set V = X and an
edge set E = {(x, y) ∈ X × X | d(x, y) = dmin } (Reeves, 1999a; Merz and Freisleben,
2000b). The objective function assigns an objective value to each vertex. We as-
sume that each solution has at least one neighbor and the resulting graph is con-
nected. Therefore, an edge exists between neighboring solutions. The distance be-
tween two solutions x, y ∈ X is proportional to the number of nodes that are on the
path of minimal length between x and y in the graph GL . The maximum distance
dmax = maxx,y∈X (d(x, y)) between any two solutions x, y ∈ X is called the diameter
diam GL of the landscape.
We want to give an example: We use the search space defined by model 1 in
Table 2.1 and Hamming distance (2.8). Then, all solutions where only one decision
variable is different are neighboring solutions (d(x, y) = dmin ). The maximum dis-
tance dmax = 2. More details on fitness landscapes can be found in Reeves and Rowe
(2003, Chap. 9) or Deb et al (1997).
Figure 2.2 illustrates the differences between locally and globally optimal so-
lutions and shows how local optima depend on the definition of N. We have a
one-dimensional minimization problem with x ∈ [0, 1] ∈ R. We assume an objec-
tive function f that assigns objective values to all x ∈ X. Independently of the
neighborhood used, u is always the globally optimal solution. If we use the 1-
dimensional Euclidean distance (2.6) as metric and define a neighborhood around
x as N(x) = {y|y ∈ X and d(x, y) ≤ ε } the solution v is a locally optimal solution if
ε < d1 . Analogously, w is locally optimal for all neighborhoods with ε < d2 . For
ε ≥ d2 , the only locally optimal solution is the globally optimal solution u.
The modality of a problem describes the number of local optima in the prob-
lem. Unimodal problems have only one local optimum (which is also the global
optimum) whereas multi-modal problems have multiple local optima. In general,
multi-modal problems are more difficult for guided search methods to solve than
unimodal problems.
The complexity of an algorithm is the effort (usually time or memory) that is nec-
essary to solve a particular problem. The effort depends on the input size, which is
equal to the size n of the problem to be solved. The difficulty or complexity of a
problem is the lowest possible effort that is necessary to solve the problem.
Therefore, problem difficulty is closely related to the complexity of algorithms.
Based on the complexity of algorithms, we are able to find upper and lower bounds
on the problem difficulty. If we know that an algorithm can solve a problem, we
automatically have an upper bound on the difficulty of the problem, which is just the
complexity of the algorithm. For example, we study the problem of finding a friend’s
telephone number in the telephone book. The most straightforward approach is to
search through the whole book starting from “A”. The effort for doing this increases
linearly with the number of names in the book. Therefore, we have an upper bound
24 2 Optimization Problems
on the difficulty of the problem (problem has at most linear complexity) as we know
a linear algorithm that can solve the problem. A more effective way to solve this
problem is bisection or binary search which iteratively splits the entries of the book
into halves. With n entries, we only need log(n) search steps to find the number. So,
we have a new, improved, upper bound on problem difficulty.
Finding lower bounds on the problem difficulty is more difficult as we have to
show that no algorithm exists that needs less effort to solve the problem. Our prob-
lem of finding a friend’s name in a telephone book is equivalent to the problem of
searching an ordered list. Binary search which searches by iteratively splitting the
list into halves is optimal and there is no method with lower effort (Knuth, 1998).
Therefore, we have a lower bound and there is no algorithm that needs less than
log(n) steps to find an address in a phone book with n entries. A problem is called
closed if the upper and lower bounds on its problem difficulty are identical. Conse-
quently, the problem of searching an ordered list is closed.
This section illustrates how bounds on the difficulty of problems can be derived
by studying the effort of optimization algorithms that are used to solve the problems.
As a result, we are able to classify problems as easy or difficult with respect to the
performance of the best-performing algorithm that can solve the problem.
The following paragraphs give an overview of the Landau notation which is an
instrument for formulating upper and lower bounds on the effort of optimization
algorithms. Thus, we can also use it for describing problem difficulty. Then, we
illustrate that each optimization problem can also be modeled as a decision problem
of the same difficulty. Finally, we illustrate different complexity classes (P, NP, NP-
hard, and NP-complete) and discuss the tractability of decision and optimization
problems.
The Landau notation (which was introduced by Bachmann (1894) and made popular
by the work of Landau (1974)) can be used to compare the asymptotic growth of
functions and is helpful when measuring the complexity of problems or algorithms.
It allows us to formulate asymptotic upper and lower bounds on function values.
For example, Landau notation can be used to determine the minimal amount of
memory or time that is necessary to solve a specific problem. With n ∈ N, c ∈ R,
and f , g : N → R the following bounds can be described using the Landau symbols:
• asymptotic upper bound (“big O notation”):
f ∈ O(g) ⇔ ∃c > 0 ∃n0 > 0 ∀n ≥ n0 : | f (n)| ≤ c|g(n)|: f is dominated by g.
• asymptotically negligible (“little o notation”):
f ∈ o(g) ⇔ ∀c > 0 ∃n0 > 0 ∀n ≥ n0 : | f (n)| < c|g(n)|: f grows slower than g.
• asymptotic lower bound:
f ∈ Ω (g) ⇔ g ∈ O( f ): f grows at least as fast as g.
• asymptotically dominant:
f ∈ ω (g) ⇔ g ∈ o( f ): f grows faster than g.
2.4 Properties of Optimization Problems 25
To derive upper and lower bounds on problem difficulty, we need a formal descrip-
tion of the problem. Thus, at least the solutions x ∈ X and the objective function
f : X → R must be defined. The search space can be very trivial (e.g. have a trivial
topology) as the definition of a neighborhood structure is not necessary. Developing
bounds is difficult for problems where the objective function does not systematically
assign objective values to each solution. Although describing X and f is sufficient
for formulating a problem model, in most problems of practical relevance the size
|X| of the search space is large and, thus, the direct assignment of objective val-
ues to each possible solution is often very time-consuming and not appropriate for
building an optimization model that should be solved by a computer.
To overcome this problem, we can define each optimization problem implicitly
using two algorithms AX and A f , and two sets of parameters SX and S f . Given a
set of parameters SX , the algorithm AX (x, SX ) decides whether the solution x is an
element of X, i.e. whether x is a feasible solution. Given a set of parameters S f ,
the algorithm A f (x, S f ) calculates the objective value of a solution x. Therefore,
26 2 Optimization Problems
This formulation of an optimization problem is equivalent to (2.3) (p. 15). The al-
gorithm AX checks whether all constraints are met and A f calculates the objective
value of x. Analogously, the evaluation version of an optimization problem can be
defined as:
Given are the two algorithms AX and A f and representations of the parameters SX and S f .
The goal is to find the objective value of the optimal solution.
The first two versions are problems where an optimal solution has to be found,
whereas in the decision version of an optimization problem a question has to be
answered either by yes or no. We denote feasible solutions x whose objective value
f (x) ≤ L as yes-solutions. We can solve the decision version of the optimization
problem by solving the original optimization problem, calculating the objective
value f (x∗ ) of the optimal solution x∗ , and deciding whether f (x∗ ) is greater than
L. Therefore, the difficulty of the three versions is roughly the same if we assume
that f (x∗ ), respectively A f (x∗ , X f ), is easy to compute (Papadimitriou and Steiglitz,
1982, Chap. 15.2). If this is the case, all three versions are equivalent.
We may ask why we want to formulate an optimization as a decision problem
which is much less intuitive? The reason is that in computational complexity the-
ory, many statements on problem difficulty are formulated for decision (and not
optimization) problems. By formulating optimization problems as decision prob-
lems, we can apply all these results to optimization problems. Therefore, complexity
classes which can be used to categorize decision problems in classes with different
difficulty (compare the following paragraphs) can also be used for categorizing opti-
mization problems. For more information on the differences between optimization,
evaluation, and decision versions of an optimization problem, we refer the interested
reader to Papadimitriou and Steiglitz (1982, Chap. 15) or Harel and Rosner (1992).
2.4 Properties of Optimization Problems 27
Computational complexity theory ((Hartmanis and Stearns, 1965; Cook, 1971; Garey
and Johnson, 1979; Papadimitriou and Yannakakis, 1991; Papadimitriou, 1994;
Arora and Barak, 2009) allows us to categorize decision problems in different
groups based on their difficulty. The difficulty of a problem is defined with respect
to the amount of computational resources that are at least necessary to solve the
problem.
In general, the effort (amount of computational resources) that is necessary to
solve an optimization problem of size n is determined by its time and space com-
plexity. Time complexity describes how many iterations or number of search steps
are necessary to solve a problem. Problems are more difficult if more time is nec-
essary. Space complexity describes the amount of space (usually memory on a com-
puter) that is necessary to solve a problem. As for time, problem difficulty increases
with higher space complexity. Usually, time and space complexity depend on the
input size n and we can use the Landau notation to describe upper and lower bounds
on them. A complexity class is a set of computational problems where the amount
of computational resources that are necessary to solve the problem shows the same
asymptotic behavior. For all problems that are contained in one complexity class,
we can give bounds on the computational complexity (in general, time and space
complexity). Usually, the bounds depend on the size n of the problem, which is also
called its input size. Usually, n is much smaller than the size |X| of the search space.
Typical bounds are asymptotic lower or upper bounds on the time that is necessary
to solve a particular problem.
Complexity Class P
The complexity class P (P stands for polynomial) is defined as the set of decision
problems that can be solved by an algorithm with worst-case polynomial time com-
plexity. The time that is necessary to solve a decision problem in P is asymptotically
bounded (for n > n0 ) by a polynomial function O(nk ). For all problems in P, an
algorithm exists that can solve any instance of the problem in time that is O(nk ), for
some k. Therefore, all problems in P can be solved effectively in the worst case. As
we showed in the previous section that all optimization problems can be formulated
as decision problems, the class P can be used to categorize optimization problems.
Complexity Class NP
The class NP (which stands for non-deterministic polynomial time) describes the set
of decision problems where a yes solution of a problem can be verified in polynomial
time. Therefore, both the formal representation of a solution x and the time it takes
to check its validity (to check whether it is a yes solution) must be polynomial or
polynomially-bounded.
28 2 Optimization Problems
Therefore, all problems in NP have the property that their yes solutions can be
checked effectively. The definition of NP says nothing about the time necessary
for verifying no solutions and a problem in NP can not necessarily be solved in
polynomial time. Informally, the class NP consists of all “reasonable” problems
of practical importance where a yes solution can be verified in polynomial time:
this means the objective value of the optimal solution can be calculated fast. For
problems not in NP, even verifying that a solution is valid (is a yes answer) can be
extremely difficult (needs exponential time).
An alternative definition of NP is based on the notion of non-deterministic al-
gorithms. Non-deterministic algorithms are algorithms which have the additional
ability to guess any verifiable intermediate result in a single step. If we assume that
we find a yes solution for a decision problem by iteratively assigning values to the
decision variables, a non-deterministic algorithm always selects the value (possibil-
ity) that leads to a yes answer, if a yes answer exists for the problem. Therefore,
we can view a non-deterministic algorithm as an algorithm that always guesses the
right possibility whenever the correctness can be checked in polynomial time. The
class NP is the set of all decision problems that can be solved by a non-deterministic
algorithm in worst-case polynomial time. The two definitions of NP are equivalent
to each other. Although non-deterministic algorithms cannot be executed directly by
conventional computers, this concept is important and helpful for the analysis of the
computational complexity of problems.
All problems that are in P also belong to the class NP. Therefore, P ⊆ NP. An
important question in computational complexity is whether P is a proper subset of
NP (P ⊂ NP) or whether NP is equal to P (P = NP). So far, this question is not finally
answered (Fortnow, 2009) but most researchers assume that P
= NP and there are
problems that are in NP, but not in P.
In addition to the classes P and NP, there are also problems where yes solutions
cannot be verified in polynomial time. Such problems are very difficult to solve and
are, so far, of only little practical relevance.
When solving an optimization problem, we are interested in the running time of the
algorithm that is able to solve the problem. In general, we can distinguish between
polynomial running time and exponential running time. Problems that can be solved
using a polynomial-time algorithm (there is an upper bound O(nk ) on the running
time of the algorithm, where k is constant) are tractable. Usually, tractable problems
are easy to solve as running time increases relatively slowly with larger input size
n. For example, finding the lowest element in an unordered list of size n is tractable
as there are algorithms with time complexity that is O(n). Spending twice as much
effort solving the problem allows us to solve problems twice as large.
In contrast, problems are intractable if they cannot be solved by a polynomial-
time algorithm and there is a lower bound on the running time which is Ω (kn ), where
k > 1 is a constant and n is the problem size (input size). For example, guessing the
2.4 Properties of Optimization Problems 29
correct number for a digital door lock with n digits is an intractable problem, as
the time necessary for finding the correct key is Ω (10n ). Using a lock with one
more digit increases the number of required search steps by a factor of 10. For
this problem, the size of the problem is n, whereas the size of the search space is
|X| = 10n . The effort to find the correct key depends on n and increases at the same
rate as the size of the search space. Table 2.2 lists the growth rate of some common
functions ordered by how fast they grow.
constant O(1)
logarithmic O(log n)
Table 2.2 Polynomial (top) linear O(n)
and exponential (bottom) quasilinear O(n log n)
functions quadratic O(n2 )
polynomial (of order c) O(nc ), c > 1
exponential O(kn )
factorial O(n!)
super-exponential O(nn )
All decision problems that are in P are tractable and thus can be easily solved using
the “right” algorithm. If we assume that P
= NP, then there are also problems that are
in NP but not in P. These problems are difficult as no polynomial-time algorithms
exist for them.
Among the decision problems in NP, there are problems where no polynomial
algorithm is available and which can be transformed into each other with polyno-
mial effort. Consequently, a problem is denoted NP-hard if an algorithm for solving
this problem is polynomial-time reducible to an algorithm that is able to solve any
problem in NP. A problem A is polynomial-time reducible to a different problem B
if and only if there is a transformation that transforms any arbitrary solution x of A
into a solution x of B in polynomial time such that x is a yes instance for A if and
only if x is a yes instance for B. Informally, a problem A is reducible to some other
problem B if problem B either has the same difficulty or is harder than problem A.
Therefore, NP-hard problems are at least as hard as any other problem in NP, al-
though they might be harder. Therefore, NP-hard problems are not necessarily in
NP.
30 2 Optimization Problems
2.4.2 Locality
In general, the locality of a problem describes how well the distances d(x, y) be-
tween any two solutions x, y ∈ X correspond to the difference of the objective values
| f (x) − f (y)| (Lohmann, 1993; Rechenberg, 1994; Rothlauf, 2006). The locality of
a problem is high if neighboring solutions have similar objective values. In contrast,
the locality of a problem is low if low distances do not correspond to low differences
of the objective values. Relevant determinants for the locality of a problem are the
metric defined on the search space and the objective function f .
In the heuristic literature, there are a number of studies on locality for discrete de-
cision variables (Weicker and Weicker, 1998; Rothlauf and Goldberg, 1999, 2000;
Gottlieb and Raidl, 2000; Gottlieb et al, 2001; Whitley and Rowe, 2005; Caminiti
and Petreschi, 2005; Raidl and Gottlieb, 2005; Paulden and Smith, 2006) as well
as for continuous decision variables (Rechenberg, 1994; Igel, 1998; Sendhoff et al,
1997b,a). For continuous decision variables, locality is also known as causality.
High and low locality correspond to strong and weak causality, respectively. Al-
though causality and locality describe the same concept and causality is the older
2.4 Properties of Optimization Problems 31
one, we refer to the concept as locality as it is currently more often used in the
literature.
Guided search methods are optimization approaches that iteratively sample solu-
tions and use the objective values of previously sampled solutions to guide the future
search process. In contrast to random search which samples solutions randomly and
uses no information about previously sampled solutions, guided search methods dif-
ferentiate between promising (for maximization problems these are solutions with
high objective values) and non-promising (solutions with low objective values) areas
in the fitness landscape. New solutions are usually generated in the neighborhood
of promising solutions with high objective values. A prominent example of guided
search is greedy search (see Sect. 3.4.1).
The locality of optimization problems has a strong impact on the performance of
guided search methods. Problems with high locality allow guided search methods
to find high-quality solutions in the neighborhood of already found good solutions.
Furthermore, the underlying idea of guided search methods to move in the search
space from low-quality solutions to high-quality solutions works well if the problem
has high locality. In contrast, if a problem has low locality, guided search methods
cannot make use of previous search steps to extract information that can be used
for guiding the search. Then, for problems with low locality, guided search methods
behave like random search.
One of the first approaches to the question of what makes problems difficult for
guided search methods, was the study of deceptive problems by Goldberg (1987)
which was based on the work of Bethke (1981). In deceptive problems, the objec-
tive values are assigned in such a way to the solutions that guided search methods
are led away from the global optimal solution. Therefore, based on the structure of
the fitness landscape (Weinberger, 1990; Manderick et al, 1991; Deb et al, 1997),
the correlation between the fitness of solutions can be used to describe how difficult
a specific problem is to solve for guided search methods. For an overview of correla-
tion measurements and problem difficulty we refer to Bäck et al (1997, Chap. B2.7)
or Reeves and Rowe (2003).
The following paragraphs present approaches that try to determine what makes
a problem difficult for guided search. Their general idea is to measure how well the
metric defined on the search space fits the structure of the objective function. A high
fit between metric and structure of the fitness function makes a problem easy for
guided search methods.
Problems become more difficult if there is no correlation between the fitness dif-
ference and the distance to the optimal solution. The locality of such problems is
low, as no meaningful relationship exists between the distances d between differ-
ent solutions and their objective values. Thus, the fitness landscape cannot guide
guided search methods to optimal solutions. Optimization methods cannot use in-
formation about a problem which was collected in prior search steps to determine
the next search step. Therefore, all search algorithms show the same performance as
no useful information (information that indicates where the optimal solution can be
found) is available for the problem. Because all search strategies are equivalent, also
random search is an appropriate search method for such problems. Random search
uses no information and performs as well as other search methods on these types of
problems.
We want to give two examples. In the first example, we have a discrete search
space X with n elements x ∈ X. A deterministic random number generator assigns
a random number to each x. Again, the optimization problem is to find x∗ , where
f (x∗ ) ≤ f (x) for all x ∈ X. Although we can define neighborhoods and similari-
ties between different solutions, all possible optimization algorithms show the same
behavior. All elements of the search space must be evaluated to find the globally
optimal solution.
The second example is the needle-in-a-haystack (NIH) problem. Following its
name, the goal is to find a needle in a haystack. In this problem, a metric exists
defining distances between solutions, but there is no meaningful relationship be-
tween the metric and the objective value (needle found or not) of different solutions.
When physically searching in a haystack for a needle, there is no good strategy for
choosing promising areas of the haystack that should be searched in the next search
step. The NIH problem can be formalized by assuming a discrete search space X
and the objective function
0 for x
= xopt
f (x) = (2.12)
1 for x = xopt .
Figure 2.4(a) illustrates the problem. The NIH problem is equivalent to the problem
of finding the largest number in an unordered list of numbers. The effort to solve
such problems is high and increases linearly with the size |X| of the search space.
Therefore, the difficulty of the NIH problem is Θ (n).
34 2 Optimization Problems
Guided search methods perform worst for problems where the fitness landscape
leads the search method away from the optimal solution. Then, the distance to the
optimal solution is negatively correlated to the fitness difference between a solution
and the optimal solution. The locality of such problems is relatively high as most
neighboring solutions have similar fitness. However, since guided search finds the
optimal solution by performing iterated small steps in the direction of better solu-
tions, all guided search approaches must fail as they are misled. All other search
methods that use information about the fitness landscape also fail. More effective
search methods for such problems are those that do not use information about the
structure of the search space but search randomly, like random search. The most
prominent example of such types of problems are deceptive traps (see Figure 2.4(b)).
For this problem, the optimal solution is x∗ = xmin . The solution xmax is a deceptive
attractor and guided search methods that search in the direction of solutions with
higher objective function always find xmax , which is not the optimal solution.
A common tool for studying the fitness-distance correlation of problems is fit-
ness distance plots. Usually, such plots are more meaningful than just calculating
c f d . Fitness distance problems show how the fitness of randomly sampled solutions
depends on their distance to the optimal solution. Examples can be found in Kauff-
man (1989) (NK-landscapes), Boese (1995) (traveling salesman problem), Reeves
(1999b) (flow-shop scheduling problems), Inayoshi and Manderick (1994) and Merz
and Freisleben (2000b) (graph bipartitioning problem), Merz and Freisleben (2000a)
(quadratic assignment problem), or Mendes et al (2002) (single machine scheduling
problem).
2.4.2.2 Ruggedness
jective values as random variables and to obtain statistical properties on how the
distribution of the objective values depends on the distances between solutions. The
autocorrelation function (which is interchangeable with the autocovariance function
if the normalization factor f 2 − f 2 is dropped) of a fitness landscape is defined
as (Merz and Freisleben, 2000b)
f (x) f (y)d(x,y)=d − f 2
ρ (d) = , (2.13)
f 2 − f 2
where f denotes the average value of f over all x ∈ X and f (x) f (y)d(x,y)=d is
the average value of f (x) f (y) for all pairs (x, y) ∈ S × S, where d(x, y) = d. The
autocorrelation function has the attractive property of being in the range [−1, 1]. An
autocorrelation value of 1 indicates perfect correlation (positive correlation) and −1
indicates prefect anti-correlation (negative correlation). For a fixed distance d, ρ is
the correlation between the objective values of all solutions that have a distance of d.
Weinberger recognized that landscapes with exponentially decaying autocovariance
functions are often easy to solve for guided search methods (Weinberger, 1990).
To calculate the autocorrelation function is demanding for optimization problems
as it requires evaluating all solutions of the search space. Therefore, Weinberger
used random walks through the fitness landscape to approximate the autocorrela-
tion function. A random walk is an iterative procedure where in each search step
a random neighboring solution is created. The random walk correlation function
(Weinberger, 1990; Stadler, 1995, 1996; Reidys and Stadler, 2002) is defined as
f (xi ) f (xi+s ) − f 2
r(s) = , (2.14)
f 2 − f 2
where xi is the solution examined in the ith step of the random walk. s is the number
of steps between two solutions in the search space. For a fixed s, r defines the corre-
lation of two solutions that are reached by a random walk in s steps, where s ≥ dmin .
For a random walk with a large number of steps, r(s) is a good estimate for ρ (d).
Correlation functions have some nice properties and can be used to measure the
difficulty of a problem for guided search methods. If we assume that we have a
completely random problem, where random objective values are assigned to all x ∈
X, then the autocorrelation function will have a peak at d = s = 0 and will be close
to zero for all other d and s. In general, for all possible problems the autocorrelation
function reaches its peak at the origin d = s = 0. Thus it holds |r(s)| ≤ r(0) for all
0 < s ≤ dmax .
When assuming that the distance between two neighboring solutions x and y is
equal to one (d(x, y) = 1), r(1) measures the correlation between the objective values
of all neighboring solutions. The correlation length lcorr of a landscape (Stadler,
1992, 1996) is defined as
1 1
lcorr = − =−
ln(|r(1)|) ln(|ρ (1)|)
36 2 Optimization Problems
2.4.3 Decomposability
The decomposability of a problem describes how well the problem can be decom-
posed into several, smaller subproblems that are independent of each other (Polya,
1945; Holland, 1975; Goldberg, 1989c). The decomposability of a problem is high
if the structure of the objective function is such that not all decision variables must
be simultaneously considered for calculating the objective function but there are
groups of decision variables that can be set independently of each other. It is low
if it is not possible to decompose a problem into subproblems with few interdepen-
dencies between the groups of variables.
When dealing with decomposable problems, it is important to choose the type
and number of decision variables such that they fit the properties of the problem.
The fit is high if the variables used result in a problem model where groups of
decision variables can be solved independently or, at least, where the interactions
between groups of decision variables are low. Given the set of decision variables
D = {x1 , . . . , xl }, a problem can be decomposed into several subproblems if the
objective value of a solution x is calculated as f (x) = ∑Ds f ({xi |xi ∈ Ds }), where Ds
2.4 Properties of Optimization Problems 37
are non-intersecting and proper subsets of D (Ds D, ∪Ds = D) and i ∈ {1, . . . , l}.
Instead of summing the objective values for the subproblems also other functions
(e.g. multiplication resulting in f = ∏Ds f ({xi |xi ∈ Ds })) can be used.
In the previous Sect. 2.4.2, we have studied various measures of locality and dis-
cussed how the locality of a problem affects the performance of guided search meth-
ods. This section focuses on the decomposability of problems and how it affects the
performance of recombination-based search methods. Recombination-based search
methods solve problems by trying different decompositions of the problem, solv-
ing the resulting subproblems, and putting together the obtained solutions for these
subproblems to get a solution for the overall problem. High decomposability of a
problem usually leads to high performance of recombination-based search meth-
ods because solving a larger number of smaller subproblems is usually easier than
solving the larger, original problem. Consequently, it is important for effective
recombination-based search methods to identify proper subsets of variables such
that there are no strong interactions between the variables of the different subsets.
We discussed in Sect. 2.3.2 that the type and number of decision variables in-
fluences the resulting neighborhood structure. In the case of recombination-based
search methods, we must choose the decision variables such that the problem can
be easily decomposed by the search method. We want to give an example of how
the decomposition of a problem can make a problem easier for recombination-based
search methods. Imagine you have to design the color and material of a chair. For
each of the two design variables, there are three different options. The quality of a
design is evaluated by marketing experts that assign an objective value to each com-
bination of color and material. Overall, there are 3 × 3 = 9 possible chair designs.
If the problem cannot be decomposed, the experts have to evaluate all nine different
solutions to find the optimal design. If we assume that the color and material are
independent of each other, we (or a recombination-based search method) can try to
decompose the problem and separately solve the decomposed subproblems. If the
experts separately decide about color and material, the problem becomes easier as
only 3 + 3 = 6 designs have to be evaluated.
Therefore, the use of recombination-based optimization methods suggests that
we should define the decision variables of a problem model such that they allow
a decomposition of the problem. The variables should be chosen such that there
are no (or at least few) interdependencies between different sets of variables. We
can study the importance of choosing proper decision variables for the chair ex-
ample. The first variant assumes no decomposition of the problem. We define one
decision variable x ∈ X, where |X| = 9. There are nine different solutions and
non-recombining optimization methods have to evaluate all possible solutions to
find the optimal one. In the second variant, we know that the objective function
of the problem can be decomposed. Therefore, we choose two decision variables
x1 ∈ X1 = {y, b, g} (yellow, blue, and green) and x2 ∈ X2 = {w, m, p} (wooden, metal,
or plastics), where |X1 | = |X2 | = 3. A possible decomposition for the example prob-
lem is f = f1 (x1 ) + f2 (x2 ) (see Table 2.3). Decomposing the problem in such a way
results in two subproblems f1 and f 2 of size |X1 | = |X2 | = 3. Comparing the two
problem formulations shows that the resulting objective values f of different solu-
38 2 Optimization Problems
tions are the same for both formulations. However, the size of the resulting search
space is lower for the second variant. Therefore, the problem becomes easier to
solve for recombination-based search methods as the assumed problem decomposi-
tion ( f = f1 + f 2 ) fits well the properties of the problem.
We want to give another example and study two different problems with l binary
decision variables xi ∈ {0, 1} (|X| = 2l ). In the first problem, a random objective
value is assigned to each x ∈ X. This problem cannot be decomposed. In the sec-
ond problem, the objective value of a solution is calculated as f = ∑li=1 xi . This
example problem can be decomposed. Using recombination-based search methods
for the first example is not helpful as no decomposition of the problem is possi-
ble. Therefore, all efforts of recombination-based search methods to find proper
decompositions of the problem are useless. The situation is different for the second
example. Recombination-based methods should be able to correctly decompose the
problem and to solve the l subproblems. If the decomposition is done properly by
the recombination-based search method, only 2l different solutions need to be evalu-
ated and the problem becomes much easier to solve once the correct decomposition
of the problem is found. However, usually there is additional effort necessary for
finding the correct decomposition.
We see that the choice of proper decision variables is important for the decom-
posability of an optimization problem. In principle, there are two extremes for com-
binatorial optimization problems. The one extreme is to encode all possible solu-
tions x ∈ X using only one decision variable x1 , where x1 ∈ {1, . . . , |X|}. Using such
a problem model, no decomposition is possible as only one decision variable ex-
ists. At the other extreme, we could use log2 |X| binary decision variables encoding
the |X| different solutions. Then, the number of possible decompositions becomes
maximal (there are 2log |X| possible decompositions of the problem). Proper decision
variables for an optimization model should be chosen such that they allow a high
decomposition of the problem. Problem decomposition is problem-specific and de-
pends on the properties of f . We should have in mind that using a different number
of decision variables not only influences problem decomposition but also results in
a different neighborhood (see also Sect. 2.3.2).
In the following paragraphs, we discuss different approaches developed in the
literature to estimate how well a problem can be solved using recombination-based
search methods. All approaches assume that search performance is higher if a prob-
lem can be decomposed into smaller subproblems. Section 2.4.3.1 presents polyno-
mial problem decomposition and Sect. 2.4.3.2 illustrates the Walsh decomposition
2.4 Properties of Optimization Problems 39
of a problem. Finally, Sect. 2.4.3.3 discusses schemata and building blocks and how
they affect the performance of recombination-based search methods.
where the vector en contains 1 in the nth column and 0 elsewhere, T denotes
transpose, and the αN are the coefficients (Liepins and Vose, 1991). Regarding
x = (x1 , . . . , xl ), we may view f as a polynomial in the variables x1 , . . . , xl . The
coefficients αN describe the non-linearity of the problem. If there are high order
coefficients, the problem function is non-linear. If a decomposed problem has only
order 1 coefficients, then the problem is linear decomposable. It is possible to de-
termine the maximum non-linearity of f (x) by its highest polynomial coefficients.
The higher the order of the αN , the more non-linear the problem is.
There is some correlation between the non-linearity of a problem and its difficulty
for recombination-based search methods (Mason, 1995). However, as illustrated in
the following example, there could be high order αN although the problem can still
easily be solved by recombination-based search methods. The function
⎧
⎪
⎪1 for x1 = x2 = 0,
⎪
⎨2 for x1 = 0, x2 = 1,
f (x) = (2.15)
⎪
⎪4 for x1 = 1, x2 = 0,
⎪
⎩
10 for x1 = x2 = 1,
2l −1
f (x) = ∑ w j ψ j (x).
j=0
The Walsh functions ψ j : {0, 1}l → {−1, 1} form a set of 2l orthogonal functions.
The weights w j ∈ R are called Walsh coefficients. The indices j are binary strings of
length l representing the integers ranging from 0 to 2l − 1. The jth Walsh function
is defined as:
ψ j (x) = (−1)bc( j∧x) ,
with x, j are binary strings and elements of {0, 1}l , ∧ denotes the bitwise logical
AND, and bc(x) is the number of 1 bits in x (Goldberg, 1989a,b; Vose and Wright,
1998a,b). The Walsh coefficients can be computed by the Walsh transformation:
2l −1
1
wj =
2l ∑ f (k)ψ j (k),
k=0
are w = (4.25, −1.75, −2.75, 1.25). Although the problem is easy to solve for
recombination-based search methods (x1 and x2 can be set independently of each
other), there are high-order Walsh coefficients (w11 = 1.25) which indicate high
problem difficulty.
Walsh analysis not only overestimates problem difficulty but also underestimates
it. For example, MAX-SAT problems (see Sect. 4.4, p. 126) are difficult (APX-hard,
see Sect. 3.4.2), but have only low-order Walsh coefficients (Rana et al, 1998). For
example, the MAX-3SAT problem has no coefficients of higher order than 3 and
the number of non-zero coefficients of order 3 is low (Rana et al, 1998). Although,
Walsh coefficients indicate that the problem is easy, recombination-based search
methods cannot perform well for this difficult optimization problem (Rana et al,
1998; Rana and Whitley, 1998; Heckendorn et al, 1996, 1999).
Schemata analysis is an approach developed and commonly used for measuring the
difficulty of problems with respect to genetic algorithms (GA, Sect. 5.2.1). As the
main search operator of GAs is recombination, GAs are a representative example
of recombination-based search methods. Schemata are usually defined for binary
search spaces and thus schemata analysis is mainly useful for problems with binary
decision variables. However, the idea of building blocks is also applicable to other
search spaces (Goldberg, 2002). In the following paragraphs, we introduce schemata
and building blocks and describe how these concepts can be used for estimating the
difficulty of problems for recombination-based search methods.
Schemata
Schemata were first proposed by Holland (1975) to model the ability of GAs to
process similarities between binary decision variables. When using l binary deci-
sion variables xi ∈ {0, 1}, a schema H = [h1 , h2 , . . . , hl ] is a sequence of symbols of
length l, where hi ∈ {0, 1, ∗}. ∗ denotes the “don’t care” symbol and tells us that a
decision variable is not fixed. A schema stands for the set of solutions which match
the schema at all the defined positions, i.e., those positions having either a 0 or a 1.
Schemata of this form allow for coarse graining (Stephens and Waelbroeck, 1999;
Contreras et al, 2003), where whole sets of strings can be treated as a single entity.
A position in a schema is fixed if there is either a 0 or a 1 at this position. The
size or order o(H) of a schema H is defined as the number of fixed positions (0’s or
1’s) in the schema string. The defining length δ (H) of a schema H is defined as the
distance between (meaning the number of bits that are between) the two outermost
fixed bits. The fitness fs (H) of a schema is defined as the average fitness of all
instances of this schema and can be calculated as
42 2 Optimization Problems
1
fs (H) = ∑ f (x),
||H|| x∈H
where ||H|| is the number of solutions x ∈ {0, 1}l that are instances of the schema H.
The instances of a schema H are all solutions x ∈ H. For example, x = (0, 1, 1, 0, 1)
and y = (0, 1, 1, 0, 0) are instances of H = [0∗1∗∗]. The number of solutions that are
instances of a schema H can be calculated as 2l−o(H) . For a more detailed discussion
of schemata in the context of GA, we refer to Holland (1975), Goldberg (1989c),
Altenberg (1994) or Radcliffe (1997).
Building Blocks
Based on schemata, Goldberg (1989c, p. 20 and p. 41) defined building blocks (BB)
as “highly fit, short-defining-length schemata”. Although BBs are commonly used
(especially in the GA literature) they are rarely defined. We can describe a BB as a
solution to a subproblem that can be expressed as a schema. Such a schema has high
fitness and its size is smaller than the length l of the binary solution. By combining
BBs of lower order, recombination-based search methods like GAs can form high-
quality over-all solutions.
We can interpret BBs also from a biological perspective and view them as genes.
A gene consists of one or more alleles and can be described as a schema with high
fitness. Often, genes do not strongly interact with each other and determine specific
properties of individuals like hair or eye color.
BBs can be helpful for estimating the performance of recombination-based
search algorithms. If the sub-solutions to a problem (the BBs) are short (low
δ (H)) and of low order (low o(H)), then the problem is assumed to be easy for
recombination-based search.
decomposed any more and usually guided or random search methods are applied to
find the correct solution for the decomposed subproblems.
Therefore, intra-BB difficulty depends on the locality of the subproblem. As dis-
cussed in Sect. 2.4.2, (sub)problems are most difficult to solve if the structure of the
fitness landscape leads guided search methods away from the optimal solution. Con-
sequently, the deceptiveness (Goldberg, 1987) of a subproblem (for an example of a
deceptive problem, see Fig. 2.4(b), p. 34) is at the core of intra-BB difficulty. We can
define the deceptiveness of a problem not only by the correlation between objective
function and distance (as we have done in Sect. 2.4.2) but also by using the notion of
BBs. A problem is said to be deceptive of order kmax if for k < kmax all schemata that
contain parts of the best solution have lower fitness than their competitors (Deb and
Goldberg, 1994). Schemata are competitors if they have the same fixed positions.
An example of four competing schemata of size k = 2 for a binary problem of length
l = 4 are H1 = [0 ∗ 0∗], H2 = [0 ∗ 1∗], H3 = [1 ∗ 0∗], and H4 = [1 ∗ 1∗]. Therefore, the
highest order kmax of the schemata that are not misleading determines the intra-BB
difficulty of a problem. The higher the maximum order kmax of the schemata, the
higher is the intra-BB difficulty.
Table 2.4 shows the average fitness of the schemata for the example from (2.15).
All schemata that contain a part of the optimal solution are above average (printed
bold) and better than their competitors. Calculating the deceptiveness of the problem
based on the fitness of the schemata correctly classifies this problem as very easy.
When using this concept of BB-difficulty for estimating the difficulty of a prob-
lem for recombination-based search methods, the most natural and direct way to
measure problem difficulty is to analyze the size and length of the BB in the prob-
lem. The intra-BB difficulty of a problem can be measured by the maximum length
δ (H) and size k = o(H) of its BBs H (Goldberg, 1989c). Representative examples
of use of these concepts to estimate problem difficulty can be found in Goldberg
(1992), Radcliffe (1993), or Horn (1995).
Recombination-based search methods solve problems by trying different prob-
lem decompositions and solving the resulting subproblems. If a problem is correctly
decomposed, optimal solutions (BBs) of the subproblems can be determined inde-
pendently of each other. Often, the contributions of different subproblems to the
overall objective value of a solution is non-uniform. Non-uniform contributions of
subproblems to the objective value of a solution determine inter-BB difficulty. Prob-
lems become more difficult if some BBs have a lower contribution to the objective
value of a solution. Furthermore, problems often cannot be decomposed into com-
pletely separated and independent sub-problems, but have some interdependencies
between subproblems which are an additional source of inter-BB difficulty.
44 2 Optimization Problems
tive function and the constraints are continuous and linear. As interior point methods
can solve such problems with polynomial effort, such problems belong to the class
P. Section 3.3 gives an overview of exact optimization methods for discrete opti-
mization problems. Many linear, discrete optimization problems are NP-complete.
We discuss and illustrate the functionality of enumeration methods like informed
search methods and branch and bound, dynamic programming methods, and cut-
ting plane methods. The large drawback of such exact methods is that their effort to
solve NP-complete problems increases exponentially with the problem size. Finally,
Sect. 3.4 gives an overview of heuristic optimization methods. Important types of
heuristic optimization methods are approximation algorithms and modern heuris-
tics. Approximation algorithms are heuristics where we have a bound on the quality
of the returned solution. Modern heuristics are general-purpose heuristics that use
sophisticated intensification and diversification mechanisms for searching through
the search space. The relevant elements of modern heuristics are the representation
and variation operators used, the fitness function, the initial solution, and the search
strategy. These elements are discussed in the subsequent chapters. Finally, we dis-
cuss the no-free-lunch theorem which states that general-purpose problem solving
is not possible but heuristic optimization methods must exploit specific properties
of a problem to outperform random search.
Finding optimal solutions for optimization problems can be relatively easy if the
problem is well-defined and there are no constraints on the decision variables. A
common example is continuous optimization problems
minimize f (x) : Rn → R,
For all local maxima and minima of the problem, ∇ f (x) = 0. All solutions where
∇ f (x) = 0 are called stationary points.
There are several options available to determine whether a stationary point is a
local minimum or maximum. A simple and straightforward approach is to calculate
the objective values of solutions that are in each dimension slightly (infinitesimally)
3.1 Analytical and Numerical Optimization Methods 47
larger and smaller than the stationary point. If the objective values of all solutions
are larger than the objective value of the stationary point, then we have a minimum;
if they are all smaller, we have a maximum. If some of them are smaller and some
are larger, we have a saddle point. A saddle point is a stationary point that is no
extremum. Figure 3.1 illustrates the saddle point (0, 0) and the minimum (1, −1)
for the objective function f (x) = 3x4 − 4x3 .
If the determinant of the Hessian matrix is unequal to zero and the Hessian matrix is
positive definite at a stationary point y, then y is a local minimum. A matrix is pos-
itive definite at y if all eigenvalues of the matrix are larger than zero. Analogously,
a stationary point is a local maximum if the Hessian matrix is negative definite at y.
If the Hessian matrix is indefinite at y (some eigenvalues are positive and some are
negative), y is a saddle point. After determining all local minima and maxima of a
problem, the optimal solution can be found by comparing the objective values of all
local optima and choosing the solution with minimal or maximal objective value,
respectively.
If we assume that there are some additional constraints on the range of the deci-
sion variables (xi ∈ [xmin , xmax ]), we can apply the same procedure. In addition, we
have to examine all solutions at the edge of the feasible search space (xi = xmin and
xi = xmax ) and check whether they are the global optimum. In general, all solutions
at the boundary of the definition space are treated as stationary points.
We want to give an example. Rosenbrock’s function is a well known test problem
for continuous optimization methods (De Jong, 1975; Goldberg, 1989c). It is twice-
differentiable and defined as
n−1
f (x) = ∑ [100(xi2 − xi+1 )2 + (xi − 1)2]. (3.1)
i=1
48 3 Optimization Methods
The global minimum of this problem is x∗ = (1, . . . , 1) with the function value
f (x∗ ) = 0. For n < 4, there are no other local optima. For higher dimensions (n ≥ 4),
the function becomes multi-modal and local optima exist (Hansen and Ostermeier,
2001; Deb et al, 2002; Shang and Qiu, 2006). The problem is viewed as diffi-
cult to solve for local search approaches as the minimum resides in a long, nar-
row, and parabolic-shaped flat valley. Figure 3.2 plots the two-dimensional variant
f (x1 , x2 ) = 100(x12 − x2 )2 + (x1 − 1)2 of Rosenbrock’s function for x1 , x2 ∈ [−2, 2].
∂f ∂f
For ∂ x1 = ∂ x2 = 0, we get
and
−200(x12 − x2 ) = 0.
The second equation yields x2 = x12 . Substituting this into the first equation, we
finally get x1 = x2 = 1, which is the globally optimal solution.
The example illustrates how optimal solutions can be found for problems where
the objective function is twice-differentiable. We have to identify stationary points
where the gradient of f (x) is zero, exclude saddle points from consideration, and
find the optimal solutions among all local optima. However, for many problems we
are not able to solve the equation system ∇ f (x) = 0 although the objective function
is available in functional form and twice-differentiable.
For solving such problems, we can use numerical methods to identify station-
ary points. Numerical methods often start with an initial solution x0 and perform
iterative steps in the direction of a local optimal solution using the gradient ∇ f (x)
for guiding the search. Common examples are Newton’s method (also known as
Newton-Raphson method) and gradient search. Newton’s method generates new
solutions as
xn+1 = xn − γ [H( f (xn ))]−1 ∇ f (xn ), (3.2)
where n ≥ 0, [H( f (xn ))]−1 is the inverse of the Hessian matrix at xn , and γ is a small
step size (γ > 0). For n → ∞, this sequence converges towards a local optimum.
Gradient search also performs iterative steps and generates new solutions as
where the variable step size γn is reduced with larger n such that (minimizing search)
f (xn+1 ) < f (xn ). Again, this sequence converges to a locally optimal solution for
n → ∞.
Both numerical methods allow us to find a local optimum in the neighborhood
of x0 . Usually, we have to guess an area where the optimal solution can be found
(d(x∗ , x0 ) should be small) and the methods return the nearest local optimum. If we
are lucky, the local optimum is also the global optimum. However, as the methods
just “move downhill” towards the next local optimum, choosing a random initial so-
lution only returns the optimal solution x∗ for unimodal problems. For multi-modal
problems, we have no guarantee that the local optimum found is the global optimum.
Summarizing, for some well-defined and twice-differentiable optimization prob-
lems we can analytically determine the optimal solution. However, the range of
problems that can be solved in such a way is limited. We can extend the number of
solvable problems by using numerical methods that use the gradient of the objective
function to direct their search through the search space. However, the use of numer-
ical methods is also limited as the effort for searching through the search space and
identifying all local optima is high.
50 3 Optimization Methods
In addition, there can be linear constraints. We can distinguish between linear equal-
ity constraints f i (x) = bi , where fi are linear functions, and linear inequality con-
straints fi (x) ≥ bi and fi (x) ≤ bi , respectively. The sets of linear equalities and in-
equalities can be formulated in matrix notation as Ax = b, Ax ≤ b, or Ax ≥ b, where
b is an n-dimensional vector (usually b ≥ 0) and A is an n × m matrix. m denotes
the number of constraints. The combined linear equalities and inequalities are called
linear constraints.
In LPs, different problem formulations exist which are equivalent to each other.
The canonical form of an LP consists of a linear objective function, linear inequali-
ties, and non-negative decision variables:
min cT x, (3.4)
subject to Ax ≥ b,
xi ≥ 0.
max cT x, (3.5)
subject to Ax ≤ b,
xi ≥ 0
max cT x, (3.6)
subject to Ax = b,
xi ≥ 0.
To transform a problem from canonical form to standard form, slack and surplus
variables are used. With the slack variables yi and the surplus variables zi (yi , zi ≥ 0),
we can rewrite inequalities of the form f (x) ≤ bi or f (x) ≥ bi as equalities f (x) +
yi = bi or f (x) − zi = bi , respectively. Consequently, each problem in canonical form
can be transformed into a problem in standard form with the same properties.
52 3 Optimization Methods
The Simplex method, which was developed by Dantzig in 1947 (Dantzig, 1949,
1951, 1962), is an effective method for solving LP problems. The Simplex method
takes as input an LP in standard form and returns an optimal solution. Although its
worst-case time complexity is exponential (Klee and Minty, 1972), it is very effi-
cient in practice (Smale, 1983; Kelner and Spielman, 2006).
When searching for optimal solutions of LPs, we can make use of the fact that
a linear inequality splits an n-dimensional search space into two halves which are
called half-spaces. One half-space contains feasible solutions, the other one infea-
sible solutions. The intersection of all feasible half-spaces forms the feasible region
and is called a polyhedron if the feasible region is unbounded or polytope if the
feasible region is bounded. A set S in an n-dimensional vector space Rn is called a
convex set if the line segment joining any pair of points of S lies entirely in S. When
using linear constraints, the feasible region of any LP is always a convex set. The
feasible, convex region forms a simplex which is the simplest possible polytope in
a space of size n (the name simplex comes from the simplest possible polytope). A
convex polytope can be defined either as the convex hull of a feasible region, or as
the intersection of a finite set of half-spaces (Weyl, 1935). Solutions that are on the
border of the feasible region (simplex) are called boundary points. Solutions that
are feasible and no boundary points are called interior points. Feasible points that
are boundary points and lie on the intersections of n half-spaces are called corner
points or vertices of the simplex. The number of corner points grows exponentially
with the size n of the problem.
The set of solutions for which the objective function obtains a specific value is
a hyperplane of dimension n − 1. Because the set of feasible solutions is a con-
vex set (simplex) and the objective function is linear, one of the corner points of the
simplex is an optimal solution for an LP (if a bounded optimal solution exists). Con-
sequently, optimal solutions can be found by examining all vertices of the simplex
and choosing the one(s) where the objective function becomes optimal.
We give two examples. Example 1 is a two-dimensional problem, where the ob-
jective function f (x) = x1 + 2x2 must be maximized. We assume that x1 , x2 ≥ 0. In
addition, there are three inequalities x2 − 2x1 ≤ 1, x2 ≤ 3, and 3x1 + x2 ≤ 9. Figure
3.3(a) shows the different half-spaces and the resulting simplex (shaded area). One
3.2 Optimization Methods for Linear, Continuous Problems 53
of the five vertices of the simplex is an optimal solution. The vertex (2, 3) is the
optimal solution as the objective function becomes maximal for f (2, 3) = 8. Figure
3.3(b) shows the same LP including the objective values. On the ground plane, we
see the feasible region defined by the five different inequalities. The surface shows
the objective function which becomes maximal for (2, 3).
cT x,
min (3.7)
subject to Ax ≥ b, xi ≥ 0,
⎛ ⎞ ⎛ ⎞
⎛ ⎞ 1 0 1 2
1 ⎜ 0 −1 1⎟ ⎜−1⎟
⎜ ⎟ ⎜ ⎟
where c = ⎝ 1 ⎠ , A = ⎜ ⎟ ⎜ ⎟
⎜−1 0 1⎟ , and b = ⎜−1⎟
−3 ⎝ 0 1 1 ⎠ ⎝2⎠
0 0 1 1
The set of five inequalities forms a polyhedron (unbounded simplex) with four cor-
ner points at (1, 1, 1), (1, 2, 1), (2, 2, 1), and (2, 1, 1). Figure 3.4 shows the five dif-
ferent constraints splitting the search space into a feasible and infeasible region. As
the set of feasible solutions forms a convex set and the objective function is linear,
the optimal solution is one of the four corner points. We plot the objective function
f (x) = x1 + x2 − 3x3 = 1 and all points in this plane have the same fitness value
f (x) = 1. The optimal solution (2, 2, 1) is contained in this plane and lies at the
corner that is formed by the constraints x3 ≥ 1, −x1 + x3 ≥ −1, and x2 + x3 ≥ 2.
In general, we can find optimal solutions for LPs of size n ≤ 3 using a graphical
procedure. In a first step, we plot the different constraints in a two-dimensional
or three-dimensional plot. If we assume that a non-empty, bounded feasible region
(simplex) exists, then the optimal solution is one of the corner points of this simplex.
To find the optimal solution, we have to consider the hyperplanes that are defined
54 3 Optimization Methods
by f (x) = c, which are parallel to each other. We obtain the optimum at the highest
(or lowest) value of c such that the resulting hyperplane still intersects the simplex
formed by the constraints.
A similar approach (which can also be used for larger problems n ≥ 3) is to
enumerate all corner points of the simplex, to calculate the objective value of each
corner point, and to select the corner point with the lowest (minimization problem)
objective value as optimal solution. The drawback of this approach is that the num-
ber of corner points that have to be evaluated increases exponentially with n.
A more systematic and usually more efficient approach is the Simplex method.
The Simplex method starts at some vertex of the simplex and performs a sequence of
iterations. In each iteration, it moves along an edge of the simplex to a neighboring
vertex with higher or equal objective value. It terminates at a local optimum which
is a vertex where all neighboring vertices have a smaller objective value. As the
feasible region is convex and the objective function is linear, there is only one local
optimum which is also the global optimum. The Simplex method only moves on the
convex hull of the simplex and makes use of the fact that the optimal solution of LPs
is never an interior point but always a corner point on the boundary of the feasible
region.
Before we can apply the Simplex method, we have to transform an LP from
canonical form (3.5) to a standard form (3.6) where all bi ≥ 0. Therefore, in a first
step, all constraints where bi ≤ 0 must be multiplied by -1. Then, we have to intro-
duce slack and surplus variables for all inequality constraints. Slack variables are
introduced for all inequalities of the form f (x) ≤ b and surplus variables are used
for all inequalities f (x) ≥ b. After these operations, the LP is in standard form and
one additional variable (either slack or surplus variable) has been introduced for
each inequality constraint. Then, a new variable xi , called an artificial variable is
added to the left-hand side of each equality constraint that does not contain a slack
variable. Consequently, each equality constraint will then contain either one slack
variable or one artificial variable. A non-negative initial solution for this problem is
obtained by setting each slack and artificial variable equal to the right-hand side of
the equation in which it appears and setting all other variables, including the surplus
variables, equal to zero. All variables that are set unequal to zero form a basis and
3.2 Optimization Methods for Linear, Continuous Problems 55
are called basic variables. Usually, a basis contains m different variables, where m
is the number of constraints (m ≥ n).
How do slack, surplus, and artificial variables change the LP? Slack and surplus
variables do not affect the constraints or the objective function. Therefore, they are
incorporated into the objective function with zero coefficients. However, artificial
variables change constraints because they are added to only one side of an equality
equation. Therefore, the new constraints are equivalent to the original ones if and
only if the artificial variables are set to zero. To guarantee such an assignment in the
optimal solution (not in the initial solution), artificial variables are incorporated into
the objective function with large positive coefficients M for minimization problems
and large negative coefficients −M for maximization problems.
We study this procedure for our example problem (3.7, p. 53). By introducing
slack (x5 and x6 ) and surplus (x4 , x7 , and x8 ) variables, we can remove the inequality
constraints and transform the problem into standard form:
x1 +x3 −x4 =2
x2 −x3 +x5 =1
x1 −x3 +x6 =1 (3.8)
x2 +x3 −x7 =2
x3 −x8 =1
Then, we have to add additional artificial variables to all constraints that do not
contain a slack variable. Therefore, we add the artificial variables x9 and x10 to the
first and last constraint, respectively. After introducing the artificial variables we get
the equations
the Simplex method are which variable is brought into the basis and which variable
is removed from the basis. Usually, a non-basic variable is brought into the basis
which makes the solution least infeasible or leads to the highest improvement in
the objective function. The basic variable to leave the basis is the one which is
most infeasible or expected to go infeasible first. Usually, the rules for selecting the
variables that enter and leave the basis are heuristics.
By performing iterative basis changes, the Simplex method moves from one so-
lution to the next, selecting the new basic variables such that the objective function
becomes lower (minimizing). If a feasible solution exists, the Simplex method finds
it. We know that we have found the optimal solution if no basis change leads to a fur-
ther reduction of the objective function. The Simplex method is fast as the optimal
solution is often reached after only 2m . . . 3m basis changes.
If the initial solution is infeasible, often a two-phase approach is used. In the first
phase, the entering and leaving variables are chosen such that a feasible solution is
obtained. The second phase starts after a feasible solution has been found. Then, the
Simplex method tries to improve the solution. We should have in mind that to obtain
a feasible solution usually all decision variables must be in the basis.
For more details on the functionality of the Simplex method and appropriate
strategies of moving from one corner of the simplex to adjacent corners, the in-
terested reader is referred to the literature. Basically each textbook on OR or lin-
ear programming describes the Simplex method in detail and provides additional
background information. Bronson and Naadimuthu (1997) or Grünert and Irnich
(2005, in German) provide an illustrative introduction but also other literature is
recommended (Papadimitriou and Steiglitz, 1982; Chvatal, 1983; Williams, 1999;
Cormen et al, 2001; Hillier and Lieberman, 2002; Winston, 1991; Domschke and
Drexl, 2005).
Prior to 1979, the Simplex method was the dominating and main solution approach
for LPs as it solved routinely and efficiently even very large LPs. The Simplex
method showed the nice property that, on almost all relevant real-world problems,
the number of iterations that are necessary to move to the feasible optimal so-
lution is a small multiple of the problem dimension n. However, as problem in-
stances exist where the Simplex method visits every vertex of the feasible region
(Klee and Minty, 1972), the worst-case complexity of the Simplex method is expo-
nential in the problem dimension n.
The situation changed in 1979, when Lenid Khachian proposed an ellipsoid
method (Khachian, 1979) which was the first polynomial-time LP algorithm. This
method is a specific variant of general non-linear approaches (Shor, 1970; Yudin
and Nemirovskii, 1976) as it does not search through the convex hull of the fea-
sible region but inscribes a sequence of ellipsoids with decreasing volume in the
feasible region. Such methods that do not search on the convex hull but approach
3.2 Optimization Methods for Linear, Continuous Problems 57
the optimal solution (which is a corner-point) from within the feasible region are
called interior point methods. Although the algorithm had a great impact on theory,
its impact on practice was low as the algorithm usually always reaches the worst-
case bound and needs on average a larger number of search steps than the Simplex
method. Khachian’s method is a nice example of a polynomial optimization method
that on average needs more search steps than a method with worst-case exponential
behavior. However, it was the first approach to succeed in establishing a connec-
tion between linear problems and non-linear problems where the objective function
depends non-linearly on the decision variables. Before, these two types of optimiza-
tion problems were solved with completely different methods although one is a strict
subset of the other.
In 1984, Karmarkar presented a more advanced and faster polynomial-time
method for solving LPs (Karmarkar, 1984). In this method, a sequence of spheres
is inscribed in the feasible region such that the centers of the spheres converge to
the optimal solution. There is a formal equivalence between this method and the
logarithmic barrier method (Gill et al, 1986; Forsgren et al, 2002). Comparing the
interior point method from Karmarkar to the Simplex method revealed similar or
better performance of interior point methods for many problem instances (Gill et al,
1986). For more information on the development of interior point methods and a
detailed explanation of their functionality, we refer the interested reader to Wright
(1997) and Wright (2005).
In the last few years, a number of improvements have been developed for the
Simplex method as well as interior point methods. Nowadays, the performance dif-
ferences between these two approaches depend strongly on the particular geometric
and algebraic properties of the LP problem. In general, however, interior-point meth-
ods show similar or better performance than Simplex methods for larger problems
when no prior information about the optimal solution is available. In contrast, when
prior information is available and a so called “warm start” is possible (starting from
a previously found solution), the situation is reversed and Simplex methods tend to
be able to make much better use of it than interior-point methods (Maros and Mitra,
1996).
Prior to 1987, all commercial LP problem solvers used the Simplex algorithm.
Nowadays, there are also a number of free interior-point implementations available
that are competitive with commercial variants. For an overview of current interior-
point methods we refer to Benson (2010a) and Benson (2010b). As the emergence of
interior point methods spurred the development of more efficient Simplex methods,
there are a number of free and commercial LP solvers based on the Simplex method
available (Mittelmann, 2010).
Summarizing, there are efficient methods available for linear, continuous, opti-
mization problems. Linear optimization problems belong to the class P as they can
be solved in polynomial time using interior point methods. The Simplex method,
which has worst-case exponential running time, and interior point methods, which
have worst-case polynomial running time, show similar performance and can solve
even large problem instances in a short time. The main limitation of linear optimiza-
tion methods is that they can only be applied to linear optimization problems where
58 3 Optimization Methods
the objective function and the constraints are linear. If the objective function or one
of the constraints is not linear, LP optimization methods cannot be applied. When
creating an LP model for a real-world optimization problem, we must make sure
that the character of the problem is described well by a linear model. Otherwise,
we may find an optimal solution for the LP model but the solution does not fit the
underlying non-linear optimization problem.
In the previous section, we studied optimization methods for linear problems where
the decision variables are continuous. This section describes optimization meth-
ods for problems where all decision variables are integers. Such problems are also
called combinatorial problems if the number of possible solutions is finite and
we are able to enumerate all possible solutions. Discrete decision variables are
used to model a variety of different combinatorial problems like assignment prob-
lems (e.g. time tabling), scheduling problems (e.g. traveling salesman problem, job
scheduling problems, or Chinese postman problem), grouping problems (e.g. cutting
and packing problems or lot sizing problems), and selection problems (e.g. knapsack
problems, set partitioning problems, or set covering problems).
Representative optimization methods for combinatorial optimization problems
are decision tree-based enumeration methods and cutting plane methods. The use
of decision trees allows us to formulate the process of problem solution as a se-
quence of decisions on the decision variables of the problem. Common methods
working with decision trees are uninformed and informed graph search algorithms
like depth or A*-search, branch-and-bound approaches, or dynamic optimization.
Cutting plane methods are often based on the Simplex method and add additional
constraints (cuts) to a problem to ensure that the decision variables in the optimal
solution are discrete. The optimization methods presented in this section can not
only be used for linear problems but can also be applied to non-linear optimization
problems.
Section 3.3.1 introduces integer linear problems. This is followed by represen-
tative decision tree methods. In particular, we describe uninformed and informed
tree search methods (Sect. 3.3.2), branch-and-bound methods (Sect. 3.3.3), and dy-
namic programming (Sect. 3.3.4). An overview of the functionality of cutting plane
methods is given in Sect. 3.3.5.
Integer linear programs (ILP) are LPs where the decision variables are integers.
ILPs in canonical form can be formulated as
3.3 Optimization Methods for Linear, Discrete Problems 59
max cT x (3.10)
subject to Ax ≤ b,
xi ∈ N0 ,
where N0 is the set of non-negative integers x ≥ 0. Problems are called mixed integer
linear programs (MILP) if there are discrete decision variables xi and continuous
decision variables y j . Their canonical form is
max cT x + d T y (3.11)
subject to Ax + By ≤ b,
xi ∈ N0 ,
y j ≥ 0.
Both types of problems are discrete optimization problems where some (MILP) or
all (ILP) of the decision variables are not allowed to be fractional.
In general, models using integer variables can be converted to models using only
binary variables xi ∈ {0, 1}. Sometimes, such a conversion provides some nice prop-
erties which can be exploited, but typically more variables are needed to characterize
the same integer problem.
If we drop the integer constraints of an ILP, we get an LP in canonical form which
is called the relaxation of the ILP or the relaxed problem. We can apply standard
methods for continuous, linear problems like the Simplex method to solve relaxed
ILPs. However, usually the optimal solution of a relaxed problem is not integral and
is, thus, not a feasible solution of the underlying ILP.
However, applying LP optimization methods to the relaxation of an ILP and ig-
noring the integer feasibility constraint can give us a first bound on the solution
quality of the underlying ILP. Even more, if the optimal solution obtained for the
LP is integral, then this solution is also the optimal solution to the original ILP. How-
ever, usually some or all variables of the optimal solution for the relaxed problem
are fractional and the optimal solution lies in the interior of the resulting simplex.
Straightforward approaches to obtain an integral feasible solution from a fractional
solution are
• rounding or
• search in the neighborhood of the optimal solution.
Rounding may be a useful strategy when the decision variables are expected to
be large integers and insensitive to rounding. However, rounding can also be com-
pletely misleading as many integer problems are not just continuous problems with
additional integrality constraints, but integer constraints are used to model combi-
natorial constraints, logical constraints, or non-linearities of any sort. Due to the
nature of such ILPs, rounding often does not work as it would defeat the purpose of
the ILP formulation (Papadimitriou and Steiglitz, 1982, Chap. 13). The same holds
true for local search in the neighborhood of the optimal solution. Often the search
space is highly constrained and searching around the optimal solution of the relaxed
problem does not yield the optimal solution of the ILP.
60 3 Optimization Methods
Although solving the relaxed problem usually does not allow us to find the in-
tegral, optimal solution, it gives us a lower (for minimization problems) bound on
the objective value of the optimal solution of an ILP. The gap between the objective
values of the optimal ILP solution and the optimal relaxed solution is called the inte-
grality gap. Obtaining a lower bound on the objective value of the optimal solution
can be helpful to estimate how close we are to the optimal solution. For example,
if we have found a feasible solution for an ILP and know that the gap between the
quality of this solution and the relaxed solution is small, spending much effort in
finding better solutions may not be necessary as the potential improvement is low.
We want to give two examples where rounding or local search around the opti-
mal solution of the relaxed problem does not find the optimal solution. In the first
example, we maximize the objective function f (x) = 6x1 + x2 . There are additional
constraints x1 ≥ 0, x2 ≥ 0, x2 ≤ 4, and 5x1 +x2 ≤ 19. We are searching for an integral
solution, where xi ∈ N0 . Figure 3.5(a) shows the problem. All feasible solutions of
the ILP are denoted as dots. The optimal solution of the relaxed problem is (3.8, 0)
with f (3.8, 0) = 23. Rounding this solution would yield (4, 0), which is an infea-
sible solution. The nearest integral feasible solution (using Euclidean distances) is
(3, 0) with f (3, 0) = 18. This is not the optimal solution as the correct, optimal,
solution of the ILP is (3, 4) with f (3, 4) = 22.
The second example is a problem where finding the optimal solution of the re-
laxed problem is not at all helpful for solving the original ILP (Papadimitriou and
Steiglitz, 1982; Bonze and Grossmann, 1993). We have an ILP of the form
3.3 Optimization Methods for Linear, Discrete Problems 61
min x1 + x2 (3.12)
subject to Ax ≥ b,
x1 , x2 ∈ N0 ,
(1 − 2n) 2n n
where A = and b = ·
(2n − 1) 2(1 − n) 1−n
We can use graphs and trees to model combinatorial search spaces if the size of the
search spaces is limited. Each node of the graph represents a possible solution of a
combinatorial optimization problem and an edge between two nodes indicates some
relationship. Common relationships between nodes are the similarity of solutions
with respect to a distance metric (see Sect. 2.3.1). For example, two solutions x and
y are related and an edge exists between them if their Hamming distance d(x, y) = 1.
Formally, a graph G = (V, E) consists of a set V of n = |V | vertices (solutions)
and a set E of m = |E| vertex pairs or edges (relationships). A path is a sequence of
edges connecting two nodes. A graph is connected if there is a path between every
two nodes. Trees are a special variant of graphs where there is exactly one path
between every two nodes. Therefore, a tree T is defined as a connected graph with
no cycles. For a tree T with n nodes, there are exactly n − 1 links.
Figure 3.6(a) uses a graph to describe the relationships between the different so-
lutions of a combinatorial optimization problem with three variables xi ∈ {0, 1}.
There are 23 = 8 different solutions and we can use a graph to model the search
space (analogously to Sect. 2.3.3, p. 21). As each node represents a possible solu-
tion, the number of solutions equals the number of nodes.
Instead of a graph, we can use a hierarchical tree of depth d to represent a search
space. One node is selected to be the root node. Figure 3.6(b) models the example
using a hierarchical tree. An undirected edge between x and y indicates neighboring
solutions (Hamming distance d(x, y) = 1). The depth d is equivalent to the maximum
distance maxx∈X (d(x, r)) between a solution x and the root node. When using a
hierarchical tree, the same solutions can occur multiple times in the tree as no cycles
are allowed in a tree.
We can also use tree structures to model the process of iteratively assigning val-
ues to the n decision variables. Then, the maximum depth d of a tree is equal to n.
At each level i ∈ {1, . . . , n}, we assign possible values to the ith decision variable.
62 3 Optimization Methods
Tree structures modeling such a decision process are called decision trees. Figure
3.7 shows a possible decision tree for the example from above. In the root node,
all three variables are undecided and we successively assign values to the three bi-
nary decision variables starting with x0 . On the lowest level we can find all possible
solutions for the problem.
The examples demonstrate that we can use tree structures to describe solution
strategies for discrete optimization problems if the number of options for the deci-
sion variables is limited. We can use trees
• either as hierarchical trees where each node represents a complete solution of the
problem (see Fig. 3.6(b)) or
• as decision trees where we assign on each level possible values to a decision
variable. Solutions are completed at depth n of the tree (see Fig. 3.7).
When representing solutions of a combinatorial optimization problem using a tree
structure, we can apply graph search (or graph traversal) algorithms to systemati-
cally go through all the nodes in the tree, often with the goal of finding a particular
node, or one with a given property. We can distinguish between
• uninformed search methods which completely enumerate all possible solutions
using a fixed search behavior and
• informed search methods that are problem-specific and use the estimated distance
to the goal node to control the search.
3.3 Optimization Methods for Linear, Discrete Problems 63
In contrast to unordered linear structures where we can start at the beginning and
work through to the end, searching a tree is more complex as it is a hierarchical
structure modeling relationships between solutions.
Fig. 3.8 Order of node expansion for breadth-first search and depth-first search
Depth-first search also starts with the root node and expands the last-created
node. Therefore, depth-first search moves directly to the bottom of the tree and
then successively expands the remaining nodes. Figure 3.8(b) shows the order in
which the nodes are expanded. The time complexity of depth-first search is O(bd ).
Space complexity is lower (O(bd)) than breadth-first search as breadth-first search
traverses the tree from “left to right” and not from “top to bottom”.
Uniform-cost search is an uninformed search method which uses a cost that is
associated to each node for controlling the search. Uniform-cost search iteratively
expands the node with the lowest associated cost. It behaves like breadth-first search
if the cost of a node is equal to its distance to the root node. Often, costs are assigned
to edges and the cost of a node is the sum of the costs of the edges on the path be-
tween the node and the root node. Figure 3.9 shows the order in which the nodes
64 3 Optimization Methods
are expanded (the numbers in brackets show the cost of each node and edge, respec-
tively). The space and time complexity of uniform-cost search is O(bd ).
Other examples of uninformed graph search methods are modifications of depth-
first search like iterative-deepening search or depth-limited search and bidirectional
search (for details see for example Russell and Norvig (2002) or Zhang (1999)).
Informed search strategies use an evaluation function h to guide the search through
a tree. h(x) measures or estimates the cost of the shortest path between node x and
a goal node. Finding a path from the root to a goal node is easy if h is accurate as
then we only need to iteratively expand the nodes with minimal h. Representative
examples of informed graph search strategies are
• best-first search and
• A* search.
Both search strategies estimate the minimal distance between node x and a goal node
using an estimation function h(x). If node x is a goal node, then h(x) = 0. h(x) is
called admissible if h(x) ≤ h∗ (x), where h∗ (x) is the true minimal distance between
x and a goal node.
Best-first search starts with the root node and iteratively expands the nodes with
lowest h. If h is accurate (h(x) = h∗ (x)), the goal is found after d expansions (if
we assume that a goal node is located at depth d). Best-first search is called greedy
search (GS) if h estimates the minimum distance to a goal node.
A* search combines uniform-cost search with best-first search. The cost f (x) of
a node x is calculated as f (x) = g(x) + h(x), where g(x) is the distance between root
node and x and h(x) estimates the minimal distance between x and a goal node. A*
always expands the node x with minimal f (x) and stops after a goal node is found.
The complexity of A* depends on the accuracy of the estimation h. If h is accurate,
time complexity is O(bd). However, in general h is inaccurate and time and space
complexity can increase up to O(bd ). If h is admissible, the first goal node that is
found has minimal distance g to the root node.
In principle, there are two different possibilities for optimization methods to use
decision trees (compare also Sect. 3.3.2): First, each node of a tree represents a
problem solution (for an example, see Fig. 3.6(b), p. 62). This is, for example, the
3.3 Optimization Methods for Linear, Discrete Problems 65
case for most modern heuristics which search through the search space by iteratively
sampling solutions. Such optimization methods are interested in finding with mini-
mal effort a node which represents an optimal solution. Expanding a node x usually
means investigating all neighboring solutions of x. An evaluation function h(x) can,
for example, estimate the minimal distance dx,x∗ between node x and a goal node x∗ ,
which represents an optimal solution. However, such an estimation is often difficult
as we do not know where the optimal solution can be found. More useful evaluation
functions would be, for example, based on the objective values or other properties
of solutions.
Second, starting with a root node where no values are assigned to the decision
variables, optimization methods can search by iteratively assigning values to the de-
cision variables (for an example, see Fig. 3.7, p. 62). To obtain a solution, a method
must traverse the tree down to depth d. In this case, edge costs describe the effect of
setting a decision variable to a particular value. h(x) measures the minimal cost of
a path from a node x, where some of the decision variables are fixed, to an optimal
solution, which can be found at depth d in the decision tree.
3.3.2.3 Example
We want to compare the behavior of the different search methods for the traveling
salesman problem (TSP). The TSP is a well-studied and common NP-hard combi-
natorial optimization problem. It can be formulated as follows: given a collection of
cities (nodes) and distances d(i, j) between each pair of nodes i and j, the TSP is to
find the cheapest way of visiting all of the cities exactly once and returning to the
starting point. A TSP is called symmetric if d(i, j) = d( j, i). The size of the search
space grows super-exponentially, as for symmetric problems there are (n − 1)!/2
different solutions. A solution can be represented as a permutation of the n cities,
which is also called a tour.
We want to study a symmetric problem with four cities (n = 4) denoted a, b, c,
and d. The distances d(i, j) are shown in Fig. 3.10. There are only 3!/2 = 3 different
solutions (adbca with cost 12, abdca with cost 17, and adcba with cost 11).
We can model the search space using a decision tree where we start with an
empty tour and successively assign a city to a tour. We assume that the tour always
starts with city a. The cost of an edge between nodes i and j is equivalent to the
distance di, j . Figure 3.11 shows the resulting tree. The numbers next to the edges
66 3 Optimization Methods
are the edge costs. The node labels represent the selected city and the node number.
The number in brackets is g(x) (cost of path from root to node x). The optimal
solution x∗ is represented twice (abcda (node 16) and adcba (node 21)) and has
cost f (x∗ ) = d(a, b) + d(b, c) + d(c, d) + d(d, a) = 11.
We want to illustrate how different search strategies traverse the tree. Breadth-
first search expands the nodes in the order of creation and expands the nodes in the
order 0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15. Depth-first search expands the nodes
in the order 0-1-4-10-5-11-2-6-12-7-13-3-8-14-9-15. Uniform-cost search considers
the distance g(x) between node x and root node. g(x) measures the length of the
partial tour represented by node x. Uniform-cost search expands the nodes in the
order 0-3-8-9-14-15-1-5-4-2-11-6-7-10. The nodes 4 and 2 as well as 6, 7, and 10
have equal cost g and are, thus, expanded in random order. The nodes 12 and 13
need not be expanded since their cost g is equal to the cost of an already found
solution.
Informed search methods estimate the minimal distance to a goal state. The es-
timation function h is accurate if it correctly predicts the length of the optimal tour.
However, as we do not know the optimal solution, we need an estimate h(x). We
assume that h(x) is the length of the shortest path from city x to city a containing
no cities that are already part of the tour. In general, h(x) ≤ h∗ (x). h provides an
inaccurate estimation but it allows us to demonstrate the functionality of informed
search methods.
Figure 3.12 illustrates the functionality of greedy search. It uses the node num-
bers introduced in Fig. 3.11. The numbers in brackets indicate h(x). Greedy search
starts with expanding node 0. We obtain h(1) = 1+1 = 2 (we go back to city a using
the path bda), h(2) = 2 + 1 = 3 (we go back to a using cda), and h(3) = 1 + 6 = 7
(we go to a using path dba). Consequently, we continue with expanding node 1 and
get h(4) = 3 (we use path cda) and h(5) = 1 (we directly go back to a from d).
We continue with expanding node 5 and get h(11) = 8. Next, we expand nodes 4
and 2 to obtain node 10 with h(10) = 1 and the nodes 6 and 7 with h(6) = 2 and
h(7) = 1. Finally, we expand nodes 10 and 7 to obtain node 16 with h(16) = 0 and
node 13 with h(13) = 6. Therefore, we have found a solution (node 16), terminate
the search, and return abcda. Although in this example, greedy search returns the
optimal solution, this is not always the case. Greedy search is myopic and often gets
stuck in local optima, not returning an optimal solution.
position of two cities in a tour results in a neighboring solution. Given n cities, the
maximum depth d of the tree is n − 2. The branching factor b of a node depends on
the level of the node and decreases with increasing tree depth.
We can use standard uninformed and informed tree search methods to traverse
the tree. Uninformed search methods just enumerate all possible solutions and return
the optimal solution with minimum tour length. Informed search methods need an
estimate of the minimal distance to a goal node. The goal node is the optimal solu-
tion x∗. If we expand the nodes in the order of the objective value of the represented
tour, we are performing a greedy search through the search space. We start at a ran-
dom solution and expand it. Expanding a node means generating all neighboring
solutions. Then, we calculate the objective value of all neighbors and continue by it-
eratively expanding the best neighboring solution. The search stops at a node where
all expanded nodes have a higher objective value. This node is a local optimum. In
our example, greedy search starts with expanding node abcda. Since all expanded
nodes have the same (adcba) or higher objective value, greedy search returns either
abcda or adcba as optimal solutions.
Besides complete enumeration of the search space, which can be performed us-
ing the search strategies discussed in the previous section, a common enumerative
approach is branch-and-bound. The general idea behind branch-and-bound algo-
rithms, which can also be used for non-linear problems, is to recursively decompose
a problem into subproblems. For example, when solving ILPs, we can obtain sub-
problems by fixing or introducing additional constraints on the decision variables.
By subsequently adding constraints to the original problem, we get subproblems
and relaxed versions of the subproblems are solved using LP methods. The pro-
cess of subsequently adding additional constraints to the original problem is called
branching and can be modeled using hierarchical tree structures.
Bounding refers to removing (also called killing or fathoming) subproblems from
further consideration. Subproblems that have been killed are not considered any
more and are not decomposed into subproblems. Subproblems are killed if a bound
(e.g., obtained by solving the relaxed version of the subproblem) is below an existing
lower bound (maximization problem). For maximization problems, the objective
3.3 Optimization Methods for Linear, Discrete Problems 69
values of feasible integral solutions can be used as lower bounds on the objective
value.
Branch-and-bound methods were introduced by Land and Doig (1960) and the
first practical implementation was presented by Dakin (1965). Detailed descriptions
of branch-and-bound methods can be found in standard OR and optimization lit-
erature (Papadimitriou and Steiglitz, 1982; Bronson and Naadimuthu, 1997; Hillier
and Lieberman, 2002). Commercial branch-and-bound implementations for solv-
ing ILPs start with solving the relaxed version of the original ILP using LP solving
methods like the Simplex method (see Sect. 3.2.2). The solution of the relaxed prob-
lem is the optimal solution of the ILP if it is integral. The original ILP is infeasible
if the relaxed problem cannot be solved. Otherwise, at least one of the integer vari-
ables of the relaxed optimal solution is fractional. Commercial solvers usually use
one fractional variable and create two subproblems such that the fractional solution
is excluded but all feasible integer solutions still remain feasible. These new ILPs
with an additional constraint represent nodes in a branching tree. The nodes in the
search tree are iteratively expanded by adding additional constraints to the subprob-
lems and solving the relaxed versions of the subproblems. Nodes can be killed if the
solution to a subproblem is infeasible, satisfies all integrality restrictions, or has an
objective function value worse than the best known integral solution.
Therefore, we can formulate the branch-and-bound method for solving ILPs
(maximization problems) as follows:
1. Solve the linear relaxation of the original ILP. This gives an upper bound on the
objective value of the optimal solution. The lower bound is set to −∞. If the
obtained solution is integral, stop, and return the optimal solution.
2. Create two new subproblems by branching on a fractional variable. Solving the
linear relaxations of the two subproblems returns upper bounds on the two sub-
problems. A subproblem can be killed (or fathomed) when any of the following
occurs:
• All variables in the solution for the relaxed subproblem are integral. If the
objective value is larger than the existing lower bound, it replaces the existing
lower bound.
• The relaxed subproblem is infeasible.
• The objective value of the fractional solution is below the current lower bound.
Then, the subproblem is killed by a bounding argument.
3. If any subproblems exist that are not yet killed (we call such subproblems active),
choose one and continue with step 2. Otherwise, stop and return the current lower
bound as the optimal solution.
Branch-and-bound is an exhaustive search method and yields the optimal solution
to an ILP if it exists. The maximum depth of the resulting decision tree is n, where n
is the number of decision variables. Therefore, the time complexity of branch-and-
bound is exponential (O(2n )) as, in the extreme case, we have to branch on all n
decision variables. When implementing branch-and-bound methods, there are three
important aspects:
70 3 Optimization Methods
• branching nodes,
• bounding solutions, and
• traversing the tree.
We briefly discuss these three aspects in the following paragraphs. Branching par-
titions a set of solutions into two mutually exclusive sets. When using search trees,
each subset in the partition is represented by a child of the original node. For ILPs,
linear relaxation of the problem can result in a solution with a non-integral variable
xi∗ . Branching on the variable xi∗ creates two mutually exclusive ILPs with either
the constraint xi ≤ xi∗ or the constraint xi ≥ xi∗
. This branching step shrinks the
feasible region such that the current non-integral solution for xi is eliminated from
further consideration but still all possible integral solutions to the original problem
are preserved.
For bounding, we need an algorithm that calculates a lower bound on the cost
of any solution in a given set. In general, a bound can be obtained by calculating
any feasible solution of the original problem. For ILPs, we obtain a lower bound by
finding a feasible, integral solution. All ILPs whose relaxed problems yield values
of the objective function lower than the lower bound can be discarded. All such
problems are excluded from further consideration.
We can order the different ILPs using a search tree with the original ILP as a
root node. Branching creates two child nodes and bounding discards branches of
the tree from further consideration. As usually more than one node of the search
tree is active (this means it is not yet fathomed or killed), we need a search strat-
egy for deciding which active node should be expanded next. For this task, we can
use for example uninformed search methods described in Sect. 3.3.2. Other search
strategies are to expand the node with the highest number of integral variables in
the optimal solution or the node with the highest objective value. After choosing an
active node to be expanded, we also have to choose a variable for which an addi-
tional constraint should be added (usually not only one variable is fractional). There
are different strategies available (for an overview see Lee and Mitchell (2001) or
Lee (2002)) but often the fractional variable is chosen that is furthest from being an
integer (the fractional part is closest to 0.5).
The remainder of this section gives an example of how to use branch-and-
bound to solve an ILP. We want to find the optimal solution for the following two-
dimensional ILP:
Figure 3.15 plots the original ILP and indicates the optimal solution of the ILP and
of the relaxed ILP. Solving the relaxed ILP (we drop the integrality constraints and
use the Simplex method for the resulting LP), we find the optimal fractional solution
3.3 Optimization Methods for Linear, Discrete Problems 71
x∗ = ( 29 53 ∗ 19
15 , 30 ) with f (x ) = 5 30 . Therefore, we have an upper bound for the optimal
solution.
Subproblem 2 Subproblem 3
Figure 3.16 shows the two resulting subproblems. The optimal solution x∗ of the
relaxed subproblem 2 is integral with x1 = 1 and x2 = 2. The objective value of the
optimal solution is f (x∗ ) = 4. As the optimal solution for the relaxed subproblem
2 is integral, we have a lower bound on the solution quality. Therefore, all other
active subproblems where the optimal solution of the relaxed problem has a lower
objective value can be killed and removed from further consideration. In our case, no
other subproblems can be killed as the optimal solution of the relaxed subproblem
3 is x∗ = (2 18 , 1) with f (x∗ ) = 5.25 > 4.
As subproblem 3 is the only active problem (subproblem 2 is not active any more
as the optimal solution for the relaxed problem is integral), we have to split it up into
72 3 Optimization Methods
Fig. 3.16 Two mutually exclusive subproblems created from the original problem (3.13). The
problems are created by adding the constraints x2 ≥ 2 and x2 ≤ 1
two exclusive subproblems. The only fractional variable of the optimal solution of
the relaxed subproblem 3 is x1∗ = 2 18 . Therefore, we create two new subproblems
from subproblem 2 by adding the constraint x1 ≤ 2 and the constraint x1 ≥ 3. The
two resulting subproblems are
Subproblem 4 Subproblem 5
Fig. 3.17 Subproblem 4 created from Fig. 3.18 Decision tree when using branch-and-
subproblem 3 by adding the constraint bound for solving the ILP from (3.13)
x1 ≤ 2
where fi (xi ) are (non-linear) functions of single variables. All problems, linear or
non-linear, that have this form can be formulated as a multistage process with n
stages. In each stage (except the first one), we have b + 1 different states u, where
u ∈ {0, . . . , b}. The value of state u j chosen in stage j is determined by the integral
decision variable x j ∈ {0, 1, . . . , b}. For j > 0, the number of states in stage j equals
the number of possible values of the jth decision variable, |u j | = |x j |. In the first
stage, we have only a single state b. In stage j ∈ {1, . . . , n}, we specify possible
3.3 Optimization Methods for Linear, Discrete Problems 75
values for the decision variable x j that determine the state u j and contribute with
f j (x j ) to the overall return.
When solving this multistage problem using dynamic programming, we must for-
mulate a recursion on how the optimum return from completing the process depends
on beginning at state u in stage j. This recursion is problem-dependent and different
for different types of multistage processes. In general, for formulating a recursion
we need a termination criterion and a recursion step. The termination criterion de-
termines how in the last stage the return depends on the state u. The recursion step
relates the return of a state in stage j to the return of stage j + 1. For problems in the
form (3.18), we can calculate in the last stage n the optimum return mn (u) of state u
as
mn (u) = max ( fn (xn )), (3.19)
0≤xn ≤u
where u = 0, 1, . . . , b. u is the state variable, whose values specify the states. We also
need a recursion step for calculating the optimum return m j (u) from completing the
process beginning at stage j in state u:
where j = n − 1, . . . , 1. We can solve the problem by starting with stage n and recur-
sively determining m j (u). The optimal solution x∗ has an objective value of m1 (b).
We want to illustrate the functionality of the approach for an example (non-linear
problem) which is defined as
√ √
max z = 1.2 x1 + log(1 + 20x2 ) + x3 (3.21)
subject to x1 + x2 + x3 ≤ 3,
with xi ∈ N0 ,
√ √
where f 1 (x1 ) = 1.2 x1 , f2 (x2 ) = log(1 + 20x2 ), and f 3 (x3 ) = x3 . This is a non-
linear multistage problem with n = 3 stages and a maximum of b + 1 = 4 different
states at each stage. When solving the problem, we start with the last stage of the
process, assuming that all previous steps (stages 1 and 2) have been completed.
Therefore, we have four possible states and x3 can be either 0, 1, 2, or 3. Using
(3.19), we get
√ √ √ √ √
m3 (3) = max( f3 (0), f3 (1), f3 (2), f 3 (3)) = max( 0, 1, 2, 3) = 3,
√ √ √ √
m3 (2) = max( f3 (0), f3 (1), f3 (2)) = max( 0, 1, 2) = 2,
√ √
m3 (1) = max( f3 (0), f3 (1)) = max( 0, 1) = 1,
√
m3 (0) = max( f3 (0)) = max( 0) = 0.
We continue with stage 2 and use (3.20) to calculate the maximum returns for the
four possible states.
76 3 Optimization Methods
After completing state 2, we turn to stage 1. There is only one state associated with
this stage, u = 3. We get
Thus, the optimal solution x∗ = (1, 1, 1) has an objective value of f (x∗ ) = 2.2 +
log(21) ≈ 3.5.
As a second example, we want to formulate the TSP as a dynamic program
(Held and Karp, 1962). There are n cities and we start the tour from city o. For-
mulating this problem as a multistage problem, we have n − 1 different stages and
in each stage we have to decide on the next city we want to visit. We only need n − 1
stages as the start city o is randomly fixed and, thus, need not be considered for the
recursion. A state u is represented by a pair (i, S), where S is the set of j cities that
are already visited and i is the city visited in stage j (i ∈ S). Assuming that the cities
are labeled from 1 to n we can define the termination criterion for the recursion as
where o is a randomly chosen start city, S = {{1, . . . , n}−{o}} and i ∈ {{1, . . . , n}−
{o}}. d(i, j) denotes the distance between two cities i and j. For other stages, the
recursion is
m j (i, S) = min (d(i, l) + m j+1 (l, S ∪ l)). (3.23)
l
=o, l ∈S
/
can use (3.22) to calculate the optimal return m3 (i, S) from completing the process
at stage 3;
In stage 2, there are six states and their optimal return m2 (i, S) can be calculated
according to (3.23) as:
m2 (b, {b, c}) = (d(b, d) + m3 (d, {b, c} ∪ {d})) = d(b, d) + m3 (d, {b, c, d})
= 1 + 1 = 2,
m2 (b, {b, d}) = (d(b, c) + m3 (c, {b, d} ∪ {c})) = d(b, c) + m3 (c, {b, c, d})
= 2 + 8 = 10,
m2 (c, {b, c}) = (d(c, d) + m3 (d, {b, c} ∪ {d})) = d(c, d) + m3 (d, {b, c, d})
= 2 + 1 = 3,
m2 (c, {c, d}) = (d(c, b) + m3 (b, {c, d} ∪ {b})) = d(c, b) + m3 (b, {b, c, d})
= 2 + 6 = 8,
m2 (d, {c, d}) = (d(d, b) + m3 (b, {c, d} ∪ {b})) = d(d, b) + m3 (b, {b, c, d})
= 1 + 6 = 7,
m2 (d, {b, d}) = (d(d, c) + m3 (c, {b, d} ∪ {c})) = d(d, c) + m3 (c, {b, c, d})
= 2 + 8 = 10.
m1 (b, {b}) = min(d(b, c) + m2 (c, {b} ∪ {c}), d(b, d) + m2 (d, {b} ∪ {d}))
= min(2 + 3; 1 + 10) = 5,
m1 (c, {c}) = min(d(c, b) + m2 (b, {c} ∪ {b}), d(c, d) + m2 (d, {c} ∪ {d}))
= min(2 + 2; 2 + 7) = 4,
m1 (d, {d}) = min(d(d, b) + m2 (b, {d} ∪ {b}), d(b, c) + m2 (c, {d} ∪ {c}))
= min(1 + 10; 2 + 8) = 10.
Although the recursion is completed, we have not yet considered traveling from
city a to the first city. We can either introduce an additional step in the recursion or
calculate modified optimal returns as
We have found two optimal solutions (abcda and adcba) with tour cost 11.
This section illustrated the functionality of dynamic programming which can be
used to solve ILPs as well as non-linear problems if they can be formulated as a
multistage process. A problem is solved by decomposing the solution process into
single steps and defining a recursion that relates the different steps. Dynamic pro-
gramming is an exact approach that enumerates the search space and intelligently
discards inferior solutions. The time and space complexity of dynamic programming
depends on the problem considered but is usually much lower than complete enu-
meration. For further details on dynamic programming, we refer to Bellman (2003)
and Denardo (2003). A nice summary including additional exercises can be found
in Bronson and Naadimuthu (1997, Chap. 19).
Cutting plane methods are approaches for solving ILPs that can either be used alone
or in combination with other techniques like branch-and-bound. Cutting plane meth-
ods are based on the idea to add additional constraints (cutting planes) to a problem
such that infeasible fractional solutions (including the optimal, but fractional, solu-
tion) are removed but all integer solutions remain feasible. When performing such
iterative cuts by adding constraints to an ILP, the original set of constraints is re-
placed by alternative constraints that are closer to the feasible integral solutions and
exclude fractional solutions.
In Sect. 3.2.2, we discussed how a convex polyhedron, which represents the feasi-
ble search space of an LP, can be described as the intersection of half-spaces (Weyl,
1935), where each half-space is defined by a constraint. When using the Simplex
method for solving LPs, we move on the convex hull defined by the half-spaces and
iteratively examine corner points until we find the optimal solution. By analogy to
LPs, the feasible search space of an ILP can also be described by using a convex
hull where the corner points are integral solutions. Thus, if we can find a set of lin-
ear inequalities that completely defines the feasible search space of an ILP such that
the corner points are integral, then we can solve an ILP using the Simplex method.
The goal of cutting plane methods is to obtain that integral convex hull from the
fractional convex hull of the underlying linear program by introducing additional
constraints that cut away fractional corner points.
In general, there are two different ways to generate cuts. First, we can gener-
ate cuts based on the structure of the problem. Such cuts are problem-specific and
have to be generated separately for each problem. However, once good cuts are
found, these cuts provide very efficient solution techniques. An example is problem-
specific cuts for the TSP (Dantzig et al, 1954) which inspired branch-and-cut al-
gorithms (Grötschel et al, 1984; Padberg and Rinaldi, 1991). Branch-and-cut algo-
rithms are problem-specific methods that combine cutting plane methods with a
branch-and-bound algorithm. Branch-and-cut methods systematically attempt to ob-
3.3 Optimization Methods for Linear, Discrete Problems 79
tain stronger (in the sense that the optimal solution is less fractional) LP relaxations
at every node of the search tree by introducing additional cutting planes.
The second way is to iteratively solve the relaxed ILP using the Simplex method
and generate cuts that remove the optimal fractional solution from the Simplex.
Such approaches are not problem-specific and can be applied to any ILP. The first
successful approach following this idea was presented by Gomory (1958) who de-
veloped a cutting plane method for ILPs which obtains the integral convex hull after
applying a sequence of cuts (linear constraints) on the fractional convex hull (Go-
mory, 1960, 1963). Chvátal (1973) showed that this procedure always terminates
in a finite number of steps and converges to an optimal solution obtaining the in-
tegral convex hull. However, the number of steps can be very large due to the fact
that these algebraically-derived cuts are often weak in the sense that the area that is
cut from the Simplex is small. Furthermore, the performance of cutting plane algo-
rithms is limited as the minimal number of inequalities that are necessary to describe
the integral convex hull increases exponentially with the number of decision vari-
ables. However, usually we do not need to correctly describe the complete convex
hull but only need a partial, accurate description of the convex hull in the neighbor-
hood of the optimal solution. Therefore, in practice, cutting plane approaches show
good performance for a variety of different combinatorial optimization problems
and many implementations for solving ILPs use cutting plane methods. Usually,
cutting plane methods for ILPs iteratively perform three steps:
1. We drop the integral constraint and solve the relaxed ILP obtaining the optimal
solution x∗ .
2. If the relaxed ILP is unbounded or infeasible, so is the original ILP. If x∗ is
integral, the problem is solved and we can stop.
3. If not, we add a linear inequality constraint to the problem such that all integral
solutions remain feasible and x∗ becomes infeasible. We continue with step 1.
The main difficulty of cutting plane methods is to generate a proper constraint. The
following paragraphs describe how the Gomory cutting plane algorithm (Gomory,
1958) generates cuts for ILPs in standard form:
min cT x (3.24)
subject to Ax = b
x i ∈ N0
In this formulation, the vector x contains both the original set of decision variables
as well as the slack variables (see Sect. 3.2.2). b is assumed to be integral.
We can solve the relaxed ILP using the Simplex method and obtain the optimal
solution x∗ . Let m be the number of constraints, which is equal to the number of
rows of A. We assume that the variables {x1 , . . . , xm } are basis variables and the
variables {xm+1 , . . . , xn } are non-basis variables. Then, we can reformulate problem
(3.24) as
Ixb + Āx f = b̄ (3.25)
80 3 Optimization Methods
If the constraints are in form (3.25), then xi∗ = bi as we can set the slack variables
to x j = 0. Therefore, to get an optimal integral solution, the bi must be integral.
Consequently, we are now able to formulate a Gomory cut as
In this formulation of the cut, no basis variables are included. If the optimal solution
x∗ is fractional, it does not satisfy constraint (3.27) or (3.28). However, no feasible
integral solutions are removed from the search space using this cut.
We want to use the problem defined in (3.13) to study Gomory cuts. After in-
troducing the slack variables x3 and x4 , we can formulate the constraints for this
maximization problem as
x1 +4x2 +x3 =9
(3.29)
8x1 +2x2 +x4 = 19
x1 − 15
1 2
x3 + 15 x4 = 29
15 (3.30)
x2 + 15 x3 − 30
4 1
x4 = 53
30
As already seen in Sect. 3.3.3, the optimal solution of the relaxed problem is x∗ =
( 29 53
15 , 30 ).
Now we want to find an additional constraint (Gomory cut) for this problem. As
both basis variables x1∗ and x2∗ are fractional, we have two possible cuts. According
to (3.27) we get
cut 1: x1 + − 15
1
x3 + 15
2
x4 ≤ 29
15 (3.31)
cut 2: x2 + 15
4
x3 + − 30
1
x4 ≤ 53
30 (3.32)
3.3 Optimization Methods for Linear, Discrete Problems 81
We choose cut 1 and obtain the constraint x1 − x3 ≤ 1. Using (3.29), cut 1 can
be formulated as 2x1 + 4x2 ≤ 10. Figure 3.19(a) shows the resulting problem with
the new cut. The optimal solution x∗ of the relaxed problem is x∗ = (2, 32 ) with
f (x∗ ) = 5.5. As the optimal solution is not yet integral, we want to find another cut.
We add cut 1 to the set of constraints (3.30) and obtain
x1 − 15
1 2
x3 + 15 x4 = 29
15
4
x2 + 15 x3 − 30
1
x4 = 53
30
(3.33)
x1 −x3 +x5 = 1
Again we have to bring the constraints into form (3.25). This transformation yields
x1 + 17 x4 − 14
1
x5 = 2
x2 − 14
1
x4 + 27 x5 = 3
2 (3.34)
x3 + 17 x4 − 15
14 x5 =1
x1 , x2 , and x3 are the basis variables. As only x2∗ is fractional, there is only one
possible cut x2 − x4 ≤ 1. Using (3.34), we can reformulate this cut as 8x1 + 3x2 ≤ 20.
Figure 3.19(b) shows the cut and the optimum solution of the relaxed ILP x∗ =
( 25 20 ∗ 70
13 , 13 ) with f (x ) = 13 .
Fig. 3.19 Two Gomory cuts for the problem defined in (3.13)
Here we will stop generating more Gomory cuts for the example problem and
leave finding additional cuts to the reader. The example illustrates the relevant prop-
erties of cutting plane methods. Adding cuts to the problem reduces the size of
the feasible search space and removes fractional solutions. However, such improve-
ments are usually small and often we need a large number of cuts to find an optimal
integral solution.
82 3 Optimization Methods
Table 3.1 Effort for solving TSP problems using dynamic programming
problem size n 50 51 52 55 60 70 100 200
necessary effort 1 2.04 4.16 35.2 1,228.8 1.47×106 2.25×1015 5.7×1045
As many interesting problems are intractable and only small problem instances
can be solved exactly, methods have been developed that do not guarantee finding
an optimal solutions but whose running time is lower than the running time of ex-
act optimization methods. Usually, such methods are problem-specific. We want to
denote optimization methods that do not guarantee to find the optimal solution and
which use problem-specific (heuristic) information about the problem as heuristic
optimization methods. The necessity to use problem-specific information in heuris-
tic optimization methods is a result of the no-free-lunch theorem. The theorem says
3.4 Heuristic Optimization Methods 83
tics use problem-specific rules for the construction of a solution and improvement
heuristics apply problem-specific search operators and search strategies.
Approximation algorithms are heuristic optimization methods that return an ap-
proximate solution with guaranteed solution quality. Therefore, for approximation
algorithms we are able to provide a bound on the quality of the returned solution.
The main difference between heuristics and approximation algorithms is the exis-
tence of a bound on the solution quality. If we are able to formulate a bound, a
heuristic “becomes” an approximation algorithm.
Finally, we denote improvement heuristics that use a general, problem-invariant,
and widely applicable search strategy as modern heuristics (Polya, 1945; Romany-
cia and Pelletier, 1985; Reeves, 1993; Rayward-Smith et al, 1996; Rayward-Smith,
1998; Michalewicz and Fogel, 2004). Such methods are denoted as modern heuris-
tics since they define a strategy for searching through the search space on a meta-
level. Therefore, they are, especially in the OR community, also often denoted as
metaheuristics (Glover, 1986). Modern heuristics often imitate the functionality
of search strategies observed in other domains (for example in biology, nature, or
physics). Characteristic properties of modern heuristics are that the same search
concept can successfully be applied to a relatively wide range of different problems
and that intensification and diversification are alternately used in the search. Dur-
ing intensification, modern heuristics focus their search on promising areas of the
search space and during diversification new areas of the search space are explored.
A precise distinction between improvement heuristics and modern heuristics is
difficult as the concepts are closely related and in the literature no consistent naming
convention exists. Usually, modern heuristics are defined problem-independently,
whereas heuristics are explicitly problem-specific and exploit problem structure.
Therefore, the design and application of high-quality heuristics is demanding, since
problem-specific properties must be known and actively exploited. In contrast, stan-
dard modern heuristics can easily be applied to different problems with little or
no modification. There is a trade-off between ease of application and effectiveness.
Problem-specific heuristics are more difficult to develop and apply in comparison
to modern heuristics. However, if problem-specific knowledge is considered appro-
priately, they often outperform standard modern heuristics. Later in this book, we
discuss what types of problem characteristics are exploited by modern heuristics
(e.g. locality or decomposability) and how the performance of modern heuristics
can also be improved by problem-specific knowledge.
The following sections provide a brief overview of the different types of heuristic
optimization methods. We start with heuristics and continue with approximation
algorithms (Sect. 3.4.2) and modern heuristics (Sect. 3.4.3). The chapter ends with
the no-free-lunch theorem.
3.4 Heuristic Optimization Methods 85
3.4.1 Heuristics
in each iteration one city is added to a tour. In this example, the heuristic iteratively
chooses a city that minimizes the estimated length of a tour back to the starting city.
Improvement heuristics start with a complete solution and iteratively try to im-
prove the solution. In contrast to modern heuristics where improvement steps alter-
nate with diversification steps (diversification steps usually lead to solutions with a
worse objective value), improvement heuristics use no explicit diversification steps.
Often, improvement heuristics perform only improvement steps and stop if the cur-
rent solution can not be improved any more and a local optimum has been reached.
By performing iterative improvement steps, we define a metric on the search space
(see Sect. 2.3.2) as all possible solutions that can be reached in one improvement
step are neighboring solutions. For the design of appropriate neighborhoods of im-
provement heuristics, we can use the same criteria as for modern heuristics (see
Chap. 4.2). An example of an improvement heuristic is given in Fig. 3.14, p. 68. An
improvement heuristic for the TSP starts with a complete solution and iteratively
examines all neighboring solutions. It stops at a local optimum, where all neighbors
have a worse objective value.
Both types of heuristics, improvement heuristics as well as construction heuris-
tics, are often based on greedy search. Greedy search (see also Sect. 3.3.2) is an
iterative search approach that uses a heuristic function h to guide the search process.
h estimates the minimal distance to an optimal solution. In each search step, greedy
search chooses the solution where the heuristic function becomes minimal, i.e. the
improvement of the solution is maximal. A representative example of greedy search
is best-first search (Sect. 3.3.2).
As heuristics are problem-specific, only representative examples can be given.
In the following paragraphs, we present selected heuristics for the TSP illustrat-
ing the functionality of heuristics. In the TSP, the goal is to find a tour of n cities
with minimal tour length. We can distinguish between construction and improve-
ment heuristics. Construction heuristics start with an empty tour and iteratively add
cities to the tour until it is completed. Improvement heuristics start with a complete
tour and perform iterative improvements. Representative examples of construction
heuristics for the TSP are:
1. Nearest neighbor heuristic: This heuristic (Rosenkrantz et al, 1977) chooses a
starting city at random and iteratively constructs a tour by going from the cur-
rent city to the nearest city that is not yet included in the tour. After adding all
cities, the tour is completed by connecting the last city with the starting city. This
heuristic does not perform well in practice, although for non-negative distances
that satisfy the triangle inequality (d(x, y) ≤ d(x, z) + d(z, y)) the ratio between
the length l(T ) of the tour T found by the heuristic and the length of the optimal
tour Topt never exceeds (log2 n)/2. Therefore, we have an upper bound on the
solution quality l(T )/l(Topt ) ≤ (log2 n)/2.
2. Nearest insertion heuristic: We start with a tour between two cities and iter-
atively add the city whose distance to any city in the tour is minimal. The city
is added to the tour in such a way that the increase in the tour length is mini-
mal. A city can be either added at the endpoints of the tour or by removing one
3.4 Heuristic Optimization Methods 87
edge and inserting the new city between the two new ends. We have a worst-case
performance ratio of l(T )/l(Topt ) ≤ 2.
3. Cheapest insertion heuristic: This heuristic works analogously to the nearest
insertion heuristic but chooses the city to insert as the one which increases the
cost of the tour the least. The bound on the solution quality is l(T )/l(Topt ) ≤
log2 n.
4. Furthest insertion heuristic: This heuristic starts with a tour that visits the two
cities which are furthest apart.The next city to be inserted is the one that increases
the length of the current tour the most when this city is inserted in the best posi-
tion on the current tour. The idea behind this heuristic is to first insert cities that
are far apart. The worst-case performance ratio is l(T )/l(Topt ) ≤ log2 n. Although
the performance guarantee is the same as for the cheapest insertion heuristic, in
practice furthest insertion consistently outperforms the other construction heuris-
tics.
In the literature, there are a variety of other construction heuristics for TSPs. Another
relevant example of construction heuristics is the Christofides heuristic which is
discussed in Sect. 3.4.2 as a representative example of an approximation algorithm.
Furthermore, there are also examples of representative and effective improve-
ment heuristics for the TSP.
1. Two-opt heuristic: This heuristic iteratively removes all possible pairs of non-
adjacent edges in a tour and connects the two resulting unconnected sub-tours
such that the tour length is minimal. If the distances between all cities are
Euclidean (the cities are on a two-dimensional plane), two-opt returns a non-
crossing tour.
2. k-opt heuristic: k-opt (Lin,
1965) is the generalization of 2-opt. The idea is to
examine some or all nk k-subsets of edges in a tour. For each possible subset
S, we test whether an improved tour can be constructed by replacing the edges
in S with k new edges. Such a replacement is called a k-switch. A tour where
the application of a k-switch does not reduce the tour length, is denoted as a
k-optimum. A k-opt heuristic iteratively performs k-switches until a k-optimum
is obtained. Although the effort for k-opt is high and only 2-opt and 3-opt are
practical, the expected number of iterations required by 2-opt is polynomial for
Euclidean problem instances and runs in O(n10 log n) (Chandra et al, 1994). For
TSPs satisfying the triangle inequality the worst-case performance ratio of 2-opt
√ 1
is O(4 n) and for k-opt it is O( 14 n 2k ) (Chandra et al, 1994).
3. Lin-Kernighan heuristic: This heuristic (Lin and Kernighan, 1973) is an ex-
tension of the k-opt heuristic. In contrast to k-opt, the Lin-Kernighan heuristic
allows k to vary during the search and does not necessarily use all improvements
immediately after they are found. The heuristic starts by removing an edge from
a random tour creating a path. Then one end of this path is connected to some
internal node creating a cycle with an additional tail. Then, an edge is again re-
moved, thereby yielding a new path. These add/remove operations are repeated
as long as the “gain sum” is positive and there still remain unadded/unremoved
edges. The “gain sum” is the difference between the sum of the lengths of the
88 3 Optimization Methods
removed edges (except the last) and the sum of the lengths of the added edges.
For each path constructed, the cost of the tour obtained by joining its endpoints
is also computed and if it is lower than the original tour, the heuristic keeps track
of this solution. When a sequence of possible interchanges has been exhausted
and an improvement was found during the add/remove operations, the best tour
found replaces the original tour and the procedure is repeated. Although it is pos-
sible to construct problem instances where the performance of the Lin-Kernighan
heuristic is low (Papadimitriou and Steiglitz, 1978), it is a powerful heuristic for
TSPs and shows good performance for large real-world problem instances (Hels-
gaun, 2000). Numerous improvements on the basic strategy have been proposed
over the years and state-of-the-art implementations of Lin-Kernighan heuristics
are the methods of choice for practitioners.
We have seen in the previous section that heuristics are an interesting and promising
approach to obtain high-quality solutions for intractable problems. Although heuris-
tics provide no guarantee of finding the optimal solution, they often find “good” so-
lutions with polynomial effort. Therefore, heuristics are substituting optimality by
tractability.
Approximation algorithms are an attempt to formalize heuristics. Approxima-
tion algorithms emerged from the field of theoretical computer science (Johnson,
1974; Hochbaum and Shmoys, 1987; Hochbaum, 1996; Vazirani, 2003) but are also
becoming popular in the field of artificial intelligence, especially evolutionary algo-
rithms (Droste et al, 2002; Jansen and Wegener, 2002; Beyer et al, 2002; Neumann
and Wegener, 2007; Neumann, 2007). As we know that there are intractable prob-
lems (for example NP-complete problems), we are interested in heuristics that have
polynomial running time and for which we can develop bounds on the quality of the
returned solution. Therefore, approximation algorithms are heuristics for which we
can derive a bound on the quality of the returned solution.
The performance of approximation algorithms is measured using the approxima-
tion ratio
f (xapprox ) f (x∗ )
ρ (n) ≥ max , ,
f (x∗ ) f (xapprox )
where n is the problem size, xapprox is the solution returned by an approximation
algorithm and x∗ is the optimal solution. ρ (n) represents a bound on the solution
quality and measures the worst-case performance of an algorithm. This definition
of ρ (n) holds for minimization and maximization problems. We say that a heuristic
has an approximation ratio of ρ (n) if for any input of size n the objective value
f (xapprox ) is within a factor of ρ (n) of the objective value f (x∗ ). If an algorithm
always returns the optimal solution, ρ (n) = 1.
3.4 Heuristic Optimization Methods 89
Figure 3.20 gives an overview of the different classes and lists some represen-
tative combinatorial optimization problems. We know for the general variant of the
TSP problem that no approximation methods with polynomial running time can
be developed (Orponen and Mannila, 1987). If the distances between the cities are
symmetric, then the TSP becomes APX-complete (Papadimitriou and Yannakakis,
1993) and constant-factor approximations are possible. A representative example
of a constant-factor approximation for symmetric TSPs is the Christofides heuris-
tic which is discussed in the next paragraphs. Other examples of APX-hard prob-
lems are the maximum satisfiability problem (MAX SAT, see Sect. 4.4, p. 126),
for which a 3/4-approximation (ρ (n) = 4/3) exists (Yannakakis, 1994) and vertex
cover, which has a 2-approximation (ρ (n) = 2) (Papadimitriou and Steiglitz, 1982).
The goal of MAX SAT is to assign variables of a given Boolean formula in such a
way that the formula becomes true. Vertex cover aims at finding, for a given graph,
the smallest set of vertices that is incident to every edge in the graph.
If the distances between the cities of a TSP are Euclidean, then PTASs exist
(Arora, 1998). For the two-dimensional case where the cities are located on a two-
dimensional grid, Arora developed a PTAS that finds a (1 + ε )-approximation to the
optimal tour in time O(n(log n)O(1/ε ) ). For the d-dimensional problem, the running
√ d−1
time increases to O(n(log n)(O( d/ε )) ), which is nearly linear for any fixed ε and
d (Arora et al, 1998).
Finally, the knapsack problem is a representative example for FPTAS-hard prob-
lems. In a knapsack problem, we have a set of n objects, each with a size and a profit.
Furthermore, we have a knapsack of some capacity. The goal is to find a subset of
objects whose total size is no greater than the capacity of the knapsack and whose
profit is maximized. This problem can be solved using dynamic programming and
there are fast FPTAS that can obtain a 1 + ε solution in time O(n log(1/ε ) + 1/ε 4 )
(Lawler, 1979; Ibarra and Kim, 1975).
In Sect. 3.4.1, we presented heuristics for the TSP and studied bounds on the ex-
pected solution quality. The heuristics presented are examples of approximation al-
gorithms if their worst-case running time is polynomial and we are able to calculate
a bound on their approximation ratio. Another prominent example of approxima-
3.4 Heuristic Optimization Methods 91
tion algorithms for symmetric TSP is the Christofides heuristic (Christofides, 1976)
which is also known as Christofides algorithm.
The Christofides heuristic was one of the first approximation algorithms illus-
trating that it is possible to develop effective heuristics for intractable problems. It
accelerated a paradigm shift in the computer science and OR communities from
trying to find exact methods to effective approximation algorithms. The heuristic
is a constant-factor approximation with an approximation ratio of ρ (n) = 3/2. Its
running time is O(n3 ).
The Christofides heuristic solves the TSP by combining a minimum spanning
tree problem with the minimum weight perfect matching problem. Let G = (V, E)
be a connected, undirected graph with n = |V | cities (nodes) and m = |E| edges. The
distance weights between every pair of nodes correspond to the city distances. The
heuristic works as follows:
1. A minimum weight tree is a spanning tree that connects all the vertices and where
the sum of the edge weights is minimal. Find a minimum weight tree T = (V, R)
(using for example Kruskal’s algorithm). The degree of a node counts the number
of edges that are connected to that node. Let V1 ⊂ V be those nodes having odd
degree.
2. A perfect matching is a subset of edges without common vertices that touches
all vertices exactly once. Any perfect matching of a graph with n nodes has n/2
edges. Find a minimum weight perfect matching M of V1 .
3. An Euler tour in an undirected graph is a cycle that uses each edge exactly once.
Find an Euler tour S in the graph (V, R ∪ M).
4. Convert S into a TSP tour by taking shortcuts. That is, we obtain the tour by
deleting from S all but one copy of each vertex in V .
Although Christofides’s construction heuristic is more than 30 years old, it is still
state-of-the-art and no better approximation algorithms for metric TSPs are known.
Many researchers have conjectured that a heuristic with a lower approximation
factor than 3/2 may be achievable (Vazirani, 2003). However, there can be no
polynomial-time algorithm with an approximation factor better than 220/219 (Pa-
padimitriou and Vempala, 2000).
For more details and information on approximation algorithms, we refer the
reader to Hochbaum (1996) and Vazirani (2003) where an overview of different
types of approximation algorithms for a variety of different, relevant optimization
problems is given.
Usually, modern heuristics perform a limited number of search steps and are stopped
after either a certain quality level is reached or a number of search steps are per-
formed. For more details on the functionality of different types of modern heuristics,
we refer to Chap. 5 and to selected literature (Rechenberg, 1973b; Holland, 1975;
Schwefel, 1981; van Laarhoven and Aarts, 1988; Goldberg, 1989c; Fogel, 1995; Os-
man and Kelly, 1996; Osman and Laporte, 1996; Bäck, 1996; Michalewicz, 1996;
Mitchell, 1996; Bäck et al, 1997; Glover and Laguna, 1997; Aarts and Lenstra,
1997; Goldberg, 2002; Ribeiro and Hansen, 2001; Langdon and Poli, 2002; Glover
and Kochenberger, 2003; Blum and Roli, 2003; Reeves and Rowe, 2003; Resende
and de Sousa, 2003; Gendreau, 2003; Hoos and Stützle, 2004; Alba, 2005; Burke
and Kendall, 2005; Dréo et al, 2005; Gendreau and Potvin, 2005; Ibaraki et al, 2005;
De Jong, 2006; Doerner et al, 2007; Siarry and Michalewicz, 2008).
The application of modern heuristics is easy as usually only two requirements
are necessary.
• Representation: We must be able to represent complete solutions to a problem
such that variation operators can be applied to them.
• Pairwise fitness comparisons: The fitness function returns the fitness of a solution.
The fitness function indicates the quality of a solution and is based on, but is
not necessarily identical to the objective function. For the application of modern
heuristics, we must be able to compare the quality of two solutions and indicate
which of the two solution has a higher objective value.
To successfully apply modern heuristics, usually it is not necessary to determine the
correct, absolute objective value of a solution but pairwise comparisons of solutions
are sufficient. If we can define an appropriate representation and perform pairwise
fitness comparisons, we are able to apply a modern heuristic to a problem.
Studying the literature reveals that there exists a large variety of different mod-
ern heuristics. However, often the differences between the different approaches are
small and similar concepts are presented using different labels. In the literature, vari-
ous categorizations of modern heuristics have been proposed to classify the existing
approaches. According to Blum and Roli (2003) the most common are to classify
modern heuristics according to the origin of the modern heuristic (nature-inspired
versus non-nature-inspired), the number of solutions that are considered in paral-
lel during search (population-based versus single-point search), the type of fitness
function (dynamic versus static fitness function), the type of neighborhood defini-
tion (static versus dynamic neighborhood definition), and the consideration of previ-
ous search steps for new solutions (memory usage versus memory-less). Although
these classifications allow us to categorize modern heuristics, the categorizations
are not necessarily meaningful with respect to the underlying principles of modern
heuristics. If we want to design modern heuristics in a systematic way, we have to
understand what are the basic ingredients of modern heuristics and how we have to
combine them to obtain effective algorithms. Partial aspects, for example concern-
ing the history of methods or the number of solutions used in parallel, are of minor
importance. In the remainder of this work, we want to categorize modern heuristics
using the following characteristics (main design elements):
94 3 Optimization Methods
Heating up a solid material allows atoms to move around. If the cooling process is
sufficiently slow, the size of the resulting crystals is large and few defects occur in
the crystalline structure. Thus, crystal structures with a minimum energy configura-
tion adopted by a collection of atoms are found.
SA imitates this behavior and uses iterative search steps. In each search step, a
variation operator is applied to a solution xo . For combinatorial optimization prob-
lems, the variation operator is often designed in such a way that it creates a neigh-
boring solution xn . Usually, the fitness f (x) of a solution x is equivalent to its objec-
tive value and the search starts with a randomly created initial solution. The search
strategy consists of intensification and diversification elements as the probability of
accepting better solutions changes during the search process. The probability is con-
trolled by the strategy parameter T which is often called temperature. At the begin-
ning of the search, diversification dominates and the SA explores the search space
(see Fig. 3.21(a)). New solutions xn replace the old solutions xo with high probabil-
ity and also solutions with lower fitness are accepted. As the number of search steps
increases, exploration becomes weaker and at the end of an SA run (T ≈ 0) only
better solutions are allowed to replace the current solution. The level of intensifica-
tion and exploration is controlled by the acceptance probability Pacc (T ) of replacing
a current solution xo by a neighboring solution xn . Pacc depends on the temperature
T which is reduced during search and on the fitness difference δ E = f (xn ) − f (xo )
between old and new solution. For minimization problems, it is calculated as
1 if f (xn ) ≤ f (xo )
Pacc (T ) =
exp( −Tδ E ) if f (xn ) > f (xo ),
The definition of a proper cooling schedule is one of the most important method-
specific design options (van Laarhoven and Aarts, 1988; Aarts et al, 1997). If T is
reduced very slowly, SA returns the optimal solution at the end of a run (Aarts
and van Laarhoven, 1985; Aarts et al, 1997; Henderson et al, 2003). However, in
practice, cooling schedules that guarantee finding the optimal solution are too slow
and time-consuming as the necessary number of search steps often exceeds the
size of the search space. Therefore, often a fixed cooling schedule is used, where
the temperature at time i + 1 is set to Ti+1 = cTi (0 < c < 1). Typical values are
c ∈ [0.9, 0.999]. The initial temperature T0 should be set such that exploration is
possible at the beginning of the search and also some areas of the search space with
low-quality solutions are explored. A proper strategy for setting T0 is to randomly
generate a number of solutions before the SA run and to set the initial temperature
to T0 ≈ σ ( f (x)) . . . 2σ ( f (x)), where σ ( f (x)) denotes the standard deviation of the
objective value of the randomly generated solutions. A low value of T at the end of
an SA run leads to a strong intensification of the search. Then, only better solutions
are accepted and SA becomes a pure local search. To force an SA to perform a local
search at the end of a run, sometimes T is set to zero near the end of a run to sys-
tematically explore the neighborhood around the current solution and to ensure that
the search returns a local optimum (van Laarhoven and Aarts, 1988).
We give an example and apply SA to the TSP. The first design option is the choice
of a proper problem representation and search operator. We directly represent a tour
as a sequence of cities (see Sect. 3.3.2.3, p. 65) and assume that two solutions x
and y are neighbors if their Hamming distance d(x, y) = 2. Therefore, neighboring
solutions differ in two different cities and each solution has n2 different neighbors.
If the initial city is fixed, the number of neighbors is n−1 2 . Figure 3.14 (p. 68)
illustrates a search space for a TSP with four cities where each solution has 32 = 3
neighbors (tour always starts with city a).
We have to determine √ a proper value for T0 . As the standard deviation of all
fitness values is σ = 62/3 ≈ 2.62 (we have three different solutions with fit-
ness 11, 12, and 17), we set T0 = 3. Furthermore, we use a simple, linear cool-
3.4 Heuristic Optimization Methods 97
ing schedule, where Ti+1 = 0.9Ti . As in Fig. 3.14, we start with the initial solu-
tion x0 = abcda with f (x0 ) = 11. The variation operator randomly exchanges the
position of two cities in the tour and, for example, creates a random neighboring
solution x1 = abdca with f (x1 ) = 17. As f (x1 ) > f (x0 ), x1 replaces x0 with prob-
ability P = exp(−6/3) ≈ 0.14. We generate a random number rnd = [0, 1) and if
rnd < 0.14, x1 replaces x0 and we continue with x1 . Otherwise, we continue with x0 .
Then, we reduce the temperature T which becomes T1 = 2.7. We continue iterations
until we exceed a predefined number of search steps or until we have found no better
solution for a certain number of search steps. With lowering T , the search converges
with high probability to the optimal solution (there are only three solutions).
Already in its early days, the heuristic optimization community observed a gen-
eral trade-off between effectiveness and application range of optimization methods
(Polya, 1945; Simon and Newell, 1958; Kuehn and Hamburger, 1963; Romanycia
and Pelletier, 1985). Often, the more problems could be solved with one partic-
ular optimization method, the lower its resulting average performance. Therefore,
researchers have started to design heuristics in a more problem-specific way to in-
crease the performance of heuristics for selected optimization problems.
The situation is different for modern heuristics as many of these methods are
viewed as general-purpose problem solvers that reliably and effectively solve a large
variety of different problems. The goal of many researchers in the modern heuris-
tics field has been to develop modern heuristics that are black-box optimization
methods. Black-box optimization methods are algorithms that need no additional
information about the structure of a problem but are able to reliably and efficiently
return high-quality solutions for a large variety of different optimization problems.
In 1995, Wolpert and Macready presented the No Free Lunch (NFL) theorem for
optimization (Wolpert and Macready, 1995). It builds upon previous work (Watan-
abe, 1969; Mitchell, 1982; Radcliffe and Surry, 1995) and basically says that the
design of general black-box optimization methods is not possible. In 1997, it was fi-
nally published in a journal (Wolpert and Macready, 1997). For an introduction, see
Whitley and Watson (2005). The NFL theorem is concerned with the performance of
algorithms that search through a search space by performing iterative search steps.
The authors summarize the main result of the NFL theorem as:
. . . for both static and time dependent optimization problems, the average performance of
any pair of algorithms across all possible problems is exactly identical. This means in par-
ticular that if some algorithm A1 ’s performance is superior to that of another algorithm A2
over some set of optimization problems, then the reverse must be true over the set of all
other optimization problems (Wolpert and Macready, 1997).
of the optimization problem and the algorithm is able to exploit these problem-
specific properties.
We will have a closer look at the theorem. Wolpert and Macready assume a
discrete search space X and an objective function f that assigns a fitness value
y = f (x) to each x ∈ X. During search, a set of m distinct solutions {dmx (i)} with
corresponding objective values {dmy (i)} is created, where i ∈ [1, . . . , m] denotes
an order on the sets (for example the order in which the solutions are created).
New solutions are created step by step and after m − 1 search steps, we have
the sets {dmx (1) = x1 , dmx (2) = x2 , . . . , dmx (m) = xm } and {dmy (1) = f (x1 ), dmy (2) =
f (x2 ), . . . , dmy (m) = f (xm )}. In each search step, a search algorithm A uses the pre-
viously generated solutions to decide which solution is created next. We assume
that newly generated solutions have not been generated before. Therefore, A is a
mapping from a previously visited set of solutions to a single new, unvisited solu-
tion, A : {dmx (i)} → xm+1 , where xm+1 ∈ X − {dmx (i)} and i ∈ [1, . . . , m]. Usually, A
considers the objective values of the previously visited solutions.
Wolpert and Macready use two performance measures. First, the performance
of A can be measured using a function Φ (dmy ) that depends on the set of objective
values dmy = {dmy (i)}. Φ (dmy ) can for example return the lowest fitness value in dmy :
Φ (dmy ) = mini {dmy (i) : i = 1, . . . , m}. Second, the performance of an algorithm A is
measured using P(dmy | f , m, A). This is the conditional probability of obtaining an
ordered set of m distinct objective values {dmy (i)} under the stated conditions.
The NFL theorem compares the performance of different algorithms A averaged
over all possible f . Measuring an algorithm’s performance using P(dmy | f , m, A), the
theorem says:
Theorem 3.1. For any pair of algorithms A1 and A2 ,
Therefore, there is no algorithm that outperforms some other algorithm on all prob-
lems that can be created by all possible assignments of objective values to solutions.
If one algorithm gains in performance on one class of problems, it necessarily pays
for it on the remaining problems.
Important for the interpretation of the NFL theorem is the definition of “all pos-
sible (objective) functions f ”. Schumacher (2000) and Schumacher et al (2001) in-
troduced the permutation closure of a set of functions. We assume that we have an
3.4 Heuristic Optimization Methods 99
Important for a proper interpretation of the NFL theorem is which problems are
closed under permutation. Trivial examples are problems where all solutions x ∈
X have the same fitness value y. Then, obviously all search algorithms show the
same performance. English (2000) recognized that NFL also holds for needle-in-a-
haystack (NIH) problems ((2.12), p. 30). In such problems, all solutions have the
same evaluation value except one which is the optimum and has a higher objective
value (maximization problem). For the NIH, |X|
= |Y |. Since |Y | = 2, there are
only |X| different functions in the permutation closure. The performance of search
100 3 Optimization Methods
algorithms is the same when averaged over all |X| functions. Figure 3.23 shows all
possible NIH functions for |X| = 4. We assume a maximization problem and y1 > y2 .
NFL-like results can also be observed for problems that are not closed under per-
mutation. Whitley and Rowe (2008) showed that a subset of algorithms can have
identical performance over a subset of functions, even when the subset is not closed
under permutation. In the extreme case, two algorithms can have identical perfor-
mance over just two functions. In contrast to Theorem 3.1, which assumes unknown
algorithms A1 and A2 , Whitley and Rowe (2008) assume that there is some a-priori
knowledge about the algorithms used. For example, they observe identical perfor-
mance of algorithms on some functions that are not closed under permutation, if the
search algorithms are limited to m steps, where m is significantly smaller than the
size of the search space.
The consequences of the NFL theorem are in agreement with Sects. 2.3.2 and
2.4.2. In Sect. 2.3.2, we discussed trivial topologies where no meaningful neigh-
borhood can be defined. Since all solutions are neighbors of each other, step-wise
search algorithms are not able to make a meaningful guess at the solution that should
be sampled next but just select a random solution from the search space. Therefore,
the performance of all search algorithms is, on average, the same.
In Sect. 2.4.2.1, we discussed problems where no correlation exists between the
distance between solutions and their corresponding objective values (see Fig. 2.3,
middle). The situation is analogous to problems with a trivial topology. Search al-
gorithms are not able to make a reasonable guess at the solution that should be
sampled next. Therefore, all possible optimization methods show, on average, the
same performance.
What about the locality and decomposability of a set of functions that is closed
under permutation? For a set F of functions f : X → Y that is closed under permuta-
tion, the correlation between X and Y is zero averaged over all functions. Therefore,
the locality, averaged over all functions, is low and the resulting functions can, on
average, not be decomposed. We illustrate this for two types of problems. Measuring
the correlation ρX,Y between X and Y for NIH problems reveals that ∑F ρX,Y ( f ) = 0.
NIH problems have low locality and cannot be decomposed. The same holds for
other problems that are closed under permutation like f : X → Y with X = {1, 2, 3}
and Y = {1, 2, 3} (Fig. 3.24). The average correlation ∑F ρX,Y ( f ) between X and Y
is ∑i ρ ( fi ) = 1 + 0.5 + 0.5 − 0.5 − 1 − 0.5 = 0.
We can generalize this result. As discussed in Sect. 2.2, a problem defines a
set of problem instances. Each problem instance can be viewed as one particular
function f : X → Y . The performance of a search algorithm is the same averaged
3.4 Heuristic Optimization Methods 101
over all functions f that are defined by the problem if the functions are closed under
permutation. Then, also the correlation ρX,Y averaged over all functions f is zero.
Given a set of functions that are closed under permutation, search performance
is also the same if we choose only a selection of functions from the permutation
closure as long as the correlation ρX,Y averaged over the selected functions is zero.
Choosing a set of random functions from a permutation closure also results in the
same performance of search algorithms on the selected set. Instead of using ρX,Y ,
we could also use the fitness-distance correlation ρFDC or other measurements that
measure a correlation between X and Y .
After discussing types of problems for which we are not able to develop well-
performing search algorithms, we might ask for which problems the NFL does not
hold and high-performing heuristics become possible. Christensen and Oppacher
(2001) studied the performance of an algorithm called submedian-seeker. The al-
gorithm proceeds as follows (minimization problem):
1. Evaluate a sample of solutions and estimate median( f ) of the objective value.
2. If f (xi ) < median( f ), then sample a neighbor of xi . Else sample a new random
solution.
3. Repeat step 2 until half of the search space is explored.
We assume a metric search space where f is one-dimensional and a bijection. The
number of submedian values of f that have supermedian successors is denoted as
M( f ). For M( f ) = 1 the set of solutions can be split into two halves, one with
objective values lower than median( f ) and one with higher objective values. With
increasing M( f ) the number of local optima increases. For M( f ) = |X|/2, there
are no neighboring solutions that both have an objective value below median( f ).
Christensen and Oppacher (2001) showed that there exists Mcrit such that when
M( f ) < Mcrit the proposed algorithm (submedian-seeker) beats random search. The
work of Christensen and Oppacher was the origin of work dealing with algorithms
that are robust: this means they are able to beat random search on a wide range of
problems (Streeter, 2003; Whitley et al, 2004; Whitley and Rowe, 2006). Whitley
et al (2004) generalized the submedian-seeker of Christensen and Oppacher and
102 3 Optimization Methods
Designing a fitness function and initialization method is usually easier than de-
signing proper representations and search operators. The fitness function is deter-
mined by the objective function and allows modern heuristics to perform pairwise
comparisons between solutions. Initial solutions are usually randomly created if no
a priori knowledge about the problem exists.
This chapter starts with a guideline on how to find the right optimization method
for a particular problem. Then, we discuss the different design elements. Section
4.2 discusses properties of representations and gives an overview of standard geno-
types. Analogously, Sect. 4.3 presents recommendations for the design of search
operators. Here, we distinguish between local and recombination search operators
and present standard search operators with well-known behavior. Finally, Sects. 4.4
and 4.5 present guidelines for the design of fitness functions and initialization meth-
ods for modern heuristics.
Since the literature provides a large number of different optimization methods, the
identification of the “right” optimization method for a particular problem is usually
a difficult task. We need some guidelines on how to find the right method.
Important for a proper choice of an optimization method is recognizing the fun-
damental and basic structure of a problem. For this process, we can use prob-
lem catalogs (Garey and Johnson, 1979; Vazirani, 2003; Crescenzi and Kann, 2003;
Alander, 2000) and look for existing problems that are similar to the problem at
hand. If the problem at hand is a standard problem and if it can be solved with pol-
ynomial effort by some optimization method (e.g. linear problems, Sect. 3.2), often
such a method is the method of choice for our problem. In contrast, problem solv-
ing is more difficult if the problem at hand is NP-hard. In this case, problem solving
becomes easier if either fast exact optimization methods or efficient approximation
methods (e.g. fully polynomial-time approximation schemes, Sect. 3.4.2) are avail-
able. Modern heuristics are the method of choice if the problem at hand is NP-hard,
difficult, and no efficient exact methods, approximation methods, or simple heuris-
tics exist.
Relating a particular real-world problem to existing standard problems in the
literature is sometimes difficult as problems in the real world usually differ from
the well-defined problems that we can find in textbooks. Real-world problems of-
ten have additional constraints, additional decision variables, and other optimization
goals. Therefore, the resulting problem models for “real” problems are large, com-
plex, non-deterministic, and often different from standard problems from the liter-
ature. We can try to reduce complexity and make our problem more standard-like
by removing constraints, simplifying the objective function, or limiting the number
of decision variables (see Sect. 2.1.3). However, we often do not want to neglect
some important aspects and are not happy with the simplified model. Furthermore,
we never know if the reduced model is an appropriate model for the real world or if
4.1 Using Modern Heuristics 107
it is just an abstract model without any practical relevance. This problem was rec-
ognized early for example by Ackoff (1973, p. 670) who stated that ‘accounts are
given about how messes were murdered by reducing them to problems, how prob-
lems were murdered by reducing them to models, and how models were murdered
by excessive exposure to the elements of mathematics.’ Therefore, when simplifying
realistic models to make them similar to existing standard models we must be care-
ful as the use of standardized models and methods easily leads to over-simplified
models and “wrong” solutions for the original problem.
After formulating a realistic problem model, we have to solve it. Finding an
appropriate modern heuristic for the problem at hand is a difficult task as we are
confronted with two major difficulties: first of all, the literature is full of success
stories of modern heuristics (for example Alander (2000)). When studying the liter-
ature, we get the impression that modern heuristics are fabulous optimization tools
that can solve all possible types of problems in short time. Of course, experience
as well as the NFL theorem tell us that such generalizations are wrong. However,
the literature usually does not provide us with failures and limitations of modern
heuristics but emphasizes successful applications. In many published applications,
the optimization problems are chosen extremely carefully and often only limited
comparisons to other methods are performed. Furthermore, the variables and de-
sign of the used modern heuristic is tweaked until its performance is sufficient to
allow publication. When users try to apply the methods they find in the literature to
their own real-world problems (which are usually slightly different and larger), they
often observe low performance. Thus, they start changing and tweaking the param-
eters until the method delivers solutions of reasonable quality. This process is very
time-consuming and frustrating for users that just want to apply modern heuristics
to solve their problems.
The second major problem for users of modern heuristic is the choice and pa-
rameterization of modern heuristics. In the literature, we can find a huge variety of
different modern heuristics we can choose from. Even more, there are many vari-
ants of the same search concepts which are denoted differently. Often, users have no
overview of the different methods and have no idea which are the “right” methods
for their problem. The situation becomes worse as in the literature mainly success
stories are published (see previous paragraph) and limitations of approaches are of-
ten not known or remain fuzzy. Furthermore, existing comparisons in the literature
of different types of modern heuristics are often biased due to the expertise and
background of the authors. This is not surprising because researchers are usually
able to find a better design and parameter setting for such types of modern heuris-
tics with which they are more familiar. Therefore, as users are overwhelmed by the
large variety of different methods, they most often choose out-of-the-box methods
from standard text books. Applying such methods to complex and large real-world
problems is often problematic and leads to low-quality results as users do not appro-
priately consider problem-specific knowledge for the design of the method. Consid-
ering problem-specific knowledge for the design of modern heuristics turns modern
heuristics from black-box optimization methods into powerful optimization tools
and is at the core of successful problem solving.
108 4 Design Elements
4.2 Representation
Successful and efficient use of modern heuristics depends on the choice of the geno-
types and the representation - that is, the mapping from genotype to phenotype - and
on the choice of search operators that are applied to the genotypes. These choices
cannot be made independently of each other. The question whether a certain repre-
sentation leads to a better performing modern heuristic than an alternative represen-
tation can only be answered when the operators applied are taken into account. The
reverse is also true: deciding between alternative operators is only meaningful for a
given representation.
In practice, one can distinguish two complementary approaches to the design of
representations and search operators (Rothlauf, 2006). The first approach defines
representations (also known as decoders or indirect representations) where a solu-
tion is encoded in a standard data structure, such as strings or vectors, and applies
standard off-the-shelf search operators to these genotypes. To evaluate a solution,
the genotype needs to be mapped to the phenotype space. The proper choice of this
genotype-phenotype mapping is important for the performance of the search pro-
cess. The second approach encodes solutions to the problem in its most “natural”
problem space and designs search operators to operate on this search space. In this
case, often no additional mapping between genotypes and phenotypes is necessary,
but domain-specific search operators need to be defined. The resulting combination
of representation and operator is often called direct representation.
This section focuses on representations. It introduces genotypes and phenotypes
(Sect. 4.2.1) and discusses properties of the resulting genotype and phenotype space
(Sect. 4.2.2). Section 4.2.3 lists the benefits of using (indirect) representations. Fi-
nally, Sect. 4.2.4 gives an overview of standard genotypes.
In 1866, Mendel recognized that nature stores the complete genetic information
for an individual in pairwise alleles (Mendel, 1866). The genetic information that
determines the properties, appearance, and shape of an individual is stored by a
number of strings. Later, it was discovered that the genetic information is formed by
a double string of four nucleotides, called DNA.
Mendel realized that nature distinguishes between the genetic code of an individ-
ual and its outward appearance. The genotype represents all the information stored
in the chromosomes and allows us to describe an individual on the level of genes.
The phenotype describes the outward appearance of an individual. A transformation
exists – a genotype-phenotype mapping or a representation – that uses the genotype
information to construct the phenotype. To represent the large number of possible
phenotypes with only four nucleotides, the genotype information is not stored in the
alleles itself, but in the sequence of alleles. By interpreting the sequence of alleles,
4.2 Representation 109
nature can encode a large number of different phenotypes using only a few different
types of alleles.
In Fig. 4.1, we illustrate the differences between chromosome, gene, and allele.
A chromosome is a string of some length where all the genetic information of an
individual is stored. Although nature often uses more than one chromosome, many
modern heuristics use only one chromosome for encoding all phenotype informa-
tion. Each chromosome consists of many alleles. Alleles are the smallest informa-
tion units in a chromosome. In nature, alleles exist pairwise, whereas in most im-
plementations of modern heuristics an allele is represented by only one symbol. For
example, binary genotypes only have alleles with value zero or one. If a phenotypic
property of an individual (solution), like its hair color or eye size is determined by
one or more alleles, then these alleles together are called a gene. A gene is a region
on a chromosome that must be interpreted together and which is responsible for a
specific property of a phenotype.
We must carefully distinguish between genotypes and phenotypes. The pheno-
typic appearance of a solution determines its objective value. Therefore, when com-
paring the quality of different solutions, we must judge them on the phenotype level.
However, when it comes to the application of variation operators we must view so-
lutions on the genotype level. New solutions that are created using variation opera-
tors do not “inherit” the phenotypic properties of its parents, but only the genotype
information regarding the phenotypic properties. Therefore, search operators work
on the genotype level, whereas the evaluation of the solutions is performed on the
phenotype level.
Formally, we define Φg as the genotype space where the variation operators are
applied. An optimization problem on Φg could be formulated as
f (x) : Φg → R,
where x is usually a vector or string of decision variables (alleles) and f (x) is the ob-
jective or fitness function. x∗ is the global maximum. We have chosen a maximiza-
tion problem, but without loss of generality, we could also model a minimization
problem. To be able to apply modern heuristics to a problem, the inverse function
f −1 does not need to exist.
110 4 Design Elements
fg (xg ) : Φg → Φ p ,
f p (x p ) : Φ p → R,
bors in the genotype search space. Representations that ensure that neighboring phe-
notypes are also neighboring genotypes are called high-locality representations (see
Sect. 6.1.2).
We describe some of the most important and widely used genotypes, and summarize
some of their major characteristics. For a more detailed overview of different types
of genotypes, we refer to Bäck et al (1997, Sect. C1).
112 4 Design Elements
Binary genotypes are commonly used in genetic algorithms (Goldberg, 2002, 1989c).
Such types of modern heuristics use recombination as the main search operator and
mutation only serves as background noise. A typical search space is Φg = {0, 1}l ,
where l is the length of a binary vector xg = (x1g , . . . , xlg ) ∈ {0, 1}l . The genotype-
phenotype mapping fg depends on the specific optimization problem to be solved.
For many combinatorial optimization problems using binary genotypes allows a di-
rect and very natural encoding.
When using binary genotypes for encoding integer phenotypes, specific genotype-
phenotype mappings are necessary. Different types of binary representations for in-
tegers assign the integers x p ∈ Φ p (phenotypes) in different ways to the binary vec-
tors xg ∈ Φg (genotypes). The most common binary genotype-phenotype mappings
are binary, Gray, and unary encoding. For a more detailed description of these three
types of encodings, we refer to Rothlauf (2006, Chap. 5) and Rowe et al (2004).
When using binary genotypes to encode continuous phenotypes, the accuracy
(precision) depends on the number of bits that represent one phenotype variable. By
increasing the number of bits that are used to represent one continuous variable the
accuracy of the representation can be increased.
Instead of using binary strings with cardinality χ = 2 higher χ -ary alphabets, where
χ ∈ {N+ \ {0, 1}}, can also be used for the genotypes. Then, instead of a binary
alphabet a χ -ary alphabet is used for a string of length l. Instead of encoding 2l
different individuals with a binary alphabet, we are able to encode χ l different pos-
sibilities. The size of the search space increases from |Φg | = 2l to |Φg | = χ l .
For many integer problems, users often prefer to use binary instead of integer
genotypes because schema processing (Sect. 2.4.3.3) is maximally efficient with bi-
nary alphabets when using standard recombination operators in genetic algorithms
(Goldberg, 1990). Goldberg (1991b) qualified this recommendation and emphasized
that the alphabet used in the encoding should be as small as possible while still al-
lowing a natural representation of solutions. To give general recommendations is
difficult, as users often do not know a priori whether binary genotypes allow a natu-
ral encoding of integer phenotypes (Radcliffe, 1997; Fogel and Stayton, 1994). We
recommend that users use binary genotypes for encoding binary decision variables
and integer genotypes for integer decision variables.
When using continuous genotypes, the search space is Φg = Rl , where l is the size
of a real-valued string or vector. Continuous genotypes are often used in local search
methods like evolution strategies (Sect. 5.1.5) or evolutionary programming. These
4.3 Search Operator 113
types of optimization are mainly based on local search and search through the search
space by adding a multivariate zero-mean Gaussian random variable to each con-
tinuous variable. In contrast, when using recombination-based genetic algorithms
continuous decision variables are often represented by using binary genotypes.
Continuous genotypes cannot only be used for encoding continuous problems,
but also for permutation and combinatorial problems. Trees, schedules, tours, or
other combinatorial problems can easily be represented by using continuous geno-
types and special genotype-phenotype mappings (for examples see Sects. 8.1.2 and
8.4.1).
In all previously presented genotypes, the position of each allele is fixed along
the chromosome and only the corresponding value is specified. The first gene-
independent genotype was proposed by Holland (1975). He proposed the inversion
operator which changes the relative order of the alleles in the string. The position
of an allele and the corresponding value are coded together as a tuple in a string.
This concept can be used for all types of genotypes such as binary, integer, and
real-valued alleles and allows an encoding which is independent of the position
of the alleles in the chromosome. Later, Goldberg et al (1989) used this position-
independent representation for the messy genetic algorithm.
During the 1990s, Radcliffe developed guidelines for the design of search opera-
tors. It is important for search operators that the representation used is taken into
account as search operators are based on the metric that is defined on the genotype
space. Radcliffe introduced the principle of formae, which are subsets of the search
114 4 Design Elements
space (Radcliffe, 1991b,a, 1992, 1993; Radcliffe and Surry, 1994; Radcliffe, 1994).
Formae are defined as equivalence classes that are induced by a set of equivalence
relations. Any possible solution of an optimization problem can be identified by
specifying the equivalence class to which it belongs for each of the equivalence re-
lations. For example, if we have a search space of faces (Surry and Radcliffe, 1996),
basic equivalence relations might be “same hair color” or “same eye color”, which
would induce the formae “red hair”, “dark hair”, “green eyes”, etc. Formae of higher
order like “red hair and green eyes” are then constructed by composing simple for-
mae. The search space, which includes all possible faces, can be constructed with
strings of alleles that represent the different formae. For the definition of formae,
the structure of the phenotypes is relevant. For example, for binary problems, pos-
sible formae would be “bit i is equal to one/zero”. When encoding tree structures,
possible basic formae would be “contains link from node i to node j”.
It is an unsolved problem to find appropriate equivalences for a particular prob-
lem. From the equivalences, the genotype search space Φg and the genotype-
phenotype mapping fg can be constructed. Usually, a solution is encoded as a string
of alleles. The value of an allele indicates whether the solution satisfies a particu-
lar equivalence. Radcliffe (1991a) proposed several design guidelines for creating
appropriate equivalences for a given problem. The most important design guide-
line is that the generated formae should group together solutions of related fitness
(Radcliffe and Surry, 1994), in order to create a fitness landscape or structure of the
search space that can be exploited by search operators.
Radcliffe recognized that the genotype search space, the genotype-phenotype
mapping, and the search operators belong together and their design cannot be sepa-
rated from each other (Radcliffe, 1992). He assumed that search operators create off-
spring solutions from a set of parent solutions. For the development of appropriate
search operators that are based on predefined formae, he formulated the following
four design principles (Radcliffe, 1991a, 1994):
• Respect: Offspring produced by recombination should be members of all formae
to which both their parents belong. This means for the “face example” that off-
spring should have red hair and green eyes if both parents have red hair and green
eyes.
• Transmission: An offspring should be equivalent to at least one of its parents
under each of the basic equivalence relations. This means that every gene should
be set to an allele which is taken from one of the parents. If one parent has dark
hair and the other red hair, then the offspring has either dark or red hair.
• Assortment: An offspring can be formed with any compatible characteristics
taken from the parents. Assortment is necessary as some combinations of equiv-
alence relations may be infeasible. This means for example, that the offspring
inherits dark hair from the first parent and blue eyes from the second parent only
if dark hair and blue eyes are compatible. Otherwise, the alleles are set to feasible
values taken from a random parent.
• Ergodicity: An iterative use of search operators allows us to reach any point in
the search space from all possible starting solutions.
4.3 Search Operator 115
The recommendations from Radcliffe confirm that representations and search op-
erators depend on each other and cannot be designed independently. He developed
a consistent concept of how to design efficient modern heuristics once appropriate
equivalence classes (formae) are defined. However, the finding of appropriate equiv-
alence classes, which is equivalent to either defining the genotype search space and
the genotype-phenotype mapping or appropriate direct search operators on the phe-
notypes, is often difficult and remains an unsolved problem.
As long as the genotypes are either binary, integer, or real-valued strings, stan-
dard recombination and mutation operators can be used. The situation is different if
direct representations (Sect. 4.3.4) are used for problems whose phenotypes are not
binary, integer, or real-valued. Then, standard recombination and mutation operators
cannot be used any more. Specialized operators are necessary that allow offspring to
inherit important properties from their parents (Radcliffe, 1991a,b; Kargupta et al,
1992; Radcliffe, 1993). In general, these operators are problem-specific and must be
developed separately for every optimization problem.
Local search and the use of local search operators are at the core of modern heuris-
tics. The goal of local search is to generate new solutions with similar properties
in comparison to the original solutions (Doran and Michie, 1966). Usually, a local
search operator creates offspring that have a small or sometimes even minimal dis-
tance to their parents. Therefore, local search operators and the metric on the corre-
sponding search space cannot be decided independently of each other but determine
each other. A metric defines possible local search operators and a local search op-
erator determines the metric. As search operators are applied to the genotypes, the
metric on Φg is relevant for the definition of local search operators.
The basic idea behind using local search operators is that the structure of a fitness
landscape should guide a search heuristic to high-quality solutions (Manderick et al,
1991), and that good solutions can be found by performing small iterated changes.
We assume that high-quality solutions are not isolated in the search space but
grouped together (Christensen and Oppacher, 2001; Whitley, 2002). Therefore, bet-
ter solutions can be found by searching in the neighborhood of already found good
solutions (see also the discussion on the submedian seeker in Sect. 3.4.4, p. 101).
The search steps must be small because too large search steps would result in ran-
domization of the search, and guided search around good solutions would become
impossible. In contrast, when using search operators that perform large steps in the
search space it would not be possible to find better solutions by searching around
already found good solutions but the search algorithm would jump randomly around
the search space (see Sect. 6.1).
The following paragraphs review some common local search operators for bi-
nary, integer, and continuous genotypes and illustrate how they are designed based
on the underlying metric. The local search operators (and underlying metrics) are
116 4 Design Elements
commonly used and usually a good choice. However, in principle we are free to
choose other metrics and to define corresponding search operators. Then, the metric
should be chosen such that high-quality solutions are neighboring solutions and the
resulting fitness landscape leads guided search methods to an optimal solution. The
choice of a proper metric and corresponding search operators are always problem-
specific and the ultimate goal is to choose a metric such that the problem becomes
easy for guided search methods (see the discussion on how locality affects guided
search methods, Sect. 2.4.2). However, we want to emphasize that for most practical
applications the illustrated search operators are a good choice and allow us to design
efficient and effective modern heuristics.
When using binary genotypes, the distance between two solutions x, y ∈ {0, 1}l is of-
ten measured using the Hamming distance (2.7). Many local search operators based
on this metric generate new solutions with Hamming distance d(x, y) = 1. This
type of search operator is also known as a standard mutation operator for binary
strings or a bit-flipping operator. As each binary solution of length l has l neigh-
bors, this search operator can create l different offspring. For example, applying the
bit-flipping operator to (0, 0, 0, 0) can result in four different offspring (1, 0, 0, 0),
(0, 1, 0, 0), (0, 0, 1, 0), and (0, 0, 0, 1).
Reeves (1999b) proposed another local search operator for binary strings based
on a different neighborhood definition: for a randomly chosen k ∈ {0, . . . , l}, it
complements the bits xk , . . . , xl . Again, each solution has l neighbors. For exam-
ple, applying this search operator to (0, 0, 0, 0) can result in (1, 1, 1, 1), (0, 1, 1, 1),
(0, 0, 1, 1), or (0, 0, 0, 1). Although the operator is of minor practical importance, it
has some interesting theoretical properties. First, it is closely related to the one-point
recombination crossover (see below) as it chooses a random point and inverts all xi
with i ≥ k. Therefore, it has also been called the complementary crossover operator.
Second, if all genotypes are encoded using Gray code (Gray, 1953; Caruana et al,
1989), the neighbors of a solution in the Gray-coded search space using Hamming
distance are identical to the neighbors in the original binary-coded search space us-
ing the complementary crossover operator. Therefore, Hamming distances between
Gray encoded solutions are equivalent to the distances between the original binary
encoded solutions using the metric induced by the complementary crossover oper-
ator (neighboring solutions have distance one). For more information regarding the
equivalence of different neighborhood definitions and search operators we refer to
the literature (Reeves, 1999b; Höhn and Reeves, 1996a,b).
For integer genotypes different metrics are common, leading to different local search
operators. When using the binary Hamming metric (2.8), two individuals are neigh-
4.3 Search Operator 117
bors if they differ in one decision variable. Search operators based on this met-
ric assign a random value to a randomly chosen allele. Therefore, each solution
x ∈ {0, . . . , k}l has lk neighbors. For example, x = (0, 0) with xi ∈ {0, 1, 2} has four
different neighbors ((1, 0), (2, 0), (0, 1), and (0, 2)).
The situation changes when defining local search operators based on the city-
block metric (2.5). Then, a local search operator can create new solutions by slightly
increasing or decreasing one randomly chosen decision variable. For example, new
solutions are generated by adding +/-1 to a randomly chosen variable xi . Each so-
lution of length l has a maximum of 2l different neighbors. For example, x = (1, 1)
with xi ∈ {0, 1, 2, 3} has four different neighbors ((0, 1), (2, 1), (1, 0), and (1, 2)).
Finally, we can define search operators such that they do not modify values of
decision variables but exchange values of two decision variables xi and x j . There-
fore, using binary Hamming distance (2.8), two neighbors have distance d = 2 and
each solution has a maximum of 2l different neighbors. For example, x = (3, 5, 2)
has three different neighbors ((5, 3, 2), (2, 5, 3), and (3, 2, 5)).
For continuous genotypes, we can define local search operators analogously to inte-
ger genotypes. Based on the binary Hamming metric (2.8), the application of a local
search operator can assign a random value xi ∈ [xi,min , xi,max ] to the ith decision vari-
able. Furthermore, we can define a local search operator such that it exchanges the
values of two decision variables xi and x j . The binary Hamming distance between
old and new solutions is d = 2.
The situation is a little more complex in comparison to integer genotypes when
designing a local search operator based on the city-block metric (2.5). We must
define a search operator such that its iterative application allows us to reach all solu-
tions in reasonable time. Therefore, a search step should be not too small (we want
to have some progress in search) and not too large (the offspring should be similar to
the parent solution). A commonly used concept for such search operators is to add a
random variable with zero mean to the decision variables. This results in x i = xi +m,
where m is a random variable and x is the offspring generated from x. Sometimes
m is uniformly distributed in [−a, a], where a < (xi,max − xi,min ). More common is
the use of a normal distribution N (0, σ ) with zero mean and standard deviation σ .
The addition of zero-mean Gaussian random variables generates offspring that have,
on average, the same statistical properties as their parents. For more information on
local search operators for continuous variables, we refer to Fogel (1997).
nately selecting alleles from parent solutions. For uniform crossover, we decide in-
dependently for every single allele of the offspring from which parent solution it
inherits the value of the allele. In most implementations, no parent is preferred and
the probability of an offspring inheriting the value of an allele from a specific parent
is p = 1/m, where m denotes the number of parents that are considered for recom-
bination. For example, when two possible offspring are considered with the same
probability (p = 1/2), we could get as offspring xo1 = [x1p1 , x2p1 , x3p2 , . . . , xl−1
p1
, xlp2 ]
p2 p2 p1 p2 p1
and xo2 = [x1 , x2 , x3 , . . . , xl−1 , xl ]. We see that uniform crossover is equivalent to
(l − 1)-point crossover.
Figure 4.2 presents examples for the three crossover variants. All three recom-
bination operators are based on the binary Hamming distance and follow (4.1)
as d(x p1 , x p2 ) ≥ max(d(x p1 , xo ), d(x p2 , xo )). Therefore, the similarity between off-
spring and parent is higher than between the parents.
Uniform and n-point crossover can be used independently of the type of deci-
sion variables (binary, integer, continuous, etc) since these operators only exchange
alleles between parents. In contrast, intermediate recombination operators attempt
to average or blend components across multiple parents and are designed for con-
tinuous and integer problems. Given two parents x p1 and x p2 , a crossover operator
known as arithmetic crossover (Michalewicz, 1996) creates an offspring xo as:
where α ∈ [0, 1]. If α = 0.5, the crossover just takes the average of both parent
solutions. In general, for m parents, this operator becomes
m
xio = ∑ αi xipi ,
i=1
120 4 Design Elements
f (xg ) : Φg → R.
fg does not exist and we have a direct representation. Because there is no longer an
additional mapping between Φg and Φ p , a direct representation does not change any
aspect of the phenotype problem such as difficulty or metric. However, when using
direct representations, we often cannot use standard search operators, but have to
define problem-specific operators. Therefore, important for the success of modern
heuristics using a direct representation is not finding a “good” representation for a
specific problem, but developing proper search operators defined on phenotypes.
Relevant for different implementations of direct representations are not the rep-
resentations used (there are no genotype-phenotype mappings) but the definition of
the variation operators. Since we assume that local search operators always generate
neighboring solutions, the definition of a local search operator induces a metric on
the genotypes. Therefore, the metric that we use on the genotype space should be
chosen in such a way that new solutions that are generated by local search operators
have small (or better minimal) distance to the old solutions and the solutions are
neighbors with respect to the metric used. Furthermore, the distance between two
solutions x ∈ Φg and y ∈ Φg should be proportional to the minimal number of local
search steps that are necessary to move from x to y. Analogously, the definition of a
recombination operator also induces a metric on the search space. The metric used
should guarantee that the application of a recombination operator to two solutions
x p ∈ Φg and y p ∈ Φg creates a new solution xo ∈ Φg whose distances to the parents
are not larger than the distance between the parents (see equation (4.1)).
4.3 Search Operator 121
For the definition of variation operators, we should also consider that for many
problems we have a natural notion of similarity between phenotypes. When we cre-
ate a problem model, we often “know” whether two solutions are similar to each
other, or not. Such a notion of similarity should be considered for the definition
of variation operators. We should design local search operators in such a way that
their application creates solutions which we “view” as similar. Such a definition of
local search operators ensures that neighboring phenotypes are also neighbors with
respect to the metric that is induced by the search operators.
At a first glance, it seems that the use of direct representations makes life easier
as direct representations release us from the challenge to design efficient represen-
tations. However, we are confronted with some problems:
• To many phenotypes no standard variation operators can be applied.
• The design of high-quality problem-specific search operators is difficult.
• We cannot use use modern heuristics that only work on standard genotypes.
For indirect representations with standard genotypes, the definition of search op-
erators is straightforward as these are usually based on the metric of the genotype
space (see Sects. 4.3.2 and 4.3.3). The behavior of modern heuristics using standard
search operators is usually well examined and well understood. However, when us-
ing direct representations, standard operators often can no longer be used. For each
different phenotype, problem-specific operators must be developed. This is difficult,
as we cannot use most of our knowledge about the behavior of modern heuristics
using standard genotypes and standard operators.
The design of proper search operators is often demanding as phenotypes are usu-
ally not string-like but are more complicated structures like trees, schedules, or other
structures (for some examples, see Sect. 4.3.5). In this case, phenotypes cannot be
depicted as a string or in another way that is accessible to variation operators. Other
representative examples are the form or shape of an object. Search operators that
can be directly applied to the shape of an object are often difficult to design.
Finally, using specific variants of modern heuristics like estimation of distribu-
tion algorithms (EDA, Sect. 5.2.2) becomes very difficult. These types of modern
heuristics do not use standard search operators that are applied to genotypes but
build new solutions according to a probabilistic model of previously generated so-
lutions (Mühlenbein and Paaß, 1996; Mühlenbein and Mahnig, 1999; Pelikan et al,
1999a,b; Larrañaga et al, 1999; Bosman, 2003). These search methods were devel-
oped for a few standard genotypes (usually binary and floats) and result in better
performance than, for example, traditional simple genetic algorithms for decom-
posable problems (Larrañaga and Lozano, 2001; Pelikan, 2006). However, because
direct representations with non-standard phenotypes and problem-specific search
operators can hardly be implemented in EDAs, direct representations cannot benefit
from these optimization methods.
122 4 Design Elements
The following paragraphs provide an overview of standard search spaces and the
corresponding search operators. The search spaces can either represent genotypes
(indirect representation) or phenotypes (direct representation). We order the search
spaces by increasing complexity. With increasing complexity of the search space,
the design of search operators becomes more demanding. An alternative to design-
ing complex search operators for complex search spaces is to introduce additional
mappings that map complex search spaces to simpler ones. Then, the design of the
corresponding search operators becomes easier, however, a proper design of the ad-
ditional mapping (representation) becomes more important.
Strings and vectors of either fixed or variable length are the most elementary search
spaces. They are the most frequently used genotype structures. Vectors allow us
to represent an ordered list of decision variables and are the standard genotypes
for the majority of optimization problems (compare Sects. 3.2 and 3.3). Strings are
appropriate for sequences of characters or patterns. Consequently, strings are suited
for problems where the objects modeled are “text”, “characters”, or “patterns”.
For strings as well as vectors, we can use standard local search and recombination
operators (Sects. 4.3.2 and 4.3.3) that are mostly based on the Hamming metric (2.7)
or binary Hamming metric (2.8).
4.3.5.2 Coordinates/points
4.3.5.3 Graphs
Common genotypes for graphs are lists of edges indicating which edges are used.
Often, the characteristic vector representation (Sect. 8.1.2) or variants of it are used
to represent graph structures. Standard search operators for the characteristic vector
representation are based on the Hamming metric (2.8) as the distance between two
graphs can be calculated as the number of different edges. Standard search operators
can be used if there are no additional constraints.
4.3.5.4 Subsets
Subsets represent selections from a set of objects. Given n different objects, the
number of subsets having exactly k elements is equal to nk . Thus, the number of
possible subsets can be calculated as ∑nk=0 nk = 2n . For subsets, the order of the
objects does not matter. Therefore, the two example subsets {1, 3, 5} and {3, 5, 1}
represent the same phenotype solution. Local search operators that can be applied
directly to subsets often either modify the objects in the subset, or increase/reduce
the number of objects in one subset. Recombination operators that are directly ap-
plied to subsets are more sophisticated as no standard operators can be used. We
refer to Falkenauer (1998) for detailed information on the design of search operators
for subsets. Subsets are often used for problems that seek a “cluster”, “collection”,
“partition”, “group”, “packaging”, or “selection”.
Given n different objects, a subset of fixed size k can be represented using an in-
teger vector x of length k, where the xi indicate the selected objects and xi
= x j , for
i
= j and i, j ∈ [1, k]. Then, standard local search operators can be applied if we as-
sume that each of the k selected objects is unique. The application of recombination
operators is more demanding as each subset is represented by k! different genotypes
(integer vectors) and the distances between the k! different genotypes that represent
the same subset are large (Choi and Moon, 2008). Recombination operators must be
designed such that the distances between offspring and parents are smaller than the
distances between parents (4.1) and the recombination of two genotypes that repre-
sent the same subset always results in the same subset. For guidelines on the design
of appropriate recombination operators and examples, we refer the interested reader
to Choi and Moon (2003) and Choi and Moon (2008).
4.3.5.5 Permutations
A large variety of modern heuristics have been developed for permutation problems
as many such problems are of practical relevance but NP-hard. Permutations are or-
derings of items. Relevant for permutations is the order of the objects. The number
of permutations on a set of n elements is given by n!. 1-2-3 and 1-3-2 are two exam-
ples of permutations of three integer numbers x ∈ {1, 2, 3}. The TSP is a prominent
example of a permutation problem (Sect. 3.3.2). Permutations are commonly used
for problems that seek an “arrangement”, “tour”, “ordering”, or “sequence”.
124 4 Design Elements
4.3.5.6 Trees
Trees are used to describe hierarchical relationships between objects. Trees are a
specialized variant of graphs where only one path exists between each pair of nodes.
As standard search operators cannot be applied to tree structures, we either need to
define problem-specific search operators that are directly applied to trees or addi-
tional genotype-phenotype mappings that map each tree to simpler genotypes where
standard variation operators can be applied.
We can distinguish between trees of fixed and variable size. For trees of fixed
size, search operators are presented in Sect. 8.3 and appropriate genotypes and
genotype-phenotype mappings in Sect. 8.4. Search operators for tree structures of
variable size are at the core of genetic programming and are discussed in Chap. 7.
Further information about appropriate search operators for trees of variable size can
be found in Koza (1992) and Banzhaf et al (1997).
Modern heuristics use a fitness function to compare the quality of solutions. The
fitness of a solution is its quality “seen” by the optimization method. It is based
4.4 Fitness Function 125
on the objective function (which is often also called evaluation function) and often
both are equivalent to each other. However, sometimes modern heuristics modify
the objective function and use the resulting fitness function to compare the quality
of different solutions. In general, the objective function is based on the problem and
model formulation whereas the fitness function measures the quality of solutions
from the perspective of modern heuristics. In the following paragraphs, we discuss
some relevant aspects of objective and fitness functions.
The objective function is usually defined for all problem solutions and allows us
to compare the quality of different solutions. The objective function is based on the
problem model and is a mapping from the set of possible candidate solutions to a set
of objective values. We can distinguish between two types of objective functions:
ordinal and numerical. Ordinal objective functions denote the position of a solution
in an ordered sequence. With Y = {1, . . . , |Φ |},
f (x) : Φ → Y
indicates the position of a solution in the sequence. Ordinal objective functions al-
low us to order the solutions with respect to their quality (best, second best, . . . ,
worst) but give us no information on their absolute quality. For example, ordinal
objective functions are commonly used when human experts are involved in the
evaluation of solutions. For humans, it is usually easier to establish a ranking of
potential solutions instead of assigning absolute quality values.
Numerical objective functions assign a real-valued objective value to all possible
solutions which indicates the quality of a solution:
f (x) : Φ → R.
Based on the objective value, we can order solutions with respect to their qual-
ity. Numerical objective functions are common for many technical or mathematical
optimization problems, where the goal is to minimize or maximize some quality
measurement (e.g. cost or profit).
The objective function is formulated during model construction (Sect. 2.1.3) and
depends on the problem definition. Usually, we have different possibilities for for-
mulating the objective function. When defining an objective function, we should
ensure that optimal solutions (solutions that meet the objective completely) have
the best evaluation. Therefore, solutions that have lower quality than the optimal
solution also have a lower objective value. Furthermore, solutions of similar qual-
ity should have similar objective values. Finally, objective functions should assign
objective values to the solutions in such a way that guided search methods can eas-
ily find the optimal solution and are “guided” towards the optimal solutions by
the structure of the fitness landscape. Therefore, objective functions should make
problems either straightforward (Sect. 2.4.2.1) if local search operators are used,
or decomposable (Sect. 2.4.3) if recombination search operators are used, or both.
Therefore, the dissimilarity between phenotype solutions (measured by the problem
metric) should be positively correlated with the difference in their objective values.
126 4 Design Elements
We want to give two examples of a bad design of objective functions. In the first
example, we have a search space X of size n. The objective function assigns the
highest objective value to the best solution ( f (maxX x) = n) and randomly assigns
an objective value {1, . . . , n − 1} to the other n − 1 solutions x ∈ X − {maxX x}.
This problem is equivalent to a NIH problem and is closed under permutation
(Sect. 3.4.4). Therefore, guided search methods have large problems finding the op-
timal solution. On average, they cannot perform better than random search as they
cannot exploit any information that leads them in the direction of optimal solutions.
The second example is the satisfiability (SAT) problem (Cook, 1971; Garey and
Johnson, 1979; Gu et al, 1996). An instance of the SAT problem is a Boolean for-
mula with three components:
• A set of n variables xi , where i ∈ {1, . . . , n}.
• A set of literals. A literal is a variable or a negation of a variable.
• A set of m distinct clauses {C1 ,C2 , . . . ,Cm }. Each clause consists only of literals
combined by logical or operators (∨).
The SAT is a decision problem and its objective is to determine whether there exists
an assignment of values to the n variables such that the conjunctive normal form
C1 ∧ C2 ∧ · · · ∧ Cm becomes satisfiable, where ∧ is the logical and operator. The
most natural objective function for the SAT problem assigns a zero objective value
to all solutions that do not satisfy the given compound Boolean statement and a
one to all solutions that satisfy the statement. However, such an objective function
results in large problems for guided search methods as the resulting problem is a
needle-in-a-haystack problem and it is not possible to extract information from the
search history to guide a search method in the direction of an optimal solution. A
more appropriate fitness function would use a measure for the quality of a solution,
for example the number of satisfied clauses. This would allow modern heuristics to
estimate how feasible a solution is (how many clauses are correct) and distinguish
between solutions based on their “degree” of inadmissibility.
From the second example, we can learn two important lessons for the design of
fitness functions. First, fitness landscapes with large plateaus like in the NIH prob-
lem can be made easier for guided search methods if we modify the objective func-
tion and consider the objective value of neighboring solutions for calculating a solu-
tion’s fitness. For example, this can be achieved by smoothing the fitness landscape.
Smoothing algorithms are commonly used in statistics to allow us to recognize rel-
evant trends in data (Simonoff, 1996). We can use this concept for optimization
and calculate the fitness value of a solution considering the neighboring solutions.
Smoothing has the drawback that we need additional fitness evaluations as we must
know for each evaluation the fitness values of neighboring solutions. An example
of smoothing is calculating the fitness of a solution as the average of the weighted
neighboring solutions (the weight decreases with increasing distance). Thus, we can
transform plateaus or NIH-landscapes into fitness landscapes that guide local search
methods towards optimal solutions (see Fig. 4.3 for an example).
Second, when dealing with constrained problems, we have to define fitness values
for solutions that are infeasible (often known as penalty functions). Defining fitness
4.5 Initialization 127
values for infeasible solutions is necessary if not all infeasible solutions can be ex-
cluded from the search space, for example by a proper definition of the genotype
space or by search operators that do not generate infeasible solutions. As we have
seen in the SAT example, we sometimes cannot systematically exclude all infeasible
solutions from the solution space but also have to assign a fitness value to infeasible
solutions. The proper design of fitness functions for constrained problems is de-
manding as often the optimal solutions are at the edge of the feasible search space
(Gottlieb, 1999) and we have to design fitness functions for infeasible solutions such
that the resulting problem is either straightforward (when using local search opera-
tors), decomposable (when using recombination operators), or both. For details on
the design of appropriate fitness functions for constrained problems we refer to the
literature (Smith and Coit, 1997; Gottlieb, 1999; Coello Coello, 1999; Michalewicz
et al, 1999)
As modern heuristics usually need a large number of solution evaluations, the
calculation of the fitness values must be fast. Problems where the calculation is
very time-consuming often cannot be solved efficiently by modern heuristics. In
many real-world problems, there is a trade-off between speed and accuracy of the
evaluations (Handa, 2006). We can get rough approximations of a solution’s quality
after a short time but often need more time to calculate the exact quality. As the
differences in the objective values of solutions at the beginning of optimization runs
are usually large, we do not need accurate quality estimations at early stages of a run.
With increasing run time, modern heuristics find better and more similar solutions
and accuracy becomes more important. When designing the fitness function for real-
world problems where solution evaluation is time-consuming, users have to take
these aspects into account.
4.5 Initialization
The proper choice of an initialization method is important for the design of effective
and efficient modern heuristics. An initialization method provides modern heuristics
with some initial solutions from which the search starts. We can distinguish between
initialization methods that generate either only one solution or a population of so-
lutions. If we want to use local search approaches, one solution may be sufficient;
for recombination-based approaches, we need a larger set of individuals since we
can only recombine properties from different solutions if the number of available
solutions is large enough.
128 4 Design Elements
biased initial populations should have sufficient diversity to make sure that recombi-
nation operators work properly and also solutions that are different from the biased
initial population can be found.
Sections 8.3 and 8.5 present illustrative examples of how biased initialization
methods influence the performance of modern heuristics. The results show that mod-
ern heuristics for the OCST problem can benefit from biasing initial solutions to-
wards minimum spanning trees (MST). However, if optimal solutions have a larger
distance to MSTs, a strong initialization bias towards MSTs results in problems for
modern heuristics.
Chapter 5
Search Strategies
The idea of local search is to iteratively create neighboring solutions. Since such
strategies usually consider only one solution, recombination operators are not mean-
ingful. For the design of efficient local search methods it is important to incorporate
intensification as well as diversification phases into the search.
Local search methods are also called trajectory methods since the search pro-
cess can be described as a trajectory in the search space. The search space is a
result of the interplay between representation and operator. A trajectory depends
on the initial solution, the fitness function, the representation/operator combination,
and the search strategy used. The behavior and the dynamics of local as well as
recombination-based search methods can be described using Markov processes and
concepts of statistical mechanics (Rudolph, 1996; Vose, 1999; Reeves and Rowe,
2003). In Markov processes, states are used which represent the subsequently gen-
erated solutions (or populations of solutions). A search step transforms one state
into a following state. The behavior of search algorithms can be analyzed by study-
ing possible sequences of states and their corresponding probabilities. The transition
5.1 Local Search Methods 133
matrix describes how states depend on each other and depends on the initial solu-
tion, fitness function, representation/operator combination, and search strategy.
Existing local as well as recombination-based search strategies mainly differ
in how they control diversification and intensification. Diversification is usually
achieved by applying variation operators or making larger modifications of solu-
tions. Intensification steps use the fitness of solutions to control search and usually
ensure that the search moves in the direction of solutions with higher fitness. The
goal is to find a trajectory that overcomes local optima by using diversification and
ends in a global optimal solution.
In greedy search approaches (like the best-first search strategy discussed in
Sect. 3.3.2), intensification is maximal as in each search step the neighboring solu-
tion with highest fitness is chosen. No diversification is possible and the search stops
at the nearest local optimum. Therefore, greedy search finds the global optimum if
we have an unimodal problem where only one local optimum exists. However, as
problems usually have a larger number of local optima, the probability of finding
the global optimum using greedy search is low.
Based on the design elements of modern heuristics, there are different strategies
to introduce diversification into the search and to escape from local optima:
• Representation and search operator: Choosing a combination of representation
and search operators is equivalent to defining a metric on the search space and de-
fines which solutions are neighbors. By using different types of neighborhoods,
it is possible to escape from local optima and explore larger areas of the search
space. Different neighborhoods can be the result of different genotype-phenotype
mappings or search operators applied during search. Standard examples for lo-
cal search approaches that use modifications of representations or operators to
diversify the search are variable neighborhood search (Hansen and Mladenović,
2001), problem space search (Storer et al, 1992) (see also Sect. 6.2.1), the rollout
algorithm (Bertsekas et al, 1997), and the pilot method (Duin and Voß, 1999).
• Fitness function: The fitness function measures the quality of solutions. Modi-
fying the fitness function has the same effect as changing the representation as
it assigns different fitness values to the problem solutions. Therefore, variations
and modifications of the fitness function lead to increased diversification in lo-
cal search approaches. A common example is guided local search (Voudouris,
1997; Balas and Vazacopoulos, 1998) which systematically changes the fitness
function with respect to the progress of search.
• Initial solution: As the search trajectory depends on the choice of the initial so-
lution (for example, greedy search always finds the nearest local optimum), we
can introduce diversification by performing repeated runs of search heuristics us-
ing different initial solutions. Such multi-start search approaches allow us to ex-
plore a larger area of the search space and lead to higher diversification. Variants
of multi-start approaches include iterated descent (Baum, 1986a,b), large-step
Markov chains (Martin et al, 1991), iterated Lin-Kernighan (Johnson, 1990),
chained local optimization (Martin and Otto, 1996), and iterated local search
(Lourenco et al, 2001).
134 5 Search Strategies
Variable Neighborhood Search (VNS) (Mladenovic and Hansen, 1997; Hansen and
Mladenović, 2001) combines local search strategies with dynamic neighborhood
structures that are changed subject to the progress made during search. VNS is based
on the following observations (Hansen and Mladenović, 2003):
• A local minimum with respect to a neighborhood structure is not necessarily a
local optimum with respect to a different neighborhood (see also Sects. 2.3.2
and 4.2.2). The neighborhood structure of the search space depends on the met-
ric used and is different for different search operators and representations. This
observation goes back to earlier work (Liepins and Vose, 1990; Jones, 1995b,a)
which found that different types of operators result in different fitness landscapes.
• A global minimum is a global minimum with respect to all possible neighbor-
hood structures. Different neighborhood structures only result in different sim-
ilarity definitions but do not change the fitness of the solutions. Therefore, the
global optimum is independent of the search operators used and remains the
global optimum for all possible metrics.
• Hansen and Mladenović (2003) conjecture that in many real-world problems, lo-
cal optima with respect to different neighborhood structures have low distance to
5.1 Local Search Methods 135
each other and local optima have some properties that are also relevant for the
global optimum. This observation is related to the decomposability of problems
which is relevant for recombination-based search. Local optima are not randomly
distributed in the search space but local optima already contain some optimal so-
lutions to subproblems. Therefore, as they share common properties, the average
distance between local optima is low.
Figure 5.1 illustrates the basic idea of VNS. The goal is to repeatedly perform a
local search using different neighborhoods N. The global optimum x∗ remains the
global optimum with respect to all possible neighborhoods. However, as different
neighborhoods result in different neighbors, x can be a local optimum with respect
to neighborhood N1 but it is not necessarily a local optimum with respect to N2 .
Thus, performing a local search starting from x and using N2 can find the global
optimum.
Fig. 5.1 Changing the neighborhood from N1 to N2 allows local search to find the global optimum
VNS contains intensification and diversification elements. The local search fo-
cuses search as it searches in the direction of high-quality solutions. Diversification
is a result of changing neighborhoods as a solution x is not necessarily locally op-
timal with respect to a different neighborhood. Therefore, by changing neighbor-
hoods, VNS can easily escape from local optima. Furthermore, due to the increas-
ing cardinality of the neighborhoods (the neighborhoods are ordered with respect to
their cardinality), diversification gets stronger as the shaking steps can choose from
a larger set of solutions and local search covers a larger area of the search space (the
basin of attraction increases).
Although in the last few years VNS has become quite popular and many publica-
tions have shown successful applications (for an overview see Hansen and Mladen-
ović (2003)), the underlying ideas are older and more general. The goal is to intro-
duce diversification into modern heuristics by changing the metric of the problem
with respect to the progress that is made during search. A change in the metric
can be achieved either by using different search operators or a different genotype-
phenotype mapping. Both lead to different metrics and neighborhoods. Early ideas
on varying the representation (adaptive representations) with respect to the search
progress go back to Holland (1975). First implementations have been presented by
Grefenstette et al (1985), Shaefer (1987), Schraudolph and Belew (1992), and Storer
et al (1992) (see also the discussion in Sect. 6.2.1). Other examples are approaches
that use additional transformations (Sebald and Chellapilla, 1998b), a set of pre-
selected representations (Sebald and Chellapilla, 1998a), or multiple and evolvable
representations (Liepins and Vose, 1990; Schnier, 1998; Schnier and Yao, 2000).
5.1 Local Search Methods 137
A fitness landscape is the result of the interplay between a metric that defines simi-
larities between solutions and a fitness function that assigns a fitness value to each
solution. VNS uses modifications of the metric to create different fitness landscapes
and to introduce diversification into the search process. Guided local search (GLS)
(Voudouris and Tsang, 1995; Voudouris, 1997) uses a similar principle and dynam-
ically changes the fitness landscape subject to the progress that is made during the
search. In GLS, the neighborhood structure remains constant. Instead, it dynam-
ically modifies the fitness of solutions near local optima so that local search can
escape from local optima.
GLS considers problem-specific knowledge by using the concept of solution fea-
tures. A solution feature can be any property or characteristics that can be used to
distinguish high-quality from low-quality solutions. Examples of solution features
are edges used in a tree or graph, city pairs (for the TSP), or the number of unsatis-
fied clauses (for the SAT problem; see p. 126). The indicator function Ii (x) indicates
whether a solution feature i ∈ {1, . . . , M} is present in solution x. For Ii (x) = 1, solu-
tion feature i is present in x, for Ii (x) = 0 it is not present. GLS modifies the fitness
function f such that the fitness of solutions with solution features that exist in many
local optimal solutions is reduced. For a minimization problem, f (x) is modified to
yield a new fitness function
M
f (x) = f (x) + λ ∑ pi Ii (x),
i=1
with the regularization parameter λ and the penalty parameters pi . The pi are ini-
tialized as pi = 0. M denotes the number of solution features. λ weights the impact
of the solution features on the original fitness function f and pi balances the impact
of solution features of different importance.
Algorithm 3 describes the functionality of GLS. It starts from a random solution
x0 and performs a local search returning the local optimum x1 . To escape the local
optimum, a penalty is added to the fitness function f such that the resulting fitness
function h allows local search to escape. The strength of the penalty depends on the
utility ui which is calculated for all solution features i ∈ {1, . . . , M} as
where ci is the cost of solution feature i. The ci are problem-specific and usually
remain unchanged during search. They are determined by the user and describe
the relative importance of the solution features. Examples of ci are the weights of
edges (graph or tree problems) or the city-pair distances (TSP). The function ui (x)
is unequal to zero for all solution features that are present in x. After calculating
the utilities, the penalty parameters pi are increased for those solution features i that
yield the highest utility value. After modifying the fitness function, we start a new
138 5 Search Strategies
local search from x1 using the modified fitness function h. Search continues until a
termination criterion is met.
The utility function u penalizes solution features i with high cost ci and allows
us to consider problem-specific knowledge by choosing appropriate values for ci .
The presence of a solution feature with high cost leads to a high fitness value of
the corresponding solution allowing local search to escape from this local optimum.
Figure 5.2 illustrates the idea of changing the fitness of local optima. By modifying
the fitness function f (x) and adding a penalty to f (x), h(x) assigns a lower fitness
to x. Thus, local search can leave x and is able to find the global optimal solution x∗ .
Fig. 5.2 Changing the fitness function from f to h allows local search to find the global optimum
function h(x) and local search repeatedly finds the same local optimum (Mills and
Tsang, 2000). In general, the setting of λ is problem-specific and must be done with
care.
Examples of the successful application of GLS are TSPs (Voudouris and Tsang,
1999), bin-packing problems (Faroe et al, 2003b), VLSI design problems (Faroe
et al, 2003a), and SAT problems (Mills and Tsang, 2000; Zhang, 2004).
Heuristics that use only intensification steps (like local search) are often able to
quickly find a local optimal solution but unfortunately cannot leave a local optimum
again. A straightforward way to introduce diversification is to perform sequential
local search runs using different initial solutions. Such approaches are commonly
known as multi-start approaches. The simplest variants of multi-start approaches
iteratively generate random solutions and perform local search runs starting from
those randomly generated solutions. Thus, we have distinct diversification phases
and can explore larger areas of the search space. Search strategies that randomly
generate initial solutions and perform a local search are also called multi-start de-
scent search methods.
However, to randomly create an initial solution and perform a local search often
results in low solution quality as the complete search space is uniformly searched
and search cannot focus on promising areas of the search space. Iterated local search
(ILS) (Martin et al, 1991; Stützle, 1999; Lourenco et al, 2001) is an approach to
connect the unrelated local search phases as it creates initial solutions not randomly
but based on solutions found in previous local search runs. Therefore, it is based on
the same observations as VNS which assumes that local optima are not uniformly
distributed in the search space but similar to each other (Sect. 5.1.1, p. 134).
Algorithm 4 outlines the basic functionality of ILS. Relevant design criteria for
ILS are the modification of x and the acceptance criterion. If the perturbation steps
are too small, the following local search cannot escape from a local optimum and
again finds the same local optimum. If perturbation is too strong, ILS shows the
same behavior as multi-start descent search methods. The modification step as well
as the acceptance criterion can depend on the search history.
The previous examples illustrated how modern heuristics can make diversification
steps by modifying the representation/search operator, fitness function, or initial so-
lutions. Simulated annealing (SA) is a representative example of a modern heuris-
tic where the search strategy used explicitly defines intensification and diversifica-
tion phases/steps. The functionality and properties of SA are discussed in detail in
Sect. 3.4.3 (pp. 94-97). Its functionality is outlined in Algorithm 1, p. 96.
SA is a combination between a random walk through the search space and local
search. Diversification of search is a result of the random walk process and intensi-
fication is due to the local search steps. The amount of intensification and diversifi-
cation is controlled by the parameter T (see Fig. 3.21). With lower temperature T ,
intensification becomes stronger as solutions with lower fitness are accepted with
lower probability. The cooling schedule which determines how T is changed during
an SA run is designed such that at the beginning of the search diversification is high
whereas at the end of a run pure local search steps are used to find a local optimum.
Tabu search (TS) (Glover, 1977, 1986; Glover and Laguna, 1997) is a popular
modern heuristic. To diversify search and to escape from local optima, TS uses a list
of previously visited solutions. A simple TS strategy combines such a short term
memory (implemented as a tabu list) with local search mechanisms. New solutions
that are created by local search are added to this list and older solutions are removed
after some search steps. Furthermore, local search can only create new solutions that
do not exist in the tabu list T . To avoid memory problems, usually the length of the
tabu list is limited and older solutions are removed. Algorithm 5 outlines the basic
functionality of a simple TS. It uses a greedy search, which evaluates all neighboring
solutions N(x) of x and continues with a solution x that is not contained in T and
has maximum fitness. Instead of removing the oldest element xoldest from T , also
other strategies for updating the tabu list are possible.
The purpose of the tabu list is to allow local search to escape from local optima.
By prohibiting previously found solutions, new solutions must be explored. Figure
5.3 illustrates how the tabu list T contains all solutions of high fitness that are in the
neighborhood of x. Therefore, new solutions with lower fitness are created and local
search is able to find the global optimum.
5.1 Local Search Methods 141
Fig. 5.3 Removing solutions x ∈ T from the search space allows TS to escape from a local opti-
mum
Diversification can be controlled by the length of the tabu list. Low values lead
to stronger intensification as the memory of the search process is smaller. In many
applications, the length of the tabu list is dynamically changed and depends on the
progress made during search (Glover, 1990; Taillard, 1991; Battiti and Tecchiolli,
1994; Battiti and Protasi, 2001). Such approaches are also known as reactive Tabu
search.
More advanced TS approaches do not store complete solutions in the tabu list
but solution attributes. Similarly to the solution features used in GLS, solution at-
tributes can be not only components of solutions but also search steps or differences
between solutions. When using different solution attributes, we obtain a set of tabu
criteria (called tabu conditions) which are used to filter the neighborhood of a so-
lution. To overcome the problem that single tabu conditions prohibit the creation of
appropriate solutions (each attribute prohibits the creation of a set of solutions), as-
piration criteria are used which overwrite tabu conditions and allow search steps to
create solutions where some tabu conditions are present. A commonly used aspira-
tion criterion selects solutions with higher fitness than the current best one. For more
information on TS and on application examples, we refer to the literature (Glover
and Laguna, 1997; Gendreau, 2003; Voß et al, 1999; Osman and Kelly, 1996).
142 5 Search Strategies
Evolution strategies (ES) are local search methods for continuous search spaces.
They were developed by Rechenberg and Schwefel in the 1960s at the Technical
University of Berlin (Rechenberg, 1973a,b) and first applications were experimental
and dealt with hydrodynamical problems like shape optimization of a bent pipe, drag
minimization of a joint plate (Rechenberg, 1965), or structure optimization of a two-
phase flashing nozzle (Schwefel, 1968). There are two different types of ES: Simple
strategies that use only one individual (like the local search strategies discussed in
the previous paragraphs) and advanced strategies that use a set of solutions which
is called a population. The use of a population allows ES to exchange information
between solutions in the population.
The simple (1 + 1)-ES uses n-dimensional continuous vectors and iteratively cre-
ates one offspring x ∈ Rn from one parent x ∈ Rn by adding randomly created val-
ues with zero mean and identical standard deviations σ to each parental decision
variable xi .
xi = xi + σ Ni (0, 1),
where i ∈ {1, . . . , n}. In ES and other biology-inspired search methods, a search step
is usually denoted as a mutation. N (0, 1) is a normally distributed one-dimensional
random variable with expectation zero and standard deviation one. Ni (0, 1) indi-
cates that the random variable is sampled anew for each possible value of the counter
i. In the (1 + 1)-ES, the resulting individual is evaluated and compared to the origi-
nal solution. The better one survives to be used for the creation of the next solution.
Algorithm 6 summarizes the basic functionality (maximization problem).
For the (1 + 1)-ES, theoretical convergence models for two simple problems, the
sphere model and the corridor model, exist (Rechenberg, 1973a). Both problems are
standard test problems for continuous local search methods. The corridor model is
representative of the situation where the current solution x has a large distance to
the optimum x∗ :
and the sphere model describes the situation near the optimal solution x∗ :
n
fsphere (x) = c0 + c1 ∑ (xi − xi∗ )2 , (5.2)
i=1
for the standard deviations σ i . For discrete crossover, x i is randomly taken from
one parent, whereas intermediate crossover creates σi as the arithmetic mean of the
parents’ standard deviations (see Sect. 4.3).
The main search operator in ES is mutation. It is applied to every individual after
recombination:
we refer to the literature (Schwefel, 1981; Rechenberg, 1994; Schwefel, 1995; Bäck
and Schwefel, 1995; Bäck, 1996, 1998; Beyer and Deb, 2001; Beyer and Schwefel,
2002).
In contrast to local search approaches which exploit the locality of problems and
show good performance for problems with high locality, recombination-based ap-
proaches make use of the decomposability of problems and perform well for de-
composable problems. Recombination-based search solves decomposable problems
by decomposing them into smaller subproblems, solving those smaller subproblems
separately, and combining the resulting solutions for the subproblems to form over-
all solutions (Sect. 2.4.3). Hence, the main search operator used in recombination-
based search methods is, of course, recombination.
Recombination operators should be designed such that they properly decompose
the problem and combine high-quality sub-solutions in different ways (Goldberg,
1991a; Goldberg et al, 1992). A proper decomposition of problems is the key fac-
tor for the design of successful recombination operators. Recombination operators
must be able to identify the relevant properties of a solution and transfer these as
a whole to offspring solutions. In particular, they must detect between which parts
of a solution a meaningful linkage exists and not destroy this linkage when creat-
ing an offspring solution. This property of recombination operators is often called
linkage learning (Harik and Goldberg, 1996; Harik, 1997). If a problem is decom-
posable, the proper problem decomposition is the most demanding part as usually
the smaller subproblems can be solved much more easily than the full problem.
However, in reality, most often a problem is not completely separable but there
exists still some linkage between different sub-problems. Therefore, usually it is
not enough for recombination-based methods to be able to decompose the problem
and solve the smaller sub-problems, but usually they must also try different com-
binations of high-quality sub-solutions to form an overall solution. This process of
juxtaposing various high-quality sub-solutions to form different overall solutions is
often called mixing (Goldberg et al, 1993b; Thierens, 1995; Sastry and Goldberg,
2002).
Recombination operators construct new solutions by recombining the proper-
ties of parent solutions. Therefore, recombination-based modern heuristics usually
use a population of solutions since when using only single individuals, like in
most local search approaches, no properties of different solutions can be recom-
bined to form new solutions. Consequently, the main differences between local and
recombination-based search are the use of a recombination operator and a popula-
tion of solutions.
For local search approaches, we have been able to classify methods with respect
to their main source of diversification. In principle, we can use the same mechanisms
for recombination-based search methods, too. However, in most implementations of
146 5 Search Strategies
Genetic algorithms (GAs) were introduced by Holland (1975) and imitate basic
principles of nature formulated by Darwin (1859) and Mendel (1866). They are
(like population-based ES discussed on p. 142) based on three basic principles:
• There is a population of solutions. The properties of a solution are evaluated
based on the phenotype, and variation operators are applied to the genotype.
Some of the solutions are removed from the population if the population size
exceeds an upper limit.
• Variation operators create new solutions with similar properties to existing solu-
tions. In GAs, the main search operator is recombination and mutation serves as
background operator. In ES, the situation is reversed.
• High-quality individuals are selected more often for reproduction by a selection
process.
148 5 Search Strategies
To illustrate the basic functionality of GAs, we want to use the standard simple ge-
netic algorithm (SGA) popularized by Goldberg (1989c). This basic GA variant is
commonly used and well understood and uses crossover as main operator (mutation
serves only as background noise). SGAs use a constant population of size N, usually
the N individuals xi ∈ {0, 1}l (i ∈ {1, . . . , N}) are binary strings of length l, and re-
combination operators like uniform or n-point crossover are directly applied to the
genotypes. The basic functionality of a SGA is shown in Algorithm 7. After ran-
domly creating and evaluating an initial population, the algorithm iteratively creates
new generations by recombining (with probability pc ) the selected highly fit indi-
viduals and applying mutation (with probability pm ) to the offspring. The function
random(0, 1) returns a uniformly distributed value in [0, 1).
The following paragraphs briefly explain the basic elements of a GA. The se-
lection process performed in population-based search approaches is equivalent to
local search for single individuals as it distinguishes high-quality from low-quality
solutions and selects promising solutions. Popular selection schemes are propor-
tionate selection (Holland, 1975) and tournament selection (Goldberg et al, 1989).
For proportionate selection, the expected number of copies a solution has in the next
population is proportional to its fitness. The chance of a solution xi being selected
5.2 Recombination-Based Search Methods 149
one hand, recombination operators are able to create new solutions. On the other
hand, usually recombination does not actively diversify the population but has an in-
tensifying character. Crossover operators reduce diversity as the distances between
offspring and parents are usually smaller than the distance between parents (see
(4.1), p. 118). Therefore, the iterative application of crossover alone reduces the di-
versity of a population as either some solution properties can become extinct in the
population (this effect is known as genetic drift (see p. 171) and especially relevant
for binary solutions) or the decision variables converge to an average value (espe-
cially for continuous decision variables). The statistical properties of a population to
which only recombination operators are applied, do not change (we have to consider
that the aim of crossover is to re-combine different optimal solutions for subprob-
lems) and, for example, the average fitness of a population remains constant (if we
use a meaningful recombination operator and N → ∞). However, diversity decreases
as the solutions in the population become more similar to each other.
lem domains are population-based incremental learning (PBIL) (Baluja, 1994), uni-
variate marginal distribution algorithm (UMDA) (Mühlenbein and Paaß, 1996), and
competent GA (cGA) (Harik et al, 1998). These algorithms assume that all variables
are independent (which is an unrealistic assumption for most real-world problems)
and calculate the probability of an individual as the product of the probabilities
of every decision variable. More advanced approaches assume bivariate dependen-
cies between decision variables. Examples are mutual information maximization
for input clustering (MIMIC) (de Bonet et al, 1997), combining optimizers with
mutual information trees (COMIT) (Baluja and Davies, 1997), probabilistic incre-
mental program evolution (PIPE) (Salustowicz and Schmidhuber, 1997), and bi-
variate marginal distribution algorithm (BMDA) (Pelikan and Mühlenbein, 1999).
The most complex EDAs that assume multivariate dependencies between decision
variables are the factorized distribution algorithm (FDA) (Mühlenbein and Mah-
nig, 1999), the extended compact GA (Harik, 1999), the polytree approximation of
distribution algorithm (PADA) (Soto et al, 1999), estimation of Bayesian networks
algorithm (Etxeberria and Larrañaga, 1999), and the Bayesian optimization algo-
rithm (BOA) (Pelikan et al, 1999a). Although EDAs are still a young research field,
EDAs using multivariate dependencies show promising results on binary problem
domains and often outperform standard GAs (Larrañaga and Lozano, 2001; Pelikan,
2006). Their main advantage in comparison to standard crossover operators is the
ability to learn the linkage between solution variables and the position-independent
creation of new solutions.
The situation is different for continuous domains (Larrañaga et al, 1999; Bosman
and Thierens, 2000; Gallagher and Frean, 2005). Here, the probability distributions
used (for example Gaussian) are often not able to model the structure of the land-
scape in an appropriate way (Bosman, 2003) leading to a low performance of EDAs
for continuous search spaces (Grahl et al, 2005).
For more detailed information on the functionality and application of EDAs we
refer to Larrañaga and Lozano (2001) and Pelikan (2006).
5.2 Recombination-Based Search Methods 153
Genetic programming (GP) (Smith, 1980; Cramer, 1985; Koza, 1992) is a variant
of GAs that evolves programs. Although most GP approaches use trees to represent
programs (Koza, 1992, 1994; Koza et al, 1999, 2005; Banzhaf et al, 1997; Langdon
and Poli, 2002), there are also a few approaches that encode programs using lin-
ear bitstrings (for example, grammatical evolution (Ryan, 1999; O’Neill and Ryan,
2003) or Cartesian genetic programming (Miller and Thomson, 2000)). The com-
mon feature of GP approaches is that the phenotypes are programs or variable-length
structures like electronic circuits or controllers.
In analogy to GAs, GP starts with a population of random candidate programs.
Each program is evaluated on a given task and its fitness value is assigned. Often,
the fitness of an individual is determined by measuring how well the found solution
(e.g. a computer program) performs a specific task. The basic functionality of GP
follows GA functionality. The main differences are the search space which consists
of tree structures of variable size and the corresponding search operators which have
to be tree-specific. Solutions (programs) are usually represented as parse trees. Parse
trees represent the syntactic structure of a string according to some formal grammar.
In a parse tree, each node is either a root node, a branch node, or a leaf node. Interior
nodes represent functions and leaf nodes represent variables, constants, or functions
with no arguments. Therefore, the nodes in a parse tree are either members of
• the terminal set T (leaf nodes) representing independent variables of the problem,
zero-argument functions, random constants, or terminals with side-effects (for
example move operations like “turn-left” or “move forward”) or
• the function set F (interior nodes) representing functions (for example arithmetic
or logical operations like “+”, ∧, or ¬), control structures (for example “if” or
“while” clauses), or functions with side-effects (for example “write to file” or
“read”).
The definition of F and T is problem-specific and they should be designed such that
• each function is defined for each possible parameter. Parameters are either termi-
nals or function returns.
• T and M must be chosen such that a parse tree can represent a solution for the
problem. Solutions that cannot be constructed using the sets T and F are not
elements of the search space and cannot be found by the search process.
In GP, the depth k and structure of a parse tree are variable. Figure 5.5 gives two
examples of parse trees.
The functionality of GP is analogous to GAs (see Algorithm 7, p. 148). Different
are the use of a direct representation (parse trees) and tree-specific initialization
and variation operators. During initialization, we generate random parse trees of
maximal depth kmax . There are three basic methods (Koza, 1992): grow, full and
ramped-half-and-half. The grow method starts with an empty tree and iteratively
assigns a node either to be a function or a terminal. If a node is a terminal, a random
terminal from the terminal set T is chosen. If the node is a function, we choose a
154 5 Search Strategies
random function from F. Furthermore, a number of child nodes are added such that
their number equals the number of arguments necessary for the chosen function. The
procedure stops either at depth k = kmax or when all leaf nodes are terminals. The
full method also starts with an empty tree but all nodes at depth k ∈ {1, . . . , kmax − 1}
are functions. All nodes at depth k = kmax become terminals. Therefore, all resulting
trees have depth kmax . For the ramped-half-and-half method, the population is evenly
divided into (kmax − 1) parts. Half of each part is created by the grow method and
half by the full method. For both halves, the depth of the nodes in the ith part is
equal to i, where i ∈ {2, . . . , kmax }. Thus, the diversity is high in the resulting initial
population.
As no standard search operators can be applied to parse trees, tree-specific
crossover and mutation operators are necessary. Like in SGAs, crossover is the
main search operator and mutation acts as background noise. Crossover exchanges
randomly selected sub-trees between two parse trees, whereas mutation usually re-
places a randomly selected sub-tree by a randomly generated one. Like in GAs,
standard GP crossover (Koza, 1992) chooses two parent solutions and generates
two offspring by swapping sub-trees (see Fig. 5.6). Analogously, mutation chooses
a random sub-tree and replaces it with a randomly generated new one (see Fig. 5.7).
Other operators used in GP (Koza et al, 2005) are permutation, which changes the
order of function parameters, editing, which replaces sub-trees by semantically sim-
pler expressions, and encapsulation, which encodes a sub-tree as a more complex
single node.
In recent years, GP has shown encouraging results finding programs or strategies
for problem-solving (Koza et al, 2005; Kleinau and Thonemann, 2004). However,
often the computational effort for finding high-quality solutions even for problems
of small sizes is extremely high. Currently, open problems in GP are the low locality
of the representation/operator combination (compare Chap. 7), the bias of the search
operators (compare the discussion in Sect. 6.2.3, p. 171) and bloat. Bloat describes
the problem that during a GP run, the average size of programs has been seen to
grow large, sometimes exponentially. Although there is a substantial amount of work
trying to fix problems with bloat (Nordin and Banzhaf, 1995; Langdon and Poli,
1997; Soule, 2002; Luke and Panait, 2006), there is no solution for this problem yet
5.2 Recombination-Based Search Methods 155
(Banzhaf and Langdon, 2002; Gelly et al, 2005; Luke and Panait, 2006; Whigham
and Dick, 2010).
Chapter 6
Design Principles
ric on the search space and representations have high locality; this means phenotype
distances must correspond to genotype distances.
The second topic of this chapter is how we can consider knowledge about
problem-specific properties of a problem for the design of modern heuristics. For
example, we have knowledge about the character and properties of high-quality (or
low-quality) solutions. Such problem-specific knowledge can be exploited by intro-
ducing a bias into modern heuristics. The bias should consider this knowledge and,
for example, concentrate search on solutions that are expected to be of high quality
or avoid solutions expected to be of low quality. A bias can be considered in all de-
sign elements of modern heuristics namely the representation, the search operator,
the fitness function, the initialization, and also the search strategy. However, mod-
ern heuristics should only be biased if we have obtained some particular knowledge
about an optimization problem or problem instance. If we have no knowledge about
properties of a problem, we should not bias modern heuristics as this will mislead
the search heuristics.
Sect. 6.1 discusses how the design of modern heuristics can modify the locality
of a problem. The locality of a problem is high if the distances between individu-
als correspond to fitness differences. To ensure guided search, the search operators
must fit the problem metric. Local search operators must generate neighboring so-
lutions and recombination operators must generate offspring where the distances
between offspring and parents do not exceed the distances between parents. If the fit
is poor, guided search is not possible any more and modern heuristics become ran-
dom search due to a high diversification of search. Section 6.1.2 focuses on repre-
sentations and recommends that the locality of representations should be high; this
means neighboring genotypes correspond to neighboring phenotypes. Only then,
the locality of a problem can be preserved and guided search becomes possible.
Section 6.2 focuses on the possibility of biasing modern heuristics. We discuss
how problem-specific construction heuristics can be used as genotype-phenotype
mappings (Sect. 6.2.1) and how redundant representations affect heuristic search
(Sect. 6.2.2). We distinguish between synonymously and non-synonymously redun-
dant representations. Non-synonymously redundant representations lead to reduced
performance of guided search methods for most real-world problems. Using synony-
mously redundant representations can increase performance if a priori knowledge is
considered for their design. The section ends with comments on the bias of search
operators (Sect. 6.2.3).
The locality of a problem measures how well the distances d(x, y) between any two
solutions x, y ∈ X correspond to the difference in their fitness values | f (x) − f (y)|
(Sect. 2.4.2). Locality is high if neighboring solutions have similar fitness values
and fitness differences correlate positively with distances. In contrast, the locality
of a problem is low if low distances do not correspond to low differences in the
6.1 High Locality 159
fitness values. Important for the locality of a problem is the metric defined on the
search space. Usually, the problem metric is a result of the model developed for the
problem.
The performance of guided search methods is high if the locality of a problem is
relatively high, this means the structure of the fitness landscape leads search algo-
rithms to high quality solutions (see Sects. 2.4.2 and 3.4.4). Local search methods
show especially good performance if either high-quality or low-quality solutions are
grouped together in the solution space (compare the discussion of the submedian-
seeker in Sect. 3.4.4). Optimization problems with high locality can usually be
solved well using modern heuristics as all modern heuristics have some kind of
local search elements. Representative examples of optimization problems with high
locality are unimodal problems like the sphere model (Sect. 5.1.5, p. 143) where
high-quality solutions are grouped together in the search space.
This section discusses design guidelines for search operators and representations.
Search operators must fit the metric of a search space, because otherwise modern
heuristics show low performance as they behave like random search. A representa-
tion introduces an additional genotype-phenotype mapping. The locality of a repre-
sentation describes how well the metric on the phenotype space fits to the metric on
the genotype space. Low locality, which means there is a poor fit, randomizes the
search and also leads to low performance of modern heuristics.
Modern heuristics rely on the concept of local search. Local search iteratively gener-
ates new solutions similar to existing ones. Local search is a reasonable and success-
ful search approach for real-world problems as most real-world problems have high
locality and are neither deceptive nor difficult. In addition, to avoid being trapped
in local optima, modern heuristics use diversification steps. Diversification steps
randomize search and allow modern heuristics to jump through the search space.
We have seen in the previous chapter that different types of modern heuristics use
different concepts for controlling intensification and diversification. Local search
intensifies the search as it allows incremental improvements of already found so-
lutions. Diversification steps must be relatively rare as they usually lead to inferior
solutions. When designing search operators, we must have in mind that modern
heuristics use local search operators as well as recombination operators for intensi-
fying the search (Sects. 5.1 and 5.2). Solutions that are generated should be similar
to the existing ones. Therefore, we must ensure that search operators (local search
operators as well as recombination operators) generate similar solutions and do not
jump around in the search space. This can be done by ensuring that local search
operators generate neighboring solutions and recombination operators generate so-
lutions where the distances between parent and offspring are smaller or equal to the
distances between parents (Sect. 4.3.3 and 4.1).
160 6 Design Principles
6.1.2 Representation
dm =
g
∑ g
|dx,y
p p
− dmin |, (6.1)
dx,y =dmin
p g
where dx,y is the phenotype distance between the phenotypes x p and y p , dx,y is the
p g
genotypic distance between the corresponding genotypes, and dmin and dmin are the
minimum distances between two (neighboring) phenotypes and genotypes, respec-
g p
tively. Without loss of generality, we assume that dmin = dmin . For dm = 0, all geno-
typic neighbors correspond to phenotypic neighbors and the encoding has perfect
(high) locality.
We want to emphasize that the locality of a representation depends on the rep-
resentation fg and the metrics that are defined on Φg and Φ p . fg alone only deter-
mines which phenotypes are represented by which genotypes and cannot be used
for measuring similarity between solutions. To describe or measure the locality of a
representation, a metric must be defined on Φg and Φ p .
Figure 6.1 illustrates the difference between high and low-locality representa-
tions. We assume 12 different phenotypes (a-l) and measure distances between solu-
tions using the Euclidean metric. Each phenotype (lower case symbol) corresponds
to one genotype (upper case symbol). The representation fg has perfect (high) lo-
162 6 Design Principles
This section discusses how to bias modern heuristics. If we have some knowledge
about the properties of either high-quality or low-quality solutions, we can make
use of this knowledge for the design of the key elements of modern heuristics. For
representations, we can incorporate heuristics or introduce redundant encodings and
assign a larger number of genotypes to high-quality phenotypes. Search operators
can be designed in such a way that they distinguish between high-quality and low-
quality solution features (building blocks) and prefer the high-quality ones. Analo-
gously, we can bias the fitness function, the initial solution, and the search strategy.
A representation or search operator is biased if an application of a variation oper-
ator generates some solutions in the search space with higher probability (Caruana
and Schaffer, 1988). We can bias representations by incorporating heuristics into
the genotype-phenotype mapping. Furthermore, representations can be biased if the
number of genotypes exceeds the number of phenotypes. Then, representations are
called redundant (Gerrits and Hogeweg, 1991; Ronald et al, 1995; Shipman, 1999;
Rothlauf and Goldberg, 2003). Redundant representations are biased if some phe-
notypes are represented by a larger number of genotypes. Analogously, search op-
erators are biased if some solutions are generated with higher probability.
In analogy to representations and search operators, we can also bias the initial-
ization method. An unbiased initialization method creates all solutions in the search
space with equal probability and chooses the starting point of the search randomly.
If we have some knowledge about the structure or properties of optimal solutions,
we can bias the initial solutions and focus on more promising areas of the search
space. Construction heuristics that apply some rules of thumb for the creation of an
initial solution are usually biased as the created solution is expected to be better than
an average solution. For more information and examples on biasing initial solutions,
we want to refer to Sects. 4.5 and 8.5.
We can also bias the fitness function. Here, we must have in mind that the fitness
function is biased per se as it allows us to distinguish high-quality from low-quality
solutions. However, we can add an additional bias to the fitness function that fa-
vors solutions with advantageous properties. An example is the smoothing of fitness
functions (Sect. 4.4). Here, we modify the fitness of solutions according to the fit-
ness of neighboring solutions. An example of a search strategy with a systematic
bias of the fitness function is GLS. GLS adds a penalty to all solutions that have
some specific solution feature. Furthermore, we often bias the fitness function if
infeasible solutions exist in the search space. Subject to the number or types of con-
straint violations, we add penalties to infeasible solutions. The penalties are added
in such a way that they should lead guided search methods in the direction of fea-
sible solutions (Gottlieb, 1999). For more information on bias of fitness functions,
we refer to Sect. 4.4.
Finally, we also can bias a search strategy. Search strategies determine the inten-
sification and diversification mechanisms of a modern heuristic. The ratio between
intensification and diversification depends on the properties of the underlying prob-
lem. If we know that a problem is easy for pure local search (e.g. unimodal prob-
6.2 Biasing Modern Heuristics 165
lems), we can bias a search strategy to use more intensification steps. For example,
for unimodal problems, we can set the number of restarts for ILS to zero or start
simulated annealing with temperature T = 0. If the problem is more difficult and a
larger number of local optima exist, diversification becomes more important and we
can bias modern heuristics to use more diversification steps. For example, we can
increase the number of restarts for ILS or start SA with a higher temperature.
When biasing modern heuristics, we must make sure that we have a priori knowl-
edge about the problem and the bias exploits this knowledge in an appropriate way.
Introducing an inappropriate or wrong bias into modern heuristics would mislead
search and result in low solution quality. Furthermore, we must make sure that a bias
is not too strong. Using a bias can focus the search on specific areas of the search
space and exclude solutions from consideration. If the bias is too strong, modern
heuristics can easily fail. For example, reducing the number of restarts of ILS to
zero and performing only local search is a promising strategy for unimodal prob-
lems. However, if only a few local optima exist, the bias is too strong as search can
never escape any local optimum. In general, we must use bias carefully. A wrong or
too strong bias can easily lead to failure of modern heuristics.
The following sections discuss bias of representations and search operators.
Sect. 6.2.1 gives an overview of how problem-specific construction heuristics can be
used as genotype-phenotype mappings. Then, heuristic search varies either the input
(problem space search) or the parameters (heuristic space search) of the construc-
tion heuristic. Sect. 6.2.2 discusses redundant representations. Redundant represen-
tations with low locality (called non-synonymously redundant representations) ran-
domize guided search and thus should not be used. Redundant representations with
high locality (synonymously redundant representations) can be biased by overrepre-
senting solutions similar to optimal solutions. Finally, Sect. 6.2.3 discusses how the
bias of search operators affects modern heuristics and give an example of the bias
of standard search operators used in genetic programming.
search step does not result in a similar phenotype but creates a random solution.
Therefore, guided search is no longer possible but becomes random search. As a re-
sult, we get reduced performance of modern heuristics on straightforward problems
when using non-synonymously redundant representations.
Fig. 6.4 The effects of local search steps for synonymously versus non-synonymously redundant
representations. The arrows indicate search steps
The previous chapter illustrated that high locality of representations and search op-
erators is a prerequisite for efficient and effective modern heuristics. This chap-
ter presents a case study on the locality of the genotype-phenotype mapping of
grammatical evolution in comparison to standard genetic programming. Gram-
matical evolution (GE) is a variant of GP (Sect. 5.2.3, p. 153) that can evolve
complete programs in an arbitrary language using a variable-length binary string
(O’Neill and Ryan, 2001). In GE, phenotypes (programs) are created from binary
genotypes by using a complex representation (genotype-phenotype mapping). The
representation selects production rules in a Backus-Naur form grammar and thereby
creates a phenotype. GE approaches have been applied to several test problems and
real-world applications and good performance has been reported (O’Neill and Ryan,
2001, 2003; Ryan and O’Neill, 1998).
We study the locality of the genotype-phenotype mapping used in GE. In contrast
to standard GP, which applies search operators directly to phenotypes (Sect. 5.2.3),
GE uses an additional mapping and applies search operators to binary genotypes.
Therefore, there is a large semantic gap between genotypes (binary strings) and
phenotypes (programs or expressions). The case study shows that the mapping used
in GE has low locality leading to low performance of standard mutation operators.
The study at hand is an example of how basic design principles of modern heuristics
can be applied to explain performance differences between different GP approaches
and demonstrates current challenges in the design of GE.
We start with relevant properties of GE. Then, Sect. 7.2 presents two standard
benchmark problems for GP. In Sect. 7.3, we study the locality of the genotype-
phenotype mapping used in GE. Finally, Sect. 7.4 presents results on how the local-
ity of the representation affects the performance of local search.
Functionality
GE is a search method that can evolve computer programs defined in BNF. In con-
trast to standard GP (Koza, 1992), the genotypes are not parse trees but bitstrings
of variable length. A genotype consists of groups of eight bits (called codons) that
select production rules from a BNF grammar. The construction of the phenotype
from the genotype is presented in Sect. 7.1.
The functionality of GE follows standard GA approaches using binary genotypes
(Algorithm 7, p. 148). As standard binary strings are used as genotypes, no specific
crossover or mutation operators are necessary. Therefore, we have an indirect repre-
sentation and standard crossover operators like one-point or uniform crossover and
standard mutation operators like bit-flipping mutation can be used (Sect. 4.3.5). A
common metric for measuring distances between binary genotypes is Hamming dis-
tance (2.7). Therefore, the application of bit-flipping mutation creates a new solution
with genotype distance d g = 1. For selection, standard operators like tournament
selection or roulette-wheel selection can be used (Sect. 5.2.1). Some GE imple-
mentations use steady state replacement mechanisms and duplication operators that
duplicate a random number of codons and insert these after the last codon position.
As usual, selection decisions are performed based on the fitness of phenotypes.
GE has been successfully applied to a number of diverse problem domains
such as symbolic regression (O’Neill and Ryan, 2001, 2003), trigonometric iden-
tities (Ryan and O’Neill, 1998), symbolic integration (O’Neill and Ryan, 2003), the
Santa Fe trail (O’Neill and Ryan, 2001), and others. The results indicate that GE can
be applied to a wide range of problems and validate the ability of GE to generate
multi-line functions in any language following BNF notation.
Backus-Naur Form
In GE, the Backus-Naur form grammar is used to define the grammar of a lan-
guage as production rules. Based on the information stored in the genotypes, BNF-
7.1 Grammatical Evolution 177
production rules are selected and form the phenotype. BNF distinguishes between
terminals, which are equivalent to leaf nodes in trees, and non-terminals, which
can be interpreted as expandable interior nodes (Sect. 5.2.3). A grammar in BNF
is defined by the set {N, T, P, S}, where N is the set of non-terminals, T is the set
of terminals, P is a set of production rules that maps N to T , and S ∈ N is a start
symbol.
In order to apply GE to a problem, it is necessary to define a BNF grammar for
the problem. It must be defined in such a way that an optimal solution can be created
from the elements defined by the grammar.
Genotype-phenotype Mapping
In GE, a phenotype is created from binary genotypes in two steps. In the first step,
integer values are calculated from codons of eight bits. Therefore, from a binary
genotype xg,bin of length 8l, we get an integer genotype xg,int of length l, where each
integer xig,int ∈ {0, . . . , 255}, for i ∈ {0, . . . , l − 1}. Beginning with the start symbol
S ∈ N, the integer value xig,int is used to select production rules from the BNF gram-
mar. nP denotes the number of production rules in P. To select a rule, we calculate
the number of the used rule as xig,int mod nP (mod denotes the modulo operation).
In this manner, the mapping process traverses the genotype string beginning from
the left hand side (x0g,int ) until one of the following situations arises:
• The mapping is complete. All non-terminals are transformed into terminals and
a complete phenotype x p is generated.
• The end of the genome is reached (i = l − 1) but the mapping process is not
yet finished. The individual is wrapped, which means the alleles are reused, and
the reading of codons continues. As the genotype is iteratively used with differ-
ent meaning, wrapping can have a negative effect on locality. However, without
wrapping, a larger number of individuals would be incomplete and infeasible.
• An upper threshold on the number of wrapping events is reached and the mapping
is not yet complete. The mapping process is halted and the individual is assigned
the lowest possible fitness value.
The mapping is deterministic, as the same genotype always results in the same
phenotype. However, the interpretation of xig,int can be different if the genotype is
wrapped and a different type of rule is selected. Figure 7.1 illustrates the functional-
ity of the mapping. One after another, eight bits of xg,bin are interpreted as a binary
string resulting in an integer value between 0 and 255. Then, we iteratively select
the (xg,int mod n p )th production rule. For the example, we assume n p = 4.
For a more detailed description of the mapping process including illustrative ex-
amples, we refer to O’Neill and Ryan (2001) or O’Neill and Ryan (2003).
178 7 High Locality Representations for Automated Programming
We study the locality and performance of GE for the Santa Fe ant trail and symbolic
regression problem. Both problems are commonly used test problems for GP and
GE.
In the Santa Fe Ant trail problem, 89 pieces of food are located on a discontinuous
trail which is embedded in a 32×32 toroidal grid. The goal is to determine rules that
guide the movements of an artificial ant and allow the ant to collect the maximum
number of pieces of food in tmax search steps. In each search step, exactly one action
can be performed. The ant can turn left (left()), turn right (right()), move
one square forward (move()), or look ahead one square in the direction it is facing
(food ahead()). Figure 7.2(a) shows the BNF grammar for the Santa Fe ant trail
problem.
Symbolic Regression
It is more difficult to define appropriate metrics for phenotypes that are programs
or expressions. In GE and GP, phenotypes are usually described as trees. Therefore,
edit distances can be used for measuring differences/similarities between pheno-
types. In general, the edit distance between two trees (phenotypes) is defined as the
minimum cost sequence of elemental edit operations that transform one tree into the
other. There are three elemental operations:
• deletion: A node is removed from the tree. The children of this node become
children of its parent.
• insertion: A single node is added.
• replacement: The label of a node is changed.
To every operation a cost is assigned (usually the same for the different operations).
Selkow (1977) presented an algorithm to calculate an edit distance where the opera-
tions insertion and deletion may only be applied to leaf nodes. Tai (1979) introduced
an unrestricted edit distance and Zhang and Shasha (1989) developed a dynamic
programming algorithm to compute tree edit distances.
In the context of GP, tree edit distances have been used as a measurement for
the similarity of trees (Keller and Banzhaf, 1996; O’Reilly, 1997; Brameier and
180 7 High Locality Representations for Automated Programming
Banzhaf, 2002). Igel (1998) and Igel and Chellapilla (1999) used tree edit distances
for analyzing the locality of GP approaches.
Results
p g
Fig. 7.3 Distribution of tree edit distances dx,y for neighboring genotypes x and y, where dx,y = 1.
p
We show the frequency (left) and cumulative frequency (right) over dx,y for the Santa Fe ant trail
problem and the symbolic regression problem
The previous results indicate some problems of GE with low locality. Therefore,
we investigate how strongly the low locality of the genotype-phenotype mapping
influences the performance of GE. We focus the study on mutation only. However,
the results for mutation are usually also relevant for crossover operators (Gottlieb
et al, 2001; Gottlieb and Raidl, 1999; Rothlauf, 2006).
Experimental Setting
For the experiments, we want to make sure that we only examine the impact of
locality on GE performance and that no other factors blur the results. Therefore,
we implement a simple local search using only mutation as a search operator. The
search strategy starts with a randomly created genotype and iteratively applies one
bit-flipping mutation to the genotype. If the offspring has a higher fitness than the
parent, it replaces it. Otherwise, the parent remains the actual solution.
We perform experiments for both test problems and compare an encoding with
high locality with the representation used in GE. In the runs, we randomly generate a
GE-encoded initial solution and use this solution as the initial solution for both types
182 7 High Locality Representations for Automated Programming
of representations. For GE, a search step is a mutation of one bit of the genotype,
and the phenotype is created from the genotype using the GE genotype-phenotype
mapping process. Due to the low locality of the representation, we expect problems
when focusing the search on areas of the search space where solutions with high
fitness can be found. However, the low locality increases diversification and makes
it easier to escape local optima. Furthermore, we should bear in mind that many
genotype search steps do not result in a different phenotype.
We compare the representation used in GE with a standard direct representation
used in GP (parse trees, see Sect. 5.2.3). We define the search operators of GP in
p
such a way that a mutation always results in a neighboring phenotype (dx,y = 1).
Therefore, the mutation operators are directly applied to the trees x p . In our imple-
mentation, we use the following mutation operators:
• Santa Fe ant trail
– Deletion: A leaf node from the set of terminals T is deleted.
– Insertion: A new leaf node from T is inserted.
– Replacement: A leaf node (from T ) is replaced by another leaf node.
• symbolic regression
– Insertion: sin, cos, exp, or log are inserted at a leaf that contains x.
– Replacement: +, -, *, and / are replaced by each other; sin, cos, exp, and log
are replaced by each other.
In each search step, the type of mutation operator used is chosen randomly. A muta-
tion step always results in a neighboring phenotype and we do not need an additional
genotype-phenotype mapping like in GE as we apply the search operators directly
to the phenotypes.
Comparing these two different approaches, in GE, a mutation of a genotype re-
sults in most cases in the same phenotype, sometimes in a neighboring phenotype,
but also sometimes in phenotypes that are completely different (see Fig. 7.3). In
contrast, the standard GP representation (parse trees) is a high-locality “representa-
tion” as mutation always results in a neighboring phenotype. Therefore, the search
can focus on promising areas of the search space but never escape local optima.
Performance Results
For the GE approach, we use the same parameter setting as described in Sect. 7.3.
For both problems, we perform 1,000 runs of a local search strategy using ran-
domly created initial solutions. Each run is stopped after 1,000 search steps. Figure
7.4 compares the performance for the Santa Fe ant trail (Fig. 7.4(a)) and the sym-
bolic regression problem (Fig. 7.4(b)) over the number of search steps. Figure 7.4(a)
shows the mean fitness of the found solution and Fig. 7.4(b) shows the mean error
1/20 ∑19i=0 | f j (xi ) − f (xi )|), where f is defined in (7.1) and f j ( j ∈ {0, . . . , 1000})
7.4 Influence of Locality on GE Performance 183
denotes the function found by the search in search step j. The results are averaged
over all 1,000 runs.
Fig. 7.4 Performance of local search using either the GE encoding or a high-locality encoding for
the Santa Fe ant trail problem and the symbolic regression problem
The results show that local search using a high-locality representation outper-
forms local search using the GE representation. Therefore, the low locality of the
encoding illustrated in Sect. 7.3 has a negative effect on the performance of mod-
ern heuristics. Although the low locality of the GE encoding allows a local search
strategy to escape local optima, local search using the GE encoding shows lower
performance than a high-locality encoding.
Comparing the results for the two types of encodings reveals that using the GE
encoding prolongs search. More search steps are necessary to converge. This in-
crease is expected as for the GE encoding a search step often does not change the
corresponding phenotypes. However, the plots show that allowing local search us-
ing the GE encoding to run for a higher number of search steps does not increase its
performance.
The results show that the GE representation has some problems with locality as
neighboring genotypes often do not correspond to neighboring phenotypes. There-
fore, a guided search around high-quality solutions can be difficult. However, due
to the lower locality of the representation, it is easier to escape from local optima.
Comparing local search using either the GE representation or the standard GP en-
coding (parse trees) with high locality reveals that the low locality of the GE repre-
sentation reduces the performance of local search.
The results of this experimental case study allow us a better understanding of the
functionality of GE and can deliver some explanations for problems of GE that have
been observed in the literature. We want to encourage GE researchers to consider
locality issues for further developments of the genotype-phenotype mapping. We
believe that increasing the locality of the GE representation can also increase the
performance and effectiveness of GE.
Chapter 8
Biased Modern Heuristics for the OCST
Problem
This section introduces the OCST problem and studies relevant properties which can
be used for the design of biased modern heuristics. We start with a problem descrip-
tion. Then, we give a brief overview of optimization methods for OCST problems.
As the problem is NP-hard and only constant-factor approximations are possible,
no efficient exact approaches are available and modern heuristics are the methods
of choice. The section ends with an analysis of the bias of existing OCST test in-
stances. We find that optimal solutions are on average more similar to the minimum
spanning tree (MST) than randomly generated solutions.
The optimal communication spanning tree (OCST) problem (Hu, 1974), which is
also known as the minimum communication spanning tree problem or the simple
network design problem (Johnson et al, 1978), seeks a tree that connects all given
nodes and satisfies their communication requirements for a minimum total cost. The
number and positions of the nodes are given a priori and the cost of a tree is deter-
mined by the cost of the edges. An edge’s flow is the sum of the communication
demands between all pairs of nodes communicating either directly, or indirectly,
over the edge. The goal is to find a tree that connects all given nodes and satisfies
their communication requirements for a minimum total cost. The cost for each link
is not fixed a priori but depends on its length and the flow over this link. Figure 8.1
shows a communication spanning tree on 15 nodes and emphasizes the path con-
necting nodes 3 and 14.
where the n × n matrix T R denotes the traffic flowing directly and indirectly over the
edge between the nodes i and j. The traffic is calculated according to the demand
matrix R and the structure of T . T is a minimum communication spanning tree if
w(T ) ≤ w(T ) for all other spanning trees T .
The OCST problem is listed as [ND7] in Garey and Johnson (1979) and Crescenzi
and Kann (2003). For the OCST problem as proposed by Hu, the cost f of a link
depends linearly on its distance weight wi j and the overall traffic tri j running over
the edge. Therefore, f = wi j tri j results in the OCST problem
which is studied throughout this chapter. In principle, other cost functions f are
possible. For example, f can depend non-linearly on its length or traffic, or there are
communication lines with only discrete capacities available. The OCST problem
becomes the minimum spanning tree (MST) problem if f = wi j . Then, T is an MST
if w(T ) ≤ w(T ) for all other spanning trees T , where w(T ) = ∑i, j∈F wi j .
Cayley’s formula identified the number of spanning trees on n nodes as nn−2
(Cayley, 1889). Furthermore, there are n different stars on a tree of n nodes. The
dissimilarity between two spanning trees Ti and T j can be measured using the dis-
tance di j ∈ {0, 1, . . . , n − 2} which is defined as
1
di j = ∑ |eiuv − euvj |·
2 u,v∈V
(8.2)
eiuv is 1 if an edge from u to v exists in Ti and 0 if it does not exist in Ti . The number
of links that two trees Ti and T j have in common is then n − 1 − di j . This definition
of distance between two trees is based on the Hamming distance (Sect. 2.3.1).
Like other constrained spanning tree problems, the OCST problem is NP-hard
(Garey and Johnson, 1979, p. 207). Furthermore, Reshef (1999) showed that the
problem is APX-hard (Fig. 3.20, p. 90) which means it cannot be solved using a
polynomial-time approximation scheme (Sect. 3.4.2), unless P = NP. Therefore,
188 8 Biased Modern Heuristics for the OCST Problem
the OCST problem belongs to the class of optimization problems that behave like
MAX-SAT (Garey and Johnson, 1979).
Consequently, no efficient algorithmic methods are available. Although some al-
gorithms exist for simplified versions of the OCST problems (complete unweighted
graph problem (Gomory and Hu, 1961; Hu, 1974) and uniform demand problem
(Hu, 1974)), there are no efficient methods for standard OCST problems. For de-
tailed information about approximation algorithms for the OCST problem, we refer
to the literature (Reshef, 1999; Peleg and Reshef, 1998; Rothlauf, 2009a).
To find high-quality solutions for OCST problems, modern heuristics are the
methods of choice (Palmer, 1994; Li, 2001; Rothlauf and Goldberg, 2003; Raidl
and Julstrom, 2003; Rothlauf, 2006; Fischer, 2007; Fischer and Merz, 2007). The
first modern heuristic was presented by Palmer (1994). He recognized that the de-
sign of a proper representation is crucial for the performance of modern heuristics.
Palmer compared different representations and developed a new one, the link and
node biased (LNB) encoding (Sect. 8.4.1). GAs using the LNB encoding showed
good results in comparison to a greedy star search heuristic (Palmer, 1994, Chap. 5).
The characteristic vector (CV) encoding is a common approach for encoding
graphs (Davis et al, 1993; Sinclair, 1995; Berry et al, 1997, 1999) and trees (Berry
et al, 1995). It represents a tree or graph as a list of n(n−1)/2 binary values. The CV
encoding shows good performance when used for trees on a small number of nodes.
However, with increasing problem size, the performance of the CV encoding de-
creases and modern heuristics using this encoding show low performance (compare
Rothlauf (2006, Sect. 6.3)).
Weighted encodings, like the LNB encoding (Palmer, 1994), the weighted encod-
ing (Raidl and Julstrom, 2000), the NetKey encoding (Rothlauf et al, 2002), or vari-
ants of the LNB encoding (Krishnamoorthy and Ernst, 2001) represent a tree using
a list of continuous numbers (weights). The weights define an order on the edges,
from which the tree is constructed (Sect. 8.4.1). Weighted encodings show good
performance when used for tree optimization problems (Rothlauf, 2006). Further-
more, Abuali et al (1995) introduced determinant factorization. This representation
is based on the in-degree matrix of the original graph, and each factor represents a
spanning tree if the determinant corresponding to that factor is equal to one. Tests
indicate performance similar to the LNB encoding.
Several modern heuristics using direct representations for trees have been pre-
sented: a direct representation for trees (Li, 2001), the edge-set encoding (Raidl and
Julstrom, 2003) (Sect. 8.3.1), and the NetDir encoding (Rothlauf, 2006, Chap. 7).
The performance results for unbiased direct representations are similar to weighted
encodings, however, it is difficult to design search operators in such a way that the
search space is uniformly searched (Tzschoppe et al, 2004).
Prüfer numbers were introduced by Prüfer (1918) in a constructive proof of Cay-
ley’s theorem (Cayley, 1889), and have subsequently been used to encode spanning
trees in modern heuristics (Palmer, 1994; Zhou and Gen, 1997; Gen et al, 1998;
Gargano et al, 1998; Gen and Li, 1999; Li et al, 1998; Kim and Gen, 1999; Edelson
and Gargano, 2000, 2001). Prüfer numbers belong to the class of Cayley codes as
they describe a one-to-one mapping between spanning trees on n nodes and strings
8.1 The Optimal Communication Spanning Tree Problem 189
of n − 2 node labels. Other Cayley codes have been proposed by Neville (1953)
(Neville II and Neville III), Deo and Micikevicius (2001) (D-M code), and Picciotto
1999 (Blob Code, Happy Code, and Dandelion Code). Caminiti et al (2004) pre-
sented a unified approach for Cayley codes which is based on the definition of
node pairs, reducing the coding problem to the problem of sorting these pairs into
lexicographic order. Due to problems with low locality, Prüfer numbers lead to
low performance of modern heuristics (Rothlauf and Goldberg, 1999; Gottlieb et al,
2001; Rothlauf, 2006). Caminiti and Petreschi (2005) showed that the locality of
the Blob code is higher than Prüfer numbers, resulting in higher performance of
modern heuristics (Julstrom, 2001). Paulden and Smith (2006) extended this work
and showed that a single mutation to a Dandelion encoded string leads to at most
five edge changes in the corresponding tree, whereas the Prüfer number encoding
has no fixed locality bound.
For more details on modern heuristics for OCST problems, we refer to Rothlauf
(2006).
Test instances for the OCST problem have been proposed by Palmer (1994), Berry
et al 1995, and Raidl (2001). Palmer (1994) described OCST problems with six
(palmer6), twelve (palmer12), and 24 (palmer24) nodes. The nodes correspond to
cities in the US and the distances between the nodes are obtained from a tariff
database. The inter-node traffic demands are inversely proportional to the distances
between the nodes. Berry et al (1995) presented three OCST test instances, one with
six nodes (berry6) and two with 35 nodes (berry35 and berry35u). For berry35u, the
distance weights wi j = 1. Raidl (2001) proposed various test instances ranging from
10 to 100 nodes. The distance weights and the traffic demands were generated ran-
domly and are uniformly distributed in the interval [0, 100]. For all problems the
cost of a tree T is w(T ) = ∑i, j∈F wi j tri j . Details on the test problems can be found
in Rothlauf (2006, Appendix A).
A second set of test instances are random problem instances. The real-valued
demands ri j are generated randomly and are usually uniformly distributed either in
the interval [0, 100] or [0, 10]. The real-valued distance weights wi j are
• generated randomly either in ]0, 100] or [0, 10] (random wi j ), or
• the Euclidean distances between the nodes placed randomly on a two-dimensional
grid of different sizes (e.g. 10×10) (Euclidean wi j ).
Again, the cost of a tree T is w(T ) = ∑i, j∈F wi j tri j . To measure the performance of
modern heuristics, we can use the success probability Psuc which is the percentage
of runs that find an optimal solution and the number of generations tconv which is
the number of generations until a run terminates.
For OCST problems, the cost of a solution depends on the weights wi j of the
links used in the tree, since trees that prefer links with low wi j tend to have lower
190 8 Biased Modern Heuristics for the OCST Problem
overall costs on average. For Euclidean wi j , usually links near the graph’s center of
gravity have lower weight than links that are far away from the center of gravity.
Therefore, it is useful to run more traffic over the nodes near the center of gravity
than over nodes at the edge of the tree (Kershenbaum, 1993). Consequently, nodes
can be characterized as either interior (some traffic only transits) or leaf nodes (all
traffic terminates). The more important a link is, and the more transit traffic crosses
one of the two nodes it connects, the higher is the degree of the nodes. Nodes near
the gravity center tend to have higher degrees than nodes at the edge of tree. Palmer
(1994) used this result for the design of the LNB encoding (Sect. 8.4.1). This repre-
sentation considers the relative importance of nodes in a tree and the more important
a node is, the more traffic transits over it (Palmer and Kershenbaum, 1994).
We study whether test instances from the literature have a bias. The goal is to gain
knowledge about the problems as well as their optimal solutions. To identify relevant
properties of unbiased solutions, we randomly generated 10,000 solutions (trees)
for each test problem. Raidl and Julstrom (2003) showed that this is not as simple
as it might seem, as techniques based on Prim’s or Kruskal’s MST algorithms us-
ing randomly chosen distance weights do not associate uniform probabilities with
spanning trees. An appropriate method for generating unbiased trees is generating
random Prüfer numbers (Sect. 8.1.2) and creating the corresponding tree from the
Prüfer number (for details on Prüfer numbers, see Rothlauf (2006, Sect. 6.2)).
Table 8.1 Properties of randomly created solutions and optimal solutions for the test instances
Properties of random solutions Properties of optimal solutions
problem dmst,rnd min(dstar,rnd ) dmst,opt min (dstar,opt)
dmst,opt min(dstar,opt ) w(T best )
instance n μ (μ /n) σ μ (μ /n) σ n n
palmer6 6 3.36 (0.56) 0.91 2.04 (0.34) 0.61 1 (0.17) 2 (0.33) 693,180
palmer12 12 9.17 (0.76) 1.17 7.22 (0.60) 0.75 5 (0.42) 7 (0.58) 3,428,509
palmer24 24 21.05 (0.88) 1.30 18.50 (0.77) 0.80 12 (0.50) 17 (0.71) 1,086,656
raidl10 10 7.20 (0.72) 1.10 5.42 (0.54) 0.70 3 (0.30) 4 (0.40) 53,674
raidl20 20 17.07 (0.85) 1.27 14.69 (0.73) 0.77 4 (0.20) 14 (0.70) 157,570
raidl50 50 47.09 (0.94) 1.32 43.88 (0.88) 0.87 13 (0.26) 41 (0.82) 806,864
raidl75 75 72.02 (0.96) 1.36 68.55 (0.91) 0.83 18 (0.24) 68 (0.91) 1,717,491
raidl100 100 97.09 (0.97) 1.36 93.29 (0.93) 0.89 32 (0.32) 90 (0.90) 2,561,543
berry6 6 3.51 (0.59) 0.83 2.03 (0.34) 0.61 0 (0) 2 (0.33) 534
berry35u 35 - - 29.19 (0.83) 0.83 - 28 (0.80) 16,273
berry35 35 32.05 (0.92) 1.32 29.16 (0.83) 0.83 0 (0) 30(0.86) 16,915
Table 8.1 lists the properties of randomly created unbiased trees and the prop-
erties of the optimal (or best known) solution. It shows the mean μ , the normal-
ized mean μ /n, and the standard deviation σ of the distances dmst,rnd between ran-
domly generated trees and MSTs and of the minimum distances min(dstar,rnd ) be-
8.1 The Optimal Communication Spanning Tree Problem 191
tween randomly created trees and the n different stars. In the instance berry35u, all
distance weights are uniform (wi j = 1), so all spanning trees are minimal and for
each dmst,rnd = 0. For the optimal solutions, we calculated their distance to an MST
(dmst,opt and dmst,opt /n) and their minimum distance to a star (min(dstar,opt ) and
min(dstar,opt )/n). Furthermore, we list the cost w(T best ) of optimal solutions. Com-
paring dmst,opt with dmst,rnd reveals that for all test instances, dmst,opt < μ (dmst,rnd ).
This means that, on average, the best solution shares more links with an MST than
a randomly generated solution. Comparing min(dstar,opt ) and μ (min(dstar,rnd )) does
not reveal any differences. Randomly created solutions have approximately the same
expected minimum distance to a star as do the optimal solutions.
For different problem sizes n, we created 100 random problem instances. The de-
mands ri j are generated randomly and are uniformly distributed in ]0, 100]. For the
wi j , there are two possibilities:
• Random: The distance weights wi j are in ]0, 100].
• Euclidean: The nodes are randomly placed on a 1,000×1,000 grid. The wi j are
the Euclidean distances between the nodes i and j.
For each of the 100 problem instances, we generated 10,000 random trees.
dmst,rnd min(dstar,rnd )
nodes μ ( μ /n) σ μ (μ /n) σ
Table 8.2 Properties of ran-
8 5.25 (0.66) 1.04 3.74 (0.47) 0.62
domly created solutions for
12 9.17 (0.77) 1.16 7.31 (0.61) 0.72
the OCST problem
16 13.13 (0.82) 1.22 11.00 (0.69) 0.76
20 17.10 (0.86) 1.26 14.78 (0.74) 0.77
24 21.09 (0.88) 1.29 18.60 (0.78) 0.80
28 25.05 (0.895) 1.30 22.70 (0.81) 0.72
Table 8.2 presents the mean μ , the normalized mean μ /n, and the standard devi-
ation σ of dmst,rnd and min(dstar,rnd ). As we get the same results using random or
Euclidean distance weights, we neglect the distance weights used. Both the aver-
age distance μ (dmst,rnd ) between randomly generated solutions and MSTs and the
average minimum distance μ (min(dstar,rnd )) between a random solution and a star,
increase approximately linearly with n.
For finding optimal or near-optimal solutions for randomly created problem in-
stances, we implement a GA. Although GAs are modern heuristics that cannot guar-
antee that an optimal solution is found, we choose its design in such a way that an
optimal or near-optimal solution is found with high probability. Due to the proposed
GA design and due to the NP-hardness of the problem, the effort for finding optimal
solutions is high, and optimal solutions can only be determined for small problem
instances (n < 30). As GA performance increases with population size N, we it-
eratively apply a GA to the problem and increase N after niter runs. We start by
192 8 Biased Modern Heuristics for the OCST Problem
Table 8.3 Properties of optimal solutions for OCST problems with Euclidean distance weights
dmst,opt min(dstar,opt ) dopt,rnd Ni w(Tibest )
n μ (μ /n) σ μ ( μ /n) σ μ (μ /n) σ μ μ σ
8 1.98 (0.25) 1.18 2.96 (0.37) 0.93 5.25 (0.66) 0.01 389 899,438 177,140
12 4.35 (0.36) 1.47 5.88 (0.49) 1.20 9.17 (0.76) 0.01 1,773 2,204,858 301,507
16 6.74 (0.42) 1.71 9.28 (0.58) 1.24 13.12 (0.82) 0.01 7,048 4,104,579 463,588
20 9.75 (0.49) 1.92 12.68 (0.63) 1.15 17.10 (0.86) 0.01 20,232 6,625,619 649,722
24 11.92 (0.50) 2.14 16.48 (0.69) 1.23 21.08 (0.88) 0.01 40,784 9,320,862 718,694
28 14.60 (0.52) 1.90 19.70 (0.70) 1.42 25.07 (0.89) 0.01 98,467 13,121,110 723,003
The average distance dmst,opt is lower for random wi j than for Euclidean wi j .
Comparing the properties of the best found solutions to the properties of randomly
created solutions listed in Table 8.2 reveals that μ (dmst,opt ) < μ (dmst,rnd ). This
means optimal solutions share more links with MSTs than random solutions. Fur-
thermore, the distances dopt,rnd are similar to dmst,rnd and min(dstar,opt ) is slightly
8.1 The Optimal Communication Spanning Tree Problem 193
Table 8.4 Properties of optimal solutions for OCST problems with random distance weights
dmst,opt min(dstar,opt ) dopt,rnd Ni w(Tibest )
n μ (μ /n) σ μ (μ /n) σ μ (μ /n) σ μ μ σ
8 0.83 (0.10) 0.88 3.42 (0.43) 0.68 5.25 (0.66) 0.01 234 50,807 18,999
12 1.42 (0.12) 1.04 6.87 (0.57) 0.75 9.16 (0.76) 0.01 478 90,842 28,207
16 2.58 (0.16) 1.36 10.23 (0.64) 0.95 13.12 (0.82) 0.01 2,208 136,275 36,437
20 3.40 (0.17) 1.63 13.94 (0.70) 1.10 17.10 (0.86) 0.01 6,512 183,367 49,179
24 4.08 (0.17) 1.83 17.76 (0.74) 1.04 21.10 (0.88) 0.01 10,432 228,862 58,333
28 5.02 (0.18) 2.07 21.96 (0.78) 0.83 25.06 (0.90) 0.01 19,456 271,897 70,205
Fig. 8.2 μ (dmst,rnd ) and μ (dmst,opt ) over n for random and Euclidean distance weights. The error
bars indicate the standard deviation
lower than min(dstar,rnd ), especially for Euclidean wi j . However, this effect is weak
and can be neglected in comparison to the low distance dmst,opt .
Figures 8.2(a) and 8.2(b) summarize the results from Tables 8.2, 8.3, and 8.4.
They plot dmst,rnd (dmst,rnd /n) and dmst,opt (dmst,opt /n) using either random wi j or
Euclidean wi j over n. Figure 8.2(a) indicates that all distances to the MST increase
approximately linearly with n and that optimal solutions share many edges with
MSTs. Finally, Fig. 8.3 shows the distribution of dmst,opt for Euclidean and random
wi j for the 100 randomly generated instances of 20-node OCST problems as well
as the distribution of dmst,rnd for 100 randomly generated trees. We see that optimal
solutions share many edges with MSTs.
The similarity between optimal solutions and MSTs can be used for biasing modern
heuristics. Relevant design elements are the problem representation and search oper-
ators, the initialization method, and the fitness function. Consequently, four promis-
ing approaches on how to consider the similarity between optimal solutions and
MSTs can be identified:
• Design search operators that favor trees similar to an MST.
• Design a redundant problem representation such that solutions similar to an MST
are overrepresented.
• Start with solutions that are similar to an MST.
• Assign a higher fitness value to solutions similar to an MST.
The following paragraphs discuss these four possibilities. We can bias search oper-
ators to favor MST-like trees. If this principle is applied to OCST problems, search
operators will prefer edges of low weight. This concept was implicitly used by
Raidl and Julstrom (2003) who proposed the edge-set representation. The search
operators of edge-sets can be combined with additional heuristics (Sect. 6.2.1) to
prefer low-weight edges. As a result, the search operators favor MST-like solutions
which can result in higher performance in comparison to unbiased search operators
(Raidl and Julstrom, 2003). In Sect. 8.3, we study how the performance of modern
heuristics using edge-sets depends on the properties of optimal solutions.
For redundant representations, the number of genotypes exceeds the number of
phenotypes. Redundant representations can increase the performance of modern
heuristics if their bias appropriately exploits an existing bias of optimal solutions
(Sect. 6.2.2). Therefore, biasing a redundant representation towards MSTs can in-
crease the performance of modern heuristics for OCST problems. This principle
is, for example, used in the link-biased encoding (Palmer, 1994). In Sect. 8.4, we
study this redundant representation and examine how it influences the performance
of modern heuristics.
The performance of modern heuristics depends on the starting point of the search.
Therefore, as the average distance between optimal solutions and MSTs is low, it is
a reasonable approach to start the search with solutions that are an MST or similar to
it. We make use of this observation in Sect. 8.5 and examine how the performance
of different modern heuristics depends on the starting solution. The experimental
results confirm that modern heuristics either need fewer fitness evaluations or find
better solutions when starting from a MST than starting from a random tree.
Finally, we can modify the fitness evaluation of trees such that solutions that are
similar to an MST get an additional bonus. Such a bias could push modern heuristics
more strongly in the direction of MST-like solutions and increase the performance
of the search. Important is a proper strength of the bias since a too strong bias could
easily result in premature convergence. Steitz and Rothlauf (2010) presented the
first results of such an approach which indicate that biasing the fitness function
can increase the performance of modern heuristics.
8.3 Search Operator 195
When using modern heuristics for tree problems, it is necessary to encode a solution
(tree) such that search operators can be applied. There are two different possibili-
ties for doing this: indirect representations usually encode a tree (phenotype) as a
list of strings (genotypes) and apply standard search operators to the genotypes. The
phenotype is constructed by an appropriate genotype-phenotype mapping (represen-
tation). In contrast, direct representations encode a tree as a set of edges and apply
search operators directly to this set. Therefore, no representation is necessary. In-
stead, tree-specific search operators must be developed as standard search operators
can no longer be used (Sect. 4.3.4).
The edge-set encoding (Raidl and Julstrom, 2003) is a representative example
of a direct representation. Raidl and Julstrom (2003) proposed two different vari-
ants: heuristic variants where the search operators consider the weights of the edges,
and non-heuristic ones. Results from applying the edge-set encoding to two sets of
degree-constrained MST problem instances have indicated the superiority of edge-
sets in comparison to several other codings of spanning trees (i.e. the Blob Code,
network random keys, and strings of weights) particularly when the operators im-
plement edge-cost-based heuristics (Raidl and Julstrom, 2003, p. 238).
In this section, we investigate the bias of the search operators of edge-sets, and
study how the performance of modern heuristics depends on the properties of op-
timal solutions. As the heuristic variants of edge-sets prefer edges with a low cost,
these variants are expected to show a bias towards MSTs (Rothlauf, 2009b).
In Sect. 8.3.1, we introduce the functionality of the edge-set encoding with
and without heuristics. Section 8.3.2 studies the bias of the search operators. Fi-
nally, Sect. 8.3.3 examines the performance of modern heuristics using edge-sets
for OCST problems. It presents results for known test instances from the literature
as well as randomly generated test instances.
196 8 Biased Modern Heuristics for the OCST Problem
The edge-set encoding directly represents trees as sets of edges. Therefore, encoding-
specific initialization, recombination, and local search operators are necessary. The
following paragraph summarizes the functionality of the different variants with and
without heuristics (Raidl and Julstrom, 2003).
Raidl and Julstrom (2003) proposed and investigated three different initialization
strategies: PrimRST, RandWalkRST, and KruskalRST. PrimRST, which is based
on Prim’s algorithm (Prim, 1957), slightly overrepresents stars and underrepre-
sents lists. RandWalkRST has an average running time of O(n log n), however, the
worst-case running time is unbounded. Therefore, Raidl and Julstrom (2003) rec-
ommended the use of the KruskalRST which is based on the algorithm of Kruskal
(Kruskal, 1956). In contrast to Kruskal’s algorithm, KruskalRST chooses edges
(i, j) not according to their corresponding weights wi j but randomly. KruskalRST
has a small bias towards star-like trees (which is lower than the bias of PrimRST).
The bias is a result of the sequential insertion of edges during the construction
of a tree using Kruskal’s algorithm. Algorithm 9 outlines the functionality of
KruskalRST.
Algorithm 9 KruskalRST(V, E)
T ← 0,/ A ← E; //E is the set of available edges
while |T | < |V | − 1 do
choose an edge {(u, v)} ∈ A at random;
A ← A − {(u, v)};
if u and v are not yet connected in T then
T ← T ∪ {(u, v)};
end if
end while
return T
remaining edges (E1 ∪ E2 )\(E1 ∩ E2 ). Results from Raidl and Julstrom (2003) indi-
cate a better performance of KruskalRST* for the degree-constraint MST problem.
The local search operator (also called mutation) randomly replaces one edge in
the spanning tree. This replacement can be realized in two different ways. The first
variant randomly chooses one edge that is not present in T and includes it in T .
Then, a randomly chosen edge of the cycle is removed (“insertion before deletion”).
The second variant first randomly deletes one edge from T and then connects the
two disjoint connected components using a random edge not present in T (“deletion
before insertion”). The running time is O(n).
We describe how heuristics that rely on the weights wi j can be included in the edge-
set encoding. Raidl and Julstrom (2003) introduced these variants of edge-sets due
to the assumption that in weighted tree optimization problems optimal solutions
often prefer edges with low weights wi j .
To favor low-weighted edges when generating an initial population, the algorithm
KruskalRST does not choose an edge at random but sorts all edges in the underlying
graph according to their weights wi j in ascending order. The first spanning tree in
a population is created by choosing the first edges in the ordered list. As these are
the edges with lowest weights, the first generated spanning tree is an MST. Then,
the k edges with lowest weights are permuted randomly and another tree is created
using the first edges in the list. This heuristic initialization results in a strong bias
towards MSTs. With increasing k, the bias of randomly created trees towards MSTs
is reduced. The number of edges, which are permuted increases according to
k = α (i − 1)n/N
where N denotes the population size, i is the number of the tree that is actually
generated (i = 1, . . . , N) and α , with 0 ≤ α ≤ (n − 1)/2, is a parameter that controls
the strength of the heuristic bias.
The heuristic recombination operator is a modified version of KruskalRST*
crossover. Firstly, the operator transfers all edges (E1 ∩ E2 ) that exist in both parents
T1 and T2 to the offspring. Then, the remaining edges are chosen randomly from
E = (E1 ∪ E2 ) \ (E1 ∩ E2 ) using a tournament with replacement of size two. This
means the weights wi j of two randomly chosen edges are compared and the edge
with the lower weight is inserted into the offspring (if no cycle is created).
The heuristic mutation operator is based on mutation by “insertion before dele-
tion”. In a pre-processing step, all edges in the underlying graph are sorted accord-
ing to their weights in ascending order. Doing this, a rank is assigned to every edge.
Rank one is assigned to the edge with lowest weight. To favor low-weighted edges,
the edge that is inserted by the heuristic mutation operator is not chosen randomly
but according to its rank
198 8 Biased Modern Heuristics for the OCST Problem
R = |N (0, β n)|
mod m + 1,
where N (0, β n) is the normal distribution with mean 0 and standard deviation β n
and m = n(n − 1)/2. β is a parameter that controls the bias towards low-weighted
edges. If a chosen edge already exists in T , the edge is discarded and the selection
is repeated.
8.3.2 Bias
8.3.2.1 Initialization
Raidl and Julstrom (2003) examined the bias of different initialization methods and
found KruskalRST to be slightly biased towards stars. As the bias is sufficiently
small and due to its lower running time it is preferred in comparison to Rand-
WalkRST and PrimRST, which shows a stronger bias towards stars.
Table 8.5 shows the average distances dmst,rnd between MSTs and randomly gen-
erated trees Trand (the standard deviations are shown in brackets). For each problem
instance (250 of each type) we generated 1,000 random solutions Trand using either
an unbiased encoding (Prüfer numbers), non-heuristic KruskalRST (Sect. 8.3.1.1),
or the heuristic initialization (Sect. 8.3.1.2). For the heuristic initialization α was set
either to α = 1.5 as recommended in Raidl and Julstrom (2003) or to the maximum
value α = (n − 1)/2, which results in the lowest possible bias. The results confirm
that KruskalRST is not biased towards MSTs (Raidl and Julstrom, 2003). Further-
more, the heuristic versions show a bias towards MSTs even when using a large
value of α .
Table 8.5 Mean and standard deviation of distances dmst,rnd between random trees and MST
n weights unbiased KruskalRST heuristic initialization
non-heuristic α = 1.5 α = (n − 1)/2
Euclidean 7.20 (0.04) 0.44 (0.19) 3.80 (0.08)
10 7.20 (0.06)
random 7.20 (0.07) 0.20 (0.13) 3.74 (0.08)
Euclidean 17.10 (0.04) 1.06 (0.28) 12.02 (0.14)
20 17.10 (0.04)
random 17.10 (0.04) 0.42 (0.22) 12.09 (0.08)
Euclidean 97.02 (0.04) 5.98 (0.67) 87.89 (0.18)
100 97.02 (0.04)
random 97.02 (0.05) 2.12 (0.60) 88.22 (0.87)
Euclidean 197.02 (0.04) 11.99 (0.92) 176.45 (0.29)
200 197.03 (0.04)
random 197.00 (0.04) 3.93 (0.69) 177.25 (0.12)
8.3 Search Operator 199
8.3.2.2 Recombination
We perform experiments for different problem sizes n and perform iter = 5, 000 (n =
25) or iter = 10, 000 (n = 100) search steps. As no selection operator (Sect. 5.2.1)
is used, no selection pressure pushes the population towards high-quality solutions.
Search operators are unbiased if the statistical properties of the population do not
change by applying variation operators alone. In our experiments we measure in
each generation the average distance dmst−pop = (1/N) ∑Ni=1 di,mst between all indi-
viduals Ti (i ∈ {0, . . . , N − 1}) in the population and an MST. If dmst−pop decreases,
the variation operators are biased towards MSTs. If dmst−pop remains constant, the
variation operators are unbiased with respect to MSTs.
As for initialization, we perform this experiment on 250 randomly generated tree
instances of different sizes with random distance weights wi j . Results for Euclidean
distance weights are analogous. For every tree instance, we perform 10 runs with
different randomly chosen initial populations.
Figure 8.4 shows the mean of dmst−pop over the number of search steps. For clar-
ity, standard deviations are generally omitted; exemplarily we plot the standard devi-
ations for “nohxover only”. The results confirm previous findings (Tzschoppe et al,
2004) and show that the crossover operator without heuristics is not biased towards
MSTs and does not modify the statistical properties of the population (dmst−pop re-
mains constant over the number of search steps). Furthermore, heuristic crossover
operators show a strong bias towards MSTs. Applying heuristic crossover operators
alone (hxover only) pushes the population towards MSTs. After some search steps
200 8 Biased Modern Heuristics for the OCST Problem
Fig. 8.4 Mean of the distance dmst−pop between a population of N = 50 individuals and an MST
over the number of search steps
(for example, after 1, 000 search steps for n = 25 and N = 50), the population has
fully converged and crossover-based search gets stuck.
The problem of premature convergence can be amended when combining heuris-
tic crossover with heuristic or non-heuristic mutation. Then, new edges are contin-
ually introduced into the population. Consequently, with an increasing number of
search steps, the population keeps moving towards an MST. However, in contrast
to when using no mutation, the population does not reach an MST as the mutation
operator continuously inserts new edges (hxover, nohmut). The situation is similar
when replacing the non-heuristic mutation operator by heuristic variants with β = 1
or β = 5 (hxover, hmut (β = 1) or hxover, hmut (β = 5)). With lower β , the bias
towards low-weighted edges increases and the population converges faster towards
MSTs. Similarly to when using non-heuristic mutation, the population does not fully
converge to MSTs as the mutation operator (heuristic as well as non-heuristic) con-
tinually inserts new edges into the population.
Table 8.5 indicates that heuristic initialization is biased towards MSTs. When us-
ing heuristic initialization with α = 1.5, the plots (“hxover, nohmut, hini”, “hxover,
hmut (β = 1), hini”, and “hxover, hmut (β = 5), hini”) indicate that the variation
operators allow the population to recover from the strong initial bias. With an in-
creasing number of search steps, dmst−pop converges to the same values as when
using non-heuristic initialization.
Summarizing the results, non-heuristic crossover results in an unbiased popu-
lation. When using the heuristic crossover operator, the population converges to
solutions only slightly different from an MST.
8.3 Search Operator 201
8.3.2.3 Mutation
Finally, we investigate the bias of the mutation (local search) operator for 250
randomly generated tree instances with random distance weights wi j . As for the
crossover operator, we create a random population of N = 50 individuals. Then,
in every search step, a randomly chosen individual is mutated exactly once using
either the non-heuristic “insertion-before-deletion” mutation or the heuristic ver-
sion. The mutated offspring replaces a randomly selected individual in the popu-
lation. Therefore, no intensification mechanisms focus the search. For the heuris-
tic mutation operator the parameter β is set to 1, 5, or 10. With lower β , edges
with lower weights are preferred. The initial population is generated randomly
using either the non-heuristic initialization or the heuristic variant with α = 1.5
(Raidl and Julstrom, 2003). We perform experiments for n = 25 (iter = 5, 000) and
n = 100 (iter = 10, 000).
Fig. 8.5 Mean of the distance dmst−pop between a population and MSTs over the number of search
steps
Figure 8.5 shows the mean of dmst−pop over the number of search steps. The
results show that the non-heuristic mutation operator is approximately unbiased,
whereas the heuristic mutation is biased towards MSTs. As expected, the bias in-
creases with lower β . Furthermore, the population does not converge completely
towards MSTs but the average distance dmst−pop remains stable after a number of
search steps. For heuristic initialization with α = 1.5, the initial population shows
a strong bias towards MSTs (see Table 8.5, p. 198) but the population can recover
due to the mutation operator. With increasing number of search steps, the population
converges to the same dmst−pop as when using the non-heuristic initialization.
202 8 Biased Modern Heuristics for the OCST Problem
To be able to study how the performance of modern heuristics using edge-sets de-
pends on the structure of the optimal solution Topt , an optimal or near-optimal solu-
tion must be determined. However, due to the NP-hardness of the OCST problem,
guaranteed optimal solutions can only be determined with reasonable computational
effort for small problem instances. Therefore, we split our study into three parts: the
first part is concerned with small 20 node problem instances where we are able to
determine optimal (or at least near-optimal) solutions using the GA framework de-
scribed in Sect. 8.1.3.2. The second part deals with known test problems from the
literature, and the third part studies larger problems with unknown Topt .
Fig. 8.6 Performance of a steady-state (50 + 1)-GA using different variants of the edge-set encod-
ing for randomly generated n = 20 node OCST problems
performance than the use of biased heuristic mutation (nohxover, hmut (β = 1)).
Furthermore, OCST problems with random wi j are easier to solve for GAs using
heuristic variants of edge-sets since optimal solutions are more similar to MSTs.
In summary, the heuristic variants of the edge-set encoding only perform well for
OCST problems where Topt is only slightly different from an MST. Otherwise, EAs
using the heuristic crossover operator fail due to their strong bias towards MSTs.
Using heuristic mutation has a similar, but weaker effect on GA performance than
heuristic crossover. If dmst,opt is low, its bias towards MSTs helps GAs to find op-
timal solutions; in contrast, if dmst,opt is large, GA performance is reduced. The
performance of GAs using non-heuristic crossover and mutation operators is nearly
independent of dmst,opt . Therefore, also problems with large dmst,opt can be solved.
We examine GA performance for the OCST test instances from Sect. 8.1.3. Since
for all test instances optimal or near-optimal solutions are known, we can also study
how GA performance depends on dmst,opt . Table 8.6 lists the properties of optimal
or best known solutions Topt for the test instances. It shows the number of nodes
n, the average distance dmst,rnd of 10,000 randomly generated unbiased trees Trand ,
the distance dmst,opt , and the cost w(Topt ). For all test instances, dmst,opt is always
smaller than dmst,rnd .
For our experiments we use the same steady-state GA as described in the previous
paragraphs. Each run is stopped after iter fitness evaluations and we perform 25 runs
for each test instance. We compare the performance of GAs that use the same search
204 8 Biased Modern Heuristics for the OCST Problem
operators as described in Sects. 8.3.2.2 (p. 199) and 8.3.3.1 (p. 202). As optimal
solutions are similar to MSTs, we extend our study and also present results for:
• nohxover, nohmut (MST): edge-sets with non-heuristic crossover (pc = 1), non-
heuristic mutation (pm = 1/n), and non-heuristic initialization. One randomly
chosen solution in the initial population is set to the MST (all other N − 1 solu-
tions are generated randomly).
The original variants “nohxover, nohmut” with random initial populations are de-
noted as “nohxover, nohmut (rnd)”. Furthermore, we show results for 10,000 ran-
domly generated trees Trand (indicated as Trand ).
Table 8.6 lists the percentage of runs Psuc that find Topt , the average distance
db f ,opt between the best solution Tb f that was found after iter fitness evaluations and
w(Tb f )−w(Topt )
Topt , and the average gap w(Topt ) (in percent) between w(Tb f ) and w(Topt ).
8.3 Search Operator 205
For the 10,000 randomly generated solutions Trand , we present the average gap
w(Trand )−w(Topt )
w(Topt ) (in percent).
We see that the gap between w(Tmst ) and w(Topt ) is much lower than the gap be-
tween w(Trand ) and w(Topt ). Therefore, an MST is always already a high-quality so-
lution for all test problems. For berry6 and berry35, where an MST is optimal, GAs
using heuristic crossover (hxover) always and easily find Topt . For test instances with
a large dmst,opt (e.g. the large Raidl test instances), GAs using heuristic crossover
have problems finding Topt . Although, the gap between w(Tb f ) and w(Topt ) is low
and similar to GAs using non-heuristic crossover and heuristic mutation (nohxover,
hmut (β = 1)), the strong bias of the heuristic crossover operator does not allow the
GA to reach Topt .
High GA performance is obtained (except for palmer24) when combining non-
heuristic crossover with heuristic mutation (nohxover, hmut (β = 1)). Combining
heuristic mutation with the unbiased non-heuristic crossover operator seems to be a
good compromise and also allows high GA performance for problems where dmst,opt
is large.
The results confirm the findings from the previous paragraphs. The non-heuristic
KruskalRST* crossover shows good performance also for large dmst,opt . The heuris-
tic crossover operator only finds optimal solutions if they are similar to an MST.
However, the gap between w(Tb f ) and w(Topt ) is low as good solutions are often
MST-like.
Finally, we study the GA performance for randomly generated, large OCST prob-
lem instances. In contrast to the previous paragraphs, we do not know Topt and GA
performance is determined only by the fitness w(Tb f ). We must bear in mind that
optimal solutions are similar to MSTs, and in comparison to randomly chosen trees,
an MST is already a high-quality solution.
As before, we use a steady-state GA with N = 50 which is stopped after iter
fitness evaluations (Table 8.7). We present results for randomly generated OCST
problems of different sizes (n = 10, n = 25, n = 50, n = 100, n = 150, n = 200)
with either random or Euclidean distance weights wi j . For each type of problem, we
create either 100 (n < 100), 25 (100 ≤ n ≤ 150), or 10 (n = 200) random problems.
For each random problem, 10 GA runs are performed. We compare GA performance
using the search operators described in Sects. 8.3.2.2 (p. 199) and 8.3.3.2 (p. 204).
w(Tmst )−w(T )
bf
Figure 8.7 presents the average gap w(Tmst )
(in percent). The higher the
gap, the better the solutions found. We compare the performance between different
search operators based on the fitness gap between Tb f and MST since an MST is
already a high-quality solution and since the design of the search operators used
is such that MSTs (or slight variants of it) are created with a high probability in
either the initial population (heuristic initialization), or during the GA run (heuris-
tic mutation or crossover). For random distance weights, we do not present results
for “nohxover, nohmut (rnd)” as GAs using this representation were unable to find
solutions that have a similar or lower cost than the MST.
The results show differences between Euclidean and random problem instances.
For Euclidean wi j , best solutions are found when using the non-heuristic variant
of edge-sets starting with an MST (“nohxover, nohmut (MST)”). In contrast, solu-
tion quality is low for heuristic initialization, crossover, and mutation (hxover, hmut
(β = 1), hini). The situation is reversed for random wi j . GA performance is high
for heuristic initialization, crossover, and mutation; in contrast, GA performance is
lower when using the non-heuristic variants. This is because random wi j result in
a stronger bias of optimal solutions towards the MST (Fig. 8.3). Therefore, encod-
ings with a strong bias towards MSTs result in higher GA performance if optimal
solutions are similar to MSTs, but lower GA performance otherwise. Consequently,
encodings like “hxover, hmut (β = 1), hini” with a high bias perform better for
random wi j (optimal solutions are more similar to MSTs) than for Euclidean wi j .
The results confirm previous findings. Heuristic edge-sets result in high GA per-
formance if optimal solutions are similar to MSTs (random weights). In contrast, for
larger dmst,opt (Euclidean weights) non-heuristic variants show better performance
as the bias of the heuristic variants hinders finding optimal solutions.
8.4 Representation 207
8.4 Representation
This section examines the link-biased (LB) encoding, which is a (indirect) represen-
tation for trees. The LB encoding is a redundant representation, which can encode
trees similar to an MST with higher probability than random trees. In Sect. 8.1.3,
we found that optimal solutions of OCST problems are MST-like. The LB encoding
exploits this problem-specific knowledge. Experimental results show high perfor-
mance for different types of modern heuristics. This section illustrates how biasing
a redundant representation towards high-quality solutions can increase the perfor-
mance of modern heuristics (Rothlauf, 2009c).
Section 8.4.1 describes the functionality of the LB encoding. Section 8.4.2
presents experimental results on a proper setting of the encoding-specific param-
eter P1 and the performance of the LB encoding for various test problems.
The idea of the link-biased LB encoding (Palmer, 1994; Rothlauf and Goldberg,
2003) is to represent a tree using a bias for each edge and to modify the edge weights
wi j according to these biases. A tree is constructed from the modified edge weights
wi j by calculating an MST. Therefore, a genotype b of length l = n(n − 1)/2 holds
biases for the links. When constructing the phenotype (the tree) from the genotype,
the biases are temporarily added to the original distance weights wi j . To get the
represented tree, an MST is calculated using the modified distance weights wi j .
Links with low wi j will be used with high probability, whereas edges with high wi j
will not appear in the tree. To finally calculate the tree’s fitness, the encoded tree is
evaluated by using the original edge weight matrix W and demand matrix R.
The weights bi j are floating values between zero and one. The original distance
weights wi j are modified by the elements bi j of the bias vector as
where the wi j are the modified weights, wmax is the largest weight (wmax = max(wi j )),
and P1 is the link-specific bias. The parameter P1 controls the influence of the biases
bi j and has a large impact on the structure of the encoded tree. For P1 = 0, the bi j
have no influence and only an MST based on the wi j can be represented. Prim’s
or Kruskal’s algorithm can be used for generating an MST resulting in time that
is O(n2 ) or O(n2 log(n)), respectively. The structure of the tree depends not only
on the bias values bi j , but also on the weights wi j . Therefore, the same link-biased
genotype can represent different trees if different weight matrices W are used.
We give an example of the construction of a tree from a bias vector b. For a
tree with n = 4 nodes, the genotype has length l = n(n − 1)/2 = 6. The link-biased
genotype is b = {0.1, 0.6, 0.2, 0.1, 0.9, 0.3}. With P1 = 1 and the edge weights w =
{10, 30, 20, 40, 10, 20}, the modified weights are calculated according to (8.3) as
208 8 Biased Modern Heuristics for the OCST Problem
w = {14, 54, 28, 44, 46, 32}. Notice that wmax = 40. The represented tree that is
calculated as the MST using the modified edge weights w is shown in Fig. 8.8.
The six possible edges are labeled from 1 to 6, and the tree consists of eAB (link 1
with wAB = 14), eAD (link 3 with wAD = 28), and eCD (link 6 with wCD
= 32).
The LB encoding is a specialized version of the more general link and node
biased (LNB) encoding (Palmer, 1994). The functionality of the LNB encoding is
analogous to the LB encoding. However, the LNB encoding uses additional biases
for each node and the modified edge weights are calculated as
with the elements bi (i ∈ [0, . . . , n − 1]) of the node bias vector and the node-specific
bias P2 . The tree T is calculated as an MST using the modified distance weights
wi . In comparison to the LB encoding, the length of a genotype increases from
n(n − 1)/2 to n(n + 1)/2.
Abuali et al (1995) compared the LNB encoding to some other representations
for the probabilistic MST problem and in some cases found the best solutions by
using the LNB encoding. Raidl and Julstrom (2000) proposed a variant of this en-
coding and observed solutions superior to those of several other representations for
the degree-constrained MST problem. For the same type of problem, Krishnamoor-
thy and Ernst (2001) proposed yet another version of the LNB encoding. Gaube and
Rothlauf (2001) found that for P2 > 0, trees similar to stars are encoded with higher
probability and for large P2 not all possible trees can be encoded. For P2 → ∞ (and
P1 finite), only stars can be encoded. Therefore, the LNB encoding with additional
node-biases and P2 = 0 is only useful if optimal solutions are similar to stars.
In the following paragraphs, we focus on the LB encoding. The LB encoding
is redundant as genotypes consisting of n(n − 1)/2 real values bi j encode a finite
number of trees (phenotypes). For large values of the link-specific bias (P1 → ∞), the
LB encoding becomes uniformly redundant (Sect. 6.2.2). This means every possible
tree is represented by the same number of different genotypes. With decreasing
P1 , the LB encoding becomes non-uniformly redundant and MST-like solutions are
over-represented. Then, a random LB-encoded genotype represents MST-like trees
with higher probability. If P1 = 0, only one tree – an MST with respect to the wi j –
can be represented, since the elements bi j of the bias vector have no impact on wi j .
Figure 8.9 illustrates how the LB encoding over-represents trees similar to an
MST (Rothlauf and Goldberg, 2003). We plot the probability Pr that a link of a ran-
domly generated genotype is part of an MST over the link-specific bias P1 . Results
are presented for two tree sizes n (16 and 28 nodes) and for P1 ranging from 0.01
to 1,000. We plot the mean and standard deviation of Pr . For large values of P1 , all
nn−2 possible trees are uniformly represented. With decreasing P1 , randomly cre-
8.4 Representation 209
ated genotypes contain links that are also used in MSTs with higher probability. For
small values of P1 , all edges of a randomly created individual are also with high
probability Pr part of an MST. For P1 → 0, only an MST can be encoded.
If P1 is too small, not all possible trees can be represented; How large must
P1 be to allow the LB encoding to represent all possible trees? Kruskal’s algo-
rithm, which is used to create the tree from the wi j , uses a sorted list of edges
and iteratively inserts edges with lowest wi j into the tree. Because the LB encod-
ing modifies the original edge weights wi j (8.3), it also modifies this ordered list
of edges and thus allows the encoding of trees different from MSTs. If P1 ≥ 1
(and all bi j = 1), P1 bi j max(wi j ) can be equal to or greater than the highest orig-
inal edge weight max(wi j ). Then, the encoding can completely change the order
of the list of edges as for edge elm the original wlm = min(wi j ) can be changed to
wlm = min(wi j ) + P1 max(wi j ) > max(wi j ) if P1 > 1. As a result, it is possible to
encode all nn−2 different trees using the LB encoding if P1 > 1 (assuming a proper
setting of bi j ).
Therefore, we make a recommendation for the choice of P1 (Rothlauf, 2009c).
With lower P1 , the LB encoding is biased towards MSTs. Thus, we expect higher
performance for OCST problems. However, if P1 < 1, some trees cannot be rep-
resented. Therefore, we recommend setting P1 ≈ 1. This ensures that all possible
solutions can be encoded while still over-representing solutions similar to MSTs.
This section presents experimental results for the LB encoding. We study the per-
formance of several modern heuristics for OCST problems from the literature and
randomly generated test instances. Furthermore, we examine how performance de-
pends on P1 . In the experiments, we use the NetKey representation (Sect. 8.1.2) and
the edge-set encoding (Sect. 8.3.1) as benchmark.
210 8 Biased Modern Heuristics for the OCST Problem
We give details on the different types of modern heuristics that are used as repre-
sentatives of recombination-based search (genetic algorithm) and local search (sim-
ulated annealing).
In all experiments, the initial solutions are generated randomly. For representa-
tions that encode solutions as strings (NetKey, LB encoding), the initial solutions are
generated by assigning random values to the genotype strings. For edge-sets with-
out heuristics, KruskalRST is used as initialization method (Sect. 8.3.1.1). For the
heuristic variant of edge sets, we set the encoding-specific initialization parameter
to α = 1, resulting in a bias towards MST-like trees.
In the experiments, we use a standard GA (Sect. 5.2.1) with uniform crossover,
no mutation, and standard 2-tournament selection without replacement as a repre-
sentative for recombination-based search. A GA run is stopped after the population
has fully converged (all solutions in the population encode the same phenotype),
or after 100 generations. For string representations, uniform crossover chooses the
alleles for the offspring randomly from the two parents. For edge-sets, we use the
heuristic as well as non-heuristic crossover variant (Sect. 8.3.1).
In SA (Sect. 5.1.4), our local search operator randomly exchanges the position of
two values in the encoded solution. For the LB encoding or NetKeys, two randomly
chosen elements bi j and bkl of the bias vector are exchanged. Therefore, each search
step assigns modified weights w to two edges, resulting in a different encoding of
a tree. In each iteration, the temperature T is reduced by a factor of 0.99 (Ti =
0.99Ti−1 ). With lowering T , the probability of accepting worse solutions decreases.
Each SA run is stopped if the number of search steps exceeds iter or there are no
improvements in the last iterterm search steps. The initial temperature T0 is set with
respect to the target problem instance. For each instance, 1,000 unbiased trees Ti are
generated and T0 = 2σ (w(Ti )), where σ (w(Ti )) denotes the standard deviation.
Table 8.8 compares the GA performance using NetKeys, edge-sets with either naive
or heuris tic initialization and crossover operators, and the LB encoding with differ-
ent link-specific biases P1 . We show results for problem instances from Sect. 8.1.3
and list success probabilities Psuc and average number of generations tconv of a GA
run. The GA is run 200 times for each of the test instances with population size N.
The results indicate that the GA performs well for the LB encoding with P1 ≈ 1.
For some problem instances, a higher (palmer24) or lower (palmer6 and palmer12)
value of P1 can slightly improve GA performance, but choosing P1 ≈ 1 results on
average in a good and robust GA performance. As expected, GA performance is
low for small values of P1 as only MSTs can be encoded (except for berry6 and
berry35 where an MST is optimal (see Table 8.1, p.190)). For high values of P1 (P1 =
100), all possible trees are represented with the same probability, and GAs using
the LB encoding show similar performance as with NetKeys. Then, no problem-
8.4 Representation 211
Table 8.8 Success probabilities Psuc and average number of generations tconv of a GA using differ-
ent types of representations
Net edge-sets LB
N Key naive heur. P1 =0.05 P1 =0.5 P1 =1 P1 =2 P1 =100
Psuc 0.27 0.17 0 0.31 0.84 0.66 0.48 0.28
palmer6 16
tconv 15.1 10.8 1.1 4.3 10.0 12.2 13.9 15.3
Psuc 0.31 0.31 0 0 0.74 0.62 0.48 0.34
palmer12 300
tconv 60.8 36.5 4.5 40.9 48.6 52.4 54.8 60.9
Psuc 0.72 0.08 0 0 0.03 0.49 0.64 0.77
palmer24 800
tconv 87.8 60.4 6.6 112.5 83.5 85.1 85.8 87.1
Psuc 0.78 0.44 0 0 1 1 0.98 0.78
raidl10 70
tconv 33.9 22.8 2.1 16.3 21.5 25.0 27.4 33.0
Psuc 0.46 0.53 0 0 0.99 0.92 0.85 0.41
raidl20 450
tconv 82.7 54.3 6.0 40.1 63.1 69.0 73.1 82.7
Psuc 0.54 0.38 1 1 1 0.98 0.82 0.56
berry6 16
tconv 14.2 10.4 1.5 1.0 6.9 9.4 11.6 14.5
Psuc 0.03 0 1 1 1 1 0.98 0.04
berry35 300
tconv 98.3 74.8 1.0 1.0 70.3 76.0 80.0 98.0
8.4.2.3 Influence of P1
Figures 8.10(a) and 8.10(b) show the success probability Psuc of a GA and SA over
the link-specific bias P1 for 100 randomly created OCST problem instances with
problem size n = 16 and randomly chosen edge weights wi j ∈]0, 100]. For each
problem instance, 50 independent runs are performed and Psuc is averaged over all
100 problem instances. The GA has population size N = 200. The SA runs are
stopped either after iter = 20, 000 search steps, or there are no improvements in the
last iterterm = 2, 000 search steps.
Fig. 8.10 Success probability Psuc of a GA and SA over the link-specific bias P1
212 8 Biased Modern Heuristics for the OCST Problem
We find that GA and SA performance is best when using the LB encoding with
P1 ≈ 1. A pairwise t-test is performed on the success probabilities Psuc of GA and
SA; for 0.1 ≤ P1 ≤ 10, the LB encoding outperforms the unbiased NetKey encoding
with high significance (p < 0.001). Furthermore, as expected, the performance of
both modern heuristics is similar for the LB encoding with large P1 and NetKeys.
For large values of P1 , the LB encoding becomes uniformly redundant, all possible
trees are represented with the same probability, and it is not possible to exploit
problem-specific properties of OCST problems. For low values of P1 , Psuc becomes
small as OCST problems can only be solved where the optimal solution is an MST.
Fig. 8.11 Average distance μ (db f ,opt ) between the best solutions Tb f that have been found by a GA
and SA and the optimal solution Topt over the link-specific bias P1 for 100 random OCST problems
Figure 8.12 shows how the success probability Psuc of a modern heuristic depends
on the problem size n for OCST problems with either random or Euclidean wi j . We
use either a GA with N = 200 or SA with iter = 20, 000 and iterterm = 2, 000. As
before, we randomly generate 100 instances of each size n and perform 50 runs for
each representation. We plot results for different values of P1 . The NetKey encoding
is used as an example of a problem-independent (unbiased) encoding.
8.4 Representation 213
Fig. 8.12 Success probability Psuc of modern heuristics over n for random OCST problems
The plots show similar results for GA or SA as well as for OCST problems with
random or Euclidean weights. As expected, the performance of representations that
are biased towards MSTs is higher for OCST problems with random weights than
with Euclidean weights (compare Fig. 8.2, p. 193). Then, GA and SA performance
decreases with increasing n as the parameters of the modern heuristics are fixed and
larger instances are more difficult to solve. To get better results, either larger popu-
lation sizes (GA), or a different cooling schedule combined with a higher number of
search steps (SA) would be necessary. More importantly, GA and SA performance
is highest with P1 ≈ 1. Unbiased representations such as NetKeys are outperformed.
Results for edge-sets are omitted due to the low performance of these encodings.
Tables 8.9 and 8.10 present the performance of modern heuristics using the LB en-
coding for random OCST problems of different type and size and compare it to
NetKeys and the results (denoted as PeRe) obtained by an approximation algorithm
from Peleg and Reshef (1998) and Reshef (1999). They presented an O(log n) ap-
proximation for OCST problems with Euclidean distance weights and an O(log3 n)
approximation for arbitrary (non-Euclidean) distance weights. The approximation
uses a recursive construction algorithm with a partitioning step that clusters the
214 8 Biased Modern Heuristics for the OCST Problem
nodes in disjoint subsets. The proposed algorithms are the best approximation algo-
rithms currently available for OCST problems.
Table 8.9 Performance of GA, SA, and approximation algorithm (PeRe) for randomly generated
OCST problems of different size n with Euclidean distance weights
GA SA
LB LB PeRe
n
NetKey NetKey
P1 =0.2 P1 =1 P1 =5 P1 =0.2 P1 =1 P1 =5
Psuc 0.56 0.31 0.78 0.60 0.70 0.35 0.83 0.76 0
10
w(Tb f ) 153,583 155,600 152,717 153,334 153,628 155,249 152,851 153,237 210,540
eval 4,071 2,565 3,289 3,868 3,770 2,865 2,837 3,389 -
Psuc 0.02 0.06 0.28 0.12 0.04 0.06 0.33 0.21 0
20
w(Tb f ) 690,697 692,340 677,477 682,193 691,633 693,085 680,821 684,057 1.1*106
eval 15,207 10,438 12,690 14,292 16,496 7,647 8,350 12,776 -
w(Tb f ) 4.78*106 4.56*106 4.47*106 4.51*106 5.25*106 4.56*106 4.52*106 4.61*106 9.7*106
50
Table 8.10 Performance of GA, SA and approximation algorithm (PeRe) for randomly generated
OCST problems of different size n with random distance weights
GA SA
LB LB PeRe
n
NetKey NetKey
P1 =0.2 P1 =1 P1 =5 P1 =0.2 P1 =1 P1 =5
Psuc 0.84 0.86 0.97 0.90 0.91 0.88 0.96 0.93 0
10
w(Tb f ) 6,824 6,799 6,789 6,800 6,812 6,798 6,789 6,800 19,438
eval 3,665 1,867 2,722 3,355 3,652 2,293 2,599 3,124 -
Psuc 0.11 0.94 0.89 0.57 0.14 0.86 0.84 0.61 0
20
w(Tb f ) 19,946 18,738 18,746 18,908 20,059 18,803 18,794 19,017 88,069
eval 14,113 8,215 10,809 12,626 16,645 5,547 7,384 11,721 -
w(Tb f ) 108,465 65,256 65,610 69,213 196,147 65,394 66,405 78,640 592,808
50
The table shows the average cost w(Tb f ) of the best solution found, and the av-
erage number of fitness evaluations iter for randomly created problems with 10,
20, 50, and 100 nodes. For each n, we generate 100 random problem instances and
perform 20 runs for each instance (for n = 100, we only create 15 instances due
to the high computational effort). The distance weights wi j are either random or
Euclidean. The demands ri j are generated randomly and are uniformly distributed
in ]0, 10]. For small problem instances (n = 10 and n = 20) optimal solutions are
determined using the iterated GA described in Sect. 8.1.3.2.
Because the performance of modern heuristics depends on parameter settings and
modern heuristics with fixed parameters show lower performance with increasing
problem size n (compare Fig. 8.12), the GA and SA parameters have to be adapted
8.5 Initialization 215
with increasing n. Table 8.11 lists how the GA (population size N) and SA (maximal
number of iterations iter, and termination criterion iterterm ) parameters are chosen
with respect to n. By using higher population sizes (GA), or a larger number of
search steps (SA), solutions with higher quality can be found for larger n.
n 10 20 50 100
Table 8.11 GA and SA pa-
GA N 100 200 400 800
rameters
iter 10,000 20,000 40,000 80,000
SA
iterterm 2,000 4,000 8,000 16,000
8.5 Initialization
This section studies how the choice of a biased initial solution affects the perfor-
mance of modern heuristics. We have seen in Sect. 8.1.3 that optimal solutions of
OCST problems are similar to an MST. This similarity suggests that the performance
of modern heuristics can be improved by starting with an MST.
Consequently, we investigate how the performance of different modern heuristics
depends on the type of initial solution. We show results for a greedy search strat-
egy, local search, SA, and a GA. Greedy search and local search are representative
216 8 Biased Modern Heuristics for the OCST Problem
start procedure of Elias and Ferguson (1974)). Examples of modern heuristics for
the capacitated MST problem that start with an MST are presented in Kershenbaum
et al (1980), and Gavish et al (1992). For a survey on heuristics and modern heuris-
tics for the capacitated MST, we refer to Amberg et al (1996) and Voß (2001).
We study how the choice of the initial solution affects the performance of various
modern heuristics.
In the experiments, we use greedy search (GS), simulated annealing (SA), local
search (LS), and genetic algorithms (GA). Greedy search (Sect. 3.3.2.3) starts with
an initial solution Ts and iteratively examines all neighboring trees that are different
in one edge and chooses the tree T j with lowest cost w(T j ). The neighborhood of
Ti consists of all T j that can be created by the change of one edge (di j = 1). GS
stops if the current solution cannot be improved any more. The number of different
neighbors of a tree depends on its structure and varies between (n − 1)(n − 2) and
6 n(n − 1)(n + 1) − n + 1 (Rothlauf, 2006).
1
Table 8.12 presents results for the test instances from Sect. 8.1.3. It lists for different
starting solutions Ts , the cost w(T ) of 1,000 randomly generated solutions of type Ts
(denoted random), the performance (average cost of best found solution and average
number of evaluations) of a greedy search starting with Ts , the maximum number
iter of evaluations for SA, LS, GA, and GA-ESH, and the average cost of the best
found solution for SA, LS, GA, and GA-ESH. Ts is either an MST, a random tree,
or the result of the PeRe-approximation. For GA-ESH, we only present results for
heuristic initialization (Sect. 8.3.1.2, p. 197) which creates MST-like solutions. For
the population-based GA, one randomly chosen individual in the initial population
of size N is set to Ts . Therefore, this randomly chosen solution is either an MST, a
randomly chosen tree, or the result of the PeRe-approximation. All other solutions
in the initial population are random trees. 50 independent runs are performed for
each problem instance and the gap (w(T b f ) − w(T best ))/w(T best ) (in %) between
the best found solution T b f and T best (Table 8.1) is shown in brackets.
For all problem instances, the average cost of randomly created solutions and
PeRe is greater than the cost of an MST. For the greedy heuristic, there are no large
differences between the cost of the best found solution for different Ts . However,
there are large differences in the average number eval of evaluations. Starting from
a random tree always results in highest eval; starting from an MST always results
(except for palmer24) in lowest eval. On average, eval for starting from PeRe is
between the results for MST and random tree. For SA, LS, and GA, starting with an
MST always results (except for palmer24 and raidl10) in better solutions than start-
ing from a random tree, and often in better solutions than starting from PeRe. Using
SA or LS and starting from an MST always results (except palmer6 and berry35)
in the same (berry6) or better performance than GA-ESH. Summarizing the results,
the cost of an MST is on average lower than the cost of the PeRe-approximation (up
to a factor of six for raidl100) and the cost of a random tree (up to a factor of 35
for raidl100). The modern heuristics find better solutions (up to a factor of four for
raidl75) when starting with an MST in comparison to starting with a random tree.
Tables 8.13 and 8.14 extend the analysis and show results for random OCST prob-
lems with 10, 25, 50, 100, and 150 nodes. We report the same performance figures
as in the above paragraphs (Table 8.13). Table 8.15 lists the maximum number of
evaluations allowed for SA, LS, GA, and GA-ESH.
8.5 Initialization 219
Table 8.12 Performance of different modern heuristics using different starting solutions Ts for test
instances from the literature
random greedy SA LS GA GA-ESH
Ts iter
cost cost (%opt) eval cost (%opt) cost (% opt) cost (%opt) cost (%opt)
MST 709,770 693,180 (0) 57 698,644 (0.8) 699,740 (0.9) 693,680 (0.07)
693,180
palmer6 rnd 1,7*106 699,442 (0.9) 186 500 700,150 (1.0) 700,635 (1.1) 693,781 (0.09)
(0)
PeRe 926,872 695,566 (0.3) 75 700,927 (1.1) 697,951 (0.7) 693,478 (0.04)
MST 3,8*106 3,541,915 (3.3) 1,794 3,579,384 (4.4) 3,498,458 (2.0) 3,730,657 (8.8)
3,623,253
palmer12 rnd 1,1*107 3,492,357 (1.9) 5,752 500 3,594,002 (4.8) 3,512,364 (2.5) 4,004,888 (17)
(5.7)
PeRe 4,6*106 3,522,282 (2.7) 2,806 3,580,330 (4.4) 3,502,914 (2.2) 3,912,014 (14)
MST 1,9*106 1,086,656 (0) 57,654 1,098,379 (1.1) 1,092,353 (0.5) 1,154,449 (6.2)
1,250,073
palmer24 rnd 1,0*107 1,086,656 (0) 92,549 2,500 1,097,179 (1.0) 1,098,732 (1.1) 1,226,783 (13)
(15)
PeRe 2,3*106 1,086,656 (0) 39,303 1,103,834 (1.6) 1,096,615 (0.9) 1,181,316 (8.7)
MST 58,352 53,674 (0) 435 54,762 (2.0) 53,699 (0.05) 57,141 (6.5)
55,761
raidl10 rnd 328,993 53,674 (0) 2,296 500 54,663 (1.8) 54,009 (0.6) 67,893 (26)
(3.9)
PeRe 194,097 53,674 (0) 1,711 54,796 (2.1) 53,674 (0) 70,120 (31)
MST 168,022 157,570 (0) 3,530 158,983 (0.9) 157,570 (0) 159,911 (1.5)
158,974
raidl20 rnd 1,9*106 157,995 (0.3) 50,843 2,500 161,023 (2.2) 160,214 (1.7) 205,718 (30.6)
(0.9)
PeRe 849,796 158,704 (0.7) 32,053 160,943 (2.1) 160,578 (1.9) 209,731 (33)
MST 912,303 809,311 (0.30) 211,394 829,780 (2.84) 811,098 (0.52) 852,091 (5.6)
880,927
raidl50 rnd 2,0*107 806,946 (0.01) 1,7*106 10,000 864,736 (7.17) 887,066 (9.94) 1,541,047 (91)
(9.18)
PeRe 5,9*106 807,353 (0.06) 1,0*106 890,082 (10.3) 883,483 (9.50) 1,488,774 (85)
MST 2,4*107 1,717,491 (0) 1,3*106 2,042,603 (19) 1,852,905 (7.9) 1,971,638 (15)
2,003,433
raidl75 rnd 5,8*107 1,717,491 (0) 8,4*106 10,000 2,401,151 (40) 2,330,226 (36) 8,814,074 (413)
(17)
PeRe 1,3*107 1,749,322 (1.9) 4,5*106 2,370,276 (38) 2,303,405 (34) 4,957,022 (189)
MST 3,6*106 2,561,543 (0) 2,4*106 2,713,040 (6) 2,619,256 (2.3) 2,831,167 (11)
2,935,381
raidl100 rnd 1,1*108 2,603,146 (1.6) 2,2*107 40,000 2,870,197 (12) 2,937,911 (15) 5,200,334 (103)
(14.6)
PeRe 2,4*107 2,709,603 (5.8) 1.2*107 2,941,064 (15) 2,959,953 (16) 4,620,145 (80)
MST 534 534 (0) 28 534 (0) 534 (0) 534 (0)
534
berry6 rnd 1,284 534 (0) 207 500 534 (0) 534 (0) 534.04 (0.07)
(0)
PeRe 842 534 (0) 120 534 (0) 534 (0) 536 (0.3)
MST 16,915 16,915 (0) 3,387 21,818 (29) 16,915 (0) 16,915 (0)
16,915
berry35 rnd 379,469 16,915 (0) 467,008 2,500 22,426 (33) 22,642 (34) 44,661 (164)
(0)
PeRe 60,382 16,915 (0) 171,425 21,563 (27) 20,777 (23) 31,765 (88)
Table 8.13 Performance of different modern heuristics using different starting solutions Ts for
randomly created OCST test instances with Euclidean distance weights
random greedy SA LS GA GA-ESH
n Ts
cost cost (%opt) eval cost (%opt) cost (% opt) cost (%opt) cost (%opt)
MST 1,670 1,515 (0%) 698 1,527 (0.80%) 1,539 (1.58%) 1,605 (5.94%)
1,574
10 rnd 3,284 1,520 (0.32%) 2,527 1,555 (2.65%) 1,550 (2.28%) 1,673 (10.4%)
(3.89%)
PeRe 2,114 1,519 (0.23%) 1,475 1,530 (1.00%) 1,549 (2.20% ) 1,665 (9.89%)
MST 10,794 8,839 (0%) 41,854 8,937 (1.11%) 8,885 (0.52%) 9,230 (4.43%)
9,412
25 rnd 40,261 8,844 (0.06%) 122,463 9,004 (1.88%) 9,027 (2.14%) 10,337 (17.0%)
(6.49%)
PeRe 12,827 8,855 (0.18%) 62,073 9,021 (2.06%) 9,044 (2.33%) 11,088 (25.5%)
MST 61,450 44,142 (0%) 786,904 44,749 (1.38%) 44,600 (1.04%) 45,320 (2.67%)
47,391
50 rnd 207,139 73,179 (65.8%) 999,800 45,187 (2.37%) 45,364 (2.77%) 50,336 (14.0%)
(7.36%)
PeRe 63,638 45,301 (2.63%) 778,602 45,248 (2.51%) 45,312 (2.65%) 51,592 (16.9%)
MST 272,987 256,149 (42%) 107 181,538 (0.5%) 180,592 (0%) 182,787 (1.2%)
189,559
100 rnd 1,207,970 1,069,412 (492%) 107 183,188 (1.4%) 182,706 (1.2%) 194,423 (7.7%)
(4.97%)
PeRe 251,198 255,158 (41%) 107 182,419 (1.0%) 183,280 (1.5%) 197,270 (9.2%)
MST 645,042 624,139 (56%) 107 399,312 (0.0%) 399,262 (0%) 401,127 (0.5%)
417,045
150 rnd 3,390,500 3,082,983 (672%) 107 403,119 (1.0%) 402,687 (0.9%) 425,814 (6.7%)
(4.45%)
PeRe 562,282 563,160 (41%) 107 401,615 (0.6%) 401,946 (0.7%) 434,682 (8.9%)
220 8 Biased Modern Heuristics for the OCST Problem
Table 8.14 Performance of different modern heuristics using different starting solutions Ts for
randomly created OCST test instances with uniformly distributed random distance weights
random greedy SA LS GA GA-ESH
n Ts
cost cost (%opt) eval cost (%opt) cost (% opt) cost (%opt) cost (%opt)
MST 717 676 (0%) 291 686 (1.37%) 680 (0.49%) 703 (3.93%)
689
10 rnd 2,863 677 (0.10%) 2,472 691 (2.11%) 685 (1.25%) 852 (25.9%)
(1.95%)
PeRe 1,949 677 (0.11%) 1,661 687 (1.62%) 684 (1.10%) 855 (26.4%)
MST 3,094 2,720 (0%) 12,420 2,806 (3.16%) 2,738 (0.65%) 2,840 (4.40%)
2,855
25 rnd 37,254 2,723 (0.08%) 126,859 2,893 (6.34%) 2,882 (5.93%) 4,894 (79.9%)
(4.93%)
PeRe 14,000 2,723 (0.09%) 74,831 2,885 (6.03%) 2,864 (5.29%) 5,290 (94.5%)
MST 7,921 6,663 (0%) 169,398 6,989 (4.89%) 6,753 (1.36%) 7,081 (6.28%)
7,205
50 rnd 214,238 45,822 (587%) 107 7,447 (11.8%) 7,435 (11.6% ) 16,009 (140%)
(8.14%)
PeRe 59,304 8,968 (34.6%) 949,768 7,492 (12.4%) 7,405 (11.1%) 14,523 (118%)
MST 21,719 18,436 (9.22%) 107 17,584 (4.2%) 16,879 (0%) 18,020 (6.8%)
18,561
100 rnd 1,206,920 990,391 (5,767%) 107 18,882 (11%) 19,765 (17%) 42,324 (151%)
(10.0%)
PeRe 244,349 187,958 (1,014%) 107 19,003 (13%) 19,432 (15%) 36,938 (119%)
MST 40,596 39,184 (29%) 107 31,451 (4.0%) 30,247 (0%) 32,807 (8.5%)
33,350
150 rnd 3,290,630 3,094,862 (10,131%) 107 36,107 (19%) 36,712 (21%) 84,711 (180%)
(10.3%)
PeRe 549,068 505,367 (1,570%) 107 37,417 (24%) 37,355 (24%) 71,298 (136%)
For Table 8.13, we generate random problem instances with Euclidean weights.
The nodes are placed randomly on a two-dimensional grid of size 10×10. For Table
8.14, the weights are randomly generated and uniformly distributed in ]0,10]. For
both tables, the demands ri j are randomly generated and uniformly distributed in
]0,10]. The results for the randomly created OCST problem instances are similar
to the results for the problem instances from the literature. The average cost of an
MST is always lower than the average cost of random solutions (significant using
a t-test with an error level of 0.01). Because for Euclidean weights (Table 8.13) the
PeRe approximation is tighter and because optimal solutions have a larger average
distance μ (dmst,opt ) to an MST (compare Fig. 8.2), the PeRe approximation results
in better solutions and the results are similar to an MST. For random weights (Table
8.14), the average cost of an MST is always lower than the average cost of the PeRe
approximation (also significant using a t-test with an error level of 0.01).
For Euclidean weights (Table 8.13), greedy search starting from an MST either
finds good solutions faster (n < 100), or finds better solutions (n ≥ 100) than starting
from a random tree. With increasing n, greedy search starting from PeRe performs
similarly to when starting from an MST. For SA, LS, and GA, starting from an MST
always finds better solutions than starting from a random solution or from PeRe.
Furthermore, starting from an MST results in better results than GA-ESH (except
n = 10).
For random weights (Table 8.14), starting from an MST always results in better
performance than starting from a random tree or PeRe (significant using a t-test
with an error level of 0.01, except SA with n = 10). Furthermore, using SA, LS, or
GA and starting with an MST results in better performance than GA-ESH (except
n = 10).
Chapter 9
Summary
This textbook taught us the art of systematically designing efficient and effective
modern heuristics. We learned when to use modern heuristics, how to deliberately
choose among the available types of modern heuristics, what are the relevant design
elements and principles of modern heuristics, and how we can improve their per-
formance by considering problem-specific knowledge. We want to summarize the
main lessons learned.
For which types of problems should we use modern heuristics? There are
not only modern heuristics, but we can usually choose from a variety of exact and
heuristic optimization methods (see Chap. 3). After we have obtained a model of
our problem, we have to select a proper optimization method. For problems that
can be solved with polynomial effort (e.g. continuous linear problems, Sect. 3.2),
modern heuristics are usually not appropriate. If the problem at hand is more diffi-
cult (e.g. NP-hard, Sect. 2.4.1), we recommend searching the literature for fast exact
optimization methods (Sect. 3.3), fast heuristics (Sect. 3.4.1), or efficient approxi-
mation methods (Sect. 3.4.2). If such methods are available, we should try them;
often modern heuristics are not necessary. For example, the knapsack problem is
FPTAS-hard (Fig. 3.20, p. 90) and, thus, efficient solution approaches are available.
The situation becomes different if our problem is difficult and/or non-standard.
Examples of difficult problems are APX-hard problems (Sect. 2.4.1) like the MAX
SAT problem, the symmetric TSP, or the OCST problem. For such problems, mod-
ern heuristics are often a good choice, especially if problems become large. Further-
more, modern heuristics are often the method of choice for non-standard real-world
problems which are different from the well-defined problems that we can find in the
literature (Sect. 4.1). Real-world problems often have additional constraints, addi-
tional decision variables, and other optimization goals. We can try to reduce com-
plexity and make our problem more standard-like by removing constraints, simpli-
fying the objective function, or limiting the number of decision variables. However,
we often do not want to neglect some important aspects and are not happy with the
simplified model. In this case, modern heuristics are often the right choice.
How can we select a modern heuristic that fits our problem well? The local-
ity of a problem (Sect. 2.4.2) describes how well distances between solutions corre-
spond to their fitness differences. Locality has a strong impact on the performance
of local search methods. High locality allows local search to find high-quality so-
lutions in the neighborhood of already found good solutions and guide local search
methods towards optimal solutions. In contrast, if a problem has low locality, local
search methods cannot make use of information gathered in previous search steps
but behave like random search. The decomposability of a problem (Sect. 2.4.3) de-
scribes how well a problem can be decomposed into smaller and independent sub-
problems. The decomposability of a problem is high if the structure of the objective
function is such that there are groups of decision variables that can be set indepen-
dently of decision variables contained in other groups. It is low if it is not possible to
decompose a problem into sub-problems that have little interdependence. Problems
with high decomposability can be solved well using recombination-based modern
heuristics because solving a number of smaller sub-problems is usually easier than
solving the larger, original problem. Decomposable problems usually have high lo-
cality. For decomposable problems, the change of a decision variable changes the
fitness of only one sub-problem. If the fitness of the other sub-problems remains
unchanged, similar solutions often have similar fitness and, thus, the locality of the
problem is high.
As many real-world problems have high locality and are decomposable, local as
well as recombination-based search often show good performance and return high-
quality solutions. However, direct comparisons between local and recombination-
based search are only meaningful for particular problem instances and general state-
ments on the superiority of one of these search concepts are unjustified. Which one
is more appropriate for solving a particular problem instance depends on the specific
characteristics of the problem (locality versus decomposability).
What are common design elements of modern heuristics? In Chap. 4, we
discussed the design elements representation, search operator, fitness function, and
initial solution. These elements are relevant for all different types of modern heuris-
tics. Chapter 5 studied the fifth design element, which is the search strategy. The
search strategy defines the intensification and diversification mechanisms.
Given a set of solutions, we can define a search space either by defining search
operators or by selecting a metric (Sect. 2.3.1). Local search operators usually cre-
ate neighboring solutions. Recombination operators generate offspring where the
distances between offspring and parents are usually equal to or smaller than the
distances between parents. Therefore, defining search operators implies a neighbor-
hood relationship for the search space, and vice versa. There are two complemen-
tary approaches for designing representations and search operators: We may use
(indirect) representations where solutions are encoded in a standard data structure
(Sect. 4.2.4), and standard operators (Sect. 4.3.5) are applied to these genotypes.
Then, a proper choice of the genotype-phenotype mapping (Sect. 4.2.3) is impor-
tant for the performance of search. In contrast, a direct representation (Sect. 4.3.4)
encodes solutions in their most natural problem space and designs search operators
to operate on this search space. Then, no mapping between genotypes and pheno-
types needs to be specified, but the search operators are specially designed for the
phenotypes and are problem-specific (Sect. 4.3.1).
9 Summary 223
Designing a fitness function (Sect. 4.4) and initialization method (Sect. 4.5) is
usually easier than designing proper representations and search operators. The fit-
ness function is determined by the objective function and must allow modern heuris-
tics to perform pairwise comparisons between solutions. Initial solutions are usually
randomly created if no a priori knowledge about the problem exists.
Finally, search strategies define intensification and diversification mechanisms
used during search. Search strategies must ensure that the search stays focused (in-
tensification) but also allow the search to escape from local optima (diversification).
This is achieved by various diversification techniques based on the representation,
search operator, fitness function, initialization, or explicit diversification steps con-
trolled by the search strategy (Chap. 5).
How can we categorize different types of modern heuristics? Modern heuris-
tics are extended variants of improvement heuristics (Sect. 3.4.3). Both consider in-
formation about previously sampled solutions for future search decisions. However,
in contrast to improvement heuristics which only perform intensification steps, mod-
ern heuristics also allow inferior solutions to be generated during search. Therefore,
modern heuristics use intensification as well as diversification steps during search.
As the behavior of modern heuristics is defined independently of the problem, they
are general-purpose methods applicable to a wide range of problems.
Diversification allows search to escape from local optima whereas intensification
ensures that the search moves in the direction of solutions with higher quality. Di-
versification is often the result of large modifications of solutions. Intensification
uses the fitness of solutions to guide search and ensures that the search moves in the
direction of solutions with higher fitness. Existing local and recombination-based
search strategies mainly differ in the way in which they control diversification and
intensification (Sect. 5.1). Based on the design elements, there are different strate-
gies to introduce diversification into the search and to escape from local optima:
• By using different types of neighborhoods during search, it is possible to es-
cape from local optima and explore larger areas of the search space. Different
neighborhoods can be the result of changing genotype-phenotype mappings or
search operators during search (Sect. 5.1.1). Examples are variable neighbor-
hood search, problem space search, the rollout algorithm, or the pilot method.
• Modifications of the fitness function can also lead to diversification (Sect. 5.1.2).
An example is guided local search which systematically changes the fitness func-
tion with respect to the progress of search.
• Diversity can be introduced by performing repeated runs with different initial
solutions (Sect. 5.1.3). Examples are iterated descent, large-step Markov chains,
iterated Lin-Kernighan, chained local optimization, or iterated local search.
• Finally, the search strategy explicitly controls diversification and intensification
(Sect. 5.1.4). Examples of search strategies that use a controlled number of large
search steps towards solutions of lower quality to increase diversity are simulated
annealing, threshold accepting, or stochastic local search. Representative exam-
ples of strategies that consider previous search steps for diversification are tabu
search or adaptive memory programming.
224 9 Summary
or even bad solutions to our problem. Such problem-specific knowledge can be ex-
ploited by introducing a bias into modern heuristics. The bias should consider this
knowledge and, for example, concentrate search on solutions that are expected to
be of high quality or avoid solutions expected to be of low quality. A bias can be
considered in all design elements of modern heuristics, namely the representation,
the search operator, the fitness function, the initialization, and also the search strat-
egy. However, we recommend biasing modern heuristics only if we have obtained
some particular knowledge about an optimization problem or problem instance. If
we have no knowledge about properties of a problem, we should not bias modern
heuristics as this would mislead search heuristics.
Chapter 8 presented a case study on the design of problem-specific modern
heuristics for the optimal communication spanning tree (OCST) problem. We find
that optimal solutions for OCST problems are similar to optimal solutions for the
(much simpler) minimum spanning tree problem. Thus, biasing the representation,
search operator, initial solution, or fitness function towards the minimum spanning
tree can increase the performance of modern heuristics. Experimental results con-
firm this conjecture for a problem-specific problem representation (link-biased en-
coding), search operators (edge-set encoding), and initial solutions. We find that it
is not the particular type of modern heuristic used that is relevant for high perfor-
mance but instead the appropriate consideration of problem-specific knowledge for
the design of the basic design elements.
References
Fortnow L (2009) The status of the P versus NP problem. Commun ACM 52(9):78–
86
Foster JA, Lutton E, Miller J, Ryan C, Tettamanzi AGB (eds) (2002) Proceedings of
the Fifth European Conference on Genetic Programming (EuroGP-2002), LNCS,
vol 2278, Springer, Berlin
Foulds LR (1983) The heuristic problem-solving approach. Journal of the Opera-
tional Research Society 10(34):927–934
Gale JS (1990) Theoretical Population Genetics. Unwin Hyman, London
Gallagher M, Frean M (2005) Population-based continuous optimization, proba-
bilistic modelling and mean shift. Evolutionary Computation 13(1):29–42
Garey MR, Johnson DS (1979) Computers and Intractability: A Guide to the Theory
of NP-completeness. W. H. Freeman, New York
Gargano ML, Edelson W, Koval O (1998) A genetic algorithm with feasible search
space for minimal spanning trees with time-dependent edge costs. In: Koza JR,
Banzhaf W, Chellapilla K, Deb K, Dorigo M, Fogel DB, Garzon MH, Goldberg
DE, Iba H, Riolo RL (eds) Genetic Programming 98, Morgan Kaufmann, San
Francisco, p 495
Gaube T, Rothlauf F (2001) The link and node biased encoding revisited: Bias and
adjustment of parameters. In: Boers EJW, Cagnoni S, Gottlieb J, Hart E, Lanzi
PL, Raidl GR, Smith RE, Tijink H (eds) Applications of Evolutionary Computing:
Proc. EvoWorkshops 2001. Springer, Berlin, pp 1–10
Gavish B, Li CL, Simchi-Levi D (1992) Analysis of heuristics for the design of tree
networks. Annals of Operations Research 36:77–86
Gelernter H (1963) Realization of a geometry-theorem proving machine. In: Feigen-
baum EA, Feldman J (eds) Computers and thought, McGraw-Hill, New York, pp
134–152, published 1959 in Proceedings of International Conference on Infor-
mation Processing, Unesco House
Gelly S, Teytaud O, Bredeche N, Schoenauer M (2005) A statistical learning theory
approach of bloat. In: Beyer HG, O’Reilly UM, Arnold DV, Banzhaf W, Blum C,
Bonabeau EW, Cantu-Paz E, Dasgupta D, Deb K, Foster JA, de Jong ED, Lipson
H, Llora X, Mancoridis S, Pelikan M, Raidl GR, Soule T, Tyrrell AM, Watson JP,
Zitzler E (eds) Proceedings of the Genetic and Evolutionary Computation Con-
ference, GECCO 2005, ACM Press, New York, pp 1783–1784
Gen M, Li Y (1999) Spanning tree-based genetic algorithms for the bicriteria fixed
charge transportation problem. In: Angeline et al (1999), pp 2265–2271
Gen M, Zhou G, Takayama M (1998) A comparative study of tree encodings on
spanning tree problems. In: Fogel (1998), pp 33–38
Gendreau M (2003) An introduction to tabu search. In: Glover F, Kochenberger GA
(eds) Handbook of Metaheuristics, Kluwer, Alphen aan den Rijn, pp 37–54
Gendreau M, Potvin JY (2005) Metaheuristics in combinatorial optimization. An-
nals OR 140(1):189–213. DOI http://dx.doi.org/10.1007/s10479-005-3971-7
Gerrits M, Hogeweg P (1991) Redundant coding of an NP-complete problem al-
lows effective genetic algorithm search. In: Schwefel HP, Männer R (eds) Paral-
lel Problem Solving from Nature – PPSN I, Springer, Berlin, LNCS, vol 496, pp
70–74
References 237
Gill PE, Murray W, Saunders MA, Tomlin JA, Wright MH (1986) On projected
Newton barrier methods for linear programming and an equivalence to Kar-
markar’s projective method. Mathematical Programming 36:183–209
Glover F (1977) Heuristics for integer programming using surrogate constraints.
Decision Sciences 8(1):156–166
Glover F (1986) Future paths for integer programming and links to artificial
intelligence. Computers & OR 13(5):533–549, DOI http://dx.doi.org/10.1016/
0305-0548(86)90048-1
Glover F (1990) Tabu search – part II. ORSA Journal on Computing 2(1):4–32
Glover F (1994) Genetic algorithms and scatter search – unsuspected potentials.
Statistics And Computing 4(2):131–140
Glover F (1997) A template for scatter search and path relinking. In: Hao JK, Lutton
E, Ronald EMA, Schoenauer M, Snyers D (eds) Proceedings of Artificial Evolu-
tion: Fourth European Conference, Berlin, LNCS, vol 1363, pp 1–51
Glover F, Kochenberger GA (eds) (2003) Handbook of Metaheuristics. Kluwer,
Boston
Glover F, Laguna M (1997) Tabu Search. Kluwer, Boston
Goldberg DE (1987) Simple genetic algorithms and the minimal, deceptive problem.
In: Davis L (ed) Genetic Algorithms and Simulated Annealing, Morgan Kauf-
mann, San Mateo, chap 6, pp 74–88
Goldberg DE (1989a) Genetic algorithms and Walsh functions: Part I, a gentle in-
troduction. Complex Systems 3(2):129–152
Goldberg DE (1989b) Genetic algorithms and Walsh functions: Part II, deception
and its analysis. Complex Systems 3(2):153–171
Goldberg DE (1989c) Genetic algorithms in search, optimization, and machine
learning. Addison-Wesley, Reading
Goldberg DE (1990) Real-coded genetic algorithms, virtual alphabets, and blocking.
IlliGAL Report No. 90001, University of Illinois at Urbana-Champaign, Urbana,
IL
Goldberg DE (1991a) Genetic algorithm theory. Fourth International Conference on
Genetic Algorithms Tutorial, unpublished manuscript
Goldberg DE (1991b) Real-coded genetic algorithms, virtual alphabets, and block-
ing. Complex Systems 5(2):139–167
Goldberg DE (1992) Construction of high-order deceptive functions using low-order
Walsh coefficients. Annals of Mathematics and Artificial Intelligence 5:35–48
Goldberg DE (2002) The design of Innovation. Series on Genetic Algorithms and
Evolutionary Computation, Kluwer, Dordrecht
Goldberg DE, Deb K (1991) A comparative analysis of selection schemes used in
genetic algorithms. Foundations of Genetic Algorithms 1:69–93
Goldberg DE, Lingle, Jr R (1985) Alleles, loci, and the traveling salesman problem.
In: Grefenstette (1985), pp 154–159
Goldberg DE, Segrest P (1987) Finite Markov chain analysis of genetic algorithms.
In: Grefenstette (1987), pp 1–8
Goldberg DE, Korb B, Deb K (1989) Messy genetic algorithms: Motivation, analy-
sis, and first results. Complex Systems 3(5):493–530
238 References
Goldberg DE, Deb K, Clark JH (1992) Genetic algorithms, noise, and the sizing of
populations. Complex Systems 6:333–362
Goldberg DE, Deb K, Kargupta H, Harik G (1993a) Rapid, accurate optimization
of difficult problems using fast messy genetic algorithms. In: Forrest (1993), pp
56–64
Goldberg DE, Deb K, Thierens D (1993b) Toward a better understanding of mixing
in genetic algorithms. Journal of the Society of Instrument and Control Engineers
32(1):10–16
Gomory RE (1958) Outline of an algorithm for integer solutions to linear programs.
Bulletin of the American Mathematical Society 64:275–278
Gomory RE (1960) Solving linear programming problems in integers. In: Bellman
R, Hall, Jr M (eds) Combinatorial Analysis, Symposia in Applied Mathematics
X, American Mathematical Society, Providence, pp 211–215
Gomory RE (1963) An algorithm for integer solutions to linear programs. In: Graves
RL, Wolfe P (eds) Recent Advances in Mathematical Programming. McGraw-
Hill, New York, pp 269–302
Gomory RE, Hu TC (1961) Multi-terminal network flows. In: SIAM Journal on
Applied Math, vol 9, pp 551–570
Gottlieb J (1999) Evolutionary algorithms for constrained optimization problems.
PhD thesis, Technische Universität Clausthal, Institut für Informatik, Clausthal,
Germany
Gottlieb J, Raidl GR (1999) Characterizing locality in decoder-based EAs for the
multidimensional knapsack problem. In: Fonlupt et al (1999), pp 38–52
Gottlieb J, Raidl GR (2000) The effects of locality on the dynamics of decoder-
based evolutionary search. In: Whitley et al (2000), pp 283–290
Gottlieb J, Julstrom BA, Raidl GR, Rothlauf F (2001) Prüfer numbers: A poor rep-
resentation of spanning trees for evolutionary search. In: Spector et al (2001), pp
343–350
Grahl J, Minner S, Rothlauf F (2005) Behaviour of UMDAc with truncation selec-
tion on monotonous functions. In: Corne D, Michalewicz Z, McKay B, Eiben G,
Fogel D, Fonseca C, Greenwood G, Raidl G, Tan KC, Zalzala A (eds) Proceed-
ings of 2005 IEEE Congress on Evolutionary Computation, IEEE Press, Piscat-
away, vol 3, pp 2553–2559
Grahl J, Radtke A, Minner S (2007) Fitness landscape analysis of dynamic multi-
item lot-sizing problems with limited storage. In: Günther HO, Mattfeld DC, Suhl
L (eds) Management logistischer Netzwerke – Entscheidungsunterstützung, In-
formationssysteme und OR-Tools, Physica, Heidelberg, pp 257–277
Gray F (1953) Pulse code communications. U.S. Patent 2632058
Grefenstette JJ (ed) (1985) Proceedings of an International Conference on Genetic
Algorithms and Their Applications, Lawrence Erlbaum, Hillsdale
Grefenstette JJ (ed) (1987) Proceedings of the Second International Conference on
Genetic Algorithms, Lawrence Erlbaum, Hillsdale
Grefenstette JJ, Gopal R, Rosmaita BJ, Van Gucht D (1985) Genetic algorithms for
the traveling salesman problem. In: Grefenstette (1985), pp 160–168
References 239
Grötschel M, Jünger M, Reinelt G (1984) A cutting plane algorithm for the linear
ordering problem. Operations Research 32:1195–1220
Grünert T, Irnich S (2005) Optimierung im Transport 1: Grundlagen. Shaker,
Aachen
Gu J (1992) Efficient local search for very large-scale satisfiability problems.
SIGART Bulletin 3(1):8–12
Gu J, Purdom P, Franco J, Wah B (1996) Algorithms for satisfiability (sat) prob-
lem: A survey. DIMACS Volume Series on Discrete Mathematics and Theoretical
Computer Science 35:19–151
Hamming R (1980) Coding and Information Theory. Prentice Hall, Englewood
Cliffs
Handa H (2006) Fitness function for finding out robust solutions on time-varying
functions. In: Keijzer M, Cattolico M, Arnold D, Babovic V, Blum C, Bosman P,
Butz MV, Coello Coello C, Dasgupta D, Ficici SG, Foster J, Hernandez-Aguirre
A, Hornby G, Lipson H, McMinn P, Moore J, Raidl G, Rothlauf F, Ryan C,
Thierens D (eds) Proceedings of the Genetic and Evolutionary Computation Con-
ference, GECCO 2006, ACM Press, New York, pp 1195–1200
Hansen N, Ostermeier A (2001) Completely derandomized self-adaptation in evo-
lution strategies. Evolutionary Computation 9(2):159–195
Hansen P, Mladenović N (2001) Variable neighborhood search – principles and ap-
plications. European Journal of Operational Research 130:449–467
Hansen P, Mladenović N (2003) Variable neighborhood search. In: Glover F,
Kochenberger G (eds) Handbook of Metaheuristics, Kluwer Academic, Dor-
drecht, pp 145–184
Harel D, Rosner R (1992) Algorithmics: The spirit of computing, 2nd edn. Addison-
Wesley, Reading
Harik G (1999) Linkage learning via probabilistic modeling in the ECGA. IlliGAL
Report No. 99010, University of Illinois at Urbana-Champaign, Urbana, IL
Harik GR (1997) Learning gene linkage to efficiently solve problems of bounded
difficulty using genetic algorithms. PhD thesis, University of Michigan, Ann Ar-
bor
Harik GR, Goldberg DE (1996) Learning linkage. In: Belew and Vose (1996), pp
247–262
Harik GR, Lobo FG, Goldberg DE (1998) The compact genetic algorithm. In: Fogel
(1998), pp 523–528
Hartl DL, Clark AG (1997) Principles of population genetics, 3rd edn. Sinauer, Sun-
derland
Hartmanis J, Stearns RE (1965) On the computational complexity of algorithms.
Transactions of the American Mathematical Society 117:285–306
Heckendorn RB, Whitley D, Rana S (1996) Nonlinearity, Walsh coefficients, hyper-
plane ranking and the simple genetic algorithm. In: Belew and Vose (1996), pp
181–201
Heckendorn RB, Rana S, Whitley D (1999) Polynomial time summary statistics for
a generalization of MAXSAT. In: Banzhaf et al (1999), pp 281–288
240 References
Hwang FK (1976) On Steiner minimal trees with rectilinear distance. SIAM J Ap-
plied Math 1:104–114
Ibaraki T, Nonobe K, Yagiura M (eds) (2005) Metaheuristics: Progress as Real Prob-
lem Solvers. Springer, New York
Ibarra OH, Kim CE (1975) Fast approximation algorithms for the knapsack and sum
of subset problems. Journal of the ACM 22(4):463–468
Igel C (1998) Causality of hierarchical variable length representations. In: Fogel
(1998), pp 324–329
Igel C, Chellapilla K (1999) Investigating the influence of depth and degree of geno-
typic change on fitness in genetic programming. In: Banzhaf et al (1999), pp
1061–1068
Inayoshi H, Manderick B (1994) The weighted graph bi-partitioning problem: A
look at GA performance. In: Davidor et al (1994), pp 617–625
Jansen T, Wegener I (2002) The analysis of evolutionary algorithms – A proof that
crossover really can help. Algorithmica 34(1):47–66
Jansen T, Wegener I (2005) Real royal road functions–where crossover provably
is essential. Discrete Applied Mathematics 149(1–3):111–125, DOI http://dx.doi.
org/10.1016/j.dam.2004.02.019
Johnson DS (1974) Approximation algorithms for combinatorial problems. Journal
of Computer and System Sciences 9(3):256–278
Johnson DS (1990) Local optimization and the traveling salesman problem. In: Pa-
terson M (ed) Automata, Languages and Programming, 17th International Collo-
quium, ICALP90, Warwick University, England, July 16–20, 1990, Proceedings,
Springer, Berlin, LNCS, vol 443, pp 446–461
Johnson DS, Lenstra JK, Kan AHGR (1978) The complexity of the network design
problem. Networks 8:279–285
Jones DR, Beltramo MA (1991) Solving partitioning problems with genetic algo-
rithms. In: Belew and Booker (1991), pp 442–449
Jones T (1995a) Evolutionary algorithms, fitness landscapes and search. PhD thesis,
University of New Mexico, Albuquerque, NM
Jones T (1995b) One operator, one landscape. Tech. Rep. 95-02-025, Santa Fe In-
stitute
Jones T, Forrest S (1995) Fitness distance correlation as a measure of problem dif-
ficulty for genetic algorithms. In: Eschelman (1995), pp 184–192
Julstrom BA (1999) Redundant genetic encodings may not be harmful. In: Banzhaf
et al (1999), p 791
Julstrom BA (2001) The blob code: A better string coding of spanning trees for evo-
lutionary search. In: Wu AS (ed) Proceedings of the 2001 Genetic and Evolution-
ary Computation Conference Workshop Program, San Francisco, pp 256–261
Julstrom BA (2002) A scalable genetic algorithm for the rectilinear Steiner problem.
In: Fogel DB, El-Sharkawi MA, Yao X, Greenwood G, Iba H, Marrow P, Shack-
leton M (eds) Proceedings of 2002 IEEE Congress on Evolutionary Computation,
IEEE Press, Piscataway, pp 1169–1173
242 References
Maros I, Mitra G (1996) Simplex algorithms. In: Beasley JE (ed) Advances in Linear
and Integer Programming, Oxford Lecture Series in Mathematics and its Appli-
cations, vol 4, Clarendon, Oxford, pp 1–46
Martin O, Otto SW (1996) Combining simulated annealing with local search heuris-
tics. Annals of Operations Research 63:57–75
Martin O, Otto SW, Felten EW (1991) Large-step Markov chains for the traveling
salesman problem. Complex Systems 5:299–326
Mason A (1995) A non-linearity measure of a problem’s crossover suitability. In:
Fogel and Attikiouzel (1995), pp 68–73
Mattfeld DC (1996) Evolutionary Search and the Job Shop: Investigations on Ge-
netic Algorithms for Production Scheduling. Physica, Heidelberg
Mendel G (1866) Versuche über Pflanzen-Hybriden. In: Verhandlungen des natur-
forschenden Vereins, Naturforschender Verein zu Brünn, Brünn, vol 4, pp 3–47
Mendes AS, França PM, Moscato P (2002) Fitness landscapes for the total tardi-
ness single machine scheduling problem. Neural Network World: An Interna-
tional Journal on Neural and Mass-Parallel Computing and Information Systems
12(2):165–180
Merz P, Freisleben B (2000a) Fitness landscape analysis and memetic algorithms
for the quadratic assignment problem. IEEE Transactions on Evolutionary Com-
putation 4(4):337–352
Merz P, Freisleben B (2000b) Fitness landscapes, memetic algorithms, and greedy
operators for graph bipartitioning. Evolutionary Computation 8(1):61–91
Metropolis N, Rosenbluth A, Rosenbluth MN, Teller A, Teller E (1958) Equations
of state calculations by fast computing machines. J Chem Phys 21:1087–1092
Michalewicz Z (1996) Genetic algorithms + data structures = evolution programs.
Springer, New York
Michalewicz Z, Fogel DB (2004) How to Solve It: Modern Heuristics, 2nd edn.
Springer, Berlin
Michalewicz Z, Janikow CZ (1989) Handling constraints in genetic algorithm. In:
Schaffer (1989), pp 151–157
Michalewicz Z, Schoenauer M (1996) Evolutionary computation for constrained
parameter optimization problems. Evolutionary Computation 4(1):1–32
Michalewicz Z, Deb K, Schmidt M, Stidsen TJ (1999) Towards understand-
ing constraint-handling methods in evolutionary algorithms. In: Angeline et al
(1999), pp 581–588
Miller JF, Thomson P (2000) Cartesian genetic programming. In: Poli R, Banzhaf
W, Langdon WB, Miller J, Nordin P, Fogarty TC (eds) Genetic Programming:
Third European Conference, Springer, Berlin, LNCS, vol 1802, pp 121–132
Mills P, Tsang EPK (2000) Guided local search for solving SAT and weighted
MAX-SAT problems. Journal of Automated Reasoning 24:205–223
Mitchell M (1996) An introduction to genetic algorithms. MIT Press, Cambridge
Mitchell M, Forrest S, Holland JH (1992) The royal road for genetic algorithms:
Fitness landscapes and GA performance. In: Varela FJ, Bourgine P (eds) Proc.
of the First European Conference on Artificial Life, MIT Press, Cambridge, pp
245–254
246 References
Rothlauf F, Goldberg DE, Heinzl A (2002) Network random keys – A tree net-
work representation scheme for genetic and evolutionary algorithms. Evolution-
ary Computation 10(1):75–97
Rowe JE, Whitley LD, Barbulescu L, Watson JP (2004) Properties of Gray and
binary representations. Evolutionary Computation 12(1):46–76
Rudolph G (1996) Convergence properties of evolutionary algorithms. PhD thesis,
Universität Dortmund, Dortmund
Russell S, Norvig P (2002) Artificial Intelligence a Modern Approach, 2nd edn. AI,
Prentice Hall
Ryan C (1999) Automatic Re-engineering of Software Using Genetic Programming.
Kluwer Academic, Amsterdam
Ryan C, O’Neill M (1998) Grammatical evolution: A steady state approach. In: Late
Breaking Papers, Genetic Programming 1998, pp 180–185
Salustowicz R, Schmidhuber J (1997) Probabilistic incremental program evolution.
Evolutionary Computation 5(2):123–141
Sastry K, Goldberg DE (2001) Modeling tournament selection with replacement
using apparent added noise. IlliGAL Report No. 2001014, University of Illinois
at Urbana-Champaign, Urbana, IL
Sastry K, Goldberg DE (2002) Analysis of mixing in genetic algorithms: A survey.
Tech. Rep. 2002012, IlliGAL, Department of General Engineering, University of
Illinois at Urbana-Champaign
Schaffer JD (ed) (1989) Proceedings of the Third International Conference on Ge-
netic Algorithms, Morgan Kaufmann, Burlington
Schneeweiß C (2003) Distributed Decision Making. Springer, Berlin
Schnier T (1998) Evolved representations and their use in computational creativ-
ity. PhD thesis, University of Sydney, Department of Architectural and Design
Science
Schnier T, Yao X (2000) Using multiple representations in evolutionary algorithms.
In: Fonseca C, Kim JH, Smith A (eds) Proceedings of 2000 IEEE Congress on
Evolutionary Computation, IEEE Press, Piscataway, pp 479–486
Schoenauer M, Deb K, Rudolph G, Yao X, Lutton E, Merelo JJ, Schwefel HP (eds)
(2000) Parallel Problem Solving from Nature – PPSN VI, Springer, Berlin
Schraudolph NN, Belew RK (1992) Dynamic parameter encoding for genetic algo-
rithms. Machine Learning 9:9–21
Schumacher C (2000) Fundamental limitations of search. PhD thesis, University of
Tennessee, Department of Computer Science, Knoxville, TN
Schumacher C, Vose MD, Whitley LD (2001) The no free lunch and problem de-
scription length. In: Spector et al (2001), pp 565–570
Schwefel HP (1965) Kybernetische Evolution als Strategie der experimentellen
Forschung in der Strömungstechnik. Master’s thesis, Technische Universität
Berlin
Schwefel HP (1968) Experimentelle Optimierung einer Zweiphasendüse.
Bericht 35, AEG Forschungsinstitut Berlin, Projekt MHD-Staustahlrohr
Schwefel HP (1975) Evolutionsstrategie und numerische Optimierung. PhD thesis,
Technical University of Berlin
252 References
Smale S (1983) On the average number of steps in the simplex method of linear
programming. Mathematical Programming 27:241–262
Smith AE, Coit DW (1997) Penalty functions. In: Bäck et al (1997), pp C5.2:1–6
Smith SF (1980) A learning system based on genetic adaptive algorithms. PhD the-
sis, University of Pittsburgh
Soto M, Ochoa A, Acid S, de Campos LM (1999) Introducing the polytree approx-
imation of distribution algorithm. In: Rodriguez AAO, Ortiz MRS, Hermida R
(eds) Second Symposium on Artificial Intelligence. Adaptive Systems. CIMAF
99, La Habana, pp 360–367
Soule T (2002) Exons and code growth in genetic programming. In: Foster et al
(2002), pp 142–151
Spector L, Goodman E, Wu A, Langdon WB, Voigt HM, Gen M, Sen S, Dorigo
M, Pezeshk S, Garzon M, Burke E (eds) (2001) Proceedings of the Genetic and
Evolutionary Computation Conference, GECCO 2001, Morgan Kaufmann, San
Francisco
Stadler PF (1992) Correlation in landscapes of combinatorial optimization prob-
lems. Europhys Lett 20:479–482
Stadler PF (1995) Towards a theory of landscapes. In: Lopéz-Peña R, Capovilla R,
Garcı́a-Pelayo R, Waelbroeck H, Zertuche F (eds) Complex Systems and Binary
Networks, Springer, Berlin, Lecture Notes in Physics, vol 461, pp 77–163
Stadler PF (1996) Landscapes and their correlation functions. J Math Chem 20:1–45
Stadler PF, Schnabl W (1992) The landscape of the traveling salesman problem.
Physics Letters A 161:337–344
Steitz W, Rothlauf F (2010) Solving OCST problems with problem-specific guided
local search. In: Pelikan M, Branke J (eds) Proceedings of the Genetic and Evo-
lutionary Computation Conference, GECCO 2010, ACM Press, New York, pp
301–302
Stephens CR, Waelbroeck H (1999) Schemata evolution and building blocks. Evo-
lutionary Computation 7:109–124
Storer RH, Wu SD, Vaccari R (1992) New search spaces for sequencing problems
with application to job shop scheduling. Management Science 38(10):1495–1509
Storer RH, Wu SD, Vaccari R (1995) Problem and heuristic space search strategies
for job shop scheduling. ORSA Journal on Computing 7(4):453–467
Streeter MJ (2003) Two broad classes of functions for which a no free lunch result
does not hold. In: Cantú-Paz (2003), pp 1418–1430
Stützle T (1999) Iterated local search for the quadratic assignment problem. Tech.
Rep. AIDA-99-03, FG Intellektik, FB Informatik, TU Darmstadt
Surry PD (1998) A prescriptive formalism for constructing domain-specific evolu-
tionary algorithms. PhD thesis, University of Edinburgh, Edinburgh
Surry PD, Radcliffe N (1996) Formal algorithms + formal representations = search
strategies. In: Voigt et al (1996), pp 366–375
Suzuki J (1995) A Markov chain analysis on simple genetic algorithms. IEEE Trans-
actions on Systems, Man, and Cybernetics 25(4):655–659
Syswerda G (1989) Uniform crossover in genetic algorithms. In: Schaffer (1989),
pp 2–9
254 References
Taha HA (2002) Operations Research: An Introduction, 7th edn. Prentice Hall, En-
glewood Cliffs
Tai KC (1979) The tree-to-tree correction problem. Journal of the ACM 26(3):422–
433
Taillard ÉD (1991) Robust taboo search for the quadratic assignment problem. Par-
allel Computing 17(4–5):443–455
Taillard ÉD (1995) Comparison of iterative searches for the quadratic assignment
problem. Location Science 3:87–105
Taillard ÉD, Gambardella LM, Gendreau M, Potvin JY (2001) Adaptive memory
programming: A unified view of metaheuristics. European Journal of Operational
Research 135(1):1–16
Thierens D (1995) Analysis and design of genetic algorithms. PhD thesis,
Katholieke Universiteit Leuven, Leuven, Belgium
Thierens D (1999) Scalability problems of simple genetic algorithms. Evolutionary
Computation 7(4):331–352
Thierens D, Goldberg DE, Pereira ÂG (1998) Domino convergence, drift, and the
temporal-salience structure of problems. In: Fogel (1998), pp 535–540
Tonge FM (1961) The use of heuristic programming in management science. Man-
agement Science 7:231–237
Turban E, Aronson JE, Liang TP (2004) Decision Support Systems and Intelligent
Systems, 7th edn. Prentice Hall, Englewood Cliffs
Turing A (1936) On computable numbers with an application to the Entschei-
dungsproblem. Proc. London Mathematical Society 2(42):230–265
Tzschoppe C, Rothlauf F, Pesch HJ (2004) The edge-set encoding revisited: On the
bias of a direct representation for trees. In: Deb et al (2004), pp 1174–1185
van Laarhoven PJM, Aarts EHL (1988) Simulated Annealing: Theory and Applica-
tions. Kluwer, Dordrecht
Vazirani VV (2003) Approximation Algorithms. Springer, Berlin
Voigt HM, Ebeling W, Rechenberg I, Schwefel HP (eds) (1996) Parallel Problem
Solving from Nature – PPSN IV, Springer, Berlin
Vose MD (1993) Modeling simple genetic algorithms. In: Whitley (1993), pp 63–73
Vose MD (1999) The simple genetic algorithm: foundations and theory. MIT Press,
Cambridge
Vose MD, Wright AH (1998a) The simple genetic algorithm and the Walsh trans-
form: Part I, theory. Evolutionary Computation 6(3):253–273
Vose MD, Wright AH (1998b) The simple genetic algorithm and the Walsh trans-
form: Part II, the inverse. Evolutionary Computation 6(3):275–289
Voß S (2001) Capacitated minimum spanning trees. In: Floudas CA, Pardalos PM
(eds) Encyclopedia of Optimization, vol 1, Kluwer, Boston, pp 225–235
Voß S, Martello S, Osman IH, Roucairol C (eds) (1999) Metaheuristics: Advances
and Trends in Local Search Paradigms for Optimization. Kluwer, Boston
Voudouris C (1997) Guided local search for combinatorial optimization problems.
Phd thesis, Dept. of Computer Science, University of Essex, Colchester, UK
Voudouris C, Tsang E (1995) Guided local search. Tech. Rep. CSM-247, University
of Essex, Colchester, UK
References 255
Voudouris C, Tsang E (1999) Guided local search. Europ J Oper Res 113(2):469–
499
Watanabe S (1969) Knowing and guessing – A formal and quantitative study. Wiley,
New York
Weicker K, Weicker N (1998) Locality vs. randomness – dependence of operator
quality on the search state. In: Banzhaf and Reeves (1998), pp 289–308
Weinberger E (1990) Correlated and uncorrelated fitness landscapes and how to tell
the difference. Biological Cybernetics 63:325–336
Weyl H (1935) Elementare Theorie der konvexen Polyeder. Comment Math Helv
7:290–306
Whigham PA, Dick G (2010) Implicitly controlling bloat in genetic programming.
IEEE Transactions on Evolutionary Computation 14(2):173–190, DOI http://dx.
doi.org/10.1109/TEVC.2009.2027314
Whitley LD (1989) The GENITOR algorithm and selection pressure: Why rank-
based allocation. In: Schaffer (1989), pp 116–121
Whitley LD (ed) (1993) Foundations of Genetic Algorithms 2, Morgan Kaufmann,
San Mateo
Whitley LD (1997) Permutations. In: Bäck et al (1997), pp C3.3:114–C3.3:20
Whitley LD (2002) Evaluating evolutionary algorithms. Tutorial Program at Parallel
Problem Solving from Nature (PPSN 2002)
Whitley LD, Rowe J (2005) Gray, binary and real valued encodings: Quad search
and locality proofs. In: Wright AH, Vose MD, De Jong KA, Schmitt LM (eds)
Foundations of Genetic Algorithms 8, LNCS, vol 3469, Springer, Berlin, pp 21–
36
Whitley LD, Rowe J (2008) Focused no free lunch theorems. In: Keijzer M, An-
toniol G, Congdon CB, Deb K, Doerr B, Hansen N, Holmes JH, Hornby GS,
Howard D, Kennedy J, Kumar S, Lobo FG, Miller JF, Moore J, Neumann F, Pe-
likan M, Pollack J, Sastry K, Stanley K, Stoica A, Talbi EG, Wegener I (eds)
Proceedings of the Genetic and Evolutionary Computation Conference, GECCO
2008, ACM Press, New York, pp 811–818
Whitley LD, Rowe JE (2006) Subthreshold-seeking local search. Theor Comput Sci
361(1):2–17, DOI http://dx.doi.org/10.1016/j.tcs.2006.04.008
Whitley LD, Vose MD (eds) (1994) Foundations of Genetic Algorithms 3, Morgan
Kaufmann, San Mateo
Whitley LD, Watson J (2005) Complexity theory and the no free lunch theorem. In:
Burke EK, Kendall G (eds) Search Methodologies, Springer, pp 317–339, DOI
http://dx.doi.org/10.1007/0-387-28356-0 11
Whitley LD, Goldberg DE, Cantú-Paz E, Spector L, Parmee L, Beyer HG (eds)
(2000) Proceedings of the Genetic and Evolutionary Computation Conference,
GECCO 2000, Morgan Kaufmann, San Francisco
Whitley LD, Bush K, Rowe JE (2004) Subthreshold-seeking behavior and robust
local search. In: Deb et al (2004), pp 282–293
Wiest JD (1966) Heuristic programs for decision making. Harvard Business Review
44(5):129–143
256 References
BB building block, 42
BNF Backus-Naur form, 176
LB link-biased, 207
LNB link and node biased, 208
LP linear programming, 50
LS local search, 217
NFL no-free-lunch, 97
NIH needle-in-a-haystack, 33
SA simulated annealing, 94
SAT satisfiability problem, 126
SGA simple genetic algorithm, 147