Benchmarking in Optimization - Best Practice and Open Issues
Benchmarking in Optimization - Best Practice and Open Issues
Thomas Weise12
1
Institute for Data Science, Engineering, and Analytics, TH Köln, Germany
2
Sorbonne Université, CNRS, LIP6, Paris, France
3
Yamasan Science & Education
4
Optimisation and Logistics, School of Computer Science, The University of Adelaide, Adelaide, Australia
5
Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
6
Statistics and Optimization Group, University of Münster, Münster, Germany
7
School of Computer Science and Engineering, University of Málaga, Málaga, Spain
8
Department of Decision Sciences, University of South Africa, Pretoria, South Africa
9
Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA
10
Department of Automatics, AGH University of Science and Technology, Krakow, Poland
11
modl.ai, Copenhagen, Denmark
12
Institute of Applied Optimization, School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
[email protected]
Abstract
This survey compiles ideas and recommendations from more than a dozen researchers with different
backgrounds and from different institutes around the world. Promoting best practice in benchmarking
is its main goal. The article discusses eight essential topics in benchmarking: clearly stated goals, well-
specified problems, suitable algorithms, adequate performance measures, thoughtful analysis, effective
and efficient designs, comprehensible presentations, and guaranteed reproducibility. The final goal is to
provide well-accepted guidelines (rules) that might be useful for authors and reviewers. As benchmarking
in optimization is an active and evolving field of research this manuscript is meant to co-evolve over time
by means of periodic updates.
1
Contents
1 Introduction 4
3 Problem Instances 11
3.1 Desirable Characteristics of a Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Evaluating the Quality of a Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Available Benchmark Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Algorithms 16
4.1 Algorithm Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Challenges and Guidelines for the Practitioner . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Challenges and Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7 Experimental Design 29
7.1 Design of Experiments (DoE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.2 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.3 Designs for Benchmark Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.4 How to Select a Design for Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.5 Tuning Before Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.6 Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2
8 How to Present Results? 32
8.1 General Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.2 Reporting Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.3 Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Glossary 37
References 39
3
1 Introduction
Introducing a new algorithm without testing it on a set of benchmark functions appears to be very strange to
every optimization practitioner, unless there is a strong theoretical motivation justifying the interest in the
algorithm. Taking theory-focused papers aside, from the very beginning in the 1960s nearly every publication
in Evolutionary Computation (EC) was accompanied by benchmarking studies. One of the key promoters
of the EC research domain, Hans-Paul Schwefel [1975], wrote in his PhD thesis:
The extremely large and constantly increasing number of optimization methods inevitably leads
to the question of the best strategy. There does not seem to be a clear answer. Because, if there
were an optimal optimization process, all other methods would be superfluous . . . 1
Famous studies, e.g., from Moré et al. [1981], were performed in this period and established test functions
that are today well known among algorithm developers. Some of them can still be found in the portfolio
of recent benchmark studies, e.g., Rosenbrock’s function [Rosenbrock, 1960]. In the 1960s, experiments
could be rerun only a very limited number of times, using different starting points or random seeds. This
situation has changed drastically: nowadays, new algorithms can be run a hundred or even a thousand times.
This enables very complex and sophisticated benchmark suites such as those available in the Comparing
Continuous Optimizers (COCO) [Hansen et al., 2016b] platform or in Nevergrad [Rapin and Teytaud, 2018].
However, the questions to be answered by benchmarking remain basically the same, e.g.,
• how well does a certain algorithm perform on a given problem?
• why does an algorithm succeed/fail on a specific test problem?
Specifying the goal of a benchmark study is as important as the study itself, as it shapes the experimental
setup – i.e., the choice of problem instances, of the algorithm instances, the performance criteria, and the
statistics. Typical goals that a user or a researcher wishes to answer through a benchmarking study are
discussed in Section 2.
But not only computational power has increased significantly in the last decades. Theory made important
progress as well. In the 1980s, some researchers claimed that there is an algorithm that is able to outperform
all other algorithms on average [Goldberg, 1989]. A set of no free lunch theorems (NFLTs), presented by
Wolpert and Macready [1997] changed this situation [Adam et al., 2019]. Statements about the performance
of algorithms should be coupled with the problem class or even the problem instances. Brownlee [2007]
summarizes NFLT consequences and gives the following recommendations:
1) bound claims of algorithm or parameter suitability to the problem instances being tested,
2) research into devising problem classes and matching suitable algorithms to classes is a good
thing,
3) be cautious about generalizing performance to other problem instances, and
4) be very cautious about generalizing performance to other problem classes or domains.
Haftka [2016] describes NFLT consequences as follows:
Improving an algorithm for one class of problem is likely to make it perform more poorly for
other problems.
Some authors claim that this statement is too general and should be detailed as follows: improving the
performance of an algorithm, e.g., via parameter tuning, for a subset of problems may make it perform
worse for a different subset. This does not work so well for classes of problems, unless the classes are finite
and small. It also does not work for any two arbitrary subsets, since the subsets may be correlated in
precisely the way that leads to better performance of the algorithm. A number of works discuss limitations
1 German original quote: “Die überaus große und ständig steigende Zahl von Optimierungsmethoden führt zwangsläufug
zu der Frage nach der besten Strategie. Eine eindeutige Antwort scheint es nicht zu geben. Denn, gäbe es ein optimales
Optimierungsverfahren, dann würden sich alle anderen Methoden erübrigen. . . ”
4
for the consequences and the impact of NFLT, such as Garcı́a-Martı́nez et al. [2012] and McDermott [2020].
For example, Culberson [1998] stated: “In the context of search problems, the NFL theorem strictly only
applies if arbitrary search landscapes are considered, while the instances of basically any search problem of
interest have compact descriptions and therefore cannot give rise to arbitrary search landscapes ”.
Without doubt, NFLT has changed the way how benchmarking is considered in EC. Problems caused by
NFLT are still subject of current research, e.g., Liu et al. [2019] discuss paradoxes in numerical comparison
of optimization algorithms based on NFLT. Whitley et al. [2002] examine the meaning and significance of
benchmarks in light of theoretical results such as NFLT.
Independently of the ongoing NFLT discussion, benchmarking gains a central role in current research,
both for theory and practice. Three main aspects that need to be addressed in every benchmark study are
the choice of
Excellent papers on how to set up a good benchmark test exist for many years. Hooker and Johnson
are only two authors that published papers still worth reading today [Hooker, 1994, 1996, Johnson et al.,
1989, 1991, Johnson, 2002b]. McGeoch [1986] can be considered as a milestone in the field of experimental
algorithmics, which builds the cornerstone for benchmark studies. Gent and Walsh [1994] stated that the
empirical study of algorithms is a relatively immature field – and we claim that this situation has unfor-
tunately not significantly changed in the last 25 years. Reasons for this unsatisfactory situation in EC are
manifold. For example, EC has not agreed upon general methodology for performing benchmark studies like
the fields of statistical Design of Experiments (DOE) or data mining [Chapman et al., 2000, Montgomery,
2017]. These fields provide a general methodology to encourage the practitioner to consider important issues
before performing a study. Some journals provide explicit minimal standard requirements.2
The question remains: why are minimum standards not considered in every paper submitted to EC
conferences and journals? Or, formulated alternatively: why have such best practices not become minimum
required standards? One answer might be: setting up a sound benchmark study is very complicated. There
are many pitfalls, especially stemming from complex statistical considerations [Črepinšek et al., 2014]. So, to
do nothing wrong, practitioners oftentimes report only average values decorated with corresponding standard
deviations, p-values, or boxplots. Another answer might be: practical guidelines are missing. Researchers
from computer science would apply these guidelines if examples were available. This paper is a joint initiative
from several researchers in EC. It presents best-practice examples with references to relevant publications and
discusses open issues. This joint initiative was established during the Dagstuhl seminar 19431 on Theory of
Randomized Optimization Heuristics, which took place in October 2019. Since then, we have been compiling
ideas covering a broad range of disciplines, all connected to EC.
We are aware that every version of this paper represents a snapshot, because the field is evolving. New
theoretical results such as no-free lunch might come up from theory and new algorithms (quantum computing,
heuristics supported by deep learning techniques, etc.) appear on the horizon, and new measures, e.g., based
on extensive resampling (Monte Carlo), can be developed in statistics.
We consider this paper as a starting point, as a first trial to support the EC community in improving the
quality of benchmark studies. Surely, this paper cannot cover every single aspect related to benchmarking.
Although this paper mainly focuses on single-objective, unconstrained optimization problems, its findings
can be easily transferred to other domains, e.g, multi-objective or constrained optimization. The objectives
in other problem domains may differ slightly and may require different performance measures – but the
2 See https://www.springer.com/cda/content/document/cda_downloaddocument/Journal+of+Heuristic+Policies+
on+Heuristic+Search.pdf?SGWID=0-0-45-1483502-p35487524 for guidelines of the Journal of Heuristics and
https://static.springer.com/sgw/documents/1593723/application/pdf/Additional_submission_instructions.pdf for
similar ones of the journal Swarm Intelligence.
5
content of most sections should be applicable. Each of the following sections presents references to best-
practice examples and discusses open topics. The following aspects, which are considered relevant to every
benchmark study, are covered in the subsequent sections:
1. Goals: what are the reasons for performing benchmark studies (Section 2)?
6. Design: how to set up a study, e.g., how many runs shall be performed (Section 7)?
7. Presentation: how to describe results (Section 8)?
8. Reproducibility: how to guarantee scientifically sound results and how to guarantee a lasting impact,
e.g., in terms of comparability (Section 9)?
Generalization of benchmarking results. As discussed above in the context of the NFLT, we recom-
mend being very precise in the description of the algorithms and the problem instances that were used in
the benchmark study. Performance extrapolation or generalization always needs to be flagged as such, and
where algorithms are compared to each other, it should be made very clear what the basis for the comparison
is. We suggest to very carefully distinguish between algorithms (e.g., “the” Covariance Matrix Adaptation
Evolution Strategy (CMA-ES) [Hansen, 2000]) and algorithm instances (e.g., the pycma-es [Hansen et al.,
2020] with population size 8, budget 100, restart strategy X, etc.).3 A similar rule applies to the prob-
lems (e.g., “the” sphere function) vs. a concrete problem instance (the five-dimensional sphere function
Pd
f : R5 → R, x 7→ α i=1 x2i + β centered at at the origin, multiplicative scaling α, and additive shift β). Go-
ing one step further, one may even argue that we only benchmark a certain implementation of an algorithm
instance, which is subject to a concrete choice of implementation language, compiler and operating system
optimizations, and concrete versions of software libraries.
Algorithm and problem instances may be (and in the context of this survey often are) randomized, so
that the performance of the algorithm instance on a given problem instance is a series of (typically highly
correlated) random variables, one for each step of the algorithm. In practice, replicability is often achieved
by fixing the random number generator and storing the random seed, which plays an important role in
guaranteeing reproducibility as discussed in Sec. 9.
6
Figure 1: Summary of common goals of benchmark studies.
and the problem instance characteristics. However, benchmarking also plays an important role as interme-
diary between the scientific community and users of optimization heuristics and as intermediary between
theoretically and empirically-guided streams within the research community.
7
serves, most notably, the purpose of understanding strengths and weaknesses of different algorithmic
approaches for different types of problems or problem instances during the different stages of the
optimization process. These insights can be leveraged to design or to select, for a given problem class
or instance, a most suitable algorithm instance.
(G1.3) Competition.
One particular motivation to compare algorithms is to determine a “winner”, i.e., an algorithm
that performs better than any of its competitors, for a given performance measure and on a given
set of problem instances. Benchmarking is of great value in selecting the most adequate algorithm
especially in real-world optimization settings [Beiranvand et al., 2017]. The role of competitive
studies for benchmarking is discussed quite controversially [Hooker, 1996], as competitive studies
may promote algorithms that overstate the importance of the problems that they are tested upon,
and thereby create over-fitting. At the same time, however, one cannot neglect that competitions
can provide an important incentive to contribute to the development of new algorithmic ideas and
better selection guidelines.
(G1.4) Assessment of the Optimization Problem. In many real-world problems like scheduling, container
packing, chemical plant control, or protein folding, the global optimum is unknown, while in other
problems it is necessary to deal with limited knowledge, or lack of explicit formulas. In those
situations, computer simulations or even physical experiments are required to evaluate the quality of
a given solution candidate. In addition, even if a problem is explicitly modelled by a mathematical
formula, it can nevertheless be difficult to grasp its structure or to derive a good intuition for what its
fitness landscape looks like. Similarly, when problems consist of several instances, it can be difficult
to understand in what respect these different instances are alike and in which aspects they differ.
Benchmarking simple optimization heuristics can help to analyze and to visualize the optimization
problem and to gain knowledge about its characteristics.
(G1.5) Illustrating Algorithms’ Search Behavior.
Understanding how an optimization heuristic operates on a problem can be difficult to grasp when
only looking at the algorithm and problem description. One of the most basic goals that bench-
marking has to offer are numerical and graphical illustrations of the optimization process. With
these numbers and visualizations, a first idea about the optimization process can be derived. This
also includes an assessment of the stochasticity when considering several runs of a randomized algo-
rithm or an algorithm operating upon a stochastic problem. In the same vein, benchmarking offers a
hands-on way of visualizing effects that are difficult to grasp from mathematical descriptions. That
is, where mathematical expressions are not easily accessible to everyone, benchmarking can be used
to illustrate the effects that the mathematical expressions describe.
8
(G2.2) Algorithm Tuning.
Most optimization heuristics are configurable, i.e, we are able to adjust their search behavior (and,
hence, performance) by modifying their parameters. Typical parameters of algorithms are the num-
ber of individuals kept in the memory (its ‘population size’), the number of individuals that are
evaluated in each iteration, parameters determining the distribution from which new samples are
generated (e.g., the mean, variance, and direction of the search), the selection of survivors for the
next generation’s population, and the stopping criterion. Optimization heuristics applied in practice
often comprise tens of parameters that need to be tuned.
Finding the optimal configuration of an algorithm for a given problem instance is referred to as offline
parameter tuning [Eiben and Jelasity, 2002, Eiben and Smith, 2015]. Tuning can be done manually
or with the help of automated configuration tools [Akiba et al., 2019, Bergstra et al., 2013, Olson and
Moore, 2016], Benchmarking is a core ingredient of the parameter tuning process. A proper design
of experiment is an essential requirement for tuning studies [Bartz-Beielstein, 2006, Orzechowski
et al., 2018, 2020]. Parameter tuning is a necessary step before comparing a viable configuration of
a method with others, as we are disregarding those combinations of parameters, which do not yield
promising results.
Benchmarking can help to shed light on suitable choices of parameters and algorithmic modules.
Selecting a proper parameterization for a given optimization problem is a tedious task [Fialho et al.,
2010]. Besides the selection of the algorithm and the problem instance, tuning requires the specifi-
cation of a performance measure, e.g., best solution found after a pre-specified number of function
evaluations (to be discussed in Sec. 5) and a statistic, i.e., number of repeats, which will be discussed
in Sec. 7.
Another important concern with respect to algorithm tuning is the robustness of the performance
with respect to these parameters, i.e., how much does the performance deteriorate if the parameters
are mildly changed? In this respect, parameter recommendations with a better robustness might be
preferable over less robust ones, even if compromising on performance [Paenke et al., 2006].
(G2.3) Understanding the Influence of Parameters and Algorithmic Components.
While algorithm tuning focuses on finding the best configuration for a given problem, understanding
refers to the question: why does one algorithm perform better than a competing one? Understanding
requires additional statistical tools, e.g., analysis of variance or regression techniques. Questions
such as “Does recombination have a significant effect on the performance?” are considered in this
approach. Several tools that combine methods from statistics and visualization are integrated in
the software package Sequential Parameter Optimization Toolbox (SPOT), which was designed for
understanding the behavior of optimization algorithms. SPOT provides a set of tools for model
based optimization and tuning of algorithms. It includes surrogate models, optimizers and DOE
approaches [Bartz-Beielstein et al., 2017].
(G2.4) Characterizing Algorithms’ Performance by Problem (Instance) Features and Vice Versa.
Whereas understanding as discussed in the previous paragraph tries to get a deep insight into the
elements and working principles of algorithms, characterization refers to the relationship between
algorithms and problems. That is, the goal is to link features of the problem with the performance
of the algorithm(s). A classical example for a question answered by the characterization approach is
how the performance of an algorithm scales with the number of decision variables.
Problem instance features can be high-level features such as its dimensionality, its search constraints,
its search space structure, and other basic properties of the problem. Low-level features of the
problem, such as its multi-modality, its separability, or its ruggedness can either be derived from the
problem formulation or via an exploratory sampling approach [Kerschke and Trautmann, 2019a,b,
Malan and Engelbrecht, 2013, Mersmann et al., 2010, 2011, Muñoz Acosta et al., 2015a,b].
9
2.3 Benchmarking as Training: Performance Extrapolation
(G3.1) Performance Regression.
The probably most classical hope associated with benchmarking is that the generated data can be
used to extrapolate the performance of an algorithm for other, not yet tested problem instances.
This extrapolation is highly relevant for selecting which algorithm to choose and how to configure it,
as we shall discuss in the next section. Performance extrapolation requires a good understanding of
how the performance depends on problem characteristics, the goal described in G2.4.
In the context of machine learning, performance extrapolation is also referred to as transfer learn-
ing [Pan and Yang, 2010]. It can be done manually or via sophisticated regression techniques.
Regardless of the methodology used to extrapolate performance data, an important aspect in this
regression task is a proper selection of the instances on which the algorithms/configurations are
tested. For performance extrapolation based on supervised learning approaches, a suitable selection
of feature extraction methods is another crucial requirement for a good fit between extrapolated and
true performance.
(G3.2) Automated Algorithm Design, Selection, and Configuration.
When the dependency of algorithms’ performance with respect to relevant problem characteristics is
known and performance can be reasonably well extrapolated to previously unseen problem instances,
the benchmarking results can be used for designing, selecting, or configuring an algorithm for the
problem at hand. That is, the goal of the benchmark study is to provide training data from which
rules can be derived that help the user choose the best algorithm for her optimization task. These
guidelines can be human-interpretable such as proposed in Bartz-Beielstein [2006], Liu et al. [2020] or
they can be implicitly derived by AutoML techniques [Hutter et al., 2019, Kerschke and Trautmann,
2019a, Kerschke et al., 2019, Olson and Moore, 2016].
10
2.5 Benchmarking in Algorithm Development
(G5.1) Source Code Validation.
Another important aspect of benchmarking is that it can be used to verify that a given program
performs as it is expected to. To this end, algorithms can be assessed on problem instances with
known properties. If the algorithm consistently does not behave as expected, a source code review
might be necessary.
(G5.2) Algorithm Development.
In addition to understanding performances, benchmarking is also used to identify weak spots with
the goal to develop better performing algorithms. This also includes first empirical comparisons of
new ideas to gain first insights into whether or not it is worth investigating further. This can result
in a loop of empirical and theoretical analysis. A good example for this is parameter control: it
has been observed early on that a dynamic choice of algorithms’ parameters can be beneficial over
static ones [Karafotias et al., 2015]. This led to the above mentioned loop of evaluating parameters
empirically and stimulated theoretical investigations.
3 Problem Instances
A critical element of algorithm benchmarking is the choice of problem instances, because it can heavily
influence the results of the benchmarking. Assuming that we (ultimately) aim at solving real-world problems,
ideally, the problem set should be representative of the real-world scenario under investigation, otherwise it is
not possible to derive general conclusions from the results of the benchmarking. In addition, it is important
that problem sets are continually updated to prevent the over-tuning of algorithms to particular problem
sets.
This section discusses various aspects related to problem sets used in benchmarking. The four questions
we address are:
1. What are the desirable properties of a good problem set?
11
3. What benchmark problem sets are publicly available?
4. What are the open problems in research related to problem sets for benchmarking?
(B1.1) Diverse.
A good benchmark suite should contain problems with a range of difficulties [Olson et al., 2017].
However, what is difficult for one algorithm could be easy for another algorithm and for that reason,
it is desirable for the suite to contain a wide variety of problems with different characteristics. In
this way, a good problem suite can be used to highlight the strengths and weaknesses of different
algorithms. Competition benchmark problems are frequently distinguished based on a few simple
characteristics such as modality and separability, but there are many other properties that can affect
the difficulty of problems for search [Kerschke and Trautmann, 2019b, Malan and Engelbrecht, 2013,
Muñoz Acosta et al., 2015b] and the instances in a problem suite should collectively capture a wide
range of characteristics.
(B1.2) Representative.
At the end of a benchmarking exercise, claims are usually made regarding algorithm performance.
The more representative the benchmarking suite is of the class of problems under investigation, the
stronger the claim about algorithm performance will hold. The problem instances should therefore
include the difficulties that are typical of real world instances of the problem class under investigation.
(B1.3) Scalable and tunable.
Ideally a benchmark set/framework includes the ability to tune the characteristics of the problem
instances. For example, it could be useful to be able to set the dimension of the problem, the level
of dependence between variables, the number of objectives, and so on.
(B1.4) Known solutions / best performance.
If the optimal solution(s) of a benchmark problem are known, then it makes it easier to measure
exact performance of algorithms in relation to the known optimal performance. There are, however,
simple problems for which optimal solutions are not known even for relatively small dimensions (e.g.
the Low Auto-correlation Binary Sequence (LABS) problem [Packebusch and Mertens, 2016]). In
these cases it is desirable to have the best known performance published for particular instances.
(B2.1) Feature space. One of the ways of assessing the diversity of a set of problem instances is to consider
how well the instances cover a range of different problem characteristics. When these characteristics
are measurable in some way, then we can talk about the instances covering a wide range of feature
values. Garden and Engelbrecht [2014] use a self-organizing feature map to cluster and analyse
the Black-Box-Optimization-Benchmarking (BBOB) and Congress on Evolutionary Computation
(CEC) problem sets based on fitness landscape features (such as ruggedness and the presence of
multiple funnels). In a similar vein, Škvorc et al. [2020] use Exploratory Landscape Analysis (ELA)
features [Mersmann et al., 2011] combined with clustering and a t-distributed stochastic neighbor
embedding visualization approach to analyse the distribution of problem instances across feature
space.
12
(B2.2) Performance space.
Simple statistics such as mean and best performance aggregate much information without always
enabling the discrimination of two or more algorithms. For example, two algorithms can be very
similar (and thus perform comparably) or they might be structurally very different but the aggregated
scores might still be comparable. From the area of algorithm portfolios, we can employ ranking-based
concepts such as the marginal contribution of an individual algorithm to the total portfolio, as well
as the Shapley values, which consider all possible portfolio configurations [Fréchette et al., 2016].
Still, for the purpose of benchmarking and better understanding of the effect of design decisions on
an algorithm’s performance, it might be desirable to focus more on instances that enable the user to
tell the algorithms apart in the performance space.
This is where the targeted creation of instances comes into play. Among the first articles that evolved
small Traveling Salesperson Problem (TSP) instances that are difficult or easy for a single algorithm
is that by Mersmann et al. [2013], which was then followed by a number of articles also in the
continuous domain as well as for constrained problems. Recently, this was extended to the explicit
discrimination of pairs of algorithms for larger TSP instances [Bossek et al., 2019], which required
more disruptive mutation operators.
13
competitions5 , travelling salesperson problem library6 and the mixed integer programming library of
problems7 .
In contrast to these instance-driven sets are the more abstract models that define variable interac-
tions at the lowest level (i.e., independent of a particular problem) and then construct an instance
based on fundamental characteristics. Noteworthy examples here are (for binary representations)
the NK landscapes [Kauffman, 1993] (which has the idea of tunable ruggedness at its core), the
W-Model [Weise and Wu, 2018] (with configurable features like length, neutrality, epistasis, multi-
objectivity, objective values, and ruggedness), and the Pseudo-Boolean Optimization (PBO) suite of
23 binary benchmark functions by Doerr et al. [2020], which covers a wide range of landscape features
and which extends the W-model in various ways (in particular, superposing its transformations to
other base problems).
14
(B3.7) Expensive optimization problems.
The GECCO 2020 Industrial Challenge provides a suite of discrete-valued electrostatic precipitator
problems with expensive simulation-based evaluation13 . An alternative approach to benchmarking
expensive optimization (used by CEC competitions) is to limit the number of allowed function eval-
uations for solving existing benchmark problems.
15
First, the number of real-world benchmark seems to be orders of magnitude smaller than the actual
number of real-world optimisation problems that are tackled on a daily basis—this is especially true for
continuous optimization. When there are some proper real-world problems available (e.g. data sets for
combinatorial problems, or the CEC problems mentioned), they are often single-shot optimizations, i.e.,
only a single run can be conducted, which then makes it difficult to retrieve generalizable results. Having
said this, a recent push towards a collection and characterization has been made with a survey24 by the
Many Criteria Optimization and Decision Analysis (MACODA) working group.
Second, the availability of diverse instances and of source code (of fitness functions, problem generators,
but also of algorithms) leaves much to be desired. Ideal are large collections of instances, their features,
algorithms, and their performance—the Algorithm Selection Library (ASlib)25 [Bischl et al., 2016] has such
data, although for a different purpose. As a side effect, these (ideally growing) repositories offer a means
against the reinvention of the wheel and the benchmarking against so-called “well-established” algorithms
that are cited many times—but maybe just cited many times because they can be beaten easily.
Third, and this is more of an educational opportunity: we as the community need to make sure that we
watch our claims when benchmarking. This includes that we not only make claims like “my approach is
better than your approach”, but that we also investigate what we can learn about the problem and about
the algorithms (see e.g. the discussion in [Agrawal et al., 2020] in the context of data mining), so that we
can inform again the creation of new instances. Or to paraphrase this: we need to clarify what conclusions
can we actually attempt to draw, given the performance comparison is always “with respect to the given
benchmark suite”.
Fourth, it is an advantage of test problem suites that they can provide an objective means of comparing
systems. However, there are also problems related to test problem suites: Whitley et al. [2002] discuss the
potential disadvantage that systems can become overfitted to work well on benchmarks and therefore that
good performance on benchmarks does not generalize to real-world problems. Fischbach and Bartz-Beielstein
[2020] list and discuss several drawbacks of these test suites, namely: (i) problem instances are somehow
artificial and have no direct link to real-world settings; (ii) since there is a fixed number of test instances,
algorithms can be fitted or tuned to this specific and very limited set of test functions; (iii) statistical tools
for comparisons of several algorithms on several test problem instances are relatively complex and not easy
to analyze.
Last, while for almost all benchmark problems and for a wide range of real-world problems the fitness of a
solution is deterministic, there are also many problems out there where the fitness evaluations are conducted
under noise. Hence, the adequate handling of noise can be critical so as to allow algorithms to explore
and exploit the search space in a robust manner. Branke et al. [2001] discuss strategies for coping with
noise, and Jin and Branke [2005] present a good survey. While noise (in computational experiments) is often
drawn from relatively simple distributions, real-world noise can be non-normal, time-varying, and even be
dependent on system states. To validate experimental outcomes from such noisy environments, mechanisms
way beyond “do n many repetitions” are needed, and Bokhari et al. [2020] compare five such approaches.
4 Algorithms
To understand strengths and weaknesses of different algorithmic ideas, it is important to select a suitable
set of algorithms that is to be tested within the benchmark study. While the algorithm portfolio is certainly
one of the most subjective choices in a benchmarking study, there are nevertheless a few design principles
to respect. In this section we summarize the most relevant of these guidelines.
16
• one-shot optimization algorithms (e.g., pure random search, Latin Hypercube Design (LHD) [McKay,
Michael D and Beckman, Richard J and Conover, William J, 2000], or quasi-random point construc-
tions),
• greedy local search algorithms (e.g., randomized local search, Broyden-Fletcher-Goldfarb-Shanno
(BFGS) algorithm [Shanno, 1970], conjugate gradients [Fletcher, 1976], and Nelder-Mead [Nelder and
Mead, 1965])
• non-greedy local search algorithms (e.g., Simulated Annealing (SANN) [Kirkpatrick et al., 1983],
Threshold Accepting [Dueck and Scheuer, 1990], and Tabu Search [Glover, 1989])
• single-point global search algorithms (e.g., (1 + λ) Evolution Strategies [Eiben and Smith, 2015] and
Variable Neighborhood Search [Mladenović and Hansen, 1997])
• population-based algorithms (e.g., Particle Swarm Optimization (PSO) [Kennedy and Eberhart, 1995,
Shi and Eberhart, 1998], ant colony optimization [Dorigo et al., 2006, Socha and Dorigo, 2008], most
EAs [Bäck et al., 1997, Eiben and Smith, 2015], and Estimation of Distribution Algorithms (EDAl-
gos) [Larrañaga and Lozano, 2002, Mühlenbein and Paaß, 1996] such as the CMA-ES [Hansen et al.,
2003])
• surrogate-based algorithms (e.g., Efficient Global Optimization (EGO) algorithm [Jones et al., 1998]
and other Bayesian optimization algorithms)
Note that the “classification” above is by no means exhaustive, nor stringent. In fact, classification schemes
for optimization heuristics always tend to be fuzzy, as hybridization between one or more algorithmic ideas
or components is not unusual, rendering the attribution of algorithms to the different categories subjective;
see [Birattari et al., 2003, Boussaı̈d et al., 2013, Stork et al., 2020] for examples.
17
(C4.3) State of the art
Results of a benchmark study can easily be biased if only outdated algorithms are added. We clearly
recommend familiarizing oneself with the state-of-the-art algorithms for the given problem type,
where state of the art may relate to performance on a given problem class or the algorithm family
itself. Preliminary experiments may lead to a proper pre-selection (i.e., exclusion) of algorithms.
The practitioner should be certain to compare versus the best algorithms. Consequently, always
compare to the most current versions and implementations of algorithms. This also counts for the
programming platform and its versions. For algorithm implementations on programming platforms
and operating systems the practitioner is not familiar with, nowadays there exist methods and tech-
nologies to solve this inconvenience, e.g., container (like Docker) or virtualizations. For details about
considerations regarding the experimental design, e.g., the number of considered algorithms, the
number of repetitions, the number of problem instances, the number of different parameter settings,
or sequential designs if the state of the art is unknown, please see Section 7.
(C4.4) Hyperparameter handling
All discussed families of algorithms require one or several control parameters. To enable a fair
comparison of their performances and to judge their efficiency, it is crucial to avoid bad parameter
configurations and to properly tune the algorithms under consideration [Beiranvand et al., 2017, Eiben
and Smit, 2011]. Even a well-working parameter configuration for a certain setup, i.e., a fixed budget,
may work comparably worse on a significantly different budget. As mentioned in Section 2 under
goal (G2.2), the robustness of algorithms with respect to their hyperparameters can be an important
characteristic for users, in which case this question should be integrated into (or even be the subject
of) the benchmarking study. Furthermore, the practitioner should be certain that the algorithm
implementation is properly using the parameter setup. It may occur that some implementations do
not warn the user if the parameter setting is out of their bounds.
Several tools developed for automatic parameter configuration are available, e.g., iterated rac-
ing (irace) [López-Ibáñez et al., 2016], Iterated Local Search in Parameter Configuration Space
(ParamILS) [Hutter et al., 2009], SPOT [Bartz-Beielstein et al., 2005], Sequential Model-based Algo-
rithm Configuration (SMAC) [Hutter et al., 2011], GGA [Ansótegui et al., 2015], and hyperband [Li
et al., 2017] to name a few. As manual tuning can be biased, especially for algorithms unknown to
the experimenter, automated tuning is state of the art and highly recommended. Giving rise to a
large amount of research in the field of automated algorithm configuration and hyperparameter opti-
mization, there exist several related benchmarking platforms, like the algorithm configuration library
(ACLib) [Hutter et al., 2014] or the hyperparameter optimization library (HPOlib) [Eggensperger
et al., 2013], which deal particularly with this topic.
(C4.5) Initialization
A good benchmarking study should ensure that the results achieved do not happen by chance. It
can be important to consider that algorithm performances are not erroneously rated due to the (e.g.,
random) initialization of the algorithm. According to a given problem instance, a random seed-based
starting point can be beneficial for algorithms if they are placed near or at one or more local optima.
Consequently, the practitioner should be aware that the performance of algorithms can be biased
by the initialization of algorithms with respect to, e.g., their random seeds, the starting points, the
sampling strategy, combined with the difficulty of the chosen problem instance.
We recommend letting all candidate algorithms use the same starting points, especially when the
goal of the benchmarking study is to compare (goals G1.2 and G1.3) or to analyze the algorithms
search behavior (G1.1). This recommendation also extents to the comparison with historical data.
Additionally, the design of experiment (see Section 7) can reflect the considerations by properly
handling the number of problem instances, repetitions, sampling strategies (in terms of the algo-
rithm parametrization), and random seeds. For (random) seed handling and further reproducibility
handling, we refer to Section 9.
18
(C4.6) Performance assessment
Not all algorithms support the configuration of the same stopping criteria, which may influence the
search [Beiranvand et al., 2017] and which has to be taken into account in the interpretation of the
results. For example, implementation of algorithms may not respect the given number of objective
function evaluations. If not detected by the practitioner, this can largely bias the evaluation of the
benchmark.
19
Fixed budget
Performance
Fixed target
Budget
Figure 2: Visualization of a fixed-budget perspective (vertical, green line) and a fixed-target perspective
(horizontal, orange line) inspired by Figure 4 in [Hansen et al., 2012]. Dashed lines show three exemplary
performance trajectories.
Satisfiability (SAT) problems [Xu et al., 2008]. CPU time is the default. However, as it is highly sensitive
to a variety of external factors – such as hardware, programming language, work load of the processors –
results from experiments that relied on CPU time are much less replicable and thus hardly comparable.
In an attempt of mitigating this issue, Johnson and McGeoch [2002] proposed a normalized time, which
is computed by dividing the runtime of an algorithm by the time a standardized implementation of a
standardized algorithm requires for the same (or at least a comparable) problem instance.
An alternative way of measuring time are Function Evaluations (FEs), i.e., the number of fully evaluated
candidate solutions. In fact, in case of sampling-based optimization, like in classical continuous optimization,
this machine-independent metric is the most common way of measuring algorithmic performances [Hansen
et al., 2016a]. Yet, from the perspective of actual clock time, they risk giving a wrong impression as the FEs
of different algorithms might be of different time complexity [Weise et al., 2014]. In such cases, counting
algorithm steps in a domain-specific method – e.g., the number of distance evaluations on the TSP [Weise
et al., 2014] or bit flips on the Maximum Satisfiability (MAX-SAT) problem [Hains et al., 2013] – may be
useful. Nevertheless, within the EC community, counting FEs is clearly the most commonly accepted way
to measure the efforts spend by an algorithm to solve a given problem instance.
From a practical point of view, both options have their merits. If the budget is given by means of clock
time – e.g., models have to be trained until the next morning, or they need to be adjusted within seconds
in case of predictions at the stock market – then results relying on CPU time are more meaningful. On the
other hand, in case single FEs are expensive – e.g., in case of physical experiments or cost-intensive numerical
simulations – the number of required FEs is a good proxy for clock time, and a more universal measure, as
discussed above. To satisfy all perspectives, best practice would be to report both: FEs and the CPU time.
Moreover, in situations in which single FEs are expensive, CPU time should ideally be separated into the
part used by the expensive FE, and the part used by the algorithm at each iteration.
In surrogate-based optimization, algorithms commonly slow down over time due to an ever-increasing
complexity of their surrogate models [Bliek et al., 2020, Ueno et al., 2016]. In this case, it is even possible
that the algorithm becomes more expensive than the expensive FE itself. By measuring the CPU time used
by the algorithm separately from the CPU time used by the FEs, it can be verified that the number of FEs
is indeed the limiting factor. At the same time, this reveals more information about the benchmark, namely
how expensive it is exactly, and whether all FEs have the same cost. However, if the FEs are expensive not
because of the computation time but due to some other costs (deteriorating biological samples, the use of
expensive equipment, human interaction, etc.), then just measuring the number of FEs could be sufficient.
Noticeably, many papers in the EC community also use generations as machine-independent time mea-
20
sures. However, it might not be a good idea to only report generations, because the exact relationship
between FEs and generations is not always clear. This makes results hard to compare with, e.g., local search
algorithms, so if generations are reported, FEs should be reported as well.
Constraint optimization Under constraint optimization, a solution is either feasible or not, which is
decided based on a set of constraints. Here, the absolute violations of each constraint can be summed up as
a performance metric [Hellwig and Beyer, 2019, Kumar et al., 2020].
Location: Measures of Central Behaviors In case of a fixed budget (i.e., vertical cut) approach,
solution qualities are usually aggregated using the arithmetic mean. But of course, other location measures
21
like the median or the geometric mean [Fleming and Wallace, 1986] can be useful alternatives when interested
in robust metrics or when aggregating normalized benchmark results, respectively.
In scenarios, in which the primary goal is to achieve a desired target quality (horizontal cut), it might
be necessary to aggregate successful and failed runs. In this case, two to three metrics are mostly used for
aggregating performances across algorithm runs, and we will list them below.
The gold standard in (single-objective) continuous optimization is the Expected Running Time
(ERT) [Auger and Hansen, 2005, Price, 1997], which computes the ratio between the sum of consumed
budget across all runs and the number of successful runs [Hansen et al., 2012]. Thereby, it estimates the av-
erage running time an algorithm needs to find a solution of the desired target quality (under the assumption
of independent restarts every T time units until success).
In other optimization domains, like TSP, SAT, etc., the Penalized Average Runtime (PAR) [Bischl et al.,
2016] is more common. It penalizes unsuccessful runs with a multiple of the maximum allowed budget –
penalty factors ten (PAR10) and two (PAR2) are the most common versions – and afterwards computes the
arithmetic mean of the consumed budget across all runs. The Penalized Quantile Runtime (PQR) [Bossek
et al., 2020a, Kerschke et al., 2018a] works similarly, but instead of using the arithmetic mean for aggregating
across the runs, it utilizes quantiles—usually the median – of the (potentially penalized) running times. In
consequence, PQR provides a robust alternative to the respective PAR scores.
Spread and Reliability A common measure of reliability is the estimated success probability, i.e., the
fraction of runs that achieved a defined goal. Bossek et al. [2020a] use it to take a multi-objective view
by combining the probability of success and the average runtime of successful runs. Similarly, Hellwig and
Beyer [2019] aggregate the two metrics in a single ratio, which they called SP.
As measures of dispersion of a given single performance metric, statistics like standard deviations as well
as quantiles are used, whereas the latter are more robust.
For constraint optimization, a feasibility rate (FR) [Kumar et al., 2020, Wu et al., 2017] is defined as
the fraction of runs discovering at least one feasible solution. The number of constraints violated by the
median solution [Kumar et al., 2020] and the mean amount of constraint violation over the best results of
all runs [Hellwig and Beyer, 2019] can also be used.
• multiple-problem analysis.
22
In both scenarios, multiple algorithms will be considered, i.e., following the notation introduced in Section 2,
there are at least two different algorithm instances, say, aj and ak from algorithm A or at least two different
algorithm instances aj ∈ A and bk ∈ B, where A and B denote the corresponding algorithms. Single-problem
analysis is a scenario where the data consists of multiple runs of the algorithms on a single problem instance
πi ∈ Π. This is necessary because many optimization algorithms are stochastic in nature, so there is no
guarantee that the result will be the same for every run. Additionally, the path leading to the final solution
is often different. For this reason, it is not enough to perform just a single algorithm run per problem, but
many runs are needed to make a conclusion. In this scenario, the result from the analysis will give us a
conclusion which algorithm performs the best on that specific problem.
Otherwise, in the case of multiple-problem analysis, focusing on (G1.2), we are interested in comparing
the algorithms on a set of benchmark problems. The best practices of how to select a representative value
for multiple-problem analysis will be described in Section 7.
No matter of what we are performing, i.e., single-problem or multiple-problem analysis, the best practices
analyzing the results of the experiments suggest that the analysis can be done as a three-level approach,
which consists of the following three steps:
1. Exploratory Data Analysis (EDA)
2. Confirmatory Analysis
3. Relevance Analysis
This section focuses on analyzing the empirical results of an experiment using descriptive, graphical, and
statistical tools, which can be used for the three-level approach for analysis. More information about various
techniques and best practices analyzing the results of experiments can be found in Crowder et al. [1979],
Golden et al. [1986], Barr et al. [1995], Bartz-Beielstein et al. [2004], Chiarandini et al. [2007], Garcı́a et al.
[2009], Bartz-Beielstein et al. [2010], Derrac et al. [2011], Eftimov et al. [2017], Beiranvand et al. [2017].
Mersmann et al. [2010], and more recently Kerschke and Trautmann [2019a], present methods based on ELA
to answer two basic questions that arise when benchmarking optimization algorithms. The first one is: which
algorithm is the ‘best’ one? and the second one: which algorithm should I use for my real world problem? In
the following, we summarize the most accepted and standard practices to evaluate the considered algorithms
stringently. These methods, if adhered to, may lead to wide acceptance and applicability of empirically
tested algorithms and may be a useful guide in the jungle of statistical tools and methods.
23
The following are the key tools available in EDA. It can provide valid conclusions that are graphically
presented, without requiring further statistical analysis. For further reading about EDA, the reader is
referred to [Tukey, 1977].
Visualising run-time behaviour. The second set of tools can be used to analyse the algorithm perfor-
mance over time, i.e., information about the performance for every lth iteration is required. Suitable for the
analysis of the performances of the optimization algorithms are convergence plots in which the performance
of the algorithm can be evaluated against the number of function evaluations. This helps us to understand
the dynamics of multiple algorithms in a single plot.
Histograms and box plots are also used in the graphical multiple problem analysis. Besides these common
tools, specific tools for the multiple problem analysis were developed, e.g., performance profiles proposed
in Dolan and Moré [2002]. They have emerged as an important tool to compare the performances of
optimization algorithms based on the cumulative distribution function of a performance metric (CPU time,
achieved optimum). It is the ratio of a performance metric obtained by each algorithm versus the best
value the performance metric among all algorithms that is being compared. Such plots help to visualize
the advantages (or disadvantages) of each competing algorithm graphically. Performance profiles are not
applicable, if the (true or theoretical) optimum is unknown. However, there are solutions for this problem,
e.g., using the best known solution so far or a guessed (most likely) optimum based on the user’s experience;
however, the latter is likely to be error-prone.
As performance profiles are not evaluated against the number of function evaluations, they cannot be
used to infer the percentage of the test problems that can be solved with some specific number of function
evaluations. To attain this feature, the data profiles were designed for fixed-budget derivative free optimiza-
tion algorithms [Moré and Wild, 2009]. It is appropriate to compare the best possible solutions obtained
from various algorithms within a fixed budget.
24
6.3 Confirmatory Analysis
6.3.1 Motivation
The second step in the three-level approach is referred to as confirmatory analysis, which is based in infer-
ential statistics, because it implements a deductive approach: a given assumption (statistical hypothesis) is
tested using the experimental data. Since the assumptions are formulated as statistical hypotheses, confir-
matory analysis heavily relies on probability models. Its final goal is to provide definite answers to specific
questions, i.e., questions for a specific experimental design. Because it uses probability models, its emphasis
is on complex numerical calculations. Its main ingredients are hypothesis tests and confidence intervals.
Confirmatory analysis usually generates more precise results for a specific context than EDA. But, if the
context is not suitable, e.g., statistical assumptions are not fulfilled, a misleading impression of precision
might occur.
Often, EDA tools are not sufficient to clearly analyze the differences in the performances of algorithms,
mainly when the differences are of smaller magnitude. The need to perform statistical analysis and various
procedures involved in making decisions about selecting the best algorithm are widely discussed in [Amini
and Barr, 1993, Barr et al., 1995, Carrano et al., 2011, Chiarandini et al., 2007, Eftimov et al., 2017, Garcı́a
et al., 2009, Golden et al., 1986, McGeoch, 1996]. The basic idea of statistical analysis is based on hypothesis
testing. Before analysing the performance data, we should define two hypotheses i) the null hypothesis
H0 and ii) the alternative hypothesis H1 . The null hypothesis states that there is no significant statistical
difference between the two algorithms’ performances, while the alternative hypothesis directly contradicts the
null hypothesis by indicating the statistical significance between the algorithms’ performances. Hypothesis
testing can be two-sided or one-sided. We will consider the one-sided case in the following, because it allows
us to ask if algorithm instance a is better than algorithm instance b. Let p(a) denote the performance
of algorithm a. In the context of minimization, smaller performance values will be better, because we will
compare the best solutions or the run times. The statement ”a outperforms b” is equivalent to ”p(a) < p(b)”,
which can be formulated as the statistical hypothesis H1 : p(b) − p(a) > 0. It is a common agreement in
hypotheses testing that this hypothesis H1 will be tested against the null hypothesis H0 : p(b) − p(a) ≤ 0,
which states that a is not better than b.
After the hypotheses are defined, we should select an appropriate statistical test, say T , for the analysis.
The test statistic T which is a function of a random sample that allows researchers to determine the likelihood
of obtaining the outcomes if the null hypothesis is true. The mean of the best found values from n repeated
runs of an algorithm is a typical example of a test statistic. Additionally, a significance level α should be
selected. Usually, a significance level of 95% is used. However, the selection of this value depends on the
experimental design and the scientific question to be answered.
25
Pipeline for Selecting Most Commonly used Statistical Tests
Yes No
Figure 3: A pipeline for selecting an appropriate statistical test [Eftimov et al., 2020].
natural or matched couplings occur. This means that each data value in one sample is uniquely paired to a
data value in the other sample. The choice between paired and unpaired samples depends on experimental
design, and researchers need to be aware of this when designing their experiment. Using Common Random
Numbers (CRN) is a well-known technique for generating paired samples. If the same seeds are used during
the optimization, CRNs might reduce the variances and lead to more reliable statistical conclusions [Kleijnen,
1988, Nazzal et al., 2012].
Single-problem analysis. As we previously mentioned, in this case, the performance measure data is
obtained using multiple runs of k algorithm instances a1 , . . . , ak on one selected problem instance πj .
The comparison of samples in pairs is called a pairwise comparison. Note, that a pairwise comparison
of algorithms does not necessarily mean that the corresponding samples are paired. In fact, most pairwise
comparisons use unpaired samples, because the setup for pairwise sampling is demanding, e.g., implementing
random number streams etc. If more than two samples are compared at the same time, a multiple comparison
is performed.
For pairwise comparison, the t test [Sheskin, 2003] is the appropriate parametric one, while its non-
parametric version is the Mann-Whitney U test (i.e., Wilcoxon-rank sum test) [Hart, 2001]. In the case
when more than two algorithms are involved, the parametric version is the one-wayANOVA [Lindman, 1974,
Montgomery, 2017], while its appropriate non-parametric test is the Kruskal-Wallis rank sum test [Kruskal
and Wallis, 1952]. Here, if the null hypothesis is rejected, then we should continue with a post-hoc procedure
to define the pairs of algorithms that contribute to the statistical significance.
26
Multiple-problem analysis. The mean of the performance measure from multiple runs can be used as
representative value of each algorithm on each problem. However, as stated above, averaging is sensitive to
outliers, which needs to be considered especially because optimization algorithms could have poor runs. For
this reason, the median of the performance measure from the multiple runs can also be used as more robust
statistic.
Both mean and median are sensitive to errors inside some -neighborhood (i.e., small difference between
their values that is not recognized by the ranking schemes of the non-parametric tests), which can additionally
affect the statistical result. For these reasons, Deep Statistical Comparison (DSC) for comparing evolutionary
algorithms was proposed [Eftimov et al., 2017]. Its main contribution is its ranking scheme, which is based
on the whole distribution, instead of using only one statistic to describe the distribution, such as mean or
median.
The impact of the selection of the three above-presented transformations, which can be used to find a
representative value for each algorithm on each problem, to the final result of the statistical analysis in the
multiple-problem analysis is presented in [Eftimov and Korošec, 2018].
Statistical tests. No matter which transformation is used, once the data for analysis is available, the
next step is to select an appropriate statistical test. For pairwise comparison, the t test is the appropriate
parametric one [Sheskin, 2003], while its relevant non-parametric version is the Wilcoxon signed rank test
[Wilcoxon, 1945]. In the case when more than two algorithms are involved, the parametric version is the
repeated measurements ANOVA [Lindman, 1974, Montgomery, 2017], while its appropriate non-parametric
tests are the Friedman rank-based test [Friedman, 1937], Friedman-aligned test [Garcı́a et al., 2009], and
Iman-Davenport test [Garcı́a et al., 2009]. Additionally, if the null hypothesis is rejected, same as the
single-problem analysis, we should continue with a post-hoc procedure to define the pairs of algorithms that
contribute to the statistical significance.
Other non-parametric tests are the non-parametric rank based tests, which are suitable when the distri-
bution assumptions are questionable [Sheskin, 2003]. Using them, the data is ranked and then the p-value
is calculated for the ranks and not the actual data. This ranking helps to eliminate the problem of skewness
and in handling extreme values. The permutation test [Pesarin, 2001] estimates the permutation distribution
by shuffling the data without replacement and identifying almost all possible values of the test statistic. The
Page’s trend test [Derrac et al., 2014] is also a non-parametric test that can be used to analyse convergence
performance of evolutionary algorithms.
Post-hoc procedures. When we have more than two algorithms that are involved in the comparison, the
appropriate statistical test can find statistical significance between the algorithms’ performances, but it is
not able to define the pairs of algorithms that contribute to this statistical significance. For this reason, if
the null hypothesis is rejected, we should continue with a post-hoc test.
The post-hoc testing can be done in two scenarios: i) all pairwise comparisons and ii) multiple comparisons
with a control algorithm. Let us assume that we have k algorithms involved in the comparison, so in the
first scenario we should perform k(k − 1)/2 comparisons, and in the second one k − 1.
In the case of all pairwise comparisons, the post-hoc test statistic should be calculated. It depends
on the appropriate statistical test that is used to compare all algorithms together, which rejected the null
hypothesis. After that, the obtained p-values are corrected with some post-hoc procedure. For example, if the
null hypothesis in the Friedman test, Friedman aligned-ranks test, or Iman–Davenport test, is rejected, we can
use the Nemenyi, Holm, Shaffer, and Bergmann correction to adapt the p-values and handle multiple-testing
issues.
The multiple comparisons with a control algorithm is the scenario when our newly developed is the
control algorithm, and we are comparing it with state-of-the-art algorithms. Same as the previous scenario,
the post-hoc statistic depends on the appropriate statistical test that is used to compare all algorithms
together, which rejected the null hypothesis, and the obtained p-values are corrected with some post-hoc
procedure. In the case of Friedman test, Friedman aligned-ranks test , or Iman–Davenport test, appropriate
post-hoc procedures are: Bonferroni, Holm, Hochberg, Hommel, Holland, Rom, Finner, and Li.
27
Another way to the multiple comparisons with a control algorithms is to perform all comparisons between
the control algorithm and each other algorithm using some pairwise test. In this case, we should be careful
when making a conclusion, since we are losing the control on the Family-Wise Error Rate (FWER) when
performing multiple pairwise comparisons. All obtained p-values will come from independent pairwise com-
parisons. The calculation of the true statistical significance for combining pairwise comparisons is presented
in [Eftimov et al., 2017, Garcı́a et al., 2009].
More information about different post-hoc procedures and their application in benchmarking theory in
evolutionary computation is presented in [Garcı́a et al., 2009].
28
a comparison between the performance measures of two optimization algorithms as the outcome of a single
game. A draw limit that defines when two performance measure values are equal should be specified by
the user and it is a problem specific. At the end, each algorithm has its own rating which is a result
from the tournament and the statistical analysis is performed using confidence intervals calculated using the
algorithms rating.
The second approach is the practical Deep Statistical Comparison (pDSC), which is a modification of the
DSC approach used for testing for statistical significance [Eftimov and Korošec, 2019]. The basic idea is that
the data on each problem should be pre-processed with some practical level specified by a user, and after
that involved with DSC to find relevant difference. Two pre-processing steps are proposed: i) sequential
pre-processing, which pre-processes the performance measures from multiple runs in a sequential order, and
ii) a Monte-Carlo approach, which pre-processes the performance measure values by using a Monte-Carlo
approach to avoid the dependence of the practical significance on the order of the independent runs. A
comparison between the CRS4EA and pDSC is presented in [Eftimov and Korošec, 2019]. Using these two
approaches, the analysis is made for a multiple-problem scenario. Additionally, the rankings from pDSC
obtained on a single-problem level can be used for single-problem analysis.
7 Experimental Design
7.1 Design of Experiments (DoE)
Unfortunately, many empirical evaluations of optimization algorithms are performed and reported with-
out addressing basic experimental design considerations [Brownlee, 2007]. An important step to make this
procedure more transparent and more objective is to use DOE and related techniques. They provide an al-
gorithmic procedure to make comparisons in benchmarking more transparent. Experimental design provides
an excellent way of deciding which and how many algorithm runs should be performed so that the desired
information can be obtained with the least number of runs.
DOE is planning, conducting, analyzing, and interpreting controlled tests to evaluate the influence of
the varied factors on the outcome of the experiments. The importance and the benefits of a well designed
planned experiment have been summarized by Hooker [1996]. Johnson [2002b] suggests to report not only
the run time of an algorithm, but also explain the corresponding adjustment process (preparation or tuning
before the algorithm is run) in detail, and therefore to include the time for the adjustment in all reported
running times to avoid a serious underestimate.
The various key implications involved in the DOE are clearly explained in Kleijnen [2001]. A compre-
hensive list of the recent publications on design techniques can be found in Kleijnen [2017]. The various
design strategies in the Design and Analysis of Computer Experiments (DACE) are discussed by Santner
29
et al. [2003]. Wagner [2010] discusses important experimental design topics, e.g., “How many replications of
each design should be performed?” or “How many algorithm runs should be evaluated?”
This section discusses various important practical aspects of formulating the design of experiments for a
stochastic optimization problem. The key principles are outlined. For a detailed reading of the DOE, the
readers are referred to Montgomery [2017] and Kleijnen [2015].
30
c2 = 2, starting value of the inertia weight wmax = 0.9, final value of the inertia weight wscale = 0, percentage
of iterations for which wmax is reduced witerscale = 1, and maximum value of the step size vmax = 100. This
algorithm design contains infinitely many design points, because c1 is not fixed.
Problem designs provide information related to the optimization problem, such as the available resources
(number of function evaluations) or the problem’s dimension.
An experimental design consists of a problem design and an algorithm design. Benchmark studies require
complex experimental designs, because they are combinations of several problem and algorithm designs.
Furthermore, as discussed in Section 5, one or several performance measures must be specified.
• If algorithm runs on the chosen problems take too long for a full tuning process, one may however
perform a simple space-filling design on the parameter space, e.g., a LHD or a low-discrepancy point
set [Matoušek, 2009] with only a few design points and repeats. This prevents misconfigurations of
algorithms as one probably easily gets into the “ball park” [De Jong, 2007] of relatively good parameter
settings. Most likely, neither algorithm works at its peak performance level, but the comparison is still
fair.
• If no code other than for one’s own algorithm is available, one has to resort to comparing with default
parameter values. For a new algorithm, these could be determined by a common tuning process over
the whole problem set. Note however, that such comparison deliberately abstains from setting good
parameters for specific problems, even if this would be attempted for any real-world application.
26 At the moment, this is only a list, which will be extended in forthcoming versions of this survey.
31
7.6 Open Issues
(O7.1) Best Designs.
Same authors consider LHDs as the default choice, even if for numerous applications a superiority
of other space-filling or low-discrepancy designs has been demonstrated Santner et al. [2003]. The
question when to prefer i.i.d. uniform sampling, LHDs, low-discrepancy point sets, other space-filling
designs, or sets minimizing some other diversity criterion is largely open.
(O7.2) Multiple Objectives.
Sometimes properties of the objective function are used to determine the quality of a design. There-
fore, it remains unclear how to measure the quality in settings where the objective function is un-
known. Furthermore, problems occur if wrong assumptions about the objective function, e.g., lin-
earity, are made. And, last but not least, in Multi-Objective Optimization (MOO), where no single
objective can be specified, finding the optimal design can be very difficult [Santner et al., 2003].
2. do not push deadlines, i.e., do not reduce the quality of the report, because the deadline is approaching
soon. Invest some time in planning: number of experiments, algorithm portfolio, hardness of the
problem instances as discussed in Section 7,
3. and report negative results , i.e., present and discuss problem instances on which the algorithms fail
(this is a key component of a good scientific report as discussed in this section).
Barr et al. [1995] in their classical work on reporting empirical results of heuristics specify a loose exper-
imental setup methodology with the following steps:
1. define the goals of the experiment,
2. select measure of performance and factors to explore,
They then suggest eight guidelines for reporting results, in summary they are; reproducibility, specify all
influential factors (code, computing environment, etc.), be precise regarding measures, specify parameters,
use statistical experimental design, compare with other methods, reduce variability of results, ensure results
are comprehensive. They then go on to clarify these points with examples.
32
8.2 Reporting Methodologies
Besides recommendations, that provide valuable hints on how to report results, there exist also methodolo-
gies, which employ a scientific methodology, e.g., based on hypothesis testing [Popper, 1959, 1975]. Such a
methodology was proposed by Bartz-Beielstein and Preuss [2010]. They propose organizing the presentation
of experiments into seven parts, as follows:
(R.1) Research question
Briefly names the matter dealt with, the (possibly very general) objective, preferably in one sentence.
This is used as the report’s “headline” and related to the primary model.
(R.2) Pre-experimental planning
Summarizes the first—possibly explorative—program runs, leading to task and setup (R-3 and R-4).
Decisions on employed benchmark problems or performance measures should be taken according to
the data collected in preliminary runs. The report on pre-experimental planning should also include
negative results, e.g., modifications to an algorithm that did not work or a test problem that turned
out to be too hard, if they provide new insight.
(R.3) Task
Concretizes the question in focus and states scientific claims and derived statistical hypotheses to test.
Note that one scientific claim may require several, sometimes hundreds, of statistical hypotheses. In
case of a purely explorative study, as with the first test of a new algorithm, statistical tests may not
be applicable. Still, the task should be formulated as precisely as possible. This step is related to the
experimental model.
(R.4) Setup
Specifies problem design and algorithm design, including the investigated algorithm, the controllable
and the fixed parameters, and the chosen performance measuring. It also includes information about
the computational environment (hard- and software specification, e.g., the packages or libraries used).
The information provided in this part should be sufficient to replicate an experiment.
(R.5) Results/Visualization
Gives raw or produced (filtered) data on the experimental outcome and additionally provides basic
visualizations where meaningful. This is related to the data model.
(R.6) Observations
Describes exceptions from the expected, or unusual patterns noticed, without subjective assessment
or explanation. As an example, it may be worthwhile to look at parameter interactions. Additional
visualizations may help to clarify what happens.
(R.7) Discussion
Decides about the hypotheses specified in R-3, and provides necessarily subjective interpretations of
the recorded observations. Also places the results in a wider context. The leading question here is:
What did we learn?
This methodology was extended and refined in Preuss [2015]. It is important to divide parts R-6 and R-7,
to facilitate different conclusions drawn by others, based on the same results/observations. This distinction
into parts of increasing subjectiveness is similar to the suggestions of Barr et al. [1995], who distinguish
between results, their analysis, and the conclusions drawn by the experimenter.
Note that all of these parts are already included in current good experimental reports. However, they
are usually not separated but wildly mixed. Thus, we only suggest inserting labels into the text to make the
structure more obvious.
We also recommend keeping a journal of experiments with single reports according to the above scheme
to enable referring to previous experiments later on. This is useful even if single experiments do not find their
way into a publication, as it improves the overview of subsequent experiments and helps to avoid repeated
tests.
33
8.3 Open Issues
Reporting negative results has many benefits, e.g., to demonstrate what has been done and does not work, so
someone else will not do the same in the future. And they are valuable tools for illustrating the limitations of
new approaches. The presentation of negative results discussed above in 3 is not adequately accepted in the
research community (cf. Gent and Walsh [1994]). Whereas a paper improving existing experimental results or
outperforming another algorithm regularly gets accepted for publication, papers presenting negative results
regularly will not.
Repeatability (Same team, same experimental setup) The measurement can be obtained with stated pre-
cision by the same team using the same measurement procedure, the same measuring system, under
the same operating conditions, in the same location on multiple trials. For computational experiments,
this means that a researcher can reliably repeat her own computation.
Reproducibility (Different team, same experimental setup) The measurement can be obtained with stated
precision by a different team using the same measurement procedure, the same measuring system, under
the same operating conditions, in the same or a different location on multiple trials. For computational
experiments, this means that an independent group can obtain the same result using the author’s own
artifacts.
Replicability (Different team, different experimental setup) The measurement can be obtained with stated
precision by a different team, a different measuring system, in a different location on multiple trials.
For computational experiments, this means that an independent group can obtain the same result
using artifacts which they develop completely independently.
The above classification helps to identify various levels of reproducibility, reserving the term “Replicabil-
ity” to the most scientifically useful, yet hardest to achieve. There are many practical guidelines and software
systems available to achieve repeatibility and reproducibility [Gent et al., 1997, Johnson, 2002a], including
code versioning tools (Subversion and Git), data repositories (Zenodo), reproducible documents (Rmarkdown
and Jupyter notebooks), and reproducible software environments (OSF29 , CodeOcean and Docker).
Unfortunately, it is not so clear how to successfully achieve Replicability. For achieving replicability,
one must give up on exactly reproducing the results and provide statistical guidelines that are commonly
accepted by the field to provide sufficient evidence for a conclusion, even under different, but similar, exper-
imental conditions. What constitutes similar experimental conditions depends on the experiment and there
is no simple answer when benchmarking algorithms. One step towards better replicability is to pre-register
experimental designs [Nosek et al., 2018] to fix the hypothesis and design of experiments. Preregistration
27 http://folk.idi.ntnu.no/odderik/reproducibility_guidelines.pdf
28 Quoting from:
https://www.acm.org/publications/policies/artifact-review-and-badging-current
29 https://osf.io/
34
reduces the risk of spurious results due to adaptations to data analysis. However, it is much harder to
systematically control for adaptive computational experiments because, unlike randomized controlled trials,
they are much easier to run and re-run prior to registration.
2. compiling videos, which explain how to set up the experiments, analyze results, and report important
findings,
3. providing software tools,
4. developing a comprehensible check-list, especially for beginners in benchmarking,
5. including a discussion section in every section, which describes controversial topics and ideas.
Our final goal is to provide well-accepted guidelines (rules) that might be useful for authors, reviewers,
and others. Consider the following (rudimentary and incomplete) checklist, that can serve as a guideline for
authors and reviewers:
1. goals: did the authors clearly state the reasons for this study?
2. problems: is the selection of problem instances well motivated and justified?
3. algorithms: do comparisons include relevant competitors?
4. performance: is the choice of the performance measure adequate?
35
Transparent, well accepted standards will improve the review process in EC and related fields significantly.
These common standards might also accelerate the review process, because they improve the quality of
submissions and helps reviewers to write objective evaluations. Most importantly, it is not our intention
to dictate specific test statistics, experimental designs, or performance measures. Instead, we claim that
publications in EC would improve, if authors explain, why they have chosen this specific measure, tool, or
design. And, last but not least, authors should describe the goal of their study.
Although we tried to include the most relevant contributions, we are aware that important contributions
are missing. Because the acceptance of the proposed recommendations is crucial, we would like to invite
more researchers to share their knowledge with us. Moreover, as the field of benchmarking is constantly
changing, this article will be regularly updated and published on arXiv [Bartz-Beielstein et al., 2020]. To
get in touch, interested readers can use the associated e-mail address for this project: benchmarkingbest-
[email protected].
There are several other initiatives that are trying to improve benchmarking standards in query-based
optimization fields, e.g., the Benchmarking Network30 , an initiative built to consolidate and to stimulate
activities on benchmarking iterative optimization heuristics [Weinand et al., 2020].
In our opinion, starting and maintaining this public discussion is very important. Maybe, this survey
poses more questions than answers, which is fine. Therefore, we conclude this article with a famous saying
that is attributed to Richard Feynman31 :
I would rather have questions that can’t be answered than answers that can’t be questioned.
Acknowledgments
This work has been initiated at Dagstuhl seminar 19431 on Theory of Randomized Optimization Heuristics,32 and we gratefully
acknowledge the support of the Dagstuhl seminar center to our community.
We thank Carlos M. Fonseca for his important input and our fruitful discussion, which helped us shape the section on
performance measures. We also thank participants of the Benchmarking workshops at GECCO 2020 and at PPSN 2020 for
several suggestions to improve this paper. We thank Nikolaus Hansen for providing feedback on an earlier version of this survey.
C. Doerr acknowledges support from the Paris Ile-de-France region and from a public grant as part of the Investissement
d’avenir project, reference ANR-11-LABX-0056-LMH, LabEx LMH.
J. Bossek acknowledges support by the Australian Research Council (ARC) through grant DP190103894.
J. Bossek and P. Kerschke acknowledge support by the European Research Center for Information Systems (ERCIS).
S. Chandrasekaran and T. Bartz-Beielstein acknowledge support from the Ministerium für Kultur und Wissenschaft des
Landes Nordrhein-Westfalen in the funding program FH Zeit für Forschung under the grant number 005-1703-0011 (OWOS).
T. Eftimov acknowledges support from the Slovenian Research Agency under research core funding No. P2-0098 and project
No. Z2-1867.
A. Fischbach and T. Bartz-Beielstein acknowledge support from the German Federal Ministry of Education and Research
in the funding program Forschung an Fachhochschulen under the grant number 13FH007IB6 (KOARCH).
W. La Cava is supported by NIH grant K99-LM012926 from the National Library of Medicine.
M. López-Ibáñez is a “Beatriz Galindo” Senior Distinguished Researcher (BEAGAL 18/00053) funded by the Ministry of
Science and Innovation of the Spanish Government.
K.M. Malan acknowledges support by the National Research Foundation of South Africa (Grant Number: 120837).
B. Naujoks and T. Bartz-Beielstein acknowledge support from the European Commission’s H2020 programme, H2020-
MSCA-ITN-2016 UTOPIAE (grant agreement No. 722734), as well as the DAAD (German Academic Exchange Service),
Project-ID: 57515062 “Multi-objective Optimization for Artificial Intelligence Systems in Industry”.
M. Wagner acknowledges support by the ARC projects DE160100850, DP200102364, and DP210102670.
T. Weise acknowledges support from the National Natural Science Foundation of China under Grant 61673359 and the
Hefei Specially Recruited Foreign Expert program.
We also acknowledge support from COST action 15140 on Improving Applicability of Nature-Inspired Optimisation by
Joining Theory and Practice (ImAppNIO).
30 https://sites.google.com/view/benchmarking-network/
31 https://en.wikiquote.org/w/index.php?title=Talk:Richard_Feynman&oldid=2681873#%22I_would_rather_have_
questions_that_cannot_be_answered%22
32 https://www.dagstuhl.de/19431
36
Glossary
AAAI Association for the Advancement of Artificial Intelligence. 34
ACM Association for Computing Machinery. 14, 34
ASlib Algorithm Selection Library. 16
37
MOO Multi-Objective Optimization. 32
OFAT One-factor-at-a-time. 30
38
References
Stavros P Adam, Stamatios-Aggelos N Alexandropoulos, Panos M Pardalos, and Michael N Vrahatis. No Free Lunch
Theorem: A Review. In Approximation and Optimization, pages 57 – 82. Springer, 2019.
Amritanshu Agrawal, Tim Menzies, Leandro L. Minku, Markus Wagner, and Zhe Yu. Better software analytics via
“duo”: Data mining algorithms using/used-by optimizers. Empirical Software Engineering, 25(3):2099–2136, 2020.
doi:10.1007/s10664-020-09808-9. URL https://doi.org/10.1007/s10664-020-09808-9.
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation
hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, pages 2623–2631, 2019.
Mohammad M Amini and Richard S Barr. Network reoptimization algorithms: A statistically designed comparison.
ORSA Journal on Computing, 5(4):395–409, 1993.
Theodore W Anderson and Donald A Darling. Asymptotic theory of certain” goodness of fit” criteria based on
stochastic processes. The annals of mathematical statistics, pages 193–212, 1952.
Carlos Ansótegui, Yuri Malitsky, Horst Samulowitz, Meinolf Sellmann, and Kevin Tierney. Model-based genetic
algorithms for algorithm configuration. In Proc. of International Conf. on Artificial Intelligence (IJCAI’15), pages
733–739. AAAI, 2015.
Dirk V. Arnold. Noisy Optimization with Evolution Strategies, volume 8. Springer, 2012.
Anne Auger and Benjamin Doerr. Theory of Randomized Search Heuristics. World Scientific, 2011.
Anne Auger and Nikolaus Hansen. Performance evaluation of an advanced local search evolutionary algorithm. In
Proceedings of the IEEE Congress on Evolutionary Computation, pages 1777–1784. IEEE, 2005.
Thomas Bäck, David B. Fogel, and Zbigniew Michalewicz. Handbook of Evolutionary Computation. IOP Publishing
Ltd., GBR, 1st edition, 1997.
Richard S Barr, Bruce L Golden, James P Kelly, Mauricio GC Resende, and William R Stewart. Designing and
reporting on computational experiments with heuristic methods. Journal of Heuristics, 1(1):9–32, 1995.
Maurice Stevenson Bartlett. Properties of sufficiency and statistical tests. Proceedings of the Royal Society of London.
Series A-Mathematical and Physical Sciences, 160(901):268–282, 1937.
Thomas Bartz-Beielstein and Mike Preuss. The Future of Experimental Research. In Thomas Bartz-Beielstein,
Marco Chiarandini, Luis Paquete, and Mike Preuss, editors, Experimental Methods for the Analysis of Optimization
Algorithms, pages 17–46. Springer, Berlin, Heidelberg, New York, 2010.
Thomas Bartz-Beielstein, Konstantinos E Parsopoulos, and Michael N Vrahatis. Design and analysis of optimization
algorithms using computational statistics. Applied Numerical Analysis & Computational Mathematics, 1(2):413–
433, 2004.
Thomas Bartz-Beielstein, Christian WG Lasarczyk, and Mike Preuss. Sequential Parameter Optimization. In Pro-
ceedings of the 2005 IEEE Congress on Evolutionary Computation, volume 1, pages 773 – 780. IEEE, 2005.
Thomas Bartz-Beielstein, Marco Chiarandini, Luı́s Paquete, and Mike Preuss. Experimental methods for the analysis
of optimization algorithms. Springer, 2010.
Thomas Bartz-Beielstein, Lorenzo Gentile, and Martin Zaefferer. In a nutshell: Sequential parameter optimization.
Technical report, TH Köln, 2017.
Thomas Bartz-Beielstein, Carola Doerr, Jakob Bossek, Sowmya Chandrasekaran, Tome Eftimov, Andreas Fischbach,
Pascal Kerschke, Manuel Lopez-Ibanez, Katherine M. Malan, Jason H. Moore, Boris Naujoks, Patryk Orzechowski,
Vanessa Volz, Markus Wagner, and Thomas Weise. Benchmarking in Optimization: Best Practice and Open Issues.
arXiv e-prints, art. arXiv:2007.03488, July 2020.
39
Vahid Beiranvand, Warren Hare, and Yves Lucet. Best Practices for Comparing Optimization Algorithms. Opti-
mization and Engineering, 18(4):815 – 848, 2017.
James Bergstra, Dan Yamins, and David D Cox. Hyperopt: A python library for optimizing the hyperparameters of
machine learning algorithms. In Proceedings of the 12th Python in science conference, volume 13, page 20. Citeseer,
2013.
Hans-Georg Beyer and Hans-Paul Schwefel. Evolution strategies: A comprehensive introduction. Natural Computing,
1:3–52, 2002.
Mauro Birattari, Thomas Stützle, Luı́s Paquete, and Klaus Varrentrapp. A racing algorithm for configuring meta-
heuristics. In W. B. Langdon et al., editors, Proceedings of the Genetic and Evolutionary Computation Conference,
GECCO 2002, pages 11–18. Morgan Kaufmann Publishers, San Francisco, CA, 2002.
Mauro Birattari, Luis Paquete, and Thomas Stützle. Classification of metaheuristics and design of experiments
for the analysis of components. https://www.researchgate.net/publication/2557723_Classification_of_
Metaheuristics_and_Design_of_Experiments_for_the_Analysis_of_Components, 2003. technical report.
Bernd Bischl, Pascal Kerschke, Lars Kotthoff, Thomas Marius Lindauer, Yuri Malitsky, Alexandre Fréchette, Hol-
ger H. Hoos, Frank Hutter, Kevin Leyton-Brown, Kevin Tierney, and Joaquin Vanschoren. ASlib: A Benchmark
Library for Algorithm Selection. Artificial Intelligence (AIJ), 237:41 – 58, 2016.
Laurens Bliek, Sicco Verwer, and Mathijs de Weerdt. Black-box mixed-variable optimisation using a surrogate model
that satisfies integer constraints. arXiv preprint arXiv:2006.04508, 2020.
Mahmoud A. Bokhari, Brad Alexander, and Markus Wagner. Towards rigorous validation of energy optimisa-
tion experiments. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO ’20,
page 1232–1240, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450371285.
doi:10.1145/3377930.3390245. URL https://doi.org/10.1145/3377930.3390245.
Mohammad Reza Bonyadi, Zbigniew Michalewicz, and Luigi Barone. The Travelling Thief Problem: The First
Step in the Transition from Theoretical Problems to Realistic Problems. In 2013 IEEE Congress on Evolutionary
Computation, pages 1037 – 1044. IEEE, 2013.
Mohammad Reza Bonyadi, Zbigniew Michalewicz, Markus Wagner, and Frank Neumann. Evolutionary Computa-
tion for Multicomponent Problems: Opportunities and Future Directions. In Optimization in Industry: Present
Practices and Future Scopes, pages 13 – 30. Springer, 2019. doi:10.1007/978-3-030-01641-8 2.
Jakob Bossek, Pascal Kerschke, Aneta Neumann, Markus Wagner, Frank Neumann, and Heike Trautmann. Evolving
Diverse TSP Instances by Means of Novel and Creative Mutation Operators. In Proc. of the 15th ACM/SIGEVO
Conference on Foundations of Genetic Algorithms, pages 58 – 71. ACM, 2019.
Jakob Bossek, Pascal Kerschke, and Heike Trautmann. A Multi-Objective Perspective on Performance Assessment
and Automated Selection of Single-Objective Optimization Algorithms. Applied Soft Computing Journal (ASOC),
88:105901, March 2020a.
Jakob Bossek, Pascal Kerschke, and Heike Trautmann. Anytime Behavior of Inexact TSP Solvers and Perspectives
for Automated Algorithm Selection. In Proc. of the IEEE Congress on Evolutionary Computation. IEEE, 2020b.
A preprint of this manuscript can be found at https://arxiv.org/abs/2005.13289.
Ilhem Boussaı̈d, Julien Lepagnot, and Patrick Siarry. A survey on optimization metaheuristics. Information Sciences,
237:82–117, 2013. doi:10.1016/j.ins.2013.02.041. URL https://doi.org/10.1016/j.ins.2013.02.041.
Jürgen Branke. Creating Robust Solutions by Means of Evolutionary Algorithms. In International Conference on
Parallel Problem Solving from Nature, pages 119 – 128. Springer, 1998.
Jürgen Branke, Christian Schmidt, and Hartmut Schmeck. Efficient Fitness Estimation in Noisy Environments. In
Genetic and Evolutionary Computation Conference (GECCO’01), pages 243 – 250. Morgan Kaufmann, 2001.
40
Jason Brownlee. A Note on Research Methodology and Benchmarking Optimization Algorithms. Technical report,
Complex Intelligent Systems Laboratory (CIS), Centre for Information Technology Research (CITR), Faculty of
Information and Communication Technologies (ICT), Swinburne University of Technology, Victoria, Australia,
Technical Report ID 70125, 2007.
Eduardo G Carrano, Elizabeth F Wanner, and Ricardo HC Takahashi. A multicriteria statistical based comparison
methodology for evaluating evolutionary algorithms. IEEE Transactions on Evolutionary Computation, 15(6):
848–870, 2011.
Marie-Liesse Cauwet and Olivier Teytaud. Noisy Optimization: Fast Convergence Rates with Comparison-based
Algorithms. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, pages 1101 – 1106,
2016.
Pete Chapman, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas Reinartz, Colin Shearer, and Rüdiger
Wirth. CRISP-DM 1.0: Step-by-Step Data Mining Guide. Technical report, SPSS Inc., 2000.
Marco Chiarandini, Luis Paquete, Mike Preuss, and Enda Ridge. Experiments on Metaheuristics: Methodological
Overview and Open Issues. Technical report, Institut for Matematik og Datalogi Syddansk Universitet, 2007.
Nicos Christofides. The Vehicle Routing Problem. Revue française d’automatique, d’informatique et de recherche
opérationnelle (RAIRO). Recherche opérationnelle, 10(1):55 – 70, 1976. URL http://www.numdam.org/item?id=
RO_1976__10_1_55_0.
Matej Črepinšek, Shih-Hsi Liu, and Marjan Mernik. Replication and Comparison of Computational Experiments in
Applied Evolutionary Computing: Common Pitfalls and Guidelines to Avoid Them. Applied Soft Computing, 19:
161 – 170, June 2014.
Harlan Crowder, Ron S Dembo, and John M Mulvey. On reporting computational experiments with mathematical
software. ACM Transactions on Mathematical Software (TOMS), 5(2):193–203, 1979.
Joseph C. Culberson. On the futility of blind search: An algorithmic view of “no free lunch”. Evolutionary Compu-
tation, 6(2):109–127, 1998. doi:10.1162/evco.1998.6.2.109.
Kenneth De Jong. Parameter Setting in EAs: a 30 Year Perspective. In Parameter Setting in Evolutionary Algorithms,
pages 1 – 18. Springer, 2007.
Marleen De Jonge and Daan van den Berg. Parameter sensitivity patterns in the plant propagation algorithm. In
IJCCI, page 92–99, 2020.
Lucas Augusto Müller de Souza, José Eduardo Henriques da Silva, Luciano Jerez Chaves, and Heder Soares
Bernardino. A benchmark suite for designing combinational logic circuits via metaheuristics. Applied Soft Comput-
ing, 91:106246, June 2020. doi:10.1016/j.asoc.2020.106246. URL https://doi.org/10.1016/j.asoc.2020.106246.
Joaquı́n Derrac, Salvador Garcı́a, Daniel Molina, and Francisco Herrera. A practical tutorial on the use of nonpara-
metric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm
and Evolutionary Computation, 1(1):3–18, 2011.
Joaquı́n Derrac, Salvador Garcı́a, Sheldon Hui, Ponnuthurai Nagaratnam Suganthan, and Francisco Herrera. Analyz-
ing convergence performance of evolutionary algorithms: A statistical approach. Information Sciences, 289:41–58,
2014.
Jay L Devore. Probability and Statistics for Engineering and the Sciences. Cengage learning, 2011.
Benjamin Doerr and Frank Neumann. Theory of Evolutionary Computation – Recent Developments in Discrete
Optimization. Springer, 2020.
Benjamin Doerr, Carola Doerr, and Johannes Lengler. Self-adjusting mutation rates with provably optimal success
rules. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 1479 – 1487. ACM, 2019.
Carola Doerr, Hao Wang, Furong Ye, Sander van Rijn, and Thomas Bäck. IOHprofiler: A Benchmarking and
Profiling Tool for Iterative Optimization Heuristics. arXiv e-prints, art. arXiv:1810.05281, Oct 2018. Wiki page of
IOHprofiler is available at https://iohprofiler.github.io/.
41
Carola Doerr, Furong Ye, Naama Horesh, Hao Wang, Ofer M. Shir, and Thomas Bäck. Benchmarking Discrete
Optimization Heuristics with IOHprofiler. Applied Soft Computing, 88:106027, 2020.
Elizabeth D Dolan and Jorge J Moré. Benchmarking optimization software with performance profiles. Mathematical
programming, 91(2):201–213, 2002.
Marco Dorigo, Mauro Birattari, and Thomas Stützle. Ant colony optimization: Artificial ants as a computational
intelligence technique. IEEE Computational Intelligence Magazine, 1(4):28–39, 2006.
Gunter Dueck and Tobias Scheuer. Threshold accepting: a general purpose optimization algorithm appearing superior
to simulated annealing. J. Comput. Phys., 90:161–175, 1990.
Tome Eftimov and Peter Korošec. The impact of statistics for benchmarking in evolutionary computation research.
In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 1329–1336, 2018.
Tome Eftimov and Peter Korošec. Identifying practical significance through statistical comparison of meta-heuristic
stochastic optimization algorithms. Applied Soft Computing, 85:105862, 2019.
Tome Eftimov, Peter Korošec, and Barbara Koroušić Seljak. A Novel Approach to Statistical Comparison of Meta-
Heuristic Stochastic Optimization Algorithms Using Deep Statistics. Information Sciences, 417:186 – 215, 2017.
Tome Eftimov, Gašper Petelin, and Peter Korošec. Dsctool: A web-service-based framework for statistical comparison
of stochastic optimization algorithms. Applied Soft Computing, 87:105977, 2020.
Katharina Eggensperger, Matthias Feurer, Frank Hutter, James Bergstra, Jasper Snoek, Holger H. Hoos, and Kevin
Leyton-Brown. Towards an Empirical Foundation for Assessing Bayesian Optimization of Hyperparameters. In
NIPS Workshop on Bayesian Optimization in Theory and Practice, volume 10, December 2013.
Ágoston Endre Eiben and Márk Jelasity. A Critical Note on Experimental Research Methodology in EC. In Proceed-
ings of the 2002 IEEE Congress on Evolutionary Computation, volume 1, pages 582 – 587. IEEE, 2002.
Ágoston Endre Eiben and Selmar K Smit. Evolutionary Algorithm Parameters and Methods to Tune Them. In
Autonomous search, pages 15 – 36. Springer, 2011.
Ágoston Endre Eiben and James E Smith. Introduction to Evolutionary Computing. Natural Computing. Springer,
2 edition, 2015.
Álvaro Fialho, Luı́s Da Costa, Marc Schoenauer, and Michèle Sebag. Analyzing bandit-based adaptive operator
selection mechanisms. Annals of Mathematics and Artificial Intelligence, 60:25 – 64, 2010.
Steffen Finck, Nikolaus Hansen, Raymond Ros, and Anne Auger. COCO Documentation, Release 15.03, November
2015. URL http://coco.lri.fr/COCOdoc/COCO.pdf.
Andreas Fischbach and Thomas Bartz-Beielstein. Improving the reliability of test functions generators. Applied
Soft Computing, 92:106315, 2020. ISSN 1568-4946. doi:https://doi.org/10.1016/j.asoc.2020.106315. URL http:
//www.sciencedirect.com/science/article/pii/S1568494620302556.
Philip J. Fleming and John J. Wallace. How not to lie with statistics: The correct way to summarize benchmark
results. Commun. ACM, 29(3):218–221, March 1986.
Roger Fletcher. Conjugate gradient methods for indefinite systems. In Numerical analysis, pages 73–89. Springer,
1976.
Alexandre Fréchette, Lars Kotthoff, Tomasz Michalak, Talal Rahwan, Holger H. Hoos, and Kevin Leyton-Brown.
Using the shapley value to analyze algorithm portfolios. In Proc. of the Thirtieth AAAI Conference on Artificial
Intelligence, pages 3397 —- 3403. AAAI Press, 2016.
Milton Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal
of the american statistical association, 32(200):675–701, 1937.
42
Marcus Gallagher. Towards improved benchmarking of black-box optimization algorithms using clustering problems.
Soft Computing, 20(10):3835–3849, March 2016. doi:10.1007/s00500-016-2094-1. URL https://doi.org/10.1007/
s00500-016-2094-1.
Salvador Garcı́a, Daniel Molina, Manuel Lozano, and Francisco Herrera. A study on the use of non-parametric tests
for analyzing the evolutionary algorithms’ behaviour: a case study on the cec’2005 special session on real parameter
optimization. Journal of Heuristics, 15(6):617, 2009.
Carlos Garcı́a-Martı́nez, Francisco J. Rodrı́guez, and Manuel Lozano. Arbitrary function optimisation with meta-
heuristics: No free lunch and real-world problems. Soft Computing, 16(12):2115–2133, 2012. doi:10.1007/s00500-
012-0881-x.
Robert W. Garden and Andries P. Engelbrecht. Analysis and classification of optimisation benchmark func-
tions and benchmark suites. In 2014 IEEE Congress on Evolutionary Computation (CEC). IEEE, July 2014.
doi:10.1109/cec.2014.6900240. URL https://doi.org/10.1109/cec.2014.6900240.
Ian P. Gent and Toby Walsh. How not to do it. In AAAI Workshop on Experimental Evaluation of Reasoning and
Search Methods, 1994.
Ian P. Gent, Stuart A. Grant, Ewen MacIntyre, Patrick Prosser, Paul Shaw, Barbara M. Smith, and Toby Walsh.
How not to do it. Technical Report 97.27, School of Computer Studies, University of Leeds, May 1997.
Sim Kuan Goh, Kay Chen Tan, Abdullah Al-Mamun, and Hussein A. Abbass. Evolutionary big optimiza-
tion (BigOpt) of signals. In 2015 IEEE Congress on Evolutionary Computation (CEC). IEEE, May 2015.
doi:10.1109/cec.2015.7257307. URL https://doi.org/10.1109/cec.2015.7257307.
David E Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading
MA, 1989.
Bruce L Golden, Arjang A Assad, Edward A Wasil, and Edward Baker. Experimentation in optimization. European
Journal of Operational Research, 27(1):1–16, 1986.
Raphael T. Haftka. Requirements for papers focusing on new or improved global optimization algorithms. Structural
and Multidisciplinary Optimization, 54(1):1–1, 2016.
Doug Hains, L. Darrell Whitley, Adele E. Howe, and Wenxiang Chen. Hyperplane Initialized Local Search for
MAXSAT. In Proc. of the Genetic and Evolutionary Computation Conference, pages 805 – 812. ACM, 2013.
Nikolaus Hansen. Invariance, self-adaptation and correlated mutations in evolution strategies. In Proc. of Interna-
tional Conference on Parallel Problem Solving from Nature, pages 355–364. Springer, 2000. ISBN 978-3-540-45356-7.
Nikolaus Hansen, Sibylle D Müller, and Petros Koumoutsakos. Reducing the time complexity of the derandomized
evolution strategy with covariance matrix adaptation (cma-es). Evolutionary computation, 11(1):1–18, 2003.
Nikolaus Hansen, Anne Auger, Steffen Finck, and Raymond Ros. Real-parameter black-box optimization bench-
marking: Experimental setup. Technical report, Université Paris Sud, INRIA Futurs, Équipe TAO, Orsay, France,
March 24, 2012. URL http://coco.lri.fr/BBOB-downloads/download11.05/bbobdocexperiment.pdf.
Nikolaus Hansen, Anne Auger, Dimo Brockhoff, Dejan Tušar, and Tea Tušar. COCO: performance assessment.
CoRR, abs/1605.03560, 2016a. URL http://arxiv.org/abs/1605.03560.
Nikolaus Hansen, Anne Auger, Olaf Mersmann, Tea Tušar, and Dimo Brockhoff. COCO: A Platform for Comparing
Continuous Optimizers in a Black-Box Setting. arXiv preprint, abs/1603.08785v3, August 2016b. URL http:
//arxiv.org/abs/1603.08785v3.
Nikolaus Hansen, Youhei Akimoto, yoshihikoueno, Dimo Brockhoff, and Matthew Chan. CMA-ES/pycma: r3.0.3,
2020. URL https://doi.org/10.5281/zenodo.3764210.
Anna Hart. Mann-whitney test is not just a test of medians: differences in spread can be important. Bmj, 323(7309):
391–393, 2001.
43
Jörg Heitkötter and David Beasley. The hitch-hiker’s guide to evolutionary computation, 1994.
Michael Hellwig and Hans-Georg Beyer. Benchmarking Evolutionary Algorithms For Single-Objective Real-valued
Constrained Optimization – A Critical Review. Swarm and Evolutionary Computation, 44:927–944, 2019.
John N Hooker. Needed: An Empirical Science of Algorithms. Operations research, 42(2):201 – 212, 1994.
John N. Hooker. Testing heuristics: We have it all wrong. Journal of Heuristics, 1(1):33–42, 1996.
doi:10.1007/BF02430364.
Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown, and Thomas Stützle. ParamILS: an automatic algorithm
configuration framework. Journal of Artificial Intelligence Research, 36:267–306, October 2009.
Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential Model-Based Optimization for General Algorithm
Configuration. In International Conference on Learning and Intelligent Optimization, pages 507 – 523. Springer,
2011.
Frank Hutter, Manuel López-Ibáñez, Chris Fawcett, Marius Thomas Lindauer, Holger H. Hoos, Kevin Leyton-Brown,
and Thomas Stützle. AClib: a benchmark library for algorithm configuration. In Panos M. Pardalos, Mauricio
G. C. Resende, Chrysafis Vogiatzis, and Jose L. Walteros, editors, Learning and Intelligent Optimization, 8th
International Conference, LION 8, volume 8426 of Lecture Notes in Computer Science, pages 36–40. Springer,
Heidelberg, Germany, 2014. doi:10.1007/978-3-319-09584-4 4.
Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated Machine Learning: Methods, Systems, Challenges.
Springer, 2019.
Alexandre D. Jesus, Arnaud Liefooghe, Bilel Derbel, and Luı́s Paquete. Algorithm Selection of Anytime Algorithms.
In Proc. of the 2020 Genetic and Evolutionary Computation Conference, pages 850 – 858. ACM, 2020.
Yaochu Jin and Jürgen Branke. Evolutionary Optimization in Uncertain Environments – A Survey. IEEE Transac-
tions on Evolutionary Computation, 9(3):303 – 317, 2005.
David S Johnson, Cecilia R Aragon, Lyle A McGeoch, and Catherine Schevon. Optimization by Simulated Annealing:
An Experimental Evaluation. Part I, Graph Partitioning. Operations Research, 37(6):865 – 892, 1989.
David S Johnson, Cecilia R Aragon, Lyle A McGeoch, and Catherine Schevon. Optimization by Simulated Annealing:
An Experimental Evaluation. Part II, Graph Coloring and Number Partitioning. Operations Research, 39(3):378 –
406, 1991.
David Stifler Johnson. A Theoretician’s Guide to the Experimental Analysis of Algorithms. In Proc. of a DIMACS
Workshop on Data Structures, Near Neighbor Searches, and Methodology: Fifth and Sixth DIMACS Implementation
Challenges, volume 59 of DIMACS – Series in Discrete Mathematics and Theoretical Computer Science, pages 215–
250, 2002a.
David Stifler Johnson. A Theoretician’s Guide to the Experimental Analysis of Algorithms. Data Structures, Near
Neighbor Searches, and Methodology: Fifth and Sixth DIMACS Implementation Challenges, 59:215 – 250, 2002b.
David Stifler Johnson and Lyle A. McGeoch. Experimental Analysis of Heuristics for the STSP. In The Traveling
Salesman Problem and its Variations, volume 12 of Combinatorial Optimization, chapter 9, pages 369 – 443. Kluwer
Academic Publishers, 2002.
Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box
functions. Journal of Global Optimization, 13(4):455–492, 1998.
Giorgos Karafotias, Mark Hoogendoorn, and Ágoston E. Eiben. Parameter control in evolutionary algorithms: Trends
and challenges. IEEE Transactions on Evolutionary Computation, 19(2):167–187, April 2015.
Stuart A. Kauffman. The Origins of Order: Self-Organization and Selection in Evolution. Oxford University Press,
USA, 1993.
James Kennedy and Russell Eberhart. Particle swarm optimization. In Neural Networks, 1995. Proceedings., IEEE
International Conference on, volume 4, pages 1942–1948. IEEE, 1995.
44
Pascal Kerschke and Heike Trautmann. Automated Algorithm Selection on Continuous Black-Box Problems by
Combining Exploratory Landscape Analysis and Machine Learning. Evolutionary Computation (ECJ), 27(1):
99 – 127, 2019a.
Pascal Kerschke and Heike Trautmann. Comprehensive Feature-Based Landscape Analysis of Continuous and Con-
strained Optimization Problems Using the R-package flacco. In Applications in Statistical Computing, pages
93 – 123. Springer, 2019b.
Pascal Kerschke, Jakob Bossek, and Heike Trautmann. Parameterization of State-of-the-Art Performance Indicators:
A Robustness Study based on Inexact TSP Solvers. In Proceedings of the Genetic and Evolutionary Computation
Conference Companion, pages 1737 – 1744. ACM, 2018a.
Pascal Kerschke, Lars Kotthoff, Jakob Bossek, Holger H. Hoos, and Heike Trautmann. Leveraging TSP Solver
Complementarity through Machine Learning. Evolutionary Computation (ECJ), 26(4):597 – 620, December 2018b.
Pascal Kerschke, Holger H. Hoos, Frank Neumann, and Heike Trautmann. Automated Algorithm Selection: Survey
and Perspectives. Evolutionary Computation (ECJ), 27:3 – 45, 2019.
Scott Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220:671–680, 1983.
Jack P. C. Kleijnen. Analyzing Simulation Experiments with Common Random Numbers. Management Science, 34
(1):65 – 74, 1988.
Jack P. C. Kleijnen. Experimental Design for Sensitivity Analysis of Simulation Models. Workingpaper, Operations
Research, 2001.
Jack P. C. Kleijnen. Design and Analysis of Simulation Experiments. In International Workshop on Simulation,
pages 3–22. Springer, 2015.
Jack P. C. Kleijnen. Regression and Kriging Metamodels with their Experimental Designs in Simulation: A Review.
European Journal of Operational Research, 256(1):1–16, 2017.
Lasse Kliemann and Peter Sanders. Algorithm Engineering: Selected Results and Surveys, volume 9220. Springer,
2016.
William H Kruskal and W Allen Wallis. Use of ranks in one-criterion variance analysis. Journal of the American
statistical Association, 47(260):583–621, 1952.
Abhishek Kumar, Guohua Wu, Mostafa Z. Ali, Rammohan Mallipeddi, Ponnuthurai Nagaratnam Suganthan, and
Swagatam Das. Guidelines for real-world single-objective constrained optimisation competition. Technical report,
2020.
Pedro Larrañaga and José A. Lozano. Estimation of Distribution Algorithms: A New Tool for Evolutionary Compu-
tation. Kluwer Academic Publishers, 2002.
Per Kristian Lehre and Carsten Witt. Black-box search by unbiased variation. Algorithmica, 64:623–642, 2012.
Howard Levene. Robust tests for equality of variances. Contributions to probability and statistics. Essays in honor
of Harold Hotelling, pages 279–292, 1961.
Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A Novel
Bandit-Based Approach to Hyperparameter Optimization. Journal of Machine Learning Research, 18:185:1–185:52,
2017.
Rui Li, Michael T. M. Emmerich, Jeroen Eggermont, Ernst G. P. Bovenkamp, Thomas Bäck, Jouke Dijkstra, and
Johan H. C. Reiber. Mixed-Integer NK Landscapes. In Proc. of Parallel Problem Solving from Nature, pages 42–51.
Springer, 2006.
Tianjun Liao, Krzysztof Socha, Marco A. Montes de Oca, Thomas Stützle, and Marco Dorigo. Ant colony optimization
for mixed-variable optimization problems. IEEE Transactions on Evolutionary Computation, 18(4):503–518, 2014.
Harold R Lindman. Analysis of variance in complex experimental designs. WH Freeman & Co, 1974.
45
Jialin Liu, Antoine Moreau, Mike Preuss, Baptiste Rozière, Jérémy Rapin, Fabien Teytaud, and Olivier Teytaud.
Versatile black-box optimization. CoRR, abs/2004.14014, 2020. URL https://arxiv.org/abs/2004.14014.
Qunfeng Liu, William V. Gehrlein, Ling Wang, Yuan Yan, Yingying Cao, Wei Chen, and Yun Li. Paradoxes in
Numerical Comparison of Optimization Algorithms. IEEE Transactions on Evolutionary Computation, pages
1–15, 2019.
Manuel López-Ibáñez, Jérémie Dubois-Lacoste, Leslie Pérez Cáceres, Thomas Stützle, and Mauro Birattari. The
irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives, 3:43–58,
2016. doi:10.1016/j.orp.2016.09.002.
Katherine Mary Malan and Andries Petrus Engelbrecht. A Survey of Techniques for Characterising Fitness Land-
scapes and Some Possible Ways Forward. Information Sciences (JIS), 241:148 – 163, 2013.
Deborah G Mayo and Aris Spanos. Severe testing as a basic concept in a neyman–pearson philosophy of induction.
The British Journal for the Philosophy of Science, 57(2):323–357, 2006.
Kent McClymont and Ed Keedwell. Benchmark Multi-Objective Optimisation Test Problems with Mixed Encodings.
In Proceedings of the 2011 IEEE Congress on Evolutionary Computation, pages 2131 – 2138. IEEE, 2011.
James McDermott. When and why metaheuristics researchers can ignore ”no free lunch” theorems. SN Computer
Science, 1(60):1–18, 2020. doi:10.1007/s42979-020-0063-3.
Catherine C McGeoch. Experimental Analysis of Algorithms. PhD thesis, Carnegie Mellon University, Pittsburgh
PA, 1986.
Catherine C McGeoch. Toward an experimental method for algorithm simulation. INFORMS Journal on Computing,
8(1):1–15, 1996.
McKay, Michael D and Beckman, Richard J and Conover, William J. A Comparison of Three Methods for Selecting
Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics, 42(1):55 – 61, 2000.
Olaf Mersmann, Mike Preuss, and Heike Trautmann. Benchmarking Evolutionary Algorithms: Towards Exploratory
Landscape Analysis. In Proc. of International Conference on Parallel Problem Solving from Nature, pages 73 – 82.
Springer, 2010.
Olaf Mersmann, Bernd Bischl, Heike Trautmann, Mike Preuss, Claus Weihs, and Günter Rudolph. Exploratory
Landscape Analysis. In Proc. of the 13th Annual Conference on Genetic and Evolutionary Computation, pages
829 – 836. ACM, 2011.
Olaf Mersmann, Bernd Bischl, Heike Trautmann, Markus Wagner, Jakob Bossek, and Frank Neumann. A Novel
Feature-Based Approach to Characterize Algorithm Performance for the Traveling Salesperson Problem. Annals
of Mathematics and Artificial Intelligence, 69(2):151 – 182, 2013.
Nenad Mladenović and Pierre Hansen. Variable neighborhood search. Comput. Oper. Res., 24(11):1097–1100, 1997.
ISSN 0305-0548. doi:10.1016/S0305-0548(97)00031-2. URL https://doi.org/10.1016/S0305-0548(97)00031-2.
Douglas C Montgomery. Design and Analysis of Experiments. John Wiley & Sons, 9 edition, 2017.
Jorge J Moré and Stefan M Wild. Benchmarking derivative-free optimization algorithms. SIAM Journal on Opti-
mization, 20(1):172–191, 2009.
Jorge J Moré, Burton S Garbow, and Kenneth E Hillstrom. Testing Unconstrained Optimization Software. ACM
Transactions on Mathematical Software (TOMS), 7(1):17 – 41, 1981.
Mario Andrés Muñoz Acosta, Michael Kirley, and Saman K. Halgamuge. Exploratory Landscape Analysis of Contin-
uous Space Optimization Problems Using Information Content. IEEE Transactions on Evolutionary Computation
(TEVC), 19(1):74 – 87, 2015a.
46
Mario Andrés Muñoz Acosta, Yuan Sun, Michael Kirley, and Saman K. Halgamuge. Algorithm Selection for Black-
Box Continuous Optimization Problems: A Survey on Methods and Challenges. Information Sciences (JIS), 317:
224 – 245, 2015b.
H. Mühlenbein and G. Paaß. From recombination of genes to the estimation of distributions i. binary parameters. In
Hans-Michael Voigt, Werner Ebeling, Ingo Rechenberg, and Hans-Paul Schwefel, editors, Proc. of Parallel Problem
Solving from Nature (PPSN’96), pages 178–187. Springer, 1996. ISBN 978-3-540-70668-7.
Matthias Müller-Hannemann and Stefan Schirra. Algorithm Engineering: Bridging the Gap Between Algorithm
Theory and Practice. Springer, 2010.
Mario A. Muñoz and Kate A. Smith-Miles. Performance Analysis of Continuous Black-Box Optimization Algorithms
via Footprints in Instance Space. Evolutionary Computation, 25(4):529–554, December 2017.
Dima Nazzal, Mansooreh Mollaghasemi, H Hedlund, and A Bozorgi. Using Genetic Algorithms and an Indifference-
Zone Ranking and Selection Procedure Under Common Random Numbers for Simulation Optimisation. Journal
of Simulation, 6(1):56 – 66, 2012.
John A Nelder and Roger Mead. A simplex method for function minimization. The computer journal, 7(4):308–313,
1965.
Aneta Neumann, Wanru Gao, Markus Wagner, and Frank Neumann. Evolutionary diversity optimization using
multi-objective indicators. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO
’19, page 837–845, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361118.
doi:10.1145/3321707.3321796. URL https://doi.org/10.1145/3321707.3321796.
Frank Neumann and Carsten Witt. Bioinspired Computation in Combinatorial Optimization – Algorithms and Their
Computational Complexity. Springer, 2010.
Trung Thanh Nguyen, Shengxiang Yang, and Juergen Branke. Evolutionary dynamic optimization: A survey of the
state of the art. Swarm and Evolutionary Computation, 6:1–24, October 2012. doi:10.1016/j.swevo.2012.05.001.
URL https://doi.org/10.1016/j.swevo.2012.05.001.
Brian A. Nosek, Charles R. Ebersole, Alexander C. DeHaven, and David T. Mellor. The preregistration revolution.
Proceedings of the National Academy of Sciences, 115(11):2600–2606, March 2018. ISSN 0027-8424, 1091-6490.
doi:10.1073/pnas.1708274114.
Randal S Olson and Jason H Moore. TPOT: A tree-based pipeline optimization tool for automating machine learning.
In Workshop on automatic machine learning, pages 66–74, 2016.
Randal S Olson, William La Cava, Patryk Orzechowski, Ryan J Urbanowicz, and Jason H Moore. Pmlb: a large
benchmark suite for machine learning evaluation and comparison. BioData mining, 10(1):1–13, 2017.
Patryk Orzechowski, William La Cava, and Jason H Moore. Where are we now? a large benchmark study of recent
symbolic regression methods. In Proceedings of the Genetic and Evolutionary Computation Conference, pages
1183–1190, 2018.
Patryk Orzechowski, Franciszek Magiera, and Jason H Moore. Benchmarking manifold learning methods on a large
collection of datasets. In European Conference on Genetic Programming (Part of EvoStar), pages 135–150. Springer,
2020.
Tom Packebusch and Stephan Mertens. Low autocorrelation binary sequences. Journal of Physics A: Mathematical
and Theoretical, 49(16):165001, 2016. doi:10.1088/1751-8113/49/16/165001. URL https://doi.org/10.1088%
2F1751-8113%2F49%2F16%2F165001.
Ingo Paenke, Jürgen Branke, and Yaochu Jin. Efficient search for robust solutions by means of evolutionary algorithms
and fitness approximation. IEEE Trans. Evolutionary Computation, 10(4):405–420, 2006.
Sinno Jialin Pan and Qiang Yang. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data
Engineering, 22(10):1345–1359, Oct 2010.
47
Fortunato Pesarin. Multivariate permutation tests: with applications in biostatistics, volume 240. Wiley Chichester,
2001.
Robin L Plackett and J Peter Burman. The design of optimum multifactorial experiments. Biometrika, 33(4):305–325,
1946.
Sergey Polyakovskiy, Mohammad Reza Bonyadi, Markus Wagner, Zbigniew Michalewicz, and Frank Neumann. A
Comprehensive Benchmark Set and Heuristics for the Traveling Thief Problem. In Proc. of the 2014 Annual
Conference on Genetic and Evolutionary Computation, pages 477 —- 484. ACM, 2014. ISBN 9781450326629.
Karl Raimund Popper. The Logic of Scientific Discovery. Hutchinson & Co, 2 edition, 1959.
Karl Raimund Popper. Objective Knowledge: An Evolutionary Approach. Oxford University Press, 1975.
Kenneth V. Price. Differential Evolution vs. The Functions of the 2nd ICEO. In Proc. of the IEEE International
Conference on Evolutionary Computation, pages 153 – 157. IEEE, 1997.
Jeremy Rapin and Olivier Teytaud. Nevergrad - A gradient-free optimization platform. https://GitHub.com/
FacebookResearch/Nevergrad, 2018.
Howard Harry Rosenbrock. An Automatic Method for Finding the Greatest or Least Value of a Function. The
Computer Journal, 3(3):175 – 184, 1960.
Jonathan Rowe and Michael Vose. Unbiased black box search algorithms. In Proc. of Genetic and Evolutionary
Computation Conference, pages 2035 – 2042. ACM, 2011.
Ranjit K Roy. Design of experiments using the Taguchi approach: 16 steps to product and process improvement. John
Wiley & Sons, 2001.
Ragav Sachdeva, Frank Neumann, and Markus Wagner. The dynamic travelling thief problem: Benchmarks and
performance of evolutionary algorithms, 2020.
Thomas J Santner, Brian J Williams, William I Notz, and Brain J Williams. The design and analysis of computer
experiments, volume 1. Springer, 2003.
Hans-Paul Schwefel. Evolutionsstrategie und numerische Optimierung. PhD thesis, Technische Universität Berlin,
Fachbereich Verfahrenstechnik, Berlin, Germany, 1975.
David F Shanno. Conditioning of quasi-newton methods for function minimization. Mathematics of computation, 24
(111):647–656, 1970.
Samuel Sanford Shapiro and Martin B Wilk. An analysis of variance test for normality (complete samples).
Biometrika, 52(3/4):591–611, 1965.
David J Sheskin. Handbook of parametric and nonparametric statistical procedures. crc Press, 2003.
Yuhui Shi and Russell Eberhart. A Modified Particle Swarm Optimizer. In Proc. of the 1998 IEEE International
Conference on Evolutionary Computation, within the IEEE World Congress on Computational Intelligence, pages
69–73. IEEE, 1998.
Ofer M. Shir, Carola Doerr, and Thomas Bäck. Compiling a Benchmarking Test-Suite for Combinatorial Black-Box
Optimization: A Position Paper. In Proc. of Genetic and Evolutionary Computation Conference, pages 1753 – 1760.
ACM, 2018.
Urban Škvorc, Tome Eftimov, and Peter Korošec. Understanding the Problem Space in Single-Objective Numerical
Optimization Using Exploratory Landscape Analysis. Applieed Soft Computing (ASOC), 90:106138, 2020.
48
Kate Smith-Miles and Simon Bowly. Generating new test instances by evolving in instance space. Computers
& Operations Research, 63:102–113, November 2015. doi:10.1016/j.cor.2015.04.022. URL https://doi.org/10.
1016/j.cor.2015.04.022.
Kate Smith-Miles and Thomas T. Tan. Measuring Algorithm Footprints in Instance Space. In 2012 IEEE Congress
on Evolutionary Computation. IEEE, 2012.
Krzysztof Socha and Marco Dorigo. Ant colony optimization for continuous domains. European journal of operational
research, 185(3):1155–1173, 2008.
Jörg Stork, A. E. Eiben, and Thomas Bartz-Beielstein. A new taxonomy of global optimization algorithms, Nov
2020. ISSN 1572-9796. URL https://doi.org/10.1007/s11047-020-09820-4.
El-Ghazali Talbi. Metaheuristics: From Design to Implementation. John Wiley & Sons Inc., July 2009. ISBN
978-0-470-27858-1.
Ryoji Tanabe and Hisao Ishibuchi. An Easy-to-Use Real-World Multi-Objective Optimization Problem Suite. Applied
Soft Computing (ASOC), 89:106078, 2020.
Shigeyoshi Tsutsui, Ashish Ghosh, and Yoshiji Fujimoto. A Robust Solution Searching Scheme in Genetic Search. In
International Conference on Parallel Problem Solving from Nature, pages 543 – 552. Springer, 1996.
John Wilder Tukey. Exploratory Data Analysis, volume 2. Reading, MA, 1977.
Tea Tušar, Dimo Brockhoff, and Nikolaus Hansen. Mixed-integer benchmark problems for single- and bi-objective
optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 718 —- 726. ACM,
2019.
Tsuyoshi Ueno, Trevor David Rhone, Zhufeng Hou, Teruyasu Mizoguchi, and Koji Tsuda. COMBO: an efficient
Bayesian optimization library for materials science. Materials discovery, 4:18–21, 2016.
Niki Veček, Marjan Mernik, and Matej Črepinšek. A chess rating system for evolutionary algorithms: A new method
for the comparison and ranking of evolutionary algorithms. Information Sciences, 277:656–679, 2014.
Vanessa Volz, Boris Naujoks, Pascal Kerschke, and Tea Tušar. Single- and Multi-Objective Game-Benchmark for
Evolutionary Algorithms. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 647 –
655. ACM, 2019.
Wouter Vrielink and Daan van den Berg. Fireworks algorithm versus plant propagation algorithm. In IJCCI, pages
101–112, 2019.
Tobias Wagner. A subjective review of the state of the art in model-based parameter tuning. In Thomas Bartz-
Beielstein, Marco Chiarandini, Luis Paquete, and Mike Preuss, editors, Workshop on Experimental Methods for
the Assessment of Computational Systems (WEMACS 2010), Algorithm Engineering Report, pages 1–13. TU
Dortmund, Faculty of Computer Science, Algorithm Engineering (Ls11), 2010.
Hao Wang, Diederick Vermetten, Furong Ye, Carola Doerr, and Thomas Bäck. Iohanalyzer: Performance analysis
for iterative optimization heuristic. CoRR, abs/2007.03953, 2020. URL https://arxiv.org/abs/2007.03953. A
benchmark data repository is available through the web-based GUI at iohprofiler.liacs.nl/.
Jann Michael Weinand, Kenneth Sörensen, Pablo San Segundo, Max Kleinebrahm, and Russell McKenna. Research
trends in combinatorial optimisation. arXiv e-prints, art. arXiv:2012.01294, December 2020.
Thomas Weise. jsspinstancesandresults: Results, data, and instances of the job shop scheduling problem, 2019.
URL http://github.com/thomasWeise/jsspInstancesAndResults/. A meta-study of 145 algorithm setups from
literature on the JSSP.
Thomas Weise and Zijun Wu. Difficult features of combinatorial optimization problems and the tunable w-model
benchmark problem for simulating them. In Proceedings of the Genetic and Evolutionary Computation Conference
Companion, GECCO ’18, pages 1769 –– 1776. ACM, 2018. ISBN 9781450357647.
49
Thomas Weise, Raymond Chiong, Ke Tang, Jörg Lässig, Shigeyoshi Tsutsui, Wenxiang Chen, Zbigniew Michalewicz,
and Xin Yao. Benchmarking optimization algorithms: An open source framework for the traveling salesman
problem. IEEE Computational Intelligence Magazine (CIM), 9:40–52, August 2014.
L. Darrell Whitley, Soraya B. Rana, John Dzubera, and Keith E. Mathias. Evaluating Evolutionary Algorithms.
Artificial Intelligence (AIJ), 85(1-2):245 – 276, 1996.
L. Darrell Whitley, Jean-Paul Watson, Adele Howe, and Laura Barbulescu. Testing, Evaluation and Performance of
Optimization and Learning Systems. In Ian C. Parmee, editor, Adaptive Computing in Design and Manufacture
V, pages 27 – 39, London, 2002. Springer.
Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics bulletin, 1(6):80–83, 1945.
David H. Wolpert and William G. Macready. No Free Lunch Theorems for Optimization. IEEE Transactions on
Evolutionary Computation, 1(1):67–82, April 1997.
Guohua Wu, Rammohan Mallipeddi, and Ponnuthurai Nagaratnam Suganthan. Problem definitions and evaluation
criteria for the CEC 2017 competition on constrained real-parameter optimization. Technical report, National
University of Defense Technology, Changsha, Hunan, PR China and Kyungpook National University, Daegu,
South Korea and Nanyang Technological University, Singapore, September 2017.
Lin Xu, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. SATzilla: portfolio-based algorithm selection for
SAT. Journal of Artificial Intelligence Research, 32:565–606, June 2008.
50