A Genetic Algorithm Approach for the Multidimensional Two-Way Number Partitioning Problem
A Genetic Algorithm Approach for the Multidimensional Two-Way Number Partitioning Problem
Learning
LNCS 7997
and Intelligent
Optimization
7th International Conference, LION 7
Catania, Italy, January 7–11, 2013
Revised Selected Papers
123
Lecture Notes in Computer Science 7997
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
123
Editors
Giuseppe Nicosia Panos Pardalos
Department of Mathematics Department of Industrial and Systems
and Computer Science Engineering
University of Catania University of Florida
Catania Gainesville, FL
Italy USA
We had more submissions than ever this year, each manuscript was independently
reviewed by at least three members of the Programme Committee in a blind review
process. So, in these proceedings there are 49 research articles written by leading
scientists in the field, from 47 different countries on 5 continents, describing an
impressive array of ideas, technologies, algorithms, methods and applications in
Optimization and Machine Learning.
We couldn’t have organized this conference without these researchers, so we thank
them all for coming. We also couldn’t have organized LION without the excellent
work of all of the Programme Committee members, the session chair Giovanni
Stracquadanio, the publicity chair, and the chair of the local Organizing Committee,
Patrizia Nardon.
We would like to express our appreciation to the keynote and tutorial speakers who
accepted our invitation, and to all authors who submitted research papers to LION 2013.
v
Organization
Local Organization
Publicity Chair
vii
viii Organization
xi
xii Contents
1 Introduction
Optimization involves the process of finding one or more solutions which correspond
to the minimization or maximization of one or more objectives. In a single optimi-
zation problem, a single optimal solution is sought to optimize a single objective
function and in a multi-objective optimization (MOO) problem the optimization
involves more than one objective function. In most MOO problems, especially those
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 1–18, 2013.
DOI: 10.1007/978-3-642-44973-4_1, Springer-Verlag Berlin Heidelberg 2013
2 A.H.C. Ng et al.
found in the real world, these objective functions are in conflict with each other. Thus,
seeking one single best solution that optimizes all of them simultaneously is impos-
sible, because improving one objective would deteriorate the others [1]. This scenario
gives rise to a set of optimal compromised (trade-off) solutions, largely known as
Pareto-optimal solutions. The so-called Pareto Front (PF) in the objective space
consists of solutions in the Pareto-optimal solution set.
Despite the existence of multiple trade-off solutions, in most cases, only one of
them will be chosen as the solution for implementation, for example, in a product or
system design. Therefore, two equally important tasks are usually involved in solving
an MOO problem: (1) searching for the PF solution set, so that the decision-maker can
acquire an idea on the extent and nature of the trade-off among the objectives, and
(2) choosing a particular preferred solution from the Pareto-optimal set. While the first
task is computationally intensive, and can be fully automated using an MOO algo-
rithm, the second task usually necessitates a manual decision-making process using
the preference information of the decision-maker. It is interesting to note that an MOO
problem can easily be converted into a single-objective optimization problem, by
formulating a weighted-sum objective function which is composed of the multiple
objectives, so that a single trade-off optimal solution can effectively be sought.
However, the major drawback is that the trade-off solution obtained by using this
procedure is very sensitive to the relative preference vector. Therefore, the choice of
the preference weights and thus the obtained trade-off solution is highly subjective to
the particular decision-maker. Firstly, without detailed knowledge about the product
or system under study, selecting the appropriate preference vector can be a very
difficult task. Secondly, converting an MOO problem into a simplistic single-objective
problem puts decision-making ahead of knowing the best possible trade-offs. In other
words, thanks to the generation of multiple trade-off solutions, an MOO procedure can
contribute to support the decision-making, in comparison to a single-objective
optimization procedure. On one hand, the decision-maker is provided with multiple
‘‘optimal’’ (or precisely near-optimal) alternatives for consideration before making the
final choice. On the other hand, since these optimal solutions are ‘‘high-performing’’
with respect to at least one objective, conducting an analysis that answers ‘‘What
makes these solutions optimal?’’ can provide the decision-maker with very important
information, or knowledge, which cannot be obtained if only one single solution is
sought in the optimization task. The idea of deciphering knowledge, or knowledge
discovery, using the post-optimality analysis of Pareto-optimal solutions from an
MOO, was first proposed by Deb and Srinivasan [2]. They coined the term
innovization (innovation via optimization) to describe the task of discovering the
salient common principles present in the Pareto-optimal solutions, in order to obtain
deeper knowledge/insights regarding the behavior/nature of the problem. The
innovization task employed in earlier publications involved the manual identification
of the important relationships among decision variables and objectives that are
common to the obtained trade-off solutions. Recent studies have shown that using data
mining (DM) techniques to enable innovization procedures to be performed
automatically [3, 4] can be promising for various engineering problems. In these
innovization tasks, the efficient evolutionary multi-objective optimization (EMO)
algorithm, NSGA-II [5], has been applied to generate the Pareto-optimal solutions.
Interleaving Innovization with Evolutionary Multi-Objective Optimization 3
As argued in [14], a SBI process does not deviate much from a standardized process
model of an ordinary, knowledge discovery in databases (KDD) process. The major
difference, when comparing SBI with a KDD process, is that the data used for the
pattern detection is not from a data source containing historical data, but from
experimental data generated from simulation-based multi-objective optimization
(SMO). Simulation is a common approach to solve industrial problems. The detail
level of the simulation model can vary from a conceptual model to one with detailed
operation logics, but stochastic processes are commonly used in almost all production
simulation models to model variability, e.g., due to machine failures. It is such a
challenge, imposed by stochastic simulation, commonly found in production system
simulation that leads to the design of some distance-based DM approaches specifically
for these types of innovization applications.
Interleaving Innovization with Evolutionary Multi-Objective Optimization 5
1 0.7
Dominated solutions
0.9 Non-dominated solutions
Interpolant 0.6
0.8
0.7 0.5
0.6
min(f2)
0.4
0.5
0.4 0.3
x
0.3 0.2
0.2
yj 0.1
0.1
0 0
0 0.2 0.4 0.6 0.8 1
min(f1)
Fig. 1. Color representing the minimum Euclidian distance of solutions to the interpolation
curve in a 2-objective MOO problem.
6 A.H.C. Ng et al.
Minimize maxM
i¼1 ½wi ðfi ð xÞ zi Þ
where wi is the ith component of a chosen weight vector used for scalarizing the
objectives. If the decision-maker is interested in biasing some objectives more than
others, a suitable weight vector can be used with the RP selected.
The use of RP to guide an EMO algorithm was proposed earlier in [20] and later in
[21] and [22]. Here, we are interested in investigating the use of RP for the decision-
maker to specify the preference for the distance-based SBI analysis. This concept is
illustrated in Fig. 2. For a 2-objectives problem, a Pareto-optimal solution can be
found by solving the above achievement scalarizing function (ASF), using the RP and
weight vector supplied by the decision-maker. Therefore, the Euclidean distances for
all solutions with respect to this so-called ASF solution can be calculated as shown by
the color scale in Fig. 2.
1
0.9
All solutions
0.9 ASF solution
0.8
0.8
0.7
0.7
0.6
0.6
min(f2)
0.5
0.5
0.4
0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1
min(f1)
Fig. 2. Color representing the minimum Euclidian distance of solutions to the ASF point.
Interleaving Innovization with Evolutionary Multi-Objective Optimization 7
Rules visualizations
Decision
Visualization
Making
Partial preference
Partial Pareto
preference Decision rules Verification
frontier
All solutions
Multi-Objective Knowledge
Significant variables
Optimization Discovery
Rules as constraints
Simulation
rule, which is represented by two parts: an antecedent set of conditions and the
consequent averaged regression value (rv). The elements in the antecedent set of
conditions consist of a design vector (x1, …, xn) and its corresponding values (v1, …,
vn) which are linked by an operator (op1, …, opn), as shown in the following form:
Rule j:IFðx1 op1 v1 ÞAND. . .ANDðxi opn vi ÞTHEN rv ¼ d
For a rule to be ‘‘interesting’’, rv=d (d is the predicted Euclidian distance) must be
sufficiently small. Therefore, all nodes are checked in order to determine all the rules
with d below a certain threshold of interestingness. Such a threshold ensures that the
high interestingness of the selected rules can be predetermined or determined at a later
stage by the decision-maker in conjunction with the visualization of the extracted
rules.
In general, since EMO algorithms do not use any mathematical optimality con-
ditions in their operators, the solutions obtained after a finite number of computations
are not guaranteed to be optimal, although an asymptotic convergence for EMOs with
certain restrictions has been proven in the past [24]. To enhance the convergence
properties of EMO algorithms, one common approach is to first apply an EMO and the
solutions obtained are then modified one at a time by using a local search procedure.
Although this hybrid method is commonly employed, the overall computational effort
needed to execute the local search for each EMO solution can be burdensome.
Therefore, an alternative strategy, which involves the hybrid use of MOO and DM,
also schematically illustrated in Fig. 3, is proposed in this paper. First, an MOO is run
to generate a data set of sufficient size for the innovization study using DM tech-
niques. Since the derived rules can reveal salient properties present in the MOO
solutions close to the preferred solution selected by the decision-maker, the rules
obtained can then be used to re-define the original optimization MOO problem, so that
a faster convergence can be achieved, compared to a single application of EMO to the
original problem.
With the introduction of the distance-based approach based on the preference region
selected by the decision-maker, it is believed that faster and ‘‘better’’ optimization can
be achieved in an interactive and interleaved manner. An interactive manner means the
decision-maker can select and subsequently change the preferred region by choosing
different reference points. With interleaving, it implies that several iterations of the
MOO-DM cycles can be repeated in order to obtain important rules with respect to the
preference of the decision-maker. In addition, the efficiency of the optimization can
simultaneously be enhanced by having the optimization converge faster to the preferred
region. While there is a vast amount of literature on interactive multi-objective opti-
mization (see e.g. [25]), most of the existing methods in Multi-criteria Decision Making
(MCDM) literature aim to assist the user in selecting the best solution through some
interaction processes. Related visualization techniques in [26, 27] are also targeted to
help the decision-maker analyze the Pareto solutions, particularly in the objective space,
but not the relationships between decision variables and objectives. The approach
proposed by Greco et al. [28], in which decision rules are used to represent user pref-
erences and also to describe the Pareto front, is very relevant to this current work.
Similar to the concept proposed here, the Dominance-based Rough Set Approach
Interleaving Innovization with Evolutionary Multi-Objective Optimization 9
(DRSA) described in the article can also be used for the progressive exploration of the
Pareto optimal set, which is interesting from the point of view of the decision-maker’s
preferences as well as using the extracted rules to refine the original constraints of the
optimization problem. Nevertheless, there are some key contrasts between SBI and
other approaches like DRSA in the MCDM literature. In practical problems like pro-
duction systems engineering, decision-makers usually have a strong interest in the
decision space. While they indicate their preference in the objective space, they would
prefer to acquire information about the values of the decision variables that put the
solutions closer to their targeted region in the objectives. The preference model using,
e.g., the pairwise comparison of solutions is also less suitable for this type of problem
with stochastic continuous objectives, because a large number of solutions can be close
to the preferred solution. In order to illustrate the applicability of the proposed inter-
leaved SBI approach for practical production system problems, we are more interested
in the results when it is applied in studies of real-world production optimization
problems, instead of some theoretical benchmarking functions.
The application case study presented here was part of an industrial-based research
project conducted at some automotive manufacturers in Sweden. As a matter of fact,
many industrial, production systems engineering problems can be effectively formu-
lated as some optimization problems, because their aim is to find the optimal setting,
through re-allocations of workforce, work-in-process and/or buffers, to improve the
performance of the system [29]. One objective of the project is to verify and evaluate
the combined use of SMO and innovization, as a new and innovative manufacturing
management toolset to support decision-making [30]. For the SMO and the sub-
sequent SBI methodology to be useful in solving real-world manufacturing decision-
making problems, the optimization of production systems via simulation models that
take into account both productivity and environmental factors related to the decision-
making process is essential. In other words, formulating the optimization objectives
related to both productivity and environmental issues, such as energy efficiency, was
the first step.
The production system considered in this industrial case study is a truck gears
machining cell comprising seven machines connected by conveyors and industrial
robots. The cell produces five types of truck gears with different cycle (processing)
times on different machines and is usually staffed by 2 operators who perform manual
tasks including tool changes, setups, and measurements. In the case study, some new
processing plans involving gear machining and cutting-time changes had been carried
out, which required the cell to be re-configured. The goal of the study was therefore to
investigate how changes in capacities, conveyor (buffer) sizes, setup sequences,
operator utilization, and planning schemes would affect the productivity and energy
efficiency cost, in order to optimally re-configure the cell. Figure 4 schematically
illustrates the product flow of the cell and its major components. There are five lathes
(L1-5) and two parallel milling machining centers (M1-2). Work-pieces flow from the
raw material inventory (RMI) to L1, L2 and L3 through some long conveyors (CB1-3)
10 A.H.C. Ng et al.
which are not only used for material handling, but also serve as in-process buffers for
temporarily storing the work-pieces. The buffer sizes are therefore determined by the
lengths of the conveyors. An identical type of conveyor buffer is located before L5
(CB5). Operators are located in two regions and they serve different machine groups.
As later indicated, the number of workers in each region (W1 or W2) is also a decision
variable in the MOO study.
A discrete-event simulation model of the production line was developed for the
machining cell and used in the simulation-based optimization of this study. Readers
are referred to [31] for the full details of the cell and the simulation model.
M1
RMI
L1 L2 L3 L4 L5 FGI
CB1 CB2 CB3 CB5
Fig. 4. Schematic illustration of the components and product flow of the machining cell.
Interleaving Innovization with Evolutionary Multi-Objective Optimization 11
5600
5500
Production
5400
5300
5200 5.9
5.8
5100
5.7
65 5
60 5.6 x 10
55
50 5.5
45
40 5.4 Energy usage
35
Work in process
buffer sizes between workstations, contribute to controlling the final average WIP of
the whole cell, which is logical and not being shown here.
Apart from the visualization of the solutions generated from the MOO, we used
the ASF-based distance SBI data mining process to extract decision rules from the
entire MOO data set. In this study, the decision-maker had chosen the reference point
(RP) to be [WIP = 35, EnergyUsage = 550000 (kW), Production = 5700]. This is an
ideal performance indicator, due to the relatively low WIP and low energy usage, but
it can achieve the highest possible total production, which also reflects a typical
decision strategy of production managers. In Fig. 5, this RP is already highlighted as
the red dot. A set of weight vectors had been tried with such a RP, which projected to
different non-dominated solutions that were presented to the decision-maker:
Fig. 6. 4D plot visualizing how key decision variables divide the solutions into clusters.
Interleaving Innovization with Evolutionary Multi-Objective Optimization 13
extracted may conclude that the other decision variables, i.e., CB3, W1 and W2 have
less influence, with respect to the ASF point chosen by the decision-maker.
The accuracy of the extracted rule can be partially verified by highlighting the
solutions that have attributes represented by the rule ‘‘LS = 1 _ CB1 \ 5.61 _ CB2 \
5.43 _ PS = 2 _ CB5 \ 1.705’’ in the objective space with all solutions, as shown in
Fig. 7 (the ASF point indicated by the blue dot).
5
x 10
6
Energy usage
5.5
5
35 40 45 50 55 60 65
Work in process
6000
Production
5500
5000
35 40 45 50 55 60 65
Work in process
6000
Production
5500
5000
5.45 5.5 5.55 5.6 5.65 5.7 5.75 5.8 5.85
Energy usage 5
x 10
Fig. 7. 2D plots visualizing how close the solutions represented by the rule are to the ASF
point for the problem with 3 objectives.
14 A.H.C. Ng et al.
shown as rules in Table 4. By applying the last rule with the shortest average distance
to ASF [1, 10, 1], the optimization problem was re-defined as shown in Table 5.
Results from re-running the optimization with 1000 solutions are plotted in Fig. 9,
showing very clearly that the new solutions are focused towards the selected alter-
native ASF point.
Interleaving Innovization with Evolutionary Multi-Objective Optimization 15
5600
5400
Production
5200
5.9
5000 5.8
65 5.7 5
60 x 10
55 5.6
50
45 5.5
Work in process 40
5.4 Energy usage
35
Fig. 8. Solutions from the second optimization run (blue triangles) with the refined constraints;
grey solutions from previous optimization; red dot=[36, 576840.3, 5585].
5600
5500
5400
Production
5300
5200
5100
65
5.85
60 5.8
55 5.75
5.7
50 5.65
5
45 5.6 x 10
40 5.55
5.5
35 5.45
Energy usage
Work in process
Fig. 9. Solutions from the optimization run (blue triangles) with the refined constraints based
on new selected rule; grey solutions from first optimization; red dot=[36, 551462.1, 5342].
4 Conclusions
References
1. Deb, K.: Multi-Objective Optimization using Evolutionary Algorithms. Wiley, Chichester
(2004)
2. Deb, K., Srinivasan, A., Innovization: innovating design principles through optimization.
In: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation,
Seattle, USA, July 2006, pp. 162–1636 (2006)
3. Bandaru, S., Deb, K.: Automated discovery of vital knowledge from Pareto optimal
solutions: first results from engineering design. In: IEEE Congress on Evolutionary
Computation, CEC ’10, pp. 1–8 (2010)
4. Bandaru, S., Deb, K.: Towards automating the discovery of certain innovative design
principles through a clustering based optimization technique. Eng. Optim. 43(9), 911–941
(2011)
5. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multi-objective genetic
algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 181–197 (2002)
6. Obayashi, S., Sasaki, D.: Visualization and data mining of Pareto solutions using self-
organizing map. In: Fonseca, C.M., Fleming, P.J., Zitzler, E., Deb, K., Thiele, L. (eds.)
EMO 2003. LNCS, vol. 2632, pp. 796–809. Springer, Heidelberg (2003)
7. Jeong, S., Chiba, K., Obayashi, S.: Data mining for aerodynamic design space. J. Aerosp.
Comput. Inf. Commun. 2(11), 452–469 (2005)
8. Sugimura, K., Obayashi, S., Jeong, S.: Multi-objective design exploration of a centrifugal
impeller accompanied with a vaned diffuser. In: Proceedings of FEDSM2007, 5th Joint
AME/JSME Fluids Engineering Conference, San Diego, USA, 30 July–2 August,
pp. 939–946 (2007)
9. Oyama, A., Nonomura, T., Obayashi, S.: Data mining of Pareto optimal transonic airfoil
shapes using proper orthogonal decomposition. In: Proceedings of 19th AIAA
Computational Fluid Dynamics, San Antonio, USA, 22–25 June, pp. 1514–1523 (2009)
10. Liebscher, M., Witowski, K, Goel, T.: Decision making in multi-objective optimization for
industrial applications – data mining and visualization of Pareto data. In: The 8th World
Congress on Structural and Multidisciplinary Optimization, Lisbon, Portugal, 1–5 June
(2009)
11. Ng, A.H.C, Deb, K., Dudas, C.: Simulation-based innovization for production systems
improvement: an industrial case study. In: Proceedings of the International 3rd Swedish
Production Symposium, Göteborg, Sweden, December 2009, pp. 278–286 (2009)
12. Dudas, C., Frantzén, M., Ng, A.H.C.: A synergy of multi-objective optimization and data
mining for the analysis of a flexible flow shop. Robot. Comput. Integr. Manuf. 27(4),
687–695 (2011)
13. Ng, A.H.C., Dudas, C., Nießen, J., Deb, K.: Simulation-based innovization using data
mining for production systems analysis. In: Wang, L., Ng, A., Deb, K. (eds.) Evolutionary
Multi-objective Optimization in Product Design and Manufacturing, pp. 401–430. Springer,
London (2011)
14. Ng, A.H.C., Dudas, C., Pehrsson, L., Deb, K.: Knowledge discovery in production
simulation by interleaving multi-objective optimization and data mining. In: Proceedings of
the 5th Swedish Production Symposium (SPS’12), Linköping, Sweden, 6–8 November
2012, pp. 461–471 (2012)
15. Deb, K., Datta, R.: Hybrid evolutionary multiobjective optimization and analysis of
machining operations. Eng. Optim. 44(6), 685–706 (2011)
18 A.H.C. Ng et al.
1 Preliminary Discussion
In a currently ongoing project, we investigate a new possibility for solving the
minimum labelling spanning tree (MLST) by an intelligent optimization algo-
rithm. The minimum labelling spanning tree problem is a challenging combina-
torial problem [1]. Given an undirected graph with labelled (or coloured) edges
as input, with each edge assigned with a single label, and a label assigned to one
or more edges, the goal of the MLST problem is to find a spanning tree with the
minimum number of labels (or colours).
The MLST problem can be formally formulated as a network or graph prob-
lem [2]. We are given a labelled connected undirected graph G = (V, E, L), where
V is the set of nodes, E is the set of edges, and L is the set of labels. The pur-
pose is to find a spanning tree T of G such that |LT | is minimized, where LT
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 19–23, 2013.
DOI: 10.1007/978-3-642-44973-4 2, c Springer-Verlag Berlin Heidelberg 2013
20 S. Consoli et al.
all possible solutions, we define the distance between any two such solutions
Σ
C1 , C2 ∈ S, as the Hamming distance: ρ(C1 , C2 ) = |C1 ΔC2 | = i=1 λi , where
λi = 1 if label i is included in one of the solutions but not in the other, and 0
otherwise, ∀i = 1, ..., φ. VNS starts from an initial solution C with k increasing
from 1 up to the maximum neighborhood size, kmax , during the progressive
execution.
The basic idea of VNS to change the neighbourhood structure when the
search is trapped at a local minimum, is implemented by the shaking phase. It
consists of the random selection of another point in the neighbourhood Nk (C)
of the current solution C. Given C, we consider its kth neighbourhood, Nk (C),
as all the different sets having a Hamming distance from C equal to k labels,
where k ← 1, 2, . . . , kmax . In order to construct the neighbourhood of a solution
C, the algorithm first proceeds with the deletion of labels from C. In other
words, given a solution C, its kth neighbourhood, Nk (C), consists of all the
different sets obtained from C by removing k labels, where k ← 1, 2, ..., kmax .
In a more formal way, given a solution C, its kth neighbourhood is defined as
Nk (C) = {S ⊂ L : (|CΔS|) = k}, where k ← 1, 2, ..., kmax .
The iterative process of selection of a new incumbent solution from the com-
plementary space of the current solution if no improvement has occurred, is
aimed at increasing the diversification capability of the basic VNS for the MLST
problem. When the local search is trapped at a local minimum, COMPL extracts
a feasible complementary solution which lies in a very different zone of the search
domain, and is set as new incumbent solution for the local search. This new
starting point allows the algorithm to escape from the local minimum where it
is trapped, producing an immediate peak of diversification.
with BestC being the current best solution, and |BestC | its number of labels.
This cooling law is very fast for the MLST problem, yielding a good balance
between intensification and diversification. Furthermore, thanks to its self-tuning
parameters setting, which is guided automatically by the best solution BestC
without requiring any user-intervention, the algorithm is allowed to adapt on-
line to the problem instance explored and to react in response to the search
algorithm’s behavior [12].
The aim of the probabilistic local search is to allow, with a specified prob-
ability, worse components with a higher number of connected components to
be added to incomplete solutions. Probability values assigned to each label are
inversely proportional to the number of components they give. So the labels
with a lower number of connected components will have a higher probability of
being chosen. Conversely, labels with a higher number of connected components
will have a lower probability of being chosen. Thus, the possibility of choosing
less promising labels is allowed. Summarizing, at each step the probabilities of
selecting labels giving a smaller number of components will be higher than the
probabilities of selecting labels with a higher number of components. Moreover,
these differences in probabilities increase step by step as a result of the reduction
of the temperature for the adaptive cooling schedule. It means that the difference
between the probabilities of two labels giving different numbers of components is
higher as the algorithm proceeds. The probability of a label with a high number
of components will decrease as the algorithm runs and will tend to zero. In this
sense, the search becomes MVCA-like.
A simple VNS implementation which uses the probabilistic local search as
constructive heuristic has been tested. However, the best results were obtained
by combining Complementary Variable Neighbourhood Search with the proba-
bilistic local search, resulting in the hybrid intelligent algorithm that we propose.
Note that the probabilistic local search is applied both in COMPL, to obtain a
solution from the complementary space of the current solution, and in the inner
local search phase, to restore feasibility by adding labels to incomplete solutions.
References
1. Chang, R.S., Leu, S.J.: The minimum labelling spanning trees. Inf. Process. Lett.
63(5), 277–282 (1997)
2. Consoli, S., Darby-Dowman, K., Mladenović, N., Moreno-Pérez, J.A.: Greedy ran-
domized adaptive search and variable neighbourhood search for the minimum
labelling spanning tree problem. Eur. J. Oper. Res. 196(2), 440–449 (2009)
3. Xiong, Y., Golden, B., Wasil, E.: A one-parameter genetic algorithm for the min-
imum labelling spanning tree problem. IEEE Trans. Evol. Comput. 9(1), 55–60
(2005)
4. Krumke, S.O., Wirth, H.C.: On the minimum label spanning tree problem. Inf.
Process. Lett. 66(2), 81–85 (1998)
5. Xiong, Y., Golden, B., Wasil, E.: Improved heuristics for the minimum labelling
spanning tree problem. IEEE Trans. Evol. Comput. 10(6), 700–703 (2006)
6. Cerulli, R., Fink, A., Gentili, M., Voß, S.: Metaheuristics comparison for the mini-
mum labelling spanning tree problem. In: Golden, B.L., Raghavan, S., Wasil, E.A.
(eds.) The Next Wave on Computing, Optimization, and Decision Technologies,
pp. 93–106. Springer-Verlag, New York (2005)
7. Brüggemann, T., Monnot, J., Woeginger, G.J.: Local search for the minimum label
spanning tree problem with bounded colour classes. Oper. Res. Lett. 31, 195–201
(2003)
8. Chwatal, A.M., Raidl, G.R.: Solving the minimum label spanning tree problem by
ant colony optimization. In: Proceedings of the 7th International Conference on
Genetic and Evolutionary Methods (GEM 2010), Las Vegas, Nevada (2010)
9. Consoli, S., Moreno-Pérez, J.A.: Solving the minimum labelling spanning tree
problem using hybrid local search. In: Proceedings of the mini EURO Conference
XXVIII on Variable Neighbourhood Search (EUROmC-XXVIII-VNS), Electronic
Notes in Discrete Mathematics, vol. 39, pp. 75–82 (2012)
10. Hansen, P., Mladenović, N.: Variable neighbourhood search: principles and appli-
cations. Eur. J. Oper. Res. 130, 449–467 (2001)
11. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing.
Science 220(4598), 671–680 (1983)
12. Osman, I.H.: Metastrategy simulated annealing and tabu search algorithms for the
vehicle routing problem. Ann. Oper. Res. 41, 421–451 (1993)
A Constraint Satisfaction Approach
to Tractable Theory Induction
1 Introduction
Inductive Logic Programming (ILP) is a branch of machine learning that
represent knowledge using predicate logic. Examples are generalized into a theory
that cover all positive and no negative examples [1,2]. Thus, positive examples
provide observables, and negative examples provide instances of what should
never be observed. The induced theories are typically given in Prolog code (first
order Horn clauses), which has both declarative and procedural interpretations
[3,4]. Thus ILP is simultaneously capable of first order concept learning and
Turing complete program synthesis. For a summary of ILP as a research field,
and its applications, we refer the reader to [1].
State of the art ILP systems—such as Progol and Aleph—use the method
of Inverse Entailment to compute a bottom clause from a positive example,
background knowledge, and a non-mandatory set of mode declarations specifying
the input-output requirements of predicates [5].
The bottom clause is intended to be a most specific clause for the example
(relative to background knowledge and mode declarations) and is used to con-
strain the search space by providing a bottom element in the lattice generated
by the subsumption order [2,6,7].
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 24–29, 2013.
DOI: 10.1007/978-3-642-44973-4 3, c Springer-Verlag Berlin Heidelberg 2013
A Constraint Satisfaction Approach to Tractable Theory Induction 25
Example 1. If the bottom clause is h(X, Y ) ← b1 (X, Z), b2 (Z, Y ), there are four
candidate solutions, given by (ordered) subsets of the bottom clause’s literals:
h(X, Y ) (the top element), h(X, Y ) ← b1 (X, Z), h(X, Y ) ← b2 (X, Z), and the
bottom clause itself. In general, there are 2n candidates whenever the bottom
clause has n body literals.
There are other approaches to ILP using constraints. Constraint logic pro-
gramming with ILP [9] converts negative examples into constraints in order to
deal with numerical constraints. Theta-subsumption with constraint satisfac-
tion [10] speeds up subsumption testing [2,6,7] using constraints. Our approach
differs from these in that our constraints are propositional and explicitly used
to construct candidate solutions by describing existential relations between the
literals of bottom clauses.
We call candidates that satisfy all constraints valid. Candidates that are not
valid are said to be invalid.
We have tested our framework using two particular constraints: mode dec-
larations (mode constraints) and search space pruning (pruning constraints),
which we present next.
and mode declarations specifying that for each literal, the last argument is an
output variable and all other arguments are input variables, we see that b4
requires both b1 and b2 (to instantiate C and D), b3 requires b1 (b3 also needs
C), and instantiating h’s output variable B means at least one of b3 and b4 must
be present. As propositional clauses, this is b4 → b1 , b4 → b2 , b3 → b1 , and
b3 ∨ b4 . One model is {b1 , b3 }, which specifies the candidates containing precisely
the first and third body literals: h(A, B) ← b1 (A, C), b3 (C, B). The constraints
prevent the invalid candidate h(A, B) ← b1 (A, C) from being generated: intu-
itively, because it does not instantiate B; logically, because the constraint b3 ∨ b4
is not satisfied.
must occur for bi to occur. Moreover, for each output Z of the head (and where
it does not also occur as input), we find all body literals where Z occurs as
output, bZ1 , . . . bZk . Since at least one of these body literals must be present to
instantiate Z, we add the clause bZ1 ∨ . . . ∨ bZk .
3 Experimental Results
We benchmark NrSample against a best-first and enumeration search on 9
problems. Best-first search is an implementation of Progol’s well established
A∗ algorithm [5], whereas enumeration search is the default search method in
the state-of-the-art Aleph ILP system.
First, we borrow some concept learning problems from the Progol distribu-
tion1 : animal, train, grammar, member, and append. Second, we add our own
problems: sorted, sublist, reverse, and add.2 Moreover, append is benchmarked
1
Available at http://www.doc.ic.ac.uk/∼shm/Software/progol4.4/
2
Available with our source code distribution upon request.
28 J. Ahlgren and S.Y. Yuen
at three difficulty levels, which we call append1, append2, and append3. Sim-
ilarly, add is benchmarked as add1 and add2. The problems are made more
difficult by altering default settings. By default, an upper limit of i = 3 itera-
tions are used in bottom clause construction; at most n = 200 candidates are
then explored. The maximum number of body literals in any body candidate is
c = 4. For append1, we use c = 3; for append3 and add2, we use c = ∞.
Table 1 shows the average execution time over 30 trials for NrSample (TS ),
best-first search (TH ), and enumeration (TE ). All benchmarks were performed
on an Intel Dual Core i5 (2 × 2.30 GHz) with 4 GB RAM. NrSample out-
performed best-first on all experiments. Compared to enumeration, NrSample
performed approximately equal for the smaller problems (member, animal,
grammar, sorted, sublist), but much better for the larger (reverse, add1, add2,
all versions of append). The only exception is train, where enumeration outper-
formed NrSample. We put an execution time limit of 10 minute per run on all
tests. This limit was reached in append2 for best-first. In append3 and add2,
both best-first and enumeration timed out.
The growing performance difference between NrSample and its competing
algorithms for larger problems—ranging from nothing to more than 1200 for
append3 and 16000 for add2—can be explained as follows: A more complex
problem is characterized by a longer bottom clause, which makes the search
space grow exponentially larger (in the number of bottom clause literals, see
Example 1). As the search space grows larger, there will be more ways of violating
the input-output specification given by the mode declarations. The percentage
of valid candidates during best-first and enumeration search is given by VH and
VE in Table 1, respectively. As can be seen, the probability of blindly generating
a valid candidate approaches zero for the larger tests, having less than 1 in 1000
valid candidates for append2, append3, and add2.3 By contrast, with NrSample,
100 % of the candidates are valid.
3
We could not measure the exact proportion for the tests that timed out, but it is
estimated to be even less than its easier variants, thus always less than 0.1 %.
A Constraint Satisfaction Approach to Tractable Theory Induction 29
4 Conclusions
We have presented NrSample, a novel framework for theory induction using
constraints. To the best of our knowledge, we are the first to use constraints to
construct valid candidates rather than to eliminate already constructed candi-
dates. Our benchmarks indicate that, as the problems become larger, it becomes
increasingly beneficial to use mode and pruning constraints, with observed
increases of four orders of magnitude.
Given that top-down and bottom-up algorithms also prune the search space,
pruning constraints do not by themselves provide an advantage for NrSample.
Its SAT solving overhead can only be compensated for by providing additional
(non-pruning) constraints. We used mode constraints, but other alternatives may
be to use domain specific knowledge. We plan to do more research into using
other kinds of constraints within our framework.
Acknowledgments. The work described in this paper was supported by a grant from
the Research Grants Council of the Hong Kong Special Administrative Region, China
[Project No. CityU 124409].
References
1. Muggleton, S., Raedt, L.D., Poole, D., Bratko, I., Flach, P.A., Inoue, K., Srinivasan,
A.: ILP turns 20 - biography and future challenges. Mach. Learn. 86(1), 3–23 (2012)
2. Nienhuys-Cheng, S.H., de Wolf, R.: Foundations of Inductive Logic Programming.
Springer-Verlag New York Inc., Secaucus (1997)
3. Blackburn, P., Bos, J., Striegnitz, K.: Learn Prolog Now!. College Publications,
London (2006)
4. Sterling, L., Shapiro, E.: The art of Prolog: advanced programming techniques,
2nd edn. MIT Press, Cambridge (1994)
5. Muggleton, S.H.: Inverse entailment and progol. New Gener. Comput. 13, 245–286
(1995)
6. Plotkin, G.D.: A note on inductive generalization. Mach. Intell. 5, 153–163 (1970)
7. Plotkin, G.D.: A further note on inductive generalization. Mach. Intell. 6, 101–124
(1971)
8. Lavrac, N., Dzeroski, S.: Inductive Logic Programming: Techniques and Applica-
tions. Ellis Horwood, New York (1994)
9. Sebag, M., Rouveirol, C.: Constraint inductive logic programming. In: De Raedt,
L. (ed.) Advances in ILP. IOS Press, Amsterdam (1996)
10. Maloberti, J., Sebag, M.: Fast theta-subsumption with constraint satisfaction algo-
rithms. Mach. Learn. 55(2), 137–174 (2004)
11. Harrison, J.: Handbook of Practical Logic and Automated Reasoning. Cambridge
University Press, Cambridge (2009)
12. Davis, M., Logemann, G., Loveland, D.: A machine program for theorem-proving.
Commun. ACM 5, 394–397 (1962)
13. Moskewicz, M.W., Madigan, C.F., Zhao, Y., Zhang, L., Malik, S.: Chaff: engineer-
ing an efficient SAT solver. In: Proceedings of the 38th Annual Design Automation
Conference, DAC ’01, pp. 530–535. ACM, New York (2001)
Features for Exploiting Black-Box Optimization
Problem Structure
1 Introduction
This paper tackles the challenge of crafting a set of features that can capture the
structure of black-box optimization (BBO) problem fitness landscapes for use
in portfolio algorithms. BBO problems involve the minimization of an objective
function f (x1 , . . . , xn ), subject to the constraints li ≤ xi ≤ ui , over the variables
xi ∈ R, ∀1 ≤ i ≤ n. These types of problems are found throughout the scientific
and engineering fields, but are difficult to solve due to their oftentimes expen-
sive objective functions. This complexity can arise when the objective involves
difficult to compute expressions or that are too complicated to be defined by a
simple mathematical expression. Even though BBO algorithms do not guaran-
tee the discovery of the optimal solution, they are an effective tool for finding
approximate solutions. However, different BBO algorithms vary greatly in per-
formance across a set of problems. Thus, deciding which solver to apply to a
particular problem is a difficult task.
Portfolio algorithms, such as Instance Specific Algorithm Configuration
(ISAC), which uses a clustering approach to identify groups of similar instances,
provide a way to automatically choose a solver for a particular BBO instance using
offline learning. However, such methods require a set of features that consolidate
Yuri Malitsky is partially supported by the EU FET grant ICON (project 284715).
Kevin Tierney is supported by the Danish Council for Strategic Research as part of
the ENERPLAN project.
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 30–36, 2013.
DOI: 10.1007/978-3-642-44973-4 4, c Springer-Verlag Berlin Heidelberg 2013
Features for Exploiting Black-Box Optimization Problem Structure 31
the relevant attributes of a BBO instance into a vector that can then be used for
learning. The only way to generate these features for BBO problems is by eval-
uating expensive queries to the black box, which contrasts with most non-black-
box problems, e.g. SAT or the set covering problem, where many features can be
quickly inferred from the problem definition itself.
In this paper, we propose a novel set of features that are fast to compute and
are descriptive enough of the instance structure to allow a portfolio algorithm like
ISAC to accurately cluster and tune for the benchmark. These features are based
on well-known fitness landscape measures and are learned through sampling the
black box. They allow for the analysis and classification of BBO problems so that
anybody can take advantage of the recent advances in the ISAC framework in
order to more efficiently solve their BBO problems. This paper is a short version
of [1].
Related Work. There has been extensive research studying the structure of
BBO problems, and copious measures have been proposed for determining the
hardness of local search problems by sampling their fitness landscape [9], such
as the search space diameter, optimal solution density/distribution [6], fitness-
distance correlation (FDC) [10], the correlation length [16,19], epistasis mea-
sures [14], information analysis [17], modality and neutrality measures [15], and
fitness-distance analysis [13]. Difficulty measures for BBO problems in particu-
lar were studied by [8], who concluded that in the worst case building predictive
difficulty measures for BBO problems is not possible to do in polynomial time.1
Most recently, Watson introduced several cost models for combinatorial land-
scapes in order to try to understand why certain algorithms perform well on
certain landscapes [18].
In [12], the authors identify six “low-level feature classes” to classify BBO
problems into groups. In [4], algorithm selection for BBO problems is considered
with a focus on minimizing the cost of incorrect algorithm selections, unlike our
approach, which minimizes a score based on the penalized expected runtime.
Our approach also differs from online methods [5] and reactive techniques [3]
that attempt to guide algorithms based on information from previously explored
states because ISAC performs all of its work offline.
We evaluate the effectiveness and robustness of our features on a dataset from the
GECCO 2012 Workshop on Black-Box Optimization Benchmarking (BBOB) [2].
The dataset contains the number of evaluations required to find a particular
objective value within some precision on one of 24 continuous, noise-free, opti-
mization functions from [7] in 6 different dimension settings for 27 solvers. The
solvers are all run on the data 15 times, each time with a different target value
set as the artificial global optimum. Note that the BBOB documentation refers
1
Our results do not contradict this, as we are not predicting the hardness of instances.
32 T. Abell et al.
3 Features
Computing features for BBO problems is difficult because evaluating the
objective function of a BBO problem is expensive, and there is scarce information
about a problem instance in its definition, other than the number of dimensions
and the desired solver accuracy. In the absence of any structure in the problem
definition, we have to sample the fitness landscape. However, such sampling is
expensive, and on our dataset performing more than 600 objective evaluations
removes all benefits of using a portfolio approach. We therefore introduce a set of
10 features that are based on well-studied aspects of search landscapes in the lit-
erature [18]. Our features are drawn from three information sources: the problem
definition, hill climbs, and random points. Table 1 summarizes our features.
The problem definition features contain the desired accuracy of the continuous
variables (Feature 1), and the number of dimensions that the problem has (Fea-
ture 2), which, together, describe the size of the problem.
The hill climbing features are based off of a number of hill climbs that are
initiated from random points and continued until a local optimum or a fixed
number of evaluations is reached. We then calculate the average and standard
deviation of the distance between optima (Features 3 and 4), which describes
the density of optima in the landscape. Using the best optimum found, we then
2
Full details about the algorithms are available in [2].
Features for Exploiting Black-Box Optimization Problem Structure 33
compute the average and standard deviation of the distance between the optima
and the best optimum (Features 5 and 6), using the nearest to each non-best
optimum for these features if multiple optima qualify as the best. Feature 7
describes what percentage of the optima are equal to the best optimum, giving
a picture of how spread out the optima are throughout the landscape.
The random point features 8 and 9 contain the average and standard deviation
of the distance of each random point to the nearest optimum, which describes
the distribution of local optima around the landscape. Feature 10 computes the
fitness-distance correlation, a measure of how effectively the fitness value at a
particular point can guide the search to a global optimum [10]. In feature 10, we
compute an approximation to the FDC.
4 Numerical Results
In this section we describe the results of using our features, in full and in various
combinations, to train a portfolio solver using the ISAC method on the BBOB
2012 dataset. We measure the performance of each solver using a penalized score
that takes into account the relative performance of each solver on an instance.
We do not directly use the expected running time (ERT) value because the
amount of evaluations can vary greatly between instances, and too much focus
would be placed on instances where a large number of evaluations is required.
The penalized score of solver s on an instance i is given by:
best(i) refers to the lowest ERT score on instance i, and worst(i) refers to the
highest non-infinity ERT score on the instance. The penalized ERT therefore
returns ten times the worst ERT on an instance for solvers that were unable to
find the global optimum. We are forced to use a penalized measure because if a
solver cannot solve a particular instance, it becomes impossible to calculate its
performance over the entire dataset.
Table 2 shows the results of training and testing ISAC on the BBOB 2012
dataset. For each entry in the table, we run a 10-fold cross validation using
features from each of the 15 BBOB target values. The scores of each of the
cross-validation folds are accumulated for each instance, and the entries in the
table are the average and standard deviation across all instances in the dataset.
34 T. Abell et al.
Table 2. The average and standard deviation of the scores across all instances for
various minimum cluster sizes, numbers of hill climbs and hill climb lengths for the
best single solver and ISAC using various features.
We compare our results against the best single solver (BSS) on the dataset,
which is the best performing solver across all instances, which is MVDE [11].
We train using several subsets of our features; only feature 1 (F1 ), only feature
2 (F2 ), and only features 1 and 2 (F1,2 ). We then train using all features (All),
and only landscape features (LSF), i.e., features 3 through 10. All∈ and LFS∈
include the evaluations necessary to compute the features, whereas all other
entries do not include the feature computation in the results. We used several
different settings of the number of hill climbs and maximum hill climb length
based on our feature robustness experiments in [1]: 10 hill climbs of maximum
length 10, 50 hill climbs of maximum length 20, and 200 hill climbs of maximum
length 400. The closer a score is to 0 (the score of the virtual best solver) the
better the performance of an approach.
Based on results for F1 , F2 and F1,2 , the easy to compute BBO features
alone are only able to give ISAC some information about the dataset, and that
a landscape analysis is justified. On the other hand, F2 outperforms BSS. In
fact, F2 performs equally well to All and LSF for cluster 100 with 10 hill climbs
of length 10 and for 50 hill climbs of length 20. In addition, F2 significantly
outperforms All on cluster size 50, where it is clear that it overfits the training
data. This is a clear indication that 10 hill climbs of length 10, or 50 hill climbs
of length 20, do not provide enough information to train ISAC to be competitive
with simply using the number of dimensions of a problem.
The fact that LSF∈ is able to match the performance of F2 on 10 hill climbs
of length 10 for both cluster size 50 and 100 an important accomplishment. With
Features for Exploiting Black-Box Optimization Problem Structure 35
so little information learned about the landscape, the fact that ISAC can learn
such an effective model indicates that our features are indeed effective.
Once we move up to 200 hill climbs of length 400, LSF significantly out-
performs F2 , and even outperforms All, which suffers from overfitting. In fact,
LSF is able to cut the total score to under a fourth of BSS’s score, and to one
half of F2 ’s score, indicating that the fitness landscape can indeed be used for
a portfolio. In addition, LSF has a lower standard deviation than BSS. LSF’s
score on the training set of 0.53 and 0.55 on the test set are surprisingly close to
the virtual best solver, which has a score of zero, indicating that ISAC is able to
exploit the landscape features to nearly always choose the best or second best
solver for each instance. On the downside, 200 hill climbs of length 400 requires
too many evaluations to be used in a competitive portfolio, and All∈ needs 50
times the evaluations of BSS. However, the 200/400 features are still useful for
classifying instances into groups and analyzing the landscape.
References
1. Abell, T., Malitsky, Y., Tierney, K.: Fitness landscape based features for exploit-
ing black-box optimization problem structure. Technical report TR-2012-163, IT
University of Copenhagen (2012)
2. Auger, A., Hansen, N., Heidrich-Meisner, V., Mersmann, O., Posik, P., Preuss, M.:
In: GECCO 2012 Workshop on Black-Box Optimization Benchmarking (BBOB).
http://coco.gforge.inria.fr/doku.php?id=bbob-2012 (2012)
3. Battiti, R., Brunato, M.: Reactive search optimization: learning while optimizing.
In: Gendreau, M., Potvin, J.-Y. (eds.) Handbook of Metaheuristics, vol. 146, pp.
543–571. Springer, New York (2010)
4. Bischl, B., Mersmann, O., Trautmann, H., Preuß, M.: Algorithm selection based
on exploratory landscape analysis and cost-sensitive learning. In: GECCO’12, pp.
313–320. ACM, New York (2012)
5. Boyan, J., Moore, A.W.: Learning evaluation functions to improve optimization
by local search. J. Mach. Learn. Res. 1, 77–112 (2001)
36 T. Abell et al.
6. Brooks, C., Durfee, E.: Using landscape theory to measure learning difficulty for
adaptive agents. In: Alonso, E., Kudenko, D., Kazakov, D. (eds.) AAMAS 2000
and AAMAS 2002. LNCS (LNAI), vol. 2636, pp. 291–305. Springer, Heidelberg
(2003)
7. Finck, S., Hansen, N., Ros, R., Auger, A.: Real-parameter black-box optimization
benchmarking 2010: presentation of the noisy functions. Technical report 2009/21,
Research Center PPE (2010)
8. He, J., Reeves, C., Witt, C., Yao, X.: A note on problem difficulty measures in
black-box optimization: classification, realizations and predictability. Evol. Com-
put. 15, 435–443 (2007)
9. Hoos, H.H., Stützle, T.: Stochastic Local Search: Foundations & Applications.
Morgan Kaufmann Publishers Inc., San Francisco (2004)
10. Jones, T., Forrest, S.: Fitness distance correlation as a measure of problem difficulty
for genetic algorithms. In: ICGA-95, pp. 184–192 (1995)
11. Melo, V.V.: Benchmarking the multi-view differential evolution on the noiseless
bbob-2012 function testbed. In: GECCO’12, pp. 183–188. ACM (2012)
12. Mersmann, O., Bischl, B., Trautmann, H., Preuß, M., Weihs, C., Rudolph, G.:
Exploratory landscape analysis. In: GECCO’11, pp. 829–836. ACM (2011)
13. Merz, P., Freisleben, B.: Fitness landscapes, memetic algorithms, and greedy oper-
ators for graph bipartitioning. Evol. Comput. 8, 61–91 (2000)
14. Naudts, B., Kallel, L.: A comparison of predictive measures of problem difficulty
in evolutionary algorithms. IEEE Trans. Evol. Comp. 4(1), 1–15 (2000)
15. Smith, T., Husbands, P., Layzell, P., O’Shea, M.: Fitness landscapes and evolv-
ability. Evol. Comput. 10(1), 1–34 (2002)
16. Stadler, P.F., Schnabl, W.: The landscape of the traveling salesman problem. Phys.
Lett. A 161(4), 337–344 (1992)
17. Vassilev, V.K., Fogarty, T.C., Miller, J.F.: Information characteristics and the
structure of landscapes. Evol. Comput. 8, 31–60 (2000)
18. Watson, J.: An introduction to fitness landscape analysis and cost models for local
search. In: Gendreau, M., Potvin, J. (eds.) Handbook of Metaheuristics, vol. 146,
pp. 599–623. Springer, New York (2010)
19. Weinberger, E.: Correlated and uncorrelated fitness landscapes and how to tell the
difference. Biol. Cybern. 63, 325–336 (1990)
MOCA-I: Discovering Rules and Guiding
Decision Maker in the Context of Partial
Classification in Large and Imbalanced Datasets
1 Introduction
Data mining on real datasets can lead to handling imbalanced data. It occurs
when many attributes are available for each observation, but only a few are
actually entered. This is especially the case with medical data: ICD-101 – a
medical coding system – allows encoding up to 14,199 diseases and symptoms.
However in hospital data, for each patient, only a very small subset of these codes
will be used: up to 100 symptoms and diseases. This implies that most frequent
symptoms, like high blood pressure, are found on at best 10 % of the patients. For
less common diseases, like transient ischemic stroke it can be lower to less than
0.5 % of patients. This can also happen with market-basket data: many different
1
International classification of diseases; http://www.who.int/classifications/
icd/en/
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 37–51, 2013.
DOI: 10.1007/978-3-642-44973-4 5, c Springer-Verlag Berlin Heidelberg 2013
38 J. Jacques et al.
items are available in the store but only a few are actually bought by a single
customer. Additionally, more and more information is available and collected
nowadays: algorithms must be able to deal with larger datasets. According to
Fernández et al. dealing with large datasets is still a challenge that needs to be
addressed [1]. This work is a part of the OPCYCLIN project – an industrial
project involving Alicante company, hospitals and academics as partners – that
aims at providing a tool to optimize screening of patients for clinical trials.
Different tasks are available in datamining, this paper focuses on the clas-
sification task, useful to predict or explain a given class (e.g.: cardiovascu-
lar risk ) on unseen observations. Classification will use known data, composed
of a set of N known observations i1 , i2 , ..., iN to build a model. Each obser-
vation can be described by M attributes a1 , a2 , ..., aM and a class c. There-
fore each observation i is associated with a set of values vi1 , vi2 , ..., viM where
vij ∈ Vj {val1 , val2 , . . . valp }; Vj being the set of possible values for attribute aj .
In the same manner, each observation i is associated to a class value ci ∈ C, C
being the set of all possible values for the class. A classification algorithm will
be able to generate a model that describes how to determine cv on an unseen
observation v, using its values vv1 , vv2 , ..., vvM . This paper focuses on models
able to give a good interpretability: they allow medical experts to give a feed
back about them. Decision trees and classification rules give easy-to-interpret
models, by generating trees or rules — like “aj = valj and ag = valg ⇒ class”,
where valj ∈ Vj , valg ∈ Vg and class ∈ C; using combinations of attributes
a1 , a2 , ..., aM and one of their possible values val1 , val2 , ..., valM to lead to the
decision.
Decision trees – like C4.5 [2] or CART (classification and regression trees)
– are popular and efficient solutions for knowledge extraction. However the
tree representation is composed of conjunctions, not allowing expressing classes
explained by different contexts (e.g.: presence of overweight or high blood pressure
implies an increased cardiovascular risk, having both increases the risk more).
Separate and conquer strategy, frequently implemented in tree algorithms often
contribute to miss rules issued from different contexts: each sub-tree is con-
structed using a sub-part of data (observations not corresponding to the top
of the tree are removed from learning). To avoid this problem we will focus on
classification rule mining approaches.
The majority of state-of-the-art classification algorithms will have trouble to
deal with imbalanced data because they use Accuracy to build their predictive
model [1]. Accuracy focuses on counting good classifications obtained by a given
algorithm: true positives and true negatives. However, when predicting a class
available on only 1 % of the observations, an algorithm can get a very good
classification Accuracy – 99 % – while predicting each observation as negative
(99 % of observations) and missing each positive observation. Some resampling
methods exist to pre-process the data and convert it into balanced data, an
overview can be found in [3]. Jo and Japkowicz showed that combining data
resampling and an algorithm able to deal with class imbalance is more effective
than using resampling alone [4]. Moreover, in addition to class imbalance and
MOCA-I: Discovering Rules and Guiding Decision Maker 39
This section will present some rule interestingness measures and their meaning.
Then it presents the 3 objectives that will be used to find rules.
When mining rules, an important question will raise: how can we assess that
a rule is better than another? Over 38 common rule interestingness measures
are referenced by Geng and Hamilton in their review [5], while Ohsaki et al.
studied measures used in medical domain [6] and Greco et al. studied Bayesian
confirmation measures [7].
P P
C TP FP
C FN TN
N
of observations not having C and not having P. FN (false negatives) and FP (false
positives) count observations on which C and P do not match. When dealing with
imbalanced data
P = F P + T N >> P = T P + F N, (1)
therefore problems may rise with some measures like previously seen with the
Accuracy.
To ease the conception of a rule mining algorithm we must focus on a subset
of these measures. Indeed, handling too many measures will add complexity and
will increase computational time. An analysis of these measures showed that
Confidence and Sensitivity
TP TP
Confidence = , Sensitivity = , (2)
TP + FP TP + FN
are two interesting complementary measures. Increasing Confidence decreases
the number of false positives while increasing Sensitivity decreases the number
of false negatives. However, increasing Confidence often decreases Sensitivity
while increasing Sensitivity decreases Confidence. To the medical domain point
of view, only rules having both good Confidence and Sensitivity are interest-
ing. Moreover, Bayardo and Agrawal showed that mining rules optimizing both
Confidence and Support leads to obtain rules optimizing several other mea-
sures including Gain, Chi-squared value, Gini, Entropy gain, Laplace, Lift, and
Conviction [8]. Since in classification, Sensitivity and Support measures are pro-
portional, optimizing Confidence and Support will bring the same rules than
optimizing Confidence and Sensitivity.
When mining variable-length rules, bloat can happen: rules endlessly grow
with no predictive enhancement. Because of bloat, a rule R1 : C ⇒ P can turn
into R2 : C OR C ⇒ P , then R3 : C OR C OR C ⇒ P , increasing computa-
tional time and preventing the algorithm to stop. Most of all, R3 is needlessly
complex and harder to interpret than R1 . Rissanen introduced the Minimum
Description Length (MDL) principle that can be used to overcome this problem
[9]. Given two equivalent rules, the simplest must be preferred. The addition of
one objective promoting simpler rules is a common solution, successfully applied
in Reynolds and Iglesia work [10]. In addition to this, Barcadit et al. used rule
deletion operators [11]. In application of this principle, we introduce a third
objective to promote simpler rulesets: minimizing the count of terms of each
solution. Finally, we choose to find rules optimizing the 3 following objectives:
– maximize Confidence
– maximize Sensitivity
– minimize number of terms
minimizing the size of each ruleset. Thus, a rule will be added only if it improves
Sensitivity (true positives rate) or Confidence of the ruleset.
a = vi a < vi a > vi
Ø Ø Ø
a > vi−1 a = vi−1 a = vi+1
a < vi+1 a < vi−1 a > vi+1
a = vi−1 a < vi+1 a < vi+2
a = vi+1 a > vi−2 a > vi−1
On a dataset with mixed attributes (ordered and not) (heart dataset, intro-
duced in results section), this simplified neighborhood decreases computational
time (in average by 14 %), while not degrading too much the classification perfor-
mance (less than 1 %). Another optimization on the neighborhood exploration is
44 J. Jacques et al.
done on rules with Conf idence = 1: adding one term can only result in decreas-
ing Sensitivity because the obtained rule will be more specific and will concern
less observations. In this case, we restrict neighbors to modification or removing
of one random term.
ROC Curve is often used in data mining to assess the performance of classifi-
cation algorithms, especially ranking algorithms. It is plotted using true positive
rate (TPR) (known as Sensitivity) and false positives rate (FPR) (also called 1
- Specificity) as axes and allows comparing algorithms. Fawcett presented differ-
ent ROC curve usages [18]. In our case, we use ROC curve to select which rules
to keep. Since the objective is to use the developed method in a medical context,
it can also be used to help our medical users to calibrate classifier or choosing
rules using a tool they are familiar with. Algorithm 2 describes how ROC curve
can be generated for a given ruleset. Rules are first ordered from the highest
Confidence score to the lower; rules having the same Confidence are ordered by
descending order according to Sensitivity. Then, TPR and FPR are computed
and drawn for each subruleset {R1 }, {R1 , R2 }, . . . , {R1 , R2 . . . Ri }.
with this problem. Figure 1 shows on the right, one sample ruleset containing
rules R1 . . . R10 , ordered from the highest Confidence to the lower and in descend-
ing order of Sensitivity for rules having the same Confidence score. On the left the
matching ROC curve is drawn. Each point on this curve depicts the performance
of a subruleset, e.g.: R1 , R2 , R3 . The higher is the point, the more observations of
positive class it detects. Additionally, the more a point is on the right, the more
false positive it brings. Consequently, point (0, 1) is the ideal point where all
positive observations are found, without bringing any false positive. This figure
shows the performance of the ruleset when cut at different places, allowing to
choose the subruleset giving the most interesting performance according to deci-
sion maker’s needs. On this curve we can see that between point a and point b,
and after point c there is only a small improvement of True positives rate, but it
brings much more false positives. Performance is more interesting before point
a, matching ruleset R1 , R2 , R3 . Subruleset R1 . . . R4 (cut b) does not seem to be
a good choice because it brings much more false positives than positives cases.
Point c brings more true positives cases, giving ruleset R1 , R2 , ..., R7 . Cutting
at point d finds a ruleset able to detect all positive observations: R1 , R2 , ..., R9 ,
keeping rule R10 is useless and will only increase false positives. Depending on
how many false positive are tolerable, the cut point can be changed. In medical
context, cut points bringing less false positives will be preferred (like cut a). To
the contrary, an advertising campaign will accept more false positives, to deal
with a larger audience. In fraud detection, the cut point can be moved until a
given number of positive observations are found.
Fig. 1. Example of ROC curve obtained from one ruleset. R1 . . . R10 is one ruleset
obtained after the post-processing. a, b, c and d represent different cuts and their
associated position on the roc curve.
to determine automatically the best cut point. Thus, we can obtain a classifier
with good performance without the intervention of the decision maker. A final
ruleset classifier is generated from all obtained rulesets coming from archive,
as described in Algorithm 3. After merging all rules into one ruleset, the ROC
curve is drawn. The subruleset giving the point closest to the ideal point (0,1)
according to the Euclidean distance is chosen. All rules after this point are
removed (or disabled if we want to allow the decision maker to change the
Sensitivity accordingly to his needs). Once this ruleset is obtained, common data
mining measures can be computed on the entire ruleset: Confidence, Support,
etc. Diverse cut conditions have been tested but only the above presented one
gives classifiers with interesting performance. An improvement of this condition
could consist in weighting true positives rate and false positives rate according
to decision maker’s needs, since false positives can be more or less important
than true positives, depending on the context.
4.1 Protocol
According to the protocol proposed by Fernandez et al., our algorithm was run
25 times for each dataset. We use 5-fold cross-validation: datasets are split into 5
MOCA-I: Discovering Rules and Guiding Decision Maker 47
parts, each containing 20 % of observations. Then 4 parts are used for training,
1 for evaluation. For each available partition, as the algorithm contains some
stochastic components, it was run 5 times. So we obtain for stochastic algorithms
25 Pareto fronts for each dataset. For each partition, solutions are evaluated on
both training and test partitions. In our case, the objective is to maximize the
results on test data, because it shows the ability of the algorithm to handle
unseen data. A discretization of data was applied with Weka when necessary
(weka.filters.unsupervised.attribute.Discretize; bins=10, findNumBins=true) to
allow our algorithm to handle datasets containing continuous attributes.
Generally accuracy measure is used to assess the performance of classifica-
tion. Previously we saw that accuracy is not effective to handle class imbal-
ance. Therefore Fernandez et al. proposed to use Geometric mean of the true
rates (GM):
TP TN
GM = × . (3)
TP + FN FP + TN
In order to have a good score, a classifier has now to classify correctly both
classes: positive and negative. GM has one drawback though: when a classifier is
not able to predict one class, score is worth 0. Here, when two classifiers failed
to predict the negative class, there is no difference between the classifier able
to find 50 % of positive observations and an other classifier predicting 70 % of
positive observations: both have a score of 0.
For each dataset we computed the average of GM values obtained in each 25
runs. In order to get a single GM value from the rulesets proposed by our algo-
rithm, we generated a ruleset, its ROC curve and cut it automatically as shown
previously. Then we computed GM on the resulting ruleset: if an observation
matches a rule from the ruleset it is considered as positive class. Observations
not matching any rule are considered as negative class.
Tests were carried out on a computer with a Xeon 3500 quad core and 8 GB of
memory, under Ubuntu 12. We used Weka software version 3.6 for discretization
of datasets and for running C4.5 tests. Our approach is implemented in C++,
using metaheuristics from ParadisEO framework [17]. In our experimentations
we set MOCA-I max ruleset size = 5, max rule size = 9 for each dataset.
Table 4. Relative error to the best average of GM: algorithms with 0 obtained the
best average of GM
MOCA-I XCS O-DT SIA CORE GAssist OCEC DT-GA HIDER C4.5 C4.5-CS
haberman 0.00 0.41 0.04 0.16 0.40 0.27 0.27 0.42 0.48 0.42 0.19
ecoli1 0.05 0.04 0.10 0.19 0.03 0.05 0.36 0.07 0.16 0.06 0.00
ecoli2 0.00 0.66 0.06 0.05 0.15 0.04 0.42 0.15 0.35 0.07 0.03
yeast3 0.01 0.81 0.10 0.10 0.23 0.06 0.01 0.10 0.43 0.07 0.00
yeast2vs8 0.11 0.18 0.18 0.88 0.14 0.26 0.16 0.88 0.18 0.88 0.00
abalone19 0.00 1.00 0.88 1.00 1.00 1.00 0.03 1.00 1.00 1.00 0.52
err. average 0.03 0.52 0.40 0.23 0.32 0.28 0.21 0.44 0.43 0.42 0.12
GM on Training GM on Test
MOCA-I 0.92 ±0.02 0.74 ±0.07
C4.5 0.58 ±0.03 0.52 ±0.11
C4.5-CS 0.99 ±0.0009 0.47 ±0.11
best performance on test data, MOCA-I uses the previously presented post-
processing method to output a ruleset with different cut possibilities. In Table 5
the ruleset is cut to improve GM; Table 6 shows results obtained with different
cut points over the ROC curve. Cut 1 is a cut where there is no false positive.
Cut 3 is the cut presented previously in our post-processing method. Cut 2
is between these two cuts on the ROC curve. The cut point can be adapted
depending on the cost of a false positive. When no error is tolerable, cut points
bringing less false positives will be preferred (like Cut 1 or Cut 2 in Table 6).
References
1. Fernández, A., Garciá, S., Luengo, J., Bernadó-Mansilla, E., Herrera, F.: Genetics-
based machine learning for rule induction: state of the art, taxonomy, and com-
parative study. IEEE Trans. Evol. Comput. 14(6), 913–941 (2010)
2. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publish-
ers Inc., San Francisco (1993)
3. Chawla, N.V.: Data mining for imbalanced datasets: an overview. In: Maimon, O.,
Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, 2nd edn, pp.
875–886. Springer, New York (2010)
4. Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. ACM SIGKDD
Explor. Newsl. 6(1), 40–49 (2004)
5. Geng, L., Hamilton, H.J.: Interestingness measures for data mining: a survey. ACM
Comput. Surv. (CSUR) 38(3), 1–32 (2006)
6. Ohsaki, M., Abe, H., Tsumoto, S., Yokoi, H., Yamaguchi, T.: Evaluation of rule
interestingness measures in medical knowledge discovery in databases. Artif. Intell.
Med. 41, 177–196 (2007)
7. Greco, S., Pawlak, Z., Slowiński, R.: Can bayesian confirmation measures be useful
for rough set decision rules? Eng. Appl. Artif. Intell. 17(4), 345–361 (2004)
8. Bayardo, J., Agrawal, R.: Mining the most interesting rules. In: Proceedings of the
Fifth ACM SIGKDD, ser. KDD ’99, pp. 145–154 (1999)
MOCA-I: Discovering Rules and Guiding Decision Maker 51
1 Introduction
Search problems arise from various domains, ranging from small logic puzzles
over scheduling problems like railway scheduling [1], to large job shop scheduling
problems [2]. As long as the answers need not to be optimal, these problems
can be translated into a constraint satisfaction problem [3], or into satisfiability
testing (SAT) [4]. SAT approach is often successful, e.g. scheduling railway trains
has been improved by a speedup up to 10000 compared to the state-of-the-art
domain specific solver [1]. With the advent of parallel architectures, the interest
moved towards parallel SAT solvers [5–7]. Most relevant parallel SAT solvers
can be divided in two families: portfolio solvers, where several sequential solvers
compete each other over the same formula, and iterative partitioning, where each
solver is assigned a partition of the original problem and partitions are created
iteratively. Portfolio solvers received much attention from the community, leading
to enhancements by means of sharing according to some filter heuristics [5], or by
controlling the diversification and intensification among the solvers [8]. The same
cannot be said for iterative partitioning: for a grid implementation of the parallel
solver, only a study on how to divide the search space [9] and on limited sharing
has been done [10]. As for portfolio solvers [5,11], Hyvärinen et.al report that in
average even this limited sharing results in a speedup. In this paper we present an
improved clause sharing mechanism for the iterative partitioning approach. Our
evaluation reveals interesting insights: first, sharing clauses introduces almost
no overhead in computation. Furthermore, the performance of the overall search
is increased. One of the reasons for this improved behavior is that the number
Davide Lanti was supported by the European Master’s Program in Computational
Logic (EMCL).
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 52–58, 2013.
DOI: 10.1007/978-3-642-44973-4 6, c Springer-Verlag Berlin Heidelberg 2013
Sharing Information in Parallel Search 53
2 Preliminaries
Let V be a finite set of Boolean variables. The set of literals V ⊆ {x | x ∀ V } con-
sists of positive and negative Boolean variables. A clause is a finite disjunction
of literals and a formula (in conjunctive normal form (CNF)) is a finite conjunc-
tion of clauses. We sometimes consider clauses and formulae as sets of literals
and sets of clauses, respectively, because duplicates can be removed safely. We
denote clauses with square brackets and formulae with angle brackets, so that
((a ∅ b) ∈ (a ∅ c ∅ d)) is written as ←[a, b], [a, c, d]⊂. An interpretation J is a (partial
or total) mapping from the set of variables to the set {, ⊥} of truth values;
the interpretation is represented by a set of literals, also denoted by J, with the
understanding that a variable x is mapped to if x ∀ J and is mapped to ⊥ if
x ∀ J. One should observe that {x, x} ∼⊆ J for any x and J.
A clause C is satisfied by an interpretation J if l ∀ J for some literal l ∀ C.
An interpretation satisfies a formula F , if it satisfies every clause in F . If there
exists an interpretation that satisfies F , then F is satisfiable, otherwise F is
unsatisfiable. An interpretation J that satisfies a formula F is called model of
F (J |= F ). Given two formulae F and G, we say that F models G (F |= G) if
and only if every model of F is also a model of G. Two formulae F and G are
equivalent (F ≡ G), if they have the same set of models. Observe that if F |= G,
then F ⊆ G ≡ F . Let C = [x, c1 , . . . , cm ] and D = [x, d1 , . . . , dn ] be two clauses.
We call the clause E = [c1 , . . . , cm , d1 , . . . , dn ] the resolvent of C and D, which
has been produced by resolution on variable x. We write E = C ⊗ D. Note, that
←C, D⊂ |= ←E⊂, and therefore ←C, D⊂ ≡ ←C, D, E⊂.
Fig. 1. The tree shows how a formula can be partitioned iteratively by using a parti-
tioning function that creates two child formulae.
beyond two cores. Since modern hardware provides many more cores, we focus
on techniques that are more promising, namely:
• parallel portfolio search [5], where different sequential solvers solve the
same input formula in parallel.
• iterative partitioning [9], where a formula is partitioned iteratively into a
tree of subproblems and each subproblem is solved in parallel.
Portfolio parallelization is the most common approach and many parallel
SAT solvers rely on this technique, e.g. [5,19]. Iterative partitioning is a par-
titioning scheme that does not suffer of the theoretical slow down common to
other partitioning approaches [20]. Since [20] reports that iterative partitioning
is the most scalable algorithm, we focus on improving it further by allowing more
communication.
Fig. 2. Partition tree for F , where unsafe clauses are underlined. Each node has been
applied resolution w.r.t. its incoming constraints. Clause [x4 , x2 ], that could be learnt by
the solver working on F 121 , is unsafe because it depends on constraint ←[x6 ]∈. However,
it could safely be shared among the children of F 1 .
4 Empirical Evaluation
The experiments have been run on AMD Opteron 6274 CPUs with 2.2 GHz and
16 cores, so that 16 local solvers run in parallel. Each instance is assigned a time-
out of 1 h (wall clock) and a total of 16 GB main memory. Each sharing approach
has been tested over 600 instances of the instances of SAT challenge 2012. Our
iterative-partitioning solver is based on Minisat [7], following the ideas of [20].
As in [9] the resources of the local solvers are restricted: a branch is created
after 8096 conflicts, and a local solver is allowed to search until 512000 conflicts
are reached. The partitioning function uses VSIDS scattering [9]. We tested our
solver with 4 different configurations: “POS” and “FLAG” use position-based
and flag-based clause sharing, respectively. Here, a clause can be shared only if
its size is less than or equal to 2. “RAND” uses position-based sharing where
each learnt clause can be shared with a probability of 5 %. The last configuration,
“NONE”, does not share any clauses.
5 Conclusion
We presented a new position-based clause sharing technique that allows to share
clauses for subsets of a parallel iterative-partitioning SAT solver. Position-based
clause sharing improves the intensification of parallel searching SAT solvers by
Sharing Information in Parallel Search 57
identifying the search space in which a shared clause is valid so that the total
number of shared clauses can be increased compared to previous work [10].
Future work could improve shared clauses further. By rejecting resolution
steps, the sharing position of learnt clauses can be improved. Moreover, a filter
on the receiving solver should be considered as well. Also, it is not trivial to
decide what shared clauses are important and if these should actively drive the
search. Additionally, parallel resources should be exploited further, for example
by using different partitioning strategies or by replacing the local sequential
solver by another parallel SAT solver. Finally, improvements to the local solver,
as for example restarts and advanced search direction techniques, could also be
incorporated into the search space partitioning.
References
1. Großmann, P., Hölldobler, S., Manthey, N., Nachtigall, K., Opitz, J., Steinke, P.:
Solving periodic event scheduling problems with SAT. In: Jiang, H., Ding, W.,
Ali, M., Wu, X. (eds.) IEA/AIE 2012. LNCS, vol. 7345, pp. 166–175. Springer,
Heidelberg (2012)
2. Carlier, J., Pinson, E.: An algorithm for solving the job-shop problem. Manage.
Sci. 35(2), 164–176 (1989)
3. Rossi, F., Beek, P.V., Walsh, T.: Handbook of Constraint Programming (Founda-
tions of Artificial Intelligence). Elsevier Science Inc, New York (2006)
4. Biere, A., Heule, M., van Maaren, H., Walsh, T. (eds.): Handbook of Satisfiability.
Frontiers in Artificial Intelligence and Applications, vol. 185. IOS Press, Amster-
dam (2009)
5. Hamadi, Y., Jabbour, S., Sais, L.: Manysat: a parallel sat solver. JSAT 6(4), 245–
262 (2009)
6. Biere, A.: Lingeling, Plingeling, PicoSAT and PrecoSAT at SAT Race 2010. FMV
Report Series Technical Report 10/1. Johannes Kepler University, Linz, Austria
(2010)
7. Eén, N., Sörensson, N.: An extensible SAT-solver. In: Giunchiglia, E., Tacchella,
A. (eds.) SAT 2003. LNCS, vol. 2919, pp. 502–518. Springer, Heidelberg (2004)
8. Guo, L., Hamadi, Y., Jabbour, S., Sais, L.: Diversification and intensification in
parallel SAT solving. In: Cohen, D. (ed.) CP 2010. LNCS, vol. 6308, pp. 252–265.
Springer, Heidelberg (2010)
9. Hyvärinen, A.E.J., Junttila, T., Niemelä, I.: Partitioning SAT instances for dis-
tributed solving. In: Fermüller, C.G., Voronkov, A. (eds.) LPAR-17. LNCS, vol.
6397, pp. 372–386. Springer, Heidelberg (2010)
10. Hyvärinen, A.E.J., Junttila, T., Niemelä, I.: Grid-based SAT solving with iterative
partitioning and clause learning. In: Lee, J. (ed.) CP 2011. LNCS, vol. 6876, pp.
385–399. Springer, Heidelberg (2011)
11. Arbelaez, A., Hamadi, Y.: Improving parallel local search for SAT. In: Coello,
C.A.C. (ed.) LION 2011. LNCS, vol. 6683, pp. 46–60. Springer, Heidelberg (2011)
12. Davis, M., Logemann, G., Loveland, D.: A machine program for theorem-proving.
Commun. ACM 5, 394–397 (1962)
13. Marques Silva, J.P., Sakallah, K.A.: Grasp: a search algorithm for propositional
satisfiability. IEEE Trans. Comput. 48(5), 506–521 (1999)
58 D. Lante and N. Manthey
14. Katebi, H., Sakallah, K.A., Marques-Silva, J.: Empirical study of the anatomy of
modern sat solvers. In: Sakallah, K.A., Simon, L. (eds.) SAT 2011. LNCS, vol.
6695, pp. 343–356. Springer, Heidelberg (2011)
15. Böhm, M., Speckenmeyer, E.: A fast parallel sat-solver - efficient workload
balancing, (1994)
16. Martins, R., Manquinho, V., Lynce, I.: An overview of parallel sat solving. Con-
straints 17(3), 304–347 (2012)
17. Hölldobler, S., Manthey, N., Nguyen, V., Stecklina, J., Steinke, P.: A short overview
on modern parallel SAT-solvers. In: Wasito, I., et al. (ed.) ICACSIS, pp. 201–206
(2011)
18. Manthey, N.: Parallel SAT solving - using more cores. In: Pragmatics of
SAT(POS’11) (2011)
19. Audemard, G., Hoessen, B., Jabbour, S., Lagniez, J.-M., Piette, C.: Revisiting
clause exchange in parallel SAT solving. In: Cimatti, A., Sebastiani, R. (eds.) SAT
2012. LNCS, vol. 7317, pp. 200–213. Springer, Heidelberg (2012)
20. Hyvärinen, A.E.J., Manthey, N.: Designing scalable parallel SAT solvers. In:
Cimatti, A., Sebastiani, R. (eds.) SAT 2012. LNCS, vol. 7317, pp. 214–227.
Springer, Heidelberg (2012)
21. Lanti, D., Manthey, N.: Sharing information in parallel search with search space
partitioning. Technical Report 1, Knowledge Representation and Reasoning Group,
Technische Universität Dresden, 01062 Dresden, Germany (2013)
22. Baader, F., Nipkow, T.: Term rewriting and all that. Cambridge University Press,
New York (1998)
Fast Computation of the Multi-Points Expected
Improvement with Applications
in Batch Selection
1 Introduction
In the last decades, metamodeling (or surrogate modeling) has been increasingly
used for problems involving costly computer codes (or “black-box simulators”).
Practitioners typically dispose of a very limited evaluation budget and aim at
selecting evaluation points cautiously when attempting to solve a given problem.
In global optimization, the focus is usually put on a real-valued function f
with d-dimensional source space. In this settings, Jones et al. [1] proposed the
now famous Efficient Global Optimization (EGO) algorithm, relying on a kriging
metamodel [2] and on the Expected Improvement (EI) criterion [3]. In EGO, the
optimization is done by sequentially evaluating f at points maximizing EI. A
crucial advantage of this criterion is its fast computation (besides, the analytical
gradient of EI is implemented in [4]), so that the hard optimization problem is
replaced by series of much simpler ones.
Coming back to the decision-theoretic roots of EI [5], a Multi-points Expected
Improvement (also called “q-EI”) criterion for batch-sequential optimization was
defined in [6] and further developed in [7,8]. Maximizing this criterion enables
choosing batches of q > 1 points at which to evaluate f in parallel, and is of
particular interest in the frequent case where several CPUs are simultaneously
available. Even though an analytical formula was derived for the 2-EI in [7], the
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 59–69, 2013.
DOI: 10.1007/978-3-642-44973-4 7, c Springer-Verlag Berlin Heidelberg 2013
60 C. Chevalier and D. Ginsbourger
Monte Carlo (MC) approach of [8] for computing q-EI when q ⊆ 3 makes the
criterion itself expensive-to-evaluate, and particularly hard to optimize.
A lot of effort has recently been paid to address this problem. The pragmatic
approach proposed by Ginsbourger and Le Riche [8] consists in circumventing a
direct q-EI maximization, and replacing it by simpler strategies where batches
are obtained using an offline q-points EGO. In such strategies, the model updates
are done using dummy response values such as the kriging mean prediction (Krig-
ing Believer) or a constant (Constant Liar), and the covariance parameters are
re-estimated only when real data is assimilated. In [9] and [10], q-EI optimiza-
tion strategies were proposed relying on the MC approach, where the number of
MC samples is tuned online to discriminate between candidate designs. Finally,
Frazier [11] proposed a q-EI optimization strategy involving stochastic gradient,
with the crucial advantage of not requiring to evaluate q-EI itself.
In this article we derive a formula allowing a fast and accurate approxi-
mate evaluation of q-EI. This formula may contribute to significantly speed up
strategies relying on q-EI. The main result, relying on Tallis’ formula, is given
in Sect. 2. The usability of the proposed formula is then illustrated in Sect. 3
through benchmark experiments, where a brute force maximization of q-EI is
compared to three variants of the Constant Liar strategy. In particular, a new
variant (CL-mix) is introduced, and is shown to offer very good performances
at a competitive computational cost. For self-containedness, a slightly revisited
proof of Tallis’ formula is given in appendix.
In this section we give an explicit formula allowing a fast and accurate determin-
istic approximation of q-EI. Let us first give a few precisions on the mathemat-
ical settings. Along the paper, f is assumed to be one realisation of a Gaussian
Process (GP) with known covariance kernel and mean known up to some linear
trend coefficients, so that the conditional distribution of a vector of values of the
GP conditional on past observations is still Gaussian (an improper uniform prior
is put on the trend coefficients when applicable). This being said, most forth-
coming derivations boil down to calculations on Gaussian vectors. Let Y :=
(Y1 , . . . , Yq ) be a Gaussian Vector with mean m ∀ Rq and covariance matrix Σ.
Our aim in this paper is to explicitly calculate expressions of the following kind:
⎡⎣ ⎨ ⎤
E max Yi − T (1)
i∈{1,...,q} +
matrix Σ are the so-called “Kriging mean” and “Kriging covariance” at Xq and
can be calculated relying on classical Kriging equations (see, e.g., [12]).
In order to obtain a tractable analytical formula for Expression (1), not
requiring any Monte-Carlo simulation, let us first give a useful formula obtained
by [13], and recently used in [14] for GP modeling with inequality constraints:
where:
– p := P(Z ∈ b) = Φq (b − m, Σ)
– Φq (u, Σ) (u ∀ Rq , Σ ∀ Rq×q , q ⊆ 1) is the c.d.f. of the centered multivariate
Gaussian distribution with covariance matrix Σ.
– ϕm,σ2 (.) is the p.d.f. of the univariate Gaussian distribution with mean m and
variance σ 2
Σ
– c.i is the vector of Rq−1 with general term (bj − mj ) − (bi − mi ) Σij ii
, j ←= i
Σiu Σiv
– Σ.i is a (q − 1) × (q − 1) matrix obtained by computing Σuv − Σii for u ←= i
and v ←= i. This matrix corresponds to the conditional covariance matrix of
the random vector Z−i := (Z1 , . . . , Zi−1 , Zi+1 , . . . , Zq ) knowing Zi .
For the sake of brevity, the proof of this Proposition is sent in the Appendix.
A crucial point for the practical use of this result is that there exist very fast
procedures to compute the c.d.f. of the multivariate Gaussian distribution. For
example, the work of Genz [15,16] have been used in many R packages (see,
e.g., [17,18]). The Formula (2) above is an important tool to efficiently compute
Expression (1) as shown with the following Property:
Denoting by m(k) and Σ (k) the mean and covariance matrix of Z(k) , and defining
the vector b(k) ∀ Rq by bk = −T and bj = 0 if j ←= k, the EI of Xq writes:
(k) (k)
⎫
q
⎦ q
⎦ ⎬ ⎭
q (k) (k) (k) (k)
EI(X ) = (mk − T )pk + Σik ϕm(k) ,Σ (k) (bi )Φq−1 c.i , Σ.i
i ii
k=1 i=1
(3)
62 C. Chevalier and D. Ginsbourger
where:
– pk := P(Z(k) ∈ b(k) ) = Φq (b(k) − m(k) , Σ (k) ).
pk is actually the probability that Yk exceeds T and Yk = maxj=1,...,q Yj .
– Φq (., Σ) and ϕm,σ2 (.) are defined in Proposition 1
– c.i is the vector of Rq−1 constructed like in Proposition 1, by computing
(k)
(k)
(k) (k) (k) (k) Σij
(bj − mj ) − (bi − mi ) (k) , with j ←= i
Σii
(k)
– Σ.i is a (q − 1) × (q − 1) matrix constructed from Σ (k) like in Proposition
1. It corresponds to the conditional covariance matrix of the random vector
(k) (k) (k) (k) (k) (k)
Z−i := (Z1 , . . . , Zi−1 , Zi+1 , . . . , Zq ) knowing Zi .
q
Proof 1. Using that 1{maxi∈{1,...,q} Yi ∗T } = k=1 1{Yk ∗T, Yj ∩Yk ∀j∪=k} , we get
⎡⎣ ⎨⎦ q
⎤
q
EI(X ) = E max Yi − T 1{Yk ∗T, Yj ∩Yk ∀j∪=k}
i∈{1,...,q}
k=1
q
⎦
= E (Yk − T )1{Yk ∗T, Yj ∩Yk ∀j∪=k}
k=1
⎦q ⎬ ⎭
= E Yk − T Yk ⊆ T, Yj ∈ Yk ∅j ←= k P (Yk ⊆ T, Yj ∈ Yk ∅j ←= k)
k=1
⎦q ⎬ ⎬ ⎭⎭ ⎬ ⎭
(k)
= −T − E Zk Z(k) ∈ b(k) P Z(k) ∈ b(k)
k=1
Now the computation of pk := P Z(k) ∈ b(k) simply requires one call to the Φq
function and the proof can be completed by applying Tallis’ formula (2) to the
random vectors Z(k) ( 1 ∈ k ∈ q).
Remark 1. From Properties (1) and (2), it appears that computing q-EI requires
a total of q calls to Φq and q 2 calls to Φq−1 . The proposed approach performs
thus well when q is moderate (typically lower than 10). For higher values of q,
estimating q-EI by Monte-Carlo might remain competitive. Note that, when q
is larger (say, q = 50) and when q CPUs are available, one can always distribute
the calculations of the q 2 calls to Φq−1 over these q CPUs.
Remark 2. In the particular case q = 1 and with the convention Φ0 (., Σ) = 1,
Eq. (3) corresponds to the classical EI formula proven in [1,5].
Remark 3. The Multi-points EI can be used in a batch-sequential strategy
to optimize a given expensive-to-evaluate function f , as detailed in the next
Section. Moreover, a similar criterion can also be used to perform opti-
mization based on a Kriging model with linear constraints, such as the
one
developed by Da Veiga and Marrel [14]. For example expressions like:
E maxi∈{1,...,q} Yi − T + |Y ∈ a , a ∀ Rq , can be computed using Tallis’ for-
mula and the same proof.
Calculation of the Multi-Points EI Relying on Tallis’ Formula 63
Fig. 1. Convergence (middle) of MC estimates to the q-EI value calculated with Propo-
sition 2 in the case of a batch of four points (shown on the left plot). Right: candidate
batches obtained by q-EI stepwise maximisation (squares), and the CL-min (circles)
and CL-max (triangles) strategies.
Fig. 2. Contour lines of the Rastrigin function (grayscale) and location of the global
optimizer (black triangle)
batch assimilation. Since the tests are done for several designs of experiments,
we chose to represent, along the runs, the relative mean squared error:
⎫ 2
1 ⎦ ymin − yopt
M (i)
rMSE = (4)
M i=1 yopt
(i)
where ymin in the current observed minimum in run number i and yopt is the real
unknown optimum. The total number M of different initial designs of experi-
ments is fixed to 50. The tested strategies are:
– (1) q-EI stepwise maximization: q sequential d-dimensional optimizations are
performed. We start with the maximization of the 1-point EI and add this
point to the new batch. We then maximize the 2-point EI (keeping the first
point obtained as first argument), add the maximizer to the batch, and iterate
until q points are selected.
– (2) Constant Liar min (CL-min): We start with the maximization of the 1-
point EI and add this point to the new batch. We then assume a dummy
response (a“lie”) at this point, and update the Kriging metamodel with this
point and the lie. We then maximize the 1-point EI obtained with the updated
kriging metamodel, get a second point, and iterate the same process until a
batch of q points is selected. The dummy response has the same value over
the q − 1 lies, and is here fixed to the minimum of the current observations.
– (3) Constant Liar max (CL-max): The lie in this Constant Liar strategy is
fixed to the maximum of the current observations.
Calculation of the Multi-Points EI Relying on Tallis’ Formula 65
– (4) Constant Liar mix (CL-mix): At each iteration, two batches are generated
with the CL-min and CL-max strategies. From these two “candidate” batches,
we choose the batch with the best actual q-EI value, calculated based on
Proposition 2.
– (5) Random sampling.
Note that CL-min tends to explore the function near the current minimizer
(as the lie is a low value and we are minimizing f ) while CL-max is more
exploratory. Thus, CL-min is expected to perform well on unimodal functions.
On the contrary, CL-max may perform better on multimodal functions. For all
the tests we use the DiceKriging and DiceOptim packages [4]. The optimizations
of the different criteria rely on a genetic algorithm using derivatives, available
in the rgenoud package [21]. Figure 3 represents the compared performances of
these strategies.
−4
−5
log10(rMSE)
log10(rMSE)
−2
−6
−7
−3
q−EI q−EI
−8
CL−min CL−min
CL−max CL−max
−4
CL−mix CL−mix
−9
random random
0 2 4 6 8 10 0 2 4 6 8 10
iteration iteration
From these plots we draw two main conclusions. From these plots we draw the
following conclusions: first, the q-EI stepwise maximization strategy outperforms
the strategies based on constant lies, CL-min and CL-max. However, the left
graph of Fig. 3 points out that the CL-min strategy seems particularly well-
adapted to the Hartman6 function. Since running a CL is computationally much
cheaper than a brute fore optimization of q-EI, it is tempting to recommend the
CL-min strategy for Hartman6. However, it is not straightforward to know in
66 C. Chevalier and D. Ginsbourger
advance which of CL-min or CL-max will perform better on a given test case.
Indeed, for example, CL-max outperforms CL-min on the Rastrigin function.
Now, we observe that using q-EI in the CL-mix heuristic enables very good
performances in both cases without having to select one of the two lie values
in advance. For the Hartman6 function, CL-mix even outperforms both CL-
min and CL-max and has roughly the same performance as a brute force q-
EI maximization. This suggests that a good heuristic might be to generate, at
each iteration, candidate batches obtained with different strategies (e.g. CL with
different lies) and to discriminate those batches using q-EI.
4 Conclusion
In this article we give a closed-form expression enabling a fast computation of
the Multi-points Expected Improvement criterion for batch sequential Bayesian
global optimization. This formula is consistent with the classical Expected
Improvement formula and its computation does not require Monte Carlo sim-
ulations. Optimization strategies based on this criterion are now ready to be
used on real test cases, and a brute maximization of this criterion shows promis-
ing results. In addition, we show that good performances can be achieved by
using a cheap-to-compute criterion and by discriminating the candidate batches
generated by such criterion with the q-EI. Such heuristics might be particularly
interesting when the time needed to generate batches becomes a computational
bottleneck, e.g. when q ⊆ 10 and calls to the Gaussian c.d.f. become expensive.
A perspective, currently under study, is to improve the maximization of q-EI
itself, e.g. through a more adapted choice of the algorithm and/or an analytical
calculation of q-EI’s gradient.
Acknowledgments. This work has been conducted within the frame of the ReDice
Consortium, gathering industrial (CEA, EDF, IFPEN, IRSN, Renault) and academic
(Ecole des Mines de Saint-Etienne, INRIA, and the University of Bern) partners around
advanced methods for Computer Experiments. Clément Chevalier gratefully acknowl-
edges support from the French Nuclear Safety Institute (IRSN). The authors also would
like to thank Dr. Sébastien Da Veiga for raising our attention to Tallis’ formula.
It is known (see, e.g., [22]) that the conditional expectation of Zk can be obtained
by deriving such MGF with respect to tk , in t = 0. Mathematically this writes:
∂MZ (t)
E(Zk |Z ∈ b) = (6)
∂tk
t=0
The main steps of this proof are then to calculate such MGF and its derivative
with respect to any coordinate tk .
Let us consider the centered random variable Zc := Z − m. Denoting h =
b − m, conditioning on Z ∈ b or on Zc ∈ h are equivalent. The MGF of Zc can
be calculated as follows:
where p := P(Z ∈ b) and ϕv,Σ (.) denotes the p.d.f. of the multivariate nor-
mal distribution with mean v and covariance matrix Σ. The calculation can be
continued by noting that:
h1 hq
1 q 1 1 ∼ 1
MZc (t) = (2π)− 2 |Σ|− 2 exp t Σt ... exp − (u − Σt)∼ Σ −1 (u − Σt) du
p 2 −∞ −∞ 2
1 1 ∼
= exp t Σt Φq (h − Σt, Σ)
p 2
where Φq (., Σ) is the c.d.f. of the centered multivariate normal distribution with
covariance matrix Σ.
Now, let us calculate for some k ∀ {1, . . . , q} the partial derivative ∂M∂tZk(t)
c
⎦q h1 hi−1 hi+1 hq
=− Σik . . . ... ϕ0,Σ (u−i , ui = hi )du−i
i=1 −∞ −∞ −∞ −∞
The last step is obtained applying the chain rule to x ⊥ Φq (x, Σ) at the
point x = h. Here, ϕ0,Σ (u−i , ui = hi ) denotes the c.d.f. of the centered mul-
tivariate normal distribution at given points (u−i , ui = hi ) := (u1 , . . . , ui−1 ,
hi , ui+1 , . . . , uq ). Note that the integrals in the latter Expression are in dimen-
sion q − 1 and not q. In the ith term of the sum above, we integrate with respect
68 C. Chevalier and D. Ginsbourger
where Σi = (Σ1i , . . . , Σi−1i , Σi+1i , . . . , Σqi ) (Σi ∀ Rq−1 ) and Σ−i,−i is the
(q − 1) × (q − 1) matrix obtained by removing the line and column i from Σ. This
identity can be proven using Bayes formula and Gaussian vectors conditioning
formulas. Its use gives:
q
p E(Zkc |Zc ≤ h) = − −1
Σik ϕ0,Σii (hi )Φq−1 (h−i − Σii −1
Σi hi , Σ−i,−i − Σi Σii Σi )
i=1
q
−1 −1
=− Σik ϕmi ,Σii (bi )Φq−1 (h−i − Σii Σi hi , Σ−i,−i − Σi Σii Σi )
i=1
References
1. Jones, D.R., Schonlau, M., William, J.: Efficient global optimization of expensive
black-box functions. J. Glob. Optim. 13(4), 455–492 (1998)
2. Santner, T.J., Williams, B.J.: The Design and Analysis of Computer Experiments.
Springer, New York (2003)
3. Mockus, J.: Bayesian Approach to Global Optimization. Theory and Applications.
Kluwer Academic Publisher, Dordrecht (1989)
4. Roustant, O., Ginsbourger, D., Deville, Y.: DiceKriging, DiceOptim: Two R pack-
ages for the analysis of computer experiments by kriging-based metamodelling and
optimization. J. Stat. Softw. 51(1), 1–55 (2012)
5. Mockus, J., Tiesis, V., Zilinskas, A.: The application of Bayesian methods for seek-
ing the extremum. In: Dixon, L., Szego, E.G. (eds.) Towards Global Optimization,
pp. 117–129. Elsevier, Amsterdam (1978)
6. Schonlau, M.: Computer experiments and global optimization. PhD thesis, Uni-
versity of Waterloo (1997)
7. Ginsbourger, D.: Métamodèles multiples pour l’approximation et l’optimisation de
fonctions numériques multivariables. PhD thesis, Ecole nationale supérieure des
Mines de Saint-Etienne (2009)
8. Ginsbourger, D., Le Riche, R., Carraro, L.: Kriging is well-suited to parallelize opti-
mization. Computational Intelligence in Expensive Optimization Problems. Adap-
tation Learning and Optimization, vol. 2, pp. 131–162. Springer, Heidelberg (2010)
9. Janusevskis, J., Le Riche, R., Ginsbourger, D.: Parallel expected improvements for
global optimization: summary, bounds and speed-up (August 2011)
10. Janusevskis, J., Le Riche, R., Ginsbourger, D., Girdziusas, R.: Expected improve-
ments for the asynchronous parallel global optimization of expensive functions:
potentials and challenges. In: Hamadi, Y., Schoenauer, M. (eds.) LION 6. LNCS,
vol. 7219, pp. 413–418. Springer, Heidelberg (2012)
11. Frazier, P.I.: Parallel global optimization using an improved multi-points expected
improvement criterion. In: INFORMS Optimization Society Conference, Miami FL
(2012)
Calculation of the Multi-Points EI Relying on Tallis’ Formula 69
12. Chilès, J.P., Delfiner, P.: Geostatistics: Modeling Spatial Uncertainty. Wiley, New
York (1999)
13. Tallis, G.: The moment generating function of the truncated multi-normal distri-
bution. J. Roy. Statist. Soc. Ser. B 23(1), 223–229 (1961)
14. Da Veiga, S., Marrel, A.: Gaussian process modeling with inequality constraints.
Annales de la Faculté des Sciences de Toulouse 21(3), 529–555 (2012)
15. Genz, A.: Numerical computation of multivariate normal probabilities. J. Comput.
Graph. Stat. 1, 141–149 (1992)
16. Genz, A., Bretz, F.: Computation of Multivariate Normal and t Probabilities.
Springer, Heidelberg (2009)
17. Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Bornkamp, B.,
Hothorn, T.: Mvtnorm: Multivariate Normal and t Distributions. R package version
0.9-9992 (2012)
18. Azzalini, A.: mnormt: The multivariate normal and t distributions. R package
version 1.4-5 (2012)
19. Finck, S., Hansen, N., Ros, R., Auger, A.: Real-parameter black-box optimiza-
tion bencharking 2009: Presentation of the noiseless functions. Technical report,
Research Center PPE, 2009 (2010)
20. Hansen, N., Finck, S., Ros, R., Auger, A.: Real-parameter black-box optimization
benchmarking 2009: Noiseless functions definitions. Technical report, INRIA 2009
(2010)
21. Mebane, W., Sekhon, J.: Genetic optimization using derivatives: The rgenoud pack-
age for R. J. Stat. Softw. 42(11), 1–26 (2011)
22. Cressie, N., Davis, A., Leroy Folks, J.: The moment-generating function and neg-
ative integer moments. Am. Stat. 35(3), 148–150 (1981)
R2-EMOA: Focused Multiobjective Search
Using R2-Indicator-Based Selection
1 Introduction
Throughout this paper, we consider multiobjective optimization problems con-
sisting of d objectives Yj and objective functions fj : Rn → R with 1 ≤ j ≤ d. In
the context of performance assessment of multiobjective optimizers, the (binary)
R-indicator family was introduced by Hansen and Jaszkiewicz [5]. It is based on
a set of utility functions. In total, three different variants were proposed which
differ in the way the utilities are evaluated and combined – the ratio of one set
being better than the other (R1), the mean difference in utilities (R2), or the
mean relative difference in utilities (R3). In particular, the second variant R2
is one of the most recommended performance indicators [8] together with the
hypervolume (HV, [9]) which directly measures the dominated objective hyper-
volume bounded by a reference point dominated by all solutions. Recently, we
defined an equivalent unary version of this R2 indicator [3]. In case the standard
weighted Tchebycheff utility function with ideal point i is used, it is defined as
1
R2(A, Λ, i) = min max {λj |ij − aj |}
|Λ| a∈A j∈{1,...,d}
λ∈Λ
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 70–74, 2013.
DOI: 10.1007/978-3-642-44973-4 8, ∈ c Springer-Verlag Berlin Heidelberg 2013
R2-EMOA: Focused Multiobjective Search 71
2 R2-EMOA
The proposed R2-EMOA implements a steady state strategy based on the con-
tribution to the unary R2-indicator (see Algorithm 1).
3 Experiments
Fig. 2. Results of best R2-EMOA runs for increasing γ on DTLZ1 (left), DTLZ2 (mid-
dle) and ZDT1 (right). The movement of the x-axis positions for γ ∈ {1, 4, 8} is shown.
The optimal distributions regarding HV are reflected by dashed vertical lines.
n = 6), and DTLZ2 (concave PF, n = 11) [4]. On each function, ten independent
runs were conducted using simulated binary crossover (SBX) and polynomial
mutation (pc = 0.9, pm = 1/n, ηc = 15, ηm = 20), 150.000 function evalua-
tions (FE), ideal point i = (0, 0)∩ , and 501 weight vectors. A population size of
μ=10 was chosen in order to allow a clear visualization of the results and the
comparison to the reference distributions of [6].
The influence of restricted weight vector domains and altered weight vector
distributions on the outcome of the R2-EMOA results is considered. There-
fore, Algorithm 1 of [6] was used to generate weight vector distributions with
increasing focus on the extremes of the weight vector domain (see Fig. 1). This
is reflected by an increased value of γ while γ = 1 corresponds to equally distrib-
uted weight vectors in [0, 1]2 . The R2-EMOA is able to accurately approximate
the optimal distributions. With increasing γ, the points tend to drift towards
the extremes of the front (Fig. 2) which is perfectly in line with the results of [6].
0.4
0.8
0.8
Y2
Y2
Y2
0.2
0.4
0.4
0.0
0.0
0.0
0.8
0.8
Y2
Y2
Y2
0.2
0.4
0.4
0.0
0.0
0.0
Fig. 3. Results of the best R2-EMOA runs (black dots) with restricted weight vector
domains for DTLZ1 (left), DTLZ2 (middle) and ZDT1 (right). The areas within the
intersections with the true PF (solid line) are highlighted.
R2-EMOA: Focused Multiobjective Search 73
Fig. 4. Boxplots of R2 values at final R2-EMOA generation for DTLZ1 (left), DTLZ2
(middle) and ZDT1 (right) for altered weight distributions with parameter γ (top) or
restricted weight space (bottom) corresponding to Fig. 3. The R2 value of the approx-
imated optimal 10-distribution of R2 in [6] is visualized by a red horizontal line.
Individually for each problem, distributions close to the optimal ones regarding
HV can be obtained for a specific choice of γ.
Moreover, the first component of the weight vector domain was restricted
to one or two intervals within [0, 1]. From [6] it is known that in this setting
the optimal solutions regarding R2 lie within the target cone defined by the
two outmost weight vectors of the interval(s). This is reflected by the respective
R2-EMOA results (Fig. 3).
Figure 4 relates the final R2 values of all experiments to the approximated
optimal 10-distributions regarding R2 [6]. It can be observed that the variance of
the R2-EMOA results is small. Sometimes even slightly better approximations
of the optimal distributions are obtained than in [6]. This is rather surprising
as these reference solutions were determined based on a global optimization on
the front. The evolutionary mechanism and the greedy selection seem to provide
efficient heuristics for the considered class of problems.
First experiments show very promising results of the R2-EMOA regarding solu-
tion quality and the possibility of incorporating preferences of the decision
maker. In future studies, the R2-EMOA will be theoretically and empirically
compared to other EMOA optimizing the R2-indicator, such as MOEA/D and
74 H. Trautmann et al.
References
1. Branke, J., Deb, K., Dierolf, H., Osswald, M.: Finding knees in multi-objective
optimization. In: Yao, X., Burke, E.K., Lozano, J., Smith, J., Merelo-Guervós, J.,
Bullinaria, J.A., Rowe, J.E., Tino, P., Kabán, A., Schwefel, H.-P. (eds.) PPSN
2004. LNCS, vol. 3242, pp. 722–731. Springer, Heidelberg (2004)
2. Bringmann, K., Friedrich, T.: Convergence of hypervolume-based archiving algo-
rithms I: effectiveness. In: Genetic and Evolutionary Computation Conference
(GECCO 2011), pp. 745–752. ACM, New York (2011)
3. Brockhoff, D., Trautmann, H., Wagner, T.: On the properties of the R2 indicator.
In: Genetic and Evolutionary Computation Conference (GECCO 2012), pp. 465–
472. ACM, New York (2012)
4. Deb, K., Thiele, L., Laumanns, M., Zitzler, E.: Scalable multi-objective optimiza-
tion test problems. In: Congress on Evolutionary Computation (CEC 2002), pp.
825–830. IEEE Press, New Jersey (2002)
5. Hansen, M.P., Jaszkiewicz, A.: Evaluating the quality of approximations of the non-
dominated set. Technical report, Institute of Mathematical Modeling, Technical
University of Denmark (1998), IMM Technical Report IMM-REP-1998-7
6. Wagner, T., Trautmann, H., Brockhoff, D.: Preference articulation by means of the
R2 indicator. In: Purshouse, R.C., Fleming, P.J., Fonseca, C.M., Greco, S., Shaw,
J. (eds.) EMO 2013. LNCS, vol. 7811, pp. 81–95. Springer, Heidelberg (2013)
7. Zitzler, E., Deb, K., Thiele, L.: Comparison of multiobjective evolutionary algo-
rithms: empirical results. Evol. Comput. 8(2), 173–195 (2000)
8. Zitzler, E., Knowles, J.D., Thiele, L.: Quality assessment of Pareto set approxima-
tions. In: Branke, J., Deb, K., Miettinen, K., Sffilowiński, R. (eds.) Multiobjective
Optimization. LNCS, vol. 5252, pp. 373–404. Springer, Heidelberg (2008)
9. Zitzler, E., Thiele, L.: Multiobjective optimization using evolutionary algorithms
- a comparative case study. In: Eiben, A.E., Bäck, T., Schoenauer, M., Schwefel,
H.-P. (eds.) PPSN 1998. LNCS, vol. 1498, pp. 292–301. Springer, Heidelberg (1998)
10. Zitzler, E., Thiele, L., Bader, J.: On set-based multiobjective optimization. IEEE
Trans. Evol. Comput. 14(1), 58–79 (2010)
A Heuristic Algorithm for the Set Multicover
Problem with Generalized Upper Bound
Constraints
1 Introduction
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 75–80, 2013.
DOI: 10.1007/978-3-642-44973-4 9, ∈ c Springer-Verlag Berlin Heidelberg 2013
76 S. Umetani et al.
applications into the SCP, because they often have additional side constraints
in practice. Most practitioners accordingly formulate them into general mixed
integer programming (MIP) problem and apply general purpose solvers, which
are usually less efficient compared to solvers specially tailored to SCP.
In this paper, we consider an extension of SCP introducing (i) multicover
and (ii) generalized upper bound (GUB) constraints, which arise in many real
applications of SCP. The multicover constraint is a generalization of covering
constraint, in which each element i ⊆ M must be covered at least bi ⊆ Z+ (Z+ is
the set of non-negative integers) times. GUB constraint is defined as follows. We
⎡k
are given a partition {G1 , . . . , Gk } of N (∈h ←= h∗ , Gh ⊂ Gh = ∅, h=1 Gh = N ).
For each block Gh ∀ N (h ⊆ K = {1, . . . , k}), the number of selected subsets Sj
(j ⊆ Gh ) is constrained to be at most dh (⊥ |Gh |). We call this problem the set
multicover problem with GUB constraints (SMCP-GUB).
The SMCP-GUB is NP-hard, and the (supposedly) simpler problem of judg-
ing the existence of a feasible solution is NP-complete. We accordingly consider
the following formulation of SMCP-GUB that allows violations of the multi-
cover constraints and introduces a penalty function with a penalty weight vector
w = (w1 , . . . , wm ) ⊆ Rm
+:
⎣ ⎣
min. z(x) = cj xj + wi yi
⎣ j∈N i∈M
s.t. aij xj + yi ∅ bi , i ⊆ M,
j∈N
⎣ (2)
xj ⊥ dh , h ⊆ K,
j∈Gh
xj ⊆ {0, 1}, j ⊆ N,
yi ⊆ {0, . . . , bi }, i ⊆ M.
n
⎨ a given x ⊆ {0, 1} , we can easily∩ compute an optimal y by yi = max{b
For
∩
i−
∩
a
j∈N ij jx , 0}. We note that when y = 0 holds for an optimal solution (x , y )
∩
of SMCP-GUB under the soft multicover constraints, x is also optimal under
the original (hard) multicover constraints. Moreover, for an optimal solution
x∩ under hard multicover constraints, (x∩ , 0) is also optimal with respect to
soft multicover
⎨ constraints if the values of wi are sufficiently large,
⎨ e.g., if
wi > c
j∈N j holds for all i ⊆ M . We accordingly set wi = j∈N cj + 1
for all i ⊆ M .
In this paper, we proposes a 2-flip neighborhood local search algorithm with
an efficient mechanism to find improved solutions. The above generalization of
SCP substantially extends the variety of its applications. However, GUB con-
straints often make the pricing method less effective (which is known to be very
effective for large-scale instances of SCP), because GUB constraints prevent solu-
tions from containing highly evaluated variables together. To overcome this, we
develop a heuristic size reduction algorithm, in which a new evaluation scheme
of variables is introduced taking account of GUB constraints.
A Heuristic Algorithm for the Set Multicover Problem 77
xj ⊆ {0, 1}, j ⊆ N,
yi ⊆ {0, . . . , bi },
i ⊆ M,
⎨
where we call c̃j (u) = cj − i∈M aij ui the Lagrangian cost associated with
column j ⊆ N . For any u ⊆ Rm + , zLR (u) gives a lower bound on the optimal
value of SMCP-GUB z(x∩ ). The problem of finding a Lagrangian multiplier
vector u that maximizes zLR (u) is called the Lagrangian dual problem.
A common approach to compute a near optimal Lagrangian multiplier vec-
tor u is the subgradient method. When huge instances of SCP are solved, the
computing time spent on the subgradient method becomes very large if a naive
implementation is used. Caprara et al. [1] developed a variant of pricing method
on the subgradient method. They define a dual core problem consisting of a
small subset of columns Cd ∼ N (|Cd | |N |), chosen among those having the
lowest Lagrangian costs c̃j (u) (j ⊆ Cd ), and iteratively update the dual core
problem in a similar fashion to that used for solving large scale LP problems.
In order to solve huge instances of SMCP-GUB, we also introduce their pricing
method into the basic subgradient method (BSM) described in [3].
of GUB constraints. We therefore propose new conditions that reduce the number
of candidates in NB2 (x)\NB1 (x) taking account of GUB constraints. As a result,
the number of solutions searched by our algorithm becomes O(n⎨+ kν + n∗ τ )
∗
⎨of NB2 is O(n ), where ν = maxj∈N |Sj |, n = j∈N xj and
2
while the size
τ = maxj∈N i∈Sj |Ni | for |Ni | = {j ⊆ N |i ⊆ Sj }.
Since the region searched in a single application of LS is limited, LS is usually
applied many times. When a locally optimal solution is obtained, a standard
strategy of our algorithm is to update penalty weights and to resume LS from
the obtained locally optimal solution. We accordingly evaluate solutions with an
alternative evaluation function ẑ(x), where the original penalty weight vector w
is replaced with ŵ = (ŵ1 , . . . , ŵm ) ⊆ Rm
+ . Our algorithm iteratively applies LS,
updating the penalty weight vector ŵ after each call to LS.
Starting from the original penalty weight vector ŵ ≡ w, the penalty weight
vector ŵ is updated as follows. Let xbest denote the best feasible solution with
respect to the original objective function z(x). If the previous locally opti-
mal solution x satisfies ẑ(x) ∅ z(xbest ), our algorithm uniformly decreases the
penalty weights ŵi (i ⊆ M ). Otherwise, our algorithm increases the penalty
weights ŵi (i ⊆ M ) in proportion to the amount of violation of the ith multi-
cover constraint.
Table 1. The benchmark instances for SMCP-GUB and time limits for our algorithm
LS-SR and the MIP solver CPLEX (in seconds)
Instance Rows Columns Density Instance types (dh /|Gh |) Time limit
(%) Type1 Type2 Type3 Type4 LS-SR CPLEX
G.1–G.5 1000 10,000 2.0 1/10 10/100 5/10 50/100 600 3600
H.1–H.5 1000 10,000 5.0 1/10 10/100 5/10 50/100 600 3600
I.1–I.5 1000 50,000 1.0 1/50 10/500 5/50 50/500 600 3600
J.1–J.5 1000 100,000 1.0 1/50 10/500 5/50 50/500 600 3600
K.1–K.5 2000 100,000 0.5 1/50 10/500 5/50 50/500 1200 7200
L.1–L.5 2000 200,000 0.5 1/50 10/500 5/50 50/500 1200 7200
M.1–M.5 5000 500,000 0.25 1/50 10/500 5/50 50/500 3000 18,000
N.1–N.5 5000 1,000,000 0.25 1/100 10/1000 5/100 50/1000 3000 18,000
5 Computational Results
We first prepared eight classes of random instances for SCP, where each class
has five instances. We denote instances in class G as G.1, . . . , G.5, and other
instances in classes H–N similarly. The summary
⎨ of these instances are given
⎨
in Table 1, where the density is defined by i∈M j∈N aij /mn and the costs
cj are random integers taken from interval [1, 100]. For each SCP instance, we
generate four types of SMCP-GUB instances with different values of parameters
dh and |Gh | as shown in Table 1, where all blocks Gh (h ⊆ K) have the same
size |Gh | and upper bound dh for each instance. Here, the right-hand sides of
multicover constraints bi are random integers taken from interval [1, 5].
We compared our algorithm, called the local search algorithm with the heuris-
tic size reduction (LS-SR), with one of the latest mixed integer program (MIP)
solver called CPLEX12.3, where they were tested on an IBM-compatible per-
sonal computer (Intel Xeon E5420 2.5 GHz, 4 GB memory) and were run on
a single thread. Table 1 also shows the time limits in seconds for LS-SR and
CPLEX12.3, respectively. We tested two variants of LS-SR: LS-SR1 evaluates
variables xj with the proposed score ĉj (x), and LS-SR2 uses the Lagrangian
cost c̃j (x) in the heuristic reduction of problem sizes. We illustrate in Fig. 1
their comparison for each type of SMCP-GUB instances with respect to the rel-
ative gap z(x)−z
zLP
LP
× 100, where zLP is the optimal value of LP relaxation for
SMCP-GUB. The horizontal axis shows the classes of instances G–N, and the
vertical axis shows the average relative gap for five instances of each class.
We first observe that LS-SR1 and LS-SR2 achieve better upper bounds than
CPLEX12.3 for types 3 and 4 instances, especially large instances with 10,000
variables or more. One of the main reasons for this is that the proposed algo-
rithms evaluate a series of candidate solutions efficiently while CPLEX12.3 con-
sumes much computing time for solving LP relaxation problems. We also observe
that LS-SR1 achieves much better upper bounds than those of LS-SR2 and
CPLEX12.3 for types 1 and 2 instances.
80 S. Umetani et al.
6 Conclusion
In this paper, we considered an extension of SCP called the set multicover prob-
lem with the generalized upper bound constraints (SMCP-GUB). For this prob-
lem, we develop a 2-flip neighborhood local search algorithm with a heuristic size
reduction algorithm, in which a new evaluation scheme of variables is introduced
taking account of GUB constraints. According to computational comparison on
benchmark instances with the latest version of a MIP solver called CPLEX12.3,
our algorithm performs quite effectively for various types of instances, especially
for very large-scale instances.
References
1. Caprara, A., Fischetti, M., Toth, P.: A heuristic method for the set covering problem.
Oper. Res. 47, 730–743 (1999)
2. Caprara, A., Toth, P., Fischetti, M.: Algorithms for the set covering problem. Ann.
Oper. Res. 98, 353–371 (2000)
3. Umetani, S., Yagiura, M.: Relaxation heuristics for the set covering problem. J.
Oper. Res. Soc. Jpn. 50, 350–375 (2007)
4. Yagiura, M., Kishida, M., Ibaraki, T.: A 3-flip neighborhood local search for the set
covering problem. Eur. J. Oper. Res. 172, 472–499 (2006)
A Genetic Algorithm Approach
for the Multidimensional Two-Way
Number Partitioning Problem
1 Introduction
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 81–86, 2013.
DOI: 10.1007/978-3-642-44973-4 10, c Springer-Verlag Berlin Heidelberg 2013
82 P.C. Pop and O. Matei
genetic algorithm by Ruml et al. [9], GRASP by Arguello et al. [1], Tabu Search
by Glover and Laguna [5], memetic algorithm by Berretta et al. [2], etc.
The problem has captioned a lot of attention due to its theoretical aspects
and important real-world applications. For a more detailed description of the
applications we refer to [3].
The multidimensional two-way number partitioning problem (MDTWNPP)
was introduced by Kojic [8] and is a generalization of the TWNPP where instead
of numbers we have a set of vectors we are looking for a partition of the vectors
into two subsets such that the sums per every coordinate should be as close as
possible.
The MDTWNPP is NP-hard, as it reduces when the vectors have dimension
one to the TWNPP which is known to be an NP-hard problem. There is little
research being done in mathematical modeling and solution methods for this
problem. Kojic [8] described an integer programming formulation and tested the
model on randomly generated sets using CPLEX, which as far as we know is
the only approach to solve the problem. The obtained experimental results show
that the MDTWNPP is very hard to solve even in the case of medium instances.
The aim of this paper is to describe a novel use of genetic algorithms with the
goal of solving the multidimensional two-way number partitioning problem. The
results of preliminary computational experiments are presented, analyzed and
compared with the previous method introduced by Kojic [8]. The results reveal
that our proposed methodology, in the case of medium and large instances,
performs very well in terms of both quality of the solutions obtained and the
computational times.
Example. Considering the set of vectors: S = {(1, 3), (5, 5), (3, −2), (−3, 12)}
and the partition:
then the sums per coordinates are (5, 5) and (1, 13) and the difference is (4, 8)
and therefore t = 8. The second component of the vector (1, 3) is the closest
t
to = 4 and we reassigned this vector to the subset S1 getting the following
2
partition:
S1 = {(1, 3), (5, 5)} and S2 = {(3, −2), (−3, 12)}
with the sums (6, 8), (0, 10) and the difference is (6, 2), t = 6.
In our algorithm we investigated and used the properties of (μ, λ) selection,
where μ parent produce λ (λ ← μ) and only the offspring undergo selection. In
other words, the lifetime of every individual is limited to only one generation.
This may lead to short periods of recession, but it avoids long stagnation phases
due to unadapted strategy parameters.
The genetic parameters are very important for the success of a GA. Based on
preliminary experiments, we have chosen the following parameters: the popula-
tion size μ has been set to 10 times the number of the vectors, the intermediate
population size λ was chosen ten times the size of the population: λ = 10 · μ,
mutation probability was set at 10 % and the maximum number of generations
(epochs) in our algorithm was set to 10000.
In our algorithm the termination strategy is based on a maximum number
of generations to be run if there is no improvement in the objective function for
a sequence of 15 consecutive generations.
solution and the necessary computational time in order to get it and the last
three columns provide the results obtained by our novel genetic algorithm: the
best solutions, the average solutions and the required time to get these solutions.
Because CPLEX did not finish its work in any considered instance in the table
are provided the best solutions obtained.
Analyzing the computational results, we observe that our genetic algorithm
based heuristic provides better solutions than the approach considered by Kojic
[8] using CPLEX.
86 P.C. Pop and O. Matei
References
1. Arguello, M.F., Feo, T.A., Goldschmidt, O.: Randomized methods for the number
partitioning problem. Comput. Oper. Res. 23(2), 103–111 (1996)
2. Berretta, R.E., Moscato, P., Cotta, C.: Enhancing a memetic algorithms’ perfor-
mance using a matching-based recombination algorithm: results on the number par-
titioning problem. In: Resende, M.G.C., Souza, J. (eds.) Metaheuristics: Computer
Decision-Making, pp. 65–90. Kluwer, Boston (2004)
3. Coffman, E., Lueker, G.S.: Probabilistic Analysis of Packing and Partitioning Algo-
rithms. Wiley, New York (1991)
4. Garey, M.R., Johnson, D.S.: Computers and Intractability. A Guide to the Theory
of NP-Completeness. W.H. Freeman, New York (1997)
5. Glover, F., Laguna, M.: Tabu Search. Kluwer Academic, Norwell (1997)
6. Johnson, D.S., Aragon, C.R., McGeoch, L.A., Schevon, C.: Optimization by simu-
lated annealing: an experimental evaluation. Part II: Graph coloring and number
partitioning. Oper. Res. 39(3), 378–406 (1991)
7. Karmarkar, N., Karp, R.M.: The differencing method of set partitioning, Techni-
cal report UCB/CSD 82/113, University of California, Berkeley, Computer Science
Division (1982)
8. Kojic, J.: Integer linear programming model for multidimensional two-way number
partitioning problem. Comput. Math. Appl. 60, 2302–2308 (2010)
9. Ruml, W., Ngo, J.T., Marks, J., Shieber, S.M.: Easily searched encodings for number
partitioning. J. Optim. Theor. Appl. 89(2), 251–291 (1996)
Adaptive Dynamic Load Balancing
in Heterogeneous Multiple GPUs-CPUs
Distributed Setting: Case Study
of B&B Tree Search
1 Introduction
Context and Motivation. The current trend in high performance comput-
ing is converging towards the development of new software tools which can be
efficiently deployed over large scale hybrid platforms, interconnecting several
hundreds to thousands of heterogeneous processing units (PUs) ranging from
multiple distributed CPUs, multiple shared-memory cores, to multiple GPUs.
Although the aggregation of those resources can in theory offer an impressive
computing power, achieving high performance and scalability is still bound to the
expertise of programmers in developing new parallel techniques and paradigms
operating both at the algorithmic and at the system levels. The heterogeneity
and incompatibility of resources in terms of computing power and programming
models, make it difficult to parallelize a given application without significantly
drifting away from the optimal and theoretically attainable performance. In par-
ticular, when parallelizing highly irregular applications producing unpredictable
workload at runtime, mapping dynamically generated tasks into the hardware
so that workload is distributed evenly is a challenging issue. In this context,
adjusting the workload distributively is mandatory to maximize resource uti-
lization and to optimize work balance over massively parallel and large scale
distributed PUs.
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 87–103, 2013.
DOI: 10.1007/978-3-642-44973-4 11, c Springer-Verlag Berlin Heidelberg 2013
88 T.-T. Vu et al.
experimented with the B&B algorithm and the well-known FlowShop combina-
torial optimization problem [14] as a case study. Firstly, on one single GPU, we
improve on the running time of the previous B&B GPUs implementations [4,11]
by at least a factor of two on the considered instances (the speedup with respect
to one CPU is around ×70). More importantly, independently of the scale and
power of CPUs or GPUs, our approach provides a substantial speed-up which is
nearly optimal compared to the ideal performance one could expect in theory.
It is worth to notice that although our experimentations are conducted for the
specific FlowShop problem, it is generic in the sense that it undergoes no spe-
cific optimization with respect to neither B&B nor FlowShop. Therefore, it can
be appropriate to solve other optimization problems, as far as a GPU parallel
evaluation (bounding) of search nodes (viewed as a blackbox) is available.
From the optimization perspective, relatively few investigations are known on
heterogenous parallel tree search algorithms. Specific to B&B, some very recent
GPU parallelizations are known for some specific problems [2–4,10,11]. The
focus there is mainly on the parallelization of the bounding step which is known
to very time-consuming. The only study we found on aggregating the power of
multiple GPUs presents a Master/Slave-like model and an experimental scale
of 2 GPUs [4]. The authors there stressed more on the parallel design issues
and not on scalability nor performance optimality. They reported a good but
sub-optimal speed-up when using 2 GPUs, which witness the difficulty of the
problem. To the best of our knowledge, the new parallel approach presented in
this paper is not only the first to scale near linearly up to 20 GPUs but also the
first to address the joint use of multiple distributed CPUs in the system.
From the parallel perspective, very few works exist on the parallelization of
highly irregular applications in heterogenous platforms. In particular, we found
no in-depth and systematic studies of application speed-up at different CPU-
GPU scales. Knowing that the adaptive workload distribution strategy adopted
in this paper is generic and not specific to tree search or B&B, our study provides
new insights into the scalability of distributed protocols harnessing both multiple
GPUs and CPUs which have a substantial gap in their respective computing
power.
Outline. In Sect. 2, we draw the main components underlying our distributed
approach while motivating their design architecture. A more detailed and tech-
nical description then follows in Sect. 3. In Sect. 4, we report and discuss our
experimental results. In Sect. 5, we conclude the paper and raise some open
issues.
To simplify the presentation and clarify our contribution, let us model the B&B
algorithm, as a tree search algorithm that starts from a root node representing
an optimization problem. During the search, a parent node generates new child
nodes (e.g., representing partial/complete candidate solutions) at runtime. The
quality of these nodes is evaluated (bounding) using a given (heuristic) proce-
dure. Then, according to the search state, some nodes are discarded (pruning)
whether some others can be selected and the tree is expanded (branching) to
push the search forward and so on. Having this in mind, the general architecture
of our approach for distributing search computations is depicted in Fig. 1 and
discussed in the following subsections. Each subsection will give an answer to
one of the three questions addressed in the introduction.
operations inside GPU, could suffer from the thread divergence induced by the
SIMD programming model of GPUs. The evaluation step, on the other side,
can highly benefit from the parallelism offered by the many GPU cores. These
design/model choices are essentially motivated by the fact that the evaluation
phase of many combinatorial optimization problems is very time-consuming, e.g
bounding for B&B, so that it dominates the other operations.
Although the GPU device can handle the evaluation of many tree nodes
in parallel [4,10], the CPU host still has to prepare a data containing these
nodes, copy them into GPU memory and copy back the result. This implies
that while computations are carried out on the GPU device, the host is idle and
vice-versa. In our Level 2 parallelism, the host and the device are managed to
run computations in parallel, i.e., while the device is evaluating tree nodes, the
host is preparing new data for the next evaluation in the device. Notice that the
evaluation step of many tree nodes inside the GPU is of course implying another
type of parallelism which we will not address in this paper, since our focus is on
scalability and work distribution on multiple PUs.
in large idle times despite the fact that surplus work could be available. In
classical RWS approaches, this is a hand-tuned parameter which depends on the
distributed system and the application context [12]. In a theoretical study [1],
the stability and optimality of RWS can be analytically guaranteed for f ≤ 1/2.
In practice, the so called steal-half strategy (f = 1/2) is often shown to perform
efficiently using homogenous computing units. In a heterogenous and large scale
scenario, this parameter is even more sensitive because of the wide variety of
computing capabilities of different PUs. In this context, the community lacks
relatively much knowledge to understand how to attain good performance for
RWS based protocols.
To understand the issues we are facing when distributing tree search works
over multiple CPUs and GPUs, one has to keep in mind that (i) a GPU is
substantially faster in evaluating tree nodes than a CPU, (ii) nothing can be
assumed about the amount of tree nodes initially. Hence, if GPUs run out of
work and stay idle searching for work, the performance of the system can drop
dramatically. If only few CPUs are available in the system, work stealing oper-
ations from CPUs to GPUs can cause a severe penalty to performance. This is
because the few CPUs can only contribute very little to the overall performance
but their stealing operations to GPUs can disturb the GPU computations and
prevent them from reaching their maximal speed. In contrast, if work is sched-
uled more on GPUs, then a significant loss in performance can occur when a
relatively large number of CPUs are available. To tackle these issues, we propose
to configure RWS so that when performing a steal operation, the value of f is
computed at runtime based on the normalized power of the thief and the victim,
where the computing power of every PU is estimated continuously at runtime
with respect to the application being tackled.
and one for downloading from device to host. Each engine is equipped with a
queue to store pending data and kernels that will be processed by the engine
shortly.
The Level 2 host-device parallelism discussed in our approach can be enabled
using CUDA primitives as sketched in Algorithm 1. Each Enqueue proce-
dure dispatches CUDA operations into the GPU device asynchronously, i.e.,
pushes/retrieves data and launches the kernel. This is possible by wrapping those
operations into a CUDA stream. All operations inside the same CUDA stream
get automatically synchronized and executed sequentially, but the CUDA oper-
ations of different streams could overlap one with the other, e.g., execute the
kernel of stream 1 and retrieve data from stream 2 concurrently in parallel. In
our implementation, we use a maximum number of streams, i.e., variable rmax ,
which is the maximum number of elements (data, kernel ) in the queue of GPU
Copy engine and Compute engine. The maximum number of streams that a
GPU can handle depends in general on GPU global memory characteristics. For
B&B search, data is a pool of tree nodes and kernel is the bounding function.
Asynchronously in parallel to the Enqueue procedure, the Dequeue procedure
in Algorithm 1 waits for data copied back from the device on a given CUDA
stream, and processes the output data. In our B&B implementation, this corre-
sponds to the pruning operation. Notice also that Algorithm 1 is independent of
the specific data or kernels being used, so that it can be customized with respect
to the search operations or optimization problems at hand. In particular, any
existing kernel implementing parallel tree node evaluation is applicable.
Procedure Enqueue
1 while q host size < rmax do
2 q host[w index].task ← prepare a pool of tree nodes;
// Asynchronous Operations on stream[w index]
3 cudaMemcpyAsyn(q device[w index], q host[w index],
sizeof(q host[w index].task),
4 cudaMemcpyHostToDevice, stream[w index]);
// Launch parallel evaluation (bounding) on device
5 KERNEL<<< stream[w index] >>> (q device[w index]) ;
6 cudaMemcpyAsyn(q host[w index].bound, q device[w index].bound,
7 sizeof (q device[w index].bound),
8 cudaMemcpyDeviceToHost, stream[w index]) ;
9 w index ← (w index + 1) (mod rmax ); q host size ← q host size + 1 ;
Procedure Dequeue
1 if q host size > 0 then
// Wait for results from device on stream[r index]
2 cudaStreamSynchronize(stream[r index]);
3 Process output data from q host[r index], i.e., prune nodes ;
4 r index ← (r index + 1) (mod rmax ); q host size ← q host size − 1;
Procedure Thief
1 x ← runtime normalized computing power;
2 repeat
3 u ← pick one PU victim uniformly at random;
// v denotes the actual thief PU executing the procedure
4 Send a steal request message (v, x) to u;
5 Receive u’s response (reject or work) message;
6 until some tree nodes are successfully transferred from victim u;
Procedure Victim
1 if a steal request is pending then
2 y ← runtime normalized computing power ;
3 if tree nodes are available then
4 (v, x) ← pull the next pending thief request;
x
5 work ← share tree nodes in the proportion of ;
x+y
6 Send back shared work to v ;
7 else
8 Send back a reject message to v ;
half works best for binomial trees. Instead, stealing a fixed amount of work items
(i.e., 7 items) is shown to work well for geometric trees. Besides, in a heteroge-
neous and hybrid computing system, the hardware characteristics of PUs, e.g.,
clock speed, Cache, RAM, etc, can be highly needed to balance the work load
evenly depending on the characteristics of every available PU. Because high vari-
ations in computing power among PUs can lead to high imbalance and idle times,
one has also to manage this issue carefully when distributing work. One possible
solution to the above issues could be to profile the system components/PUs and
tune work granularity offline before application execution in order to get the
best performance. It should be clear that such an approach is not reasonable
nor feasible, for instance when the system may undergo a huge number of many
different types of PUs, or when having many different applications at hand.
In our stealing approach, we make every PU maintain at runtime a measure
reflecting its computing power, i.e., variable x in Algorithm 2. As the computa-
tions are running on, every PU adjusts its measure continuously with respect to
the work processed in the previous iterations. In our approach, we simply use the
average time needed for processing one tree node. More precisely, each PU sets
its computing power to be x = N/T , where T is the (normalized) time elapsed
since the PU has started the computation and N is the number of tree nodes
explored locally by that PU. Notice that time T includes, in addition to tree
node evaluation (i.e., B&B lower bounding), the time needed for other search
operations (i.e., select, branch and prune) but not the time when a PU stays
idle. When running out of work, a PU v then attempts to steal work by sending
a request message to one other PU u chosen at random, while wrapping the
value of x in the request. If a victim has some work to serve, then the amount
of work (i.e., number of tree nodes) to be transferred is in the proportion of
x/(x + y), where y is the computing power maintained locally by the victim.
Otherwise, a reject message is sent back to notify the thief and a new stealing
round is performed. Initially, the value of x is normalized so that all PUs have
the same computing ability. In other words, the system starts stealing half and
then the stealing granularity is refined for each pairwise PU. Intuitively, each
PU acts as a black-hole, so that the higher computing power of PUs is, the more
available work are flowed to the black-hole. Furthermore, no knowledge about
96 T.-T. Vu et al.
4 Experimental Results
4.1 Experimental Setting
T and N , respectively the time needed to complete the B&B tree search and the
number of B&B tree nodes that were effectively explored. All reported speedups
are relative to the number of B&B tree nodes explored by time units, that is
N/T .
12 3
Baseline rmax=10, s = 8192
rmax = 5 rmax=10, s = 4096
Execution Time (x 10 3) seconds
6 1.5
4 1
2 0.5
0 Ta Ta Ta Ta Ta Ta Ta Ta Ta Ta 0 Ta Ta Ta Ta Ta Ta Ta Ta Ta Ta
21 22 23 24 25 26 27 28 29 30 21 22 23 24 25 26 27 28 29 30
Flowshop Intances Flowshop Intance
Fig. 2. Level 2 parallelism vs. baseline sequential host-device execution [4]. Left: Exe-
cution time with different number rmax of CUDA streams and s = s (Lower is better).
Right: Speedup w.r.t baseline for different values of s and rmax = 10 (Higher is better).
16 16
Steal 1/8
14 Steal 1/4 14 #GPUs= 2
Steal 1/2 #GPUs= 4
12 Adaptive 12
#GPUs= 8
Ideal Linear speedup #GPUs= 16
10 10
Speedup
Speedup
8 8
6 6
4 4
2 2
0 0
1 2 4 8 16 2048 4096 8192
#GPUs pool size s
Fig. 3. Left: Scalability of our adaptive approach vs. static stealing (s = s ). X-axis
refers to the number of GPUs in log scale. Y-axis refers to the speed-up with respect
to one GPU. Right: Speedups of our approach as a function of s. rmax = 10.
In this section, we study the properties of our approach when mixing both CPUs
and GPUs. For that purpose, we proceed as following. Let αij be the speedup
obtained by a single PU j with respect to PU i. We naturally
define the linear
(ideal) normalized speedup with respect to PU i, to be j αij . For instance, hav-
ing p identical GPUs and q identical CPUs, each GPU being β times faster than
each CPU, our definition gives a linear speedup with respect to one GPU (resp.
one CPU ) of p+q/β (resp. q +β ·p). The following sets of experiments shall allow
us to appreciate the performance of our approach when varying substantially the
ratio between the number of GPUs and CPUs.
100 T.-T. Vu et al.
4 4
Steal 1/2 Steal 1/2
3.5 3.5
Weighted Steal Weighted Steal
3 Adaptive 3 Adaptive
Linear GPU normalized Speedup Linear GPU normalized Speedup
2.5 2.5
Speedup
Speedup
2 2
1.5 1.5
1 1
0.5 0.5
0 0
1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128
#CPUs #CPUs
Fig. 4. Speedup of our approach vs. static steal when scaling CPUs and using 1 GPU
(Left) and 2 GPUs (Right) X-axis is in the log scale. Speed-up are w.r.t. one GPU.
rmax = 10.
GPU Scaling. We now fix the number of CPUs and study how the behavior
of the system when scaling the number of GPUs. Results with 128 (identical)
CPUs and (identical) GPUs ranging from 1 to 16 are reported in Fig. 5. We can
similarly see that our adaptive approach is still scaling in a linear manner while
being near optimal. It is also substantially outperforming the static steal half
strategy.
Adaptive Dynamic Load Balancing in Heterogeneous Multiple GPUs-CPUs 101
Mixed Scaling. Our last set of experiments is more complex since we manage
to mix multiple GPUs with empirically different powers and multiple CPUs with
different clock speeds. This scenario is in fact intended to reproduce a heteroge-
nous setting where, even PUs in the same family do not have the same computing
abilities. In this kind of scenario, where in addition the power of PUs can evolve,
e.g., due to system maintenance constraints or hardware renewals/updates, even
a weighted hand tuned steal strategy is not plausible nor applicable. In the
results of Fig. 6, we fix the number of CPUs to be 128 with half of them taken
from cluster C2 and the other half from cluster C3 (C2 and C3 have different
CPU clock speeds as specified previously). For GPUs, we proceed as following.
We use a variable number of GPUs in the range p ∈ {1, 4, 8, 12, 16, 20}. For
p > 1, we configure the system so that 1/2 of GPUs run a kernel with pool size
sΣ , 1/4 of them with pool size sΣ /2 and the last 1/4 of them with pool size sΣ /4.
Once again our approach is able to adapt the load for this complex heteroge-
nous scenario and to obtain a nearly optimal speedup while outperforming the
standard steal half strategy. From the previous set of experiments we can thus
18
Steal 1/2
16
Weighted Steal
14 Adaptive
12 Linear GPU normalized Speedup
Speedup
10
0
1 2 4 8 16
#GPUs
Fig. 5. Speedup (w.r.t. one GPU) when scaling GPUs and using 128 CPUs. rmax = 10.
18
Steal 1/2
16
Adaptive
14 Linear GPU normalized Speedup
12
Speedup
10
0
1 2 4 8 16
#GPUs
Fig. 6. Speedup when scaling heterogenous GPUs (1/2 with s , 1/4 with s /2, 1/4
with s /4), and 128 heterogenous CPUs (1/2 from cluster C2 , 1/2 from cluster C3 ).
Speedup is w.r.t. one GPU configured with s . rmax = 10.
102 T.-T. Vu et al.
conclude that our approach allows us to take full advantage of both GPU and
CPU power independently of considered scales, or any hand tuned parameter.
5 Conclusion
In this paper, we proposed and experimented an adaptive load balancing dis-
tributed scheme for parallelizing computing intensive B&B-like tree search algo-
rithms in heterogenous systems, where multiple CPUs and GPUs with possibly
different properties are used. Our approach is based on a two-level parallelism
allowing for (i) distributed subtree exploration among PUs and (ii) concurrent
operations between every single GPU host and device. Through extensive experi-
ments involving different PU configurations, we showed that the scalability of our
approach is near optimal, which leaves very little space for further improvements.
Besides being able to experiment our approach with other problem-specific GPU
kernels, one interesting and challenging research direction would be to extend our
approach in a dynamic distributed environment where: (i) processing units can
join or leave the system, and (ii) different end-users can concurrently request the
system for solving different optimization problems. In this setting, the load has
to be balanced not only w.r.t. the irregularity/dynamicity of one single appli-
cation, but also w.r.t many other factors and constraints that may affect the
computing system at runtime.
References
1. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work
stealing. J. ACM 46, 720–748 (1999)
2. Boukedjar, A., Lalami, M.E., El-Baz, D.: Parallel branch and bound on a CPU-
GPU system. In: 20th International Conference on Parallel, Distributed and
Network-Based Processing, pp. 392–398 (2012)
3. Carneiro, T., Muritiba, A.E., Negreiros, M., De Campos, L., Augusto, G.: A new
parallel schema for branch-and-bound algorithms using GPGPU. In: 23rd Sym-
posium on Computer Architecture and High Performance Computing, pp. 41–47
(2011)
4. Chakroun, I., Melab, M.: An adaptative multi-GPU based branch-and-bound. a
case study: the flow-shop scheduling problem. In: 14th IEEE Interernational Con-
ference on High Performance Computing and Communications (2012)
5. Dijkstra, E.W.: Derivation of a termination detection algorithm for distributed
computations. In: Broy, M. (ed.) Control Flow and Data Flow: Concepts of Dis-
tributed Programming, pp. 507–512. Springer, Berlin (1987)
Adaptive Dynamic Load Balancing in Heterogeneous Multiple GPUs-CPUs 103
6. Dinan, J., Olivier, S., Sabin, G., Prins, J., Sadayappan, P., Tseng, C.-W.: A mes-
sage passing benchmark for unbalanced applications. Simul. Model. Pract. Theor.
16(9), 1177–1189 (2008)
7. Matteo, F., Charles, E.L., Keith, H.R.: The implementation of the cilk-5 multi-
threaded language. SIGPLAN Not. 33, 212–223 (1998)
8. Grid500 French national gird. https://www.grid5000.fr/
9. James, D., Brian, L.D., Sadayappan, P., Krishnamoorthy, S., Jarek, N.: Scalable
work stealing. In: Proceedings of ACM Conference on High Performance Comput-
ing Networking, Storage and Analysis, pp. 53:1–53:11 (2009)
10. Lalami, M.E., El-Baz, D.: GPU implementation of the branch and bound method
for knapsack problems. In: IPDPS Workshops, pp. 1769–1777 (2012)
11. Melab, N., Chakroun, I., Mezmaz, M., Tuyttens, D.: A GPU-accelerated b&b algo-
rithm for the flow-shop scheduling problem. In: 14th IEEE Conference on Cluster
Computing (2012)
12. Min, S.-J., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters.
In: Proceedings of 5th Conference on Partitioned Global Address Space Program-
ming Models (2011)
13. Saraswat, V.A., Kambadur, P., Kodali, S., Grove, D., Krishnamoorthy, S.: Lifeline-
based global load balancing. In: 16th ACM Symposium on Principles and Practice
of Parallel Programming (PPoPP ’11), pp. 201–212 (2011)
14. Taillard, E.: Benchmarks for basic scheduling problems. Eur. J. Oper. Res. 64(2),
278–285 (1993)
Multi-Objective Optimization for Relevant
Sub-graph Extraction
1 Introduction
Nowadays, there are many types of data that can be represented as graphs (net-
works) such as web graphs, social networks, biological networks, communication
networks, road networks, etc. Mining hidden knowledge in networks is a non-
trivial task because networks are usually big (containing thousands of nodes and
edges) and discrete (mathematics models are difficult to apply). A graph is a
structure that consists of vertices (nodes) and edges. Vertices can be anything
such as pages in web graphs, members in social networks, proteins or genes
in biological networks, etc. Edges, which are links between two nodes, repre-
sent relationship between two nodes. There are many problems involving graphs
such as the shortest paths, graph coloring, route, etc. In this work, we focus on
graph clustering. Graph clustering is a problem where vertices are to be grouped
into clusters satisfying some pre-defined criteria. Graph clustering methods have
been applied in many fields, especially, in biological networks [1–3,10]. Most of
introduced sub-graph extraction methods are based on graph clustering such as
the Markov Cluster Algorithm (MCL) [5], a scalable clustering algorithm for
graphs. The basic idea of MCL, is Markov Chains so that when starting at a
node, and then randomly traveling to a connected node; you are more likely to
stay within a cluster than travel between clusters. Spirin et al. [13] used three
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 104–109, 2013.
DOI: 10.1007/978-3-642-44973-4 12, c Springer-Verlag Berlin Heidelberg 2013
Multi-Objective Optimization for Relevant Sub-graph Extraction 105
F (V ∈ , E ∈ ) = (f1 (V ∈ , E ∈ ), f2 (V ∈ , E ∈ )) (1)
Subject to
f2 (V ∈ , E ∈ ) > 0 (2)
where
– f1 (V, E) represents density of sub-graph, given by
2 | E∈ |
f1 (V ∈ , E ∈ ) = (3)
| V | (| V ∈ | −1)
∈
Fig. 1. Circle representation of solution. The dash circle has center, P1 and radius of
one. The dash-dot circle has center, P1 and radius of two.
3 Results
Protein complex play an important role in cellular organization and function.
The accumulation of large-scale protein interaction data on multiple organisms
requires novel computational techniques to analyse biological data through these
large-scale networks. Biological information can often be represented in the form
of a ranked list.
The Biological General Repository for Interaction Datasets (BioGRID) [14]
contains genetic and protein interaction data from model organisms and humans.
108 M. Elati et al.
4 Conclusion
The objectif of this work is to pose the problem of finding a subgraph connecting
n nodes in a graph as a multi-objective problem in the context of systems biology.
We used stochastic methods to solve the problem and used a graph specific
representation of solutions, a center and a radius, to lessen the search space. We
used Simulated annealing and a Genetic Algorithm to solve the optimization
problem and compared our results with a graph clustering method, the MCL
algorithm. MCL does not actually solve this problem, it does not maximizes f2 ,
but graph clustering was the most logical choice for a comparison.
MCL produced very big sub-graphs (2000–6000 nodes) making the analysis
of the results extremely difficult and adding many False Positive whereas the
results produced by the Genetic Algorithm had 29–190 nodes which is much more
realistic. The size of MCL clusters explains why the density is so low (average
of 0.1) and why the number of nodes from L is so high therefore motivating
the fact that the problem must be seen as a multi-objective problem. Moreover,
the execution time of MCL on such large networks is on the order of hours or
days (here 2 and a half hour on four cores) when solving the problem using a
Genetic Algorithm took an average of 2 min.
References
1. Bader, G., Hogue, C.: An automated method for finding molecular complexes in
large protein interaction networks. BMC Bioinformatics 4, 2 (2003)
2. Birmele, E., Elati, M., Rouveirol, C., Ambroise, C.: Identification of functional
modules based on transcriptional regulation structure. BMC Proc. 2(Suppl 4), S4
(2008)
3. Brohee, S., van Helden, J.: Evaluation of clustering algorithms for protein-protein
interaction networks. BMC Bioinformatics 7, 488 (2006)
4. Coello, C.A.C., Dhaenens, C., Jourdan, L. (eds.): Advances in Multi-Objective
Nature Inspired Computing. Studies in Computational Intelligence, vol. 272.
Springer, Heidelberg (2010)
5. van Dongen, S.: Graph clustering by flow simulation. Ph.D. thesis, University of
Utrecht (May 2000)
6. Dupont, P., Callut, J., Dooms, G., Monette, J.N., Deville, Y.: Relevant subgraph
extraction from random walks in a graph (2006)
7. Faloutsos, C., McCurley, K.S., Tomkins, A.: Fast discovery of connection sub-
graphs. In: Proceedings of the Tenth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’04,pp. 118–127 (2004)
8. Gavin, A., Aloy, P., Grandi, P., Krause, R., Boesche, M., Marzioch, M., Rau, C.,
Jensen, L., Bastuck, S., Dumpelfeld, B.: Proteome survey reveals modularity of
the yeast cell machinery. Nature 440, 631–636 (2006)
9. Ishibuchi, H., Yoshida, T., Murata, T.: Balance between genetic search and local
search in hybrid evolutionary multi-criterion optimization algorithms. IEEE Trans.
Evol. Comput. 7, 204–223 (2002)
10. Jiang, P., Singh, M.: Spici: a fast clustering algorithm for large biological networks.
Bioinformatics 26(8), 1105–1111 (2010)
11. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing.
Science 220, 671–680 (1983)
12. Mueller-Gritschneder, D., Graeb, H., Schlichtmann, U.: A successive approach to
compute the bounded pareto front of practical multiobjective optimization prob-
lems. SIAM J. Optim. 20, 915–934 (2009)
13. Spirin, V., Mirny, L.A.: Protein complexes and functional modules in molecular
networks. Proc. Natl Acad. Sci. 100, 12123–12128 (2003)
14. Stark, C., Breitkreutz, B.J., Reguly, T., Boucher, L., Breitkreutz, A., Tyers, M.:
Biogrid: a general repository for interaction datasets. Nucleic Acids Res. 34, D535–
D539 (2006)
15. Whitley, D.: A genetic algorithm tutorial. Stat. Comput. 4, 65–85 (1994)
PROGRESS: Progressive
Reinforcement-Learning-Based Surrogate
Selection
1 Introduction
The optimization of real-world systems based on expensive experiments or time-
consuming simulations poses an important research area. Against the back-
ground of increasing flexibility and complexity of modern product portfolios,
such kinds of problems have to be constantly solved. The use of surrogate (meta)-
models fˆ for approximating the expensive or time-consuming objective function
f : x → y represents an established approach to this task. After determining
the values of f for the points x of an initial design of experiments, the surrogate
model fˆ is computed and then used for the further analysis and optimization.
Here, we consider deterministic, i.e., noise-free minimization problems. In such
a scenario, the former approach has a conceptual drawback. The location of
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 110–124, 2013.
DOI: 10.1007/978-3-642-44973-4 13, c Springer-Verlag Berlin Heidelberg 2013
PROGRESS: Progressive Reinforcement-Learning-Based Surrogate Selection 111
the optimum can only roughly be determined based on the initial design. A high
accuracy of the optimization on the model does not necessarily provide improved
quality with respect to the original objective function. As a consequence, the
resources expended for the usually uniform coverage of the experimental region
for the approximation of the global response surface may be spent more effi-
ciently in order to increase the accuracy of the surrogate in the regions of the
actual optimum.
A solution to this problem is provided by sequential techniques, called effi-
cient global optimization (EGO) [16], sequential parameter optimization [1] and
sequential designs [24] within the different disciplines. Sequential techniques do
not focus on an approximation of a global response surface, but on an efficient
way to obtain the global minimum of the objective function f . After evaluating
a sparse initial design in the parameter space, much smaller than the actual
experimental budget, the surrogate model is fitted and proposes a new point
which is then evaluated on the original function f . The point is added to the
design and the procedure is repeated until the desired objective value has been
obtained or the experimental budget has been depleted.
For the design of a sequential technique, the choice of the surrogate model
is a crucial decision. Whereas resampling techniques [3] can be used to estimate
the global prediction quality in the classical approach, the optimization capa-
bilities of a model have to be assessed in the sequential approach. Therefore,
this capability is not necessarily static. Some models may be suited to efficiently
identify the most promising basin in the beginning, whereas others are good for
refining the approximation in the final stage of the optimization.
In this paper, we tackle the model selection problem for sequential optimiza-
tion techniques. Hence, the proposed optimization algorithm utilizes a heteroge-
neous ensemble of surrogate models. An approach to solve the progressive model
selection problem is proposed as a central scientific contribution. It is designed
to identify models that are most promising at a certain stage of the optimization.
Preference values are used to stochastically select a surrogate model, which in
turn proposes a new design point in each iteration of the algorithm. Based on the
quality of this design point, the rating of the model is adjusted by means of rein-
forcement learning techniques. The procedure is general and can be performed
with arbitrary ensembles of surrogate models.
In the following, an overview of related research is provided by means of
a brief review of the literature in Sect. 2. Details of the applied methods are
presented in Sect. 3 and the actual PROGRESS algorithm is described in Sect. 4
based on these foundations. In Sect. 6, its results on the benchmark specified
in Sect. 5 are presented and discussed. The main conclusions are summarized in
Sect. 7 and an outlook for further research in this area is introduced.
2 Review
In the following review of the literature we mainly restrict the focus to sequential
optimization techniques using ensembles of surrogate models in which the selec-
tion or combination of the models for the internal optimization is dynamically
112 S. Hess et al.
Currently, only a few related approaches exist for the third concept of ensem-
ble techniques. Its main advantage is that it constitutes a very general approach
which also allows many heterogeneous models to be integrated into the ensem-
ble since only one model has to be fitted and optimized per iteration. Thus,
the selected model can be subject to a more time-consuming tuning to specifi-
cally adapt it to the objective function. Friese et al. [8] applied and compared
different strategies to assess their suitability for sequential parameter optimiza-
tion, among them also ensemble-based methods using reinforcement learning.
However, these methods were used in a rather out-of-the-box manner, without
specifically adapting the generic reinforcement learning techniques to the prob-
lem at hand to exploit their full potential. Some of the potential problems, as
well as enhancements to overcome them, will be discussed in this paper. Another
variant of the approach was applied in the context of operator selection in evo-
lutionary algorithms by Da Costa et al. [5].
3 Methods
Response surface models are a common approach in cases where the budget of
evaluations available for the original function is severely limited. As this sur-
face can be explored with a much higher amount of evaluations, the optimum
of the so-called infill criterion can be accurately approximated using standard
optimization methods. This generic MBO approach is summarized in Algorithm
1. The stages of proposing new points and updating the model are alternated in
a sequential fashion.
114 S. Hess et al.
In the following, the steps of the generic MBO algorithm are discussed and
some details are provided.
1. For the initial design, many experimental design types are possible, but for
nonlinear regression models usually space-filling designs like Latin hypercube
sampling (LHS) are used, see [4] for an overview. Another important choice is
the size of the initial design. Rules of thumb are usually somewhere between
4d and 10d, the latter being recommended by [16].
2. As surrogate model, kriging was proposed in the seminal EGO paper [16]
because it is especially suited for nonlinear, multimodal functions and allows
local refinements to be performed, but basically any surrogate model is possi-
ble. As presented in Sect. 2, also more sophisticated approaches using ensem-
ble methods have been applied within MBO algorithms.
3. The infill criterion is optimized in order to find the next design point for
evaluation. It measures how promising the evaluation of a point x is according
to the surrogate model. One obvious choice is the direct use of fˆ(x). For
kriging models, the expected improvement (EI) [23] is commonly used. It
factors in the local model uncertainty in order to guarantee a reasonable
trade-off between the exploration of the decision space and the exploitation
of the already obtained information. These and other infill criteria have been
proposed and assessed in several studies [15,25,30,36].
4. As a stopping criterion, a fixed budget for function evaluations, the attain-
ment of a specified y-level, or a combination of both is often used in practice.
Based on these probabilities, we select action at and receive its stochastic reward
rt . Assuming we are in a general nonstationary scenario, we now have to decide
whether rt is favourable or not. For this, we compare it with a reference reward
r̄t , which encodes the expected, average pay-off across all actions at iteration t.
Assuming we already have such an r̄t , the element of the preference vector q t
for the chosen action at = vj ∈ A is now updated, while the preferences for all
other actions stay the same:
qt,k + β[rt − r̄t ], if k = j
qt+1,k = (2)
qt,k else.
Here, the strategy parameter β encodes the strength of the adjustment, i.e., the
desired trade-off between exploitation (high β) and exploration (low β).
Finally, we update the reference reward r̄t via the following exponential
smoothing formula
r̄t+1 = r̄t + α(rt − r̄t ) . (3)
The strategy parameter α ∈ (0, 1] determines how much influence the current
reward has on the reference reward, i.e, how much we shift the reference reward
towards rt . It thus reflects the assumed degree of nonstationarity in the reward
distributions.
4 Algorithm
In this section, we address how the action selection by means of the reinforce-
ment comparison can be exploited for model selection in MBO. Regarding models
as selectable actions seems straightforward, but apart from that many techni-
cal details of the basic method in Sect. 3.2 have to be clarified or adapted. The
reward will be based on the improvement in objective value obtained by the pro-
posed x∈ . The sum of rewards over time then measures the progress made during
optimization. The main idea is that models which generated larger improvements
in the past should be preferred in the future.
Instead of using an expected improvement criterion, we directly optimize
the response surface of the selected model, i.e., no local uncertainty estimation
is used. Although this carries the risk of getting stuck in a local optimum for
one model, it offers two important advantages: (a) It is possible to optimize
this criterion for arbitrary regression models, and (b) by using a heterogeneous
116 S. Hess et al.
Algorithm 2: PROGRESS
1 Let f be the black-box function that should be optimized;
2 Let E = {h1 , . . . , hm } be the regression models in the ensemble;
3 Generate initial design {x1 , . . . , xn };
4 Evaluate f on design ∀i ∈ {1, . . . , n} : yi = f (xi );
5 Let D={(x1 , y1 ), . . . , (xn , yn )};
6 for j ∈ {1, . . . , m} do
7 Build surrogate model ĥj based on D;
8 Select next promising point x∗ (ĥj ) by optimizing ĥj ;
9 Evaluate new point y ∗ (ĥj ) = f (x∗ (ĥj ));
10 Extend design set D ← D ∪ {(x∗ (ĥj ), y ∗ (ĥj ))};
11 Calculate vector of initial rewards r 0 ∈ Rm using equation (5);
12 Initialize preference vector q 1 = r 0 and reference reward r̄1 = median(r 0 );
13 Let t = 1;
14 while stopping rule not met do
15 Calculate model selection probabilities π t from q t using equation (1);
16 Sample model hj according to π t ;
17 Build surrogate model ĥj based on D;
18 Select next promising point x∗ (ĥj ) by optimizing ĥj ;
19 Evaluate new point y ∗ (ĥj ) = f (x∗ (ĥj ));
20 Extend design set D ← D ∪ {(x∗ (ĥj ), y ∗ (ĥj ))};
21 Calculate reward rt using equation (4);
22 Update preferences using equation (2) to obtain q t+1 ;
23 Update reference reward using equation (3) to obtain r̄t+1 ;
24 t ← t+1
ensemble the models are likely to focus on different basins. The exploration thus
emerges from a successful adaptation of the model selection.
The complete procedure of the PROGressive REinforcement-learning-based
Surrogate Selection (PROGRESS) is shown in Algorithm 2. It is an enhanced
instance of the generic Algorithm 1, whereby the methodological contributions
are: the initialization phase (lines 6 to 13), the stochastic model selection (lines
16 to 17) and the preference vector and reference reward updates (lines 22 to
24). Details are provided in the following.
4.1 Rewards
Let ymin be the minimum response value of the design before the integration of
x∈ (ĥj ). The reward of a chosen surrogate model hj is then given by
where x∈ (ĥj ) is the next point proposed by model ĥj and φ is a simple linear
rescaling function which will be detailed in the next section. Thereby, it is pos-
sible that a model produces negative rewards. We intentionally use all available
PROGRESS: Progressive Reinforcement-Learning-Based Surrogate Selection 117
5 Experimental Setup
PROGRESS and all experiments in this article have been implemented in the
statistical programming language R [26]. We analyzed the performance of our
118 S. Hess et al.
Table 1. Overview of the considered test functions and the problem features covered.
Separable Low or moderate cond. High cond. and unimodal Adequate glob. struct. Weak glob. struct.
1 Sphere 6 Attractive sector 10 Ellipsoidal function 15 Rastrigin 20 Schwefel
2 Ellipsoidal 7 Step ellipsoidal 11 Discus function 16 Weierstrass 21 Gallagher’s Gaussian
3 Rastrigin 8 Rosenbrock 12 Bent cigar 17 Schaffers F7 (101-me Peaks)
4 Bueche- (original) 13 Sharp ridge 18 Schaffers F7 (ill) 22 Gallagher’s Gaussian
Rastrigin 9 Rosenbrock 14 Different powers 19 Composite Griewank (21-hi Peaks)
5 Linear slope (rotated) Rosenbrock 23 Katsuura
24 Lunacek bi-Rastrigin
algorithm on the 24 test functions of the BBOB noise-free test suite [13], which
is a common benchmarking set for black-box optimization. It covers a variety
of functions that differ w.r.t. problem features like separability, multi-modality,
ill-conditioning and existence of global structure. A summary of these functions
and their respective properties is provided in Table 1. Their function definitions,
box constraints and global minima were taken from the soobench R package
[21]. The dimension of the scalable test functions was set to d = 5.
The regression models used in our PROGRESS ensemble and their respective
R packages are listed in Table 2: A second order (polynomial) response surface
model (RSM) [14], a kriging model with power exponential covariance kernel
[29], multivariate adaptive regression splines (MARS) [6], a feedforward neural
network [14] with one hidden layer, a random forest [14], a gradient boosting
machine (GBM) [7] and a regression tree (CART) [14]. Table 2 also lists (con-
stant) parameter settings of the regression models which deviate from default
values and box constraints for parameters which were tuned prior to model fit-
ting in every iteration of PROGRESS based on all currently observed design
points (lines 7 and 17 in Algorithm 2). To accomplish this, we used a 10-fold
repeated cross-validation (5 repetitions) to measure the median absolute predic-
tion error and minimized this criterion in hyperparameter space by CMAES with
a low number of iterations. Integer parameters were determined by a rounding
strategy and CMAES was always started at the point in hyperparameter space
which was discovered as optimal during the previous tuning run of the same
model (or a random point if no such point is available).
The response surfaces of the regression models were also optimized by CMAES
with 100 iterations and λ = 10d offspring. This setting deviates from the literature
recommendation for technical reasons as more parallel model evaluations (predic-
tions) reduce the computational overhead and lead to a better global search qual-
ity with the same number of iterations. We performed 10 random CMAES restarts
and one additional restart at the currently best point of the observed design points
for further exploration of the search space and to reduce the risk of getting stuck
in a local optimum.
The learning parameters α and β of PROGRESS were manually tuned prior
to the experiments. In order to not bias the results by overtuning on the prob-
lems, these parameters were fixed to α = 0.1 and β = 0.25 for all of the test
instances. Thereby, our aim was to find a global parametrization leading to
robust results and a successful adaptation of the selection probabilities.
PROGRESS: Progressive Reinforcement-Learning-Based Surrogate Selection 119
6 Results
In the following results, we report the distance of the best visited design point
to the global optimum in objective space as measure of algorithm performance.
Because the optimization of the black-box function at hand is our major interest,
we do not report global prediction quality indicators of the internal models.
Neither do we calculate expected run times to obtain a certain target level, as
we assume an a priori fixed budget of function evaluations. The performance
distributions of the 20 runs of each algorithm on the 24 test functions of the
benchmark are shown in Fig. 1 on a logarithmic scale.
The most obvious result is the superiority of the sequential designs over
the static random LHS. In all cases, except for functions 12 and 23, at least
one of the three sequential algorithms outperforms the static approach. Test
functions 12 and 23 can apparently not be approximated well enough by any of
the considered regression models to provide helpful guidance for the optimization
process. This might be due to the high condition number (function 12) or the
absence of global structure and an extreme number (> 105 ) of local optima
(function 23). The comparison of the two versions of PROGRESS with and
without hyperparameter tuning shows that both variants obtain similar results
in most instances. There are cases (e.g., function 15, 17, 19), however, where
hyperparameter tuning leads to a significantly better outcome.
120 S. Hess et al.
Fig. 1. Performance box plots per algorithm and test function. Displayed is the dif-
ference in objective space between best design point during optimization and global
minimum on a log10 scale. PROGRESS is run in two variants, one with hyperparameter
tuning of the selected surrogate model in each iteration and one without.
PROGRESS: Progressive Reinforcement-Learning-Based Surrogate Selection 121
The EGO algorithm and the use of kriging models represent the state-of-
the-art in sequential designs. If PROGRESS can in general compete with EGO,
it can be considered as successful, as it manages to detect suitable surrogate
models within the strictly limited budget. If we assign equal importance to all
test functions in the benchmark, none of the two algorithms clearly dominates
the other. Whereas PROGRESS outperforms EGO on the functions 1, 5, 13, 15,
16, 17, 21, and 24, the opposite holds on the functions 2, 3, 6, 8, 10, 14 and 20.
On the remaining test cases, both algorithms show no significant difference in
performance. Hence, PROGRESS is competitive with EGO and is the preferable
choice on about one third of the benchmark set. For this reason, the proposed
procedure can be considered as successful.
To obtain a more detailed understanding of the principles behind the surro-
gate selection, we analyzed the progression of the selection probabilities in three
exemplary runs. These are shown in Fig. 2. We also display the total number of
selections per model during the whole run.
selection probabilities
selection probabilities
selection probabilities
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
number of model selections
60 60 60
40 40 40
20 20 20
0 0 0
QM
RF
NNET
MARS
GBM
KM
QM
RF
NNET
MARS
GBM
KM
QM
RF
NNET
MARS
GBM
KM
CART
CART
CART
In the first case shown in the left plot, a dynamic shift between a model cap-
turing global trends (QM) and a more detailed model for local optimization (RF)
is accomplished during the optimization. While the former efficiently identifies
the basins of the Lunacek bi-Rastrigin Function,1 the latter succeeds in guiding
the search through the rugged area around the optimum. This synergy allows
PROGRESS to significantly outperform EGO on this function. The center plot
1
The weak global structure of this function is rooted in the existence of two basins of
almost the same size.
122 S. Hess et al.
References
1. Bartz-Beielstein, T., Lasarczyk, C.G., Preuss, M.: Sequential parameter optimiza-
tion. In: McKay, B., et al. (eds.) Proceedings of the 2005 Congress on Evolutionary
Computation (CEC’05), Edinburgh, Scotland, pp. 773–780. IEEE Press, Los Alami-
tos (2005)
2. Bischl, B., Lang, M., Mersmann, O., Rahnenfuehrer, J., Weihs, C.: BatchJobs and
BatchExperiments: abstraction mechanisms for using R in batch environments.
Submitted to Journal of Statistical Software (2012a)
3. Bischl, B., Mersmann, O., Trautmann, H., Weihs, C.: Resampling methods for meta-
model validation with recommendations for evolutionary computation. Evol. Com-
put. 20(2), 249–275 (2012b)
4. Bursztyn, D., Steinberg, D.M.: Comparison of designs for computer experiments. J.
Stat. Planning Infer. 136(3), 1103–1119 (2006)
5. DaCosta, L., Fialho, A., Schoenauer, M., Sebag, M.: Adaptive operator selection
with dynamic multi-armed bandits. In: Proceedings of the 10th Conference Genetic
and Evolutionary Computation (GECCO ’08), pp. 913–920. ACM, New York (2008)
6. Friedman, J.: Multivariate adaptive regression splines. Ann. Stat. 19(1), 1–67
(1991)
7. Friedman, J.: Greedy function approximation: a gradient boosting machine. Ann.
Stat. 29(5), 1189–1232 (2001)
8. Friese, M., Zaefferer, M., Bartz-Beielstein, T., Flasch, O., Koch, P., Konen, W.,
Naujoks, B.: Ensemble based optimization and tuning algorithms. In: Hoffmann,
F., Hüllermeier, E. (eds.) Proceedings of the 21. Workshop Computational Intelli-
gence, pp. 119–134 (2011)
9. Ginsbourger, D., Helbert, C., Carraro, L.: Discrete mixtures of kernels for kriging-
based optimization. Qual. Reliab. Eng. Int. 24(6), 681–691 (2008)
10. Goel, T., Haftka, R.T., Shyy, W., Queipo, N.V.: Ensemble of surrogates. Struct.
Multidisc. Optim. 33(3), 199–216 (2007)
11. Gorissen, D., Dhaene, T., Turck, F.: Evolutionary model type selection for global
surrogate modeling. J. Mach. Learn. Res. 10, 2039–2078 (2009)
12. Hansen, L., Salamon, P.: Neural network ensembles. IEEE Trans. Pattern Anal.
Mach. Intell. 12(10), 993–1001 (1990)
13. Hansen, N., Finck, S., Ros, R., Auger, A.: Real-Parameter Black-Box Optimization
Benchmarking 2009: Noiseless Functions Definitions. Tech. Rep. RR-6829, INRIA
(2009). http://hal.inria.fr/inria-00362633/en/
14. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York
(2009)
15. Jones, D.R.: A taxonomy of global optimization methods based on response sur-
faces. J. Global Optim. 21(4), 345–383 (2001)
16. Jones, D., Schonlau, M., Welch, W.: Efficient global optimization of expensive black-
box functions. J. Global Optim. 13(4), 455–492 (1998)
17. Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90(430), 773–795
(1995)
18. Lenth, R.V.: Response-surface methods in R, using rsm. J. Stat. Softw. 32(7), 1–17
(2009)
19. Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2(3),
18–22 (2002)
124 S. Hess et al.
20. Lim, D., Ong, Y.S., Jin, Y., Sendhoff, B.: A study on metamodeling techniques,
ensembles, and multi-surrogates in evolutionary computation. In: Thierens, D., et
al. (eds.) Proceedings of the 9th Annual Genetic and Evolutionary Computation
Conference (GECCO 2007), pp. 1288–1295. ACM, New York (2007)
21. Mersmann, O., Bischl, B.: soobench: Single Objective Optimization Benchmark
Functions (2012). http://CRAN.R-project.org/package=soobench, R package ver-
sion 1.0-73
22. Milborrow, S.: earth: Multivariate Adaptive Regression Spline Models (2012).
http://CRAN.R-project.org/package=earth, R package version 3.2-3
23. Mockus, J.B., Tiesis, V., Zilinskas, A.: The application of bayesian methods for seek-
ing the extremum. In: Dixon, L.C.W., Szegö, G.P. (eds.) Towards Global Optimiza-
tion 2, pp. 117–129. Elsevier North-Holland, New York (1978)
24. Myers, R.H., Montgomery, D.C., Anderson-Cook, C.M.: Response Surface Method-
ology, 3rd edn. Wiley, Hoboken (2009)
25. Picheny, V., Wagner, T., Ginsbourger, D.: A benchmark of kriging-based infill cri-
teria for noisy optimization. Struct. Multidisc. Optim. 48(3), 607–626 (2013)
26. R Core Team: R: A Language and Environment for Statistical Computing. R Foun-
dation for Statistical Computing, Vienna (2012). http://www.R-project.org/ ISBN
3-900051-07-0
27. Ridgeway, G.: gbm: Generalized Boosted Regression Models (2012). http://CRAN.
R-project.org/package=gbm, R package version 1.6-3.2
28. Roustant, O., Ginsbourger, D., Deville, Y.: DiceKriging, DiceOptim: two R pack-
ages for the analysis of computer experiments by kriging-based metamodeling and
optimization. J. Stat. Softw. 51(1), 1–55 (2012). http://www.jstatsoft.org/v51/i01/
29. Santner, T., Williams, B., Notz, W.: The Sesign and Analysis of Computer Exper-
iments. Springer, New York (2003)
30. Sasena, M.J., Papalambros, P., Goovaerts, P.: Exploration of metamodeling sam-
pling criteria for constrained global optimization. Eng. Optim. 34(3), 263–278
(2002)
31. Shan, S., Wang, G.G.: Survey of modeling and optimization strategies to solve high-
dimensional design problems with computationally-expensive black-box functions.
Struct. Multi. Optim. 41(2), 219–241 (2010)
32. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. Cambridge Uni-
versity Press, Cambridge (1998)
33. Therneau, T.M., port by Brian Ripley, B.A.R.: rpart: Recursive Partitioning (2012).
http://CRAN.R-project.org/package=rpart, R package version 3.1-54
34. Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer,
New York (2002)
35. Viana, F.A.C.: Multiple Surrogates for Prediction and Optimization. Ph.D. thesis,
University of Florida (2011)
36. Wagner, T., Emmerich, M., Deutz, A., Ponweiser, W.: On expected-improvement
criteria for model-based multi-objective optimization. In: Schaefer, R., Cotta, C.,
Kolodziej, J., Rudolph, G. (eds.) PPSN XI. LNCS, vol. 6238, pp. 718–727. Springer,
Heidelberg (2010)
37. Wichard, J.D.: Model selection in an ensemble framework. In: International Joint
Conference on Neural Networks, pp. 2187–2192 (2006)
Neutrality in the Graph Coloring Problem
1 Motivation
The graph coloring problem (GCP) consists in finding the minimal number of
colors χ, called the chromatic number, that leads to a legal coloring of a graph.
This is a N P−hard problem [3] widely studied in the literature. Then, for large
instances, approximate algorithms are used as local search methods [2] or evolu-
tionary strategies [8]. The most efficient metaheuristic schemes include specific
encodings and mechanisms for GCP. It requires a very good knowledge of the
problem and a long time of experimental analysis to tune the best parameters.
Another way to design efficient algorithms is to analyze the problem struc-
ture. For example, the landscape analysis aims at understanding better the char-
acteristics of the problems in order to design efficient algorithms [11]. Neutrality
appears when neighboring solutions have the same fitness value. Thus, neutrality
is a characteristic of the landscape [10].
Many insights about the neutrality of the GCP are raised when considering
the number of edges with the same color at both ends. But, as far as we know,
no deep analysis has ever been conducted in the literature. In this paper, we are
interested in the χ-GCP. This problem aims at looking for a legal coloring with
χ colors, while minimizing the number of conflicts. This paper analyses if the
χ-GCP may be considered as a neutral problem and if the neutrality may be
exploited to solve it. Therefore, Sect. 2 gives the results on the neutrality of hard
GCP instances. Then, in Sect. 3, the benefit of exploiting the neutrality when
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 125–130, 2013.
DOI: 10.1007/978-3-642-44973-4 14, c Springer-Verlag Berlin Heidelberg 2013
126 M.-E. Marmion et al.
solving the GCP is studied. Section 4 discusses the presented work and future
research interests.
2.2 Approach
The average or the distribution of neutral degrees over the landscape may be
used to qualify the level of neutrality of a problem instance. This measure plays
an important role in the dynamics of local search algorithms [12,13]. In the case a
problem gets the neutrality property, Marmion et al. [6] suggested to characterize
the plateaus found from the local optima. Then, they sampled plateaus using
neutral walks from each local optimum found. The plateaus are classified in a
three-class topology: (T1) the local optimum is the single solution of the plateau,
(T2) no neighbor with a better fitness value was met for any solutions of the
plateau encountered along the neutral walk and; (T3) a portal has been identified
on the plateau. This latest type is the most interesting since it reveals some
solutions that help to escape from the plateau by improving. Indeed, the number
of solutions visited before finding a portal during a neutral random walk is a good
indicator of the probability to find an improving solution. Then, Marmion et al.
proposes to compute the distribution of the number of solutions on the plateau
that are needed to visit before finding a portal. Also, to study the cost/quality
trade-off, they compare this distribution with the one of the number of solutions
visited to find a (new) local optimum starting with a new solution, called the step
length. This approach and the characterization of the neutrality of the χ-GCP
are more detailed in [4].
2.3 Experiments
For this work we focus on literature instances of the GCP known to have “difficult
upper bound”, that is to say that a minimal legal coloring is hard to obtain.
Those instances are extracted from the DIMACS Computational Challenge on
“Graph Colouring and its Generalisations”.1 There are four classes of instances,
according to the type of generation. All the indicators, presented above, are
computed from 30 different solutions.
1
http://dimacs.rutgers.edu/Challenges/
Neutrality in the Graph Coloring Problem 127
Table 1. Average neutral degree and the corresponding ratio for the random solutions
and the local optima.
The ratio of the neutral degree is the neutral degree over the size of the neigh-
borhood. This measure is also computed for the random solutions and the local
optima as it makes the comparison between different instances easier. Table 1
gives the average neutral degree and the corresponding ratio for random solu-
tions and local optima on the GCP instances. For each instance, the number
of nodes V , the chromatic number χ and the size of the neighborhood | nbh |
are also given. This table first shows that, the ratios for random solutions are
quite high (up to 24.2 % for the instance r250.5). The neutrality characterises
the problem in general. The landscape may have a lot of flat parts. The sec-
ond observation on the results of Table 1 is that the ratios for local optima are
smaller than the ones of random solutions. Hence, depending on the instances,
the number of neutral neighbors in the neighborhood of a local optimum may be
important or not. Some instances present a high neutral degree (such as 9.7 %
for the highest, or around 5 % for others) while some others have a much smaller
neutral degree (down to 0.2 % for the instance dsjr1000.1c). These results con-
firm the apriori fact that neutrality is a strong property in the graph coloring
problem that may be used in local search strategies to be more efficient. In the
following, only the instances where the average neutral degree of the local optima
is higher than 1 % are considered.
Table 2 gives the statistics of the number of solutions visited on a plateau
(nbS) before finding a portal. The statistics of the step lengths (L) are also
given to study the cost/quality trade-off. Clearly, it is very quick to meet a
portal even randomly. Indeed, it is necessary to visit only 1 or 2 new solution(s)
on the plateau to find a portal to escape. Let us remark that reaching a portal
may be quick, but the difficulty is to identify a solution as a portal. However,
compared to the steps lengths, it seems to be more interesting to continue the
128 M.-E. Marmion et al.
Table 2. It gives the number of T1, T2 and T3 plateaus. nbS stands for the number
of visited solutions before finding a portal and, L for the step length.
search process by moving on a plateau to find a portal than to restart the search
process from a new random solution.
The analysis of the plateaus of the local optima shows that portals, solutions
of the plateau with at least one improving neighbor, are quick to reach with a
random neutral walk. It assumes that exploiting neutrality in the search process
may help to find better solutions. The following section provides insight about
the way to exploit neutrality to solve the GCP.
3.2 Experiments
Four instances, one of each type, have been selected to perform NILS: dsjc250.5,
r250.5, flat 300 28 0 and le 450 25 c. The landscape analysis of theses instances
has shown that the plateaus of the local optima get portals that lead to improving
solutions. Several M N S values were tested in order to analyze the trade-off
between exploiting the plateau and exploring an other part of the search space.
M N S values were set to 0, 1, 2 and 5 times the size of the neighborhood. For
the value 0, NILS corresponds to a classical ILS that restarts from a new part
of the search space. 30 runs were performed for each configuration of NILS. The
stopping criterion was set to 2 × 107 evaluations.
Neutrality in the Graph Coloring Problem 129
Fig. 1. ILS and NILS performance on 4 instances of the graph coloring problem.
4 Discussion
The experimental results have shown that the hard instances of GCP present
neutrality where local search algorithms may be blocked on plateaus. Indeed,
the classical ILS is not able to find interesting solutions. However, when the
neutrality is exploited in the local search, results are improved even if no config-
uration of NILS gives a legal solution. This may be explained by the fact that
these instances are the hardest instances of the literature, and for each, k is set
to the χ-value, the best known chromatic number. In 2010, Porumbel et al. [8]
made a comparison between their algorithm dedicated to GCP and the 10 best
performing algorithms from the literature. Except the Iterated Local Search, all
the other algorithms are well-sophisticated and specific to GCP. Indeed, GCP-
specific mechanisms are used to improve the search. These mechanisms require
a huge knowledge on the GCP to be designed and tuned efficiently. Despite
this high level of sophistication, the comparison points out the difficulty for
130 M.-E. Marmion et al.
some algorithms to find the χ-value. For example, the results reported for the
instances considered above indicate that: The instance r250.5 is solved to the
optimality (k = χ) by only four algorithms out of six. The instance le 450 25c
is solved to the optimality only by six algorithms out of ten. And, the instance
flat300 28 0 is solved to the optimality only by four algorithms out of eleven.
Moreover, the ILS [7] never find the χ-value for the two last instances. Its per-
formance illustrate the difficulty for a generic algorithm to be efficient.
This paper should be considered as a preliminary work on the neutrality
of the GCP. Indeed, one points out the neutrality of some hard instances and
gives the degree of this neutrality. However, the performance of NILS are not
as good as expected, but, it shows the potential of exploiting neutrality to solve
the GCP. Since heuristic methods represent the state-of-the-art algorithms [1,9],
one wants to investigate how to exploit neutrality in such heuristics.
References
1. Caramia, M., Dell’Olmo, P., Italiano, G.F.: Checkcol: improved local search for
graph coloring. J. Discrete Algorithms 4, 277–298 (2006)
2. Galinier, P., Hertz, A.: A survey of local search methods for graph coloring. Com-
put. Oper. Res. 33(9), 2547–2562 (2006)
3. Garey, M.R., Johnson, D.S.: Computers and Intractability; A Guide to the Theory
of NP-Completeness. W. H. Freeman and Co., San Francisco (1990)
4. Marmion, M.-E., Blot, A., Jourdan, L., Dhaenens, C.: Neutrality in the graph
coloring problem. Technical Report RR-8215, INRIA (2013)
5. Marmion, M.-E., Dhaenens, C., Jourdan, L., Liefooghe, A., Verel, S.: NILS: a
neutrality-based iterated local search and its application to flowshop scheduling. In:
Merz, P., Hao, J.-K. (eds.) EvoCOP 2011. LNCS, vol. 6622, pp. 191–202. Springer,
Heidelberg (2011)
6. Marmion, M.-E., Dhaenens, C., Jourdan, L., Liefooghe, A., Verel, S.: On the neu-
trality of flowshop scheduling fitness landscapes. In: Coello, C.A.C. (ed.) LION
2011. LNCS, vol. 6683, pp. 238–252. Springer, Heidelberg (2011)
7. Paquete, L., Stützle, T.: An experimental investigation of iterated local search for
coloring graphs. In: Cagnoni, S., Gottlieb, J., Hart, E., Middendorf, M., Raidl, G.
(eds.) EvoWorkshops 2002. LNCS, vol. 2279, pp. 122–131. Springer, Heidelberg
(2002)
8. Porumbel, D.C., Hao, J.K., Kuntz, P.: An evolutionary approach with diversity
guarantee and well-informed grouping recombination for graph coloring. Comput.
Oper. Res. 37(10), 1822–1832 (2010)
9. Porumbel, D.C., Hao, J.K., Kuntz, P.: A search space “cartography” for guiding
graph coloring heuristics. Comput. Oper. Res. 37(4), 769–778 (2010)
10. Reidys, C.M., Stadler, P.F.: Neutrality in fitness landscapes. Appl. Math. Comput.
117, 321–350 (2001)
11. Stadler, P.F.: Landscapes and their correlation functions. J. Math. Chem. 20, 1–45
(1996)
12. Verel, S., Collard, P., Tomassini, M., Vanneschi, L.: Fitness landscape of the cellular
automata majority problem: view from the “Olympus”. Theor. Comput. Sci. 378,
54–77 (2007)
13. Wilke, C.O.: Adaptative evolution on neutral networks. Bull. Math. Biol. 63, 715–
730 (2001)
Kernel Multi Label Vector Optimization
(kMLVO): A Unified Multi-Label
Classification Formalism
1 Introduction
Classic supervised learning problems are formulated as a set of (xi , yi ) pairs,
where xi lies in the problem domain (typically Rn , but may be more complex),
and yi ⊆ {0, 1, . . . , n} for classification problems (with n = 1 for decision prob-
lems) or yi ⊆ R for regression problems. Naturally, not all problems fall into these
categories and several generalization have been suggested, where each instance
belongs to more than one class or where multiple instances have multiple labels
[1,2]. In this study, we keep the assumption that each instance either belongs
to some target class or does not; however, the available data might not con-
tain this labeling but rather some indirect measurements. This formulation is
related to many real life problems. For example, in the medical domain, the
decision whether a subject is ill or not is made not just based on past subjects’
data along with their diagnoses, but also on past subjects’ data along with their
physiological condition scores, appetite and happiness scores, etc.
This notion can be especially helpful in areas where the classification is diffi-
cult to obtain, with limited data sets, or where the data suffers from high varia-
tion in measurement modality and protocol. The application of a first solution,
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 131–137, 2013.
DOI: 10.1007/978-3-642-44973-4 15, c Springer-Verlag Berlin Heidelberg 2013
132 G. Liberman et al.
the MLVO method, was briefly introduced in a recent publication [3]. Here, we
present the extention of this fomalism to a kernel machine - the kMLVO, along
with some useful extensions.
The remainder of this manuscript is organized as follows. In Sect. 2, we
present the kMLVO framework. Extensions of the kMLVO are shown in Sect. 3.
In Sect. 4 we discuss the results on simulated and experimental data sets and
compare the performance of different classifiers, and we conclude with a sum-
mary in Sect. 5.
We use the SVR formalization for the regression part. This implies linear penal-
ties on regression errors, which improves the robustness to outliers. When using
a non-linear kernel, the direct relation to the original problem dimension is lost,
and the contribution of w0 must be indirect (we will return to this issue in the
kMLVO extensions).
2.1 Formalism
L
1
minimize ∗ ∅w∅2 + C T Δ + DiT ρi+
w,b,ξ,α,β,ζ,ζ 2 i=1
subject to Y (Xw + 1b) ∈ 1 − Δ, Δ ∈ 0 (1)
(λi si + 1αi ) − Xw − 1b ← 1∂i + ρi , ρi ∈ 0, ⊂i , 1 ← i ← L
Xw + 1b − (λi si + 1αi ) ← 1∂i + ρi∈ , ρi∈ ∈ 0, ⊂i , 1 ← i ← L
Where si is the vector of scores for the ith modality, for all of the samples.
Note that the units of the continuous scores si may differ from the units of the
optimal separating hyper-plane of the binary data, thus we added the vectors λ, α
of length L, with the linear transformations for the corresponding modalities. ∂
is a vector of length L containing the insensitive loss parameters for the different
Kernel Multi Label Vector Optimization (kMLVO) 133
modalities (as in [4]). Note that since for each modality this parameter is applied
after the linear transformation, its value is normalized and searching for the
optimal value is easier and indicative of the regression fit (which is otherwise
hidden).
L L L
1 T T T −T T − T
maximize
∗
− (μ Y XX Y μ + ( πi )XX ( πi )) + μ 1 − πi+T 1∂i
μ,δ,δ 2 i=1 i=1 i=1
l
− πi−T XX T Y μ
i=1
subject to μ, λ, πi , πi∈ , ηi , ηi∈ ∈ 0, ⊂i , 1 ← i ← L
μT y = 0; 1T πi− = 0; sTi πi− = 0, 1 ← i ← L (2)
(∈) (∈)
C = μ + λ; Di = πi + ηi , ⊂i , 1 ←i←L
where πi− = πi − πi∈ and πi+ = πi + πi∈ , with μ, λ being the Lagrange multipli-
ers corresponding to the classification problem (as in the SVM fromalism) and
πi , πi∈ , ηi , ηi∈ the Lagrange multipliers corresponding to the i-th modality of the
regression problem (as in the SVR formalism, following [4,7]). Using such a for-
malism, we enjoy the advantages of a kernel machine, i.e. the optimization is
on the support vectors coefficients μ, π, π ∈ and the (possibly high-dimensional)
product XX T can be replaced n using anykernel function K. Here the decision
L −
function becomes g(q) = j=1 (μ j y j + i=1 i,j )K(xj , q) + b, i.e. a weighted
π
sum on the contributions of the kernel function of the support vectors with the
classified sample point, where the weight on each support vector considers both
the classification and the various regression constraints.
The method can handle any combination of inputs, by simply setting the corre-
sponding element of the classification cost vector C and/or of the cost matrix D
to 0 where a value is missing. This constraints the corresponding support vectors
coefficient to be fixed (box constaint of 0) and effectively removes the element
from the optimization problem, while keeping all the other information intact.
134 G. Liberman et al.
3 Extending kMLVO
Two possible extensions of the kMLVO framework handle the incorporation of
w0 , as in the MLVO, and non-linear regression.
3.1 Incorporation of w0
Given w0 , we would like to introduce a penalty when diverging from it which is
2
simiar to the one used in the MLVO, i.e. E2 = 12 ∅w0 − w∅2 . This however will
automatically lead to terms which are not quadratic in the values of xi . This
can be solved by projecting w and w0 on any basis of the feature n space. Given
2
a base B = {Bi , ·
· · , Bn } of the feature space, ∅w0 − w∅2 = i=1 (< w0 , Bi >
n
− < w, Bi >)2 = i=1 (μi − < w, Bi >)2 where < w0 , Bi >= μi , ⊂i is the score
induced by w0 for the base vectors. We would like the scores induced by w to
be similar, i.e. the problem is transformed into a regression problem. Thus it
suffices to add the base vectors {Bi , · · · , Bn } as additional input samples along
with their w0 induced scores. As before, while the MLVO uses squared penalty,
the kMLVO uses L1 penalty.
Suppose that the scores for the i-th modality are not linear with the optimal
(or real) separating hyperplane for the classification problem, but follow si,k =
n
f (wT xk + b) = f ( j=1 μj yj K(xj , xk ) + b) for some function f . In this case
we would like to linearize the scores, i.e. applying f −1 before performing the
kMLVO. If f (and thus f −1 ) is unknown, we may let the kMLVO approximate
it as a linear combination of score vectors. This can be performed by replacing
the term (λi si + 1αi ) in the equations with (λi,1 si,1 + · · · + λi,pi si,pi + 1αi ). The
only additional constraints added to the final dual problem is:
These different scores can be, but not limited to, the original scores si in
different powers, etc.
4 Simulations
In order to test whether the proposed formalism outperforms existing methods,
we have compared the precision obtained using four different formalisms: SVR,
SVM, MLVO and kMLVO on artificial datasets of different dimentionality, noise
levels, and sample sizes. Additinaly, 10 % of the points were randomly chosen to
be “outliers”. For these points, the standard deviation of the added noise was 3
times the sandard deviation of all scores (instead of 0.03 or 0.6).
Kernel Multi Label Vector Optimization (kMLVO) 135
The kMLVO outperforms the other classifiers in the presence of outliers. Such
outliers can significantly affect the SVM and SVR formalisms, and we have here
tested their effect on the kMLVO formalism. The average performace scores on
the different data sets can be seen in Fig. 1. While for weak noise levels the
MLVO is dominant, in the stronge noise level, a clear dominancy of the kMLVO
can be seen, especially in the region of high number of samples with continous
scores and a low number of binary samples. This can be explained by the fact
that kMLVO (as SVR) is has L1 regularization term, while the regression part
of the MLVO (as LS-SVR) is regularized with an L2 term.
In another application, regarding the binding to an immune system molecule,
the transporter associated with antigen processing (TAP), the kMLVO outper-
formed the MLVO and the uni-label classifiers SVM (using binding/non binding
data) and SVR (using affinity score), of the commonly used package LibSVM [8]
on our data.
Fig. 1. Simulation results with outliers. The relative number of winners (that is, most
accurate estimation of the direction vector) is coded to RGB by red - kMLVO, green
- SVM, blue - SVR, and black - MLVO. The titles refer to the dimensionality of the
data set and the noise’s standard deviation.
136 G. Liberman et al.
5 Discussion
The approach presented can be used as a general supervised learning method
when multiple labels of data are available. Such a situation often emerges in
biological interactions, such as transcription factor binding or protein-protein
interactions. In such cases, observations can either be binary (the presence or
absence of an interaction) or continuous (the affinity).
Several extensions of the kMLVO have been proposed. The simplest expan-
sion is the use of multiple continuous scores. Assume samples having continuous
scores that are derived from several unit scales (e.g. IC50 and EC50 affinity
related measurements). As part of the solution (as described above), we simul-
taneously fit between the predicted to the continuous score by linear regression.
Thus, actually all the available measurements can be merged together, and the
problem will be transformed to a set of linear regressions with multiple values
of λ and α. The algorithm can also be improved if the validity of the different
dimensions of the samples in the n dimensional space or the validity of the sam-
ples themselves can be guessed. In such a case, the weight given to the similarity
to the a priori guess in each dimension or the error of each classified data point
(Δi ) can be varied.
The use of kernels, along with extension for handling multiple measurements
types, with different non-linear relations, and inherent consideration of missing
values gives the suggested approach a higher flexibility and applicability for real-
life problems.
The main limitations of the proposed methodology is that it mainly applies to
cases where the number of observations is limited. When the number of obser-
vations is very large and biased toward one type of observations, the MLVO
performes worse than the appropriate SVM or SVR. Another important caveat
is the need to determine three constants, instead of the single box constraint
constant in the standrad SVM formalism. In the presented cases, we have pred-
edefined the constant to be used, or used an internal test set to determine the
optimal constants, and then applied the results to an external test set. Even
when these caveats are taken into consideration, the MLVO can be an impor-
tant methodological approach in many biological cases, where the number of
observations is highly limited.
References
1. Zhou, Z., Zhang, M., Huang, S., Li, Y.: Multi-instance multi-label learning. Artif.
Intell. 176, 2291–2320 (2011)
2. Boutell, M., Luo, J., Shen, X., Brown, C.: Learning multi-label scene classification.
Pattern Recogn. 37, 1757–1771 (2004)
3. Vider-Shalit, T., Louzoun, Y.: Mhc-i prediction using a combination of t cell epitopes
and mhc-i binding peptides. J. Immunol. Methods 374, 43–46 (2010)
Kernel Multi Label Vector Optimization (kMLVO) 137
4. Farag, A., Mohamed, R.M: Regression using support vector machines: basic foun-
dations. Technical Report, CVIP Laboratory, University of Louisville (2004)
5. Karush, W.: Minima of functions of several variables with inequalities as side con-
straints. Master’s thesis, Department of Mathematics, University of Chicago (1939)
6. Kuhn, H., Tucker, A.: Nonlinear programming. In: Proceedings of the Second Berke-
ley Symposium on Mathematical Statistics and Probability, California, vol. 5 (1951)
7. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
8. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans.
Intell. Sys. Technol. 2, 27:1–27:27 (2011)
Robust Benchmark Set Selection
for Boolean Constraint Solvers
1 Introduction
The availability of representative sets of benchmark instances is of crucial impor-
tance for the successful development of high-performance solvers for compu-
tationally challenging problems, such as propositional satisfiability (SAT) and
answer set programming (ASP). Such benchmark sets play a key role for assess-
ing solver performance and thus for measuring the computational impact of
algorithms and/or their vital parameters. On the one hand, this allows a solver
developer to gain insights on the strengths and weaknesses of features of inter-
est. On the other hand, representative benchmark instances are indispensable to
empirically underpin the claims of computational benefit of novel ideas.
A representative benchmark set is composed of benchmark instances stem-
ming from a variety of different benchmark classes. Such benchmark sets have
been assembled (manually) in the context of well-known solver competitions,
such as the SAT and ASP competitions, and then widely used in the research
literature. These sets of competition benchmarks are well-accepted, because
they have been constituted by an independent committee using sensible criteria.
Moreover, these sets evolve over time and thus usually reflect the capabilities
(and limitations) of state-of-the-art solvers; they are also publicly available and
well-documented.
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 138–152, 2013.
DOI: 10.1007/978-3-642-44973-4 16, c Springer-Verlag Berlin Heidelberg 2013
Robust Benchmark Set Selection for Boolean Constraint Solvers 139
However, instance sets from competitions are not always suitable for bench-
marking scenarios where the same runtime cutoff is used for all instances. For
example, in the last three ASP competitions, only ≈ 10 % of all instances were
non-trivial (runtime over 9 s, i.e., 1 % of the runtime cutoff) for the state-of-the-
art ASP solver clasp, while all other instances were trivial or unsolvable for clasp
within the time cutoff used in the competition. While benchmarking, results of
benchmarks are (typically) aggregated over all instances. But if the percentage of
interesting instances in the benchmark set is too small, the interesting instances
have small influence on the aggregated result and the overall result is dominated
by uninteresting, i.e., trivial or unsolvable, instances. Hence, a significant change
of the runtime behaviour of a new algorithm is harder to identify on such degen-
erate benchmark sets. In addition, uninteresting instances unnecessarily waste
computational resources and thus cause avoidable delays in the benchmarking
process.
Moreover, in ASP, competition instances do not necessarily represent real
world applications. In the absence of a common modelling language, benchmark
instances are often formulated in the most basic common setting and thus bear
no resemblance to how real world problems are addressed (e.g., they are usu-
ally free of any aggregates).1 The situation is simpler in SAT, where a wide
range of benchmark instances stems from real-world applications and are quite
naturally encoded in a low-level format, without the modelling layer present in
ASP. Notably, SAT competitions place considerable emphasis on a public and
transparent instance selection procedure [1]. However, as we discuss in detail in
Sect. 3, competition settings may differ from other benchmarking contexts.
In what follows, we elaborate upon the composition of representative bench-
mark sets for evaluating and improving the performance of Boolean constraint
solvers in the context of ASP and SAT. Starting from an analysis of current
practice of benchmark set selection in the context of SAT competitions (Sect. 2),
we isolate a set of desiderata for representative benchmark sets (Sect. 3). For
instance, sets with a large variety of instances are favourable when developing
a default configuration of a solver that is desired to perform well across a wide
range of instances. We rely on these desiderata for guiding the development of
a parametrized benchmark selection algorithm (Sect. 4).
Overall, our approach makes use of (i) a large base set (or distribution) of
benchmark instances; (ii) instance features; and (iii) a representative set of state-
of-the-art solvers. Fundamentally, it constructs a benchmark set with desirable
properties regarding difficulty and diversity by sampling from the given base
set. It achieves diversity of the benchmark set by clustering instances based on
their similarity w.r.t a given set of features, while ensuring that no cluster is
overrepresented. The difficulty of the resulting set is calibrated based on the
given set of solvers. Use of the benchmark sets thus obtained helps save com-
putational resources during solver development, configuration and evaluation,
while concentrating on interesting instances.
1
In ASP competitions, this deficit is counterbalanced by a modelling track, in which
each participant can use its preferred modelling language.
140 H. Hoos et al.
2 Current Practice
The generation or selection of benchmark sets is an important factor in the
empirical analysis of algorithms. Depending on the goals of the empirical study,
there are various criteria for benchmark selection. For example, in the field of
Boolean constraint solving, regular competitions are used to asses new app-
roaches and techniques as well as to identify and recognize state-of-the-art
solvers. Over the years, competition organizers came up with sets of rules for
selecting subsets of submitted instances to assess solver performance in a fair
manner. To begin with, we investigate the rules used in the well-known and
widely recognized SAT Competition,2 which try to achieve (at least) three over-
all goals. First, the selection should be broad, i.e., the selected benchmark set
should contain a large variety of different kinds of instances to assess the robust-
ness of solvers. Second, each selected instance should be significant w.r.t. the
ranking obtained from the competition. Third, the selection should be fair, i.e.,
the selected set should not be dominated by a set of instances from the same
source (either a domain or a benchmark submitter).
For the 2009 SAT Competition [2] and the 2012 SAT Challenge [1], instances
were classified according to hardness, as assessed based on the runtime of a set
of representative solvers. For instance, for the 2012 SAT Challenge, the orga-
nizers measured the runtimes of the best five SAT solvers from the Application
and Crafted tracks of the last SAT Competition on all available instances and
assigned each instance to one of the following classes: easy instances are solved
by all solvers under 10 % of the runtime cutoff, i.e., 90 CPU seconds; medium
instances are solved by all solvers under 100 % of the runtime cutoff; too hard
instances are not solved by any solver within 300 % of the runtime cutoff; and
hard instances are solved by at least one solver within 300 % of the runtime
cutoff but not by all solvers within 100 % of the runtime cutoff. Instances were
then selected with the objective to have 50 % medium and 50 % hard instances
in the final instance set and, at the same time, to allow at most 10 % of the final
instance set to originate from the same source.
While the easy instances are assumed to be solvable by all solvers, the too
hard instances are presumably not solvable by any solver. Hence, neither class
contributes to the solution count ranking used in the competition.3 On the other
hand, medium instances help to rank weaker solvers and to detect performance
deterioration w.r.t. previous competitions. The hard instances are most useful
for ranking the top-performing solvers and provide both a challenge and a chance
to improve state-of-the-art SAT solving.
2
http://www.satcompetition.org
3
Solution count ranking assesses solvers based on the number of solved instances.
Robust Benchmark Set Selection for Boolean Constraint Solvers 141
Before diving into the details of our selection algorithm, let us first explicate the
desiderata for a representative benchmark set (cf. [4]).
Large Variety of Instances. As mentioned, a large variety of instances is favourable
to assess the robustness of solver performance and to reduce the risk of generalis-
ing from results that only apply to a limited class of problems. Such large variety
can include different types of problems, i.e., real-world applications, crafted prob-
lems, and randomly generated problems; different levels of difficulty, i.e., easy,
medium, and hard instances; different instance sizes; or instances with diverse
structural properties. While the structure of an instance is hard to assess, a
qualitative assessment could be based on visualizing the structure [5], and a
quantitative assessment can be performed based on instance features [6,7]. Such
instance features have already proven useful in the context of algorithm selection
[7,8] and algorithm configuration [9,10].
Adapted Instance Hardness. While easy problem instances are sometimes useful
for investigating certain properties of specific solvers, intrinsically hard or dif-
ficult to solve problem instances are better suited to demonstrate state-of-the-
art solving capabilities through benchmarking. However, in view of the nature
of NP-hard problems, it is likely that many hard instances cannot be solved
efficiently. Resource limitations, such as runtime cutoffs or memory limits, are
commonly applied in benchmarking. Solver runs that terminated prematurely
because of violations of resource limits are not helpful in differentiating solver
performance. Hence, instances should be carefully selected so that such prema-
turely terminated runs for the given set of solvers are relatively rare. Therefore,
the distribution of instance hardness within a given benchmark set should be
adjusted based on the given resource limits and solvers under consideration. In
particular, instances that are too hard (i.e., for which there is a high probability
of a timeout) as well as instances that are too easy, should be avoided, where
4
Careful ranking compares pairs of solvers based on statistically significant perfor-
mance differences and ranks solvers based on the resulting ranking graph.
142 H. Hoos et al.
Our selection process starts from a given base set of instances I. This set can
be a benchmark collection or simply a mix of previously used instances from
competitions.
Inspired by past SAT competitions, a representative set of solvers S – e.g.,
best solvers of the last competition, the state-of-the-art (SOTA) contributors
identified in the last competition, or contributors to SOTA portfolios [12] – is
used to assess the hardness h(i) ∈ R of an instance i ∈ I. Typically, the runtime
t(i, s) (measured in CPU seconds) is used to assess the hardness of an instance
i ∈ I for solver s ∈ S. The aggregation of the runtimes of all solvers s ∈ S on
a given instance i can be carried out in several ways, e.g., by considering the
minimal (mins∈S t(i, s)) or the average runtime ( |S|
1
· s∈S t(i, s)). The resulting
hardness metric is closely related to the intended ranking scheme for solvers.
For example, the minimal runtime is a lower bound of the portfolio runtime
Robust Benchmark Set Selection for Boolean Constraint Solvers 143
instances based on their similarity in feature space. We then require that a clus-
ter must not be over-represented in the selected instance set; in what follows,
roughly reminiscent of the mechanism used in SAT competitions, we say that
a cluster is over-represented if it contributes more than 10 % of the instances
to the final benchmark set. While other mechanisms are easily conceivable, the
experiments we report later demonstrate that this simple criterion works well.
Algorithm 1 implements these ideas with the precondition that the base
instance set I is free of duplicates. (This can be easily ensured by means of
simple preprocessing.) In Line 1, all instances are removed from the given base
set that cannot be solved by all solver from the representative solver set S within
the selection runtime cutoff tc (rejection of too hard instances). If solution count
ranking is to be used in the benchmarking scenario under consideration, the cut-
off in the instance selection process should be larger than the cutoff for bench-
marking, as was done in the 2012 SAT Challenge. In Line 2, all instances are
removed that are solved by all solvers under e % of the cutoff time (rejection of
too easy instances). For example, in the 2012 SAT Challenge [1], all instances
were removed which were solved by all solvers under 10 % of the cutoff. Line 3
performs clustering of the remaining instances based on their normlized features.
To perform this clustering, the well-known k-means algorithm could be used, and
the number of clusters could be computed using G-means [10,15] or by increas-
ing the number of clusters until the clustering optimization does not improve
further under a cross validation [16]. In our experiments, we used the latter, I’ve
reworded the following: because the G-means algorithm relies on a normality
assumption that is not necessarily satisfied for the instance feature data used
Robust Benchmark Set Selection for Boolean Constraint Solvers 145
here. Beginning with Line 4, instances are sampled within a loop until enough
instances are selected or no more instances are left in the base set. To this end,
x ∈ R is sampled from a distribution Dh induced by instance hardness metric
h, such that for each sample x from hardness distribution Dh , the instance i∗
is selected whose hardness h(i∗ ) is closest to x. Instance i∗ is removed from the
base instance set I. If the respective cluster S(i∗ ) is not already over-represented
in I ∗ , instance i∗ is added to I ∗ , the benchmark set under construction.
m(s(cI1 ), I1 )
Q(I1 , I2 , s, m) = (1)
m(s(cI2 ), I1 )
146 H. Hoos et al.
QI1 (I2 )
Q∗ (I1 , I2 ) = (2)
QI2 (I1 )
We use the Q∗ -score to assess the quality of the sets I ∗ obtained from our
benchmark selection algorithm in comparison to the respective base sets I. Based
on this score, we can assess the degree to which our benchmark selection algo-
rithm succeeded in producing a set that is representative of the given base set
in the way motivated earlier. Thereby, a Q∗ -score (Q∗ (I1 , I2 )) and a Q-score
(QI1 (I2 )) of larger than 1.0 indicates that I2 is better proxy for I1 than vice
versa and I2 is a good proxy for I1 .
5 Evaluation
We evaluated our benchmark set selection approach by means of the Q∗ -score
criterion on widely studied instance sets from SAT and ASP competitions.
Instance Sets. We used three base instance sets to select our benchmark set:
SAT-Application includes all instances of the application tracks from the 2009
and 2011 SAT Competition and 2012 SAT Challenge; SAT-Crafted includes
instances of the crafted tracks (resp. hard combinatorial track) of the same com-
petitions; and ASP includes all instances of the 2007 ASP Competition (SLparse
track), the 2009 ASP Competition (with the encodings of the Potassco group
[19]), the 2011 ASP Competition (decision NP-problems from the system track),
and several instances from the ASP benchmark collection platform asparagus.5
Duplicates were removed from all sets, resulting in 649 instances in
SAT-Application, 850 instances in SAT-Crafted, and 2,589 instances in ASP.
Solvers. In the context of the two sets of SAT instances, the best two solvers
of the application track, i.e., Glucose [20] (2.1) and SINN [21], and of the hard
combinatorial track, i.e., clasp [19] (2.0.6) and Lingeling [22] (agm), and the
best solver of the random track, i.e., CCASAT [23], of the 2012 SAT Challenge
were chosen as representative state-of-the-art SAT solvers. clasp [19] (2.0.6),
cmodels [24] (3.81) and smodels [25] (2.34) were selected as competitive and
representative ASP solvers capable of reading the smodels-input format [26].
Instance Features. We used efficiently computable, structural features to cluster
instances. The 54 base features of the feature extractor of SATzilla [7] (2012)
were utilized for SAT. The seven structural features of claspfolio [13] were con-
sidered for ASP, namely, tightness (0 or 1), number of atoms, all rules, basic
rules, constraint rules, choice rules, and weight rules of the grounded program.
For feature computation, a runtime limit of 900 CPU seconds per instance and a
z-score normalization was used. Any instance for which the complete set of fea-
tures could not be computed within 900 s was removed from the set of candidate
instances. This led to the removal of 52 instances from the SAT-Application
set, 2 from the SAT-Crafted set, and 3 from the ASP set.
5
http://asparagus.cs.uni-potsdam.de/
Robust Benchmark Set Selection for Boolean Constraint Solvers 147
Execution Environment and Solver Settings. All our experiments were performed
on a computer cluster with dual Intel Xeon E5520 quad-core processors
(2.26 GHz, 8,192 KB cache) and 48 GB RAM per node, running Scientific Linux
(2.6.18-308.4.1.el5). Each solver run was limited to a runtime cutoff of 900 CPU
seconds. Furthermore, we set parameter e in our benchmark selection procedure
to 10, i.e., instances solved by all solvers within 90 CPU seconds were discarded,
and the number of instances to select (n) to 200 for SAT (because of the relatively
small base sets) and 300 for ASP. After filtering out too hard instances (Line 1
of Algorithm 1), 404 instances remained in SAT-Application, 506 instances in
SAT-Crafted and 2,190 instances in ASP; after filtering out too easy instances
(Line 2), we obtained sets of size 393, 425, and 1,431, respectively.
Clustering. To cluster the instances based on their features (Line 3), we applied
k-means 100 times with different randomised initial cluster centroids. To find
the optimal number of clusters, we gradually increased the number of clusters
(starting with 2) until the quality of the clustering, assessed via 10-fold cross val-
idation and 10 randomised repetitions of k-means for each fold, did not improve
any further [16]. This resulted in 13 clusters for each of the two SAT sets, and
25 clusters for the ASP set.
Selection. To measure the hardness of a given problem instance, we used the
average runtime over all representative solvers. We considered a cluster to be
over-represented (Line 8) if more than 20 % of the final set size (n) were selected
for SAT, and more than 5 % in case of ASP; the difference in threshold was
motivated by the fact that substantially more clusters were obtained for the ASP
set than for SAT-Application and SAT-Crafted.
Algorithm Configuration. After generating the benchmark sets SAT-
Application∗ , SAT-Crafted∗ and ASP∗ using our automated selection proce-
dure, these sets were evaluated by assessing their Q∗ -scores. To this end, we
used the freely available, state-of-the-art algorithm configurator ParamILS [18]
to configure the SAT and ASP solver clasp (2.0.6). clasp is a competitive solver in
several areas of Boolean constraint solving6 that is highly parameterized, expos-
ing 46 performance-relevant parameters for SAT and 51 for ASP. This makes it
particularly well suited as a target for automated algorithm configuration meth-
ods and hence for evaluating our instance sets. Following standard practice, for
each set, we performed 10 independent runs of ParamILS of 2 CPU days each
and selected from these the configuration with the best training performance as
the final result of the configuration process for each instance set.
Sampling Distributions. One of the main input parameters of Algorithm 1 is the
sampling distribution. With the help of our Q∗ -score criterion, three distribu-
tions are assessed: a normal (Gaussian) distribution, a log-normal distribution,
and an exponential distribution. The parameters of these distributions were set
to the empirical statistics (e.g., empirical mean and variance) of the hardness
distribution over the base sets. The log-normal and exponential distributions
6
clasp won several first places in previous SAT, PB and ASP competitions.
148 H. Hoos et al.
Table 1. Comparison of set qualities of the base sets I and benchmark sets I ∗ generated
by Algorithm 1; evaluated with Q∗ -Scores with I1 = I ∗ , I2 = I, clasp as algorithm A
and PAR10-scores as performance metric m
have fat right tails and typically reflect better the runtime behaviour of solvers
for NP problems than the normal distribution. However, when using the average
runtime as our hardness metric, the instances sampled using a normal distrib-
ution are not necessarily atypically easy. For instance, an instance i, on which
half of the representative solvers have a timeout while the other half solve the
instance in nearly no time, has an average runtime of half of the runtime cutoff.
Therefore, the instance is medium hard and will be likely selected by using the
normal distribution.
In Table 1, we compare the benchmark sets we obtained from the base sets
SAT-Application, SAT-Crafted and ASP when using these three types of distri-
butions, based on their Q∗ -scores. On the left of the table, we show the PAR10
performance on the base set I of the default configuration of clasp (cdef ; we use
this as a baseline), the configuration cI found on the base set I, and the con-
figuration cI ∗ found on the selected set I ∗ ; this is followed by the performance
on the benchmark sets I ∗ generated using our new algorithm. The last column
reports the Q∗ -score values for the pairs of sets I and I ∗ .
For all three instance sets, the Q∗ -scores obtained via the normal distribution
were larger than 1.0, indicating that cI ∗ performed better than cI and the set
obtained from our benchmark selection algorithm I ∗ proved to be a good alter-
native to the entire base set I. Although on the ASP set, by using the log-normal
distribution a larger Q∗ -score (1.90) was obtained than for the normal distri-
bution (1.46), on the SAT-Application set, using the log-normal distribution
did not produce good benchmark sets. When using exponential distributions,
Q∗ -scores are larger than 1.0 in all three cases, but smaller than those obtained
with normal distributions.
Robust Benchmark Set Selection for Boolean Constraint Solvers 149
10000 10000
Speedup (PAR10 in CPU sec)
Fig. 1. Boxplots indicating the median, quartiles minimum and maximum speedup
achieved on the instance clusters within the base set SAT-Application; (left) com-
pares cdef ault and cI (high values are favourable for cI ); (right) compares cdef ault
and cI ∗ (high values are favourable for cI ∗ ); special clusters: Sf uncompleted feature
computation; Se too easy, Sh too hard.
In this work, we have introduced an algorithm for selecting instances from a base
set or distribution to form an effective and efficient benchmark set. We consider a
benchmark set to be effective, if a solver configured on it performs at least as well
as when configured on the original set, and we consider it to be efficient, if the
instances in it are on average easier to solve than those in the base set. By using
such benchmark sets, the computational resources required for assessing the
performance of a solver can be reduced substantially. Our benchmark selection
Robust Benchmark Set Selection for Boolean Constraint Solvers 151
References
1. Balint, A., Belov, A., Järvisalo, M., Sinz, C.: Application and hard combinatorial
benchmarks in SAT challenge. In: Proceedings of SAT Challenge 2012: Solver and
Benchmark Descriptions. Department of CS Series of Publications B, vol. B-2012-2,
pp. 69–71. University of Helsinki (2012)
2. Berre, D., Roussel, O., Simon, L.: http://www.satcompetition.org/2009/
BenchmarksSelection.html (2009). Accessed 09 March 2012
3. Van Gelder, A.: Careful ranking of multiple solvers with timeouts and ties. In:
Sakallah, K.A., Simon, L. (eds.) SAT 2011. LNCS, vol. 6695, pp. 317–328. Springer,
Heidelberg (2011)
4. Hoos, H., Stützle, T.: Stochastic Local Search: Foundations and Applications. Else-
vier/Morgan Kaufmann, San Francisco (2004)
5. Sinz, C.: Visualizing SAT instances and runs of the DPLL algorithm. J. Autom.
Reason. 39, 219–243 (2007)
6. Nudelman, E., Leyton-Brown, K., Hoos, H., Devkar, A., Shoham, Y.: Understand-
ing random SAT: beyond the clauses-to-variables ratio. In: Wallace, M. (ed.) CP
2004. LNCS, vol. 3258, pp. 438–452. Springer, Heidelberg (2004)
152 H. Hoos et al.
7. Xu, L., Hutter, F., Hoos, H., Leyton-Brown, K.: SATzilla: portfolio-based algo-
rithm selection for SAT. J. Artif. Intell. Res. 32, 565–606 (2008)
8. Kadioglu, S., Malitsky, Y., Sabharwal, A., Samulowitz, H., Sellmann, M.: Algo-
rithm selection and scheduling. In: Lee, J. (ed.) CP 2011. LNCS, vol. 6876, pp.
454–469. Springer, Heidelberg (2011)
9. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization
for general algorithm configuration. In: Coello, C.A.C. (ed.) LION 2011. LNCS,
vol. 6683, pp. 507–523. Springer, Heidelberg (2011)
10. Kadioglu, S., Malitsky, Y., Sellmann, M., Tierney, K.: ISAC - instance-specific
algorithm configuration. In: Proceedings of ECAI’10, pp. 751–756. IOS Press (2010)
11. Brglez, F., Li, X., Stallmann, F.: The role of a skeptic agent in testing and bench-
marking of sat algorithms (2002)
12. Xu, L., Hutter, F., Hoos, H., Leyton-Brown, K.: Evaluating component solver
contributions to portfolio-based algorithm selectors. In: Cimatti, A., Sebastiani,
R. (eds.) SAT 2012. LNCS, vol. 7317, pp. 228–241. Springer, Heidelberg (2012)
13. Gebser, M., Kaminski, R., Kaufmann, B., Schaub, T., Schneider, M.T., Ziller, S.:
A portfolio solver for answer set programming: preliminary report. In: Delgrande,
J.P., Faber, W. (eds.) LPNMR 2011. LNCS, vol. 6645, pp. 352–357. Springer,
Heidelberg (2011)
14. O’Mahony, E., Hebrard, E., Holland, A., Nugent, C., O’Sullivan, B.: Using case-
based reasoning in an algorithm portfolio for constraint solving. In: AICS’08 (2008)
15. Hamerly, G., Elkan, C.: Learning the k in k-means. In: Proceedings of NIPS’03.
MIT Press (2003)
16. Hill, T., Lewicki, P.: Statistics: Methods and Applications. StatSoft, Tulsa (2005)
17. Bayless, S., Tompkins, D., Hoos, H.: Evaluating instance generators by configura-
tion. Submitted for publication (2012)
18. Hutter, F., Hoos, H., Leyton-Brown, K., Stützle, T.: ParamILS: an automatic
algorithm configuration framework. J. Artif. Intell. Res. 36, 267–306 (2009)
19. Gebser, M., Kaminski, R., Kaufmann, B., Ostrowski, M., Schaub, T., Schneider,
M.: Potassco: the potsdam answer set solving collection. AI Commun. 24(2), 105–
124 (2011)
20. Audemard, G., Simon, L.: Glucose 2.1. in the SAT challenge 2012. In: Proceedings
of SAT Challenge 2012: Solver and Benchmark Descriptions. Department of CS
Series of Publications B, vol. B-2012-2, pp. 23–23. University of Helsinki (2012)
21. Yasumoto, T.: Sinn. In: Proceedings of SAT Challenge 2012: Solver and Benchmark
Descriptions. Department of CS Series of Publications B, vol. B-2012-2, pp. 61–61.
University of Helsinki (2012)
22. Biere, A.: Lingeling and friends entering the SAT challenge 2012. In: Proceedings
of SAT Challenge 2012: Solver and Benchmark Descriptions. Department of CS
Series of Publications B, vol. B-2012-2, pp. 33–34. University of Helsinki (2012)
23. Cai, S., Luo, C., Su, K.: CCASAT: solver description. In: Proceedings of SAT
Challenge 2012: Solver and Benchmark Descriptions. Department of CS Series of
Publications B, vol. B-2012-2, pp. 13–14. University of Helsinki (2012)
24. Giunchiglia, E., Lierler, Y., Maratea, M.: Answer set programming based on propo-
sitional satisfiability. J. Autom. Reason. 36(4), 345–377 (2006)
25. Simons, P., Niemelä, I., Soininen, T.: Extending and implementing the stable model
semantics. Artif. Intell. 138(1–2), 181–234 (2002)
26. Syrjänen, T.: Lparse 1.0 user’s manual
Boosting Sequential Solver Portfolios:
Knowledge Sharing and Accuracy Prediction
1 Introduction
Significant advances in solution techniques for propositional satisfiability test-
ing, or SAT, in the past two decades have resulted in wide adoption of the SAT
technology for solving problems from a variety of fields such as design automa-
tion, hardware and software verification, cryptography, electronic commerce, AI
planning, and bioinformatics. This has also resulted in a wide array of challeng-
ing problem instances that continually keep pushing the design of better and
faster SAT solvers to the next level. The annual SAT Competitions and SAT
Races have played a key role in this advancement, posing as a challenge a set
of so-called “application” category (previously known as the “industrial” cat-
egory) instances, along with equally, but differently challenging, “crafted” and
“random” instances.
Given the large diversity in the characteristics of problems as well as spe-
cific instances one would like to solve by translation to SAT, it is no surprise
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 153–167, 2013.
DOI: 10.1007/978-3-642-44973-4 17, c Springer-Verlag Berlin Heidelberg 2013
154 Y. Malitsky et al.
that different SAT solvers, some of which were designed with a specific set of
application domains in mind, work better on different kinds of instances. Algo-
rithm portfolios (cf. [7]) attempt to leverage this diversity by employing several
individual solvers and, at runtime, dynamically selecting what appears to be
the most promising solver — or a schedule of solvers — for the given instance.
This has allowed sequential SAT portfolios such as SATzilla [16] and 3S [8,10]
to perform very well in the annual SAT Competitions and Races.
Most of the state-of-the-art sequential algorithm portfolios are based on two
main components: (a) a schedule of “short running” solvers to be run first in
sequence for some small amount of time (usual some fixed percentage of the total
available time such as 10 %) and (b) a “long running” solver to be executed
for the remainder of the time which is selected by one or the other Machine
Learning technique (e.g., logistic regression, nearest neighbor search, or decision
forest). If one of the short running solvers succeeds in solving the instance, then
the portfolio terminates successfully. However, all work performed by each short
running solver in this execution sequence is completely wasted unless it manages
to fully solve the instance. If none of the short running solvers in the schedule
succeeds, all faith is put in the one long running solver.
Given this typical sequential portfolio setup, it is natural to consider an
extension that attempts to utilize information gained by short running solvers
even if they all fail to solve the instance. Further, one may also consider an
automated way to carefully revisit the choice of the long running solver whose
improper selection may substantially harm the overall portfolio performance. We
propose two relatively simple yet powerful techniques towards this end, namely,
learnt clause forwarding and accuracy prediction.
We remark that one limitation of current algorithm portfolios is that their
performance can never be better than that of the oracle or “virtual best solver”
which, for each given instance, (magically) selects an individual solver that will
perform best on it. By sharing knowledge, we allow portfolio solvers to, in prin-
ciple, go beyond VBS performance. Specifically, a distinguishing strength of our
proposed clause forwarding scheme is that it enables the portfolio solver to poten-
tially succeed in solving an instance that no constituent SAT solver can.
Learnt clause forwarding focuses on avoiding waste of effort by the short
running solvers in the schedule. We propose to share, or “forward,” the knowl-
edge gained by the first k solvers in the form of a selection of short learned
clauses, which are passed on to the k + 1st solver. Conflict-directed clause learn-
ing (CDCL) is a very powerful technique in SAT solving, often regarded as the
single most important element that allows these solvers to tackle real-life prob-
lems with millions of variables and constraints. Forwarding learnt clauses is a
cheap but promising way to share knowledge between solvers and is commonly
employed in parallel SAT solving. We demonstrate that sharing learnt clauses
can improve performance in sequential SAT solver portfolios as well.1
1
For the specific case of population-based algorithm portfolios, Peng et al. [11] have
proposed sharing information through migration of individuals across populations.
Boosting Sequential Solver Portfolios 155
2 Background
C1 , . . . , Cm over variables X1 , . . . , Xn , we call a Ca a formula in conjunctive
normal form (CNF).
The SAT problem has played a prominent role in theoretical computer science
where it was the first to be proven to be NP-hard [3]. At the same time, it
has driven research in combinatorial problem solving for decades. Moreover, the
SAT problem has great practical relevance in a variety of areas, in particular in
cryptography and in verification.
The last point regards the idea of inferring new clauses during search that are
redundant to the given formula but encode, often in a succinct way, the reason
why a certain partial truth assignment cannot be extended to any solution. These
redundant constraints strengthen our inference algorithm when a different partial
valuation cannot be extended to a full valuation that satisfies the given formula
for a “similar” reason. One of the ideas that we pursue in this paper is to inform
a solver about the clauses learnt by another solver that was invoked previously
to try and solve the same CNF formula. This technique is standard in parallel
SAT solving but, surprisingly, has not been considered for solver portfolios.
Boosting Sequential Solver Portfolios 157
of parallel SAT solvers, such as ManySAT [6] and Plingeling [2]. When design-
ing these parallel solvers, it has been observed that the overall performance can
be improved by carefully sharing a limited amount of knowledge between the
search efforts led by different threads. This knowledge sharing must be care-
fully done, as it must balance usefulness of the information against the effort of
communicating and incorporating it. One effective strategy has been to share
information in the form of very short learned clauses, often just unit clauses,
i.e., clauses with only one literal (e.g., the winning parallel solver [2] at the 2011
SAT Competition).
Table 1. Gap closed to the virtual best solver (VBS) by using clause forwarding.
and also after forwarding. Our overall pre-schedule was composed of the origi-
nal one used by 3S in the 2011 SAT Competition, scaled appropriately to take
the difference in machine speeds into account, enhanced with clause forwarding
solvers, and reordered to have non-forwarding CDCL solvers appear after the
forwarding ones. We note that changing the pre-schedule itself did not signifi-
cantly alter the performance of 3S. E.g., in the application category, as we will
later see in Table 3, the performances of 3S with the original and the updated
pre-schedules were very similar.
One other consideration that has a significant impact in practice is, whether to
simplify the CNF formula before handing it to the next solver in the schedule,
after (up to) M forwarded clauses have been added to it. With some experimenta-
tion, we found that minimal simplification of the formula after adding forwarded
clauses, performed using SatElite [13] in our case, was the most rewarding. We
thus used this clause forwarding setup for the experiments reported in this paper.
We selected 34 features composed of: the first 10 features of the test instance,
the Euclidean based distance measures of training instances in the neighborhood
to the test instance, and runtime measures of the five best solvers on a restricted
neighborhood (see Table 2 for details). These features are inspired by the k-
nearest-neighbor classifier that 3S employs.
Consequently, for the guardian we need to learn a classifier function: f ∈−∀ L.
To this end we require training data. The 3S portfolio is based on data T that
is composed of features of and runtimes on 5,467 SAT instances appearing in
earlier competitions. We can split T into a training set Ttrain and test set Ttest .
Now, we can run the portfolio restricting its knowledge base to Ttrain and test
its performance on Ttest . For each test instance i ⊆ Ttest we can compute the
corresponding feature vector fi and obtain the label Li . Hence, the number of
training instances we obtain for the classifier is i. Obviously, one can split T
differently over and over by random subsampling, and each time one creates
new training data to train the “guardian” classifier.
The question arises whether different splits will not merely regenerate exist-
ing knowledge. This depends on the features chosen, but here the feature vector
will actually have a high probability to be different for each single split since in
each split the neighborhood of a test instance will be different. A thought exper-
iment that makes this more apparent is the following: Assume that, for a single
instance i, we sort all other instances according to the distance to i (neighbor-
hood of i). Assume further we select training instances from the neighborhood of
i with probability 1/k until we have selected k instances (where k is the desired
neighborhood size). When k > 10 it is obviously very unlikely for an instance to
have exactly the same neighbors.
In order to determine an appropriate amount of training data we first ran-
domly split the data set T in a training split Ttrain and test split Ttest , before
generating the data for the classifier. We then perform the aforementioned split-
ting to generate training data for the classifier on Ttrain and test it on the data
generated by running k-NN with data Ttrain on the test set Ttest . We use 10
162 Y. Malitsky et al.
different random splits of type Ttrain and Ttest and try to determine the best
number of splits for generating training data for the classifier.
While normally one could essentially look at the plain accuracy of the clas-
sifier and select the number of splits that result in the highest accuracy, we
propose to employ another measure based on the following reasoning. The clas-
sifier’s “confusion matrix” looks in our context like this (denoting the solver that
was selected by the portfolio on instance I with S):
Instances that fall in category (a) reflect a “good” choice by the portfolio
(our original selector) and, while correctly detected, there is also nothing for us
to gain. In case (c) we cannot exploit the wrong choice of the portfolio since
the guardian classifier does not detect it. However, we will also not degenerate
the performance of the portfolio. Cases (b) and (d) are the interesting cases. In
(b) we collect the false-positives where the classifier predicts that the portfolio’s
choice was wrong while it was not. Consequently it could be the case that we
degrade the performance of the original portfolio selector by altering its decision.
All instances falling in category (d) represent the correctly labeled decisions of
the primary selector that should be overturned. In (d) lies the potential of our
method: all instances that fall in this category cannot be solved by solver S that
the primary selector chose, and the guardian classifier correctly detected it. Since
cases (a) and (c) are somewhat irrelevant to any potential recourse action, we
(b)
focus on keeping the ratio (d) as small as possible in order to favorably balance
potential losses and wins. Based on this quality measure we determined that
roughly 100 splits achieve the most favorable trade off on our data.
4.2 Recourse
When the guardian classifier triggers, we need to select an alternative solver. For
this purpose we need to devise a second “recourse” classifier. While we clearly
do not want to select the same solver that was suggested by the original portfolio
selector, the choices for possible recourse actions is vast and their benefits hardly
apparent. We introduce the following recourse strategy:
Since we want to replace the suggested solver S, we assume S is not suitable
for the given test instance I. Based on this conditional probability we can also
infer that the instances solved by S in the neighborhood of size k of I can
be removed from its neighborhood. Now, it can be the case that the entire
neighborhood of I can be solved by S and therefore we extend the size of the
neighborhood by 30 %. If on this extended neighborhood S cannot solve all
instances, we choose the solver with the lowest PAR10-score on the instances
in the extended neighborhood not solved by S. Otherwise, we choose the solver
with the second best ranking by the original portfolio selector. In the context of
Boosting Sequential Solver Portfolios 163
3S this is the solver that has the second lowest PAR10-score on the neighborhood
of the test instance.
Designing a good recourse strategy poses a challenge. As we will see later
in Sect. 5.3, our proposed recourse strategy resulted in solving 209 instances on
the 2011 SAT Competition application benchmark, compared to the 204 that
3S solved. We tried a few other simpler strategies as well, which did not fare
as well. We briefly mention them here: First, we used the solver that has the
second best ranking in terms of the original classifier. For 3S this means choosing
the solver with the second lowest PAR10-score on the neighborhood of the test
instance. This showed only a marginal improvement, solving 206 instances. We
then tried to leverage diversity by mixing-and-matching the two recourse strate-
gies mentioned above, giving each exactly half the remaining time. This resulted
in overall performance to drop below 3S without accuracy prediction. Finally,
we computed offline a static replacement map that, for each solver S, specifies
one fixed solver f (S) that works the best across all training data whenever S
is selected by the original classifier but does not solve the instance. This static,
feature-independent strategy also resulted in degrading performance. For the
rest of this paper, we will not consider these alternative replacement strategies.
5 Empirical Evaluation
We implemented learnt clause forwarding in three CDCL SAT solvers that were
used by 3S in the 2011 Competition: CryptoMiniSat 2.9.0 [12], Glucose 1.0 [1],
and MiniSat 2.2.0 [14]. The pre-schedule was modified to prolong the time these
three clause-learning solvers are run, as discussed earlier. With clause forward-
ing disabled, 3S with this modified pre-schedule resulted in roughly the same
performance on our testbed as 3S with the original pre-schedule used in the
Competition (henceforth referred to as 3S-C). In other words, any performance
differences we observe can be attributed to clause forwarding and accuracy pre-
diction and recourse, not to the change in the pre-schedule itself.
For clause forwarding, we used parameter values L = 10 and M = 10, 000,
i.e., each of the three solvers may share up to 10,000 clauses of size up to 10 for
the next solver to be run. The maximum amount of clauses shared is therefore
30,000. We note that these parameters are by no means optimized. Among other
variations, we tried sharing an unlimited number of (small) clauses, but this
un-surprisingly degraded performance. We expect that these parameters can be
tuned better. Nevertheless, the above choices worked well enough to demonstrate
the benefits of clause sharing, which is the main purpose of this experimentation.
80
80
Number of instances
Number of instances
60
60
40
40
20
20
0
0
0 5000 10000 15000 20000 25000 30000 0 5000 10000 15000 20000 25000 30000
Fig. 1. Histogram showing how often N clauses are forwarded. Left: Crafted instances.
Right: Application instances.
(a) 61 (b) 7
(c) 25 (d) 14
Hence, the best possible outcome for a recourse action would be to solve the
previously unsolved 14 instances (∅ 5 % of all the 2011 application instances)
under (c) and to still be able to solve the 7 instances (∅ 2%) under (b). While in
the best case we could gain 14 instances and lose none, it is obviously not clear
whether one would achieve any gain at all, or even solve at least the 7 instances
that originally used to be solved. Fortunately, with our recourse strategy, we
witness a significant gain in overall performance. We integrated the classifier in
3S in the following way: When 3S suggests the primary solver, if indicated by
our guardian REP-Tree model, we intercept its decision and alter it as proposed
by our recourse strategy.
Since our base portfolio solver, 3S, already works best on random and crafted
instances considered, the objective is to close the large gap between the best
sequential portfolios and the best individual solvers in the application track,
while not degrading the performance of the portfolio on crafted and random
categories.
To this end, let us first note that adding the methods proposed in this paper
have no significant impact on 3S performance on random and crafted instances.
On random instances, knowledge sharing hardly takes place since CDCL based
complete solvers are barely able to learn any short clauses on these instances.
For crafted instances, a limited amount of clause forwarding does happen,
but much less so than in application instances. In Fig. 1 we show how many
instances in our test set share how many clauses. On the left we see that, on
crafted instances, we mostly share a modest amount of clauses between solvers,
if any. The plot on the right shows the situation for application instances. Here
it is usually the case that the solvers share the fully allowed 30,000 clauses.
166 Y. Malitsky et al.
Table 3. Performance comparison of 3S-C from the competition and its four new
variants: 3S, 3S+f, 3S+p, and 3S+fp on application.
6 Conclusion
We presented two novel generic techniques for boosting the performance of SAT
portfolios. The first approach shares the knowledge discovered by SAT solvers
Boosting Sequential Solver Portfolios 167
that run in sequence, while the second improves solver selection accuracy by
detecting when a selection is likely to be inferior and proposing a more promis-
ing recourse selection. Applying these generic techniques to the SAT portfolio
3S resulted in significantly better performance on application instances while
not reducing performance on crafted and random categories, making the result-
ing solver, 3S+fp excel on all categories in our evaluation using the 2011 SAT
Competition data and solvers.
References
1. Audemard, G., Simon, L.: Predicting learnt clauses quality in modern SAT solvers.
In: 21st IJCAI, Pasadena, CA, pp. 399–404, July 2009
2. Biere, A.: Plingeling: solver description. SAT Race (2010)
3. Cook, S.A.: The complexity of theorem-proving procedures. In: STOC, pp. 151–
158. ACM (1971)
4. Gomes, C.P., Selman, B.: Algorithm portfolios. AI J. 126(1–2), 43–62 (2001)
5. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The
WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
6. Hamadi, Y., Sais, L.: ManySAT: a parallel SAT solver. JSAT 6, 245–262 (2009)
7. Rice, J.R.: The algorithm selection problem. Adv. Comput. 15, 65–118 (1976)
8. Kadioglu, S., Malitsky, Y., Sabharwal, A., Samulowitz, H., Sellmann, M.: Algo-
rithm selection and scheduling. In: Lee, J. (ed.) CP 2011. LNCS, vol. 6876, pp.
454–469. Springer, Heidelberg (2011)
9. Leyton-Brown, K., Nudelman, E., Andrew, G., McFadden, J., Shoham, Y.: A port-
folio approach to algorithm selection. In: IJCAI, pp. 1542–1543 (2003)
10. Malitsky, Y., Sabharwal, A., Samulowitz, H., Sellmann, M.: Non-model-based algo-
rithm portfolios for SAT. In: Sakallah, K.A., Simon, L. (eds.) SAT 2011. LNCS,
vol. 6695, pp. 369–370. Springer, Heidelberg (2011)
11. Peng, F., Tang, K., Chen, G., Yao, X.: Population-based algorithm portfolios for
numerical optimization. IEEE Trans. Evol. Comput. 14(5), 782–800 (2010)
12. Soos, M.: CryptoMiniSat 2.9.0. http://www.msoos.org/cryptominisat2 (2010)
13. Sorensson, N., Een, N.: SatELite 1.0. http://minisat.se (2005)
14. Sorensson, N., Een, N.: MiniSAT 2.2.0. http://minisat.se (2010)
15. Xu, L., Hutter, F., Hoos, H.H., Leyton-Brown, K.: SATzilla-07: the design and
analysis of an algorithm portfolio for SAT. In: Bessière, C. (ed.) CP 2007. LNCS,
vol. 4741, pp. 712–727. Springer, Heidelberg (2007)
16. Xu, L., Hutter, F., Hoos, H.H., Leyton-Brown, K.: SATzilla: portfolio-based algo-
rithm selection for SAT. JAIR 32(1), 565–606 (2008)
A Fast and Adaptive Local Search Algorithm
for Multi-Objective Optimization
1 Introduction
Multi-objective Optimization Problems (MOP) appear in many practical appli-
cations in engineering, finance, transportation, etc. They are characterized by
multiple objective functions which need to be jointly optimized. The Pareto-
optimal solutions of MOP are the solutions where no single objective can be
improved without worsening at least another objective. The set of all Pareto-
optimal solutions in the objective space is called the Pareto front. Several algo-
rithms have been proposed to approximate the Pareto front with a single run.
They can be split into two groups: population method and individual-solution
method. The first group contains evolutionary algorithms (EAs) [1,2]. In each
generation, EAs modify a population of solutions by genetic operators such that
the next population contains high-quality and diverse solutions. Although EAs
are robust for many problems, they often require a large number of function
calls to evaluate the quality of solutions. In contrast, the individual-solution
algorithms [5,6] try improving only one solution in each iteration, by replacing
it with a better one in its neighborhood. When that solution is Pareto-optimal,
it is saved as a representative solution on the Pareto front. As only one solution
is evaluated in each iteration, the number of function evaluations can be sig-
nificantly reduced. However, without a good trade-off between exploration and
exploitation, these algorithms can be stuck in local minima.
To solve the above difficulties, we introduce a Fast and Adaptive Local
Search Algorithm for Multi-objective Optimization (abbreviated as FASAMO).
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 168–173, 2013.
DOI: 10.1007/978-3-642-44973-4 18, c Springer-Verlag Berlin Heidelberg 2013
A Fast and Adaptive Local Search Algorithm 169
the perturbation level. Besides, two phases of the algorithms can be switched
according to the approximate front and the current solution. This gives the algo-
rithm chances to correct the mistake of switching to the exploitation phase too
soon. In addition, to increase the convergence speed, if a perturbation step is
successful then it will be performed again in the next step.
return nonDomSet
solution. Next, the new solution xnew is added to the non-dominated solution
set nonDomSet. If xnew dominates at least one solution in nonDomSet, then
xnew is considered as it has improved nonDomSet. If the new solution improves
the non-dominated solution set or its energy is smaller than or equal to the
current solution energy, then the algorithm moves to the new solution and sets
the search direction to goOn. Otherwise, the search direction is set to random.
After that, the algorithm considers the improvement status to switch between
two phases and adjust the perturbation level. In detail, if the algorithm is in the
exploration phase and the number of times that the algorithm sequentially can-
not improve the current solution reaches a maximum value maxU nimproved,
then the algorithm switches to the exploitation phase as the current solution
is now on a local front. When the algorithm is in the exploitation phase, the
perturbation level perturbationLevel is adjusted by a cosine function with a fre-
quency of cosF req. If the number of times that the current solution is improved
equals to a maximum value maxImproved, then the algorithm switches again to
the exploration phase. The main reason of this step is that the improvement of
the current solution implies that the current solution is now in a new and more
potential region, thus the large search steps can be performed to quickly reach
the local minima of that region or identify other potential neighbor regions.
3 Experiments
This section presents the experiments to compare our algorithm and four other
algorithms: SMOSA, UMOSA, DMOSA and NSGAII on the DTLZ bench-
mark [3] (with the number of variables and objectives as suggested in [3]). All
algorithms are implemented on the jmetalcpp framework [4]. The parameters
of FASAMO are set as follows: maxU nimproved = 100, maxImproved = 10,
maxP erturbation = 0.3 and cosF req = 1 + 10 ∗ numberOf V ariables. The odd
value of cosF req is used to eliminate the zero value of the perturbation level.
The parameters of DMOSA is set as suggested in [6]. The default parameters
of NSGAII in the jmetalcpp framework are used in the experiments. The max-
imum temperature Tmax of SMOSA and UMOSA is set to 1.0. For measuring
the distance to the true Pareto front and the coverage of the solution set, we
use the generational distance metric [8] and the hyper volume metric [9] (see
the implementation for the minimization problem in the jmetalcpp framework),
respectively. Besides, the Pareto front samples on the website of jmetalcpp are
used when computing the metrics.
Experimental Results on the Number of Function Evaluations: In this
experiment, we measure the number of evaluations required by each algorithm to
obtain the approximate front with the generational distance of 0.01 (to the true
Pareto front). We run each algorithm 10 times on each problem. In a run, for
every 100 evaluations, the algorithms check whether the generational distance
of the approximate front is less than or equal to 0.01. If this condition is hold or
the number of evaluations equals to 100000, the algorithms stop and report the
number of evaluations.
172 D.T. Truong
4 Conclusion
In this paper, we propose a fast and adaptive local search algorithm for MOP,
called FASAMO. Our algorithm is an individual-solution algorithm with an
automatic mechanism for switching between two phases of exploration and
exploitation. The experiments on seven problems of the DTLZ benchmark show
that our algorithm significantly outperforms the popular evolutionary algorithm
NSGAII and three other multi-objective simulated annealing algorithms.
References
1. Coello, C.A.C., Lamont, G.B., Van Veldhuizen, D.A.: Evolutionary algorithms for
solving multi-objective problems, vol. 5. Springer, Heidelberg (2007)
2. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective
genetic algorithm: Nsga-ii. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)
3. Deb, K., Thiele, L., Laumanns, M., Zitzler, E.: Scalable multi-objective optimization
test problems. In: Proceedings of the Congress on Evolutionary Computation (CEC-
2002), Honolulu, USA, pp. 825–830 (2002)
4. Durillo, Juan J., Nebro, Antonio J.: jmetal: a java framework for multi-objective
optimization. Adv. Eng. Softw. 42, 760–771 (2011)
5. Serafini, P.: Simulated annealing for multiple objective optimization problems. In:
Tzeng, G., Wang, H., Wen, U., Yu, P. (eds.) Multiple Criteria Decision Mak-
ing. Expand and Enrich the Domains of Thinking and Application, pp. 283–292.
Springer, Heidelberg (1994)
6. Smith, K.I., Everson, R.M., Fieldsend, J.E., Murphy, C., Misra, R.: Dominance-
based multiobjective simulated annealing. IEEE Trans. Evol. Comput. 12(3), 323–
342 (2008)
7. Ulungu, E.L., Teghem, J., Fortemps, P.H., Tuyttens, D.: Mosa method: a tool for
solving multiobjective combinatorial optimization problems. J. Multi-Criteria Decis.
Anal. 8(4), 221–236 (1999)
8. Van Veldhuizen, D.A., Lamont, G.B.: Evolutionary computation and convergence
to a pareto front. In: Late Breaking Papers at the Genetic Programming 1998 Con-
ference, pp. 221–228 (1998)
9. Zitzler, E., Thiele, L.: Multiobjective evolutionary algorithms: a comparative case
study and the strength pareto approach. IEEE Trans. Evol. Comput. 3(4), 257–271
(1999)
An Analysis of Hall-of-Fame Strategies
in Competitive Coevolutionary Algorithms
for Self-Learning in RTS Games
1 Introduction
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 174–188, 2013.
DOI: 10.1007/978-3-642-44973-4 19, c Springer-Verlag Berlin Heidelberg 2013
An Analysis of the HoF Used in CC Algorithms 175
components that evolve simultaneously and the fitness depends on the interac-
tion between these components; in competition-based approaches, an individual
competes with other individuals for the fitness value and, if appropriate, will
increase its fitness at the expense of its counterparts, whose fitnesses decrease.
This latter approach resembles an army race in which the improvement of some
individuals causes the improvement in others, and vice versa.
This paper deals with the application of competitive coevolution (CC) as a
self learning mechanism in RTS games. As a first contribution, the paper ana-
lyzes the performance of different approaches in order to apply the concept of
Hall-of-Fame (HoF) defined by Rosin and Belew in [1] as a long-term memory in
competitive coevolutionary algorithms; the analysis is conducted in the context
of the real-time strategy (RTS) game RobotWars. The goal is to produce auto-
matically game strategies to govern the behavior of an army in the game that
can also beat its opponent counterpart. As a second contribution this work pro-
poses alternatives for optimizing two key aspects in the implementation of the
HoF, which are, the diversity of the solutions, and the growth of the champions’
memory.
This paper is organized as follows. Next, –and given that we focus in this work
on exploring different variants of HoF as an evaluation and memory mechanism
in competitive coevolutionary settings– we present an overview of competitive
coevolution in games. In Sect. 3 we explain the game which is our arena for
competitive coevolution. Section 4 describes ours variants for implementing a
Hall-of-Fame based competitive coevolutionary algorithm. In Sect. 5 we analyze
the results obtained by each variants in many experiments. And finally in Sect. 6
we closure this investigation.
3 Game Description
This section is devoted to RobotWars,1 our arena for competitive coevolution
which will be presented for first time in [18]. RobotWars is a test environment
that allows two virtual players (i.e., game AIs) to compete in a 3 dimensional
scenario of a RTS war game, and thus it is not a standard game itself in the
sense that no human players intervene interactively during the game; however it
1
http://www.lcc.uma.es/∼afdez/robotWars
An Analysis of the HoF Used in CC Algorithms 177
Regarding the use and implementation of HoF some aspects must be defined.
The first is the criteria for inserting a new member in the memory. Also we have
considered different policies for maintaining the champions in the set, regarding
this issue one has to take into account the contribution of the individual (i.e., the
champion) to the search process as, for instance, it might be the case that some
opponents that belong to very old generations do not show a valuable perfor-
mance in comparison with opponents generated in recent generations and thus
they might be easily beaten; it is therefore crucial to remove those champions
not contributing to the solution what, in other words, represents a mechanism to
control the size of the champions’ memory. Another relevant aspect concerns to
the selection of those strategies from HoF that will be employed in the evaluation
process; considering all the champions might produce more consistent solutions
at the expense of a very high computational cost (note that a simulation of the
match must be executed for each champion involved in the evaluation; we will
provide more details on this further on). Next we present our HoF-based com-
petitive coevolutionary algorithm (HofCC) and five variants that precisely differ
in the policy of establishing the aspects mentioned previously.
Algorithm 1 shows the schema of our basic algorithm HofCC. A specific strategy
is considered winning if it achieves a certain score (see below) when it deals with
each of the strategies belonging to the set of winning strategies of its opponent
(i.e., the enemy Hall-of-Fame). The initial objective is to find a winning strategy
of player 1 with respect to player 2 (i.e., the initial opponent) so that the HoF of
player 2 is initially loaded with some strategies (randomly or manually initialized:
line 2). Then a standard evolutionary process tries to find a strategy for player
1 that can be considered as victorious (lines 7–13). A strategy is considered
winning if its fitness value is above a certain threshold value φ (line 14) that
enables the tuning of the selective pressure of the search process by considering
higher/lower quality strategies; in case of success (line 14), this strategy is added
to the HoF of player 1 (line 16) and the process is initiated again but with
the players’ roles interchanged (line 17); otherwise (i.e., no winning strategy is
found) the search process is restarted again. If after a number of coevolutionary
steps no winning strategy is found the search is considered to have stagnated
and the coevolution finishes (see while condition in line 4). At the end of the
whole process we obtain as a result two sets of winning strategies associated
respectively to each of the players.
Regarding to the evaluation of candidates for a specific player p (where p ⊆
{player 1, player 2}), the fitness of an specific strategy is computed by facing
it against a selected subset of the (winning) strategies in the Hall-of-Fame of
its opponent player (that we call the selected opponent set). Given a specific
strategy s its fitness is computed as follows:
k s
j=1 pj + extrass (j)
f itness(s) = (1)
k
An Analysis of the HoF Used in CC Algorithms 179
Algorithm 1: HofCC()
1 nCoev ∈ 0; A ∈ player1 ; B ∈ player2 ; φ ∈ thresholdvalue;
2 HoFA ∈ ∗; HoFB ∈ InitialOpponent();
3 pop ∈ Evaluate(HoFB ); // Evaluate initial population
4 while nCoev < M axCoevolutions ∩ N OT (timeout) do
5 pop ∈RandomSolutions(); // pop randomly initialized
6 i ∈ 0;
7 while (i < M axGenerations) ∩ (fitness(best(pop)) < φ) do
8 parents ∈Select (pop);
9 childs ∈ Recombine (parents, pX );
10 childs ∈ Mutate (childs, pM );
11 pop ∈ Replace(childs);
12 pop ∈ Evaluate(HoFB );
13 i ∈ i + 1;
14 end while
15 if fitness(best(pop)) ∀ φ then //winner found!
16 nCoev ∈ 0; // start new search
17 HoFA ∈ HoFA ∪ {best(pop)}
18 temp ∈ A; A ∈ B; B ∈ temp; // interchange players’ roles
19 else
20 nCoev ∈ nCoev + 1; // continue search
21 end if
22 end while
This section is devoted to describe five variants of our HofCC algorithms; basi-
cally these variants differ in the nature (i.e., in this case, size) of the HoF when
it is used as a memory mechanism, ranging from short-term memory versions to
long-term memory instances.
4.2.1 HofCC-Complete
In this variant the Hall-of-Fame acts as a long-term memory by keeping all the
winners found in previous coevolutions, and all of them are also used in the
evaluation process. So, in the coevolutionary step n each possible solution of
180 M. Nogueira et al.
army A fight again each solution in {B1 , B2 , ...Bn−1 }, where Bi is the champion
found by army B in the i (for 1 ∈ i ∈ n − 1) coevolutionary step.
Note that the cardinality of the selected opponent set and the cardinality of
the HoF in the coevolutionary step n are equal (i.e., k = n in Eq. 1).
4.2.2 HofCC-Reduced
Here the HoF acts as the minimum short-term memory mechanism by minimiz-
ing the number of battles required for evaluating an individual; that is to say,
in the n co-evolutionary step, each individual in army A only faces the latest
champion inserted in the HoF of army B (i.e., Bn−1 ). Note that this means
k = 1 in Eq. 1.
4.2.3 HofCC-Diversity
In this proposal the HoF acts as a long-term memory mechanism, but the content
of the HoF is updated by removing those members that provide less diversity.
The value of diversity that an individual in the HoF provides is calculated by the
genotypic distance as follows: we manage the memory of champions as a matrix
in which each row represents a solution and each column a gen (i.e., an action
in the strategy). Then, we compute the entropy value for a specific column j as
follows:
k
Hj = − (pij log pij ) (3)
i=1
where pij is the probability of action i in column j, and k is the length of the
memory. Finally the entropy of the whole set is defined by the following formula:
n
H= Hj (4)
j=1
The higher the value of H the greater the diversity of the set. For determin-
ing the diversity’s contribution of a specific solution, we calculate the value of
entropy with this solution inside the set, and the corresponding value with this
solution out of the set, and finally, the difference of these two values represents
the contribution of diversity.
The number of individuals to be deleted from the memory should be set by
the programmer as a percentage value (α) representing the portion of the HoF
to be removed; in other words, the HoF (with cardinality #HoF ) is ordered
×n
according to the diversity value in a decreased order and the last #HoF α
individuals in this ordered sequence are removed. The frequency of updating (λ)
is also a parameter of this version (i.e., the HoF is updated every λ coevolutions)
The motivation of this proposal is to maintain certain diversity among the
members of the HoF, and at the same time to reduce (or maintain an acceptable
value for) the size of the memory. With this idea, we assume that the deleted
individuals will not affect the quality of the found solutions.
An Analysis of the HoF Used in CC Algorithms 181
Here, the cardinality of the selected opponent set k in the evaluation phase
(see Eq. 1) is the cardinality of the opponent HoF after executing the updating
of the memory (i.e., after removing the individuals).
4.2.4 HofCC-Quality
In this version, we follow a similar approach to that applied in HofCC-Diversity
but now the HoF is ordered with respect to a measure of quality that is defined as
the number of defeats that an individual obtained in the previous coevolutionary
step; in other words, a simple counter variable associated with each member of
the HoF stores the number of defeats that were computed for the corresponding
member during the evaluation process of the opponent army in the previous
coevolutive turn.
Based on our game experience, we assume that this metric is representative
of the strength of a solution, and the aim is to keep only the robust individuals
in the champions’ memory by removing the weak strategies.
As in the HofCC-Diversity, the parameters α and λ have to be set, and the
cardinality of the selected opponent set k is exactly the same.
4.2.5 HofCC-U
This variant of HofCC follows the idea of optimizing the memory of champions,
but in this case we propose a multiobjective approach where each solution has a
diversity value and also a quality value as described previously associated with
it. Then, a percentage value (α) from the set of dominated solutions according
to the multiobjective values is removed; if the set of dominate solutions is empty
then HoF is ordered according to the measure of quality and the solutions with
worst quality will be removed.
As in the previous algorithms (HofCC-Quality and HofCC-Diversity) the
frequency of updating the HoF is an important parameter that must be defined.
This proposal uses a different fitness function (to that shown in Eq. 1) whose
definition was inspired by the Competitive Fitness Sharing (CFS) [3]. The main
idea is that a defeat against opponent X has more importance if there are other
individuals that defeated X. So, a penalization value N for each individual i (for
1 ∈ i ∈ k) in the population is then calculated as follows:
k vij
j=1 V (j)
Ni = 1 − (5)
k
where vij = 1 is the individual i of the population defeats the strategy (or
champion) j in the HoF (whose cardinality is k) and 0 otherwise; and
n
V (j) = vij
i=1
Figure 1 shows the behavior of the average fitness for each algorithm instance. In
this figure the algorithms are sorted according to the median of the distribution
(i.e., the vertical red line). The Kruskal-Wallis test confirms that the differences
between values are statistically significant (see the first row in Table 1). The
HofR algorithm reaches the worst results for this indicator, such results may be
a sign that this algorithm does not exploit the search space sufficiently because
in this version the HoF acts as a short memory mechanism, whereby it is easier
to find a champion than for the rest of the algorithms. Moreover, note that
the best results are obtained by algorithms which optimize the use of HoF (in
terms of diversity, quality, or both), and at the same time do not reduce too
significantly the memory size; note also that the average fitness value decreases
in those cases where the HoF suffered a reduction of 50 % during the updating
process. In the results of multiple tests for the value of average fitness, the HofR
distribution has significant differences respect to the majority of the algorithms
Table 1. Results of Kruskal-Wallis test for all the indicators (pvalue < 0.05)
Indicator pvalue
Average fitness 2.6205E − 004
Best fitness 0.0175
Number of evaluations 0, 2214
Numbers of defeats 5.6909E − 007
(except HofU-50, and HofQua-50), and the rest of the versions have a similar
behavior.
For the case of the number of evaluations the results are shown in Fig. 3 and
according to the statistical tests performed (see Table 1), there is no significant
statistical difference in the distribution data. In this indicator we noted that the
increasing in the number of evaluations is in consonance to the length of the
coevolutive cycle; except in the case of HofR which presents a very long cycle
and has no influence because during the evaluation process of this algorithm the
individuals face a single opponent, and this decreases the number of evaluations
significantly. Consider that the coevolutive cycle’s length is determined by the
number of coevolutions that use the algorithm to find an undefeated champion
(i.e a member of the HoF which can not be defeated by the opponent side), and
it helps to identify whether the problem difficulty increases as best solutions
are obtained, or if it remains stable. In all algorithms (except HofR) the rigor
of the competition increases until it reaches the point at which the algorithm
can not exceed the level of specialization achieved. However, in HofR the cycles
were very long, because the quality of the solutions was stagnated; and it was
necessary to limit the length of cycles up to 500 iterations.
For this test, the last champions (i.e., the last member added to the HoF) found
by each algorithm instance (in each execution) fought in an All versus All tour-
nament. The results with respect to the amount of defeats are shown in Fig. 4;
and Table 1 (row 4). The main differences are in the values of HofR, and HofC
which still maintain the same poor results as the previous indicators. On the
other hand, HofDiv-10 and HofDiv-30 again obtains the best values. Curiously,
the HofQua-30 which had the worst results in the analysis of best fitness has a
186 M. Nogueira et al.
Fig. 4. Numbers of defeats obtained by each algorithm in an All (vs) All tournament
low ranking of defeats here, this is certainly an indicator that the fitness mea-
sure used is insufficient. The instances of HofQua-10 and HofU-10 have a similar
behavior with high numbers of defeats. Another detail that attracts attention is
that variants that reduce the HoF by 50 % in the previous indicators have not
shown encouraging results, except for the HofDiv-50, however in this analysis
we can see that they are in the middle top of the ranking.
We have also shown that the fitness values are not related with the solution
robustness by executing fighting tournaments; this is a particular result which
can be interpreted as follows. During the coevolutive cycle the search space is
explored, but the self-learning mechanism falls into a local optimum and gets
trapped there, so that the solutions found improve their fitness values without
a global improvement in a more general context.
References
1. Rosin, C.D., Belew, R.K.: Methods for competitive co-evolution: finding opponents
worth beating. In: ICGA, pp. 373–381 (1995)
2. Ficici, S.G., Bucci, A.: Advanced tutorial on coevolution. In: Proceedings of the:
GECCO Conference Companion on Genetic and Evolutionary Computation, pp.
3172–3204. ACM, New York (2007)
3. Rosin, C., Belew, R.: New methods for competitive coevolution. Evol. Comput.
5(1), 1–29 (1997)
4. de Jong, E.: Towards a bounded pareto-coevolution archive. In: Congress on Evo-
lutionary Computation, CEC2004, vol. 2, pp. 2341–2348. IEEE, New York (2004)
5. Jaskowski, W., Krawiec, K.: Coordinate system archive for coevolution. [21], pp.
1–10
6. Yang, L., Huang, H., Yang, X.: A simple coevolution archive based on bidirectional
dimension extraction. In: International Conference on Artificial Intelligence and
Computational Intelligence: AICI’09, vol. 1, pp. 596–600. IEEE, Washington (2009)
188 M. Nogueira et al.
Abstract. In spite of the efficacy of Operations Research (OR), its tools are
still underused, due to the difficulties that people experience when describing a
problem through a mathematical model. For this reason, teaching how to
approach and model complex problems is still an open issue. A strong relation
exists between (video) games and learning: for this reason we explore to which
extent (real time) simulation video games could be envisaged to be an inno-
vative, stimulating and compelling approach to teach OR techniques.
1 Introduction
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 189–195, 2013.
DOI: 10.1007/978-3-642-44973-4_20, Springer-Verlag Berlin Heidelberg 2013
190 D. Maggiorini et al.
We will not dig into how the game should be presented to a classroom or how the
modelling phase should be taught, but on a preliminary aspect: (dis)proving that
simulation video games contain enough combinatorial structure to imply interesting
optimization problems and enough numerical data to allow building reasonable
instances.
The paper is organized as follows: Sect. 2 briefly recalls the grounding upon
which we have built our work. The following Sect. 3 describes how it is possible to
formalize quite complex OR problems starting from the video game Caesar IV.
Section 4 summarizes several perspective results, draws conclusions and describes
future developments.
learning are connected [27, 28]. The basic observation is that humans have always
used games as playgrounds for learning and exercising safely specific skills (see, e.g.,
the relation that, in the past, linked archery tournaments to the ability to catch food).
During this process, human brain secretes endorphins (which makes a game an
enthralling and fun activity), is highly focused on recognizing recurring patterns in
problems, and on creating appropriate neural routines to deal with them. Once the
pattern is fully caught by the player, the game becomes boring, but the skill has been
accurately acquired. To exploit this phenomenon, we have envisaged a set of par-
ticularly ‘‘difficult to approach’’ problems, typical of the OR domain, and for each
problem we have identified at least one popular video game which can be exploited as
a case study to help students in the process of refining their modelling skills (Table 1).
For each game we have developed and tested the related OR problems, but for reasons
of paper length, in the following we will focus mainly on the maximum profit problem
for the game Caesar IV.
Caesar IV [29] is a managerial strategic video game in real time, set in the ancient
Rome period, developed by Tilted Mill Entertainment. The game aims at simulating
the management of a city, and the player ultimate goal is to develop the ‘‘ideal’’ city,
endowed with whichever good or service its population may need, while at the same
time, maximizing its income. It is as an ideal playground to verify how OR techniques
can be proficiently applied to video games. We have designed, developed and tested
models based on Caesar (namely: profit maximization, optimal service configuration,
facility location) able to allow the maximization of the player performances. This
means exploiting the ‘‘rewarding’’ effect implicit in the accomplishment of in-game
goals that characterizes each (video) game as tool for motivating learners to approach
the modeling problem. As an example, we briefly describe the approach to the
maximization of the city profit. From the point of view of OR, the goal of the game
maps into the optimization of the starting budget of the player, that is to say the
optimal allocation of starting resources in order to acquire the maximization of the
city income deriving from trading goods with other cities. Since goods can be
192 D. Maggiorini et al.
produced either from scratch (e.g. olives) or from raw materials (e.g. olive oil), a first
set of decisions focuses on the selection of the best (in terms of trade-off between
production costs and income provided) mix of goods to produce. This decision is
influenced also by the types of goods produced by or tradable (in defined quantities)
with neighboring cities. Once the mix has been chosen, the necessity of other deci-
sions arises: each good can be produced only by a specific plant (characterized by a
construction cost), which, in turn, is operated by workers characterized by specific sets
of needs. As a consequence, the player must face the necessity to preserve enough of
each good to satisfy domestic consumption. The problem is further complicated by the
fact that goods must be distributed among the population according to the needs of
different classes. Also, different social classes will need different types of houses (with
different cost, space occupation, etc.).
In other words, the player aims at maximizing the difference between total rev-
enues (R) and total costs (C). R is the sum of the revenues produced by selling abroad
certain amounts of goods at their fixed prices. Costs are, as usual, the sum between
total fixed costs and total variable costs, constrained by the total starting budget
available to the player. Total fixed costs are the sum of the costs sustained to build
production plants, houses, warehouses and to open commercial routes, and total
variable costs are produced by the total amount of goods imported. To model the
problem, it is necessary to define sets, constraints and several variables, whose values
indicates if a certain activity is in place, the quantity of specific plants, goods, etc.
Once the objective function, the relevant sets and variables are in place, several
constraints mirroring the restriction in the gameplay are needed for the following in-
game objects:
– production plants: many of them need raw materials to produce goods. Raw
materials can be produced or imported. A production plant should be built only if
the raw materials it requires can be acquired;
– goods for import/export: if no neighbouring city needs a specific good, there is no
reason to set up its production for export. The production of a certain good should
be constrained by the fact that a trading route with possible customers is active and
by the total quantity of that good that can be absorbed by the market;
– production plants: to define the number of each production plant to build, we need
to constraint their number to the actual production (i.e. if the player does not
produce wine, she does not need wineries), and then to model the fact that a certain
good can be both a final product and a raw material. Moreover, the same good may
be imported from abroad suppliers. Finally, we must take into account the pro-
duction necessary to satisfy domestic consumption and export, plus the import;
– workforce: the production plants should be operated by the appropriate workforce.
We need two constraints, the first one models the number of houses needed to host
the workforce, the second one the correct number of workers for operating all the
plants (and warehouses) set up by the player.
Resources Optimization in (Video) Games 193
Our goal was to verify to what extent several problems presented in an appropriate
selection of video games can be described ‘‘ex-post’’ through formal systems, aimed at
supporting teaching activities in the field of applied mathematics (i.e. simulation and
OR). Obviously, this approach would arise a greater interest in the student/player if
the effort put in the modelling process actually produces better in-game results (since
players are always struggling to enhance their in-game performances). For these
reasons we have developed and tested (using AMPL - www.ampl.com) several among
the above mentioned models (namely those based on Caesar IV, Armor Picross and
Domino). In the case of Armor Picross, for example, the simulation allowed the player
to find very quickly an optimal configuration that respects all the constraints (and, in
several cases, the game map can produce more than one optimal solution). Anyway,
the most interesting results emerged from the simulations run on Caesar IV. In this
case, we have run many simulations, gradually reducing the starting budget, in order
to test which are the most suitable starting choices when the budget is very limited.
The solutions supplied by the simulations have then been adopted by a human player
in the game: as a result, the profit deriving from trading was higher then the profit that
a skilled player can typically obtain. As we showed, the models that can be built
starting from a real-time simulation video game are not trivial and require a non-trivial
effort to be fully understood by undergraduate students. Same drawbacks are also
present: for example, while numerical data for several problems (e.g. maximizing the
city profit) are presented by the game in tabular form, gathering data for several other
problems can be a truly boring task. In this latter case, the simulated environment
could be used mainly as a support tool to introduce real-life problems and to build
their related models. As our preliminary computational campaign also shows, the non
trivial impact on in-game performance of right decisions can be used to encourage
student’s interest. Actually, video games could be exploited as tools to convey the
teaching of applied mathematics, provided that they offer problems matching also with
the skills required by each specific teaching unit and student. This is, by the way,
coherent with one of the fundamental requirements of game design: a ‘‘fun’’ game
provides a challenge that perfectly matches the player skills, drawing her into the
‘‘flow’’ [30] (hence, also the difficulty of the game should grow at the same pace of the
player ability [5, 30]).
The next steps in our research will focus on further developing the corpus of
problems derived from video games that can be exploited for OR teaching purposes
and in testing their use with undergraduate students in Computer Science.
References
1. Kaiser, G., Blum, W., Borromeo Ferri, R., Stillman, G. (eds.): Trends in Teaching and
Learning of Mathematical Modelling, vol. 1. Springer, Netherlands (2011)
2. Adams, E.: Fundamentals of Game Design. New Riders, Indianapolis (2009)
3. Gardner, M.: My Best Mathematical and Logic Puzzles. Dover Recreational Math. Dover,
Mineola (1994)
194 D. Maggiorini et al.
4. Crawford, C.: On Game Design. New Riders Publishing, Thousand Oaks (2003)
5. Koster, R.: A Theory of Fun for Game Design. Paraglyph Press, Scottsdale (2005)
6. Fullerton, T.: Game Design Workshop: A Playcentric Approach to Creating Innovative
Games. Morgan Kaufmann, Burlington (2008)
7. DePuy, G.W., Taylor, D.: Using board puzzles to teach operations research. INFORMS
Trans. Educ. 7(2), 160–171 (2007)
8. Zyda, M.: From visual simulation to virtual reality to games. Computer 38(9), 25–32 (2005)
(IEEE Computer Society Press, Los Alamitos)
9. Prensky, M.: Don’t Bother Me Mom, I’m Learning!. Paragon House Publisher, St. Paul
(2005)
10. Abt, C.: Serious Games. The Viking Press, New York (1970)
11. Din, F.S., Calao, J.: The effects of playing educational videogames on kindergarten
achievement. Child Study J. 2, 95–102 (2001)
12. Durik, A.M., Harackiewicz, J.M.: Achievement goals and intrinsic motivation: coherence,
concordance, and achievement orientation. J. Exp. Soc. Psychol. 39, 378–385 (2003)
13. Ritterfeld, U., Weber, R.: Video games for entertainment and education. In: Vorderer,
P., Bryant, J. (eds.) Playing Video Games – Motives, Responses, and Consequences.
Lawrence Erlbaum, Mahwah (2006)
14. Squire, K., Jenkins, H.: Harnessing the Power of Games in Education. Insight 3(1), 5–33
(2003)
15. Susi, T., Johannesson, M., Backlund, P.: Serious games – an overview. Technical Report
HS-IKI-TR-07-001, University of Skovde, Sweden (2007)
16. Ripamonti, L.A., Peraboni, C.: Managing the design-manufacturing interface in VEs
through MUVEs: a perspective approach. Int. J. Comput. Integr. Manuf. 23(8–9), 758–776
(2010) (Taylor & Francis)
17. Michael, D., Chen, S.: Serious Games: Games that Educate, Train, and Inform. Thomson
Course Technology, Boston (2006)
18. Mitchell, A., Savill-Smith, C.: The Use of Computer and Video Games for Learning: A
Review of the Literature. Learning and Skills Development Agency, London (2004)
19. Wong, W. L., Shen, C., Nocera, L., Carriazo, E., Tang, F., Bugga, S., Narayanan, H.,
Wang, H., Ritterfeld, U.: Serious Video Game Effectiveness. In: ACE’07, Salzburg, Austria
(2007)
20. Ripamonti, L.A., Maggiorini, D.: Learning in virtual worlds: a new path for supporting
cognitive impaired children. In: Schmorrow, D.D., Fidopiastis, C.M. (eds.) FAC 2011.
LNCS, vol. 6780, pp. 462–471. Springer, Heidelberg (2011)
21. Caillois, R.: Man, Play, and Games. The Free Press, Glencoe (1961)
22. Huizinga, J.: Homo Ludens. The Beacon Press, Boston (1950)
23. Juul, J.: The game, the player, the world: looking for a heart of gameness. In: Copier, M.,
Raessens, J. (eds.) Level Up: Digital Games Research Conference Proceedings, pp. 30–45.
Utrecht University, Utrecht (2003)
24. Crawford, C.: The Art of Computer Game Design. McGraw-Hill/Osborne Media Berkeley,
California (1984)
25. Rollings, A., Adams, E.: A Rollings and E. Adams on Game Design. New Riders,
California (2003)
26. Salen, K., Zimmerman, E.: Game design and meaningful play. In: Raessens, J., Goldstein,
J. (eds.) Handbook of Computer Game Studies, pp. 59–79. MIT Press, Cambridge (2005)
27. Miller, G.A.: The magical number seven, plus or minus two: some limits on our capacity
for processing information. Psychol. Rev. 63, 81–97 (1956)
Resources Optimization in (Video) Games 195
28. Johnson, S.: Mind Wide Open: Your Brain and the Neuroscience of Everyday Life.
Scribner, New York (2004)
29. Caesar IV. http://caesar4.heavengames.com/
30. Csikszentmihalyi, M.: Flow: The Psychology of Optimal Experience. Harper & Row, New
York (1990)
CMF: A Combinatorial Tool to Find
Composite Motifs
1 Introduction
Transcription Factors (or simply factor ) are particular proteins that bind to
short specific stretches of DNA (called TFBS - Transcription Factor Binding
Sites) in the proximity of genes and participate in regulating the expression
of those genes [1]. The “language” of gene regulation is a complex one since a
single factor regulates multiple genes, and a gene is usually regulated over time
by a cohort of cooperating factors. This network of interactions is still far from
being completely uncovered and understood even for well studied model species.
Groups of factors that concur in regulating the expression of groups of genes
form functional elements of such complex network and are likely to have TFBS
in the proximity of the regulated genes. TFBSs are often described by means of
Position Weight Matrices (PWMs) (see Sect. 2 for a quick recap).
Over the last two decades more than a hundred computational methods have
been proposed for the de-novo prediction “in silico” of single functional TFBSs
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 196–208, 2013.
DOI: 10.1007/978-3-642-44973-4 21, c Springer-Verlag Berlin Heidelberg 2013
CMF: A Combinatorial Tool to Find Composite Motifs 197
(often called single motifs, or simply motifs) [2–5]. Moreover, several hundreds
of validated PWMs for identifying TFBS are available in databases such as
TRANSFAC [6] and JASPAR [7]. Observe that, although these PWM have been
subject to validation in some form, the highly degenerate nature of the TFBS
implies that, when scanning sequences for PWM matches, false positive non-
functional matches are quite likely.
In this paper we address the problem of discovering groups of TFBSs that
are functional for a set of cooperating factors, given in input a set of PWMs
that describe the binding affinities of the single factors. This is known as the
Composite Motif Discovery problem in the literature [8]. For us, a composite
motif will be simply a set of TFBSs that are close by in a stretch of DNA,
i.e., we do not pose any constraints on the order or the spacing between the
participating TFBSs (but see [9] for other possible models).
The composite motif discovery problem has been the subject of a number
of studies, and we refer to [10] for a survey. In addition we observe that the
phenomenon of clustering of TFBS is used also by tools that try to predict the
location and composition of Cis-Regulatory Modules (see, e.g., [11]), which then
address composite motif discovery problems of some sort.
In this paper we present a new tool (CMF ) for composite motifs discovery
that adopts a two stage approach: first it looks for candidate single TFBSs in
the given sequences, and then uses them to devise the prospective composite
motifs by using mainly combinatoric techniques. CMF borrows the idea of the
two stage approach from a previous tool we developed for the related problem of
structured motif detection [12]. Using the data set and the published results in
[8,13] we can readily compare CMF’s performance against the eight state of the
art methods listed in [8] and other three more recent methods [13–15], showing
that our tool is highly competitive with the others.
The detection of TFBS and composite motifs is a complex challenging prob-
lem (witness the wide spectrum of approaches) which is far from having a sat-
isfactory solution [16], thus there is ample scope for improvements from both
the modeling and the algorithmic one points of view. CMF introduces several
new key ideas within a combinatorial approach, which, on one hand, have been
shown empirically to be valid on challenging benchmarks, and on the other hand
may prove useful in developing future more advanced solutions.
The rest of the paper is organized as follows: Sect. 2 introduces preliminary
notions and definitions, Sect. 3 describes the algorithm adopted by CMF and,
finally, Sect. 4 reports experimental results.
2 Preliminary Notions
In this Section we introduce the fundamental notions used in the description of
the algorithm that forms the computational core of CMF.
Given the DNA alphabet D = {A, C, G, T }, a short word w ⊆ D∈ is called an
oligonucleotide, or simply oligo (typically |w| ∀ 20), and we say that w occurs
in S ⊆ D∈ if and only if w is a substring of S.
198 M. Leoncini et al.
Note that we have adopted the string representation for multisets. We next
consider pairs ⊂M, n, where M is a multiset and n is a positive integer, and sets
P of such pairs which only include maximal pairs. That is, if ⊂M, n ⊆ P then
there is no other pair ⊂M̄ , n̄ in P such that M̄ ⊥ M and n̄ ∅ n, where inclusion
takes multiplicity into account.
We define special union and intersection operations, denoted by ∼ and ∧,
over sets of maximal pairs. The definition of ∼ is easy:
P ∼ Q = {p : p is maximal in P ≡ Q}
(1) (2)
Then, for arbitrary sets P1 = {pi }i=1,...,h and P2 = {pj }j=1,...,k :
(1) (2)
P1 ∧ P2 = ∼i,j pi ∧ pj .
3 Algorithm
CMF main operation mode is composite motifs discovery in a set S =
{S1 , . . . , SN } of DNA sequences, using a collection of PWMs.2
PWMs can be passed to CMF in either a single or multiple files. In the latter
case, CMF assumes that each file contains PWMs for only one given factor.
Actually, when the input set is prepared using matrices taken from an annotated
repository (e.g., the TRANSFAC database [20]), assuming the knowledge of the
corresponding factors is not an artificial scenario. However, here we describe the
main steps implementing CMF’s operation mode on input a single PWM file,
namely:
1. (Optional) PWM clustering, to organize the matrices in classes believed to
belong to different factors;
2. Discretization, to detect PWM matches in the input sequences;
3. Group and composite motif finding.
2
Even if not taken into consideration in this paper, CMF is also able to run a number
of third-party motif discovery tools to “synthesize” PWMs.
200 M. Leoncini et al.
By default, CMF assumes that the PWMs in the input file correspond to different
factors, and hence it does not perform any clustering. However, in many cases
the number of matrices available, which describe the binding affinities of the
factors involved in the upstream experimental protocol, is much larger than the
number of such factors. If the latter information is available to the user, then
clustering may be highly useful both to improve the accuracy and to reduce the
group finding complexity. Another circumstance in which clustering is advisable
(not discussed here) is when the input matrices are produced by third-party
motif discovery tools.
To perform the clustering, CMF first builds a weighted adjacency graph
whose nodes are the matrices and edges the pairs (M1 , M2 ) such that the sim-
ilarity3 between M1 and M2 is above a given threshold. Then, CMF executes
a single-linkage partitioning step of the graph vertices; finally, it identifies the
dense cores in each set of the partition, via pseudo-cliques enumeration [22],
returning them as the computed clusters.
The experiments described in Sect. 4 suggest that, when the PWM file mainly
includes good matrices corresponding to possibly different factors, then even a
simple clustering algorithm like the one mentioned above is able to correctly
separate them into the “right” groups (factors). In general, however, performing
a good partitioning of the input matrices when the fraction of “noisy” PWM
increases (as is the case when CMF is used downstream de-novo motif discovery
software tools) is one of the major issues left open for further work.
3.2 Discretization
Even with the most accurate PWM description of a motif, the problem of deter-
mining the “true” motif matches in the input sequences is all but a trivial task.
Whatever the algorithm adopted, there is always the problem of setting some
thresholds τ to distinguish matches from non-matches, a choice that may have
a dramatic impact on the tool’s performance.
In general, low thresholds improve sensitivity while high thresholds may
improve the rate of positive predicted values (PPVs). A reasonable strategy
is to moderately privilege sensitivity during discretization, with the hope to
increase the positive predicted rate thanks to the combinatorial effect of close-
by matches. Indeed, keeping initial low thresholds may give the benefit of not
filtering out low-score matches.4 On the other hand, complexity issues demand
that the number of possible combinations of motif matches, which the composite
motifs should emerge from, will not explode. Now, for factors with many matri-
ces, low thresholds may incur in a very high number of matches and these in
turn affect the number of potential composite motifs.
3
Currently, CMF invokes RSAT’s utility compare-matrices for this purpose [21],
which uses pairwise normalized correlation
4
Sometimes referred to as weak signals in the literature.
CMF: A Combinatorial Tool to Find Composite Motifs 201
In light of the above arguments, we formulate the following general and sim-
ple qualitative criterion: assign factors (motif classes) with many/few matrices
a high/low threshold. All the experiments of Sect. 4 were performed with fixed
threshold values. Although these can be varied (in the configuration file, hence in
a completely transparent way to the typical user), the overall good results suggests
that the above criterion may have some merits, to be further investigated.
3. Set G1 = P1
4. For i = 2, . . . , N compute
Gi = Gi−1 ∧ Pi
5. Discard from GN all the pairs ⊂M, n such that n < ⊗q · N .
6. If the (remaining) multisets in GN include all the letters of R or W = Wr
and q = qs , then set G = {M : ⊂M, n ⊆ GN } and return G.
7. In alternate order (whenever possible) advance W or q to the next value and
jump to step 2.
The above general description has only explanatory purposes, since a direct
implementation would be highly inefficient. For instance, when relaxing the quo-
(i)
rum value, step 2 can be avoided, since the multisets Mj have already been
computed. On the other hand, the pairwise intersections of step 4 can be per-
formed quite efficiently thanks to the character sorted string representation of
multisets of factors.
By the properties of the ∼ and ∧ operators, the pairs ⊂M, n included in
GN are maximal, with n satisfying the last fixed quorum value. Note, however,
that even with the weakest parameter values (i.e., widest window and smallest
202 M. Leoncini et al.
The cost of the bare CMF algorithm is dominated by the Composite Motif find-
ing step or, more precisely, by the combinatorial group finding subprocess. This
is easily seen to be exponential in the length of the longest group g (regarded
as a string over R) in any of the initial sets Mi ’s, simply because g may have
an exponential number of maximal subgroups that satisfy also the quorum con-
straint. In turn, the length of g may be of the order of composite motif width and
hence of sequence length. At the other extreme, there is the situation where we
only have two (of few) factors and look for sites where both factors bind (as for
the TRANSCompel datasets of Sect. 4). In this case the cost of the subprocess
is linear in the number of sequences.
When combinatorial group finding is fast (as in all the experiments we have
performed) the computational bottlenecks move to other parts of the code, i.e.,
outside of the software module that implements the core CMF algorithm. In
particular, the computation of PWM pairwise similarities takes quadratic time
in the number of PWMs, which can be pretty high in a number of scenarios.
4 Experiments
4.1 Datasets
We use the TRANSCompel as well as the liver and muscle datasets presented in
[8]. The TRANSCompel benchmark includes ten datasets corresponding to as
many composite motifs, each consisting of two binding sites for different factors.
In [8], all the matrices corresponding to a same factor were grouped to form
an “equivalence set”, and treated as they were one. These matrices form what is
called, in the assessment paper, the noise 0 benchmark. To simulate conditions in
which input data are fuzzier, we also consider the so-called noise 50 benchmark
presented in [8], in which each dataset is composed of an equal number of good
and random (i.e., taken at random from TRANSFAC) matrices.
Two additional benchmarks are discussed in [8], namely liver and mus-
cle, having very different characteristics from the previous ones. Liver includes
sequences with up to nine binding sites from four different factors, while muscle
includes sequences with up to eight sites from five factors.
Statistics for tools evaluated in [8] were downloaded from the site
http://tare.medisin.ntnu.no/composite/composite.php. Regarding CO-
MPO, we computed the statistics for liver and muscle datasets starting from
the prediction files made available by the authors at the address http://tare.
medisin.ntnu.no/compo/. For the TRANSCompel datasets (noise 0 and
noise 50), we directly used the statistic results provided at the same address.
4.3 Results
In all the experiments, CMF was run with fixed configuration file, with W =
{50, 75, 100, 125, 150}, q = {0.9, 0.8, 0.7, 0.6., 0.5., 0.4, 0.3, 0.2, 0.1} (see Sect. 3.3).
Nucleotide level analysis. Table 1 shows the results obtained by CMF com-
pared to eleven competitor algorithms on the whole collection of twelve datasets
(noise 0, liver, and muscle). The results suggest that CMF is indeed competitive
with other state of the art tools. In the attempt to assess the significance of the
results of Table 1, we first performed a Friedman non-parametric test (see, e.g.,
[32]) that involved eleven tools (all but CORECLUST, because of the limited
availability of homogeneous data with which to perform the comparisons). As it
can be easily argued, here the null hypothesis (i.e., that all the considered algo-
rithms behave similarly, and hence that the average ranks over the all datasets
are essentially the same), can be safely rejected, with a P-value around 2.2·10−9 .
204 M. Leoncini et al.
Table 1. CC results for noise 0, liver, and muscle data, with best figures in bold-face.
CB = Cluster-Buster, MS = ModuleSearcher, CMA = Composite Module Analyst,
CM = CisModule, C = CORECLUST.
AP1-Ets 0.52 0.19 0.24 0.00 0.11 0.30 0.20 0.15 0.22 −0.0 0.37
AP1-NFAT 0.11 0.06 0.04 0.00 0.00 0.05 0.14 −0.01 0.15 −0.02 0.14
AP1-NFkB 0.76 0.59 0.49 0.19 0.36 0.29 0.26 0.35 0.55 0.05 0.18
CEBP-NFkB 0.74 0.70 0.72 0.45 0.56 0.56 0.60 0.36 0.60 −0.03 0.38
Ebox-Ets 0.59 0.55 0.16 0.26 0.44 0.20 0.23 0.14 0.18 0.05 0.15
Ets-AML 0.49 0.42 0.30 0.07 0.31 0.38 0.26 0.23 0.33 0.03 0.27
IRF-NFkB 0.92 0.73 0.77 0.62 0.91 0.85 0.41 0.41 0.69 0.04 0.57
NFkB-HMGIY 0.26 0.31 0.35 0.10 0.30 0.40 0.23 0.07 0.15 −0.03 0.13
PU1-IRF 0.92 0.28 0.16 0.27 0.00 0.43 0.16 0.17 0.24 −0.01 0.21
Sp1-Ets 0.20 0.05 0.09 0.20 0.00 0.00 0.13 0.19 0.15 0.02 0.09
Liver 0.49 0.57 0.59 0.31 0.51 0.42 0.50 0.48 0.36 −0.01 0.33
Muscle 0.56 0.52 0.41 0.36 0.50 0.46 0.30 0.24 0.46 0.29 0.37 0.56
We then performed the post hoc tests associated to the Friedman statistics,
by considering CMF as the new proposed methods to be compared against the
other ten tools. Table 2 shows the P-values of the ten comparisons, adjusted
(according to the Hochberg step-up procedure [32]) to take into account possible
type-I errors in the whole set of comparisons. For nine competing algorithms we
obtained figures below the critical 0.05 threshold; only in case of COMPO we
cannot reject with high confidence the null hypothesis, namely that the observed
average ranks of the two algorithms (CMF and COMPO) are different by chance
only.
Table 2. Adjusted P-values for post hoc comparisons of CFM against other 10 tools:
MS = ModuleSearcher, CMA = Composite Module Analyst, CM = CisModule
Table 3. Further nucleotide level comparisons between CMF and COMPO. We report
here the results most favorable to COMPO, as the authors provide three different files
with predictions for each datasets.
Table 4. Motif level results for CMF and COMPO on the noise 0 dataset
6
Note that in the already cited paper by Tompa et al. [31], true negative predictions
at the motif level are not considered.
206 M. Leoncini et al.
Table 5. CMF motif level statistics for the muscle and liver datasets
Table 4 reports the performances at motif level obtained using our computed
CMF predictions and the predictions made by COMPO on the TRANSCompel
datasets. The results are essentially similar, with a slightly better Performance
Coefficient (the sole comprehensive measure computed at motif level) exhibited
by CMF. The results suggest once more that our software is competitive with
current state of the art tools.
Finally, Table 5 reports CMF statistics on the muscle and liver datasets. We
do not include a comparison against COMPO here since the way to correctly
and fairly interpret COMPO’s prediction is not completely clearly to us. First of
all, the authors present three different prediction sets, obtained under different
configuration runs. Secondly, all the prediction files contain multiple identical
predictions, which negatively influences the PPV counts.
5 Conclusions
In this paper we have presented CMF, a novel tool for Composite Motif detection,
a computational problem which is well-known to be very difficult. Indeed, to
date, no available software for (simple or composite) motif discovery can be
clearly identified as the “best one” under all application settings. Knowing this,
we are also aware that more comparisons are required, in different experimental
frameworks, for general conclusions to be drawn about the competitiveness of
CMF.
However, we think that some interesting findings have emerged from this
work, all related to the power of simple motif combinations. First of all, that the
good results exhibited by CMF have been obtained without using any sophis-
ticated statistic filtering criteria; the combination of “right” simple sites were
often strong enough to emerge from a huge set of potentially active motif clus-
ters. Secondly, that the conceptually simple CMF architecture, based on a two-
stage approach to composite motif finding (i.e., first detect simple motifs, then
combine them to form clusters of prospective functional motifs) proved to be
competitive against other, more sophisticated approaches (see also [12]). In the
third place, that lowering the thresholds that “define” (in silico) the DNA occu-
pancy by a transcription factor, can be appropriate a strategy that can be kept
hidden to the user.
On the other hand, the same issues outlined in the preceding paragraph
suggest possible directions to improve CMF performance. For instance, incor-
porating a statistical filtering may enhance the PPV rate of the prospective
composite motifs devised by simple site combinations. However, we think that
CMF: A Combinatorial Tool to Find Composite Motifs 207
References
1. Davidson, E.H.: The Regulatory Genome: Gene Regulatory Networks in Develop-
ment and Evolution, 1st edn. Academic Press, San Diego (2006)
2. Pavesi, G., Mauri, G., Pesole, G.: In silico representation and discovery of tran-
scription factor binding sites. Brief. Bioinform. 5, 217–236 (2004)
3. Sandve, G.K., Drabløs, F.: A survey of motif discovery methods in an integrated
framework. Biol. Direct. 1, 11 (2006)
4. Häußler, M., Nicolas, J.: Motif discovery on promotor sequences. Research report
RR-5714, INRIA (2005)
5. Zambelli, F., Pesole, G., Pavesi, G.: Motif discovery and transcription factor bind-
ing sites before and after the next-generation sequencing era. Brief. Bioinf. (2012)
6. Wingender, E., et al.: Transfac: a database on transcription factors and their DNA
binding sites. Nucl. Acids Res. 24, 238–241 (1996)
7. Sandelin, A., Alkema, W., Engström, P.G., Wasserman, W.W., Lenhard, B.: Jas-
par: an open-access database for eukaryotic transcription factor binding profiles.
Nucl. Acids Res. 32, 91–94 (2004)
8. Klepper, K., Sandve, G., Abul, O., Johansen, J., Drabløs, F.: Assessment of com-
posite motif discovery methods. BMC Bioinform. 9, 123 (2008)
9. Sinha, S.: Finding regulatory elements in genomic sequences. Ph.D. thesis, Univer-
sity of Washington (2002)
10. Van Loo, P., Marynen, P.: Computational methods for the detection of cis-
regulatory modules. Brief. Bioinform. 10, 509–524 (2009)
11. Ivan, A., Halfon, M., Sinha, S.: Computational discovery of cis-regulatory modules
in drosophila without prior knowledge of motifs. Genome Biol. 9, R22 (2008)
12. Federico, M., Leoncini, M., Montangero, M., Valente, P.: Direct vs 2-stage
approaches to structured motif finding. Algorithms Mol. Biol. 7, 20 (2012)
13. Sandve, G., Abul, O., Drablos, F.: Compo: composite motif discovery using discrete
models. BMC Bioinform. 9, 527 (2008)
14. Hu, J., Hu, H., Li, X.: Mopat: a graph-based method to predict recurrent cis-
regulatory modules from known motifs. Nucl. Acids Res. 36, 4488–4497 (2008)
15. Nikulova, A.A., Favorov, A.V., Sutormin, R.A., Makeev, V.J., Mironov, A.A.:
Coreclust: identification of the conserved CRM grammar together with prediction
of gene regulation. Nucl. Acids Res. 40, e93 (2012). doi:10.1093/nar/gks235
16. Vavouri, T., Elgar, G.: Prediction of cis-regulatory elements using binding site
matrices - the successes, the failures and the reasons for both. Curr. Opin. Genet.
Develop. 15, 395–402 (2005)
208 M. Leoncini et al.
17. Kel, A., Gößling, E., Reuter, I., Cheremushkin, E., Kel-Margoulis, O., Wingen-
der, E.: Matchtm: a tool for searching transcription factor binding sites in DNA
sequences. Nucl. Acids Res. 31, 3576–3579 (2003)
18. Chen, Q.K., Hertz, G.Z., Stormo, G.D.: Matrix search 1.0: a computer program
that scans DNA sequences for transcriptional elements using a database of weight
matrices. Comp. Appl. Biosci.: CABIOS 11, 563–566 (1995)
19. Prestridge, D.S.: Signal scan: a computer program that scans DNA sequences
for eukaryotic transcriptional elements. Comp. Appl. Biosci.: CABIOS 7, 203–206
(1991)
20. Matys, V., et al.: TRANSFAC and its module TRANSCompel: transcriptional gene
regulation in eukaryotes. Nucl. Acids Res. 34, D108–D110 (2006)
21. Thomas-Chollier, M., et al.: RSAT: regulatory sequence analysis tools. Nucl. Acids
Res. 36, W119–W127 (2008)
22. Uno, T.: Pce: Pseudo clique enumerator, ver. 1.0 (2006)
23. Zhou, Q., Wong, W.H.: Cismodule: De novo discovery of cis-regulatory modules
by hierarchical mixture modeling. Proc. Natl. Acad. Sci. 101, 12114–12119 (2004)
24. Frith, M.C., Hansen, U., Weng, Z.: Detection of cis -element clusters in higher
eukaryotic dna. Bioinformatics 17, 878–889 (2001)
25. Frith, M.C., Li, M.C., Weng, Z.: Cluster-Buster: finding dense clusters of motifs in
DNA sequences. Nucl. Acids Res. 31, 3666–3668 (2003)
26. Kel, A., Konovalova, T., Waleev, T., Cheremushkin, E., Kel-Margoulis, O., Win-
gender, E.: Composite module analyst: a fitness-based tool for identification of tran-
scription factor binding site combinations. Bioinformatics 22, 1190–1197 (2006)
27. Bailey, T.L., Noble, W.S.: Searching for statistically significant regulatory modules.
Bioinformatics 19, ii16–ii25 (2003)
28. Aerts, S., Van Loo, P., Thijs, G., Moreau, Y., De Moor, B.: Computational detec-
tion of cis -regulatory modules. Bioinformatics 19, ii5–ii14 (2003)
29. Johansson, Ö., Alkema, W., Wasserman, W.W., Lagergren, J.: Identification of
functional clusters of transcription factor binding motifs in genome sequences: the
mscan algorithm. Bioinformatics 19, i169–i176 (2003)
30. Sinha, S., van Nimwegen, E., Siggia, E.D.: A probabilistic method to detect regu-
latory modules. Bioinformatics 19, i292–i301 (2003)
31. Tompa, M., et al.: Assessing computational tools for the discovery of transcription
factor binding sites. Nat. Biotechnol. 23, 137–144 (2005)
32. Garcı́a, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests
for multiple comparisons in the design of experiments in computational intelligence
and data mining: experimental analysis of power. Inf. Sci. 180, 2044–2064 (2010)
Hill-Climbing Behavior on Quantized
NK-Landscapes
1 Introduction
Basic iterative improvement methods like climbers are generally used as compo-
nents of more sophisticated local search techniques or metaheuristics. A climber
consists in reaching a local optimum by iteratively improving a single solution
with local modifications. Although most of metaheuristics use climbers or vari-
ants as intensification mechanism, they mainly focus on determining how to
escape local optima. Nevertheless, several important questions have to be con-
sidered while designing a climber. Usually, the conception effort of any local
search algorithm focus on the design of the neighborhood structure as well
as how to build the solution initiating the search. However there are several
questions which are regularly considered during the conception process, but not
really empirically or theoretically investigated. Among them, one can identify
two main issues. First, the choice of pivoting rule: are different pivoting rules
leading to similar local optima qualities in comparable computational effort? To
the best of our knowledge, there is no real consensus on the benefit of using a
best-improvement strategy rather than a first-improvement one, or vice versa.
Second, the neutral moves policy: should we restrict the use of neutral moves
for escaping local optima? The use of neutral moves during the climbing should
be experimentally analyzed. In particular, it is contradictory that traditional
climbers only allow strictly improving moves, while a derived search strategy,
e.g. simulated annealing, systematically accept neutral moves.
Most of local search and evolutionary computation contributions are focus-
ing on the design and evaluation of advanced and original search mechanisms.
However, those aforementioned elementary components are rarely discussed in
the experimental analysis.
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 209–214, 2013.
DOI: 10.1007/978-3-642-44973-4 22, c Springer-Verlag Berlin Heidelberg 2013
210 M. Basseur and A. Goëffon
Pivoting Rule. The best improvement strategy (or greedy hill-climbing) con-
sists in selecting, at each iteration, a neighbor which achieves the best fitness.
This implies to generate the whole neighborhood at each step of the search,
unless an incremental evaluation of all neighbors can be performed. On the con-
trary, the first improvement strategy accepts the first evaluated neighbor which
satisfies the moving condition. This avoids the systematic generation of the entire
neighborhood and allows more conceptual options.
Hill-Climbing Behavior on Quantized NK-Landscapes 211
Neutral Move Policy. A basic hill-climbing algorithm does not allow neutral
moves (i.e. moves to neutral neighbors) during the search, and only performs
improving moves until reaching a local optimum. Question of neutral moves
can be considered to escape local optima (neutral perturbation, NP) when the
fitness landscape contains a substantial proportion of neutral transitions (on
smooth landscapes). Another variant, called stochastic hill-climbing, can accept
indifferently neutral or improving neighbors throughout the search, even before
reaching a local optimum. It is not that obvious to determine the influence of
the neutral move policy on the quality of the configurations reached. However,
it is interesting to note that the more advanced simulated annealing algorithm,
which allows some deteriorating moves during the search, systematically accepts
neutral moves under consideration.
There are other aspects which could be discussed. For instance the neighbor-
hood evaluation can be made with or without replacement, and its generation
order can be done deterministically or randomly. Nevertheless, these choices are
greatly dependent on the problem under study and are not discussed here.
where ci : {0, 1}K+1 → [0, 1) defines the component function associated with
each variable xi , i ∈ {1, . . . , N }, and where K < N .
NK-landscapes instances are both determined by the (K + 1)-uples
(xi , xi1 , . . . , xiK ) and the 2N .(K + 1) ci result values corresponding to a fit-
ness contribution matrix C whose values are randomly generated in [0, 1). The
usual precision of random values imply that plateaus are almost absent on
NK-landscapes.
result values. Indeed, limiting their possible values increase the number of neu-
tral neighbors. Thus, NKq implies a third parameter q 2 which specifies the
ci . functions codomain size. The maximal degree of neutrality is reached when
q = 2 (C is then a binary matrix), and decreases while q increases.
Empirical Analysis
Experiment results are given on Tables 1 and 2 which focus respectively on the
NK and NKq instances. For each couple climber/instance, we report the average
fitness of the 10,000 resulting configurations. For each instance, the best average
value appears in bold. Moreover, we indicate in grey methods which are not
statistically outperformed by any other method (w.r.t. the Mann-Whitney test).
Table 1. Climbers results on NK-landscapes. Only two variants, with no neutral moves,
are outputed.
the neutral move policy being adopted. Anyway, this table provides us a signif-
icant piece of information while comparing the best improvement and the first
improvement pivoting rules. Best improvement statistically outperforms first
improvement when K ∈ {1, 2}, and first improvement appears more efficient
while K increases. In other words, best improvement is well-suited to explore
smooth landscapes, whereas first improvement seems more adapted to explore a
rugged one.
NKq instances experiments lead to relevant outcomes. One can see in Table 2
that neutral moves are necessary to climb landscapes containing even a small
level of neutrality. Indeed, basic climbers are always statistically outperformed
by others. Moreover, this table emphasizes significant differences between the
three strategies allowing neutral moves. First, stochastic climbers reach bests
results on most instances, especially on more rugged and/or neutral landscapes
(high K, low q). This is particularly interesting since, to our knowledge, basic
policies – with or without neutral perturbations – are more traditionally used
while designing metaheuristics. However, a best improvement strategy combined
with neutral perturbations remains suitable in smooth landscapes, especially
with lowest levels of neutrality. Globally, one observe that the search space size
given by parameter N does not influence the overall tendency of the results;
although efficiency differences between policies tends to be more significant for
larger search spaces.
Let us precise that in our original experimental analysis, two other models
of neutrality have been experimented: probabilistic NK-landscapes [1] as well as
rounded NK-landscapes, designed by simply rounding fitnesses. These experi-
ments also lead to significant outcomes, which are not outputted here due to a
lack of space.
214 M. Basseur and A. Goëffon
4 Conclusion
Climbers are often considered as basic components of advanced search meth-
ods. However, influence of their conception choices are rarely discussed through
advanced studies. In this paper we have focused on the capacity of different
hill-climbings versions to reach good configurations in various landscapes. In
particular, we compared the first and best improvement strategies as well as
three different neutral move policies. In order to provide an empirical analysis
on a large panel of representative instances, we used NK-landscapes with dif-
ferent sizes and rugosity levels. On landscapes with no neutrality, we show that
best improvement performs better on smooth landscapes, while first improve-
ment is well-suited on more rugged ones. To evaluate the impact of neutral move
policies, we used quantized NK-lanscapes (NKq) as model of neutrality. First,
one observes that stochastic hill-climbings globally reach better configurations
than other variants. In other words, at each step of the search, it makes sense to
perform the first non-deteriorating move instead of extending the neighborhood
evaluation.
Perspectives of this work mainly includes the extension of this analysis to
Iterative Local Search methods [4]. Indeed, several questions arise while con-
sidering iterated versions. First, we have to determine to what extent efficient
climbers can improve iterated searches. Last, a similar study performed in an iter-
ated context will determine if the overall influence of structural choices remain
unchanged.
References
1. Barnett, L.: Ruggedness and neutrality - the NKp family of fitness landscapes. In:
Alive VI: Sixth International Conference on Artificial Life, pp. 18–27. MIT Press
(1998)
2. Hoos, H., Stützle, T.: Stochastic Local Search: Foundations & Applications. Morgan
Kaufmann Publishers Inc., San Francisco (2004)
3. Kauffman, S.A.: The Origins of Order: Self-Organization and Selection in Evolution,
1st edn. Oxford University Press, USA (1993)
4. Lourenço, H.R., Martin, O., Stützle, T.: Iterated local search. In: Glover, F., Kochen-
berger, G. (eds.) Handbook of Metaheuristics. International Series in Operations
Research and Management Science, vol. 57, pp. 321–353. Kluwer Academic, Nor-
well (2002)
5. Newman, M.E.J., Engelhardt, R.: Effects of selective neutrality on the evolution of
molecular species. Proc. Roy. Soc. B 265(1403), 1333–1338 (1998)
6. Ochoa, G., Verel, S., Tomassini, M.: First-improvement vs. best-improvement local
optima networks of NK landscapes. In: Proceedings of the 11th International Con-
ference on Parallel Problem Solving From Nature, Krakow Pologne, pp. 104–113,
September 2010
Neighborhood Specification for Game Strategy
Evolution in a Spatial Iterated Prisoner’s
Dilemma Game
1 Introduction
behavior in the IPD (Iterated Prisoner’s Dilemma) game strongly depended on the
choice of a representation scheme for game strategy encoding. They examined a
number of different representation schemes, and obtained totally different results from
different settings. In each run in their computational experiments, a single represen-
tation scheme was assigned to all agents to examine the relation between the choice of
a representation scheme and the evolution of cooperative behavior.
Simultaneous use of two representation schemes was examined in Ishibuchi et al.
[4]. One of the two schemes was randomly assigned to each agent. When the IPD
game was not played between agents with different schemes, totally different results
were obtained from each scheme. However, similar results were obtained from each
scheme when the IPD game was played between agents with different schemes. The
simultaneous use of different representation schemes was also examined in other areas
of evolutionary computation (e.g., see [5] for the use in island models).
It has been demonstrated in many studies on the IPD game that spatial structures
of agents have a large effect on the evolution of cooperative behavior (e.g., [6–9]). A
two-dimensional grid-world with a neighborhood structure is often used in such a
spatial IPD game. An agent in each cell plays the IPD game against its neighbors (i.e.,
local interaction through the IPD game). A new strategy for each agent is generated
from current strategies of the agent and its neighbors (i.e., local mating). Local
interaction and local mating usually use the same neighborhood structure. The use of
different neighborhood structures was examined in Ishibuchi and Namikawa [10].
Motivated by the above-mentioned studies on representation schemes [1–5] and
spatial structures [6–10], we examined a spatial IPD game model with a number of
different representation schemes in our former study [11]. One of four representation
schemes was randomly assigned to each cell in a two-dimensional grid-world. An
example of such a random assignment is shown for a two-dimensional 11 9 11 grid-
world in Fig. 1 where the von Neumann neighborhood with four neighbors is also
illustrated. In Fig. 1, the center cell of the illustrated von Neumann neighborhood has
a representation scheme D. However, none of its four neighbors has the representation
scheme D. Thus no strategies of the four neighbors can be used to generate a new
strategy for the center cell. As a result, only the current strategy of the center cell is
used to generate its new strategy (i.e., new strategies are generated by mutation).
In our former study [11], we used a larger neighborhood when a cell had no
neighbors with the same representation scheme as the cell’s scheme. That is, each cell
had a different neighborhood and a different number of neighbors. This makes it very
difficult to discuss the effect of the neighborhood size on the strategy evolution.
In this paper, we examine the following three settings of neighborhood for local
mating in our spatial IPD game model in a two-dimensional grid-world:
(i) Basic setting: A given neighborhood structure for local mating is always used
for all cells. If a cell has no neighbors with the same representation scheme as
the cell’s scheme, a new strategy is always generated from the current one by
mutation. Otherwise, a new strategy is generated by local mating, crossover and
mutation.
Neighborhood Specification for Game Strategy Evolution 217
A B C D
Fig. 1. An example of random assignment of four representation schemes (A, B, C and D) over
a two-dimensional 11 9 11 grid-world.
(ii) Setting in our former study [11]: When a cell has no neighbors with the same
representation scheme as the cell’s scheme, a larger neighborhood structure is
used so that the cell has at least one neighbor with the same scheme. In this
setting, a new strategy at each cell is generated by local mating, crossover and
mutation.
(iii) Setting with the same number of neighbors for local mating: In this setting, a
pre-specified number of cells with the same representation scheme are defined
for each cell as neighbors for local mating. That is, each cell has the same
number of neighbors for local mating. A new strategy at each cell is generated
by local mating, crossover and mutation.
For comparison, we also examine a different spatial IPD game model with a
different spatial structure where each agent is randomly assigned to a point in a two-
dimensional continuous space. We do not use a two-dimensional grid-world in this
spatial IPD game model. For each agent, a pre-specified number of the nearest agents
with the same representation scheme are defined as its neighbors for local mating.
One advantage of this model is that the number of neighbors for local mating can
be arbitrarily specified in each computer simulation. Another advantage is that the
definition of neighbors is clear and easy to implement. One disadvantage is that the
neighborhood relation is not always symmetrical. That is, ‘‘X is in the nearest
neighbors of Y’’ does not always mean ‘‘Y is in the nearest neighbors of X’’. The
above-mentioned settings (ii) and (iii) of neighbors for local mating in a two-
dimensional grid-world also have such an asymmetric property of neighborhood.
This paper is organized as follows. In Sect. 2, we explain our spatial IPD game
model in a two-dimensional grid-world with different representation schemes. We use
two neighborhood structures in our spatial IPD game model: One is for local mating
and the other is for local interaction. Game strategies and representation schemes are
also explained in Sect. 2. In Sect. 3, we explain our cellular genetic algorithm for
218 H. Ishibuchi et al.
strategy evolution. In each generation of the evolutionary algorithm, the fitness of each
agent is evaluated as the average payoff from the IPD game against its neighbors.
A new strategy of each agent is generated from its own and its neighbors’ strategies.
In Sect. 4, we report experimental results of computational experiments where one of
four representation schemes is randomly assigned to each cell in a two-dimensional
11 9 11 grid-world. The above-mentioned three settings of neighborhood for local
mating are compared with each other using various specifications of neighborhood size.
In Sect. 5, we explain a different spatial IPD game model in a two-dimensional con-
tinuous space where each agent is randomly located. Experimental results using this
spatial IPD game model are reported in Sect. 6. In Sect. 7, we conclude this paper.
In this section, we explain our spatial IPD game model in a two-dimensional grid-
world with two neighborhood structures and different representation schemes. As in
many other studies on the spatial IPD game, we assume the torus structure of the two-
dimensional grid-world. Our spatial IPD game model in this section is the same as in
our previous study [11] except for the specification of local mating neighborhood.
Payoff Matrix: The PD (Prisoner’s Dilemma) game is a two-player non-zero sum
game with two actions: cooperation and defection. We use a frequently-used standard
payoff matrix in Table 1. When both the agent and the opponent cooperate in Table 1,
each of them receives the payoff 3. When both of them defect, each of them receives
the payoff 1. The agent receives the maximum payoff 5 by defecting when the
opponent cooperates. In this case, the opponent receives the minimum payoff 0. In
Table 1, the defection is a rational action because the agent always receives the larger
payoff by defecting than cooperating in each of the two cases of the opponent action.
The defection is also a rational action for the opponent. However, the payoff 1 by
mutual defection is smaller than the payoff 3 by mutual cooperation. That is, the
rational actions of the agent and the opponent lead to the smaller payoff 1 than the
payoff 3 by their irrational actions. This is the dilemma in Table 1.
IPD Game Strategies: The IPD game is an iterated version of the PD game. The
agent plays the PD game with the payoff matrix in Table 1 against the same opponent
for a pre-specified number of rounds. In the first round, no information is available
with respect to the previous actions of the opponent. When the agent and the opponent
choose their actions for the second round, they can use the information about
their actions in the first round. In the third round, the information about the actions in
the first two rounds is available. A game strategy for the IPD game determines the
next action based on a finite memory about previous actions. In this paper, we use the
following four representation schemes to encode IPD game strategies:
(1) Binary strings of length 3,
(2) Real number strings of length 3,
(3) Binary strings of length 7,
(4) Real number strings of length 7.
Each value in these strings shows the probability of cooperation in the next action
in a different situation. Strings of length 3 determine the next action based on the
opponent’s previous action whereas strings of length 7 use the opponent’s actions in
the previous two rounds. As an example, we show a binary string strategy ‘‘101’’
called TFT (tit for tat) in Table 2. This string chooses the cooperation at the first
round, which corresponds to the first value ‘‘1’’ of the string ‘‘101’’. When the
opponent cooperated in the previous round, the cooperation is selected using the third
value ‘‘1’’. Only when the opponent defected in the previous round, the defection is
selected using the second value ‘‘0’’. Table 3 shows an example of a binary string
strategy of length 7 (‘‘1110111’’). The first value ‘‘1’’ is used for the first round. The
next two values ‘‘11’’ are used for the second round. The choice of an action in each of
the other rounds is specified by the last four values ‘‘0111’’. This strategy, which is
called TF2T (tit for two tats), defects only when the opponent defected in the previous
two rounds. In the case of real number strings, the cooperation is probabilistically
chosen using the corresponding real number as the cooperation probability.
Local Interaction Neighborhood in the Grid-World: We use the 11 9 11 grid-
world with the torus structure in Fig. 1. Each cell has an agent, a game strategy, and a
representation scheme. Each agent plays the IPD game against its local interaction
neighbors. Let us denote the set of local interaction neighbors of Agent i by NIPD(i). If
NIPD(i) includes five or less neighbors, Player i plays the IPD game against all
neighbors in NIPD(i). If NIPD(i) includes more than five neighbors, five opponents are
randomly selected from NIPD(i). It should be noted that any agent is not allowed to
play the IPD game against itself. It was demonstrated that the IPD game of each agent
against itself had a large positive effect on the evolution of cooperative behavior [12].
We examine six specifications of NIPD(i) in Fig. 2 with 4, 8, 12, 24, 40 and 48
neighbors. We also examine an extreme specification of NIPD(i) for Agent i where all
the other 120 agents are included in NIPD(i).
Local Mating Neighborhood in the Grid-World: A new strategy for each agent is
generated from its own and its neighbors’ strategies. Let NGA(i) be a set of neighbors
of Agent i for local mating. As in the case of local interaction neighborhood NIPD(i),
we examine as local mating neighborhood NGA(i) the six specifications in Fig. 2 and
the extreme specification including all the other agents. Only neighbors with the same
representation scheme as Agent i are qualified as mates of Agent i. If no neighbors are
qualified, the current strategy of Agent i is selected to generate its new strategy. We
consider the following three settings to handle this undesirable case:
(i) Basic Setting: We do not modify local mating neighborhood even in this case.
(ii) Use of a Larger Neighborhood: In our former study [11], we used a larger
neighborhood structure for Agent i if Agent i had no qualified neighbors. More
specifically, a larger neighborhood structure with at least one qualified neighbor
was searched in the order (a) ? (b) ? (c) ? (d) ? (e) ? (f) in Fig. 2. If no
qualified neighbor is included even in Fig. 2(f), all agents were handled as
neighbors.
(iii) The Same Number of Qualified Neighbors: In this new setting, all agents have
the same number of qualified neighbors. First all neighbors are systematically
sorted (e.g., see Fig. 3). Then qualified neighbors are added to NIPD(i) up to the
pre-specified number. We can use any kind of order of neighbors.
Neighborhood Specification for Game Strategy Evolution 221
In this section, we explain our cellular genetic algorithm for game strategy evolution.
In our algorithm, each cell has a representation scheme and an agent with a game
strategy. The fitness value of each strategy is evaluated through the IPD game against
its neighbors. A new strategy is generated from its own and its neighbors’ strategies
with the same representation scheme. The current strategy is always replaced with a
newly generated one. The following are more detailed explanations.
Representation Scheme Assignment: One of the four representation schemes (i.e.,
binary strings of length 3, real number strings of length 3, binary strings of length 7,
and real number strings of length 7) is randomly assigned to each agent. More spe-
cifically, the 121 agents in the 11 9 11 grid-world are randomly divided into four
subsets with 30, 30, 30 and 31 agents. Each representation scheme is randomly
assigned to a different subset (i.e., to all agents in that subset). Each agent uses the
assigned representation scheme throughout the current execution of our cellular
genetic algorithm. The random assignment of a representation scheme to each agent is
updated in each run of our cellular genetic algorithm.
Initial Strategies: Each agent randomly generates an initial strategy using the
assigned representation scheme. For binary strings, each bit is randomly specified as 0
or 1 with the same probability. Real numbers in real number strings are randomly
specified using the uniform distribution over the unit interval [0, 1].
Fitness Evaluation: The fitness value of each agent is calculated as the average
payoff obtained from the IPD game with 100 rounds against up to five opponents in its
local interaction neighborhood NIPD(i). If an agent has five or less neighbors, the
average payoff is calculated over the IPD game against all neighbors. Otherwise, five
neighbors are randomly chosen as opponents to calculate the average payoff.
Parent Selection, Crossover and Mutation: Two parent strategies are selected for
each agent from its own and its qualified neighbors’ strategies in NGA(i) using binary
tournament selection with replacement. A new strategy is generated from the selected
pair by crossover and mutation. For binary strings, we use one-point crossover and
222 H. Ishibuchi et al.
bit-flip mutation. For real number strings, we use blend crossover (BLX-a [13]) with
a = 0.25 and uniform mutation. If a real number becomes more than 1 (or less than 0)
by the crossover operator, it is repaired to be 1 (or 0) before the mutation. The same
crossover probability 1.0 and the same mutation probability 1/(5 9 121) are used for
binary and real number strategies.
Generation Update and Termination: The current strategy of each agent is always
replaced with a newly generated one. The execution of our cellular genetic algorithm
is terminated after 1000 generation updates.
In this section, we report experimental results with four representation schemes. The
reported results are average results over 500 runs of our cellular genetic algorithm for
each setting of local interaction and mating neighborhood structures. For comparison,
we also report experimental results using a single representation scheme.
Experimental Results using a Single Representation Scheme: For comparison, we
first report experimental results with a single representation scheme. One of the four
representation schemes was assigned to all the 121 agents. We examined 7 9 7
combinations of the seven neighborhood specifications for local interaction
NIPD(i) and local mating NGA(i). Experimental results are summarized in Fig. 4 where
each bar shows the average payoff over 1000 generations in each of 500 runs.
Experimental Results with the Basic Setting: As explained in Sect. 2, some agents
have no qualified neighbors as mates when the four representation schemes are ran-
domly assigned over the 121 agents. Table 4 summarizes the average percentage of
those agents with no qualified neighbors as mates. In the case of the smallest
neighborhood with four neighbors, many agents (i.e., 32.3 % of the 121 agents) have
no qualified neighbors as mates. A new strategy for each of those agents is generated
from its current strategy by mutation. No selection pressure towards good strategies
with higher average payoff is applied to those agents.
In Fig. 5, we show experimental results with the basic setting where the specified
local mating neighborhood structure is never modified. As shown in Table 4, many
agents have no qualified neighbors as mates in this setting. In Fig. 5, lower average
payoff was obtained from the smallest two neighborhood structures for local mating
(i.e., NGA(i) with four and eight neighbors) than the other larger structures. We can
also observe in Fig. 5 that similar results were obtained from the four representation
schemes when they were randomly assigned over the 121 agents (whereas totally
different results were obtained from each representation scheme in Fig. 4).
Experimental Results with the Use of a Larger Neighborhood: As in our former
study, we used a larger neighborhood structure for each agent when the agent had no
qualified neighbors as its mate. In this setting, each agent has at least one qualified
neighbor as its mate. This means that each agent has at least two candidates for its
parent strategies. As a result, some selection pressure was applied to strategies of all
Neighborhood Specification for Game Strategy Evolution 223
Average Average
Payoff Payoff
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 120 1.0 120
48 48 48 48
40 40
12 24 12 24
24 12 24 12
40 8 40 8
48 4 48 4
120 120
(a) Binary strings of length 3. (b) Real number strings of length 3.
Average Average
Payoff Payoff
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 120 1.0 120
48 48 48 48
40 40
12 24 12 24
24 12 24 12
40 8 40 8
48 4 48 4
120 120
(c) Binary strings of length 7. (d) Real number strings of length 7.
Fig. 4. Experimental results with a single representation scheme for all the 121 agents.
Average Average
Payoff Payoff
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 120 1.0 120
48 48 48 48
40 40
12 24 12 24
24 12 24 12
40 8 40 8
48 4 48 4
120 120
(a) Binary strings of length 3. (b) Real number strings of length 3.
Average Average
Payoff Payoff
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 120 1.0 120
48 48 48 48
40 40
12 24 12 24
24 12 24 12
40 8 40 8
48 4 48 4
120 120
(c) Binary strings of length 7. (d) Real number strings of length 7.
Fig. 5. Experimental results with the randomly assigned four representation schemes. The
specified neighborhood structure for local mating was never modified (i.e., the basic setting).
Simulation Results with the Same Number of Qualified Neighbors: In the third
setting of local mating neighborhood NGA(i), all neighbors of each agent are sorted in
a systematic manner. Then qualified neighbors are added to NGA(i) up to a pre-
specified number. In our computational experiments, we used the order of neighbors
in Fig. 3. Since our two-dimensional 11 9 11 grid-world has the torus structure, we
can use Fig. 3 for all agents (not only for the agent at the center of the grid-world). As
the number of qualified neighbors, we examined seven specifications: 1, 2, 3, 6, 10, 12
and 30. These values are 1/4 of the size of the seven neighborhood structures
examined in the previous computational experiments. When 30 was used as the
number of qualified neighbors in NGA(i), the actual number of qualified neighbor was
29 or 30 since 121 agents were divided into four subsets with 30, 30, 30 and 31 agents.
Experimental results are summarized in Fig. 7. The axis with NGA(i) in Fig. 7 is
the number of qualified neighbors whereas it is the size of NGA(i) including unqual-
ified neighbors in Figs. 5 and 6. Experimental results in Figs. 5–7 are similar to one
another. Closer examination on experimental results for |NGA(i)| = 4 and |NGA(i)| = 8
in Figs. 5–7 may suggest that Fig. 7 looks like something between Figs. 5 and 6 (see
Table 5 for the average number of qualified neighbors in Figs. 5 and 6).
Neighborhood Specification for Game Strategy Evolution 225
Average Average
Payoff Payoff
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 120 1.0 120
48 48 48 48
40 40
12 24 12 24
24 12 24 12
40 8 40 8
48 4 48 4
120 120
(a) Binary strings of length 3. (b) Real number strings of length 3.
Average Average
Payoff Payoff
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 120 1.0 120
48 48 48 48
40 40
12 24 12 24
24 12 24 12
40 8 40 8
48 4 48 4
120 120
(c) Binary strings of length 7. (d) Real number strings of length 7.
Fig. 6. Experimental results with the randomly assigned four representation schemes. A larger
neighborhood structure was used as the local mating neighborhood for each agent when the
agent has no qualified neighbor as its mate.
Average Average
Payoff Payoff
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 30 1.0 30
48 12 48 12
12 10 12 10
24 6 24 6
40 3 40 3
48 2 48 2
120 1 120 1
(a) Binary strings of length 3. (b) Real number strings of length 3.
Average Average
Payoff Payoff
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 30 1.0 30
48 12 48 12
12 10 12 10
24 6 24 6
40 3 40 3
48 2 48 2
120 1 120 1
(c) Binary strings of length 7. (d) Real number strings of length 7.
Fig. 7. Experimental results with the randomly assigned four representation schemes. Each
agent had the pre-specified number of qualified neighbors as its mate.
A B C D
1.0
0.0
0.0 1.0
Fig. 8. An example of 121 agents with a different representation scheme in the unit square. The
four representation schemes are shown by A, B, C and D (i.e., A: binary strings of length 3,
B: real number strings of length 3, C: binary strings of length 7, and D: real number strings of
length 7). Those representation schemes are randomly assigned to 30, 30, 30 and 31 agents.
Neighborhood Specification for Game Strategy Evolution 227
nearest neighbors for each agent. We use the Euclidean distance between agents when
we choose the nearest neighbors. The neighborhood structure NIPD(i) for local inter-
action includes the nearest agents independent of the representation scheme of each
agent. The neighborhood structure NGA(i) for local mating includes the nearest agents
with the same representation scheme as Agent i. Except for the specifications of these
two neighborhood structures, we can use the same specifications for the IPD game and
the cellular genetic algorithm for our spatial IPD game in the continuous space
[0, 1] 9 [0, 1] in this section as those in the 11 9 11 grid-world in Sect. 4.
In each run of our cellular genetic algorithm, the locations of the 121 agents were
randomly specified in the unit square as in Fig. 8. One of the four representation
schemes was randomly specified to each agent in the same manner as in Sect. 4 (i.e.,
the 121 agents were divided into four subsets with 30, 30, 30 and 31 agents in each
run). As the number of neighbors in NIPD(i) for local interaction and NGA(i) for local
Average Average
Payoff Payoff
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 30 1.0 30
4 12 4 12
8 10 8 10
12 6 12 6
24 3 24 3
40 2 40 2
48 1 48 1
120 120
(a) Binary strings of length 3. (b) Real number strings of length 3.
Average Average
Payoff Payoff
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 30 1.0 30
4 12 4 12
8 10 8 10
12 6 12 6
24 3 24 3
40 2 40 2
48 1 48 1
120 120
(c) Binary strings of length 7. (d) Real number strings of length 7.
Fig. 9. Experimental results of our different spatial IPD game in the two-dimensional unit
square with the randomly assigned four representation schemes over 121 agents. Each agent had
the same number of qualified neighbors as its mates.
228 H. Ishibuchi et al.
Average Average
Payoff Payoff
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 30 1.0 30
1 12 1 12
2 10 2 10
3 6 3 6
6 3 6 3
10 2 10 2
12 30 1 12 30 1
(a) Binary strings of length 3. (b) Real number strings of length 3.
Average Average
Payoff Payoff
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 30 1.0 30
1 12 1 12
2 10 2 10
3 6 3 6
6 3 6 3
10 2 10 2
12 30 1 12 30 1
(c) Binary strings of length 7. (d) Real number strings of length 7.
Fig. 10. Experimental results with no local interaction through the IPD game between agents
with different representation schemes. All the other settings of computational experiments are
the same as Fig. 9.
Neighborhood Specification for Game Strategy Evolution 229
Experimental results over all of the 49 combinations of those seven specifications are
summarized in Fig. 10. Experimental results in Fig. 10 were totally different from
those in Fig. 9. In Fig. 9, similar results were obtained from each of the four repre-
sentation schemes. That is, the four plots in Fig. 9 are similar to one another. How-
ever, experimental results from each representation scheme were totally different in
Fig. 10. This is because there was no interaction through the IPD game between
agents with different representation schemes in Fig. 10.
7 Conclusions
In this paper, we discussed the handling of agents with no qualified neighbors for local
mating. Those agents generated new strategies from their current strategies by
mutation. No selection pressure towards better strategies with higher average payoff
was applied to those agents. Thus those agents might have negative effects on the
evolution of cooperative behavior. We examined the three settings with respect to the
handling of those agents with no qualified neighbors through computational experi-
ments. However, the effect of those agents was not so clear. This is because the
existence of those agents prevents the population from converging not only to
cooperative strategies but also to uncooperative strategies. We also proposed the use
of a different spatial IPD game in a two-dimensional continuous space. Since we do
not use any discrete structure in the continuous space, we can arbitrarily specify the
number of agents and the size of neighborhood structures. The location of each agent
can also be arbitrarily specified in the continuous space. That is, the use of the spatial
IPD game in the continuous space makes it possible to perform various computational
experiments in a more flexible manner than the case of the grid-world. Those com-
putational experiments are left for future research.
References
1. Ashlock, D., Kim, E.Y., Leahy, N.: Understanding representational sensitivity in the
iterated prisoner’s dilemma with fingerprints. IEEE Trans. Syst. Man Cybern.: Part C 36,
464–475 (2006)
2. Ashlock, D., Kim, E.Y.: Fingerprinting: visualization and automatic analysis of prisoner’s
dilemma strategies. IEEE Trans. Evol. Comput. 12, 647–659 (2008)
3. Ashlock, D., Kim, E.Y., Ashlock, W.: Fingerprint analysis of the noisy prisoner’s dilemma
using a finite-state representation. IEEE Trans. Comput. Intell. AI Games 1, 154–167
(2009)
4. Ishibuchi, H., Ohyanagi, H., Nojima, Y.: Evolution of strategies with different
representation schemes in a spatial iterated prisoner’s dilemma game. IEEE Trans.
Comput. Intell. AI Games 3, 67–82 (2011)
5. Skolicki, Z., De Jong, K.A.: Improving evolutionary algorithms with multi-representation
island models. In: Yao, X., Burke, E.K., Lozano, J., Smith, J., Merelo-Guervós, J.,
Bullinaria, J.A., Rowe, J.E., Tiňo, P., Kabán, A., Schwefel, H.-P. (eds.) PPSN 2004. LNCS,
vol. 3242, pp. 420–429. Springer, Heidelberg (2004)
230 H. Ishibuchi et al.
6. Oliphant, M.: Evolving cooperation in the non-iterated prisoner’s dilemma: the importance
of spatial organization. In: Brooks, R.A., Maes. P. (eds.) Artificial Life IV, pp. 349–352
(1994)
7. Grim, P.: Spatialization and greater generosity in the stochastic prisoner’s dilemma.
BioSystems 37, 3–17 (1996)
8. Brauchli, K., Killingback, T., Doebeli, M.: Evolution of cooperation in spatially structured
populations. J. Theor. Biol. 200, 405–417 (1999)
9. Seo, Y.G., Cho, S.B., Yao, X.: The impact of payoff function and local interaction on the N-
player iterated prisoner’s dilemma. Knowl. Inf. Syst. 2, 461–478 (2000)
10. Ishibuchi, H., Namikawa, N.: Evolution of iterated prisoner’s dilemma game strategies in
structured demes under random pairing in game playing. IEEE Trans. Evol. Comput. 9,
552–561 (2005)
11. Ishibuchi, H., Hoshino, K., Nojima, Y.: Evolution of strategies in a spatial IPD game with a
number of different representation schemes. In: Proceedings of 2012 IEEE Congress on
Evolutionary Computation, pp. 808–815 (2012)
12. Ishibuchi, H., Hoshino, K., Nojima, Y.: Strategy evolution in a spatial IPD game where
each agent is not allowed to play against itself. In: Proceedings of 2012 IEEE Congress on
Evolutionary Computation, pp. 688–695 (2012)
13. Eshelman, L.J., Schaffer, J.D.: Real-coded genetic algorithms and interval-schemata.
Foundations of Genetic Algorithms 2, pp. 187–202. Morgan Kaufman, San Mateo (1993)
A Study on the Specification of a Scalarizing
Function in MOEA/D for Many-Objective
Knapsack Problems
1 Introduction
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 231–246, 2013.
DOI: 10.1007/978-3-642-44973-4_24, Springer-Verlag Berlin Heidelberg 2013
232 H. Ishibuchi et al.
EMO algorithms such as NSGA-II [9] and SPEA2 [10]. For example, its scalarizing
function-based fitness evaluation is very fast even for many-objective problems. Its
hybridization with local search and other heuristics is often very easy.
One important issue in the implementation of MOEA/D is the choice of an
appropriate scalarizing function. In our former study [11], we proposed an idea of
automatically switching between the weighted sum and the weighted Tchebycheff
during the execution of MOEA/D. We also proposed an idea of simultaneously using
multiple scalarizing functions [12]. In the original study on MOEA/D [1], the
weighted sum and the weighted Tchebycheff were used for multiobjective knapsack
problems while the weighted Tchebycheff and the PBI (penalty-based boundary
intersection) function were used for multiobjective continuous optimization. Good
results were obtained from a different scalarizing function for a different test problem.
In this paper, we compare the weighted sum, the weighted Tchebycheff and the
PBI function with each other, which were used in the original study on MOEA/D [1].
We examine a wide range of penalty parameter values in the PBI function. As test
problems, we use multiobjective knapsack problems with 2-10 objectives. One of our
test problems is the 2-500 (two-objective 500-item) knapsack problem in Zitzler and
Thiele [13]. Many-objective test problems with 4-10 objectives are generated by
adding new objectives to the 2-500 problem. Objectives in some test problems are
strongly correlated [14] while objectives in other problems are not. MOEA/D with a
different scalarizing function shows a different search behavior on each test problem.
This paper is organized as follows. In Sect. 2, we explain our implementation of
MOEA/D, which is its basic version with no archive population [1]. In Sect. 2, we also
explain the three scalarizing functions used in [1]: the weighted sum, the weighted
Tchebycheff, and the PBI function. In Sect. 3, we compare the three scalarizing
functions through computational experiments on the 2-500 problem. Experimental
results by each scalarizing function are explained using its contour lines. It is also
shown that the PBI function with different penalty parameter values has totally dif-
ferent contour lines. In Sect. 4, we examine the performance of each scalarizing
function through computational experiments on many-objective problems. In Sect. 5,
we suggest a potential usefulness of the PBI function with a negative penalty
parameter value. Finally we conclude this paper in Sect. 6.
2 MOEA/D Algorithm
We explain the basic version of MOEA/D [1] with no archive population for the
following m-objective maximization problem:
Maximize f ðxÞ ¼ ðf1 ð xÞ; f2 ð xÞ; . . .; fm ð xÞÞ; ð1Þ
where f(x) is an m-dimensional objective vector, fi(x) is the ith objective to be
maximized, and x is a decision vector. The use of the basic version is to concentrate
on a choice of a scalarizing function. The maintenance of an archive population needs
additional special care and large computation load in many-objective optimization.
A Study on the Specification of a Scalarizing Function in MOEA/D 233
X
n
subject to wij xj ci ; i ¼ 1; 2 ; ð11Þ
j¼1
xj ¼ 0 or 1; j ¼ 1; 2; . . .; n ; ð12Þ
where n is the number of items (i.e., n = 500), x is a 500-bit binary string, pij is the
profit of item j according to knapsack i, wij is the weight of item j according to
knapsack i, and ci is the capacity of knapsack i [13]. The values of each profit pij and
each weight wij were randomly specified integers in [10, 100], and the constant value
ci was set as a half of the total weight value (i.e., ci = (wi1 + wi2 + + win)/2) in [13].
In the next section, we generate many-objective knapsack problems by adding new
objectives to the 2-500 problem.
A Study on the Specification of a Scalarizing Function in MOEA/D 235
We applied MOEA/D with the weighted sum, the weighted Tchebycheff, and the
PBI function (h = 5) to the 2-500 problem using the following specifications:
Coding: 500-bit binary string, Population size: 200,
Termination condition: 20092000 solution evaluations,
Parent selection: Random selection from the neighborhood,
Crossover: Uniform crossover (Probability: 0.8),
Mutation: Bit-flip mutation (Probability: 1/500 for each bit),
Constraint Handling: Maximum profit/weight ratio-based repair in [13],
Number of runs of MOEA/D with each scalarizing function: 100 runs.
Solutions in the final generation of a single run of MOEA/D with each scalarizing
function are shown in Fig. 1a–c. Figure 1d shows the 50 % attainment surface over
100 runs of MOEA/D with each scalarizing function. In Fig. 1, better distributions of
solutions were obtained by the weighted Tchebycheff and the PBI function (h = 5).
In Fig. 2, we show the contour lines of each scalarizing function for the weighted
vector k = (0.5, 0.5). Contour lines have been used to explain scalarizing functions in
the literature [15]. When we used the weighted sum in Fig. 2a, the same solution was
often obtained from different weight vectors. As a result, many solutions were not
obtained in Fig. 1a. In the case of the weighted Tchebycheff in Fig. 2b, the same
solution was not often obtained from different weight vectors. Thus more solutions
were obtained in Fig. 1b than Fig. 1a. The best results with respect to the number of
solutions were obtained in Fig. 1 from the PBI function with h = 5 in Fig. 1c due to
the sharp edge of the contour lines in Fig. 2c. In Fig. 1d, the PBI function slightly
outperformed the weighted Tchebycheff with respect to the width of the obtained
solution sets along the Pareto front.
The shape of the contour lines of the PBI function totally depends on the value of
h. Figure 3 shows the contour lines of the PBI function with some other values of h.
When h = 0, the contour lines in Fig. 3a are similar to those of the weighted sum in
Fig. 2a. The contour lines of the PBI function with h = 1 in Fig. 3b look somewhat
similar to those of the weighted Tchebycheff in Fig. 2b. The valley of the contour
lines in Fig. 3c with h = 50 is much longer than that in Fig. 2c with h = 5. This means
that the contour lines with h = 50 have a sharper edge, which leads to slower con-
vergence speed towards the reference point (i.e., towards the Pareto front).
Fig. 3. Contour lines of the PBI function with different values of the penalty parameter h.
A Study on the Specification of a Scalarizing Function in MOEA/D 237
Figure 4 shows experimental results using each value of the penalty parameter h in
Fig. 3. As expected, the use of h = 50 in Fig. 4c deteriorated the convergence ability
of MOEA/D from Fig. 1c with h = 5. Figure 4a with h = 0 is similar to Fig. 1a with
the weighted sum. However, Fig. 4b with h = 1 is not similar to Fig. 1b with the
weighted Tchebycheff. This is because the contour lines of the two scalarizing
functions are not similar to each other for other weight vectors (see Fig. 5).
Figure 5 shows the contour lines of each scalarizing function for the five weight
vectors: k = (1, 0), (0.75, 0.25), (0.5, 0.5), (0.25, 0.75), (0, 1). The contour lines of the
weighted sum are similar to those of the PBI function with h = 0 in Fig. 5. Thus
similar results were obtained from these two scalarizing functions in Figs. 1 and 4.
Except for the PBI function with h = 1, the contour lines are (almost) vertical or
(almost) horizontal when k = (1, 0) and k = (0, 1). Thus a wide variety of solutions
were obtained in Figs. 1 and 4 except for the case of the PBI function with h = 1.
Fig. 5. Contour lines of the scalarizing functions for various weight vectors.
problems). It should be noted that all of those test problems have the same constraint
condition as in the 2-500 problem. This means that all of our test problems have the
same feasible solution set as in the 2-500 problem.
In one extreme case with a = 1 in (13)–(15), our many-objective test problems
have only two different objectives g1(x) and g2(x) (i.e., f1(x) and f2(x) of the 2-500
problem). In the other extreme case with a = 0, all objectives gi(x) are totally different
from each other and have no strong correlation since profit values were randomly
specified in each objective. We examined six values of a: a = 0, 0.2, 0.4, 0.6, 0.8, 1.
We applied MOEA/D with the weighted sum, the weighed Tchebycheff and the
PBI function (h = 0, 0.1, 0.5, 1, 5, 10, 50) to our many-objective 4-500, 6-500, 8-500
and 10-500 problems with a = 0, 0.2, 0.4, 0.6, 0.8, 1. For comparison, we also applied
NSGA-II and SPEA2 to the same test problems. Computational experiments were
performed using the same parameter specifications as in Sect. 3 except for the pop-
ulation size in MOEA/D due to the combinatorial nature of the number of weight
vectors. The population size in MOEA/D was specified as 220 (4-500), 252 (6-500),
120 (8-500), and 220 (10-500). In NSGA-II and SPEA2, the population size was
always specified as 200. We evaluated the performance of each algorithm on each test
problem using the average hypervolume over 100 runs. The origin of the objective
space of each test problem (i.e., (0, 0, …, 0)) was used as a reference point for
hypervolume calculation.
Experimental results are summarized in Tables 1, 2, 3 and 4 where all experi-
mental results are normalized using the results of MOEA/D with the weighted sum for
each test problem. The average hypervolume value by MOEA/D with the weighted
sum is normalized as 1.00. In Table 1, experimental results on the 2-500 problem are
also included for comparison. The largest normalized average hypervolume value for
each test problem is highlighted by bold face in Tables 1, 2, 3 and 4. Good results
were obtained for all test problems by MOEA/D with the weighted sum and the PBI
function with h = 0. The weighted Tchebycheff did not work well on many-objective
problems with no or small correlation among objectives (e.g., 10-500 with a = 0.2)
while it worked well on many-objective problems with highly correlated objectives
(e.g., 10-500 with a = 0.8). It also worked well on the 2-500 and 4-500 problems.
For the 2-500 problem, the best results were obtained by the PBI function with h = 5
and h = 10.
In Tables 1, 2, 3 and 4, the performance of MOEA/D with the PBI function on many-
objective problems was improved by using a smaller value for the penalty parameter h
(e.g., see the results on the 10-500 problem with a = 0.0). From this observation, one
may think that better results would be obtained from a negative value of h.
To examine this issue, we further performed computational experiments using
three negative values of h: h = -1, -0.5, -0.1. Experimental results are summarized
in Tables 5, 6, 7 and 8 where the normalized average hypervolume is calculated in the
same manner as in Tables 1, 2, 3 and 4. The origin of the objective space of each test
problem was used in Tables 5, 6, 7 and 8 for hypervolume calculation as in Tables 1,
2, 3 and 4. We can see from Tables 5, 6, 7 and 8 that the best results were obtained
from the PBI function with h = -0.1 for almost all test problems except for those with
highly correlated objectives.
These experimental results in Tables 5, 6, 7 and 8, however, are somewhat
misleading. In Fig. 6a, we show all solutions in the final generation of a single run of
MOEA/D on the 2-500 problem for the case of the PBI function with h = -0.1. We
can see that no solutions were obtained around the center of the Pareto front. Such a
distribution of the obtained solutions can be explained by the concave shape of the
contour lines of the PBI function with h = -0.1, which are shown in Fig. 6b. Due to
the concave contour lines, MOEA/D with the PBI function tends to search for solu-
tions around the edges of the Pareto front even when the weight vector k is specified as
(0.5, 0.5). As a result, no solutions were obtained around the center of the Pareto front
in Fig. 6a.
In Tables 1, 2, 3, 4, 5, 6, 7 and 8 we used the origin of the objective space as the
reference point for hypervolume calculation. Since the origin is far from the Pareto
front of each test problem, extreme solutions around the edge of the Pareto front have
A Study on the Specification of a Scalarizing Function in MOEA/D 243
Fig. 6. Experimental results by the PBI function with h = -0.1 on the 2-500 problem.
large effects on hypervolume calculation [22]. This is why the best results were
obtained from the PBI function with h = -0.1 in Tables 5, 6, 7 and 8. Good results
from h = -0.1 also suggest the importance of diversity improvement even for many-
objective problems [23].
In Tables 9, 10, 11 and 12, we recalculated the normalized average hypervolume
using another reference point closer to the Pareto front for each test problem. For each
test problem, we used (15000, 15000, …, 15000) as the reference point for hyper-
volume calculation. This setting of hypervolume calculation increases the importance
of the convergence of solutions around the center of the Pareto front. Thus the nor-
malized average hypervolume value by the PBI function with h = -0.1 was deteri-
orated for almost all test problems in Tables 9, 10, 11 and 12 (e.g., see the last row in
Table 9 on the 2-500 problem).
244 H. Ishibuchi et al.
6 Conclusions
In this paper, we examined the choice of a scalarizing function in MOEA/D for many-
objective knapsack problems. Good results were obtained from the weighted sum over
various settings of test problems. With respect to the specification of the penalty
parameter in the PBI function, we obtained the following interesting observations:
(i) The best hypervolume values were obtained from a small negative parameter
value (i.e., h = -0.1) when a reference point was far from the Pareto front. In
this case, the search of MOEA/D was biased towards the edges of the Pareto
front.
(ii) When a reference point was close to the Pareto front, the best hypervolume
values were obtained from a small positive parameter value (i.e., h = 0.1).
(iii) Almost the same results were obtained from h = 0 and the weighed sum.
A Study on the Specification of a Scalarizing Function in MOEA/D 245
References
1. Zhang, Q., Li, H.: MOEA/D: a multiobjective evolutionary algorithm based on
decomposition. IEEE Trans. Evol. Comput. 11, 712–731 (2007)
2. Chang, P.C., Chen, S.H., Zhang, Q., Lin, J.L.: MOEA/D for flowshop scheduling problems.
In: Proceedings of IEEE Congress on Evolutionary Computation, pp. 1433–1438 (2008)
3. Zhang, Q., Liu, W., Li, H.: The performance of a new version of MOEA/D on CEC09
unconstrained MOP test instances. In: Proceedings of IEEE Congress on Evolutionary
Computation, pp. 203–208 (2009)
4. Li, H., Zhang, Q.: Multiobjective optimization problems with complicated pareto sets,
MOEA/D and NSGA-II. IEEE Trans. Evol. Comput. 13, 284–302 (2009)
5. Ishibuchi, H., Sakane, Y., Tsukamoto, N., Nojima, Y.: Evolutionary many-objective
optimization by NSGA-II and MOEA/D with large populations. In: Proceedings of IEEE
International Conference on Systems, Man, and Cybernetics, pp. 1820–1825 (2009)
6. Zhang, Q., Liu, W., Tsang, E., Virginas, B.: Expensive multiobjective optimization by
MOEA/D with Gaussian process model. IEEE Trans. Evol. Comput. 14, 456–474 (2010)
7. Tan, Y.Y., Jiao, Y.C., Li, H., Wang, X.K.: A modification to MOEA/D-DE for
multiobjective optimization problems with complicated pareto sets. Inf. Sci. 213, 14–38
(2012)
8. Zhao, S.Z., Suganthan, P.N., Zhang, Q.: Decomposition-based multiobjective evolutionary
algorithm with an ensemble of neighborhood sizes. IEEE Trans. Evol. Comput. 16,
442–446 (2012)
9. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic
algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6, 182–197 (2002)
10. Zitzler, E., Laumanns, M., Thiele, L.: SPEA2: Improving the Strength Pareto Evolutionary
Algorithm. TIK-Report 103, Department of Electrical Engineering, ETH, Zurich (2001)
11. Ishibuchi, H., Hitotsuyanagi, Y., Ohyanagi, H., Nojima, Y.: Effects of the existence of
highly correlated objectives on the behavior of MOEA/D. In: Takahashi, R.H.C., Deb, K.,
Wanner, E.F., Greco, S. (eds.) EMO 2011. LNCS, vol. 6576, pp. 166–181. Springer,
Heidelberg (2011)
12. Ishibuchi, H., Sakane, Y., Tsukamoto, N., Nojima, Y.: Simultaneous use of different
scalarizing functions in MOEA/D. In: Proceedings of Genetic and Evolutionary
Computation Conference, pp. 519–526 (2010)
13. Zitzler, E., Thiele, L.: Multiobjective evolutionary algorithms: a comparative case study
and the strength pareto approach. IEEE Trans. Evol. Comput. 3, 257–271 (1999)
14. Ishibuchi, H., Hitotsuyanagi, Y., Ohyanagi, H., Nojima, Y.: Effects of the existence of
highly correlated objectives on the behavior of MOEA/D. In: Takahashi, R.H.C., Deb, K.,
Wanner, E.F., Greco, S. (eds.) EMO 2011. LNCS, vol. 6576, pp. 166–181. Springer,
Heidelberg (2011)
15. Miettinen, K.: Nonlinear Multiobjective Optimization. Kluwer, Dordrecht (1999)
16. Hughes, E. J.: Evolutionary many-objective optimisation: many once or one many? In:
Procedings of IEEE Congress on Evolutionary Computation, pp. 222–227 (2005)
17. Purshouse, R.C., Fleming, P.J.: On the evolutionary optimization of many conflicting
objectives. IEEE Trans. Evol. Comput. 11, 770–784 (2007)
18. Sato, H., Aguirre, H.E., Tanaka, K.: Local dominance and local recombination in MOEAs
on 0/1 multiobjective knapsack problems. Eur. J. Oper. Res. 181, 1708–1723 (2007)
19. Ishibuchi, H., Tsukamoto, N., Hitotsuyanagi, Y., Nojima, Y.: Effectiveness of scalability
improvement attempts on the performance of NSGA-II for many-objective problems. In:
Proceedings of Genetic and Evolutionary Computation Conference, pp. 649–656 (2008)
246 H. Ishibuchi et al.
20. Ishibuchi, H., Tsukamoto, N., Nojima, Y.: Evolutionary many-objective optimization: a
short review. In: Proceedings of IEEE Congress on Evolutionary Computation,
pp. 2424–2431 (2008)
21. Kowatari, N., Oyama, A., Aguirre, H., Tanaka, K.: Analysis on population size and
neighborhood recombination on many-objective optimization. In: Coello, C.A.C., Cutello,
V., Deb, K., Forrest, S., Nicosia, G., Pavone, M. (eds.) PPSN 2012, Part II.LNCS, vol.
7492, pp. 22–31. Springer, Heidelberg (2012)
22. Ishibuchi, H., Nojima, Y., Doi, T.: Comparison between single-objective and multi-
objective genetic algorithms: performance comparison and performance measures. In:
Proceedings of IEEE Congress on Evolutionary Computation, pp. 3959–3966 (2006)
23. Ishibuchi, H., Tsukamoto, N., Nojima, Y.: Diversity improvement by non-geometric binary
crossover in evolutionary multiobjective optimization. IEEE Trans. Evol. Comput. 14,
985–998 (2010)
Portfolio with Block Branching
for Parallel SAT Solvers
1 Introduction
The Boolean satisfiability (SAT) problem asks whether an assignment of vari-
ables exists that can evaluate a given formula as true. A formula is given in
Conjunctive Normal Form (CNF), which is a conjunction of clauses. A clause
is a disjunction of literals, where a literal is a positive or a negative form of a
Boolean variable. The solvers for this problem are called SAT solvers. Today,
there are many real applications [6] of SAT solvers, such as AI planning, circuit
design, and software verification. Many state-of-the-art SAT solvers are based
on the Davis-Putnam-Logeman-Loveland (DPLL) algorithm. In recent decades,
conflict-driven clause learning and non-chronological backtracking, Variable
State Independent Decaying Sum (VSIDS) decision heuristic [7], and restart
[3] were added to DPLL, which improved the performance of DPLL SAT solvers
significantly. Today, these solvers are called Conflict Driven Clause Learning
(CDCL) solvers [1].
State-of-the-art parallel SAT solvers are also built upon CDCL solvers. The
mainstream approach to parallelizing SAT solvers is portfolio [5]. In this app-
roach, all workers conduct the search competitively and cooperatively without
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 247–252, 2013.
DOI: 10.1007/978-3-642-44973-4 25, c Springer-Verlag Berlin Heidelberg 2013
248 T. Sonobe and M. Inaba
2 Block Branching
Our method is quite simple. Firstly, we divide the variables in a given CNF into
blocks, in such a way that the variables in each block have close relationships.
Then, we assign them to each worker. Each worker conducts a search, focusing
on the variables in the assigned block. In this manner, we can easily reduce
the overlaps of the search spaces between the workers. In order to focus on
the specific variables in the search, the workers must periodically increase their
VSIDS scores, in order to change the decision order. The method used to increase
the VSIDS scores is based on Counter Implication Restart (CIR) [10]. For every
several restarts, we increase the scores of the variables vigorously, immediately
after the restart, in order to force the solver to select them as decision variables.
In [2], each worker fixes three variables that are randomly chosen as the root of
the search tree. Our method does not always assign values to the target variables
at the top of the search tree.
Portfolio with Block Branching for Parallel SAT Solvers 249
We utilize binary clauses in a given CNF for dividing the variables (literals)
into blocks. The basic concept comes from our previous work [9] and we explain
the procedure in Fig. 1. Assuming that there are four variables (eight literals),
State #1 indicates that each literal belongs to its own block. Next, the first
binary clause, (a ∨ b), is evaluated in State #2. This clause logically stands
for (¬a ⇒ b) and (¬b ⇒ a). In the case of the former clause, the literal b
is assigned to True, immediately after the assignment of a = F alse, and the
same assignment is made for the latter clause. In other words, the literal b is
dominated by the literal ¬a. Thus, we can make a block represented by the
literal ¬a, which consists of the literal ¬a and the literal b. In this manner, we
can identify the two blocks in State #4 by analyzing three binary clauses. We
believe that variables in a certain block have close relationships, and an intensive
search for these variables can achieve effective diversification for the whole search.
Note that although there can be illogical relationships between some literals in
a block, we permit such irrelevancies since we want to divide the variables into
blocks in the simplest possible way. We use the Union-Find algorithm to detect
the representative literals and merge the two blocks effectively. For the merging
process, we set a threshold in order to change the number of blocks because we
have to adjust it to the number of working processes. When the new block-size
after the merging exceeds the threshold, the process is cancelled. We use a binary
search algorithm for identifying the suitable threshold.
3 Experimental Results
a shared-memory machine. The cactus plot of the results of the four solvers
is shown in Fig. 2 and the details are shown in Table 1. Our proposed solver
achieved better performance than ParaCIRMiniSAT within the time limit. In
total, it could solve eight more instances than ParaCIRMiniSAT and three more
instances than the solver with no block branching (no bb p16). Even in the
small parallel settings (sixteen processes), it was proven that block branching
can improve the performance of the base solver. We are sure that block branching
can achieve stronger diversification in massively parallel environments.
Table 1. The details of the results: 252 (SAT: 111, UNSAT: 141) instances could be
solved at least one solver. The instances that could not be solved within 5000 s are
calculated as 5000.
SAT (111) UNSAT (141) Total Total time for 252 instances
bb INT5 p16 107 132 239 190040
bb INT10 var30 p16 108 131 239 176250
no bb p16 106 130 236 188790
ParaCIRMiniSAT p8 105 126 231 200070
Portfolio with Block Branching for Parallel SAT Solvers 251
Fig. 2. The experimental results of four solvers using 300 instances from the SAT
Competition 2011
4 Conclusion
References
1. Biere, A., Heule, M., van Maaren, H., Walsh, T. (eds.): Handbook of Satisfiability.
Frontiers in Artificial Intelligence and Applications, vol. 185. IOS Press, Amster-
dam (2009)
2. Bordeaux, L., Hamadi, Y., Samulowitz, H.: Experiments with massively parallel
constraint solving. In: Proceedings of the 21st International Jont Conference on
Artifical Intelligence, IJCAI’09, pp. 443–448 (2009)
3. Gomes, C.P., Selman, B., Kautz, H.: Boosting combinatorial search through ran-
domization. In: Proceedings of the 15th National/10th Conference on Artificial
Intelligence, AAAI’98/IAAI’98, pp. 431–437 (1998)
252 T. Sonobe and M. Inaba
4. Guo, L., Hamadi, Y., Jabbour, S., Sais, L.: Diversification and intensification in
parallel SAT solving. In: Cohen, D. (ed.) CP 2010. LNCS, vol. 6308, pp. 252–265.
Springer, Heidelberg (2010)
5. Hamadi, Y., Jabbour, S., Sais, L.: ManySAT: a parallel SAT solver. JSAT 6(4),
245–262 (2009)
6. Marques-Silva, J.: Practical applications of boolean satisfiability. In: Proceedings
of Workshop on Discrete Event Systems, WODES’08, pp. 74–80 (2008)
7. Moskewicz, M.W., Madigan, C.F., Zhao, Y., Zhang, L., Malik, S.: Chaff: engineer-
ing an efficient SAT solver. In: Proceedings of the 38th Annual Design Automation
Conference, DAC’01, pp. 530–535 (2001)
8. Sonobe, T., Inaba, M.: Counter implication restart for parallel SAT solvers. In:
Hamadi, Y., Schoenauer, M. (eds.) LION 2012. LNCS, vol. 7219, pp. 485–490.
Springer, Heidelberg (2012)
9. Sonobe, T., Inaba M.: Division and alternation of decision variables. In: Pragmatics
of SAT 2012 (2012)
10. Sonobe, T., Inaba, M., Nagai, A.: Counter implication restart. In: Pragmatics of
SAT 2011 (2011)
Parameter Setting with Dynamic Island Models
1 Introduction
Island Models [7] have been introduced in order to better manage diversity in
population-based algorithms. A well-known drawback of evolutionary algorithms
(EA) is indeed the premature convergence of their population.
Island models provide a natural abstraction for dividing the population into
several subsets, distributed on the islands. An independent EA is run on each
of these islands. Individuals are allowed to migrate from one island to another
in order to insure information sharing during the resolution and to maintain
some diversity on each island thanks to incoming individuals. Another immediate
advantage of such models is that they facilitate the parallelization of EAs. The
basic model can be refined, especially concerning the following aspects:
– Migration policies: Generally, migrations between the different islands are
performed according to predefined rules [5]. Individuals may be chosen in
order to reinforce the islands’population characteristics [6]. Recently, dynamic
policies have been proposed [1]. A transition matrix is updated during the
search process in order to dynamically regulate the diversity and the size of
the different islands’ populations, according to their quality.
– Search algorithms: In classic island models, each island uses the same EA. It
may be interesting to consider on each island different algorithms or different
possible configurations of an algorithm, by changing its parameters.
We propose to use island models as a new method for improving parameter man-
agement in EAs. Although the efficiency of EAs is well-established on numerous
optimization problems, their performance and robustness depend on the correct
setting of its components by means of parameters. Different type of parameters
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 253–258, 2013.
DOI: 10.1007/978-3-642-44973-4 26, c Springer-Verlag Berlin Heidelberg 2013
254 C. Candan et al.
can be considered from the solving components (e.g., the selection process or
the variation operators) until numerical parameters that modify the behavior of
a given EA (e.g., the population size or the application rates of the variation
operators). Most of the time, these parameters have strong interactions and it
is very difficult to forecast how they will affect the performance of the EA. We
focus here on the use of variation operators in population based algorithms: i.e.,
choosing at each iteration which variation operator should be applied.
Automatic tuning (off-line setting before solving) tools [3] have been devel-
oped to search for good parameters values in the parameters’ space, which is
defined as the crossproduct of the parameters values and the possible instances
of the problem. Specific heuristics are used in order to sample efficiently the pos-
sible values of parameters that are expected to have a significant impact on the
performance. Another possible approach for parameter setting is to control the
value of the parameters during the search. Adaptive operator selection (AOS) [3]
consists in providing an adaptive mechanism for selecting the suitable variation
operator to apply on individuals at each iteration of the EA process.
Dynamic island models can be used as an AOS method for classic steady
state GA. The main principle is to distribute the operators of the GA on the
different islands. We use a dynamic migration policy in order to achieve a suitable
distribution of the individuals on the most promising island. The purpose of
this paper is to show that our approach is able to identify the most efficient
operators of a given EA. This result can be used to improve the performance
of an algorithm whose behavior is difficult to handle by non specialist users but
also to help algorithm’s designers to improve their EAs.
Different choices can be made for the topology: complete graph, ring... A node
(i, i) in the graph allows individuals to have the possibility of staying on the
Parameter Setting with Dynamic Island Models 255
same island. The size of the global population is fixed along the process but the
size of each Pk changes constantly according to the migrations. Dynamic islands
models uses a migration policy that evolves during the search. Each island is
equipped with two main processes that define how its population evolves and
how the migration policy is controlled.
Figure 1 highlights that the migration policy is modified during the run
according to incoming feedback received from other islands. This feedback allows
the island to know about the improvement that its individuals have obtained on
other islands. The basic learning principle consists in sending more individuals
to islands that have been able to improve them significantly and less to islands
that are currently less efficient for these individuals. To avoid brutal changes,
these effects are evaluated on a time window. The selection component uses the
migration matrix M to select the individuals that are sent to other islands. Of
course, some individuals may stay on the same island. The analyze component
aims at evaluating the improvements of the individuals after the EA has been
applied and send this feedback information to the island these individuals were
originated from. According to the previous notations, we define the following
basic processes and data structures:
- The update process uses the data matrix D (i.e., the feedback from other island)
into order to modify the migration matrix. Of course, only the line corresponding
to island i is modified. We compute an intermediate reward vector R. We propose
to use an intensification strategy: only the island where individuals of i have
obtained the best improvement is rewarded (note that there could be several
such best islands).
1
if k ∅ B,
R(k) = card(B)
0 otherwise,
with B = argmax D(i, k)
k
3 Experimental Results
The NK family of landscapes is a problem-independent model for construct-
ing multimodal landscapes. An NK-landscape is commonly defined by triplet
(X , N , f ), with the set of N -length binary strings as search space X and 1-flip
as neighborhood relation N (two configurations are neighbors iff their Hamming
distance is 1). The fitness function f : {0, 1}N ⊆ [0, 1), to be maximized, is
defined as follows:
N
1
f (x) = ci (xi , xi1 , . . . , xiK ) (1)
N i=1
where ci : {0, 1}K+1 ⊆ [0, 1) defines the component function associated with
each variable xi , i ∅ {1, . . . , N }, and where K < N .
Parameters N and K define the landscape characteristics. The configurations
length N is naturally determining the size of the search space, while K specifies
the rugosity level of the landscape. Indeed, the fitness value of a configuration
is given by the sum of N terms, each one depending on K + 1 bits of the
configuration. Thus, by increasing the value of K from 0 to N −1, NK-landscapes
Parameter Setting with Dynamic Island Models 257
Fig. 2. DIM behavior for the AOS problem on an NK landscape instance (a) Compar-
isons among DIM, ParamILS, Uniform and 1-flip (b)
Fig. 3. Multi-armed bandit problem with two learning strategies: (a) for mean and (b)
for best.
An armed bandit is a machine which rewards the gambler with a fixed prob-
ability. It can be represented by a couple (p, r) where p is the probability to
gain r. While playing, the gambler does know neither p nor r. The multi-armed
bandit is a set of armed bandits where each arm has its own probability and
its own reward. Figure 3 shows the Dynamic Island Model behavior with the
mean (a) and the max (b) strategy for the multi-armed bandit problem. In this
experiment, 3 operators have been used, corresponding to a 3-armed bandit with
the following rewards: (0.4,0.6), (0.8,0.4) and (1,0.001).
The mean strategy leads to mainly use the operator with the higher expected
value (operator (0.8,0.4)) whereas the best strategy leads to only use the operator
with the higher reward (operator (0.4,0.6)).
References
1. Candan, C., Goëffon, A., Lardeux, F., Saubion, F.: A dynamic island model for
adaptive operator selection. In: Proceedings of Genetic and Evolutionary Compu-
tation Conference (GECCO’12), pp. 1253–1260 (2012)
2. Eiben, A., Smith, J.: Introduction to Evolutionary Computing. Natural Computing
Series. Springer, Heidelberg (2003)
3. Hamadi, Y., Monfroy, E., Saubion, F. (eds.): Autonomous Search. Springer, Heidel-
berg (2012)
4. Hutter, F., Hoos, H.H., Leyton-Brown, K., Stützle, T.: ParamILS: an automatic
algorithm configuration framework. J. Artif. Int. Res. 36(1), 267–306 (2009)
5. Rucinski, M., Izzo, D., Biscani, F.: On the impact of the migration topology on the
island model. CoRR, abs/1004.4541 (2010)
6. Skolicki, Z., Jong, K.A.D.: The influence of migration sizes and intervals on island
models. In: Proceedings of Genetic and Evolutionary Computation Conference
(GECCO’05), pp. 1295–1302 (2005)
7. Whitley, D., Rana, S., Heckendorn, R.B.: The island model genetic algorithm: on
separability, population size and convergence. J. Comput. Inf. Tech. 7, 33–47 (1998)
A Simulated Annealing Algorithm
for the Vehicle Routing Problem with Time
Windows and Synchronization Constraints
1 Introduction
The vehicle routing problem (VRP) [9] is a widely studied combinatorial opti-
mization problem in which the aim is to design optimal tours for a set of vehicles
serving a set of customers geographically distributed and respecting some side
constraints. We are interested in a particular variant of VRP, the so-called VRP
with time windows and synchronization constraints (VRPTWSyn). In such a
problem, each customer is associated with a time window that represents the
interval of time when the customer is available to receive the vehicle service.
This means that if the vehicle arrives too soon, it should wait until the opening
of the time window to serve the customer while too late arrival is not allowed.
Additionally, for some customers, more than one visit, e.g. two visits from two
This work is partially supported by the Regional Council of Picardie and the Euro-
pean Regional Development Fund (ERDF), under PRIMA project.
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 259–265, 2013.
DOI: 10.1007/978-3-642-44973-4 27, c Springer-Verlag Berlin Heidelberg 2013
260 S. Afifi et al.
u v
p q
of valid positions (for insertion) from the existing routes. This process is known
as the computation of transitive closures.
The function Diversif ication(X, dmin , dmax ) first removes a number (randomly
generated between dmin and dmax ) of visits from the current solution, then
262 S. Afifi et al.
detected by w, then the other neighborhood will be put back to W (in case it
was removed). The procedure is terminated when W is empty.
3 Results
We tested our algorithm on the instances introduced by [3]. The benchmark
comprises 10 sets grouped in 3 categories based on the number of customers.
Each set has 5 varieties of instances, those are named after the width of the time
windows. Our algorithm is coded in C++ and all experiments were conducted on
an Intel Core i7-2620M 2.70 GHz. This configuration is comparable to the com-
putational environment employed by Bredström and Rönnqvist [2,3] (a 2.67 GHz
Intel Xeon processor). According to the protocol proposed in [2], all the meth-
ods were tested with the varieties of S (small), M (medium) and L (large) time
windows. After several experiments on a subset of small instances, we decided
to fix the parameters as follows: T0 = 20, α = 0.99, d = 3, itermax = 10 × n
and rhmax = 10.
Table 1 shows our results compared to the literature. Instances, in which all
methods report the same results, are discarded from this table. Columns n, m,
s and Best show the number of visits, the number of vehicles, the number of
synchronizations and the best known solution from all methods (including ours)
respectively for each instance. A star symbol (*) is used in Best to indicate that
the solution is proved to be optimal. The other column headers are: MIP for the
results of the default CPLEX solver reported in [3]; H for the heuristic proposed
in [3] which is based on the local-branching technique [5]; BP1 and BP2 for
the results of the two branch-and-price algorithms presented in [2] and finally
SA for our simulated annealing algorithm. Columns Sol and CPU correspond to
the best solution found by each method and the associated total computational
time. Bold numbers in Sol indicate that the solution quality reaches Best.
From these results, we remark that SA finds all known optimal solutions
(20 of 30) in very short computational times compared to the other methods.
Quality of the other solutions is also better than the one found in the literature.
The algorithm strictly improved the best known solutions for 9 instances of the
data sets. Those instances are 7M , 7L, 8L, 9S, 9M , 9L, 10S, 10M and 10L. To
summarize, our SA is clearly fast and efficient.
4 Conclusion
References
1. Bouly, H., Moukrim, A., Chanteur, D., Simon, L.: An iterative destruc-
tion/construction heuristic for solving a specific vehicle routing problem. In:
MOSIM’08 (2008) (in French)
2. Bredström, D., Rönnqvist, M.: A branch and price algorithm for the combined
vehicle routing and scheduling problem with synchronization constraints (Feb 2007)
3. Bredström, D., Rönnqvist, M.: Combined vehicle routing and scheduling with tem-
poral precedence and synchronization constraints. Eur. J. Oper. Res. 191(1), 19–31
(2008)
4. Drexl, M.: Synchronization in vehicle routing a survey of vrps with multiple syn-
chronization constraints. Transp. Sci. 46(3), 297–316 (2012)
5. Fischetti, M., Lodi, A.: Local branching. Math. Program. 98(1–3), 23–47 (2003)
6. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing.
Science 220, 671–680 (1983)
7. Potvin, J.Y., Kervahut, T., Garcia, B.L., Rousseau, J.M.: The vehicle routing prob-
lem with time windows part I: tabu search. INFORMS J. Comput. 8(2), 158–164
(1996)
A Simulated Annealing Algorithm for the VRPTWSyn 265
8. Solomon, M.M., Desrosiers, J.: Time window constrained routing and scheduling
problems. Transp. Sci. 22, 1–13 (1988)
9. Toth, P., Vigo, D.: The Vehicle Routing Problem. Monographs on Discrete Mathe-
matics and Applications. Society for Industrial and Applied Mathematics, Philadel-
phia (2002)
Solution of the Maximum k-Balanced
Subgraph Problem
1 Introduction
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 266–271, 2013.
DOI: 10.1007/978-3-642-44973-4 28, ∈ c Springer-Verlag Berlin Heidelberg 2013
Solution of the Maximum k-Balanced Subgraph Problem 267
maximize xij
(i,j)∩A0
subject to xij ≤ 1, ∀ j ∈ V, (1)
i∩O(j)
xii ≤ k, (2)
i∩V
3 Branch-and-Cut Algorithm
Due to the space limitations, we only scratch the principal components of this
algorithm. The branch-and-cut algorithm implemented has two basic compo-
nents: the initial formulation and the cut generation. The initial formulation is
composed by inequalities (1), (2), (3), (4), all the trivial inequalities and by a
subset of inequalities (5) (with |S| = 1).
A partial description of the polytope associated with the formulation describe
in the last section was done. We described families of valid inequalities associated
with substructures of the graph (negative cliques, positive cliques and holes)
and based on a related problem (the stable set problem). The cut generation
component consists, basically, in detecting cliques and holes in G. A GRASP
heuristic was used for finding clique cuts and a modification of Hoffman and
Padberg’s heuristic [10] for finding odd holes cuts.
4 Primal Heuristics
Table 1. Random instances not solved by the branch-and-cut algorithm within the
time limit.
Instance Branch-and-cut
− +
|V | d |E |/|E | k LR root LB UB Nodes
50 .25 .5 2 36.92 25 29 1591
50 .50 .5 2 27.74 17 27 273
50 .75 .5 2 22.91 12 22 299
50 .25 1 10 37.77 27 37 243
50 .25 .5 10 38.09 26 37 97
50 .50 1 10 29.48 18 28 202
50 .50 2 10 29.72 22 25 2164
50 .50 .5 10 29.75 17 29 150
50 .75 1 10 22.77 14 22 258
50 .75 .5 10 24.07 12 23 106
The branch-and-cut algorithm and the heuristic procedures are coded in C++
running on a Intel(R) Pentium(R) 4 CPU 3.06 GHz, equipped with 3 GB of
RAM. We use Xpress-Optimizer 20.00.21 to implement the components of the
enumerative algorithm. The CPU time limit is set to 1h for the branch-and-cut
and to 5 min for the heuristics.
We run our experiments with the branch-and-cut algorithm on a set of 74
random instances. The random instances were generated by varying |V |, graph
density d, rate |E − |/|E + | and parameter k, respectively, in sets {20, 30, 40, 50},
{0.25, 0.50, 0.75}, {0.5, 1.0, 2.0} and {2, |V |/5}. All random instances with
up to 40 vertices were solved to optimality; those with 20 and 30 vertices were
solved in few seconds. For each unsolved instance, Table 1 exhibits the value of
the linear relaxation (LR) at the root of the branch-and-cut tree; upper (UB)
and lower (LB) bounds obtained and the total number of nodes in the branch-
and-cut tree. The metaheuristics used in the separation procedures showed to
be quite efficient: effective cuts were found and the time spent in separation
270 R. Figueiredo et al.
was around 20 % of CPU time. However, for half of the instances in Table 1, the
upper bound after 1h of computation is very close to the value of the LR at the
root. This seems to indicate that stronger inequalities are needed to optimally
solve large instances of this problem.
We run our experiments with the GRASP and the ILS-VND procedures
on a set of portfolio instances described in [3]. This test set is composed by
instances with the number of vertices varying in the set {30, 60, 90, . . . , 510} and
a threshold value t (used to define the set of positive and negative edges) vary-
ing in the set {0.300, 0.325, 0.350, 0.375, 0.400}. For each combination of these
values, 10 different signed graphs were randomly chosen, which means that each
signed graph represents a different subset of stocks and totalize 850 instances.
For more details on the definition of these instances, we refer the reader to [8,11].
Figure 1 presents the results obtained for portfolio instances when k = 2. The
x-axis exhibits instances ordered primarily by number of vertices and secondly
by the threshold value while the y-axis exhibits the percentage gaps. Clearly, the
GRASP metaheuristic has found the best heuristic solution for almost all port-
folio instances when k = 2. Similar results were obtained for portfolio instances
with k = |V |/5.
6 Future Research
is observed for other sets of instances in the literature (e.g. the community
structure instances defined in [3]). For each test set, we also need to investigate
which heuristic is the best option when a quick solution (in some seconds) is
needed.
References
1. Barahona, F., Mahjoub, A.R.: Facets of the balanced (acyclic) induced subgraph
polytope. Math. Program. 45, 21–33 (1989)
2. Campelo, M., Correa, R.C., Frota, Y.: Cliques, holes and the vertex coloring poly-
tope. Inf. Process.Lett. 89, 1097–1111 (2004)
3. Figueiredo, R., Frota, Y.: The maximum balanced subgraph of a signed graph:
applications and solution approaches. Paper submitted (2012)
4. Figueiredo, R., Labbé, M., de Souza, C.C.: An exact approach to the problem
of extracting an embedded network matrix. Comput. Oper. Res. 38, 1483–1492
(2011)
5. Frota, Y., Maculan, N., Noronha, T.F., Ribeiro, C.C.: A branch-and-cut algorithm
for partition coloring. Networks 55, 194–204 (2010)
6. Gülpinar, N., Gutin, G., Mitra, G., Zverovitch, A.: Extracting pure network subma-
trices in linear programs using signed graphs. Discrete. Appl. Math. 137, 359–372
(2004)
7. Harary, F., Kabell, J.A.: A simple algorithm to detect balance in signed graphs.
Math. Soc. Sci. 1, 131–136 (1980)
8. Harary, F., Lim, M., Wunsch, D.C.: Signed graphs for portfolio analysis in risk
management. IMA J. Manag. Math. 13, 1–10 (2003)
9. Heider, F.: Attitudes and cognitive organization. J. Psychol. 21, 107–112 (1946)
10. Padberg, M., Hoffman, K.L.: Solving airline crew scheduling problems. Manag. Sci.
39, 657–682 (1993)
11. Huffner, F., Betzler, N., Niedermeier, R.: Separator-based data reduction for signed
graph balancing. J. Comb. Optim. 20, 335–360 (2010)
12. Martin, Q.C., Stutzle, T., Lourenço, H.R.: Iterated Local Search. In: Handbook of
Metaheuristics, pp. 1355–1377. Kluwer Academic Publishers, Norwell (2003)
13. Hansen, P., Mladenović, N.: Variable neighborhood search. Comput. Oper. Res.
24(11), 1097–1100 (1997)
14. Resende, M.G.C., Ribeiro, C.C.: GRASP: greedy randomized adaptive search pro-
cedures. In: Burke, E.K., Kendall, G. (eds.) Search Methodologies, 2nd edn, pp.
285–310. Springer, New York (2013)
Racing with a Fixed Budget
and a Self-Adaptive Significance Level
1 Introduction
Racing algorithms were first introduced in [1] to solve the model selection problem in
Machine Learning. Within the parameter tuning context, a Racing algorithm takes as
an input a number of parameter settings k, a computational budget N, and a problem
instance generator I. Running each parameter setting i on an instance j consumes a
portion of N and returns a performance value for that parameter setting on that
instance Uij. These performance values are used to identify inferior parameter settings
and systematically discard them from the race. When only one parameter setting
remains, or when N is consumed, the race terminates. If N is consumed with more than
one surviving parameter setting, the one with the best aggregated performance value is
selected as the winner. In both cases, only one parameter setting is returned by the
algorithm to be used to solve other instances of the problem (i.e. test instances). The
assumption made here is that the training set is representative and the parameter
setting which performed best on the training set will also perform best on the test set.
Racing initially runs all parameter settings on n0 N instances to get an estimate
of their performance, then a two-stage statistical test is carried out to determine which
parameter setting(s) to keep and which to discard. The first stage only detects if at
least one parameter setting is significantly different from the rest, tests such as
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 272–280, 2013.
DOI: 10.1007/978-3-642-44973-4_29, Springer-Verlag Berlin Heidelberg 2013
Racing with a Fixed Budget and a Self-Adaptive Significance Level 273
Analysis of Variance (ANOVA), cf. [2], for normally distributed data or ranked-based
ANOVA, such as Friedman’s F-test or the Kruskal-Wallis, for non-normal data are
usually used. If there is at least one parameter setting that is significantly different
from the rest, the second stage test is carried out to identify inferior parameter settings
using tests such as the paired t-test for normally distributed data, or any non-para-
metric post hoc test for non-normal data, see [3, 4] for various suggestions. Following
this filtering stage, the surviving parameter settings are tested on a new instance and
the tests are applied again. This continues until only one parameter setting remains, or
until N is consumed. See Fig. 1 for an illustration.
F-Race [5] is a Racing algorithms which uses the F-test followed by a non-
parametric pair-wise comparison with the current best to discard inferior parameter
settings. It requires setting a significance level a for both tests. If a is set to a high
value, parameter settings, including the true best, are likely to be dropped out based on
only a few samples and the algorithm terminates before consuming the entire budget.
On the other hand, if it is set to a small value, F-Race will be very conservative, rarely
discarding any parameter setting, and will allocate the computational budget almost
equally between the competing parameter settings. In both cases the performance of
F-Race is sub-optimal, which is reflected in the U-shaped curve in Fig. 2, where F-
Race is to find the best out of 10 parameter settings with a limited budget of 1000
function evaluations maximum. Another observation from Fig. 2 is that the chosen a
does not correspond to the actual failure rate (percentage of runs F-Race fails to select
the true best parameter setting), for example an a of 0.08 achieves a 0.12 failure rate,
and an a of 0.001 achieves a 0.097 failure rate. This demonstrates that setting an
appropriate value for the parameter a is not straightforward.
In this paper, we propose a simple modification to F-Race which automatically
adapts a such that the failure rate f, after having used up a pre-determined N, is
minimized. Using a fixed N reflects a typical real-world scenario where a decision has
to be made within a limited time. To calculate f we use data drawn from probability
distributions with pre-defined means, variances, and covariances and assume that the
best parameter setting is the one with the lowest mean. The rest of the paper is
organized as follows: Sect. 2 reviews related work on Racing algorithms, Sect. 3
describes how to adapt a and the experimental setup, Sect. 4 presents the empirical
results, and Sect. 5 concludes the paper.
Parameter settings
Iterations / Instances
PS1 PS2 PS3 PS4
Iter 3 Instance 3 U31 U32 U33 U34 Run the first test and drop PS3 and PS4
Iteration 4 Instance 4 U41 U42 Run the second test and don’t drop anything
Iteration 5 Instance 5 U51 U52 Run the third test and drop PS1
… …
1000 0.25
800 0.20
700
Average budget consumed
600 0.15
Failure rate
500
400 0.10
300
200 0.05
100
0 0.00
0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0
Significance level
2 Literature Review
k individuals equally to find the best one, Racing was used to efficiently allocate the
computational budget on the most promising individuals. In a similar fashion Racing
was combined with Ant Colony Optimization in [15], Mesh Adaptive Direct Search
[16], Bound Optimization By Quadratic Approximation, Covariance Matrix Adapta-
tion Evolution Strategy, and Uniform Random and Iterated Random Sampling in [17].
F-Race can only select from the initial set of parameter settings provided to it. If a
better parameter setting exists for that algorithm and it was not included in the initial
set, it will never be discovered. An interesting modification to F-Race to overcome
this issue can be found in [18] where the authors insert a new parameter settings into
the race after each iteration, this parameter setting is first tested on as many instances
as the others, then the statistical tests are carried out. Yet another exploration
mechanism added to F-Race can be found in [19] where the authors created an
iterative version of F-Race (I/F-Race); each iteration is a single race, and with each
iteration the initial set of parameter settings is biased towards the best. In their
implementation biasing was done analogous to an Estimation of Distribution Algo-
rithm working at a higher level than F-Race.
In all of these applications, and many others, the significance levels of the sta-
tistical tests were set by the user to a fixed value all throughout the race. In the
following section we propose a simple modification to automatically adapt a such that
f is minimized given a fixed N. We restrict the application of this modification to
F-Race.
In the following, we assume that F-Race is run under a fixed budget constraint N,
finishing at a time t\N has no benefit. If more than one parameter setting remains in
the race after N is consumed, the race is aborted and the parameter setting with the
best estimated performance is selected. The basic idea of the proposed approach is to
allow F-Race to consume the entire budget even if the race terminates beforehand.
Starting with k, N, I, and a (the latter set to a relatively high default value and shown
to be irrelevant) F-Race is run until a single parameter setting remains. Because a has
been chosen large, it is likely that the race terminates before N is consumed. If this
happens, we roll back to the point/iteration where the first parameter setting was
discarded, lower a by a factor of q, and then repeat the statistical tests on all the
parameter settings. Because of the smaller a, F-Race is now more conservative, and
less likely to discard parameter settings. Obviously, all the samples already collected
are maintained in memory and used again in subsequent iterations if needed, only
those parameter settings which had been previously discarded, but are now in the race
due to the lower a, are sampled. This ‘‘reset’’ is done as many times as needed until
N is consumed, each time the previous a is discounted by q. See Fig. 3.
Assessment of the proposed method is based primarily on f calculated over many
replications r. This requires prior knowledge of which is the best parameter setting;
therefore, we draw numbers from multivariate normal distributions with pre-defined
means, variances, and covariances. Using such data will help better understand how
the method works and where it fails. In addition to f, the dropout rate d (portion of
276 J. Branke and J. Elomari
times the best parameter setting was discarded from the race over all replications) is
reported. Clearly f d as it accounts for failures due to dropout, and failures due to
the variation in the data which could lead to selecting a sub-optimal parameter setting
appearing to be better than the true best.
The proposed method F-Race_R is compared to the standard F-Race, which uses a
fixed a, and to Equal Allocation. Comparisons are based on f and d plotted against
different a values, in specific: an a value is set for both F-Race and F-Race_R (an
initial one) and they are run until termination. If the returned parameter settings does
not correspond to the one with the true best mean, that method is considered to have
failed on that replication. The same process is repeated for all r to find f and d. This
represents one point on the plot, other values are obtained by changing a. F-Race_R
introduces its own parameter q, which determines the new a with each reset. It is
shown that F-Race_R is robust to different values of q. We expect that the perfor-
mance of F-Race_R is insensitive to q, but that it should not be set too low, as
decreasing a by a large amount may result in the optimal value of a being skipped.
The specific settings of the experiments are:
• k = 10, N = 1000 FEs, n0 = 10 instances, r = 3000, q = 0.5, 0.2
• a values used to construct the f and d plots: 0.2, 0.1:0.01 decreasing by 0.01, and
0.009:0.001 decreasing by 0.001. These represent initial values for F-Race_R
• Sampling distribution of the performance values: N ði; 102 Þ8i ¼ 1; . . .; k or
N ðU ð1; 10Þ; U ð24; 48ÞÞ8i ¼ 1; . . .; k
Racing with a Fixed Budget and a Self-Adaptive Significance Level 277
• Correlation values used were: 0, 0.1, 0.3, and 0.6. Although the 0-correlation case
does not apply to F-Race for tuning parameter settings, it was included to observe
the performance of the algorithm under such conditions.
In the original implementation of F-Race [19], the F-test is replaced with the
Wilcoxon matched-pairs signed-ranks test if only two parameter settings remain in the
race. The Wilcoxon test statistic T follows a Wilcoxon distribution, for which tabu-
lated critical values are available for a number of significance levels and sample sizes;
however, the a values which F-Race_R may require with every reset are not available
and can only be calculated by enumeration, which is infeasible for the application of
F-Race_R. A normal approximation of the Wilcoxon T statistic is possible if there are
many ties, which is not the case since utility values are from probability distributions,
and the sample size is large, although there is no general agreement on what is a large
enough sample size [4]. Given these difficulties, we chose not to replace the F-test
with the Wilcoxon test if two parameter settings remain. This applies for both F-Race
and F-Race_R.
4 Results
0.20 0.20
Failure rate
Failure rate
0.15 0.15
0.10 0.10
0.05 0.05
0.00
0.00
0.2 0.15 0.1 0.05 0 0.2 0.15 0.1 0.05 0
Singinficnce level a) Independent Significance level b) 0.1 correlation
Equal F-Race-Drop Equal RacingD
Failure rate
Failure rate
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
0.2 0.15 0.1 0.05 0 0.2 0.15 0.1 0.05 0
Significance level c) 0.3 correlation Significance level d) 0.6 correlation
Fig. 4. F-Race vs. F-Race_R using utility values drawn from N(i, 102) for i = 1, …, k
0.25 0.25
Failure rate
0.20 0.20
Failure rate
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
0.2 0.15 0.1 0.05 0 0.2 0.15 0.1 0.05 0
Significance level a) Independent Significance level b) 0.1 correlation
Equal F-Race_Drop Equal RacingD
F-Race_Fail F-Race_R-Drop_0.5 RacingF F-Race_R-Drop_0.5
F-Race_R-Fail_0.5 F-Race_R-Drop_0.2 F-Race_R-Fail_0.5 F-Race_R-Drop_0.2
F-Race_R-Fail_0.2 F-Race_R-Fail_0.2
0.30 0.30
0.25 0.25
0.20
Failure rate
0.20
Failure rate
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
0.2 0.15 0.1 0.05 0 0.2 0.15 0.1 0.05 0
Significance level c) 0.3 correlation Significance level d) 0.6 correlation
Fig. 5. F-Race vs. F-Race_R using utility values drawn from N(U(1, 10), U(24, 48)) for i = 1, …, k
Racing with a Fixed Budget and a Self-Adaptive Significance Level 279
Table 1. Comparing the best f obtained by F-Race with any f obtained by F-Race_R obtained
at a single high default value of a
Sampling distribution of utility values Correlation F-Race_R @ 0.2 F-Race_R @ 0.1
p-value p-value
N ði; 102 Þ8i ¼ 1; . . .; k 0 0.000 0.000
0.1 0.000 0.000
0.3 0.001 0.000
0.6 0.000 0.000
5 Conclusion
A new method to automatically adjust the significance level of the F-Race algorithm
for best performance given a fixed budget was presented. It was shown that a chosen
significance level does not correspond to the actual failure rate the user observes, and
choosing an appropriate significance level is not straightforward. The proposed
method, F-Race_R, allows the user to set a computational budget and it will adapt the
significance level accordingly such that the failure rate is minimized. This is achieved
by systematically lowering the initial significance level each time the race terminates
until the entire budget is consumed.
Experiments were carried out using performance values drawn from normal dis-
tributions with known means, variances, and covariances. Results show that F-Race_R
is quite robust to the initial significance level chosen, as long as it is not too low,
demonstrating its ability to adapt it online. Finally, and perhaps most importantly,
F-Race_R is able to find significance levels which achieve lower failure rates than any
fixed significance level used in F-Race. The 1-sample sign test indicates that the
improvement in the failure rate is significant. F-Race_R comes with its own new
parameter, the reduction factor. However, as shown from the experiments, F-Race_R
is also robust to different values of the reduction factor.
More experiments are still needed to better understand the full potential, and
limitations, of the proposed method. Utility values drawn from other normal and non-
normal distributions need to be tested, in addition to using actual utility values from
algorithms solving real optimization problems.
References
1. Maron, O., Moore, A.: Hoeffding races: accelerating model selection search for
classification and function approximation. Adv. Neural Inf. Process. Syst. 6, 59–66 (1994)
2. Schaffer, J.D., Caruana, R.A., Eshelman, L.J., Das, R.: A study of control parameters
affecting online performance of genetic algorithms for function optimization. In:
International Conference on Genetic Algorithms, pp. 51–60 (1989)
280 J. Branke and J. Elomari
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 281–287, 2013.
DOI: 10.1007/978-3-642-44973-4 30, c Springer-Verlag Berlin Heidelberg 2013
282 C. Pira and C. Artigues
The graph G involved in constraint (1) contains the arcs (i, j) for which li,j > 0
(since otherwise the equation is trivially satisfied). Since some complications are
introduced when only one of the delays li,j or lj,i is strictly positive, we will
suppose in the following that G is symetric (which can always be enforced [6]).
Seeing it as an undirected graph, we will write G(i) for the set of neighbors of i,
i.e. the set of tasks with which i is constrained.
max α (2)
s.t. (tj − ti ) mod gi,j ≥ li,j α ∀(i, j) ∈ G (3)
ti ∈ Z ∀i (4)
α≥0 (5)
The main component of the algorithm is called the best response procedure.
Following a game theory analogy, each task is seen as an agent which tries to
optimize its own offset ti , while the other offsets (t∈j ) are fixed. Moreover, the
agent only takes into account the constraints in which it is involved, i.e. only
the constraints associated with its neighborhood G(i). We obtain the following
program:
Since the offsets are integer, and since the problem is periodic, we can always
impose ti to belong to {0, · · · , Ti − 1}. Therefore we can trivially solve the best
284 C. Pira and C. Artigues
response program (BRi ) by computing the α-value for each ti ∈ {0, · · · , Ti − 1},
using the following expression, and selecting the best offset:
(ti − t∈j ) mod gi,j (t∈j − ti ) mod gi,j
α = min min , (11)
j∩G(i) lj,i li,j
This procedure runs in O(Ti N ), hence any method should at least be faster.
In [1], the authors propose a method consisting in precomputing a set of inter-
section points to reduce the number of evaluations. In the following, we present
a new line-search method which greatly improves (BRi ) solving.
(a) (b)
Fig. 2. Possible values for (ti , α) when constrained by (a) a single task j, and (b) all
the tasks j ∈ G(i)
local fractional optimum, as well as the local integral optimum (see Sect. 2.3). If
the latter is better than the current best solution, we update. We now want to
reach the next polyhedron. At the local fractional optimum, there are two active
lines, an increasing one and a decreasing one. Since the procedure runs from left
to right, we follow the decreasing line until we reach the x-axis (i.e. the offset o∀k
on Fig. 4). We can use this point as the new reference offset. We continue until
a whole period has been traversed. This method is illustrated on Fig. 4.
3 Experimental Results
We test the method on non-harmonic instances, with N = 20 tasks and P = 4
processors, generated using the procedure described in [3]: the periods were
choosen in the set {2x 3y 50 | x ∈ [0, 4], y ∈ [0, 3]} and the processing times
were generated following an exponential distribution and averaging at about
20 % of the period. Columns 2 and 3 allow to compare our results with a MILP
formulation presented in [6] and restricted with a timeout of 200s. Results about
the original heuristic [1,2] are presented on columns 4-7. Here, timesingle is the
time needed for a single run and starts2s = 2/timesingle measures the average
number of starts performed by the original heuristic in 2s. Finally, columns 8-11
contain the results for our version of the heuristic. In [1,2], the stopping criterion
Table 1. Results of the MILP, the heuristic of [1], and the new version of the heuristic
id MILP (200s) Original heuristic [2] (Bayesian test) New heuristic (2s)
σMILP timesol σheuristic timesingle timestop starts2s σheuristic startssol starts2s timesol
0 2.5 159 2.3 1.43 101.33 1.4 2.5 35 3624 0.01932
1 2 18 2.01091 3.27 5064.67 0.61 2.01091 28 4477 0.01251
2 1.6 6 1.40455 1.52 869.45 1.32 1.6 2 2462 0.00162
3 1.6 4 1.6 4.34 8704.45 0.46 1.64324 45 2910 0.03093
4 2 5 1.92 3.48 1115.51 0.57 2 1 2489 0.00080
5∗ 3 7 1.43413 1.63 1498.21 1.23 3 1 2107 0.00095
6 2.5 54 2.3 1.44 101.25 1.39 2.5 35 3805 0.01840
7 2 19 2 0.23 302.27 8.7 2 3 4431 0.00135
8 2.12222 8 1.75794 1.03 871.8 1.94 2.12222 3 2513 0.00239
9∗ 2 11 2 2.42 3541.79 0.83 2 3 3575 0.00168
10 1.12 6 0.87 0.72 368.44 2.78 1.12 4 3466 0.00231
11 2.81098 20 0.847368 3.78 478.63 0.53 2.81098 1 3421 0.00058
12 1.5 7 1.5 0.27 313.74 7.4 1.5 4 3645 0.00219
13 1.56833 49 1.5 1.77 3293.33 1.13 1.56833 1 3863 0.00052
14 2 8 2 1.85 3873 1.08 2 2 2331 0.00172
An Efficient Best Response Heuristic 287
was a bayesian test, but column 4 shows that it often stopped with a solution
far from the best solution found by the MILP. Hence for the new heuristic,
the stopping criterion is more simply a timeout of 2s. The value start2s is the
number of times the heuristic was started during this period. This number is
much greater than the equivalent number for the original heuristic, therefore
our heuristic is much faster (about 3100 times on these instances). Moreover,
timesol represents approximately the time needed to find the best solution. This
is to compare with the column timesol of the MILP formulation, which shows
that our version of the heuristic is very competitive compared to the MILP
(Table 1).
References
1. Al Sheikh, A.: Resource allocation in hard real-time avionic systems - scheduling
and routing problems. Ph.D. thesis, LAAS, Toulouse, France (2011)
2. Al Sheikh, A., Brun, O., Hladik, P.E., Prabhu, B.: Strictly periodic scheduling in
IMA-based architectures. Real Time Syst. 48(4), 359–386 (2012)
3. Eisenbrand, F., Kesavan, K., Mattikalli, R.S., Niemeier, M., Nordsieck, A.W.,
Skutella, M., Verschae, J., Wiese, A.: Solving an avionics real-time scheduling prob-
lem by advanced IP-methods. In: de Berg, M., Meyer, U. (eds.) ESA 2010, Part I.
LNCS, vol. 6346, pp. 11–22. Springer, Heidelberg (2010)
4. Korst, J.: Periodic multiprocessors scheduling. Ph.D. thesis, Eindhoven University
of Technology, Eindhoven, The Netherlands (1992)
5. Megiddo, N.: Linear-time algorithms for linear programming in R3 and related prob-
lems. SIAM J. Comput. 12(4), 759–776 (1983)
6. Pira, C., Artigues, C.: An efficient best response heuristic for a non-preemptive
strictly periodic scheduling problem. Technical report LAAS-CNRS, Toulouse,
France, October 2012. http://hal.archives-ouvertes.fr/hal-00761345
Finding an Evolutionary Solution to the Game
of Mastermind with Good Scaling Behavior
Abstract. There are two main research issues in the game of Master-
mind: one of them is finding solutions that are able to minimize the num-
ber of turns needed to find the solution, and another is finding methods
that scale well when the size of the search space is increased. In this paper
we will present a method that uses evolutionary algorithms to find fast
solutions to the game of Mastermind that scale better with problem size
than previously described methods; this is obtained by just fixing one
parameter.
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 288–293, 2013.
DOI: 10.1007/978-3-642-44973-4 31, c Springer-Verlag Berlin Heidelberg 2013
Finding an Evolutionary Solution to the Game of Mastermind 289
is reflected by a score. However, scores are heuristic and there is no rigorous way
of scoring combinations. To compute these scores, every combination is compared
in turn with the rest of the combinations in the set; the number of combinations
that get every response (there is a limited amount of possible responses) is noted.
Eventually this results in a series of partitions in which the set of consistent com-
binations is divided by its distance (in terms of common positions and colors) to
every other. This results in a set of combinations with the best score; one of the
combinations of this set is chosen deterministically (using lexicographical order,
for instance) or randomly. In this paper we use most parts, proposed in [7] which
takes into account only the number of non-zero partitions.
Currently, the state of the art was established by Berghman et al. in [4].
They obtained a system that is able to find the solution in an average number of
moves that is, for all sizes tested, better than previously published. The number
of evaluations was not published, but time was. In both cases, their solutions
were quite good. However, there were many parameters that had to be set for
each size, starting with the first guess and the size of the consistent set, as well as
population size and other evolutionary algorithm parameters. In this paper we
will try to adapt our previously published Evo method by reducing the number
of parameters without compromising too much on algorithm performance, based
on the fact that even as you can find a good solution using only a sample of the
consistent set size as proved in [4,5], different set sizes do have an influence on
the outcome. When you reduce the size to the minimum it is bound to have an
influence on the result, in terms of turns needed to win and number of evaluations
needed to do it. The effect of the reduction of this sample size will decrease the
probability of finding, and thus playing, the hidden combination, and also the
probability of finding the combination that maximally reduces the search space
size when played. However, in this paper we will prove that good solutions can
be found by using a small and, what is more, a common set size across all
Mastermind problem sizes.
In the next section we will present the experiments carried out and its results
for sizes from Δ = 4, φ = 8 to Δ = 7, φ = 10.
This paper uses the method called, simply, Evo [8–11]. This method, which
has been released as open source code at CPAN (http://search.cpan.org/dist/
Algorithm-MasterMind/), is an evolutionary algorithm that has been optimized
for speed and to obtain the minimal number of evaluations possible. An evolu-
tionary algorithm [12] is a Nature-inspired search and optimization method that,
modeling natural evolution and its molecular base, uses a (codified) population
of solutions to find the optimal one. Candidate solutions are scored according
to its closeness to the optimal solution (called fitness) and the whole population
evolved by discarding solutions with the lowest fitness and making those with
the highest fitness reproduce via combination (crossover) and random change
(mutation).
290 J.J. Merelo et al.
Table 1. Comparison among this approach (Evo10) and previous results published by
the authors (Evo++) in [11] and Berghman et al. [4].
(a) Mean number of guesses and the standard error of the mean for Σ = 4, 5, the
quantities in parentheses indicate population and consistent set size (in the case of the
previous results).
Σ=4 Σ=5
ϕ=8 ϕ=8 ϕ=9
Berghman et al. 5.618
Evo++ (400,30) 5.15 ± 0.87 (600,40) 5.62 ± 0.84 (800,80) 5.94 ± 0.87
Evo10 (200) 5.209 ± 0.91 (600) 5.652 ± 0.818 (800) 6.013 ± 0.875
(b) Mean number of guesses and the standard error of the mean for Σ = 6, 7, the
quantities in parentheses indicate population and consistent set size (in the case of the
previous results).
Σ=6 Σ=7
ϕ=9 ϕ = 10 ϕ = 10
Berghman et al. 6.475
Evo++ (1000,100) 6.479 ± 0.89
Evo10 (800) 6.504 ± 0.871 (1000) 6.877 ± 0.013 (1500) 7.425 ± 0.013
(c) Mean number of evaluations and its standard deviationΣ = 4, 5.
Σ=4 Σ=5
ϕ=8 ϕ=8 ϕ=9
Evo++ 6412 ± 3014 14911 ± 6120 25323 ± 9972
Evo10 2551 ± 1367 7981 ± 3511 8953 ± 3982
(d) Mean number of evaluations and its standard deviationΣ = 6, 7.
Σ=6 Σ=7
ϕ=9 ϕ = 10 ϕ = 10
Evo++ 46483 ± 17031
Evo10 17562 ± 135367 21804 ± 67227 40205 ± 65485
The first of these tables, which represent the average number of moves needed
to find the solution, show results that are quite similar. The average for Evo10 is
consistently higher (more turns are needed to find the solution) but in half the
cases the difference is not statistically significant using Wilcoxon paired test.
There is a significant difference for the two smaller sizes (Δ = 4, φ = 8 and
Δ = 5, φ = 8), but not for the larger sizes Δ = 5, φ = 9 and Δ = 6, 7. This is
probably due to the fact that, with increasing search space size, the difference
among 10 and other sample size, even if they are in different orders of magnitude,
become negligible; the difference between 10 and 1 % of the actual sample size
is significant, but the difference 0.001 and 0.0001 % is not.
However, the difference in the number of evaluations (shown in Tables 1(c)
and (d)), that is, the total population evaluated to find the solution is quite
significant, going from a bit less than half to a third of the total evaluations
292 J.J. Merelo et al.
for the larger size. This means that the time needed scales roughly in the same
way, but it is even more interesting to note that it scales better for a fixed
size than for the best consistent set size. Besides, in all cases the algorithm
does not examine the full set of combinations, while previously the number of
combinations evaluated, 6412, was almost 50 % bigger than the search space size
for that problem. The same argument can be applied to the comparison with
Berghman’s results (when they are available); Evo++ was able to find solutions
which were quite similar to them, but Evo10 obtains an average number of
turns that is slightly worse; since we don’t have the complete set of results, and
besides they have been made on a different set of combinations, it is not possible
to compare, but at any rate it would be reasonable to think that this result is
significant.
It is also clear than, when increasing the search space size, the size of the
consistent set will become negligible with respect to the actual size of the consis-
tent set. This could work both ways: first, by making the results independent of
sample size (for this small size, at least) or by making the strategy of extracting
a sample of a particular size indistinguishable from finding a single consistent
combination and playing it. As we improve the computation speed, it would be
interesting to take measurements to prove these hypotheses.
References
1. Meirovitz, M.: Board game (December 30 1980) US Patent 4,241,923
2. Knuth, D.E.: The computer as master mind. J. Recreational Math. 9(1), 1–6 (1976–
1977)
3. Montgomery, G.: Mastermind: improving the search. AI Expert 7(4), 40–47 (1992)
4. Berghman, L., Goossens, D., Leus, R.: Effcient solutions for mastermind using
genetic algorithms. Compu. Oper. Res. 36(6), 1880–1885 (2009)
5. Runarsson, T.P., Merelo-Guervós, J.J.: Adapting heuristic mastermind strategies
to evolutionary algorithms. In: González, J.R., Pelta, D.A., Cruz, C., Terrazas, G.,
Krasnogor, N. (eds.) NICSO 2010. SCI, vol. 284, pp. 255–267. Springer, Heidelberg
(2010). ArXiV: http://arxiv.org/abs/0912.2415v1
6. Merelo-Guervós, J.J., Mora, A.M., Cotta, C., Runarsson, T.P.: An experimental
study of exhaustive solutions for the mastermind puzzle. CoRR abs/1207.1315
(2012)
7. Kooi, B.: Yet another mastermind strategy. ICGA J. 28(1), 13–20 (2005)
8. Cotta, C., Merelo Guervós, J.J., Mora Garćia, A.M., Runarsson, T.P.: Entropy-
driven evolutionary approaches to the mastermind problem. In: Schaefer, R., Cotta,
C., Koffilodziej, J., Rudolph, G. (eds.) PPSN XI. LNCS, vol. 6239, pp. 421–431.
Springer, Heidelberg (2010)
9. Merelo, J., Mora, A., Runarsson, T., Cotta, C.: Assessing effciency of different
evolutionary strategies playing mastermind. In: 2010 IEEE Symposium on Com-
putational Intelligence and Games (CIG), pp. 38–45, August 2010
10. Merelo, J.J., Cotta, C., Mora, A.: Improving and scaling evolutionary approaches
to the mastermind problem. In: Di Chio, C., et al. (eds.) EvoApplications 2011,
Part I. LNCS, vol. 6624, pp. 103–112. Springer, Heidelberg (2011)
11. Merelo-Guervós, J.J., Mora, A.M., Cotta, C.: Optimizing worst-case scenario in
evolutionary solutions to the MasterMind puzzle. In: IEEE Congress on Evolu-
tionary Computation, pp. 2669–2676. IEEE (2011)
12. Eiben, A.E., Smit, J.E.: Introduction to Evolutionary Computing. Springer, Hei-
delberg (2003)
A Fast Local Search Approach
for Multiobjective Problems
1 Introduction
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 294–298, 2013.
DOI: 10.1007/978-3-642-44973-4 32, c Springer-Verlag Berlin Heidelberg 2013
A Fast Local Search Approach for Multiobjective Problems 295
Pareto set of this problem. The aim is to locate n stations in a given area to
maximise several daily requests of population flows.
The area is discretized into a grid and all the flows are set in a 3D matrix
F = (fi,j,t ) where fi,j,t represents the number of displacements from the cell i to
the cell j at time period t. We have 3 objectives to locate the charging stations:
f1 : flow maximization i.e. the locations must allow us to maximize the flows
between themselves
⎡ ⎤
⎨ ⎨
f1 = max ⎣ f (sti , stj )⎦ (1)
s∈Ω
sti ∈s stj ∈s\{sti }
With,
4 Performance Analysis
be a good way to distinguish these sets. Here we have considered the additive Δ-
indicator [14]. The unary additive Δ-indicator gives the minimum factor by which
a set A has to be translated to dominate the reference set R. As we do not know
the optimal reference set of the problem we composed an approximated R with
the best solutions obtained with PLS and FLS-MO on many runs.
References
1. Paquete, L., Chiarandini, M., Stützle, T.: Pareto local optimum sets in the biob-
jective traveling salesman problem: an experimental study. In: Gandibleux, X.,
Sevaux, M., Sörensen, K., T’kindt, V., Fandel, G., Trockel, W. (eds.) Metaheuris-
tics for Multiobjective Optimisation. Lecture Notes in Economics and Mathemat-
ical Systems, vol. 535, pp. 177–199. Springer, Heidelberg (2004)
2. Coello, C., Lamont, G.: Applications of Multi-Objective Evolutionary Algorithms,
vol. 1. World Scientific, Singapore (2004)
3. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective
genetic algorithm: Nsga-2. IEEE Trans. Evol. Comput. 6, 182–197 (2002)
4. Zitzler, E., Thiele, L.: Multiobjective evolutionary algorithms: a comparative case
study and the strength pareto approach. IEEE Trans. Evol. Comput. 3, 257–271
(1999)
5. Zitzler, E., Laumanns, M., Thiele, L.: Spea2: Improving the strength pareto evo-
lutionary algorithm. TIK-Report 103 (2001)
6. Knowles, J., Corne, D.: M-paes: a memetic algorithm for multiobjective optimiza-
tion. In: Proceedings of the 2000 Congress on Evolutionary Computation, vol. 1,
pp. 325–332 (2000)
7. Jaszkiewicz, A.: Genetic local search for multi-objective combinatorial optimiza-
tion. Eur. J. Oper. Res. 137(1), 50–71 (2002)
298 L. Moalic et al.
8. Wu, Z., Chow, T.S.: A local multiobjective optimization algorithm using neighbor-
hood field. Struct. Multi. Optim. 46, 853–870 (2012)
9. Liefooghe, A., Humeau, J., Mesmoudi, S., Jourdan, L., Talbi, E.-G.: On dominance-
based multiobjective local search: design, implementation and experimental analy-
sis on scheduling and traveling salesman problems. J. Heuristics 18, 317–352
(2012). doi:10.1007/s10732-011-9181-3
10. Gandibleux, X., Freville, A.: Tabu search based procedure for solving the 0-1 mul-
tiobjective knapsack problem: the two objectives case. J. Heuristics 6, 361–383
(2000). doi:10.1023/A:1009682532542
11. Hansen, M.P.: Tabu search for multiobjective optimization: Mots. In: MCDM’97,
Springer (1997)
12. Shaheen, S.A., Cohen, A.P.: Worldwide Carsharing Growth: An International Com-
parison. University of California, Berkeley (2008)
13. de Almeida Correia, G.H., Antunes, A.P.: Optimization approach to depot location
and trip selection in one-way carsharing systems. Transp. Res. Part E 48(1), 233–
247 (2012)
14. Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C., da Fonseca, V.: Performance
assessment of multiobjective optimizers: an analysis and review. IEEE Trans. Evol.
Comput. 7, 117–132 (2003)
Generating Customized Landscapes
in Permutation-Based Combinatorial
Optimization Problems
1 Introduction
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 299–303, 2013.
DOI: 10.1007/978-3-642-44973-4 33, c Springer-Verlag Berlin Heidelberg 2013
300 L. Hernando et al.
2 Mallows Model
The Mallows model [4] is an exponential probability model for permutations
based on a distance. This distribution is defined by two parameters: the central
permutation σ0 , and the spread parameter θ. If Ω is the set of all permutations
of size n, for each σ ⊆ Ω the Mallows distribution is defined as:
1 −θd(σ0 ,σ)
p(σ) = e
Z(θ)
−θd(σ0 ,σ )
where Z(θ) = σ ∈Ω e is a normalization term and d(σ0 , σ) is the
distance between the central permutation σ0 and σ. The most commonly used
distance is the Kendall tau. Given two permutations σ1 and σ2 , it counts the
minimum number of adjacent swaps needed to convert σ1 into σ2 . Under this
metric the normalization term Z(θ) has closed form and does not depend on σ0 :
n−1
1 − e−(n−j+1)θ
Z(θ) = .
j=1
1 − e−θ
Notice that if θ > 0, then σ0 is the permutation with the highest probability.
The rest of permutations σ ∗ ⊆ Ω − {σ0 } have probability inversely exponentially
proportional to θ and their distance to σ0 . So, the Mallows distribution can be
considered analogous to the Gaussian distribution on the space of permutations.
3 Instance Generator
In this section we show a generator of instances of COPs where the solutions
are in the space of permutations. Our generator defines an optimization function
based on a mixture of Mallows models.
The generator proposed in this paper uses 3m parameters: m central permu-
tations {σ1 , ..., σm }, m spread parameters {θ1 , ..., θm } and m weights {w1 , ..., wm }.
We generate m Mallows models pi (σ|σi , θi ), one for each σi and θi , ∀i ⊆ {1, ..., m}.
The objective function value for each permutation σ ⊆ Ω is defined as follows:
Some of these interesting properties are analyzed here. The first relevant
factor we consider is that all central permutations σi ’s were local optima. Clearly,
in order to be local optima, {σ1 , ..., σm } have to fulfill that d(σi , σj ) ∅ 2, ∀i ∈= j.
A second constraint is that the objective function value of σi has to be reached
in the ith Mallows model, i.e.:
To satisfy (2), and taking into account the constraint (1), we need to comply
with:
wi
∀j = 1, ..., m, > wj pj (σ), ∀σ s.t. d(σi , σ) = 1.
Z(θi )
However, taking into account that if σ ⊆ Ω is s.t. d(σi , σ) = 1, then d(σj , σ) =
d(σj , σi ) − 1 or d(σj , σ) = d(σj , σi ) + 1, Eq. (2) can be stated as:
wi wj −θj (d(σi ,σj )−1)
> e , ∀j ⊆ {1, 2, ..., m}, i ∈= j. (3)
Zi (θi ) Zj (θj )
Notice that once the parameters θi ’s have been fixed, the previous inequalities
are linear in wi ’s. So the values of wi ’s could be obtained as the solution of
just a linear programming problem. However, we have not defined any objective
function to be optimized in our linear programing problem. This function can be
chosen taking into account the different desired characteristics for the instance.
For example, one could think about a landscape with similar sizes of attrac-
tion basins. In this case, and without loss of generality, we consider that σ1 is
the global optimum and that σm is the local optimum with the lowest objective
function value. Our objective function tries to minimize the difference between
the objective function values of these two permutations (and implicitly minimize
the difference of the objective function values of all the local optima). In addi-
tion we have to include new constraints to comply with these properties in the
objective function values. This landscape can be generated as follows:
wi wi+1
> (∀i ⊆ {1, 2, ..., m − 1})
Z(θi ) Z(θi+1 )
e−θj (d(σi ,σj )−1)
wi /Z(θi ) > wj (∀i, j ⊆ {1, 2, ..., m}, ?i > j)
Z(θj )
4. Assign to each σ ⊆ Ω the objective function value:
the Traveling Salesman Problem, the Flowshop Scheduling Problem, the Linear
Ordering Problem, etc. Moreover, we think that the model could be flexible
enough to represent the complexity of real-world problems.
References
1. De Jong, K.A., Potter, M.A., Spears, W.M.: Using problem generators to explore
the effects of epistasis. Seventh International Conference on Genetic Algorithms,
pp. 338–345. Morgan Kaufmann, San Francisco (1997)
2. Rönkkönen, J., Li, X., Kyrki, V., Lampinen, J.: A framework for generating tunable
test functions for multimodal optimization. Soft Comput., 1–18 (2010)
3. Gallagher, M., Yuan, B.: A general-purpose tunable landscape generator. IEEE
Trans. Evol. Comput. 10(5), 590–603 (2006)
4. Mallows, C.L.: Non-null ranking models. Biometrika 44(1–2), 114–130 (1957)
5. Fligner, M.A., Verducci, J.S.: Distance based ranking models. J. Roy. Stat. Soc.
48(3), 359–369 (1986)
Multiobjective Evolution of Mixed
Nash Equilibria
1 Introduction
In Game Theory [6] a mixed strategy is a probability distribution over the actions
available for the players. This allows a player to randomly choose an action
instead of choosing a single pure strategy. If only one action has a probability of
one to be selected, the player is said to use a pure strategy.
The most popular solution concept in game theory is Nash equilibrium [5].
A game state is a Nash equilibrium, if no player has the incentive to unilaterally
deviate from his/her strategy. Every finite game has a Nash equilibrium. How-
ever, not every game has a Nash equilibrium in pure strategies. Therefore, the
concept of mixed strategies is a fundamental component in Game Theory, as it
can provide Nash equilibria in games where no equilibrium in pure strategies
exists.
The empirical relevance of mixed strategies have been often criticized for
being “intuitively problematic” [1]. In [9] mixed strategies are viewed from the
perspective of evolutionary biology. According to other interpretations players
randomize because they think their strategies may be observed in advance by
other players [8]. Albeit there are theoretical arguments trying to rationalize this
concept [4], it is not clear why and how players randomize their decisions.
Beside the behavioral observation that people seldom make their choices fol-
lowing a lottery, the most puzzling question arises from the “indifference” prop-
erty of a mixed strategy equilibrium. In mixed equilibrium, given the strategies
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 304–314, 2013.
DOI: 10.1007/978-3-642-44973-4 34, c Springer-Verlag Berlin Heidelberg 2013
Multiobjective Evolution of Mixed Nash Equilibria 305
chosen by the other players, each player is indifferent among all the actions that
he/she may select with positive probability, as they do not affect the result-
ing payoff. Therefore, there is no direct benefit to select precisely the strategy
that induces the opponents to be indifferent, as required for the existence of the
equilibrium. Then, in the absence of communication between players, how can a
mixed equilibrium arise in a real-world scenario, especially in cases of incomplete
information?
Computational computing the Nash equilibrium is a complex problem [2,7].
We experiment with a novel model, that can lead to the emergence of mixed
equilibrium. Here, agents aim to develop strategies for which the payoff outcome
of the opponent can be predicted.
where:
S = S1 × S2 × ... × SN
is the set of all possible situations of the game and s ⊆ S is a strategy (or
strategy profile) of the game
– for each player i ⊆ N , ui : S ∀ R represents the payoff function.
Let us denote by (si , s∈−i ) the strategy profile obtained from s∈ by replacing
the strategy of player i with si :
φi : Si ∀ ⊂+ ,
where ⎡
φi (sk ) = 1.
sk ∗Si
The strategies available to player i is the set of all probability distributions over
Si . A player may assign a probability 1 to a single action, in this case the player
chooses a pure strategy.
The payoff for player i in a game allowing mixed strategies is given by:
⎡
ui (φi ) = φi (si )ui (s−i , si ).
si ∗Si
The mixed strategy Nash equilibrium (or simply mixed Nash equilibrium) is
an extension of the Nash equilibrium. A mixed strategy profile φ ∈ is a mixed
Nash equilibrium is there is no player i that would prefer the lottery over the
∈
payoffs generated by the strategy profile (φi , φ−i ).
ui (φ ∈ ) ≥ ui (φi , φ−i
∈
), ∈i ⊆ N, ∈φi ⊆ Σ i .
Example 1. The game of Matching Pennies is a simple two player game. Both
players have a penny and that they simultaneously turn to either heads (H) or
tails (T). If the two pennies match (both are heads or both are tails), Player 1
wins a penny from Player 2, otherwise Player 2 wins a penny from Player 1. The
payoffs for the game are depicted in Table 1.
The game does not have a Nash equilibrium, however there is a Nash equi-
librium in mixed strategies, namely when both players choose Heads with a
probability of 12 and Tails with a probability of 12 . This way both players end up
with an expected payoff of 0 and neither can do better by deviating from this
strategy.
In most cases a player can benefit from knowing the next move of the oppo-
nent, so each player wants to keep his/her opponent guessing. An important
feature of mixed Nash equilibrium is, that, given the actions of the opponents,
the players are indifferent among the actions chosen with positive possibility.
For example in the Matching Pennies game, given that Player 2 chooses Heads
Multiobjective Evolution of Mixed Nash Equilibria 307
or Tails with same probability, Player 1 is indifferent among its actions. Thus
the goal of each player is to randomize in such a way to keep the opponent
indifferent.
3 Proposed Model
Rational agents often build internal models that anticipate the actions of the
other players and adapt their strategies accordingly. Here, we experiment with
a model for two player games where players try to anticipate directly the game
payoff of the other player. The agents adapt their strategies in order to reduce
the uncertainty of this prediction. This is in contrast with the classical scenario,
where players foremost objective is to maximize their game utility.
Let (w, p) be a mixed strategy profile for a two player game, where w defines
the probability distribution over the actions available for the first player, with p
having a similar role for the second player. Let u1 (w, p) and u2 (w, p) denote the
game payoff for player one and for player two respectively.
Then, the proposed model is formalized as follows:
⎣ ⎦m
⎨o1 = argmin( m 1
i=1 (u2 (w, p) − u2 (w, Δi (p))) )
2
w
⎦ m (1)
⎤o2 = argmin( m 1
i=1 (u1 (w, p) − u1 (Δi (w), p)) )
2
p
every possible strategy profile, while small perturbations assume only a local
knowledge about the game outcomes.
– How many perturbations of the actual strategy profile are required at each
step for reliable results (how big should parameter m be)?
We empirically investigate these issues in the next Section.
4.1 Game 1
We study a game where each player has two actions. The payoff matrix is
described in Table 2.
Table 2. The payoff gatrix for Game 1. The game has two pure Nash equilibria (p =
1, w = 0) and (p = 0, w = 1) and one mixed equilibrium at (p = 67 , w = 67 ).
Multiobjective Evolution of Mixed Nash Equilibria 309
Table 3. The payoff matrix for Game 2. The game has one pure Nash equilibrium and
one mixed equilibrium at (p = 0, w = 12 ).
Table 4. Payoff matrix for Game 3. The game has no pure Nash equilibrium and one
mixed equilibrium at (p = 47 , w = 27 ).
The game has two Nash equilibria in pure strategies and one mixed equilib-
rium at ( 67 , 67 ).
4.2 Game 2
We consider an other two player game, where payoffs and actions are presented
in Table 3. The game has one pure Nash equilibrium (p = 1, w = 0) with the
corresponding payoff of (1, 0)) and one mixed equilibrium at (0, 12 ).
4.3 Game 3
We consider a bimatrix zero-sum game, having its payoff matrix described in
Table 4. The game has no pure Nash equilibrium but it has one mixed equilibrium
at ( 47 , 27 ).
4.4 Game 4
Finally, we consider the Rock-Paper-Scissors Game. Both players chose simul-
taneously a sign: Rock (R), Paper (P) or Scissors (S). The winner gets a dollar
from the loser according to the following rule: Paper beats (wraps) Rock, Rock
beats (blunts) Scissor and Scissor beats (cuts) Paper. The payoff matrix of the
game is depicted in Table 5.
The Rock-Paper-Scissors Game has no Nash equilibrium in pure strategies.
There is however a single Nash equilibrium in mixed strategies, namely when
both players play all three strategies with equal probability.
Table 5. The Rock-Paper-Scissors Game. The game has no pure Nash equilibrium and
one symmetric mixed equilibrium in ( 13 , 13 , 13 ).
30
20
10
0 -8 -6 -4 -2
10 10 10 10
Perturbation magnitude
Fig. 1. Semilogarithmic plot of the average convergence speed for various perturbation
magnitudes for Game 1.
Fig. 2. Average convergence speed when using various number of perturbed states for
Game 1.
Multiobjective Evolution of Mixed Nash Equilibria 311
Fig. 3. Semilogarithmic plot of the average convergence speed for various perturbation
magnitudes (Game 2).
Fig. 4. Average convergence speed when using various number of perturbed states
(1-7) (Game 2).
312 D. Iclănzan et al.
Fig. 5. Semilogarithmic plot of the average convergence speed for various perturbation
magnitudes (Game 3).
Fig. 6. Average convergence speed when using various number of perturbed states
(1-7) (Game 3).
Multiobjective Evolution of Mixed Nash Equilibria 313
the method displays a robust behavior for finding mixed equilibria, even for very
small perturbations.
The obtained averages for the three games are displayed in Figs. 1, 3, respec-
tively Fig. 5 for Game 3.
Surprisingly, as one can see in these figures, the search is mostly insensitive
to the amount of perturbation, with a slight better behavior, faster convergence
with smaller perturbations.
In a next step, we lock φ = 0.00000001 and experiment with various number
of perturbations (parameter m) used at each evaluation, ranging from 1 to 7.
Again, an average of 500 independent run for each game and for each case is
computed. The results of this experiment are displayed in Figs. 2, 4 and 6.
In the case of Game 4 the strategy of a player is perturbed by a Gaussian
noise with a standard deviation of φ = 0.15. The number of perturbations is
equal to three. 500 different runs are performed, and with a percent of 93.8 %
the algorithm finds the mixed Nash equilibrium.
Results suggest that the model is moderately sensitive to the parameter m.
The lower this number is, the higher is the required average number of genera-
tions until convergence. However, this difference is not very large. For example,
for Game 1, the difference between using only one perturbed point and using
seven points is on average 4.9627 generations, representing an 32.19 % increase.
5 Conclusions
We propose a model that adaptively converge to mixed strategy Nash equilibria,
when optimized via an evolutionary multiobjective search method. The model
can work with only a local knowledge about the game, centered around the
actual strategy profile, and at each step requires only one evaluation of a slightly
perturbed strategy profile.
The results suggest that a player can adaptively develop the strategy that
makes the opponent indifferent with regard to his own actions. Interestingly, to
obtain this result, it is enough to consider and measure one additional alternative.
The players need to have only a local knowledge about the game, where it can
internally evaluate the outcome of strategy profiles that are very close to the
current profile and its known outcome.
Numerical experiments describe several two player games, with different pure
and mixed Nash equilibria. Results indicate the potential of the proposed evo-
lutionary search method.
Future work will extend the model to more than two players.
Acknowledgments. The first author acknowledges the financial support of the Sec-
toral Operational Program for Human Resources Development 2007-2013, co-financed
by the European Social Fund, within the project POSDRU 89/1.5/S/60189 with the
title “Postdoctoral Programs for Sustainable Development in a Knowledge Based Soci-
ety”. The second author wishes to thank for the financial support of the national
project code TE 252 financed by the Romanian Ministry of Education and Research
CNCSIS-UEFISCSU and “Collegium Talentum”.
314 D. Iclănzan et al.
References
1. Aumann, R.J.: What is game theory trying to accomplish? In: Arrow, K., Honkapo-
hja, S. (eds.) Frontiers of Economics. Blackwell, Oxford (1985)
2. Daskalakis, C., Goldberg, P.W., Papadimitriou, C.H.: The complexity of computing
a nash equilibrium. In: Proceedings of the Thirty-Eighth Annual ACM Symposium
on Theory of Computing, STOC ’06, pp. 71–78. ACM, New York (2006)
3. Deb, K., Agrawal, S., Pratap, A., Meyarivan, T.: A fast elitist non-dominated sorting
genetic algorithm for multi-objective optimisation: NSGA-II. In: Schoenauer, M.,
Deb, K., Rudolph, G., Yao, X., Lutton, E., Merelo, J.J., Schwefel, H.-P. (eds.)
PPSN 2000. LNCS, vol. 1917, pp. 849–858. Springer, Heidelberg (2000)
4. Harsanyi, J.C.: Games with randomly disturbed payoffs: a new rationale for mixed-
strategy equilibrium points. Int. J. Game Theor. 2(1), 1–23 (1973)
5. Nash, J.: Non-cooperative games. Ann. Math. 54(2), 286–295 (1951)
6. Osborne, M.J.: An introduction to game theory. Oxford University Press, New York
(2004)
7. Papadimitriou, C.H.: On the complexity of the parity argument and other inefficient
proofs of existence. J. Comput. Syst. Sci. 48(3), 498–532 (1994)
8. Reny, P.J., Robson, A.J.: Reinterpreting mixed strategy equilibria: a unification of
the classical and bayesian views. Games Econ. Behav. 48(2), 355–384 (2004)
9. Smith, J.M.: Evolution and the Theory of Games. Cambridge University Press,
Cambridge (1982)
Hybridizing Constraint Programming
and Monte-Carlo Tree Search: Application
to the Job Shop Problem
1 Introduction
This paper focuses on hybridizing Constraint Programming (CP) and Monte-
Carlo Tree Search (MCTS) methods. The proof of concept of the approach is
given on the job-shop problem (JSP), where JSPs are modelled as CP prob-
lem instances, and MCTS is hybridized with the Gecode constraint solver envi-
ronment [3]. This paper first briefly presents the JSP modeling in constraint
programming and the MCTS framework, referring the reader to respectively
[2] and [4,5] for a comprehensive presentation. The proposed hybrid approach,
referred to as Bandit-Search for Constraint-Programming (BaSCoP), is there-
after described. The first experimental results on the difficult Taillard 11-20
20 × 15 problem instances are presented in Sect. 5. The paper concludes with a
discussion w.r.t related work [10], and some perspectives for further research.
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 315–320, 2013.
DOI: 10.1007/978-3-642-44973-4 35, c Springer-Verlag Berlin Heidelberg 2013
316 M. Loth et al.
– When MCTS reaches a leaf node, this node is attached a child node −the
tree thus involves N nodes after N tree-walks− and MCTS enters the roll-out
phase. This expansion may also occur only every k tree-walk, where k can be
referred to as an expand rate.
– In the roll-out phase, nodes (a.k.a. actions) are selected using a default (usu-
ally randomized) policy, until arriving at a terminal state.
– In this terminal state, the overall reward associated to this tree-walk is com-
puted and used to update the reward μ̂i of every node in the tree-walk.
MCTS is frequently combined with the so-called RAVE (Rapid Action Value
Estimate) heuristic [9], which stores the average reward associated to each action
(averaging all rewards received along tree-walks involving this action). In
particular, the RAVE information is used to select the new nodes added to
the tree.
Hybridizing Constraint Programming and Monte-Carlo Tree Search 317
4 BaSCoP
The considered CP setting significantly differs from the MCTS one. Basically,
CP relies on the multiple restart strategy, which implies that it deals with many,
mostly narrow, trees. In contrast, MCTS proceeds by searching in a single, grad-
ually growing and eventually very large tree. The CP and MCTS approaches
were thus hybridized by attaching average rewards (Sect. 4.1) to each value of a
variable (Sect. 4.2). Secondly, BaSCoP relies on redefining the selection rule in
the bandit- and in the roll-out phases (Sect. 4.3, 4.4).
4.1 Reward
4.2 RAVE
In the line of [4], the most straightforward option would have been to associate
to each node in the tree (that is, a (variable, value) assignment conditioned by
the former assignment nodes) the average objective associated to this partial
assignment. This option is however irrelevant in the considered setting: the mul-
tiple restarts make it ineffective to associate an average reward to an assignment
conditioned by other variable assignments, since there are not enough tree-walks
to compute reliable statistics before they are discarded by the change of con-
text (the new tree). Hence, a radical form of RAVE was used, where statistics
are computed for each (variable,value) assignment, independently of the context
(previous variables assignments).
The roll-out phase also presents a key difference with the usual MCTS setting.
The roll-out policy launched after reaching a leaf of the MCTS tree usually
318 M. Loth et al.
5 Experimental Results
Figure 2 depicts the overall results in terms of mean relative error w.r.t. the best
(non CP-based) solution found in the literature, on the Taillard 11-20 problem
suite (20 × 15), averaged on 11 independent runs, versus the number of tree-
walks. The computational cost is ca. 30 mn on a PC with Intel dual-core CPU
2.66GHz. Compared to DFS, a simple diversification improves only on the early
stages, while a left-biased one yields a significant improvement, of the same order
as a failure-depth one, and improvements seem to add up when combining both
biases.
Overall, BaSCoP is shown to match the CP-based state of the art [2]: the use
of MCTS was found to compensate for the lack of JSP-specific variable ordering.
Hybridizing Constraint Programming and Monte-Carlo Tree Search 319
Fig. 1. Balanced + DFS search tree. Concurrent DFS are run under each leaf of the
−growing− MCTS tree (dotted nodes).
Fig. 2. Mean relative error, for 11 runs on the 10 20×20 Taillard instances.
320 M. Loth et al.
References
1. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed
bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
2. Beck, J.C.: Solution-guided multi-point constructive search for job shop scheduling.
J. Artif. Intell. Res. 29, 49–77 (2007)
3. Gecode Team: Gecode: Generic constraint development environment,
www.gecode.org
4. Gelly, S., et al.: The grand challenge of computer go: monte carlo tree search and
extensions. Commun. ACM 55(3), 106–113 (2012)
5. Kocsis, L., Szepesvári, C.: Bandit based monte-carlo planning. In: Fürnkranz, J.,
Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp.
282–293. Springer, Heidelberg (2006)
6. Luby, M., Sinclair, A., Zuckerman, D.: Optimal speedup of las vegas algorithms.
Inf. Process. Lett. 47, 173–180 (1993)
7. Mathon, R., Rosa, A.: Tables of parameters for BIBD’s with r ≤ 41 including
existence, enumeration, and resolvability results. Ann. Discrete Math. 26, 275–308
(1985)
8. Meseguer P.: Interleaved depth-first search. In: IJCAI 1997, vol. 2, pp. 1382–1387
(1997)
9. Rimmel, A., Teytaud, F., Teytaud, O.: Biasing monte-carlo simulations through
RAVE values. In: ICCG 2010, pp. 59–68 (2010)
10. Runarsson, T.P., Schoenauer, M., Sebag, M.: Pilot, rollout and monte carlo tree
search methods for job shop scheduling. In: Hamadi, Y., Schoenauer, M. (eds.)
LION 2012. LNCS, vol. 7219, pp. 160–174. Springer, Heidelberg (2012)
11. Watson, J.-P., Beck, J.C.: A hybrid constraint programming/local search approach
to the job-shop scheduling problem. In: Trick, M.A. (ed.) CPAIOR 2008. LNCS,
vol. 5015, pp. 263–277. Springer, Heidelberg (2008)
From Grammars to Parameters: Automatic
Iterated Greedy Design for the Permutation
Flow-Shop Problem with Weighted Tardiness
1 Introduction
Designing an effective stochastic local search (SLS) algorithm for a hard optimi-
sation problem is a time-consuming, creative process that relies on the experience
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 321–334, 2013.
DOI: 10.1007/978-3-642-44973-4 36, © Springer-Verlag Berlin Heidelberg 2013
322 F. Mascia et al.
The flowshop scheduling problem (FSP) is one of the most widely studied sche-
duling problems, as it models a very common kind of production environment
in industries. The goal in the FSP is to find a schedule to process a set of
n jobs (J1 , . . . , Jn ) on m machines (M1 , . . . , Mm ). The specificity of flowshop
environments is that all jobs must be processed on the machines in the same
order, i.e., all jobs have to be processed on machine M1 , then machine M2 , and
so on until machine Mm . A common restriction in the FSP is to forbid job
passing between machines, i.e., to restrict to solutions that are permutations of
jobs. The resulting problem is called permutation flowshop scheduling problem
(PFSP). In the PFSP, all processing times pij for a job Ji on a machine Mj
are fixed, known in advance, and non-negative. In what follows, Cij denotes the
completion time of a job Ji on machine Mj and Ci denotes the completion time
of a job Ji on the last machine, Mm .
In many practical situations, for instance when products are due to arrive at
customers at a specific time, jobs have an associated due date, denoted here by di
for a job Ji . Moreover, some jobs may be more important than others, which can
be expressed by a weight associated to them representing their priority. Thus,
the so-called tardiness of a job Ji is defined as Ti = max{Ci −di , 0} and the total
n
weighted tardiness is given by i=1 wi · Ti , where wi is the priority assigned to
job Ji .
We consider the problem of minimizing the total weighted tardiness (WT).
This problem, which we call PFSP-WT , is N P-hard in the strong sense even for
a single machine [3]. Let πi denote the job in the ith position of a permutation
π. Formally, the PFSP-WT consists of finding a given job permutation π as to
n
minimize F (π) = wi · Ti
i=1
subject to Cπ0 j = 0 j ⊆ {1, . . . , m},
Cπi 0 = 0 i ⊆ {1, . . . , n}, (1)
Cπi j = max{Cπi−1 j , Cπi j−1 } + pij ,
i ⊆ {1, . . . , n} j ⊆ {1, . . . , m}.
Ti = max{Ci − di , 0} i ⊆ {1, . . . , n}.
We tackle the PFSP-WT by means of iterated greedy (IG), which has been
shown to perform well in several PFSP variants [11]. Our goal is to automatically
generate IG algorithms in a bottom-up manner from a grammar description. The
next section explains our approach in more detail.
324 F. Mascia et al.
3 Methods
The methodology used in this work is the following. Given a problem to be tack-
led, we first define a set of algorithmic components for the problem at hand,
avoiding as much as possible assumptions on which component will contribute
the most to the effectiveness of the algorithm or which is the best way of combin-
ing the components in the final algorithm. Once the building blocks are defined,
we use tools for automatic algorithm configuration to explore the large design
space of all possible combinations and select the best algorithm for the problem
at hand.
The size and complexity of the building blocks is set at a level that is below
a full-fledged heuristic, but still allows us to easily combine them in a modular
way to generate a very large number of different algorithms. This is in contrast
with the more standard way of designing SLS algorithms for a given problem,
in which the algorithm designer defines the full-fledged heuristics and leaves out
some parameters to tune specific choices within the already defined structure.
In this paper, the building blocks and the way in which they can be combined
will be described by means of context-free grammars. Grammars comprise a set
of rules that describe how to construct sentences in a language given a set of
symbols. The grammar discussed in this paper generates algorithm descriptions
in pseudo-code, the actual grammar used in the experiments is equivalent to the
one presented in this paper but generates directly C++ code.
Fig. 1. Grammar that describes the rules for generating IG algorithms for the PFSP.
symbol on the left-hand side can be replaced by the expression on the right-
hand side. Expressions are strings of terminal and/or non-terminal symbols. If
there are alternative strings of symbols for the replacement of the non terminal
on the left-hand side, the alternative strings are separated with the symbol “|”.
In Fig. 1, the non-terminal symbol <program> defines the main step of the
algorithm. First one or more jobs are marked for removal from the current solu-
tion, then the selected jobs are removed and sorted, and finally the solution is
reconstructed inserting the jobs back in the current solution. Implementing an
IG algorithm for the PFSP requires to make some design choices. In particular,
(i) which jobs and how many are selected for removal, (ii) in which order the jobs
are reinserted; and (iii) which criteria should be optimized when deciding the
insertion point. All the possibilities that we consider in this paper are described
by the grammar in Fig. 1. Next, we explain these components in detail.
Heuristics for the Selection of Jobs. The selection of jobs for removal (rule
<select jobs>) consists in the application of one or more selection rules. In
particular this is done with the function select jobs (<heuristic>, <num>,
<low range>, <high range>) that selects <num> jobs from the current solution
according to the rule specified in <heuristic>. Each rule computes a numerical
value for each job Ji , which may be one of the following properties:
326 F. Mascia et al.
After the heuristic values are computed, they are normalized in the fol-
lowing way: the minimum for each heuristic value among all jobs is normal-
ized to 0, the maximum one to 100, and values in-between are normalized
linearly to the range [0, 100]. Only jobs whose normalized heuristic value is
between a certain range [low, high] are considered for selection. The range is
computed from the values given by the grammar as high = <high range> and
low = <low range> · high/100. Finally, from the jobs considered for selection,
at most <num> percent (computed as <num> · n/100) of the jobs are actually
selected, where n is the total number of jobs. An example of selection rule would
be select jobs(DueDate,20,10,50), which means that, from those jobs that
have a normalized due date in the range [10, 50], at most 0.2 · n jobs are selected.
Rules for Ordering the Jobs. The function that sorts jobs for re-insertion
(sort removed jobs) is composed by one or more order criteria
(<order criteria>), where each additional order criterion is used for break-
ing ties. Each order criterion sorts the removed jobs by a particular heuristic
value, in either ascending or descending order, according to <comparator>. The
result is a permutation of the removed jobs according to the order criteria.
Rules for Inserting the Jobs. In this paper we consider the minimization of
the weighted tardiness of the solution, thus, it is natural to optimize primarily
this objective when choosing the position for re-inserting each job. However, it
often happens that the weighted tardiness is the same for any insertion position
of a job (in particular, when the solution is partial: all jobs can easily respect
due dates and therefore the weighted tardiness is 0).
Thus, we consider the possibility of breaking ties according to additional
criteria, namely, the minimization of the sum of completion
n times and the max-
imization of the weighted earliness, computed as i=1 wi · (di − Ci ). Both are
correlated with the minimization of the weighted tardiness and allow us to differ-
entiate between partial schedules with zero weighted tardiness because none of
the jobs is tardy. In total, we consider five alternatives for the insertion criteria
(<insert criteria>), corresponding to breaking ties with any combination of
either, none or both sum of completion times and weighted earliness.
From Grammars to Parameters: Automatic IG Design for the PFSP 327
such rules are never applied more than a small number of times. We use this
consideration and explicitly limit the number of parameters that describe such
rules; thus, in this way we also limit the length of the generated algorithm.
Converting rules that can be derived an unbounded number of times is the
non trivial case, and we will explain it here with an example. Assume we want
to map the following rule to a set of categorical parameters:
In order to have at least one job, the first parameter should not have the
empty string among the possible values. This would more directly map to the
following equivalent grammar:
Table 1 shows the mapping of Fig. 1 to parameters. Both rules <select jobs>
and <order criteria> can be applied up to i and j times respectively. More-
over, each parameter used in those rules has to be duplicated for each possible
application of the rules.
How to search for the best algorithm in the design space defined by the gram-
mar and how to represent the sequence of derivation rules that represent an
algorithm is the goal of different methods in grammar based genetic program-
ming (GBGP) [10]. Among the GBGP techniques proposed in the literature, we
consider recent works in grammatical evolution (GE) [2].
In GE, the instantiation of a grammar is done by starting with the <program>
non-terminal symbol, and successively applying the derivation rules in the gram-
mar, until there are no non-terminal symbols left. Every time that a non-terminal
From Grammars to Parameters: Automatic IG Design for the PFSP 329
symbol can be replaced following more than one production rule, a choice has
to be made. The sequence of specific choices made during the derivation, which
leads to a specific program, is encoded in a sequence of integers.
This linearisation of the derivation tree, leads to a high decoupling between
the sequence of integers and the programs being generated. For example, when
a derivation is complete and there are still numbers left in the sequence, these
numbers are discarded. Conversely, if the derivation is not complete and there are
no numbers left in the sequence, the sequence is read again from the beginning.
This operation is called wrapping and is repeated for a limited number of times.
If after a given number of wrappings the derivation is not complete, the sequence
of strings is considered to lead to an invalid program. Moreover, since the integers
are usually in a range which is bigger than the possible choices for the derivation
of a non terminal, a modulo operation is applied at each choice.
In GE, the sequences of integers are used as chromosomes in a genetic algo-
rithm that is used to derive the best algorithm for a given problem. The high
decoupling between the programs and their representation, has it drawbacks
when used within a genetic algorithm. The decoupling translates to non locality
in the mutation and crossover operators [10]. Wrapping operations are clearly
responsible of this decoupling, but even without wrapping, the way in which an
algorithm is derived from a grammar and the sequence of integer values leads
to non locality in the operation. In fact, since the integer values are used to
transform the left-most non terminal symbol, a choice in one of the early trans-
formations can impact on the structure of the program being generated and on
the meaning of all subsequent integers in the sequence. Therefore since a muta-
tion on one integer in the sequence (a codon) impacts on the meaning of all
the following codons, one-bit mutations in different positions of the individual
genotype have impacts of different magnitude on the phenotype of the individ-
ual. For the same reason the offspring of two highly fit parents is not necessarily
composed of highly fit individuals. On the contrary a one-point cross-over of
the best individuals in the population could lead to individuals whose geno-
type can not be translated to any algorithm, because of the upper-bound on the
wrapping operations. But, regardless of the specific issues when used in a genetic
algorithm, we are interested to see if this representation presents similar
drawbacks also when used with a tool for automatic algorithm configuration.
In fact, this linearisation of the grammar, can easily be used within a tool for
algorithmic configuration by mapping all codons to categorical parameters. The
choice here between integer and categorical parameters is due to the high non
linear response between the values of the codons and the algorithm they are
decoded into.
Both the parameters and the sequence of codons limit the length of
the algorithms that can be generated. In fact, a grammar can represent an
arbitrarily long algorithm, but in practice the length is limited by the num-
ber of parameters in one case, and in the other case by the number of possible
wrapping operations.
330 F. Mascia et al.
4 Experimental Results
4.1 Experiments
The automatic configuration procedure used in this work is irace [8], a publicly
available implementation of Iterated F-Race [1]. Iterated F-Race starts by sam-
pling a number of parameter configurations uniformly at random. Then, at each
iteration, it selects a set of elite configurations using a racing procedure and the
non-parametric Friedman test. This racing procedure runs the configurations
iteratively on a sequence of (training) problem instances, and discards config-
urations as soon as there is enough statistical evidence that a configuration is
worse than the others. After the race, the elite configurations are used to bias
a local sampling model. The next iteration starts by sampling new configura-
tions from this model, and racing is applied to these configurations together
with the previous elite configurations. This procedure is repeated until a given
budget of runs is exhausted. The fact that irace handles categorical, numerical
and surrogate parameters with complex constraints makes it ideal to instantiate
algorithms from grammars in the manner proposed in this paper.
Fig. 2. Mean relative percentage deviation (RPD) obtained by the heuristics generated
by each tuning method. Results are given separately for the heuristics trained and
tested on 50x20 instances and on 100x20 instances.
Using the same computational budget, we also consider two additional meth-
ods that generate heuristics randomly, to use as a baseline comparison. These
methods generate 250 IG heuristics randomly, run them on 10 randomly selected
training instances and select the heuristic that obtains the lowest mean value.
Method rand-ge uses the grammar representation, while method rand-param
uses the parametric representation.
Each method (irace-ge, irace-param5, irace-param3, irace-param1,
rand-ge, rand-param) is repeated 30 times with different random seeds for each
benchmark set, that is, in total, each method generates 60 IG heuristics. The
training set used by all methods are the first 90 instances of each size. A gram-
mar equivalent to the one in Fig. 1 is used to generate directly C++ code, which
is in turn compiled with GCC 4.4.6 with optimization level -O3. Experiments
were run on a single core of an AMD Opteron 6272 CPU (2.1 GHz, 16 MB L2/L3
cache size) running under Cluster Rocks Linux version 6/CentOS 6.3, 64 bits.
4.2 Results
For assessing the quality of the generated heuristics, we run them on 10 test
instances (that are distinct from the ones used for the training), repeating each
run 10 times with different random seeds. Next, we compute the relative per-
centage deviation (RPD) from the best solution ever found by any run for each
instance. The RPD is averaged over the 10 runs and over the 10 test instances.
Figure 2 compares the quality of the heuristics generated by the four methods
described above on each test set for 50x20 and 100x20 benchmark sets. For each
method, we plot the distribution of the mean RPD of each heuristic generated
by it. In both benchmark sets, irace-param5 and irace-param3 obtain the best
heuristics. The heuristics generated by irace-param1 are typically worse than
those generated by irace-ge.
Pairwise comparisons using the Wilcoxon signed rank test indicate that all
pair-wise differences are statistically significant with the only exception being the
332 F. Mascia et al.
Table 2. Comparison of the methods through the Friedman test on the two bench-
mark sets. ΔRα=0.95 gives the minimum difference in the sum of ranks between two
methods that is statistically significant. For both benchmark sets, irace-param5 and
irace-param3 are clearly superior to the other methods.
5 Conclusions
The main conclusion from our work is that existing automatic configuration tools
may be used to generate algorithms from grammars.
Defining algorithmic components to be combined in an SLS algorithm prese-
nts several advantages over the design of a full fledged heuristic, where some
design choices are left to be tuned automatically. Most importantly, less intu-
ition, and therefore less bias of the designer goes in the definition of the separate
blocks with respect to the classical top-down approach. But there are also more
practical advantages of following a bottom-up strategy. In fact, every instanti-
ation of the grammar is a minimal SLS algorithm designed and implemented
to have a very specific behaviour. Less programming abstractions are needed,
From Grammars to Parameters: Automatic IG Design for the PFSP 333
and a simpler code may be optimised more easily by the compilers. Even the
parameters become constant values in the source code, and there is no need to
pass them to various parts of the algorithm. On the contrary, when following a
top-down approach, the designer tackles the hard engineering task of designing a
full-fledged framework where all possible combinations of design choices have to
be defined beforehand. This leads to a reduced number of possible combinations
with respect to a modular bottom-up approach, and also to the added complex-
ity of intricate conditional expressions required to instantiate only the parts of
the framework needed to express a specific algorithm.
We have shown that it is possible to represent the instantiation of the gram-
mar by means of a parametric space. The number of parameters required is
proportional to the number of times a production rule can be applied, and,
hence, our approach is more appropriate for grammars where this number is
bounded and not excessively large. It is an open research question for which
kind of grammars the number of parameters required to represent applications
of production rules becomes prohibitively expensive and other representations
are more appropriate. Nonetheless, the grammar used in this work is similar in
this respect to others that can be found in the literature, and, hence, we believe
that grammars, where production rules are to be applied only a rather limited
number of times are common in the development of heuristic algorithms.
From our experimental results, the heuristics generated by irace when using
the parametric representation achieve better results than those generated when
using the GE representation. This indicates that the parametric representation
can help to avoid disadvantages of grammatical evolution such as a low fine-
tuning behaviour due to a low locality of the used operators. Furthermore, our
approach is not limited to irace and it can be applied using other automatic
configuration tools, as long as they are able to handle categorical and conditional
parameters.
In future work, we plan to compare our approach with a pure GE algorithm,
which is the algorithm used in previous similar works. Moreover, our intention
is to test the proposed method on different grammars and benchmark problems
to investigate its benefits and limitations.
References
1. Balaprakash, P., Birattari, M., Stützle, T.: Improvement strategies for the F-Race
algorithm: sampling design and iterative refinement. In: Bartz-Beielstein, T., Blesa
Aguilera, M.J., Blum, C., Naujoks, B., Roli, A., Rudolph, G., Sampels, M. (eds.)
HCI 2007. LNCS, vol. 4771, pp. 108–122. Springer, Heidelberg (2007)
2. Burke, E.K., Hyde, M.R., Kendall, G.: Grammatical evolution of local search
heuristics. IEEE Trans. Evol. Comput. 16(7), 406–417 (2012)
3. Du, J., Leung, J.Y.T.: Minimizing total tardiness on one machine is NP-hard.
Math. Oper. Res. 15(3), 483–495 (1990)
4. Dubois-Lacoste, J., López-Ibáñez, M., Stützle, T.: A hybrid TP+PLS algorithm for
bi-objective flow-shop scheduling problems. Comput. Oper. Res. 38(8), 1219–1236
(2011)
5. Garey, M.R., Johnson, D.S., Sethi, R.: The complexity of flowshop and jobshop
scheduling. Math. Oper. Res. 1, 117–129 (1976)
6. Johnson, D.S.: Optimal two- and three-stage production scheduling with setup
times included. Naval Res. Logistics Quart. 1, 61–68 (1954)
7. KhudaBukhsh, A.R., Xu, L., Hoos, H.H., Leyton-Brown, K.: SATenstein: automat-
ically building local search SAT solvers from components. In: Boutilier, C. (ed.)
Proceedings of the Twenty-First International Joint Conference on Artificial Intel-
ligence (IJCAI-09), pp. 517–524. AAAI Press/International Joint Conferences on
Artificial Intelligence, Menlo Park (2009)
8. López-Ibáñez, M., Dubois-Lacoste, J., Stützle, T., Birattari, M.: The irace
package, iterated race for automatic algorithm configuration. Technical report
TR/IRIDIA/2011-004, IRIDIA, Université Libre de Bruxelles, Belgium (2011)
9. López-Ibáñez, M., Stützle, T.: The automatic design of multi-objective ant colony
optimization algorithms. IEEE Trans. Evol. Comput. 16(6), 861–875 (2012)
10. Mckay, R.I., Hoai, N.X., Whigham, P.A., Shan, Y., O’Neill, M.: Grammar-based
genetic programming: a survey. Genet. Program. Evolvable Mach. 11(3–4), 365–
396 (2010)
11. Ruiz, R., Stützle, T.: A simple and effective iterated greedy algorithm for the
permutation flowshop scheduling problem. Eur. J. Oper. Res. 177(3), 2033–2049
(2007)
12. Taillard, É.D.: Benchmarks for basic scheduling problems. Eur. J. Oper. Res. 64(2),
278–285 (1993)
13. Vázquez-Rodrı́guez, J.A., Ochoa, G.: On the automatic discovery of variants of the
NEH procedure for flow shop scheduling using genetic programming. J. Oper. Res.
Soc. 62(2), 381–396 (2010)
Architecture for Monitoring Learning
Processes Using Video Games
1 Introduction
The incorporation of Game Based Learning (GBL) into learning processes has already
become a reality, both in schools and for research purposes. In particular, playing
these games in groups has been shown to be a desirable way of accomplishing this, as
players are used to playing commercial video games in groups. By combining these
two aspects, we have focused our research on the use of group activities within
Educational Video Games (EVG) in order to promote collaborative skills in students
and to promote the many advantages that both elements involve.
Although several aspects of designing and using EVG have been developed in our
research, in this paper we have fixed the focus on the personalization of the learning/
playing process. Assuming that an educational video game includes recreational
activities that hide some educational content, the monitoring and adapting of the game
are closely related to the monitoring and adapting of the learning process. Starting
from this assumption, we think that it is necessary to monitor relevant activities in the
game in order to analyze and adapt it according to the features of the player or the
group who is playing.
In this context of using collaborative learning and EVG as tools for teaching, we
think our previously proposed architecture PLAGER-VG [1] (PLAtform for manag-
inG Educational multiplayeR Video Games) needs to be modified. PLAGER-VG
helps teachers and designers to obtain more efficient video games and is able to
monitor and to adapt the learning processes underpinned by them.
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 335–340, 2013.
DOI: 10.1007/978-3-642-44973-4_37, Springer-Verlag Berlin Heidelberg 2013
336 N. Padilla-Zea et al.
2 Architecture PLAGER-VG
In this paper we have presented our proposal of using agents (1) to analyze interaction
between players of EVG in order to assess their learning processes and (2) to control
the adaptation of the game. These agents have been included as an extension of the
architecture PLAGER-VG.
We have defined the concept of the interesting event and we have presented the
information to be collected in order to classify and analyze the learning process during
the game based on the widely known model of the 3 C’s.
To monitor the interaction during the game, we have proposed the inclusion of
additional information (context) referring to the conditions under which events occur.
This information is collected by a set of special agents of a Monitoring Agent type.
Once the previously mentioned information has been analyzed, another special
agent called Facilitator Agent decides whether some adaptation actions have to be
performed or whether the teacher has to be informed. In the latter case, the teacher
could accept or reject the proposed changes.
Our immediate future work is to refine and to implement the modifications of the
PLAGER-VG prototype with the defined agents. We want to integrate the prototype
with a modular design which allows the design, execution and analysis of educational
video games with group activities.
We also intend to improve the adaptation mechanism. We are going to complete
the list of adaptation actions and to include the corresponding pre and post conditions
to them. Pre-conditions and Post-conditions will guarantee the integrity of the game
when these new adaptation actions are carried out. This process is dynamically per-
formed while the game is running.
340 N. Padilla-Zea et al.
Acknowledgments. This study has been financed by the Ministry of Science and Innovation,
Spain, as part of the VIDECO Project (TIN2011-26928) and Vice-Rector’s Office for Scientific
Policy and Research of University of Granada (Spain).
References
1. Padilla Zea, N.: Metodología para el diseño de videojuegos educativos sobre una arquitectura
para el análisis del aprendizaje colaborativo. Ph.D. thesis, University of Granada (2011)
2. Gutwin, C., Stark, G., Greenberg, S.: Support for workspace awareness in educational
groupware. In: Proceedings of CSCL’95. The First International Conference on Computer
Support for Collaborative Learning, pp. 147–156 (1995)
3. Baker, M., Lund, K.: Promoting reflective interactions in a CSCL environment. J. Comput.
Assist. Learn. 3(13), 175–193 (1997)
4. Collazos, C., Guerrero, L.A., Pino, J., Ochoa, S.F.: Evaluating collaborative learning pro-
cesses. In: Haake, J., Pino, J. (eds.) CRIWG 2002. LNCS, vol. 2440, pp. 203–221. Springer,
Heidelberg (2002)
5. Paderewski-Rodríguez, P., Rodríguez-Fortiz, M.J., Parets-Llorca, J.: An architecture for
dynamic and evolving cooperative software agents. Comput. Stand. Interfaces 25(3),
261–269 (2003)
6. Paderewski-Rodríguez, P., Torres-Carbonell, J., Rodríguez-Fortiz, M.J., Medina-Medina, N.,
Molina-Ortiz, F.: A software system evolutionary and adaptive framework: application to
agent-based systems. J Syst Archit. Elsevier 50, 407–416 (2004)
7. Ellis, C.A., Gibbs, S.J., Rein, G.L.: Groupware: some issues and experiences. Commun.
ACM 34(1), 39–58 (1991)
8. Padilla, N., González, J.L., Gutiérrez, F.L.: Collaborative learning by means of video games:
an entertainment system in the learning processes. In: Proceedings of 9th IEEE International
Conference on Advanced Learning Technologies (ICALT), pp. 215–217 (2009)
9. Hanneman, R.A., Riddle, M.: Introduction to social network methods. Free online textbook.
http://www.faculty.ucr.edu/*hanneman/nettext/ (2005). Accessed 2010
Quality Measures of Parameter Tuning
for Aggregated Multi-Objective
Temporal Planning
1 Introduction
Parameter tuning is now well recognized as a mandatory step when attempting
to solve a given set of instance of some optimization problem. All optimization
algorithms behave very differently on a given problem, depending on their para-
meter values, and setting the algorithm parameters to the correct value can make
the difference between failure and success. This is equally true for deterministic
complete algorithms [1] and for stochastic approximate algorithms [2,3]. Cur-
rent approaches range from methods issued from racing-like methods [4,5] to
This work is being partially funded by the French National Research Agency under
the research contract DESCARWIN (ANR-09-COSI-002).
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 341–356, 2013.
DOI: 10.1007/978-3-642-44973-4 38, c Springer-Verlag Berlin Heidelberg 2013
342 M.R. Khouadjia et al.
2 AI Planning
An AI Planning problem (see e.g. [10]) is defined by a set of predicates, a set
of actions, an initial state and a goal state. A state is a set of non-exclusive
instantiated predicates, or (Boolean) atoms. An action is defined by a set of
pre-conditions and a set of effects: the action can be executed only if all pre-
conditions are true in the current state, and after an action has been executed,
the effects of the action modify the state: the system enters a new state. A plan
is a sequence of actions, and a feasible plan is a plan such that executing each
action in turn from the initial state puts the systems into the goal state. The goal
of (single objective) AI Planning is to find a feasible plan that minimizes some
quantity related to the actions: number of actions for STRIPS problems, sum
of action costs in case actions have different costs, or makespan in the case of
temporal planning, when actions have a duration and can eventually be executed
in parallel. All these problems are P-SPACE.
A simple planning problem in the domain of logistics (inspired by the well-
known Zeno problem of IPC series) is given in Fig. 1: the problem involves cities,
passengers, and planes. Passengers can be transported from one city to another,
following the links on the figure. One plane can only carry one passenger at a
time from one city to another, and the flight duration (number on the link) is
the same whether or not the plane carries a passenger (this defines the domain
of the problem). In the simplest non-trivial instance of such domain, there are 3
passengers and 2 planes. In the initial state, all passengers and planes are in city
0, and in the goal state, all passengers must be in city 4. The not-so-obvious
optimal solution has a total makespan of 8 and is left as a teaser for the reader.
AI Planning is a very active field of research, as witnessed by the success of
the ICAPS series of yearly conferences (http://icaps-conferences.org), and
its biannual competition IPC, where the best planners in the world compete on
a set of problems. This competition has lead the researchers to design a common
language to describe planning problems, PDDL (Planning Domain Definition
Language). Two main categories of planners can be distinguished: exact planners
are guaranteed to find the optimal solution . . . if given enough time; satisficing
planners give the best possible solution, but with no optimality guarantee.
Most existing work in AI Planning involves one single objective, even though
real-world problems are generally multi-objective (e.g., optimizing the makespan
while minimizing the cost, two contradictory objectives). An obvious approach to
Multi-Objective AI planning is to aggregate the different objectives into a single
objective, generally a fixed linear combination (weighted sum) of all objectives.
The single objective is to be minimized, and the weights have to be positive
(resp. negative) for the objectives to be minimized (resp. maximized) in the
original problem. The solution of one aggregated problem is Pareto optimal if
all weights are non-zero, or the solution is unique [12]. It is also well-known that
344 M.R. Khouadjia et al.
City 1
2 3 2
3
City 0 4 City 2 4 City 4
5 2
3
6 6
City 3
1
Fig. 1. A schematic view of MultiZeno, a simple benchmark transportation domain:
Flight durations of available routes are attached to the corresponding edges, costs are
attached to landing in the central cities (in grey circles).
35 35 35
Pareto solution Pareto solution Pareto solution
30 30 30
25 25 25
Cost
Cost
Cost
20 20 20
15 15 15
10 10 10
5 5 5
20 30 40 50 60 20 30 40 50 60 20 30 40 50 60
Makespan Makespan Makespan
Fig. 2. The exact Pareto Fronts for the MultiZeno6 problem for different values of
cost(city2) (all other values as in Fig. 1).
3 Divide-and-Evolve
Let PD (I, G) denote the planning problem defined on domain D (the predicates,
the objects, and the actions), with initial state I and goal state G. In STRIPS
representation model [22], a state is a list of Boolean atoms defined using the
predicates of the domain, instantiated with the domain objects.
In order to solve PD (I, G), the basic idea of DaEX is to find a sequence of
states S1 , . . . , Sn , and to use some embedded planner X to solve the series of
3
In the MultiZenoRisk problem, not detailed here, the second objective is the risk:
its maximal value ever encountered is to be minimized.
346 M.R. Khouadjia et al.
planning problems PD (Sk , Sk+1 ), for k ∈ [0, n] (with the convention that S0 = I
and Sn+1 = G). The generation and optimization of the sequence of states
(Si )i∈[1,n] is driven by an evolutionary algorithm. After each of the sub-problems
PD (Sk , Sk+1 ) has been solved by the embedded planner, the concatenation of
the corresponding plans (possibly compressed to take into account possible par-
allelism in the case of temporal planning) is a solution of the initial problem.
In case one sub-problem cannot be solved by the embedded solver, the individ-
ual is said unfeasible and its fitness is highly penalized in order to ensure that
feasible individuals always have a better fitness than unfeasible ones, and are
selected only when there are not enough feasible individual. A thorough descrip-
tion of DaEX can be found in [11]. The rest of this section will briefly recall the
evolutionary parts of DaEX .
DaEX uses an external embedded planner to solve in turn the sequence of sub-
problems defined by the ordered list of partial states. Any existing planner can in
theory be used. However, there is no need for an optimality guarantee when solv-
ing the intermediate problems in order for DaEX to obtain good quality results
[11]. Hence, and because a very large number of calls to this embedded plan-
ner are necessary for a single fitness evaluation, a sub-optimal but fast planner
was found to be the best choice: YAHSP [25] is a lookahead strategy planning
system for sub-optimal planning which uses the actions in the relaxed plan to
compute reachable states in order to speed up the search process. Because the
rationale for DaEX is that all sub-problems should hopefully be easier than the
initial global problem, and for computational performance reason, the search
capabilities of the embedded planner YAHSP are limited by setting a maximal
number of nodes that it is allowed to expand to solve any of the sub-problems
(see again [11] for more details).
However, even though YAHSP, like all known planners to-date, is a single-
objective planner, it is nevertheless possible since PDDL 3.0 to add in a PDDL
domain file other quantities (aka Soft Constraints or Preferences [19]) that are
simply computed throughout the execution of the final plan, without interfering
with the search. Two strategies are then possible for YAHSP in the two-objective
context of MultiZeno: it can optimize either the makespan or the cost, and sim-
ply compute the other quantity (cost or makespan) along the solution plan. The
corresponding strategie will be referred to as YAHSPmakespan and YAHSPcost .
In the multi-objective versions of DaEYAHSP the choice between both strate-
gies is governed by user-defined weights, named respectively W-makespan and
W-cost (see Table 1). For each individual, the actual strategy is randomly chosen
according to those weights, and applied to all subproblems of the individual.
4 Experimental Conditions
The Aggregation Method for multi-objective optimization runs in turn a
series of single-objective problems. The fitness of each of these problems is
defined using a single positive parameter α. In the following, Fα will denote
α ∗ makespan + (1 − α) ∗ cost, and DaEYAHSP run optimizing Fα will be called
the α-run. Because the range of the makespan values is approximately twice as
large as that of the cost, the following values of α have been used instead of
regularly spaced values: 0, 0.05, 0.1, 0.3, 0.5, 0.55, 0.7, 1.0. One “run” of the
aggregation method thus amounts to running the corresponding eight α-runs,
and returns as the approximation of the Pareto front the set of non-dominated
solutions among the merge of the eight final populations.
ParamILS [8] is used to tune the parameters of DaEYAHSP . ParamILS uses the
simple Iterated Local Search heuristic [26] to optimize parameter configurations,
and can be applied to any parameterized algorithm whose parameters can be
discretized. ParamILS repeats local search loops from different random starting
348 M.R. Khouadjia et al.
Table 1. Set of DaE parameters and their discretizations for ParamILS, leading to
approx. 1.5 · 109 possible configurations.
points, and during each local search loops, modifies one parameter at a time,
runs the target algorithm with the new configuration and computes the quality
measure it aims at optimizing, accepting the new configuration if it improves
the quality measure over the current one.
The most prominent parameters of DaEYAHSP that have been subject to
optimization can be seen in Table 1.
Quality Measures for ParamILS: The goal of the experiments presented
here is to compare the influence of two quality measures of ParamILS for the
aggregated DaEYAHSP on MultiZeno instances. In AggregF itness , the quality
measure used by ParamILS to tune the α-run of DaEYAHSP is Fα , the fitness
also used by the target α-run. In AggregHyper , ParamILS uses, for each of the
α-run, the same quality measure, i.e., the unary hypervolume [27] of the final
population of the α-run w.r.t. the exact Pareto front of the problem at hand (or
its best available approximation when it is not available). The lower the better
(a value of 0 indicates that the exact Pareto front has been reached).
Implementation: Algorithms have been implemented within the ParadisEO-
MOEO framework.4 All experiments were performed on the MultiZeno3,
MultiZeno6, and MultiZeno9 instances. The first objective is the makespan,
and the second objective is the cost. The values of the different flight durations
(makespans) and costs are those given on Fig. 1 except otherwise stated.
Performance Assessment and Stopping Criterion: For all experiments, 11
independent runs were performed. Note that all the performance assessment pro-
cedures, including the hypervolume calculations, have been achieved using the
PISA performance assessment tool suite.5 The main quality measure used here
to compare Pareto Fronts is, as above, the unary hypervolume IH − [27] of the set
of non-dominated points output by the algorithms with respect to the complete
true Pareto front. For aggregated runs, the union of all final populations of the
α-runs for the different values of α is considered the output of the complete ‘run’.
4
http://paradiseo.gforge.inria.fr/
5
http://www.tik.ee.ethz.ch/pisa/
Quality Measures of Parameter Tuning 349
However, and because the true front is known exactly, and is made of a few
scattered points (at most 17 for MultiZeno9 in this paper), it is also possible
to visually monitor, for each point of the front, the ratio of actual runs (out of
11) that discovered it at any given time. This allows some other point of view on
the comparison between algorithms, even when none has found the whole Pareto
front. Such hitting plots will be used in the following, together with more classical
plots of hypervolume vs computational effort. In any case, when comparing differ-
ent approaches, statistical significance tests are made on the hypervolumes, using
Wilcoxon signed rank test with 95 % confidence level.
Finally, because different fitness evaluations involve different number calls to
YAHSP – and because YAHSP runs can have different computational costs too,
depending on the difficulty of the sub-problem being solved – the computational
efforts will be measured in terms of CPU time and not number of function evalu-
ations – and that goes for the stopping criterion: The absolute limits in terms of
computational efforts were set to 300, 600, and 1800 seconds respectively for Mul-
tiZeno3, MultiZeno6, and MultiZeno9. The stopping criterion for ParamILS
was likewise set to a fixed wall-clock time: 48 h (resp. 72 h) for MultiZeno3 and 6
(resp. MultiZeno9), corresponding to 576, 288, and 144 parameter configuration
evaluations per value of α for MultiZeno3, 6 and 9 respectively.
5 Experimental Results
5.1 ParamILS Results
Table 2 presents the optimal values for DaEYAHSP parameters of Table 1 found
by ParamILS in both experiments, for all values of α - as well as for the multi-
objective version of DaEYAHSP presented in [15] (last column, entitled IBEAH ).
The most striking and clear conclusion regards the weights for the choice of
YAHSP strategy (see Sect. 3.2) W-makespan and W-cost. Indeed, for the
0.05
0.001
0 100 200 300 400 500 600 700 0 500 1000 1500 2000
Time (seconds) Time (seconds)
Fig. 3. Evolution of the Hypervolume for both approaches, for all α-runs and overall,
on MultiZeno instances. Warning: Hypervolume is in log scale, and the X-axis is not
the value 0, but 6.7 10−5 for MultiZeno6 and 0.0125 for MultiZeno9.
AggregHyper approach, ParamILS found out that YAHSP should optimize only
the makespan (W-cost = 0) for small values of α, and only the cost for large values
of α while the exact opposite is true for the AggregF itness approach. Remember
that small (resp. large) values of α correspond to an aggregated fitness having all
its weight on the cost (resp. the makespan). Hence, during the 0- or 0.5-runs, the
fitness of the corresponding α-run is pulling toward minimizing the cost: but for
the AggregHyper approach, the best choice for YAHSP strategy, as identified by
ParamILS, is to minimize the makespan (i.e., setting W-cost to 0): as a result,
the population has a better chance to remain diverse, and hence to optimize the
hypervolume, i.e., ParamILS quality measure. In the same situation (small α),
on the opposite, for AggregF itness , ParamILS has identified that the best strat-
egy for YAHSP is to also favor the minimization of the cost, setting W-makespan
to zero. The symmetrical reasoning can be applied to the case of large values of
α. For the multi-objective version of DaEYAHSP (IBEA column in Table 2), the
best strategy that ParamILS came up with is a perfect balance between both
strategies, setting both weights to 1.
The values returned by ParamILS for the other parameters are more difficult
to interpret. It seems that large values of Proba-mut are preferable for AggregHyper
for α set to 0 or 1, i.e. when the DaEYAHSP explores the extreme sides of the objec-
tive space – more mutation is needed to depart from the boundary of the objec-
tive space and cover more of its volume. Another tendancy is that ParamILS
repeatedly found higher values of Proba-cross and lower values of Proba-mut for
AggregHyper than for AggregF itness . Together with large population sizes (com-
pared to the one for IBEA for instance), the 1-point crossover of DaEYAHSP
remains exploratory for a long time, and leads to viable individuals that can remain
in the population even though they don’t optimize the α-fitness, thus contribut-
ing to the hypervolume. On the opposite, large mutation rate is preferable for
AggregF itness as it increases the chances to hit a better fitness, and otherwise
generates likely non-viable individuals that will be quickly eliminated by selec-
tion, making DaEYAHSP closer from a local search. The values found for IBEA,
Quality Measures of Parameter Tuning 351
on the other hand, are rather small – but the small population size also has to
be considered here: because it aims at exploring the whole objective space in one
go, the most efficient strategy for IBEA is to make more but smaller steps, in all
possible directions.
1.2 1.2
60-10 60-10
56-12 56-12
52-14 52-14
48-16 48-16
1 44-18 1 44-18
40-20 40-20
36-22 36-22
32-24 32-24
28-26 28-26
0.8 24-28 0.8 24-28
20-30 20-30
all-points all-points
Attainment
Attainment
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
Time (seconds) Time (seconds)
1.2 1.2
96-16 96-16
92-18 92-18
88-20 88-20
84-22 84-22
1 80-24 1 80-24
76-26 76-26
72-28 72-28
68-30 68-30
64-32 64-32
0.8 60-34 0.8 60-34
56-36 56-36
52-38 52-38
48-40 48-40
Attainment
Attainment
44-42 44-42
0.6 40-44 0.6 40-44
36-46 36-46
32-48 32-48
all-points all-points
0.4 0.4
0.2 0.2
0 0
40 40
Aggregation hyper Aggregation fitness
Exact Pareto front Exact Pareto front
35 35
30 30
Cost
Cost
25 25
20 20
15 15
10 10
20 30 40 50 60 70 80 90 20 30 40 50 60 70 80
Makespan Makespan
70 70
Aggregation hyper Aggregation fitness
Exact Pareto front Exact Pareto front
60 60
50 50
Cost
Cost
40 40
30 30
20 20
10 10
20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160
Makespan Makespan
1.2 1.2
60-19.0 60-19.0
56-19.2 56-19.2
52-19.4 52-19.4
48-19.6 48-19.6
1 44-19.8 1 44-19.8
40-20.0 40-20.0
36-23.8 36-23.8
32-27.6 32-27.6
28-31.4 28-31.4
0.8 24-35.2 0.8 24-35.2
20-39.0 20-39.0
all-points all-points
Attainment
Attainment
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
Time (seconds) Time (seconds)
1.2 1.2
60-19.0 60-19.0
56-22.8 56-22.8
52-23.0 52-23.0
48-25.0 48-25.0
1 44-27.0 1 44-27.0
40-29.0 40-29.0
36-31.0 36-31.0
32-33.0 32-33.0
28-35.0 28-35.0
0.8 24-37.0 0.8 24-37.0
20-39.0 20-39.0
all-points all-points
Attainment
Attainment
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
Time (seconds) Time (seconds)
Fig. 6. Hitting plots for different Pareto fronts for MultiZeno6. See Sect. 2.2 and com-
pare with Fig. 4(a), (b).
References
1. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Automated configuration of mixed inte-
ger programming solvers. In: Lodi, A., Milano, M., Toth, P. (eds.) CPAIOR 2010.
LNCS, vol. 6140, pp. 186–202. Springer, Heidelberg (2010)
2. Eiben, A., Michalewicz, Z., Schoenauer, M., Smith, J.: Parameter control in evolu-
tionary algorithms. In: Lobo, F.G., Lima, C.F., Michalewicz, Z. (eds.) Parameter
Setting in Evolutionary Algorithms. SCI, vol. 54, pp. 19–46. Springer, Heidelberg
(2007)
3. Yuan, Z., de Oca, M.A.M., Birattari, M., Stützle, T.: Modern continuous optimiza-
tion algorithms for tuning real and integer algorithm parameters. In: Dorigo, M., et
al. (eds.) ANTS 2010. LNCS, vol. 6234, pp. 203–214. Springer, Heidelberg (2010)
4. Birattari, M., Yuan, Z., Balaprakash, P., Stützle, T.: Automated algorithm tuning
using F-Races: recent developments. In: Caserta, M., et al. (eds.) Proceedings of
MIC’09. University of Hamburg (2009)
Quality Measures of Parameter Tuning 355
26. Lourenço, H., Martin, O., Stützle, T.: Iterated local search. In: Glover, F., Kochen-
berger, G.A. (eds.) Handbook of Metaheuristics, pp. 320–353. Kluwer Academic,
New York (2003)
27. Zitzler, E., Künzli, S.: Indicator-based selection in multiobjective search. In: Yao,
X., et al. (eds.) PPSN VIII. LNCS, vol. 3242, pp. 832–842. Springer, Heidelberg
(2004)
28. Bibaı̈, J., Savéant, P., Schoenauer, M., Vidal, V.: On the generality of parameter
tuning in evolutionary planning. In: Proceedings of 12th GECCO, pp. 241–248.
ACM (2010)
Evolutionary FSM-Based Agents
for Playing Super Mario Game
Abstract. Most of game development along the years has been focused
on the technical part (graphics and sound), leaving the artificial intelli-
gence aside. However computational intelligence is becoming more signif-
icant, leading to much research on how to provide non-playing characters
with adapted and unpredictable behaviour so as to afford users a bet-
ter gaming experience. This work applies strategies based on Genetic
Algorithms mixed with behavioural models, to obtain an agent (or bot)
capable of completing autonomously different scenarios on a simulator
of Super Mario Bros. game. Specifically, the agent follows the rules of
the Gameplay track of Mario AI Championship. Different approaches
have been analysed, combining Genetic Algorithms with Finite State
Machines, yielding agents which can complete levels of different difficul-
ties playing much better than an expert human player.
1 Introduction
Mario Bros. games series were created by Shigeru Miyamoto1 , and appeared in
early 80s. The most famous so far is the platform game Super Mario Bros. and
its sequels (for instance the blockbuster Super Mario World).
All of them follow a well-known plot: the plumber Mario must rescue the
princess of Mushroom Kingdom, Peach, who has been kidnapped by the king
of the koopas, Bowser. The main goal is to go across lateral platforming levels,
trying to avoid different types of enemies and obstacles and using some useful
(but limited) items, such as mushrooms or fire flowers.
Due to their success, amusement and attractiveness, Mario series have
become a successful researching environment in the field of Computational Intel-
ligence (CI) [1,5,6]. The most used framework is Mario AI, a modified version
of the game known as Infinite Mario Bros.,2 an open-code application where the
1
Designer and producer of Nintendo Ltd., and winner of the 2012 Prı́ncipe de Asturias
Prize in Humanities and Communication
2
http://www.mojang.com/notch/mario/
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 357–363, 2013.
DOI: 10.1007/978-3-642-44973-4 39, c Springer-Verlag Berlin Heidelberg 2013
358 R.M. Hidalgo-Bermúdez et al.
The proposed agents follow the rules of the Mario AI Championship, considering
the GamePlay track (complete as many levels as possible). The game consists in
moving the character, Mario, through bi-dimensional levels. He can move left and
right, down (crouch), run (letting the button pushed), jump and shoot fireballs
(when in “fire” mode).
The main goal is complete the level, whereas secondary goals could be killing
enemies and collecting coins or other items. These items may be hidden and may
cause Mario to change his state (for instance a fire flower placed ‘inside’ a block).
The difficulty of the game lies in the presence of cliffs/gaps and enemies. Mario
loses power (i.e., its status goes down one level) when touched by an enemy and
dies if he falls off a cliff.
The Mario AI simulator provides information about Mario’s surrounding
areas. According to the rules of the competition, two matrices give this infor-
mation, both of them are 19 × 19 cells size, centred in Mario. One contains the
positions of surrounding enemies, and the other provides information about the
objects in the area (scenery objects and items).
Every tick (40 ms), Mario’s next action must be indicated. This action consist
in a combination of the five possible movements that Mario can do (left, right,
down, fire/speed, jump). This information is encoded into a boolean array, con-
taining a true value when a movement must be done.
The action to perform depends, of course, in the scenery characteristics
around Mario, but it is also important to know where the enemies are and their
type. Thus, the agent could know if it is best to jump, shoot or avoid them.
We have defined four main enemies groups according to what the agent needs
3
http://www.marioai.org/
Evolutionary FSM-Based Agents for Playing Super Mario Game 359
Table 1. Codification of the feasible states of the FSM which will model the Mario
agent’s AI. 1 is true/active, 0 is f alse/non − active.
St 0 St 1 St 2 St 3 St 4 St 5 St 6 St 7 St 8 St 9 St 10 St 11 St 12 St 13
Right 1 1 1 1 0 0 0 0 0 0 0 0 0 0
Left 0 0 0 0 1 1 1 1 0 0 0 0 0 0
Fire/Run 0 0 1 1 0 0 1 1 0 0 0 0 1 1
Jump 0 1 0 1 0 1 0 1 0 0 1 1 0 1
Down 0 0 0 0 0 0 0 0 0 1 0 1 0 0
Depending on the input string, the state changes (or remains), so the transi-
tion is decided. Inputs are the possible situations of the agent in the environment:
for example, find an enemy or being near a cliff/gap. Each input is represented
as a boolean string, having a true (1) value in a position if a specific situation
or event has happened.
These possible states along with the possible inputs and transitions will be
evolved by means of the GA, considering that the output state for a new entry is
randomly set, but according to a probability which models the preference of the
states, in order to improve the convergence of the algorithm. Due to the huge
search space a parameter indicating the percentage of new individuals that will
be added in each generation is included, in order to control the diversity rate.
Every individual in the population is represented by a set of tables, one per
state, and every table contains an output state for every possible input. The
fitness function is calculated for each individual by setting the FSM represented
in a chromosome as the AI of one agent. It is then placed in a level, and then,
it plays for obtaining a fitness value. Two different schemes have been imple-
mented: mono-seed, where all the individuals are tested in the same level (with
the same difficulty), which grows in length with the generations; and a multi-seed
360 R.M. Hidalgo-Bermúdez et al.
approach, where every individual is tested in 30 levels (in the same difficulty)
generated randomly (using different seeds). In both cases every agent plays until
it pass the level, dies or gets stacked.
The aim of the latter scheme is: first, avoid the usual noise [4] present in
this type of problems (videogames), i.e. it tries to get a fair valuation for an
individual, since the same configuration could represent an agent which is very
good sometimes and quite bad some others, due to the stochasticity present in
every play (just in the agent’s behaviour); and second, get individuals prepared
to a wide set of situations in every level and difficulty, since 30 levels should
present a high amount of different scenarios and configurations.
Thus, there is a generic fitness which has as restriction completely finish
the level to be set to positive. On the contrary, individuals that have not fin-
ished the level start from the lowest fitness possible and their negativity is
reduced according the behaviour during the level run. This generic fitness is a
weighted aggregation based in the values: marioWinner (1 if the agent finish the
level), marioSize (0 small, 1 big, and 2 fire), numKilledEnemies, numTotalEn-
emies, numCellsPassed, remainingTime, timeSpent, coinsGathered, totalCoins,
numCollisions (number of times the agent has bumped with an enemy), num-
GatheredPowerUps, causeOfDeath (value representing how the agent has died).
This fitness is considered as the result of the evaluation for the individuals in
the mono-seed approach, meanwhile multi-seed considers a hierarchical fitness,
where the population is ordered according the next criteria: First, taking into
account the percentage of levels where the individuals have been stacked or fallen
from a cliff. Then, they are ordered considering the average percentage of levels
completed. Finally, the individuals are ordered by the average generic fitness.
The selection mechanism considers the best individual and a percentage of
the best ones, selected by tournament according to their fitness. The percentage
of individuals to consider as parents follows to schemes: in mono-seed it is low
at the beginning and will be increased when the number of generations grows;
in multi-seed it is constant.
Uniform Crossover is performed considering the best individual of the present
generation as one of the parents, and one of the individuals with positive fitness
as the other parent. They generate a number of descendents which depends on
the percentage of population to complete with the crossover.
The Mutation operator selects a percentage of individuals to be mutated, and
a random set of genes to be changed in every individual, then the output state
is randomly changed for an input in the table.
There is a 1 − elitism replacement to form the new population (the best
individual survives). The rest of the population is composed by the offspring
generated in the previous generation (a percentage of the global population)
and a set of random individuals, in order to increase the diversity.
Mono-seed Multi-seed
Population size 1000 (difficulty 0) 2000 (difficulty 4)
Number of generations 30 (difficulty 0) 500 (difficulty 4)
Crossover percentage 95 % 95 %
Mutation percentage 2 % (individuals) 2 % (individuals)
Mutation rate 1 % (genes) 1 % (genes)
Percentage of random individuals 5 % (decreased with the generations) 5 % (constant)
Fitness function generic (aggregation) hierarchical
The last one was evolved for some generations (not all the desired) in that
level of difficulty, due to the commented problems, so it cannot complete this
hard level in the simulator. However, as it can be seen, it is quite good in the
first part of the play. Thus, if we could finish the complete evolution process in
this difficulty level we think the agent could complete any possible level.
5 Conclusions
In this work, two different approaches for evolving, by means of Genetic Algo-
rithms (GAs), agents which play Super Mario Bros. game have been proposed
and analysed. They have been implemented using Finite State Machine (FSM)
models, and considering different schemes: mono-seed and a multi-seed evalua-
tion approaches, along with two different fitness functions. Both algorithms have
been tested inside a simulator named Mario AI, implemented for the Mario AI
Competition, focusing on the GamePlay Track.
Several experiments have been conducted to test the algorithms and a deep
analysis has been performed in each case, in order to set the best configuration
parameters for the GA. Some problems have arisen such as the high memory
requirements, which have done it hard to complete the optimisation process
in several cases. However, very competent agents have been obtained for the
difficulty levels 0 to 4 in both approaches, which are, in turn, enough for the
Game Play competition requirements.
In the comparison between the approaches, mono-seed can yield excellent
agents for the level where they were ‘trained’ (evolved), having a quite bad
behaviour in a different level. Multi-seed takes much more computational time
and has higher resource requirements, but the agents it yields are very good
playing in any level of the considered difficulty (in the evolution). All these
agents play much better than an expert human player and can complete the
levels in a time impossible to get for the human.
References
1. Bojarski, S., Bates-Congdon, C.: REALM: A rule-based evolutionary computation
agent that learns to play mario. In: Proceedings of the IEEE CIG 2011, pp. 83–90.
IEEE Press (2011)
Evolutionary FSM-Based Agents for Playing Super Mario Game 363
2. Booth, T.L.: Sequential Machines and Automata Theory, 1st edn. Wiley, New York
(1967)
3. Goldberg, D.E., Korb, B., Deb, K.: Messy genetic algorithms: motivation, analysis,
and first results. Complex Syst. 3(5), 493–530 (1989)
4. Mora, A.M., Fernández-Ares, A., Merelo-Guervós, J.-J., Garcı́a-Sánchez, P.: Dealing
with noisy fitness in the design of a RTS game bot. In: Di Chio, C., et al. (eds.)
EvoApplications 2012. LNCS, vol. 7248, pp. 234–244. Springer, Heidelberg (2012)
5. Pedersen, C., Togelius, J., Yannakakis, G.: Modeling player experience in super
mario bros. In: Proceedings 2009 IEEE Symposium on Computational Intelligence
and Games (CIG’09), pp. 132–139. IEEE Press (2009)
6. Togelius, J., Karakovskiy, S., Koutnik, J., Schmidhuber, J.: Super mario evolution.
In: Proceedings 2009 IEEE Symposium on Computational Intelligence and Games
(CIG’09), pp. 156–161. IEEE Press (2009)
Identifying Key Algorithm Parameters
and Instance Features Using Forward Selection
University of British Columbia, 2366 Main Mall, Vancouver BC, V6T 1Z4 , Canada
{hutter,hoos,kevinlb}@cs.ubc.ca
1 Introduction
State-of-the-art algorithms for hard combinatorial optimization problems tend
to expose a set of parameters to users to allow customization for peak perfor-
mance in different application domains. As these parameters can be instantiated
independently, they give rise to combinatorial spaces of possible parameter con-
figurations that are hard for humans to handle, both in terms of finding good
configurations and in terms of understanding the impact of each parameter. As
an example, consider the most widely used mixed integer programming (MIP)
software, IBM ILOG CPLEX, and the manual effort involved in exploring its 76
optimization parameters [1].
By now, substantial progress has been made in addressing the first sense
in which large parameter spaces are hard for users to deal with. Specifically,
it has been convincingly demonstrated that methods for automated algorithm
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 364–381, 2013.
DOI: 10.1007/978-3-642-44973-4 40, c Springer-Verlag Berlin Heidelberg 2013
Identifying Key Algorithm Parameters 365
configuration [2–7] are able to find configurations that substantially improve the
state of the art for various hard combinatorial problems (e.g., SAT-based formal
verification [8], mixed integer programming [1], timetabling [9], and AI planning
[10]). However, much less work has been done towards the goal of explaining
to algorithm designers which parameters are important and what values for
these important parameters lead to good performance. Notable exceptions in
the literature include experimental design based on linear models [11,12], an
entropy-based measure [2], and visualization methods for interactive parameter
exploration, such as contour plots [13]. However, to the best of our knowledge,
none of these methods has so far been applied to study the configuration spaces
of state-of-the-art highly parametric solvers; their applicability is unclear, due
to the high dimensionality of these spaces and the prominence of discrete para-
meters (which, e.g., linear models cannot handle gracefully).
In the following, we show how a generic, model-independent method can be
used to:
– identify key parameters of highly parametric algorithms for solving SAT, MIP,
and TSP;
– identify key instance features of the underlying problem instances;
– demonstrate interaction effects between the two; and
– identify values of these parameters that are predicted to yield good perfor-
mance, both unconditionally and conditioned on instance features.
2 Methods
Ultimately, our forward selection methods aim to identify a set of the kmax most
important algorithm parameters and mmax most important instance features
(where kmax and mmax are user-defined), as well as the best values for these
parameters (both on average across instances and on a per-instance basis). Our
approach for solving this problem relies on predictive models, learned from given
algorithm performance data for various problem instances and parameter con-
figurations. We identify important parameters and features by analyzing which
inputs suffice to achieve high predictive accuracy in the model, and identify good
parameter values by optimizing performance based on model predictions.
There are many possible approaches for identifying important input dimensions
of a model. For example, one can measure the model coefficients w in ridge
regression (large coefficients mean that small changes in a feature value have
a large effect on predictions, see, e.g., [26]) or the length scales λ in Gaussian
process regression (small length scales mean that small changes in a feature value
have a large effect on predictions, see, e.g., [27]). In random forests, to measure
the importance of input dimension i, Breiman suggested perturbing the values in
the ith column of the out-of-bag (or validation) data and measuring the resulting
loss in predictive accuracy [28].
All of these methods run into trouble when input dimensions are highly
correlated. While this does not occur with randomly sampled parameter config-
urations, it does occur with instance features, which cannot be freely sampled.
Our goal is to build models that yield good predictions but yet depend on as few
input dimensions as possible; to achieve this goal, it is not sufficient to merely
find important parameters, but we need to find a set of important parameters
that are as uncorrelated as possible.
Forward selection is a generic, model-independent tool that can be used to
solve this problem [17,29].1 Specifically, this method identifies sets of model
inputs that are jointly sufficient to achieve good predictive accuracy; our variant
of it is defined in Algorithm 1. After initializing the complete input set I and
the subset of important inputs S in lines 1–2, the outer for-loop incrementally
adds one input at a time to S. The forall-loop over inputs i not yet contained
in S (and not violating the constraint of adding at most kmax parameters and
mmax features) uses validation data to compute err(i), the root mean squared
error (RMSE) for a model containing i and the inputs already in S. It then adds
the input resulting in lowest RMSE to S. Because inputs are added one at a
time, highly correlated inputs will only be added if they provide large marginal
value to the model.
Note that we simply call procedure learn with a subset of input dimensions,
regardless of whether they are numerical or categorical (for models that require
a so-called “1-in-K encoding” to handle categorical parameters, this means we
introduce/drop all K binary columns representing a K-ary categorical input at
once). Also note that, while here, we use prediction RMSE on the validation set
1
A further advantage of forward selection is that it can be used in combination with
arbitrary modeling techniques. Although here, we focus on using our best-performing
model, random forests, we also provide summary results for other model types.
368 F. Hutter et al.
12 S ← S \ {i};
13 î ← random element of arg maxi err(i);
14 S ← S ∪ {î};
15 return S;
to assess the value of adding input i, forward selection can also be used with any
other objective function.2
Having selected a set S of inputs via forward selection, we quantify their
relative importance following the same process used by Leyton-Brown et al. to
determine the importance of instance features [17], which is originally due to
[31]: we simply drop one input from S at a time and measure the increase in
predictive RMSE. After computing this increase for each feature, we normalize
by dividing by the maximal RMSE increase and multiplying by 100.
We note that forward selection can be computationally costly due to its need
for repeated model learning: for example, to select 5 out of 200 inputs via forward
selection requires the construction and validation of 200 + 199 + 198 + 197 +
196 = 990 models. In our experiments, this process required up to a day of CPU
time.
2
In fact, it also applies to classification algorithms and has, e.g., been used to derive
classifiers for predicting the solubility of SAT instances based on 1–2 features [30].
Identifying Key Algorithm Parameters 369
Algorithm Parameter type # parameters of this type # values considered Total # configurations
Boolean 6 2
CPLEX Categorical 45 3–7 1.90 × 1047
Integer 18 5–7
Continuous 7 5–8
Categorical 10 2–20
SPEAR Integer 4 5–8 8.34 × 1017
Continuous 12 3–6
Boolean 5 2
LK-H Categorical 8 3–10 6.91 × 1014
Integer 10 3–9
370 F. Hutter et al.
We gathered a large amount of runtime data for these solvers by executing them
with various configurations and instances. Specifically, for each combination of
solver and instance distribution (CPLEX run on MIP, SPEAR on SAT, and LK-H
on TSP instances), we measured the runtime of each of M = 1 000 randomly-
sampled parameter configurations on each of the P problem instances available
for the distribution, with P ranging from 63 to 2 000. The resulting runtime
observations can be thought of as a M × P matrix. Since gathering this runtime
matrix meant performing M · P (i.e., between 63 000 and 2 000 000) runs per
dataset, we limited each single algorithm run to a cutoff time of 300 CPU seconds
on one node of the Westgrid cluster Glacier (each of whose nodes is equipped with
two 3.06 GHz Intel Xeon 32-bit processors and 2–4 GB RAM). While collecting
this data required substantial computational resources (between 1.3 CPU years
and 18 CPU years per dataset), we note that this much data was only required
for the thorough empirical analysis of our methods; in practice, our methods are
often surprisingly accurate based on small amounts of training data. For all our
experiments, we partitioned both instances and parameter configurations into
training, validation, and test sets; the training sets (and likewise, the validation
and test sets) were formed as subsamples of training instances and parameter
configurations. We used 10 000 training subsamples throughout our experiments
but demonstrate in Sect. 4.3 that qualitatively similar results can also be achieved
based on subsamples of 1 000 data points.
We note that sampling parameter configurations uniformly at random is not
the only possible way of collecting training data. Uniform sampling has the
advantage of producing unbiased training data, which in turn gives rise to models
that can be expected to perform well on average across the entire configuration
space. However, because algorithm designers typically care more about regions
of the configuration space that yield good performance, in future work, we also
aim to study models based on data generated through a biased sequential sam-
pling approach (as is implemented, e.g., in model-based algorithm configuration
methods, such as SMAC [6]).
4 Experiments
We carried out various computational experiments to identify the quality of
models based on small subsets of features and parameters identified using for-
ward selection, to quantify which inputs are most important, and to determine
372 F. Hutter et al.
good values for the selected parameters. All our experiments made use of the
algorithm performance data described in Sect. 3, and consequently, our claims
hold on average across the entire configuration space. Whether they also apply
to biased samples from the configuration space (in particular, regions of very
strong algorithm performance) is a question for future work.
space: in our experimental design, parameter values have been sampled uniformly
at random and are thus independent (i.e., uncorrelated) by design. Thus, this
finding indicates that some parameters influence performance much more than
others, to the point where knowledge of a few parameter values suffices to predict
performance just as well as knowledge of all parameters.
Figure 2 focuses on what we consider to be the most interesting case, namely
performance prediction in the joint space of instance features and parameter
configurations. The figure qualitatively indicates the performance that can be
achieved based on subsets of inputs of various sizes. We note that in some cases,
in particular in the SPEAR scenarios, predictions of models using all inputs closely
resemble the true performance, and that the predictions of models based on a
few inputs tend to capture the salient characteristics of the full models. Since the
instances we study vary widely in hardness, instance features tend to be more
predictive than algorithm parameters, and are thus favoured by forward selec-
tion. This sometimes leads to models that only rely on instance features, yielding
predictions that are constant across parameter configurations; for example, see
the predictions with up to 10 inputs for dataset CPLEX-CORLAT (the second row
in Fig. 2). While these models yield low RMSE, they are uninformative about
parameter settings; this observation caused us to modify forward selection as
discussed in Sect. 2.2 to limit the number of features/parameters selected.
Table 2. Key inputs, in the order in which they were selected, along with their omission
cost from this set.
1st selected cplex prob time (10.1) Pre featuretime (35.9) tour const heu avg (0.0)
2nd selected obj coef per constr2 std (7.7) nclausesOrig (100.0) cluster distance std (0.8)
3rd selected vcg constr weight0 avg (30.2) sp-var-dec-heur (32.6) EXCESS (10.0)
4th selected mip limits cutsfactor (8.3) VCG CLAUSE entropy (34.5) bc no1s q25 (100.0)
5th selected mip strategy subalgorithm (100.0) sp-phase-dec-heur (27.6) BACKTRACKING (0.0)
Table 3. Key parameters and their best fixed values as judged by an empirical per-
formance model based on 3 features and 2 parameters.
CPLEX-BIGMIX (Table 2, left). While the single most important feature in this
case was cplex prob time (a timing feature measuring how long CPLEX probing
takes), in the context of the other four features, its importance was relatively
small; on the other hand, the input selected 5th, mip strategy subalgorithm
(CPLEX’s MIP strategy parameter from above) was the most important input in
the context of the other 4. We also note that all algorithm parameters that were
selected as important in this context of instance features (mip limits cutsfactor
and mip strategy subalgorithm for CPLEX; sp-var-dec-heur and sp-phase-dec-heur
for SPEAR; and EXCESS and BACKTRACKING for LK-H) were already selected
and labeled important when considering only parameters. This finding increases
our confidence in the robustness of this analysis.
Next, we used our subset models to identify which values the key parameters
identified by forward selection should be set to. For each dataset, we used the
same subset models of 3 features and 2 parameters as above; Table 3 lists the best
predicted values for these 2 parameters. The main purpose of this experiment
was to demonstrate that this analysis can be done automatically, and we thus
Identifying Key Algorithm Parameters 377
only summarize the results at a high level; we see them as a starting point that
can inform domain experts about empirical properties of their algorithm in a
particular application context and trigger further in-depth studies. At a high
level, we note that CPLEX’s parameter mip strategy subalgorithm (determining
the continuous optimizer used to solve subproblems in a MIP) was important
for most instance sets, the most prominent values being 2 (use CPLEX’s dual
simplex optimizer) and 0 (use CPLEX’s auto-choice, which also defaults to the dual
simplex optimizer). Another important choice was to set preprocessing reduce
to 3 (use both primal and dual reductions) or 1 (use only primal reductions),
depending on the instance set. For SPEAR, the parameter determining the variable
selection heuristic (sp-var-dec-heur) was the most important one in all 3 cases,
with an optimal value of 2 (select variables based on their activity level, breaking
ties by selecting the more frequent variable). For good average performance of
LK-H on TSPLIB, the most important choices were to set EXCESS to −1 (use an
instance-dependent setting of the reciprocal problem dimension), and to not use
backtracking moves.
We also measured the performance of parameter configurations that actually
set these parameters to the values predicted to be best by the model, both on aver-
age across instances and in an instance-specific way. This serves as a further way
of evaluating model quality and also facilitates deeper understanding of the para-
meter space. Specifically, we consider parameter configurations that instantiate
the selected parameters according to the model and assign all other parameter
to randomly sampled values; we compare the performance of these configurations
to that of configurations that instantiate all values at random. Figure 4 visual-
izes the result of this comparison for two datasets, showing that the model indeed
selected values that lead to high performance: by just controlling two parameters,
improvements of orders of magnitude could be achieved for some instances. Of
course, this only compares to random configurations; in contrast to our work on
algorithm configuration, here, our goal was to gain a better understanding of an
378 F. Hutter et al.
Fig. 5. Log10 speedups over random configurations by setting almost all parameters at
random, except 2 key parameters, values for which (fixed best, and best per instance)
are selected by an empirical performance model with 3 features and 2 parameters.
The boxplots show the distribution of log10 speedups across all problem instances;
note that, e.g., a log10 speedup of 0, −1, and 1 mean identical performance, a 10-fold
slowdown, and a 10-fold speedup, respectively. The dashed green lines indicate where
two configurations performed the same, points above the line indicate speedups. Top:
based on models trained on 10 000 data points; bottom: based on models trained on
1 000 data points.
algorithms’ parameter space rather than to improve over its manually engineered
default parameter settings.3 However, we nevertheless believe that the speedups
achieved by setting only the identified parameters to good values demonstrate the
importance of these parameters. While Fig. 4 only covers 2 datasets, Fig. 5 (top)
summarizes results for a wide range of datasets. Figure 5 (bottom) demonstrates
that predictive performance does not degrade much when using sparser training
data (here: 1 000 instead of 10 000 training data points); this is important for facil-
itating the use of our approach in practice.
5 Conclusions
In this work, we have demonstrated how forward selection can be used to analyze
algorithm performance data gathered using randomly sampled parameter config-
urations on a large set of problem instances. This analysis identified small sets of
key algorithm parameters and instance features, based on which the performance
of these algorithms could be predicted with surprisingly high accuracy. Using
this fully automated analysis technique, we found that for high-performance
solvers for some of the most widely studied NP-hard combinatorial problems,
namely SAT, MIP and TSP, only very few key parameters (often just two of
dozens) largely determine algorithm performance. Automatically constructed
performance models, in our case based on random forests, were of sufficient
3
In fact, in many cases, the best setting of the key parameters were their default values.
Identifying Key Algorithm Parameters 379
quality to reliably identify good values for these key parameters, both on aver-
age across instances and dependent on key instance features. We believe that our
rather simple importance analysis approach can be of great value to algorithm
designers seeking to identify key algorithm parameters, instance features, and
their interaction.
We also note that the finding that the performance of these highly parametric
algorithms mostly depends on a few key parameters has broad implications on
the design of algorithms for NP-hard problems, such as the ones considered here,
and of future algorithm configuration procedures.
In future work, we aim to reduce the computational cost of identifying key
parameters; to automatically identify the relative performance obtained with
their possible values; and to study which parameters are important in high-
performing regions of an algorithm’s configuration space.
References
1. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Automated configuration of mixed inte-
ger programming solvers. In: Proceedings of CPAIOR-10, pp. 186–202 (2010)
2. Nannen, V., Eiben, A.E.: Relevance estimation and value calibration of evolution-
ary algorithm parameters. In: Proceedings of IJCAI-07, pp. 975–980 (2007)
3. Ansotegui, C., Sellmann, M., Tierney, K.: A gender-based genetic algorithm for the
automatic configuration of solvers. In: Proceedings of CP-09, pp. 142–157 (2009)
4. Birattari, M., Yuan, Z., Balaprakash, P., Stützle, T.: F-race and iterated F-race:
an overview. In: Bartz-Beielstein, T., Chiarandini, M., Paquete, L., Preuss, M.
(eds.) Empirical Methods for the Analysis of Optimization Algorithms. Springer,
Heidelberg (2010)
5. Hutter, F., Hoos, H.H., Leyton-Brown, K., Stützle, T.: ParamILS: an automatic
algorithm configuration framework. JAIR 36, 267–306 (2009)
6. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization
for general algorithm configuration. In: Coello, C.A.C. (ed.) LION 5. LNCS, vol.
6683, pp. 507–523. Springer, Heidelberg (2011)
7. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Parallel algorithm configuration. In:
Hamadi, Y., Schoenauer, M. (eds.) LION 2012. LNCS, vol. 7219, pp. 55–70.
Springer, Heidelberg (2012)
8. Hutter, F., Babić, D., Hoos, H.H., Hu, A.J.: Boosting verification by automatic
tuning of decision procedures. In: Proceedings of FMCAD-07, pp. 27–34 (2007)
9. Chiarandini, M., Fawcett, C., Hoos, H.: A modular multiphase heuristic solver for
post enrolment course timetabling. In: Proceedings of PATAT-08 (2008)
10. Vallati, M., Fawcett, C., Gerevini, A.E., Hoos, H.H., Saetti, A.: Generating fast
domain-optimized planners by automatically configuring a generic parameterised
planner. In: Proceedings of ICAPS-PAL11 (2011)
11. Ridge, E., Kudenko, D.: Sequential experiment designs for screening and tuning
parameters of stochastic heuristics. In: Proceedings of PPSN-06, pp. 27–34 (2006)
12. Chiarandini, M., Goegebeur, Y.: Mixed models for the analysis of optimization
algorithms. In: Bartz-Beielstein, T., Chiarandini, M., Paquete, L., Preuss, M. (eds.)
Experimental Methods for the Analysis of Optimization Algorithms, pp. 225–264.
Springer, Berlin (2010)
380 F. Hutter et al.
36. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. JMLR
13, 281–305 (2012)
37. Wang, Z., Zoghi, M., Hutter, F., Matheson, D., de Freitas, N.: Bayesian opti-
mization in a billion dimensions via random embeddings. ArXiv e-prints, January
(2013). arXiv:1301.1942
Using Racing to Automatically Configure
Algorithms for Scaling Performance
University of British Columbia, 2366 Main Mall, Vancouver, BCV6T 1Z4, Canada
{jastyles,hoos}@cs.ubc.ca
1 Introduction
High performance algorithms for computationally hard problems often have
numerous parameters which control their behaviour and performance. Finding
good values for these parameters, some exposed to end users and others hidden
as hard-coded design choices, can be a challenging problem for algorithm design-
ers. Recent work on automatically configuring algorithms has proven to be very
effective. These automatic algorithm configurators rely on the use of significant
computational resource to explore the design space of an algorithm.
In previous work [7], we examined a limitation of the basic protocol for
using automatic algorithm configurators in scenarios where the intended use case
of an algorithm is too expensive to be feasibly used during configuration. We
proposed a new protocol for using algorithm configurators, referred to as train-
easy select-intermediate (TE-SI), which uses so-called easy instances during the
configuration step of the protocol and so-called intermediate instances during
the selection step. Through a large empirical study we were able to show that
TE-SI reliably out performed the basic protocol.
In this work, we show how even better configurations can be found using
two novel configuration protocols that combine the idea of using intermediate
instances for validation with the concept of racing. One of these protocols uses
a new variant of F-Race [1] and the other is based on a novel racing procedure
dubbed ordered permutation race. We show that both racing-based protocols
reliably outperform our previous protocol [7] and are able to produce configu-
rations up to 25 % better within the same time budget or configurations of the
same quality in up to 45 % less total time and up to 90 % less time for validation.
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 382–388, 2013.
DOI: 10.1007/978-3-642-44973-4 41, c Springer-Verlag Berlin Heidelberg 2013
Using Racing to Automatically Configure Algorithms 383
Judge the Racers by What Matters in the End. The configuration scenar-
ios examined in this work involve minimising a given target algorithm’s runtime.
While rank-based methods may indirectly lead to a reduction in runtime they
are more appropriate for scenarios where the magnitude of performance dif-
ferences does not matter. We therefore propose the use of a permutation test
instead of the rank-based Friedman test, focused on runtime, for eliminating
configurations.
In detail, our testing procedure works as follows. Given n configurations
c1 , . . . cn , and m problem instances i1 , . . . , im considered at stage m of the race,
we use pk,j to denote the performance of configuration ck on instance ij , and
pk to denote the aggregate performance of configuration ck over i1 , . . . , im . In
this work, we use penalised average run time, PAR10, to measure aggregate
performance, and our goal is to find a configuration with minimal PAR10. Let
c1 be the current leader of the race, i.e., the configuration with the best aggre-
gate performance among c1 , . . . , cn , We now perform pairwise permutation tests
between the leader, c1 , and all other configurations ck . Each of these tests assesses
whether c1 performs significantly better than ck ; if so, ck is eliminated from the
race. To perform this one-sided pairwise permutation test between c1 and ck ,
we generate 100,000 resamples of the given performance data for these two con-
figurations. Each resample is generated from the original performance data by
swapping the performance values p1,j and pk,j with probability 0.5 and leav-
ing them unchanged otherwise; this is done independently for each instance
j = 1, . . . , m. We then consider the distribution of the aggregate performance
ratios p∈1 /p∈k over these resamples and determine the q-quantile of this distrib-
ution that equals the p1 /pk ratio for the original performance data. Finally, if,
and only if, q > α2 , where α2 is the significance of the one-sided pairwise test,
we conclude that c1 performs significantly better than ck . Different from F-race,
where the multi-way Friedman test is used to gate a series of pairwise post-tests,
we only perform pairwise tests and therefore need to perform multiple testing
correction. While more sophisticated corrections could be applied, we decided to
α
use the simple, but conservative Bonferroni correction and set α2 := n−1 for an
overall significance level α.
We refer to the racing procedure that considers problem instances in order
of increasing difficulty for the default configuration of the given target algorithm
and in each stage eliminates configurations using the previously described series
of pairwise permutation tests as ordered permutation race (op-race), and the
variant of basic F-race that uses the same ordering as ordered F-race
(of-race).
This yields two new protocols for using algorithm configurators: (1) train-
easy validate-intermediate with of-race (TE-FRI) and (2) train-easy validate-
intermediate with op-race (TE-PRI). We have observed that both protocols are
quite robust with respect to the significance level α (see extended version) and
generally use α = 0.01 for TE-FRI and α = 0.1 for TE-PRI.
The first CPLEX scenario considered configuring CPLEX 12.1 for CORLAT
instances. Each run of ParamILS was given a time budget of 20 h. A 120 second
per-instance cutoff was enforced during configuation and a 2 hour per-instance
cutoff was enforced during testing. The second CPLEX scenario considered con-
figuring CPLEX 12.3 for CORLAT instances. Each run of ParamILS and SMAC
was given a time budget of 3456 s. A 15 second per-instance cutoff was enforced
during configuation and a 346 second cutoff was enfored during testing. The third
CPLEX scenario considered configuring CPLEX 12.3 for RCW instances. Each
run of ParamILS and SMAC was given a time budget of 48 h. A 180 second
per-instance cutoff was enfored during configuration and a 10 hour cutoff was
enforced during testing.
4 Results
Using the methods described in Sect. 3 we evaluated each of the four protocols
on all five configuration scenarios. The results are shown in Table 1,
where we report bootstrapped median quality (in terms of speedup over the
default configurations, where run time was measured using PAR10 scores)
of the configurations found within various time budgets as well as bootstrap
[10 %, 90 %] percentile confidence intervals (i.e., 80 % of simulated applications
of the respective protocol fall within these ranges; note that these confidence
intervals are not for median speedups, but for the actual speedups over simulated
experiments).
As can be seen from these results, TE-PRI is the most effective configuration
protocol, followed by TE-FRI and TE-SI. These three protocols tend to produce
very similar [10 %, 90 %] confidence intervals, but the two racing approaches
achieve better median speedups, especially for larger time budgets.
To further investigate the performance differences between the protocols,
we compared them against a hypothetical protocol with an oracle selection
mechanism. This mechanism uses the same configurator runs as the other
protocols, but always selects the configuration from this set that has the
best testing performance, without incurring any additional computational
burden. This provides a upper bound of the performance that could be
achieved by any method for selecting from a set of configurations obtained
for a given training set, configurator and time budget. These results, shown in
Table 1, demonstrate that for some scenarios (e.g., CPLEX 12.1 for CORLAT)
the various procedures, particularly TE-PRI, provide nearly the same perfor-
mance as the oracle, while for others (e.g., CPLEX 12.3 for RCW), there is a
sizable gap.
Using Racing to Automatically Configure Algorithms 387
5 Conclusion
In this work, we have addressed the problem of using automated algorithm con-
figuration in situations where instances in the intended use case of an algorithm
are too difficult to be used directly during the configuration process. Building
on the idea of selecting from a set of configurations optimised on easy train-
388 J. Styles and H. Hoos
References
1. Birattari, M., Stützle, T., Paquete, L., Varrentrapp, K.: A racing algorithm for
configuring metaheuristics. In: GECCO ’02: Proceedings of the Genetic and Evolu-
tionary Computation Conference, pp. 11–18 (2002)
2. Gomes, C.P., van Hoeve, W.-J., Sabharwal, A.: Connections in networks: A hybrid
approach. In: Perron, L., Trick, M. (eds.) CPAIOR 2008. LNCS, vol. 5015, pp.
303–307. Springer, Heidelberg (2008)
3. Helsgaun, K.: An effective implementation of the Lin-Kernighan traveling salesman
heuristic. EJOR 126, 106–130 (2000)
4. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for
general algorithm configuration. In: Coello Coello, C.A. (ed.) LION 2011. LNCS,
vol. 6683, pp. 507–523. Springer, Heidelberg (2011)
5. Hutter, F., Hoos, H.H., Leyton-Brown, K., Stützle, T.: ParamILS: An automatic
algorithm configuration framework. J. Artif. Intell. Res. 36, 267–306 (2009)
6. Reinelt, G.: TSPLIB. http://www.iwr.uni-heidelberg.de/groups/comopt/software/
TSPLIB95. Version visited in October 2011
7. Styles, J., Hoos, H.H., Müller, M.: Automatically configuring algorithms for scaling
performance. In: Hamadi, Y., Schoenauer, M. (eds.) LION 2012. LNCS, vol. 7219,
pp. 205–219. Springer, Heidelberg (2012)
Algorithm Selection
for the Graph Coloring Problem
1 Introduction
Many heuristic algorithms have been developed to solve combinatorial optimiza-
tion problems. Usually, such techniques show different behavior when solving
particular instances. According to the no free lunch theorems [45], no algorithm
can dominate all other techniques on each problem. In practice, this raises new
issues, as selecting the best (or most appropriate) solver for a particular instance
may be challenging. Often, the “winner-take-all” strategy is applied and the
algorithm with the best average performance is chosen to solve all instances.
However, this methodology has its drawbacks, because the distribution of tested
instances effects the average performance, and usually in practice only a special
class of instances are solved.
One possible approach to obtain better solutions on average is to select for
each particular instance the algorithm with the highest expected performance.
This task is known as algorithm selection (AS) and one emerging and very
promising approach that is used for AS is based on machine learning methods.
These techniques are able to learn a model based on previous observations an
then predict on a new and unseen instance the best algorithm.
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 389–403, 2013.
DOI: 10.1007/978-3-642-44973-4 42, c Springer-Verlag Berlin Heidelberg 2013
390 N. Musliu and M. Schwengerer
data preparation. In this paper we present new attributes of a graph that can be
calculated in polynomial time and are suitable to predict the most appropriate
heuristic for GCP.
clique by using a simple greedy algorithm and take statistical information about
the size of these cliques as relevant attributes. Regarding the local clustering
coefficient [44] of a node, we use, besides the classical value, a modified version
denoted as weighted clustering coefficient, where the coefficient of the node is
multiplied with its degree. The local search probing features are extracted from
10 executions of a simple 1-opt best-improvement local search on the k-coloring
problem. The greedy coloring attributes are based on the application of DSATUR
and RLF. For these features, we take, besides the number of used colors, also
the sizes of the independent sets into account and calculate statistical informa-
tion like the average size or the variation coefficient. Furthermore, we consider
attributes of a tree decomposition obtained by a minimum-degree heuristic. Such
features have been used successfully by [32] for AS in Answer Set Programming.
The last category builds on a lower bound of k, denoted as Bl , which is the car-
dinality of the greatest maximal clique found, and an upper bound Bu , which
is the minimum number of colors needed by the two greedy algorithms. Apart
from features we described above, we also take the computation times of some
feature classes as additional parameters. Note that we also experimented with
attributes based on the betweenness centrality [15] and the eccentricity [20] of the
nodes. Unfortunately, the algorithms we implemented to calculate these features
required during our tests much time, for which reasons we did not use them in
our approach.
It is widely accepted that the performance of learning algorithms depend on
the choice of features, and that using irrelevant features may lead to suboptimal
results. Therefore, we apply a feature subset selection using a forward selection
with limited backtracking and a genetic search to reduce the set of basic features.
Both techniques are applied with the CfsSubsetEval criteria as evaluation func-
tion. Only features that are selected by one of these methods are used further.
Additionally, for each pair of features xj , xk , k > j we create two new features
that represent the product xj ·xk and the quotient xj /xk , respectively. This idea
is based on a similar technique used in [47], where also the product of two fea-
tures is included as an additional attribute. Finally, we apply feature selection
on these expanded attributes to eliminate unnecessary attributes. In the end, we
obtain 90 features, including 8 basic features and 82 composed attributes.
graphs. HEA is chosen because it shows good performance on flat graphs and it
is used as basis for many other evolutionary heuristics that are applied for GCP.
We selected FPC and MMT because we also wanted to use algorithms working
with partial colorings and these two candidates are the correspondent versions
of TABU and HEA. The last competitor, MAFS, is included because it shows good
performance on large graphs.
As training instances, we take three different publicly available sets: The first
set, further denoted as dimacs, consists of 174 graphs from the Graph Color-
ing and its Generalizations-series (COLOR02/03/04)1 which builds up on the
well-established Dimacs Challenge [22]. This set includes instances from the col-
oring and the clique part of the Dimacs Challenge. The second and third set of
instances, denoted as chi500 and chi1000, are used by a comparative study [9]
of several heuristics for the GCP and contain 520 instances with 500 nodes and
740 instances with 1000 nodes respectively.2 These instances are created using
Culberson’s [10] random instance generator by controlling various parameters
like the edge density (p = {0.1, 0.5, 0.9}) or the edge distribution (resulting in
three groups of graphs: uniform graphs (G), geometric graphs (U ) and weight
biased graphs (W )).
For the final evaluation of our algorithm selection approach with the under-
lying algorithms, we use a test set comprising complete new instances of dif-
ferent size, density and type, generated with Culberson’s instance generator. We
constructed uniform (G), geometric (U ) and weight biased (W ) graphs of dif-
ferent sizes n = {500, 750, 1000, 1250} and density values p = {0.1, 0.5, 0.9}.
For each parameter setting we created 5 graphs, leading to a total of 180
instances.
In order to ensure practicable results and prevent excessivecomputational
effort, we use a maximal time limit per color tmax = min(3600, |E| · x) where
|E| is the number of edges and x is 15, 5 and 3 for the sets dimacs, chi500 and
chi1000, respectively. For the test set which contains graphs of different size,
we stick to the values used for chi1000 (x = 3). These values for x are obtained
experimentally. In this context, we want to note that the average time needed
for the best solution on the hard instances is only 21.58 % of the allowed value
tmax and 90 % of the best solutions are found within 62.66 % of tmax .
Regarding the feature computation, we do not use any time limitations except
for the local search probing, although this might be reasonable for practical
implementations. However, for our test data the median calculation time is 2 s,
the 95th percentile is 18 s and the 99th percentile is 53 s.
In total, we collected 1434 graphs of variable size and density as training
data. We removed instances where an optimal solution has been found by one of
1
Available at http://mat.gsia.cmu.edu/COLOR04/, last visited on 22.10.2012
2
Available at www.imada.sdu.dk/∼marco/gcp-study/, last visited on 28.10.2012
Algorithm Selection for the Graph Coloring Problem 395
the two greedy algorithms or where the heuristics did not find better colorings
than obtained by the greedy algorithms. We further excluded all instances where
at least four heuristics (more than 50 %) yield the best solution in less than five
seconds. These seem to be easy instances which can be solved efficiently by most
heuristics. Therefore, they are less interesting for algorithm selection. In the end,
our training data consist of 859 hard instances.
Note that during our experiments, we discovered instances where several
heuristics obtain best result. For machine learning, this is rather uncomfort-
able, as the training data should contain only one recommended algorithm per
instance. One solution for this issue is using multi-labeled classification [23]. How-
ever, we follow a different strategy where we prioritize the algorithms according
to their average rank on all instances. Thus, in case of a tie, we prefer the
algorithm with the lower rank. Concerning the performance evaluation of the
classifiers, we have to take into account that there might be several “best” algo-
rithms. For that reason, we introduce a new performance measurement, called
success rate (sr), that is defined as follows: Given for each instance i ∈ I a set
of algorithms B i that obtains best result on i. Then, the sr of a classifier c on
i
a set of instances I is sr = |{i∈I:c(i)∈B
|I|
}|
where c(i) is the predicted algorithm
for the instance i. Furthermore, the success rate of a solver is the ratio between
the number of instances for which the solver achieves the best solution and the
total number of instances.
For the selection procedure itself, we test six popular classification algorithms:
Bayesian Networks (BN), C4.5 Decision Trees (DT), k-Nearest Neighbor (kNN),
Multilayer Perceptrons (MLP), Random Forests (RF), and Support-Vector
Machines (SVM). For all these techniques, we use the implementation included
in the Weka software collection [2], version 3.6.6. Furthermore, we manually
identify important parameters of these learning algorithms and experimented
with different settings. We refer the reader to [38] for more details regarding
different parameter settings that we used for classification algorithms.
Apart from selection of relevant features, a different, but also important issue
is whether to use the original numeric attributes or to apply a discretization
step to transform the values into nominal attributes. Besides the fact that some
classification algorithms can not deal with numeric features, research has clearly
shown that some classifiers achieve significant better results when applied with
discretized variables [11]. In this work, we experimented with two different super-
vised discretization techniques. The first one is the classical minimum-descriptive
length (MDL) method [13], while the second method is a derivation of MDL using
a different criteria [24] (further denoted as Kononenko’s criteria (KON)).
396 N. Musliu and M. Schwengerer
4 Experimental Results
All our experiments have been performed on a Transtec CALLEO 652 Server
containing 4 nodes, each with 2 AMD Opteron Magny-Cours 6176 SE CPUs
(2 · 12 = 24 cores with 2.3 GHz) and 128 GB memory.
Concerning the heuristic for the GCP, we execute each algorithm n = 10
times (n = 20 for the dimacs instances) using different random seeds. The result
of each algorithm is the lowest number of colors that has been found in more
than 50 % of the trials. Furthermore, we take the median time needed within
the n executions as required computation time. In cases of a timeout, we take
tmax as computation time. Detailed results of the experiments can be found at
http://www.kr.tuwien.ac.at/staff/mschweng/gcp/.
Fig. 2. Prediction of the best algorithm by different classifiers on the training data and
their comparison with the existing (meta)heuristics.
398 N. Musliu and M. Schwengerer
In the next step, we trained the classifiers with the complete training set and
evaluate the performance of them on the test set. The corresponding results are
shown in Fig. 3, which shows the number of instances on which the solvers show
the best performance. From this figure, we can see that all learning strategies
except MLP accomplish a higher number of best solutions than any existing
solver for the GCP. The most successful classifiers are RF, BN and kNN which
predict on up to 71.71 % of the 152 graphs the most appropriate algorithm.
Fig. 3. Number of instances from the test set on which a solver shows best perfor-
mance.
A more detailed view on the results using different metrics is given in Table 2.
Besides the success rate, we also consider the distance to the best known solu-
χ, G) [6], and the average rank. The figures point out that MMT is the
tion, err (
best single heuristic with respect to the number of best solutions. Moreover,
it accomplishes the lowest average distance err ( χ, G) with a larger gap to the
other approaches. Surprisingly, when we look at the average rank, MMT is not
ranked first because TABU and HEA show both a lower value. Thus, it seems
that although MMT obtains often solution with a low number of colors (resulting
in a low err (χ, G)), it is not always ranked first. One possible explanation for
this is that MMT is a method which is powerful, but rather slow. Consequently,
on instances where other heuristics (e.g. TABU or HEA) find equal colorings, MMT
requires more computation time and is therefore, ranked behind its competitors.
Compared with our solver that applies all algorithms and an automated
algorithm selection mechanism, we can see that for all considered metrics except
χ, G) at least one system shows a stronger performance than the best sin-
err (
gle heuristic. The best selection mechanism provides clearly RF, which is on all
Algorithm Selection for the Graph Coloring Problem 399
Table 2. Performance metrics of the algorithm selection and the underlying heuristics
on the test set.
Fig. 4. Number of instances of the test set on which a solver shows the best perfor-
mance, grouped by the graph type (edge distribution) and the density. The dark bar
denotes that our approach is at least as successful as the best single solver.
detail, TABU is to prefer on uniform and weight biased graphs while MMT is better
on geometric graphs. Unfortunately this information is not part of the feature
space and it seems that the classifier is not able to distinguish this based on
other features, which leads to mispredictions. The suboptimal prediction rate
on W-0.1 is hard to explain, as FPC is also in the related subset W-0.1 of the
training data the best algorithm. Thus, is seems that the classifier is just not able
to learn this pattern correctly. On the groups G-0.5 and W-0.5 our approach is
also not able to achieve competitive results compare to the best single solver.
This is surprising as the best heuristic on these instances is HEA, which shows
also on the corresponding training data good results. One possible explanation is
that, as for instances with high density, the classifier is unable to detect the type
of edge distribution. Nevertheless, we can see that in many cases, the classifier is
able to predict the most appropriate algorithm, which leads to a better average
performance compare to any single heuristic.
5 Conclusion
Acknowledgments. The work was supported by the Austrian Science Fund (FWF):
P24814-N23. Additionally, the research herein is partially conducted within the com-
petence network Softnet Austria II (www.soft-net.at, COMET K-Projekt) and funded
by the Austrian Federal Ministry of Economy, Family and Youth (bmwfj), the province
of Styria, the Steirische Wirtschaftsförderungsgesellschaft mbH. (SFG), and the city of
Vienna in terms of the center for innovation and technology (ZIT).
References
1. Blöchliger, I., Zufferey, N.: A graph coloring heuristic using partial solutions and
a reactive tabu scheme. Comput. Oper. Res. 35(3), 960–975 (2008)
2. Bouckaert, R.R., Frank, E., Hall, M., Kirkby, R., Reutemann, P., Seewald, A.,
Scuse, D.: Weka manual (3.6.6), October 2011
3. Brélaz, D.: New methods to color the vertices of a graph. Commun. ACM 22,
251–256 (1979)
4. Brown, K.L., Nudelman, E., Shoham, Y.: Empirical hardness models: methodology
and a case study on combinatorial auctions. J. ACM 56(4), 1–52 (2009)
5. Chaitin, G.: Register allocation and spilling via graph coloring. SIGPLAN Not.
39(4), 66–74 (2004)
6. Chiarandini, M.: Stochastic local search methods for highly constrained combina-
torial optimisation problems. Ph.D. thesis, TU Darmstadt, August 2005
7. Chiarandini, M., Dumitrescu, I., Stützle, T.: Stochastic local search algorithms for
the graph colouring problem. In: Gonzalez, T.F. (ed.) Handbook of Approximation
Algorithms and Metaheuristics. Chapman & Hall/CRC, Boca Raton (2007)
8. Chiarandini, M., Stützle, T.: An application of iterated local search to graph col-
oring. In: Johnson, D.S., Mehrotra, A., Trick, M.A. (eds.) Proceedings of the Com-
putational Symposium on Graph Coloring and its Generalizations (2002)
402 N. Musliu and M. Schwengerer
9. Chiarandini, M., Stützle, T.: An analysis of heuristics for vertex colouring. In:
Festa, P. (ed.) SEA 2010. LNCS, vol. 6049, pp. 326–337. Springer, Heidelberg
(2010)
10. Culberson, J.C., Luo, F.: Exploring the k-colorable landscape with iterated greedy.
In: Dimacs Series in Discrete Mathematics and Theoretical Computer Science, pp.
245–284. American Mathematical Society, Providence (1995)
11. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization
of continuous features. In: Machine Learning: Proceedings of the Twelfth Interna-
tional Conference, pp. 194–202. Morgan Kaufmann, San Francisco (1995)
12. Ewald, R.: Experimentation methodology. In: Ewald, R. (ed.) In: Automatic Algo-
rithm Selection for Complex Simulation Problems, pp. 203–246. Vieweg+Teubner
Verlag, Wiesbaden (2012)
13. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued
attributes for classification learning. In: Bajcsy, R. (ed.) IJCAI. Morgan Kauf-
mann, San Mateo (1993)
14. Feige, U., Kilian, J.: Zero knowledge and the chromatic number. J. Comput. Syst.
Sci. 57(2), 187–199 (1998)
15. Freeman, L.C.: A set of measures of centrality based on betweenness. Sociometry
40(1), 35–41 (1977)
16. Galinier, P., Hao, J.-K.: Hybrid evolutionary algorithms for graph coloring. J.
Comb. Optim. 3, 379–397 (1999)
17. Garey, M.R., Johnson, D.S., Hing, S.C.: An application of graph coloring to printed
circuit testing. IEEE Trans. Circ. Syst. (1976)
18. Guerri, A., Milano, M.: Learning techniques for automatic algorithm portfolio
selection. In: de Mántaras, R.L., Saitta, L. (eds.) In: Conference on Artificial Intel-
ligence, ECAI’2004, pp. 475–479. IOS Press, Amsterdam (2004)
19. Guo, H., Hsu, W.H.: A machine learning approach to algorithm selection for NP-
hard optimization problems: a case study on the MPE problem. Ann. Oper. Res.
156, 61–82 (2007)
20. Hage, P., Harary, F.: Eccentricity and centrality in networks. Soc. Netw. 17(1),
57–63 (1995)
21. Hertz, A., de Werra, D.: Using tabu search techniques for graph coloring. Com-
puting 39(4), 345–351 (1987)
22. Johnson, D.J., Trick, M.A. (eds) Cliques, Coloring, and Satisfiability: Second
DIMACS Implementation Challenge, 11–13 October 1993. American Mathematical
Society (1996)
23. Kanda, J., de Carvalho, A.C.P.L.F., Hruschka, E.R., Soares, C.: Selection of algo-
rithms to solve traveling salesman problems using meta-learning. Int. J. Hybrid
Intell. Syst. 8(3), 117–128 (2011)
24. Kononenko, I.: On biases in estimating multi-valued attributes. In: IJCAI. Morgan
Kaufmann, San Francisco (1995)
25. Leighton, F.T.: A graph coloring algorithm for large scheduling problems. J. Res.
Natl Bur. Stand. 84(6), 489–506 (1979)
26. Lewis, R., Thompson, J., Mumford, C.L., Gillard, J.W.: A wide-ranging computa-
tional comparison of high-performance graph colouring algorithms. Comput. Oper.
Res. 39(9), 1933–1950 (2012)
27. Luce, D.R., Perry, A.D.: A method of matrix analysis of group structure. Psy-
chometrika 14, 95–116 (1949)
28. Malaguti, E., Monaci, M., Toth, P.: A metaheuristic approach for the vertex col-
oring problem. INFORMS J. Comput. 20(2), 302–316 (2008)
Algorithm Selection for the Graph Coloring Problem 403
29. Malaguti, E., Toth, P.: A survey on vertex coloring problems. Int. Trans. Oper.
Res. 17, 1–34 (2010)
30. Malitsky, Y., Sabharwal, A., Samulowitz, H., Sellmann, M.: Non-model-based algo-
rithm portfolios for SAT. In: Sakallah, K.A., Simon, L. (eds.) SAT 2011. LNCS,
vol. 6695, pp. 369–370. Springer, Heidelberg (2011)
31. Messelis, T., De Causmaecker, P.: An algorithm selection approach for nurse ros-
tering. In: Proceedings of BNAIC 2011, Nevelland, pp. 160–166, November (2011)
32. Morak, M., Musliu, N., Pichler, R., Rümmele, S., Woltran, S.: Evaluating tree-
decomposition based algorithms for answer set programming. In: Hamadi, Y.,
Schoenauer, M. (eds.) LION 2012. LNCS, vol. 7219, pp. 130–144. Springer, Hei-
delberg (2012)
33. Nadeau, C., Bengio, Y.: Inference for the generalization error. Mach. Learn. 52(3),
239–281 (2003)
34. Nudelman, E.: Empirical approach to the complexity of hard problems. Ph.D.
thesis, Stanford University, Stanford, CA, USA (2006)
35. Pardalos, P., Mavridou, T., Xue, J.: The Graph Coloring Problem: A Bibliographic
Survey, pp. 331–395. Kluwer Academic Publishers, Boston (1998)
36. Paschos, V.T.: Polynomial approximation and graph-coloring. Computing 70(1),
41–86 (2003)
37. Rice, J.R.: The algorithm selection problem. Adv. Comput. 15, 65–118 (1976)
38. Schwengerer, M.: Algorithm selection for the graph coloring problem. Vienna Uni-
versity of Technology, Master’s thesis, October 2012
39. Smith-Miles, K.: Towards insightful algorithm selection for optimisation using
meta-learning concepts. In: IEEE International Joint Conference on Neural Net-
works. IEEE, New York (2008)
40. Smith-Miles, K., Lopes, L.: Measuring instance difficulty for combinatorial opti-
mization problems. Comput. OR 39(5), 875–889 (2012)
41. Smith-Miles, K., van Hemert, J., Lim, X.Y.: Understanding TSP difficulty by learn-
ing from evolved instances. In: Blum, C., Battiti, R. (eds.) LION 2010. LNCS, vol.
6073, pp. 266–280. Springer, Heidelberg (2010)
42. Smith-Miles, K., Wreford, B., Lopes, L., Insani, N.: Predicting metaheuristic per-
formance on graph coloring problems using data mining. In: El Talbi, G. (ed.)
Hybrid Metaheuristics. SCI, pp. 3–76. Springer, Heidelberg (2013)
43. Venkatesan, R., Levin, L.: Random instances of a graph coloring problem are hard.
Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing,
STOC ’88, pp. 217–222. ACM, New York (1988)
44. Watts, D.J., Strogatz, S.M.: Collective dynamics of ‘small-world’ networks. Nature
393(6684), 440–442 (1998)
45. Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE
Trans. Evol. Comput. 1(1), 67–82 (1997)
46. Xie, X.F., Liu, J.: Graph coloring by multiagent fusion search. J. Comb. Optim.
18(2), 99–123 (2009)
47. Xu, L., Hutter, F., Hoos, H.H., Leyton-Brown, K.: SATzilla: portfolio-based algo-
rithm selection for sat. J. Artif. IntelL. Res. 32, 565–606 (2008)
48. Zufferey, N., Giaccari, P.: Graph colouring approaches for a satellite range schedul-
ing problem. J. Schedul. 11(4), 263–277 (2008)
Batched Mode Hyper-heuristics
1 Introduction
A goal of hyper-heuristic [1] research is to raise the level of generality of search
methods by providing high level strategies, and associated directly-usable soft-
ware components, that are useful across different problem domains rather than
for a single one. (Note that this general goal is not unique to hyper-heuristics
but also occurs in other forms, e.g. in memetic computation [2].) There are
two main types of hyper-heuristics depending on whether they do generation or
selection of heuristics [3]. In this paper, we focus on selection hyper-heuristics,
and in particular, those that combine heuristic selection and move acceptance
processes under a single point search (i.e. not population-based) framework [4].
A candidate solution is improved iteratively by selecting and applying a heuristic
(neighbourhood operator) from a set of low level heuristics and then using some
acceptance criteria to decide if it should replace the incumbent. We also use the
HyFlex (Hyper-heuristics Flexible framework)1 [5] software tool associated with
the CHeSC20112 hyper-heuristic competition.
1
http://www.hyflex.org/
2
http://www.asap.cs.nott.ac.uk/chesc2011/
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 404–409, 2013.
DOI: 10.1007/978-3-642-44973-4 43, c Springer-Verlag Berlin Heidelberg 2013
Batched Mode Hyper-heuristics 405
2 Background
35
SAT inst 0 14 SAT inst 0, run 1
30 SAT inst 2 SAT inst 0, run 2
SAT inst 4 12 SAT inst 0, run 3
25 SAT inst 6 SAT inst 0, run 4
10
SAT inst 8 SAT inst 0, run 5
objective
objective
20
8
15 6
10 4
5 2
0 0
0 100 200 300 400 500 600 0 100 200 300 400 500 600
time time
(a) (b)
Fig. 1. Performance Profiles for runs on instances from the SAT domain using AdapHH.
The Best-So-Far (BSF) quality is plotted against running time; and the lines stop when
no further improvements are made within the 600 nsec limit. (a) A single run on each
of 5 separate instances. (b) 5 separate runs on a single instance.
3
http://code.google.com/p/generic-intelligent-hyper-heuristic/downloads/
list
4
In experiments, the “10 minute” is a “nominal” (or normalised) standardised time
as determined by a benchmarking program available via the CHeSC website. To aid
future comparisons, we always report results using nominal seconds (nsecs).
Batched Mode Hyper-heuristics 407
Table 1. For each domain, and then the aggregate of all domains, we give the average
“non-stagnant fraction of the runtime” that was taken to reach the last improvement.
(Based on runs of 1800 s per instance).
In Fig. 2(a) we give the results of ranking the instances in a domain by their
value of t(LastImp) and then plotting t(LastImp) against this rank (achieved by
AdapHH). In the SAT domain we see that most instances stagnate fairly early. In
other domains there are a wide range of these stagnation times. In Fig. 2(b) we
use the same data, but give the ranking over the domains aggregated together.
We see that the instances show widely different behaviours. In particular, around
10 instances stagnate well before the usual 600 nsecs deadline; in contrast, many
other instances would potentially benefit from a longer runtime.
The dispersion of stagnation times suggests that each instance should be
given a new time limit. There are many potential ways to do this, however, here
we use two simple schemes. In the first scheme, some instances are selected to
be ‘down’ by reducing their time to a new limit arbitrarily chosen as 500 nsecs,
and others are ‘up’ with their time limit increased to 700 nsecs. Equal numbers
of up and down instances are taken so that the average time will still be 600
nsecs. The up and down instances are (heuristically) selected based on their
performance during trial runs with down being those that do not need the need
the extra time, and the up the ones that are most likely to benefit. Specifically,
8 and 6 instances were selected for AdapHH and RHH respectively. We then
tested the effects of the new time limits on the selected instances using separate
test runs that compare the qualities achieved at 500 vs. 600 nsecs for the down
instances, and 600 vs. 700 nsecs for the up instances. For efficiency, we use the
“Retrospective Parameter Variation” technique in the style of [9] (originally used
for the loosely-related topic of restarts) so as to re-use data from long runs to
quickly simulate shorter runs of different lengths. A run on a down (up) instance
is called a loser (winner) if it improved between 500 and 600 (600 and 700) nsecs,
1400 FS 1400
1200 VRP 1200
TSP
1000 1000
800 800
600 600
400 400
200 200
0 0
1 2 3 4 5 6 7 8 9 10 0 10 20 30 40 50 60
rank within domain rank within all domains
(a) (b)
Fig. 2. Times till last improvement on the instances after ranking using AdapHH. (a)
For each domain separately, (b) for all the domains together.
408 S. Asta et al.
as the reduced (increased) time would have caused a loss (an improvement) in
quality. The net gain is just the numbers of winners minus losers per run; we
found net gains of around 7 and 5 instances for AdapHH and RHH, respectively.
Although this first scheme uses knowledge of the performances that means it
is not directly implementable, it does show that simple transfers of runtimes
between instances have a potential for significant improvements.
In the second experimental scheme, a two phase on-the-fly approach is fol-
lowed. In the first phase, each instance is given a standard time, however if it
makes no progress within a given of duration 100 nsecs, it is terminated and the
remaining time is added to a time budget. In the second phase, the time budget
is shared between instances that they were not terminated in the first phase. The
point of this simple and common scheme, unlike the first scheme, is that it can
be implemented directly without pre-knowledge of the performance profile on an
instance. The success rate, denoting the percentage of runs on which extra time
allocation leads to improvement in the solution, was 75 and 69% for AdapHH
and RHH, respectively.
could choose to continue improving it at a later stage if time allows. Since part
of the goal of batched mode is to allow inter-instance learning, and learning
might naturally start with the smallest instances first, then we will also add a
method returning an indicative measure of the size of a given instance. A further
planned improvement, “delayed commitment” arises from the observation that
moves are often rejected, and the decision to reject relies only on the objective
function which is often obtainable with a fast incremental computation with no
immediate need to also update internal data structures. It may be more efficient
if “virtual-states” were created, with a full construction of the state and data
structures only occurring after a separate call to a commit method, and only if
it is decided to accept the move.
Besides the extensions specific to batch mode, we hold the view (shared,
we believe, by many others in the community) that the classical strict domain
barrier returning only the fitness is too restrictive; that more information needs
to be passed through the barrier, but still ensuring that there is no loss of the
modularity of splitting the search control from the domain details (for example,
a common and natural request seems to be to allow multiple objectives rather
than one). However, we expect that such relaxations of the domain barrier can
generally be made independently of changes for the batched mode.
References
1. Cowling, P., Kendall, G., Soubeiga, E.: A hyperheuristic approach to scheduling a
sales summit. In: Burke, E., Erben, W. (eds.) PATAT 2000. LNCS, vol. 2079, pp.
176–190. Springer, Heidelberg (2001)
2. Chen, X., Ong, Y.S.: A conceptual modeling of meme complexes in stochastic search.
IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev. 42(5), 612–625 (2012)
3. Burke, E.K., Hyde, M., Kendall, G., Ochoa, G., Özcan, E., Qu, R.: Hyper-heuristics:
a survey of the state of the art. Technical Report NOTTCS-TR-SUB-0906241418-
2747, School of Computer Science, University of Nottingham (2010)
4. Özcan, E., Bilgin, B., Korkmaz, E.E.: A comprehensive analysis of hyper-heuristics.
Intell. Data Anal. 12(1), 3–23 (2008)
5. Ochoa, G., et al.: HyFlex: a benchmark framework for cross-domain heuristic search.
In: Hao, J.-K., Middendorf, M. (eds.) EvoCOP 2012. LNCS, vol. 7245, pp. 136–147.
Springer, Heidelberg (2012)
6. Mısır, M., Verbeeck, K., De Causmaecker, P., Vanden Berghe, G.: A new hyper-
heuristic implementation in HyFlex: a study on generality. In: Fowler, J., Kendall,
G., McCollum, B. (eds.) Proceedings of the MISTA’11, pp. 374–393 (2011)
7. Zilberstein, S., Russell, S.J.: Approximate reasoning using anytime algorithms. In:
Natarajan, S. (ed.) Imprecise and Approximate Computation. Kluwer Academic
Publishers, The Netherlands (1995)
8. Kheiri, A., Özcan, E.: A Hyper-heuristic with a round robin neighbourhood selec-
tion. In: Middendorf, M., Blum, C. (eds.) EvoCOP 2013. LNCS, vol. 7832, pp. 1–12.
Springer, Heidelberg (2013)
9. Parkes, A.J., Walser, J.P.: Tuning local search for satisfiability testing. In: Proceed-
ings of AAAI 1996, pp. 356–362 (1996)
Tuning Algorithms for Tackling Large Instances:
An Experimental Protocol
1 Introduction
Many applications require the solution of very large problem instances. If such
large instances are to be solved effectively, the algorithms need to operate at
appropriate settings of their parameters. As one intriguing way of deriving
appropriate algorithm parameters, the automatic configuration of algorithms
has shown impressive advances [1]. However, tuning algorithms for very large
instances directly is difficult, a main reason being the high computation times
that even a single algorithm run on very large instances requires. There are two
main reasons for these high computation times. First, the computational cost
of a single search step scales with instance size; second, larger instances usually
require a much larger number of search steps to find good quality solutions. From
a theoretical side, the tuning time would scale linearly with the number of con-
figurations tested or linearly with the computation time given to each instance.
However, even if a limited number of algorithm configurations are tested during
the tuning of the algorithm, the sheer amount of time required to test a single
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 410–422, 2013.
DOI: 10.1007/978-3-642-44973-4 44, c Springer-Verlag Berlin Heidelberg 2013
Tuning Algorithms for Tackling Large Instances 411
2 Modelling
In the most general case, we want to define a protocol for tuning an SLS algo-
rithm on a set of small training instances s ∈ S and make predictions on the
behaviour of the tuned algorithm on a very large instance sΣ . There are two
free variables in our experimental setting. The first one is the maximum cut-off
time on the small instances, which we define as a policy t(s) that depends on
the instance size s. The second one is the parameter setting that we define as
the policy m̂(s; t(s)) that depends on the instance size and the cut-off time t(s).
412 F. Mascia et al.
Our aim is to optimise policies t(s) and m̂(s; t(s)) to predict a good parameter
setting m̂(sΣ ; T Σ ) when executing the algorithm on a very large instance sΣ with
cut-off time T Σ , where T Σ is the maximum time-budget available for solving the
target instance sΣ .
We cast this problem as a parameter estimation for the minimisation of
a loss function. More in detail, we select a priori a parametric family for the
policies t(s) and m̂(s; t(s)). The value defined by the policies for an instance
size s will be determined by the sets of parameters πt and πm̂ . The number and
type of parameters in πt and πm̂ depend on the parametric family chosen for
the respective policies. We further constrain the policies by requiring that the
maximum cut-off time is larger than a specified threshold t(s) > δ and that the
policy defines a specific cut-off time for the target instance t(sΣ ) = T Σ .
Very small instances should have a smaller impact on the optimisation of the
policies than have small or small-to-medium training instances. The latter are in
fact closer and more similar to the target instance size sΣ . Therefore, in the most
general case, we also use a weighting policy ω(s) of a specific parametric
family
with parameters πσ . The only constraint on this policy is that s∈S ω(s) = 1.
We define the loss function in Eq. 1 as the difference between Cm̂ (s; t(s)),
which is the cost obtained when executing the algorithm with the parameter
setting determined by m̂(s; t(s)); and CB (s; t(s)), which is the cost function
obtained when executing the algorithm with the best possible parameter setting
B(s; t(s)) given the same maximum run-time t(s), and try to determine:
arg min ω(s) [Cm̂ (s; t(s)) − CB (s; t(s))] . (1)
∂ω ,∂m̂ ,∂t
s∈S
By finding the optimal settings for πσ , πm̂ , and πt , we effectively find the best
scaling of the examples in S, and the best cut-off time, which allow us to find
the policy that best describes how the parameter setting scales with the sizes in
S. The same policy can be used to extrapolate a parameter setting for a target
instance size sΣ and a target cut-off time T Σ .
3 A Proof of Concept
In this paper, we present a proof of concept in which we concretely use the
parameter estimation in Eq. 1 to tune an ILS algorithm and a RoTS algorithm
for the QAP [5].
The QAP models the problem of finding a minimal cost assignment between
a set P of facilities and a set L of locations. Between each pair of facilities there
is a flow defined by a flow function w : P × P → R, and locations are at distance
d : L × L → R. To simplify the notation, flow and distance functions can be seen
as two real-valued square matrices W and D respectively. The QAP is to find
a bijective function π : P ∈ L that assigns each facility to a location and that
minimises the cost functional:
wi,j d∂(i),∂(j) .
i,j∈P
Tuning Algorithms for Tackling Large Instances 413
The Experimental Setting. For each size in S = {40, 50, . . . , 100}, we gen-
erate 10 random instances of Taillard’s structured asymmetric family [6]. We
then measure the average percentage deviation from the best-known solution for
each size s ∈ S, by running the ILS algorithm 10 times on each instance with
100 values of the perturbation parameter and by taking the mean value. The
maximum cut-off time for these experiments is set to a number of CPU-seconds
that is larger than a threshold maxt (s), which allows at least for 1000 iterations
of the ILS algorithm with the perturbation strength set to k = 0.5s.
3
We fix the scaling policy as ω(s) = s s3 with no parameters πσ to be
s∈S
optimised. The parametric family of the parameter policy is the linear function
m̂(s; t(s)) = c+ms with the parameters πm̂ = (c, m). The cut-off time policy t(s)
is defined as a function t(s) = c0 +c1 sζ , with the constraint that t(s) > δ ∀s > s∗ ,
where s∗ is the smallest s ∈ S. The constant δ is set to 20 ms as the minimum
amount of time that can be prescribed for an experiment. Moreover, since we
pre-computed the cost function for a given maximum cut-off time, we set also
an upper-bound t(s) < 3 maxt (s). Finally, the policy has to pass through the
point (sΣ , T Σ ), hence one of the parameters can be determined as function of the
others and πt can be restricted to the two parameters (c0 , α):
s3
arg min [Cm̂ (s; t(s)) − CB (s; t(s))] . (2)
c,m,c0 ,ζ s∈S s3
s∈S
To minimize the loss in Eq. 2 we implemented, for this case study, an ad hoc
local search procedure that estimates the parameter values within predefined
ranges. The local search starts by generating random solutions until a feasible
one is obtained. This initial random solution is then minimised until a local opti-
mum is reached. The algorithm is repeated until the loss function is equal to zero
or a maximum number of iterations is reached. To minimise the incumbent solu-
tion, the algorithm selects an improving neighbour by systematically selecting a
parameter and increasing or decreasing its value by a step l. For integer-valued
parameters l is always equal to 1. For real-valued parameters the value of l is
set as in a variable neighbourhood search (VNS) [10,11]. Initially, l is set to 1−6 ,
then as soon as the the local search is stuck in a local minimum, its value l is first
increased to 1−5 , then 1−4 and so on until l is equal to 1. As soon as the local
414 F. Mascia et al.
search escapes the local minimum, the value of l is reset to 1−6 . With this local
search, we do not want to define an effective procedure for the parameter esti-
mation, the aim here is mainly to have some improvement algorithm for finding
reasonable parameter settings for our policy and to present a proof of concept
of the experimental protocol presented in this paper. In the future, we plan to
replace our ad hoc local search with more performing local search algorithms for
continuous optimization such as CMA- ES [7] or Mtsls1 [8]. The parameters c
and m are evaluated in the range [−0.5, 0.5], and the parameter α in the range
[3, 4, . . . , 7]. The parameter c0 is initialised to maxt (s∗ ), where s∗ is the smallest
s ∈ S, and evaluated in the range [0, 3 maxt (s∗ )].
Results. For each target size sΣ and target cut-off time T Σ , we first let the t(s)
policy pass through the point (sΣ , T Σ ), and then we find the optimal policies by
minimising Eq. 2. We optimise and test our policies for the target sizes sΣ 150,
200, 300, 400 and 500. For the minimisation of the loss function, we allow our
ad hoc local search algorithm to restart at most 25 times.
To evaluate the policies obtained, we compute two metrics: the loss obtained
by the predicted value and a normalised score inspired by [9]. The loss is the
difference between the cost obtained when running the algorithm with the para-
meter prescribed by the policy and the best parameter for the given size and
cut-off time. The normalised score, is computed as:
Eunif C(s; t(s)) − Cm̂ (s; t(s))
,
Eunif C(s; t(s)) − CB (s; t(s))
where Eunif C(s; t(s)) is the expectation of an uniform choice of the parameter
setting. This score is equal to zero when the cost of the parameter prescribed
by the policy is the same as the one expected by an uniform random choice
of the parameter. It is equal to one when the cost of the prescribed parameter
corresponds to the cost attainable by an oracle that selects the best parameter
setting. Negative values of the score indicate that the prescribed parameter is
worse than what could be expected by an uniform random choice.
To calculate the two metrics, we pre-compute on the target instance sizes, the
values of the cost function for 100 possible parameter configurations. Then, to
evaluate the predicted values, we round them to the closest pre-computed ones.
In Fig. 1 we present the results for the largest of the test instances, with
sΣ = 500 and T Σ = 6615.44 CPU-seconds. The plot on top shows the cut-
off time policy with exponent α = 4 that passes through the target size and
target cut-off time. The second plot from the top shows the linear policy for the
parameter setting that prescribes a perturbation of size 177, while the optimal
perturbation value for the specified cut-off time is 171. The third plot shows the
loss. The predicted parameter is rounded to the closest precomputed one, which
is 176. The difference in the average deviation from the best known solution
amounts to 0.044977. The last plot at the bottom shows the normalised score
that for the prediction on target size 500 is equal to 0.841609. In Table 1 we
summarise similar results also for 150, 200, 300 and 400.
Tuning Algorithms for Tackling Large Instances 415
Fig. 1. Cut-off time policy, parameter policy for ILS, loss, and prediction quality on
target instance size s = 500.
416 F. Mascia et al.
Table 1. Summary of the loss and normalised score on the target sizes of the policies
optimised for ILS.
Table 2. Normalised score on the target sizes for ILS in the case in which the cut-off
time is optimised as a free variable of the experimental setting, and in the case in which
the cut-off time is fixed.
Fig. 2. Comparison between VNS and the parameter policy for ILS on target instances
s at time T .
The RoTS algorithm for the QAP [4] is a rather straightforward tabu search
algorithm that is put on top of a best improvement algorithm making use of
the usual 2-exchange neighbourhood for the QAP, where the location of two
facilities are exchanged at each iteration. A move is tabu, if at least the two
facilities involved are assigned to a location they were assigned in the last tl
iterations, where tl is the tabu list length. Diversification is ensured by enforc-
ing specific assignments of facilities to locations if such an assignment was not
considered for a rather large number of local search moves. In addition, an aspi-
ration criterion is used that overrides the tabu status of a move if it leads to a
new best solution. The term robust in RoTS stems from the random variation
of the tabu list length within a small interval; this mechanism was intended to
increase the robustness of the algorithm by making it less dependent on one
fixed setting of tl [4]. Hence, instead of having a fixed tabu list length, at each
iteration the value for the tabu tenure is selected uniformly random in the range
max(2, unif(μ − 0.1, μ + 0.1) · s)). Thus, μ is the expected value of the tabu list
length and it is the only parameter exposed to the tuning. In the original paper,
a setting of μ = 1.0 was proposed.
The Experimental Setting. For instance sizes S = {40, 50, . . . , 100}, we gen-
erate 10 Taillard’s instances with uniformly random flows between facilities and
uniformly random distances between locations [6]. For each parameter setting of
μ ∈ {0.0, 0.1, . . . , 2.5}, and for each instance size, we compute the mean devia-
tion from the best-known solution. The mean is computed over 10 runs of the
RoTS algorithm on each of the 10 instances. The maximum cut-off time for these
experiments is set to a number of CPU-seconds that allow for at least 100 · s
iterations of the RoTS algorithm.
418 F. Mascia et al.
We keep for this problem the same free variables and the same parametric
families we used for the ILS algorithm. The only difference is a further constraint
on the parameter policy m̂(s; t(s)) that is required to prescribe a positive value
for the parameter for all s ∈ S and on the target size sΣ . Since the problem
is more constrained and harder to minimise, we allow our ad hoc local search
algorithm to restart from at most 5000 random solutions. We optimise and test
the policies on target instance sizes 150, 200, 300, 400 and 500.
Results. As for the ILS algorithm, we evaluate the policies by measuring the
loss on the target instance sizes and by computing the normalised score. In Fig. 3
we present the results on the largest test instance sΣ = 500. On this instance,
the parameter policy prescribes a parameter setting of 0.040394 while the best
parameter for this instance size and cut-off time is 0.1. The loss amounts to
0.051162 and the normalised score is 0.653224.
Table 3 summarises the results on all test instances. On instance sΣ = 400
the parameter policy obtains a loss of 0 and a normalised score equal to 1. This
is due to the fact that for evaluating the policies, and hence knowing the best
parameter setting given the size and cut-off time, we pre-computed the cost of
a fixed number of parameter configurations. When computing the cost of the
parameter configuration prescribed by the policy, the value of the parameter
is rounded to the closest value for which the cost has been pre-computed. For
instance, for size sΣ = 400 the prescribed parameter value is 0.086961. This value
is rounded to 0.1, which corresponds to the best parameter setting.
To further evaluate the policy we evaluate it with a default setting of the
parameter that, in the case of RoTS, would be μ = 1.0. Figure 4 shows the aver-
age deviation obtained with the parameter setting prescribed by the policy and
the default parameter setting. The average deviation is computed with respect
to the solution quality obtained when using the best a posteriori parameter set-
ting given the instance size and the cut-off time. On all instances the policy
obtained for the parameter setting lead to results which are much better than
what we could expect from the default parameter setting. Also in this case, a
stratified rank-based permutation test rejects at a 0.05 significance level the null
hypothesis of no difference between the average deviations obtained with the
two algorithms.
Table 3. Summary of the loss and normalised score on the target sizes of the policies
optimised for RoTS.
Fig. 3. Cut-off time policy, parameter policy for RoTS, loss, and prediction quality on
target instance size s = 500.
420 F. Mascia et al.
Fig. 4. Comparison between the default value µ = 1.0 and the parameter policy for
RoTS on target instances s at time T .
To test the importance of optimising also a policy for the cut-off time, we
tested also for this problem a tuning protocol in which we optimise the parameter
policy while keeping the cut-off time fixed. In this case the cut-off time was fixed
to a number of CPU-seconds that allow for 100 · s steps of the RoTS algorithm.
On this problem there was no clear-cut result, in fact on instance sizes 300, 400,
and 500, with a fixed cut-off time policy the score remains the same; on instance
150 the score drops from 0.827993 to 0.354483; and on instance size 200 the score
increases from 0.697857 to 1.
for very large instances if SLS algorithms rely on few key parameters such as the
algorithms tested here.
One key element of our contribution is the optimisation of a policy for the
cut-off time that prescribes how long a configuration should be tested during
the tuning on small instances. We showed experimentally, that at least for the
ILS algorithm, optimising a cut-off time policy allows for better extrapolations
of the parameter setting on large instances.
As future work, we plan to extend the approach to algorithms with (many)
more than one parameter and to extrapolate to much larger instance sizes. In
both cases we also need to define an extended protocol for assessing the per-
formance of our method since pre-computing the cost function may become
prohibitive. Furthermore, an automatic selection of the parametric models, and
a comparisons to other recent approaches for tuning for large instances such
as [12] would be interesting.
References
1. Hoos, H.H.: Programming by optimization. Commun. ACM 55(2), 70–80 (2012)
2. Lourenço, H.R., Martin, O., Stützle, T.: Iterated local search: framework and appli-
cations. In: Gendreau, M., Potvin, J.Y. (eds.) Handbook of Metaheuristics. Interna-
tional Series in Operations Research & Management Science, 2nd edn, pp. 363–397.
Springer, New York (2010)
3. Stützle, T.: Iterated local search for the quadratic assignment problem. Eur. J.
Oper. Res. 174(3), 1519–1539 (2006)
4. Taillard, É.D.: Robust taboo search for the quadratic assignment problem. Parallel
Comput. 17(4–5), 443–455 (1991)
5. Koopmans, T.C., Beckmann, M.J.: Assignment problems and the location of eco-
nomic activities. Econometrica 25, 53–76 (1957)
6. Taillard, E.D.: Comparison of iterative searches for the quadratic assignment prob-
lem. Location Sci. 3(2), 87–105 (1995)
7. Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution
strategies. Evol. Comput. 9(2), 159–195 (2001)
8. Tseng, L.Y., Chen, C.: Multiple trajectory search for large scale global optimiza-
tion. In: Proceedings of IEEE Congress on Evolutionary Computation, Piscataway,
NJ, IEEE, pp. 3052–3059 June 2008
9. Birattari, M., Zlochin, M., Dorigo, M.: Towards a theory of practice in metaheuris-
tics design: a machine learning perspective. Theor. Inform. Appl. 40(2), 353–369
(2006)
422 F. Mascia et al.
10. Mladenovic, N., Hansen, P.: Variable neighbourhood search. Comput. Oper. Res.
24(11), 71–86 (1997)
11. Hansen, P., Mladenovic, N.: Variable neighborhood search: principles and applica-
tions. Eur. J. Oper. Res. 130(3), 449–467 (2001)
12. Styles, J., Hoos, H.H., Müller, M.: Automatically configuring algorithms for scaling
performance. In: Hamadi, Y., Schoenauer, M. (eds.) LION 6. LNCS, vol. 7219, pp.
205–219. Springer, Heidelberg (2012)
Automated Parameter Tuning Framework
for Heterogeneous and Large Instances: Case
Study in Quadratic Assignment Problem
1 Introduction
Good parameter configurations are critically important to ensure algorithms to
be efficient and effective. Automated parameter tuning (also called automated
algorithm configuration or automated parameter optimization) is concerned with
finding good parameter configurations based on training instances. Existing
approaches for automated parameter tuning mainly differ in their applicability to
different types of parameters: tuners that handle both categorical and numerical
parameters include ParamILS [16], GGA [1], iterated F-Race [4] and SMAC [15]
etc.; and there exist approaches specially designed for tuning numerical parame-
ters, e.g. by Kriging models (SPO [2]) and continuous optimizers [32].
In this paper, we are concerned with two specific challenges of automated
parameter tuning:
1. Heterogeneity. This refers to the phenomenon that different problem instances
require different parameter configurations on the same target algorithm to
solve. Hutter et al. [14] defines “inhomogeneous” instances as those for which
algorithm rankings are unstable across instances. They state that inhomoge-
neous instance sets are problematic to address with both manual and auto-
mated methods for offline algorithm configuration. Schneider and Hoos [26]
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 423–437, 2013.
DOI: 10.1007/978-3-642-44973-4 45, c Springer-Verlag Berlin Heidelberg 2013
424 Lindawati et al.
2 AutoParTune
AutoParTune is a realization of the concept proposed in [20]. As shown in
Fig. 1(a), the framework consists of four components: (I) parameter search space
reduction; (II) instance-specific tuning to handle heterogeneous instances; (III)
scaling large instances; and (IV) global tuning. The first components are con-
sidered as pre-processing steps which can be executed in any combination and
in any order. The detail of the AutoParTune subsystems is shown in Fig. 1(b).
Parameter Search Space Reduction. This component allows us to drasti-
cally reduce the size of the parameter space before tuning. We implement the
method presented in [6], which is based on Design of Experiment (DoE), that
involves full factorial design and Response Surface Methodology (RSM).
Instance-Specific Tuning. For the instance-specific tuning, we construct a
generic approach for instance-specific parameter tuning: SufTra. The detail will
be presented in Sect. 4.
Scaling Large Instances. For handling large instances, we present a novel tech-
nique based on the computational time analysis: ScaLa, and empirically explore
its feasibility in Sect. 5.
Global Tuning. There are many global tuning methods in the literature. Here,
we embed ParamILS [16] which utilizes Iterated Local Search (ILS) to explore
the parameter space in order to find a good parameter configuration based on
the given training instances. ParamILS has been very successfully applied to
tune a broad range of high-performing algorithms for several hard combinatorial
problems with large number of parameters.
426 Lindawati et al.
instances and 400 as testing instances. In ScaLa, we use 120 training instances
with size 50, 70, 90, and 150 and 100 testing instances with size 150 from the
first generator.
All experiments were performed on a 1.7 GHz Pentium-4 machine running
Windows XP for SufTra and on a 3.30 GHz Intel Core i5-3550 running Windows
7 for ScaLa. We measured runtime as the CPU time needed by these machine.
encodes a combination of two solution attributes: (1) position type, based on the
topology of the local neighborhood as given in Table 3 [13]; and (2) percentage
deviation from optimum or best known. These two attributes are combined: the
first two digits are the deviation from optimum and the last digit is the position
type. To handle target algorithms with cycles and (random) restarts, SufTra
adds two additional symbols: ‘CYCLE’ and ‘JUMP’; ‘CYCLE’ is used when the
target algorithm returns to a previously-visited position, while ‘JUMP’ is used
when the local search is restarted.
Note that in a search trajectory, several consecutive solutions may have simi-
lar solution properties before final improvement and reaching local optimum. We
therefore compress the search trajectory sequence to a Hash String by removing
the consecutive repetition symbols and store the number of repetitions in a Hash
Table to be used later in pair-wise similarity calculation.
Suffix Tree Construction. The search trajectory sequences found in the pre-
vious section is used to build a suffix tree. Suffix tree is a data structure that
exposes internal structure of a string for a particularly fast implementation of
many important string operations. The construction of a suffix tree proves to
have a linear time complexity w.r.t. the input string length [7]. A suffix tree T for
an m-character string S is a rooted directed tree having exactly m leaves num-
bered 1 to m. Each internal node, except for the root, has at least two children
and each edge is labeled with a substring (including the empty substring) of S.
No two edges out of a node has edge-labels beginning with the same character.
To represent suffixes of a set {S1 , S2 , ....Sn } of strings, we use a generalized
suffix tree. Generalized suffix tree is built by appending a different end of string
marker (which is a symbol not in used in any of the string) to each string in
the set, then concatenate all the strings together, and build a suffix tree for the
concatenated string [7].
Automated Parameter Tuning Framework 429
We construct the suffix tree for the hash strings derived from search trajecto-
ries using the Ukkonen’s algorithm [7]. We build a single generalized suffix tree
by concatenating all the Hash Strings together to cover all training instances.
Length of the concatenate string is proportional to the sum of all the Hash String
lengths. The Ukkonen’s algorithm works by first building an implicit suffix tree
containing the first character of the string and then adding successive charac-
ters until the tree is complete. Details of Ukkonen’s algorithm can be found in
[7]. Our Ukkonen’s algorithm implementation requires O(n × l), where n is the
number of instances and l is the maximum length of the Hash String.
with Inst1 (f eaturei ) and Inst2 (f eaturei ) as score from instance-feature metric
for instance 1 and 2, and feature i.
4.3 Clustering
Similar to [21], we cluster the instances by a well-known clustering approach
AGNES [12] with L method [25]. AGNES or AGglomerative NESting is a hier-
archical clustering approach that works by creating clusters for each individual
430 Lindawati et al.
instance and then merging two closest clusters (i.e., a pair of clusters with the
smallest distance) resulting in fewer clusters of larger sizes until all instances
belong to one cluster or a termination condition is satisfied (e.g. a prescribed
number of clusters is reached). We implement the L method [25] to automatically
find the optimal number of clusters.
Training Testing
I. Computational time
CluPaTra 1,051 s 2,718 s
SufTra 56 s 146 s
II. Performance result
ParamILS 1.07 2.12
CluPaTra 0.87 1.54
ISAC 0.83 1.21
SufTra 0.81 1.16
p-value∗ 0.061 0.042
∗
based on statistical test on ISAC and SufTra
Automated Parameter Tuning Framework 431
Each of the three questions above will be addressed in one of the following
subsections.
consider scaling among different instances by, e.g. considering different solution
quality threshold. In this work, we adopt the performance-based similarity mea-
sures proposed in [26], more specifically, the variance measure that is described
in more details in the next section, and use them to find similarities among
different instances with different runtime.
the four instance sizes are less obvious. Nevertheless, this interesting cluster-
ing result confirms our hypothesis raised in question 2: using performance-based
similarity measure, given the right runtime, different instances, small or large,
can become similar to each other.
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
instance size @ runtime
Fig. 2. Clustering four different instance sizes each with four different computation
times by hierarchical agglomerative based on variance measure.
How can automatic tuning benefit from this automatically detected instance
similarity? One straightforward follow-up idea is to use the best parameter
configuration tuned on small instances with short runtime to solve similar large
instances with long runtime. However, it remains unjustified that how good these
tuned-on-small parameter configurations are, compared with, for example, para-
meter configuration tuned directly on instances of the same size with the same
runtime. In this experiment, two most similar groups of size-runtime pairs (see
Fig. 2 of Sect. 5.2) are used: the first group includes (50, 6.709), (70, 8.002), (90,
9.822), and (150, 24.823); the second group includes (50, 23.556), (70, 29.796),
(90, 39.156), (150, 127.609). For each of the two groups, two sets of experiments
are set up: (1) tuned by oracle: as a quick proof-of-concept, we take the best
configuration from 100 configurations based on 30 instances on each instance
size as found in Sect. 5.2, and test them on another 100 testing instances of size
150 with corresponding runtime; (2) tuned by ParamILS [16]: we tune the target
algorithm using three independent runs of ParamILS for each instance size using
the size-runtime pairs mentioned above, each run was assigned maximum 300
calls of target algorithm on new randomly generated training instances, and each
tuned configuration is tested on 100 same testing instances as in (1). The second
experiment set is to test generalizability of the similarity information detected
in Sect. 5.2. The goal is to see how good these best configurations tuned on small
434 Lindawati et al.
instances such as 50, 70, and 90 with shorter runtime, compared with the best
configuration tuned on instance size 150, when tested on instance size 150 with
the same runtime.
The results are listed in Table 5. The results confirm that, firstly as expected,
a large amount of tuning time is saved by tuning on small instances, ranging from
59 to 81 % in our experiments; and secondly, in general, parameter configurations
tuned on smaller instances with shorter runtime do not differ significantly from
the ones directly tuned on large instances, as long as similarities between them
are confirmed. In both groups of both experiment sets, there is no statistical
difference between the configurations tuned on 50, 70, 90, and 150, tested by
Wilcoxon’s rank-sum test. In the first experiment set tuned by oracle, config-
urations tuned on size 70 and 90 sometimes perform even better than tuned on
150. The mean performance difference from the tuned-on-150 configuration in
the first group is usually less than 0.1 %, and even less than 0.01 % in the second
group. In the second experiment set tuned by ParamILS, although configuration
tuned on size 150 performs best, the difference is not significant: the mean per-
formance difference is usually less than 0.1 % in the first group, and less than
0.05 % in the second group. This shows the similarity information detected from
Sect. 5.2 can be actually generalized to tuners with different training instances.
As reference, the performance of the default parameter configuration (listed in
Table 2) is presented in Table 5, and it is statistically significantly outperformed
by almost all the above tuned configurations in both groups, which proves the
necessity and success of tuning process. We also include as reference the best con-
figuration tuned on instance size 150 with runtime 23.556 (127.609) s to be tested
Table 5. Results for the performance of the best parameter configurations tuned on
sizes 50, 70, 90, 150, and tested on instances of size 150. Two most similar groups of size-
runtime pairs (see text or Fig. 2) are used. Two experiment sets are presented, oracle
and ParamILS (see text). Each column of %oracle and %ParamILS shows the mean
percentage deviation from the reference cost. In each column, +x (−x) means that the
tuned configuration performance is x% more (less) than the reference cost. %time.saved
shows the percentage of tuning time saved comparing with tuning on instances of
size 150. The performance of default parameter configuration is shown in row “def.”.
The last row 150’ used the best parameter configuration tuned on instance size 150
with runtime 127.609 (23.556) s, and tested on instance size 150 with runtime 23.556
(127.609) s, respectively. Results marked with † refers to statistically significantly worse
results compared to tuned-on-150 using Wilcoxon’s rank-sum test.
23.556 s 127.609 s
tuned.on %oracle %ParamILS %time.saved %oracle %ParamILS %time.saved
50 −0.48 −0.047 72 −0.048 −0.048 81
70 −0.65 −0.053 66 −0.060 −0.027 76
90 −0.61 −0.093 59 −0.057 −0.040 69
150 −0.58 −0.151 0 −0.060 −0.070 0
def. +1.17† +0.150† - −0.008† −0.024 -
150’ +1.16† +0.195† - +0.232† +0.208† -
Automated Parameter Tuning Framework 435
on instance size 150 with different runtime, i.e. 127.609 (23.556) s, respectively (in
row 150’ of Table 5). Although the tuning and testing instances are of the same
size, different runtime makes a great performance difference, resulting in almost
one order of magnitude worse than tuning on the small instances with appropri-
ate runtime. The 150’ performance is statistically significantly worse than all the
above tuned configurations belonging to the same group, and it is even signifi-
cantly worse than the default configuration in the second group. This contrasts
with the fact that the difference among the similar size-runtime pairs (the first
four rows of Table 5) is indeed very minor, and it also shows the risk of tuning
on algorithm solution quality without assigning the right runtime, which in fact
proves the necessity of our automatic similarity detection procedure in ScaLa.
References
1. Ansótegui, C., Sellmann, M., Tierney, K.: A gender-based genetic algorithm for
the automatic configuration of algorithms. In: 15th International Conference on
Principles and Practice of Constraint Programming, pp. 142–157 (2009)
2. Bartz-Beielstein, T., Lasarczyk, C., Preuss, M.: Sequential parameter optimization.
In: Congress on Evolutionary Computation 2005, pp. 773–780. IEEE Press (2005)
3. Birattari, M., Gagliolo, M., Saifullah bin Hussin, Stützle, T., Yuan, Z.: Discussion
in IRIDIA coffee room, October 2008
4. Birattari, M., Yuan, Z., Balaprakash, P., Stützle, T.: F-race and iterated f-race: an
overview. In: Bartz-Beielstein, T., Chiarandini, M., Paquete, L., Preuss, M. (eds.)
Experimental Methods for the Analysis of Optimization Algorithms, pp. 311–336.
Springer, Heidelberg (2010)
5. Glover, F.: Tabu search - part I. ORSA J. Comput. 1, 190–206 (1989)
6. Gunawan, A., Lau, H.C., Lindawati, : Fine-tuning algorithm parameters using the
design of experiments approach. In: Coello Coello, C.A. (ed.) LION 5. LNCS, vol.
6683, pp. 278–292. Springer, Heidelberg (2011)
7. Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University
Press, Cambridge (1997)
8. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A. (eds.): Feature Extraction: Foun-
dations and Applications. Springer, Heidelberg (2006)
9. Halim, S., Yap, Y.: Designing and tuning sls through animation and graphics an
extended walk-through. In: Stochastic Local Search, Workshop (2007)
10. Halim, S., Yap, Y., Lau, H.C.: Viz: a visual analysis suite for explaining local search
behavior. In: 19th ACM Symposium on User Interface Software and Technology,
pp. 57–66 (2006)
11. Halim, S., Yap, R.H.C., Lau, H.C.: An integrated white+black box approach for
designing and tuning stochastic local search. In: Bessière, C. (ed.) CP 2007. LNCS,
vol. 4741, pp. 332–347. Springer, Heidelberg (2007)
12. Han, J., Kamber, M.: Data Mining: Concept and Techniques, 2nd edn. Morgan
Kaufman, San Francisco (2006)
13. Hoos, H.H., Stützle, T.: Stochastic Local Search: Foundation and Application.
Morgan Kaufman, San Francisco (2004)
14. Hutter, F., Hoos, H., Leyton-Brown, K.: Tradeoffs in the empirical evaluation of
competing algorithm designs. Ann. Math. Artif. Intell. (AMAI), Spec. Issue Learn.
Intell. Optim. 60, 65–89 (2011)
Automated Parameter Tuning Framework 437
15. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization
for general algorithm configuration. In: Coello Coello, C.A. (ed.) LION 5. LNCS,
vol. 6683, pp. 507–523. Springer, Heidelberg (2011)
16. Hutter, F., Hoos, H.H., Leyton-Brown, K., Stützle, T.: Paramils: an automatic
algorithm configuration framework. J. Artif. Intell. Res. 36, 267–306 (2009)
17. Kadioglu, S., Malitsky, Y., Sellmann, M., Tierney, K.: Isac: instance-specific algo-
rithm configuration. In: 19th European Conference on Artificial Intelligence (2010)
18. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing.
Science 200, 671–680 (1983)
19. Knowles, J.D., Corne, D.W.: Instance generators and test suites for the multiob-
jective quadratic assignment problem. In: Fonseca, C.M., Fleming, P.J., Zitzler,
E., Thiele, L., Deb, K. (eds.) EMO 2003. LNCS, vol. 2632, pp. 295–310. Springer,
Heidelberg (2003)
20. Lau, H.C., Xiao, F.: Enhancing the speed and accuracy of automated parameter
tuning in heuristic design. In: 8th Metaheuristics International Conference (2009)
21. Lindawati, Lau, H.C., Lo, D.: Clustering of search trajectory and its application to
parameter tuning. JORS Special Edition: Systems to Build Systems (to appear)
22. Ng, K.M., Gunawan, A., Poh, K.L.: A hybrid algorithm for the quadratic assign-
ment problem. In: International Conference on Scientific Computing, pp. 14–17
(2008)
23. Ochoa, G., Verel, S., Daolio, F., Tomassini, M.: Clustering of local optima in com-
binatorial fitness landscapes. In: Coello Coello, C.A. (ed.) LION 5. LNCS, vol.
6683, pp. 454–457. Springer, Heidelberg (2011)
24. Reeves, C.R.: Landscapes, operators and heuristic search. Ann. Oper. Res. 86(1),
473–490 (1999)
25. Salvador, S., Chan, P.: Determining the number of clusters/segments in hierarchical
clustering/segmentation algorithms. In: 16th IEEE International Conference on
Tools with Artificial Intelligence, pp. 576–584 (2004)
26. Schneider, M., Hoos, H.H.: Quantifying homogeneity of instance sets for algorithm
configuration. In: Hamadi, Y., Schoenauer, M. (eds.) LION 6. LNCS, vol. 7219,
pp. 190–204. Springer, Heidelberg (2012)
27. Smith-Miles, K., Lopes, L.: Measuring instance difficulty for combinatorial opti-
mization problems. Comput. Oper. Res. 39(5), 875–889 (2012)
28. Stützle, T., Fernandes, S.: New benchmark instances for the QAP and the exper-
imental analysis of algorithms. In: Gottlieb, J., Raidl, G. (eds.) EvoCOP 2004.
LNCS, vol. 3004, pp. 199–209. Springer, Heidelberg (2004)
29. Styles, J., Hoos, H.H., Müller, M.: Automatically configuring algorithms for scaling
performance. In: Hamadi, Y., Schoenauer, M. (eds.) LION 6. LNCS, vol. 7219, pp.
205–219. Springer, Heidelberg (2012)
30. Xu, L., Hoos, H.H., Leyton-Brown, K.: Hydra: automatically configuring algo-
rithms for portfolio-based selection. In: Conference of the Association for the
Advancement of Artificial Intelligence (AAAI-10) (2010)
31. Yong, L., Pardalos, P.M., Resende, M.G.C.: A greedy randomized adaptive search
procedure for the quadratic assignment problem. In: Pardalos, P.M., Wolkowicz, H.
(eds.) Quadratic Assignment and Related Problems. DIMACS Series in Discrete
Mathematics and Theoretical Computer Science, vol. 16, pp. 237–261. American
Mathematical Society, Providence (1994)
32. Yuan, Z., Montes de Oca, M., Birattari, M., Stützle, T.: Continuous optimization
algorithms for tuning real and integer parameters of swarm intelligence algorithms.
Swarm Intell. 6(1), 49–75 (2012)
Practically Desirable Solutions Search
on Multi-Objective Optimization
1 Introduction
Evolutionary multi-objective algorithms [1] optimize simultaneously two or more
objective functions that are usually in conflict with each other. The aim of the
algorithm is to find the set of Pareto optimal solutions that capture the trade-offs
among objective functions. In the presence of several solutions, a decision maker
often considers preferences in objective space to choose one or few candidate
solutions for implementation. This approach is valid when there is no concern
about the buildability of candidate solutions.
In many practical situations, however, the decision maker has to pay spe-
cial attention to decision space in order to determine the constructability of a
potential solution. In manufacturing applications, for example, preferences for
particular values of variables could appear due to unexpected operational con-
straints, such as the availability or lack of materials with particular specifica-
tions. Or simple because physical processes that determine a particular value for
a decision variable have become easier to perform than those required to deter-
mine another value. When these situations arise the decision maker is interested
in knowing how far these possible solutions are from optimality. Furthermore,
in design optimization and innovation related applications, rather than precise
values of decision variables of few candidate solutions, the extraction of useful
design knowledge is more relevant. In these cases, analysis of what-if scenarios
in decision space, without losing sight of optimality, are important.
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 438–443, 2013.
DOI: 10.1007/978-3-642-44973-4 46, c Springer-Verlag Berlin Heidelberg 2013
Practically Desirable Solutions Search 439
2 Proposed Method
2.1 Concept
We pursue an approach that incorporates additional fitness functions associated
to particular decision variables, aiming to find solutions around preferred val-
ues of the chosen variables while searching for optimal solutions in the original
objective space.
Let us define the original objective space f (m) as the vector of functions
where fm+1 (x), · · · , fM (x) are the additional M − m functions used to evaluate
solutions with preferred values in one or more decision variables.
Extending the objective space would work to bias selection to include solu-
tions with particular desired values for some decision variables. However, it is
also expected that evolution in an expanded objective space would substantially
increase diversity of solutions, which could jeopardize convergence of the algo-
rithm in the original space and the expanded space as well. Thus, in addition
to an expanded space, we also constraint the distance that solutions could be
from the instantaneous Pareto optimal set computed in the original space, as
illustrated in Fig. 1(a). We call these solutions as practically desirable solutions.
In the following we describe a method that implements this concept.
Fig. 1. (a) Region of practically desirable solutions with preferred values in variable
space located at most a distance d from the Pareto optimal set computed in the original
objective space f (m) . (b) Sorting by desirability respect to the original space f (m) and
front number in the extended space f (M )
Step 1 Get a copy of the set of non-dominated solutions from Population B that
(m)
evolves in the original space f (m) . Let us call this set F1 .
20 20 1000
15 15 100
GD
10
f2
10
f1
10
5 5 1
0.1
0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 0 200 400 600 800 1000
x5 f1 generations
Fig. 3. Proposed method searching concurrently on the original space f (m) = (f1 , f2 ),
m = 2, n = 5, and expanded space f (M ) , M = 4. Final generation, DTLZ3
problem.
We run the algorithms 30 times using different random seeds and present
average results, unless stated otherwise. The number of generations is set to
1000 generations, parent and offspring population size are |PtA | = |QA t | = 2250
for the search on the expanded space and |PtB | = |QB t | = 250 for the search
on the original space. As variation operators, the algorithms use SBX crossover
and polynomial mutation, setting their distribution exponents to ηc = 15 and
ηm = 20, respectively. Crossover rate is pc = 1.0, crossover rate per variable is
pcv = 0.5, and mutation rate per variable is pm = 1/n.
Figure 3 shows results by the proposed method searching concurrently on
the original space f (m) = (f1 , f2 ), m = 2, and on the expanded space f (M ) ,
M = 4, for n = 5 variables. From Fig. 3(a) it can be seen that the proposed
method effectively finds solutions around the two preferred values x5 = 0.3 and
x5 = 0.4. In addition it also finds solutions around x5 = 0.5, the value at which
solutions become Pareto optimal in this problem. Also, from Fig. 3(b) note that
the solutions found are within the threshold distance d established as a condition
for solutions desirability.
Figure 3(c) shows the generational distance GD over the generations.
GD is calculated separately grouping solutions around the preferred values
x5 = 0.3, x5 = 0.4 and optimal value x5 = 0.5. Solutions are considered within
a group if the value of x5 is in the range [x5 − 0.005, x5 + 0.005]. Note that GD
reduces considerably for the three groups of solutions. This clearly shows that
the concurrent search on the original space pulls the population closer to the
Pareto optimal front and achieves good convergence in addition of finding solu-
tions around the preferred values in variable space. These solutions are valuable
for the designer to analyze alternatives that include practically desirable features
in addition to optimality.
Figure 4 shows the number of solutions that fall within the desirable area at
various generations of the evolutionary process, i.e. solutions located within a
distance d = 10 of the instantaneous set of Pareto optimal solutions in f (m) .
Results are shown for DTLZ3 problem with m = {2, 3} original objectives
Practically Desirable Solutions Search 443
2500 2500
2000 2000
number of individuals
number of individuals
1500 1500
n=5
1000 1000 n=10
n=15
500 500
n=5
n=10
0 n=15 0
0 200 400 600 800 1000 0 200 400 600 800 1000
generation generation
(a) (b)
Fig. 4. Number of solutions within the desirable area over the generations, i.e. solutions
located within a distance d = 10 of the Pareto optimal solutions in f (m) found by the
proposed method. DTLZ3 problem with m = 2 and m = 3 original objectives and two
additional fitness functions.
varying the number of variables n = {5, 10, 15}. Note that the proposed method
can effectively find a large number of solutions for any number of variables.
4 Conclusions
In this work we have proposed a method to search practically desirable solutions
expanding the objective space with additional fitness functions associated to
preferred values of decision variables. The proposed method evolves concurrently
two populations, one in the original objective space and the other one in the
expanded space using an enhanced ranking and survival selection that favors
optimality as well as practical desirability of solutions. Our experiments show
that the proposed method can effectively find a large number of practically
desirable solutions around preferred values of variables for DTLZ3 problems
with 2 and 3 objectives in its original space and 5, 10, and 15 variables.
In the future we would like to test the approach on other kinds of problems,
including real world applications.
References
1. Coello, C., Van Veldhuizen, D., Lamont, G.: Evolutionary Algorithms for Solving
Multi-Objective Problems. Kluwer Academic, Boston (2002)
2. Deb. K., Agrawal, S., Pratap, A., Meyarivan, T.: A fast elitist non-dominated sort-
ing genetic algorithm for multi-objective optimization: NSGA-II. KanGAL report
200001 (2000)
3. Deb, K., Thiele, L., Laumanns, M., Zitzler, E.: Scalable multi-objective optimization
test problems. In: Proceedings of the Congress on Evolutionary Computation 2002,
pp. 825–830. IEEE Service Center (2002)
Oversized Populations and Cooperative
Selection: Dealing with Massive Resources
in Parallel Infrastructures
1 Introduction
We seek after a more efficient exploitation of massively large infrastructures
in parallel EAs by balancing population size and selection pressure parameters.
To that aim, we assume platforms in which the number of resources can be
always considered sufficient (see e.g. [5]), i.e. large enough to allow a parallelized
population to eventually converge to problem optima. The challenging issue here
is to make an efficient use of such resources since a too large population size can
be considered oversized: a parametrization error leading to unnecessary wastes
of computational time and resources [8]. We show that, in the case of oversized
populations, selection pressure can be increased to high values in such a way that
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 444–449, 2013.
DOI: 10.1007/978-3-642-44973-4 47, c Springer-Verlag Berlin Heidelberg 2013
Oversized Populations and Cooperative Selection 445
computing time is minimized and the solution quality is not damaged. Hence, the
population sizing problem can be redefined into a twofold question that we call
the selection dominance criterion; a selection scheme A can be said to dominate
other selection scheme B if:
2 Cooperative Selection
The design of new selection schemes is an active topic of research in EC. In
addition to canonical approaches such as ranking, roulette wheel or tournament
selection [3], other selection schemes have been designed to trade off exploration
and exploitation [1] or to be able to self-adapt the selection pressure on-line [2],
just to mention a few. Cooperation has been also considered in the design of
co-evolutionary EAs [9] in which sub-populations represent partial solutions to
a problem and have to collaborate in order to build up complete solutions. How-
ever, to the extent of our knowledge, there have been no attempts for designing
selection schemes inspired by cooperation.
Cooperative selection, in this early approach, is not more than a simple
extension of the classical tournament selection operator. As in the latter, a set
of randomly chosen individuals − →
s = {random1 (P ), . . . , randoms (P )} compete
for reproduction in a tournament of size s. The best ranked individual is then
446 J.L.J. Laredo et al.
selected for breeding. The innovation of the new operator consists of each indi-
vidual having two different measures for fitness. The first is the standard fitness
function f which is calculated in the canonical way while the second is the
cooperative fitness fcoop which is utilized for competing. Since we analyze the
operator in a generational scheme context, at the beginning of every generation
fcoop is initialized with the current fitness value (f ) of the individual. There-
fore, every first tournament within every generation is performed the same way
as with the classical tournament selection. The novelty of the approach relies
on the subsequent steps: after winning a competition of a tournament − →s , the
fcoop of the fittest individual is modified to be the average of the second and the
third, which means that, in the following competitions, the winning individual
will yield its position to the second. Since each fcoop is restarted with the f
value every generation, it is likely that fitter individuals reproduce at least once
per generation but without taking over the entire population. The details of the
cooperative selection scheme are described in Procedure 1.
return winner
end procedure
3 Analysis of Results
In order to analyze the performance of the cooperative selection scheme, we
conduct simple experiments in a master-slave GA [4] and tackle an instance of
length L = 200 of the onemax problem [10]. The GA, in addition to be parallel,
Oversized Populations and Cooperative Selection 447
(a) Scalability of the success rate (SR) (b) Tradeoff between Tsec and Tpar for
as a function of the population size SR ≥ 0.98 and different population
(P). Marks in bold show a SR ≥ and tournament sizes.
0.98.
– Tsec : is the sequential optimization time and refers to the number of function
evaluations until the first global optimum is evaluated.
– Tpar : is the parallel optimization time and accounts for the number of gener-
ations it takes to find the problem optimum.
Figure 1(a) shows the scalability of the SR with respect to the population
size for the three different selection operators. The SR scales in all cases with a
sigmoid shape in which smaller population sizes perform poorly and larger ones
reach the sufficiency SR ≥ 0.98. The only remarkable difference between para-
meterizations rely on the transition phase of the sigmoid. Such transitions occur
1
We follow here the selectorecombinative approach of Lobo and Lima [7] for studying
the scalability of the population size.
448 J.L.J. Laredo et al.
much earlier in s2 and Coop s16 than in s16. This allows smaller populations to
be sufficient for the formers while not for the latter.
In Fig. 1(b), the analysis of the trade offs Tsec /Tpar shows that, whenever a
given population size is sufficient in the three settings, the winning strategy is
s16, i.e. the maximum selection pressure. Nevertheless, such results do not imply
that s16 dominates s2 since s16 requires of larger population sizes to achieve bet-
ter performance. Coop s16, however, outperforms parallel and sequential times
of s2 while having the same population requirements. Therefore, it can be said
that Coop s16 dominates s2 under any setting.
References
1. Alba, E., Dorronsoro, B.: The exploration/exploitation tradeoff in dynamic cellular
genetic algorithms. IEEE Trans. Evol. Comput. 9(2), 126–142 (2005)
2. Eiben, A.E., Schut, M.C., De Wilde, A.R.: Boosting genetic algorithms with self-
adaptive selection. In: Proceedings of the IEEE Congress on Evolutionary Com-
putation, pp. 1584–1589 (2006)
3. Eiben, A.E., Smith, J.E.: Introduction to Evolutionary Computing. Springer, Hei-
delberg (2003)
4. Lombraña González, D., Jiménez Laredo, J., Fernández de Vega, F., Merelo
Guervós, J.: Characterizing fault-tolerance of genetic algorithms in desktop grid
systems. In: Cowling, P., Merz, P. (eds.) EvoCOP 2010. LNCS, vol. 6022, pp.
131–142. Springer, Heidelberg (2010)
Oversized Populations and Cooperative Selection 449
5. Laredo, J.L.J., Eiben, A.E., van Steen, M., Merelo Guervós, J.J.: Evag: a scalable
peer-to-peer evolutionary algorithm. Genet. Program. Evolvable Mach. 11(2), 227–
246 (2010)
6. Lässig, J., Sudholt, D.: General scheme for analyzing running times of parallel
evolutionary algorithms. In: Schaefer, R., Cotta, C., Koffilodziej, J., Rudolph, G.
(eds.) PPSN XI. LNCS, vol. 6238, pp. 234–243. Springer, Heidelberg (2010)
7. Lobo, F., Lima, C.: Adaptive population sizing schemes in genetic algorithms.
In: Lobo, F., Lima, C., Zbigniew, M. (eds.) Parameter Setting in Evolutionary
Algorithms, vol. 54, pp. 185–204. Springer, Heidelberg (2007)
8. Lobo, F.G., Goldberg, D.E.: The parameter-less genetic algorithm in practice. Inf.
Sci. Inf. Comput. Sci. 167(1–4), 217–232 (2004)
9. Potter, M.A., De Jong, K.A.: A cooperative coevolutionary approach to function
optimization. In: Davidor, Y., Männer, R., Schwefel, H.-P. (eds.) PPSN LNCS, vol.
866, pp. 249–267. Springer, Heidelberg (1994)
10. David Schaffer, J., Eshelman, L.J.: On crossover as an evolutionarily viable strat-
egy. In: Belew, R.K., Booker, L.B. (eds.) ICGA, pp. 61–68. Morgan Kaufmann,
San Francisco (1991)
Effects of Population Size on Selection
and Scalability in Evolutionary
Many-Objective Optimization
1 Introduction
Conventional multi-objective evolutionary algorithms (MOEAs) [1] are known to
scale up poorly to high dimensional objective spaces [2], particularly dominance-
based algorithms. This lack of scalability has been attributed mainly to inappro-
priate operators for selection and variation. The population size greatly influences
the dynamics of the algorithm. However, its effects on large dimensional objectives
spaces are not well understood. In this work we set population size as a fraction of
the true Pareto optimal set and analyze its effects on selection and performance
scalability of a conventional MOEA applied to many-objective optimization. In
our study we enumerate small MNK-landscapes with 3–6 objectives, 20 bits, and
observe the number of Pareto optimal solutions that the algorithm is able to find
for various population sizes.
2 Methodology
In our study we use four MNK-landscapes [3] randomly generated with m = 3,
4, 5 and 6 objectives, n = 20 bits, and k = 1 epistatic bit. For each landscape
we enumerate all its solutions and classify them in non-dominated fronts. The
exact number of true Pareto optimal solutions P OS T found by enumeration are
|P OS T | = 152, 1554, 6265, and 16845 for m = 3, 4, 5, and 6 objectives, respec-
tively. Similarly, the exact number of non-dominated fronts of the landscapes
are 258, 76, 29, and 22, respectively.
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 450–454, 2013.
DOI: 10.1007/978-3-642-44973-4 48, c Springer-Verlag Berlin Heidelberg 2013
Effects of Population Size on Selection and Scalability 451
Fig. 1. Number of non-dominated F1 and actual number of true Pareto optimal solu-
tions F1T in the population over the generations. |P OS T | = 152, and 1554 for m = 3,
and 4 objectives, respectively. Population sizes |P | = 50, 100, and 200.
Fig. 2. Accumulated and instantaneous number of true Pareto optimal solutions, AF1T
and F1T , m = 3, 4, and 5 objectives. Population sizes |P | = 50, 100, and 200.
Effects of Population Size on Selection and Scalability 453
Fig. 3. Accumulated and instantaneous number of true Pareto optimal solutions, AF1T
and F1T , m = 4, 5, and 6 objectives. Population sizes 1/3, 2/3 and 4/3 of P OS T .
4 Conclusions
In this work we analyzed the effects of population size on selection and scalability
of a conventional dominance-based MOEA for many-objective optimization. We
showed that the performance of a conventional MOEA can scale up fairly well
to high dimensional objective spaces if a sufficiently large population size is used
compared to the size of the true Pareto optimal set.
References
1. Deb, K.: Multi-Objective Optimization using Evolutionary Algorithms. Wiley,
Chichester (2001)
2. Ishibuchi, H., Tsukamoto, N., Nojima, Y.: Evolutionary many-objective optimiza-
tion: a short review. In: Proceedings of IEEE Congress on Evolutionary Computa-
tion (CEC 2008), pp. 2424–2431. IEEE Press (2008)
3. Aguirre, H., Tanaka, K.: Insights on properties of multi-objective MNK-landscapes.
In: Proceedings of 2004 IEEE Congress on Evolutionary Computation, pp. 196–203.
IEEE Service Center (2004)
A Novel Feature Selection Method
for Classification Using a Fuzzy Criterion
1 Introduction
In many practical situations, the size and dimensionality of datasets is large and
many irrelevant and redundant features are included. In a classification context,
learning from huge datasets could not work well even if theoretically more fea-
tures should lead more discriminant power. In order to face with this problem
two kinds of algorithms can be used: feature transformation (or extraction) and
feature selection. Feature transformation consists in constructing new features
(in a lower dimentional space) from the original ones. These methods include
clustering, basic linear transforms of the input variables (Principal Component
Analysis/Singular Value Decomposition, Linear Discriminant Analysis), spectral
transforms, wavelet transforms or convolution of kernels. The basic idea of a fea-
ture transformation is simply projecting a high-dimensional feature vector onto
a low-dimensional space. Unfortunately, the projection leads a loss of the mea-
surement units of features and the obtained features are not easy to interpret.
Feature selection (FS) may overcome this disadvantages.
FS aims at selecting a subset of features relevant in terms of discrimina-
tion capability. It avoids the drawback of the output interpretability, because
the selected features represent a subset of the given ones. FS is used as a pre-
processing phase in many contexts. It plays an important role in applications
that involve a large number of features and only few samples. FS enables data
G. Nicosia and P. Pardalos (Eds.): LION 7, LNCS 7997, pp. 455–467, 2013.
DOI: 10.1007/978-3-642-44973-4 49, c Springer-Verlag Berlin Heidelberg 2013
456 M.B. Ferraro et al.
original data set onto a fuzzy space, and the determination of the optimal subset
of fuzzy features by using conventional search techniques. A k-nearest neighbor
(NN) algorithm is used. Pedrycz and Vukovich [13] generalize feature selection
method by introducing a mechanism of fuzzy feature selection. They propose
to consider granular features, rather than numeric. A process of fuzzy feature
selection is carried out and numerically quantified in the space of membership
values generated by fuzzy clusters. In this case a simple Fuzzy C-Means (FCM)
algorithm is used. More recently, a new heuristic algorithm has been introduced
by Li and Wu [5]. This algorithm is characterized by a new evaluation criterion,
based on a min-max learning rule, and a search strategy for feature selection
from fuzzy feature space. The authors consider the accuracy of k-NN classifier
as the evaluation criterion. Hedjazi et al. [14] introduce a new feature selection
algorithm, MEmbership Margin Based Attribute Selection (MEMBAS). This
approach processes in the same way numerical, qualitative and interval data
based on an appropriate and simultaneous mapping, using fuzzy logic concepts.
They propose to use the Learning Algorithm for Multivariable Data Analysis
(LAMBDA), a fuzzy classification algorithm that aims at getting the global
membership degree of a sample to an existing class, taking into account the con-
tributions of each feature. Chen et al. [15] introduce an embedded method. It is
an integrated mechanism to extract fuzzy rules and select useful features, simul-
taneously. They use the Takagi-Sugeno model for classification. Finally, Vieira
et al. [16] consider fuzzy criteria in feature selection by using a fuzzy decision
making framework. The underlying optimization problem is solved using an ant
colony optimization algorithm previously proposed by the same authors. The
classification accuracy is computed by means of a fuzzy classifiers.
A different approach is considered in the work proposed by Moustakidis and
Theocharis [17]. They propose a forward filter FS based on a Fuzzy Comple-
mentary Critrion (FuzCoC). They introduce the notion of fuzzy partition vec-
tor (FPV) associated with each feature. A local fuzzy evaluation measure with
respect to patterns is used and it takes advantage of fuzzy membership degrees of
training patterns (projected on that feature) to their own classes. These grades
are obtained using a fuzzy output kernel-based SVM. FPV aims at detecting the
data discrimination capability provided by each feature. It treats each feature on
a pattern-wise base, thus allowing to assess redundancy between features. They
obtain subsets of discriminating (highly relevant) and non-redundant features.
FuzCoC acts like a minimal-redundancy-maximal-relevance (mRMR) criterion.
Once features have been selected, the prediction on class labels is obtained using
a 1-NN.
In the present work, we take inspiration from the above methodology and
from [18] to devise a novel wrapper FS method. It can be seen as a FuzCoC con-
structed by a ReGEC (Guarracino et al. [19]) classification approach. By means
of a binary linear ReGEC, a one-versus-all (OVA) strategy is implemented, that
allows to solve multiclass problems. For each feature, distances between each
pattern and classification hyperplanes are computed, and they are used to con-
struct the membership degree of each pattern to its own class. The sum of these
458 M.B. Ferraro et al.
grades represent the score associated with the feature, that is the capability to
discriminate the classes. In this way, all features are ranked, and the selection
process determines the features leading to an increment of the total accuracy on
training set. Hence, only features with highest discrimination power are selected.
The advantage of this strategy is that it takes into account the peculiarity
of the classification method, providing a set of features consistent with it. We
show that this process fits out a robust subset of features, thus, a change in
training points produces a small variation in the selected features. Furthermore,
using standard datasets, we show that the classification accuracy obtained with
a small percentage of available features is comparable with that obtained using
all features.
This paper is organized as follows. In the next section, a description of the
forward filter FS SVM-FuzCoC ([17]) is given. Section 3 contains our proposal,
FFS-ReGEC, and the novel algorithm is described. In order to check the ade-
quacy of the proposed procedure, in Sect. 4, we present a discussion on the
dataset SONAR. Some comparative results on real world datasets are given in
Sect. 5. Finally, Sect. 6 contains some concluding remarks and open problems.
2 SVM-FuzCoC
D = {D1 , · · · , Dk , · · · , DM }
where Dk = {xi1 , · · · , xiNk } denotes the set of class k patterns and Nk is the
⎡
M
number of patterns included in Dk , with Nk = N (M is the number of
k=1
classes). Following the OVA methodology, the authors initially train a set of M
binary K-SVM classifiers on each single feature, to obtain fuzzy membership of
each pattern to its class. Let xij denote the feature j component of pattern xi ,
i = 1, · · · , N . According to FO-K-SVM, fuzzy membership value μk (xij ) ⊆ [0, 1]
of xij to class k is computed by
⎣
⎨ 0.5 if fk (xij ) = mijk = 1
⎨
⎨
⎨
⎤
μk (xij ) = 1 (1)
⎨ ⎛ ⎞ if mijk ∀= 1
⎨
⎨ 1− γ fk (xij )
− mijk
⎨
⎦ ln ·⎝ ⎠
γ |1 − mijk |
1+e
where fk (xij ) is the decision value of the kth K-SVM binary classifier trained
by xij , mijk = maxl∈=k fl (xij ) is the maximum decision value obtained by the
rest (k − 1) K-SVM binary classifiers, and γ is the membership degree threshold
fixed by the user.
A Novel Feature Selection Method for Classification Using a Fuzzy Criterion 459
Fig. 1. SVM-FuzCoC FS
A Novel Feature Selection Method for Classification Using a Fuzzy Criterion 461
Fig. 2. FFS-ReGEC
4 A Case Study
78,0
77,5
77,0
76,5
76,0
75,5
75,0
300 74,5
250
200
150
100
50
0
1 2 3 4 5 6 7 8 9 10 11 12
Fig. 3. The distribution of 1000 hold outs according to the number of selected features,
the average test accuracy and the standard deviation for each set of splits with the
same number of selected features.
100
80
60
40
20
number of selected features is between six and nine. It is also interesting that all
20 partitions in which a single feature is selected, it is always feature 12.
In Fig. 4 we report the number of times (in percentage) features are selected
in 1000 hold outs. In the figure, the darkest bars are related to the 15 most
selected features.
In Fig. 5 the absolute value of the correlations corr among the top 15 most
chosen features is shown. We can see that some features are highly correlated
(the darkest zones), hence we cluster them by means of the hierarchical clus-
tering derived by the dendrogram depicted in Fig. 6, in which the vertical axis
represents the value 1 − |corr|, the complement to one of the absolute value of
the correlation, and by clustering together those features with |corr| > .6. We
obtain 4 clusters, forming the groups reported in Table 1.
A Novel Feature Selection Method for Classification Using a Fuzzy Criterion 463
1
9
10 0.9
11
0.8
12
13 0.7
20
0.6
21
22 0.5
34
0.4
35
36 0.3
37
0.2
45
48 0.1
49
9 10 11 12 13 20 21 22 34 35 36 37 45 48 49
0.7
0.6
0.5
0.4
0.3
0.2
0.1
9 10 11 12 13 20 21 22 45 48 49 34 35 36 37
In each cluster, features are correlated when all data are considered. In the
hold out process the discrimination capability might be lower because of pres-
ence/absence of some patterns. In this case, it would be desirable that another
feature from the cluster of correlated patterns is selected, which is also consis-
tent with the idea that an FS procedure should avoid redundancy in selected
features. To verify whether the proposed algorithm has this characteristic, we
perform the following test.
Let A(zj ) be the set of hold outs where the feature zj has been selected
with cardinality |A(zj )|, and let Ck be a cluster of features. For each cluster, we
compute the following coverage index:
⎬
| zj ∩Ck A(zj )|
CI(k) =
H
464 M.B. Ferraro et al.
where H is the number of hold outs. This index represents the probability that
at least one feature is selected from a cluster. In Table 1 coverage indexes of
clusters of correlated features are reported. These represent the percentages of
times at least one of correlated features is selected in all the considered hold
outs. These percentages are much higher than those we would obtain with a
random selection.
80%
75%
70%
65%
60%
55%
50%
1 4 5 6 7 8 9 10 11 12
Mean Test Accuracy with FS Mean Test Accuracy with random selection
Fig. 7. Test accuracy of FFS-ReGEC and test accuracy of a random feature selection.
this number of features and we consider the average test accuracy obtained with
FFS-ReGEC and that of the random selector.
We conclude that the proposed strategy always select a number of features
sufficient to discriminate patterns with almost the same accuracy, whatever are
the patterns used to train the classifier.
5 Comparative Analysis
Table 3. Test accuracy (TA) rates for Multi-ReGEC, 1-NN, FFS-ReGEC, SVM-
FuzCoC and number of selected features (sf) for the FS procedures
6 Concluding Remarks
In this paper we propose a novel fuzzy feature selection technique. It uses the
ReGEC algorithm to select the most promising set of variables for classification.
In future, we will devise techniques to weight the contribution of the variables
in the computation of the classification model, in order to enhance the discrim-
ination capability of the most promising ones.
Acknowledgment. This work has been partially funded by Italian Flagship project
Interomics and Kauno Technologijos Universitetas (KTU).
References
1. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach.
Learn. Res. 3, 1157–1182 (2003)
2. Battiti, R.: Using mutual information for selecting features in supervised neural
net learning. IEEE Trans. Neural Netw. 5, 537–550 (1994)
3. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information crite-
ria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern
Anal. Mach. Intell. 27, 1226–1238 (2005)
4. Ooi, C.H., Chetty, M., Teng, S.W.: Differential prioritization in feature selection
and classifier aggregation for multi class microarray datasets. Data Min. Knowl.
Discov. 114, 329–366 (2007)
5. Li, Y., Wu, Z.F.: Fuzzy feature selection based on min-max learning rule and
extension matrix. Pattern Recogn. 41, 217–226 (2008)
6. Mao, K.Z.: Orthogonal forward selection and backward elimination algorithms for
feature subset selection. IEEE Trans. Syst. Man Cybern. B 34, 629–634 (2004)
7. Fu, X., Wang, L.: Data dimensionality reduction with application to simplifying
RBF network structure and improving classification performance. IEEE Trans.
Syst. Man Cybern. B. 33, 399–409 (2003)
A Novel Feature Selection Method for Classification Using a Fuzzy Criterion 467