A Racing Algorithm For Configuring Metaheuristics, M Birattari, 2002, (8p)
A Racing Algorithm For Configuring Metaheuristics, M Birattari, 2002, (8p)
1 INTRODUCTION Two are the main contributions of this paper. First, we give
a formal definition of the metaheuristic configuration prob-
A metaheuristic is a general algorithmic template whose lem. Second, we show that a metaheuristic can be tuned
components need to be instantiated and properly tuned in efficiently and effectively by a racing procedure. Our re-
order to yield a fully functioning algorithm. The instan- sults confirm the general validity of the racing algorithms
tiation of such an algorithmic template requires to choose and extend their area of applicability. On a more technical
among a set of different possible components and to assign level, left aside the specific application to metaheuristics,
specific values to all free parameters. We will refer to such we give some contribution to the general class of racing
an instantiation as a configuration. Accordingly, we call algorithms. In particular, our method adopts blocking de-
configuration problem the problem of selecting the optimal sign (Dean and Voss, 1999) in a nonparametric setting. In
configuration. some sense, therefore, the method fills the gap between Ho-
effding race (Maron and Moore, 1994) and BRACE (Moore
Practitioners typically configure their metaheuristics in an and Lee, 1994): similarly to Hoeffding race it features a
iterative process on the basis of some runs of different con- nonparametric test, and similarly to BRACE it considers a
figurations that are felt as promising. Usually, such a pro-
1
cess is heavily based on personal experience and is guided Several metaheuristics involve continuous parameters. This
would actually lead to an infinite set of candidate configurations.
†
This research was carried out while MB was with Intellek- In practice, typically only a finite set of possible parameter values
tik, Technische Universität Darmstadt. are considered by discretizing the range of continuous parameters.
12 ARTIFICIAL LIFE, ADAPTIVE BEHAVIOR, AGENTS AND ANT COLONY OPTIMIZATION
• t : I → < is a function associating to every instance 2.3 Further Considerations and Possible Extensions
the computation time that is allocated to it.
The formal configuration problem, as described in Sec-
• c(θ, i) = c(θ, i, t(i)) is a random variable represent- tion 2.2, assumes that, as far as a given instance is con-
ing the cost of the best solution found by running con- cerned, no information on the performance of the various
figuration θ on instance i for t(i) seconds.3 candidate configurations can be obtained prior to their ac-
tual execution on the instance itself. In this sense, the in-
• C ⊂ < is the range of c, that is, the possible values stances are a priori indistinguishable.
for the cost of the best solution found in a run of a In many practical situations, it is known a priori that var-
configuration θ ∈ T heta on an instance i ∈ I. ious types of instances with different characteristics may
arise. In such a situation all possible prior knowledge
• PC is a probability measure over the set C: With the should be used to cluster the instances into homogeneous
notation4 PC (c|θ, i), we indicate the probability that classes and to find, for each class, the most suitable config-
c is the cost of the best solution found by running for uration.
t(i) seconds configuration θ on instance i.
The case mentioned in Section 2.1, in which it is not rea-
• C(θ) = C(θ|Θ, I, PI , PC , t) is the criterion that needs sonable to accept that all instances are extracted indepen-
to be optimized with respect to θ. In the most general dently and according to the same probability measure, can
case it measures in some sense the desirability of θ. possibly be handled in a similar way. Often, some temporal
correlation is observed among instances. In other words,
temporal patterns can be observed on previous instances
On the basis of these concepts, the problem of configuring that bring a priori information on the characteristics of the
a metaheuristic can be formally described by the 6-tuple current instance. This phenomenon can be handled by as-
hΘ, I, PI , PC , t, Ci. The solution of this problem is the suming that the instances are generated by a process akin
configuration θ ∗ such that: to a time-series. Also in this case, different configuration
problems should be formulated: Each class of instances to
θ∗ = arg min C(θ). (1) be treated separately would be composed by instances that
θ
follow in time a given pattern and that are therefore sup-
As far as the criterion C is concerned, different alternatives posed to share similar characteristics. The aim is again to
are possible. In this paper, we consider the optimization match the hypothesis of a priori indistinguishability of in-
of the expected value of the cost c(θ, i). Such a criterion stances within each of the different configuration problems
is adopted in many different applications and, besides be- in which the original one is reformulated.
ing quite natural, it is often very convenient from both the
theoretical and the practical point of view. Formally:
3 A RACING ALGORITHM
h i ZZ
C(θ) = EI,C c(θ, i) = c(θ, i) dPC (c|θ, i) dPI (i), Before giving a definition of a racing algorithm for solv-
I C
(2) ing the problem given in Equation 1, it is convenient to
where the expectation is considered with respect to both describe a somewhat naive brute-force approach for high-
PI and PC , and the integration is taken in the Lebesgue lighting some of the difficulties associated with the config-
sense (Billingsley, 1986). uration problem.
The measures PI and PC are usually not explicitly avail- A brute-force approach to the problem defined in Equa-
able and the analytical solution of the integrals in Equa- tion 1 consists in estimating the quantities defined in Equa-
tion 2, one for each configuration θ, is not possible. In tion 2 by means of a sufficiently large number of runs of
order to overcome such a limitation, the integrals defined each candidate on a sufficiently large set of training in-
in Equation 2 will be estimated in a Monte Carlo fashion stances. The candidate configuration with the smallest es-
on the basis of a training set of instances, as it will be ex- timated quantity is then selected.
plained in Section 3. However, such a brute-force approach presents some draw-
backs: First, the size of the training set must be defined
to single elements, the correct notation should be PI ({i}). Our prior to any computation. A criterion is missing to avoid
notational abuse consists therefore in using the same symbol i
both for the element i ∈ I, and for the singleton {i} ⊂ I. considering, on the one hand, too few instances, which
3
In the following, for the sake of a lighter notation, the depen- could prevent from obtaining reliable estimates, and on the
dency of c on t will be often implicit. other hand, too many instances, which would then require
4
The same remark as in Note 2 applies here. a great deal of useless computation. Second, no criterion
14 ARTIFICIAL LIFE, ADAPTIVE BEHAVIOR, AGENTS AND ANT COLONY OPTIMIZATION
Θ0 ⊇ Θ 1 ⊇ Θ 2 ⊇ . . . ,
instances il , with 1 ≤ l ≤ k. The Friedman test considers namely the normality of data: When the hypothesis of nor-
the following statistic (Conover, 1999): mality is not strictly met t-test gracefully looses power.
n 2 For what concerns the metaheuristics configuration prob-
X k(n + 1)
(n − 1) Rj − lem, we are in a situation in which these arguments look
j=1
2
T = . suspicious. First, since we wish to reduce as soon as possi-
k X n
X
2 kn(n + 1)2 ble the number of candidates, we deal with very small sam-
Rlj − ples and it is exactly on these small samples, for which the
j=1
4
l=1 central limit theorem cannot be advocated, that we wish to
Under the null hypothesis that all possible rankings of the have the maximum power. Second, the computational costs
candidates within each block are equally likely, T is ap- are not really relevant since in any case they are negligible
proximatively χ2 distributed with n−1 degrees of freedom. compared to the computational cost of executing configura-
If the observed T exceeds the 1 − α quantile of such a dis- tions of the metaheuristic in order to enlarge the available
tribution, the null is rejected, at the approximate level α, in samples. Section 5 shows that the doubts expressed here
favor of the hypothesis that at least one candidate tends to find some evidential support in our experiments.
yield a better performance than at least one other. A second role played by ranking in F-Race is to imple-
If the null is rejected, we are justified in performing pair- ment in a natural way a blocking design (Dean and Voss,
wise comparisons between individual candidates. Candi- 1999). The variation in the observed costs c is due to dif-
dates θj and θh are considered different if ferent sources: Metaheuristics are intrinsically stochastic
algorithms, the instances might be very different one from
|Rj − Rh | the other, and finally some configurations perform better
r > t1−α/2 ,
kn(n+1)2 than others. This last source of variation is the one that
“P ”
2k( ) k
T
Pn 2
1− k(n−1) l=1 j=1 Rlj − 4
(k−1)(n−1) is of interest in the configuration problem while the oth-
ers might be considered as disturbing elements. Blocking
where t1−α/2 is the 1 − α/2 quantile of the Student’s t is an effective way for normalizing the costs observed on
distribution (Conover, 1999). different instances. By focusing only on the ranking of
In F-Race, if at step k the null of the aggregate comparison the different configurations within each instance, blocking
is not rejected, all candidates in Θk−1 pass to Θk . On the eliminates the risks that the variation due to the difference
other hand, if the null is rejected, pairwise comparisons are among instances washes out the variation due to the differ-
executed between the best candidate and each other one. ence among configurations.
All candidates that result significatively worse than the best The work proposed in this paper was openly and largely in-
are discarded and will not appear in Θk . spired by some algorithms proposed in the machine learn-
ing community (Maron and Moore, 1994; Moore and Lee,
3.3 Discussion on the Role of Ranking in F-Race 1994) but it is precisely in the adoption of a statistical test
based on ranking that it diverges from previously published
In F-Race, ranking plays an important two-fold role. The works. Maron and Moore (1994) proposed Hoeffding Race
first one is connected with the nonparametric nature of a that adopts a nonparametric approach but does not consider
test based on ranking. The main merit of nonparametric blocking. In a following paper, Moore and Lee (1994) de-
analysis is that it does not require to formulate hypothe- scribe BRACE that adopts blocking but discards the non-
ses on the distribution of the observations. Discussions parametric setting in favor of a Bayesian approach. Other
on the relative pros and cons of the parametric and non- relevant work was proposed by Gratch et al. (1993) and by
parametric approaches can be found in most textbooks on Chien et al. (1995) who consider blocking in a parametric
statistics (Larson, 1982). For an organic presentation of the setting.
topic, we refer the reader, for example, to Conover (1999).
Here we limit ourselves to mention some widely accepted This paper, to the best of our knowledge, is the first work
facts about parametric and nonparametric hypothesis test- in which blocking is considered in a nonparametric set-
ing: When the hypotheses they formulate are met, para- ting. Further, in all the above mentioned works blocking
metric tests have a higher power than nonparametric ones was always implemented through multiple pairwise paired
and usually require much less computation. Further, when comparisons (Hsu, 1996), and only in the more recent
a large amount of data is available the hypotheses for the one (Chien et al., 1995) correction for multiple tests is con-
application of parametric tests tend to be met in virtue of sidered. F-Race is the first racing algorithm to implement
the central limit theorem. Finally, it is well known that the blocking through ranking and to adopt an aggregate test
t-test, the classical parametric test that is of interest here, over all candidates, to be performed prior to any pairwise
is robust against departure from some of its hypotheses, test.
16 ARTIFICIAL LIFE, ADAPTIVE BEHAVIOR, AGENTS AND ANT COLONY OPTIMIZATION
uration, leading to a total number of 4×4×4×4 = 256 con- races respectively, where the three races were conducted on
figurations. In our experiments each solution is improved the basis of the same pseudo-sample: We are therefore jus-
by a 2.5-opt local search procedure (Bentley, 1992). tified in using paired statistical tests when comparing the
three races among them.
5 EXPERIMENTAL RESULTS On the basis of a paired Wilcoxon test we can state that
F-Race is significatively better, at a significance level of
In this section we propose a Monte Carlo evaluation of 5%, than both tn-Race and tb-Race.6
F-Race based on a resampling technique (Good, 2001).
Some insight on this result can be obtained from the fol-
For comparison, we consider two other instances of racing lowing observation. By early dropping the less interesting
algorithms both based on a paired t-test. They are therefore candidates, F-Race is able to perform more experiments on
parametric, and they adopt a blocking design. We refer the more promising candidates. On the 1000 pseudo-trials
to them as tn-Race and tb-Race. The first does not adopt considered, at the moment in which the computation time
any correction for multiple-tests, while the second adopts was up and a decision among the surviving candidate had
the Bonferroni correction and is therefore not unlike the to be taken, the set of survivors was on average composed
method described by Chien et al. (1995). by 7.9 candidates and such survivors had been tested on av-
The goal is to select an as good as possible configura- erage on 77.9 instances. In the case of tn-Race, the average
tion out of the 256 configurations of the MAX–MIN-Ant- size of the set of survivors upon expiration of computation
System described in Section 4.2. time was 31.1, while the number of instances seen by such
survivors was on average 18.2. For tb-Race the numbers
Each configuration was executed once on each of the 400 are 253.8 and 5, respectively. In this sense, F-Race proved
instances for 10s on a CPU Athlon 1.4GHz with 512 MB to be the bravest of the three, while tb-Race appeared to
of RAM, for a total time of about 12 days to allow in a be extremely conservative and on average it dropped only
following phase the application of the resampling analysis. slightly more than 2 candidates before the time limit.
The costs of the best solution found in each of these exper-
iments were stored in a two-dimensional 400 × 256 array. On the basis of our Monte Carlo evaluation, some stronger
In the following, when saying that we run configuration j statement can be pronounced on the quality of the results
over instance i, we will simply mean that we execute the obtained by F-Race. We have shown above that the perfor-
pseudo-experiment that consists in reading the value in po- mance of F-Race was good in a relative sense: F-Race pro-
sition (i, j) from the array of the results. duced better results than its competitors. We state now that,
in a precise sense to be defined presently, the performance
From the 400 instances, we extract 1000 pseudo-samples of F-Race was absolutely good. We compare F-Race with
each of which is obtained by re-ordering randomly the orig- Cheat, a brute-force method that, rather unfairly, uses in
inal instances. Each pseudo-sample is used for a pseudo- each pseudo-trial the same number of instances used by F-
trial, that is, for simulating a run of a racing algorithm: One Race and on these instances runs all the candidate config-
after the other the instances are considered and, on the ba- urations. In doing so, Cheat allows itself an enormously
sis of the results of pseudo-experiments, configurations are large amount of computation time. In our experiments,
progressively discarded. Each algorithm stops after execut- Cheat has performed on average about 19950 experiments
ing 5 × 256 pseudo-experiments.5 Upon time expiration, per trial which is equivalent to about 55 hours of computa-
the best candidate in the pseudo-trial is selected and it is tion against the 3.5 hour available to F-Race. The selection
tested on 10 instances that were not used during the selec- operated by Cheat is the optimum that can be obtained from
tion itself. The results obtained on these previously unseen the fixed set of training instances, and considering only one
instances are recorded and are used for comparing the three run of each configuration on each instance. F-Race can be
racing methods. To summarize, after 1000 pseudo-trials a seen as an approximation of Cheat: The set of experiment
vector of 10 × 1000 components is obtained for each of performed by F-Race is a proper subset of the experiments
F-Race, tn-Race, and tb-Race. It is important to note that performed by Cheat.
the three algorithms face the same pseudo-samples and that
the candidates selected in each pseudo-trial by each algo- Now, in the statistical analysis of the results obtained by
rithm are tested on the same unseen instances. The generic our Monte Carlo experiments, we were not able to reject
i-th components of the three 10×1000 vectors refers there- the null that F-Race and Cheat produce equivalent results.
fore to the results obtained by the champions of the three Also in this case, we have worked at the significance level
of 5%: neither Wilcoxon test nor t-test were able to show
5
In such a time, by definition, brute-force would be able to significance.
test the 256 candidates on only 5 instances. The 5 × 256 pseudo-
6
experiments simulate 3.5 hours of actual computation on the com- The same conclusion can be drawn on the basis of a paired
puter used for producing the results proposed here. t-test.
18 ARTIFICIAL LIFE, ADAPTIVE BEHAVIOR, AGENTS AND ANT COLONY OPTIMIZATION