Evolution Strategies With Additive Noise: A Convergence Rate Lower Bound
Evolution Strategies With Additive Noise: A Convergence Rate Lower Bound
Evolution Strategies With Additive Noise: A Convergence Rate Lower Bound
Abstract
We consider the problem of optimizing functions corrupted with addi-
tive noise. It is known that Evolutionary Algorithms can reach a Simple
√
Regret O(1/ n) within logarithmic factors, when n is the number of func-
tion evaluations. Here, Simple Regret at evaluation n is the difference
between the evaluation of the function at the current recommendation
point of the algorithm and at the real optimum. We show mathematically
that this bound is tight, for any family of functions that includes sphere
functions, at least for a wide set of Evolution Strategies without large
mutations.
1 Introduction
Evolutionary Algorithms (EAs) have received a significant amount of attention
due to their wide applicability in optimization problems. In particular, they
show robustness when confronted with rugged fitness landscapes. This robust-
ness becomes a strong feature of EAs when facing objective functions corrupted
by noise.
Black-Box Noisy Optimization. When we only have access to an ap-
proximate or noisy value of the objective function, the problem is termed a
Noisy Optimization Problem. Additionally, we can consider that the values of
the objective function are given by a black-box which receives as an input a fea-
sible point and outputs the value of the (noisy) objective function at that point.
This is the only information available regarding to the objective function. This
will be termed a Black-Box Noisy Optimization Problem (BBNOP).
Noise models. To analyze the performance of an algorithm in front of a
BBNOP, several noise models are considered in the literature. These models are
appreciated for their simple and natural design. If we let f (x) be the objective
1
1
function, then the noisy version of it, f (x, ω) can be define, as one of the
following examples:
Rd →Rd
x 7→x + σN (0, C)
The term σ is denominated step-size and the adaptation of it has been subject
of study since the creation of ESs. The first achievement in that direction is
the 1/5-success-rule [22] which is followed by the study of self-adaptation of the
step-size using a variation process on it. The latter study gives birth to the
so-called SA-ES: Self-Adaptive Evolution Strategies [9]. Additionally, the work
on [18] develops a technique where the whole covariance matrix C is adapted,
leading to the CMA-ES algorithm.
For the selection stage, ESs generally use ranking-based operators. Thus,
when we consider BBNOP, the problem in the selection is the misranking of
1 More formally, f (x, ·) is a random variable on a probability space (Ω, A) and ω ∈ Ω is
an element of the sample space. So each time we draw a ω we have a new realization of the
random variable f (x, ·)
2 Some but not all authors consider that ESs are by definition working in the continuous
2
individuals. In other words, if we consider individuals x1 and x2 and an additive
noise model, then due to the noise perturbation we might obtain f (x1 , ω1 ) >
f (x2 , ω2 ) whereas actually the real ordering between individuals is the opposite
i.e. f (x1 ) < f (x2 ). To deal with this problem, specific methods have been
studied, including increasing the population size [1], using surrogate models
[6, 20] and resampling many times per search point [17, 19]. In this context,
resampling means that the query to the black-box is repeated several times for
a given search point. Afterwards, some statistic of the repeated sample (usually
the mean) is used as the objective function value of the point.
3
context, Fabian [15] and Shamir [25] approximate the tangent of the objective
function through a gradient approximated by finite differences, and use it in the
optimization process. They both obtain linear convergence in the log-log scale.
More precisely, the work on [25] proves that the convergence occurs with a slope
−1 in the case of strongly convex quadratic functions, as detailed later, whereas
Fabian [15] proves similar rates (arbitrarily close to −1) asymptotically but on
a wider family of functions. The tightness of rates in [15] is proved in [11]). A
key feature which is common to all these algorithms is that they sample farther
from the optimum than ESs.
2 Preliminaries
Consider d a positive integer and a domain D ⊂ Rd . Given a function f : D → R,
the noisy version of it is a stochastic process, also denoted f but defined as
f : D × Ω → R, (x, ω) 7→ f (x, ω) where ω represents the realization of a random
variable over some Ω. Henceforward, f (x) will be the exact value of f in x
whereas f (x, ω) denotes a noisy value of f in x. f (x, ω) is supposed to be
unbiased, i.e. Eω f (x, ω) = f (x). We assume that x∗ is the unknown exact and
unique optimum (minimum) of f .
In the present paper, the noise model corresponds to additive noise:
4
point can be the same as the search point, but not necessarily: the recommen-
dation point can be computed without evaluating the objective function on it
(as in [25, 15]). We denote yn the evaluation of the noisy function in xn . The
sequence of search points and their evaluation on the noisy function is:
The Simple Regret and the Cumulative Regret, SRn and CRn respectively, are
defined as follows:
We are interested in the linear relationship between the Regret and the num-
ber of function evaluations using scale log-linear or log-log graphs. Therefore,
the rates of convergence are the slopes of the log-linear (Eq. 5) or log-log graphs
(Eq. 6).
log SRn
Log-linear: lim sup = −α < 0. (5)
n n
log SRn
Log-log: lim sup = −α < 0, (6)
n log n
In this paper we will use the the slope in log-log graph: −α on Eq. 6. Precisely,
we say that the slope −α is verified on a family F of noisy objective functions
if α is such that:
This means that several different slopes might be verified. Note that the impor-
tant is the supremum of such α.
In the following the inner product in Rd is represented by · and sgn() refers
to the sign function.
3 Theoretical analysis
This section consists of the formalization of the algorithms and the main result
of the paper. We present on one side the formalization of the concept of black-
box noisy optimization algorithm (Algorithm 1, section (3.1)) and on the other
side a formalization of “classical” ES (Def. 1, Section (3.1)). Notably, the ESs
covered by the formalization of ES in Definition 1 correspond to a wide family
but not all of them. The condition on Eq. 10 refers to the evolution of the
step-size and the approximation to the optimum. The latter condition holds
provably for some ESs, and probably for much more, but not necessarily for e.g.
ESs with surrogate models or ESs with large mutations [16, 21, 29, 7]. Also
5
algorithms using several Gaussian distributions simultaneously or multimodal
random variables for generating individuals are not covered. Thus, we mainly
consider here ESs with one Gaussian distribution which scales roughly as the
distance to the optimum.
In Section (3.2) we prove the theorem that states that the family of evolu-
tion strategies described by the formalization converges at best with rate −1/2
(tightness comes from [13]).
6
3.1 Formalization of Algorithms
General Optimization Framework
Basically, an optimization algorithm samples some search points, evaluates them
and proposes a recommendation (i.e. an approximation of the optimum) from
this information. We formalize a general optimization algorithm in Algorithm 1.
7
Perimeter covered by General Optimization Framework
We provide some observations to clarify the scope of Algorithm 1, and how it
covers the usual definitions of black-box optimization algorithms.
In general, a black-box optimization algorithm uses the objective function
as an oracle. Since we consider a black-box setting, there is no access to any
internal characteristic of the objective function.
On the other side, a black-box optimization algorithm has a state that is
either its initial state (in case we are at the first time step) or a function of its
internal state and of the results of requests to the oracle.
And since the algorithm is an algorithm for optimization, it must provide
an approximation of the optimum. Such an approximation is termed “recom-
mendation”. We here decide that the approximation of the optimum should not
change between two calls to the objective (i.e. oracle) function.
Therefore, an optimization algorithm is a sequence of internal computations,
which modify an internal state. This sequence is sometimes interrupted by a
call to the oracle function, or by a change in the recommendation.
We can then rewrite the algorithm, hiding all internal transformations of
the internal state I between two calls to the oracle in some SP function. The
algorithm then evaluates the objective function at xn (call to the oracle). Next,
it proposes a new approximation of the optimum; this computation is encoded
in R. We have specified that this does not modify I; but the procedure R can be
duplicated inside SP, which is allowed to modify the internal state, if necessary,
so this is not a loss of generality. The random seed is available for all functions
so that there is no limitation around randomization.
We have assumed that the algorithm never spends infinite computation times
between two calls to the oracle, and does not stop. We can just decide that in
such a case we report the same output for R and the same output for SP.
All the elements discussed in this section allow us to use the general opti-
mization framework described in Algorithm 1 to represent many of black-box
optimization algorithms
where, for short, Zn = (Zn , r, n, p, I). The step-size σ(Zn ) is usually updated
at each generation. Ψ(Zn ) is an independent d-dimensional zero-mean random
variable, not necessarily Gaussian, with
8
Also, we consider that the ESs should satisfy the following condition on the
evolution of the step-size with regards to the recommendation points. In the
following section we will explain with more details the reasons behind this con-
dition:
∃D > 0, ∀n ≥ 0, E[σ(Zn )2 ] ≤ DE[kx̃n − x∗ k2 ]. (10)
Now we can state the definition of ES covered by the theorem in Section (3.2).
Definition 1 Simple Evolution Strategy A Simple Evolution Strategy
is an algorithm that matches framework of Alg. 1 and satisfies both Eq. 8 and
Eq. 10.
What would be an EA which does not verify Eq. 10? A natural example is
an Evolutionary Algorithm which samples far from the current estimate x̃n of
the optimum, e.g. for building a surrogate model. Interestingly, all optimization
algorithms which are fast in noisy optimization with constant noise variance in
the vicinity of the optimum verify such a property, namely sampling far from the
optimum [15, 25, 12]. This suggests that modified ESs which include samplings
far from the optimum, might be faster.
9
3.2 Lower bound for Simple Evolution Strategies
We now state our main theorem, namely the proof that Evolution Strategies, in
their usual setting without mutations far from the optimum, can not reach rate
as good as algorithms without such restrictions.
Theorem 1 Let F be the set of quadratic functions f : D → R defined on D =
Rd by f (x) = 21 kxk2 − (x∗ · x) for some kx∗ k ≤ 12 . Consider a simple Evolution
Strategy as in definition 1 and the noisy optimization of f ∈ F corrupted by
some additive noise with variance 1 in the unit ball: f (x, ω) = f (x) + N (ω)
such that Eω [f (x, ω)] = f (x). Then, for all α > 21 , the slope −α is not reached.
Remark 1 (Tightness in the framework of evolution strategies.) The
work in [13] shows that, within logarithmic factors, an evolution strategy with
Bernstein races (with modified sampling in order to avoid huge numbers of re-
samplings due to individuals with almost equal fitness values) can reach a slope
−α arbitrarily close to −α = − 12 . To the best of our knowledge, it is not known
whether we can reach α = 12 .
10
The last equation leads to
Shamir [25, Theorem 6] has shown that, for any optimization algorithm as
defined in Section (3.1), there is at least√one function in f ∈ F for which the
cumulative regret is CRn ≥ 0.02 min(1, d n), which contradicts Eq. 11.
4 Experimental verification
of the Lower Bound
This section is devoted to the verification of the lower bound on the convergence
rate for ESs stated in Theorem 1 and the comparison with the convergence rate
of a “fast” Algorithm: Shamir Algorithm [25].
We show experimentally that the rate −1 promised by the results in [25] is visible
on experiments, even with moderate budgets in terms of numbers of function
evaluations. We then show that, consistently with theory, we could not do better
than slope −1/2 with ESs (Section (4.2)). The experimental results presented
on this section use an approximation of the slope of simple regret (Eq. 6)4 :
11
Algorithm 2 Shamir Algorithm. Written in the general optimization
framework.
procedure R(x0 , . . . , xn−1 , y0 , . . . , yn−1 , r, n, p, I)
. I is a vector of n elements in the domain.
if kIn k ≥ B then
In
In = B kI
nk
end if P
2 n
x̃n = n j=dn/2e,...,n Ij
end procedure
procedure SP(x0 , . . . , xn−1 , y0 , . . . , yn−1 , r, n, p, I)
if n = 0 then
I = (0)
Return x0 = 0
end if
Compute xn = xn−1 + √ r
√ d
n−1 dy
Compute g̃ = r
1
Compute I = (I, xn−1 − λn g̃)
end procedure
12
SHAMIR
Dim 2
Dim 4
Dim 8
Dim 16
slope (SR)
−0.5
−1
0 1 2 3 4 5 6 7 8 9 10
nbEval x 10
5
13
Algorithm 3 UH-CMA-ES. N (a, b) stands for a normal random variable of
mean a and covariance b. hparam stand for hidden parameters, it includes the
0
parameters used in the update of σ, pσ , C, pC , in functions Generate λ and
Compute threshold t.
Require: λ ∈ N, αr ∈ R, ασ ∈ R, hparam
1: Initialization: xi = 0 ∈ Rd , ∀i ∈ {1, . . . , λ}, m = 0, σ = 0.6, C = I, pσ , pc , r = 1, = 10−7 ,
t=0
2: while not terminate do
3: for i = 1 to λ do
4: xi ← N P (m, σ 2 C)
r
5: yi ← r1 j=1 f (xi , ω)
6: end for
7: Sort (yi ) such that ys(1) ≤ · · · ≤ ys(λ)
8: (x1 , . . . , xλ ) ← (xs(1) , . . . , xs(λ) )
1
Pλ
9: m← λ i=1 xi
10: Update parameters σ, C, pσ , pc
0
11: Generate λ . possibly 0
0
12: for i0= 1 to λ do
13: xi ← xi + σN (0, C)
0 Pr 0
14: yi ← r1 j=1 f (xi , ω)
15: end for
16: Compute threshold t.
0
00 y +y 0 00
17: yi ← i 2 i if i ≤ λ and yi ← yi otherwise
00 00 00
18: Sort (yi ) such that y 00 ≤ · · · ≤ y 00
s (1) s (λ)
19: (x1 , . . . , xλ ) ← (xs(1) , . . . , xs(λ) )
20: if t > 0 then
21: r ← αr r, σ ← ασ σ
22: else
r
23: r ← α0.25
r
24: end if
25: end while
26: return: x1
Now let us consider a Simple Evolution Strategy, namely the (1 + 1)ES with
one-fifth rule [22, 24], with additional revaluations, implemented as shown in
Algorithm 4.
A way to slightly improve Algorithm 4 is to improve the computation of
the fitness value of the current best recommendation by averaging the current
estimate with the previous estimates of the same search point, when the mu-
tation has not been accepted and xn = xn−1 . We propose such a modification
in Algorithm 5 by using a weighted average in the estimate fitness value of the
current best search point.
Experiments on the noisy sphere function kx − x∗ k2 + N (0, 1), where N (0, 1)
is a standard Gaussian, are provided, using Algorithm 5. Results are presented
in Figure 3 in various dimensions (2, 4, 8, 16 respectively). Seemingly both
exponential and polynomial resamplings lead to a slope −1/2.
14
0
Subroutine 1 Generate λ
2
Input: gλ = max(0.2, λ)
0 2
1: if λ = 0 for more then gλ λ generations then
0
2: λ ←1
3: else 0
4: λ ← bgλ × λc + 1 with probability gλ × λ − bgλ × λc
0
5: λ ← bgλ × λc otherwise
6: end if 0
return λ
0
λ
1 X lim 0
t← 0 2|∆i | − ∆θ rank(yi ) − 10
λ y >yi
i=1 i
lim
−∆θ rank(yi ) − 1 0
yi >y
i
6: t ← (1 − ct )t + ct t
return t
5 Conclusions
We have shown that Evolution Strategies, at least under their most common
form, can not reach the same rate as noisy optimization algorithms which use
evaluations of the objective function farther from the approximate optimum in
order to obtain extra information on the function. On the contrary, ESs use
evaluations of objective functions only in the neighborhood of the optimum.
Therefore, usual ESs cannot reach rates as the ones in [15, 25] (Shamir [25]
in the quadratic case non-asymptotically, Fabian [15] in the general case but
asymptotically). The latter type of algorithms reach a slope −1, whereas we
have a limit at −1/2 with evolution strategies. This solves the conjecture pro-
posed in [25] just after theorem 1. This also shows the optimality of the rate
−1/2 obtained by R-EDA [23], in the framework of local sampling only.
It is important to note that the result in this paper indeed covers not only
Evolutionary Algorithms, but also, for example, many pattern search methods.
We proved the results for algorithms which perform sampling within some dis-
tance of the approximate optimum, this distance scaling roughly as the distance
to the optimum. This property is known to be satisfied by most ESs (see Section
(3.1)). However, for many algorithms it is verified only experimentally and not
formally proved.
15
UHCMAES
Dim 2
Dim 4
Dim 8
Dim 16
slope (SR)
−0.5
−1
0 2 4 6 8 10 12
nbEval x 10
5
ESs with surrogate models are not concerned by our lower bound. More
precisely, if we include strong surrogate modelling with large mutations (and so
contradicting Eq. 10), then we can recover fast rates with slope −1. An extreme
example of this situation is the case in which the sampling/
surrogate model is exactly the algorithm in [15], [25] or [12]. Using them as to
obtain surrogate models within an ES will ensure a fast convergence rate for the
ES. Obviously, it is desirable to verify if such result can also be obtained with
more “evolutionary” approaches.
The bound presented in this paper does not cover evolutionary algorithms
that would use very large mutations[7]. Maybe this is a good path to follow for
designing fast evolutionary noisy optimization algorithms.
For all experiments we check convergence rates on the sphere function with
additive noise.
We consider an algorithm with theoretical fast rate, the Shamir Algorithm,
and two ESs: UHCMAES and (1+1) ES. For Shamir Algorithm we have
achieved a successful implementation of the algorithm 6 and confirmed empiri-
cally the fast convergence rate proved in [15, 25] (i.e. slope of SR = −1). For
UHCMAES and (1+1) ES we have shown that ESs can approximate slope
of SR −0.5 using (1 + 1)ES. UHCMAES also reaches linear convergence in the
log-log scale but with a slower rate (slope of SR around −0.2).
6 Shamir [25] delivers only the theoretical analysis of his algorithm and an implemented
version.
16
(1+1) − ES (1+1) − ES (1+1) − ES
Dim 2 Dim 2 Dim 2
Dim 4 Dim 4 Dim 4
Dim 8 Dim 8 Dim 8
Polynomial
slope (SR)
slope (SR)
−0.5 −0.5 −0.5
−1 −1 −1
2 3 4 5 6 6 7 8 10 12 14 16 18 15 2020 25 30 35 40 45 50 55 60
log(nbEval) log(nbEval) log(nbEval)
Dim 16 Dim 16
0 0
slope (SR)
slope (SR)
−0.5 −0.5
−1 −1
2 3 4 5 6 0 7 5 10 15 20 25 30
log(nbEval) log(nbEval)
−0.5
−1
0 10 20 30 40 50 60
log(nbEval)
Figure 3: Results (1+1) ES for dimension 2, 4, 8 and 16. First row of plots
presents Polynomial resampling and the second row Exponential resampling.
The maximum standard deviation for all averages presented here (experiments
are averaged over 400 runs) is 0.025. Note that (nbEval) is the number of
evaluations, n.
17
Algorithm 4 (1 + 1) − ES for noisy optimization with resamplings. N (0, 1)
is a standard Gaussian. The function number of revaluations depends on the
current iteration and on a parameter p. Typically the number of revaluations
is polynomial: np or exponential: pn .
1: Initialization: n = 0, σ = 1, x = (0, . . . , 0), p
2: while not terminate do
3: x0 ← x + σN (0, 1)
4: n←n+1
5: r ← number
Pr of revaluations(n, p)
6: y ← r1 (f (x), ω))
Pi=1
y 0 ← r1 r 0
7: i=1 f (x , ω)
0
8: if y < y then
9: x ← x0 and σ ← 2σ
10: else
11: σ ← .84σ
12: end if
13: end while
return x
Further work
A first further work consists in proving the result in a wider setting, this is,
weakening the assumption in Eq. 10. We might also check other criteria than
non-asymptotic expected simple regret, e.g. almost sure convergence. Another
further work is investigating which optimization algorithms, other than Evolu-
tion Strategies, are concerned by our result or by similar results. In the case of
strongly convex functions with a lower bound on eigenvalues of the Hessian, we
conjecture that the asymptotic rate −1 can also not be reached by the considered
family of evolutionary algorithms.
Acknowledgements
We are grateful to the Ademe Post project for making this work possible (post.
artelys.com).
18
Possibilities of algorithms using huge mutations were discussed in the noisy
optimization working group at Dagstuhl’s seminar 2014; we are grateful to N.
Hansen, Y. Akimoto, J. Shapiro, A. Prügel-Benett for fruitful discussions there.
References
[1] D. Arnold and H.-G. Beyer. Investigation of the (µ, λ)-ES in the presence of
noise. In Proceedings of the IEEE Conference on Evolutionary Computation
(CEC 2001), pages 332–339. IEEE, 2001.
[2] D. Arnold and H.-G. Beyer. Local performance of the (1 + 1)-ES in a noisy
environment. Evolutionary Computation, IEEE Transactions on, 6(1):30–
41, Feb 2002.
[3] S. Astete-Morales, J. Liu, and O. Teytaud. Log-log convergence for noisy
optimization. In Artificial Evolution, Lecture Notes in Computer Science,
pages 16–28. Springer International Publishing, 2014.
19
[11] H. Chen. Lower rate of convergence for locating a maximum of a function.
The Annals of Statistics, 16(3):1330–1334, Sep 1988.
[12] R. Coulom. Clop: Confident local optimization for noisy black-box param-
eter tuning. In Advances in Computer Games, Lecture Notes in Computer
Science, pages 146–157. Springer Berlin Heidelberg, 2012.
[19] V. Heidrich-Meisner and C. Igel. Hoeffding and bernstein races for selecting
policies in evolutionary direct policy search. In Proceedings of the 26th
Annual International Conference on Machine Learning, ICML ’09, pages
401–408, New York, NY, USA, 2009. ACM.
[20] Y. Jin and J. Branke. Evolutionary optimization in uncertain
environments-a survey. IEEE Transactions on Evolutionary Computation,
9(3):303–317, June 2005.
[21] Y. Ong, K.-Y. Lum, P. Nair, D. Shi, and Z. Zhang. Global convergence un-
constrained and bound constrained surrogate-assisted evolutionary search
in aerodynamic shape design. In Proceedings of the IEEE Congress on
Evolutionary Computation (CEC 2003), pages 1856–1863. IEEE, 2003.
[22] I. Rechenberg. Evolutionsstrategie: Optimierung technischer Systeme
nach Prinzipien der biologischen Evolution. Problemata, 15. Frommann-
Holzboog, 1973.
20
[23] P. Rolet and O. Teytaud. Adaptive noisy optimization. In Applications
of Evolutionary Computation, Lecture Notes in Computer Science, pages
592–601. Springer Berlin Heidelberg, 2010.
[24] H.-P. Schwefel. Adaptive Mechanismen in der biologischen Evolution und
ihr Einfluss auf die Evolutionsgeschwindigkeit. Technical Report of the
Working Group of Bionics and Evolution Techniques at the Institute for
Measurement and Control Technology Re 215/3, Technical University of
Berlin, July 1974.
[25] O. Shamir. On the complexity of bandit and derivative-free stochastic
convex optimization. CoRR, abs/1209.2388, 2012.
21