Gradient-Free Optimization of Highly Smooth Functions: Improved Analysis and A New Algorithm
Gradient-Free Optimization of Highly Smooth Functions: Improved Analysis and A New Algorithm
Abstract
This work studies minimization problems with zero-order noisy oracle information under the
assumption that the objective function is highly smooth and possibly satisfies additional properties.
We consider two kinds of zero-order projected gradient descent algorithms, which differ in the form
of the gradient estimator. The first algorithm uses a gradient estimator based on randomization over
the `2 sphere due to Bach and Perchet (2016). We present an improved analysis of this algorithm on
the class of highly smooth and strongly convex functions studied in the prior work, and we derive
rates of convergence for two more general classes of non-convex functions. Namely, we consider
highly smooth functions satisfying the Polyak-Łojasiewicz condition and the class of highly smooth
functions with no additional property. The second algorithm is based on randomization over the
`1 sphere, and it extends to the highly smooth setting the algorithm that was recently proposed
for Lipschitz convex functions in Akhavan et al. (2022). We show that, in the case of noiseless
oracle, this novel algorithm enjoys better bounds on bias and variance than the `2 randomization
and the commonly used Gaussian randomization algorithms, while in the noisy case both `1 and `2
algorithms benefit from similar improved theoretical guarantees. The improvements are achieved
thanks to a new proof techniques based on Poincaré type inequalities for uniform distributions on
the `1 or `2 spheres. The results are established under weak (almost adversarial) assumptions on
the noise. Moreover, we provide minimax lower bounds proving optimality or near optimality of
the obtained upper bounds in several cases.
©2024 Arya Akhavan, Evgenii Chzhen, Massimiliano Pontil, and Alexandre B. Tsybakov.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at
http://jmlr.org/papers/v25/23-0733.html.
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
1. Introduction
In this work, we study the problem of gradient-free optimization for certain types of smooth
functions. Let f : Rd → R and Θ ⊆ Rd . We are interested in solving the following optimization
problem
f ? , inf f (x) ,
x∈Θ
and we assume that f ? is finite. One main theme of this paper is to exploit higher order smoothness
properties of the underlying function f in order to improve the performance of the optimization
algorithm. We consider that the algorithm has access to a zero-order stochastic oracle, which, given
a point x ∈ Rd returns a noisy value of f (x), under a general noise model.
We study two kinds of zero-order projected gradient descent algorithms, which differ in the
form of the gradient estimator. Both algorithms can be written as an iterative update of the form
where g t is a gradient estimator at the point xt , ηt > 0 is a step size, and ProjΘ (·) is the Euclidean
projection operator onto the set Θ. In either case, the gradient estimator is built from two noisy
function values, that are queried at two random perturbations of the current guess for the solution,
and it involves an additional randomization step. Also, both algorithms invoke smoothing kernels
in order to take advantage of higher order smoothness, following the approach initially suggested
in (Polyak and Tsybakov, 1990). The first algorithm uses a form of `2 randomization introduced
by Bach and Perchet (2016) and it has been studied in the context of gradient-free optimization of
strongly convex functions in (Akhavan et al., 2020; Novitskii and Gasnikov, 2022). The second
algorithm is an extension, by incorporating smoothing kernels, of the approach proposed and
analysed in Akhavan et al. (2022) for online minimization of Lipschitz convex functions. It is based
on an alternative randomization scheme, which uses `1 -geometry in place of the `2 one.
A principal goal of this paper is to derive upper bounds on the expected optimization error
of both algorithms under different assumptions on the underlying function f . These assumptions
are used to set the step size in the algorithms and the perturbation parameter used in the gradient
estimator. Previous works on gradient-free optimization of highly smooth functions considered
mostly the strongly convex case (Polyak and Tsybakov, 1990; Bach and Perchet, 2016; Akhavan
et al., 2020, 2021; Novitskii and Gasnikov, 2022). In this paper, we provide a refined analysis of
the strongly convex case, improving the dependence on the dimension d and the strong convexity
parameter α for the algorithm with `2 randomization, and showing analogous results for the new
method with `1 randomization. For the special case of strongly convex functions with Lipschitz
gradient, we find the minimax optimal dependence of the bounds on all the three parameters of the
problem (namely, the horizon T , the dimension
√ d and α) and we show that both algorithms attain
the minimax rate, which equals α−1 d/ T . This finalizes the line of work starting from√ (Polyak
and Tsybakov, 1990), where it was proved that optimal dependence √ on T is of the order 1/ T , and
papers proving that optimal dependence on d and T scales as d/ T (Shamir (2013) establishing a
2
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
lower bound and Akhavan et al. (2020) giving a complete proof, see the discussion in Section 5).
Furthermore, we complement these results by considering highly smooth but not necessary convex
functions f , and highly smooth functions f , which additionally satisfy the gradient dominance
(Polyak-Łojasiewicz) condition. To this end, we develop unified tools that can cover a variety of
gradient estimators, and then apply them to the algorithm with `2 randomization and to our new
algorithm based on `1 randomization. We show that, in the case of noiseless oracle, this novel
algorithm enjoys better bounds on bias and variance than its `2 counterpart, while in the noisy case,
both algorithms benefit from similar theoretical guarantees. The improvements in the analysis are
achieved thanks to a new method of evaluating the bias and variance of both algorithms based on
Poincaré type inequalities for uniform distributions on the `1 or `2 spheres. Moreover, we establish
all our upper bounds under very weak (almost adversarial) assumptions on the noise.
Rate of convergence under only smoothness assumption. Assume that f is a β-Hölder function
with Lipschitz continuous gradient, where β ≥ 2. Then, after 2T oracle queries both considered
algorithms provide a point xS satisfying
β−1
d2
2β−1
1
h i
2
E k∇f (xS )k . under the assumption that T ≥ d β ,
T
3
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
where S is a random variable with values in {1, . . . , T }, k·k denotes the Euclidean norm, and the sign
. conceals a multiplicative constant that does not depend on T and d. To the best of our knowledge,
this result is the first convergence guarantee for the zero-order stochastic optimization under the
considered noise model. In a related development, Ghadimi and Lan (2013); Balasubramanian and
Ghadimi (2021) study zero-order optimization of non-convex objective function with Lipschitz
gradient, which corresponds to β = 2. They assume querying two function values with identical
noises, in which case the analysis and the convergence rates are essentially analogous to the particular
case of our setting with no noise (see the discussion in Section 2.2 below). The work of Carmon
et al. (2020) studies noiseless optimization of highly smooth functions assuming that the derivatives
up to higher order are observed, and Arjevani et al. (2022) consider stochastic optimization with first
order oracle. These papers cannot be directly compared with our work as the settings are different.
Rate of convergence under smoothness and Polyak-Łojasiewicz assumptions. Assume that f
is a β-Hölder function with Lipschitz continuous gradient, with a Lipschitz constant L̄. Additionally,
let β ≥ 2, and suppose that f satisfies the Polyak-Łojasiewicz inequality with a constant α. Then,
after 2T oracle queries, both considered algorithms provide a point xT , for which the expected
optimization error satisfies
β−1
µd2
? 1 β β
E[f (xT ) − f ] . under the assumption that T & d2− 2 ,
α T
where µ = L̄/α, and the signs . and & conceal multiplicative constants that do not depend on T , d
and α. The Polyak-Łojasiewicz assumption was introduced in the context of first order optimization
by Polyak (1963) who showed that it implies linear convergence of the gradient descent algorithm.
Years later, this condition received attention in the machine learning and optimization community
following the work of Karimi et al. (2016). To the best of our knowledge, zero-order optimization
under the considered noise model with the Polyak-Łojasiewicz assumption was not previously
studied. Very recently Rando et al. (2022) studied a related problem under the Polyak-Łojasiewicz
assumption when querying two function values with identical noises, which can be compared with
the analysis in the particular case of our setting with no noise (see the discussion in Section 2.2).
Unlike our work, Rando et al. (2022) do not deal with higher order smoothness and do not derive
the dependency of the bounds on d, µ and α.
Rate of convergence under smoothness and strong convexity. Assume that f is a β-Hölder
function with Lipschitz continuous gradient, where β ≥ 2, and satisfies α-strong convexity condition.
Then, after 2T oracle queries, both considered algorithms provide a point x̂T such that
β−1
d2
? 1 β β
E[f (x̂T ) − f ] . under the assumption that T ≥ d2− 2 ,
α T
where . conceals a multiplicative constant that does not depend on T , d and α. The closest result to
ours is obtained in Akhavan et al. (2020) and it splits into two cases: β = 2 (Lipschitz continuous
4
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
gradient) and β > 2 (higher order smoothness). For β = 2, Akhavan et al. (2020) deal with a
compact Θ and prove a bound with optimal dependence on the dimension (linear in d) but sub-
optimal in α, while for β > 2 they derive (for Θ = Rd and for compact Θ) the rate with sub-optimal
dimension factor d2 . Later, Akhavan et al. (2021) and Novitskii and Gasnikov (2022) improved the
dimension factor to d2−1/β for β > 2, which still does not match the linear dependence as β → 2.
In contrast, by considering a slightly different definition of smoothness, we provide below a unified
analysis leading to the dimension factor d2−2/β for any β ≥ 2, under constrained and unconstrained
Θ; the improvement is both in the rate and in the proof technique.
1.2 Notation
Throughout the paper, we use the following notation. For any k ∈ N we denote by [k], the
set of first k positive integers. For any x ∈ Rd we denote by x 7→ sign(x) the component-
wise sign function (defined at 0 as 1). We let h·, ·i and k · k be the standard inner product and
Euclidean norm on Rd , respectively. For every close convex set Θ ⊂ Rd and x ∈ Rd we denote
by ProjΘ (x) = argmin{kz − xk : z ∈ Θ} the Euclidean projection of x onto Θ. For any
p ∈ [1, +∞] we let k·kp be the `p -norm in Rd and we introduce the open `p -ball and `p -sphere,
respectively, as
n o n o
Bpd , x ∈ Rd : kxkp < 1 and ∂Bpd , x ∈ Rd : kxkp = 1 .
For any β ≥ 2 we let bβc be the largest integer strictly less than β. Given a multi-index m =
(m1 , . . . , md ) ∈ Nd , we set m! , m1 ! · · · md !, |m| , m1 + · · · + md .
2. Preliminaries
For any multi-index m ∈ Nd , any |m| times continuously differentiable function f : Rd → R, and
every h = (h1 , . . . , hd )> ∈ Rd we define
∂ |m| f (x) md
Dm f (x) , , hm , h m
1 · · · hd .
1
∂ m 1 x1 · · · ∂ m d xd
k
For any k-linear form A : Rd → R, we define its norm as
5
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
Remark 2 Definition 1 of higher order smoothness was used by Bach and Perchet (2016) who
considered only integer β, while Polyak and Tsybakov (1990); Akhavan et al. (2020, 2021) use a
slightly different definition. Namely, they consider a class Fβ0 (L) defined as the set of all ` times
continuously differentiable functions f satisfying for all x, z ∈ Rd , the condition
|f (x) − Tz` (x)| ≤ L kx − zkβ ,
where Tz` (·) is the Taylor polynomial of order ` = bβc of f around z. If f ∈ Fβ (L), then
f ∈ Fβ0 (L/`!) (cf. Appendix A). Thus, the results obtained for classes Fβ0 hold true for f ∈ Fβ
modulo a change of constant. Moreover, if f is convex and β = 2 (Lipschitz continuous gradient) the
properties defining the two classes are equivalent to within constants, cf. (Nesterov, 2018, Theorem
2.1.5).
Since we study the minimization of highly smooth functions, in what follows, we will always
assume that f belongs to Fβ (L) for some β ≥ 2 and L > 0. We additionally require that f ∈ F2 (L̄)
for some L̄ > 0, that is, the gradient of f is Lipschitz continuous.
Assumption A The function f ∈ Fβ (L) ∩ F2 (L̄) for some β ≥ 2 and L, L̄ > 0.
We will start our analysis by providing rates of convergence to a stationary point of f under
Assumption A. The first additional assumption that we consider is the Polyak-Łojasiewicz condition,
which is also referred to as α-gradient dominance. This condition became rather popular since it
leads to linear convergence of the gradient descent algorithm without convexity as shown by Polyak
(1963) and further discussed by Karimi et al. (2016).
6
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
which is a set of α-gradient dominant functions on any compact subset of Rd , for some α > 0. A
popular example of suchP a function appearing in machine learning applications is the logistic loss
defined as g(Ax) = ni=1 log(1 + exp(a> i x)), where for 1 ≤ i ≤ n, ai is i-th row of A, and
d
x ∈ R . For this and more examples, see e.g. (Garrigos et al., 2023) and references therein.
In what follows, we consider three different scenarios: (i) the case of only smoothness assump-
tion on f , (ii) smoothness and α-gradient dominance assumptions, (iii) smoothness and α-strong
convexity assumptions. Let x̃ be an output of any algorithm. For the first scenario, we obtain
stationary point guarantee, that is, a bound on E[k∇f (x̃)k2 ]. For the second and the third scenarios,
we provide bounds for the optimization error E[f (x̃) − f ? ].
Remark 5 Note that under α-strong convexity and the fact that ∇f (x? ) = 0, as well as under
α-gradient dominance (see, e.g., Karimi et al., 2016, Appendix A), for any x ∈ Rd we have
α
f (x) − f ? ≥ kx − x∗ k2 , (2)
2
where x∗ is the Euclidean projection of x onto the solution set arg minx∈Rd f (x) of the considered
optimization problem, which is a singleton in case of strong convexity. Thus, our upper bounds
on E[f (x̃) − f ? ] obtained under α-strong convexity or α-gradient dominance imply immediately
upper bounds for E[kx̃ − x∗ k2 ] with an extra factor 2/α.
7
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
where ξ is a random variable. In order to find a minimizer of f , at each step t of the algorithm
one makes two queries and gets outputs of the form (F (xt , ξt ), F (xt + ht ζ t , ξt )), t = 1, 2, . . . ,
where ξt ’s are independent identically distributed (iid) realizations of random variable ξ, and xt ,
xt + ht ζ t are query points at step t. Here, ht > 0 is a perturbation parameter and ζ t is a random or
deterministic perturbation. We emphasize two features of this setting:
(a) the two queries are obtained with the same random variable ξt ;
Both (a) and (b) are not assumed in our setting. On the other hand, we assume additive noise
structure. That is, at step t, we can only observe the values F (z t , ξt ) = f (z t ) + ξt for any choice
of query points z t depending only on the observations at the previous steps, but we do not assume
that two queries are obtained with the same noise ξt or are iid. We deal with almost adversarial
noise, see Assumption B below. In particular, we do not assume that the noise is zero-mean. Thus,
in general, we can have E[F (x, ξt )] 6= f (x).
In the CZSO setting, the values F (xt , ξt ) and F (xt + ht ζ t , ξt ) are used to obtain gradient
approximations involving the divided differences (F (xt + ht ζ t , ξt ) − F (xt , ξt ))/ht . A popular
choice is the gradient estimator with Gaussian randomization suggested by Nesterov (2011):
1
gG
t , (F (xt + ht ζ t , ξt ) − F (xt , ξt ))ζ t , with ζ t ∼ N (0, Id ) , (3)
ht
where N (0, Id ) denotes the standard Gaussian distribution in Rd . In the case of additive noise, the
divided differences are equal to (f (xt + ht ζ t ) − f (xt ))/ht , that is, the analysis of these algorithms
reduces to that of noiseless (deterministic) optimization setting. When ξt ’s are not additive, the
assumptions that are often made in the literature on CZSO are such that the rates of convergence are
the same as in the additive noise case, which is equivalent to noiseless case due to the above remark.
We can summarize this discussion as follows:
• the results obtained in the literature on CZSO, as well as some tools (e.g., averaging in
Algorithm 1 of Balasubramanian and Ghadimi, 2021), do not apply in our framework;
1. Some papers, for example, Ghadimi and Lan (2013); Gasnikov et al. (2016) relax this assumption.
8
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
• our upper bounds on the expected optimization error in the case of no noise (σ = 0) imply
identical bounds in the CZSO setting when the noise is additive. Under some assumptions
made in the CZSO literature (e.g., under Assumption B of Duchi et al., 2015), these bounds
for σ = 0 also extend to the general CZSO setting, with possible changes only in constants
and not in the rates. In other cases the rates in CZSO setting can only be slower than ours (e.g.,
under Assumptions A1b, A2 of Ghadimi and Lan, 2013).
3. Algorithms
Given a closed convex set Θ ⊆ Rd , we consider the following optimization scheme
where g t is an update direction, approximating the gradient direction ∇f (xt ) and ηt > 0 is a
step-size. Allowing one to perform two function evaluations per step, we consider two gradient
estimators g t which are based on different randomization schemes. They both employ a smoothing
kernel K : [−1, 1] → R which we assume to satisfy, for β ≥ 2 and ` = bβc, the conditions
Z Z Z Z
K(r) dr=0, rK(r) dr=1, r K(r) dr=0, j=2, . . . , `, κβ , |r|β |K(r)| dr < ∞ . (5)
j
In (Polyak and Tsybakov, 1990) it √was suggested to construct such kernels employing Legendre
polynomials, in which case κβ ≤ 2 2β, cf. Bach and Perchet, 2016, Appendix A.3.
We are now in a position to introduce the two estimators. Similarly to earlier works dealing
with `2 -randomized methods (see e.g., Nemirovsky and Yudin, 1983; Flaxman et al., 2005; Bach
and Perchet, 2016; Akhavan et al., 2020) we use gradient estimators based on a result, which is
sometimes referred to as Stokes’ theorem. A general form of this result, not restricted to the `2
geometry, can be found in Akhavan et al., 2022, Appendix A.
Gradient estimator based on `2 randomization. At time t ≥ 1, let ζ ◦t be distributed uniformly
on the `2 -sphere ∂B2d , let rt be uniformly distributed on [−1, 1], and ht > 0. Query two points:
where ξt , ξt0 are noises. Using the above feedback, define the gradient estimator as
d
(`2 randomization) g ◦t , (yt − yt0 )ζ ◦t K(rt ) . (6)
2ht
We use the superscript ◦ to emphasize the fact that g ◦t is based on the `2 randomization.
Gradient estimator based on `1 randomization. At time t ≥ 1, let ζ t be distributed uniformly
on the `1 -sphere ∂B1d , let rt be uniformly distributed on [−1, 1], and ht > 0. Query two points:
9
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
d
(`1 randomization) g t , (yt − yt0 ) sign(ζ t )K(rt ) . (7)
2ht
We use the superscript reminiscent of the form of the `1 -sphere in order to emphasize the fact
that g t is based on the `1 randomization. The idea of using an `1 randomization (different from
(7)) was probably first invoked by Gasnikov et al. (2016). We refer to Akhavan et al. (2022) who
highlighted the potential computational and memory gains of another `1 randomization gradient
estimator compared to its `2 counterpart, as well as its advantages in theoretical guarantees. The
estimator of Akhavan et al. (2022) differs from (7) as it does not involve the kernel K but the same
computational and memory advantages remain true for the estimator (7).
Throughout the paper, we impose the following assumption on the noises ξt , ξt0 and on the
random variables that we generate in the estimators (6) and (7).
(i) the random variables ξt and ξt0 are independent from ζ ◦t (resp. ζ t ) and from rt conditionally
on xt , and the random variables ζ ◦t (resp. ζ t ) and rt are independent;
Let us emphasize that we do not assume ξt and ξt0 to have zero mean. Moreover, they can be
non-random and no independence between noises on different time steps is required, so that the
setting can be considered as almost adversarial. Nevertheless, the first part of Assumption B does
not permit a completely adversarial setup. Indeed, the oracle is not allowed to choose the noise
variable depending on the current randomization, i.e., ζ ◦t (resp. ζ t ) and rt . However, Assumption B
encompasses the following protocol: at each round, the oracle generates pairs of noise (ξt , ξt0 ) with a
second moment bounded by σ. This is done possibly with full knowledge of the algorithm employed
by the learner, the previous actions, and the past information received by the learner.
In the next two subsections, we study the bias and variance of the two estimators. As we shall
see, the `1 randomization can be more advantageous in the noiseless case than its `2 counterpart (cf.
Remark 11).
Lemma 6 (Bias of `2 randomization) Let Assumption B be fulfilled. Suppose that f ∈ Fβ (L) for
some β ≥ 2 and L > 0. Let g ◦t be defined in (6) at time t ≥ 1. Let ` = bβc. Then,
L d
kE[g ◦t | xt ] − ∇f (xt )k ≤ κβ · hβ−1 . (8)
(` − 1)! d + β − 1 t
10
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
Intuitively, the smaller ht is, the more accurately g t estimates the gradient. A result analogous to
Lemma 6 with a bigger constant was claimed in (Bach and Perchet, 2016, Lemma 2). The proof
of Lemma 6 is presented in the appendix. It relies on the fact that g ◦t is an unbiased estimator of
some surrogate function, which is strongly related to the original f . The factor in front of hβ−1
t
in Lemma 6 is O(1) as function of d. It should be noted that for β > 2 the bounds on the bias
obtained in Akhavan
√ et al. (2020) and Novitskii and Gasnikov (2022), where the factors scale as
O(d) and O( d), respectively, cannot be directly compared to Lemma 6. This is due to the fact
that those bounds are proved under a different notion of smoothness, cf. Remark 2. Nevertheless,
if f is convex and β = 2 both notions of smoothness coincide, and Lemma 6 improves upon √ the
bounds in (Akhavan et al., 2020) and (Novitskii and Gasnikov, 2022) by factors of order d and d,
respectively.
Lemma 7 (Variance of `2 randomization) Let Assumption B hold and f ∈ F2 (L̄) for some L̄ > 0.
Then, for any d ≥ 2,
◦ 2 d2 κ h 2 i d2 σ 2 κ
E[kg t k ] ≤ E k∇f (xt )k + L̄ht + ,
d−1 h2t
R1
where κ = −1 K 2 (r) dr.
11
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
Since f ∈ F2 (L̄) and (10) holds, then using Wirtinger-Poincaré inequality (see, e.g., Osserman,
1978, (3.1)) we get
h2
E (f (x+hrζ ◦ )−f (x−hrζ ◦ ))2 x, r ≤ E k∇f (x+hrζ ◦ )+∇f (x−hrζ ◦ )k2 x, r .
d−1
(12)
The fact that f ∈ F2 (L̄) and the triangle inequality imply that
h i 2
E k∇f (x + hrζ ◦ ) + ∇f (x − hrζ ◦ )k2 x, r ≤ 4 k∇f (x)k + L̄h . (13)
Note that Eqs. (12)–(13) imply, in particular, Lemma 9 of Shamir (2017). Our method based
on Poincaré’s inequality yields explicitly the constants in the bound. In this aspect, we improve
upon Shamir (2017), where a concentration argument leads only to non-specified constants.
Lemma 8 (Bias of `1 randomization) Let Assumption B be fulfilled and f ∈ Fβ (L) for some
β ≥ 2 and L > 0. Let g t be defined in (7) at time t ≥ 1. Let ` = bβc. Then,
1−β
kE[g t | xt ] − ∇f (xt )k ≤ Lcβ κβ `β−` d 2 hβ−1
t . (14)
β−1
When 2 ≤ β < 3, then cβ = 2 2 , and if β ≥ 3 we have cβ = 1.
Notice that Lemma 8 gives the same dependence on the discretization parameter ht as Lemma 6.
However, unlike the bias bound in Lemma 6, which is dimension independent, the result of Lemma 8
depends on the dimension in a favorable way. In particular, the bias is controlled by a decreasing
function of the dimension and this dependence becomes more and more favorable for smoother
functions. Yet, the price for such a favorable control of the bias is an inflated bound on the variance,
which is established below.
Lemma 9 (Variance of `1 randomization) Let Assumption B be fulfilled and f ∈ F2 (L̄) for some
L̄ > 0. Then, for any d ≥ 3,
r !2
3
8d κ 2 d3 σ 2 κ
E[kg t k2 ] ≤ E k∇f (xt )k + L̄ht + ,
(d + 1)(d − 2) d h2t
R1 2 (r) dr.
where κ = −1 K
12
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
Combined with the facts that a2 /((a − 2)(a + 1)) ≤ 2.25 for all a ≥ 3 and (a + b)2 ≤ 2a2 + 2b2 ,
the inequality of Lemma 9 can be further simplified as
8L2 18L2
Var(G(ζ )) ≤ ≤ 2 .
(d + 1)(d − 2) d
The proof of Lemma 10 is given in the Appendix.
Proof of Lemma 9 For simplicity we drop the subscript t from all the quantities. Similarly to the
proof of Lemma 7, using Assumption B we deduce that
d3
E[kg k2 ] ≤ E[(f (x + hrζ ) − f (x − hrζ ))2 K 2 (r)] + 4σ 2 κ .
2
(16)
4h
Consider G : Rd → R defined for all u ∈ Rd as G(u) = f (x + hru) − f (x − hru). Using the
fact that f ∈ F2 (L̄) we obtain for all u ∈ Rd
2
k∇G(u)k2 ≤ 4h2 k∇f (x)k + L̄h kuk .
13
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
Remark 11 (On the advantage of `1 randomization) In the noiseless case (σ = 0) both bias and
variance under the `1 randomization are strictly smaller than under the `2 randomization. Indeed,
if σ = 0
( β−1
kE[g ◦t | xt ] − ∇f (xt )k . hβ−1
t
kE[g t | xt ] − ∇f (xt )k . √
ht
2
√ and d ,
E[kg ◦t k2 ] . dE[k∇f (xt )k ] + ( dht )2 E[kg k2 ] . dE[k∇f (x )k2 ] + h2
t
t t
For comparison, the corresponding bounds for gradient estimators with Gaussian randomization
defined in (3) are proved for β = 2 and have the form (cf. Nesterov (2011) or (Ghadimi and Lan,
2013, Theorem 3.1)):
(
E[g G
t | xt ] − ∇f (xt ) . d
3/2 h
t
2 ] . dE[k∇f (x )k2 ] + d3 h2 ,
(17)
E[kg G
t k t t
where the signs . hide multiplicative constants that do not depend on ht and d. Setting β = 2
in Remark 11 we see that the dependence on ht in (17) is of the same order as for `1 and `2
randomizations with β = 2, but the dimension factors are substantially bigger. Also, the bounds in
(17) for the Gaussian randomization are tight. Thus, the Gaussian randomization is less efficient
than its `1 and `2 counterparts in the noiseless setting.
4. Upper bounds
In this section, we present convergence guarantees for the two considered gradient estimators and
for three classes of objective functions f . Each of the following subsections is structured similarly:
14
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
first, we define the choice of ηt and ht involved in both algorithms and then, for each class of the
objective functions, we state the corresponding convergence guarantees.
Throughout this section, we assume that f ∈ F2 (L̄) ∩ Fβ (L) for some β ≥ 2. Under this
assumption, in Section 4.1 we establish a guarantee for the stationary point. In Section 4.2 we
additionally assume that f is α-gradient dominant and provide upper bounds on the optimization
error. In Section 4.3 we additionally assume that f is α-strongly convex and provide upper bounds
on the optimization error for both constrained and unconstrained cases. Unless stated otherwise,
the convergence guarantees presented in this section hold under the assumption that the number of
queries T is known before running the algorithms.
Assumption C Assume that there exist two positive sequences bt , vt : N → [0, ∞) and V1 ≥ 0
such that for all t ≥ 1 it holds almost surely that
Note that Assumption C holds for the gradient estimators (6) and (7) with bt , vt and V1 specified in
Lemmas 6–9 (see also Assumption D and Table 1 below).
The results of this subsection will be stated on a randomly sampled point along the trajectory
of the algorithm. The distribution over the trajectory is chosen carefully, in order to guarantee the
desired convergence. The distribution that we are going to use is defined in the following lemma.
Lemma 12 Let f ∈ F2 (L̄) for some L̄ > 0, Θ = Rd and f ? > −∞. Let xt be defined by
algorithm (4) with g t satisfying Assumption C. Assume that ηt in (4) is chosen such that L̄ηt V1 < 1.
Let S be a random variable with values in [T ], which is independent from x1 , . . . , xT , g 1 , . . . , g T
and distributed with the law
ηt 1 − L̄ηt V1
P(S = t) = PT , t ∈ [T ] .
t=1 ηt 1 − L̄ηt V1
Then,
15
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
Lemma 12 is obtained by techniques similar to Ghadimi and Lan (2013). However, the paper
Ghadimi and Lan (2013) considers only a particular choice of g t defined via a Gaussian random-
ization, and a different setting (cf. the discussion in Section 2.2), under which vt does not increase
as the discretization parameter ht (µ in the notation of Ghadimi and Lan (2013)) decreases. In
our setting, this situation happens only when there is no noise (σ = 0), while in the noisy case vt
increases as ht tends to 0.
Note that the distribution of S in Lemma 12 depends on the choice of ηt and V1 . In the following
results, we are going to specify the exact values of ηt . We also provide the values of V1 for the
gradient estimators (6) and (7). Regarding these two estimators, it will be convenient to use the
following instance of Assumption C.
Assumption D There exist positive numbers b, V1 , V2 , V3 such that for all t ≥ 1 the gradient
estimators g t satisfy almost surely the inequalities
It follows from Lemmas 6–9 that Assumption D holds for gradient estimators (6) and (7) with the
values that are indicated in Table 1. Note that the bounds for the variance in those lemmas do not
cover the case d = 1 for the `2 randomization and d = 1, 2 for the `1 randomization. Nevertheless,
it is straightforward to check that in these cases Assumption D remains valid with Vj ’s given in
Table 1.
Estimator b V1 V2 V3
κβ d
`2 randomization (`−1)! · d+β−1 4dκ 4dκ d2 κ
1−β
`1 randomization cβ κβ `β−` d 2 36dκ 72κ d3 κ
Table 1: Factors in the bounds for bias and variance of both gradient estimators, ` = bβc, d ≥ 1.
The next theorem requires a definition of algorithm-dependent parameters, which are needed as
an input to our algorithms. We set
1
−1
(8κL̄) , d 2β−1 for `2 randomization,
(y, h) =
2β+1
(18)
(72κL̄)−1 , d 4β−2 for `1 randomization.
Theorem 13 Let Assumptions A and B hold, and Θ = Rd . Let xt be defined by algorithm (4) with
gradient estimator (6) or (7), where the parameters ηt and ht are set for t = 1, . . . , T , as
y − 2(β−1) − β
− 1
ηt = min , d 2β−1 T 2β−1 and ht = h T 2(2β−1) ,
d
16
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
1
and the constants y and h are given in (18). Assume that x1 is deterministic and T ≥ d β . Then, for
the random variable S defined in Lemma 12, we have
β−1
d2 2β−1
2 ?
E[k∇f (xS )k ] ≤ A1 (f (x1 ) − f ) + A2 ,
T
where the constants A1 , A2 > 0 depend only on σ, L, L̄, β, and on the choice of the gradient
estimator.
In the case σ = 0, the result of this theorem can be improved. As explained in Section 2.2,
this case is analogous to the CZSO setting, and it is enough to assume that β = 2 since higher
order smoothness does not lead to improvement in the main term of the rates. Due to Remark 11
(or Assumption D) one can set ht for both methods as small as one wishes, and thus sufficiently
small to make the sum over t in the numerator of the inequality of Lemma 12 less than an absolute
constant. Then, choosing ηt = (2V1 L̄)−1 and recalling that, for both algorithms, V1 scales as d up
to a multiplicative constant (cf. Table 1) we get the following result.
Theorem 14 Let f be a function belonging to F2 (L̄) for some L̄ > 0, Θ = Rd , and let Assumptions
A and B hold with σ = 0. Let xt be defined by algorithm (4) with deterministic x1 , and gradient
estimators (6) or (7) for β = 2, where ηt = (2V1 L̄)−1 and ht is chosen sufficiently small. Then we
have
L̄d
E[k∇f (xS )k2 ] ≤ A(f (x1 ) − f ? + 1) ,
T
where A > 0 is an absolute constant depending only the choice of the gradient estimator.
The rate O(d/T ) in Theorem 14 coincides with the rate derived in (Nesterov and Spokoiny,
2017, inequality (68)) for β = 2 under the classical zero-order stochastic optimization setting, where
the authors were using Gaussian rather than `1 or `2 randomization.
p In a setting with non-additive
noise, Ghadimi and Lan (2013) exhibit a slower rate of O( d/T ).
17
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
Then
L̄V1
E[f (xT ) − f ? ] ≤ A1 (f (x1 ) − f ? )
αT
1 1 2 ! β−1
V3 − β V3 β αT − β −2 αT − β
A2 2
+ V3 2 2 + V2 L̄ σ ,
α b L b2 L2 L̄σ 2 L̄σ 2
Theorem 15 provides a general result for any gradient estimator that satisfies Assumption D. By
taking the values Vj from Table 1 we immediately obtain the following corollary for our `1 - and
`2 -randomized gradient estimators.
Corollary 16 Let f be an α-gradient dominant function, Θ = Rd , and let Assumptions A and B hold,
with σ > 0. Let xt be defined by algorithm (4) with deterministic x1 and gradient estimators (6)
or (7). Set the parameters ηt and ht as in Theorem 15, where b, V1 , V2 , V3 are given in Table 1 for
β
σ2
each gradient estimator, respectively. Then for any T ≥ d2− 2 αL 2 we have
2
2 2 β−1 2
? L̄d ? L̄ L̄σ d β Lβ
E[f (xT ) − f ] ≤ A1 (f (x1 ) − f ) + A2 + A3 2 ,
αT σ αT α
where A1 , A2 , A3 > 0 depend only on β and on the choice of the gradient estimator.
β
Note that here we consider σ and L as numerical constants. The condition T & d2− 2 /α mentioned
in Corollary 16 is satisfied in all reasonable cases since it is weaker than the condition T & d2 /α
guaranteeing non-triviality of the bounds.
Recall that, in the context of deterministic optimization with first order oracle, the α-gradient
dominance allows one to obtain the rates of convergence of gradient descent algorithm, which are
similar to the case of strongly convex objective function with Lipschitz gradient (Polyak (1963);
Karimi et al. (2016)). A natural question is whether the same property holds in our setting of
stochastic optimization with zero-order oracle and higher order smoothness. Theorem 15 shows
the rates are only inflated by a multiplicative factor µ(β−1)/β , where µ = L̄/α, compared to the
α-strongly convex case that will be considered in Section 4.3.
Consider now the case σ = 0, which is analogous to the CZSO setting as explained in Section
2.2. In this case, we assume that β = 2 since higher order smoothness does not lead to improvement
in the main term of the rates. We set the parameters ηt , ht as follows:
− 12
8L̄2 V2
−1 4 L̄ ∨ 1 2
ηt = min (2L̄V1 ) , , ht ≤ T 2b L̄ + . (19)
αt α∧1 α
18
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
Theorem 17 Let f be an α-gradient dominant function belonging to F2 (L̄) for some L̄ > 0,
Θ = Rd , and let Assumptions A and B hold with σ = 0. Let xt be defined by algorithm (4) with
deterministic x1 , and gradient estimators (6) or (7) for β = 2. Set the parameters ηt and ht as in
(19). Then we have
L̄d
E[f (xT ) − f ? ] ≤ A1 ((f (x1 ) − f ? ) + A2 ) ,
αT
where A1 , A2 > 0 are absolute constants depending only the choice of the gradient estimator.
Note that, in the CZSO setting, Rando et al. (2022) proved the rate O(T −1 ) for the optimization
error under α-gradient dominance by using an `2 randomization gradient estimator. However, unlike
Theorem 17 the bound obtained in that paper does not provide the dependence on the dimension d
and on variables L̄, α.
Theorem 18 Let f be an α-strongly convex function, Θ = Rd , and let Assumptions A and B hold.
Let xt be defined by algorithm (4) with g t satisfying Assumption D, deterministic x1 and
1
1
α 4
2 2β
4σ V3 t− 2β 8
if ηt = α(t+1)
ηt = min , , h t = · 1 .
8L̄2 V1 α(t + 1) b2 L2 T − 2β if ηt = α2
4L̄ V1
Then
(
L̄2 V1 2 β−1
?
E[f (x̂T ) − f ] ≤ A1 kx1 −x? k2 + A2 (bL) β (V3 σ 2 ) β
αT
β1 )
− β−1
V3 σ 2
2 − β2 T β
+ A3 V2 L̄ T ,
b2 L2 α
19
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
Subsequently, in Corollary 19, we customize the above theorem for gradient estimators (6) and (7),
with assignments of ηt , ht that are again selected based on Table 1. We also include a bound for
E[kx̂T − x? k2 ], which comes as an immediate consequence due to (2).
Corollary 19 Let f be an α-strongly convex function, Θ = Rd , and let Assumptions A and B hold.
Let xt be defined by algorithm (4) with gradient estimator (6) or (7), and parameters ηt , ht as in
Theorem 18, where b, V1 , V2 , V3 are given in Table 1 for each gradient estimator, respectively. Let
β 2
x1 be deterministic. Then for any T ≥ d2− 2 Lσ 2 we have
2 2 β−1
L̄2 d L̄2
? 2 d σ β 2
?
E[f (x̂T ) − f ] ≤ A1 kx1 − x k + A2 + A3 2 L β α−1 , (20)
αT σ T
2 2 β−1
L̄2 d L̄2
? 2 ? 2 d σ β 2
E[kx̂T − x k ] ≤ 2A1 2 kx1 − x k + 2 A2 + A3 2 L β α−2 , (21)
α T σ T
where A1 , A2 , A3 > 0 depend only on β and on the choice of the gradient estimator.
With a slightly different definition of smoothness class (which coincides with ours for β = 2, cf.
Remark 2), a result comparable to Corollary 19 is derived in (Akhavan
p et al., 2020, Theorem 3.2).
However, that result imposes an additional condition on α (i.e., α & d/T ) and provides a bound
with the dimension factor d2 rather than d2−2/β in Corollary 19. We also note that earlier Bach and
Perchet (2016) analyzed the case `2 -randomized gradient estimator with integer β > 2 and proved a
− β−1
bound with a slower (suboptimal) rate T β+1 .
Theorem 20 Let Θ ⊂ Rd be a compact convex set. Assume that f is an α-strongly convex function,
Assumptions A and B hold, and maxx∈Θ k∇f (x)k ≤ G. Let xt be defined by algorithm (4) with
2 1
2β
4
gradient estimator g t satisfying Assumption D and ηt = α(t+1) , ht = bσ2 LV23t . Then
− β1 β1 !
4L̄2 V1 G2 A1 V3 σ 2 V3 σ 2
− β2 − β−1
E[f (x̂t ) − f ? ] ≤ + V3 σ 2 + V2 L̄2 T T β ,
αT α b2 L2 b2 L2
Using the bounds on the variance and bias of gradient estimators (6) and (7) from Section 3,
Remark 5 and the trivial bounds E[f (x̂T ) − f ? ] ≤ GB, E[kx̂T − x? k2 ] ≤ B 2 , where B is the
Euclidean diameter of Θ, we immediately obtain the following corollary.
20
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
Corollary 21 Let Θ ⊂ Rd be a compact convex set. Assume that f is an α-strongly convex function,
Assumptions A and B hold, and maxx∈Θ k∇f (x)k ≤ G. Let xt be defined by algorithm (4) with
gradient estimator (6) or (7), and parameters ηt , ht as in Theorem 20, where b, V1 , V2 , V3 are given
β 2
in Table 1 for each gradient estimator, respectively. Then for any T ≥ d2− 2 Lσ 2 we have
2 V G2
2
2 2 β−1
? 4L̄ 1 L̄ d σ β 2
−1
E[f (x̂T ) − f ] ≤ min GB, + A1 + A2 2 L α
β , (22)
αT σ T
2 2
2 2 β−1
L̄2
2 2GB 8L̄ V1 G d σ β 2
? 2 −2
E[kx̂T − x k ] ≤ min B , , + 2 A1 + A2 2 L α
β , (23)
α α2 T σ T
where B is the Euclidean diameter of Θ, and A1 , A2 > 0 depends only on β and on the choice of
the gradient estimator.
In a similar setting, but assuming independent zero-mean ξt ’s, Bach and Perchet (2016) considered
− β−1
the case of `2 randomization and proved, for integer β > 2, a bound with suboptimal rate T β+1 .
Corollary 21 can be also compared to Akhavan et al. (2020, 2021) (`2 randomization and coordinate-
wise radomization) and, for β > 2, to Novitskii and Gasnikov (2022) (`2 randomization). However,
those papers use a slightly different definition of β-smoothness class (both definitions coincide if
2−1/β
β = 2, see Remark 2). Their bounds guarantee the rate O d αT for β > 2 (Akhavan et al., 2021,
d
Corollary 6), (Novitskii and Gasnikov, 2022, Theorem 1) and O √αT for β = 2 (Akhavan et al.,
2020, Theorem D.4) by using two different approaches for the two cases. In contrast, Corollary 21
2−2/β
yields O d αT and O α√d T , respectively, and obtains these rates by a unified approach for all
5. Lower bounds
In this section we prove minimax lower bounds on the optimization error over all sequential strategies
with two-point feedback that allow the query points depend on the past. For t = 1, . . . , T , we
assume that the values yt = f (z t ) + ξt and yt0 = f (z 0t ) + ξt0 are observed, where (ξ, ξt0 ) are random
noises, and (z t , z 0t ) are query points. We consider all strategies of choosing the query points as
t−1 0 0 t−1 0 0 t−1 0 0 t−1
z t = Φt (z i , yi )i=1 , (z i , yi )i=1 , τ t and z t = Φt (z i , yi )i=1 , (z i , yi )i=1 , τ t for t ≥ 2, where
Φt ’s and Φ0t ’s are measurable functions, z 1 , z 01 ∈ Rd are any random variables, and {τ t } is a
sequence of random variables with values in a measurable space (Z, U), such that τ t is independent
t−1 0 0 t−1
of (z i , yi )i=1 , (z i , yi )i=1 . We denote by ΠT the set of all such strategies of choosing query
points up to t = T . The class ΠT includes the sequential strategy of Algorithm (4) with either of
the two considered gradient estimators (6) and (7). In this case, τ t = (ζ t , rt ), z t = xt + ht ζ t rt
and z 0t = xt − ht ζ t rt , where ζ t = ζ ◦t or ζ t = ζ t .
21
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
To state our assumption on the noises (ξ, ξt0 ), we introduce the squared Hellinger distance
H 2 (·, ·)defined for two probability measures P, P0 on a measurable space (Ω, A) as
Z √ √
H 2 (P, P0 ) , ( dP − dP0 )2 .
• The cumulative distribution function Ft : R2 → R of random variable (ξt , ξt0 ) is such that
for some 0 < I0 < ∞, 0 < v0 ≤ ∞. Here, PF (·,·) denotes the probability measure
corresponding to the c.d.f. F (·, ·).
Condition (24) is not restrictive and encompasses a large family of distributions. It is satisfied
with small enough v0 for distributions that correspond to regular statistical experiments, (see e.g.,
Ibragimov and Khas’minskii, 1982, Chapter 1). If Ft is a Gaussian c.d.f. condition (24) is satisfied
with v0 = ∞.
To state the lower bounds, we consider a subset of the classes of functions, for which we obtained
the upper bounds in Section 4. Let Θ = {x ∈ Rd : kxk ≤ 1}. For α, L, L̄ > 0, β ≥ 2, let Fα,β
denote the set of all α-strongly convex functions f that satisfy Assumption A, attain their minimum
over Rd in Θ and such that maxx∈Θ k∇f (x)k ≤ G, and the condition G > α is satisfied 2 .
Theorem 22 Let Θ = {x ∈ Rd : kxk ≤ 1} and let Assumption E hold. Then, for any estimator
x̃T based on the observations ((z t , yt ), (z 0t , yt0 ), t = 1, . . . , T ), where ((z t , z 0t ), t = 1, . . . , T ) are
obtained by any strategy in the class ΠT we have
d d
− β−1
? −1/2+1/β
sup E f (x̃T ) − f ≥ C min max α, T , √ , T β , (25)
f ∈Fα,β T α
and
∗ 2 d − β−1 d
sup E[kz T − x (f )k ] ≥ C min 1, 1 , 2 T β , (26)
f ∈Fα,β Tβ α
where C > 0 is a constant that does not depend of T, d, and α, and x? (f ) is the minimizer of f
on Θ.
2. The condition G ≥ α is necessary for the class Fα,β to be non-empty. Indeed, due to (1) and (2), for all f ∈ Fα,β
and x ∈ Θ we have kx − x∗ k ≤ G/α, and thus 2G/α ≥ diam(Θ) = 2.
22
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
Some remarks are in order here. First, note that the threshold T −1/2+1/β on the strong convexity
parameter α plays an important role in bounds (25) and (26). Indeed, for α below this threshold,
the bounds start to be independent of α. Intuitively, it seems reasonable that α-strong convexity
should be of no added value for small α. Theorem 22 allows us to quantify exactly how small such
α should be, namely, α . T −1/2+1/β . In particular, for β = 2 the threshold occurs at α 1. Also,
quite naturally, the threshold becomes smaller when the smoothness β increases. √
−1/2+1/β threshold, the rate of (25) becomes min(T 1/β , d)/ T ,
In the regime below the √ T
which is√asymptotically d/ T independently of the smoothness index β and on α. Thus, we obtain
that d/ T is a lower bound over the class of simply convex √ functions. On the other hand, the
achievable rate for convex functions is shown to be d16 / T in Agarwal et al. (2011) and improved
√
to d4.5 / T in Lattimore and György (2021) (both results are up to poly-logarithmic
√ factors, and
under sub-Gaussian noise ξt ). The gap between our lower bound d/ T and these upper bounds
is only in the dependence on the dimension, but this gap is substantial. In the regime where α is
above the T −1/2+1/β threshold, our results imply that the gap between upper and lower bounds is
β−1
d2−2/β −
much smaller. Thus, our upper bounds in this regime scale as α T while the lower bound
β
− β−1
of Theorem 22 is of the order αd T β .
√
Consider now the √ case β = 2. Then the lower bounds (25) and (26) are of order d/(max(α, 1) T)
2 2
and d/(max(α , 1) T ), respectively, under the condition T ≥ d guaranteeing non-triviality of
& 1 (meaning that α is above the threshold α 1) we obtain the lower
√ in addition, α2 √
the rates. If,
rates d/(α T ) and d/(α T ), respectively. Comparing this remark to Corollary 21 we obtain the
following result.
Corollary 23 Let β = 2 and let the assumptions of Theorem 22 and Corollary 21 hold. If α ≥ 1
and T ≥ max(d2 ,L̄2 d, L̄4 G4 ) then there exist positive constants c, C that do not depend on T, d,
and α such that we have the following bounds on the minimax risks:
d d
c √ ≤ inf sup E f (x̃T ) − f ? ≤ C √ ,
(27)
α T x̃T f ∈Fα,β α T
and
d d
c √ ≤ inf sup E[kx̃T − x∗ (f )k2 ] ≤ C √ , (28)
α2 T x̃T f ∈Fα,β α2 T
where x? (f ) is the minimizer of f on Θ, and the infimum is over all estimators x̃T based on query
points obtained via strategies in the class ΠT . The minimax rates in (27) and (28) are attained by
the estimator x̃T = x̂T with parameters as in Corollary 21.
Thus, the weighted average estimator x̂T as in Corollary 21 is minimax optimal with respect to all
the three parameters T, d, and α, both in the optimization error and in the estimation risk. Note
that we introduced the condition T ≥ max(L̄2 d, L̄4 G4 ) in Corollary 23 to guarantee that the upper
bounds are of the required order, cf. Corollary 21. Thus, G is allowed to be a function of T
that grows not too fast. Since G > α this condition also prevents α from being too large, that is,
Corollary 23 does not hold if α & T 1/4 .
23
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
The issue of finding the minimax optimal rates in gradient-free stochastic optimization under
strong convexity and smoothness assumptions has a long history. It was initiated in Fabian (1967);
Polyak and Tsybakov (1990) and more recently developed in Dippon (2003); Jamieson et al. (2012);
Shamir (2013); Bach and Perchet (2016); Akhavan et al. (2020, 2021). It was shown in Polyak
and Tsybakov (1990) that the minimax optimal rate on the class of α-strong convex and β-Hölder
functions scales as c(α, d)T −(β−1)/β for β ≥ 2, where c(α, d) is an unspecified function of α and d
(for d = 1 and integer β ≥ 2 an upper bound of the same order was earlier derived in Fabian (1967)).
The issue of establishing non-asymptotic fundamental limits as function of the main parameters
the problem (α, d and T ) was first addressed in Jamieson et al. (2012) giving a lower bound
of p √
Ω( d/T ) for β = 2, without specifying the dependency on α.√This was improved to Ω(d/ T )
when α 1 by Shamir (2013) who also claimed that the rate d/ T is optimal for β = 2 referring
to an upper bound in Agarwal et al. (2010). However, invoking Agarwal et al. (2010) in the setting
with random noise ξt (for which the lower bound of Shamir (2013) was proved) is not legitimate
because in Agarwal et al. (2010) the observations are√considered as a Lipschitz function of t. The
complete proof of minimax optimality of the rate d/ T for β = 2 under random noise was later
provided in Akhavan et al. (2020). However, the upper and the lower bounds in Akhavan et al.
(2020) still differ in their dependence on α. Corollary 23 completes this line of work by establishing
the minimax optimality as a function of the whole triplet (T, d, α) for β = 2.
The main lines of the proof of Theorem 22 follow Akhavan et al. (2020). However, in Akhavan
et al. (2020) the assumptions on the noise are much more restrictive – the random variables ξt are
assumed iid and instead of (24) a much stronger condition is imposed, namely, a bound on the
Kullback–Leibler divergence. In particular, in order to use the Kullback–Leibler divergence between
two distributions we need one of them to be absolutely continuous with respect to the other. Using
the Hellinger distance allows us to drop this restriction. For example, if Ft is a distribution with
bounded support then the Kullback–Leibler divergence between Ft (·) and Ft (· + v) is +∞ while
the Hellinger distance is finite.
Acknowledgment: The work of A. Akhavan and A.B. Tsybakov was supported by the grant
of French National Research Agency (ANR) ”Investissements d’Avenir” LabEx Ecodec/ANR-11-
LABX-0047.
References
A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with
multi-point bandit feedback. In Proc. 23rd International Conference on Learning Theory, pages
28–40, 2010.
24
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
A. Akhavan, M. Pontil, and A.B. Tsybakov. Exploiting higher order smoothness in derivative-free
optimization and continuous bandits. In Advances in Neural Information Processing Systems 33,
2020.
A. Akhavan, E. Chzhen, M. Pontil, and A.B. Tsybakov. A gradient estimator via l1-randomization
for online zero-order optimization with two point feedback. In Advances in Neural Information
Processing Systems 35, 2022.
Y. Arjevani, Y. Carmon, J. Duchi, D. Foster, N. Srebro, and B. Woodworth. Lower bounds for
non-convex stochastic optimization. Mathematical Programming, 2022.
F. Bach and V. Perchet. Highly-smooth zero-th order online optimization. In Proc. 29th Annual
Conference on Learning Theory, 2016.
M. V. Balashov, B. T. Polyak, and A. A. Tremba. Gradient projection and conditional gradient meth-
ods for constrained nonconvex minimization. Numerical Functional Analysis and Optimization,
41(7):822–849, 2020.
F. Barthe, O. Guédon, S. Mendelson, and A. Naor. A probabilistic approach to the geometry of the
Lnp ball. The Annals of Probability, 33(2):480–513, 2005.
S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine
Learning, 8(3-4):231–257, 2015.
Y. Carmon, J. Duchi, O Hinder, and Aaron Sidford. Lower bounds for finding stationary points I.
Mathematical Programming, 184:71–120, 2020.
J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order convex
optimization: The power of two function evaluations. IEEE Transactions on Information Theory,
61(5):2788–2806, 2015.
V. Fabian. Stochastic approximation of minima with improved asymptotic speed. The Annals of
Mathematical Statistics, 38(1):191–200, 1967.
A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting:
gradient descent without a gradient. In Proc. 16th Annual ACM-SIAM Symposium on Discrete
algorithms (SODA), 2005.
25
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
G. Garrigos, L. Rosasco, and S. Villa. Convergence of the forward-backward algorithm: Beyond the
worst case with the help of geometry. Mathematical Programming, 198:937–996, 2023.
S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic
programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods
under the Polyak-Łojasiewicz condition. In Machine Learning and Knowledge Discovery in
Databases, 2016.
T. Lattimore and A. György. Improved regret for zeroth-order stochastic convex bandits. In Advances
in Neural Information Processing Systems 34, 2021.
R. Osserman. The isoperimetric inequality. Bulletin of the American Mathematical Society, 84(6):
1182–1238, 1978.
26
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
B.T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathe-
matics and Mathematical Physics, 3:864–878, 1963.
F. Qi and Q.-M. Luo. Bounds for the ratio of two gamma functions: from Wendel’s asymptotic
relation to Elezović-Giordano-Pečarić’s theorem. Journal of Inequalities and Applications, 2013.
M. Rando, C. Molinari, S. Villa, and L. Rosasco. Stochastic zeroth order descent with structured
directions. arXiv:2206.05124, 2022.
G. Schechtman and J. Zinn. On the volume of the intersection of two Lnp balls. Proceedings of the
American Mathematical Society, 110(1):217–224, 1990.
O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Proc.
30th Annual Conference on Learning Theory, pages 1–22, 2013.
O. Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point
feedback. Journal of Machine Learning Research, 18(1):1703–1713, 2017.
Appendix
In this appendix we first provide some auxiliary results and then prove the results stated in the
main body of the paper.
d
Additional notation Let W 1 , W 2 be two random variables, we write W 1 = W 2 to denote their
R ∞ We also denote by Γ : R+ → R+ the gamma function defined, for every
equality in distribution.
z > 0, as Γ(z) = 0 xz−1 exp(−x) dx.
27
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
Proof The first equality of the remark follows from the definition. For the second one it is sufficient
to show that for each m = (m1 , . . . , md )> ∈ Nd with |m| = k there exist exactly k!/m! distinct
choices of (m1 , . . . , mk ) ∈ (Nd )k with |m1 | = . . . = |mk | = 1 and m1 + . . . + mk = m. To
see this, we map m ∈ Nd into a word containing letters from {a1 , a2 , . . . , ad } as
m 7→ W (m) , a1 . . . a1 a2 . . . a2 . . . ad . . . ad .
| {z } | {z } | {z }
m1 −times m2 −times md −times
We observe that the condition m1 + . . . + mk = m, implies that the word W (m1 ) + W (m2 ) +
. . . + W (mk ) is a permutation of W (m). A standard combinatorial fact states that the number of
distinct permutations of W (m) is given by the multinomial coefficient, i.e., by k!/m!. Since the
mapping (m1 , . . . , mk ) 7→ W (m1 ) + W (m2 ) + . . . + W (mk ) is invertible, we conclude.
Lemma 25 Assume that f ∈ Fβ (L) for some β ≥ 2 and L > 0. Let v ∈ Rd with kvk = 1 and
defined the function gv : Rd → R as gv (x) ≡ hv , ∇f (x)i, x ∈ Rd . Then gv ∈ Fβ−1 (L).
Proof Set ` , bβc. Note that since f is ` times continuously differentiable, then gv is ` − 1 times
continuously differentiable. Furthermore, for any h1 , . . . , h`−1 ∈ Rd
(`−1) m`−1
X
gv (x)[h1 , . . . ,h`−1 ] = Dm1 +...+m`−1 gv (x)hm 1 · . . . · h`−1
1
Hence, for any x, z ∈ Rd we can write by definition of the norm of a `−1-linear form
(`−1) (`−1)
gv (x) − gv (z)
n o
(`−1) (`−1)
= sup gv (x)[h1 , . . . , h`−1 ] − gv (z)[h1 , . . . , h`−1 ] : khj k = 1 j ∈ [` − 1]
n o
= sup f (`) (x)[h1 , . . . , h`−1 , v] − f (`) (z)[h1 , . . . , h`−1 , v] : khj k = 1 j ∈ [` − 1]
28
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
Lemma 26 Fix some real β ≥ 2 and assume that f ∈ Fβ (L). Then, for all x, z ∈ Rd
X 1 m L
f (x) − D f (z)(x − z)m ≤ kx − zkβ .
m! `!
0≤|m|≤`
Proof Fix some x, z ∈ Rd . Taylor expansion yields that, for some c ∈ (0, 1),
X 1 m X 1
f (x) = D f (z)(x − z)m + Dm f (z + c(x − z))(x − z)m .
m! m!
0≤|m|≤`−1 |m|=`
X 1 m X 1
f (x) − D f (z)(x − z)m = (Dm f (z + c(x − z)) − Dm f (z)) (x − z)m
m! m!
|m|≤` |m|=`
1 (`)
= f (z + c(x − z))[x − z]` − f (`) (z)[x − z]`
`!
L L
≤ kx − zk` kc(x − z)kβ−` ≤ kx − zkβ .
`! `!
d
E[∇f (x + hrU ◦ )rK(r)] = E[f (x + hrζ ◦ )ζ ◦ K(r)] .
h
29
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
Proof Fix r ∈ [−1, 1] \ {0}. Define φ : Rd → R as φ(u) = f (x + hru)K(r) and note that
∇φ(u) = hr∇f (x + hru)K(r). Hence, we have
1 d
E[∇f (x + hrU ◦ )K(r) | r] = E[∇φ(U ◦ ) | r] = E[φ(ζ ◦ )ζ ◦ | r]
hr hr
d
= K(r)E[f (x + hrζ ◦ )ζ ◦ | r] ,
hr
where the second equality is obtained from a version of Stokes’ theorem (see e.g., Zorich, 2016,
Section 13.3.5, Exercise 14a). Multiplying by r from both sides, using the fact that r follows a
continuous distribution, and taking the total expectation concludes the proof.
R1
Proof of Lemma 6 Using Lemma 27, the fact that −1 rK(r) dr = 1, and the variational
representation of the Euclidiean norm, we can write
kE[g ◦t | xt ] − ∇f (xt )k = sup E[ ∇v f (x + ht rt U ◦ ) − ∇v f (x) rt K(rt )] ,
(29)
v∈∂B2d
where we recall that U ◦ is uniformly distributed on B2d . Lemma 25 asserts that for any v ∈ ∂B2d the
directional gradient ∇v f (·) is (β − 1, L)-Hölder. Thus, due to Lemma 26 we have the following
Taylor expansion:
X (rt ht )|m| m
∇v f (xt + ht rt U ◦ ) = ∇v f (xt ) + D ∇v f (xt )(U ◦ )m + R(ht rt U ◦ ) , (30)
m!
1≤|m|≤`−1
L
where the residual term R(·) satisfies |R(x)| ≤ (`−1)! kxkβ−1 .
Substituting (30) in (29) and using the “zeroing-out” properties of the kernel K, we deduce that
L L d
kE[g ◦t | xt ] − ∇f (xt )k ≤ κβ hβ−1
t E kU ◦ kβ−1 = κβ hβ−1
t ,
(` − 1)! (` − 1)! d + β − 1
where the last equality is obtained from the fact that E kU ◦ kq = d
d+q , for any q ≥ 0.
30
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
Proof The proof is analogous to that of Lemma 27 using (Akhavan et al., 2022, Theorem 6).
In order to obtain a bound on the bias of the estimator in (31) we need the following result,
which controls the moments of the Euclidean norm of U .
Lemma 29 Let U ∈ Rd be distributed uniformly on B1d . Then for any β ≥ 1 it holds that
β
β cβ+1 d 2 Γ(β + 1)Γ(d + 1)
E[kU k ] ≤ ,
Γ(d + β + 1)
Proof Let W = (W1 , . . . , Wd ), Wd+1 be i.i.d. random variables following Laplace distribution
with mean 0 and scale parameter 1. Then, following (Barthe et al., 2005, Theorem 1) we have
d W
U = ,
kW k1 + |Wd+1 |
d
where the sign = stands for equality in distribution. Furthermore, it follows from (Barthe et al.,
2005, Theorem 2) (see also Rachev and Ruschendorf (1991); Schechtman and Zinn (1990)) that
(W , |Wd+1 |)
and kW k1 + |Wd+1 | ,
kW k1 + |Wd+1 |
where the equality follows from the independence recalled above. Note that |Wj | is exp(1) random
variable for any j = 1, . . . , d. Thus, if 1 ≤ β < 2 by Jensen’s inequality we can write
β
β
d 2 d 2
X X β β β β
E[kW kβ ] = E 2
Wj ≤ 2
E[Wj ] = d 2 E[W12 ] 2 = d 2 Γ(3) 2 . (33)
j=1 j=1
31
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
It remains to provide a suitable expression for E[k(W , Wd+1 )kβ1 ]. We observe that k(W , Wd+1 )k1
follows the Erlang distribution with parameters (d + 1, 1) (as a sum of d + 1 i.i.d. exp(1) random
variables). Hence, using the expression for the density of the Erlang distribution we get
Z ∞
1 Γ(d + β + 1)
E[k(W , Wd+1 )kβ1 ] = xd+β exp(−x) dx = . (35)
Γ(d + 1) 0 Γ(d + 1)
Proof of Lemma 8 Using Lemma 28 and following the same lines as in the proof of Lemma 6 we
deduce that
β−1
L L cβ d 2 Γ(β)Γ(d + 1)
kE[g t | xt ] − ∇f (xt )k ≤ κβ hβ−1
t E kU kβ−1 ≤ κβ hβ−1
t ,
(` − 1)! (` − 1)! Γ(d + β)
where the last inequality is due to Lemma 29. Next, recall that the Gamma function satisfies
Γ(z + 1) = zΓ(z) for any z > 0. Applying this relation iteratively and using the fact that ` = bβc
we get:
where the first inequality is obtained from (Qi and Luo, 2013, Remark 1). Proceeding analogously
Γ(β)
we obtain that (`−1)! ≤ `β−` . Combining this bound with the two preceding displays yields the
lemma.
32
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
Wj are i.i.d. Laplace random variables with mean 0 and scale parameter 1. Set T (w) = w/ kwk1 .
Lemma 1 in Schechtman and Zinn (1990) asserts that, for ζ uniformly distributed on ∂B1d ,
d
T (W ) = ζ and T (W ) is independent of kW k1 . (36)
where I is the identity matrix, k · k applied to matrices denotes the spectral norm, and sign(·) applied
to vectors denotes the vector of signs of the coordinates.
From this point, the proof diverges from that of Lemma 3 in of Akhavan et al. (2022). Instead of
bounding the spectral norm of I − ab> by 1 + kak kbk as it was done in that paper, we compute it
exactly, which leads to the main improvement. Namely, Lemma 30 proved below gives
> 2
I − T (W ) sign(W ) = d kT (W )k2 .
Combining this equality with (37) we obtain the first bound of the lemma. The second bound of the
lemma (regarding Lipschitz functions G) is deduced from the first one by the same argument as
in Akhavan et al. (2022).
where
γ 2 − 1 −γ v̄ >
A= ,
−γ v̄ I
33
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
with (v̄)j = q j+1 , v for j = 1, . . . , d − 1. Let us find the eigenvalues of A. For any λ ∈ R,
using the expression for the determinant of a block matrix we get
!
d−1 2 γ 2 kv̄k2
det(A − λI) = (1 − λ) γ −1−λ− .
1−λ
> √
Thus, kI − a sign(a) k = max{γ, 1} = max{ d kak , 1}. We conclude the proof by observing
√
that d kak ≥ kak1 = 1.
Finally, we provide the following auxiliary lemma used in the proof of Lemma 9.
Proof Observe that the vector |ζ | , (|ζ1 |, . . . , |ζd |)> follows the Dirichlet distribution (i.e., the
uniform distribution of the probability simplex on d atoms). In what follows we will make use of
the following expression for the moments of the Dirichlet distribution:
d
Γ(d) Y (d − 1)!m!
E[(ζ )m ] = Γ(mi + 1) = , (38)
Γ(d + |m|) (d − 1 + |m|)!
i=1
P (2m)!
Direct calculations show that |m|=2 m! = 2d(d + 5). Hence, we deduce that, for all d ≥ 1,
4d!(d + 5)
E kζ k4 = .
(d + 3)!
34
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
d(d+5)
Note that (d+2)(d+3) ≤ 1 for all d ≥ 1. Thus,
4d!(d + 5) 4(d + 5) 4
E kζ k4 = = ≤ . (40)
(d + 3)! (d + 1)(d + 2)(d + 3) d(d + 1)
Combining this bound with (39) and (40) concludes the proof.
Lemma 32 Let {δt }t≥1 be a sequence of real numbers such that for all integers t > t0 ≥ 1,
N
c X ai
δt+1 ≤ 1 − δt + , (41)
t tpi +1
i=1
Proof For any fixed t > 0 the convexity of the mapping u 7→ g(u) = (t + u)−p implies that
p p (c−p)+(t−c)
g(1) − g(0) ≥ g 0 (0), i.e., t1p − (t+1)
1 1
p ≤ tp+1 . Thus, using the fact that tp − tp+1 = tp+1
≤
1
(t+1)p ,
ai ai 1 c 1
≤ − 1− . (43)
tp+1 c−p (t + 1)p t tp
35
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
Letting τt = δt − N ai c
P
i=1 (c−pi )tpi we have τt+1 ≤ (1 − t )τt . Now, if τt0 ≤ 0 then τt ≤ 0 for any
t ≥ t0 and thus (42) holds. Otherwise, if τt0 > 0 then for t ≥ t0 + 1 we have
t−1 t−1
Y c Y 1 (t0 − 1)τt0 2(t0 − 1)δt0
τt ≤ τt0 1− ≤ τt0 1− ≤ ≤ .
i i t t
i=t0 i=t0
L̄ηt2 h i
Et [f (xt+1 )] ≤ f (xt ) − ηt h∇f (xt ) , Et [g t ]i + Et kg t k2
2
L̄ηt2 h i
≤ f (xt ) − ηt k∇f (xt )k2 + ηt k∇f (xt )k kEt [g t ] − ∇f (xt )k + Et kg t k2 .
2
Furthermore, invoking the assumption on the bias and the variance of g t and using the fact that
2ab ≤ a2 + b2 we deduce
L̄ηt2
Et [f (xt+1 )]−f (xt ) ≤ −ηt k∇f (xt )k2 + ηt bt k∇f (xt )k + vt + V1 k∇f (xt )k2
2
η L̄η 2
vt + V1 k∇f (xt )k2 (44)
t
≤ −ηt k∇f (xt )k2 + b2t + k∇f (xt )k2 + t
2 2
ηt 2 ηt 2
=− 1 − L̄ηt V1 k∇f (xt )k + bt + L̄ηt vt .
2 2
Let S be a random variable with values in {1, . . . , T }, which is independent from x1 , . . . , xT , g 1 , . . . , g T
and such that
ηt 1 − L̄ηt V1
P(S = t) = PT .
t=1 ηt 1 − L̄ηt V1
Assume that ηt in (4) is chosen to satisfy L̄ηt m < 1 and that f ? > −∞. Taking total expectation in
(44) and summing up these inequalities for t ≤ T , combined with the fact that f (xT +1 ) ≥ f ? , we
deduce that
h i 2(E[f (x )] − f ? ) + PT η b2 + L̄η v
2 1 t=1 t t t t
E k∇f (xS )k ≤ PT .
t=1 ηt 1 − L̄ηt V1
36
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
Proof of Theorem 13 The proof will be split into two parts: for gradient estimators (6) and (7),
respectively. Both of these proofs follow from Lemma 12, which states that
where δ1 = E[f (x1 )] − f ? . Using the corresponding bounds on the bias bt and variance vt , we
substitute these values in the above inequality with ηt and ht obtained by optimizing the obtained
expressions.
We start with the part of the proof that is common for both gradient estimators. Introduce the
notation
2(β−1) β
− − 2β−1
ΞT := d 2β−1 T .
Using this notation, we consider algorithm (4) with gradient estimators (6) or (7) such that
y 1
−
ηt = min , ΞT and ht = hT 2(2β−1) ,
d
where
1
−1
(8κL̄) , d 2β−1 for estimator (6)
(y, h) = 2β+1
.
−1
(72κL̄) , d 4β−2 for estimator (7)
Given the values of V1 in Table 1, the choice of ηt for both algorithms ensures that
1
≤ 1 − L̄ηt V1 .
2
Thus we get from (45) that both algorithms satisfy
T
!−1 T T
!
h i X X X
E k∇f (xS )k2 ≤ ηt 4δ1 + 2 ηt b2t + 2L̄ ηt2 vt . (46)
t=1 t=1 t=1
T
!−1
X d 1 d 1
ηt = max , ≤ + .
T y T ΞT T y T ΞT
t=1
37
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
i d T
h
2 1 4δ1 dΞT 1 X 2
E k∇f (xS )k ≤ + +2 + bt + L̄ΞT vt . (47)
y ΞT T Ty T
t=1
In the rest of the proof, we use the algorithm specific bounds on bt and vt as well as the particular
choice of y and h in order to get the final results.
After substituting the expressions for ΞT and hT into the above bound, the right hand side of (49)
reduces to
( 1 ! ) β−1
4d d 2β−1 5−2β 2
d2 2β−1
−
δ1 + 4δ1 + +1 A6 + A7 1 + d 2β−1 T 2β−1 .
Ty Tβ T
38
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
1 5−2β 2
− 2β−1
To conclude, we note that the assumption T ≥ d β , implies that for all β ≥ 2 we have d 2β−1 T ≤
1
1 and d/T β 2β−1 ≤ 1. Therefore, the final bound takes the form
β−1
d2 β−1
d2
2β−1
h
2 4d
i
2β−1
E k∇f (xS )k ≤ δ1 + 4δ1 + 2 A6 + 2A7 ≤ A1 δ1 + A2 ,
Ty T T
where A1 = 4(y−1 + 1) and A2 = 2 A6 + 2A7 .
2(β−1) 1−β d3 σ 2 κ
b2t ≤ (cβ κβ `L)2 ht d , vt = 72κL̄2 h2t + , and V1 = 36dκ ,
h2t
with ` = bβc. Using these bounds in (47) we get
h i
2(β−1)
E k∇f (xS )k2 ≤ (dΞT +1) A6 d1−β hT +ΞT A7 h2T +A8 d3 h−2
T
d
(50)
−1 4δ1
+ +ΞT ,
y T
where the constants are defined as
Proof of Theorem 14 As in the proof of Theorem 13 we use Lemma 12, cf. (45):
h i 2δ + PT η (b2 + L̄η v )
2 1 t t t t
E k∇f (xS )k ≤ PT t=1 .
t=1 ηt (1 − L̄ηt V1 )
39
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
From this inequality and the fact that, by assumption, ηt = (2L̄V1 )−1 we obtain:
Since the gradient estimators (6) and (7) satisfy Assumption D and we consider the case σ = 0,
β = 2, the values bt = bLht and vt = V2 L̄2 h2t can be made asP small as possible by choosing
ht small enough. Thus, we can take ht sufficiently small to have Tt=1 (b2t + (2V1 )−1 vt ) ≤ L̄V1 .
Under this choice of ht ,
h i L̄V1
E k∇f (xS )k2 ≤ (8δ1 + 2) .
T
Using the values of V1 for the gradient estimators (6) and (7) (see Table 1) we obtain the result.
L̄ηt2 h i
Et [f (xt+1 )] ≤ f (xt ) − ηt h∇f (xt ) , Et [g t ]i + Et kg t k2
2
L̄ηt2 h i
≤ f (xt ) − ηt k∇f (xt )k2 + ηt k∇f (xt )k kEt [g t ] − ∇f (xt )k + Et kg t k2 .
2
Next, invoking Assumption D on the bias and variance of g t and using the elementary inequality
2ab ≤ a2 + b2 we get that, for the iterative procedure (4) with Θ = Rd ,
ηt ηt 2 2 2(β−1)
δt+1 ≤ δt − (1 − L̄ηt V1 )E[k∇f (xt )k2 ] + b L ht + L̄ηt V2 L̄2 h2t + V3 σ 2 h−2
t ,
2 2
where δt = E[f (xt ) − f ? ]. Furthermore, our choice of the step size ηt ensures that 1 − L̄ηt V1 ≥ 21 .
Using this inequality and the fact that f is α-gradient dominant we deduce that
ηt α ηt 2 2 2(β−1)
δt+1 ≤ δt 1 − + b L ht + L̄ηt V2 L̄2 h2t + V3 σ 2 h−2
t . (51)
2 2
j k
8L̄V1
We now analyze this recursion according to the cases T > T0 and T ≤ T0 , where T0 := α is
the value of t, where ηt switches its regime.
40
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
First case: T > T0 . In this case, the recursion (51) has two different regimes, depending on the
4
value of ηt . In the first regime, for any t = T0 + 1, . . . , T , we have ηt = αt and (51) takes the form
2(β−1)
2 2 2 ht 8L̄
δt+1 ≤ δt 1 − + 2b L · + 2 2 V2 L̄2 h2t + V3 σ 2 h−2
t . (52)
t αt α t
1
4L̄σ 2 V3
Additionally in this regime of t, we have ht = b2 L2 αt
2β
. Using this expression for ht in (52) we
obtain that
1− β1 1
L̄σ 2 V3 − β − 2β−1
2 1
δt+1 ≤ δt 1− + A3 · V3 2 2 t β
t α α b L
1+ β1 1 (53)
2 β
1 L̄ 2 V3 σ − 2β+1
+ A4 · V2 L̄ t β ,
α α b2 L2
4− β2 3+ β2
where A3 = 2 , and A4 = 2 . Applying Lemma 32 to the above recursion we get
1 β−1
V3 − β αT − β
2T0 βA3
δT ≤ δT0 +1 + · V3 2 2 (54)
T (β+1)α b L L̄σ 2
1
2 β αT − β
β+1
βA4 2 V3 σ
+ · V2 L̄ .
(3β+1)α b2 L2 L̄
If T0 = 0, we conclude the proof for the case T > T0 . Otherwise, we consider the second
2V 1
regime that corresponds to t ∈ [1, T0 ]. In this regime, we have ht = b42L̄σ
L2 αT
3 2β 1
, ηt = 2L̄V , and
1
4 4
(T0 +1)α ≤ ηt ≤ T0 α . Using these expressions for ht and ηt in (51) we get that, for 1 ≤ t ≤ T0 ,
4− β2 1− β1 1 1 !
L̄σ 2 V3 − β
2 2 1 − β−1 T β
δt+1 ≤ δt 1− + · V3 2 2 T β +
T0 + 1 T0 α α b L T0
1 1
3+ β2 1+ β
V3 σ 2 β − β1
2 1 L̄
+ · V2 L̄2 T .
T02 α α b2 L2
2
Using the rough bound 1 − T0 +1 ≤ 1 and unfolding the above recursion we obtain:
1− β1 1 1 !
1 L̄σ 2 V3 − β
4− β2 − β−1 T β
δT0 +1 ≤ δ1 + 2 · V3 2 2 T β +
α α b L T0
2 1 1
3+ 1+ β
V3 σ 2 β − β1
2 β 1 L̄
+ · V2 L̄2 T .
T0 α α b2 L2
41
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
Taking into account the definition of T0 , and the fact that T0 ≤ T we further obtain:
1 β−1
V3 − β αT − β
2T0 16L̄V1 2A3
δT0 +1 ≤ δ1 + · V3 2 2 (55)
T αT α b L L̄σ 2
1+ β1 1 β+1
2 β αT − β
2A4 L̄ 2 V3 σ
+ · V2 L̄ .
α α b2 L2 L̄
− 1 1 − 2 ! − β−1
L̄V1 A2 V3 β
2 V3 β αT β
−2 αT β
δT ≤ A1 · δ1 + · V3 + V2 L̄ σ ,
αT α b2 L2 b2 L2 L̄σ 2 L̄σ 2
β β
where A1 = 16 and A2 = 2 + β+1 A3 + 2 + 3β+1 A4 .
1
4L̄σ 2 V3
Second case: T ≤ T0 . In this case, we have ht = b2 L2 αT
2β
and thus (51) takes the form
4− 2 1− β1 1 T 1 !
V3 − β X
T
2 β 1 L̄σ 2
2 − β−1 T β
δT +1 ≤ δ1 1 − + · V3 2 2 T β +
T0 + 1 T0 α α b L T0
t=1
1 1
3+ 2 1+ β T
2 β X
2 β 1 L̄ 2 V3 σ − β1
+ · V 2 L̄ T
T02 α α b2 L2
t=1
− 1 β−1 1 β+1
αT − β αT − β
T 2 β
2 A3 V3 β A4 2 V3 σ
≤ δ1 1 − + · V3 2 2 + · V2 L̄ .
T0 + 1 α b L L̄σ 2 α b2 L2 L̄
1
Note that, for any ρ, T > 0, we have (1 − ρ)T ≤ exp(−ρT ) ≤ ρT . Using this inequality for
2
ρ = T0 +1 , the definition of T0 and the fact that T + 1 ≤ 2T we obtain:
L̄V1
δT +1 ≤ A1 δ1
α(T + 1)
1 1 2 ! β−1
V3 − β V3 β α(T + 1) − β −2 α(T + 1) − β
A2 2
+ V3 2 2 + V2 L̄ σ .
α b L b2 L2 L̄σ 2 L̄σ 2
Proof of Theorem 17 Asjin thekproof of Theorem 15, we consider separately the cases T > T0
and T ≤ T0 , where T0 := 8L̄V
α
1
.
42
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
4
The case T > T0 . First, consider the algorithm at steps t = T0 , . . . , T , where we have ηt = αt .
Since σ = 0 and β = 2, from (51) we have
2 2
8L̄3
2 2b L
δt+1 ≤ δt 1 − + + 2 2 V2 h2t . (56)
t αt α t
− 1
L̄∨1 8L̄2 V2 2
Since β = 2, we have L = L̄. Thus, using the assumption that ht ≤ α∧1 T 2b2 L̄ + α
we deduce from (56) that
2 L̄
δt+1 ≤ δt 1 − + 2. (57)
t αt
Applying Lemma 32 to the above recursion gives
2T0 L̄
δT ≤ δT0 +1 + . (58)
T αT
If T0 = 0, we conclude the proof for the case T > T0 . Otherwise, we consider the algorithm at
1 4
steps t = 1, . . . , T0 , where ηt = 2L̄V and (T0 +1)α ≤ ηt ≤ T04α . From (51) with σ = 0 and β = 2
1
we obtain
2 L̄ 8
δt+1 ≤ δt 1 − + 2b2 L̄ + L̄2 V2 h2t . (59)
T0 + 1 αT0 α
− 1
Using here the assumption that ht ≤ L̄∨1 T 2b2 L̄ + 8L̄2 V2 2
and a rough bound 1 − T02+1 ≤ 1
α∧1 α
and summing up both sides of the resulting inequality from t = 1 to t = T0 we get
L̄(α ∧ 1)
δT0 +1 ≤ δ1 + .
α(L̄ ∨ 1)T
Combining this inequality with (58) and using the definition of T0 and the fact that T0 ≤ T we
obtain the bound
16L̄V1 16L̄V1 L̄
δT ≤ δ1 + 2
+ .
αT αT αT
It remains to note that V1 ≤ 36dκ, cf. Table 1. This implies the theorem for the case T < T0 with
A1 = 576κ and A2 = 1/T + 1/A1 d.
2
− 1
2
Second case: T ≤ T0 . Using the fact that ht ≤ α∧1 T
2b2 L̄ + 8L̄αV2 and unfolding the
recursion in (59) gives
T
2 L̄
δT ≤ δ1 1 − + .
T0 + 1 αT
43
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
1
From the elementary inequality (1 − ρ)T ≤ ρT , which is valid for all ρ, T > 0, we obtain with
ρ = T02+1 that
T0 + 1 L̄ T0 L̄ L̄d
δT ≤ δ1 + ≤ δ1 + ≤ A1 (δ1 + A2 ) ,
2T αT T αT αT
8L̄V1
where A1 = 288κ, A2 = 1/A1 d, and the last inequality follows from the facts that T0 ≤ α and
V1 ≤ 36dκ, cf. Table 1.
Lemma 33 Consider the iterative algorithm defined in (4). Let f be α-strongly convex on Rd , let
Assumption D be satisfied. Let the minimizer x? of f on Θ be such that ∇f (x? ) = 0. Then we have
rt − rt+1 α η (bLhβ−1 )2 ηt
t
E[f (xt ) − f ? ] ≤ − L̄2 V1 + t
V2 L̄2 h2t + V3 σ 2 h−2
− rt + t , (60)
2ηt 4 2 α 2
where rt = E[kxt − x? k2 ].
Proof Recall the notation Et [·] = E[· | xt ]. For any x ∈ Θ, by the definition of projection,
2
kxt+1 − xk2 = ProjΘ xt − ηt g t − x ≤ kxt − ηt g t − xk2 .
(61)
Expanding the squares and rearranging the above inequality, we deduce that (61) is equivalent to
44
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
(bLhβ−1 )2 α
bLhβ−1
t kxt − x? k ≤ t
+ kxt − x? k2 . (66)
α 4
Substituting (66) in (65), setting rt = E[at ], using the fact that
and taking the total expectation from both sides of the resulting inequality yields the lemma.
α α ηt 2 α
Proof of Theorem 18 By definition, ηt ≤ 4L̄2 V1
, so that 4 − 2 L̄ V1 ≥ 8 and (60) implies that
rt − rt+1 α (bLhβ−1 )2 ηt
E[f (xt ) − f ? ] ≤ − rt + t
+ (V2 L̄2 h2t + V3 σ 2 h−2t ). (67)
2ηt 8 α 2
j 2 k
Set T0 := max 32L̄α2V1 − 1 , 0 . This is the value of t, where ηt switches its regime. We analyze
the recursion (67) separately for the cases T > T0 and T ≤ T0 . We will use the fact that, by the
convexity of f and Jensen’s inequality,
T
2 X
f (x̂T ) − f ? ≤ t(f (xt ) − f ? ). (68)
T (T + 1)
t=1
First case: T > T0 . In this case, we decompose the sum in (68) into the sum over t ∈ [T0 + 1, T ]
and the sum over t ∈ [1, T0 ].
We first evaluate the sum Tt=T0 +1 tE[f (xt )−f ? ]. For any t ∈ [T0 +1, T ], we have ηt = α(t+1)
8
P
2 1
V3 2β
and ht = 4σ b2 L2 t
. Using in (67) these values of ηt and ht , multiplying by t, and summing both
sides of the resulting inequality from T0 + 1 to T we deduce that
T T T
X α X A4 2 β−1 X 1
tE[f (xt ) − f ? ] ≤ t ((rt − rt+1 ) (t + 1) − 2rt ) + (bL) β (V3 σ 2 ) β tβ
16 α
t=T0 +1 t=T0 +1 t=T0 +1
| {z } | {z }
=:I =:II
1 T
V3 σ 2 β X − β1
A5
+ L̄2 V2 t ,
α b2 L2
t=T0 +1
| {z }
=:III
3β−2 2β+2
where A4 = 2 β , A5 = 2 β and we defined the terms I, II, and III that will be evaluated
separately.
45
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
α
It is not hard to check that I ≤ 16 T T0 rT0 +1 since the summation in term I is telescoping. Next,
we have
T
A4 2 β−1 X 1 A6 2 β−1 1
+1
II ≤ (bL) β (V3 σ 2 ) β tβ ≤ (bL) β (V3 σ 2 ) β T β .
α α
t=1
Finally,
β1 X
T β1
V3 σ 2 V3 σ 2
A5 2 − β1 A7 2 1− β1
III ≤ L̄ V2 t ≤ L̄ V2 T ,
α b2 L2 α b2 L2
t=1
β+1 β−1
where A6 = β A4 and A7 = β A5 . Combining these bounds on I, II, and III we obtain
T
X α
tE[f (xt ) − f ? ] ≤ T T0 rT0 +1 (69)
16
t=T0 +1
β1 T β1 +1
V3 σ 2
2 β−1
− β2
2 2
+ A6 (bL) (V3 σ ) β β + A7 V2 L̄ T .
b2 L2 α
If T0 = 0 then combining (68) and (69) proves the theorem. If T0 ≥ 1 we need additionally to
control the value rT0 +1 on the right hand side of (69). It follows from (67) that, for 1 ≤ t ≤ T0 ,
(bLhβ−1 )2
t
+ ηt2 V2 L̄2 h2t + V3 σ 2 h−2
rt+1 ≤ rt + 2ηt t .
α
α 8
Moreover, for 1 ≤ t ≤ T0 we have ηt = 4L̄2 V1
and ηt ≤ α(T0 +1) . Therefore, unfolding the above
recursion we get
T0
X 16 β−1
2 64 2 2 2 −2
rT0 +1 ≤ r1 + bLht + 2 2 V2 L̄ ht + V3 σ ht .
α2 T0 α T 0
t=1
1
4σ 2 V3 2β
For 1 ≤ t ≤ T0 we have ht = b2 L2 T
, which yields
β1 ! 1
V3 σ 2
2
2
β−1
2 − β2 Tβ
rT0 +1 ≤ r1 + 16 A4 (bL) (V3 σ ) β β + A5 V2 L̄ T ,
b2 L2 α2 T0
so that
β1 ! 1
+1
2L̄2 T V1 V3 σ 2
α 2
2
β−1
2 − β2 Tβ
T T0 rT0 +1 ≤ r1 + A4 (bL) (V3 σ ) β β + A5 V2 L̄ T . (70)
16 α b2 L2 α
46
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
We now evaluate the sum Tt=1 tE[f (xt ) − f ? ]. Recall that for t ∈ [1, T0 ] the parameters ht , ηt take
P 0
2 1
V3 2β
constant values: ht = b4σ
2 L2 T and ηt = 4L̄α2 V . Omitting in (67) the term −αrt /8, summing
1
8
the resulting recursion from 1 to T0 and using the inequality ηt ≤ α(T0 +1) we obtain
T0
X T0
X
?
tE[f (xt ) − f ] ≤ T E[f (xt ) − f ? ] (72)
t=1 t=1
T0
T r1 X (bLhβ−1 )2 ηt
≤ +T t
+ (V2 L̄2 h2t + V3 σ 2 h−2
t )
2η1 α 2
t=1
T r1 (bLhβ−1 )2 2T
≤ + T2 1
+ (V2 L̄2 h21 + V3 σ 2 h−2
1 )
2η1 α α
2L̄2 T V1 21 β1 +1
2 V3 σ − β2 T
2 β−1
β
2 β
= r1 + A4 (bL) (V3 σ )
β + A5 V2 L̄ T .
α b2 L2 α
Summing up (71) and (72) and using (68) we obtain the bound of the theorem:
β1 !
− β−1
L̄2 V1 V3 σ 2
?
2
2
β−1
2 − β2 T β
E[f (x̂T ) − f ] ≤ A1 r1 + A2 (bL) (V3 σ )
β β + A3 V2 L̄ T ,
αT b2 L2 α
47
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
Multiplying both sides of (73) by t, summing up from t = 1 to T and using the fact that
T
X t(rt − rt+1 ) α 4
− trt ≤0 if ηt =
2ηt 4 α(t + 1)
t=1
we find that
T T
1X 2t
t(bLhβ−1
X
2 −2
tE[f (xt ) − f ? ] ≤ 2 2 2 2
t ) + V1 G + V2 L̄ ht + V3 σ ht .
α t+1
t=1 t=1
1
σ 2 V3 2β
Since ht = b2 L2 t
we obtain
T − β1 β1 !
2V1 G2 T V3 σ 2 V3 σ 2
X A3 − β2 1+ β1
tE[f (xt ) − f ? ] ≤ + V3 σ 2 + V2 L̄2 T T ,
α α b2 L2 b2 L2
t=1
(74)
2
where A3 = 2. To complete the proof, we multiply both sides of (74) by T (T +1) and use (68).
48
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS
With this notation, we have dPf = Tt=1 dFf,t . Using the definition of Hellinger distance we
Q
obtain
T Z T !
H 2 dFf,t , dFf 0 ,t
Z
1 2 p Y p p Y
1− H (Pf , Pf 0 ) = dPf dPf 0 = dFf,t dFf 0 ,t = 1− .
2 2
t=1 t=1
Proof of Theorem 22 The proof follows the general lines given in Akhavan et al. (2020), so that
we omit some details that can be found in that paper. We first assume that α ≥ T −1/2+1/β .
Let η0 : R → R be an infinitely many times differentiable function such that
= 1
if |x| ≤ 1/4,
η0 (x) = ∈ (0, 1) if 1/4 < |x| < 1,
=0 if |x| ≥ 1.
Rx d
Set η(x) = −∞ η0 (τ )dτ . Let Ω = − 1, 1 be the set of binary sequences of length d. Consider
the finite set of functions fω : Rd → R, ω = (ω1 , . . . , ωd ) ∈ Ω, defined as follows:
d
X
2
fω (u) = α(1 + δ) kuk /2 + ωi rhβ η(ui h−1 ), u = (u1 , . . . , ud ),
i=1
1
− 1
where ωi ∈ {−1, 1}, h = min (α2 /d) 2(β−1) , T 2β and r > 0, δ > 0 are fixed numbers that will
be chosen small enough.
It is shown in Akhavan et al. (2020) that if α ≥ T −1/2+1/β then fω ∈ Fα,β
0 for r > 0 and δ > 0
small enough, and the minimizers of functions fω belong to Θ and are of the form
x∗ω = (x? (ω1 ), . . . , x? (ωd )) ,
where x? (ωi ) = −ωi α−1 (1 + δ)−1 rhβ−1 .
For any fixed ω ∈ Ω, we denote by Pω,T the probability measure corresponding to the joint
distribution of (AT , (τ i )Ti=2 ) where yt = fω (z t ) + ξt and yt0 = fω (z 0t ) + ξt0 with (ξt , ξt0 )’s satisfying
Assumption E, and (z t , z 0t )’s chosen by a sequential strategy in ΠT . Consider the statistic
ω̂ ∈ arg min kx̃T − x∗ω k .
ω∈Ω
49
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV
max Eω,T kx̃T − x∗ω k2 ≥ α−2 r2 h2β−2 inf max Eω,T ρ(ω̂, ω) .
ω∈Ω ω̂ ω∈Ω
max |fω (u) − fω0 (u)| ≤ 2rhβ η(1) ≤ 2rT −1/2 η(1).
u∈Rd
−1/2
Thus, choosing r small enough to satisfy 2rη(1) < min(v0 , I0 ) we ensure 2rT −1/2 η(1) ≤ v0
0
to apply Lemma 34 and deduce for the considered ω, ω ∈ Ω that
T
H 2 (Pω,T , Pω0 ,T ) ≤ 2 1 − 1 − (2T )−1 ≤ 1,
where we have used the fact that 1 − x ≥ 4−x for 0 < x ≤ 1/2. Applying (Tsybakov, 2009,
Theorem 2.12) we deduce that
Therefore, we have proved that if α ≥ T −1/2+1/β there exist r > 0 and δ > 0 such that
d − β−1
x∗ω k2 −2 2 2β−2 2
max Eω,T kx̃T − ≥ 0.3 dα r h = 0.3 r min 1, 2 T β . (75)
ω∈Ω α
−1/2+1/β . In particular, if α = α := T −1/2+1/β the bound (75) is of
for α 1≥ T
This implies (26) 0
−
the order min 1, dT β . Then for 0 < α < α0 we also have the bound of this order since the
classes Fα,β are nested: Fα0 ,β ⊂ Fα,β . This completes the proof of (26).
− β+2
We now prove (25). From (75) and α-strong convexity of f we get that, for α ≥ T 2β ,
d
− β−1
max Eω,T f (x̃T ) − f (x∗ω ) ≥ 0.15 r2 min α, T β .
(76)
ω∈Ω α
− β+2
This implies (25) in the zone α ≥ T 2β = α0 since for such α we have
d d d − β−1
− β−1 − β+2
min α, T β = min max(α, T 2β ), √ , T β .
α T α
− β−1 − β+2 √
On the other hand, min α0 , αd0 T β = min T 2β , d/ T . The same lower bound holds for
0 < α < α0 by the nestedness argument that we used to prove (26) in the zone 0 < α < α0 . Thus,
(25) follows.
50