0% found this document useful (0 votes)
23 views50 pages

Gradient-Free Optimization of Highly Smooth Functions: Improved Analysis and A New Algorithm

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views50 pages

Gradient-Free Optimization of Highly Smooth Functions: Improved Analysis and A New Algorithm

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Journal of Machine Learning Research 25 (2024) 1-50 Submitted 6/23; Published 11/24

Gradient-free optimization of highly smooth functions: improved


analysis and a new algorithm

Arya Akhavan ARIA . AKHAVANFOOMANI @ IIT. IT


CSML, Istituto Italiano di Tecnologia
CMAP, École Polytechnique, IP Paris
Evgenii Chzhen EVGENII . CHZHEN @ CNRS . FR
CNRS, LMO, Université Paris-Saclay
Massimiliano Pontil MASSIMILIANO . PONTIL @ IIT. IT
CSML, Istituto Italiano di Tecnologia
University College London
Alexandre B. Tsybakov ALEXANDRE . TSYBAKOV @ ENSAE . FR
CREST, ENSAE, IP Paris

Editor: Krishnakumar Balasubramanian

Abstract

This work studies minimization problems with zero-order noisy oracle information under the
assumption that the objective function is highly smooth and possibly satisfies additional properties.
We consider two kinds of zero-order projected gradient descent algorithms, which differ in the form
of the gradient estimator. The first algorithm uses a gradient estimator based on randomization over
the `2 sphere due to Bach and Perchet (2016). We present an improved analysis of this algorithm on
the class of highly smooth and strongly convex functions studied in the prior work, and we derive
rates of convergence for two more general classes of non-convex functions. Namely, we consider
highly smooth functions satisfying the Polyak-Łojasiewicz condition and the class of highly smooth
functions with no additional property. The second algorithm is based on randomization over the
`1 sphere, and it extends to the highly smooth setting the algorithm that was recently proposed
for Lipschitz convex functions in Akhavan et al. (2022). We show that, in the case of noiseless
oracle, this novel algorithm enjoys better bounds on bias and variance than the `2 randomization
and the commonly used Gaussian randomization algorithms, while in the noisy case both `1 and `2
algorithms benefit from similar improved theoretical guarantees. The improvements are achieved
thanks to a new proof techniques based on Poincaré type inequalities for uniform distributions on
the `1 or `2 spheres. The results are established under weak (almost adversarial) assumptions on
the noise. Moreover, we provide minimax lower bounds proving optimality or near optimality of
the obtained upper bounds in several cases.

Keywords: smooth optimization, zero-order oracle, gradient-free optimization, stochastic zero-


order algorithms, minimax optimality, Polyak-Łojasiewicz condition

©2024 Arya Akhavan, Evgenii Chzhen, Massimiliano Pontil, and Alexandre B. Tsybakov.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at
http://jmlr.org/papers/v25/23-0733.html.
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

1. Introduction
In this work, we study the problem of gradient-free optimization for certain types of smooth
functions. Let f : Rd → R and Θ ⊆ Rd . We are interested in solving the following optimization
problem

f ? , inf f (x) ,
x∈Θ

and we assume that f ? is finite. One main theme of this paper is to exploit higher order smoothness
properties of the underlying function f in order to improve the performance of the optimization
algorithm. We consider that the algorithm has access to a zero-order stochastic oracle, which, given
a point x ∈ Rd returns a noisy value of f (x), under a general noise model.
We study two kinds of zero-order projected gradient descent algorithms, which differ in the
form of the gradient estimator. Both algorithms can be written as an iterative update of the form

x1 ∈ Θ and xt+1 = ProjΘ (xt − ηt g t ) t = 1, 2, . . . ,

where g t is a gradient estimator at the point xt , ηt > 0 is a step size, and ProjΘ (·) is the Euclidean
projection operator onto the set Θ. In either case, the gradient estimator is built from two noisy
function values, that are queried at two random perturbations of the current guess for the solution,
and it involves an additional randomization step. Also, both algorithms invoke smoothing kernels
in order to take advantage of higher order smoothness, following the approach initially suggested
in (Polyak and Tsybakov, 1990). The first algorithm uses a form of `2 randomization introduced
by Bach and Perchet (2016) and it has been studied in the context of gradient-free optimization of
strongly convex functions in (Akhavan et al., 2020; Novitskii and Gasnikov, 2022). The second
algorithm is an extension, by incorporating smoothing kernels, of the approach proposed and
analysed in Akhavan et al. (2022) for online minimization of Lipschitz convex functions. It is based
on an alternative randomization scheme, which uses `1 -geometry in place of the `2 one.
A principal goal of this paper is to derive upper bounds on the expected optimization error
of both algorithms under different assumptions on the underlying function f . These assumptions
are used to set the step size in the algorithms and the perturbation parameter used in the gradient
estimator. Previous works on gradient-free optimization of highly smooth functions considered
mostly the strongly convex case (Polyak and Tsybakov, 1990; Bach and Perchet, 2016; Akhavan
et al., 2020, 2021; Novitskii and Gasnikov, 2022). In this paper, we provide a refined analysis of
the strongly convex case, improving the dependence on the dimension d and the strong convexity
parameter α for the algorithm with `2 randomization, and showing analogous results for the new
method with `1 randomization. For the special case of strongly convex functions with Lipschitz
gradient, we find the minimax optimal dependence of the bounds on all the three parameters of the
problem (namely, the horizon T , the dimension
√ d and α) and we show that both algorithms attain
the minimax rate, which equals α−1 d/ T . This finalizes the line of work starting from√ (Polyak
and Tsybakov, 1990), where it was proved that optimal dependence √ on T is of the order 1/ T , and
papers proving that optimal dependence on d and T scales as d/ T (Shamir (2013) establishing a

2
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

lower bound and Akhavan et al. (2020) giving a complete proof, see the discussion in Section 5).
Furthermore, we complement these results by considering highly smooth but not necessary convex
functions f , and highly smooth functions f , which additionally satisfy the gradient dominance
(Polyak-Łojasiewicz) condition. To this end, we develop unified tools that can cover a variety of
gradient estimators, and then apply them to the algorithm with `2 randomization and to our new
algorithm based on `1 randomization. We show that, in the case of noiseless oracle, this novel
algorithm enjoys better bounds on bias and variance than its `2 counterpart, while in the noisy case,
both algorithms benefit from similar theoretical guarantees. The improvements in the analysis are
achieved thanks to a new method of evaluating the bias and variance of both algorithms based on
Poincaré type inequalities for uniform distributions on the `1 or `2 spheres. Moreover, we establish
all our upper bounds under very weak (almost adversarial) assumptions on the noise.

1.1 Summary of the upper bounds


In this subsection, we give a high-level overview of the main contributions of this work. Apart from
the improved guarantees for the previously studied function classes, one of the main novelties of
our work is the analysis in the case of a non-convex highly smooth objective function, for which
we provide a convergence rate to a stationary point. Furthermore, we study the case of α-gradient
dominant f , a popular relaxation of strong convexity, which includes non-convex functions. To the
best of our knowledge, the analysis of noisy zero-order optimization in these two cases is novel. In
Section 5 we derive minimax lower bounds and discuss the optimality or near optimality of our
convergence rates.
In the following we highlight the guarantees that we derive for the two analysed algorithms.
Each of the guarantees differs in the dependency on the main parameters of the problem, which is
a consequence of the different types of available properties of the objective function. Let us also
mention that we mainly deal with the unconstrained optimization case, Θ = Rd . This is largely due
to the fact that the Polyak-Łojasiewicz inequality is mainly used in the unconstrained case and the
exertion of this condition to the constrained case is still an active area of research (see e.g., Balashov
et al., 2020, and references therein). Meanwhile, for the strongly convex case, as in previous works
(Bach and Perchet, 2016; Akhavan et al., 2020; Novitskii and Gasnikov, 2022), we additionally treat
the constrained optimization. In this section, we only sketch our results in the case Θ = Rd .
We would like to emphasize that we do not assume the measurement noise to have zero mean.
Moreover, the noise can be non-random and no independence between noises on different time steps
is required, so that the setting can be considered as almost adversarial.

Rate of convergence under only smoothness assumption. Assume that f is a β-Hölder function
with Lipschitz continuous gradient, where β ≥ 2. Then, after 2T oracle queries both considered
algorithms provide a point xS satisfying
β−1
d2
  2β−1
1
h i
2
E k∇f (xS )k . under the assumption that T ≥ d β ,
T

3
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

where S is a random variable with values in {1, . . . , T }, k·k denotes the Euclidean norm, and the sign
. conceals a multiplicative constant that does not depend on T and d. To the best of our knowledge,
this result is the first convergence guarantee for the zero-order stochastic optimization under the
considered noise model. In a related development, Ghadimi and Lan (2013); Balasubramanian and
Ghadimi (2021) study zero-order optimization of non-convex objective function with Lipschitz
gradient, which corresponds to β = 2. They assume querying two function values with identical
noises, in which case the analysis and the convergence rates are essentially analogous to the particular
case of our setting with no noise (see the discussion in Section 2.2 below). The work of Carmon
et al. (2020) studies noiseless optimization of highly smooth functions assuming that the derivatives
up to higher order are observed, and Arjevani et al. (2022) consider stochastic optimization with first
order oracle. These papers cannot be directly compared with our work as the settings are different.
Rate of convergence under smoothness and Polyak-Łojasiewicz assumptions. Assume that f
is a β-Hölder function with Lipschitz continuous gradient, with a Lipschitz constant L̄. Additionally,
let β ≥ 2, and suppose that f satisfies the Polyak-Łojasiewicz inequality with a constant α. Then,
after 2T oracle queries, both considered algorithms provide a point xT , for which the expected
optimization error satisfies
 β−1
µd2

? 1 β β
E[f (xT ) − f ] . under the assumption that T & d2− 2 ,
α T

where µ = L̄/α, and the signs . and & conceal multiplicative constants that do not depend on T , d
and α. The Polyak-Łojasiewicz assumption was introduced in the context of first order optimization
by Polyak (1963) who showed that it implies linear convergence of the gradient descent algorithm.
Years later, this condition received attention in the machine learning and optimization community
following the work of Karimi et al. (2016). To the best of our knowledge, zero-order optimization
under the considered noise model with the Polyak-Łojasiewicz assumption was not previously
studied. Very recently Rando et al. (2022) studied a related problem under the Polyak-Łojasiewicz
assumption when querying two function values with identical noises, which can be compared with
the analysis in the particular case of our setting with no noise (see the discussion in Section 2.2).
Unlike our work, Rando et al. (2022) do not deal with higher order smoothness and do not derive
the dependency of the bounds on d, µ and α.
Rate of convergence under smoothness and strong convexity. Assume that f is a β-Hölder
function with Lipschitz continuous gradient, where β ≥ 2, and satisfies α-strong convexity condition.
Then, after 2T oracle queries, both considered algorithms provide a point x̂T such that
 β−1
d2

? 1 β β
E[f (x̂T ) − f ] . under the assumption that T ≥ d2− 2 ,
α T

where . conceals a multiplicative constant that does not depend on T , d and α. The closest result to
ours is obtained in Akhavan et al. (2020) and it splits into two cases: β = 2 (Lipschitz continuous

4
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

gradient) and β > 2 (higher order smoothness). For β = 2, Akhavan et al. (2020) deal with a
compact Θ and prove a bound with optimal dependence on the dimension (linear in d) but sub-
optimal in α, while for β > 2 they derive (for Θ = Rd and for compact Θ) the rate with sub-optimal
dimension factor d2 . Later, Akhavan et al. (2021) and Novitskii and Gasnikov (2022) improved the
dimension factor to d2−1/β for β > 2, which still does not match the linear dependence as β → 2.
In contrast, by considering a slightly different definition of smoothness, we provide below a unified
analysis leading to the dimension factor d2−2/β for any β ≥ 2, under constrained and unconstrained
Θ; the improvement is both in the rate and in the proof technique.

1.2 Notation
Throughout the paper, we use the following notation. For any k ∈ N we denote by [k], the
set of first k positive integers. For any x ∈ Rd we denote by x 7→ sign(x) the component-
wise sign function (defined at 0 as 1). We let h·, ·i and k · k be the standard inner product and
Euclidean norm on Rd , respectively. For every close convex set Θ ⊂ Rd and x ∈ Rd we denote
by ProjΘ (x) = argmin{kz − xk : z ∈ Θ} the Euclidean projection of x onto Θ. For any
p ∈ [1, +∞] we let k·kp be the `p -norm in Rd and we introduce the open `p -ball and `p -sphere,
respectively, as
n o n o
Bpd , x ∈ Rd : kxkp < 1 and ∂Bpd , x ∈ Rd : kxkp = 1 .

For any β ≥ 2 we let bβc be the largest integer strictly less than β. Given a multi-index m =
(m1 , . . . , md ) ∈ Nd , we set m! , m1 ! · · · md !, |m| , m1 + · · · + md .

1.3 Structure of the paper


The paper is organized as follows. In Section 2, we recall some preliminaries and introduce the
classes of functions considered throughout. In Section 3, we present the two algorithms that are
studied in the paper. In Section 4, we present the upper bounds for both algorithms, and in each of
the considered function classes. In Section 5, we establish minimax lower bounds for the zero-order
optimization problem. The proofs of most of the results are presented in the appendix.

2. Preliminaries
For any multi-index m ∈ Nd , any |m| times continuously differentiable function f : Rd → R, and
every h = (h1 , . . . , hd )> ∈ Rd we define

∂ |m| f (x) md
Dm f (x) , , hm , h m
1 · · · hd .
1
∂ m 1 x1 · · · ∂ m d xd
k
For any k-linear form A : Rd → R, we define its norm as

kAk , sup {|A[h1 , . . . , hk ]| : khj k ≤ 1, j ∈ [k]} .

5
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

Whenever h1 = . . . = hk = h we write A[h]k to denote A[h, . . . , h]. Given a k times continuously


k
differentiable function f : Rd → R and x ∈ Rd we denote by f (k) (x) : Rd → R the following
k-linear form:
X mk
f (k) (x)[h1 , . . . , hk ] = Dm1 +···+mk f (x)hm
1 · · · hk ,
1
∀h1 , . . . , hk ∈ Rd ,
|m1 |=···=|mk |=1

where m1 , . . . , mk ∈ Nd . We note that since f is k times continuously differentiable in Rd , then


f (k) (x) is symmetric for all x ∈ Rd .

2.1 Classes of functions


We start this section by stating all the relevant definitions and assumptions related to the target
function f . Following (Nemirovski, 2000, Section 1.3), we recall the definition of higher order
Hölder smoothness.
Definition 1 (Higher order smoothness) Fix some β ≥ 2 and L > 0. We denote by Fβ (L) the set
of all functions f : Rd → R that are ` = bβc times continuously differentiable and satisfy, for all
x, z ∈ Rd , the Hölder-type condition

f (`) (x) − f (`) (z) ≤ L kx − zkβ−` .

Remark 2 Definition 1 of higher order smoothness was used by Bach and Perchet (2016) who
considered only integer β, while Polyak and Tsybakov (1990); Akhavan et al. (2020, 2021) use a
slightly different definition. Namely, they consider a class Fβ0 (L) defined as the set of all ` times
continuously differentiable functions f satisfying for all x, z ∈ Rd , the condition
|f (x) − Tz` (x)| ≤ L kx − zkβ ,
where Tz` (·) is the Taylor polynomial of order ` = bβc of f around z. If f ∈ Fβ (L), then
f ∈ Fβ0 (L/`!) (cf. Appendix A). Thus, the results obtained for classes Fβ0 hold true for f ∈ Fβ
modulo a change of constant. Moreover, if f is convex and β = 2 (Lipschitz continuous gradient) the
properties defining the two classes are equivalent to within constants, cf. (Nesterov, 2018, Theorem
2.1.5).
Since we study the minimization of highly smooth functions, in what follows, we will always
assume that f belongs to Fβ (L) for some β ≥ 2 and L > 0. We additionally require that f ∈ F2 (L̄)
for some L̄ > 0, that is, the gradient of f is Lipschitz continuous.
Assumption A The function f ∈ Fβ (L) ∩ F2 (L̄) for some β ≥ 2 and L, L̄ > 0.
We will start our analysis by providing rates of convergence to a stationary point of f under
Assumption A. The first additional assumption that we consider is the Polyak-Łojasiewicz condition,
which is also referred to as α-gradient dominance. This condition became rather popular since it
leads to linear convergence of the gradient descent algorithm without convexity as shown by Polyak
(1963) and further discussed by Karimi et al. (2016).

6
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

Definition 3 (α-gradient dominance) Let α > 0. Function f : Rd → R is called α-gradient


dominant on Rd , if f is differentiable on Rd and satisfies Polyak-Łojasiewicz inequality,
2α(f (x) − f ? ) ≤ k∇f (x)k2 , ∀x ∈ Rd . (1)
Finally, we consider the second additional condition, which is the α-strong convexity.
Definition 4 (α-strong convexity) Let α > 0. Function f : Rd → R is called α-strongly convex
on Rd , if it is differentiable on Rd and satisfies
α 2
f (x) ≥ f (x0 ) + ∇f (x0 ) , x − x0 + x − x0 , ∀x, x0 ∈ Rd .
2
We recall that α-strong convexity implies (1) (see, e.g., Nesterov, 2018, Theorem 2.1.10), and thus
it is a more restrictive property than α-gradient dominance.
An important example of family of functions satisfying the α-dominance condition is given
by composing strongly convex functions with a linear transformation. Let n ∈ N, A ∈ Rn×d and
define 
F(A) = f : f (x) = g(Ax), g is α-strongly convex .
Note that if A> A is not invertible then the functions in F(A) are not necessarily strongly convex.
However, it can be shown that any f ∈ F(A) is an αγ-gradient dominant function, where γ is the
smallest non-zero singular value of A (see, e.g., Karimi et al., 2016). Alternatively, we can consider
the following family of functions
F 0 (A) = f : f (x) = g(Ax), g ∈ C 2 (Rd ), g is strictly convex ,


which is a set of α-gradient dominant functions on any compact subset of Rd , for some α > 0. A
popular example of suchP a function appearing in machine learning applications is the logistic loss
defined as g(Ax) = ni=1 log(1 + exp(a> i x)), where for 1 ≤ i ≤ n, ai is i-th row of A, and
d
x ∈ R . For this and more examples, see e.g. (Garrigos et al., 2023) and references therein.
In what follows, we consider three different scenarios: (i) the case of only smoothness assump-
tion on f , (ii) smoothness and α-gradient dominance assumptions, (iii) smoothness and α-strong
convexity assumptions. Let x̃ be an output of any algorithm. For the first scenario, we obtain
stationary point guarantee, that is, a bound on E[k∇f (x̃)k2 ]. For the second and the third scenarios,
we provide bounds for the optimization error E[f (x̃) − f ? ].

Remark 5 Note that under α-strong convexity and the fact that ∇f (x? ) = 0, as well as under
α-gradient dominance (see, e.g., Karimi et al., 2016, Appendix A), for any x ∈ Rd we have
α
f (x) − f ? ≥ kx − x∗ k2 , (2)
2
where x∗ is the Euclidean projection of x onto the solution set arg minx∈Rd f (x) of the considered
optimization problem, which is a singleton in case of strong convexity. Thus, our upper bounds
on E[f (x̃) − f ? ] obtained under α-strong convexity or α-gradient dominance imply immediately
upper bounds for E[kx̃ − x∗ k2 ] with an extra factor 2/α.

7
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

2.2 Classical stochastic optimization versus our setting


The classical zero-order stochastic optimization (CZSO) setting considered by Nemirovsky and
Yudin (1983); Nesterov (2011); Ghadimi and Lan (2013); Duchi et al. (2015); Nesterov and Spokoiny
(2017); Balasubramanian and Ghadimi (2021); Rando et al. (2022) among others assumes that
there is a function F : Rd × R → R such that our target function is its expectation over the second
argument,
f (x) = E[F (x, ξ)],

where ξ is a random variable. In order to find a minimizer of f , at each step t of the algorithm
one makes two queries and gets outputs of the form (F (xt , ξt ), F (xt + ht ζ t , ξt )), t = 1, 2, . . . ,
where ξt ’s are independent identically distributed (iid) realizations of random variable ξ, and xt ,
xt + ht ζ t are query points at step t. Here, ht > 0 is a perturbation parameter and ζ t is a random or
deterministic perturbation. We emphasize two features of this setting:

(a) the two queries are obtained with the same random variable ξt ;

(b) the random variables ξt are iid over t1 .

Both (a) and (b) are not assumed in our setting. On the other hand, we assume additive noise
structure. That is, at step t, we can only observe the values F (z t , ξt ) = f (z t ) + ξt for any choice
of query points z t depending only on the observations at the previous steps, but we do not assume
that two queries are obtained with the same noise ξt or are iid. We deal with almost adversarial
noise, see Assumption B below. In particular, we do not assume that the noise is zero-mean. Thus,
in general, we can have E[F (x, ξt )] 6= f (x).
In the CZSO setting, the values F (xt , ξt ) and F (xt + ht ζ t , ξt ) are used to obtain gradient
approximations involving the divided differences (F (xt + ht ζ t , ξt ) − F (xt , ξt ))/ht . A popular
choice is the gradient estimator with Gaussian randomization suggested by Nesterov (2011):

1
gG
t , (F (xt + ht ζ t , ξt ) − F (xt , ξt ))ζ t , with ζ t ∼ N (0, Id ) , (3)
ht

where N (0, Id ) denotes the standard Gaussian distribution in Rd . In the case of additive noise, the
divided differences are equal to (f (xt + ht ζ t ) − f (xt ))/ht , that is, the analysis of these algorithms
reduces to that of noiseless (deterministic) optimization setting. When ξt ’s are not additive, the
assumptions that are often made in the literature on CZSO are such that the rates of convergence are
the same as in the additive noise case, which is equivalent to noiseless case due to the above remark.
We can summarize this discussion as follows:

• the results obtained in the literature on CZSO, as well as some tools (e.g., averaging in
Algorithm 1 of Balasubramanian and Ghadimi, 2021), do not apply in our framework;

1. Some papers, for example, Ghadimi and Lan (2013); Gasnikov et al. (2016) relax this assumption.

8
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

• our upper bounds on the expected optimization error in the case of no noise (σ = 0) imply
identical bounds in the CZSO setting when the noise is additive. Under some assumptions
made in the CZSO literature (e.g., under Assumption B of Duchi et al., 2015), these bounds
for σ = 0 also extend to the general CZSO setting, with possible changes only in constants
and not in the rates. In other cases the rates in CZSO setting can only be slower than ours (e.g.,
under Assumptions A1b, A2 of Ghadimi and Lan, 2013).

3. Algorithms
Given a closed convex set Θ ⊆ Rd , we consider the following optimization scheme

x1 ∈ Θ and xt+1 = ProjΘ (xt − ηt g t ) t ≥ 1, (4)

where g t is an update direction, approximating the gradient direction ∇f (xt ) and ηt > 0 is a
step-size. Allowing one to perform two function evaluations per step, we consider two gradient
estimators g t which are based on different randomization schemes. They both employ a smoothing
kernel K : [−1, 1] → R which we assume to satisfy, for β ≥ 2 and ` = bβc, the conditions
Z Z Z Z
K(r) dr=0, rK(r) dr=1, r K(r) dr=0, j=2, . . . , `, κβ , |r|β |K(r)| dr < ∞ . (5)
j

In (Polyak and Tsybakov, 1990) it √was suggested to construct such kernels employing Legendre
polynomials, in which case κβ ≤ 2 2β, cf. Bach and Perchet, 2016, Appendix A.3.
We are now in a position to introduce the two estimators. Similarly to earlier works dealing
with `2 -randomized methods (see e.g., Nemirovsky and Yudin, 1983; Flaxman et al., 2005; Bach
and Perchet, 2016; Akhavan et al., 2020) we use gradient estimators based on a result, which is
sometimes referred to as Stokes’ theorem. A general form of this result, not restricted to the `2
geometry, can be found in Akhavan et al., 2022, Appendix A.
Gradient estimator based on `2 randomization. At time t ≥ 1, let ζ ◦t be distributed uniformly
on the `2 -sphere ∂B2d , let rt be uniformly distributed on [−1, 1], and ht > 0. Query two points:

yt = f (xt + ht rt ζ ◦t ) + ξt and yt0 = f (xt − ht rt ζ ◦t ) + ξt0 ,

where ξt , ξt0 are noises. Using the above feedback, define the gradient estimator as
d
(`2 randomization) g ◦t , (yt − yt0 )ζ ◦t K(rt ) . (6)
2ht
We use the superscript ◦ to emphasize the fact that g ◦t is based on the `2 randomization.
Gradient estimator based on `1 randomization. At time t ≥ 1, let ζ t be distributed uniformly
on the `1 -sphere ∂B1d , let rt be uniformly distributed on [−1, 1], and ht > 0. Query two points:

yt = f (xt + ht rt ζ t ) + ξt and yt0 = f (xt − ht rt ζ t ) + ξt0 .

9
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

Using the above feedback, define the gradient estimator as

d
(`1 randomization) g t , (yt − yt0 ) sign(ζ t )K(rt ) . (7)
2ht
We use the superscript  reminiscent of the form of the `1 -sphere in order to emphasize the fact
that g t is based on the `1 randomization. The idea of using an `1 randomization (different from
(7)) was probably first invoked by Gasnikov et al. (2016). We refer to Akhavan et al. (2022) who
highlighted the potential computational and memory gains of another `1 randomization gradient
estimator compared to its `2 counterpart, as well as its advantages in theoretical guarantees. The
estimator of Akhavan et al. (2022) differs from (7) as it does not involve the kernel K but the same
computational and memory advantages remain true for the estimator (7).
Throughout the paper, we impose the following assumption on the noises ξt , ξt0 and on the
random variables that we generate in the estimators (6) and (7).

Assumption B For all t ∈ {1, . . . , T }, it holds that:

(i) the random variables ξt and ξt0 are independent from ζ ◦t (resp. ζ t ) and from rt conditionally
on xt , and the random variables ζ ◦t (resp. ζ t ) and rt are independent;

(ii) E[ξt2 ] ≤ σ 2 and E[(ξt0 )2 ] ≤ σ 2 , where σ ≥ 0.

Let us emphasize that we do not assume ξt and ξt0 to have zero mean. Moreover, they can be
non-random and no independence between noises on different time steps is required, so that the
setting can be considered as almost adversarial. Nevertheless, the first part of Assumption B does
not permit a completely adversarial setup. Indeed, the oracle is not allowed to choose the noise
variable depending on the current randomization, i.e., ζ ◦t (resp. ζ t ) and rt . However, Assumption B
encompasses the following protocol: at each round, the oracle generates pairs of noise (ξt , ξt0 ) with a
second moment bounded by σ. This is done possibly with full knowledge of the algorithm employed
by the learner, the previous actions, and the past information received by the learner.
In the next two subsections, we study the bias and variance of the two estimators. As we shall
see, the `1 randomization can be more advantageous in the noiseless case than its `2 counterpart (cf.
Remark 11).

3.1 Bias and variance of `2 randomization


The next results allow us to control the bias and the second moment of gradient estimators g ◦1 , . . . , g ◦T ,
and play a crucial role in our analysis.

Lemma 6 (Bias of `2 randomization) Let Assumption B be fulfilled. Suppose that f ∈ Fβ (L) for
some β ≥ 2 and L > 0. Let g ◦t be defined in (6) at time t ≥ 1. Let ` = bβc. Then,

L d
kE[g ◦t | xt ] − ∇f (xt )k ≤ κβ · hβ−1 . (8)
(` − 1)! d + β − 1 t

10
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

Intuitively, the smaller ht is, the more accurately g t estimates the gradient. A result analogous to
Lemma 6 with a bigger constant was claimed in (Bach and Perchet, 2016, Lemma 2). The proof
of Lemma 6 is presented in the appendix. It relies on the fact that g ◦t is an unbiased estimator of
some surrogate function, which is strongly related to the original f . The factor in front of hβ−1
t
in Lemma 6 is O(1) as function of d. It should be noted that for β > 2 the bounds on the bias
obtained in Akhavan
√ et al. (2020) and Novitskii and Gasnikov (2022), where the factors scale as
O(d) and O( d), respectively, cannot be directly compared to Lemma 6. This is due to the fact
that those bounds are proved under a different notion of smoothness, cf. Remark 2. Nevertheless,
if f is convex and β = 2 both notions of smoothness coincide, and Lemma 6 improves upon √ the
bounds in (Akhavan et al., 2020) and (Novitskii and Gasnikov, 2022) by factors of order d and d,
respectively.

Lemma 7 (Variance of `2 randomization) Let Assumption B hold and f ∈ F2 (L̄) for some L̄ > 0.
Then, for any d ≥ 2,

◦ 2 d2 κ h 2 i d2 σ 2 κ
E[kg t k ] ≤ E k∇f (xt )k + L̄ht + ,
d−1 h2t
R1
where κ = −1 K 2 (r) dr.

The result of Lemma 7 can be simplified as


d2 σ 2 κ
E[kg ◦t k2 ] ≤ 4dκE[k∇f (xt )k2 ] + 4dκL̄2 h2t + , d ≥ 2. (9)
h2t
Let us provide some remarks about this inequality. First, the leading term of order d2 h−2
t in (9) is
the same as in (Akhavan et al., 2020, Lemma 2.4) and in (Bach and Perchet, 2016, Appendix C1,
beginning of the proof of Proposition 3), but we obtain a better constant. The main improvement
w.r.t. to both works lies in the lower order term. Indeed, unlike in those papers, the term h2t is
multiplied by d instead of d2 . This improvement is crucial for the condition T ≥ d2−β/2 , under
which we obtain our main results below. In particular, there is no condition on the horizon T
whenever β ≥ 4. On the contrary, we would need T ≥ d3 if we would used the previously known
versions of the variance bounds (Bach and Perchet, 2016; Akhavan et al., 2020). The proof of
Lemma 7 presented below relies on the Poincaré inequality for the uniform distribution on ∂B2d .
Proof of Lemma 7 For simplicity we drop the subscript t from all the quantities. Using Assump-
tion B and the fact that
E[f (x + hrζ ◦ )−f (x − hrζ ◦ ) | x, r] = 0 (10)
we obtain
d2 h 0 2
i
E[kg ◦ k2 ] = ◦ ◦ 2

E f (x + hrζ ) − f (x − hrζ ) + (ξ − ξ ) K (r)
4h2 (11)
d2
≤ 2 E (f (x + hrζ ◦ ) − f (x − hrζ ◦ )2 K 2 (r) + 4κσ 2 .
  
4h

11
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

Since f ∈ F2 (L̄) and (10) holds, then using Wirtinger-Poincaré inequality (see, e.g., Osserman,
1978, (3.1)) we get
h2 
E (f (x+hrζ ◦ )−f (x−hrζ ◦ ))2 x, r ≤ E k∇f (x+hrζ ◦ )+∇f (x−hrζ ◦ )k2 x, r .
  
d−1
(12)

The fact that f ∈ F2 (L̄) and the triangle inequality imply that
h i 2
E k∇f (x + hrζ ◦ ) + ∇f (x − hrζ ◦ )k2 x, r ≤ 4 k∇f (x)k + L̄h . (13)

Combining (11) – (13) proves the lemma.

Note that Eqs. (12)–(13) imply, in particular, Lemma 9 of Shamir (2017). Our method based
on Poincaré’s inequality yields explicitly the constants in the bound. In this aspect, we improve
upon Shamir (2017), where a concentration argument leads only to non-specified constants.

3.2 Bias and variance of `1 randomization


This section analyses the gradient estimate based on the `1 randomization. The results below display
a very different bias and variance behavior compared to the `2 randomization.

Lemma 8 (Bias of `1 randomization) Let Assumption B be fulfilled and f ∈ Fβ (L) for some
β ≥ 2 and L > 0. Let g t be defined in (7) at time t ≥ 1. Let ` = bβc. Then,
1−β
kE[g t | xt ] − ∇f (xt )k ≤ Lcβ κβ `β−` d 2 hβ−1
t . (14)
β−1
When 2 ≤ β < 3, then cβ = 2 2 , and if β ≥ 3 we have cβ = 1.

Notice that Lemma 8 gives the same dependence on the discretization parameter ht as Lemma 6.
However, unlike the bias bound in Lemma 6, which is dimension independent, the result of Lemma 8
depends on the dimension in a favorable way. In particular, the bias is controlled by a decreasing
function of the dimension and this dependence becomes more and more favorable for smoother
functions. Yet, the price for such a favorable control of the bias is an inflated bound on the variance,
which is established below.

Lemma 9 (Variance of `1 randomization) Let Assumption B be fulfilled and f ∈ F2 (L̄) for some
L̄ > 0. Then, for any d ≥ 3,
 r !2 
3
8d κ 2  d3 σ 2 κ
E[kg t k2 ] ≤ E  k∇f (xt )k + L̄ht + ,
(d + 1)(d − 2) d h2t
R1 2 (r) dr.
where κ = −1 K

12
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

Combined with the facts that a2 /((a − 2)(a + 1)) ≤ 2.25 for all a ≥ 3 and (a + b)2 ≤ 2a2 + 2b2 ,
the inequality of Lemma 9 can be further simplified as

16d3 κ 32d2 κL̄2 h2t d3 σ 2 κ


E[kg t k2 ] ≤ E k∇f (xt )k2 + +
(d + 1)(d − 1) (d + 1)(d − 1) h2t
(15)
d3 σ 2 κ
≤ 36dκE k∇f (xt )k2 + 72κL̄2 h2t + , d ≥ 3.
h2t
The proof of Lemma 9 is based on the following lemma, which gives a version of Poincaré’s
inequality improving upon the previously derived in (Akhavan et al., 2022, Lemma 3). We provide
sharper constants and an easier to use expression.
Lemma 10 Let d ≥ 3. Assume that G : Rd → R is a continuously differentiable function, and ζ  is
distributed uniformly on ∂B1d . Then
4 h i
Var(G(ζ  )) ≤ E k∇G(ζ  )k2 kζ  k2 .
d−2
Furthermore, if G : Rd → R is an L-Lipschitz function w.r.t. the `2 -norm then

8L2 18L2
Var(G(ζ  )) ≤ ≤ 2 .
(d + 1)(d − 2) d
The proof of Lemma 10 is given in the Appendix.
Proof of Lemma 9 For simplicity we drop the subscript t from all the quantities. Similarly to the
proof of Lemma 7, using Assumption B we deduce that

d3
E[kg  k2 ] ≤ E[(f (x + hrζ  ) − f (x − hrζ  ))2 K 2 (r)] + 4σ 2 κ .

2
(16)
4h
Consider G : Rd → R defined for all u ∈ Rd as G(u) = f (x + hru) − f (x − hru). Using the
fact that f ∈ F2 (L̄) we obtain for all u ∈ Rd
2
k∇G(u)k2 ≤ 4h2 k∇f (x)k + L̄h kuk .

Applying Lemma 10 to the function G defined above we deduce that


h i 16h2 h 2 i
E (G(ζ  ))2 | x, r ≤ E k∇f (x)k + L̄h kζ  k kζ  k2 .
d−2
Lemma 31 provided in the Appendix gives upper bounds on the expectations appearing in the above
inequality for all d ≥ 3. Its application yields:
r !2
h
  2
 i 32h2 2
E f (x + hrζ ) − f (x − hrζ ) | x, r ≤ k∇f (x)k + L̄h .
(d + 1)(d − 2) d

13
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

We conclude the proof by combining this bound with (16).


Modulo absolute constants, the leading term w.r.t. ht in Lemma 9 is the same as for the `2 ran-
domization in Lemma 7. However, for the `2 randomization this term involves only a quadratic
dependence on the dimension d, while in the case of `1 randomization the dependence is cubic.
Looking at the h2t term that essentially matters only when σ = 0, we note that the factor in front of
this term in Lemma 9 is bounded as the dimension grows. In contrast, the corresponding term in
Lemma 7 depends linearly on the dimension. We summarize these observations in the following
remark focusing on the noiseless case.

Remark 11 (On the advantage of `1 randomization) In the noiseless case (σ = 0) both bias and
variance under the `1 randomization are strictly smaller than under the `2 randomization. Indeed,
if σ = 0
(   β−1
kE[g ◦t | xt ] − ∇f (xt )k . hβ−1
t

kE[g t | xt ] − ∇f (xt )k . √
ht
2
√ and d ,
E[kg ◦t k2 ] . dE[k∇f (xt )k ] + ( dht )2 E[kg  k2 ] . dE[k∇f (x )k2 ] + h2
t
t t

where the signs . hide multiplicative


√ constants that do not depend on ht and d. Notice that given
an ht for `2 randomization, dht used with `1 randomization gives the same order of magnitude
for both bias and variance. This is especially useful in the floating-point arithmetic, where very
small values of ht (on the level of machine precision) can result in high rounding errors. Thus, in
the absence of noise, `1 randomization can be seen as a more numerically stable alternative to the
`2 randomization.

For comparison, the corresponding bounds for gradient estimators with Gaussian randomization
defined in (3) are proved for β = 2 and have the form (cf. Nesterov (2011) or (Ghadimi and Lan,
2013, Theorem 3.1)):
(
E[g G
t | xt ] − ∇f (xt ) . d
3/2 h
t
2 ] . dE[k∇f (x )k2 ] + d3 h2 ,
(17)
E[kg G
t k t t

where the signs . hide multiplicative constants that do not depend on ht and d. Setting β = 2
in Remark 11 we see that the dependence on ht in (17) is of the same order as for `1 and `2
randomizations with β = 2, but the dimension factors are substantially bigger. Also, the bounds in
(17) for the Gaussian randomization are tight. Thus, the Gaussian randomization is less efficient
than its `1 and `2 counterparts in the noiseless setting.

4. Upper bounds
In this section, we present convergence guarantees for the two considered gradient estimators and
for three classes of objective functions f . Each of the following subsections is structured similarly:

14
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

first, we define the choice of ηt and ht involved in both algorithms and then, for each class of the
objective functions, we state the corresponding convergence guarantees.
Throughout this section, we assume that f ∈ F2 (L̄) ∩ Fβ (L) for some β ≥ 2. Under this
assumption, in Section 4.1 we establish a guarantee for the stationary point. In Section 4.2 we
additionally assume that f is α-gradient dominant and provide upper bounds on the optimization
error. In Section 4.3 we additionally assume that f is α-strongly convex and provide upper bounds
on the optimization error for both constrained and unconstrained cases. Unless stated otherwise,
the convergence guarantees presented in this section hold under the assumption that the number of
queries T is known before running the algorithms.

4.1 Only smoothness assumptions


In this subsection, we only assume that the objective function f : Rd → R satisfies Assumption A.
In particular, since there is no guarantee of the existence of the minimizer, our goal is modest – we
only want to obtain a nearly stationary point.
The plan of our study is as follows. We first obtain guarantees for algorithm (4) with gradient
estimator g t satisfying some general assumption, and then concretize the results for the gradient
estimators (6) and (7). We use the following assumption.

Assumption C Assume that there exist two positive sequences bt , vt : N → [0, ∞) and V1 ≥ 0
such that for all t ≥ 1 it holds almost surely that

kE[g t | xt ] − ∇f (xt )k ≤ bt and E[kg t k2 ] ≤ V1 E[k∇f (xt )k2 ] + vt .

Note that Assumption C holds for the gradient estimators (6) and (7) with bt , vt and V1 specified in
Lemmas 6–9 (see also Assumption D and Table 1 below).
The results of this subsection will be stated on a randomly sampled point along the trajectory
of the algorithm. The distribution over the trajectory is chosen carefully, in order to guarantee the
desired convergence. The distribution that we are going to use is defined in the following lemma.

Lemma 12 Let f ∈ F2 (L̄) for some L̄ > 0, Θ = Rd and f ? > −∞. Let xt be defined by
algorithm (4) with g t satisfying Assumption C. Assume that ηt in (4) is chosen such that L̄ηt V1 < 1.
Let S be a random variable with values in [T ], which is independent from x1 , . . . , xT , g 1 , . . . , g T
and distributed with the law

ηt 1 − L̄ηt V1
P(S = t) = PT  , t ∈ [T ] .
t=1 ηt 1 − L̄ηt V1

Then,

2(E[f (x1 )] − f ? ) + Tt=1 ηt b2t + L̄ηt vt


P 
2
E[k∇f (xS )k ] ≤ PT  .
t=1 ηt 1 − L̄ηt V1

15
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

Lemma 12 is obtained by techniques similar to Ghadimi and Lan (2013). However, the paper
Ghadimi and Lan (2013) considers only a particular choice of g t defined via a Gaussian random-
ization, and a different setting (cf. the discussion in Section 2.2), under which vt does not increase
as the discretization parameter ht (µ in the notation of Ghadimi and Lan (2013)) decreases. In
our setting, this situation happens only when there is no noise (σ = 0), while in the noisy case vt
increases as ht tends to 0.
Note that the distribution of S in Lemma 12 depends on the choice of ηt and V1 . In the following
results, we are going to specify the exact values of ηt . We also provide the values of V1 for the
gradient estimators (6) and (7). Regarding these two estimators, it will be convenient to use the
following instance of Assumption C.
Assumption D There exist positive numbers b, V1 , V2 , V3 such that for all t ≥ 1 the gradient
estimators g t satisfy almost surely the inequalities

kE[g t | xt ] − ∇f (xt )k ≤ bLhβ−1


t and E[kg t k2 ] ≤ V1 E[k∇f (xt )k2 ] + V2 L̄2 h2t + V3 σ 2 h−2
t .

It follows from Lemmas 6–9 that Assumption D holds for gradient estimators (6) and (7) with the
values that are indicated in Table 1. Note that the bounds for the variance in those lemmas do not
cover the case d = 1 for the `2 randomization and d = 1, 2 for the `1 randomization. Nevertheless,
it is straightforward to check that in these cases Assumption D remains valid with Vj ’s given in
Table 1.

Estimator b V1 V2 V3
κβ d
`2 randomization (`−1)! · d+β−1 4dκ 4dκ d2 κ
1−β
`1 randomization cβ κβ `β−` d 2 36dκ 72κ d3 κ

Table 1: Factors in the bounds for bias and variance of both gradient estimators, ` = bβc, d ≥ 1.

The next theorem requires a definition of algorithm-dependent parameters, which are needed as
an input to our algorithms. We set
 
1
−1
 (8κL̄) , d 2β−1 for `2 randomization,


(y, h) = 
2β+1
 (18)
 (72κL̄)−1 , d 4β−2 for `1 randomization.

Theorem 13 Let Assumptions A and B hold, and Θ = Rd . Let xt be defined by algorithm (4) with
gradient estimator (6) or (7), where the parameters ηt and ht are set for t = 1, . . . , T , as
 
y − 2(β−1) − β
− 1
ηt = min , d 2β−1 T 2β−1 and ht = h T 2(2β−1) ,
d

16
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

1
and the constants y and h are given in (18). Assume that x1 is deterministic and T ≥ d β . Then, for
the random variable S defined in Lemma 12, we have
β−1
   d2  2β−1
2 ?
E[k∇f (xS )k ] ≤ A1 (f (x1 ) − f ) + A2 ,
T

where the constants A1 , A2 > 0 depend only on σ, L, L̄, β, and on the choice of the gradient
estimator.

In the case σ = 0, the result of this theorem can be improved. As explained in Section 2.2,
this case is analogous to the CZSO setting, and it is enough to assume that β = 2 since higher
order smoothness does not lead to improvement in the main term of the rates. Due to Remark 11
(or Assumption D) one can set ht for both methods as small as one wishes, and thus sufficiently
small to make the sum over t in the numerator of the inequality of Lemma 12 less than an absolute
constant. Then, choosing ηt = (2V1 L̄)−1 and recalling that, for both algorithms, V1 scales as d up
to a multiplicative constant (cf. Table 1) we get the following result.

Theorem 14 Let f be a function belonging to F2 (L̄) for some L̄ > 0, Θ = Rd , and let Assumptions
A and B hold with σ = 0. Let xt be defined by algorithm (4) with deterministic x1 , and gradient
estimators (6) or (7) for β = 2, where ηt = (2V1 L̄)−1 and ht is chosen sufficiently small. Then we
have
L̄d
E[k∇f (xS )k2 ] ≤ A(f (x1 ) − f ? + 1) ,
T
where A > 0 is an absolute constant depending only the choice of the gradient estimator.

The rate O(d/T ) in Theorem 14 coincides with the rate derived in (Nesterov and Spokoiny,
2017, inequality (68)) for β = 2 under the classical zero-order stochastic optimization setting, where
the authors were using Gaussian rather than `1 or `2 randomization.
p In a setting with non-additive
noise, Ghadimi and Lan (2013) exhibit a slower rate of O( d/T ).

4.2 Smoothness and α-gradient dominance


We now provide the analysis of our algorithms under smoothness and α-gradient dominance (Polyak-
Łojasiewicz) conditions.

Theorem 15 Let f be an α-gradient dominant function, Θ = Rd , and let Assumptions A and B


hold, with σ > 0. Let xt be defined by algorithm (4) with g t satisfying Assumption D, deterministic
x1 and
1 (−1
   2V
 2β 4
4 4 L̄σ 3 t 2β if ηt = αt
ηt = min (2L̄V1 )−1 , , ht = · − 1 .
αt b2 L2 α T 2β if ηt = 1 2L̄V1

17
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

Then

L̄V1
E[f (xT ) − f ? ] ≤ A1 (f (x1 ) − f ? )
αT
 1 1   2 !  β−1
V3 − β V3 β αT − β −2 αT − β
 
A2 2
+ V3 2 2 + V2 L̄ σ ,
α b L b2 L2 L̄σ 2 L̄σ 2

where A1 , A2 > 0 depend only on β.

Theorem 15 provides a general result for any gradient estimator that satisfies Assumption D. By
taking the values Vj from Table 1 we immediately obtain the following corollary for our `1 - and
`2 -randomized gradient estimators.

Corollary 16 Let f be an α-gradient dominant function, Θ = Rd , and let Assumptions A and B hold,
with σ > 0. Let xt be defined by algorithm (4) with deterministic x1 and gradient estimators (6)
or (7). Set the parameters ηt and ht as in Theorem 15, where b, V1 , V2 , V3 are given in Table 1 for
β
σ2
each gradient estimator, respectively. Then for any T ≥ d2− 2 αL 2 we have

 2
  2 2  β−1 2
? L̄d ? L̄ L̄σ d β Lβ
E[f (xT ) − f ] ≤ A1 (f (x1 ) − f ) + A2 + A3 2 ,
αT σ αT α

where A1 , A2 , A3 > 0 depend only on β and on the choice of the gradient estimator.
β
Note that here we consider σ and L as numerical constants. The condition T & d2− 2 /α mentioned
in Corollary 16 is satisfied in all reasonable cases since it is weaker than the condition T & d2 /α
guaranteeing non-triviality of the bounds.
Recall that, in the context of deterministic optimization with first order oracle, the α-gradient
dominance allows one to obtain the rates of convergence of gradient descent algorithm, which are
similar to the case of strongly convex objective function with Lipschitz gradient (Polyak (1963);
Karimi et al. (2016)). A natural question is whether the same property holds in our setting of
stochastic optimization with zero-order oracle and higher order smoothness. Theorem 15 shows
the rates are only inflated by a multiplicative factor µ(β−1)/β , where µ = L̄/α, compared to the
α-strongly convex case that will be considered in Section 4.3.
Consider now the case σ = 0, which is analogous to the CZSO setting as explained in Section
2.2. In this case, we assume that β = 2 since higher order smoothness does not lead to improvement
in the main term of the rates. We set the parameters ηt , ht as follows:
− 12
8L̄2 V2
   
−1 4 L̄ ∨ 1 2
ηt = min (2L̄V1 ) , , ht ≤ T 2b L̄ + . (19)
αt α∧1 α

18
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

Theorem 17 Let f be an α-gradient dominant function belonging to F2 (L̄) for some L̄ > 0,
Θ = Rd , and let Assumptions A and B hold with σ = 0. Let xt be defined by algorithm (4) with
deterministic x1 , and gradient estimators (6) or (7) for β = 2. Set the parameters ηt and ht as in
(19). Then we have
L̄d
E[f (xT ) − f ? ] ≤ A1 ((f (x1 ) − f ? ) + A2 ) ,
αT
where A1 , A2 > 0 are absolute constants depending only the choice of the gradient estimator.
Note that, in the CZSO setting, Rando et al. (2022) proved the rate O(T −1 ) for the optimization
error under α-gradient dominance by using an `2 randomization gradient estimator. However, unlike
Theorem 17 the bound obtained in that paper does not provide the dependence on the dimension d
and on variables L̄, α.

4.3 Smoothness and strong convexity


In this subsection, we additionally assume that f is a strongly convex function and denote by x? its
unique minimizer. We provide a guarantee on the weighted average point x̂T along the trajectory of
the algorithm defined as
T
2 X
x̂T = txt .
T (T + 1)
t=1
We consider separately the cases of unconstrained and constrained optimization.

4.3.1 U NCONSTRAINED OPTIMIZATION


In this part we assume that Θ = Rd and the horizon T is known to the learner. Similar to Section 4.2,
we first state a general result that can be applied to any gradient estimator satisfying Assumption D.

Theorem 18 Let f be an α-strongly convex function, Θ = Rd , and let Assumptions A and B hold.
Let xt be defined by algorithm (4) with g t satisfying Assumption D, deterministic x1 and
 1
1

α 4
  2  2β
4σ V3 t− 2β 8
if ηt = α(t+1)
ηt = min , , h t = · 1 .
8L̄2 V1 α(t + 1) b2 L2 T − 2β if ηt = α2
4L̄ V1

Then
(
L̄2 V1 2 β−1
?
E[f (x̂T ) − f ] ≤ A1 kx1 −x? k2 + A2 (bL) β (V3 σ 2 ) β
αT
 β1 )
− β−1
V3 σ 2

2 − β2 T β
+ A3 V2 L̄ T ,
b2 L2 α

where the constants A1 , A2 , A3 > 0 depend only on β.

19
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

Subsequently, in Corollary 19, we customize the above theorem for gradient estimators (6) and (7),
with assignments of ηt , ht that are again selected based on Table 1. We also include a bound for
E[kx̂T − x? k2 ], which comes as an immediate consequence due to (2).

Corollary 19 Let f be an α-strongly convex function, Θ = Rd , and let Assumptions A and B hold.
Let xt be defined by algorithm (4) with gradient estimator (6) or (7), and parameters ηt , ht as in
Theorem 18, where b, V1 , V2 , V3 are given in Table 1 for each gradient estimator, respectively. Let
β 2
x1 be deterministic. Then for any T ≥ d2− 2 Lσ 2 we have

  2 2  β−1
L̄2 d L̄2

? 2 d σ β 2
?
E[f (x̂T ) − f ] ≤ A1 kx1 − x k + A2 + A3 2 L β α−1 , (20)
αT σ T
  2 2  β−1
L̄2 d L̄2

? 2 ? 2 d σ β 2
E[kx̂T − x k ] ≤ 2A1 2 kx1 − x k + 2 A2 + A3 2 L β α−2 , (21)
α T σ T

where A1 , A2 , A3 > 0 depend only on β and on the choice of the gradient estimator.

With a slightly different definition of smoothness class (which coincides with ours for β = 2, cf.
Remark 2), a result comparable to Corollary 19 is derived in (Akhavan
p et al., 2020, Theorem 3.2).
However, that result imposes an additional condition on α (i.e., α & d/T ) and provides a bound
with the dimension factor d2 rather than d2−2/β in Corollary 19. We also note that earlier Bach and
Perchet (2016) analyzed the case `2 -randomized gradient estimator with integer β > 2 and proved a
− β−1
bound with a slower (suboptimal) rate T β+1 .

4.3.2 C ONSTRAINED OPTIMIZATION


We now assume that Θ ⊂ Rd is a compact convex set. In the present part, we do not need the
knowledge of the horizon T to define the updates xt . We first state the following general theorem
valid when g t is any gradient estimator satisfying Assumption D.

Theorem 20 Let Θ ⊂ Rd be a compact convex set. Assume that f is an α-strongly convex function,
Assumptions A and B hold, and maxx∈Θ k∇f (x)k ≤ G. Let xt be defined by algorithm (4) with
 2 1

4
gradient estimator g t satisfying Assumption D and ηt = α(t+1) , ht = bσ2 LV23t . Then

− β1  β1 !
4L̄2 V1 G2 A1 V3 σ 2 V3 σ 2
 
− β2 − β−1
E[f (x̂t ) − f ? ] ≤ + V3 σ 2 + V2 L̄2 T T β ,
αT α b2 L2 b2 L2

where the constant A1 > 0 depends only on β.

Using the bounds on the variance and bias of gradient estimators (6) and (7) from Section 3,
Remark 5 and the trivial bounds E[f (x̂T ) − f ? ] ≤ GB, E[kx̂T − x? k2 ] ≤ B 2 , where B is the
Euclidean diameter of Θ, we immediately obtain the following corollary.

20
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

Corollary 21 Let Θ ⊂ Rd be a compact convex set. Assume that f is an α-strongly convex function,
Assumptions A and B hold, and maxx∈Θ k∇f (x)k ≤ G. Let xt be defined by algorithm (4) with
gradient estimator (6) or (7), and parameters ηt , ht as in Theorem 20, where b, V1 , V2 , V3 are given
β 2
in Table 1 for each gradient estimator, respectively. Then for any T ≥ d2− 2 Lσ 2 we have
 2 V G2
 2
  2 2  β−1 
? 4L̄ 1 L̄ d σ β 2
−1
E[f (x̂T ) − f ] ≤ min GB, + A1 + A2 2 L α
β , (22)
αT σ T
2 2
  2 2  β−1
L̄2
  
2 2GB 8L̄ V1 G d σ β 2
? 2 −2
E[kx̂T − x k ] ≤ min B , , + 2 A1 + A2 2 L α
β , (23)
α α2 T σ T
where B is the Euclidean diameter of Θ, and A1 , A2 > 0 depends only on β and on the choice of
the gradient estimator.

In a similar setting, but assuming independent zero-mean ξt ’s, Bach and Perchet (2016) considered
− β−1
the case of `2 randomization and proved, for integer β > 2, a bound with suboptimal rate T β+1 .
Corollary 21 can be also compared to Akhavan et al. (2020, 2021) (`2 randomization and coordinate-
wise radomization) and, for β > 2, to Novitskii and Gasnikov (2022) (`2 randomization). However,
those papers use a slightly different definition of β-smoothness class (both definitions coincide if
2−1/β 
β = 2, see Remark 2). Their bounds guarantee the rate O d αT for β > 2 (Akhavan et al., 2021,
d

Corollary 6), (Novitskii and Gasnikov, 2022, Theorem 1) and O √αT for β = 2 (Akhavan et al.,
2020, Theorem D.4) by using two different approaches for the two cases. In contrast, Corollary 21
2−2/β 
yields O d αT and O α√d T , respectively, and obtains these rates by a unified approach for all


β ≥ 2, and simultaneously under `1 and `2 randomizations. Note that, 2


d
 under the condition T ≥ d
guaranteeing non-triviality of the bound, and α ≥ 1 the rate O α√T that we obtain in Corollary 21
for β = 2 matches the minimax lower bound (cf. Theorem 22 below) as a function of all the three
parameters T, d, and α.

5. Lower bounds
In this section we prove minimax lower bounds on the optimization error over all sequential strategies
with two-point feedback that allow the query points depend on the past. For t = 1, . . . , T , we
assume that the values yt = f (z t ) + ξt and yt0 = f (z 0t ) + ξt0 are observed, where (ξ, ξt0 ) are random
noises, and (z t , z 0t ) are query points. We consider all strategies of choosing the  query points as
t−1 0 0 t−1 0 0 t−1 0 0 t−1
z t = Φt (z i , yi )i=1 , (z i , yi )i=1 , τ t and z t = Φt (z i , yi )i=1 , (z i , yi )i=1 , τ t for t ≥ 2, where
Φt ’s and Φ0t ’s are measurable functions, z 1 , z 01 ∈ Rd are any random variables, and {τ t } is a
sequence of random variables with values in a measurable space (Z, U), such that τ t is independent
t−1 0 0 t−1 
of (z i , yi )i=1 , (z i , yi )i=1 . We denote by ΠT the set of all such strategies of choosing query
points up to t = T . The class ΠT includes the sequential strategy of Algorithm (4) with either of
the two considered gradient estimators (6) and (7). In this case, τ t = (ζ t , rt ), z t = xt + ht ζ t rt
and z 0t = xt − ht ζ t rt , where ζ t = ζ ◦t or ζ t = ζ t .

21
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

To state our assumption on the noises (ξ, ξt0 ), we introduce the squared Hellinger distance
H 2 (·, ·)defined for two probability measures P, P0 on a measurable space (Ω, A) as
Z √ √
H 2 (P, P0 ) , ( dP − dP0 )2 .

Assumption E For every t ≥ 1, the following holds:

• The cumulative distribution function Ft : R2 → R of random variable (ξt , ξt0 ) is such that

H 2 (PFt (·,·) , PFt (·+v,·+v) ) ≤ I0 v 2 , |v| ≤ v0 , (24)

for some 0 < I0 < ∞, 0 < v0 ≤ ∞. Here, PF (·,·) denotes the probability measure
corresponding to the c.d.f. F (·, ·).

• The random variable (ξt , ξt0 ) is independent of ((z i , yi )t−1 0 0 t−1


i=1 , (z i , yi )i=1 , τ t ).

Condition (24) is not restrictive and encompasses a large family of distributions. It is satisfied
with small enough v0 for distributions that correspond to regular statistical experiments, (see e.g.,
Ibragimov and Khas’minskii, 1982, Chapter 1). If Ft is a Gaussian c.d.f. condition (24) is satisfied
with v0 = ∞.
To state the lower bounds, we consider a subset of the classes of functions, for which we obtained
the upper bounds in Section 4. Let Θ = {x ∈ Rd : kxk ≤ 1}. For α, L, L̄ > 0, β ≥ 2, let Fα,β
denote the set of all α-strongly convex functions f that satisfy Assumption A, attain their minimum
over Rd in Θ and such that maxx∈Θ k∇f (x)k ≤ G, and the condition G > α is satisfied 2 .

Theorem 22 Let Θ = {x ∈ Rd : kxk ≤ 1} and let Assumption E hold. Then, for any estimator
x̃T based on the observations ((z t , yt ), (z 0t , yt0 ), t = 1, . . . , T ), where ((z t , z 0t ), t = 1, . . . , T ) are
obtained by any strategy in the class ΠT we have
  d d 
− β−1

? −1/2+1/β
 
sup E f (x̃T ) − f ≥ C min max α, T , √ , T β , (25)
f ∈Fα,β T α

and  
∗ 2 d − β−1 d
sup E[kz T − x (f )k ] ≥ C min 1, 1 , 2 T β , (26)
f ∈Fα,β Tβ α
where C > 0 is a constant that does not depend of T, d, and α, and x? (f ) is the minimizer of f
on Θ.
2. The condition G ≥ α is necessary for the class Fα,β to be non-empty. Indeed, due to (1) and (2), for all f ∈ Fα,β
and x ∈ Θ we have kx − x∗ k ≤ G/α, and thus 2G/α ≥ diam(Θ) = 2.

22
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

Some remarks are in order here. First, note that the threshold T −1/2+1/β on the strong convexity
parameter α plays an important role in bounds (25) and (26). Indeed, for α below this threshold,
the bounds start to be independent of α. Intuitively, it seems reasonable that α-strong convexity
should be of no added value for small α. Theorem 22 allows us to quantify exactly how small such
α should be, namely, α . T −1/2+1/β . In particular, for β = 2 the threshold occurs at α  1. Also,
quite naturally, the threshold becomes smaller when the smoothness β increases. √
−1/2+1/β threshold, the rate of (25) becomes min(T 1/β , d)/ T ,
In the regime below the √ T
which is√asymptotically d/ T independently of the smoothness index β and on α. Thus, we obtain
that d/ T is a lower bound over the class of simply convex √ functions. On the other hand, the
achievable rate for convex functions is shown to be d16 / T in Agarwal et al. (2011) and improved

to d4.5 / T in Lattimore and György (2021) (both results are up to poly-logarithmic
√ factors, and
under sub-Gaussian noise ξt ). The gap between our lower bound d/ T and these upper bounds
is only in the dependence on the dimension, but this gap is substantial. In the regime where α is
above the T −1/2+1/β threshold, our results imply that the gap between upper and lower bounds is
β−1
d2−2/β −
much smaller. Thus, our upper bounds in this regime scale as α T while the lower bound
β

− β−1
of Theorem 22 is of the order αd T β .

Consider now the √ case β = 2. Then the lower bounds (25) and (26) are of order d/(max(α, 1) T)
2 2
and d/(max(α , 1) T ), respectively, under the condition T ≥ d guaranteeing non-triviality of
& 1 (meaning that α is above the threshold α  1) we obtain the lower
√ in addition, α2 √
the rates. If,
rates d/(α T ) and d/(α T ), respectively. Comparing this remark to Corollary 21 we obtain the
following result.
Corollary 23 Let β = 2 and let the assumptions of Theorem 22 and Corollary 21 hold. If α ≥ 1
and T ≥ max(d2 ,L̄2 d, L̄4 G4 ) then there exist positive constants c, C that do not depend on T, d,
and α such that we have the following bounds on the minimax risks:
d d
c √ ≤ inf sup E f (x̃T ) − f ? ≤ C √ ,
 
(27)
α T x̃T f ∈Fα,β α T
and
d d
c √ ≤ inf sup E[kx̃T − x∗ (f )k2 ] ≤ C √ , (28)
α2 T x̃T f ∈Fα,β α2 T
where x? (f ) is the minimizer of f on Θ, and the infimum is over all estimators x̃T based on query
points obtained via strategies in the class ΠT . The minimax rates in (27) and (28) are attained by
the estimator x̃T = x̂T with parameters as in Corollary 21.
Thus, the weighted average estimator x̂T as in Corollary 21 is minimax optimal with respect to all
the three parameters T, d, and α, both in the optimization error and in the estimation risk. Note
that we introduced the condition T ≥ max(L̄2 d, L̄4 G4 ) in Corollary 23 to guarantee that the upper
bounds are of the required order, cf. Corollary 21. Thus, G is allowed to be a function of T
that grows not too fast. Since G > α this condition also prevents α from being too large, that is,
Corollary 23 does not hold if α & T 1/4 .

23
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

The issue of finding the minimax optimal rates in gradient-free stochastic optimization under
strong convexity and smoothness assumptions has a long history. It was initiated in Fabian (1967);
Polyak and Tsybakov (1990) and more recently developed in Dippon (2003); Jamieson et al. (2012);
Shamir (2013); Bach and Perchet (2016); Akhavan et al. (2020, 2021). It was shown in Polyak
and Tsybakov (1990) that the minimax optimal rate on the class of α-strong convex and β-Hölder
functions scales as c(α, d)T −(β−1)/β for β ≥ 2, where c(α, d) is an unspecified function of α and d
(for d = 1 and integer β ≥ 2 an upper bound of the same order was earlier derived in Fabian (1967)).
The issue of establishing non-asymptotic fundamental limits as function of the main parameters
the problem (α, d and T ) was first addressed in Jamieson et al. (2012) giving a lower bound
of p √
Ω( d/T ) for β = 2, without specifying the dependency on α.√This was improved to Ω(d/ T )
when α  1 by Shamir (2013) who also claimed that the rate d/ T is optimal for β = 2 referring
to an upper bound in Agarwal et al. (2010). However, invoking Agarwal et al. (2010) in the setting
with random noise ξt (for which the lower bound of Shamir (2013) was proved) is not legitimate
because in Agarwal et al. (2010) the observations are√considered as a Lipschitz function of t. The
complete proof of minimax optimality of the rate d/ T for β = 2 under random noise was later
provided in Akhavan et al. (2020). However, the upper and the lower bounds in Akhavan et al.
(2020) still differ in their dependence on α. Corollary 23 completes this line of work by establishing
the minimax optimality as a function of the whole triplet (T, d, α) for β = 2.
The main lines of the proof of Theorem 22 follow Akhavan et al. (2020). However, in Akhavan
et al. (2020) the assumptions on the noise are much more restrictive – the random variables ξt are
assumed iid and instead of (24) a much stronger condition is imposed, namely, a bound on the
Kullback–Leibler divergence. In particular, in order to use the Kullback–Leibler divergence between
two distributions we need one of them to be absolutely continuous with respect to the other. Using
the Hellinger distance allows us to drop this restriction. For example, if Ft is a distribution with
bounded support then the Kullback–Leibler divergence between Ft (·) and Ft (· + v) is +∞ while
the Hellinger distance is finite.

Acknowledgment: The work of A. Akhavan and A.B. Tsybakov was supported by the grant
of French National Research Agency (ANR) ”Investissements d’Avenir” LabEx Ecodec/ANR-11-
LABX-0047.

References
A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with
multi-point bandit feedback. In Proc. 23rd International Conference on Learning Theory, pages
28–40, 2010.

A. Agarwal, D. P. Foster, D. J. Hsu, S. M. Kakade, and A. Rakhlin. Stochastic convex optimization


with bandit feedback. In Advances in Neural Information Processing Systems, volume 25, pages
1035–1043, 2011.

24
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

A. Akhavan, M. Pontil, and A.B. Tsybakov. Exploiting higher order smoothness in derivative-free
optimization and continuous bandits. In Advances in Neural Information Processing Systems 33,
2020.

A. Akhavan, M. Pontil, and A. B. Tsybakov. Distributed zero-order optimization under adversarial


noise. In Advances in Neural Information Processing Systems 34, 2021.

A. Akhavan, E. Chzhen, M. Pontil, and A.B. Tsybakov. A gradient estimator via l1-randomization
for online zero-order optimization with two point feedback. In Advances in Neural Information
Processing Systems 35, 2022.

Y. Arjevani, Y. Carmon, J. Duchi, D. Foster, N. Srebro, and B. Woodworth. Lower bounds for
non-convex stochastic optimization. Mathematical Programming, 2022.

F. Bach and V. Perchet. Highly-smooth zero-th order online optimization. In Proc. 29th Annual
Conference on Learning Theory, 2016.

M. V. Balashov, B. T. Polyak, and A. A. Tremba. Gradient projection and conditional gradient meth-
ods for constrained nonconvex minimization. Numerical Functional Analysis and Optimization,
41(7):822–849, 2020.

K. Balasubramanian and S. Ghadimi. Zeroth-order nonconvex stochastic optimization: Handling


constraints, high dimensionality, and saddle points. Foundations of Computational Mathematics,
pages 1–42, 2021.

F. Barthe, O. Guédon, S. Mendelson, and A. Naor. A probabilistic approach to the geometry of the
Lnp ball. The Annals of Probability, 33(2):480–513, 2005.

S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine
Learning, 8(3-4):231–257, 2015.

Y. Carmon, J. Duchi, O Hinder, and Aaron Sidford. Lower bounds for finding stationary points I.
Mathematical Programming, 184:71–120, 2020.

J. Dippon. Accelerated randomized stochastic optimization. Ann. Statist., 31(4):1260–1281, 2003.

J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order convex
optimization: The power of two function evaluations. IEEE Transactions on Information Theory,
61(5):2788–2806, 2015.

V. Fabian. Stochastic approximation of minima with improved asymptotic speed. The Annals of
Mathematical Statistics, 38(1):191–200, 1967.

A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting:
gradient descent without a gradient. In Proc. 16th Annual ACM-SIAM Symposium on Discrete
algorithms (SODA), 2005.

25
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

G. Garrigos, L. Rosasco, and S. Villa. Convergence of the forward-backward algorithm: Beyond the
worst case with the help of geometry. Mathematical Programming, 198:937–996, 2023.

A. Gasnikov, A. Lagunovskaya, I. Usmanova, and F. Fedorenko. Gradient-free proximal methods


with inexact oracle for convex stochastic nonsmooth optimization problems on the simplex.
Automation and Remote Control, 77(11):2018–2034, 2016.

S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic
programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.

I. A. Ibragimov and R. Z. Khas’minskii. Estimation of the maximum value of a signal in gaussian


white noise. Mat. Zametki, 32(4):746–750, 1982.

K. G. Jamieson, R. Nowak, and B. Recht. Query complexity of derivative-free optimization. In


Advances in Neural Information Processing Systems, volume 26, pages 2672–2680, 2012.

H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods
under the Polyak-Łojasiewicz condition. In Machine Learning and Knowledge Discovery in
Databases, 2016.

T. Lattimore and A. György. Improved regret for zeroth-order stochastic convex bandits. In Advances
in Neural Information Processing Systems 34, 2021.

A. Nemirovski. Topics in non-parametric statistics. Ecole d’Eté de Probabilités de Saint-Flour 28,


2000.

A. S. Nemirovsky and D. B Yudin. Problem Complexity and Method Efficiency in Optimization.


Wiley & Sons, 1983.

Y. Nesterov. Random gradient-free minimization of convex functions. Technical Report 2011001,


Center for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2011.

Y. Nesterov. Lectures on Convex Optimization. Springer, 2018.

Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Found.


Comput. Math., 17:527––566, 2017.

V. Novitskii and A. Gasnikov. Improved exploitation of higher order smoothness in derivative-free


optimization. Optimization Letters, 16:2059–2071, 2022.

R. Osserman. The isoperimetric inequality. Bulletin of the American Mathematical Society, 84(6):
1182–1238, 1978.

B. T. Polyak and A. B. Tsybakov. Optimal order of accuracy of search algorithms in stochastic


optimization. Problems of Information Transmission, 26(2):45–53, 1990.

26
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

B.T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathe-
matics and Mathematical Physics, 3:864–878, 1963.

F. Qi and Q.-M. Luo. Bounds for the ratio of two gamma functions: from Wendel’s asymptotic
relation to Elezović-Giordano-Pečarić’s theorem. Journal of Inequalities and Applications, 2013.

S. T. Rachev and L. Ruschendorf. Approximate independence of distributions on spheres and their


stability properties. The Annals of Probability, 19(3):1311 – 1337, 1991.

M. Rando, C. Molinari, S. Villa, and L. Rosasco. Stochastic zeroth order descent with structured
directions. arXiv:2206.05124, 2022.

G. Schechtman and J. Zinn. On the volume of the intersection of two Lnp balls. Proceedings of the
American Mathematical Society, 110(1):217–224, 1990.

O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Proc.
30th Annual Conference on Learning Theory, pages 1–22, 2013.

O. Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point
feedback. Journal of Machine Learning Research, 18(1):1703–1713, 2017.

A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, New York, 2009.

V. A. Zorich. Mathematical analysis II. Springer, 2016.

Appendix
In this appendix we first provide some auxiliary results and then prove the results stated in the
main body of the paper.
d
Additional notation Let W 1 , W 2 be two random variables, we write W 1 = W 2 to denote their
R ∞ We also denote by Γ : R+ → R+ the gamma function defined, for every
equality in distribution.
z > 0, as Γ(z) = 0 xz−1 exp(−x) dx.

Appendix A. Consequences of the smoothness assumption


Let us first provide some immediate consequences of the smoothness assumption that we consider.

Remark 24 For all k ∈ N \ {0} and all h ∈ Rd it holds that


X X k!
f (k) (x)[h]k = Dm1 +···+mk f (x)hm1 +···+mk = Dm f (x)hm .
m!
|m1 |=···=|mk |=1 |m|=k

27
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

Proof The first equality of the remark follows from the definition. For the second one it is sufficient
to show that for each m = (m1 , . . . , md )> ∈ Nd with |m| = k there exist exactly k!/m! distinct
choices of (m1 , . . . , mk ) ∈ (Nd )k with |m1 | = . . . = |mk | = 1 and m1 + . . . + mk = m. To
see this, we map m ∈ Nd into a word containing letters from {a1 , a2 , . . . , ad } as

m 7→ W (m) , a1 . . . a1 a2 . . . a2 . . . ad . . . ad .
| {z } | {z } | {z }
m1 −times m2 −times md −times

By construction, each letter aj is repeated exactly mj -times in W (m). Furthermore, if |m| = k,


then W (m) contains exactly k letters. From now on, fix an arbitrary m ∈ Nd with |m| = k. Given
(m1 , . . . , mk ) ∈ (Nd )k such that |m1 | = . . . = |mk | = 1 and m1 + . . . + mk = m, define3

(m1 , . . . , mk ) 7→ W (m1 ) + W (m2 ) + . . . + W (mk ) .

We observe that the condition m1 + . . . + mk = m, implies that the word W (m1 ) + W (m2 ) +
. . . + W (mk ) is a permutation of W (m). A standard combinatorial fact states that the number of
distinct permutations of W (m) is given by the multinomial coefficient, i.e., by k!/m!. Since the
mapping (m1 , . . . , mk ) 7→ W (m1 ) + W (m2 ) + . . . + W (mk ) is invertible, we conclude.

Lemma 25 Assume that f ∈ Fβ (L) for some β ≥ 2 and L > 0. Let v ∈ Rd with kvk = 1 and
defined the function gv : Rd → R as gv (x) ≡ hv , ∇f (x)i, x ∈ Rd . Then gv ∈ Fβ−1 (L).

Proof Set ` , bβc. Note that since f is ` times continuously differentiable, then gv is ` − 1 times
continuously differentiable. Furthermore, for any h1 , . . . , h`−1 ∈ Rd
(`−1) m`−1
X
gv (x)[h1 , . . . ,h`−1 ] = Dm1 +...+m`−1 gv (x)hm 1 · . . . · h`−1
1

|m1 |=...=|m`−1 |=1


m
X
= Dm1 +...+m` f (x)hm `−1 m`
1 · . . . · h`−1 v
1

|m1 |=...=|m` |=1

= f (`) (x)[h1 , . . . , h`−1 , v] .

Hence, for any x, z ∈ Rd we can write by definition of the norm of a `−1-linear form
(`−1) (`−1)
gv (x) − gv (z)
n o
(`−1) (`−1)
= sup gv (x)[h1 , . . . , h`−1 ] − gv (z)[h1 , . . . , h`−1 ] : khj k = 1 j ∈ [` − 1]
n o
= sup f (`) (x)[h1 , . . . , h`−1 , v] − f (`) (z)[h1 , . . . , h`−1 , v] : khj k = 1 j ∈ [` − 1]

≤ f (`) (x) − f (`) (z) ≤ Lkx − zkβ−` .

3. The summation of words is defined as concatenation.

28
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

Lemma 26 Fix some real β ≥ 2 and assume that f ∈ Fβ (L). Then, for all x, z ∈ Rd

X 1 m L
f (x) − D f (z)(x − z)m ≤ kx − zkβ .
m! `!
0≤|m|≤`

Proof Fix some x, z ∈ Rd . Taylor expansion yields that, for some c ∈ (0, 1),
X 1 m X 1
f (x) = D f (z)(x − z)m + Dm f (z + c(x − z))(x − z)m .
m! m!
0≤|m|≤`−1 |m|=`

Thus, using Remark 24 and the fact that f ∈ Fβ (L), we get

X 1 m X 1
f (x) − D f (z)(x − z)m = (Dm f (z + c(x − z)) − Dm f (z)) (x − z)m
m! m!
|m|≤` |m|=`
1 (`)
= f (z + c(x − z))[x − z]` − f (`) (z)[x − z]`
`!
L L
≤ kx − zk` kc(x − z)kβ−` ≤ kx − zkβ .
`! `!

Appendix B. Bias and variance of the gradient estimators


B.1 Gradient estimator with `2 randomization
In this section we prove Lemma 6 that provides a bound on the bias of the gradient estimator with
`2 randomization. The variance of this estimator is evaluated in Lemma 7 in the main body of the
paper.
We will need the following auxiliary lemma.

Lemma 27 Let f : Rd → R be a continuously differentiable function. Let r, U ◦ , ζ ◦ be uniformly


distributed on [−1, 1], B2d , and ∂B2d , respectively. Then, for any h > 0, we have

d
E[∇f (x + hrU ◦ )rK(r)] = E[f (x + hrζ ◦ )ζ ◦ K(r)] .
h

29
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

Proof Fix r ∈ [−1, 1] \ {0}. Define φ : Rd → R as φ(u) = f (x + hru)K(r) and note that
∇φ(u) = hr∇f (x + hru)K(r). Hence, we have
1 d
E[∇f (x + hrU ◦ )K(r) | r] = E[∇φ(U ◦ ) | r] = E[φ(ζ ◦ )ζ ◦ | r]
hr hr
d
= K(r)E[f (x + hrζ ◦ )ζ ◦ | r] ,
hr
where the second equality is obtained from a version of Stokes’ theorem (see e.g., Zorich, 2016,
Section 13.3.5, Exercise 14a). Multiplying by r from both sides, using the fact that r follows a
continuous distribution, and taking the total expectation concludes the proof.

R1
Proof of Lemma 6 Using Lemma 27, the fact that −1 rK(r) dr = 1, and the variational
representation of the Euclidiean norm, we can write
kE[g ◦t | xt ] − ∇f (xt )k = sup E[ ∇v f (x + ht rt U ◦ ) − ∇v f (x) rt K(rt )] ,

(29)
v∈∂B2d

where we recall that U ◦ is uniformly distributed on B2d . Lemma 25 asserts that for any v ∈ ∂B2d the
directional gradient ∇v f (·) is (β − 1, L)-Hölder. Thus, due to Lemma 26 we have the following
Taylor expansion:
X (rt ht )|m| m
∇v f (xt + ht rt U ◦ ) = ∇v f (xt ) + D ∇v f (xt )(U ◦ )m + R(ht rt U ◦ ) , (30)
m!
1≤|m|≤`−1

L
where the residual term R(·) satisfies |R(x)| ≤ (`−1)! kxkβ−1 .
Substituting (30) in (29) and using the “zeroing-out” properties of the kernel K, we deduce that
L L d
kE[g ◦t | xt ] − ∇f (xt )k ≤ κβ hβ−1
t E kU ◦ kβ−1 = κβ hβ−1
t ,
(` − 1)! (` − 1)! d + β − 1
where the last equality is obtained from the fact that E kU ◦ kq = d
d+q , for any q ≥ 0.

B.2 Gradient estimator with `1 randomization


In this section, we prove Lemma 8 that gives a bound on the bias of our gradient estimator with `1
randomization, and Lemma 10, that provides a Poincaré type inequality crucial for the control of its
variance.
Let ζ be a real valued random variable with E[ζ 2 ] ≤ 4σ 2 and let ζ  be distributed uniformly
on ∂B1d . Assume that ζ and ζ  are independent from each other and from the random variable r,
which is uniformly distributed on [−1, 1]. In order to control the bias and variance of the gradient
estimator (7) for any fixed t, it is sufficient to do it for the random vector
d
g x,h = (f (x + hrζ  ) − f (x − hrζ  ) + ζ) sign(ζ  )K(r), (31)
2h
where ζ = ξt − ξt0 .

30
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

B.2.1 C ONTROL OF THE BIAS


Lemma 28 Let U  be uniformly distributed on B1d and ζ  be uniformly distributed on ∂B1d . Fix
some x ∈ Rd and h > 0, and let Assumption B be fulfilled, then the estimator in (31) satisfies

E[g x,h ] = E[∇f (x + hrU  )rK(r)] .

Proof The proof is analogous to that of Lemma 27 using (Akhavan et al., 2022, Theorem 6).

In order to obtain a bound on the bias of the estimator in (31) we need the following result,
which controls the moments of the Euclidean norm of U  .

Lemma 29 Let U  ∈ Rd be distributed uniformly on B1d . Then for any β ≥ 1 it holds that
β
 β cβ+1 d 2 Γ(β + 1)Γ(d + 1)
E[kU k ] ≤ ,
Γ(d + β + 1)

where cβ+1 = 2β/2 for 1 ≤ β < 2 and cβ+1 = 1 for β ≥ 2.

Proof Let W = (W1 , . . . , Wd ), Wd+1 be i.i.d. random variables following Laplace distribution
with mean 0 and scale parameter 1. Then, following (Barthe et al., 2005, Theorem 1) we have

d W
U = ,
kW k1 + |Wd+1 |
d
where the sign = stands for equality in distribution. Furthermore, it follows from (Barthe et al.,
2005, Theorem 2) (see also Rachev and Ruschendorf (1991); Schechtman and Zinn (1990)) that

(W , |Wd+1 |)
and kW k1 + |Wd+1 | ,
kW k1 + |Wd+1 |

are independent. Hence,


 Pd !β/2 
2
j=1 Wj E[kW kβ ]
E[kU  kβ ] = E  = , (32)
(kW k1 + |Wd+1 |)2 E[k(W , Wd+1 )kβ1 ]

where the equality follows from the independence recalled above. Note that |Wj | is exp(1) random
variable for any j = 1, . . . , d. Thus, if 1 ≤ β < 2 by Jensen’s inequality we can write
 β
 β
d 2 d 2
X X β β β β
E[kW kβ ] = E  2
Wj ≤  2 
E[Wj ] = d 2 E[W12 ] 2 = d 2 Γ(3) 2 . (33)
j=1 j=1

31
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

If β ≥ 2, again by Jensen’s inequality we have


 β
d 2 d
β 1X β β β
E[Wjβ ] = d 2 E[W1β ] = d 2 Γ(β + 1) .
β
X
−1
E[kW k ] = d E  2 Wj2  ≤d 2 (34)
d
j=1 j=1

It remains to provide a suitable expression for E[k(W , Wd+1 )kβ1 ]. We observe that k(W , Wd+1 )k1
follows the Erlang distribution with parameters (d + 1, 1) (as a sum of d + 1 i.i.d. exp(1) random
variables). Hence, using the expression for the density of the Erlang distribution we get
Z ∞
1 Γ(d + β + 1)
E[k(W , Wd+1 )kβ1 ] = xd+β exp(−x) dx = . (35)
Γ(d + 1) 0 Γ(d + 1)

Combining (32)–(35) proves the lemma.

Proof of Lemma 8 Using Lemma 28 and following the same lines as in the proof of Lemma 6 we
deduce that
β−1
L L cβ d 2 Γ(β)Γ(d + 1)
kE[g t | xt ] − ∇f (xt )k ≤ κβ hβ−1
t E kU  kβ−1 ≤ κβ hβ−1
t ,
(` − 1)! (` − 1)! Γ(d + β)

where the last inequality is due to Lemma 29. Next, recall that the Gamma function satisfies
Γ(z + 1) = zΓ(z) for any z > 0. Applying this relation iteratively and using the fact that ` = bβc
we get:

Γ(d + 1) Γ(d + 1) (d + β − `)1−(β−`) 1


=  Q`  ≤ Q`  ≤ β−1 ,
Γ(d + β) Γ d + (β − `) i=1 d + β − i i=1 d + β − i
d
| {z }
∈(0,1]

where the first inequality is obtained from (Qi and Luo, 2013, Remark 1). Proceeding analogously
Γ(β)
we obtain that (`−1)! ≤ `β−` . Combining this bound with the two preceding displays yields the
lemma.

B.2.2 P OINCAR É INEQUALITY FOR THE CONTROL OF THE VARIANCE


We now prove the Poincaré inequality of Lemma 10 used to control the variance of the `1 -randomized
estimator.
Proof of Lemma 10 The beginning of the proof is the same as in (Akhavan et al., 2022, Lemma 3).
In particular, without loss of generality we assume that E[G(ζ)] = 0, and consider first the case
of continuously differentiable G. Let W = (W1 , . . . , Wd ) be a vector such that the components

32
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

Wj are i.i.d. Laplace random variables with mean 0 and scale parameter 1. Set T (w) = w/ kwk1 .
Lemma 1 in Schechtman and Zinn (1990) asserts that, for ζ uniformly distributed on ∂B1d ,

d
T (W ) = ζ and T (W ) is independent of kW k1 . (36)

Furthermore, in the proof of Lemma 3 in Akhavan et al. (2022), it is shown that


 
4 > 2 2
Var(G(ζ)) ≤ E I − T (W ) sign(W ) k∇G(T (W ))k , (37)
d(d − 2)

where I is the identity matrix, k · k applied to matrices denotes the spectral norm, and sign(·) applied
to vectors denotes the vector of signs of the coordinates.
From this point, the proof diverges from that of Lemma 3 in of Akhavan et al. (2022). Instead of
bounding the spectral norm of I − ab> by 1 + kak kbk as it was done in that paper, we compute it
exactly, which leads to the main improvement. Namely, Lemma 30 proved below gives
> 2
I − T (W ) sign(W ) = d kT (W )k2 .

Combining this equality with (37) we obtain the first bound of the lemma. The second bound of the
lemma (regarding Lipschitz functions G) is deduced from the first one by the same argument as
in Akhavan et al. (2022).

Lemma 30 Let a ∈ Rd be such that kak1 = 1. Then,


> √
I − a sign(a) = d kak .
√ √
Proof of Lemma 30 Let u = a/ kak, v = sign(a)/ d, and γ = d kak. Then, since
1 = kak1 = ha , sign(a)i we have hu , vi = 1/γ. Consider the matrix Q = [u, q 2 , . . . , q d ],
such that Q> Q = QQ> = I. Let e1 = (1, 0, . . . , 0)> . For any matrix B ∈ Rd×d we have
kBk = Q> BQ and kBk2 = BB> . Using these remarks and the fact that kQvk2 = 1,
Q> u = e1 , we deduce that
> 2
I − a sign(a) = (I − γe1 (Q> v)> )(I − γe1 (Q> v)> )>

= I − γe1 (Q> v)> − γ(Q> v)e1 + γ 2 e1 e>


1 = kAk ,

where

γ 2 − 1 −γ v̄ >
 
A= ,
−γ v̄ I

33
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

with (v̄)j = q j+1 , v for j = 1, . . . , d − 1. Let us find the eigenvalues of A. For any λ ∈ R,
using the expression for the determinant of a block matrix we get
!
d−1 2 γ 2 kv̄k2
det(A − λI) = (1 − λ) γ −1−λ− .
1−λ

Note that 1 = kQvk2 = 1


γ2
+ kv̄k2 . Hence,

det(A − λI) = (1 − λ)d−2 (1 − λ)(γ 2 − 1 − λ) − (γ 2 − 1) = (1 − λ)d−2 (λ − γ 2 )λ .




> √
Thus, kI − a sign(a) k = max{γ, 1} = max{ d kak , 1}. We conclude the proof by observing

that d kak ≥ kak1 = 1.

Finally, we provide the following auxiliary lemma used in the proof of Lemma 9.

Lemma 31 For all d ≥ 3 and all c ∈ Rd , h > 0 it holds that


r !2
 2 2
 2 2
E(kck + h kζ k) kζ k ≤ kck + h .
d+1 d

Proof Observe that the vector |ζ  | , (|ζ1 |, . . . , |ζd |)> follows the Dirichlet distribution (i.e., the
uniform distribution of the probability simplex on d atoms). In what follows we will make use of
the following expression for the moments of the Dirichlet distribution:
d
Γ(d) Y (d − 1)!m!
E[(ζ  )m ] = Γ(mi + 1) = , (38)
Γ(d + |m|) (d − 1 + |m|)!
i=1

for any multi-index m = (m1 , . . . , md ) ∈ Nd with even coordinates.


Using (38) we get
2
E kζ  k2 = . (39)
d+1
Furthermore, using the multinomial identity and the expression for the moments in (38) we find
X 2 X 2 (d − 1)!(2m)! (d − 1)! X (2m)!
E kζ  k4 = E[(ζ  )2m ] = · =2 .
m! m! (d + 3)! (d + 3)! m!
|m|=2 |m|=2 |m|=2

P (2m)!
Direct calculations show that |m|=2 m! = 2d(d + 5). Hence, we deduce that, for all d ≥ 1,

4d!(d + 5)
E kζ  k4 = .
(d + 3)!

34
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

d(d+5)
Note that (d+2)(d+3) ≤ 1 for all d ≥ 1. Thus,

4d!(d + 5) 4(d + 5) 4
E kζ  k4 = = ≤ . (40)
(d + 3)! (d + 1)(d + 2)(d + 3) d(d + 1)

Finally, observe that by the Cauchy-Schwarz inequality,


q
E(kck + h kζ  k)2 kζ  k2 ≤ h2 E kζ  k4 + 2h kck E kζ  k2 E kζ  k4 + kck2 E kζ  k2
 q q 2
 2  4
= kck E kζ k + h E kζ k .

Combining this bound with (39) and (40) concludes the proof.

Appendix C. A technical lemma


In this section, we provide a lemma, which will be useful to handle recursive relations in the main
proofs. It is a direct extension of (Akhavan et al., 2020, Lemma D.1).

Lemma 32 Let {δt }t≥1 be a sequence of real numbers such that for all integers t > t0 ≥ 1,
N
 c X ai
δt+1 ≤ 1 − δt + , (41)
t tpi +1
i=1

where c ≥ 1, pi ∈ (0, c) and ai ≥ 0 for i ∈ [N ]. Then for t ≥ t0 ≥ c + 1, we have


N
2(t0 − 1)δt0 X ai
δt ≤ + . (42)
t (c − pi )tpi
i=1

Proof For any fixed t > 0 the convexity of the mapping u 7→ g(u) = (t + u)−p implies that
p p (c−p)+(t−c)
g(1) − g(0) ≥ g 0 (0), i.e., t1p − (t+1)
1 1
p ≤ tp+1 . Thus, using the fact that tp − tp+1 = tp+1

1
(t+1)p ,
 
ai ai 1  c 1
≤ − 1− . (43)
tp+1 c−p (t + 1)p t tp

Using (41), (43) and rearranging terms, for any t ≥ t0 we get


N N
( )
X ai  c X ai
δt+1 − ≤ 1− δt − .
(c − pi )(t + 1)pi t (c − pi )tpi
i=1 i=1

35
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

Letting τt = δt − N ai c
P
i=1 (c−pi )tpi we have τt+1 ≤ (1 − t )τt . Now, if τt0 ≤ 0 then τt ≤ 0 for any
t ≥ t0 and thus (42) holds. Otherwise, if τt0 > 0 then for t ≥ t0 + 1 we have
t−1  t−1 
Y c Y 1  (t0 − 1)τt0 2(t0 − 1)δt0
τt ≤ τt0 1− ≤ τt0 1− ≤ ≤ .
i i t t
i=t0 i=t0

Thus, (42) holds in this case as well.

Appendix D. Upper bounds


D.1 Upper bounds: Only smoothness assumption
Proof of Lemma 12 For brevity we write Et [·] in place of E[· | xt ]. Using Lipschitz continuity of
∇f and the definition of the algorithm in (4) we can write

L̄ηt2 h i
Et [f (xt+1 )] ≤ f (xt ) − ηt h∇f (xt ) , Et [g t ]i + Et kg t k2
2
L̄ηt2 h i
≤ f (xt ) − ηt k∇f (xt )k2 + ηt k∇f (xt )k kEt [g t ] − ∇f (xt )k + Et kg t k2 .
2
Furthermore, invoking the assumption on the bias and the variance of g t and using the fact that
2ab ≤ a2 + b2 we deduce

L̄ηt2  
Et [f (xt+1 )]−f (xt ) ≤ −ηt k∇f (xt )k2 + ηt bt k∇f (xt )k + vt + V1 k∇f (xt )k2
2
η   L̄η 2  
vt + V1 k∇f (xt )k2 (44)
t
≤ −ηt k∇f (xt )k2 + b2t + k∇f (xt )k2 + t
2 2
ηt  2 ηt 2 
=− 1 − L̄ηt V1 k∇f (xt )k + bt + L̄ηt vt .
2 2
Let S be a random variable with values in {1, . . . , T }, which is independent from x1 , . . . , xT , g 1 , . . . , g T
and such that

ηt 1 − L̄ηt V1
P(S = t) = PT .
t=1 ηt 1 − L̄ηt V1

Assume that ηt in (4) is chosen to satisfy L̄ηt m < 1 and that f ? > −∞. Taking total expectation in
(44) and summing up these inequalities for t ≤ T , combined with the fact that f (xT +1 ) ≥ f ? , we
deduce that
h i 2(E[f (x )] − f ? ) + PT η b2 + L̄η v 
2 1 t=1 t t t t
E k∇f (xS )k ≤ PT  .
t=1 ηt 1 − L̄ηt V1

36
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

Proof of Theorem 13 The proof will be split into two parts: for gradient estimators (6) and (7),
respectively. Both of these proofs follow from Lemma 12, which states that

2δ1 + Tt=1 ηt b2t + L̄ηt vt


h i P 
2
E k∇f (xS )k ≤ PT  , (45)
t=1 ηt 1 − L̄ηt V1

where δ1 = E[f (x1 )] − f ? . Using the corresponding bounds on the bias bt and variance vt , we
substitute these values in the above inequality with ηt and ht obtained by optimizing the obtained
expressions.
We start with the part of the proof that is common for both gradient estimators. Introduce the
notation
2(β−1) β
− − 2β−1
ΞT := d 2β−1 T .

Using this notation, we consider algorithm (4) with gradient estimators (6) or (7) such that
y  1

ηt = min , ΞT and ht = hT 2(2β−1) ,
d
where
 
1
−1
 (8κL̄) , d 2β−1 for estimator (6)


(y, h) =  2β+1
 .
−1
 (72κL̄) , d 4β−2 for estimator (7)

Given the values of V1 in Table 1, the choice of ηt for both algorithms ensures that

1
≤ 1 − L̄ηt V1 .
2
Thus we get from (45) that both algorithms satisfy

T
!−1 T T
!
h i X X X
E k∇f (xS )k2 ≤ ηt 4δ1 + 2 ηt b2t + 2L̄ ηt2 vt . (46)
t=1 t=1 t=1

Furthermore, since ηt = min(y/d, ΞT ), then in both cases we have

T
!−1  
X d 1 d 1
ηt = max , ≤ + .
T y T ΞT T y T ΞT
t=1

37
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

Using this bound in (46) we deduce that


T T
!
h i  d 1
 X X
2 2
E k∇f (xS )k ≤ + 4δ1 + 2 ηt bt + 2L̄ ηt2 vt .
T y T ΞT
t=1 t=1

Finally, by the definition of ηt we have ηt ≤ ΞT for all t = 1, . . . , T , which yields

i d    T
h
2 1 4δ1 dΞT 1 X 2
E k∇f (xS )k ≤ + +2 + bt + L̄ΞT vt . (47)
y ΞT T Ty T
t=1

In the rest of the proof, we use the algorithm specific bounds on bt and vt as well as the particular
choice of y and h in order to get the final results.

D.2 Bounds for the gradient estimator (6) - `2 randomization


Lemma 6 for the bias and Lemma 7 for the variance imply that
2
d2 σ 2 κ

κβ L 2(β−1)
b2t ≤ ht and vt = 4dκL̄2 h2t + , and V1 = 4dκ .
(` − 1)! 2h2t

Using these bounds in (47) we get


i d 
−1 4δ1
h
2
E k∇f (xS )k ≤ + ΞT
y T
 X T  
dΞT 1 2(β−1) 2 −1 2 −2

+ + A 3 ht + ΞT d A4 d ht +A5 ht
Ty T
t=1
  T
d −1 4δ1 dΞT +1 X n 2(β−1) o
≤ + ΞT + A6 ht + A7 d2 ΞT d−1 h2t + h−2
t
y T T
t=1
(48)
κβ L 2
 
3 κσ 2 L̄ −1 −1
 
where A3 = (`−1)! , A4 = 4κL̄ , A5 = 2 , and A6 = 2A3 y + 1 , A7 = 2 y + 1 A4 +

A5 . Since ht = hT for t = 1, . . . , T , inequality (48) has the form
i d 
−1 4δ1
h  
2 2(β−1)
E k∇f (xS )k ≤ +ΞT + (dΞT +1) A6 hT + A7 d2 ΞT d−1 h2T + h−2
T . (49)
y T

After substituting the expressions for ΞT and hT into the above bound, the right hand side of (49)
reduces to
(   1 ! ) β−1
4d d 2β−1   5−2β 2
  d2  2β−1

δ1 + 4δ1 + +1 A6 + A7 1 + d 2β−1 T 2β−1 .
Ty Tβ T

38
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

1 5−2β 2
− 2β−1
To conclude, we note that the assumption T ≥ d β , implies that for all β ≥ 2 we have d 2β−1 T ≤
 1
1 and d/T β 2β−1 ≤ 1. Therefore, the final bound takes the form
β−1
 d2  β−1
d2
  2β−1
h
2 4d
i  
2β−1
 
E k∇f (xS )k ≤ δ1 + 4δ1 + 2 A6 + 2A7 ≤ A1 δ1 + A2 ,
Ty T T
where A1 = 4(y−1 + 1) and A2 = 2 A6 + 2A7 .


D.3 Bounds for the gradient estimator (7) - `1 randomization


Lemma 8 for the the bias and the bound (15) for the variance imply that

2(β−1) 1−β d3 σ 2 κ
b2t ≤ (cβ κβ `L)2 ht d , vt = 72κL̄2 h2t + , and V1 = 36dκ ,
h2t
with ` = bβc. Using these bounds in (47) we get
h i  
2(β−1)
E k∇f (xS )k2 ≤ (dΞT +1) A6 d1−β hT +ΞT A7 h2T +A8 d3 h−2
T

d
 (50)
−1 4δ1
+ +ΞT ,
y T
where the constants are defined as

A6 = 2(cβ κβ `L)2 y−1 + 1 , A7 = 144κL̄3 y−1 + 1 , A8 = 2L̄σ 2 κ y−1 + 1 .


  

Substituting the expressions for ΞT and hT in (50), we deduce that


(   1 ! 1 !) 
 5−2β  2β−1  β−1
h
2
i 4d d 2β−1 d d2 2β−1
E k∇f (xS )k ≤ δ1 + 4δ1 + +1 A6 + A8 + A7 .
Ty Tβ T2 T
1
d d5−2β
Finally, we assumed that T ≥ d β , which implies that both Tβ
and T2
are less than or equal to
one. Thus, we have
β−1
d2
h i   2β−1
2
E k∇f (xS )k ≤ (A1 δ1 + A2 ) ,
T
where A1 = 4(y−1 + 1), and A2 = 2 (A6 + A7 + A8 ).

Proof of Theorem 14 As in the proof of Theorem 13 we use Lemma 12, cf. (45):
h i 2δ + PT η (b2 + L̄η v )
2 1 t t t t
E k∇f (xS )k ≤ PT t=1 .
t=1 ηt (1 − L̄ηt V1 )

39
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

From this inequality and the fact that, by assumption, ηt = (2L̄V1 )−1 we obtain:

h i 8L̄V δ + 2 PT (b2 + (2V )−1 v )


2 1 1 t=1 t 1 t
E k∇f (xS )k ≤ .
T

Since the gradient estimators (6) and (7) satisfy Assumption D and we consider the case σ = 0,
β = 2, the values bt = bLht and vt = V2 L̄2 h2t can be made asP small as possible by choosing
ht small enough. Thus, we can take ht sufficiently small to have Tt=1 (b2t + (2V1 )−1 vt ) ≤ L̄V1 .
Under this choice of ht ,
h i L̄V1
E k∇f (xS )k2 ≤ (8δ1 + 2) .
T

Using the values of V1 for the gradient estimators (6) and (7) (see Table 1) we obtain the result.

D.4 Upper bounds: Smoothness and α-gradient dominance


Proof of Theorem 15 For brevity, we write Et [·] in place of E[· | xt ]. Using Lipschitz continuity
of ∇f (see e.g. Bubeck, 2015, Lemma 3.4) and the definition of the algorithm in (4) with Θ = Rd
we have

L̄ηt2 h i
Et [f (xt+1 )] ≤ f (xt ) − ηt h∇f (xt ) , Et [g t ]i + Et kg t k2
2
L̄ηt2 h i
≤ f (xt ) − ηt k∇f (xt )k2 + ηt k∇f (xt )k kEt [g t ] − ∇f (xt )k + Et kg t k2 .
2

Next, invoking Assumption D on the bias and variance of g t and using the elementary inequality
2ab ≤ a2 + b2 we get that, for the iterative procedure (4) with Θ = Rd ,

ηt ηt  2 2 2(β−1) 
δt+1 ≤ δt − (1 − L̄ηt V1 )E[k∇f (xt )k2 ] + b L ht + L̄ηt V2 L̄2 h2t + V3 σ 2 h−2
t ,
2 2

where δt = E[f (xt ) − f ? ]. Furthermore, our choice of the step size ηt ensures that 1 − L̄ηt V1 ≥ 21 .
Using this inequality and the fact that f is α-gradient dominant we deduce that
 ηt α  ηt  2 2 2(β−1) 
δt+1 ≤ δt 1 − + b L ht + L̄ηt V2 L̄2 h2t + V3 σ 2 h−2
t . (51)
2 2
j k
8L̄V1
We now analyze this recursion according to the cases T > T0 and T ≤ T0 , where T0 := α is
the value of t, where ηt switches its regime.

40
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

First case: T > T0 . In this case, the recursion (51) has two different regimes, depending on the
4
value of ηt . In the first regime, for any t = T0 + 1, . . . , T , we have ηt = αt and (51) takes the form
  2(β−1)
2 2 2 ht 8L̄  
δt+1 ≤ δt 1 − + 2b L · + 2 2 V2 L̄2 h2t + V3 σ 2 h−2
t . (52)
t αt α t
1
4L̄σ 2 V3

Additionally in this regime of t, we have ht = b2 L2 αt

. Using this expression for ht in (52) we
obtain that
1− β1  1
L̄σ 2 V3 − β − 2β−1
   
2 1
δt+1 ≤ δt 1− + A3 · V3 2 2 t β
t α α b L
 1+ β1 1 (53)
2 β

1 L̄ 2 V3 σ − 2β+1
+ A4 · V2 L̄ t β ,
α α b2 L2

4− β2 3+ β2
where A3 = 2 , and A4 = 2 . Applying Lemma 32 to the above recursion we get
 1  β−1
V3 − β αT − β

2T0 βA3
δT ≤ δT0 +1 + · V3 2 2 (54)
T (β+1)α b L L̄σ 2
  1
2 β αT − β
 β+1
βA4 2 V3 σ
+ · V2 L̄ .
(3β+1)α b2 L2 L̄

If T0 = 0, we conclude the proof for the case T > T0 . Otherwise, we consider the second
2V  1
regime that corresponds to t ∈ [1, T0 ]. In this regime, we have ht = b42L̄σ
L2 αT
3 2β 1
, ηt = 2L̄V , and
1
4 4
(T0 +1)α ≤ ηt ≤ T0 α . Using these expressions for ht and ηt in (51) we get that, for 1 ≤ t ≤ T0 ,

4− β2 1− β1  1 1 !
L̄σ 2 V3 − β
   
2 2 1 − β−1 T β
δt+1 ≤ δt 1− + · V3 2 2 T β +
T0 + 1 T0 α α b L T0
1 1
3+ β2  1+ β
V3 σ 2 β − β1
 
2 1 L̄
+ · V2 L̄2 T .
T02 α α b2 L2

2
Using the rough bound 1 − T0 +1 ≤ 1 and unfolding the above recursion we obtain:

1− β1  1 1 !
1 L̄σ 2 V3 − β
 
4− β2 − β−1 T β
δT0 +1 ≤ δ1 + 2 · V3 2 2 T β +
α α b L T0
2 1 1
3+ 1+ β
V3 σ 2 β − β1
   
2 β 1 L̄
+ · V2 L̄2 T .
T0 α α b2 L2

41
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

Taking into account the definition of T0 , and the fact that T0 ≤ T we further obtain:

 1  β−1
V3 − β αT − β

2T0 16L̄V1 2A3
δT0 +1 ≤ δ1 + · V3 2 2 (55)
T αT α b L L̄σ 2
 1+ β1 1   β+1
2 β αT − β

2A4 L̄ 2 V3 σ
+ · V2 L̄ .
α α b2 L2 L̄

Finally, combining (54) and (55) yields:

 − 1  1  − 2 ! − β−1
L̄V1 A2 V3 β
2 V3 β αT β
−2 αT β
δT ≤ A1 · δ1 + · V3 + V2 L̄ σ ,
αT α b2 L2 b2 L2 L̄σ 2 L̄σ 2
   
β β
where A1 = 16 and A2 = 2 + β+1 A3 + 2 + 3β+1 A4 .

1
4L̄σ 2 V3

Second case: T ≤ T0 . In this case, we have ht = b2 L2 αT

and thus (51) takes the form

4− 2 1− β1  1 T 1 !
V3 − β X
T
2 β 1 L̄σ 2
  
2 − β−1 T β
δT +1 ≤ δ1 1 − + · V3 2 2 T β +
T0 + 1 T0 α α b L T0
t=1
1 1
3+ 2  1+ β   T
2 β X
2 β 1 L̄ 2 V3 σ − β1
+ · V 2 L̄ T
T02 α α b2 L2
t=1
− 1   β−1 1   β+1
αT − β αT − β
 T   2 β
2 A3 V3 β A4 2 V3 σ
≤ δ1 1 − + · V3 2 2 + · V2 L̄ .
T0 + 1 α b L L̄σ 2 α b2 L2 L̄

1
Note that, for any ρ, T > 0, we have (1 − ρ)T ≤ exp(−ρT ) ≤ ρT . Using this inequality for
2
ρ = T0 +1 , the definition of T0 and the fact that T + 1 ≤ 2T we obtain:

L̄V1
δT +1 ≤ A1 δ1
α(T + 1)
 1 1   2 !  β−1
V3 − β V3 β α(T + 1) − β −2 α(T + 1) − β
 
A2 2
+ V3 2 2 + V2 L̄ σ .
α b L b2 L2 L̄σ 2 L̄σ 2

Proof of Theorem 17 Asjin thekproof of Theorem 15, we consider separately the cases T > T0
and T ≤ T0 , where T0 := 8L̄V
α
1
.

42
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

4
The case T > T0 . First, consider the algorithm at steps t = T0 , . . . , T , where we have ηt = αt .
Since σ = 0 and β = 2, from (51) we have
  2 2
8L̄3
 
2 2b L
δt+1 ≤ δt 1 − + + 2 2 V2 h2t . (56)
t αt α t
  − 1
L̄∨1 8L̄2 V2 2
Since β = 2, we have L = L̄. Thus, using the assumption that ht ≤ α∧1 T 2b2 L̄ + α
we deduce from (56) that
 
2 L̄
δt+1 ≤ δt 1 − + 2. (57)
t αt
Applying Lemma 32 to the above recursion gives

2T0 L̄
δT ≤ δT0 +1 + . (58)
T αT
If T0 = 0, we conclude the proof for the case T > T0 . Otherwise, we consider the algorithm at
1 4
steps t = 1, . . . , T0 , where ηt = 2L̄V and (T0 +1)α ≤ ηt ≤ T04α . From (51) with σ = 0 and β = 2
1
we obtain
   
2 L̄ 8
δt+1 ≤ δt 1 − + 2b2 L̄ + L̄2 V2 h2t . (59)
T0 + 1 αT0 α
  − 1
Using here the assumption that ht ≤ L̄∨1 T 2b2 L̄ + 8L̄2 V2 2
and a rough bound 1 − T02+1 ≤ 1
α∧1 α
and summing up both sides of the resulting inequality from t = 1 to t = T0 we get

L̄(α ∧ 1)
δT0 +1 ≤ δ1 + .
α(L̄ ∨ 1)T
Combining this inequality with (58) and using the definition of T0 and the fact that T0 ≤ T we
obtain the bound
16L̄V1 16L̄V1 L̄
δT ≤ δ1 + 2
+ .
αT αT αT
It remains to note that V1 ≤ 36dκ, cf. Table 1. This implies the theorem for the case T < T0 with
A1 = 576κ and A2 = 1/T + 1/A1 d.
  2
− 1
2
Second case: T ≤ T0 . Using the fact that ht ≤ α∧1 T
2b2 L̄ + 8L̄αV2 and unfolding the
recursion in (59) gives
 T
2 L̄
δT ≤ δ1 1 − + .
T0 + 1 αT

43
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

1
From the elementary inequality (1 − ρ)T ≤ ρT , which is valid for all ρ, T > 0, we obtain with
ρ = T02+1 that

T0 + 1 L̄ T0 L̄ L̄d
δT ≤ δ1 + ≤ δ1 + ≤ A1 (δ1 + A2 ) ,
2T αT T αT αT
8L̄V1
where A1 = 288κ, A2 = 1/A1 d, and the last inequality follows from the facts that T0 ≤ α and
V1 ≤ 36dκ, cf. Table 1.

D.5 Smoothness and α-strong convexity: unconstrained minimization


We will use the following basic lemma.

Lemma 33 Consider the iterative algorithm defined in (4). Let f be α-strongly convex on Rd , let
Assumption D be satisfied. Let the minimizer x? of f on Θ be such that ∇f (x? ) = 0. Then we have

rt − rt+1 α η  (bLhβ−1 )2 ηt
t
E[f (xt ) − f ? ] ≤ − L̄2 V1 + t
V2 L̄2 h2t + V3 σ 2 h−2

− rt + t , (60)
2ηt 4 2 α 2

where rt = E[kxt − x? k2 ].

Proof Recall the notation Et [·] = E[· | xt ]. For any x ∈ Θ, by the definition of projection,
2
kxt+1 − xk2 = ProjΘ xt − ηt g t − x ≤ kxt − ηt g t − xk2 .

(61)

Expanding the squares and rearranging the above inequality, we deduce that (61) is equivalent to

kxt − xk2 − kxt+1 − xk2 ηt


hg t , xt − xi ≤ + kg t k2 . (62)
2ηt 2
On the other hand, since f is a α-strongly convex function on Θ, we have
α
f (xt ) − f (x) ≤ h∇f (xt ), xt − xi − kxt − xk2 . (63)
2
Combining (62) with (63) and introducing the notation at = kxt − x? k2 we deduce that
1 ηt α
Et [f (xt )−f (x? )] ≤ kEt [g t ]−∇f (xt )k kxt −x? k + Et [at −at+1 ] + Et kg t k2 − Et [at ]
2ηt 2 2
(64)
1
≤ bLhβ−1
t kxt − x? k + Et [at − at+1 ]
2ηt
ηt   α
+ V1 E[k∇f (xt )k2 ] + V2 L̄2 h2t + V3 σ 2 h−2
t − Et [at ] . (65)
2 2

44
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

Using the elementary inequality 2ab ≤ a2 + b2 we have

(bLhβ−1 )2 α
bLhβ−1
t kxt − x? k ≤ t
+ kxt − x? k2 . (66)
α 4
Substituting (66) in (65), setting rt = E[at ], using the fact that

k∇f (xt )k2 ≤ L̄2 kxt − x? k2 = L̄2 a2t

and taking the total expectation from both sides of the resulting inequality yields the lemma.

α α ηt 2 α
Proof of Theorem 18 By definition, ηt ≤ 4L̄2 V1
, so that 4 − 2 L̄ V1 ≥ 8 and (60) implies that

rt − rt+1 α (bLhβ−1 )2 ηt
E[f (xt ) − f ? ] ≤ − rt + t
+ (V2 L̄2 h2t + V3 σ 2 h−2t ). (67)
2ηt 8 α 2
j 2 k 
Set T0 := max 32L̄α2V1 − 1 , 0 . This is the value of t, where ηt switches its regime. We analyze
the recursion (67) separately for the cases T > T0 and T ≤ T0 . We will use the fact that, by the
convexity of f and Jensen’s inequality,
T
2 X
f (x̂T ) − f ? ≤ t(f (xt ) − f ? ). (68)
T (T + 1)
t=1

First case: T > T0 . In this case, we decompose the sum in (68) into the sum over t ∈ [T0 + 1, T ]
and the sum over t ∈ [1, T0 ].
We first evaluate the sum Tt=T0 +1 tE[f (xt )−f ? ]. For any t ∈ [T0 +1, T ], we have ηt = α(t+1)
8
P
 2 1
V3 2β
and ht = 4σ b2 L2 t
. Using in (67) these values of ηt and ht , multiplying by t, and summing both
sides of the resulting inequality from T0 + 1 to T we deduce that
T T T
X α X A4 2 β−1 X 1
tE[f (xt ) − f ? ] ≤ t ((rt − rt+1 ) (t + 1) − 2rt ) + (bL) β (V3 σ 2 ) β tβ
16 α
t=T0 +1 t=T0 +1 t=T0 +1
| {z } | {z }
=:I =:II
1 T
V3 σ 2 β X − β1

A5
+ L̄2 V2 t ,
α b2 L2
t=T0 +1
| {z }
=:III

3β−2 2β+2
where A4 = 2 β , A5 = 2 β and we defined the terms I, II, and III that will be evaluated
separately.

45
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

α
It is not hard to check that I ≤ 16 T T0 rT0 +1 since the summation in term I is telescoping. Next,
we have
T
A4 2 β−1 X 1 A6 2 β−1 1
+1
II ≤ (bL) β (V3 σ 2 ) β tβ ≤ (bL) β (V3 σ 2 ) β T β .
α α
t=1

Finally,
 β1 X
T  β1
V3 σ 2 V3 σ 2
 
A5 2 − β1 A7 2 1− β1
III ≤ L̄ V2 t ≤ L̄ V2 T ,
α b2 L2 α b2 L2
t=1

β+1 β−1
where A6 = β A4 and A7 = β A5 . Combining these bounds on I, II, and III we obtain

T
X α
tE[f (xt ) − f ? ] ≤ T T0 rT0 +1 (69)
16
t=T0 +1
 β1  T β1 +1
V3 σ 2

2 β−1
− β2

2 2
+ A6 (bL) (V3 σ ) β β + A7 V2 L̄ T .
b2 L2 α

If T0 = 0 then combining (68) and (69) proves the theorem. If T0 ≥ 1 we need additionally to
control the value rT0 +1 on the right hand side of (69). It follows from (67) that, for 1 ≤ t ≤ T0 ,

(bLhβ−1 )2
t
+ ηt2 V2 L̄2 h2t + V3 σ 2 h−2

rt+1 ≤ rt + 2ηt t .
α
α 8
Moreover, for 1 ≤ t ≤ T0 we have ηt = 4L̄2 V1
and ηt ≤ α(T0 +1) . Therefore, unfolding the above
recursion we get

T0  
X 16  β−1
2 64 2 2 2 −2

rT0 +1 ≤ r1 + bLht + 2 2 V2 L̄ ht + V3 σ ht .
α2 T0 α T 0
t=1

  1
4σ 2 V3 2β
For 1 ≤ t ≤ T0 we have ht = b2 L2 T
, which yields

 β1 ! 1
V3 σ 2

2
2
β−1
2 − β2 Tβ
rT0 +1 ≤ r1 + 16 A4 (bL) (V3 σ ) β β + A5 V2 L̄ T ,
b2 L2 α2 T0

so that
 β1 ! 1
+1
2L̄2 T V1 V3 σ 2

α 2
2
β−1
2 − β2 Tβ
T T0 rT0 +1 ≤ r1 + A4 (bL) (V3 σ ) β β + A5 V2 L̄ T . (70)
16 α b2 L2 α

46
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

It follows from (69) and (70) that


T
(
X
? 2L̄2 T V1 2 β−1
tE[f (xt ) − f ] ≤ r1 + (A4 + A6 )(bL) β (V3 σ 2 ) β
α
t=T0 +1
 β1 ) 1
(71)
+1
σ2

V3 − β2 T β
+ (A5 + A7 )V2 L̄2 T ,
b2 L2 α

We now evaluate the sum Tt=1 tE[f (xt ) − f ? ]. Recall that for t ∈ [1, T0 ] the parameters ht , ηt take
P 0
 2 1
V3 2β
constant values: ht = b4σ
2 L2 T and ηt = 4L̄α2 V . Omitting in (67) the term −αrt /8, summing
1
8
the resulting recursion from 1 to T0 and using the inequality ηt ≤ α(T0 +1) we obtain

T0
X T0
X
?
tE[f (xt ) − f ] ≤ T E[f (xt ) − f ? ] (72)
t=1 t=1
T0 
T r1 X (bLhβ−1 )2 ηt 
≤ +T t
+ (V2 L̄2 h2t + V3 σ 2 h−2
t )
2η1 α 2
t=1
T r1 (bLhβ−1 )2 2T
≤ + T2 1
+ (V2 L̄2 h21 + V3 σ 2 h−2
1 )
2η1 α α
2L̄2 T V1 21  β1 +1
2 V3 σ − β2 T
 2 β−1 
β
2 β
= r1 + A4 (bL) (V3 σ )
β + A5 V2 L̄ T .
α b2 L2 α
Summing up (71) and (72) and using (68) we obtain the bound of the theorem:
 β1 !
− β−1
L̄2 V1 V3 σ 2

?
2
2
β−1
2 − β2 T β
E[f (x̂T ) − f ] ≤ A1 r1 + A2 (bL) (V3 σ )
β β + A3 V2 L̄ T ,
αT b2 L2 α

where A1 = 8, A2 = 4A4 + 2A6 , and A3 = 4A5 + 2A7 .


Second case: T ≤ T0 . In this scenario, the summation Tt=1 tE[f (xt ) − f ? ] is treated as in (72),
P
with the only difference that T0 is replaced by T . As a result, we obtain the same bound as in (72).

D.6 Smoothness and α-strong convexity: constrained minimization


Proof of Theorem 20 Since supx∈Θ k∇f (x)k ≤ G we get from (65) that, for any t = 1, . . . , T ,
 2
rt − rt+1 α bLhβ−1
t ηt
E[f (xt ) − f ? ] ≤ V1 G2 + V2 L̄2 h2t + V3 σ 2 h−2

− rt + + t . (73)
2ηt 4 α 2

47
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

Multiplying both sides of (73) by t, summing up from t = 1 to T and using the fact that
T  
X t(rt − rt+1 ) α 4
− trt ≤0 if ηt =
2ηt 4 α(t + 1)
t=1

we find that
T T  
1X 2t
t(bLhβ−1
X
2 −2
tE[f (xt ) − f ? ] ≤ 2 2 2 2

t ) + V1 G + V2 L̄ ht + V3 σ ht .
α t+1
t=1 t=1
  1
σ 2 V3 2β
Since ht = b2 L2 t
we obtain

T − β1  β1 !
2V1 G2 T V3 σ 2 V3 σ 2
 
X A3 − β2 1+ β1
tE[f (xt ) − f ? ] ≤ + V3 σ 2 + V2 L̄2 T T ,
α α b2 L2 b2 L2
t=1
(74)
2
where A3 = 2. To complete the proof, we multiply both sides of (74) by T (T +1) and use (68).

Appendix E. Proof of the lower bounds


Set for brevity At = {(z i , yi )ti=1 , (z 0i , yi0 )ti=1 } for t ≥ 1. Without loss of generality, we assume
that z 1 and z 01 are fixed and we prove the result for any fixed (z 1 , z 01 ). For any f : Rd → R and
any sequential strategy in the class ΠT , such that z t = Φt (At−1 , τ t ) with yt = f (z t ) + ξt and
z 0t = Φ0t (At−1 , τ t ) for t ≥ 2, with yt0 = f (z 0t ) + ξt0 for t = 1, . . . , T , we will denote by Pf the
joint distribution of (AT , (τ i )Ti=2 ).
We start with the following lemma that will be used in the proof of Theorem 22.
Lemma 34 Let Assumption E be satisfied. Then, for any functions f, f 0 : Rd → R such that
kf − f 0 k∞ = maxu∈Rd |f (u) − f 0 (u)| ≤ v0 it holds that
 T
1 2 I0 0 2
H (Pf , Pf 0 ) ≤ 1 − 1 − f −f ∞ .
2 2
Proof Since, for each t ≥ 2,the noise (ξt , ξt0 ) is independent of (At−1 , (τ i )ti=1 ) and τ t is indepen-
dent of At−1 we have
T
Y     
dPf = dF1 (y1 − f (z 1 ), y10 − f (z 01 )) dFt 0 0
yt − f Φt (At−1 , τ t ) , yt − f Φt (At−1 , τ t ) dPt (τ t )
t=2

where Pt is the probability measurecorresponding to the distribution of τ t .  Set for brevity


dFf,1 , dF1 y1 − f (z 1 ), y10 − f (z 01 ) and dFf,t , dFt yt − f (z t ), yt0 − f (z 0t ) dPt (τ t ), t ≥ 2.

48
G RADIENT- FREE OPTIMIZATION OF HIGHLY SMOOTH FUNCTIONS

With this notation, we have dPf = Tt=1 dFf,t . Using the definition of Hellinger distance we
Q
obtain
T Z T !
H 2 dFf,t , dFf 0 ,t
Z
1 2 p Y p p Y
1− H (Pf , Pf 0 ) = dPf dPf 0 = dFf,t dFf 0 ,t = 1− .
2 2
t=1 t=1

Finally, invoking Assumption E, we get


T !  !T
Y H 2 dFf,t , dFf 0 ,t H 2 dFf,t , dFf 0 ,t
1− ≥ min 1 −
2 1≤t≤T 2
t=1
!T
I0 kf − f 0 k2∞
≥ 1− ,
2

which implies the lemma.

Proof of Theorem 22 The proof follows the general lines given in Akhavan et al. (2020), so that
we omit some details that can be found in that paper. We first assume that α ≥ T −1/2+1/β .
Let η0 : R → R be an infinitely many times differentiable function such that

= 1
 if |x| ≤ 1/4,
η0 (x) = ∈ (0, 1) if 1/4 < |x| < 1,

=0 if |x| ≥ 1.

Rx  d
Set η(x) = −∞ η0 (τ )dτ . Let Ω = − 1, 1 be the set of binary sequences of length d. Consider
the finite set of functions fω : Rd → R, ω = (ω1 , . . . , ωd ) ∈ Ω, defined as follows:
d
X
2
fω (u) = α(1 + δ) kuk /2 + ωi rhβ η(ui h−1 ), u = (u1 , . . . , ud ),
i=1
1
− 1 
where ωi ∈ {−1, 1}, h = min (α2 /d) 2(β−1) , T 2β and r > 0, δ > 0 are fixed numbers that will
be chosen small enough.
It is shown in Akhavan et al. (2020) that if α ≥ T −1/2+1/β then fω ∈ Fα,β
0 for r > 0 and δ > 0
small enough, and the minimizers of functions fω belong to Θ and are of the form
x∗ω = (x? (ω1 ), . . . , x? (ωd )) ,
where x? (ωi ) = −ωi α−1 (1 + δ)−1 rhβ−1 .
For any fixed ω ∈ Ω, we denote by Pω,T the probability measure corresponding to the joint
distribution of (AT , (τ i )Ti=2 ) where yt = fω (z t ) + ξt and yt0 = fω (z 0t ) + ξt0 with (ξt , ξt0 )’s satisfying
Assumption E, and (z t , z 0t )’s chosen by a sequential strategy in ΠT . Consider the statistic
ω̂ ∈ arg min kx̃T − x∗ω k .
ω∈Ω

49
A KHAVAN , C HZHEN , P ONTIL , T SYBAKOV

Classical triangle inequality based arguments yield

max Eω,T kx̃T − x∗ω k2 ≥ α−2 r2 h2β−2 inf max Eω,T ρ(ω̂, ω) .
   
ω∈Ω ω̂ ω∈Ω

Note that for all ω, ω 0 ∈ Ω such that ρ(ω, ω 0 ) = 1 we have

max |fω (u) − fω0 (u)| ≤ 2rhβ η(1) ≤ 2rT −1/2 η(1).
u∈Rd

−1/2
Thus, choosing r small enough to satisfy 2rη(1) < min(v0 , I0 ) we ensure 2rT −1/2 η(1) ≤ v0
0
to apply Lemma 34 and deduce for the considered ω, ω ∈ Ω that
 T 
H 2 (Pω,T , Pω0 ,T ) ≤ 2 1 − 1 − (2T )−1 ≤ 1,

where we have used the fact that 1 − x ≥ 4−x for 0 < x ≤ 1/2. Applying (Tsybakov, 2009,
Theorem 2.12) we deduce that

inf max Eω,T [ρ(ω̂, ω)] ≥ 0.3 d.


ω̂ ω∈Ω

Therefore, we have proved that if α ≥ T −1/2+1/β there exist r > 0 and δ > 0 such that
 d − β−1 
x∗ω k2 −2 2 2β−2 2
 
max Eω,T kx̃T − ≥ 0.3 dα r h = 0.3 r min 1, 2 T β . (75)
ω∈Ω α
−1/2+1/β . In particular, if α = α := T −1/2+1/β the bound (75) is of
 for α 1≥ T
This implies (26) 0

the order min 1, dT β . Then for 0 < α < α0 we also have the bound of this order since the
classes Fα,β are nested: Fα0 ,β ⊂ Fα,β . This completes the proof of (26).
− β+2
We now prove (25). From (75) and α-strong convexity of f we get that, for α ≥ T 2β ,
 d
− β−1

max Eω,T f (x̃T ) − f (x∗ω ) ≥ 0.15 r2 min α, T β .
 
(76)
ω∈Ω α
− β+2
This implies (25) in the zone α ≥ T 2β = α0 since for such α we have
 d d d − β−1 
− β−1 − β+2
 
min α, T β = min max(α, T 2β ), √ , T β .
α T α
− β−1  − β+2 √ 
On the other hand, min α0 , αd0 T β = min T 2β , d/ T . The same lower bound holds for
0 < α < α0 by the nestedness argument that we used to prove (26) in the zone 0 < α < α0 . Thus,
(25) follows.

50

You might also like