0% found this document useful (0 votes)
8 views39 pages

Optimal Stochastic Non-Smooth Non-Convex Optimization Through

This paper presents new algorithms for optimizing non-smooth, non-convex stochastic objectives, improving the complexity of finding a (δ, ǫ)-stationary point from O(ǫ−4 δ−1) to O(ǫ−3 δ−1), which is shown to be optimal. The authors introduce a novel online-to-non-convex conversion technique that connects non-convex stochastic optimization with online learning, leading to faster optimization algorithms. Additionally, the paper provides optimal results for smooth and second-order smooth objectives, demonstrating the applicability of their methods across various optimization settings.

Uploaded by

Bakht Zaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views39 pages

Optimal Stochastic Non-Smooth Non-Convex Optimization Through

This paper presents new algorithms for optimizing non-smooth, non-convex stochastic objectives, improving the complexity of finding a (δ, ǫ)-stationary point from O(ǫ−4 δ−1) to O(ǫ−3 δ−1), which is shown to be optimal. The authors introduce a novel online-to-non-convex conversion technique that connects non-convex stochastic optimization with online learning, leading to faster optimization algorithms. Additionally, the paper provides optimal results for smooth and second-order smooth objectives, demonstrating the applicability of their methods across various optimization settings.

Uploaded by

Bakht Zaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Optimal Stochastic Non-smooth Non-convex Optimization through

Online-to-Non-convex Conversion
Ashok Cutkosky Harsh Mehta Francesco Orabona
Boston University Google Research Boston University
arXiv:2302.03775v2 [cs.LG] 11 Feb 2023

Boston, MA Mountain View, CA Boston, MA


[email protected] [email protected] [email protected]

Abstract
We present new algorithms for optimizing non-smooth, non-convex stochastic objectives
based on a novel analysis technique. This improves the current best-known complexity for
finding a (δ, ǫ)-stationary point from O(ǫ−4 δ −1 ) stochastic gradient queries to O(ǫ−3 δ −1 ), which
we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex
optimization to online learning, after which our results follow from standard regret bounds in
online learning. For deterministic and second-order smooth objectives, applying more advanced
optimistic online learning techniques enables a new complexity of O(ǫ−1.5 δ −0.5 ). Our techniques
also recover all optimal or best-known results for finding ǫ stationary points of smooth or second-
order smooth objectives in both stochastic and deterministic settings.

1 Introduction
Algorithms for non-convex optimization are one of the most important tools in modern machine
learning, as training neural networks requires optimizing a non-convex objective. Given the abun-
dance of data in many domains, the time to train a neural network is the current bottleneck to
having bigger and more powerful machine learning models. Motivated by this need, the past
few years have seen an explosion of research focused on understanding non-convex optimiza-
tion (Ghadimi & Lan, 2013; Carmon et al., 2017; Arjevani et al., 2019; 2020; Carmon et al., 2019;
Fang et al., 2018). Despite significant progress, key issues remain unaddressed.
In this paper, we work to minimize a potentially non-convex objective F : Rd → R which we
only accesss in some stochastic or “noisy” manner. As motivation, consider F (x) , Ez [f (x, z)],
where x can represent the model weights, z a minibatch of i.i.d. examples, and f the loss of a
model with parameters x on the minibatch z. In keeping with standard empirical practice, we will
restrict ourselves to first order algorithms (gradient-based optimization).
The vast majority of prior analyses of non-convex optimization algorithms impose various
smoothness conditions on the objective (Ghadimi & Lan, 2013; Carmon et al., 2017; Allen-Zhu,
2018; Tripuraneni et al., 2018; Fang et al., 2018; Zhou et al., 2018; Fang et al., 2019; Cutkosky & Orabona,
2019; Li & Orabona, 2019; Cutkosky & Mehta, 2020; Zhang et al., 2020a; Karimireddy et al., 2020;
Levy et al., 2021; Faw et al., 2022; Liu et al., 2022). One motivation for smoothness assumptions
is that they allow for a convenient surrogate for global minimization: rather than finding a global
minimum of a neural network’s loss surface (which may be intractable), we can hope to find an
ǫ-stationary point, i.e., a point x such that k∇F (x)k ≤ ǫ. By now, the fundamental limits on first

1
order smooth non-convex optimization are well understood: Stochastic Gradient Descent (SGD)
will find an ǫ-stationary point in O(ǫ−4 ) iterations, which is the optimal rate (Arjevani et al., 2019).
Moreover, if F happens to be second-order smooth, SGD requires only O(ǫ−3.5 ) iterations, which
is also optimal (Fang et al., 2019; Arjevani et al., 2020). These optimality results motivate the
popularity of SGD and its variants in practice (Kingma & Ba, 2014; Loshchilov & Hutter, 2016;
2018; Goyal et al., 2017; You et al., 2019).
Unfortunately, many standard neural network architectures are non-smooth (e.g., architectures
incorporating ReLUs or max-pools cannot be smooth). As a result, these analyses can only pro-
vide intuition about what might occur when an algorithm is deployed in practice: the theorems
themselves do not apply (see also the discussion in, e.g., Li et al. (2021)). Despite the obvious
need for non-smooth analyses, recent results suggest that even approaching a neighborhood of a
stationary point may be impossible for non-smooth objectives (Kornowski & Shamir, 2022b). Nev-
ertheless, optimization clearly is possible in practice, which suggests that we may need to rethink
our assumptions and goals in order to understand non-smooth optimization.
Fortunately, Zhang et al. (2020b) recently considered an alternative definition of stationarity
that is tractable even for non-smooth objectives and which has attracted much interest (Davis et al.,
2021; Tian et al., 2022; Kornowski & Shamir, 2022a; Tian & So, 2022; Jordan et al., 2022). Roughly
speaking, instead of k∇F (x)k ≤ ǫ, we ask that there is a random variable y supported in a ball
of radius δ about x such that k E[∇F (y)]k ≤ ǫ. We call such an x an (δ, ǫ)-stationary point, so
that the previous definition (k∇F (x)k ≤ ǫ) is a (0, ǫ)-stationary point. The current best-known
complexity for identifying an (δ, ǫ) stationary point is O(ǫ−4 δ−1 ) stochastic gradient evaluations.
In this paper, we significantly improve this result: we can identify an (δ, ǫ)-stationary point
with only O(ǫ−3 δ−1 ) stochastic gradient evaluations. Moreover, we also show that this rate is
optimal. Our primary technique is a novel online-to-non-convex conversion: a connection between
non-convex stochastic optimization and online learning, which is a classical field of learning theory
that already has a deep literature (Cesa-Bianchi & Lugosi, 2006; Hazan, 2019; Orabona, 2019). In
particular, we show that an online learning algorithm that satisfies a shifting regret can be used to
decide the update step, when fed with linear losses constructed using the stochastic gradients of
the function F . By establishing this connection, we hope to open new avenues for algorithm design
in non-convex optimization and also motivate new research directions in online learning.
In sum, we make the following contributions:

• A reduction from non-convex non-smooth stochastic optimization to online learning: better


online learning algorithms result in faster non-convex optimization. Applying this reduction
to standard online learning algorithms allows us to identify an (δ, ǫ) stationary point in
O(ǫ−3 δ−1 ) stochastic gradient evaluations. The previous best-known rate in this setting
was O(ǫ−4 δ−1 ).

• We show that the O(ǫ−3 δ−1 ) rate is optimal for all δ, ǫ such that ǫ ≤ O(δ).

Additionally, we prove important corollaries for smooth F :

• The O(ǫ−3 δ−1 ) complexity implies the optimal O(ǫ−4 ) and O(ǫ−3.5 ) respective complexities
for finding (0, ǫ)-stationary points of smooth or second-order smooth objectives.

• For deterministic and second-order smooth objectives, we obtain a rate of O(ǫ−3/2 δ−1/2 ),
which implies the best-known O(ǫ−7/4 ) complexity for finding (0, ǫ)-stationary points.

2
2 Definitions and Setup
Here, we formally introduce our setting and notation. We are interested in optimizing real-valued
functions F : H → R where H is a real Hilbert space (e.g., usually H = Rd ). We assume F ⋆ ,
inf x F (x) > −∞. We assume that F is differentiable, but we do not assume that F is smooth. All
norms k · k are the Hilbert space norm (i.e., the 2-norm) unless otherwise specified. As mentioned
in the introduction, the motivating example to keep in mind in our development is the case F (x) =
Ez [f (x, z)].
Our algorithms access information about F through a stochastic gradient oracle Grad : H ×
Z → R. Given a point x in H, the oracle will sample an i.i.d. random variable z ∈ Z and return
Grad(x, z) ∈ H such that E[Grad(x, z)] = ∇F (x) and Var(Grad(x, z)) ≤ σ 2 .
In the following, we only consider functions satisfying the following mild regularity condition.

Definition 1. We define a differentiable function F : H → R to be well-behaved if for all


x, y ∈ H, it holds that
Z 1
F (y) − F (x) = h∇F (x + t(y − x), z), y − xi dt .
0

Using this assumption, our results can be applied to improve the past results on non-smooth
stochastic optimization. In fact, Proposition 2 (proof in Appendix A) below shows that for the
wide class of functions that are locally Lipschitz, applying an arbitrarily small perturbation to the
function is sufficient to ensure both differentiability and well-behavedness. It works via standard
perturbation arguments similar to those used previously by Davis et al. (2021) (see also Bertsekas
(1973); Duchi et al. (2012); Flaxman et al. (2005) for similar techniques in the convex setting).

Proposition 2. Let F : Rd → R be locally Lipschitz with stochastic oracle Grad such that
Ez [Grad(x, z)] = ∇F (x) whenever F is differentiable. We have two cases:
• If F is differentiable everywhere, then F is well-behaved.

• If F is not differentiable everywhere, let p > 0 be an arbitrary number and let u be a random
vector in Rd uniformly distributed on the unit ball. Define F̂ (x) , Eu [F (x + pu)]. Then,
\
F̂ is differentiable and well-behaved, and the oracle Grad(x, (z, u)) = Grad(x + pu, z) is a
stochastic gradient oracle for F̂ . Moreover, F is differentiable at x + pu with probability 1
and if F is G-Lipschitz, then |F̂ (x) − F (x)| ≤ pG for all x.

Remark 3. We explicitly note that our results cover the case in which F is directionally differen-
tiable and we have access to a stochastic directional gradient oracle, as considered by Zhang et al.
(2020b). This is an less standard oracle Grad(x, v, z) that outputs g such that E[hg, vi] is the
directional derivative of F in the direction v. This setting is subtly different (although a directional
derivative oracle is a gradient oracle at all points for which F is continuously differentiable). In
order to keep technical complications to a minimum, in the main text we consider the more familiar
stochastic gradient oracle discussed above. In Appendix H, we show that our results and techniques
also apply using directional gradient oracles with only superficial modification.

3
2.1 (δ, ǫ)-Stationary Points
Now, let us define our notion of (δ, ǫ)-stationary point. This definition is essentially the same
as used in Zhang et al. (2020b); Davis et al. (2021); Tian et al. (2022). It is in fact mildly more
stringent since we require an “unbiasedness” condition in order to make eventual connections to
second-order smooth objectives easier.

Definition 4. A point x is an (δ, ǫ)-stationary point of an almost-everywhere differentiable function


F if there is a finite subset S of the ball of radius δ centered at x such that for y selected uniformly
at random from S, E[y] = x and k E[∇F (z)]k ≤ ǫ.

As a counterpart to this definition, we also define:

Definition 5. Given a point x, a number δ > 0 and a almost-everywhere differentiable function


F , define
1 X
k∇F (x)kδ , infP ∇F (y) .
1
S⊂B(x,δ), |S| y∈S y=x
|S|
y∈S

Let’s also state an immediate corollary of Proposition 2 that converts a guarantee on a random-
ized smoothed function to one on the original function.

Corollary 6. Let F : Rd → R be G-Lipschitz. For ǫ > 0, let p = Gǫ and let u be a random vector
in Rd uniformly distributed on the unit ball. Define F̂ (x) , Eu [F (x + pu)]. If a point x satisfies
k∇F̂ (x)kδ ≤ ǫ, then k∇F (x)kδ ≤ 2ǫ.

Our ultimate goal is to use N stochastic gradient evaluations of F to identify a point x with
as small a value of E[k∇F (x)kδ ] as possible. Due to Proposition 2 and Corollary 6, our results
will immediately extend from differentiable F to those F that are locally Lipschitz and for which
Grad(x, z) returns a unbiased estimate of ∇F (x) whenever F is differentiable at x. We therefore
focus our subsequent development on the case that F is differentiable and we have access to a
stochastic gradient oracle.

2.2 Online Learning


Here, we very briefly introduce the setting of online linear learning with shifting competitors,
that will be the core of our online-to-non-convex conversion. We refer the interested reader to
Cesa-Bianchi & Lugosi (2006); Hazan (2019); Orabona (2019) for a comprehensive introduction to
online learning. In the online learning setting, the learning process goes on in rounds. In each
round the algorithm outputs a point ∆t in a feasible set V , and then receives a linear loss function
ℓt (·) = hgn , ·i and it pays ℓt (∆t ). The goal of the algorithm is to minimize the static regret over T
rounds, defined as the difference between its cumulative loss and the one of an arbitrary comparison
vector u ∈ V :
X T
RT (u) , hgt , ∆t − ui .
t=1

With no stochastic assumption,√it is possible to design online algorithms that guarantee that the
regret is upper bounded by O( T ). We can also define a more challenging objective that is the

4
one of minimizing the K-shifting regret, that is the regret with respect to an arbitrary sequence of
K vectors u1 , . . . , uK ∈ V that changes every T iterations:
K
X kT
X
RT (u1 , . . . , uK ) , hgn , ∆n − uk i . (1)
k=1 n=(k−1)T +1

It should be intuitive
√ that resetting the online algorithm every T iterations can achieve a shifting
regret of O(K T ).

2.3 Related Work


In addition to the papers discussed in the introduction, here we discuss further related work.
In this paper we build on top of the definition of (δ, ǫ)-stationary points proposed by Zhang et al.
(2020b). There, they prove a complexity rate of O(ǫ−4 δ−1 ) for stochastic Lipschitz functions, which
we improve to O(ǫ−3 δ−1 ) and prove the optimality of this result.
The idea to reduce machine learning to online learning was pioneered by Cesa-Bianchi et al.
(2004) with the online-to-batch conversion. There are also few papers that explore the possibility
to transform non-convex problems into online learning ones. Ghai et al. (2022) provides some
conditions under which online gradient descent on non-convex losses is equivalent to a convex
online mirror descent. Hazan et al. (2017) defines a notion of regret which can be used to find
approximate stationary points of smooth objectives. Zhuang et al. (2019) transform the problem of
tuning of learning rates in stochastic non-convex optimization into an online learning problem. Our
proposed approach differs from all the ones above in applying to non-smooth objectives. Moreover,
we employ online learning algorithms with shifting regret Herbster & Warmuth (1998) to generate
the updates (i.e. the differences between successive iterates), rather than the iterates themselves.

3 Online-to-non-convex Conversion
In this section, we explain the online-to-non-convex conversion. The core idea transforms the
minimization of a non-convex and non-smooth function onto the problem of minimizing the shifting
regret over linear losses. In particular, consider an optimization algorithm that updates a previous
iterate xn−1 by moving in a direction ∆n : xn = xn−1 +∆n . For example, SGD sets ∆n = −ηgn−1 =
−η · Grad(xn−1 , zn−1 ) for a learning rate η. Instead, we let an online learning algorithm A decide
the update direction ∆n , using linear losses ℓt (x) = hgn , xi.
The motivation behind essentially all first order algorithms is that F (xn−1 + ∆n ) − F (xn−1 ) ≈
hgn , ∆n i. This suggests that ∆n should be chosen to minimize the inner product hgn , ∆n i. How-
ever, we are faced with two difficulties. The first difficulty is the the approximation error in the
first-order expansion. The second is the fact that ∆n needs to be chosen before gn is revealed, so
that ∆n needs in some sense to “predict the future”. Typical analysis of algorithms such as SGD
use the remainder form of Taylor’s theorem to address both difficulties simultaneously for smooth
objectives, but in our non-smooth case this is not a valid approach. Instead, we tackle these dif-
ficulties independently. We overcome the first difficulty using the same randomized scaling trick
employed by Zhang et al. (2020b): define gn to be a gradient evaluated not at xn−1 or xn−1 + ∆n ,
but at a random point along the line segment connecting the two. Then for a well-behaved function
we will have F (xn−1 + ∆n ) − F (xn−1 ) = E[hgn , ∆n i]. The second difficulty is where online learning

5
Algorithm 1 Online-to-Non-Convex Conversion
Input: Initial point x0 , K ∈ N, T ∈ N, online learning algorithm A, sn for all n
Set M = K · T
for n = 1 . . . M do
Get ∆n from A
Set xn = xn−1 + ∆n
Set wn = xn−1 + sn ∆n
Sample random zn
Generate gradient gn = Grad(wn , zn )
Send gn to A as gradient
end for
Set wtk = w(k−1)T +t for k = 1, . . . , K and t = 1, . . . , T
Set wk = T1 Tt=1 wtk for k = 1, . . . , K
P
Return {w1 , . . . , wK }

shines: online learning algorithms are specifically designed to predict completely arbitrary sequences
of vectors as accurately as possible.
The previous intuition is formalized in Algorithm 1 and the following result, which we will
elaborate on in Theorem 8 before yielding our main result in Corollary 9.
R1
Theorem 7. Suppose F is well-behaved. Define ∇n = 0 ∇F (xn−1 + s∆n ) ds. Then, with the
notation in Algorithm 1 and for any sequence of vectors u1 , . . . , uN , we have the equality:
M
X M
X M
X
F (xM ) = F (x0 ) + hgn , ∆n − un i + h∇n − gn , ∆n i + hgn , un i .
n=1 n=1 n=1

Moreover, if we let sn be independent random variables uniformly distributed in [0, 1], then we have
"M # "M #
X X
E[F (xM )] = F (x0 ) + E hgn , ∆n − un i + E hgn , un i .
n=1 n=1

Proof. By the well-behaveness of F , we have


Z 1
F (xn ) − F (xn−1 ) = h∇F (xn−1 + s(xn − xn−1 )), xn − xn−1 i ds
0
Z 1
= h∇F (xn−1 + s∆n ), ∆n i ds
0
= h∇n , ∆n i
= hgn , ∆n − un i + h∇n − gn , ∆n i + hgn , un i .

Now, sum over n and telescope to obtain the stated bound. R1


For the second statement, simply observe that by definition we have E[gn ] = 0 ∇F (xn−1 +
s∆n ) ds = ∇n .

6
3.1 Guarantees for Non-Smooth Non-Convex Functions
The primary value of Theorem 7 is that the term M
P
n=1 hgn , ∆n − un i is exactly the regret of an
online learning algorithm: lower regret clearly translates toPa smaller bound on F (xM ). Next, by
carefully choosing un , we will be able to relate the term M n=1 hgn , un i to the gradient averages
that appear in the definition of (δ, ǫ)-stationarity. Formalizing these ideas, we have the following:
Theorem 8. Assume F is well-behaved. With the notation in Algorithm 1, set sn toPbe a random
T
∇F (wtk )
variable sampled uniformly from [0, 1]. Set T, K ∈ N and M = KT . Define uk = −D Pt=1 T
k t=1 ∇F (wtk )k
2
for some D > 0 for k = 1, . . . , K. Finally, suppose Var(gn ) = σ . Then:
K T
" #
1 X 1X k F (x0 ) − F ⋆ E[RT (u1 , . . . , uK )] σ
E ∇F (wt ) ≤ + +√ .
K T t=1 DM DM T
k=1

Proof. In Theorem 7, set un to be equal to u1 for the first T iterations, u2 for the second T
iterations and so on. In other words, un = umod(n,T )+1 for n = 1, . . . , M . So, we have
"M #
X
1 K
E[F (xM )] = F (x0 ) + E[RT (u , . . . , u )] + E hgn , un i .
n=1
PT
∇F (wtk )
Now, since uk = −D Pt=1
T , E[gn ] = ∇F (wn ), and Var(gn ) = σ 2 , we have
k t=1 ∇F (wtk )k
M K
" # " * T +# "K T
#
X X X X X
E hgn , un i ≤ E ∇F (wtk ), uk +E D (∇F (wtk ) − gn )
n=1 k=1 t=1 k=1 t=1
K
" * T +#
X X √
≤E ∇F (wtk ), uk + DσK T
k=1 t=1
K T
" #
X 1X   √
=E − DT ∇F wtk + DσK T .
T
k=1 t=1

Putting this all together, we have


K T
" #
√ X 1X  
F ⋆ ≤ E[F (xM )] ≤ F (x0 ) + E[RT (u1 , . . . , uK )] + σDK T − DT E ∇F wtk .
T
k=1 t=1

Dividing by KDT = DM and reordering, we have the stated bound.

We now instantiate Theorem 8 with the simplest online learning algorithm: online gradient
descent (OGD) (Zinkevich, 2003). OGD takes input a radius D and a step size η and makes
the update ∆n+1 = Πk∆k≤D [∆n − ηgn ] with ∆1 = 0. The standard analysis shows that if

E[kgn k2 ] ≤ G2 for all n, then with η = GD
√ , OGD will ensure1 static regret E[RT (u)] ≤ DG T
T
for any u satisfying kuk ≤ √ D. Thus, by resetting the algorithm every T iterations, we achieve
E[RT (u1 , . . . uK )] ≤ KDG T . This powerful guarantee for all sequences is characteristic of online
learning. We are now free to optimize the remaining parameters K and D to achieve our main
result, presented in Corollary 9.
1
For completeness a proof of this statement is in Appendix B.

7
Corollary 9. Suppose we have a budget of N gradient evaluations. Under the assumptions of Theo-
rem 8, suppose in addition E[kgn k2 ] ≤ G2 and that A guarantees k∆n k ≤ D for some user-specified

D for all n and ensures the worst-case K-shifting regret bound E[RT (u1 , . . . , uK )] ≤ DGK T for
all kuk k ≤ D (e.g., as achieved by the OGD algorithm that is reset every T iterations). Let δ > 0
be an arbitrary number. Set D = δ/T , T = min(⌈( F (xGN δ
0 )−F
⋆)
2/3 ⌉, N ), and K = ⌊ N ⌋. Then, for all
2 T
k and t, kwk − wtk k ≤ δ.
Moreover, we have the inequality
K T
" # !
1 X 1X 2(F (x0 ) − F ⋆ ) 5G2/3 (F (x0 ) − F ⋆ )1/3 6G
E ∇F (wtk ) ≤ + max ,√ ,
K
k=1
T
t=1
δN (N δ)1/3 N

which implies
K
" # !
1 X 2(F (x0 ) − F ⋆ ) 5G2/3 (F (x0 ) − F ⋆ )1/3 6G
E k∇F (wk )kδ ≤ + max ,√ .
K t=1 δN (N δ)1/3 N

Before providing the proof, let us discuss the implications. Notice


h Pthat if we selectiŵ at random
1 K 1 K k
from {w , . . . , w }, then we clearly have E[k∇F (ŵ)kδ ] = E K t=1 k∇F (w )kδ . Therefore,
the Corollary asserts that for a function F with F (x0 ) − inf F (x) ≤ γ with a stochastic gradient
oracle whose second moment is bounded by G2 , a properly instantiated Algorithm 1 finds a (δ, ǫ)
stationary point in N = O(Gγǫ−3 δ−1 ) gradient evaluations. In Section 7, we will provide a lower
bound showing that this rate is optimal essentially whenever δG2 ≥ ǫγ. Together, the Corollary
and the lower bound provide a nearly complete characterization of the complexity of finding (δ, ǫ)-
stationary points in the stochastic setting.
It is also interesting to note that the bound does not appear to improve if σ → 0. Instead all σ
dependency is dominated by the dependency on G. This highlights an interesting open question:
is it in fact possible to improve in the deterministic setting? It is conceivable that the answer is
“no”, as in the non-smooth convex optimization setting it is well-known that the optimal rates for
stochastic and deterministic optimization are the same.

Remark 10. With fairly straightforward martingale concentration arguments, the above can be
2/3 ⋆ )1/3
extended to identify a (δ, O( G (F(N(xδ)0 )−F
1/3 ))-stationary point with high probability. We leave this
for a future version of this work.

It is also interesting to explicitly write the update of the overall algorithm:

xn = xn−1 + ∆n
gn = Grad(xn + (sn − 1)∆n , zn )
∆n+1 = clipD (∆n + ηgn )
D
where clipD (x) = x min( kxk , 1). In words, the update is reminiscent of SGD updates with momen-
tum and clipping. The primary different element is the fact that the stochastic gradient is taken
on a slightly perturbed xn .

8
Proof of Corollary 9. Since A guarantees k∆n k ≤ D, for all n < n′ ≤ T + n − 1, we have

kwn − wn′ k = kxn − (1 − sn )∆n − xn′ −1 + sn′ ∆n′ k


′ −1
nX
≤ ∆i + k∆n k + k∆n′ k
i=n+1

≤ D((n − 1) − (n + 1) + 1) + 2D
= D(n′ − n + 1) ≤ DT .

Therefore, we clearly have kwtk − wk k ≤ DT = δ.


Note that from the choice of K and T we have M = KT ≥ N − T ≥ N/2. So, for the second
fact, notice that Var(gn ) ≤ G2 for all n.√Thus, applying Theorem 8 in concert with the additional
assumption E[RT (u1 , . . . , uK )] ≤ DGK T , we have:
" K T
# √
1 X 1X k F (x0 ) − F ⋆ KDG T G
E ∇F (wt ) ≤ 2 +2 +√
K T t=1 DN DN T
k=1
2T (F (x0 ) − F ⋆ ) 3G
≤ +√
δN T
!
5G2/3 (F (x0 ) − F ⋆ )1/3 6G 2(F (x0 ) − F ⋆ )
≤ max ,√ + ,
(N δ)1/3 N δN

where the last inequality is due to the choice of T .


Finally, observe that kwtk − wk k ≤ δ for all t and k, and also that wk = T1 Tt=1 wtk . Therefore
P
S = {w1k , . . . , wTk } satisfies the conditions in the infimum in Definition 5 so that k∇F (wk )kδ ≤
1 PT k
T t=1 ∇F (wt ) .

4 Bounds for the L1 norm via per-coordinate updates


It is well-known that running a separate instance of an online learning on each coordinate of ∆ (e.g.,
as in AdaGrad (Duchi et al., 2010; McMahan & Streeter, 2010)) yields regret bounds with respect
to L1 norms of the linear costs. For example, we can run the online gradient descent algorithm
with a separate learning rate for each coordinate: ∆n+1,i = Π[−D∞ ,D∞ ] [∆n,i − ηi gn,i ]. The regret of
this procedure is simply the sums of the regrets of each of the individual algorithms.√ InPparticular,
if E[gn,i ] ≤ Gi , then setting ηi = G T yields the regret bound E[RT (u)] ≤ D∞ T N
2 2 D√∞
i=1 Gi for
i
any u satisfying kuk∞ ≤ D∞ . By employing such online algorithms with our online-to-non-convex
conversion, we can obtain a guarantee on the L1 norm of the gradients.

Definition 11. A point x is a (δ, ǫ)-stationary point with respect to the L1 norm of an almost-
everywhere differentiable function F if there exists a finite subset S of the L∞ ball of radius δ
centered at x such that if y is selected uniformly at random from S, E[y] = x and k E[∇F (y)]k1 ≤ ǫ.

As a counterpart to this definition, we define:

9
Definition 12. Given a point x, a number δ > 0 and an almost-everywhere differentiable function
F , define

1 X
k∇F (x)k1,δ , inf ∇F (y) .
1
S⊂B∞ (x,δ)|, |S|
P
y=x |S|
y∈S y∈S
1

We now can state a theorem similar to Corollary 9. Given that the proof is also very similar,
we defer it to Appendix G.
Theorem 13. Suppose we have a budget of N gradient evaluations. Assume F : Rd → R is well-
behaved. With the notation in Algorithm 1, set sn to be a random variable sampled uniformly from
[0, 1]. Set T, K ∈ N and M = KT . Assume that E[gn,i 2 ] ≤ G2 for i = 1, . . . , d for all n. Assume
i
that A guarantees k∆n k∞ ≤ D∞ for some user-specified D∞ √ for all
P n and ensures the standard
worst-case K-shifting regret bound E[RT (u1 , . . . , uK )] ≤ D∞ K T di=1 Gi for all kuk k∞ ≤ D∞ .
Pd
Nδ G
Let δ > 0 be an arbitrary number. Set D∞ = δ/T , T = min(⌈( F (x0 )−Fi=1 i 2/3
⋆ ) ⌉, N2 ), and K = ⌊ N
T ⌋.
k k
Then, for all k, t, kw − wt k∞ ≤ δ.
Moreover, we have the inequality
K T
" # Pd Pd !
1 X 1X F (x ) − F ⋆ 5( G )2/3 (F (x ) − F ⋆ )1/3 6 G
0 i=1 i 0
E ∇F (wtk ) ≤ + max , √i=1 i ,
K
k=1
T
t=1
δN (N δ)1/3 N
1

which implies
K
!
2(F (x0 ) − F ⋆ ) 5( di=1 Gi )2/3 (F (x0 ) − F ⋆ )1/3 6 di=1 Gi
P P
1 X
k∇F (wk )k1,δ ≤ + max , √ .
K t=1 δN (N δ)1/3 N

Pd Let’s2 compare this result with Corollary 9. For a fair comparison, we set Gi and G such that
2 . Then, we can lower bound k · k with √1 k · k . Hence, under the assumption
G
i=1 i = G δ d 1,δ

Pd Pd 1 PK G2/3 d
E[kgn k ] = i=1 E[gn,i ] ≤ i=1 Gi = G , Corollary 9 implies K t=1 k∇F (wk )k1,δ ≤ O( (N
2 2 2 2
δ)1/3
).
Now, let us see what would happen if we instead employed the above Corollary 13. First,
Pd √ qPd √
observe that i=1 Gi ≤ d 2
i=1 Gi ≤ dG. Substituting this expression into Corollary 30 now
1/3 2/3
gives an upper bound on k · k1,δ that is O( d(N δ)G1/3 ), which is better than the one we could obtain
from Corollary 9 under the same assumptions.

5 From Non-smooth to Smooth Guarantees


Let us now see what our results imply for smooth objectives. The following two propositions show
that for smooth F , a (δ, ǫ)-stationary point is automatically a (0, ǫ′ )-stationary point for some
appropriate ǫ′ . The proofs are in Appendix E.
Proposition 14. Suppose that F is H-smooth (that is, ∇F is H-Lipschitz) and x also satisfies
k∇F (x)kδ ≤ ǫ. Then, k∇F (x)k ≤ ǫ + Hδ.

Proposition 15. Suppose that F is J-second-order-smooth (that is, k∇2 F (x) − ∇2 F (y)kop ≤
Jkx − yk for all x and y). Suppose also that x satisfies k∇F (x)kδ ≤ ǫ. Then, k∇F (x)k ≤ ǫ + J2 δ2 .

10
Now, recall that Corollary 9 shows that we can find a (δ, ǫ) stationary point in O(ǫ−3 δ−1 )
iteration. Thus, Proposition 14 implies that by setting δ = ǫ/H, we can find a (0, ǫ)-stationary
point of an H-smooth objective F in O(ǫ−4 ) iterations, which matches the (optimal) guarantee of
standard SGDp (Ghadimi & Lan, 2013; Arjevani et al., 2019). Further, Proposition 15 shows that by
setting δ = ǫ/J, we can find a (0, ǫ)-stationary point of a J-second order smooth objective. This
matches the performance of more refined SGD variants and is also known to be tight (Fang et al.,
2019; Cutkosky & Mehta, 2020; Arjevani et al., 2020). In summary: the online-to-non-convex
conversion also recovers the optimal results for smooth stochastic losses.

6 Deterministic and Smooth Case


We will not consider the case of a non-stochastic oracle (that is, Grad(x, z) = ∇F (x) for all
z, x) and F is H-smooth (i.e. ∇F is H-Lipschitz). We will show that optimistic online al-
gorithms (Rakhlin & Sridharan, 2013; Hazan & Kale, 2010) achieve rates matching the optimal
deterministic results. In particular, we consider online algorithms that ensure static regret:
 v u T 
uX
RT (u) ≤ O D t kht − gt k2  , (2)
t=1

for some “hint” vectors ht . In Appendix B, we provide an explicit construction of such an algorithm
for completeness. The standard setting for the hints is ht = gt−1 . As explained in Section 2.2, to
obtain a K-shifting regret it will be enough to reset the algorithm every T iterations.

Theorem 16. Suppose we have a budget of N gradient evaluations. and that we have an online
algorithm Astatic
qPthat guarantees k∆n k ≤ D for all n and ensures the optimistic regret bound
T 2
RT (u) ≤ CD t=1 kgt − gt−1 k for some constant C, and we define g0 = 0. In Algorithm 1,
set A to be Astatic

that is reset every T rounds. Let δ > 0 be an arbitrary number. Set D = δ/T ,
(Cδ 2 HN )2/5 N
T = min(⌈ (F (x ⋆ )2/5 ⌉, 2 ), and K = ⌊ N
T ⌋. Finally, suppose that F is H-smooth and that the
0 )−F
gradient oracle is deterministic (that is, gn = ∇F (wn )). Then, for all k, t, kwk − wtk k ≤ δ.
Moreover, with G1 = k∇F (x1 )k, we have:
" K T
# √ !
2/5 (F (x ) − F ⋆ )3/5 17Cδ H
1 X 1X (CH) 0
E ∇F (wtk ) ≤ max 6 ,
K
k=1
T t=1 δ1/5 N 3/5 N 3/2
2CG1 2(F (x0 ) − F ⋆ )
+ + ,
N δN
which implies
K
" #
1 X 2CG1 2(F (x0 ) − F ⋆ )
E k∇F (wk )kδ ≤ +
K t=1 N δN
√ !
(CH)2/5 (F (x0 ) − F ⋆ )3/5 17Cδ H
+ max 6 , .
δ1/5 N 3/5 N 3/2

11
Note that the expectation here encompasses only the randomness in the choice of skt , because
the gradient oracle is assumed to be deterministic. Theorem 16 finds a (δ, ǫ) stationary point in
O(ǫ−5/3 δ−1/3 ) iteratations. Thus, by setting δ = ǫ/H, Proposition 14 shows we can find a (0, ǫ)
stationary point in O(ǫ−2 ) iterations, which matches the standard optimal rate (Carmon et al.,
2021).

Proof. The first fact (for all k, t, kwk − wtk k ≤ δ) holds for precisely the same reason that it holds
in Corollary 9. The final fact follows from the second fact again with the same reasoning, so it
remains only to show the second fact.
To show the second fact, observe that for k = 1 we have
v
u T
uX
k
RT (u ) ≤ CD t kgtk − gt−1
k k2

t=1
v
u
u T
X
≤ CD tG21 + k∇F (wtk ) − ∇F (wt−1
k )k2

t=2
v
u
u T
X
≤ CD tG21 + H 2 kwtk − wt−1
k k2

t=2
q √
≤ CD G21 + 4H 2 T D 2 ≤ CDG1 + 2CD 2 H T .

Similarly, for k > 1, we observe that:


T
X T
X
kgtk − gt−1
k
k2 ≤ k∇F (wtk ) − ∇F (wt−1
k
)k2 + k∇F (w1k ) − ∇F (wTk−1 )k2
t=1 t=2
T
X
2
≤H (kw1k − wTk−1 k2 + k
kwtk − wt−1 k2 )
t=2
2 2
≤ 4T H D

Thus, we have
v
u T
uX
k
RT (u ) ≤ CD t kgtk − gt−1
k k2

t=1

≤ CD 4H 2 T D 2

≤ 2CD 2 H T .

12
Now, applying Theorem 8 in concert with the above bounds on RT (uk ), we have
" K T
# √
1 X 1X k 2(F (x0 ) − F ⋆ ) 2CDG1 + 4CKD 2 HT
E ∇F (wt ) ≤ +
K T t=1 DN DN
k=1

2T (F (x0 ) − F ⋆ ) 2CG1 4Cδ H
= + +
δN N T 3/2
√ !
(CH)2/5 (F (x0 ) − F ⋆ )3/5 17Cδ H
≤ max 6 ,
δ1/5 N 3/5 N 3/2
2CG1 2(F (x0 ) − F ⋆ )
+ + .
N δN

6.1 Better Results with Second-Order Smoothness


When F is J-second-order smooth (i.e., ∇2 F is J-Lipschitz) we can do even better. First, observe
that by Theorem 16, if F is J-second-order-smooth, then by Proposition 15, the O(ǫ−5/3 δ−1/3 )
iteration complexity of Theoremp16 implies an O(ǫ−11/6 ) iteration complexity for finding (0, ǫ)
stationary points by setting δ = ǫ/J . This already improves upon the O(ǫ−2 ) result for smooth
losses, but we can improve still further. The key idea is to generate more informative hints ht . If
we can make ht ≈ gt , then by (2), we can achieve smaller regret and so a better guarantee.
To do so, we abandon randomization: instead of choosing sn randomly, we just set sn = 1/2.
This setting still allows F (xn ) ≈ F (xn−1 ) + hgn , ∆n i with very little error when F is second-
order-smooth. Then, via careful inspection
√ of the optimistic mirror descent update formula, we can
identify an ht with kht − gt k ≤ O(1/ N ) using O(log(N )) gradient queries. This more advanced
online learning algorithm is presented in Algorithm 2 (full analysis in Appendix C).
Overall, Algorithm 2 makes an update of a similar flavor to the “implicit” update:
h gn i
∆n = Πk∆k≤D ∆n−1 − ,
2H
gn = ∇F (xn−1 + ∆n /2) .

With this refined online algorithm, we can show the following convergence guarantee, whose
proof is in Appendix D.
1
Theorem 17. In Algorithm 1, assume that gn = ∇F (wn ), and set sn = 2 . Use Algorithm 2
2 )1/3
restarted every T rounds as A. Let δ > 0 an arbitrary number. Set T = min(⌈ (δ(F(H+Jδ)N
(x )−F ⋆ )1/3 ⌉, N/2)
0

and K = ⌊ N
p
T ⌋. In Algorithm 2, set η = 1/2H, D = δ/T , and Q = ⌈log 2 ( N G/HD)⌉. Finally,
suppose that F is J-second-order-smooth. Then, the following facts hold:

1. For all k, t, kwk − wtk k ≤ δ.

2. We have the inequality


K T
1 X 1X 4G 2(F (x0 ) − F ⋆ )
∇F (wtk ) ≤ +
K T t=1 N Nδ
k=1
(H + Jδ)1/3 (F (x0 ) − F ⋆ )2/3 δ(H + Jδ)
+3 + 10 .
δ1/3 N 2/3 N2

13
Algorithm 2 Optimistic Mirror Descent with Careful Hints
Input: Learning rate η, number Q (Q will be O(log N )), function F , horizon T , radius D
Receive initial iterate x0
Set ∆′1 = 0
for t = 1 . . . T do
Set h0t = ∇F (xt−1 )
for i = 1 . . . Q do
Set hit = ∇F xt−1 + 21 Πk∆k≤D ∆′t − ηhi−1
 
t
end for
Set ht = hQ t
Output ∆t = Πk∆k≤D [∆′t − ηht ]
Receive tth gradient gt
Set ∆′t+1 = Πk∆k≤D [∆′t − ηgt ]
end for

H 1/7 (F (x0 )−F (xN ))2/7


3. With δ = J 3/7 N 2/7
, we have
K
1 X  1/7 2/7
(x0 )−F ⋆ )4/7

k∇F (wk )k ≤ O J H (F N 4/7
.
K
t=1

Moreover, the total number of gradient queries consumed is N Q = O(N log(N ))


This result finds a (δ, ǫ) stationary point in Õ(ǫ−3/2 δ−1/2 ) iterations. Via Proposition 15, this
translates to Õ(ǫ−7/4 ) iterations for finding a (0, ǫ) stationary point, which matches the best known
rates (up to a logarithmic factor) (Carmon et al., 2017). Note that this rate may not be optimal:
the best available lower bound is Ω(ǫ−12/7 ) (Carmon et al., 2021). Intriguingly, our technique seems
rather different than previous work, which usually relies on acceleration and methods for detecting
or exploiting negative curvature (Carmon et al., 2017; Agarwal et al., 2016; Carmon et al., 2018;
Li & Lin, 2022).

7 Lower Bounds
In this section, we show that our O(ǫ−3 δ−1 ) complexity achieved in Corollary 9 is tight. We do
this by a simple extension of the lower bound for stochastic smooth non-convex optimization of
Arjevani et al. (2019). We provide an informal statement and proof-sketch below. The formal
result (Theorem 28) and proof is provided in Appendix F.

ǫγ
Theorem 18 (informal). There is a universal constant C such that for any δ, ǫ, γ and G ≥ C √ ,
δ
for any first-order algorithm A, there is a G-Lipschitz, C ∞ function F : Rd → R for some d with
F (0) − inf x F (x) ≤ γ and a stochastic first-order gradient oracle for F whose outputs g satisfy
E[kgk2 ] ≤ G2 such that such that A requires Ω(G2 γ/δǫ3 ) stochastic oracle queries to identify a
point x with E[k∇F (x)kδ ] ≤ ǫ.
Proof sketch. The construction of Arjevani et al. (2019) provides, for any H, γ, ǫ, and σ a function

F and stochastic oracle whose outputs have norm at most σ such that F is H-smooth, O( Hγ)-
Lipschitz and A requires Ω(σ 2 Hγ/ǫ4 ) oracle queries to find a point x with k∇F (x)k ≤ 2ǫ. By

14

setting H = δǫ and σ = G/ 2, this becomes an ǫγ/δ-Lipschitz function, and so is at most
p

G/ 2-Lipschitz. Thus, the second moment of the gradient oracle is at most G2 /2 + G2 /2 = G2 .
Further, the algorithm requires Ω(G2 γ/δǫ3 ) queries to find a point x with k∇F (x)k ≤ 2ǫ. Now, if
k∇F (x)kδ ≤ ǫ, then since F is H = δǫ -smooth, by Proposition 14, k∇F (x)k ≤ ǫ + δH = 2ǫ. Thus,
we see that we need Ω(G2 γ/δǫ3 ) queries to find a point with k∇F (x)kδ ≤ ǫ as desired.

8 Conclusion
We have presented a new online-to-non-convex conversion technique that applies online learning
algorithms to non-convex and non-smooth stochastic optimization. When used with online gradient
descent, this achieves the optimal ǫ−3 δ−1 complexity for finding (δ, ǫ) stationary points.
These results suggest new directions for work in online learning. Much past work is moti-
vated by the online-to-batch conversion relating static regret to convex optimization. We employ
switching regret for non-convex optimization. More refined analysis may be possible via gen-
eralizations such as strongly adaptive or dynamic regret (Daniely et al., 2015; Jun et al., 2017;
Zhang et al., 2018; Jacobsen & Cutkosky, 2022; Cutkosky, 2020; Lu et al., 2022; Luo et al., 2022;
Zhang et al., 2021; Baby & Wang, 2022; Zhang et al., 2022). Moreover, our analysis assumes per-
fect tuning of constants (e.g., D, T, K) for simplicity. In practice, we would prefer to adapt to un-
known parameters, motivating new applications and problems for adaptive online learning, which
is already an area of active current investigation (see, e.g., Orabona & Pál, 2015; Hoeven et al.,
2018; Cutkosky & Orabona, 2018; Cutkosky, 2019; Mhammedi & Koolen, 2020; Chen et al., 2021;
Sachs et al., 2022; Wang et al., 2022). It is our hope that some of this expertise can be applied in
the non-convex setting as well.
Finally, our results leave an important question unanswered: the current best-known algo-
rithm for deterministic non-smooth optimization still requires O(ǫ−3 δ−1 ) iterations to find a (δ, ǫ)-
stationary point (Zhang et al., 2020b). We achieve this same result even in the stochastic case.
Thus it is natural to wonder if one can improve in the deterministic setting. For example, is
the O(ǫ−3/2 δ−1/2 ) complexity we achieve in the smooth setting also achievable in the non-smooth
setting?

Acknowledgements
Ashok Cutkosky is supported by the National Science Foundation grant CCF-2211718 as well as a
Google gift. Francesco Orabona is supported by the National Science Foundation under the grants
no. 2022446 “Foundations of Data Science Institute” and no. 2046096 “CAREER: Parameter-free
Optimization Algorithms for Machine Learning”.

References
Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T. Finding approximate local minima
for nonconvex optimization in linear time. arXiv preprint arXiv:1611.01146, 2016.

Allen-Zhu, Z. Natasha 2: Faster non-convex optimization than SGD. In Advances in neural


information processing systems, pp. 2675–2686, 2018.

15
Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds
for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.

Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Sekhari, A., and Sridharan, K. Second-order
information in non-convex stochastic optimization: Power and limitations. In Conference on
Learning Theory, pp. 242–299, 2020.

Baby, D. and Wang, Y.-X. Optimal dynamic regret in proper online learning with strongly convex
losses and beyond. In International Conference on Artificial Intelligence and Statistics, pp.
1805–1845. PMLR, 2022.

Bertsekas, D. P. Stochastic optimization problems with nondifferentiable cost functionals. Journal


of Optimization Theory and Applications, 12(2):218–231, 1973.

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. “convex until proven guilty”: Dimension-
free acceleration of gradient descent on non-convex functions. In International Conference on
Machine Learning, pp. 654–663. PMLR, 2017.

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Accelerated methods for nonconvex opti-
mization. SIAM Journal on Optimization, 28(2):1751–1772, 2018.

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Lower bounds for finding stationary points
I. Mathematical Programming, pp. 1–50, 2019.

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Lower bounds for finding stationary points
II: first-order methods. Mathematical Programming, 185(1-2), 2021.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge University Press,
2006.

Cesa-Bianchi, N., Conconi, A., and Gentile, C. On the generalization ability of on-line learning
algorithms. Information Theory, IEEE Transactions on, 50(9):2050–2057, 2004.

Chen, L., Luo, H., and Wei, C.-Y. Impossible tuning made possible: A new expert algorithm and
its applications. In Conference on Learning Theory, pp. 1216–1259. PMLR, 2021.

Clarke, F. H. Optimization and nonsmooth analysis. SIAM, 1990.

Cutkosky, A. Combining online learning guarantees. In Proceedings of the Thirty-Second Conference


on Learning Theory, pp. 895–913, 2019.

Cutkosky, A. Parameter-free, dynamic, and strongly-adaptive online learning. In International


Conference on Machine Learning, volume 2, 2020.

Cutkosky, A. and Mehta, H. Momentum improves normalized SGD. In International Conference


on Machine Learning, 2020.

Cutkosky, A. and Orabona, F. Black-box reductions for parameter-free online learning in Banach
spaces. In Conference On Learning Theory, pp. 1493–1529, 2018.

Cutkosky, A. and Orabona, F. Momentum-based variance reduction in non-convex SGD. In


Advances in Neural Information Processing Systems, pp. 15210–15219, 2019.

16
Daniely, A., Gonen, A., and Shalev-Shwartz, S. Strongly adaptive online learning. In International
Conference on Machine Learning, pp. 1405–1411. PMLR, 2015.

Davis, D., Drusvyatskiy, D., Lee, Y. T., Padmanabhan, S., and Ye, G. A gradient sampling method
with complexity guarantees for lipschitz functions in high and low dimensions. arXiv preprint
arXiv:2112.06969, 2021.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochas-
tic optimization. In Conference on Learning Theory (COLT), pp. 257–269, 2010.

Duchi, J. C., Bartlett, P. L., and Wainwright, M. J. Randomized smoothing for stochastic opti-
mization. SIAM Journal on Optimization, 22(2):674–701, 2012.

Fang, C., Li, C. J., Lin, Z., and Zhang, T. SPIDER: Near-optimal non-convex optimization via
stochastic path-integrated differential estimator. In Advances in Neural Information Processing
Systems, pp. 689–699, 2018.

Fang, C., Lin, Z., and Zhang, T. Sharp analysis for nonconvex sgd escaping from saddle points. In
Conference on Learning Theory, pp. 1192–1234, 2019.

Faw, M., Tziotis, I., Caramanis, C., Mokhtari, A., Shakkottai, S., and Ward, R. The power
of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance. In
Conference on Learning Theory, pp. 313–355. PMLR, 2022.

Flaxman, A. D., Kalai, A. T., and McMahan, H. B. Online convex optimization in the bandit
setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM
symposium on Discrete algorithms, pp. 385–394, 2005.

Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic pro-
gramming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.

Ghai, U., Lu, Z., and Hazan, E. Non-convex online learning via algorithmic equivalence. arXiv
preprint arXiv:2205.15235, 2022.

Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia,
Y., and He, K. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint
arXiv:1706.02677, 2017.

Hazan, E. Introduction to online convex optimization. arXiv preprint arXiv:1909.05207, 2019.

Hazan, E. and Kale, S. Extracting certainty from uncertainty: Regret bounded by variation in
costs. Machine learning, 80(2-3):165–188, 2010.

Hazan, E., Singh, K., and Zhang, C. Efficient regret minimization in non-convex games. In
International Conference on Machine Learning, pp. 1433–1441. PMLR, 2017.

Herbster, M. and Warmuth, M. K. Tracking the best regressor. In Proceedings of the eleventh
annual conference on Computational learning theory, pp. 24–31, 1998.

Hoeven, D., Erven, T., and Kotlowski, W. The many faces of exponential weights in online learning.
In Conference On Learning Theory, pp. 2067–2092. PMLR, 2018.

17
Jacobsen, A. and Cutkosky, A. Parameter-free mirror descent. In Loh, P.-L. and Raginsky, M.
(eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of
Machine Learning Research, pp. 4160–4211. PMLR, 2022.

Jordan, M. I., Lin, T., and Zampetakis, M. On the complexity of deterministic nonsmooth and
nonconvex optimization. arXiv preprint arXiv:2209.12463, 2022.

Jun, K.-S., Orabona, F., Wright, S., and Willett, R. Improved strongly adaptive online learning
using coin betting. In Artificial Intelligence and Statistics, pp. 943–951. PMLR, 2017.

Karimireddy, S. P., Jaggi, M., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., and Suresh,
A. T. Mime: Mimicking centralized stochastic algorithms in federated learning. arXiv preprint
arXiv:2008.03606, 2020.

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.

Kornowski, G. and Shamir, O. On the complexity of finding small subgradients in nonsmooth


optimization. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop),
2022a.

Kornowski, G. and Shamir, O. Oracle complexity in nonsmooth nonconvex optimization. Journal


of Machine Learning Research, 23(314):1–44, 2022b.

Levy, K., Kavis, A., and Cevher, V. Storm+: Fully adaptive SGD with recursive momentum for
nonconvex optimization. Advances in Neural Information Processing Systems, 34:20571–20582,
2021.

Li, H. and Lin, Z. Restarted nonconvex accelerated gradient descent: No more polylogarithmic
factor in the o(ǫ−7/4 ) complexity. In International Conference on Machine Learning. PMLR,
2022.

Li, X. and Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In
The 22nd International Conference on Artificial Intelligence and Statistics, pp. 983–992. PMLR,
2019.

Li, X., Zhuang, Z., and Orabona, F. A second look at exponential and cosine step sizes: Simplicity,
adaptivity, and performance. In International Conference on Machine Learning, pp. 6553–6564.
PMLR, 2021.

Liu, Z., Nguyen, T. D., Nguyen, T. H., Ene, A., and Nguyen, H. L. META-STORM: Generalized
fully-adaptive variance reduced SGD for unbounded functions. arXiv preprint arXiv:2209.14853,
2022.

Loshchilov, I. and Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint
arXiv:1608.03983, 2016.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference


on Learning Representations, 2018.

18
Lu, Z., Xia, W., Arora, S., and Hazan, E. Adaptive gradient methods with local guarantees. arXiv
preprint arXiv:2203.01400, 2022.

Luo, H., Zhang, M., Zhao, P., and Zhou, Z.-H. Corralling a larger band of bandits: A case study
on switching regret for linear bandits. In Conference on Learning Theory, 2022.

McMahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In
Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pp. 244–256, 2010.

Mhammedi, Z. and Koolen, W. M. Lipschitz and comparator-norm adaptivity in online learning.


Conference on Learning Theory, pp. 2858–2887, 2020.

Orabona, F. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.

Orabona, F. and Pál, D. Scale-free algorithms for online linear optimization. In Chaudhuri, K.,
Gentile, C., and Zilles, S. (eds.), Algorithmic Learning Theory, pp. 287–301. Springer Interna-
tional Publishing, 2015.

Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In Conference on


Learning Theory (COLT), pp. 993–1019, 2013.

Sachs, S., Hadiji, H., van Erven, T., and Guzmán, C. Between stochastic and adversarial online
convex optimization: Improved regret bounds via smoothness. arXiv preprint arXiv:2202.07554,
2022.

Stein, E. M. and Shakarchi, R. Real analysis: measure theory, integration, and Hilbert spaces.
Princeton University Press, 2009.

Tian, L. and So, A. M.-C. No dimension-free deterministic algorithm computes approximate sta-
tionarities of lipschitzians. arXiv preprint arXiv:2210.06907, 2022.

Tian, L., Zhou, K., and So, A. M.-C. On the finite-time complexity and practical computation of
approximate stationarity concepts of lipschitz functions. In International Conference on Machine
Learning, pp. 21360–21379. PMLR, 2022.

Tripuraneni, N., Stern, M., Jin, C., Regier, J., and Jordan, M. I. Stochastic cubic regularization
for fast nonconvex optimization. In Advances in neural information processing systems, pp.
2899–2908, 2018.

Wang, G., Hu, Z., Muthukumar, V., and Abernethy, J. Adaptive oracle-efficient online learning.
arXiv preprint arXiv:2210.09385, 2022.

You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer,
K., and Hsieh, C.-J. Large batch optimization for deep learning: Training BERT in 76 minutes.
arXiv preprint arXiv:1904.00962, 2019.

Zhang, J., Karimireddy, S. P., Veit, A., Kim, S., Reddi, S., Kumar, S., and Sra, S. Why are adaptive
methods good for attention models? Advances in Neural Information Processing Systems, 33:
15383–15393, 2020a.

19
Zhang, J., Lin, H., Jegelka, S., Sra, S., and Jadbabaie, A. Complexity of finding stationary points
of nonconvex nonsmooth functions. In International Conference on Machine Learning, 2020b.

Zhang, L., Lu, S., and Zhou, Z.-H. Adaptive online learning in dynamic environments. In Pro-
ceedings of the 32nd International Conference on Neural Information Processing Systems, pp.
1330–1340, 2018.

Zhang, L., Wang, G., Tu, W.-W., Jiang, W., and Zhou, Z.-H. Dual adaptivity: A universal
algorithm for minimizing the adaptive regret of convex functions. Advances in Neural Information
Processing Systems, 34:24968–24980, 2021.

Zhang, Z., Cutkosky, A., and Paschalidis, I. Adversarial tracking control via strongly adaptive
online learning with memory. In International Conference on Artificial Intelligence and Statistics,
pp. 8458–8492. PMLR, 2022.

Zhou, D., Xu, P., and Gu, Q. Stochastic nested variance reduction for nonconvex optimization.
In Proceedings of the 32nd International Conference on Neural Information Processing Systems,
pp. 3925–3936, 2018.

Zhuang, Z., Cutkosky, A., and Orabona, F. Surrogate losses for online learning of stepsizes in
stochastic non-convex optimization. In International Conference on Machine Learning, pp. 7664–
7672, 2019.

Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Pro-
ceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928–936,
2003.

20
A Proof of Proposition 2
First, we state a technical lemma that will be used to prove Proposition 2.

Lemma 19. Let F : Rd → R be locally Lipschitz. Then, F is differentiable almost everywhere, is


Lipschitz on all compact sets, and for all v ∈ Rd , x 7→ h∇F (x), vi is
R integrable on all compact sets.
Finally, for any compact d
measurable set D ⊂ R , the vector w = D ∇F (x) dx is well-defined and
R
the operator ρ(v) = Dh∇F (x), vi dx is linear and equal to hw, vi.

Proof. First, observe that since F is locally Lipschitz, for every point x ∈ Rd with rational coordi-
nates there is a neighborhood Ux of x on which F is Lipschitz. Thus, by Rademacher’s theorem, F
is differentiable almost everywhere in Ux .SSince the set of points with rational coordiantes is dense
in Rd , Rd is equal to the countable union Ux . Thus, since the set of points of non-differentiability
of F in Ux is measure zero, the total set of points of non-differentability is a countable union of sets
of measure zero and so must be measure zero. Thus F is differentiable almost everywhere. This
implies that F is differentiable at x + pu with probability 1.
Next, observe that for any compact set S ⊂ Rd , for every point x ∈ S with rational coordinates,
F is Lipschitz on some neighborhood Ux containing xSwith Lipschitz constant Gx . Since S is
compact, there is a finite set x1 , . . . , xK such that S = Uxi . Therefore, F is maxi Gxi -Lipschitz
on S and so F is Lipschitz on every compact set.
Now, for almost all x, for all v we have that the limit limδ→0 F (x+δv)−Fδ
(x)
exists and is equal
to h∇F (x), vi by definition of differentiability. Further, on any compact set S we have that
|F (x+δv)−F (x)|
δ ≤ L for some L for all x ∈ S. Therefore, by the bounded convergence theorem (see
e.g., Stein & Shakarchi (2009, Theorem 1.4)), we have that h∇F (x), vi is integrable on S.
Next, we prove the linearity of the operator ρ. Observe that for any vectors v and w, and scalar
c, by linearity of integration, we have
Z Z
h∇F (x), cv + wi dx = h∇F (x), cvi + h∇F (x), wi dx
D ZD
= ch∇F (x), vi + h∇F (x), wi dx
D
Z Z
= c h∇F (x), vi dx + h∇F (x), wi dx .
D D

For the remaining statement, given that Rd is finite dimensional, there must exist w ∈ Rd such
that ρ(v) = hw, vi for all v. Further, w is uniquely determined by hw, ei i for i = 1, . . . , d where ei
indicates the ith standard basis vector. Then, since ∇F (x)i = h∇F (x), ei i is integrable on compact
sets, we have
Z Z
hw, ei i = h∇F (x), ei i dx = ∇F (x)i dx,
D D
R
which is the definition of D ∇F (x) dx when the integral is defined.

We can now prove the Proposition.

Proof of Proposition 2. Since F is locally Lipschitz, by Proposition 19, F is Lipschitz on compact


sets. Therefore, F must be Lipschitz on the line segment connecting x and y. Thus the function

21
k(t) = F (x + t(y − x)) is absolutely ′
R 1 ′ continuous
R 1 on [0, 1]. As a result, k is integrable on [0, 1] and
F (y) − F (x) = k(1) − k(0) = 0 k (t) dt = 0 h∇F (x + t(y − x)), y − xi dt by the Fundamental
Theorem of Calculus (see e.g., Stein & Shakarchi (2009, Theorem 3.11)).
Now, we tackle the case that F is not differentiable everywhere. Notice that the last statement
of the Proposition is an immediate consequence of Lipschitzness. So, we focus on showing the
remaining parts.
Now, by Lemma 19, we have that F is differentiable almost everywhere. Further, gx =
Eu [∇F (x + pu)] exists and satisfies for all v ∈ Rd :
hgx , vi = E[h∇F (x + pu), vi] .
u

Notice also that


ˆ
E [Grad(x, (z, u))] = E E[Grad(x + pu, z)] = E[∇F (x + pu)] = gx .
z,u u z u

So, it remains to show that F̂ is differentiable and ∇F̂ (x) = gx .


Now, let x be an arbitrary elements of Rd and let v1 , v2 , . . . be any sequence of vectors such
that limn→∞ vn = 0 and kvi k ≤ p for all i. Then, since the ball of radius 2p centered at x is
compact, F is L-Lipschitz inside this ball for some L. Then, we have
 
F̂ (x + vn ) − F̂ (x) − hgx , vn i F (x + vn + pu) − F (x + pu) − h∇F (x + pu), vn i
lim = lim E .
n→∞ kvn k n→∞ u kvn k
|F (x+vi +pu)−F (x+pu)|
Now, observe kvi k ≤ L. Further, for all almost all u, F is differentiable at x + pu
F (x+vn +pu)−F (x+pu)−h∇F (x+pu),vn i]
so that limn→∞ kvn k = 0 for almost all u. Thus, by the bounded
convergence theorem, we have

F̂ (x + vn ) − F̂ (x) − hgx , vn i
lim =0.
n→∞ kvn k

which shows that gx = ∇F̂ (x).


Finally, observe that since F is Lipschitz on compact sets, F̂ must be also, and so by the first
part of the proposition, F̂ is well-behaved.

B Analysis of (Optimistic) Online Gradient Descent


Optimistic Online Gradient Descent (in its simplest form) is described by Algorithm 3. Here we
collect the standard analysis of the algorithm for completeness. None of this analysis is new, and
more refined versions can be found in a variety of sources (e.g. Chen et al. (2021)).
We will analyze only a simple version of this algorithm, that is when V is an L2 ball of radius
D in some real Hilbert space (such as Rd ). Then, Algorithm 3 satisfies the following guarantee.
Proposition 20. Let V = {x : kxk ≤ D} ⊂ H for some real Hilbert space H. Then, with for all
u ∈ V , Algorithm 3 ensures
T T
X D2 X η
hgt , wt − ui ≤ + kgt − ht k2 .
2η 2
t=1 t=1

22
Algorithm 3 Optimistic Mirror Descent
Input: Regularizer function φ, domain V , time horizon T .
ŵ1 = 0
for t = 1 . . . T do
Generate “hint” ht
Set wt = argminx∈V hht , xi + 21 kx − ŵt k2
Output wt and receive loss vector gt
Set ŵt+1 = argminx∈V hgt , xi + 21 kx − ŵt k2
end for

Proof. Now, by Chen et al. (2021, Lemma 15) instantiated with the squared Euclidean distance as
Bregman divergence, we have
1 1 1 1
hgt , wt − ui ≤ hgt − ht , wt − ŵt+1 i + ku − ŵt k2 − ku − ŵt+1 k2 − kŵt+1 − wt k2 − kwt − ŵt k2
2 2 2 2
From Young inequality:

ηkgt − ht k2 kwt − ŵt+1 k2 1 1 1


≤ + + ku − ŵt k2 − ku − ŵt+1 k2 − kŵt+1 − wt k2
2 2η 2 2 2
kwt − ŵt k2


ηkgt − ht k2 1 1
≤ + ku − ŵt k2 − ku − ŵt+1 k2 .
2 2 2
Summing over t and telescoping, we have
T T
X 1 2 1 2
X ηkgt − ht k2
hgt , wt − ui ≤ ku − ŵ1 k − ku − ŵT +1 k +
2 2 2
t=1 t=1
T T
ku − ŵ1 k2 X ηkgt − ht k2 D 2 X ηkgt − ht k2
≤ + ≤ + .
2η t=1
2 2η t=1
2

In the case that the hints ht are not present, the algorithm becomes online gradient de-
scent (Zinkevich, 2003). In this case, assuming E[kgt k2 ] ≤ G2 and setting η = GD

T
we obtain

the E[RT (u)] ≤ DG T for all u such that kuk ≤ D.

C Algorithm 2 and Regret Guarantee


p
Theorem 21. Let F be an H-smooth and G-Lipschitz function. Then, when Q = ⌈log2 ( N G/HD)⌉,
1
Algorithm 2 with xt = xt−1 + 21 ∆t and gt = ∇F (xt ) and η ≤ 2H ensures for all kuk ≤ D
T
X HD 2 2GT D
hgt , ∆t − ui ≤ + .
2 N
t=1
p
Furthermore, a total of at most T ⌈log2 ( N G/HD)⌉ gradient evaluations are required.

23
Proof. The count of gradient evaluations is immediate from inspection of the algorithm, so it
remains only to prove the regret bound.
First, we observe that the choices of ∆t specified by Algorithm 2 correspond to the values of
1
wt produced by Algorithm 3 when ψ(w) = 2η kwk2 . This can be verified by direct calculation
2
(recalling that Dψ (x, y) = kx−yk
2η ).
Therefore, by Proposition 20, we have
T T
X D2 X η
hgt , ∆t − ui ≤ + kgt − ht k2 . (3)
t=1
2η t=1
2

So, our primary task is to show that kgt − ht k is small. To this end, recall that gt = ∇F (wt ) =
∇F (xt−1 + ∆t /2).
Now, we define hM +1
= ∇F xt−1 + 12 Πk∆k≤D ∆′t − ηhM
 
t t (which simply continues the recur-
i
sive definition of ht in Algorithm 2 for one more step). Then, we claim that for all 0 ≤ i ≤ M ,
khi+1
t − hit k ≤ 21i kh1t − h0t k. We establish the claim by induction on i. First, for i = 0 the claim
holds by definition. Now suppose khit − hi−1 1 1 0
t k ≤ 2i−1 kht − ht k for some i. Then, we have
   
1 1
khi+1 hit k
 ′ i
  ′ i−1

t − ≤ ∇F xt−1 + Πk∆k≤D ∆t − ηht − ∇F xt−1 + Πk∆k≤D ∆t − ηht
2 2

Using the H-smoothness of F :


H
Πk∆k≤D ∆′t − ηhit − Πk∆k≤D ∆′t − ηhi−1
   
≤ t
2
Using the fact that projection is a contraction:

≤ hit − hi−1
t
2
1
Using η ≤ H:

1 i
= h − hi−1
2 t t

From the induction assumption:


1 1
≤ kh − h0t k .
2i t
So that the claim holds.
Now, since ht = hQ i−1
 ′ 
t , we have ∆t = Πk∆k≤D ∆t − ηht . Therefore gt = ∇F (xt ) = ∇F (xt−1 +
Q+1
∆t /2) = ht . Thus,

1 2G
kgt − hQ Q+1
t k = kht − hQ
t k≤ Q
kh1t − h0t k ≤ Q ,
2 2
p
where in the last inequality

we used the fact that F is G-Lipschitz. So, for Q = ⌈log2 ( N G/HD)⌉,
we have kgt − hQt k≤
2G
√ HD for all t. The result now follows by substituting into equation (3).
NG

24
D Proof of Theorem 17
Proof of Theorem 17. Once more, the first part of the result is immediate from the fact that k∆n k ≤
D. So, we proceed
R 1to show the second part.
Define ∇n = 0 ∇F (xn−1 + s∆n ) ds. Then, we have

kh∇n − gn , ∆n ik
Z 1   
1
= ∇F (xn−1 + s∆n ) − ∇F xn−1 + ∆n ds, ∆n
0 2
Z 1  
1
≤D ∇F (xn−1 + s∆n ) − ∇F xn−1 + ∆n ds
0 2
Z 1     
1 2 1
=D ∇F (xn−1 + s∆n ) − ∇F xn−1 + ∆n − ∇ F xn−1 + ∆n ∆n (s − 1/2)
0 2 2
1
+∇2 F (xn−1 + ∆n )∆n (s − 1/2) ds
2
R1
(observing that 0 s − 1/2 ds = 0)
1  
1 1
Z
=D ∇F (xn−1 + s∆n ) − ∇F (xn−1 + ∆n ) − ∇2 F xn−1 + ∆n ∆n (s − 1/2) ds
0 2 2

(using second-order smoothness)


1
J JD 3
Z
≤D k∆n k2 (s − 1/2)2 ds ≤ .
0 2 48

In Theorem 7, set un to be equal to u1 for the first T iterations, u2 for the second T iterations
and so on. In other words, un = umod(n,T )+1 for n = 1, . . . , M . So, we have
M
X M
X
1 K
F (xM ) − F (x0 ) = RT (u , . . . , u ) + h∇n − gn , ∆n i + hgn , un i
n=1 n=1
M
N JD 3 X
≤ RT (u1 , . . . , uK ) + + hgn , un i .
48 n=1
PT
∇F (wtk ) HD 2 2T GD
Now, set uk = −D Pt=1
T . Then, by Theorem 21, we have that RT (uk ) ≤ 2 + N .
k t=1 ∇F (wtk )k
Therefore:
K T
HD 2 K M JD 3 X 1X
F (xM ) ≤ F (x0 ) + + 2GD + − DT ∇F (wtk ) .
2 48 T
k=1 t=1

Hence, we obtain
K T
1 X 1X F (x0 ) − F (xM ) HD 2G JD 2
∇F (wtk ) ≤ + + + .
K T MD 2T M 48
k=1 t=1

25
Note that from the choice of K and T we have M = KT ≥ N − T ≥ N/2. So, using D = δ/T , we
have can upper bound the r.h.s. with

2T (F (x0 ) − F ⋆ ) Hδ 4G Jδ2
+ + +
Nδ 2T 2 N 2T 2
l 2 m 
)1/3
and with T = min (δ(F(H+Jδ)N
(x )−F ⋆ )1/3
, N/2 :
0

(H + Jδ)1/3 (F (x0 ) − F ⋆ )2/3 4G 2(F (x0 ) − F ⋆ ) δ(H + Jδ)


≤3 + + + 10 .
δ1/3 N 2/3 N Nδ N2
Now, the third fact follows by observing that Proposition 15 implies that
K K T
1 X k 1 X 1X Jδ2
∇F (w ) ≤ ∇F (wtk ) + .
K K T 2
k=1 k=1 t=1

Now, substituting the specified value of δ completes the identity. Finally, the count of number of
gradient evaluations is a direct calculation.

E Proofs for Section 5


Proposition 14. Suppose that F is H-smooth (that is, ∇F is H-Lipschitz) and x also satisfies
k∇F (x)kδ ≤ ǫ. Then, k∇F (x)k ≤ ǫ + Hδ.
1 P
Proof. Let S ⊂ B(x, δ) with x = |S| y∈S y. By H-smoothness, for all y ∈ S, k∇F (y)−∇F (x)k ≤
Hky − xk ≤ Hδ. Therefore, we have

1 X 1 X
∇F (y) = ∇F (x) + (∇F (y) − ∇F (x))
|S| |S|
y∈S y∈S

≥ k∇F (x)k − Hδ .

1 P
Now, since k∇F (x)kδ ≤ ǫ, for any p > 0, there is a set S such that |S| y∈S ∇F (y) ≤ ǫ + p.
Thus, k∇F (x)k ≤ ǫ + Hδ + p for any p > 0, which implies k∇F (x)k ≤ ǫ + Hδ.

Proposition 15. Suppose that F is J-second-order-smooth (that is, k∇2 F (x) − ∇2 F (y)kop ≤
Jkx − yk for all x and y). Suppose also that x satisfies k∇F (x)kδ ≤ ǫ. Then, k∇F (x)k ≤ ǫ + J2 δ2 .
1 P
Proof. The proof is similar to that of Proposition 14. Let S ⊂ B(x, δ) with x = |S| y∈S y. By
J-second-order-smoothness, for all y ∈ S, we have
Z 1
2
k∇F (y) − ∇F (x) − ∇ F (x)(y − x)k = (∇2 F (x + t(y − x)) − ∇2 F (x))(y − x) dt
0
Z 1
Jky − xk2 Jδ2
≤ tJky − xk2 dt = ≤ .
0 2 2

26
1 1
∇2 F (x)(y − x) = 0. Therefore, we have
P P
Further, since |S| y∈S y = x, we have |S| y∈S

1 X 1 X
∇F (y) = ∇F (x) + (∇F (y) − ∇F (x) − ∇2 F (x)(y − x))
|S| |S|
y∈S y∈S

Jδ2
≥ k∇F (x)k − .
2
1 P
Now, since k∇F (x)kδ ≤ ǫ, for any p > 0, there is a set S such that |S| y∈S ∇F (y) ≤ ǫ + p.
J 2 J 2
Thus, k∇F (x)k ≤ ǫ + 2δ + p for any p > 0, which implies k∇F (x)k ≤ ǫ + 2δ .

F Lower Bounds
Our lower bounds are constructed via a mild alteration to the arguments of Arjevani et al. (2019)
for lower bounds on finding (0, ǫ)-stationary points of smooth functions with a stochastic gradient
oracle. At a high level, we show that since a δ, ǫ-stationary point of an H-smooth loss is also a
(0, Hδ + ǫ)-stationary point, a lower bound on the complexity of the latter implies a lower bound on
the of complexity of the former. The lower bound of Arjevani et al. (2019) is proved by constructing
a distribution over “hard” functions such that no algorithm can quickly find a (0, ǫ)-stationary point
of a random selected function. Unfortunately, these “hard” functions are not Lipschitz. Fortunately,
they take the form F (x) + ηkxk2 where F is Lipschitz and smooth so that the “non-Lipschitz” part
is solely contained in the quadratic term. We show that one can replace the quadratic term kxk2
with a Lipschitz function that is quadratic for sufficiently small x but proportional to kxk for
larger values. Our proof consists of carefully reproducing the argument of Arjevani et al. (2019)
to show that this modification does not cause any problems. We emphasize that almost all of this
development can be found with more detail in Arjevani et al. (2019). We merely restate here the
minimum results required to verify our modification to their construction.

F.1 Definitions and Results from Arjevani et al. (2019)


A randomized first-order algorithm is a distribution PS supported on a set S and a sequence
of measurable mappings Ai (s, g1 , . . . , gi−1 ) → Rd with s ∈ S and gi ∈ Rd . Given a stochastic
gradient oracle Grad : Rd × Z → Rd , a distribution PZ supported on Z and an i.i.d. sample
(z1 , . . . , zn ) ∼ PZ , we define the iterates of A recursively by:

x1 = A1 (s)
xi = Ai (s, Grad(x1 , z1 ), Grad(x2 , z2 ), . . . , Grad(xi−1 , zi−1 )) .

So, xi is a function of s and z1 , . . . , zi−1 . We define Arand to be the set of such sequences of
mappings.
Now, in the notation of Arjevani et al. (2019), we define the “progress function”

progc (x) = max{i : |xi | ≥ c} .

Further, a stochastic gradient oracle Grad can be called a probability-p zero-chain if prog0 (Grad(x, z)) =
prog1/4 (x) + 1 for all x with probability at least 1 − p, and prog0 (Grad(x, z)) ≤ prog1/4 (x) + 1
with probability 1.

27
Next, let FT : RT → R be the function defined by Lemma 2 of Arjevani et al. (2019). Restating
their Lemma, this function satisfies:
Lemma 22 (Lemma 2 of Arjevani et al. (2019)). There exists a function FT : RT → R satisfies
that satisfies:
1. FT (0) = 0 and inf FT (x) ≥ −γ0 T for γ0 = 12.
2. ∇FT (x) is H0 -Lipschitz, with H0 = 152.
3. For all x, k∇FT (x)k∞ ≤ G0 with G0 = 23
4. For all x, prog0 (∇FT (x)) ≤ prog1/2 (∇FT (x)) + 1.
5. If prog1 (x) < T , then k∇FT (x)k ≥ |∇FT (x)prog1 (x)+1 | ≥ 1.
We associate with this function FT the stochastic gradient oracle OT (x, z) : RT × {0, 1} → Rd
where z is Bernoulli(p):
(
∇FT (x)i , if i 6= prog1/4 (x)
GradT (x, z)i = z∇FT (x)i
p , if i = prog1/4 (x)

It is clear that Ez [OT (x, z)] = ∇FT (x).


This construction is so far identical to that in Arjevani et al. (2019), and so we have by their
Lemma 3:
Lemma 23 (Lemma 3 of Arjevani et al. (2019)). GradT is a probability-p
√ zero chain, has variance
G
E[kGradT (x, z) − ∇FT (x)k2 ] ≤ G20 /p, and kGradT (x, z)k ≤ p0 + G0 T .
Proof. The probability p zero-chain and variance statements are directly from Arjevani et al. (2019).
For the bound on kGradT k, observe that GradT (x, z) = ∇FT (x) in all but one coordinate. In
that one coordinate, GradT (x, z) is at most k∇FTp(x)k∞ = Gp0 . Thus, the bound follows by triangle
inequality.

Next, for any matrix U ∈ Rd×T with orthonormal columns, we define FT,U : Rd → R by:
FT,U (x) = FT (U ⊤ x) .
The associated stochastic gradient oracle is:
GradT,U (x, z) = U GradT (U ⊤ x, z) .
Now, we restate Lemma 5 of Arjevani et al. (2019):
Lemma 24 (Lemma 5 of Arjevani et al. (2019)). l Let R>0m and suppose A ∈ Arand is such that
R2 T 2T 2
A produces iterates xt with kxt k ≤ R. Let d ≥ 18 p log pc Suppose U is chosen uniformly at
random from the set of d×T matrices with orthonormal columns. Let Grad be an probability-p zero
chain and let GradU (x, z) = U Grad(U ⊤ x, z). Let x1 , x2 , . . . be the iterates of A when provided
the stochastic gradient oracle GradU . Then with probability at least 1 − c (over the randomness of
U , the oracle, and also the seed s of A):
T − log(2/c)
prog1/4 (U ⊤ xt ) < T for all t ≤ .
2p

28
F.2 Defining the “Hard” Instance
Now, we for the first time diverge from the construction of Arjevani et al. (2019) (albeit only
slightly). Their construction uses a “shrinking function” ρR,d : Rd → Rd given by ρR,d (x) =
√ kxk 2 2 as well as an additional quadratic term to overcome the limitation of bounded iterates.
1+kxk /R
We cannot tolerate the non-Lipschitz quadratic term, so we replace it with a Lipschitz version
qB,d (x) = x⊤ ρB,d (x). Intuitively, qB,d behaves like kxk2 for small enough x, but behaves like kxk
for large kxk. This pre-processed function is defined by:

F̂T,U (x) = FT,U (ρR,d (x)) + ηqB,T (ρ−1 ⊤


R,T (U ρR,d (x)))
= FT (U ⊤ ρR,d (x)) + ηqB,T (ρ−1 ⊤
R,T (U ρR,d (x)))
= FT (U ⊤ ρR,d (x)) + ηqB,d (x) .

To see the last line above, note that since qB,d and ρR,d are rotationally symmetric, and U has
orthonormal columns, we have U ⊤ ρR,d (x) = ρR,T (U ⊤ x) and qB,T (U ⊤ x) = qB,d (x). Thus, we have
qB,T (ρ−1 ⊤
R,T (U ρR,d (x))) = qB,d (x). The stochastic gradient oracle associated with F̂T,U (x) is

\ T,U (x, z) = J[ρR,d ](x)⊤ U GradT (U ⊤ ρR,d (x), z) + η∇qB,d (x)


Grad
= J[ρR,d ](x)⊤ U GradT (U ⊤ ρR,d (x), z)
+ ηJ[ρR,d ](x)⊤ U ⊤ J[ρ−1 ⊤ ⊤ −1 ⊤
R,T ](U ρR,d (x)) ∇qB,T (ρR,T (U ρR,d (x))) .

where J[f ](x) indicates the Jacobian of the function f evaluated at (x).
A description of the relevant properties of qB is provided in Section F.3.
Next we produce a variant on Lemma 6 from Arjevani et al. (2019). This is the most delicate
part of our alteration, although the proof is still almost identical to that of Arjevani et al. (2019).

Lemma 25 (variant on Lemma 6 of Arjevani et al. (2019)). Let R = B = 10G0 T . Let η = 1/10
2 2
and c ∈ (0, 1) and p ∈ (0, 1) and T ∈ N. Set d = ⌈18 RpT log 2T pc ⌉ and let U be sampled uniformly
from the set of d × T matrices with orthonormal columns. Define F̂T,U and Grad ˆ U,T as above.
ˆ U,T as input.
Suppose A ∈ Arand and let x1 , x2 , . . . be the iterates of A when provided with Grad
Then with probability at least 1 − c:

T − log(2/c)
k∇F̂T,U (xt )k ≥ 1/2 for all t ≤ .
2p
Proof. First, observe that GradT is a probability p zero-chain. Next, qB is rotationally symmetric
and ρR (x) ∝ X , we see that ∇qB (x) ∝ x so that prog0 (J[ρ−1 ⊤ −1
R,T ](x) ∇qB,T (ρR,T (x))) = prog0 (x).
Thus, the oracle

^ T (x, z) = GradT (x, z) + ηJ[ρ−1 ](x)⊤ ∇qB,T (ρ−1 (x))


Grad B,T R,T

is a probability-p zero-chain. We define the rotated oracle:

^ U,T (y, z) = U Grad


Grad ^ T (U ⊤ y, z)
= U GradT (U ⊤ y, z) + ηU J[ρ−1 ⊤ ⊤ −1 ⊤
B,T ](U y) ∇qB,T (ρR,T (U y)) .

29
Following Arjevani et al. (2019), we now define yi = ρR (xi ). Notice that:

^ U,T (y, z) = J[ρR,d ](x)⊤ Grad


Grad \ T,U (x, z)
⊤\
= J[ρR,d ](ρ−1
R,d (y))] GradT,U (x, z)

Therefore, we can view y1 , . . . , yT as the iterates of some different algorithm Ay ∈ Arand applied
˜ T,U (since they are computed by measurable functions of the oracle outputs
to the oracle Grad
and previous iterates). However, since ρR (kbxk) ≤ R, kyi k ≤ R for all i. Thus, since Grad ^ T is
2 2
a probability-p zero-chain, by Lemma 24 we have that so long as d ≥ ⌈18 RpT log 2T
pc ⌉, then with
probability at least 1 − c, for all i ≤ (T − log(2/c))/2p,

prog1/4 (U ⊤ yi ) < T .

Now, our goal is to show that k∇F (xi )k ≥ 1/2. We consider two cases, either kxi k > R/2 or
not.
First, observe that for xi with kxi k > R/2, we have:

k∇F̂T,U (xi )k ≥ ηk∇qB (kxi k)k − kJ[ρR ](xi )kop k∇F̂ (yi )k
kxi k √
≥ ηp − G0 T
1 + kxi k2 /B 2

≥ 3ηB − G0 T

Recalling B = R = 10G0 T and η = 1/10:

= 2G0 T

Recalling G0 = 23:

≥ 1/2 .

Now, suppose kxi k ≤ R/2. Then, let us set j = prog1 (U ⊤ yi ) + 1 ≤ T (the inequality follows
since prog1 ≤ prog1/4 ). Then, if uj indicates the jth row of u, Lemma 22 implies:

|huj , yi i| < 1,
|huj , ∇FT,U (yi )i| ≥ 1 .
⊤ /R2
I−ρR (xi )ρR (xi )⊤ /R2 I−y y
Next, by direct calculation we have J[ρR ](xi ) = √ =√ i i so that:
1+kxi k2 /R2 1+kxi k2 /R2

huj , ∇F̂T,U (xi )i = huj J[ρR ](xi )⊤ ∇FT,U (yi )i + ηhuj , ∇qB (xi )i
huj , ∇FT,U (yi )i huj , yi ihyi , ∇FT,U (yi )i/R2
= p − p + ηhuj , ∇qB (xi )i .
1 + kxi k2 /R2 1 + kxi k2 /R2
 2

Now, by Proposition 29, we have ∇qB (xi ) = 2 − kyBi2k yi =. So, (with R = B):

huj , ∇FT,U (yi )i huj , yi ihyi , ∇FT,U (yi )i/R2 kyi k2


 
j
hu , ∇F̂T,U (xi )i = p − +η 2− huj , yi i
B2
p
1 + kxi k2 /R2 1 + kxi k2 /R2

30
Observing that kyi k ≤ kxi k ≤ R/2 and |huj , yi i| < 1:
1 k∇FT,U (yi )k
|huj , ∇F̂T,U (xi )i| ≥ p − − 2η
1 + kxi k2 /R2 2R

Using |huj , ∇FT,U (yi )i| ≥ 1 and k∇F (yi )k ≤ G0 T :

2 G0 T
≥√ − − 2η
5 2R

With R = 10G0 T and η = 1/10:
2 1 1
=√ − −
5 20 5
> 1/2 .

With the settings of parameters specified in Lemma 25, we have:


Lemma 26 (variation on Lemma 7 in Arjevani et al. (2019)). With the settings of R, B, η in
Lemma 25, the function F̂T,U satisfies:
1. F̂T,U (0) − inf F̂T,U (x) ≤ γ0 T = 12T
√ √
2. kF̂T,U (x)k ≤ G0 T + 3ηB ≤ 92 T for all x.

3. ∇F̂U (x) is H0 + 3 + 8η ≤ 156-Lipschitz.


√ √
\ T,U (x, z)k ≤ G0 + G0 T + 3ηB ≤
4. kGrad 23
+ 92 T with probability 1.
p p

G20 232
5. GradT,U has variance at most p ≤ p

Proof. 1. This property follows immediately from the fact that FT (0) − inf FT (x) ≤ γ0 T .
2. Since
√ ρR is 1-Lipschitz for all R and qB is 3B-Lipschitz (see Proposition 29), F̂T,U (x) is
G0 T + 3ηB-Lipschitz.
3. By assumption, R ≥ max (H0 , 1). Thus, by Arjevani et al. (2019, Lemma 16), ∇FT (ρR (x))
is H0 + 3-Lipschitz and so ∇F̂T,U is H0 + 3 + 8η-Lipschitz by Proposition 29.

4. Since kOT k ≤ Gp0 + G0 T , and J[ρR ](x)⊤ U has operator norm at most 1, the bound follows.

5. Just as in the previous part, since GradT has variance G20 /p and J[ρR ](x)⊤ U has operator
norm at most 1, the bound follows.

Now, we are finally in a position to prove:



Theorem 27. Given any γ, H, ǫ, and σ such that 12·156·4ǫ 2 ≥ 2, there exists a distribution over
functions F and stochastic first-order
√ oracles Grad such that with probability 1, F is H-smooth,
F (0) − inf F (x) ≤ γ, F is 3 Hγ-Lipschitz and Grad has variance σ 2 , and for any algorithm in
 , with
Arand  probability at least 1 − c, when provided a randomly selected Grad, A requires at least
γHσ2
Ω ǫ4 iterations to output a point x with E[k∇F (x)k] ≤ ǫ.

31
Proof. From Lemma 26 and Lemma 25, we √ have a √
distribution
√ over functions F and first-order
oracles such that with probability 1, F is G0 T +3G0 T ≤ 92 T Lipschitz, F is H0 +3+4η ≤ 156-
smoooth, F (0) − inf F (x) ≤ γ0 T = 12T , Grad has variance at most G20 /p = 232 /p, and with
probability at least 1 − c,
T − log(2/c)
k∇F (xt )k ≥ 1/2 for all t ≤ .
2p
156 Hγ 2 2
Now, set λ = H · 2ǫ, T = ⌊ 12·156·(2ǫ)2 ⌋ and p = min((2 · 23 · ǫ) /σ , 1). Then, define

Hλ2
Fλ (x) = F (x/λ) .
156
2 Hλ2
√ Hλ
Then Fλ is Hλ
H · 1
λ2 · 156 = H- smooth, Fλ (0) − inf x Fλ (x) ≤ 12 · T · 156 ≤ γ, and Fλ is 92 T 156 ≤
√ 0
3 Hγ- Lipschitz. We can construct an oracle Gradλ from Grad by:

Gradλ (x, z) = Grad(x/λ, z) .
156
so that

2 H 2 λ2 232
E[kGradλ (x, z) − ∇Fλ (x)k ] ≤ · = σ2 .
1562 p
Further, since an oracle for Fλ can be constructed from the oracle for F , if we run A on Fλ , with
probability at least 1 − c,

Hλ T − log(2/c) Hγσ 2 log(2/c)


k∇Fλ (xt )k = k∇F (xt )k ≥ ǫ for all t ≤ ≤ 3 · 10−7 · 4
− 5 · 10−3 · .
156 2p ǫ ǫ2
Thus, there exists a constant K such that

Hγσ 2
E[k∇Fλ (xt )k] ≥ ǫ for all t ≤ K .
ǫ4
From this result, we have our main lower bound (the formal version of Theorem 18):

Theorem 28. For any δ, ǫ, γ, G ≥ 3 √2ǫγ δ
, there is a distribution over G-Lipschitz C ∞ functions
F with F (0) − inf F (x) ≤ γ and stochastic gradient oracles Grad with E[kGrad(x, z)k2 ] ≤ G2
such that for any algorithm A ∈ Arand , if A is provided as input a randomly selected oracle Grad,
A will require Ω(G2 γ/δǫ3 ) iterations to identify a point x with E[k∇F (x)kδ ] ≤ ǫ.

Proof. From Theorem 27, for any H, ǫ′ , γ, and σ√we have a distrubution over C ∞ functions
F and oracles Grad such that F is H-smooth, 3 Hγ-Lischitz and F (0) − inf F (x) ≤ γ and
Grad has variance σ 2 such that A requires Ω(Hγσ 2 /ǫ′4 ) iterations to output a point x such that
E[k∇F (x)k] ≤ ǫ√′ . Set σ =√G, H = ǫ/δ and ǫ′ = 2ǫ. Then, we see that Grad has variance G2 /2, and
√ 3 ǫγ
F is 3 Hγ = √ ≤ G/ 2-Lipschitz so that E[kGrad(x, z)k2 ] ≤ G2 . Further by Proposition!14,
δ
if k∇F (x)kδ ≤ ǫ, then k∇F (x)k ≤ ǫ + Hδ = 2ǫ. Therefore, since A cannot output a point x with
E[k∇F (x)k] ≤ ǫ′ = 2ǫ in less than Ω(Hγσ 2 /ǫ′4 ) iterations, we see that A also cannot output a
point x with E[k∇F (x)kδ ] ≤ ǫ in less than Ω(Hγσ 2 /ǫ′4 ) = Ω(γG2 /ǫ3 δ) iterations.

32
F.3 Definition and Properties of qB
Consider the function qB,d : Rd → R defined by

kxk2
qB,d (x) = p = x⊤ ρB,d (x) .
1+ kxk2 /B 2

This function has the following properties, all of which follow from direct calculuation:

Proposition 29. qB,d satisfies:

1.
xkxk2 kρB (x)k2
 
2x
∇qB,d (x) = p − 2 = 2+ ρB (x) .
1 + kxk2 /B 2 B (1 + kxk2 /B 2 )3/2 B2

2.
3xx⊤ kxk2 I 2kxk2 xx⊤
 
2 1
∇ qB,d (x) = p 2I − 2 − + .
1 + kxk2 /B 2 B (1 + kxk2 /B 2 ) B 2 (1 + kxk2 /B 2 ) B 4 (1 + kxk2 /B 2 )2

3.
kxk 3kxk
p ≤ k∇qB,d (x)k ≤ p ≤ 3B .
1+ kxk2 /B 2 1 + kxk2 /B 2

4.
8
k∇2 qB,d (x)kop ≤ p ≤8.
1 + kxk2 /B 2

G Proof of Theorem 13
First, we state and prove a theorem analogous to Theorem 8.

Theorem 30. Assume F : Rd → R is well-behaved. In Algorithm 1, set sn to be a random


variable sampled uniformly from [0, 1]. Set T, K ∈ N and M = KT . For i = 1, . . . , d, set uki =
PT ∂F (wtk )
t=1
−D∞ ∂xi
for some D∞ > 0. Finally, suppose Var(gn,i ) = σi2 for i = 1, . . . , d. Then, we
PT ∂F (wtk )
t=1 ∂xi

have
K T
" #
F (x0 ) − F ⋆ E[RT (u1 , . . . , uK )] D∞ di=1 σi
P
1 X 1X
E ∇F (wtk ) ≤ + + √ .
K T D∞ M D∞ M T
k=1 t=1 1

Proof. In Theorem 7, set un to be equal to u1 for the first T iterations, u2 for the second T
iterations and so on. In other words, un = umod(n,T )+1 for n = 1, . . . , N .

33
From Theorem 7, we have
M
" #
X
1 K
E[F (xM )] = F (x0 ) + E[RT (u , . . . , u )] + E hgn , un i .
n=1

PT ∂F wtk ( )
t=1
Now, since uk,i = −D∞ ∂xi
, E[gn ] = ∇F (wn ), and Var(gn,i ) ≤ σi2 for i = 1, . . . , d, we
PT ∂F wtk ( )
t=1 ∂xi

have
M K
" # " * T + K T
#
X X X X X
E hgn , un i ≤ E ∇F (wtk ), uk + D∞ (∇F (wtk ) − gn )
n=1 k=1 t=1 k=1 t=1 1
K
" * T +# d
X X √ X
≤E ∇F (wtk ), uk + D∞ K T σi
k=1 t=1 i=1
" K T
# d
X 1X   √ X
=E − D∞ T ∇F wtk + D∞ K T σi .
T t=1
k=1 1 i=1

Putting this all together, we have


d K T
" #
⋆ 1 k
√ X X 1X  
F ≤ E[F (xN )] ≤ F (x0 )+E[RT (u , . . . , u )]+D∞ K T σi −D∞ T E ∇F wtk .
T t=1
i=1 k=1 1

Dividing by KT D∞ = D∞ M and reordering, we have the stated bound.

We can now prove Theorem 13.

Proof of Theorem 13. Since A guarantees k∆n k∞ ≤ D∞ , for all n < n′ ≤ T + n − 1, we have

kwn − wn′ k∞ = kxn − (1 − sn )∆n − xn′ −1 + sn′ ∆n′ k∞


′ −1
nX
≤ ∆i + k∆n k∞ + k∆n′ k∞
i=n+1 ∞
≤ D∞ ((n′ − 1) − (n + 1) + 1) + 2D∞
= D∞ (n′ − n + 1)
≤ D∞ T .

Therefore, we clearly have kwtk − wk k∞ ≤ D∞ T = δ, which shows the first fact.


Note that from the choice of K and T we have M = KT ≥ N − T ≥ N/2. For the second fact,
observe that Var(gn,i ) ≤ E[gn2 i ] ≤ G2i . Thus, applying Theorem 30 in concert with the additional

34
√ P
assumption E[RT (u1 , . . . , uK )] ≤ D∞ K T di=1 Gi , we have
" K T
# √ P
KD∞ T di=1 Gi
Pd
1 X 1X k F (x0 ) − F ⋆ i=1 Gi
E ∇F (wt ) ≤2 +2 + √
K T D∞ N D∞ N T
k=1 t=1 1
2T (F (x0 ) − F ⋆ ) 3 di=1 Gi
P
= + √
δN T
!
5( i=1 Gi )2/3 (F (x0 ) − F ⋆ )1/3 6 di=1 Gi
Pd
2(F (x0 ) − F ⋆ )
P
≤ max , √ + ,
(N δ)1/3 N δN

where the last inequality is due to the choice of T .


Now, for the final fact, observe that kwtk − wk k∞ ≤ δ for all t and k, and also that wk =
1 P T k k k
T t=1 wt . Therefore S = {w1 , . . . , wT } satisfies the conditions in the infimum in Definition 12
so that k∇F (wk )k1,δ ≤ T1 Tt=1 ∇F (wtk ) .
P
1

H Directional Derivative Setting


In the main text, our algorithms make use of a stochastic gradient oracle. However, the prior work
of Zhang et al. (2020b) instead considers a stochastic directional gradient oracle. This is a less
common setup, and other works (e.g., Davis et al. (2021)) have also taken our route of tackling
non-smooth optimization via an oracle that returns gradients at points of differentiability.
Nevertheless, all our results extend easily to the exact setting of Zhang et al. (2020b) in which
F is Lipschitz and directionally differentiable and we have access to a stochastic directional gradient
oracle rather than a stochastic gradient oracle. To quantify this setting, we need a bit more notation
which we copy directly from Zhang et al. (2020b) below:
First, from Clarke (1990) and Zhang et al. (2020b), the generalized directional derivative of a
function F in a direction d is
f (y + td) − f (y)
F ◦ (x, d) = lim sup . (4)
y→x t↓0 t

Further, the generalized gradient is the set

∂F (x) = {g : hg, di ≤ hF ◦ (x, d), di for all d} .

Finally, F : Rd → R is Hadamard directionally differentiable in the direction v ∈ Rd if for any


function ψ : R+ → Rd such that limt→0 ψ(t)−ψ(0)
t = v and ψ(0) = x, the following limit exists:

F (ψ(t)) − F (x)
lim .
t→0 t
If F is Hadamard directionally differentiable, then the above limit is denoted F ′ (x, v). When F is
Hadamard directionally differentiable for all x and v, then we say simply that F is directionally
differentiable.
With these definitions, a stochastic directional oracle for a Lipschitz, directionally differentiable,
and bounded from below function F is an oracle Grad(x, v, z) that outputs g ∈ ∂F (x) such that

35
hg, vi = F ′ (x, v). In this case, Zhang et al. (2020b) shows (Lemma 3) that F satisfies an alternative
notion of well-behavedness:
Z 1
F (y) − F (x) = hE[Grad(x + t(y − x), y − x, z)], y − xidt . (5)
0

Next, we define:
Definition 31. A point x is a (δ, ǫ) stationary point of F for the generalized gradient if there is a
set of points S contained in the ball of radius δ centered at x such that for y selected uniformly at
random from S, E[y] = x and for all y there is a choice of gy ∈ ∂F (y) such that k E[gy ]k ≤ ǫ.
Similarly, we have the definition:
Definition 32. Given a point x, and a number δ > 0, define:

1 X
k∂F (x)kδ , Pinf gy
1
S⊂B(x,δ), |S| y=x,gy ∈∂F (y) |S|
y∈S y∈S

In fact, whenever a locally Lipschitz function F is differentiable at a point x, we have that


∇F (x) ∈ ∂F (x), so that k∂F (x)kδ ≤ k∇F (x)kδ . Thus our results in the main text also bound
k∂F (x)kδ . However, while a gradient oracle is also directional derivative oracle, a directional
derivative oracle is only guaranteed to be a gradient oracle if F is continuously differentiable at the
queried point x. This technical issue means that when we have access to a directional derivative
oracle rather than a gradient oracle, we will instead only bound k∂F (x)kδ rather than k∇F (x)kδ .
Despite this technical complication, our overall strategy is essentially identical. The key obser-
vation is that the only time at which we used the properties of the gradient previously was when
we invoked well-behavedness of F . When we have a directional derivative instead of the gradient,
the alternative notion of well-behavedness in (5) will play an identical role. Thus, our approach is
simply to replace the call to Grad(wn , zn ) in Algorithm 1 with a call instead to Grad(wn , ∆n , zn )
(see Algorithm 4). With this change, all of our analysis in the main text applies almost without
modification. Essentially, we only need to change notation in a few places to reflect the updated
definitions.
To begin this notational update, the counterpart to Theorem 7 is:
Theorem 33. Suppose F is Lipschitz and directionally differentiable. With the notation in Algo-
rithm 4, if we let sn be independent random variables uniformly distributed in [0, 1], then for any
sequence of vectors u1 , . . . , uN , if we have the equality:
"M # "M #
X X
E[F (xM )] = F (x0 ) + E hgn , ∆n − un i + E hgn , un i .
n=1 n=1

Proof.
Z 1
F (xn ) − F (xn−1 ) = hE[Grad(xn−1 + s(xn − xn−1 ), xn − xn−1 , zn )], xn − xn−1 i ds
0
= E[hgn , ∆n i]
= E[hgn , ∆n − un i + hgn , un i] .

36
Algorithm 4 Online-to-Non-Convex Conversion (directional derivative oracle version)
Input: Initial point x0 , K ∈ N, T ∈ N, online learning algorithm A, sn for all n
Set M = K · T
for n = 1 . . . M do
Get ∆n from A
Set xn = xn−1 + ∆n
Set wn = xn−1 + sn ∆n
Sample random zn
Generate directional derivative gn = Grad(wn , ∆n , zn )
Send gn to A as gradient
end for
Set wtk = w(k−1)T +t for k = 1, . . . , K and t = 1, . . . , T
Set wk = T1 Tt=1 wtk for k = 1, . . . , K
P

Return {w1 , . . . , wK }

Where in the second line we have used the definition gn = Grad(xn−1 + sn (xn − xn−1 ), xn −
xn−1 , zn ), the assumption that sn is uniform on [0, 1], and Fubini theorem (as Grad is bounded
by Lipschitzness of F ). Now, sum over n and telescope to obtain the stated bound.

Next, we have the following analog of Theorem 8:


Theorem 34. With the notation in Algorithm 4, set sn to be a random variable sampled uniformly
T
∇k
P
from [0, 1]. Set T, K ∈ N and M = KT . Define ∇kt = E[g(k−1)T +t ]. Define uk = −D Pt=1 t
k Tt=1 ∇kt k
for some D > 0 for k = 1, . . . , K. Finally, suppose Var(gn ) = σ 2 . Then:
K T
" #
1 X 1X k F (x0 ) − F ⋆ E[RT (u1 , . . . , uK )] σ
E ∇ t ≤ + +√ .
K T t=1 DM DM T
k=1
Proof. The proof is essentially identical to that of Theorem 8. In Theorem 33, set un to be
equal to u1 for the first T iterations, u2 for the second T iterations and so on. In other words,
un = umod(n,T )+1 for n = 1, . . . , M . So, we have
"M #
X
1 K
E[F (xM )] = F (x0 ) + E[RT (u , . . . , u )] + E hgn , un i .
n=1
PT
∇kt
Now, since uk = −D Pt=1
T , and Var(gn ) = σ 2 , we have
k t=1 ∇kt k
M K
" # " * T +# K
" T
#
X X X X X
E hgn , un i ≤ E ∇kt , uk +E D (∇kt − g(k−1)T +t )
n=1 k=1 t=1 k=1 t=1
K
" * T +#
X X √
≤E ∇kt , uk + DσK T
k=1 t=1
K T
" #
X 1X k √
=E − DT ∇t + DσK T .
T
k=1 t=1

37
Putting this all together, we have
K T
" #
√ X 1X  
F ⋆ ≤ E[F (xM )] ≤ F (x0 ) + E[RT (u1 , . . . , uK )] + σDK T − DT E ∇F wtk .
T
k=1 t=1

Dividing by KDT = DM and reordering, we have the stated bound.

Finally, we instantiate Theorem 34 with online gradient descent to obtain the analog of Corol-
lary 9. This result establishes that the online-to-batch conversion finds an (δ, ǫ) critical point in
O(1/ǫ3 δ) iterations, even when using a directional derivative oracle. Further, our lower bound
construction makes use of continuously differentiable functions, for which the directional derivative
oracle and the standard gradient oracle must coincide. Thus the O(1/ǫ3 δ) complexity is optimal in
this setting as well.

Corollary 35. Suppose we have a budget of N gradient evaluations. Under the assumptions and no-
tation of Theorem 34, suppose in addition E[kgn k2 ] ≤ G2 and that A guarantees k∆n k ≤ D for some
user-specified
√ D for all n and ensures the worst-case K-shifting regret bound E[RT (u1 , . . . , uK )] ≤
DGK T for all kuk k ≤ D (e.g., as achieved by the OGD algorithm that is reset every T iterations).
Let δ > 0 be an arbitrary number. Set D = δ/T , T = min(⌈( F (xGN δ
0 )−F
⋆)
2/3 ⌉, N ), and K = ⌊ N ⌋.
2 T
Then, for all k and t, kwk − wtk k ≤ δ.
Moreover, we have the inequality
K T
" # !
1 X 1X k 2(F (x0 ) − F ⋆ ) 5G2/3 (F (x0 ) − F ⋆ )1/3 6G
E ∇t ≤ + max ,√ ,
K
k=1
T t=1 δN (N δ)1/3 N

which implies
K
!
1 X 2(F (x0 ) − F ⋆ ) 5G2/3 (F (x0 ) − F ⋆ )1/3 6G
k∂F (wk )kδ ≤ + max ,√ .
K
t=1
δN (N δ)1/3 N

Proof. Since A guarantees k∆n k ≤ D, for all n < n′ ≤ T + n − 1, we have

kwn − wn′ k = kxn − (1 − sn )∆n − xn′ −1 + sn′ ∆n′ k


′ −1
nX
≤ ∆i + k∆n k + k∆n′ k
i=n+1

≤ D((n − 1) − (n + 1) + 1) + 2D
= D(n′ − n + 1) ≤ DT .

Therefore, we clearly have kwtk − wk k ≤ DT = δ.


Note that from the choice of K and T we have M = KT ≥ N − T ≥ N/2. So, for the second
fact, notice that Var(gn ) ≤ G2 for all n. Thus, applying Theorem 34 in concert with the additional

38

assumption E[RT (u1 , . . . , uK )] ≤ DGK T , we have:
" K T
# √
1 X 1X k F (x0 ) − F ⋆ KDG T G
E ∇t ≤2 +2 +√
K T DN DN T
k=1 t=1
2T (F (x0 ) − F ⋆ ) 3G
≤ +√
δN T
!
5G2/3 (F (x0 ) − F ⋆ )1/3 6G 2(F (x0 ) − F ⋆ )
≤ max ,√ + ,
(N δ)1/3 N δN

where the last inequality is due to the choice of T .


Finally, observe that kwtk − wk k ≤ δ for all t and k, and also that wk = T1 Tt=1 wtk . Therefore
P
S = {w1k , . . . , wTk } satisfies the conditions in the infimum in Definition 32 so that k∂F (wk )kδ ≤
1 PT k
T t=1 ∇t .

39

You might also like