0% found this document useful (0 votes)

8 views39 pages

Optimal Stochastic Non-Smooth Non-Convex Optimization Through

This paper presents new algorithms for optimizing non-smooth, non-convex stochastic objectives, improving the complexity of finding a (δ, ǫ)-stationary point from O(ǫ−4 δ−1) to O(ǫ−3 δ−1), which is shown to be optimal. The authors introduce a novel online-to-non-convex conversion technique that connects non-convex stochastic optimization with online learning, leading to faster optimization algorithms. Additionally, the paper provides optimal results for smooth and second-order smooth objectives, demonstrating the applicability of their methods across various optimization settings.

Uploaded by

Bakht Zaman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views39 pages

Optimal Stochastic Non-Smooth Non-Convex Optimization Through

Uploaded by

Bakht Zaman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Optimal Stochastic Non-smooth Non-convex Optimization through

Online-to-Non-convex Conversion
Ashok Cutkosky Harsh Mehta Francesco Orabona
Boston University Google Research Boston University
arXiv:2302.03775v2 [cs.LG] 11 Feb 2023

Boston, MA Mountain View, CA Boston, MA

[email protected] [email protected] [email protected]

Abstract
We present new algorithms for optimizing non-smooth, non-convex stochastic objectives
based on a novel analysis technique. This improves the current best-known complexity for
ﬁnding a (δ, ǫ)-stationary point from O(ǫ−4 δ −1 ) stochastic gradient queries to O(ǫ−3 δ −1 ), which
we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex
optimization to online learning, after which our results follow from standard regret bounds in
online learning. For deterministic and second-order smooth objectives, applying more advanced
optimistic online learning techniques enables a new complexity of O(ǫ−1.5 δ −0.5 ). Our techniques
also recover all optimal or best-known results for ﬁnding ǫ stationary points of smooth or second-
order smooth objectives in both stochastic and deterministic settings.

1 Introduction
Algorithms for non-convex optimization are one of the most important tools in modern machine
learning, as training neural networks requires optimizing a non-convex objective. Given the abun-
dance of data in many domains, the time to train a neural network is the current bottleneck to
having bigger and more powerful machine learning models. Motivated by this need, the past
few years have seen an explosion of research focused on understanding non-convex optimiza-
tion (Ghadimi & Lan, 2013; Carmon et al., 2017; Arjevani et al., 2019; 2020; Carmon et al., 2019;
Fang et al., 2018). Despite significant progress, key issues remain unaddressed.
In this paper, we work to minimize a potentially non-convex objective F : Rd → R which we
only accesss in some stochastic or “noisy” manner. As motivation, consider F (x) , Ez [f (x, z)],
where x can represent the model weights, z a minibatch of i.i.d. examples, and f the loss of a
model with parameters x on the minibatch z. In keeping with standard empirical practice, we will
restrict ourselves to first order algorithms (gradient-based optimization).
The vast majority of prior analyses of non-convex optimization algorithms impose various
smoothness conditions on the objective (Ghadimi & Lan, 2013; Carmon et al., 2017; Allen-Zhu,
2018; Tripuraneni et al., 2018; Fang et al., 2018; Zhou et al., 2018; Fang et al., 2019; Cutkosky & Orabona,
2019; Li & Orabona, 2019; Cutkosky & Mehta, 2020; Zhang et al., 2020a; Karimireddy et al., 2020;
Levy et al., 2021; Faw et al., 2022; Liu et al., 2022). One motivation for smoothness assumptions
is that they allow for a convenient surrogate for global minimization: rather than finding a global
minimum of a neural network’s loss surface (which may be intractable), we can hope to find an
ǫ-stationary point, i.e., a point x such that k∇F (x)k ≤ ǫ. By now, the fundamental limits on first

1
order smooth non-convex optimization are well understood: Stochastic Gradient Descent (SGD)
will find an ǫ-stationary point in O(ǫ−4 ) iterations, which is the optimal rate (Arjevani et al., 2019).
Moreover, if F happens to be second-order smooth, SGD requires only O(ǫ−3.5 ) iterations, which
is also optimal (Fang et al., 2019; Arjevani et al., 2020). These optimality results motivate the
popularity of SGD and its variants in practice (Kingma & Ba, 2014; Loshchilov & Hutter, 2016;
2018; Goyal et al., 2017; You et al., 2019).
Unfortunately, many standard neural network architectures are non-smooth (e.g., architectures
incorporating ReLUs or max-pools cannot be smooth). As a result, these analyses can only pro-
vide intuition about what might occur when an algorithm is deployed in practice: the theorems
themselves do not apply (see also the discussion in, e.g., Li et al. (2021)). Despite the obvious
need for non-smooth analyses, recent results suggest that even approaching a neighborhood of a
stationary point may be impossible for non-smooth objectives (Kornowski & Shamir, 2022b). Nev-
ertheless, optimization clearly is possible in practice, which suggests that we may need to rethink
our assumptions and goals in order to understand non-smooth optimization.
Fortunately, Zhang et al. (2020b) recently considered an alternative definition of stationarity
that is tractable even for non-smooth objectives and which has attracted much interest (Davis et al.,
2021; Tian et al., 2022; Kornowski & Shamir, 2022a; Tian & So, 2022; Jordan et al., 2022). Roughly
speaking, instead of k∇F (x)k ≤ ǫ, we ask that there is a random variable y supported in a ball
of radius δ about x such that k E[∇F (y)]k ≤ ǫ. We call such an x an (δ, ǫ)-stationary point, so
that the previous definition (k∇F (x)k ≤ ǫ) is a (0, ǫ)-stationary point. The current best-known
complexity for identifying an (δ, ǫ) stationary point is O(ǫ−4 δ−1 ) stochastic gradient evaluations.
In this paper, we significantly improve this result: we can identify an (δ, ǫ)-stationary point
with only O(ǫ−3 δ−1 ) stochastic gradient evaluations. Moreover, we also show that this rate is
optimal. Our primary technique is a novel online-to-non-convex conversion: a connection between
non-convex stochastic optimization and online learning, which is a classical field of learning theory
that already has a deep literature (Cesa-Bianchi & Lugosi, 2006; Hazan, 2019; Orabona, 2019). In
particular, we show that an online learning algorithm that satisfies a shifting regret can be used to
decide the update step, when fed with linear losses constructed using the stochastic gradients of
the function F . By establishing this connection, we hope to open new avenues for algorithm design
in non-convex optimization and also motivate new research directions in online learning.
In sum, we make the following contributions:

• A reduction from non-convex non-smooth stochastic optimization to online learning: better

online learning algorithms result in faster non-convex optimization. Applying this reduction
to standard online learning algorithms allows us to identify an (δ, ǫ) stationary point in
O(ǫ−3 δ−1 ) stochastic gradient evaluations. The previous best-known rate in this setting
was O(ǫ−4 δ−1 ).

• We show that the O(ǫ−3 δ−1 ) rate is optimal for all δ, ǫ such that ǫ ≤ O(δ).

Additionally, we prove important corollaries for smooth F :

• The O(ǫ−3 δ−1 ) complexity implies the optimal O(ǫ−4 ) and O(ǫ−3.5 ) respective complexities
for ﬁnding (0, ǫ)-stationary points of smooth or second-order smooth objectives.

• For deterministic and second-order smooth objectives, we obtain a rate of O(ǫ−3/2 δ−1/2 ),
which implies the best-known O(ǫ−7/4 ) complexity for ﬁnding (0, ǫ)-stationary points.

2
2 Definitions and Setup
Here, we formally introduce our setting and notation. We are interested in optimizing real-valued
functions F : H → R where H is a real Hilbert space (e.g., usually H = Rd ). We assume F ⋆ ,
inf x F (x) > −∞. We assume that F is diﬀerentiable, but we do not assume that F is smooth. All
norms k · k are the Hilbert space norm (i.e., the 2-norm) unless otherwise speciﬁed. As mentioned
in the introduction, the motivating example to keep in mind in our development is the case F (x) =
Ez [f (x, z)].
Our algorithms access information about F through a stochastic gradient oracle Grad : H ×
Z → R. Given a point x in H, the oracle will sample an i.i.d. random variable z ∈ Z and return
Grad(x, z) ∈ H such that E[Grad(x, z)] = ∇F (x) and Var(Grad(x, z)) ≤ σ 2 .
In the following, we only consider functions satisfying the following mild regularity condition.

Definition 1. We define a differentiable function F : H → R to be well-behaved if for all

x, y ∈ H, it holds that
Z 1
F (y) − F (x) = h∇F (x + t(y − x), z), y − xi dt .
0

Using this assumption, our results can be applied to improve the past results on non-smooth
stochastic optimization. In fact, Proposition 2 (proof in Appendix A) below shows that for the
wide class of functions that are locally Lipschitz, applying an arbitrarily small perturbation to the
function is suﬃcient to ensure both diﬀerentiability and well-behavedness. It works via standard
perturbation arguments similar to those used previously by Davis et al. (2021) (see also Bertsekas
(1973); Duchi et al. (2012); Flaxman et al. (2005) for similar techniques in the convex setting).

Proposition 2. Let F : Rd → R be locally Lipschitz with stochastic oracle Grad such that
Ez [Grad(x, z)] = ∇F (x) whenever F is differentiable. We have two cases:
• If F is differentiable everywhere, then F is well-behaved.

• If F is not differentiable everywhere, let p > 0 be an arbitrary number and let u be a random
vector in Rd uniformly distributed on the unit ball. Define F̂ (x) , Eu [F (x + pu)]. Then,
\
F̂ is differentiable and well-behaved, and the oracle Grad(x, (z, u)) = Grad(x + pu, z) is a
stochastic gradient oracle for F̂ . Moreover, F is differentiable at x + pu with probability 1
and if F is G-Lipschitz, then |F̂ (x) − F (x)| ≤ pG for all x.

Remark 3. We explicitly note that our results cover the case in which F is directionally differen-
tiable and we have access to a stochastic directional gradient oracle, as considered by Zhang et al.
(2020b). This is an less standard oracle Grad(x, v, z) that outputs g such that E[hg, vi] is the
directional derivative of F in the direction v. This setting is subtly different (although a directional
derivative oracle is a gradient oracle at all points for which F is continuously differentiable). In
order to keep technical complications to a minimum, in the main text we consider the more familiar
stochastic gradient oracle discussed above. In Appendix H, we show that our results and techniques
also apply using directional gradient oracles with only superficial modification.

3
2.1 (δ, ǫ)-Stationary Points
Now, let us deﬁne our notion of (δ, ǫ)-stationary point. This deﬁnition is essentially the same
as used in Zhang et al. (2020b); Davis et al. (2021); Tian et al. (2022). It is in fact mildly more
stringent since we require an “unbiasedness” condition in order to make eventual connections to
second-order smooth objectives easier.

Definition 4. A point x is an (δ, ǫ)-stationary point of an almost-everywhere differentiable function

F if there is a finite subset S of the ball of radius δ centered at x such that for y selected uniformly
at random from S, E[y] = x and k E[∇F (z)]k ≤ ǫ.

As a counterpart to this deﬁnition, we also deﬁne:

Definition 5. Given a point x, a number δ > 0 and a almost-everywhere differentiable function

F , define
1 X
k∇F (x)kδ , infP ∇F (y) .
1
S⊂B(x,δ), |S| y∈S y=x
|S|
y∈S

Let’s also state an immediate corollary of Proposition 2 that converts a guarantee on a random-
ized smoothed function to one on the original function.

Corollary 6. Let F : Rd → R be G-Lipschitz. For ǫ > 0, let p = Gǫ and let u be a random vector
in Rd uniformly distributed on the unit ball. Define F̂ (x) , Eu [F (x + pu)]. If a point x satisfies
k∇F̂ (x)kδ ≤ ǫ, then k∇F (x)kδ ≤ 2ǫ.

Our ultimate goal is to use N stochastic gradient evaluations of F to identify a point x with
as small a value of E[k∇F (x)kδ ] as possible. Due to Proposition 2 and Corollary 6, our results
will immediately extend from differentiable F to those F that are locally Lipschitz and for which
Grad(x, z) returns a unbiased estimate of ∇F (x) whenever F is differentiable at x. We therefore
focus our subsequent development on the case that F is differentiable and we have access to a
stochastic gradient oracle.

2.2 Online Learning

Here, we very briefly introduce the setting of online linear learning with shifting competitors,
that will be the core of our online-to-non-convex conversion. We refer the interested reader to
Cesa-Bianchi & Lugosi (2006); Hazan (2019); Orabona (2019) for a comprehensive introduction to
online learning. In the online learning setting, the learning process goes on in rounds. In each
round the algorithm outputs a point ∆t in a feasible set V , and then receives a linear loss function
ℓt (·) = hgn , ·i and it pays ℓt (∆t ). The goal of the algorithm is to minimize the static regret over T
rounds, defined as the difference between its cumulative loss and the one of an arbitrary comparison
vector u ∈ V :
X T
RT (u) , hgt , ∆t − ui .
t=1

With no stochastic assumption,√it is possible to design online algorithms that guarantee that the
regret is upper bounded by O( T ). We can also deﬁne a more challenging objective that is the

4
one of minimizing the K-shifting regret, that is the regret with respect to an arbitrary sequence of
K vectors u1 , . . . , uK ∈ V that changes every T iterations:
K
X kT
X
RT (u1 , . . . , uK ) , hgn , ∆n − uk i . (1)
k=1 n=(k−1)T +1

It should be intuitive
√ that resetting the online algorithm every T iterations can achieve a shifting
regret of O(K T ).

2.3 Related Work

In addition to the papers discussed in the introduction, here we discuss further related work.
In this paper we build on top of the definition of (δ, ǫ)-stationary points proposed by Zhang et al.
(2020b). There, they prove a complexity rate of O(ǫ−4 δ−1 ) for stochastic Lipschitz functions, which
we improve to O(ǫ−3 δ−1 ) and prove the optimality of this result.
The idea to reduce machine learning to online learning was pioneered by Cesa-Bianchi et al.
(2004) with the online-to-batch conversion. There are also few papers that explore the possibility
to transform non-convex problems into online learning ones. Ghai et al. (2022) provides some
conditions under which online gradient descent on non-convex losses is equivalent to a convex
online mirror descent. Hazan et al. (2017) defines a notion of regret which can be used to find
approximate stationary points of smooth objectives. Zhuang et al. (2019) transform the problem of
tuning of learning rates in stochastic non-convex optimization into an online learning problem. Our
proposed approach differs from all the ones above in applying to non-smooth objectives. Moreover,
we employ online learning algorithms with shifting regret Herbster & Warmuth (1998) to generate
the updates (i.e. the differences between successive iterates), rather than the iterates themselves.

3 Online-to-non-convex Conversion
In this section, we explain the online-to-non-convex conversion. The core idea transforms the
minimization of a non-convex and non-smooth function onto the problem of minimizing the shifting
regret over linear losses. In particular, consider an optimization algorithm that updates a previous
iterate xn−1 by moving in a direction ∆n : xn = xn−1 +∆n . For example, SGD sets ∆n = −ηgn−1 =
−η · Grad(xn−1 , zn−1 ) for a learning rate η. Instead, we let an online learning algorithm A decide
the update direction ∆n , using linear losses ℓt (x) = hgn , xi.
The motivation behind essentially all first order algorithms is that F (xn−1 + ∆n ) − F (xn−1 ) ≈
hgn , ∆n i. This suggests that ∆n should be chosen to minimize the inner product hgn , ∆n i. How-
ever, we are faced with two difficulties. The first difficulty is the the approximation error in the
first-order expansion. The second is the fact that ∆n needs to be chosen before gn is revealed, so
that ∆n needs in some sense to “predict the future”. Typical analysis of algorithms such as SGD
use the remainder form of Taylor’s theorem to address both difficulties simultaneously for smooth
objectives, but in our non-smooth case this is not a valid approach. Instead, we tackle these dif-
ficulties independently. We overcome the first difficulty using the same randomized scaling trick
employed by Zhang et al. (2020b): define gn to be a gradient evaluated not at xn−1 or xn−1 + ∆n ,
but at a random point along the line segment connecting the two. Then for a well-behaved function
we will have F (xn−1 + ∆n ) − F (xn−1 ) = E[hgn , ∆n i]. The second difficulty is where online learning

5
Algorithm 1 Online-to-Non-Convex Conversion
Input: Initial point x0 , K ∈ N, T ∈ N, online learning algorithm A, sn for all n
Set M = K · T
for n = 1 . . . M do
Get ∆n from A
Set xn = xn−1 + ∆n
Set wn = xn−1 + sn ∆n
Sample random zn
Generate gradient gn = Grad(wn , zn )
Send gn to A as gradient
end for
Set wtk = w(k−1)T +t for k = 1, . . . , K and t = 1, . . . , T
Set wk = T1 Tt=1 wtk for k = 1, . . . , K
P
Return {w1 , . . . , wK }

shines: online learning algorithms are specifically designed to predict completely arbitrary sequences
of vectors as accurately as possible.
The previous intuition is formalized in Algorithm 1 and the following result, which we will
elaborate on in Theorem 8 before yielding our main result in Corollary 9.
R1
Theorem 7. Suppose F is well-behaved. Define ∇n = 0 ∇F (xn−1 + s∆n ) ds. Then, with the
notation in Algorithm 1 and for any sequence of vectors u1 , . . . , uN , we have the equality:
M
X M
X M
X
F (xM ) = F (x0 ) + hgn , ∆n − un i + h∇n − gn , ∆n i + hgn , un i .
n=1 n=1 n=1

Moreover, if we let sn be independent random variables uniformly distributed in [0, 1], then we have
"M # "M #
X X
E[F (xM )] = F (x0 ) + E hgn , ∆n − un i + E hgn , un i .
n=1 n=1

Proof. By the well-behaveness of F , we have

Z 1
F (xn ) − F (xn−1 ) = h∇F (xn−1 + s(xn − xn−1 )), xn − xn−1 i ds
0
Z 1
= h∇F (xn−1 + s∆n ), ∆n i ds
0
= h∇n , ∆n i
= hgn , ∆n − un i + h∇n − gn , ∆n i + hgn , un i .

Now, sum over n and telescope to obtain the stated bound. R1

For the second statement, simply observe that by deﬁnition we have E[gn ] = 0 ∇F (xn−1 +
s∆n ) ds = ∇n .

6
3.1 Guarantees for Non-Smooth Non-Convex Functions
The primary value of Theorem 7 is that the term M
P
n=1 hgn , ∆n − un i is exactly the regret of an
online learning algorithm: lower regret clearly translates toPa smaller bound on F (xM ). Next, by
carefully choosing un , we will be able to relate the term M n=1 hgn , un i to the gradient averages
that appear in the deﬁnition of (δ, ǫ)-stationarity. Formalizing these ideas, we have the following:
Theorem 8. Assume F is well-behaved. With the notation in Algorithm 1, set sn toPbe a random
T
∇F (wtk )
variable sampled uniformly from [0, 1]. Set T, K ∈ N and M = KT . Define uk = −D Pt=1 T
k t=1 ∇F (wtk )k
2
for some D > 0 for k = 1, . . . , K. Finally, suppose Var(gn ) = σ . Then:
K T
" #
1 X 1X k F (x0 ) − F ⋆ E[RT (u1 , . . . , uK )] σ
E ∇F (wt ) ≤ + +√ .
K T t=1 DM DM T
k=1

Proof. In Theorem 7, set un to be equal to u1 for the ﬁrst T iterations, u2 for the second T
iterations and so on. In other words, un = umod(n,T )+1 for n = 1, . . . , M . So, we have
"M #
X
1 K
E[F (xM )] = F (x0 ) + E[RT (u , . . . , u )] + E hgn , un i .
n=1
PT
∇F (wtk )
Now, since uk = −D Pt=1
T , E[gn ] = ∇F (wn ), and Var(gn ) = σ 2 , we have
k t=1 ∇F (wtk )k
M K
" # " * T +# "K T
#
X X X X X
E hgn , un i ≤ E ∇F (wtk ), uk +E D (∇F (wtk ) − gn )
n=1 k=1 t=1 k=1 t=1
K
" * T +#
X X √
≤E ∇F (wtk ), uk + DσK T
k=1 t=1
K T
" #
X 1X √
=E − DT ∇F wtk + DσK T .
T
k=1 t=1

Putting this all together, we have

K T
" #
√ X 1X
F ⋆ ≤ E[F (xM )] ≤ F (x0 ) + E[RT (u1 , . . . , uK )] + σDK T − DT E ∇F wtk .
T
k=1 t=1

Dividing by KDT = DM and reordering, we have the stated bound.

We now instantiate Theorem 8 with the simplest online learning algorithm: online gradient
descent (OGD) (Zinkevich, 2003). OGD takes input a radius D and a step size η and makes
the update ∆n+1 = Πk∆k≤D [∆n − ηgn ] with ∆1 = 0. The standard analysis shows that if
√
E[kgn k2 ] ≤ G2 for all n, then with η = GD
√ , OGD will ensure1 static regret E[RT (u)] ≤ DG T
T
for any u satisfying kuk ≤ √ D. Thus, by resetting the algorithm every T iterations, we achieve
E[RT (u1 , . . . uK )] ≤ KDG T . This powerful guarantee for all sequences is characteristic of online
learning. We are now free to optimize the remaining parameters K and D to achieve our main
result, presented in Corollary 9.
1
For completeness a proof of this statement is in Appendix B.

7
Corollary 9. Suppose we have a budget of N gradient evaluations. Under the assumptions of Theo-
rem 8, suppose in addition E[kgn k2 ] ≤ G2 and that A guarantees k∆n k ≤ D for some user-specified
√
D for all n and ensures the worst-case K-shifting regret bound E[RT (u1 , . . . , uK )] ≤ DGK T for
all kuk k ≤ D (e.g., as achieved by the OGD algorithm that is reset every T iterations). Let δ > 0
be an arbitrary number. Set D = δ/T , T = min(⌈( F (xGN δ
0 )−F
⋆)
2/3 ⌉, N ), and K = ⌊ N ⌋. Then, for all
2 T
k and t, kwk − wtk k ≤ δ.
Moreover, we have the inequality
K T
" # !
1 X 1X 2(F (x0 ) − F ⋆ ) 5G2/3 (F (x0 ) − F ⋆ )1/3 6G
E ∇F (wtk ) ≤ + max ,√ ,
K
k=1
T
t=1
δN (N δ)1/3 N

which implies
K
" # !
1 X 2(F (x0 ) − F ⋆ ) 5G2/3 (F (x0 ) − F ⋆ )1/3 6G
E k∇F (wk )kδ ≤ + max ,√ .
K t=1 δN (N δ)1/3 N

Before providing the proof, let us discuss the implications. Notice

h Pthat if we selectiŵ at random
1 K 1 K k
from {w , . . . , w }, then we clearly have E[k∇F (ŵ)kδ ] = E K t=1 k∇F (w )kδ . Therefore,
the Corollary asserts that for a function F with F (x0 ) − inf F (x) ≤ γ with a stochastic gradient
oracle whose second moment is bounded by G2 , a properly instantiated Algorithm 1 ﬁnds a (δ, ǫ)
stationary point in N = O(Gγǫ−3 δ−1 ) gradient evaluations. In Section 7, we will provide a lower
bound showing that this rate is optimal essentially whenever δG2 ≥ ǫγ. Together, the Corollary
and the lower bound provide a nearly complete characterization of the complexity of ﬁnding (δ, ǫ)-
stationary points in the stochastic setting.
It is also interesting to note that the bound does not appear to improve if σ → 0. Instead all σ
dependency is dominated by the dependency on G. This highlights an interesting open question:
is it in fact possible to improve in the deterministic setting? It is conceivable that the answer is
“no”, as in the non-smooth convex optimization setting it is well-known that the optimal rates for
stochastic and deterministic optimization are the same.

Remark 10. With fairly straightforward martingale concentration arguments, the above can be
2/3 ⋆ )1/3
extended to identify a (δ, O( G (F(N(xδ)0 )−F
1/3 ))-stationary point with high probability. We leave this
for a future version of this work.

It is also interesting to explicitly write the update of the overall algorithm:

xn = xn−1 + ∆n
gn = Grad(xn + (sn − 1)∆n , zn )
∆n+1 = clipD (∆n + ηgn )
D
where clipD (x) = x min( kxk , 1). In words, the update is reminiscent of SGD updates with momen-
tum and clipping. The primary diﬀerent element is the fact that the stochastic gradient is taken
on a slightly perturbed xn .

8
Proof of Corollary 9. Since A guarantees k∆n k ≤ D, for all n < n′ ≤ T + n − 1, we have

kwn − wn′ k = kxn − (1 − sn )∆n − xn′ −1 + sn′ ∆n′ k

′ −1
nX
≤ ∆i + k∆n k + k∆n′ k
i=n+1
′
≤ D((n − 1) − (n + 1) + 1) + 2D
= D(n′ − n + 1) ≤ DT .

Therefore, we clearly have kwtk − wk k ≤ DT = δ.

Note that from the choice of K and T we have M = KT ≥ N − T ≥ N/2. So, for the second
fact, notice that Var(gn ) ≤ G2 for all n.√Thus, applying Theorem 8 in concert with the additional
assumption E[RT (u1 , . . . , uK )] ≤ DGK T , we have:
" K T
# √
1 X 1X k F (x0 ) − F ⋆ KDG T G
E ∇F (wt ) ≤ 2 +2 +√
K T t=1 DN DN T
k=1
2T (F (x0 ) − F ⋆ ) 3G
≤ +√
δN T
!
5G2/3 (F (x0 ) − F ⋆ )1/3 6G 2(F (x0 ) − F ⋆ )
≤ max ,√ + ,
(N δ)1/3 N δN

where the last inequality is due to the choice of T .

Finally, observe that kwtk − wk k ≤ δ for all t and k, and also that wk = T1 Tt=1 wtk . Therefore
P
S = {w1k , . . . , wTk } satisfies the conditions in the infimum in Definition 5 so that k∇F (wk )kδ ≤
1 PT k
T t=1 ∇F (wt ) .

4 Bounds for the L1 norm via per-coordinate updates

It is well-known that running a separate instance of an online learning on each coordinate of ∆ (e.g.,
as in AdaGrad (Duchi et al., 2010; McMahan & Streeter, 2010)) yields regret bounds with respect
to L1 norms of the linear costs. For example, we can run the online gradient descent algorithm
with a separate learning rate for each coordinate: ∆n+1,i = Π[−D∞ ,D∞ ] [∆n,i − ηi gn,i ]. The regret of
this procedure is simply the sums of the regrets of each of the individual algorithms.√ InPparticular,
if E[gn,i ] ≤ Gi , then setting ηi = G T yields the regret bound E[RT (u)] ≤ D∞ T N
2 2 D√∞
i=1 Gi for
i
any u satisfying kuk∞ ≤ D∞ . By employing such online algorithms with our online-to-non-convex
conversion, we can obtain a guarantee on the L1 norm of the gradients.

Definition 11. A point x is a (δ, ǫ)-stationary point with respect to the L1 norm of an almost-
everywhere differentiable function F if there exists a finite subset S of the L∞ ball of radius δ
centered at x such that if y is selected uniformly at random from S, E[y] = x and k E[∇F (y)]k1 ≤ ǫ.

As a counterpart to this deﬁnition, we deﬁne:

9
Definition 12. Given a point x, a number δ > 0 and an almost-everywhere differentiable function
F , define

1 X
k∇F (x)k1,δ , inf ∇F (y) .
1
S⊂B∞ (x,δ)|, |S|
P
y=x |S|
y∈S y∈S
1

We now can state a theorem similar to Corollary 9. Given that the proof is also very similar,
we defer it to Appendix G.
Theorem 13. Suppose we have a budget of N gradient evaluations. Assume F : Rd → R is well-
behaved. With the notation in Algorithm 1, set sn to be a random variable sampled uniformly from
[0, 1]. Set T, K ∈ N and M = KT . Assume that E[gn,i 2 ] ≤ G2 for i = 1, . . . , d for all n. Assume
i
that A guarantees k∆n k∞ ≤ D∞ for some user-specified D∞ √ for all
P n and ensures the standard
worst-case K-shifting regret bound E[RT (u1 , . . . , uK )] ≤ D∞ K T di=1 Gi for all kuk k∞ ≤ D∞ .
Pd
Nδ G
Let δ > 0 be an arbitrary number. Set D∞ = δ/T , T = min(⌈( F (x0 )−Fi=1 i 2/3
⋆ ) ⌉, N2 ), and K = ⌊ N
T ⌋.
k k
Then, for all k, t, kw − wt k∞ ≤ δ.
Moreover, we have the inequality
K T
" # Pd Pd !
1 X 1X F (x ) − F ⋆ 5( G )2/3 (F (x ) − F ⋆ )1/3 6 G
0 i=1 i 0
E ∇F (wtk ) ≤ + max , √i=1 i ,
K
k=1
T
t=1
δN (N δ)1/3 N
1

which implies
K
!
2(F (x0 ) − F ⋆ ) 5( di=1 Gi )2/3 (F (x0 ) − F ⋆ )1/3 6 di=1 Gi
P P
1 X
k∇F (wk )k1,δ ≤ + max , √ .
K t=1 δN (N δ)1/3 N

Pd Let’s2 compare this result with Corollary 9. For a fair comparison, we set Gi and G such that
2 . Then, we can lower bound k · k with √1 k · k . Hence, under the assumption
G
i=1 i = G δ d 1,δ
√
Pd Pd 1 PK G2/3 d
E[kgn k ] = i=1 E[gn,i ] ≤ i=1 Gi = G , Corollary 9 implies K t=1 k∇F (wk )k1,δ ≤ O( (N
2 2 2 2
δ)1/3
).
Now, let us see what would happen if we instead employed the above Corollary 13. First,
Pd √ qPd √
observe that i=1 Gi ≤ d 2
i=1 Gi ≤ dG. Substituting this expression into Corollary 30 now
1/3 2/3
gives an upper bound on k · k1,δ that is O( d(N δ)G1/3 ), which is better than the one we could obtain
from Corollary 9 under the same assumptions.

5 From Non-smooth to Smooth Guarantees

Let us now see what our results imply for smooth objectives. The following two propositions show
that for smooth F , a (δ, ǫ)-stationary point is automatically a (0, ǫ′ )-stationary point for some
appropriate ǫ′ . The proofs are in Appendix E.
Proposition 14. Suppose that F is H-smooth (that is, ∇F is H-Lipschitz) and x also satisfies
k∇F (x)kδ ≤ ǫ. Then, k∇F (x)k ≤ ǫ + Hδ.

10
Now, recall that Corollary 9 shows that we can find a (δ, ǫ) stationary point in O(ǫ−3 δ−1 )
iteration. Thus, Proposition 14 implies that by setting δ = ǫ/H, we can find a (0, ǫ)-stationary
point of an H-smooth objective F in O(ǫ−4 ) iterations, which matches the (optimal) guarantee of
standard SGDp (Ghadimi & Lan, 2013; Arjevani et al., 2019). Further, Proposition 15 shows that by
setting δ = ǫ/J, we can find a (0, ǫ)-stationary point of a J-second order smooth objective. This
matches the performance of more refined SGD variants and is also known to be tight (Fang et al.,
2019; Cutkosky & Mehta, 2020; Arjevani et al., 2020). In summary: the online-to-non-convex
conversion also recovers the optimal results for smooth stochastic losses.

6 Deterministic and Smooth Case

We will not consider the case of a non-stochastic oracle (that is, Grad(x, z) = ∇F (x) for all
z, x) and F is H-smooth (i.e. ∇F is H-Lipschitz). We will show that optimistic online al-
gorithms (Rakhlin & Sridharan, 2013; Hazan & Kale, 2010) achieve rates matching the optimal
deterministic results. In particular, we consider online algorithms that ensure static regret:
 v u T 
uX
RT (u) ≤ O D t kht − gt k2  , (2)
t=1

for some “hint” vectors ht . In Appendix B, we provide an explicit construction of such an algorithm
for completeness. The standard setting for the hints is ht = gt−1 . As explained in Section 2.2, to
obtain a K-shifting regret it will be enough to reset the algorithm every T iterations.

Theorem 16. Suppose we have a budget of N gradient evaluations. and that we have an online
algorithm Astatic
qPthat guarantees k∆n k ≤ D for all n and ensures the optimistic regret bound
T 2
RT (u) ≤ CD t=1 kgt − gt−1 k for some constant C, and we define g0 = 0. In Algorithm 1,
set A to be Astatic
√
that is reset every T rounds. Let δ > 0 be an arbitrary number. Set D = δ/T ,
(Cδ 2 HN )2/5 N
T = min(⌈ (F (x ⋆ )2/5 ⌉, 2 ), and K = ⌊ N
T ⌋. Finally, suppose that F is H-smooth and that the
0 )−F
gradient oracle is deterministic (that is, gn = ∇F (wn )). Then, for all k, t, kwk − wtk k ≤ δ.
Moreover, with G1 = k∇F (x1 )k, we have:
" K T
# √ !
2/5 (F (x ) − F ⋆ )3/5 17Cδ H
1 X 1X (CH) 0
E ∇F (wtk ) ≤ max 6 ,
K
k=1
T t=1 δ1/5 N 3/5 N 3/2
2CG1 2(F (x0 ) − F ⋆ )
+ + ,
N δN
which implies
K
" #
1 X 2CG1 2(F (x0 ) − F ⋆ )
E k∇F (wk )kδ ≤ +
K t=1 N δN
√ !
(CH)2/5 (F (x0 ) − F ⋆ )3/5 17Cδ H
+ max 6 , .
δ1/5 N 3/5 N 3/2

11
Note that the expectation here encompasses only the randomness in the choice of skt , because
the gradient oracle is assumed to be deterministic. Theorem 16 ﬁnds a (δ, ǫ) stationary point in
O(ǫ−5/3 δ−1/3 ) iteratations. Thus, by setting δ = ǫ/H, Proposition 14 shows we can ﬁnd a (0, ǫ)
stationary point in O(ǫ−2 ) iterations, which matches the standard optimal rate (Carmon et al.,
2021).

Proof. The ﬁrst fact (for all k, t, kwk − wtk k ≤ δ) holds for precisely the same reason that it holds
in Corollary 9. The ﬁnal fact follows from the second fact again with the same reasoning, so it
remains only to show the second fact.
To show the second fact, observe that for k = 1 we have
v
u T
uX
k
RT (u ) ≤ CD t kgtk − gt−1
k k2

t=1
v
u
u T
X
≤ CD tG21 + k∇F (wtk ) − ∇F (wt−1
k )k2

t=2
v
u
u T
X
≤ CD tG21 + H 2 kwtk − wt−1
k k2

t=2
q √
≤ CD G21 + 4H 2 T D 2 ≤ CDG1 + 2CD 2 H T .

Similarly, for k > 1, we observe that:

T
X T
X
kgtk − gt−1
k
k2 ≤ k∇F (wtk ) − ∇F (wt−1
k
)k2 + k∇F (w1k ) − ∇F (wTk−1 )k2
t=1 t=2
T
X
2
≤H (kw1k − wTk−1 k2 + k
kwtk − wt−1 k2 )
t=2
2 2
≤ 4T H D

Thus, we have
v
u T
uX
k
RT (u ) ≤ CD t kgtk − gt−1
k k2

t=1
√
≤ CD 4H 2 T D 2
√
≤ 2CD 2 H T .

12
Now, applying Theorem 8 in concert with the above bounds on RT (uk ), we have
" K T
# √
1 X 1X k 2(F (x0 ) − F ⋆ ) 2CDG1 + 4CKD 2 HT
E ∇F (wt ) ≤ +
K T t=1 DN DN
k=1
√
2T (F (x0 ) − F ⋆ ) 2CG1 4Cδ H
= + +
δN N T 3/2
√ !
(CH)2/5 (F (x0 ) − F ⋆ )3/5 17Cδ H
≤ max 6 ,
δ1/5 N 3/5 N 3/2
2CG1 2(F (x0 ) − F ⋆ )
+ + .
N δN

6.1 Better Results with Second-Order Smoothness

When F is J-second-order smooth (i.e., ∇2 F is J-Lipschitz) we can do even better. First, observe
that by Theorem 16, if F is J-second-order-smooth, then by Proposition 15, the O(ǫ−5/3 δ−1/3 )
iteration complexity of Theoremp16 implies an O(ǫ−11/6 ) iteration complexity for ﬁnding (0, ǫ)
stationary points by setting δ = ǫ/J . This already improves upon the O(ǫ−2 ) result for smooth
losses, but we can improve still further. The key idea is to generate more informative hints ht . If
we can make ht ≈ gt , then by (2), we can achieve smaller regret and so a better guarantee.
To do so, we abandon randomization: instead of choosing sn randomly, we just set sn = 1/2.
This setting still allows F (xn ) ≈ F (xn−1 ) + hgn , ∆n i with very little error when F is second-
order-smooth. Then, via careful inspection
√ of the optimistic mirror descent update formula, we can
identify an ht with kht − gt k ≤ O(1/ N ) using O(log(N )) gradient queries. This more advanced
online learning algorithm is presented in Algorithm 2 (full analysis in Appendix C).
Overall, Algorithm 2 makes an update of a similar ﬂavor to the “implicit” update:
h gn i
∆n = Πk∆k≤D ∆n−1 − ,
2H
gn = ∇F (xn−1 + ∆n /2) .

With this reﬁned online algorithm, we can show the following convergence guarantee, whose
proof is in Appendix D.
1
Theorem 17. In Algorithm 1, assume that gn = ∇F (wn ), and set sn = 2 . Use Algorithm 2
2 )1/3
restarted every T rounds as A. Let δ > 0 an arbitrary number. Set T = min(⌈ (δ(F(H+Jδ)N
(x )−F ⋆ )1/3 ⌉, N/2)
0

and K = ⌊ N
p
T ⌋. In Algorithm 2, set η = 1/2H, D = δ/T , and Q = ⌈log 2 ( N G/HD)⌉. Finally,
suppose that F is J-second-order-smooth. Then, the following facts hold:

1. For all k, t, kwk − wtk k ≤ δ.

2. We have the inequality

K T
1 X 1X 4G 2(F (x0 ) − F ⋆ )
∇F (wtk ) ≤ +
K T t=1 N Nδ
k=1
(H + Jδ)1/3 (F (x0 ) − F ⋆ )2/3 δ(H + Jδ)
+3 + 10 .
δ1/3 N 2/3 N2

13
Algorithm 2 Optimistic Mirror Descent with Careful Hints
Input: Learning rate η, number Q (Q will be O(log N )), function F , horizon T , radius D
Receive initial iterate x0
Set ∆′1 = 0
for t = 1 . . . T do
Set h0t = ∇F (xt−1 )
for i = 1 . . . Q do
Set hit = ∇F xt−1 + 21 Πk∆k≤D ∆′t − ηhi−1

t
end for
Set ht = hQ t
Output ∆t = Πk∆k≤D [∆′t − ηht ]
Receive tth gradient gt
Set ∆′t+1 = Πk∆k≤D [∆′t − ηgt ]
end for

H 1/7 (F (x0 )−F (xN ))2/7

3. With δ = J 3/7 N 2/7
, we have
K
1 X 1/7 2/7
(x0 )−F ⋆ )4/7

k∇F (wk )k ≤ O J H (F N 4/7
.
K
t=1

Moreover, the total number of gradient queries consumed is N Q = O(N log(N ))

This result finds a (δ, ǫ) stationary point in Õ(ǫ−3/2 δ−1/2 ) iterations. Via Proposition 15, this
translates to Õ(ǫ−7/4 ) iterations for finding a (0, ǫ) stationary point, which matches the best known
rates (up to a logarithmic factor) (Carmon et al., 2017). Note that this rate may not be optimal:
the best available lower bound is Ω(ǫ−12/7 ) (Carmon et al., 2021). Intriguingly, our technique seems
rather different than previous work, which usually relies on acceleration and methods for detecting
or exploiting negative curvature (Carmon et al., 2017; Agarwal et al., 2016; Carmon et al., 2018;
Li & Lin, 2022).

7 Lower Bounds
In this section, we show that our O(ǫ−3 δ−1 ) complexity achieved in Corollary 9 is tight. We do
this by a simple extension of the lower bound for stochastic smooth non-convex optimization of
Arjevani et al. (2019). We provide an informal statement and proof-sketch below. The formal
result (Theorem 28) and proof is provided in Appendix F.
√
ǫγ
Theorem 18 (informal). There is a universal constant C such that for any δ, ǫ, γ and G ≥ C √ ,
δ
for any first-order algorithm A, there is a G-Lipschitz, C ∞ function F : Rd → R for some d with
F (0) − inf x F (x) ≤ γ and a stochastic first-order gradient oracle for F whose outputs g satisfy
E[kgk2 ] ≤ G2 such that such that A requires Ω(G2 γ/δǫ3 ) stochastic oracle queries to identify a
point x with E[k∇F (x)kδ ] ≤ ǫ.
Proof sketch. The construction of Arjevani et al. (2019) provides, for any H, γ, ǫ, and σ a function
√
F and stochastic oracle whose outputs have norm at most σ such that F is H-smooth, O( Hγ)-
Lipschitz and A requires Ω(σ 2 Hγ/ǫ4 ) oracle queries to ﬁnd a point x with k∇F (x)k ≤ 2ǫ. By

14
√
setting H = δǫ and σ = G/ 2, this becomes an ǫγ/δ-Lipschitz function, and so is at most
p
√
G/ 2-Lipschitz. Thus, the second moment of the gradient oracle is at most G2 /2 + G2 /2 = G2 .
Further, the algorithm requires Ω(G2 γ/δǫ3 ) queries to ﬁnd a point x with k∇F (x)k ≤ 2ǫ. Now, if
k∇F (x)kδ ≤ ǫ, then since F is H = δǫ -smooth, by Proposition 14, k∇F (x)k ≤ ǫ + δH = 2ǫ. Thus,
we see that we need Ω(G2 γ/δǫ3 ) queries to ﬁnd a point with k∇F (x)kδ ≤ ǫ as desired.

8 Conclusion
We have presented a new online-to-non-convex conversion technique that applies online learning
algorithms to non-convex and non-smooth stochastic optimization. When used with online gradient
descent, this achieves the optimal ǫ−3 δ−1 complexity for finding (δ, ǫ) stationary points.
These results suggest new directions for work in online learning. Much past work is moti-
vated by the online-to-batch conversion relating static regret to convex optimization. We employ
switching regret for non-convex optimization. More refined analysis may be possible via gen-
eralizations such as strongly adaptive or dynamic regret (Daniely et al., 2015; Jun et al., 2017;
Zhang et al., 2018; Jacobsen & Cutkosky, 2022; Cutkosky, 2020; Lu et al., 2022; Luo et al., 2022;
Zhang et al., 2021; Baby & Wang, 2022; Zhang et al., 2022). Moreover, our analysis assumes per-
fect tuning of constants (e.g., D, T, K) for simplicity. In practice, we would prefer to adapt to un-
known parameters, motivating new applications and problems for adaptive online learning, which
is already an area of active current investigation (see, e.g., Orabona & Pál, 2015; Hoeven et al.,
2018; Cutkosky & Orabona, 2018; Cutkosky, 2019; Mhammedi & Koolen, 2020; Chen et al., 2021;
Sachs et al., 2022; Wang et al., 2022). It is our hope that some of this expertise can be applied in
the non-convex setting as well.
Finally, our results leave an important question unanswered: the current best-known algo-
rithm for deterministic non-smooth optimization still requires O(ǫ−3 δ−1 ) iterations to find a (δ, ǫ)-
stationary point (Zhang et al., 2020b). We achieve this same result even in the stochastic case.
Thus it is natural to wonder if one can improve in the deterministic setting. For example, is
the O(ǫ−3/2 δ−1/2 ) complexity we achieve in the smooth setting also achievable in the non-smooth
setting?

Acknowledgements
Ashok Cutkosky is supported by the National Science Foundation grant CCF-2211718 as well as a
Google gift. Francesco Orabona is supported by the National Science Foundation under the grants
no. 2022446 “Foundations of Data Science Institute” and no. 2046096 “CAREER: Parameter-free
Optimization Algorithms for Machine Learning”.

References
Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T. Finding approximate local minima
for nonconvex optimization in linear time. arXiv preprint arXiv:1611.01146, 2016.

Allen-Zhu, Z. Natasha 2: Faster non-convex optimization than SGD. In Advances in neural

information processing systems, pp. 2675–2686, 2018.

15
Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds
for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.

Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Sekhari, A., and Sridharan, K. Second-order
information in non-convex stochastic optimization: Power and limitations. In Conference on
Learning Theory, pp. 242–299, 2020.

Baby, D. and Wang, Y.-X. Optimal dynamic regret in proper online learning with strongly convex
losses and beyond. In International Conference on Artificial Intelligence and Statistics, pp.
1805–1845. PMLR, 2022.

Bertsekas, D. P. Stochastic optimization problems with nondiﬀerentiable cost functionals. Journal

of Optimization Theory and Applications, 12(2):218–231, 1973.

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. “convex until proven guilty”: Dimension-
free acceleration of gradient descent on non-convex functions. In International Conference on
Machine Learning, pp. 654–663. PMLR, 2017.

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Accelerated methods for nonconvex opti-
mization. SIAM Journal on Optimization, 28(2):1751–1772, 2018.

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Lower bounds for ﬁnding stationary points
I. Mathematical Programming, pp. 1–50, 2019.

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Lower bounds for ﬁnding stationary points
II: ﬁrst-order methods. Mathematical Programming, 185(1-2), 2021.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge University Press,
2006.

Cesa-Bianchi, N., Conconi, A., and Gentile, C. On the generalization ability of on-line learning
algorithms. Information Theory, IEEE Transactions on, 50(9):2050–2057, 2004.

Chen, L., Luo, H., and Wei, C.-Y. Impossible tuning made possible: A new expert algorithm and
its applications. In Conference on Learning Theory, pp. 1216–1259. PMLR, 2021.

Clarke, F. H. Optimization and nonsmooth analysis. SIAM, 1990.

Cutkosky, A. Combining online learning guarantees. In Proceedings of the Thirty-Second Conference

on Learning Theory, pp. 895–913, 2019.

Cutkosky, A. Parameter-free, dynamic, and strongly-adaptive online learning. In International

Conference on Machine Learning, volume 2, 2020.

Cutkosky, A. and Mehta, H. Momentum improves normalized SGD. In International Conference

on Machine Learning, 2020.

Cutkosky, A. and Orabona, F. Black-box reductions for parameter-free online learning in Banach
spaces. In Conference On Learning Theory, pp. 1493–1529, 2018.

Cutkosky, A. and Orabona, F. Momentum-based variance reduction in non-convex SGD. In

Advances in Neural Information Processing Systems, pp. 15210–15219, 2019.

16
Daniely, A., Gonen, A., and Shalev-Shwartz, S. Strongly adaptive online learning. In International
Conference on Machine Learning, pp. 1405–1411. PMLR, 2015.

Davis, D., Drusvyatskiy, D., Lee, Y. T., Padmanabhan, S., and Ye, G. A gradient sampling method
with complexity guarantees for lipschitz functions in high and low dimensions. arXiv preprint
arXiv:2112.06969, 2021.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochas-
tic optimization. In Conference on Learning Theory (COLT), pp. 257–269, 2010.

Duchi, J. C., Bartlett, P. L., and Wainwright, M. J. Randomized smoothing for stochastic opti-
mization. SIAM Journal on Optimization, 22(2):674–701, 2012.

Fang, C., Li, C. J., Lin, Z., and Zhang, T. SPIDER: Near-optimal non-convex optimization via
stochastic path-integrated diﬀerential estimator. In Advances in Neural Information Processing
Systems, pp. 689–699, 2018.

Fang, C., Lin, Z., and Zhang, T. Sharp analysis for nonconvex sgd escaping from saddle points. In
Conference on Learning Theory, pp. 1192–1234, 2019.

Faw, M., Tziotis, I., Caramanis, C., Mokhtari, A., Shakkottai, S., and Ward, R. The power
of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and aﬃne variance. In
Conference on Learning Theory, pp. 313–355. PMLR, 2022.

Flaxman, A. D., Kalai, A. T., and McMahan, H. B. Online convex optimization in the bandit
setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM
symposium on Discrete algorithms, pp. 385–394, 2005.

Ghadimi, S. and Lan, G. Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic pro-
gramming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.

Ghai, U., Lu, Z., and Hazan, E. Non-convex online learning via algorithmic equivalence. arXiv
preprint arXiv:2205.15235, 2022.

Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia,
Y., and He, K. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint
arXiv:1706.02677, 2017.

Hazan, E. Introduction to online convex optimization. arXiv preprint arXiv:1909.05207, 2019.

Hazan, E. and Kale, S. Extracting certainty from uncertainty: Regret bounded by variation in
costs. Machine learning, 80(2-3):165–188, 2010.

Hazan, E., Singh, K., and Zhang, C. Eﬃcient regret minimization in non-convex games. In
International Conference on Machine Learning, pp. 1433–1441. PMLR, 2017.

Herbster, M. and Warmuth, M. K. Tracking the best regressor. In Proceedings of the eleventh
annual conference on Computational learning theory, pp. 24–31, 1998.

Hoeven, D., Erven, T., and Kotlowski, W. The many faces of exponential weights in online learning.
In Conference On Learning Theory, pp. 2067–2092. PMLR, 2018.

17
Jacobsen, A. and Cutkosky, A. Parameter-free mirror descent. In Loh, P.-L. and Raginsky, M.
(eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of
Machine Learning Research, pp. 4160–4211. PMLR, 2022.

Jordan, M. I., Lin, T., and Zampetakis, M. On the complexity of deterministic nonsmooth and
nonconvex optimization. arXiv preprint arXiv:2209.12463, 2022.

Jun, K.-S., Orabona, F., Wright, S., and Willett, R. Improved strongly adaptive online learning
using coin betting. In Artificial Intelligence and Statistics, pp. 943–951. PMLR, 2017.

Karimireddy, S. P., Jaggi, M., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., and Suresh,
A. T. Mime: Mimicking centralized stochastic algorithms in federated learning. arXiv preprint
arXiv:2008.03606, 2020.

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.

Kornowski, G. and Shamir, O. On the complexity of ﬁnding small subgradients in nonsmooth

optimization. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop),
2022a.

Kornowski, G. and Shamir, O. Oracle complexity in nonsmooth nonconvex optimization. Journal

of Machine Learning Research, 23(314):1–44, 2022b.

Levy, K., Kavis, A., and Cevher, V. Storm+: Fully adaptive SGD with recursive momentum for
nonconvex optimization. Advances in Neural Information Processing Systems, 34:20571–20582,
2021.

Li, H. and Lin, Z. Restarted nonconvex accelerated gradient descent: No more polylogarithmic
factor in the o(ǫ−7/4 ) complexity. In International Conference on Machine Learning. PMLR,
2022.

Li, X. and Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In
The 22nd International Conference on Artificial Intelligence and Statistics, pp. 983–992. PMLR,
2019.

Li, X., Zhuang, Z., and Orabona, F. A second look at exponential and cosine step sizes: Simplicity,
adaptivity, and performance. In International Conference on Machine Learning, pp. 6553–6564.
PMLR, 2021.

Liu, Z., Nguyen, T. D., Nguyen, T. H., Ene, A., and Nguyen, H. L. META-STORM: Generalized
fully-adaptive variance reduced SGD for unbounded functions. arXiv preprint arXiv:2209.14853,
2022.

Loshchilov, I. and Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint
arXiv:1608.03983, 2016.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference

on Learning Representations, 2018.

18
Lu, Z., Xia, W., Arora, S., and Hazan, E. Adaptive gradient methods with local guarantees. arXiv
preprint arXiv:2203.01400, 2022.

Luo, H., Zhang, M., Zhao, P., and Zhou, Z.-H. Corralling a larger band of bandits: A case study
on switching regret for linear bandits. In Conference on Learning Theory, 2022.

McMahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In
Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pp. 244–256, 2010.

Mhammedi, Z. and Koolen, W. M. Lipschitz and comparator-norm adaptivity in online learning.

Conference on Learning Theory, pp. 2858–2887, 2020.

Orabona, F. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.

Orabona, F. and Pál, D. Scale-free algorithms for online linear optimization. In Chaudhuri, K.,
Gentile, C., and Zilles, S. (eds.), Algorithmic Learning Theory, pp. 287–301. Springer Interna-
tional Publishing, 2015.

Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In Conference on

Learning Theory (COLT), pp. 993–1019, 2013.

Sachs, S., Hadiji, H., van Erven, T., and Guzmán, C. Between stochastic and adversarial online
convex optimization: Improved regret bounds via smoothness. arXiv preprint arXiv:2202.07554,
2022.

Stein, E. M. and Shakarchi, R. Real analysis: measure theory, integration, and Hilbert spaces.
Princeton University Press, 2009.

Tian, L. and So, A. M.-C. No dimension-free deterministic algorithm computes approximate sta-
tionarities of lipschitzians. arXiv preprint arXiv:2210.06907, 2022.

Tian, L., Zhou, K., and So, A. M.-C. On the ﬁnite-time complexity and practical computation of
approximate stationarity concepts of lipschitz functions. In International Conference on Machine
Learning, pp. 21360–21379. PMLR, 2022.

Tripuraneni, N., Stern, M., Jin, C., Regier, J., and Jordan, M. I. Stochastic cubic regularization
for fast nonconvex optimization. In Advances in neural information processing systems, pp.
2899–2908, 2018.

Wang, G., Hu, Z., Muthukumar, V., and Abernethy, J. Adaptive oracle-eﬃcient online learning.
arXiv preprint arXiv:2210.09385, 2022.

You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer,
K., and Hsieh, C.-J. Large batch optimization for deep learning: Training BERT in 76 minutes.
arXiv preprint arXiv:1904.00962, 2019.

Zhang, J., Karimireddy, S. P., Veit, A., Kim, S., Reddi, S., Kumar, S., and Sra, S. Why are adaptive
methods good for attention models? Advances in Neural Information Processing Systems, 33:
15383–15393, 2020a.

19
Zhang, J., Lin, H., Jegelka, S., Sra, S., and Jadbabaie, A. Complexity of ﬁnding stationary points
of nonconvex nonsmooth functions. In International Conference on Machine Learning, 2020b.

Zhang, L., Lu, S., and Zhou, Z.-H. Adaptive online learning in dynamic environments. In Pro-
ceedings of the 32nd International Conference on Neural Information Processing Systems, pp.
1330–1340, 2018.

Zhang, L., Wang, G., Tu, W.-W., Jiang, W., and Zhou, Z.-H. Dual adaptivity: A universal
algorithm for minimizing the adaptive regret of convex functions. Advances in Neural Information
Processing Systems, 34:24968–24980, 2021.

Zhang, Z., Cutkosky, A., and Paschalidis, I. Adversarial tracking control via strongly adaptive
online learning with memory. In International Conference on Artificial Intelligence and Statistics,
pp. 8458–8492. PMLR, 2022.

Zhou, D., Xu, P., and Gu, Q. Stochastic nested variance reduction for nonconvex optimization.
In Proceedings of the 32nd International Conference on Neural Information Processing Systems,
pp. 3925–3936, 2018.

Zhuang, Z., Cutkosky, A., and Orabona, F. Surrogate losses for online learning of stepsizes in
stochastic non-convex optimization. In International Conference on Machine Learning, pp. 7664–
7672, 2019.

Zinkevich, M. Online convex programming and generalized inﬁnitesimal gradient ascent. In Pro-
ceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928–936,
2003.

20
A Proof of Proposition 2
First, we state a technical lemma that will be used to prove Proposition 2.

Lemma 19. Let F : Rd → R be locally Lipschitz. Then, F is differentiable almost everywhere, is

Lipschitz on all compact sets, and for all v ∈ Rd , x 7→ h∇F (x), vi is
R integrable on all compact sets.
Finally, for any compact d
measurable set D ⊂ R , the vector w = D ∇F (x) dx is well-defined and
R
the operator ρ(v) = Dh∇F (x), vi dx is linear and equal to hw, vi.

Proof. First, observe that since F is locally Lipschitz, for every point x ∈ Rd with rational coordi-
nates there is a neighborhood Ux of x on which F is Lipschitz. Thus, by Rademacher’s theorem, F
is differentiable almost everywhere in Ux .SSince the set of points with rational coordiantes is dense
in Rd , Rd is equal to the countable union Ux . Thus, since the set of points of non-differentiability
of F in Ux is measure zero, the total set of points of non-differentability is a countable union of sets
of measure zero and so must be measure zero. Thus F is differentiable almost everywhere. This
implies that F is differentiable at x + pu with probability 1.
Next, observe that for any compact set S ⊂ Rd , for every point x ∈ S with rational coordinates,
F is Lipschitz on some neighborhood Ux containing xSwith Lipschitz constant Gx . Since S is
compact, there is a finite set x1 , . . . , xK such that S = Uxi . Therefore, F is maxi Gxi -Lipschitz
on S and so F is Lipschitz on every compact set.
Now, for almost all x, for all v we have that the limit limδ→0 F (x+δv)−Fδ
(x)
exists and is equal
to h∇F (x), vi by definition of differentiability. Further, on any compact set S we have that
|F (x+δv)−F (x)|
δ ≤ L for some L for all x ∈ S. Therefore, by the bounded convergence theorem (see
e.g., Stein & Shakarchi (2009, Theorem 1.4)), we have that h∇F (x), vi is integrable on S.
Next, we prove the linearity of the operator ρ. Observe that for any vectors v and w, and scalar
c, by linearity of integration, we have
Z Z
h∇F (x), cv + wi dx = h∇F (x), cvi + h∇F (x), wi dx
D ZD
= ch∇F (x), vi + h∇F (x), wi dx
D
Z Z
= c h∇F (x), vi dx + h∇F (x), wi dx .
D D

For the remaining statement, given that Rd is finite dimensional, there must exist w ∈ Rd such
that ρ(v) = hw, vi for all v. Further, w is uniquely determined by hw, ei i for i = 1, . . . , d where ei
indicates the ith standard basis vector. Then, since ∇F (x)i = h∇F (x), ei i is integrable on compact
sets, we have
Z Z
hw, ei i = h∇F (x), ei i dx = ∇F (x)i dx,
D D
R
which is the definition of D ∇F (x) dx when the integral is defined.

We can now prove the Proposition.

Proof of Proposition 2. Since F is locally Lipschitz, by Proposition 19, F is Lipschitz on compact

sets. Therefore, F must be Lipschitz on the line segment connecting x and y. Thus the function

21
k(t) = F (x + t(y − x)) is absolutely ′
R 1 ′ continuous
R 1 on [0, 1]. As a result, k is integrable on [0, 1] and
F (y) − F (x) = k(1) − k(0) = 0 k (t) dt = 0 h∇F (x + t(y − x)), y − xi dt by the Fundamental
Theorem of Calculus (see e.g., Stein & Shakarchi (2009, Theorem 3.11)).
Now, we tackle the case that F is not differentiable everywhere. Notice that the last statement
of the Proposition is an immediate consequence of Lipschitzness. So, we focus on showing the
remaining parts.
Now, by Lemma 19, we have that F is differentiable almost everywhere. Further, gx =
Eu [∇F (x + pu)] exists and satisfies for all v ∈ Rd :
hgx , vi = E[h∇F (x + pu), vi] .
u

Notice also that

ˆ
E [Grad(x, (z, u))] = E E[Grad(x + pu, z)] = E[∇F (x + pu)] = gx .
z,u u z u

So, it remains to show that F̂ is diﬀerentiable and ∇F̂ (x) = gx .

Now, let x be an arbitrary elements of Rd and let v1 , v2 , . . . be any sequence of vectors such
that limn→∞ vn = 0 and kvi k ≤ p for all i. Then, since the ball of radius 2p centered at x is
compact, F is L-Lipschitz inside this ball for some L. Then, we have

F̂ (x + vn ) − F̂ (x) − hgx , vn i F (x + vn + pu) − F (x + pu) − h∇F (x + pu), vn i
lim = lim E .
n→∞ kvn k n→∞ u kvn k
|F (x+vi +pu)−F (x+pu)|
Now, observe kvi k ≤ L. Further, for all almost all u, F is diﬀerentiable at x + pu
F (x+vn +pu)−F (x+pu)−h∇F (x+pu),vn i]
so that limn→∞ kvn k = 0 for almost all u. Thus, by the bounded
convergence theorem, we have

F̂ (x + vn ) − F̂ (x) − hgx , vn i
lim =0.
n→∞ kvn k

which shows that gx = ∇F̂ (x).

Finally, observe that since F is Lipschitz on compact sets, F̂ must be also, and so by the ﬁrst
part of the proposition, F̂ is well-behaved.

B Analysis of (Optimistic) Online Gradient Descent

Optimistic Online Gradient Descent (in its simplest form) is described by Algorithm 3. Here we
collect the standard analysis of the algorithm for completeness. None of this analysis is new, and
more reﬁned versions can be found in a variety of sources (e.g. Chen et al. (2021)).
We will analyze only a simple version of this algorithm, that is when V is an L2 ball of radius
D in some real Hilbert space (such as Rd ). Then, Algorithm 3 satisﬁes the following guarantee.
Proposition 20. Let V = {x : kxk ≤ D} ⊂ H for some real Hilbert space H. Then, with for all
u ∈ V , Algorithm 3 ensures
T T
X D2 X η
hgt , wt − ui ≤ + kgt − ht k2 .
2η 2
t=1 t=1

22
Algorithm 3 Optimistic Mirror Descent
Input: Regularizer function φ, domain V , time horizon T .
ŵ1 = 0
for t = 1 . . . T do
Generate “hint” ht
Set wt = argminx∈V hht , xi + 21 kx − ŵt k2
Output wt and receive loss vector gt
Set ŵt+1 = argminx∈V hgt , xi + 21 kx − ŵt k2
end for

Proof. Now, by Chen et al. (2021, Lemma 15) instantiated with the squared Euclidean distance as
Bregman divergence, we have
1 1 1 1
hgt , wt − ui ≤ hgt − ht , wt − ŵt+1 i + ku − ŵt k2 − ku − ŵt+1 k2 − kŵt+1 − wt k2 − kwt − ŵt k2
2 2 2 2
From Young inequality:

ηkgt − ht k2 kwt − ŵt+1 k2 1 1 1

≤ + + ku − ŵt k2 − ku − ŵt+1 k2 − kŵt+1 − wt k2
2 2η 2 2 2
kwt − ŵt k2
−
2η
ηkgt − ht k2 1 1
≤ + ku − ŵt k2 − ku − ŵt+1 k2 .
2 2 2
Summing over t and telescoping, we have
T T
X 1 2 1 2
X ηkgt − ht k2
hgt , wt − ui ≤ ku − ŵ1 k − ku − ŵT +1 k +
2 2 2
t=1 t=1
T T
ku − ŵ1 k2 X ηkgt − ht k2 D 2 X ηkgt − ht k2
≤ + ≤ + .
2η t=1
2 2η t=1
2

In the case that the hints ht are not present, the algorithm becomes online gradient de-
scent (Zinkevich, 2003). In this case, assuming E[kgt k2 ] ≤ G2 and setting η = GD
√
T
we obtain
√
the E[RT (u)] ≤ DG T for all u such that kuk ≤ D.

C Algorithm 2 and Regret Guarantee

p
Theorem 21. Let F be an H-smooth and G-Lipschitz function. Then, when Q = ⌈log2 ( N G/HD)⌉,
1
Algorithm 2 with xt = xt−1 + 21 ∆t and gt = ∇F (xt ) and η ≤ 2H ensures for all kuk ≤ D
T
X HD 2 2GT D
hgt , ∆t − ui ≤ + .
2 N
t=1
p
Furthermore, a total of at most T ⌈log2 ( N G/HD)⌉ gradient evaluations are required.

23
Proof. The count of gradient evaluations is immediate from inspection of the algorithm, so it
remains only to prove the regret bound.
First, we observe that the choices of ∆t speciﬁed by Algorithm 2 correspond to the values of
1
wt produced by Algorithm 3 when ψ(w) = 2η kwk2 . This can be veriﬁed by direct calculation
2
(recalling that Dψ (x, y) = kx−yk
2η ).
Therefore, by Proposition 20, we have
T T
X D2 X η
hgt , ∆t − ui ≤ + kgt − ht k2 . (3)
t=1
2η t=1
2

So, our primary task is to show that kgt − ht k is small. To this end, recall that gt = ∇F (wt ) =
∇F (xt−1 + ∆t /2).
Now, we define hM +1
= ∇F xt−1 + 12 Πk∆k≤D ∆′t − ηhM

t t (which simply continues the recur-
i
sive definition of ht in Algorithm 2 for one more step). Then, we claim that for all 0 ≤ i ≤ M ,
khi+1
t − hit k ≤ 21i kh1t − h0t k. We establish the claim by induction on i. First, for i = 0 the claim
holds by definition. Now suppose khit − hi−1 1 1 0
t k ≤ 2i−1 kht − ht k for some i. Then, we have

1 1
khi+1 hit k
′ i
′ i−1

t − ≤ ∇F xt−1 + Πk∆k≤D ∆t − ηht − ∇F xt−1 + Πk∆k≤D ∆t − ηht
2 2

Using the H-smoothness of F :

H
Πk∆k≤D ∆′t − ηhit − Πk∆k≤D ∆′t − ηhi−1

≤ t
2
Using the fact that projection is a contraction:
Hη
≤ hit − hi−1
t
2
1
Using η ≤ H:

1 i
= h − hi−1
2 t t

From the induction assumption:

1 1
≤ kh − h0t k .
2i t
So that the claim holds.
Now, since ht = hQ i−1
′
t , we have ∆t = Πk∆k≤D ∆t − ηht . Therefore gt = ∇F (xt ) = ∇F (xt−1 +
Q+1
∆t /2) = ht . Thus,

1 2G
kgt − hQ Q+1
t k = kht − hQ
t k≤ Q
kh1t − h0t k ≤ Q ,
2 2
p
where in the last inequality
√
we used the fact that F is G-Lipschitz. So, for Q = ⌈log2 ( N G/HD)⌉,
we have kgt − hQt k≤
2G
√ HD for all t. The result now follows by substituting into equation (3).
NG

24
D Proof of Theorem 17
Proof of Theorem 17. Once more, the ﬁrst part of the result is immediate from the fact that k∆n k ≤
D. So, we proceed
R 1to show the second part.
Deﬁne ∇n = 0 ∇F (xn−1 + s∆n ) ds. Then, we have

kh∇n − gn , ∆n ik
Z 1
1
= ∇F (xn−1 + s∆n ) − ∇F xn−1 + ∆n ds, ∆n
0 2
Z 1
1
≤D ∇F (xn−1 + s∆n ) − ∇F xn−1 + ∆n ds
0 2
Z 1
1 2 1
=D ∇F (xn−1 + s∆n ) − ∇F xn−1 + ∆n − ∇ F xn−1 + ∆n ∆n (s − 1/2)
0 2 2
1
+∇2 F (xn−1 + ∆n )∆n (s − 1/2) ds
2
R1
(observing that 0 s − 1/2 ds = 0)
1
1 1
Z
=D ∇F (xn−1 + s∆n ) − ∇F (xn−1 + ∆n ) − ∇2 F xn−1 + ∆n ∆n (s − 1/2) ds
0 2 2

(using second-order smoothness)

1
J JD 3
Z
≤D k∆n k2 (s − 1/2)2 ds ≤ .
0 2 48

In Theorem 7, set un to be equal to u1 for the ﬁrst T iterations, u2 for the second T iterations
and so on. In other words, un = umod(n,T )+1 for n = 1, . . . , M . So, we have
M
X M
X
1 K
F (xM ) − F (x0 ) = RT (u , . . . , u ) + h∇n − gn , ∆n i + hgn , un i
n=1 n=1
M
N JD 3 X
≤ RT (u1 , . . . , uK ) + + hgn , un i .
48 n=1
PT
∇F (wtk ) HD 2 2T GD
Now, set uk = −D Pt=1
T . Then, by Theorem 21, we have that RT (uk ) ≤ 2 + N .
k t=1 ∇F (wtk )k
Therefore:
K T
HD 2 K M JD 3 X 1X
F (xM ) ≤ F (x0 ) + + 2GD + − DT ∇F (wtk ) .
2 48 T
k=1 t=1

Hence, we obtain
K T
1 X 1X F (x0 ) − F (xM ) HD 2G JD 2
∇F (wtk ) ≤ + + + .
K T MD 2T M 48
k=1 t=1

25
Note that from the choice of K and T we have M = KT ≥ N − T ≥ N/2. So, using D = δ/T , we
have can upper bound the r.h.s. with

2T (F (x0 ) − F ⋆ ) Hδ 4G Jδ2
+ + +
Nδ 2T 2 N 2T 2
l 2 m
)1/3
and with T = min (δ(F(H+Jδ)N
(x )−F ⋆ )1/3
, N/2 :
0

(H + Jδ)1/3 (F (x0 ) − F ⋆ )2/3 4G 2(F (x0 ) − F ⋆ ) δ(H + Jδ)

≤3 + + + 10 .
δ1/3 N 2/3 N Nδ N2
Now, the third fact follows by observing that Proposition 15 implies that
K K T
1 X k 1 X 1X Jδ2
∇F (w ) ≤ ∇F (wtk ) + .
K K T 2
k=1 k=1 t=1

Now, substituting the speciﬁed value of δ completes the identity. Finally, the count of number of
gradient evaluations is a direct calculation.

E Proofs for Section 5

Proposition 14. Suppose that F is H-smooth (that is, ∇F is H-Lipschitz) and x also satisfies
k∇F (x)kδ ≤ ǫ. Then, k∇F (x)k ≤ ǫ + Hδ.
1 P
Proof. Let S ⊂ B(x, δ) with x = |S| y∈S y. By H-smoothness, for all y ∈ S, k∇F (y)−∇F (x)k ≤
Hky − xk ≤ Hδ. Therefore, we have

1 X 1 X
∇F (y) = ∇F (x) + (∇F (y) − ∇F (x))
|S| |S|
y∈S y∈S

≥ k∇F (x)k − Hδ .

1 P
Now, since k∇F (x)kδ ≤ ǫ, for any p > 0, there is a set S such that |S| y∈S ∇F (y) ≤ ǫ + p.
Thus, k∇F (x)k ≤ ǫ + Hδ + p for any p > 0, which implies k∇F (x)k ≤ ǫ + Hδ.

Proposition 15. Suppose that F is J-second-order-smooth (that is, k∇2 F (x) − ∇2 F (y)kop ≤
Jkx − yk for all x and y). Suppose also that x satisfies k∇F (x)kδ ≤ ǫ. Then, k∇F (x)k ≤ ǫ + J2 δ2 .
1 P
Proof. The proof is similar to that of Proposition 14. Let S ⊂ B(x, δ) with x = |S| y∈S y. By
J-second-order-smoothness, for all y ∈ S, we have
Z 1
2
k∇F (y) − ∇F (x) − ∇ F (x)(y − x)k = (∇2 F (x + t(y − x)) − ∇2 F (x))(y − x) dt
0
Z 1
Jky − xk2 Jδ2
≤ tJky − xk2 dt = ≤ .
0 2 2

26
1 1
∇2 F (x)(y − x) = 0. Therefore, we have
P P
Further, since |S| y∈S y = x, we have |S| y∈S

1 X 1 X
∇F (y) = ∇F (x) + (∇F (y) − ∇F (x) − ∇2 F (x)(y − x))
|S| |S|
y∈S y∈S

Jδ2
≥ k∇F (x)k − .
2
1 P
Now, since k∇F (x)kδ ≤ ǫ, for any p > 0, there is a set S such that |S| y∈S ∇F (y) ≤ ǫ + p.
J 2 J 2
Thus, k∇F (x)k ≤ ǫ + 2δ + p for any p > 0, which implies k∇F (x)k ≤ ǫ + 2δ .

F Lower Bounds
Our lower bounds are constructed via a mild alteration to the arguments of Arjevani et al. (2019)
for lower bounds on finding (0, ǫ)-stationary points of smooth functions with a stochastic gradient
oracle. At a high level, we show that since a δ, ǫ-stationary point of an H-smooth loss is also a
(0, Hδ + ǫ)-stationary point, a lower bound on the complexity of the latter implies a lower bound on
the of complexity of the former. The lower bound of Arjevani et al. (2019) is proved by constructing
a distribution over “hard” functions such that no algorithm can quickly find a (0, ǫ)-stationary point
of a random selected function. Unfortunately, these “hard” functions are not Lipschitz. Fortunately,
they take the form F (x) + ηkxk2 where F is Lipschitz and smooth so that the “non-Lipschitz” part
is solely contained in the quadratic term. We show that one can replace the quadratic term kxk2
with a Lipschitz function that is quadratic for sufficiently small x but proportional to kxk for
larger values. Our proof consists of carefully reproducing the argument of Arjevani et al. (2019)
to show that this modification does not cause any problems. We emphasize that almost all of this
development can be found with more detail in Arjevani et al. (2019). We merely restate here the
minimum results required to verify our modification to their construction.

F.1 Definitions and Results from Arjevani et al. (2019)

A randomized ﬁrst-order algorithm is a distribution PS supported on a set S and a sequence
of measurable mappings Ai (s, g1 , . . . , gi−1 ) → Rd with s ∈ S and gi ∈ Rd . Given a stochastic
gradient oracle Grad : Rd × Z → Rd , a distribution PZ supported on Z and an i.i.d. sample
(z1 , . . . , zn ) ∼ PZ , we deﬁne the iterates of A recursively by:

x1 = A1 (s)
xi = Ai (s, Grad(x1 , z1 ), Grad(x2 , z2 ), . . . , Grad(xi−1 , zi−1 )) .

So, xi is a function of s and z1 , . . . , zi−1 . We deﬁne Arand to be the set of such sequences of
mappings.
Now, in the notation of Arjevani et al. (2019), we deﬁne the “progress function”

progc (x) = max{i : |xi | ≥ c} .

Further, a stochastic gradient oracle Grad can be called a probability-p zero-chain if prog0 (Grad(x, z)) =
prog1/4 (x) + 1 for all x with probability at least 1 − p, and prog0 (Grad(x, z)) ≤ prog1/4 (x) + 1
with probability 1.

27
Next, let FT : RT → R be the function deﬁned by Lemma 2 of Arjevani et al. (2019). Restating
their Lemma, this function satisﬁes:
Lemma 22 (Lemma 2 of Arjevani et al. (2019)). There exists a function FT : RT → R satisfies
that satisfies:
1. FT (0) = 0 and inf FT (x) ≥ −γ0 T for γ0 = 12.
2. ∇FT (x) is H0 -Lipschitz, with H0 = 152.
3. For all x, k∇FT (x)k∞ ≤ G0 with G0 = 23
4. For all x, prog0 (∇FT (x)) ≤ prog1/2 (∇FT (x)) + 1.
5. If prog1 (x) < T , then k∇FT (x)k ≥ |∇FT (x)prog1 (x)+1 | ≥ 1.
We associate with this function FT the stochastic gradient oracle OT (x, z) : RT × {0, 1} → Rd
where z is Bernoulli(p):
(
∇FT (x)i , if i 6= prog1/4 (x)
GradT (x, z)i = z∇FT (x)i
p , if i = prog1/4 (x)

It is clear that Ez [OT (x, z)] = ∇FT (x).

This construction is so far identical to that in Arjevani et al. (2019), and so we have by their
Lemma 3:
Lemma 23 (Lemma 3 of Arjevani et al. (2019)). GradT is a probability-p
√ zero chain, has variance
G
E[kGradT (x, z) − ∇FT (x)k2 ] ≤ G20 /p, and kGradT (x, z)k ≤ p0 + G0 T .
Proof. The probability p zero-chain and variance statements are directly from Arjevani et al. (2019).
For the bound on kGradT k, observe that GradT (x, z) = ∇FT (x) in all but one coordinate. In
that one coordinate, GradT (x, z) is at most k∇FTp(x)k∞ = Gp0 . Thus, the bound follows by triangle
inequality.

Next, for any matrix U ∈ Rd×T with orthonormal columns, we deﬁne FT,U : Rd → R by:
FT,U (x) = FT (U ⊤ x) .
The associated stochastic gradient oracle is:
GradT,U (x, z) = U GradT (U ⊤ x, z) .
Now, we restate Lemma 5 of Arjevani et al. (2019):
Lemma 24 (Lemma 5 of Arjevani et al. (2019)). l Let R>0m and suppose A ∈ Arand is such that
R2 T 2T 2
A produces iterates xt with kxt k ≤ R. Let d ≥ 18 p log pc Suppose U is chosen uniformly at
random from the set of d×T matrices with orthonormal columns. Let Grad be an probability-p zero
chain and let GradU (x, z) = U Grad(U ⊤ x, z). Let x1 , x2 , . . . be the iterates of A when provided
the stochastic gradient oracle GradU . Then with probability at least 1 − c (over the randomness of
U , the oracle, and also the seed s of A):
T − log(2/c)
prog1/4 (U ⊤ xt ) < T for all t ≤ .
2p

28
F.2 Defining the “Hard” Instance
Now, we for the ﬁrst time diverge from the construction of Arjevani et al. (2019) (albeit only
slightly). Their construction uses a “shrinking function” ρR,d : Rd → Rd given by ρR,d (x) =
√ kxk 2 2 as well as an additional quadratic term to overcome the limitation of bounded iterates.
1+kxk /R
We cannot tolerate the non-Lipschitz quadratic term, so we replace it with a Lipschitz version
qB,d (x) = x⊤ ρB,d (x). Intuitively, qB,d behaves like kxk2 for small enough x, but behaves like kxk
for large kxk. This pre-processed function is deﬁned by:

F̂T,U (x) = FT,U (ρR,d (x)) + ηqB,T (ρ−1 ⊤

R,T (U ρR,d (x)))
= FT (U ⊤ ρR,d (x)) + ηqB,T (ρ−1 ⊤
R,T (U ρR,d (x)))
= FT (U ⊤ ρR,d (x)) + ηqB,d (x) .

To see the last line above, note that since qB,d and ρR,d are rotationally symmetric, and U has
orthonormal columns, we have U ⊤ ρR,d (x) = ρR,T (U ⊤ x) and qB,T (U ⊤ x) = qB,d (x). Thus, we have
qB,T (ρ−1 ⊤
R,T (U ρR,d (x))) = qB,d (x). The stochastic gradient oracle associated with F̂T,U (x) is

\ T,U (x, z) = J[ρR,d ](x)⊤ U GradT (U ⊤ ρR,d (x), z) + η∇qB,d (x)

Grad
= J[ρR,d ](x)⊤ U GradT (U ⊤ ρR,d (x), z)
+ ηJ[ρR,d ](x)⊤ U ⊤ J[ρ−1 ⊤ ⊤ −1 ⊤
R,T ](U ρR,d (x)) ∇qB,T (ρR,T (U ρR,d (x))) .

where J[f ](x) indicates the Jacobian of the function f evaluated at (x).
A description of the relevant properties of qB is provided in Section F.3.
Next we produce a variant on Lemma 6 from Arjevani et al. (2019). This is the most delicate
part of our alteration, although the proof is still almost identical to that of Arjevani et al. (2019).
√
Lemma 25 (variant on Lemma 6 of Arjevani et al. (2019)). Let R = B = 10G0 T . Let η = 1/10
2 2
and c ∈ (0, 1) and p ∈ (0, 1) and T ∈ N. Set d = ⌈18 RpT log 2T pc ⌉ and let U be sampled uniformly
from the set of d × T matrices with orthonormal columns. Define F̂T,U and Grad ˆ U,T as above.
ˆ U,T as input.
Suppose A ∈ Arand and let x1 , x2 , . . . be the iterates of A when provided with Grad
Then with probability at least 1 − c:

T − log(2/c)
k∇F̂T,U (xt )k ≥ 1/2 for all t ≤ .
2p
Proof. First, observe that GradT is a probability p zero-chain. Next, qB is rotationally symmetric
and ρR (x) ∝ X , we see that ∇qB (x) ∝ x so that prog0 (J[ρ−1 ⊤ −1
R,T ](x) ∇qB,T (ρR,T (x))) = prog0 (x).
Thus, the oracle

^ T (x, z) = GradT (x, z) + ηJ[ρ−1 ](x)⊤ ∇qB,T (ρ−1 (x))

Grad B,T R,T

is a probability-p zero-chain. We deﬁne the rotated oracle:

^ U,T (y, z) = U Grad

Grad ^ T (U ⊤ y, z)
= U GradT (U ⊤ y, z) + ηU J[ρ−1 ⊤ ⊤ −1 ⊤
B,T ](U y) ∇qB,T (ρR,T (U y)) .

29
Following Arjevani et al. (2019), we now deﬁne yi = ρR (xi ). Notice that:

^ U,T (y, z) = J[ρR,d ](x)⊤ Grad

Grad \ T,U (x, z)
⊤\
= J[ρR,d ](ρ−1
R,d (y))] GradT,U (x, z)

Therefore, we can view y1 , . . . , yT as the iterates of some different algorithm Ay ∈ Arand applied
˜ T,U (since they are computed by measurable functions of the oracle outputs
to the oracle Grad
and previous iterates). However, since ρR (kbxk) ≤ R, kyi k ≤ R for all i. Thus, since Grad ^ T is
2 2
a probability-p zero-chain, by Lemma 24 we have that so long as d ≥ ⌈18 RpT log 2T
pc ⌉, then with
probability at least 1 − c, for all i ≤ (T − log(2/c))/2p,

prog1/4 (U ⊤ yi ) < T .

Now, our goal is to show that k∇F (xi )k ≥ 1/2. We consider two cases, either kxi k > R/2 or
not.
First, observe that for xi with kxi k > R/2, we have:

k∇F̂T,U (xi )k ≥ ηk∇qB (kxi k)k − kJ[ρR ](xi )kop k∇F̂ (yi )k
kxi k √
≥ ηp − G0 T
1 + kxi k2 /B 2
√
≥ 3ηB − G0 T
√
Recalling B = R = 10G0 T and η = 1/10:
√
= 2G0 T

Recalling G0 = 23:

≥ 1/2 .

Now, suppose kxi k ≤ R/2. Then, let us set j = prog1 (U ⊤ yi ) + 1 ≤ T (the inequality follows
since prog1 ≤ prog1/4 ). Then, if uj indicates the jth row of u, Lemma 22 implies:

|huj , yi i| < 1,
|huj , ∇FT,U (yi )i| ≥ 1 .
⊤ /R2
I−ρR (xi )ρR (xi )⊤ /R2 I−y y
Next, by direct calculation we have J[ρR ](xi ) = √ =√ i i so that:
1+kxi k2 /R2 1+kxi k2 /R2

huj , ∇F̂T,U (xi )i = huj J[ρR ](xi )⊤ ∇FT,U (yi )i + ηhuj , ∇qB (xi )i
huj , ∇FT,U (yi )i huj , yi ihyi , ∇FT,U (yi )i/R2
= p − p + ηhuj , ∇qB (xi )i .
1 + kxi k2 /R2 1 + kxi k2 /R2
2

Now, by Proposition 29, we have ∇qB (xi ) = 2 − kyBi2k yi =. So, (with R = B):

huj , ∇FT,U (yi )i huj , yi ihyi , ∇FT,U (yi )i/R2 kyi k2

j
hu , ∇F̂T,U (xi )i = p − +η 2− huj , yi i
B2
p
1 + kxi k2 /R2 1 + kxi k2 /R2

With the settings of parameters speciﬁed in Lemma 25, we have:

Lemma 26 (variation on Lemma 7 in Arjevani et al. (2019)). With the settings of R, B, η in
Lemma 25, the function F̂T,U satisfies:
1. F̂T,U (0) − inf F̂T,U (x) ≤ γ0 T = 12T
√ √
2. kF̂T,U (x)k ≤ G0 T + 3ηB ≤ 92 T for all x.

3. ∇F̂U (x) is H0 + 3 + 8η ≤ 156-Lipschitz.

√ √
\ T,U (x, z)k ≤ G0 + G0 T + 3ηB ≤
4. kGrad 23
+ 92 T with probability 1.
p p

G20 232
5. GradT,U has variance at most p ≤ p

Proof. 1. This property follows immediately from the fact that FT (0) − inf FT (x) ≤ γ0 T .
2. Since
√ ρR is 1-Lipschitz for all R and qB is 3B-Lipschitz (see Proposition 29), F̂T,U (x) is
G0 T + 3ηB-Lipschitz.
3. By assumption, R ≥ max (H0 , 1). Thus, by Arjevani et al. (2019, Lemma 16), ∇FT (ρR (x))
is H0 + 3-Lipschitz and so ∇F̂T,U is H0 + 3 + 8η-Lipschitz by Proposition 29.
√
4. Since kOT k ≤ Gp0 + G0 T , and J[ρR ](x)⊤ U has operator norm at most 1, the bound follows.

5. Just as in the previous part, since GradT has variance G20 /p and J[ρR ](x)⊤ U has operator
norm at most 1, the bound follows.

Now, we are ﬁnally in a position to prove:

Hγ
Theorem 27. Given any γ, H, ǫ, and σ such that 12·156·4ǫ 2 ≥ 2, there exists a distribution over
functions F and stochastic first-order
√ oracles Grad such that with probability 1, F is H-smooth,
F (0) − inf F (x) ≤ γ, F is 3 Hγ-Lipschitz and Grad has variance σ 2 , and for any algorithm in
, with
Arand probability at least 1 − c, when provided a randomly selected Grad, A requires at least
γHσ2
Ω ǫ4 iterations to output a point x with E[k∇F (x)k] ≤ ǫ.

31
Proof. From Lemma 26 and Lemma 25, we √ have a √
distribution
√ over functions F and ﬁrst-order
oracles such that with probability 1, F is G0 T +3G0 T ≤ 92 T Lipschitz, F is H0 +3+4η ≤ 156-
smoooth, F (0) − inf F (x) ≤ γ0 T = 12T , Grad has variance at most G20 /p = 232 /p, and with
probability at least 1 − c,
T − log(2/c)
k∇F (xt )k ≥ 1/2 for all t ≤ .
2p
156 Hγ 2 2
Now, set λ = H · 2ǫ, T = ⌊ 12·156·(2ǫ)2 ⌋ and p = min((2 · 23 · ǫ) /σ , 1). Then, deﬁne

Hλ2
Fλ (x) = F (x/λ) .
156
2 Hλ2
√ Hλ
Then Fλ is Hλ
H · 1
λ2 · 156 = H- smooth, Fλ (0) − inf x Fλ (x) ≤ 12 · T · 156 ≤ γ, and Fλ is 92 T 156 ≤
√ 0
3 Hγ- Lipschitz. We can construct an oracle Gradλ from Grad by:
Hλ
Gradλ (x, z) = Grad(x/λ, z) .
156
so that

2 H 2 λ2 232
E[kGradλ (x, z) − ∇Fλ (x)k ] ≤ · = σ2 .
1562 p
Further, since an oracle for Fλ can be constructed from the oracle for F , if we run A on Fλ , with
probability at least 1 − c,

Hλ T − log(2/c) Hγσ 2 log(2/c)

k∇Fλ (xt )k = k∇F (xt )k ≥ ǫ for all t ≤ ≤ 3 · 10−7 · 4
− 5 · 10−3 · .
156 2p ǫ ǫ2
Thus, there exists a constant K such that

Hγσ 2
E[k∇Fλ (xt )k] ≥ ǫ for all t ≤ K .
ǫ4
From this result, we have our main lower bound (the formal version of Theorem 18):
√
Theorem 28. For any δ, ǫ, γ, G ≥ 3 √2ǫγ δ
, there is a distribution over G-Lipschitz C ∞ functions
F with F (0) − inf F (x) ≤ γ and stochastic gradient oracles Grad with E[kGrad(x, z)k2 ] ≤ G2
such that for any algorithm A ∈ Arand , if A is provided as input a randomly selected oracle Grad,
A will require Ω(G2 γ/δǫ3 ) iterations to identify a point x with E[k∇F (x)kδ ] ≤ ǫ.

Proof. From Theorem 27, for any H, ǫ′ , γ, and σ√we have a distrubution over C ∞ functions
F and oracles Grad such that F is H-smooth, 3 Hγ-Lischitz and F (0) − inf F (x) ≤ γ and
Grad has variance σ 2 such that A requires Ω(Hγσ 2 /ǫ′4 ) iterations to output a point x such that
E[k∇F (x)k] ≤ ǫ√′ . Set σ =√G, H = ǫ/δ and ǫ′ = 2ǫ. Then, we see that Grad has variance G2 /2, and
√ 3 ǫγ
F is 3 Hγ = √ ≤ G/ 2-Lipschitz so that E[kGrad(x, z)k2 ] ≤ G2 . Further by Proposition!14,
δ
if k∇F (x)kδ ≤ ǫ, then k∇F (x)k ≤ ǫ + Hδ = 2ǫ. Therefore, since A cannot output a point x with
E[k∇F (x)k] ≤ ǫ′ = 2ǫ in less than Ω(Hγσ 2 /ǫ′4 ) iterations, we see that A also cannot output a
point x with E[k∇F (x)kδ ] ≤ ǫ in less than Ω(Hγσ 2 /ǫ′4 ) = Ω(γG2 /ǫ3 δ) iterations.

32
F.3 Definition and Properties of qB
Consider the function qB,d : Rd → R deﬁned by

kxk2
qB,d (x) = p = x⊤ ρB,d (x) .
1+ kxk2 /B 2

This function has the following properties, all of which follow from direct calculuation:

Proposition 29. qB,d satisfies:

1.
xkxk2 kρB (x)k2

2x
∇qB,d (x) = p − 2 = 2+ ρB (x) .
1 + kxk2 /B 2 B (1 + kxk2 /B 2 )3/2 B2

2.
3xx⊤ kxk2 I 2kxk2 xx⊤

2 1
∇ qB,d (x) = p 2I − 2 − + .
1 + kxk2 /B 2 B (1 + kxk2 /B 2 ) B 2 (1 + kxk2 /B 2 ) B 4 (1 + kxk2 /B 2 )2

3.
kxk 3kxk
p ≤ k∇qB,d (x)k ≤ p ≤ 3B .
1+ kxk2 /B 2 1 + kxk2 /B 2

4.
8
k∇2 qB,d (x)kop ≤ p ≤8.
1 + kxk2 /B 2

G Proof of Theorem 13
First, we state and prove a theorem analogous to Theorem 8.

Theorem 30. Assume F : Rd → R is well-behaved. In Algorithm 1, set sn to be a random

variable sampled uniformly from [0, 1]. Set T, K ∈ N and M = KT . For i = 1, . . . , d, set uki =
PT ∂F (wtk )
t=1
−D∞ ∂xi
for some D∞ > 0. Finally, suppose Var(gn,i ) = σi2 for i = 1, . . . , d. Then, we
PT ∂F (wtk )
t=1 ∂xi

have
K T
" #
F (x0 ) − F ⋆ E[RT (u1 , . . . , uK )] D∞ di=1 σi
P
1 X 1X
E ∇F (wtk ) ≤ + + √ .
K T D∞ M D∞ M T
k=1 t=1 1

Proof. In Theorem 7, set un to be equal to u1 for the ﬁrst T iterations, u2 for the second T
iterations and so on. In other words, un = umod(n,T )+1 for n = 1, . . . , N .

33
From Theorem 7, we have
M
" #
X
1 K
E[F (xM )] = F (x0 ) + E[RT (u , . . . , u )] + E hgn , un i .
n=1

PT ∂F wtk ( )
t=1
Now, since uk,i = −D∞ ∂xi
, E[gn ] = ∇F (wn ), and Var(gn,i ) ≤ σi2 for i = 1, . . . , d, we
PT ∂F wtk ( )
t=1 ∂xi

have
M K
" # " * T + K T
#
X X X X X
E hgn , un i ≤ E ∇F (wtk ), uk + D∞ (∇F (wtk ) − gn )
n=1 k=1 t=1 k=1 t=1 1
K
" * T +# d
X X √ X
≤E ∇F (wtk ), uk + D∞ K T σi
k=1 t=1 i=1
" K T
# d
X 1X √ X
=E − D∞ T ∇F wtk + D∞ K T σi .
T t=1
k=1 1 i=1

Putting this all together, we have

d K T
" #
⋆ 1 k
√ X X 1X
F ≤ E[F (xN )] ≤ F (x0 )+E[RT (u , . . . , u )]+D∞ K T σi −D∞ T E ∇F wtk .
T t=1
i=1 k=1 1

Dividing by KT D∞ = D∞ M and reordering, we have the stated bound.

We can now prove Theorem 13.

Proof of Theorem 13. Since A guarantees k∆n k∞ ≤ D∞ , for all n < n′ ≤ T + n − 1, we have

kwn − wn′ k∞ = kxn − (1 − sn )∆n − xn′ −1 + sn′ ∆n′ k∞

′ −1
nX
≤ ∆i + k∆n k∞ + k∆n′ k∞
i=n+1 ∞
≤ D∞ ((n′ − 1) − (n + 1) + 1) + 2D∞
= D∞ (n′ − n + 1)
≤ D∞ T .

Therefore, we clearly have kwtk − wk k∞ ≤ D∞ T = δ, which shows the ﬁrst fact.

Note that from the choice of K and T we have M = KT ≥ N − T ≥ N/2. For the second fact,
observe that Var(gn,i ) ≤ E[gn2 i ] ≤ G2i . Thus, applying Theorem 30 in concert with the additional

34
√ P
assumption E[RT (u1 , . . . , uK )] ≤ D∞ K T di=1 Gi , we have
" K T
# √ P
KD∞ T di=1 Gi
Pd
1 X 1X k F (x0 ) − F ⋆ i=1 Gi
E ∇F (wt ) ≤2 +2 + √
K T D∞ N D∞ N T
k=1 t=1 1
2T (F (x0 ) − F ⋆ ) 3 di=1 Gi
P
= + √
δN T
!
5( i=1 Gi )2/3 (F (x0 ) − F ⋆ )1/3 6 di=1 Gi
Pd
2(F (x0 ) − F ⋆ )
P
≤ max , √ + ,
(N δ)1/3 N δN

where the last inequality is due to the choice of T .

Now, for the final fact, observe that kwtk − wk k∞ ≤ δ for all t and k, and also that wk =
1 P T k k k
T t=1 wt . Therefore S = {w1 , . . . , wT } satisfies the conditions in the infimum in Definition 12
so that k∇F (wk )k1,δ ≤ T1 Tt=1 ∇F (wtk ) .
P
1

H Directional Derivative Setting

In the main text, our algorithms make use of a stochastic gradient oracle. However, the prior work
of Zhang et al. (2020b) instead considers a stochastic directional gradient oracle. This is a less
common setup, and other works (e.g., Davis et al. (2021)) have also taken our route of tackling
non-smooth optimization via an oracle that returns gradients at points of diﬀerentiability.
Nevertheless, all our results extend easily to the exact setting of Zhang et al. (2020b) in which
F is Lipschitz and directionally diﬀerentiable and we have access to a stochastic directional gradient
oracle rather than a stochastic gradient oracle. To quantify this setting, we need a bit more notation
which we copy directly from Zhang et al. (2020b) below:
First, from Clarke (1990) and Zhang et al. (2020b), the generalized directional derivative of a
function F in a direction d is
f (y + td) − f (y)
F ◦ (x, d) = lim sup . (4)
y→x t↓0 t

Further, the generalized gradient is the set

∂F (x) = {g : hg, di ≤ hF ◦ (x, d), di for all d} .

Finally, F : Rd → R is Hadamard directionally differentiable in the direction v ∈ Rd if for any

function ψ : R+ → Rd such that limt→0 ψ(t)−ψ(0)
t = v and ψ(0) = x, the following limit exists:

F (ψ(t)) − F (x)
lim .
t→0 t
If F is Hadamard directionally differentiable, then the above limit is denoted F ′ (x, v). When F is
Hadamard directionally differentiable for all x and v, then we say simply that F is directionally
differentiable.
With these definitions, a stochastic directional oracle for a Lipschitz, directionally differentiable,
and bounded from below function F is an oracle Grad(x, v, z) that outputs g ∈ ∂F (x) such that

35
hg, vi = F ′ (x, v). In this case, Zhang et al. (2020b) shows (Lemma 3) that F satisﬁes an alternative
notion of well-behavedness:
Z 1
F (y) − F (x) = hE[Grad(x + t(y − x), y − x, z)], y − xidt . (5)
0

Next, we deﬁne:
Definition 31. A point x is a (δ, ǫ) stationary point of F for the generalized gradient if there is a
set of points S contained in the ball of radius δ centered at x such that for y selected uniformly at
random from S, E[y] = x and for all y there is a choice of gy ∈ ∂F (y) such that k E[gy ]k ≤ ǫ.
Similarly, we have the deﬁnition:
Definition 32. Given a point x, and a number δ > 0, define:

1 X
k∂F (x)kδ , Pinf gy
1
S⊂B(x,δ), |S| y=x,gy ∈∂F (y) |S|
y∈S y∈S

In fact, whenever a locally Lipschitz function F is diﬀerentiable at a point x, we have that

∇F (x) ∈ ∂F (x), so that k∂F (x)kδ ≤ k∇F (x)kδ . Thus our results in the main text also bound
k∂F (x)kδ . However, while a gradient oracle is also directional derivative oracle, a directional
derivative oracle is only guaranteed to be a gradient oracle if F is continuously differentiable at the
queried point x. This technical issue means that when we have access to a directional derivative
oracle rather than a gradient oracle, we will instead only bound k∂F (x)kδ rather than k∇F (x)kδ .
Despite this technical complication, our overall strategy is essentially identical. The key obser-
vation is that the only time at which we used the properties of the gradient previously was when
we invoked well-behavedness of F . When we have a directional derivative instead of the gradient,
the alternative notion of well-behavedness in (5) will play an identical role. Thus, our approach is
simply to replace the call to Grad(wn , zn ) in Algorithm 1 with a call instead to Grad(wn , ∆n , zn )
(see Algorithm 4). With this change, all of our analysis in the main text applies almost without
modification. Essentially, we only need to change notation in a few places to reflect the updated
definitions.
To begin this notational update, the counterpart to Theorem 7 is:
Theorem 33. Suppose F is Lipschitz and directionally differentiable. With the notation in Algo-
rithm 4, if we let sn be independent random variables uniformly distributed in [0, 1], then for any
sequence of vectors u1 , . . . , uN , if we have the equality:
"M # "M #
X X
E[F (xM )] = F (x0 ) + E hgn , ∆n − un i + E hgn , un i .
n=1 n=1

Proof.
Z 1
F (xn ) − F (xn−1 ) = hE[Grad(xn−1 + s(xn − xn−1 ), xn − xn−1 , zn )], xn − xn−1 i ds
0
= E[hgn , ∆n i]
= E[hgn , ∆n − un i + hgn , un i] .

36
Algorithm 4 Online-to-Non-Convex Conversion (directional derivative oracle version)
Input: Initial point x0 , K ∈ N, T ∈ N, online learning algorithm A, sn for all n
Set M = K · T
for n = 1 . . . M do
Get ∆n from A
Set xn = xn−1 + ∆n
Set wn = xn−1 + sn ∆n
Sample random zn
Generate directional derivative gn = Grad(wn , ∆n , zn )
Send gn to A as gradient
end for
Set wtk = w(k−1)T +t for k = 1, . . . , K and t = 1, . . . , T
Set wk = T1 Tt=1 wtk for k = 1, . . . , K
P

Return {w1 , . . . , wK }

Where in the second line we have used the deﬁnition gn = Grad(xn−1 + sn (xn − xn−1 ), xn −
xn−1 , zn ), the assumption that sn is uniform on [0, 1], and Fubini theorem (as Grad is bounded
by Lipschitzness of F ). Now, sum over n and telescope to obtain the stated bound.

Next, we have the following analog of Theorem 8:

Theorem 34. With the notation in Algorithm 4, set sn to be a random variable sampled uniformly
T
∇k
P
from [0, 1]. Set T, K ∈ N and M = KT . Define ∇kt = E[g(k−1)T +t ]. Define uk = −D Pt=1 t
k Tt=1 ∇kt k
for some D > 0 for k = 1, . . . , K. Finally, suppose Var(gn ) = σ 2 . Then:
K T
" #
1 X 1X k F (x0 ) − F ⋆ E[RT (u1 , . . . , uK )] σ
E ∇ t ≤ + +√ .
K T t=1 DM DM T
k=1
Proof. The proof is essentially identical to that of Theorem 8. In Theorem 33, set un to be
equal to u1 for the ﬁrst T iterations, u2 for the second T iterations and so on. In other words,
un = umod(n,T )+1 for n = 1, . . . , M . So, we have
"M #
X
1 K
E[F (xM )] = F (x0 ) + E[RT (u , . . . , u )] + E hgn , un i .
n=1
PT
∇kt
Now, since uk = −D Pt=1
T , and Var(gn ) = σ 2 , we have
k t=1 ∇kt k
M K
" # " * T +# K
" T
#
X X X X X
E hgn , un i ≤ E ∇kt , uk +E D (∇kt − g(k−1)T +t )
n=1 k=1 t=1 k=1 t=1
K
" * T +#
X X √
≤E ∇kt , uk + DσK T
k=1 t=1
K T
" #
X 1X k √
=E − DT ∇t + DσK T .
T
k=1 t=1

37
Putting this all together, we have
K T
" #
√ X 1X
F ⋆ ≤ E[F (xM )] ≤ F (x0 ) + E[RT (u1 , . . . , uK )] + σDK T − DT E ∇F wtk .
T
k=1 t=1

Dividing by KDT = DM and reordering, we have the stated bound.

Finally, we instantiate Theorem 34 with online gradient descent to obtain the analog of Corol-
lary 9. This result establishes that the online-to-batch conversion ﬁnds an (δ, ǫ) critical point in
O(1/ǫ3 δ) iterations, even when using a directional derivative oracle. Further, our lower bound
construction makes use of continuously diﬀerentiable functions, for which the directional derivative
oracle and the standard gradient oracle must coincide. Thus the O(1/ǫ3 δ) complexity is optimal in
this setting as well.

Corollary 35. Suppose we have a budget of N gradient evaluations. Under the assumptions and no-
tation of Theorem 34, suppose in addition E[kgn k2 ] ≤ G2 and that A guarantees k∆n k ≤ D for some
user-specified
√ D for all n and ensures the worst-case K-shifting regret bound E[RT (u1 , . . . , uK )] ≤
DGK T for all kuk k ≤ D (e.g., as achieved by the OGD algorithm that is reset every T iterations).
Let δ > 0 be an arbitrary number. Set D = δ/T , T = min(⌈( F (xGN δ
0 )−F
⋆)
2/3 ⌉, N ), and K = ⌊ N ⌋.
2 T
Then, for all k and t, kwk − wtk k ≤ δ.
Moreover, we have the inequality
K T
" # !
1 X 1X k 2(F (x0 ) − F ⋆ ) 5G2/3 (F (x0 ) − F ⋆ )1/3 6G
E ∇t ≤ + max ,√ ,
K
k=1
T t=1 δN (N δ)1/3 N

which implies
K
!
1 X 2(F (x0 ) − F ⋆ ) 5G2/3 (F (x0 ) − F ⋆ )1/3 6G
k∂F (wk )kδ ≤ + max ,√ .
K
t=1
δN (N δ)1/3 N

Proof. Since A guarantees k∆n k ≤ D, for all n < n′ ≤ T + n − 1, we have

kwn − wn′ k = kxn − (1 − sn )∆n − xn′ −1 + sn′ ∆n′ k

′ −1
nX
≤ ∆i + k∆n k + k∆n′ k
i=n+1
′
≤ D((n − 1) − (n + 1) + 1) + 2D
= D(n′ − n + 1) ≤ DT .

Therefore, we clearly have kwtk − wk k ≤ DT = δ.

Note that from the choice of K and T we have M = KT ≥ N − T ≥ N/2. So, for the second
fact, notice that Var(gn ) ≤ G2 for all n. Thus, applying Theorem 34 in concert with the additional

38
√
assumption E[RT (u1 , . . . , uK )] ≤ DGK T , we have:
" K T
# √
1 X 1X k F (x0 ) − F ⋆ KDG T G
E ∇t ≤2 +2 +√
K T DN DN T
k=1 t=1
2T (F (x0 ) − F ⋆ ) 3G
≤ +√
δN T
!
5G2/3 (F (x0 ) − F ⋆ )1/3 6G 2(F (x0 ) − F ⋆ )
≤ max ,√ + ,
(N δ)1/3 N δN

where the last inequality is due to the choice of T .

Finally, observe that kwtk − wk k ≤ δ for all t and k, and also that wk = T1 Tt=1 wtk . Therefore
P
S = {w1k , . . . , wTk } satisfies the conditions in the infimum in Definition 32 so that k∂F (wk )kδ ≤
1 PT k
T t=1 ∇t .

s550 6989494 Enus SM Bob Cat s550
100% (2)
s550 6989494 Enus SM Bob Cat s550
1,234 pages
Law of The Sea
100% (1)
Law of The Sea
22 pages
Advent Youth Sing
No ratings yet
Advent Youth Sing
451 pages
Dancing in The Dark - FULL Big Band - May - Frank Sinatra PDF
67% (3)
Dancing in The Dark - FULL Big Band - May - Frank Sinatra PDF
59 pages
GA 323 Astronomy - Rudolf Steiner
100% (2)
GA 323 Astronomy - Rudolf Steiner
154 pages
MM - 9E Planetary Axle Wheel Ends
100% (1)
MM - 9E Planetary Axle Wheel Ends
41 pages
Questionnaire PDF
No ratings yet
Questionnaire PDF
4 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Pricelist Honda Agustus 2019
No ratings yet
Pricelist Honda Agustus 2019
2,436 pages
God Bless You, Mr. Rosewater
100% (4)
God Bless You, Mr. Rosewater
196 pages
Ee227c Notes 2 PDF
No ratings yet
Ee227c Notes 2 PDF
122 pages
Continuous Optimization - Vaithilingam Jeyakumar, Alexander Rubinov
100% (1)
Continuous Optimization - Vaithilingam Jeyakumar, Alexander Rubinov
453 pages
(Book) Fundas of OptTheory Applications To ML
No ratings yet
(Book) Fundas of OptTheory Applications To ML
832 pages
Algorithms and Complexity
No ratings yet
Algorithms and Complexity
130 pages
Irs Optimiazation Meta
No ratings yet
Irs Optimiazation Meta
32 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Uncertainty Notes
No ratings yet
Uncertainty Notes
166 pages
Notes HQ
No ratings yet
Notes HQ
96 pages
DLbook
No ratings yet
DLbook
165 pages
WBW-DC-CG-01 - Puzzlement in The Potager
No ratings yet
WBW-DC-CG-01 - Puzzlement in The Potager
25 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
A Gentle Introduction To Gradient-Based Optimization
No ratings yet
A Gentle Introduction To Gradient-Based Optimization
36 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Handbook of Convergence Theorems
No ratings yet
Handbook of Convergence Theorems
70 pages
EDE Microproejct 1 by Campusify
No ratings yet
EDE Microproejct 1 by Campusify
23 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Physics Term Paper Examples
100% (2)
Physics Term Paper Examples
7 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Stochastic Online Opitmization Using Kalman Recursion - Paper
No ratings yet
Stochastic Online Opitmization Using Kalman Recursion - Paper
55 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
Lecture07 Slides
No ratings yet
Lecture07 Slides
42 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Is Stochastic Gradient Descent Effective? A PDE Perspective On Machine Learning Processes
No ratings yet
Is Stochastic Gradient Descent Effective? A PDE Perspective On Machine Learning Processes
50 pages
Function Approximation: A Gradient Boosting Machine.
No ratings yet
Function Approximation: A Gradient Boosting Machine.
45 pages
Fast Convex Optimization With Quantum Gradient Methods
No ratings yet
Fast Convex Optimization With Quantum Gradient Methods
42 pages
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
No ratings yet
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
45 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
Better Theory For SGD in The Nonconvex World
No ratings yet
Better Theory For SGD in The Nonconvex World
33 pages
A Lyapunov Analysis For Accelerated Gradient Methods
No ratings yet
A Lyapunov Analysis For Accelerated Gradient Methods
37 pages
Kil Bertus 2020
No ratings yet
Kil Bertus 2020
31 pages
Convergence of Inexact Forward - Backward Algorithms Using The Forward - Backward Envelope
No ratings yet
Convergence of Inexact Forward - Backward Algorithms Using The Forward - Backward Envelope
29 pages
Article: Projection-Based Curve Pattern Search For Black-Box Optimization Over Smooth Convex Sets
No ratings yet
Article: Projection-Based Curve Pattern Search For Black-Box Optimization Over Smooth Convex Sets
20 pages
Unconstrained Online Learning With Unbounded Losses
No ratings yet
Unconstrained Online Learning With Unbounded Losses
41 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
ConvexSpring25 Week9
No ratings yet
ConvexSpring25 Week9
26 pages
2015 Book MachineLearningAndKnowledgeDis
100% (1)
2015 Book MachineLearningAndKnowledgeDis
760 pages
Adaptive Stochastic Conjugate Gradient For Machine Learning
No ratings yet
Adaptive Stochastic Conjugate Gradient For Machine Learning
14 pages
Non Convex Non Concave
No ratings yet
Non Convex Non Concave
31 pages
Convexity, Classification, and Risk Bounds
No ratings yet
Convexity, Classification, and Risk Bounds
36 pages
Les Hoches 2022 Convex Optimization
No ratings yet
Les Hoches 2022 Convex Optimization
34 pages
Stochastic Optimization Under Hidden Convexity: Ilyas Fatkhullin Niao He Yifan Hu
No ratings yet
Stochastic Optimization Under Hidden Convexity: Ilyas Fatkhullin Niao He Yifan Hu
34 pages
Optimization PDF
No ratings yet
Optimization PDF
59 pages
Stochastic Optimization For Machine Learning
No ratings yet
Stochastic Optimization For Machine Learning
59 pages
A High Probability Analysis of Adaptive SGD With Momentum
No ratings yet
A High Probability Analysis of Adaptive SGD With Momentum
13 pages
E Cient Methods For Structured Nonconvex-Nonconcave Min-Max Optimization
No ratings yet
E Cient Methods For Structured Nonconvex-Nonconcave Min-Max Optimization
11 pages
Lecture 8
No ratings yet
Lecture 8
16 pages
Standards in It
No ratings yet
Standards in It
16 pages
Gradient Methods For Minimizing Composite Objective Function
No ratings yet
Gradient Methods For Minimizing Composite Objective Function
31 pages
NeurIPS 2019 Efficient Algorithms For Smooth Minimax Optimization Paper
No ratings yet
NeurIPS 2019 Efficient Algorithms For Smooth Minimax Optimization Paper
12 pages
NIPS 2011 Non Asymptotic Analysis of Stochastic Approximation Algorithms For Machine Learning Paper
No ratings yet
NIPS 2011 Non Asymptotic Analysis of Stochastic Approximation Algorithms For Machine Learning Paper
9 pages
Lec 13
No ratings yet
Lec 13
6 pages
When Are Nonconvex Problems Not Scary
No ratings yet
When Are Nonconvex Problems Not Scary
11 pages
14 Efficient Learning
No ratings yet
14 Efficient Learning
7 pages
Ollifying Etworks
No ratings yet
Ollifying Etworks
13 pages
Kaka de 09 Generalization
No ratings yet
Kaka de 09 Generalization
8 pages
Stochastic Gradient Descent For Nonconvex Learning Without Bounded Gradient Assumptions
No ratings yet
Stochastic Gradient Descent For Nonconvex Learning Without Bounded Gradient Assumptions
7 pages
NPNG Gazett 2197 27 - E
No ratings yet
NPNG Gazett 2197 27 - E
21 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Lecture 11
No ratings yet
Lecture 11
4 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
The Bible The Most Useful Book For Devotion
No ratings yet
The Bible The Most Useful Book For Devotion
6 pages
A Brief History of The OT Apocrypha
No ratings yet
A Brief History of The OT Apocrypha
7 pages
GS-OPT: A New Fast Stochastic Algorithm For Solving The Non-Convex Optimization Problem
No ratings yet
GS-OPT: A New Fast Stochastic Algorithm For Solving The Non-Convex Optimization Problem
10 pages
Epicure 11 2024 Freemagazines Top
No ratings yet
Epicure 11 2024 Freemagazines Top
88 pages
Bai 2000 Var Variance Co Variance
No ratings yet
Bai 2000 Var Variance Co Variance
37 pages
Neural Network Classification: Maximizing Zero-Error Density
No ratings yet
Neural Network Classification: Maximizing Zero-Error Density
4 pages
Admitcard NSB Coimbatore SGX201M003245
No ratings yet
Admitcard NSB Coimbatore SGX201M003245
1 page
Rangelia Vitalli Dogs
No ratings yet
Rangelia Vitalli Dogs
5 pages
What's The differenceSS
No ratings yet
What's The differenceSS
2 pages
Le Pigeon by Gabriel Rucker and Meredith Erickson, With Lauren Fortgang and Andrew Fortgang - Recipes and Excerpt
0% (1)
Le Pigeon by Gabriel Rucker and Meredith Erickson, With Lauren Fortgang and Andrew Fortgang - Recipes and Excerpt
14 pages
Hall 2013 Inference
No ratings yet
Hall 2013 Inference
28 pages
BUKU - MENU Parkir Depan
No ratings yet
BUKU - MENU Parkir Depan
2 pages
Adam S. Charles Nicholas P. Bertrand John Lee Christopher J. Rozell
No ratings yet
Adam S. Charles Nicholas P. Bertrand John Lee Christopher J. Rozell
5 pages
Han 2017 Partial
No ratings yet
Han 2017 Partial
20 pages
Datasheet ECM 3560 Interface 4pgv1 A80401 Press
No ratings yet
Datasheet ECM 3560 Interface 4pgv1 A80401 Press
4 pages
Cyber Bullying Essay
100% (4)
Cyber Bullying Essay
8 pages
The Joint Lasso - High-Dimensional Regression For Group
No ratings yet
The Joint Lasso - High-Dimensional Regression For Group
17 pages
Automatic Generation of Geographical Networks For Maritime
No ratings yet
Automatic Generation of Geographical Networks For Maritime
8 pages
Gullas vs. Philippine National Bank, G.R. No. L-43191, November 13, 1935
100% (1)
Gullas vs. Philippine National Bank, G.R. No. L-43191, November 13, 1935
3 pages
A Fast Algorithm Based On A Sylvester-Like Equation For LS Regression With GMRF Prior
No ratings yet
A Fast Algorithm Based On A Sylvester-Like Equation For LS Regression With GMRF Prior
5 pages
Information Distances For Radar Resolution Analysis: Radmila Pribić Geert Leus
No ratings yet
Information Distances For Radar Resolution Analysis: Radmila Pribić Geert Leus
5 pages
Byzantine-Resilient Locally Optimum Detection Using Collaborative Autonomous Networks
No ratings yet
Byzantine-Resilient Locally Optimum Detection Using Collaborative Autonomous Networks
5 pages
Yamada2019timevarying ICASSP
No ratings yet
Yamada2019timevarying ICASSP
5 pages
Online Return Center
No ratings yet
Online Return Center
4 pages
Vayetze - The Ladder of Jacob
No ratings yet
Vayetze - The Ladder of Jacob
10 pages
5 - MC01 - Exterior Walkaround & Normal Checklist - REV 05.1
No ratings yet
5 - MC01 - Exterior Walkaround & Normal Checklist - REV 05.1
2 pages
Roma
No ratings yet
Roma
2 pages
HKDSE - English - PP - 20120116-2（拖移項目） 10 booklet
No ratings yet
HKDSE - English - PP - 20120116-2（拖移項目） 10 booklet
2 pages
Reading Test 2nd Cse A Group 1
No ratings yet
Reading Test 2nd Cse A Group 1
1 page
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet