Optimal Stochastic Non-Smooth Non-Convex Optimization Through
Optimal Stochastic Non-Smooth Non-Convex Optimization Through
Online-to-Non-convex Conversion
Ashok Cutkosky Harsh Mehta Francesco Orabona
Boston University Google Research Boston University
arXiv:2302.03775v2 [cs.LG] 11 Feb 2023
Abstract
We present new algorithms for optimizing non-smooth, non-convex stochastic objectives
based on a novel analysis technique. This improves the current best-known complexity for
finding a (δ, ǫ)-stationary point from O(ǫ−4 δ −1 ) stochastic gradient queries to O(ǫ−3 δ −1 ), which
we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex
optimization to online learning, after which our results follow from standard regret bounds in
online learning. For deterministic and second-order smooth objectives, applying more advanced
optimistic online learning techniques enables a new complexity of O(ǫ−1.5 δ −0.5 ). Our techniques
also recover all optimal or best-known results for finding ǫ stationary points of smooth or second-
order smooth objectives in both stochastic and deterministic settings.
1 Introduction
Algorithms for non-convex optimization are one of the most important tools in modern machine
learning, as training neural networks requires optimizing a non-convex objective. Given the abun-
dance of data in many domains, the time to train a neural network is the current bottleneck to
having bigger and more powerful machine learning models. Motivated by this need, the past
few years have seen an explosion of research focused on understanding non-convex optimiza-
tion (Ghadimi & Lan, 2013; Carmon et al., 2017; Arjevani et al., 2019; 2020; Carmon et al., 2019;
Fang et al., 2018). Despite significant progress, key issues remain unaddressed.
In this paper, we work to minimize a potentially non-convex objective F : Rd → R which we
only accesss in some stochastic or “noisy” manner. As motivation, consider F (x) , Ez [f (x, z)],
where x can represent the model weights, z a minibatch of i.i.d. examples, and f the loss of a
model with parameters x on the minibatch z. In keeping with standard empirical practice, we will
restrict ourselves to first order algorithms (gradient-based optimization).
The vast majority of prior analyses of non-convex optimization algorithms impose various
smoothness conditions on the objective (Ghadimi & Lan, 2013; Carmon et al., 2017; Allen-Zhu,
2018; Tripuraneni et al., 2018; Fang et al., 2018; Zhou et al., 2018; Fang et al., 2019; Cutkosky & Orabona,
2019; Li & Orabona, 2019; Cutkosky & Mehta, 2020; Zhang et al., 2020a; Karimireddy et al., 2020;
Levy et al., 2021; Faw et al., 2022; Liu et al., 2022). One motivation for smoothness assumptions
is that they allow for a convenient surrogate for global minimization: rather than finding a global
minimum of a neural network’s loss surface (which may be intractable), we can hope to find an
ǫ-stationary point, i.e., a point x such that k∇F (x)k ≤ ǫ. By now, the fundamental limits on first
1
order smooth non-convex optimization are well understood: Stochastic Gradient Descent (SGD)
will find an ǫ-stationary point in O(ǫ−4 ) iterations, which is the optimal rate (Arjevani et al., 2019).
Moreover, if F happens to be second-order smooth, SGD requires only O(ǫ−3.5 ) iterations, which
is also optimal (Fang et al., 2019; Arjevani et al., 2020). These optimality results motivate the
popularity of SGD and its variants in practice (Kingma & Ba, 2014; Loshchilov & Hutter, 2016;
2018; Goyal et al., 2017; You et al., 2019).
Unfortunately, many standard neural network architectures are non-smooth (e.g., architectures
incorporating ReLUs or max-pools cannot be smooth). As a result, these analyses can only pro-
vide intuition about what might occur when an algorithm is deployed in practice: the theorems
themselves do not apply (see also the discussion in, e.g., Li et al. (2021)). Despite the obvious
need for non-smooth analyses, recent results suggest that even approaching a neighborhood of a
stationary point may be impossible for non-smooth objectives (Kornowski & Shamir, 2022b). Nev-
ertheless, optimization clearly is possible in practice, which suggests that we may need to rethink
our assumptions and goals in order to understand non-smooth optimization.
Fortunately, Zhang et al. (2020b) recently considered an alternative definition of stationarity
that is tractable even for non-smooth objectives and which has attracted much interest (Davis et al.,
2021; Tian et al., 2022; Kornowski & Shamir, 2022a; Tian & So, 2022; Jordan et al., 2022). Roughly
speaking, instead of k∇F (x)k ≤ ǫ, we ask that there is a random variable y supported in a ball
of radius δ about x such that k E[∇F (y)]k ≤ ǫ. We call such an x an (δ, ǫ)-stationary point, so
that the previous definition (k∇F (x)k ≤ ǫ) is a (0, ǫ)-stationary point. The current best-known
complexity for identifying an (δ, ǫ) stationary point is O(ǫ−4 δ−1 ) stochastic gradient evaluations.
In this paper, we significantly improve this result: we can identify an (δ, ǫ)-stationary point
with only O(ǫ−3 δ−1 ) stochastic gradient evaluations. Moreover, we also show that this rate is
optimal. Our primary technique is a novel online-to-non-convex conversion: a connection between
non-convex stochastic optimization and online learning, which is a classical field of learning theory
that already has a deep literature (Cesa-Bianchi & Lugosi, 2006; Hazan, 2019; Orabona, 2019). In
particular, we show that an online learning algorithm that satisfies a shifting regret can be used to
decide the update step, when fed with linear losses constructed using the stochastic gradients of
the function F . By establishing this connection, we hope to open new avenues for algorithm design
in non-convex optimization and also motivate new research directions in online learning.
In sum, we make the following contributions:
• We show that the O(ǫ−3 δ−1 ) rate is optimal for all δ, ǫ such that ǫ ≤ O(δ).
• The O(ǫ−3 δ−1 ) complexity implies the optimal O(ǫ−4 ) and O(ǫ−3.5 ) respective complexities
for finding (0, ǫ)-stationary points of smooth or second-order smooth objectives.
• For deterministic and second-order smooth objectives, we obtain a rate of O(ǫ−3/2 δ−1/2 ),
which implies the best-known O(ǫ−7/4 ) complexity for finding (0, ǫ)-stationary points.
2
2 Definitions and Setup
Here, we formally introduce our setting and notation. We are interested in optimizing real-valued
functions F : H → R where H is a real Hilbert space (e.g., usually H = Rd ). We assume F ⋆ ,
inf x F (x) > −∞. We assume that F is differentiable, but we do not assume that F is smooth. All
norms k · k are the Hilbert space norm (i.e., the 2-norm) unless otherwise specified. As mentioned
in the introduction, the motivating example to keep in mind in our development is the case F (x) =
Ez [f (x, z)].
Our algorithms access information about F through a stochastic gradient oracle Grad : H ×
Z → R. Given a point x in H, the oracle will sample an i.i.d. random variable z ∈ Z and return
Grad(x, z) ∈ H such that E[Grad(x, z)] = ∇F (x) and Var(Grad(x, z)) ≤ σ 2 .
In the following, we only consider functions satisfying the following mild regularity condition.
Using this assumption, our results can be applied to improve the past results on non-smooth
stochastic optimization. In fact, Proposition 2 (proof in Appendix A) below shows that for the
wide class of functions that are locally Lipschitz, applying an arbitrarily small perturbation to the
function is sufficient to ensure both differentiability and well-behavedness. It works via standard
perturbation arguments similar to those used previously by Davis et al. (2021) (see also Bertsekas
(1973); Duchi et al. (2012); Flaxman et al. (2005) for similar techniques in the convex setting).
Proposition 2. Let F : Rd → R be locally Lipschitz with stochastic oracle Grad such that
Ez [Grad(x, z)] = ∇F (x) whenever F is differentiable. We have two cases:
• If F is differentiable everywhere, then F is well-behaved.
• If F is not differentiable everywhere, let p > 0 be an arbitrary number and let u be a random
vector in Rd uniformly distributed on the unit ball. Define F̂ (x) , Eu [F (x + pu)]. Then,
\
F̂ is differentiable and well-behaved, and the oracle Grad(x, (z, u)) = Grad(x + pu, z) is a
stochastic gradient oracle for F̂ . Moreover, F is differentiable at x + pu with probability 1
and if F is G-Lipschitz, then |F̂ (x) − F (x)| ≤ pG for all x.
Remark 3. We explicitly note that our results cover the case in which F is directionally differen-
tiable and we have access to a stochastic directional gradient oracle, as considered by Zhang et al.
(2020b). This is an less standard oracle Grad(x, v, z) that outputs g such that E[hg, vi] is the
directional derivative of F in the direction v. This setting is subtly different (although a directional
derivative oracle is a gradient oracle at all points for which F is continuously differentiable). In
order to keep technical complications to a minimum, in the main text we consider the more familiar
stochastic gradient oracle discussed above. In Appendix H, we show that our results and techniques
also apply using directional gradient oracles with only superficial modification.
3
2.1 (δ, ǫ)-Stationary Points
Now, let us define our notion of (δ, ǫ)-stationary point. This definition is essentially the same
as used in Zhang et al. (2020b); Davis et al. (2021); Tian et al. (2022). It is in fact mildly more
stringent since we require an “unbiasedness” condition in order to make eventual connections to
second-order smooth objectives easier.
Let’s also state an immediate corollary of Proposition 2 that converts a guarantee on a random-
ized smoothed function to one on the original function.
Corollary 6. Let F : Rd → R be G-Lipschitz. For ǫ > 0, let p = Gǫ and let u be a random vector
in Rd uniformly distributed on the unit ball. Define F̂ (x) , Eu [F (x + pu)]. If a point x satisfies
k∇F̂ (x)kδ ≤ ǫ, then k∇F (x)kδ ≤ 2ǫ.
Our ultimate goal is to use N stochastic gradient evaluations of F to identify a point x with
as small a value of E[k∇F (x)kδ ] as possible. Due to Proposition 2 and Corollary 6, our results
will immediately extend from differentiable F to those F that are locally Lipschitz and for which
Grad(x, z) returns a unbiased estimate of ∇F (x) whenever F is differentiable at x. We therefore
focus our subsequent development on the case that F is differentiable and we have access to a
stochastic gradient oracle.
With no stochastic assumption,√it is possible to design online algorithms that guarantee that the
regret is upper bounded by O( T ). We can also define a more challenging objective that is the
4
one of minimizing the K-shifting regret, that is the regret with respect to an arbitrary sequence of
K vectors u1 , . . . , uK ∈ V that changes every T iterations:
K
X kT
X
RT (u1 , . . . , uK ) , hgn , ∆n − uk i . (1)
k=1 n=(k−1)T +1
It should be intuitive
√ that resetting the online algorithm every T iterations can achieve a shifting
regret of O(K T ).
3 Online-to-non-convex Conversion
In this section, we explain the online-to-non-convex conversion. The core idea transforms the
minimization of a non-convex and non-smooth function onto the problem of minimizing the shifting
regret over linear losses. In particular, consider an optimization algorithm that updates a previous
iterate xn−1 by moving in a direction ∆n : xn = xn−1 +∆n . For example, SGD sets ∆n = −ηgn−1 =
−η · Grad(xn−1 , zn−1 ) for a learning rate η. Instead, we let an online learning algorithm A decide
the update direction ∆n , using linear losses ℓt (x) = hgn , xi.
The motivation behind essentially all first order algorithms is that F (xn−1 + ∆n ) − F (xn−1 ) ≈
hgn , ∆n i. This suggests that ∆n should be chosen to minimize the inner product hgn , ∆n i. How-
ever, we are faced with two difficulties. The first difficulty is the the approximation error in the
first-order expansion. The second is the fact that ∆n needs to be chosen before gn is revealed, so
that ∆n needs in some sense to “predict the future”. Typical analysis of algorithms such as SGD
use the remainder form of Taylor’s theorem to address both difficulties simultaneously for smooth
objectives, but in our non-smooth case this is not a valid approach. Instead, we tackle these dif-
ficulties independently. We overcome the first difficulty using the same randomized scaling trick
employed by Zhang et al. (2020b): define gn to be a gradient evaluated not at xn−1 or xn−1 + ∆n ,
but at a random point along the line segment connecting the two. Then for a well-behaved function
we will have F (xn−1 + ∆n ) − F (xn−1 ) = E[hgn , ∆n i]. The second difficulty is where online learning
5
Algorithm 1 Online-to-Non-Convex Conversion
Input: Initial point x0 , K ∈ N, T ∈ N, online learning algorithm A, sn for all n
Set M = K · T
for n = 1 . . . M do
Get ∆n from A
Set xn = xn−1 + ∆n
Set wn = xn−1 + sn ∆n
Sample random zn
Generate gradient gn = Grad(wn , zn )
Send gn to A as gradient
end for
Set wtk = w(k−1)T +t for k = 1, . . . , K and t = 1, . . . , T
Set wk = T1 Tt=1 wtk for k = 1, . . . , K
P
Return {w1 , . . . , wK }
shines: online learning algorithms are specifically designed to predict completely arbitrary sequences
of vectors as accurately as possible.
The previous intuition is formalized in Algorithm 1 and the following result, which we will
elaborate on in Theorem 8 before yielding our main result in Corollary 9.
R1
Theorem 7. Suppose F is well-behaved. Define ∇n = 0 ∇F (xn−1 + s∆n ) ds. Then, with the
notation in Algorithm 1 and for any sequence of vectors u1 , . . . , uN , we have the equality:
M
X M
X M
X
F (xM ) = F (x0 ) + hgn , ∆n − un i + h∇n − gn , ∆n i + hgn , un i .
n=1 n=1 n=1
Moreover, if we let sn be independent random variables uniformly distributed in [0, 1], then we have
"M # "M #
X X
E[F (xM )] = F (x0 ) + E hgn , ∆n − un i + E hgn , un i .
n=1 n=1
6
3.1 Guarantees for Non-Smooth Non-Convex Functions
The primary value of Theorem 7 is that the term M
P
n=1 hgn , ∆n − un i is exactly the regret of an
online learning algorithm: lower regret clearly translates toPa smaller bound on F (xM ). Next, by
carefully choosing un , we will be able to relate the term M n=1 hgn , un i to the gradient averages
that appear in the definition of (δ, ǫ)-stationarity. Formalizing these ideas, we have the following:
Theorem 8. Assume F is well-behaved. With the notation in Algorithm 1, set sn toPbe a random
T
∇F (wtk )
variable sampled uniformly from [0, 1]. Set T, K ∈ N and M = KT . Define uk = −D Pt=1 T
k t=1 ∇F (wtk )k
2
for some D > 0 for k = 1, . . . , K. Finally, suppose Var(gn ) = σ . Then:
K T
" #
1 X 1X k F (x0 ) − F ⋆ E[RT (u1 , . . . , uK )] σ
E ∇F (wt ) ≤ + +√ .
K T t=1 DM DM T
k=1
Proof. In Theorem 7, set un to be equal to u1 for the first T iterations, u2 for the second T
iterations and so on. In other words, un = umod(n,T )+1 for n = 1, . . . , M . So, we have
"M #
X
1 K
E[F (xM )] = F (x0 ) + E[RT (u , . . . , u )] + E hgn , un i .
n=1
PT
∇F (wtk )
Now, since uk = −D Pt=1
T , E[gn ] = ∇F (wn ), and Var(gn ) = σ 2 , we have
k t=1 ∇F (wtk )k
M K
" # " * T +# "K T
#
X X X X X
E hgn , un i ≤ E ∇F (wtk ), uk +E D (∇F (wtk ) − gn )
n=1 k=1 t=1 k=1 t=1
K
" * T +#
X X √
≤E ∇F (wtk ), uk + DσK T
k=1 t=1
K T
" #
X 1X √
=E − DT ∇F wtk + DσK T .
T
k=1 t=1
We now instantiate Theorem 8 with the simplest online learning algorithm: online gradient
descent (OGD) (Zinkevich, 2003). OGD takes input a radius D and a step size η and makes
the update ∆n+1 = Πk∆k≤D [∆n − ηgn ] with ∆1 = 0. The standard analysis shows that if
√
E[kgn k2 ] ≤ G2 for all n, then with η = GD
√ , OGD will ensure1 static regret E[RT (u)] ≤ DG T
T
for any u satisfying kuk ≤ √ D. Thus, by resetting the algorithm every T iterations, we achieve
E[RT (u1 , . . . uK )] ≤ KDG T . This powerful guarantee for all sequences is characteristic of online
learning. We are now free to optimize the remaining parameters K and D to achieve our main
result, presented in Corollary 9.
1
For completeness a proof of this statement is in Appendix B.
7
Corollary 9. Suppose we have a budget of N gradient evaluations. Under the assumptions of Theo-
rem 8, suppose in addition E[kgn k2 ] ≤ G2 and that A guarantees k∆n k ≤ D for some user-specified
√
D for all n and ensures the worst-case K-shifting regret bound E[RT (u1 , . . . , uK )] ≤ DGK T for
all kuk k ≤ D (e.g., as achieved by the OGD algorithm that is reset every T iterations). Let δ > 0
be an arbitrary number. Set D = δ/T , T = min(⌈( F (xGN δ
0 )−F
⋆)
2/3 ⌉, N ), and K = ⌊ N ⌋. Then, for all
2 T
k and t, kwk − wtk k ≤ δ.
Moreover, we have the inequality
K T
" # !
1 X 1X 2(F (x0 ) − F ⋆ ) 5G2/3 (F (x0 ) − F ⋆ )1/3 6G
E ∇F (wtk ) ≤ + max ,√ ,
K
k=1
T
t=1
δN (N δ)1/3 N
which implies
K
" # !
1 X 2(F (x0 ) − F ⋆ ) 5G2/3 (F (x0 ) − F ⋆ )1/3 6G
E k∇F (wk )kδ ≤ + max ,√ .
K t=1 δN (N δ)1/3 N
Remark 10. With fairly straightforward martingale concentration arguments, the above can be
2/3 ⋆ )1/3
extended to identify a (δ, O( G (F(N(xδ)0 )−F
1/3 ))-stationary point with high probability. We leave this
for a future version of this work.
xn = xn−1 + ∆n
gn = Grad(xn + (sn − 1)∆n , zn )
∆n+1 = clipD (∆n + ηgn )
D
where clipD (x) = x min( kxk , 1). In words, the update is reminiscent of SGD updates with momen-
tum and clipping. The primary different element is the fact that the stochastic gradient is taken
on a slightly perturbed xn .
8
Proof of Corollary 9. Since A guarantees k∆n k ≤ D, for all n < n′ ≤ T + n − 1, we have
Definition 11. A point x is a (δ, ǫ)-stationary point with respect to the L1 norm of an almost-
everywhere differentiable function F if there exists a finite subset S of the L∞ ball of radius δ
centered at x such that if y is selected uniformly at random from S, E[y] = x and k E[∇F (y)]k1 ≤ ǫ.
9
Definition 12. Given a point x, a number δ > 0 and an almost-everywhere differentiable function
F , define
1 X
k∇F (x)k1,δ , inf ∇F (y) .
1
S⊂B∞ (x,δ)|, |S|
P
y=x |S|
y∈S y∈S
1
We now can state a theorem similar to Corollary 9. Given that the proof is also very similar,
we defer it to Appendix G.
Theorem 13. Suppose we have a budget of N gradient evaluations. Assume F : Rd → R is well-
behaved. With the notation in Algorithm 1, set sn to be a random variable sampled uniformly from
[0, 1]. Set T, K ∈ N and M = KT . Assume that E[gn,i 2 ] ≤ G2 for i = 1, . . . , d for all n. Assume
i
that A guarantees k∆n k∞ ≤ D∞ for some user-specified D∞ √ for all
P n and ensures the standard
worst-case K-shifting regret bound E[RT (u1 , . . . , uK )] ≤ D∞ K T di=1 Gi for all kuk k∞ ≤ D∞ .
Pd
Nδ G
Let δ > 0 be an arbitrary number. Set D∞ = δ/T , T = min(⌈( F (x0 )−Fi=1 i 2/3
⋆ ) ⌉, N2 ), and K = ⌊ N
T ⌋.
k k
Then, for all k, t, kw − wt k∞ ≤ δ.
Moreover, we have the inequality
K T
" # Pd Pd !
1 X 1X F (x ) − F ⋆ 5( G )2/3 (F (x ) − F ⋆ )1/3 6 G
0 i=1 i 0
E ∇F (wtk ) ≤ + max , √i=1 i ,
K
k=1
T
t=1
δN (N δ)1/3 N
1
which implies
K
!
2(F (x0 ) − F ⋆ ) 5( di=1 Gi )2/3 (F (x0 ) − F ⋆ )1/3 6 di=1 Gi
P P
1 X
k∇F (wk )k1,δ ≤ + max , √ .
K t=1 δN (N δ)1/3 N
Pd Let’s2 compare this result with Corollary 9. For a fair comparison, we set Gi and G such that
2 . Then, we can lower bound k · k with √1 k · k . Hence, under the assumption
G
i=1 i = G δ d 1,δ
√
Pd Pd 1 PK G2/3 d
E[kgn k ] = i=1 E[gn,i ] ≤ i=1 Gi = G , Corollary 9 implies K t=1 k∇F (wk )k1,δ ≤ O( (N
2 2 2 2
δ)1/3
).
Now, let us see what would happen if we instead employed the above Corollary 13. First,
Pd √ qPd √
observe that i=1 Gi ≤ d 2
i=1 Gi ≤ dG. Substituting this expression into Corollary 30 now
1/3 2/3
gives an upper bound on k · k1,δ that is O( d(N δ)G1/3 ), which is better than the one we could obtain
from Corollary 9 under the same assumptions.
Proposition 15. Suppose that F is J-second-order-smooth (that is, k∇2 F (x) − ∇2 F (y)kop ≤
Jkx − yk for all x and y). Suppose also that x satisfies k∇F (x)kδ ≤ ǫ. Then, k∇F (x)k ≤ ǫ + J2 δ2 .
10
Now, recall that Corollary 9 shows that we can find a (δ, ǫ) stationary point in O(ǫ−3 δ−1 )
iteration. Thus, Proposition 14 implies that by setting δ = ǫ/H, we can find a (0, ǫ)-stationary
point of an H-smooth objective F in O(ǫ−4 ) iterations, which matches the (optimal) guarantee of
standard SGDp (Ghadimi & Lan, 2013; Arjevani et al., 2019). Further, Proposition 15 shows that by
setting δ = ǫ/J, we can find a (0, ǫ)-stationary point of a J-second order smooth objective. This
matches the performance of more refined SGD variants and is also known to be tight (Fang et al.,
2019; Cutkosky & Mehta, 2020; Arjevani et al., 2020). In summary: the online-to-non-convex
conversion also recovers the optimal results for smooth stochastic losses.
for some “hint” vectors ht . In Appendix B, we provide an explicit construction of such an algorithm
for completeness. The standard setting for the hints is ht = gt−1 . As explained in Section 2.2, to
obtain a K-shifting regret it will be enough to reset the algorithm every T iterations.
Theorem 16. Suppose we have a budget of N gradient evaluations. and that we have an online
algorithm Astatic
qPthat guarantees k∆n k ≤ D for all n and ensures the optimistic regret bound
T 2
RT (u) ≤ CD t=1 kgt − gt−1 k for some constant C, and we define g0 = 0. In Algorithm 1,
set A to be Astatic
√
that is reset every T rounds. Let δ > 0 be an arbitrary number. Set D = δ/T ,
(Cδ 2 HN )2/5 N
T = min(⌈ (F (x ⋆ )2/5 ⌉, 2 ), and K = ⌊ N
T ⌋. Finally, suppose that F is H-smooth and that the
0 )−F
gradient oracle is deterministic (that is, gn = ∇F (wn )). Then, for all k, t, kwk − wtk k ≤ δ.
Moreover, with G1 = k∇F (x1 )k, we have:
" K T
# √ !
2/5 (F (x ) − F ⋆ )3/5 17Cδ H
1 X 1X (CH) 0
E ∇F (wtk ) ≤ max 6 ,
K
k=1
T t=1 δ1/5 N 3/5 N 3/2
2CG1 2(F (x0 ) − F ⋆ )
+ + ,
N δN
which implies
K
" #
1 X 2CG1 2(F (x0 ) − F ⋆ )
E k∇F (wk )kδ ≤ +
K t=1 N δN
√ !
(CH)2/5 (F (x0 ) − F ⋆ )3/5 17Cδ H
+ max 6 , .
δ1/5 N 3/5 N 3/2
11
Note that the expectation here encompasses only the randomness in the choice of skt , because
the gradient oracle is assumed to be deterministic. Theorem 16 finds a (δ, ǫ) stationary point in
O(ǫ−5/3 δ−1/3 ) iteratations. Thus, by setting δ = ǫ/H, Proposition 14 shows we can find a (0, ǫ)
stationary point in O(ǫ−2 ) iterations, which matches the standard optimal rate (Carmon et al.,
2021).
Proof. The first fact (for all k, t, kwk − wtk k ≤ δ) holds for precisely the same reason that it holds
in Corollary 9. The final fact follows from the second fact again with the same reasoning, so it
remains only to show the second fact.
To show the second fact, observe that for k = 1 we have
v
u T
uX
k
RT (u ) ≤ CD t kgtk − gt−1
k k2
t=1
v
u
u T
X
≤ CD tG21 + k∇F (wtk ) − ∇F (wt−1
k )k2
t=2
v
u
u T
X
≤ CD tG21 + H 2 kwtk − wt−1
k k2
t=2
q √
≤ CD G21 + 4H 2 T D 2 ≤ CDG1 + 2CD 2 H T .
Thus, we have
v
u T
uX
k
RT (u ) ≤ CD t kgtk − gt−1
k k2
t=1
√
≤ CD 4H 2 T D 2
√
≤ 2CD 2 H T .
12
Now, applying Theorem 8 in concert with the above bounds on RT (uk ), we have
" K T
# √
1 X 1X k 2(F (x0 ) − F ⋆ ) 2CDG1 + 4CKD 2 HT
E ∇F (wt ) ≤ +
K T t=1 DN DN
k=1
√
2T (F (x0 ) − F ⋆ ) 2CG1 4Cδ H
= + +
δN N T 3/2
√ !
(CH)2/5 (F (x0 ) − F ⋆ )3/5 17Cδ H
≤ max 6 ,
δ1/5 N 3/5 N 3/2
2CG1 2(F (x0 ) − F ⋆ )
+ + .
N δN
With this refined online algorithm, we can show the following convergence guarantee, whose
proof is in Appendix D.
1
Theorem 17. In Algorithm 1, assume that gn = ∇F (wn ), and set sn = 2 . Use Algorithm 2
2 )1/3
restarted every T rounds as A. Let δ > 0 an arbitrary number. Set T = min(⌈ (δ(F(H+Jδ)N
(x )−F ⋆ )1/3 ⌉, N/2)
0
and K = ⌊ N
p
T ⌋. In Algorithm 2, set η = 1/2H, D = δ/T , and Q = ⌈log 2 ( N G/HD)⌉. Finally,
suppose that F is J-second-order-smooth. Then, the following facts hold:
13
Algorithm 2 Optimistic Mirror Descent with Careful Hints
Input: Learning rate η, number Q (Q will be O(log N )), function F , horizon T , radius D
Receive initial iterate x0
Set ∆′1 = 0
for t = 1 . . . T do
Set h0t = ∇F (xt−1 )
for i = 1 . . . Q do
Set hit = ∇F xt−1 + 21 Πk∆k≤D ∆′t − ηhi−1
t
end for
Set ht = hQ t
Output ∆t = Πk∆k≤D [∆′t − ηht ]
Receive tth gradient gt
Set ∆′t+1 = Πk∆k≤D [∆′t − ηgt ]
end for
7 Lower Bounds
In this section, we show that our O(ǫ−3 δ−1 ) complexity achieved in Corollary 9 is tight. We do
this by a simple extension of the lower bound for stochastic smooth non-convex optimization of
Arjevani et al. (2019). We provide an informal statement and proof-sketch below. The formal
result (Theorem 28) and proof is provided in Appendix F.
√
ǫγ
Theorem 18 (informal). There is a universal constant C such that for any δ, ǫ, γ and G ≥ C √ ,
δ
for any first-order algorithm A, there is a G-Lipschitz, C ∞ function F : Rd → R for some d with
F (0) − inf x F (x) ≤ γ and a stochastic first-order gradient oracle for F whose outputs g satisfy
E[kgk2 ] ≤ G2 such that such that A requires Ω(G2 γ/δǫ3 ) stochastic oracle queries to identify a
point x with E[k∇F (x)kδ ] ≤ ǫ.
Proof sketch. The construction of Arjevani et al. (2019) provides, for any H, γ, ǫ, and σ a function
√
F and stochastic oracle whose outputs have norm at most σ such that F is H-smooth, O( Hγ)-
Lipschitz and A requires Ω(σ 2 Hγ/ǫ4 ) oracle queries to find a point x with k∇F (x)k ≤ 2ǫ. By
14
√
setting H = δǫ and σ = G/ 2, this becomes an ǫγ/δ-Lipschitz function, and so is at most
p
√
G/ 2-Lipschitz. Thus, the second moment of the gradient oracle is at most G2 /2 + G2 /2 = G2 .
Further, the algorithm requires Ω(G2 γ/δǫ3 ) queries to find a point x with k∇F (x)k ≤ 2ǫ. Now, if
k∇F (x)kδ ≤ ǫ, then since F is H = δǫ -smooth, by Proposition 14, k∇F (x)k ≤ ǫ + δH = 2ǫ. Thus,
we see that we need Ω(G2 γ/δǫ3 ) queries to find a point with k∇F (x)kδ ≤ ǫ as desired.
8 Conclusion
We have presented a new online-to-non-convex conversion technique that applies online learning
algorithms to non-convex and non-smooth stochastic optimization. When used with online gradient
descent, this achieves the optimal ǫ−3 δ−1 complexity for finding (δ, ǫ) stationary points.
These results suggest new directions for work in online learning. Much past work is moti-
vated by the online-to-batch conversion relating static regret to convex optimization. We employ
switching regret for non-convex optimization. More refined analysis may be possible via gen-
eralizations such as strongly adaptive or dynamic regret (Daniely et al., 2015; Jun et al., 2017;
Zhang et al., 2018; Jacobsen & Cutkosky, 2022; Cutkosky, 2020; Lu et al., 2022; Luo et al., 2022;
Zhang et al., 2021; Baby & Wang, 2022; Zhang et al., 2022). Moreover, our analysis assumes per-
fect tuning of constants (e.g., D, T, K) for simplicity. In practice, we would prefer to adapt to un-
known parameters, motivating new applications and problems for adaptive online learning, which
is already an area of active current investigation (see, e.g., Orabona & Pál, 2015; Hoeven et al.,
2018; Cutkosky & Orabona, 2018; Cutkosky, 2019; Mhammedi & Koolen, 2020; Chen et al., 2021;
Sachs et al., 2022; Wang et al., 2022). It is our hope that some of this expertise can be applied in
the non-convex setting as well.
Finally, our results leave an important question unanswered: the current best-known algo-
rithm for deterministic non-smooth optimization still requires O(ǫ−3 δ−1 ) iterations to find a (δ, ǫ)-
stationary point (Zhang et al., 2020b). We achieve this same result even in the stochastic case.
Thus it is natural to wonder if one can improve in the deterministic setting. For example, is
the O(ǫ−3/2 δ−1/2 ) complexity we achieve in the smooth setting also achievable in the non-smooth
setting?
Acknowledgements
Ashok Cutkosky is supported by the National Science Foundation grant CCF-2211718 as well as a
Google gift. Francesco Orabona is supported by the National Science Foundation under the grants
no. 2022446 “Foundations of Data Science Institute” and no. 2046096 “CAREER: Parameter-free
Optimization Algorithms for Machine Learning”.
References
Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T. Finding approximate local minima
for nonconvex optimization in linear time. arXiv preprint arXiv:1611.01146, 2016.
15
Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds
for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.
Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Sekhari, A., and Sridharan, K. Second-order
information in non-convex stochastic optimization: Power and limitations. In Conference on
Learning Theory, pp. 242–299, 2020.
Baby, D. and Wang, Y.-X. Optimal dynamic regret in proper online learning with strongly convex
losses and beyond. In International Conference on Artificial Intelligence and Statistics, pp.
1805–1845. PMLR, 2022.
Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. “convex until proven guilty”: Dimension-
free acceleration of gradient descent on non-convex functions. In International Conference on
Machine Learning, pp. 654–663. PMLR, 2017.
Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Accelerated methods for nonconvex opti-
mization. SIAM Journal on Optimization, 28(2):1751–1772, 2018.
Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Lower bounds for finding stationary points
I. Mathematical Programming, pp. 1–50, 2019.
Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Lower bounds for finding stationary points
II: first-order methods. Mathematical Programming, 185(1-2), 2021.
Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge University Press,
2006.
Cesa-Bianchi, N., Conconi, A., and Gentile, C. On the generalization ability of on-line learning
algorithms. Information Theory, IEEE Transactions on, 50(9):2050–2057, 2004.
Chen, L., Luo, H., and Wei, C.-Y. Impossible tuning made possible: A new expert algorithm and
its applications. In Conference on Learning Theory, pp. 1216–1259. PMLR, 2021.
Cutkosky, A. and Orabona, F. Black-box reductions for parameter-free online learning in Banach
spaces. In Conference On Learning Theory, pp. 1493–1529, 2018.
16
Daniely, A., Gonen, A., and Shalev-Shwartz, S. Strongly adaptive online learning. In International
Conference on Machine Learning, pp. 1405–1411. PMLR, 2015.
Davis, D., Drusvyatskiy, D., Lee, Y. T., Padmanabhan, S., and Ye, G. A gradient sampling method
with complexity guarantees for lipschitz functions in high and low dimensions. arXiv preprint
arXiv:2112.06969, 2021.
Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochas-
tic optimization. In Conference on Learning Theory (COLT), pp. 257–269, 2010.
Duchi, J. C., Bartlett, P. L., and Wainwright, M. J. Randomized smoothing for stochastic opti-
mization. SIAM Journal on Optimization, 22(2):674–701, 2012.
Fang, C., Li, C. J., Lin, Z., and Zhang, T. SPIDER: Near-optimal non-convex optimization via
stochastic path-integrated differential estimator. In Advances in Neural Information Processing
Systems, pp. 689–699, 2018.
Fang, C., Lin, Z., and Zhang, T. Sharp analysis for nonconvex sgd escaping from saddle points. In
Conference on Learning Theory, pp. 1192–1234, 2019.
Faw, M., Tziotis, I., Caramanis, C., Mokhtari, A., Shakkottai, S., and Ward, R. The power
of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance. In
Conference on Learning Theory, pp. 313–355. PMLR, 2022.
Flaxman, A. D., Kalai, A. T., and McMahan, H. B. Online convex optimization in the bandit
setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM
symposium on Discrete algorithms, pp. 385–394, 2005.
Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic pro-
gramming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
Ghai, U., Lu, Z., and Hazan, E. Non-convex online learning via algorithmic equivalence. arXiv
preprint arXiv:2205.15235, 2022.
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia,
Y., and He, K. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint
arXiv:1706.02677, 2017.
Hazan, E. and Kale, S. Extracting certainty from uncertainty: Regret bounded by variation in
costs. Machine learning, 80(2-3):165–188, 2010.
Hazan, E., Singh, K., and Zhang, C. Efficient regret minimization in non-convex games. In
International Conference on Machine Learning, pp. 1433–1441. PMLR, 2017.
Herbster, M. and Warmuth, M. K. Tracking the best regressor. In Proceedings of the eleventh
annual conference on Computational learning theory, pp. 24–31, 1998.
Hoeven, D., Erven, T., and Kotlowski, W. The many faces of exponential weights in online learning.
In Conference On Learning Theory, pp. 2067–2092. PMLR, 2018.
17
Jacobsen, A. and Cutkosky, A. Parameter-free mirror descent. In Loh, P.-L. and Raginsky, M.
(eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of
Machine Learning Research, pp. 4160–4211. PMLR, 2022.
Jordan, M. I., Lin, T., and Zampetakis, M. On the complexity of deterministic nonsmooth and
nonconvex optimization. arXiv preprint arXiv:2209.12463, 2022.
Jun, K.-S., Orabona, F., Wright, S., and Willett, R. Improved strongly adaptive online learning
using coin betting. In Artificial Intelligence and Statistics, pp. 943–951. PMLR, 2017.
Karimireddy, S. P., Jaggi, M., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., and Suresh,
A. T. Mime: Mimicking centralized stochastic algorithms in federated learning. arXiv preprint
arXiv:2008.03606, 2020.
Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Levy, K., Kavis, A., and Cevher, V. Storm+: Fully adaptive SGD with recursive momentum for
nonconvex optimization. Advances in Neural Information Processing Systems, 34:20571–20582,
2021.
Li, H. and Lin, Z. Restarted nonconvex accelerated gradient descent: No more polylogarithmic
factor in the o(ǫ−7/4 ) complexity. In International Conference on Machine Learning. PMLR,
2022.
Li, X. and Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In
The 22nd International Conference on Artificial Intelligence and Statistics, pp. 983–992. PMLR,
2019.
Li, X., Zhuang, Z., and Orabona, F. A second look at exponential and cosine step sizes: Simplicity,
adaptivity, and performance. In International Conference on Machine Learning, pp. 6553–6564.
PMLR, 2021.
Liu, Z., Nguyen, T. D., Nguyen, T. H., Ene, A., and Nguyen, H. L. META-STORM: Generalized
fully-adaptive variance reduced SGD for unbounded functions. arXiv preprint arXiv:2209.14853,
2022.
Loshchilov, I. and Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint
arXiv:1608.03983, 2016.
18
Lu, Z., Xia, W., Arora, S., and Hazan, E. Adaptive gradient methods with local guarantees. arXiv
preprint arXiv:2203.01400, 2022.
Luo, H., Zhang, M., Zhao, P., and Zhou, Z.-H. Corralling a larger band of bandits: A case study
on switching regret for linear bandits. In Conference on Learning Theory, 2022.
McMahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In
Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pp. 244–256, 2010.
Orabona, F. and Pál, D. Scale-free algorithms for online linear optimization. In Chaudhuri, K.,
Gentile, C., and Zilles, S. (eds.), Algorithmic Learning Theory, pp. 287–301. Springer Interna-
tional Publishing, 2015.
Sachs, S., Hadiji, H., van Erven, T., and Guzmán, C. Between stochastic and adversarial online
convex optimization: Improved regret bounds via smoothness. arXiv preprint arXiv:2202.07554,
2022.
Stein, E. M. and Shakarchi, R. Real analysis: measure theory, integration, and Hilbert spaces.
Princeton University Press, 2009.
Tian, L. and So, A. M.-C. No dimension-free deterministic algorithm computes approximate sta-
tionarities of lipschitzians. arXiv preprint arXiv:2210.06907, 2022.
Tian, L., Zhou, K., and So, A. M.-C. On the finite-time complexity and practical computation of
approximate stationarity concepts of lipschitz functions. In International Conference on Machine
Learning, pp. 21360–21379. PMLR, 2022.
Tripuraneni, N., Stern, M., Jin, C., Regier, J., and Jordan, M. I. Stochastic cubic regularization
for fast nonconvex optimization. In Advances in neural information processing systems, pp.
2899–2908, 2018.
Wang, G., Hu, Z., Muthukumar, V., and Abernethy, J. Adaptive oracle-efficient online learning.
arXiv preprint arXiv:2210.09385, 2022.
You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer,
K., and Hsieh, C.-J. Large batch optimization for deep learning: Training BERT in 76 minutes.
arXiv preprint arXiv:1904.00962, 2019.
Zhang, J., Karimireddy, S. P., Veit, A., Kim, S., Reddi, S., Kumar, S., and Sra, S. Why are adaptive
methods good for attention models? Advances in Neural Information Processing Systems, 33:
15383–15393, 2020a.
19
Zhang, J., Lin, H., Jegelka, S., Sra, S., and Jadbabaie, A. Complexity of finding stationary points
of nonconvex nonsmooth functions. In International Conference on Machine Learning, 2020b.
Zhang, L., Lu, S., and Zhou, Z.-H. Adaptive online learning in dynamic environments. In Pro-
ceedings of the 32nd International Conference on Neural Information Processing Systems, pp.
1330–1340, 2018.
Zhang, L., Wang, G., Tu, W.-W., Jiang, W., and Zhou, Z.-H. Dual adaptivity: A universal
algorithm for minimizing the adaptive regret of convex functions. Advances in Neural Information
Processing Systems, 34:24968–24980, 2021.
Zhang, Z., Cutkosky, A., and Paschalidis, I. Adversarial tracking control via strongly adaptive
online learning with memory. In International Conference on Artificial Intelligence and Statistics,
pp. 8458–8492. PMLR, 2022.
Zhou, D., Xu, P., and Gu, Q. Stochastic nested variance reduction for nonconvex optimization.
In Proceedings of the 32nd International Conference on Neural Information Processing Systems,
pp. 3925–3936, 2018.
Zhuang, Z., Cutkosky, A., and Orabona, F. Surrogate losses for online learning of stepsizes in
stochastic non-convex optimization. In International Conference on Machine Learning, pp. 7664–
7672, 2019.
Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Pro-
ceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928–936,
2003.
20
A Proof of Proposition 2
First, we state a technical lemma that will be used to prove Proposition 2.
Proof. First, observe that since F is locally Lipschitz, for every point x ∈ Rd with rational coordi-
nates there is a neighborhood Ux of x on which F is Lipschitz. Thus, by Rademacher’s theorem, F
is differentiable almost everywhere in Ux .SSince the set of points with rational coordiantes is dense
in Rd , Rd is equal to the countable union Ux . Thus, since the set of points of non-differentiability
of F in Ux is measure zero, the total set of points of non-differentability is a countable union of sets
of measure zero and so must be measure zero. Thus F is differentiable almost everywhere. This
implies that F is differentiable at x + pu with probability 1.
Next, observe that for any compact set S ⊂ Rd , for every point x ∈ S with rational coordinates,
F is Lipschitz on some neighborhood Ux containing xSwith Lipschitz constant Gx . Since S is
compact, there is a finite set x1 , . . . , xK such that S = Uxi . Therefore, F is maxi Gxi -Lipschitz
on S and so F is Lipschitz on every compact set.
Now, for almost all x, for all v we have that the limit limδ→0 F (x+δv)−Fδ
(x)
exists and is equal
to h∇F (x), vi by definition of differentiability. Further, on any compact set S we have that
|F (x+δv)−F (x)|
δ ≤ L for some L for all x ∈ S. Therefore, by the bounded convergence theorem (see
e.g., Stein & Shakarchi (2009, Theorem 1.4)), we have that h∇F (x), vi is integrable on S.
Next, we prove the linearity of the operator ρ. Observe that for any vectors v and w, and scalar
c, by linearity of integration, we have
Z Z
h∇F (x), cv + wi dx = h∇F (x), cvi + h∇F (x), wi dx
D ZD
= ch∇F (x), vi + h∇F (x), wi dx
D
Z Z
= c h∇F (x), vi dx + h∇F (x), wi dx .
D D
For the remaining statement, given that Rd is finite dimensional, there must exist w ∈ Rd such
that ρ(v) = hw, vi for all v. Further, w is uniquely determined by hw, ei i for i = 1, . . . , d where ei
indicates the ith standard basis vector. Then, since ∇F (x)i = h∇F (x), ei i is integrable on compact
sets, we have
Z Z
hw, ei i = h∇F (x), ei i dx = ∇F (x)i dx,
D D
R
which is the definition of D ∇F (x) dx when the integral is defined.
21
k(t) = F (x + t(y − x)) is absolutely ′
R 1 ′ continuous
R 1 on [0, 1]. As a result, k is integrable on [0, 1] and
F (y) − F (x) = k(1) − k(0) = 0 k (t) dt = 0 h∇F (x + t(y − x)), y − xi dt by the Fundamental
Theorem of Calculus (see e.g., Stein & Shakarchi (2009, Theorem 3.11)).
Now, we tackle the case that F is not differentiable everywhere. Notice that the last statement
of the Proposition is an immediate consequence of Lipschitzness. So, we focus on showing the
remaining parts.
Now, by Lemma 19, we have that F is differentiable almost everywhere. Further, gx =
Eu [∇F (x + pu)] exists and satisfies for all v ∈ Rd :
hgx , vi = E[h∇F (x + pu), vi] .
u
F̂ (x + vn ) − F̂ (x) − hgx , vn i
lim =0.
n→∞ kvn k
22
Algorithm 3 Optimistic Mirror Descent
Input: Regularizer function φ, domain V , time horizon T .
ŵ1 = 0
for t = 1 . . . T do
Generate “hint” ht
Set wt = argminx∈V hht , xi + 21 kx − ŵt k2
Output wt and receive loss vector gt
Set ŵt+1 = argminx∈V hgt , xi + 21 kx − ŵt k2
end for
Proof. Now, by Chen et al. (2021, Lemma 15) instantiated with the squared Euclidean distance as
Bregman divergence, we have
1 1 1 1
hgt , wt − ui ≤ hgt − ht , wt − ŵt+1 i + ku − ŵt k2 − ku − ŵt+1 k2 − kŵt+1 − wt k2 − kwt − ŵt k2
2 2 2 2
From Young inequality:
In the case that the hints ht are not present, the algorithm becomes online gradient de-
scent (Zinkevich, 2003). In this case, assuming E[kgt k2 ] ≤ G2 and setting η = GD
√
T
we obtain
√
the E[RT (u)] ≤ DG T for all u such that kuk ≤ D.
23
Proof. The count of gradient evaluations is immediate from inspection of the algorithm, so it
remains only to prove the regret bound.
First, we observe that the choices of ∆t specified by Algorithm 2 correspond to the values of
1
wt produced by Algorithm 3 when ψ(w) = 2η kwk2 . This can be verified by direct calculation
2
(recalling that Dψ (x, y) = kx−yk
2η ).
Therefore, by Proposition 20, we have
T T
X D2 X η
hgt , ∆t − ui ≤ + kgt − ht k2 . (3)
t=1
2η t=1
2
So, our primary task is to show that kgt − ht k is small. To this end, recall that gt = ∇F (wt ) =
∇F (xt−1 + ∆t /2).
Now, we define hM +1
= ∇F xt−1 + 12 Πk∆k≤D ∆′t − ηhM
t t (which simply continues the recur-
i
sive definition of ht in Algorithm 2 for one more step). Then, we claim that for all 0 ≤ i ≤ M ,
khi+1
t − hit k ≤ 21i kh1t − h0t k. We establish the claim by induction on i. First, for i = 0 the claim
holds by definition. Now suppose khit − hi−1 1 1 0
t k ≤ 2i−1 kht − ht k for some i. Then, we have
1 1
khi+1 hit k
′ i
′ i−1
t − ≤ ∇F xt−1 + Πk∆k≤D ∆t − ηht − ∇F xt−1 + Πk∆k≤D ∆t − ηht
2 2
1 i
= h − hi−1
2 t t
1 2G
kgt − hQ Q+1
t k = kht − hQ
t k≤ Q
kh1t − h0t k ≤ Q ,
2 2
p
where in the last inequality
√
we used the fact that F is G-Lipschitz. So, for Q = ⌈log2 ( N G/HD)⌉,
we have kgt − hQt k≤
2G
√ HD for all t. The result now follows by substituting into equation (3).
NG
24
D Proof of Theorem 17
Proof of Theorem 17. Once more, the first part of the result is immediate from the fact that k∆n k ≤
D. So, we proceed
R 1to show the second part.
Define ∇n = 0 ∇F (xn−1 + s∆n ) ds. Then, we have
kh∇n − gn , ∆n ik
Z 1
1
= ∇F (xn−1 + s∆n ) − ∇F xn−1 + ∆n ds, ∆n
0 2
Z 1
1
≤D ∇F (xn−1 + s∆n ) − ∇F xn−1 + ∆n ds
0 2
Z 1
1 2 1
=D ∇F (xn−1 + s∆n ) − ∇F xn−1 + ∆n − ∇ F xn−1 + ∆n ∆n (s − 1/2)
0 2 2
1
+∇2 F (xn−1 + ∆n )∆n (s − 1/2) ds
2
R1
(observing that 0 s − 1/2 ds = 0)
1
1 1
Z
=D ∇F (xn−1 + s∆n ) − ∇F (xn−1 + ∆n ) − ∇2 F xn−1 + ∆n ∆n (s − 1/2) ds
0 2 2
In Theorem 7, set un to be equal to u1 for the first T iterations, u2 for the second T iterations
and so on. In other words, un = umod(n,T )+1 for n = 1, . . . , M . So, we have
M
X M
X
1 K
F (xM ) − F (x0 ) = RT (u , . . . , u ) + h∇n − gn , ∆n i + hgn , un i
n=1 n=1
M
N JD 3 X
≤ RT (u1 , . . . , uK ) + + hgn , un i .
48 n=1
PT
∇F (wtk ) HD 2 2T GD
Now, set uk = −D Pt=1
T . Then, by Theorem 21, we have that RT (uk ) ≤ 2 + N .
k t=1 ∇F (wtk )k
Therefore:
K T
HD 2 K M JD 3 X 1X
F (xM ) ≤ F (x0 ) + + 2GD + − DT ∇F (wtk ) .
2 48 T
k=1 t=1
Hence, we obtain
K T
1 X 1X F (x0 ) − F (xM ) HD 2G JD 2
∇F (wtk ) ≤ + + + .
K T MD 2T M 48
k=1 t=1
25
Note that from the choice of K and T we have M = KT ≥ N − T ≥ N/2. So, using D = δ/T , we
have can upper bound the r.h.s. with
2T (F (x0 ) − F ⋆ ) Hδ 4G Jδ2
+ + +
Nδ 2T 2 N 2T 2
l 2 m
)1/3
and with T = min (δ(F(H+Jδ)N
(x )−F ⋆ )1/3
, N/2 :
0
Now, substituting the specified value of δ completes the identity. Finally, the count of number of
gradient evaluations is a direct calculation.
1 X 1 X
∇F (y) = ∇F (x) + (∇F (y) − ∇F (x))
|S| |S|
y∈S y∈S
≥ k∇F (x)k − Hδ .
1 P
Now, since k∇F (x)kδ ≤ ǫ, for any p > 0, there is a set S such that |S| y∈S ∇F (y) ≤ ǫ + p.
Thus, k∇F (x)k ≤ ǫ + Hδ + p for any p > 0, which implies k∇F (x)k ≤ ǫ + Hδ.
Proposition 15. Suppose that F is J-second-order-smooth (that is, k∇2 F (x) − ∇2 F (y)kop ≤
Jkx − yk for all x and y). Suppose also that x satisfies k∇F (x)kδ ≤ ǫ. Then, k∇F (x)k ≤ ǫ + J2 δ2 .
1 P
Proof. The proof is similar to that of Proposition 14. Let S ⊂ B(x, δ) with x = |S| y∈S y. By
J-second-order-smoothness, for all y ∈ S, we have
Z 1
2
k∇F (y) − ∇F (x) − ∇ F (x)(y − x)k = (∇2 F (x + t(y − x)) − ∇2 F (x))(y − x) dt
0
Z 1
Jky − xk2 Jδ2
≤ tJky − xk2 dt = ≤ .
0 2 2
26
1 1
∇2 F (x)(y − x) = 0. Therefore, we have
P P
Further, since |S| y∈S y = x, we have |S| y∈S
1 X 1 X
∇F (y) = ∇F (x) + (∇F (y) − ∇F (x) − ∇2 F (x)(y − x))
|S| |S|
y∈S y∈S
Jδ2
≥ k∇F (x)k − .
2
1 P
Now, since k∇F (x)kδ ≤ ǫ, for any p > 0, there is a set S such that |S| y∈S ∇F (y) ≤ ǫ + p.
J 2 J 2
Thus, k∇F (x)k ≤ ǫ + 2δ + p for any p > 0, which implies k∇F (x)k ≤ ǫ + 2δ .
F Lower Bounds
Our lower bounds are constructed via a mild alteration to the arguments of Arjevani et al. (2019)
for lower bounds on finding (0, ǫ)-stationary points of smooth functions with a stochastic gradient
oracle. At a high level, we show that since a δ, ǫ-stationary point of an H-smooth loss is also a
(0, Hδ + ǫ)-stationary point, a lower bound on the complexity of the latter implies a lower bound on
the of complexity of the former. The lower bound of Arjevani et al. (2019) is proved by constructing
a distribution over “hard” functions such that no algorithm can quickly find a (0, ǫ)-stationary point
of a random selected function. Unfortunately, these “hard” functions are not Lipschitz. Fortunately,
they take the form F (x) + ηkxk2 where F is Lipschitz and smooth so that the “non-Lipschitz” part
is solely contained in the quadratic term. We show that one can replace the quadratic term kxk2
with a Lipschitz function that is quadratic for sufficiently small x but proportional to kxk for
larger values. Our proof consists of carefully reproducing the argument of Arjevani et al. (2019)
to show that this modification does not cause any problems. We emphasize that almost all of this
development can be found with more detail in Arjevani et al. (2019). We merely restate here the
minimum results required to verify our modification to their construction.
x1 = A1 (s)
xi = Ai (s, Grad(x1 , z1 ), Grad(x2 , z2 ), . . . , Grad(xi−1 , zi−1 )) .
So, xi is a function of s and z1 , . . . , zi−1 . We define Arand to be the set of such sequences of
mappings.
Now, in the notation of Arjevani et al. (2019), we define the “progress function”
Further, a stochastic gradient oracle Grad can be called a probability-p zero-chain if prog0 (Grad(x, z)) =
prog1/4 (x) + 1 for all x with probability at least 1 − p, and prog0 (Grad(x, z)) ≤ prog1/4 (x) + 1
with probability 1.
27
Next, let FT : RT → R be the function defined by Lemma 2 of Arjevani et al. (2019). Restating
their Lemma, this function satisfies:
Lemma 22 (Lemma 2 of Arjevani et al. (2019)). There exists a function FT : RT → R satisfies
that satisfies:
1. FT (0) = 0 and inf FT (x) ≥ −γ0 T for γ0 = 12.
2. ∇FT (x) is H0 -Lipschitz, with H0 = 152.
3. For all x, k∇FT (x)k∞ ≤ G0 with G0 = 23
4. For all x, prog0 (∇FT (x)) ≤ prog1/2 (∇FT (x)) + 1.
5. If prog1 (x) < T , then k∇FT (x)k ≥ |∇FT (x)prog1 (x)+1 | ≥ 1.
We associate with this function FT the stochastic gradient oracle OT (x, z) : RT × {0, 1} → Rd
where z is Bernoulli(p):
(
∇FT (x)i , if i 6= prog1/4 (x)
GradT (x, z)i = z∇FT (x)i
p , if i = prog1/4 (x)
Next, for any matrix U ∈ Rd×T with orthonormal columns, we define FT,U : Rd → R by:
FT,U (x) = FT (U ⊤ x) .
The associated stochastic gradient oracle is:
GradT,U (x, z) = U GradT (U ⊤ x, z) .
Now, we restate Lemma 5 of Arjevani et al. (2019):
Lemma 24 (Lemma 5 of Arjevani et al. (2019)). l Let R>0m and suppose A ∈ Arand is such that
R2 T 2T 2
A produces iterates xt with kxt k ≤ R. Let d ≥ 18 p log pc Suppose U is chosen uniformly at
random from the set of d×T matrices with orthonormal columns. Let Grad be an probability-p zero
chain and let GradU (x, z) = U Grad(U ⊤ x, z). Let x1 , x2 , . . . be the iterates of A when provided
the stochastic gradient oracle GradU . Then with probability at least 1 − c (over the randomness of
U , the oracle, and also the seed s of A):
T − log(2/c)
prog1/4 (U ⊤ xt ) < T for all t ≤ .
2p
28
F.2 Defining the “Hard” Instance
Now, we for the first time diverge from the construction of Arjevani et al. (2019) (albeit only
slightly). Their construction uses a “shrinking function” ρR,d : Rd → Rd given by ρR,d (x) =
√ kxk 2 2 as well as an additional quadratic term to overcome the limitation of bounded iterates.
1+kxk /R
We cannot tolerate the non-Lipschitz quadratic term, so we replace it with a Lipschitz version
qB,d (x) = x⊤ ρB,d (x). Intuitively, qB,d behaves like kxk2 for small enough x, but behaves like kxk
for large kxk. This pre-processed function is defined by:
To see the last line above, note that since qB,d and ρR,d are rotationally symmetric, and U has
orthonormal columns, we have U ⊤ ρR,d (x) = ρR,T (U ⊤ x) and qB,T (U ⊤ x) = qB,d (x). Thus, we have
qB,T (ρ−1 ⊤
R,T (U ρR,d (x))) = qB,d (x). The stochastic gradient oracle associated with F̂T,U (x) is
where J[f ](x) indicates the Jacobian of the function f evaluated at (x).
A description of the relevant properties of qB is provided in Section F.3.
Next we produce a variant on Lemma 6 from Arjevani et al. (2019). This is the most delicate
part of our alteration, although the proof is still almost identical to that of Arjevani et al. (2019).
√
Lemma 25 (variant on Lemma 6 of Arjevani et al. (2019)). Let R = B = 10G0 T . Let η = 1/10
2 2
and c ∈ (0, 1) and p ∈ (0, 1) and T ∈ N. Set d = ⌈18 RpT log 2T pc ⌉ and let U be sampled uniformly
from the set of d × T matrices with orthonormal columns. Define F̂T,U and Grad ˆ U,T as above.
ˆ U,T as input.
Suppose A ∈ Arand and let x1 , x2 , . . . be the iterates of A when provided with Grad
Then with probability at least 1 − c:
T − log(2/c)
k∇F̂T,U (xt )k ≥ 1/2 for all t ≤ .
2p
Proof. First, observe that GradT is a probability p zero-chain. Next, qB is rotationally symmetric
and ρR (x) ∝ X , we see that ∇qB (x) ∝ x so that prog0 (J[ρ−1 ⊤ −1
R,T ](x) ∇qB,T (ρR,T (x))) = prog0 (x).
Thus, the oracle
29
Following Arjevani et al. (2019), we now define yi = ρR (xi ). Notice that:
Therefore, we can view y1 , . . . , yT as the iterates of some different algorithm Ay ∈ Arand applied
˜ T,U (since they are computed by measurable functions of the oracle outputs
to the oracle Grad
and previous iterates). However, since ρR (kbxk) ≤ R, kyi k ≤ R for all i. Thus, since Grad ^ T is
2 2
a probability-p zero-chain, by Lemma 24 we have that so long as d ≥ ⌈18 RpT log 2T
pc ⌉, then with
probability at least 1 − c, for all i ≤ (T − log(2/c))/2p,
prog1/4 (U ⊤ yi ) < T .
Now, our goal is to show that k∇F (xi )k ≥ 1/2. We consider two cases, either kxi k > R/2 or
not.
First, observe that for xi with kxi k > R/2, we have:
k∇F̂T,U (xi )k ≥ ηk∇qB (kxi k)k − kJ[ρR ](xi )kop k∇F̂ (yi )k
kxi k √
≥ ηp − G0 T
1 + kxi k2 /B 2
√
≥ 3ηB − G0 T
√
Recalling B = R = 10G0 T and η = 1/10:
√
= 2G0 T
Recalling G0 = 23:
≥ 1/2 .
Now, suppose kxi k ≤ R/2. Then, let us set j = prog1 (U ⊤ yi ) + 1 ≤ T (the inequality follows
since prog1 ≤ prog1/4 ). Then, if uj indicates the jth row of u, Lemma 22 implies:
|huj , yi i| < 1,
|huj , ∇FT,U (yi )i| ≥ 1 .
⊤ /R2
I−ρR (xi )ρR (xi )⊤ /R2 I−y y
Next, by direct calculation we have J[ρR ](xi ) = √ =√ i i so that:
1+kxi k2 /R2 1+kxi k2 /R2
huj , ∇F̂T,U (xi )i = huj J[ρR ](xi )⊤ ∇FT,U (yi )i + ηhuj , ∇qB (xi )i
huj , ∇FT,U (yi )i huj , yi ihyi , ∇FT,U (yi )i/R2
= p − p + ηhuj , ∇qB (xi )i .
1 + kxi k2 /R2 1 + kxi k2 /R2
2
Now, by Proposition 29, we have ∇qB (xi ) = 2 − kyBi2k yi =. So, (with R = B):
30
Observing that kyi k ≤ kxi k ≤ R/2 and |huj , yi i| < 1:
1 k∇FT,U (yi )k
|huj , ∇F̂T,U (xi )i| ≥ p − − 2η
1 + kxi k2 /R2 2R
√
Using |huj , ∇FT,U (yi )i| ≥ 1 and k∇F (yi )k ≤ G0 T :
√
2 G0 T
≥√ − − 2η
5 2R
√
With R = 10G0 T and η = 1/10:
2 1 1
=√ − −
5 20 5
> 1/2 .
G20 232
5. GradT,U has variance at most p ≤ p
Proof. 1. This property follows immediately from the fact that FT (0) − inf FT (x) ≤ γ0 T .
2. Since
√ ρR is 1-Lipschitz for all R and qB is 3B-Lipschitz (see Proposition 29), F̂T,U (x) is
G0 T + 3ηB-Lipschitz.
3. By assumption, R ≥ max (H0 , 1). Thus, by Arjevani et al. (2019, Lemma 16), ∇FT (ρR (x))
is H0 + 3-Lipschitz and so ∇F̂T,U is H0 + 3 + 8η-Lipschitz by Proposition 29.
√
4. Since kOT k ≤ Gp0 + G0 T , and J[ρR ](x)⊤ U has operator norm at most 1, the bound follows.
5. Just as in the previous part, since GradT has variance G20 /p and J[ρR ](x)⊤ U has operator
norm at most 1, the bound follows.
31
Proof. From Lemma 26 and Lemma 25, we √ have a √
distribution
√ over functions F and first-order
oracles such that with probability 1, F is G0 T +3G0 T ≤ 92 T Lipschitz, F is H0 +3+4η ≤ 156-
smoooth, F (0) − inf F (x) ≤ γ0 T = 12T , Grad has variance at most G20 /p = 232 /p, and with
probability at least 1 − c,
T − log(2/c)
k∇F (xt )k ≥ 1/2 for all t ≤ .
2p
156 Hγ 2 2
Now, set λ = H · 2ǫ, T = ⌊ 12·156·(2ǫ)2 ⌋ and p = min((2 · 23 · ǫ) /σ , 1). Then, define
Hλ2
Fλ (x) = F (x/λ) .
156
2 Hλ2
√ Hλ
Then Fλ is Hλ
H · 1
λ2 · 156 = H- smooth, Fλ (0) − inf x Fλ (x) ≤ 12 · T · 156 ≤ γ, and Fλ is 92 T 156 ≤
√ 0
3 Hγ- Lipschitz. We can construct an oracle Gradλ from Grad by:
Hλ
Gradλ (x, z) = Grad(x/λ, z) .
156
so that
2 H 2 λ2 232
E[kGradλ (x, z) − ∇Fλ (x)k ] ≤ · = σ2 .
1562 p
Further, since an oracle for Fλ can be constructed from the oracle for F , if we run A on Fλ , with
probability at least 1 − c,
Hγσ 2
E[k∇Fλ (xt )k] ≥ ǫ for all t ≤ K .
ǫ4
From this result, we have our main lower bound (the formal version of Theorem 18):
√
Theorem 28. For any δ, ǫ, γ, G ≥ 3 √2ǫγ δ
, there is a distribution over G-Lipschitz C ∞ functions
F with F (0) − inf F (x) ≤ γ and stochastic gradient oracles Grad with E[kGrad(x, z)k2 ] ≤ G2
such that for any algorithm A ∈ Arand , if A is provided as input a randomly selected oracle Grad,
A will require Ω(G2 γ/δǫ3 ) iterations to identify a point x with E[k∇F (x)kδ ] ≤ ǫ.
Proof. From Theorem 27, for any H, ǫ′ , γ, and σ√we have a distrubution over C ∞ functions
F and oracles Grad such that F is H-smooth, 3 Hγ-Lischitz and F (0) − inf F (x) ≤ γ and
Grad has variance σ 2 such that A requires Ω(Hγσ 2 /ǫ′4 ) iterations to output a point x such that
E[k∇F (x)k] ≤ ǫ√′ . Set σ =√G, H = ǫ/δ and ǫ′ = 2ǫ. Then, we see that Grad has variance G2 /2, and
√ 3 ǫγ
F is 3 Hγ = √ ≤ G/ 2-Lipschitz so that E[kGrad(x, z)k2 ] ≤ G2 . Further by Proposition!14,
δ
if k∇F (x)kδ ≤ ǫ, then k∇F (x)k ≤ ǫ + Hδ = 2ǫ. Therefore, since A cannot output a point x with
E[k∇F (x)k] ≤ ǫ′ = 2ǫ in less than Ω(Hγσ 2 /ǫ′4 ) iterations, we see that A also cannot output a
point x with E[k∇F (x)kδ ] ≤ ǫ in less than Ω(Hγσ 2 /ǫ′4 ) = Ω(γG2 /ǫ3 δ) iterations.
32
F.3 Definition and Properties of qB
Consider the function qB,d : Rd → R defined by
kxk2
qB,d (x) = p = x⊤ ρB,d (x) .
1+ kxk2 /B 2
This function has the following properties, all of which follow from direct calculuation:
1.
xkxk2 kρB (x)k2
2x
∇qB,d (x) = p − 2 = 2+ ρB (x) .
1 + kxk2 /B 2 B (1 + kxk2 /B 2 )3/2 B2
2.
3xx⊤ kxk2 I 2kxk2 xx⊤
2 1
∇ qB,d (x) = p 2I − 2 − + .
1 + kxk2 /B 2 B (1 + kxk2 /B 2 ) B 2 (1 + kxk2 /B 2 ) B 4 (1 + kxk2 /B 2 )2
3.
kxk 3kxk
p ≤ k∇qB,d (x)k ≤ p ≤ 3B .
1+ kxk2 /B 2 1 + kxk2 /B 2
4.
8
k∇2 qB,d (x)kop ≤ p ≤8.
1 + kxk2 /B 2
G Proof of Theorem 13
First, we state and prove a theorem analogous to Theorem 8.
have
K T
" #
F (x0 ) − F ⋆ E[RT (u1 , . . . , uK )] D∞ di=1 σi
P
1 X 1X
E ∇F (wtk ) ≤ + + √ .
K T D∞ M D∞ M T
k=1 t=1 1
Proof. In Theorem 7, set un to be equal to u1 for the first T iterations, u2 for the second T
iterations and so on. In other words, un = umod(n,T )+1 for n = 1, . . . , N .
33
From Theorem 7, we have
M
" #
X
1 K
E[F (xM )] = F (x0 ) + E[RT (u , . . . , u )] + E hgn , un i .
n=1
PT ∂F wtk ( )
t=1
Now, since uk,i = −D∞ ∂xi
, E[gn ] = ∇F (wn ), and Var(gn,i ) ≤ σi2 for i = 1, . . . , d, we
PT ∂F wtk ( )
t=1 ∂xi
have
M K
" # " * T + K T
#
X X X X X
E hgn , un i ≤ E ∇F (wtk ), uk + D∞ (∇F (wtk ) − gn )
n=1 k=1 t=1 k=1 t=1 1
K
" * T +# d
X X √ X
≤E ∇F (wtk ), uk + D∞ K T σi
k=1 t=1 i=1
" K T
# d
X 1X √ X
=E − D∞ T ∇F wtk + D∞ K T σi .
T t=1
k=1 1 i=1
Proof of Theorem 13. Since A guarantees k∆n k∞ ≤ D∞ , for all n < n′ ≤ T + n − 1, we have
34
√ P
assumption E[RT (u1 , . . . , uK )] ≤ D∞ K T di=1 Gi , we have
" K T
# √ P
KD∞ T di=1 Gi
Pd
1 X 1X k F (x0 ) − F ⋆ i=1 Gi
E ∇F (wt ) ≤2 +2 + √
K T D∞ N D∞ N T
k=1 t=1 1
2T (F (x0 ) − F ⋆ ) 3 di=1 Gi
P
= + √
δN T
!
5( i=1 Gi )2/3 (F (x0 ) − F ⋆ )1/3 6 di=1 Gi
Pd
2(F (x0 ) − F ⋆ )
P
≤ max , √ + ,
(N δ)1/3 N δN
F (ψ(t)) − F (x)
lim .
t→0 t
If F is Hadamard directionally differentiable, then the above limit is denoted F ′ (x, v). When F is
Hadamard directionally differentiable for all x and v, then we say simply that F is directionally
differentiable.
With these definitions, a stochastic directional oracle for a Lipschitz, directionally differentiable,
and bounded from below function F is an oracle Grad(x, v, z) that outputs g ∈ ∂F (x) such that
35
hg, vi = F ′ (x, v). In this case, Zhang et al. (2020b) shows (Lemma 3) that F satisfies an alternative
notion of well-behavedness:
Z 1
F (y) − F (x) = hE[Grad(x + t(y − x), y − x, z)], y − xidt . (5)
0
Next, we define:
Definition 31. A point x is a (δ, ǫ) stationary point of F for the generalized gradient if there is a
set of points S contained in the ball of radius δ centered at x such that for y selected uniformly at
random from S, E[y] = x and for all y there is a choice of gy ∈ ∂F (y) such that k E[gy ]k ≤ ǫ.
Similarly, we have the definition:
Definition 32. Given a point x, and a number δ > 0, define:
1 X
k∂F (x)kδ , Pinf gy
1
S⊂B(x,δ), |S| y=x,gy ∈∂F (y) |S|
y∈S y∈S
Proof.
Z 1
F (xn ) − F (xn−1 ) = hE[Grad(xn−1 + s(xn − xn−1 ), xn − xn−1 , zn )], xn − xn−1 i ds
0
= E[hgn , ∆n i]
= E[hgn , ∆n − un i + hgn , un i] .
36
Algorithm 4 Online-to-Non-Convex Conversion (directional derivative oracle version)
Input: Initial point x0 , K ∈ N, T ∈ N, online learning algorithm A, sn for all n
Set M = K · T
for n = 1 . . . M do
Get ∆n from A
Set xn = xn−1 + ∆n
Set wn = xn−1 + sn ∆n
Sample random zn
Generate directional derivative gn = Grad(wn , ∆n , zn )
Send gn to A as gradient
end for
Set wtk = w(k−1)T +t for k = 1, . . . , K and t = 1, . . . , T
Set wk = T1 Tt=1 wtk for k = 1, . . . , K
P
Return {w1 , . . . , wK }
Where in the second line we have used the definition gn = Grad(xn−1 + sn (xn − xn−1 ), xn −
xn−1 , zn ), the assumption that sn is uniform on [0, 1], and Fubini theorem (as Grad is bounded
by Lipschitzness of F ). Now, sum over n and telescope to obtain the stated bound.
37
Putting this all together, we have
K T
" #
√ X 1X
F ⋆ ≤ E[F (xM )] ≤ F (x0 ) + E[RT (u1 , . . . , uK )] + σDK T − DT E ∇F wtk .
T
k=1 t=1
Finally, we instantiate Theorem 34 with online gradient descent to obtain the analog of Corol-
lary 9. This result establishes that the online-to-batch conversion finds an (δ, ǫ) critical point in
O(1/ǫ3 δ) iterations, even when using a directional derivative oracle. Further, our lower bound
construction makes use of continuously differentiable functions, for which the directional derivative
oracle and the standard gradient oracle must coincide. Thus the O(1/ǫ3 δ) complexity is optimal in
this setting as well.
Corollary 35. Suppose we have a budget of N gradient evaluations. Under the assumptions and no-
tation of Theorem 34, suppose in addition E[kgn k2 ] ≤ G2 and that A guarantees k∆n k ≤ D for some
user-specified
√ D for all n and ensures the worst-case K-shifting regret bound E[RT (u1 , . . . , uK )] ≤
DGK T for all kuk k ≤ D (e.g., as achieved by the OGD algorithm that is reset every T iterations).
Let δ > 0 be an arbitrary number. Set D = δ/T , T = min(⌈( F (xGN δ
0 )−F
⋆)
2/3 ⌉, N ), and K = ⌊ N ⌋.
2 T
Then, for all k and t, kwk − wtk k ≤ δ.
Moreover, we have the inequality
K T
" # !
1 X 1X k 2(F (x0 ) − F ⋆ ) 5G2/3 (F (x0 ) − F ⋆ )1/3 6G
E ∇t ≤ + max ,√ ,
K
k=1
T t=1 δN (N δ)1/3 N
which implies
K
!
1 X 2(F (x0 ) − F ⋆ ) 5G2/3 (F (x0 ) − F ⋆ )1/3 6G
k∂F (wk )kδ ≤ + max ,√ .
K
t=1
δN (N δ)1/3 N
38
√
assumption E[RT (u1 , . . . , uK )] ≤ DGK T , we have:
" K T
# √
1 X 1X k F (x0 ) − F ⋆ KDG T G
E ∇t ≤2 +2 +√
K T DN DN T
k=1 t=1
2T (F (x0 ) − F ⋆ ) 3G
≤ +√
δN T
!
5G2/3 (F (x0 ) − F ⋆ )1/3 6G 2(F (x0 ) − F ⋆ )
≤ max ,√ + ,
(N δ)1/3 N δN
39