E Cient Methods For Structured Nonconvex-Nonconcave Min-Max Optimization
E Cient Methods For Structured Nonconvex-Nonconcave Min-Max Optimization
Optimization
on f , even when the goal is only to compute locally Our contributions. Given the aforedescribed in-
optimal solutions. Indeed, Daskalakis et al. (2021) es- tractability results, our goal is to identify structural
tablish intractability results in the constrained setting properties that make it possible to solve min-max op-
of the problem, wherein first-order locally optimal so- timization problems with smooth objectives. Focus-
lutions are guaranteed to exist whenever the objective ing on the unconstrained setting of (1.1), we view it
is smooth. Moreover, they show that even the compu- as a special case (obtained by considering the opera-
⇥ rx f (x,y)⇤
tation of approximate solutions is PPAD-complete and, tor F ([yx]) = r y f (x,y)
) of the unconstrained varia-
if the objective function is accessible through value- tional inequality problem (svi), and consider instead
queries and gradient-queries, exponentially many such this more general problem. We identify conditions for
queries are necessary (in particular, exponential in at F under which a generalized version of the extragradi-
least one of the following: the inverse approximation ent method of Korpelevich (1976), which we propose,
parameter, the smoothness constant of f , or the diam- converges to a solution of (svi) (or, in the special case
p
eter of the constraint set). of (1.1), to a stationary point of f ) at a rate of 1/ k
We expect similar intractability results to hold in the in the number of iterations k. Our condition, pre-
unconstrained case, which is the case considered in this sented as Assumption 1, postulates that there exists a
paper, even when restricting to smooth objectives that solution to (svi) that only violates the stronger (mvi)
have a non-empty set of optimal solutions.1 Indeed, requirement in a controlled manner that we delineate.
fixed-point complexity-based intractability results for Our generalized extragradient method is based on an
the constrained case are typically extendable to the aggressive interpolation step, as specified by (eg+),
unconstrained case, by embedding the hard instances and our main convergence result is Theorem 3.2. We
within an unbounded domain. additionally show, in Theorems 4.1 and 4.4, that the
algorithm converges in non-Euclidean settings, under
Relatedly, we already know that the unconstrained the stronger condition that an (mvi) solution exists, or
Stampacchia variational inequality (SVI) problem for when we only have stochastic oracle access to F (or,
Lipschitz continuous operators F : Rd ! Rd —a prob- in the special case of (1.1), to the gradient of f ).
lem which includes the unconstrained case of (1.1) by
⇥ rx f (x,y)⇤ The condition on F under which our main result ap-
setting F ([yx]) = ry f (x,y) —is computationally in-
plies is weaker than the assumption that a solution
tractable, even when restricting to operators that have
to (mvi) exists (Zhou et al., 2017; Mertikopoulos et al.,
a non-empty set of SVI solutions.2 This is because:
2019; Malitsky, 2019; Song et al., 2020), an assump-
(i) F is Lipschitz-continuous if and only if the op-
tion which is already satisfied by several interesting
erator T (u) = u F (u) is Lipschitz-continuous; (ii)
families of min-max objectives, including quasiconvex-
for ✏ 0, points ū 2 Rd such that kF (ū)k2 ✏ sat-
concave families or starconvex-concave families. Our
isfy kT (ū) ūk2 ✏, i.e. they are ✏-approximate fixed
significantly weaker condition applies in particular to
points of T , and vice versa; and (iii) it is known that
(min-max objectives f with corresponding) operators
finding approximate fixed points of Lipschitz opera-
F that are negatively comonotone (Bauschke et al.,
tors over Rd is PPAD-hard, even when the operators are
2020) or positively cohypomonotone (Combettes and
guaranteed to have fixed points (Papadimitriou, 1994).
Pennanen, 2004). These conditions have been stud-
Moreover, if we restrict attention to algorithms that
ied in the literature for at least a couple of decades,
only make value queries to T (i.e. F , which corresponds
but only asymptotic convergence results were available
to the type of access that all first-order algorithms
prior to our work for computing solutions to (svi). In
have), the query complexity becomes exponential in
contrast, our rates are asymptotically identical to the
the dimension (Hirsch et al., 1989). Finally, by the
rates that we would get under the stronger assump-
equivalence of norms, these results extend to arbitrary
tion that a solution to (mvi) exists, and sidestep the
`p -normed finite dimensional real vector spaces. Of
intractability results for (1.1) suggested by Daskalakis
course, for these intractability results for SVI to apply
et al. (2021) for general smooth objectives.
to the nonconvex-nonconcave min-max problem (1.1),
one would need to prove that these complexity results
extend to operators F constructed from 1.1 Further Related Work
⇥ rx f (x,y) ⇤ a smooth func-
tion f by setting F ([yx]) = r y f (x,y)
.
A large number of recent works target identifying prac-
tical first-order, low-order, or efficient online learning
1
methods for solving min-max optimization problems
Note that these are stationary points of f in this case.
2 in a variety of settings, ranging from the well-behaved
We formally define the Stampacchia variational in-
equality problem, (svi), in Section 2. We also define the setting of convex-concave objectives to the challeng-
harder Minty variational inequality problem, (mvi), in the ing setting of nonconvex-nonconcave objectives. There
same section. has been substantial work for convex-concave and
Jelena Diakonikolas, Constantinos Daskalakis, Michael I. Jordan
Table 1: Comparison of iteration complexities required to find a point x with kF (x)kp⇤ ✏ using deterministic
algorithms, where ✏ > 0, F : Rd ! Rd is a Lipschitz operator satisfying Assumption 1 (Section 2) with ⇢ 0.
Parameter p determines the `p setup, and p⇤ = p p 1 is the exponent conjugate to p. Only the dependence on ✏
and possibly the dimension d is shown; the dependence on other problem parameters is comparable for all the
e hides logarithmic factors. ‘—’ indicates that the result does not exist/is not known.
results. O
XXX
XXX Setup 1
⇢ 2 (0, 4L ), p = 2 ⇢ = 0, p = 2 ⇢ = 0, p 2 (1, 2) ⇢ = 0, p > 2
Paper XXX
X 1/p 1/2 1/2 1/p
(Dang and Lan, 2015) — O( ✏12 ) O( poly(d ✏2 )
) O( poly(d ✏2 )
)
(Lin et al., 2018) — e 12 )
O( — —
✏
(Song et al., 2020) — O( ✏12 ) O( ✏12 ) —
This Paper O( ✏12 ) O( ✏12 ) O( ✏12 ) O( ✏1p )
nonconvex-concave objectives, targeting the compu- In unconstrained Euclidean setups, the p best known
tation of min-max solutions to (1.1) or, respectively, convergence rates are of the order 1/ k (Dang and
stationary points of f or (x) := maxy f (x, y). This Lan, 2015; Song et al., 2020), under the assumption
work has focused on attaining improved convergence that an (mvi) solution exists. We obtain the same
rates (Kong and Monteiro, 2019; Lin et al., 2020b; rate under our weaker Assumption 1. Moreover, under
Thekumparampil et al., 2019; Nouiehed et al., 2019; Lu our weaker assumption, we show that the accumula-
et al., 2020; Zhao, 2019; Alkousa et al., 2019; Azizian tion points of the sequence of iterates of our algorithm
et al., 2020; Golowich et al., 2020; Lin et al., 2020a; Di- are (svi) solutions. This was previously established
akonikolas, 2020) and/or obtaining last-iterate conver- for alternative algorithms and under the stronger as-
gence guarantees (Daskalakis et al., 2018; Daskalakis sumption that an (mvi) solution exists (Mertikopoulos
and Panageas, 2018; Mazumdar et al., 2020; Mer- et al., 2019; Malitsky, 2019).
tikopoulos et al., 2018; Lin et al., 2018; Hamedani
When it comes to more general `p norms, Mertikopou-
and Aybat, 2018; Adolphs et al., 2019; Daskalakis and
los et al. (2019) establish the asymptotic convergence
Panageas, 2019; Liang and Stokes, 2019; Gidel et al.,
of the iterates of an optimistic variant of the mirror de-
2019; Mokhtari et al., 2020; Abernethy et al., 2019;
scent algorithm, under the assumption that an (mvi)
Liu et al., 2020).
solution exists, but they do not provide any conver-
In the nonconvex-nonconcave setting, research has fo- gence rates.p On the other hand, Dang and Lan (2015)
cused on identifying di↵erent notions of local min-max prove a 1/ k rate of convergence for a variant of the
solutions (Daskalakis and Panageas, 2018; Mazumdar mirror-prox algorithm in general normed spaces. This
et al., 2020; Jin et al., 2020; Mangoubi and Vishnoi, result, however, requires the regularizing (prox) func-
2021) and studying the existence and (local) conver- tion to be both smooth and strongly convex w.r.t. the
gence properties of learning methods to these points same norm, and the constant in the convergence bound
(Wang et al., 2019; Mangoubi et al., 2020; Mangoubi scales at least linearly with the condition number of
and Vishnoi, 2021). As already discussed, recent work the prox function. It is well-known that no function
of Daskalakis et al. (2021) shows that, for general can be simultaneously smooth and strongly convex
smooth objectives, the computation of even approx- w.r.t. an `p norm with p 6= 2 and have a condition
imate first-order locally optimal min-max solutions is number independent of the dimension (Borwein et al.,
intractable, motivating the identification of structural 2009). In fact, unless p is trivially close to 2, we only
assumptions on the objective function for which these know of functions whose condition number would scale
intractability barriers can be bypassed. polynomially with the dimension.
An example such assumption, which is closely related Very recent (and independent) work of Song et al.
to the one made in this work, is that an (mvi) solution (2020) proposes an optimistic dual extrapolation
⇥ rx f (x,y)⇤
exists for the operator F ([yx]) = r y f (x,y)
, as studied method with linear convergence for a class of prob-
by Zhou et al. (2017); Lin et al. (2018); Mertikopoulos lems that have a “strong” (mvi) solution. (In particu-
et al. (2019); Malitsky (2019); Liu et al. (2020); Song lar, their assumption is that there exists u⇤ 2 Rd such
et al. (2020). As we have already discussed, the as- that 8u 2 Rd : hF (u), u u⇤ i mku u⇤ k2 for some
sumption we make for our main result in this work is constant m 0; the case m = 0 recovers the existence
weaker. Table 1 provides a comparison of our results of a standard (mvi) solution.) Their result only ap-
to those of existing works, considering the determinis- plies to norms that are strongly convex, which in the
tic setting (i.e. having exact value access to F ). case of `p norms is true only for p 2 (1, 2]. In that case,
Efficient Methods for Structured Nonconvex-Nonconcave Min-Max Optimization
our results match those of Song et al. (2020). For the the Stampacchia Variational Inequality problem con-
case of stochastic oracle access to F, our bounds also sists in finding u⇤ 2 Rd such that:
match those of Song et al. (2020) for p 2 (1, 2], and
we also handle the case p > 2 which is not covered by (8u 2 U) : hF (u⇤ ), u u⇤ i 0. (svi)
Song et al. (2020).
In this case, u⇤ is referred to as the strong solution
Finally, it is worth noting that Zhou et al. (2017); to the variational inequality corresponding to F and
Mertikopoulos et al. (2019); Malitsky (2019); Song U . When U ⌘ Rd (the case considered here), it must
et al. (2020) consider constrained optimization setups, be the case that kF (u⇤ )kp⇤ = 0. We will assume that
which are not considered in our work. We believe that there exists at least one (svi) solution, and will denote
generalizing our results to constrained setups is possi- the set of all such solutions by U ⇤ .
ble, and defer such generalizations to future work.
The Minty Variational Inequality problem consists in
finding u⇤ such that:
2 Notation and Preliminaries
(8u 2 U) : hF (u), u⇤ ui 0, (mvi)
d
We consider real d-dimensional spaces (R , k · kp ),
in which case u⇤ is referred to as the weak solution
where k · kp is the p standard `p norm for p 1. In
to the variational inequality corresponding to F and
particular, k · k2 = h·, ·i is the `2 (Euclidean) norm
U . If we assume that F is monotone, then (2.1) im-
and h·, ·i denotes the inner product. When the context
plies that every solution to (svi) is also a solution to
is clear, we omit the subscript 2 and just write k · k
(mvi), and the two solution sets are equivalent. More
for the Euclidean norm k · k2 . Moreover, we denote by
generally, if F is not monotone, all that can be said
p⇤ = p p 1 the exponent conjugate to p.
is that the set of (mvi) solutions is a subset of the set
We are interested in finding stationary points for min- of (svi) solutions. In particular, (mvi) solutions may
max problems of the form: not exist even when (svi) solutions exist. These facts
follow from Minty’s theorem (see, e.g., (Kinderlehrer
min max f (x, y), (P) and Stampacchia, 2000, Chapter 3)).
x2Rd1 y2Rd2
We will not, in general, be assuming that F is mono-
where f is a smooth (possibly nonconvex-nonconcave) tone. Note that the Lipschitzness of F on its own is
function and d1 + d2 = d. In this case, stationary not sufficient to guarantee that the problem is compu-
points can be defined as the points at which the gra- tationally tractable, as discussed in the introduction.
dient of f is the zero vector. As is standard, the ✏- Thus, additional structure is needed, which we intro-
approximate variant of this problem for ✏ > 0 is to find duce in the following.
a point (x, y) 2 Rd1 ⇥ Rd2 such that krf (x, y)kp⇤ ✏.
We will study Problem (P) through the lens of vari- Weak MVI solutions. We define the class of prob-
ational inequalities, described in Section 2.1. To do lems with weak (mvi) solutions as the class of problems
so, we consider⇥ rthe operator
⇤ F : Rd ⇥!⇤ Rd defined in which F satisfies the following assumption.
x f (x,y) x
via F (u) = ry f (x,y) , where u = y and where Assumption 1 (Weak mvi). There exists u⇤ 2 U ⇤
rx f (respectively, ry f ) denotes the gradient of f such that:
w.r.t. x (respectively, y). It is clear that F is Lipschitz- ⇢
continuous⇥ whenever f is smooth and that kF (u)kp⇤ (8u 2 Rd ) : hF (u), u u⇤ i kF (u)k2p⇤ , (a2 )
⇤ 2
✏ for u = yx holds if and only if krf (x, y)kp⇤ ✏. ⇥ 1
for some parameter ⇢ 2 0, 4L .
2.1 Variational Inequalities and Structured
We will only provide results for ⇢ > 0 in the case of
(Possibly Non-Monotone) Operators
the `2 norm. For p 6= 2, we will require a stronger as-
Let F : Rd ! Rd be an operator that is L-Lipschitz- sumption; namely, that an (mvi) solution exists, which
continuous w.r.t. k · kp : holds when ⇢ = 0.
(8u, v 2 Rd ) : kF (u) F (v)kp⇤ Lku vkp . (a1 ) 2.2 Example Settings Satisfying
Assumption 1.
F is said to be monotone if:
The class of problems that have weak (mvi) solutions
(8u, v 2 Rd ) : hF (u) F (v), u vi 0. (2.1) in the sense of Assumption 1 generalizes other struc-
tured non-monotone variational inequality problems,
Given a closed convex set U ✓ Rd and an operator F, as we discuss in this section.
Jelena Diakonikolas, Constantinos Daskalakis, Michael I. Jordan
When ⇢ = 0, we recover the class of problems that that for some u⇤ 2 U ⇤ we have
have an (mvi) solution. This class contains all uncon-
strained variationally coherent problems studied in, (8u 2 Rd ) hF (u), u u⇤ i ku u⇤ k2p , (2.3)
2
e.g., Zhou et al. (2017); Mertikopoulos et al. (2019),
which encompass all min-max problems with objec- and when the operator F does not plateau or be-
tives that are bilinear, pseudo-convex-concave, quasi- come too close to a linear operator around u⇤ ; namely,
convex-concave, and star-convex-concave. kF (u) F (u⇤ )kp⇤ µku u⇤ kp . (Note that (2.3) is
always satisfied with = 2L for L-Lipschitz opera-
When ⇢ > 0 and p = 2, Assumption 1 is implied by
tors, but we may need to be smaller than 2L). Then
F being ⇢2 -comonotone (Bauschke et al., 2020) or
⇢ Assumption 1 would be satisfied with ⇢ = µ . For a
2 -cohypomonotone (Combettes and Pennanen, 2004), min-max problem, assuming f is twice di↵erentiable,
defined as follows:
this would mean that the lowest⇥ eigenvalue⇤ of the sym-
rx f (x,y)
(8u, v 2 Rd ) : metric part of the Jacobian of r y f (x,y)
is bounded
⇢ below by /2 in any direction u u⇤ and the func-
hF (u) F (v), u vi kF (u) F (v)k22 . (2.2)
2 tion f is sufficiently “curved” (not⇥close to a linear or
⇤⇤
be p-uniformly convex w.r.t. k · k and with constant m Assumption 1), we introduce the following generaliza-
if 8x, y 2 Rd , tion of the extragradient algorithm, to which we refer
as Extragradient+ (eg+).
m
(y) (x) + hr (x), y xi + ky xkp . na 1 o
p ūk = argmin
k
hF (uk ), u uk i + ku uk k2 ,
u2Rd 2
n 1 o
Observe that when p = 2, we recover the standard def- uk+1 = argmin ak hF (ūk ), u uk i + ku uk k2 ,
inition of strong convexity. Thus, uniform convexity is u2Rd 2
a generalization of strong convexity. (eg+)
where 2 (0, 1] is a parameter of the algorithm and
Definition 2.2 (Bregman divergence). Let : Rd !
ak > 0 is the step size. When = 1, we recover
R be a di↵erentiable function. Then its Bregman di-
standard eg.
vergence between points x, y 2 Rd is defined by
The analysis relies on the following merit (or gap) func-
D (x, y) = (x) (y) hr (y), x yi . tion:
⇣ ⇢ ⌘
hk := ak hF (ūk ), ūk u⇤ i + kF (ūk )k2 , (3.1)
It is immediate that the Bregman divergence of a con- 2
vex function is non-negative. for some u⇤ 2 U ⇤ for which F satisfies Assumption 1.
Then Assumption 1 implies that hk 0, 8k.
Useful facts for `p setups. We now outline some The first (and main) step is to bound all hk ’s above,
useful auxiliary results used specifically in Section 4, as in the following lemma.
where we study the case that p is not necessarily equal
to 2. Lemma 3.1. Let F : Rd ! Rd be an arbitrary L-
Lipschitz operator that satisfies Assumption 1 for some
Proposition 2.3. Given, z, u 2 Rd , p 2 (1, 1) and u⇤ 2 U ⇤ . Given an arbitrary initial point u0 , let the
q 2 {p, 2}, let sequences of points {ui }i 1 , {ūi }i 0 evolve according
n o to (eg+) for some 2 (0, 1] and positive step sizes
1
w = argmin hz, vi + ku vkqp . {ai }i 0 . Then, for any > 0 and any k 0, we have:
v2Rd q
1 ⇤ 1 ⇤
p q hk ku uk k2 ku uk+1 k2
Then, for p⇤ = p
⇤
1, q = q 1:
2 2
ak
⇣1 ⌘ + ⇢ ak (1 ) kF (ūk )k2
q⇤ 1 1 ⇤ 2
w=u r kzkp ⇤ and kw ukqp = kzkqp⇤ . ak 2 (3.2)
q⇤ q q + 2 ak L kF (uk )k2
2
Another useful result is the following proposition, 1 ⇣ ak L ⌘
+ kūk uk+1 k2 ,
which will allow us to relate Lipschitzness of F to uni- 2
form convexity of the prox mapping 1q k · kqp in the def- where hk is defined as in Eq. (3.1).
inition of the algorithm. The ideas used in the proof
can be found in the proofs of (d’Aspremont et al., 2018, The proof is provided in Appendix B.
Lemma 5.7), (Nesterov, 2015, Lemma 2), and in (De-
Using Lemma 3.1, we can now draw conclusions about
volder et al., 2014, Section 2.3).
the convergence of eg+ by choosing parameters ,
Proposition 2.4. For any L > 0, > 0, q , t 0, and the step sizes ak to guarantee that hk < 12 ku⇤
and > 0, uk k2 12 ku⇤ uk+1 k2 as long as kF (ūk )k = 6 0.
L ⇤ Theorem 3.2. Let F : Rd ! Rd be an arbitrary L-
t tq + ,
q 2 Lipschitz operator that satisfies Assumption 1 for some
q u⇤ 2 U ⇤ . Given an arbitrary initial point u0 2 Rd , let
2(q )
where ⇤ = q
Lq/ . the sequences of points {ui }i 1 , {ūi }i 0 evolve accord-
ing to (eg+) for = 12 and ak = 2L 1
. Then:
3 Generalized Extragradient for
(i) all accumulation points of {ūk }k 0 are in U ⇤ .
Problems with Weak MVI Solutions
(ii) for all k 1:
In this section, we consider the setup with the Eu- k
clidean norm k · k = k · k2 , i.e., p = 2. To address 1 X 2Lku0 u⇤ k2
kF (ūi )k2 .
the class of problems with weak (mvi) solutions (see k + 1 i=0 (k + 1)(1/(4L) ⇢)
Jelena Diakonikolas, Constantinos Daskalakis, Michael I. Jordan
where Fk and F̄k denote the natural filtrations, includ- As in the case of Euclidean norms, the analysis relies
ing all the randomness up to the construction of points on the following merit function:
uk and ūk , respectively, and ¯k2 , k+1 2 are the variance ⇣ ⌘
⇢
constants. Observe that Fk ✓ F̄k and F̄k ✓ Fk+1 . To hk := ak hF (ūk ), ūk u⇤ i + kF (ūk )k2p⇤ . (4.7)
simplify the notation, we denote: 2
Moreover, as before, Assumption 1 guarantees that
⌘¯k = F̃ (ūk ) F (ūk ), ⌘k+1 = F̃ (uk+1 ) F (uk+1 ).
hk 0, 8k. Even though we only handle the case ⇢ = 0
(4.2)
for p 6= 2, the analysis is significantly more challenging
The variant of the method we consider here is stated than in the `2 case, and, due to space constraints, we
as follows: only state the main results here, while all the technical
na D E 1 o details are provided in Appendix C.
k
ūk = argmin F̃ (uk ), u uk + ku uk kqp ,
u2Rd q
n D E o Deterministic oracle access. The main result is
uk+1 = argmin ak F̃ (ūk ), u uk + p (u, uk ) , summarized in the following theorem.
u2Rd
(egp +) Theorem 4.1. Let p > 1 and let F : Rd ! Rd be an
where ( arbitrary L-Lipschitz operator w.r.t. k · kp that satisfies
2, if p 2 (1, 2], Assumption 1 with ⇢ = 0 for some u⇤ 2 U ⇤ . Assume
q= p (4.3) that we are given oracle access to the exact evaluations
p⇤ = p 1, if p 2 (2, 1)
of F, i.e., ⌘¯i = ⌘i = 0, 8i. Given an arbitrary initial
and point u0 2 Rd , let the sequences of points {ui }i 1 ,
( {ūi }i 0 evolve according to (egp +) for 2 (0, 1] and
D 12 k· u0 k2p (u, uk ), if p 2 (1, 2],
p (u, uk ) = step sizes {ai }i 0 specified below. Then, we have:
1
p ku uk kpp , if p 2 (2, 1).
(4.4) p m 3/2
(i) Let p 2 (1, 2]. If = mp = p 1, ak = 2L ,
Notice that for p = 2, egp + is equivalent to eg+. then all accumulation points of {uk }k 0 are in U ⇤ ,
Thus, egp + generalizes eg+ to arbitrary `p norms. and, furthermore 8k 0:
However, egp + is di↵erent from the standard Extra-
gradient or Mirror-Prox, for two reasons. First is that, 1 X
k
16L2 p (u⇤ , u0 )
as is the case for eg+, the step sizes that determine kF (ui )k2p⇤
k + 1 i=0 mp 2 (k + 1)
ūk and uk+1 (i.e., ak / and ak ) are not the same in
general, as we could (and will) choose 6= 1. Sec- ⇣ L2 ku⇤ u0 k2 ⌘
p
ond, unless p = q = 2, the function 1q ku uk kqp in =O 2 .
(p 1) (k + 1)
the definition of the algorithm is not a Bregman di-
vergence between points u and uk of any function . L2 ku⇤ u0 k2
In particular, within k = O (p 1)2 ✏2 p itera-
Further, when p > 2, 1q ku uk kqp is not strongly con-
vex. Instead, it is p-uniformly convex with constant tions egp + can output a point u with kF (u)kp⇤
1. Additionally, no function whose gap between the ✏.
maximum and the minimum value is bounded by a 1
constant on any ball of constant radius can have con- (ii) Let p 2 (2, 1). If = 2, k = > 0, ⇤ =
q 2
q 2 q/2 1
stant of strong convexity w.r.t. k·kp that is larger than q
2
L , and ak = 2⇤ = a, then, 8k 0:
O( d1 12/p ) (d’Aspremont et al., 2018). When p 2 (1, 2],
1 q
q ku uk kp is strongly convex with constant p 1 (Ne-
k
1 X ⇤ 2ku⇤ u0 kpp 2p
mirovski, 2004). We let mp denote the constant of kF (ūi )kpp⇤ p⇤ + p⇤ 1
.
k + 1 i=0 a (k + 1) a
strong/uniform convexity of 1q ku uk kqp , that is:
In particular, for any ✏ > 0, there is a choice of
mp = max{p 1, 1}. (4.5) 2
= C✏p L , where Cp is a constant that only depends
Observe that on p, such that egp + can output a point u with
mp kF (u)kp⇤ ✏ in at most
p (u, uk ) ku uk kqp . (4.6)
q ✓⇣ ◆
Lku⇤ u0 kp ⌘p
This is immediate for p > 2, by the definition of p k = Op
✏
and using that q = p and mp = 1 when p > 2. For
p 2 (1, 2], we have that q = 2, and Eq. (4.6) follows by iterations. Here, the Op notation hides constants
strong convexity of 12 k · k2p . that only depend on p.
Jelena Diakonikolas, Constantinos Daskalakis, Michael I. Jordan
Remark 4.2. There are significant technical obsta- E[kF̃ (u)kp⇤ ] ✏ with at most
cles in generalizing the results from Theorem 4.1 to
⇣ L2 ku⇤ u0 k2p ⇣ 2 ⌘⌘
settings with ⇢ > 0. In particular, when p 2 (1, 2), the
O 1+
proof fails because we take p (u⇤ , u) to be the Breg- mp 2 ✏2 mp ✏2
man divergence of k · u0 k2p , and relating kūk uk kp
to kF (uk )kp⇤ would require k · k2p to be smooth, which oracle queries to F̃ , where mp = p 1.
is not true. If we had, instead, used ku⇤ uk2p in
(iii) Let p > 2 and ⇢ = 0. If = 12 and ak =
place of p (u⇤ , u), we would have incurred 12 ku⇤ 1
mp ⇤ a = 4⇤ , then egp + can output a point u with
uk k2p 2 ku uk+1 k2p in the upper bound on hk ,
which would not telescope, as in this case mp < 1. In E[kF̃ (u)kp⇤ ] ✏ with at most
the case of p > 2, the challenges come from a deli- ✓⇣ ⇣ ⌘p ⇤ ⌘◆
Lku⇤ u0 kp ⌘p ⇣
cate relationship between the step sizes ak and error Op 1+
terms k . It turns out that it is possible to guaran- ✏ ✏
tee local convergence (in the region where kF (ūk )k2 is p
oracle queries to F̃ , where p⇤ = p 1.
bounded by a constant less than 1) with ⇢ > 0, but ⇢
would need to scale with poly(✏) in this case. As this
is a weak result whose usefulness is unclear, we have 5 Discussion
omitted it.
We introduced a new class of structured nonconvex-
Stochastic oracle access. To obtain results for the nonconcave min-max optimization problems and pro-
stochastic setups, we mainly need to bound stochas- posed a new generalization of the extragradient
tic error terms which decompose from the analysis of method that provably converges to a stationary point
deterministic setups, as in the following lemma. in Euclidean setups. Our algorithmic results guaran-
tee that problems in this class contain at least one
Lemma 4.3. Let E s = ak h⌘¯k , ūk u⇤ i
stationary point (an (svi) solution, see Remark 3.5).
ak h⌘k ⌘k , ūk uk+1 i, where ⌘k and ⌘k are defined
¯ ¯
The class we introduced generalizes other important
as in Eq. (4.2) and all the assumptions of Theorem 4.4
classes of structured nonconvex-nonconcave problems,
below apply. Then, for q defined by Eq. (4.3) and any
such as those in which an (mvi) solution exists. We
⌧ > 0:
further generalized our results to stochastic setups and
2q
⇤
/2 ⇤
ak q ( k 2 + ¯k2 )q
⇤
/2 h⌧q i `p -normed setups in which an (mvi) solution exists.
E[E s ] +E kūk uk+1 kqp , An interesting direction for future research is to under-
q ⇤ ⌧ q⇤ q
stand to what extent we can further relax the assump-
where the expectation is w.r.t. all the randomness in tions about the structure of nonconvex-nonconcave
the algorithm. problems, while maintaining computational feasibility
of algorithms that can address them.
Theorem 4.4. Let p > 1 and let F : Rd ! Rd be an
arbitrary L-Lipschitz operator w.r.t. k · kp that satis-
Acknowledgements
fies Assumption 1 for some u⇤ 2 U ⇤ . Given an arbi-
trary initial point u0 2 Rd , let the sequences of points
We wish to thank Steve Wright for a useful discussion
{ui }i 1 , {ūi }i 0 evolve according to (egp +) for some
regarding convergence of sequences. We also wish to
2 (0, 1] and positive step sizes {ai }i 0 . Let the vari-
thank the Simons Institute for the Theory of Comput-
ance of a single query to the stochastic oracle F̃ be ing where some of this work was conducted.
bounded by some 2 < 1.
JD was supported by the Office of the Vice Chan-
⇥ cellor for Research and Graduate Education at the
(i) Let p = 2 and ⇢ 2 0, ⇢¯ , where ⇢¯ = If p1 .
4 2L University of Wisconsin–Madison with funding from
1 1
= 2 and ak = 2p2L , then egp + can output a
the Wisconsin Alumni Research Foundation and by
point u with E[kF̃ (u)k2 ] ✏ with at most the NSF Award CCF-2007757. CD was supported by
⇣ Lku⇤ NSF Awards IIS-1741137, CCF-1617730, and CCF-
u0 k22 ⇣ 2 ⌘⌘
1901292, by a Simons Investigator Award, by the Si-
O 1+ 2
✏2 (¯
⇢ ⇢) L✏ (¯
⇢ ⇢) mons Collaboration on the Theory of Algorithmic Fair-
ness, and by the DOE PhILMs project (No. DE-AC05-
oracle queries to F̃ . 76RL01830). MJ was supported in part by the Math-
3/2
ematical Data Science program of the Office of Naval
p m
(ii) Let p 2 (1, 2] and ⇢ = 0. If ak = 2L and Research under grant number N00014-18-1-2764.
= mp , then egp + can output a point u with
Efficient Methods for Structured Nonconvex-Nonconcave Min-Max Optimization
Korpelevich, G. (1976). The extragradient method Mertikopoulos, P., Papadimitriou, C. H., and Pil-
for finding saddle points and other problems. Mate- iouras, G. (2018). Cycles in adversarial regularized
con, 12:747–756. learning. In Proc. ACM-SIAM SODA’18.
Liang, T. and Stokes, J. (2019). Interaction matters: Mokhtari, A., Ozdaglar, A., and Pattathil, S. (2020).
A note on non-asymptotic local convergence of gen- A unified analysis of extra-gradient and optimistic
erative adversarial networks. In Proc. AISTATS’19. gradient methods for saddle point problems: Proxi-
mal point approach. In Proc. AISTATS’20.
Lin, Q., Liu, M., Rafique, H., and Yang, T. (2018).
Solving weakly-convex-weakly-concave saddle-point Nemirovski, A. (2004). Regular Banach spaces
problems as successive strongly monotone variational and large deviations of random sums. Un-
inequalities. arXiv preprint arXiv:1810.10207. published, E-print: https: // www2. isye. gatech.
edu/ ~ nemirovs/ LargeDev2004. pdf .
Lin, T., Jin, C., and Jordan, M. (2020a). Near-
optimal algorithms for minimax optimization. In Nesterov, Y. (2015). Universal gradient methods for
Proc. COLT’20. convex optimization problems. Mathematical Pro-
gramming, 152(1-2):381–404.
Lin, T., Jin, C., and Jordan, M. I. (2020b). On gra-
dient descent ascent for nonconvex-concave minimax Nouiehed, M., Sanjabi, M., Huang, T., Lee, J. D., and
problems. In Proc. ICML’20. Razaviyayn, M. (2019). Solving a class of non-convex
min-max games using iterative first order methods.
Liu, M., Mroueh, Y., Ross, J., Zhang, W., Cui, X., In Proc. NeurIPS’19.
Das, P., and Yang, T. (2020). Towards better under-
standing of adaptive gradient algorithms in genera- Papadimitriou, C. H. (1994). On the complexity of
tive adversarial nets. In Proc. ICLR’20. the parity argument and other inefficient proofs of
existence. Journal of Computer and system Sciences,
Lu, S., Tsaknakis, I., Hong, M., and Chen, Y. (2020). 48(3):498–532.
Hybrid block successive approximation for one-sided
non-convex min-max problems: Algorithms and ap- Song, C., Zhou, Y., Zhou, Z., Jiang, Y., and Ma,
plications. IEEE Transactions on Signal Processing, Y. (2020). Optimistic dual extrapolation for co-
68:3676–3691. herent non-monotone variational inequalities. In
Proc. NeurIPS’20.
Malitsky, Y. (2019). Golden ratio algorithms for vari-
ational inequalities. Mathematical Programming. Thekumparampil, K. K., Jain, P., Netrapalli, P., and
Oh, S. (2019). Efficient algorithms for smooth mini-
Mangoubi, O., Sachdeva, S., and Vishnoi, N. K. max optimization. In Proc. NeurIPS’19.
(2020). A provably convergent and practical algo-
rithm for min-max optimization with applications to Wang, Y., Zhang, G., and Ba, J. (2019). On solv-
GANs. arXiv preprint arXiv:2006.12376. ing minimax optimization locally: A follow-the-ridge
approach. In Proc. ICLR’19.
Mangoubi, O. and Vishnoi, N. K. (2021). Greedy
adversarial equilibrium: An efficient alternative to Yang, J., Kiyavash, N., and He, N. (2020). Global
nonconvex-nonconcave min-max optimization. In convergence and variance-reduced optimization for a
Proc. ICLR’21. class of nonconvex-nonconcave minimax problems. In
Proc. NeurIPS’20.
Mazumdar, E., Ratli↵, L. J., and Sastry, S. S.
(2020). On gradient-based learning in continuous Zhao, R. (2019). Optimal algorithms for stochastic
games. SIAM Journal on Mathematics of Data Sci- three-composite convex-concave saddle point prob-
ence, 2(1):103–131. lems. arXiv preprint arXiv:1903.01687.
Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C.- Zhou, Z., Mertikopoulos, P., Bambos, N., Boyd, S.,
S., Chandrasekhar, V., and Piliouras, G. (2019). Op- and Glynn, P. W. (2017). Stochastic mirror descent
timistic mirror descent in saddle-point problems: Go- in variationally coherent optimization problems. In
ing the extra (gradient) mile. In Proc. ICLR’19. Proc. NIPS’17.