0% found this document useful (0 votes)

24 views14 pages

Solving A Class of Non-Convex Minimax Optimization in Federated Learning

Uploaded by

rasheena.arimbrathodi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views14 pages

Solving A Class of Non-Convex Minimax Optimization in Federated Learning

Uploaded by

rasheena.arimbrathodi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Solving a Class of Non-Convex Minimax Optimization

in Federated Learning

Xidong Wu† Jianhui Sun†

Electrical and Computer Engineering Computer Science
University of Pittsburgh University of Virginia
Pittsburgh, PA 15213 Charlottesville, VA 22903
[email protected] [email protected]

Zhengmian Hu Aidong Zhang§ Heng Huang∗

Computer Science Computer Science Computer Science
University of Maryland University of Virginia University of Maryland
College Park, MD 20742 Charlottesville, VA 22903 College Park, MD 20742
[email protected] [email protected] [email protected]

Abstract

The minimax problems arise throughout machine learning applications, ranging

from adversarial training and policy evaluation in reinforcement learning to AU-
ROC maximization. To address the large-scale distributed data challenges across
multiple clients with communication-efficient distributed training, federated learn-
ing (FL) is gaining popularity. Many optimization algorithms for minimax problems
have been developed in the centralized setting (i.e., single-machine). Nonetheless,
the algorithm for minimax problems under FL is still underexplored. In this paper,
we study a class of federated nonconvex minimax optimization problems. We pro-
pose FL algorithms (FedSGDA+ and FedSGDA-M) and reduce existing complexity
results for the most common minimax problems. For nonconvex-concave problems,
we propose FedSGDA+ and reduce the communication complexity to O(ε−6 ).
Under nonconvex-strongly-concave and nonconvex-PL minimax settings, we prove
that FedSGDA-M has the best-known sample complexity of O(κ3 N −1 ε−3 ) and
the best-known communication complexity of O(κ2 ε−2 ). FedSGDA-M is the
first algorithm to match the best sample complexity O(ε−3 ) achieved by the
single-machine method under the nonconvex-strongly-concave setting. Extensive
experimental results on fair classification and AUROC maximization show the
efficiency of our algorithms.

1 Introduction
The nonconvex minimax optimization has been actively applied to solve enormous machine learning
problems, such as adversarial training [56, 61], generative adversarial networks (GANs) [16, 18],
policy evaluation in reinforcement learning [55, 22, 25], robust optimization [11, 60], AUROC
(area under the ROC curve) maximization [39], etc. Many single-machine minimax optimization
algorithms have been proposed to address these problems.
†
Equal contribution
§
This work was partially supported by NSF CNS 2213700 and CCF 2217071 at UVA.
∗
This work was partially supported by NSF IIS 1838627, 1837956, 1956002, 2211492, CNS 2213701, CCF
2217003, DBI 2225775 at Pitt and UMD.

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

Machine learning tasks with large-scale distributed datasets call for distributed training [3] because
of its ability to shorten the calculation time and train models with data from various locations. At the
same time, communication overhead has emerged as the most restrictive bottleneck of distributed
training due to the increasing model and data size. To tackle these issues, federated learning (FL) [44]
has emerged as a promising technique since it makes use of distributed data from different clients,
avoids frequent transmission between clients and the central server, and preserves data privacy. In
FL, clients train and update their models locally, and the server aggregates and averages the model
parameters from all clients periodically. Only models are shared among clients and the training
data are stored locally, which provides a certain level of data privacy. In addition, FL also enhances
computation power since it utilizes many clients to train models.
Although federated learning has gained popularity, most existing works focus on the standard
stochastic minimization problem [33, 48, 34, 36, 71, 58, 70, 72]. Recently, some algorithms for non-
minimization optimization in FL are proposed [53, 21, 15, 50, 59]. However, existing FL minimax
algorithms have not achieved the complexity level reached by single-machine algorithms. To bridge
this gap, we consider the federated nonconvex minimax optimization problem as follows:
( N
)
1 X
min max F (x, y) = fi (x, y) (1)
x∈Rd1 y∈Rd2 N i=1
where the function fi (x, y) = Eξi ∼Di [fi (x, y; ξi )] : Rd1 × Rd2 → R is the loss function of the ith
client. We restrict our focus to the non-convex minimax problem, where fi (x, y) is nonconvex over
x ∈ Rd1 and concave or nonconcave over y ∈ Rd2 . N is the total number of clients. ξi = (xi , yi ) ∼
Di denotes data point ξi is sampled from the local data distribution Di on machine i. In this paper,
heterogeneous datasets are considered, namely, Di and Dj (i ̸= j ) are not identical.
Some recent works have attempted to solve federated minimax optimization for convex-concave
setting [11, 26, 52]. Due to the popularity of deep neural networks, nonconvex minimax has wider
applications. More recently, some works [50, 53, 10, 19, 67] extend single-machine algorithms, such
as SGDA, to federated learning settings for nonconvex minimax optimization. However, theoretical
understandings of federated minimax optimization remain limited in the literature. In the context of
stochastic smooth nonconvex minimax problems, their analysis either relies on strict assumptions
[19] or achieves suboptimal convergence results [50]. For example, single-machine methods [42, 28]
achieve O(κ3 ε−3 ) under nonconvex-strongly-concave setting, which is much better than O(κ4 ε−4 )
achieved by the best FL minimax algorithm [50, 63] in existing literature. Therefore, a natural
question arises:

Can we design stochastic gradient decent ascent methods with better sample and com-
munication complexities to match the convergence rate of single-machine counterparts
for solving the problem (1)?

In this paper, we provide an affirmative answer to the aforementioned question and propose a class of
algorithms to solve the problem (1) under different settings. In particular, we consider three most
common classes of nonconvex minimax optimization problems: 1) NC-C: NonConvex in x, Concave
in y; 2) NC-SC: NonConvex in x, Strongly-Concave in y; 3) NC- PL: NonConvex in x, PL-condition
in y. For each of these problems, we propose a new algorithm with provably better convergence rate
(please see Table 1) and provide a theoretical analysis. Our main contributions are four-fold:
1) NC-C setting. We propose FedSGDA+, and prove it has sample complexity of O(N −1 ε−8 )
and communication complexity of O(ε−6 ). FedSGDA+ takes advantage of the structure of
FL by adding the global learning rate at the server and reduces communication complexity
to O(ε−6 ) from O(ε−7 ) in [50]. It also achieves a linear speedup to the number of clients.
2) NC-PL setting. We propose a federated stochastic gradient ascent (FedSGDA-M) algorithm
with the momentum-based variance reduction technique. It has the best sample complexity
of O(κ3 N −1 ε−3 ) and the best communication complexity of O(κ2 ε−2 ). Compared with
existing momentum-based variance reduction algorithms, our result employs a novel theo-
retical analysis framework that produces a tighter convergence rate (i.e., our rate gets rid of
a logarithmic term appearing in existing works).
3) NC-SC setting. FedSGDA-M can be directly applied to the NC-SC setting since the PL
condition is weaker than strong-concavity. Our algorithm is the first work to reach sample

2
Table 1: Complexity comparison of existing nonconvex federated minimax algorithms for finding
an ε-stationary point. Sample complexity is the total number of the First-order Oracle (IFO) to
reach an ε-stationary point. Communication complexity denotes the total number of back-and-
forth communication times between clients and the server. Here, N is the number of clients, and
κ = Lf /µ is the condition number.

Type Algorithm Reference Sample Communication

Nonconvex Local SGDA+ [50] O N −1 ε−8 O(ε−7 )
concave FedSGDA+ Ours O N −1 ε−8 O ε−6
Local SGDA [50] O κ4 N −1 ε−4 O(κ3 ε−3 )
Nonconvex
Strongly Momentum Local SGDA [50] O κ4 N −1 ε−4 O(κ3 ε−3 )
O κ3 ε−4 a O κ2 ε−4

Concave FEDNEST [53]
FedSGDA Ours O κ3 N −1 ε−3 O κ2 ε−2
Local SGDA [50] O κ4 N −1 ε−4 O(κ3 ε−3 )
Nonconvex Momentum Local SGDA [50] O κ4 N −1 ε−4 O(κ3 ε−3 )
PL SAGDA [63] O N −1 ε−4 O(ε−2 ) b
FedSGDA Ours O κ3 N −1 ε−3 O κ2 ε−2
a
Their theoretical analysis does not report the dependency on N.
b
Their theoretical analysis does not report the dependency on κ.

complexity of O(ε−3 ) in federated learning. In addition, FedSGDA-M does not rely on a

large batch size to reach optimal sample complexity compared with single-machine minimax
algorithms [29, 42].

4) Extensive experimental results on fair classification and AUROC maximization confirm the
effectiveness of our proposed algorithm.

2 Related Work

2.1 Single-Machine Minimax

Nonconvex-Concave (NC-C) setting. [31, 46, 45, 54, 35] proposed various deterministic and
stochastic algorithms to solve the NC-C minimax problems. All of these algorithms, however, have a
double-loop structure and are thus relatively complicated to implement. They decouple the minimax
problem into a minimization problem and a maximization problem and use a nested loop to update
variable y while keeping variable x constant. Subsequently, [38] studied the complexity result of the
single-loop algorithm (SGDA) for the NC-C minimax problem and proves the stochastic algorithm
achieves O(ε−8 ) complexity. SGDA is a direct extension of SGD from minimization optimization
to minimax optimization problems. Recently, [43] providing a unified analysis for the convergence
of OGDA and EG methods in the nonconvex-strongly-concave (NC-SC) and nonconvex-concave
(NC-C) settings.
Nonconvex-Strongly-Concave (NC-SC) setting. [38] analyzed the stochastic gradient descent
ascent (SGDA) algorithm and proved that SGDA has O(κ3 ε−4 ) stochastic gradient complexity.
To reduce the convergence rate, [42] proposed a stochastic GDA algorithm (i.e. SREDA) with a
double-loop structure based on the variance reduction technique of SPIDER [13] and reduce the
complexity to O(κ3 ε−3 ). [29] used momentum-based variance reduction technique of STORM [7]
and proposed Acc-MDA. Acc-MDA is a single-loop algorithm, which gets the same convergence
result as SREDA. Furthermore, adaptive minimax algorithms are introduced [28, 30] to solve the
nonsmooth nonconvex-strongly-concave minimax problems based on dynamic mirror functions.
[65] used a nested adaptive framework to design parameter-agnostic nonconvex minimax algorithm.
[69] proves that VR-based SAPD+ has the complexity of O(κ2 ε−3 ). However, whether the best
convergence result of O(ε−3 ) in single-machine methods can be achieved in the federated setting
is an open question. In addition, [23] conducts an in-depth investigation of limitations of GDA
algorithm (e.g., smaller learning rate, cycling/divergence issue) and gives a systematic analysis of
how to improve GDA dynamics.

3
Nonconvex-Nonconcave (NC-NC) setting. There is extensive research on NC-NC problems [12]
and the Nonconvex-PL condition is a special class of functions that interests us the most. Polyak-
Łojasiewicz (PL) condition does not require the objective to be concave and recent works show
that the PL condition could hold in the training process of overparameterized neural networks with
random initialization [1, 5]. Recently, many deterministic methods [45, 64, 14] are proposed for
NC-NC problems under the NC-PL setting. [20] proposed a PDAda method for Nonconvex-PL
minimax optimization with the restriction of the concavity condition, and [6] focuses on the finite-sum
Nonconvex-PL minimax optimization. Stochastic alternating GDA and stochastic smoothed GDA
proposed in [66] achieve complexity of O(κ4 ε−4 ) and O(κ2 ε−4 ), respectively.

2.2 Distributed/Federated Minimax

Distributed training has rapid development in minimax optimization in recent years, driven by the need
to train large-scale datasets [40]. Under the serverless decentralized setting, algorithms for nonconvex
minimax have been studied extensively in nonconvex-strongly-concave setting [4, 62, 68, 41, 57] and
nonconvex-PL setting [27].
In the FL setting, some works analyzed algorithms for convex-concave problems [11, 37, 26, 52].
However, as nonconvex models (e.g., deep neural networks) being more and more prevalent, there
is a growing need for federated nonconvex minimax optimization, such as federated adversarial
training [49], federated deep AUROC maximization [19] and federated GAN [47]. [19] and [67]
focus on imbalanced data tasks. They reformulated the AUROC maximization problem as the min-
max problem under the FL setting. But their analysis relies on strict assumptions that deep models
satisfy the PL condition and only focuses on PL-strong-concave minimax. [49] converted the robust
federated learning into the minimax problem, where only model parameters, namely min variables,
are exchanged among clients via the server. [10] proposed Local SGDA and Local SGDA+. Local
SGDA is the local-update version of the SGDA algorithm in FL. Different from local SGDA, in local
SGDA+, max variable y is updated with a constant min variable x̃ and the snapshot x̃ updates every
S iteration. Afterward, [50] improves sample complexity and communication complexity of Local
SGDA for NC-SC and NC-PL settings, and Local SGDA+ for NC-C setting. [50] also proposes a
Momentum Local SGDA, which achieves the same theoretical results as Local SGDA for NC-PL and
NC-SC settings. In addition, [53] designs FEDNEST with two nested loops. Although FEDNEST
is composed of FEDINN (a federated stochastic variance reduction algorithm ), their convergence
complexity is not improved over vanilla Local SGDA. Afterward, [63] proposes SAGDA under
NC-PL setting, which yields a better communication complexity (i.e., O(ε−2 )). However, its analysis
does not consider the effect of condition number κ. More recently, [51] considers federated minimax
optimization with Client Heterogeneity in nonconvex concave and nonconvex strongly-concave
settings.
Relation to Existing Works. We propose FedSGDA+ for NC-C setting and FedSGDA-M for NC-SC
and NC-PL settings. In NC-C setting, we discover that the addition of a global step size leads to better
communication complexity. Under this circumstance, theoretical analysis is more challenging as we
not only need to consider the complicated structure of the minimax problem but also need to handle
the local update and global update separately. For FedSGDA-M, we relax the requirement of step
size (specifically designed unnatural step size is often required in STORM-like approaches [7, 34]),
which requires novel proof techniques to obtain. Thus our result does not contain a logarithmic term
and provides a tighter convergence rate (Seen in contributions 1 in [34]). In addition, with different
theoretical frameworks, our better sample complexity doesn’t rely on a big batch size, while the
single-machine minimax algorithm with variance reduction (Acc-MDA) in [29] (Table 2) needs a
large batch to achieve the same sample complexity.

3 Algorithms and Convergence Analysis

Notation: ∥ · ∥ denotes the ℓ2 norm for vectors. a = O(b) denotes that a ≤ Cb for some constant
Pb
C > 0. Given the mini-batch samples B = {ξj }bj=1 , we let ∇fi (x, y; B) = 1b j=1 ∇fi (x, y; ξi ).
Assumption 3.1. (i) Unbiased Gradient. The gradient of each component function fi (x, y; ξ)
computed at each client is unbiased for all ξ (i) ∼ Di , i ∈ [N ]:
E[∇fi (x, y; ξ (i) )] = ∇fi (x, y),

4
Algorithm 1 FedSGDA+ Algorithm
1: Input: T , local step sizes ĉ, c, global step sizes ηx , ηy , k = 0, numbers of inner updates Q and
outer update S, and mini-batch size b, N clients;
2: Initialize: xi0 = x̃0 = x̄0 , y0i = ȳ0 ,
3: for t = 0, 1, . . . , T − 1 do
4: for i = 1, 2, . . . , N do
5: Local Update:
6: for q = 0, 1, . . . , Q − 1 do
7: Draw mini-batch samples Bt,q i
= {ξij }bj=1 with |Bti | = b from Di locally
i i i i i
8: xt,q+1 = xt,q − ĉ∇x fi (xt,q , yt,q ; Bt,q )
i i i i
9: yt,q+1 = yt,q + c∇y fi (x̃k , yt,q ; Bt,q )
10: end for
11: end for PN
12: xit+1,0 = x̄t+1 = x̄t + ηx N1 i=1 (xit,Q − x̄t )
N
i
= ȳt+1 = ȳt + ηy N1 i=1 (yt,Q i
P
13: yt+1,0 − ȳt )
14: if mod (t + 1, S) = 0 then
15: k =k+1
16: x̃k = x̄t+1
17: end if
18: end for
19: Output: x and y chosen uniformly random from {(x̄t , ȳt )}T t=1 .

(ii) Variance Bound. The following inequalities hold for all ξ (i) ∼ Di , i, j ∈ [N ]:

E∥∇fi (x, y; ξ (i) ) − ∇fi (x, y)∥2 ≤ σ 2

N
1 X 2
∥∇fi (x, y) − ∇F (x, y)∥ ≤ ζ 2 (2)
N i=1

The Assumption 3.1 is a standard assumption in stochastic optimization, which will be used through-
out the rest of the paper. In FL algorithms, the Assumption 3.1 (ii) is frequently used to bound
the variance and data heterogeneity. The heterogeneity parameter, ζ, denotes the level of data
heterogeneity. In the homogeneous data configuration, ζ = 0.

3.1 Nonconvex Concave (NC-C) Problems

Assumption 3.2. (Concavity). The nonconvex function f (·, ·) is concave in y if for a fixed
x ∈ Rd1 , ∀y, y ′ ∈ Rd2 , we have
f (x, y) ≤ f (x, y ′ ) + ⟨∇f (x, y ′ ), y − y ′ ⟩ (3)
To solve the problem (1) under the NC-C setting and reduce the complexity, we propose FedSGDA+
(Seen in algorithm 1). Although local SGDA+ [50] achieves the same sample complexity as the single-
machine method (SGDA), its communication complexity is not optimal. This unsatisfactory result
is due to the fact that local SGDA+ simply extends the single-machine method into the distributed
setting, and does not consider the complicated local-server structure in FL.
In FedSGDA+, Line 5-10 are conducted in the local clients. The updates of variable x are similar
to the standard stochastic algorithm for minimization problems, such as FedAvg. We sample data
points and update the x locally with the current variable xit,q and yt,q i
. However, for the y-updates,
stochastic gradients are calculated with the latest snapshot of x(i.e., x̃k ) and in each local iteration, y
updates with the constant x̃k as seen in Line 9 in Algorithm 1. The x̃k is updated every S rounds (
Line 14-16 in Algorithm 1).
In addition, we make use of the advantage of FL and introduce the global step size ηx , ηy , which
provides the flexibility of FL training (Seen in Line 12-13). Local SGDA+ could be regarded as a
special case of FedSGDA+. We now provide the convergence analysis of FedSGDA+ and introduce
the necessary assumptions. The details of the proofs are provided in the supplementary materials.

5
Assumption 3.3. (Smoothness). Each local function fi (x, y) has a Lf -Lipschitz continuous gradients,
i.e., ∀x1 , x2 and y1 , y2 , we have
∥∇fi (x1 , y1 ) − ∇fi (x2 , y2 )∥ ≤ Lf ∥(x1 , y1 ) − (x2 , y2 )∥ (4)

The Assumption 3.3 on the smoothness is a standard assumption in stochastic optimization [2, 17].
Assumption 3.4. (Lipschitz continuity in x). For the function F , there exists a constant Gx , such
that for each y ∈ Rd2 , and ∀x, x′ ∈ Rd1 , we have
∥F (x, y) − F (x′ , y)∥ ≤ Gx ∥x − x′ ∥

Under the NC-C setting, the function F (·, ·) is concave in y. Following [9], we define Φ(x) =
maxy F (x, y) and the Moreau envelope of Φ(·) is defined below:
Definition 3.5. (Moreau Envelope) A function Φλ (·) is the λ-Moreau envelope of Φ(·), for λ > 0, if
∀x ∈ Rd1 ,
1
Φλ (x) = min Φ(z) + ∥z − x∥2
z 2λ

From [38], we know that when we have a point x that is an ε-stationary point of Φλ (x), then x is
close to a point x′ which is stationary for Φ(x). Hence, we focus on minimizing ∥∇Φλ (x)∥ under
the NC-C setting.
Theorem 3.6. Suppose Assumptions 3.1, 3.2, 3.3, 3.4 hold and the sequences {xt , yt } are generated
1
by Algorithm 1, max{cηy , c} ≤ 10QL f
and let ∥ȳt ∥2 ≤ D following [10, 50],
T −1
1 X 2 σ2 E Φ1/2Lf (x0 ) − E Φ1/2Lf (x̄T )
E ∇Φ1/2Lf (x̄t ) ≤ 8Lf ĉηx (QG2x + )+8
T t=0 N Qĉηx T
+48L2f Q[ĉ2 + c2 ](σ 2 + 6Qζ 2 ) + 288L2f Q2 ĉ2 G2x
D cηy σ 2
+576L3f Q2 c2 [ + + 6Lf Q2 c2 (σ 2 + 6ζ 2 )]
cηy QS N
r
σ2 16Lf D 16Lf (cηy )σ 2
+32Lf Gx ηx ĉSQ G2x + + + + 96L2f Q2 c2 (σ 2 + 6ζ 2 )]
N cηy QS N
1 T 1/3 N 1 N
Corollary 3.7. By setting c = ĉ = 10Lf QT 1/3
, Q= N , ĉηx = 10Lf T , cηy = 10Lf Q = 10Lf T 1/3
,
S = T 1/3 , FedSGDA+ has the following convergence rate:
T −1
1 X 2 80Lf ∆ 4(G2x + σ 2 ) 24(σ 2 + 6ζ 2 ) 72G2x 24(σ 2 + 6ζ 2 )
E ∇Φ1/2Lf (x̄t ) ≤ + + + +
T t=0 T 1/3 5T 2/3 25T 2/3 25T 2/3 25T 2/3
r
144Lf 10Lf D σ2 3(σ 2 + 6ζ 2 ) 16Gx σ2 160L2f D 8σ 2
+ [ + + ] + G2 + + +
2/3 1/3 1/3 2/3 1/3 x 1/3
25T T 10Lf T 50Lf T 5T N T 5T 1/3
PT −1 2
Remark 3.8. (Complexity) Based on Corollary 3.7, to make T1 t=0 E ∇Φ1/2Lf (x̄t ) ≤ ε2 ,
−6
communication complexity T = O(ε ). We choose b = O(1), then we have sample complexity
bQT = N −1 ε−8 . It also denotes that FedSGDA+ has linear speedup with respect to the number of
clients.

3.2 Nonconvex-PL (NC-PL) Problems

Assumption 3.9. (Polyak Łojasiewicz (PL) Condition in y). The function F (x, y) is µ-PL condition
in y(µ > 0), if ∀x: 1) y ∗ (x) = arg maxy F (x, y) has a nonempty solution set; 2) ∥∇y F (x, y)∥2 ≥
2µ(F (x, y ∗ (x)) − F (x, y)), ∀y.

To solve problem (1) with better convergence complexity under nonconvex-PL, we propose federated
stochastic gradient ascent (FedSGDA-M) algorithm with the momentum-based variance reduction
technique (Seen in in algorithm 2).

6
Algorithm 2 FedSGDA-M Algorithm
1: Input: T , step sizes ĉ, c, η; momentum coefficient α, β, the number of local updates Q, and
mini-batch size b and initial mini-batch
PN i size B; PN
1
2: Initialize: xi0 = x̄0 = N x ,
i=1 0 0 y i
= ȳ0 = N1 i=1 y0i , ui1 = ∇x f (xi0 , y0i ; B0i ) and
v1i = ∇y f (xi0 , y0i ; B0i ) where |B0i | = B are drawn from Di for i ∈ [N ].
3: for t = 1, 2, . . . , T do
4: for i = 1, 2, . . . , N do
5: if mod (t, Q) = 0 then
6: Sever Update:
PN
7: uit = ūt = N1 j=1 ujt
PN
8: vti = v̄t = N1 j=1 vtj
PN
9: xit = x̄t = N1 j=1 (xjt−1 − ĉηujt )
PN j
10: yti = ȳt = N1 j=1 (yt−1 + cηvtj )
11: else
12: xit = xit−1 − ĉηuit
13: yti = yt−1
i
+ cηvti
14: end if
15: Draw mini-batch samples Bti = {ξij }bj=1 with |Bti | = b from Di locally
16: uit+1 = ∇x fi (xit , yti ; Bti ) + (1 − α)(uit − ∇x fi (xit−1 , yt−1i
; Bti ))
i
17: vt+1 = ∇y fi (xit , yti ; Bti ) + (1 − β)(vti − ∇y fi (xit−1 , yt−1
i
; Bti ))
18: end for
19: end for
20: Output: x and y chosen uniformly random from {x̄t , ȳt }T t=1 .

In FedSGDA-M, each client initializes the gradient estimators {ui1 , v1i } with stochastic gradient as
seen in line 2 of Algorithm 2. Following that, each client updates the model variables {xit , yti } locally
as standard stochastic gradient descent ascent method (lines 12-13 of Algorithm 2). Compared with
local momentum SGDA [50], the key difference is that clients utilize variance reduction gradient
estimators {uit , vti }, which are constructed in lines 15-17 of Algorithm 2. For the update step of
{ut , vt }, the coefficients should satisfy 0 < α < 1 and 0 < β < 1. In every Q iteration, clients
transmit model parameters and gradient estimators to the server, which computes {x̄t , ȳt , ūt , v̄t }.
Then the server sends averaged model variables and gradient estimators to each client to update the
local variables, as shown in lines 5-10 of Algorithm 2.
Definition 3.10. According to Assumption 3.9, there exists at least one solution to the problem
maxy F (x, y) for any x. Here we define Φ(x) = F (x, y ∗ (x)) = maxy F (x, y). We use ε-stationary
point of Φ(x), i.e. ∥∇Φ(x)∥ ≤ ε as the convergence metric.
We know Φ(x) is differentiable and (L + κL)-smooth and y ∗ (·) is κ-Lipschitz from [50]. Given
that ∇y F (x̄t , y ∗ (xt )) = 0, we have ∇Φ (x̄t ) = ∇x F (x̄t , y ∗ (xt )) + ∇y F (x̄t , y ∗ (xt )) · ∂y ∗ (x̄t ) =
∇x F (x̄t , y ∗ (xt )) which is widely used in the analysis of nonconvex-PL [10] and nonconvex-strongly-
concave minimax optimization [62, 38]. Then we discuss the convergence analysis of FedSGDA-M.
The proofs are provided in the supplementary materials.
Assumption 3.11. (Lipschitz Smoothness) Each component function fi (x, y; ξ) has a Lf -Lipschitz
gradient, i.e., ∀x1 , x2 and y1 , y2 , we have
E∥∇fi (x1 , y1 ; ξ) − ∇fi (x2 , y2 ; ξ)∥ ≤ Lf ∥(x1 , y1 ) − (x2 , y2 )∥ (5)

F (x, y) has an Lf -Lipschitz gradient based on the convexity of norm and Assumption 3.11. We have
N
1 X
∥∇x F (x1 , y1 ) − ∇x F (x2 , y2 )∥ = E ∇x fi (x1 , y1 ; ξ) − ∇x fi (x2 , y2 ; ξ)]
N i=1
N
1 X
≤ E∥∇x fi (x1 , y1 ; ξ) − ∇x fi (x2 , y2 ; ξ)∥
N i=1
≤ Lf ∥(x1 , y1 ) − (x2 , y2 )∥

7
In optimization analysis, it is standard to use the Assumption 3.11. Several widely used single-
machine stochastic algorithms, such as SPIDER [13] and STORM [7], make use of this assumption.
It is also used by numerous FL algorithms, including MIME [32] Fed-GLOMO [8], STEM [34] and
FAFED [58].
Theorem 3.12. Suppose that sequence {x̄t , ȳt }Tt=0 is generated from Algorithm 2. Under the above
2
1 30L2
Assumptions (3.1,3.9,3.11), given η = 20QL , α = c1 η 2 , β = c2 η 2 , c1 = bN30L
κ1−ν , c2 = bN κ2−2ν , c =
1 1
6κ1−ν , ĉ = 54κ3−ν where ν ∈ [0, 1] we have

T −1
1 X 2 2[Φ(x̄0 ) − Φ(x̄T )] 3σ 2 36L2f σ 2 12L2f
E ∥∇Φ (x̄t )∥ ≤ + + 2 + [Φ(x̄0 ) − F (x̄0 , ȳt )]
T t=0 ĉηT αT BN µ βT BN cηµ2 T
6ασ 2 72βσ 2 L2f
2 2
σ (c1 + c22 ) ζ 2 (c21 + c22 ) 2 2

+ + + + κ η
Nb N bµ2 30bL2 12L2
30L2 30L2 1
Corollary 3.13. By setting b = O(κν ) for ν ∈ [0, 1], c1 = bN κ1−ν , c2 = bN κ2−2ν , c = 6κ1−ν , ĉ =
1/3 1/3
1 3−ν T0 1 N 2/3 T0 bκ1−ν
54κ3−ν , T =κ T0 , Q = N 2/3
, η = 20QL = 1/3 , B = N 2/3
, we have α = c1 η 2 =
20LT0
1/3 1/3
3N 3N
2/3 , β = c2 η 2 = 2/3 .
40T0 bκ1−ν 40T0 bκ2−2ν

T −1
1 X 2 2160L[Φ(x̄0 ) − Φ∗ ] 40σ 2 480σ 2
E ∥∇Φ (x̄t )∥ ≤ 2/3
+ 3−ν 2/3
+ 2
T t=0 (N T0 ) κ (N T0 ) κ (N T0 )2/3
9σ 2 27σ 2
2
15ζ 2

240Lf 3σ 1
+ 2/3
[Φ(x̄0 ) − F (x̄0 , ȳ0 )] + 2/3
+ 2/3
+ +
(N T0 ) 20bκ(N T0 ) 5(N T0 ) 20b 40 (N T0 )2/3
where Φ∗ is the optimal.
PT −1 2
Remark 3.14. (Complexity) To make the T1 t=0 E ∥∇Φ (x̄t )∥ ≤ ε2 , we get T0 = O(N −1 ε−3 )
and T = O(κ3−ν N −1 ε−3 ). Considering the b = κν , we have communication complexity Q T
=
3−ν 2/3 3−ν −2 3 −1 −3
κ (N T0 ) = κ ε and sample complexity bT = O(κ N ε ). When ν = 1, b = κ,
communication Complexity Q T
= κ2 ε−2 for finding an ε-stationary point. Sample complexity
bT = O(κ3 N −1 ε−3 ) matches the complexity result achieved by the single-machine algorithms, such
as SREDA and Acc-MDA in [42, 29] but we do not require a large batch size b compared with these
algorithms. And O(κ3 N −1 ε−3 ) also exhibits a linear speed-up compared with the aforementioned
single-machine algorithms.

3.3 Nonconvex-Strongly-Concave (NC-SC) Problems

Assumption 3.15. Each local function function fi (x, y) is µ-strongly concave in y ∈ Y, i.e., ∀x ∈ X
and y1 , y2 ∈ Y, we have
∥∇y fi (x, y1 ) − ∇y fi (x, y2 )∥ ≥ µ∥y1 − y2 ∥

When the function F (x, y) is strongly concave in y ∈ Y, there exists a unique solution to the problem
maxy∈Y f (x, y) for any x. Since PL condition is weaker than strong concavity, the convergence
result of FedSGDA-M in Algorithm 2 under NC-PL also apply to NC-SC problem and FedSGDA-M
has the sample complexity of O(κ3 n−1 ε−3 ) and the communication complexity of O(κ2 ε−2 ) under
nonconvex-strongly-concave setting.

4 Experiments
We conduct experiments on AUROC maximization and fair classification tasks to verify the efficiency
of our algorithms under nonconvex-strongly-concave and nonconvex-concave settings. Experiments
are completed on the computer cluster with AMD EPYC 7513 Processors and NVIDIA RTX A6000.
The code is available ∗
∗
https://github.com/xidongwu/Federated-Minimax-and-Conditional-Stochastic-Optimization

8
Datasets and Models: We test the performance of algorithms on three typical datasets: Fashion-
MNIST dataset, CIFAR-10 dataset and Tiny-ImageNet. Fashion-MNIST dataset has 70, 000 28 × 28
gray images (10 categories, 60, 000 training images and 10, 000 testing images). CIFAR-10 dataset
consists of 50, 000 training images and 10, 000 testing images. Each image is the 3 × 32 × 32 arrays
of color image. Tiny-ImageNet dataset has 200 classes of (64 × 64) colored images and each class has
500 training images, 50 validation images, and 50 test images. For Fashion MNIST and Cifar10 data
sets, we choose convolutional neural network from [28] (The details are shown in the supplementray
materials). For Tiny-ImageNet, we choose ResNet-18 [24] as the model.

4.1 Fair Classification

First, we follow [50, 28] and train fair classification networks by minimizing the maximum loss over
different categories.
N C C
1 XX X
yc Lic (x) s.t.Y = y | yi ≥ 0,

min max yi = 1
x y∈Y N i=1 c=1 i=1

where Lic denotes the cross-entropy loss functions corresponding to the class c in C different classes
and x denotes the CNN model parameters. Clearly, the problem in (6) is nonConvex in x (deep model
parameters), and Concave in y. We compare our algorithm (FedSGDA+) and local SGDA+ with
varying model, datasets, local update numbers, step size. Although the constraint is not considered in
the theoretical analysis of local SGDA+ and FedSGDA+, FedSGDA+ still shows better performance
compared with local SGDA+.
The network has 20 clients. The datasets are partitioned into disjoint sets across all clients and
each client holds part of the data from all the classes [34]. We initialize the renset18 with pre-
trained weights in PyTorch. In experiments, we run grid search for step size, and choose the step
size for primal variable in the set {0.01, 0.03, 0.05, 0.1, 0.3} and that for dual variable in the set
{0.001, 0.01, 0.1}. We choose the global step size in the set {0.1, 0.5, 1, 1.5, 2}. The batch-size b is
in 50 and the inner loop number Q is seleted from {20, 50, 100}, The outer loop number S is selected
from {1, 5, 10} for FedSGDA+ and {1, 5, 10, Q} for local SGDA+.
Figure 1 shows that FedSGDA+ has a better convergence rate than local SGDA+. This confirms that
our algorithm can effectively accelerate SGDA by using the structure of federated learning. Due to
the page limititation, the ablation analysis of step size is presented in the supplementary materials.

(a) Fashion-MNIST (b) CIFAR-10 (c) Tiny ImageNet

Figure 1: Test Accuracy vs the number of communication rounds during the training phase.

4.2 AUROC Maximization

[39] showed the AUROC maximization problem could be reformulated as the non-convex-strongly-
concave minimax optimization, as below:
N
1 X
min max Eξ ∼D [fi (m, a, b, w; ξ)] (6)
m∈Rd w∈R N i=1 i i
(a,b)∈R2

9
where
2 2
f (m, a, b, w; ξ) = (1 − p) (h (m; x) − a) I[y=1] + p (h (m; x) − b) I[y=−1]
+ 2(1 + w)[ph (m; x) I[y=−1] − (1 − p)h (m, x) I[y=1] ] − p(1 − p)w2
ξi = (x, y) ∼ Di denotes a random data point and x represents the data features and y ∈ Y =
{−1, +1} is the label. h(m; x) denotes the prediction score of the data point x calculated by a model
with parameter m. p = P r(y = 1) = Ey [I[y=1] ] denotes the prior probability of the positive data.

(a) Fashion-MNIST (b) CIFAR-10 (c) Tiny ImageNet

Figure 2: AUROC scores on the test datasets vs the number of communication rounds during the
training phase.

Following [19, 67], we constructed the imbalanced binary-class versions of datasets as follow: firstly,
the first half of the classes (0 - 4) in the original Fashion-MNIST, CIFAR10 and classes (0 - 99)
in Tiny-ImageNet datasets are converted into the negative class, and the rest half of classes are
considered to be a positive class. 80% of the negative data points are randomly dropped in each
dataset. Then the datasets are evenly divided into disjoint sets across 16 clients. In this case, each
clients share completely different imbalanced datasets. In the experiment, we use xavier normal
initialization to deep models.
We compare our algorithm (i.e., FedSGDA-M) with local SGDA [10, 50], CODA+ [19, 67], Mo-
mentum SGDA [50], CODASCA [67] and SAGDA [63] as baselines in AUROC maximization. In
experiments, we carefully tune hyperparameters for all methods. We run a grid search for step size,
and choose the step size for the primal variable in the set {0.001, 0.005, 0.01} and that for dual
variable in the set {0.0001, 0.001, 0.01}. We choose the global step size from {0.9, 1, 1.5, 2} for
CODASCA and SAGDA. We choose the momentum parameter in Local Momentum SGDA in the
set {0.1, 0.5, 0.9}. The α and β in FedSGDA-M are chosen from {0.1, 0.5, 0.9}. The batch-size b is
50 and the inner loop number Q ∈ {10, 20, 50}.
As shown in Figure 2, we compare the performance of FedSGDA-M and other baseline methods
against the number of communication rounds. Figure 2 shows that our algorithms consistently
outperform the other baseline algorithms on testing datasets, which validates the efficacy of our
algorithms. Due to space limitation, other test results are provided in the supplementary materials.
Limitation. Minimax optimization has many applications and a more comprehensive discussion of
our proposed algorithms on these tasks will be a future study because the theoretical analysis is the
main contribution of this paper.

5 Conclusion
In this paper, we study a class of federated nonconvex minimax optimization problems (1). We
consider the three most common settings (NC-SC, NC-PL, NC-C). Under the NC-C setting, we
propose FedSGDA+ and prove it has the best communication complexity of O(ε−6 ). It also achieves
a linear speedup to the number of clients. Under NC-PL and NC-PL settings, we propose FedSGDA-
M with variance reduction technique and we prove that our algorithm (FedSGDA-M ) has the best
sample complexity (O(κ3 n−1 ε−3 )) and the best sample communication complexity (O(κ2 ε−2 )).
We prove that FedSGDA also enjoys linear speedup with respect to the number of clients. Therefore,
we reduce the existing complexity results for the most common nonconvex minimax optimization
problems under the federated learning setting.

10
References
[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-
parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
[2] Runxue Bao, Bin Gu, and Heng Huang. An accelerated doubly stochastic gradient method with faster
explicit model identification. In Proceedings of the 31st ACM International Conference on Information &
Knowledge Management, pages 57–66, 2022.
[3] Runxue Bao, Xidong Wu, Wenhan Xian, and Heng Huang. Doubly sparse asynchronous learning for
stochastic composite optimization. In Proceedings of the Thirty-First International Joint Conference on
Artificial Intelligence, IJCAI, pages 1916–1922, 2022.
[4] Aleksandr Beznosikov, Pavel Dvurechensky, Anastasia Koloskova, Valentin Samokhin, Sebastian U Stich,
and Alexander Gasnikov. Decentralized local stochastic extra-gradient for variational inequalities. arXiv
preprint arXiv:2106.08315, 2021.
[5] Zachary Charles and Dimitris Papailiopoulos. Stability and generalization of learning algorithms that
converge to global optima. In International Conference on Machine Learning, pages 745–754. PMLR,
2018.
[6] Lesi Chen, Boyuan Yao, and Luo Luo. Faster stochastic algorithms for minimax optimization under
polyak-{\L} ojasiewicz condition. Advances in Neural Information Processing Systems, 35:13921–13932,
2022.
[7] Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd.
Advances in neural information processing systems, 32, 2019.
[8] Rudrajit Das, Abolfazl Hashemi, Sujay Sanghavi, and Inderjit S Dhillon. Improved convergence rates for
non-convex federated learning with compression. arXiv e-prints, pages arXiv–2012, 2020.
[9] Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions.
SIAM Journal on Optimization, 29(1):207–239, 2019.
[10] Yuyang Deng and Mehrdad Mahdavi. Local stochastic gradient descent ascent: Convergence analysis
and communication efficiency. In International Conference on Artificial Intelligence and Statistics, pages
1387–1395. PMLR, 2021.
[11] Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Distributionally robust federated
averaging. arXiv preprint arXiv:2102.12660, 2021.
[12] Jelena Diakonikolas, Constantinos Daskalakis, and Michael I Jordan. Efficient methods for structured
nonconvex-nonconcave min-max optimization. In International Conference on Artificial Intelligence and
Statistics, pages 2746–2754. PMLR, 2021.
[13] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex opti-
mization via stochastic path-integrated differential estimator. Advances in Neural Information Processing
Systems, 31, 2018.
[14] Tanner Fiez, Lillian Ratliff, Eric Mazumdar, Evan Faulkner, and Adhyyan Narang. Global convergence
to local minmax equilibrium in classes of nonconvex zero-sum games. Advances in Neural Information
Processing Systems, 34:29049–29063, 2021.
[15] Hongchang Gao, Junyi Li, and Heng Huang. On the convergence of local stochastic compositional gradient
descent with momentum. In International Conference on Machine Learning, pages 7017–7035. PMLR,
2022.
[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing
systems, pages 2672–2680, 2014.
[17] Bin Gu, Runxue Bao, Chenkang Zhang, and Heng Huang. New scalable and efficient online pairwise
learning algorithm. IEEE Transactions on Neural Networks and Learning Systems, 2023.
[18] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved
training of wasserstein gans. Advances in neural information processing systems, 30, 2017.
[19] Zhishuai Guo, Mingrui Liu, Zhuoning Yuan, Li Shen, Wei Liu, and Tianbao Yang. Communication-
efficient distributed stochastic auc maximization with deep neural networks. In International Conference
on Machine Learning, pages 3864–3874. PMLR, 2020.

11
[20] Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. A novel convergence analysis for algorithms
of the adam family and beyond. arXiv preprint arXiv:2104.14840, 2021.
[21] Zhishuai Guo, Rong Jin, Jiebo Luo, and Tianbao Yang. Fedxl: Provable federated learning for deep x-risk
optimization. 2023.
[22] Songyang Han, Sanbao Su, Sihong He, Shuo Han, Haizhao Yang, and Fei Miao. What is the solution for
state adversarial multi-agent reinforcement learning? arXiv preprint arXiv:2212.02705, 2022.
[23] Huan He, Shifan Zhao, Yuanzhe Xi, Joyce Ho, and Yousef Saad. Gda-am: On the effectiveness of solving
min-imax optimization via anderson mixing. In International Conference on Learning Representations,
2021.
[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[25] Sihong He, Songyang Han, Sanbao Su, Shuo Han, Shaofeng Zou, and Fei Miao. Robust multi-agent
reinforcement learning with state uncertainty. Transactions on Machine Learning Research, 2023. ISSN
2835-8856. URL https://openreview.net/forum?id=CqTkapZ6H9.
[26] Charlie Hou, Kiran K Thekumparampil, Giulia Fanti, and Sewoong Oh. Efficient algorithms for federated
saddle point optimization. arXiv preprint arXiv:2102.06333, 2021.
[27] Feihu Huang and Songcan Chen. Near-optimal decentralized momentum method for nonconvex-pl minimax
problems. arXiv preprint arXiv:2304.10902, 2023.
[28] Feihu Huang, Xidong Wu, and Heng Huang. Efficient mirror descent ascent methods for nonsmooth
minimax problems. Advances in Neural Information Processing Systems, 34:10431–10443, 2021.
[29] Feihu Huang, Shangqian Gao, Jian Pei, and Heng Huang. Accelerated zeroth-order and first-order
momentum methods from mini to minimax optimization. J. Mach. Learn. Res., 23:36–1, 2022.
[30] Feihu Huang, Xidong Wu, and Zhengmian Hu. Adagda: Faster adaptive gradient descent ascent methods
for minimax optimization. In International Conference on Artificial Intelligence and Statistics, pages
2365–2389. PMLR, 2023.
[31] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Minmax optimization: Stable limit points of gradient
descent ascent are locally optimal. arXiv preprint arXiv:1902.00618, 2019.
[32] Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich,
and Ananda Theertha Suresh. Mime: Mimicking centralized stochastic algorithms in federated learning.
arXiv preprint arXiv:2008.03606, 2020.
[33] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and
Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In Inter-
national conference on machine learning, pages 5132–5143. PMLR, 2020.
[34] Prashant Khanduri, Pranay Sharma, Haibo Yang, Mingyi Hong, Jia Liu, Ketan Rajawat, and Pramod
Varshney. Stem: A stochastic two-sided momentum algorithm achieving near-optimal sample and com-
munication complexities for federated learning. Advances in Neural Information Processing Systems, 34,
2021.
[35] Weiwei Kong and Renato DC Monteiro. An accelerated inexact proximal point method for solving
nonconvex-concave min-max problems. SIAM Journal on Optimization, 31(4):2558–2585, 2021.
[36] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated
optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429–450, 2020.
[37] Luofeng Liao, Li Shen, Jia Duan, Mladen Kolar, and Dacheng Tao. Local adagrad-type algorithm for
stochastic convex-concave minimax problems. arXiv preprint arXiv:2106.10022, 2021.
[38] Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax
problems. In International Conference on Machine Learning, pages 6083–6093. PMLR, 2020.
[39] Mingrui Liu, Zhuoning Yuan, Yiming Ying, and Tianbao Yang. Stochastic auc maximization with deep
neural networks. arXiv preprint arXiv:1908.10831, 2019.
[40] Mingrui Liu, Wei Zhang, Youssef Mroueh, Xiaodong Cui, Jarret Ross, Tianbao Yang, and Payel Das. A
decentralized parallel algorithm for training generative adversarial nets. Advances in Neural Information
Processing Systems, 33:11056–11070, 2020.

12
[41] Zhuqing Liu, Xin Zhang, Songtao Lu, and Jia Liu. Precision: Decentralized constrained min-max learning
with low communication and sample complexities. arXiv preprint arXiv:2303.02532, 2023.
[42] Luo Luo, Haishan Ye, Zhichao Huang, and Tong Zhang. Stochastic recursive gradient descent ascent for
stochastic nonconvex-strongly-concave minimax problems. Advances in Neural Information Processing
Systems, 33:20566–20577, 2020.
[43] Pouria Mahdavinia, Yuyang Deng, Haochuan Li, and Mehrdad Mahdavi. Tight analysis of extra-gradient
and optimistic gradient methods for nonconvex minimax problems. Advances in Neural Information
Processing Systems, 35:31213–31225, 2022.
[44] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.
Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and
statistics, pages 1273–1282. PMLR, 2017.
[45] Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a
class of non-convex min-max games using iterative first order methods. Advances in Neural Information
Processing Systems, 32, 2019.
[46] H Rafique, M Liu, Q Lin, and T Yang. Non-convex min–max optimization: provable algorithms and
applications in machine learning. arXiv preprint arXiv:1810.02060, 2018.
[47] Mohammad Rasouli, Tao Sun, and Ram Rajagopal. Fedgan: Federated generative adversarial networks for
distributed data. arXiv preprint arXiv:2006.07228, 2020.
[48] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv
Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295,
2020.
[49] Amirhossein Reisizadeh, Farzan Farnia, Ramtin Pedarsani, and Ali Jadbabaie. Robust federated learning:
The case of affine distribution shifts. Advances in Neural Information Processing Systems, 33:21554–21565,
2020.
[50] Pranay Sharma, Rohan Panda, Gauri Joshi, and Pramod Varshney. Federated minimax optimization:
Improved convergence analyses and algorithms. In International Conference on Machine Learning, pages
19683–19730. PMLR, 2022.
[51] Pranay Sharma, Rohan Panda, and Gauri Joshi. Federated minimax optimization with client heterogeneity.
arXiv preprint arXiv:2302.04249, 2023.
[52] Zhenyu Sun and Ermin Wei. A communication-efficient algorithm with linear convergence for federated
minimax learning. arXiv preprint arXiv:2206.01132, 2022.
[53] Davoud Ataee Tarzanagh, Mingchen Li, Christos Thrampoulidis, and Samet Oymak. Fednest: Federated
bilevel, minimax, and compositional optimization. In International Conference on Machine Learning,
pages 21146–21179. PMLR, 2022.
[54] Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Efficient algorithms for
smooth minimax optimization. Advances in Neural Information Processing Systems, 32, 2019.
[55] Hoi-To Wai, Mingyi Hong, Zhuoran Yang, Zhaoran Wang, and Kexin Tang. Variance reduced policy
evaluation with smooth function approximation. Advances in Neural Information Processing Systems, 32:
5784–5795, 2019.
[56] Jingkang Wang, Tianyun Zhang, Sijia Liu, Pin-Yu Chen, Jiacen Xu, Makan Fardad, and Bo Li. Adversarial
attack generation empowered by min-max optimization. Advances in Neural Information Processing
Systems, 34:16020–16033, 2021.
[57] Xidong Wu, Zhengmian Hu, and Heng Huang. Decentralized riemannian algorithm for nonconvex minimax
problems. arXiv preprint arXiv:2302.03825, 2023.
[58] Xidong Wu, Feihu Huang, Zhengmian Hu, and Heng Huang. Faster adaptive federated learning. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10379–10387, 2023.
[59] Xidong Wu, Jianhui Sun, Zhengmian Hu, Junyi Li, Aidong Zhang, and Heng Huang. Federated conditional
stochastic optimization. Advances in Neural Information Processing Systems (NeurIPS), 36, 2023.
[60] Yihan Wu, Hongyang Zhang, and Heng Huang. Retrievalguard: Provably robust 1-nearest neighbor image
retrieval. In International Conference on Machine Learning, pages 24266–24279. PMLR, 2022.

13
[61] Yihan Wu, Aleksandar Bojchevski, and Heng Huang. Adversarial weight perturbation improves generaliza-
tion in graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37,
pages 10417–10425, 2023.

[62] Wenhan Xian, Feihu Huang, Yanfu Zhang, and Heng Huang. A faster decentralized algorithm for
nonconvex minimax problems. Advances in Neural Information Processing Systems, 34:25865–25877,
2021.

[63] Haibo Yang, Zhuqing Liu, Xin Zhang, and Jia Liu. Sagda: Achieving O(ϵ−2 ) communication complexity
in federated min-max learning. arXiv preprint arXiv:2210.00611, 2022.

[64] Junchi Yang, Negar Kiyavash, and Niao He. Global convergence and variance reduction for a class of
nonconvex-nonconcave minimax problems. Advances in Neural Information Processing Systems, 33:
1153–1165, 2020.

[65] Junchi Yang, Xiang Li, and Niao He. Nest your adaptive algorithm for parameter-agnostic nonconvex
minimax optimization. arXiv preprint arXiv:2206.00743, 2022.

[66] Junchi Yang, Antonio Orvieto, Aurelien Lucchi, and Niao He. Faster single-loop algorithms for minimax
optimization without strong concavity. In International Conference on Artificial Intelligence and Statistics,
pages 5485–5517. PMLR, 2022.

[67] Zhuoning Yuan, Zhishuai Guo, Yi Xu, Yiming Ying, and Tianbao Yang. Federated deep auc maximization
for hetergeneous data with a constant communication complexity. In International Conference on Machine
Learning, pages 12219–12229. PMLR, 2021.

[68] Xin Zhang, Zhuqing Liu, Jia Liu, Zhengyuan Zhu, and Songtao Lu. Taming communication and sam-
ple complexities in decentralized policy evaluation for cooperative multi-agent reinforcement learning.
Advances in Neural Information Processing Systems, 34:18825–18838, 2021.

[69] Xuan Zhang, Necdet Serhat Aybat, and Mert Gurbuzbalaban. Sapd+: An accelerated stochastic method for
nonconvex-concave minimax problems. arXiv preprint arXiv:2205.15084, 2022.

[70] Hu Zhengmian, Wu Xidong, and Huang Heng. Beyond lipschitz smoothness: A tighter analysis for
nonconvex optimization. In International Conference on Machine Learning (ICML), 2023.
[71] Hanhan Zhou, Tian Lan, Guru Prasadh Venkataramani, and Wenbo Ding. Federated learning with online
adaptive heterogeneous local models. In Workshop on Federated Learning: Recent Advances and New
Challenges (in Conjunction with NeurIPS 2022), 2022.
[72] Hanhan Zhou, Tian Lan, Guru Prasadh Venkataramani, and Wenbo Ding. Every parameter matters:
Ensuring the convergence of federated learning with dynamic heterogeneous models reduction. In
Advances in Neural Information Processing Systems, volume 36, 2023.

Wong H. Introduction To Quantum Computing... 2022
100% (2)
Wong H. Introduction To Quantum Computing... 2022
299 pages
Algorithm Design Solutions
80% (5)
Algorithm Design Solutions
207 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
FL Minmax
No ratings yet
FL Minmax
52 pages
Paper 14
No ratings yet
Paper 14
25 pages
TCT: Convexifying Federated Learning Using Bootstrapped Neural Tangent Kernels
No ratings yet
TCT: Convexifying Federated Learning Using Bootstrapped Neural Tangent Kernels
29 pages
NeurIPS 2021 A Faster Decentralized Algorithm For Nonconvex Minimax Problems Paper
No ratings yet
NeurIPS 2021 A Faster Decentralized Algorithm For Nonconvex Minimax Problems Paper
13 pages
Distributed Zero-Order Algorithms For Nonconvex Multi-Agent Optimization
No ratings yet
Distributed Zero-Order Algorithms For Nonconvex Multi-Agent Optimization
41 pages
2024 MTH058 Lecture07 FederatedLearning
No ratings yet
2024 MTH058 Lecture07 FederatedLearning
25 pages
Multi Model
No ratings yet
Multi Model
33 pages
Daptive Ederated Ptimization: Mcmahan Et Al. 2017
No ratings yet
Daptive Ederated Ptimization: Mcmahan Et Al. 2017
38 pages
Fedlab A Flexible Federated Learning Framework
No ratings yet
Fedlab A Flexible Federated Learning Framework
10 pages
SCAFFOLD Stochastic Controlled Averaging For Federated Learning
No ratings yet
SCAFFOLD Stochastic Controlled Averaging For Federated Learning
40 pages
P - P - F F L: Roblem Arameter REE Ederated Earning
No ratings yet
P - P - F F L: Roblem Arameter REE Ederated Earning
28 pages
Personalized Federated Learning With First Order Model Optimization
No ratings yet
Personalized Federated Learning With First Order Model Optimization
17 pages
F O: O V F L S E: ED Ptimus Ptimizing Ertical Ederated Earning FOR Calability and Fficiency
No ratings yet
F O: O V F L S E: ED Ptimus Ptimizing Ertical Ederated Earning FOR Calability and Fficiency
6 pages
Paper 10
No ratings yet
Paper 10
10 pages
A F L A - T C: Daptive Ederated Earning With UTO Uned Lients
No ratings yet
A F L A - T C: Daptive Ederated Earning With UTO Uned Lients
29 pages
A Fair Federated Learning Framework With
No ratings yet
A Fair Federated Learning Framework With
8 pages
On The Unreasonable Effectiveness of Federated Averaging With Heterogeneous Data
No ratings yet
On The Unreasonable Effectiveness of Federated Averaging With Heterogeneous Data
21 pages
Federated Learning
No ratings yet
Federated Learning
50 pages
Icml Zeno Appendix
No ratings yet
Icml Zeno Appendix
17 pages
Mime Mimicking Centralized Stochastic Algorithms in Federating Learning
No ratings yet
Mime Mimicking Centralized Stochastic Algorithms in Federating Learning
40 pages
2017 Konecny Et Al Federated Learning Google Paper
No ratings yet
2017 Konecny Et Al Federated Learning Google Paper
10 pages
Client Selection in Federated Learning-Convergence Analysis and Power of Choice Selection Strategies
No ratings yet
Client Selection in Federated Learning-Convergence Analysis and Power of Choice Selection Strategies
22 pages
Federated Learning of A Mixture of Global and Local Models
No ratings yet
Federated Learning of A Mixture of Global and Local Models
33 pages
Decentralized Personalized Federated Learning For Min-Max Problems
No ratings yet
Decentralized Personalized Federated Learning For Min-Max Problems
33 pages
Federated Learning Via Consensus Mechanism
No ratings yet
Federated Learning Via Consensus Mechanism
12 pages
Achieving Linear Converg
No ratings yet
Achieving Linear Converg
34 pages
A Communication-Efficient Collaborative Learning21
No ratings yet
A Communication-Efficient Collaborative Learning21
19 pages
Clustered Federated Learning - Model-Agnostic Distributed Multitask Optimization Under Privacy Constraints
No ratings yet
Clustered Federated Learning - Model-Agnostic Distributed Multitask Optimization Under Privacy Constraints
13 pages
1827 Leveraging Function Space
No ratings yet
1827 Leveraging Function Space
23 pages
26246-Article Text-30309-1-2-20230626
No ratings yet
26246-Article Text-30309-1-2-20230626
9 pages
Federated Learning For Generalization Robustness Fairness A Survey and Benchmark
No ratings yet
Federated Learning For Generalization Robustness Fairness A Survey and Benchmark
20 pages
2022 - Neural Optimization Machine-A Neural Network Approach For Optimization
No ratings yet
2022 - Neural Optimization Machine-A Neural Network Approach For Optimization
22 pages
Optimal Stochastic Non-Smooth Non-Convex Optimization Through
No ratings yet
Optimal Stochastic Non-Smooth Non-Convex Optimization Through
39 pages
GS-OPT: A New Fast Stochastic Algorithm For Solving The Non-Convex Optimization Problem
No ratings yet
GS-OPT: A New Fast Stochastic Algorithm For Solving The Non-Convex Optimization Problem
10 pages
Federated Learning: Strategies For Improving Communication Efficiency
No ratings yet
Federated Learning: Strategies For Improving Communication Efficiency
5 pages
Open Source FL Frameworks Ranking
No ratings yet
Open Source FL Frameworks Ranking
26 pages
Futureinternet 13 00073 v2
No ratings yet
Futureinternet 13 00073 v2
14 pages
FedAFR Enhancing Federated Learning With Adaptive Fea - 2024 - Computer Communi
No ratings yet
FedAFR Enhancing Federated Learning With Adaptive Fea - 2024 - Computer Communi
8 pages
Generalized Majorization-Minimization
No ratings yet
Generalized Majorization-Minimization
10 pages
A D F L: Symmetrically Ecentralized Ederated Earning
No ratings yet
A D F L: Symmetrically Ecentralized Ederated Earning
18 pages
Adaptive Stochastic Conjugate Gradient For Machine Learning
No ratings yet
Adaptive Stochastic Conjugate Gradient For Machine Learning
14 pages
Base Paper
No ratings yet
Base Paper
38 pages
NIPS 2015 Communication Complexity of Distributed Convex Learning and Optimization Paper
No ratings yet
NIPS 2015 Communication Complexity of Distributed Convex Learning and Optimization Paper
9 pages
FL 1
No ratings yet
FL 1
25 pages
Federated Learning With Differential Privacy Algorithms and Performance Analysis
No ratings yet
Federated Learning With Differential Privacy Algorithms and Performance Analysis
16 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
Better Theory For SGD in The Nonconvex World
No ratings yet
Better Theory For SGD in The Nonconvex World
33 pages
(24.07) Combining Federated Learning and Control A Survey
No ratings yet
(24.07) Combining Federated Learning and Control A Survey
19 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Gradient Inversion of Federated Diffusion Models
No ratings yet
Gradient Inversion of Federated Diffusion Models
12 pages
A High Probability Analysis of Adaptive SGD With Momentum
No ratings yet
A High Probability Analysis of Adaptive SGD With Momentum
13 pages
Report
No ratings yet
Report
59 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
DFML
No ratings yet
DFML
6 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
A Gentle Introduction To Gradient-Based Optimization
No ratings yet
A Gentle Introduction To Gradient-Based Optimization
36 pages
Non Convex Optimization PDF
No ratings yet
Non Convex Optimization PDF
204 pages
IGNOU MCA Design and Analysis of Algorithms Previous Years Unsolved Papers MCS 211
From Everand
IGNOU MCA Design and Analysis of Algorithms Previous Years Unsolved Papers MCS 211
Manish Soni
No ratings yet
CS483-04 Non-Recursive and Recursive Algorithm Analysis: Lifei@cs - Gmu.edu
No ratings yet
CS483-04 Non-Recursive and Recursive Algorithm Analysis: Lifei@cs - Gmu.edu
27 pages
Self Stabilization
No ratings yet
Self Stabilization
43 pages
Bread, Milk Bread, Diapers, Beer, Eggs Bread, Diapers, Beer, Cola Bread, Milk, Diapers, Beer Bread, Milk, Diapers, Cola
No ratings yet
Bread, Milk Bread, Diapers, Beer, Eggs Bread, Diapers, Beer, Cola Bread, Milk, Diapers, Beer Bread, Milk, Diapers, Cola
4 pages
878 2234 1 PB
No ratings yet
878 2234 1 PB
12 pages
Selfstudys Com File
No ratings yet
Selfstudys Com File
12 pages
Solutions of Equations in One Variable The Bisection Method
No ratings yet
Solutions of Equations in One Variable The Bisection Method
28 pages
360 Ms 850 Ms 900 Ms None of The Above: Seek Time T 6 Ms Per Cylinder Head 20
No ratings yet
360 Ms 850 Ms 900 Ms None of The Above: Seek Time T 6 Ms Per Cylinder Head 20
7 pages
Kca 201 Unit 2 1 Unit 1 Notes
No ratings yet
Kca 201 Unit 2 1 Unit 1 Notes
22 pages
Question Bank-Unit 1 Numerical Methods: MATH3510 B.Tech, V Sem
No ratings yet
Question Bank-Unit 1 Numerical Methods: MATH3510 B.Tech, V Sem
4 pages
18-20-Resolution in Predicate Logic
100% (1)
18-20-Resolution in Predicate Logic
39 pages
DS Lecture 14 Recursion
No ratings yet
DS Lecture 14 Recursion
16 pages
DATA STRUCTURES LAB Syllabus
No ratings yet
DATA STRUCTURES LAB Syllabus
2 pages
Java String
No ratings yet
Java String
4 pages
Presentation On Stack Data Structure and Its Applications: Name Department Roll No
No ratings yet
Presentation On Stack Data Structure and Its Applications: Name Department Roll No
10 pages
IML-IITKGP - Assignment 7 Solution
No ratings yet
IML-IITKGP - Assignment 7 Solution
8 pages
Continuous Assesment 1 (Ca1) AY-2022-2023 (ODD SEM) (Makaut Examination) Assessment Type-Presentation
No ratings yet
Continuous Assesment 1 (Ca1) AY-2022-2023 (ODD SEM) (Makaut Examination) Assessment Type-Presentation
11 pages
Quick Data Structures - David Matuszek
No ratings yet
Quick Data Structures - David Matuszek
169 pages
Discrete Mathematics Sem 1
No ratings yet
Discrete Mathematics Sem 1
22 pages
Saturation
No ratings yet
Saturation
12 pages
RDF Ref
No ratings yet
RDF Ref
3 pages
Slides Algo Ds Hash Universal 2 Typed
No ratings yet
Slides Algo Ds Hash Universal 2 Typed
8 pages
8th CPM - Chapter 4
No ratings yet
8th CPM - Chapter 4
2 pages
Line and Splitiing Graph 2017
No ratings yet
Line and Splitiing Graph 2017
11 pages
CC Record
No ratings yet
CC Record
83 pages
ICS 418 Assignment 1 - Tinashe Gufu
No ratings yet
ICS 418 Assignment 1 - Tinashe Gufu
3 pages
TOC Paper 4
No ratings yet
TOC Paper 4
4 pages
Department of Computer Science and Business Systems: Artificial Intelligence (CB630)
No ratings yet
Department of Computer Science and Business Systems: Artificial Intelligence (CB630)
11 pages

Solving A Class of Non-Convex Minimax Optimization in Federated Learning

Uploaded by

Solving A Class of Non-Convex Minimax Optimization in Federated Learning

Uploaded by

Solving a Class of Non-Convex Minimax Optimization

Xidong Wu† Jianhui Sun†

Zhengmian Hu Aidong Zhang§ Heng Huang∗

The minimax problems arise throughout machine learning applications, ranging

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

Type Algorithm Reference Sample Communication

complexity of O(ε−3 ) in federated learning. In addition, FedSGDA-M does not rely on a

2.1 Single-Machine Minimax

2.2 Distributed/Federated Minimax

3 Algorithms and Convergence Analysis

E∥∇fi (x, y; ξ (i) ) − ∇fi (x, y)∥2 ≤ σ 2

3.1 Nonconvex Concave (NC-C) Problems

3.2 Nonconvex-PL (NC-PL) Problems

3.3 Nonconvex-Strongly-Concave (NC-SC) Problems

4.1 Fair Classification

(a) Fashion-MNIST (b) CIFAR-10 (c) Tiny ImageNet

4.2 AUROC Maximization

(a) Fashion-MNIST (b) CIFAR-10 (c) Tiny ImageNet

You might also like