Solving A Class of Non-Convex Minimax Optimization in Federated Learning
Solving A Class of Non-Convex Minimax Optimization in Federated Learning
in Federated Learning
Abstract
1 Introduction
The nonconvex minimax optimization has been actively applied to solve enormous machine learning
problems, such as adversarial training [56, 61], generative adversarial networks (GANs) [16, 18],
policy evaluation in reinforcement learning [55, 22, 25], robust optimization [11, 60], AUROC
(area under the ROC curve) maximization [39], etc. Many single-machine minimax optimization
algorithms have been proposed to address these problems.
†
Equal contribution
§
This work was partially supported by NSF CNS 2213700 and CCF 2217071 at UVA.
∗
This work was partially supported by NSF IIS 1838627, 1837956, 1956002, 2211492, CNS 2213701, CCF
2217003, DBI 2225775 at Pitt and UMD.
Can we design stochastic gradient decent ascent methods with better sample and com-
munication complexities to match the convergence rate of single-machine counterparts
for solving the problem (1)?
In this paper, we provide an affirmative answer to the aforementioned question and propose a class of
algorithms to solve the problem (1) under different settings. In particular, we consider three most
common classes of nonconvex minimax optimization problems: 1) NC-C: NonConvex in x, Concave
in y; 2) NC-SC: NonConvex in x, Strongly-Concave in y; 3) NC- PL: NonConvex in x, PL-condition
in y. For each of these problems, we propose a new algorithm with provably better convergence rate
(please see Table 1) and provide a theoretical analysis. Our main contributions are four-fold:
1) NC-C setting. We propose FedSGDA+, and prove it has sample complexity of O(N −1 ε−8 )
and communication complexity of O(ε−6 ). FedSGDA+ takes advantage of the structure of
FL by adding the global learning rate at the server and reduces communication complexity
to O(ε−6 ) from O(ε−7 ) in [50]. It also achieves a linear speedup to the number of clients.
2) NC-PL setting. We propose a federated stochastic gradient ascent (FedSGDA-M) algorithm
with the momentum-based variance reduction technique. It has the best sample complexity
of O(κ3 N −1 ε−3 ) and the best communication complexity of O(κ2 ε−2 ). Compared with
existing momentum-based variance reduction algorithms, our result employs a novel theo-
retical analysis framework that produces a tighter convergence rate (i.e., our rate gets rid of
a logarithmic term appearing in existing works).
3) NC-SC setting. FedSGDA-M can be directly applied to the NC-SC setting since the PL
condition is weaker than strong-concavity. Our algorithm is the first work to reach sample
2
Table 1: Complexity comparison of existing nonconvex federated minimax algorithms for finding
an ε-stationary point. Sample complexity is the total number of the First-order Oracle (IFO) to
reach an ε-stationary point. Communication complexity denotes the total number of back-and-
forth communication times between clients and the server. Here, N is the number of clients, and
κ = Lf /µ is the condition number.
4) Extensive experimental results on fair classification and AUROC maximization confirm the
effectiveness of our proposed algorithm.
2 Related Work
Nonconvex-Concave (NC-C) setting. [31, 46, 45, 54, 35] proposed various deterministic and
stochastic algorithms to solve the NC-C minimax problems. All of these algorithms, however, have a
double-loop structure and are thus relatively complicated to implement. They decouple the minimax
problem into a minimization problem and a maximization problem and use a nested loop to update
variable y while keeping variable x constant. Subsequently, [38] studied the complexity result of the
single-loop algorithm (SGDA) for the NC-C minimax problem and proves the stochastic algorithm
achieves O(ε−8 ) complexity. SGDA is a direct extension of SGD from minimization optimization
to minimax optimization problems. Recently, [43] providing a unified analysis for the convergence
of OGDA and EG methods in the nonconvex-strongly-concave (NC-SC) and nonconvex-concave
(NC-C) settings.
Nonconvex-Strongly-Concave (NC-SC) setting. [38] analyzed the stochastic gradient descent
ascent (SGDA) algorithm and proved that SGDA has O(κ3 ε−4 ) stochastic gradient complexity.
To reduce the convergence rate, [42] proposed a stochastic GDA algorithm (i.e. SREDA) with a
double-loop structure based on the variance reduction technique of SPIDER [13] and reduce the
complexity to O(κ3 ε−3 ). [29] used momentum-based variance reduction technique of STORM [7]
and proposed Acc-MDA. Acc-MDA is a single-loop algorithm, which gets the same convergence
result as SREDA. Furthermore, adaptive minimax algorithms are introduced [28, 30] to solve the
nonsmooth nonconvex-strongly-concave minimax problems based on dynamic mirror functions.
[65] used a nested adaptive framework to design parameter-agnostic nonconvex minimax algorithm.
[69] proves that VR-based SAPD+ has the complexity of O(κ2 ε−3 ). However, whether the best
convergence result of O(ε−3 ) in single-machine methods can be achieved in the federated setting
is an open question. In addition, [23] conducts an in-depth investigation of limitations of GDA
algorithm (e.g., smaller learning rate, cycling/divergence issue) and gives a systematic analysis of
how to improve GDA dynamics.
3
Nonconvex-Nonconcave (NC-NC) setting. There is extensive research on NC-NC problems [12]
and the Nonconvex-PL condition is a special class of functions that interests us the most. Polyak-
Łojasiewicz (PL) condition does not require the objective to be concave and recent works show
that the PL condition could hold in the training process of overparameterized neural networks with
random initialization [1, 5]. Recently, many deterministic methods [45, 64, 14] are proposed for
NC-NC problems under the NC-PL setting. [20] proposed a PDAda method for Nonconvex-PL
minimax optimization with the restriction of the concavity condition, and [6] focuses on the finite-sum
Nonconvex-PL minimax optimization. Stochastic alternating GDA and stochastic smoothed GDA
proposed in [66] achieve complexity of O(κ4 ε−4 ) and O(κ2 ε−4 ), respectively.
Distributed training has rapid development in minimax optimization in recent years, driven by the need
to train large-scale datasets [40]. Under the serverless decentralized setting, algorithms for nonconvex
minimax have been studied extensively in nonconvex-strongly-concave setting [4, 62, 68, 41, 57] and
nonconvex-PL setting [27].
In the FL setting, some works analyzed algorithms for convex-concave problems [11, 37, 26, 52].
However, as nonconvex models (e.g., deep neural networks) being more and more prevalent, there
is a growing need for federated nonconvex minimax optimization, such as federated adversarial
training [49], federated deep AUROC maximization [19] and federated GAN [47]. [19] and [67]
focus on imbalanced data tasks. They reformulated the AUROC maximization problem as the min-
max problem under the FL setting. But their analysis relies on strict assumptions that deep models
satisfy the PL condition and only focuses on PL-strong-concave minimax. [49] converted the robust
federated learning into the minimax problem, where only model parameters, namely min variables,
are exchanged among clients via the server. [10] proposed Local SGDA and Local SGDA+. Local
SGDA is the local-update version of the SGDA algorithm in FL. Different from local SGDA, in local
SGDA+, max variable y is updated with a constant min variable x̃ and the snapshot x̃ updates every
S iteration. Afterward, [50] improves sample complexity and communication complexity of Local
SGDA for NC-SC and NC-PL settings, and Local SGDA+ for NC-C setting. [50] also proposes a
Momentum Local SGDA, which achieves the same theoretical results as Local SGDA for NC-PL and
NC-SC settings. In addition, [53] designs FEDNEST with two nested loops. Although FEDNEST
is composed of FEDINN (a federated stochastic variance reduction algorithm ), their convergence
complexity is not improved over vanilla Local SGDA. Afterward, [63] proposes SAGDA under
NC-PL setting, which yields a better communication complexity (i.e., O(ε−2 )). However, its analysis
does not consider the effect of condition number κ. More recently, [51] considers federated minimax
optimization with Client Heterogeneity in nonconvex concave and nonconvex strongly-concave
settings.
Relation to Existing Works. We propose FedSGDA+ for NC-C setting and FedSGDA-M for NC-SC
and NC-PL settings. In NC-C setting, we discover that the addition of a global step size leads to better
communication complexity. Under this circumstance, theoretical analysis is more challenging as we
not only need to consider the complicated structure of the minimax problem but also need to handle
the local update and global update separately. For FedSGDA-M, we relax the requirement of step
size (specifically designed unnatural step size is often required in STORM-like approaches [7, 34]),
which requires novel proof techniques to obtain. Thus our result does not contain a logarithmic term
and provides a tighter convergence rate (Seen in contributions 1 in [34]). In addition, with different
theoretical frameworks, our better sample complexity doesn’t rely on a big batch size, while the
single-machine minimax algorithm with variance reduction (Acc-MDA) in [29] (Table 2) needs a
large batch to achieve the same sample complexity.
4
Algorithm 1 FedSGDA+ Algorithm
1: Input: T , local step sizes ĉ, c, global step sizes ηx , ηy , k = 0, numbers of inner updates Q and
outer update S, and mini-batch size b, N clients;
2: Initialize: xi0 = x̃0 = x̄0 , y0i = ȳ0 ,
3: for t = 0, 1, . . . , T − 1 do
4: for i = 1, 2, . . . , N do
5: Local Update:
6: for q = 0, 1, . . . , Q − 1 do
7: Draw mini-batch samples Bt,q i
= {ξij }bj=1 with |Bti | = b from Di locally
i i i i i
8: xt,q+1 = xt,q − ĉ∇x fi (xt,q , yt,q ; Bt,q )
i i i i
9: yt,q+1 = yt,q + c∇y fi (x̃k , yt,q ; Bt,q )
10: end for
11: end for PN
12: xit+1,0 = x̄t+1 = x̄t + ηx N1 i=1 (xit,Q − x̄t )
N
i
= ȳt+1 = ȳt + ηy N1 i=1 (yt,Q i
P
13: yt+1,0 − ȳt )
14: if mod (t + 1, S) = 0 then
15: k =k+1
16: x̃k = x̄t+1
17: end if
18: end for
19: Output: x and y chosen uniformly random from {(x̄t , ȳt )}T t=1 .
(ii) Variance Bound. The following inequalities hold for all ξ (i) ∼ Di , i, j ∈ [N ]:
The Assumption 3.1 is a standard assumption in stochastic optimization, which will be used through-
out the rest of the paper. In FL algorithms, the Assumption 3.1 (ii) is frequently used to bound
the variance and data heterogeneity. The heterogeneity parameter, ζ, denotes the level of data
heterogeneity. In the homogeneous data configuration, ζ = 0.
5
Assumption 3.3. (Smoothness). Each local function fi (x, y) has a Lf -Lipschitz continuous gradients,
i.e., ∀x1 , x2 and y1 , y2 , we have
∥∇fi (x1 , y1 ) − ∇fi (x2 , y2 )∥ ≤ Lf ∥(x1 , y1 ) − (x2 , y2 )∥ (4)
The Assumption 3.3 on the smoothness is a standard assumption in stochastic optimization [2, 17].
Assumption 3.4. (Lipschitz continuity in x). For the function F , there exists a constant Gx , such
that for each y ∈ Rd2 , and ∀x, x′ ∈ Rd1 , we have
∥F (x, y) − F (x′ , y)∥ ≤ Gx ∥x − x′ ∥
Under the NC-C setting, the function F (·, ·) is concave in y. Following [9], we define Φ(x) =
maxy F (x, y) and the Moreau envelope of Φ(·) is defined below:
Definition 3.5. (Moreau Envelope) A function Φλ (·) is the λ-Moreau envelope of Φ(·), for λ > 0, if
∀x ∈ Rd1 ,
1
Φλ (x) = min Φ(z) + ∥z − x∥2
z 2λ
From [38], we know that when we have a point x that is an ε-stationary point of Φλ (x), then x is
close to a point x′ which is stationary for Φ(x). Hence, we focus on minimizing ∥∇Φλ (x)∥ under
the NC-C setting.
Theorem 3.6. Suppose Assumptions 3.1, 3.2, 3.3, 3.4 hold and the sequences {xt , yt } are generated
1
by Algorithm 1, max{cηy , c} ≤ 10QL f
and let ∥ȳt ∥2 ≤ D following [10, 50],
T −1
1 X 2 σ2 E Φ1/2Lf (x0 ) − E Φ1/2Lf (x̄T )
E ∇Φ1/2Lf (x̄t ) ≤ 8Lf ĉηx (QG2x + )+8
T t=0 N Qĉηx T
+48L2f Q[ĉ2 + c2 ](σ 2 + 6Qζ 2 ) + 288L2f Q2 ĉ2 G2x
D cηy σ 2
+576L3f Q2 c2 [ + + 6Lf Q2 c2 (σ 2 + 6ζ 2 )]
cηy QS N
r
σ2 16Lf D 16Lf (cηy )σ 2
+32Lf Gx ηx ĉSQ G2x + + + + 96L2f Q2 c2 (σ 2 + 6ζ 2 )]
N cηy QS N
1 T 1/3 N 1 N
Corollary 3.7. By setting c = ĉ = 10Lf QT 1/3
, Q= N , ĉηx = 10Lf T , cηy = 10Lf Q = 10Lf T 1/3
,
S = T 1/3 , FedSGDA+ has the following convergence rate:
T −1
1 X 2 80Lf ∆ 4(G2x + σ 2 ) 24(σ 2 + 6ζ 2 ) 72G2x 24(σ 2 + 6ζ 2 )
E ∇Φ1/2Lf (x̄t ) ≤ + + + +
T t=0 T 1/3 5T 2/3 25T 2/3 25T 2/3 25T 2/3
r
144Lf 10Lf D σ2 3(σ 2 + 6ζ 2 ) 16Gx σ2 160L2f D 8σ 2
+ [ + + ] + G2 + + +
2/3 1/3 1/3 2/3 1/3 x 1/3
25T T 10Lf T 50Lf T 5T N T 5T 1/3
PT −1 2
Remark 3.8. (Complexity) Based on Corollary 3.7, to make T1 t=0 E ∇Φ1/2Lf (x̄t ) ≤ ε2 ,
−6
communication complexity T = O(ε ). We choose b = O(1), then we have sample complexity
bQT = N −1 ε−8 . It also denotes that FedSGDA+ has linear speedup with respect to the number of
clients.
To solve problem (1) with better convergence complexity under nonconvex-PL, we propose federated
stochastic gradient ascent (FedSGDA-M) algorithm with the momentum-based variance reduction
technique (Seen in in algorithm 2).
6
Algorithm 2 FedSGDA-M Algorithm
1: Input: T , step sizes ĉ, c, η; momentum coefficient α, β, the number of local updates Q, and
mini-batch size b and initial mini-batch
PN i size B; PN
1
2: Initialize: xi0 = x̄0 = N x ,
i=1 0 0 y i
= ȳ0 = N1 i=1 y0i , ui1 = ∇x f (xi0 , y0i ; B0i ) and
v1i = ∇y f (xi0 , y0i ; B0i ) where |B0i | = B are drawn from Di for i ∈ [N ].
3: for t = 1, 2, . . . , T do
4: for i = 1, 2, . . . , N do
5: if mod (t, Q) = 0 then
6: Sever Update:
PN
7: uit = ūt = N1 j=1 ujt
PN
8: vti = v̄t = N1 j=1 vtj
PN
9: xit = x̄t = N1 j=1 (xjt−1 − ĉηujt )
PN j
10: yti = ȳt = N1 j=1 (yt−1 + cηvtj )
11: else
12: xit = xit−1 − ĉηuit
13: yti = yt−1
i
+ cηvti
14: end if
15: Draw mini-batch samples Bti = {ξij }bj=1 with |Bti | = b from Di locally
16: uit+1 = ∇x fi (xit , yti ; Bti ) + (1 − α)(uit − ∇x fi (xit−1 , yt−1i
; Bti ))
i
17: vt+1 = ∇y fi (xit , yti ; Bti ) + (1 − β)(vti − ∇y fi (xit−1 , yt−1
i
; Bti ))
18: end for
19: end for
20: Output: x and y chosen uniformly random from {x̄t , ȳt }T t=1 .
In FedSGDA-M, each client initializes the gradient estimators {ui1 , v1i } with stochastic gradient as
seen in line 2 of Algorithm 2. Following that, each client updates the model variables {xit , yti } locally
as standard stochastic gradient descent ascent method (lines 12-13 of Algorithm 2). Compared with
local momentum SGDA [50], the key difference is that clients utilize variance reduction gradient
estimators {uit , vti }, which are constructed in lines 15-17 of Algorithm 2. For the update step of
{ut , vt }, the coefficients should satisfy 0 < α < 1 and 0 < β < 1. In every Q iteration, clients
transmit model parameters and gradient estimators to the server, which computes {x̄t , ȳt , ūt , v̄t }.
Then the server sends averaged model variables and gradient estimators to each client to update the
local variables, as shown in lines 5-10 of Algorithm 2.
Definition 3.10. According to Assumption 3.9, there exists at least one solution to the problem
maxy F (x, y) for any x. Here we define Φ(x) = F (x, y ∗ (x)) = maxy F (x, y). We use ε-stationary
point of Φ(x), i.e. ∥∇Φ(x)∥ ≤ ε as the convergence metric.
We know Φ(x) is differentiable and (L + κL)-smooth and y ∗ (·) is κ-Lipschitz from [50]. Given
that ∇y F (x̄t , y ∗ (xt )) = 0, we have ∇Φ (x̄t ) = ∇x F (x̄t , y ∗ (xt )) + ∇y F (x̄t , y ∗ (xt )) · ∂y ∗ (x̄t ) =
∇x F (x̄t , y ∗ (xt )) which is widely used in the analysis of nonconvex-PL [10] and nonconvex-strongly-
concave minimax optimization [62, 38]. Then we discuss the convergence analysis of FedSGDA-M.
The proofs are provided in the supplementary materials.
Assumption 3.11. (Lipschitz Smoothness) Each component function fi (x, y; ξ) has a Lf -Lipschitz
gradient, i.e., ∀x1 , x2 and y1 , y2 , we have
E∥∇fi (x1 , y1 ; ξ) − ∇fi (x2 , y2 ; ξ)∥ ≤ Lf ∥(x1 , y1 ) − (x2 , y2 )∥ (5)
F (x, y) has an Lf -Lipschitz gradient based on the convexity of norm and Assumption 3.11. We have
N
1 X
∥∇x F (x1 , y1 ) − ∇x F (x2 , y2 )∥ = E ∇x fi (x1 , y1 ; ξ) − ∇x fi (x2 , y2 ; ξ)]
N i=1
N
1 X
≤ E∥∇x fi (x1 , y1 ; ξ) − ∇x fi (x2 , y2 ; ξ)∥
N i=1
≤ Lf ∥(x1 , y1 ) − (x2 , y2 )∥
7
In optimization analysis, it is standard to use the Assumption 3.11. Several widely used single-
machine stochastic algorithms, such as SPIDER [13] and STORM [7], make use of this assumption.
It is also used by numerous FL algorithms, including MIME [32] Fed-GLOMO [8], STEM [34] and
FAFED [58].
Theorem 3.12. Suppose that sequence {x̄t , ȳt }Tt=0 is generated from Algorithm 2. Under the above
2
1 30L2
Assumptions (3.1,3.9,3.11), given η = 20QL , α = c1 η 2 , β = c2 η 2 , c1 = bN30L
κ1−ν , c2 = bN κ2−2ν , c =
1 1
6κ1−ν , ĉ = 54κ3−ν where ν ∈ [0, 1] we have
T −1
1 X 2 2[Φ(x̄0 ) − Φ(x̄T )] 3σ 2 36L2f σ 2 12L2f
E ∥∇Φ (x̄t )∥ ≤ + + 2 + [Φ(x̄0 ) − F (x̄0 , ȳt )]
T t=0 ĉηT αT BN µ βT BN cηµ2 T
6ασ 2 72βσ 2 L2f
2 2
σ (c1 + c22 ) ζ 2 (c21 + c22 ) 2 2
+ + + + κ η
Nb N bµ2 30bL2 12L2
30L2 30L2 1
Corollary 3.13. By setting b = O(κν ) for ν ∈ [0, 1], c1 = bN κ1−ν , c2 = bN κ2−2ν , c = 6κ1−ν , ĉ =
1/3 1/3
1 3−ν T0 1 N 2/3 T0 bκ1−ν
54κ3−ν , T =κ T0 , Q = N 2/3
, η = 20QL = 1/3 , B = N 2/3
, we have α = c1 η 2 =
20LT0
1/3 1/3
3N 3N
2/3 , β = c2 η 2 = 2/3 .
40T0 bκ1−ν 40T0 bκ2−2ν
T −1
1 X 2 2160L[Φ(x̄0 ) − Φ∗ ] 40σ 2 480σ 2
E ∥∇Φ (x̄t )∥ ≤ 2/3
+ 3−ν 2/3
+ 2
T t=0 (N T0 ) κ (N T0 ) κ (N T0 )2/3
9σ 2 27σ 2
2
15ζ 2
240Lf 3σ 1
+ 2/3
[Φ(x̄0 ) − F (x̄0 , ȳ0 )] + 2/3
+ 2/3
+ +
(N T0 ) 20bκ(N T0 ) 5(N T0 ) 20b 40 (N T0 )2/3
where Φ∗ is the optimal.
PT −1 2
Remark 3.14. (Complexity) To make the T1 t=0 E ∥∇Φ (x̄t )∥ ≤ ε2 , we get T0 = O(N −1 ε−3 )
and T = O(κ3−ν N −1 ε−3 ). Considering the b = κν , we have communication complexity Q T
=
3−ν 2/3 3−ν −2 3 −1 −3
κ (N T0 ) = κ ε and sample complexity bT = O(κ N ε ). When ν = 1, b = κ,
communication Complexity Q T
= κ2 ε−2 for finding an ε-stationary point. Sample complexity
bT = O(κ3 N −1 ε−3 ) matches the complexity result achieved by the single-machine algorithms, such
as SREDA and Acc-MDA in [42, 29] but we do not require a large batch size b compared with these
algorithms. And O(κ3 N −1 ε−3 ) also exhibits a linear speed-up compared with the aforementioned
single-machine algorithms.
Assumption 3.15. Each local function function fi (x, y) is µ-strongly concave in y ∈ Y, i.e., ∀x ∈ X
and y1 , y2 ∈ Y, we have
∥∇y fi (x, y1 ) − ∇y fi (x, y2 )∥ ≥ µ∥y1 − y2 ∥
When the function F (x, y) is strongly concave in y ∈ Y, there exists a unique solution to the problem
maxy∈Y f (x, y) for any x. Since PL condition is weaker than strong concavity, the convergence
result of FedSGDA-M in Algorithm 2 under NC-PL also apply to NC-SC problem and FedSGDA-M
has the sample complexity of O(κ3 n−1 ε−3 ) and the communication complexity of O(κ2 ε−2 ) under
nonconvex-strongly-concave setting.
4 Experiments
We conduct experiments on AUROC maximization and fair classification tasks to verify the efficiency
of our algorithms under nonconvex-strongly-concave and nonconvex-concave settings. Experiments
are completed on the computer cluster with AMD EPYC 7513 Processors and NVIDIA RTX A6000.
The code is available ∗
∗
https://github.com/xidongwu/Federated-Minimax-and-Conditional-Stochastic-Optimization
8
Datasets and Models: We test the performance of algorithms on three typical datasets: Fashion-
MNIST dataset, CIFAR-10 dataset and Tiny-ImageNet. Fashion-MNIST dataset has 70, 000 28 × 28
gray images (10 categories, 60, 000 training images and 10, 000 testing images). CIFAR-10 dataset
consists of 50, 000 training images and 10, 000 testing images. Each image is the 3 × 32 × 32 arrays
of color image. Tiny-ImageNet dataset has 200 classes of (64 × 64) colored images and each class has
500 training images, 50 validation images, and 50 test images. For Fashion MNIST and Cifar10 data
sets, we choose convolutional neural network from [28] (The details are shown in the supplementray
materials). For Tiny-ImageNet, we choose ResNet-18 [24] as the model.
First, we follow [50, 28] and train fair classification networks by minimizing the maximum loss over
different categories.
N C C
1 XX X
yc Lic (x) s.t.Y = y | yi ≥ 0,
min max yi = 1
x y∈Y N i=1 c=1 i=1
where Lic denotes the cross-entropy loss functions corresponding to the class c in C different classes
and x denotes the CNN model parameters. Clearly, the problem in (6) is nonConvex in x (deep model
parameters), and Concave in y. We compare our algorithm (FedSGDA+) and local SGDA+ with
varying model, datasets, local update numbers, step size. Although the constraint is not considered in
the theoretical analysis of local SGDA+ and FedSGDA+, FedSGDA+ still shows better performance
compared with local SGDA+.
The network has 20 clients. The datasets are partitioned into disjoint sets across all clients and
each client holds part of the data from all the classes [34]. We initialize the renset18 with pre-
trained weights in PyTorch. In experiments, we run grid search for step size, and choose the step
size for primal variable in the set {0.01, 0.03, 0.05, 0.1, 0.3} and that for dual variable in the set
{0.001, 0.01, 0.1}. We choose the global step size in the set {0.1, 0.5, 1, 1.5, 2}. The batch-size b is
in 50 and the inner loop number Q is seleted from {20, 50, 100}, The outer loop number S is selected
from {1, 5, 10} for FedSGDA+ and {1, 5, 10, Q} for local SGDA+.
Figure 1 shows that FedSGDA+ has a better convergence rate than local SGDA+. This confirms that
our algorithm can effectively accelerate SGDA by using the structure of federated learning. Due to
the page limititation, the ablation analysis of step size is presented in the supplementary materials.
Figure 1: Test Accuracy vs the number of communication rounds during the training phase.
[39] showed the AUROC maximization problem could be reformulated as the non-convex-strongly-
concave minimax optimization, as below:
N
1 X
min max Eξ ∼D [fi (m, a, b, w; ξ)] (6)
m∈Rd w∈R N i=1 i i
(a,b)∈R2
9
where
2 2
f (m, a, b, w; ξ) = (1 − p) (h (m; x) − a) I[y=1] + p (h (m; x) − b) I[y=−1]
+ 2(1 + w)[ph (m; x) I[y=−1] − (1 − p)h (m, x) I[y=1] ] − p(1 − p)w2
ξi = (x, y) ∼ Di denotes a random data point and x represents the data features and y ∈ Y =
{−1, +1} is the label. h(m; x) denotes the prediction score of the data point x calculated by a model
with parameter m. p = P r(y = 1) = Ey [I[y=1] ] denotes the prior probability of the positive data.
Figure 2: AUROC scores on the test datasets vs the number of communication rounds during the
training phase.
Following [19, 67], we constructed the imbalanced binary-class versions of datasets as follow: firstly,
the first half of the classes (0 - 4) in the original Fashion-MNIST, CIFAR10 and classes (0 - 99)
in Tiny-ImageNet datasets are converted into the negative class, and the rest half of classes are
considered to be a positive class. 80% of the negative data points are randomly dropped in each
dataset. Then the datasets are evenly divided into disjoint sets across 16 clients. In this case, each
clients share completely different imbalanced datasets. In the experiment, we use xavier normal
initialization to deep models.
We compare our algorithm (i.e., FedSGDA-M) with local SGDA [10, 50], CODA+ [19, 67], Mo-
mentum SGDA [50], CODASCA [67] and SAGDA [63] as baselines in AUROC maximization. In
experiments, we carefully tune hyperparameters for all methods. We run a grid search for step size,
and choose the step size for the primal variable in the set {0.001, 0.005, 0.01} and that for dual
variable in the set {0.0001, 0.001, 0.01}. We choose the global step size from {0.9, 1, 1.5, 2} for
CODASCA and SAGDA. We choose the momentum parameter in Local Momentum SGDA in the
set {0.1, 0.5, 0.9}. The α and β in FedSGDA-M are chosen from {0.1, 0.5, 0.9}. The batch-size b is
50 and the inner loop number Q ∈ {10, 20, 50}.
As shown in Figure 2, we compare the performance of FedSGDA-M and other baseline methods
against the number of communication rounds. Figure 2 shows that our algorithms consistently
outperform the other baseline algorithms on testing datasets, which validates the efficacy of our
algorithms. Due to space limitation, other test results are provided in the supplementary materials.
Limitation. Minimax optimization has many applications and a more comprehensive discussion of
our proposed algorithms on these tasks will be a future study because the theoretical analysis is the
main contribution of this paper.
5 Conclusion
In this paper, we study a class of federated nonconvex minimax optimization problems (1). We
consider the three most common settings (NC-SC, NC-PL, NC-C). Under the NC-C setting, we
propose FedSGDA+ and prove it has the best communication complexity of O(ε−6 ). It also achieves
a linear speedup to the number of clients. Under NC-PL and NC-PL settings, we propose FedSGDA-
M with variance reduction technique and we prove that our algorithm (FedSGDA-M ) has the best
sample complexity (O(κ3 n−1 ε−3 )) and the best sample communication complexity (O(κ2 ε−2 )).
We prove that FedSGDA also enjoys linear speedup with respect to the number of clients. Therefore,
we reduce the existing complexity results for the most common nonconvex minimax optimization
problems under the federated learning setting.
10
References
[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-
parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
[2] Runxue Bao, Bin Gu, and Heng Huang. An accelerated doubly stochastic gradient method with faster
explicit model identification. In Proceedings of the 31st ACM International Conference on Information &
Knowledge Management, pages 57–66, 2022.
[3] Runxue Bao, Xidong Wu, Wenhan Xian, and Heng Huang. Doubly sparse asynchronous learning for
stochastic composite optimization. In Proceedings of the Thirty-First International Joint Conference on
Artificial Intelligence, IJCAI, pages 1916–1922, 2022.
[4] Aleksandr Beznosikov, Pavel Dvurechensky, Anastasia Koloskova, Valentin Samokhin, Sebastian U Stich,
and Alexander Gasnikov. Decentralized local stochastic extra-gradient for variational inequalities. arXiv
preprint arXiv:2106.08315, 2021.
[5] Zachary Charles and Dimitris Papailiopoulos. Stability and generalization of learning algorithms that
converge to global optima. In International Conference on Machine Learning, pages 745–754. PMLR,
2018.
[6] Lesi Chen, Boyuan Yao, and Luo Luo. Faster stochastic algorithms for minimax optimization under
polyak-{\L} ojasiewicz condition. Advances in Neural Information Processing Systems, 35:13921–13932,
2022.
[7] Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd.
Advances in neural information processing systems, 32, 2019.
[8] Rudrajit Das, Abolfazl Hashemi, Sujay Sanghavi, and Inderjit S Dhillon. Improved convergence rates for
non-convex federated learning with compression. arXiv e-prints, pages arXiv–2012, 2020.
[9] Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions.
SIAM Journal on Optimization, 29(1):207–239, 2019.
[10] Yuyang Deng and Mehrdad Mahdavi. Local stochastic gradient descent ascent: Convergence analysis
and communication efficiency. In International Conference on Artificial Intelligence and Statistics, pages
1387–1395. PMLR, 2021.
[11] Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Distributionally robust federated
averaging. arXiv preprint arXiv:2102.12660, 2021.
[12] Jelena Diakonikolas, Constantinos Daskalakis, and Michael I Jordan. Efficient methods for structured
nonconvex-nonconcave min-max optimization. In International Conference on Artificial Intelligence and
Statistics, pages 2746–2754. PMLR, 2021.
[13] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex opti-
mization via stochastic path-integrated differential estimator. Advances in Neural Information Processing
Systems, 31, 2018.
[14] Tanner Fiez, Lillian Ratliff, Eric Mazumdar, Evan Faulkner, and Adhyyan Narang. Global convergence
to local minmax equilibrium in classes of nonconvex zero-sum games. Advances in Neural Information
Processing Systems, 34:29049–29063, 2021.
[15] Hongchang Gao, Junyi Li, and Heng Huang. On the convergence of local stochastic compositional gradient
descent with momentum. In International Conference on Machine Learning, pages 7017–7035. PMLR,
2022.
[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing
systems, pages 2672–2680, 2014.
[17] Bin Gu, Runxue Bao, Chenkang Zhang, and Heng Huang. New scalable and efficient online pairwise
learning algorithm. IEEE Transactions on Neural Networks and Learning Systems, 2023.
[18] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved
training of wasserstein gans. Advances in neural information processing systems, 30, 2017.
[19] Zhishuai Guo, Mingrui Liu, Zhuoning Yuan, Li Shen, Wei Liu, and Tianbao Yang. Communication-
efficient distributed stochastic auc maximization with deep neural networks. In International Conference
on Machine Learning, pages 3864–3874. PMLR, 2020.
11
[20] Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. A novel convergence analysis for algorithms
of the adam family and beyond. arXiv preprint arXiv:2104.14840, 2021.
[21] Zhishuai Guo, Rong Jin, Jiebo Luo, and Tianbao Yang. Fedxl: Provable federated learning for deep x-risk
optimization. 2023.
[22] Songyang Han, Sanbao Su, Sihong He, Shuo Han, Haizhao Yang, and Fei Miao. What is the solution for
state adversarial multi-agent reinforcement learning? arXiv preprint arXiv:2212.02705, 2022.
[23] Huan He, Shifan Zhao, Yuanzhe Xi, Joyce Ho, and Yousef Saad. Gda-am: On the effectiveness of solving
min-imax optimization via anderson mixing. In International Conference on Learning Representations,
2021.
[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[25] Sihong He, Songyang Han, Sanbao Su, Shuo Han, Shaofeng Zou, and Fei Miao. Robust multi-agent
reinforcement learning with state uncertainty. Transactions on Machine Learning Research, 2023. ISSN
2835-8856. URL https://openreview.net/forum?id=CqTkapZ6H9.
[26] Charlie Hou, Kiran K Thekumparampil, Giulia Fanti, and Sewoong Oh. Efficient algorithms for federated
saddle point optimization. arXiv preprint arXiv:2102.06333, 2021.
[27] Feihu Huang and Songcan Chen. Near-optimal decentralized momentum method for nonconvex-pl minimax
problems. arXiv preprint arXiv:2304.10902, 2023.
[28] Feihu Huang, Xidong Wu, and Heng Huang. Efficient mirror descent ascent methods for nonsmooth
minimax problems. Advances in Neural Information Processing Systems, 34:10431–10443, 2021.
[29] Feihu Huang, Shangqian Gao, Jian Pei, and Heng Huang. Accelerated zeroth-order and first-order
momentum methods from mini to minimax optimization. J. Mach. Learn. Res., 23:36–1, 2022.
[30] Feihu Huang, Xidong Wu, and Zhengmian Hu. Adagda: Faster adaptive gradient descent ascent methods
for minimax optimization. In International Conference on Artificial Intelligence and Statistics, pages
2365–2389. PMLR, 2023.
[31] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Minmax optimization: Stable limit points of gradient
descent ascent are locally optimal. arXiv preprint arXiv:1902.00618, 2019.
[32] Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich,
and Ananda Theertha Suresh. Mime: Mimicking centralized stochastic algorithms in federated learning.
arXiv preprint arXiv:2008.03606, 2020.
[33] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and
Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In Inter-
national conference on machine learning, pages 5132–5143. PMLR, 2020.
[34] Prashant Khanduri, Pranay Sharma, Haibo Yang, Mingyi Hong, Jia Liu, Ketan Rajawat, and Pramod
Varshney. Stem: A stochastic two-sided momentum algorithm achieving near-optimal sample and com-
munication complexities for federated learning. Advances in Neural Information Processing Systems, 34,
2021.
[35] Weiwei Kong and Renato DC Monteiro. An accelerated inexact proximal point method for solving
nonconvex-concave min-max problems. SIAM Journal on Optimization, 31(4):2558–2585, 2021.
[36] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated
optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429–450, 2020.
[37] Luofeng Liao, Li Shen, Jia Duan, Mladen Kolar, and Dacheng Tao. Local adagrad-type algorithm for
stochastic convex-concave minimax problems. arXiv preprint arXiv:2106.10022, 2021.
[38] Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax
problems. In International Conference on Machine Learning, pages 6083–6093. PMLR, 2020.
[39] Mingrui Liu, Zhuoning Yuan, Yiming Ying, and Tianbao Yang. Stochastic auc maximization with deep
neural networks. arXiv preprint arXiv:1908.10831, 2019.
[40] Mingrui Liu, Wei Zhang, Youssef Mroueh, Xiaodong Cui, Jarret Ross, Tianbao Yang, and Payel Das. A
decentralized parallel algorithm for training generative adversarial nets. Advances in Neural Information
Processing Systems, 33:11056–11070, 2020.
12
[41] Zhuqing Liu, Xin Zhang, Songtao Lu, and Jia Liu. Precision: Decentralized constrained min-max learning
with low communication and sample complexities. arXiv preprint arXiv:2303.02532, 2023.
[42] Luo Luo, Haishan Ye, Zhichao Huang, and Tong Zhang. Stochastic recursive gradient descent ascent for
stochastic nonconvex-strongly-concave minimax problems. Advances in Neural Information Processing
Systems, 33:20566–20577, 2020.
[43] Pouria Mahdavinia, Yuyang Deng, Haochuan Li, and Mehrdad Mahdavi. Tight analysis of extra-gradient
and optimistic gradient methods for nonconvex minimax problems. Advances in Neural Information
Processing Systems, 35:31213–31225, 2022.
[44] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.
Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and
statistics, pages 1273–1282. PMLR, 2017.
[45] Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a
class of non-convex min-max games using iterative first order methods. Advances in Neural Information
Processing Systems, 32, 2019.
[46] H Rafique, M Liu, Q Lin, and T Yang. Non-convex min–max optimization: provable algorithms and
applications in machine learning. arXiv preprint arXiv:1810.02060, 2018.
[47] Mohammad Rasouli, Tao Sun, and Ram Rajagopal. Fedgan: Federated generative adversarial networks for
distributed data. arXiv preprint arXiv:2006.07228, 2020.
[48] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv
Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295,
2020.
[49] Amirhossein Reisizadeh, Farzan Farnia, Ramtin Pedarsani, and Ali Jadbabaie. Robust federated learning:
The case of affine distribution shifts. Advances in Neural Information Processing Systems, 33:21554–21565,
2020.
[50] Pranay Sharma, Rohan Panda, Gauri Joshi, and Pramod Varshney. Federated minimax optimization:
Improved convergence analyses and algorithms. In International Conference on Machine Learning, pages
19683–19730. PMLR, 2022.
[51] Pranay Sharma, Rohan Panda, and Gauri Joshi. Federated minimax optimization with client heterogeneity.
arXiv preprint arXiv:2302.04249, 2023.
[52] Zhenyu Sun and Ermin Wei. A communication-efficient algorithm with linear convergence for federated
minimax learning. arXiv preprint arXiv:2206.01132, 2022.
[53] Davoud Ataee Tarzanagh, Mingchen Li, Christos Thrampoulidis, and Samet Oymak. Fednest: Federated
bilevel, minimax, and compositional optimization. In International Conference on Machine Learning,
pages 21146–21179. PMLR, 2022.
[54] Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Efficient algorithms for
smooth minimax optimization. Advances in Neural Information Processing Systems, 32, 2019.
[55] Hoi-To Wai, Mingyi Hong, Zhuoran Yang, Zhaoran Wang, and Kexin Tang. Variance reduced policy
evaluation with smooth function approximation. Advances in Neural Information Processing Systems, 32:
5784–5795, 2019.
[56] Jingkang Wang, Tianyun Zhang, Sijia Liu, Pin-Yu Chen, Jiacen Xu, Makan Fardad, and Bo Li. Adversarial
attack generation empowered by min-max optimization. Advances in Neural Information Processing
Systems, 34:16020–16033, 2021.
[57] Xidong Wu, Zhengmian Hu, and Heng Huang. Decentralized riemannian algorithm for nonconvex minimax
problems. arXiv preprint arXiv:2302.03825, 2023.
[58] Xidong Wu, Feihu Huang, Zhengmian Hu, and Heng Huang. Faster adaptive federated learning. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10379–10387, 2023.
[59] Xidong Wu, Jianhui Sun, Zhengmian Hu, Junyi Li, Aidong Zhang, and Heng Huang. Federated conditional
stochastic optimization. Advances in Neural Information Processing Systems (NeurIPS), 36, 2023.
[60] Yihan Wu, Hongyang Zhang, and Heng Huang. Retrievalguard: Provably robust 1-nearest neighbor image
retrieval. In International Conference on Machine Learning, pages 24266–24279. PMLR, 2022.
13
[61] Yihan Wu, Aleksandar Bojchevski, and Heng Huang. Adversarial weight perturbation improves generaliza-
tion in graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37,
pages 10417–10425, 2023.
[62] Wenhan Xian, Feihu Huang, Yanfu Zhang, and Heng Huang. A faster decentralized algorithm for
nonconvex minimax problems. Advances in Neural Information Processing Systems, 34:25865–25877,
2021.
[63] Haibo Yang, Zhuqing Liu, Xin Zhang, and Jia Liu. Sagda: Achieving O(ϵ−2 ) communication complexity
in federated min-max learning. arXiv preprint arXiv:2210.00611, 2022.
[64] Junchi Yang, Negar Kiyavash, and Niao He. Global convergence and variance reduction for a class of
nonconvex-nonconcave minimax problems. Advances in Neural Information Processing Systems, 33:
1153–1165, 2020.
[65] Junchi Yang, Xiang Li, and Niao He. Nest your adaptive algorithm for parameter-agnostic nonconvex
minimax optimization. arXiv preprint arXiv:2206.00743, 2022.
[66] Junchi Yang, Antonio Orvieto, Aurelien Lucchi, and Niao He. Faster single-loop algorithms for minimax
optimization without strong concavity. In International Conference on Artificial Intelligence and Statistics,
pages 5485–5517. PMLR, 2022.
[67] Zhuoning Yuan, Zhishuai Guo, Yi Xu, Yiming Ying, and Tianbao Yang. Federated deep auc maximization
for hetergeneous data with a constant communication complexity. In International Conference on Machine
Learning, pages 12219–12229. PMLR, 2021.
[68] Xin Zhang, Zhuqing Liu, Jia Liu, Zhengyuan Zhu, and Songtao Lu. Taming communication and sam-
ple complexities in decentralized policy evaluation for cooperative multi-agent reinforcement learning.
Advances in Neural Information Processing Systems, 34:18825–18838, 2021.
[69] Xuan Zhang, Necdet Serhat Aybat, and Mert Gurbuzbalaban. Sapd+: An accelerated stochastic method for
nonconvex-concave minimax problems. arXiv preprint arXiv:2205.15084, 2022.
[70] Hu Zhengmian, Wu Xidong, and Huang Heng. Beyond lipschitz smoothness: A tighter analysis for
nonconvex optimization. In International Conference on Machine Learning (ICML), 2023.
[71] Hanhan Zhou, Tian Lan, Guru Prasadh Venkataramani, and Wenbo Ding. Federated learning with online
adaptive heterogeneous local models. In Workshop on Federated Learning: Recent Advances and New
Challenges (in Conjunction with NeurIPS 2022), 2022.
[72] Hanhan Zhou, Tian Lan, Guru Prasadh Venkataramani, and Wenbo Ding. Every parameter matters:
Ensuring the convergence of federated learning with dynamic heterogeneous models reduction. In
Advances in Neural Information Processing Systems, volume 36, 2023.
14