0% found this document useful (0 votes)
35 views52 pages

FL Minmax

federated learning minmax

Uploaded by

aefgh7488
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views52 pages

FL Minmax

federated learning minmax

Uploaded by

aefgh7488
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Federated Minimax Optimization with Client Heterogeneity

Pranay Sharma, Rohan Panda, and Gauri Joshi


Department of Electrical and Computer Engineering,
Carnegie Mellon University, Pittsburgh, PA 15213
{pranaysh, rohanpan, gaurij}@andrew.cmu.edu
February 10, 2023
arXiv:2302.04249v2 [cs.LG] 9 Feb 2023

Abstract
Minimax optimization has seen a surge in interest with the advent of modern applications such as GANs, and it is
inherently more challenging than simple minimization. The difficulty is exacerbated by the training data residing at
multiple edge devices or clients, especially when these clients can have heterogeneous datasets and local computation
capabilities. We propose a general federated minimax optimization framework that subsumes such settings and
several existing methods like Local SGDA. We show that naive aggregation of heterogeneous local progress results in
optimizing a mismatched objective function – a phenomenon previously observed in standard federated minimization.
To fix this problem, we propose normalizing the client updates by the number of local steps undertaken between
successive communication rounds. We analyze the convergence of the proposed algorithm for classes of nonconvex-
concave and nonconvex-nonconcave functions and characterize the impact of heterogeneous client data, partial client
participation, and heterogeneous local computations. Our analysis works under more general assumptions on the
intra-client noise and inter-client heterogeneity than so far considered in the literature. For all the function classes
considered, we significantly improve the existing computation and communication complexity results. Experimental
results support our theoretical claims.

1 Introduction
The massive surge in machine learning (ML) research in the past decade has brought forth new applications that cannot
be modeled as simple minimization problems. Many of these problems, including generative adversarial networks
(GANs) Goodfellow et al. (2014); Arjovsky et al. (2017); Sanjabi et al. (2018), adversarial neural network training
Madry et al. (2018), robust optimization Namkoong & Duchi (2016); Mohajerin Esfahani & Kuhn (2018), distributed
nonconvex optimization Lu et al. (2019), and fair machine learning Madras et al. (2018); Mohri et al. (2019), have an
underlying min-max structure. However, the underlying problem is often nonconvex, while classical minimax theory
deals almost exclusively with convex-concave problems.
Another feature of modern ML applications is the inherently distributed nature of the training data Xing et al. (2016).
The data collection is often outsourced to edge devices or clients. However, the clients may then be unable (due to
resource constraints) or unwilling (due to privacy concerns) to share their data with a central server. Federated Learning
(FL) Konečnỳ et al. (2016); Kairouz et al. (2019) was proposed to alleviate this problem. In exchange for retaining
control of their data, the clients shoulder some of the computational load, and run part of the training process locally,
using only their own data. The communication with the server is infrequent, leading to further resource savings. Since
its introduction, FL has been an active area of research, with some remarkable successes Li et al. (2020); Wang et al.
(2021). Research has shown practical benefits of, and provided theoretical justifications for commonly used practical
techniques, such as, multiple local updates at the clients Stich (2018); Khaled et al. (2020); Koloskova et al. (2020);
Wang & Joshi (2021), partial client participation Yang et al. (2021), communication compression Hamer et al. (2020);
Chen et al. (2021). Further, impact of heterogeneity in the clients’ local data Zhao et al. (2018); Sattler et al. (2019), as
well as their system capabilities Wang et al. (2020); Mitra et al. (2021) has been studied. However, all this research has
been focused almost solely on simple minimization problems.

1
Table 1: Comparison of (per client) stochastic gradient complexity and the number of communication rounds
needed to reach an -stationary solution (see Definition 1), for different classes of nonconvex minimax
problems. Here n is the total number of clients. For a fair comparison with existing works, our results in
this table are specialized to the case when all clients (i) have equal weights (pi = 1/n), (ii) perform equal
number of local updates (τi = τ ), and (iii) use the same local update algorithm SGDA. However, our results
(Section 4) apply under more general settings when (i)-(iii) do not hold.

Setting and Assumptions Full Client Participation (FCP)


Work System Partial Client Stochastic Gradient Communication
Heterogeneitya Participation Complexity Rounds
Nonconvex-Strongly-concave (NC-SC)/Nonconvex-Polyak-Łojasiewicz (NC-PL)
(n = 1) Lin et al. (2020a) - - O(1/4 ) -
Sharma et al. (2022) 7 7 O(1/(n4 )) O(1/3 )
Yang et al. (2022a)b 7 X O(1/(n4 )) O(1/2 )
4
Our Work (Theorem 1, Corollary 1.2) X X O 1/(n ) O 1/2
Nonconvex-Concave (NC-C)
(n = 1) Lin et al. (2020a) - - O(1/8 ) -
Sharma et al. (2022) 7 7 O(1/(n8 )) O(1/7 )
Our Work: (Theorem 2, Corollary 2.2) X X O 1/(n8 ) O 1/4
Nonconvex-One-point-concave (NC-1PC)
Deng & Mahdavi (2021) 7 7 O(1/12 ) O(n1/6 /8 )
Sharma et al. (2022) 7 7 O(1/8 )  O(1/7 )
8
Our Work: (Theorem 3) X X O 1/(n ) O 1/4
a
Individual clients can run an unequal number of local iterations, using different local optimizers (see Section 4).
b
We came across Yang et al. (2022a) during the preparation of this paper. Our algorithm Fed-Norm-SGDA(Algorithm 1)
strictly generalizes their algorithm FSGDA.

With its increasing usage in large-scale applications, FL systems must adapt to a wide range of clients. Data
heterogeneity has received significant attention from the community. However, system-level heterogeneity remains
relatively unexplored. The effect of client variability or heterogeneity can be controlled by forcing all the clients to carry
out an equal number of local updates and utilize the same local optimizer Yu et al. (2019); Haddadpour et al. (2019).
However, this approach is inefficient if the client dataset sizes are widely different. Also, it would entail faster clients
sitting idle for long durations Reisizadeh et al. (2022); Tziotis et al. (2022), waiting for stragglers to finish. Additionally,
using the same optimizer might be inefficient or expensive for clients, depending on their system capabilities. Therefore,
adapting to system-level heterogeneity forms a desideratum for real-world FL schemes.

Contributions. We consider a general federated minimax optimization framework, in presence of both inter-client
data and system heterogeneity. We consider the problem
Pn
min max {F (x, y) := i=1 pi fi (x, y)} , (1)
x∈Rdx y∈Rdy

where fi is the local loss of client i, pi is the weight assigned to client i, which is often the relative sample size at client
i, and n is the total number of clients. We study several classes of nonconvex minimax problems. Further,
• In our generalized federated minimax algorithm, the participating clients in a round may each perform different
number of local steps, potentially with different local optimizers. In this setting, naive aggregation of local model
updates (as done in existing methods like Local Stochastic Gradient Descent Ascent) can lead to convergence in
terms of a mismatched global objective. We propose a simple normalization strategy to fix this problem.
• Using independent server and client learning rates, we achieve order-optimal or state-of-the-art computation complex-
ity, and significantly improve the communication complexity of existing methods.

2
• Under the special case where all the clients (i) are assigned equal weights pi = 1/n in (1), (ii) carry out equal number
of local updates (τi = τ for all i), and (iii) utilize the same local-update algorithm, our results become directly
comparable with existing work (see Table 1) and improve upon them as follows.
1. For nonconvex-strongly-concave (NC-SC) and nonconvex-PL (NC-PL) problems, our method has the order-
optimal gradient complexity O(1/(n4 )). Further, we improve the communication from O(1/3 ) in Sharma
et al. (2022) to O(1/2 ).1
2. For nonconvex-concave (NC-C) and nonconvex-one-point-concave (NC-1PC) problems, we achieve state-of-
the-art gradient complexity, while significantly improving the communication costs from O(1/7 ) in Sharma
et al. (2022) to O(1/4 ). For NC-1PC functions, we prove the linear speedup in gradient complexity with n that
was conjectured in Sharma et al. (2022), thereby solving an open problem.
3. As an intermediate result in our proof, we prove the theoretical convergence of Local SGD for one-point-convex
function minimization (see Lemma C.5 in Appendix C.4). The achieved convergence rate is the same as that
achieved for convex minimization problems. Therefore, we generalize the convergence of Local SGD to a much
larger class of functions.

2 Related Work
2.1 Single-client minimax
Nonconvex-Strongly-concave (NC-SC). To our knowledge, Lin et al. (2020a) is the first work to analyze a single-
loop algorithm for stochastic (and deterministic) NC-SC problems. Although the O(κ3 /4 ) complexity shown is
optimal in , the algorithm required O(−2 ) batch-size. Qiu et al. (2020) utilized momentum to achieve O(−4 )
convergence with O(1) batch-size. Recent works Yang et al. (2022c); Sharma et al. (2022) achieve the same rate
without momentum. Yang et al. (2022c) also improved the dependence on the condition number κ. Second-order
stationarity for NC-SC has been recently studied in Luo & Chen (2021). Lower bounds for this problem class have
appeared in Luo et al. (2020); Li et al. (2021); Zhang et al. (2021).

Nonconvex-Concave (NC-C). Again, Lin et al. (2020a) was the first to analyze a single-loop algorithm for stochastic
NC-C problems, proving O(−8 ) complexity. In deterministic problems, this has been improved using nested Nouiehed
et al. (2019); Thekumparampil et al. (2019) as well as single-loop Xu et al. (2020); Zhang et al. (2020) algorithms.
For stochastic problems, Rafique et al. (2021) and the recent work Zhang et al. (2022) improved the complexity to
O(−6 ). However, both the algorithms have a nested structure, which at every step, solve a simpler problem iteratively.
Achieving O(−6 ) complexity with a single-loop algorithm has so far proved elusive.

2.2 Distributed/Federated Minimax


Recent years have also seen an increasing body of work in distributed minimax optimization. Some of this work is
focused on decentralized settings, as in Rogozin et al. (2021); Beznosikov et al. (2021b,c); Metelev et al. (2022).
Of immediate relevance to us is the federated setting, where clients carry out multiple local updates between
successive communication rounds. The relevant works which focused on convex-concave problems include Reisizadeh
et al. (2020); Hou et al. (2021); Liao et al. (2021); Sun & Wei (2022). Special classes of nonconvex minimax problems
in the federated setting have been studied in recent works, such as, nonconvex-linear Deng et al. (2020), nonconvex-PL
Deng & Mahdavi (2021); Xie et al. (2021), and nonconvex-one-point-concave Deng & Mahdavi (2021). The complexity
guarantees for several function classes considered in Deng & Mahdavi (2021) were further improved in Sharma et al.
(2022). However, all these works consider specialized federated settings, either assuming full-client participation, or
system-wise identical clients, each carrying out equal number of local updates. As we see in this paper, partial client
1 During the preparation of this manuscript, we came across the recent work Yang et al. (2022a), which proposes FSGDA algorithm and achieves

O(1/2 ) communication cost for NC-PL functions. However, our work is more general since we allow a heterogeneous number of local updates at
the clients.

3
participation is the most source of error in simple FL algorithms. Also, system-level heterogeneity can have crucial
implications on the algorithm performance.

Differences from Related Existing Work. Wang et al. (2020) was the first work to consider the problem of system
heterogeneity in simple minimization problems, and proposed a normalized averaging scheme to avoid optimizing an
inconsistent objective. Compared to Wang et al. (2020), we consider a more challenging problem and achieve higher
communication savings (Table 1). Yang et al. (2021) analyzed partial client participation in FL and demonstrated
the theoretical benefit of using separate client/server learning rates. Deng & Mahdavi (2021); Sharma et al. (2022)
studied minimax problems in the federated setting but assumed a homogeneous number of local updates, with full client
participation. The very recent work Yang et al. (2022a) considers NC-SC problem and achieves similar communication
savings as ours. However, our work considers a more general minimax FL framework with system-level client
heterogeneity, and partial client participation. We consider several classes of functions and improve the communication
and computation complexity of existing minimax algorithms.

3 Preliminaries
Notations. We let k · k denote the Euclidean norm k · k2 . Given a positive integer m, the set {1, 2, . . . , m} is denoted
by [m]. Vectors at client i are denoted with subscript i, e.g., xi , while iteration indices are denoted using superscripts,
>
e.g., y(t) or y(t,k) . Given a function g, we define its gradient vector as ∇x g(x, y)> , ∇y g(x, y)> , and its stochastic


gradient as ∇g(x, y; ξ), where ξ denotes the randomness.

Convergence Metrics. In presence of nonconvexity, we can only prove convergence to an approximate stationary
point, which is defined next.

Definition 1 (-Stationarity). A point x is an -stationary point of a differentiable function g if k∇g(x)k ≤ .


Definition 2. Stochastic Gradient (SG) complexity is the total number of gradients computed by a single client during
the course of the algorithm.
Definition 3 (Communication Rounds). During a single communication round, the server sends its global model to a
set of clients, which carry out multiple local updates starting from the same model, and return their local vectors to the
server. The server then aggregates these local vectors to arrive at a new global model. Throughout this paper, we denote
the number of communication rounds by T .
Next, we discuss some assumptions used in the paper.
Assumption 1 (Smoothness). Each local function fi is differentiable and has Lipschitz continuous gradients. That is,
there exists a constant Lf > 0 such that at each client i ∈ [n], for all x, x0 ∈ Rd1 and y, y0 ∈ Rd2 ,

k∇fi (x, y) − ∇fi (x0 , y0 )k ≤ Lf k(x, y) − (x0 , y0 )k .

Assumption 2 (Local Variance). The stochastic gradient oracle at each client is unbiased. Also, there exist constants
σL , βL ≥ 0 such that at each client i ∈ [n], for all x, y,

Eξi [∇fi (x, y; ξi )] = ∇fi (x, y),


2
Eξi k∇fi (x, y; ξi ) − ∇fi (x, y)k2 ≤ βL2 k∇fi (x, y)k + σL
2
.
Pn
Assumption 3 (Global Heterogeneity). For any set of non-negative weights {wi }ni=1 such that i=1 wi = 1, there
exist constants βG ≥ 1, σG ≥ 0 such that for all x, y,
n n 2
X 2
X
2 2
wi k∇x fi (x, y)k ≤ βG wi ∇x fi (x, y) + σG ,
i=1 i=1

4
n n 2
X 2
X
2 2
wi k∇y fi (x, y)k ≤ βG wi ∇y fi (x, y) + σG .
i=1 i=1

If all fi ’s are identical, we have βG = 1, and σG = 0.

Most existing work uses simplified versions of Assumptions 2, 3, assuming βL = 0 and/or βG = 0.

4 Algorithm for Heterogeneous Federated Minimax Optimization


In this section, we propose a federated minimax algorithm to handle system heterogeneity across clients.

4.1 Limitations of Local SGDA


Following the success of FedAvg McMahan et al. (2017) in FL, Deng & Mahdavi (2021) was the first to explore
a simple extension Local stochastic gradient descent-ascent (SGDA) in minimax problems. Between successive
communication rounds, clients take multiple simultaneous descent/ascent steps to respectively update the min-variable
x and max-variable y. Subsequent work in Sharma et al. (2022) improved the convergence results and showed that
LocalSGDA achieves optimal gradient complexity for several classes of nonconvex minimax problems. However,
existing work on LocalSGDA also assumes the participation of all n clients in every communication round. More
crucially, as observed with simple minimization problems Wang et al. (2020), if clients carry out an unequal number
of local updates, or if their local optimizers are not all the same, LocalSGDA (like FedAvg) might converge to the
stationary point of a different objective. This is further discussed in Section 5.1, 5.2, and illustrated in Figure 1, where
the learning process gets disproportionately skewed towards the clients carrying out more local updates.

−∇𝐱 𝑓* (",&)
𝐱 (",$) 𝐱*
𝐲 (",$) +∇𝐲 𝑓* 𝐲*
(",&) Heterogeneous
setting (𝜏! < 𝜏" )
−∇𝐱 𝑓&
+∇𝐲 𝑓&
(",'! ) (",'" )
𝐱& 𝐱*
",'! (",'" )
𝐲& 𝐲*
𝐱&∗ 𝐱 ("(&,$)
𝐲&∗ 𝐱∗ 𝐱 *∗
𝐲 ("(&,$)
𝐲∗ 𝐲*∗

Figure 1: FedAvg with heterogeneous local updates. The green (red) triangle represents the local optimizer of f1 (f2 ),
while (x∗ , y∗ ) is the global optimizer. The number of local updates at client i is τi , where τ1 = 2, τ2 = 5).

Generalized Local SGDA Update Rule. To understand this mismatched convergence phenomenon with naive
aggregation in local SGDA, recall that Local SGDA updates are of the form
n n
(t) (t)
X X
x(t+1) = x(t) + γxs pi ∆x,i , y(t+1) = y(t) + γys pi ∆y,i ,
i=1 i=1

(t) (t)
(t) 1 (t,τi )  (t) 1 (t,τi ) 
where γxs , γys are the server learning rates, ∆x,i = ηxc xi − x(t) , ∆y,i = ηyc yi − y(t) are the scaled local
(t)
(t,τi ) (t)
updates. xi is the iterate at client i after taking τi local steps, and ηxc , ηyc are the client learning rates. Let us

5
(t) (t)
consider a generalized version of this update rule where ∆x,i , ∆y,i are linear combinations of local stochastic gradients
(t) Pτi(t) −1 (t,k) (t,k) (t,k) (t,k) (t,k)
computed by client i, as ∆y,i = k=0 ai ∇y fi (xi , yi ; ξi ), where ai ≥ 0. Commonly used client
optimizers, such as, SGD, local momentum, variable local learning rates can be accommodated in this general form (see
Appendix A.1 for some examples). For this more general form, we can rewrite the x, y updates at the server as follows
(t)
Pn (t) āi (t)
x(t+1) = x(t) − γxs i=1 pi Gx,i (t) kāi k1
kāi k1
n n (t) (t) (t)
pi kāi k1 Gi āi
X  X
(t)
= x(t) − pj kāj k1 γxs Pn (t) (t)
,
j=1 i=1 j=1 pj kāj k
1
kāi k1 (2)
| {z } | {z } | {z }
wi (t)
(t)
τeff gx,i

(t) Pn (t)
y(t+1) = y(t) + τeff γys i=1 wi gy,i ,
(t) (t)
(t) (t,k) (t,k) (t,k) τ (t)
where Gx,i = [∇y fi (xi , yi ; ξi )]k=0
i
∈ Rdx ×τi contains the τi stochastic gradients stacked column-wise,
(t)
(t) t,τi −1 (t) (t) (t)
āi = [at,0 t,1
i , ai , . . . , ai ]> , gx,i , gy,i
are the normalized aggregates of the stochastic gradients and τeff is the
effective number of local steps. Similar to the observation for simple minimization problems in Wang et al. (2020),

𝐱 (",&) 𝐠 "𝐱,( Δ"𝐱,(


𝐲 (",&) 𝐠 "𝐲,( Δ"𝐲,(

𝐱 ("#$,&) (",)! )
𝐱(
𝐲 ("#$,&) (",)! )
𝐲(

(t) (t) 1 (t) (t)


Figure 2: Generalized update rule in (2). Note that (gx,i , gy,i ) = τi (∆x,i , ∆y,i ). Also, at the server, the weighted sum
Pn (t) (t)
i=1 wi gx,i gets scaled by τeff .

we see in Theorems 1, 2 that Pthe resulting iterates of this general algorithm end up converging to the stationary point
n
of a different objective Fe = i=1 wi fi . Further, in Corollary 1.1, we observe that this mismatch is a result of using
weights wi in (2) to weigh the clients’ contribution.

4.2 Proposed Normalized Federated Minimax Algorithm


From the generalized update rule, we can see that setting the weights wi equal to pi will ensure that the surrogate
objective F̃ matches with the original global objective F . Setting wi = pi results in normalization of the local progress
at each client before their aggregation at the server. As a result, we can preserve convergence to a stationary point of the
(t)
original objective function F , even with heterogeneous {τi }, as we see in Theorem 1 and Theorem 2.
The algorithm follows the steps given in Algorithm 1. In each communication round t, the server selects a client set
C (t) and communicates its model parameters (x(t) , y(t) ) to these clients. The selected clients then run multiple local
(t) (t)
stochastic gradient steps. The number of local steps {τi } can vary across clients and across rounds. At the end of τi
(t) (t)
local steps, client i aggregates its local stochastic gradients into {gx,i , gy,i }, which are then sent to the server. Note that
(t) (t)
(t,k) τ (t) (t) t,τi −1 >
the gradients at client i, {∇fi (·, ·; ξi )}k=0
i
, are normalized by kāi k1 , where āi = [at,0 t,1
i , ai , . . . , ai ] is

6
Algorithm 1 Fed-Norm-SGDA and Fed-Norm-SGDA+
1: Input: initialization x(0) , y(0) , Number of communication rounds T , learning rates: client {ηxc , ηyc }, server
(t)
{γxs , γys }, #local-updates {τi }i,t , S, s = −1
2: for t = 0 to T − 1 do
3: Server selects client set C (t) ; sends them (x(t) , y(t) )
4: if t mod S = 0 then
5: s←s+1
6: Server sends x b(s) = x(t) to clients in C (t)
7: end if
(t,0) (t,0)
8: xi = x(t) , yi = y(t) for i ∈ C (t)
(t)
9: for k = 0, . . . , τi − 1 do
(t,k+1) (t,k) (t,k) (t,k) (t,k) (t,k)
10: xi = xi − ηxc ai ∇x fi (xi , yi ; ξi )
(t,k+1) (t,k) (t,k) (t,k) (t,k)
11: yi = yi + ηyc ai x(s) , yi
∇y fi (b ; ξi ) # y-update for Fed-Norm-SGDA+
(t,k+1) (t,k) (t,k) (t,k) (t,k) (t,k)
12: yi = yi + ηyc ai ∇y fi (xi , yi ; ξi ) # y-update for Fed-Norm-SGDA
13: end for
(t) (t)
14: Client i aggregates its gradients to compute gx,i , gy,i
(t) Pτi(t) −1 ai(t,k) (t,k) (t,k) (t,k)
15: gx,i = k=0 (t)
∇x fi (xi , yi ; ξi )
kāi k1
(t) (t,k)
(t) τi −1 ai (t,k) (t,k)
x(s) , yi
P
16: gy,i = k=0 (t)
∇y fi (b ; ξi )
kāi k1

(t) Pτi(t) −1 ai(t,k) (t,k) (t,k) (t,k)


17: gy,i = k=0 (t)
∇y fi (xi , yi ; ξi )
kāi k1
(t) (t)
18: Clients i ∈ C (t) communicate {gx,i , gy,i } to the server
(t) (t)
19:
n aggregate vectors {gx , gy } using (3)
Server computes
(t) s (t) (t) (t)
20: Server step: x(t+1) = x(t) − τeff γx gx , y(t+1) = y(t) + τeff γys gy
21: end for
22: Return: x̄(T ) drawn uniformly at random from {x(t) }T
t=1

the vector of weights assigned to individual stochastic gradients in the local updates.2 The server aggregates these local
(t) (t)
vectors to compute global direction estimates gx , gy , which are then used to update the server model parameters
(t) (t)
(x , y ).

Client Selection. In each round t, the server samples |C (t) | clients uniformly at random without replacement (WOR).
While aggregating client updates at the server, client i update is weighed by w̃i = wi n/|C (t) |, i.e.,
(t) (t)
X X
gx(t) = w̃i gx,i , gy(t) = w̃i gy,i . (3)
i∈C (t) i∈C (t)

(t) Pn (t) (t) Pn (t)


Note that EC (t) [gx ] = i=1 wi gx,i , EC (t) [gy ] = i=1 wi gy,i .
2 For (t,k) (t) (t) (t)
LocalSGDA Deng & Mahdavi (2021); Sharma et al. (2022), ai = 1 for all i ∈ [n], t ∈ [T ], k ∈ [τi ] and kāi k1 = τi . Therefore,
(t) (t)
gx,i , gy,i are simply the average of the stochastic gradients computed in the t-th round.

7
5 Convergence Results
Next, we present the convergence results for different classes of nonconvex minimax problems. For simplicity,
(t,k) (k)
throughout this section we assume the parameters utilized in Algorithm 1 to be fixed across t. Therefore, ai ≡ ai ,
(t) (t) (t)
āi ≡ ai , τi ≡ τi , τeff ≡ τeff and |C (t) | = P , for all t.

5.1 Non-convex-Strongly-Concave (NC-SC) Case


Assumption 4 (µ-Strong-concavity (SC) in y). A function f is µ-strong concave (µ > 0) in y if for all x, ȳ, y2 ,
µ 2
−f (x, y2 ) ≥ −f (x, ȳ) − h∇y f (x, ȳ), y2 − ȳi + y2 − ȳ .
2

General Convergence Result. We first show that using Pnthe local updates of Algorithm 1, the iterates converge to the
stationary point of a surrogate objective Fe, Fe(x, y) , i=1 wi fi (x, y). See Appendix B for the full statement.
Theorem 1. Suppose the local loss functions {fi }i satisfy Assumptions 1, 2, 3, 4. Suppose the server selects |C (t) | = P
clients in each round t. Given appropriate choices of client and server learning rates, (ηxc , ηyc ) and (γxs , γys ) respectively,
the iterates generated by Fed-Norm-SGDA satisfy
2 2 2  2 2  2 E τ σ2 
e (t) ) 2 ≤ O κ2 (τ̄ /τeff ) + A√w σL + Bw βL σG + O κ2 Cw σL + DσG + O n − P · κ √
  
w eff G
min E ∇Φ(x . (4)
t∈[T ] P τ̄ T τ̄ 2 T n−1 P τ̄ T
| {z } | {z } | {z }
Error with full synchronization Error due to local updates Partial Participation Error

Pn
where, κ = Lf /µ is the condition number, Φ(x)
e , maxy Fe(x, y) is the envelope function, τ̄ = n1 i=1 τi , Aw ,
Pn wi2 kai k22 wi kai k22 Pn 2 (t,τi −1) 2 2
nτeff i=1 ka k2
, Bw , nτeff maxi ka k2
, Cw , i=1 wi (kai k2 − [αi ] ), D , maxi (βL2 kai,−1 k2 +
i 1 i 1
2 (0) (1) (τi −2)
kai,−1 k1 ), where ai,−1 , [ai , ai , . . . , ai ]> for all i and Ew , n maxi wi .
See Appendix B for the proof. The first term in the bound in (4) represents the optimization error for a centralized
algorithm (see Appendix C.3 in Lin et al. (2020a)). The second term represents the error if at least one of the clients
carries out multiple (τi > 1) local updates. The last term results from client subsampling. This also explains its
dependence on the data heterogeneity σG .
Theorem 1 states convergence for a surrogate objective Fe. Next, we see convergence for the true objective F .
Corollary 1.1 (Convergence in terms of F ). Given Φ(x) , maxy F (x, y), under the conditions of Theorem 1,
 
2
min ∇Φ(x(t) ) ≤ 2 2χ2pkw βH 2
+ 1 opt + 4χ2pkw σG
2
t∈[T ]
4L2f PT −1 2
+ t=0 y∗ (x(t) ) − y
e ∗ (x(t) ) . (5)
T
2
Pn (pi −wi )2 PT −1
where χ2pkw , i=1 wi , opt , T1 t=0 ∇Φ(x e (t) ) denotes the optimization error in (4). If pi = wi
for all i ∈ [n], then χ2pkw = 0. Also, then Fe(x, y) ≡ F (x, y). Therefore, y∗ (x) = arg maxy F (x, y) and
2
e ∗ (x) =
y arg maxy Fe(x, y) are identical, for all x. Hence, (5) yields mint∈[T ] ∇Φ(x(t) ) ≤ 2opt .
It follows from Corollary 1.1 that if we replace {wi } with {pi } in the server updates in Algorithm 1, we get
convergence in terms of the true objective F .
Remark 1. If clients are weighted equally (wi = 1/n for all i), with each carrying out τ steps of local SGDA, we get
2
τ̄ = τ, Aw = Bw = 1, Cw = τ − 1, and D = (τ − 1)(τ − 1 + βL ). Therefore, the bound in (4) can be greatly simplified to
 σ2 + β 2 σ2 σ 2 + τ σG
2
n−P
 q
τ
 
2
O L√ L G + L + σG . (6)
P τ̄ T τT n−1 PT

Several key insights can be derived from (6).

8
• Partial Client Participation (PCP) error O n−P 2
p τ 
n−1 · σG P T is the most significant component of convergence error.
Further, unlike the other two errors, the error due to PCP actually increases with local updates τ . Consequently, we
do not observe communication savings by performing multiple local updates at the clients, except in the special case
when σG = 0 (see Table 1). Similar observations have been made for minimization Yang et al. (2021); Jhunjhunwala
et al. (2022) and very recently for minimax problems Yang et al. (2022a).
• In the absence of multiple local updates (i.e., τi = 1 for all i) and with full participation (P = n), the resulting error
σ 2 +β 2 σ 2 
O L√nτ̄LT G depends on the global heterogeneity σG despite full synchronization. This is owing to the more general
local variance bound (Assumption 2). For βL > 0, this dependence on σG is unavoidable. This observation holds for
all the results in this paper. See Remark 6 (Appendix A.2) for a justification.
Corollary 1.2 (Improved Communication Savings). Suppose all the clients are weighted equally (pi = 1/n for all i),
with each carrying out τ steps of local SGDA. To reach an -stationary point, i.e., x such that Ek∇Φ(x)k ≤ ,
 4
κ
• Under full participation, the per-client gradient complexity of Fed-Norm-SGDA is T τ = O n 4 . The number of
 2
κ
communication rounds required is T = O 2 .
• Under partial participation, in the special case when inter-client data heterogeneity σG = 0, the per-client gradient
complexity of Fed-Norm-SGDA is O(κ4 /(P 4 )), while the communication cost is O(κ2 /2 ).
Remark 2. The gradient complexity in Corollary 1.2 is optimal in , and achieves linear speedup in the number of
participating clients. The communication complexity is also optimal in  and improves the corresponding results in
Deng & Mahdavi (2021); Sharma et al. (2022). We match the communication cost in the very recent work Yang et al.
(2022a). However, our work addresses a more realistic FL setting with disparate clients.

Extending the Results to Nonconvex-PL Functions


Assumption 5. A function f satisfies µ-PL condition in y (µ > 0), if for any fixed x:
1. maxy0 f (x, y0 ) has a nonempty solution set;
2
2. k∇y f (x, y)k ≥ 2µ(maxy0 f (x, y0 ) − f (x, y)), for all y.
Remark 3. If the local functions {fi } satisfy Assumptions 1, 2, 3, and the global function F satisfies Assumption 5,
then for appropriately chosen learning rates (see Appendix B.5), the bounds in Theorem 1 hold for µ-PL functions as
well.

5.2 Non-convex-Concave (NC-C) Case


In this subsection, we consider smooth nonconvex functions which satisfy the following assumptions.
Assumption 6 (Concavity). The function f is concave in y if for a fixed x ∈ Rd1 , for all y, y0 ∈ Rd2 ,
f (x, y) ≤ f (x, y0 ) + h∇y f (x, y0 ), y − y0 i .
Assumption 7 (Lipschitz continuity in x). Given a function f , there exists a constant Gx , such that for each y ∈ Rd2 ,
and all x, x0 ∈ Rd1 ,
kf (x, y) − f (x0 , y)k ≤ Gx kx − x0 k .
The envelope function Φ(x) = maxy f (x, y) used so far, may no longer be smooth in the absence of a unique
maximizer. Instead, we use the alternate definition of stationarity, proposed in Davis & Drusvyatskiy (2019), utilizing
the Moreau envelope of Φ, which is defined next.
Definition 4 (Moreau Envelope). The function φλ is the λ-Moreau envelope of φ, for λ > 0, if for all x ∈ Rdx ,
1 2
φλ (x) = min φ(x0 ) + kx0 − xk .
0 x 2λ
Drusvyatskiy & Paquette (2019) showed that a small k∇φλ (x)k indicates the existence of some point x
e in the
vicinity of x, that is nearly stationary for φ. Hence, in our case, we focus on minimizing k∇Φλ (x)k.

9
Proposed Algorithm. For nonconvex-concave functions, we use Fed-Norm-SGDA+. The x-updates are identical
(t,k) (t,k)
to Fed-Norm-SGDA. For the y updates however, the clients compute stochastic gradients ∇y fi (b x(s) , yi ; ξi )
(s)
keeping the x-component fixed at xb for S communication rounds. This trick, originally proposed in Deng & Mahdavi
(2021), gives the analytical benefit of a double-loop algorithm (which update y several times before updating x once)
while also updating x simultaneously.
2
Theorem 2. Suppose the local loss functions {fi } satisfy Assumptions 1, 2, 3, 6, 7, y(t) ≤ R for all t, and the
server selects |C (t) | = P clients for all t. With appropriate client and server learning rates, (ηxc , ηyc ) and (γxs , γys )
respectively, the iterates of Fed-Norm-SGDA+ satisfy
 
2 + D(G2 + σ 2 )
n − P Ew 1/4
 (τ̄ /τ )1/4 (τeff P )1/4
 Cw σL
  
(t) 2 eff x G
min E ∇Φ1/2Lf (x ) ≤ O
e
1/4
+ 3/4
+O 2 3/4
+O · ,
t∈[T ] (τ̄ P T ) T τ̄ T n − 1 PT
| {z } | {z } | {z }
Error with full synchronization Local updates error Partial participation error
(7)
where Φ1/2Lf is the Moreau envelope of Φ. The constants Cw , D, τ̄ are defined in Theorem 1.

See Appendix C for the proof. Theorem 2 states convergence for a surrogate objective Fe. Next, we see convergence
for the true objective F .
Corollary 2.1 (Convergence in terms of F ). Given envelope functions Φ(x) , maxy F (x, y), Φ(x)
e , maxy Fe(x, y),
under the conditions of Theorem 2,

2 8L2f PT −1 (t) 2
min ∇Φ1/2Lf (x(t) ) ≤ 0opt + t=0 e − x̄(t)
x ,
t∈[T ] T
2
e 0 ) + Lf x0 − x(t)
e(t) , arg minx0 {Φ(x
where Φ1/2Lf is the Moreau envelope of Φ, x }, x̄(t) , arg minx0 {Φ(x0 ) +
(t) 2
Lf x0 − x }, for all t, 0opt is the error bound in (7).
Similar to Corollary 1.1, if we replace {wi } with {pi } for all i ∈ [n] in the server updates in Algorithm 1, then
2
F ≡ F , and x
e e(t) and x̄(t) are identical for all t. Consequently, Theorem 2 gives us mint∈[T ] ∇Φ1/2Lf (x(t) ) ≤ 0opt .
Remark 4. Some existing works do not require Assumption 7 for NC-C functions, and also improve the convergence
rate. However, these methods either have a double-loop structure Rafique et al. (2021); Zhang et al. (2022), or work
with deterministic problems Xu et al. (2020); Zhang et al. (2020). Proposing a single-loop method for stochastic NC-C
problems with the same advantages is an open problem.
Corollary 2.2 (Improved Communication Savings). Suppose all the clients are weighted equally (pi = 1/n for all i),
with each carrying out τ steps of local SGDA. To reach an -stationary point, i.e., x such that Ek∇Φ1/2Lf (x)k ≤ ,
1

• Under full participation, the per-client gradient complexity of Fed-Norm-SGDA+ is T τ = O n 8 . The number of

communication rounds required is T = O 14 .



• Under partial participation, in the special case when inter-client data heterogeneity σG = 0, the per-client gradient
complexity of Fed-Norm-SGDA is O(1/(P 8 )), while the communication cost is O(1/4 ).
In terms of communication requirements, we achieve massive savings (compared to O(1/7 ) in Sharma et al.
(2022)). Our gradient complexity results achieve linear speedup in the number of participating clients.

5.3 Nonconvex-1-Point-Concave (NC-1PC) Case


One-point-convexity has been observed in SGD dynamics during neural network training.
Assumption 8 (One-point-Concavity in y). The function f is said to be one-point-concave in y if fixing x ∈ Rd1 , for
all y ∈ Rd2 ,
h∇y f (x, y0 ), y − y∗ (x)i ≤ f (x, y) − f (x, y∗ (x)),
where y∗ (x) ∈ arg maxy f (x, y).

10
Owing to space limitations, we only state the per-client gradient complexity, and communication complexity results
under the special case when all the clients are weighted equally (pi = 1/n for all i), with each carrying out τ steps of
local SGDA. See Appendix C.4 for more details.
Theorem 3. Suppose the local loss functions {fi } satisfy Assumptions 1, 2, 3, 7. Suppose for all x, all the fi ’s satisfy
2
Assumption 8 at a common global minimizer y∗ (x), and that y(t) ≤ R for all t. Then, to reach an -accurate
point, the stochastic gradient complexity of Fed-Norm-SGDA+ (Algorithm 1) is O(1/(n8 )), and the number of
communication rounds required is T /τ = O(1/4 ).
Remark 5. Theorem 3 proves the conjecture posed in Sharma et al. (2022) that linear speedup should be achievable for
NC-1PC functions. Further, we improve their communication complexity from O(1/7 ) to O(1/4 ). As an intermediate
result in our proof, we show convergence of Local SGD for one-point-convex functions, extending convex minimization
bounds to a much larger class of functions.

6 Experiments
In this section, we evaluate the empirical performance of the proposed algorithms. We consider a robust neural training
problem Sinha et al. (2017); Madry et al. (2018), and a fair classification problem Mohri et al. (2019); Deng et al.
(2020). Due to space constraints, additional details of our experiments, and some additional results are included in
Appendix D. Our experiments were run on a network of n = 15 clients, each equipped with an NVIDIA TitanX GPU.
We model data heterogeneity across clients using Dirichlet distribution Wang et al. (2019) with parameter α, Dirn (α).
Small α ⇒ higher heterogeneity across clients.

Robust NN training. We consider the following robust neural network (NN) training problem.
N
X
min max ` (hx (ai + y), bi ) , (8)
x
2
kyk ≤1
j=1

where x denotes the NN parameters, (ai , bi ) denote the feature and label of the i-th sample, y denotes the adversarially
added feature perturbation, and hx denotes the NN output.

80
70
60
Test Accuracy

50
40
30
20 Local SGDA+
Local SGDA+ (M)
10 Fed-Norm_SGDA+
0
0 25 50 75 100 125 150 175 200
Number of Communications
Figure 3: Comparison of the effect of heterogeneous number of local updates {τi } on the performance of Fed-Norm-
SGDA+ (Algorithm 1), Local SGDA+, and Local SGDA+ with momentum, while solving (8) on CIFAR10 dataset,
with VGG11 model. The solid (dashed) curves are for E = 5 (E = 7), and α = 0.1.

11
Impact of system heterogeneity. In Figure 3, we compare the effect of heterogeneous number of local updates across
clients, on the performance of our proposed Fed-Norm-SGDA+. We compare with Local SGDA+ Deng & Mahdavi
(2021), and Local SGDA+ with momentum Sharma et al. (2022). Clients sample the number of epochs they run locally
via τi ∼ U nif [2, E]. We observe that Fed-Norm-SGDA+ adapts well to system heterogeneity and outperforms both
existing methods.

80
70
60
Test Accuracy

50
40
30
20 PCP (P=5)
PCP (P=10)
10 FCP (n=15)
0
0 50 100 150 200 250 300 350 400
Number of Communications
Figure 4: Comparison of the effects of partial client participation (PCP) on the performance of Fed-Norm-SGDA+,
for the robust NN training problem on the CIFAR10 dataset, with the VGG11 model. The figure shows the robust test
accuracy. The solid (dashed) curves are for α = 0.1 (α = 1.0).

Impact of partial participation and heterogeneity. Next, we compare the impact of different levels of partial client
participation on performance. We compare the full participation setting (n = 15) with P = 5, 10. Clients sample
the number of epochs they run locally via τi ∼ U nif [2, 5]. We plot the results for two different values of the data
heterogeneity parameter α = 0.1, 1.0. As seen in all our theoretical results where partial participation was the most
significant component of convergence error, smaller values of P result in performance loss. Further, higher inter-client
heterogeneity (modeled by smaller values of α) results in worse performance. We further explore the impact of α on
performance in Appendix D.

50

40
Test Accuracy

30

20

10
LocalSGDA
LocalSGDA (M)
0 Fed-Norm-SGDA
0 20 40 60 80 100 120 140
Number of Communications
Figure 5: Comparison of Local SGDA, Local SGDA with momentum, and Fed-Norm-SGDA, for the fair classification
task on the CIFAR10 dataset, with the VGG11 model. The solid (dashed) curves are for E = 5 (E = 7), α = 0.1.

12
Fair Classification. We consider minimax formulation of the fair classification problem Mohri et al. (2019); Nouiehed
et al. (2019).
C
X λ 2
min max yc Fc (x) − kyk , (9)
x y∈Y
c=1
2

where x denotes the parameters of the NN, {Fc }C c=1 denote the loss corresponding to class c, and ∆C is the C-
dimensional probability simplex. In Figure 5, we plot the worst distribution test accuracy achieved by Fed-Norm-SGDA,
Local SGDA Deng & Mahdavi (2021) and Local SGDA with momentum Sharma et al. (2022). As in Figure 3, clients
sample τi ∼ U nif [2, E]. We plot the test accuracy on the worst distribution in each case. Again, Fed-Norm-SGDA
outperforms existing methods.

7 Conclusion
In this work, we considered nonconvex minimax problems in the federated setting, where in addition to inter-client data
heterogeneity and partial client participation, there is system heterogeneity as well. We observed that existing methods,
such as Local SGDA, might converge to the stationary point of an objective quite different from the original intended
objective. We show that normalizing individual client contributions solves this problem. We analyze several classes of
nonconvex minimax functions, and significantly improve existing computation and communication complexity results.
Potential future directions include analyzing federated systems with unpredictable client presence Yang et al. (2022b).

Acknowledgments
This work was supported in part by NSF grants CCF 2045694, CNS-2112471, and ONR N00014-23-1-2149. Jiarui Li
helped with some figures in the paper.

References
Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International conference on
machine learning, pp. 214–223. PMLR, 2017.
Beznosikov, A., Richtárik, P., Diskin, M., Ryabinin, M., and Gasnikov, A. Distributed methods with compressed
communication for solving variational inequalities, with theoretical guarantees. arXiv preprint arXiv:2110.03313,
2021a.
Beznosikov, A., Rogozin, A., Kovalev, D., and Gasnikov, A. Near-optimal decentralized algorithms for saddle point
problems over time-varying networks. In International Conference on Optimization and Applications, pp. 246–257.
Springer, 2021b.

Beznosikov, A., Scutari, G., Rogozin, A., and Gasnikov, A. Distributed saddle-point problems under similarity. In
Advances in Neural Information Processing Systems, volume 34, 2021c.
Beznosikov, A., Sushko, V., Sadiev, A., and Gasnikov, A. Decentralized personalized federated min-max problems.
arXiv preprint arXiv:2106.07289, 2021d.

Chen, M., Shlezinger, N., Poor, H. V., Eldar, Y. C., and Cui, S. Communication-efficient federated learning. Proceedings
of the National Academy of Sciences, 118(17):e2024789118, 2021.
Chen, Z., Zhou, Y., Xu, T., and Liang, Y. Proximal gradient descent-ascent: Variable convergence under kł geometry.
In International Conference on Learning Representations, 2020.

13
Cho, H. and Yun, C. Sgda with shuffling: faster convergence for nonconvex-p {\L} minimax optimization. arXiv
preprint arXiv:2210.05995, 2022.
Davis, D. and Drusvyatskiy, D. Stochastic model-based minimization of weakly convex functions. SIAM Journal on
Optimization, 29(1):207–239, 2019.

Deng, Y. and Mahdavi, M. Local stochastic gradient descent ascent: Convergence analysis and communication
efficiency. In International Conference on Artificial Intelligence and Statistics, pp. 1387–1395. PMLR, 2021.
Deng, Y., Kamani, M. M., and Mahdavi, M. Distributionally robust federated averaging. In Advances in Neural
Information Processing Systems, volume 33, pp. 15111–15122, 2020.
Drusvyatskiy, D. and Paquette, C. Efficiency of minimizing compositions of convex functions and smooth maps.
Mathematical Programming, 178(1):503–558, 2019.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.
Generative adversarial nets. In Advances in neural information processing systems, volume 27, 2014.
Haddadpour, F., Kamani, M. M., Mahdavi, M., and Cadambe, V. Local sgd with periodic averaging: Tighter analysis
and adaptive synchronization. Advances in Neural Information Processing Systems, 32:11082–11094, 2019.
Hamer, J., Mohri, M., and Suresh, A. T. Fedboost: A communication-efficient algorithm for federated learning. In
International Conference on Machine Learning, pp. 3973–3983. PMLR, 2020.
Hou, C., Thekumparampil, K. K., Fanti, G., and Oh, S. Efficient algorithms for federated saddle point optimization.
arXiv preprint arXiv:2102.06333, 2021.

Jhunjhunwala, D., Sharma, P., Nagarkatti, A., and Joshi, G. FedVARP: Tackling the variance due to partial client
participation in federated learning. In The 38th Conference on Uncertainty in Artificial Intelligence, 2022.
Jin, C., Netrapalli, P., and Jordan, M. What is local optimality in nonconvex-nonconcave minimax optimization? In
International Conference on Machine Learning, pp. 4880–4889. PMLR, 2020.

Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode,
G., Cummings, R., et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the
polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in
Databases, pp. 795–811. Springer, 2016.

Khaled, A., Mishchenko, K., and Richtárik, P. Tighter theory for local sgd on identical and heterogeneous data. In
International Conference on Artificial Intelligence and Statistics, pp. 4519–4529. PMLR, 2020.
Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, S. A unified theory of decentralized sgd with changing
topology and local updates. In International Conference on Machine Learning, pp. 5381–5393. PMLR, 2020.

Konečnỳ, J., McMahan, H. B., Ramage, D., and Richtárik, P. Federated optimization: Distributed machine learning for
on-device intelligence. arXiv preprint arXiv:1610.02527, 2016.
Lee, S. and Kim, D. Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems. In
Advances in Neural Information Processing Systems, volume 34, 2021.
Lei, Y., Yang, Z., Yang, T., and Ying, Y. Stability and generalization of stochastic gradient methods for minimax
problems. In International Conference on Machine Learning, pp. 6175–6186. PMLR, 2021.
Li, H., Tian, Y., Zhang, J., and Jadbabaie, A. Complexity lower bounds for nonconvex-strongly-concave min-max
optimization. In Advances in Neural Information Processing Systems, volume 34, 2021.

14
Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. IEEE
Signal Processing Magazine, 37(3):50–60, 2020.
Liao, L., Shen, L., Duan, J., Kolar, M., and Tao, D. Local adagrad-type algorithm for stochastic convex-concave
minimax problems. arXiv preprint arXiv:2106.10022, 2021.

Lin, T., Jin, C., and Jordan, M. On gradient descent ascent for nonconvex-concave minimax problems. In International
Conference on Machine Learning, pp. 6083–6093. PMLR, 2020a.
Lin, T., Jin, C., and Jordan, M. I. Near-optimal algorithms for minimax optimization. In Conference on Learning
Theory, pp. 2738–2779. PMLR, 2020b.
Liu, W., Mokhtari, A., Ozdaglar, A., Pattathil, S., Shen, Z., and Zheng, N. A decentralized proximal point-type method
for saddle point problems. arXiv preprint arXiv:1910.14380, 2019.
Lu, S., Tsaknakis, I., and Hong, M. Block alternating optimization for non-convex min-max problems: algorithms and
applications in signal processing and communications. In ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 4754–4758. IEEE, 2019.

Lu, S., Tsaknakis, I., Hong, M., and Chen, Y. Hybrid block successive approximation for one-sided non-convex
min-max problems: algorithms and applications. IEEE Transactions on Signal Processing, 68:3676–3691, 2020.
Luo, L. and Chen, C. Finding second-order stationary point for nonconvex-strongly-concave minimax problem. arXiv
preprint arXiv:2110.04814, 2021.
Luo, L., Ye, H., Huang, Z., and Zhang, T. Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-
concave minimax problems. In Advances in Neural Information Processing Systems, volume 33, pp. 20566–20577,
2020.
Luo, L., Xie, G., Zhang, T., and Zhang, Z. Near optimal stochastic algorithms for finite-sum unbalanced convex-concave
minimax optimization. arXiv preprint arXiv:2106.01761, 2021.
Madras, D., Creager, E., Pitassi, T., and Zemel, R. Learning adversarially fair and transferable representations. In
International Conference on Machine Learning, pp. 3384–3393. PMLR, 2018.
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial
attacks. In International Conference on Learning Representations, 2018.
McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep
networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. PMLR, 2017.
Metelev, D., Rogozin, A., Gasnikov, A., and Kovalev, D. Decentralized saddle-point problems with different constants
of strong convexity and strong concavity. arXiv preprint arXiv:2206.00090, 2022.
Mitra, A., Jaafar, R., Pappas, G. J., and Hassani, H. Linear convergence in federated learning: Tackling client
heterogeneity and sparse gradients. Advances in Neural Information Processing Systems, 34:14606–14619, 2021.

Mohajerin Esfahani, P. and Kuhn, D. Data-driven distributionally robust optimization using the wasserstein metric:
Performance guarantees and tractable reformulations. Mathematical Programming, 171(1):115–166, 2018.
Mohri, M., Sivek, G., and Suresh, A. T. Agnostic federated learning. In International Conference on Machine Learning,
pp. 4615–4625. PMLR, 2019.

Namkoong, H. and Duchi, J. C. Stochastic gradient methods for distributionally robust optimization with f-divergences.
In Advances in Neural Information Processing Systems, volume 29, 2016.
Nesterov, Y. Lectures on convex optimization, volume 137. Springer, 2018.

15
Nouiehed, M., Sanjabi, M., Huang, T., Lee, J. D., and Razaviyayn, M. Solving a class of non-convex min-max
games using iterative first order methods. In Advances in Neural Information Processing Systems, volume 32, pp.
14934–14942, 2019.
Ouyang, Y. and Xu, Y. Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point
problems. Mathematical Programming, 185(1):1–35, 2021.

Qiu, S., Yang, Z., Wei, X., Ye, J., and Wang, Z. Single-timescale stochastic nonconvex-concave optimization for smooth
nonlinear TD learning. arXiv preprint arXiv:2008.10103, 2020.
Rafique, H., Liu, M., Lin, Q., and Yang, T. Weakly-convex–concave min–max optimization: provable algorithms and
applications in machine learning. Optimization Methods and Software, pp. 1–35, 2021.

Reisizadeh, A., Farnia, F., Pedarsani, R., and Jadbabaie, A. Robust federated learning: The case of affine distribution
shifts. In Advances in Neural Information Processing Systems, volume 33, pp. 21554–21565, 2020.
Reisizadeh, A., Tziotis, I., Hassani, H., Mokhtari, A., and Pedarsani, R. Straggler-resilient federated learning:
Leveraging the interplay between statistical accuracy and system heterogeneity. IEEE Journal on Selected Areas in
Information Theory, 2022.

Rogozin, A., Beznosikov, A., Dvinskikh, D., Kovalev, D., Dvurechensky, P., and Gasnikov, A. Decentralized distributed
optimization for saddle point problems. arXiv preprint arXiv:2102.07758, 2021.
Sanjabi, M., Ba, J., Razaviyayn, M., and Lee, J. D. On the convergence and robustness of training gans with regularized
optimal transport. Advances in Neural Information Processing Systems, 31, 2018.

Sattler, F., Wiedemann, S., Müller, K.-R., and Samek, W. Robust and communication-efficient federated learning from
non-iid data. IEEE transactions on neural networks and learning systems, 31(9):3400–3413, 2019.
Sharma, P., Panda, R., Joshi, G., and Varshney, P. Federated minimax optimization: Improved convergence analyses
and algorithms. In International Conference on Machine Learning, pp. 19683–19730. PMLR, 2022.
Sinha, A., Namkoong, H., and Duchi, J. Certifiable distributional robustness with principled adversarial training. In
International Conference on Learning Representations, 2017.
Stich, S. U. Local sgd converges fast and communicates little. In International Conference on Learning Representations,
2018.
Sun, Z. and Wei, E. A communication-efficient algorithm with linear convergence for federated minimax learning.
arXiv preprint arXiv:2206.01132, 2022.
Thekumparampil, K. K., Jain, P., Netrapalli, P., and Oh, S. Efficient algorithms for smooth minimax optimization. In
Advances in Neural Information Processing Systems, volume 32, 2019.
Tran-Dinh, Q., Liu, D., and Nguyen, L. M. Hybrid variance-reduced sgd algorithms for minimax problems with
nonconvex-linear function. In Advances in Neural Information Processing Systems, volume 33, pp. 11096–11107,
2020.
Tziotis, I., Shen, Z., Pedarsani, R., Hassani, H., and Mokhtari, A. Straggler-resilient personalized federated learning.
arXiv preprint arXiv:2206.02078, 2022.
Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., and Khazaeni, Y. Federated learning with matched averaging. In
International Conference on Learning Representations, 2019.

Wang, J. and Joshi, G. Cooperative SGD: A unified framework for the design and analysis of local-update sgd
algorithms. Journal of Machine Learning Research, 22(213):1–50, 2021.

16
Wang, J., Liu, Q., Liang, H., Joshi, G., and Poor, H. V. Tackling the objective inconsistency problem in heterogeneous
federated optimization. In Advances in Neural Information Processing Systems, volume 33, pp. 7611–7623, 2020.
Wang, J., Charles, Z., Xu, Z., Joshi, G., McMahan, H. B., Al-Shedivat, M., Andrew, G., Avestimehr, S., Daly, K., Data,
D., et al. A field guide to federated optimization. arXiv preprint arXiv:2107.06917, 2021.
Wang, Y. and Li, J. Improved algorithms for convex-concave minimax optimization. In Advances in Neural Information
Processing Systems, volume 33, pp. 4800–4810, 2020.
Woodworth, B. E., Patel, K. K., and Srebro, N. Minibatch vs local sgd for heterogeneous distributed learning. Advances
in Neural Information Processing Systems, 33:6281–6292, 2020.
Xie, G., Luo, L., Lian, Y., and Zhang, Z. Lower complexity bounds for finite-sum convex-concave minimax optimization
problems. In International Conference on Machine Learning, pp. 10504–10513. PMLR, 2020.
Xie, J., Zhang, C., Zhang, Y., Shen, Z., and Qian, H. A federated learning framework for nonconvex-pl minimax
problems. arXiv preprint arXiv:2105.14216, 2021.
Xing, E. P., Ho, Q., Xie, P., and Wei, D. Strategies and principles of distributed machine learning on big data.
Engineering, 2(2):179–195, 2016.
Xu, Z., Zhang, H., Xu, Y., and Lan, G. A unified single-loop alternating gradient projection algorithm for nonconvex-
concave and convex-nonconcave minimax problems. arXiv preprint arXiv:2006.02032, 2020.
Yang, H., Fang, M., and Liu, J. Achieving linear speedup with partial worker participation in non-iid federated learning.
In International Conference on Learning Representations, 2021.
Yang, H., Liu, Z., Zhang, X., and Liu, J. Sagda: Achieving O(−2 ) communication complexity in federated min-max
learning. arXiv preprint arXiv:2210.00611, 2022a.
Yang, H., Zhang, X., Khanduri, P., and Liu, J. Anarchic federated learning. In International Conference on Machine
Learning, pp. 25331–25363. PMLR, 2022b.
Yang, J., Zhang, S., Kiyavash, N., and He, N. A catalyst framework for minimax optimization. In Advances in Neural
Information Processing Systems, volume 33, pp. 5667–5678, 2020.
Yang, J., Orvieto, A., Lucchi, A., and He, N. Faster single-loop algorithms for minimax optimization without strong
concavity. In International Conference on Artificial Intelligence and Statistics, pp. 5485–5517. PMLR, 2022c.
Yoon, T. and Ryu, E. K. Accelerated algorithms for smooth convex-concave minimax problems with o (1/kˆ2) rate on
squared gradient norm. In International Conference on Machine Learning, pp. 12098–12109. PMLR, 2021.
Yu, H., Jin, R., and Yang, S. On the linear speedup analysis of communication efficient momentum SGD for distributed
non-convex optimization. In International Conference on Machine Learning, pp. 7184–7193. PMLR, 2019.
Yun, C., Rajput, S., and Sra, S. Minibatch vs local sgd with shuffling: Tight convergence bounds and beyond. In
International Conference on Learning Representations, 2022.
Zhang, J., Xiao, P., Sun, R., and Luo, Z. A single-loop smoothed gradient descent-ascent algorithm for nonconvex-
concave min-max problems. In Advances in Neural Information Processing Systems, volume 33, pp. 7377–7389,
2020.
Zhang, S., Yang, J., Guzmán, C., Kiyavash, N., and He, N. The complexity of nonconvex-strongly-concave minimax
optimization. In Conference on Uncertainty in Artificial Intelligence, pp. 482–492. PMLR, 2021.
Zhang, X., Aybat, N. S., and Gurbuzbalaban, M. Sapd+: An accelerated stochastic method for nonconvex-concave
minimax problems. arXiv preprint arXiv:2205.15084, 2022.
Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra, V. Federated learning with non-iid data. arXiv preprint
arXiv:1806.00582, 2018.

17
Contents
1 Introduction 1

2 Related Work 3
2.1 Single-client minimax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Distributed/Federated Minimax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Preliminaries 4

4 Algorithm for Heterogeneous Federated Minimax Optimization 5


4.1 Limitations of Local SGDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Proposed Normalized Federated Minimax Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Convergence Results 8
5.1 Non-convex-Strongly-Concave (NC-SC) Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2 Non-convex-Concave (NC-C) Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.3 Nonconvex-1-Point-Concave (NC-1PC) Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

6 Experiments 11

7 Conclusion 13

A Background 19
A.1 Gradient Aggregation with Different Solvers at Clients . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.2 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
A.3 Useful Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
A.4 Comparison of Convergence Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

B Convergence of Fed-Norm-SGDA for Nonconvex-Strongly-Concave Functions (Theorem 1) 21


B.1 Intermediate Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
B.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
B.3 Proofs of the Intermediate Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
B.4 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
B.5 Convergence under Polyak Łojasiewicz (PL) Condition . . . . . . . . . . . . . . . . . . . . . . . . . 38

C Convergence of Fed-Norm-SGDA+ for Nonconvex Concave Functions (Theorem 2) 39


C.1 Intermediate Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
C.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
C.3 Proofs of the Intermediate Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
C.4 Extending the result for Nonconvex One-Point-Concave (NC-1PC) Functions (Theorem 3) . . . . . . 50

D Additional Experiments 51

18
Appendix
The appendix are organized as follows. In Section A we mention some basic mathematical results and inequalities
which are used throughout the paper. In Appendix B we prove the non-asymptotic convergence of Fed-Norm-SGDA
(Algorithm 1) for smooth nonconvex-strongly-concave (and nonconvex-PŁ) functions, and derive gradient complexity
and communication cost of the algorithm to achieve an -stationary point. In Appendix C, we prove the non-asymptotic
convergence of Fed-Norm-SGDA+ (Algorithm 1) for smooth nonconvex-concave and nonconvex-one-point-concave
functions. Finally, in Appendix D we provide the details of the additional experiments we performed.

A Background
A.1 Gradient Aggregation with Different Solvers at Clients
(t) (t) (t,k)
Local SGDA. Suppose τi = τeff = τ for all i ∈ [n], t ∈ [T ]. Also, ai = 1 for all k ∈ [τ ], t. Then, the local
iterate updates in Algorithm 1-Fed-Norm-SGDA reduce to (the updates for Fed-Norm-SGDA+ are analogous)
(t,k+1) (t,k) (t,k) (t,k) (t,k)
xi = xi − ηxc ∇x fi (xi , yi ; ξi ),
(t,k+1) (t,k) (t,k) (t,k) (t,k)
yi = yi + ηyc ∇y fi (xi , yi ; ξi ),

(t) (t)
for k ∈ {0, . . . , τ − 1} and the gradient aggregate vectors (gx,i , gy,i ) are simply the average of individual gradients

τ −1 τ −1
(t) 1X (t,k) (t,k) (t,k) (t) 1X (t,k) (t,k) (t,k)
gx,i = ∇x fi (xi , yi ; ξi ), gy,i = ∇y fi (xi , yi ; ξi )
τ τ
k=0 k=0

Note that these are precisely the iterates of LocalSGDA proposed in Deng & Mahdavi (2021); Sharma et al. (2022), with
(t,τ ) (t,τ )
the only difference that in LocalSGDA, the clients communicate the iterates {xi , yi } to the server, which averages
(t) (t)
them to compute {x(t+1) , y(t+1) }. While here, the clients communicate {gx,i , gy,i }. Also, in Fed-Norm-SGDA, the
clients and server use separate learning rates, which results in tighter bounds on the local-updates error.

With Momentum in Local Updates. Suppose each local client uses a momentum buffer with momentum scale ρ.
(t)
Then, for k ∈ {0, . . . , τi − 1}
(t,k) (t,k) (t,k) (t,k+1) (t,k)
dt,k+1
x,i = ρdt,k
x,i + ∇x fi (xi , yi ; ξi ), xi = xi − ηxc dt,k+1
x,i
(t,k) (t,k) (t,k) (t,k+1) (t,k)
dt,k+1
y,i = ρdt,k
y,i + ∇y fi (xi , yi ; ξi ), yi = yi + ηyc dt,k+1
y,i ,

(t,k) (t,k) (t,k) (t,k) (t,k) (t,k)


Simple calculations show that the coefficient of ∇x fi (xi , yi ; ξi ) and ∇y fi (xi , yi ; ξi ) in the gradient
(t) (t)
aggregate vectors (gx,i , gy,i ) is
(t)
τi −1 (t)
X (t)
τi −1−k 1 − ρτi −k
= 1 + ρ + ··· + ρ = .
1−ρ
j≥k

(t) (t) (t)


1 −1
Therefore, the aggregation vector is āi = 1−ρ [1 − ρ τi , 1 − ρ τi , . . . , 1 − ρ], and

(t)
τi −1
X 1 − ρτi(t) −k (t)
" #
(t) 1 (t) ρ(1 − ρτi )
kāi k1 = = τ − .
1−ρ 1−ρ i 1−ρ
k=0

19
A.2 Auxiliary Results
Remark 6 (Impact of heterogeneity σG even with τ = 1). Consider two simple minimization problems:
n
1X
(P1): min fi (x) and (P2): min f (x).
x n i=1 x

(P1) is a simple distributed minimization problem, with n clients, which we solve using synchronous distributed SGD.
(t)
At iteration t, each client i computes stochastic gradient ∇fi (x(t) ; ξi ), and sends it to the server, which averages these,
n (t)
and takes a step in the direction n1 i=1 ∇fi (x(t) ; ξi ). On the other hand, (P1) is a centralized minimization problem,
P
Pn (t)
where at each iteration t, the agent computes a stochastic gradient estimator with batch-size n, n1 i=1 ∇f (x(t) ; ξi ).
We compare the variance of the two global gradient estimators as follows.
(P1) (P2)
1 Pn 2 1 2
(t) Pn (t)
E i=1 ∇f (x(t) ; ξi ) − ∇f (x(t) )
i E i=1 ∇f (x(t) ; ξi ) − ∇f (x(t) )
n n
1 Pn
h
2 2
i vs 1 Pn (t)
2
≤ i=1 σL + βL2 E ∇fi (x(t) ) = i=1 E ∇f (x(t) ; ξi ) − ∇f (x(t) )
n2 n2
σL2 2 h
βL
i 2 2
2 σL βL 2
≤ + βH E ∇f (x(t) ) + σG
2 2
. ≤ + E ∇f (x(t) ) .
n n n n
Since almost all the existing works consider the local variance bound (Assumption 2) with βL = 0, the global gradient
σ2
estimator in both synchronous distributed SGD (P1) and single-agent minibatch SGD (P2) have the same nL variance
bound. Therefore, in most existing federated works on minimization Wang et al. (2020); Yang et al. (2021) and minimax
2
problems Sharma et al. (2022), the full synchronization error only depends on the local variance σL . However, as seen
above, for βL > 0, this apparent equivalence breaks down. Koloskova et al. (2020), which considers similar local
variance assumption as ours for minimization problems, also show similar dependence on heterogeneity σG .

A.3 Useful Inequalities


Lemma A.1 (Young’s inequality). Given two same-dimensional vectors u, v ∈ Rd , the Euclidean inner product can
be bounded as follows:
2 2
kuk γ kvk
hu, vi ≤ +
2γ 2
for every constant γ > 0.
Lemma A.2 (Strong Concavity). A function g : X × Y is strongly concave in y, if there exists a constant µ > 0, such
that for all x ∈ X , and for all y, y0 ∈ Y, the following inequality holds.
µ 2
g(x, y) ≤ g(x, y0 ) + h∇y g(x, y0 ), y0 − yi − ky − y0 k .
2
Lemma A.3 (Jensen’s inequality). Given a convex function f and a random variable X, the following holds.
f (E[X]) ≤ E [f (X)] .
Lemma A.4 (Sum of squares). For a positive integer K, and a set of vectors x1 , . . . , xK , the following holds:
K 2 K
X X 2
xk ≤K kxk k .
k=1 k=1

Lemma A.5 (Quadratic growth condition Karimi et al. (2016)). If function g satisfies Assumptions 1, 4, then for all x,
the following conditions holds
µ 2
g(x) − min g(z) ≥ kxp − xk ,
z 2  
2
k∇g(x)k ≥ 2µ g(x) − min g(z) .
z

20
Lemma A.6. For L-smooth, convex function g, the following inequality holds
2
E k∇g(y) − ∇g(x)k ≤ 2L g(y) − g(x) − ∇g(x)> (y − x) .
 
(10)

Lemma A.7 (Proposition 6 in Cho & Yun (2022)). For L-smooth function g which is bounded below by g , the
following inequality holds for all x
2
E k∇g(x)k ≤ 2L [g(x) − g ∗ ] . (11)

A.4 Comparison of Convergence Rates

Table 2: Comparison of the convergence rates for different classes of nonconvex minimax problems.
n is the total number of clients, while P is the number of clients sampled in each round under partial
client participation. T is the number of communication rounds, τ is the number of local updates per
2 2
client, σL , σG are the stochastic gradient variance and global heterogeneity, respectively (Assumption 2,
3). κ = Lf /µ is the condition number. For a fair comparison with existing works, our results in this
table are specialized to the case when all clients (i) have equal weights (pi = 1/n), (ii) perform equal
number of local updates (τi = τ ), and (iii) use the same local update algorithm SGDA. However, our
results (Section 4) apply under more general settings when (i)-(iii) do not hold.

Partial System
Work Convergence Rate
Participation Heterogeneity
Nonconvex-Strongly-concave (NC-SC)/Nonconvex-Polyak-Łojasiewicz
 (NC-PL) 
2 2
1 nτ (σL + σG )
Sharma et al. (2022) 7 7 O √ +
nτ T T
 2 2 2

(n − P )(σG ) τ 1 (σL + τ σG )
q
Yang et al. (2022a) X 7 + √ +
n PT P τT τT
 2 2 2

(n − P )σG τ 1 σL + τ σG
q
Our Work: Theorem 1 X X + √ +
(n − 1) PT P τT τT
Nonconvex-Concave (NC-C)  
1 (nτ )3/2
 
Sharma et al. (2022) 7 7 O +O √
(τ nT )1/4 T
r 
n−P 1 (P τ )1/4
Our Work: Theorem 2 X X O 4
+ +
(n − 1)P T (τ P T )1/4 T 3/4
Nonconvex-One-point-concave (NC-1PC)  3/2 
 1  τ
Sharma et al. (2022) 7 7 O +O √
(τ T )1/4 T
r 
n−P 1 (P τ )1/4
Our Work: Theorem 3 X X O 4
+ 1/4
+ 3/4
(n − 1)P T (τ P T ) T

B Convergence of Fed-Norm-SGDA for Nonconvex-Strongly-Concave Func-


tions (Theorem 1)
We organize this section as follows. First, in Appendix B.1 we present some intermediate results, which we use to
prove the main theorem. Next, in Appendix B.2, we present the proof of Theorem 1, which is followed by the proofs of
the intermediate results in Appendix B.3. Appendix B.4 contains some auxiliary results. Finally, in Appendix B.5 we
discuss the convergence result for nonconvex-PL functions.
The problem we solve is
( n
)
X
min max F (x, y) ,
e wi fi (x, y) .
x y
i=1

21
We define Φ(x)
e e ∗ (x) ∈ arg maxy Fe(x, y). Since Fe(x, ·) is µ-strongly concave, y
, maxy Fe(x, y) and y e ∗ (x) is
unique. In Fed-Norm-SGDA (Algorithm 1), the client updates are given by
k−1
(t,k) (j) (t,j) (t,j) (t,j)
X
xi = x(t) − ηxc ai (k)∇x fi (xi , yi ; ξi ),
j=0
(12)
k−1
(t,k) (j) (t,j) (t,j) (t,j)
X
yi = y(t) + ηyc ai (k)∇y fi (xi , yi ; ξi ),
j=0

(t) (t)
where 1 ≤ k ≤ τi . These client updates are then aggregated to compute {gx,i , gy,i }

τi −1 τi −1
(t) 1 X (k)

(t,k) (t,k) (t,k)

(t) 1 X (k)

(t,k) (t,k)

gx,i = ai (τi )∇x fi xi , yi ; ξi ; hx,i = ai (τi )∇x fi xi , yi
kai k kai k
k=0 k=0
i −1
τX i −1
τX
(t) 1 (k)

(t,k) (t,k) (t,k)

(t) 1 (k)

(t,k) (t,k)

gy,i = ai (τi )∇y fi xi , yi ; ξi ; hy,i = ai (τi )∇y fi xi , yi .
kai k kai k
k=0 k=0

(j)
Remark 7. Note that we have made explicit, the dependence on k in ai (k) above. This was omitted in the main paper to
avoid tedious notation. However, for some local optimizers, such as local momentum at the clients (Appendix A.1), the
(j) (j)
coefficients ai (k) change with k. We assume in our subsequent analysis that ai (k) ≤ α for all j ∈ {0, 1, . . . , k − 1}
and for all k ∈ {1, 2, . . . , τi }. Further, we assume that ai (k) 1 ≤ ai (k) 1 and ai (k + 1) 2 ≤ ai (k + 1) 2 for
feasible k. We also use the notation kai k , ai (τi ) .
At iteration t, the server samples |C (t) | clients without replacement (WOR) uniformly at random. While aggregating
(t) (t)
at the server, client i update is weighed by w̃i = wi n/|C (t) |. The aggregates (gx , gy ) computed at the server are of
the form
n
hX n
i X
(t) (t) (t)
X
gx(t) = w̃i gx,i , such that EC (t) [gx(t) ] = EC (t) I(i ∈ C (t) )w̃i gx,i = wi gx,i
i∈C (t) i=1 i=1
n
hX i n
(13)
(t) (t) (t)
X X
gy(t) = w̃i gy,i , such that EC (t) [gy(t) ] = EC (t) I(i ∈ C (t)
)w̃i gy,i = wi gy,i
i∈C (t) i=1 i=1

For simplicity of analysis, unless stated otherwise, we assume that |C (t) | = P for all t. Finally, server updates the x, y
variables as
x(t+1) = x(t) − τeff γxs gx(t) , y(t+1) = y(t) + τeff γys gy(t) .

B.1 Intermediate Lemmas


We begin with the following result from Nouiehed et al. (2019) about the smoothness of Φ(·).
e

Lemma B.1. If a function f (·, ·) satisfies Assumptions 1, 4 (Lf -smoothness and µ-strong concavity in y), then
φ(·) , maxy f (·, y) is LΦ -smooth with LΦ = κLf /2 + Lf , where κ = Lf /µ is the condition number.
Lemma B.2. Suppose the local client loss functions {fi } satisfy Assumptions 1, 3, and the stochastic oracles for the
local functions satisfy Assumption 2. Suppose the server selects P clients in each round. Then the iterates generated by

22
Fed-Norm-SGDA (Algorithm 1) satisfy
2
2
(t)
X
E gx(t) =E w̃i gx,i
i∈C (t)
n n 2
τi −1
n X wi2 X
   
n P −1 X (t) (k) 2 2 2 (t,k) (t,k)
2
(14)
≤ E wi hx,i + [ai (τi )] σ L + β L E ∇ f (x
x i i , y i )
P n−1 i=1
P i=1 kai k21
k=0
τi −1
" n #
n(n − P ) 2L2f X wi2 X

2 2
(k) (t,k) 2 (t) (t) 2
+ ai (τi )∆x,y (i) + (max wi ) βG ∇x Fe(x , y ) + σG ,
n−1 P i=1 kai k1 i P
k=0
h i
(t,k) (t,k) (t,k)
where, ∆x,y (i) , E kxi − x(t) k2 + kyi − y(t) k2 is the iterate drift for client i, at local iteration k in the
t-th communication round.
Lemma B.3. Suppose the local client loss functions {fi } satisfy Assumptions 1, 3, 4, and the stochastic oracles for the lo-
s s 2 2 n 2 2
cal functions satisfy Assumption
  Also, the server learning rate γx satisfies 64τeff γx LΦ βL βG P (maxi wi kai k2 /kai k1 ) ≤
2.
(k)
1, 8τeff γxs LΦ (maxi wi ) Pn n−P 2 s 2 n
n−1 max{8βG , 1} ≤ 1, and 8τeff γx LΦ βL P (maxi,k wi ai (τi )/kai k1 ) ≤ 1. Then the
iterates generated by Algorithm 1 satisfy
n 2
s
τeff γxs
 
h
e (t) ) ≤ − 7τeff γx E ∇Φ(x
e (t+1) ) − Φ(x
i
e (t) )
2 n(P − 1) X (t)
E Φ(x − 1− τeff γxs LΦ E wi hx,i
16 2 P (n − 1) i=1
n τi −1
5 X wi X (k) 9τeff γxs L2f h i
+ τeff γxs L2f (t,k)
ai (τi )∆x,y (i) + e (t) ) − Fe(x(t) , y(t) )
E Φ(x (15)
4 i=1
kai k1 4µ
k=0
" n
!#
2 s 2 X w2 kai k 2 2
τeff [γx ] LΦ n 2 i 2 2 n−P 2 wi kai k2
+ σL 2 + σG 2(max wi ) + 2βL max 2 .
2 P i=1 kai k1
i n−1 i kai k1

Remark. The bound in Equation (15) looks very similar to the corresponding
h one-step decay bound
i for simple smooth
(t) (t) (t)
minimization problems. The major difference is the presence of E Φ(x ) − F (x , y ) , which quantifies the
e e
Pn Pτi −1 (k) (t,k)
inaccuracy of y(t) in solving the max problem maxy Fe(x(t) , y). The term i=1 kawiik k=0 ai (τi )∆x,y (i) is the
1
client drift and is bounded in Lemma B.4.
Lemma B.4. Suppose the local loss functions {fi } satisfy Assumptions 1, 3, 4, and the stochastic oracles for the local
functions satisfy Assumption 2. Further, in Algorithm 1, we choose learning rates ηxc , ηyc such that max{ηxc , ηyc } ≤
1 √ (t) (t)
2
. Then, the iterates {xi , yi } generated by Fed-Norm-SGDA (Algorithm 1) satisfy
2Lf (maxi kai k1 ) 2(1+βL )

n τi −1 n
X wi X (k)  2 2X 2
L2f ai (τi )∆(t,k) c 2 c 2
wi kai,−1 k2 + 4L2f Ma−1 [ηxc ]2 + [ηyc ]2 σG
 2
x,y (i) ≤ 2 [η x ] + [ηy ] L σ
f L
i=1
kai k1 i=1
k=0
2  h i
+ 8L2f Ma−1 βG 2 c 2 e (t) ) + 8L3f Ma βG
[ηx ] E ∇Φ(x −1
2
2κ[η c 2
x ] + [η c 2
y ] E Φ(x
e (t)
) − F
e (x (t)
, y (t)
) ,
 
2 2
where Ma−1 , maxi kai,−1 k1 + βL2 kai,−1 k2 .

Lemma B.5. Suppose the local loss functions {fi } satisfy Assumptions 1, 3, 4, and the stochastic oracles for the local
functions satisfy Assumption 2. The server learning rates γxs , γys satisfy the following conditions:
( )
n wi kai k22 n − P 1 s γys
2τeff γys Lf ≤ 1, τeff γys κLf βG
2 2
max βL max 2 , max wi ≤ , γx ≤ ,
P i kai k1 n − 1 i 64 156κ2

23
( ) ( s )
(k)
wi kai k22
r
n wi ai (τi ) n − P n n−P 1
8τeff γxs LΦ βL
2
max max , s
max wi ≤ 1, τeff γx Lf βG max max wi , βL max 2 ≤
P i,k kai k1 n−1 i P n−1 i i kai k1 40κ
The client learning rates ηxc , ηyc satisfy ηyc Lf βG ≤ √1 and ηxc : ηxc Lf βG ≤ √1 , respectively. Then the
16 κMa−1 64κ Ma−1
iterates generated by Fed-Norm-SGDA (Algorithm 1) satisfy
T −1
1 X h e (t) i
E Φ(x ) − Fe(x(t) , y(t) )
T t=0
h i
e (0) ) − Fe(x(0) , y(0) ) T −1 n 2
4 Φ(x s 2
e (t) ) + 8τeff [γx ] LΦ n(P − 1) E
1 1 X 2
(t)
X
≤ + E ∇ Φ(x wi hx,i
τeff γys µT 12µκ2 T t=0 γys µ P (n − 1) i=1
" n
!#
2 2 2
n X wi kai k2 n−P wi kai k2
+ 6τeff γys κ σL2
2 + 2σG
2
max wi + βL2 max 2
P i=1 ka k
i 1 n − 1 i i kai k1
" n
#
c 2 c 2
 2X 2 2
+ 18κLf [ηx ] + [ηy ] σL wi kai,−1 k2 + 2σG Ma−1
i=1
! " n
#
(k)
8τeff [γxs ]2 κ n n−P wi ai (τi ) X 2
max wi + βL2 max [ηxc ]2 [ηyc ]2 L2f 2 2

+ + σL wi kai,−1 k2 + 2σG Ma−1 .
γys P n−1 i i,k kai k1 i=1
(16)

B.2 Proof of Theorem 1


For the sake of completeness, we first state the full statement of Theorem 1 here.
Theorem. Suppose the local loss functions {fi }i satisfy Assumptions 1, 2, 3, 4. Suppose the server selects clients
using without-replacement sampling scheme (WOR). Also, the server learning rates γxs , γys and the client learning rates
ηxc , ηyc satisfy the conditions specified in Lemma B.5. Then the iterates generated by Fed-Norm-SGDA (Algorithm 1)
satisfy
T −1
γys Lf
  
(t)
2 1 X (t)
2
2 ∆Φ 2 2 2

min E ∇Φ(x ) ≤ E ∇Φ(x ) ≤ O κ + Aw σ L + B w β L σ G
e e e
t∈[0:T −1] T t=0 τeff γys T P
| {z }
Error with full synchronization
!
2
γ s L E τ σ2
2 n − P y f w eff G
[ηxc ]2 [ηyc ]2 L2f 2 2
  
+O κ + Cw σL + DσG + O κ ,
| {z } n−1 P
Error due to local updates | {z }
Partial Participation Error

where κ = Lf /µ is the condition number, Φ(x)


e , maxy Fe(x, y) is the envelope function, ∆Φ e , Φ(x
e (0) ) −
2 2 2
n w ka k w ka k n 2 (t,τ −1)
]2 ), D ,
P i 2 i i 2
P i
minx Φ(x), Aw , nτeff i=1 ka , Bw , nτeff maxi ka , Cw , i=1 wi (kai k2 − [αi
e i
k2 i 1 k2 i 1
2 2 (0) (1) (τ −2)
maxi (βL2 kai,−1 k2 + kai,−1 k1 ), where ai,−1 , [ai , ai , . . . , ai i ]> for all i and Ew , n maxi wi .
q  Pn
Using γys = Θ L1f τ̄PT and ηxc ≤ ηyc = Θ L τ̄1√T , where τ̄ = n1 i=1 τi in the bounds above, we get

f

2 2 2 2 2
n − P κ2 Ew τeff σG 2
     
2
min E ∇Φ(x e (t) ) ≤ O κ2 (τ̄ /τeff ) + A w σ + B w βL σ G
√ L + O κ2
C w σL + DσG
+ O · √ .
t∈[T ] P τ̄ T τ̄ 2 T n−1 P τ̄ T
| {z } | {z } | {z }
Error with full synchronization Local updates error Partial participation
error

Proof. Using Lemma B.3, and substituting in the bound on iterates’ drift from Lemma B.4, we can bound
n 2
s s
 
e (t) ) − τeff γx 1 − n(P − 1) τeff γ s LΦ E
2
e (t) ) ≤ − 7τeff γx E ∇Φ(x
h i
(t)
X
e (t+1) ) − Φ(x
E Φ(x wi hx,i
x
16 2 P (n − 1) i=1

24
9τeff γxs L2f h i
+ e (t) ) − Fe(x(t) , y(t) )
E Φ(x

" n
!#
2 2
2
τeff [γxs ]2 LΦ n 2
X wi2 kai k2 2 n−P 2 wi kai k2
+ σL 2 + σG 2(max wi ) + 2βL max 2
2 P i=1 kai k1
i n−1 i kai k1
" n
#
5 X 2
+ τeff γxs [ηxc ]2 + [ηyc ]2 L2f σL 2 2

wi kai,−1 k2 + 2σG Ma−1
2 i=1
 i
2  h
+ 10τeff γxs L2f Ma−1 βG2
[ηxc ]2 E ∇Φ(x
e (t) ) + Lf 2κ[ηxc ]2 + [ηyc ]2 E Φ(x
e (t) ) − Fe(x(t) , y(t) ) . (17)
h i
e (t) ) − Fe(x(t) , y(t) ) from Lemma B.5, and
Summing (17) over t = 0, . . . , T − 1, substituting the bound on E Φ(x
rearranging the terms, we get
T −1 2
1 X e (t) )
E ∇Φ(x
T t=0
" n
!#!
2 2
κ2 ∆Φ n X wi
2
ka i k n − P w i ka i k
=O + τeff γys Lf κ2 σL2 2 2
2 + σG max wi + βL2 max 2
e
τeff γys T P n − 1 2
i=1 ka i k 1
i i ka i k 1
" n
#!
   
(t,τi −1) 2
X 2 2 2
2 c 2 c 2 2 2 2 2

+ O κ [ηx ] + [ηy ] Lf σL wi kai k2 − [ai ] + σG max kai,−1 k1 + βL kai,−1 k2 (18)
i
i=1

Consequently, using constants Aw , Bw , Cw , D, Ew , (18) can be simplified to


T −1
γys Lf
   
1 X e (t) )
2 ∆Φ
E ∇Φ(x ≤ O κ2 + A σ 2
+ (B β 2
+ E τ )σ 2
+ κ 2
[η c 2
] + [η c 2 2 
] L C σ 2
+ Dσ 2
.
e
w L w L w eff G x y f w L G
T t=0 τeff γys T P

which completes the proof.

Convergence in terms of F
Proof of Corollary 1.1. According to the definition of F (x) and Fe(x), we have
n
X
∇Φ(x) − ∇Φ(x)
e = [pi ∇x fi (x, y∗ (x)) − wi ∇x fi (x, y
e ∗ (x))] (y∗ (x) ∈ arg maxy F (x, y))
i=1
n
X n
X
= pi [∇x fi (x, y∗ (x)) − ∇x fi (x, y
e ∗ (x))] + e ∗ (x))
(pi − wi ) ∇x fi (x, y
i=1 i=1
n
X pi − wi √
= [∇x F (x, y∗ (x)) − ∇x F (x, y
e ∗ (x))] + √ e ∗ (x)).
· wi ∇x fi (x, y
i=1
w i

Taking norm, using Lf -smoothness and applying Cauchy–Schwarz inequality, we get


" n #" n #
2
2
X (pi − wi )2 X 2
2 ∗ ∗ ∗
∇Φ(x) − ∇Φ(x)
e ≤ 2Lf ky (x) − y e (x)k + 2 wi k∇x fi (x, y
e (x))k
i=1
wi i=1
 
2
2 ∗ ∗ 2 2 2 2
≤ 2Lf ky (x) − y e (x)k + 2χpkw βG ∇Φ(x) + σG ,
e

where the last inequality uses Assumption 3. Next, note that


2 2
2
k∇Φ(x)k ≤2 ∇Φ(x) − ∇Φ(x)
e + 2 ∇Φ(x)
e .

25
Therefore, we obtain

2 T −1 2
1 X
min ∇Φ(x(t) ) ≤ ∇Φ(x(t) )
t∈[T ] T t=0
−1 T −1
" #
h i 1 TX 2 1 X ∗ (t) 2
2 2
≤ 2 2χpkw βG + 1 ∇Φ(x
e (t)
) + 4 χ2pkw σG2
+ L2f e ∗ (x(t) )
y (x ) − y
T t=0 T t=0
−1
" T
#
2
2 1
h i X
2 2 2 2 ∗ (t) ∗ (t)
= 2 2χpkw βG + 1 opt + 4 χpkw σG + Lf y (x ) − y e (x ) .
T t=0

where opt denotes the optimization error in the right hand side of (4) in Theorem 1.

Proof of Corollary 1.2. If clients are weighted equally (wi = pi = 1/n for all i), with each carrying out τ steps of
local SGDA, as seen in (6) we get
2  σ2 + β 2 σ2 2 + τ σ2
σL

n−P
 q
τ

min ∇Φ(x(t) ) ≤O L
√L G
+ G
+ 2
σG .
t∈[T ] PτT τT n−1 PT

• For full client participation, this reduces to


 
2 1 1
e (t) )
min E ∇Φ(x ≤O √ + .
t∈[T ] nτ T T

is T τ = O n14 . Since

To reach an -stationary point, assuming nτ ≤ T , the per-client gradient complexity
τ ≤ T /n, the minimum number of communication rounds required is T = O 12 .


n−P τ
  q 
2
• For partial participation, O σG is the dominant term, and we do not get any convergence benefit
n−1 PT
of multiple local updates. Consequently, per-gradient client complexity and number of communication rounds
are both T τ = O P14 , for τ = O(1). However, if the data across clients

comes from identical distributions
1

(σG = 0), then we recover per-client gradient complexity of O P 4 , and number of communication rounds
1

= O 2 .

B.3 Proofs of the Intermediate Lemmas

Proof of Lemma B.2.


2 2
 
(t) (t) (t) (t)
X X
E w̃i gx,i =E w̃i gx,i − hx,i + hx,i
i∈C (t) i∈C (t)
2 2
 
(t) (t) (t)
X X
=E w̃i gx,i − hx,i +E w̃i hx,i
i∈C (t) i∈C (t)
  2
2
(t) (t) (t)
X X
= E w̃i2 gx,i − hx,i +E w̃i hx,i (sampling scheme)
i∈C (t) i∈C (t)
2
n 2
nX 2 (t) (t)
X (t)
= w E gx,i − hx,i +E w̃i hx,i (∵ w̃i = wi n/P )
P i=1 i
i∈C (t)

26
2
n τi −1
n X wi2 X
 
2
(k) (t,k) (t,k) (t)
X
2 2
≤ [ai (τi )] σL + βL2 E ∇x fi (xi , yi ) +E w̃i hx,i .
P i=1 kai k21
k=0 i∈C (t)
(19)

where the last inequality follows from Assumption 1 and 2. Further, we can bound the second term as follows.
2
n n
(t) (t) (t)
X X X
E w̃i hx,i − wi hx,i + wi hx,i
i∈C (t) i=1 i=1

n 2 n n 2
(t) (t) (t)
X X X
(t)
=E wi hx,i +E I(i ∈ C )w̃i hx,i − wi hx,i ((WOR) sampling)
i=1 i=1 i=1
n 2 n  
 2
(t) (t)
X X
=E wi hx,i + E (I(i ∈ C (t) ))2 w̃i2 + wi2 − 2I(i ∈ C (t) )w̃i wi hx,i
i=1 i=1
X D (t) (t)
E
+ E (I(i ∈ C (t) )w̃i − wi )hx,i , (I(j ∈ C (t) )w̃j − wj )hx,j
i6=j
n 2 n   
n  2
(t) (t)
X X
=E wi hx,i + E wi2 − 1 hx,i
i=1 i=1
P
X h D
(t) (t)
Ei
+ E I(i ∈ C (t) ) · I(j ∈ C (t) )w̃i w̃j − I(j ∈ C (t) )w̃j wi − I(i ∈ C (t) )w̃i wj + wi wj hx,i , hx,j
i6=j
n 2 n  
X (t)
n X
(t)
2 X   
n P −1
 D
(t) (t)
E
=E wi hx,i + −1 E wi2 hx,i + E wi wj −1 hx,i , hx,j
i=1
P i=1
P n−1
i6=j
  n 2 n
n P −1 X (t) n n−P X (t)
2
= E wi hx,i + wi2 E hx,i , (20)
P n−1 i=1
P n−1 i=1

Next, we bound the second term in (20).


n 2
(t)
X
wi2 E hx,i − ∇x fi (x(t) , y(t) ) + ∇x fi (x(t) , y(t) )
i=1
n τi −1
wi2 X
 
2
(k)
X
≤ 2L2f (t,k) 2
ai (τi )∆x,y (i) + 2(max wi ) βG ∇x Fe(x(t) , y(t) ) + 2
σG , (21)
i=1
kai k 1
i
k=0
h i
(t,k) (t,k) (t,k)
using Assumption 3. ∆x,y (i) , E kxi − x(t) k2 + kyi − y(t) k2 . Substituting (21) in (20), and using the
resulting bound in (19) we get the bound in (14).
Proof of Lemma B.3. Since the local functions {fi } satisfy Assumption 4, F (x, ·) is µ-SC for any x. In the proof, we
use the quadratic growth property of µ-SC function F (x, ·), i.e., for any given x
µ 2
ky − y∗ (x)k ≤ F (x, y∗ (x)) − F (x, y), for all y, (22)
2
where y∗ (x) = arg maxy0 F (x, y0 ). Using LΦ -smoothness of Φ(·),
e

* + 2
1 X (t) τ 2 [γ s ]2 LΦ 1 X (t)
EΦ(x
e (t+1)
) ≤ EΦ(x
e e (t)
) − E ∇Φ(x (t)
), τeff γxs gx,i + eff x E gx,i
|C (t) | 2 |C (t) |
i∈C (t) (t) i∈C

27
* n
+
X (t) τ 2 [γ s ]2 LΦ 2
(t) s
= EΦ(x ) − τeff γx E ∇Φ(x ),
e e (t)
wi hx,i + eff x E gx(t) (using Assumption 2 and (3))
i=1
2
 
n 2 n 2
s 2 s
(t) τeff γ x (t)
X (t) τeff γ x (t)
X (t)
= EΦ(x
e )− E  ∇Φ(x
e ) + wi hx,i  + E ∇Φ(x
e )− wi hx,i
2 i=1
2 i=1
2
τeff [γxs ]2 LΦ 2
+ E gx(t) . (23)
2
Next,
n 2
(t)
X
e (t) ) −
E ∇Φ(x wi hx,i
i=1
n 2
 
(t)
X
=E wi ∇x fi (x(t) , y∗ (x(t) )) − ∇x fi (x(t) , y(t) ) + ∇x fi (x(t) , y(t) ) − hx,i
i=1
(since y∗ (x) = arg maxy0 F (x, y0 ))
i −1  2
n τX
!
2 1 
(k) (t,k) (t,k)
X
≤ 2L2f E y∗ (x(t) ) − y(t) + 2E wi ∇x fi (x(t) , y(t) ) − ai (τi )∇x fi xi , yi
i=1
kai k1
k=0
(Jensen’s inequality; Lf -smoothness; Young’s inequality)
n τi −1
4L2f h i X 1 X (k) (t,k) (t,k)
2
≤ e (t) ) − Fe(x(t) , y(t) ) + 2
E Φ(x wi ai (τi )E ∇x fi (x(t) , y(t) ) − ∇x fi (xi , yi )
µ i=1
kai k1 k=0
(Quadratic growth of µ-SC functions (22); Jensen’s inequality)
n τi −1
4L2f h
 
i wi X 2 2
(k) (t,k) (t,k)
X
(t) (t) (t) 2 (t) (t)
≤ E Φ(x ) − F (x , y ) + 2Lf
e e ai (τi )E xi −x + yi −y
µ i=1
kai k1
k=0
(Lf -smoothness)
n τi −1
4L2f h i X wi X (k)
= e (t) ) − Fe(x(t) , y(t) ) + 2L2f
E Φ(x (t,k)
ai (τi )∆x,y (i). (24)
µ i=1
kai k1 k=0
 
2 2 2
(t,k) (t,k) (t) (t,k) (t) (t,k) (t,k)
where ∆x,y (i) , E xi −x + yi −y . Further, the term containing ∇x fi (xi , yi ) in
(14) is bounded in Lemma B.7.
substituting the bounds from (24), (14) and Lemma B.7 into (23), we get
 
n 2
s 2
e (t+1) ) ≤ EΦ(x e (t) ) − τ eff γ x  e (t) ) +
X (t)
EΦ(x E ∇Φ(x wi hx,i 
2 i=1

i −1
" #
s
τeff γx 4Lf 2 h i X wi τX
n
(k)
(t) (t) (t) 2 (t,k)
+ E Φ(x ) − F (x , y ) + 2Lf
e e ai (τi )∆x,y (i)
2 µ i=1
kai k1
k=0
 
2 i −1
2 s 2 n n 2 τX
   
τ [γ ] LΦ  n P − 1 X (t) n n−P X wi (k)
+ eff x E wi hx,i + 2L2f ai (τi )∆(t,k)
x,y (i)

2 P n−1 i=1
P n − 1 i=1
ka i k 1 k=0
n 2
2
[γxs ]2 LΦ nσL
2 X wi2 kai k2 2
[γxs ]2 LΦ n
   
τeff τeff n−P 2
2
+ 2 + 2(max wi ) βG ∇x Fe(x(t) , y(t) ) 2
+ σG
2 P i=1 kai k1
2 P n−1 i

τi −1
"
n
!#
2
τ 2 [γ s ]2 LΦ n 2 X wi2 X (k) wi kai k2
+ eff x βL 2L2f 2 [ai (τi )]2 ∆x,y
(t,k) 2
(i) + 2σG max 2
2 P i=1 ka k
i 1
i kai k1
k=0

28
!" #
2
τ 2 [γ s ]2 LΦ n 2 2 wi kai k2 2L2f   2
+ eff x β 4β max E Φ(x e (t) )
e (t) ) − Fe(x(t) , y(t) ) + ∇x Φ(x
2 P L G i kai k1
2 µ
n 2
s
τeff γxs
   
e (t) ) − 7τeff γx E ∇Φ(x
e (t) )
2 n P −1 s
X (t)
≤ EΦ(x − 1− τeff γx LΦ E wi hx,i
16 2 P n−1 i=1
n τi −1
9τeff γxs L2f h i 5 X wi X (k)
+ e (t) ) − Fe(x(t) , y(t) ) +
E Φ(x s 2
τeff γx Lf ai (τi )∆(t,k)
x,y (i)
4µ 4 i=1
ka k
i 1
k=0
" n
!#
2 2
2
τeff [γxs ]2 LΦ n 2
X wi2 kai k2 2 n−P 2 wi kai k2
+ σL 2 + σG 2(max wi ) + 2βL max 2 ,
2 P i=1 ka k
i 1
i n−1 i kai k1
where, the coefficients are simplified, using assumptions on the learning rate γxs
" #
  (k)
s n−P 2 wi ai (τi ) P
τeff γx LΦ (max wi ) + βL max ≤
i n−1 i,k kai k1 4n
" #
  2
s n−P 2 wi kai k2 P
τeff γx LΦ (max wi ) + βL max 2 ≤ 2 n.
i n−1 i kai k1 32βG
This finishes the proof of.
(t,k)
Proof of Lemma B.4. We use the client update equations for individual iterates (12). To bound ∆x,y (i), first we bound
2
(t,k)
a single component term E xi − x(t) . For 1 ≤ k ≤ τi , Using modified variance assumption
2
2 k−1       
(t,k) (j) (t,j) (t,j) (t,j) (t,j) (t,j) (t,j) (t,j)
X
E xi − x(t) = [ηxc ]2 E ai (k) ∇x fi xi , yi ; ξi − ∇x fi xi , yi + ∇x f i x i , yi
j=0
 
2 2
k−1      k−1  
(j) (t,j) (t,j) (t,j) (t,j) (t,j) (j) (t,j) (t,j)
X X
= [ηxc ]2 E ai (k) ∇x fi xi , yi ; ξi − ∇x fi xi , yi +E ai (k)∇x fi xi , yi
 

j=0 j=0

(using unbiasedness in Assumption 2)


 
2
k−1     2 k−1  
(j) (t,j) (t,j) (t,j) (t,j) (t,j) (j) (t,j) (t,j)
X X
= [ηxc ]2  [ai (k)]2 E ∇x fi xi , yi ; ξi − ∇ x fi x i , yi +E ai (k)∇x fi xi , yi
 

j=0 j=0
   
k−1 
2
 k−1 k−1   2
(j) (t,j) (t,j) (j) (j) (t,j) (t,j)
X X X
≤ [ηxc ]2  2
[ai (k)] σL 2
+ βL2 ∇x fi (xi , yi ) + ai (k) ai (k)E ∇x fi xi , yi .
j=0 j=0 j=0
(25)
where the last inequality follows from Jensen’s inequality (Lemma A.3). Next, note that
τi −1 k−1 τi −1 i −2
τX i −2
τX
1 X (k)
X (j) 1 X (k) (j) (j) 2 (t,τ −1)
ai (τi ) [ai (k)]2 ≤ ai (τi ) [ai (k)]2 = [ai (k)]2 ≤ kai k2 − [ai i (τi )]2 ,
kai k1 j=0
kai k 1
k=0 k=0 k=0 k=0
i −1
τX k−1 i −1
τX i −2
τX i −2
τX
1 (k)
X (j) 1 (k) (j) (j) (t,τi −1)
ai (τi ) ai (k) ≤ ai (τi ) ai (k) = ai (k) ≤ kai k1 − [ai (τi )],
kai k1 j=0
kai k1
k=0 k=0 k=0 k=0
τi −1 k−1 τi −1 τXi −2 i −2
τX
1 X (k)
X (j)
2 1 X (k) (j) 2 (k)
ai (τi ) [ai (k)] ≤ ai (τi ) [ai (k)] ≤ α · ai (τi ),
kai k1 j=0
ka k
i 1 j=0
k=0 k=0 k=0
(26)

29
2 2 (t,τ −1) (t,τi −1)
We define kai,−1 k2 , kai k2 − [ai i (τi )]2 , kai,−1 k1 , kai k1 − [ai (τi )] for the sake of brevity. Using (26),
we bound the individual terms in (25).
τi −1 k−1
1 X 2
(k)
X (j) (t,j) (t,j)
ai (τi ) [ai (k)]2 βL2 ∇x fi (xi , yi )
kai k1 j=0
k=0
i −2
τX
2
(j) 2
≤ 2βL2 αL2f (t,j)
[ai (k)]∆x,y + 2βL2 kai,−1 k2 ∇x fi (x(t) , y(t) ) . (27)
j=0

Similarly,
 
i −1
τX k−1 k−1
1   2
(k) (j) (j) (t,j) (t,j)
X X
ai (τi )  ai (k) ai (k)E ∇x fi xi , yi
kai k1 j=0 j=0
k=0
 
i −1
τX i −2
τX i −2
τX  
2 2
(k) (j) (j)
≤ ai (τi )  ai (k) ai (k) L2f ∆(t,j) (t) (t)
x,y + ∇x fi (x , y )
kai k1 j=0 j=0
k=0
i −2
τX
2
(j) 2
≤ 2 kai,−1 k1 L2f ai (k)∆(t,j) (t) (t)
x,y + 2 kai,−1 k1 ∇x fi (x , y ) . (28)
j=0

Substituting (27), (28) in (25), we get


τi −1 i −1
1 X (k) (t,k)
2
2  τX (k)
ai (τi )E xi − x(t) ≤ [ηxc ]2 σL
2
kai,−1 k2 + 2[ηxc ]2 L2f kai,−1 k1 + βL2 α ai (τi )∆(t,k)
x,y (i)
kai k1
k=0 k=0
  2
2 2
+ 2[ηxc ]2 kai,−1 k1 + βL2 kai,−1 k2 E ∇x fi (x(t) , y(t) ) . (29)

Similarly, we can bound the y error


τi −1 i −1
1 X (k) (t,k)
2
2  τX (k)
ai (τi )E yi − y(t) ≤ [ηyc ]2 σL
2
kai,−1 k2 + 2[ηyc ]2 L2f kai,−1 k1 + βL2 α ai (τi )∆(t,k)
x,y (i)
kai k1
k=0 k=0
  2
2 2
+ 2[ηyc ]2 kai,−1 k1 + βL2 kai,−1 k2 E ∇y fi (x , y(t) )
(t)
. (30)

Combining the two bounds in (29) and (30), we get


τi −1  
1 X 2 2
(k) (t,k) (t,k) 2
− x(t) − y(t) ≤ [ηxc ]2 + [ηyc ]2 σL
 2
ai (τi )E xi + yi kai,−1 k2
kai k1
k=0
i τ −1
1 X (k)
+ 2 [ηxc ]2 + [ηyc ]2 L2f kai k1 kai,−1 k1 + βL2 α (t,k)
 
ai (τi )∆x,y (i)
kai k1
k=0
    2   2

2 2
+ 2 kai,−1 k1 + βL kai,−1 k2 [ηxc ]2 E ∇x fi x(t) , y(t)
2
+ [ηyc ]2 E ∇y fi x(t) , y(t) . (31)

 
Define Am , 2L2f [ηxc ]2 + [ηyc ]2 maxi kai k1 kai,−1 k1 + βL2 α . Rearranging the terms in (31), and taking the
weighted sum over agents, we get
n τi −1
X wi X (k)
L2f ai (τi )∆(t,k)
x,y (i)
i=1
kai k 1k=0

30
n

[ηxc ]2 + [ηyc ]2 L2f σL
2 X
2
≤ wi kai,−1 k2
1 − Am i=1
n
2L2f X 
2 2 2

c 2

(t) (t)
 2
c 2

(t) (t)
 2
+ wi kai,−1 k1 + βL kai,−1 k2 [ηx ] E ∇x fi x , y + [ηy ] E ∇y fi x , y
1 − Am i=1
 
 n n 2
[ηxc ]2 + [ηyc ]2 L2f 2 X 2 2L 2
f M a−1 c 2 2
X  
≤ σL wi kai,−1 k2 + [ηx ] βG E wi ∇x fi x(t) , y(t) + σG2
1 − Am i=1
1 − A m i=1
 
n 2
2L2f Ma−1 c 2 2
X  
+ [ηy ] βG E wi ∇y fi x(t) , y(t) + σG2
. (32)
1 − Am i=1

 
2 2
where (32) follows from Assumption 3, and we define Ma−1 , maxi kai,−1 k1 + βL2 kai,−1 k2 . We bounded
 2  2
E ∇x F x(t) , y(t) in Lemma B.6. Similarly, we can bound E ∇y F x(t) , y(t) as follows.
  2     2
E ∇y F x(t) , y(t) = E ∇y F x(t) , y(t) − ∇y F x(t) , y∗ (x(t) ) (∵ y∗ (x) = arg maxy0 F (x, y0 ))
h i
e (t) ) − Fe(x(t) , y(t) ) .
≤ 2Lf E Φ(x (33)

using Lf -smoothness and concavity of F (x, ·) (Lemma A.6). Also, for the choice of ηxc , ηyc , we get Am ≤ 1/2.
Consequently, substituting the two bounds in (32), we complete the proof.
Proof of Lemma B.5. Using Lf -smoothness (Assumption 1) of F (x, ·),
D E L 2
f
EFe(x(t+1) , y(t) ) ≤ EFe(x(t+1) , y(t+1) ) − E ∇y Fe(x(t+1) , y(t) ), y(t+1) − y(t) + y(t+1) − y(t)
* + 2
n 2 s 2
X (t) τ [γy ] Lf 2
= EFe(x(t+1) , y(t+1) ) − τeff γys E ∇y Fe(x(t+1) , y(t) ), wi hy,i + eff E gy(t) ,
i=1
2
 
n 2 n 2
(t+1) (t+1)
τeff γys (t+1) (t)
2 X (t)
X (t)
≤ EF (x
e ,y )− E ∇y F (x
 e ,y ) + wi hy,i − ∇y Fe(x(t+1) , y(t) ) − wi hy,i 
2 i=1 i=1
2
τeff [γys ]2 Lf 2
+ E gy(t) . (34)
2
Next, we bound the individual terms in (34). Using quadratic growth of µ-SC functions (Lemma A.5)
2 h i
E ∇y Fe(x(t+1) , y(t) ) e (t+1) ) − Fe(x(t+1) , y(t) ) .
≥ 2µE Φ(x

2
Pn (t)
Next, we bound ∇y Fe(x(t+1) , y(t) ) − i=1 wi hy,i , using similar reasoning as in (24).

n 2
(t)
X
E ∇y Fe(x(t+1) , y(t) ) − ∇y Fe(x(t) , y(t) ) + ∇y Fe(x(t) , y(t) ) − wi hy,i
i=1
n 2
2  
(t)
X
≤ 2L2f x(t+1) − x(t) +2 wi ∇y fi (x(t) , y(t) ) − hy,i
i=1
n τi −1
2 wi X (k)
X
≤ 2L2f τeff
2
[γxs ]2 E gx(t) + 2L2f ai (τi )∆(t,k)
x,y (i), (35)
i=1
ka k
i 1
k=0

31
(t) 2
We can bound E gx using (14) in Lemma B.2 to get
n 2 n 2
σ 2 n X wi2 kai k2 2
 
n(P − 1)
2 X (t) 2σG n(n − P )
E gx(t)
≤ E wi hx,i + L + max w i
P (n − 1) i=1
P i=1 kai k21 P n−1 i

τi −1
" n
! #
2
βL2 n 2
X wi2 X (k) 2 (t,k) 2 wi kai k2 (t)
2
+ 2Lf 2 [ai (τi )] ∆x,y (i) + 4βG max 2 E ∇x Φ(x )
e
P i=1 kai k1
i kai k1
k=0
2
β2 n wi kai k2 h 2 2
h
(t) (t) (t)
ii
+ L 2 max 2 σ G + 4β L
G f κE Φ(x
e ) − F
e (x , y )
P i kai k1
n τi −1
n(n − P ) 2L2f X wi2 X (k)
+ ai (τi )∆(t,k)
x,y (i)
n−1 P i=1 kai k1
k=0
  2  
n(n − P ) 2βG h i
e (t) )
e (t) ) − Fe(x(t) , y(t) ) + 2 ∇x Φ(x
2
+ max wi 4Lf κE Φ(x
n−1 i P
n 2 n 2 2
!
n(P − 1) X (t) σ 2 n X wi2 kai k2 2nσG2
n−P wi kai k2
≤ E wi hx,i + L + max wi + βL2 max
P (n − 1) i=1
P i=1 kai k21 P n−1 i i kai k1
2

τi −1
! n
(k)
n−P wi ai (τi ) 2nL2f X wi X (k)
+ max wi + βL2 max (t,k)
ai (τi )∆x,y (i)
n−1 i i,k kai k1 P i=1 kai k1
k=0
!
2 2
4nβG n−P 2 wi kai k2 e (t) )
2
+ max wi + βL max 2 ∇x Φ(x
P n−1 i i kai k1
!
2 2
8nβG Lf κ n − P 2 wi kai k2 h i
e (t) ) − Fe(x(t) , y(t) ) .
+ max wi + βL max 2 E Φ(x (36)
P n−1 i i kai k1
(t) 2
Similarly, we can bound E gy to get
n 2
n 2 2
!
2 n(P − 1) X (t)
2
σL n X wi2 kai k2 2σG2
n n−P wi kai k2
E gy(t) ≤ E wi hy,i + + max wi + βL max2
P (n − 1) i=1
P i=1 kai k21 P n−1 i i kai k1
2

τi −1
! n
(k)
n−P 2 wi ai (τi ) 2nL2f X wi X (k)
+ max wi + βL max ai (τi )∆(t,k)
x,y (i)
n−1 i i,k kai k1 P i=1 kai k1
k=0
!
2 2
4βG Lf n n − P 2 wi kai k2 h i
e (t) ) − Fe(x(t) , y(t) ) .
+ max wi + βL max 2 E Φ(x (37)
P n−1 i i kai k1
Substituting (35), (36), (37) and Lemma B.4 in (34), and rearranging the terms, we get

EFe(x(t+1) , y(t) )
n 2 n 2
τeff γys
 
n(P − 1) X (t) n(P − 1) X (t)
≤ EFe(x(t+1) , y(t+1) ) − 1− τeff γys Lf E wi hy,i + 3
τeff [γxs ]2 γys L2f E wi hx,i
2 P (n − 1) i=1
P (n − 1) i=1
h i
− τeff γys µE Φ(x
e (t+1) ) − Fe(x(t+1) , y(t) )
!
2 2
3 4nβ n − P w i kai k 2
+ τeff [γxs ]2 γys L2f G
max wi + βL2 max 2
2 e (t) )
∇x Φ(x
P n−1 i i kai k1
" n
!#
2 s 2 s 2
 2 X 2 2 2 2
τeff [γy ] Lf

2τeff [γx ] Lf σL n wi kai k2 2σG n n−P 2 wi kai k2
+ 1+ + max wi + βL max
2 γys P i=1 kai k21 P n−1 i i kai k1
2

32
τi −1
" !# n
2 (k)
τeff [γys ]2 Lf n 2τeff [γxs ]2 Lf
 
n−P wi ai (τi ) X wi X (k)
+ τeff γys
+ + 1 max w i + β 2
L max L2
f ai (τi )∆(t,k)
x,y (i)
P γys n−1 i i,k kai k1 i=1
ka i k1 k=0
!
 s 2
 2 2
2 4τ eff [γ x ] L f κ β nLf n − P wi ka i k h i
+ 2τeff [γys ]2 Lf + 1 G
max w i + β 2
L max 2
2
E e (t) ) − Fe(x(t) , y(t) )
Φ(x
γys P n−1 i i kai k1
n 2 n 2
τeff γys
 
n(P − 1) X (t) n(P − 1) X (t)
≤ EFe(x(t+1) , y(t+1) ) − 1− τeff γys Lf E wi hy,i + 3
τeff [γxs ]2 γys L2f E wi hx,i
2 P (n − 1) i=1
P (n − 1) i=1
h i
− τeff γys µE Φ(x
e (t+1) ) − Fe(x(t+1) , y(t) )
!
2 2
3 s s 2 2 4β n n − P wi kai k 2
+ τeff γy [γx ] Lf G max wi + βL2 max 2
2 e (t) )
E ∇x Φ(x
P n−1 i i kai k1
" n
!#
2 2 2 2 2
2 s 2 σL n X wi kai k2 2σG n n − P 2 wi kai k2
+ τeff [γy ] Lf + max wi + βL max
P i=1 kai k21 P n−1 i i 2
kai k1
(∵ 4τeff γxs Lf ≤ 1, γxs κ ≤ γys )
n τi −1
X wi X (k)
+ 2τeff γys L2f ai (τi )∆(t,k)
x,y (i) (∵ 8τeff γxs LΦ ≤ P )
i=1
kai k1 k=0
!
2
2 β2 n n−P wi kai k2 h i
+ 4τeff [γys ]2 L2f G max wi + βL2 max 2
e (t) ) − Fe(x(t) , y(t) )
E Φ(x (38)
P n−1 i i kai k1
where we simplify some coefficients using 4τeff γxs Lf ≤ 1, γxs ≤ κγys . We rearrange the terms and use the bound in
Lemma B.4 to get
h i
e (t+1) ) − Fe(x(t+1) , y(t+1) )
E Φ(x
h i
≤ (1 − τeff γys µ)E Φ(x
e (t+1) ) − Fe(x(t+1) , y(t) )

n 2 n 2
τeff γys
 
n(P − 1) X (t) n(P − 1) X (t)
− 1 − τeff γys Lf E 3
wi hy,i + τeff [γxs ]2 γys L2f E wi hx,i
2 P (n − 1) i=1
P (n − 1) i=1
" ! #
2 s 2 2
τ [γ ] n n − P w i ka i k 2
+ 4τeff γys L2f βG2 eff x
max wi + βL 2
max 2
+ 4Ma−1 [ηxc ]2 E ∇x Φ(x e (t) )
P n−1 i i kai k21
" n
!#
2 nσL 2 X
wi2 kai k22 2nσG2
n−P wi kai k22
+ τeff [γys ]2 Lf + max w i + β 2
L max
P i=1 kai k21 P n−1 i i kai k21
" n
#
X
+ 4τeff γys [ηxc ]2 + [ηyc ]2 L2f σL 2
wi kai,−1 k22 + 2Ma−1 σG 2

i=1
" ! #
τeff γys n n−P wi kai k22 h i
4τeff γys L2f βG
2 2
2κ[ηxc ]2 [ηyc ]2 e (t) ) − Fe(x(t) , y(t) )

+ max wi + βL max + 4Lf Ma−1 + E Φ(x
P n−1 i i kai k21
(39)
Next, note that
h i
e (t+1) ) − Fe(x(t+1) , y(t) )
E Φ(x
h i
e (t+1) ) − Φ(x
= E Φ(x e (t) ) + Φ(xe (t) ) − Fe(x(t) , y(t) ) + Fe(x(t) , y(t) ) − Fe(x(t+1) , y(t) ) . (40)
h i
e (t+1) ) − Φ(x
Substituting the bound from Lemma B.4 into Lemma B.3, E Φ(x e (t) ) can be bounded as follows.

e (t+1) ) − EΦ(x
EΦ(x e (t) )

33
n 2
3τeff γxs τeff γxs 5τeff γxs L2f h e (t)
 
(t)
2
s n(P − 1) X (t)
i
≤− E ∇Φ(x ) −
e 1 − τeff γx LΦ E wi hx,i + E Φ(x ) − Fe(x(t) , y(t) )
8 2 P (n − 1) i=1

" n
!#
τ 2 [γ s ]2 LΦ n 2
X wi2 kai k22 2 n−P 2 wi kai k22
+ eff x σL 2 + σG 2(max wi ) + 2βL max
2 P i=1
kai k1 i n−1 i kai k21
" n
#
5 X
+ τeff γxs [ηxc ]2 + [ηyc ]2 L2f σL 2
wi kai,−1 k22 + 2Ma−1 σG 2

. (41)
2 i=1
h i
[ηx ] ≤ 18 , 40Lf Ma−1 βG 1

where, we use 20L2f Ma−1 βG
2 c 2 2
2κ[ηxc ]2 + [ηyc ]2 ≤ µ. Next, we bound E Fe(x(t) , y(t) ) − Fe(x(t+1) , y(t) ) .
Again, using Lf -smoothness of F (·, y),
h i
E Fe(x(t) , y(t) ) − Fe(x(t+1) , y(t) )
D E L 
2
f
≤ E −∇x Fe(x(t) , y(t) ), x(t+1) − x(t) + x(t+1) − x(t)
2
" 2# 2
n
τeff γxs 2 X (t) τ 2 [γ s ]2 Lf 1 X (t)
≤ E ∇x Fe(x(t) , y(t) ) − ∇Φ(x e (t) ) + ∇Φ(x
e (t) ) + wi hx,i + eff x E gx,i
2 i=1
2 |C |
(t)
(t)
i∈C
" n 2#
2L2f h i 2 1 X (t)
≤ τeff γxs E e (t) ) − Fe(x(t) , y(t) ) + ∇Φ(x
Φ(x e (t) ) + wi hx,i (Lf -smoothness, Lemma A.5)
µ 2 i=1
2
τeff [γxs ]2 Lf 2
+ E gx(t)
2
n 2
3τeff γxs L2f h e (t) τeff γxs
 
i 2 n(P − 1) X (t)
≤ E Φ(x ) − Fe(x(t) , y(t) ) + 2τeff γxs E ∇Φ(x
e (t) ) + 1 + τeff γxs Lf E wi hx,i
µ 2 P (n − 1) i=1
(using (36), Lemma B.4)
" n ! n
#
2 (k)
σL τeff [γxs ]2 Lf n X wi2 kai k2
2 2
n−P 2 wi ai (τi ) c 2 c 2 2
X 2
+ +4 max wi + βL max [ηx ] + [ηy ] Lf wi kai,−1 k2
2 P i=1 kai k21 n−1 i i,k kai k1 i=1
!
2
2 2 n n−P wi kai k2
+ σG τeff [γxs ]2 Lf 2
max wi + βL max
P n−1 i i kai k21
!
(k)
2 2 s 2 n n−P 2 wi ai (τi )
L2f Ma−1 [ηxc ]2 + [ηyc ]2 ,

+ σG τeff [γx ] Lf 4 max wi + βL max (42)
P n−1 i i,k kai k1

where the coefficients are simplified since the choices of learning rates ensures that
!
2 (k)
s 4nβG n−P 2 wi ai (τi )
τeff γy Lf max wi + βL max 4L2f Ma−1 [ηxc ]2 ≤ 1
P n−1 i i,k kai k1
!
2 2
s 4nβG n−P 2 wi kai k2
τeff γy Lf max wi + βL max 2 ≤1
P n−1 i i kai k1
!
2 (k)
s 4nβG n−P 2 wi ai (τi ) 1
4L2f Ma−1 2κ[ηxc ]2 + [ηyc ]2 ≤

τeff γx max wi + βL max
P n−1 i i,k kai k1 µ
!
2 2
8nβG n−P wi kai k2
τeff γxs Lf max wi + βL2 max 2 ≤1
P n−1 i i kai k1

We substitute the bounds from (41), (42) in (40), and subsequently in (39), we get
h i
e (t+1) ) − Fe(x(t+1) , y(t+1) )
E Φ(x

34
" #
3τeff γxs L2f 5τeff γxs L2f h i
≤ (1 − τeff γys µ) 1 + + e (t) ) − Fe(x(t) , y(t) )
E Φ(x
µ 2µ
" ! #
τeff γys n n−P wi kai k22 h i
4τeff γys L2f βG
2 2
2κ[ηxc ]2 [ηyc ]2 e (t) ) − Fe(x(t) , y(t) )

+ max wi + βL max + 4Lf Ma−1 + E Φ(x
P n−1 i i kai k21
" ! #
2
τeff [γxs ]2 n
n−P wi kai k22 2
+ 4τeff γys L2f βG
2 2
max wi + βL max c 2
+ 4Ma−1 [ηx ] E ∇x Φ(x e (t) )
P n−1 i i kai k21
n 2 n 2
τeff γys
 
n(P − 1) X (t) n(P − 1) X (t)
− 1 − τeff γys Lf E wi hy,i + τeff 3
[γxs ]2 γys L2f E wi hx,i
2 P (n − 1) i=1
P (n − 1) i=1
" n 2#
s 2 s 2
s 13τeff γx (t)
2 τeff [γx ] n(P − 1) X (t)
+ (1 − τeff γy µ) E ∇Φ(x ) +
e (LΦ + Lf ) E wi hx,i
8 2 P (n − 1) i=1
" n
!#
2
2 s 2 σL2
n X wi2 kai k2 2σG2
n n−P 2 wi kai k22
+ τeff [γy ] Lf + max wi + βL max
P i=1 kai k21 P n−1 i i kai k21
" n
#
X
+ 4τeff γys [ηxc ]2 + [ηyc ]2 L2f σL 2
wi kai,−1 k22 + 2Ma−1 σG 2

i=1
" n ! n
#
2 (k)
2 2
σL n X wi2 kai k2
τeff [γxs ]2 Lf n−P wi ai (τi ) X
+ (1 − τeff γys µ) + 4 max w i + βL
2
max [η c 2
x ] + [ηy
c 2 2
] L f w i ka k2
i,−1 2
2 P i=1 kai k21 n−1 i i,k kai k1 i=1
" ! !
n n−P wi kai k22 n−P
(k)
wi ai (τi )
+ (1 − τeff γys µ)σG2 2
τeff [γxs ]2 Lf max wi + βL 2
max + 4 max w i + β 2
L max L2f Ma−1 [ηxc ]2 + [ηyc
P n−1 i i kai k21 n−1 i i,k kai k1
" n
#
s 5 s c 2 c 2 2 2
X 2 2
+ (1 − τeff γy µ) τeff γx [ηx ] + [ηy ] Lf σL wi kai,−1 k2 + 2Ma−1 σG
2 i=1
" n
!#
s
2
τeff [γxs ]2 LΦ n 2
X wi2 kai k22 2 n−P 2 wi kai k22
+ (1 − τeff γy µ) σL + σG 2(max wi ) + 2βL max (43)
2 P i=1
kai k21 i n−1 i kai k21

Next, we simplify the coefficients of different terms in (43).


h i
e (t) ) − Fe(x(t) , y(t) ) can be simplified to
• Coefficient of E Φ(x
!
s 2
11τ eff γ x Lf
(1 − τeff γys µ) 1 +

" ! #
2
s 2 2
τeff γys n n − P 2 wi kai k2 c 2 c 2

+ 4τeff γy Lf βG max wi + βL max 2 + 4Lf Ma−1 2κ[ηx ] + [ηy ]
P n−1 i i kai k1
τeff γys µ
≤1− .
4
γys wi kai k22 n−P
n o
2 n 1 √1
using γxs ≤ 11κ2 , τeff γys κLf βG P max βL2 maxi kai k21
, n−1 maxi wi ≤ 64 , κLf βG ηxc ≤ 16 2Ma−1
and
Lf βG ηyc ≤ √1 .
16 κMa−1

2
Pn (t)
• Coefficient of E i=1 wi hx,i can be simplified to

2
3 n(P − 1) τeff [γxs ]2 n(P − 1)
τeff [γxs ]2 γys L2f + (LΦ + Lf )
P (n − 1) 2 P (n − 1)
2 n(P − 1)
≤ 2τeff [γxs ]2 LΦ . (∵ Lf ≤ LΦ , τeff γys Lf ≤ 1)
P (n − 1)

35
2
e (t) )
• Coefficient of E ∇Φ(x can be simplified to
" ! #
2 2
13 τeff [γxs ]2 n n−P wi kai k2
(1 − τeff γys µ) τeff γxs + 4τeff γys L2f βG
2
max wi + βL2 max 2 + 4Ma−1 [ηxc ]2
8 P n−1 i i kai k1
τeff γys γs 1
≤ . ( γxs ≤ 49κ2 )
48κ2 y

 r 
γys wi kai k22
√1
pn n−P
using γxs ≤ 156κ2 , ηxc Lf βG ≤ 64κ Ma−1
and τeff γxs Lf βG P max n−1 maxi wi , βL maxi kai k21

1
40κ .
2
• Coefficient of σL can be simplified to
! n
(k)
2 n n−P wi ai (τi ) X 2
[γxs ]2 Lf max wi + βL2 max [ηxc ]2 + [ηyc ]2 L2f

2τeff wi kai,−1 k2
P n−1 i i,k kai k1 i=1
n 2 n 2
2 n X wi2 kai k2 2
τeff [γxs ]2 (LΦ + Lf ) n X wi2 kai k2
+ τeff [γys ]2 Lf +
P i=1 kai k21 2 P i=1 kai k21
  n
5 X 2
+ τeff 4γys + γxs [ηxc ]2 + [ηyc ]2 L2f

wi kai,−1 k2
2 i=1
n 2 n
3 2 s 2 n X wi2 kai k2 9 s c 2 c 2
 2X 2
≤ τeff [γy ] Lf + τ γ
eff y [η x ] + [η y ] Lf wi kai,−1 k2
2 P i=1 kai k21 2 i=1
! n
(k)
2 n n−P wi ai (τi ) X 2
[γxs ]2 Lf max wi + βL2 max [ηxc ]2 + [ηyc ]2 L2f

+ 2τeff wi kai,−1 k2 .
P n−1 i i,k kai k1 i=1

2
• Coefficient of σG can be simplified to
!
2 2
2τeff n n−P wi kai k2
[γys ]2 Lf + [γxs ]2 (Lf + LΦ ) max wi + βL2 max

P n−1 i 2
i kai k1
! !
(k)
n n−P wi ai (τi )
τeff L2f Ma−1 [ηxc ]2 [ηyc ]2 8γys 4τeff [γxs ]2 Lf
max wi + βL2 max 5γxs

+ + + +
P n−1 i i,k kai k1
!
2
3τ 2 [γys ]2 Lf n n − P wi kai k2
≤ eff max wi + βL2 max + 9γys τeff L2f Ma−1 [ηxc ]2 + [ηyc ]2

P n−1 i 2
i kai k1
!
(k)
2 n n − P wi a (τi )
[γxs ]2 L3f Ma−1 [ηxc ]2 + [ηyc ]2 max wi + βL2 max i

+ 4τeff .
P n−1 i i,k kai k1

Finally, substituting these coefficients in (43), summing over t = 0, . . . , T − 1 and rearranging the terms, we get
T −1
1 X h e (t) i
E Φ(x ) − Fe(x(t) , y(t) )
T t=0
  
e (0) ) − F (x(0) , y(0) ) e (T ) ) − F (x(T ) , y(T ) )
E Φ(x T −1
4  Φ(x + 1 1
X
e (t) )
2
≤ − E ∇Φ(x
τeff γys µ T T 12µκ2 T t=0
n 2 " n
#
8τeff [γxs ]2 LΦ n(P − 1) X (t)
X
+ 18κLf [ηxc ]2 + [ηyc ]2 2
wi kai,−1 k22 + 2σG
2

+ s
E wi hx,i σL Ma−1
γy µ P (n − 1) i=1 i=1

36
! " n
#
(k)
8τeff [γxs ]2 κ n n − P 2 wi ai (τi ) c 2 c 2 2 2
X 2 2
+ max w i + β L max [η x ] + [ηy ] L f σL w i ka k
i,−1 2 + 2σG Ma−1
γys P n−1 i i,k kai k1 i=1
" n
!#
n X wi2 kai k22 n−P wi kai k22
+ 6τeff γys κ 2
σL 2 + 2σG 2
max wi + βL 2
max , (44)
P i=1
kai k1 n−1 i i kai k21

which concludes the proof.

B.4 Auxiliary Lemmas


Lemma B.6. If the local client function fi (x, ·) satisfy Assumptions 1, 4 (Lf -smoothness and µ-strong concavity in y),
then the function F satisfies

2 2 4L2f h i
E ∇x Fe(x, y ≤ 2E ∇Φ(x)
e + E Φ(x)
e − Fe(x, y)
µ
Proof.
2 2 2
E ∇x Fe(x, y ≤ 2E ∇Φ(x)
e + 2E ∇x F (x, y) − ∇Φ(x)
e
2
2
≤ 2E ∇Φ(x)
e + 2L2f E ky∗ (x) − yk (Lf -smoothness (Assumption 1))
2 4L2f h i
≤ 2E ∇Φ(x)
e + E Φ(x)
e − Fe(x, y) . (Assumption 4)
µ

(t,k) (t,k)
Lemma B.7. If the local client function fi (x, ·) satisfy Assumptions 1, 3 and 4, then the iterates {xi , yi }i,(t,k)
generated by Algorithm 1 satisfy
n i −1
τX
X wi2 (k) (t,k) (t,k) 2
2 [ai (τi )]2 E ∇x fi (xi , yi )
i=1 kai k1 k=0
i −1
n τX
!
2
X wi2 (k) wi kai k2
≤2 2 [ai (τi )]2 L2f ∆(t,k)
x,y (i)
2
+ 2σG max 2
i=1 kai k1 k=0
kai k1 i
!" #
2 2
2 wi kai k2 2Lf   2
+ 4βG max e (t) ) − Fe(x(t) , y(t) ) + ∇x Φ(x
E Φ(x e (t) ) .
2 µ
i kai k1

Proof.
n i −1
τX
X wi2 (k) (t,k) (t,k)
2
2 [ai (τi )]2 E ∇x fi (xi , yi ) ± ∇x fi (x(t) , y(t) )
i=1 ka k
i 1 k=0
n i −1
τX n 2
wi2 w2 kai k
 
2 2 2
(k) (t,k) (t,k)
X X
i
≤2 2 [ai (τi )]2 L2f E xi −x (t)
+ yi −y (t)
+2 2
2
E ∇x fi (x(t) , y(t) )
i=1 kai k1 k=0 i=1 kai k1
(Lf -smoothness)
i −1
n τX
!
2
wi2

X (k) wi kai k2 2
≤2 2 [ai (τi )]2 L2f ∆(t,k)
x,y (i) + 2 max 2
2
βG ∇x Fe(x(t) , y(t) ) + σG
2
(Assumption 3)
i=1 kai k1 k=0
i kai k1

Using Lemma B.6 gives the result.

37
(t,k) (t,k)
Lemma B.8. If the local client function fi (x, ·) satisfy Assumptions 1, 3 and 4, then the iterates {xi , yi }i,(t,k)
generated by Algorithm 1 satisfy
n i −1
τX
X wi2 (k) (t,k) (t,k) 2
2 [ai (τi )]2 E ∇y fi (xi , yi )
i=1 ka i k 1 k=0
i −1
n τX
!
2
X wi2 (k) wi kai k2 h  i
≤2 2 [ai (τi )]2 L2f ∆(t,k)
x,y (i) + 2 max 2
2
σG 2
+ 2βG e (t) ) − Fe(x(t) , y(t) ) .
Lf E Φ(x
i=1 kai k1 k=0 kai k1
i

Proof. Following closely the proof of Lemma B.7,


n i −1
τX
X wi2 (k) (t,k) (t,k)
2
2 [ai (τi )]2 E ∇y fi (xi , yi ) ± ∇y fi (x(t) , y(t) )
i=1 kai k1 k=0
n i −1
τX n 2
wi2 w2 kai k
 
2 2 2
(k) (t,k) (t,k)
X X
i
≤2 2 [ai (τi )]2 L2f E xi −x (t)
+ yi −y (t)
+2 2
2
E ∇y fi (x(t) , y(t) )
i=1 kai k1 k=0 i=1 kai k1
(Lf -smoothness)
i −1
n τX
!
2
wi2

X (k) wi kai k2 2
≤2 2 [ai (τi )]2 L2f ∆(t,k)
x,y (i) + 2 max 2
2
βG 2
∇y Fe(x(t) , y(t) ) + σG (Assumption 3)
i=1 kai k1 k=0
i kai k1
i −1
n τX
!
2
X wi2 (k) wi kai k2 h  i
≤2 2 [ai (τi )]2 L2f ∆(t,k)
x,y (i) + 2 max 2
2
σG 2
+ 2βG e (t) ) − Fe(x(t) , y(t) ) .
Lf E Φ(x
i=1 kai k1 k=0 kai k1
i

where the final inequality follows from smoothness and concavity of F in y.

B.5 Convergence under Polyak Łojasiewicz (PL) Condition


In case the global function satisfies Assumption 5, the results in this section follow with minor modifications. The
crucial difference is that Lemma A.6 no longer holds. Lemma B.2 and Lemma B.3 follow exactly. The statement of
Lemma B.4 needs some modification, since we use Lemma A.6 in the proof.
Lemma B.9. Suppose the local loss functions {fi } satisfy Assumptions 1, 3, 5, and the stochastic oracles for the local
(t) (t)
functions satisfy Assumption 2. Under the conditions of Lemma B.9, the iterates {xi , yi } generated by Algorithm 1
satisfy
n τi −1 n
X pi X (k)  2 2X 2
L2f ai (τi )∆(t,k) c 2 c 2
pi kai,−1 k2 + 4L2f Ma−1 [ηxc ]2 + [ηyc ]2 σG
 2
x,y (i) ≤ 2 [η x ] + [ηy ] L σ
f L
i=1
kai k1 i=1
k=0
2  h i
+ 8L2f Ma−1 βG 2 c 2 e (t) ) + 8κL3f Ma βG
[ηx ] E ∇Φ(x −1
2
2[ηxc ]2 + [ηyc ]2 E Φ(x
e (t) ) − Fe(x(t) , y(t) ) ,
 
2 2
where Ma−1 , maxi kai,−1 k1 + βL2 kai,−1 k2 .
The bound in Lemma B.8 also changes to
n i −1
τX
X wi2 (k) (t,k) (t,k) 2
2 [ai (τi )]2 E ∇y fi (xi , yi )
i=1 kai k1 k=0
i −1
n τX
!
2
X wi2 (k) pi kai k2 h  i
≤2 2 [ai (τi )]2 L2f ∆(t,k)
x,y (i) + 2 max 2
2
σG 2
+ 2βG e (t) ) − Fe(x(t) , y(t) ) .
κLf E Φ(x
i=1 kai k1 k=0
i kai k1
The same bound in Lemma B.5 holds, but with more stringent conditions on learning rates, namely ηyc Lf βG ≤
kai k22
n o
√1 and τ eff γ s
y κLf β 2 1
GP max κβL
2
max i 2 , 1 1
≤ 64 . Consequently, the bounds in Theorem 1 hold, under
16κ M a−1 ka k i 1

slightly more stringent conditions on the learning rates.

38
C Convergence of Fed-Norm-SGDA+ for Nonconvex Concave Functions (The-
orem 2)
We organize this section as follows. First, in Appendix C.1 we present some intermediate results, which we use in the
proof of Theorem 2. Next, in Appendix C.2, we present the proof of Theorem 2, which is followed by the proofs of the
intermediate results in Appendix C.3. Finally, we discuss the extension of our results to nonconvex-one-point-concave
functions in Appendix C.4.
The problem we solve is
( n
)
X
min max Fe(x, y) , wi fi (x, y) .
x y
i=1

We define Φ(x)
e , maxy Fe(x, y) and ye ∗ (x) ∈ arg maxy Fe(x, y). Since Fe(x, ·) is no longer strongly concave, y∗ (x)
need not be unique. In Algorithm 1-Fed-Norm-SGDA+ , the client updates are given by
k−1
(t,k) (j) (t,j) (t,j) (t,j)
X
xi = x(t) − ηxc ai (k)∇x fi (xi , yi ; ξi ),
j=0
(45)
k−1
(t,k) (j) (t,j) (t,j)
X
yi = y(t) + ηyc x(s) , yi
ai (k)∇y fi (b ; ξi ),
j=0

where 1 ≤ k ≤ τi . The server updates are given by

x(t+1) = x(t) − τeff γxs gx(t) , y(t+1) = y(t) + τeff γys gy(t) , (46)

(t) (t)
where gx , gy are defined in (3). The normalized (stochastic) gradient vectors are defined as
τi −1 τi −1
(t) 1 X (k)

(t,k) (t,k) (t,k)

(t) 1 X (k)

(t,k) (t,k)

gx,i = ai (τi )∇x fi xi , yi ; ξi ; hx,i = ai (τi )∇x fi xi , yi ,
kai k1 kai k1
k=0 k=0
(47)
i −1
τX i −1
τX
(t) 1 (k)

(s) (t,k) (t,k)

(t) 1 (k)

(s) (t,k)

gy,i = ai (τi )∇y fi x , yi ; ξi ; hy,i = ai (τi )∇y fi x , yi .
kai k1 kai k1
b b
k=0 k=0

C.1 Intermediate Lemmas


As discussed in Section 5.2, we analyze the convergence of the smoothed envelope function Φ
e 1/2L . We begin with a
f
bound on the one-step decay of this function.
Lemma C.1 (One-step decay of Smoothed-Envelope). Suppose the local loss functions {fi } satisfy Assumptions 1, 2,
6, and 7. Then, the iterates generated by Algorithm 1-Fed-Norm-SGDA+ satisfy
" n n
!#
2
h
(t+1)
i h
(t)
i
2 s 2 n X wi2 kai k2 2 2 2
 2 P −1 n−P X 2
E Φ1/2Lf (x
e ) ≤ E Φ1/2Lf (x ) + τeff [γx ] Lf
e σL + βL Gx + Gx + w
P i=1 kai k21 n−1 n − 1 i=1 i
τi −1
( n
)
s 2
X wi X (k) (t,k)
h
(t) (t) (t)
i τeff γxs 2
e 1/2L (x(t) ) ,
+ 2τeff γx Lf ai (τi )∆x,y (i) + Lf E Φ(x
e ) − Fe(x , y ) − E ∇Φ
i=1
kai k1 8 f
k=0
h i
(t,k) (t,k) (t,k)
where ∆x,y (i) = E kxi − x(t) k2 + kyi − y(t) k2 is the drift of client i ∈ [n], at the k-th local step of epoch
t.

39
Between two successive synchronization time instants (for example,
h t, t + 1), the clients
i drift apart due to local
(t,k) (t) (t) (t)
descent/ascent steps, resulting in the {∆x,y (i)}i,k terms. Also, E Φ(x ) − F (x , y ) quantifies the error of the
e e
inner maximization. In the subsequent lemmas, we bound both these error terms.
Lemma C.2 (Consensus Error). Suppose the local loss functions {fi } satisfy Assumptions 1, 3, 6, and 7. The stochastic
oracles for the local functions satisfy Assumption 2. Further, in Algorithm 1-Fed-Norm-SGDA+, we choose the
1√ (t) (t)
client learning rate ηyc such that ηyc ≤ 2
. Then, the iterates {xi , yi } generated by
2Lf (maxi kai k1 ) 2 max{1,βL }
Algorithm 1-Fed-Norm-SGDA+ satisfy
n τi −1 n
X wi X (k)  2 2X 2
L2f ai (τi )∆(t,k) c 2 c 2
wi kai,−1 k2 + 4L2f Ma−1 [ηxc ]2 G2x + [ηyc ]2 σG
2

x,y (i) ≤ 2 [η x ] + [η y ] L σ
f L
i=1
kai k1 i=1
k=0
h i
c 2 3 2 e x(s) ) − Fe(b
+ 8[ηy ] Lf Ma−1 βG E Φ(b x(s) , y(t) ) ,

2 2
where Ma−1 , maxi (kai,−1 k1 + βL2 kai,−1 k2 ).
e x(s) ) − Fe(b
Note that consensus error depends on the difference E[Φ(b x(s) , y(t) )]. This is different from the term
(t) (t) (t)
E[Φ(x ) − F (x , y )] in Lemma C.1. Since in Algorithm 1-Fed-Norm-SGDA+ , the x-component stays fixed at
e e
(t,k)
b(s) for S communication rounds while updating yi , the difference
x

(k+1)S−1 h i
X
e x(s) ) − Fe(b
E Φ(b x(s) , y(t) )
t=kS

can be interpreted as the optimization error, when maximizing the concave function F (b x(s) , ·) over S communication
rounds. Next, we bound this error. The following result essentially extends the analysis of FedNova (Wang et al. (2020))
to concave maximization (analogously, convex minimization) problems. We also generalize the corresponding analyses
in Khaled et al. (2020); Koloskova et al. (2020) to heterogeneous local updates.
Lemma C.3 (Local SG updates for Concave Maximization). Suppose the local functions satisfy Assumptions 1, 2, 3
2
and 6. Further, let y(t) ≤ R for all t. We run Algorithm 1-Fed-Norm-SGDA+ with client step-size ηyc such that
n
64[ηyc ]2 Ma−1 L2f βG
2 s
P ≤ 1. Further, the server step-size γy satisfies
( )
2
n w i kai k n − P 1
2τeff γys Lf max{βG2
, 1} max βL2 max 2
2 , max wi ≤ ,
P i kai k1 n − 1 i 8
( )
(k)
s n P −1 2 wi ai (τi ) 1
2τeff γy Lf max , β max ≤ .
P n − 1 L i,k kai k1 8

Then the iterates generated by Algorithm 1-Fed-Norm-SGDA+ satisfy


(s+1)S−1
1 X h
e x(s) ) − Fe(b
i
E Φ(b x(s) , y(t) )
S
t=sS
" n
!#
2 2
4R sn 2
X wi2 kai k2 2 n−P 2 wi kai k2
≤ + τeff γy σL 2 + 2σG max wi + βL max 2
τeff γys S P i=1 kai k1
n−1 i i kai k1
" n
#
X 2
c 2 c 2 2 2 2
+ 4Lf ([ηx ] + [ηy ] ) σL wi kai,−1 k2 + 2Ma−1 (Gx + σG ) ,
i=1
 
2 2
where Ma−1 , maxi kai,−1 k1 + βL2 kai,−1 k2 .

40
Remark 8. It is worth noting that the proof of Lemma C.3 does not require global concavity of local functions.
Rather, given x, we only need concavity of local functions {fi } at some point y∗ (x). This is precisely the one-point-
concavity assumption (Assumption 8) discussed earlier in Deng & Mahdavi (2021); Sharma et al. (2022). Therefore,
Lemma C.3 for a much larger class of functions. Further, the bound in Lemma C.3 improves the corresponding
bounds derived in existing work. As we discuss in Appendix C.4, this helps us achieve improve complexity results for
nonconvex-one-point-concave (NC-1PC) functions.
h i
Next, we bound the difference E Φ(xe (t) ) − Fe(x(t) , y(t) ) .

Lemma C.4. Suppose the local functions satisfy Assumptions 1, 2, 3, 7. Then the iterates generated by Algorithm 1-
Fed-Norm-SGDA+ satisfy

T −1 T /S−1 (s+1)S−1 h
1 X h e (t) i 1 X X i
E Φ(x ) − Fe(x(t) , y(t) ) ≤ e x(s) ) − Fe(b
E Φ(b x(s) , y(t) )
T t=0 T s=0
t=sS
v !
r u n n
nu X w2 kai k2 P −1 n−P X 2
s i 2 2 2 2 2
+ 2τeff γx Gx (S − 1) t (σL + βL Gx ) + Gx + w .
P i=1 kai k21 n−1 n − 1 i=1 i

C.2 Proof of Theorem 2


For the sake of completeness, we first state the full statement of Theorem 2 here.
2
Theorem 4. Suppose the local loss functions {fi } satisfy Assumptions 1, 2, 3, 6, 7. Further, let y(t) ≤ R for all t.
2 n
We run Algorithm 1-Fed-Norm-SGDA+ with client step-size ηyc such that 64[ηyc ]2 Ma−1 L2f max{βG P , 1} ≤ 1. Further,
s
the server step-size γy satisfies
( )
2
n w i kai k n − P 1
2τeff γys Lf max{βG2
, 1} max βL2 max 2
2 , max wi ≤ ,
P i kai k1 n − 1 i 8
( )
(k)
s n P −1 2 wi ai (τi ) 1
2τeff γy Lf max , β max ≤ .
P n − 1 L i,k kai k1 8

Then the iterates generated by Algorithm 1-Fed-Norm-SGDA+ satisfy


T −1  ¯e   
1 X e 1/2L (x(t) )
2 ∆ Φ s Aw 2 2 2
 2 n(P − 1)
E ∇Φ ≤O + τeff γx Lf σ + βL Gx + Gx + Fw
T t=0 f
τeff γxs T P τeff L P (n − 1)
s  !
s Aw n(P − 1)
+ O τeff γx Lf Gx (S − 1) (σ 2 + βL2 G2x ) + G2x + Fw
P τeff L P (n − 1)
γys Lf
   
Lf R 2 2 n−P 2
+ O [ηxc ]2 + [ηyc ]2 L2f Cw σL 2
+ D G2x + σG 2
  
+O s
+ A w σ L + σ G τeff Ew + B w β L ,
τeff γy S P n−1
(48)
where Φ
e 1/2L (x) , minx0 Φ(x e 0 ) + Lf kx0 − xk2 is the envelope function, ∆ ¯e , Φe 1/2L (x0 ) − minx Φ e 1/2L (x),
f Φ f f
Pn wi2 kai k22 wi kai k22 Pn 2 (t,τi −1) 2
Aw , nτeff i=1 ka 2 , Bw , nτeff maxi
i k1
2 , Ew , n maxi wi , Cw ,
kai k1 i=1 wi (kai k2 − [αi ] ), and
2 2 2 n(n−P ) P n 2
D , maxi (kai,−1 k1 + βL kai,−1 k2 ), Fw , P (n−1) i=1 wi . With the following parameter values:
s !
P 1/4 P 3/4
     
1 T
ηxc = ηyc = Θ , γxs = Θ , γys = Θ , S=Θ ,
Lf τ̄ T 3/8 (τeff T )3/4 (τeff T )1/4 τeff P

41
1
Pn
where τ̄ = n i=1 τi , we can further simplify to
T −1 2
1 X e 1/2L (x(t) )
E ∇Φ f
T t=0
1/4 !
(τ̄ /τeff )1/4 (τeff P )1/4 2
+ D(G2x + σG
2
      
n − P Ew Cw σL )
≤O + O + O · + O .
(τ̄ P T )1/4 T 3/4 n − 1 PT τ̄ 2 T 3/4
| {z } | {z } | {z }
Error with full synchronization Partial participation error Error due to local updates

Proof. We sum the bound in Lemma C.1 over t = 0 to T − 1 and rearrange the terms to get
T −1 2
1 X e 1/2L (x(t) )
E ∇Φ f
T t=0
T −1 h
8 X
e 1/2L (x(t) ) − Φ
e 1/2L (x(t+1) )
i
≤ E Φ
τeff γxs T t=0 f f

" n n
!#
2
s n X wi2 kai k2 2 P −1 n−P X 2
σL + βL2 G2x + G2x

+ 8τeff γx Lf + w
P i=1 kai k1 2 n−1 n − 1 i=1 i
T −1
1 X h e (t) i 16 T X /S−1 (s+1)S−1 n
wi X
τi −1
(k)
X X
+ 16Lf E Φ(x ) − Fe(x(t) , y(t) ) + L2f ai (τi )∆(t,k)
x,y (i)
T t=0 T s=0 i=1
kai k 1 k=0
t=sS
h i
8 Φe 1/2L (x(0) ) − Φ e 1/2L (x(T ) ) " n
2 n
!#
f f
s n X wi2 kai k2 2 2 2
 2 P −1 n−P X 2
≤ + 8τeff γx Lf σL + βL Gx + Gx + w
τeff γxs T P i=1 kai k21 n−1 n − 1 i=1 i
n
X 2
+ 32 [ηxc ]2 + [ηyc ]2 L2f σL
2
wi kai,−1 k2 + 64L2f Ma−1 [ηxc ]2 G2x + [ηyc ]2 σG
2
 
(From Lemma C.2)
i=1
T /S−1 (s+1)S−1 h T −1
1 X X i 1 X h e (t) i
+ 128[ηyc ]2 L3f Ma−1 βG
2
E Φ(be x(s) ) − Fe(b
x(s) , y(t) ) + 16Lf E Φ(x ) − Fe(x(t) , y(t) )
T s=0 T t=0
t=sS
" n n
!#
8∆ ¯e n X w2 kai k2 P − 1 n − P X
i
Φ
+ 8τeff γxs Lf 2 2
+ βL2 G2x + G2x w2

≤ σL +
τeff γxs T P i=1 kai k21 n−1 n − 1 i=1 i
(where ∆ ¯e ,Φ e 1/2L (x0 ) − minx Φ
e 1/2L (x))
Φ f f
" n
#
X 2
+ 32 [ηxc ]2 + [ηyc ]2 L2f σL 2
wi kai,−1 k2 + 2Ma−1 G2x + σG 2
 
i=1
v !
u n n
X w2 kai k2
r
nu P −1 n−P X 2
+ 32τeff γxs Lf Gx (S − 1) t i 2 2 + β 2 G2 ) + G2
(σL L x x + w
P i=1 kai k21 n−1 n − 1 i=1 i
(From Lemma C.4)
" n
!!#
2 2
4R n X w2 kai k
i n−P wi kai k2
+ 18Lf + τeff γys 2
σL 2
2 2
+ 2σG max wi + βL2 max 2
τeff γys S P i=1 kai k1 n−1 i i kai k1
(From Lemma C.3; using Am ≤ min{ 12 , 16β
1
2 })
G
" n
#
X 2
+ 72[ηyc ]2 L2f σL
2 2
wi kai,−1 k2 + 2σG Ma−1 . (49)
i=1

wi2 kai k22 wi kai k22


Pn  
We can simplify the notation using the constants Aw , nτeff i=1 kai k21
, Bw , τeff n maxi kai k21
, Ew ,

42
Pn 2 n
Pn
n maxi wi , Cw , i=1 wi kai,−1 k2 , D , Ma−1 , Fw , P i=1 wi2 and drop the numerical constants, for simplicity,
to get
T −1 2
1 X e 1/2L (x(t) )
E ∇Φ f
T t=0
∆¯e 
Aw

n(P − 1) (n − P )

Φ s 2 2 2 2

. + τ γ
eff x fL σ + β G + G + F w
τeff γxs T P τeff L L x x
P (n − 1) (n − 1)
s  
Aw n(P − 1) (n − P )
+ τeff γxs Lf Gx (S − 1) (σL 2 + β 2 G2 ) + G2
L x x + F w
P τeff P (n − 1) (n − 1)
γys Lf
  
Lf R 2 2 n−P
Ew + Bw βL2 + [ηxc ]2 + [ηyc ]2 L2f Cw σL2
+ D G2x + σG 2
  
+ + Aw σL + σG τeff
τeff γys S P n−1
∆¯e γys Lf I2

R

Φ s 2 s
= + γ L
x f 1 I + + L f τ γ
eff x xG (S − 1)I 1 +
τeff γxs T P τeff γys S
+ [ηxc ]2 + [ηyc ]2 L2f Cw σL 2
+ D(G2x + σG 2
  
) , (50)
r  
where in (50), to simplify notation, we have defined I1 , PAτweff (σL 2 + β 2 G2 ) + G2 n(P −1) + (n−P ) F , I2 ,
L x x P (n−1) (n−1) w
2
Aw σ L + (Bw βL2 + τeff n−P
n−1 Ew )σG .
2

Next, we optimize the algorithm parameters S, γxs , γys , ηyc , ηyc to achieve a tight bound on (50). If R = 0, we let
q
S = 1. Else, let S = τ 2 γ s γRs Gx I1 . Substituting this in (50), we get
eff x y

T −1
s
1 X ∆¯e
2 γys Lf I2 Rγxs Gx I1
e 1/2L (x(t) )
E ∇Φ . Φ s 2
+ τeff γx Lf I1 + + Lf
T t=0 f s
τeff γx T P γys
+ [ηxc ]2 + [ηyc ]2 L2f Cw σL 2
+ D(G2x + σG 2
  
) , (51)
h s i
γ I2
q s
Rγx Gx I1
Next, we focus on the terms in (51) containing γys : Lf yP + γys . To optimize these, we choose γys =
 2/3
P 1/3
2I2 (Rγxs Gx I1 ) . Substituting in (51), we get

T −1 ¯e  1/3
1 X e 1/2L (x(t) )
2 ∆ Φ s 2 I2 s
E ∇Φ . + τ eff γ x Lf I 1 + Lf Rγx Gx I 1
T t=0 f
τeff γxs T P
+ [ηxc ]2 + [ηyc ]2 L2f Cw σL 2
+ D(G2x + σG 2
  
) , (52)
∆¯ 1/3
Finally, we focus on the terms in (52) containing γxs : τeff γΦes T + Lf IP2 Rγxs Gx I1 . We ignore the higher order linear
x
 ¯ 3/4 −1/4
3∆ I2
term. With γxs = τeff LfΦeT P RGx I1 , and absorbing numerical constants inside O(·) we get,

T −1 1/4
1 X e 1/2L (x(t) ) . (I1 I2 RGx )
2 (τeff P )1/4 −1/4 2
E ∇Φ + (I1 I2 RGx ) I1
T t=0 f
(τeff P T )1/4 T 3/4
+ [ηxc ]2 + [ηyc ]2 L2f Cw σL 2
+ D(G2x + σG 2
  
) ,
!
1/4
(τeff P )1/4 3/4
 
(I1 I2 )
+ O [ηxc ]2 + [ηyc ]2 L2f Cw σL 2
+ D(G2x + σG
2
  
=O 1/4
+ O 3/4
I 2 ) ,
(τeff P T ) T
 1/4 
n(n−P )

(τ̄ /τeff ) 1/4

n−1 max i w i σ G

(τeff P )1/4

≤O +O +O
 
(τ̄ P T )1/4 (P T )1/4 T 3/4

43
[ηxc ]2 + [ηyc ]2 L2f Cw σL
2
+ D(G2x + σG
2
  
+O ) , (53)

where in (53), we have shown dependence only on τ, n, P, T . Lastly, we specify the algorithm parameters in terms of
n, T, τeff , τ̄ . !
r
P 1/4 P 3/4
   
s s T
γx = Θ , γy = Θ , S=Θ .
(τeff T )3/4 (τeff T )1/4 τeff P
1
Finally, choosing the client learning rates ηxc = ηyc = Lf τ̄ T 3/8
, we get

T −1 1/4 !
(τ̄ /τeff )1/4 (τeff P )1/4
    
1 X e 1/2L (x(t) )
2 1 n(n − P )
E ∇Φ ≤O + O max w i + O
T t=0 f
(τ̄ P T )1/4 (P T )1/4 (n − 1) i T 3/4
2
+ D(G2x + σG 2
 
C w σL )
+O .
τ̄ 2 T 3/4

Convergence in terms of F
Proof of Corollary 2.1. Following Lin et al. (2020a), we define
n o n o
Φe 1/2L (x) , minx0 Φ(xe 0 ) + Lf kx0 − xk2 ; x , arg minx0 Φ(x
e 0
) + Lf kx 0
− xk
2
,
f
e
n
2
o n
2
o (54)
Φ1/2Lf (x) , minx0 Φ(x0 ) + Lf kx0 − xk ; x̄ , arg minx0 Φ(x0 ) + Lf kx0 − xk .

Also, it follows from Lemma 2.2 in Davis & Drusvyatskiy (2019) that ∇Φ
e 1/2L (x) = 2Lf (x − x
f
e) and ∇Φ1/2Lf (x) =
2Lf (x − x̄). Therefore,
2 2 2
∇Φ1/2Lf (x) ≤2 ∇Φ1/2Lf (x) − ∇Φ
e 1/2L (x)
f
+ 2 ∇Φ
e 1/2L (x)
f

2
2
= 8L2f ke
x − x̄k + 2 ∇Φ
e 1/2L (x)
f

Consequently, we obtain

2 T −1 2
(t) 1 X
min ∇Φ1/2Lf (x ) ≤ ∇Φ1/2Lf (x(t) )
t∈[T ] T t=0
T −1  2 2

2 X e 1/2L (x(t) ) + 4L2 x (t)
≤ ∇Φ f f e − x̄(t) . (55)
T t=0

e(t) , x̄(t) follow the same definition as in (54), with x replaced with x(t) .
where x
Proof of Corollary 2.2. If clients are weighted equally (wi = pi = 1/n for all i), with each carrying out τ steps of
local SGDA+, then (7) reduces to
 2 
1 (τ P )1/4 σL + τ (G2x + σG
2)
n−P 1 1/4
    
2
min E ∇Φ1/2Lf (x(t) ) ≤ O 1/4
+ 3/4
+ O 3/4
+O · .
t∈[T ] (τ P T ) T τT n−1 PT

• For full client participation, this reduces to


1 (τ n)1/4
 
2
min E ∇Φ1/2Lf (x(t) ) ≤O + .
t∈[T ] (τ nT )1/4 T 3/4

1

To reach an -stationary point, assuming nτ ≤ T , the per-client gradient complexity is T τ = O n8 . Since
τ ≤ T /n, the minimum number of communication rounds required is T = O 14 .


44
n−P 1 1/4
  
• For partial participation, O · is the dominant term, and we do not get any convergence benefit
n − 1 PT
of multiple local updates.
 Consequently, per-gradient client complexity and number of communication rounds
are both T τ = O P18 , for τ = O(1). However, if the data across clients comes from identical distributions
1

(σG = 0), then we recover per-client gradient complexity of O P 8 , and number of communication rounds
= O 14 .


Special Cases
• Centralized, deterministic
p case (σL = σG = 0, βG = 1, τeff = n = 1): in this case Aw = Bw = 1, Cw = D = 0.
Also, I1 = Gx βL2 + 1, I2 = 0. The bound in (48) reduces to
T −1  ¯   
1 X e 1/2L (x(t) )
2 ∆Φ s 2 2
q
2 Lf R
E ∇Φ ≤O + γx Lf Gx (βL + 1) + (S − 1) βL + 1 + s . (56)
e
T t=0 f
γxs T γy S

For βL = 0, (56) yields the convergence result in Lin et al. (2020a).


• Singlepnode, stochastic case (σG = 0, βG = 1, τeff = n = 1): in this case Aw = Bw = 1, Cw = D = 0. Also,
I1 = σL 2 + (β 2 + 1)G2 , I = σ 2 . The bound in (48) reduces to
L x 2 L

T −1  ¯ 
1 X e 1/2L (x(t) )
2 ∆Φ s 2 2 2 s 2
E ∇Φ ≤Θ + γ L (σ + (β + 1)G ) + γ L σ
e
x f L L x y f L
T t=0 f
γxs T
  
2 + (β 2 + 1)G2 + R
q
+ Θ Lf γxs Gx (S − 1) σL L x . (57)
γys S

Again, for βL = 0, (57) yields the convergence result in Lin et al. (2020a).
• Multiple equally weighted (wi = 1/n, ∀ i ∈ [n]) clients, full client participation, stochastic case with synchronous
client updates (τeff = 1): in this case Aw = 1, Bw = 1, Cw = D = 0, 0. The bound in (48) reduces to
T −1  ¯ 2
+ βL2 G2x
 
1 X e 1/2L (x(t) )
2 ∆Φ s 2 σL
E ∇Φ ≤Θ + γ L G +
e
x f x
T t=0 f
γxs T n
r !
γys Lf 2 2 2 s 2
2 + β 2 G2
σL L x RLf
+Θ (σL + σG βL ) + γx Lf Gx (S − 1) Gx + + s , (58)
n n γy S

Note that unlike existing analyses of synchronous update algorithms Woodworth et al. (2020); Yun et al. (2022);
2
Sharma et al. (2022), the bound in (58) depends on the inter-client heterogeneity σG . This is due to the more
2
general noise assumption (Assumption 2). In the existing works, βL is assumed zero, in which case, the bound in
2
(58) is also independent of σG . See Appendix A.2 for a more detailed explanation.
• Multiple, equally weighted (wi = 1/n, ∀ i ∈ [n]) clients, full client participation, multiple, but equal number of
client updates (τi = τeff = τ, ∀ i ∈ [n]). In this case Aw = Bw = 1, Cw = τ − 1, D = (τ − 1)(τ − 1 + βL2 ).
The bound in (48) then reduces to
T −1  ¯ 2 2 2
γys Lf (σL
2 2 2 
 
1 X e 1/2L (x(t) ) ≤ Θ ∆Φ σL + βL Gx + βL σG )
2
s 2
E ∇Φ + τ γ L G + + (59)
e
f x f x
T t=0 τ γxs T nτ n
" r # !
σ 2 + βL 2 2
Gx R
+ Θ Lf τ γxs Gx (S − 1) G2x + L + + (τ − 1) [η c 2
x ] + [η c 2 2  2
y ] L f σ L + (τ − 1 + β 2
L )(G2
x + σG
2 
) .
nτ τ γys S

For βL = βG = 0, this setting reduces to the one considered in Sharma et al. (2022). However, as stated earlier,
our bound on the local update error is tighter.

45
C.3 Proofs of the Intermediate Lemmas
2
Proof of Lemma C.1. Using the definition in (54) x̄(t) = arg minx Φ(x)
e + Lf x − x(t) . Also, note that
2
e 1/2L (x(t+1) ) ≤ Φ(x̄
Φ e (t) ) + Lf x̄(t) − x(t+1) . (60)
f

Using the x updates in (46),


2
2
(t)
X
E x̄(t) − x(t+1) = E x̄(t) − x(t) + τeff γxs w̃i gx,i
i∈C (t)
2 * +
2 n
(t) (t)
X X
(t) (t) 2
= E x̄ −x + τeff [γxs ]2 E w̃i gx,i + 2τeff γxs E x̄ (t) (t)
−x , wi hx,i
i∈C (t) i=1
2
2 D E
(t)
X
≤ E x̄(t) − x(t) + 2τeff γxs E x̄(t) − x(t) , ∇x Fe(x(t) , y(t) ) + τeff
2
[γxs ]2 E w̃i gx,i
i∈C (t)
 2

n τi −1
Lf 2 2 wi X  
(k) (t,k) (t,k)
X
+ τeff γxs E  x̄(t) − x(t) + ai (τi ) ∇x fi (xi , yi ) − ∇x fi (x(t) , y(t) ) 
2 Lf i=1
kai k1
k=0
2
2
(t)
X
≤ E x̄(t) − x(t) 2
+ τeff [γxs ]2 E w̃i gx,i (61)
i∈C (t)
τi −1
" n
#
Lf 2 wi X D E
(k)
X
+ τeff γxs E x̄(t) − x(t) + 2Lf (t,k) (t) (t) (t) (t)
ai (τi )∆x,y (i) + 2 x̄ − x , ∇x Fe(x , y ) ,
2 i=1
kai k1
k=0
(62)

where (62) follows from Lf -smoothness (Assumption 1) and Jensen’s inequality. From (19), (20), we can bound
P (t0 ) 2
E i∈C (t0 ) w̃i gx,i as follows.
2
n τi −1
n X wi2 X
 
2
(t0 ) (k) (t,k) (t,k)
X
2 2
E w̃i gx,i ≤ [ai (τi )] σL + βL2 E ∇x fi (xi , yi )
P i=1 kai k21
i∈C (t0 ) k=0
  n 2 n
n P −1 X (t) n n−P X 2 (t)
2
+ E wi hx,i + w E hx,i
P n−1 i=1
P n − 1 i=1 i
n 2 n
n X wi2 kai k2 2 2 2
 n(P − 1) 2 n(n − P ) 2 X
≤ σ L + β G
L x + G + G w2 , (63)
P i=1 kai k21 P (n − 1) x P (n − 1) x i=1 i

where the final inequality by using Assumption 7. Next, we bound the inner product term in (62). Using Lf -smoothness
of F (Assumption 1):
 
D E Lf 2
(t) (t) (t) (t) (t) (t) (t) (t) (t) (t)
E x̄ − x , ∇x F (x , y ) ≤ E F (x̄ , y ) − F (x , y ) +
e e e x̄ − x
2
 
2 Lf 2
≤ E Φ(x̄e (t) ) + Lf x̄(t) − x(t) − EFe(x(t) , y(t) ) − E x̄(t) − x(t)
2
 
2 L 2
e (t) ) + Lf x(t) − x(t) f
≤ E Φ(x − EFe(x(t) , y(t) ) − E x̄(t) − x(t) (by definition of x̄(t) )
2

46
 
2
e (t) ) − Fe(x(t) , y(t) ) − Lf x̄(t) − x(t)
≤ E Φ(x . (64)
2

Substituting the bounds from (62) and (64) into (60), we get
 
h i 2
(t+1) (t) (t) (t)
E Φ1/2Lf (x
e ) ≤ E Φ(x̄ ) + Lf x̄ − x
e
" n n
!#
2
2 s 2 n X wi2 kai k2 2 2 2
 2 P −1 n−P X 2
+ τeff [γx ] Lf σL + βL Gx + Gx + w
P i=1 kai k21 n−1 n − 1 i=1 i
n τi −1 i τeff γ s L2
wi X h
x f 2
(k)
X
+ 2τeff γxs L2f ai (τi )∆(t,k)
x,y (i) + 2τ γ s
eff x fL E Φ(x
e (t)
) − Fe (x (t)
, y (t)
) − E x̄(t) − x(t)
i=1
kai k1 2
k=0
" n n
!#
2
h
(t)
i
2 s 2 n X wi2 kai k2 2 2 2
 2 P −1 n−P X 2
≤ E Φ1/2Lf (x ) + τeff [γx ] Lf
e σL + βL Gx + Gx + w
P i=1 kai k21 n−1 n − 1 i=1 i
τi −1
( n
)
s 2
X wi X (k) (t,k)
h
(t) (t) (t)
i τeff γxs 2
e 1/2L (x(t) ) ,
+ 2τeff γx Lf ai (τi )∆x,y (i) + Lf E Φ(x e ) − Fe(x , y ) − E ∇Φ
i=1
kai k1 8 f
k=0

where we use ∇Φ
e 1/2L (x) = 2Lf (x − x̄) from (54).
f

(t,k)
Proof of Lemma C.2. We use the client update equations for individual iterates in (45). To bound ∆x,y (i), first we
2
(t,k)
bound the x-error E xi − x(t) . Starting from (25), using Assumption 7, for 1 ≤ k ≤ τi ,
   
i −1
τX τi −1 k−1 k−1 k−1
1 (k) (t,k)
2 [ηxc ]2 X (k)
X (j) X (j)
X (j)
− x(t) ai (τi )  [ai (k)]2 2
+ βL2 Gx + 
2
ai (k)G2x 

ai (τi )E xi ≤ σL ai (k)
kai k1 kai k1 j=0 j=0 j=0
k=0 k=0
h  i
2 2 2
≤ [ηxc ]2 σL
2
kai,−1 k2 + G2x kai,−1 k1 + βL2 kai,−1 k2 , (65)

2
(t,k)
where we use (26). Next, we bound E yi − y(t) , using the bound from (30), to get

τi −1 i −1
1 X (k) (t,k)
2
2  τX (k)
ai (τi )E yi − y(t) ≤ [ηyc ]2 σL
2
kai,−1 k2 + 2[ηyc ]2 L2f kai,−1 k1 + βL2 α ai (τi )∆(t,k)
y (i)
kai k1
k=0 k=0
  2
2 2
+ 2[ηyc ]2 kai,−1 k1 + βL2 kai,−1 k2 E ∇y fi (b
x (s)
, y(t) ) . (66)

(t,k) (t,k)
Compared to (30), the difference is the presence of ∆y (i) in (66), rather than ∆x,y (i). Taking a weighted sum over
agents in (66), we get
τi −1
n
" n  #
wi X 2
(k)
X X 2
2 (t,k) c 2 2 2 2 (s) (t) 2
Lf ai (τi )∆y (i) ≤ 2[ηy ] Lf σL wi kai,−1 k2 + 2Ma−1 βG E ∇y Fe(b x , y ) + σG .
i=1
kai k1 i=1
k=0
(67)
 
2 2
where, we choose ηyc such that Am , 2L2f [ηyc ]2 maxi kai k1 kai,−1 k1 + βL2 α ≤ 12 , and define Ma−1 , maxi kai,−1 k1 + βL2 kai,−1 k2


Next, it follows from Lf -smoothness (Assumption 1) and Lemma A.7 that


  2 h i
E ∇y Fe xb(s) , y(t) e x(s) ) − Fe(b
≤ 2Lf E Φ(b x(s) , y(t) ) .

47
Subsequently, combining (65) and (67), we get
n τi −1 n
X wi X (k)  2 2X 2
L2f ai (τi )∆(t,k) c 2 c 2
wi kai,−1 k2 + 4L2f Ma−1 [ηxc ]2 G2x + [ηyc ]2 σG
2

x,y (i) ≤ 2 [ηx ] + [ηy ] Lf σL
i=1
ka k
i 1 i=1
k=0
h i
c 2 3 2 e x(s) ) − Fe(b
+ 8[ηy ] Lf Ma−1 βG E Φ(b x(s) , y(t) ) .

which finishes the proof.

Proof of Lemma C.3. We define y∗ (b


x(s) ) ∈ arg maxy Fe(b
x(s) , y). Then,
2 (46) 2
E y(t+1) − y∗ (b
x(s) ) = E y(t) + τeff γys gy(t) − y∗ (b
x(s) )
* n
+
2 2
(t)
X
(t) ∗ (s) 2 ∗
=E y − y (b
x ) + τeff [γys ]2 E gy(t) + 2τeff γys E y (t)
− y (b
x (s)
), wi hy,i . (68)
i=1

2 2
(t) Pn (t)
E gy is bounded in (37). We only need to further bound E i=1 wi hy,i which appears in (37).

n 2 n τi −1 2
X (t)
X wi X (k)

(t,k)

E wi hy,i ≤E x(s) , yi ) − ∇y fi (b
[ai (τi )] ∇y fi (b x(s) , y(t) ) + ∇y fi (b
x(s) , y(t) )
i=1 i=1
kai k1
k=0
n τi −1
wi X 2
(k)
X
≤ 2L2f [ai (τi )]∆y(t,k) (i) + 2 ∇y Fe(b
x(s) , y(t) ) (Jensen’s inequality)
i=1
kai k1 k=0
n τi −1
X wi X (k)
h i
≤ 2L2f [ai (τi )]∆y(t,k) (i) + 4Lf E Φ(b
e x(s) ) − Fe(b
x(s) , y(t) ) . (69)
i=1
kai k1
k=0

Next, we bound the third term in (68).


i −1
* n
+ * n τX
+
X (t)
X w i (k) (t,k)
E y(t) − y∗ (b x(s) ), wi hy,i = E y(t) − y∗ (b
x(s) ), x(s) , yi )
[ai (τi )]∇y fi (b
i=1 i=1
kai k1
k=0
n τi −1
X wi X (k)
hD
(t,k) (t,k)
E D
(t,k) (t,k)
Ei
= [ai (τi )]E y(t) − yi , ∇y fi (b x(s) , yi ) + yi − y∗ (b
x(s) ), ∇y fi (b
x(s) , yi )
i=1
kai k1
k=0
i −1
n τ
"
X wi X Lf 2
(k) (t,k) (t,k)
≤ x(s) , y(t) ) − fi (b
[ai (τi )]E fi (b x(s) , yi ) + y(t) − yi (Lf -smoothness)
i=1
kai k1 2
k=0
#
(t,k)
x(s) , yi
+ fi (b x(s) , y∗ (b
) − fi (b x(s) )) (Concavity in y)

n τi −1
Lf X wi X (k)
h i
= [ai (τi )]∆(t,k)
y (i) − E Φ(b
e x (s)
) − F
e (b
x (s)
, y (t)
) . (70)
2 i=1 kai k1
k=0

Substituting (37), (69), (70) in (68), we get


2
E y(t+1) − y∗ (b
x(s) )
" n
!#
2 2
2 2
σL n X wi2 kai k2 2σ 2 n n−P wi kai k2
≤ E y(t) − y∗ (b
x(s) ) 2
+ τeff [γys ]2 2 + G max wi + βL2 max 2
P i=1 kai k1 P n−1 i i kai k1

48
" !#!
2
n P −1 n−P wi kai k2 h i
− 2τeff γys 1− 2τeff γys Lf 2
+ βG 2
max wi + βL max 2 E e x(s) ) − Fe(b
Φ(b x(s)
, y (t)
)
P n−1 n−1 i i kai k1
i −1
" !# n
n P −1 n−P
(k)
wi ai (τi ) X wi τX (k)
+ τeff γys Lf 1 + 2τeff γys Lf + 2
max wi + βL max [ai (τi )]∆(t,k)
x,y (i),
P n−1 n−1 i i,k kai k1 i=1
ka k
i 1
k=0
(71)
(t,k) (t,k) (t,k)
since ∆y ≤ ∆x,y . Using the bound on ∆x,y from Lemma C.2,
n τi −1 n
X wi X (k)  2X 2
ai (τi )∆(t,k) c 2 c 2
wi kai,−1 k2 + 4Ma−1 [ηxc ]2 G2x + [ηyc ]2 σG
2

x,y (i) ≤ 2 [η x ] + [ηy ] σL
i=1
kai k1 i=1
k=0
h i
+ 8[ηyc ]2 Lf Ma−1 βG 2 e x(s) ) − Fe(b
E Φ(b x(s) , y(t) ) . (72)

We substitute (72) in (71), and simplify the terms using the choice of γys , ηyc to get
2
E y(t+1) − y∗ (b
x(s) )
" n
!#
2 2
(t) ∗ (s)
2
2
2
σL n X wi2 kai k2 2
2σG n n−P wi kai k2
≤E y − y (b
x ) + τeff [γys ]2 2 + max wi + βL2 max 2
P i=1 kai k1 P n−1 i i kai k1
" n
#
h i X 2
− τeff γys E Φ(b
e x (s)
) − Fe(b
x (s) (t)
,y ) + 4τeff γys Lf ([ηxc ]2 + [ηyc ]2 ) 2
σL wi kai,−1 k2 + 2Ma−1 (G2x + 2
σG ) .
i=1

using γys , ηyc that satisfy


!
(k)
P −1 n−P n wi ai (τi )
2τeff γys Lf
+ 2
max wi + βL max ≤ 1,
n−1 n−1 i P i,k kai k1
" !#
2
s n P −1 2 n−P 2 wi kai k2 1
2τeff γy Lf + βG max wi + βL max 2 ≤ ,
P n−1 n−1 i i kai k1 4
n  1
8[ηyc ]2 Ma−1 Lf βG 2

2Lf ≤
P 4
h i
Then the coefficient of E Φ(be x(s) ) − Fe(b
x(s) , y(t) ) can we bounded by −τeff γys . Consequently, by rearranging the
terms and summing over t, we get the result.
(s+1)S−1
1 X h
e x(s) ) − Fe(b
i
E Φ(b x(s) , y(t) )
S
t=sS
2 " n 2 2
!#
E ysS − y∗ (b x(s) ) n X wi2 kai k2 n−P wi kai k2
≤ + τeff γys 2
σL 2
2 + 2σG max wi + βL2 max 2
τeff γys S P i=1 ka k
i 1 n−1 i i kai k1
" n
#
X 2
+ 4Lf ([ηxc ]2 + [ηyc ]2 ) 2
σL wi kai,−1 k2 + 2Ma−1 (G2x + σG
2
) .
i=1

b(s) is the latest snapshot


Proof of Lemma C.4. Let t = sS, sS + 1, . . . , (s + 1)S − 1, where k is a positive integer. Let x
iterate for the y-update in Algorithm 1-Fed-Norm-SGDA+ . Then
h i
e (t) ) − Fe(x(t) , y(t) )
E Φ(x

49
h i
= E Fe(x(t) , y∗ (x(t) )) − Fe(b
x(s) , y∗ (b
x(s) )) + Fe(bx(s) , y∗ (bx(s) )) − Fe(bx(s) , y(t) ) + Fe(bx(s) , y(t) ) − Fe(x(t) , y(t) )
h i h i
≤ E Fe(x(t) , y∗ (x(t) )) − Fe(b
x(s) , y∗ (x(t) )) + E Fe(b x(s) , y∗ (b
x(s) )) − Fe(bx(s) , y(t) ) + Gx E x(t) − x b(s)
h i
≤ 2Gx E x(t) − x b(s) + E Φ(b e x(s) ) − Fe(b x(s) , y(t) ) . (73)

where, y∗ (·) ∈ arg maxy Fe(·, y) and (73) follows from Gx -Lipschitz continuity of F (·, y) (Assumption 7). Next, we
see that
q
(s) (t) 2
E xb −x ≤ E x b(s) − x(t) (Jensen’s inequality)
v
u 2
u t−1 X
(46) u X (t 0)
= tE τeff γxs w̃i gx,i
t0 =sS i∈C (t0 )
v
u 2
u t−1
(t0 )
X X
≤ τeff γxs t(S − 1)
u
E w̃i gx,i
t0 =sS i∈C (t0 )
v !
r u n n
n uX w2 kai k2 P −1 n−P X 2
≤ τeff γxs (S − 1) t i 2 2 + β 2 G2 ) + G2
(σL L x x + w .
P i=1 kai k21 n−1 n − 1 i=1 i
(from (63))
Using this bound in (73), and summing over t, we get
(s+1)S−1 (s+1)S−1 h
1 X h
(t) (t) (t)
i 1 X e x(s) ) − Fe(b
i
E Φ(x ) − F (x , y ) ≤
e e E Φ(b x(s) , y(t) )
S S
t=sS t=sS
v !
r u n 2 n
nu X w2 kai k P − 1 n − P X
s i 2 2 + β 2 G2 ) + G2
+ 2τeff γx Gx (S − 1) t (σL L x x + w2 .
P i=1 kai k21 n−1 n − 1 i=1 i

Finally, summing over s = 0 to T /S − 1 we get the result.

C.4 Extending the result for Nonconvex One-Point-Concave (NC-1PC) Functions (Theo-
rem 3)
Carefully revisting the proof of Theorem 2, we notice that Lemma C.1 and Lemma C.2 do not rely on the concavity
assumption. Lemma C.3 does use concavity of local functions {fi }. However, it is only needed to derive (70). Further,
this only requires concavity of local functions at a global point y∗ (b
x(s) ). Therefore, as mentioned earlier in Remark 8,
it holds even for NC-1PC functions. This is an independent result in itself, since we have extended the existing
convergence result of local stochastic gradient method for convex minimization (concave maximization) problems, to a
much more general one-point-convex minimization (or one-point-convex maximization) problem. Therefore, we restate
it here for the more general case.
Lemma C.5 (Local SG updates for One-Point-Concave Maximization). Suppose the local loss functions {fi } satisfy
Assumptions 1, 2, 3, 7. Suppose for all x, all the fi ’s satisfy Assumption 8 at a common global minimizer y∗ (x), and
2
that y(t) ≤ R for all t. If we run Algorithm 1-Fed-Norm-SGDA+ with same conditions on the client and server
step-sizes ηyc , γys respectively, as in Lemma C.3, then the iterates generated by Algorithm 1-Fed-Norm-SGDA+ also
satisfy the bound in Lemma C.3.
Next, Lemma C.4 also holds irrespective of concavity. Therefore, the resulting convergence result in Theorem 2 for
nonconvex-concave minimax problems holds for a much larger class of functions. We restate the modified theorem
statement briefly.

50
Theorem. Suppose the local loss functions {fi } satisfy Assumptions 1, 2, 3, 7. Suppose for all x, all the fi ’s satisfy
2
Assumption 8 at a common global minimizer y∗ (x), and that y(t) ≤ R for all t. If we run Algorithm 1-Fed-Norm-
SGDA+ with the same conditions on the client and server step-sizes ηyc , γys respectively, as in Theorem 4, then the
iterates generated by Algorithm 1-Fed-Norm-SGDA+ also satisfy the bound in Theorem 4.
Remark 9. Again, choosing client weights {wi } the same as in the original global objective {pi }, we get convergence
in terms of the original objective F .

D Additional Experiments
For communicating parameters and related information amongst the clients, ethernet connections were used. Our
algorithm was implemented using parallel training tools in PyTorch 1.0.0 and Python 3.6.3.
For both robust NN Training and fair classification experiments, we use batch-size of 32 in all the algorithms.
Momentum parameter 0.9 is used only in Momentum Local SGDA(+).

Robust NN Training Here we further explore performance of Fed-Norm-SGDA+ on the robust NN training problem.
We use VGG-11 model to classify CIFAR10 dataset. In Figure 6, we demonstrate the effect of increasing data
heterogeneity across clients, whle in Figure 7 we show the advantage of using multiple clients for the federated minimax
problem. With k-fold increase in n, we observe an almost k-fold drop in the number of communication rounds needed
to reach a target test accuracy (70% here.).

Varying Heterogeneity
80
70
60
Test accuracy

50 Dir15(0.01)
40
Dir15(0.05)
Dir15(0.1)
30 Dir15(0.5)
Dir15(1.0)
20 Dir15(10.0)
10 IID distribution
0 25 50 75 100 125 150 175 200
No. of comm. rounds
Figure 6: Effect of inter-client data heterogeneity (quantified by α) on the performance of Fed-Norm-SGDA+.

3.5

3.0

2.5
Speedup

2.0

1.5

1.0
2 5 8 11 14 17 20
Number of clients

Figure 7: Effect of increasing client-set on the performance of Fed-Norm-SGDA+ in a robust NN training task.

51
50 P=15
P=10
P=5
40

Test Accuracy
30

20

10

0
0 20 40 60 80 100
Number of Communications

Figure 8: Effect of partial client participation on the performance of Fed-Norm-SGDA in a fair image classification task.

Fair Classification We also demonstrate the impact of partial client participation in the fair classification problem.
Figure 8 complements Figure 8 in the main paper, evaluating fairness of a VGG11 model on CIFAR10 dataset. We have
plotted the test accuracy of the model over the worst distribution. With an increasing number of participating clients,
the performance consistently improves.

52

You might also like