0% found this document useful (0 votes)
12 views14 pages

GSASG Paper

Uploaded by

zikangding92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views14 pages

GSASG Paper

Uploaded by

zikangding92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

GSASG: Global Sparsification with Adaptive Aggregated

Stochastic Gradients for Communication-efficient Federated


Learning

Journal: IEEE Internet of Things Journal

Manuscript ID IoT-31697-2023

Manuscript Type: Regular Article

Date Submitted by the


12-Aug-2023
Author:

Complete List of Authors: Du, Runmeng; East China Normal University, Software Engineering
Institute
He, Daojing; East China Normal University, School of Information
Engineering
Ding, Zikang; East China Normal University
Wang, Miao; East China Normal University
Chan, Sammy; City University of Hong Kong, EE
Li, Xuru; East China Normal University, Software Engineering Institute

Federated Learning, Sparse Communication, Adaptive Gradients,


Keywords:
Distributed Learning
Page 1 of 13
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 1

1
2
3 GSASG: Global Sparsification with Adaptive
4
5
6
Aggregated Stochastic Gradients for
7
8
Communication-efficient Federated Learning
9
Runmeng Du, Daojing He, Zikang Ding, Miao Wang, Sammy Chan, Xuru Li
10
11
12
13
14 Abstract—This paper addresses the challenge of communica- resources may find it difficult to cope with the high com-
tion efficiency in federated learning by the proposed algorithm munication overhead [10], [11]. Then these workers would
15
called global sparsification with adaptive aggregated stochastic give up training, resulting in suboptimal model performance.
16 gradients (GSASG). GSASG leverages the advantages of local
17 sparse communication, global sparsification communication, and Therefore, the high communication cost is a major bottleneck
18 adaptive aggregated gradients. More specifically, we devise an for FL [12], [13].
19 efficient global top-k′ sparsification operator. By applying this In this context, many well known methods have been pro-
operator to the aggregated gradients obtained from the top-k posed to address communication-efficient FL, including sparse
20 sparsification, the global model parameter is rarefied to reduce
21 communication [14]–[17] and gradients aggregation [18], [19].
the download transmitted bits from O(dM T ) to O(k′ M T ), where
22 d is the dimension of the gradient, M is the number of workers, Sparse communication method, such as top-k, mainly chooses
23 T is the total number of iterations and k′ ≤ k < d. Meanwhile, the k highest components of gradient, and the adaptive ag-
24 the adaptive aggregated gradient method is adopted to skip gregated gradient method is used to set a threshold to skip
meaningless communication and reduce communication rounds. the meaningless communication rounds. Another approach to
25 The deep neural network training experiment demonstrates
26 improve communication efficiency in FL is quantization [20],
that, compared to the previous algorithms, GSASG significantly
27 reduces communication cost without sacrificing model perfor- [21], which works mainly by limiting communication bits that
28 mance. For instance, when considering the MNIST dataset with represent floating-point number. Unfortunately, the quantiza-
29 k = 1%d and k′ = 0.5%d, in terms of communication rounds, tion only utilizes a maximum of 32 bits to quantify floating-
GSASG outperforms sparse communication by 91%, adaptive point number, and the sparse communication outperforms the
30 aggregated gradients by 90%, and the combination of sparse
31 quantization method when the model parameter’s dimension
communication with adaptive aggregated gradients by 58%.
32 In terms of communication bits, GSASG yields 1% of the d is high [7].
33 communication bits needed with previous algorithms. This paper focuses on communication-efficient FL based
34 on sparse communication and adaptive aggregation gradients
Index Terms—Federated learning, sparse communication,
35 adaptive gradients, distributed learning. methods. Compared to existing algorithms [7], [22]–[24],
36 the main challenges lie in reducing communication rounds
37 while maintaining model performance and extending sparse
I. I NTRODUCTION communication to include the global model parameter. To
38
Internet of things (IoT) devices, in general, have con- address these challenges, this paper propose a novel method
39
straints on computing power and storage [1], [2]. Federated called global sparsification with adaptive aggregated stochas-
40
learning (FL) is designed for decentralized environments and tic gradients for communication-efficient FL (GSASG). Our
41
can turn these numerous devices into an edge computing contribution are three-fold:
42 ′
network where learning takes place on the device, reducing • We devise an efficient global top-k sparsification oper-
43
the need for data transmission, and thus saving energy and ator, replacing the previous global top-k approach, with
44
communication costs [3]–[5]. Using data parallelism and a the aim of further reducing communication bits. Here,
45
46
larger number of training workers, the total computation cost k ′ ≤ k. Furthermore, we propose a combination of
47
of training can be significantly reduced. However, the saving top-k and global top-k ′ approachs, and we conduct a
in computation pales in comparison to information exchanged thorough investigation into the relationship between k and
48
49
between the server and workers, especially for recurrent neural k ′ . When our focus is on the model’s accuracy, we can
50
networks with low computation-communication ratios [6], [7]. minimize the difference between k ′ and k. Conversely,
Meanwhile, the high communication cost associated with this when our focus is on the model’s convergence, we can
51
52
information exchange poses a scalability challenge for FL [8], maximize the difference between k ′ and k. Taking both
[9]. More particularly, workers with limited communication the model’s convergence and accuracy into consideration,
53
54 and considering that commonly k = 1%d, we recommend
Corresponding author: Daojing He.
55 R. Du, D. He, Z. Ding, M. Wang and X. Li are with the School using k ′ = 0.5%d, where 1%d and 0.5%d represent 1
56 of Software Engineering, East China Normal University, Shanghai, China, and 0.5 percent of d and d is the model parameter’s
200062 (e-mail: [email protected]; [email protected]; dimension. This choice strikes a suitable balance and
57 [email protected]; [email protected]; [email protected]).
58 S. Chan is with the Department of Electrical Engineering, City University proves to be effective in practice.
of Hong Kong, Hong Kong, China. (e-mail: [email protected]). • To avoid meaningless communication, GSASG deter-
59
60
Page 2 of 13
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2

1
2 mines which workers (M) communicate with the server convergence. Whether it chooses a larger gradient or a gradient
3 and which workers (Mc ) do not, using the adaptive according to the structure of the neural network, it involves a
4 aggregated gradient. Then, by exploring the relationship sorting problem. To avoid GPU-unfriendly sorting operations,
5 between the sparse aggregated gradients and the global improved sparse communication methods were proposed to
6 model parameter ω, this paper defines a changed com- improve learning performance [28], [29]. However, most of
7 ponent ω changed of ω. Based on ω changed , the global these works [16], [26]–[29] focus on the sparsification operator
8 parameter sparsification and recovery algorithms are pro- for the uploaded transmitted gradient, while neglecting the
9 vided. During the information exchanged, instead of ω sparsification operator for aggregated gradients. The concept
10 with d dimensions, ω changed with k ′ dimensions is trans- of global sparsity, introduced by the global top-k [14], selects
11 mitted. As a result, the communicationPT cost of GSASG the k largest absolute values of aggregated gradients to update
12 is reduced from 32dM T to 32k ′ t=1 (|Mt | + |Mtc |), the global model. FetchSGD [15] used sketch operations S
13 where |Mt | is the number of workers who communicate and U for communication-efficient FL, where the momentum
14 with the server at the t-th iteration, |Mtc | is the number and error accumulations could be performed in the sketch
15 of workers who do not communicate with the server at S, and the U operation could compute an unbiased estimate
16 the t-th iteration, and T is the total iterations and k ′ < d. of the original gradient. In FetchSGD [15], the top-k op-
17 • Experiments on training deep neural networks demon- eration performed by the server on the original aggregated
18 strates that GSASG can significantly reduce communi- gradients was equivalent to the sparsification operation on
19 cation cost compared to previous algorithms [7], [22]– the global gradient. However, after performing a sparsification
20 [24] while ensuring model accuracy. Taking the MNIST operation on the aggregated gradients [14], [15], the server
21 data set as an example (k = 1%d, k ′ = 0.5%d): in will perform a global model update and the global model
22 terms of communication rounds, GSASG can improve by parameter transmitted by the server is still d-dimensional
23 91% over sparse communication [23], 90% over adaptive (d ≫ k). Other works, such as S3 GD-MV [27] and DSFL [30],
24 aggregated gradients [24], and 58% over sparse commu- did not explicitly define the concept of global sparsification,
25 nication combined with adaptive aggregated gradients [7]. but proposed that the server cloud only broadcast the top-
26 In terms of communication bits, GSASG yields 1% of the k sparsified aggregated gradients. In this manner, the direct
27 communication bits needed with previous algorithms. transmission of the aggregated gradients top-k sparsification
28 The remainder of this paper is structured as follows. would reduce the transmitted bits (≥ k) of the downloaded
29 Section II provides an overview of related work on gradient, but S3 GD-MV and DSFL required each worker m
30 communication-efficient FL and summarizes the related chal- to store the additional d-dimensional global model parameter
31 lenges. In Section III, we introduce the necessary background for the global model updating.
32 and preliminaries that are relevant to the design of GSASG. Adaptive aggregated gradient. The adaptive aggregated
33 Section IV presents the system model and algorithm of gradient approach focuses on developing aggregation rules
34 GSASG. Section V provides a complexity analysis of GSASG. to avoid unnecessary communication rounds [19], [24], [31],
35 In Section VI, we evaluate the performance of GSASG and [32]. LAG [32] introduced an adaptive skip gradient rule but
36 discuss the experimental results. In Section VII, we discuss the faced challenges due to non-diminishing variance of stochastic
37 impact of the value of k ′ on the performance and give the fur- gradients. LASG [18] proposed a new adaptive rule based on
38 ther optimization idea of GSASG. Finally, in Section VIII, we different iterations with comparable convergence rates to clas-
39 provide concluding remarks and summarize the contributions sic SGD. Furthermore, LAGC [19] combined the benefits of
40 of this paper. gradient coding and the adaptive rule of LAG, novel strategies.
41 However, gradient coding imposes additional storage cost due
42 to increased redundant gradients on the worker-side.
II. RELATED WORK
43 Sparse communication combined with adaptive aggre-
44 The prior works in this field can be classified into four gated gradients. SASG [7] leveraged the benefit of sparse
45 categories: i) sparse communication, ii) adaptive aggregated communication and adaptive aggregated gradients to design
46 gradients, iii) sparse communication combined with adaptive a distributed algorithm that is communication-efficient. How-
47 aggregated gradients, and iv) other combinations. ever, SASG primarily focuses on reducing the bits of the
48 Sparse communication. The idea of sparse communication uploaded gradient and communication rounds, while there
49 is that each worker chooses the k highest components (top- are additional opportunities to reduce the transmitted bits of
50 k) of a gradient in magnitude and transmits them to the the downloaded gradient, further minimizing communication
51 server in each iteration [14], [25]. Various variants of the rounds, and reducing storage costs on the server or worker-
52 top-k operation have been proposed to achieve more aggres- side.
53 sive compression or reduce implementation complexity. For Other combinations. LAQ [21] suggested quantization and
54 example, AGS [26] introduced an adaptive sparse gradient avoiding meaningless communications between the worker
55 upload mode to determine near-optimal trade-offs between and the server to reduce the communication cost. Spars-
56 communication and computation. S3 GD-MV [27] addressed eSecAgg [17] combined the top-k operation with differential
57 the sign loss issue in aggregated gradients by proposing a vot- privacy to achieve privacy-preserving FL.
58 ing mechanism. MIPD [16] determined gradient priority based Challenge. In conclusion, current challenges in achieving
59 on the l2 -norm of each layer, considering the impact on model communication-efficient FL can be summarized as follows:
60
Page 3 of 13
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3

1
2 Most of the existing works predominantly focused on
• TABLE I
L IST OF KEY NOTATIONS .
3 sparsification for the uploaded gradients, overlooking the
4 sparsification operator for aggregated gradients. Notation Description
5 • Some research works introduced the concept of global
M
The number of workers and the corresponding index m
starts at 1.
6 sparsification for aggregated gradients. When applying
Dm The private dataset of worker m.
7 the global top-k to FL, two key problems arise. Firstly,
t The independent random sample selected from Dm
ξm
8 updating the transmitted global model parameter based at the t-th iteration.
The set of workers communicate with the server
9 on aggregated gradients still results in a d-dimensional Mt
at the t-th iteration.
10 representation, leading to increased communication bits. The set of workers do not communicate with
Mtc
11 Additionally, directly broadcasting aggregated gradients the server at the t-th iteration.
The local model parameter of worker m
12 also incurs higher storage costs per worker. Secondly, ω tm
at the t-th iteration.
13 the global top-k method requires that the dimension of
ωt The global model parameter at the t-th iteration.
14 the aggregated gradients remains fixed at k, consistent t
gm The local gradient at the t-th iteration.
15 with the dimension of the uploaded gradient based on the t The delay count of worker m
τm
top-k operation. However, in sparse gradient aggregation, at the t-th iteration.
16 The dimension of the gradient or model parameter
17 some components may have less significance and do not d
and the corresponding index i starts at 1.
18 significantly contribute to the model’s convergence. k The dimension of the sparse gradient.
• Combining sparse communication with adaptive aggre- k′ The dimension of the sparse aggregated gradients.
19
20 gated gradients has been demonstrated to improve com-
21 munication efficiency in FL, as shown by the SASG
algorithm. However, there are still opportunities to further In the context of FL, the responsibilities can be outlined as
22 follows:
23 reduce the communication bits of aggregated gradients
24 and communication rounds by using global sparsifica- 1 X
tion communication. One straightforward extension is to min L = [fm (ω, ξm )], (1)
25 ω∈Rd M
combine global top-k with top-k and adaptive gradients. m∈M
26
27 However, in the adaptive aggregated gradient, worker is where M = [M ] is the set of workers, and ω is a parameter
28 required to store an old gradient with d-dimensions. In vector with dimension d. The function fm represents a smooth
29 the global top-k, worker is also required to store a new loss function. The mini-batch SGD method can be iteratively
30 d-dimensional global model parameter to preserve the applied to solve Equation (1) by updating the parameter ω t+1
communication bits in dimension k. Combining these two t
31 step by step using randomly sampled mini-batches ξm , and
32 methods directly require worker to store two additional finally get the minimum value of the loss function L and the
33 global model parameters for each iteration, thereby in- model parameter ω T , where T is the total number of iterations
34 creasing the storage pressure on the worker. Therefore, and the corresponding index t starts at 1. The specific equation
35 the challenge of integrating these two methods is to for the updating step is as follows:
36 maintain the communication cost in dimension k while
1 X
37 minimizing the storage pressure on the worker. ω t+1 = ω t − γ · ▽fm (ω t , ξm
t
), (2)
M
38 This paper aims to tackle the communication-efficient FL m∈M
39 problem by our proposed GSASG algorithm, which leverages where γ represents learning rate. The FL implementation
40 the advantages of local sparse communication, global sparse involves several processes: during t-th iteration, the server
41 communication, and adaptive aggregated gradients. Also, the broadcasts ω t to worker m; worker m computes the gradient
42 parameter sparsification and recovery algorithms are used to ▽fm (ω t , ξm
t t
) locally using ξP
m and sends it to the server; the
43 reduce the communication and storage pressure of workers. server aggregates gradients t t
m∈M ▽fm (ω , ξm ) to update
44 the parameter by Equation (2).
45 III. PRELIMINARIES
46 A. Symbol Definitions
C. Previous Algorithms
47 We use [N ] to represent the set 1, . . . , N , where N is a
48 SASG [7] utilizes the top-k sparse communication, with the
positive integer. Vectors are denoted by bold lowercase letters,
49 corresponding sparsification operator is denoted as Tk (·). The
such as x. The j-th element of x is represented as x[j], with
50 definition of top-k is as follows.
the index j starting at 1. The set of integers is denoted as Z. For
51 2 Definition 1: For parameter k ∈ [d] and a vector x ∈ Rd ,
a vector x, |x| denotes its Euclidean norm. The cardinality of
52 the sparsification operator Tk (·) : Rd → Rk is defined as
a finite set χ is denoted as |χ|, and for a vector x, its cardinality
53

is |x|. The absolute value of a number R is represented as π[i], if i ≤ k;
(Tk (x))π[i] = (3)
54 |R|. Other key notations used in this paper are summarized in 0, otherwise,
55 Table I.
56 where π ∈ Rd as a permutable vector of x and |π[k]| ≥
57 |π[k + 1]|.
B. Learning Problem Furthermore, SASG employs an adaptive aggregated gra-
58
59 dient method to avoid unnecessary communication. During
60
Page 4 of 13
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4

1
2 model training, if the gradient error from consecutive uploads IV. D ESGIN
3 by a worker is negligible, redundant uploads are skipped, and In this section, we first demonstrate the system model
4 the server updates the model by reusing previously stored of GSASG. Then, we provide an global top-k ′ sparsification
5 gradients. In the LAG method [32], a threshold value ▽tm operator, which is an improved version of the global top-k.
6 is defined to evaluate the gradient error as follows: Finally, we describe the GSASG algorithm based on the global
7 t t
top-k ′ (gtop-k’). GSASG will continue to follow the setting
8 ▽tm = ▽fm (ω t , ξm
t
) − ▽fm (ω t−τm , ξ t−τm ), of the adaptive aggregated gradient.
9 d
(4)
1 X 2
10 ▽tm ≤ αi ω t+1−i − ω t−i ,
M i=1 A. The System Model of GSASG
11
12 The system model of GSASG is depicted in Figure 1, where
t
where τm represents the delay count of worker m, and αi ≥ 0 involves the server and multiple workers as entities.
13
for i ∈ [d] is constant weights. The approach used by SASG t t
14 • The server. The server determines M and Mc , and
for adaptive aggregated gradients exhibits a resemblance to t
15 aggregates the gradients sent by M along with the old
LASG [18], which can be seen as a direct and straightforward global parameter sent by Mtc , which are stored locally.
16
extension of LAG, such as The server then updates the global model parameter ω t
17
18 t and broadcasts the variable component of the global
19 ▽tm = ▽fm (ω t , ξm
t
) − ▽fm (ω t−τm , ξm
t
), parameter ω tchanged (k ′ -dimensional) to all workers using
20 1 X
d
2
gtop-k ′ . If the workers in Mtc are continually lazy to the
21 ▽tm ≤ αd ω t+1−i − ω t−i , (5) set maximum threshold, the server will abandon them as
M i=1
22 2
they do not contribute to the model convergence.
t
23 • Workers. The workers belonging to M
t
▽tm ≤ Lm ω t − ω t−τm , compute
t t
24 ▽fm (ω , ξm ) and upload the k-dimensional sparse gra-

25 where Lm represents the Lipschitz constant. This condition is dients ▽fm (ω t , ξm
t
) using the top-k.
26 t
evaluated on the same sample ξm for two different iterations,
t
27 t and ξ t−τm
. As shown in Equation (6), SASG divides the
28 workers into Mt and Mtc .
29
30  
31 1 X X 
t−τ t

ω t+1 = ωt − t

Tk g m + Tk g m m  ,
32

M t t
m∈Mc m∈Mc
33
(6)
34
Here, we introduce a global top-k (gtop-k) sparsification
35
operator denoted as GTk (·), which aims to achieve better
36
practical results for maintaining the model convergence [14],
37
[33]. The GTk (·) operator retains the k components of the
38
aggregated gradients with the highest absolute value. The
39
process of combining the top-k and gtop-k is defined for Fig. 1. The system model of GSASG, where the blue entity belongs to Mt
40
multiple vectors as follows. which communicates with the server in the t-th iteration, the green entity
41 belongs to Mtc which does not communicate with the server in the t-th
Definition 2: For parameters k, k̃ ∈ [d], k ≤ k̃, vectors xm ∈
42 iteration, and the red entity indicates a worker that no longer communicates
Rd , m ∈ M, a sparsification operator Tk (·) : Rd → Rk , and
43 with the server.
a sparsification operator GTk (·) : Rk̃ → Rk , the process of
44
combining Tk (·) and GTk (·) is defined as
45
46
47 M
X 
σ[i], if i ≤ k;
48 y= (Tk (xm ))πm [i] , (GTk (y))σ[i] =
0, otherwise,
(7)
49 m
50
where σ ∈ Rk̃ as a permutable vector of y and |σ[k]| ≥
51
σ[k + 1]|.
52
53 Intuitively, if the server broadcasts the sparse global ag-
gregated gradients, each worker will store an additional d- Fig. 2. An example of GTk′ (·) sparsification operator on 4 workers, where
54 k = 3, k′ = 2.
55 dimensional global model parameter ω t to compute model
56 updating by Equation (2). Note that, the global model param-
57 eter ω t+1 broadcast by the server would still be d-dimensional
after performing the sparsification operation on the aggregated B. GSASG Algorithm
58
59 gradients, as observed in works such as [30], [34].
60
Page 5 of 13
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5

1
2 1) The Global Top-k ′ : The gtop-k’ sparsification operator
3 is denoted as GTk′ (·). It preserves the top k ′ components of P t
Tk (gm )=
P
(Tk (γ ▽ fm (ω t , ξm
t
))
m∈Mt  t
m∈M
4 the aggregated gradients vector with the largest absolute value. t
+ (e1 )m +
t
(e2 )m
5 The process of combining the top-k and gtop-k ′ is defined for
(12)
6 multiple vectors as follows.
7 Definition 3: For parameters k, k̃ ∈ [d], k ′ ∈ [d′ ], k ′ ≤ k ≤ !
8 k̃, d′ ≤ d, vectors xm ∈ Rd , m ∈ M, a sparsification operator t+1
X
t
 X
t

Tk (·) : Rd → Rk , and a sparsification operator GTk′ (·) : (e2 )m = Tk gm − GTk′ Tk gm (13)
9 ′ m∈Mt m∈Mt
10 Rk̃ → Rk , the process of combining Tk (·) and GTk′ (·) is
11 defined as t
▽ fm (ω t , ξm
t
P P
Tk (gm )= T (γ )
12 m∈Mtc  tc k
m∈M
M t t
σ[i], if i ≤ k ′ ; (e1 )m + (ec2 )m

13 y=
X
(Tk (xm ))πm [i] , (GTk′ (y))σ[i] =
+
14 m
0, otherwise, (14)
15 (8)
16 where σ ∈ Rk̃ as a permutable vector of y and |σ[k ′ ]| ≥
 
17 σ[k ′ + 1]|. t+1
X X
(ec2 )m = t t 
 
Tk gm − GTk′  Tk g m (15)
18 According to Definition 3, the function GTk′ (·) aggregates all m∈Mtc m∈Mtc
19 the sparse gradients and further compresses the aggregated re-
20 sults. To illustrate this, we provide an example of GT k ′ (·) ap- While performing the sparsification operation on the aggre-
21 plied to four workers, as depicted in Figure 2. In this example, gated gradients, the global model parameter ω broadcast by
22 worker 1 sends Tk (g1 ) = (0, 0.5, 0, 0.7, 0, 0.8, 0) to the server, the server remains to be d-dimensional after being updated by
23 worker 2 sends Tk (g2 ) = (0, 0, 0.4, 0.3, 0.3, 0, 0), worker Equation (9). This is consistent with perivious works, such as
24 3 sends Tk (g3 ) = (0, 0.4, 0.6, 0.2, 0, 0, 0), and worker 4 those based on the gtop-k [14], [33]. It is possible to extend
25 sends Tk (g4 ) = (0, 0.7, this approach to a scenario where the server onlyP broadcasts
P 0, 0.3, 0.4, 0, 0). The server computes the non-zero elements of aggregated gradients t
Tk (gm ).
26 aggregated gradients T3 (gm ) = (0, 1.6, 1.0, 1.5, 0.7, 0.8, 0)
27 and P then further computes sparse aggregated gradients However, this would require the worker to store an additional
28 GT2 ( T3 (gm )) = (0, 1.6, 0, 1.5, 0, 0, 0). It is worth noting d-dimensional global model parameter for updating.
29 that the efficiency of GTk′ (·) may decrease as the number
30 of workers increases. To address this concern, we can utilize
31 a tree structure to ensure efficient computation of GTk′ (·)
32 during server-side computation. The tree structure can help
33 manage and optimize the aggregation process even with a
34 larger number of workers.
35 2) Algorithm Description: To overcome the convergence is-
36 sue casued by the biased nature of gtop-k ′ , which is consistent
37 with top-k, we continue to employ error feedback technique in
38 GSASG. By accumulating the “discarded” gradient informa-
39 Fig. 3. The sparsification and recovery processes of parameter ω t .
tion, the accumulation result contributes to model convergence
40 in future training iterations.
We observe that GTk′ (·) sets the insignificant component
41 Specifically, in each iteration of GSASG, Pwe performs of the aggregated gradients to 0, which means that corre-
42

M
Tk (gm ) on the uploaded gradient and GTk′ m kT (g m ) sponding component of the global model parameter ω t remain
43
on the aggregated gradients. Then we compute compres- unchanged compared to ω t−1 . Therefore, when the server
44 PM
45 P e 1 = g
sion errors  − Tk (g) and e2 = m Tk (gm ) − needs to broadcast information, it only needs to broadcast the
M
GTk′ m Tk (gm ) . In brief, GSASG’s iterative strategy
changed component ω tchanged of the global model parameter
46
can be described as follows: ω t , where |ω tchanged | = k ′ . The worker m needs to restore
47
48 ω t based on previously stored data such as ω t−τ , ω t−1changed
49 and received ω tchanged , where ω t−τ is used to compute the
1

ω t+1 = ω t − M t
P
50
(GT
Pk′ m∈M t Tk (gm) adaptive gradients. Figure 3 illustrates the sparsification and
t (9)

t−τm
recovery process of parameter ω t . The steps for parameter
P
51 + m∈Mt GTk′
c m∈Mt Tk gmc
),
52 sparsification and recovery are summarized in Algorithm 1
53 where and Algorithm 2, respectively.
54 Similar to LAG [32], GSASG also partitions the workers
55 t t into Mt and Mtc . The delay τm t
counts the number of times
= γ ▽ fm ω t , ξm
t

gm + (e1 )m (10) that worker m fails to submit the local gradient in succession.
56
57 At the t-th iteration, for worker m ∈ Mt , it uploads the local
t t
58 gradient Tk (gm ) and the server set τm = 1; otherwise, worker
t+1
t t

59 (e)m = gm − T k gm (11) m does not upload anything, and the server will reuse the old
60
Page 6 of 13
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6

1
2 Algorithm 1 Parameter Sparsification Algorithm Algorithm 3 GSASG Algorithm
3 Input: The global parameter ω t−1 , and the global Initialize: Compression error e, delay counters
1
4 aggregated τm = 1, m ∈ M.
P gradients which t−1
 include
5 GTk′ m∈Mt T k g m and Input: Learning rate γ > 0, maximum delay D, and
t
P  
6
P
GT T
t−1−τm constant
m∈Mtc k gm .
k ′
m∈Mtc
7 t weights {αd }Dd=1 .
Output: The changed components ω changed .
8 Output: The global parameter ω T .
1

1: ω t ω t−1 t−1
P
9 =  − M (GTk′ m∈M
t
t Tk gm + 1: for t = 1 → T do
10 t−1−τm
P P
m∈Mtc GT k ′
m∈Mtc kT g m ). 2: Server rarely the the global model parameter ω t to
11 2: for i = 1 → d do obtain ω tchanged by Algorithm 1
12 3: if ω t−1 [i] == ω t [i] then 3: Server broadcasts the changed component ω tchanged to
13 4: Set ω tchanged [i] = 0. all workers.
14 5: else 4: for worker m = 1 → M do
15 6: Set ω tchanged [i] = ω t [i]. 5: Restore the global model parameter ω t according to
16 7: end if Algorithm 2.
17 8: end for 6: Compute gradients ▽fm (ω t , ξm t t
), ▽fm (ω t−τm , ξm t
).
18 7: t t t
Set gm = γ ▽ fm (ω , ξm ) + em . t
19 Algorithm 2 Parameter Recovery Algorithm 8: Divide the workers into Mt and Mtc according to
20 the Equation (5).
Input: The global parameter ω t−τ and the changed
21 9: if worker m ∈ Mt then
components ω t−1 t
changed , ω changed .
22 10: Worker m uploads Tk (gm t
) to the server.
Output: The global parameter ω t .
23 11: t+1 t t
em = gm − Tk (gm ), τm t+1
= 1.
24 1: for i = 1 → d do 12: else
25 2: if ω tchanged [i] ̸= 0 then 13: Worker m uploads nothing.
26 3: Set ω t [i] = ω tchanged [i]. 14: et+1 t t+1
m = em , τm = τm + 1.
t
27 4: else if ω t−1
changed [i] ̸= 0 then 15: end if
28 Set ω t [i] = ω t−1 end for
changed [i].
5: 16:
29 6: else if ω t−τ [i] ̸= 0 then 17: for worker m = 1 → |Mtc |, m ∈ Mtc do
30 7: Set ω t [i] = ω tchanged [i]. 18: if τmt
> D then
31 8: end if 19: Discard worker m, M = M − 1.
32 9: end for 20: end if
33 21: end for
34 t
P
22: Set m∈Mt Tk (gm ) =
35
 
t t t t
local parameter of worker m and increases the counter by
P
m∈Mt T k (γ ▽ fm (ω , ξm )) + (e )
1 m + (e )
2 m .
36 τmt+1
= τmt
+ 1. Unlike LAG, if τmt
for a worker m ∈ Mtc P t
23: Set T (g ) =
37 exceeds a certain threshold D, the server will abandon this  m∈Mc t k m 
t c t
38 t t
P
worker. m∈Mtc Tk γ ▽ fm (ω , ξm ) + (e1 )m + (e2 )m .
39 24: Server updatesP parametert ω according P to Equationt (9).
40 25: (e2 )t+1 = T (g ) − GT k P m∈Mt Tk (gm )).
′ (
m Pm∈Mt k m
41 26: (ec2 )t+1
m = T (g
m∈Mtc k m
t
) − GT k′ ( m∈Mtc Tk (gm )).
t

42 27: end for


43
44
45
46 changed component ω tchanged of the global model pa-
47 rameter ω t to all workers, where |ω changed | = k ′ .
48 • According to Algorithm 2, each worker restores ω t
49 based on previously stored data and received ω tchanged .
50 They compute the two local gradients ▽fm (ω t , ξm t
) and
t
t−τm t
51 ▽fm (ω , ξm ) to determine whether they belong to
52 Mt or Mtc according to Equation (5).
53 • Workers in Mt rarefy their local gradients, gm t
, by
54 applying Tk (·), and upload the result to the server.
Fig. 4. The process of GSASG. t
55 • The server checks if τm of each worker in Mtc exceeds
The process of the GSASG is illustrated in Figure 4 and a maximum threshold D and decides whether to discard
56 summarized in Algorithm 3. For t ∈ [T ], GSASG algorithm
57 them.
comprises the following steps: • The server sparsifies the aggregated gradients of selected
58
59 • According to Algorithm 1, the server broadcasts the workers in Mt and Mtc . Then it aggregates the sparsified
60
Page 7 of 13
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7

1
2 TABLE II
T HE NUMBER OF COMMUNICATION ROUNDS PER ITERATION ( DISREGARD PARALLELISM ), COMMUNICATION BITS PER UPLOAD AND DOWNLOAD , AND
3 TOTAL COMMUNICATION COST OF T ITERATIONS WHEN TRAINING THE d- DIMENSIONAL PARAMETER MODEL WITH M WORKERS , WHERE k AND k ′ ARE
4 THE SPARSIFICATION LEVEL DEFINED IN (1) AND (3), RESPECTIVELY, WHICH SATISFIES k ′ ≤ k < d. A ND τm
t DENOTES THE DELAY TIME , D DENOTES

5 THE MAXIMUM DELAY, Mt DENOTES THE WORKERS WHO COMMUNICATE WITH THE SERVER , AND Mtc DENOTES THE WORKERS WHO DO NOT
COMMUNICATE WITH THE SERVER .
6
7 Algorithm
Upload Download
8 Round Bits Total bits Round Bits Total bits
SGD [22] M 32d 32dM T M 32d 32dM T
9 Sparse [23] M 32k 32kM T M 32d 32dM T
10 LASG [24] |Mt | 32d 32k T
P
|Mt | M 32d 32dM T
Pt=1
11 SASG [7] |Mt | 32k 32k T |Mt | M 32d 32dM T
Pt=1
12 |Mt | 32k T t |Mt | + |Mtc |(τm∈M
t 32k′ 32k′ T t t
P 
GSASG 32k t=1 |M | t < D)
c
t=1 |M | + |Mc |
13
14 TABLE III
15 T HE COMPUTATION COST OF ONE WORKER PER ITERATION , THE TOTAL COMPUTATION COST OF THE T ITERATIONS WHEN TRAINING THE MODEL WITH
16 M WORKERS AND ONE SERVER , THE STORAGE COST OF ONE WORKER ( SERVER ) PER ITERATION , AND THE TOTAL COMPUTATION COST OF THE T
ITERATIONS WHEN TRAINING THE d- DIMENSIONAL PARAMETER MODEL WITH M WORKERS ( ONE SERVER ), WHERE Ctr DENOTES THE LOCAL TRAINING
17 COST OF ONE WORKER PER ITERATION , Cs DENOTES THE TRAINING COST OF THE SERVER DURING THE T ITERATIONS , Cα DENOTES COSTS OF
18 COMPUTING THE ADAPTIVE GRADIENTS , Cβ DENOTES COSTS OF COMPUTING THE TOP -k, AND Cγ DENOTES COSTS OF COMPUTING THE GTOP -k.
19
20 Algorithm Computation Total computation
Storage
Worker
Total storage Storage
Server
Total storage
21 SGD [22] Ctr Ctr M T + Cs T - - - -
22 Sparse [23] Ctr + Cβ
(Ctr + Cβ )M T
- - - -
23 +Cs T
(Ctr + Cα )M T
24
PT
LASG [24] Ctr + Cα 896d 896dM T 896d|Mtc | 896 t=1 d|Mtc |
+Cs T
25 Ctr + Cα (Ctr + Cα + Cβ )M T PT
SASG [7] 896d 896dM T 896k|Mtc | 896 k|Mtc |
26 +Cβ +Cs T t=1
Ctr + Cα (Ctr + Cα + Cβ )M T
27
PT PT
GSASG 896(d + k′ ) 896(d + k′ ) t (|Mt | + |Mtc |) 896k|Mtc | 896 t=1 k|Mtc |
+Cβ +(Cγ + Cs )T
28
29
30
31
32
33
34
35
36
37
38
39
40
(a) (b) (c)
41
42 Fig. 5. Experimental results of training loss versus communication rounds. The results show that our GSASG achieves faster and better convergence with
fewer communication rounds. (a) FC (MNIST). (b) FC (Fashion-MNIST). (c) ResNet19 (CIFAR-100).
43
44
45 gradients, updates the global model parameter ω using represented using 32bits. Additionally, we use the execution
46 Equation (9), and proceeds to the next iteration. time in seconds as a measure of computation cost. Finally, the
47 storage cost is measured by the extra parameters stored locally,
48 V. C OMPLEXITY ANALYSIS apart from the global model parameter. We assume that each
49 parameter occupies 896 bytes of memory.
50 In this section, we first provide a complexity analysis of our
51 proposed GSASG algorithm. We then conduct a comparative
analysis of several existing algorithms, including distributed A. Complexity Analysis
52
53 SGD [22], the Sparse [23], the adaptive aggregated gradient al- Communication cost. In Algorithm 3, during the upload
54 gorithms such as SASG [7] and LASG [24]. We compare these phase, the workers in Mt interact with the server and transfer
55 algorithms with GSASG in terms of their communication, 32k bits of information at t-th iteration. Therefore, the total
56 computation, and storage costs. In our analysis, we consider communication P bits in the upload phase over T iterations
T
57 the number of communication rounds and bits of information is given by 32k t=1 |Mt | when training the d-dimensional
58 exchanged between a worker and the server as measures parameter model with the top-k. In the download phase, the
59 of communication cost. We assume that each parameter is workers in Mt , Mtc interact with the server and transfer 32k ′
60
Page 8 of 13
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8

1
2
3
4
5
6
7
8
9
10
11
12
13
14 (a) (b) (c)
15 Fig. 6. Experimental results of test accuracy versus communication rounds. The results show that our GSASG converges faster on all three datasets while
16 model accuracy is guaranteed. (a) FC (MNIST). (b) FC (Fashion-MNIST). (c) ResNet19 (CIFAR-100).
17
18 bits of information per iteration. The server abandons workers to all workers. In Sparse, each worker m uploads the sparse
19 whose delay counters τm t
do not meet the condition τm t
< D. gradients (k-dimensional) to the server, and the server down-
20 The total communicationPbits in the upload phase over T loads the parameter (d-dimensional) to all workers. In LASG,
21 T
iterations is given by 32k ′ t=1 (|Mt | + |Mtc |) when training the selected workers upload the gradients (d-dimensional),
22 the d-dimensional parameter model with the gtop-k ′ . and the server downloads the parameter (d-dimensional) to
23 Computation cost. We use Ctr , Cs , Cα , Cβ and Cγ to all workers. In SASG algorithm, the selected workers up-
24 represent the local model training cost of one worker per load the sparse gradients (k-dimensional), and the server
25 iteration, the training cost of the server per iteration, the cost downloads the parameter (d-dimensional) to all workers. In
26 of performing adaptive gradients, the cost of performing the GSASG, the selected workers upload the sparse gradients (k-
27 top-k and the cost of performing the gtop-k ′ , respectively. dimensional), and the server downloads the sparse parameter
28 Worker m needs to perform adaptive gradients and the top- (k ′ -dimensional) to all workers whose delay counters τm
t
≤ D.
29 k in addition to the local model training per iteration, which As a result, our approach significantly reduces the communica-
30 costs Ctr + Cα + Cβ . The server needs to compute the gtop- tion cost, primarily reflected in the number of communication
31 k and aggregate the local model from the M per iteration, bits in the download phase due to the use of the gtop-k ′
32 which costs (Cγ + Cs )T . Therefore, the total computation operation.
33 cost over T iterations is given by T iterations is given by
34 (Ctr + Cα + Cβ )M T + (Cγ + Cs )T when training model.
35 Storage cost. Worker m needs to store the old param- In terms of computation cost, SGD incurs only the cost
36 eter ω t−τ (d-dimensional) to compute adaptive gradients of model training. In contrast, Sparse requires additional
37 per iteration, which requires 896d memory. Worker m also computation to compute the top-k values, LASG needs to
38 needs to store the ω t−1 ′ compute adaptive gradients, and SASG requires computing for
changed (k -dimensional) to restore the
39 global parameter per iteration, which requires 896k ′ memory. both adaptive gradients and the top-k values. Consequently,
40 For M workers, the total storage cost over T iterations is GSASG theoretically has the highest computation cost among
41 896(d + k ′ )M T when training the model. If some workers are these algorithms.
42 abandoned by the server during the model
43 PTtraining process,
the total storage cost will be 896(d + k ′ ) t (|Mt | + |Mtc |). In terms of storage cost, SGD and Sparse do not require
44 If all workers meet (τm t
≤ D), the server will not abandon any
45 computing adaptive gradients, resulting in no additional stor-
workers, and the total storage cost becomes 896(d + k ′ )M T . age requirement. LASG requires workers to store the old pa-
46 Additionally, the server needs to store the sparse gradients
47 rameter ω t−τ (d-dimensional) to compute adaptive gradients.
(k-dimensional) of workers from the set Mtc per iteration, In SASG, workers need to store the old parameter ω t−τ (d-
48 requiring 896k|Mct | memory.
49 P The server’s total storage cost dimensional) to compute adaptive gradients, while the server
over T iterations is 896 t = 1T k|Mtc | when training the needs to store the sparse gradients (k-dimensional) from Mc .
50 model with M workers.
51 Thus, the storage cost of SASG for the server is lower than that
52 of LASG. As for the server’s storage cost, SASG outperforms
B. Complexity Comparison LASG due to the introduction of the top-k operation. However,
53
54 As presented in Table II, we compare the complexities of the storage cost for workers in SASG remains the same as
55 GSASG with those of other algorithms, such as SGD [22], LASG. Compared with SASG, GSASG further reduces the
56 Sparse [23], LASG [24] and SASG [7]. overall storage cost for workers while maintaining the total
57 In terms of communication cost, SGD requires each worker storage cost for the server consistent with SASG. The reason
58 m to upload the gradient (d-dimensional vector) to the server, for this additional reduction in GSASG’s storage cost is that
59 and the server downloads the parameter (d-dimensional vector) the server abandons some workers during model training.
60
Page 9 of 13
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9

1
2 TABLE IV
T HE NUMBER OF COMMUNICATION ROUNDS AND BITS REQUIRED FOR THE FIVE ALGORITHMS TO REACH THE SAME ACCURACY BASELINE , WHERE THE
3 ACCURACY BASELINE CORRESPONDING TO MNIST, FASHION -MNIST AND CIFAR-100 IS 97%, 86%, AND 92%.
4
5 Model (Dataset) Algorithm Model (Skip)
Upload Download
Rounds Bits Rounds Bits
6 SGD - 23200 6.04E+11 23200 6.04E+11
7 Sparse - 22800 3.02E+11 22800 3.02E+11
8 FC (MNIST) LASG 2096 20304 2.64E+11 20304 2.64E+11
SASG 1637 4428 5.76E+08 4428 5.77E+10
9 GSASG 8941 1859 2.42E+08 1859 1.21E+08
10 SGD - 11600 1.51E+11 11600 1.51E+11
11 Sparse - 12800 1.66E+09 12800 1.66E+11
FC (Fashion-MNIST) LASG 0 11200 1.45E+11 11200 1.45E+11
12 SASG 1772 11428 1.48E+09 11428 1.48E+11
13 GSASG 1653 5147 6.70E+08 5147 3.35E+08
14 SGD - 151000 5.42E+13 151000 5.42E+13
Sparse - 154000 5.52E+11 154000 5.52E+13
15 ReNet18 (CIFAR-100) LASG 6744 144256 5.17E+13 144256 1.87E+12
16 SASG 19847 141153 5.06E+11 141153 1.48E+11
17 GSASG 43336 120664 4.33E+11 120664 3.35E+08
18
19
20 B. Parameter Selection and Dataset Description
21 Sparse, SASG and GSASG all use the top-k. We adopt
22 the top-1%d (i.e., k = 0.01d) which is suggested in [25].
23 Additionally, we set the the global k ′ = 0.01d and k ′ =
24 0.005d. Our experiments were conducted using M workers,
25 partitioning the given datasets MNIST, Fashion-MNIST, and
26 CIFAR-100. We train the models using both a two-layer fully
27 connected (FC) neural network and ResNet18.
28 (a) (b) MNIST. The MNIST dataset contains 70000 images cat-
29 Fig. 7. Experimental results of model performance versus epochs and bits. egorized divided into 10 classes. To perform 10-category
The results show that our GSASG has the fastest convergence speed (FC
30 & MNIST). (a) Experimental results of training loss versus the epochs. (b)
classification on the MNIST dataset, we utilize a two-layer
31 Experimental results of test accuracy versus the epochs. fully connected (FC) neural network model with 512 neurons
32 in the second layer. In all algorithms (Table II), we train 20
33 TABLE V
epochs with the learning rate γ = 0.005, a batch size of
34 E XTRA AVERAGE COMPUTATION TIME FOR FIVE ALGORITHMS ( RECORD ξ = 50 and M = 4. For the adaptive gradient methods, namely
35 EVERY 40 TRAINING ITERATIONS ) AND EXTRA STORAGE REQUIRED FOR GSASG, SASG and LASG, we set D = 10, αd = 1/2γ for
FIVE ALGORITHMS ( PER ITERATION ).
36 d = 1, 2, ..., 10.
37 Storge Fashion-MNIST. The Fashion-MNIST dataset contains
Memory Computation
38 Worker Server 70000 images divided into 10 classes. To perform 10-category
SGD - - -
39 Sparse - - -
classification on the Fashion-MNIST dataset, we utilize a two-
40 LASG 0.44s 347.82 MB 698.64 MB layer FC neural network model with 512 neurons in the second
41 SASG 0.44s 347.82 MB 6.95 MB layer. In all algorithms (Table II), we train 20 epochs with the
42 GSASG 0.44s 348.73 MB 6.95MB learning rate γ = 0.005, a batch size of ξ = 50, and M = 4.
43 For the adaptive gradient methods, namely GSASG, SASG
44 and LASG, we set D = 10, αd = 1/2γ for d = 1, 2, ..., 10.
45 VI. EXPERIMENT RESULTS CIFAR-100. The CIFAR-100 dataset contains 60000 im-
46 For comparison with SGD [22], Sparse [23], LASG [24] ages divided into 100 classes. To perform 100-category clas-
47 and SASG [7], we conduct extensive experiments to showcase sification on the CIFAR-100 dataset, we utilize a ResNet18
48 the convergence properties and communication efficiency of neural network model. For all algorithms (Table II), we train
49 GSASG. 30 epochs with the learning rate γ = 0.01, a batch size of
50 ξ = 10, and M = 10, where the learning rate decays to
51 0.1 at epoch 50. For the adaptive gradient methods, namely
52 A. Experimental Environment GSASG, SASG and LASG, we set D = 10, αd = 1/2γ for
53 d = 1, 2, ..., 10.
54 We implemented our FL framework using PyTorch as the
55 underlying framework, and we utilized Python 3.9 as the pro-
gramming language. To accelerate the inference and training C. Communication Rounds and Bits
56
57 processes of deep neural networks within the FL framework, The experimental results of training loss and test accuracy
58 we leverage the computing power of an NVIDIA GeForce versus communication rounds are presented in Figure 5 and
59 RTX 3060 GPU. Figure 6, respectively. In Figure 5a (training on MNIST) and
60
Page 10 of 13
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

1
2
3
4
5
6
7
8
9
10
11 (a) (b) (c) (d)
12 Fig. 8. Experimental results of model performance versus bits. The results show that GSASG can reduce the number of communication bits (FC & MNIST).
13 (a) Experimental results of training loss versus the bit between 0 and E11. (b) Experimental results of training loss versus the bit between 0 and E9. (c)
Experimental results of test accuracy versus epoch between 0 and E11. (d) Experimental results of test accuracy versus the bit between 0 and E9.
14
15
16 Figure 5b (training on Fashion-MNIST), it is evident that D. Computation Time and Storage
17 GSASG converges more rapidly with fewer communication Extra average computation time for five algorithms (record
18 rounds compared to other algorithms. Figure 5c (training on every 40 training iterations) and extra storage required for five
19 CIFAR-100) demonstrates that GSASG significantly reduces algorithms (per iteration) are shown in Table V.
20 the number of communication rounds required to complete According to Table III, GSASG has the highest computation
21 training while still ensuring convergence. cost in addition to the basic model training cost in theory.
22
However, according to Table V, the extra computation time
23 In Figure 6, GSASG demonstrates a good model accuracy.
consumed by LASG, SASG, and GSASG is roughly same,
24 Especially in Figure 6a (training on MNIST) and Figure 6b
indicating that the time consumed by the top-k and gtop-k ′
25 (training on Fashion-MNIST), GSASG attains the higher test
operations is negligible, with the computation time mainly
26 accuracy with the same number of communication rounds,
focused on calculating adaptive gradients. In terms of storage
27 indicating that it selects more information bits for model train-
cost, GSASG maintains the advantage of server storage over
28 ing. Additionally, GSASG is capable of significantly reducing
SASG, and the slight increase in worker storage is negligible.
29 communication rounds after model convergence is achieved.
30 Furthermore, GSASG is able to reduce communication rounds
31 after model convergence is achieved. VII. D ISCUSSION
32 In this section, we discuss the impact of the value of k ′ on
33 Next, we compare the communication efficiency of different the performance. In addition, we give the further optimization
34 algorithms using specific accuracy baselines, which represent idea of GSASG.
35 the number of communication rounds and bits required for all
36 five algorithms to achieve the same performance, as shown
A. Performance Evaluation
37 in Table IV. The accuracy baselines for MNIST, Fashion-
38 MNIST and CIFAR-100 are 97%, 86%, and 92% respectively. We conducted experiments on the MNIST dataset using the
39 According to Table IV, GSASG is more efficient compared FC model with different values of k ′ while keeping k = 1%d
40 to the other algorithms due to its sparse communication and and ξ = 50. Figure 9a and Figure 9b depict the relationship
41 adaptive aggregated gradients. In terms of communication between model performance and communication rounds for
42 rounds, GSASG improves by 91% over SGD and Sparse, 90% different k ′ . We also performed experiments with k = 3%d,
43 over LASG, and 58% over SASG on MNIST dateset; GSASG ξ = 50, and varying values of k ′ using the FC model on the
44 improvesover 55% over SGD, 59% over Sparse, 54% over MNIST dataset. Figure 9c and Figure 9d show the relationship
45 LASG, and 54% over SASG on Fashion-MNIST dateset; and between model performance and communication rounds for
46 GSASG improvesover 20% over SGD, 21% over Sparse, 16% different k ′ .
47 over LASG, and 14% over SASG on CIFAR-100 dateset. In As shown in Figure 9, a smaller value of k ′ leads to
48 terms of communication bits, GSASG improves by 99% over faster model convergence. However, higher sparsification lev-
49 previous algorithms based on MNIST and Fasion-MNIST; and els result in reduced testing accuracy. Therefore, while higher
50 GSASG improves by 99% over SGD and Sparse, 98% over sparsification can improve model efficiency, we need to be
51 LASG, and 77% over SASG on CIFAR-100 dateset. cautious of its upper limit. In Figure 9, we use the (k = 1%d,
52 k ′ = 1%d), (k = 1%d, k ′ = 0.5%d), (k = 2%d, k ′ = 1%d)
53 Figure 7 presents the training loss and test accuracy versus and (k = 2%d, k ′ = 0.5%d) in GSASG to strike a balance
54 the training epochs. It can be observed that the the convergence between convergence speed and testing accuracy. We also
55 rate of all algorithms have little difference except GSASG, tested k ′ = 0.0001%d, and the model fails to converge.
56 which aligns with the theoretical analysis in SASG. As shown Hence, exceeding the sparse lower limit for k ′ can harm model
57 in Figure 8, GSASG can significantly reduce the number performance.
58 of communication bits required for training. This result is In [14], experiments showed that increasing ξ can improve
59 consistent with the findings presented in Table IV. the performance of gtop-k. To further explore the impact of
60
Page 11 of 13
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

1
2
3
4
5
6
7
8
9
10
11 (a) (b) (c) (d)
12 Fig. 9. Experimental results of model performance versus communication round for different k′ with constant k = 1%d. The results show that using the
13 itop-1%d( k = 1%d, k′ = 1%d), itop-0.5%d (k = 1%d, k′ = 0.5%d), itop-1%d (k = 3%d, k′ = 1%d) and itop-0.5%d (k = 3%d, k′ = 0.5%d)
sparsification operators in GSASG, can realize the trade-off between the model performance and communication rounds. (a) Experimental results of training
14 loss versus communication round for different k′ with k = 1%d, ξ = 50. (b) Experimental results of testing accuracy versus communication round for
15 different k′ with k = 1%d, ξ = 50. (c) Experimental results of training loss versus communication round for different k′ with k = 3%d, ξ = 50. (d)
16 Experimental results of testing accuracy versus communication round for different k′ with k = 3%d, ξ = 50.
17
18 st
stochastic gradients gm to screen out batch gradients that
19
meet the threshold condition. It checks whether the number
20
of selected batch gradients is greater than n/2 to determine
21
sets Mt and Mtc . If the number of gradients meeting the
22
threshold condition is greater than n/2, worker m belongs to
23
Mt ; otherwise, it belongs to Mtc . (3) The worker m belonging
24
to Mt performs the global top-k on the aggregated batch
25
gradients and uploads the sparse local gradient. On the server
26
(a) (b) side, the server performs the global top-k ′ operation on the
27
aggregated sparse local gradients, performs the global model
28 Fig. 10. Experimental results of model performance versus communication
round for different batch ξ with k = 1%d, k′ = 0.5%d. The results show that update, and broadcast the global parameter. This process is
29 ξ = 50 is the best value for the model performance. (a) Experimental results referred to as HGSASG. The core of HGSASG is performing
30 of training loss versus communication round for different ξ. (b) Experimental
the global top-k twice.
31 results of testing accuracy versus communication round for different ξ.
Next, we compare the performance of HGSASG and
32 GSASG using the same settings and two distinct datasets,
33 the change in ξ on the performance of gtop-k ′ , we conduct MNIST and Fashion MNIST. We set k = k ′ = 1%d.
34 experiments on the MNIST dataset using the FC model with Figure 11 shows the relationship between the training losses
35 k = 1%d and k ′ = 0.5%d. Different values of ξ, namely and communication rounds for the different datasets. As shown
36 20, 30, 40, and 50, are employed. Figure 10 presents the in Figure 11a (training on the MNIST dataset) and Figure 11b
37 relationship between model performance and communication (training on the Fashion-MNIST dataset), it is evident that
38 rounds for these different ξ values. According to Figure 10, it HGSASG algorithm achieves faster and better convergence
39 is evident that larger batch sizes (ξ) contribute to faster and with fewer communication rounds. Figure 12 illustrates the re-
40 more effective model convergence while ensuring the accuracy lationship between model accuracy and communication rounds
41 of the model. for the different datasets.
42 In summary, when prioritizing efficient communication, we Figure 12 demonstrates that HGSASG is more efficient than
43 can maximize the difference between k ′ and k and increase GSASG. In Figure 12a (training of the MNIST dataset) and
44 the batch size. When emphasizing model accuracy, we should Figure 12b (training of the Fashion-MNIST dataset), HGSASG
45 minimize the difference between k ′ and k. Taking into account achieves the highest test accuracy using the same number of
46 both the model convergence and the common choice of k = communication rounds, which means that HGSASG selects
47 1%d, we recommend using k ′ = 0.5%d or k ′ = 1%d. more meaningful gradients for communication. Furthermore,
48 HGSASG can further reduce communication rounds after
49 B. Algorithm Optimization completing the training and required model convergence.
50
51 GSASG uses mini batch SGD for updating the local model.
However, training with only a small batch of data per iteration VIII. C ONCLUSION
52
53 can impact the model’s accuracy. The optimization goal of We have proposed GSASG, a communication-efficient
54 GSASG is to improve the model’s accuracy while maintain- FL algorithm that provides two synergistic benefits from the
55 ing training speed. The optimization process of GSASG is global sparsification and adaptive aggregated gradients. First,
56 as follows: (1) worker m divides the dataset into batches it reduces the communication cost compared to the state-
st
57 ξm , s ∈ [1, n] and performs gradient updates on the each of-the-art algorithms. Second, it achieves a faster conver-
st
58 batch ξm using the global model parameters, obtaining the gence rate. Experiments on training deep neural networks
st
59 batch gradients gm . (2) The worker m runs the adaptive have demonstrated that GSASG can significantly reduce the
60
Page 12 of 13
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

1
2 [8] S. Shi, Q. Wang, and X. Chu, “Performance modeling and evaluation
of distributed deep learning frameworks on gpus,” in 2018 IEEE
3 16th Intl Conf on Dependable, Autonomic and Secure Computing,
4 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl
5 Conf on Big Data Intelligence and Computing and Cyber Science
and Technology Congress. IEEE Computer Society, 2018, pp.
6 949–957. [Online]. Available: https://doi.org/10.1109/DASC/PiCom/
7 DataCom/CyberSciTec.2018.000-4
8 [9] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,
“Communication-efficient learning of deep networks from decentralized
9 data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273–
10 (a) (b) 1282.
11 Fig. 11. Experimental results of training loss versus communication round. [10] Y. Cheng, L. Shi, J. Shao, L. Bai, and K. Chen, “Emergent
dynamics of asynchronous multiagent systems under signed networks
12 The results show that HGSASG achieves faster and better convergence with
and limited communication resources,” IEEE Trans. Netw. Sci.
fewer communication rounds. (a) FC (MNIST). (b) FC (Fashion-MNIST).
13 Eng., vol. 10, no. 1, pp. 477–488, 2023. [Online]. Available:
14 https://doi.org/10.1109/TNSE.2022.3216000
[11] X. Tong, J. Xu, and S. Huang, “An information-theoretic method
15 for collaborative distributed learning with limited communication,”
16 in IEEE Information Theory Workshop, ITW 2022, Mumbai, India,
17 November 1-9, 2022. IEEE, 2022, pp. 49–54. [Online]. Available:
https://doi.org/10.1109/ITW54588.2022.9965863
18 [12] V. C. Gogineni, S. Werner, Y. Huang, and A. Kuh, “Communication-
19 efficient online federated learning strategies for kernel regression,”
20 IEEE Internet Things J., vol. 10, no. 5, pp. 4531–4544, 2023. [Online].
Available: https://doi.org/10.1109/JIOT.2022.3218484
21 [13] Y. Oh, N. Lee, Y. Jeon, and H. V. Poor, “Communication efficient
22 federated learning via quantized compressed sensing,” IEEE Trans.
23 (a) (b) Wirel. Commun., vol. 22, no. 2, pp. 1087–1100, 2023. [Online].
Available: https://doi.org/10.1109/TWC.2022.3201207
24 Fig. 12. Experimental results of test accuracy versus communication round.
[14] S. Shi, Q. Wang, K. Zhao, Z. Tang, Y. Wang, X. Huang,
The results show that HGSASG converges faster on all two datasets while
25 model accuracy is guaranteed. (a) FC (MNIST). (b) FC (Fashion-MNIST).
and X. Chu, “A distributed synchronous sgd algorithm with
26 global top-k sparsification for low bandwidth networks,” in 2019
IEEE 39th International Conference on Distributed Computing
27 Systems (ICDCS). IEEE, 2019, pp. 2238–2247. [Online]. Available:
28 communication cost while ensuring the accuracy of the model. https://doi.org/10.1109/ICDCS.2019.00220
29 We have also suggested suitable values for k ′ and discussed [15] D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman,
J. Gonzalez, and R. Arora, “Fetchsgd: Communication-efficient
30 the optimization direction of GSASG. In future work, we plan federated learning with sketching,” in Proceedings of the 37th
31 to investigate secure gradient transmission methods to enhance International Conference on Machine Learning, ICML 2020, 13-18
32 the privacy and security of the FL framework. July 2020, Virtual Event, ser. Proceedings of Machine Learning
Research, vol. 119. PMLR, 2020, pp. 8253–8265. [Online]. Available:
33 http://proceedings.mlr.press/v119/rothchild20a.html
34 R EFERENCES [16] Z. Zhang and C. Wang, “MIPD: an adaptive gradient sparsification
35 framework for distributed dnns training,” IEEE Trans. Parallel
[1] P. Xiao, Z. Qin, D. Chen, N. Zhang, Y. Ding, F. Deng, Z. Qin, and Distributed Syst., vol. 33, no. 11, pp. 3053–3066, 2022. [Online].
36 M. Pang, “Fastnet: A lightweight convolutional neural network for Available: https://doi.org/10.1109/TPDS.2022.3154387
37 tumors fast identification in mobile-computer-assisted devices,” IEEE [17] S. Lu, R. Li, W. Liu, C. Guan, and X. Yang, “Top-k sparsification
38 Internet Things J., vol. 10, no. 11, pp. 9878–9891, 2023. [Online].
Available: https://doi.org/10.1109/JIOT.2023.3235651
with secure aggregation for privacy-preserving federated learning,”
Comput. Secur., vol. 124, p. 102993, 2023. [Online]. Available:
39 [2] W. Ni, J. Zheng, and H. Tian, “Semi-federated learning for https://doi.org/10.1016/j.cose.2022.102993
40 collaborative intelligence in massive iot networks,” IEEE Internet [18] T. Chen, Y. Sun, and W. Yin, “LASG: lazily aggregated stochastic
41 Things J., vol. 10, no. 13, pp. 11 942–11 943, 2023. [Online]. Available:
https://doi.org/10.1109/JIOT.2023.3253853
gradients for communication-efficient distributed learning,” CoRR, vol.
abs/2002.11360, 2020. [Online]. Available: https://arxiv.org/abs/2002.
42 [3] P. Zhao, Z. Cao, J. Jiang, and F. Gao, “Practical private aggregation 11360
43 in federated learning against inference attack,” IEEE Internet [19] J. Zhang and O. Simeone, “LAGC: lazily aggregated gradient coding
Things J., vol. 10, no. 1, pp. 318–329, 2023. [Online]. Available: for straggler-tolerant and communication-efficient distributed learning,”
44 https://doi.org/10.1109/JIOT.2022.3201231 IEEE Trans. Neural Networks Learn. Syst., vol. 32, no. 3, pp.
45 [4] X. He, Q. Chen, L. Tang, W. Wang, and T. Liu, “Cgan-based 962–974, 2021. [Online]. Available: https://doi.org/10.1109/TNNLS.
46 collaborative intrusion detection for UAV networks: A blockchain- 2020.2979762
empowered distributed federated learning approach,” IEEE Internet [20] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang, “Zipml:
47 Things J., vol. 10, no. 1, pp. 120–132, 2023. [Online]. Available: Training linear models with end-to-end low precision, and a little bit of
48 https://doi.org/10.1109/JIOT.2022.3200121 deep learning,” in Proceedings of the 34th International Conference on
49 [5] A. Makkar, U. Ghosh, D. B. Rawat, and J. H. Abawajy, “Fedlearnsp: Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August
Preserving privacy and security using federated learning and edge 2017, ser. Proceedings of Machine Learning Research, D. Precup and
50 computing,” IEEE Consumer Electron. Mag., vol. 11, no. 2, pp. 21–27, Y. W. Teh, Eds., vol. 70. PMLR, 2017, pp. 4035–4043. [Online].
51 2022. [Online]. Available: https://doi.org/10.1109/MCE.2020.3048926 Available: http://proceedings.mlr.press/v70/zhang17e.html
52 [6] Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally, “Deep gradient [21] J. Sun, T. Chen, G. B. Giannakis, Q. Yang, and Z. Yang,
compression: Reducing the communication bandwidth for distributed “Lazily aggregated quantized gradient innovation for communication-
53 training,” in 6th International Conference on Learning Representations, efficient federated learning,” IEEE Trans. Pattern Anal. Mach.
54 ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Intell., vol. 44, no. 4, pp. 2031–2044, 2022. [Online]. Available:
55 Conference Track Proceedings. OpenReview.net, 2018. [Online]. https://doi.org/10.1109/TPAMI.2020.3033286
Available: https://openreview.net/forum?id=SkhQHMW0W [22] A. Sharma, “Guided parallelized stochastic gradient descent for delay
56 [7] X. Deng, T. Sun, and D. Li, “SASG: sparsification with adaptive compensation,” Appl. Soft Comput., vol. 102, p. 107084, 2021. [Online].
57 stochastic gradients for communication-efficient distributed learning,” Available: https://doi.org/10.1016/j.asoc.2021.107084
58 CoRR, vol. abs/2112.04088, 2022. [Online]. Available: https://arxiv.org/ [23] M. Elibol, L. Lei, and M. I. Jordan, “Variance reduction with
abs/2112.04088 sparse gradients,” in 8th International Conference on Learning
59
60
Page 13 of 13
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

1
2 Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, Daojing He (Member, IEEE) received the B.Eng.
2020. OpenReview.net, 2020. [Online]. Available: https://openreview. (2007) and M. Eng. (2009) degrees from Harbin
3 net/forum?id=Syx1DkSYwB Institute of Technology (China) and the Ph.D. degree
4 [24] T. Chen, Y. Sun, and W. Yin, “Communication-adaptive stochastic (2012) from Zhejiang University (China), all in
5 gradient methods for distributed learning,” IEEE Trans. Signal computer science. He is currently a professor in the
Process., vol. 69, pp. 4637–4651, 2021. [Online]. Available: https: School of Software Engineering, East China Normal
6 //doi.org/10.1109/TSP.2021.3099977 University, P.R. China. His research interests include
7 [25] A. F. Aji and K. Heafield, “Sparse communication for distributed network and systems security. He is on the editorial
8 gradient descent,” in Proceedings of the 2017 Conference on Empirical board of several international journals such as IEEE
Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Network and IEEE Communications Magazine.
9 Denmark, September 9-11, 2017, M. Palmer, R. Hwa, and S. Riedel,
10 Eds. Association for Computational Linguistics, 2017, pp. 440–445.
11 [Online]. Available: https://doi.org/10.18653/v1/d17-1045
[26] P. Han, S. Wang, and K. K. Leung, “Adaptive gradient sparsification for
12 efficient federated learning: An online learning approach,” in 40th IEEE
13 International Conference on Distributed Computing Systems, ICDCS Zikang Ding is currently studying for his master’s
14 2020, Singapore, November 29 - December 1, 2020. IEEE, 2020, degree in East China Normal University. His current
pp. 300–310. [Online]. Available: https://doi.org/10.1109/ICDCS47774. research interests include authentication encryption,
15 2020.00026 big data privacy and security issues, and federated
16 [27] C. Park and N. Lee, “Sparse-signsgd with majority vote for communi- learning.
17 cation efficient distributed learning,” CoRR, vol. abs/2302.07475, 2023.
[Online]. Available: https://doi.org/10.48550/arXiv.2302.07475
18 [28] A. Sergeev and M. D. Balso, “Horovod: fast and easy distributed deep
19 learning in tensorflow,” CoRR, vol. abs/1802.05799, 2018. [Online].
20 Available: http://arxiv.org/abs/1802.05799
[29] S. Li and T. Hoefler, “Near-optimal sparse allreduce for distributed deep
21 learning,” in PPoPP ’22: 27th ACM SIGPLAN Symposium on Principles
22 and Practice of Parallel Programming, Seoul, Republic of Korea, April
23 2 - 6, 2022, J. Lee, K. Agrawal, and M. F. Spear, Eds. ACM, 2022, pp.
135–149. [Online]. Available: https://doi.org/10.1145/3503221.3508399 Miao Wang is a PhD candidate at East China
24 [30] M. Beitollahi, M. Liu, and N. Lu, “DSFL: dynamic sparsification for Normal University. Her research interests include
25 federated learning,” in 5th International Conference on Communications, Number Theory and Applied Cryptography The-
26 Signal Processing, and their Applications, ICCSPA 2022, Cairo, Egypt, ory, Quantum Computing and Information Security.
December 27-29, 2022. IEEE, 2022, pp. 1–6. [Online]. Available: From 2017 to 2020, she studied at South China
27 https://doi.org/10.1109/ICCSPA55860.2022.10019204 University of Technology and obtained the MS. In
28 [31] T. Chen, Z. Guo, Y. Sun, and W. Yin, “CADA: communication adaptive September 2020, she entered the School of Software
29 distributed adam,” in The 24th International Conference on Artificial Engineering of East China Normal University to
Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual pursue the PhD.
30 Event, ser. Proceedings of Machine Learning Research, A. Banerjee and
31 K. Fukumizu, Eds., vol. 130. PMLR, 2021, pp. 613–621. [Online].
32 Available: http://proceedings.mlr.press/v130/chen21a.html
[32] T. Chen, G. B. Giannakis, T. Sun, and W. Yin, “LAG: lazily aggregated
33 gradient for communication-efficient distributed learning,” in Advances
34 in Neural Information Processing Systems 31: Annual Conference Sammy Chan (Senior Member, IEEE) received the
35 on Neural Information Processing Systems 2018, NeurIPS 2018, B.E. and M.Eng.Sc. degrees in electrical engineering
December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach,
36 H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds.,
from the University of Melbourne, Australia, in 1988
and 1990, respectively, and the Ph.D. degree in
37 2018, pp. 5055–5065. [Online]. Available: https://proceedings.neurips. communication engineering from the Royal Mel-
38 cc/paper/2018/hash/feecee9f1643651799ede2740927317a-Abstract.html bourne Institute of Technology, Australia, in 1995.
[33] S. Shi, K. Zhao, Q. Wang, Z. Tang, and X. Chu, “A convergence
39 analysis of distributed SGD with communication-efficient gradient
From 1989 to 1994, he was with Telecom Australia
Research Laboratories, first as a Research Engineer,
40 sparsification,” in Proceedings of the Twenty-Eighth International Joint and from 1992 to 1994 as a Senior Research En-
41 Conference on Artificial Intelligence, IJCAI 2019, Macao, China,
August 10-16, 2019, S. Kraus, Ed. ijcai.org, 2019, pp. 3411–3417.
gineer and a Project Leader. Since December 1994,
42 [Online]. Available: https://doi.org/10.24963/ijcai.2019/473
he has been with the Department of Electronic En-
gineering, City University of Hong Kong, where he is currently an Associate
43 [34] I. Ergün, H. U. Sami, and B. Guler, “Sparsified secure aggregation Professor.
for privacy-preserving federated learning,” CoRR, vol. abs/2112.12872,
44 2021. [Online]. Available: https://arxiv.org/abs/2112.12872
45
46
IX. B IOGRAPHY S ECTION
47 Xuru Li is currently a PhD student in East China
48 Normal University after receiving the MS in 2018
from Shanghai Maritime University and BS in 2016
49 from Northeast Forestry University. Her current re-
50 search interests include authentication encryption,
51 Runmeng Du is a PhD candidate at East China
secure computing in distribution system and privacy
issues for big data.
52 Normal University. Her research interests include
53 Cryptography and Information Security and Fed-
eration Learning. From 2017 to 2020, she studied
54 at Shaanxi Normal University to obtain a mas-
55 ter’s degree. In September 2020, she entered the
56 School of Software Engineering of East China Nor-
mal University to pursue the PhD. Her e-mail is
57 [email protected].
58
59
60

You might also like