0% found this document useful (0 votes)
5 views17 pages

Semi-Synchronous Personalized Federated Learning

Uploaded by

wudirac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views17 pages

Semi-Synchronous Personalized Federated Learning

Uploaded by

wudirac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2015 1

Semi-Synchronous Personalized Federated Learning


over Mobile Edge Networks
Chaoqun You, Member, IEEE, Daquan Feng, Member, IEEE, Kun Guo, Member, IEEE,
Howard H. Yang Member, IEEE, Chenyuan Feng, and Tony Q. S. Quek, Fellow, IEEE

Abstract—Personalized Federated Learning (PFL) is a new local computing based on their own dataset, and upload the
Federated Learning (FL) approach to address the heterogeneity resultant parameters to the server, (ii) the server aggregates the
issue of the datasets generated by distributed user equipments
arXiv:2209.13115v1 [cs.LG] 27 Sep 2022

UEs’ parameters and improve the global model, and (iii) the
(UEs). However, most existing PFL implementations rely on
synchronous training to ensure good convergence performances, server feeds back the new model to UEs for another round of
which may lead to a serious straggler problem, where the training local computing. This procedure repeats until the loss function
time is heavily prolonged by the slowest UE. To address this starts to converge and a certain model accuracy is achieved.
issue, we propose a semi-synchronous PFL algorithm, termed as With the substantial improvement in sensing capabilities
Semi-Synchronous Personalized FederatedAveraging (PerFedS2 ), and computational power of edge devices, UEs are producing
over mobile edge networks. By jointly optimizing the wireless
bandwidth allocation and UE scheduling policy, it not only miti- abundant but diverse data [6]. The increasingly diverse datasets
gates the straggler problem but also provides convergent training breed a demand for customized services on individual UEs.
loss guarantees. We derive an upper bound of the convergence Typical examples of potential applications include Vehicle-
rate of PerFedS2 in terms of the number of participants per to-everything (V2X) communications, where vehicles in the
global round and the number of rounds. On this basis, the network may experience various road conditions and driv-
bandwidth allocation problem can be solved using analytical
solutions and the UE scheduling policy can be obtained by a ing habits, making the local model disparate to the global
greedy algorithm. Experimental results verify the effectiveness model [7, 8]; and recommendation systems, where local
of PerFedS2 in saving the training time as well as guaranteeing servers have potentially heterogeneous customers and share
the convergence of training loss, in contrast to synchronous and non-independent and identically distributed (non-i.i.d.) item
asynchronous PFL algorithms. popularities, and thus requiring fine-grained recommenda-
Index Terms—Semi-synchronous implementation, personalized tions [9, 10]. However, conventional FL algorithms are pro-
federated learning, mobile edge networks posed to learn a common model which may have mediocre
performance on certain UEs. And the situation is exacerbating
I. I NTRODUCTION as the ever-developing mobile UEs are generating increasingly
EDERATED Learning (FL) is a new distributed machine diverse data. To address this issue, Personalized Federated
F learning paradigm that enables model training across
multiple user equipments (UEs) without uploading their raw
Learning (PFL) [11, 12] has been proposed. Specifically, PFL
provides an initial model that is good enough for the UEs to
data to a central parameter server [1]. Since its advent, FL start with. Using this initial model, each UE can fastly adapt to
has been widely adopted as a powerful tool to exploit the its local dataset with one or more gradient descent steps using
wealth of data available at the end-user devices [2, 3] and only a few data points. As a result, the UEs (especially with
foster new applications such as Artificial Intelligence (AI) heterogeneous datasets) are able to enjoy fast personalized
medical diagnosis [4] and autonomous vehicles [5]. Training a models by adapting the global model to local datasets.
FL model contains three typical steps: (i) a set of UEs conduct Nonetheless, most PFL implementations adopt synchronous
training to ensure good convergence performance [11, 13–
This paper was supported in part by the National Research Foundation, Sin- 16]. In the synchronous setting, the central server has to wait
gapore and Infocomm Media Development Authority under its Future Com-
munications Research & Development Programme, in part by MOE ARF Tier until the arrival of the parameters of the slowest UE before it
2 under Grant T2EP20120−0006, in part by the National Science and Tech- can update the global model. As a consequence, synchronous
nology Major Project under Grant 2020YFB1807601, in part by the Shenzhen training may cause severe straggler problem in PFL, where
Science and Technology Program under Grants JCYJ20210324095209025, in
part by Shanghai Pujiang Program under Grant No. 21PJ1402600, in part by the deceleration of any UE can delay all other UEs. On the
the National Natural Science Foundation of China under Grant 62201504, in other hand, parameters of the UEs may arrive at the server at
part by the Zhejiang Provincial Natural Science Foundation of China under different speeds due to reasons such as various CPU processing
Grant LGJ22F010001. (Corresponding author: Daquan Feng)
C. You and T. Quek are with the Wireless Networks and Design Systems capabilities and different wireless channel conditions. This
Group, Singapore University of Design and Technology, 487372, Singapore difference begets another operation mechanism: asynchronous
(e-mail: chaoqun_you, [email protected]). training. The key idea of asynchronous implementation is to
D. Feng and C. Feng are with the Shenzhen University, Shenzhen 518052,
China (e-mail:fdquan, [email protected]) allow all UEs work independently and the server updates
K. Guo is with the East China Normal University, Shanghai 200241, China the global model every time it receives an update from any
(e-mail: [email protected]). UE [17–19]. Although this model updating strategy avoids
H. H. Yang is with the Zhejiang University/University of Illinois at Urbana-
Champaign Institute, Zhejiang University, Haining 314400, China (email: the waiting time of UEs, the gradient staleness caused by
[email protected]). asynchronous updating will further degrade the performance of
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

the model training. At this point, a semi-synchronous PFL has synchronous training but also abbreviates potential diver-
been a natural choice to balance the disadvantages caused by gence issue in asynchronous training.
2
the synchronous as well as the asynchronous PFL algorithms. • We derive the convergence rate of the PerFedS . Our
Although there have been several works on semi- analysis characterizes the upper bound of convergence
synchronous FL algorithms [20–23], the semi-synchronous rate as a function with respect to the number of UEs
PFL problem is not well understood. [20] studied the semi- that are scheduled to update the global model in each
asynchronous protocol for fast FL. [21] proposed a semi- communication round and the number of communication
asynchronous FL algorithm in heterogeneous edge computing. rounds.
[22] introduced a novel energy-efficient semi-asynchronous FL • We solve the optimization problem by decoupling it
protocol that mixes local models periodically with minimal into two sub-problems: bandwidth allocation problem and
idle time and fast convergence. At last, [23] proposed a UE scheduling problem. While the optimal bandwidth
clustered semi-asynchronous FL algorithm that groups UEs is proved to minimize the overall training time within a
by the delay and direction of clients’ model update to make range of values, the UE scheduling policy can also be
the most of the advantage of both synchronous and asyn- determined using a greedy online algorithm.
chronous FL. Designing a semi-synchronous PFL in mobile • We conduct extensive experiments by using MNIST,
edge networks, however, is particularly challenging due to CIFAR-100 and Shakespeare datasets to demonstrate the
the following reasons: (1) The convergence rate of a semi- effectiveness of PerFedS2 in saving the overall training
synchronous PFL is unclear. Moreover, the loss function of time as well as providing a convergent training loss,
a deep learning model is usually non-convex, and whether compared with four baselines, namely, the synchronous
a semi-synchronous PFL can converge and under what con- and asynchronous, FL and PFL algorithms, respectively.
ditions can the algorithm converge is of much interest. (2) The rest of the paper has been organized as follows. In
The practical wireless communication environments need to Section II we introduce the basic learning process of PerFedS2 .
be considered. It is non-trivial to decide the UE scheduling Then in Section III we formulate a joint bandwidth allocation
policy of a semi-synchronous PFL algorithm while considering and UE scheduling problem to quantify and maximize the
the wireless bandwidth allocation. benefits PerFedS2 could bring compared with synchronous
In this paper, we propose a semi-synchronous PFL algo- and asynchronous training. In order to solve the optimization
rithm over mobile edge networks, named Semi-Synchronous problem, we first analyse the convergence rate of PerFedS2 in
Personalized FederatedAveraging (PerFedS2 ) that mitigates Section IV. Then, we solve the joint optimization problem in
the straggler problem in PFL. This is done by optimizing a Section V. At last, we evaluate the performance of PerFedS2
joint bandwidth allocation and UE scheduling problem. To in Section VI.
solve this problem, we first analyse the convergence rate
of PerFedS2 with non-convex loss functions. Our analysis II. S EMI -S YNCHRONOUS P ERSONALIZED F EDERATED
characterizes the upper bound of the convergence rate in terms L EARNING M ECHANISM
of two decision variables: the number of scheduled UEs in In this section, we propose PerFedS2 to mitigate the draw-
each communication round, and the number of communication backs of synchronous and asynchronous PFL algorithms. For a
rounds. Based on this upper bound, the joint bandwidth better understanding of the proposed algorithm, we commence
allocation and UE scheduling optimization problem can be with reviewing FL and PFL in Section II-A and Section II-B,
solved separately. For the bandwidth allocation problem, we respectively. Then, we formally introduce PerFedS2 in Sec-
find that for a given UE scheduling policy, there exists tion II-C.
infinitely many bandwidth solutions to minimize the overall
training time. For the UE scheduling problem, facilitated by A. Review: Federated Learning
the results obtained from the convergence analysis, the optimal Consider a set of n UEs connected to the server via a BS,
number of UEs that are scheduled to update the global model where each UE has a local data (x, y) ∈ Xi × Yi . If we define
in each communication round and the optimal number of fi : Rm → R as the loss function corresponding to UE i, and
communication rounds can be estimated. These results lead us w as the model parameter that the server needs to learn, then
to designing a greedy algorithm that gives the UE scheduling the goal of the server is to solve
policy. Finally, with the optimal bandwidth allocation and the n
UE scheduling policy, we are able to implement PerFedS2 over 1X
minm f (w) := fi (w), (1)
mobile edge networks. w∈R n i=1
To summarize, in this paper we make the following contri- where fi represents the expected loss over the data distribution
butions: of UE i, which is formalized as follows,
• We propose a new semi-synchronous PFL algorithm, i.e.,
fi (w) := E(x,y)∼Hi [li (w; x, y)], (2)
the PerFedS2 , over mobile edge networks. The PerFedS2
strikes a good balance between synchronous and asyn- where li (w; x, y) measure the error of model w in predicting
chronous PFL algorithms. Particularly, by solving a joint the true label y, and Hi is the distribution over Xi × Yi .
bandwidth allocation and UE scheduling problem, it Because the dataset resided on different UEs are usually
not only mitigates the straggler problem caused by the non-i.i.d. and unbalanced, while the global model trained by
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

FedAvg concentrates on the average performance of all the Algorithm 1: Semi-Synchronous Personalized Feder-
UEs. The resultant model may perform very poor on certain ated Averaging (PerFedS2 )
individual UEs. In response, PFL is proposed to capture the 1 for k = 0, 1, . . . , K − 1 do
statistical heterogeneity among UEs by adapting the global 2 Processing at Each UE i
model to local datasets. We review this scheme in the next 3 if Receive wk from the server then
subsection. 4 Compute local gradient ∇F˜ i (wk ) by Eq. (7)
˜
Upload ∇Fi (wk ) to the server
B. Review: Personalized Federated Learning 5 end
In contrast to the standard FL, PFL approaches the solu- 6 Processing at the Parameter Server
tion of (1) via the Model-Agnostic Meta-Learning (MAML). 7 Ak = ∅
Specifically, the target of PFL is to learn an initial model that 8 while |Ak | < A do
adapts quickly to each UE through one or more gradient steps 9 ˜ i (wk ) from UE i
Receive local gradient ∇F
with only a few data points on the UEs. Such an initial model 10 Ak = Ak ∪ {i}
is commonly known as the meta model, and the local model 11 end
after adaptation is referred to as the fine-tuned model. 12 Update global model to wk+1 by Eq. (8)
Formally, if each UE intakes the initial model and updates 13 for i ∈ U do
it via one step of gradient using its own loss function, problem 14 if i ∈ Ak or τki > S then
(1) can be written as 15 Distribute wk+1 to UE i
1X
n 16 end
min F (w) := fi (w − α∇fi (w)), (3) 17 end
w∈Rm n i=1
18 end
where α ≥ 0 is the learning rate at individual UEs. Note that
we use the same learning rate for all UEs in this paper for
simplification. This assumption can be easily extended to the
PerFedS2 is formally described in Alg. 1. At the UE side (Line
general case when UEs have diverse learning rate αi as long
2-5), upon receiving a global model, or equivalently, the meta
as αi ≥ 0. For each UE i, its optimization objective Fi can
model wk , the UE adapts wk to its local dataset to obtain the
be computed as
gradient of local functions, which in this case, the gradient
Fi (w) := fi (w − α∇fi (w)). (4) ∇Fi , that is given by
Unlike conventional FL, after receiving the current global ∇Fi (wk ) = (I − α∇2 fi (wk ))∇fi (wk − α∇fi (wk )). (5)
model, a UE in PFL first adapts the global model to its local
data with one step of gradient descent, and then computes At the server side (Line 6-12), let Ak be the set of UEs
local gradients with respect to the model after the adaptation. participating in the global updating in round k, with the
This step of local adaptation captures the difference between carnality being |Ak | = A. Let τki be the interval between the
UEs, and the model learned with this new formulation (3) is current round k and the last received global model version by
proved to be a good initial point for any UE to start with for UE i. Such an interval reflects the staleness of local updates.
fast adaptation [24, 25]. With this notion, we can write the gradient received by the
Many existing works on PFL is limited to the context of BS at round k from UE i as ∇F (wk−τki ). Upon receiving A
synchronous learning, where the faster UEs have to wait until local gradients, the server updates the global model parameter
all the others arrive the server to move to the next communi- as follows:
cation round [11, 13–16]. As a result, the synchronous PFL β X
wk+1 = wk − ∇Fi (wk−τki ), (6)
often suffers from the straggler problem due to the prolonged A
i∈Ak
waiting time for the slowest UE. On the other hand, the PFL
can also be trained in an asynchronous manner, where the where β > 0 is the global step size. Then, the server distributes
server performs global updating as soon as it receives a local the new global model wk+1 to either (a) the UEs in Ak or (b)
model from any UE. In this scenario, some slower UEs will those with a staleness larger than the staleness threshold S.
bring stale gradient updates to the server, thereby degrading Due to the vast volume of dataset, computing the exact
the convergence performance of the model training. Therefore, gradient for each UE is costly. Therefore, we use the stochastic
in this paper, we propose a semi-synchronous PFL mechanism gradient descent (SGD) [26] as a proxy. Specifically, a generic
that seeks a trade-off between synchronous and asynchronous UE i samples a subset of data points to calculate an unbiased
estimate ∇f ˜ i (wk ; Di ) of ∇fi (wk ), where Di represents a
PFL algorithms, which is detailed in the following subsection.
portion of UE i’s local dataset with size |Di | = Di . Similarly,
C. Semi-Synchronous Personalized Federated Learning the Hessian ∇2 in (5) can be replaced by its unbiased estimate
∇˜ 2 fi (wk ; Di ). At this point, the actual gradient computed by
We propose a semi-synchronous PFL mechanism, which ˜ i (wk ),
UE i is the stochastic gradient of local loss function ∇F
is a trade-off between synchronous and asynchronous PFL.
which is given by
We term this semi-synchronous PFL algorithm as Semi-
Synchronous Personalized FederatedAveraging (PerFedS2 ). ˜ i (wk ) =
∇F
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

˜ 2 fi (wk ; Dih ))∇f


(I − α∇ ˜ i (wk − α∇f
˜ i (wk ; Diin ); Dio ), (7) is much smaller than that in the uplink. Meanwhile, we care
more about the transmit power allocation on individual UEs
where Diin , Dio
and ; Dih
are independently sampled datasets
rather than that on the server, hence we ignore the downlink
with total size denoted by di = Diin +Dio +Dih . This stochastic
delay for simplicity.
gradient is then uploaded to the central server for global model
As for the computation time, let ci denote the number of
update as follows:
CPU cycles for UE i to execute one sample of data, ϑi denote
β X ˜ the CPU-cycle frequency of UE i, and di denote the number
wk+1 = wk − ∇Fi (wk−τki ) (8)
A of sampled data points on UE i, then the computation time of
i∈Ak
UE i per local iteration can be expressed as follows [28],
III. S YSTEM M ODEL AND P ROBLEM F ORMULATION ci di
T cmpik = . (11)
In the last section, we introduce the basic learning process ϑi
of PerFedS2 . This alone is not enough to quantify the benefits As such, given that for semi-synchronous training, each local
a semi-synchronous training manner brings to implementation, iteration of UE i may last several global rounds, the total time
because the communication related parameters and the training UE i spent in round k is given by
hyperparameters remain to be unclear. Therefore, our next  i i
step is to formulate an optimization problem for PerFedS2 , T comk + T cmpk ,

with the wireless bandwidth allocation and the UE scheduling Tki = when UE i starts a new local iteration in round k,

policy to be determined. In this section, We introduce some T comik , otherwise.

notations and concepts in Section III-A and III-B that are used (12)
to formulate the optimization problem in Section III-C.

A. Communication Model B. Illustrative Example


2
To implement PerFedS in mobile edge networks, the wire- We give an example to facilitate the understanding of
less communication environments should also be considered PerFedS2 . Consider the scenario depicted in Fig. 1, where
to maximize the benefit a semi-asynchronous learning manner A = 2. This network has four UEs. In the first communi-
brings to the learning algorithm. Note that in PerFedS2 , one lo- cation round, UE 3 and 4 are stragglers. Therefore, once the
cal iteration of UE i may last for a few global communication stochastic gradients uploaded by UE 1 and 2 arrive at the
rounds, we focus on describing the wireless communication server in round 1, the server updates the global model from
processes of UE i within such a local iteration. The learning w0 to w1 , leaving the gradients computed by UE 3 and 4 to
time of UE i during one local iteration consists of two parts: be integrated into the global model in round 2 and round 3,
communication time and computation time. As for the commu- respectively.
nication time over mobile edge networks, we consider that UEs
access the BS through a channel partitioning scheme, such as Scheduling policy: Let πki ∈ {0, 1} be an indicator to denote
orthogonal frequency division multiple access (OFDMA) [27], whether the gradient uploaded from UE i arrives at the server
with total bandwidth B. Meanwhile, the bandwidth allocation in round k. That is, πki = 1 if the update from UE i is included
to UE i in round k is denoted as bik . The uplink rate of UE i in the global model in round k, and πki = 0 otherwise. Then,
transmitting its local gradients to the BS can be computed as Π , [Π1 , Π2 , . . . , ΠK ] denotes the scheduling decision
follows [28, 29], matrix up to round K, where Πk , [πk1 , πk2 , . . . , πkn ]. For
the example given in Fig. 1, the computation has been carried
pi hik kci k−κ
rki = bik ln(1 + ), (9) out five rounds and the scheduling decision matrix Π can be
bik N0 written as  
where pi is the transmit power of UE i, κ is the path 1 1 0 0
0 1 1 0
loss exponent, and N0 is the noise power spectral density.  
hik kci k−κ is the channel gain between UE i and the BS at Π= 1 0 0 1 . (13)

0 1 1 0
round k with ci being the distance between UE i and the BS
and hik being the small-scale channel coefficient. In this paper, 1 0 0 1
we assume that the small-scale channel coefficients across From the above, we can see that the entries in each row of Π
communication rounds hik follow Rayleigh distribution [30]. satisfy the following relationship
With rki , the uplink transmission delay of UE i can be specified n
X
as follows, πki = A. (14)
Zi
T comik = ik , (10) i=1
rk
We further introduce a concept, coined as the relative
where Zki denotes the number of bits UE i transmits in round participation frequency, to reflect the statistical property of the
k. Meanwhile, Z denotes total size of the gradient UE i scheduling policy. Specifically, for UE i, we denote its relative
transmits each time. Since the transmit power of the BS is participation frequency as ηi , which represents the fraction of
much higher than the UEs’, the downlink transmission latency time this UE participates in the global iteration. Such a notion
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Round 1 Round 2 Round 3 Round 4 Round 5

𝑤0 𝑤11 𝑤12 𝑤1 𝑤13 𝑤22 𝑤2 𝑤14 𝑤21 𝑤3 𝑤33 𝑤32 𝑤4 𝑤41 𝑤44 𝑤5

𝑤11 𝑤21 𝑤41


UE 1 𝑤0 𝑤11 𝑤21 𝑤41

𝑤12 𝑤22 𝑤32 𝑤52


UE 2 𝑤0 𝑤22 𝑤52
𝑤12 𝑤32

𝑤13 𝑤33 𝑤53


UE 3 𝑤0 𝑤33 𝑤53
𝑤13

𝑤14 𝑤44
UE 4 𝑤0 𝑤14 𝑤44

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡

computation communication

Fig. 1: Example of the PerFedS2 mechanism when A = 2.

k−τki +S
is formally defined as X
PK PK πji ≥ 1, ∀i ∈ U (C1.3)
i i
k=1 πk k=1 πk j=k−τki
ηi = PK Pn = . (15)
k=1 i=1 πki AK k
X
Notably, the staleness bound S provides a lower bound of Zji ≤ Z (C1.4)
ηi , that is, ηi ≥ S/K (∀i ∈ U). j=k−τki
S
K≥ , ∀i ∈ U, (C1.5)
C. Problem Formulation ηi
PerFedS2 significantly increases the proportion of time where b , [b1 , b2 , . . . , bK ] denotes the bandwidth allocation
UEs spend on computing, as opposed to waiting. Meanwhile, matrix up to round K, and bk = [b1k , b2k , . . . , bnk ]. (C1.1) is the
PerFedS2 also upper bounds the staleness caused by updates overall training time constraint, that for each communication
from slow UEs. Let T be the overall training time over K round k, the round time is determined by the maximum of
communication rounds. Then the objective of PerFedS2 is to Tki over i ∈ Ak , and the total time up to round K is equal
minimize the loss function as well as the overall training time. to T . (C1.2) is the bandwidth constraint, that the bandwidth
Formally, the optimization problem of PerFedS2 is formulated allocation to all UEs in every communication round shall
as follows 1 , not exceed the available bandwidth B. (C1.3) stipulates the
staleness constraint on the updates, that the during any S
min F (w) (P1) rounds of communication, UE i must be scheduled to update
b,Π,A,K the global model at least once. (C1.4) limits the number of bit
K
X transmitted, note that Zki is determined by bik , and the number
s.t. min max{Tki } = T, ∀i ∈ U, (C1.1) of bits that are transmitted during τki rounds shall not be larger
b i∈Ak
k=1 than the size of model parameters. Finally, (C1.5) follows from
n
X the lower bound we drawn in the previous subsection.
bik ≤ B, k = 1, 2, . . . , K, (C1.2)
i=1

1 Besides bandwidth allocation and UE scheduling policy, other decision IV. C ONVERGENCE A NALYSIS
variables such like transmit power can also be included in the problem
formulation. The logic keeps the same, but the parameters that need to be In this section, we first introduce some definitions and as-
considered might change. Problem (P1) shows the case when we consider the
bandwidth allocation and UE scheduling policy as variables, and it is free for sumptions on the loss functions of PerFedS2 in Section IV-A.
the researcher to extend this general formulation to other forms. Then we analyse its convergence rate in Section IV-B.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

A. Preliminaries Lemma 2. If Assumptions 2-4 hold, then for any αi ∈ (0, 1/L]
and w ∈ Rm , we have
We consider the non-convex loss functions in this paper. Our
˜ i (w) − ∇Fi (w) ≤ 2αLσ
h i
goal is to find an -approximate first-order stationary point E ∇F √ G, (22)
(FOSP) for PerFedS2 [13, 25]. The formal definition of FOSP h i Din
is given as follows. E k∇F ˜ i (w) − ∇Fi (w)k2 ≤ σF2 . (23)
Definition 1. A random vector w ∈ Rm is called an -FOSP
where σF2 is defined as
for PerFedS2 if it satisfies E[k∇F (w )k2 ] ≤ .
(αL)2 2
    
1 2 α
To make the convergence analysis consistent with that of σF2 := 12 C 2 + σG 2
+ 1 + σ H −12C 2 ,
Do Din 4Dh
Per-FedAvg, we make the following assumptions [13]. (24)
Assumption 1 (Bounded Staleness). All delay variables τki ’s where Din = maxi∈U Diin , Do = maxi∈U Dio and Dh =
are bounded, i.e., maxk,i τki ≤ S. maxi∈U Dih .

Assumption 2. For each UE i ∈ U, its gradient ∇fi is L- Lemma 3. Given the loss function Fi (w) shown in (4) and
Lipschitz continuous and is bounded by a nonnegative constant α ∈ (0, 1/L], if the conditions in Assumptions 2, 3, and 5 are
C, namely, all satisfied, then for any w ∈ Rm , we have
n
k∇fi (w) − ∇fi (u)k ≤ Lkw − uk, w, u ∈ Rm 1X
(17) k∇Fi (w) − ∇F (w)k2 ≤ γF2 , (25)
n i=1
k∇fi (w)k ≤ C, w ∈ Rm . (18)
where γF2 is defined as
Assumption 3. For each UE i ∈ U, the Hessian of fi is
ρ-Lipschitz continuous: γF2 := 3C 2 α2 γH
2 2
+ 192γG , (26)
2 2 m Pn
k∇ fi (w) − ∇ fi (u)k ≤ ρkw − uk, w, u ∈ R . (19) where ∇F (w) = 1/n i=1 ∇Fi (w).
Assumption 4. For any w ∈ Rm , ∇li (w; x, y) and Based on the three lemmas, we obtain the following theorem
∇2 li (w; x, y), computed w.r.t. a single data point (x, y) ∈ to
Xi × Yi , have bounded variance:
Theorem 1. If Assumptions 1 to 5 hold and the steplength
E(x,y)∼pi [k∇li (w; x, y) − ∇fi (w)k2 ] ≤ σG
2
, LF in Lemma 1 satisfies
E(x,y)∼pi [k∇2 li (w; x, y) − ∇2 fi (w)k2 ] ≤ σH
2
. (20) LF β 2 − β + 2L2F β 2 S 2 ≤ 1, (27)
m
Assumption 5. For any w ∈ R , the gradient and Hessian then the following FOSP condition holds,
of local lossPfunction fi (w) and the average loss function K−1
n
f (w) = 1/n i=1 fi (w) satisfy the following conditions: 1 X 2(F (w0 ) − F (w∗ ))
E[k∇F (wk )k2 ] ≤
n K βK
1X k=0
k∇fi (w) − ∇f (w)k2 ≤ γG
2
, √
n i=1 + 4(LF β + 2L2F β 2 S 2 )(σF2 + γF2 ) A. (28)
n
1X 2 Proof: See the Appendix.
k∇ fi (w) − ∇2 f (w)k2 ≤ γH
2
. (21)
n i=1 Corollary 1. Assume the conditions in Theorem 1 are satis-
fied. Then, if we set the number of total communication rounds
While Assumption 1 limits the maximum of the staleness,
as K = O(−3 ), the global learning rate as β = O(2 ), the
Assumptions 2 to 5 characterize the properties of the gradient
staleness threshold as S = O(−1 ), and the number of UEs
and Hessian of fi (w), which are necessary to deduce the
that updates the global model as A = O(−2 ), Algorithm 1
following lemmas and convergence rate analysis.
finds an -FOSP for PerFedS2 .
Proof: Note that 2(F (w0 )−F (w∗ )) is constant, then K =
−3
B. Analysis of Convergence Bound O( ) and β = O(2 ) ensure the first term of right-hand-side
of (28) to be equal to O(). Next we examine the second term
Before delving into the full details of convergence analysis,
of (28). Note that (σF2 + γF2 ) is constant, then β = O(2 ) and
we introduce three lemmas inherited from [13] to quantify
S = O(−1 ) together make (2LF β + 4L2F β 2 S 2 ) = O(2 ).
the smoothness of Fi (w) and F (w), the deviation between
˜ i (w), and the deviation between At this point, if A = O(−2 ), the second term of (28) is
∇Fi (w) and its estimate ∇F
equivalent to O().
∇Fi (w) and ∇F (w), respectively.
Lemma 1. If Assumptions 2-4 hold, then Fi is smooth with V. J OINT BANDWIDTH A LLOCATION AND UE
parameter LF := 4L + αρC.
Pn As a consequence, the aver- S CHEDULING
age function F (w) = 1/n i=1 Fi (w) is also smooth with
parameter LF . In this section, we present the steps to solve the optimization
problem P1. Particularly, we decouple P1 into P2, a bandwidth
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Zi
allocation problem, and P3, a UE scheduling problem. Note T cmpik + rik monotonically decreases with bik . Therefore, at
k
that individually solving the two sub-problems is equivalent to round k, if any UE i ∈ Ak has finished its whole local
solving the original P1, which will be elaborated in the sequel. model update process than the others, we can decrease its
bandwidth allocation to make it up for the other slower UEs
A. Problem Decoupling in Ak . As a result, the round latency which is determined
by the slowest UE in Ak can be reduced. Such a bandwidth
We begin with the bandwidth allocation problem. Given a compensation is performed until all scheduled UEs in Ak
scheduling pattern Π, the bandwidth allocation problem can finish their local iterations at the same time. Consequently,
be written as follows: the optimal bandwidth allocation in round k is achieved when
min T (Π) (P2) all scheduled UEs in Ak have the same finishing time.
b
K
X Theorem 3. Given the relative participation frequency ηi (i ∈
s.t. max{Tki } ≤ T (Π) (C2.1) U), the UEs would be scheduled in an order with a recurrence
i∈Ak
k=1 pattern. That is, the UEs would periodically participate into
Xn the global model update.
bik ≤ B, k = 1, 2, . . . , K (C2.2)
i=1 Proof: Recall the formulation of ηi defined in (15), it
Xk is obvious that ηi is computed by the number of times UE
Zji ≤ Z, ∀i ∈ U. (C2.3) i has been scheduled
PK−1during all K rounds. Therefore, if ηi
j=k−τki is settled, then k=0 πki is settled. As a result, if the UEs
are scheduled periodically, the times of each UE involved in
Then, with the optimal bandwidth allocation and the cor-
the global update can be settled, thus matching the relative
responding minimal overall training time T ∗ (Π), the UE
participation rate it has been assigned with.
scheduling problem can be written as follows,
Theorem 4. The optimal bandwidth allocation that achieves
min F (w) (P3)
K,A,Π the minimum learning time is given by the following
K X
s.t.
X
max{Tki } = T ∗ (Π), ∀i ∈ U (C3.1)


 bik = B, k = 1, . . . , K
i∈Ak

 i∈U
k=1 

 Bnηi Z
k−τki +S bik > ∗ (Π) − T cmp )(W (−Γ e−Γi ) + Γ ) , (33)
X
πji ≥ 1, ∀i ∈ U (C3.2) 
 (T i i i i
X
bik ≤ B,

j=k−τki




S i∈Ak
K≥ , ∀i ∈ U. (C3.3) N0 Z
ηi where Γi , (T ∗ (Π)−T cmp −κ , W (·) is Lambert-W
i i )pi hi kci k

function, and Ti (Π) is the objective value of (P2).
B. Optimal Bandwidth Allocation
Proof: From Theorem 3, we know that all UEs update
In order to solve P2, we introduce the following theorems the global model periodically. Let Kp denote the number of
to explore the relationship between bik and T (Π) step by step. communication rounds in each period, then inferring from
Theorem 2. If the server updates the global model after Theorem 2, all UEs have the same finishing time in each period
receiving A gradients from the UEs in each round, then the without any waiting time. That is, we have
optimal bandwidth allocation can be achieved if and only if Kp
X Kp
X
all the scheduled UEs have the same finishing time. Tki = Tki , ∀i, j ∈ U, i 6= j, (34)
k=1 k=1
Proof: Recall the expression of rki
defined in (9), we take
a derivative with respect to bik and arrive at the following Meanwhile, we have
Kp
pi hi kci k−κ
  
d i
X
bk ln 1 + Zki = ηi ZAKP , ∀i ∈ U, (35)
dbik bik N0
k=1
−κ
pi hi kci k−κ
 
pi hi kci k
= ln 1 + − (31) where ZAKP denotes the number of bits that needs to be
bik N0 bik N0 + pi hi kci k−κ transmitted during the Kp rounds. This equation indicates that
pi hi kci k−κ
bik N0 pi hi kci k−κ the number of bits transmitted by UE i during Kp rounds
> pi hi kci k−κ
− is equal to the product of its relative participation frequency
1+ bik N0 + pi hi kci k−κ
bik N0 ηi and the total number of bits transmitted during that Kp
=0, (32) communication rounds. From equation (35), it is easy to
indicate that
where the inequality follows from the fact that ln(1 + x) > Kp Kp
x i i
1+x , for x > 0. Therefore, rk monotonically increases with bk .
X Zi k
X Zj k
i i i = , ∀i, j ∈ U, i 6= j. (36)
While it is obvious that rk > 0, and thus T cmpk + T comk = ηi ηj
k=1 k=1
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

Round 1 Round 2 Round 1 Round 2

𝑤0 UE 1 𝑤0 𝑤11
UE 1 𝑤11

UE 2 𝑤0 𝑤12 UE 2 𝑤0 𝑤12

UE 3 𝑤0 𝑤13 UE 3 𝑤0
𝑤13

UE 4 𝑤0 𝑤14 UE 4 𝑤0
𝑤14

𝑡1 𝑡2 𝑡1 𝑡2

(a) The largest bandwidth allocation to UEs in Ak (b) The least bandwidth allocation to UEs in Ak

Fig. 2: Bandwidth allocation example, where all UEs have the same parameters, and A = 2.

Now combing (34) and (36), we have pi , hi , and ci . We can write the scheduling pattern Π of the
PKp i PKp j four UEs as follows:
k=1 Zk k=1 Zk
PKp i = PKp j , ∀i, j ∈ U, i 6= j.
 
(37) 1 1 0 0
ηi k=1 Tk ηj k=1 Tk  0 0 1 1
 
PKp
Zki
 1 1 0 0 . (39)
From equation (37) we observe that PKp
k=1
denotes the 
 0 0 1 1

k=1 Tki
average rate of UE i during Kp rounds. That is, we have ...........
E(rki ) E(rkj ) The length of the scheduling period is Kp = 2. Meanwhile,
= , ∀i, j ∈ U, i 6= j. (38)
ηi ηj according to Theorem 4, we have E(rk1 ) = · · · = E(rk4 ). One
extreme case of bandwidth allocation is UE 1 and UE 2 share
The above equation states a fact that as long as the average
the total bandwidth B in the first round, each of which is
rate of each UE is weighted equalized, the optimal solution is
assigned B2 . At the same time, UE 3 and UE 4 can complete
achieved. Therefore, there exists infinitely many solutions of
η their local computation during round 1. Then, at round 2, all
rki to the above equation. The simplest solution is rηii = rjj
k k bandwidth B is allocated to UE 3 and UE 4 for their gradients
in each round k. Note that rki is determined by bik , and thus transmission. In this case, according to Theorem 2, in each
there exits infinitely many solutions of bik in each round k. round, both UEs will finish their gradient transmission at the
Our next step is to compute the boundary values of bik . To same time. That is, the duration of round 1 will be minimized
do this, we first divide UEs into two categories: UEs in Ak when UE 1 and UE 2 share the total bandwidth B equally.
and UEs do not in Ak . Z
At this point, the round duration is r(B/2) , where r(B/2) =
−κ
• At one extreme case, only UEs in Ak are assigned with i kci k
P i
B
2 ln(1 + 2pi hBN 0
). Similarly, the duration of round 2 is
bandwidth. That is, i∈Ak bk = B. Under this case, Z
also r(B/2) . Then, the total time of each period is r(B/2)2Z
.
2
the PerFedS algorithm turns out to be a synchronous The other extreme case of bandwidth allocation is for all the
PerFedAvg algorithm where in each round A UEs are four UEs to share the bandwidth equally, then the UEs will
selected to update the global model. Meanwhile, the finish one time of global update at the same time, which is
bandwidth is allocated proportionally to the UEs in Ak computed by r(B/4) Z
. Note that we set A = 2, but in this
ri rj
such that ηki = ηkj , ∀i, j ∈ Ak , i 6= j. This extreme case case if all UEs finish one communication round at the same
is corresponding to the third inequation of (33). time then A = 4, therefore this extreme situation cannot be
• At the other extreme case, all UEs in round k share the achieved but can only be approached infinitely. It is obvious
ri rj Z 2Z
available bandwith B at a rate ηki = ηkj , ∀i, j ∈ Ak , i 6= r(B/4) = r(B/2) , this equation indicates that all bandwidth
j. This case indicates the least bandwidth allocation to allocation policies between the two extreme cases can lead to
UEs in Ak to ensure their orders to arrivePthe server the same minimized overall training time.
in the scheduling pattern. Under this case, i∈U bik = At this point, according to the features of the optimal
B. Therefore, a closed form of bik is obtained, which is bandwidth solutions, we obtain four corollaries. Corollary 2
corresponding to the lower bound of bik shown in the and 3 are two direct conclusions derived from Theorem 2,
second inequation of (33). which are shown as follows,

To better illustrate these approaches, let us take the example Corollary 2. From Theorem 2, we find that in each round k,
in Fig. 2. Assume A = 2 and the four UEs have the same ηi , UEs in Ak will finish the communication round at the same
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

time. That is, none of the UEs have to wait for the others Algorithm 2: Greedy PerFedS2 Scheduling Algorithm
underPthe optimal bandwidthPallocation policy. Therefore, we Input: η = {η1 , η2 , . . . , ηn }, A∗
K K
have k=1 maxi∈A {Tki } = k=1 Tki∗ = Ti∗ (∀i ∈ U). 1 Initialize Π ← ∅ ;

Corollary 3. The optimal overall training time is equivalent 2 for k = 1 to K do

to the optimal total training time of any UE i from a long-term 3 for i = 1 to N do


perspective when K → +∞. That is, T ∗ (Π) = Ti∗ (∀i ∈ U 4 if the total number of global updates
and a large K). sum(Π) = 0 then
5 η̂i = 0 ;
Next, according to Theorem 4, we extract Corollary 4 to 6 else
characterize the optimal solutions of Zki , which is determined η̂i = number of overall updates of UE i
7 number of overall global updates =
right after the computation of bik . sum(Π[:,i])
sum(Π) ;
Corollary 4. There exists infinitely many solutions of Zki as 8 end
long as the bandwidth allocation follows the results shown in 9 if current number of updates in round k
Theorem 4. Meanwhile, Zki is in a range of values from 0 to sum(Π[k, :]) < A∗ and current relative
Z. participation frequency of UE i η̂i ≤ ηi then
10 Set Π[k][i] ← 1;
At last, we introduce Corollary 5 to describe the relationship 11 if current number of updates in round k
between the relative participation frequency ηi and the optimal sum(Π[k, :]) < A∗ then
overall training time T ∗ (Π). 12 Schedule the first A∗ − sum(Π[k, :])
Corollary 5. There is a tradeoff between the relative partic- UEs in current round k;
ipation frequency ηi (i ∈ U) and the optimal overall training 13 i.e., Π[k][0 : A∗ − sum(Π[k, :])] = 1 ;
time T ∗ (Π). As long as η is defined or determined, then 14 end
according to Theorem 3 the circular scheduling pattern Π 15 else
can be determined. With the scheduling pattern Π, according 16 Π[k][i] = 0;
to Theorem 4, the optimal bandwidth allocation and the 17 end
corresponding optimal overall training time T ∗ (Π) can be 18 end
determined. 19 end

of the objective of P4 be equal to  respectively, the optimal


C. Scheduling Policy
solution of K and A can be approximated by
Based on the optimal bandwidth bik obtained from P2, we 2(F (w0 ) − F (w∗ )) S
turn to P3 to solve the UE scheduling problem. From (C3.2) K ∗ ≈ min{ , } (42)
i∈U β ηi
we have 2 1
K
X K A∗ ≈ min{ , }. (43)
ηi AK = πki ≥ , ∀i ∈ U, (40) 2 2 2 2 2 2
i∈U 16(LF β + 2LF β S ) (σF + γF )2 ηi S
S
k=1
With the optimal value A∗ , we use a greedy algorithm to
1
which can be further simplified to A ≥ ηi S .
Meanwhile, generate the scheduling policy matrix Π, which is shown in
note that the minimization of F (w) Pcan be approximated Algorithm 2. In each round k, the algorithm is always picking
1 K−1 2
by minimizing the upper bound of K k=0 E[k∇F (wk )k ] up the UE i with the smallest current relative participation
according to Theorem 1. Therefore, P3 can be approximated frequency η̂i , if η̂i < ηi then the algorithm sets πki = 1.
by P4 as follows: Then the algorithm picks up the second poorest UE j and set
2(F (w0 ) − F (w∗ )) πkj = 1. This process repeats until A∗ UEs are picked up in
min round k. For the next round k + 1, the same process repeats.
K,A,Π βK
√ In this way, the circular scheduling pattern can be achieved
+ 4(LF β + 2L2F β 2 S 2 )(σF2 + γF2 ) A (P4) and Π is obtained.
s.t. Ti∗ ∗
= T (Π), ∀i ∈ U (C4.1)
1 VI. P ERFORMANCE E VALUATION
A≥ , ∀i ∈ U (C4.2)
ηi S In this section, we conduct extensive experiments to (i)
S verify the effectiveness of PerFedS2 in saving the overall
K ≥ , ∀i ∈ U, (C4.3)
ηi training time and (ii) examine the effects of different system
where (C4.1) is derived from Corollary 2 and 3. parameters on the performance of PerFedS2 .
The relationship between A and K has been coarsely
analysed in Corollary 1, where K = O(−3 ) and A = O(−2 ). A. Setup
This means that the optimal K ∗ and A∗ can only be estimated 1) Datasets and Models: We consider an FL system that
in the implementation. Let the first term and the second term contains multiple UEs located in a cell of radius R = 200
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

TABLE I: System Parameters η1 = η2 = · · · = ηn . For the second one, we consider the


Parameter Value distances from the UEs to the server is uniformly distributed,
α (MNIST) 0.03 while the other parameters of the UEs are the same. Under
β (MNIST) 0.07 this setting, the values of ηi among the UEs are unbalanced.
α (CIFAR-100) 0.02
β (CIFAR-100) 0.06 B. Evaluation Results
α (Shakespeare) 0.03
1) Effect of relative participation frequency η: Fig. 3 shows
β (Shakespeare) 0.07
the convergence performance comparison between PerFedS2
B 1 MHz
and other five FL and PFL algorithms, where UEs have the
κ 3.8
same ηi , and A = 5. Then Fig. 4 shows the convergence
N0 −174 dBm/Hz performance comparison of the six algorithms, where the ηi
pi 0.01 W of each UE is determined by its distance to the server, and
the distance is uniformly distributed from 0 to 200 m. At last,
Fig. 5 shows the convergence comparison of the six algorithms
m and a BS located at the center. Meanwhile, the Rayleigh using Shakespeare dataset, where A = 50.
distribution parameter of hik across communication rounds From both figures, we find that for MNIST, generally,
is 40. We conduct the experiments using three datasets: it takes synchronous algorithms the most time to achieve
MNIST [31], CIFAR-100 [32] and the Shakespeare [33] the same convergence performance compared with semi-
datasets. The network model we used for MNIST is a 2- synchronous and asynchronous algorithms, then asynchronous
layer deep neural network (DNN) with hidden layer of size algorithms behaves the best. However, for the CIFAR-100
100. The network model we used for CIFAR-100 is LeNet- dataset, generally, semi-synchronous algorithms behaves the
5 [34] that contains two convolutional layers and three fully best. We attribute this confliction of behavior to the fact
connected layers. And the network model we used for the that MNIST is a much simpler dataset than CIFAR-100.
Shakespeare dataset is an LSTM classifier. The number of Commonly, we use asynchronous algorithms to save waiting
UEs under the MNIST and the CIFAR-100 datasets is set to time for faster UEs and hope that the convergence performance
be 20, and the number of UEs under the Shakespeare dataset will not be affected by the update staleness. This only works
for next-character prediction is 188. The other parameters used when the dataset is simple and easy to train. Therefore, as we
in the experiments are summarized in Table I. can see in Fig. 3, for the MNIST dataset with a two-layer DNN
2) Baselines: We compare PerFedS2 with three bench- model, the asynchronous algorithms does behave the best,
marks: synchronous, semi-synchronous, and asynchronous FL semi-synchronous algorithms is the second, and synchronous
algorithms. For the synchronous FL benchmark, we consider algorithms behave the worst. However, when it comes to
three algorithms, FedAvg, FedProx [35], and Per-FedAvg the CIFAR-100 dataset with the LeNet-5 model, which is a
(termed as FedAvg-SYN, FedProx-SYN and PerFed-SYN much larger dataset with a much more complicated model,
in the figures). FedProx is a FL algorithm that deals with it is hard for the asynchronous algorithms to convergence.
heterogenous datasets. For the semi-synchronous benchmark, In this case, semi-synchronous algorithms behave the best.
we consider only two algorithms besides PerFedS2 , semi- This evaluation performance verifies our theoretical result that
synchronous Federated Learning (FedAvgS2 ), which is a semi- a proper semi-synchronous algorithm not only mitigates the
asynchronous FL algorithm, and semi-synchronous FedProx straggler problem that happened in synchronous algorithms,
(FedProxS2 ). For the asynchronous FL benchmark we consider but also bounds the staleness caused by the stragglers, thereby
three algorithms, FedAvg-ASY, FedProx-ASY and PerFed- ensuring the convergence of the learning process. Meanwhile,
ASY. The above three algorithms are asynchronous FL mech- it is clear that PFL algorithms converge much faster than
anisms, where the server performs the global updating as soon FL algorithms. This result is derived from the fact the PFL
as it receives a local model from any UE. algorithms is designed to adapt and converge fast to new
3) Dataset Participation: The level of divergence in the dis- datasets.
tribution of UEs’ datasets will affect the overall performance Most importantly, we find that compared with Fig. 3, the
of the system. To reflect this feature, each UE is allocated a convergence performance shown in Fig. 4 is poorer. This is
different local data size and has l = 1, 2, . . . , 10 of the 10 because the relative participation frequencies of UEs in Fig. 4
labels, where l denotes the level of data heterogeneity, the is not equalized. Since the UEs are uniformly distributed in
higher l is, the more diverse the datasets are. the cell, their distances to the central server are different. The
4) Relative Participation Frequency Setting: The relative UEs with longer distances to the server have to transmit its
participation frequency plays a critical role in the system gradients for a longer time to reach the server. Therefore, these
performance as it determines not only the scheduling pattern UEs are naturally slower than the others, leading to smaller
but also the minimal overall training time. In practice, there η to participate in the global model updates. Given that the
are many factors that may affect the value of η. For example, datasets among UEs are heterogenous, the less participation
the distances from UEs to the server and the transmit power of long distance UEs will lead to inadequate training on
of each UE. In this paper, we use two sets of η. For the these UEs, making the global model convergence performance
first one, we consider all the UEs have the same ηi , i.e., poorer than the ones shown in Fig. 3.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

FedAvg-SYN PerFed-SYN FedAvg-SYN PerFed-SYN


PerFedS2 FedAvg-ASY PerFedS2 FedAvg-ASY
1.00 FedAvgS2 PerFed-ASY 0.8 FedAvgS2 PerFed-ASY
1.60 FedAvg_SYN
3.00 0.7
1.40
PerFed_SYN 0.95
FedAvgS2 0.90
0.6
2.50
1.20 PerFedS2

Test Accuracy
Test Accuracy
Training Loss

0.5

Training Loss
1.00 FedAvg_ASY 0.85
PerFed_ASY FedAvg_SYN 2.00 0.4
0.80 0.80 PerFed_SYN 0.3
0.60
0.75
FedAvgS2 1.50

0.40 PerFedS2 0.2

0.20 0.70 FedAvg_ASY 1.00


0.1
PerFed_ASY
0.00 0.65 0.50 0.0
0 10 20 30 40 50 0 10 20 30 40 50 0 20 40 60 80 100 0 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)
(a) MNIST training loss (b) MNIST test accuracy (c) CIFAR-100 training loss (d) CIFAR-100 test accuracy
2 2
Fig. 3: Convergence performance comparison of PerFedS , FedAvgS , FedAvg-SYN, PerFed-SYN, FedAvg-ASY and PerFed-
ASY using MNIST and CIFAR-100 datasets. In this case, η1 = η2 = · · · = ηn . Meanwhile, as for the PerFedS2 and FedAvgS2
algorithms, we set A = 5.

FedAvg-SYN PerFed-SYN FedAvg-SYN PerFed-SYN


PerFedS2 FedAvg-ASY PerFedS2 FedAvg-ASY
1.0
3.50 FedAvgS2 PerFed-ASY FedAvgS2 PerFed-ASY
1.60 FedAvg-SYN 0.8
PerFed-SYN 0.9 3.00 0.7
1.40
FedAvgS2
1.20 PerFedS2 0.8 2.50 0.6
Test Accuracy

Test Accuracy
Training Loss

Training Loss

1.00 FedAvg-ASY
PerFed-ASY 0.5
0.80
0.7 FedAvg-SYN 2.00
PerFed-SYN 0.4
0.60 0.6 FedAvgS2 1.50
0.3
0.40 PerFedS2
1.00
0.20
0.5 FedAvg-ASY 0.2
PerFed-ASY
0.4 0.50
0.00 0.1
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)
(a) MNIST training loss (b) MNIST test accuracy (c) CIFAR-100 training loss (d) CIFAR-100 test accuracy
2 2
Fig. 4: Convergence performance comparison of PerFedS , FedAvgS , FedAvg-SYN, PerFed-SYN, FedAvg-ASY and PerFed-
ASY using MNIST and CIFAR-100 datasets. In this case, the distance from UEs to the server obeys the random distribution
from 0 to 200 m. Meanwhile, as for the PerFedS2 and FedAvgS2 algorithms, we set A = 5.

2.25 0.7
PerAvg-ASY 0.6
FedAvg-ASY
2.00 FedFed-ASY 2.00
PerFed-ASY 0.6
PerAvgS2 0.5
PerAvg-ASY 1.75 FedAvgS2 0.5
FedFedS2 FedFed-ASY PerFedS2
Test Accuracy

1.50
Test Accuracy
Training Loss

0.4
PerAvg-SYN
Training Loss

1.50
PerAvgS2 FedAvg-SYN 0.4
FedFed-SYN 1.25
PerFed-SYN FedAvg-ASY
0.3 FedFedS2 1.00
1.00 PerAvg-SYN 0.3 PerFed-ASY
0.2 FedFed-SYN 0.75
0.2
FedAvgS2
0.50 PerFedS2
0.50 0.1 0.1 FedAvg-SYN
0.25
PerFed-SYN
0.0 0.00 0.0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 0 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)
(a) Shakespeare training loss (b) Shakespeare test accuracy (c) Shakespeare training loss (d) Shakespeare test accuracy
2 2
Fig. 5: Convergence performance comparison of PerFedS , FedAvgS , FedAvg-SYN, PerFed-SYN, FedAvg-ASY and PerFed-
ASY using the Shakespeare dataset. For (a) and (b), η1 = η2 = · · · = ηn , and for (c) and (d), the distance from UEs to the
server obeys the random distribution from 0 to 200 m. Meanwhile, as for the PerFedS2 and FedAvgS2 algorithms, we set
A = 50.

As for the shakespeare dataset, we find that all the conclu- The comparison between FedAvgS2 , FedProxS2 and
sions about the comparisons between the 6 algorithms drawn PerFedS2 using the MNIST and Shakespeare datasets is shown
from the above two datasets still stand. in Fig. 6. From the figure it is obvious that PerFedS2 out-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

1.0 0.5
0.9
0.950
0.9 0.4
0.925 0.8

Test Accuracy

Test Accuracy

Test Accuracy
Test Accuracy

0.900 0.8 0.3 0.7

0.875
0.6
0.7 0.2
0.850
0.5
0.825 FedAvgS2 with equal FedAvgS2 with randomly distributed UEs PerAvgS2 with equal FedAvgS2 with randomly distributed UEs
0.6 0.1
FedProxS2 with equal FedProxS2 with randomly distributed UEs FedProxS2 with equal 0.4 FedProxS2 with randomly distributed UEs
0.800
PerAvgS2 with equal PerAvgS2 with randomly distributed UEs FedAvgS2 with equal PerAvgS2 with randomly distributed UEs
0.775 0.5 0.0 0.3
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)

(a) MNIST (b) MNIST (c) Shakespeare (d) Shakespeare


2 2 2
Fig. 6: Convergence performance comparison of PerFedS , FedAvgS and FedProxS . For (a), we use the MNIST dataset and
η1 = η2 = · · · = ηn . For (b), we use the MNIST dataset and the distance from UEs to the server obeys the random distribution
from 0 to 200 m. For (c), we use the Shakespeare dataset and η1 = η2 = · · · = ηn . And for (d), we use the Shakespeare
dataset and the distance from UEs to the server obeys the random distribution from 0 to 200 m. Meanwhile, we set A = 5 for
the MNIST dataset and A = 50 for the Shakespeare dataset.

1.00
0.8
1.60 PerFedS2, l=2 2.25
PerFedS2, l=3
1.40
PerFedS2, l=4 0.95
2.00 PerFedS2, l=5 0.7
PerFedS2, l=6 0.90 PerFedS2, l=7
1.20 PerFedS2, l=8 1.75
PerFedS2, l=9
0.6
Test Accuracy

Test Accuracy
Training Loss

Training Loss
1.00 0.85 1.50 0.5
0.80 0.80 1.25 0.4
0.60 PerFedS2, l=2 1.00 PerFedS2, l=3
0.75 0.3
0.40 PerFedS2, l=4 PerFedS2, l=5
0.70 PerFedS2, l=6 0.75
0.2 PerFedS2, l=7
0.20
PerFedS2, l=8 0.50 PerFedS2, l=9
0.00 0.65 0.1
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 20 40 60 80 100 0 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)
(a) MNIST training loss (b) MNIST test accuracy (c) CIFAR-100 training loss (d) CIFAR-100 test accuracy

Fig. 7: Convergence performance of PerFedS2 with respect to the non-i.i.d level l of data sampled from the MNIST and
CIFAR-100 datasets. We compare the results when l = 2, 4, 6, 8 for data sampled from the MNIST dataset, and l = 3, 5, 7, 9
for data sampled from the CIFAR-100 dataset.

performs the other two algorithms. This is reasonable since from UEs to the central server, and thus the optimal A to
Per-FedAvg has already been verified in previous works to minimize the overall training time is random. We can only
provide a better convergence performance, and PerFedS2 is conclude that in this very specific case of η, the larger number
designed based on Per-FedAvg. Therefore, PerFedS2 inherits of participation UEs in each round, the better. Nevertheless,
this benefit. the benefits gained from a smaller value of A is slight in
Fig. 9. This is reasonable because, the randomly generated η
2) Effect of the non-i.i.d. level l: Fig. 7 shows the evaluation may result in a scheduling pattern that degrades the influences
results of PerFedS2 under different non-i.i.d. levels. It is caused by different number of participation UEs in each round.
obvious that for both datasets, the higher the heterogenous
However, as for the CIFAR-100 dataset, although Fig. 8c
level is, the worse the convergence performances are. These
and 8c still indicate the same conclusion as that in the MNIST
results are natural and in line with the laws of theory.
dataset, Fig. 9c and 9d indicate another situation where the
3) Effect of the number of participants in each round A: convergence performance of PerFedS2 wins when A = 10.
Fig. 8 and Fig. 9 show the convergence performance of This result just verified the conclusion we mentioned above,
PerFedS2 with respect to different number of participation UEs that the conclusion obtained from the MNIST dataset is not
A in each round, where Fig. 8 is under the case that all UEs always true. The result shown in Fig. 9c and 9d indicate a
have the same ηi , whereas Fig. 9 is under the case that the ηi specific case when A = 10 is approaching the optimal A∗ .
of each UE is determined by its distance to the central server 4) Effect of the staleness threshold S: Finally, we evaluate
that follows a random distribution. the effect of the staleness threshold S on the convergence
As for the MNIST dataset, the result shown in Fig. 8 performance of PerFedS2 , where the results are shown in
and Fig. 9 indicates a situation that the larger number of Fig. 10. Here, in order to make the effect of S more clear,
participation UEs in each round, the poorer the convergence we use the simpler setting when all UEs have the same ηi ,
performance is. This conclusion is not always true, given that and A = 5. Therefore, when S ≥ 5, all the scheduled UEs
the relative participation frequency vector η = [ηi , η2 , . . . , ηn ] would arrive the server within S rounds. Consequently, we
in Fig. 9 is generated randomly according to the distances study change of the total training time when S = 1, 2, 3, 4, 5.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

0.98 0.5
PerFedS2, A=5
1.00
PerFedS2, A=10 0.97
2.20 PerFedS2, A=5
PerFedS2, A=15 2.00 PerFedS2, A=10 0.4
0.80 0.96 1.80 PerFedS2, A=15

Test Accuracy

Test Accuracy
Training Loss

Training Loss
0.95 1.60 0.3
0.60
1.40
0.94
1.20 0.2
0.40
0.93
PerFedS2, A=5
1.00 PerFedS2, A=5
0.1
0.20
0.92 PerFedS2, A=10 0.80 PerFedS2, A=10
PerFedS2, A=15 0.60 PerFedS2, A=15
0.00 0.91 0.0
0 20 40 60 80 100 120 140 0 20 40 60 80
100 120 140 0 20 40 60 80 100 0 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)
(a) MNIST training loss (b) MNIST test accuracy (c) CIFAR-100 training loss (d) CIFAR-100 test accuracy

Fig. 8: Convergence performance of PerFedS2 with respect to the number of UEs A that participate in the global model update
in each round using MNIST and CIFAR-100 datasets. In this case, η1 = η2 = · · · = ηn . Meanwhile, we compare the results
when A = 5, 10, 15.

0.98
PerFedS2, A=5
1.00
PerFedS2, A=10 0.97 1.20 PerFedS2, A=5 0.50
PerFedS2, A=15 PerFedS2, A=10
0.45
0.80 0.96 1.00
PerFedS2, A=15
Test Accuracy
Training Loss

Test Accuracy
Training Loss
0.40
0.60 0.95
0.80
0.94 0.35
0.40
0.60 0.30
0.93
PerFedS2, A=5 PerFedS2, A=5
0.20
0.92 PerFedS2, A=10 0.40 0.25 PerFedS2, A=10
PerFedS2, A=15 PerFedS2, A=15
0.00 0.91 0.20
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 0 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)
(a) MNIST training loss (b) MNIST test accuracy (c) CIFAR-100 training loss (d) CIFAR-100 test accuracy
2
Fig. 9: Convergence performance of PerFedS with respect to the number of UEs A that participate in the global model update
in each round using MNIST and CIFAR-100 datasets. In this case, the distance from UEs to the server obeys the random
distribution from 0 to 200 m. Meanwhile, we compare the results A = 5, 10, 15.

Note that in the theoretical analysis, we have the constraint have proved that there exist a convergent upper bound on the
that ηi ≥ S/K. This constraint eliminates the situations when convergence rate. Then, based on the convergence analysis,
the staleness τki is larger than the staleness bound S, and thus we have solved the optimization problem by decoupling it
no updates would be dropped by the central server. However, into two sub-problems: the bandwidth allocation problem
in practice, ηi is determined by a number of elements, for and the UE scheduling problem. For a given scheduling
example, the distances from UEs to the server or the transmit policy, the bandwidth allocations problem has been proved
power of individual UEs. Therefore, in practice, the constraint to have infinitely many solutions. Meanwhile, based on the
ηi ≥ S/K cannot be always satisfied. When this happens to convergence analysis of PerFedS2 , the optimal UE scheduling
UE i, in order to keep ηi constant, other UEs may have to policy can be determined using a greedy algorithm. We have
wait until the updates from UE i finally arrives the server, conducted extensive experiments to verify the effectiveness of
thereby prolonging the overall training time. This conclusion PerFedS2 in saving training time, compared with synchronous
is verified through the results shown in Fig. 10, where the and asynchronous FL and PFL algorithms.
larger S is, the better the convergence performance PerFedS2
has.
A PPENDIX
VII. C ONCLUSIONS
Proof of Theorem 1
We have proposed a new semi-synchronous PFL algorithm
over mobile edge networks, PerFedS2 , that not only mitigates Using Lemma 1, we have
the straggler problem caused by the synchronous training, F (wk+1 ) − F (wk )
but also ensures a convergent training loss that may not be
LF
guaranteed in the asynchronous training. This is achieved by ≤h∇F (wk ), wk+1 − wk i + kwk+1 − wk k2
optimizing the joint bandwidth allocation and UE scheduling * 2 +
problem. In order to solve such an optimization problem, we β X ˜
= − ∇F (wk ), ∇Fi (wk−τki )
first have analysed the convergence rate of PerFedS2 , and A
i∈Ak
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

1.20 0.98 2.00 0.8


PerFedS2, S=1 PerAvg, S=1
PerFedS2, S=2 1.80 PerAvg, S=2
1.00 PerAvg, S=3
PerFedS2, S=3 0.96 0.7
PerFedS2, S=4 1.60 PerAvg, S=4
0.80 PerFedS2, S=5 PerAvg, S=5

Test Accuracy
Test Accuracy
Training Loss

Training Loss
0.94 1.40 0.6
0.60
1.20
0.92 0.5
0.40 PerFedS2, S=1 1.00 PerAvg, S=1
PerFedS2, S=2 PerAvg, S=2
0.20
0.90 PerFedS2, S=3 0.80 0.4 PerAvg, S=3
PerFedS2, S=4 PerAvg, S=4
PerFedS2, S=5 0.60 PerAvg, S=5
0.00 0.88 0.3
0 10 20 30 40 50 60 0 10 20 30 40 50 60 20 40 60 80 100 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)
(a) MNIST training loss (b) MNIST test accuracy (c) CIFAR-100 training loss (d) CIFAR-100 test accuracy
2
Fig. 10: Convergence performance comparison of PerFedS with respect to the staleness threshold S using the MNIST and
CIFAR-100 datasets. In this case, η1 = η2 = · · · = ηn , A=5. Meanwhile, we compare the results S = 1, 2, 3, 4, 5.

2
LF β X ˜ with the tower rule, we have
+ ∇Fi (wk−τki ) . (44)
2 A
X
i∈Ak E[kY k2 ] = E[E[kY k2 ]|Fk ] ≤ γF2 ηi . (50)
i∈Ak
From the abovePinequality, it is obvious that the key is to
˜ i (wk−τ i ). Let
bound the term i∈Ak ∇F Now getting back to the inequality (44), from the fact ha, bi =
k
1 2 2 2
1 X ˜ 1 X 2 (kak + kbk − ka − bk ), we have
∇Fi (wk−τki ) = X + Y + ∇F (wk−τki ), (45)
A A
i∈Ak i∈Ak
F (wk+1 ) − F (wk )
where 2
1 X ˜ β β 1 X ˜
X= (∇Fi (wk−τki ) − ∇Fi (wk−τki )), ≤− k∇F (wk )k2 − ∇Fi (wk−τki )
A 2 2 A
i∈Ak i∈Ak
1 X 2
Y = (∇Fi (wk−τki ) − ∇F (wk−τki )). (46) β 1 X
A + ∇F (wk ) − X − Y − ∇F (wk−τki )
i∈Ak 2 A
i∈Ak
Our next step is to upper bound E[kXk2 ] and E[kY 2
Pn k ] respec- 2
2
LF β 1 X ˜
tively.
Pn Recall2 thePCauchy-Schwarz inequality k i=1 ai bi k2 ≤ + ∇Fi (wk−τki )
n 2
( i=1 kai k )( i=1 kbi k ), as for X, consider the Cauchy- 2 A
i∈Ak
Schwarz inequality with ai = √1 (∇F˜ i (wk−τ i ) − β
1
A k
≤ − k∇F (wk )k2 + LF β 2 kX + Y k2
∇Fi (wk−τki )) and bi = A , we have
√ 2 | {z }
! T1
1 X ˜ 2
2 2 1 X
kXk ≤ k∇Fi (wk−τki ) − ∇Fi (wk−τki )k . (47)
A + β ∇F (wk ) − ∇F (wk−τki )
i∈Ak A
i∈Ak
Let Fk denote the information up to round k. Given that the | {z }
T2
set of scheduled UEs Ak is selected according to their relative 2
participation frequency ηi (i ∈ Ak ), hence, by using Lemma 2 1 X
+ (LF β 2 − β) ∇F (wk−τki ) . (51)
along with the tower rule, we have A
i∈Ak
X
E[kXk2 ] = E[E[kXk2 |Fk ]] ≤ σF2 ηi . (48) Our next step is to estimate the upper bounds of E[T1 ] and
i∈Ak E[T2 ], respectively. As for T1 , we have
Meanwhile, as for Y , consider the Cauchy-Schewarz inequal-
E[T1 ] ≤ 2E[kXk2 ] + 2E[kY k2 ] = 2(σF2 + γF2 ). (52)
ity with ai = √1A (∇Fi (wk−τki ) − ∇F (wk−τki )) and bi = √1A ,
we have As for T2 , we have
!
2
2 1 X 2 1
kY k ≤ k∇Fi (wk−τki ) − ∇F (wk−τki )k . (49)
X
A T2 = 2 (∇F (wk ) − ∇F (wk−τki ))
i∈Ak A
i∈Ak
2
In a similar way, the mean of kY k2 is the weighted average 1 X
≤ ∇F (wk ) − ∇F (wk−τki )
sum of E[kY k2 |Fk ], where the weight is the relative partic- A
i∈Ak
ipation frequency of UE i ∈ Ak . By using Lemma 3 along 2
1 X
≤ LF (wk − wk−τki )
A
i∈Ak
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15

≤ max kLF (wk − wk−τki )k2 Summarizing the inequality from k = 0 to k = K − 1, we


i∈Ak
have
=L2F k(wk − wk−τkµ )k2 , (53)
E[F (wK )] − f (w0 )
where µ = arg maxi∈Ak kLF (wk − wk−τki )k2 , the first K
inequality
Pn
is obtained from the fact that k i=1 ai k2 ≤ βX
Pn ≤− E[k∇F (wk )k2 ]+
n i=1 kai k2 , the second inequality is derived from 2
k=1
Lemma √
1
Pn 1, and the third inequality comes from the fact that K(2LF β 2 + 4L2F β 3 S 2 )(σF2 + γF2 ) A+
n i=1 kai k ≤ maxi kai k. It follows that 
2

K
1 X
T2 ≤L2F kwk 2
X
− wk−τkµ k (LF β 2 − β + 2L2F β 2 S 2 )E  ∇F (wk−τki ) 
2
A
k=1 i∈Ak
k−1
X K−1
=L2F (wj+1 − wj ) β X
j=k−τkµ
≤− E[k∇F (wk )k2 ]
2
k=0
k−1
2 √
X 1 X ˜ + K(2LF β 2 + 4L2F β 3 S 2 )(σF2 + γF2 ) A, (59)
=L2F β 2 ∇Fi (wj−τji )
A where the last inequality is due to (27). As a result, the desired
j=k−τkµ i∈Aj
2 result is obtained.
k−1
X 1 X ˜
≤L2F β 2 S ∇Fi (wj−τji )
A
j=k−S i∈Aj R EFERENCES
≤2L2F β 2 S 2 kX + Y k2 [1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,
2 “Communication-efficient learning of deep networks from decentralized
1 X data,” in International Conference on Artificial Intelligence and Statistics
+ 2L2F β 2 S 2 ∇F (wj−τji ) (54) (AISTATS), 2017, pp. 1273–1282.
A [2] Z. Yang, M. Chen, K.-K. Wong, H. V. Poor, and S. Cui, “Federated
i∈Aj
learning for 6g: Applications, challenges, and opportunities,” Engineer-
Taking expectation on both sides of (54), we have ing, 2021.
X [3] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy
E[T2 ] ≤4L2F β 2 S 2 (σF2 + γF2 ) ηi efficient federated learning over wireless communication networks,”
IEEE Transactions on Wireless Communications (TWC), vol. 20, no. 3,
i∈Ak
  pp. 1935–1949, 2020.
2 [4] N. Rieke, J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albarqouni,
1 X
+ 2L2F β 2 S 2 E  ∇F (wk−τki ) . (55) S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein et al., “The
A future of digital health with federated learning,” NPJ Digital Medicine,
i∈Ak vol. 3, no. 1, pp. 1–7, 2020.
[5] H. Xiao, J. Zhao, Q. Pei, J. Feng, L. Liu, and W. Shi, “Vehicle selection
ηi = i∈U πki ηi , we have
P P
Note that i∈Ak and resource optimization for federated learning in vehicular edge
X X X computing,” IEEE Transactions on Intelligent Transportation Systems
( πki ηi )2 ≤ (πki )2 ηi2 (TITS), 2021.
i∈U i∈U i∈U
[6] H. Song, J. Bai, Y. Yi, J. Wu, and L. Liu, “Artificial intelligence enabled
X X X Internet of Things: Network architecture and spectrum access,” IEEE
= πki ηi2 =A ηi2 ≤ A, (56) Computational Intelligence Magazine, vol. 15, no. 1, pp. 44–51, 2020.
i∈U i∈U i∈U
[7] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah, “Distributed fed-
erated learning for ultra-reliable low-latency vehicular communications,”
where the first equation is derived from the fact that IEEE Transactions on Communications, vol. 68, no. 2, pp. 1146–1159,
(π i 2 i 2019.
Pk ) =i πk , the second equation is derived from the fact that [8] S. Prathiba, G. Raja, S. Anbalagan, S. Gurumoorthy, N. Kumar, and
i∈U πk = P
A, the last inequation is derived from the fact that M. Guizani, “Cybertwin-driven federated learning based personalized
ηi < 1 and i∈U ηi = 1. As a result, we have service provision for 6g-v2x,” IEEE Transactions on Vehicular Technol-
X √ ogy (TVT), 2021.
ηi ≤ A. (57) [9] L. Yang, B. Tan, V. W. Zheng, K. Chen, and Q. Yang, “Federated
recommendation systems,” in Federated Learning. Springer, 2020, pp.
i∈Ak 225–239.
Now getting back to (51), we have [10] Q. Wang, H. Yin, T. Chen, J. Yu, A. Zhou, and X. Zhang, “Fast-adapting
and privacy-preserving federated recommender system,” arXiv preprint
arXiv:2104.00919, 2021.
E[F (wk+1 )] − E[F (wk )] [11] C. T. Dinh, N. H. Tran, and T. D. Nguyen, “Personalized federated
β learning with moreau envelopes,” 2020.
≤ − E[k∇F (wk )k2 ] [12] Y. Jiang, J. Konečnỳ, K. Rush, and S. Kannan, “Improving feder-
2 √ ated learning personalization via model agnostic meta learning,” arXiv
+ (2LF β 2 + 4L2F β 3 S 2 )(σF2 + γF2 ) A preprint arXiv:1909.12488, 2019.

2
 [13] A. Fallah, A. Mokhtari, and A. E. Ozdaglar, “Personalized federated
learning with theoretical guarantees: A model-agnostic meta-learning
 1 X
+ (LF β 2 − β + 2L2F β 2 S 2 )E  ∇F (wj−τji ) approach.” in International Conference on Neural Information Process-

A

ing Systems (NeurIPS), 2020.
i∈Aj [14] Y. Deng, M. M. Kamani, and M. Mahdavi, “Adaptive personalized
(58) federated learning,” 2020.
[15] A. Shamsian, A. Navon, E. Fetaya, and G. Chechik, “Personalized
federated learning using hypernetworks,” 2021.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16

[16] I. Achituve, A. Shamsian, A. Navon, G. Chechik, and E. Fetaya, “Per- Daquan Feng received the Ph.D. degree in informa-
sonalized federated learning with gaussian processes,” in International tion engineering from the National Key Laboratory
Conference on Neural Information Processing Systems (NeurIPS), 2021. of Science and Technology on Communications,
[17] X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic University of Electronic Science and Technology
gradient for nonconvex optimization,” vol. 28, 2015, pp. 2737–2745. of China, Chengdu, China, in 2015. From 2011 to
[18] C. Xu, Y. Qu, Y. Xiang, and L. Gao, “Asynchronous federated learning 2014, he was a visiting student with the School
on heterogeneous devices: A survey,” arXiv preprint arXiv:2109.04269, of Electrical and Computer Engineering, Georgia
2021. Institute of Technology, Atlanta, GA, USA. After
[19] Y. Chen, Y. Ning, M. Slawski, and H. Rangwala, “Asynchronous graduation, he was a Research Staff with State
online federated learning for edge devices with non-iid data,” in IEEE Radio Monitoring Center, Beijing, China, and then
International Conference on Big Data (Big Data), 2020, pp. 15–24. a Postdoctoral Research Fellow with the Singapore
[20] W. Wu, L. He, W. Lin, R. Mao, C. Maple, and S. Jarvis, “SAFA: A semi- University of Technology and Design, Singapore. He is now an associate
asynchronous protocol for fast federated learning with low overhead,” professor with the Shenzhen Key Laboratory of Digital Creative Technology,
IEEE Transactions on Computers (TOC), vol. 70, no. 5, pp. 655–668, the Guangdong Province Engineering Laboratory for Digital Creative Tech-
2020. nology, the Guangdong-Hong Kong Joint Laboratory for Big Data Imaging
[21] Q. Ma, Y. Xu, H. Xu, Z. Jiang, L. Huang, and H. Huang, “FedSA: and Communication, College of Electronics and Information Engineering,
A semi-asynchronous federated learning mechanism in heterogeneous Shenzhen University, Shenzhen, China. His research interests include URLLC
edge computing,” IEEE Journal on Selected Areas in Communications communications, MEC, and massive IoT networks. Dr. Feng is an Associate
(JSAC), 2021. Editor of IEEE COMMUNICATIONS LETTERS, Digital Communications
[22] D. Stripelis and J. L. Ambite, “Semi-synchronous federated learning,” and Networks and ICT Express.
arXiv preprint arXiv:2102.02849, 2021.
[23] Y. Zhang, M. Duan, D. Liu, L. Li, A. Ren, X. Chen, Y. Tan, and
C. Wang, “CSAFL: A clustered semi-asynchronous federated learning
framework,” arXiv preprint arXiv:2104.08184, 2021.
[24] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
for fast adaptation of deep networks,” in International Conference on
Machine Learning (ICML), 2017, pp. 1126–1135.
[25] A. Fallah, A. Mokhtari, and A. Ozdaglar, “On the convergence theory
of gradient-based model-agnostic meta-learning algorithms,” in Inter- Kun Guo (Member, IEEE) received the B.E. de-
national Conference on Artificial Intelligence and Statistics (AISTATS), gree in Telecommunications Engineering from Xi-
2020, pp. 1082–1092. dian University, Xi’an, China, in 2012, where she
[26] L. Bottou, “Stochastic gradient descent tricks,” in Neural networks: received the Ph.D. degree in communication and
Tricks of the trade. Springer, 2012, pp. 421–436. information systems in 2019. From 2019 to 2021,
[27] H. Yin and S. Alamouti, “Ofdma: A broadband wireless access technol- she was a Post-Doctoral Research Fellow with the
ogy,” in IEEE Sarnoff Symposium, 2006, pp. 1–4. Singapore University of Technology and Design
[28] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device scheduling (SUTD), Singapore. Currently, she is a Zijiang
and resource allocation for latency constrained wireless federated learn- Young Scholar with the School of Communications
ing,” IEEE Transactions on Wireless Communications (TWC), vol. 20, and Electronics Engineering at East China Normal
no. 1, pp. 453–467, 2020. University, Shanghai, China. Her research interests
[29] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint include edge computing, caching, and intelligence.
learning and communications framework for federated learning over
wireless networks,” IEEE Transactions on Wireless Communications
(TWC), vol. 20, no. 1, pp. 269–283, 2020.
[30] B. Sklar, “Rayleigh fading channels in mobile digital communication
systems. i. characterization,” IEEE Communications Magazine, vol. 35,
no. 7, pp. 90–100, 1997.
[31] L. Yann, C. Corinna, and B. Christopher. The mnist dataset. [Online].
Available: http://yann.lecun.com/exdb/mnist/
[32] K. Alex, N. Vinod, and H. Geoffrey. The cifat-10 dataset. [Online].
Available: https://www.cs.toronto.edu/~kriz/cifar.html Howard H. Yang (S’13–M’17) received the B.E.
[33] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, degree in Communication Engineering from Harbin
V. Smith, and A. Talwalkar, “Leaf: A benchmark for federated settings,” Institute of Technology (HIT), China, in 2012, and
arXiv preprint arXiv:1812.01097, 2018. the M.Sc. degree in Electronic Engineering from
[34] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning Hong Kong University of Science and Technology
applied to document recognition,” Proceedings of the IEEE, vol. 86, (HKUST), Hong Kong, in 2013. He earned the
no. 11, pp. 2278–2324, 1998. Ph.D. degree in Electrical Engineering from Singa-
[35] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, pore University of Technology and Design (SUTD),
“Federated optimization in heterogeneous networks,” Proceedings of Singapore, in 2017. He was a Postdoctoral Research
Machine Learning and Systems (MLSys), vol. 2, pp. 429–450, 2020. Fellow at SUTD from 2017 to 2020, a Visiting
Postdoc Researcher at Princeton University from
2018 to 2019, and a Visiting Student at the University of Texas at Austin
from 2015 to 2016. Currently, he is an assistant professor with the Zhejiang
University/University of Illinois at Urbana-Champaign Institute (ZJU-UIUC
Institute), Zhejiang University, Haining, China. He is also an adjunct assistant
Chaoqun You (S’13–M’20) is a postdoctoral re- professor with the Department of Electrical and Computer Engineering at the
search fellow in Singapore University of Technology University of Illinois at Urbana-Champaign, IL, USA
and Design (SUTD). She received the B.S. degree Dr. Yang’s research interests cover various aspects of wireless com-
in communication engineering and the Ph.D. de- munications, networking, and signal processing, currently focusing on the
gree in communication and information system from modeling of modern wireless networks, high dimensional statistics, graph
University of Electronic Science and Technology signal processing, and machine learning. He serves as an editor for the IEEE
of China (UESTC) in 2013 and 2020, respectively. T RANSACTIONS ON W IRELESS C OMMUNICATIONS. He received the IEEE
She was a visiting student at the University of WCSP 10-Year Anniversary Excellent Paper Award in 2019 and the IEEE
Toronto from 2015 to 2017. Her current research WCSP Best Paper Award in 2014.
interests include mobile edge computing, network
virtualization, federated learning, meta-learning, and
6G.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17

Chenyuan Feng (S’16-M’21) received the B.E.


degree in electrical and electronics engineering from
the University of Electronic Science and Technology
of China (UESTC), Chengdu, China, in 2016, and
the Ph.D. degree in information system technology
and design from Singapore University of Technol-
ogy and Design (SUTD), Singapore, in 2021, re-
spectively. Currently she has been doing postdoc-
toral work at Shenzhen Key Laboratory of Digital
Creative Technology in Shenzhen University. Her
research interests include edge computing, federated
learning, graph signal processing and recommendation systems. She received
the IEEE ComComAp Best Paper Award in 2021.

Tony Q.S. Quek (S’98-M’08-SM’12-F’18) received


the B.E. and M.E. degrees in electrical and electron-
ics engineering from the Tokyo Institute of Technol-
ogy in 1998 and 2000, respectively, and the Ph.D.
degree in electrical engineering and computer sci-
ence from the Massachusetts Institute of Technology
in 2008. Currently, he is the Cheng Tsang Man Chair
Professor with Singapore University of Technology
and Design (SUTD). He also serves as the Director
of the Future Communications R&D Programme, the
Head of ISTD Pillar, and the Deputy Director of the
SUTD-ZJU IDEA. His current research topics include wireless communica-
tions and networking, network intelligence, internet-of-things, URLLC, and
6G.
Dr. Quek has been actively involved in organizing and chairing sessions,
and has served as a member of the Technical Program Committee as well as
symposium chairs in a number of international conferences. He is currently
serving as an Area Editor for the IEEE T RANSACTIONS ON W IRELESS C OM -
MUNICATIONS and an elected member of the IEEE Signal Processing Society
SPCOM Technical Committee. He was an Executive Editorial Committee
Member for the IEEE T RANSACTIONS ON W IRELESS C OMMUNICATIONS,
an Editor for the IEEE T RANSACTIONS ON C OMMUNICATIONS, and an
Editor for the IEEE W IRELESS C OMMUNICATIONS L ETTERS.
Dr. Quek was honored with the 2008 Philip Yeo Prize for Outstanding
Achievement in Research, the 2012 IEEE William R. Bennett Prize, the 2015
SUTD Outstanding Education Awards – Excellence in Research, the 2016
IEEE Signal Processing Society Young Author Best Paper Award, the 2017
CTTC Early Achievement Award, the 2017 IEEE ComSoc AP Outstanding
Paper Award, the 2020 IEEE Communications Society Young Author Best
Paper Award, the 2020 IEEE Stephen O. Rice Prize, the 2020 Nokia Visiting
Professor, and the 2016-2020 Clarivate Analytics Highly Cited Researcher.
He is a Fellow of IEEE.

You might also like