Semi-Synchronous Personalized Federated Learning
Semi-Synchronous Personalized Federated Learning
8, AUGUST 2015 1
Abstract—Personalized Federated Learning (PFL) is a new local computing based on their own dataset, and upload the
Federated Learning (FL) approach to address the heterogeneity resultant parameters to the server, (ii) the server aggregates the
issue of the datasets generated by distributed user equipments
arXiv:2209.13115v1 [cs.LG] 27 Sep 2022
UEs’ parameters and improve the global model, and (iii) the
(UEs). However, most existing PFL implementations rely on
synchronous training to ensure good convergence performances, server feeds back the new model to UEs for another round of
which may lead to a serious straggler problem, where the training local computing. This procedure repeats until the loss function
time is heavily prolonged by the slowest UE. To address this starts to converge and a certain model accuracy is achieved.
issue, we propose a semi-synchronous PFL algorithm, termed as With the substantial improvement in sensing capabilities
Semi-Synchronous Personalized FederatedAveraging (PerFedS2 ), and computational power of edge devices, UEs are producing
over mobile edge networks. By jointly optimizing the wireless
bandwidth allocation and UE scheduling policy, it not only miti- abundant but diverse data [6]. The increasingly diverse datasets
gates the straggler problem but also provides convergent training breed a demand for customized services on individual UEs.
loss guarantees. We derive an upper bound of the convergence Typical examples of potential applications include Vehicle-
rate of PerFedS2 in terms of the number of participants per to-everything (V2X) communications, where vehicles in the
global round and the number of rounds. On this basis, the network may experience various road conditions and driv-
bandwidth allocation problem can be solved using analytical
solutions and the UE scheduling policy can be obtained by a ing habits, making the local model disparate to the global
greedy algorithm. Experimental results verify the effectiveness model [7, 8]; and recommendation systems, where local
of PerFedS2 in saving the training time as well as guaranteeing servers have potentially heterogeneous customers and share
the convergence of training loss, in contrast to synchronous and non-independent and identically distributed (non-i.i.d.) item
asynchronous PFL algorithms. popularities, and thus requiring fine-grained recommenda-
Index Terms—Semi-synchronous implementation, personalized tions [9, 10]. However, conventional FL algorithms are pro-
federated learning, mobile edge networks posed to learn a common model which may have mediocre
performance on certain UEs. And the situation is exacerbating
I. I NTRODUCTION as the ever-developing mobile UEs are generating increasingly
EDERATED Learning (FL) is a new distributed machine diverse data. To address this issue, Personalized Federated
F learning paradigm that enables model training across
multiple user equipments (UEs) without uploading their raw
Learning (PFL) [11, 12] has been proposed. Specifically, PFL
provides an initial model that is good enough for the UEs to
data to a central parameter server [1]. Since its advent, FL start with. Using this initial model, each UE can fastly adapt to
has been widely adopted as a powerful tool to exploit the its local dataset with one or more gradient descent steps using
wealth of data available at the end-user devices [2, 3] and only a few data points. As a result, the UEs (especially with
foster new applications such as Artificial Intelligence (AI) heterogeneous datasets) are able to enjoy fast personalized
medical diagnosis [4] and autonomous vehicles [5]. Training a models by adapting the global model to local datasets.
FL model contains three typical steps: (i) a set of UEs conduct Nonetheless, most PFL implementations adopt synchronous
training to ensure good convergence performance [11, 13–
This paper was supported in part by the National Research Foundation, Sin- 16]. In the synchronous setting, the central server has to wait
gapore and Infocomm Media Development Authority under its Future Com-
munications Research & Development Programme, in part by MOE ARF Tier until the arrival of the parameters of the slowest UE before it
2 under Grant T2EP20120−0006, in part by the National Science and Tech- can update the global model. As a consequence, synchronous
nology Major Project under Grant 2020YFB1807601, in part by the Shenzhen training may cause severe straggler problem in PFL, where
Science and Technology Program under Grants JCYJ20210324095209025, in
part by Shanghai Pujiang Program under Grant No. 21PJ1402600, in part by the deceleration of any UE can delay all other UEs. On the
the National Natural Science Foundation of China under Grant 62201504, in other hand, parameters of the UEs may arrive at the server at
part by the Zhejiang Provincial Natural Science Foundation of China under different speeds due to reasons such as various CPU processing
Grant LGJ22F010001. (Corresponding author: Daquan Feng)
C. You and T. Quek are with the Wireless Networks and Design Systems capabilities and different wireless channel conditions. This
Group, Singapore University of Design and Technology, 487372, Singapore difference begets another operation mechanism: asynchronous
(e-mail: chaoqun_you, [email protected]). training. The key idea of asynchronous implementation is to
D. Feng and C. Feng are with the Shenzhen University, Shenzhen 518052,
China (e-mail:fdquan, [email protected]) allow all UEs work independently and the server updates
K. Guo is with the East China Normal University, Shanghai 200241, China the global model every time it receives an update from any
(e-mail: [email protected]). UE [17–19]. Although this model updating strategy avoids
H. H. Yang is with the Zhejiang University/University of Illinois at Urbana-
Champaign Institute, Zhejiang University, Haining 314400, China (email: the waiting time of UEs, the gradient staleness caused by
[email protected]). asynchronous updating will further degrade the performance of
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
the model training. At this point, a semi-synchronous PFL has synchronous training but also abbreviates potential diver-
been a natural choice to balance the disadvantages caused by gence issue in asynchronous training.
2
the synchronous as well as the asynchronous PFL algorithms. • We derive the convergence rate of the PerFedS . Our
Although there have been several works on semi- analysis characterizes the upper bound of convergence
synchronous FL algorithms [20–23], the semi-synchronous rate as a function with respect to the number of UEs
PFL problem is not well understood. [20] studied the semi- that are scheduled to update the global model in each
asynchronous protocol for fast FL. [21] proposed a semi- communication round and the number of communication
asynchronous FL algorithm in heterogeneous edge computing. rounds.
[22] introduced a novel energy-efficient semi-asynchronous FL • We solve the optimization problem by decoupling it
protocol that mixes local models periodically with minimal into two sub-problems: bandwidth allocation problem and
idle time and fast convergence. At last, [23] proposed a UE scheduling problem. While the optimal bandwidth
clustered semi-asynchronous FL algorithm that groups UEs is proved to minimize the overall training time within a
by the delay and direction of clients’ model update to make range of values, the UE scheduling policy can also be
the most of the advantage of both synchronous and asyn- determined using a greedy online algorithm.
chronous FL. Designing a semi-synchronous PFL in mobile • We conduct extensive experiments by using MNIST,
edge networks, however, is particularly challenging due to CIFAR-100 and Shakespeare datasets to demonstrate the
the following reasons: (1) The convergence rate of a semi- effectiveness of PerFedS2 in saving the overall training
synchronous PFL is unclear. Moreover, the loss function of time as well as providing a convergent training loss,
a deep learning model is usually non-convex, and whether compared with four baselines, namely, the synchronous
a semi-synchronous PFL can converge and under what con- and asynchronous, FL and PFL algorithms, respectively.
ditions can the algorithm converge is of much interest. (2) The rest of the paper has been organized as follows. In
The practical wireless communication environments need to Section II we introduce the basic learning process of PerFedS2 .
be considered. It is non-trivial to decide the UE scheduling Then in Section III we formulate a joint bandwidth allocation
policy of a semi-synchronous PFL algorithm while considering and UE scheduling problem to quantify and maximize the
the wireless bandwidth allocation. benefits PerFedS2 could bring compared with synchronous
In this paper, we propose a semi-synchronous PFL algo- and asynchronous training. In order to solve the optimization
rithm over mobile edge networks, named Semi-Synchronous problem, we first analyse the convergence rate of PerFedS2 in
Personalized FederatedAveraging (PerFedS2 ) that mitigates Section IV. Then, we solve the joint optimization problem in
the straggler problem in PFL. This is done by optimizing a Section V. At last, we evaluate the performance of PerFedS2
joint bandwidth allocation and UE scheduling problem. To in Section VI.
solve this problem, we first analyse the convergence rate
of PerFedS2 with non-convex loss functions. Our analysis II. S EMI -S YNCHRONOUS P ERSONALIZED F EDERATED
characterizes the upper bound of the convergence rate in terms L EARNING M ECHANISM
of two decision variables: the number of scheduled UEs in In this section, we propose PerFedS2 to mitigate the draw-
each communication round, and the number of communication backs of synchronous and asynchronous PFL algorithms. For a
rounds. Based on this upper bound, the joint bandwidth better understanding of the proposed algorithm, we commence
allocation and UE scheduling optimization problem can be with reviewing FL and PFL in Section II-A and Section II-B,
solved separately. For the bandwidth allocation problem, we respectively. Then, we formally introduce PerFedS2 in Sec-
find that for a given UE scheduling policy, there exists tion II-C.
infinitely many bandwidth solutions to minimize the overall
training time. For the UE scheduling problem, facilitated by A. Review: Federated Learning
the results obtained from the convergence analysis, the optimal Consider a set of n UEs connected to the server via a BS,
number of UEs that are scheduled to update the global model where each UE has a local data (x, y) ∈ Xi × Yi . If we define
in each communication round and the optimal number of fi : Rm → R as the loss function corresponding to UE i, and
communication rounds can be estimated. These results lead us w as the model parameter that the server needs to learn, then
to designing a greedy algorithm that gives the UE scheduling the goal of the server is to solve
policy. Finally, with the optimal bandwidth allocation and the n
UE scheduling policy, we are able to implement PerFedS2 over 1X
minm f (w) := fi (w), (1)
mobile edge networks. w∈R n i=1
To summarize, in this paper we make the following contri- where fi represents the expected loss over the data distribution
butions: of UE i, which is formalized as follows,
• We propose a new semi-synchronous PFL algorithm, i.e.,
fi (w) := E(x,y)∼Hi [li (w; x, y)], (2)
the PerFedS2 , over mobile edge networks. The PerFedS2
strikes a good balance between synchronous and asyn- where li (w; x, y) measure the error of model w in predicting
chronous PFL algorithms. Particularly, by solving a joint the true label y, and Hi is the distribution over Xi × Yi .
bandwidth allocation and UE scheduling problem, it Because the dataset resided on different UEs are usually
not only mitigates the straggler problem caused by the non-i.i.d. and unbalanced, while the global model trained by
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
FedAvg concentrates on the average performance of all the Algorithm 1: Semi-Synchronous Personalized Feder-
UEs. The resultant model may perform very poor on certain ated Averaging (PerFedS2 )
individual UEs. In response, PFL is proposed to capture the 1 for k = 0, 1, . . . , K − 1 do
statistical heterogeneity among UEs by adapting the global 2 Processing at Each UE i
model to local datasets. We review this scheme in the next 3 if Receive wk from the server then
subsection. 4 Compute local gradient ∇F˜ i (wk ) by Eq. (7)
˜
Upload ∇Fi (wk ) to the server
B. Review: Personalized Federated Learning 5 end
In contrast to the standard FL, PFL approaches the solu- 6 Processing at the Parameter Server
tion of (1) via the Model-Agnostic Meta-Learning (MAML). 7 Ak = ∅
Specifically, the target of PFL is to learn an initial model that 8 while |Ak | < A do
adapts quickly to each UE through one or more gradient steps 9 ˜ i (wk ) from UE i
Receive local gradient ∇F
with only a few data points on the UEs. Such an initial model 10 Ak = Ak ∪ {i}
is commonly known as the meta model, and the local model 11 end
after adaptation is referred to as the fine-tuned model. 12 Update global model to wk+1 by Eq. (8)
Formally, if each UE intakes the initial model and updates 13 for i ∈ U do
it via one step of gradient using its own loss function, problem 14 if i ∈ Ak or τki > S then
(1) can be written as 15 Distribute wk+1 to UE i
1X
n 16 end
min F (w) := fi (w − α∇fi (w)), (3) 17 end
w∈Rm n i=1
18 end
where α ≥ 0 is the learning rate at individual UEs. Note that
we use the same learning rate for all UEs in this paper for
simplification. This assumption can be easily extended to the
PerFedS2 is formally described in Alg. 1. At the UE side (Line
general case when UEs have diverse learning rate αi as long
2-5), upon receiving a global model, or equivalently, the meta
as αi ≥ 0. For each UE i, its optimization objective Fi can
model wk , the UE adapts wk to its local dataset to obtain the
be computed as
gradient of local functions, which in this case, the gradient
Fi (w) := fi (w − α∇fi (w)). (4) ∇Fi , that is given by
Unlike conventional FL, after receiving the current global ∇Fi (wk ) = (I − α∇2 fi (wk ))∇fi (wk − α∇fi (wk )). (5)
model, a UE in PFL first adapts the global model to its local
data with one step of gradient descent, and then computes At the server side (Line 6-12), let Ak be the set of UEs
local gradients with respect to the model after the adaptation. participating in the global updating in round k, with the
This step of local adaptation captures the difference between carnality being |Ak | = A. Let τki be the interval between the
UEs, and the model learned with this new formulation (3) is current round k and the last received global model version by
proved to be a good initial point for any UE to start with for UE i. Such an interval reflects the staleness of local updates.
fast adaptation [24, 25]. With this notion, we can write the gradient received by the
Many existing works on PFL is limited to the context of BS at round k from UE i as ∇F (wk−τki ). Upon receiving A
synchronous learning, where the faster UEs have to wait until local gradients, the server updates the global model parameter
all the others arrive the server to move to the next communi- as follows:
cation round [11, 13–16]. As a result, the synchronous PFL β X
wk+1 = wk − ∇Fi (wk−τki ), (6)
often suffers from the straggler problem due to the prolonged A
i∈Ak
waiting time for the slowest UE. On the other hand, the PFL
can also be trained in an asynchronous manner, where the where β > 0 is the global step size. Then, the server distributes
server performs global updating as soon as it receives a local the new global model wk+1 to either (a) the UEs in Ak or (b)
model from any UE. In this scenario, some slower UEs will those with a staleness larger than the staleness threshold S.
bring stale gradient updates to the server, thereby degrading Due to the vast volume of dataset, computing the exact
the convergence performance of the model training. Therefore, gradient for each UE is costly. Therefore, we use the stochastic
in this paper, we propose a semi-synchronous PFL mechanism gradient descent (SGD) [26] as a proxy. Specifically, a generic
that seeks a trade-off between synchronous and asynchronous UE i samples a subset of data points to calculate an unbiased
estimate ∇f ˜ i (wk ; Di ) of ∇fi (wk ), where Di represents a
PFL algorithms, which is detailed in the following subsection.
portion of UE i’s local dataset with size |Di | = Di . Similarly,
C. Semi-Synchronous Personalized Federated Learning the Hessian ∇2 in (5) can be replaced by its unbiased estimate
∇˜ 2 fi (wk ; Di ). At this point, the actual gradient computed by
We propose a semi-synchronous PFL mechanism, which ˜ i (wk ),
UE i is the stochastic gradient of local loss function ∇F
is a trade-off between synchronous and asynchronous PFL.
which is given by
We term this semi-synchronous PFL algorithm as Semi-
Synchronous Personalized FederatedAveraging (PerFedS2 ). ˜ i (wk ) =
∇F
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
𝑤0 𝑤11 𝑤12 𝑤1 𝑤13 𝑤22 𝑤2 𝑤14 𝑤21 𝑤3 𝑤33 𝑤32 𝑤4 𝑤41 𝑤44 𝑤5
𝑤14 𝑤44
UE 4 𝑤0 𝑤14 𝑤44
𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡
computation communication
k−τki +S
is formally defined as X
PK PK πji ≥ 1, ∀i ∈ U (C1.3)
i i
k=1 πk k=1 πk j=k−τki
ηi = PK Pn = . (15)
k=1 i=1 πki AK k
X
Notably, the staleness bound S provides a lower bound of Zji ≤ Z (C1.4)
ηi , that is, ηi ≥ S/K (∀i ∈ U). j=k−τki
S
K≥ , ∀i ∈ U, (C1.5)
C. Problem Formulation ηi
PerFedS2 significantly increases the proportion of time where b , [b1 , b2 , . . . , bK ] denotes the bandwidth allocation
UEs spend on computing, as opposed to waiting. Meanwhile, matrix up to round K, and bk = [b1k , b2k , . . . , bnk ]. (C1.1) is the
PerFedS2 also upper bounds the staleness caused by updates overall training time constraint, that for each communication
from slow UEs. Let T be the overall training time over K round k, the round time is determined by the maximum of
communication rounds. Then the objective of PerFedS2 is to Tki over i ∈ Ak , and the total time up to round K is equal
minimize the loss function as well as the overall training time. to T . (C1.2) is the bandwidth constraint, that the bandwidth
Formally, the optimization problem of PerFedS2 is formulated allocation to all UEs in every communication round shall
as follows 1 , not exceed the available bandwidth B. (C1.3) stipulates the
staleness constraint on the updates, that the during any S
min F (w) (P1) rounds of communication, UE i must be scheduled to update
b,Π,A,K the global model at least once. (C1.4) limits the number of bit
K
X transmitted, note that Zki is determined by bik , and the number
s.t. min max{Tki } = T, ∀i ∈ U, (C1.1) of bits that are transmitted during τki rounds shall not be larger
b i∈Ak
k=1 than the size of model parameters. Finally, (C1.5) follows from
n
X the lower bound we drawn in the previous subsection.
bik ≤ B, k = 1, 2, . . . , K, (C1.2)
i=1
1 Besides bandwidth allocation and UE scheduling policy, other decision IV. C ONVERGENCE A NALYSIS
variables such like transmit power can also be included in the problem
formulation. The logic keeps the same, but the parameters that need to be In this section, we first introduce some definitions and as-
considered might change. Problem (P1) shows the case when we consider the
bandwidth allocation and UE scheduling policy as variables, and it is free for sumptions on the loss functions of PerFedS2 in Section IV-A.
the researcher to extend this general formulation to other forms. Then we analyse its convergence rate in Section IV-B.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
A. Preliminaries Lemma 2. If Assumptions 2-4 hold, then for any αi ∈ (0, 1/L]
and w ∈ Rm , we have
We consider the non-convex loss functions in this paper. Our
˜ i (w) − ∇Fi (w) ≤ 2αLσ
h i
goal is to find an -approximate first-order stationary point E ∇F √ G, (22)
(FOSP) for PerFedS2 [13, 25]. The formal definition of FOSP h i Din
is given as follows. E k∇F ˜ i (w) − ∇Fi (w)k2 ≤ σF2 . (23)
Definition 1. A random vector w ∈ Rm is called an -FOSP
where σF2 is defined as
for PerFedS2 if it satisfies E[k∇F (w )k2 ] ≤ .
(αL)2 2
1 2 α
To make the convergence analysis consistent with that of σF2 := 12 C 2 + σG 2
+ 1 + σ H −12C 2 ,
Do Din 4Dh
Per-FedAvg, we make the following assumptions [13]. (24)
Assumption 1 (Bounded Staleness). All delay variables τki ’s where Din = maxi∈U Diin , Do = maxi∈U Dio and Dh =
are bounded, i.e., maxk,i τki ≤ S. maxi∈U Dih .
Assumption 2. For each UE i ∈ U, its gradient ∇fi is L- Lemma 3. Given the loss function Fi (w) shown in (4) and
Lipschitz continuous and is bounded by a nonnegative constant α ∈ (0, 1/L], if the conditions in Assumptions 2, 3, and 5 are
C, namely, all satisfied, then for any w ∈ Rm , we have
n
k∇fi (w) − ∇fi (u)k ≤ Lkw − uk, w, u ∈ Rm 1X
(17) k∇Fi (w) − ∇F (w)k2 ≤ γF2 , (25)
n i=1
k∇fi (w)k ≤ C, w ∈ Rm . (18)
where γF2 is defined as
Assumption 3. For each UE i ∈ U, the Hessian of fi is
ρ-Lipschitz continuous: γF2 := 3C 2 α2 γH
2 2
+ 192γG , (26)
2 2 m Pn
k∇ fi (w) − ∇ fi (u)k ≤ ρkw − uk, w, u ∈ R . (19) where ∇F (w) = 1/n i=1 ∇Fi (w).
Assumption 4. For any w ∈ Rm , ∇li (w; x, y) and Based on the three lemmas, we obtain the following theorem
∇2 li (w; x, y), computed w.r.t. a single data point (x, y) ∈ to
Xi × Yi , have bounded variance:
Theorem 1. If Assumptions 1 to 5 hold and the steplength
E(x,y)∼pi [k∇li (w; x, y) − ∇fi (w)k2 ] ≤ σG
2
, LF in Lemma 1 satisfies
E(x,y)∼pi [k∇2 li (w; x, y) − ∇2 fi (w)k2 ] ≤ σH
2
. (20) LF β 2 − β + 2L2F β 2 S 2 ≤ 1, (27)
m
Assumption 5. For any w ∈ R , the gradient and Hessian then the following FOSP condition holds,
of local lossPfunction fi (w) and the average loss function K−1
n
f (w) = 1/n i=1 fi (w) satisfy the following conditions: 1 X 2(F (w0 ) − F (w∗ ))
E[k∇F (wk )k2 ] ≤
n K βK
1X k=0
k∇fi (w) − ∇f (w)k2 ≤ γG
2
, √
n i=1 + 4(LF β + 2L2F β 2 S 2 )(σF2 + γF2 ) A. (28)
n
1X 2 Proof: See the Appendix.
k∇ fi (w) − ∇2 f (w)k2 ≤ γH
2
. (21)
n i=1 Corollary 1. Assume the conditions in Theorem 1 are satis-
fied. Then, if we set the number of total communication rounds
While Assumption 1 limits the maximum of the staleness,
as K = O(−3 ), the global learning rate as β = O(2 ), the
Assumptions 2 to 5 characterize the properties of the gradient
staleness threshold as S = O(−1 ), and the number of UEs
and Hessian of fi (w), which are necessary to deduce the
that updates the global model as A = O(−2 ), Algorithm 1
following lemmas and convergence rate analysis.
finds an -FOSP for PerFedS2 .
Proof: Note that 2(F (w0 )−F (w∗ )) is constant, then K =
−3
B. Analysis of Convergence Bound O( ) and β = O(2 ) ensure the first term of right-hand-side
of (28) to be equal to O(). Next we examine the second term
Before delving into the full details of convergence analysis,
of (28). Note that (σF2 + γF2 ) is constant, then β = O(2 ) and
we introduce three lemmas inherited from [13] to quantify
S = O(−1 ) together make (2LF β + 4L2F β 2 S 2 ) = O(2 ).
the smoothness of Fi (w) and F (w), the deviation between
˜ i (w), and the deviation between At this point, if A = O(−2 ), the second term of (28) is
∇Fi (w) and its estimate ∇F
equivalent to O().
∇Fi (w) and ∇F (w), respectively.
Lemma 1. If Assumptions 2-4 hold, then Fi is smooth with V. J OINT BANDWIDTH A LLOCATION AND UE
parameter LF := 4L + αρC.
Pn As a consequence, the aver- S CHEDULING
age function F (w) = 1/n i=1 Fi (w) is also smooth with
parameter LF . In this section, we present the steps to solve the optimization
problem P1. Particularly, we decouple P1 into P2, a bandwidth
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
Zi
allocation problem, and P3, a UE scheduling problem. Note T cmpik + rik monotonically decreases with bik . Therefore, at
k
that individually solving the two sub-problems is equivalent to round k, if any UE i ∈ Ak has finished its whole local
solving the original P1, which will be elaborated in the sequel. model update process than the others, we can decrease its
bandwidth allocation to make it up for the other slower UEs
A. Problem Decoupling in Ak . As a result, the round latency which is determined
by the slowest UE in Ak can be reduced. Such a bandwidth
We begin with the bandwidth allocation problem. Given a compensation is performed until all scheduled UEs in Ak
scheduling pattern Π, the bandwidth allocation problem can finish their local iterations at the same time. Consequently,
be written as follows: the optimal bandwidth allocation in round k is achieved when
min T (Π) (P2) all scheduled UEs in Ak have the same finishing time.
b
K
X Theorem 3. Given the relative participation frequency ηi (i ∈
s.t. max{Tki } ≤ T (Π) (C2.1) U), the UEs would be scheduled in an order with a recurrence
i∈Ak
k=1 pattern. That is, the UEs would periodically participate into
Xn the global model update.
bik ≤ B, k = 1, 2, . . . , K (C2.2)
i=1 Proof: Recall the formulation of ηi defined in (15), it
Xk is obvious that ηi is computed by the number of times UE
Zji ≤ Z, ∀i ∈ U. (C2.3) i has been scheduled
PK−1during all K rounds. Therefore, if ηi
j=k−τki is settled, then k=0 πki is settled. As a result, if the UEs
are scheduled periodically, the times of each UE involved in
Then, with the optimal bandwidth allocation and the cor-
the global update can be settled, thus matching the relative
responding minimal overall training time T ∗ (Π), the UE
participation rate it has been assigned with.
scheduling problem can be written as follows,
Theorem 4. The optimal bandwidth allocation that achieves
min F (w) (P3)
K,A,Π the minimum learning time is given by the following
K X
s.t.
X
max{Tki } = T ∗ (Π), ∀i ∈ U (C3.1)
bik = B, k = 1, . . . , K
i∈Ak
i∈U
k=1
Bnηi Z
k−τki +S bik > ∗ (Π) − T cmp )(W (−Γ e−Γi ) + Γ ) , (33)
X
πji ≥ 1, ∀i ∈ U (C3.2)
(T i i i i
X
bik ≤ B,
j=k−τki
S i∈Ak
K≥ , ∀i ∈ U. (C3.3) N0 Z
ηi where Γi , (T ∗ (Π)−T cmp −κ , W (·) is Lambert-W
i i )pi hi kci k
∗
function, and Ti (Π) is the objective value of (P2).
B. Optimal Bandwidth Allocation
Proof: From Theorem 3, we know that all UEs update
In order to solve P2, we introduce the following theorems the global model periodically. Let Kp denote the number of
to explore the relationship between bik and T (Π) step by step. communication rounds in each period, then inferring from
Theorem 2. If the server updates the global model after Theorem 2, all UEs have the same finishing time in each period
receiving A gradients from the UEs in each round, then the without any waiting time. That is, we have
optimal bandwidth allocation can be achieved if and only if Kp
X Kp
X
all the scheduled UEs have the same finishing time. Tki = Tki , ∀i, j ∈ U, i 6= j, (34)
k=1 k=1
Proof: Recall the expression of rki
defined in (9), we take
a derivative with respect to bik and arrive at the following Meanwhile, we have
Kp
pi hi kci k−κ
d i
X
bk ln 1 + Zki = ηi ZAKP , ∀i ∈ U, (35)
dbik bik N0
k=1
−κ
pi hi kci k−κ
pi hi kci k
= ln 1 + − (31) where ZAKP denotes the number of bits that needs to be
bik N0 bik N0 + pi hi kci k−κ transmitted during the Kp rounds. This equation indicates that
pi hi kci k−κ
bik N0 pi hi kci k−κ the number of bits transmitted by UE i during Kp rounds
> pi hi kci k−κ
− is equal to the product of its relative participation frequency
1+ bik N0 + pi hi kci k−κ
bik N0 ηi and the total number of bits transmitted during that Kp
=0, (32) communication rounds. From equation (35), it is easy to
indicate that
where the inequality follows from the fact that ln(1 + x) > Kp Kp
x i i
1+x , for x > 0. Therefore, rk monotonically increases with bk .
X Zi k
X Zj k
i i i = , ∀i, j ∈ U, i 6= j. (36)
While it is obvious that rk > 0, and thus T cmpk + T comk = ηi ηj
k=1 k=1
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
𝑤0 UE 1 𝑤0 𝑤11
UE 1 𝑤11
UE 2 𝑤0 𝑤12 UE 2 𝑤0 𝑤12
UE 3 𝑤0 𝑤13 UE 3 𝑤0
𝑤13
UE 4 𝑤0 𝑤14 UE 4 𝑤0
𝑤14
𝑡1 𝑡2 𝑡1 𝑡2
(a) The largest bandwidth allocation to UEs in Ak (b) The least bandwidth allocation to UEs in Ak
Fig. 2: Bandwidth allocation example, where all UEs have the same parameters, and A = 2.
Now combing (34) and (36), we have pi , hi , and ci . We can write the scheduling pattern Π of the
PKp i PKp j four UEs as follows:
k=1 Zk k=1 Zk
PKp i = PKp j , ∀i, j ∈ U, i 6= j.
(37) 1 1 0 0
ηi k=1 Tk ηj k=1 Tk 0 0 1 1
PKp
Zki
1 1 0 0 . (39)
From equation (37) we observe that PKp
k=1
denotes the
0 0 1 1
k=1 Tki
average rate of UE i during Kp rounds. That is, we have ...........
E(rki ) E(rkj ) The length of the scheduling period is Kp = 2. Meanwhile,
= , ∀i, j ∈ U, i 6= j. (38)
ηi ηj according to Theorem 4, we have E(rk1 ) = · · · = E(rk4 ). One
extreme case of bandwidth allocation is UE 1 and UE 2 share
The above equation states a fact that as long as the average
the total bandwidth B in the first round, each of which is
rate of each UE is weighted equalized, the optimal solution is
assigned B2 . At the same time, UE 3 and UE 4 can complete
achieved. Therefore, there exists infinitely many solutions of
η their local computation during round 1. Then, at round 2, all
rki to the above equation. The simplest solution is rηii = rjj
k k bandwidth B is allocated to UE 3 and UE 4 for their gradients
in each round k. Note that rki is determined by bik , and thus transmission. In this case, according to Theorem 2, in each
there exits infinitely many solutions of bik in each round k. round, both UEs will finish their gradient transmission at the
Our next step is to compute the boundary values of bik . To same time. That is, the duration of round 1 will be minimized
do this, we first divide UEs into two categories: UEs in Ak when UE 1 and UE 2 share the total bandwidth B equally.
and UEs do not in Ak . Z
At this point, the round duration is r(B/2) , where r(B/2) =
−κ
• At one extreme case, only UEs in Ak are assigned with i kci k
P i
B
2 ln(1 + 2pi hBN 0
). Similarly, the duration of round 2 is
bandwidth. That is, i∈Ak bk = B. Under this case, Z
also r(B/2) . Then, the total time of each period is r(B/2)2Z
.
2
the PerFedS algorithm turns out to be a synchronous The other extreme case of bandwidth allocation is for all the
PerFedAvg algorithm where in each round A UEs are four UEs to share the bandwidth equally, then the UEs will
selected to update the global model. Meanwhile, the finish one time of global update at the same time, which is
bandwidth is allocated proportionally to the UEs in Ak computed by r(B/4) Z
. Note that we set A = 2, but in this
ri rj
such that ηki = ηkj , ∀i, j ∈ Ak , i 6= j. This extreme case case if all UEs finish one communication round at the same
is corresponding to the third inequation of (33). time then A = 4, therefore this extreme situation cannot be
• At the other extreme case, all UEs in round k share the achieved but can only be approached infinitely. It is obvious
ri rj Z 2Z
available bandwith B at a rate ηki = ηkj , ∀i, j ∈ Ak , i 6= r(B/4) = r(B/2) , this equation indicates that all bandwidth
j. This case indicates the least bandwidth allocation to allocation policies between the two extreme cases can lead to
UEs in Ak to ensure their orders to arrivePthe server the same minimized overall training time.
in the scheduling pattern. Under this case, i∈U bik = At this point, according to the features of the optimal
B. Therefore, a closed form of bik is obtained, which is bandwidth solutions, we obtain four corollaries. Corollary 2
corresponding to the lower bound of bik shown in the and 3 are two direct conclusions derived from Theorem 2,
second inequation of (33). which are shown as follows,
To better illustrate these approaches, let us take the example Corollary 2. From Theorem 2, we find that in each round k,
in Fig. 2. Assume A = 2 and the four UEs have the same ηi , UEs in Ak will finish the communication round at the same
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
time. That is, none of the UEs have to wait for the others Algorithm 2: Greedy PerFedS2 Scheduling Algorithm
underPthe optimal bandwidthPallocation policy. Therefore, we Input: η = {η1 , η2 , . . . , ηn }, A∗
K K
have k=1 maxi∈A {Tki } = k=1 Tki∗ = Ti∗ (∀i ∈ U). 1 Initialize Π ← ∅ ;
Test Accuracy
Test Accuracy
Training Loss
0.5
Training Loss
1.00 FedAvg_ASY 0.85
PerFed_ASY FedAvg_SYN 2.00 0.4
0.80 0.80 PerFed_SYN 0.3
0.60
0.75
FedAvgS2 1.50
Test Accuracy
Training Loss
Training Loss
1.00 FedAvg-ASY
PerFed-ASY 0.5
0.80
0.7 FedAvg-SYN 2.00
PerFed-SYN 0.4
0.60 0.6 FedAvgS2 1.50
0.3
0.40 PerFedS2
1.00
0.20
0.5 FedAvg-ASY 0.2
PerFed-ASY
0.4 0.50
0.00 0.1
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)
(a) MNIST training loss (b) MNIST test accuracy (c) CIFAR-100 training loss (d) CIFAR-100 test accuracy
2 2
Fig. 4: Convergence performance comparison of PerFedS , FedAvgS , FedAvg-SYN, PerFed-SYN, FedAvg-ASY and PerFed-
ASY using MNIST and CIFAR-100 datasets. In this case, the distance from UEs to the server obeys the random distribution
from 0 to 200 m. Meanwhile, as for the PerFedS2 and FedAvgS2 algorithms, we set A = 5.
2.25 0.7
PerAvg-ASY 0.6
FedAvg-ASY
2.00 FedFed-ASY 2.00
PerFed-ASY 0.6
PerAvgS2 0.5
PerAvg-ASY 1.75 FedAvgS2 0.5
FedFedS2 FedFed-ASY PerFedS2
Test Accuracy
1.50
Test Accuracy
Training Loss
0.4
PerAvg-SYN
Training Loss
1.50
PerAvgS2 FedAvg-SYN 0.4
FedFed-SYN 1.25
PerFed-SYN FedAvg-ASY
0.3 FedFedS2 1.00
1.00 PerAvg-SYN 0.3 PerFed-ASY
0.2 FedFed-SYN 0.75
0.2
FedAvgS2
0.50 PerFedS2
0.50 0.1 0.1 FedAvg-SYN
0.25
PerFed-SYN
0.0 0.00 0.0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 0 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)
(a) Shakespeare training loss (b) Shakespeare test accuracy (c) Shakespeare training loss (d) Shakespeare test accuracy
2 2
Fig. 5: Convergence performance comparison of PerFedS , FedAvgS , FedAvg-SYN, PerFed-SYN, FedAvg-ASY and PerFed-
ASY using the Shakespeare dataset. For (a) and (b), η1 = η2 = · · · = ηn , and for (c) and (d), the distance from UEs to the
server obeys the random distribution from 0 to 200 m. Meanwhile, as for the PerFedS2 and FedAvgS2 algorithms, we set
A = 50.
As for the shakespeare dataset, we find that all the conclu- The comparison between FedAvgS2 , FedProxS2 and
sions about the comparisons between the 6 algorithms drawn PerFedS2 using the MNIST and Shakespeare datasets is shown
from the above two datasets still stand. in Fig. 6. From the figure it is obvious that PerFedS2 out-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
1.0 0.5
0.9
0.950
0.9 0.4
0.925 0.8
Test Accuracy
Test Accuracy
Test Accuracy
Test Accuracy
0.875
0.6
0.7 0.2
0.850
0.5
0.825 FedAvgS2 with equal FedAvgS2 with randomly distributed UEs PerAvgS2 with equal FedAvgS2 with randomly distributed UEs
0.6 0.1
FedProxS2 with equal FedProxS2 with randomly distributed UEs FedProxS2 with equal 0.4 FedProxS2 with randomly distributed UEs
0.800
PerAvgS2 with equal PerAvgS2 with randomly distributed UEs FedAvgS2 with equal PerAvgS2 with randomly distributed UEs
0.775 0.5 0.0 0.3
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)
1.00
0.8
1.60 PerFedS2, l=2 2.25
PerFedS2, l=3
1.40
PerFedS2, l=4 0.95
2.00 PerFedS2, l=5 0.7
PerFedS2, l=6 0.90 PerFedS2, l=7
1.20 PerFedS2, l=8 1.75
PerFedS2, l=9
0.6
Test Accuracy
Test Accuracy
Training Loss
Training Loss
1.00 0.85 1.50 0.5
0.80 0.80 1.25 0.4
0.60 PerFedS2, l=2 1.00 PerFedS2, l=3
0.75 0.3
0.40 PerFedS2, l=4 PerFedS2, l=5
0.70 PerFedS2, l=6 0.75
0.2 PerFedS2, l=7
0.20
PerFedS2, l=8 0.50 PerFedS2, l=9
0.00 0.65 0.1
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 20 40 60 80 100 0 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)
(a) MNIST training loss (b) MNIST test accuracy (c) CIFAR-100 training loss (d) CIFAR-100 test accuracy
Fig. 7: Convergence performance of PerFedS2 with respect to the non-i.i.d level l of data sampled from the MNIST and
CIFAR-100 datasets. We compare the results when l = 2, 4, 6, 8 for data sampled from the MNIST dataset, and l = 3, 5, 7, 9
for data sampled from the CIFAR-100 dataset.
performs the other two algorithms. This is reasonable since from UEs to the central server, and thus the optimal A to
Per-FedAvg has already been verified in previous works to minimize the overall training time is random. We can only
provide a better convergence performance, and PerFedS2 is conclude that in this very specific case of η, the larger number
designed based on Per-FedAvg. Therefore, PerFedS2 inherits of participation UEs in each round, the better. Nevertheless,
this benefit. the benefits gained from a smaller value of A is slight in
Fig. 9. This is reasonable because, the randomly generated η
2) Effect of the non-i.i.d. level l: Fig. 7 shows the evaluation may result in a scheduling pattern that degrades the influences
results of PerFedS2 under different non-i.i.d. levels. It is caused by different number of participation UEs in each round.
obvious that for both datasets, the higher the heterogenous
However, as for the CIFAR-100 dataset, although Fig. 8c
level is, the worse the convergence performances are. These
and 8c still indicate the same conclusion as that in the MNIST
results are natural and in line with the laws of theory.
dataset, Fig. 9c and 9d indicate another situation where the
3) Effect of the number of participants in each round A: convergence performance of PerFedS2 wins when A = 10.
Fig. 8 and Fig. 9 show the convergence performance of This result just verified the conclusion we mentioned above,
PerFedS2 with respect to different number of participation UEs that the conclusion obtained from the MNIST dataset is not
A in each round, where Fig. 8 is under the case that all UEs always true. The result shown in Fig. 9c and 9d indicate a
have the same ηi , whereas Fig. 9 is under the case that the ηi specific case when A = 10 is approaching the optimal A∗ .
of each UE is determined by its distance to the central server 4) Effect of the staleness threshold S: Finally, we evaluate
that follows a random distribution. the effect of the staleness threshold S on the convergence
As for the MNIST dataset, the result shown in Fig. 8 performance of PerFedS2 , where the results are shown in
and Fig. 9 indicates a situation that the larger number of Fig. 10. Here, in order to make the effect of S more clear,
participation UEs in each round, the poorer the convergence we use the simpler setting when all UEs have the same ηi ,
performance is. This conclusion is not always true, given that and A = 5. Therefore, when S ≥ 5, all the scheduled UEs
the relative participation frequency vector η = [ηi , η2 , . . . , ηn ] would arrive the server within S rounds. Consequently, we
in Fig. 9 is generated randomly according to the distances study change of the total training time when S = 1, 2, 3, 4, 5.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
0.98 0.5
PerFedS2, A=5
1.00
PerFedS2, A=10 0.97
2.20 PerFedS2, A=5
PerFedS2, A=15 2.00 PerFedS2, A=10 0.4
0.80 0.96 1.80 PerFedS2, A=15
Test Accuracy
Test Accuracy
Training Loss
Training Loss
0.95 1.60 0.3
0.60
1.40
0.94
1.20 0.2
0.40
0.93
PerFedS2, A=5
1.00 PerFedS2, A=5
0.1
0.20
0.92 PerFedS2, A=10 0.80 PerFedS2, A=10
PerFedS2, A=15 0.60 PerFedS2, A=15
0.00 0.91 0.0
0 20 40 60 80 100 120 140 0 20 40 60 80
100 120 140 0 20 40 60 80 100 0 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)
(a) MNIST training loss (b) MNIST test accuracy (c) CIFAR-100 training loss (d) CIFAR-100 test accuracy
Fig. 8: Convergence performance of PerFedS2 with respect to the number of UEs A that participate in the global model update
in each round using MNIST and CIFAR-100 datasets. In this case, η1 = η2 = · · · = ηn . Meanwhile, we compare the results
when A = 5, 10, 15.
0.98
PerFedS2, A=5
1.00
PerFedS2, A=10 0.97 1.20 PerFedS2, A=5 0.50
PerFedS2, A=15 PerFedS2, A=10
0.45
0.80 0.96 1.00
PerFedS2, A=15
Test Accuracy
Training Loss
Test Accuracy
Training Loss
0.40
0.60 0.95
0.80
0.94 0.35
0.40
0.60 0.30
0.93
PerFedS2, A=5 PerFedS2, A=5
0.20
0.92 PerFedS2, A=10 0.40 0.25 PerFedS2, A=10
PerFedS2, A=15 PerFedS2, A=15
0.00 0.91 0.20
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 0 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)
(a) MNIST training loss (b) MNIST test accuracy (c) CIFAR-100 training loss (d) CIFAR-100 test accuracy
2
Fig. 9: Convergence performance of PerFedS with respect to the number of UEs A that participate in the global model update
in each round using MNIST and CIFAR-100 datasets. In this case, the distance from UEs to the server obeys the random
distribution from 0 to 200 m. Meanwhile, we compare the results A = 5, 10, 15.
Note that in the theoretical analysis, we have the constraint have proved that there exist a convergent upper bound on the
that ηi ≥ S/K. This constraint eliminates the situations when convergence rate. Then, based on the convergence analysis,
the staleness τki is larger than the staleness bound S, and thus we have solved the optimization problem by decoupling it
no updates would be dropped by the central server. However, into two sub-problems: the bandwidth allocation problem
in practice, ηi is determined by a number of elements, for and the UE scheduling problem. For a given scheduling
example, the distances from UEs to the server or the transmit policy, the bandwidth allocations problem has been proved
power of individual UEs. Therefore, in practice, the constraint to have infinitely many solutions. Meanwhile, based on the
ηi ≥ S/K cannot be always satisfied. When this happens to convergence analysis of PerFedS2 , the optimal UE scheduling
UE i, in order to keep ηi constant, other UEs may have to policy can be determined using a greedy algorithm. We have
wait until the updates from UE i finally arrives the server, conducted extensive experiments to verify the effectiveness of
thereby prolonging the overall training time. This conclusion PerFedS2 in saving training time, compared with synchronous
is verified through the results shown in Fig. 10, where the and asynchronous FL and PFL algorithms.
larger S is, the better the convergence performance PerFedS2
has.
A PPENDIX
VII. C ONCLUSIONS
Proof of Theorem 1
We have proposed a new semi-synchronous PFL algorithm
over mobile edge networks, PerFedS2 , that not only mitigates Using Lemma 1, we have
the straggler problem caused by the synchronous training, F (wk+1 ) − F (wk )
but also ensures a convergent training loss that may not be
LF
guaranteed in the asynchronous training. This is achieved by ≤h∇F (wk ), wk+1 − wk i + kwk+1 − wk k2
optimizing the joint bandwidth allocation and UE scheduling * 2 +
problem. In order to solve such an optimization problem, we β X ˜
= − ∇F (wk ), ∇Fi (wk−τki )
first have analysed the convergence rate of PerFedS2 , and A
i∈Ak
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14
Test Accuracy
Test Accuracy
Training Loss
Training Loss
0.94 1.40 0.6
0.60
1.20
0.92 0.5
0.40 PerFedS2, S=1 1.00 PerAvg, S=1
PerFedS2, S=2 PerAvg, S=2
0.20
0.90 PerFedS2, S=3 0.80 0.4 PerAvg, S=3
PerFedS2, S=4 PerAvg, S=4
PerFedS2, S=5 0.60 PerAvg, S=5
0.00 0.88 0.3
0 10 20 30 40 50 60 0 10 20 30 40 50 60 20 40 60 80 100 20 40 60 80 100
Time (×102s) Time (×102s) Time (×102s) Time (×102s)
(a) MNIST training loss (b) MNIST test accuracy (c) CIFAR-100 training loss (d) CIFAR-100 test accuracy
2
Fig. 10: Convergence performance comparison of PerFedS with respect to the staleness threshold S using the MNIST and
CIFAR-100 datasets. In this case, η1 = η2 = · · · = ηn , A=5. Meanwhile, we compare the results S = 1, 2, 3, 4, 5.
2
LF β X ˜ with the tower rule, we have
+ ∇Fi (wk−τki ) . (44)
2 A
X
i∈Ak E[kY k2 ] = E[E[kY k2 ]|Fk ] ≤ γF2 ηi . (50)
i∈Ak
From the abovePinequality, it is obvious that the key is to
˜ i (wk−τ i ). Let
bound the term i∈Ak ∇F Now getting back to the inequality (44), from the fact ha, bi =
k
1 2 2 2
1 X ˜ 1 X 2 (kak + kbk − ka − bk ), we have
∇Fi (wk−τki ) = X + Y + ∇F (wk−τki ), (45)
A A
i∈Ak i∈Ak
F (wk+1 ) − F (wk )
where 2
1 X ˜ β β 1 X ˜
X= (∇Fi (wk−τki ) − ∇Fi (wk−τki )), ≤− k∇F (wk )k2 − ∇Fi (wk−τki )
A 2 2 A
i∈Ak i∈Ak
1 X 2
Y = (∇Fi (wk−τki ) − ∇F (wk−τki )). (46) β 1 X
A + ∇F (wk ) − X − Y − ∇F (wk−τki )
i∈Ak 2 A
i∈Ak
Our next step is to upper bound E[kXk2 ] and E[kY 2
Pn k ] respec- 2
2
LF β 1 X ˜
tively.
Pn Recall2 thePCauchy-Schwarz inequality k i=1 ai bi k2 ≤ + ∇Fi (wk−τki )
n 2
( i=1 kai k )( i=1 kbi k ), as for X, consider the Cauchy- 2 A
i∈Ak
Schwarz inequality with ai = √1 (∇F˜ i (wk−τ i ) − β
1
A k
≤ − k∇F (wk )k2 + LF β 2 kX + Y k2
∇Fi (wk−τki )) and bi = A , we have
√ 2 | {z }
! T1
1 X ˜ 2
2 2 1 X
kXk ≤ k∇Fi (wk−τki ) − ∇Fi (wk−τki )k . (47)
A + β ∇F (wk ) − ∇F (wk−τki )
i∈Ak A
i∈Ak
Let Fk denote the information up to round k. Given that the | {z }
T2
set of scheduled UEs Ak is selected according to their relative 2
participation frequency ηi (i ∈ Ak ), hence, by using Lemma 2 1 X
+ (LF β 2 − β) ∇F (wk−τki ) . (51)
along with the tower rule, we have A
i∈Ak
X
E[kXk2 ] = E[E[kXk2 |Fk ]] ≤ σF2 ηi . (48) Our next step is to estimate the upper bounds of E[T1 ] and
i∈Ak E[T2 ], respectively. As for T1 , we have
Meanwhile, as for Y , consider the Cauchy-Schewarz inequal-
E[T1 ] ≤ 2E[kXk2 ] + 2E[kY k2 ] = 2(σF2 + γF2 ). (52)
ity with ai = √1A (∇Fi (wk−τki ) − ∇F (wk−τki )) and bi = √1A ,
we have As for T2 , we have
!
2
2 1 X 2 1
kY k ≤ k∇Fi (wk−τki ) − ∇F (wk−τki )k . (49)
X
A T2 = 2 (∇F (wk ) − ∇F (wk−τki ))
i∈Ak A
i∈Ak
2
In a similar way, the mean of kY k2 is the weighted average 1 X
≤ ∇F (wk ) − ∇F (wk−τki )
sum of E[kY k2 |Fk ], where the weight is the relative partic- A
i∈Ak
ipation frequency of UE i ∈ Ak . By using Lemma 3 along 2
1 X
≤ LF (wk − wk−τki )
A
i∈Ak
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15
[16] I. Achituve, A. Shamsian, A. Navon, G. Chechik, and E. Fetaya, “Per- Daquan Feng received the Ph.D. degree in informa-
sonalized federated learning with gaussian processes,” in International tion engineering from the National Key Laboratory
Conference on Neural Information Processing Systems (NeurIPS), 2021. of Science and Technology on Communications,
[17] X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic University of Electronic Science and Technology
gradient for nonconvex optimization,” vol. 28, 2015, pp. 2737–2745. of China, Chengdu, China, in 2015. From 2011 to
[18] C. Xu, Y. Qu, Y. Xiang, and L. Gao, “Asynchronous federated learning 2014, he was a visiting student with the School
on heterogeneous devices: A survey,” arXiv preprint arXiv:2109.04269, of Electrical and Computer Engineering, Georgia
2021. Institute of Technology, Atlanta, GA, USA. After
[19] Y. Chen, Y. Ning, M. Slawski, and H. Rangwala, “Asynchronous graduation, he was a Research Staff with State
online federated learning for edge devices with non-iid data,” in IEEE Radio Monitoring Center, Beijing, China, and then
International Conference on Big Data (Big Data), 2020, pp. 15–24. a Postdoctoral Research Fellow with the Singapore
[20] W. Wu, L. He, W. Lin, R. Mao, C. Maple, and S. Jarvis, “SAFA: A semi- University of Technology and Design, Singapore. He is now an associate
asynchronous protocol for fast federated learning with low overhead,” professor with the Shenzhen Key Laboratory of Digital Creative Technology,
IEEE Transactions on Computers (TOC), vol. 70, no. 5, pp. 655–668, the Guangdong Province Engineering Laboratory for Digital Creative Tech-
2020. nology, the Guangdong-Hong Kong Joint Laboratory for Big Data Imaging
[21] Q. Ma, Y. Xu, H. Xu, Z. Jiang, L. Huang, and H. Huang, “FedSA: and Communication, College of Electronics and Information Engineering,
A semi-asynchronous federated learning mechanism in heterogeneous Shenzhen University, Shenzhen, China. His research interests include URLLC
edge computing,” IEEE Journal on Selected Areas in Communications communications, MEC, and massive IoT networks. Dr. Feng is an Associate
(JSAC), 2021. Editor of IEEE COMMUNICATIONS LETTERS, Digital Communications
[22] D. Stripelis and J. L. Ambite, “Semi-synchronous federated learning,” and Networks and ICT Express.
arXiv preprint arXiv:2102.02849, 2021.
[23] Y. Zhang, M. Duan, D. Liu, L. Li, A. Ren, X. Chen, Y. Tan, and
C. Wang, “CSAFL: A clustered semi-asynchronous federated learning
framework,” arXiv preprint arXiv:2104.08184, 2021.
[24] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
for fast adaptation of deep networks,” in International Conference on
Machine Learning (ICML), 2017, pp. 1126–1135.
[25] A. Fallah, A. Mokhtari, and A. Ozdaglar, “On the convergence theory
of gradient-based model-agnostic meta-learning algorithms,” in Inter- Kun Guo (Member, IEEE) received the B.E. de-
national Conference on Artificial Intelligence and Statistics (AISTATS), gree in Telecommunications Engineering from Xi-
2020, pp. 1082–1092. dian University, Xi’an, China, in 2012, where she
[26] L. Bottou, “Stochastic gradient descent tricks,” in Neural networks: received the Ph.D. degree in communication and
Tricks of the trade. Springer, 2012, pp. 421–436. information systems in 2019. From 2019 to 2021,
[27] H. Yin and S. Alamouti, “Ofdma: A broadband wireless access technol- she was a Post-Doctoral Research Fellow with the
ogy,” in IEEE Sarnoff Symposium, 2006, pp. 1–4. Singapore University of Technology and Design
[28] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device scheduling (SUTD), Singapore. Currently, she is a Zijiang
and resource allocation for latency constrained wireless federated learn- Young Scholar with the School of Communications
ing,” IEEE Transactions on Wireless Communications (TWC), vol. 20, and Electronics Engineering at East China Normal
no. 1, pp. 453–467, 2020. University, Shanghai, China. Her research interests
[29] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint include edge computing, caching, and intelligence.
learning and communications framework for federated learning over
wireless networks,” IEEE Transactions on Wireless Communications
(TWC), vol. 20, no. 1, pp. 269–283, 2020.
[30] B. Sklar, “Rayleigh fading channels in mobile digital communication
systems. i. characterization,” IEEE Communications Magazine, vol. 35,
no. 7, pp. 90–100, 1997.
[31] L. Yann, C. Corinna, and B. Christopher. The mnist dataset. [Online].
Available: http://yann.lecun.com/exdb/mnist/
[32] K. Alex, N. Vinod, and H. Geoffrey. The cifat-10 dataset. [Online].
Available: https://www.cs.toronto.edu/~kriz/cifar.html Howard H. Yang (S’13–M’17) received the B.E.
[33] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, degree in Communication Engineering from Harbin
V. Smith, and A. Talwalkar, “Leaf: A benchmark for federated settings,” Institute of Technology (HIT), China, in 2012, and
arXiv preprint arXiv:1812.01097, 2018. the M.Sc. degree in Electronic Engineering from
[34] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning Hong Kong University of Science and Technology
applied to document recognition,” Proceedings of the IEEE, vol. 86, (HKUST), Hong Kong, in 2013. He earned the
no. 11, pp. 2278–2324, 1998. Ph.D. degree in Electrical Engineering from Singa-
[35] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, pore University of Technology and Design (SUTD),
“Federated optimization in heterogeneous networks,” Proceedings of Singapore, in 2017. He was a Postdoctoral Research
Machine Learning and Systems (MLSys), vol. 2, pp. 429–450, 2020. Fellow at SUTD from 2017 to 2020, a Visiting
Postdoc Researcher at Princeton University from
2018 to 2019, and a Visiting Student at the University of Texas at Austin
from 2015 to 2016. Currently, he is an assistant professor with the Zhejiang
University/University of Illinois at Urbana-Champaign Institute (ZJU-UIUC
Institute), Zhejiang University, Haining, China. He is also an adjunct assistant
Chaoqun You (S’13–M’20) is a postdoctoral re- professor with the Department of Electrical and Computer Engineering at the
search fellow in Singapore University of Technology University of Illinois at Urbana-Champaign, IL, USA
and Design (SUTD). She received the B.S. degree Dr. Yang’s research interests cover various aspects of wireless com-
in communication engineering and the Ph.D. de- munications, networking, and signal processing, currently focusing on the
gree in communication and information system from modeling of modern wireless networks, high dimensional statistics, graph
University of Electronic Science and Technology signal processing, and machine learning. He serves as an editor for the IEEE
of China (UESTC) in 2013 and 2020, respectively. T RANSACTIONS ON W IRELESS C OMMUNICATIONS. He received the IEEE
She was a visiting student at the University of WCSP 10-Year Anniversary Excellent Paper Award in 2019 and the IEEE
Toronto from 2015 to 2017. Her current research WCSP Best Paper Award in 2014.
interests include mobile edge computing, network
virtualization, federated learning, meta-learning, and
6G.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17