Computation and Communication Efficient Federated Learning Over Wireless Networks
Computation and Communication Efficient Federated Learning Over Wireless Networks
Abstract—Federated learning (FL) allows model training from resources, e.g., bandwidth and energy, may impede devices to
local data by edge devices while preserving data privacy. How- effectively contribute to the model aggregation, which results
ever, the learning accuracy decreases due to the heterogeneity of in high transmission latency and has negatively effect on the
devices’ data, and the computation and communication latency
convergence rate and learning accuracy [6].
arXiv:2309.01816v1 [cs.LG] 4 Sep 2023
• We model the computation and communication latency of of the learning model is delivered between the edge server
the proposed FL framework under a given pruning ratio. and devices. In [7], two FL algorithms for training partially
Also, we analyze the convergence rate of an upper bound personazlied models were introduced, where the shared and
on the l2 -norm of gradients for FL with partial model personal parameters were updated either simultaneously or
pruning and personalization. Then, we jointly optimize alternately on devices. In [8], the authors proved that under
the pruning ratio of the global part and wireless resource the right split of parameters, it was possible to find proper
allocation to maximize the convergence rate under the global and personalized parameters that allowed each device
latency and bandwidth thresholds. to fit its local dataset perfectly, and broke the curse of data
• In order to derive the optimal solutions of the pruning heterogeneity in several settings, such as training with local
ratio and wireless resource allocation in each global com- steps, asynchronous training, and Byzantine-robust training.
munication round, we decouple the optimization problem However, the communication and computation overheads of
into two sub-problems and deploy Karush–Kuhn–Tucker partial model personalization in [7] and [8] over wireless
(KKT) conditions to solve these sub-problems and obtain networks have been rarely explored.
the desired closed-form solutions.
• The experimental results demonstrate that our proposed
FL framework with partial model pruning and personal- B. FL with Model Pruning
ization achieves comparable learning accuracy to the FL To improve the communication efficiency of FL, federated
only with model personalization, and leads to a reduction pruning was proposed in [9], [10], [14]. In [9], model pruning
of approximately 50% in computation and communica- was adopted before local gradient calculation to reduce both
tion latency. the local model computation and gradient communication la-
The rest of this paper is organized as follows. Section II tency in FL over wireless networks. By removing the stragglers
presents the related works. Section III introduces the system with low computing power and bad channel conditions, device
model. The convergence analysis and problem formulation are selection was also considered to save the communication
detailed in Section IV. The pruning ratio and wireless resource overhead and reduce the model aggregation error caused by the
optimization are presented in Section V. The simulation results model pruning. In [10], PruneFL was proposed to effectively
and conclusions are shown in Section VI and Section VII, reduce the size of neural networks so that the resource-limited
respectively. devices could train the learning model within a short time.
PruneFL included initial pruning at a selected device and fur-
II. R ELATED W ORKS ther pruning as part of the FL process. Also, it included a low-
complexity adaptive pruning method for efficient FL, which
In this section, related works on personalized FL, FL with
was able to find the desired model size that could achieve
model pruning, and resource allocation and device selection
a similar prediction accuracy as the original model but with
are briefly introduced in the following three subsections.
much less time. In [14], model pruning for hierarchical FL in
wireless networks was introduced to reduce the neural network
A. Personalized FL scale, which decreased both computation and communication
To address the issue of data heterogeneity of each device, latency, while guaranteeing a similar learning accuracy as the
personalized FL (PFL) was considered in [11]–[13]. In [11], original model. However, FL with model pruning in [9], [10],
personalized FL based on model similarity was proposed to [14] was not suitable for non-IID data settings.
leverage the classifier-based similarity to conduct personalized
model aggregation, without incurring extra communication
overhead. In [12], a new PFL algorithm called FL with C. Resource Allocation and Device Selection
dynamic weight adjustment was proposed to leverage the edge To adapt to the limited wireless resources, computation and
server to calculate personalized aggregation weights based on communication resource allocation and edge-device associa-
collected model from devices, which could capture similari- tion of wireless FL were investigated [15]–[18]. Specifically,
ties between devices and improve PFL learning accuracy by in [15], by optimizing time allocation, bandwidth allocation,
encouraging collaborations among devices with similar data power control, and computation frequency, an iterative algo-
distributions. In [13], hierarchical PFL (HPFL) was considered rithm was proposed to minimize the total energy consumption
in massive mobile edge computing (MEC) networks. HPFL for local computation and wireless transmission of FL. In
combined the objectives of training loss minimization and [16], a probabilistic user selection scheme was proposed to
round latency minimization while jointly determining the opti- minimize the FL convergence time and the FL training loss in
mal bandwidth allocation as well as the edge server scheduling a joint learning, wireless resource allocation, and user selection
policy in the hierarchical learning framework. However, PFL problem. In [17], the Hungarian algorithm was used to find
in [11]–[13] still need to transmit the whole learning model the optimal device selection and resource block allocation so
to the server for similarity calculation and model aggregation, as to minimize the FL loss function. In [18], by deploying
which could not guarantee low computation and communica- successive convex approximation and Hungarian algorithms to
tion latency, especially for large-scale learning models. optimize bandwidth, computation frequency, power allocation,
To address the issue in PFL, partial model personalization and sub-carrier assignment, the sum of system and learning
was proposed in [7] and [8], where only the global part costs was minimized. Although the proposed device selection
3
and
Dn B. Learning Process in FL with Partial Model Pruning and
1 X
Fn (un , vn ) = fn (xi , yi , un , vn ), (4) Personalization
Dn i=1
The proposed FL with partial model pruning and person-
respectively. alization is shown in Fig. 1, where the personalized parts
4
Global Part
Aggregation
Edge The edge devices first update the global part uen by uen −
ηu ∇Fn (uen , vne,τv , ξne ) for τ̂u iterations, and τ̂u can be a small
Global Part
value according to [21], [22]. Then, the importance of each
Downloading
Global Part
Uploading
weight in the global part is calculated by (6), and the weights
of the global part are sorted in a descending order. Given a
pruning ratio ρen of the nth device, we deploy a pruning mask
Local Model
Updating and men to prune the global part uen , which is calculated as
Global Part Device
Pruning
Personalized
ue,0 e e
n = uG ⊙ mn . (10)
Part
Algorithm 1 FL with partial model pruning and personaliza- E. Computation and Communication Latency
tion
Synchronous training is considered in the proposed FL
1: Local dataset Dn on N local devices associated with the
framework over wireless networks and we mainly focus on
edge server, learning rate ηu and ηv , number of local
local computation and uplink transmission latency, which is
iterations τu , τ̂u and τv , number of global communication
written as
rounds E, edge model parameterized by ue , local model
parameterized by ue,t n and vn .
e,t
Tne = Tn,e
cmp up
+ Tn,e (17)
2: for global communication round e = 1,...,E do e e e e
τv Cn Wn,vn (1 − ρn )τu Cn Wn,un τ̂u Cn Wn,un
3: for local device n = 1,...,N do = + +
4: for iteration t = 1,2,...,τv do fn fn fn
5: Update vne,t as (9). q̂(1 − ρen )Wn,u
e
n
+ up . (18)
6: end for Rn,e
7: Update the global part uen by uen − Therefore, the latency of the edge server in the eth global
ηu ∇Fn (uen , vne,τv , ξne ) for τ̂u iterations and generate communication round is written as
mask men by (6).
8: Initialize ue,0 e
n = u ⊙ mn .
e Te = max{Tne }. (19)
n∈N
9: for iteration t = 1,2,...,τu do
10: Update ue,tn as (11).
Obviously, from (19), we observe that the bottleneck of the
11: end for computation and communication latency is affected by the last
12: end for device that finishes all local iterations and uplink transmission
13: for parameter j in global part ue,τ n
u
do after local model updating.
j e,j
14: Find Ne = {n : mn = 1}.
15: Update ue,j as (16). IV. C ONVERGENCE A NALYSIS AND P ROBLEM
16: end for F ORMULATION
17: end for
In this section, the convergence analysis of FL with partial
model pruning and personalization is first analyzed. Subse-
quently, we formulate an optimization problem to minimize
is the noise power. Then, the uplink transmission latency is the upper bound of the convergence analysis.
calculated as
up
e
q̂(1 − ρen )Wn,un
Tn,e = up , (15) A. Convergence Analysis
Rn,e
In general, the neural network is non-convex, thus, the
where q̂ is the quantization bit. average l2 -norm of gradients is deployed to evaluate the
convergence performance [25], [26]. To facilitate analysis,
the following assumptions are employed in the convergence
D. Global Part Aggregation analysis of FL with partial model pruning and personalization.
Assumption 1. (Smoothness) All loss functions Fn (u, vn )
Because of partial model pruning, some model weights of are continuously differentiable with respect to u and vn ,
the global part are not in the received local models. Let Nej namely, ∇u Fn (u, vn ) is Lu -Lipschitz continuous with u and
be the set of devices associated with the edge server and Luv -Lipschitz continuous with vn , and ∇v Fn (u, vn ) is Lv -
containing the jth model weight of the global part in the eth Lipschitz continuous with vn and Lvu -Lipschitz continuous
global communication round. Then, the global part update of with u, which are denoted as
the jth model weight is performed by aggregating global parts
with the jth model weight available, which is calculated as ∥∇u Fn (u, vn ) − ∇u Fn (û, vn )∥ ≤ Lu ∥u − û∥, (20)
where |Nej | is the number of global parts containing the jth and
model weight.
∥∇u Fn (u, vn ) − ∇u Fn (û, vn )∥ ≤ Lvu ∥u − û∥, (23)
Then, the edge server delivers ue+1 to its associated devices
for the next round of local model updating. The edge server where Lu , Lv , Luv , and Lvu are positive constants. The
does not access the local dataset of each mobile device, which smoothness assumption is a standard one. We assume that
preserves personal data privacy. Since the edge server typically without loss of generality, the cross-Lipschitz coefficients Luv
has high computation capability, the computation latency of and Lvu are equal.
the global part aggregation is neglected. Assumption 2. (Pruning-induced Noise) Different from the
The detailed FL with partial model pruning and personal- other convergence analysis of FL with partial model personal-
ization is presented in Algorithm 1. ization in [7], [27], we consider the effect of pruning-induced
6
noise. According to [14] and [28], the model error of the nth and
device under the pruning ratio ρen is bounded by W ηu τu L2u D2 + 3W 2 ηu2 3
Lu D2 τu2
A2 = ∗
. (34)
EΓ
E∥uen − uen ⊙ men ∥2 ≤ ρen D2 , (24)
In (32), (33), and (34), E is the number of global commu-
where D is a positive constant. nication rounds, W is the total number of model weights in
Assumption 3. (Bounded Gradient) The second moments the global part, and Γ∗ is the minimum occurrence of the
of stochastic gradients of global and personalizaed parts are parameter in the global part of all rounds.
bounded [29], [30], which are denoted as Proof : Please refer to Appendix A.
E∥∇u Fn (ue,t e,t e,t 2 2
n , vn , ξn )∥ ≤ ϕu , (25)
B. Problem Formulation
and
Given the previously mentioned system model, we focus on
E∥∇v Fn (ue,t e,t e,t 2 2
n , vn , ξn )∥ ≤ ϕv , (26)
an optimization problem that aims to minimize the global loss
respectively. In (25) and (26), ϕu and ϕv are positive con- in (3). The optimization problem is formulated as follows:
stants, and ξne,t are mini-batch data samples for any n, e, t. E X
N
Assumption 4. (Bounded Variance) The stochastic gradi-
X
min
e e
A2 ρen , (35)
ents of global and personalized parts are unbiased and have bn ,ρn
e=1 n=1
bounded variance, which are denoted as s.t. Te ≤ Tth , (36)
E[∇u Fn (ue,t e,t e,t
n , ξn )] = ∇u Fn (un ), (27) N
X
ben ≤ 1, (37)
and n=1
E[∇v Fn (vne,t , ξne,t )] = ∇v Fn (vne,t ), (28) 0 ≤ ben ≤ 1, (38)
respectively. Furthermore, there exist constants σ̂u and σ̂v ρen ∈ [0, 1]. (39)
satisfying
In (36), Tth represents the computation and communication
E∥∇u Fn (ue,t e,t e,t 2 2
n , ξn ) − ∇u Fn (un )∥ ≤ σ̂u , (29) latency constraint. Constraints in (37) and (38) represent the
and wireless resource thresholds, namely, the bandwidth fraction
E∥∇v Fn (vne,t , ξne,t ) − ∇v Fn (vne,t )∥2 ≤ σ̂v2 , (30) bn,e allocated to the nth device in the eth global commu-
nication round cannot larger than the total bandwidth B. The
respectively. constraint in (39) represents the pruning ratio constraint, which
Assumption 5. (Partial Gradient
PN Diversity) There exist δ ≥ should be carefully selected to prevent a sharp decline in
0 and φ ≥ 0 for all u and V = n=1 vn satisfying learning accuracy.
N To minimize the global loss function, proper pruning ratio
1 X
∥∇u Fn (u, vn ) − ∇u F (u, V )∥2 of the global part should be selected based on the latency and
N n=1 wireless resource thresholds. Since it is almost impossible to
≤ δ 2 + φ2 ∥∇u F (u, V )∥2 . (31) know the training performance exactly before the model has
been trained, we turn to find an upper bound of l2 -norm of
Partial gradient diversity characterizes how local steps on one gradients and minimize it for the global loss minimization.
device affect convergence globally. Obviously, the optimization problem in (35) is a mixed integer
Theorem 1: With the above assumptions, FL with partial
non-linear programming (MINLP) problem, which is non-
model pruning and personalization converges to a small neigh-
convex and impractical to directly obtain the optimal solutions.
borhood of a stationary point of standard FL as follows:
" To address this issue, we decompose the original problem into
E N several sub-problems to obtain sub-optimal solutions.
1 X ηv τv X 2
∥∇v Fn (ue , vne )∥
E e=1 8 n=1
# V. P RUNING R ATIO AND W IRELESS R ESOURCE
W
ηu τu X j e e+1 2 O PTIMIZATION
+ ∇u F (u , V )
2 j=1 In this section, we decouple the optimization problem in
0 0 ∗ ∗ E X
N (35) into two sub-problems to obtain the optimal solutions of
E[F (u , V ) − F (u , V )] X
≤ + A1 + A2 ρen , pruning ratio and wireless resource allocation.
E e=1 n=1
(32) A. Optimal Pruning Ratio
where A1 and A2 are expressed as According to (36), the computation and uplink transmission
η 2 τ 2 σ̂ 2 Lv 3η 2 W 2 τu2 ϕ2u Lu latency of the nth mobile device should satisfy the latency
A1 = v v v + 4ηv3 L2v σ̂v2 τv2 (τv − 1) + u threshold, which is written as
2 2
W ϕ2u N ηu 3 2 3
Lu τu +3W 2 ηu 2 2
τu N σ̂u 2
Lu +3W 2 L3u τu4 ηu
4
N ϕ2u e
τv Cn Wn,v
e
τu Cn Wn,u e
q̂Wn,u
+ ∗
, n e
+ (1 − ρn ) n
+ up
n
≤ Tth .
2Γ fn fn Rn,e
(33) (40)
7
Theorem 2: Based on (40), the pruning ratio of the nth VI. S IMULATION R ESULTS
device in the eth global communication round should satisfy In this section, we examine the effectiveness of our proposed
!+
cmp-Per
Tth − Tn,e FL with partial model pruning and personalization. In the
e∗
ρn ≥ 1 − cmp-G , (41) simulation, we consider a scenario with one edge server and
Tn,e + Tn,e com-G
ten devices participating in model training. We use a common
where Tn,ecmp-Per
is the computation latency of the personal- CNN model for image classification over the datasets MNIST
ized part. In (41), Tn,ecmp-G com-G
and Tn,e are computation and and Fashion MNIST, which contain 50000 training samples
transmission latency of the global part, respectively, and and 10000 testing samples, respectively. The input size of
(z)+ = max(z, 0)+ . CNN is 1 × 28 × 28, and the sizes of the first and second
Proof : Please refer to Appendix B. convolutional layers are 32 × 28 × 28 and 64 × 14 × 14,
Remark 1 : Based on Theorem 2 and (71) in the Appendix respectively. The sizes of the first and second max-pooling
B, the pruning ratio of each device is jointly determined by the layers are 32 × 14 × 14 and 64 × 7 × 7, respectively. The
computation capability and uplink transmission rate. For the sizes of the first and second fully-connected layers are 3136
device with a high uplink transmission rate and computation and 128, respectively. The size of the output layer is 10.
capability, a small pruning ratio is adopted. The global parts are transmitted between the edge server and
devices by wireless channels. The main simulation parameters
B. Optimal Wireless Resource Allocation are presented in Table I.
Based on the derived pruning ratio in (41), the optimization
problem in (35) is rewritten as A. FL with Partial Model Pruning and Personalization
Fig. 2 (a) and (b) plot the loss value of FL with partial model
E X N
!
cmp-Per
X Tth − Tn,e
min A2 1 − cmp-G , (42) pruning and personalization with different pruning ratios on
ben Tn,e + Tn,e com-G
e=1 n=1 MNIST and Fashion MNIST, respectively. Fig. 2 (c) plots the
e
com-G q̂Wn,u testing accuracy of FL with partial model pruning and per-
with the constraints (37) and (38). Given Tn,e = up
Rn,e
n
,
sonalization with different pruning ratios on non-IID datasets
(42) is further rewritten as
! MNIST and Fashion MNIST, respectively. It is observed that
E XN up cmp-Per
X Rn,e (Tth − Tn,e ) the convergence rate decreases and the loss increases with
min A2 1 − up cmp-G . (43) increasing pruning ratio. This is because more model weights
ben Rn,e Tn,e + q̂Wn,u e
e=1 n=1 n
are pruned with a higher pruning ratio, which leads to a higher
The optimal wireless resource allocation is achieved by solving model aggregation error, and more iterations are required to
the optimization problem in (43). First, based on the following train learning models. Fig. 3 plots the comparison of the testing
Lemma 1, we prove that the optimization problem in (43) is accuracy of FL with partial model pruning and personalization
convex with respect to the bandwidth fraction ben . by alternatively (FedAlt) and simultaneously (FedSim) local
Lemma 1: The optimization problem in (43) is convex with updating on non-IID datasets MNIST and Fashion MNIST,
respect to uplink transmission rate. respectively. It is observed that the testing accuracy of FedAlt
Proof : Please refer to Appendix C. is a bit higher than that of FedSim.
Based on the Lagrange multiplier method, the optimal
bandwidth allocation is achieved in the following theorem.
Theorem 3: The optimal bandwidth allocated to the nth B. FL with Paitial Model Pruning and Personazliation in
device is derived as Wireless Networks
In this section, the effect of latency threshold in FL with
r e
cmp-Per e gn pn
(Tth −Tn,e )q̂Wn,u B log2 1+ σ2
n e
λ∗ − q̂Wn,un
partial model pruning and personalization and joint design of
be∗
n = ep
, (44) the proposed FL algorithm and wireless resource allocation
gn n cmp-G
B log2 1 + σ2 Tn,e over the non-IID dataset Fashion MNIST are simulated.
where λ∗ is the optimal Lagrange multiplier. 1) Effect of Latency Threshold: Fig. 4 (a) and (b) plot
Proof : Please refer to Appendix D. loss value and testing accuracy of FL with partial model
Based on Theorem 2 and Theorem 3, the optimal pruning pruning and personalization with different latency thresholds
ratio is calculated as on non-IID dataset Fashion MNIST, respectively. Fig. 4 (c)
cmp-Per
e
gn pn
shows the required pruning ratio to achieve a given latency
be∗
n (T th − Tn,e )B log 2 1 + σ 2
constraint on non-IID dataset Fashion MNIST. Four latency
ρe∗
n =1− . (45)
thresholds are considered, which are 15ms, 20ms, 25ms, and
ep
cmp-G g
be∗
n Tn,e B log2 1 + σ 2
n n e
+ q̂Wn,u n
30ms. It is obtained that the testing loss decreases and the
Remark 2 : According to Theorem 3, we can observe that testing accuracy increases with increasing latency threshold.
the devices with bad channel conditions are allocated with Also, the number of global communication rounds required to
more bandwidth to satisfy transmission latency. In addition, achieve convergence is small with a high latency threshold.
the device with high computation capacity is allocated with It is because a small pruning ratio is selected with a large
more bandwidth, which not only decreases computation la- latency threshold. However, for the device with a small latency
tency but also improves the convergence rate. threshold, a large pruning ratio is considered to satisfy the
8
TABLE I
S IMULATION PARAMETERS OF FL WITH PARTIAL M ODEL P RUNING AND P ERSONALIZATION AND W IRELESS R ESOURCE A LLOCATION
TABLE II
C OMPUTATION AND COMMUNICATION LATENCY OF EACH GLOBAL COMMUNICATION ROUND ( MS )
and communication latency is about 50% less than that of the bandwidth thresholds by KKT conditions. Simulation results
scheme only with model personalization. This is because the have demonstrated that our proposed FL framework achieved
proposed FL with partial model pruning and personalization is similar learning accuracy compared to FL only with partial
able to dynamically prune the unimportant weights based on model personalization and reduced about 50% computation
the wireless channel, which further decreases the latency for and communication latency.
both local model updating and uplink transmission, especially
when the model size is large. In addition, it is observed that the A PPENDIX
learning accuracy of the proposed FL is much better than that
A. Appendix A - Proof of Theorem 1
of FL only with model pruning in [14]. This is because partial
model personalization is able to learn the data heterogeneity We now analyze the convergence of FL with partial model
of different devices. pruning and personalization. Throughout the proof, we use the
following inequalities.
From Jensen’s inequality, for any zk ∈ Rd , k ∈
VII. C ONCLUSIONS
{1, 2, ..., K}, we have
In this paper, a communication and computation efficient 2
K K
FL framework with partial model pruning and personalization 1 X 1 X
over wireless networks was proposed to adapt to data het- zk ≤ ∥zk ∥2 , (46)
K K
erogeneity and dynamical wireless environments. Specifically, k=1 k=1
the convergence analysis of an upper bound on the l2 -norm and directly achieves
of gradients for the proposed FL frameworks was derived. K 2 K
Subsequently, the closed-form solutions of pruning ratio and X
zk ≤K
X
∥zk ∥2 . (47)
wireless resource allocation were derived under latency and k=1 k=1
10
Lu 2
= 2E∥uen ⊙ men − uen ∥2
+ ue+1 − ue . (53) t=1 n=1
2 τu XN N
X X
Before deriving the smoothness bound in (53), several Lemmas ≤2 ρen D2 = 2τu D2 ρen , (57)
are first introduced as follows. t=1 n=1 n=1
Lemma 2: Under Assumption 2 and 3, for any global where the second step is obtained from pruning-induced noise
communication rounds e, we obtain that in Assumption 2. By plugging (56) and (57) into (55), we
τu X
N
obtain the desired result, which ends the proof of Lemma 2.
Lemma 3: Under Assumptions 1-3, for any global commu-
X
E∥ue,t−1
n − uen ∥2
t=1 n=1
nication rounds e, we derive that
N 2
τu X
X
2 2
≤ ηu ϕu N τu3 + 2τu D2 ρen . (54) 1 X
E [∇u Fnj (ue,t−1
n , vne,τv ) − ∇u Fnj (uen , vne,τv )]
n=1
Γje t=1 n∈Nej
Proof : In (54), uen is the received global part from the edge ϕ2 N η 2 L2 τ 4 + 2τu2 L2u D2
PN
ρen
server at the beginning of the eth global communication round, ≤ u u u u n=1
, (58)
Γ∗
and difference (ue,t−1
n − uen ) consists of two parts, namely,
variation because of local global part training (une,t−1 − ue,0n ) where Γje = |Nej | is the number of local models containing
and variation because of pruning (ue,0n − u e
n ). Therefore, (54) parameters j in the eth global communication round and
is rewritten as ∇u Fnj (uen , vne,τv ) is the gradient of the jth weight.
11
Lemma 4: For bounded variance under Assumption 4, for − ∇u Fnj (uen , vne,τv )]
any global communication rounds e, we obtain that 2
W τu
2
X 1 X X
τu X + 3W E j
ηu [∇u Fnj (uen , vne,τv )] , (63)
1X Γ
E [∇u Fnj (ue,t−1 e,τv e,t−1
n , vn ,ξn )−∇u Fnj (une,t−1,vne,τv)] j=1 e j t=1
n∈Ne
Γje t=1
n∈Nej
where we split stochastic gradient
τu2 N σ̂u
2
≤ ∗
. (60) ∇u Fnj (ue,t−1
n , v e,τv e,t−1
n , ξn ) into three parts, namely,
Γ [∇u Fnj (une,t−1 , vne,τv , ξne,t−1 ) − ∇u Fnj (une,t−1 , vne,τv )],
Proof : [∇u Fnj (ue,t−1
n , vne,τv ) − ∇u Fnj (uen , vne,τv )], and
j e e,τv
[∇u Fn (un , vn )].
τu X
The third term of the last step in (63) is derived as
1 X
E [∇u Fnj (ue,t−1
n , vne,τv , ξne,t−1 ) 2
Γje W τu
t=1 n∈Nej X 1 X X
2 3W E ηu [∇u Fnj (uen , vne,τv )]
j=1 Γje t=1
n∈Nej
− ∇u Fnj (ue,t−1
n , vne,τv )]
τu
W X
X
2
τu XN ≤ 3ηu W τu E∥∇u Fn (uen , vne,τv )∥2
τu X
≤ ∗ E ∇u Fnj (uq,e,t−1
k,n , vne,τv , ξne,t−1 ) j=1 t=1
Γ t=1 n=1 2
≤ 3ηu W 2 τu2 G2 . (64)
− ∇u Fnj (ue,t−1
n , vne,τv )∥2
τu X N Through plugging (58), (60), and (64) into (63), the upper-
τu X
≤ ∗ E ∇u Fn (ue,t−1n , vne,τv , ξne,t−1 ) bound of E∥ue+1 − ue ∥2 is derived as (62), which ends the
Γ t=1 n=1 proof of Lemma 5. Then, by taking expectations on both sides
− ∇u Fn (ue,t−1
n , vne,τv )∥2 of (53), we obtain
τ 2 N σ̂ 2
≤ u ∗ . (61) E[F (ue+1 , V e+1 )] − E[F (ue , V e+1 )]
Γ
Lu
In the second step, we consider that l2 -gradient norm of a ≤ E⟨∇u F (ue ,V e+1 ), ue+1 −ue ⟩ + E∥ue+1 − ue ∥2 .
2
vector is no larger than the sum of norm of all sub-vectors, (65)
which allows us to consider ∇u Fn rather than its sub-vectors.
The last step in (61) is obtained from bounded variance in First, we analyze E⟨∇u F (ue ,V e+1 ), ue+1 −ue ⟩ by consider-
Assumption 4, which ends the proof of Lemma 4. ing a sum of inner products over all model weights, which is
12
C. Appendix C - Proof of Lemma 1 [6] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning:
Challenges, methods, and future directions,” IEEE Signal Process. Mag.,
The objective function in (43) is equal to vol. 37, no. 3, pp. 50–60, 2020.
E X
N E X
N [7] K. Pillutla, K. Malik, A.-R. Mohamed, M. Rabbat, M. Sanjabi, and
ben V1
X X
F (b) = f (ben ) = 1− , (74) L. Xiao, “Federated learning with partial model personalization,” in
e
bn V2 + V3 ICML, vol. 162, 2022.
e=1 n=1 e=1 n=1
[8] K. Mishchenko, R. Islamov, E. Gorbunov, and S. Horvath, “Partially per-
where V1 , V2 , V3 > 0, and 0 ≤ ben ≤ 1. To prove the Lemma sonalized federated learning: Breaking the curse of data heterogeneity,”
1, we need to analyze the convexity of the function f (ben ). arxiv:2305.18285, 2023.
The first derivative of f (ben ) is computed as [9] S. Liu, G. Yu, R. Yin, J. Yuan, L. Shen, and C. Liu, “Joint model
pruning and device selection for communication-efficient federated edge
′ V1 V3 learning,” IEEE Trans. Commun., vol. 70, no. 1, pp. 231–244, Jan. 2022.
f (ben ) = − . (75) [10] Y. Jiang, S. Wang, V. Valls, B. J. Ko, W.-H. Lee, K. K. Leung, and
(ben V2 + V3 )2 L. Tassiulas, “Model pruning enables efficient federated learning on edge
Then, the second derivative of f (ben ) is derived as devices,” IEEE Trans. Neural Netw. Learn Syst., pp. 1–13, 2022.
[11] J. Tan, Y. Zhou, G. Liu, J. H. Wang, and S. Yu, “pfedsim: Similarity-
′′ 2V1 V2 V3 aware model aggregation towards personalized federated learning,”
f (ben ) = > 0. (76) arxiv:2305.15706, 2023.
(ben V2 + V3 )3
[12] J. Liu, J. Wu, J. Chen, M. Hu, Y. Zhou, and D. Wu, “Fed-
As a result, the objective function in (43) is convex. Mean- dwa: Personalized federated learning with online weight adjustment,”
arxiv:2305.06124, 2023.
while, both thresholds in (37) and (38) are convex. Conse- [13] C. You, K. Guo, H. H. Yang, and T. Q. S. Quek, “Hierarchical
quently, the optimization problem in (43) in convex, which personalized federated learning over massive mobile edge computing
ends the proof of Lemma 1. networks,” IEEE Trans. Wireless Commun., pp. 1–1, 2023.
[14] X. Liu, S. Wang, Y. Deng, and A. Nallanathan, “Adaptive federated
pruning in hierarchical wireless networks,” arxiv:2305.09042, 2023.
[15] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy
D. Appendix D - Proof of Theorem 3 efficient federated learning over wireless communication networks,”
Based on the optimization problem in (43) and the threshold IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1935–1949, 2020.
in (37), the Lagrange function is written as [16] M. Chen, H. V. Poor, W. Saad, and S. Cui, “Convergence time opti-
mization for federated learning over wireless networks,” IEEE Trans.
ge p
N
X ben B log2 1 + nσ2n (Tth − Tn,e cmp-Per
) Wireless Commun., vol. 20, no. 4, pp. 2457–2471, 2021.
e
L(bn , λ) = 1 − [17] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint
ge p cmp-G learning and communications framework for federated learning over
n=1 ben B log2 1 + nσ2n Tn,e e
+ q̂Wn,u n
wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp.
269–283, 2021.
N
!
X e
+λ bn − 1 , (77) [18] J. Ren, W. Ni, G. Nie, and H. Tian, “Research on resource allocation
n=1
for efficient federated learning,” arxiv:2104.09177, 2021.
[19] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz, “Importance
where λ is a Lagrange multiplier. Then, we consider the estimation for neural network pruning,” in Proc. IEEE/CVF Conf.
Karush-Kuhn-Tucker (KKT) conditions to solve the problem, Comput. Vis. Pattern Recognit. (CVPR), pp. 11 264 – 11 272, Jun. 2019.
which is written as [20] S. Horvath, S. Laskaridis, M. Almeida, I. Leontiadis, S. I. Venieris,
and N. D. Lane, “Fjord: Fair and accurate federated learning under
ge p
cmp-Per e heterogeneous targets with ordered dropout,” in Proc. Adv. Neural Inf.
∂L (Tth − Tn,e )q̂Wn,u n
B log2 1 + nσ2n
= λ− h i2 = 0, (78) Process. Syst. (NeurIPS’21), 2021.
∂ben
ge p cmp-G [21] S. Narang, G. Diamos, S. Sengupta, and E. Elsen, “Exploring sparsity
ben B log2 1 + nσ2n Tn,e e
+ q̂Wn,u n in recurrent neural networks,” in ICRL, 2017.
! [22] M. H. Zhu and S. Gupta, “To prune, or not to prune: Exploring the
N
X efficacy of pruning for model compression,” 2018.
λ ben −1 = 0, λ ≥ 0. (79) [23] S. Luo, X. Chen, Q. Wu, Z. Zhou, and S. Yu, “Hfel: Joint edge asso-
n=1 ciation and resource allocation for cost-efficient hierarchical federated
edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 10, pp.
According to KKT conditions, the optimal bandwidth alloca- 6535–6548, 2020.
tion for each device is obtained as Theorem 3, which ends the [24] D. Wen, M. Bennis, and K. Huang, “Joint parameter-and-bandwidth
proof of Theorem 3. allocation for improving the efficiency of partitioned edge learning,”
IEEE Trans. Wireless Commun., vol. 19, no. 12, pp. 8272–8286, 2020.
[25] S. Ghadimi and G. H. Lan, “Stochastic first-and zeroth-order methods
R EFERENCES for nonconvex stochastic programming,” SIAM J. Optim., vol. 23, no. 4,
pp. 2341 – 2368, 2013.
[1] D. C. Nguyen, M. Ding, P. N. Pathirana, A. Seneviratne, J. Li, and [26] S. Shi, K. Zhao, Q. Wang, Z. Tang, and X. Chu, “A convergence analysis
H. Vincent Poor, “Federated learning for internet of things: A compre- of distributed SGD with communication-efficient gradient sparsifica-
hensive survey,” IEEE Commun. Surv. Tutor., vol. 23, no. 3, pp. 1622– tion,” in Proc. 28th Int. Joint Conf. Artif. Intell., pp. 3411 – 3417, Aug.
1658, 2021. 2019.
[2] L. U. Khan, W. Saad, Z. Han, E. Hossain, and C. S. Hong, “Federated [27] K. Mishchenko, R. Islamov, E. Gorbunov, and S. Horvath, “Partially per-
learning for internet of things: Recent advances, taxonomy, and open sonalized federated learning: Breaking the curse of data heterogeneity,”
challenges,” IEEE Commun. Surv. Tutor., vol. 23, no. 3, pp. 1759–1799, arxiv:2305.18285, 2023.
2021. [28] S. U. Stich, J. B. Cordonnier, and M. Jaggi, “Sparsified SGD with
[3] H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas, memory,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS’18), pp.
“Communication-efficient learning of deep networks from decentralized 4447 – 4458, Dec. 2018.
data,” In Proceedings of the 20th International Conference on Artificial [29] P. L. Bartlett, “The sample complexity of pattern classification with
Intelligence and Statistics, pp. 1273 – 1282, 2017. neural networks: The size of the weights is more important than the
[4] X. Liu, Y. Deng, A. Nallanathan, and M. Bennis, “Federated and size of the network,” IEEE Trans. Inf. Theory, vol. 44, no. 2, pp. 525 –
meta learning over non-wireless and wireless networks: A tutorial,” 536, Mar. 1998.
arxiv:2210.13111, 2022. [30] T. Salimans and D. P. Kingma, “Weight normalization: A simple
[5] Y. Mu, N. Garg, and T. Ratnarajah, “Federated learning in massive reparameterization to accelerate training of deep neural networks,” in
MIMO 6G networks: Convergence analysis and communication-efficient Proc. Adv. Neural Inf. Process. Syst. (NeurIPS’16), pp. 901 – 909, Dec.
design,” IEEE Trans. Netw. Sci. Eng., vol. 9, no. 6, pp. 4220–4234, 2022. 2016.