0% found this document useful (0 votes)
12 views13 pages

Computation and Communication Efficient Federated Learning Over Wireless Networks

The document proposes a federated learning framework with partial model pruning and personalization to reduce computation and communication overhead. It mathematically analyzes the latency of this approach and formulates an optimization problem to maximize convergence under latency thresholds by adjusting the pruning ratio and wireless resource allocation.

Uploaded by

Sree Krishna Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

Computation and Communication Efficient Federated Learning Over Wireless Networks

The document proposes a federated learning framework with partial model pruning and personalization to reduce computation and communication overhead. It mathematically analyzes the latency of this approach and formulates an optimization problem to maximize convergence under latency thresholds by adjusting the pruning ratio and wireless resource allocation.

Uploaded by

Sree Krishna Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1

Computation and Communication Efficient


Federated Learning over Wireless Networks
Xiaonan Liu, Member, IEEE, Tharmalingam Ratnarajah, Senior Member, IEEE

Abstract—Federated learning (FL) allows model training from resources, e.g., bandwidth and energy, may impede devices to
local data by edge devices while preserving data privacy. How- effectively contribute to the model aggregation, which results
ever, the learning accuracy decreases due to the heterogeneity of in high transmission latency and has negatively effect on the
devices’ data, and the computation and communication latency
convergence rate and learning accuracy [6].
arXiv:2309.01816v1 [cs.LG] 4 Sep 2023

increase when updating large-scale learning models on devices


with limited computational capability and wireless resources. To To overcome these challenges, on one hand, an innovative
overcome these challenges, we consider a novel FL framework approach called FL with partial model personalization has
with partial model pruning and personalization. This framework been introduced [7], [8], which strikes a balance between
splits the learning model into a global part with model pruning the flexibility of personalization and cooperativeness of global
shared with all devices to learn data representations and a
personalized part to be fine-tuned for a specific device, which training. This FL framework splits the learning model into
adapts the model size during FL to reduce both computation a global part, which is shared with all devices to learn data
and communication overhead and minimize the overall training representations, and a personalized part, which is fine-tuned
time, and increases the learning accuracy for the device with non- for a specific device based on the heterogeneity of the local
independent and identically distributed (non-IID) data. Then, the dataset. In each global communication round, the edge server
computation and communication latency and the convergence
analysis of the proposed FL framework are mathematically broadcasts the current global part to devices. Then, each device
analyzed. Based on the convergence analysis, an optimization performs one or more steps of stochastic gradient descent to
problem is formulated to maximize the convergence rate under update both the global and personalized parts, and transmits
a latency threshold by jointly optimizing the pruning ratio only the updated global part to the edge server for model
and wireless resource allocation. By decoupling the optimization aggregation, which further reduces the transmission latency.
problem and deploying Karush–Kuhn–Tucker (KKT) conditions,
we derive the closed-form solutions of pruning ratio and wireless Meanwhile, the updated personalized part is kept locally at
resource allocation. Finally, experimental results demonstrate the device to serve as the initialization for another update in
that the proposed FL framework achieves a remarkable reduction the next global communication round.
of approximately 50% computation and communication latency On the other hand, federated pruning is also an effect
compared with the scheme only with model personalization. approach to improve computation and communication effi-
Index Terms—Adaptive partial model pruning and person- ciency [9], [10], where the model size is adapted during the
alization, communication and computation latency, federated training/inference phase or both of the training and inference
learning, wireless networks. phases to decrease computation and communication overhead
and minimize the overall training time, while maintaining
I. I NTRODUCTION a similar learning accuracy as the original model. However,
the proposed FL pruning methods in [9], [10] are unsuitable
Federated learning (FL) is a promising approach for pro- for devices with non-independent and identically distributed
tecting data privacy and facilitating distributed learning across (non-IID) data settings. Therefore, it is essential to design
diverse domains, such as healthcare, finance, and mobile a FL framework with high communication and computation
devices [1], [2], where multiple edge devices collaboratively efficiency and good adaptability to non-IID datasets.
train a learning model without sharing their raw data. Instead, Motivated by the above, in this work, we propose a commu-
only the model/gradient updates are shared with an edge server nication and computation efficient FL framework with partial
[3]. Despite its ability in privacy-preserving, FL has significant model pruning and personalization over wireless networks.
challenges in computation and communication parts [4], [5]. First, the variation of computation and communication latency
Specifically, for the computation part, data heterogeneity in caused by the partial model pruning and personalization are
different devices may lead to unstable training process of mathematically derived. Second, the convergence analysis of
FL, and further result in poor generalization ability of the the proposed FL framework is presented. Then, the prun-
global model and low learning accuracy. Furthermore, the ing ratio and wireless resource allocation under latency and
devices equipped with limited computational capabilities lead bandwidth thresholds are jointly optimized to further improve
to high computation latency, especially for updating large- learning performance. The main contributions are summarized
scale learning models. For the communication part, with an as follows.
increasing number of edge devices, the limited communication • To adapt to data heterogeneity across different devices
and dynamical wireless environments, we propose a com-
X. Liu and T. Ratnarajah are with the Institute for Digital Communications,
The University of Edinburgh, U.K. (e-mail: {xliu8, t.ratnarajah}@ed.ac.uk). munication and computation efficient FL framework with
(Corresponding author: Tharmalingam Ratnarajah). partial model pruning and personalization.
2

• We model the computation and communication latency of of the learning model is delivered between the edge server
the proposed FL framework under a given pruning ratio. and devices. In [7], two FL algorithms for training partially
Also, we analyze the convergence rate of an upper bound personazlied models were introduced, where the shared and
on the l2 -norm of gradients for FL with partial model personal parameters were updated either simultaneously or
pruning and personalization. Then, we jointly optimize alternately on devices. In [8], the authors proved that under
the pruning ratio of the global part and wireless resource the right split of parameters, it was possible to find proper
allocation to maximize the convergence rate under the global and personalized parameters that allowed each device
latency and bandwidth thresholds. to fit its local dataset perfectly, and broke the curse of data
• In order to derive the optimal solutions of the pruning heterogeneity in several settings, such as training with local
ratio and wireless resource allocation in each global com- steps, asynchronous training, and Byzantine-robust training.
munication round, we decouple the optimization problem However, the communication and computation overheads of
into two sub-problems and deploy Karush–Kuhn–Tucker partial model personalization in [7] and [8] over wireless
(KKT) conditions to solve these sub-problems and obtain networks have been rarely explored.
the desired closed-form solutions.
• The experimental results demonstrate that our proposed
FL framework with partial model pruning and personal- B. FL with Model Pruning
ization achieves comparable learning accuracy to the FL To improve the communication efficiency of FL, federated
only with model personalization, and leads to a reduction pruning was proposed in [9], [10], [14]. In [9], model pruning
of approximately 50% in computation and communica- was adopted before local gradient calculation to reduce both
tion latency. the local model computation and gradient communication la-
The rest of this paper is organized as follows. Section II tency in FL over wireless networks. By removing the stragglers
presents the related works. Section III introduces the system with low computing power and bad channel conditions, device
model. The convergence analysis and problem formulation are selection was also considered to save the communication
detailed in Section IV. The pruning ratio and wireless resource overhead and reduce the model aggregation error caused by the
optimization are presented in Section V. The simulation results model pruning. In [10], PruneFL was proposed to effectively
and conclusions are shown in Section VI and Section VII, reduce the size of neural networks so that the resource-limited
respectively. devices could train the learning model within a short time.
PruneFL included initial pruning at a selected device and fur-
II. R ELATED W ORKS ther pruning as part of the FL process. Also, it included a low-
complexity adaptive pruning method for efficient FL, which
In this section, related works on personalized FL, FL with
was able to find the desired model size that could achieve
model pruning, and resource allocation and device selection
a similar prediction accuracy as the original model but with
are briefly introduced in the following three subsections.
much less time. In [14], model pruning for hierarchical FL in
wireless networks was introduced to reduce the neural network
A. Personalized FL scale, which decreased both computation and communication
To address the issue of data heterogeneity of each device, latency, while guaranteeing a similar learning accuracy as the
personalized FL (PFL) was considered in [11]–[13]. In [11], original model. However, FL with model pruning in [9], [10],
personalized FL based on model similarity was proposed to [14] was not suitable for non-IID data settings.
leverage the classifier-based similarity to conduct personalized
model aggregation, without incurring extra communication
overhead. In [12], a new PFL algorithm called FL with C. Resource Allocation and Device Selection
dynamic weight adjustment was proposed to leverage the edge To adapt to the limited wireless resources, computation and
server to calculate personalized aggregation weights based on communication resource allocation and edge-device associa-
collected model from devices, which could capture similari- tion of wireless FL were investigated [15]–[18]. Specifically,
ties between devices and improve PFL learning accuracy by in [15], by optimizing time allocation, bandwidth allocation,
encouraging collaborations among devices with similar data power control, and computation frequency, an iterative algo-
distributions. In [13], hierarchical PFL (HPFL) was considered rithm was proposed to minimize the total energy consumption
in massive mobile edge computing (MEC) networks. HPFL for local computation and wireless transmission of FL. In
combined the objectives of training loss minimization and [16], a probabilistic user selection scheme was proposed to
round latency minimization while jointly determining the opti- minimize the FL convergence time and the FL training loss in
mal bandwidth allocation as well as the edge server scheduling a joint learning, wireless resource allocation, and user selection
policy in the hierarchical learning framework. However, PFL problem. In [17], the Hungarian algorithm was used to find
in [11]–[13] still need to transmit the whole learning model the optimal device selection and resource block allocation so
to the server for similarity calculation and model aggregation, as to minimize the FL loss function. In [18], by deploying
which could not guarantee low computation and communica- successive convex approximation and Hungarian algorithms to
tion latency, especially for large-scale learning models. optimize bandwidth, computation frequency, power allocation,
To address the issue in PFL, partial model personalization and sub-carrier assignment, the sum of system and learning
was proposed in [7] and [8], where only the global part costs was minimized. Although the proposed device selection
3

and resource allocation approaches in [15]–[18] effectively al- A. Model Pruning


leviate communication pressure, uploading the entire learning In the FL framework with partial model personalization,
model still poses a challenge for devices with poor channel the scale of neural networks can be very large with increasing
conditions. requirements of learning performance, such as high learning
accuracy. As a result, updating learning models on local
III. S YSTEM M ODEL devices and delivering them between edge servers and devices
In a wireless FL network, we assume an edge server may lead to high computation and communication latency,
provides wireless connections for N = {n = 1, 2, ..., N } even if only the global part u is shared among devices. In
mobile devices. The edge server is equipped with M an- order to solve these problems, the strategy of model pruning
tennas and each mobile device is equipped with a single is employed to reduce the size of the learning model.
antenna. Meanwhile, the nth device has a local dataset Dn = Effectively pruning insignificant neurons or weights may
{(xi , yi )}D
i=1 , where xi is the ith input data sample, yi is the
n
significantly reduce the model size while incurring only a
corresponding labeled output of xi , and Dn is the number of minor decline in learning performance. The learning accuracy
data samples. Without loss of generality, we assume that there only severely decreases with a high pruning ratio. As illus-
is no overlapping for datasets from different devices, namely, trated in [19], the importance of weight is quantified by the
Dn ∩ Dk = ∅, (∀n, k ∈ N ). Thus, the whole dataset and total error induced by removing it. This induced error is computed
N
numberPNof data samples are denoted as D = ∪{Dn }n=1 and as a squared difference between prediction errors obtained with
D = n=1 Dn , respectively. and without the jth weight un,j of the nth device, which is
The main objective of traditional FL algorithms, such as denoted as
FedAvg [3], is to find an optimal global model w∗ = wn∗ (∀n ∈ 2
In,j = (Fn (un , vn ) − Fn (un , vn |un,j = 0)) . (5)
N ) that minimizes the local loss function Fn (w) of each
device, which is denoted as Weight importance increases as its corresponding error cal-

Fn (w ) = arg min Fn (wn ), (1) culated in (5) grows. Nevertheless, calculating In,j for each
wn weight of the nth device in (5) is computationally intensive,
where the local loss Fn (wn ) of the nth device is defined on particularly when the nth device has a substantial number of
its local dataset Dn and is denoted as model weights. In order to alleviate the computational burden
Dn
of the importance calculation, we calculate the difference
1 X between the jth local model weight and the updated jth local
Fn (wn ) = fn (xi , yi , wn ). (2)
Dn i=1 model weight as
In (2), fn (xi , yi , wn ) is the loss function (e.g., cross-entropy În,j = |un,j − ûn,j |. (6)
and mean square error (MSE)) that denotes the difference
Calculating importance using (6) is straightforward, as the
between the model output and the desired output based on
updated local model weight ûn,j is already available through
the local model wn .
backpropagation.
However, the objective function in (1) is often too restrictive
The goal of model pruning is to mitigate the high computa-
as there might not exist a global model w∗ that fits all devices
tional demands during training and inference phases. When the
in real-world FL systems, especially when the data distribution
lth layer of the learning model is pruned through importance-
in devices is non-IID or heterogeneous, namely, statistical
based model pruning by a given pruning ratio ρn,l , it becomes
heterogeneity. The local optimal models may drift significantly
unnecessary to perform forward and backward process or
from each other and leads to a poor generalization of each
gradient updates on the pruned units. Therefore, model pruning
device because of statistical data heterogeneity. Fortunately,
offers benefits in both floating point operation (FLOP) count
modern deep learning models usually have a mult-layer archi-
reduction and model size reduction [20]. In the context of the
tecture, and a general insight is that the lower layers (close
lth fully-connected layer, the number of weights is computed
to the input) are responsible for feature extraction and the
as
upper layers (close to the output) focus on complex pattern
Wn,l = ⌈ρn,l Wn,l,in ⌉⌈ρn,l Wn,l,out ⌉, (7)
recognition. Therefore, based on the application domain and
scenarios, either the input layer or the output layer of the where Wn,l,in and Wn,l,out correspond to the number of input
model can be personalized, and the model w is split into a and output weights, respectively, and the number of weights
Wn,l,in Wn,l,out 1
global part u shared with all devices and a personalized part is decreased by ⌈ρn,l Wn,l,in ⌉⌈ρn,l Wn,l,out ⌉ ∼ ρn,l . Furthermore,
v specific to the device, i.e., w = [u, v]. Then, the objective W
n,l,out
the bias terms are reduced by a factor of ⌈ρn,l W 1
∼ ρn,l .
n,l,out ⌉
functions in (1) and (2) are rewritten as
In the following sections, we simply utilize ρn to represent
Fn (u∗ , v ∗ ) = arg min Fn (un , vn ), (3) the pruning ratio of the nth device.
un ,vn

and
Dn B. Learning Process in FL with Partial Model Pruning and
1 X
Fn (un , vn ) = fn (xi , yi , un , vn ), (4) Personalization
Dn i=1
The proposed FL with partial model pruning and person-
respectively. alization is shown in Fig. 1, where the personalized parts
4

Global Part
Aggregation
Edge The edge devices first update the global part uen by uen −
ηu ∇Fn (uen , vne,τv , ξne ) for τ̂u iterations, and τ̂u can be a small
Global Part
value according to [21], [22]. Then, the importance of each
Downloading
Global Part
Uploading
weight in the global part is calculated by (6), and the weights
of the global part are sorted in a descending order. Given a
pruning ratio ρen of the nth device, we deploy a pruning mask
Local Model
Updating and men to prune the global part uen , which is calculated as
Global Part Device
Pruning

Personalized
ue,0 e e
n = uG ⊙ mn . (10)
Part

In the pruning mask men , if me,j e,0


Global
Part n = 1, un contains the jth
model weight, otherwise, mn = 0, and me,j
e,j
n is determined
Fig. 1. FL with partial model pruning and personalization framework.
by (6). In (10), the weights whose importance ranked last
ρen Wn,u
e
n
are pruned, and Wn,u e
n
is the size of the global
part. Based on the updated personalized part vne,τv , the global
are only updated in local devices, and the global parts are part updating in the tth iteration is denoted as
pruned and updated in local devices and aggregated in the
edge server. Therefore, learning models updating and aggre- ue,t+1
n = ue,t e,t e,τv e,t e
n − ηu ∇Fn (un , vn , ξn ) ⊙ mn , (11)
gation in a global communication round include global part
where ∇Fn (ue,t e,τv e,t
n , vn , ξn ) is the gradient in the tth iteration,
broadcasting, local model updating, and global part uploading.
ηu is the learning rate. Furthermore, the number of weights
Meanwhile, to quantify training overhead in the FL with partial
after pruning is calculated as
model pruning and personalization, we model computation
and communication latency within one global communication e
Wρen = Wn,v + (1 − ρen )Wn,u
e
, (12)
n n
round. The learning process is introduced as follows:
e e
1) Global Part Broadcasting: In the eth global communi- where Wn,v n
and Wn,u n
are the size of the personalized and
cation round, the edge server broadcasts the global part ueG to global parts, respectively.
its associated devices by downlink transmission. In a practical Then, we calculate the computation latency incurred by the
system, the downlink transmission latency is very small due to nth device. We assume that the number of CPU cycles for
sufficient channel bandwidth. Consequently, the transmission the nth device to update one model weight is Cn , thus, given
e e
latency of global part broadcasting is ignored in the study of the size of global and personalized parts Wn,u n
and Wn,v n
,
this paper. the total number of CPU cycles to run one local iteration
2) Local Model Updating: Local model updating includes e e
is Cn Wn,u n
and Cn Wn,v n
, respectively. We denote that the
personalized and global parts updating. According to [7], allocated CPU frequency of the nth device for computation
updating the personalized and global parts alternatively, so- is fn with fn ∈ [fnmin , fnmax ]. Therefore, the total latency of
called LocalAlt, guarantees a bit higher learning accuracy. In local iterations is calculated as
LocalAlt, the personalized part vne is first updated τv iterations, e
τv Cn Wn,v e
τ̂u Cn Wn,u (1 − ρen )τu Cn Wn,u
e
cmp n n n
and the global part remains the same. Then, the global part Tn,e = + + .
uen updates τu iterations based on the updated personalized fn fn fn
(13)
part vne,τv . where τu is the number of iterations for the global part
Based on (3) and (4), when the nth device receives the updating.
global part from the edge server in the beginning of the eth
global communication round, the objective of the nth device
is to find C. Local Model Uplink Transmission
vn∗ (uen ) = arg min
e
Fn (uen , vne ). (8) After finishing local model updating, the nth device trans-
vn
mits its updated global part uen to the edge server, which leads
Due to the limited memory capacity of the mobile device,
to wireless transmission latency.
calculating the loss over the whole dataset is time-consuming.
We consider an orthogonal frequency-division multiple ac-
As a result, we employ minibatch stochastic gradient descent
cess (OFDMA) protocol for devices in the FL network [23],
(SGD) where the nth device utilizes a subset of its dataset to
[24]. The transmission rate between the edge server and the
compute the loss. Within the eth global communication round,
nth device in the eth global communication round is denoted
the personalized part updating in the tth iteration is written as
as
gne pn
 
vne,t+1 = vne,t − ηv ∇v Fn (uen , vne,t , ξne,t ), (9) up e
Rn,e = bn B log2 1 + 2 , (14)
σ
where ∇v Fn (uen , vne,t , ξne,t )
is the gradient of the personalized
part in the tth iteration, ηv is the learning rate of the personal- where ben is the bandwidth fraction allocated to the nth mobile
ized part, and ξne,t ⊆ Dn is the mini-batch randomly selected device in the eth global communication round, B is the total
from the data samples Dn of the nth device. bandwidth allocated to the edge server, gne is the channel
After τv iterations of updating the personalized part vne,t , gain between the edge server and the nth mobile device, pn
we begin updating the global part uen of the nth device. is the transmission power of the nth mobile device, and σ 2
5

Algorithm 1 FL with partial model pruning and personaliza- E. Computation and Communication Latency
tion
Synchronous training is considered in the proposed FL
1: Local dataset Dn on N local devices associated with the
framework over wireless networks and we mainly focus on
edge server, learning rate ηu and ηv , number of local
local computation and uplink transmission latency, which is
iterations τu , τ̂u and τv , number of global communication
written as
rounds E, edge model parameterized by ue , local model
parameterized by ue,t n and vn .
e,t
Tne = Tn,e
cmp up
+ Tn,e (17)
2: for global communication round e = 1,...,E do e e e e
τv Cn Wn,vn (1 − ρn )τu Cn Wn,un τ̂u Cn Wn,un
3: for local device n = 1,...,N do = + +
4: for iteration t = 1,2,...,τv do fn fn fn
5: Update vne,t as (9). q̂(1 − ρen )Wn,u
e
n
+ up . (18)
6: end for Rn,e
7: Update the global part uen by uen − Therefore, the latency of the edge server in the eth global
ηu ∇Fn (uen , vne,τv , ξne ) for τ̂u iterations and generate communication round is written as
mask men by (6).
8: Initialize ue,0 e
n = u ⊙ mn .
e Te = max{Tne }. (19)
n∈N
9: for iteration t = 1,2,...,τu do
10: Update ue,tn as (11).
Obviously, from (19), we observe that the bottleneck of the
11: end for computation and communication latency is affected by the last
12: end for device that finishes all local iterations and uplink transmission
13: for parameter j in global part ue,τ n
u
do after local model updating.
j e,j
14: Find Ne = {n : mn = 1}.
15: Update ue,j as (16). IV. C ONVERGENCE A NALYSIS AND P ROBLEM
16: end for F ORMULATION
17: end for
In this section, the convergence analysis of FL with partial
model pruning and personalization is first analyzed. Subse-
quently, we formulate an optimization problem to minimize
is the noise power. Then, the uplink transmission latency is the upper bound of the convergence analysis.
calculated as
up
e
q̂(1 − ρen )Wn,un
Tn,e = up , (15) A. Convergence Analysis
Rn,e
In general, the neural network is non-convex, thus, the
where q̂ is the quantization bit. average l2 -norm of gradients is deployed to evaluate the
convergence performance [25], [26]. To facilitate analysis,
the following assumptions are employed in the convergence
D. Global Part Aggregation analysis of FL with partial model pruning and personalization.
Assumption 1. (Smoothness) All loss functions Fn (u, vn )
Because of partial model pruning, some model weights of are continuously differentiable with respect to u and vn ,
the global part are not in the received local models. Let Nej namely, ∇u Fn (u, vn ) is Lu -Lipschitz continuous with u and
be the set of devices associated with the edge server and Luv -Lipschitz continuous with vn , and ∇v Fn (u, vn ) is Lv -
containing the jth model weight of the global part in the eth Lipschitz continuous with vn and Lvu -Lipschitz continuous
global communication round. Then, the global part update of with u, which are denoted as
the jth model weight is performed by aggregating global parts
with the jth model weight available, which is calculated as ∥∇u Fn (u, vn ) − ∇u Fn (û, vn )∥ ≤ Lu ∥u − û∥, (20)

1 X ∥∇u Fn (u, vn ) − ∇u Fn (u, v̂n )∥ ≤ Luv ∥vn − v̂n ∥, (21)


ue+1,j = ue,j
n , (16)
|Nej | n∈N ∥∇v Fn (u, vn ) − ∇v Fn (u, v̂n )∥ ≤ Lv ∥vn − v̂n ∥, (22)

where |Nej | is the number of global parts containing the jth and
model weight.
∥∇u Fn (u, vn ) − ∇u Fn (û, vn )∥ ≤ Lvu ∥u − û∥, (23)
Then, the edge server delivers ue+1 to its associated devices
for the next round of local model updating. The edge server where Lu , Lv , Luv , and Lvu are positive constants. The
does not access the local dataset of each mobile device, which smoothness assumption is a standard one. We assume that
preserves personal data privacy. Since the edge server typically without loss of generality, the cross-Lipschitz coefficients Luv
has high computation capability, the computation latency of and Lvu are equal.
the global part aggregation is neglected. Assumption 2. (Pruning-induced Noise) Different from the
The detailed FL with partial model pruning and personal- other convergence analysis of FL with partial model personal-
ization is presented in Algorithm 1. ization in [7], [27], we consider the effect of pruning-induced
6

noise. According to [14] and [28], the model error of the nth and
device under the pruning ratio ρen is bounded by W ηu τu L2u D2 + 3W 2 ηu2 3
Lu D2 τu2
A2 = ∗
. (34)

E∥uen − uen ⊙ men ∥2 ≤ ρen D2 , (24)
In (32), (33), and (34), E is the number of global commu-
where D is a positive constant. nication rounds, W is the total number of model weights in
Assumption 3. (Bounded Gradient) The second moments the global part, and Γ∗ is the minimum occurrence of the
of stochastic gradients of global and personalizaed parts are parameter in the global part of all rounds.
bounded [29], [30], which are denoted as Proof : Please refer to Appendix A.
E∥∇u Fn (ue,t e,t e,t 2 2
n , vn , ξn )∥ ≤ ϕu , (25)
B. Problem Formulation
and
Given the previously mentioned system model, we focus on
E∥∇v Fn (ue,t e,t e,t 2 2
n , vn , ξn )∥ ≤ ϕv , (26)
an optimization problem that aims to minimize the global loss
respectively. In (25) and (26), ϕu and ϕv are positive con- in (3). The optimization problem is formulated as follows:
stants, and ξne,t are mini-batch data samples for any n, e, t. E X
N
Assumption 4. (Bounded Variance) The stochastic gradi-
X
min
e e
A2 ρen , (35)
ents of global and personalized parts are unbiased and have bn ,ρn
e=1 n=1
bounded variance, which are denoted as s.t. Te ≤ Tth , (36)
E[∇u Fn (ue,t e,t e,t
n , ξn )] = ∇u Fn (un ), (27) N
X
ben ≤ 1, (37)
and n=1
E[∇v Fn (vne,t , ξne,t )] = ∇v Fn (vne,t ), (28) 0 ≤ ben ≤ 1, (38)
respectively. Furthermore, there exist constants σ̂u and σ̂v ρen ∈ [0, 1]. (39)
satisfying
In (36), Tth represents the computation and communication
E∥∇u Fn (ue,t e,t e,t 2 2
n , ξn ) − ∇u Fn (un )∥ ≤ σ̂u , (29) latency constraint. Constraints in (37) and (38) represent the
and wireless resource thresholds, namely, the bandwidth fraction
E∥∇v Fn (vne,t , ξne,t ) − ∇v Fn (vne,t )∥2 ≤ σ̂v2 , (30) bn,e allocated to the nth device in the eth global commu-
nication round cannot larger than the total bandwidth B. The
respectively. constraint in (39) represents the pruning ratio constraint, which
Assumption 5. (Partial Gradient
PN Diversity) There exist δ ≥ should be carefully selected to prevent a sharp decline in
0 and φ ≥ 0 for all u and V = n=1 vn satisfying learning accuracy.
N To minimize the global loss function, proper pruning ratio
1 X
∥∇u Fn (u, vn ) − ∇u F (u, V )∥2 of the global part should be selected based on the latency and
N n=1 wireless resource thresholds. Since it is almost impossible to
≤ δ 2 + φ2 ∥∇u F (u, V )∥2 . (31) know the training performance exactly before the model has
been trained, we turn to find an upper bound of l2 -norm of
Partial gradient diversity characterizes how local steps on one gradients and minimize it for the global loss minimization.
device affect convergence globally. Obviously, the optimization problem in (35) is a mixed integer
Theorem 1: With the above assumptions, FL with partial
non-linear programming (MINLP) problem, which is non-
model pruning and personalization converges to a small neigh-
convex and impractical to directly obtain the optimal solutions.
borhood of a stationary point of standard FL as follows:
" To address this issue, we decompose the original problem into
E N several sub-problems to obtain sub-optimal solutions.
1 X ηv τv X 2
∥∇v Fn (ue , vne )∥
E e=1 8 n=1
# V. P RUNING R ATIO AND W IRELESS R ESOURCE
W
ηu τu X j e e+1 2 O PTIMIZATION
+ ∇u F (u , V )
2 j=1 In this section, we decouple the optimization problem in
0 0 ∗ ∗ E X
N (35) into two sub-problems to obtain the optimal solutions of
E[F (u , V ) − F (u , V )] X
≤ + A1 + A2 ρen , pruning ratio and wireless resource allocation.
E e=1 n=1
(32) A. Optimal Pruning Ratio
where A1 and A2 are expressed as According to (36), the computation and uplink transmission
η 2 τ 2 σ̂ 2 Lv 3η 2 W 2 τu2 ϕ2u Lu latency of the nth mobile device should satisfy the latency
A1 = v v v + 4ηv3 L2v σ̂v2 τv2 (τv − 1) + u threshold, which is written as
2 2
W ϕ2u N ηu 3 2 3
Lu τu +3W 2 ηu 2 2
τu N σ̂u 2
Lu +3W 2 L3u τu4 ηu
4
N ϕ2u e
τv Cn Wn,v
 e
τu Cn Wn,u e
q̂Wn,u

+ ∗
, n e
+ (1 − ρn ) n
+ up
n
≤ Tth .
2Γ fn fn Rn,e
(33) (40)
7

Theorem 2: Based on (40), the pruning ratio of the nth VI. S IMULATION R ESULTS
device in the eth global communication round should satisfy In this section, we examine the effectiveness of our proposed
!+
cmp-Per
Tth − Tn,e FL with partial model pruning and personalization. In the
e∗
ρn ≥ 1 − cmp-G , (41) simulation, we consider a scenario with one edge server and
Tn,e + Tn,e com-G
ten devices participating in model training. We use a common
where Tn,ecmp-Per
is the computation latency of the personal- CNN model for image classification over the datasets MNIST
ized part. In (41), Tn,ecmp-G com-G
and Tn,e are computation and and Fashion MNIST, which contain 50000 training samples
transmission latency of the global part, respectively, and and 10000 testing samples, respectively. The input size of
(z)+ = max(z, 0)+ . CNN is 1 × 28 × 28, and the sizes of the first and second
Proof : Please refer to Appendix B. convolutional layers are 32 × 28 × 28 and 64 × 14 × 14,
Remark 1 : Based on Theorem 2 and (71) in the Appendix respectively. The sizes of the first and second max-pooling
B, the pruning ratio of each device is jointly determined by the layers are 32 × 14 × 14 and 64 × 7 × 7, respectively. The
computation capability and uplink transmission rate. For the sizes of the first and second fully-connected layers are 3136
device with a high uplink transmission rate and computation and 128, respectively. The size of the output layer is 10.
capability, a small pruning ratio is adopted. The global parts are transmitted between the edge server and
devices by wireless channels. The main simulation parameters
B. Optimal Wireless Resource Allocation are presented in Table I.
Based on the derived pruning ratio in (41), the optimization
problem in (35) is rewritten as A. FL with Partial Model Pruning and Personalization
Fig. 2 (a) and (b) plot the loss value of FL with partial model
E X N
!
cmp-Per
X Tth − Tn,e
min A2 1 − cmp-G , (42) pruning and personalization with different pruning ratios on
ben Tn,e + Tn,e com-G
e=1 n=1 MNIST and Fashion MNIST, respectively. Fig. 2 (c) plots the
e
com-G q̂Wn,u testing accuracy of FL with partial model pruning and per-
with the constraints (37) and (38). Given Tn,e = up
Rn,e
n
,
sonalization with different pruning ratios on non-IID datasets
(42) is further rewritten as
! MNIST and Fashion MNIST, respectively. It is observed that
E XN up cmp-Per
X Rn,e (Tth − Tn,e ) the convergence rate decreases and the loss increases with
min A2 1 − up cmp-G . (43) increasing pruning ratio. This is because more model weights
ben Rn,e Tn,e + q̂Wn,u e
e=1 n=1 n
are pruned with a higher pruning ratio, which leads to a higher
The optimal wireless resource allocation is achieved by solving model aggregation error, and more iterations are required to
the optimization problem in (43). First, based on the following train learning models. Fig. 3 plots the comparison of the testing
Lemma 1, we prove that the optimization problem in (43) is accuracy of FL with partial model pruning and personalization
convex with respect to the bandwidth fraction ben . by alternatively (FedAlt) and simultaneously (FedSim) local
Lemma 1: The optimization problem in (43) is convex with updating on non-IID datasets MNIST and Fashion MNIST,
respect to uplink transmission rate. respectively. It is observed that the testing accuracy of FedAlt
Proof : Please refer to Appendix C. is a bit higher than that of FedSim.
Based on the Lagrange multiplier method, the optimal
bandwidth allocation is achieved in the following theorem.
Theorem 3: The optimal bandwidth allocated to the nth B. FL with Paitial Model Pruning and Personazliation in
device is derived as Wireless Networks
In this section, the effect of latency threshold in FL with
r  e 
cmp-Per e gn pn
(Tth −Tn,e )q̂Wn,u B log2 1+ σ2
n e
λ∗ − q̂Wn,un
partial model pruning and personalization and joint design of
be∗
n =  ep
 , (44) the proposed FL algorithm and wireless resource allocation
gn n cmp-G
B log2 1 + σ2 Tn,e over the non-IID dataset Fashion MNIST are simulated.
where λ∗ is the optimal Lagrange multiplier. 1) Effect of Latency Threshold: Fig. 4 (a) and (b) plot
Proof : Please refer to Appendix D. loss value and testing accuracy of FL with partial model
Based on Theorem 2 and Theorem 3, the optimal pruning pruning and personalization with different latency thresholds
ratio is calculated as on non-IID dataset Fashion MNIST, respectively. Fig. 4 (c)
cmp-Per
 e
gn pn
 shows the required pruning ratio to achieve a given latency
be∗
n (T th − Tn,e )B log 2 1 + σ 2
constraint on non-IID dataset Fashion MNIST. Four latency
ρe∗
n =1− . (45)
thresholds are considered, which are 15ms, 20ms, 25ms, and
 ep

cmp-G g
be∗
n Tn,e B log2 1 + σ 2
n n e
+ q̂Wn,u n
30ms. It is obtained that the testing loss decreases and the
Remark 2 : According to Theorem 3, we can observe that testing accuracy increases with increasing latency threshold.
the devices with bad channel conditions are allocated with Also, the number of global communication rounds required to
more bandwidth to satisfy transmission latency. In addition, achieve convergence is small with a high latency threshold.
the device with high computation capacity is allocated with It is because a small pruning ratio is selected with a large
more bandwidth, which not only decreases computation la- latency threshold. However, for the device with a small latency
tency but also improves the convergence rate. threshold, a large pruning ratio is considered to satisfy the
8

TABLE I
S IMULATION PARAMETERS OF FL WITH PARTIAL M ODEL P RUNING AND P ERSONALIZATION AND W IRELESS R ESOURCE A LLOCATION

Transmission power of device 28 dBm Bandwidth 20MHz


CPU frequency of device 3 GHz Learning rate 0.001
AWGN noise power -110 dBm Batchsize 128
Quantization bit 32 Latency Threshold 25 ms
Number of devices 10 Number of local iterations 10

(a) (b) (c)


Fig. 2. (a) Loss value of FL with partial model pruning and personalization with different pruning ratios on non-IID dataset MNIST. (b) Loss value of FL
with partial model pruning and personalization with different pruning ratios on non-IID dataset Fashion MNIST. (c) Testing accuracy of FL with partial model
pruning and personalization with different pruning ratios on non-IID datasets MNIST and Fashion MNIST.

• Equal Resource Pruning: Based on our proposed FL


with partial model pruning and personalization, the prun-
ing ratio is optimized. However, the bandwidth is equally
allocated to all devices.
• FL only with Model Personalization: Only FL with
model personalization is considered.
• FL only with Model Pruning in [14]: In [14], only
FL with optimal model pruning was considered.
Table II shows the computation and communication latency
of each global communication round of these four schemes
mentioned above. We can observe that the computation and
communication latency of the proposed FL is much smaller
than that of the FL only with model pruning in [14] or the FL
only with model personalization. This is because the person-
Fig. 3. Comparison of the testing accuarcy of FL with partial model pruning alized part does not need to be transmitted between the edge
and personalization by alternatively and simultaneously local updating on non- server and devices. Also, it is observed that the computation
IID datasets MNIST and Fashion MNIST.
and communication latency of the proposed FL is smaller
than that of equal resource pruning scheme. It is because the
proposed FL is able to select the optimal pruning ratio of the
latency requirement while sacrificing the learning performance
global part based on the dynamic wireless environment, which
and more global communication rounds are required to achieve
further reduces the computation and communication latency.
convergence. In the following experiment, we assume that the
latency threshold is 25ms. Fig. 5 (a) and (b) plot the loss value and testing accu-
racy of joint design of FL with partial model pruning and
2) Joint Design of FL with Partial Model Pruning and Per- personalization and wireless resource allocation on non-IID
sonalization and Wireless Resource Allocation: We compare Fashion MNIST, respectively. Fig. 5 (c) shows the com-
the proposed FL with partial model pruning and personaliza- munication costs on different schemes. The communication
tion with other three baseline schemes to demonstrate the joint cost quantifies the number of model weights of the global
design of FL with partial model pruning and personalization part required to be delivered for model aggregation. We can
and wireless resource allocation. These four schemes are observe that the proposed FL with partial model pruning
presented as follows. and personalization has the ability to adapt to the wireless
• Proposed FL: Based on our proposed FL with partial resource. Also, these figures show that the performance of
model pruning and personalization, both the pruning ratio loss and testing accuracy of the proposed FL with partial
and wireless resource are optimized according to Section model pruning and personalization is close to the scheme
V. only with model personalization. However, the computation
9

(a) (b) (c)


Fig. 4. (a) Loss value of FL with partial model pruning and personalization with different latency thresholds on non-IID dataset Fashion MNIST. (b) Testing
accuracy of FL with partial model pruning and personalization with different latency thresholds on non-IID dataset Fashion MNIST. (c) Pruning ratio required
to achieve a given latency constraint on non-IID dataset Fashion MNIST.

(a) (b) (c)


Fig. 5. (a) Loss value of FL with partial model pruning and personalization with different latency thresholds on non-IID dataset Fashion MNIST. (b) Testing
accuracy of FL with partial model pruning and personalization with different latency thresholds on non-IID dataset Fashion MNIST. (c) Communication costs
on different schemes.

TABLE II
C OMPUTATION AND COMMUNICATION LATENCY OF EACH GLOBAL COMMUNICATION ROUND ( MS )

Scheme Proposed FL Equal Resource Pruning Only Personalization FL in [14]


FL with Model Pruning and Personalization 25 38 ± 0.45 55 ± 0.55 45

and communication latency is about 50% less than that of the bandwidth thresholds by KKT conditions. Simulation results
scheme only with model personalization. This is because the have demonstrated that our proposed FL framework achieved
proposed FL with partial model pruning and personalization is similar learning accuracy compared to FL only with partial
able to dynamically prune the unimportant weights based on model personalization and reduced about 50% computation
the wireless channel, which further decreases the latency for and communication latency.
both local model updating and uplink transmission, especially
when the model size is large. In addition, it is observed that the A PPENDIX
learning accuracy of the proposed FL is much better than that
A. Appendix A - Proof of Theorem 1
of FL only with model pruning in [14]. This is because partial
model personalization is able to learn the data heterogeneity We now analyze the convergence of FL with partial model
of different devices. pruning and personalization. Throughout the proof, we use the
following inequalities.
From Jensen’s inequality, for any zk ∈ Rd , k ∈
VII. C ONCLUSIONS
{1, 2, ..., K}, we have
In this paper, a communication and computation efficient 2
K K
FL framework with partial model pruning and personalization 1 X 1 X
over wireless networks was proposed to adapt to data het- zk ≤ ∥zk ∥2 , (46)
K K
erogeneity and dynamical wireless environments. Specifically, k=1 k=1

the convergence analysis of an upper bound on the l2 -norm and directly achieves
of gradients for the proposed FL frameworks was derived. K 2 K
Subsequently, the closed-form solutions of pruning ratio and X
zk ≤K
X
∥zk ∥2 . (47)
wireless resource allocation were derived under latency and k=1 k=1
10

Peter-Paul inequality (also known as Young’s inequality) gives


τu X
N
1 1 X
⟨z1 , z2 ⟩ ≤ ∥z1 ∥2 + ∥z2 ∥2 , (48) E∥ue,t−1
n − uen ∥2
2 2
t=1 n=1
d τu X N
and for any constant s > 0 and z1 , z2 ∈ R , we have X
= E∥(ue,t−1
n −ue,0 e,0 e
n ) + (un −un )∥
2
 
2 2 1 t=1 n=1
∥z1 + z2 ∥ ≤ (1 + s)∥z1 ∥ + 1 + ∥z2 ∥2 . (49) τu X
N
s X
≤ 2E∥ue,t−1
n − ue,0 2
n ∥ +
Proof of the Convergence: In the FL with partial model t=1 n=1
τu X
N
pruning and personalization, the objective is to minimize the X
function 2E∥ue,0 e 2
n − un ∥ . (55)
N t=1 n=1
1 X
F (u, V ) = Fn (u, vn ), (50) In (55), ue,t−1 is updated from ue,0
N n=1 n n by t − 1 iterations on
the nth device. Through the local gradient updating, we obtain
where V = (v1 , ..., vN ) is a concatenation of all personalized that
parts. We use L-smoothness in Assumption 1 to give conver-
gence analysis. We begin with τu X
X N
2E∥ue,t−1
n − ue,0
n ∥
2
e+1 e+1 e e e e+1 e e
F (u ,V ) − F (u , V ) = F (u , V ) − F (u , V ) t=1 n=1
e+1 e+1 e e+1 τu X 2
+ F (u ,V ) − F (u , V ). (51) X N t−2
X
= 2E −ηu ∇u Fn (ue,i e,τv e,i
n , vn , ξn ) ⊙ men
In (51), the first line corresponds to the effect of the v-step and t=1 n=1 i=0
τu X N t−2
is easy to obtain the upper bound with standard techniques. 2
X X
According to [7], we have ≤ 2ηu (t − 1) E∥∇Fn (ue,i e,τv e,i
n , vn , ξn ) ⊙ men ∥2
t=1 n=1 i=0
N τu
2
X 2τ 3 − 3τu2 + τu
∥∇v Fn (ue , vne )∥ 2 2
(t − 1)2 = η 2 ϕ2u N u
P
η v τv ≤ 2ηu ϕu N
n=1 3
E F (ue , V e+1 )−F (ue , V e ) ≤ −
 
t=1
8 ≤ η 2 ϕ2u N τu3 , (56)
η 2 τ 2 σ̂ 2 Lv
+ v v v + 4ηv3 L2v σ̂v2 τv2 (τv − 1). (52)
2 where the third step in (56) is obtained from the bounded
The second line in (51) corresponds to the effect of the u-step, gradient in Assumption 3. Then, ue,0 e
n −un in (55) is calculated

however, deriving the upper bound of it is more challenging. as


In particular, the smoothness bound for the u-step is written τu X
X N
as 2E∥ue,0 e 2
n − un ∥
t=1 n=1
F (ue+1,V e+1 )−F (ue, V e+1 ) ≤ ⟨∇u F (ue ,V e+1 ), ue+1 −ue ⟩ Xτu X N

Lu 2
= 2E∥uen ⊙ men − uen ∥2
+ ue+1 − ue . (53) t=1 n=1
2 τu XN N
X X
Before deriving the smoothness bound in (53), several Lemmas ≤2 ρen D2 = 2τu D2 ρen , (57)
are first introduced as follows. t=1 n=1 n=1

Lemma 2: Under Assumption 2 and 3, for any global where the second step is obtained from pruning-induced noise
communication rounds e, we obtain that in Assumption 2. By plugging (56) and (57) into (55), we
τu X
N
obtain the desired result, which ends the proof of Lemma 2.
Lemma 3: Under Assumptions 1-3, for any global commu-
X
E∥ue,t−1
n − uen ∥2
t=1 n=1
nication rounds e, we derive that
N 2
τu X
X
2 2
≤ ηu ϕu N τu3 + 2τu D2 ρen . (54) 1 X
E [∇u Fnj (ue,t−1
n , vne,τv ) − ∇u Fnj (uen , vne,τv )]
n=1
Γje t=1 n∈Nej
Proof : In (54), uen is the received global part from the edge ϕ2 N η 2 L2 τ 4 + 2τu2 L2u D2
PN
ρen
server at the beginning of the eth global communication round, ≤ u u u u n=1
, (58)
Γ∗
and difference (ue,t−1
n − uen ) consists of two parts, namely,
variation because of local global part training (une,t−1 − ue,0n ) where Γje = |Nej | is the number of local models containing
and variation because of pruning (ue,0n − u e
n ). Therefore, (54) parameters j in the eth global communication round and
is rewritten as ∇u Fnj (uen , vne,τv ) is the gradient of the jth weight.
11

Proof : Lemma 5: The upperbound of E∥ue+1 − ue ∥2 is


2
τu X
1 X E∥ue+1 − ue ∥2 ≤ 3ηu 2
W 2 τu2 ϕ2u
E [∇u Fnj (ue,t−1 , vne,τv ) − ∇u Fnj (uen , vne,τv )]
Γje
n 3W 2 ηu
2 2 2
τu N σ̂u + 3W 2 L2u τu4 ηu
4
N ϕ2u
t=1 n∈Nej + ∗
τu
Γ
τu X X
6W 2 ηu
2 2
PN
Lu D2 τu2 n=1 ρen
≤ E∥∇u Fnj (ue,t−1
n , vne,τv )−∇u Fnj (uen , vne,τv )∥2 + , (62)
Γje t=1 Γ∗
n∈Nej
τu XN
τu X where W is the number of weights of the global part.
≤ E∥∇u Fnj (ue,t−1
n , vne,τv ) − ∇u Fnj (uen , vne,τv )∥2 Proof :
Γ∗ t=1 n=1
τu XN
τu X E∥ue+1 − ue ∥2
≤ ∗ E∥∇u Fn (ue,t−1
n , vne,τv ) − ∇u Fn (uen , vne,τv )∥2 2
Γ t=1 n=1 W τu
X 1 X X
τu XN =E j
ηu ∇u Fnj (ue,t−1
n , vne,τv , ξne,t−1 )
τu X j=1
Γe j t=1
≤ L2 E∥ue,t−1 − uen ∥2 , (59) n∈Ne
Γ∗ t=1 n=1 u n
W τu
X 1 X X
≤ 3W E ηu [∇u Fnj (ue,t−1
n , vne,τv, ξne,t−1 )
where we relax the inequality by selecting the smallest Γ∗ = j=1
Γje j t=1
n∈Ne
min Γje and changing the summation over n to all devices in 2
the second step. Then, in the third step, we consider that l2 - − ∇u Fnj (ue,t−1
n , vne,τv )]
gradient norm of a vector is no larger than the sum of norm
of all sub-vectors, which allows us to consider ∇u Fn rather W τu
X 1 X X
than its sub-vectors. The last step in (59) is derived from L- + 3W E ηu [∇u Fnj (ue,t−1
n , vne,τv )
Γje
smoothness in Assumption 1, which ends the proof of Lemma j=1 j t=1
n∈Ne
3. 2

Lemma 4: For bounded variance under Assumption 4, for − ∇u Fnj (uen , vne,τv )]
any global communication rounds e, we obtain that 2
W τu
2
X 1 X X
τu X + 3W E j
ηu [∇u Fnj (uen , vne,τv )] , (63)
1X Γ
E [∇u Fnj (ue,t−1 e,τv e,t−1
n , vn ,ξn )−∇u Fnj (une,t−1,vne,τv)] j=1 e j t=1
n∈Ne
Γje t=1
n∈Nej
where we split stochastic gradient
τu2 N σ̂u
2
≤ ∗
. (60) ∇u Fnj (ue,t−1
n , v e,τv e,t−1
n , ξn ) into three parts, namely,
Γ [∇u Fnj (une,t−1 , vne,τv , ξne,t−1 ) − ∇u Fnj (une,t−1 , vne,τv )],
Proof : [∇u Fnj (ue,t−1
n , vne,τv ) − ∇u Fnj (uen , vne,τv )], and
j e e,τv
[∇u Fn (un , vn )].
τu X
The third term of the last step in (63) is derived as
1 X
E [∇u Fnj (ue,t−1
n , vne,τv , ξne,t−1 ) 2
Γje W τu
t=1 n∈Nej X 1 X X
2 3W E ηu [∇u Fnj (uen , vne,τv )]
j=1 Γje t=1
n∈Nej
− ∇u Fnj (ue,t−1
n , vne,τv )]
τu
W X
X
2
τu XN ≤ 3ηu W τu E∥∇u Fn (uen , vne,τv )∥2
τu X
≤ ∗ E ∇u Fnj (uq,e,t−1
k,n , vne,τv , ξne,t−1 ) j=1 t=1
Γ t=1 n=1 2
≤ 3ηu W 2 τu2 G2 . (64)
− ∇u Fnj (ue,t−1
n , vne,τv )∥2
τu X N Through plugging (58), (60), and (64) into (63), the upper-
τu X
≤ ∗ E ∇u Fn (ue,t−1n , vne,τv , ξne,t−1 ) bound of E∥ue+1 − ue ∥2 is derived as (62), which ends the
Γ t=1 n=1 proof of Lemma 5. Then, by taking expectations on both sides
− ∇u Fn (ue,t−1
n , vne,τv )∥2 of (53), we obtain
τ 2 N σ̂ 2
≤ u ∗ . (61) E[F (ue+1 , V e+1 )] − E[F (ue , V e+1 )]
Γ
Lu
In the second step, we consider that l2 -gradient norm of a ≤ E⟨∇u F (ue ,V e+1 ), ue+1 −ue ⟩ + E∥ue+1 − ue ∥2 .
2
vector is no larger than the sum of norm of all sub-vectors, (65)
which allows us to consider ∇u Fn rather than its sub-vectors.
The last step in (61) is obtained from bounded variance in First, we analyze E⟨∇u F (ue ,V e+1 ), ue+1 −ue ⟩ by consider-
Assumption 4, which ends the proof of Lemma 4. ing a sum of inner products over all model weights, which is
12

denoted as By plugging (67) and (68) into (66),


e e+1 e+1 e
E⟨∇u F (ue ,V e+1 ), ue+1 − ue ⟩ is derived as
E⟨∇u F (u ,V ), u −u ⟩
W
E⟨∇u F (ue , V e+1 ), ue+1 − ue ⟩ ≤
X
= E⟨∇F j (ue , V e+1 ), ue+1,j − ue,j ⟩ W
ηu τu X 2
j=1
− ∇u F j (ue ,V e+1 )
* 2 j=1
W
X PN
= E ∇u F j (ue , V e+1 ), W ϕ2u N ηu
3 2 3
Lu τu + 2W ηu τu L2u D2 n=1 ρen
+ . (69)
j=1
+ 2Γ∗
τu
1 X X Finally, we plug the upperbound of E∥ue+1 − ue ∥2 into
− j
ηu ∇u Fnj (ue,t−1
n , vne,τv ) (65) and obtain that
Γe j t=1
n∈Ne
E[F (ue+1 , V e+1 )] − E[F (ue , V e+1 )]
W
W
X
=− E⟨∇u F j (ue , V e+1 ), ηu τu ∇u F j (ue , V e+1 )⟩ ηu τu X Lu 2
≤− ∇u F j (ue ,V e+1 ) E∥ue+1 − ue ∥2
+
j=1 2 j=1 2
W
*
W ϕ2u N ηu Lu τu + 2W ηu τu L2u D2 N
3 2 3 e
X P
− E ∇u F j (ue , V e+1 ), + n=1 ρn
. (70)
j=1 2Γ ∗

By taking expectations on both sides of (51) and plugging (52)


τu
+
1 X X h j e,t−1 e,τv j e e,τv
i
ηu ∇u Fn (un , vn ) − ∇u Fn (un , vn ) , and (70) into it, we obtain
Γje j t=1
n∈Ne E[F (ue+1 , V e+1 ) − F (ue , V e )]
(66)
N
ηv τv ∥∇v Fn (ue , vne )∥2
P
where the last step splits the result into two parts with respect n=1
≤−
to a reference point ηu τu ∇u F j (ue , V e+1 ). For the first term 8
in the last step of (66), it is derived as ηv2 τv2 σ̂v2 Lv
+ + 4ηv3 L2v σ̂v2 τv2 (τv − 1)
2
W W
ηu τ u X 2
X
− E⟨ηu τu ∇u F j (ue , V e+1 ), ηu τu ∇u F j (ue , V e+1 )⟩ − ∇u F j (ue ,V e+1 )
j=1
2 j=1
W W ϕ2u N ηu Lu τu + 2W ηu τu L2u D2 N
3 2 3 e
P
X 2 n=1 ρn
= −ηu τu j
∇u F (u , V e e+1
) . (67) + ∗

2
j=1 3ηu W 2 τu2 ϕ2u Lu 3W 2 ηu 2 2 2
τu N σ̂u Lu + 3W 2 L3u τu4 ηu
4
N ϕ2u
+ +
For the second term in the last step of (66), it is derived as 2 2Γ∗
3W 2 ηu Lu D2 τu2 N
2 3 e
P
n=1 ρn
W
* + . (71)
X Γ∗
− E ∇F j (ue , V e+1 ), Then, we take the sum over global communication round e =
j=1
τu
+ 1, 2, ..., E on both sides of (71) and obtain that
1 XX h i
ηu ∇u Fnj (ue,t−1
n , vne,τv )−∇u Fnj (uen , vne,τv )
Γje E
" N
n∈Ne t=1
j 1 X ηv τv X
∥∇v Fn (ue , vne )∥2
XW
* E e=1 8 n=1
=− ηu τu E ∇u F j (ue , V e+1 ), W
#
ηu τ u X j e e+1 2
j=1 + ∥∇u F (u ,V )∥
τu
+ 2 j=1
1 1 X Xh j e,t−1 e,τv j e e,τv
i
∇u Fn (un , vn )−∇u Fn (un , vn ) E[F (u0 , V 0 ) − F (u∗ , V ∗ )] η 2 τ 2 σ̂ 2 Lv
τu Γje j t=1 ≤ + v v v
n∈Ne E 2
W 2 2 2 2
ηu τu X 3η W τ ϕ
u u u L
≤ E∥∇u F j (ue , V e+1 )∥2 + + 4ηv3 L2v σ̂v2 τv2 (τv − 1) + u
2 j=1 2
W ϕ2u N ηu 3 2 3
Lu τu +3W 2 ηu2 2
τu N σ̂u2
Lu +3W 2 L3u τu4 ηu 4
N ϕ2u
W τu + ∗
ηu X 1 X X 2Γ
E [∇u Fnj (ue,t−1
n , vne,τv ) (W ηu τu L2u D2 + 3W 2 ηu Lu D2 τu2 ) E
2 3 P PN e
n=1 ρn
j
2τu j=1 Γe j t=1 + e=1
(72)
n∈Ne ∗

2
which completes the proof of Theorem 1.
−∇u Fnj (uen , vne,τv )] .
B. Appendix B - Proof of Theorem 2
W
ηu τu X According to (40), the pruning ratio ρen is derived as
≤ E∥∇u F j (ue , V e+1 )∥2
2 j=1
PN
cmp-Per
Tn,e + (1 − ρen )(Tn,e
com-G com-G
+ Tn,e ) ≤ Tth , (73)
W ϕ2u N ηu
3 2 3
Lu τu + 2W ηu τu L2u D2 e
n=1 ρn e e
+ , (68) cmp-Per τv Cn Wn,v com-G τu Cn Wn,u
2Γ∗ where Tn,e = fn
n
, Tn,e = fn
n
, and
e
com-G q̂Wn,u e
where the second step is obtained from (48) and the last step Tn,e = .
up
Rn,e
n
Then, ρn is derived as (41), which ends
is obtained from Lemma 3. the proof of Theorem 2.
13

C. Appendix C - Proof of Lemma 1 [6] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning:
Challenges, methods, and future directions,” IEEE Signal Process. Mag.,
The objective function in (43) is equal to vol. 37, no. 3, pp. 50–60, 2020.
E X
N E X
N  [7] K. Pillutla, K. Malik, A.-R. Mohamed, M. Rabbat, M. Sanjabi, and
ben V1
X X 
F (b) = f (ben ) = 1− , (74) L. Xiao, “Federated learning with partial model personalization,” in
e
bn V2 + V3 ICML, vol. 162, 2022.
e=1 n=1 e=1 n=1
[8] K. Mishchenko, R. Islamov, E. Gorbunov, and S. Horvath, “Partially per-
where V1 , V2 , V3 > 0, and 0 ≤ ben ≤ 1. To prove the Lemma sonalized federated learning: Breaking the curse of data heterogeneity,”
1, we need to analyze the convexity of the function f (ben ). arxiv:2305.18285, 2023.
The first derivative of f (ben ) is computed as [9] S. Liu, G. Yu, R. Yin, J. Yuan, L. Shen, and C. Liu, “Joint model
pruning and device selection for communication-efficient federated edge
′ V1 V3 learning,” IEEE Trans. Commun., vol. 70, no. 1, pp. 231–244, Jan. 2022.
f (ben ) = − . (75) [10] Y. Jiang, S. Wang, V. Valls, B. J. Ko, W.-H. Lee, K. K. Leung, and
(ben V2 + V3 )2 L. Tassiulas, “Model pruning enables efficient federated learning on edge
Then, the second derivative of f (ben ) is derived as devices,” IEEE Trans. Neural Netw. Learn Syst., pp. 1–13, 2022.
[11] J. Tan, Y. Zhou, G. Liu, J. H. Wang, and S. Yu, “pfedsim: Similarity-
′′ 2V1 V2 V3 aware model aggregation towards personalized federated learning,”
f (ben ) = > 0. (76) arxiv:2305.15706, 2023.
(ben V2 + V3 )3
[12] J. Liu, J. Wu, J. Chen, M. Hu, Y. Zhou, and D. Wu, “Fed-
As a result, the objective function in (43) is convex. Mean- dwa: Personalized federated learning with online weight adjustment,”
arxiv:2305.06124, 2023.
while, both thresholds in (37) and (38) are convex. Conse- [13] C. You, K. Guo, H. H. Yang, and T. Q. S. Quek, “Hierarchical
quently, the optimization problem in (43) in convex, which personalized federated learning over massive mobile edge computing
ends the proof of Lemma 1. networks,” IEEE Trans. Wireless Commun., pp. 1–1, 2023.
[14] X. Liu, S. Wang, Y. Deng, and A. Nallanathan, “Adaptive federated
pruning in hierarchical wireless networks,” arxiv:2305.09042, 2023.
[15] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy
D. Appendix D - Proof of Theorem 3 efficient federated learning over wireless communication networks,”
Based on the optimization problem in (43) and the threshold IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1935–1949, 2020.
in (37), the Lagrange function is written as [16] M. Chen, H. V. Poor, W. Saad, and S. Cui, “Convergence time opti-
mization for federated learning over wireless networks,” IEEE Trans.
ge p
   
N
X ben B log2 1 + nσ2n (Tth − Tn,e cmp-Per
) Wireless Commun., vol. 20, no. 4, pp. 2457–2471, 2021.
e
L(bn , λ) = 1 −    [17] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint
ge p cmp-G learning and communications framework for federated learning over
n=1 ben B log2 1 + nσ2n Tn,e e
+ q̂Wn,u n
wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp.
269–283, 2021.
N
!
X e
+λ bn − 1 , (77) [18] J. Ren, W. Ni, G. Nie, and H. Tian, “Research on resource allocation
n=1
for efficient federated learning,” arxiv:2104.09177, 2021.
[19] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz, “Importance
where λ is a Lagrange multiplier. Then, we consider the estimation for neural network pruning,” in Proc. IEEE/CVF Conf.
Karush-Kuhn-Tucker (KKT) conditions to solve the problem, Comput. Vis. Pattern Recognit. (CVPR), pp. 11 264 – 11 272, Jun. 2019.
which is written as [20] S. Horvath, S. Laskaridis, M. Almeida, I. Leontiadis, S. I. Venieris,
and N. D. Lane, “Fjord: Fair and accurate federated learning under
ge p
 
cmp-Per e heterogeneous targets with ordered dropout,” in Proc. Adv. Neural Inf.
∂L (Tth − Tn,e )q̂Wn,u n
B log2 1 + nσ2n
= λ− h i2 = 0, (78) Process. Syst. (NeurIPS’21), 2021.
∂ben
 
ge p cmp-G [21] S. Narang, G. Diamos, S. Sengupta, and E. Elsen, “Exploring sparsity
ben B log2 1 + nσ2n Tn,e e
+ q̂Wn,u n in recurrent neural networks,” in ICRL, 2017.
! [22] M. H. Zhu and S. Gupta, “To prune, or not to prune: Exploring the
N
X efficacy of pruning for model compression,” 2018.
λ ben −1 = 0, λ ≥ 0. (79) [23] S. Luo, X. Chen, Q. Wu, Z. Zhou, and S. Yu, “Hfel: Joint edge asso-
n=1 ciation and resource allocation for cost-efficient hierarchical federated
edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 10, pp.
According to KKT conditions, the optimal bandwidth alloca- 6535–6548, 2020.
tion for each device is obtained as Theorem 3, which ends the [24] D. Wen, M. Bennis, and K. Huang, “Joint parameter-and-bandwidth
proof of Theorem 3. allocation for improving the efficiency of partitioned edge learning,”
IEEE Trans. Wireless Commun., vol. 19, no. 12, pp. 8272–8286, 2020.
[25] S. Ghadimi and G. H. Lan, “Stochastic first-and zeroth-order methods
R EFERENCES for nonconvex stochastic programming,” SIAM J. Optim., vol. 23, no. 4,
pp. 2341 – 2368, 2013.
[1] D. C. Nguyen, M. Ding, P. N. Pathirana, A. Seneviratne, J. Li, and [26] S. Shi, K. Zhao, Q. Wang, Z. Tang, and X. Chu, “A convergence analysis
H. Vincent Poor, “Federated learning for internet of things: A compre- of distributed SGD with communication-efficient gradient sparsifica-
hensive survey,” IEEE Commun. Surv. Tutor., vol. 23, no. 3, pp. 1622– tion,” in Proc. 28th Int. Joint Conf. Artif. Intell., pp. 3411 – 3417, Aug.
1658, 2021. 2019.
[2] L. U. Khan, W. Saad, Z. Han, E. Hossain, and C. S. Hong, “Federated [27] K. Mishchenko, R. Islamov, E. Gorbunov, and S. Horvath, “Partially per-
learning for internet of things: Recent advances, taxonomy, and open sonalized federated learning: Breaking the curse of data heterogeneity,”
challenges,” IEEE Commun. Surv. Tutor., vol. 23, no. 3, pp. 1759–1799, arxiv:2305.18285, 2023.
2021. [28] S. U. Stich, J. B. Cordonnier, and M. Jaggi, “Sparsified SGD with
[3] H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas, memory,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS’18), pp.
“Communication-efficient learning of deep networks from decentralized 4447 – 4458, Dec. 2018.
data,” In Proceedings of the 20th International Conference on Artificial [29] P. L. Bartlett, “The sample complexity of pattern classification with
Intelligence and Statistics, pp. 1273 – 1282, 2017. neural networks: The size of the weights is more important than the
[4] X. Liu, Y. Deng, A. Nallanathan, and M. Bennis, “Federated and size of the network,” IEEE Trans. Inf. Theory, vol. 44, no. 2, pp. 525 –
meta learning over non-wireless and wireless networks: A tutorial,” 536, Mar. 1998.
arxiv:2210.13111, 2022. [30] T. Salimans and D. P. Kingma, “Weight normalization: A simple
[5] Y. Mu, N. Garg, and T. Ratnarajah, “Federated learning in massive reparameterization to accelerate training of deep neural networks,” in
MIMO 6G networks: Convergence analysis and communication-efficient Proc. Adv. Neural Inf. Process. Syst. (NeurIPS’16), pp. 901 – 909, Dec.
design,” IEEE Trans. Netw. Sci. Eng., vol. 9, no. 6, pp. 4220–4234, 2022. 2016.

You might also like