Unstructured PruneFL

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

1

Model Pruning Enables Efficient Federated


Learning on Edge Devices
Yuang Jiang, Shiqiang Wang, Vı́ctor Valls, Bong Jun Ko, Wei-Han Lee, Kin K. Leung, Leandros Tassiulas

Abstract—Federated learning (FL) allows model training from computation power, communication bandwidth, memory and
local data collected by edge/mobile devices while preserving storage size, etc. Training DNNs that can include over millions
data privacy, which has wide applicability to image and vision of parameters (weights) on such resource-limited edge devices
applications. A challenge is that client devices in FL usually have
can take prohibitively long and consume a large amount of
arXiv:1909.12326v5 [cs.LG] 6 Apr 2022

much more limited computation and communication resources


compared to servers in a datacenter. To overcome this challenge, energy. Therefore, a natural question is: how can we perform
we propose PruneFL – a novel FL approach with adaptive and FL efficiently so that a model is trained within a reasonable
distributed parameter pruning, which adapts the model size dur- amount of time and energy?
ing FL to reduce both communication and computation overhead Some progress has been made towards this direction re-
and minimize the overall training time, while maintaining a
similar accuracy as the original model. PruneFL includes initial cently using model/gradient compression techniques, where
pruning at a selected client and further pruning as part of the instead of training the original model with full parameter
FL process. The model size is adapted during this process, which vector, either a small model is extracted from the original
includes maximizing the approximate empirical risk reduction model for training or a compressed parameter vector (or its
divided by the time of one FL round. Our experiments with gradient) is transmitted in the fusion stage [7]–[10]. However,
various datasets on edge devices (e.g., Raspberry Pi) show
that: (i) we significantly reduce the training time compared to the former approach may reduce the accuracy of the final
conventional FL and various other pruning-based methods; (ii) model in undesirable ways, whereas the latter approach only
the pruned model with automatically determined size converges reduces the communication overhead and does not generate
to an accuracy that is very similar to the original model, and it a small model for efficient computation. Furthermore, how to
is also a lottery ticket of the original model. adapt the compressed model size for the most efficient training
Index Terms—Efficient training, federated learning, model remains a largely unexplored area, which is a challenging
pruning. problem due to unpredictable training dynamics and the need
of obtaining a good solution in a short time with minimal
I. I NTRODUCTION overhead.
To overcome these problems, we propose a new FL
T HE past decade has seen a rapid development of machine
learning algorithms and applications, particularly in the
area of deep neural networks (DNNs) [1]. However, a huge
paradigm called PruneFL, which includes adaptive and dis-
tributed parameter pruning as part of the FL procedure. We
volume of training data is usually required to train accurate make the following key contributions.
models for complex tasks such as image classification and Distributed pruning. PruneFL includes initial pruning at
computer vision. Due to limits in data privacy regulations and a selected client followed by further distributed pruning that
communication bandwidth, it is usually infeasible to transmit is intertwined with the standard FL procedure. Our experi-
and store all training data at a central location. To address this mental results show that this method outperforms alternative
problem, federated learning (FL) has emerged as a promising approaches that either prunes at a single client only or directly
approach of distributed model training from decentralized involves multiple FL clients for pruning, especially when the
data [2]–[6]. In a typical FL system, data is collected by clients have heterogeneous data statistics and computational
client devices (e.g., cameras) at the network edge; the training power.
process includes local model updates using each client’s own Adaptive pruning. PruneFL continuously “tracks” a model
data and the fusion of all clients’ models typically through a that is small enough for efficient transmission and compu-
server. In this way, the raw data remains local in clients. tation with low memory footprint, while maintaining useful
Client devices in FL are usually much more resource- connections and their parameters so that the model converges
constrained than server machines in a datacenter, in terms of to a similar accuracy as the original model. The importance
of model parameters evolves during training, so our method
Y. Jiang, V. Valls, and L. Tassiulas are with Yale University, New Haven,
CT, USA. continuously updates which parameters to keep and the cor-
S. Wang and W.-H. Lee are with IBM T. J. Watson Research Center, responding model size. The update follows an objective of
Yorktown Heights, NY, USA. minimizing the time of reaching intermediate and final loss
B. J. Ko was with Stanford Institute for Human-Centered Artificial Intelli-
gence (HAI), Stanford, CA, USA when contributing to this work. values. Each FL round operates on a small pruned model,
K. K. Leung is with Imperial College London, UK. which is efficient. A small model is also obtained at any time
Contact authors: Y. Jiang ([email protected]) and S. Wang during and after the FL process for efficient inference on edge
([email protected])
Accepted for publication in IEEE Transactions on Neural Networks and devices, which is a lottery ticket [11] of the original model as
Learning Systems (TNNLS) we show experimentally.
2

Implementation. We implement FL with model pruning on that determines a near-optimal gradient sparsity was proposed
real edge devices, where we extend a deep learning framework in [9], which includes exploration steps that may slow down
to support efficient sparse matrix computation. Our code is the training initially. This body of work does not address
available at: https://github.com/jiangyuang/PruneFL computation efficiency.
To reduce both communication and computation costs,
II. R ELATED W ORK efficient FL techniques using lossy compression and dropout
were developed [8], [10], where the final model still has
Neural network pruning. To reduce the complexity of the original size and hence providing no benefit for efficient
neural network models, different ways of parameter pruning inference after the model is trained. Moreover, because the
were proposed in the literature. Early work considered approx- main goal of pruning is to remove less important weights from
imation using second-order Taylor expansion [12]. However, the model, it is orthogonal to other acceleration methods such
the computation of Hessian matrix has high complexity which as quantization [34], low-rank decomposition [35], etc., and
is infeasible for modern DNNs. In recent years, magnitude- pruning can be applied together with these other methods. In
based pruning has become popular [13], where parameters addition, since our approach considers acceleration for both
with small enough magnitudes are removed from the network. training and inference, methods that accelerate inference only,
A finding that suggests a network that is pruned by magnitude e.g., knowledge distillation [36], runtime neural pruning [37],
consists of an optimal substructure of the original network, and DNN partitioning and offloading [38], do not serve our
known as “lottery ticket hypothesis”, was presented in [11], purpose. There are other distributed training methods, such as
[14]. It shows that directly training the pruned network can split learning [39], which are beyond our scope since we focus
reach a similar accuracy as pruning a pre-trained original on FL in this paper.
network. Furthermore, most existing studies on FL are based on sim-
In addition to the above approaches that train until conver- ulation. Only a few recent papers considered implementation
gence before the next pruning step, there are iterative pruning on real embedded devices [10], [26], but they do not include
methods where the model is pruned after every few steps of parameter pruning.
training [15], [16]. There are also one-shot pruning approaches Novelty of our work. The uniqueness of PruneFL is that we
including SynFlow [17] that prunes the model at model jointly address communication and computation efficiency for
initialization (before training), and SNIP [18] that prunes the both training and inference phases, by extending FedAvg with
model using first training round’s gradient information. A minimal extra overhead. Our two-stage distributed pruning
dynamic pruning approach that allows the network to grow method is designed to address both data (statistical) and device
and shrink during training was proposed in [19]. Besides (system) heterogeneity including non-IID data distribution.
these unstructured pruning methods, structured pruning was Our adaptive pruning method is uniquely based on gradient
also studied [20], which, however, often requires specific information, which does not require sharing clients’ local
network architectures and does not conform to the lottery data, so that existing privacy preservation and secure aggre-
ticket hypothesis. The lottery ticket is useful for retraining a gation [40] methods for FL can be directly applied to the
pruned model on a different yet similar dataset [14]. The use gradient. Thus, our approach does not introduce extra privacy
of pruning for efficient model training was discussed in [21], concerns.
where the optimal choice of pruning rate (or final model size) Roadmap. The remainder of this paper is organized as
remained unstudied. follows. Section III provides preliminaries of FL and model
These existing pruning techniques consider the centralized pruning. Section IV presents the proposed PruneFL approach
setting with full access to training data, which is fundamentally and its analysis. Implementation challenges are discussed in
different from our PruneFL that works with decentralized Section V. Section VI presents the experimental setup and
datasets at local clients. Furthermore, the automatic adaptation results. Section VII draws conclusion.
of model size has not been studied before.
Efficient federated learning. The first FL method is known III. P RELIMINARIES
as federated averaging (FedAvg) [2], where each “round” of
Federated learning. We consider an FL system with N
training includes multiple local gradient computation steps on
clients. Each client n ∈ [NP] := {1, 2, ..., N } has a local
each client’s local data, followed by a parameter averaging
empirical risk Fn (w) := D1n i∈Dn fi (w) defined on its local
step through a server. This method can be shown to converge
dataset Dn (Dn := |Dn |) for model parameter vector w, where
in various settings including when the data at different clients
fi (w) is the loss function (e.g., cross-entropy, mean square
are non-identically distributed (non-IID) [22]–[24].
error, etc.) that captures the difference between the model
To improve the communication efficiency of FL, meth-
output and the desired output of data sample i. The system
ods for optimizing the communication frequency were stud-
tries to find a parameter w that minimizes the global empirical
ied [25]–[27]. An approach of parameter averaging using
risk:
structured, sketched, and quantized updates was introduced X
min F (w) := pn Fn (w) (1)
in [7], which belongs to the broader area of gradient compres- w
n∈[N ]
sion/sparsification [28]–[33]. These techniques usually con- P
sider a fixed degree of sparsity or compression that needs to be where pn > 0 are weights such that n∈[N ] pn = 1. For
configured as a hyperparameter. An online learning approach example, if Dn ∩ Dn0 = ∅ for n 6= n0 and pn = Dn /D
3

Fig. 1. Illustration and flowchart of PruneFL.

TABLE I Model pruning. In the iterative training and pruning ap-


M AIN NOTATIONS . proaches for the centralized machine learning setting, the
Notation Definition
model is first trained using SGD for a given number of
iterations [13], [15], [16]. Then, a certain percentage (referred
element-wise product of two vectors
to as the pruning rate) of weights that have smallest absolute
n, N client index, total number of clients
values within each layer is removed (set to zero). This training
k, K iteration index, total number of iterations and pruning process is repeated until a desired model size is
I number of local iterations reached. The benefit of this approach is that the training and
P
pn weight for client n (pn > 0, ∀n, and n pn = 1) pruning occurs at the same time, so that a trained model with
m(k) weight mask in iteration k (universal for all clients) a desired (small) size can be obtained in the end. However,
wn (k) client n’s parameter in iteration k existing pruning techniques require the availability of training
w(k) := N
P
w(k) n=1 pn wn (k) data at a central location, which is not applicable to FL.
0 (k), w0 (k)
wn 0 (k) = w (k)
wn m(k), w0 (k) = w(k) m(k)
n
gn (w) client n’s stochastic gradient with parameter w
IV. P RUNE FL
∇Fn (w) client n’s expected gradient with parameter w
∇F (w) = N Our proposed PruneFL approach includes two stages: initial
P
∇F (w) n=1 pn Fn (w)
pruning at a selected client and further pruning involving
S both the server and clients during the FL process. The initial
with D := n Dn and D := |D|, we have F (w) = pruning can be done with biased data at a single client that
1
P
D i∈D f i (w). Other ways of configuring pn may also be has a relatively high computational capability, and the further
used to account for fairness and other objectives [41]. pruning stage will “remove” the bias and refine the model. We
In FL, each client n has a local parameter wn (k) in use adaptive pruning in both stages. The illustration of the
iteration k. The aggregation
P of these local parameters is overall procedure is presented in Fig. 1. In the following, we
defined as w(k) := n∈[N ] n wn (k). The FL procedure
p introduce the two pruning stages (Section IV-A) and adaptive
usually involves multiple updates of wn (k) using stochastic pruning (Section IV-B).
gradient descent (SGD) on the local empirical risk Fn (wn (k))
computed by every client n, followed by a parameter fusion
step that involves the server collecting clients’ local param- A. Two-stage Distributed Pruning
eters {wn (k) : ∀n ∈ [N ]} and computing the aggregated Initial pruning at a selected client. Before FL starts, the
parameter w(k). After parameter fusion, the local parameters system selects a single client to prune the model using its
{wn (k) : ∀n ∈ [N ]} are all set to be equal to the aggregated local data. This is important for two reasons. First, it allows
parameter w(k). us to start the FL process with a small model, which can
In the following, we call this procedure of multiple local significantly reduce the computation and communication time
SGD iterations followed by a fusion step a round. We use I of each FL round. Second, when clients have heterogeneous
to denote the number of local SGD iterations in each round. computational capabilities, the selected client for initial prun-
The main notations in this paper are listed in Table I. ing can be one that is powerful and trusted, so that the time
It is possible that each round only involves a subset of required for initial pruning is short. We apply the adaptive
clients, to avoid excessive delay caused by waiting for all the pruning procedure that we describe in Section IV-B, where
clients [42]. It has been shown that FedAvg converges even we adjust the original model iteratively while training the
with random client participation, although the convergence rate model on the selected client’s local dataset, until the model
is related to the degree of such randomness [23]. size remains almost unchanged.
4

Algorithm 1: Adaptive pruning and computed on the full parameter space on client n. Also, let
1 for k = 0 . . . , K − 1 do mw (k) denote a mask vector that is zero if the corresponding
2 Initialize the set of importance measure on each component in w(k) is pruned and one if not pruned, and
client: Zn ← ∅, ∀n; denote the element-wise product.
3 for each client n, in parallel do In each reconfiguration step, adaptive pruning finds an
4 Compute stochastic gradient optimal set of remaining (i.e., not pruned) model parameters.
gn (wn0 (k)) := gn (wn (k) m(k)); Then, parameters are pruned or added back accordingly, and
5 Update local parameters: the resulting model and mask are used for training until
wn (k + 1) ← wn0 (k) − ηgn (wn0 (k)) m(k); the next reconfiguration step. This procedure is illustrated in
6 Add importance measure zn to Zn : Algorithm 1, where a | b denotes that a divides b, i.e., b is an
zn := gn (wn0 (k)) gn (wn0 (k)); integer multiple of a. More details are given in Appendix C.
Zn ← Zn ∪ zn ; Our goal is to find the subnetwork that learns the “fastest”.
We do so by estimating the empirical risk reduction divided
7 if I | k + 1 then
by the time required for completing an FL round, for any
8 Each client n sends wn (k + 1) to the server;
given subset of parameters chosen to be pruned. Note that
9 Server aggregates the parameters from each
PN at the beginning of each round (after averaging the clients’
client: w(k + 1) ← n=1 pn wn (k + 1);
parameters), all clients start with the same parameter vector w,
10 if k + 1 is reconfiguration iteration then
i.e., wn (k) = w(k) for all n if a new round starts at iteration k
11 Each client sends the averagedimportance
P (see Section III). For approximation purpose, we first consider
measure zn := zn ∈Zn zn /|Zn | to the the change of empirical risk after one SGD iteration starting
server; with a common parameter w(k), for both initial and further
12 Server aggregates the received importance pruning stages. The full FL procedure will be considered later
PN
measure: z ← n=1 pn zn ; in Theorem 2.
13 Reconfigure using Algorithm 2: When the model is reconfigured at the end of iteration k,
w0 (k + 1), m(k + 1) ← parameter update in the next iteration will be done on the
reconfigure(w(k + 1), z); reconfigured parameter w0 (k), so we have an SGD update
14 Reset: Zn ← ∅, ∀n; step that follows:
15 else N
No reconfiguration: w0 (k + 1) ← w(k + 1);
X
16 w(k + 1) = w0 (k) − η pn gn (w0 (k)) mw0 (k)
17 Server sends new parameters to each client: n=1

wn0 (k + 1) ← w0 (k + 1), ∀n; = w0 (k) − ηg(w0 (k)) m(k) (2)

where η is P the learning rate. For simplicity, we define


N
g(w0 (k)) := n=1 pn gn (w0 (k)), and we omit the subscript
0
Further pruning during FL process. The model produced w of m in the following when it is clear from the context.
by initial pruning may not be optimal, because it is obtained Let M denote the index set of components that are not
based on data at a single client. However, it is a good starting pruned, which corresponds to the indices of all non-zero values
point for the FL process involving all clients. During FL, we of the mask m(k).
perform further adaptive pruning together with the standard Empirical risk reduction. To analyze the empirical risk
FedAvg procedure, where the model can either grow or shrink reduction, we use a first-order approximation, which is a
depending on which way makes the training most efficient. In common practice in the literature [18], [43], [44]. We have
this stage, data from all participating clients are involved.
F (w(k + 1))
≈ F (w0 (k)) + h∇F (w0 (k)), w(k + 1) − w0 (k)i (3)
B. Adaptive Pruning
= F (w0 (k)) − ηh∇F (w0 (k)), g(w0 (k)) m(k)i (4)
For our adaptive pruning method, the notion of pruning 0 2
broadly includes both removing and adding back parameters. ≈ F (w(k)) − ηkg(w (k)) m(k)k (5)
Hence, we also refer to such pruning operations as reconfigu- where h·, ·i is the inner product, (3) is from Taylor expansion,
ration. We reconfigure the model at a given interval of multiple (4) is because of (2), and (5) is obtained by using the stochastic
iterations. For initial pruning, the reconfiguration interval can gradient to approximate the actual gradient, i.e., g(w0 (k)) ≈
be any number of local iterations at the selected client. For ∇F (w0 (k)). Then, the approximate decrease of empirical risk
further pruning, reconfiguration is done at the server after after the SGD step (2) is:
receiving parameter updates from clients (i.e., at the boundary
between two rounds), and the reconfiguration interval in this F (w0 (k)) − F (w(k + 1)) ≈ ηkg(w0 (k)) m(k)k2
case is always an integer multiple of the number of iterations ∝ kg(w0 (k)) m(k)k2
(i.e., I) in each round. X
Definitions. Let k denote the iteration index, gn (w(k)) = gj2 =: ∆(M) (6)
denote the stochastic gradient of Fn (w(k)) evaluated at w(k) j∈M
5

where gj is the j-th component of g(w0 (k)) and we define Algorithm 2: Solving (7)
the set function ∆(M) in the last line. The learning rate η Input : importance measure gj2 and time coefficient
is omitted since it is independent to the relative importance tj , for each parameter index j
between parameter components, no matter if it is constant or Output: the optimal subset of parameters A
varying. We use ∆(M) as the approximate risk reduction, 1 A ← ∅;
where we ignore the proportionality coefficient because our gj2
optimization problem is independent of the coefficient. As 2 S ← arg sortj∈P tj ; // ordered set
∆(M) is defined as the sum of gj2 in (6), we use gj2 as the 3 for j ∈ S do
gj2 
importance measure for the j-th component of the parameter 4 if tj≥ Γ A ∪ P then
vector. 5 A ← A ∪ {j};
Remark. During the further pruning stage, the stochastic
6 else
gradient g(w0 (k)) is the aggregated stochastic gradient from
7 break;
clients in FL. Since clients cannot compute g(w0 (k)) before
receiving w0 (k) from the server, they compute g(w(k)) and 8 return A ; // final result
we use g(w0 (k)) ≈ g(w(k)), both of which are denoted
by g(w0 (k)) with components {gj } in the following. The
additional overhead for clients to compute and transmit gradi-
where A is the set of parameters in P that remain (i.e., are
ents on the full parameter space in a reconfiguration is small
not pruned). The final set of remaining parameters is then
because pruning is done once in many FL rounds (the interval
M = A ∪ P. Note that P ∪ P is the set of all parameters in
between two reconfigurations is 50 rounds in our experiments).
the original model.
Further details are given in Appendix C.
The algorithm for solving (7) is given in Algorithm 2, where
Time of one FL round. We define the (approximate) time sorting is in non-increasing order and S is an ordered set that
of one FL round when the model has P remaining parameters includes the sorted indices. In essence, this algorithm sorts the
M as a set function T (M) := c + j∈M tj , where c ≥ 0 is ratios of components in the sums of ∆(M) and T (M). When
a fixed constant and tj > 0 is the time corresponding to the the individual ratio gj2 /tj is larger than the current overall
j-th parameter component. Note that this is a linear function ratio Γ, then adding j to A increases Γ. The bottleneck of
which is sufficient according to our empirical observations (see this algorithm is the sorting operation. Hence, the overall time
Appendix D1). In particular, the quantity tj has a value that complexity of this algorithm is O(|P| log |P|).
can be dependent on the neural network layer, and c captures a
Theorem 1. We have Γ A ∪ P ≥ Γ A0 ∪ P , where A
 
constant system overhead. From our experiments, we observed
that tj remains the same for all j that belong to the same neural from Algorithm 2 and A0 is any subset of P with A0 6= A.
network layer. Therefore, we can estimate the quantities {tj }
Theorem 1 shows that the result obtained from our Algo-
and c by measuring the time of one FL round for a small
rithm 2 is a global optimal solution to (7).
subset of different model sizes, before the overall pruning and
Convergence of adaptive pruning. As adaptive pruning
FL procedure starts. An extension to the general case with
can both increase and decrease the model size over time, a
non-linear T (M) is also discussed in Appendix A.
natural question is whether the model parameter vector will
Optimization of reconfiguration. We would like to find the converge to a fixed value in a regular FL procedure with I ≥ 1
set of remaining parameters M that maximizes the empirical local SGD iterations in each round. We study this problem in
risk reduction per unit training time. However, ∆(M) only the following.
captures the risk reduction in the next SGD step when starting We first make the following minimal set of assumptions that
from the reconfigured parameter vector w0 (k), as defined are common in the literature [22], [23].
in (6). It does not capture the change in empirical risk
when using w0 (k) instead of the original parameter vector Assumption 1.
w(k) before reconfiguration. In other words, in addition to (a) Smoothness:
maximizing Γ(M) := ∆(M) T (M) , we also need to ensure that
0
F (w (k)) ≈ F (w(k)). k∇Fn (w1 ) − ∇Fn (w2 )k ≤ β kw1 − w2 k , ∀n, w1 , w2 ,
To ensure F (w0 (k)) ≈ F (w(k)) after reconfiguration, we where β is a positive constant.
define an index set P to denote the parameters that are (b) Lipschitzness:
not allowed to be pruned. Usually, P includes parameters
whose magnitudes are larger than a certain threshold, because kF (w1 ) − F (w2 )k ≤ L kw1 − w2 k , ∀w1 , w2 ,
pruning them can cause F (w0 (k)) to become much larger than
where L is a positive constant.
F (w(k)). Among the remaining parameters that can be pruned
(c) Unbiasedness:
(or added back if they are already pruned before), denoted by
P, we find which of them to prune to maximize Γ(M). This E [gn (w)] = ∇Fn (w), ∀n, w .
yields the following optimization problem:
(d) Bounded variance:

max Γ A∪P (7) 2
E kgn (w) − ∇Fn (w)k ≤ σ 2 , ∀n, w .
A⊆P
6

(e) Bounded divergence: In cases where a target maximum model size should be
2 reached at convergence (e.g., for efficient inference later),
∇F (w) − ∇Fn (w) ≤ 2 , ∀n, w , we can also enforce a maximum size constraint in each
PN reconfiguration that starts with the full size and gradually
where F (w) := n=1 pn Fn (w) as defined in (1). decreases to the target size as training progresses, which allows
(f) Time independence in SGD: The stochastic gradients the model to train quickly in initial rounds while converging
obtained in different iterations are independent from to the target size in the end.
each other.
(g) Client independence: The stochastic gradients obtained V. I MPLEMENTATION
from different clients are always independent from each
A. Using Sparse Matrices
other, even in the same iteration.
Although the benefit of model pruning in terms of computa-
Theorem 2. When Assumption 1 holds, and η ≤ √1 , we
2 6Iβ tion is constantly mentioned in the literature from a theoretical
have point of view [13], most existing implementations substitute
K−1 sparse parameters by applying binary masks to dense param-
1 X 2
E ∇F (w0 (k)) mw0 (k) eters. Applying masks increases the overhead of computation,
K instead of reducing it. We implement sparse matrices for model
k=0
2(F0 − F ∗ ) pruning, and we show its efficacy in our experiments. We use
+ αηβσ 2 + 4β 2 (1 − α)Iσ 2 + 3I 2 2 η 2

≤ dense matrices for full-sized models, and sparse matrices for
ηK
K−1 weights in both convolutional and fully-connected layers in
2L X
+ E kw(k) − w0 (k)k , (8) pruned models.
ηK
k=0
PN
where α := n=1 p2n , F0 := F (w(0)), F ∗ := minw F (w), B. Complexity Analysis
and I is the number of iterations per round. Storage, memory, and communication. We implement
two types of storage for sparse matrices: bitmap and value-
Under some conditions, this  convergence can be bounded
1 1
 index tuple. Bitmap uses one extra bit to indicate whether the
asymptotically by O √
NK
+ O K , achieving linear specific value is zero. For 32-bit floating point parameter com-
speedup1 with the number of clients N [31], [45] for suffi- ponents, bitmap incurs 1/32 extra storage and communication
ciently large K. See Appendix B2 for more details. overhead. Value-index tuple stores the values and both row and
If we do not reconfigure in an iteration k, we have w(k) = column indices of all non-zero entries. In our implementation,
w0 (k). In the right-hand side (RHS) of (8), the first two terms we use 16-bit integers to store row and column indices and
go to zero as K → ∞. The last term is related to how well 32-bit floating point numbers to store parameter values. Since
w0 approximates w after pruning. √ To ensure that the sum in each parameter component is associated with a row index
the last term grows slower than K, the number of non-zero and a column index, the storage and communication overhead
prunable parameters (which belong to P) should decrease over doubles compared to storing the values only. We dynamically
time. Note that we consider all zero parameters to be prunable choose between the two ways of storage, and thus, the ratio
and they also belong to P, thus the size of P itself may not of the
 sparse1 parameter size to the dense parameter size is
decrease over time. This convergence result shows that the min 2×d, 32 +d , where d is the model’s density (percentage
gradient components corresponding to the remaining (i.e., not of non-zero parameters). This ratio is further optimized when
pruned) parameters vanishes over time, which suggests that we the matrix sparsity pattern is fixed (in most FL rounds, see
will get a “stable” parameter vector in the end, because when Appendix C). In this case, there is no extra cost since only
the gradient norm is small, the change of parameters in each values of the non-zero entries need to be exchanged.
iteration is also small. In addition to gradient convergence on Computation. Because dense matrix multiplication is ex-
the subspace after pruning as suggested in Theorem 2, our tremely optimized, sparse matrices will show advantage in
experiments show that our pruned model also converges to an computation time only when the matrix is below a certain
accuracy close to that of the full-sized model. density, where this density threshold depends on specific hard-
Tracking a small model. By choosing the size of P ware and software implementations. In our implementation,
properly over time, our adaptive pruning algorithm can keep we choose either dense or sparse representation depending
reducing the model size as long as such reduction does not on which one is more efficient. The complexity (computation
adversely impact further training. Intuitively, the model that time) of the matrix multiplication between a sparse matrix S
we obtain from this process is one that has a small size while and a dense matrix D is linear to the number of non-zero
maintaining full “trainablity” in future iterations. Parameter entries in S (assuming D is fixed).
components for which the corresponding gradient components
remain zero (or close to zero) will be pruned. C. Implementation Challenges
1 The

dominant term is O √ 1

when K is sufficiently large. The notion
As of today, well-known machine learning frameworks have
NK
of linear speedup means that, to reach the same error bound, the total number limited support of sparse matrix computation. For instance, in
of rounds K can proportionally decrease as N increases. PyTorch version 1.6.0, the persistent storage of a matrix in
7

sparse form takes 5× space compared to its dense form; the and is approximately 1.4 MB/s. The simulated system uses the
computations on sparse matrices are slow; and sparse matrices same setting as in the prototype. We use time measurements
are not supported for the kernels in convolutional layers, from Raspberry Pis, except for the ImageNet-100 dataset
etc. To benefit from using sparse matrices in real systems, where we replace the computation time by measurements from
we extend the PyTorch library by implementing a more effi- Android virtual machine (VM).
cient sparse storage, and the support for sparse convolutional We consider FL with full client participation in the main
kernels. We only partially improve backward passes due to paper and present results with random client selection [42] in
implementation limitations (more details in Appendix C2). Appendix D2. The results are similar. FEMNIST and CelebA
This problem, however, can be improved in the future by data are partitioned into clients in a non-IID manner according
implementing and further optimizing efficient sparse matrix to writer/person identity, and CIFAR-10 and ImageNet-100 are
multiplication on low-level software, as well as developing partitioned into clients in an IID manner.
specific hardware for this purpose. Nevertheless, the novelty Baselines. We compare the test accuracy vs. time curve
in our implementation is that we use sparse matrices in both of PruneFL with five baselines: (i) conventional FL [2], (ii)
fully-connected and convolutional layers in the pruned model. iterative pruning [13], (iii) online learning [9], (iv) SNIP [18],
and (v) SynFlow [17]. Because iterative pruning and SNIP
VI. E XPERIMENTS cannot automatically determine the model size, we consider
an enhanced version of these baselines that obtain the same
In this section, we present the experimental setup and
model size as PruneFL at convergence. Additional baselines
results.
are also considered in Section VI-B.
Datasets. We evaluate PruneFL on four image classification
Since our experiments try to minimize the training time
tasks:
using pruning, there is no direct way of comparing with the
(a) Conv-2 model on FEMNIST [46], baselines that either are not specifically designed for pruning
(b) VGG-11 model [47] on CIFAR-10 [48], (the online learning baseline) or do not adapt the pruned model
(c) ResNet-18 model [49] on ImageNet-100 [50], size (all other baselines). We compare with the baselines as
(d) MobileNetV3-Small model [51] on CelebA [46], follows. In every round, the online learning approach produces
all of which represent typical FL tasks. Due to practical a model size for the next round, and we adjust the model
considerations of edge devices’ training time and storage accordingly while keeping each layer’s density the same. To
capacity, we select data corresponding to 193 writers for compare with SNIP, after the first round, we let SNIP prune
FEMNIST, and the first 100 classes of the ImageNet dataset the original model in a one-shot manner to the same density
(referred to as ImageNet-100). We adapt some layers in VGG- as the final model found by our adaptive pruning method, and
11, ResNet-18 and MobileNetV3-Small to match with the keep the architecture afterwards. Similarly, to compare with
number of output labels in our datasets. SynFlow, we let SynFlow prune the model (before training)
When using full client participation, because we only have to the same density as the final model found by our adaptive
10 clients in total, for FEMNIST, we partition all the 193 pruning method, and keep the architecture afterwards. To
writers’ images into 10 clients (the first 9 clients each has 19 compare with iterative pruning, we let the model be pruned
writers’ images and the last client has 22 writers’ images). with a fixed rate for 20 times (at an equal interval) in the first
For CelebA, we partition all the 9,343 persons’ images into half of the total number of rounds, such that the remaining
10 clients (the first 9 clients each has 934 persons’ images number of parameter components equals that of the model
and the last client has 937 persons’ images). Note that such found by our adaptive pruning method, and the pruning rate
partitioning is still non-IID. is equal across layers. See Fig. 9 for the illustration of the
Model architectures. The architecture details are pre- baseline settings.
sented in Table C.1 in the appendix. VGG-11 ResNet-18, Pruning configurations. The initial pruning stage is done
and MobileNetV3-Small are well-known architectures, and we on the personal computer client. We end the initial pruning
directly acquire Conv-2 from its original work [46]. stage either when the model size is “stable”, or when it exceeds
Platform. To study the performance of our proposed ap- certain maximum number of iterations. We consider the model
proach, we conduct experiments in (i) a real edge computing size as “stable” when its relative change is below 10% for 5
prototype, where a personal computer serves as both the server consecutive reconfigurations.
and a client, and the other clients are Raspberry Pi devices, For adaptive pruning, to ensure convergence of the last term
and (ii) a simulated setting with multiple clients and a server, on the RHS of (8) in Theorem 2, we exponentially decrease
where computation and communication times are obtained the number of non-zero prunable parameters in P over rounds.
from measurements involving either Raspberry Pi devices or We note that P includes both zero and non-zero parameters,
Android phones. hence the size of P itself may not decrease. For a given size
Unless otherwise specified, the prototype system includes of P, the |P| parameters with the smallest magnitude belong
nine Raspberry Pi (version 4, with 2 GB RAM, 32 GB SD to P that can be pruned (or added back), and the rest belong
card) devices as clients and a personal computer without GPU to P that cannot be pruned.
as both a client and the server (totaling 10 clients). Three of Biases (if any) in the DNNs are not pruned. In ResNet-18,
the Raspberry Pis use wireless connections and the remaining BatchNorm layers and downsampling layers are not pruned
six use wired connections. The communication speed is stable since the number of parameters in such layers is negligible
8

TABLE II
E VALUATION CONFIGURATIONS (C.S. STANDS FOR CLIENT SELECTION ; LR STANDS FOR LEARNING RATE ).

Dataset FEMNIST CIFAR-10 ImageNet-100 CelebA


r r c·0.1
b 1000
SGD params in round r LR = 0.25 LR = 0.1 · 0.5 10000 LR = 0.05 · 0.5 LR = 0.2
r r r r
Fraction of non-zero prunable parameters in round r 0.3 · 0.5 10000 0.3 · 0.5 10000 0.3 · 0.5 10000 0.3 · 0.5 10000
Number of data samples used in initial pruning 200 200 500 500
Number of clients (Non-C.S., C.S.) 10, 193 10, 100 10, 100 10, 934
Mini-batch size, local iterations in each round 20, 5 20, 5 20, 5 20, 5
Reconfiguration every 50 rounds every 50 rounds every 50 rounds every 50 rounds
Total number of FL rounds 10,000 10,000 20,000 1,000
prototype (Pi 4),
Evaluation simulation (Pi 4) simulation (Android VM) simulation (Pi 4)
simulation (Pi 4)

Fig. 2. Training time on Raspberry Pi 4 (FEMNIST). Fig. 4. Comparing conventional FL and PruneFL with both prototype and
simulation results (FEMNIST).

in dense form as well as the pruned models in sparse form


at different densities, and measure the average elapsed time
of FL on these pruned models involving both the server and
clients over 10 rounds.
Fig. 2 shows the average total time, computation time,
and communication time in one round as we vary the model
density. Note that the model is in dense form at 100% on
the x-axis and sparse form elsewhere. We also plot the actual
Fig. 3. Inference time on Raspberry Pi 4 (FEMNIST). size of the parameters that are exchanged between server and
clients in this figure for one FL round.
compared to the size of convolutional and fully-connected Computation time. We see from Fig. 2 that as the model
layers. density decreases, the computation time (for five local itera-
Lottery ticket analysis. To verify whether the final model tions) decreases from 11.24 seconds per round to 6.34 seconds
from adaptive pruning is a lottery ticket [11], [14], we reini- per round. This reduction in computation time is moderate
tialize this converged model using the original random seed, since our implementation of sparse computation only partially
and compare its accuracy vs. round curve with (i) conventional improves backward passes (see Section V). Additionally, we
FL, (ii) random reinitialization (same architecture as the lottery plot in Fig. 3 the total inference time and the number of
ticket but initialized with a different random seed), (iii) SNIP, floating-point operations (FLOPs) for 200 data samples (see
and (iv) SynFlow. Appendix C3 for details of FLOPs computation). The infer-
Hyperparameters. The hyperparameters above are chosen ence time result shows a similar trends as in Fig. 2, and the
empirically only with coarse tuning by experience. We observe number of FLOPs keeps decreasing as we reduce the model
that our and other methods are insensitive to these hyperparam- size.
eters. Hence, we do not perform fine tuning on any parameter. Communication time. Our implementation of sparse ma-
The detailed evaluation configurations are given in Table II. trices reduces the storage requirement significantly (see Sec-
tion V). Compared to computation time, the decrease in
the communication time is more noticeable. It drops from
A. Time Measurement 35.88 seconds per round to 1.04 seconds per round.
We present the time measurements of one FL round on the Enabling FL on low-power edge device. In addition, we
prototype system to show the effectiveness of model pruning ran experiments with the LeNet-300-100 [52] architecture, and
on edge devices. We implement the full-sized Conv-2 model we observed that when training the MNIST [52] dataset on the
9

(a) Conv-2 on FEMNIST (b) VGG-11 on CIFAR-10 (c) ResNet-18 on ImageNet-100 (d) MobileNetV3-Small on CelebA
Fig. 5. Test accuracy vs. time results of four datasets.

TABLE III
T IME AND ACCUMULATED FLOP S PER CLIENT TO REACH TARGET
ACCURACY (FEMNIST).

Time (FLOPs) to reach Time (FLOPs) to reach


Approach
70% accuracy 80% accuracy
Conventional FL 17,929 s (3.5 TFLOPs) 52,153 s (10.5 TFLOPs)
PruneFL (ours) 3,187 s (1.6 TFLOPs) 15,009 s (6.8 TFLOPs)
SNIP 6,801 s (3.3 TFLOPs) 22,467 s (11.7 TFLOPs)
SynFlow 7,132 s (3.6 TFLOPs) 22,327 s (12.3 TFLOPs)
Online 18,042 s (3.5 TFLOPs) 54,593 s (10.7 TFLOPs)
Iterative 17,495 s (3.5 TFLOPs) 46,521 s (10.1 TFLOPs)
Fig. 6. Test accuracy vs. accumulated FLOPs per client (FEMNIST).

full-sized, dense-form LeNet-300-100 model on Raspberry Pi Training FLOPs reduction. Although we observe that the
version 3 (with 1 GB RAM, 32 GB SD card), the system dies training time on Raspberry Pis is relatively consistent across
during the first mini-batch due to resource exhaustion, while different models and tasks, there are still factors that can affect
the models in sparse form can be trained. Thus, our approach the training time (e.g., environment temperature). To further
of using sparse models enables model training on low-power validate our approach’s advantage on accelerating training, we
edge devices, which is otherwise impossible on Raspberry Pi present the results on test accuracy vs. accumulated FLOPs per
3 in this experiment due to the device’s resource limitation. client for FEMNIST in Fig. 6. We find that this result shares
the same characteristics with Fig. 5(a) in terms of acceleration
(we present only one set of results here due to the similarity).
B. Training Cost Reduction Time and FLOPs to reach target accuracy. Table III lists
In the following, we study PruneFL’s cost reduction in terms the time and accumulated FLOPs per client that an algorithm
of both time and FLOPs for training. first reaches a certain accuracy with FEMNIST. PruneFL takes
Comparing conventional FL and PruneFL. Fig. 4 shows less than 1/3 of time compared to conventional FL to reach
the test accuracy vs. time results on both the prototype and 80% accuracy, and it also saves more than 33% of time (more
simulated systems, for Conv-2 on FEMNIST. The time for than 2 hours) compared to SNIP and SynFlow. The savings of
initial pruning of PruneFL is included in this figure, which FLOPs are similar.
is negligible (it takes less than 500 seconds) compared to Comparing with additional baselines. To avoid bottle-
the further pruning stage. We see that PruneFL outperforms necks, our algorithm and implementation ensures that all
conventional FL by a significant margin. Since the prototype components in PruneFL, including communication, compu-
and simulation results match closely, we present the simula- tation, and reconfiguration, are orchestrated and inexpensive.
tion results in subsequent experiments due to their excessive For this reason, some approaches in the literature that are
training time on the prototype system. not specifically designed for the edge computing environment
Training time reduction. In Fig. 5, we compare the test with low-power devices may perform poorly if applied to our
accuracy vs. time results for all datasets, models, and base- system setup, as we illustrate next.
lines. Clearly, PruneFL demonstrates a consistent advantage Considering computation time, PruneTrain [21] applies reg-
on training speed over baselines. Moreover, PruneFL always ularization on every input and output channel in every layer.
converges to similar accuracy achieved by conventional FL When the same regularization is applied to our system, we
(see Appendix D3). Other methods may have suboptimal find that the computation time (using FEMNIST and Conv-
performance, e.g., SNIP does not converge to conventional 2) takes 17.65 seconds per round, which is a 57% increase
FL’s accuracy with CIFAR-10. We also observe that some compared to PruneFL.
approaches such as online learning and SynFlow in Fig. 5(b) Considering communication time, dynamic pruning with
always stay at the random guess accuracy. The reason could feedback (DPF) [19] maintains a full-sized model, and clients
be that such approaches are unstable and can get stuck in local have to upload full-sized gradients to the server (but only
optimal points at the beginning of training. download a subset of model parameters) in every round.
10

(a) Conv-2 on FEMNIST (b) VGG-11 on CIFAR-10 (c) ResNet-18 on ImageNet-100 (d) MobileNetV3-Small on CelebA
Fig. 7. Lottery ticket results of four datasets.

avoids drawbacks from methods that includes only one stage.


Furthermore, the model obtained from initial pruning is not a
lottery ticket of the original model, while PruneFL with only
further pruning or both pruning stages find a lottery ticket.
Therefore, one can view PruneFL as a two-stage procedure
to find a lottery ticket of the given model, which is in line with
our claims in Section IV. The ability that we can find lottery
tickets is useful when we need to retrain a pruned model on
slightly different but similar datasets [14].

Fig. 8. PruneFL with only one pruning stage (FEMNIST). D. Model Size Adaptation
An illustration of the change in model size is shown in
Thus, assuming unit model size and model density d, the Fig. 9. The small negative part of the x-axis shows the initial
communication cost per round, including both uploading and pruning stage which is unique to PruneFL. Since there is no
downloading, is 1 + d. In comparison, clients in PruneFL only notion of “round” in the initial pruning stage, we consider five
upload the full-sized model to the server at a reconfiguration local iterations in this stage as one round, which is consistent
round (every 50 rounds in our experiments), and always with our FL setting. Conventional FL always keeps the full
exchange pruned models otherwise. This gives an average cost model size. The model sizes provided by online learning is
of (1+d)+2×49d
50 = 0.02 + 1.96d including both uploading and unstable. It fluctuates in initial rounds due to its exploration.
downloading. For instance, when the model density is 10%, SNIP and SynFlow prune the initial model to the target size
DPF incurs 5.1× communication cost compared to PruneFL. in a one-shot manner at the beginning of training. Iterative
Finally, our reconfiguration algorithm (Algorithm 2) runs in pruning gradually reduces the model size until reaching the
quasi-linear time, making it possible to be implemented on target. We notice that PruneFL also discovers the degree
edge devices. of overparameterization. Empirically, Conv-2, an overparam-
eterized model for FEMNIST, converges to a small density
C. Finding a Lottery Ticket (13.4%), while ResNet-18, an underparameterized model for
Unlike some existing pruning techniques such as SNIP [18], ImageNet-100, converges to a density of around 67.7%.
dynamic pruning [19], and SynFlow [17], PruneFL finds a It is worth mentioning that finding a proper target density for
lottery ticket (although not necessarily the smallest). In Fig. 7, pruning is non-trivial. Usually, foresight pruning methods such
FL with the reinitialized pruned model obtained from PruneFL as [18] and [43] prune the model to the (manually selected)
learns comparably fast as FL with the original model, in terms density before training. Fig. 10 shows two cases where we use
of test accuracy vs. round, confirming that they are lottery SNIP to prune Conv-2 (with FEMNIST) to 30% and 1%, the
tickets. In Fig. 7(b), the Random Reinit curve stays at the training speed becomes slower, and if the density is too small
random guess accuracy. This is not surprising since the lottery (1%), the sparse model cannot converge to the same accuracy
ticket, i.e., the final pruned model found by PruneFL, needs to as the original model. In comparison, PruneFL automatically
be reinitialized to their original values to learn comparably fast determines a proper density.
as the full-sized model [11]. When reinitialized with different
values, the training of the “lottery ticket” can be suboptimal. E. Training with Limited/Targeted Model Sizes
In this experiment, it is stuck at the beginning of training. There are cases where a hard limit on the maximum model
Fig. 8 compares the test accuracy vs. round curves of size or a targeted final model size (or both) is desired. For
PruneFL with alternative methods that either only includes example, if some of the client devices have limited memory
initial pruning (at a single client) or only includes further or storage so that only a partial model can be loaded, then the
pruning (during FL). It shows that the model obtained from the model size must be constrained after initial pruning. Targeted
initial pruning stage does not converge to the optimal accuracy, model size may be needed in the case where the goal of the
and only performing further pruning without initial pruning FL system is to obtain a model with a certain small size at
results in a slower learning speed. PruneFL with both stages the end of training.
11

(a) Conv-2 on FEMNIST (b) VGG-11 on CIFAR-10 (c) ResNet-18 on ImageNet-100 (d) MobileNetV3-Small on CelebA
Fig. 9. Number of parameters vs. round for four datasets.

TABLE IV
D ENSITY OF EACH LAYER AT CONVERGENCE ( NO C.S).

Experiment VGG-11 on CIFAR-10


Convolutional 1.0, 0.80, 0.84, 0.84, 0.50, 0.07, 0.01, 0.03
Fully-connected 0.14, 0.15, 0.7

F. Relative Importance Between Layers


Similar to SynFlow [17] and SNIP [18], our algorithm
discovers the relative importance between layers automatically.
Fig. 10. SNIP with different densities (FEMNIST). Taking CIFAR-10 on VGG-11 as an example, the densities of
convolutional layers and fully-connected layers at convergence
are listed in Table IV in a sequential manner. We observe
that the input and output layers are not pruned to a low
density, indicating that they are relatively important in the
neural network architectures. This also agrees with the pruning
scheme in [11], where the authors empirically set the pruning
rate of the output layer to a small percentage or even to
zero. Some large convolutional layers, such as the last two
convolutional layers in VGG-11, are identified as redundant,
and thus have small densities at convergence.

Fig. 11. Training with limited and targeted size (FEMNIST). VII. C ONCLUSION
We have proposed PruneFL for FL in edge/mobile com-
puting environments, where the goal is to effectively reduce
Next, we present an extended PruneFL with limited and tar-
the size of neural networks so that resource-limited clients
geted model sizes, and show that with reasonable constraints,
can train them within a short time. Our PruneFL method
PruneFL still achieves good results. We use a heuristic way
includes initial and further pruning stages, which improves the
to limit the model size: we stop Algorithm 2 early when
performance compared to only having a single stage. PruneFL
the number of remaining parameters reaches the maximum
also includes a low-complexity adaptive pruning method for
allowed size, and we schedule the maximum size of the model
efficient FL, which finds a desired model size that can achieve
to decrease linearly. Assuming dl is the density limit, dt is
a similar prediction accuracy as the original model but with
the target model density at the end of the further pruning
much less time. Our experiments on Raspberry Pi devices
stage (dt ≤ dl ), and PruneFL is run for rmax rounds after
confirm that we improve the cost-efficiency of FL while
the initial pruning stage, then the maximum density at round
 obtaining a lottery ticket. Our method can be applied together
r is dmax (r) = r · dt + (rmax − r) · dl /rmax . The results of
with other compression techniques, such as quantization, to
selecting dl = 10% and dt = 5% for Conv-2 on FEMNIST
further reduce the communication overhead.
are given in Fig. 11. We see that if we do not impose these
model size constraints, PruneFL exceeds the density limits
dl = 10% and dt = 5% defined in this example, and obtains ACKNOWLEDGMENT
a model that is much larger than the target density dt = 5% This work was partially supported by the U.S. Office of
at the end of training. We see that PruneFL with effective size Naval Research under Grant N00014-19-1-2566, the U.S.
limit and target still achieves fast convergence and similar National Science Foundation AI Institute Athena under Grant
convergence accuracy, and the model size is always limited CNS-2112562, and the U.S. Army Research Laboratory and
below the threshold dl = 10% and reaches the target density the U.K. Ministry of Defence under Agreement Number
dt = 5% at the end of training. W911NF-16-3-0001. The views and conclusions contained in
12

this document are those of the authors and should not be inter- [21] S. Lym, E. Choukse, S. Zangeneh, W. Wen, S. Sanghavi, and M. Erez,
preted as representing the official policies, either expressed or “Prunetrain: fast neural network training by dynamic sparse model
reconfiguration,” in Proceedings of the International Conference for
implied, of U.S. Office of Naval Research, the U.S. National High Performance Computing, Networking, Storage and Analysis, 2019,
Science Foundation, the U.S. Army Research Laboratory, the pp. 1–13.
U.S. Government, the U.K. Ministry of Defence or the U.K. [22] H. Yu, S. Yang, and S. Zhu, “Parallel restarted sgd with faster con-
vergence and less communication: Demystifying why model averaging
Government. The U.S. and U.K. Governments are authorized works for deep learning,” in Proceedings of the AAAI Conference on
to reproduce and distribute reprints for Government purposes Artificial Intelligence, vol. 33, 2019, pp. 5693–5700.
notwithstanding any copyright notation hereon. V. Valls has [23] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On
the convergence of fedavg on non-iid data,” in International
also received funding from the European Union’s Horizon Conference on Learning Representations, 2020. [Online]. Available:
2020 research and innovation programme under the Marie https://openreview.net/forum?id=HJxNAnVtDS
Skłodowska-Curie grant agreement No. 795244. [24] J. Wang, S. Wang, R.-R. Chen, and M. Ji, “Local averaging helps: Hi-
erarchical federated learning and convergence analysis,” arXiv preprint
arXiv:2010.12998, 2020.
R EFERENCES [25] J. Wang and G. Joshi, “Adaptive communication strategies to achieve the
best error-runtime trade-off in local-update sgd,” in Machine Learning
[1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, and Systems (MLSys), 2019.
2016, http://www.deeplearningbook.org. [26] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and
[2] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, K. Chan, “Adaptive federated learning in resource constrained edge com-
“Communication-efficient learning of deep networks from decentralized puting systems,” IEEE Journal on Selected Areas in Communications,
data,” in AISTATS, 2017. vol. 37, no. 6, pp. 1205–1221, June 2019.
[3] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learn- [27] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and
ing: Challenges, methods, and future directions,” arXiv preprint A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated
arXiv:1908.07873, 2019. learning,” arXiv preprint arXiv:1910.06378, 2019.
[4] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network [28] S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi, “Error feedback
intelligence at the edge,” Proceedings of the IEEE, vol. 107, no. 11, pp. fixes SignSGD and other gradient compression schemes,” in Interna-
2204–2239, 2019. tional Conference on Machine Learning, vol. 97, Jun. 2019, pp. 3252–
[5] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: 3261.
Concept and applications,” ACM Transactions on Intelligent Systems and
[29] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat,
Technology (TIST), vol. 10, no. 2, p. 12, 2019.
and C. Renggli, “The convergence of sparsified gradient methods,” in
[6] P. Kairouz, H. B. McMahan et al., “Advances and open problems in
Advances in Neural Information Processing Systems, 2018, pp. 5973–
federated learning,” arXiv preprint arXiv:1912.04977, 2019.
5983.
[7] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and
[30] S. Shi, K. Zhao, Q. Wang, Z. Tang, and X. Chu, “A convergence analysis
D. Bacon, “Federated learning: Strategies for improving communication
of distributed sgd with communication-efficient gradient sparsification,”
efficiency,” arXiv preprint arXiv:1610.05492, 2016.
in Proceedings of the Twenty-Eighth International Joint Conference on
[8] S. Caldas, J. Konečny, H. B. McMahan, and A. Talwalkar, “Expanding
Artificial Intelligence, IJCAI-19, 2019, pp. 3411–3417.
the reach of federated learning by reducing client resource require-
[31] P. Jiang and G. Agrawal, “A linear speedup analysis of distributed
ments,” arXiv preprint arXiv:1812.07210, 2018.
deep learning with sparse and quantized communication,” in Advances
[9] P. Han, S. Wang, and K. K. Leung, “Adaptive gradient sparsification
in Neural Information Processing Systems 31, S. Bengio, H. Wallach,
for efficient federated learning: An online learning approach,” in IEEE
H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds.,
ICDCS, 2020.
2018, pp. 2525–2536.
[10] Z. Xu, Z. Yang, J. Xiong, J. Yang, and X. Chen, “Elfish: Resource-
aware federated learning on heterogeneous edge devices,” arXiv preprint [32] W. Du, X. Zeng, M. Yan, and M. Zhang, “Efficient federated learning
arXiv:1912.01684, 2019. via variational dropout,” 2018.
[11] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, [33] A. Li, J. Sun, B. Wang, L. Duan, S. Li, Y. Chen, and H. Li, “Lotteryfl:
trainable neural networks,” in ICLR, 2019. Personalized and communication-efficient federated learning with lottery
[12] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in ticket hypothesis on non-iid datasets,” arXiv preprint arXiv:2008.03371,
Advances in neural information processing systems, 1990, pp. 598–605. 2020.
[13] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con- [34] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
nections for efficient neural network,” in Advances in neural information learning with limited numerical precision,” in International Conference
processing systems, 2015, pp. 1135–1143. on Machine Learning, 2015, pp. 1737–1746.
[14] A. Morcos, H. Yu, M. Paganini, and Y. Tian, “One ticket to win them all: [35] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional
generalizing lottery ticket initializations across datasets and optimizers,” neural networks with low rank expansions,” in Proceedings of the British
in Advances in Neural Information Processing Systems, 2019, pp. 4933– Machine Vision Conference. BMVA Press, 2014.
4943. [36] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
[15] S. Narang, E. Elsen, G. Diamos, and S. Sengupta, “Exploring sparsity network,” arXiv preprint arXiv:1503.02531, 2015.
in recurrent neural networks,” in International Conference on Learning [37] J. Lin, Y. Rao, J. Lu, and J. Zhou, “Runtime neural pruning,” in
Representations, 2017. Proceedings of the 31st International Conference on Neural Information
[16] M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy Processing Systems, 2017, pp. 2178–2188.
of pruning for model compression,” arXiv preprint arXiv:1710.01878, [38] C. Hu, W. Bao, D. Wang, and F. Liu, “Dynamic adaptive dnn surgery
2017. for inference acceleration on the edge,” in IEEE INFOCOM 2019-IEEE
[17] H. Tanaka, D. Kunin, D. L. Yamins, and S. Ganguli, “Pruning neural Conference on Computer Communications. IEEE, 2019, pp. 1423–
networks without any data by iteratively conserving synaptic flow,” 1431.
Advances in Neural Information Processing Systems, vol. 33, 2020. [39] P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar, “Split learning
[18] N. Lee, T. Ajanthan, and P. Torr, “Snip: Single-shot network for health: Distributed deep learning without sharing raw patient data,”
pruning based on connection sensitivity,” in International Conference arXiv preprint arXiv:1812.00564, 2018.
on Learning Representations, 2019. [Online]. Available: https: [40] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan,
//openreview.net/forum?id=B1VZqjAcYX S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation
[19] T. Lin, S. U. Stich, L. Barba, D. Dmitriev, and M. Jaggi, for privacy-preserving machine learning,” in proceedings of the 2017
“Dynamic model pruning with feedback,” in International Conference ACM SIGSAC Conference on Computer and Communications Security,
on Learning Representations, 2020. [Online]. Available: https: 2017, pp. 1175–1191.
//openreview.net/forum?id=SJem8lSFwB [41] T. Li, M. Sanjabi, A. Beirami, and V. Smith, “Fair resource allocation
[20] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep in federated learning,” in International Conference on Learning
convolutional neural networks,” ACM Journal on Emerging Technologies Representations, 2020. [Online]. Available: https://openreview.net/
in Computing Systems (JETC), vol. 13, no. 3, pp. 1–18, 2017. forum?id=ByexElSYDr
13

[42] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,


V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan et al.,
“Towards federated learning at scale: System design,” arXiv preprint
arXiv:1902.01046, 2019.
[43] C. Wang, G. Zhang, and R. Grosse, “Picking winning tickets before
training by preserving gradient flow,” in International Conference on
Learning Representations, 2019.
[44] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning
convolutional neural networks for resource efficient inference,” arXiv
preprint arXiv:1611.06440, 2016.
[45] H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of communi-
cation efficient momentum sgd for distributed non-convex optimization,”
in International Conference on Machine Learning. PMLR, 2019, pp.
7184–7193.
[46] S. Caldas, P. Wu, T. Li, J. Konecný, H. B. McMahan, V. Smith, and
A. Talwalkar, “LEAF: A benchmark for federated settings,” CoRR, vol.
abs/1812.01097, 2018. [Online]. Available: http://arxiv.org/abs/1812.
01097
[47] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[48] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features
from tiny images,” Citeseer, Tech. Rep., 2009.
[49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[50] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in 2009 IEEE conference on
computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[51] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang,
Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in
Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2019, pp. 1314–1324.
[52] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based
learning applied to document recognition,” Proceedings of the IEEE,
vol. 86, no. 11, pp. 2278–2324, 1998.
[53] R. Tang, A. Adhikari, and J. Lin, “Flops as a direct optimiza-
tion objective for learning sparse neural networks,” arXiv preprint
arXiv:1811.03060, 2018.
14

A PPENDIX
A. Extension to Non-linear T (M)
For the case where T (M) is non-linear, but a general monotone and positive set function instead, we can still find a local
optimal solution to (7) using Algorithm A.1. We can see that the complexity of Algorithm A.1 is O(|P|2 ).

Algorithm A.1: Solving (7), general T (·)


1 A ← ∅;
2 j ∗ ← None;
3 repeat
4 if j ∗ is not None then
5 A ← A ∪ {j ∗ };
gj2
6 j ∗ ← arg maxj∈P\A tj (A∪P)
;
gj2 
7 until t (A∪P) < Γ A ∪ P ;
j
8 return A ; // final result

Theorem 3. For general T (M), we have Γ A ∪ P ≥ Γ A0 ∪ P , where A is given by Algorithm A.1 and A0 = A ∪ {j}
 

for any j ∈ P \ A.
Theorem 3 shows that for general T (M), adding another component to A cannot improve the solution to (7).
Remark. Theorem 3 gives a weaker result for general T (·) compared to the global optimality result in Theorem 1 for linear
T (·), because when A and A0 differ by more than one element, it is non-straightforward to express the change in cost for
general T (·). Furthermore, there may exist multiple local optimal solutions for general T (·).

B. Proofs
1) Proof of Theorems 1 and 3: Recall that Γ(M) := ∆(M)
T (M) , where ∆(M) and T (M) are both monotone and positive
functions, i.e., for any M ⊆ M0 , we have 0 ≤ ∆(M) ≤ ∆(M0 ) and 0 ≤ T (M) ≤ T (M0 ).
Lemma 1. For any M and M0 , let δ∆ (M, M0 ) := ∆(M0 ) − ∆(M) and δT (M, M0 ) := T (M0 ) − T (M). We have
Γ(M0 ) ≤ Γ(M) if and only if δ∆ (M, M0 ) ≤ Γ(M) · δT (M, M0 ).
Proof.
∆(M0 )
Γ(M0 ) := ≤ Γ(M)
T (M0 )
⇐⇒ ∆(M0 ) ≤ Γ(M) · T (M0 )
⇐⇒ ∆(M) + δ∆ (M, M0 ) ≤ Γ(M) · T (M) + Γ(M) · δT (M, M0 ) (by definition of δ∆ (·, ·) and δT (·, ·))
0 0
⇐⇒ ∆(M) + δ∆ (M, M ) ≤ ∆(M) + Γ(M) · δT (M, M ) (by definition of Γ(M))
0 0
⇐⇒ δ∆ (M, M ) ≤ Γ(M) · δT (M, M ) .

We are now ready to prove Theorems 1 and 3.


Proof of Theorem 1. By definition, we have ∆(M) := j∈M gj2 and T (M) := c + j∈M tj for any M.
P P

In the following, we let M := A ∪ P and M0 := A0 ∪ P. We have


δ∆ (M, M0 ) = ∆(M0 ) − ∆(M)
X X
= gj2 − gj2 (B.1)
j∈M0 \M j∈M\M0

δT (M, M0 ) = T (M0 ) − T (M)


X X
= tj − tj . (B.2)
j∈M0 \M j∈M\M0

g2 gj20
For A obtained from Algorithm 2, we can easily see that tjj < Γ(M) for any j ∈ M0 \ M and tj 0 ≥ Γ(M) for any
j ∈ M \ M0 . Hence,
0
X X
gj2 < Γ(M) · tj (B.3)
j∈M0 \M j∈M0 \M
15

X X
gj2 ≥ Γ(M) · tj . (B.4)
j∈M\M0 j∈M\M0

Combining with (B.1) and (B.2), we have


δ∆ (M, M0 ) ≤ Γ(M) · δT (M, M0 ) .
Then, the result follows from Lemma 1.
Proof of Theorem 3. Let M := A ∪ P and M0 := A0 ∪ P in this proof. As A0 := A ∪ {j} for some j ∈
/ A by definition in
this theorem, we note that δ∆ (M, M0 ) = gj2 and δT (M, M0 ) = tj (M).
gj2
For A obtained from Algorithm A.1, it is easy to see that tj (M) / A. Hence, δ∆ (M, M0 ) < Γ (M) ·
< Γ (M) for j ∈
δT (M, M0 ) and the result follows from Lemma 1.

2) Proof of Theorem 2: The analysis of this section is an extension of Theorem 1 in [24] (proof given in Section A.1). Note
that Assumption 1(a)-(e) still hold if we apply the same mask to all gradients or function values in the LHS, since applying
masks is equivalent to replacing a subset of the entries in the parameter with zeros. For the same reason, Assumption 1(f) and
1(g) also hold if the gradients are masked. For convenience, we define three expedient notations in addition to the notations
in Table I for pruned values: gn0 (w) := gn (w) m(k), ∇0 Fn (w) := ∇Fn (w) m(k), and ∇0 F (w) := ∇F (w) m(k).
We first present a special form of Jensen’s inequality, which has the original form below:
P  P
i ai xi ai φ(xi )
φ P ≤ iP ,
i ai i ai

where φ(·) is a convex function, ai ’s are positive weights.


Lemma 2. Assume φ(·) is a convex function, pn ’s are positive weights that sum to 1, we have
2
X X
pn xn ≤ pn kxn k2 .
n n

Proof.
P  2
2
Pn pn xn,1 XX 2
=  n pn xn,2 
X XX X X X
pn xn = pn xn,i ≤ pn x2n,i = pn x2n,i = pn kxn k2 ,
 
n
.. i n i n n i n
.
where xn,i denotes the i-th component of xn .
The local updating rule is
wn0 (k + 1) = wn (k)

m(k) − ηgn wn (k) m(k) m(k)
= wn0 (k) − ηgn0 (wn0 (k)) ,

Consequently, the updating rule for the averaged parameters is


N
X N
X
w(k + 1) = pn wn0 (k + 1) = w0 (k) − η pn gn0 (wn0 (k)) .
n=1 n=1
0
Note that w(k), w (k) are observable only in iterations where the clients send their local parameters to the server for
aggregation (see Algorithm 1), and g0 (wn0 (k)) is dependent on m(k). We know that
E [F (w(k + 1))|{wn (k)}, m(k)]
  XN  
= E F w0 (k) − η pn gn0 (wn0 (k)) {wn (k)}, m(k)
n=1
2
"* +  # 
N N
(a) X η2 β  X
≤ F (w0 (k)) − ηE ∇F (w0 (k)), pn gn0 (wn0 (k)) {wn (k)}, m(k) + E pn gn0 (wn0 (k)) {wn (k)}, m(k)
n=1
2 n=1

(b) hD N
X E i
≤ F (w(k)) + L kw(k) − w0 (k)k − ηE ∇F (w0 (k)), pn gn0 (wn0 (k)) {wn (k)}, m(k)
n=1
16

2
 
N
η2 β  X
+ E pn gn0 (wn0 (k)) {wn (k)}, m(k)
2 n=1

hD N
X E i
= F (w(k)) + L kw(k) − w0 (k)k − ηE ∇0 F (w0 (k)), pn gn0 (wn0 (k)) {wn (k)}, m(k)
n=1
2
 
2 N
η β  X
+ E pn gn0 (wn0 (k)) {wn (k)}, m(k) (B.5)
2 n=1

where (a) is due to Assumption 1(a) (β-smoothness), (b) is due to Assumption 1(b) (L-Lipschitzness), and (B.5) is because
D N
X E D N
X E
∇F (w0 (k)), pn gn0 (wn0 (k)) = ∇F (w0 (k)), pn gn (wn0 (k)) m(k)
n=1 n=1
D N
X E
= ∇F (w0 (k)) m(k), pn gn (wn0 (k)) m(k)
n=1
D N
X E
= ∇0 F (w0 (k)), pn gn0 (wn0 (k)) .
n=1

The third term in (B.5) can be rewritten as

hD N
X E i
− ηE ∇0 F (w0 (k)), pn gn0 (wn0 (k)) {wn (k)}, m(k)
n=1
D N E
(a) X
= −η ∇0 F (w0 (k)), pn ∇0 Fn (wn0 (k))
n=1
N N
!
2 2
η 0 0
X
0 2 X
= ∇ F (w (k)) − pn ∇ Fn (wn0 (k)) 0
− ∇ F (w (k)) 0
− pn ∇ 0
Fn (wn0 (k)) , (B.6)
2 n=1 n=1

where (a) uses Assumption 1(c) (unbiasedness). Our fourth term in (B.5) is bounded by
" N #
X 2
0 0
E pn gn (wn (k)) {wn (k)}, m(k)
n=1
" N   2
# N
 X !2
(a) X
= E pn gn0 (wn0 (k)) −∇ 0
Fn (wn0 (k)) {wn (k)}, m(k) + E 0 0
pn ∇ Fn (wn (k)) {wn (k)}, m(k)
n=1 n=1
" N   2
# N
 X !2
(b) X
=E pn gn0 (wn0 (k)) − ∇0 Fn (wn0 (k)) {wn (k)}, m(k) + E 0 0
pn ∇ Fn (wn (k)) {wn (k)}, m(k)
n=1 n=1
N N 2
(c) X  X
≤ p2n σ 2 + pn ∇0 Fn (wn0 (k)) , (B.7)
n=1 n=1
 
where (a) uses the definition of variance, i.e., E[kxk2 ] = E kx − E[x]k2 + [Ekxk]2 ; (b) uses Assumption 1(g) (client
independence), expands the first term and removes the zero-valued cross-product terms; (c) uses Assumption 1(d) (bounded
variance). Substituting (B.6) and (B.7) into (B.5), (B.5) becomes
E [F (w(k + 1))|{wn (k)}, m(k)]
PN N 2
η2 β p2n η 2 η X
≤ F (w(k)) + L kw(k) − w0 (k)k + n=1
σ2 − ∇0 F (w0 (k)) + ∇0 F (w0 (k)) − pn ∇0 Fn (wn0 (k))
2 2 2 n=1
N 2
η2 β  X

− − pn ∇0 Fn (wn0 (k)) . (B.8)
2 2 n=1

1 η η2 β
Assuming η ≤ β, we have 2 − 2 ≥ 0, and the last term in (B.8) can be removed. Then we have the following:

E [F (w(k + 1))|{wn (k)}, m(k)]


17

PN N 2
η2 β p2n η 2 η X
≤ F (w(k)) + L kw(k) − w0 (k)k + n=1
σ2 − ∇0 F (w0 (k)) + ∇0 F (w0 (k)) − pn ∇0 Fn (wn0 (k))
2 2 2 n=1
PN N 2
(a) η2 β p2n η 2 ηX
≤ F (w(k)) + L kw(k) − w0 (k)k + n=1
σ2 − ∇0 F (w0 (k)) + pn ∇0 Fn (w0 (k)) − ∇0 Fn (wn0 (k))
2 2 2 n=1
PN N
(b) η2 β p2n η 2 ηβ 2 X 2
≤ F (w(k)) + L kw(k) − w0 (k)k + n=1
σ2 − ∇0 F (w0 (k)) + pn w0 (k) − wn0 (k) , (B.9)
2 2 2 n=1
where (a) uses Jensen’s inequality, and (b) uses Assumption 1(a) (smoothness). Taking expectation on both sides of (B.9), we
get
PN
0 η2 β p2n η 2
E [F (w(k + 1))] ≤ E [F (w(k))] + LE kw(k) − w (k)k + n=1
σ 2 − E ∇0 F (w0 (k))
2 2
N
ηβ 2 X 2
+ pn E w0 (k) − wn0 (k) . (B.10)
2 n=1
Taking average over time on (B.10) and rearranging, we get
K−1
1 X 2
E ∇0 F (w0 (k))
K
k=0
K−1 N K−1 N
2 2L X X  β2 X X 2
≤ [F (w(0)) − F ∗ ] + E kw(k) − w0 (k)k + ηβ p2n σ 2 + pn E w0 (k) − wn0 (k) . (B.11)
ηK ηK n=1
K n=1
k=0 k=0
Now we bound the last term of (B.11).

N
X 2
pn E w0 (k) − wn0 (k)
n=1
N
X  N
X    2
= pn E w0 (k − 1) − η pi gi0 (wi0 (k − 1)) − wn0 (k − 1) − ηgn0 (wn0 (k − 1))
n=1 i=1
N k−1  N  2
(a) X X X
= η2 pn E gn0 (wn0 (τ )) − pi gi0 (wi0 (τ ))
n=1 τ =I·bk/Ic i=1
N
X k−1
X  N
X N
X 
= η2 pn E gn0 (wn0 (τ )) − ∇0 Fn (wn0 (τ )) + pi ∇0 Fi (wi0 (τ )) − pi gi0 (wi0 (τ ))
n=1 τ =I·bk/Ic i=1 i=1
N 2
 X 
0 0 0 0
+ ∇ Fn (wn (τ )) − pi ∇ Fi (wi (τ ))
i=1
N k−1 N N 2
X X  X X 
≤ 2η 2
pn E gn0 (wn0 (τ )) −∇ 0
Fn (wn0 (τ )) + pi ∇ 0
Fi (wi0 (τ )) − pi gi0 (wi0 (τ )) (B.12)
n=1 τ =I·bk/Ic i=1 i=1
N k−1 N 2
X X  X 
0
+ 2η 2
pn E ∇ Fn (wn0 (τ )) − pi ∇ 0
Fn (wi0 (τ )) , (B.13)
n=1 τ =I·bk/Ic i=1

where in (a), we trace back to the nearest iteration where all local parameters are synchronized. For (B.12),

N k−1 N N 2
X X  X X 
2η 2
pn E gn0 (wn0 (τ )) −∇ 0
Fn (wn0 (τ )) + pi ∇ 0
Fi (wi0 (τ )) − pi gi0 (wi0 (τ ))
n=1 τ =I·bk/Ic i=1 i=1
N k−1 2 N k−1 N 2
(a) X X   X X X  
= 2η 2 pn E gn0 (wn0 (τ )) − ∇0 Fn (wn0 (τ )) − 2η 2 pn E pi gi0 (wi0 (τ )) − ∇0 Fi (wi0 (τ ))
n=1 τ =I·bk/Ic n=1 τ =I·bk/Ic i=1
N k−1 k−1 N 2
(b) X X 2 X X  
= 2η 2
pn E gn0 (wn0 (τ )) −∇ 0
Fn (wn0 (τ )) − 2η 2
E pi gi0 (wi0 (τ )) −∇ 0
Fi (wi0 (τ ))
n=1 τ =I·bk/Ic τ =I·bk/Ic i=1
18

k−1 N 2 k−1 N   2
(c) X X X X
= 2η 2 pn E gn0 (wn0 (τ )) − ∇0 Fn (wn0 (τ )) − 2η 2 E pn gn0 (wn0 (τ )) − ∇0 Fn (wn0 (τ ))
τ =I·bk/Ic n=1 τ =I·bk/Ic n=1
k−1
X N
X 2
≤ 2η 2 (pn − p2n )E gn0 (wn0 (τ )) − ∇0 Fn (wn0 (τ ))
τ =I·bk/Ic n=1
 N
X 
≤2 1− p2n Iη 2 σ 2 , (B.14)
n=1
P
where (a) is due to the definition of variance; (b) is due to Assumption 1(f) (time independence) and n pn = 1; and (c) is
due to Assumption 1(g) (client independence). For (B.13),

N k−1 N 2
X X  X 
0
2η 2
pn E ∇ Fn (wn0 (τ )) − pi ∇ 0
Fi (wi0 (τ ))
n=1 τ =I·bk/Ic i=1
N
X k−1
X    N
X 
= 2η 2 pn E ∇0 Fn (wn0 (τ )) − ∇0 Fn (w0 (τ )) + ∇0 Fn (w0 (τ )) − pi ∇0 Fi (w0 (τ ))
n=1 τ =I·bk/Ic i=1
N N 2
X X 
0 0 0 0
+ pi ∇ Fi (w (τ )) − pi ∇ Fi (wi (τ ))
i=1 i=1
N
X k−1
X   2 k−1
X N
X   2
≤ 6η 2 pn E ∇0 Fn (wn0 (τ )) − ∇0 Fn (w0 (τ )) + pn E pi ∇0 Fi (w0 (τ )) − ∇0 Fi (wi0 (τ ))
n=1 τ =I·bk/Ic τ =I·bk/Ic i=1
k−1 N
!
X  X  2
+ pn E ∇0 Fn (w0 (τ )) − pi ∇0 Fi (w0 (τ ))
τ =I·bk/Ic i=1
k−1 N N
!
(a) X X 2 X   2
0
≤ 6η I 2
pn E ∇ Fn (wn0 (τ )) 0
− ∇ Fn (w (τ )) 0
+E 0 0
pi ∇ Fi (w (τ )) − ∇ 0
Fi (wi0 (τ ))
τ =I·bk/Ic n=1 i=1
N
X k−1
X N
X 2
+ 6η 2 IE pn ∇0 Fn (w0 (τ )) − pi ∇0 Fi (w0 (τ ))
n=1 τ =I·bk/Ic i=1

(b) k−1
X N
X  2 N
X   2
0
≤ 6η I 2
pn E ∇ Fn (wn0 (τ )) 0
− ∇ Fn (w (τ )) 0
+E 0 0
pi ∇ Fi (w (τ )) − ∇ 0
Fi (wi0 (τ )) + 6I 2 η 2 2
τ =I·bk/Ic n=1 i=1

(c) k−1
X 2
≤ 12η 2 I E ∇0 Fn (wn0 (τ )) − ∇0 Fn (w0 (τ )) + 6I 2 η 2 2
τ =I·bk/Ic
N
X k−1
X 2
≤ 12Iη 2 β 2 E wn0 (τ ) − w0 (τ ) + 6I 2 η 2 2 , (B.15)
n=1 τ =I·bk/Ic

where in (a), we use the fact that, ∀xτ ,


k−1
X 2   k−1
X k−1
X
xτ ≤ (k − 1) − I · bk/Ic · kxτ k2 ≤ I · kxτ k2 .
τ =I·bk/Ic τ =I·bk/Ic τ =I·bk/Ic

In (b), we use Assumption 1(e) (bounded divergence), and in (c), we use Jensen’s inequality. Substituting (B.14) and (B.15)
into (B.12) and (B.13), respectively, we get
K−1 N
1 XX 2
pn E w0 (k) − wn0 (k)
K n=1
k=0
N N K−1 k−1
 X  12Iη 2 β 2 X X X 2
≤2 1− p2n Iη 2 σ 2 + 6η 2 I 2 2 + E wn0 (τ ) − w0 (τ )
n=1
K n=1 k=0 τ =I·bk/Ic
N N K−1 min{I·dk/Ie,K−1}
(a) X  12Iη 2 β 2 X X X 2
≤ 2 1− 2 2 2 2 2 2
pn η Iσ + 6η I  + E wn0 (τ ) − w0 (τ )
n=1
K n=1 k=0 τ =I·bk/Ic
19

N K−1 N
(b) X  12I 2 η 2 β 2 X X 2
≤ 2 1− p2n η 2 Iσ 2 + 6η 2 I 2 2 + E wn0 (τ ) − w0 (τ ) , (B.16)
n=1
K n=1 k=0

where (a) holds because k − 1 ≤ min{I · dk/Ie, K − 1} is always true, and (b) holds because in the innermost summation in
its last term, we always have
min{I · dk/Ie, K − 1} − I · bk/Ic ≤ I .

Rearranging yields
 PN 
K−1 N
1 XX 2
2 1 − n=1 p2n η 2 Iσ 2 + 6η 2 I 2 2
pn E w0 (k) − wn0 (k) ≤ . (B.17)
K n=1
1 − 12I 2 η 2 β 2
k=0

Applying (B.17) into (B.11)’s last term, (B.11) becomes


 PN 
1
K−1
X 2 2
N
X  2 1 − n=1 p2n Iσ 2 + 6I 2 2
E ∇0 F (w0 (k)) ≤ [F (w(0)) − F ∗ ] + ηβ p2n σ 2 + η2 β 2
K ηK n=1
1 − 12I 2 η 2 β 2
k=0
K−1
2L X
+ E kw(k) − w0 (k)k . (B.18)
ηK
k=0

Assume η ≤ √1 , we have 1 − 12η 2 I 2 η 2 β 2 ≥ 12 , and


2 6Iβ

K−1 K−1
1 X 2 2(F0 − F ∗ ) 2L X
E ∇0 F (w0 (k)) + αηβσ 2 + 4β 2 (1 − α)Iσ 2 + 3I 2 2 η 2 + E kw(k) − w0 (k)k , (B.19)


K ηK ηK
k=0 k=0
PN
where α := n=1 p2n , F0 := F (w(0)), F ∗ := minw F (w). Note that under the assumption that η ≤ 2√16Iβ , the previous
assumption that η ≤ β1 used for (B.8) is automatically satisfied.
Discussion.
q Consider the situation where all clients have equal weight, i.e. pn = N1 , ∀n, we have α = N1 . Letting η =
√1 = N
αK K, (B.19) becomes

K−1  K−1
1 X 2 2(F0 − F ∗ ) + βσ 2 4β 2 (N − 1)Iσ 2 + 3N I 2 2 2L X
E ∇0 F (w0 (k)) ≤ √ + +√ E kw(k) − w0 (k)k . (B.20)
K NK K N K k=0
k=0

For the last term in (B.20), when k is not a reconfiguration iteration, w(k) = w0 (k), and when k is a reconfiguration iteration,
the difference between w(k) and w0 (k) is the subset of parameters that get pruned from w(k). Assuming the norm of w(k) is
bounded by B, i.e., kw(k)k ≤ B, ∀k, the initial fraction of non-zero prunable parameters is r0 and this fraction halves every
h iterations (as we do in our experiments, see Table II), then the last term in (B.20) is bounded by
K−1 K−1 ∞ 1
2L X (a) 2L X 2LBr0 X −k/h 2 h +1 LBr0
√ E kw(k) − w0 (k)k ≤ √ B · r0 · 2−k/h ≤ √ 2 ≤ 1 ·√ . (B.21)
N K k=0 N K k=0 N K k=0 2h − 1 NK

In (B.21), (a) holds for the following reason: the reconfiguration on the parameter vector w(k) includes both adding back
parameters and removing parameters, resulting in a new parameter vector w0 (k). However, parameters that are added back
in w0 (k) in reconfigurations are assigned zero values2 , which are equal to their corresponding values in w(k). Thus, the
difference between w(k) and w0 (k) is the part that is removed from w(k), whose maximum fraction is bounded by r0 · 2−k/h .
In consequence, kw(k) − w0 (k)k is bounded by B · r0 · 2−k/h . Plugging (B.21) into (B.20), we get
K−1 1
! 
1 X 0 0 2 ∗ 2 2 h +1 1 4β 2 (N − 1)Iσ 2 + 3N I 2 2
E ∇ F (w (k)) ≤ 2(F0 − F ) + βσ + 1 LBr0 √ +
K 2h − 1 NK K
k=0
   
1 1
=O √ +O . (B.22)
NK K
q
N
Thus, when the additional assumptions (i) K ≥ 24N I 2 β 2 , (ii) η = K , (iii) pn = N1 , ∀n, (iv) kw(k)k ≤ B, ∀k, and (v) the
fraction of non-zero prunable
 parameters
 decreases exponentially all hold, we have a convergence bound provided in (B.22)
that is dominated by O √N K . This means using more clients can accelerate the convergence (by a factor of √1N ).
1
20

TABLE C.1
M ODEL ARCHITECTURES .

Architecture Conv-2 VGG-11 ResNet-18 MobileNetV3-Small


64, pool, 64, pool, 16, 16, 8, 16, 16, 72, 72, 24, 88, 88, 24, 96, 96, 24,
128, pool, 2 × [64, 64], 96, 40, 240, 240, 64, 240, 40, 240, 240, 64, 240, 40,
32, pool,
Convolutional 2 × 256, pool, 2 × [128, 128], 120, 120, 32, 120, 48, 144, 144, 40, 144, 48, 288,
64, pool
2 × 512, pool, 2 × [256, 256], 288, 72, 288, 96, 576, 576, 144, 576, 96, 576, 576,
2 × 512, pool 2 × [512, 512] 144, 576, 96, 576
2048, 62 512, 512, 10 avgpool, 100 avgpool, 1280, dropout (0.2), 2
Fully-connected
(input: 3136) (input: 512) (input: 512) (input: 960)
Conv/FC/all params 52.1K/6.6M/6.6M 9.2M/530.4K/9.8M 11.2M/102.6K/11.3M 927.0K/592.9K/1.5M

C. PruneFL details
1) Model architecture details: The details of the model architectures are listed in Table C.1.
2) Gradient Computation: The forward pass in neural networks with sparse matrices is straightforward. Taking an FC layer
as an example (convolutional layers are more complex but similar in principle), the input data is multiplied by a sparse weight,
and produces a dense output to be passed to the next layer. Let u ∈ Rnin ×nout be its (sparse) weight, x ∈ RN ×nin be the (dense)
input and y ∈ RN ×nout be the (dense) output, where nin , nout , and N are the number of input neurons, output neurons of the
FC layer, and the mini-batch size for SGD, respectively. Assuming there is no bias, the forward pass is given by y = x · u . The
backward pass is slightly more complex. By simple calculations, the gradient of u is given by gu = xT gy , and the gradient
of x (when required) is given by gx = gy uT . Here, gy is the (dense) gradient in backpropagation fed by the next layer.
For the computation of gx , since the weight u is sparse, we can accelerate the computation using our sparse matrix
implementation.
The computation of gu is however different. Note that both x and gy are dense, and thus current implementations (e.g.,
PyTorch) first compute the dense gradient with u’s dense form that has all zero values included, and then select values from
the dense gradient according to u’s sparse pattern. There is currently no better way to accelerate this process as far as we
know. Therefore, this implementation does not improve the backward pass’s speed of weights’ gradient computations. For the
above reason, in our implementation we collect the gradients of zero-valued components of u at the same time with no extra
overhead (although those zero-valued components themselves are not updated). This characteristic is useful in our adaptive
pruning procedure.
3) FLOPs Computation: Following the discussion in Section C2, we now explain the computation of FLOPs in both forward
and backward passes in our models. Using convention in the literature, we consider that one addition and one multiplication
each counts as a FLOP [53]. Taking the same notations and assumptions from Section C2, the FLOPs for the forward pass
is 2N nin nout × d, where d is the density of this FC layer. In the backward pass, the FLOPs for the gradient computation of
weight u is 2N nin nout since the computation does not involve sparse matrices, while for the gradient of input x, the FLOPs is
2N nin nout × d. Therefore, the total FLOPs for the backward pass is 2N nin nout × (1 + d), and the FLOPs for both forward and
backward passes is 2N nin nout × (1 + 2d). FLOPs computation for convolutional layers is similar, so we skip this discussion.
4) Starting Reconfiguration Round: We start the reconfiguration in the initial pruning stage when the training accuracy on
local sample data of the selected client exceeds 1.5 times the random guess accuracy. There are two advantages: (i) if the task
is easy, reconfiguration starts early, which results in early model size decrease that saves training time; and (ii) it also avoids
pruning the model too early when the prediction is still close to random guess, meaning the parameter values are still in the
random initialization stage. In the further pruning stage, reconfiguration happens periodically with a fixed interval.
5) Further Details of Adaptive Pruning: In the following, we provide detailed information for the adaptive pruning procedure
described in Section IV-B. In both reconfiguration and non-reconfiguration rounds, the importance measure is summed locally
after every local update, until the sum is sent to the server in the next reconfiguration round. Note that in non-reconfiguration
rounds, only the remaining model parameters are used in computation and exchanged between server and clients. Since the
parameter set is fixed in non-reconfiguration rounds, only the values of the parameters need to be exchanged between the
server and clients, which incurs no extra communication cost. An illustration of the two types of rounds is shown in Fig. C.1.
6) PyTorch on Raspberry Pi devices: To install PyTorch on Raspberry Pi devices, we follow the instructions described at
https://bit.ly/3e6I7tG, where acceleration packages such as MLKDNN and NNPACK are disabled due to possible compatibility
issues and their lack of support of sparse computation. We compare our implementation with the plain implementation by
PyTorch without accelerations. We expect that similar results can be obtained if acceleration packages could support sparse
computation. This is an active area of research on its own where methods for efficient sparse computation on both CPU and
GPU have been developed in recent years. Integrating such methods into our experiments is left for future work.

2 In practice, we use a small perturbation instead of zero values for parameters that are added back, but we use zero values in our analysis for ease of
exposition.
21

(a) Non-reconfiguration round (b) Reconfiguration round

Fig. C.1. Illustration of adaptive pruning as part of further pruning during FL.

(a) Conv-2’s first fully-connected layer (b) VGG-11’s last convolutional layer

Fig. D.1. Linearity of average computation time vs. number of parameters.

D. Additional Experimental Results


1) Validation of Assumptions: Agreeing with our assumptions and analysis in Sections IV-B and V-B, we observe that the
training time in each layer is generally independent of the other layers. Within each layer, the time is approximately linear
with the number of parameters in the layer with sparse implementation. In Fig. D.1, we fix the parameters in other layers
and increase the number of parameters in Conv-2’s largest (first) FC layer and VGG-11’s largest convolutional layer (last
convolutional layer with 512 channels), respectively, and measure for 50 times. The R2 values of linear regression for Conv-2
and VGG-11 are 0.997 and 0.994, respectively.
2) Client Selection Results for Section VI: We present results under same settings as in Section VI, but with random client
selection. We partition the IID CIFAR-10 and ImageNet-100 datasets uniformly into 100 equal-sized, non-overlapping clients.
FEMNIST and CelebA are intrinsically non-IID datasets. For the FEMNIST dataset, we use its original 193-user partition; and
for the CelebA dataset, since some of its users in the original partition have too few images (e.g. 4 images), we merge the 9343
persons’ images into 934 clients (first 933 clients have 10 persons’ images and the last client has 13 persons’ images). In each
round, we sample 10 clients randomly from the aforementioned partitions when using client selection. Fig. D.2 (corresponding
to Fig. 5) shows the training time reduction; Fig. D.3 (corresponding to Fig. 7) shows the lottery ticket result; and Fig. D.4
(corresponding to Fig. 9) shows the model size adaptation. We observe similar behaviors as with full client participation that
is described in the main paper, thus we omit further discussion in this section.
3) Convergence Accuracy Results for All Experiments: The convergence accuracies with/without client selection are shown
in Table D.1. The results are taken from the test accuracies in the last five evaluations of each simulation. We see that the
convergence accuracy of PruneFL is similar to that of conventional FL. Conventional FL sometimes shows a slight advantage
because all methods run for the same number of rounds, and full-sized models in conventional FL learn faster when the
accuracy is measured in rounds instead of time.
22

(a) Conv-2 on FEMNIST (b) VGG-11 on CIFAR-10 (c) ResNet-18 on ImageNet-100 (d) MobileNetV3-Small on CelebA

Fig. D.2. Test accuracy vs. time results of four datasets (client selection).

(a) Conv-2 on FEMNIST (b) VGG-11 on CIFAR-10 (c) ResNet-18 on ImageNet-100 (d) MobileNetV3-Small on CelebA

Fig. D.3. Lottery ticket results of four datasets (client selection).

(a) Conv-2 on FEMNIST (b) VGG-11 on CIFAR-10 (c) ResNet-18 on ImageNet-100 (d) MobileNetV3-Small on CelebA

Fig. D.4. Number of parameters vs. round for four datasets (client selection).

TABLE D.1
AVERAGE OF THE LAST 5 MEASURED ACCURACIES (%). C.S. STANDS FOR CLIENT SELECTION .

FEMNIST CIFAR-10 ImageNet-100 CelebA


No C.S. C.S. No C.S. C.S. No C.S. C.S. No C.S. C.S.
Conventional
85.33 ± 0.17 84.61 ± 0.62 86.30 ± 0.20 86.51 ± 0.12 76.36 ± 0.56 76.62 ± 0.34 91.41 ± 0.15 91.33 ± 0.17
FL
PruneFL
85.07 ± 0.31 83.90 ± 0.48 85.47 ± 0.19 85.50 ± 0.18 77.23 ± 0.25 76.07 ± 0.31 91.38 ± 0.14 91.2 ± 0.17
(ours)
SNIP 85.16 ± 0.22 84.05 ± 0.70 81.16 ± 0.22 81.44 ± 0.13 76.89 ± 0.27 76.76 ± 0.60 91.48 ± 0.07 89.69 ± 0.75
SynFlow 84.77 ± 0.17 84.19 ± 0.21 82.41 ± 0.42 82.36 ± 0.28 76.66 ± 0.28 76.60 ± 0.24 91.24 ± 0.23 90.23 ± 0.11
Online
85.31 ± 0.37 84.30 ± 0.27 10.00 ± 0.00 10.00 ± 0.00 75.32 ± 0.18 74.97 ± 0.15 90.39 ± 0.13 88.61 ± 0.53
learning
Iterative
84.87 ± 0.22 84.08 ± 0.39 84.45 ± 0.14 84.64 ± 0.14 76.18 ± 0.30 76.96 ± 0.26 90.14 ± 0.37 88.65 ± 0.33
pruning

You might also like