0% found this document useful (0 votes)
7 views5 pages

Shlezinger Et Al. - 2020 - Federated Learning With Quantization Constraints

The document discusses federated learning (FL), a method for training machine learning models without sharing private data from users. It addresses the challenge of efficiently transmitting model updates over constrained communication channels by proposing a quantization scheme that minimizes error in the global model. The authors demonstrate that their approach improves performance compared to previous quantization methods, providing theoretical guarantees and numerical results to support their findings.

Uploaded by

jai.fau98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views5 pages

Shlezinger Et Al. - 2020 - Federated Learning With Quantization Constraints

The document discusses federated learning (FL), a method for training machine learning models without sharing private data from users. It addresses the challenge of efficiently transmitting model updates over constrained communication channels by proposing a quantization scheme that minimizes error in the global model. The authors demonstrate that their approach improves performance compared to previous quantization methods, providing theoretical guarantees and numerical results to support their findings.

Uploaded by

jai.fau98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Federated Learning with Quantization Constraints

Nir Shlezinger, Mingzhe Chen, Yonina C. Eldar, H. Vincent Poor, and Shuguang Cui
Abstract—Traditional deep learning models are trained on centralized random subset of the gradients can result in dominant distortion.
servers using labeled sample data collected from edge devices. This This motivates the design and analysis of quantization methods for
data often includes private information, which the users may not be
facilitating updated model transfer in FL, which minimize the error
willing to share. Federated learning (FL) is an emerging approach to
train such learning models without requiring the users to share their in the global model.
possibly private labeled data. In FL, each user trains its copy of the Here, we design quantizers for distributed deep network training
learning model locally. The server then collects the individual updates by utilizing quantization theory methods. We first discuss the re-
and aggregates them into a global model. A major challenge that arises quirements which arise in FL setups under quantization constraints.
in this method is the need of each user to efficiently transmit its learned
model over the throughput limited uplink channel. In this work, we We specifically identify the lack of a unified statistical model and
tackle this challenge using tools from quantization theory. In particular, the availability of a source of local randomness as characteristics
we identify the unique characteristics associated with conveying trained of such setups. Based on these properties, we propose a mapping
models over rate-constrained channels, and characterize a suitable scheme following concepts from universal quantization [17], which
quantization scheme for such setups. We show that combining universal
is based on solid information theoretic arguments while meeting the
vector quantization methods with FL yields a decentralized training
system, which is both efficient and feasible. We also derive theoretical aforementioned requirements.
performance guarantees of the system. Our numerical results illustrate We theoretically analyze the ability of the server to accurately
the substantial performance gains of our scheme over FL with previously recover the updated model via conventional FL [4], showing that
proposed quantization approaches. the error induced by the proposed quantization scheme is mitigated
Index terms— Federated learning, edge computing, quantization.
by conventional federated averaging. Specifically, our analysis shows
I. I NTRODUCTION that this error can be bounded by a term which decays exponentially
Machine learning methods have demonstrated unprecedented per-
with the number of users, regardless of the statistical model from
formance in a broad range of applications [1]. This is achieved by
which the data of each user is generated, rigorously proving that
training a deep network model based on a large number of labeled
the quantization distortion can be made arbitrarily small when a
training samples. Often, these samples are gathered on end-devices,
sufficient number of users contribute to the overall model. Then,
such as smartphones, while the deep model is maintained by a
we show that these theoretical gains translate into FL performance
computationally powerful centralized server [2]. Traditionally, the
gains in a numerical study, demonstrating that FL with the proposed
users send their labeled data to the server, who in turn uses this
quantization scheme yields more accurate global models compared
massive amount of samples to train the model. However, the data
to previously proposed quantization approaches for such setups.
often contains private information, which the users may prefer not to The rest of this paper is organized as follows: Section II presents
share. This gives rise to the need to adapt the network on the end- the system model and identifies the requirements of FL quantization.
devices., i.e., train a centralized model in a distributed fashion [3]. Section III details the proposed quantization system and theoretically
Federated learning (FL) proposed in [4], is a method to update such
anaylzes its performance. Numerical examples are presented in Sec-
decentralized models, and is the focus of growing attention in recent
tion IV. Section V concludes the paper.
years. Here, instead of requiring the users to share their possibly
Throughout the paper, we use boldface lower-case letters for
private labeled data, each user trains the network locally, and conveys
vectors, e.g., x; Matrices are denoted with boldface upper-case letters,
its trained model updates to the server. The server then aggregates
e.g., M ; calligraphic letters, such as X , are used for sets. The `2
these updates into a global network in an iterative fashion [4], [5].
norm and vectorization operator are denoted by k · k and vec(·),
This strategy was extended to multi-task learning in [6]. Methods for
respectively. Finally, R and Z are the sets of real numbers and
guaranteeing privacy in FL were studied in [7], and user scheduling
integers, respectively.
policies were analyzed in [8].
One of the major challenges of FL is the transfer of a large number II. S YSTEM M ODEL
of updated model parameters over the uplink communication channel In this section we detail the FL with quantization constraints setup.
from the users to the server, whose throughput is typically constrained To that aim, we first review the conventional FL setup in Subsection
[4], [9], [10]. Several approaches have been proposed to tackle this II-A. Then, in Subsection II-B, we formulate the problem and identify
issue: The work [11] proposed various methods for quantizing the the unique requirements of quantizers utilized in FL systems.
updates sent from the users to the server. These methods include
A. Federated Learning
random masks, subsampling, and probabilistic quantization, which
We consider the FL model detailed in [4], [11], which consists of
is also considered in [12]; Sparsifying masks for compressing the
iterative subsequent updates of a single deep network model. Here, we
gradients were proposed in [13]–[15], while [16] applied ternary
focus on a single iteration of the model update. A centralized server
quantization to the updates. However, these approaches are subop-
is training a model consisting of m1 × m2 hyperparameters based on
timal from a quantization theory perspective, as, e.g., discarding a
labeled samples available at a set of K remote users. To that aim,
This project has received funding from the Benoziyo Endowment Fund the server shares its current model, represented by the matrix W ∈
for the Advancement of Science, the Estate of Olga Klein – Astrachan, the Rm1 ×m2 , with the users. The kth user, k ∈ {1, . . . , K} , K, uses
European Unions Horizon 2020 research and innovation program under grant (k) (k) nk
its set of nk labeled training samples, denoted as {xi , y i }i=1 ,
No. 646804-ERC-COG-BNYQ, from the Israel Science Foundation under
grant No. 0100101, and from the U.S. National Science Foundation under to retrain the model W into an updated model W̃ k ∈ Rm1 ×m2 .
grants CCF-0939370 and CCF-1513915. N. Shlezinger and Y. C. Eldar are For example, W̃ k may be obtained by multiple stochastic gradient
with the Faculty of Math and CS, Weizmann Institute of Science, Rehovot, descent (SGD) steps applied to W , executed over the local data set.
Israel (e-mail: [email protected]; [email protected]). M. Chen Having updated the model weights, the kth user should convey its
and H. V. Poor are with the EE Dept., Princeton University, Princeton, NJ
(e-mail: {mingzhec, poor}@princeton.edu). M. Chen is also with the Chinese
model update, denoted as H k , W̃ k − W , to the server. Due to the
University of Hong Kong, Shenzhen, China. S. Cui is with the Chinese Uni- fact that the internet upload throughput is typically more limited com-
versity of Hong Kong, Shenzhen, China (e-mail: [email protected]) pared to its download counterpart [18], the kth user communicates its

978-1-5090-6631-5/20/$31.00 ©2020 IEEE 8851 ICASSP 2020

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on October 17,2023 at 08:51:13 UTC from IEEE Xplore. Restrictions apply.
A2 No a-priori knowledge or distribution of H k is assumed.
A3 As in [11], the users and the server share a source of common
randomness. This is achieved by, e.g., letting the server share a
random seed with each user along the initial model weights W .
Requirement A2 gives rise to the need for a universal quantization
approach, as we propose in the following section.

III. Q UANTIZATION FOR F EDERATED L EARNING


Based on the problem formulation detailed in the previous section,
Fig. 1. Federated learning with bit rate constraints. we propose to convey the model updates {H k } from the users to
model update over a rate-constrained link, particularly using no more the server over the bit-constrained channel based on the universal
than Rk bits. The kth model update H k is therefore encoded into quantization scheme of [17]. Broadly speaking, the scheme encodes
a digital codeword uk ∈ {0, . . . , 2Rk − 1} , Uk using an encoding each model update using subtractive dithered lattice quantization,
function whose input is H k , i.e., which operates in the same manner for each user, satisfying A1.
This method allows the server to recover the updates with minor
ek : Rm1 ×m2 7→ Uk . (1) average error regardless of the distribution of {H k }, as required
The server uses the received codewords {uk }K
to reconstruct Ĥ ∈
k=1 in A2, by exploiting the source of common randomness assumed
Rm1 ×m2 , obtained via a joint decoding function in A3. In addition to its compliance with the model requirements
d : U1 × . . . × UK 7→ Rm1 ×m2 . (2) stated in Subsection II-B, the proposed approach is particularly
suitable for FL, as the distortion is mitigated by federated averaging,
PK Ĥ = d(u1 , . . . , uK ) is an estimate of some weighted
The recovered
significantly improving the overall FL capabilities, as demonstrated in
average
PK k=1 αk H k , referred to as federated averaging, where our numerical study in Section IV. The proposed quantization method
k=1 αk = 1. Finally, the server updates its global model W̃ via is detailed in Subsection III-A, followed by a theoretical performance
W̃ = W + Ĥ. (3) analysis and a discussion in Subsections III-B and III-C, respectively.
An illustration of this FL procedure is depicted in Fig. 1.
We considered a simplified model for the communication link, A. Quantization Scheme
where the bit transmissions are error-free as long as the number of Here, we present the quantization scheme, namely, the encoding
bits (transmitted within an operation period) is less than Rk , i.e., the and decoding functions, e(·) and d(·). As this method involves lattice
limitation is only on the number of bits conveyed over the uplink quantization, we begin with some definitions required to formulate
channel. Clearly, if thePnumber of allowed bits is sufficiently large, such mappings: We fix a positive integer L, referred to henceforth
then the distance kĤ− K 2
k=1 αk H k k can be made arbitrarily small, as the lattice dimension, and a non-singular L × L matrix G, which
allowing the server to update the global model as the desired weighted denotes the lattice generator matrix. Guidelines for selecting L and
average: K
G are discussed in Subsection III-C. For simplicity, we assume that
W̃ opt =
X
αk W̃ k . (4) M , m1Lm2 is an integer, although the scheme can also be applied
k=1
when this does not hold by replacing M with dM e. Next, we use
However, in the presence of a limited bit budget, the error due L to denote the lattice, which is the set of points in RL that can be
to quantization can severely degrade the ability of the server to written as an integer linear combination of the columns of G, i.e.,
accurately update its model. To tackle this issue, the work [11] L , {x = Gl : l ∈ Z L }. (5)
proposed various methods for quantizing the data sent from the users L
A lattice quantizer QL (·) maps each x ∈ R to its nearest lattice
to the server. These techniques include random masks, subsampling, point, i.e., QL (x) = lx if kx − lx k ≤ kx − lk for every l ∈ L.
and probabilistic scalar quantization, and assume that the users and Finally, let P0 be the basic lattice cell [22], defined as the set of
the sever share a source of common randomness. These approaches points in RL which are closer to 0 than to any other lattice point:
are however suboptimal from a quantization theory perspective [19,
P0 , {x ∈ RL : kxk < kx − pk, ∀p ∈ L/{0}}. (6)
Ch. 23], which motivates the proposed research on efficient and
practical quantization methods for FL. For example, when G = ∆ · I L for some ∆ > 0, then L is the
square lattice, for which P0 is the set of vectors x ∈ RL whose
B. Problem Formulation `∞ norm is not larger than ∆ 2
. For this setting, QL (·) implements
Our goal is to propose an encoding-decoding system which miti- entry-wise scalar uniform quantization with spacing ∆ [19, Ch. 23].
gates the effect of quantization errors on the ability of the server to The proposed encoding function e(·) includes the following steps:
accurately recover the updated model (4). Distributed lossy source E1 Partitioning: The kth user vectorizes H k and divides vec(H k )
coding theory establishes bounds on the reconstruction fidelity for (k)
into M distinct vectors of size L × 1, denoted {hi }M i=1 .
a given bit budget [20, Ch. 11], which can be approached using E2 Dithering: The encoder utilizes the source of common random-
Wyner-Ziv coding [21]. However, these coding schemes tend to be ness, e.g., a shared seed, to generate the set of L × 1 dither
computationally complex, and require each of the encoders and the (k)
vectors {z i }M i=1 , which are randomized in an i.i.d. fashion,
decoder to know the joint distribution of the model updates from all independently of H k , from a uniform distribution over P0 .
the users, {H k }Kk=1 , which is not likely to be available in practice. (k)
E3 Quantization: The vectors {hi }M i=1 are discretized by adding
In order to faithfully represent the FL setup, we design our the dither vectors and applying lattice quantization, i.e., by
quantization strategy in light of the following requirements and (k)
computing {QL (hi + z i )}.
(k)
assumptions: (k)
E4 Entropy coding: The discrete values {QL (hi + z i )} are
(k)

A1 All users share the same encoding function, denoted ek (·) = e(·) encoded into a digital codeword uk in a lossless manner. This
for each k ∈ K. This requirement, which was also considered can be achieved using entropy coding schemes, e.g., Huffman
in [11], significantly simplifies FL implementation. coding, Lempel-Ziv methods, and arithmetic codes [23, Ch. 13].

8852

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on October 17,2023 at 08:51:13 UTC from IEEE Xplore. Restrictions apply.
Note that in order to utilize entropy coding in step E4, the discretized Theorem 1 implies that the recovered model can be made arbi-
(k) (k)
{QL (hi + z i )} must take values on a finite set. This can be trarily close to the desired one under bit constraints by increasing
achieved by restricting the quantizer QL (·) to output only lattice K, namely, the number of users which participate in the update
points whose magnitude is not larger than some given threshold. procedure. In particular, if ᾱ does not grow with K, which essentially
The decoding mapping d(·) implements the following: means that the updated model is not based only on a small part of the
D1 Entropy decoding: The server first decodes each digital code- participating users, then the effect of quantization error vanishes in
(k) (k) the aggregation process. In fact, (7) is obtained by exploiting the fact
word uk into the discrete value {QL (hi + z i )}. Since the
encoding is carried out using a lossless source code, the discrete that the quantization error in subtractive dithered quantization does
values are recovered without any errors. not depend on the quantized value. Hence, our ability to rigorously
D2 Dither subtraction: Using the source of common randomness, upper bound the distance in Theorem 1 is a direct consequence of
(k)
the server generates the dither vectors {z i }M i=1 , and subtracts
applying this universal quantization method.
the corresponding vector from each lattice point, i.e., compute C. Discussion
(k) (k) (k)
{QL (hi + z i ) − z i }. The proposed scheme has several clear advantages. First, while it is
(k) (k) (k)
D3 Collecting: The values {QL (hi + z i ) − z i } are collected based on information theoretic arguments, the resulting architecture
into an m1 × m2 matrix Ĥ k using the inverse operation of the is completely practical and rather simple to implement. In particular,
partitioning in encoding step E1. both subtractive dithered quantization as well as entropy coding are
D4 Model recovery: The recoveredP matrices are combined into an concrete and established methods which can be realized with rela-
updated model via W̃ = W + K k=1 αk Ĥ k . tively low complexity and feasible hardware requirements. The source
An illustration of the proposed scheme is depicted in Fig. 2. of common randomness needed for generating the dither vectors can
For a given lattice L and model update matrix H k , the resulting be obtained by sharing a common seed between the server and users,
codeword uk may require more than Rk bits, namely, more bits as also assumed in [11], [12] The statistical characterization of the
than supported by the channel. In particular, the number of bits used quantization error of such quantizers can be obtained in a universal
for encoding uk depends not only on the lattice L, but also on the manner, i.e., regardless of the distribution of the model updates
distribution of H k [24], on which we assume no a-priori knowledge {H k }. This analytical tractability allows us to rigorously show that
by A2. This can be handled by letting the encoder use a lattice with its combination with federated averaging mitigates the quantization
a basic cell P0 of increased volume if the resulting uk exceeds the error. A similar approach was also used in the analysis of probabilistic
bit limitation Rk , decreasing the number of bits at the cost of a more quantization schemes for average consensus problems [27] The fact
coarse quantization. By letting each user select from a finite set of that the updates are quantized for a specific task, i.e., to obtain the
lattices while conveying the index of the selected lattice as part of the global model by federated averaging, implies that the FL with bit
digital codeword allows the server to carry out the aforementioned constraints setup can be treated as a task-based quantization scenario
decoding process while meeting the bit constraints and satisfying A1. [26], [28], [29]. This task is accounted for in the selection of the
quantization scheme, using one for which the error term vanishes by
B. Performance Analysis
averaging regardless of the values of {H k }. In our numerical study
Next, we analyze the performance of FL with the quantization
in Section IV we demonstrate that the improved reconstruction is
system detailed in the previous subsection. Recall that, in the absence
translated into performance gains of the overall global model.
of bit constraints, the server uses the updates provided by the users
The fact that the entropy coding in encoding-decoding steps E4
to compute W̃ opt in (4). Therefore, performance is determined in
and D1 involves multiple encoders and a single decoder implies that
the distance between the recovered model W̃ and the desired W̃ opt .
opt M more efficient compression can be achieved by utilizing distributed
To formulate this distance, we use {w̃i }M i=1 and {w̃ i }i=1 to
opt lossless source coding methods, e.g., Slepian-Wolf coding [23, Ch.
denote the partitions of W̃ and W̃ into M distinct L × 1
15.4]. In such cases, the server decodes the received codewords {uk }
vectors via step E1. In order to analyze the distance between W̃ (k) (k)
into {QL (hi +z i )} in a joint manner, instead of using individual
and W̃ opt , we characterize a bound on the `2 distance between w̃i (k) (k)
entropy codes where QL (hi + z i ) is decoded separately from
and w̃opt
i which holds for all i ∈ {1, . . . , M } , M. To that aim,
its corresponding uk . However, such distributed coding schemes
define, ᾱ , K · maxk αk . Note that the term ᾱ is a parameter
typically require a-priori knowledge of the joint distribution of
of the federated averaging method [4] which is minimized when (k)
{hi }, and utilize different encoding mappings for each user, thus
the server uses standard averaging as in [11], i.e., αk = 1/K and
2 not meeting requirements A1 and A2.
ᾱ = 1. Additionally, define µL , supx∈P0 kxk, and let σL be
Finally, we note that the FL performance is affected by the
the normalized second order moment of the lattice L, defined as
2 selection of the lattice L and its dimension L via the values of µL and
, P0 kxk2 dx/ P0 dx [25]. We can now bound the distance
R R
σL 2
σL in (7). In general, lattices of higher dimensions typically result
between the desired model and the recovered one. We focus on the
in more accurate representations, at the cost of increased complexity.
case in which all encoders use the same lattice. The case where each
Methods for designing lattices for quantization can be found in [30].
encoder chooses from a finite set of lattices can be studied using
a similar derivation. Furthermore, as in [17], [22], [26], we assume IV. N UMERICAL E VALUATIONS
that the quantizers are not overloaded, namely, that despite restricting In this section we numerically evaluate the proposed quantization
the quantizer QL (·) to output a finite number of lattice points, each scheme for FL. We first compare the quantization error induced by
quantized value is mapped into its nearest point in the (infinite) lattice, our encoder-decoder pair to competing methods utilized in FL sys-
(k) (k) (k) (k) tems. Then, we numerically demonstrate how the reduced distortion
i.e., that hi +z i −QL (hi +z i ) ∈ P0 . The resulting bound is
stated in the following theorem, whose proof is omitted due to page induced by quantization is translated in FL performance gains.
limitations. We begin by focusing only on the compression method, comparing
σ2 the distortion induced by the following quantizers using the same
Theorem 1. For all 0 < η < ᾱ µL
L
and i ∈ M, it holds that overall number of bits:
η2
 
1 Q1 Dithered quantization with a two-dimensional hexagonial lattice,
Pr kw̃i − w̃opt √

i k ≥ η ≤ exp −K 2
+ . (7)
8ᾱ2 σL 4 i.e., L = 2 and G = [2, 0; 1, 3].

8853

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on October 17,2023 at 08:51:13 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Federated learning with proposed quantization system illustration.
Average squared-error

Average squared-error
-1
10 10-1

Dithered hexagonial lattice quantization, R=1


Dithered hexagonial lattice quantization R=2
Dithered hexagonial lattice quantization Uniform quantization with random unitary rotation, R=1
Dithered scalar quantization Uniform quantization with random unitary rotation, R=2
Uniform quantization with random unitary rotation Subsampling with 3 bits quantizers, R=1
Subsampling with 3 bits quantizers Subsampling with 3 bits quantizers, R=2

1 1.5 2 2.5 3 3 6 9 12 15
Quantization rate Number of users
Fig. 3. Quantization schemes comparison. Fig. 4. FL performance as the number of users varies.
Q2 Dithered scalar quantization, i.e., L = 1 and G = 1. server. These model updates are recovered into a global model using
1
Q3 Uniform quantization with random unitary rotation [11]. conventional federated averaging, namely, αh = K for each k ∈ K.
Q4 Subsampling by random masks followed by uniform three-bit The FL performance is measured by the mean squared-error achieved
quantizers applied to the non-masked entries [11]. by the global model after 300 FL iterations, computed by averaging
In Fig. 3 we depict the numerically evaluated average per-entry over 500 test input samples uniformly distributed over [0, 1].
squared-error of Q1-Q4 in quantizing a 128 × 128 matrix with In Fig. 4 we depict the resulting FL performance versus the number
Gaussian i.i.d. entries versus the quantization rate R, defined as the of users K when using the dithered lattice quantizers Q1, compared to
number of bits per entry. To meet the bit rate constraint when using that achieved using the quantization schemes Q3 and Q4 proposed in
Q1-Q2 we scaled G such that the resulting codewords use less than [11], for quantization rates of R = {1, 2} bits per sample. Observing
1282 R bits, while for Q3 and Q4 the rate determines the quantization Fig. 4, we note that the FL performance achieved using Q1 notably
resolution and the subsampling ratio, respectively. outperforms that of the previously proposed Q3 and Q4. For example,
Observing Fig. 3, we note that indeed the dithered quantization FL operating with K = 9 users over links constrained to R = 1 bits
methods Q1-Q2, which are used in our proposed FL scheme, achieve per sample achieves performance improved by 64.02% and 83.36%
more accurate digital representation compared to the methods Q3-Q4 using the proposed quantziation scheme Q1 compared to Q3 and Q4,
previously proposed for FL. It is also observed that Q1, which uses respectively. It is also noted that as the number of users increases, the
vector lattice quantization, outperforms its scalar counterpart Q2, and accuracy of the global model obtained using FL with Q1 improves
that the gain is more notable when less bits are available. accordingly for both R = 1 and R = 2, indicating that the ability
to approach the desired global model proven in Theorem 1 directly
Next, we demonstrate that the improved digital representation
translates into improved FL performance.
achieved by dithered lattice quantization also translates into FL
performance gains, when utilizing the system detailed in Section III.
To that aim, we simulate a basic FL setup in which a regression neural V. C ONCLUSIONS
network learns to approximate the mapping f : R 7→ R, given by In this work we studied FL systems operating under uplink bit
f (α) , sin 6πα. In particular, the neural network consists of a fully- constraints. We first identified the specific requirements from quanti-
connected 1 × 20 and 20 × 1 layers with an intermediate sigmoid zation schemes used in FL setups. Then, we proposed an encoding-
activation, and is trained to minimize the mean-squared error loss. decoding strategy based on dithered quantization. We analyzed the
In each FL iteration, every user has access to 15 i.i.d. input samples proposed scheme and proved that its error term is mitigated by fed-
{xi }, generated from a uniform distribution over [0, 1], and their erated averaging, indicating its potential for FL. Our numerical study
corresponding labels {yi = f (xi )}. The users utilize the gradient demonstrates that FL with the suggested quantization scheme notably
descent method to train their weights based on the labeled samples, outperforms previously proposed methods for such setups, indicating
and forward a quantized version of the model updates to centralized that its theoretical benefits are translated into FL performance gains.

8854

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on October 17,2023 at 08:51:13 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [28] N. Shlezinger, Y. C. Eldar, and M. R. Rodrigues, “Asymptotic task-
based quantization with application to massive MIMO,” IEEE Trans.
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, Signal Process., vol. 67, no. 15, pp. 3995–4012, 2019.
no. 7553, p. 436, 2015. [29] S. Salamtian, N. Shlezinger, Y. C. Eldar, and M. Medard, “Task-based
[2] J. Chen and X. Ran, “Deep learning with edge computing: A review,” quantization for recovering quadratic functions using principal inertia
Proceedings of the IEEE, 2019. components,” in Proc. IEEE Int. Symp. Inf. Theory, 2019.
[3] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, [30] E. Agrell and T. Eriksson, “Optimization of lattices for quantization,”
P. Tucker, K. Yang, Q. V. Le et al., “Large scale distributed deep IEEE Trans. Inf. Theory, vol. 44, no. 5, pp. 1814–1828, 1998.
networks,” in Neural Information Processing Systems, 2012, pp. 1223–
1231.
[4] H. B. McMahan, E. Moore, D. Ramage, and S. Hampson,
“Communication-efficient learning of deep networks from decentralized
data,” arXiv preprint arXiv:1602.05629, 2016.
[5] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,
V. Ivanov, C. Kiddon, J. Konečnỳ, S. Mazzocchi, and H. B. McMahan,
“Towards federated learning at scale: System design,” arXiv preprint
arXiv:1902.01046, 2019.
[6] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated
multi-task learning,” in Neural Information Processing Systems, 2017,
pp. 4424–4434.
[7] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learn-
ing differentially private recurrent language models,” arXiv preprint
arXiv:1710.06963, 2017.
[8] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling
policies for federated learning in wireless networks,” arXiv preprint
arXiv:1908.06287, 2019.
[9] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint
learning and communications framework for federated learning over
wireless networks,” arXiv preprint arXiv:1909.07972, 2019.
[10] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learn-
ing: Challenges, methods, and future directions,” arXiv preprint
arXiv:1908.07873, 2019.
[11] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and
D. Bacon, “Federated learning: Strategies for improving communication
efficiency,” arXiv preprint arXiv:1610.05492, 2016.
[12] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD:
Communication-efficient SGD via gradient quantization and encoding,”
in Neural Information Processing Systems, 2017, pp. 1709–1720.
[13] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient
compression: Reducing the communication bandwidth for distributed
training,” arXiv preprint arXiv:1712.01887, 2017.
[14] C. Hardy, E. Le Merrer, and B. Sericola, “Distributed deep learning
on edge-devices: feasibility via adaptive compression,” in 2017 IEEE
16th International Symposium on Network Computing and Applications
(NCA). IEEE, 2017, pp. 1–8.
[15] A. F. Aji and K. Heafield, “Sparse communication for distributed
gradient descent,” arXiv preprint arXiv:1704.05021, 2017.
[16] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad:
Ternary gradients to reduce communication in distributed deep learning,”
in Neural Information Processing Systems, 2017, pp. 1509–1519.
[17] R. Zamir and M. Feder, “On universal quantization by randomized
uniform/lattice quantizers,” IEEE Trans. Inf. Theory, vol. 38, no. 2, pp.
428–436, 1992.
[18] speedtest.net, “Speedtest united states market report,” 2019. [Online].
Available: http://www.speedtest.net/reports/united-states/
[19] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” Lecture
Notes for ECE563 (UIUC) and, vol. 6, no. 2012-2016, p. 7, 2014.
[20] A. El Gamal and Y.-H. Kim, Network Information Theory. Cambridge
university press, 2011.
[21] A. Wyner and J. Ziv, “The rate-distortion function for source coding
with side information at the decoder,” IEEE Trans. Inf. Theory, vol. 22,
no. 1, pp. 1–10, 1976.
[22] R. Zamir and M. Feder, “On lattice quantization noise,” IEEE Trans.
Inf. Theory, vol. 42, no. 4, pp. 1152–1159, 1996.
[23] T. M. Cover and J. A. Thomas, Elements of Information Theory. John
Wiley & Sons, 2012.
[24] J. Ziv, “On universal quantization,” IEEE Trans. Inf. Theory, vol. 31,
no. 3, pp. 344–347, 1985.
[25] J. Conway and N. Sloane, “Voronoi regions of lattices, second moments
of polytopes, and quantization,” IEEE Trans. Inf. Theory, vol. 28, no. 2,
pp. 211–226, 1982.
[26] N. Shlezinger, Y. C. Eldar, and M. R. Rodrigues, “Hardware-limited
task-based quantization,” IEEE Trans. Signal Process., vol. 67, no. 20,
pp. 5223–5238, 2019.
[27] T. C. Aysal, M. J. Coates, and M. G. Rabbat, “Distributed average
consensus with dithered quantization,” IEEE Trans. Signal Process.,
vol. 56, no. 10, pp. 4905–4918, 2008.

8855

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on October 17,2023 at 08:51:13 UTC from IEEE Xplore. Restrictions apply.

You might also like