Shlezinger Et Al. - 2020 - Federated Learning With Quantization Constraints
Shlezinger Et Al. - 2020 - Federated Learning With Quantization Constraints
Nir Shlezinger, Mingzhe Chen, Yonina C. Eldar, H. Vincent Poor, and Shuguang Cui
Abstract—Traditional deep learning models are trained on centralized random subset of the gradients can result in dominant distortion.
servers using labeled sample data collected from edge devices. This This motivates the design and analysis of quantization methods for
data often includes private information, which the users may not be
facilitating updated model transfer in FL, which minimize the error
willing to share. Federated learning (FL) is an emerging approach to
train such learning models without requiring the users to share their in the global model.
possibly private labeled data. In FL, each user trains its copy of the Here, we design quantizers for distributed deep network training
learning model locally. The server then collects the individual updates by utilizing quantization theory methods. We first discuss the re-
and aggregates them into a global model. A major challenge that arises quirements which arise in FL setups under quantization constraints.
in this method is the need of each user to efficiently transmit its learned
model over the throughput limited uplink channel. In this work, we We specifically identify the lack of a unified statistical model and
tackle this challenge using tools from quantization theory. In particular, the availability of a source of local randomness as characteristics
we identify the unique characteristics associated with conveying trained of such setups. Based on these properties, we propose a mapping
models over rate-constrained channels, and characterize a suitable scheme following concepts from universal quantization [17], which
quantization scheme for such setups. We show that combining universal
is based on solid information theoretic arguments while meeting the
vector quantization methods with FL yields a decentralized training
system, which is both efficient and feasible. We also derive theoretical aforementioned requirements.
performance guarantees of the system. Our numerical results illustrate We theoretically analyze the ability of the server to accurately
the substantial performance gains of our scheme over FL with previously recover the updated model via conventional FL [4], showing that
proposed quantization approaches. the error induced by the proposed quantization scheme is mitigated
Index terms— Federated learning, edge computing, quantization.
by conventional federated averaging. Specifically, our analysis shows
I. I NTRODUCTION that this error can be bounded by a term which decays exponentially
Machine learning methods have demonstrated unprecedented per-
with the number of users, regardless of the statistical model from
formance in a broad range of applications [1]. This is achieved by
which the data of each user is generated, rigorously proving that
training a deep network model based on a large number of labeled
the quantization distortion can be made arbitrarily small when a
training samples. Often, these samples are gathered on end-devices,
sufficient number of users contribute to the overall model. Then,
such as smartphones, while the deep model is maintained by a
we show that these theoretical gains translate into FL performance
computationally powerful centralized server [2]. Traditionally, the
gains in a numerical study, demonstrating that FL with the proposed
users send their labeled data to the server, who in turn uses this
quantization scheme yields more accurate global models compared
massive amount of samples to train the model. However, the data
to previously proposed quantization approaches for such setups.
often contains private information, which the users may prefer not to The rest of this paper is organized as follows: Section II presents
share. This gives rise to the need to adapt the network on the end- the system model and identifies the requirements of FL quantization.
devices., i.e., train a centralized model in a distributed fashion [3]. Section III details the proposed quantization system and theoretically
Federated learning (FL) proposed in [4], is a method to update such
anaylzes its performance. Numerical examples are presented in Sec-
decentralized models, and is the focus of growing attention in recent
tion IV. Section V concludes the paper.
years. Here, instead of requiring the users to share their possibly
Throughout the paper, we use boldface lower-case letters for
private labeled data, each user trains the network locally, and conveys
vectors, e.g., x; Matrices are denoted with boldface upper-case letters,
its trained model updates to the server. The server then aggregates
e.g., M ; calligraphic letters, such as X , are used for sets. The `2
these updates into a global network in an iterative fashion [4], [5].
norm and vectorization operator are denoted by k · k and vec(·),
This strategy was extended to multi-task learning in [6]. Methods for
respectively. Finally, R and Z are the sets of real numbers and
guaranteeing privacy in FL were studied in [7], and user scheduling
integers, respectively.
policies were analyzed in [8].
One of the major challenges of FL is the transfer of a large number II. S YSTEM M ODEL
of updated model parameters over the uplink communication channel In this section we detail the FL with quantization constraints setup.
from the users to the server, whose throughput is typically constrained To that aim, we first review the conventional FL setup in Subsection
[4], [9], [10]. Several approaches have been proposed to tackle this II-A. Then, in Subsection II-B, we formulate the problem and identify
issue: The work [11] proposed various methods for quantizing the the unique requirements of quantizers utilized in FL systems.
updates sent from the users to the server. These methods include
A. Federated Learning
random masks, subsampling, and probabilistic quantization, which
We consider the FL model detailed in [4], [11], which consists of
is also considered in [12]; Sparsifying masks for compressing the
iterative subsequent updates of a single deep network model. Here, we
gradients were proposed in [13]–[15], while [16] applied ternary
focus on a single iteration of the model update. A centralized server
quantization to the updates. However, these approaches are subop-
is training a model consisting of m1 × m2 hyperparameters based on
timal from a quantization theory perspective, as, e.g., discarding a
labeled samples available at a set of K remote users. To that aim,
This project has received funding from the Benoziyo Endowment Fund the server shares its current model, represented by the matrix W ∈
for the Advancement of Science, the Estate of Olga Klein – Astrachan, the Rm1 ×m2 , with the users. The kth user, k ∈ {1, . . . , K} , K, uses
European Unions Horizon 2020 research and innovation program under grant (k) (k) nk
its set of nk labeled training samples, denoted as {xi , y i }i=1 ,
No. 646804-ERC-COG-BNYQ, from the Israel Science Foundation under
grant No. 0100101, and from the U.S. National Science Foundation under to retrain the model W into an updated model W̃ k ∈ Rm1 ×m2 .
grants CCF-0939370 and CCF-1513915. N. Shlezinger and Y. C. Eldar are For example, W̃ k may be obtained by multiple stochastic gradient
with the Faculty of Math and CS, Weizmann Institute of Science, Rehovot, descent (SGD) steps applied to W , executed over the local data set.
Israel (e-mail: [email protected]; [email protected]). M. Chen Having updated the model weights, the kth user should convey its
and H. V. Poor are with the EE Dept., Princeton University, Princeton, NJ
(e-mail: {mingzhec, poor}@princeton.edu). M. Chen is also with the Chinese
model update, denoted as H k , W̃ k − W , to the server. Due to the
University of Hong Kong, Shenzhen, China. S. Cui is with the Chinese Uni- fact that the internet upload throughput is typically more limited com-
versity of Hong Kong, Shenzhen, China (e-mail: [email protected]) pared to its download counterpart [18], the kth user communicates its
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on October 17,2023 at 08:51:13 UTC from IEEE Xplore. Restrictions apply.
A2 No a-priori knowledge or distribution of H k is assumed.
A3 As in [11], the users and the server share a source of common
randomness. This is achieved by, e.g., letting the server share a
random seed with each user along the initial model weights W .
Requirement A2 gives rise to the need for a universal quantization
approach, as we propose in the following section.
A1 All users share the same encoding function, denoted ek (·) = e(·) encoded into a digital codeword uk in a lossless manner. This
for each k ∈ K. This requirement, which was also considered can be achieved using entropy coding schemes, e.g., Huffman
in [11], significantly simplifies FL implementation. coding, Lempel-Ziv methods, and arithmetic codes [23, Ch. 13].
8852
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on October 17,2023 at 08:51:13 UTC from IEEE Xplore. Restrictions apply.
Note that in order to utilize entropy coding in step E4, the discretized Theorem 1 implies that the recovered model can be made arbi-
(k) (k)
{QL (hi + z i )} must take values on a finite set. This can be trarily close to the desired one under bit constraints by increasing
achieved by restricting the quantizer QL (·) to output only lattice K, namely, the number of users which participate in the update
points whose magnitude is not larger than some given threshold. procedure. In particular, if ᾱ does not grow with K, which essentially
The decoding mapping d(·) implements the following: means that the updated model is not based only on a small part of the
D1 Entropy decoding: The server first decodes each digital code- participating users, then the effect of quantization error vanishes in
(k) (k) the aggregation process. In fact, (7) is obtained by exploiting the fact
word uk into the discrete value {QL (hi + z i )}. Since the
encoding is carried out using a lossless source code, the discrete that the quantization error in subtractive dithered quantization does
values are recovered without any errors. not depend on the quantized value. Hence, our ability to rigorously
D2 Dither subtraction: Using the source of common randomness, upper bound the distance in Theorem 1 is a direct consequence of
(k)
the server generates the dither vectors {z i }M i=1 , and subtracts
applying this universal quantization method.
the corresponding vector from each lattice point, i.e., compute C. Discussion
(k) (k) (k)
{QL (hi + z i ) − z i }. The proposed scheme has several clear advantages. First, while it is
(k) (k) (k)
D3 Collecting: The values {QL (hi + z i ) − z i } are collected based on information theoretic arguments, the resulting architecture
into an m1 × m2 matrix Ĥ k using the inverse operation of the is completely practical and rather simple to implement. In particular,
partitioning in encoding step E1. both subtractive dithered quantization as well as entropy coding are
D4 Model recovery: The recoveredP matrices are combined into an concrete and established methods which can be realized with rela-
updated model via W̃ = W + K k=1 αk Ĥ k . tively low complexity and feasible hardware requirements. The source
An illustration of the proposed scheme is depicted in Fig. 2. of common randomness needed for generating the dither vectors can
For a given lattice L and model update matrix H k , the resulting be obtained by sharing a common seed between the server and users,
codeword uk may require more than Rk bits, namely, more bits as also assumed in [11], [12] The statistical characterization of the
than supported by the channel. In particular, the number of bits used quantization error of such quantizers can be obtained in a universal
for encoding uk depends not only on the lattice L, but also on the manner, i.e., regardless of the distribution of the model updates
distribution of H k [24], on which we assume no a-priori knowledge {H k }. This analytical tractability allows us to rigorously show that
by A2. This can be handled by letting the encoder use a lattice with its combination with federated averaging mitigates the quantization
a basic cell P0 of increased volume if the resulting uk exceeds the error. A similar approach was also used in the analysis of probabilistic
bit limitation Rk , decreasing the number of bits at the cost of a more quantization schemes for average consensus problems [27] The fact
coarse quantization. By letting each user select from a finite set of that the updates are quantized for a specific task, i.e., to obtain the
lattices while conveying the index of the selected lattice as part of the global model by federated averaging, implies that the FL with bit
digital codeword allows the server to carry out the aforementioned constraints setup can be treated as a task-based quantization scenario
decoding process while meeting the bit constraints and satisfying A1. [26], [28], [29]. This task is accounted for in the selection of the
quantization scheme, using one for which the error term vanishes by
B. Performance Analysis
averaging regardless of the values of {H k }. In our numerical study
Next, we analyze the performance of FL with the quantization
in Section IV we demonstrate that the improved reconstruction is
system detailed in the previous subsection. Recall that, in the absence
translated into performance gains of the overall global model.
of bit constraints, the server uses the updates provided by the users
The fact that the entropy coding in encoding-decoding steps E4
to compute W̃ opt in (4). Therefore, performance is determined in
and D1 involves multiple encoders and a single decoder implies that
the distance between the recovered model W̃ and the desired W̃ opt .
opt M more efficient compression can be achieved by utilizing distributed
To formulate this distance, we use {w̃i }M i=1 and {w̃ i }i=1 to
opt lossless source coding methods, e.g., Slepian-Wolf coding [23, Ch.
denote the partitions of W̃ and W̃ into M distinct L × 1
15.4]. In such cases, the server decodes the received codewords {uk }
vectors via step E1. In order to analyze the distance between W̃ (k) (k)
into {QL (hi +z i )} in a joint manner, instead of using individual
and W̃ opt , we characterize a bound on the `2 distance between w̃i (k) (k)
entropy codes where QL (hi + z i ) is decoded separately from
and w̃opt
i which holds for all i ∈ {1, . . . , M } , M. To that aim,
its corresponding uk . However, such distributed coding schemes
define, ᾱ , K · maxk αk . Note that the term ᾱ is a parameter
typically require a-priori knowledge of the joint distribution of
of the federated averaging method [4] which is minimized when (k)
{hi }, and utilize different encoding mappings for each user, thus
the server uses standard averaging as in [11], i.e., αk = 1/K and
2 not meeting requirements A1 and A2.
ᾱ = 1. Additionally, define µL , supx∈P0 kxk, and let σL be
Finally, we note that the FL performance is affected by the
the normalized second order moment of the lattice L, defined as
2 selection of the lattice L and its dimension L via the values of µL and
, P0 kxk2 dx/ P0 dx [25]. We can now bound the distance
R R
σL 2
σL in (7). In general, lattices of higher dimensions typically result
between the desired model and the recovered one. We focus on the
in more accurate representations, at the cost of increased complexity.
case in which all encoders use the same lattice. The case where each
Methods for designing lattices for quantization can be found in [30].
encoder chooses from a finite set of lattices can be studied using
a similar derivation. Furthermore, as in [17], [22], [26], we assume IV. N UMERICAL E VALUATIONS
that the quantizers are not overloaded, namely, that despite restricting In this section we numerically evaluate the proposed quantization
the quantizer QL (·) to output a finite number of lattice points, each scheme for FL. We first compare the quantization error induced by
quantized value is mapped into its nearest point in the (infinite) lattice, our encoder-decoder pair to competing methods utilized in FL sys-
(k) (k) (k) (k) tems. Then, we numerically demonstrate how the reduced distortion
i.e., that hi +z i −QL (hi +z i ) ∈ P0 . The resulting bound is
stated in the following theorem, whose proof is omitted due to page induced by quantization is translated in FL performance gains.
limitations. We begin by focusing only on the compression method, comparing
σ2 the distortion induced by the following quantizers using the same
Theorem 1. For all 0 < η < ᾱ µL
L
and i ∈ M, it holds that overall number of bits:
η2
1 Q1 Dithered quantization with a two-dimensional hexagonial lattice,
Pr kw̃i − w̃opt √
i k ≥ η ≤ exp −K 2
+ . (7)
8ᾱ2 σL 4 i.e., L = 2 and G = [2, 0; 1, 3].
8853
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on October 17,2023 at 08:51:13 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Federated learning with proposed quantization system illustration.
Average squared-error
Average squared-error
-1
10 10-1
1 1.5 2 2.5 3 3 6 9 12 15
Quantization rate Number of users
Fig. 3. Quantization schemes comparison. Fig. 4. FL performance as the number of users varies.
Q2 Dithered scalar quantization, i.e., L = 1 and G = 1. server. These model updates are recovered into a global model using
1
Q3 Uniform quantization with random unitary rotation [11]. conventional federated averaging, namely, αh = K for each k ∈ K.
Q4 Subsampling by random masks followed by uniform three-bit The FL performance is measured by the mean squared-error achieved
quantizers applied to the non-masked entries [11]. by the global model after 300 FL iterations, computed by averaging
In Fig. 3 we depict the numerically evaluated average per-entry over 500 test input samples uniformly distributed over [0, 1].
squared-error of Q1-Q4 in quantizing a 128 × 128 matrix with In Fig. 4 we depict the resulting FL performance versus the number
Gaussian i.i.d. entries versus the quantization rate R, defined as the of users K when using the dithered lattice quantizers Q1, compared to
number of bits per entry. To meet the bit rate constraint when using that achieved using the quantization schemes Q3 and Q4 proposed in
Q1-Q2 we scaled G such that the resulting codewords use less than [11], for quantization rates of R = {1, 2} bits per sample. Observing
1282 R bits, while for Q3 and Q4 the rate determines the quantization Fig. 4, we note that the FL performance achieved using Q1 notably
resolution and the subsampling ratio, respectively. outperforms that of the previously proposed Q3 and Q4. For example,
Observing Fig. 3, we note that indeed the dithered quantization FL operating with K = 9 users over links constrained to R = 1 bits
methods Q1-Q2, which are used in our proposed FL scheme, achieve per sample achieves performance improved by 64.02% and 83.36%
more accurate digital representation compared to the methods Q3-Q4 using the proposed quantziation scheme Q1 compared to Q3 and Q4,
previously proposed for FL. It is also observed that Q1, which uses respectively. It is also noted that as the number of users increases, the
vector lattice quantization, outperforms its scalar counterpart Q2, and accuracy of the global model obtained using FL with Q1 improves
that the gain is more notable when less bits are available. accordingly for both R = 1 and R = 2, indicating that the ability
to approach the desired global model proven in Theorem 1 directly
Next, we demonstrate that the improved digital representation
translates into improved FL performance.
achieved by dithered lattice quantization also translates into FL
performance gains, when utilizing the system detailed in Section III.
To that aim, we simulate a basic FL setup in which a regression neural V. C ONCLUSIONS
network learns to approximate the mapping f : R 7→ R, given by In this work we studied FL systems operating under uplink bit
f (α) , sin 6πα. In particular, the neural network consists of a fully- constraints. We first identified the specific requirements from quanti-
connected 1 × 20 and 20 × 1 layers with an intermediate sigmoid zation schemes used in FL setups. Then, we proposed an encoding-
activation, and is trained to minimize the mean-squared error loss. decoding strategy based on dithered quantization. We analyzed the
In each FL iteration, every user has access to 15 i.i.d. input samples proposed scheme and proved that its error term is mitigated by fed-
{xi }, generated from a uniform distribution over [0, 1], and their erated averaging, indicating its potential for FL. Our numerical study
corresponding labels {yi = f (xi )}. The users utilize the gradient demonstrates that FL with the suggested quantization scheme notably
descent method to train their weights based on the labeled samples, outperforms previously proposed methods for such setups, indicating
and forward a quantized version of the model updates to centralized that its theoretical benefits are translated into FL performance gains.
8854
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on October 17,2023 at 08:51:13 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [28] N. Shlezinger, Y. C. Eldar, and M. R. Rodrigues, “Asymptotic task-
based quantization with application to massive MIMO,” IEEE Trans.
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, Signal Process., vol. 67, no. 15, pp. 3995–4012, 2019.
no. 7553, p. 436, 2015. [29] S. Salamtian, N. Shlezinger, Y. C. Eldar, and M. Medard, “Task-based
[2] J. Chen and X. Ran, “Deep learning with edge computing: A review,” quantization for recovering quadratic functions using principal inertia
Proceedings of the IEEE, 2019. components,” in Proc. IEEE Int. Symp. Inf. Theory, 2019.
[3] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, [30] E. Agrell and T. Eriksson, “Optimization of lattices for quantization,”
P. Tucker, K. Yang, Q. V. Le et al., “Large scale distributed deep IEEE Trans. Inf. Theory, vol. 44, no. 5, pp. 1814–1828, 1998.
networks,” in Neural Information Processing Systems, 2012, pp. 1223–
1231.
[4] H. B. McMahan, E. Moore, D. Ramage, and S. Hampson,
“Communication-efficient learning of deep networks from decentralized
data,” arXiv preprint arXiv:1602.05629, 2016.
[5] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,
V. Ivanov, C. Kiddon, J. Konečnỳ, S. Mazzocchi, and H. B. McMahan,
“Towards federated learning at scale: System design,” arXiv preprint
arXiv:1902.01046, 2019.
[6] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated
multi-task learning,” in Neural Information Processing Systems, 2017,
pp. 4424–4434.
[7] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learn-
ing differentially private recurrent language models,” arXiv preprint
arXiv:1710.06963, 2017.
[8] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling
policies for federated learning in wireless networks,” arXiv preprint
arXiv:1908.06287, 2019.
[9] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint
learning and communications framework for federated learning over
wireless networks,” arXiv preprint arXiv:1909.07972, 2019.
[10] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learn-
ing: Challenges, methods, and future directions,” arXiv preprint
arXiv:1908.07873, 2019.
[11] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and
D. Bacon, “Federated learning: Strategies for improving communication
efficiency,” arXiv preprint arXiv:1610.05492, 2016.
[12] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD:
Communication-efficient SGD via gradient quantization and encoding,”
in Neural Information Processing Systems, 2017, pp. 1709–1720.
[13] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient
compression: Reducing the communication bandwidth for distributed
training,” arXiv preprint arXiv:1712.01887, 2017.
[14] C. Hardy, E. Le Merrer, and B. Sericola, “Distributed deep learning
on edge-devices: feasibility via adaptive compression,” in 2017 IEEE
16th International Symposium on Network Computing and Applications
(NCA). IEEE, 2017, pp. 1–8.
[15] A. F. Aji and K. Heafield, “Sparse communication for distributed
gradient descent,” arXiv preprint arXiv:1704.05021, 2017.
[16] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad:
Ternary gradients to reduce communication in distributed deep learning,”
in Neural Information Processing Systems, 2017, pp. 1509–1519.
[17] R. Zamir and M. Feder, “On universal quantization by randomized
uniform/lattice quantizers,” IEEE Trans. Inf. Theory, vol. 38, no. 2, pp.
428–436, 1992.
[18] speedtest.net, “Speedtest united states market report,” 2019. [Online].
Available: http://www.speedtest.net/reports/united-states/
[19] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” Lecture
Notes for ECE563 (UIUC) and, vol. 6, no. 2012-2016, p. 7, 2014.
[20] A. El Gamal and Y.-H. Kim, Network Information Theory. Cambridge
university press, 2011.
[21] A. Wyner and J. Ziv, “The rate-distortion function for source coding
with side information at the decoder,” IEEE Trans. Inf. Theory, vol. 22,
no. 1, pp. 1–10, 1976.
[22] R. Zamir and M. Feder, “On lattice quantization noise,” IEEE Trans.
Inf. Theory, vol. 42, no. 4, pp. 1152–1159, 1996.
[23] T. M. Cover and J. A. Thomas, Elements of Information Theory. John
Wiley & Sons, 2012.
[24] J. Ziv, “On universal quantization,” IEEE Trans. Inf. Theory, vol. 31,
no. 3, pp. 344–347, 1985.
[25] J. Conway and N. Sloane, “Voronoi regions of lattices, second moments
of polytopes, and quantization,” IEEE Trans. Inf. Theory, vol. 28, no. 2,
pp. 211–226, 1982.
[26] N. Shlezinger, Y. C. Eldar, and M. R. Rodrigues, “Hardware-limited
task-based quantization,” IEEE Trans. Signal Process., vol. 67, no. 20,
pp. 5223–5238, 2019.
[27] T. C. Aysal, M. J. Coates, and M. G. Rabbat, “Distributed average
consensus with dithered quantization,” IEEE Trans. Signal Process.,
vol. 56, no. 10, pp. 4905–4918, 2008.
8855
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on October 17,2023 at 08:51:13 UTC from IEEE Xplore. Restrictions apply.