Robust and Communication-Efficient Federated Learning From Non-I.i.d. Data
Robust and Communication-Efficient Federated Learning From Non-I.i.d. Data
9, SEPTEMBER 2020
the true update size and the minimal update size (which is
given by the entropy). If we assume the size of the model
and number of training iterations to be fixed (e.g., because
we want to achieve a certain accuracy on a given task),
this leaves us with three options to reduce communication:
1) we can reduce the communication frequency f ; 2) reduce
the entropy of the weight updates H (W up/down) via lossy
compression schemes; and/or 3) use more efficient encodings
to communicate the weight updates, thus reducing η.
performance gap. Sparsification methods have been pro- the practitioner, it is typically not valid in the federated learn-
posed primarily with the intention to speed up parallel ing setting where we can generally only hope for unbiasedness
training in the data center. Their convergence properties in the mean
in the much more challenging federated learning envi-
1
n
ronments have not yet been investigated. Sparsification Ex i ∼ pi [∇W l(x i , W)] = ∇W R(W) (3)
methods (in their existing form) primarily compress the n
i=1
upstream communication, as the sparsity patterns on while the individual client’s gradients will be biased toward
the updates from different clients will generally differ. the local data set according to
If the number of participating clients is greater than the
inverse sparsity rate, which can easily be the case in Ex∼ pi [∇W l(x, W)] = ∇W Ri (W) = ∇W R(W) ∀i = 1, .., n.
federated learning, the downstream update will not even (4)
be compressed at all.
3) Dense quantization methods reduce the entropy of the As it violates assumption (2), a non-i.i.d. distribution of
weight updates by restricting all updates to a reduced set the local data renders existing convergence guarantees, as
of values. Bernstein et al. [22] propose signSGD, a com- formulated in [19]–[21] and [29], inapplicability and has dra-
pression method with theoretical convergence guarantees matic effects on the practical performance of communication-
on i.i.d. data that quantizes every gradient update to its efficient distributed training algorithms as we will demonstrate
binary sign, thus reducing the bit size per update by in the following experiments.
a factor of ×32. signSGD also incorporates download
compression by aggregating the binary updates from A. Preliminary Experiments
all clients by means of a majority vote. Other authors We run preliminary experiments with a simplified version of
propose to stochastically quantize the gradients during the well-studied 11-layer VGG11 network [28], which we train
upload in an unbiased way (TernGrad [19], quantized on the CIFAR-10 [30] data set in a federated learning setup
stochastic gradient descent (QSGD) [20], ATOMO [21]). using ten clients. For the i.i.d. setting, we split the training
These methods are theoretically appealing, as they data randomly into equally sized shards and assign one shard
inherit the convergence properties of regular SGD under to every one of the clients. For the “non-i.i.d. (m)” setting,
relatively mild assumptions. However, their empirical we assign every client samples from exactly m classes of
performance and compression rates do not match those the data set. The data splits are nonoverlapping and balanced,
of sparsification methods. such that every client ends up with the same number of data
Out of all the above-listed methods, only federated aver- points. The detailed procedure that generates the split of data
aging and signSGD compress both the upstream and down- is described in Section B of the Appendix in the Supplemen-
stream communications. All other methods are of limited tary Material. We also perform experiments with a simple
utility in the federated learning setting defined in Section II, logistic regression classifier, which we train on the MNIST
as they leave the communication from the server to the clients data set [31] under the same setup of the federated learning
uncompressed. environment. Both models are trained using momentum SGD.
Notation: In the following, calligraphic W will refer to To make the results comparable, all compression methods use
the entirety of parameters of a neural network, while regular the same learning rate and batch size.
uppercase W refers to one specific tensor of parameters within
W and lowercase w refers to one single scalar parameter of
B. Results
the network. Arithmetic operations between the neural network
parameters are to be understood elementwise. Fig. 2 shows the convergence speed in terms of gradient
evaluations for the two models when trained using differ-
V. L IMITATIONS OF E XISTING C OMPRESSION M ETHODS ent methods for communication-efficient federated learning.
We observe that while all compression methods achieve com-
The related work on efficient distributed deep learning parably fast convergence in terms of gradient evaluations
almost exclusively considers i.i.d. data distributions among the on i.i.d. data, closely matching the uncompressed baseline
clients, i.e., they assume unbiasedness of the local gradients (black line), they suffer considerably in the non-i.i.d. training
with respect to the full-batch gradient according to settings. As this trend can be observed also for the logis-
Ex∼ pi [∇W l(x, W)] = ∇W R(W) ∀i = 1, .., n (2) tic regression model, we can conclude that the underlying
phenomenon is not unique to deep neural networks and also
where pi is the distribution of data on the i th client and R(W) carries over to convex objectives. We will now analyze these
is the empirical risk function over the combined training data. results in detail for the different compression methods.
While this assumption is reasonable for parallel training 1) Federated Averaging: Most noticeably, federated aver-
where the distribution of data among the clients is chosen by aging [10] (see orange line in Fig. 2), although specifically
proposed for the federated learning setting, suffers consid-
0 We denote by VGG11* a simplified version of the original VGG11 archi-
erably from non-i.i.d. data. This observation is consistent
tecture described in [28], where all dropout and batch normalization layers are
removed and the number of convolutional filters and size of all fully connected with Zhao et al. [32] who demonstrated that model accuracy
layers is reduced by a factor of 2. can drop by up to 55% in non-i.i.d. learning environments
3404 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 9, SEPTEMBER 2020
Fig. 3. Left: distribution of values for αw (1) for the weight layer of
logistic regression over the MNIST data set. Right: development of α(k) for
increasing batch sizes. In the i.i.d. case, the batches are sampled randomly
from the training data, while in the non-i.i.d. case, every batch contains
samples from only exactly one class. For i.i.d. batches, the gradient sign
becomes increasingly accurate with growing batch sizes. For non-i.i.d. batches
of data, this is not the case. The gradient signs remain highly incongruent with
the full-batch gradient, no matter how large the size of the batch.
This ternarization step reduces the entropy of the update Existing compression frameworks that were proposed for
from distributed training (see [19], [20], [23], [25]) only compress
the communication from clients to the server, which is suffi-
Hsparse = − p log2 ( p) − (1 − p) log2 ( p) + 32 p (8) cient for applications where aggregation can be achieved via
3406 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 9, SEPTEMBER 2020
A(t +1)
= A(t ) (t +1) ˜ i (t +1) C. Weight Update Caching for Partial Client Participation
i i + Wi − W (18)
(t +1) This far we have only been looking at scenarios in which all
˜
A(t +1) = A(t ) + W (t +1) − W . (19) of the clients participate throughout the entire training process.
We can express this new update rule for both upload and However, as elaborated in Section II, in federated learning,
download compression (17) as a special case of pure upload typically only a fraction of the entire client population will
compression (15) with generalized filter masks. Let Mi , i = participate in any particular communication round. As clients
1, .., n be the sparsifying filter masks used by the respective do not download the full model W (t ) , but only compressed
clients during the upload and M be the one used during model updates W̃ (t ); this introduces new challenges when it
the download by the server. Then, we could arrive at the comes to keeping all clients synchronized.
same sparse update W ˜ (t +1) if all clients use filter masks To solve the synchronization problem and reduce the work-
M̃i = Mi M, where is the Hadamard product. We, thus, load for the clients, we propose to use a caching mechanism
on the server. Assume that the last τ communication rounds
predict that training models using this new update rule should
have produced the updates {W ˜ (t )|t = T −1, . . . , T −τ }. The
behave similar to regular upstream-only sparsification but with
a slightly increased sparsity rate. We experimentally verify this server can cache all partial sums of these updates up until a
certain point {P (s) = s ˜ (T −t ) |s = 1, .., τ } together
prediction: t =1 W
Fig. 5 shows the accuracies achieved by VGG11 on with the global model W (T ) = W (T −τ −1) + τt =1 W ˜ (T −t ) .
CIFAR10, when trained in a federated learning environment Every client that wants to participate in the next communica-
with five clients for 10 000 iterations at different rates of tion round then has to first synchronize itself with the server
upload and download compression. As we can see, for as long by either downloading P (s) or W (T ) , depending on how many
as download and upload sparsity are of the same order, spar- previous communication rounds it has skipped. For general
sifying the download is not very harmful to the convergence sparse updates, the bound on the entropy
and decreases the accuracy by at most 2% in both the i.i.d.
(T −1)
and the non-i.i.d. case. ˜
H (P (τ )) ≤ τ H (P (1)) = τ H (W ) (20)
SATTLER et al.: ROBUST AND COMMUNICATION-EFFICIENT FEDERATED LEARNING FROM NON-I.I.D. DATA 3407
can be attained. This means that the size of the download will Algorithm 2 Efficient Federated Learning With Parameter
grow linearly with the number of rounds a client has skipped Server Via STC
training. The average number of skipped rounds is equal to the 1 input: initial parameters W
inverse participation fraction 1/η. This is usually tolerable as 2 output: improved parameters W
the downlink typically is cheaper and has far higher bandwidth 3 init: all clients Ci , i = 1, .., [Number of Clients] are
than the uplink, as already noted in [10] and [19]. Essentially, initialized with the same parameters Wi ← W. Every
all compression methods that communicate only parameter Client holds a different data set Di , with
updates instead of full models suffer from this same problem. |{y : (x, y) ∈ Di }| = [Classes per Client] of size
This is also the case for signSGD although here the size of the |Di | = ϕi | ∪ j D j |. The residuals are initialized to zero
downstream update only grows logarithmically with the delay W, Ri , R ← 0.
period according to 4 for t = 1, .., T do
(τ ) 5 for i ∈ It ⊆ {1, .., [Number of Clients]} in parallel do
H (PsignSG D ) ≤ log2 (2τ + 1). (21)
6 Client Ci does:
Partial client participation also has effects on the conver- 7 · msg ← downloadS→Ci (msg)
gence speed of federated training, both with delayed and 8 · W ← decode(msg)
sparsified updates. We will investigate these effects in detail 9 · Wi ← Wi + W
in Section VII-C. 10 · Wi ← Ri + SGD(Wi , Di , b) − Wi
11 · W ˜ i ← STC pu p (Wi )
D. Lossless Encoding 12 · Ri ← Wi − W ˜ i
13 · msgi ← encode(W ˜ i)
To communicate a set of sparse ternary tensors produced
by STC, we only need to transfer the positions of the nonzero 14 · uploadCi →S (msgi )
elements in the flattened tensors, along with one bit per 15 end
nonzero update to indicate the mean sign μ or −μ. Instead of 16 Server S does:
communicating the absolute positions of the nonzero elements, 17 · gatherCi →S (W ˜ i ), i ∈ It
it is favorable to communicate the distances between them. 18 · W ← R + |I1t | i∈It W ˜ i
Assuming a random sparsity pattern we know that for big 19 · W ˜ ← STC pdown (W)
values of |W | and k = p|W |, the distances are approximately 20 · R ← W − W ˜
geometrically distributed with success probability equal to 21 · W ← W + W ˜
the sparsity rate p. Therefore, we can optimally encode the ˜
22 · msg ← encode(W)
distances using the Golomb code [34]. The Golomb encoding
23 · broadcastS→Ci (msg), i = 1, .., M
reduces the average number of position bits to
24 end
1 25 return W
b̄pos = b∗ + b∗
(22)
1 − (1 − p)2
∗
√ b = 1 + log2 ((log(φ − 1)/ log(1 − p))) and φ =
with
( 5 + 1/2) being the golden ratio. For a sparsity rate of e.g., VGG11* on CIFAR: We train a modified version of the
p = 0.01, we get b̄pos = 8.38, which translates to ×1.9 com- popular 11-layer VGG11 network [28] on the CIFAR [30]
pression, compared to a naive distance encoding with 16 fixed data set. We simplify the VGG11 architecture by reducing the
bits. Both the encoding and the decoding scheme can be found number of convolutional filters to [32, 64, 128, 128, 128, 128,
in Section A of the Appendix (Algorithms A1 and A2) in the 128, 128] in the respective convolutional layers and reducing
Supplementary Material. The updates are encoded both before the size of the hidden fully-connected layers to 128. We also
upload and before download. remove all dropout layers and batch-normalization layers as
The complete compression framework that features the regularization is no longer required. Batch normalization
upstream and downstream compression via sparsification, has been observed to perform very poorly with both small
ternarization, and optimal encoding of the updates is described batch sizes and non-i.i.d. data [35], and we do not want
in Algorithm 2. this effect to obscure the investigated behavior. The result-
ing VGG11* network still achieves 85.46% accuracy on the
VII. E XPERIMENTS validation set after 20 000 iterations of training with a constant
We evaluate our proposed communication protocol on four learning rate of 0.16 and contains 865 482 parameters.
different learning tasks and compare its performance to feder- CNN on KWS: We train the four-layer convolutional neural
ated averaging and signSGD in a wide a variety of different network (CNN) from [27] on the speech commands data
federated learning environments. set [36]. The speech commands data set consists of 51 088 dif-
Models and Data Sets: To cover a broad spectrum of ferent speech samples of specific keywords. There are 30 dif-
learning problems, we evaluate on differently sized con- ferent keywords in total, and every speech sample is of 1-s
volutional and recurrent neural networks for the relevant duration. Like [32], we restrict us to the subset of the ten most
federated learning tasks of image classification and speech common keywords. For every speech command, we extract
recognition: the Mel spectrogram from the short-time Fourier transform,
3408 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 9, SEPTEMBER 2020
learning environment with different degrees of client partici- the final accuracy achieved after 20 000 iterations for different
pation. To isolate the effects of reduced participation, we keep values of γ . Interestingly, the unbalancedness of the data does
the absolute number of participating clients and the local batch not seem to have a significant effect on the performance of
sizes at constant values of 5 and 40, respectively, throughout either of the compression methods. Even if the data are highly
all experiments and vary only the total number of clients (and concentrated on a few clients (as is the case for γ = 0.9),
thus the relative participation η). As we can see, reducing the all methods converge reliably, and for federated averaging,
participation rate has negative effects on both federated aver- the accuracy even slightly goes down with increased balanced-
aging and STC. The causes for these negative effects, however, ness. Apparently, the rare participation of large clients can
are different. In federated averaging, the participation rate is balance out several communication rounds with much smaller
proportional to the effective amount of data that the training clients. These results also carry over to all other benchmarks
is conducted on in any individual communication round. If a (see Fig. A5 in the Appendix in the Supplementary Material).
nonrepresentative subset of clients is selected to participate
in a particular communication round of federated averag- D. Communication Efficiency
ing, this can steer the optimization process away from the Finally, we compare the different compression methods
minimum and might even cause catastrophic forgetting [38] with respect to the number of iterations and communicated
of previously learned concepts. On the other hand, partial bits they require to achieve a certain target accuracy on a
participation reduces the convergence speed of STC by causing federated learning task. As we saw in Section V, both federated
the clients residuals to go out sync and increasing the gradient averaging and signSGD perform considerably worse if clients
staleness [25]. The more rounds a client has to wait before hold non-i.i.d. data or use small batch sizes. To still have a
it is selected to participate during training again, the more meaningful comparison, we, therefore, choose to evaluate this
outdated its accumulated gradients become. We can observe time on an i.i.d. environment where every client holds ten
this behavior for STC most strongly in the non-i.i.d. situation different classes and uses a moderate batch size of 20 during
(see Fig. 8 left), where the accuracy steadily decreases with the training. This setup favors federated averaging and signSGD
participation rate. However, even in the extreme case where to the maximum degree possible! All other parameters of the
only 5 out of 400 clients participate in every round of training, learning environment are set to the base configuration given
STC still achieves higher accuracy than federated averaging in Table III. We train until the target accuracy is achieved or a
and signSGD. If the clients hold i.i.d. data (see Fig. 8 maximum amount of iterations is exceeded and measure the
right), STC suffers much less from a reduced participation amount of communicated bits both for upload and download.
rate than federated averaging. If only 5 out of 400 clients Fig. 10 shows the results for VGG11* trained on CIFAR, CNN
participate in every round, STC (without momentum) still trained on keyword spotting (KWS), and the LSTM model
manages to achieve an accuracy of 68.2% while federated trained on Fashion-MNIST. We can see that even if all clients
averaging stagnates at 42.3% accuracy. signSGD is affected the hold i.i.d. data, STC still manages to achieve the desired target
least by reduced participation, which is unsurprising, as only accuracy within the smallest communication budget out of
the absolute number of participating clients would have a all methods. STC also converges faster in terms of training
direct influence on its performance. Similar behavior can be iterations than the versions of federated averaging with com-
observed on all other benchmarks, and the results can be found parable compression rate. Unsurprisingly, we see that both for
in Fig. A3 in the Appendix in the Supplementary Material. It is federated averaging and STC, we face a tradeoff between the
noteworthy that in federated learning, it is usually possible number of training iterations (“computation”) and the number
for the server to exercise some control over the rate of client of communicated bits (“communication”). On all investigated
participation. For instance, it is typically possible to increase benchmarks, however, STC is Pareto-superior to federated
the participation ratio at the cost of a long waiting time for averaging in the sense for any fixed iteration complexity,
all clients to finish. it achieves a lower (upload) communication complexity.
3) Unbalancedness: Up until now, all experiments were Table IV shows the amount of upstream and downstream
performed with a balanced split of data in which every client communications required to achieve the target accuracy for
was assigned the same amount of data points. In practice, the different methods in megabytes. On the CIFAR learning
however, the data sets on different clients will typically vary task, STC at a sparsity rate of p = 0.0025 only communicates
heavily in size. To simulate different degrees of unbalanced- 183.9 MB worth of data, which is a reduction in commu-
ness, we split the data among the clients in a way such that nication by a factor of ×199.5 as compared to the baseline
the i th out of n clients is assigned a fraction with requires 36696 MB and federated averaging (n = 100),
α γi which still requires 1606 MB. Federated averaging with a
ϕi (α, γ ) = + (1 − α) n (23) delay period of 1000 steps does not achieve the target accuracy
j =1 γ
n j
within the given iteration budget.
of the total data. The parameter α controls the minimum
amount of data on every client, while the parameter γ controls VIII. L ESSONS L EARNED
the concentration of data. We fix α = 0.1 and vary γ between We will now summarize the findings of this article and
0.9 and 1.0 in our experiments. To amplify the effects of give general suggestions on how to approach communication-
unbalanced client data, we also set the client participation constrained federated learning problems (see our summarizing
to a low value of only 5 out of 200 clients. Fig. 9 shows Fig. 11).
SATTLER et al.: ROBUST AND COMMUNICATION-EFFICIENT FEDERATED LEARNING FROM NON-I.I.D. DATA 3411
TABLE IV
B ITS R EQUIRED FOR Upload and/ Download TO A CHIEVE A C ERTAIN
TARGET A CCURACY ON D IFFERENT L EARNING TASKS IN AN I . I . D .
L EARNING E NVIRONMENT. A VALUE OF “n.a.” IN THE TABLE
S IGNIFIES T HAT THE M ETHOD H AS N OT A CHIEVED THE
TARGET A CCURACY W ITHIN THE I TERATION B UDGET.
T HE L EARNING E NVIRONMENT I S C ONFIGURED
AS D ESCRIBED IN TABLE III
IX. C ONCLUSION
Fig. 11. Left: accuracy achieved by VGG11* on CIFAR after 20 000 iter- Federated learning for mobile and IoT applications is a
ations of federated training with federated averaging and STC for three challenging task, as generally little to no control can be exerted
different configurations of the learning environment. Right: upstream and over the properties of the learning environment.
downstream communication necessary to achieve a validation accuracy of 84%
with federated averaging and STC on the CIFAR benchmark under i.i.d. data In this article, we demonstrated that the convergence
and a moderate batch-size. behavior of current methods for communication-efficient
federated learning is very sensitive to these properties.
1) If clients hold non-i.i.d. data, sparse communication On a variety of different data sets and model architectures,
protocols such as STC distinctively outperform federated we observe that the convergence speed of federated averaging
averaging across all federated learning environments drastically decreases in learning environments where the
[see Figs. 6, 7 (left), and 8 (left)]. clients either hold non-i.i.d. subsets of data are forced to
2) The same holds true if clients are forced to train on train on small minibatches or where only a small fraction
small minibatches (e.g., because the hardware is mem- of clients participates in every communication round.
ory constrained). In these situations, STC outperforms To address these issues, we propose STC, a communication
federated averaging even if the client’s data are i.i.d. protocol that compresses both the upstream and downstream
[see Fig. 7 (right)]. communications via sparsification, ternarization, error
3412 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 9, SEPTEMBER 2020
accumulation, and optimal Golomb encoding. Our experiments [18] K. Bonawitz et al., “Towards federated learning at scale: System
show that STC is far more robust to the above-mentioned design,” 2019, arXiv:1902.01046. [Online]. Available: https://arxiv.
org/abs/1902.01046
peculiarities of the learning environment than federated [19] W. Wen et al., “TernGrad: Ternary gradients to reduce communication in
averaging. Moreover, STC converges faster than federated distributed deep learning,” 2017, arXiv:1705.07878. [Online]. Available:
averaging both with respect to the number of training iterations https://arxiv.org/abs/1705.07878
[20] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD:
and the amount of communicated bits even if the clients hold Communication-efficient SGD via gradient quantization and encoding,”
i.i.d. data and use moderate batch sizes during training. in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 1707–1718.
[21] H. Wang, S. Sievert, Z. Charles, D. Papailiopoulos, S. Liu, and
Our approach can be understood as an alternative paradigm S. Wright, “ATOMO: Communication-efficient learning via atomic spar-
for communication-efficient federated optimization that relies sification,” 2018, arXiv:1806.04090. [Online]. Available: https://arxiv.
on high-frequent low-volume instead of low-frequent high- org/abs/1806.04090
[22] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar,
volume communication. As such, it is particularly well suited “signSGD: Compressed optimisation for non-convex problems,” 2018,
for federated learning environments that are characterized by arXiv:1802.04434. [Online]. Available: https://arxiv.org/abs/1802.04434
low latency and low bandwidth channels between clients and [23] A. F. Aji and K. Heafield, “Sparse communication for distributed gradi-
ent descent,” 2017, arXiv:1704.05021. [Online]. Available: https://arxiv.
server. org/abs/1704.05021
[24] N. Strom, “Scalable distributed DNN training using commodity GPU
cloud computing,” in Proc. 16th Annu. Conf. Int. Speech Commun.
R EFERENCES Assoc., 2015, pp. 1488–1492.
[25] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep
[1] R. Taylor, D. Baron, and D. Schmidt, “The world in 2025: 8 Predictions gradient compression: Reducing the communication bandwidth for
for the next 10 years,” in Proc. 10th Int. Microsyst., Packag., Assembly distributed training,” 2017, arXiv:1712.01887. [Online]. Available:
Circuits Technol. Conf. (IMPACT), 2015, pp. 192–195. https://arxiv.org/abs/1712.01887
[2] S. Wiedemann, K.-R. Müller, and W. Samek, “Compact and com- [26] Y. Tsuzuku, H. Imachi, and T. Akiba, “Variance-based gradient compres-
putationally efficient representation of deep neural networks,” IEEE sion for efficient distributed deep learning,” 2018, arXiv:1802.06058.
Trans. Neural Netw. Learn. Syst., to be published. doi: 10.1109/TNNLS. [Online]. Available: https://arxiv.org/abs/1802.06058
2019.2910073. [27] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh,
[3] S. Wiedemann, A. Marban, K.-R. Müller, and W. Samek, “Entropy- and D. Bacon, “Federated learning: Strategies for improving com-
constrained training of deep neural networks,” in Proc. IEEE Int. Joint munication efficiency,” 2016, arXiv:1610.05492. [Online]. Available:
Conf. Neural Netw. (IJCNN), 2019, pp. 1–8. https://arxiv.org/abs/1610.05492
[4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, [28] K. Simonyan and A. Zisserman, “Very deep convolutional networks
pp. 436–444, May 2015. for large-scale image recognition,” 2014, arXiv:1409.1556. [Online].
[5] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for Available: https://arxiv.org/abs/1409.1556
generating image descriptions,” in Proc. IEEE Conf. Comput. Vis. [29] J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar,
Pattern Recognit. (CVPR), Jul. 2015, pp. 3128–3137. “signSGD with majority vote is communication efficient and byzan-
[6] S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek, tine fault tolerant,” 2018, arXiv:1810.05291. [Online]. Available:
“Deep neural networks for no-reference and full-reference image quality https://arxiv.org/abs/1810.05291
[30] A. Krizhevsky, V. Nair, and G. Hinton. (2014). The CIFAR-10 Dataset.
assessment,” IEEE Trans. Image Process., vol. 27, no. 1, pp. 206–219,
[Online]. Available: http://www.cs.toronto.edu/kriz/cifar.html
Jan. 2018.
[31] Y. LeCun. (1998). The MNIST Database of Handwritten Digits. [Online].
[7] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and Available: http://yann.lecun.com/exdb/mnist/
L. Fei-Fei, “Large-scale video classification with convolutional neural [32] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated
networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), learning with non-IID data,” 2018, arXiv:1806.00582. [Online]. Avail-
Jun. 2014, pp. 1725–1732. able: https://arxiv.org/abs/1806.00582
[8] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning [33] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with
with neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2014, memory,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 4447–4458.
pp. 3104–3112. [34] S. Golomb, “Run-length encodings (corresp.),” IEEE Trans. Inf. Theory,
[9] W. Samek, T. Wiegand, and K.-R. Müller, “Explainable artificial vol. 12, no. 3, pp. 399–401, Jul. 1966.
intelligence: Understanding, visualizing and interpreting deep learning [35] S. Ioffe, “Batch renormalization: Towards reducing minibatch depen-
models,” ITU J., ICT Discoveries, vol. 1, no. 1, pp. 39–48, 2018. dence in batch-normalized models,” in Proc. Adv. Neural Inf. Process.
[10] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and Syst., 2017, pp. 1945–1953.
B. A. y Arcas, “Communication-efficient learning of deep networks [36] P. Warden, “Speech commands: A dataset for limited-vocabulary
from decentralized data,” 2016, arXiv:1602.05629. [Online]. Available: speech recognition,” 2018, arXiv:1804.03209. [Online]. Available:
https://arxiv.org/abs/1602.05629 https://arxiv.org/abs/1804.03209
[11] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov, “How [37] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A novel
to backdoor federated learning,” 2018, arXiv:1807.00459. [Online]. image dataset for benchmarking machine learning algorithms,” 2017,
Available: https://arxiv.org/abs/1807.00459 arXiv:1708.07747. [Online]. Available: https://arxiv.org/abs/1708.07747
[12] K. Bonawitz et al., “Practical secure aggregation for privacy-preserving [38] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and
machine learning,” in Proc. ACM SIGSAC Conf. Comput. Commun. Y. Bengio, “An empirical investigation of catastrophic forgetting
Secur., 2017, pp. 1175–1191. in gradient-based neural networks,” 2013, arXiv:1312.6211. [Online].
[13] S. Hardy et al., “Private federated learning on vertically partitioned data Available: https://arxiv.org/abs/1312.6211
via entity resolution and additively homomorphic encryption,” 2017,
arXiv:1711.10677. [Online]. Available: https://arxiv.org/abs/1711.10677
[14] M. Abadi et al., “Deep learning with differential privacy,” in Proc. ACM
SIGSAC Conf. Comput. Commun. Secur., 2016, pp. 308–318. Felix Sattler received the B.Sc. degree in mathe-
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for matics, the M.Sc. degree in computer science, and
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. the M.Sc. degree in applied mathematics from the
(CVPR), Jun. 2016, pp. 770–778. Technische Universität Berlin, Berlin, Germany, in
[16] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely 2016, 2018, and 2018, respectively.
connected convolutional networks,” in Proc. IEEE CVPR, vol. 1, He is currently with the Machine Learning Group,
Jun. 2017, no. 2, p. 3. Fraunhofer Heinrich Hertz Institute, Berlin. His cur-
[17] F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek, “Sparse binary rent research interests include distributed machine
compression: Towards distributed deep learning with minimal commu- learning, neural networks, and multitask learning.
nication,” in Proc. IEEE Int. Joint Conf. Neural Netw. (IJCNN), 2019,
pp. 1–8.
SATTLER et al.: ROBUST AND COMMUNICATION-EFFICIENT FEDERATED LEARNING FROM NON-I.I.D. DATA 3413
Simon Wiedemann received the M.Sc. degree in Dr. Müller was elected to be a member of the German National Academy of
applied mathematics from the Technische Univer- Sciences-Leopoldina in 2012, the Berlin Brandenburg Academy of sciences
sität Berlin, Berlin, Germany, in 2017. in 2017, and an External Scientific Member of the Max Planck Society
He is currently with the Machine Learning Group, in 2017. He received the 1999 Olympus Prize by the German Pattern
Fraunhofer Heinrich Hertz Institute, Berlin. His cur- Recognition Society, DAGM. He received the SEL Alcatel Communication
rent research interests include machine learning, Award in 2006, the Science Prize of Berlin awarded by the Governing Mayor
neural networks, and information theory. of Berlin in 2014, and the Vodafone Innovation Award in 2017.