0% found this document useful (0 votes)
35 views14 pages

Robust and Communication-Efficient Federated Learning From Non-I.i.d. Data

The document discusses federated learning, a privacy-preserving method that allows multiple parties to collaboratively train deep learning models without sharing local data. It introduces a new compression framework called sparse ternary compression (STC) designed to address the communication overhead challenges in federated learning, particularly in non-i.i.d. data scenarios. The proposed STC method significantly outperforms existing techniques by enabling efficient communication while maintaining model accuracy across various learning tasks.

Uploaded by

apogne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views14 pages

Robust and Communication-Efficient Federated Learning From Non-I.i.d. Data

The document discusses federated learning, a privacy-preserving method that allows multiple parties to collaboratively train deep learning models without sharing local data. It introduces a new compression framework called sparse ternary compression (STC) designed to address the communication overhead challenges in federated learning, particularly in non-i.i.d. data scenarios. The proposed STC method significantly outperforms existing techniques by enabling efficient communication while maintaining model accuracy across various learning tasks.

Uploaded by

apogne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

3400 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO.

9, SEPTEMBER 2020

Robust and Communication-Efficient Federated


Learning From Non-i.i.d. Data
Felix Sattler , Simon Wiedemann, Klaus-Robert Müller, Member, IEEE,
and Wojciech Samek , Member, IEEE

Abstract— Federated learning allows multiple parties to jointly I. I NTRODUCTION


train a deep learning model on their combined data, without
any of the participants having to reveal their local data to a
centralized server. This form of privacy-preserving collaborative
learning, however, comes at the cost of a significant communica-
T HREE major developments are currently transforming the
ways how data are created and processed: First of all,
with the advent of the Internet of Things (IoT), the number of
tion overhead during training. To address this problem, several intelligent devices in the world has rapidly grown in the last
compression methods have been proposed in the distributed
couple of years. Many of these devices are equipped with var-
training literature that can reduce the amount of required
communication by up to three orders of magnitude. These ious sensors and increasingly potent hardware that allow them
existing methods, however, are only of limited utility in the to collect and process data at unprecedented scales [1]–[3].
federated learning setting, as they either only compress the In a concurrent development, deep learning has revolution-
upstream communication from the clients to the server (leaving ized the ways that information can be extracted from data
the downstream communication uncompressed) or only perform
resources with groundbreaking successes in areas such as com-
well under idealized conditions, such as i.i.d. distribution of the
client data, which typically cannot be found in federated learning. puter vision, natural language processing, or voice recognition,
In this article, we propose sparse ternary compression (STC), among many others [4]–[9]. Deep learning scales well with
a new compression framework that is specifically designed to growing amounts of data and its astounding successes in recent
meet the requirements of the federated learning environment. times can be at least partly attributed to the availability of
STC extends the existing compression technique of top-k gradient
very large data sets for training. Therefore, there lays huge
sparsification with a novel mechanism to enable downstream
compression as well as ternarization and optimal Golomb encod- potential in harnessing the rich data provided by IoT devices
ing of the weight updates. Our experiments on four different for the training and improving deep learning models [10].
learning tasks demonstrate that STC distinctively outperforms At the same time, data privacy has become a growing
federated averaging in common federated learning scenarios. concern for many users. Multiple cases of data leakage and
These results advocate for a paradigm shift in federated optimiza-
tion toward high-frequency low-bitwidth communication, in par-
misuse in recent times have demonstrated that the centralized
ticular in the bandwidth-constrained learning environments. processing of data comes at high risk for the end users privacy.
As IoT devices usually collect data in private environments,
Index Terms— Deep learning, distributed learning, efficient
communication, federated learning, privacy-preserving machine
often even without explicit awareness of the users, these
learning. concerns hold particularly strong. It is, therefore, generally
not an option to share this data with a centralized entity that
could conduct training of a deep learning model. In other
Manuscript received March 6, 2019; revised June 28, 2019; accepted situations, local processing of the data might be desirable for
September 25, 2019. Date of publication November 1, 2019; date of current
version September 1, 2020. This work was supported in part by the Fraunhofer other reasons such as increased autonomy of the local agent.
Society through the MPI-FhG Collaboration Project “Theory & Practice for This leaves us facing the following dilemma: How are we
Reduced Learning Machines,” in part by the German Ministry for Education going to make use of the rich combined data of millions of IoT
and Research as Berlin Big Data Center under Grant 01IS14013A, in part by
the Berlin Center for Machine Learning under Grant 01IS18037I, in part devices for training deep learning models if this data cannot
by DFG under Grant EXC 2046/1 and Grant 390685689, and in part be stored at a centralized location?
by the Information & Communications Technology Planning & Evaluation Federated learning resolves this issue as it allows multiple
(IITP) Grant funded by the Korea Government under Grant 2017-0-00451.
(Corresponding authors: Klaus-Robert Müller; Wojciech Samek.) parties to jointly train a deep learning model on their combined
F. Sattler, S. Wiedemann, and W. Samek are with the Fraunhofer Hein- data, without any of the participants having to reveal their data
rich Hertz Institute, 10587 Berlin, Germany (e-mail: wojciech.samek@ to a centralized server [10]. This form of privacy-preserving
hhi.fraunhofer.de).
K.-R. Müller is with the Technische Universität Berlin, 10587 Berlin, collaborative learning is achieved by following a simple
Germany, with the Max Planck Institute for Informatics, 66123 Saarbrücken, three-step protocol illustrated in Fig. 1. In the first step, all
Germany, and also with the Department of Brain and Cognitive Engineer- participating clients download the latest master model W from
ing, Korea University, Seoul 136-713, South Korea (e-mail: klaus-robert.
[email protected]). the server. Next, the clients improve the downloaded model,
This article has supplementary downloadable material available at based on their local training data using stochastic gradient
http://ieeexplore.ieee.org, provided by the authors. descent (SGD). Finally, all participating clients upload their
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. locally improved models Wi back to the server, where they
Digital Object Identifier 10.1109/TNNLS.2019.2944481 are gathered and aggregated to form a new master model
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
SATTLER et al.: ROBUST AND COMMUNICATION-EFFICIENT FEDERATED LEARNING FROM NON-I.I.D. DATA 3401

the true update size and the minimal update size (which is
given by the entropy). If we assume the size of the model
and number of training iterations to be fixed (e.g., because
we want to achieve a certain accuracy on a given task),
this leaves us with three options to reduce communication:
1) we can reduce the communication frequency f ; 2) reduce
the entropy of the weight updates H (W up/down) via lossy
compression schemes; and/or 3) use more efficient encodings
to communicate the weight updates, thus reducing η.

II. C HALLENGES OF THE F EDERATED


L EARNING E NVIRONMENT
Before we can consider ways to reduce the amount of
communication, we first have to take into account the unique
Fig. 1. Federated learning with a parameter server. Illustrated is one characteristics, which distinguish federated learning from other
communication round of distributed SGD. (a) Clients synchronize with the
server. (b) Clients compute a weight update independently based on their distributed training settings such as parallel training (compare
local data. (c) Clients upload their local weight updates to the server, where also with [10]). In federated learning, the distribution of both
they are averaged to produce the new master model. training data and computational resources is a fundamental
and fixed property of the learning environment. This entails
the following challenges.
(in practice, weight updates W = W new − W old can be
communicated instead of full models W, which is equivalent 1) Unbalanced and non-i.i.d. data: As the training data
as long as all clients remain synchronized). These steps are present on the individual clients is collected by the
repeated until a certain convergence criterion is satisfied. clients themselves based on their local environment and
Observe that when following this protocol, training data never usage pattern, both the size and the distribution of
leave the local devices as only model updates are communi- the local data sets will typically vary heavily between
cated. Although it has been shown that in adversarial settings different clients.
information about the training data can still be inferred from 2) Large number of clients: Federated learning environ-
these updates [11], additional mechanisms, such as homo- ments may constitute of multiple millions of partici-
morphic encryption of the updates [12], [13] or differentially pants [18]. Furthermore, as the quality of the collab-
private training [14], can be applied to fully conceal any oratively learned model is determined by the combined
information about the local data. available data of all clients, collaborative learning envi-
A major issue in federated learning is the massive commu- ronments will have a natural tendency to grow.
nication overhead that arises from sending around the model 3) Parameter server: Once the number of clients grows
updates. When naively following the protocol described ear- beyond a certain threshold, direct communication of
lier, every participating client has to communicate a full model weight updates becomes unfeasible because the work-
update during every training iteration. Every such update is load for both communication and aggregation of updates
of the same size as the trained model, which can be in the grows linearly with the number of clients. In federated
range of gigabytes for modern architectures with millions of learning, it is, therefore, unavoidable to communicate
parameters [15], [16]. Over the course of multiple hundred via an intermediate parameter server. This reduces the
thousands of training iterations on big data sets, the total amount of communication per client and communica-
communication for every client can easily grow to more than tion rounds to one single upload of a local weight
a petabyte [17]. Consequently, if communication bandwidth is update to and one download of the aggregated update
limited or communication is costly (naive), federated learning from the server and moves the workload of aggregation
can become unproductive or even completely unfeasible. away from the clients. Communicating via a parame-
The total amount of bits that have to be uploaded and ter server, however, introduces an additional challenge
downloaded by every client during training is given by to communication-efficient distributed training, as now
both the upload to the server and the download from
bup/down ∈ O(Niter × f × |W| × (H (W up/down) + η)) the server need to be compressed in order to reduce
     
# updates update size
communication time and energy consumption.
(1) 4) Partial participation: In the general federated learning
for IoT setting, it can generally not be guaranteed that
where Niter is the total number of training iterations all clients participate in every communication round.
(forward–backward passes) performed by every client, f is Devices might lose their connection, run out of bat-
the communication frequency, |W| is the size of the tery or seize to contribute to the collaborative training
model, H (W up/down) is the entropy of the weight updates for other reasons.
exchanged during upload and download, respectively, and η is 5) Limited battery and memory: Mobile and embedded
the inefficiency of the encoding, i.e., the difference between devices often are not connected to a power grid.
3402 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 9, SEPTEMBER 2020

TABLE I existing compression schemes in that it requires both fewer


D IFFERENT M ETHODS FOR C OMMUNICATION -E FFICIENT D ISTRIBUTED gradient evaluations and communicated bits to converge to
D EEP L EARNING P ROPOSED IN THE L ITERATURE . N ONE OF THE
E XISTING M ETHODS S ATISFIES A LL R EQUIREMENTS (R1)–(R3)
a given target accuracy (see Section IX). These results also
OF THE F EDERATED L EARNING E NVIRONMENT. W E C ALL extend to the i.i.d. regime.
A M ETHOD “R OBUST TO N ON - I . I . D . D ATA” IF THE
F EDERATED T RAINING C ONVERGES I NDEPENDENT
OF THE L OCAL D ISTRIBUTION OF C LIENT D ATA . IV. R ELATED W ORK
W E C ALL C OMPRESSION R ATES G REATER
T HAN ×32 “S TRONG ” AND T HOSE In the broader realm of communication-efficient distributed
S MALLER OR E QUAL TO deep learning, a wide variety of methods has been proposed
×32 “W EAK ”
to reduce the amount of communication during the train-
ing process. Using (1) as a reference, we can organize the
substantial existing research body on communication-efficient
distributed deep learning into three different groups.
1) Communication delay methods reduce the communica-
tion frequency f . McMahan et al. [10] propose federated
averaging where instead of communicating after every
iteration, every client performs multiple iterations of
SGD to compute a weight update. McMahan et al.
observe that on different convolutional and recurrent
neural network architectures, communication can be
delayed for up to 100 iterations without significantly
affecting the convergence speed as long as the data are
distributed among the clients in an i.i.d. manner. The
Instead, their capacity to run computations is limited
amount of communication can be reduced even further
by a finite battery. Performing iterations of SGD is
with longer delay periods; however, this comes at the
notoriously expensive for deep neural networks. It is,
cost of an increased number of gradient evaluations.
therefore, necessary to keep the number of gradient
In a follow-up work, Konečnỳ et al. [27] combine
evaluations per client as small as possible. Mobile and
this communication delay with random sparsification
embedded devices also typically have only very limited
and probabilistic quantization. They restrict the clients
memory. As the memory footprint of SGD grows lin-
to learn random sparse weight updates or force ran-
early with the batch size, this might force the devices to
dom sparsity on them afterward (“structured” versus
train on very small batch sizes.
“sketched” updates) and combine this sparsification with
Based on the above-mentioned characterization of the feder- probabilistic quantization. Their method, however, sig-
ated learning environment, we conclude that a communication- nificantly slows down convergence speed in terms of
efficient distributed training algorithm for federated learning SGD iterations. Communication delay methods automat-
needs to fulfil the following requirements. ically reduce both upstream and downstream communi-
(R1): It should compress both upstream and downstream cation and are proven to work with large numbers of
communications. clients and partial client participation.
(R2): It should be robust to non-i.i.d., small batch sizes, and 2) Sparsification methods reduce the entropy H (W) of
unbalanced data. the updates by restricting changes to only a small subset
(R3): It should be robust to large numbers of clients and of the parameters. Strom [24] presents an approach
partial client participation. (later modified by [26]) in which only gradients with
a magnitude greater than a certain predefined threshold
III. C ONTRIBUTION are sent to the server. All other gradients are accumu-
In this article, we will demonstrate that none of the exist- lated in a residual. This method is shown to achieve
ing methods proposed for communication-efficient federated upstream compression rates of up to three orders of
learning satisfies all of these requirements (see Table I). More magnitude on an acoustic modeling task. In practice,
concretely, we will show that the methods that are able to however, it is hard to choose appropriate values for
compress both upstream and downstream communications are the threshold, as it may vary a lot for different archi-
very sensitive to non-i.i.d. data distributions, while the meth- tectures and even different layers. To overcome this
ods that are more robust to this type of data do not compress issue, Aji and Heafield [23] instead fix the sparsity rate
the downstream (see Section V). We will then proceed to and only communicate the fraction p entries with the
construct a new efficient communication protocol for federated biggest magnitude of each gradient while also collecting
learning that resolves these issues and meets all requirements all other gradients in a residual. At a sparsity rate
(R1)–(R3). We provide a convergence analysis of our method of p = 0.001, their method only slightly degrades the
as well as extensive empirical results on four different neural convergence speed and final accuracy of the trained
network architectures and data sets that demonstrate that the model. Lin et al. [25] present minor modifications to the
sparse ternary compression (STC) protocol is superior to the work of Aji and Heafield [23] that even close this small
SATTLER et al.: ROBUST AND COMMUNICATION-EFFICIENT FEDERATED LEARNING FROM NON-I.I.D. DATA 3403

performance gap. Sparsification methods have been pro- the practitioner, it is typically not valid in the federated learn-
posed primarily with the intention to speed up parallel ing setting where we can generally only hope for unbiasedness
training in the data center. Their convergence properties in the mean
in the much more challenging federated learning envi-
1
n
ronments have not yet been investigated. Sparsification Ex i ∼ pi [∇W l(x i , W)] = ∇W R(W) (3)
methods (in their existing form) primarily compress the n
i=1
upstream communication, as the sparsity patterns on while the individual client’s gradients will be biased toward
the updates from different clients will generally differ. the local data set according to
If the number of participating clients is greater than the
inverse sparsity rate, which can easily be the case in Ex∼ pi [∇W l(x, W)] = ∇W Ri (W) = ∇W R(W) ∀i = 1, .., n.
federated learning, the downstream update will not even (4)
be compressed at all.
3) Dense quantization methods reduce the entropy of the As it violates assumption (2), a non-i.i.d. distribution of
weight updates by restricting all updates to a reduced set the local data renders existing convergence guarantees, as
of values. Bernstein et al. [22] propose signSGD, a com- formulated in [19]–[21] and [29], inapplicability and has dra-
pression method with theoretical convergence guarantees matic effects on the practical performance of communication-
on i.i.d. data that quantizes every gradient update to its efficient distributed training algorithms as we will demonstrate
binary sign, thus reducing the bit size per update by in the following experiments.
a factor of ×32. signSGD also incorporates download
compression by aggregating the binary updates from A. Preliminary Experiments
all clients by means of a majority vote. Other authors We run preliminary experiments with a simplified version of
propose to stochastically quantize the gradients during the well-studied 11-layer VGG11 network [28], which we train
upload in an unbiased way (TernGrad [19], quantized on the CIFAR-10 [30] data set in a federated learning setup
stochastic gradient descent (QSGD) [20], ATOMO [21]). using ten clients. For the i.i.d. setting, we split the training
These methods are theoretically appealing, as they data randomly into equally sized shards and assign one shard
inherit the convergence properties of regular SGD under to every one of the clients. For the “non-i.i.d. (m)” setting,
relatively mild assumptions. However, their empirical we assign every client samples from exactly m classes of
performance and compression rates do not match those the data set. The data splits are nonoverlapping and balanced,
of sparsification methods. such that every client ends up with the same number of data
Out of all the above-listed methods, only federated aver- points. The detailed procedure that generates the split of data
aging and signSGD compress both the upstream and down- is described in Section B of the Appendix in the Supplemen-
stream communications. All other methods are of limited tary Material. We also perform experiments with a simple
utility in the federated learning setting defined in Section II, logistic regression classifier, which we train on the MNIST
as they leave the communication from the server to the clients data set [31] under the same setup of the federated learning
uncompressed. environment. Both models are trained using momentum SGD.
Notation: In the following, calligraphic W will refer to To make the results comparable, all compression methods use
the entirety of parameters of a neural network, while regular the same learning rate and batch size.
uppercase W refers to one specific tensor of parameters within
W and lowercase w refers to one single scalar parameter of
B. Results
the network. Arithmetic operations between the neural network
parameters are to be understood elementwise. Fig. 2 shows the convergence speed in terms of gradient
evaluations for the two models when trained using differ-
V. L IMITATIONS OF E XISTING C OMPRESSION M ETHODS ent methods for communication-efficient federated learning.
We observe that while all compression methods achieve com-
The related work on efficient distributed deep learning parably fast convergence in terms of gradient evaluations
almost exclusively considers i.i.d. data distributions among the on i.i.d. data, closely matching the uncompressed baseline
clients, i.e., they assume unbiasedness of the local gradients (black line), they suffer considerably in the non-i.i.d. training
with respect to the full-batch gradient according to settings. As this trend can be observed also for the logis-
Ex∼ pi [∇W l(x, W)] = ∇W R(W) ∀i = 1, .., n (2) tic regression model, we can conclude that the underlying
phenomenon is not unique to deep neural networks and also
where pi is the distribution of data on the i th client and R(W) carries over to convex objectives. We will now analyze these
is the empirical risk function over the combined training data. results in detail for the different compression methods.
While this assumption is reasonable for parallel training 1) Federated Averaging: Most noticeably, federated aver-
where the distribution of data among the clients is chosen by aging [10] (see orange line in Fig. 2), although specifically
proposed for the federated learning setting, suffers consid-
0 We denote by VGG11* a simplified version of the original VGG11 archi-
erably from non-i.i.d. data. This observation is consistent
tecture described in [28], where all dropout and batch normalization layers are
removed and the number of convolutional filters and size of all fully connected with Zhao et al. [32] who demonstrated that model accuracy
layers is reduced by a factor of 2. can drop by up to 55% in non-i.i.d. learning environments
3404 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 9, SEPTEMBER 2020

Fig. 3. Left: distribution of values for αw (1) for the weight layer of
logistic regression over the MNIST data set. Right: development of α(k) for
increasing batch sizes. In the i.i.d. case, the batches are sampled randomly
from the training data, while in the non-i.i.d. case, every batch contains
samples from only exactly one class. For i.i.d. batches, the gradient sign
becomes increasingly accurate with growing batch sizes. For non-i.i.d. batches
of data, this is not the case. The gradient signs remain highly incongruent with
the full-batch gradient, no matter how large the size of the batch.

be the gradient over the entire training data D. Then, we can


Fig. 2. Convergence speed when using different compression methods during define this probability by
the training of VGG11*2 on CIFAR-10 and logistic regression on MNIST and   k
Fashion-MNIST in a distributed setting with ten clients for i.i.d. and non-i.i.d. αw (k) = P sign gw = sign(gw ) . (6)
data. In the non-i.i.d. cases, every client only holds examples from exactly two
respectively one of the ten classes in the data set. All compression methods
suffer from degraded convergence speed in the non-i.i.d. situation, but sparse We can also compute the mean statistic
top-k is affected by far the least.
1 
α(k) = αw (k) (7)
|W|
w∈W
compared to the i.i.d. ones. They attribute the loss in accuracy
to the increased weight divergence between the clients and to estimate the average congruence over all parameters of the
propose to side-step the problem by assigning a shared public network.
i.i.d. data set to all clients. While this approach can indeed cre- Fig. 3 (left) exemplary shows the distribution of values for
ate more accurate models, it also has multiple shortcomings, αw (1) within the weights of logistic regression on MNIST
the most crucial one being that we generally cannot assume at the beginning of training. As we can see, at a batch size
of 1, gw 1 is a very bad predictor of the true gradient sign
the availability of such a public data set. If a public data set
were to exist, one could use it to pretrain a model at the server, with a very high variance and an average congruence of
which is not consistent with the assumptions typically made in α(1) = 0.51 just slightly higher than random. The sensitivity
federated learning. Furthermore, if all clients share (part of) the of signSGD to non-i.i.d. data becomes apparent once we
same public data set, overfitting to this shared data can become inspect the development of the gradient sign congruence for
a serious issue. This effect will be particularly severe in highly increasing batch sizes. Fig. 3 (right) shows this development
distributed settings where the number of data points on every for batches of increasing size sampled from an i.i.d. and non-
client is small. Finally, even when sharing a relatively large i.i.d. distribution. For the latter one, every sampled batch
data set between the clients, the original accuracy achieved in only contains data from exactly one class. As we can see,
the i.i.d. situation cannot be fully restored. For these reasons, for i.i.d. data, α quickly grows with increasing batch size,
we believe that the data-sharing strategy proposed by [32] resulting in increasingly accurate updates. For non-i.i.d. data,
is an insufficient workaround to the fundamental problem of however, the congruence stays low, independent of the size
federated averaging having convergence issues on non-i.i.d. of the batch. This means that if clients hold highly non-i.i.d.
data. subsets of data, signSGD updates will only weakly correlate
2) SignSGD: The quantization method signSGD [29] (see with the direction of steepest descent, no matter how large of
green line in Fig. 2) suffers from even worse stability issues a batch size is chosen for training.
in the non-i.i.d. learning environment. The method completely 3) Top-k Sparsification: Out of all existing compression
fails to converge on the CIFAR benchmark, and even for the methods, top-k sparsification (see blue line in Fig. 2) suffers
convex logistic regression objective, the training plateaus at a least from non-i.i.d. data. For VGG11 on CIFAR the train-
substantially degraded accuracy. ing still converges reliably even if every client only holds
To understand the reasons for these convergence issues, data from exactly one class, and for the logistic regression
we have to investigate how likely it is for a single batch classifier trained on MNIST, the convergence does not slow
gradient to have the “correct” sign. Let down at all. We hypothesize that this robustness to non-i.i.d.
data is due to mainly two reasons. First of all, the frequent
1
k
communication of weight updates between the clients prevents
k
gw = ∇w l(x i , W) (5)
k them from diverging too far from one another, and hence, top-k
i=1
sparsification does not suffer from weight divergence [32] as it
be the batch gradient over a specific minibatch of data D k = is the case for federated averaging. Second, sparsification does
{x 1 , . . . , x k } ⊂ D of size k at parameter w. Let, further, gw not destabilize the training nearly as much as signSGD does
SATTLER et al.: ROBUST AND COMMUNICATION-EFFICIENT FEDERATED LEARNING FROM NON-I.I.D. DATA 3405

since the noise in the stochastic gradients is not amplified by to


quantization. Although top-k sparsification shows promising
performance on non-i.i.d. data, its utility is limited in the HSTC = − p log2 ( p) − (1 − p) log2 ( p) + p (9)
federated learning setting as it only directly compresses the when compared to the regular sparsification. At a sparsity
upstream communication. rate of p = 0.01, the additional compression achieved by
Table I summarizes our findings. None of the existing ternarization is Hsparse/HSTC = 4.414. In order to achieve
compression methods supports both download compression the same compression gains by pure sparsification, one would
and properly works with non-i.i.d. data. have to increase the sparsity rate by approximately the same
factor.
VI. S PARSE T ERNARY C OMPRESSION Using a theoretical framework developed by
Top-k sparsification shows the most promising performance Stich et al. [33], we can prove the convergence of STC
in distributed learning environments with non-i.i.d. client data. under standard assumptions on the loss function. The proof
We will use this observation as a starting point to construct relies on bounding the impact of the perturbation caused by
an efficient communication protocol for federated learning. the compression operator. This is formalized in the following
To arrive at this protocol, we will solve three open problems definition.
that prevent the direct application of top-k sparsification to Definition 1 (k-Contraction) [33]: For a parameter
federated learning. 0 < k ≤ d, a k-contraction is an operator comp : Rd → Rd
that satisfies the contraction property
1) We will further increase the efficiency of our method by
employing quantization and optimal lossless coding of k
Ex − comp(x)2 ≤ 1 − x2 ∀x ∈ Rd . (10)
the weight updates. d
2) We will incorporate downstream compression into the We can show that STC indeed is a k-contraction.
method to allow for efficient communication from server Lemma 2: STCk as defined in Algorithm 1 is a
to clients. k̃-contraction, with
3) We will implement a caching mechanism to keep
topk (x)21
the clients synchronized in case of partial client 0 < k̃ = d ≤ d. (11)
participation. kx22
The proof can be found in Appendix E in the Supplementary
A. Ternarizing Weight Updates Material. It then directly follows from [33, Th. 2.4] that for
any L-smooth, μ-strongly convex objective function f with
Regular top-k sparsification, as proposed in [23] and [25], bounded gradients EW2 ≤ G 2 , the update rule
communicates the fraction of largest elements at full precision,  (t )
while all other elements are not communicated at all. In our W (t +1) := W (t ) − STCk A(t ) + ηWit (12)
previous work (Sattler et al. [17]), we already demonstrated (t +1) (t ) (t +1)  (t +1)
A := A + Wit − STCk Wit (13)
that this imbalance in update precision is wasteful in the dis-
tributed training setting and that higher compression gains can converges according to
be achieved when sparsification is combined with quantization ⎛ 2 ⎞ ⎛ 3 ⎞
d
of the nonzero elements. G 2 G 2 μL d
G2
E[ f (WT )]− f ∗ ≤ O +O ⎝ k̃ ⎠ +O ⎝ k̃ ⎠.
2 3
We adopt the method described in [17] to the federated μT μT 2 μT 3
learning setting and quantize the remaining top-k elements
of the sparsified updates to the mean population magnitude, (14)
leaving us with a ternary tensor containing values {−μ, 0, μ}. This means that for T ∈ O((d/k̃)((L/μ))1/2 ), STC converges
The quantization method is formalized in Algorithm 1. at rate O((G 2 /μT )), which is the same as for regular SGD!
Preliminary experiments are in line with our theoretical
Algorithm 1 STC findings. Fig. 4 shows the final accuracy of the VGG11* model
1 input: flattened tensor T ∈ Rn , sparsity p when trained at different sparsity levels with and without
2 output: sparse ternary tensor T ∗ ∈ {−μ, 0, μ}n ternarization. As we can see, additional ternarization does
3 · k ← max(np, 1) only have a negligible effect on the convergence speed and
4 · v ← topk (|T |) sometimes does even increase the final accuracy of the trained
5 · mask ← (|T | ≥ v) ∈ {0, 1}n model. It seems evident that a combination of sparsity and
6 · T masked ← mask T quantization makes more efficient use of the communication
7 · μ ← k1 ni=1 |Timasked | budged than pure sparsification.
8 return T ∗ ← μ × sign(T masked )
B. Extending to Downstream Compression

This ternarization step reduces the entropy of the update Existing compression frameworks that were proposed for
from distributed training (see [19], [20], [23], [25]) only compress
the communication from clients to the server, which is suffi-
Hsparse = − p log2 ( p) − (1 − p) log2 ( p) + 32 p (8) cient for applications where aggregation can be achieved via
3406 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 9, SEPTEMBER 2020

an all-reduce operation. However, in the federated learning


setting, where the clients have to download the aggregated
weight-updates from the server, this approach is not feasible,
as it will lead to a communication bottleneck.
To illustrate this point, let STCk : Rn → Rn , W → W ˜
be the compression operator that maps a (flattened) weight
update W to a sparsified and ternarized weight update W ˜
(t )
according to Algorithm 1. For local weight updates Wi ,
the update rule for STC can then be written as
1
n
 (t +1) (t )
W (t +1) = STCk Wi + Ai (15)
n
i=1
   Fig. 4. Effects of ternarization at different levels of upload and download
(t+1)
˜ i
W sparsities. Displayed is the difference in final accuracy in % between a model
(t +1)
trained with sparse updates and a model trained with sparse binarized updates.
A(t
i
+1)
= A(t ) (t +1)
i + Wi
˜ i
− W (16) Positive numbers indicate better performance of the model trained with pure
sparsity. VGG11 trained on CIFAR10 for 16 000 iterations with five clients
(0) holding i.i.d. and non-i.i.d. data.
starting with an empty residual Ai = 0 ∈ Rn on all clients.
While the updates W ˜ (t
i
+1)
that are sent from clients to the
server are always sparse, the number of nonzero elements in
the update W (t +1) that is sent downstream grows linearly
with the amount of participating clients in the worst case. If the
participation rate exceeds the inverse sparsity 1/ p, the update
W (t +1) essentially becomes dense.
To resolve this issue, we propose to apply the same com-
pression mechanism that is used on the clients also at the
server side to compress the downstream communication. This
modifies the update rule to
⎛ ⎞
⎜1 n
 ⎟
W˜(t +1) = STCk ⎜
(t +1) (t )
⎝n STCk Wi + Ai +A(t )⎟

Fig. 5. Accuracy achieved by VGG11* when trained on CIFAR in a
i=1
   distributed setting with five clients for 16 000 iterations at different levels
(t+1) of upload and download sparsity. Sparsifying the updates for downstream
˜ i
W communication reduces the final accuracy by at most 3% when compared to
(17) using only upload sparsity.

with a client-side and a server-side residual updates

A(t +1)
= A(t ) (t +1) ˜ i (t +1) C. Weight Update Caching for Partial Client Participation
i i + Wi − W (18)
(t +1) This far we have only been looking at scenarios in which all
˜
A(t +1) = A(t ) + W (t +1) − W . (19) of the clients participate throughout the entire training process.
We can express this new update rule for both upload and However, as elaborated in Section II, in federated learning,
download compression (17) as a special case of pure upload typically only a fraction of the entire client population will
compression (15) with generalized filter masks. Let Mi , i = participate in any particular communication round. As clients
1, .., n be the sparsifying filter masks used by the respective do not download the full model W (t ) , but only compressed
clients during the upload and M be the one used during model updates W̃ (t ); this introduces new challenges when it
the download by the server. Then, we could arrive at the comes to keeping all clients synchronized.
same sparse update W ˜ (t +1) if all clients use filter masks To solve the synchronization problem and reduce the work-
M̃i = Mi M, where is the Hadamard product. We, thus, load for the clients, we propose to use a caching mechanism
on the server. Assume that the last τ communication rounds
predict that training models using this new update rule should
have produced the updates {W ˜ (t )|t = T −1, . . . , T −τ }. The
behave similar to regular upstream-only sparsification but with
a slightly increased sparsity rate. We experimentally verify this server can cache all partial sums of these updates up until a
certain point {P (s) = s ˜ (T −t ) |s = 1, .., τ } together
prediction: t =1 W
Fig. 5 shows the accuracies achieved by VGG11 on with the global model W (T ) = W (T −τ −1) + τt =1 W ˜ (T −t ) .
CIFAR10, when trained in a federated learning environment Every client that wants to participate in the next communica-
with five clients for 10 000 iterations at different rates of tion round then has to first synchronize itself with the server
upload and download compression. As we can see, for as long by either downloading P (s) or W (T ) , depending on how many
as download and upload sparsity are of the same order, spar- previous communication rounds it has skipped. For general
sifying the download is not very harmful to the convergence sparse updates, the bound on the entropy
and decreases the accuracy by at most 2% in both the i.i.d.
(T −1)
and the non-i.i.d. case. ˜
H (P (τ )) ≤ τ H (P (1)) = τ H (W ) (20)
SATTLER et al.: ROBUST AND COMMUNICATION-EFFICIENT FEDERATED LEARNING FROM NON-I.I.D. DATA 3407

can be attained. This means that the size of the download will Algorithm 2 Efficient Federated Learning With Parameter
grow linearly with the number of rounds a client has skipped Server Via STC
training. The average number of skipped rounds is equal to the 1 input: initial parameters W
inverse participation fraction 1/η. This is usually tolerable as 2 output: improved parameters W
the downlink typically is cheaper and has far higher bandwidth 3 init: all clients Ci , i = 1, .., [Number of Clients] are
than the uplink, as already noted in [10] and [19]. Essentially, initialized with the same parameters Wi ← W. Every
all compression methods that communicate only parameter Client holds a different data set Di , with
updates instead of full models suffer from this same problem. |{y : (x, y) ∈ Di }| = [Classes per Client] of size
This is also the case for signSGD although here the size of the |Di | = ϕi | ∪ j D j |. The residuals are initialized to zero
downstream update only grows logarithmically with the delay W, Ri , R ← 0.
period according to 4 for t = 1, .., T do
(τ ) 5 for i ∈ It ⊆ {1, .., [Number of Clients]} in parallel do
H (PsignSG D ) ≤ log2 (2τ + 1). (21)
6 Client Ci does:
Partial client participation also has effects on the conver- 7 · msg ← downloadS→Ci (msg)
gence speed of federated training, both with delayed and 8 · W ← decode(msg)
sparsified updates. We will investigate these effects in detail 9 · Wi ← Wi + W
in Section VII-C. 10 · Wi ← Ri + SGD(Wi , Di , b) − Wi
11 · W ˜ i ← STC pu p (Wi )
D. Lossless Encoding 12 · Ri ← Wi − W ˜ i
13 · msgi ← encode(W ˜ i)
To communicate a set of sparse ternary tensors produced
by STC, we only need to transfer the positions of the nonzero 14 · uploadCi →S (msgi )
elements in the flattened tensors, along with one bit per 15 end
nonzero update to indicate the mean sign μ or −μ. Instead of 16 Server S does:
communicating the absolute positions of the nonzero elements, 17 · gatherCi →S (W ˜ i ), i ∈ It
it is favorable to communicate the distances between them. 18 · W ← R + |I1t | i∈It W ˜ i
Assuming a random sparsity pattern we know that for big 19 · W ˜ ← STC pdown (W)
values of |W | and k = p|W |, the distances are approximately 20 · R ← W − W ˜
geometrically distributed with success probability equal to 21 · W ← W + W ˜
the sparsity rate p. Therefore, we can optimally encode the ˜
22 · msg ← encode(W)
distances using the Golomb code [34]. The Golomb encoding
23 · broadcastS→Ci (msg), i = 1, .., M
reduces the average number of position bits to
24 end
1 25 return W
b̄pos = b∗ + b∗
(22)
1 − (1 − p)2

√ b = 1 + log2 ((log(φ − 1)/ log(1 − p))) and φ =
with
( 5 + 1/2) being the golden ratio. For a sparsity rate of e.g., VGG11* on CIFAR: We train a modified version of the
p = 0.01, we get b̄pos = 8.38, which translates to ×1.9 com- popular 11-layer VGG11 network [28] on the CIFAR [30]
pression, compared to a naive distance encoding with 16 fixed data set. We simplify the VGG11 architecture by reducing the
bits. Both the encoding and the decoding scheme can be found number of convolutional filters to [32, 64, 128, 128, 128, 128,
in Section A of the Appendix (Algorithms A1 and A2) in the 128, 128] in the respective convolutional layers and reducing
Supplementary Material. The updates are encoded both before the size of the hidden fully-connected layers to 128. We also
upload and before download. remove all dropout layers and batch-normalization layers as
The complete compression framework that features the regularization is no longer required. Batch normalization
upstream and downstream compression via sparsification, has been observed to perform very poorly with both small
ternarization, and optimal encoding of the updates is described batch sizes and non-i.i.d. data [35], and we do not want
in Algorithm 2. this effect to obscure the investigated behavior. The result-
ing VGG11* network still achieves 85.46% accuracy on the
VII. E XPERIMENTS validation set after 20 000 iterations of training with a constant
We evaluate our proposed communication protocol on four learning rate of 0.16 and contains 865 482 parameters.
different learning tasks and compare its performance to feder- CNN on KWS: We train the four-layer convolutional neural
ated averaging and signSGD in a wide a variety of different network (CNN) from [27] on the speech commands data
federated learning environments. set [36]. The speech commands data set consists of 51 088 dif-
Models and Data Sets: To cover a broad spectrum of ferent speech samples of specific keywords. There are 30 dif-
learning problems, we evaluate on differently sized con- ferent keywords in total, and every speech sample is of 1-s
volutional and recurrent neural networks for the relevant duration. Like [32], we restrict us to the subset of the ten most
federated learning tasks of image classification and speech common keywords. For every speech command, we extract
recognition: the Mel spectrogram from the short-time Fourier transform,
3408 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 9, SEPTEMBER 2020

TABLE II TABLE III


M ODELS AND H YPERPARAMETERS . T HE L EARNING R ATE BASE C ONFIGURATION OF THE F EDERATED L EARNING
I S K EPT C ONSTANT T HROUGHOUT T RAINING E NVIRONMENT IN O UR E XPERIMENTS

which results in a 32 × 32 feature map. The CNN architecture


achieves 89.12% accuracy after 10 000 training iterations and
has 876 938 parameters in total.
LSTM on Fashion-MNIST : We also train an Long Short-
Term Memory (LSTM) network with two hidden layers of
size 128 on the Fashion-MNIST data set [37]. The Fashion-
MNIST data set contains 60 000 train and 10 000 validation
greyscale images of ten different fashion items. Every 28 × 28
Fig. 6. Robustness of different compression methods to the non-i.i.d.-ness
image is treated as a sequence of 28 features of dimensionality of client data on four different benchmarks. VGG11* trained on CIFAR. STC
28 and fed as such in the many-to-one LSTM network. distinctively outperforms federated averaging on non-i.i.d. data. The learning
After 20 000 training iterations with a learning rate of 0.04, environment is configured as described in Table III. Dashed lines signify that
a momentum of m = 0.9 was used.
the LSTM model achieves 90.21% accuracy on the validation
set. The model contains 216 330 parameters.
Logistic Regression on MNIST : Finally, we also train a sim-
ple logistic regression classifier on the MNIST [31] data set.
The MNIST data set contains 60 000 training and 10 000 test
greyscale images of handwritten digits of size 28 × 28. The
trained logistic regression classifier achieves 92.31% accuracy
on the test set and contains 7850 parameters.
The different learning tasks are summarized in Table II.
In the following, we will primarily discuss the results for
VGG11* trained on CIFAR; however, the described phenom-
ena carry over to all other benchmarks and the supporting
experimental results can be found in the Appendix in the Fig. 7. Maximum accuracy achieved by the different compression methods
Supplementary Material. when training VGG11* on CIFAR for 20 000 iterations at varying batch sizes
in a federated learning environment with ten clients and full participation.
Compression Methods: We compare our proposed STC Left: Every client holds data from exactly two different classes. Right: Every
method at a sparsity rate of p = 1/400 with federated client holds an i.i.d. subset of data.
averaging at an “equivalent” delay period of n = 400 iterations
and signSGD with a coordinatewise step size of δ = 0.0002.
local batch size to 20 and assign every client an equally
At a sparsity rate of p = 1/400, STC compresses updates both
sized subset of the training data containing samples from ten
during upload and download by roughly a factor of ×1050.
different classes. In the following experiments, if not explicitly
A delay period of n = 400 iterations for federated averaging
signified otherwise, all hyperparameters will default to this
results in a slightly smaller compression rate of ×400. Further
base configuration summarized in Table III. We will use the
analysis on the effects of the sparsity rate p and delay period
short notations “Clients: ηN/N” and “Classes: c” to refer to a
n on the convergence speed of STC and federated averaging
setup of the federated learning environment in which a random
can be found in Section C of the Appendix in the Supplemen-
subset of ηN out of a total of N clients participates in every
tary Material. During our experiments, we keep all training
communication round and every client is holding data from
related hyperparameters constant for the different compression
exactly c different classes.
methods. To be able to compare the different methods in a
fair way, all methods are given the same budged of training
iterations in the following experiments (one communication A. Momentum in Federated Optimization
round of federated averaging uses up n iterations, where n is We start out by investigating the effects of momentum
the number of local iterations). optimization on the convergence behavior of the different
Learning Environment: The federated learning environment compression methods. Figs. 6–9 show the final accuracy
described in Algorithm 2 can be fully characterized by five achieved by federated averaging (n = 400), STC ( p = 1/400),
parameters. For the base configuration, we set the number and signSGD after 20 000 training iterations in a variety of
of clients to 100, the participation ratio to 10%, and the different federated learning environments. In Figs. 6–9, dashed
SATTLER et al.: ROBUST AND COMMUNICATION-EFFICIENT FEDERATED LEARNING FROM NON-I.I.D. DATA 3409

B. Non-i.i.d.-ness of the Data


Our preliminary experiments in Section V have already
demonstrated that the convergence behavior of both federated
averaging and signSGD is very sensitive to the degree of i.i.d.-
ness of the local client data, whereas sparse communication
seems to be more robust. We will now investigate this behavior
in some more detail. Fig. 6 shows the maximum achieved
generalization accuracy after a fixed number of iterations for
VGG11* trained on CIFAR at different levels of non-i.i.d.-
ness. Additional results on all other benchmarks can be found
in Fig. A2 in the Appendix in the Supplementary Material.
Both at full (left plot) and partial (right plot) client participa-
Fig. 8. Validation accuracy achieved by VGG11* on CIFAR after 20 000 iter-
ations of communication-efficient federated training with different compres- tions, STC outperforms federated averaging across all levels
sion methods. The relative client participation fraction is varied between 100% of i.i.d.-ness. The most distinct difference can be observed in
(5/5) and 5% (5/100). Left: Every client holds data from exactly two different the non-i.i.d. regime, where the individual clients hold less
classes. Right: Every client holds an i.i.d. subset of data.
than five different classes. Here, STC (without momentum)
outperforms both federated averaging and signSGD by a wide
margin. In the extreme case where every client only holds data
from exactly one class, STC still achieves 79.5% and 53.2%
accuracy at full and partial client participations, respectively,
while both federated averaging and signSGD fail to converge
at all.

C. Robustness to Other Parameters of


the Learning Environment
We will now proceed to investigate the effects of other
parameters of the learning environment on the convergence
Fig. 9. Validation accuracy achieved by VGG11* on CIFAR after 20 000 iter- behavior of the different compression methods. Figs. 7–9 show
ations of communication-efficient federated training with different compres-
sion methods. The training data are split among the client at different degrees the maximum achieved accuracy after training VGG11* on
of unbalancedness with γ varying between 0.9 and 1.0. CIFAR for 20 000 iterations in different federated learning
environments. Additional results on the three other bench-
lines refer to experiments where the momentum of m = 0.9 marks can be found in Section D in the Appendix in the
was used during training, while solid lines signify that classical Supplementary Material.
SGD was used. As we can see, momentum has a significant We observe that STC (without momentum) consistently
influence on the convergence behavior of the different meth- dominates federated averaging on all benchmarks and learning
ods. While signSGD always performs distinctively better if environments.
momentum is turned on during the optimization, the picture 1) Local Batch Size: The memory capacity of mobile and
is less clear for STC and federated averaging. We can make IoT devices is typically very limited. As the memory footprint
out three different parameters of the learning environment of SGD is proportional to the batch size used during training,
that determine whether momentum is beneficial or harmful clients might be restricted to train on small minibatches only.
to the performance of STC. If the participation rate is high Fig. 7 shows the influence of the local batch size on the
and the batch size used during training is sufficiently large performance of different communication-efficient federated
(see Fig. 7 left), momentum improves the performance of learning techniques exemplary for VGG11* trained on CIFAR.
STC. Conversely, momentum will deteriorate the training First of all, we notice that using momentum significantly
performance in situations where training is carried out on small slows down the convergence speed of both STC and federated
batches and with low client participation. The latter effect is averaging at batch sizes smaller than 20 independent of the
increasingly strong if clients hold non-i.i.d. subsets of data distribution of data among the clients. As we can see, even if
[see Fig. 6 (right)]. These results are not surprising, as the the training data is distributed among the clients in an i.i.d.
issues with stale momentum described in [25] are enhanced manner (see Fig. 7 right) and all clients participate in every
in these situations. Similar relationships can be observed for training iteration, federated averaging suffers considerably
federated averaging where again the size (see Fig. 7) and the from small batch sizes. STC, on the other hand, demonstrates
heterogeneity (see Fig. 6) of the local minibatches determine to be far more robust to this type of constraint. At an extreme
whether the momentum will have a positive effect on the batch size of one, the model trained with STC still achieves
training performance or not. an accuracy of 63.8%, while the federated averaging model
When we compare federated averaging, signSGD and STC only reaches 39.2% after 20 000 training iterations.
in the following, we will ignore whichever version of these 2) Client Participation Fraction: Fig. 8 shows the conver-
methods (momentum “on” or “off”) performs worse. gence speed of VGG11* trained on CIFAR10 in a federated
3410 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 9, SEPTEMBER 2020

learning environment with different degrees of client partici- the final accuracy achieved after 20 000 iterations for different
pation. To isolate the effects of reduced participation, we keep values of γ . Interestingly, the unbalancedness of the data does
the absolute number of participating clients and the local batch not seem to have a significant effect on the performance of
sizes at constant values of 5 and 40, respectively, throughout either of the compression methods. Even if the data are highly
all experiments and vary only the total number of clients (and concentrated on a few clients (as is the case for γ = 0.9),
thus the relative participation η). As we can see, reducing the all methods converge reliably, and for federated averaging,
participation rate has negative effects on both federated aver- the accuracy even slightly goes down with increased balanced-
aging and STC. The causes for these negative effects, however, ness. Apparently, the rare participation of large clients can
are different. In federated averaging, the participation rate is balance out several communication rounds with much smaller
proportional to the effective amount of data that the training clients. These results also carry over to all other benchmarks
is conducted on in any individual communication round. If a (see Fig. A5 in the Appendix in the Supplementary Material).
nonrepresentative subset of clients is selected to participate
in a particular communication round of federated averag- D. Communication Efficiency
ing, this can steer the optimization process away from the Finally, we compare the different compression methods
minimum and might even cause catastrophic forgetting [38] with respect to the number of iterations and communicated
of previously learned concepts. On the other hand, partial bits they require to achieve a certain target accuracy on a
participation reduces the convergence speed of STC by causing federated learning task. As we saw in Section V, both federated
the clients residuals to go out sync and increasing the gradient averaging and signSGD perform considerably worse if clients
staleness [25]. The more rounds a client has to wait before hold non-i.i.d. data or use small batch sizes. To still have a
it is selected to participate during training again, the more meaningful comparison, we, therefore, choose to evaluate this
outdated its accumulated gradients become. We can observe time on an i.i.d. environment where every client holds ten
this behavior for STC most strongly in the non-i.i.d. situation different classes and uses a moderate batch size of 20 during
(see Fig. 8 left), where the accuracy steadily decreases with the training. This setup favors federated averaging and signSGD
participation rate. However, even in the extreme case where to the maximum degree possible! All other parameters of the
only 5 out of 400 clients participate in every round of training, learning environment are set to the base configuration given
STC still achieves higher accuracy than federated averaging in Table III. We train until the target accuracy is achieved or a
and signSGD. If the clients hold i.i.d. data (see Fig. 8 maximum amount of iterations is exceeded and measure the
right), STC suffers much less from a reduced participation amount of communicated bits both for upload and download.
rate than federated averaging. If only 5 out of 400 clients Fig. 10 shows the results for VGG11* trained on CIFAR, CNN
participate in every round, STC (without momentum) still trained on keyword spotting (KWS), and the LSTM model
manages to achieve an accuracy of 68.2% while federated trained on Fashion-MNIST. We can see that even if all clients
averaging stagnates at 42.3% accuracy. signSGD is affected the hold i.i.d. data, STC still manages to achieve the desired target
least by reduced participation, which is unsurprising, as only accuracy within the smallest communication budget out of
the absolute number of participating clients would have a all methods. STC also converges faster in terms of training
direct influence on its performance. Similar behavior can be iterations than the versions of federated averaging with com-
observed on all other benchmarks, and the results can be found parable compression rate. Unsurprisingly, we see that both for
in Fig. A3 in the Appendix in the Supplementary Material. It is federated averaging and STC, we face a tradeoff between the
noteworthy that in federated learning, it is usually possible number of training iterations (“computation”) and the number
for the server to exercise some control over the rate of client of communicated bits (“communication”). On all investigated
participation. For instance, it is typically possible to increase benchmarks, however, STC is Pareto-superior to federated
the participation ratio at the cost of a long waiting time for averaging in the sense for any fixed iteration complexity,
all clients to finish. it achieves a lower (upload) communication complexity.
3) Unbalancedness: Up until now, all experiments were Table IV shows the amount of upstream and downstream
performed with a balanced split of data in which every client communications required to achieve the target accuracy for
was assigned the same amount of data points. In practice, the different methods in megabytes. On the CIFAR learning
however, the data sets on different clients will typically vary task, STC at a sparsity rate of p = 0.0025 only communicates
heavily in size. To simulate different degrees of unbalanced- 183.9 MB worth of data, which is a reduction in commu-
ness, we split the data among the clients in a way such that nication by a factor of ×199.5 as compared to the baseline
the i th out of n clients is assigned a fraction with requires 36696 MB and federated averaging (n = 100),
α γi which still requires 1606 MB. Federated averaging with a
ϕi (α, γ ) = + (1 − α) n (23) delay period of 1000 steps does not achieve the target accuracy
j =1 γ
n j
within the given iteration budget.
of the total data. The parameter α controls the minimum
amount of data on every client, while the parameter γ controls VIII. L ESSONS L EARNED
the concentration of data. We fix α = 0.1 and vary γ between We will now summarize the findings of this article and
0.9 and 1.0 in our experiments. To amplify the effects of give general suggestions on how to approach communication-
unbalanced client data, we also set the client participation constrained federated learning problems (see our summarizing
to a low value of only 5 out of 200 clients. Fig. 9 shows Fig. 11).
SATTLER et al.: ROBUST AND COMMUNICATION-EFFICIENT FEDERATED LEARNING FROM NON-I.I.D. DATA 3411

TABLE IV
B ITS R EQUIRED FOR Upload and/ Download TO A CHIEVE A C ERTAIN
TARGET A CCURACY ON D IFFERENT L EARNING TASKS IN AN I . I . D .
L EARNING E NVIRONMENT. A VALUE OF “n.a.” IN THE TABLE
S IGNIFIES T HAT THE M ETHOD H AS N OT A CHIEVED THE
TARGET A CCURACY W ITHIN THE I TERATION B UDGET.
T HE L EARNING E NVIRONMENT I S C ONFIGURED
AS D ESCRIBED IN TABLE III

3) STC should also be preferred over federated averaging


if the client participation rate is expected to be low, as it
converges more stable and quickly in both the i.i.d. and
non-i.i.d. regime [see Fig. 8 (right)].
4) STC is generally most advantageous in situations where
the communication is bandwidth-constrained or costly
(metered network, limited battery), as it does achieve a
Fig. 10. Convergence speed of federated learning with compressed com-
munication in terms of training iterations (left) and uploaded bits (right) on certain target accuracy within the minimum amount of
three different benchmarks (top to bottom) in an i.i.d. federated learning communicated bits even on i.i.d. data (see Fig. 10 and
environment with 100 clients and 10% participation fraction. For better Table IV).
readability, the validation error curves are average-smoothed with a step size
of five. On all benchmarks, STC requires the least amount of bits to converge 5) Federated averaging in return should be used if the
to the target accuracy. communication is latency-constrained or if the client
participation is expected to be very low (and 1–3 do
not hold).
6) Momentum optimization should be avoided in federated
learning whenever either clients are training with small
batch sizes or the client data are non-i.i.d. and the
participation rate is low (see Figs. 6–8).

IX. C ONCLUSION
Fig. 11. Left: accuracy achieved by VGG11* on CIFAR after 20 000 iter- Federated learning for mobile and IoT applications is a
ations of federated training with federated averaging and STC for three challenging task, as generally little to no control can be exerted
different configurations of the learning environment. Right: upstream and over the properties of the learning environment.
downstream communication necessary to achieve a validation accuracy of 84%
with federated averaging and STC on the CIFAR benchmark under i.i.d. data In this article, we demonstrated that the convergence
and a moderate batch-size. behavior of current methods for communication-efficient
federated learning is very sensitive to these properties.
1) If clients hold non-i.i.d. data, sparse communication On a variety of different data sets and model architectures,
protocols such as STC distinctively outperform federated we observe that the convergence speed of federated averaging
averaging across all federated learning environments drastically decreases in learning environments where the
[see Figs. 6, 7 (left), and 8 (left)]. clients either hold non-i.i.d. subsets of data are forced to
2) The same holds true if clients are forced to train on train on small minibatches or where only a small fraction
small minibatches (e.g., because the hardware is mem- of clients participates in every communication round.
ory constrained). In these situations, STC outperforms To address these issues, we propose STC, a communication
federated averaging even if the client’s data are i.i.d. protocol that compresses both the upstream and downstream
[see Fig. 7 (right)]. communications via sparsification, ternarization, error
3412 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 9, SEPTEMBER 2020

accumulation, and optimal Golomb encoding. Our experiments [18] K. Bonawitz et al., “Towards federated learning at scale: System
show that STC is far more robust to the above-mentioned design,” 2019, arXiv:1902.01046. [Online]. Available: https://arxiv.
org/abs/1902.01046
peculiarities of the learning environment than federated [19] W. Wen et al., “TernGrad: Ternary gradients to reduce communication in
averaging. Moreover, STC converges faster than federated distributed deep learning,” 2017, arXiv:1705.07878. [Online]. Available:
averaging both with respect to the number of training iterations https://arxiv.org/abs/1705.07878
[20] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD:
and the amount of communicated bits even if the clients hold Communication-efficient SGD via gradient quantization and encoding,”
i.i.d. data and use moderate batch sizes during training. in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 1707–1718.
[21] H. Wang, S. Sievert, Z. Charles, D. Papailiopoulos, S. Liu, and
Our approach can be understood as an alternative paradigm S. Wright, “ATOMO: Communication-efficient learning via atomic spar-
for communication-efficient federated optimization that relies sification,” 2018, arXiv:1806.04090. [Online]. Available: https://arxiv.
on high-frequent low-volume instead of low-frequent high- org/abs/1806.04090
[22] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar,
volume communication. As such, it is particularly well suited “signSGD: Compressed optimisation for non-convex problems,” 2018,
for federated learning environments that are characterized by arXiv:1802.04434. [Online]. Available: https://arxiv.org/abs/1802.04434
low latency and low bandwidth channels between clients and [23] A. F. Aji and K. Heafield, “Sparse communication for distributed gradi-
ent descent,” 2017, arXiv:1704.05021. [Online]. Available: https://arxiv.
server. org/abs/1704.05021
[24] N. Strom, “Scalable distributed DNN training using commodity GPU
cloud computing,” in Proc. 16th Annu. Conf. Int. Speech Commun.
R EFERENCES Assoc., 2015, pp. 1488–1492.
[25] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep
[1] R. Taylor, D. Baron, and D. Schmidt, “The world in 2025: 8 Predictions gradient compression: Reducing the communication bandwidth for
for the next 10 years,” in Proc. 10th Int. Microsyst., Packag., Assembly distributed training,” 2017, arXiv:1712.01887. [Online]. Available:
Circuits Technol. Conf. (IMPACT), 2015, pp. 192–195. https://arxiv.org/abs/1712.01887
[2] S. Wiedemann, K.-R. Müller, and W. Samek, “Compact and com- [26] Y. Tsuzuku, H. Imachi, and T. Akiba, “Variance-based gradient compres-
putationally efficient representation of deep neural networks,” IEEE sion for efficient distributed deep learning,” 2018, arXiv:1802.06058.
Trans. Neural Netw. Learn. Syst., to be published. doi: 10.1109/TNNLS. [Online]. Available: https://arxiv.org/abs/1802.06058
2019.2910073. [27] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh,
[3] S. Wiedemann, A. Marban, K.-R. Müller, and W. Samek, “Entropy- and D. Bacon, “Federated learning: Strategies for improving com-
constrained training of deep neural networks,” in Proc. IEEE Int. Joint munication efficiency,” 2016, arXiv:1610.05492. [Online]. Available:
Conf. Neural Netw. (IJCNN), 2019, pp. 1–8. https://arxiv.org/abs/1610.05492
[4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, [28] K. Simonyan and A. Zisserman, “Very deep convolutional networks
pp. 436–444, May 2015. for large-scale image recognition,” 2014, arXiv:1409.1556. [Online].
[5] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for Available: https://arxiv.org/abs/1409.1556
generating image descriptions,” in Proc. IEEE Conf. Comput. Vis. [29] J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar,
Pattern Recognit. (CVPR), Jul. 2015, pp. 3128–3137. “signSGD with majority vote is communication efficient and byzan-
[6] S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek, tine fault tolerant,” 2018, arXiv:1810.05291. [Online]. Available:
“Deep neural networks for no-reference and full-reference image quality https://arxiv.org/abs/1810.05291
[30] A. Krizhevsky, V. Nair, and G. Hinton. (2014). The CIFAR-10 Dataset.
assessment,” IEEE Trans. Image Process., vol. 27, no. 1, pp. 206–219,
[Online]. Available: http://www.cs.toronto.edu/kriz/cifar.html
Jan. 2018.
[31] Y. LeCun. (1998). The MNIST Database of Handwritten Digits. [Online].
[7] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and Available: http://yann.lecun.com/exdb/mnist/
L. Fei-Fei, “Large-scale video classification with convolutional neural [32] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated
networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), learning with non-IID data,” 2018, arXiv:1806.00582. [Online]. Avail-
Jun. 2014, pp. 1725–1732. able: https://arxiv.org/abs/1806.00582
[8] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning [33] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with
with neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2014, memory,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 4447–4458.
pp. 3104–3112. [34] S. Golomb, “Run-length encodings (corresp.),” IEEE Trans. Inf. Theory,
[9] W. Samek, T. Wiegand, and K.-R. Müller, “Explainable artificial vol. 12, no. 3, pp. 399–401, Jul. 1966.
intelligence: Understanding, visualizing and interpreting deep learning [35] S. Ioffe, “Batch renormalization: Towards reducing minibatch depen-
models,” ITU J., ICT Discoveries, vol. 1, no. 1, pp. 39–48, 2018. dence in batch-normalized models,” in Proc. Adv. Neural Inf. Process.
[10] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and Syst., 2017, pp. 1945–1953.
B. A. y Arcas, “Communication-efficient learning of deep networks [36] P. Warden, “Speech commands: A dataset for limited-vocabulary
from decentralized data,” 2016, arXiv:1602.05629. [Online]. Available: speech recognition,” 2018, arXiv:1804.03209. [Online]. Available:
https://arxiv.org/abs/1602.05629 https://arxiv.org/abs/1804.03209
[11] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov, “How [37] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A novel
to backdoor federated learning,” 2018, arXiv:1807.00459. [Online]. image dataset for benchmarking machine learning algorithms,” 2017,
Available: https://arxiv.org/abs/1807.00459 arXiv:1708.07747. [Online]. Available: https://arxiv.org/abs/1708.07747
[12] K. Bonawitz et al., “Practical secure aggregation for privacy-preserving [38] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and
machine learning,” in Proc. ACM SIGSAC Conf. Comput. Commun. Y. Bengio, “An empirical investigation of catastrophic forgetting
Secur., 2017, pp. 1175–1191. in gradient-based neural networks,” 2013, arXiv:1312.6211. [Online].
[13] S. Hardy et al., “Private federated learning on vertically partitioned data Available: https://arxiv.org/abs/1312.6211
via entity resolution and additively homomorphic encryption,” 2017,
arXiv:1711.10677. [Online]. Available: https://arxiv.org/abs/1711.10677
[14] M. Abadi et al., “Deep learning with differential privacy,” in Proc. ACM
SIGSAC Conf. Comput. Commun. Secur., 2016, pp. 308–318. Felix Sattler received the B.Sc. degree in mathe-
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for matics, the M.Sc. degree in computer science, and
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. the M.Sc. degree in applied mathematics from the
(CVPR), Jun. 2016, pp. 770–778. Technische Universität Berlin, Berlin, Germany, in
[16] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely 2016, 2018, and 2018, respectively.
connected convolutional networks,” in Proc. IEEE CVPR, vol. 1, He is currently with the Machine Learning Group,
Jun. 2017, no. 2, p. 3. Fraunhofer Heinrich Hertz Institute, Berlin. His cur-
[17] F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek, “Sparse binary rent research interests include distributed machine
compression: Towards distributed deep learning with minimal commu- learning, neural networks, and multitask learning.
nication,” in Proc. IEEE Int. Joint Conf. Neural Netw. (IJCNN), 2019,
pp. 1–8.
SATTLER et al.: ROBUST AND COMMUNICATION-EFFICIENT FEDERATED LEARNING FROM NON-I.I.D. DATA 3413

Simon Wiedemann received the M.Sc. degree in Dr. Müller was elected to be a member of the German National Academy of
applied mathematics from the Technische Univer- Sciences-Leopoldina in 2012, the Berlin Brandenburg Academy of sciences
sität Berlin, Berlin, Germany, in 2017. in 2017, and an External Scientific Member of the Max Planck Society
He is currently with the Machine Learning Group, in 2017. He received the 1999 Olympus Prize by the German Pattern
Fraunhofer Heinrich Hertz Institute, Berlin. His cur- Recognition Society, DAGM. He received the SEL Alcatel Communication
rent research interests include machine learning, Award in 2006, the Science Prize of Berlin awarded by the Governing Mayor
neural networks, and information theory. of Berlin in 2014, and the Vodafone Innovation Award in 2017.

Wojciech Samek (M’13) received the Diploma


Klaus-Robert Müller (M’12) received the Ph.D. degree in computer science from the Humboldt
degree in computer science from the Technische University of Berlin, Berlin, Germany, in 2010, and
Universität Karlsruhe, Karlsruhe, Germany, in 1992, the Ph.D. degree in machine learning from the
where he studied physics from 1984 to 1989. Technische Universität Berlin, Berlin, in 2014.
He has been a Professor of computer science with In 2014, he founded the Machine Learning Group,
the Technische Universität Berlin, Berlin, Germany, Fraunhofer Heinrich Hertz Institute, where he is
since 2006, where he is currently co-directing the currently the Director. He was a Scholar of the
Berlin Big Data Center. After completing a post- German National Academic Foundation and a Ph.D.
doctoral position at GMD FIRST, Berlin, he was Fellow with the Bernstein Center for Computational
a Research Fellow with The University of Tokyo, Neuroscience Berlin, Berlin, where he is also with
Tokyo, Japan, from 1994 to 1995. In 1995, he the Berlin Big Data Center. He was visiting with Heriot-Watt University,
founded the Intelligent Data Analysis Group, GMD-FIRST (later Fraunhofer Edinburgh, U.K., and The University of Edinburgh, Edinburgh, from 2007 to
FIRST), and directed it until 2008. From 1999 to 2006, he was a Professor 2008. In 2009, he was with the Intelligent Robotics Group, NASA Ames
with the University of Potsdam, Potsdam, Germany. His current research Research Center, Mountain View, CA, USA. His current research interests
interests include intelligent data analysis, machine learning, signal processing, include interpretable machine learning, neural networks, federated learning,
and brain–computer interfaces. and computer vision.

You might also like