0% found this document useful (0 votes)
221 views11 pages

Federated Learning Challenges Methods and Future Directions

Federated learning involves training machine learning models across multiple decentralized edge devices or organizations while keeping the training data localized. It addresses challenges of privacy and scale in distributed machine learning by allowing local models to be trained on private device data and then aggregated to build a global model without exposing private training examples. Potential applications include predictive features on smartphones without privacy risks, and private predictive healthcare models trained across hospitals.

Uploaded by

azimsohel267452
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
221 views11 pages

Federated Learning Challenges Methods and Future Directions

Federated learning involves training machine learning models across multiple decentralized edge devices or organizations while keeping the training data localized. It addresses challenges of privacy and scale in distributed machine learning by allowing local models to be trained on private device data and then aggregated to build a global model without exposing private training examples. Potential applications include predictive features on smartphones without privacy risks, and private predictive healthcare models trained across hospitals.

Uploaded by

azimsohel267452
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

DISTRIBUTED, STREAMING MACHINE LEARNING

Tian Li, Anit Kumar Sahu, Ameet Talwalkar,


and Virginia Smith

Federated Learning
Challenges, methods, and future directions

F
ederated learning involves training statistical models over
remote devices or siloed data centers, such as mobile phones
or hospitals, while keeping data localized. Training in het-
erogeneous and potentially massive networks introduces novel
challenges that require a fundamental departure from standard
approaches for large-scale machine learning, distributed optimi-
zation, and privacy-preserving data analysis. In this article, we
discuss the unique characteristics and challenges of federated
learning, provide a broad overview of current approaches, and
outline several directions of future work that are relevant to a
wide range of research communities.

Introduction
Mobile phones, wearable devices, and autonomous vehicles
are just a few of the modern distributed networks generating
a wealth of data each day. Due to the growing computational
power of these devices, coupled with concerns over transmit-
ting private information, it is increasingly attractive to store
data locally and push network computation to the edge.
The concept of edge computing is not a new one. Indeed,
computing simple queries across distributed, low-powered
devices is a decades-long area of research that has been
explored under the purview of query processing in sensor net-
works, computing at the edge, and fog computing [6], [30].
©ISTOCKPHOTO.COM/HAMSTER3D
Recent works have also considered training machine learning
models centrally but serving and storing them locally; for
example, this is a common approach in mobile user modeling
and personalization [23].
However, as the storage and computational capabilities of
the devices within distributed networks grow, it is possible to
leverage enhanced local resources on each device. In addi-
tion, privacy concerns over transmitting raw data require user-
generated data to remain on local devices. This has led to a
growing interest in federated learning [31], which explores
training statistical models directly on remote devices. The
term device is used throughout the article to describe entities
in the communication network, such as nodes, clients, sen-
Digital Object Identifier 10.1109/MSP.2020.2975749
Date of current version: 28 April 2020 sors, or organizations.

ed licensed50 IEEE SIGNAL PROCESSING


use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. MAGAZINE
Downloaded | May 2020
on October | at 12:23:10 UTC from IEEE Xplore.
23,2023 1053-5888/20©2020IEEE
Restrictions apply.
As we discuss in this article, learn- Federated learning has are organizations that contain a multitude
ing in such a setting differs significantly the potential to enable of patient data for predictive health care;
from traditional distributed environments, however, hospitals operate under strict pri-
predictive features on
which require fundamental advances in vacy practices and may face legal, admin-
areas such as privacy, large-scale machine smartphones without istrative, or ethical constraints that require
learning, and distributed optimization and diminishing the user data to remain local. Federated learning is
raise new questions at the intersection of experience or leaking a promising solution for these applications
diverse fields such as machine learning and private information. [19], as it can reduce privacy leakage and
systems. Federated learning methods have naturally eliminate these constrains to en-
been deployed in practice by major companies [5], [41] and able private learning between various devices/organizations.
play a critical role in supporting privacy-sensitive applications
where training data are distributed at the edge [8], [19]. In the The Internet of Things
next sections, we discuss several canonical applications of fed- Modern Internet of Things networks, such as wearable devices,
erated learning. autonomous vehicles, or smart homes, may contain numerous
sensors that allow them to collect, react, and adapt to incoming
Smartphones data in real time. For example, a fleet of autonomous vehicles may
By jointly learning user behavior across a large pool of mo- require an up-to-date model of traffic, construction, or pedestrian
bile phones, statistical models can power applications such as behavior to safely operate; however, building aggregate models
next-word prediction [17]. However, users may not be willing in these scenarios may be difficult due to the private nature of
to share their data to protect their personal privacy or to save the data and the limited connectivity of each device. Federated
the limited bandwidth/battery power of their phone. Feder- learning methods can help train models that efficiently adapt to
ated learning has the potential to enable predictive features changes in these systems, while maintaining user privacy.
on smartphones without diminishing the user experience or
leaking private information. Figure 1 depicts one such appli- Problem formulation
cation in which we aim to learn a next-word predictor in a The standard federated learning problem involves learning a
large-scale mobile phone network based on users’ historical single global statistical model from data stored on tens to po-
text data [17]. tentially millions of remote devices. We aim to learn this mod-
el under the constraint that device-generated data are stored
Organizations and processed locally, with only intermediate updates being
Organizations or institutions can also be viewed as “devices” communicated periodically with a central server. The goal is
in the context of federated learning. For example, hospitals typically to minimize the following objective function:

Local Local
Updates Updates
New Global
Model

Local Data Local Data


Local Data Local Data
Local Data Local Data
Learned Model:
Next-Word Prediction
Local Data Local Data

FIGURE 1. An example application of federated learning for the task of next-word prediction on mobile phones. To preserve the privacy of the text data
and reduce strain on the network, we seek to train a predictor in a distributed fashion, rather than sending the raw data to a central server. In this setup,
remote devices communicate with a central server periodically to learn a global model. At each communication round, a subset of selected phones
performs local training on their nonidentically distributed user data, and sends these local updates to the server. After incorporating the updates, the
server then sends back the new global model to another subset of devices. This iterative training process continues across the network until convergence
is reached or some stopping criterion is met.

IEEE SIGNAL PROCESSING


ed licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. MAGAZINE
Downloaded | May 2020
on October | at 12:23:10 UTC from IEEE Xplore. Restrictions apply.
23,2023 51
m pation, 2) tolerate heterogeneous hardware, and 3) be robust
min
w
F (w), where F (w): = / p k Fk (w) . (1)
enough to dropped devices in the communication network.
k= 1

Here, m is the total number of devices, p k $ 0 and R k p k = 1, Challenge 3: Statistical heterogeneity


and Fk is the local objective function for the kth device. The Devices frequently generate and collect data in a highly non-
local objective function is often defined as the empirical risk identically distributed manner across the network, e.g., mo-
over local data, i.e., Fk (w) = 1/n k R nj kk = 1 f j k (w; x j k, y j k), where nk
bile phone users have varied use of language in the context
is the number of samples available locally. The user-defined of a next-word prediction task. Moreover, the number of data
term pk specifies the relative impact of each device, with points across devices may vary significantly, and there may
two natural settings being p k = (1/n) or p k = (n k /n), where be an underlying statistical structure present that captures
n = R k n k is the total number of samples. We will reference the relationship among devices and their associated distribu-
(1) throughout the article but, as discussed in the next section, tions [42]. This data-generation paradigm violates frequently
we note that other objectives or modeling approaches may be used independent and identically distributed (i.i.d.) assump-
appropriate depending on the application of interest. tions in distributed optimization and may add complexity in
terms of problem modeling, theoretical analysis, and the em-
Core challenges pirical evaluation of solutions. Indeed, although the canoni-
We next describe four of the core challenges associated with cal federated learning problem of (1) aims to learn a single
solving the distributed optimization problem posed in (1). global model, there exist other alternatives such as simulta-
These challenges make the federated set- neously learning distinct local models via
ting distinct from other classical problems, Devices frequently multitask learning frameworks (cf. [42]).
such as distributed learning in data center generate and collect data In this regard, there is also a close con-
settings or traditional private data analyses. nection between leading approaches for
in a highly nonidentically federated learning and metalearning [24].
Challenge 1: Expensive communication distributed manner across Both the multitask and metalearning per-
Communication is a critical bottleneck in the network. spectives enable personalized or device-
federated networks [5] which, when coupled specific modeling, which is often a more
with privacy concerns over sending raw data, necessitates that natural approach to handle the statistical heterogeneity of
data generated on each device remain local. Indeed, federated the data for better personalization.
networks potentially comprise a massive number of devices,
e.g., millions of smartphones, and communication in the net- Challenge 4: Privacy concerns
work can be slower than local computation by many orders of Finally, privacy is often a major concern in federated learning
magnitude due to limited resources such as bandwidth, energy, applications. Federated learning makes a step toward protect-
and power [46]. To fit a model to data generated by the devices ing data generated on each device by sharing model updates,
in the federated network, it is therefore important to develop e.g., gradient information, instead of the raw data. However,
communication-efficient methods that iteratively send small communicating model updates throughout the training pro-
messages or model updates as part of the training process, as cess can nonetheless reveal sensitive information, either to a
opposed to sending the entire data set over the network. To fur- third-party or the central server [32]. Although recent methods
ther reduce communication in such a setting, two key aspects aim to enhance the privacy of federated learning using tools
to consider are 1) reducing the total number of communication such as secure multiparty computation (SMC) or differential
rounds and 2) reducing the size of the transmitted messages at privacy, these approaches often provide privacy at the cost of
each round. reduced model performance or system efficiency [4], [32]. Un-
derstanding and balancing these tradeoffs, both theoretically
Challenge 2: Systems heterogeneity and empirically, is a considerable challenge in realizing private
The storage, computational, and communication capabilities federated learning systems.
of each device in federated networks may differ due to vari-
ability in hardware (CPU and memory), network connectivity Survey of related and current work
(3G, 4G, 5G, and Wi-Fi), and power (battery level) [46]. Ad- At first glance, the challenges in federated learning resemble
ditionally, the network size and systems-related constraints on classical problems in areas such as privacy, large-scale ma-
each device typically result in only a small fraction of the de- chine learning, and distributed optimization. For instance,
vices being active at once, e.g., hundreds of active devices in a numerous methods have been proposed to tackle expensive
network with millions of devices [5]. It is also not uncommon communication in the optimization and signal process-
for an active device to drop out at a given iteration due to con- ing communities [28], [40], [43]. However, these methods
nectivity or energy constraints [5]. These system-level charac- are typically unable to fully handle the scale of federated
teristics dramatically exacerbate challenges such as straggler networks, much less the challenges of systems and statisti-
mitigation and fault tolerance. Developed federated learning cal heterogeneity (see the discussions throughout this sec-
methods must therefore 1) anticipate a low amount of partici- tion). Similarly, even though privacy is an important aspect

ed licensed52 IEEE SIGNAL PROCESSING


use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. MAGAZINE
Downloaded | May 2020
on October | at 12:23:10 UTC from IEEE Xplore. Restrictions apply.
23,2023
for many applications, privacy-preserving The most commonly used effectively decompose the global objective
methods for federated learning can be method for federated into subproblems that can be solved in par-
challenging to rigorously assert as a result allel at each communication round. Several
learning is federated
of statistical variations in data and may distributed local updating primal methods
be even more difficult to implement due averaging, a method have also been proposed, which have the
to systems constraints on each device and based on averaging added benefit of being applicable to non-
across the massive network. In the follow- local stochastic gradient convex objectives [53]. These methods dras-
ing section, we explore in greater detail the descent updates for the tically improve performance in practice,
challenges presented in the “Introduction” primal problem. and have been shown to achieve orders-of-
section, including a discussion of classical magnitude speedups over traditional mini-
results as well as more recent work focused specifically on batch methods or distributed approaches like the alternating
federated learning. direction method of multipliers in real-world data center envi-
ronments. We provide an intuitive illustration of local updating
Communication efficiency methods in Figure 2.
Communication is a key bottleneck to consider when devel- In federated settings, optimization methods that allow for
oping methods for federated networks. Although it is beyond flexible local updating and low client participation have become
the scope of this article to provide a self-contained review the de facto solvers [31]. The most commonly used method
of communication-efficient learning methods, we point out for federated learning is federated averaging (FedAvg) [31], a
several general directions, which we group into 1) local up- method based on averaging local stochastic gradient descent
dating methods, 2) compression schemes, and 3) decentral- (SGD) updates for the primal problem. FedAvg has been shown
ized training. to work well empirically, particularly for nonconvex problems,
but comes without convergence guarantees and can diverge in
Local updating practical settings when data are heterogeneous [25]. We discuss
Minibatch optimization methods, which involve extending methods to handle such statistical heterogeneity in more detail
classical stochastic methods to process multiple data points in the “Convergence Guarantees for Non-i.i.d. Data” section.
at a time, have emerged as a popular paradigm for distrib-
uted machine learning in data center environments. In prac- Compression schemes
tice, however, they have shown limited flexibility to adapt Although local updating methods can reduce the total number
to communication-computation tradeoffs [53], which would of communication rounds, model-compression schemes such
maximally leverage distributed data processing. In response, as sparsification and quantization can significantly reduce the
several recent methods have been proposed to improve com- size of messages communicated at each round. These methods
munication efficiency in distributed settings by allowing for have been extensively studied, both empirically and theoreti-
a variable to be applied on each machine in parallel at each cally, in previous literature for distributed training in data cen-
communication round (rather than just computing them lo- ter environments. (We refer readers to [47] for a more complete
cally and then applying them centrally) [44]. This makes the review.) In federated environments, the low participation of de-
amount of computation versus communication substantially vices, nonidentically distributed local data, and local updating
more flexible. schemes pose novel challenges to these model-compression
For convex objectives, distributed local updating primal- approaches. Several works have provided practical strategies
dual methods have emerged as a popular way to tackle such a in federated settings, such as forcing the updating models to
problem [43]. These approaches leverage duality structures to be sparse and low rank [22], performing quantization with

w t +1 = w t + η ∑ ∇Fk (w t ) Apply w t +1 = w t + η ∑ ∆wk


k Updates Locally k

Minibatch Data

Computation

Communication
(a) (b)

FIGURE 2. (a) The distributed (minibatch) SGD. Each device, k, locally computes gradients from a minibatch of data points to approximate dFk (w) , and the
aggregated minibatch updates are applied on the server. (b) The local updating schemes. Each device immediately applies local updates, e.g., gradients, after
they are computed, and a server performs a global aggregation after a (potentially) variable number of local updates. Local updating schemes can reduce
communication by performing additional work locally.

IEEE SIGNAL PROCESSING


ed licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. MAGAZINE
Downloaded | May 2020
on October | at 12:23:10 UTC from IEEE Xplore. Restrictions apply.
23,2023 53
s­tructured random rotations [22], and We point out several on the star network setting in this article.
using lossy compression and dropout to general directions, which We briefly discuss decentralized topologies
reduce server-to-device communica- [where devices only communicate with their
tion [9]. From a theoretical perspective,
we group into 1) local neighbors, e.g., in Figure 3(b)] as a potential
although prior work has explored con- updating methods, alternative. In data center environments, de-
vergence guarantees with low-precision 2) compression schemes, centralized training has been demonstrated
training in the presence of nonidentically and 3) decentralized to be faster than centralized training when
distributed data (e.g., [45]), the assump- training. operating on networks with low bandwidth or
tions made do not take into consideration high latency. Some works propose deadline-
common characteristics of the federated setting, such as low based approaches where all workers compute the local gradi-
device participation or locally updating optimization methods. ents using a variable number of samples within a fixed global
cycle, which helps mitigate the impact of stragglers [16], [39].
Decentralized training (We refer readers to [18] for a more comprehensive review.)
In federated learning, a star network [where a central server Similarly, in federated learning, decentralized algorithms can,
is connected to a network of devices, e.g., in Figure 3(a)] is in theory, reduce the high communication cost on the central
the predominant communication topology; we therefore f­ocus server. Some recent works have investigated decentralized
training over heterogeneous data with local updating schemes
[18]. However, they are either restricted to linear models [18] or
assume full device participation.

Systems heterogeneity
In federated settings, there is significant variability in the sys-
tems characteristics across the network, as devices may differ
in terms of hardware, network connectivity, and battery power.
(a)
As depicted in Figure 4, these systems characteristics make
issues such as stragglers significantly more prevalent than in
typical data center environments. We roughly group several
key directions used to handle systems heterogeneity into 1)
asynchronous communication, 2) active device sampling, and
3) fault tolerance. As mentioned in the “Decentralized Train-
ing” section, we assume a star topology for the discussions pre-
sented in the following section.

Asynchronous communication
(b)
In traditional data center settings, synchronous (i.e., workers
waiting for each other for synchronization) and asynchronous
FIGURE 3. Centralized versus decentralized topologies. In the typical feder- (i.e., workers running independently without synchroniza-
ated learning setting and as a focus of this article, we assume (a) a star
tion) schemes are both commonly used to parallelize iterative
network where a server connects with all the remote devices. (b) Decentral-
ized topologies are a potential alternative when communication to the server ­optimization algorithms, with each approach having advan-
becomes a bottleneck. tages and disadvantages [37], [53]. Synchronous schemes are

Subsample Devices Subsample Devices

Send Model
Training Training
Updates Send the
Send the Global Model
Global Model
4G Training
Device Failure
Training

FIGURE 4. Systems heterogeneity in federated learning. Devices may vary in terms of network connection, power, and hardware. Moreover, some of the
devices may drop at any time during training. Therefore, federated training methods must tolerate heterogeneous systems environments and low partici-
pation of devices, i.e., they must allow for only a small subset of devices to be active at each round.

ed licensed54 IEEE SIGNAL PROCESSING


use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. MAGAZINE
Downloaded | May 2020
on October | at 12:23:10 UTC from IEEE Xplore. Restrictions apply.
23,2023
simple and guarantee a serial-equivalent computational model learning methods [52], few analyses allow for low participa-
but, they are also more susceptible to stragglers in the face of tion [25], [42] or study directly the effect of dropped devices
device variability [53]. Asynchronous schemes are an attrac- [50]. FedProx tackles systems heterogeneity by allowing for
tive approach used to mitigate stragglers in heterogeneous each selected device to perform partial work that conforms
­environments, particularly in shared-memory systems [37]. to the underlying systems constraints and safely incorporate
However, they typically rely on bounded-delay assumptions those partial updates via a proximal term (see the “Conver-
to control the degree of staleness [37]. For device k, the stale- gence Guarantees for Non-i.i.d. Data” section for a more
ness depends on the number of other devices that have updat- detailed discussion).
ed since device k pulled from the central server. Even though Coded computation is another option used to tolerate
asynchronous parameter servers have been successful in dis- device failures by introducing algorithmic redundancy. Recent
tributed data centers [37], classical bounded-delay assump- works have explored using codes to speed up distributed
tions can be unrealistic in federated settings, where the delay machine learning training [11]. For instance, in the presence
may be on the order of hours to days or completely unbounded. of stragglers, gradient coding and its variants [11] carefully
replicate data blocks (as well as the gradient computation on
Active sampling those data blocks) across computing nodes to obtain either an
In federated networks, typically only a small subset of devices exact or inexact recovery of the true gradients. Although this
participate at each round of training; however, the vast majority is seemingly a promising approach for the federated setting,
of federated methods, e.g., those described these methods face fundamental challenges
in [5], [25], and [31], are passive in the sense in federated networks, as sharing data/rep-
that they do not aim to influence which de-
Asynchronous schemes lication across devices is often infeasible
vices participate. An alternative approach are an attractive approach due to privacy constraints and the scale of
involves actively selecting participating de- used to mitigate stragglers the network.
vices at each round. For example, Nishio in heterogeneous
and Yonetani [36] explore novel device- environments, particularly Statistical heterogeneity
sampling policies based on systems resourc- Challenges arise when training federated
in shared-memory
es, with the aim being for the server to ag- models from data that are highly noniden-
gregate as many device updates as possible
systems. tically distributed across devices, both in
within a predefined time window. However, terms of modeling heterogeneous data and
these methods assume a static model of the systems character- in terms of analyzing the convergence behavior of associ-
istics of the network; it remains open how best to extend these ated training procedures. We discuss related work in these
approaches to handle real-time, device-specific fluctuations in next sections.
computation and communication delays. Moreover, although
these methods primarily focus on systems variability to per- Modeling heterogeneous data
form active sampling, we note that it is also worth considering There exists a large body of literature in machine learning
actively sampling a set of small but sufficiently representative that models statistical heterogeneity via methods such as
devices based on the underlying statistical structure. metalearning and multitask learning; these ideas have been
recently extended to the federated setting [12], [14], [21]. For
Fault tolerance instance, MOCHA [42], an optimization framework designed
Fault tolerance has been extensively studied in the systems for the federated setting, can allow for personalization by
community and is a fundamental consideration of classical learning separate but related models for each device, while le-
distributed systems, including formalisms such as Byzantine veraging a shared representation via multitask learning. This
failures [50]. Recent works have also investigated fault tol- method has provable theoretical convergence guarantees for
erance specifically for machine learning workloads in data the considered objectives but is limited in its ability to scale to
center environments. When learning over remote devices, massive networks and is restricted to convex objectives. An-
however, fault tolerance becomes more critical, as it is com- other approach [12] models the star topology as a Bayesian
mon for some participating devices to drop out at some point network and p­ erforms variational inference during learning.
before the completion of the given training iteration [5]. One Although this method can handle nonconvex functions, it is
practical strategy is to simply ignore such device failure, as in expensive to generalize to large federated networks. Khodak
FedAvg [5], which may introduce bias into the device-sampling et al. [21] provably metalearn a within-task learning rate us-
scheme if the failed devices have specific data characteristics. ing multitask information (where each task corresponds to a
For instance, devices from remote areas may be more likely device) and demonstrate improved empirical performance
to drop due to poor network connections and thus, the trained over vanilla FedAvg. Eichner et al. [14] investigate a plural-
federated model will be biased toward devices with favorable istic solution (adaptively choosing between a global model
network conditions. and device-specific models) to address the cyclic patterns in
Theoretically, although several recent works have inves- data samples during federated training. Despite these recent
tigated convergence guarantees of variants of federated advances, key challenges still remain in developing methods

IEEE SIGNAL PROCESSING


ed licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. MAGAZINE
Downloaded | May 2020
on October | at 12:23:10 UTC from IEEE Xplore. Restrictions apply.
23,2023 55
for heterogeneous modeling that are robust, scalable, and auto- local updates to be closer to the initial (global) model and 2) it
mated in federated settings. safely incorporates the partial updates from selected devices.
When modeling federated data, it may also be important to Theoretically, FedProx uses a dissimilarity metric to cap-
consider issues beyond accuracy, such as fairness. In particu- ture the statistical heterogeneity in the network and provides
lar, naively solving an aggregate loss function, such as in (1), ­convergence guarantees for both convex and nonconvex func-
may implicitly advantage or disadvantage some of the devices, tions under the bounded device dissimilarity assumption. The
as the learned model may become biased toward devices with convergence analysis also covers the setting in which each
larger amounts of data, or (if weighting devices equally), to device performs a variable amount of work locally. Several other
commonly occurring groups of devices. Recent works have works [27], [52] have also derived convergence guarantees in the
proposed modified modeling approaches that aim to reduce presence of heterogeneous data with different assumptions, e.g.,
the variance of the model performance across devices [19], convexity [27] or uniformly bounded gradients [52]. There are
[26], [33]. Some heuristics simply perform a varied number of also heuristic approaches that aim to tackle statistical hetero-
local updates based on local loss of the device [19]. Other more geneity, either by sharing local device data or some server-side
principled approaches include agnostic federated learning [33], proxy data [19], [20]. However, these methods may be unrealis-
which optimizes the centralized model for any target distribu- tic: in addition to imposing burdens on network bandwidth, send-
tion formed by a mixture of the client distributions via a mini- ing local data to the server violates the key privacy assumption of
max optimization scheme. Another more general approach is federated learning, and sending globally shared proxy data to all
taken by Li et al. [26], which proposes an devices requires effort to carefully generate
objective called q-FFL, in which devices The key idea of FedProx is or collect such auxiliary data.
with higher loss are given higher relative that there is an interplay
weight to encourage less variance in the Privacy
final accuracy distribution. Beyond issues
between systems Privacy concerns often motivate the need to
of fairness, we note that aspects such as heterogeneity and keep raw data on each device local in feder-
accountability and interpretability in feder- statistical heterogeneity. ated settings; however, sharing other infor-
ated learning are, additionally, worth explor- mation, such as model updates as part of the
ing, but may be challenging due to the scale and heterogeneity training process, can also leak sensitive user information. For
of the network. instance, Carlini et al. [10] demonstrate that one can extract
sensitive text patterns, e.g., a specific credit card number, from
Convergence guarantees for non-i.i.d. data a recurrent neural network trained on users’ language data.
Statistical heterogeneity also presents novel challenges in terms Given increasing interest in privacy-preserving learning ap-
of analyzing the convergence behavior in federated settings— proaches, in the “Privacy in Machine Learning” section, we
even when learning a single global model. Indeed, when data first briefly revisit prior work on enhancing privacy in the gen-
are not identically distributed across devices in the network, eral (distributed) machine learning setting. We then review re-
methods such as FedAvg can diverge in practice when the se- cent privacy-preserving methods specifically designed for fed-
lected devices perform too many local updates [25], [31]. Par- erated settings in the “Privacy in Federated Learning” section.
allel SGD and related variants, which make local updates simi-
lar to FedAvg, have been analyzed in the i.i.d. setting [38], [48], Privacy in machine learning
[53]. However, the results rely on the premise that each local The three main strategies in privacy-preserving machine learn-
solver is a copy of the same stochastic process (due to the i.i.d. ing, each of which are briefly reviewed, include differential
assumption), which is not the case in typical federated settings. privacy to communicate noisy data sketches, homomorphic
To understand the performance of FedAvg in heteroge- encryption to operate on encrypted data, and secure function
neous settings, FedProx [25] was recently proposed. The key evaluation (SFE) or multiparty computation.
idea of FedProx is that there is an interplay between systems Among these various privacy approaches, differential pri-
heterogeneity and statistical heterogeneity. As mentioned pre- vacy [13] is most widely used due to its strong information
viously, simply dropping the stragglers in the network due to theoretic guarantees, algorithmic simplicity, and relatively
systems constraints can implicitly increase statistical hetero- small systems overhead. Simply put, a randomized mecha-
geneity. FedProx makes a small modification to the FedAvg nism is differentially private if the change of one input ele-
method by allowing for partial work to be performed across ment will not result in too much difference in the output
devices based on the underlying systems constraints and lever- distribution; this means that one cannot draw any conclusions
aging the proximal term to safely incorporate the partial work. about whether or not a specific sample is used in the learn-
It can be viewed as a reparameterization of FedAvg because ing process. Such sample-level privacy can be achieved in
tuning the proximal term of FedProx is effectively equivalent many learning tasks. For gradient-based learning methods,
to tuning the number of local epochs E in FedAvg. However, a popular approach is to apply differential privacy by ran-
it is unrealistic to set E for the devices constrained by systems domly perturbing the intermediate output at each iteration.
conditions. The proximal term therefore has two benefits: 1) it Before applying the perturbation, e.g., via binomial noise
encourages more well-behaved local updates by restricting the [1], it is common to clip the gradients to bound the influence

ed licensed56 IEEE SIGNAL PROCESSING


use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. MAGAZINE
Downloaded | May 2020
on October | at 12:23:10 UTC from IEEE Xplore. Restrictions apply.
23,2023
of each example on the overall update. Current works that aim learning scenarios, as they incur substan-
There exists an inherent tradeoff between to improve the privacy tial additional communication and com-
differential privacy and model accuracy, putation costs. We refer interested readers
of federated learning
as adding more noise results in greater to [7] for a more comprehensive review
privacy but may compromise accuracy typically build upon of the approaches based on homomorphic
significantly. Despite the fact that differ- previous classical encryption and SMC.
ential privacy is the de facto metric for cryptographic protocols
privacy in machine learning, there are such as SMC and Privacy in federated learning
many other privacy definitions, such as differential privacy. The federated setting poses novel chal-
k-anonymity [15] and d-presence [34], lenges to existing privacy-preserving al-
which may be applicable to different learning problems. gorithms. Beyond providing rigorous privacy guarantees, it
Beyond differential privacy, homomorphic encryption is necessary to develop methods that are computationally
can be used to secure the learning process by computing cheap, communication efficient, and tolerant to dropped
on encrypted data, although it has currently been applied in devices—all without overly compromising accuracy. Al-
limited settings, e.g., training linear models [35]. When the though there are a variety of privacy definitions in federated
user-generated data are distributed across different data own- learning, typically, they can be classified into two catego-
ers, another natural option is to perform privacy-preserving ries: global privacy and local privacy. As presented in Fig-
learning via SFE or SMC. The resulting protocols can enable ure 5, global privacy requires that the model updates gener-
multiple parties to collaboratively compute an agreed-upon ated at each round are private to all untrusted third parties
function without leaking input information from any party other than the central server, while local privacy further
except for what can be inferred from the output. Thus, requires that the updates are also private to the server.
although SMC cannot guarantee protection from informa- Current works that aim to improve the privacy of feder-
tion leakage, it can be combined with differential privacy to ated learning typically build upon previous classical crypto-
achieve stronger privacy guarantees. However, approaches graphic protocols such as SMC [4] and differential privacy
along these lines may not be applicable to large-scale machine [2], [32]. Bonawitz et al. [4] introduce a secure aggregation

∆W = aggregate (∆W1 + ∆W2 + ∆W3) ∆W = M (aggregate (∆W1 + ∆W2 + ∆W3))

Server Server

∆W1 ∆W3 ∆W1 ∆W3

W ∆W2 W W W ∆W2 W W

Devices Devices
(a) (b)

∆W ′ = aggregate (M (∆W1) + M (∆W2) + M (∆W3))

Server

M (∆W1)
M (∆W3)
W M (∆W2) W W

Devices
(c)

FIGURE 5. An illustration of different privacy-enhancing mechanisms in one round of federated learning. M denotes a randomized mechanism used to
privatize the data. (a) Federated learning without additional privacy protection mechanisms, (b) global privacy, where a trusted server is assumed, and (c)
local privacy, where the central server may be malicious.

IEEE SIGNAL PROCESSING


ed licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. MAGAZINE
Downloaded | May 2020
on October | at 12:23:10 UTC from IEEE Xplore. Restrictions apply.
23,2023 57
protocol to protect individual model updates. The central will demonstrate improvements at the Pareto frontier, i.e.,
server is not able to see any local updates but can still achieving an accuracy greater than any other approach un-
observe the exact aggregated results at each round. Secure der the same communication budget and, ideally, across a
aggregation is a lossless method and can retain the original wide range of communication/accuracy profiles. Similar
accuracy with a very high privacy guarantee; however, the comprehensive analyses have been performed for efficient
resulting method incurs significant extra communication neural network inference [3] and are necessary to compare
cost. Other works apply differential privacy to federated communication-reduction techniques for federated learning
learning and offer global differential privacy (e.g., [32]); in a meaningful way.
these approaches have a number of hyperparameters that
affect communication and accuracy and must be carefully Novel models of asynchrony
chosen. In the case where stronger privacy guarantees are As discussed in the “Asynchronous Communication” sec-
required, Bhowmick et al. [2] introduce a relaxed version tion, the two most commonly studied communication
of local privacy by limiting the power of schemes in distributed optimization are
potential adversaries. It affords stronger The two most commonly bulk synchronous approaches and asyn-
privacy guarantees than global privacy studied communication chronous approaches (where it is assumed
and has better model performance than that the delay is bounded). These schemes
schemes in distributed
strict local privacy. Li et al. [24] propose are more realistic in data center settings,
locally differentially private algorithms in
optimization are bulk where worker nodes are typically dedicat-
the context of metalearning, which can be synchronous approaches ed to the workload, i.e., they are ready to
applied to federated learning with person- and asynchronous “pull” their next job from the central node
alization, while also providing provable approaches. immediately after they “push” the results
learning guarantees in convex settings. In of their previous job. In contrast, in feder-
addition, differential privacy can be combined with mod- ated networks, each device is often undedicated to the task
el-compression techniques to reduce communication and at hand and most devices are not active on any given itera-
obtain privacy benefits simultaneously [1]. tion. Therefore, it is worth studying the effects of this more
realistic device-centric communication scheme, in which
Future directions each device can decide when to “wake up,” at which point it
Federated learning is an active and ongoing area of research. pulls a new task from the central node and performs some
Although recent work has begun to address the challenges local computation.
discussed in the “Survey of Related and Current Work” sec-
tion, there are a number of critical open directions yet to be Heterogeneity diagnostics
explored. In this section, we briefly outline a few promis- Recent works have aimed to quantify statistical heterogeneity
ing research directions surrounding the previously discussed through metrics such as local dissimilarity (as defined in the
challenges (expensive communication, systems heterogeneity, context of federated learning in [25] and used for other pur-
statistical heterogeneity, and privacy concerns) and introduce poses in works such as [51]). However, these metrics cannot
additional challenges regarding issues such as productionizing be easily calculated over the federated network before training
and benchmarking in federated settings. occurs. The importance of these metrics motivates the follow-
ing open questions:
Extreme communication schemes 1) Do simple diagnostics exist to quickly determine the level
It remains to be seen how much communication is neces- of heterogeneity in federated networks a priori?
sary in federated learning. Indeed, it is well known that 2) Can analogous diagnostics be developed to quantify the
optimization methods used for machine learning can toler- amount of systems-related heterogeneity?
ate a lack of precision; this error can, in fact, help with gen- 3) Can current or new definitions of heterogeneity be ex­­­
eralization [49]. Although one-shot or divide-and-conquer ploited to design new federated optimization meth-
communication schemes have been explored in traditional ods with improved convergence, both empirically and
data center settings [29], the behavior of these methods is theoretically?
not well understood in massive and statistical heteroge-
neous networks. Granular privacy constraints
The definitions of privacy outlined in the “Privacy in Feder-
Communication reduction and the Pareto frontier ated Learning” section cover privacy at a local or global level
We have discussed several ways to reduce communication with respect to all devices in the network. However, in prac-
in federated training, such as local updating and model tice, it may be necessary to define privacy on a more granular
compression. It is important to understand how these tech- level, as privacy constraints may differ across devices or even
niques compose with one another and to systematically across data points on a single device. For instance, Li et al.
analyze the tradeoff between accuracy and communication [24] recently proposed sample-specific (as opposed to user-
for each approach. In particular, the most useful techniques specific) privacy guarantees, thus providing a weaker form

ed licensed58 IEEE SIGNAL PROCESSING


use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. MAGAZINE
Downloaded | May 2020
on October | at 12:23:10 UTC from IEEE Xplore. Restrictions apply.
23,2023
of privacy in exchange for more accurate models. Developing Authors
methods to handle mixed (device-specific or sample-specific) Tian Li ([email protected]) received her bachelor’s degree in
privacy restrictions is an interesting and ongoing direction of computer science at Peking University, Beijing, China.
future work. Currently, she is a Ph.D. candidate in the Computer Science
Department at Carnegie Mellon University, Pittsburgh,
Beyond supervised learning Pennsylvania, advised by Dr. Virginia Smith. Her research
It is important to note that the methods discussed thus far interests include large-scale machine learning, distributed
have been developed with the task of supervised learning optimization, and data-intensive systems.
in mind, i.e., they assume that labels exist for all of the Anit Kumar Sahu ([email protected]) received his
data in the federated network. In practice, much of the data B.Tech. degree in electronics and electrical communication
generated in realistic federated networks may be unlabeled engineering and his M.Tech. degree in telecommunication sys-
or weakly labeled. Furthermore, the problem at hand may tems engineering, both from the Indian Institute of Technology,
not be to fit a model to data, as presented in (1), but in- Kharagpur, India, in May 2013 and his Ph.D. degree in electri-
stead to perform some exploratory data analysis, determine cal and computer engineering from Carnegie Mellon
aggregate statistics, or run a more complex task, such as University (CMU), Pittsburgh, Pennsylvania, in 2018. He is
reinforcement learning. Tackling problems beyond super- currently a machine learning research scientist at the Bosch
vised learning in federated networks will likely require Center for Artificial Intelligence in Pittsburgh. He is a recipi-
addressing similar challenges of scalability, heterogeneity, ent of the 2019 A.G. Jordan Award from the Department of
and privacy. Electrical and Computer Engineering at CMU. His research
interests include distributed optimization and robust deep
Productionizing federated learning learning. He is a Member of the IEEE.
Beyond the major challenges discussed in this article, there Ameet Talwalkar ([email protected]) is an assistant
are a number of practical concerns that arise when running professor in the Machine Learning Department at Carnegie
federated learning in production. In particular, issues such Mellon University, Pittsburgh, Pennsylvania, and is also the
as concept drift (when the underlying data-generation model cofounder and chief scientist at Determined AI. His current
changes over time), diurnal variations (when the devices ex- work is motivated by the goal of democratizing machine
hibit different behavior at different times of the day or week) learning, with a focus on topics related to scalability, automa-
[14], and cold-start problems (when new devices enter the tion, fairness, and interpretability. He led the initial develop-
network) must be handled with care. We refer readers to [5], ment of the MLlib project in Apache Spark, is a coauthor of
which discusses some of the practical systems-related issues the textbook Foundations of Machine Learning (MIT Press),
that exist in production federated learning systems. and created an award-winning edX massive open online
course on distributed machine learning. He also helped create
Benchmarks the Conference on Machine Learning and Systems and is cur-
Finally, as federated learning is a nascent field, we are at a rently the president of the MLSys board.
pivotal time to shape the developments made in this area and Virginia Smith ([email protected]) received undergradu-
must ensure that they are grounded in real-world settings, as- ate degrees in mathematics and computer science from the
sumptions, and data sets. It is critical for the broader research University of Virginia, Charlottesville, and her Ph.D. degree in
communities to further build upon existing benchmarking computer science from the University of California, Berkeley.
tools and implementations, such as LEAF [54] and Tensor- Previously, she was a postdoctoral professor at Stanford
Flow Federated [55], to facilitate both the reproducibility of University, California. Currently, she is an assistant professor
empirical results and the dissemination of new solutions for in the Machine Learning Department and a courtesy fac-
federated learning. ulty member in the Electrical and Computer Engineering
Department at Carnegie Mellon University, Pittsburgh,
Conclusions Pennsylvania. Her research interests include machine learning,
In this article, we provided an overview of federated learn- optimization, and distributed systems.
ing, a learning paradigm where statistical models are
trained at the edge in distributed networks. We discussed References
the unique properties and associated challenges of feder- [1] N. Agarwal, A. T. Suresh, F. X. X. Yu, S. Kumar, and B. McMahan,
“cpSGD: Communication-efficient and differentially-private distributed
ated learning compared with traditional distributed data SGD,” in Proc. Advances in Neural Information Processing Systems, 2018,
center computing and classical privacy-preserving learn- pp. 7564–7575.
ing. We provided a broad survey on classical results as [2] A. Bhowmick, J. Duchi, J. Freudiger, G. Kapoor, and R. Rogers, Protection
against reconstruction and its applications in private federated learning. 2018.
well as more recent work specifically focused on federated [Online]. Available: arXiv:1812.00984
settings. Finally, we outlined a handful of open problems [3] T. Bolukbasi, J. Wang, O. Dekel, and V. Saligrama, “Adaptive neural networks
worth future research effort. Providing solutions to these for efficient inference,” in Proc. Int. Conf. Machine Learning, Aug. 2017, vol. 70,
pp. 527–536.
problems will require interdisciplinary effort from a broad [4] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel,
set of research communities. D. Ramage, A. Segal et al., “Practical secure aggregation for privacy-preserving

IEEE SIGNAL PROCESSING


ed licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. MAGAZINE
Downloaded | May 2020
on October | at 12:23:10 UTC from IEEE Xplore. Restrictions apply.
23,2023 59
machine learning,” in Proc. Conf. Computer and Communications Security, 2017, [31] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Aguera y
1175–1191. doi: 10.1145/3133956.3133982. Arcas, “Communication-efficient learning of deep networks from decentralized
[5] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. data,” in Proc. 20th Int. Conf. Artificial Intelligence and Statistics, 2017, pp.
Kiddon, J. Konecny et al., “Towards federated learning at scale: System design,” in 1273–1282.
Proc. Conf. Machine Learning and Systems, 2019. [32] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learning differentially
[6] F. Bonomi, R. Milito, J. Zhu, and S. Addepalli, “Fog computing and its role in private recurrent language models,” in Proc. Int. Conf. Learning Representations, 2018.
the Internet of Things,” in Proc. SIGCOMM Workshop on Mobile Cloud [33] M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,” in Proc.
Computing, 2012. doi: 10.1145/2342509.2342513. Int. Conf. Machine Learning, 2019, pp. 4615–4625.
[7] R. Bost, R. A. Popa, S. Tu, and S. Goldwasser, “Machine learning classification [34] M. E. Nergiz and C. Clifton, “d-presence without complete world knowledge,”
over encrypted data,” in Proc. Network and Distributed System Security Symp., IEEE Trans. Knowledge and Data Eng., vol. 22, no. 6, pp. 868–883, 2010. doi:
2015. doi: 10.14722/ndss.2015.23241. 10.1109/TKDE.2009.125.
[8] T. S. Brisimi, R. Chen, T. Mela, A. Olshevsky, I. C. Paschalidis, and W. Shi, [35] V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Boneh, and N. Taft,
“Federated learning of predictive models from federated electronic health records,” Int. J. “Privacy-preserving ridge regression on hundreds of millions of records,” in Proc.
Medical Informatics, vol. 112, Apr. 2018, pp. 59–67. doi: 10.1016/j.ijmedinf.2018.01.007. Symp. Security and Privacy, 2013, pp. 334–348. doi: 10.1109/SP.2013.30.
[9] S. Caldas, J. Konečny, H. B. McMahan, and A. Talwalkar, Expanding the reach [36] T. Nishio and R. Yonetani, “Client selection for federated learning with hetero-
of federated learning by reducing client resource requirements. 2018. [Online]. geneous resources in mobile edge,” in Proc. Int. Conf. Communications, 2019, pp.
Available: arXiv:1812.07210 1–7. doi: 10.1109/ICC.2019.8761315.
[10] N. Carlini, C. Liu, Ú. Erlingsson, J. Kos, and D. Song, “The secret sharer: [37] B. Recht, C. Re, S. Wright, and F. Niu, “HOGWILD!: A lock-free approach to
Evaluating and testing unintended memorization in neural networks,” in Proc. parallelizing stochastic gradient descent,” in Proc. Advances in Neural Information
USENIX Security Symp., 2019, pp. 267–284. Processing Systems, 2011, pp. 693–701.
[11] Z. Charles and D. Papailiopoulos, “Gradient coding using the stochastic block [38] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani,
model,” in Proc. Int. Symp. Information Theory, 2018, pp. 1998–2002. doi: FedPAQ: A communication-efficient federated learning method with periodic aver-
10.1109/ISIT.2018.8437887. aging and quantization. 2019. [Online]. Available: arXiv:1909.13014
[12] L. Corinzia and J. M. Buhmann, Variational federated multi-task learning. [39] A. Reisizadeh, H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Robust
2019. [Online]. Available: arXiv:1906.06268 and communication-efficient collaborative learning,” in Proc. Advances in Neural
[13] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensi- Information Processing Systems, 2019, pp. 8386–8397.
tivity in private data analysis,” in Proc. Theory of Cryptography Conf., 2006, [40] A. K. Sahu, D. Jakovetic, D. Bajovic, and S. Kar, Communication-efficient dis-
pp. 265–284. doi: 10.1007/11681878_14. tributed strongly convex stochastic optimization: Non-asymptotic rates. 2018.
[14] H. Eichner, T. Koren, H. B. McMahan, N. Srebro, and K. Talwar, “Semi-cyclic sto- [Online]. Available: arXiv:1809.02920
chastic gradient descent,” in Proc. Int. Conf. Machine Learning, 2019, pp. 1764–1773. [41] M. J. Sheller, G. A. Reina, B. Edwards, J. Martin, and S. Bakas, “Multi-
[15] K. E. Emam and F. K. Dankar, “Protecting privacy using k-anonymity,” J. Amer. institutional deep learning modeling without sharing patient data: A feasibility study
Med. Inform. Assoc., vol. 15, no. 5, pp. 627–637, 2008. doi: 10.1197/jamia.M2716. on brain tumor segmentation,” in Proc. Int. MICCAI Brainlesion Workshop, 2018,
pp. 92–104. doi: 10.1007/978-3-030-11723-8_9.
[16] N. Ferdinand, H. Al-Lawati, S. Draper, and M. Nokleby, “Anytime Minibatch:
Exploiting stragglers in online distributed optimization,” in Proc. Int. Conf. [42] V. Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated multi-task
Learning Representations, 2019. learning,” in Proc. Advances in Neural Information Processing Systems, 2017,
pp. 4424–4434.
[17] A. Hard, K. Rao, R. Mathews, F. Beaufays, S. Augenstein, H. Eichner, C.
Kiddon, and D. Ramage, Federated learning for mobile keyboard prediction. 2018. [43] V. Smith, S. Forte, C. Ma, M. Takac, M. I. Jordan, and M. Jaggi, “CoCoA: A
[Online]. Available: arXiv:1811.03604 general framework for communication-efficient distributed optimization,” J. Mach.
Learning Res., vol. 18, no. 1, pp. 8590–8638, 2018.
[18] L. He, A. Bian, and M. Jaggi, “Cola: Decentralized linear learning,” in Proc.
Advances in Neural Information Processing Systems, 2018, pp. 4541–4551. [44] S. U. Stich, “Local SGD converges fast and communicates little,” in Proc. Int.
Conf. Learning Representations, 2019.
[19] L. Huang, Y. Yin, Z. Fu, S. Zhang, H. Deng, and D. Liu, LoAdaBoost: Loss-
based adaboost federated machine learning on medical data. 2018. [Online]. [45] H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu, “Communication compres-
Available: arXiv:1811.12629 sion for decentralized training,” in Proc. Advances in Neural Information
Processing Systems, 2018, pp. 7652–7662.
[20] E. Jeong, S. Oh, H. Kim, J. Park, M. Bennis, and S.-L. Kim, Communication-
efficient on-device machine learning: Federated distillation and augmentation under [46] C. Van Berkel, “Multi-core for mobile phones,” in Proc. Conf. Design,
non-IID private data. 2018. [Online]. Available: arXiv:1811.11479 Automation and Test in Europe, 2009, pp. 1260–1265.
[21] M. Khodak, M.-F. Balcan, and A. Talwalkar, “Adaptive gradient-based meta- [47] H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright,
learning methods,” in Proc. Advances in Neural Information Processing Systems, “ATOMO: Communication-efficient learning via atomic sparsification,” in Proc.
2019, pp. 5917–5928. Advances in Neural Information Processing Systems, 2018, pp. 1–12.
[22] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. [48] J. Wang and G. Joshi, Cooperative SGD: A unified framework for the design
Bacon, Federated learning: Strategies for improving communication efficiency. and analysis of communication-efficient SGD algorithms. 2018. [Online]. Available:
2016. [Online]. Available: arXiv:1610.05492 arXiv:1808.07576
[23] T. Kuflik, J. Kay, and B. Kummerfeld, “Challenges and solutions of ubiqui- [49] Y. Yao, L. Rosasco, and A. Caponnetto, “On early stopping in gradient descent
tous user modeling,” in Ubiquitous Display Environments, A. Krüger and T. learning,” Constr. Approx., vol. 26, no. 2, pp. 289–315, 2007. doi: 10.1007/s00365-
Kuflik, Eds. Berlin: Springer-Verlag, 2012, pp. 7–30. 006-0663-2.
[24] J. Li, M. Khodak, S. Caldas, and A. Talwalkar, “Differentially private meta- [50] D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett, “Byzantine-robust distribut-
learning,” in Proc. Int. Conf. Learning Representations, 2020. ed learning: Towards optimal statistical rates,” in Proc. Int. Conf. Machine
Learning, 2018, pp. 5650–5659.
[25] T. Li, A. K. Sahu, M. Sanjabi, M. Zaheer, A. Talwalkar, and V. Smith,
“Federated optimization in heterogeneous networks,” in Proc. Conf. Machine [51] D. Yin, A. Pananjady, M. Lam, D. Papailiopoulos, K. Ramchandran, and P.
Learning and Systems, 2020. Bartlett, “Gradient diversity: A key ingredient for scalable distributed learning,” in
Proc. Artificial Intelligence and Statistics, 2018, pp. 1998–2007.
[26] T. Li, M. Sanjabi, and V. Smith, “Fair resource allocation in federated learn-
ing,” in Proc. Int. Conf. Learning Representations, 2020. [52] H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD for non-convex optimiza-
tion with faster convergence and less communication,” in Proc. AAAI Conf.
[27] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of
Artificial Intelligence, 2018, pp. 5693–5700. doi: 10.1609/aaai.v33i01.33015693.
FedAvg on non-IID data,” in Proc. Int. Conf. Learning Representations, 2020.
[53] S. Zhang, A. E. Choromanska, and Y. LeCun, “Deep learning with elastic
[28] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decen-
averaging SGD,” in Proc. Advances in Neural Information Processing Systems,
tralized algorithms outperform centralized algorithms? A case study for decentral-
2015, pp. 685–693.
ized parallel stochastic gradient descent,” in Proc. Advances in Neural Information
Processing Systems, 2017, pp. 5330–5340. [54] LEAF: A benchmark for federated settings. Accessed on: Mar. 4, 2020.
[Online]. Available: https://leaf.cmu.edu/
[29] L. W. Mackey, M. I. Jordan, and A. Talwalkar, “Divide-and-conquer matrix
factorization,” in Proc. Advances in Neural Information Processing Systems, 2011, [55] TensorFlow. “TensorFlow federated: Machine learning on decentralized
pp. 1134–1142. data.” Accessed on: Mar. 4, 2020. [Online]. Available: https://www.tensorflow
[30] S. R. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong, “TinyDB: An .org/federated
acquisitional query processing system for sensor networks,” Trans. Database Syst.,
vol. 30, no. 1, pp. 122–173, 2005.  SP

ed licensed60 IEEE SIGNAL PROCESSING


use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. MAGAZINE
Downloaded | May 2020
on October | at 12:23:10 UTC from IEEE Xplore. Restrictions apply.
23,2023

You might also like