Robust Aggregation For Federated Learning
Robust Aggregation For Federated Learning
70, 2022
Abstract—We present a novel approach to federated learning updates contributed by participating devices, where the aggre-
that endows its aggregation process with greater robustness to po- gation is privacy-preserving. Sensitivity to corrupted updates,
tential poisoning of local data or model parameters of participating caused either by adversaries intending to attack the system or
devices. The proposed approach, Robust Federated Aggregation
(RFA), relies on the aggregation of updates using the geometric due to failures in low-cost hardware, is a vulnerability of the
median, which can be computed efficiently using a Weiszfeld-type usual approach. The standard arithmetic mean aggregation in
algorithm. RFA is agnostic to the level of corruption and aggregates federated learning is not robust to corruptions, in the sense that
model updates without revealing each device’s individual contribu- even a single corrupted update in a round is sufficient to degrade
tion. We establish the convergence of the robust federated learning the global model for all devices. In one dimension, the median
algorithm for the stochastic learning of additive models with least
squares. We also offer two variants of RFA: a faster one with is an attractive aggregate for its robustness to outliers. We adopt
one-step robust aggregation, and another one with on-device per- this approach to federated learning by considering a classical
sonalization. We present experimental results with additive models multidimensional generalization of the median, known variously
and deep networks for three tasks in computer vision and natural as the geometric or spatial or L1 median [10].
language processing. The experiments show that RFA is competitive Our robust approach preserves the privacy of the device up-
with the classical aggregation when the level of corruption is low,
while demonstrating greater robustness under high corruption. dates by iteratively invoking the secure multi-party computation
primitives used in typical non-robust federated learning [11],
Index Terms—Federated learning, robust aggregation, [12]. A device’s updates are information theoretically protected
corrupted updates, distributed learning, data privacy.
in that they are computationally indistinguishable from random
noise and the sensitivity of the final aggregate to the contribution
I. INTRODUCTION
of each device is bounded. Our approach is scalable, since the
EDERATED learning is a key paradigm for machine learn-
F ing and analytics on mobile, wearable and edge devices [1],
[2] over wireless networks of 5G and beyond as well as edge
underlying secure aggregation algorithms are implemented in
production systems across millions of mobile users across the
planet [13]. The approach is communication-efficient, requiring
networks and the Internet of Things. The paradigm has found a modest 1-3× the communication cost of the non-robust setting
widespread applications ranging from mobile apps deployed to compute the non-linear aggregate in a privacy-preserving
on millions of devices [3], [4], to sensitive healthcare appli- manner.
cations [5], [6]. Contributions: The main take-away message of this work is:
In federated learning, a number of devices with privacy-
Federated learning can be made robust to corrupted updates by
sensitive data collaboratively optimize a machine learning model
replacing the weighted arithmetic mean aggregation with an ap-
under the orchestration of a central server, while keeping the data proximate geometric median at 1-3 times the communication cost.
fully decentralized and private. Recent work has looked beyond
supervised learning to domains such as data analytics but also To this end, we make the following concrete contributions.
semi-, self- and un-supervised learning, transfer learning, meta a) Robust Aggregation: We design a novel robust aggrega-
learning, and reinforcement learning [2], [7]–[9]. tion oracle based on the classical geometric median. We
We study a question relevant in all these areas: robustness to analyze the convergence of the resulting federated learning
corrupted updates. Federated learning relies on aggregation of algorithm, RFA, for least-squares estimation and show that
the proposed method is robust to update corruption in up to
Manuscript received July 24, 2021; revised December 20, 2021; accepted half the devices in federated learning with bounded hetero-
February 3, 2022. Date of publication February 24, 2022; date of current version geneity. We also describe an extension of the framework
March 11, 2022. The associate editor coordinating the review of this manuscript
and approving it for publication was Prof. Yue M. Lu. This work was supported to handle arbitrary heterogeneity via personalization.
in part by NSF under Grants CCF-1740551, CCF-1703574, and DMS-1839371, b) Algorithmic Implementation: We show how to implement
in part by the Washington Research Foundation for innovation in data-intensive this robust aggregation oracle in a practical and privacy-
discovery, in part by the CIFAR program Learning in Machines and Brains,
faculty research awards, and in part by JP Morgan Ph.D. Fellowship. This work preserving manner. This relies on an alternating minimiza-
was first presented at the Workshop on Federated Learning and Analytics in tion algorithm which empirically exhibits rapid conver-
June 2019. (Corresponding author: Krishna Pillutla.) gence. This algorithm can be interpreted as a numerically
Krishna Pillutla and Zaid Harchaoui are with the University of Washington,
Seattle, WA 98195 USA (e-mail: [email protected]; [email protected]). stable version of the classical algorithm of Weiszfeld [14],
Sham M. Kakade is with the Harvard University, Cambridge, MA 02138 USA thus shedding new light on it.
(e-mail: [email protected]). c) Numerical Simulations: We demonstrate the effectiveness
This article has supplementary downloadable material available at
https://doi.org/10.1109/TSP.2022.3153135, provided by the authors. of our framework for data corruption and parameter update
Digital Object Identifier 10.1109/TSP.2022.3153135 corruption, on federated learning tasks from computer
1053-587X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Donghua University. Downloaded on June 10,2024 at 07:34:02 UTC from IEEE Xplore. Restrictions apply.
PILLUTLA et al.: ROBUST AGGREGATION FOR FEDERATED LEARNING 1143
vision and natural language processing, with linear models for practical examples. Further, it is unclear how to securely
as well as convolutional and recurrent neural networks. implement the nonlinear aggregation algorithms of these works.
In particular, our results show that the proposed RFA Lastly, the use of, e.g., secure enclaves [55] in conjunction
algorithm (i) outperforms the standard FedAvg [1], in with our approach could guarantee Byzantine robustness in
high corruption and (ii) nearly matches the performance federated learning. We aggregate model parameters in a robust
of the FedAvg in low corruption, both at 1-3 times the manner, which is more suited to the federated setting. We note
communication cost. Moreover, the proposed algorithm is that [56] also aggregate model parameters rather than gradients
agnostic to the actual level of corruption in the problem by framing the problem in terms of consensus optimization.
instance. However, their algorithm requires devices to be always available
We open source an implementation of the proposed approach and participate in multiple rounds, which is not practical in the
in TensorFlow Federated1 ; cf. Appendix II for a template im- federated setting [2].
plementation. The Python code and scripts used to reproduce Weiszfeld’s algorithm [14] to compute the geometric median,
experimental results are publicly available online.2 has received much attention [57]–[59]. The Weiszfeld algorithm
Overview: Section II describes related work, and Section III is also known to exhibit asymptotic linear convergence [60].
describes the problem formulation and tradeoffs of robustness. However, unlike these variants, ours is numerically stable. A
Section IV proposes a robust aggregation oracle and presents a theoretical proposal of a near-linear time algorithm for the
convergence analysis of the resulting robust federated learning geometric median was recently explored in [61].
algorithm. Finally, Section V gives comprehensive numerical Frameworks to guarantee privacy of user data include dif-
simulations demonstrating the robustness of the proposed fed- ferential privacy [62], [63] and homomorphic encryption [64].
erated learning algorithm compared to standard baselines. These directions are orthogonal to ours, and could be used in
conjunction. See [2], [11], [27] for a broader discussion.
II. RELATED WORK
We now survey some related work. III. PROBLEM SETUP: FEDERATED LEARNING WITH
Federated Learning was introduced in [1] as a distributed CORRUPTIONS
optimization approach to handle on-device machine learning,
We begin this section by recalling the setup of federated learn-
with secure multi-party averaging algorithms given in [11],
ing (without corruption) and the standard FedAvg algorithm [1]
[17]. Extensions were proposed in [18]–[26]; see also the recent
in Section III-A. We then formally setup our corruption model
surveys [2], [27]. We address robustness to corrupted updates,
and discuss the trade-offs introduced by requiring robustness to
which is broadly applicable in these settings.
corrupted updates in Section III-B.
Distributed optimization has a long history [28]. Recent work
includes primal-dual frameworks [29], [30] and variants suited
to decentralized [31], and asynchronous [32] settings. A. Federated Learning Setup and Review
From the lens of learning in networks [33], federated learning Federated learning consists of n client devices which collab-
comprises a star network where agents (i.e., devices) with private oratively train a machine learning model under the orchestration
data are connected to a server with no data, which orchestrates of a central server or a fusion center [1], [2]. The data is local
the cooperative learning. Further, for privacy, model updates to the client devices while the job of the server is to orchestrate
from individual agents cannot be shared directly, but must be the training.
aggregated securely. We consider a typical federated learning setting where each
Robust estimation was pioneered by Huber [34], [35]. Ro- device i has a distribution Di over some data space such that the
bust median-of-means were introduced in [36], with follow ups data on the client is sampled i.i.d. from Di . Let the vector w ∈ Rd
in [37]–[41]. Robust mean estimation, in particular, received denote the parameters of a (supervised) learning model and let
much attention [42]–[44]. Robust estimation in networks was f (w; z) denote the loss of model w on input-output pair z, such
considered in [45]–[47]. These works consider the statistics of as the mean-squared-error loss. Then, the objective function of
robust estimation in the i.i.d. case, while we focus on distributed device i is Fi (w) = Ez∼Di [f (w; z)].
optimization with privacy preservation. Federated learning aims to find a model w that minimizes
Byzantine robustness, resilience to arbitrary behavior of some the average objective across all the devices,
devices [48], was studied in distributed optimization with gra-
dient aggregation [49]–[54]. Byzantine robustness of federated n
learning is a priori not possible without additional assumptions min F (w) := αi Fi (w) , (1)
w∈Rd
because the secure multi-party computation protocols require i=1
faithful participation of the devices. Thus, we consider a more
nuanced and less adversarial corruption model where devices where device i is weighted by αi > 0. In practice, the weight
participate faithfully in the aggregation loop; see Section III αi is chosen proportional to the amount of data on de-
vice i. For instance, in an empirical risk minimization set-
1 https://github.com/google-research/federated/tree/master/robust_ ting, each Di is the uniform distribution over a finite set
aggregation {zi,1 , . . . , zi,Ni } of size
Ni . It is common practice to choose
2 https://github.com/krishnap25/rfa αi = Ni /N where N = ni=1 Ni so that the objective F (w) =
Authorized licensed use limited to: Donghua University. Downloaded on June 10,2024 at 07:34:02 UTC from IEEE Xplore. Restrictions apply.
1144 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 70, 2022
i
(1/N ) ni=1 N j=1 f (w; zi,j ) is simply the unweighted average
In practice, a secure average oracle is implemented using
over all samples from all n devices. cryptographic protocols based on secure multi-party compu-
Federated Learning Algorithms: Typical federated learning tation [11], [12]. These require a communication overhead of
algorithms run in synchronized rounds of communication be- O(m log m) in addition to O(md) cost of sending the m vectors.
tween the server and the devices with some local computation First, the vector βi wi is dimension-wise discretized on the ring
on the devices based on their local data, and aggregation of ZdM of integers modulo M in d-dimensions. Then, a noisy
these updates to update the server model. The de facto standard version w̃i is sent to the server, where the noise is designed
to satisfy:
r correctness up to discretization, by ensuring m w̃i
training algorithm is FedAvg [1], which runs as follows.
a) The server samples a set St of m clients from [n] and i=1
broadcasts the current model w(t) to these clients. mod M = m i=1 βi wi mod M with probability 1, and,
(t)
b) Staring from wi,0 = w(t) , each client i ∈ St makes τ
r privacy preservation from honest-but-curious devices and
local gradient or stochastic gradient descent steps for server in the information theoretic sense, by ensuring
k = 0, . . . , τ − 1 with a learning rate γ: that w̃i is computationally indistinguishable from ζi ∼
Uniform(ZdM ), irrespective of wi and βi .
(t) (t) (t)
wi,k+1 = wi,k − γ∇Fi (wi,k ) . (2) As a result, we get the correct average (up to discretization)
while not revealing any further information about a wi or βi to
(t+1)
c) Each device i ∈ St sends to the server a vector wi the server or other devices, beyond what can be inferred from
(t+1) (t)
which is simply the final iterate, i.e., wi = wi,τ . The the average. Hence, no further information about the underlying
server updates its global model using the weighted average data distribution Di is revealed either. In this work, we assume
for simplicity that the secure average oracle returns the exact
(t+1)
(t+1)
αi wi update, i.e., we ignore the effects of discretization on the integer
w = i∈S t . (3)
i∈St αi
ring and modular wraparound. This assumption is reasonable
for a large enough value of M .
The federated learning algorithm, and in particular, the choice Robustness: We would like a federated learning algorithm to
of aggregation, impacts the following three factors [2], [27], be robust to corrupted updates contributed by malicious devices
[65]: communication efficiency, privacy, and robustness. or hardware/software failures. FedAvg uses an arithmetic mean
Communication Efficiency: Besides the computation cost, to aggregate the device updates in (3), which is known to not
the communication cost is an important parameter in distributed be robust [34]. This can be made precise by the notion of a
optimization. While communication is relatively fast in the data- breakdown point [67], which is the smallest fraction of the
center, that is not the case of federated learning. The repeated ex- points which need to be changed to cause the aggregate to
change of massive models between the server and client devices take on arbitrary values. The breakdown point of the mean is
over resource-limited wireless networks makes communication 0, since only one point needs to changed to arbitrarily change
over the network more of a bottleneck in federated learning than the aggregate [10]. This means in federated learning that a single
local computation on the devices. Therefore, training algorithms corrupted update, either due to an adversarial attack or a failure,
should be able to trade-off more local computation for lower can arbitrarily change the resulting aggregate in each round. We
communication, similar to step (b) of FedAvg above. While the will give examples of adversarial corruptions in Section III-B.
exact benefits (or lack thereof) of local steps is an active area of In the rest of this work, we aim to address the lack of robust-
research, local steps have been found empirically to reduce the ness of FedAvg. A popular robust aggregation of scalars is the
amount of communication required for a moderately accurate median rather than the mean. We investigate a multidimensional
solution [1], [66]. analogue of the median, while respecting the other two factors:
Accordingly, we set aside the local computation cost for a first communication efficiency and privacy. While the non-robust
order approximation, and compare algorithms in terms of their mean aggregation can be computed with secure multi-party
total communication cost [2]. Since typical federated learning computation via the secure average oracle, it is unclear if a robust
algorithms proceed in synchronized rounds of communication, aggregate can also satisfy this requirement. We discuss this as
we measure the complexity of the algorithms in terms of the well as other tradeoffs involving robustness in the next section.
number of communication rounds.
Privacy: While the privacy-sensitive data z ∼ Di is kept local
(t+1) B. Corruption Model and Trade-Offs of Robustness
to the device, the model updates wi might also leak privacy.
To add a further layer of privacy protection, the server is not We start with the corruption model used in this work. We
allowed to inspect individual updates wi
(t+1)
in the aggregation allow a subset C ⊂ [n] of corrupted devices to, unbeknownst to
(t+1)
step (c); it can only access the aggregate w(t+1) . the server, send arbitrary vectors wi ∈ Rd rather than the
(t)
We make this precise through the notion of a secure aver- updated model wi,τ from local data as expected by the server.
age oracle. Given m devices with each device i containing Formally, we have,
wi ∈ Rd and a scalar βi >0,ma secure average oracle computes (t)
the average m β
i=1 i iw / i=1 βi at a total communication of wi,τ, if i ∈
/ C,
(t+1)
O(md + m log m) bits such that no wi or βi are revealed to wi = (t) (4)
Hi w(t) , {(wj,τ , Dj )}j∈St if i ∈ C,
either the server or any other device.
Authorized licensed use limited to: Donghua University. Downloaded on June 10,2024 at 07:34:02 UTC from IEEE Xplore. Restrictions apply.
PILLUTLA et al.: ROBUST AGGREGATION FOR FEDERATED LEARNING 1145
TABLE I
EXAMPLES CORRUPTIONS AND CAPABILITY OF AN ADVERSARY THEY REQUIRE, AS MEASURED ALONG THE FOLLOWING AXES: DATA WRITE, WHERE A DEVICE
i ∈ C CAN REPLACE ITS LOCAL DISTRIBUTION Di BY ANY ARBITRARY DISTRIBUTION D̃i ; MODEL READ, WHERE A DEVICE i ∈ C CAN READ THE SERVER
(t)
MODEL w(t) AND REPLACE ITS LOCAL DISTRIBUTION Di BY AN ADAPTIVE DISTRIBUTION D̃i DEPENDING ON w(t) ; MODEL WRITE, WHERE A DEVICE i ∈ C
CAN RETURN AN ARBITRARY VECTOR TO THE SERVER FOR AGGREGATION AS IN (4), AND, AGGREGATION, WHERE A DEVICE i ∈ C CAN BEHAVE ARBITRARILY
DURING THE COMPUTATION OF AN ITERATIVE SECURE AGGREGATE. THE LAST COLUMN INDICATES WHETHER THE PROPOSED RFA ALGORITHM IS ROBUST TO
EACH TYPE OF CORRUPTION
where Hi is an arbitrary Rd -valued function which is allowed to In this work, we strike a compromise between robustness,
(t)
depend on the global model w(t) , the uncorrupted updates wj,τ communication and privacy. We will approximate a non-linear
as well as the data distributions Dj of each device j ∈ St . robust aggregate as an iterative secure aggregate, i.e., as a
This encompasses situations where the corrupted devices are sequence of weighted averages, computed with a secure average
individually or collectively trying to “attack” the global model, oracle with weights being adaptively updated.
that is, reduce its predictive power over uncorrupted data. We Definition 1: A function A : (Rd )m → Rd is said to be an
define the corruption level ρ as the total fraction of the weight iterative secure aggregate of w1 , . . . , wm with R communication
of the corrupted devices: rounds and initial iterate v (0) if for r = 0, . . . , R − 1, there exist
(r) (r)
weights β1 , . . . , βm such that
(r)
αi i) βi depends only on v (r) and wi ,
ρ = ni∈C . (5) (r) m (r)
i=1 αi ii) v (r+1) = m i=1 βi wi / i=1 βi , and,
iii) A(w1 , · · · wm ) = v (R) .
Since the corrupted devices can only harm the global model Further, the iterative secure aggregate is said to be s-privacy
through the updates they contribute in the aggregation step, we preserving for some s ∈ (0, 1) if
(r) (r)
aim to robustify the aggregation in federated learning. However, iv) βi / m j=1 βj ≤ s for all i ∈ [m] and r ∈ [R].
it turns out that robustness is not directly compatible with the two If we have an iterative secure aggregate with R communi-
other desiderata of federated learning, namely communication cation rounds which is also robust, we gain robustness at a
efficiency and privacy. R-fold increase in communication cost. Condition (iv) ensures
The Tension Between Robustness, Communication and privacy preservation because it reveals only weighted averages
Privacy: We first argue that any federated learning algorithm with weights at most s, so a user’s update is only available after
can only have two out of the three of robustness, commu- being mixed with those from a large cohort of devices.
nication and privacy under the existing techniques of secure The Tension Between Robustness and Heterogeneity: Het-
multi-party computation. The standard approach of FedAvg erogeneity is a key property of federated learning. The distribu-
is communication-efficient and privacy-preserving but not ro- tion Di of device i can be quite different from the distribution
bust, as we discussed earlier. In fact, any aggregation scheme Dj of some other device j, reflecting the heterogeneous data
A(w1 , . . . , wm ) which is a linear function of w1 , . . . , wm is generated by a diverse set of users.
similarly non-robust. Therefore, any robust aggregate A must To analyze the effect of heterogeneity on robustness, con-
be a non-linear function of the vectors it aggregates. sider the simplified scenario of robust mean estimation in
The approach of sending the updates to the server at a com- Huber’s contamination model [34]. Here, we wish to esti-
munication of O(md) and utilizing one of the many robust mate the mean μ ∈ Rd given samples w1 , . . . , wm ∼ (1 −
aggregates studied in the literature [e.g. [50], [52], [53]] has ro- ρ)N (μ, σ 2 I) + ρQ, where Q denotes some outlier distribution
bustness and communication efficiency but not privacy. If we try that ρ-fraction of the points (designated as outliers) are drawn
to make it privacy-preserving, however, we lose communication from. Any aggregate w̄ must satisfy the lower bound w̄ −
efficiency. Indeed, the secure multi-party computation primitives μ2 ≥ Ω(σ 2 max{ρ2 , d/m}) with constant probability [69,
based on secret sharing, upon which privacy-preservation is Theorem 2.2]. In the federated learning setting, more hetero-
built, are communication efficient only for linear functions of geneity corresponds to a greater variance σ 2 among the inlier
the inputs [68]. The additional O(m log m) overhead of secure points, implying a larger error in mean estimation. This suggests
averaging for linear functions becomes Ω(md log m) for gen- a tension between robustness and heterogeneity, where increas-
eral non-linear functions required for robustness; this makes it ing heterogeneity makes robust mean estimation harder in terms
impractical for large-scale systems [11]. Therefore, one cannot of 2 error.
have both communication efficiency and privacy preservation In this work, we strike a compromise between robustness
along with robustness. and heterogeneity by considering a family D of allowed data
Authorized licensed use limited to: Donghua University. Downloaded on June 10,2024 at 07:34:02 UTC from IEEE Xplore. Restrictions apply.
1146 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 70, 2022
Authorized licensed use limited to: Donghua University. Downloaded on June 10,2024 at 07:34:02 UTC from IEEE Xplore. Restrictions apply.
PILLUTLA et al.: ROBUST AGGREGATION FOR FEDERATED LEARNING 1147
Fig. 1. Left two: Convergence of the smoothed Weiszfeld algorithm. Right two: Visualization of the re-weighting βi /αi , where βi is the weight of wi in
GM((wi ), (αi )) = β w . See Appendix IV-D for details.
i i i
challenge in the federated setting is to implement it as an iterative Proposition 3: Consider β (r) , v (r) produced by Algorithm
secure aggregate. Our approach, given in Algorithm 2, iteratively 2 when given w1 , · · · wm ∈ Rd with weights αi = 1/m for
(r)
computes a new weight βi ∝ 1/v (r) − wi , up to a tolerance each i as inputs. Denote B = maxi,j wi − wj and ν̄ as in
ν > 0, whose role is to prevent division by zero. This endows Proposition 2. Then, we have for all i ∈ [m] and r ∈ [R] that
the algorithm with greater stability. We call it the smoothed (r)
βi B
Weiszfeld algorithm as it is a variation of Weiszfeld’s classical m ≤ .
(r) B + (m − 1)ν̄
algorithm [14]. The smoothed Weiszfeld algorithm satisfies the j=1 βj
following convergence guarantee, proved in Appendix III.
Proposition 2: The iterate v (R) of Algorithm 2 with input Proof: Since v (r) ∈ conv{w1 , . . . , wm }, we have ν̄ ≤
(r)
(0)
v ∈ conv{w1 , . . . , wm } and ν > 0 satisfies v (r) − wi ≤ B. Hence, αi /B ≤ βi ≤ αi /ν̄ for each i and
r and the proof follows.
2v (0) − v 2 ν
g(v (R) ) − g(v ) ≤ + ,
νR 2 A. Convergence Analysis of RFA
where v = arg min g and ν = minr∈[R],i∈[m] ν ∨ v (r−1) − We present a convergence analysis of RFA under two sim-
wi ≥ ν. Furthermore, if 0 < ν ≤ mini=1,...,m v − wi , then plifying assumptions. First, we focus on least-squares fitting
it holds that g(v (R) ) − g(v ) ≤ 2v (0) − v 2 /νR . of additive models, as it allows us to leverage sharp analyses
For a -approximate GM, we set ν = O( ) to get a O(1/ 2 ) of SGD [71]–[73] and focus on the effect of the aggregation.
rate. However, if the GM v is not too close to any wi , then Second, we assume w.l.o.g. that each device is weighted by
the same algorithm automatically enjoys a faster O(1/ ) rate. αi = 1/n to avoid technicalities of random sums i∈St αi . This
The algorithm enjoys plausibly an even faster convergence rate assumption can be lifted with standard reductions; see Remark
locally, and we leave this for future work. 5.
The proof relies on constructing a jointly convex surrogate Setup: We are interested in the supervised learning setting
G : Rd × R m where zi ≡ (xi , yi ) ∼ Di is an input-output pair. We assume
++ → R defined using η = (η1 , . . . , ηm ) ∈ R as
m
that the output yi satisfies E[yi ] = 0 and E[yi2 ] < ∞. Denote
1
m
v − wk 2 the marginal distribution of input xi as DX,i . The goal is to esti-
G(v, η) :=
2
αk
ηk
+ ηk . mate the regression function x → E[yi |xi = x] from a training
k=1 sequence of independent copies of (xi , yi ) ∼ Di in each device.
Instead of minimizing g(v) directly using the equality g(v) = The corresponding objective is the square loss minimization
inf η>0 G(v, η), we impose the constraint ηi ≥ ν instead to avoid 1
n
division by small numbers. The following alternating minimiza- F (w) = Fi (w) , where (7)
n i=1
tion leads to Algorithm 2:
1 2
η (r) = arg minη≥ν G(v (r) , η) , and, Fi (w) = E(x,y)∼Di y − w φ(x) for all i ∈ [n], (8)
2
v (r+1) = arg minv∈Rd G(v, η (r) ) . where, φ(x) = (φ1 (x), . . . , φd (x)) ∈ Rd where φ1 , . . . , φd are
a fixed basis of measurable, centered functions. The basis func-
Numerically, we find in Fig. 1 that Algorithm 2 is rapidly tions may be nonlinear, thus encompassing random feature
convergent, giving a high quality solution in 3 iterations. This approximations of kernel feature maps and pre-trained deep
ensures that the approximate GM as an iterative secure aggregate network feature representations.
provides robustness at a modest 3× increase in communication We state our results under the following assumptions: (a) the
cost over regular mean aggregation in FedAvg. feature maps are bounded as φ(x) ≤ R with probability one
Privacy Preservation: While we can compute the geometric under DX,i for each device i; (b) each Fi is μ-strongly convex;
median as an iterate secure aggregate, privacy preservation also (c) the additive model is well-specified on each device: for each
(r) (r)
requires that the effective weights βi / j βj are bounded device i, there exists wi ∈ Rd such that yi = φ(xi ) wi + ζi
away from 1 for each i. We show this holds for m large. where ζi ∼ N (0, σ 2 ). The second assumption is equivalent
Authorized licensed use limited to: Donghua University. Downloaded on June 10,2024 at 07:34:02 UTC from IEEE Xplore. Restrictions apply.
1148 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 70, 2022
to requiring that Hi = ∇2 Fi (w) = Ex∼DX,i [φ(x)φ(x) ], the number m of devices sampled per round. The second error term
covariance of x on device i, has eigenvalues no smaller than Ω2 is due to heterogeneity. Indeed, exact convergence as T → ∞
μ. is not possible in the presence of corruption: lower bounds
Quantifying Heterogeneity: We quantify the heterogene- for robust mean estimation [e.g. 69, Theorem 2.2] imply that
ity in the data distributions Di across devices in terms of w(T ) − w 2 ≥ Cρ2 Ω2Y |X w.p. at least 1/2. Consistent with
the heterogeneity of marginals DX,i and of the conditional our theory, we find in real heterogeneous datasets in Section V
2
n E[yi |xi = x] = φ(x) wi . Let H = ∇ F (w) =
expectation that RFA can lead to marginally worse performance than FedAvg
(1/n) i=1 Hi be the covariance of x under the mixture distri- in the corruption-free regime (ρ = 0). Finally, while we focus
bution across devices, where Hi is the covariance of xi in device on the setting of least squares, our results can be extended to the
i. We measure the dissimilarities ΩX , ΩY |X of the marginal and general convex case.
the conditionals respectively as Remark 5: For unequal weights, we can perform the reduction
F̃i (w) = nαi Fi (w), so the theory applies with the substitu-
ΩX = max λmax (H −1/2 Hi H −1/2 ) , (9) tion (R2 , σ 2 , μ, ΩX ) → (c1 R2 , c1 σ 2 , c2 μ, (c1 /c2 )ΩX ), where
i∈[n]
c1 = n maxi αi and c2 = n mini αi .
ΩY |X = max wi − wj , (10) We use the following convergence result of SGD [72, Theo-
i,j∈[n]
rem 1], [73, Corollary 2].
where λmax (·) denotes the largest eigenvalue. Note that ΩX ≥ 1 Theorem 6 ([72], [73]): Consider a Fk from (7). Then, defin-
and it is equal to 1 iff each Hi = H. It measures the spectral mis- ing κ := R2 /μ, the output v τ of τ steps of tail-averaged SGD
alignment between each Hi and H. The second condition is re- starting from v0 ∈ Rd using learning rate (2R2 )−1 satisfies
lated to the Wasserstein-2 distance [74] between the conditionals
DY |X,i as W2 (DY |X,i , DY |X,j ) ≤ RΩY |X . We define the de- τ 8dσ 2
gree of heterogeneity between the various Di = DX,i ⊗ DY |X,i Ev τ − w 2 ≤ 2κ exp − v0 − w 2 + .
4κ μτ
as width(D) = ΩX ΩY |X =: Ω. That is, if the conditionals are
the same (ΩY |X = 0), we can tolerate arbitrary heterogeneity in Proof of Theorem 4: Define the event Et = {|St ∩ C| ≤ θm}
−1
the marginals DX,i . so that E = Tt=0 Et . Hoeffding’s inequality gives P (Et ) ≤ δ/T
Convergence: We now analyze RFA where the local SGD for each t so that P (E) ≤ δ using the union bound. Below, let
(t+1)
updates are equipped with “tail-averaging” [73] so that wi = Ft denote the sigma algebra generated by w(t) .
τ (t) Consider the local updates on an uncorrupted device i ∈ St \
(2/τ ) k=τ /2 wi,k is averaged over the latter half of the tra-
jectory of iterates instead of line 9 of Algorithm 1. We show C, starting from w(t) . Theorem 6 gives, upon using τt ≥ τ ≥
that this variant of RFA converges up to the dissimilarity level 4κ log(128Cθ κ),
Ω = ΩX ΩY |X when the corruption level ρ < 1/2. 8dσ 2
(t+1) 1
Theorem 4: Consider F defined in (7) and suppose the cor- E wi − wi 2 E, Ft ≤ w(t) − wi 2 + .
ruption level satisfies ρ < 1/2. Consider Algorithm 1 run for T 64Cθ μτt
outer iterations with a learning rate γ = 1/(2R2 ), and the local n
Note that w = (1/n) j=1 H −1 Hj wj , so that
updates are run for τt steps in outer iteration t with tail averaging.
Fix δ > 0 and θ ∈ (ρ, 1/2), and set the number of devices per
1
n
iteration, m as w − wi ≤ H −1 Hj (wj − wi ) ≤ Ω .
n j=1
log(T /δ)
m≥ . (11)
2(θ − ρ)2
Using a + b2 ≤ 2a2 + 2b2 , we get,
−2
Define Cθ := (1 − 2θ) , w = arg min F , F = F (w ), κ :=
− w 2 E, Ft ≤ 2E wi
(t+1) (t+1)
R2 /μ and Δ0 := w(0) − w 2 . Let τ ≥ 4κ log(128Cθ κ). We E wi − wi 2 E, Ft
−1
have that the event E = Tt=0 {|St ∩ C| ≤ θm} holds with prob-
ability at least 1 − δ. Further, if τt = 2t τ for each iteration t, then + 2Ω2
the output w(T ) of Algorithm 1 satisfies,
1 16dσ 2
Δ ≤ w(t) − wi 2 + + 2Ω2
0 dσ 2 T 2
32Cθ μτt
E w(T ) ) − w 2 E ≤ T + CCθ T
+ 2 + Ω2
2 μτ 2 m q 16dσ 2
where C is a universal constant. If τt = τ instead, then, the noise ≤ w(t) − w 2 + + 4Ω2 .
16Cθ μτt
term above reads dσ 2 /μτ .
Theorem 4 shows near-linear convergence O(T /2T ) up to We now apply the robustness property of the GM ([70, Thm.
two error terms in the case that ρ is bounded away from 1/2 (so 2.2] or [75, Lem. 3]) to get,
that θ and Cθ can be taken to be constants). The increasing local
computation τt = 2t τ required by this rate is feasible since local E w(t+1) − w 2 E, Ft
computation is assumed to be cheaper than communication.
The first error term is 2 /m2 due to approximation in the 1 (t) 128Cθ dσ 2
≤ w − w 2 + + Γ,
GM, which can be made arbitrarily small by increasing the 2 μτt
Authorized licensed use limited to: Donghua University. Downloaded on June 10,2024 at 07:34:02 UTC from IEEE Xplore. Restrictions apply.
PILLUTLA et al.: ROBUST AGGREGATION FOR FEDERATED LEARNING 1149
B. Extensions to RFA
Algorithm 3: One-step Smoothed Weiszfeld Algorithm.
Input: Same as Algorithm 2 We now discuss two extensions to RFA to reduce the com-
m∨ wi )
1: Device isets βi = αi /(ν munication cost (without sacrificing privacy) and better accom-
2: return ( m β w
i=1 i i )/ i=1 βi using A
modate statistical heterogeneity in the data with model person-
alization.
One-step RFA: Reducing the Communication Cost: Recall
Algorithm 4: RFA with Personalization. that RFA results in a 3-5× increase in the communication
Replace lines 4 to 8 of Algorithm 1 with the following: cost over FedAvg. Here, we give a heuristic variant of RFA
(t) (t) (t) in an extremely communication-constrained setting, where it is
1: Set ui,0 = ui and wi,0 = w(t)
infeasible to run multiple iterations of Algorithm 2. We simply
2: for k = 0, . . . , τ − 1 do
(t) (t) (t) (t) run Algorithm 2 with v (0) = 0 and a communication budget of
3: ui,k+1 = ui,k − γ∇f (w(t) + ui,k ; zi,k ) with R = 1; see Algorithm 3 for details. We find in Section V-C that
(t)
zi,k ∼ Di one-step RFA retains most of the robustness of RFA.
4: for k = 0, . . . , τ − 1 do Personalized RFA: Offsetting Heterogeneity: We now
(t) (t) (t) (t) (t)
5: wi,k+1 = wi,k − γ∇f (wi,k + ui,τ ; z̃i,k ) with show RFA can be extended to better handle heterogeneity in
(t) the devices with the use of personalization. The key idea is
z̃i,k ∼ Di
(t+1) (t) (t+1) (t) that predictions are made on device i by summing the shared
6: Set wi = wi,τ and ui = ui,τ parameters w maintained by the server with personalized param-
eters U = {u1 , . . . , un } maintained individually on-device. In
particular, the optimization problem we are interested in solving
where Γ = 2Cθ ( 2 /m2 + 16Ω2 ). Taking an expectation condi- is
tioned on E and unrolling this inequality gives
n
min F (w, U ) := αk Ez∼Di [f (w + ui ; z)] .
Δ 128Cθ dσ 2 1
T w,U
0 i=1
E w(T ) − w 2 E ≤ T + + 2Γ .
2 μ t=1
2T −t τt We outline the algorithm in Algorithm 4. We train the shared
and personalized parameters on each other’s residuals, following
When τt = 2t τ , the series sums to 2−(T −1) T /τ , while for τt = the residual learning scheme of [76]. Each selected device first
τ , the series is upper bounded by 2/τ . updates its personalized parameters ui while keeping the shared
We now consider RFA in connection with the three factors parameters w fixed. Next, the updates to the shared parameter are
mentioned in Section III-A. computed on the residual of the personalized parameters. The
i) Communication Efficiency: Similar to FedAvg, RFA updates to the shared parameter are aggregated with the geomet-
performs multiple local updates for each aggregation ric median, identical to RFA. Experiments in Section V-C show
round, to save on the total communication. However, ow- that personalization is effective in combating heterogeneity.
ing to the trade-off between communication, privacy and
robustness, RFA requires a modest 3× more communi- V. NUMERICAL SIMULATIONS
cation for robustness per aggregation. In the next section,
we present a heuristic to reduce this communication cost We now conduct simulations to compare RFA with other
to one secure average oracle call per aggregation. federated learning algorithms. The simulations were run using
ii) Privacy Preservation: Algorithm 2 computes the aggre- TensorFlow and the data was preprocessed using LEAF [77].
gation as an iterative secure aggregate. This means that We first describe the experimental setup in Section V-A, then
the server only learns the intermediate parameters after study the robustness and convergence of RFA in Section V-B.
being averaged over all the devices, with effective weights We study the effect of the extensions of RFA in Section V-C.
bounded away from 1 (Proposition 3). The noisy pa- The full details from this section and more simulation results
rameter vectors sent by individual devices are uniformly are given in Appendix IV. The code and scripts to reproduce
uninformative in information theoretic sense with the use these experiments can be found online [16].
of secure multi-party computation.
iii) Robustness: The geometric median has a breakdown A. Setup
point of 1/2 [70, Theorem 2.2], which is the highest We consider three machine learning tasks. The datasets are
possible [70, Theorem 2.1]. In the federated learning described in Table II. As described in Section III-A, we take
context, this means that convergence is still guaranteed the weight αi of device i to be proportional to the number of
by Theorem 4 when up to half the points in terms of datapoints Ni on the device.
total weight are corrupted. RFA is resistant to both data a) Character Recognition: We use the EMNIST dataset [78],
or update poisoning, while being privacy preserving. On where the input x is a 28 × 28 grayscale image of a
the other hand, FedAvg has a breakdown point of 0, where handwritten character and the output y is its identification
a single corruption in each round can cause the model to (0-9, a-z, A-Z). Each device is a writer of the handwrit-
become arbitrarily bad. ten character x. We use two models — a linear model
Authorized licensed use limited to: Donghua University. Downloaded on June 10,2024 at 07:34:02 UTC from IEEE Xplore. Restrictions apply.
1150 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 70, 2022
TABLE II
DATASET DESCRIPTION AND STATISTICS
ϕ(x; w) = w x and a convolutional neural network (Con- budget of R = 3 calls to the secure average oracle, thanks
vNet). We use as objective f (w; (x, y)) = (y, ϕ(x; w)), to its rapid empirical convergence (cf. Fig. 1), and ν = 10−6
where is the multinomial logistic loss . We evaluate for numerical stability. Each simulation was repeated 5 times
performance using the classification accuracy. and the shaded area denotes the minimum and maximum
b) Character-Level Language Modeling: We learn a over these runs. Appendix IV gives details on hyperparame-
character-level language model over the Complete Works ter, and a sensitivity analysis of the Weiszfeld communication
of Shakespeare [79]. We formulate it as a multiclass clas- budget.
sification problem, where the input x is a window of 20
characters, the output y is the next (i.e., 21st) character.
Each device is a role from a play (e.g., Brutus from B. Robustness and Convergence of RFA
The Tragedy of Julius Caesar). We use a long-short term First, we compare the robustness of RFA as opposed to vanilla
memory model (LSTM) [80] together with the multino- FedAvg to different types of corruption across different datasets
mial logistic loss. The performance is evaluated with the in Fig. 2. We make the following observations.
classification accuracy of next-character prediction. RFA gives improved robustness to linear models with data
c) Sentiment Analysis: We use the Sent140 dataset [81] where corruption: For instance, consider the EMNIST linear model at
the input x is a tweet and the output y = ±1 is its senti- ρ = 1/4. RFA achieves 52.8% accuracy, over 10% better than
ment. Each device is a distinct Twitter user. We use a linear FedAvg at 41.2%.
model using average of the GloVe embeddings [82] of the RFA performs similarly to FedAvg in deep nets with
words of the tweet. It is trained with the binary logistic data corruption: RFA and FedAvg are within one stan-
loss and evaluated with the classification accuracy. dard deviation of each other for the Shakespeare LSTM
Corruption Models: We consider the following corruption model, and nearly equal for the EMNIST ConvNet model.
models for corrupted devices C, cf. Section III-B: We note that the behavior of the training of a neural network
a) Data Poisoning: The distribution Di on a device k ∈ C when the data is corrupted is not well-understood in general
is replaced by some fixed D̃i . For EMNIST, we take [e.g., 83].
the negative of an image so that D̃i (x, y) = Di (1 − RFA gives improved robustness to omniscient corruptions
x, y). For the Shakespeare dataset, we reverse the text for all models: For the omniscient corruption, the test accuracy
so that D̃i (c1 , · · · c20 , c21 ) = Di (c21 , · · · c2 , c1 ). In both of the FedAvg is close to 0% for the EMNIST linear model and
these cases, the labels are unchanged. For the Sent140 ConvNet, while RFA still achieves over 40% at ρ = 1/4 for the
dataset, we flip the label while keeping x unchanged. former and well over 60% for the latter. A similar trend holds
b) Update poisoning with Gaussian corruption: Each cor- for the Shakespeare LSTM model.
(t+1) (t) (t)
rupted device i ∈ C returns wi = wi,τ + ζi , where RFA almost matches FedAvg in the absence of corruption:
(t)
ζi ∼ N (0, σ 2 I), where σ 2 is the variance across the Recall from Section III-B that robustness comes at the cost of
(t) heterogeneity; this is also reflected in the theory of Section IV.
components of wi,τ − w(t) .4 Empirically, we find that the performance hit of RFA due to
c) Update poisoning with omniscient corruption: The param- heterogeneity is quite small: 1.4% for the EMNIST linear model
(t+1)
eters wi returned by devices i ∈ C are modified so that (64.3% vs. 62.9%), under 0.4% for the Shakespeare LSTM, and
(t+1)
the weighted arithmetic mean i∈St αi wi over the 0.3% for Sent140 (65.0% vs. 64.7%). Further, we demonstrate
(t)
selected devices St is set to − i∈St αi wi,τ , the negative in Appendix IV-E that, consistent with the theory, this gap
of what it would to have been without the corruption. completely vanishes in the i.i.d. case.
This is designed to hurt the weighted arithmetic mean RFA is competitive with other robust aggregation schemes
aggregation. while being privacy-preserving. We now compare RFA with:
Hyperparameters: The hyperparameters are chosen simi- (a) coordinate-wise median [52] and 2 norm clipping [84]
lar to the defaults of [1]. A learning rate schedule was tuned which are agnostic to the actual corruption level ρ like RFA,
on a validation set for FedAvg with no corruption. The same and, (b) trimmed mean [52] and multi-Krum [49], that require
schedule was used for RFA. The aggregation in RFA is im- exact knowledge of the level of corruption ρ in the problem. We
plemented using the smoothed Weiszfeld algorithm with a find that RFA is more robust than the two agnostic algorithms
coordinate-wise median and norm clipping. Perhaps surpris-
4 Model updates w (t) (t)
− w(t) are aggregated, not the models wi directly [2]. ingly, RFA is also more robust than the trimmed mean which
i
Authorized licensed use limited to: Donghua University. Downloaded on June 10,2024 at 07:34:02 UTC from IEEE Xplore. Restrictions apply.
PILLUTLA et al.: ROBUST AGGREGATION FOR FEDERATED LEARNING 1151
Fig. 2. Comparison of robustness of RFA and FedAvg under data corruption (top) and update corruption (bottom). The left three plots for update corruption
show omniscient corruption while the rightmost one shows Gaussian corruption. The shaded area denotes minimum and maximum over 5 random seeds.
Fig. 3. Comparison of RFA with other robust aggregation algorithms on C. Extensions of RFA
Sent140 with data corruption.
We now study the proposed extensions: one-step RFA and
personalization.
One-step RFA gives most of the robustness with no extra
uses perfect knowledge of the corruption level ρ. We note that communication: From Fig. 5, we observe that for one-step RFA
multi-Krum is more robust than RFA. That being said, RFA has is quite close in performance to RFA across different levels
the advantage that it is fully agnostic to the actual corruption level of corruption for both data corruption on an EMNIST linear
ρ and is privacy-preserving, while the other robust approaches model and omniscient corruption on an EMNIST ConvNet. For
are not. instance, in the former, one-step RFA gets 51.4% in accuracy,
Summary: robustness of RFA: Overall, we find that RFA is which is 10% better than FedAvg while being almost as good
no worse than FedAvg in the presence of corruption and is often as full RFA (52.8%) at ρ = 0.25. Moreover, for the latter, we
better, while being almost as good in the absence of corruption. find that one-step RFA (67.9%) actually achieves higher test
Furthermore, RFA degrades more gracefully as the corruption accuracy than full RFA (63.0%) at ρ = 0.25.
level increases. Personalization helps RFA offset effects of heterogeneity:
RFA requires only 3× the communication of FedAvg: Fig. 6 plots the effect of RFA with personalization. First, we
Next, we plot in Fig. 4 the performance versus the number observe that personalization leads to an improvement with no
of rounds of communication as measured by the number of corruption for both FedAvg and RFA. For the EMNIST linear
calls to the secure average oracle. We note that in the low model, we get 70.1% and 69.9% respectively from 64.3% and
corruption regime of ρ = 0 or ρ = 10−2 under data corruption, 62.9%. Second, we observe that RFA exhibits greater robust-
RFA requires 3× the number of calls to the secure average ness to corruption with personalization. At ρ = 1/4 with the
oracle to reach the same performance. However, it matches EMNIST linear model, RFA with personalization gives 66.4%
the performance of FedAvg when measured in terms of the (a reduction of 3.4%) while no personalization gives 52.8% (a
number of outer iterations, with the additional communication reduction of 10.1%). The results for Sent140 are similar, with the
cost coming from multiple Weiszfeld iterations for computation exception that FedAvg with personalization is nearly identical
of the average. to RFA with personalization.
Authorized licensed use limited to: Donghua University. Downloaded on June 10,2024 at 07:34:02 UTC from IEEE Xplore. Restrictions apply.
1152 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 70, 2022
Fig. 4. Comparison of methods plotted against number of calls to the secure average oracle for different corruption settings. For the case of omniscient corruption,
FedAvg and SGD are not shown in the plot if they diverge. The shaded area denotes the maximum and minimum over 5 random seeds.
REFERENCES
[1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y Arcas,
“Communication-efficient learning of deep networks from decentralized
data,” in Proc. Artif. Intell. Statist., 2017, pp. 1273–1282.
[2] P. Kairouz et al., “Advances and open problems in federated learning,”
Found.Trends Mach. Learn., vol. 14, no. 1-2, pp. 1–210, 2021.
Fig. 6. Effect of personalization on the robustness of RFA and FedAvg under [3] T. Yang et al., “Applied federated learning: Improving google keyboard
data corruption. query suggestions,” CoRR, vol. abs/1812.02903, 2018, [Online]. Avail-
able: http://arxiv.org/abs/1812.02903.
[4] M. Ammad-ud din et al., “Federated collaborative filtering for
privacy-preserving personalized recommendation system,” CoRR,
VI. CONCLUSION vol. abs/1901.09888, 2019, [Online]. Available: http://arxiv.org/abs/
1901.09888
We presented a robust aggregation approach, based on the [5] A. Pantelopoulos and N. G. Bourbakis, “A survey on wearable sensor-
geometric median and the smoothed Weiszfeld algorithm to based systems for health monitoring and prognosis,” IEEE Trans. Systems,
Man, Cybernet., Part C (Appl. Reviews), vol. 40, no. 1, pp. 1–12, Jan. 2009.
efficiently compute it, to make federated learning more robust to [6] L. Huang, A. L. Shea, H. Qian, A. Masurkar, H. Deng, and D. Liu,
settings where a fraction of the devices may be sending corrupted “Patient clustering improves efficiency of federated machine learning
updates to the orchestrating server. The robust aggregation or- to predict mortality and hospital stay time using distributed electronic
medical records,” J. Biomed. Informat., vol. 99, 2019, Art. no. 103291.
acle preserves the privacy of participating devices, operating [7] J. Ren, H. Wang, T. Hou, S. Zheng, and C. Tang, “Federated learning-
with calls to secure multi-party computation primitives enjoying based computation offloading optimization in edge computing-supported
privacy preservation theoretical guarantees. RFA is available Internet of Things,” IEEE Access, vol. 7, pp. 69194–69201, 2019.
Authorized licensed use limited to: Donghua University. Downloaded on June 10,2024 at 07:34:02 UTC from IEEE Xplore. Restrictions apply.
PILLUTLA et al.: ROBUST AGGREGATION FOR FEDERATED LEARNING 1153
[8] S. Lin, G. Yang, and J. Zhang, “A collaborative learning framework via [37] S. Minsker, “Geometric median and robust estimation in banach spaces,”
federated meta-learning,” in Proc. IEEE Int. Conf. Distrib. Comput. Syst., Bernoulli, vol. 21, no. 4, pp. 2308–2335, 2015.
2020, pp. 289–299. [38] D. J. Hsu and S. Sabato, “Loss minimization and parameter estima-
[9] W. Zhuang, X. Gan, Y. Wen, S. Zhang, and S. Yi, “Collaborative unsuper- tion with heavy tails,” J. Mach. Learn. Res., vol. 17, pp. 18:1–18:40,
vised visual representation learning from decentralized data,” in Proc. Int. 2016.
Conf. Comput. Vis., 2021, pp. 4912–4921. [39] G. Lugosi and S. Mendelson, “Risk minimization by median-of-
[10] R. A. Maronna, D. R. Martin, and V. J. Yohai, Robust Statistics: Theory means tournaments,” J. Eur. Math. Soc., vol. 22, no. 3, pp. 925–965,
and Methods. Hoboken, NJ, USA: Wiley, 2018. 2019.
[11] K. Bonawitz et al., “Practical secure aggregation for privacy-preserving [40] G. Lecué and M. Lerasle, “Robust machine learning by median-of-means:
machine learning,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., Theory and practice,” Ann. Statist., vol. 48, no. 2, pp. 906–931, 2020.
2017, pp. 1175–1191. [41] G. Lugosi and S. Mendelson, “Regularization, sparse recovery, and
[12] J. H. Bell, K. A. Bonawitz, A. Gascón, T. Lepoint, and M. Raykova, median-of-means tournaments,” Bernoulli, vol. 25, no. 3, pp. 2075–2106,
“Secure single-server aggregation with (Poly) Logarithmic overhead,” in 2019, doi: 10.3150/18-BEJ1046.
Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2020, pp. 1253–1269. [42] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart,
[13] K. A. Bonawitz et al., “Towards federated learning at scale: System “Robust estimators in high dimensions without the computational in-
design,” in Proc. Mach. Learn. Syst., 2019, pp. 374–388. tractability,” in Proc. Symp. Foundations Comput. Sci., 2016, pp. 655–664.
[14] E. Weiszfeld, “On the point for which the sum of the distances to n given [43] S. Minsker, “Uniform bounds for robust mean estimators,” 2019.
points is minimum,” (in French), Tohoku Math. J., First Ser., vol. 43, [44] Y. Cheng, I. Diakonikolas, and R. Ge, “High-dimensional robust mean
pp. 355–386, 1937. estimation in nearly-linear time,” in Proc. ACM-SIAM Symp. Discrete
[15] 2019. [Online]. Available: https://github.com/google-research/federated/ Algorithms, 2019, pp. 2755–2771.
tree/master/robust_aggregation [45] S. Al-Sayed, A. M. Zoubir, and A. H. Sayed, “Robust distributed estima-
[16] 2019. [Online]. Available: https://github.com/krishnap25/rfa tion by networked agents,” IEEE Trans. Signal Process., vol. 65, no. 15,
[17] B. Balle, G. Barthe, M. Gaboardi, J. Hsu, and T. Sato, “Hypothesis testing pp. 3909–3921, Aug. 2017.
interpretations and renyi differential privacy,” in Proc. 23rd Int. Conf. Artif. [46] Y. Yu, H. Zhao, R. C. de Lamare, Y. Zakharov, and L. Lu, “Robust
Intell. Statist., 2020, pp. 2496–2506. distributed diffusion recursive least squares algorithms with side informa-
[18] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi- tion for adaptive networks,” IEEE Trans. Signal Process., vol. 67, no. 6,
task learning,” in Proc. Adv. Neural Informat. Process. Syst., 2017, vol. 30, pp. 1566–1581, Mar. 2019.
pp. 4424–4434. [47] Y. Chen, S. Kar, and J. M. Moura, “Resilient distributed parameter esti-
[19] M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,” in mation with heterogeneous data,” IEEE Trans. Signal Process., vol. 67,
Proc. Int. Conf. Mach. Learn., 2019, vol. 97, pp. 4615–4625. no. 19, pp. 4918–4933, Oct. 2019.
[20] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, [48] L. Lamport, R. E. Shostak, and M. C. Pease, “The Byzantine generals
“Federated optimization in heterogeneous networks,” in Proc. Mach. problem,” ACM Trans. Program. Lang. Syst., vol. 4, no. 3, pp. 382–401,
Learn. Syst., 2020, pp. 429–450. 1982.
[21] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, [49] P. Blanchard, R. Guerraoui, E. M. El Mhamdi, and J. Stainer, “Machine
“Scaffold: Stochastic controlled averaging for federated learning,” in Proc. learning with adversaries: Byzantine tolerant gradient descent,” in Proc.
Int. Conf. Mach. Learn., 2020, pp. 5132–5143. Adv. Neural Inf. Process. Syst., 2017, pp. 119–129.
[22] C. T. Dinh, N. H. Tran, and T. D. Nguyen, “Personalized federated learning [50] Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learning in
with moreau envelopes,” in Proc. Adv. Neural Inf. Process. Syst., 2020, vol. adversarial settings: Byzantine gradient descent,” Proc. ACM Meas. Anal.
33, pp. 21394–21405. Comput. Syst., vol. 1, no. 2, 2017, pp. 44:1–44:25.
[23] A. Fallah, A. Mokhtari, and A. E. Ozdaglar, “Personalized federated [51] L. Chen, H. Wang, Z. Charles, and D. Papailiopoulos, “DRACO:
learning with theoretical guarantees: A model-agnostic meta-learning Byzantine-resilient distributed training via redundant gradients,” in Proc.
approach,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 3557–3568. Int. Conf. Mach. Learn., 2018, pp. 902–911.
[24] Y. Laguel, K. Pillutla, J. Malick, and Z. Harchaoui, “A superquantile [52] D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett, “Byzantine-robust
approach to federated learning with heterogeneous devices,” in Proc. IEEE distributed learning: Towards optimal statistical rates,” in Proc. Int. Conf.
Conf. Inf. Sci. Syst., 2021, pp. 1–6. Mach. Learn., 2018, pp. 5636–5645.
[25] D. Avdiukhin and S. P. Kasiviswanathan, “Federated learning under ar- [53] D. Alistarh, Z. Allen-Zhu, and J. Li, “Byzantine stochastic gradient de-
bitrary communication patterns,” in Proc. Int. Conf. Mach. Learn., 2021, scent,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 4618–4628.
vol. 139, pp. 425–435. [54] X. Cao and L. Lai, “Distributed gradient descent algorithm robust to an
[26] S. J. Reddi et al., “Adaptive federated optimization,” in Proc. Int. Conf. arbitrary number of Byzantine attackers,” IEEE Trans. Signal Process.,
Learn. Representations, 2021, [Online]. Available: https://openreview.net/ vol. 67, no. 22, pp. 5850–5864, Nov. 2019.
forum?id=LkFG3lB13U5. [55] P. Subramanyan, R. Sinha, I. Lebedev, S. Devadas, and S. A. Seshia, “A
[27] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: formal foundation for secure remote execution of enclaves,” in Proc. ACM
Challenges, methods, and future directions,” IEEE Signal Process. Mag., SIGSAC Conf. Comput. Commun. Secur., 2017, pp. 2435–2450.
vol. 37, no. 3, pp. 50–60, May 2020. [56] L. Li, W. Xu, T. Chen, G. B. Giannakis, and Q. Ling, “RSA: Byzantine-
[28] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: robust stochastic aggregation methods for distributed learning from hetero-
Numerical Methods, Belmont, MA, USA: Athena Scientific, 1997. geneous datasets,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 1544–1551.
[29] V. Smith, S. Forte, M. Chenxin, M. Takáč, M. I. Jordan, and M. Jaggi, [57] H. W. Kuhn, “A note on Fermat’s problem,” Math. Program., vol. 4, no. 1,
“COCOA: A general framework for communication-efficient distributed pp. 98–107, Dec. 1973.
optimization,” J. Mach. Learn. Res., vol. 18, p. 230:1–230:49, 2018. [58] Y. Vardi and C.-H. Zhang, “A modified Weiszfeld algorithm for the Fermat-
[30] C. Ma et al., “Distributed optimization with arbitrary local solvers,” Optim. Weber location problem,” Math. Program., vol. 90, no. 3, pp. 559–566,
Methods Softw., vol. 32, no. 4, pp. 813–848, 2017. 2001.
[31] L. He, A. Bian, and M. Jaggi, “COLA: Decentralized linear learning,” in [59] A. Beck and S. Sabach, “Weiszfeld’s method: Old and new results,” J.
Proc. Adv. Neural Inf. Process. Syst., 2018, vol. 31, pp. 4541–4551. Optim. Theory Appl., vol. 164, no. 1, pp. 1–40, 2015.
[32] R. Leblond, F. Pedregosa, and S. Lacoste-Julien, “Improved asynchronous [60] I. N. Katz, “Local convergence in fermat’s problem,” Math. Program.,
parallel optimization analysis for stochastic incremental methods,” J. vol. 6, no. 1, pp. 89–104, 1974.
Mach. Learn. Res., vol. 19, pp. 81:1–81:68, 2018. [61] M. B. Cohen, Y. T. Lee, G. L. Miller, J. Pachocki, and A. Sidford,
[33] A. H. Sayed, “Adaptation, learning, and optimization over networks,” “Geometric median in nearly linear time,” in Proc. Symp. Theory Comput.,
Found. Trends Mach. Learn., vol. 7, no. 4-5, pp. 311–801, 2014. 2016, pp. 9–21.
[34] P. J. Huber, “Robust estimation of a location parameter,” Ann. Math. [62] C. Dwork, F. McSherry, K. Nissim, and A. D. Smith, “Calibrating noise to
Statist., vol. 35, no. 1, pp. 73–101, Mar. 1964. sensitivity in private data analysis,” in Proceeding Theory Cryptography
[35] P. J. Huber, Robust Statistics. in International Encyclopedia of Sta- Conference, Berlin, Heidelberg, Germany: Springer, 2006, vol. 3876,
tistical Science, Berlin, Heidelberg: Springer, 2011, pp. 1248–1251. pp. 265–284.
doi: 10.1007/978-3-642-04898-2_594. [63] P. Kairouz, Z. Liu, and T. Steinke, “The distributed discrete Gaussian
[36] A. S. Nemirovski and D. B. Yudin, Problem Complexity and Method mechanism for federated learning with secure aggregation,” in Proc. ICML,
Efficiency in Optimization. Hoboken, NJ, USA: Wiley, 1983. 2021, vol. 139, pp. 5201–5212.
Authorized licensed use limited to: Donghua University. Downloaded on June 10,2024 at 07:34:02 UTC from IEEE Xplore. Restrictions apply.
1154 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 70, 2022
[64] C. Gentry, “Computing arbitrary functions of encrypted data,” Commun. [77] S. Caldas et al., “LEAF: A benchmark for federated settings,” CoRR,
ACM, vol. 53, no. 3, pp. 97–105, 2010. vol. abs/1812.01097, http://arxiv.org/abs/1812.01097.
[65] T. Gafni, N. Shlezinger, K. Cohen, Y. C. Eldar, and H. V. Poor, “Federated [78] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik, “EMNIST: An
learning: A signal processing perspective,” CoRR, vol. abs/2103.17150, extension of MNIST to handwritten letters,” CoRR, vol. abs/1702.05373,
2021, [Online]. Available: https://arxiv.org/abs/2103.17150. 2017, [Online]. Available: http://arxiv.org/abs/1702.05373.
[66] J. Wang et al., “A field guide to federated optimization,” CoRR, [79] W. Shakespeare, “The complete works of William Shakespeare.” [Online].
vol. abs/2107.06917, 2021, [Online]. Available: https://arxiv.org/abs/ Available: https://www.gutenberg.org/ebooks/100
2107.06917. [80] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
[67] D. L. Donoho and P. J. Huber, “The notion of breakdown point,” A Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
festschrift Erich L. Lehmann, vol. 157184, pp. 57–184, 1983. [81] A. Go, R. Bhayani, and L. Huang, “Twitter sentiment classification using
[68] D. Evans et al., “A pragmatic introduction to secure multi-party compu- distant supervision,” CS224N Project Report, pp. 1–12, 2009.
tation,” Found. Trends Privacy Secur., vol. 2, no. 2-3, pp. 70–246, 2018. [82] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for
[69] M. Chen, C. Gao, and Z. Ren, “Robust covariance and scatter matrix word representation,” in Proc. Empirical Methods Natural Lang. Process.,
estimation under Huber’s contamination model,” Ann. Statist., vol. 46, 2014, pp. 1532–1543.
no. 5, pp. 1932–1960, 2018. [83] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding
[70] H. P. Lopuhaa and P. J. Rousseeuw, “Breakdown points of affine equiv- deep learning requires rethinking generalization,” in Proc. Int. Conf. Learn.
ariant estimators of multivariate location and covariance matrices,” Ann. Representations, 2017. [Online]. Available: https://dblp.org/rec/conf/iclr/
Statist., vol. 19, no. 1, pp. 229–248, Mar. 1991. ZhangBHRV17.html?view=bibtex.
[71] F. Bach and E. Moulines, “Non-strongly-convex smooth stochastic ap- [84] Z. Sun, P. Kairouz, A. T. Suresh, and H. B. McMahan, “Can you really
proximation with convergence rate O(1/n),” in Proc. Adv. Neural Inf. backdoor federated learning?,” CoRR, vol. abs/1911.07963, 2019, [On-
Process. Syst., 2013, pp. 773–781. line]. Available: http://arxiv.org/abs/1911.07963.
[72] P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, V. K. Pillutla, and A. [85] A. Beck and M. Teboulle, “Smoothing and first order methods: A unified
Sidford, “A Markov chain theory approach to characterizing the minimax framework,” SIAM J. Optim., vol. 22, no. 2, pp. 557–580, 2012.
optimality of stochastic gradient descent (for least squares),” in Proc. Conf. [86] D. P. Bertsekas, Nonlinear Programming. Belmont, MA, USA: Athena
Found. Softw. Technol. Theor. Comput. Sci., 2017, vol. 2, pp 2:1–2:10. Scientific, 2016.
[73] P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford, “Par- [87] J. Mairal, “Optimization with first-order surrogate functions,” in Proc. Int.
allelizing stochastic gradient descent for least squares regression: Mini- Conf. Mach. Learn., 2013, pp. 783–791.
batching, averaging, and model misspecification,” J. Mach. Learn. Res., [88] J. Mairal, “Incremental majorization-minimization optimization with ap-
vol. 18, pp. 223:1–223: 42, 2017. plication to large-scale machine learning,” SIAM J. Optim., vol. 25, no. 2,
[74] V. M. Panaretos and Y. Zemel, An Invitation to Statistics in Wasserstein pp. 829–855, 2015.
Space. Berlin, Germany: Springer Nature, 2020. [89] Y. Nesterov, Introductory Lectures on Convex Optimization, Berlin, Ger-
[75] Z. Wu, Q. Ling, T. Chen, and G. B. Giannakis, “Federated variance-reduced many: Springer, 2018.
stochastic gradient descent with robustness to Byzantine attacks,” IEEE [90] A. Beck, “On the convergence of alternating minimization for convex
Trans. Signal Process., vol. 68, pp. 4583–4596, 2020. programming with applications to iteratively reweighted least squares and
[76] A. Agarwal, J. Langford, and C.-Y. Wei, “Federated residual learning,” decomposition schemes,” SIAM J. Optim., vol. 25, no. 1, pp. 185–209,
CoRR, vol. abs/2003.12880, 2020, [Online]. Available: https://arxiv.org/ 2015.
abs/2003.12880. [91] Y. LeCun et al., “Gradient-based learning applied to document recogni-
tion,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.
Authorized licensed use limited to: Donghua University. Downloaded on June 10,2024 at 07:34:02 UTC from IEEE Xplore. Restrictions apply.