Federated Learning With Differential Privacy Algorithms and Performance Analysis
Federated Learning With Differential Privacy Algorithms and Performance Analysis
15, 2020
Abstract— Federated learning (FL), as a type of distributed can improve the convergence performance; and 3) there is an
machine learning, is capable of significantly preserving clients’ optimal number aggregation times (communication rounds) in
private data from being exposed to adversaries. Nevertheless, terms of convergence performance for a given protection level.
private information can still be divulged by analyzing uploaded Furthermore, we propose a K -client random scheduling strategy,
parameters from clients, e.g., weights trained in deep neural where K (1 ≤ K < N) clients are randomly selected from the N
networks. In this paper, to effectively prevent information leak- overall clients to participate in each aggregation. We also develop
age, we propose a novel framework based on the concept of a corresponding convergence bound for the loss function in this
differential privacy (DP), in which artificial noise is added case and the K -client random scheduling strategy also retains
to parameters at the clients’ side before aggregating, namely, the above three properties. Moreover, we find that there is an
noising before model aggregation FL (NbAFL). First, we prove optimal K that achieves the best convergence performance at a
that the NbAFL can satisfy DP under distinct protection levels fixed privacy level. Evaluations demonstrate that our theoretical
by properly adapting different variances of artificial noise. results are consistent with simulations, thereby facilitating the
Then we develop a theoretical convergence bound on the loss design of various privacy-preserving FL algorithms with different
function of the trained FL model in the NbAFL. Specifically, the tradeoff requirements on convergence performance and privacy
theoretical bound reveals the following three key properties: 1) levels.
there is a tradeoff between convergence performance and privacy
protection levels, i.e., better convergence performance leads to a Index Terms— Federated learning, differential privacy, conver-
lower protection level; 2) given a fixed privacy protection level, gence performance, information leakage, client selection.
increasing the number N of overall clients participating in FL
I. I NTRODUCTION
Manuscript received December 6, 2019; revised March 20, 2020; accepted
April 11, 2020. Date of publication April 17, 2020; date of current version
June 16, 2020. This work was supported in part by the National Key Research
and Development Program under Grant 2018YFB1004800, in part by the
I T IS anticipated that big data-driven artificial intelligence
(AI) will soon be applied in many aspects of our daily life,
including medical care, agriculture, transportation systems,
National Natural Science Foundation of China under Grant 61872184 and etc. At the same time, the rapid growth of Internet-of-Things
Grant 61727802, in part by the SUTD Growth Plan Grant for AI, in part by
the U.S. National Science Foundation under Grant CCF-1908308, and in part (IoT) applications calls for data mining and learning securely
by the Princeton Center for Statistics and Machine Learning under a Data X and reliably in distributed systems [1]–[3]. When integrating
Grant. The associate editor coordinating the review of this manuscript and AI into a variety of IoT applications, distributed machine
approving it for publication was Dr. Aris Gkoulalas Divanis. (Corresponding
authors: Jun Li; Chuan Ma.) learning (ML) is preferred for many data processing tasks
Kang Wei and Chuan Ma are with the School of Electrical and Optical by defining parametrized functions from inputs to outputs
Engineering, Nanjing University of Science and Technology, Nanjing 210094, as compositions of basic building blocks [4], [5]. Federated
China (e-mail: [email protected]; [email protected]).
Jun Li is with the School of Electronic and Optical Engineering, Nanjing learning (FL) is a recent advance in distributed ML in which
University of Science and Technology, Nanjing 210094, China, and also with data are acquired and processed locally at the client side, and
the School of Computer Science and Robotics, National Research Tomsk then the updated ML parameters are transmitted to a central
Polytechnic University, 634050 Tomsk, Russia (e-mail: [email protected]).
Ming Ding is with CSIRO Data61, Sydney, NSW 2015, Australia server for aggregation [6]–[8]. The goal of FL is to fit a model
(e-mail: [email protected]). generated by an empirical risk minimization (ERM) objective.
Howard H. Yang and Tony Q. S. Quek are with the Singapore However, FL also poses several key challenges, such as private
University of Technology and Design, Singapore 487372 (e-mail:
[email protected]; [email protected]). information leakage, expensive communication costs between
Farhad Farokhi was with the CSIRO’s Data61, Melbourne, VIC 3008, servers and clients, and device variability [9]–[15].
Australia. He is now with the Department of Electrical and Electronic Generally, distributed stochastic gradient descent (SGD) is
Engineering, The University of Melbourne, Melbourne, VIC 3010, Australia
(e-mail: [email protected]). adopted in FL for training ML models. In [16], [17], bounds
Shi Jin is with the National Mobile Communications Research Laboratory, for FL convergence performance were developed based on
Southeast University, Nanjing 210096, China (e-mail: [email protected]). distributed SGD, with a one-step local update before global
H. Vincent Poor is with the Department of Electrical Engineering, Princeton
University, Princeton, NJ 08544 USA (e-mail: [email protected]). aggregation. The work in [18] considered partially global
Digital Object Identifier 10.1109/TIFS.2020.2988575 aggregation, where after each local update step, parameter
1556-6013 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3455
aggregation is performed over a non-empty subset of the differential attacks. However, the above two works on DP-
set of clients. In order to analyze the convergence, the fed- based FL design have not taken into account privacy protec-
erated proximal algorithm (FedProx) was proposed [19] by tion during the parameter uploading stage, i.e., the clients’
adding regularization on each local loss function. The work private information can be potentially intercepted by hid-
in [20] obtained a convergence bound for SGD based FL den adversaries when uploading the training results to the
that incorporates non-independent-and-identically-distributed server. Moreover, these two works only showed empirical
(non-i.i.d.) data distributions among clients. results using simulations, but lacked theoretical analysis on
At the same time, with the ever increasing awareness of the FL system, such as the tradeoffs between privacy, con-
data security of personal information, privacy preservation has vergence performance, and convergence rate. To the authors’
become significant issue, especially for big data applications knowledge, a theoretical analysis on convergence behavior of
and distributed learning systems. One prominent advantage FL with privacy-preserving noise perturbations has not yet
of FL is that it enables local training without personal data been considered in existing studies, which will be the major
exchange between the server and clients, thereby protecting focus of this work. Compared with conventional works, such
clients’ data from being eavesdropped upon by hidden adver- as [34], [35], which focus mainly on simulation results, our
saries. Nevertheless, private information can still be divulged theoretical performance analysis is more efficient for finding
to some extent by analyzing the differences of parameters the optimal parameters, e.g., the number of chosen clients K
trained and uploaded by the clients, e.g., weights trained in and the number of maximum aggregation times T , to achieve
neural networks [21]–[23]. the minimum loss function.
A natural approach to preventing information leakage is to In this paper, to effectively prevent information leakage,
add artificial noise, one prominent example of which is differ- we propose a novel framework based on the concept of DP,
ential privacy (DP) [24], [25]. Existing works on DP based in which each client perturbs its trained parameters locally by
learning algorithms include local DP (LDP) [26]–[28], DP purposely adding noise before uploading them to the server
based distributed SGD [29], [30] and DP meta learning [31]. for aggregation, namely, noising before model aggregation
In LDP, each client perturbs its information locally and only FL (NbAFL). To the best of the authors’ knowledge, this is
sends a randomized version to a server, thereby protecting the first piece of work of its kind that provides a theoretical
both the clients and server against private information leak- analysis of the convergence properties of differentially private
age. The work in [27] proposed solutions to building up an FL algorithms.
LDP-compliant SGD, which powers a variety of important ML The main contributions of this paper are summarized as
tasks. The work in [28] considered distributed estimation at the follows:
server over uploaded data from clients while providing pro- • We prove that the proposed NbAFL scheme satisfies the
tections on these data with LDP. The work in [32] introduced requirement of DP in terms of global data under a certain
an algorithm for user-level differentially private training of noise perturbation level with Gaussian noise by properly
large neural networks, in particular a complex sequence model adapting their variances.
for next-word prediction. The work in [33] developed a chain • We develop a convergence bound on the loss function
abstraction model on tensors to efficiently override operations of the trained FL model in the NbAFL with artificial
(or encode new ones) such as sending/sharing a tensor between Gaussian noise. Our developed bound reveals the follow-
workers, and then provided the elements to implement recently ing three key properties: 1) there is a tradeoff between the
proposed DP and multiparty computation protocols using this convergence performance and privacy protection levels,
framework. The work in [29] improved the computational i.e., better convergence performance leads to a lower
efficiency of DP based SGD by tracking detailed information protection level; 2) increasing the number N of overall
about the privacy loss, and obtained accurate estimates of the clients participating in FL can improve the convergence
overall privacy loss. The work in [30] proposed novel DP performance, given a fixed privacy protection level; and
based SGD algorithms and analyzed their performance bounds 3) there is an optimal number of maximum aggregation
which were shown to be related to privacy levels and the times in terms of convergence performance for a given
sizes of datasets. Also, the work in [31] focused on the class protection level.
of gradient-based parameter-transfer methods and developed • We propose a K -client random scheduling strategy, where
a DP based meta learning algorithm that not only satisfies K (1 ≤ K < N) clients are randomly selected from the N
the privacy requirement but also retains provable learning overall clients to participate in each aggregation. We also
performance in convex settings. develop a corresponding convergence bound on the loss
More specifically, DP based FL approaches are usually function in this case. From our analysis, the K -client
devoted to capturing the tradeoff between privacy and conver- random scheduling strategy retains the above three prop-
gence performance in the training process. The work in [34] erties. Also, we find that there exists an optimal value of
proposed an FL algorithm with the consideration on preserving K that achieves the best convergence performance at a
clients’ privacy. This algorithm can achieve good training fixed privacy level.
performance at a given privacy level, especially when there • We conduct extensive simulations based on real-world
is a sufficiently large number of participating clients. The datasets to validate the properties of our theoretical bound
work in [35] presented an alternative approach that utilizes in NbAFL. Evaluations demonstrate that our theoreti-
both DP and secure multiparty computation (SMC) to prevent cal results are consistent with simulations. Therefore,
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
3456 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020
TABLE I
S UMMARY OF M AIN N OTATION
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3457
happen in the broadcasting (through downlink channels) phase A. Global Differential Privacy
by analyzing the global parameter w. Here, we define a global (, δ)-DP requirement for both
We also assume that uplink channels are more secure uplink and downlink channels. From the uplink perspective,
than downlink broadcasting channels, since clients can be using a clipping technique, we can ensure that wi ≤ C,
assigned to different channels (e.g., time slots, frequency where wi denotes training parameters from the i -th client with-
bands) dynamically in each uploading time, while downlink out perturbation and C is a clipping threshold for bounding wi .
channels are broadcasting. Hence, we assume that there are at We assume that the batch size in the local training is equal to
most L (L ≤ T ) exposures of uploaded parameters from each the number of training samples and then define local training
client in the uplink1 and T exposures of aggregated parameters process in the i -th client by
in the downlink, where T is the number of aggregation times.
D
sU i wi = arg min Fi (w, Di )
w
C. Differential Privacy |Di |
1
= arg min Fi (w, Di, j ), (4)
(, δ)-DP provides a strong criterion for privacy preservation |Di | w
j =1
of distributed data processing systems. Here, > 0 is the
distinguishable bound of all outputs on neighboring datasets where Di is the i -th client’s database and Di, j is the j -th
Di , Di in a database, and δ represents the event that the ratio D
sample in Di . Thus, the sensitivity of sU i can be expressed as
of the probabilities for two adjacent datasets Di , Di cannot be
D
bounded by e after adding a privacy preserving mechanism. sUDi = max sUDi − sU i
With an arbitrarily given δ, a privacy preserving mechanism Di ,Di
with a larger gives a clearer distinguishability of neighboring |Di |
1
datasets and hence a higher risk of privacy violation. Now, = max arg min Fi (w, Di, j )
we will formally define DP as follows. Di ,Di |Di | w
j =1
Definition 1 ((, δ)-DP [24]): A randomized mechanism
|Di |
M : X → R with domain X and range R satisfies (, δ)-DP, 1
2C
− arg min Fi (w, Di, j ) = , (5)
if for all measurable sets S ⊆ R and for any two adjacent |Di | w |Di |
j =1
databases Di , Di ∈ X ,
where Di is an adjacent dataset to Di which has the same size
Pr[M(Di ) ∈ S] ≤ e Pr[M(Di ) ∈ S] + δ. (3) but only differ by one sample, and Di, j is the j -th sample in
Di . From the above result, a global sensitivity in the uplink
For numerical data, a Gaussian mechanism defined in [24] channel can be defined by
can be used to guarantee (, δ)-DP. According to [24],
we present the following DP mechanism by adding artificial D
sU max sU i , ∀i. (6)
Gaussian noise.
In order to ensure that the given noise distribution To achieve a small global sensitivity, the ideal condition is
n ∼ N (0, σ 2 ) preserves (, δ)-DP, where N represents the that all the clients use sufficient local datasets for training.
Gaussian distribution, √we choose noise scale σ ≥ cs/ Hence, we define the minimum size of the local datasets by
and the constant c ≥ 2 ln(1.25/δ) for ∈ (0, 1). In this m and then obtain sU = 2C m . To ensure (, δ)-DP for each
result, n is the value of an additive noise sample for a data client in the uplink in one exposure, we set the noise scale,
in the dateset, s is the sensitivity of the function s given represented by the standard deviation of the additive Gaussian
by s = maxDi ,Di s(Di ) − s(Di ) , and s is a real-valued noise, as σU = csU /. Considering L exposures of local
function. parameters, we need to set σU = cLsU / due to the linear
Considering the above DP mechanism, choosing an appro- relation between and σU in the Gaussian mechanism.
priate level of noise remains a significant research problem, From the downlink perspective, the aggregation operation
which will affect the privacy guarantee of clients and the for Di can be expressed as
convergence rate of the FL process.
sDDi w = p1 w1 + . . . + pi wi + . . . + p N w N , (7)
III. F EDERATED L EARNING W ITH D IFFERENTIAL where 1 ≤ i ≤ N and w is the aggregated parameters at the
P RIVACY server to be broadcast to the clients. Regarding the sensitivity
of sDDi , i.e., sDDi , we have the following lemma.
In this section, we first introduce the concept of global
Lemma 1 (Sensitivity for the Aggregation Operation): In FL
DP and analyze the DP performance in the context of FL.
training process, the sensitivity for Di after the aggregation
Then we propose the NbAFL scheme that can satisfy the DP
operation sDDi is given by
requirement by adding proper noisy perturbations at both the
clients and the server. D 2C pi
sD i = . (8)
1 Here we assume that the adversary cannot know where the parameters
m
come from. Proof: See Appendix A.
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
3458 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020
Algorithm 1 Noising Before Aggregation FL helpful for hiding their private information. This theorem also
Data: T , w(0) , μ, and δ provides the variance value of the noise terms that should
(0) be added to the aggregated parameters. Based on the above
1 Initialization: t = 1 and wi = w(0) , ∀i
2 while t ≤ T do results, we propose the following NbAFL algorithm.
3 Local training process:
4 while Ci ∈ {C1 , C2 , . . . , C N } do B. Proposed NbAFL
5 Update the local parameters wi(t ) as
Algorithm 1 outlines our NbAFL for training an effective
6 wi(t ) = arg min Fi (wi ) + μ2 wi − w(t −1) 2 model with a global (, δ)-DP requirement. We denote by μ
wi
7 Clip the local parameters the presetting constant of the proximal term and by w(0) the
(t) initiate global parameter. At the beginning of this algorithm,
wi(t ) = wi(t ) / max 1,
wi
C the server broadcasts the required privacy level parameters
8 Add noise and upload parameters wi = wi + ni
(t ) (t ) (t ) (, δ) are set and the initiate global parameter w(0) are sent to
clients. In the t-th aggregation, N active clients respectively
9 Model aggregating process: train the parameters by using local databases with preset
10 Update the global parameters w(t ) as termination conditions. After completing the local training,
N
the i -th client, ∀i , will add noise to the trained parameters
11 w(t ) = pi wi(t ) (t ) (t )
i=1 wi , and upload the noised parameters wi to the server for
12 The server broadcasts global noised parameters aggregation.
(t )
13 w(t ) = w(t ) + nD Then the server update the global parameters w(t ) by aggre-
14 Local testing process: gating the local parameters integrated with different weights.
(t )
15 while Ci ∈ {C1 , C2 , . . . , C N } do The additive noise terms nD are added to this w(t ) according
16 Test the aggregating parameters w(t ) using local to Theorem 1 before being broadcast to the clients. Based on
dataset the received global parameters w(t ) , each client will estimate
17 t ←t +1 the accuracy by using local testing databases and start the next
round of training process based on these received parameters.
Result: w(T )
The FL process completes after the aggregation time reaches
a preset number T and the algorithm returns w(T ) .
Now, let us focus on the privacy preservation performance
Remark 1: From the above lemma, to achieve a small of the NbAFL. First, the set of all local parameters are
global sensitivity in the downlink channel which is defined by received by the server. Owing to the local perturbations in the
2C pi NbAFL, it will be difficult for malicious adversaries to infer
sD max sDDi = max , ∀i, (9) the information at the i -client from its uploaded parameters
m
wi . After the model aggregation, the aggregated parameters w
the ideal condition is that all the clients should use the same will be sent back to clients via broadcast channels. This poses
size of local datasets for training, i.e., pi = 1/N. threats on clients’s privacy as potential adversaries may reveal
From the above remark, when setting pi = 1/N, ∀i , we can sensitive information about individual clients from w. In this
obtain the optimal value of the sensitivity sD . So here we case, additive noise may be posed to w based on Theorem 1.
should add noise at the client side first and then decide whether
or not to add noise at the server to satisfy the (, δ)-DP
criterion in the downlink channel. IV. C ONVERGENCE A NALYSIS ON N BAFL
Theorem 1 (DP Guarantee for Downlink Channels): To In this section, we are ready to analyze the convergence
ensure (, δ)-DP in the downlink channels with T aggrega- performance of the proposed NbAFL. First, we analyze the
tions, the standard deviation of Gaussian noise terms nD that expected increment of adjacent aggregations in the loss func-
are added to the aggregated parameter w by the server can tion with Gaussian noise. Then, we focus on deriving the
be given as convergence property under the global (, δ)-DP requirement.
⎧ √ For the convenience of the analysis, we make the following
⎨ 2cC T 2 − L 2 N √
T > L N, assumptions on the loss function and network parameters.
σD = m N √ (10)
⎩ Assumption 1: We make assumptionsN on the global loss
0 T ≤ L N. function F(·) defined by F(·) i=1 pi Fi (·), and the i -th
Proof: See Appendix B. local loss function Fi (·) as follows:
Theorem 1 shows that to satisfy a (, δ)-DP requirement for 1) Fi (w) is convex;
the downlink channels, additional noise terms nD need to be 2) Fi (w) satisfies the Polyak-Lojasiewicz condition with
added by the server. With a certain L, the standard deviation the positive parameter l, which implies that F(w) −
of additional noise depends on the relationship between the F(w∗ ) ≤ 2l1 ∇ F(w) 2 , where w∗ is the optimal result;
number of aggregation times T and the number of clients N. 3) F(w(0) ) − F(w∗ ) = ;
The intuition is that a larger T can lead to a higher chance 4) Fi (w) is β-Lipschitz, i.e., Fi (w) − Fi (w ) ≤ β w −
of information leakage, while a larger number of clients is w , for any w, w ;
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3459
5) Fi (w) is ρ-Lipschitz smooth, i.e., ∇ Fi (w) − Next, we will analyze the convergence property of NbAFL
∇ Fi (w ) ≤ ρ w − w , for any w, w , where ρ with the (, δ)-DP requirement.
is a constant determined by the practical loss function; Theorem 2 (Convergence Upper Bound of the NbAFL):
6) For any i and w, ∇ Fi (w) − ∇ F(w) ≤ εi , where εi is With required protection level , the convergence upper bound
the divergence metric. of Algorithm 1 after T aggregations is given by
Similar to the gradient divergence, the divergence metric εi κ1 T κ0 T 2
is the metric to capture the divergence between the gradients E{F(w(T ) )− F(w∗ )} ≤ P T + + 2 1− P T ,
of the local loss functions and that of the aggregated loss (14)
function, which is essential for analyzing SGD. The divergence
λ1 βcC λ0 c C 2
2
is related to how the data is distributed at different nodes. where P = 1 + 2lλ2 , κ1 = m(1−P) Nπ and κ0 = m 2 (1−P)N .
2
Using Assumption 1 and assume ∇ F(w) to be uniformly Proof: See Appendix D.
away from zero, we then have the following lemma. Theorem 2 reveals an important relationship between pri-
Lemma 2 (B-Dissimilarity of Various Clients): For a given vacy and utility by taking into account the protection level
ML parameter w, there exists B satisfying and the number of aggregation times T . As the number
of aggregation times T increases, the first term of the upper
E ∇ Fi (w) 2 ≤ ∇ F(w) 2 B 2 , ∀i . (11) bound decreases but the second term increases. Furthermore,
By viewing T as a continuous variable and by writing the
Proof: See Appendix C. RHS of (14) as h(T ), we have
Lemma 2 comes from the assumption of the divergence
metric and demonstrates the statistical heterogeneity of all d 2 h(T ) κ1 T κ0 T 2
= − − 2 P T ln2 P
clients. As mentioned earlier, the values of ρ and B are deter- 2
d T
mined by the specific global loss function F(w) in practice and κ1 2κ0 T 2κ0
the training parameters w. With the above preparation, we are −2 + 2 P T ln P + 2 1− P T . (15)
now ready to analyze the convergence property of NbAFL.
It can be seen that the second term and third term of on the
First, we present the following lemma to derive an expected
RHS of (15) are always positive. When N and are set to
increment bound on the loss function during each iteration of
be large enough, we can see that κ1 and κ0 are small, and
parameters with artificial noise.
thus the first term can also be positive. In this case, we have
Lemma 3 (Expected Increment in the Loss Function): After
d 2 h(T )/d 2 T > 0 and the upper bound is convex for T .
receiving updates, from the t-th to the (t + 1)-th aggregation,
Remark 2: As can be seen from this theorem, the expected
the expected difference in the loss function can be upper-
gap between the achieved loss function F(w(T ) ) and the mini-
bounded by
mum one F(w∗ ) is a decreasing function of . By increasing ,
E{F(w(t +1) ) − F(w(t ) )} i.e., relaxing the privacy protection level, the performance of
NbAFL algorithm will improve. This is reasonable because the
≤ λ2 E{ ∇ F(w(t ) ) 2 }
variance of artificial noise terms decreases, thereby improving
+ λ1 E{ n(t +1) ∇ F(w(t ) ) } + λ0 E{ n(t +1) 2 }, (12) the convergence performance.
Remark 3: The number of clients N will also affect its
where λ0 = ρ2 , λ1 = μ1 + ρμB , λ2 = − μ1 + ρμB2 + 2μ2 and ρ B2
iterative convergence performance, i.e., a larger N would
n(t ) are the equivalent noise terms imposedon the parameters achieve a better convergence performance. This is because
(t ) (t )
after the t-th aggregation, given by n(t ) = i=1
N
pi ni + nD . a lager N leads to a lower variance of the artificial noise
Proof: See Appendix D. terms.
In this lemma, the value of an additive noise sample n Remark 4: There is an optimal number of maximum aggre-
in vector n(t ) satisfies the following Gaussian
distribution gation times T in terms of convergence performance for given
n ∼ N (0, σA ). Also, we can obtain σA = σD + σU2 /N from
2 2 and N. In more detail, a larger T may lead to a higher
Section III. From the right hand side (RHS) of the above variance of artificial noise, and thus may pose a negative
inequality, we can see that it is crucial to select a proper impact on convergence performance. On the other hand, more
proximal term μ to achieve a low upper-bound. It is clear iterations can generally boost the convergence performance
that artificial noise with a large σA may improve the DP if noise levels are not large enough. In this sense, there is a
performance in terms privacy protection. However, from the tradeoff on choosing a proper T .
RHS of (12), a large σA may enlarge the expected difference
of the loss function between two consecutive aggregations, V. K -C LIENT R ANDOM S CHEDULING P OLICY
leading to a deterioration of convergence performance. In this section, we consider the case where only K (K < N)
Furthermore, to satisfy the global (, δ)-DP, by using clients are selected to participate in the aggregation process,
Theorem 1, we have namely K -client random scheduling.
⎧ √ We now discuss how to add artificial noise in the K -
⎪ cT sD
⎨ T > L N, client random scheduling to satisfy a global (, δ)-DP. It is
σA = cLs
⎪ U √ (13) obvious that in the uplink channels, each of the K scheduled
⎩ √ T ≤ L N.
N clients should add noise terms with scale σU = cLsU / for
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
3460 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3461
Fig. 3. The comparison of training loss with various protection level for 50 Fig. 5. The comparison of training loss with various clipping thresholds for
clients using = 6, = 8 and = 10, respectively. 50 clients using = 60.
Fig. 4. The comparison of training loss with various privacy levels for Fig. 6. The value of the loss function with various numbers of clients under
50 clients using = 50, = 60 and = 100, respectively. = 60 under NbAFL Algorithm with 50 clients.
Such observation results are in line with Remark 2. We also if the clipping threshold C is too small, clipping destroys the
choose high protection levels = 6, = 8 and = 10 for intended gradient direction of parameters. On the other hand,
this experiment, where each client has 512 training samples increasing the norm bound C forces us to add more noise to
locally. We set N = 50, T = 25 and δ = 0.01. From Fig. 3, the parameters because of its effect on the sensitivity.
we can draw a similar conclusion as in Remark 2 that values
of the loss function in NbAFL are decreasing as we relax the C. Impact of the Number of Clients N
privacy guarantees. Figs. 6 compares the convergence performance of NbAFL
Considering the K -client random scheduling, in Fig. 4, under required protection level = 60 and δ = 10−2 as a
we investigate the performances with various protection levels function of clients’ number, N. In this experiment, we set
= 50, = 60 and = 100. For simulation parameters, N = 50, N = 60, N = 80 and N = 100. We notice
we set N = 50, K = 20, T = 25, and δ = 0.01. As shown that the performance among different numbers of clients is
in Figs. 4, the convergence performance under the K -client governed by Remark 3. This is because more clients not
random scheduling is improved with an increasing . only provide larger global datasets for training, but also bring
down the standard deviation of additive noise due to the
B. Impact of the Clipping Threshold C aggregation.
In Fig. 5, we choose various clipping thresholds C =
10, 15, 20 and 25 to show the results of the loss function for D. Impact of the Number of Maximum Aggregation Times T
50 clients using = 60 in NbAFL. As shown in Fig. 5, In Fig. 7, we show the experimental results of training loss
when C = 20, the convergence performance of NbAFL as a function of maximum aggregation times with various
can obtain the best value. We can note that limiting the privacy levels = 50, 60, 80 and 100 under NbAFL algorithm.
parameter norm has two opposing effects. On the one hand, This observation is in line with Remark 4, and the reason
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
3462 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020
Fig. 7. The convergence upper bounds with various privacy levels = 50, Fig. 9. The value of the loss function with various privacy levels = 60
60 and 100 under 50-clients’ NbAFL algorithm. and = 100 under NbAFL Algorithm with 50 clients.
Fig. 8. The comparison of the loss function between experimental and Fig. 10. The value of the loss function with various numbers of chosen clients
theoretical results with the various aggregation times under NbAFL Algorithm under = 50, 60, 100 under NbAFL Algorithm and non-private approach with
with 50 clients. 50 clients.
comes from the fact that a lower privacy level decreases the policy in NbAFL. The number of clients is N = 50, and
standard deviation of additive noise terms and the server can K clients are randomly chosen to participate in training
obtain better quality ML model parameters from the clients. and aggregation in each iteration. In this experiment, we set
Fig. 7 also implies that an optimal number of maximum aggre- = 50, = 60, = 100 and δ = 0.01. Meanwhile, we also
gation times increases almost with respect to the increasing . exhibit the performance of the non-private approach with
In Fig. 9, we plot the values of the loss function in various numbers of chosen clients K . Note that an optimal K
the normalized NbAFL using solid lines and the K -random which further improves the convergence performance exists for
scheduling based NbAFL using dotted lines with various num- various protection levels, due to a trade-off between enhance
bers of maximum aggregation times. This figure shows that privacy protection and involving larger global training datasets
the value of loss function is a convex function of maximum in each model updating round. This observation is in line
aggregation times for a given protection level under NbAFL with Remark 5. The figure shows that in NbAFL, for a given
algorithm, which validates Remark 4. From Fig. 9, we can protection level , the K -client random scheduling can obtain
also see that for a given , K -client random scheduling based a better tradeoff than the normal selection policy.
NbAFL algorithm has a better convergence performance than
the normalized NbAFL algorithm for a larger T . This is VII. C ONCLUSIONS
because that K -client random scheduling can bring down the
In this paper, we have focused on information leakage
variance of artificial noise with little performance loss.
in SGD based FL. We have first defined a global (, δ)-
DP requirement for both uplink and downlink channels, and
E. Impact of the Number of Chosen Clients K developed variances of artificial noise terms at clients and
In Fig. 10, we plot values of the loss function with various server sides. Then, we have proposed a novel framework
numbers of chosen clients K under the random scheduling based on the concept of global (, δ)-DP, named NbAFL.
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3463
We have developed theoretically a convergence bound on the aggregation process with artificial noise added by clients can
loss function of the trained FL model in the NbAFL. Using this be expressed as
convergence bound, we have obtained the following results:
1) there is a tradeoff between the convergence performance
N
N
N
w= pi (wi + ni ) = p i wi + pi ni . (22)
and privacy protection levels, i.e., better convergence perfor-
i=1 i=1 i=1
mance leads to a lower protection level; 2) increasing the N
number N of overall clients participating in FL can improve The distribution φ N (n) of i=1 pi n i can be expressed as
the convergence performance, given a fixed privacy protec-
tion level; and 3) there is an optimal number of maximum
N
φ N (n) = ϕi (n), (23)
aggregation times in terms of convergence performance for
i=1
a given protection level. Furthermore, we have proposed a
K -client random scheduling strategy and also developed a where pi n i ∼ ϕi (n), and is convolutional operation.
corresponding convergence bound on the loss function in this When we use Gaussian mechanism for n i with noise scale
case. In addition to the above three properties. we find that σU , the distribution of pi n i is also Gaussian distribution.
there exists an optimal value of K that achieves the best To obtain a small sensitivity√sD , we set pi = 1/N. Fur-
convergence performance at a fixed privacy level. Extensive thermore, the noise scale σU / N of the Gaussian distribution
simulation results confirm the correctness of our analysis. φ N (n) can be calculated. To ensure a global (, δ)-DP in
Therefore, our analytical results are helpful for the design downlink channels, we know the standard deviation of additive
on privacy-preserving FL architectures with different tradeoff noise terms can be set to σA = cT sD /, where sD =
requirements on convergence performance and privacy levels. 2C/m N. Hence, we can obtain the standard deviation of
We can note that the size and the distribution of data both additive noise at the server as
⎧ √
greatly affect the quality of the FL training. As a future work, ⎨ 2cC T 2 − L 2 N √
it is of great interest to analytically evaluate the convergence σ 2
T > L N,
σD = σA − 2 U
= m N √ (24)
performance of NbAFL with varying size and distribution of N ⎩
0 T ≤ L N.
data at client sides.
Hence, Theorem 1 has been proved.
A PPENDIX A
P ROOF OF L EMMA 1 A PPENDIX C
From the downlink perspective, for all Di and which Di P ROOF OF L EMMA 2
differ in a signal entry, the sensitivity can be expressed as Due to Assumption 1, we have
D D D
sD i = max sD i − sD i . (18) E ∇ Fi (w) − ∇ F(w) 2 ≤ E{εi2 } (25)
D i ,D i
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
3464 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020
ρB
Now, let us bound wi
(t +1)
− w(t ) . We know 1
μ + μ and λ0 = ρ2 . This completes the proof.
(t +1) (t +1) (t +1) (t +1)
wi − w(t ) ≤ wi −
wi +
wi − w(t ) , A PPENDIX E
(38) P ROOF OF T HEOREM 2
(t +1) We assume that F satisfies the Polyak-Lojasiewicz inequal-
where wi = arg minw Ji (w; w(t )). Let us define μ = μ +
ity [38] with positive parameter l, which implies that
l > 0, then we know Ji (w; w(t )) is μ-convexity. Based on
1
this, we can obtain E{F(w(t ) ) − F(w∗ )} ≤
∇ F(w(t ) ) 2 . (48)
θ 2l
(t +1) (t +1)
wi − wi ≤ ∇ Fi (w(t ) ) (39) Moreover, subtract E{F(w∗ )} in both sides of (45), we know
μ
and E{F(w(t +1)) − F(w∗ )}
1 ≤ E{F(w(t ) ) − F(w∗ )} + λ2 ∇ F(w(t ) ) 2
i(t +1) − w(t ) ≤
w ∇ Fi (w(t ) ) , (40) (t +1)
μ + λ1 E{ n } ∇ F(w ) + λ0 E{ n(t +1) 2 }.
(t )
(49)
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3465
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
3466 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020
Hence, if Gaussian noise terms are added at the client sides, and
we can obtain the additive noise scale in the server as ⎧ 2 ⎫
⎨ K ⎬
(t +1)
csD T 2 c2 L 2 sU2 E{ v(t +1) 2
}=E p i wi
σD = − ⎩ ⎭
b K 2 ⎧
i=1
⎧ 2 ⎫
⎪ 2 ⎨ K ⎬
⎨ 2cC Tb2 − L 2 K √ (t +1)
T > bL K , +E pi ni
= (66) ⎩ ⎭
⎪
⎩ mK √ i=1
0 T ≤ bL K . ⎧) * ⎫
⎨ K ⎬
Furthermore, considering (60), we can obtain + 2E pi wi(t +1) n(t +1) . (77)
⎧ ⎩ ⎭
i=1
⎪
⎪ 2cC T2
− L2 K
⎨ b2 K
T > , Note that we set pi = Di / i=1 Di = 1/K in K -client
σD = mK γ (67)
⎪
⎪ random scheduling in order to a small sensitivity sD . We
⎩0 T ≤ ,
γ have
where ⎧ 2⎫
⎨ K ⎬ 1 K
(t +1) (t +1) 2 K − 1
γ = − ln 1 − q + qe L
−
√
. E p i wi ≤ 2 wi + w(t +1) 2
K (68) ⎩ ⎭ K K
i=1 i=1
This completes the proof. (78)
A PPENDIX G and
P ROOF OF T HEOREM 3 1 (t +1)
K
K − 1 (t +1)
Here we define E{ v(t +1) 2 } ≤ wi 2
+ w 2
K2 K
i=1
K
v (t )
= pi wi(t ) , (69) + n(t +1) 2
+ 2[w(t +1)] n(t +1) . (79)
i=1
Combining (75) and (79), we can obtain
K
v (t )
= pi wi(t ) + ni(t ) + nD
(t )
(70)
1 (t +1) (t )
K
i=1 E{ w(t +1) − v(t +1) 2 } ≤ wi −v 2
+ n(t +1) 2 .
K2
and i=1
K (80)
(t +1) (t +1) (t )
n = pi ni + nD . (71)
Using (41), we know
i=1
which considers the aggregated parameters under K -random B 2 (1+θ )2
E{ w(t +1) − v(t +1) 2 } ≤ n(t +1) 2
+ ∇ F(v(t ) ) 2 .
scheduling. Because Fi (·) and F(·) are β-Lipschitz, we obtain K μ2
that (81)
E{F(v(t +1))} − F(w(t +1) ) ≤ β v(t +1) − w(t +1) . (72)
Moreover,
Because β is the Lipchitz continuity constant of function F, B(1+θ )
we have E{ w(t +1) − v(t +1) } ≤ n(t +1) + √ ∇ F(vt ) . (82)
μ K
β ≤ ∇ F(v(t ) ) +ρ w(t +1) − v(t ) + v(t +1)−!v(t ) . (73)
Substituting (45), (73) and (82) into (72), setting θ = 0 and
From (42), we know μ = μ, we can obtain
B(1 + θ )
w(t +1) − v(t ) ≤ ∇ F(v(t )) . (74) E{F(v(t +1))} − F(v(t ))
μ
≤ F(w(t +1)) − F(v(t ))
Then, we have
× ∇ F(v(t ) ) + 2ρ w(t +1) − v(t ) E w(t +1) − v(t +1)
E{ w(t +1) − v(t +1) 2 }
= w(t +1) 2
− 2[w(t +1)] E{v(t +1)}+E{ v(t +1) 2 }. (75) + ρE{ w(t +1) − v(t +1) 2 } = α2 ∇ F(v(t )) 2
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3467
In this case, we take expectation E{F(v(t +1)) − F(v(t ))} as [6] H. Brendan McMahan, E. Moore, D. Ramage, S. Hampson, and
follows, B. Agüera y Arcas, “Communication-efficient learning of deep networks
from decentralized data,” 2016, arXiv:1602.05629. [Online]. Available:
E{F(v(t +1)) − F(v(t ))} ≤ α2 ∇ F(v(t ) ) 2 http://arxiv.org/abs/1602.05629
[7] J. KonečnÝ, H. Brendan McMahan, F. X. Yu, P. Richtárik,
+ α1 E{ n(t +1) } ∇ F(v(t ) ) A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improv-
+ α0 E{ n(t +1) 2 }.
ing communication efficiency,” 2016, arXiv:1610.05492. [Online].
(86) Available: http://arxiv.org/abs/1610.05492
For > 0 and f (v(0) ) − f (w∗ ) = , we can obtain
[8] U. Mohammad and S. Sorour, “Adaptive task allocation for asyn-
chronous federated mobile edge learning,” 2019, arXiv:1905.01656.
E{F(v(t +1)) − F(w∗ )} ≤ E{F(v(t ) ) − F(w∗ )}
[Online]. Available: http://arxiv.org/abs/1905.01656
[9] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen, “In-edge
+ α2 ∇ F(v(t ) ) 2 +α1 βE{ n(t +1) } AI: Intelligentizing mobile edge computing, caching and communica-
tion by federated learning,” IEEE Netw., vol. 33, no. 5, pp. 156–165,
+ α0 E{ n(t +1) 2 }. (87) Sep. 2019.
[10] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning:
If we select the penalty parameter μ to make α2 < 0 and Concept and applications,” ACM Trans. Intell. Syst. Technol., vol. 10,
using (48), we know no. 2, p. 12, 2019.
[11] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning:
E{F(v(t +1))− F(w∗ )} ≤ (1+2lα2 )E{F(v(t ))− F(w∗ )} Challenges, methods, and future directions,” 2019, arXiv:1908.07873.
[Online]. Available: http://arxiv.org/abs/1908.07873
+ α1 βE{ n(t +1) }+α0 E{ n(t +1) 2 }. [12] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, and C. S. Hong,
(88) “Federated learning over wireless networks: Optimization model design
and analysis,” in Proc. IEEE Conf. Comput. Commun. (INFOCOM),
Considering independence of additive noise terms and apply- Apr. 2019, pp. 1387–1395.
[13] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling policies
ing (88) recursively, we have for federated learning in wireless networks,” IEEE Trans. Commun.,
vol. 68, no. 1, pp. 317–333, Jan. 2020.
E{F(v(T ) ) − F(w∗ )} [14] M. Hao, H. Li, G. Xu, S. Liu, and H. Yang, “Towards efficient and
≤ (1 + 2lα2 )T E{F(v(0) ) − F(w∗ )} privacy-preserving federated deep learning,” in Proc. IEEE Int. Conf.
1 − (1 + 2lα2 )T Commun. (ICC), May 2019, pp. 1–6.
+ α1 βE{ n } + α0 E{ n 2 } [15] H. H. Yang, A. Arafa, T. Q. S. Quek, and H. V. Poor, “Age-based
2lα2 scheduling policy for federated learning in mobile edge networks,”
1 − QT in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP),
Barcelona, Spain, May 2020, pp. 8743–8747.
= QT + α1 βE{ n } + α0 E{ n 2 } , (89)
1− Q [16] A. Agarwal and J. C. Duchi, “Distributed delayed stochastic optimiza-
tion,” in Proc. IEEE 51st IEEE Conf. Decis. Control (CDC), Maui, HI,
where Q = 1+2lα2 . Substituting (65) into (89), we can obtain USA, Dec. 2012, pp. 5451–5452.
[17] L. Xiangru, H. Yijun, L. Yuncheng, and L. Ji, “Asynchronous parallel
sD T c 2N sD2 T 2 c2 N stochastic gradient for nonconvex optimization,” in Proc. ACM NIPS,
E{ n } = , E{ n 2 } = (90)
b π b2 2 Montreal, QC, Canada, Dec. 2015, pp. 2737–2745.
[18] X. Lian et al., “Can decentralized algorithms outperform centralized
and algorithms? A case study for decentralized parallel stochastic gradient
descent,” in Proc. ACM NIPS, Long Beach, CA, USA, Dec. 2017,
E{F(vT ) − F(w∗ )}
⎛
pp. 5336–5346.
[19] T. Li, A. Kumar Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and
1 − Q T
⎝ cCα1 β 2 V. Smith, “Federated optimization in heterogeneous networks,” 2018,
≤ QT +
1− Q −m K ln 1 − KN + N − T π arXiv:1812.06127. [Online]. Available: http://arxiv.org/abs/1812.06127
Ke [20] S. Wang et al., “Adaptive federated learning in resource constrained
⎞ edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6,
c2 C 2 α0 pp. 1205–1221, Jun. 2019.
+ ⎠. (91) [21] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in Proc.
− T
m 2 K 2 ln2 1 − KN + N K e 22nd ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), Denver, CO,
USA, 2015, pp. 1310–1321.
This completes the proof. [22] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, and H. Qi, “Beyond
inferring class representatives: User-level privacy leakage from federated
learning,” in Proc. IEEE Conf. Comput. Commun. (INFOCOM), Paris,
R EFERENCES France, Apr. 2019, pp. 2512–2520.
[1] J. Li, S. Chu, F. Shu, J. Wu, and D. N. K. Jayakody, “Contract- [23] C. Ma et al., “On safeguarding privacy and security in the framework
based small-cell caching for data disseminations in ultra-dense cellular of federated learning,” 2019, arXiv:1909.06512. [Online]. Available:
networks,” IEEE Trans. Mobile Comput., vol. 18, no. 5, pp. 1042–1053, http://arxiv.org/abs/1909.06512
May 2019. [24] C. Dwork and A. Roth, “The algorithmic foundations of differential pri-
[2] Z. Ma, M. Xiao, Y. Xiao, Z. Pang, H. V. Poor, and B. Vucetic, vacy,” Found. Trends Theor. Comput. Sci., vol. 9, nos. 3–4, pp. 211–407,
“High-reliability and low-latency wireless communication for Internet 2013.
of Things: Challenges, fundamentals, and enabling technologies,” IEEE [25] A. Blum, C. Dwork, F. McSherry, and K. Nissim, “Practical pri-
Internet Things J., vol. 6, no. 5, pp. 7946–7970, Oct. 2019. vacy: The SuLQ framework,” in Proc. 24th ACM SIGMOD-SIGACT-
[3] H. Lee, S. H. Lee, and T. Q. S. Quek, “Deep learning for distributed SIGART Symp. Princ. Database Syst., Baltimore, MD, USA, Jun. 2005,
optimization: Applications to wireless resource management,” IEEE pp. 128–138.
J. Sel. Areas Commun., vol. 37, no. 10, pp. 2251–2266, Oct. 2019. [26] Ú. Erlingsson, V. Pihur, and A. Korolova, “Rappor: Randomized aggre-
[4] W. Sun, J. Liu, and Y. Yue, “AI-enhanced offloading in edge computing: gatable privacy-preserving ordinal response,” in Proc. ACM SIGSAC
When machine learning meets industrial IoT,” IEEE Netw., vol. 33, no. 5, Conf. Comput. Commun. Secur., Scottsdale, AZ, USA, Nov. 2014,
pp. 68–74, Sep. 2019. pp. 1054–1067.
[5] M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep [27] N. Wang et al., “Collecting and analyzing multidimensional data with
learning for IoT big data and streaming analytics: A survey,” IEEE local differential privacy,” in Proc. IEEE 35th Int. Conf. Data Eng.
Commun. Surveys Tuts., vol. 20, no. 4, pp. 2923–2960, Jun. 2018. (ICDE), Macao, China, Apr. 2019, pp. 638–649.
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
3468 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020
[28] S. Wang et al., “Local differential private data aggregation for discrete Ming Ding (Senior Member, IEEE) received the
distribution estimation,” IEEE Trans. Parallel Distrib. Syst., vol. 30, B.S. and M.S. degrees (Hons.) in electronics engi-
no. 9, pp. 2046–2059, Sep. 2019. neering and the Ph.D. degree in signal and infor-
[29] M. Abadi et al., “Deep learning with differential privacy,” in Proc. ACM mation processing from Shanghai Jiao Tong Uni-
SIGSAC Conf. Comput. Commun. Secur. (CCS), Vienna, Austria, 2016, versity (SJTU), Shanghai, China, in 2004, 2007,
pp. 308–318. and 2011, respectively. From April 2007 to Sep-
[30] N. Wu, F. Farokhi, D. Smith, and M. Ali Kaafar, “The value of col- tember 2014, he worked as a Researcher/Senior
laboration in convex machine learning with differential privacy,” 2019, Researcher/Principal Researcher at the Sharp Lab-
arXiv:1906.09679. [Online]. Available: http://arxiv.org/abs/1906.09679 oratories of China, Shanghai. He also served as
[31] J. Li, M. Khodak, S. Caldas, and A. Talwalkar, “Differentially the Algorithm Design Director and the Program-
private meta-learning,” 2019, arXiv:1909.05830. [Online]. Available: ming Director for a system-level simulator of future
http://arxiv.org/abs/1909.05830 telecommunication networks in Sharp Laboratories of China for more than
[32] H. Brendan McMahan, D. Ramage, K. Talwar, and L. Zhang, seven years. He is currently a Senior Research Scientist with the CSIRO
“Learning differentially private recurrent language models,” 2017, Data61, Sydney, NSW, Australia. His research interests include information
arXiv:1710.06963. [Online]. Available: http://arxiv.org/abs/1710.06963 technology, data privacy and security, machine Learning and AI. He has
[33] T. Ryffel et al., “A generic framework for privacy preserv- authored over 100 articles in IEEE journals and conferences, all in recognized
ing deep learning,” 2018, arXiv:1811.04017. [Online]. Available: venues, and around 20 3GPP standardization contributions, and a Springer
http://arxiv.org/abs/1811.04017 book Multi-Point Cooperative Communication Systems: Theory and Applica-
[34] R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federated tions. He holds 21 U.S. patents and co-invented another more than 100 patents
learning: A client level perspective,” 2017, arXiv:1712.07557. [Online]. on 4G/5G technologies in CN, JP, KR, EU. He is an Editor of the IEEE
Available: http://arxiv.org/abs/1712.07557 T RANSACTIONS ON W IRELESS C OMMUNICATIONS and the IEEE Wireless
[35] S. Truex et al., “A hybrid approach to privacy-preserving Communications Letters. Besides, he is or has been a Guest Editor/Co-
federated learning,” 2018, arXiv:1812.03224. [Online]. Available: Chair/Co-Tutor/TPC Member of several IEEE top-tier journals/conferences,
http://arxiv.org/abs/1812.03224 such as the IEEE J OURNAL ON S ELECTED A REAS IN C OMMUNICATIONS,
[36] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that IEEE Communications Magazine, and the IEEE GLOBECOM Workshops.
exploit confidence information and basic countermeasures,” in Proc. He was the Lead Speaker of the industrial presentation on unmanned aerial
22nd ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), New York, vehicles in IEEE GLOBECOM 2017, which was awarded as the Most
NY, USA, 2015, pp. 1322–1333. Attended Industry Program in the conference. He was awarded as the
[37] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learn- Exemplary Reviewer of the IEEE T RANSACTIONS ON W IRELESS C OMMU -
ing applied to document recognition,” Proc. IEEE, vol. 86, no. 11, NICATIONS in 2017.
pp. 2278–2324, Nov. 1998.
[38] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic
Course, 1st ed. Boston, MA, USA: Springer, 2014.
Kang Wei (Graduate Student Member, IEEE) Chuan Ma received the B.S. degree from the Bei-
received the B.Sc. degree in information engineering jing University of Posts and Telecommunications,
from Xidian University, Xi’an, China, in 2014, and Beijing, China, in 2013, and the Ph.D. degree from
the M.Sc. degree from the School of Electronic and The University of Sydney, Australia, in 2018. He is
Optical Engineering, Nanjing University of Science currently working as a Lecturer at the School of
and Technology, Nanjing, China, in 2018, where he Electronic and Optical Engineering, Nanjing Uni-
is currently pursuing the Ph.D. degree. His current versity of Science and Technology, Nanjing, China.
research interests include data privacy and security, He has published more than ten journal articles
differential privacy, AI and machine learning, infor- and conference papers, including the Best Paper in
mation theory, and channel coding theory in NAND WCNC 2018. His research interests include sto-
flash memory. chastic geometry, wireless caching networks and
machine learning, and now focuses on the big data analysis and privacy
preservation.
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3469
Farhad Farokhi (Senior Member, IEEE) received Tony Q. S. Quek (Fellow, IEEE) received the
the Ph.D. degree from the KTH Royal Institute B.E. and M.E. degrees in electrical and electronics
of Technology in 2014. He is currently a Lecturer engineering from the Tokyo Institute of Technology,
(Assistant Professor) with the Department of Elec- Tokyo, Japan, in 1998 and 2000, respectively, and
trical and Electronic Engineering, The University of the Ph.D. degree in electrical engineering and com-
Melbourne. Prior to that, he was a Research Scientist puter science from the Massachusetts Institute of
with the Information Security and Privacy Group, Technology, Cambridge, MA, USA, in 2008.
CSIRO’s Data61, a Research Fellow at The Univer- He is currently the Cheng Tsang Man Chair Pro-
sity of Melbourne, and a Post-Doctoral Fellow with fessor with the Singapore University of Technology
the KTH Royal Institute of Technology. During his and Design (SUTD). He also serves as the Head
Ph.D. studies, he was a Visiting Researcher with the of ISTD Pillar, the Sector Lead of the SUTD AI
University of California at Berkeley and the University of Illinois at Urbana– Program, and the Deputy Director of SUTD-ZJU IDEA. His current research
Champaign. He was a recipient of the VESKI Victoria Fellowship from topics include wireless communications and networking, network intelligence,
the Victorian State Government and the McKenzie Fellowship and the 2015 the Internet of Things, URLLC, and big data processing.
Early Career Researcher Award from The University of Melbourne. He was Dr. Quek has been actively involved in organizing and chairing sessions
a Finalist in the 2014 European Embedded Control Institute (EECI) Ph.D. and has served as a member of the Technical Program Committee and
Award. He has been part of numerous projects on data privacy and cyber- symposium chairs in a number of international conferences. He was an
security funded by the Defence Science and Technology Group (DSTG), the Executive Editorial Committee Member of the IEEE T RANSACTIONS ON
Department of the Prime Minister and Cabinet (PMC), the Department of W IRELESS C OMMUNICATIONS. He was honored with the 2008 Philip Yeo
Environment and Energy (DEE), and CSIRO, Australia. Prize for Outstanding Achievement in Research, the 2012 IEEE William R.
Bennett Prize, the 2015 SUTD Outstanding Education Awards–Excellence in
Research, the 2016 IEEE Signal Processing Society Young Author Best Paper
Award, the 2017 CTTC Early Achievement Award, the 2017 IEEE ComSoc
AP Outstanding Paper Award, and the 2016–2019 Clarivate Analytics Highly
Cited Researcher. He is a Distinguished Lecturer of the IEEE Communications
Society. He is serving as an Editor for the IEEE T RANSACTIONS ON
W IRELESS C OMMUNICATIONS , the Chair of IEEE VTS Technical Committee
on Deep Learning for Wireless Communications, and an Elected Member of
the IEEE Signal Processing Society SPCOM Technical Committee. He was
an Editor of the IEEE T RANSACTIONS ON C OMMUNICATIONS and IEEE
W IRELESS C OMMUNICATIONS L ETTERS .
Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.