0% found this document useful (0 votes)
39 views

Federated Learning With Differential Privacy Algorithms and Performance Analysis

1) The document proposes a framework called noising before model aggregation federated learning (NbAFL) that adds artificial noise to client parameters before aggregation to prevent privacy leaks while preserving convergence performance. 2) It proves that NbAFL can satisfy differential privacy and develops a theoretical convergence bound showing a tradeoff between privacy and convergence. 3) The analysis reveals that increasing client participation or aggregation rounds can improve convergence at a fixed privacy level, and there is an optimal number of clients for best performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Federated Learning With Differential Privacy Algorithms and Performance Analysis

1) The document proposes a framework called noising before model aggregation federated learning (NbAFL) that adds artificial noise to client parameters before aggregation to prevent privacy leaks while preserving convergence performance. 2) It proves that NbAFL can satisfy differential privacy and develops a theoretical convergence bound showing a tradeoff between privacy and convergence. 3) The analysis reveals that increasing client participation or aggregation rounds can improve convergence at a fixed privacy level, and there is an optimal number of clients for best performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

3454 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL.

15, 2020

Federated Learning With Differential Privacy:


Algorithms and Performance Analysis
Kang Wei, Graduate Student Member, IEEE, Jun Li , Senior Member, IEEE,
Ming Ding , Senior Member, IEEE, Chuan Ma , Howard H. Yang , Member, IEEE,
Farhad Farokhi , Senior Member, IEEE, Shi Jin , Senior Member, IEEE,
Tony Q. S. Quek , Fellow, IEEE, and H. Vincent Poor , Life Fellow, IEEE

Abstract— Federated learning (FL), as a type of distributed can improve the convergence performance; and 3) there is an
machine learning, is capable of significantly preserving clients’ optimal number aggregation times (communication rounds) in
private data from being exposed to adversaries. Nevertheless, terms of convergence performance for a given protection level.
private information can still be divulged by analyzing uploaded Furthermore, we propose a K -client random scheduling strategy,
parameters from clients, e.g., weights trained in deep neural where K (1 ≤ K < N) clients are randomly selected from the N
networks. In this paper, to effectively prevent information leak- overall clients to participate in each aggregation. We also develop
age, we propose a novel framework based on the concept of a corresponding convergence bound for the loss function in this
differential privacy (DP), in which artificial noise is added case and the K -client random scheduling strategy also retains
to parameters at the clients’ side before aggregating, namely, the above three properties. Moreover, we find that there is an
noising before model aggregation FL (NbAFL). First, we prove optimal K that achieves the best convergence performance at a
that the NbAFL can satisfy DP under distinct protection levels fixed privacy level. Evaluations demonstrate that our theoretical
by properly adapting different variances of artificial noise. results are consistent with simulations, thereby facilitating the
Then we develop a theoretical convergence bound on the loss design of various privacy-preserving FL algorithms with different
function of the trained FL model in the NbAFL. Specifically, the tradeoff requirements on convergence performance and privacy
theoretical bound reveals the following three key properties: 1) levels.
there is a tradeoff between convergence performance and privacy
protection levels, i.e., better convergence performance leads to a Index Terms— Federated learning, differential privacy, conver-
lower protection level; 2) given a fixed privacy protection level, gence performance, information leakage, client selection.
increasing the number N of overall clients participating in FL
I. I NTRODUCTION
Manuscript received December 6, 2019; revised March 20, 2020; accepted
April 11, 2020. Date of publication April 17, 2020; date of current version
June 16, 2020. This work was supported in part by the National Key Research
and Development Program under Grant 2018YFB1004800, in part by the
I T IS anticipated that big data-driven artificial intelligence
(AI) will soon be applied in many aspects of our daily life,
including medical care, agriculture, transportation systems,
National Natural Science Foundation of China under Grant 61872184 and etc. At the same time, the rapid growth of Internet-of-Things
Grant 61727802, in part by the SUTD Growth Plan Grant for AI, in part by
the U.S. National Science Foundation under Grant CCF-1908308, and in part (IoT) applications calls for data mining and learning securely
by the Princeton Center for Statistics and Machine Learning under a Data X and reliably in distributed systems [1]–[3]. When integrating
Grant. The associate editor coordinating the review of this manuscript and AI into a variety of IoT applications, distributed machine
approving it for publication was Dr. Aris Gkoulalas Divanis. (Corresponding
authors: Jun Li; Chuan Ma.) learning (ML) is preferred for many data processing tasks
Kang Wei and Chuan Ma are with the School of Electrical and Optical by defining parametrized functions from inputs to outputs
Engineering, Nanjing University of Science and Technology, Nanjing 210094, as compositions of basic building blocks [4], [5]. Federated
China (e-mail: [email protected]; [email protected]).
Jun Li is with the School of Electronic and Optical Engineering, Nanjing learning (FL) is a recent advance in distributed ML in which
University of Science and Technology, Nanjing 210094, China, and also with data are acquired and processed locally at the client side, and
the School of Computer Science and Robotics, National Research Tomsk then the updated ML parameters are transmitted to a central
Polytechnic University, 634050 Tomsk, Russia (e-mail: [email protected]).
Ming Ding is with CSIRO Data61, Sydney, NSW 2015, Australia server for aggregation [6]–[8]. The goal of FL is to fit a model
(e-mail: [email protected]). generated by an empirical risk minimization (ERM) objective.
Howard H. Yang and Tony Q. S. Quek are with the Singapore However, FL also poses several key challenges, such as private
University of Technology and Design, Singapore 487372 (e-mail:
[email protected]; [email protected]). information leakage, expensive communication costs between
Farhad Farokhi was with the CSIRO’s Data61, Melbourne, VIC 3008, servers and clients, and device variability [9]–[15].
Australia. He is now with the Department of Electrical and Electronic Generally, distributed stochastic gradient descent (SGD) is
Engineering, The University of Melbourne, Melbourne, VIC 3010, Australia
(e-mail: [email protected]). adopted in FL for training ML models. In [16], [17], bounds
Shi Jin is with the National Mobile Communications Research Laboratory, for FL convergence performance were developed based on
Southeast University, Nanjing 210096, China (e-mail: [email protected]). distributed SGD, with a one-step local update before global
H. Vincent Poor is with the Department of Electrical Engineering, Princeton
University, Princeton, NJ 08544 USA (e-mail: [email protected]). aggregation. The work in [18] considered partially global
Digital Object Identifier 10.1109/TIFS.2020.2988575 aggregation, where after each local update step, parameter

1556-6013 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3455

aggregation is performed over a non-empty subset of the differential attacks. However, the above two works on DP-
set of clients. In order to analyze the convergence, the fed- based FL design have not taken into account privacy protec-
erated proximal algorithm (FedProx) was proposed [19] by tion during the parameter uploading stage, i.e., the clients’
adding regularization on each local loss function. The work private information can be potentially intercepted by hid-
in [20] obtained a convergence bound for SGD based FL den adversaries when uploading the training results to the
that incorporates non-independent-and-identically-distributed server. Moreover, these two works only showed empirical
(non-i.i.d.) data distributions among clients. results using simulations, but lacked theoretical analysis on
At the same time, with the ever increasing awareness of the FL system, such as the tradeoffs between privacy, con-
data security of personal information, privacy preservation has vergence performance, and convergence rate. To the authors’
become significant issue, especially for big data applications knowledge, a theoretical analysis on convergence behavior of
and distributed learning systems. One prominent advantage FL with privacy-preserving noise perturbations has not yet
of FL is that it enables local training without personal data been considered in existing studies, which will be the major
exchange between the server and clients, thereby protecting focus of this work. Compared with conventional works, such
clients’ data from being eavesdropped upon by hidden adver- as [34], [35], which focus mainly on simulation results, our
saries. Nevertheless, private information can still be divulged theoretical performance analysis is more efficient for finding
to some extent by analyzing the differences of parameters the optimal parameters, e.g., the number of chosen clients K
trained and uploaded by the clients, e.g., weights trained in and the number of maximum aggregation times T , to achieve
neural networks [21]–[23]. the minimum loss function.
A natural approach to preventing information leakage is to In this paper, to effectively prevent information leakage,
add artificial noise, one prominent example of which is differ- we propose a novel framework based on the concept of DP,
ential privacy (DP) [24], [25]. Existing works on DP based in which each client perturbs its trained parameters locally by
learning algorithms include local DP (LDP) [26]–[28], DP purposely adding noise before uploading them to the server
based distributed SGD [29], [30] and DP meta learning [31]. for aggregation, namely, noising before model aggregation
In LDP, each client perturbs its information locally and only FL (NbAFL). To the best of the authors’ knowledge, this is
sends a randomized version to a server, thereby protecting the first piece of work of its kind that provides a theoretical
both the clients and server against private information leak- analysis of the convergence properties of differentially private
age. The work in [27] proposed solutions to building up an FL algorithms.
LDP-compliant SGD, which powers a variety of important ML The main contributions of this paper are summarized as
tasks. The work in [28] considered distributed estimation at the follows:
server over uploaded data from clients while providing pro- • We prove that the proposed NbAFL scheme satisfies the
tections on these data with LDP. The work in [32] introduced requirement of DP in terms of global data under a certain
an algorithm for user-level differentially private training of noise perturbation level with Gaussian noise by properly
large neural networks, in particular a complex sequence model adapting their variances.
for next-word prediction. The work in [33] developed a chain • We develop a convergence bound on the loss function
abstraction model on tensors to efficiently override operations of the trained FL model in the NbAFL with artificial
(or encode new ones) such as sending/sharing a tensor between Gaussian noise. Our developed bound reveals the follow-
workers, and then provided the elements to implement recently ing three key properties: 1) there is a tradeoff between the
proposed DP and multiparty computation protocols using this convergence performance and privacy protection levels,
framework. The work in [29] improved the computational i.e., better convergence performance leads to a lower
efficiency of DP based SGD by tracking detailed information protection level; 2) increasing the number N of overall
about the privacy loss, and obtained accurate estimates of the clients participating in FL can improve the convergence
overall privacy loss. The work in [30] proposed novel DP performance, given a fixed privacy protection level; and
based SGD algorithms and analyzed their performance bounds 3) there is an optimal number of maximum aggregation
which were shown to be related to privacy levels and the times in terms of convergence performance for a given
sizes of datasets. Also, the work in [31] focused on the class protection level.
of gradient-based parameter-transfer methods and developed • We propose a K -client random scheduling strategy, where
a DP based meta learning algorithm that not only satisfies K (1 ≤ K < N) clients are randomly selected from the N
the privacy requirement but also retains provable learning overall clients to participate in each aggregation. We also
performance in convex settings. develop a corresponding convergence bound on the loss
More specifically, DP based FL approaches are usually function in this case. From our analysis, the K -client
devoted to capturing the tradeoff between privacy and conver- random scheduling strategy retains the above three prop-
gence performance in the training process. The work in [34] erties. Also, we find that there exists an optimal value of
proposed an FL algorithm with the consideration on preserving K that achieves the best convergence performance at a
clients’ privacy. This algorithm can achieve good training fixed privacy level.
performance at a given privacy level, especially when there • We conduct extensive simulations based on real-world
is a sufficiently large number of participating clients. The datasets to validate the properties of our theoretical bound
work in [35] presented an alternative approach that utilizes in NbAFL. Evaluations demonstrate that our theoreti-
both DP and secure multiparty computation (SMC) to prevent cal results are consistent with simulations. Therefore,

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
3456 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020

TABLE I
S UMMARY OF M AIN N OTATION

Fig. 1. A FL training model with hidden adversaries who can eavesdrop


trained parameters from both the clients and the server.

where wi is the parameter vector trained at the i -th client, w


is the parameter vector after aggregating atthe server, N is
the number of clients, pi = ||D i|
D | ≥ 0 with
N
i=1 pi = 1, and
N
|D| = i=1 |Di | is the total size of all data samples. Such an
optimization problem can be formulated as

our analytical results are helpful for the design on 


N

privacy-preserving FL architectures with different trade- w∗ = arg min pi Fi (w, Di ), (2)


w
off requirements on convergence performance and privacy i=1
levels. where Fi (·) is the local loss function of the i -th client.
The remainder of this paper is organized as follows. Generally, the local loss function Fi (·) is given by local
In Section II, we introduce background on FL, DP and a empirical risks. The training process of such a FL system
conventional DP-based FL algorithm. In Section III, we detail usually contains the following four steps:
the proposed NbAFL and analyze the privacy performance • Step 1: Local training: All active clients locally compute
based on DP. In Section IV, we analyze the convergence bound training gradients or parameters and send locally
of NbAFL and reveal the relationship between privacy levels, trained ML parameters to the server;
convergence performance, the number of clients, and the • Step 2: Model aggregating: The server performs secure
number of global aggregations. In Section V, we propose the aggregation over the uploaded parameters from N
K -client random scheduling scheme and develop the conver- clients without learning local information;
gence bound. We show the analytical results and simulations in • Step 3: Parameters broadcasting: The server broadcasts
Section VI. We conclude the paper in Section VII. A summary the aggregated parameters to the N clients;
of basic concepts and notations is provided in Tab. I. • Step 4: Model updating: All clients update their respective
models with the aggregated parameters and test the
II. P RELIMINARIES performance of the updated models.
In the FL process, the N clients with the same data structure
In this section, we will present preliminaries and related collaboratively learn a ML model with the help of a cloud
background knowledge on FL and DP. Also, we introduce the server. After a sufficient number of local training and update
threat model that will be discussed in our following analysis. exchanges between the server and its associated clients, the
solution to the optimization problem (2) is able to converge
A. Federated Learning to that of the global optimal learning model.
Let us consider a general FL system consisting of one server
and N clients, as depicted in Fig. 1. Let Di denote the local B. Threat Model
database held by the client Ci , where i ∈ {1, 2, . . . , N}. At the The server in this paper is assumed to be honest. However,
server, the goal is to learn a model over data that resides at the there are external adversaries targeting at clients’ private
N associated clients. An active client, participating in the local information. Although the individual dataset Di of the i -th
training, needs to find a vector w of an AI model to minimize client is kept locally in FL, the intermediate parameter wi
a certain loss function. Formally, the server aggregates the needs to be shared with the server, which may reveal the
weights received from the N clients as clients’ private information as demonstrated by model inver-

N sion attacks. For example, authors in [36] demonstrated a
w= p i wi , (1) model-inversion attack that recovers images from a facial
i=1 recognition system. In addition, the privacy leakage can also

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3457

happen in the broadcasting (through downlink channels) phase A. Global Differential Privacy
by analyzing the global parameter w. Here, we define a global (, δ)-DP requirement for both
We also assume that uplink channels are more secure uplink and downlink channels. From the uplink perspective,
than downlink broadcasting channels, since clients can be using a clipping technique, we can ensure that wi ≤ C,
assigned to different channels (e.g., time slots, frequency where wi denotes training parameters from the i -th client with-
bands) dynamically in each uploading time, while downlink out perturbation and C is a clipping threshold for bounding wi .
channels are broadcasting. Hence, we assume that there are at We assume that the batch size in the local training is equal to
most L (L ≤ T ) exposures of uploaded parameters from each the number of training samples and then define local training
client in the uplink1 and T exposures of aggregated parameters process in the i -th client by
in the downlink, where T is the number of aggregation times.
D
sU i  wi = arg min Fi (w, Di )
w
C. Differential Privacy |Di |

1
= arg min Fi (w, Di, j ), (4)
(, δ)-DP provides a strong criterion for privacy preservation |Di | w
j =1
of distributed data processing systems. Here,  > 0 is the
distinguishable bound of all outputs on neighboring datasets where Di is the i -th client’s database and Di, j is the j -th
Di , Di in a database, and δ represents the event that the ratio D
sample in Di . Thus, the sensitivity of sU i can be expressed as
of the probabilities for two adjacent datasets Di , Di cannot be
D
bounded by e after adding a privacy preserving mechanism. sUDi = max sUDi − sU i
With an arbitrarily given δ, a privacy preserving mechanism Di ,Di

with a larger  gives a clearer distinguishability of neighboring  |Di |
 1 
datasets and hence a higher risk of privacy violation. Now, = max   arg min Fi (w, Di, j )
we will formally define DP as follows. Di ,Di  |Di | w
j =1
Definition 1 ((, δ)-DP [24]): A randomized mechanism 
|Di | 
M : X → R with domain X and range R satisfies (, δ)-DP, 1   
 2C
−  arg min Fi (w, Di, j ) = , (5)
if for all measurable sets S ⊆ R and for any two adjacent |Di | w  |Di |
j =1
databases Di , Di ∈ X ,
where Di is an adjacent dataset to Di which has the same size
Pr[M(Di ) ∈ S] ≤ e Pr[M(Di ) ∈ S] + δ. (3) but only differ by one sample, and Di, j is the j -th sample in
Di . From the above result, a global sensitivity in the uplink
For numerical data, a Gaussian mechanism defined in [24] channel can be defined by
can be used to guarantee (, δ)-DP. According to [24],  
we present the following DP mechanism by adding artificial D
sU  max sU i , ∀i. (6)
Gaussian noise.
In order to ensure that the given noise distribution To achieve a small global sensitivity, the ideal condition is
n ∼ N (0, σ 2 ) preserves (, δ)-DP, where N represents the that all the clients use sufficient local datasets for training.
Gaussian distribution, √we choose noise scale σ ≥ cs/ Hence, we define the minimum size of the local datasets by
and the constant c ≥ 2 ln(1.25/δ) for  ∈ (0, 1). In this m and then obtain sU = 2C m . To ensure (, δ)-DP for each
result, n is the value of an additive noise sample for a data client in the uplink in one exposure, we set the noise scale,
in the dateset, s is the sensitivity of the function s given represented by the standard deviation of the additive Gaussian
by s = maxDi ,Di s(Di ) − s(Di ) , and s is a real-valued noise, as σU = csU /. Considering L exposures of local
function. parameters, we need to set σU = cLsU / due to the linear
Considering the above DP mechanism, choosing an appro- relation between  and σU in the Gaussian mechanism.
priate level of noise remains a significant research problem, From the downlink perspective, the aggregation operation
which will affect the privacy guarantee of clients and the for Di can be expressed as
convergence rate of the FL process.
sDDi  w = p1 w1 + . . . + pi wi + . . . + p N w N , (7)

III. F EDERATED L EARNING W ITH D IFFERENTIAL where 1 ≤ i ≤ N and w is the aggregated parameters at the
P RIVACY server to be broadcast to the clients. Regarding the sensitivity
of sDDi , i.e., sDDi , we have the following lemma.
In this section, we first introduce the concept of global
Lemma 1 (Sensitivity for the Aggregation Operation): In FL
DP and analyze the DP performance in the context of FL.
training process, the sensitivity for Di after the aggregation
Then we propose the NbAFL scheme that can satisfy the DP
operation sDDi is given by
requirement by adding proper noisy perturbations at both the
clients and the server. D 2C pi
sD i = . (8)
1 Here we assume that the adversary cannot know where the parameters
m
come from. Proof: See Appendix A.

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
3458 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020

Algorithm 1 Noising Before Aggregation FL helpful for hiding their private information. This theorem also
Data: T , w(0) , μ,  and δ provides the variance value of the noise terms that should
(0) be added to the aggregated parameters. Based on the above
1 Initialization: t = 1 and wi = w(0) , ∀i
2 while t ≤ T do results, we propose the following NbAFL algorithm.
3 Local training process:
4 while Ci ∈ {C1 , C2 , . . . , C N } do B. Proposed NbAFL
5 Update the local parameters wi(t ) as
  Algorithm 1 outlines our NbAFL for training an effective
6 wi(t ) = arg min Fi (wi ) + μ2 wi − w(t −1) 2 model with a global (, δ)-DP requirement. We denote by μ
wi
7 Clip the local parameters the presetting constant of the proximal term and by w(0) the
(t) initiate global parameter. At the beginning of this algorithm,
wi(t ) = wi(t ) / max 1,
wi
C the server broadcasts the required privacy level parameters
8 Add noise and upload parameters wi = wi + ni
(t ) (t ) (t ) (, δ) are set and the initiate global parameter w(0) are sent to
clients. In the t-th aggregation, N active clients respectively
9 Model aggregating process: train the parameters by using local databases with preset
10 Update the global parameters w(t ) as termination conditions. After completing the local training,

N
the i -th client, ∀i , will add noise to the trained parameters
11 w(t ) = pi wi(t ) (t ) (t )
i=1 wi , and upload the noised parameters wi to the server for
12 The server broadcasts global noised parameters aggregation.
(t )
13 w(t ) = w(t ) + nD Then the server update the global parameters w(t ) by aggre-
14 Local testing process: gating the local parameters integrated with different weights.
(t )
15 while Ci ∈ {C1 , C2 , . . . , C N } do The additive noise terms nD are added to this w(t ) according
16 Test the aggregating parameters w(t ) using local to Theorem 1 before being broadcast to the clients. Based on
dataset the received global parameters w(t ) , each client will estimate
17 t ←t +1 the accuracy by using local testing databases and start the next
round of training process based on these received parameters.
Result: w(T )
The FL process completes after the aggregation time reaches
a preset number T and the algorithm returns w(T ) .
Now, let us focus on the privacy preservation performance
Remark 1: From the above lemma, to achieve a small of the NbAFL. First, the set of all local parameters are
global sensitivity in the downlink channel which is defined by received by the server. Owing to the local perturbations in the
  2C pi NbAFL, it will be difficult for malicious adversaries to infer
sD  max sDDi = max , ∀i, (9) the information at the i -client from its uploaded parameters
m
wi . After the model aggregation, the aggregated parameters w
the ideal condition is that all the clients should use the same will be sent back to clients via broadcast channels. This poses
size of local datasets for training, i.e., pi = 1/N. threats on clients’s privacy as potential adversaries may reveal
From the above remark, when setting pi = 1/N, ∀i , we can sensitive information about individual clients from w. In this
obtain the optimal value of the sensitivity sD . So here we case, additive noise may be posed to w based on Theorem 1.
should add noise at the client side first and then decide whether
or not to add noise at the server to satisfy the (, δ)-DP
criterion in the downlink channel. IV. C ONVERGENCE A NALYSIS ON N BAFL
Theorem 1 (DP Guarantee for Downlink Channels): To In this section, we are ready to analyze the convergence
ensure (, δ)-DP in the downlink channels with T aggrega- performance of the proposed NbAFL. First, we analyze the
tions, the standard deviation of Gaussian noise terms nD that expected increment of adjacent aggregations in the loss func-
are added to the aggregated parameter w by the server can tion with Gaussian noise. Then, we focus on deriving the
be given as convergence property under the global (, δ)-DP requirement.
⎧ √ For the convenience of the analysis, we make the following
⎨ 2cC T 2 − L 2 N √
T > L N, assumptions on the loss function and network parameters.
σD = m N √ (10)
⎩ Assumption 1: We make assumptionsN on the global loss
0 T ≤ L N. function F(·) defined by F(·)  i=1 pi Fi (·), and the i -th
Proof: See Appendix B. local loss function Fi (·) as follows:
Theorem 1 shows that to satisfy a (, δ)-DP requirement for 1) Fi (w) is convex;
the downlink channels, additional noise terms nD need to be 2) Fi (w) satisfies the Polyak-Lojasiewicz condition with
added by the server. With a certain L, the standard deviation the positive parameter l, which implies that F(w) −
of additional noise depends on the relationship between the F(w∗ ) ≤ 2l1 ∇ F(w) 2 , where w∗ is the optimal result;
number of aggregation times T and the number of clients N. 3) F(w(0) ) − F(w∗ ) = ;
The intuition is that a larger T can lead to a higher chance 4) Fi (w) is β-Lipschitz, i.e., Fi (w) − Fi (w ) ≤ β w −
of information leakage, while a larger number of clients is w , for any w, w ;

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3459

5) Fi (w) is ρ-Lipschitz smooth, i.e., ∇ Fi (w) − Next, we will analyze the convergence property of NbAFL
∇ Fi (w ) ≤ ρ w − w , for any w, w , where ρ with the (, δ)-DP requirement.
is a constant determined by the practical loss function; Theorem 2 (Convergence Upper Bound of the NbAFL):
6) For any i and w, ∇ Fi (w) − ∇ F(w) ≤ εi , where εi is With required protection level , the convergence upper bound
the divergence metric. of Algorithm 1 after T aggregations is given by
Similar to the gradient divergence, the divergence metric εi κ1 T κ0 T 2  
is the metric to capture the divergence between the gradients E{F(w(T ) )− F(w∗ )} ≤ P T  + + 2 1− P T ,
 
of the local loss functions and that of the aggregated loss (14)
function, which is essential for analyzing SGD. The divergence 
λ1 βcC λ0 c C 2
2
is related to how the data is distributed at different nodes. where P = 1 + 2lλ2 , κ1 = m(1−P) Nπ and κ0 = m 2 (1−P)N .
2
Using Assumption 1 and assume ∇ F(w) to be uniformly Proof: See Appendix D.
away from zero, we then have the following lemma. Theorem 2 reveals an important relationship between pri-
Lemma 2 (B-Dissimilarity of Various Clients): For a given vacy and utility by taking into account the protection level
ML parameter w, there exists B satisfying  and the number of aggregation times T . As the number
  of aggregation times T increases, the first term of the upper
E ∇ Fi (w) 2 ≤ ∇ F(w) 2 B 2 , ∀i . (11) bound decreases but the second term increases. Furthermore,
By viewing T as a continuous variable and by writing the
Proof: See Appendix C. RHS of (14) as h(T ), we have
Lemma 2 comes from the assumption of the divergence
metric and demonstrates the statistical heterogeneity of all d 2 h(T ) κ1 T κ0 T 2
= − − 2 P T ln2 P
clients. As mentioned earlier, the values of ρ and B are deter- 2
d T  
mined by the specific global loss function F(w) in practice and κ1 2κ0 T 2κ0  
the training parameters w. With the above preparation, we are −2 + 2 P T ln P + 2 1− P T . (15)
  
now ready to analyze the convergence property of NbAFL.
It can be seen that the second term and third term of on the
First, we present the following lemma to derive an expected
RHS of (15) are always positive. When N and  are set to
increment bound on the loss function during each iteration of
be large enough, we can see that κ1 and κ0 are small, and
parameters with artificial noise.
thus the first term can also be positive. In this case, we have
Lemma 3 (Expected Increment in the Loss Function): After
d 2 h(T )/d 2 T > 0 and the upper bound is convex for T .
receiving updates, from the t-th to the (t + 1)-th aggregation,
Remark 2: As can be seen from this theorem, the expected
the expected difference in the loss function can be upper-
gap between the achieved loss function F(w(T ) ) and the mini-
bounded by
mum one F(w∗ ) is a decreasing function of . By increasing ,
E{F(w(t +1) ) − F(w(t ) )} i.e., relaxing the privacy protection level, the performance of
NbAFL algorithm will improve. This is reasonable because the
≤ λ2 E{ ∇ F(w(t ) ) 2 }
variance of artificial noise terms decreases, thereby improving
+ λ1 E{ n(t +1) ∇ F(w(t ) ) } + λ0 E{ n(t +1) 2 }, (12) the convergence performance.
Remark 3: The number of clients N will also affect its
where λ0 = ρ2 , λ1 = μ1 + ρμB , λ2 = − μ1 + ρμB2 + 2μ2 and ρ B2
iterative convergence performance, i.e., a larger N would
n(t ) are the equivalent noise terms imposedon the parameters achieve a better convergence performance. This is because
(t ) (t )
after the t-th aggregation, given by n(t ) = i=1
N
pi ni + nD . a lager N leads to a lower variance of the artificial noise
Proof: See Appendix D. terms.
In this lemma, the value of an additive noise sample n Remark 4: There is an optimal number of maximum aggre-
in vector n(t ) satisfies the following Gaussian
 distribution gation times T in terms of convergence performance for given
n ∼ N (0, σA ). Also, we can obtain σA = σD + σU2 /N from
2 2  and N. In more detail, a larger T may lead to a higher
Section III. From the right hand side (RHS) of the above variance of artificial noise, and thus may pose a negative
inequality, we can see that it is crucial to select a proper impact on convergence performance. On the other hand, more
proximal term μ to achieve a low upper-bound. It is clear iterations can generally boost the convergence performance
that artificial noise with a large σA may improve the DP if noise levels are not large enough. In this sense, there is a
performance in terms privacy protection. However, from the tradeoff on choosing a proper T .
RHS of (12), a large σA may enlarge the expected difference
of the loss function between two consecutive aggregations, V. K -C LIENT R ANDOM S CHEDULING P OLICY
leading to a deterioration of convergence performance. In this section, we consider the case where only K (K < N)
Furthermore, to satisfy the global (, δ)-DP, by using clients are selected to participate in the aggregation process,
Theorem 1, we have namely K -client random scheduling.
⎧ √ We now discuss how to add artificial noise in the K -
⎪ cT sD
⎨ T > L N, client random scheduling to satisfy a global (, δ)-DP. It is
σA = cLs 
⎪ U √ (13) obvious that in the uplink channels, each of the K scheduled
⎩ √ T ≤ L N.
N clients should add noise terms with scale σU = cLsU / for

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
3460 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020

achieving (, δ)-DP. This is equivalent to the noise scale in the


all-clients selection case in Section III, since each client only
considers its own privacy for uplink channels in both cases.
However, the derivation of the noise scale in the downlink
will be different for the K -client random scheduling. As an
extension of Theorem 1, we present the following lemma in
the case of K -client random scheduling on how to obtain σD .
Lemma 4 (DP Guarantee for K -Client Random Schedul-
ing): In the NbAFL algorithm with K -client random schedul-
ing, to satisfy a global (, δ)-DP, and the standard deviation
σD of additive Gaussian noise terms for downlink channels
should be set as
⎧ 

⎪ 2cC T2
− L2 K 
⎨ b2
T > ,
σD = mK γ (16) Fig. 2. The comparison of training loss with various protection levels for

⎪ 
⎩0 T ≤ , 50 clients using  = 50,  = 60 and  = 100, respectively.
γ
 
N − VI. S IMULATION R ESULTS
where b = − T ln 1 − N
K + Ke
T and γ =
−
√ In this section, we evaluate the proposed NbAFL by
− ln 1 − K
N + K
Ne
L K . using multi-layer perception (MLP) and real-world federated
Proof: See Appendix F. datasets. In order to characterize the convergence property of
Lemma 4 recalculates σD by considering the number of NbAFL, we conduct experiments by varying the protection
chosen clients K . Generally, the number of clients N is fixed, levels of , the number of clients N, the number of maximum
we thus focus on the effect of K . Based on the DP analysis aggregation times T and the number of chosen clients K .
in Lemma 4, we can obtain the following theorem. We conduct experiments on the standard MNIST dataset
Theorem 3 (Convergence Under K -Client Random for handwritten digit recognition consisting of 60000 training
Scheduling): With required protection level  and the number examples and 10000 testing examples [37]. Each example is
of chosen clients K , for any  > 0, the convergence upper a 28 × 28 size gray-level image. Our baseline model uses a a
bound after T aggregation times is given by MLP network with a single hidden layer containing 256 hidden
units. In this feed-forward neural network, we use a ReLU
E{F(vT ) − F(w∗ )} units and softmax of 10 classes (corresponding to the 10
⎛  digits). For the optimizer of networks, we set the learning
1− QT
⎝ cCα1 β 2 rate to 0.002. Then, we evaluate this MLP for the multi-class
≤ QT  +  
1− Q −m K ln 1 − KN + N −T 
π classification task with the standard MNIST dataset, namely,
Ke
⎞ recognizing from 0 to 9, where each client has 100 training
c2 C 2 α0 samples locally. This setting is in line with the ideal condition
+   ⎠. (17) in Remark 1.
− T
m 2 K 2 ln2 1 − KN + N K e We can note that parameter clipping C is a popular ingredi-
  ent of SGD and ML for non-privacy reasons. A proper value of
ρ B2 ρ B2 2ρ
√B
2 μB clipping threshold C should be considered for the DP based FL
where Q = 1 + 2l
μ2 2 + ρB + K + + √ −μ ,
√ K K framework. In the following experiments (except subsection
α0 = 2ρ K
+ ρ, α1 = 1 + 2ρμB + and v(T ) =
2ρ B K
μN , B), we utilize the method in [29] and choose C by taking
K N 
(T ) (T ) (T ) the median of the norms of the unclipped parameters over the
i=1 pi wi + ni + nD .
Proof: See Appendix G. course of training. The values of ρ, β, l and B are determined
The above theorem provides the convergence upper bound by the specific loss function, and we will use estimated values
between F(vT ) and F(w∗ ) under K -random scheduling. in our simulations [20].
Using K -client random scheduling, we can obtain an important
relationship between privacy and utility by taking into account A. Performance Evaluation on Protection Levels
the protection level , the number of aggregation times T and In Figs. 2, we choose various protection levels  = 50,
the number of chosen clients K .  = 60 and  = 100 to show the results of the loss function in
Remark 5: From the bound derived in Theorem 3, we con- NbAFL. Furthermore, we also include a non-private approach
clude that there is an optimal K in between 0 and N to compare with our NbAFL. In this experiment, we set
that achieves the optimal convergence performance. That is, N = 50, T = 25 and δ = 0.01, and compute the values of
by finding a proper K , the K -client random scheduling policy the loss function as a function of the aggregation times t. As
is superior to the one that all N clients participate in the FL shown in Fig. 2, values of the loss function in NbAFL are
aggregations. decreasing as we relax the privacy guarantees (increasing ).

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3461

Fig. 3. The comparison of training loss with various protection level for 50 Fig. 5. The comparison of training loss with various clipping thresholds for
clients using  = 6,  = 8 and  = 10, respectively. 50 clients using  = 60.

Fig. 4. The comparison of training loss with various privacy levels for Fig. 6. The value of the loss function with various numbers of clients under
50 clients using  = 50,  = 60 and  = 100, respectively.  = 60 under NbAFL Algorithm with 50 clients.

Such observation results are in line with Remark 2. We also if the clipping threshold C is too small, clipping destroys the
choose high protection levels  = 6,  = 8 and  = 10 for intended gradient direction of parameters. On the other hand,
this experiment, where each client has 512 training samples increasing the norm bound C forces us to add more noise to
locally. We set N = 50, T = 25 and δ = 0.01. From Fig. 3, the parameters because of its effect on the sensitivity.
we can draw a similar conclusion as in Remark 2 that values
of the loss function in NbAFL are decreasing as we relax the C. Impact of the Number of Clients N
privacy guarantees. Figs. 6 compares the convergence performance of NbAFL
Considering the K -client random scheduling, in Fig. 4, under required protection level  = 60 and δ = 10−2 as a
we investigate the performances with various protection levels function of clients’ number, N. In this experiment, we set
 = 50,  = 60 and  = 100. For simulation parameters, N = 50, N = 60, N = 80 and N = 100. We notice
we set N = 50, K = 20, T = 25, and δ = 0.01. As shown that the performance among different numbers of clients is
in Figs. 4, the convergence performance under the K -client governed by Remark 3. This is because more clients not
random scheduling is improved with an increasing . only provide larger global datasets for training, but also bring
down the standard deviation of additive noise due to the
B. Impact of the Clipping Threshold C aggregation.
In Fig. 5, we choose various clipping thresholds C =
10, 15, 20 and 25 to show the results of the loss function for D. Impact of the Number of Maximum Aggregation Times T
50 clients using  = 60 in NbAFL. As shown in Fig. 5, In Fig. 7, we show the experimental results of training loss
when C = 20, the convergence performance of NbAFL as a function of maximum aggregation times with various
can obtain the best value. We can note that limiting the privacy levels  = 50, 60, 80 and 100 under NbAFL algorithm.
parameter norm has two opposing effects. On the one hand, This observation is in line with Remark 4, and the reason

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
3462 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020

Fig. 7. The convergence upper bounds with various privacy levels  = 50, Fig. 9. The value of the loss function with various privacy levels  = 60
60 and 100 under 50-clients’ NbAFL algorithm. and  = 100 under NbAFL Algorithm with 50 clients.

Fig. 8. The comparison of the loss function between experimental and Fig. 10. The value of the loss function with various numbers of chosen clients
theoretical results with the various aggregation times under NbAFL Algorithm under  = 50, 60, 100 under NbAFL Algorithm and non-private approach with
with 50 clients. 50 clients.

comes from the fact that a lower privacy level decreases the policy in NbAFL. The number of clients is N = 50, and
standard deviation of additive noise terms and the server can K clients are randomly chosen to participate in training
obtain better quality ML model parameters from the clients. and aggregation in each iteration. In this experiment, we set
Fig. 7 also implies that an optimal number of maximum aggre-  = 50,  = 60,  = 100 and δ = 0.01. Meanwhile, we also
gation times increases almost with respect to the increasing . exhibit the performance of the non-private approach with
In Fig. 9, we plot the values of the loss function in various numbers of chosen clients K . Note that an optimal K
the normalized NbAFL using solid lines and the K -random which further improves the convergence performance exists for
scheduling based NbAFL using dotted lines with various num- various protection levels, due to a trade-off between enhance
bers of maximum aggregation times. This figure shows that privacy protection and involving larger global training datasets
the value of loss function is a convex function of maximum in each model updating round. This observation is in line
aggregation times for a given protection level under NbAFL with Remark 5. The figure shows that in NbAFL, for a given
algorithm, which validates Remark 4. From Fig. 9, we can protection level , the K -client random scheduling can obtain
also see that for a given , K -client random scheduling based a better tradeoff than the normal selection policy.
NbAFL algorithm has a better convergence performance than
the normalized NbAFL algorithm for a larger T . This is VII. C ONCLUSIONS
because that K -client random scheduling can bring down the
In this paper, we have focused on information leakage
variance of artificial noise with little performance loss.
in SGD based FL. We have first defined a global (, δ)-
DP requirement for both uplink and downlink channels, and
E. Impact of the Number of Chosen Clients K developed variances of artificial noise terms at clients and
In Fig. 10, we plot values of the loss function with various server sides. Then, we have proposed a novel framework
numbers of chosen clients K under the random scheduling based on the concept of global (, δ)-DP, named NbAFL.

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3463

We have developed theoretically a convergence bound on the aggregation process with artificial noise added by clients can
loss function of the trained FL model in the NbAFL. Using this be expressed as
convergence bound, we have obtained the following results:
1) there is a tradeoff between the convergence performance 
N 
N 
N
w= pi (wi + ni ) = p i wi + pi ni . (22)
and privacy protection levels, i.e., better convergence perfor-
i=1 i=1 i=1
mance leads to a lower protection level; 2) increasing the N
number N of overall clients participating in FL can improve The distribution φ N (n) of i=1 pi n i can be expressed as
the convergence performance, given a fixed privacy protec-
tion level; and 3) there is an optimal number of maximum 
N
φ N (n) = ϕi (n), (23)
aggregation times in terms of convergence performance for
i=1
a given protection level. Furthermore, we have proposed a 
K -client random scheduling strategy and also developed a where pi n i ∼ ϕi (n), and is convolutional operation.
corresponding convergence bound on the loss function in this When we use Gaussian mechanism for n i with noise scale
case. In addition to the above three properties. we find that σU , the distribution of pi n i is also Gaussian distribution.
there exists an optimal value of K that achieves the best To obtain a small sensitivity√sD , we set pi = 1/N. Fur-
convergence performance at a fixed privacy level. Extensive thermore, the noise scale σU / N of the Gaussian distribution
simulation results confirm the correctness of our analysis. φ N (n) can be calculated. To ensure a global (, δ)-DP in
Therefore, our analytical results are helpful for the design downlink channels, we know the standard deviation of additive
on privacy-preserving FL architectures with different tradeoff noise terms can be set to σA = cT sD /, where sD =
requirements on convergence performance and privacy levels. 2C/m N. Hence, we can obtain the standard deviation of
We can note that the size and the distribution of data both additive noise at the server as
 ⎧ √
greatly affect the quality of the FL training. As a future work, ⎨ 2cC T 2 − L 2 N √
it is of great interest to analytically evaluate the convergence σ 2
T > L N,
σD = σA − 2 U
= m N √ (24)
performance of NbAFL with varying size and distribution of N ⎩
0 T ≤ L N.
data at client sides.
Hence, Theorem 1 has been proved. 
A PPENDIX A
P ROOF OF L EMMA 1 A PPENDIX C
From the downlink perspective, for all Di and which Di P ROOF OF L EMMA 2
differ in a signal entry, the sensitivity can be expressed as Due to Assumption 1, we have
D D D  
sD i = max sD i − sD i . (18) E ∇ Fi (w) − ∇ F(w) 2 ≤ E{εi2 } (25)
D i ,D i

Based on (4) and (7), we have and


 
D
sD i = p1 w1 (D1 )+. . .+ pi wi (Di )+. . .+ p N w N (D N ) (19) E ∇ Fi (w) − ∇ F(w) 2
   
and = E ∇ Fi (w) 2 − 2E ∇ Fi (w) ∇ F(w) + ∇ F(w) 2
D  
sD i = p1 w1 (D1 )+. . .+ pi wi (Di )+. . .+ p N w N (D N ), (20) = E ∇ Fi (w) 2 − ∇ F(w) 2 . (26)
Furthermore, the sensitivity can be given as
Considering (25), (26) and ∇ F(w) = E{∇ Fi (w)}, we have
sDDi = max pi wi (Di ) − pi wi (Di )  
Di ,Di E ∇ Fi (w) 2 ≤ ∇ F(w) 2 + E{εi2 }
2C pi
pi max wi (Di ) − wi (Di ) = pi sUDi ≤ . (21) = ∇ F(w) 2
B(w)2 . (27)
Di ,Di m
Note that when ∇ F(w) 2 = 0, there exists
Hence, we know sDDi = 2C pi
m . This completes the proof.  
E{εi2 }
A PPENDIX B B(w) = 1+ ≥ 1, (28)
∇ F(w) 2
P ROOF OF T HEOREM 1
which satisfies the equation. We can notice that a smaller value
To ensure a global (, δ)-DP in the uplink channels, the
of B(w) implies that the local loss functions are more locally
standard deviation of additive noise terms in client sides can
similar. When all the local loss functions are the same, then
be set to σU = cLsU / due to the linear relation between
B(w) = 1, for all w. Therefore, we can have
 and σU with Gaussian mechanism, where sU = 2C m is the  
sensitivity for the aggregation operation and m is the data size E ∇ Fi (w) 2 ≤ ∇ F(w) 2 B 2 , ∀i, (29)
of each client. We then set the sample in the i -th local noise
vector to a same distribution n i ∼ ϕ(n) (i.i.d for all i ) because where B is the upper bound of B(w). This completes the
each client is coincident with the same global (, δ)-DP. The proof. 

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
3464 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020

A PPENDIX D where θ denotes a θ solution of minw Ji (w; w(t ) ), which is


P ROOF OF L EMMA 3 defined in [19]. Now, we can use the inequality (39) and (40)
Considering the aggregation process with artificial noise to obtain
added by clients and the server in the (t + 1)-th aggregation, 1+θ
wi(t +1) − w(t ) ≤ ∇ Fi (w(t )) . (41)
we have μ

N Therefore,
w(t +1) = pi wi(t +1) + n(t +1) , (30) w(t +1) − w(t ) ≤ w(t +1) − w(t ) + n(t +1)
i=1 (t +1)
≤ E{ wi − w(t ) } + n(t +1)
where
1+θ
≤ E{ ∇ Fi (w(t ) ) } + n(t +1)

N
μ
n(t ) = pi ni(t ) + nD
(t )
. (31) B(1 + θ )
i=1 ≤ ∇ F(w(t ) ) + n(t +1) . (42)
μ
Because Fi (·) is ρ-Lipschitz smooth, we know Using (37) and (38), we know
Fi (w(t +1)) ≤ Fi (w(t )) + ∇ Fi (w(t )) (w(t +1) − w(t )) E{∇ Fi (wi
(t +1)
)} − ∇ F(w(t ) ) − E{∇ J (wi
(t +1)
; w(t ) )}
ρ (t +1)
+ w − w(t ) 2 , (32) ≤ ρE{ wi(t +1) − w(t ) } + E{ ∇ J (wi(t +1) ; w(t )) }
2
ρ B(1 + θ )
for all w(t +1) , w(t ) . Combining F(w(t )) = E{Fi (w(t ) )} and ≤ ∇ F(w(t ) ) + Bθ ∇ F(w(t ) ) . (43)
∇ F(w(t ) ) = E{∇ Fi (w(t ) )}, we have μ
Substituting (37), (42) and (43) into (33), we know
E{F(w(t +1) ) − F(w(t ))} ≤ E{∇ F(w(t )) (w(t +1) − w(t ))}
ρ E{F(w(t +1) ) − F(w(t ) )}
+ E{ w(t +1) − w(t ) 2 }. (33) 1 1
2 ≤ E ∇ F(w(t ) ) − ∇ F(w(t )) + n(t +1)
We define μ μ
μ (t +1) ρ B(1 + θ ) Bθ
J (wi(t +1); w(t ))  Fi (wi(t +1)) + w − w(t ) 2 . (34) + + ∇ F(w(t ) )
2 i μμ μ
 !
2
Then, we know ρ B(1 + θ )
  + E ∇ F(w(t ) ) + n(t +1) . (44)
(t +1) (t +1) (t +1) 2 μ
∇ J (wi ; w(t )) = ∇ Fi (wi ) + μ wi − w(t ) (35)
Then, using triangle inequation, we can obtain
and
E{F(w(t +1)) − F(w(t ) )}
N 
 
w(t +1) − w(t ) = wi
(t +1)
+ ni
(t +1) (t +1)
+ nD − w(t ) ≤ λ2 ∇ F(w(t ) ) 2
+ λ1 E{ n(t +1) } ∇ F(w(t ) )
i=1 + λ0 E{ n(t +1) 2 }. (45)
1 (t +1) (t +1)
= E{∇ J (wi ; w(t )) − ∇ Fi (wi )} where
μ 
1 B ρ(1 + θ ) ρ B 2 (1 + θ )2
+ n(t +1). (36) λ2 = − + +θ + , (46)
μ μ μ 2μ2
Because Fi (·) is ρ-Lipschitz smooth, we can obtain 1 ρ B(1 + θ ) ρ
λ1 = + and λ0 = . (47)
μ μ 2
E{∇ Fi (wi(t +1))} ≤ E{∇ Fi (w(t ) ) + ρ wi(t +1) − w(t ) }
In this convex case, where μ = μ, if θ = 0, all subproblems
= ∇ F(w(t )) + ρE{ wi(t +1) − w(t ) }. (37)
are solved accurately. We know λ2 = − μ1 + ρμB2 + ρ2μB2 , λ1 =
2

ρB
Now, let us bound wi
(t +1)
− w(t ) . We know 1
μ + μ and λ0 = ρ2 . This completes the proof. 
(t +1) (t +1) (t +1) (t +1)
wi − w(t ) ≤ wi −
wi + 
wi − w(t ) , A PPENDIX E
(38) P ROOF OF T HEOREM 2
(t +1) We assume that F satisfies the Polyak-Lojasiewicz inequal-
where wi = arg minw Ji (w; w(t )). Let us define μ = μ +
ity [38] with positive parameter l, which implies that
l > 0, then we know Ji (w; w(t )) is μ-convexity. Based on
1
this, we can obtain E{F(w(t ) ) − F(w∗ )} ≤
∇ F(w(t ) ) 2 . (48)
θ 2l
(t +1) (t +1)

wi − wi ≤ ∇ Fi (w(t ) ) (39) Moreover, subtract E{F(w∗ )} in both sides of (45), we know
μ
and E{F(w(t +1)) − F(w∗ )}
1 ≤ E{F(w(t ) ) − F(w∗ )} + λ2 ∇ F(w(t ) ) 2
i(t +1) − w(t ) ≤
w ∇ Fi (w(t ) ) , (40) (t +1)
μ + λ1 E{ n } ∇ F(w ) + λ0 E{ n(t +1) 2 }.
(t )
(49)

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3465

Considering ∇ F(w(t ) ) ≤ β and (48), we have we are looking at


$ $
$ Pr[M1:T (D ) = o1:T ] $
$ i,1:T $
E{F(w(t +1)) − F(w∗ )} $ln $
$ Pr[M1:T (Di,1:T ) = o1:T ] $
≤ (1 + 2lλ2 )E{F(w(t )) − F(w∗ )} + λ1 βE{ n(t +1) } $ $
$ 2 n+sD 2 $
$ T − n2 − $
+ λ0 E{ n(t +1) 2 }, (50) $  (1 − q)e A + qe
2σ 2
2σA $
=$$ ln $
2 $
$ i=1 − n2 $
where F(w∗ ) is the loss function corresponding to the optimal $ e 2σA $
$ ⎛ ⎞$
parameters w∗ . Considering the same and independent distri- $ % 2nsD +sD2 $
$ T − $
bution of additive noise terms, we define E{ n(t ) } = E{ n } $
= $ln ⎝ 1 − q + qe
2
2σA ⎠ $.
$ (54)
and E{ n(t ) 2 } = E{ n 2 }, for 0 ≤ t ≤ T . Applying (50) $ i=1 $
recursively, we have
This quantity is bounded by , we require
$ $
$ Pr[M1:T (D ) = o1:T ] $
E{F(w(T ) ) − F(w∗ )} $ i,1:T $
$ln $ ≤ . (55)
$ Pr[M1:T (Di,1:T ) = o1:T ] $
≤ (1 + 2lλ2 )T E{F(w(0) ) − F(w∗ )}
  T−1 Considering the independence of additive noise terms,
+ λ1 βE{ n } + λ0 E{ n 2 } (1 + 2lλ2 )t we know
⎛ ⎞
t =0 2 2nsD + sD

(0) ∗
T ln ⎝1 − q + qe ⎠ ≥ −.
2
= (1 + 2lλ2 ) E{F(w ) − F(w )}
T 2σA
(56)
  (1 + 2lλ )T − 1
2
+ λ1 βE{ n } + λ0 E{ n 2 } . (51)
2lλ2 We can obtain the result
σ2 exp(− T ) 1 sD
√ n ≤ − A ln − +1 − . (57)
If T ≤ L N and then σD = 0, this case √ is special. Hence, sD q q 2
we will consider the condition that T > L N . Based on (13), We set
we have σA = sD T c/. Hence, we can obtain T exp(−/T ) − 1
b = − ln +1 . (58)
  q
sD T c 2N sD2 T 2 c2 N
E{ n } = and E{ n 2 } = . (52) Hence,
 π 2
exp(−/T ) − 1 b
ln +1 =− . (59)
Substituting (52) into (51), setting sD = 1/m N and q T
F(w(0) ) − F(w∗ ) = , we have Note that  and T should satisfy
−
 < −T ln (1 − q) or T > . (60)
E{F(w(T ) ) − F(w∗ )} ln (1 − q)
≤ (1 + 2lλ2 )T  Then,
"  #
λ1 Tβc 2 λ0 T 2 c2 (1 + 2lλ2 )T − 1 σA2 b sD
+ + 2 n≤ − . (61)
 Nπ  N 2lλ2 T sD 2
  Using the tail bound Pr[n > η] ≤ √σA η1 e−η
2 /2σ 2
κ1 T κ0 T 2 A , we can

= PT  + + 2 1 − PT , (53) obtain
  " #
η η2 21
 ln + > ln . (62)
λ1 βc λ0 c 2 σA 2σA2 πδ
where P = 1 + 2lλ2 , κ1 = m( P−1)
2
Nπ and κ0 = m 2 ( P−1)N
.
This completes the proof.  Let us set σA = csD T /b, if b/T ∈ (0, 1), the inequa-
tion (62) can be solved as
1.25
c2 ≥ 2 ln . (63)
A PPENDIX F δ
P ROOF OF L EMMA 4 Meanwhile,  and T should satisfy
 q −
We define the sampling parameter q  K /N to represent  < −T ln 1 − q + or T >   . (64)
e ln 1 − q + qe
the probability of being selected by the server for each client
in an aggregation. Let M1:T denote (M1 , . . . , MT ) and If b/T > 1, we can also obtain σA = csD T /b by adjusting
similarly let o1:T denote a sequence of outcomes (o1 , . . . , oT ). the value of c. The standard deviation of required noise terms
Considering a global (, δ)-DP in the downlinks channels, is given as
we use σA to represent the standard deviation of aggregated csD T
σA ≥ . (65)
Gaussian noise terms. With neighboring datasets Di and Di , b

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
3466 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020

Hence, if Gaussian noise terms are added at the client sides, and
we can obtain the additive noise scale in the server as ⎧ 2 ⎫
 ⎨  K  ⎬
 (t +1) 
csD T 2 c2 L 2 sU2 E{ v(t +1) 2
}=E  p i wi 
σD = − ⎩  ⎭
b K 2 ⎧
i=1
⎧  2 ⎫
⎪ 2 ⎨ K  ⎬
⎨ 2cC Tb2 − L 2 K √  (t +1) 
T > bL K , +E  pi ni 
= (66) ⎩  ⎭

⎩ mK √ i=1
0 T ≤ bL K . ⎧) * ⎫
⎨  K ⎬
Furthermore, considering (60), we can obtain + 2E pi wi(t +1) n(t +1) . (77)
⎧  ⎩ ⎭
i=1

⎪ 2cC T2
− L2 K 
⎨ b2 K
T > , Note that we set pi = Di / i=1 Di = 1/K in K -client
σD = mK γ (67)

⎪  random scheduling in order to a small sensitivity sD . We
⎩0 T ≤ ,
γ have
where ⎧ 2⎫
⎨ K ⎬ 1 K
 (t +1)  (t +1) 2 K − 1
γ = − ln 1 − q + qe L
−

. E  p i wi  ≤ 2 wi + w(t +1) 2
K (68) ⎩ ⎭ K K
i=1 i=1
This completes the proof.  (78)

A PPENDIX G and
P ROOF OF T HEOREM 3 1  (t +1)
K
K − 1 (t +1)
Here we define E{ v(t +1) 2 } ≤ wi 2
+ w 2
K2 K
i=1

K
v (t )
= pi wi(t ) , (69) + n(t +1) 2
+ 2[w(t +1)] n(t +1) . (79)
i=1
Combining (75) and (79), we can obtain
K  
v (t )
= pi wi(t ) + ni(t ) + nD
(t )
(70)
1  (t +1) (t )
K
i=1 E{ w(t +1) − v(t +1) 2 } ≤ wi −v 2
+ n(t +1) 2 .
K2
and i=1


K (80)
(t +1) (t +1) (t )
n = pi ni + nD . (71)
Using (41), we know
i=1
which considers the aggregated parameters under K -random B 2 (1+θ )2
E{ w(t +1) − v(t +1) 2 } ≤ n(t +1) 2
+ ∇ F(v(t ) ) 2 .
scheduling. Because Fi (·) and F(·) are β-Lipschitz, we obtain K μ2
that (81)
E{F(v(t +1))} − F(w(t +1) ) ≤ β v(t +1) − w(t +1) . (72)
Moreover,
Because β is the Lipchitz continuity constant of function F, B(1+θ )
we have E{ w(t +1) − v(t +1) } ≤ n(t +1) + √ ∇ F(vt ) . (82)
  μ K
β ≤ ∇ F(v(t ) ) +ρ w(t +1) − v(t ) + v(t +1)−!v(t ) . (73)
Substituting (45), (73) and (82) into (72), setting θ = 0 and
From (42), we know μ = μ, we can obtain
B(1 + θ )
w(t +1) − v(t ) ≤ ∇ F(v(t )) . (74) E{F(v(t +1))} − F(v(t ))
μ
≤ F(w(t +1)) − F(v(t ))
Then, we have  
× ∇ F(v(t ) ) + 2ρ w(t +1) − v(t ) E w(t +1) − v(t +1)
E{ w(t +1) − v(t +1) 2 }
= w(t +1) 2
− 2[w(t +1)] E{v(t +1)}+E{ v(t +1) 2 }. (75) + ρE{ w(t +1) − v(t +1) 2 } = α2 ∇ F(v(t )) 2

+ α1 n(t +1) ∇ F(v(t )) + α0 n(t +1) 2 , (83)


Furthermore, we can obtain
N where
1 K 
N
ρ B2 ρ B 2 2ρ B 2 μB
E{v(t +1)} = K pi wi(t +1) + n(t +1) α2 =
1
+ρ B + + √ + √ −μ , (84)
N N μ2 2 K K K
K
i=1 √
2ρ B 2ρ B K 2ρ K
(t +1) α1 = 1 + + and α0 = + ρ. (85)
= E{wi } + n(t +1) = w(t +1) + n(t +1) (76) μ μN N

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3467

In this case, we take expectation E{F(v(t +1)) − F(v(t ))} as [6] H. Brendan McMahan, E. Moore, D. Ramage, S. Hampson, and
follows, B. Agüera y Arcas, “Communication-efficient learning of deep networks
from decentralized data,” 2016, arXiv:1602.05629. [Online]. Available:
E{F(v(t +1)) − F(v(t ))} ≤ α2 ∇ F(v(t ) ) 2 http://arxiv.org/abs/1602.05629
[7] J. KonečnÝ, H. Brendan McMahan, F. X. Yu, P. Richtárik,
+ α1 E{ n(t +1) } ∇ F(v(t ) ) A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improv-
+ α0 E{ n(t +1) 2 }.
ing communication efficiency,” 2016, arXiv:1610.05492. [Online].
(86) Available: http://arxiv.org/abs/1610.05492
For  > 0 and f (v(0) ) − f (w∗ ) = , we can obtain
[8] U. Mohammad and S. Sorour, “Adaptive task allocation for asyn-
chronous federated mobile edge learning,” 2019, arXiv:1905.01656.
E{F(v(t +1)) − F(w∗ )} ≤ E{F(v(t ) ) − F(w∗ )}
[Online]. Available: http://arxiv.org/abs/1905.01656
[9] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen, “In-edge
+ α2 ∇ F(v(t ) ) 2 +α1 βE{ n(t +1) } AI: Intelligentizing mobile edge computing, caching and communica-
tion by federated learning,” IEEE Netw., vol. 33, no. 5, pp. 156–165,
+ α0 E{ n(t +1) 2 }. (87) Sep. 2019.
[10] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning:
If we select the penalty parameter μ to make α2 < 0 and Concept and applications,” ACM Trans. Intell. Syst. Technol., vol. 10,
using (48), we know no. 2, p. 12, 2019.
[11] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning:
E{F(v(t +1))− F(w∗ )} ≤ (1+2lα2 )E{F(v(t ))− F(w∗ )} Challenges, methods, and future directions,” 2019, arXiv:1908.07873.
[Online]. Available: http://arxiv.org/abs/1908.07873
+ α1 βE{ n(t +1) }+α0 E{ n(t +1) 2 }. [12] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, and C. S. Hong,
(88) “Federated learning over wireless networks: Optimization model design
and analysis,” in Proc. IEEE Conf. Comput. Commun. (INFOCOM),
Considering independence of additive noise terms and apply- Apr. 2019, pp. 1387–1395.
[13] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling policies
ing (88) recursively, we have for federated learning in wireless networks,” IEEE Trans. Commun.,
vol. 68, no. 1, pp. 317–333, Jan. 2020.
E{F(v(T ) ) − F(w∗ )} [14] M. Hao, H. Li, G. Xu, S. Liu, and H. Yang, “Towards efficient and
≤ (1 + 2lα2 )T E{F(v(0) ) − F(w∗ )} privacy-preserving federated deep learning,” in Proc. IEEE Int. Conf.
1 − (1 + 2lα2 )T   Commun. (ICC), May 2019, pp. 1–6.
+ α1 βE{ n } + α0 E{ n 2 } [15] H. H. Yang, A. Arafa, T. Q. S. Quek, and H. V. Poor, “Age-based
2lα2 scheduling policy for federated learning in mobile edge networks,”
1 − QT   in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP),
Barcelona, Spain, May 2020, pp. 8743–8747.
= QT  + α1 βE{ n } + α0 E{ n 2 } , (89)
1− Q [16] A. Agarwal and J. C. Duchi, “Distributed delayed stochastic optimiza-
tion,” in Proc. IEEE 51st IEEE Conf. Decis. Control (CDC), Maui, HI,
where Q = 1+2lα2 . Substituting (65) into (89), we can obtain USA, Dec. 2012, pp. 5451–5452.
 [17] L. Xiangru, H. Yijun, L. Yuncheng, and L. Ji, “Asynchronous parallel
sD T c 2N sD2 T 2 c2 N stochastic gradient for nonconvex optimization,” in Proc. ACM NIPS,
E{ n } = , E{ n 2 } = (90)
b π b2  2 Montreal, QC, Canada, Dec. 2015, pp. 2737–2745.
[18] X. Lian et al., “Can decentralized algorithms outperform centralized
and algorithms? A case study for decentralized parallel stochastic gradient
descent,” in Proc. ACM NIPS, Long Beach, CA, USA, Dec. 2017,
E{F(vT ) − F(w∗ )}
⎛ 
pp. 5336–5346.
[19] T. Li, A. Kumar Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and
1 − Q T
⎝ cCα1 β 2 V. Smith, “Federated optimization in heterogeneous networks,” 2018,
≤ QT  +  
1− Q −m K ln 1 − KN + N − T π arXiv:1812.06127. [Online]. Available: http://arxiv.org/abs/1812.06127
Ke [20] S. Wang et al., “Adaptive federated learning in resource constrained
⎞ edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6,
c2 C 2 α0 pp. 1205–1221, Jun. 2019.
+   ⎠. (91) [21] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in Proc.
− T
m 2 K 2 ln2 1 − KN + N K e 22nd ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), Denver, CO,
USA, 2015, pp. 1310–1321.
This completes the proof.  [22] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, and H. Qi, “Beyond
inferring class representatives: User-level privacy leakage from federated
learning,” in Proc. IEEE Conf. Comput. Commun. (INFOCOM), Paris,
R EFERENCES France, Apr. 2019, pp. 2512–2520.
[1] J. Li, S. Chu, F. Shu, J. Wu, and D. N. K. Jayakody, “Contract- [23] C. Ma et al., “On safeguarding privacy and security in the framework
based small-cell caching for data disseminations in ultra-dense cellular of federated learning,” 2019, arXiv:1909.06512. [Online]. Available:
networks,” IEEE Trans. Mobile Comput., vol. 18, no. 5, pp. 1042–1053, http://arxiv.org/abs/1909.06512
May 2019. [24] C. Dwork and A. Roth, “The algorithmic foundations of differential pri-
[2] Z. Ma, M. Xiao, Y. Xiao, Z. Pang, H. V. Poor, and B. Vucetic, vacy,” Found. Trends Theor. Comput. Sci., vol. 9, nos. 3–4, pp. 211–407,
“High-reliability and low-latency wireless communication for Internet 2013.
of Things: Challenges, fundamentals, and enabling technologies,” IEEE [25] A. Blum, C. Dwork, F. McSherry, and K. Nissim, “Practical pri-
Internet Things J., vol. 6, no. 5, pp. 7946–7970, Oct. 2019. vacy: The SuLQ framework,” in Proc. 24th ACM SIGMOD-SIGACT-
[3] H. Lee, S. H. Lee, and T. Q. S. Quek, “Deep learning for distributed SIGART Symp. Princ. Database Syst., Baltimore, MD, USA, Jun. 2005,
optimization: Applications to wireless resource management,” IEEE pp. 128–138.
J. Sel. Areas Commun., vol. 37, no. 10, pp. 2251–2266, Oct. 2019. [26] Ú. Erlingsson, V. Pihur, and A. Korolova, “Rappor: Randomized aggre-
[4] W. Sun, J. Liu, and Y. Yue, “AI-enhanced offloading in edge computing: gatable privacy-preserving ordinal response,” in Proc. ACM SIGSAC
When machine learning meets industrial IoT,” IEEE Netw., vol. 33, no. 5, Conf. Comput. Commun. Secur., Scottsdale, AZ, USA, Nov. 2014,
pp. 68–74, Sep. 2019. pp. 1054–1067.
[5] M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep [27] N. Wang et al., “Collecting and analyzing multidimensional data with
learning for IoT big data and streaming analytics: A survey,” IEEE local differential privacy,” in Proc. IEEE 35th Int. Conf. Data Eng.
Commun. Surveys Tuts., vol. 20, no. 4, pp. 2923–2960, Jun. 2018. (ICDE), Macao, China, Apr. 2019, pp. 638–649.

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
3468 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020

[28] S. Wang et al., “Local differential private data aggregation for discrete Ming Ding (Senior Member, IEEE) received the
distribution estimation,” IEEE Trans. Parallel Distrib. Syst., vol. 30, B.S. and M.S. degrees (Hons.) in electronics engi-
no. 9, pp. 2046–2059, Sep. 2019. neering and the Ph.D. degree in signal and infor-
[29] M. Abadi et al., “Deep learning with differential privacy,” in Proc. ACM mation processing from Shanghai Jiao Tong Uni-
SIGSAC Conf. Comput. Commun. Secur. (CCS), Vienna, Austria, 2016, versity (SJTU), Shanghai, China, in 2004, 2007,
pp. 308–318. and 2011, respectively. From April 2007 to Sep-
[30] N. Wu, F. Farokhi, D. Smith, and M. Ali Kaafar, “The value of col- tember 2014, he worked as a Researcher/Senior
laboration in convex machine learning with differential privacy,” 2019, Researcher/Principal Researcher at the Sharp Lab-
arXiv:1906.09679. [Online]. Available: http://arxiv.org/abs/1906.09679 oratories of China, Shanghai. He also served as
[31] J. Li, M. Khodak, S. Caldas, and A. Talwalkar, “Differentially the Algorithm Design Director and the Program-
private meta-learning,” 2019, arXiv:1909.05830. [Online]. Available: ming Director for a system-level simulator of future
http://arxiv.org/abs/1909.05830 telecommunication networks in Sharp Laboratories of China for more than
[32] H. Brendan McMahan, D. Ramage, K. Talwar, and L. Zhang, seven years. He is currently a Senior Research Scientist with the CSIRO
“Learning differentially private recurrent language models,” 2017, Data61, Sydney, NSW, Australia. His research interests include information
arXiv:1710.06963. [Online]. Available: http://arxiv.org/abs/1710.06963 technology, data privacy and security, machine Learning and AI. He has
[33] T. Ryffel et al., “A generic framework for privacy preserv- authored over 100 articles in IEEE journals and conferences, all in recognized
ing deep learning,” 2018, arXiv:1811.04017. [Online]. Available: venues, and around 20 3GPP standardization contributions, and a Springer
http://arxiv.org/abs/1811.04017 book Multi-Point Cooperative Communication Systems: Theory and Applica-
[34] R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federated tions. He holds 21 U.S. patents and co-invented another more than 100 patents
learning: A client level perspective,” 2017, arXiv:1712.07557. [Online]. on 4G/5G technologies in CN, JP, KR, EU. He is an Editor of the IEEE
Available: http://arxiv.org/abs/1712.07557 T RANSACTIONS ON W IRELESS C OMMUNICATIONS and the IEEE Wireless
[35] S. Truex et al., “A hybrid approach to privacy-preserving Communications Letters. Besides, he is or has been a Guest Editor/Co-
federated learning,” 2018, arXiv:1812.03224. [Online]. Available: Chair/Co-Tutor/TPC Member of several IEEE top-tier journals/conferences,
http://arxiv.org/abs/1812.03224 such as the IEEE J OURNAL ON S ELECTED A REAS IN C OMMUNICATIONS,
[36] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that IEEE Communications Magazine, and the IEEE GLOBECOM Workshops.
exploit confidence information and basic countermeasures,” in Proc. He was the Lead Speaker of the industrial presentation on unmanned aerial
22nd ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), New York, vehicles in IEEE GLOBECOM 2017, which was awarded as the Most
NY, USA, 2015, pp. 1322–1333. Attended Industry Program in the conference. He was awarded as the
[37] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learn- Exemplary Reviewer of the IEEE T RANSACTIONS ON W IRELESS C OMMU -
ing applied to document recognition,” Proc. IEEE, vol. 86, no. 11, NICATIONS in 2017.
pp. 2278–2324, Nov. 1998.
[38] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic
Course, 1st ed. Boston, MA, USA: Springer, 2014.

Kang Wei (Graduate Student Member, IEEE) Chuan Ma received the B.S. degree from the Bei-
received the B.Sc. degree in information engineering jing University of Posts and Telecommunications,
from Xidian University, Xi’an, China, in 2014, and Beijing, China, in 2013, and the Ph.D. degree from
the M.Sc. degree from the School of Electronic and The University of Sydney, Australia, in 2018. He is
Optical Engineering, Nanjing University of Science currently working as a Lecturer at the School of
and Technology, Nanjing, China, in 2018, where he Electronic and Optical Engineering, Nanjing Uni-
is currently pursuing the Ph.D. degree. His current versity of Science and Technology, Nanjing, China.
research interests include data privacy and security, He has published more than ten journal articles
differential privacy, AI and machine learning, infor- and conference papers, including the Best Paper in
mation theory, and channel coding theory in NAND WCNC 2018. His research interests include sto-
flash memory. chastic geometry, wireless caching networks and
machine learning, and now focuses on the big data analysis and privacy
preservation.

Jun Li (Senior Member, IEEE) received the Ph.D.


degree in electronic engineering from Shanghai Jiao
Tong University, Shanghai, China, in 2009. From
January 2009 to June 2009, he worked as a Research
Scientist at the Department of Research and Innova-
tion, Alcatel-Lucent Shanghai Bell Company, Ltd. Howard H. Yang (Member, IEEE) received the
From June 2009 to April 2012, he was a Post- B.Sc. degree in communication engineering from
Doctoral Fellow with the School of Electrical Engi- the Harbin Institute of Technology (HIT), China,
neering and Telecommunications, University of New in 2012, the M.Sc. degree in electronic engineering
South Wales, Australia. From April 2012 to June from The Hong Kong University of Science and
2015, he was a Research Fellow with the School Technology (HKUST), Hong Kong, in 2013, and
of Electrical Engineering, The University of Sydney, Australia. Since June the Ph.D. degree in electronic engineering from
2015, he has been a Professor with the School of Electronic and Optical the Singapore University of Technology and Design
Engineering, Nanjing University of Science and Technology, Nanjing, China. (SUTD), Singapore, in 2017.
He was a Visiting Professor with Princeton University from 2018 to 2019. His From August 2015 to March 2016, he was a
research interests include network information theory, game theory, distributed Visiting Student with WNCG under supervisor of
intelligence, multiple agent reinforcement learning, and their applications in Prof. J. G. Andrews at The University of Texas at Austin. He is currently
ultradense wireless networks, mobile edge computing, network privacy and a Post-Doctoral Research Fellow with the Wireless Networks and Decision
security, and industrial Internet of Things. He has coauthored more than 200 Systems (WNDS) Group, Singapore University of Technology and Design,
articles in IEEE journals and conferences and holds 1 U.S. patents and more led by Prof. T. Q. S. Quek. He has held a visiting research appointment
than 10 Chinese patents in these areas. He was a TPC member for several at Princeton University from September 2018 to April 2019. His research
flagship IEEE conferences. He received the Exemplary Reviewer of the IEEE interests cover various aspects of wireless communications, networking, and
T RANSACTIONS ON C OMMUNICATIONS in 2018 and the Best Paper Award signal processing, currently focusing on the modeling of modern wireless
from the IEEE International Conference on 5G for Future Wireless Networks networks, high-dimensional statistics, graph signal processing, and machine
in 2017. He has served as an Editor for the IEEE C OMMUNICATIONS learning. He received the IEEE WCSP 10-Year Anniversary Excellent Paper
L ETTERS . Award in 2019 and the IEEE WCSP Best Paper Award in 2014.

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.
WEI et al.: FL WITH DP: ALGORITHMS AND PERFORMANCE ANALYSIS 3469

Farhad Farokhi (Senior Member, IEEE) received Tony Q. S. Quek (Fellow, IEEE) received the
the Ph.D. degree from the KTH Royal Institute B.E. and M.E. degrees in electrical and electronics
of Technology in 2014. He is currently a Lecturer engineering from the Tokyo Institute of Technology,
(Assistant Professor) with the Department of Elec- Tokyo, Japan, in 1998 and 2000, respectively, and
trical and Electronic Engineering, The University of the Ph.D. degree in electrical engineering and com-
Melbourne. Prior to that, he was a Research Scientist puter science from the Massachusetts Institute of
with the Information Security and Privacy Group, Technology, Cambridge, MA, USA, in 2008.
CSIRO’s Data61, a Research Fellow at The Univer- He is currently the Cheng Tsang Man Chair Pro-
sity of Melbourne, and a Post-Doctoral Fellow with fessor with the Singapore University of Technology
the KTH Royal Institute of Technology. During his and Design (SUTD). He also serves as the Head
Ph.D. studies, he was a Visiting Researcher with the of ISTD Pillar, the Sector Lead of the SUTD AI
University of California at Berkeley and the University of Illinois at Urbana– Program, and the Deputy Director of SUTD-ZJU IDEA. His current research
Champaign. He was a recipient of the VESKI Victoria Fellowship from topics include wireless communications and networking, network intelligence,
the Victorian State Government and the McKenzie Fellowship and the 2015 the Internet of Things, URLLC, and big data processing.
Early Career Researcher Award from The University of Melbourne. He was Dr. Quek has been actively involved in organizing and chairing sessions
a Finalist in the 2014 European Embedded Control Institute (EECI) Ph.D. and has served as a member of the Technical Program Committee and
Award. He has been part of numerous projects on data privacy and cyber- symposium chairs in a number of international conferences. He was an
security funded by the Defence Science and Technology Group (DSTG), the Executive Editorial Committee Member of the IEEE T RANSACTIONS ON
Department of the Prime Minister and Cabinet (PMC), the Department of W IRELESS C OMMUNICATIONS. He was honored with the 2008 Philip Yeo
Environment and Energy (DEE), and CSIRO, Australia. Prize for Outstanding Achievement in Research, the 2012 IEEE William R.
Bennett Prize, the 2015 SUTD Outstanding Education Awards–Excellence in
Research, the 2016 IEEE Signal Processing Society Young Author Best Paper
Award, the 2017 CTTC Early Achievement Award, the 2017 IEEE ComSoc
AP Outstanding Paper Award, and the 2016–2019 Clarivate Analytics Highly
Cited Researcher. He is a Distinguished Lecturer of the IEEE Communications
Society. He is serving as an Editor for the IEEE T RANSACTIONS ON
W IRELESS C OMMUNICATIONS , the Chair of IEEE VTS Technical Committee
on Deep Learning for Wireless Communications, and an Elected Member of
the IEEE Signal Processing Society SPCOM Technical Committee. He was
an Editor of the IEEE T RANSACTIONS ON C OMMUNICATIONS and IEEE
W IRELESS C OMMUNICATIONS L ETTERS .

H. Vincent Poor (Life Fellow, IEEE) received the


Ph.D. degree in EECS from Princeton University in
1977.
From 1977 to 1990, he was on the faculty of the
University of Illinois at Urbana–Champaign. Since
Shi Jin (Senior Member, IEEE) received the B.S. 1990, he has been on the faculty at Princeton, where
degree in communications engineering from the he is currently the Michael Henry Strater University
Guilin University of Electronic Technology, Guilin, Professor of Electrical Engineering. From 2006 to
China, in 1996, the M.S. degree from the Nan- 2016, he was the Dean of Princeton’s School of
jing University of Posts and Telecommunications, Engineering and Applied Science. He has also held
Nanjing, China, in 2003, and the Ph.D. degree in visiting appointments at several other institutions,
information and communications engineering from including most recently at Berkeley and Cambridge. His research interests are
Southeast University, Nanjing, in 2007. From June in the areas of information theory, signal processing and machine learning,
2007 to October 2009, he was a Research Fellow and their applications in wireless networks, energy systems and related fields.
with the Adastral Park Research Campus, University Among his publications in these areas is the forthcoming book Advanced Data
College London, London, U.K. He is currently with Analytics for Power Systems (Cambridge University Press, 2020).
the Faculty of the National Mobile Communications Research Laboratory, Dr. Poor is a member of the National Academy of Engineering and the
Southeast University. His research interests include space time wireless National Academy of Sciences, and is a foreign member of the Chinese
communications, random matrix theory, and information theory. He and Academy of Sciences, the Royal Society, and other national and international
his coauthors have been awarded the 2011 IEEE Communications Society academies. He received the Technical Achievement and Society Awards of
Stephen O. Rice Prize Paper Award in the field of communication theory and the IEEE Signal Processing Society in 2007 and 2011. Recent recognition of
the 2010 Young Author Best Paper Award by the IEEE Signal Processing his work includes the 2017 IEEE Alexander Graham Bell Medal, the 2019
Society. He serves as an Associate Editor for the IEEE T RANSACTIONS ASEE Benjamin Garver Lamme Award, a D.Sc. honoris causa from Syracuse
ON W IRELESS C OMMUNICATIONS, the IEEE C OMMUNICATIONS L ETTERS , University, awarded in 2017, and a D.Eng. honoris causa from the University
and IET Communications. of Waterloo, awarded in 2019.

Authorized licensed use limited to: Kyunghee Univ. Downloaded on December 18,2023 at 11:45:40 UTC from IEEE Xplore. Restrictions apply.

You might also like