A Hybrid Approach To Privacy-Preserving Federated Learning
A Hybrid Approach To Privacy-Preserving Federated Learning
Yi Zhou
[email protected]
IBM Research Almaden
San Jose, California
ABSTRACT KEYWORDS
Federated learning facilitates the collaborative training of models Privacy, Federated Learning, Privacy-Preserving Machine Learning,
without the sharing of raw data. However, recent attacks demon- Differential Privacy, Secure Multiparty Computation
strate that simply maintaining data locality during training pro-
ACM Reference Format:
cesses does not provide sufficient privacy guarantees. Rather, we Stacey Truex, Nathalie Baracaldo, Ali Anwar, Thomas Steinke, Heiko Lud-
need a federated learning system capable of preventing inference wig, Rui Zhang, and Yi Zhou. 2019. A Hybrid Approach to Privacy-Preserving
over both the messages exchanged during training and the final Federated Learning. In London ’19: ACM Workshop on Artificial Intelligence
trained model while ensuring the resulting model also has accept- and Security, November 15, 2019, London, UK. ACM, New York, NY, USA,
able predictive accuracy. Existing federated learning approaches 11 pages. https://doi.org/10.1145/1122445.1122456
either use secure multiparty computation (SMC) which is vulnerable
to inference or differential privacy which can lead to low accuracy 1 INTRODUCTION
given a large number of parties with relatively small amounts of
In traditional machine learning (ML) environments, training data is
data each. In this paper, we present an alternative approach that uti-
centrally held by one organization executing the learning algorithm.
lizes both differential privacy and SMC to balance these trade-offs.
Distributed learning systems extend this approach by using a set
Combining differential privacy with secure multiparty computation
of learning nodes accessing shared data or having the data sent to
enables us to reduce the growth of noise injection as the number
the participating nodes from a central node, all of which are fully
of parties increases without sacrificing privacy while maintaining
trusted. For example, MLlib from Apache Spark assumes a trusted
a pre-defined rate of trust. Our system is therefore a scalable ap-
central node to coordinate distributed learning processes [28]. An-
proach that protects against inference threats and produces models
other approach is the parameter server [26], which again requires a
with high accuracy. Additionally, our system can be used to train
fully trusted central node to collect and aggregate parameters from
a variety of machine learning models, which we validate with ex-
the many nodes learning on their different datasets.
perimental results on 3 different machine learning algorithms. Our
However, some learning scenarios must address less open trust
experiments demonstrate that our approach out-performs state of
boundaries, particularly when multiple organizations are involved.
the art solutions.
While a larger dataset improves the performance of a trained model,
organizations often cannot share data due to legal restrictions or
CCS CONCEPTS competition between participants. For example, consider three hos-
• Security and privacy → Privacy-preserving protocols; Trust pitals with different owners serving the same city. Rather than each
frameworks; • Computing methodologies → Learning settings. hospital creating their own predictive model forecasting cancer
risks for their patients, the hospitals want to create a model learned
over the whole patient population. However, privacy laws prohibit
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed them from sharing their patients’ data. Similarly, a service provider
for profit or commercial advantage and that copies bear this notice and the full citation may collect usage data both in Europe and the United States. Due
on the first page. Copyrights for components of this work owned by others than ACM to legislative restrictions, the service provider’s data cannot be
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a stored in one central location. When creating a predictive model
fee. Request permissions from [email protected]. forecasting service usage, however, all datasets should be used.
London ’19, November 15, 2019, London, UK The area of federated learning (FL) addresses these more restric-
© 2019 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 tive environments wherein data holders collaborate throughout the
https://doi.org/10.1145/1122445.1122456 learning process rather than relying on a trusted third party to hold
London ’19, November 15, 2019, London, UK Stacey Truex, Nathalie Baracaldo, Ali Anwar, Thomas Steinke, Heiko Ludwig, Rui Zhang, and Yi Zhou
data [6, 39]. Data holders in FL run a machine learning algorithm single individual, thus limiting an attacker’s ability to infer such
locally and only exchange model parameters, which are aggregated membership. The formal definition for DP is [13]:
and redistributed by one or more central entities. However, this
Definition 1 (Differential Privacy). A randomized mecha-
approach is not sufficient to provide reasonable data privacy guar-
nism K provides (ϵ, δ )- differential privacy if for any two neighboring
antees. We must also consider that information can be inferred
database D 1 and D 2 that differ in only a single entry, ∀S ⊆ Ranдe(K),
from the learning process [30] and that information that can be
traced back to its source in the resulting trained model [40]. Pr(K(D 1 ) ∈ S) ≤ e ϵ Pr(K(D 2 ) ∈ S) + δ (1)
Some previous work has proposed a trusted aggregator as a way If δ = 0, K is said to satisfy ϵ-differential privacy.
to control privacy exposure [1], [32]. FL schemes using Local Dif- To achieve DP, noise is added to the algorithm’s output. This
ferential Privacy also address the privacy problem [39] but entails noise is proportional to the sensitivity of the output, where sen-
adding too much noise to model parameter data from each node, sitivity measures the maximum change of the output due to the
often yielding poor performance of the resulting model. inclusion of a single data instance.
We propose a novel federated learning system which provides Two popular mechanisms for achieving DP are the Laplacian
formal privacy guarantees, accounts for various trust scenarios, and Gaussian mechanisms. Gaussian is defined by
and produces models with increased accuracy when compared
with existing privacy-preserving approaches. Data never leaves M(D) ≜ f (D) + N (0, S f2 σ 2 ), (2)
the participants and privacy is guaranteed using secure multiparty where N (0, S f2 σ 2 ) is the normal distribution with mean 0 and stan-
computation (SMC) and differential privacy. We account for po-
dard deviation S f σ . A single application of the Gaussian mechanism
tential inference from individual participants as well as the risk of
to function f of sensitivity S f satisfies (ϵ, δ )-differential privacy if
collusion amongst the participating parties through a customizable
trust threshold. Our contributions are the following: δ ≥ 54 exp(−(σϵ)2 /2) and ϵ < 1 [16].
To achieve ϵ-differential privacy, the Laplace mechanism may be
• We propose and implement an FL system providing formal used in the same manner by substituting N (0, S f2 σ 2 ) with random
privacy guarantees and models with improved accuracy com- variables drawn from Lap(S f /ϵ) [16].
pared to existing approaches. When an algorithm requires multiple additive noise mechanisms,
• We include a tunable trust parameter which accounts for the evaluation of the privacy guarantee follows from the basic com-
various trust scenarios while maintaining the improved ac- position theorem [14, 15] or from advanced composition theorems
curacy and formal privacy guarantees. and their extensions [7, 17, 18, 23].
• We demonstrate that it is possible to use the proposed ap-
proach to train a variety of ML models through the exper- 2.2 Threshold Homomorphic Encryption
imental evaluation of our system with three significantly
An additively homomorphic encryption scheme is one wherein the
different ML models: decision trees, convolutional neural
following property is guaranteed:
networks and linear support vector machines.
• We include the first federated approach for the private and Enc(m 1 ) ◦ Enc(m 2 ) = Enc(m 1 + m 2 ),
accurate training of a neural network model. for some predefined function ◦. Such schemes are popular in privacy-
The rest of this paper is organized as follows. We outline the preserving data analytics as untrusted parties can perform opera-
building blocks in our system. We then discuss the various privacy tions on encrypted values.
considerations in FL systems followed by outlining our threat model One such additive homomorphic scheme is the Paillier cryptosys-
and general system. We then provide experimental evaluation and tem [31], a probabilistic encryption scheme based on computations
discussion of the system implementation process. Finally, we give in the group Zn∗ 2 , where n is an RSA modulus. In [11] the authors
an overview of related work and some concluding remarks. extend this encryption scheme and propose a threshold variant. In
the threshold variant, a set of participants is able to share the secret
key such that no subset of the parties smaller than a pre-defined
2 PRELIMINARIES
threshold is able to decrypt values.
In this section we introduce building blocks of our approach and
explain how various approaches fail to protect data privacy in FL. 2.3 Privacy in Federated Learning
In centralized learning environments a single party P using a dataset
2.1 Differential Privacy D executes some learning algorithm f M resulting in a model M
Differential privacy (DP) is a rigorous mathematical framework where f M (D) = M. In this case P has access to the complete dataset
wherein an algorithm may be described as differentially private if D. By contrast, in a federated learning environment, multiple parties
and only if the inclusion of a single instance in the training dataset P1 , P2 , ..., Pn , each have their own dataset D 1 , D 2 , ..., D n , respec-
causes only statistically insignificant changes to the algorithm’s tively. The goal is then to learn a model using all of the datasets.
output. For example, consider private medical information from We must consider two potential threats to data privacy in such an
a particular hospital. The authors in [40] have shown that with FL environment: (1) inference during the learning process and (2) in-
access to only a trained ML model, attackers can infer whether ference over the outputs. Inference during the learning process refers
or not an individual was a patient at the hospital, violating their to any participant in the federation inferring information about an-
right to privacy. DP puts a theoretical limit on the influence of a other participant’s private dataset given the data exchanged during
A Hybrid Approach to Privacy-Preserving Federated Learning London ’19, November 15, 2019, London, UK
the execution of f M . Inference over the outputs refers to the leakage 3.1.1 Honest-But-Curious Aggregator. The honest-but-curious or
of any participants’ data from intermediate outputs as well as M. semi-honest adversarial model is commonly used in the field of SMC
We consider two types of inference attacks: insider and outsider. since its introduction in [3] and application to data mining in [27].
Insider attacks include those launched by participants in the FL Honest-but-curious adversaries follow the protocol instructions
system, including both data holders as well as any third parties, correctly but will try to learn additional information. Therefore,
while outsider attacks include those launched both by eavesdroppers the aggregator will not vary from the predetermined ML algorithm
to the communication between participants and by users of the but will attempt to infer private information using all data received
final predictive model when deployed as a service. throughout the protocol execution.
2.3.1 Inference during the learning process. Let us consider f M as 3.1.2 Colluding Parties. Our work also considers the threat of
the combination of computational operations and a set of queries collusion among parties, including the aggregator, through the
Q 1 , Q 2 , ..., Q k . That is, for each step s in f M requiring knowledge trust parameter t which is the minimum number of non-colluding
of the parties’ data there is a query Q s . In the execution of f M each parties. Additionally, in contrast to the aggregator, we consider
party Pi must respond to each such query Q s with appropriate scenarios in which parties in P may deviate from the protocol
information on D i . The types of queries are highly dependent on execution to achieve additional information on data held by honest
f M . For example, to build a decision tree, a query may request parties.
the number of instances in D i matching a certain criteria. In con-
trast, to train an SVM or neural network a query would request 3.1.3 Outsiders. We also consider potential attacks from adver-
model parameters after a certain number of training iterations. Any saries outside of the system. Our work ensures that any adversary
privacy-preserving FL system must account for the risk of inference monitoring communications during training cannot infer the pri-
over the responses to these queries. vate data of the participants. We also consider users of the final
Privacy-preserving ML approaches addressing this risk often model as potential adversaries. A predictive model output from our
do so by using secure multiparty computation (SMC). Generally, system may therefore be deployed as a service, remaining resilient
SMC protocols allow n parties to obtain the output of a function to inference against adversaries who may be users of the service.
over their n inputs while preventing knowledge of anything other We now detail the assumptions made in our system to more
than this output [20]. Unfortunately, approaches exclusively using concretely formulate our threat model.
secure multiparty computation remain vulnerable to inference over 3.1.4 Communication. We assume secure channels between each
the output. As the function output remains unchanged from func- party and the aggregator. This allows the aggregator to authenticate
tion execution without privacy, the output can reveal information incoming messages and prevents an adversary, whether they be an
about individual inputs. Therefore, we must also consider potential outsider or malicious data party, from injecting their own responses.
inference over outputs.
3.1.5 System set up. We additionally make use of the threshold
2.3.2 Inference over the outputs. This refers to intermediate out- variant of the Paillier encryption scheme from [11] assuming secure
puts available to participants as well as the predictive model. Recent key distribution. It is sufficient within our system to say that se-
work shows that given only black-box access to the model through mantic security of encrypted communication is equivalent to the
an ML as a service API, an attacker can still make training data decisional composite residuosity assumption. For further discussion
inferences [40]. An FL system should prevent such outsider attacks we direct the reader to [11]. Our use of the threshold variant of the
while also considering insiders. That is, participant Pi should not Paillier system ensures that any set of n − t or fewer parties can-
be able to infer information about D j when i , j as shown in [30]. not decrypt ciphertexts. Within the context of our FL system, this
Solutions addressing privacy of output often make use of the DP ensures the privacy of individual messages sent to the aggregator.
framework discussed in Preliminaries. As a mechanism satisfying
differential privacy guarantees that if an individual contained in a
3.2 Proposed Approach
given dataset is removed, no outputs would become significantly
more or less likely [13], a learning algorithm f M which is theoreti- We propose an FL system that addresses risk of inference during
cally proven to be ϵ-differentially private is guaranteed to have a the learning process, risk of inference over the outputs, and trust.
certain privacy of output quantified by the ϵ privacy parameter. We combine methods from SMC and DP to develop protocols that
In the federated learning setting it is important to note that the guarantee privacy without sacrificing accuracy.
definition of neighboring databases is consistent with the usual We consider the following scenario. There exists a set of n parties
DP definition – that is, privacy is provided at the individual record P = P1 , P2 , ..., Pn , a set of disjoint datasets D 1 , D 2 , ..., D n belonging
level, not the party level (which may represent many individuals). to the respective parties and adhering to the same structure, and
an aggregator A. Our system takes as additional input three pa-
3 AN END-TO-END APPROACH WITH TRUST rameters: f M , ϵ, and t. f M specifies the training algorithm, ϵ is the
privacy guarantee against inference, and t specifies the minimum
3.1 Threat Model number of honest, non-colluding parties.
We propose a system wherein n data parties use an ML service for The aggregator A runs the learning algorithm f M consisting
FL. We refer to this service as the aggregator. Our system is designed of k or fewer linear queries Q 1 , Q 2 , ..., Q k , each requiring infor-
to withstand three potential adversaries: (1) the aggregator, (2) the mation from the n datasets. This information may include model
data parties, and (3) outsiders. parameters after some local learning on each dataset or may be
London ’19, November 15, 2019, London, UK Stacey Truex, Nathalie Baracaldo, Ali Anwar, Thomas Steinke, Heiko Ludwig, Rui Zhang, and Yi Zhou
F1 score
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0.1
0.05
10
25
50
75
100
500
epsilon data parties
Figure 2: Effect of privacy budgets on the overall F1-score Figure 3: Effect of increasing number of parties on the over-
for Decision Trees all F1-score for Decision Trees
counts (done at the leaf nodes) or evaluating attributes (done at in degraded performance as the budget decreases, which is expected.
internal nodes). For internal nodes, each feature is evaluated for It is clear that our approach maintains improved performance over
potential splitting against the same dataset. The budget allocated the local DP approach for all budgets (until both approaches con-
to evaluating attributes must therefore be divided amongst each verge to the random guessing baseline). Particularly as the budget
|F|
feature (ϵ2 ). In all experiments the max depth is set to d = 2 . decreases from 1.0 to 0.4 we see our approach maintaining better
resilience to the decrease in the privacy budget.
Dataset. We conduct a number of experiments using the Nurs-
ery dataset from the UCI Machine Learning Repository [12]. This Number of Parties. Another important consideration for FL sys-
dataset contains 8 categorical attributes about 12,960 nursery school tems is the ability to maintain accuracy in highly distributed sce-
applications. The target attribute has five distinct classes with the narios. That is, when many parties, each with a small amount of
following distribution: 33.333%, 0.015%, 2.531%, 32.917%, 31.204%. data, such as in an IoT scenario, are contributing to the learning.
In Figures 3 and 4 we show the impact that |P | has on perfor-
Comparison Methods. To put model performance into context, mance. The results are for a fixed overall privacy budget of 0.5 and
we compare with two different random baselines and two current assume no collusion. For each experiment, the overall dataset was
FL approaches. Random baselines enable us to characterize when a divided into |P | equal sized partitions.
particular approach is no longer learning meaningful information The results in Figure 3 demonstrate the viability of our system
while the FL approaches visualize relative performance cost. for FL in highly distributed environments while highlighting the
(1) Uniform Guess. In this approach, class predictions are ran- shortcomings of the local DP approach. As |P | increases, the noise
domly sampled with a |C1 | chance for each class. in the local DP approach increases proportionally while our ap-
proach maintains consistent accuracy. We can see that with as few
(2) Random Guess. Random guess improves upon Uniform Guess
as 25 parties, the local DP results begin to approach the baseline
with consideration of class value distribution in the training
and even dip below random guessing by 100 participants.
data. At test time, each prediction is sampled from the set of
training class labels.
Unencrypted Encrypted
(3) Local DP. In the local approach, parties add noise to protect 80
training time
60
(seconds)
0.9
F1 score
0.1
0.1
0.8
0.7
0.01
0.01
0.6
0.5
1
10
19
28
37
46
55
64
73
82
91
100
0.001
0.001 epoch
0 10 20 30 40 50 60 70 80 90 100
0 10 20 30 40 50 60 70 80 90 100
Untrusted (%) Figure 6: Convolutional Neural Network Training with
Figure 5: Query Epsilons in Decision Tree Training with MNIST Data (10 parties and σ = 8, (ϵ, δ ) = (0.5, 10−5 ))
Varying Rate of Trust (50 parties). Epsilon 1 is defined as the to each randomly sampled batch. Using the moments accountant
privacy budget for count queries while Epsilon 2 is used for √
in [1], our approach is (O(bϵ E/b ), δ )-DP overall.
class counts.
record level, the trust model for adversarial knowledge is consid- Dataset and Model Structure. For our CNN experiments we use
ered within the context of the entire system. The trust parameter the publicly available MNIST dataset. This includes 60,000 training
therefore represents the degree of adversarial knowledge by cap- instances of handwritten digits and 10,000 testing instances. Each
turing the maximum number of colluding parties which the system example is a 28x28 grey-scale image of a digit between 0 and 9 [24].
may tolerate. Figure 5 demonstrates how the ϵ values used for both We use a model structure similar to that in [1]. Our model is a
count and distribution queries in private, federated DT learning are feedforward neural network with 2 internal layers of ReLU units
impacted by the trust parameter setting when |P | = 50. and a softmax layer of 10 classes with cross-entropy loss. The first
In the worst case scenario where a party Pi ∈ P assumes that layer contains 60 units and the second layer contains 1000. We set
all other P j ∈ P, i , j are colluding, our approach converges with the norm clipping to 4.0, learning rate to 0.1 and batch rate to 0.01.
existing local DP approaches. In all other scenarios the query ϵ We use Keras with a Tensorflow backend.
values will be increased in our system leading to more accurate
Comparison Methods. To the best of our knowledge, this paper
outcomes. Additionally, we believe the aforementioned scenario of
presents the first approach to accurately train a neural network
no trust is unlikely to exist in real world instances. Let us consider
in a private federated fashion without reliance on any public or
smart phone users as an IoT example. Collusion of all but one
non-protected data. We therefore compare our approach with the
party is impractical not only due to scale but also since such a
following baselines:
system is likely the be running without many users even knowing.
Additionally, on a smaller scale, if there is a set of five parties in (1) Central Data Holder, No Privacy. In this approach all the data
the system and one party is concerned that the other four are all is centrally held by one party and no privacy is considered
colluding, there is no reason for the honest party to continue to in the learning process.
participate. We therefore believe that realistic scenarios of FL will (2) Central Data Holder, With Privacy. While all the data is still
see accuracy gains when deploying our system. centrally held by one entity, this data holder now conducts
privacy-preserving learning. This is representative of the
scenario in [1].
4.2 Convolutional Neural Networks (3) Distributed, No Privacy. In this approach the data is dis-
We additionally demonstrate how to use our method to train a tributed to multiple parties, but the parties do not add noise
distributed differentially private CNN. In our approach, similarly during the learning process.
to centrally trained CNNs, each party is sent a model with the (4) Local DP. Parties add noise to protect the privacy of their
same initial structure and randomly initialized parameters. Each own data in isolation, adapting from [1] and [39].
party will then conduct one full epoch of learning locally. At the Figure 6 shows results with 10 parties conducting 100 epochs of
conclusion of each batch, Gaussian noise is introduced according training with the privacy parameter σ set to 8.0, the “large noise”
to the norm clipping value c and the privacy parameter σ . Norm setting in [1]. Note that Central Data Holder, No Privacy and Dis-
clipping allows us to put a bound on the sensitivity of the gradient tributed Data, No Privacy achieve similar results and thus overlap.
update.We use the same privacy strategy used in the centralized Our model is able to achieve an F1-score in this setting of 0.9. While
training approach presented in [1]. Once an entire epoch, or b1 this is lower than the central data holder setting where an F1-score
batches where b = batch rate, has completed the final parameters of approximately 0.95 is achieved, our approach again significantly
are sent back to A. A then averages the parameters and sends out-performs the local approach which only reaches 0.723. Addi-
back an updated model for another epoch of learning. After a pre- tionally, we see a drop off in the performance of the local approach
determined E number of epochs, the final model M is output. This early on as updates become overwhelmed by noise.
process for the aggregator and data parties are specifically detailed We additionally experiment with σ = 4 and σ = 2 as was done
as algorithmic pseudocode in Section 5.2. in [1]. When σ = 4 ((ϵ, δ ) = (2, 10−5 )) the central data holder with
Within our private, federated NN learning system, if σ = 2 · log 1.25 /ϵ
q
δ privacy is able to reach an F1 score of 0.960, the local approach
then by [16] our approach is (ϵ, δ )-differentially private with respect reaches 0.864, and our approach results in an F1-score of 0.957.
A Hybrid Approach to Privacy-Preserving Federated Learning London ’19, November 15, 2019, London, UK
F1 score
10 0.8
0 0.7
0 10 20 30 40 50 60 70 80 90 100 0.6
Algorithm 2 Private Decision Tree Learning meaningful information. The resulting private algorithm deployed
Input: Set of data parties P ; minimum number of honest, non-colluding parties t ; in our system is detailed in Algorithm 2.
privacy guarantee ϵ ; attribute set F ; class attribute C ; max tree depth d ; public key
pk
t̄ = n − t + 1 5.2 Application to Private Neural Network
ϵ1 = 2(dϵ+1) Training
Define current splits, S = ∅, for root node
M = BuildTree( S , P , t , ϵ1 , F , C , d , pk ) The process of deploying our system for neural network learning
return M is distinct from the process outlined in the previous section for
procedure BuildTree( S , P , t , ϵ1 , F , C , d , pk )
f = maxF ∈F |F | decision tree learning. In central neural network training, after a
Asynchronously query P : counts( S, ϵ1 , t ) randomly initialized model of pre-defined structure is created, the
N = decrypted aggregate of noisy √
counts following process is used: (1) the dataset D is shuffled and then
if F = ∅ or d = 0 or f N 2
|C | < ϵ1 then equally divided into batches, (2) each batch is passed through the
Asynchronously query P : class_counts( S, ϵ1 , t ) model iteratively, (3) a loss function L is used to compute the
Nc = vector of decrypted, noisy class counts
return node labeled with arg maxc Nc error of the model on each batch, (4) errors are then propagated
else
ϵ1
back through the network where an optimizer such as Stochastic
ϵ2 = 2|F|
Gradient Descent (SGD) is used to update network weights before
for each F ∈ F do
for each f i ∈ F do processing the next batch. Steps (1) through (4) constitute one
Update set of split values to send to child node: S i = S + {F = f i } epoch of learning and are repeated until the model converges (stops
Asynchronously query P : counts( Si , ϵ2 , t ) demonstrating improved performance).
and class_counts( Si , ϵ2 , t )
N ′Fi = aggregate of counts In our system we equate one query to the data parties as one
N ′Fi,c = element-wise aggregate of class_counts epoch of local learning. That is, each party conducts steps (1)
Recover N iF from t̄ partial decryptions of N ′Fi through (4) for one iteration and then sends an updated model
Recover N i,cF from t̄ partial decryptions of N ′F to the aggregator. The aggregator then averages the new model
i,c
end for weights provided by each party. An updated model is then sent
N F
· log i,c
Í |F | Í |C | F
VF = i =1 c =1 N i,c F along with a new query for another epoch of learning to each party.
Ni
end for
F¯ = arg maxF VF
Create root node M with label F¯ Algorithm 3 Private CNN Learning: Aggregator
for each f i ∈ F¯ do
Input: Set of data parties P ; minimum number of honest, non-colluding parties t ;
S i = S + {F = f i } noise parameter σ ; learning rate η ; sampling probability b ; loss function L ; clipping
M i = BuildTree(S i , P , t , ϵ1 , F \ F¯ , C , d − 1, pk ) value c ; number of epochs E ; public key pk
Set M i as child of M with edge f i t̄ = n − t + 1
end for Initialize model M with random weights θ ;
return M for each e ∈ [E] do
end if Asynchronously query P :
end procedure train_epoch(M , η , b , L , c , σ , t )
θ e = decrypted aggregate, noisy parameters from P
C4.5 [35] and C5.0 [36] tree training algorithms. Information gain M ← θe
end for
for a candidate feature f quantifies the difference between the return M
entropy of the current data with the weighted sum of the entropy
values for each of the data subsets which would be generated if f
were to be chosen as the splitting feature. Entropy for a dataset (or Each epoch receives the noise parameter σ and cost to the overall
subset) D is computed via the following equation: privacy budget is determined through a separate privacy accoun-
tant utility. Just as the decision tree stopping condition was replaced
|C |
Õ with a pre-set depth the neural network stopping condition of con-
Entropy(D) = pi log2pi (3) vergence is replaced with a pre-defined number of epochs E. This
i=1
process from the aggregator perspective is outlined Algorithm 3.
where C is the set of potential class values and pi indicates the At each data party we deploy code to support the process de-
probability that a random instance in D is of class i. Therefore, the tailed in Algorithm 4. To conduct a complete epoch of learning we
selection of the “best” feature on which to switch can be chosen follow the approach proposed in [1] for private centralized neural
via determining class probabilities which in turn may be computed network learning. This requires a number of changes to the tra-
via counts. Queries to the parties from the aggregator are therefore ditional learning approach. Rather than shuffling the dataset into
counts and class counts, known to have a sensitivity of 1. equal sized batches, a batch is randomly sampled for processing
Given the ability to query parties for class counts the aggregator with sampling probability b. An epoch then becomes defined as
may then control the iterative learning process. To ensure this the number of batch iterations required to process |D i | instances.
process is differentially private according to a pre-defined privacy Additionally, parameter updates determined through the loss func-
budget, we follow the approach from [19] to divide the budget for tion L are clipped to define the sensitivity of the neural network
each iteration and set a fixed number of iterations rather than a learning to individual training instances. Noise is then added to
purity test as a stopping condition. The algorithm will also stop if the weight updates. Once an entire epoch is completed the updated
counts appear too small relative to the degree of noise to provide weights can be sent back to the aggregator.
A Hybrid Approach to Privacy-Preserving Federated Learning London ’19, November 15, 2019, London, UK
Algorithm 4 Private CNN Learning: Data Party Pi Algorithm 6 Private SVM Learning: Data Party Pi
procedure train_epoch(M , η , b , L , c , σ , t ) procedure train_epoch(M , η , K , L , c , σ , t )
θ = parameters of M w = parameters of M
for j ∈ {1, 2, ..., b1 } do for each (x i , yi ) ∈ D do
Randomly sample D i, j from D i w/ probability b ||x i ||2
x i ← x i / max 1, c
for each d ∈ D i, j do
end for
gj (d ) ← ▽θ L(θ, d ) for k ∈ {1, 2, ..., K } do
||g (d )||
ḡj (d ) ← g j (d )/ max 1, j c 2 g(D) ← ▽w L(w, D)
σ2
end for ḡ ← g(D) + N 0, t −1
σ2
Í
ḡj ← |D 1 ∀d ḡj (d ) + N 0, c2 · t −1 w ← w − η ḡ
i, j |
θ ← θ − η ḡj M ←w
end for
M ←θ
return Enc pk (w )
end for
return Enc pk (θ ) end procedure
end procedure
interest and significantly different. The task of generating and de-
5.3 Application to Private Support Vector ploying our system for each algorithm, however, is non-trivial. First,
Machine Training a DP version of the algorithm must be developed. Second, this must
be written as a series of queries. Finally, each query must have
Finally, we focus on the classic ℓ2 -regularized binary linear SVM
an appropriate aggregation procedure. Our approach may then be
problem with hinge loss, which is given in the following form:
applied for accurate, federated, private results.
1 Õ
L(w) := max{0, 1 − yi ⟨w, x i ⟩} + λ∥w ∥22 , (4) Due to our choices to use the threshold Paillier cryptosystem in
|D| conjunction with an aggregator, rather than a complex SMC pro-
∀(x i ,yi )∈D
tocol run by the parties themselves, we can provide a streamlined
where (x i , yi ) ∈ Rd × {−1, 1} is a feature vector, class label pair, w
∈ interface between the aggregator and the parties. Parties need only
Rd is the model weight vector, and λ is the regularized coefficient. answer data queries with encrypted, noisy responses and decryp-
From the aggregator perspective, specified in Algorithm 5, the tion queries with partial decryption values. Management of the
process of SVM training is similar to that of neural network training. global model and communication with all other parties falls to the
Each query to the data parties is defined as K epochs of training. aggregator, therefore decreasing the barrier to entry for parties to
Once query responses are received, model parameters are averaged engage in our federated learning system. Figure 4 demonstrates the
to generate a new support vector machine model. This new model impact of this choice as our approach is able to effectively handle
is then sent to the data parties for another K epochs of training. We the introduction of more parties into the federated learning system
again specify a pre-determined number of epochs E to control the without the introduction of increased encryption overhead.
number of training iterations. Another issue in the deployment of new machine learning train-
ing algorithms is the choice of algorithmic parameters. Key deci-
Algorithm 5 Private SVM Learning: Aggregator sions must be made when using our system and many are domain-
Input: Set of data parties P ; minimum number of honest, non-colluding parties t ; specific. We aim to inform such decisions with our analysis of
noise parameter σ ; learning rate η ; loss function L ; clipping value c ; number of
epochs E ; number of epochs per query K , public key pk
trade-offs between privacy, trust and accuracy in Section 4.1.1, but
t̄ = n − t + 1 note that the impact will vary depending on the data and the train-
Initialize model M with random weights w ; ing algorithm chosen. While our system will reduce the amount of
for each e ∈ [E/K ] do
Asynchronously query P : noise required to train any federated ML algorithm, questions sur-
train_epoch(M , η , K , L , c , σ , t ) rounding what impact various data-specific features will have on
θ e = decrypted aggregate, noisy parameters from P the privacy budget are algorithm-specific. For example, Algorithm 2
M ← θe
end for demonstrates how, in decision tree training, the number of features
return M and classes impact the privacy budget at each level. Similarly, Al-
gorithms 4 and 6 show the role of norm clipping in neural network
To complete an epoch of learning at each data party, we iterate and SVM learning. In neural networks, this value not only impacts
through each instance in the local training dataset D i . We again noise but will also have a different impact on learning depending
deploy a clipping approach to constrain the sensitivity of the up- on the size of the network and number of features.
dates. The model parameters are then updated according to the
loss function L as well as the noise parameter. The process con- 6 RELATED WORK
ducted at each data party for K epochs of training in response to
Our work relates to both the areas of FL as well as privacy-preserving
an aggregator query is outlined in Algorithm 6.
ML. Existing work can be classified into three categories: trusted
aggregator, local DP, and cryptographic.
5.4 Expanding the Algorithm Repository
Beyond the three models evaluated here, our approach can be used Trusted Aggregator. Approaches in this area trust the aggregator
to extend any differentially private machine learning algorithm into to obtain data in plaintext or add noise. [1] and [22] propose differ-
a federated learning environment. We demonstrate the flexibility entially private ML systems, but do not consider a distributed data
of our system through 3 example algorithms which are of broad scenario, thus requiring a central party. In [41], the authors develop
London ’19, November 15, 2019, London, UK Stacey Truex, Nathalie Baracaldo, Ali Anwar, Thomas Steinke, Heiko Ludwig, Rui Zhang, and Yi Zhou
a distributed data mining system with DP but show significant counting the number of values in an array. In contrast, we propose
accuracy loss and require a trusted aggregator to add noise. an accurate, private FL system for predictive model training.
Recently, [32] presented PATE, an ensemble approach to pri- Dwork et al. [14] present a distributed noise generation scheme
vate learning wherein several “teacher” models are independently and focus on methods for generating noise from different distribu-
trained over local datasets. A trusted aggregator then provides a DP tions. This scheme is based on secret sharing, an MPC mechanism
query interface to a “student” model that has unlabelled public data that requires extensive exchange of messages and entails a commu-
(but no direct access to private data) and obtains labels through nication overhead not viable in many federated learning settings.
queries to the teachers. While we have proposed a federated learn- [10] proposes a method to train neural networks in a private
ing (FL) approach wherein one global model is learned over the collaborative fashion by combining MPC, DP and secret sharing
aggregate of the parties’ datasets, the PATE method develops an assuming non-colluding honest parties. In contrast, our system
ensemble model with independently trained base models using prevents privacy leakages even if parties actively collude.
local datasets. Unlike the methods we evaluate, PATE assumes a Approaches for the private collection of streaming data, includ-
fully trusted party to aggregate the teachers’ labels; focuses on ing [2, 8, 21, 37], aim to recover computation when one or more
scenarios wherein each party has enough data to train an accurate parties go down. Our system, however, enables private federated
model, which might not hold, e.g., for cellphone users training a learning which allows for checkpoints in each epoch of training.
neural network; and assumes access to publicly available data, an The use of threshold cryptography also enables our system to de-
assumption not made in our FL system. Models produced from our crypt values when only a subset of the participants is available.
FL system learn from all available data, leading to more accurate
models than the local models trained by each participant in PATE 7 CONCLUSION
(Figure 4b in [32] demonstrates the need for a lot of parties to
In this paper, we present a novel approach to perform FL that com-
achieve reasonable accuracy in such a setting).
bines DP and SMC to improve model accuracy while preserving
provable privacy guarantees and protecting against extraction at-
Local Differential Privacy. [39] presents a distributed learning
tacks and collusion threats. Our approach can be applied to train
system using DP without a central trusted party. However, the DP
different ML models in a federated learning fashion for varying trust
guarantee is per-parameter and becomes meaningless for models
scenarios. Through adherence to the DP framework we are able
with more than a small number of parameters.
to guarantee overall privacy from inference of any model output
from our system as well as any intermediate result made available
Cryptographic Approaches. [38] presents a protocol to privately
to A or P. SMC additionally guarantees any messages exchanged
aggregate sums over multiple time periods. Their protocol is de-
without DP protection are not revealed and therefore do not leak
signed to allow participants to periodically upload encrypted values
any private information. This provides end-to-end privacy guar-
to an oblivious aggregator with minimum communication costs.
antees with respect to the participants as well as any attackers of
Their approach however has participants sending in a stream of
the model itself. Given these guarantees, models produced by our
statistics and does not address FL or propose an FL system. Ad-
system can be safely deployed to production without infringing on
ditionally, their approach calls for each participant to add noise
privacy guarantees.
independently. As our experimental results show, allowing each
We demonstrated how to apply our approach to train a variety
participant to add noise in this fashion results in models with low
of ML models and showed that it out-performs existing state-of-
accuracy, making this approach is unsuitable for FL. In contrast, our
the art techniques for FL. Our system provides significant gains in
approach reduces the amount of noise injected by each participant
accuracy when compared to a naïve application of state-of-the-art
by taking advantage of the additive properties of DP and the use
differentially private protocols to FL systems.
of threshold-based homomorphic encryption to produce accurate
For a tailored threat model, we propose an end-to-end private
models that protect individual parties’ privacy.
federated learning system which uses SMC in combination with
In [6, §B] the authors propose the use of multiparty computation
DP to produce models with high accuracy. As far as we know
to securely aggregate data for FL. The focus of the paper is to present
this is the first paper to demonstrate that the application of these
suitable cryptographic techniques to ensure that the aggregation
combined techniques allow us to maintain this high accuracy at
process can take place in mobile environments. While the authors
a given level of privacy over different learning approaches. In the
propose FL as motivation, no complete system is developed with
light of the ongoing social discussion on privacy, this proposed
“a detailed study of the integration of differential privacy, secure
approach provides a novel method for organizations to use ML in
aggregation, and deep learning” remaining beyond the scope.
applications requiring high model performance while addressing
[4] provides a theoretical analysis on how differentially private
privacy needs and regulatory compliance.
computations could be done in a federated setting for single in-
stance operations using either secure function evaluation or the
local model with a semi-trusted curator. By comparison, we consider REFERENCES
[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov,
multiple operations to conduct FL and provide empirical evaluation Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In
of the FL system. [29] proposes a system to perform differentially Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications
private database joins. This approach combines private set intersec- Security. ACM, 308–318.
[2] Gergely Ács and Claude Castelluccia. 2011. I have a dream!(differentially private
tion with random padding, but cannot be generally applied to FL. smart metering). In International Workshop on Information Hiding. Springer, 118–
In [33] the authors’ protocols are tailored to inner join tables and 132.
A Hybrid Approach to Privacy-Preserving Federated Learning London ’19, November 15, 2019, London, UK
[3] Donald Beaver. 1991. Foundations of secure interactive computing. In Annual [27] Yehuda Lindell and Benny Pinkas. 2000. Privacy preserving data mining. In
International Cryptology Conference. Springer, 377–391. Annual International Cryptology Conference. Springer, 36–54.
[4] Amos Beimel, Kobbi Nissim, and Eran Omri. 2008. Distributed Private Data [28] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkatara-
Analysis: Simultaneously Solving How and What. In Advances in Cryptology – man, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al.
CRYPTO 2008, David Wagner (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning
451–468. Research 17, 1 (2016), 1235–1241.
[5] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. 2005. Practical [29] Arjun Narayan and Andreas Haeberlen. 2012. DJoin: Differentially Private Join
privacy: the SuLQ framework. In Proceedings of the twenty-fourth ACM SIGMOD- Queries over Distributed Databases.. In OSDI. 149–162.
SIGACT-SIGART symposium on Principles of database systems. ACM, 128–138. [30] Milad Nasr, Reza Shokri, and Amir Houmansadr. 2019. Comprehensive Privacy
[6] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan Analysis of Deep Learning: Stand-alone and Federated Learning under Passive
McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. 2017. Prac- and Active White-box Inference Attacks. In Security and Privacy (SP), 2019 IEEE
tical secure aggregation for privacy-preserving machine learning. In Proceedings Symposium on.
of the 2017 ACM SIGSAC Conference on Computer and Communications Security. [31] Pascal Paillier. 1999. Public-key cryptosystems based on composite degree resid-
ACM, 1175–1191. uosity classes. In International Conference on the Theory and Applications of
[7] Mark Bun and Thomas Steinke. 2016. Concentrated differential privacy: Simpli- Cryptographic Techniques. Springer, 223–238.
fications, extensions, and lower bounds. In Theory of Cryptography Conference. [32] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Tal-
Springer, 635–658. war, and Úlfar Erlingsson. 2018. Scalable Private Learning with PATE. arXiv
[8] T. H. Hubert Chan, Elaine Shi, and Dawn Song. 2012. Privacy-Preserving Stream preprint arXiv:1802.08908 (2018).
Aggregation with Fault Tolerance. In Financial Cryptography and Data Security, [33] Martin Pettai and Peeter Laud. 2015. Combining differential privacy and secure
Angelos D. Keromytis (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 200– multiparty computation. In Proceedings of the 31st Annual Computer Security
214. Applications Conference. ACM, 421–430.
[9] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector [34] J. Ross Quinlan. 1986. Induction of decision trees. Machine learning 1, 1 (1986),
machines. ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1– 81–106.
27:27. Issue 3. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. [35] J Ross Quinlan. 1993. C4. 5: Programming for machine learning. Morgan Kauff-
[10] Melissa Chase, Ran Gilad-Bachrach, Kim Laine, Kristin E Lauter, and Peter Rindal. mann 38 (1993), 48.
2017. Private Collaborative Neural Network Learning. IACR Cryptology ePrint [36] J. Ross Quinlan. 2007. C5. (2007). http://rulequest.com
Archive 2017 (2017), 762. [37] Vibhor Rastogi and Suman Nath. 2010. Differentially private aggregation of
[11] Ivan Damgård and Mats Jurik. 2001. A Generalisation, a Simplification and Some distributed time-series with transformation and encryption. In Proceedings of
Applications of Paillier’s Probabilistic Public-Key System. In Proceedings of the the 2010 ACM SIGMOD International Conference on Management of data. ACM,
4th International Workshop on Practice and Theory in Public Key Cryptography: 735–746.
Public Key Cryptography (PKC ’01). Springer-Verlag, London, UK, UK, 119–136. [38] Elaine Shi, HTH Chan, Eleanor Rieffel, Richard Chow, and Dawn Song. 2011.
http://dl.acm.org/citation.cfm?id=648118.746742 Privacy-preserving aggregation of time-series data. In Annual Network & Dis-
[12] Dua Dheeru and Efi Karra Taniskidou. 2017. UCI Machine Learning Repository. tributed System Security Symposium (NDSS). Internet Society.
(2017). http://archive.ics.uci.edu/ml [39] Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In
[13] Cynthia Dwork. 2008. Differential privacy: A survey of results. In International Proceedings of the 22nd ACM SIGSAC conference on computer and communications
Conference on Theory and Applications of Models of Computation. Springer, 1–19. security. ACM, 1310–1321.
[14] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and [40] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Mem-
Moni Naor. 2006. Our data, ourselves: Privacy via distributed noise generation. In bership inference attacks against machine learning models. In Security and Privacy
Annual International Conference on the Theory and Applications of Cryptographic (SP), 2017 IEEE Symposium on. IEEE, 3–18.
Techniques. Springer, 486–503. [41] Ning Zhang, Ming Li, and Wenjing Lou. 2011. Distributed data mining with
[15] Cynthia Dwork and Jing Lei. 2009. Differential privacy and robust statistics. In differential privacy. In Communications (ICC), 2011 IEEE International Conference
Proceedings of the forty-first annual ACM symposium on Theory of computing. on. IEEE, 1–5.
ACM, 371–380.
[16] Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differ-
ential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4
(2014), 211–407.
[17] Cynthia Dwork and Guy N Rothblum. 2016. Concentrated differential privacy.
arXiv preprint arXiv:1603.01887 (2016).
[18] Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. 2010. Boosting and differ-
ential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer
Science. IEEE, 51–60.
[19] Arik Friedman and Assaf Schuster. 2010. Data mining with differential privacy.
In Proceedings of the 16th ACM SIGKDD international conference on Knowledge
discovery and data mining. ACM, 493–502.
[20] Oded Goldreich. 1998. Secure multi-party computation. Manuscript. Preliminary
version 78 (1998).
[21] S. Goryczka and L. Xiong. 2017. A Comprehensive Comparison of Multiparty
Secure Additions with Differential Privacy. IEEE Transactions on Dependable and
Secure Computing 14, 5 (Sep. 2017), 463–477. https://doi.org/10.1109/TDSC.2015.
2484326
[22] Geetha Jagannathan, Krishnan Pillaipakkamnatt, and Rebecca N Wright. 2009.
A practical differentially private random decision tree classifier. In Data Mining
Workshops, 2009. ICDMW’09. IEEE International Conference on. IEEE, 114–121.
[23] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2017. The composition
theorem for differential privacy. IEEE Transactions on Information Theory 63, 6
(2017), 4037–4049.
[24] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-
based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–
2324.
[25] Jaewoo Lee and Daniel Kifer. 2018. Concentrated differentially private gradient
descent with adaptive per-iteration privacy budget. In Proceedings of the 24th
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
ACM, 1656–1665.
[26] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed,
Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling
Distributed Machine Learning with the Parameter Server.. In OSDI, Vol. 14. 583–
598.