0% found this document useful (0 votes)
46 views9 pages

SecureBoost A Lossless Federated Learning Framework

Uploaded by

cnlonger
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views9 pages

SecureBoost A Lossless Federated Learning Framework

Uploaded by

cnlonger
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1

SecureBoost: A Lossless Federated Learning


Framework
Kewei Cheng, Tao Fan, Yilun Jin, Yang Liu, Member, IEEE, Tianjian Chen, Dimitrios Papadopoulos, Qiang
Yang, Fellow, IEEE

Abstract—The protection of user privacy is an important concern in machine learning, as evidenced by the rolling out of the General
Data Protection Regulation (GDPR) in the European Union (EU) in May 2018. The GDPR is designed to give users more control over
their personal data, which motivates us to explore machine learning frameworks for data sharing that do not violate user privacy. To
arXiv:1901.08755v3 [cs.LG] 7 Apr 2021

meet this goal, in this paper, we propose a novel lossless privacy-preserving tree-boosting system known as SecureBoost in the setting
of federated learning. SecureBoost first conducts entity alignment under a privacy-preserving protocol and then constructs boosting
trees across multiple parties with a carefully designed encryption strategy. This federated learning system allows the learning process
to be jointly conducted over multiple parties with common user samples but different feature sets, which corresponds to a vertically
partitioned data set. An advantage of SecureBoost is that it provides the same level of accuracy as the non-privacy-preserving
approach while at the same time, reveals no information of each private data provider. We show that the SecureBoost framework is as
accurate as other non-federated gradient tree-boosting algorithms that require centralized data and thus it is highly scalable and
practical for industrial applications such as credit risk analysis. To this end, we discuss information leakage during the protocol
execution and propose ways to provably reduce it.

Index Terms—Federated Learning, Privacy, Security, Decision Tree

1 I NTRODUCTION user privacy issue in machine learning [5], [6]. For example,
Apple proposed to use differential privacy (DP) [7], [8] to address
The modern society is increasingly concerned with the unlawful
the privacy preservation issue. The basic idea of DP is to add
use and exploitation of personal data. At the individual level,
properly calibrated noise to data to disambiguate the identity of
improper use of personal data may cause potential risk to user
any individuals when data is being exchanged and analyzed by
privacy. At the enterprise level, data leakage may have grave
a third party. However, DP only prevents user-data leakage to
consequences on commercial interests. Actions are being taken by
a certain degree and cannot completely rule out the identity of
different societies. For example, the European Union has enacted a
an individual. In addition, data exchange under DP still requires
law known as General Data Protection Regulation (GDPR). GDPR
that data change hands between organizations, which may not be
is designed to give users more control over their personal data [1],
allowed by strict laws like GDPR. Furthermore, the DP method
[2], [3], [4]. Many enterprise that rely heavily on machine learning
is lossy in machine learning in that models built after noise is
are beginning to make sweeping changes as a consequence.
injected may perform unsatisfactorily in prediction accuracy.
Despite difficulty in meeting the goal of user privacy pro-
More recently, Google introduced a federated learning (FL)
tection, the need for different organizations to collaborate while
framework [9] and deployed it on Android cloud. The basic idea
building machine learning models still stays strong. In reality,
is to allow individual clients to upload only model updates but not
many data owners do not have sufficient amount of data to build
raw data to a central server where the models are aggregated. A
high-quality models. For example, retail companies have users’
secure aggregation protocol was further introduced [10] to ensure
purchases and transaction data, which are highly useful if provided
the model parameters do not leak user information to the server.
to banks for credit rating applications. Likewise, mobile phone
This framework is also referred to as horizontal FL [11] or data-
companies have users’ usage data, but each company may only
partition FL where each partition corresponds to a subset of data
have a small amount of users which are not enough to train
samples collected from one or multiple users.
high-quality user preference models. Such companies have strong
In this paper, we consider another setting of multiple par-
motivation to collaboratively exploit the joint data value.
ties collaboratively build their machine learning models while
So far, it is still a challenge to allow different data owners to
protecting user privacy and data confidentiality. Our setting is
collaboratively build high-quality machine learning models while
shown in Figure 2 and is typically referred as vertical FL [11]
at the same time protecting user data privacy and confidentiality.
because data are partitioned by features among different parties.
In the past, several attempts have been made to address the
This setting has a wide range of real-world applications. For
Yang Liu and Qiang Yang are the corresponding authors. Email: yan- example, financial institutes can leverage alternative data from a
[email protected], [email protected] third party to enhance users’ and small and medium enterprises’
Kewei Cheng is with the University of California, Los Angeles, Los Angeles, credit ratings [12]. Patents’ record from multiple hospitals can
USA. Tao Fan, Yang Liu, Tianjian Chen are with Department of Artificial be used together for diagnoses [13], [14]. We can regard the data
Intelligence, Webank, Shenzhen, China. Yilun Jin, Dimitrios Papadopoulos are
with Hong Kong University of Science and Technology. Qiang Yang is with located at different parties as a subsection of a virtual big data
both Webank and Hong Kong University of Science and Technology. table obtained by taking the union of all data at different parties.
2

SecureBoost

Sub-Model 1 Sub-Model 3 Sub-Model 2


Intermediate Computation Exchange Intermediate Computation Exchange

Privacy-Preserving Entity Alignment Privacy-Preserving Entity Alignment

Confidential Confidential
Info. Exchange Info. Exchange
Passive Party 1 Active Party Passive Party 2

Fig. 1: Illustration of the proposed SecureBoost framework

Virtually Joint Table


User X1 X2 X3 X4 X5 Y
under a privacy-preserving constraint. Then, we collaboratively
U1 learn a shared classification or regression model without leaking
U2
any user information to each other. We summarize our main
Virtually Join contributions as follows:
User
U1
X1 X2 Y User
U1
X3 X4 X5
• We formally define a novel problem of privacy-preserving
U2
U3
U2
U4
machine learning over vertically partitioned data in the
Party 1 Party 2 setting of federated learning.
• We present an approach to train a high-quality tree boost-
Fig. 2: Vertically partitioned data set
ing model collaboratively while keeping the training data
local over multiple parties. Our protocol does not need the
Then the data at each party has the following property: participation of a trusted third party.
• Finally and importantly, we prove that our approach is
1) The big data table is vertically split, such that the data are lossless in the sense that it is as accurate as any centralized
split in the feature dimension among parties; non-privacy-preserving methods that bring all data to a
2) Only one data provider has the label information; central location.
3) Parties share a common set of users. • In addition, along with a proof of security, we discuss
what would be required to make the protocols completely
Our goal is then to allow parties to build a prediction model jointly
secure.
while protecting all parties from leaking data information to other
parties. In contrast with most existing work on privacy-preserving
data mining and machine learning, the complexity in our setting is 2 P RELIMINARIES AND R ELATED WORK
significantly increased. Unlike sample-partitioned/horizontal FL,
the vertical FL setting requires a more complex mechanism to To protect the privacy of the data used for learning a model,
decompose the loss function at each party [5], [15], [16]. In the authors in [18] proposed to take advantage of differential
addition, since only one data provider owns the label information, privacy (DP) for learning a deep learning model. Recently, Google
we need to propose a secure protocol to guide the learning process introduced a federated learning framework to prevent the data
instead of sharing label information explicitly among all parties. from being transmitted by bringing the model training to each
Finally, data confidentiality and privacy concerns prevent parties mobile terminal [9], [10], [19]. Its basic idea is that each local
from exposing their own users. Hence, entity alignment should mobile terminal trains the local model using its local data with
also be conducted in a sufficiently secure manner. the same model architecture. The global model can simply be
Tree boosting is a highly effective and widely used machine updated by averaging all the local models. Following the same
learning method, which excels in many machine learning tasks idea, several attempts have been made to reinvent different ma-
due to its high efficiency as well as strong interpretability. For chine learning models to the federated setting, including decision
example, XGBoost [17] has been widely used in various appli- tree [20], [21], linear/logistic regression [22], [23], [24] and neu-
cations including credit risk analysis and user behavior studies. ral network [25], [26].
In this paper, we propose a novel end-to-end privacy-preserving All the above methods are designed for horizontally parti-
tree-boosting algorithm and framework known as SecureBoost to tioned data. Unlike sample-partitioned/horizontal FL, the vertical
enable machine learning in a federated setting. Secureboost has FL setting requires a more complex mechanism to decompose
been implemented in an open-sourced FL project, FATE1 to enable the loss function at each party. The concept of vertical FL is
industrial applications. Our federated learning framework operates first proposed in [5], [11] and protocols are proposed for linear
in two steps. First, we find the common users among the parties models [5], [13] and neural networks [27]. Some previous works
have been proposed for privacy-preserving decision trees over
1. https://github.com/FederatedAI/FATE vertically partitioned data [16], [28]. However, their proposed
3

methods have to reveal class distribution over given attributes, using their IDs. The problem is how to find the common data
which will cause potential security risks. In addition, they can only samples across the parties without revealing the non-shared parts.
handle discrete data, which is less practical for real-life scenarios. To achieve this goal, we align the data samples under a privacy-
In contrast, our method guarantees better protection to the data and preserving protocol for inter-database intersections [30].
can be easily applied to continuous data. Another work proposed After aligning the data across different parties under the pri-
in [29] jointly performs logistic regression over the encrypted vacy constraint, we now consider the problem of jointly building
vertically-partitioned data by approximating a non-linear logistic tree ensemble models over multiple parties without violating
loss by a Taylor expansion, which will inevitably compromise the privacy. Before further discussing the detail of the algorithm,
performance of the model. In contrast to these works, we propose we first introduce the general framework of federated learning.
a novel approach that is lossless in nature. In federated learning, a typical iteration consists of four steps.
First, each client downloads the current global model from server.
Second, each client computes an updated model based on its
3 P ROBLEM S TATEMENT local data and the current global model, which resides within the
m
Let Xk ∈ Rnk ×dk k=1 be the data matrix distributed on m pri-

active party. Third, each client sends the model update back to
vate parties with each row Xki∗ ∈ R1×dk being a data instance. We the server under encryption. Finally, the server aggregates these
use F k = {f1 , ..., fdk } to denote the feature set of corresponding model updates and construct the updated global model.
data matrix Xk . Two parties p and q have different sets of features, Following the general framework of federated learning, we
denoted as F p ∩ F q = ∅, ∀p 6= q ∈ {1...m}. Different parties see that to design a privacy-preserving tree boosting framework
may hold different sets of users as well, allowing some degree of in the setting of federated learning, essentially we have to answer
overlap. Only one of the parties holds the class labels y. the following three questions: (1) How can each client (i.e., a
Definition 1. Active Party: passive party) compute an updated model based on its local data
without reference to class label? (2) How can the server (i.e.,
We define the active party as the data provider who holds the active party) aggregate all the updated model and obtain a
both a data matrix and the class label. Since the class label new global model? (3) How to share the updated global model
information is indispensable for supervised learning, the active among all parties without leaking any information at inference
party naturally takes the responsibility as a dominating server in time? To answer these three questions, we start by reviewing a
federated learning. tree ensemble model, XGBoost [31], in a non-federated setting.
Definition 2. Passive Party: Given a data set X ∈ Rn×d with n samples and d features,
XGBoost predicts the output by using K regression trees.
We define the data provider which has only the data matrix
K
as a passive party. Passive parties play the role of clients in the X
federated learning setting. yˆi = fk (xi ) (1)
k=1
The problem of privacy-preserving machine learning over
vertically split data in federated learning can be stated
 as:m To learn the set of regression tree models used in Eq.(1),
Given: a vertically partitioned data matrix Xk k=1 dis- it greedily adds a tree ft at the t-th iteration to minimize the
tributed on m private parties and the class labels y distributed following loss.
on active party. n  
1

Learn: a machine learning model M without giving informa-

(t−1)
X
(t) 2
L ≃ l yi , yˆi + gi ft (xi ) + hi ft (xi ) + Ω(ft )
tion of the data matrix of any party to others in the process. The
i=1
2
model M is a function that has a projection Mi at each party i, (2)
2
such that Mi takes input of its own features Xi . where Ω(ft ) = γT + 21 λ kwk , gi = ∂ŷ(t−1) l(yi , ŷ (t−1) ) and
Lossless Constraint: We require that the model M is lossless, hi = ∂ŷ2(t−1) l(yi , ŷ (t−1) ).
which means that the loss of M under federated learning over the When constructing the regression tree in the t-th iteration,
training data is the same as the loss of M ′ when M ′ is built on it starts from the tree with depth of 0 and add a split for each
the union of all data. leaf node until reaching the maximum depth. In particular, it
maximizes the following equation to determine the best split,
where IL and IR are the instance spaces of left and right tree
4 F EDERATED L EARNING WITH S ECURE B OOST
nodes after the split.
As one of the most popular machine learning algorithms, the
gradient-tree boosting excels in many machine learning tasks, such " P 2 P 2 2 #
i∈IL gi i∈IR gi
P
as fraud detection, feature selection and product recommenda- 1 i∈I gi
tion. In this section, we propose a novel gradient-tree boosting Lsp = P +P −P −γ
2 i∈IL hi + λ i∈IR hi + λ i∈I hi + λ
algorithm called SecureBoost in the federated learning setting. (3)
It consists of two major steps. First, it aligns the data under After it obtains an optimal tree structure, the optimal weight
the privacy constraint. Second, it collaboratively learns a shared wj∗ of leaf j can be computed by the following equation, where Ij
gradient-tree boosting model while keeping all the training data is the instance space of leaf j .
secret over multiple private parties. We explain each step below. P
Our first goal is to find a common set of data samples at all ∗ i∈Ij gi
participating parties so as to build a joint model M . When the wj = − P (4)
i∈Ij hi + λ
data is vertically partitioned among parties, different parties hold
different but partially overlapping users, which can be identified From the above review, we make following observations:
4

Algorithm 1 Aggregate Encrypted Gradient Statistics Algorithm 2 Split Finding


Input: I , instance space of current node Input: I, instance m space of current node
Input: d, feature dimension Input: Gi , Hi i=1 , aggregated encrypted gradient statistics
Input: {hgi i , hhi i}i∈I from m parties
Output: G ∈ Rd×l , H ∈ Rd×l Output: Partition current instance space according to the selected
1: for k = 0 → d do attribute’s value
2: Propose Sk = {sk1 , sk2 , ..., skl } by percentiles on fea- 1: /*Conduct on Active Party*/
P P
ture k 2: g ← i∈I gi , h ← i∈I hi
3: end for 3: for i = 0 to m do
4: for k = 0 → d do 4: for k = 0 to di do
gl ← 0, hl ← 0
P
5: Gkv = i∈{i|sk,v ≥xi,k >sk,v−1 } hgi i 5:
//enumerate all threshold value
P
6: Hkv = i∈{i|sk,v ≥xi,k >sk,v−1 } hhi i 6:
7: end for 7: for v = 0 to lk do
8: get decrypted values D(Gikv ) and D(Hikv )
9: gl ← gl + D(Gikv ), hl ← hl + D(Hikv )
(1) The evaluation of split candidates and the calculation of 10: g r ← g − g l , hr ← h − hl
gl2 gr2 g2
the optimal weight of leaf only depends on the gi and hi , which 11: score ← max(score, hl +λ + hr +λ − h+λ )
makes it easily adapted to the setting of federated learning. 12: end for
(2) The class label can be inferred from gi and hi . For instance, 13: end for
when we take the square loss as the loss function, we have gi = 14: end for
(t−1) 15: Return kopt and vopt to the passive party iopt when we obtain
ŷi − yi .
the max score.
With the above observations, we now introduce our federated
16: /*Conduct on Passive Party iopt */
gradient tree boosting algorithm. Following observation (1), we
17: Determine the selected attribute’s value according to kopt and
can see that passive parties can determine their locally optimal
vopt and partition current instance space.
split with only its local data and gi ,hi , which motivates us to
18: Record the selected attribute’s value and return [record id, IL ]
follow such method to decompose learning task at each party.
back to the active party.
However, according to observation (2), gi and hi should be re-
19: /*Conduct on Active Party*/
garded as sensitive data, since they are able to disclose class label
20: Split current node according to IL and associate current node
information to passive parties. Therefore, in order to keep gi and
with [party id, record id].
hi confidential, the active party is required to encrypt gi and hi
before sending them to passive parties. The remaining challenge
is how to determine the locally optimal split with encrypted gi and
hi for each passive party. because the active party does not have features located in passive
parties, for the active party to know which passive party to
P According to Eq.(3), P the optimal split can be found if gl = deliver an instance to, as well as instructing the passive party
i∈IL g i and h l = i∈IL hi are calculated for every possible which split condition to use at inference time, it associates every
split. So next, we show how to obtain gl and hl with encrypted gi
tree node with a pair (party id i, record id r). Specific details
and hi using additive homomorphic encryption scheme [32]. The
about the split finding algorithm for SecureBoost is summarized
Paillier encryption scheme is taken as our encryption scheme.
in Algorithm 2. The problem remaining is the computation of
Denoting the encryption of a number u under the Paillier cryp-
optimal leaf weights. According
P to Equation P 4, the optimal weight
tosystem as hui, the main property of the Paillier cryptosystem
of leaf j only depends on i∈Ij gi and i∈Ij hi . Consequently,
ensures that for arbitrary numbers u and v , we have Q hui . hvi = it follows similar procedures as split
P finding. When P a leaf node is
Q
hu + vi. Therefore, hhl i = i∈IL hhi i and hgl i = i∈IL hgi i. reached, the passive party sends h i∈Ij gi i and h i∈Ij hi i to the
Consequently, the best split can be found in the following way.
active party, which are then deciphered to compute corresponding
First, each passive party computes hgl i and hhl i for all possible
weights through Equation 4.
splits locally, which are then sent back to the active party. The
active party deciphers all hgl i and hhl i and calculates the global
optimal split according to Eq.(3). We adopt the approximation 5 F EDERATED I NFERENCE
scheme used by [31], so as to alleviate the need of enumerating all In this section, we describe how the learned model (distributed
possible split candidates and communicating their hgi i and hhi i. among parties) can be used to classify a new instance even
The details of our secure gradient aggregation algorithm are shown though the features of the instance to be classified are private
in Algorithm 1. and distributed among parties. Since each party knows its own
Following the observation (1), the split finding algorithm re- features but nothing of the others, we need a secure distributed
mains largely the same as XGBoost except for minor adjustments inference protocol to control passes from one party to another,
to fit the federated learning framework. Due to separation in based on the decision made. To illustrate the inference process,
features, SecureBoost requires different parties to store certain we consider a system with three parties as depicted in Figure 3.
information for each split, so as to perform prediction for new Specifically, party 1 is the active party, which collects information
samples. Passive parties should keep a lookup table as shown in including user’s monthly bill payment, education, as well as the
Figure 3. It contains split thresholds [feature id k , threshold value label, whether the user X made the payment on time. Party 2 and
v ] and a unique record id r used to index the table, in order party 3 are passive parties, holding features age, gender, marriage
to look up split conditions during inference. In the meantime, status and amount of given credit respectively. Suppose we wish to
5

Party 1 (Passive Party) Party 2 (Active Party) Party 3 (Passive Party)


Example BIll Payment Education Example Age Gender Marriage Label Example Amount of given credit
X1 3102 2 X1 20 1 0 0 X1 5000
X2 17250 3 X2 30 1 1 1 X2 300000

Training Set X3 14027 2 X3 35 0 1 1 X3 250000


X4 6787 1 X4 48 0 1 2 X4 300000
X5 280 1 X5 10 1 0 3 X5 200

Example BIll Payment Education Example Age Gender Marriage Label Example Amount of given credit
Predict X6 4367 2 X6 28 1 0 0 X6 5500

input
ķ Party 1 query for ‘1' Root Lookup table
from its lookup table
Party ID: 1
Record ID: 1 Record ID Feature threshold value
Party 1:
1 Bill Payment 5000
ĸ Party 1: 4367<5000
Node 1 Node 2
Ĺ Party 3 query for ‘1' Party ID:3 Party ID:2 Record ID Feature threshold value
from its lookup table Record ID:1 Record ID:1 Party 2: 1 Age 40

ĺ Party 3: 5500>800
Record ID Feature threshold value
w1 w2 w3 w4 Party 3:
1 Amount of given credit 800

{X5} {X1} {X2, X3} {X4}

Fig. 3: An illustration of Federated Inference

predict whether a user X6 would make payment on time, then all throughout the construction of the tree and result in identical M
sites would have to collaborate to make the prediction. The whole and M ′ , which ensures the property of lossless.
process is coordinated by the active party. Starting from the root,
by referring to the record [party id:1, record id:1], the active party
knows party 1 holds the root node, thereby requiring party 1 to
7 S ECURITY D ISCUSSION
retrieve the corresponding attribute, Bill Payment, from its lookup SecureBoost avoids revealing data records held by each of the
table based on the record id 1. Since the classifying attribute is bill parties to others during training and inference thus protecting the
payment and party 1 knows the bill payment for user X6 is 4367, privacy of individual parties’ data. However, we stress that there
which is less than the threshold 5000, it makes the decision that is some leakage that can be inferred during the protocol execution
it should move down to its left child, node 1. Then, active party which is quite different for passive vs. active parties.
refers to the record [party id:3, record id:1] associated with node 1 The active party is in an advantageous position with Secure-
and requires party 3 to conduct the same operations. This process Boost as it learns the instance space for each split and which party
continues until a leaf is reached. is responsible for the decision at each node. Also, it learns all the
possible values of gl , gr and hl , hr , during learning. The former
seems unavoidable in this setting, unless one is willing to severely
6 T HEORETICAL A NALYSIS FOR L OSSLESS P ROP - increase the overhead during the inference phase. However, the
ERTY latter can be avoided using secure multi-party computation tech-
Theorem 1. SecureBoost is lossless, i.e. SecureBoost model M niques for comparison of encrypted values (e.g., [33], [34]). In
and XGBoost model M ′ would behave identically provided that this way, the active party learns only the optimal gl , gr , hl , hr per
the models M and M ′ are identically initialized and hyper- party; on the other hand, this significantly affects the efficiency
parameterized. during learning.
Note that the instances that are associated with the same leaf
Proof. According to Eq.(3), gl and hl are the only informa- strongly indicates they belong to the same class. We denote the
tion needed for the calculation of the best split, which can be proportion of samples which belong to the majority class as leaf
obtained with encrypted gi and hi using Paillier cryptosystem purity. The information leakage with respect to passive parties
in SecureBoost. In the Paillier cryptosystem, the encryption of is directly related with leaf purity of the first tree of SecureBoost.
a message m is hmi = g m rn mod n2 , for some random Moreover, the first tree’s leaf purity can be inferred from the
r ∈ {0, . . . , n−1}. Given the definition of encrypted message, we weight of the leaves.
have hm1 i . hm2 i = hm1 + m2 i for arbitrary message m1 and
Theorem 2. Given a learned SecureBoost model, its first tree’s
m2 under Paillier cryptosystem, which can be proved as follows:
leaf purity can be inferred from the weight of the leaves.
hm1 i . hm2 i = (g m1 r1c )(g m2 r2c ) mod n Proof. The loss function for binary classification problem is given
= g m1 +m2 (r1 r2 )c mod n (5) as follows.
= hm1 + m2 i
Q L = yi log(1 + e−yˆi ) + (1 − yi )log(1 + eyˆi ) (6)
Q Therefore, we have hhl i = i∈IL hhi i and hgl i =
(0)
i∈IL hgi i. Provided that with the same initialization, an instance Based on the loss function, we have gi = yˆi − yi and hi =
i will have the same value of gi and hi under either setting. yˆi (0) ∗(1−yˆi (0) ) during the construction of the decision tree at first
Thus, model M and M ′ can always achieve the same best split iteration. Specifically, yˆi (0) is given as initialized value. Suppose
6
(0)
we initialize all yˆi as a where 0 < a < 1. According to of |gi | for negative samples as µn . When we have a large amount
Eq.(4), for the instances
P associated with the specific leaf j , yˆi (1) = of samples but small number of leave nodes k , we can use the
gi
S(wj∗ ) = S(− P j i∈I
) where S(x) is the sigmoid function. following equation to approximates Eq.( 9).
i∈Ij hi +λ
Suppose the number of instances associated with the leaf j is nj k
X (nnj µn − npj µp )2
and the percentage of positive samples
P is θj . When nj is relatively (10)
i∈Ij gi nnj µn (µn − 1) + npj µp (µp − 1)
big, we can ignore λ in −P hi +λ and rewrite the weight j=1
P i∈Ij
p
of leaf j as wj∗ = − P
i∈Ij gi
= − θj ∗n∗(a−1)+(1−θ j )∗n∗a
= Where nn j and nj represent the number of negative samples
i∈Ij hi n∗a∗(1−a) and positive samples associated with leaf j . Since µn ∈ [0, 1]
θ ∗n∗(a−1)+(1−θ )∗n∗a a−θ
− j n∗a∗(1−a)
j j
= a(a−1) . By reformulating the equa- and µn ∈ [0, 1], we know the numerator has to be positive and the
tion, we have θj = a − a(a − 1)wj∗ . θj depends on a and wj∗ denominator has to be negative. Thus, the whole equation has to be
and a is given as initialization. Thus, wj∗ is the key to determine negative. To minimize Eq.(10) is equal to maximizing the numera-
θj . Note that θj can be used to represent the leaf purify of leaf j torPwhile minimizing the denominator.
P Note that the denominator
(i.e., purify of leaf j can be formally written as max(θj , 1 − θj ), is x2 and the numerator is ( x)2 where x ∈ [0, 1] . The
leaf purity of the first tree can be inferred from the weight of the equation is dominated by numerator. Thereby, minimizing Eq.( 10)
p 2
leaves (wj∗ ) given a learned SecureBoost model. can be regarded as maximizing the numerator (nn j µn − n j µp ) .
n p
Ideally, we require nj = nj in order to prevent label information
According to Theorem 2, given a SecureBoost model, the from divulging. When |µn − µp | is bigger, more possible we can
weight of the leaves of its first tree can reveal sensitive informa-
achieve the goal. And we know |gi | = |yˆi (t−1) − yi | = yˆi (t−1)
tion. In order to reduce information leakage with respect to passive
for negative samples and |gi | = |yˆi (t−1) − yi | = 1 − yˆi (t−1) for
parties, we opt to store decision tree leaves at the active party and
positive samples. Thereby, µn = N1n j=1 (1−θj )nj yˆi (t−1) and
Pk
propose a modified version of our framework, called Reduced-
µp = N1p kj=1 θj nj (1 − yˆi (t−1) ). |µn − µp | can be calculated
P
Leakage SecureBoost (RL-SecureBoost). With RL-SecureBoost,
the active party learns the first tree independently based only on its as follows.
own features which fully protects the instance space of its leaves.
Hence, all the information that passive parties learn is based on |µn − µp |
residuals. Although the residuals may also reveal information, we k k
prove that as the purity in the first tree increases, this residual 1 X 1 X
=| (1 − θj )nj yˆi (t−1) − θj nj (1 − yˆi (t−1) )|
information decreased. Nn j=1 Np j=1
Theorem 3. As the purity in the first tree increases, the residual (11)
information decreased. Where Nn and Np correspond to the number of negative
Proof. As mentioned before, for binary classification problem, we samples and positive samples in total. θj is the percentage of
have gi = yˆi (t−1) − yi and hi = yˆi (t−1) ∗ (1 − yˆi (t−1) ), where positive samples associated with leave j for decision tree at
gi ∈ [−1, 1]. Hence, (t − 1)-th iteration (previous decision tree). nj denote the number
of instances associated with leave j for previous decision tree.
yˆi (t−1) = S(wj ) where wj represents the weight of j -th leave

hi = gi (1 − gi ), if yi = 0
(7)
hi = −gi (gi + 1), if yi = 1 of previous decision tree. When the positive samples and negative
samples are balanced, Nn = Np , we have
When we construct the decision tree at the t-th iteration with
k leaves to fit the residuals of the previous tree, in essential, we
split the data into k clusters to minimize the following loss. |µn − µp |
k
1 X
= | ((1 − θj )nj S(wj ) − θj nj (1 − S(wj ))|
k ( i∈Ij gi )2
P
X Nn j=1
L=− P
i∈Ij hi k
j=1
(8) 1 X (12)
= nj |(S(wj ) − θj )|
k ( i∈Ij gi )2
P
X Nn j=1
=− P P
i∈IjN gi (1 − gi ) + i∈I P −gi (1 + gi ) k
j=1 j 1 X a − θj
= nj |(S( ) − θj )|
We know yˆi (t−1) ∈ [0, 1] and gi = yˆi (t−1) − yi . Thus, we Nn j=1 a(a − 1)
have gi ∈ [−1, 0] for positive samples and gi ∈ [0, 1] for negative
samples. Taking the range of gi into consideration, we can rewrite As observed from Eq.( 12), it achieves the minimum value
a−θj
the above equation as follows. when S( a(a−1) ) = a. By solving the equation, we have the
a
optimal solution of θj as θj ∗ = a(1 + (1 − a) ln( 1−a ))). In order
to achieve bigger µn −µp , we want the deviation from θj to θj ∗ to
( i∈IjN |gi | − i∈IjP |gi |)2
P P
k
X be as big as possible. When we have proper initialization of a, for
(9)
instance a = 0.5, θj ∗ = 0.5. In this case, maximizing |θj − θj ∗ |
P P
j=1 i∈IjN |gi |(|gi | − 1) + i∈IjP |gi |(|gi | − 1)
is the same as maximizing max(θj , 1 − θj ), which exactly is the
Where IjN and IjP denote the set of negative samples and leaf purity. Therefore, we have proved that high leaf purity will
positive samples associated with leaf j respectively. We denote the guarantee big difference between µn and µp , which finally results
expectation of |gi | for positive samples as µp and the expectation in less information leakage. We complete our proof.
7

0.58
training and test loss of SecureBoost are very much alike GBDT
0.575 SecureBoost SecureBoost
0.550
GBDT
XGBoost
0.56 GBDT
XGBoost
and XGBoost.
0.54
0.525
0.52 Next, to investigate how maximum depth of individual trees
loss

loss
0.500
0.50
0.475 0.48 affects the runtime of learning, we vary the maximum depth of
0.450

0.425
0.46
0.44
individual tree among {3, 4, 5, 6, 7, 8} and report the runtime
1 5 10 15
# boost stages
20 25
0.42
1 5 10 15
# boost stages
20 25 of one boosting stage. As depicted in Figure 5 (a), the runtime
increases almost linearly with the maximum depth of individual
(a) Learning Curve (b) Test Error
trees, which indicates that we can train deep trees with relatively
Fig. 4: Loss convergence little additional time, which is very appealing in practice, espe-
cially in scenarios like big data.
TABLE 1: First Tree vs. Second Tree in terms of Leaf Purity Finally we study the impact of data size on the runtime of our
Mean Purity Credit 1 Credit 2
proposed system. We augment the feature sets by feature products.
1st Tree 0.8058 0.7159 We fix the maximum depth of individual regression trees to 3 and
2rd Tree 0.66663 0.638 vary the feature number in {50, 500, 1000, 5000} and the sample
number in {5000, 10000, 30000}. We compare the runtime of one
boosting stage to investigate how each variant affects the efficiency
Given Theorem 3, we prove that RL-SecureBoost is secure as of the algorithm. We make similar observations on both Figure 5
long as its first tree learns enough information to mask the actual (b) and Figure 5 (c), which imply that sample and feature numbers
label with residuals. Moreover, as we experimentally demonstrate contribute equally to running time. In addition, we can see that our
in Section 8, RL-SecureBoost performs identically as SecureBoost proposed framework scales well even with relatively big data.
in terms of prediction accuracy.
8.2 Performance of RL-SecureBoost
To investigate the performance of RL-SecureBoost in both se-
8 E XPERIMENTS curity and prediction accuracy, we aim to answer the following
We conduct experiments on two public datasets. questions: (1) Does the first tree, built upon only features held
Credit 12 : It involves the problem of classifying whether a by active party, learns enough information to reduce information
user would suffer from serious financial problems. It contains a leakage? (2) Does RL-SecureBoost suffer from a loss of acccuracy
total of 150000 instances and 10 attributes. compared with SecureBoost?
Credit 23 : It is also a credit scoring dataset, correlated to the First, we study the performance of RL-SecureBoost in security.
task of predicting whether a user would make payment on time. It Following the analysis in Section 7, we evaluate information
consist of 30000 instances and 25 attributes in all. leakage in terms of leaf purity. Also, we know that as the leaf
In our experiment, we use 2/3 of each dataset for training purity in the first tree increases, leaked information is reduced.
and the remaining for testing. We split the data vertically into Thereby, to verify the security of RL-SecureBoost, we have
two halves and distribute them to two parties. To fairly compare to illustrate that the first tree of RL-SecureBoost perform well
different methods, we set the maximum depth of each tree as 3, the enough to reduce the information leaked from the second tree. As
fraction of samples used to fit individual regression trees as 0.8, shown in Table 1, we compare the mean leaf purity of the first
and learning rate as 0.3 for all methods. The Paillier encryption tree with the second tree. In particular, the mean
Pk leaf purity is
scheme is taken as our encryption scheme with a key size of 512 the weighted average, which is calculated by i=0 nni pi . Here,
bits. All experiments are conducted on a machine with 8GB RAM k and n represents number of leaves and number of instances in
and Intel Core i5-7200u CPU. total. pi and ni are defined as leaf purity and number of instances
associated with leaf i. According to Table 1, the mean leaf purity
decreases significantly from the first to the second tree on both
8.1 Scalability
datasets, which reflects a great reduction in information leakage.
Note that the efficiency of SecureBoost may be reflected by rate Moreover, the mean leaf purity of the second tree is just over 0.6
of convergence and runtime, which may be influenced by (1) on both datasets, which is good enough to ensure a safe protocol.
maximum depth of individual regression trees; (2) the size of the Next, to investigate the prediction performance of RL-
datasets. In this subsection, we conduct convergence analysis as SecureBoost, we compare RL-SecureBoost with SecureBoost with
well as study the impact of all the variables on the runtime of respect to the the first tree’s performance and the overall perfor-
learning. All experiments are conducted on dataset Credit2. mance. We consider commonly used metrics including accuracy,
First, we are interested in the convergence rate of our proposed Area under ROC curve (AUC) and f1-score. The results are
system. We compare the convergence behavior of SecureBoost presented in Table 2. As observed, RL-SecureBoost performs
with non-federated tree boosting counterparts, including GBDT4 equally well compared to SecureBoost in almost all cases. We
and XGBoost5 . As can be observed from Figure 4, SecureBoost also conduct a pairwise Wilcoxon signed-rank test between RL-
shows a similar learning curve with other non-federated methods SecureBoost and SecureBoost. The comparison results indicate
on the training dataset and even performs slightly better than that RL-SecureBoost is as accurate as SecureBoost, with a signifi-
others on the test dataset. In addition, the convergence behavior of cance level of 0.05. The property of lossless is still guaranteed for
RL-SecureBoost.
2. https://www.kaggle.com/c/GiveMeSomeCredit/data
3. https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset 9 C ONCLUSION
4. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
GradientBoostingClassifier.html In this paper, we proposed a lossless privacy-preserving tree
5. https://github.com/dmlc/xgboost boosting algorithm, SecureBoost, to train a high-quality tree
8

800

700 # samples = 5000 # features = 50


80000 # samples = 10000 80000 # features = 500
600 # samples = 30000 # features = 1000
runtime (s)

# features = 5000

runtime (s)
60000

runtime (s)
60000
500
40000 40000
400

300 20000 20000

3 4 5 6 7 8 0 0
# depths 50 500 1000 5000 5000 10000 30000
# features # samples
(a)Runtime w.r.t. maximum depth of
(b)Runtime w.r.t. feature size (c)Runtime w.r.t. sample size
individual tree
Fig. 5: Scalability Analysis of SecureBoost

TABLE 2: Classification Performance for RL-SecureBoost vs. SecureBoost

Credit 1 Credit 2
Model
ACC F1-score AUC ACC F1-score AUC
1st Tree, SecureBoost 0.9298 0.012 0.7002 0.7806 0 0.6381
1st Tree, RL-SecureBoost 0.9186 0 0.6912 0.7793 0 0.6320
Overall, SecureBoost 0.9345 0.2576 0.8461 0.8180 0.4634 0.7701
Overall, RL-SecureBoost 0.9331 0.2549 0.8423 0.8179 0.4650 0.7682

boosting model with private data split across multiple parties. We [12] Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, and H. Yu, “Federated
theoretically prove that our proposed framework is as accurate learning,” Synthesis Lectures on Artificial Intelligence and Machine
Learning, vol. 13, no. 3, pp. 1–207, 2019.
as non-federated gradient tree boosting counterparts. In addition, [13] Y. Liu, Y. Kang, X. Zhang, L. Li, Y. Cheng, T. Chen, M. Hong,
we analyze information leakage during the protocol execution and and Q. Yang, “A communication efficient vertical federated learning
propose provable ways to reduce it. framework,” CoRR, vol. abs/1912.11187, 2019. [Online]. Available:
http://arxiv.org/abs/1912.11187
[14] X. Liang, Y. Liu, J. Luo, Y. He, T. Chen, and Q. Yang, “Self-supervised
ACKNOWLEDGMENT cross-silo federated neural architecture search,” 2021.
[15] J. Vaidya, “A survey of privacy-preserving methods across vertically
This work was partially supported by the National Key Re-
partitioned data,” in Privacy-preserving data mining. Springer, 2008,
search and Development Program of China under Grant No. pp. 337–358.
2018AAA0101100. [16] J. Vaidya and C. Clifton, “Privacy-preserving decision trees over verti-
cally partitioned data,” in IFIP Annual Conference on Data and Applica-
tions Security and Privacy. Springer, 2005, pp. 139–152.
R EFERENCES [17] KDD ’16: Proceedings of the 22nd ACM SIGKDD International Confer-
[1] P. Regulation, “The general data protection regulation,” European Com- ence on Knowledge Discovery and Data Mining. New York, NY, USA:
mission. Available at: https://eur-lex. europa. eu/legal-content/EN/TXT, Association for Computing Machinery, 2016.
2016. [18] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in
[2] J. P. Albrecht, “How the gdpr will change the world,” Eur. Data Prot. L. Proceedings of the 22nd ACM SIGSAC conference on computer and
Rev., vol. 2, p. 287, 2016. communications security. ACM, 2015, pp. 1310–1321.
[3] V. Mayer-Schonberger and Y. Padova, “Regime change: Enabling big [19] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N.
data through europe’s new data protection regulation,” Colum. Sci. & Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L.
Tech. L. Rev., vol. 17, p. 315, 2015. D’Oliveira, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón,
[4] B. Goodman and S. Flaxman, “European union regulations on algo- B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He,
rithmic decision-making and a” right to explanation”,” arXiv preprint Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak,
arXiv:1606.08813, 2016. J. Konecný, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu,
[5] S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith, and P. Mittal, M. Mohri, R. Nock, A. Özgür, R. Pagh, M. Raykova, H. Qi,
B. Thorne, “Private federated learning on vertically partitioned data via D. Ramage, R. Raskar, D. Song, W. Song, S. U. Stich, Z. Sun, A. T.
entity resolution and additively homomorphic encryption,” arXiv preprint Suresh, F. Tramèr, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang,
arXiv:1711.10677, 2017. F. X. Yu, H. Yu, and S. Zhao, “Advances and open problems in federated
[6] P. Mohassel and Y. Zhang, “Secureml: A system for scalable privacy- learning,” CoRR, vol. abs/1912.04977, 2019. [Online]. Available: http://
preserving machine learning,” in 2017 38th IEEE Symposium on Security arxiv.org/abs/1912.04977
and Privacy (SP). IEEE, 2017, pp. 19–38. [20] L. Zhao, L. Ni, S. Hu, Y. Chen, P. Zhou, F. Xiao, and L. Wu, “Inprivate
[7] C. Dwork, A. Roth et al., “The algorithmic foundations of differential digging: Enabling tree-based distributed data mining with differential
privacy,” Foundations and Trends® in Theoretical Computer Science, privacy,” in IEEE INFOCOM 2018-IEEE Conference on Computer
vol. 9, no. 3–4, pp. 211–407, 2014. Communications. IEEE, 2018, pp. 2087–2095.
[8] C. Dwork, “Differential privacy: A survey of results,” in International [21] Q. Li, Z. Wen, and B. He, “Practical federated gradient boosting decision
Conference on Theory and Applications of Models of Computation. trees,” in Proceedings of the AAAI Conference on Artificial Intelligence,
Springer, 2008, pp. 1–19. vol. 34, no. 04, 2020, pp. 4642–4649.
[9] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and [22] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith,
D. Bacon, “Federated learning: Strategies for improving communication “Federated optimization in heterogeneous networks,” arXiv preprint
efficiency,” arXiv preprint arXiv:1610.05492, 2016. arXiv:1812.06127, 2018.
[10] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, [23] F. Hanzely, S. Hanzely, S. Horváth, and P. Richtárik, “Lower bounds and
S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation optimal algorithms for personalized federated learning,” arXiv preprint
for privacy-preserving machine learning,” in CCS, 2017, pp. 1175–1191. arXiv:2010.02372, 2020.
[11] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: [24] M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,”
Concept and applications,” ACM Transactions on Intelligent Systems and in International Conference on Machine Learning. PMLR, 2019, pp.
Technology, vol. 10, no. 2, pp. 12:1–12:19, 2019. 4615–4625.
9

[25] M. Yurochkin, M. Agarwal, S. Ghosh, K. Greenewald, N. Hoang,


and Y. Khazaeni, “Bayesian nonparametric federated learning of neural
networks,” in International Conference on Machine Learning. PMLR,
2019, pp. 7252–7261.
[26] H. Wang, M. Yurochkin, Y. Sun, D. Papailiopoulos, and Y. Khaz-
aeni, “Federated learning with matched averaging,” arXiv preprint
arXiv:2002.06440, 2020.
[27] Y. Liu, Y. Kang, C. Xing, T. Chen, and Q. Yang, “A secure federated
transfer learning framework,” IEEE Intelligent Systems, vol. 35, no. 4,
pp. 70–82, 2020.
[28] J. Vaidya, C. Clifton, M. Kantarcioglu, and A. S. Patterson, “Privacy-
preserving decision trees over vertically partitioned data,” ACM Transac-
tions on Knowledge Discovery from Data (TKDD), vol. 2, no. 3, p. 14,
2008.
[29] M. Djatmiko, S. Hardy, W. Henecka, H. Ivey-Law, M. Ott, G. Patrini,
G. Smith, B. Thorne, and D. Wu, “Privacy-preserving entity resolution
and logistic regression on encrypted data,” Private and Secure Machine
Learning (PSML), 2017.
[30] G. Liang and S. S. Chawathe, “Privacy-preserving inter-database oper-
ations,” in International Conference on Intelligence and Security Infor-
matics. Springer, 2004, pp. 66–82.
[31] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”
in Proceedings of the 22nd acm sigkdd international conference on
knowledge discovery and data mining. ACM, 2016, pp. 785–794.
[32] P. Paillier, “Public-key cryptosystems based on composite degree resid-
uosity classes,” in International Conference on the Theory and Applica-
tions of Cryptographic Techniques. Springer, 1999, pp. 223–238.
[33] R. Bost, R. A. Popa, S. Tu, and S. Goldwasser, “Machine learning clas-
sification over encrypted data,” in 22nd Annual Network and Distributed
System Security Symposium, NDSS, 2015.
[34] F. Baldimtsi, D. Papadopoulos, S. Papadopoulos, A. Scafuro, and
N. Triandopoulos, “Server-aided secure computation with off-line par-
ties,” in Computer Security - ESORICS 2017 - 22nd European Symposium
on Research in Computer Security, Proceedings, Part I, 2017, pp. 103–
123.

You might also like