Privacy-Preserving Feature Selection With Secure Multiparty Computation
Privacy-Preserving Feature Selection With Secure Multiparty Computation
Abstract—Existing work on privacy-preserving machine learn- to each other [12]. We consider the scenario where different
ing with Secure Multiparty Computation (MPC) is almost exclu- data owners or enterprises are interested in training an ML
sively focused on model training and on inference with trained model over their combined data. There is a lot of potential in
models, thereby overlooking the important data pre-processing
stage. In this work, we propose the first MPC based protocol training ML models over the aggregated data from multiple
arXiv:2102.03517v1 [cs.CR] 6 Feb 2021
for private feature selection based on the filter method, which is enterprises. First of all, training on more data typically yields
independent of model training, and can be used in combination higher quality ML models. For instance, one could train a
with any MPC protocol to rank features. We propose an efficient more accurate model to predict the length of hospital stay
feature scoring protocol based on Gini impurity to this end. To of COVID-19 patients when combining data from multiple
demonstrate the feasibility of our approach for practical data
science, we perform experiments with the proposed MPC proto- clinics. This is an application where the data is horizontally
cols for feature selection in a commonly used machine-learning- distributed, meaning that each data owner or enterprise has
as-a-service configuration where computations are outsourced to records/rows of the data. Furthermore, being able to combine
multiple servers, with semi-honest and with malicious adversaries. different data sets enables new applications that pool together
Regarding effectiveness, we show that secure feature selection data from multiple enterprises, or even from different entities
with the proposed protocols improves the accuracy of classifiers
on a variety of real-world data sets, without leaking information within the same enterprise. An example of this would be an
about the feature values or even which features were selected. ML model that relies on lab test results as well as healthcare
Regarding efficiency, we document runtimes ranging from several bill payment information about patients, which are usually
seconds to an hour for our protocols to finish, depending on the managed by different departments within a hospital system.
size of the data set and the security settings. This is an example of an application where the data is vertically
distributed, i.e. each data owner has their own columns. While
there are clear advantages to training ML models over data
I. I NTRODUCTION
that is distributed across multiple data owners, often these data
Machine learning (ML) thrives because of the availability of owners do not want to disclose their data to each other, because
an abundant amount of data, and of computational resources the data in itself constitutes a competitive advantage, or because
and devices to collect and process such data. In many effective the data owners need to comply with data privacy regulations.
ML applications, the data that is consumed during ML model These roadblocks can even affect different departments within
training and inference is often of a very personal nature. the same enterprise, such as different clinics within a healthcare
Protection of user data has become a significant concern system.
in ML model development and deployment, giving rise to During the last decade, cryptographic protocols designed
laws to safeguard the privacy of users, such as the European with MPC have been developed for training of ML models over
General Data Protection Regulation (GDPR) and the California aggregated data, without the need for the individual data owners
Customer Privacy Act (CCPA). Cryptographic protocols that or enterprises to reveal their data to anyone in an unencrypted
allow computations on encrypted data are an increasingly manner. This existing work includes MPC protocols for training
important mechanism to enable data science applications of decision tree models [26], [17], [11], [1], linear regression
while complying with privacy regulations. In this paper, we models [29], [15], [2], and neural network architectures [28],
contribute to the field of privacy-preserving machine learning [3], [34], [21], [16]. Existing approaches assume that the data
(PPML), a burgeoning and interdisciplinary research area at the sets are pre-processed and clean, with features that have been
intersection of cryptography and ML that has gained significant pre-selected and constructed. In practical data science projects,
traction in tackling privacy issues. model building constitutes only a small part of the workflow:
In particular, we use techniques from Secure Multiparty real-world data sets must be cleaned and pre-processed, outliers
Computation (MPC), an umbrella term for cryptographic must be removed, training features must be selected, and
approaches that allow two or more parties to jointly compute a missing values need to be addressed before model training
specified output from their private information in a distributed can begin. Data scientists are estimated to spend 50% to 80%
fashion, without actually revealing their private information of their time on data wrangling as opposed to model training
itself [27]. PPML solutions will not be adopted in practice if
Xiling Li is with the School of Engineering and Technology, University of
Washington, Tacoma, WA, USA. Email: [email protected] they do not encompass these data preparation steps. Indeed,
Rafael Dowsley is with the Faculty of Information Technology, Monash there is little point in preserving the privacy of clean data sets
University, Clayton, Australia. Email: [email protected] during model training – which is currently already possible –
Martine De Cock is with the School of Engineering and Technology,
University of Washington, Tacoma, WA, USA and Ghent University, Ghent, if the raw data has to be leaked first to arrive at those clean
Belgium. Email: [email protected] data sets!
2
Fig. 1. Overview of private feature selection and model training in 3PC setting with computing servers (parties) Alice, Bob, and Carol.
In this paper, we contribute to filling this gap in the open can be trivially revealed by combining shares, no information
literature by proposing the first MPC based protocol for privacy- about the data is revealed by the shares received by any single
preserving feature selection. Feature selection is the process server, meaning that none of the servers by themselves learn
of selecting a subset of relevant features for model training anything about the actual values of the data. In Step 2A, the
[10]. Using a well chosen subset of features can lead to more three servers execute protocols πMS−GINI and πFILTER−FS to
accurate models, as well as efficiency gains during model create a reduced version of the data set that contains only the
training. A commonly used technique for feature selection selected features. Throughout this process, none of the parties
is the so-called filter method in which features are ranked learns the values of the data or even which features are selected,
according to a score indicative of their predictive ability, and as all computations are done over secret shares. Next, in Step
subsequently the highest ranked features are retained. Despite 2B, the parties train an ML model over the pre-processed data
of its known shortcomings, including the fact that it considers using existing privacy-preserving training protocols, e.g., a
each feature in isolation and ignores feature dependencies, the privacy-preserving protocol for logistic regression training [16].
filter method is popular in practical data science because it is Finally, in Step 3, the servers can disclose the trained model
computationally very efficient, and independent of any specific to the intended model owner by revealing their shares. Steps
ML model architecture. 1 and 3 are trivial as they follow directly from the choice of
The MPC based protocol πFILTER−FS for private feature se- the underlying MPC scheme (see Sec. II-B). MPC protocols
lection that we propose in this paper can be used in combination for Step 2B have previously been proposed. The focus of this
with any MPC protocol to rank features in a privacy-preserving paper is on Step 2A. Our approach works in scenarios where
manner. Well-known techniques to score features in terms of the data is horizontally partitioned (each data owner has one
their informativeness include mutual information (MI), Gini or more of the rows or instances), scenarios where the data is
impurity (GI), and Pearson’s correlation coefficient (PCC). We vertically partitioned (each data owner has some of the columns
propose an efficient feature scoring protocol πMS−GINI based on or attributes), or any other partition.
Gini impurity, leaving the development of privacy-preserving After presenting preliminaries about Gini impurity and MPC
protocols for other feature scoring techniques as future work. in Sec. II, and discussing related work in Sec. III, we present
The computation of a GI score for continuous valued features our main protocol πFILTER−FS for private feature selection and
traditionally requires sorting of the feature values to determine the supporting protocols πGINI−FS and πMS−GINI in Sec. IV.
candidate split points in the feature value range. As sorting In Sec. V we demonstrate the feasibility of our approach
is an expensive operation to perform in a privacy-preserving for practical data science in terms of accuracy and runtime
way, we instead propose a “mean-split Gini score” (MS-GINI) results through experiments executed on real-world data sets.
that avoids the need for sorting by selecting the mean of the In our experiments, we consider honest-majority 3PC settings
feature values as the split point. As we show in Sec. V, feature with semi-honest as well as malicious adversaries. While
selection with MS-GINI leads to accuracy improvements that parties corrupted by semi-honest adversaries follow the protocol
are on par with those obtained with GI, PCC, and MI in the data instructions correctly but try to obtain additional information,
sets used in our experiments. Depending on the application parties corrupted by malicious adversaries can deviate from the
and the data set at hand, one may want to use a different protocol instructions. Defending against the latter comes at a
feature scoring technique, in combination with our protocol higher computational cost which, as we show, can be mitigated
πFILTER−FS for private feature selection. by using a recently proposed MPC scheme for 4PC.
Fig. 1 illustrates the flow of private feature selection and
subsequent model training at a high level in an outsourced “ML II. P RELIMINARIES
as a service setting” with three computing servers, nicknamed
Alice, Bob, and Carol (three-party computation, 3PC). 3PC with A. Feature Selection based on Gini Impurity
honest majority, i.e. with at most one server being corrupted, Assume that we have a set S of m training examples, where
is a configuration that is often used in MPC because this setup each training example consists of an input feature vector
allows for some of the most efficient MPC schemes. In Step 1 (x1 , . . . , xp ) and a corresponding label y. Throughout this
of Fig. 1, each of m data owners sends secret shares of their paper, we assume that there are n possible class labels. We
data to the three servers (parties). While the secret shared data wish to induce an ML model from this training data that can
3
infer, for a previously unseen input feature vector, a label y in dishonest-majority as well as honest-majority settings, with
as accurately as possible. Not all p features may be equally passive or active adversaries. This is achieved by changing
beneficial to this end. In the filter approach to feature selection, the underlying MPC scheme to align with the desired security
all features are first assigned a score that is indicative of setting. Some of the most efficient MPC schemes have been
their predictive ability. Subsequently only the best scoring developed for 3 parties, out of which at most one is corrupted.
features are retained. A well-known feature scoring criterion We evaluate the runtime of our protocols in this honest-majority
is Gini impurity, made popular as part of the classification and 3PC setting, which is growing in popularity in the PPML
regression tree algorithm (CART) [7]. literature, e.g. [14], [24], [31], [34], and we demonstrate how
If the j th feature Fj is a discrete feature that can assume ` even better runtimes can be obtained with a recently proposed
different values, then it induces a partition S1 ∪ S2 ∪ . . . ∪ S` MPC scheme for 4PC with one corruption [13].
of S in which Si is the set of instances that have the ith value In the MPC schemes used in this paper, all computations by
for the j th feature. The Gini impurity of Si is defined as: the parties (servers) are done over integers in a ring Zq . Raw
n
X n
X data in ML applications is often real-valued. As is common in
G(Si ) = pc · (1 − pc ) = 1 − p2c (1) the MPC literature, we convert real numbers to integers using
c=1 c=1 a fixed-point representation [9]. After this conversion, the data
owners secret share their values with the parties using a secret
where pc is the probability of a randomly selected instance
sharing scheme and proceed by performing operations over the
from Si belonging to the cth class. The Gini score of feature
secret shares.
Fj is a weighted average of the Gini impurities of the Si ’s:
For the passive 3PC setting, we follow a replicated secret
`
X |Si | sharing scheme from Araki et al. ([4]). To share a secret value
G(Fj ) = · G(Si ) (2) x ∈ Zq among parties P1 , P2 and P3 , the shares x1 , x2 , x3
m
i=1 are chosen uniformly at random in Zq with the constraint
Conceptually, G(Fj ) estimates the likelihood of a randomly that x1 + x2 + x3 = x mod q. P1 receives x1 and x2 , P2
selected instance to be misclassified based on knowledge of receives x2 and x3 , and P3 receives x3 and x1 . Note that it
the value of the j th feature. During feature selection, the k is necessary to combine the shares available to two parties in
features with the lowest Gini scores are retained. order to recover x, and no information about the secret shared
If Fj is a feature with continuous values, then G(Fj ) is value x is revealed to any single party. For short, we denote
defined as the weighted average of the Gini impurities of a this secret sharing by [[x]]q . Let [[x]]q , [[y]]q be secret shared
set S≤θ containing all instances for which the j th feature values and c be a constant, the following computations can be
value is smaller than or equal to θ, and a set S>θ with all done locally by parties without communication:
instances for which the j th feature value is larger than θ. In the • Addition (z = x + y): Each party Pi gets shares of z by
CART algorithm, an optimal threshold θ is determined based computing zi = xi + yi and z(i+1 mod 3) = x(i+1 mod 3) +
on sorting of all the instances on their feature values. Since y(i+1 mod 3) . This is denoted by [[z]]q ← [[x]]q + [[y]]q .
privacy-preserving sorting is a time-consuming operation in • Subtraction [[z]]q ← [[x]]q − [[y]]q is performed analogously.
MPC [6], [20], in Sec. IV-B we propose a more straightforward • Multiplication by a constant (z = c·x): Each party multiplies
approach for threshold selection which, as we show in Sec. V, its local shares of x by c to obtain shares of z. This is denoted
yields desirable improvements in accuracy. by [[z]]q ← c · [[x]]q
• Addition of a constant (z = x + c): P1 and P3 add c to their
share x1 of x to obtain z1 , while the parties set z2 = x2
B. Secure Multiparty Computation and z3 = x3 . This will be denoted by [[z]]q ← [[x]]q + c.
Protocols for MPC enable a set of parties to jointly compute The main advantage of replicated secret sharing compared to
the output of a function over each of the parties’ private inputs, other secret sharing schemes is that replicated shares enables
without requiring parties to disclose their input to anyone. a very efficient procedure for multiplying secret shared values.
MPC is concerned with the protocol execution coming under To compute x · y = (x1 + x2 + x3 )(y1 + y2 + y3 ), the parties
attack by an adversary which may corrupt parties to learn locally perform the following computations: P1 computes z1
private information or cause the result of the computation to be = x1 · y1 + x1 · y2 + x2 · y1 , P2 computes z2 = x2 · y2 +
incorrect. MPC protocols are designed to prevent such attacks x2 · y3 + x3 · y2 and P3 computes z3 = x3 · y3 + x3 · y1 +
being successful, and use proven cryptographic techniques to x1 · y3 . By doing so, without any interaction, each Pi obtains
guarantee privacy. zi such that z1 + z2 + z3 = x · y mod q. After that, the
Adversarial Model: An adversary A can corrupt any parties are required to convert from this additive secret sharing
number of parties. In a dishonest-majority setting, half or representation back to the original replicated secret sharing
more of the parties may be corrupt, while in an honest- representation (which requires that the parties add a secret
majority setting, more than half of the parties are honest (not sharing of zero and that each party sends one share to one
corrupted). Furthermore, A can be a semi-honest or a malicious other party for a total communication of three shares). See [4]
adversary. While a party corrupted by a semi-honest or “passive” for more details.
adversary follows the protocol instructions correctly but tries to In the active 3PC setting, we use the MPC scheme
obtain additional information, parties corrupted by malicious or SYReplicated2k recently proposed by Dalskov et al. ([13]). In
“active” adversaries can deviate from the protocol instructions. this MPC scheme, the parties are prevented from deviating from
The protocols in Sec. IV are sufficiently generic to be used the protocol and from gaining knowledge from other parties
4
through the use of information-theoretic message authentication manner. [30] proposed a more principled 2PC protocol with
codes (MACs). In addition to computations over secret shares Paillier homomorphic encryption for private feature selection
of the data, the parties also perform computations required for with χ2 as filter criteria in the semi-honest setting, without
MACs. See [13] for details. Finally, we use the MPC scheme an experimental evaluation of the proposed approach. To the
recently proposed by Dalskov et al. ([13]) for the active 4PC best of our knowledge, private feature selection with malicious
setting, where the computations are outsourced to four servers adversaries has not yet been proposed or evaluated. The recent
out of which at most one has been corrupted by a malicious approach by [35] is not based on cryptography, does not
adversary. provide any formal privacy guarantees, and leaks information
Building Blocks: Building on the cryptographic primitives through disclosure of intermediate representations.
listed above for addition and multiplication of secret shared Secure Gini Score Computation: Besides as a technique to
values, MPC protocols for other operations have been developed score features for feature selection, as we do in this paper, Gini
in the literature. In this paper, we use: impurity is traditionally used in ML in the CART algorithm
• Secure matrix multiplication πDMM : at the start of this for training decision trees [7], and it has been adopted in
protocol, the parties have secret sharings [[A]] and [[B]] of MPC protocols for privacy-preserving training of decision tree
matrices A and B; at the end, the parties have a secret sharing models [17], [11], [1]. Gini score computation for continuous
[[C]] of the product of the matrices, C = A×B. πDMM can be valued features, as we do in this paper, is especially challenging
constructed as a direct extension of the secure multiplication from an MPC point of view, as it requires sorting of feature
protocol for two integers, which we will denote as πDM in values to determine candidate split points in the feature range.
the remainder of the paper. Similarly, we use πDP to denote Abspoel et al. ([1]) put ample effort in performing this sorting
the protocol for the secure dot product of two vectors. In process as efficiently as possible in a secure manner. We take a
a replicated sharing scheme, dot products can be computed drastically different approach by assuming that the mean of the
more efficiently than the direct extension from πDM , and feature values serves as a good approximation for an optimal
matrix multiplication can use this optimized version of dot split threshold. This has the double advantage that (1) there
products; we refer to Keller ([23]) for details. is no need for oblivious sorting of feature values, and (2) for
• Secure comparison protocol πLT [8]: at the start of this each feature only one Gini score for one threshold θ has to be
protocol, the parties have secret sharings [[x]] and [[y]] of two computed as opposed to computing the Gini score for multiple
integers x and y; at the end, they have a secret sharing of 1 candidate thresholds and then selecting the best one through
if x < y, and a secret sharing of 0 otherwise. secure comparisons. This leads to significant efficiency gains,
• Secure argmin protocol πARGMIN : this protocol accepts secret while preserving good accuracy, as we demonstrate in Sec. V.
sharings of a vector of integers and returns a secret sharing
of the index at which the vector has the minimum value. Protocol 1 Protocol πFILTER−FS for Secure Filter based Feature
πARGMIN is straightforwardly constructed using the above Selection
mentioned secure comparison protocol. Input: A secret shared m × p data matrix [[D]]q , a secret shared p-length
score vector [[G]]q , the number k < p of features to be selected, and a constant
• Secure equality test protocol πEQ [9]: at the start of this
t that is bigger than the highest possible score in [[G]]q
protocol, the parties have secret sharings [[x]] and [[y]] of two Output: a secret shared m × k matrix [[D0 ]]q
integers x and y; at the end, they have a secret sharing of 1 1: for i = 1 to k do
if x = y, and a secret sharing of 0 otherwise. 2: [[I[i]]]q ← πARGMIN ([[G]]q )
3: for j ← 1 to p do
• Secure division protocol πDIV [9]: at the start of this protocol, 4: [[f lagk ]]q ← πEQ ([[I[i]]]q , j)
the parties have secret sharings [[x]]q and [[y]]q of two integers 5: [[T [j][i]]]q ← [[f lagk ]]q
x and y; at the end, they have a secret sharing [[z]]q of 6: [[G[j]]]q ← [[G[j]]]q + πDM ([[f lagk ]]q , t − [[G[j]]]q )
7: end for
z = x/y. 8: end for
9: [[D0 ]]q ← πDMM ([[D]]q , [[T ]]q )
III. R ELATED W ORK 10: return [[D0 ]]q
features. At the end of the protocol, the parties have a reduced As is common in MPC protocols, we use multiplication
matrix D0 of size m × k in which only the columns from D instead of control flow logic for conditional assignments. To this
corresponding to the lowest scores in G are retained (note that end, a conditional based branch operation as “if c then a ← b”
this protocol can be trivially modified to select the k features is rephrased as a ← a + c · (b − a). In this way, the number
with the highest scores). The main ideas behind the protocol and the kind of operations executed by the parties does not
(which is described in Protocol 1) are to: depend on the actual values of the inputs, so it does not leak
1) Determine the indices of the features that need to be selected information that could be exploited by side-channel attacks.
(these are stored in a secret-shared way in I). Such a conditional assignment occurs in Line 6 of Protocol 1,
2) Create a matrix T in which the columns are one-hot-encoded where the value of the condition c itself is computed on Line
representations of these indices. 4. In the final step, on Line 9, the parties multiply matrix D
3) Multiply D with this feature selection matrix T . with matrix T in a secure manner to obtain a matrix D0 that
Before walking through the pseudocode of Protocol 1, we contains only the feature columns corresponding to the k best
present a plaintext example to illustrate the notation. features. Throughout this process, the parties are unaware of
Example 1. Consider the data matrix D at the left of which features were actually selected. The secret shared matrix
Equation (3), containing values for m = 5 instances (rows) D0 can subsequently be used as input for a privacy-preserving
and p = 4 features (columns). Assume that the feature score ML model training protocol, e.g. [16].
vector is G = [65, 26, 83, 14] and that we want to select the
k = 2 features with the lowest scores in G. B. Secure Feature Score Computation
1 2 3 4 4 2
5 6 7 8
0 0
8 6 Protocol πFILTER−FS assumes the availability of a feature
0 1
9 10 11 12 · 0 0 = 12 10
(3) score vector G and an upper bound t for the values in
13 14 15 16 16 14
17 18 19 20
1 0
20 18 G. Below we explain how this can be obtained from the
data in a secure manner. To this end, we present a protocol
| {z }
| {z } | {z }
T
D D0
πMS−GINI for computation of the score of a feature based
The lowest scores in G are 14 and 26, hence the 4th and the on Gini impurity. This protocol is applicable to data sets
2nd column of D should be selected. The columns of T in with continuous features. It is computationally cheaper than
Equation (3) are a one-hot-encoding of 4 and 2 respectively, previously proposed protocols for Gini impurity that rely on
and multiplying D with T will yield the desired reduced data sorting of feature values. Furthermore, as shown in previous
matrix D0 . This multiplication takes place on Line 9 in Protocol work [25] and in Sec. V, the “Mean-Split” GINI score can
1. The bulk of Protocol 1 is about how to construct T based yield similar accuracy improvements.
on G. As explained below, this process involves an auxiliary Recall that we have a set S of m training examples, where
vector, which, at the end of the protocol, contains the following each training example consists of an input feature vector
values for our example: I = [4, 2]. (x1 , . . . , xp ) and a corresponding label y. We propose to split
In the protocol, vector [[I]]q of length k stores the indices the set of values of the j th feature Fj based on its mean value
of the k selected features out of the p features of [[D]]q and as a threshold θ. We denote by S≤θ the set of instances that
matrix [[T ]]q is a p × k transformation matrix that eventually have xj ≤ θ, and by S>θ the set of instances that have xj > θ.
holds one-hot-encodings of the indices in I. Through executing Furthermore, for c = 1, . . . , n, we denote by Lc the set of
Lines 1-8 of Protocol 1, the parties construct a feature selection examples from S that have class label y = c. Based on the
matrix T based on the values in G. In Line 2 the index of binary split, we define the MS-GINI (“Mean-Split” GINI) score
the ith smallest value in [[G]]q is identified. To this end, the for feature Fj as:
parties run a secure argmin protocol πARGMIN . The inner for-
1
loop serves two purposes, namely constructing the ith column G(Fj ) = · (|S≤θ | · G(S≤θ ) + |S>θ | · G(S>θ )) (4)
of matrix T , and overwriting the score in G of the feature that m
was selected in Line 2 by the upper bound, so that it will not with the Gini impurities of S≤θ and S>θ defined as:
be selected anymore in further iterations of the outer for-loop Xn Xn
(such an upper bound t is passed as input to Protocol 1 and is G(S≤θ ) = 1 − (p≤θ
c ) 2
; G(S>θ ) = 1 − (p>θ
c )
2
(5)
usually very easy to determine in practice, as most common c=1 c=1
feature scoring techniques range between 0 and 1): and the probabilities defined as:
th
• To construct the i column of T , the parties loop through
|S≤θ ∩ Lc | |S>θ ∩ Lc |
row j = 1 . . . p, and on Line 5, update T [j][i] with either a p≤θ
c = ; p>θ
c = (6)
0 or a 1, depending on the outcome of the secure equality |S≤θ | |S>θ |
test on Line 4. The outcome of this test will be 1 exactly Formulas (4), (5) and (6) are consistent with the definition of
once, namely when j equals I[i], hence Line 5 results in a Gini score given in Sec. II, and presented here in more detail to
one-hot-encoding of I[i] stored in the ith column of T . enhance the readability of our secure protocol πMS−GINI for the
• The flag f lagk computed on Line 4 is used again on Line 6 computation of the Gini score G(F ) of feature F (described
to overwrite G[I[i]] with t in an oblivious manner, where t in Protocol 2).
is a value that is larger than the highest possible score that At the start of Protocol πMS−GINI , the parties have secret
occurs in [[G]]q . This theoretical upper bound t ensures that shares of a feature column F (think of this as a column from
feature I[i] will not be selected again in later iterations of data matrix D in Example 1), as well as secret shares of an one-
the outer for-loop. hot-encoded version of the label vector. The latter is represented
6
Protocol 2 Protocol πMS−GINI for Secure MS-GINI Score of 13-14 can be performed locally by the parties, on their own
a Feature shares. Moving the computation of [[A[n]]]q and [[B[n]]]q out
Input: A secret shared feature column [[F ]]q = ([[f1 ]]q ,[[f2 ]]q ,...,[[fm ]]q ), a of the for-loop, reduces the number of secure multiplications
secret shared m × (n − 1) label-class matrix [[L]]q , where m is the number
of instances and n is the number of classes. needed from m × n to m × (n − 1). In the case of a binary
Output: MS-GINI score [[G(F )]]q of the feature F classification problem, i.e. n = 2, this means that the number
1
1: [[θ]]q ← ([[f1 ]]q + [[f2 ]]q + ... + [[fm ]]q ) · m of secure multiplications required is cut down by half.
2: Initialize [[a]]q , [[b]]q , [[A]]q and [[B]]q with zeros. Using the notations for the counters from the pseudocode
3: for i ← 1 to m do
4: [[f lags ]]q ← πLT ([[θ]]q , [[fi ]]q ) of Protocol 2, Equation (4) comes down to:
5: [[b]]q ← [[b]]q + [[f lags ]]q
6: for j ← 1 to n − 1 do 1 Xn
A[j] 2
Xn
B[j] 2
7: [[f lagm ]]q ← πDM ([[f lags ]]q , [[L[i][j]]]q ) G(F ) = · a· 1−
+ b· 1−
m j=1
a b
8: [[B[j]]]q ← [[B[j]]]q + [[f lagm ]]q j=1
1 1 1
9: [[A[j]]]q ← [[A[j]]]q + [[L[i][j]]]q − [[f lagm ]]q = · a− ·A•A + b− ·B•B
10: end for m a b
11: end for
12: [[a]]q ← m − [[b]]q in which A•A and B •B are the dot products of A and B with
13: [[A[n]]]q ← [[a]]q − ([[A[1]]]q + ... + [[A[n − 1]]]q ) themselves, respectively. These computations are performed
14: [[B[n]]]q ← [[b]]q − ([[B[1]]]q + ... + [[B[n − 1]]]q ) by the parties on Lines 15-17 using, among other things, the
15: [[G(S≤θ )]]q ← [[a]]q − πDM ( πDP ([[A]]q , [[A]]q ), πDIV (1, [[a]]q ))
protocol πDP for secure dot product of vectors, and the protocol
16: [[G(S>θ )]]q ← [[b]]q − πDM ( πDP ([[B]]q , [[B]]q ), πDIV (1, [[b]]q ))
17: [[G(F )]]q ← [[G(S≤θ )]]q + [[G(S>θ )]]q πDIV for secure division. We note that the final multiplication
18: return [[G(F )]]q with the factor 1/m is omitted altogether from Protocol 2 as
this will have no effect on the relative ordering of the scores
of the individual features.
as a label-class matrix [[L]]q , in which [[L[i][j]]]q = [[1]]q means If data are vertically partitioned and all data owners have the
that the label of the ith instance is equal to the j th class. label vector, they can compute MS-GINI scores offline without
Otherwise, [[L[i][j]]]q = [[0]]q . We note that, while there are n πMS−GINI , and the computing servers would only have to do
classes, it is sufficient for L to contain only n − 1 columns: feature selection based on pre-computed MS-GINI scores with
as there is exactly one value 1 per row, the value of the nth Protocol πFILTER−FS . In reality, often, it is not reasonable to
column is implicit from the values of the other columns. We allow each data owner to have all labels, so we do not assume
indirectly take advantage of this fact by terminating the loop this scenario in our protocols.
on Line 6-10 at n − 1, and performing calculations for the nth
class separately and in a cheaper manner on Line 13-14, as C. Secure Feature Selection with MS-GINI
we explain in more detail below. Protocol πGINI−FS (described in Protocol 3) performs secure
On Line 1, the parties compute [[θ]]q as a threshold to split filter-based feature selection with MS-GINI, used for the
the input feature [[F ]]q , as the mean of the feature values in experiments in this work. It combines the building blocks
the column. To this end, each party first sums up the secret presented earlier in the section. By executing the loop on Line
shares of the feature values, and then multiplies the sum with 1-3, the parties compute the MS-GINI score of the ith feature
1
a known constant m locally. Line 2 is to initialize all counters from the original data matrix [[D]] using Protocol π
q MS−GINI ,
related to S≤θ and S>θ to zero. After Line 14, these counters and store it into [[G[i]]] . On Line 4, the parties perform filter-
q
will contain the following values: based feature selection using Protocol πFILTER−FS to obtain a
a = |S≤θ | m × k matrix [[D0 ]]q with k selected features from [[D]]q . As
b = |S>θ | the standard GINI score is upper bounded by 1, and πMS−GINI
A[j] = |S≤θ ∩ Lj |, for j = 1 . . . n ignores the multiplication by 1/m for efficiency reasons, it is
B[j] = |S>θ ∩ Lj |, for j = 1 . . . n safe to use m as the upper bound that is passed to Protocol
πFILTER−FS on Line 4.
These counters are needed for the probabilities in Equation
(6). For each instance, in Line 4 of Protocol 2, the parties Protocol 3 Protocol πGINI−FS for Secure Filter-based Feature
perform a secure comparison to determine whether the instance Selection with MS-GINI
belongs to S>θ . The outcome of that test is added to b on Input: A secret shared m × p data matrix [[D]]q = ([[F1 ]]q ,[[F2 ]]q ,...,[[Fp ]]q ),
shared m × (n − 1) label-class matrix [[L]]q , where m is the number
Line 5. Since the total number of instances is m, a can be aofsecret instances, p the number of features, n the number of classes, and k the
straightforwardly computed as m − b after the outer for-loop, number of features to be selected.
i.e. on Line 12. Lines 7-8 check whether the instance belongs Output: a secret shared m × k matrix [[D0 ]]q
to S>θ ∩ Lj , in which case B[j] is incremented by 1. The 1: for i ← 1 to p do
2: [[G[i]]]q ← πMS−GINI ([[Fi ]]q , [[L]]q , m, n)
equivalent operation of Line 7-8 for A[j] would be [[A[j]]]q ← 3: end for
[[A[j]]]q + πDM ((1 − [[f lags ]]q ), [[L[i][j]]]q ). We have simplified 4: [[D0 ]]q ← πFILTER−FS ([[D]]q , [[G]]q , k, m)
this instruction on Line 9, taking advantage of the fact that 5: return [[D0 ]]q
πDM ([[f lags ]]q , [[L[i][j]]]q ) has been precomputed as [[f lagm ]]q
on Line 7.
On Line 13-14 the parties compute [[A[n]]]q and [[B[n]]]q , V. E XPERIMENTS AND R ESULTS
leveraging the fact that sum of all values in [[A]]q is [[a]]q , and The first four columns of Table I contain details for three data
the sum of all values in [[B]]q is [[b]]q . All operations on Line sets corresponding to binary classification tasks with continuous
7
TABLE I
F EATURE SELECTION ACCURACY AND RUNTIME RESULTS
data set details logistic regression accuracy results runtime
Data set m p k #folds RAW MS-GINI GI PCC MI passive 3PC active 3PC active 4PC
CogLoad 632 120 12 6 50.90% 52.50% 52.70% 48.57% 51.59% 50 sec 163 sec 79 sec
LSVT 126 310 103 10 80.09% 86.15% 82.74% 78.89% 85.38% 60 sec 254 sec 89 sec
SPEED 8,378 122 67 10 95.24% 97.26% 95.56% 95.89% 95.83% 949 sec 3,634 sec 1,435 sec
[6] Dan Bogdanov, Sven Laur, and Riivo Talviste. Oblivious sorting of of millions of records. In IEEE Symposium on Security and Privacy
secret-shared data. Technical Report, 2013. (SP), pages 334–348, 2013.
[7] Leo Breiman, Jerome Friedman, Charles Stone, and Richard Olshen. [30] Vanishree Rao, Yunhui Long, Hoda Eldardiry, Shantanu Rane, Ryan A.
Classification and Regression Trees. Taylor and Francis, 1st edition, Rossi, and Frank Torres. Secure two-party feature selection. arXiv
1984. preprint arXiv:1901.00832, 2019.
[8] O. Catrina and S. De Hoogh. Improved primitives for secure multiparty [31] M.S. Riazi, C. Weinert, O. Tkachenko, E.M. Songhori, T. Schneider, and
integer computation. In International Conference on Security and F. Koushanfar. Chameleon: A hybrid secure computation framework for
Cryptography for Networks, pages 182–199. Springer, 2010. machine learning applications. In Asia Conference on Computer and
[9] O. Catrina and A. Saxena. Secure computation with fixed-point numbers. Communications Security, pages 707–721, 2018.
In 14th International Conference on Financial Cryptography and Data [32] Mina Sheikhalishahi and Fabio Martinelli. Privacy-utility feature selection
Security, volume 6052 of Lecture Notes in Computer Science, pages as a privacy mechanism in collaborative data classification. In IEEE
35–50. Springer, 2010. 26th International Conference on Enabling Technologies: Infrastructure
[10] Girish Chandrashekar and Ferat Sahin. A survey on feature selection for Collaborative Enterprises (WETICE), pages 244–249, 2017.
methods. Computers & Electrical Engineering, 40(1):16 – 28, 2014. [33] Athanasios Tsanas, Max A. Little, Cynthia Fox, and Lorraine O.
[11] C.A. Choudhary, M. De Cock, R. Dowsley, A. Nascimento, and Ramig. Objective automatic assessment of rehabilitative speech treatment
D. Railsback. Secure training of extra trees classifiers over continuous in parkinson’s disease. IEEE Transactions on Neural Systems and
data. In AAAI-20 Workshop on Privacy-Preserving Artificial Intelligence, Rehabilitation Engineering, 22(1):181–190, 2014.
2020. [34] Sameer Wagh, Divya Gupta, and Nishanth Chandran. SecureNN: 3-party
[12] Ronald Cramer, Ivan Bjerre Damgard, and Jesper Buus Nielsen. Secure secure computation for neural network training. Proceedings on Privacy
Multiparty Computation and Secret Sharing. Cambridge University Press, Enhancing Technologies (PoPETs), 2019(3):26–49, 2019.
1st edition, 2015. [35] Xiucai Ye, Hongmin Li, Akira Imakura, and Tetsuya Sakurai. Distributed
[13] A. Dalskov, D. Escudero, and M. Keller. Fantastic four: Honest-majority collaborative feature selection based on intermediate representation. In
four-party secure computation with malicious security. Cryptology ePrint International Joint Conference on Artificial Intelligence, pages 4142–
Archive, Report 2020/1330, 2020. 4149, 2019.
[14] A. Dalskov, D. Escudero, and M. Keller. Secure evaluation of quantized
neural networks. Proceedings on Privacy Enhancing Technologies,
2020(4):355–375, 2020.
[15] Martine De Cock, Rafael Dowsley, Anderson C. A. Nascimento, and
Stacey C. Newman. Fast, privacy preserving linear regression over
distributed datasets based on pre-distributed data. In 8th ACM Workshop
on Artificial Intelligence and Security (AISec), page 3–14, 2015.
[16] Martine De Cock, Rafael Dowsley, Anderson C. A. Nascimento, Davis
Railsback, Jianwei Shen, and Ariel Todoki. High performance logistic
regression for privacy-preserving genome analysis. BMC Medical
Genomics, 14(1):23, 2021.
[17] Sebastiaan De Hoogh, Berry Schoenmakers, Ping Chen, and Harm
op den Akker. Practical secure decision tree learning in a teletreatment
application. In International Conference on Financial Cryptography and
Data Security, pages 179–194. Springer, 2014.
[18] Raymond Fisman, Sheena S. Iyengar, Emir Kamenica, and Itamar
Simonson. Gender differences in mate selection: Evidence from a speed
dating experiment. The Quarterly Journal of Economics, 121(2):673–697,
2006.
[19] Martin Gjoreski, Tine Kolenik, Timotej Knez, Mitja Luštrek, Matjaž
Gams, Hristijan Gjoreski, and Veljko Pejović. Datasets for cognitive
load inference using wearable sensors and psychological traits. Applied
Sciences, 10(11):38–43, 2020.
[20] M. Goodrich. Zig-zag sort: A simple deterministic data-oblivious sorting
algorithm running in o(n log n) time. In Proceedings of the 46th Annual
ACM Symposium on Theory of Computing, pages 684–693, 2014.
[21] Chuan Guo, Awni Hannun, Brian Knott, Laurens van der Maaten, Mark
Tygert, and Ruiyu Zhu. Secure multiparty computations in floating-point
arithmetic. arXiv preprint arXiv:2001.03192, 2020.
[22] Yasser Jafer, Stan Matwin, and Marina Sokolova. A framework for
a privacy-aware feature selection evaluation measure. In 13th Annual
Conference on Privacy, Security and Trust (PST), pages 62–69. IEEE,
2015.
[23] Marcel Keller. MP-SPDZ: A versatile framework for multi-party
computation. In Proceedings of the 2020 ACM SIGSAC Conference
on Computer and Communications Security, page 1575–1590, 2020.
[24] N. Kumar, M. Rathee, N. Chandran, D. Gupta, A. Rastogi, and R. Sharma.
CrypTFlow: Secure TensorFlow inference. In 41st IEEE Symposium on
Security and Privacy, 2020.
[25] Xiling Li and Martine De Cock. Cognitive load detection from wrist-band
sensors. In Adjunct Proceedings of the 2020 ACM International Joint
Conference on Pervasive and Ubiquitous Computing and Proceedings of
the 2020 ACM International Symposium on Wearable Computers, page
456–461, 2020.
[26] Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. In
Annual International Cryptology Conference, pages 36–54. Springer,
2000.
[27] Steven Lohr. For big-data scientists, ‘janitor work’ is key hurdle to
insights. The New York Times, 2014.
[28] P. Mohassel and Y. Zhang. Secureml: A system for scalable privacy-
preserving machine learning. In IEEE Symposium on Security and
Privacy (SP), pages 19–38, 2017.
[29] Valeria Nikolaenko, Udi Weinsberg, Stratis Ioannidis, Marc Joye, Dan
Boneh, and Nina Taft. Privacy-preserving ridge regression on hundreds