Sang 2016
Sang 2016
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
Abstract—The randomization methods that are applied for privacy- by the difference between PPDM and PPDP, and we focus
preserving data mining are commonly subject to reconstruction, link- on typical methods for PPDM and also on those that were
age, and semantic-related attacks. Some existing works employed
random noise addition to realize probabilistic anonymity, aiming only originally proposed for PPDP but have utility for PPDM.
at linkage attacks. Random noise addition is vulnerable to reconstruc- Randomization methods, including random noise addi-
tion attacks, and is unable to achieve semantic closeness, particu- tion and random projection, have been proposed to pro-
larly on high-dimensional data, to prevent semantic-related attacks. tect data privacy ([13],[22]), and these methods have also
For linkage attacks, the main security vulnerability of their proposed
probabilistic anonymity lies in the assumption that the attacker had a been applied in other relevant fields, such as privacy-
priori knowledge of the quasi-identifiers of all individuals. When only preserving queries on encrypted graph data ([4]) and
some individuals leak their quasi-identifiers, the proposed model will privacy-preserving multimedia retrieval ([23]). With a lin-
become incapable because the attacker can deploy a different linkage
attack that has not been studied before. This type of attack is much
ear cost, they are very suitable for processing data at
easier to deploy and is thus very harmful. In this paper, we propose collection time, data accumulated at a large scale, and data
new frameworks of probabilistic (1,k)- and (k,k)-anonymity to defend with high dimensions. However, their security in terms of
against all these linkage attacks, and realize the frameworks on a hybrid privacy preservation is still lacking a thorough investiga-
randomization model. The model is also secure against reconstruction
attacks. We further achieve statistical semantic closeness of high- tion. In general, privacy-preserving methods are subject to
dimensional data to prevent semantic-related attacks on the model. The the following 3 types of attacks:
frameworks also allow us to re-design the traditional K-nearest neigh-
bors algorithm to leverage the introduced data uncertainty and improve 1) Reconstruction attack: without knowing any at-
the mining results. Our work demonstrates promising applications in tribute values in the original record, the attacker
large-scale and high-dimensional data mining in clouds, by providing attempts to recover them from the corresponding
high efficiency and security to protect data privacy, guaranteeing high
data utility for mining purposes, on-time processing and non-interactive disguised record using methods such as principal
data publishing. component analysis and maximum a posteriori.
2) Linkage attack: after obtaining some attribute values
Index Terms—randomization, k-anonymity, privacy protection, data of a victim, the attacker attempts to link a unique
mining
disguised record to the victim, from which the
other sensitive attribute values of the victim will
1 I NTRODUCTION be disclosed.
3) Semantic-related attacks: in cases where more than
P RIVACY -preserving
data mining (PPDM) consists of
implementing data mining tasks such as classifica-
tion, clustering and association rule mining without leak-
one disguised record are linked to one victim, the
attacker can inspect the records’ sensitive values
because the skewness and similarity of these values
ing the sensitive information from the data owners. It is can actually leak some certain privacy. More details
required in many applications, such as customers classi- about these attacks are discussed in Section 2.
fication ([33]), disease outbreak identification ([32]), and
financial crime detection. Privacy-preserving data publish- In this paper, we study the enforcement of security of
ing (PPDP), as another related field, is distinct from PPDM randomization against these attacks. The randomization
in that data owners generally do not know how the data model that we use is a hybrid model in which data are
will be used when the data are preprocessed and released. first randomly projected and then random noise is added.
Although many methods (e.g., random noise addition, k- Both of the these methods are vulnerable to reconstruc-
anonymity) have been proposed for PPDP, they are actually tion attacks when they are applied individually ([13],[34]),
cross fields, and their utility for data mining has been whereas the hybrid method is considerably stronger. Our
measured. Therefore, in this paper, we are not constrained focus is on preventing linkage and semantic-related attacks
in this hybrid model, which requires a probabilistic version
Yingpeng Sang is with the School of Data and Computer Sci- of k -anonymity. To clarify the concept of this probabilistic
ence, Sun Yat-sen University, Guangzhou,510006, China. e-mail:
[email protected].
anonymity, we briefly compare it with the deterministic k -
Hong Shen is with the School of Data and Computer Science, Sun Yat- anonymity. Details of the concept are discussed in Section
sen University, China, and School of Computer Science, The University of 3.
Adelaide, Australia. e-mail: [email protected]. Deterministic k-Anonymity. The private record of an
Hui Tian is with the School of Electronic and Information Engineering, Bei-
jing Jiaotong University, Beijing, 100044, China. e-mail: [email protected]. individual commonly has 4 types of attributes: explicit
Zonghua Zhang is with Institute Mines-Telecom/TELECOM Lille, and identifiers, quasi-identifiers, sensitive attributes and non-
CNRS UMR 5157 SAMOVAR Lab, France. e-mail: zonghua.zhang@telecom- sensitive attributes. Simply removing explicit identifiers,
lille.fr. such as passport number, staff ID, and name, hardly makes
Copyright (c) 2013 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be the individual anonymous. The quasi-Identifiers (QIs) can be
obtained from the IEEE by sending a request to [email protected] exploited in combination to re-identify the individual, e.g.,
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
“marital status”, “sex”, and “working hours per week” in tation that have higher linkage possibilities when
a patient’s record (as shown in Table 1(a)). Some sensitive fitting to the victim in the true fit.
attributes should be kept intact for data mining purposes,
e.g., the Yes/No value for “hypertension” in Table 1(a) is Our Contributions. The work reported in [1] only achieved
used as a class label. If an attacker can successfully re- a probabilistic (k, 1)-anonymity in the randomization
identify a patient by his/her QIs, he will be able to infer model of noise addition. In comparison, our contributions
his/her hypertension status. in this paper include the following:
k -anonymity processes the QIs and sensitive attributes 1) We propose the framework of probabilistic (1, k)-
such that each released record is indistinguishable among anonymity and discuss its unique application sce-
at least k individuals. In the example presented in Table nario that (k, 1)-anonymity is not applied for. That
1(b), for a patient’s marital status, k -anonymity can gen- is, when the attacker can only obtain the QIs for
eralize it to a value in the new domain {“been married”, a limited amount of victims, it is (1, k)-anonymity,
“never married”} or suppress it by completely removing it not (k, 1)-anonymity, that should be strengthened
(such as replacing it with ’*’). At least k original records to prevent linkage attacks.
should be clustered into the same group T and released in 2) We achieve two types of probabilistic anonymity,
the same form on their QIs. In addition, each group should (1, k)- and (k, 1)-anonymity, on a hybrid random-
be invulnerable to semantic-related attacks. In the example ization model. This model is secure against re-
presented in Table 1(b), if all members in a group have construction attacks. Methods for calibrating the
“Yes” for hypertension, the attacker will certainly know randomization parameters in the model are pro-
this sensitive information for a patient when the latter is posed to realize the two types of anonymity. We
uniquely linked with this group. also analyze the theoretical bounds of the semantic
closeness in the anonymity.
TABLE 1 3) We re-design the algorithm of K-nearest neigh-
An Example
bors classification to suit the randomized data.
(a) The Original Private Tabel
The current study generally applies traditional
classification algorithms without considering the
Marital Status Sex Hours Hypertension
divorced M 35 Y uncertainty or errors introduced by the privacy-
married M 40 N preserving preprocessing.
... ... ... ...
The remainder of this paper is organized as follows. Sec-
(b) After 2-anonymity tion 2 discusses related works on privacy-preserving data
Marital Status Sex Hours Hypertension mining and k -anonymity. Section 3 presents a detailed
have married M * (1Y,1N) definition and explanation of the probabilistic anonymity.
... ... ... ...
Section 4 and Section 5 respectively discuss how to realize
the probabilistic (1, k)-anonymity and (k, 1)-anonymity in
Probabilistic Anonymity. Randomization can also the hybrid randomization model. Section 6 analyzes the
make data anonymous, but the anonymity level cannot semantic closeness in the randomization model. Section
be deterministic. Suppose that an individual has an m- 7 describes how to leverage the noise introduced by the
attribute record X = (x1 , ..., xm ) on his/her QIs. X can randomization model in K-NN classification. Section 8
be randomized to record Z by random noise addition or conducts experiments using our methods and algorithm
random projection (i.e., linearly transform X from a higher and performs comparisons with some other related works.
dimensional vector space to a lower one using a random Section 9 concludes the paper.
projection matrix). Let Table T 0 be a randomization of Table
T . One major difference between randomization and k -
anonymity is that there is no obvious anonymization group 2 R ELATED W ORK
in randomization, and the attacker thus cannot link a victim A considerable amount of research has been conducted
to any group of size k . Therefore, the attacker must perform on k -anonymity. A few reviews by different taxonomies
a probabilistic linkage attack by inferring the potential fit can be found in [2], [9], [19], [35]. Regarding the meth-
between the victims and randomized records. Probabilistic ods of transforming the values, there are generalization,
anonymity is proposed for this type of linkage attack. suppression, bucketization, and so forth. Based on whether
Suppose that Table T has N records. It is easy to see the transformed values strictly satisfy some information-
that the best anonymity is achieved when the potential fit loss metrics, there are optimal k -anonymity (e.g., [21]) and
between any victim and any randomized record is 1/N . approximate k -anonymity (e.g., [27]). We primarily review
However, this type of anonymity is difficult to achieve the related works in terms of time complexity, privacy and
for both randomization and deterministic k -anonymity utility concerns.
methods. In practice, we can simply reduce the linkage Time Complexity. Determining an optimal solution for
probability of every true fit and increase the probabilities k -anonymity with minimal information loss is very time
of some false fits, such that the true fit is disguised among consuming. It has been proven to be NP-hard by sup-
the false fits and the attacker cannot effectively distinguish pressing the attributes and cells of QIs ([25]), generalizing
them. There are two types of probabilistic anonymities, as cells of QIs ([17]), and ensuring an l-diversity on sensitive
follows: attributes ([21]). Many approximate solutions have been
• Probabilistic (k,1)-Anonymity: for every true fit, there proposed to improve efficiency, trading off some non-
are at least k − 1 false victims in expectation who optimal information loss. Given the size of data set |T |, their
have higher linkage probabilities when fitting to the complexities are generally O(|T |a ). For example, as proven
randomized record in the true fit. in [25] a is a constant of 3, and in [27], a = k . The Incognito
• Probabilistic (1,k)-Anonymity: for every true fit, there algorithm in [16] had a complexity exponential in the size
are at least k − 1 false randomized records in expec- of QIs. The Mondrian algorithm achieved a complexity
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
of O(|T | · log |T |) by employing a greedy strategy. The without a specialized and focused aim of data mining.
algorithms in [15] achieved a logarithmic approximation In many cases, these previous studies only applied the
ratio with small time costs, but how to strengthen security traditional mining algorithms, which performed well on the
toward semantic closeness is still unknown. original data to the anonymized data, without leveraging
(k, 1)-, (1, k)- and (k, k)-anonymities have also been the uncertainty introduced by the anonymizations. A few
realized in [11] by local recoding on QIs. Local recoding exceptions are [14] and [1].
generates indirect equivalence classes by applying differ-
ent generalization approaches to the same attribute values Differential Privacy. Differential privacy is a frame-
in different instances, whereas global recoding uses the work that aims to prevent an attacker from inferring
same approach on all records and the equivalence classes the existence of any target record in the database given
are obvious. In [11], (k, 1)-anonymity means that each the published results on the database, regardless of how
anonymized record is a generalization of at least k original much background knowledge the attacker has obtained.
records. (1, k)-anonymity means that at least k anonymized For privacy-preserving data publishing, a non-interactive
records are generalizations of each original record; these method based on differential privacy was proposed in [26],
anonymized records form an equivalence class for the which utilizes top-down techniques to divide the database
corresponding original record, but their values on the same into some groups that are similar to the anonymization
attribute are not necessarily equal. These concepts are very groups in k -anonymity, while the count of each group
similar to those of this paper, whereas the essential dif- is perturbed to satisfy the differential privacy. Recently,
ference is in the anonymization approach and complexity. how to improve the data utility in differential privacy in
The algorithms in [11] to achieve these properties have a non-interactive settings was discussed in [30]. However,
runtime of O(k · |T |2 ). the complexity of their approach is still quadratic in the
|T |2
Privacy Models. As commonly assumed in the related number of records. Specifically, the complexity is O( k ).
work of k -anonymity, an attacker can successfully link a
victim with a unique group of size k by checking the Even under differential privacy, some attacks can be
victim’s QIs. Subsequently, it was noted in [24] that this successful, although they do not contradict its claims ([15],
actually gives the attacker a chance to infer the victim’s [5], [6]). As discussed in [15], when records are highly
sensitive attributes when there is little diversity in the correlated, the attacker may still infer a record’s sensitive
sensitive attributes of the linked group. In [18], further con- value from the published results, even though the attacker
cerns were discussed, namely, that a mere diversity in the cannot infer the existence of this record in the database. The
sensitive values is still not sufficient and that skewness and same limitation applies to the method of [26], e.g., when the
similarity attacks are still feasible. Diversity may change the records in a group highly correlate and the group is domi-
skewness of the distribution of a sensitive attribute, e.g., a nated by one sensitive value. The reason is that the attacker
2-diversity may show that anyone in the equivalence class can successfully link a record to this group by observing
would be considered to have a 50% possibility of being that the QIs of the record belong to the generalized range
positive, compared with the 1% of the overall population. of the group; then, the attacker will infer that the record
If the sensitive attribute values in an equivalence class may have the dominant sensitive value, and his guess may
are semantically similar (e.g., “gastric ulcer”, “gastritis” be correct with high probability due to the high correla-
and “stomach cancer”), then the attacker can still learn tion among the members of this group. Therefore, further
important information (e.g., the victim must have some strengthening of semantic security in these groups is still
stomach-related problems). necessary. In [6], classifiers based on data published under
To prevent skewness and similarity attacks, the work differential privacy can still help attackers infer sensitive
of [18] was proposed to ensure a semantic closeness inside information about an individual. Such an inference attack
a group, in which the distribution of a sensitive attribute will fail if some method is employed to prevent accurate
in any group is enforced to be close to its distribution in classifiers from being constructed. In Section 7, we present
the overall table. However, all of these concerns required a viable method.
solutions of higher complexity. In contrast to the existing works on deterministic/plain
Probabilistic k -anonymity was also defined in [29] based k -anonymity and differential privacy, our work does not
on clustering QIs and local swapping inside each cluster, support queries on randomized data, but this can be com-
which is different from our method based on randomizing pensated by our advantages of real-time data publication,
QIs. Semantic-related vulnerabilities may still exist inside easy handling of high-dimensional data, high security, and
some clusters. The authors updated their method based on high classification accuracy. These features are highly de-
differential privacy in [30]. We compare the latter with ours manded properties for data-processing techniques in the
with respect to data utility and time cost in Section 8.4. era of big data, whereas they are not well provided by the
For randomization methods, reconstruction attacks existing solutions that support both data queries and data
should also be prevented. Given only the randomized mining. The linear complexity of our solution is achieved
records, the attacker may attempt to recover the values of by the additive and multiplicative randomization on each
the attributes. The attacking techniques include principal attribute value. The strong privacy protection is achieved
component analysis, which can recover the uncorrelated by effectively preventing reconstruction, linkage, skewness
components in the attributes, and maximum a posteriori, and similarity attacks. The high classification accuracy is
which searches the maximizer X̃ for the posterior probabil- achieved by reducing the uncertainty introduced by the
ity P (X̃|Z), and makes X̃ the recovery of the randomized randomization model.
record Z . Using these techniques, random additive noise
and random projection are vulnerable, as noted by [13] and A preliminary version of this paper has been published
[34], respectively. in [28], which only focused on one type of linkage attack.
Utility Concerns. As summarized in [9], many This journal version covers more properties of security and
anonymization models and tools have been provided for data utility for the proposed method and presents more
the general purpose of privacy-preserving data publishing, theoretical proofs.
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
Fig. 1. Anonymization Model Fig. 2. Examples of Multiple-to-One and One-to-Multiple Linkage At-
tacks
Our linear randomization model is shown in Fig. 1. It
has the following specific parameters: To compare two potential fits PF (Xj1 |Zi )
and PF (Xj2 |Zi ), the attacker needs to compare
• Each random entry in R1 independently and identi- PX (Xj1 )PR (Zi |Xj1 ) and PX (Xj2 )PR (Zi |Xj2 ) using
cally follows the Gaussian distribution N (0, σ12 ); Bayes’ theorem. Therefore, the attacker should be aware
• Each random entry in R2 independently and identi- of PX (Xj1 ) and PX (Xj2 ), which are from the prior
cally follows the Gaussian distribution N (0, σ22 ). probability of X . The attacker may estimate a probability
Attributes of Mixed Types: In cases where the table to be density function by obtaining Xsub of a sufficient size. The
published has attributes of mixed types, including nu- work in [1] only assumes that X has a uniform distribution.
merical and categorical values, they can be transformed However, in many scenarios, it is difficult to obtain X or
such that the distance metrics for continuous values, e.g., have prior knowledge of its probability distribution, which
Euclidean distance, are still applicable. An attribute with makes the M2O linkage attack impractical.
M categorical values can be transformed into M binary One-to-Multiple (O2M) Linkage Attack. When the
attributes based on the method in Chapter 7.12 of [12]. attacker obtains the QIs of a victim, such as Xi , he can
perform an O2M linkage attack on Xi by ordering the
potential fits of Xi to all the randomized records in Z .
3.2 Potential Fit and Linkage Attacks The attacker will finally fit Xi to Zj if PF (Xi |Zj ) is
Let nN be the number of records in the entire data set, maximal. For example, in Fig. 2(b), the attacker orders
which is divided into n segments where each segment has PF (X1 |Z1 ), PF (X1 |Z2 ), PF (X1 |Z3 ), PF (X1 |Z4 ) and finds
N records. In Section 4.3, we demonstrate that segmenting the maximum PF (X1 |Z1 ) = 0.4.
is not compulsory and when segmenting should be per- To compare two potential fits PF (Xi |Zj1 ) and
formed. In this section, we use segmenting and focus on PF (Xi |Zj2 ), PX (Xi ) is not required because it is sufficient
a single segment, simply for making this case clear. Our to compare PR (Zj1 |Xi )/PZ (Zj1 ) and PR (Zj2 |Xi )/PZ (Zj2 )
method can be trivially extended to all segments. using Bayes’ theorem. In some randomization models,
Suppose that in a single segment the publisher has N PZ (·) is not difficult to estimate, e.g., in the linear model
records (X1 , C1 ), ..., (XN , CN ) and then randomizes them Z = XR1 + R2 , Z approaches a spherical multivariate
to (Z1 , C1 ), ..., (ZN , CN ) and publishes the latter. The at- Gaussian distribution. In cases where the attacker only
tacker obtains all of the published data. From some public obtains a small subset Xsub that is not sufficient for prob-
channel, the attacker also obtains the QIs of some victims, ability density estimation, the O2M linkage attack is more
i.e., a few Xj (j ∈ {1, ..., N }). The attacker is curious about applicable than the M2O linkage attack.
their sensitive values Cj ; therefore, he measures the poten- The attacker can also combine the above two attacks to
tial fit between one particular (Zi , Ci ) (i ∈ {1, ..., N }) and perform a bidirectional linkage attack. If some fit has high
one particular Xj , denoted by F (Zi → Xj ). He determines probabilities in both attacks (e.g., F (Z1 → X1 ) in Fig.
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
2(a) and Fig. 2(b)), then the attacker’s confidence in the 1) As shown in Fig. 3(a), an attacker first launches
truthfulness of the fit will be increased. More details will be an M2O attack on Z2 and finds some fits in X to
discussed in Section 3.3. Z2 with high linkage probabilities. Because (k, 1)-
Note that for both M2O and O2M attacks, to calculate anonymity favors the conclusion that a high (or
the potential fit (or linkage probability), the attacker should low) linkage probability does not necessarily mean
ultimately rely on the posterior probability PF (X|Z), not a true (or false) fit, the attacker has no confidence
only the conditional probability P (Z|X). In Bayesian in- regarding which fit is true or false.
ference, the latter is the likelihood of observing Z given 2) The attacker has to use auxiliary information to
some certain QIs of X ; it does not consider the uncertainty increase his confidence, and an O2M attack may
brought by the other QIs in the generation of Z . Posterior help. Suppose that F (Z2 → X1 ) is among the fits
probability was also employed in [20] and [31] to measure with high probability in the above M2O attack. As
the linkage probability between some QIs and sensitive shown in Fig. 3(b), the attacker compares the prob-
value. ability of F (Z2 → X1 ) with F (Zj → X1 ) (j 6= 2). If
(1, k)-anonymity does not exist, F (Z2 → X1 ) may
keep a lower probability than many of the other
3.3 Probabilistic Anonymity
fits; then, the attacker will become confident in the
It is easy to see that the M2O or O2M attack is the falseness of F (Z2 → X1 ).
most fruitful if the true fit F (Zi → Xi ) has the largest 3) The attacker repeats Step 2). The more false fits that
probability among all fits F (Zi → Xj ) for j = 1, ..., N the attacker eliminates, the more he approaches the
or among all fits F (Zj → Xi ) for j = 1, ..., N , and is true one.
the least fruitful if all fits F (Zi → Xj ) for i = 1, ..., N
and j = 1, ..., N have equal probabilities. Therefore, to
protect a record Xi ∈ X (or Xi ∈ Xsub ) against an
M2O (or O2M) linkage attack, an intuitive approach is to
enforce an ordering of the list {PF (X1 |Zi ), ..., PF (XN |Zi )}
(or {PF (Xi |Z1 ), ..., PF (Xi |ZN )}) where PF (Xi |Zi ) is not
the largest. Probabilistic anonymity follows this idea and
makes this ordering be undeterministic, whereas the ex-
pected order of PF (Xi |Zi ) in the list is kept as a stable value
k ∈ (1, N ]. We have the following two types of frameworks (a) M2O Linkage Attack (b) O2M Linkage Attack
against M2O and O2M linkage attacks.
Fig. 3. An Example of Bidirectional Linkage Attack
Definition 1. Probabilistic (k, 1)-Anonymity: A record Xi has
a probabilistic (k, 1)-anonymity if the probability of the true Therefore, (1, k)-anonymity is also necessary even if
fit F (Zi → Xi ) is not greater than the probabilities of at (k, 1)-anonymity has been realized. As shown in Fig. 3(b),
least k − 1 false fits in expectation between Zi and the other with (1, k)-anonymity, the attacker cannot confirm the
records in X , i.e., falseness of F (Z2 → X1 ) after he is aware that some fits
F (Zj → X1 ) have higher probabilities than F (Z2 → X1 ).
PF (Xj |Zi ) ≥ PF (Xi |Zi ), ∃J ⊆ {1, ..., N }\i, ∀j ∈ J. (3a)
4 ACHIEVING THE P ROBABILISTIC (1, k)-
A NONYMITY
E[|J|] ≥ k − 1. (3b)
When a small Xsub is leaked to the attacker, the probabilistic
In Eq. (3a), J is the index set of all Xj with no less fit to (1, k)-anonymity is sufficient for privacy protection since
Z i than Xi . In Eq. (3b), E[|J|] is the expectation of the size the attacker can only perform an O2M attack. In this
of J. Definition 1 is the same as the probabilistic k -anonymity section, we discuss how to achieve the probabilistic (1, k)-
defined in [1], whereas the following Definition 2 has not anonymity on every record in X in the linear randomiza-
been previously addressed. tion model presented in Fig. 1. Because the O2M linkage
attack is based on the ordering of the potential fits of all
Definition 2. Probabilistic (1, k)-Anonymity: A record Xi has
randomized records to one victim Xi , in this section, we
a probabilistic (1, k)-anonymity if the probability of the true
first explore the randomness of this ordering, and we then
fit F (Zi → Xi ) is not greater than the probabilities of at
analyze how to calibrate the linear randomization model to
least k −1 false fits in expectation between Xi and the other
achieve the probabilistic (1, k)-anonymity on any Xi ∈ X .
records in Z , i.e.,
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
Time Cost. Suppose that the size of the entire data set 1.5
Contour Line at I=−1.26
is nN , which can be divided into n segments, where each 1
10
Integration I(λi,λj)
0.5
segment has N records. In fact, Step 1) ∼ 3) can be off-line, 0
λmax=8.77
λi
which means that k , λmax and R1 can be configured before −0.5
−1
5
the randomization begins. For each segment of size N , −1.5
15
λj
0
0 5
λj
λmax=8.77
10 15
the entire data set is O(mdnN ), which is linear in regards (a) Numeric Integration (b) The Contour Lines for I =
to the total number of records (i.e., nN ). I(λi , λj ) 1.26 and I = −1.26
The detailed numerical procedure of our method is
given in the following. Fig. 4. Numeric Integration and Examples of the Contour Lines
By Theorem 2, any pair of λi and λj (i 6= j ) selected
from {λ1 , λ2 , ..., λN } should satisfy Eq. (8). To restrict It should be noticed that in Theorem 2 the anonymity
I(λi , λj ) within the required range, by Theorem 1, λi and level k < (N − 1)/2 + 1. For the case where k ≥
λj need to be bounded, which means that R1 and σ2 (N − 1)/2 + 1, it is not easy to make every record Xi
should be calibrated. Our method consists of using numeric satisfy the probabilistic (1, k)-anonymity. The reason is, by
σ2
integration to calculate I , finding two contour lines at Theorem 1, when ||Xi ||2 ≥ σRP 2 , I(λi , λj ) should be within
1
k−1 1 1 k−1
−1 − 2 )π and Imax = ( 2 − N −1 )π , and then
Imin = ( N k−1 1 π σ2
( N −1 − 2 )π, 2 , whereas when ||Xi ||2 < σRP
2 , I(λi , λj )
bounding λi and λj inside the two lines. π 1 k−1
1
It can be observed that regardless of how R1 and σ2 should be within − 2 , ( 2 − N −1 )π . The intersection of the
are calibrated, all λ1 , λ2 , ..., λN are non-negative because two ranges is null. For example, when k = 37, N = 41, and
σ2
λi = Xi R1 R1T XiT /σ22 = ||Xi R1 ||2 /σ22 . ||Xi ||2 ≥ σRP
2 , it is not easy to calibrate σ2 to generate the
1
Indeed, the monotonicity of the function I(λi , λj ) required λi and λj such that I(λi , λj ) ∈ [1.26, π/2). How
shows that when all λ1 , λ2 , ..., λN are bounded within the to perform this calibration is not the focus of this paper.
domain (0, λmax ], I(λi , λj ) on any pair of λi and λj is also Parameter selection summary. The dimension d of R1
k−1 1 1 k−1
bounded within [( N −1 − 2 )π, ( 2 − N −1 )π]. By the following is suggested to satisfy m ≥ 2d − 1, following the result
Lemma 3, if λj is held constant, I is monotonically decreasing of [22] (Details can be found in their Theorem 4.4). The
with the other variable λi , and if λi is held constant, I is selection of σ1 can be based on their Lemma 5.2. That p is,
monotonically increasing with λj . Therefore, the contour line because E(R1 R1T ) = dσ12 I , we can choose σ1 = 1/d
k−1 1
at Imin = ( N −1 − 2 )π starts from the point at λi = λmax,1 such that E(R1 R1T ) equals the identity matrix I and
and λj = 0, and on this line, λj monotonically increases E(Xi R1 R1T XiT ) = Xi XiT ; thus, better utility of the data
with λi . Similarly, the contour line at Imax = ( 21 − N k−1
−1 )π can be ensured.
starts from the point at λj = λmax,2 and λi = 0, and on The anonymity level k can be selected by practical
this line, λi monotonically increases with λj . It is also easy requirements, e.g., 5~50. The selection of N should make
to see that λmax,1 = λmax,2 . As in the example presented k < (N − 1)/2 + 1 and facilitate an appropriate value of
in Fig. 4(b), let k = 5 and N = 41, then Imin = −1.26 λmax (1~9.2 is suggested). When N is large, we suggest
and Imax = 1.26, and the two contour lines are shown in segmenting the records into groups to retain high data
the figure. By the numeric integration, λmax,1 and λmax,2 utility. The reason is that M in Eq. (9) may be very large
should be 8.77. It is easy to see that any point in the shaded without segmenting, and then a large σ2,min is produced.
area of Fig. 4(b) satisfies Eq. (8). A small variance generally means that small random noise
After λmax is found, it is easy to calibrate R1 and σ2 . may be added. The large σ2,min is suitable for only a small
In fact, R1 can be first generated and then σ2 is bounded fraction of records (i.e., those Xi producing large values
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
on Xi R1 R1T XiT ) but applied to all records. By segmenting, By the following theorem, we need to calibrate the
σ2,min can be determined individually for each group. values of σ1 and σ2 to realize the required occurrence
In addition, the segmenting does not need to consider probability for (k, 1)-anonymity.
relationships among the records; thus, the most convenient
Theorem 3. let Gσ1 ,σ2 (Xi , Xj ) be a benchmark function
way is grouping by their collection order.
on any two records Xi , Xj in X whose Euclidean norms
The simulation of successful O2M attacks. When the
are not equal (||Xi ||2 6= ||Xj ||2 ,i 6= j ), defined as follows:
lower bound on σ2 as shown in Eq. (9) is not abided
by, O2M attacks could succeed. We only demonstrate the
attack for k = 2, and the other cases for k > 2 can d (||Xj ||2 σ12 + σ22 )
Gσ1 ,σ2 (Xi , Xj ) = ·
be trivially derived. Without knowing the bound, a data 2 (||Xi ||2 − ||Xj ||2 )σ12
owner would select an arbitrary σ2 for randomization, but ||Xi ||2 σ12 + σ22
it could be very likely for him to select a value such that · log .
||Xj ||2 σ12 + σ22
σ22 < M/λmax . When k and N are given, λmax is a fixed
value that bounds I(λi , λj ) on each pair (λi , λj ) within When X follows the uniform distribution, a sufficient
k−1 1 1 k−1
−1 − 2 )π, ( 2 − N −1 )π]. In fact, λmax is unnecessary
[( N condition to achieve (k, 1)-anonymity is that the values σ1
to be large (say 50 ∼ 100) to suit any value of k (k ≥ 2) and and σ2 should be assigned to make the benchmark function
commonly used N (e.g., N = 10000). However, M could satisfy the following:
be very large because it is the maximal value in ||Xi R1 ||2
for i = 1, ..., N . Therefore, it is common for M/λmax > 1, min Gσ1 ,σ2 (Xi , Xj ) ≥ Qχ2 , k−1 , (10)
N −1
and it is also common to select a σ2 less than 1. Then, in max Gσ1 ,σ2 (Xi , Xj ) ≤ Qχ2 , N −k , (11)
N −1
the data set {Xi |i = 1, ..., N }, there is a subset that satisfies
σ22 λmax < ||Xi R1 ||2 ≤ M . It is this subset (denoted by in which Qχ2 , k−1 and Qχ2 , N −k are respectively the (k −1)-
N −1 N −1
X 0 ) that is vulnerable for the O2M attack because for each th and (N − k)-th (N − 1)-quantiles of the chi-squared
2
Xi ∈ X 0 , its corresponding λi = ||XiσR21 || > λmax . For distribution with freedom of degree d. Let Xmin_norm
2
each data Xi ∈ X 0 , there could be some other data Xj that and Xmax_norm be records in X with the smallest and
makes (λi , λj ) fall on a contour line with I < I(λmax , 0) largest Euclidean norms. Then, min Gσ1 ,σ2 (Xi , Xj ) is ob-
in Figure 4; then, by Theorem 2 and Lemma 2, the proba- tained at Xi = Xmax_norm and Xj = Xmin_norm , and
bilistic (1, 2)-anonymity (k = 2) could not be achieved. In max Gσ1 ,σ2 (Xi , Xj ) is obtained at Xi = Xmin_norm and
Section 8.1, we demonstrate that when σ2 is too small, the Xj = Xmax_norm .
percentage of X 0 will be high. The proof of this theorem is postponed to the Appendix
C. In brief, this theorem provides two inequalities as shown
5 ACHIEVING THE P ROBABILISTIC (k, 1)- in Eq. (10) and (11) to constrain σ1 and σ2 , and there are
only two variables (σ1 and σ2 ) in the two inequalities,
A NONYMITY which simplifies the calibration.
As discussed in Section 3.2, it is difficult for the attacker Achieving the probabilistic (k, k)-anonymity. σ1 and
to perform a successful M2O linkage attack. The reason σ2 can be calibrated to simultaneously satisfy the require-
is the attacker generally possesses only a small subset ments of probabilistic (1, k)- and (k, 1)-anonymity such
of X and cannot accurately determine the required prior that (k, k)-anonymity is achieved.
knowledge on the p.d.f of X . Similarly, the realization of
(k, 1)-anonymity also requires the same prior knowledge
on X . In this section, we discuss one representative case 6 S ECURITY A NALYSIS
used in [1]: X follows a uniform distribution. This case As mentioned in Section 1, 3 types of attacks can gener-
is particularly applicable when the attacker has no prior ally be deployed on randomization-based methods. The
knowledge on PX (X ) and thus assumes that X follows a previous sections discussed the prevention of linkage-type
uniform distribution. For the other types of distributions, attacks, and thus, they will not be discussed here.
the realization method can be similarly derived. For reconstruction attacks, there are two types, as sum-
In the same way as (1, k)-anonymity, to achieve (k, 1)- marized in [34]: one is based on principal component
anonymity on every Zi , we need to reduce the probability analysis (PCA), and the other is based on maximum a
of F (Zi → Xi ) and increase the probability of F (Zi → Xj ) posteriori (MAP). The prevention of PCA-based attacks
(j 6= i) to ensure that some false fits always exist that will be discussed in Section 8.5. The prevention of MAP-
have higher probabilities than the true one. By the same based attacks is actually implied in our achievement of
paradigm in Lemma 2, a sufficient condition for (k, 1)- probabilistic anonymity. MAP-based attacks aim to recover
anonymity is as follows: the record X given its perturbed version Z , by finding the
optimum X̃ in the optimization problem max P (X|Z) and
X
PO PF (Xj |Zi ) ≥ PF (Xi |Zi ) ≥ (k − 1)/(N − 1), treating it as an estimate of X . There are two obstacles to
finding the optimum X̃ . First, the p.d.f of X should be a
in which PO PF (Xj |Zi ) ≥ PF (Xi |Zi ) is the occurrence priori knowledge and input for the problem. Second, due to
probability of the event PF (Xj |Zi ) ≥ PF (Xi |Zi ). the hybrid randomization of our method, the optimization
When X follows the uniform distribution, i.e., problem is equivalent to
PX (Xi ) = PX (Xj ), by Bayes’ theorem, the event
PF (Xj |Zi ) ≥ PF (Xi |Zi ) is equivalent to PR (Zi |Xj ) ≥
PR (Zi |Xi ). Then, another sufficient condition for (k, 1)- max P (X, R1 , R2 |Z), s.t. Z = X · R1 + R2 ,
X,R1 ,R2
anonymity is as follows:
which is high dimensional and difficult to solve. Even if
the optimum X̃ is found by overcoming the obstacles, it
PO ( PR (Zi |Xj ) ≥ PR (Zi |Xi ) ) ≥ (k − 1)/(N − 1). is highly likely to be different from the true record X ,
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
because X̃ is with the highest posterior probability P (X̃|Z) anonymity: Suppose that there are t classes for the sensi-
whereas the true record X is with a considerably lower tive attributes in the overall table, denoted by c1 , ..., ct .
posterior probability P (X|Z). This fact is generated by the Their probabilities are denoted by pc1 , ..., pct . By (1, k)-
probabilistic anonymity, in which the true fit PF (Xi |Zi ) anonymity, Xi ∈ X has the statistical anonymization group
is artificially lowered than (k − 1)-on-average false fits with size Oi . In this group, let c0l denote the subset of
PF (Xj |Zi ). records labeled by cl (l = 1, ..., t). Then, the statistical se-
In the remainder of this section, we focus on the se- mantic closeness of (1, k)-anonymity for class cl is defined
E(|c0 |)
mantic security of our method against the skewness and by Pcl /pcl , in which Pcl = E(Oli ) , |c0l | is the cardinality of
similarity attacks. As defined in [18], semantic closeness c0l .
requires the distribution of a sensitive attribute in any
anonymization group on QIs to be close to the distribution Theorem 4. Suppose that there are t classes for the sensitive
of the attribute in the overall table, and a tighter seman- attributes in the overall table, denoted by c1 ,...,ct , with
tic closeness means a higher semantic security. Generally, probabilitiespc1 , ..., pct . The lower bound of the statisti-
Earth Mover’s Distance is employed to measure semantic cal semantic closeness of (1, k)-anonymity to any class cl
closeness. Because the anonymization group in our hybrid (l = 1, ..., t) is
randomization method is quite different from the one used (k − 1)
in [18], we define a new measurement of semantic closeness .
pcl (2k − 1 − N ) + N (N − k)/(N − 1)
and demonstrate that semantic security can be achieved by
adjusting some parameters, such as N and k . The upper bound is
Because an obvious anonymization group is required (N − k)
for the skewness and similarity attacks, the M2O linkage .
attack does not provide any help for the attacker to deploy pcl (N − 2k + 1) + N (k − 1)/(N − 1)
them. This is because all records in X are linked to a single The proof of this theorem is postponed to the Appendix
randomized record in Z , but not a group of randomized D.
records. Only by the O2M linkage attack is the attacker able To prevent the skewness attack, the minority class in
to perform these attacks on a group of randomized records the overall table should not become the majority class in
after he links a record in X to the group. We define this the randomization group. To prevent the similarity attack,
group in the following: the ratio of each class in the randomization group should
be close to its probability in the overall table. By Theorem 4,
Definition 3. Statistical Anonymization Group: through an
when pc,l is very small, pc,l (2k −1−N ) and pc,l (N −2k +1)
O2M linkage attack on any Xi ∈ X , the attacker ob-
0 can be neglected, and the two bounds become approxi-
tains a list of records Z arranged in decreasing order of (N −1)(k−1) (N −1)(N −k)
mately N (N −k) and N (k−1) , respectively. Therefore,
F (Zj → Xi ) (j = 1, ..., N ). Let Oi be the order of Zi ,
the true randomized record of Xi . E(Oi ) is the expected N (the size of each segment) and k can be adjusted such
value of Oi . Then, the statistical anonymization group for that the upper bound is not very high (to prevent the
Xi consists of the randomized records from the 1st to the skewness attack), or such that the two bounds are close
0
E(Oi )-th record in Z . to 1 (to prevent the similarity attack), as we show in the
experiments presented in Section 8.3.
Assuming that a statistical anonymization group has
been accurately estimated for Xi , the attacker can analyze
the sensitive values inside the group. If these values are 7 M INING THE DATA IN THE P ROBABILISTIC
dominated by a single value or by some semantically sim- A NONYMITY
ilar values, then the attacker will be highly confident that In this section, we re-design the K -nearest neighbors (K-
the victim with Xi is also with the same single value or the NN) algorithm for data classification. In this algorithm, the
semantically related values. However, by our probabilistic training data set is the randomized records Z . The test data
(1, k)-anonymity, these skewness and similarity attacks can consist of all QI attributes of an individual, denoted by
rarely be successful due to the following two reasons. record Xt . His sensitive value is unknown, which should
First, the attacker cannot make an accurate estimation be the output of the K-NN algorithm as a class label.
of the order of Zi , denoted Oi , which is random and In our algorithm, the original record set X is divided
depends on the occurring number of events F (Zj → Xi ) ≥ into n segments, and each segment has N records. This
F (Zi → Xi ). Even the mean value E(Oi ) is uncertain partition is suitable for parallel processing, in which the
for the attacker. As we discussed in the Proof of Lemma corresponding randomized records of different segments
j6=i
X are distributed to different processors. It is also suitable
2, E(Oi ) = pi,j , in which pi,j is the occurrence
for processing data at collection time, and each time a
j∈{1,...,N }
probability of F (Zj → Xi ) ≥ F (Zi → Xi ). By Theorem 1, segment is processed. In the randomization model, on
2
pi,j is determined by λi and λj , which are related to many different segments, different variances σ1,I (I = 1, ..., n) are
2
2
factors, such as Xi , Xj , R1 , σRP ,σ1 and σ2 . employed for R1 , and different variances σ2,I (I = 1, ..., n)
Second, inside those sensitive attributes of the statisti- are employed for R2 . The parameters σ1,I and σ2,I for the
cal anonymization group, a statistical semantic closeness n segments are published to the data miner. However, the
exists, i.e., the fraction of any class in the group is close anonymous level k in each segment is not published to
to the probability of this class in the overall table. This increase the difficulty of the attacks.
type of closeness is described in the following Definition In the training phase of our K-NN algorithm, PZ (·) is
4. Theorem 4 provides the lower and upper bounds of the treated as a spherical multivariate Gaussian distribution,
closeness, which means that the closeness can be calibrated which means that each dimension of Z is identically and in-
by the parameters k and N . dependently distributed as as a Gaussian distribution. The
mean of this Gaussian distribution is 0, and the variance
Definition 4. Statistical Semantic Closeness of (1, k)- can be easily estimated on all data Z .
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
0.1 0.55
0.72
0 0.7 0.5
5 10 15 20 25 30 35 40 45 0 1 2 3 4 5 5 10 15 5 10 15
k × σ2,min k k
(a) Percentage of records achiev- (b) Percentage of records achiev- (a) Breast tissue data set (b) Blood transfusion data set
ing (1,k)-anonymity when k is ing (1,10)-anonymity when σ2 is
changed changed Fig. 8. Experiments on Probabilistic (k,k)-Anonymity for Two Real Data
Sets
Fig. 5. Experiments on Probabilistic (1,k)-Anonymity for Synthetic Data
Set
these data sets. The three data sets are the same as those
0.9
Breast Tissue
Transfusion
0.85
used in Section 8.2. The records in the synthetic data set
Percentage of records achieving 10−anonymity
0.8
Percentage of records achieving k−anonymity
0.7
0.8 are labeled with two class labels with a ratio of 9:1. In Fig.
0.6
0.75
9(a) and Fig. 10(a), for each given k , the statistical semantic
0.5
0.4
closeness (as defined in Definition 4) of the smallest class
0.3
0.7
Breast Tissue
(with the fewest members) is shown. These two figures
Transfusion
0.2
0.65 illustrate that the average percentage of the smallest class
0.1
in the anonymization groups is close to its probability in
0
5 10 15 20 25
k
30 35 40 45 0 1 2
× σ2,min
3 4 5
the overall table, with only 0.5 ∼ 1 times of increment.
(a) Percentage of records achiev- (b) Percentage of records achiev-
ing (1,k)-anonymity when k is ing (1,10)-anonymity when σ2 is We also employ Earth Mover’s Distance (EMD) to com-
changed changed pare our hybrid randomization with SABRE in [3] and Diff-
Gen in [26]. Generally speaking, given the same anonymity
Fig. 6. Experiments on Probabilistic (1,k)-Anonymity for Two Real Data level, the smaller EMD that an anonymization method
Sets generates over a group, the better semantic closeness that
it achieves. The comparisons are not very visual because
uniform distribution, with parameters N = 100, n = 500, the latter two methods are quite different from ours. In
m = 10, and d = 5. In the randomization model of Eq. (1), our method, the anonymity level k is a predefined input,
σ1 and σ2 , which simultaneously satisfy the requirements and it is easy to obtain the corresponding EMD value
of (1, k)- and (k, 1)-anonymity, are employed. Specifically, a on each group. SABRE is a k -anonymity-based method,
group of candidate σ1 and σ2 that satisfy (1, k)-anonymity which groups records into equivalence classes to satisfy a
are first determined and then verified by the requirement of given EMD; therefore, the EMD value is a predefined input.
(k, 1)-anonymity as shown in Theorem 3. Fig. 7 shows for DiffGen is based on -differential privacy, and the privacy
each given k the average percentage of records on which parameter is an input.
both (1, k)- and (k, 1)-anonymity are achieved. When k is In our experiments on SABRE, when the given value
increased, the average percentage is reduced. of EMD varies from 0.1 to 1, the anonymity level (i.e.,
Experiments on high-dimensional synthetic data are the average size of equivalence groups) remains a constant
also performed. m varies from 20 to 50, and d = b(m + value, that is, as 9, 6 and 9 on the synthetic, breast tissue
1)/2c. The other parameters are k = 10, N = 100, and and transfusion data sets, respectively. In our experiments
n = 500. Subsequently, 85%~90% of records can achieve on DiffGen, the average size of equivalence groups varies
(10, 10)-anonymity. with the corresponding average EMD. Fig. 9(b) and Fig.
10(b) show these variations. These two figures illustrate
0.96
(1,k)−anonymity
that on the same anonymity level, our method achieves
0.94 (k,1)−anonymity
better semantic closeness.
Percentage of records achieving k−anonymity
0.92
0.9
Experiments on high-dimensional synthetic data are
0.88
0.8
respectively.
0.78
0.76
5 10 15 20
k
8 0.8
Actual Closeness
Set 6 0.6
Earth Mover’s Distance
Semantic Closeness
5 0.5
Fig. 8 presents the experiments of (k, k)-anonymity on 4 0.4 Hybrid Randomization on Synthetic Data
DiffGen on Synthetic Data
the two real data sets with parameter of N = 100. From 3 0.3
Fig. 8(a) and Fig. 8(b), the same conclusions can be drawn 2 0.2
0 0
assuming that the data sets follow uniform distributions. 5 10
Anonymity Level: k
15 20 4 6 8 10 12 14
Anonymity Level
16 18 20 22
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
0.65
DiffGen on Breast Tissue Data rates of correct classification for our hybrid method, [30],
[26] and [3] are 82.7%, 77.5%, 71.2%, and 65.1%, respec-
0.6
0.55
tively.
Breast Tissue
1
Transfusion
0.5
We also conduct similar experiments in clustering sce-
0.45 narios. The results are shown in Fig. 11(d), Fig. 11(e) and
0.4 Fig. 11(f). For SABRE, the average rates of correct classifi-
0.5
5 10
Anonymity Level: k
15 20 0.35
5 10 15 20
cation are 70.1%, 57.4%, and 69.2% for the synthetic, blood
Anonymity Level
(a) Semantic closeness transfusion and adult data, respectively. The comparisons
(b) EMD comparison
demonstrate that our hybrid randomization method retains
Fig. 10. Experiments on Semantic Closeness on Two Real Data Sets
higher utility of data for k-means clustering.
Time cost comparisons. For the transfusion and adult
data, the average time costs of our hybrid randomization
8.4 Comparisons of Data Utility are 0.35 seconds and 7.442 seconds, respectively, which is
5% of SABRE, 49.4% of DiffGen, and 53.3% of MDAV-based
We experiment using the K-NN algorithm redesigned in differential privacy.
Section 7 on one synthetic and two real data sets. The
synthetic training data set has 2 clusters, representing
two different classes. In each cluster, 90% of records are 1 1
Hybrid Randomization
DiffGen
MDAV−based Differential Privacy
0.8
assigned to the other class. The parameters n, m, d are the
same as in the experiments in Section 8.2. The two real data 0.7 0.6
sets are the blood transfusion and adult data sets from the 0.6
0.5
UCI ML Repository. In the adult set, categorical attributes Hybrid Randomization on Synthetic Data
0.4
the 3 training data sets. In Fig. 11, the percentage of the (a) Accuracy of classification for (b) Accuracy of classification for
correctly classified data is shown when k varies from 5 to synthetic data by K-NN the Blood Transfusion data by K-
NN
50. When k is increased, N is also changed to generate an
appropriate λmax (say, 8~9). Therefore the result in Fig.5(a) 1 1
and 6(a), i.e., the larger k is, the less achievable the prob- 0.9
0.9
0.8
0.6 0.7
On each data set, to compare with the classical K-NN 0.5
algorithm, we sanitize the training set and the testing set 0.4 Hybrid Randomization
0.6
anonymization methods, we sanitize the data set using (c) Accuracy of classification for (d) Accuracy of classification for
SABRE in [3], DiffGen in [26], and MDAV-based differential the Adult data by K-NN synthetic data by k-means
privacy in [30], and then we also execute the classical K-
NN. The variations in the rate of correct classification with 1 0.95
Hybrid Randomization
DiffGen
anonymity level are shown in Fig. 11(a), Fig. 11(b) and 0.9
0.9
MDAV−based Differential Privacy
0.85
Fig. 11(c) for each data set. The results on SABRE are now
Rate of Correct Classification
0.8
0.8
shown here because the anonymity level (i.e., the average 0.7 0.75
values of EMD from 0.1 to 1. For SABRE, the average rates 0.65
The parameter in DiffGen and MDAV varies from 0.1 (e) Accuracy of classification for (f) Accuracy of classification for
to 5, whereas in the figures, the axis of anonymity level the Blood Transfusion data by k- the Adult data by k-means
means
for DiffGen and MDAV does not represent . It is not
applicable to directly align with the anonymity level k Fig. 11. Comparisons of Data Utility
of our solution on the same axis because their domains and
meanings are quite different. Space limitations also do not
allow us to use separate figures for DiffGen and MDAV.
Therefore, we compromise by computing the average size 8.5 Analysis of Security Against Reconstruction At-
of the groups in DiffGen on each value of , the average tack
classification accuracy on various values on each cluster We also conduct experiments to analyze the security of the
size of MDAV, and then aligning the group/cluster size randomized data against reconstruction attacks. There are
with k in our solution because their meanings and domains two types of reconstruction attacks, and one of them (MAP-
are similar. The corresponding accuracy rates in DiffGen based) has been analyzed in Section 6. In this section, we
cannot be linked by a curve because the average group focus on the other type, PCA-based attack. This attack does
size fluctuates with the increase of . The combination of not require too much a priori knowledge and is thus easier
our hybrid randomization method and redesigned K-NN for the attackers to deploy. It aims to separate a few uncor-
algorithm facilitates higher accuracy rates of classification related components from the randomized data, reduce the
compared with [3], [26], [30] and the classical K-NN. scaling and permutation ambiguities of the components,
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
and estimate the original data from these components. higher efficiency, thus well suiting the needs of large-scale
Below, we present the attack based on the general PCA data, high-dimensional data and collection time processing.
method. We also re-designed the K-nearest neighbors algorithm to
The attacker computes the covariance matrix ΣZ of Z leverage the introduced uncertainty. Experiments show that
and performs an eigenvalue decomposition: ΣZ = QDQT , our K-NN algorithm achieves higher accuracy when clas-
in which Q is an orthogonal matrix and D is a diagonal sifying anonymized data, and our transformation method
matrix whose entries are eigenvalues of ΣZ . The attacker also retains higher utility of data for traditional clustering
computes the PCA result X̃ = Z · QD−1/2 QT . It is easy to scenarios, such as k-means.
prove that the covariance matrix of X̃ , ΣX̃ , is an identity
matrix. In the attacker’s view, the j-th attribute x̃j of X̃ is
ACKNOWLEDGMENTS
a recovery of the i-th attribute of X , with some scaling and
permutation ambiguity. To reduce the scaling ambiguity, This work is supported by the National Natural Sci-
the attacker utilizes the mean (µi ) and variance (σi2 ) of ence Foundation of China grant 61202427, the 985 Project
the i-th attribute of X and computes x̂j = σi x̃j + µi . To funding of Sun Yat-sen University, Australian Research
reduce the permutation ambiguity, the attacker performs a Council Discovery Projects funding DP150104871, Scientific
statistical test on whether x̂j has a similar distribution to Research Starting Foundation for the Returned Overseas
the i-th attribute of X . Chinese Scholars from Ministry of Education of China, the
The PCA-based attack has some limitations. It can only Fundamental Research Funds for the Central Universities
separate d components, whereas there are m attributes in 2012JBZ017. The corresponding author is Hong Shen. The
X . It is not suitable for the cases where there is some authors also would like to thank the anonymous reviewers
correlation among the attributes of X . Minimizing the for their suggestions and comments.
ambiguities also requires accurate statistical information
on X . Fig. 12 shows the experiments on the 3 data sets, A PPENDIX A
which are the same as those used in the above sections. P ROOF OF T HEOREM 1
In Fig. 12(a), the mean absolute error is measured for the We firstly assume there is an Xi ∈ X , on which ||Xi ||2 ≥
reconstructions, which is the expected value of the relative 2
σRP /σ12 . We find what is required in order to achieve the
m n
1 X X xi,j − x̂i,j probabilistic (1, k)-anonymity on Xi . Then we extend our
errors, i.e., | |, given the (i, j)-th
m ∗ n i=1 j=1 xi,j findings to the case that ||Xi ||2 < σRP 2
/σ12 .
element of X̂ and X . In Fig. 12(b), the recovery rate at a By Lemma 1, if ||Xi || ≥ σRP /σ12 , for any j 6= i,
2 2
relative error of 0.5 is measured, which is the percentage PO PF (Xi |Zj ) ≥ PF (Xi |Zi ) equals PO ||Zj ||2 ≥
of reconstructed entriesxwhose relative errors are within a ||Zi ||2 , which is the probability of the event ||Zj ||2 ≥
−x̂
i,j
#{x̂i,j :| i,j
x
i,j
|≤,i=1,...,m,j=1,...,n} ||Zi ||2 . Since this probability is related with the original
threshold , i.e., m∗n . In Fig. records Xi , Xj , and random parameters R1 , R2 , we need to
12(a), the MAE of the “breast tissue” data is not shown find their relationship and calibrate these values. Actually
because they are too large to fit within (the mean value on there can be two strategies for the calibration: simultane-
all k is 895.5). The two figures show that the PCA-based ously adjusting R1 and R2 , or only adjusting R2 based on
attack has high MAE and low recovery rate and that our a pre-determined R1 . For simplicity we employ the second
randomized data are secure against this attack. strategy.
Let Zi = Xi R1 + R2,i , Zj = Xj R1 + R2,j (j 6= i).
10
Synthetic Data
15
Synthetic Data
Let zi,1 , ..., zi,d be the entries of Zi . When R1 is pre-
9
8
Transfusion Breast Tissue
Transfusion determined and only R2 is random, the probability density
7 PZ (zi,l |Xi , R1 ) (l = 1, ..., d) is a Gaussian distribution
Mean Absolute Error
10
Recovery Rate (%)
5
with mean Xi R1,cl (R1,ci is the l-th column of R1 ), and
4
5
variance σ22 . Let zj,1 , ..., zj,d be the entries of Zj . Similarly
3
2
PZ (zj,l|Xj , R1 ) = N (Xj R1,cl , σ22 ). Then PO ||Zj ||2 ≥
1
||Zi ||2 is related with PZ (zi,l |Xi , R1 ) and PZ (zj,l |Xj , R1 )
0 0
5 10
k
15 20 5 10
k
15 20
as follows:
(a) Mean absolute error of the (b) Recovery rate at relative error
PCA-based attack of 0.5
d
X d
X
PO ||Zj ||2 ≥ ||Zi ||2 2 2
Fig. 12. Experiments on PCA-based Attack = PO zj,l − zi,l ≥0
l=1 l=1
d d
(13)
X zj,l 2 X zi,l 2
9 C ONCLUSIONS = PO ( ) − ( ) ≥0
σ2 σ2
l=1 l=1
We proposed privacy-preserving data transformation
methods in response to the requirements of strong security, In Eq. (13) when R1 is pre-determined, the distribution
high efficiency and high data utility. Our methods can of zj,l and zi,l will respectively follow PZ (zi,l |Xi , R1 ) and
achieve probabilistic (1, k)- and (k, 1)-anonymity on a lin- PZ (zj,l |Xj , R1 ). In the following we firstly find the char-
ear and hybrid randomization model to effectively prevent acteristic function of the difference between two random
d d
the attacker’s one-to-multiple and multiple-to-one link-
X zj,l 2 X zi,l 2
variables ( ) and ( ) . Then the cumulative
age attacks. Comparisons are made with the k -anonymity l=1
σ 2 l=1
σ2
method in [3] and differential privacy methods in [26] and distribution of their difference can be found by Gil-Pelaez’s
[30]. Our methods have high semantic closeness to the inversion formula ([10]).
distribution of sensitive values in the overall anonymized Suppose that the probability density functions of the
table; thus, they can prevent skewness and similarity at- two variables X and Y are respectively f and g . The
tacks. Our methods also have high security against recon- probability density of their difference Y − X can be given
struction attacks. Moreover, the proposed methods run at by the cross-correlation f ? g , therefore by the convolution
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
theorem the characteristic function of their difference is We substitute t by 12 tan θ, θ ∈ (0, π/2). Then dt =
1
denoted φX · φY , in which φX and φY are characteristic 2 sec2 θdθ, Eq. (19) becomes the following:
functions (or Fourier transforms) of f and g respectively, 1 1
φX is the complex conjugate of φX . PO PF (Xi |Zj ) ≥ PF (Xi |Zi ) = + ·
2 π
d
zj,l 2 ˆ π
λj − λi
X 2 1 2
In Eq. (13) the random variable ( ) is distributed e− 2 (λj +λi ) sin θ · csc θ · cosd−1 θ · sin( sin 2θ) · dθ.
l=1
σ2 0 4
according to a noncentral χ2 distribution, with d degrees of (20)
freedom, a non-centrality parameter λj as follows: In the case that ||Xi ||2 < σRP
2
/σ12 , similarly we can get
the following:
(Xj R1,c1 )2 + ... + (Xj R1,cd )2 PO PF (Xi |Zj ) ≥ PF (Xi |Zi ) = PO ( ||Zj ||2 ≤ ||Zi ||2 )
λj =
σ22 d d
(14) X zj,l 2 X zi,l 2
(Xj R1 )(Xj R1 )T Xj R1 R1T XjT = PO ( ) − ( ) ≤0
= 2
= . σ2 σ2
σ2 σ22 l=1 l=1
1 1
= F (0) = − ·
We use ι to denote the imaginary unit. The correspond- 2 π
ˆ π
ing characteristic function of this distribution is: 2 1 2 λj − λi
e− 2 (λj +λi ) sin θ · csc θ · cosd−1 θ · sin( sin 2θ) · dθ.
0 4
j ιλ t (21)
exp( 1−2ιt )
φj (t) = . (15)
(1 − 2ιt)d/2
A PPENDIX B
d
X zi,l 2 P ROOF OF L EMMA 3
Similarly the random variable ( ) also is dis-
l=1
σ2 To prove the monotonicity of I(λi , λj ) with respect to any
tributed according to a noncentral χ2 distribution, with independent variable λi or λj , holding the other variable
d degrees of freedom, a non-centrality parameter λi = ∂I ∂I
constant, we need to prove ∂λ i
< 0 and ∂λ j
> 0. In
Xi R1 R1T XiT ∂I
σ2
and a characteristic function as follows: the following we will only prove that ∂λj > 0. The same
2
∂I
method can be used to prove ∂λ i
< 0. From Eq. (7),
ιλi t
exp( 1−2ιt )
φi (t) = . (16) ˆ π
(1 − 2ιt)d/2 ∂I 1 2 λj +λi
sin2 θ
= cosd−1 θ · e− 2 ·
∂λj 2π 0 (22)
Then the characteristic function of the difference of the λj − λi
two random variables is cos( θ + sin 2θ ) · dθ
4
φi,j (t) = φj (t)φi (t) = φj (t)φi (−t) Let ∆λ = λj − λi . When λj , λi ∈ (0, 9.2], ∆λ ∈
exp 1−2ιt
ιλj t
exp −ιλ it
(−9.2, 9.2). In Eq. (22), the most intricate part is cos( θ +
1+2ιt ∆λ π
= ·
(1 − 2ιt)d/2 (1 + 2ιt)d/2 4 sin 2θ ) since it is not always positive when θ ∈ [0, 2 ].
We will discuss its sign by dividing the domain of ∆λ
1 ιλj t − 2λj t2 −ιλi t − 2λi t2
= · exp( ) · exp( ) into 3 sub-domains: [-2,2], (2,9.2), (-9.2,-2). Let f (θ, ∆λ) =
(1 + 4t2 )d/2 1 + 4t2 1 + 4t2 θ + ∆λ
4 sin 2θ . Then,
1 h 2(λj + λi )t2 i h (λj − λi )t i
= · exp − · exp ι
(1 + 4t ) 2 d/2 1 + 4t2 1 + 4t2 ∂f ∆λ
1 h 2(λj + λi )t 2i =1+ cos 2θ, (23)
= · exp − · ∂θ 2
(1 + 4t2 )d/2 1 + 4t2 ∂f 1
= sin 2θ. (24)
h (λ − λ )t
j i (λj − λi )t i ∂∆λ 4
cos 2
+ ι sin
1 + 4t 1 + 4t2 Case 1: When ∆λ ∈ [−2, 2] and θ ∈ [0, π2 ], ∂f
(17) ∂θ ≥ 0,
∂f
∂∆λ ≥ 0. Therefore, f ∈ [0, π/2] and cos f ∈ [0, 1], then it
∂I
Let F (x) be the cumulative distribution function of is easy to prove that ∂λ j
> 0.
d d
X zj,l 2 X zi,l 2 Case 2: When ∆λ ∈ (2, 9.2), if θ ∈ [0, 21 arccos(− ∆λ 2
)],
( ) − ( ) . By the inversion formula in [10], ∂f
σ2 σ2 then ∂θ ≥ 0, and f ∈ [0, fmax ] in which fmax varies
l=1 l=1
F (0), i.e., the probability that the difference being less than depending on ∆λ; if θ ∈ ( 12 arccos(− ∆λ 2
), π2 ], then ∂f
∂θ < 0,
0, is the following: and f ∈ [ 2 , fmax ). It is easy to get that fmax ∈ ( π2 , 3.14)
π
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security
1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.