0% found this document useful (0 votes)

29 views16 pages

Sang 2016

This article proposes new frameworks of probabilistic (1,k)- and (k,k)-anonymity to defend against linkage attacks on privacy-preserving data mining. It uses a hybrid randomization model combining random projection and random noise addition to anonymize high-dimensional data. This model aims to prevent reconstruction attacks by disclosing attribute values as well as semantic-related attacks through similarity of sensitive values. It also allows redesigning the k-nearest neighbors algorithm to leverage data uncertainty while improving mining accuracy.

Uploaded by

727822TPMB005 ARAVINTHAN.S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views16 pages

Sang 2016

Uploaded by

727822TPMB005 ARAVINTHAN.S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 1

Achieving Probabilistic Anonymity in a Linear

and Hybrid Randomization Model
Yingpeng Sang, Hong Shen, Hui Tian and Zonghua Zhang

Abstract—The randomization methods that are applied for privacy- by the difference between PPDM and PPDP, and we focus
preserving data mining are commonly subject to reconstruction, link- on typical methods for PPDM and also on those that were
age, and semantic-related attacks. Some existing works employed
random noise addition to realize probabilistic anonymity, aiming only originally proposed for PPDP but have utility for PPDM.
at linkage attacks. Random noise addition is vulnerable to reconstruc- Randomization methods, including random noise addi-
tion attacks, and is unable to achieve semantic closeness, particu- tion and random projection, have been proposed to pro-
larly on high-dimensional data, to prevent semantic-related attacks. tect data privacy ([13],[22]), and these methods have also
For linkage attacks, the main security vulnerability of their proposed
probabilistic anonymity lies in the assumption that the attacker had a been applied in other relevant fields, such as privacy-
priori knowledge of the quasi-identifiers of all individuals. When only preserving queries on encrypted graph data ([4]) and
some individuals leak their quasi-identifiers, the proposed model will privacy-preserving multimedia retrieval ([23]). With a lin-
become incapable because the attacker can deploy a different linkage
attack that has not been studied before. This type of attack is much
ear cost, they are very suitable for processing data at
easier to deploy and is thus very harmful. In this paper, we propose collection time, data accumulated at a large scale, and data
new frameworks of probabilistic (1,k)- and (k,k)-anonymity to defend with high dimensions. However, their security in terms of
against all these linkage attacks, and realize the frameworks on a hybrid privacy preservation is still lacking a thorough investiga-
randomization model. The model is also secure against reconstruction
attacks. We further achieve statistical semantic closeness of high- tion. In general, privacy-preserving methods are subject to
dimensional data to prevent semantic-related attacks on the model. The the following 3 types of attacks:
frameworks also allow us to re-design the traditional K-nearest neigh-
bors algorithm to leverage the introduced data uncertainty and improve 1) Reconstruction attack: without knowing any at-
the mining results. Our work demonstrates promising applications in tribute values in the original record, the attacker
large-scale and high-dimensional data mining in clouds, by providing attempts to recover them from the corresponding
high efficiency and security to protect data privacy, guaranteeing high
data utility for mining purposes, on-time processing and non-interactive disguised record using methods such as principal
data publishing. component analysis and maximum a posteriori.
2) Linkage attack: after obtaining some attribute values
Index Terms—randomization, k-anonymity, privacy protection, data of a victim, the attacker attempts to link a unique
mining
disguised record to the victim, from which the
other sensitive attribute values of the victim will
1 I NTRODUCTION be disclosed.
3) Semantic-related attacks: in cases where more than

P RIVACY -preserving
data mining (PPDM) consists of
implementing data mining tasks such as classifica-
tion, clustering and association rule mining without leak-
one disguised record are linked to one victim, the
attacker can inspect the records’ sensitive values
because the skewness and similarity of these values
ing the sensitive information from the data owners. It is can actually leak some certain privacy. More details
required in many applications, such as customers classi- about these attacks are discussed in Section 2.
fication ([33]), disease outbreak identification ([32]), and
financial crime detection. Privacy-preserving data publish- In this paper, we study the enforcement of security of
ing (PPDP), as another related field, is distinct from PPDM randomization against these attacks. The randomization
in that data owners generally do not know how the data model that we use is a hybrid model in which data are
will be used when the data are preprocessed and released. first randomly projected and then random noise is added.
Although many methods (e.g., random noise addition, k- Both of the these methods are vulnerable to reconstruc-
anonymity) have been proposed for PPDP, they are actually tion attacks when they are applied individually ([13],[34]),
cross fields, and their utility for data mining has been whereas the hybrid method is considerably stronger. Our
measured. Therefore, in this paper, we are not constrained focus is on preventing linkage and semantic-related attacks
in this hybrid model, which requires a probabilistic version
Yingpeng Sang is with the School of Data and Computer Sci- of k -anonymity. To clarify the concept of this probabilistic
ence, Sun Yat-sen University, Guangzhou,510006, China. e-mail:
[email protected].
anonymity, we briefly compare it with the deterministic k -
Hong Shen is with the School of Data and Computer Science, Sun Yat- anonymity. Details of the concept are discussed in Section
sen University, China, and School of Computer Science, The University of 3.
Adelaide, Australia. e-mail: [email protected]. Deterministic k-Anonymity. The private record of an
Hui Tian is with the School of Electronic and Information Engineering, Bei-
jing Jiaotong University, Beijing, 100044, China. e-mail: [email protected]. individual commonly has 4 types of attributes: explicit
Zonghua Zhang is with Institute Mines-Telecom/TELECOM Lille, and identifiers, quasi-identifiers, sensitive attributes and non-
CNRS UMR 5157 SAMOVAR Lab, France. e-mail: zonghua.zhang@telecom- sensitive attributes. Simply removing explicit identifiers,
lille.fr. such as passport number, staff ID, and name, hardly makes
Copyright (c) 2013 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be the individual anonymous. The quasi-Identifiers (QIs) can be
obtained from the IEEE by sending a request to [email protected] exploited in combination to re-identify the individual, e.g.,

1556-6013 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2562605, IEEE
Transactions on Information Forensics and Security

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 2

“marital status”, “sex”, and “working hours per week” in tation that have higher linkage possibilities when
a patient’s record (as shown in Table 1(a)). Some sensitive fitting to the victim in the true fit.
attributes should be kept intact for data mining purposes,
e.g., the Yes/No value for “hypertension” in Table 1(a) is Our Contributions. The work reported in [1] only achieved
used as a class label. If an attacker can successfully re- a probabilistic (k, 1)-anonymity in the randomization
identify a patient by his/her QIs, he will be able to infer model of noise addition. In comparison, our contributions
his/her hypertension status. in this paper include the following:
k -anonymity processes the QIs and sensitive attributes 1) We propose the framework of probabilistic (1, k)-
such that each released record is indistinguishable among anonymity and discuss its unique application sce-
at least k individuals. In the example presented in Table nario that (k, 1)-anonymity is not applied for. That
1(b), for a patient’s marital status, k -anonymity can gen- is, when the attacker can only obtain the QIs for
eralize it to a value in the new domain {“been married”, a limited amount of victims, it is (1, k)-anonymity,
“never married”} or suppress it by completely removing it not (k, 1)-anonymity, that should be strengthened
(such as replacing it with ’*’). At least k original records to prevent linkage attacks.
should be clustered into the same group T and released in 2) We achieve two types of probabilistic anonymity,
the same form on their QIs. In addition, each group should (1, k)- and (k, 1)-anonymity, on a hybrid random-
be invulnerable to semantic-related attacks. In the example ization model. This model is secure against re-
presented in Table 1(b), if all members in a group have construction attacks. Methods for calibrating the
“Yes” for hypertension, the attacker will certainly know randomization parameters in the model are pro-
this sensitive information for a patient when the latter is posed to realize the two types of anonymity. We
uniquely linked with this group. also analyze the theoretical bounds of the semantic
closeness in the anonymity.
TABLE 1 3) We re-design the algorithm of K-nearest neigh-
An Example
bors classification to suit the randomized data.
(a) The Original Private Tabel
The current study generally applies traditional
classification algorithms without considering the
Marital Status Sex Hours Hypertension
divorced M 35 Y uncertainty or errors introduced by the privacy-
married M 40 N preserving preprocessing.
... ... ... ...
The remainder of this paper is organized as follows. Sec-
(b) After 2-anonymity tion 2 discusses related works on privacy-preserving data
Marital Status Sex Hours Hypertension mining and k -anonymity. Section 3 presents a detailed
have married M * (1Y,1N) definition and explanation of the probabilistic anonymity.
... ... ... ...
Section 4 and Section 5 respectively discuss how to realize
the probabilistic (1, k)-anonymity and (k, 1)-anonymity in
Probabilistic Anonymity. Randomization can also the hybrid randomization model. Section 6 analyzes the
make data anonymous, but the anonymity level cannot semantic closeness in the randomization model. Section
be deterministic. Suppose that an individual has an m- 7 describes how to leverage the noise introduced by the
attribute record X = (x1 , ..., xm ) on his/her QIs. X can randomization model in K-NN classification. Section 8
be randomized to record Z by random noise addition or conducts experiments using our methods and algorithm
random projection (i.e., linearly transform X from a higher and performs comparisons with some other related works.
dimensional vector space to a lower one using a random Section 9 concludes the paper.
projection matrix). Let Table T 0 be a randomization of Table
T . One major difference between randomization and k -
anonymity is that there is no obvious anonymization group 2 R ELATED W ORK
in randomization, and the attacker thus cannot link a victim A considerable amount of research has been conducted
to any group of size k . Therefore, the attacker must perform on k -anonymity. A few reviews by different taxonomies
a probabilistic linkage attack by inferring the potential fit can be found in [2], [9], [19], [35]. Regarding the meth-
between the victims and randomized records. Probabilistic ods of transforming the values, there are generalization,
anonymity is proposed for this type of linkage attack. suppression, bucketization, and so forth. Based on whether
Suppose that Table T has N records. It is easy to see the transformed values strictly satisfy some information-
that the best anonymity is achieved when the potential fit loss metrics, there are optimal k -anonymity (e.g., [21]) and
between any victim and any randomized record is 1/N . approximate k -anonymity (e.g., [27]). We primarily review
However, this type of anonymity is difficult to achieve the related works in terms of time complexity, privacy and
for both randomization and deterministic k -anonymity utility concerns.
methods. In practice, we can simply reduce the linkage Time Complexity. Determining an optimal solution for
probability of every true fit and increase the probabilities k -anonymity with minimal information loss is very time
of some false fits, such that the true fit is disguised among consuming. It has been proven to be NP-hard by sup-
the false fits and the attacker cannot effectively distinguish pressing the attributes and cells of QIs ([25]), generalizing
them. There are two types of probabilistic anonymities, as cells of QIs ([17]), and ensuring an l-diversity on sensitive
follows: attributes ([21]). Many approximate solutions have been
• Probabilistic (k,1)-Anonymity: for every true fit, there proposed to improve efficiency, trading off some non-
are at least k − 1 false victims in expectation who optimal information loss. Given the size of data set |T |, their
have higher linkage probabilities when fitting to the complexities are generally O(|T |a ). For example, as proven
randomized record in the true fit. in [25] a is a constant of 3, and in [27], a = k . The Incognito
• Probabilistic (1,k)-Anonymity: for every true fit, there algorithm in [16] had a complexity exponential in the size
are at least k − 1 false randomized records in expec- of QIs. The Mondrian algorithm achieved a complexity

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 3

of O(|T | · log |T |) by employing a greedy strategy. The without a specialized and focused aim of data mining.
algorithms in [15] achieved a logarithmic approximation In many cases, these previous studies only applied the
ratio with small time costs, but how to strengthen security traditional mining algorithms, which performed well on the
toward semantic closeness is still unknown. original data to the anonymized data, without leveraging
(k, 1)-, (1, k)- and (k, k)-anonymities have also been the uncertainty introduced by the anonymizations. A few
realized in [11] by local recoding on QIs. Local recoding exceptions are [14] and [1].
generates indirect equivalence classes by applying differ-
ent generalization approaches to the same attribute values Differential Privacy. Differential privacy is a frame-
in different instances, whereas global recoding uses the work that aims to prevent an attacker from inferring
same approach on all records and the equivalence classes the existence of any target record in the database given
are obvious. In [11], (k, 1)-anonymity means that each the published results on the database, regardless of how
anonymized record is a generalization of at least k original much background knowledge the attacker has obtained.
records. (1, k)-anonymity means that at least k anonymized For privacy-preserving data publishing, a non-interactive
records are generalizations of each original record; these method based on differential privacy was proposed in [26],
anonymized records form an equivalence class for the which utilizes top-down techniques to divide the database
corresponding original record, but their values on the same into some groups that are similar to the anonymization
attribute are not necessarily equal. These concepts are very groups in k -anonymity, while the count of each group
similar to those of this paper, whereas the essential dif- is perturbed to satisfy the differential privacy. Recently,
ference is in the anonymization approach and complexity. how to improve the data utility in differential privacy in
The algorithms in [11] to achieve these properties have a non-interactive settings was discussed in [30]. However,
runtime of O(k · |T |2 ). the complexity of their approach is still quadratic in the
|T |2
Privacy Models. As commonly assumed in the related number of records. Specifically, the complexity is O( k ).
work of k -anonymity, an attacker can successfully link a
victim with a unique group of size k by checking the Even under differential privacy, some attacks can be
victim’s QIs. Subsequently, it was noted in [24] that this successful, although they do not contradict its claims ([15],
actually gives the attacker a chance to infer the victim’s [5], [6]). As discussed in [15], when records are highly
sensitive attributes when there is little diversity in the correlated, the attacker may still infer a record’s sensitive
sensitive attributes of the linked group. In [18], further con- value from the published results, even though the attacker
cerns were discussed, namely, that a mere diversity in the cannot infer the existence of this record in the database. The
sensitive values is still not sufficient and that skewness and same limitation applies to the method of [26], e.g., when the
similarity attacks are still feasible. Diversity may change the records in a group highly correlate and the group is domi-
skewness of the distribution of a sensitive attribute, e.g., a nated by one sensitive value. The reason is that the attacker
2-diversity may show that anyone in the equivalence class can successfully link a record to this group by observing
would be considered to have a 50% possibility of being that the QIs of the record belong to the generalized range
positive, compared with the 1% of the overall population. of the group; then, the attacker will infer that the record
If the sensitive attribute values in an equivalence class may have the dominant sensitive value, and his guess may
are semantically similar (e.g., “gastric ulcer”, “gastritis” be correct with high probability due to the high correla-
and “stomach cancer”), then the attacker can still learn tion among the members of this group. Therefore, further
important information (e.g., the victim must have some strengthening of semantic security in these groups is still
stomach-related problems). necessary. In [6], classifiers based on data published under
To prevent skewness and similarity attacks, the work differential privacy can still help attackers infer sensitive
of [18] was proposed to ensure a semantic closeness inside information about an individual. Such an inference attack
a group, in which the distribution of a sensitive attribute will fail if some method is employed to prevent accurate
in any group is enforced to be close to its distribution in classifiers from being constructed. In Section 7, we present
the overall table. However, all of these concerns required a viable method.
solutions of higher complexity. In contrast to the existing works on deterministic/plain
Probabilistic k -anonymity was also defined in [29] based k -anonymity and differential privacy, our work does not
on clustering QIs and local swapping inside each cluster, support queries on randomized data, but this can be com-
which is different from our method based on randomizing pensated by our advantages of real-time data publication,
QIs. Semantic-related vulnerabilities may still exist inside easy handling of high-dimensional data, high security, and
some clusters. The authors updated their method based on high classification accuracy. These features are highly de-
differential privacy in [30]. We compare the latter with ours manded properties for data-processing techniques in the
with respect to data utility and time cost in Section 8.4. era of big data, whereas they are not well provided by the
For randomization methods, reconstruction attacks existing solutions that support both data queries and data
should also be prevented. Given only the randomized mining. The linear complexity of our solution is achieved
records, the attacker may attempt to recover the values of by the additive and multiplicative randomization on each
the attributes. The attacking techniques include principal attribute value. The strong privacy protection is achieved
component analysis, which can recover the uncorrelated by effectively preventing reconstruction, linkage, skewness
components in the attributes, and maximum a posteriori, and similarity attacks. The high classification accuracy is
which searches the maximizer X̃ for the posterior probabil- achieved by reducing the uncertainty introduced by the
ity P (X̃|Z), and makes X̃ the recovery of the randomized randomization model.
record Z . Using these techniques, random additive noise
and random projection are vulnerable, as noted by [13] and A preliminary version of this paper has been published
[34], respectively. in [28], which only focused on one type of linkage attack.
Utility Concerns. As summarized in [9], many This journal version covers more properties of security and
anonymization models and tools have been provided for data utility for the proposed method and presents more
the general purpose of privacy-preserving data publishing, theoretical proofs.

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 4

3 D EFINITIONS whether the victim Xj has a sensitive value of Ci based

3.1 Randomization Model on whether F (Zi → Xj ) is high enough. This potential fit
is actually the posterior probability, PF (Xj |Zi ), of a record
Suppose that a record has m QI attributes, denoted by vec-
Xj to correspond to Zi . By Bayes’ theorem,
tor X = (x1 , ..., xm ). The record also has multiple sensitive
attributes, all of which are denoted C . For classification PX (Xj )PR (Zi |Xj )
scenarios, these sensitive values are used as class labels. PF (Xj |Zi ) = , (2)
PZ (Zi )
In a randomization model, X is transformed to Z , and
(Z, C) is published. The model can be described in two in which PX (Xj ) and PZ (Zi ) are marginal probability
equivalent ways: a function Z = f (X) or a conditional densities at Xj and Zi , respectively, and PR (Zi |Xj ) is
probability PR (Z|X). In [1] the latter was used, and the defined by the randomization model.
model was based on the Gaussian or uniform distribu- We denote the entire set {X1 , ..., XN } by X , a subset of
tion centered at X . Thus, the function for their models it by Xsub , the entire set of {Z1 , ..., ZN } by Z . The attacker
is Z = X + R, in which R = (r1 , ..., rm ) and each ri can employ two strategies of linkage attacks, as follows:
(i = 1, ..., m) is a random number following a Gaussian Multiple-to-One (M2O) Linkage Attack. The attacker
or uniform distribution. can perform an M2O linkage attack on every Zi ∈ Z
In this paper, we focus on the model by ordering the potential fits of all records in X to Zi .
Finally, the attacker will fit Xj to Zi if PF (Xj |Zi ) is
Z = XR1 + R2 , (1) maximal. For example, in Fig. 2(a), the attacker orders
in which R1 is an m×d matrix, R2 is a d-dimensional vector,
PF (X1 |Z1 ), PF (X2 |Z1 ), PF (X3 |Z1 ), PF (X4 |Z1 ) and finds
the maximum PF (X1 |Z1 ) = 0.4.
and each of their entries is a random number that follows a
Gaussian distribution. In particular, we make d < m such
that the random projection into the d-dimensional space
(i.e., XR1 ) approaches a spherical and multi-dimensional
Gaussian distribution to reduce the eccentricity in X ’s
distribution ([7], [8]) and obtain an easily managed data
set.

(a) M2O Linkage Attack (b) O2M Linkage Attack

Fig. 1. Anonymization Model Fig. 2. Examples of Multiple-to-One and One-to-Multiple Linkage At-
tacks
Our linear randomization model is shown in Fig. 1. It
has the following specific parameters: To compare two potential fits PF (Xj1 |Zi )
and PF (Xj2 |Zi ), the attacker needs to compare
• Each random entry in R1 independently and identi- PX (Xj1 )PR (Zi |Xj1 ) and PX (Xj2 )PR (Zi |Xj2 ) using
cally follows the Gaussian distribution N (0, σ12 ); Bayes’ theorem. Therefore, the attacker should be aware
• Each random entry in R2 independently and identi- of PX (Xj1 ) and PX (Xj2 ), which are from the prior
cally follows the Gaussian distribution N (0, σ22 ). probability of X . The attacker may estimate a probability
Attributes of Mixed Types: In cases where the table to be density function by obtaining Xsub of a sufficient size. The
published has attributes of mixed types, including nu- work in [1] only assumes that X has a uniform distribution.
merical and categorical values, they can be transformed However, in many scenarios, it is difficult to obtain X or
such that the distance metrics for continuous values, e.g., have prior knowledge of its probability distribution, which
Euclidean distance, are still applicable. An attribute with makes the M2O linkage attack impractical.
M categorical values can be transformed into M binary One-to-Multiple (O2M) Linkage Attack. When the
attributes based on the method in Chapter 7.12 of [12]. attacker obtains the QIs of a victim, such as Xi , he can
perform an O2M linkage attack on Xi by ordering the
potential fits of Xi to all the randomized records in Z .
3.2 Potential Fit and Linkage Attacks The attacker will finally fit Xi to Zj if PF (Xi |Zj ) is
Let nN be the number of records in the entire data set, maximal. For example, in Fig. 2(b), the attacker orders
which is divided into n segments where each segment has PF (X1 |Z1 ), PF (X1 |Z2 ), PF (X1 |Z3 ), PF (X1 |Z4 ) and finds
N records. In Section 4.3, we demonstrate that segmenting the maximum PF (X1 |Z1 ) = 0.4.
is not compulsory and when segmenting should be per- To compare two potential fits PF (Xi |Zj1 ) and
formed. In this section, we use segmenting and focus on PF (Xi |Zj2 ), PX (Xi ) is not required because it is sufficient
a single segment, simply for making this case clear. Our to compare PR (Zj1 |Xi )/PZ (Zj1 ) and PR (Zj2 |Xi )/PZ (Zj2 )
method can be trivially extended to all segments. using Bayes’ theorem. In some randomization models,
Suppose that in a single segment the publisher has N PZ (·) is not difficult to estimate, e.g., in the linear model
records (X1 , C1 ), ..., (XN , CN ) and then randomizes them Z = XR1 + R2 , Z approaches a spherical multivariate
to (Z1 , C1 ), ..., (ZN , CN ) and publishes the latter. The at- Gaussian distribution. In cases where the attacker only
tacker obtains all of the published data. From some public obtains a small subset Xsub that is not sufficient for prob-
channel, the attacker also obtains the QIs of some victims, ability density estimation, the O2M linkage attack is more
i.e., a few Xj (j ∈ {1, ..., N }). The attacker is curious about applicable than the M2O linkage attack.
their sensitive values Cj ; therefore, he measures the poten- The attacker can also combine the above two attacks to
tial fit between one particular (Zi , Ci ) (i ∈ {1, ..., N }) and perform a bidirectional linkage attack. If some fit has high
one particular Xj , denoted by F (Zi → Xj ). He determines probabilities in both attacks (e.g., F (Z1 → X1 ) in Fig.

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 5

2(a) and Fig. 2(b)), then the attacker’s confidence in the 1) As shown in Fig. 3(a), an attacker first launches
truthfulness of the fit will be increased. More details will be an M2O attack on Z2 and finds some fits in X to
discussed in Section 3.3. Z2 with high linkage probabilities. Because (k, 1)-
Note that for both M2O and O2M attacks, to calculate anonymity favors the conclusion that a high (or
the potential fit (or linkage probability), the attacker should low) linkage probability does not necessarily mean
ultimately rely on the posterior probability PF (X|Z), not a true (or false) fit, the attacker has no confidence
only the conditional probability P (Z|X). In Bayesian in- regarding which fit is true or false.
ference, the latter is the likelihood of observing Z given 2) The attacker has to use auxiliary information to
some certain QIs of X ; it does not consider the uncertainty increase his confidence, and an O2M attack may
brought by the other QIs in the generation of Z . Posterior help. Suppose that F (Z2 → X1 ) is among the fits
probability was also employed in [20] and [31] to measure with high probability in the above M2O attack. As
the linkage probability between some QIs and sensitive shown in Fig. 3(b), the attacker compares the prob-
value. ability of F (Z2 → X1 ) with F (Zj → X1 ) (j 6= 2). If
(1, k)-anonymity does not exist, F (Z2 → X1 ) may
keep a lower probability than many of the other
3.3 Probabilistic Anonymity
fits; then, the attacker will become confident in the
It is easy to see that the M2O or O2M attack is the falseness of F (Z2 → X1 ).
most fruitful if the true fit F (Zi → Xi ) has the largest 3) The attacker repeats Step 2). The more false fits that
probability among all fits F (Zi → Xj ) for j = 1, ..., N the attacker eliminates, the more he approaches the
or among all fits F (Zj → Xi ) for j = 1, ..., N , and is true one.
the least fruitful if all fits F (Zi → Xj ) for i = 1, ..., N
and j = 1, ..., N have equal probabilities. Therefore, to
protect a record Xi ∈ X (or Xi ∈ Xsub ) against an
M2O (or O2M) linkage attack, an intuitive approach is to
enforce an ordering of the list {PF (X1 |Zi ), ..., PF (XN |Zi )}
(or {PF (Xi |Z1 ), ..., PF (Xi |ZN )}) where PF (Xi |Zi ) is not
the largest. Probabilistic anonymity follows this idea and
makes this ordering be undeterministic, whereas the ex-
pected order of PF (Xi |Zi ) in the list is kept as a stable value
k ∈ (1, N ]. We have the following two types of frameworks (a) M2O Linkage Attack (b) O2M Linkage Attack
against M2O and O2M linkage attacks.
Fig. 3. An Example of Bidirectional Linkage Attack
Definition 1. Probabilistic (k, 1)-Anonymity: A record Xi has
a probabilistic (k, 1)-anonymity if the probability of the true Therefore, (1, k)-anonymity is also necessary even if
fit F (Zi → Xi ) is not greater than the probabilities of at (k, 1)-anonymity has been realized. As shown in Fig. 3(b),
least k − 1 false fits in expectation between Zi and the other with (1, k)-anonymity, the attacker cannot confirm the
records in X , i.e., falseness of F (Z2 → X1 ) after he is aware that some fits
F (Zj → X1 ) have higher probabilities than F (Z2 → X1 ).
PF (Xj |Zi ) ≥ PF (Xi |Zi ), ∃J ⊆ {1, ..., N }\i, ∀j ∈ J. (3a)
4 ACHIEVING THE P ROBABILISTIC (1, k)-
A NONYMITY
E[|J|] ≥ k − 1. (3b)
When a small Xsub is leaked to the attacker, the probabilistic
In Eq. (3a), J is the index set of all Xj with no less fit to (1, k)-anonymity is sufficient for privacy protection since
Z i than Xi . In Eq. (3b), E[|J|] is the expectation of the size the attacker can only perform an O2M attack. In this
of J. Definition 1 is the same as the probabilistic k -anonymity section, we discuss how to achieve the probabilistic (1, k)-
defined in [1], whereas the following Definition 2 has not anonymity on every record in X in the linear randomiza-
been previously addressed. tion model presented in Fig. 1. Because the O2M linkage
attack is based on the ordering of the potential fits of all
Definition 2. Probabilistic (1, k)-Anonymity: A record Xi has
randomized records to one victim Xi , in this section, we
a probabilistic (1, k)-anonymity if the probability of the true
first explore the randomness of this ordering, and we then
fit F (Zi → Xi ) is not greater than the probabilities of at
analyze how to calibrate the linear randomization model to
least k −1 false fits in expectation between Xi and the other
achieve the probabilistic (1, k)-anonymity on any Xi ∈ X .
records in Z , i.e.,

4.1 The Randomness of the Ordering List

PF (Xi |Zj ) ≥ PF (Xi |Zi ), ∃J ⊆ {1, ..., N }\i, ∀j ∈ J. (4a) In the randomization model presented in Fig. 1, when a vec-
tor X ∈ X is given, the conditional probability PR (Z|X)
E[|J|] ≥ k − 1. (4b) is a multivariate Gaussian distribution. Each entry of Z
is a linear combination of some independent Gaussian
For scenarios in which the bidirectional linkage at- random variables and follows the Gaussian distribution
tack is applicable, it is ideal to achieve probabilistic (k, k)- N (0, ||X||2 σ12 + σ22 ).
anonymity, which implies both probabilistic (k, 1)- and Without knowing the entire data set X , the marginal
(1, k)-anonymity. The following example illustrates that probability PZ (Z) can be approximately treated as a spheri-
under this attack, when only (k, 1)-anonymity is realized cal multivariate Gaussian distribution when d is sufficiently
but (1, k)-anonymity is not, the anonymity level may be small. According to [8], when a random vector X has mean
reduced. The same vulnerability applies when only (1, k)- zero, the random projection XR1 resembles a scale mixture
anonymity is realized. of spherical Gaussians, whose variance is only determined

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 6

by the distribution of X , not by the projection matrix R1 . ∀j ∈ {1, ..., N } \ i.

For simplicity, we use N (·) to denote the p.d.f of the
Guassian distribution. Then p.d.f of XR1 can be denoted Proof. First, the publisher makes a unique trial of the ran-
by N (0, σRP 2
· Id ), in which Id is an identity matrix of size domization on Xi and publishes Zi . Then, on each Xj
d, the variance σRP 2
can be estimated on multiple trials of (j = 1, ..., N, j 6= i), he makes n independent trials of ran-
randomizing X into XR1 , and the average variance of the domizing Xj to Zj and chooses one of the trials to publish.
d dimensions in XR1 can be treated as σRP 2
. Overall, he publishes N −1 independent trials. We compute
It is easy to see that if X has a mean vector (µ1 , ..., µm ), the mean number of the outcomes PF (Xi |Zj ) ≥ PF (Xi |Zi )
then XR1 has a distribution of N 0, (σRP 2
+ σ12 Σni=1 µ2i ) · Id in these trials as follows. 2
σRP
2 2 n 2 Suppose that ||Xi ||2 ≥ 2 . By Lemma 1,
and XR 1 +R2 has a distribution of N 0, (σRP +σ1 Σi=1 µi +
σ 1
PO PF (Xi |Zj ) ≥ PF (Xi |Zi ) = PO (||Zj ||2 ≥ ||Zi ||2 ). Let

2
σ2 )· Id . In this paper, for simplicity, we suppose that X has
been centered by removing its mean vector. For cases where this probability be pi,j . We denote the order between ||Zj ||2
X has a non-zero mean vector, our results in the following and ||Zi ||2 by a random variable ei,j , which is 1 when
can be easily extended. ||Zj ||2 ≥ ||Zi ||2 and 0 otherwise. Then, {ei,j,1 , ..., ei,j,n } for
the n trials on randomizing Xj is a Bernoulli process. The
Observation 1. ∀i ∈ {1, ..., N }, PX (Xi ) is not required number of its values being 1 follows Bernoulli distribution
to order {PF (Xi |Z1 ), ..., PF (Xi |ZN )}. ∀j ∈ {1, ..., N } \ i, B(n, pi,j ), and the mean of this number is npi,j .
P (Zj |Xi )
by Eq. (2), PF (Xi |Zj ) ≥ PF (Xi |Zi ) ⇐⇒ RPZ (Z j)
≥ There are a total of N − 1 independent trials chosen by
PR (Zi |Xi ) the publisher. In each trial, the probability of ei,j = 1 is
PZ (Zi ) .
pi,j . Therefore, in the N − 1 trials, the mean of the total
The following Lemma 1 elaborates the randomness of j6=i
X
the order between PF (Xi |Zj ) and PF (Xi |Zi ), which is number of ei,j = 1 is pi,j , which is actually the
identical to the randomness of the order between ||Zj ||2 j∈{1,...,N }
and ||Zi ||2 . mean number of times that ||Zj ||2 ≥ ||Zi ||2 or PF (Xi |Zj ) ≥
Lemma 1. The probability of the event PF (Xi |Zj ) ≥ PF (Xi |Zi ) occurs.
PF (Xi |Zi ), i.e., PO PF (Xi |Zj ) ≥ PF (Xi |Zi ) = To achieve the probabilistic (1, k)-anonymity, this mean
number should be at least k − 1, as defined in Eq. (4b). If
σ2
(
PO (||Zj ||2 ≥ ||Zi ||2 ), if||Xi ||2 ≥ σRP 2
every pi,j ≥ (k − 1)/(N − 1), then it is easy to see that
1 (5) j6=i
PO (||Zj ||2 ≤ ||Zi ||2 ), else
X
pi,j ≥ k − 1.
Proof. Because PR (Zj |Xi ) = N 0, (||Xi ||2 σ12 + σ22 ) · Id , in

j∈{1,...,N }
which Id is an identity matrix of size d, log PR (Zj |Xi ) = The above conclusion can be trivially extended to the
||Z ||2 σ2
− 21 ||Xi ||2jσ2 +σ2 + c1 , in which c1 = − d2 log 2π(||Xi ||2 σ12 + case in which ||Xi ||2 < σRP
2 . This concludes the proof.

1
1 2
σ22 ) .

Theorem 1 presents the formula for calculating the
2
+ σ22 ) · Id , log PZ (Zj ) =

Because PZ (Zj ) = N 0, (σRP probability PO PF (Xi |Zj ) ≥ PF (Xi |Zi ) when Xi , Xj ,
||Z ||2 and some parameters in the randomization model ( σ1 , σ2 ,
− 12 σ2 j+σ2 + c2 , in which c2 = − d2 log 2π(σRP 2
+ σ22 ) .

2
RP
||Z ||2 σRP and R1 ) are given.
Similarly, log PR (Zi |Xi ) = − 21 ||Xi ||2iσ2 +σ2 + c1 , and
1 2
2 Theorem 1. When R1 is generated as a ran-
log PZ (Zi ) = − 12 σ||Zi ||
2 +σ 2 + c2 . Then, Xi R1 R1T XiT
RP 2 dom m × d matrix, let λi = σ2
and
2
PR (Zj |Xi ) PR (Zi |Xi ) Xj R1 RT X T
1 j
PF (Xi |Zj ) ≥ PF (Xi |Zi ) ⇐⇒ ≥ λj = σ22
. ∀i ∈ {1, ..., N } and ∀j ∈ {1, ..., N } \ i,
PZ (Zj ) PZ (Zi )
PO PF (Xi |Zj ) ≥ PF (Xi |Zi ) =
⇐⇒ log PR (Zj |Xi ) − log PZ (Zj ) ≥
2
(
log PR (Zi |Xi ) − log PZ (Zi ) 1 1 2 σRP
2 + π I(λi , λj ), if||Xi || ≥ σ12 (6)
2
||Zj || ||Zj ||2 1 1
⇐⇒ − + 2 ≥ 2 − π I(λi , λj ), else
2 2
||Xi || σ1 + σ22 σRP + σ22
in which I(λi , λj ) is a definite integral as follows:
||Zi ||2 ||Zi ||2 ˆ π
− 2 2 + 2 .
2
||Xi || σ1 + σ2 σRP + σ22 I(λi , λj ) =
2 1 2
e− 2 (λj +λi ) sin θ · csc θ · cosd−1 θ·
2 0
Therefore, when ||Xi ||2 ≥
σRP
, PF (Xi |Zj ) ≥ (7)
σ12 λj − λi
2
σRP sin( sin 2θ) · dθ.
PF (Xi |Zi ) ⇐⇒ ||Zj ||2 ≥ ||Zi ||2 . When ||Xi ||2 < σ12
, 4
PF (Xi |Zj ) ≥ PF (Xi |Zi ) ⇐⇒ ||Zj ||2 ≤ ||Zi ||2 . The proof of this theorem is postponed to the Appendix
A.
4.2 A Sufficient Condition for Probabilistic (1, k)- By Theorem 1, the calibration of σ2 (which thus ad-
Anonymity justs the values of λi and λj ) is critical for achieving the
The following Lemma 2 contributes a sufficient condition probabilistic
(1, k)-anonymity
on all records in X . We set
for achieving the probabilistic (1,k)-anonymity on a certain k ∈ 2, (N − 1)/2 + 1 and discuss how to enable the
Xi ∈ X . calibration to achieve this anonymity level. This range of
levels is adequate for practical applications.
Lemma 2. Given an anonymity k ≥ 2, a sufficient condition
for achieving the probabilistic (1,k)-anonymity on Xi ∈ X Theorem 2. Given an anonymity of 2 ≤ k < (N − 1)/2 +
is to make each of the probabilities of events PF (Xi |Zj ) ≥ 1, to achieve the probabilistic (1, k)-anonymity on all Xi ,
PF (Xi |Zi ) no less than (k − 1)/(N − 1), i.e., I(λi , λj ) in Eq. (7) should satisfy the following:
k−1 k−1 1 1 k−1
PO PF (Xi |Zj ) ≥ PF (Xi |Zi ) ≥ , ( − )π ≤ I(λi , λj ) ≤ ( − )π. (8)
N −1 N −1 2 2 N −1

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 7

2
σRP
Proof. By Lemma 2 and Theorem 1, when ||Xi ||2 ≥ σ12
, such that all λ1 , λ2 , ..., λN are bounded within (0, λmax ].
we should have I(λi , λj ) ≥ k−1
(N − 1 Eq. (9) in the following gives the lower bound of σ2 . When
−1 2 )π . Similarly, when
2
σRP σ2 ≥ σ2,min , λi = Xi R1 R1T XiT /σ22 ≤ λmax for i = 1, ..., N .
||Xi || < 2
, we should have I(λi , λj ) ≤ ( 21 − N
σ12
k−1
−1 )π . q
k−1 1 1 k−1
Because k < (N − 1)/2 + 1, then ( N −1 − 2 )π <( 2 − N −1 )π . σ2,min = M/λmax , in which (9a)

M = max(X1 R1 R1T X1T , ..., XN R1 R1T XN

T
). (9b)
4.3 Calibration of the Parameters R1 and σ2
In summary, for a set X consisting of N records, the follow- In the following, Lemma 3 provides a theoretical proof
ing method can be employed to achieve the probabilistic of the monotonicity of I(λi , λj ) when λi and λj are less
(1, k)-anonymity on all of the records: than 9.2. The value of 9.2 is used to provide an easy proof of
the theorem. Lemma 3 applies to most applications because
1) Select a value for the anonymity k ; 9.2 is generally large enough to be λmax . When the two
2) Find λmax from the contour map of the numeric parameters are greater than 9.2, our experimental results
integration in Eq. (7) such that I(λi , λj ) on any pair prove that the same monotonicity still holds, as shown in
k−1 1 1
of λi and λj is bounded within [( N −1 − 2 )π, ( 2 − Fig. 4(a).
k−1
N −1 )π]; Lemma 3. In Eq. (7), if the variable λj is held constant,
3) Select a value for σ1 , and then construct the random
then I is monotonically decreasing with the other variable λi
matrix R1 by generating m × d random variables
(λi ∈ (0, 9.2]). If λi is held constant, then I is monotonically
following N (0, σ12 );
increasing with λj (λj ∈ (0, 9.2]).
4) Compute M = max(X1 R1 R1T X1T , ..., XN R1 R1T XN T
)
2 The proof of Lemma 3 is postponed to the Appendix B.
and set σ2 = M/λmax ;
5) Generate a random vector R2 , randomize each
Xi R1 to Xi R1 + R2 , and then publish it.
15

Time Cost. Suppose that the size of the entire data set 1.5
Contour Line at I=−1.26
is nN , which can be divided into n segments, where each 1
10
Integration I(λi,λj)

0.5
segment has N records. In fact, Step 1) ∼ 3) can be off-line, 0
λmax=8.77

λi
which means that k , λmax and R1 can be configured before −0.5

−1
5
the randomization begins. For each segment of size N , −1.5
15

Steps 4) and 5) are executed, and both of these steps have a 10

10
15
Contour Line at I=1.26
5

time cost of O(mdN ). The total time cost for randomizing λi 0 0

λj
0
0 5
λj
λmax=8.77
10 15

the entire data set is O(mdnN ), which is linear in regards (a) Numeric Integration (b) The Contour Lines for I =
to the total number of records (i.e., nN ). I(λi , λj ) 1.26 and I = −1.26
The detailed numerical procedure of our method is
given in the following. Fig. 4. Numeric Integration and Examples of the Contour Lines
By Theorem 2, any pair of λi and λj (i 6= j ) selected
from {λ1 , λ2 , ..., λN } should satisfy Eq. (8). To restrict It should be noticed that in Theorem 2 the anonymity
I(λi , λj ) within the required range, by Theorem 1, λi and level k < (N − 1)/2 + 1. For the case where k ≥
λj need to be bounded, which means that R1 and σ2 (N − 1)/2 + 1, it is not easy to make every record Xi
should be calibrated. Our method consists of using numeric satisfy the probabilistic (1, k)-anonymity. The reason is, by
σ2
integration to calculate I , finding two contour lines at Theorem 1, when ||Xi ||2 ≥ σRP 2 , I(λi , λj ) should be within
1
k−1 1 1 k−1
−1 − 2 )π and Imax = ( 2 − N −1 )π , and then
Imin = ( N k−1 1 π σ2
( N −1 − 2 )π, 2 , whereas when ||Xi ||2 < σRP

2 , I(λi , λj )
bounding λi and λj inside the two lines. π 1 k−1
1

It can be observed that regardless of how R1 and σ2 should be within − 2 , ( 2 − N −1 )π . The intersection of the
are calibrated, all λ1 , λ2 , ..., λN are non-negative because two ranges is null. For example, when k = 37, N = 41, and
σ2
λi = Xi R1 R1T XiT /σ22 = ||Xi R1 ||2 /σ22 . ||Xi ||2 ≥ σRP
2 , it is not easy to calibrate σ2 to generate the
1
Indeed, the monotonicity of the function I(λi , λj ) required λi and λj such that I(λi , λj ) ∈ [1.26, π/2). How
shows that when all λ1 , λ2 , ..., λN are bounded within the to perform this calibration is not the focus of this paper.
domain (0, λmax ], I(λi , λj ) on any pair of λi and λj is also Parameter selection summary. The dimension d of R1
k−1 1 1 k−1
bounded within [( N −1 − 2 )π, ( 2 − N −1 )π]. By the following is suggested to satisfy m ≥ 2d − 1, following the result
Lemma 3, if λj is held constant, I is monotonically decreasing of [22] (Details can be found in their Theorem 4.4). The
with the other variable λi , and if λi is held constant, I is selection of σ1 can be based on their Lemma 5.2. That p is,
monotonically increasing with λj . Therefore, the contour line because E(R1 R1T ) = dσ12 I , we can choose σ1 = 1/d
k−1 1
at Imin = ( N −1 − 2 )π starts from the point at λi = λmax,1 such that E(R1 R1T ) equals the identity matrix I and
and λj = 0, and on this line, λj monotonically increases E(Xi R1 R1T XiT ) = Xi XiT ; thus, better utility of the data
with λi . Similarly, the contour line at Imax = ( 21 − N k−1
−1 )π can be ensured.
starts from the point at λj = λmax,2 and λi = 0, and on The anonymity level k can be selected by practical
this line, λi monotonically increases with λj . It is also easy requirements, e.g., 5~50. The selection of N should make
to see that λmax,1 = λmax,2 . As in the example presented k < (N − 1)/2 + 1 and facilitate an appropriate value of
in Fig. 4(b), let k = 5 and N = 41, then Imin = −1.26 λmax (1~9.2 is suggested). When N is large, we suggest
and Imax = 1.26, and the two contour lines are shown in segmenting the records into groups to retain high data
the figure. By the numeric integration, λmax,1 and λmax,2 utility. The reason is that M in Eq. (9) may be very large
should be 8.77. It is easy to see that any point in the shaded without segmenting, and then a large σ2,min is produced.
area of Fig. 4(b) satisfies Eq. (8). A small variance generally means that small random noise
After λmax is found, it is easy to calibrate R1 and σ2 . may be added. The large σ2,min is suitable for only a small
In fact, R1 can be first generated and then σ2 is bounded fraction of records (i.e., those Xi producing large values

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 8

on Xi R1 R1T XiT ) but applied to all records. By segmenting, By the following theorem, we need to calibrate the
σ2,min can be determined individually for each group. values of σ1 and σ2 to realize the required occurrence
In addition, the segmenting does not need to consider probability for (k, 1)-anonymity.
relationships among the records; thus, the most convenient
Theorem 3. let Gσ1 ,σ2 (Xi , Xj ) be a benchmark function
way is grouping by their collection order.
on any two records Xi , Xj in X whose Euclidean norms
The simulation of successful O2M attacks. When the
are not equal (||Xi ||2 6= ||Xj ||2 ,i 6= j ), defined as follows:
lower bound on σ2 as shown in Eq. (9) is not abided
by, O2M attacks could succeed. We only demonstrate the
attack for k = 2, and the other cases for k > 2 can d (||Xj ||2 σ12 + σ22 )
Gσ1 ,σ2 (Xi , Xj ) = ·
be trivially derived. Without knowing the bound, a data 2 (||Xi ||2 − ||Xj ||2 )σ12
owner would select an arbitrary σ2 for randomization, but ||Xi ||2 σ12 + σ22
it could be very likely for him to select a value such that · log .
||Xj ||2 σ12 + σ22
σ22 < M/λmax . When k and N are given, λmax is a fixed
value that bounds I(λi , λj ) on each pair (λi , λj ) within When X follows the uniform distribution, a sufficient
k−1 1 1 k−1
−1 − 2 )π, ( 2 − N −1 )π]. In fact, λmax is unnecessary
[( N condition to achieve (k, 1)-anonymity is that the values σ1
to be large (say 50 ∼ 100) to suit any value of k (k ≥ 2) and and σ2 should be assigned to make the benchmark function
commonly used N (e.g., N = 10000). However, M could satisfy the following:
be very large because it is the maximal value in ||Xi R1 ||2
for i = 1, ..., N . Therefore, it is common for M/λmax > 1, min Gσ1 ,σ2 (Xi , Xj ) ≥ Qχ2 , k−1 , (10)
N −1

and it is also common to select a σ2 less than 1. Then, in max Gσ1 ,σ2 (Xi , Xj ) ≤ Qχ2 , N −k , (11)
N −1
the data set {Xi |i = 1, ..., N }, there is a subset that satisfies
σ22 λmax < ||Xi R1 ||2 ≤ M . It is this subset (denoted by in which Qχ2 , k−1 and Qχ2 , N −k are respectively the (k −1)-
N −1 N −1
X 0 ) that is vulnerable for the O2M attack because for each th and (N − k)-th (N − 1)-quantiles of the chi-squared
2
Xi ∈ X 0 , its corresponding λi = ||XiσR21 || > λmax . For distribution with freedom of degree d. Let Xmin_norm
2
each data Xi ∈ X 0 , there could be some other data Xj that and Xmax_norm be records in X with the smallest and
makes (λi , λj ) fall on a contour line with I < I(λmax , 0) largest Euclidean norms. Then, min Gσ1 ,σ2 (Xi , Xj ) is ob-
in Figure 4; then, by Theorem 2 and Lemma 2, the proba- tained at Xi = Xmax_norm and Xj = Xmin_norm , and
bilistic (1, 2)-anonymity (k = 2) could not be achieved. In max Gσ1 ,σ2 (Xi , Xj ) is obtained at Xi = Xmin_norm and
Section 8.1, we demonstrate that when σ2 is too small, the Xj = Xmax_norm .
percentage of X 0 will be high. The proof of this theorem is postponed to the Appendix
C. In brief, this theorem provides two inequalities as shown
5 ACHIEVING THE P ROBABILISTIC (k, 1)- in Eq. (10) and (11) to constrain σ1 and σ2 , and there are
only two variables (σ1 and σ2 ) in the two inequalities,
A NONYMITY which simplifies the calibration.
As discussed in Section 3.2, it is difficult for the attacker Achieving the probabilistic (k, k)-anonymity. σ1 and
to perform a successful M2O linkage attack. The reason σ2 can be calibrated to simultaneously satisfy the require-
is the attacker generally possesses only a small subset ments of probabilistic (1, k)- and (k, 1)-anonymity such
of X and cannot accurately determine the required prior that (k, k)-anonymity is achieved.
knowledge on the p.d.f of X . Similarly, the realization of
(k, 1)-anonymity also requires the same prior knowledge
on X . In this section, we discuss one representative case 6 S ECURITY A NALYSIS
used in [1]: X follows a uniform distribution. This case As mentioned in Section 1, 3 types of attacks can gener-
is particularly applicable when the attacker has no prior ally be deployed on randomization-based methods. The
knowledge on PX (X ) and thus assumes that X follows a previous sections discussed the prevention of linkage-type
uniform distribution. For the other types of distributions, attacks, and thus, they will not be discussed here.
the realization method can be similarly derived. For reconstruction attacks, there are two types, as sum-
In the same way as (1, k)-anonymity, to achieve (k, 1)- marized in [34]: one is based on principal component
anonymity on every Zi , we need to reduce the probability analysis (PCA), and the other is based on maximum a
of F (Zi → Xi ) and increase the probability of F (Zi → Xj ) posteriori (MAP). The prevention of PCA-based attacks
(j 6= i) to ensure that some false fits always exist that will be discussed in Section 8.5. The prevention of MAP-
have higher probabilities than the true one. By the same based attacks is actually implied in our achievement of
paradigm in Lemma 2, a sufficient condition for (k, 1)- probabilistic anonymity. MAP-based attacks aim to recover
anonymity is as follows: the record X given its perturbed version Z , by finding the
optimum X̃ in the optimization problem max P (X|Z) and
X

PO PF (Xj |Zi ) ≥ PF (Xi |Zi ) ≥ (k − 1)/(N − 1), treating it as an estimate of X . There are two obstacles to
finding the optimum X̃ . First, the p.d.f of X should be a
in which PO PF (Xj |Zi ) ≥ PF (Xi |Zi ) is the occurrence priori knowledge and input for the problem. Second, due to
probability of the event PF (Xj |Zi ) ≥ PF (Xi |Zi ). the hybrid randomization of our method, the optimization
When X follows the uniform distribution, i.e., problem is equivalent to
PX (Xi ) = PX (Xj ), by Bayes’ theorem, the event
PF (Xj |Zi ) ≥ PF (Xi |Zi ) is equivalent to PR (Zi |Xj ) ≥
PR (Zi |Xi ). Then, another sufficient condition for (k, 1)- max P (X, R1 , R2 |Z), s.t. Z = X · R1 + R2 ,
X,R1 ,R2
anonymity is as follows:
which is high dimensional and difficult to solve. Even if
the optimum X̃ is found by overcoming the obstacles, it
PO ( PR (Zi |Xj ) ≥ PR (Zi |Xi ) ) ≥ (k − 1)/(N − 1). is highly likely to be different from the true record X ,

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 9

because X̃ is with the highest posterior probability P (X̃|Z) anonymity: Suppose that there are t classes for the sensi-
whereas the true record X is with a considerably lower tive attributes in the overall table, denoted by c1 , ..., ct .
posterior probability P (X|Z). This fact is generated by the Their probabilities are denoted by pc1 , ..., pct . By (1, k)-
probabilistic anonymity, in which the true fit PF (Xi |Zi ) anonymity, Xi ∈ X has the statistical anonymization group
is artificially lowered than (k − 1)-on-average false fits with size Oi . In this group, let c0l denote the subset of
PF (Xj |Zi ). records labeled by cl (l = 1, ..., t). Then, the statistical se-
In the remainder of this section, we focus on the semantic closeness of (1, k)-anonymity for class cl is defined
E(|c0 |)
mantic security of our method against the skewness and by Pcl /pcl , in which Pcl = E(Oli ) , |c0l | is the cardinality of
similarity attacks. As defined in [18], semantic closeness c0l .
requires the distribution of a sensitive attribute in any
anonymization group on QIs to be close to the distribution Theorem 4. Suppose that there are t classes for the sensitive
of the attribute in the overall table, and a tighter seman- attributes in the overall table, denoted by c1 ,...,ct , with
tic closeness means a higher semantic security. Generally, probabilitiespc1 , ..., pct . The lower bound of the statisti-
Earth Mover’s Distance is employed to measure semantic cal semantic closeness of (1, k)-anonymity to any class cl
closeness. Because the anonymization group in our hybrid (l = 1, ..., t) is
randomization method is quite different from the one used (k − 1)
in [18], we define a new measurement of semantic closeness .
pcl (2k − 1 − N ) + N (N − k)/(N − 1)
and demonstrate that semantic security can be achieved by
adjusting some parameters, such as N and k . The upper bound is
Because an obvious anonymization group is required (N − k)
for the skewness and similarity attacks, the M2O linkage .
attack does not provide any help for the attacker to deploy pcl (N − 2k + 1) + N (k − 1)/(N − 1)
them. This is because all records in X are linked to a single The proof of this theorem is postponed to the Appendix
randomized record in Z , but not a group of randomized D.
records. Only by the O2M linkage attack is the attacker able To prevent the skewness attack, the minority class in
to perform these attacks on a group of randomized records the overall table should not become the majority class in
after he links a record in X to the group. We define this the randomization group. To prevent the similarity attack,
group in the following: the ratio of each class in the randomization group should
be close to its probability in the overall table. By Theorem 4,
Definition 3. Statistical Anonymization Group: through an
when pc,l is very small, pc,l (2k −1−N ) and pc,l (N −2k +1)
O2M linkage attack on any Xi ∈ X , the attacker ob-
0 can be neglected, and the two bounds become approxi-
tains a list of records Z arranged in decreasing order of (N −1)(k−1) (N −1)(N −k)
mately N (N −k) and N (k−1) , respectively. Therefore,
F (Zj → Xi ) (j = 1, ..., N ). Let Oi be the order of Zi ,
the true randomized record of Xi . E(Oi ) is the expected N (the size of each segment) and k can be adjusted such
value of Oi . Then, the statistical anonymization group for that the upper bound is not very high (to prevent the
Xi consists of the randomized records from the 1st to the skewness attack), or such that the two bounds are close
0
E(Oi )-th record in Z . to 1 (to prevent the similarity attack), as we show in the
experiments presented in Section 8.3.
Assuming that a statistical anonymization group has
been accurately estimated for Xi , the attacker can analyze
the sensitive values inside the group. If these values are 7 M INING THE DATA IN THE P ROBABILISTIC
dominated by a single value or by some semantically sim- A NONYMITY
ilar values, then the attacker will be highly confident that In this section, we re-design the K -nearest neighbors (K-
the victim with Xi is also with the same single value or the NN) algorithm for data classification. In this algorithm, the
semantically related values. However, by our probabilistic training data set is the randomized records Z . The test data
(1, k)-anonymity, these skewness and similarity attacks can consist of all QI attributes of an individual, denoted by
rarely be successful due to the following two reasons. record Xt . His sensitive value is unknown, which should
First, the attacker cannot make an accurate estimation be the output of the K-NN algorithm as a class label.
of the order of Zi , denoted Oi , which is random and In our algorithm, the original record set X is divided
depends on the occurring number of events F (Zj → Xi ) ≥ into n segments, and each segment has N records. This
F (Zi → Xi ). Even the mean value E(Oi ) is uncertain partition is suitable for parallel processing, in which the
for the attacker. As we discussed in the Proof of Lemma corresponding randomized records of different segments
j6=i
X are distributed to different processors. It is also suitable
2, E(Oi ) = pi,j , in which pi,j is the occurrence
for processing data at collection time, and each time a
j∈{1,...,N }
probability of F (Zj → Xi ) ≥ F (Zi → Xi ). By Theorem 1, segment is processed. In the randomization model, on
2
pi,j is determined by λi and λj , which are related to many different segments, different variances σ1,I (I = 1, ..., n) are
2
2
factors, such as Xi , Xj , R1 , σRP ,σ1 and σ2 . employed for R1 , and different variances σ2,I (I = 1, ..., n)
Second, inside those sensitive attributes of the statisti- are employed for R2 . The parameters σ1,I and σ2,I for the
cal anonymization group, a statistical semantic closeness n segments are published to the data miner. However, the
exists, i.e., the fraction of any class in the group is close anonymous level k in each segment is not published to
to the probability of this class in the overall table. This increase the difficulty of the attacks.
type of closeness is described in the following Definition In the training phase of our K-NN algorithm, PZ (·) is
4. Theorem 4 provides the lower and upper bounds of the treated as a spherical multivariate Gaussian distribution,
closeness, which means that the closeness can be calibrated which means that each dimension of Z is identically and in-
by the parameters k and N . dependently distributed as as a Gaussian distribution. The
mean of this Gaussian distribution is 0, and the variance
Definition 4. Statistical Semantic Closeness of (1, k)- can be easily estimated on all data Z .

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 10

K is a predetermined value of the miner. Suppose that 8 E XPERIMENTS

there are L class labels. In the traditional K-NN algorithm, 8.1 Achieving Probabilistic (1,k)-Anonymity
the miner should find the K -nearest neighbors of Xt by
We use a synthetic data set and two real data sets (“breast
some distance metric (e.g., Euclidean distance) and then
tissue data set” and “blood transfusion service center data
label Xt with the majority class within the K -nearest neigh-
set”) from the UCI Machine Learning Repository. The
bors. In our method, Xt is not sanitized by the method
breast tissue data set has 106 instances and 10 attributes.
of probabilistic anonymity, and the distance metrics are
The blood Transfusion data set has 748 instances and 5
replaced by the potential fit of Xt to the randomized data.
attributes. Although the numbers of instances are not high,
We first find the K -largest potential fits, and then label Xt
the two data sets are appropriate for comparisons because
with the majority class in those potential fits. Specifically,
their attributes are all numerical, and we can employ mul-
the classification is as follows:
tiple independent trials to average out errors introduced
1) In each segment I (I = 1, .., n) of Z , the K -largest by the small sizes of the data sets. For experiments on
potential fits PF (Xt |Zi ) can be found by ordering high-volume and high-dimensional data, we use synthetic
||Zi ||2 , as shown in Lemma 1. If ||Xt ||2 ≥ σRP
2 2
/σ1,I , data. The programming environment is MATLAB. Our
then the K -largest potentials fits are Zi with the method in Section 4 is employed to achieve the probabilistic
K -largest Euclidean norm. If ||Xt ||2 < σRP 2 2
/σ1,I , (1, k)-anonymity over these data sets. All experiments were
then the K -largest potentials fits are Zi with the conducted on an Intel Core i5 3.1 GHz PC with 4 GB RAM.
K -smallest Euclidean norm. σRP can be estimated Fig. 5 presents the experiments for the synthetic data.
by the data miner, and it is not necessary to be The data are generated from a multivariate Gaussian dis-
provided by the publisher to reduce the latter’s tribution. The parameters are set as N = 100, n = 500,
2
overhead. Let σZ,I be the estimated variance of m = 10, d = 5, and σ1 = 0.4; therefore, the data set
2 2 2 size is nN = 50, 000. In Fig. 5(a), the anonymity level
PZ (·); then, σRP = σZ,I − σ2,I .
2) Given nK randomized records consisting of the K - k is increased from 5 to 45. At each level, we compute
largest potential fits in each segment, the K -largest σ2,min by Eq. (9), anonymize the records with R1 and
potential fits PF (Xt |Zi ) among these nK records R2 , and compute the number of records whose anonymity
should be chosen. For the Zi,I in segment I , the level is k . The y-axis is the percentage of records that
selection criterion is based on Bayes’ theorem:
achieve an anonymity of k averaged from 10 trials. The
log PF (Xt |Zi,I ) = log PX (Xt ) + log PR (Zi,I |Xt ) figure shows that the larger k is, the less achievable is the
− log PZ (Zi,I ) probabilistic (1, k)-anonymity. In particular, when k > 25,
||Zi,I ||2 the percentage is not higher than 50%, which is consistent
= log PX (Xt ) + c1 − 2 2 with the situations where k > (N − 1)/2 + 1 illustrated in
||Xt ||2 σ1,I + σ2,I
Section 4.
||Zi,I ||2 The percentage of records that achieve a required level
− c2 + 2 2
,
σRP + σ2,I k is impossible to be 100% in one single trial because the
(12a)
anonymization is random and the result is not human
d controllable. This does not weaken the strength of our
c1 = − log 2π(||Xt ||2 σ1,I
2 2

+ σ2,I ) , (12b)
2 anonymization method because what our method follows
is an average anonymity level out of multiple trials, as
d 2 2
c2 = − log 2π(σRP + σ2,I ) . (12c) shown in Definition 2. In addition, achieving 100% on
2
records for a given k is unnecessary. In fact, the view that
In the selection, log PX (Xt ) can actually be treated the attacker gleans is always obscure: when an ordering
as a constant. list of the potential fits is generated from an M2O or O2M
3) Xt is labeled with the majority class in the K - linkage attack, the order of PF (Xi |Zi ) in the list is still
largest potential fits. Within the K -largest potential random. Its order is neither the maximal nor the k -th
fits to Xt , suppose that CJ has the maximal class largest. The attacker only knows that PF (Xi |Zi ) is within
members among the L classes Ci (i = 1, ..., L); the top-k largest fits in expectation.
then, Xt is classified by a label CJ . In Fig. 5(b), we make k = 10 and compute σ2,min by Eq.
Time cost. The time cost of Step 1) is O(nN ). Step 2) has (9). We change σ2 from 0.1 · σ2,min to 5 · σ2,min and show
a time cost of O(nK). Step 3) has a time cost of O(K). the percentage of records whose anonymity level is 10. This
Thus, considering the entire data set at the size of nN , our figure shows that when σ2 < σ2,min , the percentage of
classification algorithm is very efficient. The algorithm is records that achieve the probabilistic (1, 10)-anonymity is
also easy to be parallelized. not high. This corresponds with our conclusion in Section
Prevention of class label leakage in test data. An 4.3 that when σ2 > σ2,min , any pair (λi , λj ) will fall
attacker cannot implement this classification algorithm be- within the squared range from (0, 0) to (λmax , λmax ) on the
cause he does not know the relevant parameters. However, (λi , λj ) plane, and thus, I(λi , λj ) will be bounded within
the miner may also not be trustworthy, and they may take the required domain.
advantage of knowing the sensitive class of test data Xt Fig. 6 presents the experiments for the two real data sets.
by executing the K-NN algorithm on it. One solution to The parameters are set as N = 100 and σ1 = 0.4. m and n
this problem is to move the crucial steps of determining are determined by the number of attributes and instances
the class label to the data owner. In the above algorithm, in the two data sets, and d = b m+1 2 c. From Fig. 6(a) and
Steps 2 and 3 are moved to the data owner. Step 1 is Fig. 6(b), we can draw the same conclusions as those of Fig.
still executed by the miner, without causing too much 5(a) and Fig. 5(b).
2 2
computation overhead on the owner. Values σRP /σ1,I are
2 2 2
published to the miner, while σRP , σ1,I , σ2,I are not, such 8.2 Achieving Probabilistic (k,k)-Anonymity
that the miner can only execute Step 1 but not Steps 2 and We achieve both (1, k)- and (k, 1)-anonymity on a synthetic
3. data set, as shown in Fig. 7. The data are generated from a

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 11

0.9 0.85 0.9 0.9

(1,k)−anonymity (1,k)−anonymity
0.88 (k,1)−anonymity (k,1)−anonymity

Percentage of records achieving 10−anonymity

0.8
Percentage of records achieving k−anonymity

Percentage of records achieving k−anonymity

0.85
0.8 0.86
0.7
0.8
0.84
0.6
0.75
0.75 0.82
0.5
0.8 0.7
0.4
0.7 0.78
0.65
0.3
0.76
0.6
0.2
0.65 0.74

0.1 0.55
0.72

0 0.7 0.5
5 10 15 20 25 30 35 40 45 0 1 2 3 4 5 5 10 15 5 10 15
k × σ2,min k k

(a) Percentage of records achiev- (b) Percentage of records achiev- (a) Breast tissue data set (b) Blood transfusion data set
ing (1,k)-anonymity when k is ing (1,10)-anonymity when σ2 is
changed changed Fig. 8. Experiments on Probabilistic (k,k)-Anonymity for Two Real Data
Sets
Fig. 5. Experiments on Probabilistic (1,k)-Anonymity for Synthetic Data
Set

these data sets. The three data sets are the same as those
0.9
Breast Tissue
Transfusion
0.85
used in Section 8.2. The records in the synthetic data set
Percentage of records achieving 10−anonymity

0.8
Percentage of records achieving k−anonymity

0.7
0.8 are labeled with two class labels with a ratio of 9:1. In Fig.
0.6
0.75
9(a) and Fig. 10(a), for each given k , the statistical semantic
0.5

0.4
closeness (as defined in Definition 4) of the smallest class
0.3
0.7
Breast Tissue
(with the fewest members) is shown. These two figures
Transfusion
0.2
0.65 illustrate that the average percentage of the smallest class
0.1
in the anonymization groups is close to its probability in
0
5 10 15 20 25
k
30 35 40 45 0 1 2
× σ2,min
3 4 5
the overall table, with only 0.5 ∼ 1 times of increment.
(a) Percentage of records achiev- (b) Percentage of records achiev-
ing (1,k)-anonymity when k is ing (1,10)-anonymity when σ2 is We also employ Earth Mover’s Distance (EMD) to com-
changed changed pare our hybrid randomization with SABRE in [3] and Diff-
Gen in [26]. Generally speaking, given the same anonymity
Fig. 6. Experiments on Probabilistic (1,k)-Anonymity for Two Real Data level, the smaller EMD that an anonymization method
Sets generates over a group, the better semantic closeness that
it achieves. The comparisons are not very visual because
uniform distribution, with parameters N = 100, n = 500, the latter two methods are quite different from ours. In
m = 10, and d = 5. In the randomization model of Eq. (1), our method, the anonymity level k is a predefined input,
σ1 and σ2 , which simultaneously satisfy the requirements and it is easy to obtain the corresponding EMD value
of (1, k)- and (k, 1)-anonymity, are employed. Specifically, a on each group. SABRE is a k -anonymity-based method,
group of candidate σ1 and σ2 that satisfy (1, k)-anonymity which groups records into equivalence classes to satisfy a
are first determined and then verified by the requirement of given EMD; therefore, the EMD value is a predefined input.
(k, 1)-anonymity as shown in Theorem 3. Fig. 7 shows for DiffGen is based on -differential privacy, and the privacy
each given k the average percentage of records on which parameter is an input.
both (1, k)- and (k, 1)-anonymity are achieved. When k is In our experiments on SABRE, when the given value
increased, the average percentage is reduced. of EMD varies from 0.1 to 1, the anonymity level (i.e.,
Experiments on high-dimensional synthetic data are the average size of equivalence groups) remains a constant
also performed. m varies from 20 to 50, and d = b(m + value, that is, as 9, 6 and 9 on the synthetic, breast tissue
1)/2c. The other parameters are k = 10, N = 100, and and transfusion data sets, respectively. In our experiments
n = 500. Subsequently, 85%~90% of records can achieve on DiffGen, the average size of equivalence groups varies
(10, 10)-anonymity. with the corresponding average EMD. Fig. 9(b) and Fig.
10(b) show these variations. These two figures illustrate
0.96
(1,k)−anonymity
that on the same anonymity level, our method achieves
0.94 (k,1)−anonymity
better semantic closeness.
Percentage of records achieving k−anonymity

0.92

0.9
Experiments on high-dimensional synthetic data are
0.88

0.86 also performed. When m varies from 20 to 50, the average

0.84
EMDs of our hybrid method and DiffGen are 0.29 and 0.57,
0.82

0.8
respectively.
0.78

0.76
5 10 15 20
k

8 0.8
Actual Closeness

Fig. 7. Experiments on Probabilistic (k,k)-Anonymity for Synthetic Data 7

Lower Bound
Upper Bound
0.7

Set 6 0.6
Earth Mover’s Distance
Semantic Closeness

5 0.5

Fig. 8 presents the experiments of (k, k)-anonymity on 4 0.4 Hybrid Randomization on Synthetic Data
DiffGen on Synthetic Data

the two real data sets with parameter of N = 100. From 3 0.3

Fig. 8(a) and Fig. 8(b), the same conclusions can be drawn 2 0.2

as those of Fig 7. The (k, 1)-anonymity is achieved by 1 0.1

0 0
assuming that the data sets follow uniform distributions. 5 10
Anonymity Level: k
15 20 4 6 8 10 12 14
Anonymity Level
16 18 20 22

(a) Semantic closeness (b) EMD comparison

8.3 Comparisons of Semantic Closeness
Fig. 9. Experiments on Semantic Closeness on Synthetic Data Set
We analyze the semantic closeness for one synthetic and
two real data sets after (k, k)-anonymity is achieved on

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 12

Experiments are also performed on high-dimensional

0.75
2
Hybrid Randomization on Transfusion Data
Hybrid Randomization on Breast Tissue Data synthetic data with m varying from 20 to 50. The average
0.7 DiffGen on Transfusion Data

0.65
DiffGen on Breast Tissue Data rates of correct classification for our hybrid method, [30],
[26] and [3] are 82.7%, 77.5%, 71.2%, and 65.1%, respec-

Earth Mover’s Distance

1.5
Semantic Closeness

0.6

0.55
tively.
Breast Tissue

1
Transfusion
0.5
We also conduct similar experiments in clustering sce-
0.45 narios. The results are shown in Fig. 11(d), Fig. 11(e) and
0.4 Fig. 11(f). For SABRE, the average rates of correct classifi-
0.5
5 10
Anonymity Level: k
15 20 0.35
5 10 15 20
cation are 70.1%, 57.4%, and 69.2% for the synthetic, blood
Anonymity Level
(a) Semantic closeness transfusion and adult data, respectively. The comparisons
(b) EMD comparison
demonstrate that our hybrid randomization method retains
Fig. 10. Experiments on Semantic Closeness on Two Real Data Sets
higher utility of data for k-means clustering.
Time cost comparisons. For the transfusion and adult
data, the average time costs of our hybrid randomization
8.4 Comparisons of Data Utility are 0.35 seconds and 7.442 seconds, respectively, which is
5% of SABRE, 49.4% of DiffGen, and 53.3% of MDAV-based
We experiment using the K-NN algorithm redesigned in differential privacy.
Section 7 on one synthetic and two real data sets. The
synthetic training data set has 2 clusters, representing
two different classes. In each cluster, 90% of records are 1 1
Hybrid Randomization
DiffGen
MDAV−based Differential Privacy

assigned to the cluster’s class, and 10% of records are 0.9

0.9 Classical KNN

0.8
assigned to the other class. The parameters n, m, d are the

Rate of Correct Classification

0.8
0.7

same as in the experiments in Section 8.2. The two real data 0.7 0.6

sets are the blood transfusion and adult data sets from the 0.6
0.5

UCI ML Repository. In the adult set, categorical attributes Hybrid Randomization on Synthetic Data
0.4

0.5 DiffGen on Synthetic Data

are transformed into binary attributes based on the method MDAV−based Differential Privacy
Classical KNN
0.3

in 3.1. The probabilistic (k, k)-anonymity is achieved on 0.4

0 10 20 30
Anonymity Level
40 50
0.2
5 10 15 20 25 30
Anonymity Level
35 40 45 50

the 3 training data sets. In Fig. 11, the percentage of the (a) Accuracy of classification for (b) Accuracy of classification for
correctly classified data is shown when k varies from 5 to synthetic data by K-NN the Blood Transfusion data by K-
NN
50. When k is increased, N is also changed to generate an
appropriate λmax (say, 8~9). Therefore the result in Fig.5(a) 1 1

and 6(a), i.e., the larger k is, the less achievable the prob- 0.9
0.9

abilistic anonymity, is not applicable here because N does 0.8

Rate of Correct Classification

0.8

not change in these two figures. 0.7

0.6 0.7
On each data set, to compare with the classical K-NN 0.5

algorithm, we sanitize the training set and the testing set 0.4 Hybrid Randomization
0.6

by our probabilistic (k, k)-anonymity, and then we execute 0.3

DiffGen
MDAV−based Differential Privacy
0.5 Hybrid Randomization
DiffGen
Classical KNN MDAV−based Differential Privacy
the classical algorithm. For comparisons with the other 0.2
5 10 15 20 25 30 35 40 45 50
0.4
0 10 20 30 40 50
Anonymity Level Anonymity Level

anonymization methods, we sanitize the data set using (c) Accuracy of classification for (d) Accuracy of classification for
SABRE in [3], DiffGen in [26], and MDAV-based differential the Adult data by K-NN synthetic data by k-means
privacy in [30], and then we also execute the classical K-
NN. The variations in the rate of correct classification with 1 0.95
Hybrid Randomization
DiffGen
anonymity level are shown in Fig. 11(a), Fig. 11(b) and 0.9
0.9
MDAV−based Differential Privacy

0.85
Fig. 11(c) for each data set. The results on SABRE are now
Rate of Correct Classification

Rate of Correct Classification

0.8
0.8

shown here because the anonymity level (i.e., the average 0.7 0.75

size of equivalence groups) does not change when we select 0.6

0.7

values of EMD from 0.1 to 1. For SABRE, the average rates 0.65

0.5 Hybrid Randomization

of correct classification are 66.7%, 76.2% and 67.8% for the DiffGen
MDAV−based Differential Privacy
0.6

synthetic, blood transfusion and adult data, respectively. 0.4

5 10 15 20 25 30
Anonymity Level
35 40 45 50
0.55
5 10 15 20 25 30
Anonymity Level
35 40 45 50

The parameter in DiffGen and MDAV varies from 0.1 (e) Accuracy of classification for (f) Accuracy of classification for
to 5, whereas in the figures, the axis of anonymity level the Blood Transfusion data by k- the Adult data by k-means
means
for DiffGen and MDAV does not represent . It is not
applicable to directly align with the anonymity level k Fig. 11. Comparisons of Data Utility
of our solution on the same axis because their domains and
meanings are quite different. Space limitations also do not
allow us to use separate figures for DiffGen and MDAV.
Therefore, we compromise by computing the average size 8.5 Analysis of Security Against Reconstruction At-
of the groups in DiffGen on each value of , the average tack
classification accuracy on various values on each cluster We also conduct experiments to analyze the security of the
size of MDAV, and then aligning the group/cluster size randomized data against reconstruction attacks. There are
with k in our solution because their meanings and domains two types of reconstruction attacks, and one of them (MAP-
are similar. The corresponding accuracy rates in DiffGen based) has been analyzed in Section 6. In this section, we
cannot be linked by a curve because the average group focus on the other type, PCA-based attack. This attack does
size fluctuates with the increase of . The combination of not require too much a priori knowledge and is thus easier
our hybrid randomization method and redesigned K-NN for the attackers to deploy. It aims to separate a few uncor-
algorithm facilitates higher accuracy rates of classification related components from the randomized data, reduce the
compared with [3], [26], [30] and the classical K-NN. scaling and permutation ambiguities of the components,

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 13

and estimate the original data from these components. higher efficiency, thus well suiting the needs of large-scale
Below, we present the attack based on the general PCA data, high-dimensional data and collection time processing.
method. We also re-designed the K-nearest neighbors algorithm to
The attacker computes the covariance matrix ΣZ of Z leverage the introduced uncertainty. Experiments show that
and performs an eigenvalue decomposition: ΣZ = QDQT , our K-NN algorithm achieves higher accuracy when clas-
in which Q is an orthogonal matrix and D is a diagonal sifying anonymized data, and our transformation method
matrix whose entries are eigenvalues of ΣZ . The attacker also retains higher utility of data for traditional clustering
computes the PCA result X̃ = Z · QD−1/2 QT . It is easy to scenarios, such as k-means.
prove that the covariance matrix of X̃ , ΣX̃ , is an identity
matrix. In the attacker’s view, the j-th attribute x̃j of X̃ is
ACKNOWLEDGMENTS
a recovery of the i-th attribute of X , with some scaling and
permutation ambiguity. To reduce the scaling ambiguity, This work is supported by the National Natural Sci-
the attacker utilizes the mean (µi ) and variance (σi2 ) of ence Foundation of China grant 61202427, the 985 Project
the i-th attribute of X and computes x̂j = σi x̃j + µi . To funding of Sun Yat-sen University, Australian Research
reduce the permutation ambiguity, the attacker performs a Council Discovery Projects funding DP150104871, Scientific
statistical test on whether x̂j has a similar distribution to Research Starting Foundation for the Returned Overseas
the i-th attribute of X . Chinese Scholars from Ministry of Education of China, the
The PCA-based attack has some limitations. It can only Fundamental Research Funds for the Central Universities
separate d components, whereas there are m attributes in 2012JBZ017. The corresponding author is Hong Shen. The
X . It is not suitable for the cases where there is some authors also would like to thank the anonymous reviewers
correlation among the attributes of X . Minimizing the for their suggestions and comments.
ambiguities also requires accurate statistical information
on X . Fig. 12 shows the experiments on the 3 data sets, A PPENDIX A
which are the same as those used in the above sections. P ROOF OF T HEOREM 1
In Fig. 12(a), the mean absolute error is measured for the We firstly assume there is an Xi ∈ X , on which ||Xi ||2 ≥
reconstructions, which is the expected value of the relative 2
σRP /σ12 . We find what is required in order to achieve the
m n
1 X X xi,j − x̂i,j probabilistic (1, k)-anonymity on Xi . Then we extend our
errors, i.e., | |, given the (i, j)-th
m ∗ n i=1 j=1 xi,j findings to the case that ||Xi ||2 < σRP 2
/σ12 .
element of X̂ and X . In Fig. 12(b), the recovery rate at a By Lemma 1, if ||Xi || ≥ σRP /σ12 , for any j 6= i,
2 2

relative error of 0.5 is measured, which is the percentage PO PF (Xi |Zj ) ≥ PF (Xi |Zi ) equals PO ||Zj ||2 ≥
of reconstructed entriesxwhose relative errors are within a ||Zi ||2 , which is the probability of the event ||Zj ||2 ≥
−x̂
i,j
#{x̂i,j :| i,j
x
i,j
|≤,i=1,...,m,j=1,...,n} ||Zi ||2 . Since this probability is related with the original
threshold , i.e., m∗n . In Fig. records Xi , Xj , and random parameters R1 , R2 , we need to
12(a), the MAE of the “breast tissue” data is not shown find their relationship and calibrate these values. Actually
because they are too large to fit within (the mean value on there can be two strategies for the calibration: simultane-
all k is 895.5). The two figures show that the PCA-based ously adjusting R1 and R2 , or only adjusting R2 based on
attack has high MAE and low recovery rate and that our a pre-determined R1 . For simplicity we employ the second
randomized data are secure against this attack. strategy.
Let Zi = Xi R1 + R2,i , Zj = Xj R1 + R2,j (j 6= i).
10
Synthetic Data
15
Synthetic Data
Let zi,1 , ..., zi,d be the entries of Zi . When R1 is pre-
9

8
Transfusion Breast Tissue
Transfusion determined and only R2 is random, the probability density
7 PZ (zi,l |Xi , R1 ) (l = 1, ..., d) is a Gaussian distribution
Mean Absolute Error

10
Recovery Rate (%)

5
with mean Xi R1,cl (R1,ci is the l-th column of R1 ), and
4
5
variance σ22 . Let zj,1 , ..., zj,d be the entries of Zj . Similarly
3

2
PZ (zj,l|Xj , R1 ) = N (Xj R1,cl , σ22 ). Then PO ||Zj ||2 ≥
1
||Zi ||2 is related with PZ (zi,l |Xi , R1 ) and PZ (zj,l |Xj , R1 )
0 0
5 10
k
15 20 5 10
k
15 20
as follows:
(a) Mean absolute error of the (b) Recovery rate at relative error
PCA-based attack of 0.5
d
X d
X
PO ||Zj ||2 ≥ ||Zi ||2 2 2

Fig. 12. Experiments on PCA-based Attack = PO zj,l − zi,l ≥0
l=1 l=1
d d
(13)
X zj,l 2 X zi,l 2
9 C ONCLUSIONS = PO ( ) − ( ) ≥0
σ2 σ2
l=1 l=1
We proposed privacy-preserving data transformation
methods in response to the requirements of strong security, In Eq. (13) when R1 is pre-determined, the distribution
high efficiency and high data utility. Our methods can of zj,l and zi,l will respectively follow PZ (zi,l |Xi , R1 ) and
achieve probabilistic (1, k)- and (k, 1)-anonymity on a lin- PZ (zj,l |Xj , R1 ). In the following we firstly find the char-
ear and hybrid randomization model to effectively prevent acteristic function of the difference between two random
d d
the attacker’s one-to-multiple and multiple-to-one link-
X zj,l 2 X zi,l 2
variables ( ) and ( ) . Then the cumulative
age attacks. Comparisons are made with the k -anonymity l=1
σ 2 l=1
σ2
method in [3] and differential privacy methods in [26] and distribution of their difference can be found by Gil-Pelaez’s
[30]. Our methods have high semantic closeness to the inversion formula ([10]).
distribution of sensitive values in the overall anonymized Suppose that the probability density functions of the
table; thus, they can prevent skewness and similarity at- two variables X and Y are respectively f and g . The
tacks. Our methods also have high security against recon- probability density of their difference Y − X can be given
struction attacks. Moreover, the proposed methods run at by the cross-correlation f ? g , therefore by the convolution

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 14

theorem the characteristic function of their difference is We substitute t by 12 tan θ, θ ∈ (0, π/2). Then dt =
1
denoted φX · φY , in which φX and φY are characteristic 2 sec2 θdθ, Eq. (19) becomes the following:
functions (or Fourier transforms) of f and g respectively, 1 1
φX is the complex conjugate of φX . PO PF (Xi |Zj ) ≥ PF (Xi |Zi ) = + ·
2 π
d
zj,l 2 ˆ π
λj − λi
X 2 1 2
In Eq. (13) the random variable ( ) is distributed e− 2 (λj +λi ) sin θ · csc θ · cosd−1 θ · sin( sin 2θ) · dθ.
l=1
σ2 0 4
according to a noncentral χ2 distribution, with d degrees of (20)
freedom, a non-centrality parameter λj as follows: In the case that ||Xi ||2 < σRP
2
/σ12 , similarly we can get
the following:
(Xj R1,c1 )2 + ... + (Xj R1,cd )2 PO PF (Xi |Zj ) ≥ PF (Xi |Zi ) = PO ( ||Zj ||2 ≤ ||Zi ||2 )

λj =
σ22 d d
(14) X zj,l 2 X zi,l 2
(Xj R1 )(Xj R1 )T Xj R1 R1T XjT = PO ( ) − ( ) ≤0
= 2
= . σ2 σ2
σ2 σ22 l=1 l=1
1 1
= F (0) = − ·
We use ι to denote the imaginary unit. The correspond- 2 π
ˆ π
ing characteristic function of this distribution is: 2 1 2 λj − λi
e− 2 (λj +λi ) sin θ · csc θ · cosd−1 θ · sin( sin 2θ) · dθ.
0 4
j ιλ t (21)
exp( 1−2ιt )
φj (t) = . (15)
(1 − 2ιt)d/2
A PPENDIX B
d
X zi,l 2 P ROOF OF L EMMA 3
Similarly the random variable ( ) also is dis-
l=1
σ2 To prove the monotonicity of I(λi , λj ) with respect to any
tributed according to a noncentral χ2 distribution, with independent variable λi or λj , holding the other variable
d degrees of freedom, a non-centrality parameter λi = ∂I ∂I
constant, we need to prove ∂λ i
< 0 and ∂λ j
> 0. In
Xi R1 R1T XiT ∂I
σ2
and a characteristic function as follows: the following we will only prove that ∂λj > 0. The same
2
∂I
method can be used to prove ∂λ i
< 0. From Eq. (7),
ιλi t
exp( 1−2ιt )
φi (t) = . (16) ˆ π
(1 − 2ιt)d/2 ∂I 1 2 λj +λi
sin2 θ
= cosd−1 θ · e− 2 ·
∂λj 2π 0 (22)
Then the characteristic function of the difference of the λj − λi
two random variables is cos( θ + sin 2θ ) · dθ
4
φi,j (t) = φj (t)φi (t) = φj (t)φi (−t) Let ∆λ = λj − λi . When λj , λi ∈ (0, 9.2], ∆λ ∈
exp 1−2ιt
ιλj t
exp −ιλ it

(−9.2, 9.2). In Eq. (22), the most intricate part is cos( θ +
1+2ιt ∆λ π
= ·
(1 − 2ιt)d/2 (1 + 2ιt)d/2 4 sin 2θ ) since it is not always positive when θ ∈ [0, 2 ].
We will discuss its sign by dividing the domain of ∆λ
1 ιλj t − 2λj t2 −ιλi t − 2λi t2
= · exp( ) · exp( ) into 3 sub-domains: [-2,2], (2,9.2), (-9.2,-2). Let f (θ, ∆λ) =
(1 + 4t2 )d/2 1 + 4t2 1 + 4t2 θ + ∆λ
4 sin 2θ . Then,
1 h 2(λj + λi )t2 i h (λj − λi )t i
= · exp − · exp ι
(1 + 4t ) 2 d/2 1 + 4t2 1 + 4t2 ∂f ∆λ
1 h 2(λj + λi )t 2i =1+ cos 2θ, (23)
= · exp − · ∂θ 2
(1 + 4t2 )d/2 1 + 4t2 ∂f 1
= sin 2θ. (24)
h (λ − λ )t
j i (λj − λi )t i ∂∆λ 4
cos 2
+ ι sin
1 + 4t 1 + 4t2 Case 1: When ∆λ ∈ [−2, 2] and θ ∈ [0, π2 ], ∂f
(17) ∂θ ≥ 0,
∂f
∂∆λ ≥ 0. Therefore, f ∈ [0, π/2] and cos f ∈ [0, 1], then it
∂I
Let F (x) be the cumulative distribution function of is easy to prove that ∂λ j
> 0.
d d
X zj,l 2 X zi,l 2 Case 2: When ∆λ ∈ (2, 9.2), if θ ∈ [0, 21 arccos(− ∆λ 2
)],
( ) − ( ) . By the inversion formula in [10], ∂f
σ2 σ2 then ∂θ ≥ 0, and f ∈ [0, fmax ] in which fmax varies
l=1 l=1
F (0), i.e., the probability that the difference being less than depending on ∆λ; if θ ∈ ( 12 arccos(− ∆λ 2
), π2 ], then ∂f
∂θ < 0,
0, is the following: and f ∈ [ 2 , fmax ). It is easy to get that fmax ∈ ( π2 , 3.14)
π

ˆ +∞ while ∆λ varies, which means cos f is not always positive

1 1 φi,j (−t) − φi,j (t) when θ ∈ [0, π2 ]. Therefore we need to divide the domain of
F (0) = + dt
2 2π 0 ιt θ into two sub-domains: [0, θ1 ], (θ1 , π2 ], in which θ1 satisfies
ˆ (18)
1 1 +∞ Im(φi,j (t)) f (θ1 , ∆λ) = π2 , so that on each sub-domain cos f is always
= − dt, positive or always negative.
2 π 0 t
Let g denote all the functions under the integration sign
∂I
in which Im(φi,j (t)) is the imaginary part of φi,j (t). of ∂λ j
in Eq. (22). Then
By Eq. (17) and (18), ˆ θ1 ˆ π
∂I 1 2
= (F + G), F = g · dθ, G = g · dθ.
1 1 ∂λj 2π 0 θ1
PO PF (Xi |Zj ) ≥ PF (Xi |Zi ) = 1 − F (0) = + ·
ˆ ∞ 2 π
1 h 2(λ + λ )t2 i (λj − λi )t Since F is always positive and G is always negative, we
j i
· exp − 2
· sin dt need to prove that Fmin +Gmin is positive in order to show
0
2
t(1 + 4t ) d/2 1 + 4t 1 + 4t2 ∂I
(19) that ∂λ j
is always positive. It is easy to prove that for any

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 15

δ δ
λj and λi , ∂θ ∂F
1
is non-negative, then the lower bound of is, d2 δi i−δjj log δδji should be always no more
F can be obtained by minimizing θ1 , maximizing ∆λ and than the (N − K)-th (N − 1)-quantile of
λj +λi . That is, when θ1 = 0.294, ∆λ = 9.2, λj +λi = 18.4, PO (||Zi ||2 ).
Fmin can be obtained which varies only depending on d. b) If ||Xi ||2 = ||Xj ||2 , the occurring probabili-
∂G
It is also easy to prove that for any λj and λi , ∂θ 1
is ties of PR (Zi |Xj ) = PR (Zi |Xi ) is 1.
non-negative, then the lower bound of G can be obtained
by minimizing θ1 and λj + λi , maximizing ∆λ. That is, Actually Condition 3) gives on constraint on σ1
when θ1 = 0.294, ∆λ = 9.2, λj + λi = 0, Gmin can be and σ2 . The conjunction of Condition 1) and 2) will
obtained which varies only depending on d. achieve the (k, 1)-anonymity on Xi .
For 5 ≤ d ≤ 40, we can show that Fmin + Gmin are Then we calibrate the parameters σ1 and σ2 in the
positive, which means ∂λ ∂I
> 0. randomization model Z = XR1 +R2 , for any given
j
Xi , so that ||Zi ||2 satisfies both Condition 1) and
Case 3: When ∆λ ∈ (−9.2, −2), if θ ∈
2). Given Xi , every attribute in record Zi indepen-
[0, 12 arccos(− ∆λ 2
)], then ∂f
∂θ < 0, and f ∈ [fmin , 0] in which dently and identically follows a normal distribution
fmin varies depending on ∆λ; if θ ∈ ( 21 arccos(− ∆λ 2
), π2 ],
∂f π ∂f as N (0, ||Xi ||2 σ12 +σ22 ). It is easy to prove that every
then ∂θ > 0, and f ∈ (fmin , 2 ]. Since ∂∆λ ≥ 0 for attribute of √ 1
Zi follows the standard
θ ∈ [0, π2 ], the minimum of fmin is − π2 , which means cos f 2 2 2
||Xi || σ1 +σ2

are always non-negative, and then by Eq. (22) ∂λ ∂I

> 0. normal distribution, and ||Xi ||21σ2 +σ2 ||Zi ||2 follows
j 1 2
the chi-squared distribution with d degrees of free-
A PPENDIX C dom.
P ROOF OF T HEOREM 3 Then we can treat σ1 and σ2 as constants, and
define a benchmark function Gσ1 ,σ2 (δi , δj ) on any
We need to calibrate the two parameters σ1 and σ2 in the two records Xi , Xj in X whose Euclidean norms
randomization model of Fig. 1, so that for every Xi , the are not equal (||Xi ||2 6= ||Xj ||2 ,i, j ∈ {1, ..., N }), as
following condition is satisfied: follows:
1 d δi δj δi
Gσ1 ,σ2 (δi , δj ) = · log
PO ( PF (Xj |Zi ) ≥ PF (Xi |Zi ) ) ≥ (k − 1)/(N − 1) (25) ||Xi ||2 σ12 + σ22 2 δi − δj δj
d δj δi
When X follows the uniform distribution, PX (Xi ) = = log .
2 δi − δj δj
PX (Xj ), then by Bayes theorem, PO ( PF (Xj |Zi ) ≥
PF (Xi |Zi ) ) = PO ( PR (Zi |Xj ) ≥ PR (Zi |Xi ) ). Let Qχ2 , k−1 and Qχ2 , N −k be respectively the
N −1 N −1
When Xj is given, Zi follows a multivariate Gaus- (k − 1)-th and (N − k)-th (N − 1)-quantiles of the
sian distribution, the conditional probability PR (Zi |Xj ) ∼ chi-squared distribution with d degrees of freedom.
N (0, (||Xj ||2 σ12 + σ22 ) · Id ). Similarly, PR (Zi |Xi ) ∼ Then by Condition 1) and 2), the benchmark func-
N (0, (||Xi ||2 σ12 + σ22 ) · Id ). The equivalent forms of tion should satisfy the following:
PR (Zi |Xj ) ≥ PR (Zi |Xi ) can be given as follows:
min Gσ1 ,σ2 (δi , δj ) ≥ Qχ2 , k−1 ,
PR (Zi |Xj ) ≥ PR (Zi |Xi ) N −1

max Gσ1 ,σ2 (δi , δj ) ≤ Qχ2 , N −k ,

1 d
2 ||Zi ||2 N −1
⇐⇒ 2 2
exp(− )
||Xj || σ1 + σ2
2 ||Xj ||2 σ12 + σ22 Then for any given X , we can find that the mini-
1 d
2 ||Zi ||2 mum and maximum of Gσ1 ,σ2 are obtained at two
≥ 2 2
exp(− )
||Xi || σ1 + σ2
2 ||Xi ||2 σ12 + σ22 definite records, i.e., Xmin_norm and Xmax_norm
d ||Zi ||2 with the smallest and largest Euclidean norms in
⇐⇒ (− ) log(||Xj ||2 σ12 + σ22 ) − X , regardless of the values of σ1 and σ2 .
2 ||Xj ||2 σ12 + σ22
d ||Zi ||2 Let a = δi /δj , then G = d/2 · log a/(a − 1), and it
≥ (− ) log(||Xi ||2 σ12 + σ22 ) − 0
is easy to prove that the derivative G < 0 when
2 ||Xi ||2 σ12 + σ22
a > 1 or 0 < a < 1. Therefore, in the2 case that
||Xmax_norm || σ12 +σ22

2 d δi δj δi
if||Xi ||2 > ||Xj ||2 a > 1, at a = amax = ||Xmin_norm ||2 σ2 +σ2 , G
 ||Zi || ≤ 2 δi −δj log δj ,
 1 2

⇐⇒ δi δj obtains its minimum. In the case that 0 < a < 1,

||Zi || ≥ 2 δi −δj log δδji ,
2 d
if||Xi ||2 < ||Xj ||2 (26) 2 2 2
||Xmin_norm || σ1 +σ2
 at a = amin = ||Xmax 2 2 2 , G obtains its
no contraint on ||Zi ||2 , if||Xi ||2 = ||Xj ||2 _norm || σ1 +σ2

maximum.
in which δi = ||Xi ||2 σ12 + σ22 , δj = ||Xj ||2 σ12 + σ22 .
Therefore if we want to anonymize Xi to achieve (k, 1)-
anonymity on it, by Eq. (26), we need to divide those ||Xj ||2 A PPENDIX D
into 3 groups, find how to constrain σ1 and σ2 to satisfy Eq. P ROOF OF T HEOREM 4
(25) in each group, then combine the constraints to obtain Oi is the occurring number of events F (Zj →
the sufficient condition for the (k, 1)-anonymity. Xi ) ≥ F (Zi → Xi ) in the overall table. Let pi,j =
1) If ||Xi ||2 > ||Xj ||2 , the occurring probability of PO ( PF (Xi |Zj ) ≥ PF (Xi |Zi ) ). As shown in the proof
j6=i
δ δ
the event ||Zi ||2 ≤ d2 δi i−δjj log δδji should be at least
X
of Lemma 2, E(Oi ) = pi,j . For records Zj in
δ δ
(K − 1)/(N − 1). That is, d2 δi i−δjj log δδji should be j∈{1,...,N }
always no less than the (K − 1)-th (N − 1)-quantile the class cl (j ∈ cl ), then |c0l | is the occurring number
of PO (||Zi ||2 ). X F (Zj → Xi ) ≥ F (Zi → Xi ) in this class,
of the events
E(|c0l |) = pi,j , and
a) If ||Xi ||2 < ||Xj ||2 , the occurring proba- j∈cl
δ δ
bility of the event ||Zi ||2 ≥ d2 δi i−δjj log δδji E(|c0l |) E(|c0l |)
should be at least (K − 1)/(N − 1). That = ,
E(Oi ) E(|c0l |) + ρ

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 16

in which [15] D. Kifer and A. Machnavajjhala. A rigorous and customizable

framework for privacy. In Proc. the 30th ACM SIGACT-SIGMOD-
j6=i SIGART Symposium on Principles of Database Systems (PODS 2011),
pages 77–88. IEEE Computer Society, May 2012.
X
ρ= pi,j . [16] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Incognito: Effi-
j∈{1,...,N }\cl cient full-domain k-anonymity. In Proc. the 2005 ACM SIGMOD
international conference on Management of data (SIGMOD’05), pages
By Theorem 1 and 2, if ||Xi ||2 ≥ σRP
2
/σ12 , then for j = 49–60. ACM, April 2005.
1, ..., N , [17] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Mondrian
k−1 N −k multidimensional k-anonymity. In Proc. the 22nd International
≤ pi,j ≤ . Conference on Data Engineering (ICDE 2006), pages 25–36. IEEE
N −1 N −1 Computer Society, April 2006.
[18] N. Li, T. Li, and S. Venkatasubramanian. Closeness: A new pri-
If ||Xi ||2 < σRP
2
/σ12 , the same constraint as above holds vacy measure for data publishing. IEEE Transactions on Knowledge
on pi,j . and Data Engineering, 22(7):943–956, July 2010.
[19] T. Li, N. Li, and J. Zhang. Modeling and integrating background
Therefore, the constraint on E(|c0l |) can be obtained as
knowledge in data anonymization. In Proc. the 25th International
follows: Conference on Data Engineering (ICDE 2009), pages 6–17. IEEE
Computer Society, April 2009.
k−1 N −k [20] T. Li, N. Li, and J. Zhang. Modeling and integrating background
· |cl | ≤ E(|c0l |) ≤ · |cl |. knowledge in data anonymization. In Data Engineering, 2009.
N −1 N −1 ICDE ’09. IEEE 25th International Conference on, pages 6–17, 2009.
k−1 [21] J. Liu and K. Wang. On optimal anonymization for l+-diversity.
Similarly the constraint on ρ is: N −1 · (N − |cl |) ≤ ρ ≤ In Proc. the 26th International Conference on Data Engineering (ICDE
N −k
N −1 · (N − |cl |). 2010), pages 213–224. IEEE Computer Society, March 2010.
Since pcl = |cl |/(N − 1), then the lower bound for the [22] K. Liu, H. Kargupta, and J. Ryan. Random projection-based mul-
E(|c0l |) tiplicative data perturbation for privacy preserving distributed
statistical semantic closeness of cl , i.e., E(Oi )·p c
, is data mining. Knowledge and Data Engineering, IEEE Transactions
l
on, 18(1):92 – 106, jan. 2006.
0 [23] W. Lu, A. Varna, and M. Wu. Security analysis for privacy
E(|cl |)min 1 (k − 1)
0 · = . preserving search of multimedia. In Image Processing (ICIP), 2010
E(|cl |)min + ρmax pcl pcl (2k − 1 − N ) + N (N − k)/(N − 1) 17th IEEE International Conference on, pages 2093 –2096, sept. 2010.
[24] A. Machanavajjhala, J. Gehrke, and D. Kifer. l-diversity: Privacy
The upper bound is beyond k-anonymity. In Proc. the 22nd International Conference
0 on Data Engineering (ICDE 2006), pages 24–36. IEEE Computer
E(|cl |)max 1 (N − k) Society, April 2006.
0 · = . [25] A. Meyerson and R. Williams. On the complexity of optimal
E(|cl |)max + ρmin pcl pcl (N − 2k + 1) + N (k − 1)/(N − 1)
k-anonymity. In Proc. the 23rd ACM SIGACT-SIGMOD-SIGART
Symposium on Principles of Database Systems (PODS 2004), pages
223–228. ACM, June 2004.
[26] B. C. F. Noman Mohammed, Rui Chen and P. S. Yu. Differ-
R EFERENCES entially private data release for data mining. In Proc. the 17th
[1] C. C. Aggarwal. On unifying privacy and uncertain data models. ACM SIGKDD Conference on Knowledge Discovey and Data Mining
In Proc. the 24th International Conference on Data Engineering (ICDE (KDD’11), pages 493–501. ACM, August 2011.
2008), pages 386–395. IEEE, April 2008. [27] H. Park and K. Shim. Approximate algorithms with generalizing
[2] C. C. Aggarwal and P. S. Yu. Privacy-Preserving Data Mining: attribute values for k-anonymity. Information Systems, 35(8):933–
Models and Algorithms (Chapter 2). Springer Science+Business 955, December 2010.
Media, Spring Street, New York, NY, USA, 2008. [28] Y. Sang, H. Shen, H. Tian, and Z. Zhang. Achieving probabilistic
[3] J. Cao, P. Karras, P. Kalnis, and K.-L. Tan. Sabre: a sensi- anonymity against one-to-multiple linkage attacks. In e-Business
tive attribute bucketization and redistribution framework for t- Engineering (ICEBE), 2013 IEEE 10th International Conference on,
closeness. The VLDB Journal, 20(1):59–81, Feb. 2011. pages 176–183, Sept 2013.
[4] N. Cao, Z. Yang, C. Wang, K. Ren, and W. Lou. Privacy-preserving [29] J. Soria-Comas and J. Domingo-Ferrer. Probabilistic k-anonymity
query over encrypted graph-structured data in cloud computing. through microaggregation and data swapping. In Fuzzy Systems
In Distributed Computing Systems (ICDCS), 2011 31st International (FUZZ-IEEE), 2012 IEEE International Conference on, pages 1–8,
Conference on, pages 393–402, 2011. 2012.
[5] C. Clifton and T. Tassa. On syntactic anonymity and differential [30] J. Soria-Comas, J. Domingo-Ferrer, D. Sánchez, and
privacy. In 2013 IEEE 29th International Conference on Data Engi- S. Martínez. Enhancing data utility in differential privacy
neering Workshops (ICDEW), pages 88–93. IEEE, 2013. via microaggregation-based k-anonymity. The VLDB Journal - The
[6] G. Cormode. Personal privacy vs population privacy: Learning International Journal on Very Large Data Bases, 23(5):771–794, 2014.
to attack anonymization. In ACM Knowledge Discovery and Data [31] R.-W. Wong, A.-C. Fu, K. Wang, Y. Xu, J. Pei, and P. Yu. Proba-
Mining, pages 1253–1261, 2011. bilistic inference protection on anonymized data. In Data Mining
[7] S. Dasgupta. Learning mixtures of gaussians. In Proc. 40th Annual (ICDM), 2010 IEEE 10th International Conference on, pages 1127–
Symposium on Foundations of Computer Science, page 634. IEEE 1132, 2010.
Computer Society, March 1999. [32] L. Xiong, S. Chitti, and L. Liu. Mining multiple private databases
[8] S. Dasgupta, D. Hsu, and N. Verma. A concentration theorem for using a knn classifier. In Proc. 2007 ACM symposium on Applied
projections. In Proc. the 22nd Conference in Uncertainty in Artificial computing (SAC 2007), pages 435–440. ACM SIGAPP, March 2007.
Intelligence, pages 1–17. AUAI Press, July 2006. [33] Z. Yang, S. Zhong, and R. N. Wright. Privacy-preserving classifi-
[9] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy- cation of customer data without loss of accuracy. In Proc. the Fifth
preserving data publishing: A survey of recent developments. SIAM International Conference on Data Mining (SDM 2005), pages
ACM Computing Surveys, 42(4):14:1–14:53, June 2010. 1–11. SIAM, April 2005.
[10] J. Gil-Pelaez. Note on the inversion theorem. Biometrika, [34] H. S. Yingpeng Sang and H. Tian. Effective reconstruction of data
38(3):481–482, July 1951. perturbed by random projections. IEEE Transactions on Computers,
61(1):101–117, 2012.
[11] A. Gionis, A. Mazza, and T. Tassa. k-anonymization revisited. In
[35] Z. Zhu, G. Wang, and W. Du. Deriving private information from
Proc. IEEE 24th International Conference on Data Engineering, pages
association rule mining results. In Proc. the 25th International
744–753. IEEE Computer Society, April 2008.
Conference on Data Engineering (ICDE 2009), pages 18–29. IEEE
[12] J. Han and M. Kamber. Data mining: concepts and techniques.
Computer Society, April 2009.
Morgan Kaufmann, 2005.
[13] Z. Huang, W. Du, and B. Chen. Deriving private information
from randomized data. In Proceedings of the 2005 ACM SIGMOD
international conference on Management of data, SIGMOD ’05, pages
37–48. ACM, 2005.
[14] A. Inan, M. Kantarcioglu, and E. Bertino. Using anonymized data
for classification. In Proc. the 25th International Conference on Data
Engineering (ICDE 2009), pages 429–440. IEEE Computer Society,
April 2009.

Exam Preparation Questions ECL-1 2020
No ratings yet
Exam Preparation Questions ECL-1 2020
6 pages
2002 Spring CS525 Lecture 4
No ratings yet
2002 Spring CS525 Lecture 4
47 pages
New Static Data Anonymization On Multidimensional Data 19-02-2024
No ratings yet
New Static Data Anonymization On Multidimensional Data 19-02-2024
71 pages
Pinkas 2002
No ratings yet
Pinkas 2002
8 pages
Varnika Resume Final
No ratings yet
Varnika Resume Final
2 pages
SPEML SS2023-Lecture Anonymisation
No ratings yet
SPEML SS2023-Lecture Anonymisation
101 pages
L-Diversity Privacy Beyond K-Anonymity
No ratings yet
L-Diversity Privacy Beyond K-Anonymity
12 pages
Privacy Preserving Data Mining-"A State of The Art": Mamta Narwaria Suchita Arya
No ratings yet
Privacy Preserving Data Mining-"A State of The Art": Mamta Narwaria Suchita Arya
5 pages
ch06 Anonymization
No ratings yet
ch06 Anonymization
40 pages
02 Synopsis
No ratings yet
02 Synopsis
16 pages
Bba Bba Batchno 75
No ratings yet
Bba Bba Batchno 75
52 pages
Data 102 Fall 2023 Lecture 24 - Privacy in Machine Learning
No ratings yet
Data 102 Fall 2023 Lecture 24 - Privacy in Machine Learning
46 pages
Privacy and
No ratings yet
Privacy and
18 pages
CMT 102
No ratings yet
CMT 102
2 pages
Book 1
No ratings yet
Book 1
6 pages
Differential Privacy
No ratings yet
Differential Privacy
22 pages
Syllabus MCA 2024 2026
No ratings yet
Syllabus MCA 2024 2026
12 pages
2002 Iyengar2002
No ratings yet
2002 Iyengar2002
10 pages
Summary Statistic Privacy in Data Sharing
No ratings yet
Summary Statistic Privacy in Data Sharing
16 pages
Introduction To Privacy Preserving Data Publishing Concepts and Techniques Chapman Hall CRC Data Mining and Knowledge Discovery Series
No ratings yet
Introduction To Privacy Preserving Data Publishing Concepts and Techniques Chapman Hall CRC Data Mining and Knowledge Discovery Series
355 pages
2025 01 15 11 23 34 F4 23H 2024 Trained Graduate Teacher (Female) Physics
No ratings yet
2025 01 15 11 23 34 F4 23H 2024 Trained Graduate Teacher (Female) Physics
7 pages
Research Proposal
No ratings yet
Research Proposal
17 pages
Basu 2015
No ratings yet
Basu 2015
7 pages
Privacy Preserving Data Mining Thesis PDF
100% (3)
Privacy Preserving Data Mining Thesis PDF
4 pages
Team-17 Final
No ratings yet
Team-17 Final
42 pages
Architectural Styles Repositories
No ratings yet
Architectural Styles Repositories
16 pages
Major Project (Finalproject)
No ratings yet
Major Project (Finalproject)
19 pages
Big Data and Social Media Analytics
No ratings yet
Big Data and Social Media Analytics
6 pages
PHD Thesis On Privacy Preserving Data Mining
100% (3)
PHD Thesis On Privacy Preserving Data Mining
5 pages
20PNC102 - Managerial Economics and Indian Economy Module 01-Introduction
No ratings yet
20PNC102 - Managerial Economics and Indian Economy Module 01-Introduction
99 pages
Additive Data Perturbation Approach For Privacy
No ratings yet
Additive Data Perturbation Approach For Privacy
9 pages
OB Module2,3
No ratings yet
OB Module2,3
43 pages
Research On Privacy Protection Based On K-Anonymity
No ratings yet
Research On Privacy Protection Based On K-Anonymity
5 pages
Resume 6'24
No ratings yet
Resume 6'24
1 page
Accounting Concepts and Conventions
No ratings yet
Accounting Concepts and Conventions
38 pages
Conflict and Negotiations
No ratings yet
Conflict and Negotiations
53 pages
Machine Learning For Industry 40 A Systematic Review Using Deep LearningBased Topic ModellingSensors
No ratings yet
Machine Learning For Industry 40 A Systematic Review Using Deep LearningBased Topic ModellingSensors
26 pages
Transport
No ratings yet
Transport
10 pages
ISAM (An Acronym For Indexed Sequential Access Method) Is A Method For Creating, Maintaining, and
No ratings yet
ISAM (An Acronym For Indexed Sequential Access Method) Is A Method For Creating, Maintaining, and
4 pages
A Systematic Overview On Methods To Protect Sensitive Data Provided For Various Analyses
No ratings yet
A Systematic Overview On Methods To Protect Sensitive Data Provided For Various Analyses
14 pages
Chap IV - Cobb Douglas PF
No ratings yet
Chap IV - Cobb Douglas PF
8 pages
1 s2.0 S1877050915020979 Main
No ratings yet
1 s2.0 S1877050915020979 Main
7 pages
Macro Economic Overview
No ratings yet
Macro Economic Overview
37 pages
Rdbms MCQ Quiz
0% (1)
Rdbms MCQ Quiz
5 pages
Privacy-Preserving Data Mining: Methods, Metrics, and Applications
No ratings yet
Privacy-Preserving Data Mining: Methods, Metrics, and Applications
21 pages
DHCP
No ratings yet
DHCP
5 pages
WDS Unit 5 Notes
No ratings yet
WDS Unit 5 Notes
20 pages
A New Scheme On Privacy-Preserving Data Classification: Nan Zhang, Shengquan Wang, and Wei Zhao
No ratings yet
A New Scheme On Privacy-Preserving Data Classification: Nan Zhang, Shengquan Wang, and Wei Zhao
10 pages
Pawar 2018
No ratings yet
Pawar 2018
6 pages
Automated Resume Parsing A Natural Language Processing Approach
No ratings yet
Automated Resume Parsing A Natural Language Processing Approach
6 pages
2008 02 Robust De-Anonymization of Large Sparse Datasets
No ratings yet
2008 02 Robust De-Anonymization of Large Sparse Datasets
15 pages
Hiding Sensitive Predictive Association Rules: Shyue-Liang Wang, Ayat Jafari
No ratings yet
Hiding Sensitive Predictive Association Rules: Shyue-Liang Wang, Ayat Jafari
6 pages
Cit 3200 Oprating Systems
No ratings yet
Cit 3200 Oprating Systems
2 pages
Privacy Preserving Publication of Datasets: S-Tuple Inclusion - A New Method For
No ratings yet
Privacy Preserving Publication of Datasets: S-Tuple Inclusion - A New Method For
7 pages
Chakraborty 2015
No ratings yet
Chakraborty 2015
2 pages
K Anonymity and Cluster Based Methods
No ratings yet
K Anonymity and Cluster Based Methods
37 pages
Big Data Analysis and Perturbation Using Data Mining Algorithm
No ratings yet
Big Data Analysis and Perturbation Using Data Mining Algorithm
10 pages
A Review On K-Anonymization Techniques
No ratings yet
A Review On K-Anonymization Techniques
8 pages
Home Automation With Ai and Iot
No ratings yet
Home Automation With Ai and Iot
7 pages
The Big Data System, Components, Tools, and Technologies A Survey
No ratings yet
The Big Data System, Components, Tools, and Technologies A Survey
100 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
Evaluating Clustering Performance For The Protected Data Using Perturbative Masking Techniques in Privacy Preserving Data Mining
100% (1)
Evaluating Clustering Performance For The Protected Data Using Perturbative Masking Techniques in Privacy Preserving Data Mining
5 pages
Privacy Preservation by Anonymization Method Accomplishing Concept of Hierarchical Clustering and DES: A Propose Study
No ratings yet
Privacy Preservation by Anonymization Method Accomplishing Concept of Hierarchical Clustering and DES: A Propose Study
4 pages
A Hybrid Approach of Privacy Preserving Data
No ratings yet
A Hybrid Approach of Privacy Preserving Data
6 pages
Paper 16
No ratings yet
Paper 16
4 pages
1 s2.0 S1877042815004875 Main
No ratings yet
1 s2.0 S1877042815004875 Main
6 pages
GSM Assignment
No ratings yet
GSM Assignment
6 pages
Data Privacy Through Optimal K-Anonymization
No ratings yet
Data Privacy Through Optimal K-Anonymization
12 pages
Literature Survey of Association Rule Based Techniques For Preserving Privacy
No ratings yet
Literature Survey of Association Rule Based Techniques For Preserving Privacy
6 pages
Jun 12 Ijcoa 001
No ratings yet
Jun 12 Ijcoa 001
6 pages
Privacy Preserving Clustering On Distorted Data: Thanveer Jahan Dr.G.Narasimha, Dr.C.V.Guru Rao
No ratings yet
Privacy Preserving Clustering On Distorted Data: Thanveer Jahan Dr.G.Narasimha, Dr.C.V.Guru Rao
5 pages
Differential Privacy
No ratings yet
Differential Privacy
12 pages
An Approach For Privacy Preservation Using XML Distance Measure
No ratings yet
An Approach For Privacy Preservation Using XML Distance Measure
5 pages
Privacy Preserving DDM
No ratings yet
Privacy Preserving DDM
5 pages
Notes Artificial Intelligence Unit 5
No ratings yet
Notes Artificial Intelligence Unit 5
11 pages
Differential Privacy
No ratings yet
Differential Privacy
56 pages
Cloud Defense: Advanced Endpoint Protection and Secure Network Strategies
From Everand
Cloud Defense: Advanced Endpoint Protection and Secure Network Strategies
Rob Botwright
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Privacy Preserving Data Mining
No ratings yet
Privacy Preserving Data Mining
10 pages
Business Analytics
No ratings yet
Business Analytics
35 pages
Privacy Preserving Decision Tree Learning PDF
No ratings yet
Privacy Preserving Decision Tree Learning PDF
12 pages
Digital Assignment-1 Literature Review On Twitter Sentiment Analysis Name: G.Tirumala Reg No: 16BCE0202 1)
No ratings yet
Digital Assignment-1 Literature Review On Twitter Sentiment Analysis Name: G.Tirumala Reg No: 16BCE0202 1)
9 pages
An Iterative Classification Scheme
No ratings yet
An Iterative Classification Scheme
6 pages
Linux Essential Commands
No ratings yet
Linux Essential Commands
8 pages
2yrs Mca Sem3
No ratings yet
2yrs Mca Sem3
9 pages
Class-12 Viva Questions
No ratings yet
Class-12 Viva Questions
5 pages
Đề thi cuối kỳ 20212 IT2030 - Techinical Writing and Presentation
No ratings yet
Đề thi cuối kỳ 20212 IT2030 - Techinical Writing and Presentation
3 pages
List
No ratings yet
List
2 pages
IR
No ratings yet
IR
31 pages
Survey On Anonymization Techniques in Big Data and Privacy Models
No ratings yet
Survey On Anonymization Techniques in Big Data and Privacy Models
20 pages
Privacy Preserving On Continuous and Discrete Data Sets - A Novel Approach
No ratings yet
Privacy Preserving On Continuous and Discrete Data Sets - A Novel Approach
7 pages
A Survey On Privacy Preserving Data Mining Techniques
No ratings yet
A Survey On Privacy Preserving Data Mining Techniques
5 pages
A Review On "Privacy Preservation Data Mining (PPDM)
No ratings yet
A Review On "Privacy Preservation Data Mining (PPDM)
6 pages
Survey On Incentive Compatible Privacy Preserving Data Analysis Technique
No ratings yet
Survey On Incentive Compatible Privacy Preserving Data Analysis Technique
5 pages
Siva Sankar
No ratings yet
Siva Sankar
6 pages
Privacy Preservation Techniques in Data Mining
No ratings yet
Privacy Preservation Techniques in Data Mining
5 pages
The Knowledge Graph Cookbook
No ratings yet
The Knowledge Graph Cookbook
228 pages
An Overview On Privacy Preserving Data Mining Methodologies
No ratings yet
An Overview On Privacy Preserving Data Mining Methodologies
5 pages
Class-Xii-It-Database Concepts
No ratings yet
Class-Xii-It-Database Concepts
240 pages
Dataengieer
No ratings yet
Dataengieer
23 pages
Advanced Network Defense: Architectures and Best Practices for Today’s Perimeter
From Everand
Advanced Network Defense: Architectures and Best Practices for Today’s Perimeter
Lawrence Bennier
No ratings yet
IRJCS:: Information Security in Big Data Using Encryption and Decryption
No ratings yet
IRJCS:: Information Security in Big Data Using Encryption and Decryption
6 pages
Fascination: Honeypots and Cybercrime
From Everand
Fascination: Honeypots and Cybercrime
Armin Snyder
No ratings yet
Penetration Testing Fundamentals -1: Penetration Testing Study Guide To Breaking Into Systems
From Everand
Penetration Testing Fundamentals -1: Penetration Testing Study Guide To Breaking Into Systems
Devi Prasad
No ratings yet

Sang 2016

Uploaded by

Sang 2016

Uploaded by

This article has been accepted for publication in a future issue of this journal, but has not been

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 1

Achieving Probabilistic Anonymity in a Linear

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 2

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 3

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 4

3 D EFINITIONS whether the victim Xj has a sensitive value of Ci based

(a) M2O Linkage Attack (b) O2M Linkage Attack

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 5

4.1 The Randomness of the Ordering List

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 6

by the distribution of X , not by the projection matrix R1 . ∀j ∈ {1, ..., N } \ i.

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 7

M = max(X1 R1 R1T X1T , ..., XN R1 R1T XN

Steps 4) and 5) are executed, and both of these steps have a 10

time cost of O(mdN ). The total time cost for randomizing λi 0 0

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 8

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 9

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 10

K is a predetermined value of the miner. Suppose that 8 E XPERIMENTS

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 11

0.9 0.85 0.9 0.9

Percentage of records achieving 10−anonymity

Percentage of records achieving k−anonymity

Percentage of records achieving k−anonymity

0.86 also performed. When m varies from 20 to 50, the average

Fig. 7. Experiments on Probabilistic (k,k)-Anonymity for Synthetic Data 7

as those of Fig 7. The (k, 1)-anonymity is achieved by 1 0.1

(a) Semantic closeness (b) EMD comparison

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 12

Experiments are also performed on high-dimensional

Earth Mover’s Distance

assigned to the cluster’s class, and 10% of records are 0.9

Rate of Correct Classification

Rate of Correct Classification

0.5 DiffGen on Synthetic Data

in 3.1. The probabilistic (k, k)-anonymity is achieved on 0.4

abilistic anonymity, is not applicable here because N does 0.8

Rate of Correct Classification

not change in these two figures. 0.7

by our probabilistic (k, k)-anonymity, and then we execute 0.3

Rate of Correct Classification

size of equivalence groups) does not change when we select 0.6

0.5 Hybrid Randomization

synthetic, blood transfusion and adult data, respectively. 0.4

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 13

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 14

ˆ +∞ while ∆λ varies, which means cos f is not always positive

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 15

are always non-negative, and then by Eq. (22) ∂λ ∂I

max Gσ1 ,σ2 (δi , δj ) ≤ Qχ2 , N −k ,

⇐⇒ δi δj obtains its minimum. In the case that 0 < a < 1,

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 16

in which [15] D. Kifer and A. Machnavajjhala. A rigorous and customizable

You might also like