1. Introduction
Set-valued data is one of the data formats that can be extracted from social networks, it is characterized by associating a set of values to individuals. Set-valued data is common in logs of search engines, search by hashtags, and can be also found in databases such as health databases and market basket records. This work is an extension of [
1] which studies the privacy-utility trade-off of certain predefined associations in an anonymized set-valued data through the disassociation technique, defined first in [
2]. The aim of this paper is to continue the investigation and reconstruct the disassociated set-valued dataset to be considerate to future data analysis. Describing, diagnosing, predicting and prescribing are the main uses of data. Set-valued data provides multiple opportunities for various data mining tasks. Using data mining techniques and learning algorithms, machine learning infers models of what is underlying in order to predict possible futures. Extracting knowledge from data, whether it be to increase business productivity, drives effective decision making or predict trends or behaviors, is done through discovering patterns inherited in the data. To benefit from the large amount of set-valued data, association rule mining is deployed and used in many fields all the way from discovering the link between diseases to marketing and retail. It does not refer to just one single algorithm or application, but more to a set of applications that can be used to analyze the correlation between data items in order to find relationships between different items and items categories. This study is an investigation of general set-valued data, where data temporality and semantics is not examined. To illustrate the dilemma of data analysis and privacy preservation in set-valued data, let us consider an example of a mobility data, which stores the GPS location of individuals, each record corresponds to the set of visited cities by an individual. Finding a very high association rule between three cities {{Valetta, Bratislava} ⇒ {Salzburg}} might indicate a high demand for mobility solutions between the cities. On the other side, publishing set-valued data raises the problem of privacy preservation. If an individual, Bob, visited {Salzburg (Austria), Valetta (Malta), Bratislava (Slovakia), Bergamo (Italy)} and an attacker knows that Bob visited both {Valetta, Bratislava}; he/she can predict with high probability that Bob also visited {Salzburg}. Publishing the dataset unrefined, fails to protect the privacy of Bob’s mobility data.
From the above example, we can see that it is necessary to extract knowledge from set-valued datasets but at the same time, the privacy of the individuals should be considered wisely. Publishing raw data is a public disclosure of information collected and puts the privacy of individuals at risk. For that reason, privacy-preserving mechanisms should be applied to the data before publishing it.
Anonymization is the process that imposes a level of privacy on the data in an attempt to protect the privacy of individuals participating in the dataset. Anonymization techniques that can be deployed to protect the confidentiality of user data is a wide area of research [
3,
4,
5,
6,
7,
8,
9,
10]. Disassociation, is an anonymization technique presented in [
2], particularly designed for set-valued data. Basically, it set-valued data are clustered into homogeneous groups (via horizontal partitioning), and then each cluster is split into record chunks (via vertical partitioning) where items of a set are separated into disconnected subsets to preserve the
-anonymity privacy constraint.
However, the anonymization approach aims are both: to provide a dataset respecting privacy and to make the published data valuable for direct analysis. A trade-off must be found between privacy preservation and the benefits of knowledge extraction from datasets.
Guided by the disassociation technique our investigation started in [
1] by studying and improving the probabilistic preservation of associations in a disassociated dataset. After examination of the probabilistic preservation results, we present an optimization of disassociation for a predefined set of associations, we call utility rules. The goal is to save the utility rules from vertical partitioning as much as possible, while respecting the
-anonymity privacy constraint of disassociation.
A derivative of the ant-based algorithm for the disassociation has been proposed to group data whilst respecting the utility rules.
The new problematic in this extension of the work [
1], is that in reality, we cannot define every association present in the original dataset as a utility rule in order to optimize its preservation for future analysis. Therefore, the study of association rules in a disassociated dataset cannot be limited to items preserved together without a break up after vertical partitioning. To run an analysis on a disassociated dataset, we must compute a huge set of pre-images of the disassociated dataset, by reconstructing all possible associations between subsets of items disconnected. This new data format, the disassociated dataset, by breaking down a set of items into disconnected subsets of items, makes the analysis over it hard or at least time consuming to achieve. Primarily, it is useless to publish a dataset that cannot be helpful for data analysis.
In this extension, the goal is to publish an anonymized set-valued dataset that falls in the neighborhood of the original one, consequently preserving its original format. In this context, the “neighbor datasets” terminology seems to fall perfectly to our intention. In an effort to make this article self-sufficient,
Section 2 recalls and summarizes our previous work [
1] for probabilistic preservation of utility in a disassociated dataset and its implementation as ant clustering. In
Section 3, we present our first contribution where we define “neighbor datasets” as two similar datasets that fall under a certain radius of distance and present a way to asses the distance between the “neighbor datasets” by formalizing the datasets into trees. Our second contribution in this work, presented in
Section 4, is a technique that generates a neighbor datasets to the original, from its disassociated result, which can be seen as a statistical-based re-association. We investigate our solution from two perspectives. First, we look at the preservation of association rules which will reflect the synthetic similarity of neighbor datasets. Second, to generalize our approach, we evaluate the distance between the original and reconstructed datasets to have an overview of the created neighborhood using the distance introduced above.
Figure 1 summarizes of the whole solution depicted in this article. All experimental results that confirm the applicability of our approach are presented in
Section 5. Finally,
Section 6 presents conclusive remarks.
2. Disassociation and Utility Awareness for Associations
The dissociation technique [
2] ensures the constraint of
-anonymity by separating the elements of a record into several chunks within the same group. It thus creates association ambiguity between its separated terms, which causes a reduction of the utility for the association in question. Dissociation, as defined by Terrovitis, is based on two assumptions. The first is that no one association is more significant than another. The second is that the data should not be modified, generalized or deleted. In the next section, we provide an algorithm to preserve a better utility for a set of predefined associations, called the utility rules, by reducing the amount of split-ups a utility rule has to endure in order to preserve
-anonymity [
2].
Table 1 recalls the basic notations used in this paper.
2.1. Disassociation of a Set-Valued Data
This section is dedicated to show how disassociation works in terms of privacy and utility. We use
Figure 2 to illustrate an example of disassociation, applied with
,
and
.
Horizontal partitioning: is the process of clustering the records of a datasets following a naive similarity function. Iteratively and until all records are clustered, from non clustered records, horizontal partitioning groups at most
records containing the most frequent item at the current iteration. With that, horizontal partitioning fails to take into consideration the associations in the dataset, relying only on one common term to cluster the records.
Figure 2b reflects the process of horizontal partitioning, where records containing the most frequent item,
a,
, are grouped together within cluster
, and all the other records within
. Both clusters have a size less than
.
Vertical partitioning: is the process that verifies the -anonymity privacy constraint by vertically cutting every cluster into record chunks . A record chunk represents sub-itemsets of the cluster’s records verifying -anonymity between their items. A term chunk , is added to the vertical cut for items with a support of less than k. This process is known as vertical partitioning and is the core of privacy preservation in disassociation.
In our example, vertical partitioning is applied over and . Associations in the clusters are split into different record chunks when -anonymity is not verified.
Figure 2c represents the final result of horizontal and vertical partitioning.
To illustrate the effect of disassociation on the utility of associations, let us consider that the frequency of the association
, is valuable for future analysis. From the result of disassociation,
Figure 2c, we can note that the bond between items
b and
c is ambiguous. In
, items
b and
c are dropped in the term chunk
having a support less than
, therefore, camouflaging their exact support and the association between them, if it exists. Similarly, the association between
b and
c is unclear in
, with only one advantage over
; knowing the support of item
b.
Let us suppose that we can guide the horizontal partitioning process for the favor of the association
by keeping together all records that its supersets, as in
Figure 3a. Now, association
verifies
and is totally preserved associated after vertical partitioning,
Figure 3b.
From this example, we deduce that preserving associations depends significantly on horizontal partitioning. When a set of associations is important for future analysis, a special treatment of horizontal partitioning must be considered. In what follows, we present a solution to preserve high utility for a set of predefined associations, we call utility rules, within the disassociated dataset.
2.2. The Privacy-Utility Trade-Off in Disassociation
Giving an exact general definition for the utility of the data in the domain of anonymization is irrational. In this work, a utility rule is an association which is important for accurate analysis, especially for aggregate query answering accuracy.
Let be a set of predefined associations, that we refer to as utility rules and are important in future analysis.
Let be the support of the utility rule in the original dataset.
Let be the support of the utility rule in the disassociated dataset.
To evaluate the probabilistic preservation of a utility rule in a disassociated dataset, the confidence of a utility rule is analyzed.
Definition 1 (
-confidence)
. The α-confidence of a utility rule is evaluated as follows: The term confidence is used to determine the strength of the association between the items of a utility rule after disassociation. Statistical queries are based on the support of the associations in question. The -confidence represents a ratio of the preservation of a utility rule’s support of a utility rule, reflected in the final output of the disassociation.
In [
1] the utility of an association in disassociated datasets is evaluated theoretically under the
-anonymity privacy model, and is proven to be:
From this perspective of privacy-utility trade-off, we are motivated to contribute with a more insightful horizontal partitioning process, tolerant to the predefined utility rules for future data analysis accuracy.
The next section provides a general description of the clustering problem and then the role of swarm intelligence algorithms for the improvement of data clustering.
2.3. Data Clustering
Clustering by definition is the process of grouping objects with similar characteristics together within a cluster. There exists no unified solution to solve all the clustering problems and it has been proven to be NP-hard [
11,
12], thus the problem consists of finding an optimal or near-optimal solution. In what follows, we present briefly general techniques used for data clustering.
Classical clustering algorithms attempt to optimize a fitness function in order to minimize the dissimilarity between items within a cluster and maximize it between the clusters. Clustering can be either fuzzy or partitional. For fuzzy clustering, data items may belong to multiple clusters with a fuzzy membership grade like the
algorithm [
13]. For partitional clustering, clusters are totally disjoint and data objects belong to exactly one cluster like the
K-means algorithm [
14]. In this work and in accordance with the disassociation principle, we are only interested in partitional clustering.
Bio-inspired algorithms model processes existing in nature for optimization problems. Example of bio-inspired systems used for data clustering are: ant colony system [
15], particle swarm optimization [
16], artificial bee colony [
17]. Those algorithms draw inspiration from the collective behavior of decentralized, self-organized natural social animals. Even though particles of a swarm may have very limited individual capabilities, they can perform very complex jobs, vital for their survival, when acting as a community. Choosing the right bio-inspired algorithm to cluster data relies on the comparability of the given problem’s background with the behavior of the bio-particles.
Ant clustering algorithm (ACA) is a population-based stochastic search process, modeled after the social behavior of ants searching for food, sorting larval and cleaning corpse. To pick and drop items, ants action is influenced by the similarity and the density of the data within the local neighborhood. From this behavior, researches introduced many variation of clustering algorithms applicable in a wide range of problems [
18,
19,
20,
21,
22].
In the next section, we define a variant version of the ant-based clustering algorithm to cluster records that are supersets of the predefined utility rules, for a more utility-guided disassociation.
2.4. Framework of the Algorithm
We are motivated by the need to preserve some items associated together to increase their utility value despite disassociation and the ambiguity that it raises. We refer to those associations as utility rules. Accordingly, we transform normal horizontal partitioning for the set of records, which are supersets of at least one utility rule, into a clustering optimization problem. Before describing the algorithm, it is important to identify the challenges of the clustering problem in our context:
A record might enclose multiple utility rules and with partitional clustering, this record should belong to exactly one cluster satisfying one utility rule.
The common items, belonging or not to a utility rule, between the records affect the distance metrics.
The maximum cluster size constant, , limits the number of records allowed in a cluster.
The proposed algorithm benefits from the studied behaviors of natural ant.
Table 2 describes the environment of our clustering problem in the ant colony system terminology.
Let
be the set of records from
that are supersets of any utility rule
:
Cluster initialization: Every utility rule, , has a representative cluster and an expert ant that transports records. The algorithm starts by sending the expert ants in search for records from , containing their representative utility rules; recursively until is empty.
Pheromone Trail: Square matrix
represents the pheromone trail of the working ants, with
u being the number of predefined utility rules,
. It is the collective adaptive memory of the expert ants and is initialized by the value of the support of each utility rule
in each cluster
, such that:
We denote by
the ratio of the records representing
in cluster
:
Individual and cooperative work of expert ants:During the clustering process, every ant
works for its utility rule
to reach a support threshold
in its representative cluster
, such that:
• If :
Let
be the density of a utility rule
in
:
Each ant chooses the cluster whose density of is the highest. A record r from this one is then moved to cluster , with . This process is known as the pick up job.
• If :
To speed up the convergence of the solution and prevent the ants from moving aimlessly, for the current iteration, ant
works for the benefit of another ant
that needs the most help to increase its
. This process is known as the drop job. We consider that ant
is in need for the most help when reaching
for its utility rule
demands the highest number of iterations:
The next section describes and comments the algorithm implementing the ant-based clustering methodology.
2.5. Utility Guided Ant-Based Clustering Algorithm (UGAC)
The utility-guided ant-based clustering (UGAC) algorithm, Algorithm 1, is presented and explained in details in [
1]. This section recaps the optimization process elaborated for the preservation of the utility rules. UGAC creates for every utility rule
, an expert ant
to pick up records that are supersets of
, and drop them in the representative cluster
. The choice of records for pick up is guided by the pheromone trail represented with a square matrix
(line 3) reflecting the support of each utility rule in every cluster. Actually, in the
procedure, an ant
chooses to move a record from the cluster that has the highest density of
(line 2) to
(line 7–9). After every move, the pheromone matrix,
A, is updated with the support of the utility rules in the clusters (lines 10–15).
Algorithm 1 Utility guided ant-based clustering algorithm (UGAC). |
Input:, , k, , , |
Output: |
1: | |
2: | |
3: | create u clusters, u ants and square matrix |
4: | |
5: | while () do |
6: | for each expert ant do |
7: | |
8: | |
9: | for (; ; ) do |
10: | |
11: | end for |
12: | end for |
13: | end while |
14: | while ( or ) do |
15: | |
16: | for each expert ant do |
17: | |
18: | if ( or ) then |
19: | |
20: | else |
21: | |
22: | |
23: | end if |
24: | end for |
25: | |
26: | end while |
27: | for each cluster do |
28: | |
29: | end for |
30: | |
31: | |
All this pick up job is executed while or if there is less than k records in (line 16–19). Yet, if , the expert ant can work for the benefit of another ant during the current iteration to converge to the optimal solution quickly (line 21–22). Function , Algorithm 2, finds the utility rule that still needs the most iterations to achieve the (line 2). Then, it calls the function, Algorithm 3, to find a record that can be transported to the corresponding cluster.
At the end of the iterations, there exist
u clusters, each representing mainly one utility rule. The resulting clusters may have sizes greater than the maximum cluster size
allowed. Every cluster is split into smaller clusters having respectively a size less or equal to
(line 28), calling Algorithm 4, when necessary. Algorithm 1 ends by vertically partitioning the resulting clusters from UGAC (line 30) and treats all the records that are not supersets of any utility rule, via the normal processes of disassociation (line 31).
Algorithm 2 DropLoad Procedure |
1: | procedureDropLoad() |
2: | |
3: | |
4: | end procedure |
Algorithm 3 PickUp Procedure |
1: | procedurePickUp(, , ) |
2: | , |
3: | if then |
4: | |
5: | , , |
6: | end if |
7: | and |
8: | |
9: | |
10: | for (; ; ) do |
11: | if () then |
12: | |
13: | |
14: | end if |
15: | end for |
16: | end procedure |
2.6. Ant-Based Clustering Effect on Associations
In [
1], the efficiency of the UGAC technique in terms of preservation of the associations represented in the utility rule is evaluated alongside other experiments analyzing the privacy–utility trade-off. Results are very promising for the preservation of the utility rules. In this section, we investigate the effect of ant-based clustering for the predefined utility rules
on the associations beyond
.
Algorithm 4 SplitClusters Function |
1: | procedureSplit() |
2: | if then |
3: | create new cluster |
4: | for (int ; ; ) do |
5: | |
6: | |
7: | |
8: | end for |
9: | |
10: | |
11: | else |
12: | |
13: | end if |
14: | Return |
15: | end procedure |
We chose for the experiment the BMS1 dataset, which contains click-stream E-commerce data, and a set of 70 distinct utility rules , extracted from the dataset with different characteristics: highest frequency= 1204 representing frequent association and Lowest frequency = 2 representing a very rare association. Only records that are supersets of the utility rules (36,141 records from the original set of 149,639 records) are clustered and evaluated.
We compare UGAC (with
) to normal horizontal partitioning for disassociation and investigate the overall preservation of associations for the 2 clustering techniques using the Relative Association Error (RAE) defined as:
where
A is any type of associations present in
and
,
represent respectively the support of the association
A in
and its disassociated result
. In this experiment, two types of associations are evaluated: first, the set of all the couples present in the records related to the utility rules,
, denoted by
and second the set of utility rules
.
Table 3 shows that the results of UGAC for
and
are much better than normal horizontal partitioning. This indicates that for the records that are treated through UGAC, the associations are not deteriorated on the expense of preserving the
. Finally, we can say that UGAC is reliable for analysis for the utility rules in question and beyond them.
The above work, as presented, offers an optimization for disassociating interesting rules in the datasets for analysis. However, a disassociated dataset is not the most appropriate data format for knowledge extraction because it breaks down a set of items into disconnected subsets of items. For this reason, we present in the next section, an algorithm that generates a re-associated dataset from the disassociated one, that we intend to call it a neighbor dataset, by re-associating statistically partitioned subsets to restore the original format of set-valued dataset.
3. Neighbor Datasets: A Road-Map for Distance Evaluation
When a dataset is anonymized it cannot hold the exact same records as the original dataset. Yet, for the anonymized dataset to be useful it has to lead analysis in the same direction of the original dataset. In the following, we define a dataset and its anonymized version as two neighbor datasets if they fall under a certain radius of distance and consequently lead to close hypothesis when analyzed.
Let us first define neighboring for two datasets. To be able to describe datasets as neighbors, two questions arise. First, should the definition of neighboring datasets be contextual or general? Second, how can we mathematically assess this neighborhood degree of two datasets? In the following, we go over the above two dilemmas. From what we briefly described above, we see that the basic definition of neighbor datasets should neither discriminate a contextual nor a general analysis. Taking into consideration both very general branches of data analysis we surely need to have a synthetic reproduction of the original dataset
. Then, the value of the items should stay intact without generalization, to preserve the specificity and context of outliers. At the same time, the neighbor dataset should be a synthetic representation of
and this means that statistically, the two datasets should represent almost the same overview of the data, leading to a close hypothesis. This section first introduces how set-valued datasets are translated into trees (
Section 3.1) and further shows (
Section 3.2) how generated trees are used to compute distances between this dataset.
3.1. Dataset Formalization
This work focuses on unordered records of sets of distinct data. For instance, a summary of visited cities during last month for some travelers, sets of unique words used in web queries for collection of persons, etc. There is no interpretation of the sens of the data neither in terms of semantic nor in terms of values. Data in records are either equal or distinct. They are thus considered as unique. For instance, if the same city is visited twice by a traveler, it will only be counted once in the summary of cities visited during the period under study. Moreover, the order (between records or between innermost data) is not taken into consideration. There is thus no notion of temporality in the data.
From a mathematical point of view, the considered data are multisets of finite sets of discrete data. Multiset allows duplicated records: two distinct user may have indeed visited the same cities, and have then the same data.
Multisets of sets may be efficiently represented by a forest of trees.
Figure 4 gives, for example, two trees representing the same multiset.
Each path from the root to a leaf defines a set, and conversely. Storing multiset of sets as is thus motivated by the objective of computing distance between two multisets as a distance between two trees.
However, as shown in
Figure 4, each multiset of sets can possibly be represented in various ways. In this case, using tree edit distance (TED) will not lead to a distance because the distance between a multiset and itself will not be zero, which would contradict the separation property. The next section presents how this problem is tackled.
3.2. Tree Edit Distance for Datasets
Tree editing distance is a measure that estimates the cost of transforming one tree into another. More precisely, this cost depends on the minimum number of elementary operations (such as deletion, insertion and renaming applied to a node of a tree) weighted by their cost, to move from one tree to another. This notion extends to trees the edit distance (or Levenshtein Distance) between character strings.
We are then left to translate in a uniform manner multiset of sets to ordered trees. Each set of items (i.e., each record) is thus translated into an ordered list of items, thanks to a lexicographic ordering. It results in a multiset of words where each of them contains at most one occurrence of each letter. Similarly, all these words are lexicographically ordered leading in a sequence of words.
Finally, the distance between the two datasets is the distance between the translated ordered trees. Since the lexical ordering based translation from a multiset of sets is bijective, and since the tree edit distance is a distance, the proposed metric is a distance on datasets. With tree edit distance it is easy to evaluate how far two sets of data are, and more specifically the original and anonymized one.
4. Probabilistic Wheel Re-Association for Disassociated Datasets
This section depicts the final motive of our study: How can we publish an anonymized set-valued datasets? First, the process of building a dataset to be published, arising from the privacy constraints of disassociation is described. Then, the relationship between the result of the developed process and tree edit distance is examined from a privacy point of view and the real usefulness of the metric.
4.1. Probabilistic Wheel Re-Association Algorithm
The problem of publishing an anonymized set-valued data is transformed into finding a neighbor dataset that shows true features from the original dataset while being distinct, hence faithful to data analysis. In this section, we propose an algorithm, probabilistic wheel re-association, Algorithm 5, to generate neighbor datasets, given hereafter. Roulette wheel selection is a probabilistic optimization method used for selecting potentially useful solutions for recombination. We profit from the disassociation technique to lead the re-association process. Disassociation by default creates ambiguity between associations present in different record chunks of a cluster, while preserving the accuracy for associations found in the same record chunk.Ambiguity on the level of the associations between the record chunks is our playground for the probabilistic wheel re-association. The following solution respects the result of disassociation on different levels:
What has been disassociated in two distinct clusters should not be re-associated.
Associations from different record chunks of the same cluster can only be re-associated.
The associations preserved in a record chunk should not be altered or re-associated between them. They already passed the -anonymity test for anonymization.
Algorithm 5 takes as input the disassociated dataset , and starts by reconstructing a neighbor cluster of each cluster C from (Line 1). is the cluster containing the result of the gradual re-association between the record chunks of . At first, is loaded with the records of the first record chunk (Line 2). Further, the original number of records in a cluster affects deeply the re-association where empty records show the weakness of the associations between the record chunks. To have the same size of the initial cluster, we add to a number of empty sets representing what is left from the original cluster size not represented in due to -anonymity (Line 3–5).
After initializing , the algorithm applies probabilistic wheel re-association between and successively every in (Lines 6–17). Two records and are respectively chosen from and the record chunk in question following the rules of the function (Line 9-10). Function , Algorithm 6, takes a record chunk, generates the counts for the distinct itemsets in (Line 2) and then constructs the array of cumulative probabilities of the records (Line 3). A random number between 1 and 100 is generated (Line 4); the itemset with cumulative probability equal or straight greater than the selected random number is returned. The two selected itemsets, and , are merged together (Line 11) and moved to the temporary cluster (Line 12), waiting for all the itemsets in to be merged with other itemsets from in a similar way (Line 8–15). The generated itemsets in are added to the neighbor cluster for the merge with the next record chunk (Line 16).
The union of all the generated clusters form the neighbor dataset of the disassociated dataset (Line 18). In
Section 5, we present a set of experiments to evaluate the result in terms of neighborhood generation.
Algorithm 5 Probabilistic wheel re-association. |
Input: |
Output: |
1: | for each cluster in do |
2: | |
3: | for , , while do |
4: | |
5: | end for |
6: | for each record chunk from cluster such that do |
7: | |
8: | while do |
9: | |
10: | |
11: | |
12: | |
13: | |
14: | |
15: | end while |
16: | |
17: | end for |
18: | |
19: | end for |
Algorithm 6 SelectRecord Function |
1: | functionSelectRecord() |
2: | Assign counts to the distinct records r in |
3: | Construct array of cumulative probabilities of the records |
4: | Generate a random number between |
5: | From select the record r such that |
6: | Return r |
7: | end function |
4.2. Analysis of Re-Associated Datasets in Terms of Distance
How can a re-associated dataset be evaluated in terms of neighborhood generation? A suitable distance provides precise indicators about the similarity between objects. As presented in
Section 3, tree edit distance provides a solution to assess the similarity between set-valued datasets. Using this metric, it is easy to assess, first the distance between two re-associated datasets and second the distance between a re-associated dataset and the original dataset. For two datasets to be neighbors the distance between them should be small, in other words, few modifications should be made to make them similar. Neighborhood is generated when multiple re-associated datasets are close to each other. It is interesting to investigate the process of neighborhood initiation by generating re-associated datasets using the same/distinct privacy constraints. However, using the tree edit distance to evaluate the distance between the generated dataset and the original one, can cause a privacy breach in practice. Let us consider a very basic example where the distance is really low, almost zero: in this case, the public can be sure that they have the original dataset. Yet, we stress the idea that with a big dataset and domain, probabilistically it is very rare to regenerate the original dataset. Therefore, the distance between a re-associated dataset and the original one should not be leaked in any sense to the public, increasing the awareness of an attacker to the background data.
The next section presents experiments evaluating the neighborhood generation and the preservation of the data utility for analysis. Despite our realization of the privacy threat that might occur when calculating the distance with the original dataset, we are going to use the tree edit distance on the original dataset, to prove the efficiency of the probabilistic wheel re-association algorithm in terms of neighborhood generation.
6. Conclusions
Anonymization is challenging when data must be analyzed, studied and explored afterwards. Unstructured data complicate the anonymization process, since in a very general way data items may vary tremendously in structure and value without following a pattern. Set-valued data is a type of unstructured data representing multi-sets of sets. In this work, we are interested in publishing an anonymized set-valued dataset, ready for future accurate analysis. In this context, set-valued data is considered for isolated data, without temporality, and without semantics. This scenario is a generalization of specific data values, which can be considered for a broader investigation of set-value data. Disassociation is an anonymization technique for set-valued datasets, presented by [
2]. It guarantees
-anonymity for associations without altering the value of the data items. It proceeds by grouping the data records in different clusters and then vertically separate items of the records when
-anonymity is not verified, thus creating ambiguity on the level of associations for the items that are vertically separated in a cluster.
This paper is an extension of the work done in [
1], where the utility-privacy trade-off in a disassociated dataset is studied deeply. The loss of associations for aggregate analysis is considered for the theoretical study. We came to the conclusion that the loss of the utility is directly linked to the clustering process of disassociation. Driven by this problem we propose in [
1], UGAC, to drive the process of clustering for a set of records representing predefined utility rules.
As a continuity of the previous work, we tackle the problem of knowledge extraction in a disassociated dataset. We know from the aggregate analysis that it is hard to evaluate itemsets that are split on multiple record chunks. The problematic becomes: how can we re-associate the itemsets from the disassociated result while staying faithful to anonymization and knowledge extraction at the same time, in set-valued datasets?
To solve this problem, we define a general notion of similarity between datasets: neighbor datasets. Neighbor datasets are datasets that are not copies but synthetic representation, that leads to trustworthy data analysis. To standardize our notion of a neighborhood, we need a distance to assess it. Up to our knowledge, there exists no specific metric for set-valued data which is mathematically defined as a multiset of sets. Our first contribution in this work is the formalization of the datasets into trees and the use of tree edit distance that calculates the number of transformations needed to move from one tree to another. This way we are able to calculate a distance between two multisets of sets, also known as datasets. Our second contribution is an algorithm that intuitively generates neighbor datasets. We propose a probabilistic wheel re-association algorithm to generate a re-associatied dataset from the result of disassociation of the original dataset.
Finally, we test our utility guided ant-based clustering and probabilistic wheel re-association algorithms to evaluate their efficiency especially for knowledge extraction and data analysis in the context of general set-valued data. From the experiments, we can see that re-associated datasets create a neighborhood to the original dataset that depends on the privacy level imposed by disassociation. Despite the perturbation and noise added to the support of itemsets, probabilistic wheel re-association is able to generate synthetic datasets respecting the overall data representation of the itemsets and association rules in the data. This is extremely interesting for prediction and decision-making analysis, as statistical exploration will lead to a very close hypothesis. On another side, when applying the same privacy constraints for disassociating the original dataset, the probabilistic wheel re-association algorithm generates re-associated datasets almost at the same distance from the original dataset. This reflects the fact that our approach respects the privacy imposed through the process. We can conclude that probabilistic wheel re-association is a faithful algorithm for knowledge extraction and data analysis over anonymized set-valued datasets. The preservation of predefined utility rules is necessary when we want to ensure their representation over a threshold. We run a set of experiments with the utility guided ant-based clustering algorithm, UGAC, to see how well can it preserve the utility rules. UGAC is compared with the classical clustering technique, k-means, and normal horizontal disassociation for various properties of utility rules. The results show that UGAC, compared to the other two solutions, is able to decrease the information loss for the utility rules without increasing the information loss of other associations in the cluster. Combining the two solutions, we can say that knowledge extraction and data analysis is exceptionally valid on anonymized datasets when transformed into neighbors to the original dataset.
In future work, we intend to use the Tree Edit Distance on the set of generated neighbor datasets, to decide which re-association is the most representative of the datasets for publication, being the centroid, or in other words the least distance to all the other datasets. We will investigate our approach with other anonymization techniques for publishing datasets and generalize the mathematical evaluation of the utility-privacy trade-off for future uses of the dataset in machine learning algorithms.