A systematic comparison of measures for k-anonymity in networks

Rachel G. de Jong
Leiden University, Statistics Netherlands (CBS)
[email protected]
&Mark P. J. van der Loo
Statistics Netherlands (CBS), Leiden University,
[email protected] &Frank W. Takes
Leiden University
[email protected]
Abstract

Privacy-aware sharing of network data is a difficult task due to the interconnectedness of individuals in networks. An important part of this problem is the inherently difficult question of how in a particular situation the privacy of an individual node should be measured. To that end, in this paper we propose a set of aspects that one should consider when choosing a measure for privacy. These aspects include the type of desired privacy and attacker scenario against which the measure protects, utility of the data, the type of desired output, and the computational complexity of the chosen measure. Based on these aspects, we provide a systematic overview of existing approaches in the literature. We then focus on a set of measures that ultimately enables our objective: sharing the anonymized full network dataset with limited disclosure risk. The considered measures, each based on the concept of k𝑘kitalic_k-anonymity, account for the structure of the surroundings of a certain node and differ in completeness and ‘reach’ of the structural information taken into account. We present a comprehensive theoretical characterization as well as comparative empirical experiments on a wide range of real-world network datasets with up to millions of edges. We find that the choice of the measure has an enormous effect on aforementioned aspects. Most interestingly, we find that the most effective measures consider a greater node vicinity, yet utilize minimal structural information and thus use minimal computational resources. This finding has important implications for researchers and practitioners, who may, based on the recommendations given in this paper, make an informed choice on how to safely share large-scale network data in a privacy-aware manner.

Keywords Complex networks  \cdot Privacy  \cdot k𝑘kitalic_k-Anonymity

1 Introduction

Network science is a research field centered around modelling real-world phenomena using nodes, representing entities, and edges representing some type of meaningful connections between these entities. Various methods have been introduced to study networks, for example to find important nodes [1], communities [2] or anomalies [3]. These methods can be applied to model disease spread [4], measure segregation [5, 6], or identify suspicious entities in, for example, social networks [7]. For such research purposes it is beneficial to have access to networks that contain realistic information on individual entities, for example about people. While often social media platforms such as Meta/Facebook or X/Twitter are used, recently there has been a rise in attention for population-scale network analysis where the set of nodes represents the entire population of a country [8, 9]. Sharing such large-scale social networks of individuals for research purposes comes with considerable privacy risks, as individuals might be identified and sensitive data may be derived. Even after removing identifiers, i.e., pseudonymization, a network is not sufficiently protected, as the structure of the network itself can be used to identify individuals [10, 11, 12]. Subsequently, social network data is often not shared or can be accessed by a limited number of researchers under strict conditions. This to some extent hinders the transition to open science.

The problem we focus on in this work is that of sharing data while preserving privacy. This has many aspects that should be taken into account when choosing an approach to address this problem. This includes 1) the type of desired output, 2) the ‘attacker scenarios’ against which should be protected, and 3) the utility of the data in terms of which data properties should be preserved and how the data can be used. The most commonly used methods for the protection of network data have initially been introduced in the context of Statistical Disclosure Control (SDC) [13, 14] aiming to protect tabular data, i.e., data where each entity is described by a number of attributes. In this field, one distinguishes between quasi-identifiers, attributes which may be known to the attacker, who aims to de-anonymize the network, and could be used for identification, and sensitive attributes, which are unknown and the attacker aims to obtain.

The most commonly used approaches in SDC are differential privacy [15, 16] and k𝑘kitalic_k-anonymity [17]. In differential privacy, a mechanism gives noisy answers to user queries such that the privacy of entities is preserved. For k𝑘kitalic_k-anonymity, the aim is to create a version of the data such that for each entity there are at least k𝑘kitalic_k equivalent candidates, so that no single individual can be identified. Typically a value of k>1𝑘1k>1italic_k > 1 is used. This is achieved by generalizing quasi-identifiers, or suppressing sensitive values in the dataset. However, in a k𝑘kitalic_k-anonymous dataset it would still be possible for an attacker to obtain information if all k𝑘kitalic_k entities have the same sensitive attribute(s) or if the distribution in a certain class deviates from the distribution over the entire dataset. This is counteracted by extending k𝑘kitalic_k-anonymity with \ellroman_ℓ-diversity [18], t𝑡titalic_t-closeness [19] or (α𝛼\alphaitalic_α, k𝑘kitalic_k)-anonymity [20]. Other approaches are incorporating synthetic data [21] by either generating a new dataset from scratch or combining synthetic data with the actual data. Another approach to address this problem is anatomy [22], in which the quasi-identifiers and sensitive attributes are shared in separate tables such that it can not be inferred which set of sensitive attributes correspond to which set of quasi-identifiers.

SDC methods originally designed for tabular data have been adjusted to the case of network data. Several generically applicable approaches based on differential privacy [23, 24, 25] as well as k𝑘kitalic_k-anonymity [26, 27, 11, 28] have been introduced. Other approaches specific to networks are randomization [29, 30, 31] and clustering [32, 27, 33, 34, 35, 36, 37]. As we will see in Section 3, we can further categorize these approaches based on the type of output they generate.

The approach we focus on in this paper is k𝑘kitalic_k-anonymity. This approach ultimately enables one to share an anonymized version of the full network data while protecting against reidentification risk, attaining the goal of sharing an anonymized version of the network data. When using k𝑘kitalic_k-anonymity to this end, one needs to chose both an anonymization algorithm and a definition for equivalence, which we refer to as the anonymity measure. The goal of the anonymization algorithm is to achieve a k𝑘kitalic_k-anonymous network by means of perturbation, the output of which can safely be shared. In this network, each node is equivalent to at least k1𝑘1k-1italic_k - 1 other nodes, given the chosen definition for equivalence. While the anonymization step is an important one, in this paper we specifically focus on the measurement of k𝑘kitalic_k-anonymity.

Various anonymity measures can be used for node equivalence. These measures differ in the completeness and reach of the structural information they take into account. As a result, they differ in their ‘strictness’ of when two nodes are equivalent. Existing measures are typically based on the degree of a node [26] or nearby nodes [38], the ego-network structure [11, 12] or the exact structural position of a node in the network [28, 39]. Each measure essentially corresponds to an attacker scenario in which we assume the attacker has a certain amount and type of information on an entity. Stricter measures would protect against more attacker scenarios, but at the same time they add stronger requirements for a network to be anonymous. While the anonymization method can be chosen separately from the anonymization measure, in general it is more difficult to obtain an anonymous network for a stricter measure by means of perturbation. As more perturbations might be required to obtain an anonymized network, choosing a more strict measure may in turn result in lower data utility. On the other hand, if the network is protected against a lenient attacker scenario, one takes the risk of underestimating the attacker knowledge and the attacker may still be able to identify entities in the anonymized network.

It is clear that the choice of anonymity measure is of high importance. While many existing works aim to achieve k𝑘kitalic_k-anonymous network for various measures [26, 27, 11, 28], there is not yet a systematic comparison of these measures and how the choice of measure impacts the attained privacy of a k𝑘kitalic_k-anonymous network.

As a key part of this paper, we compare different measures for k𝑘kitalic_k-anonymity and aim to understand them both theoretically and empirically by comparing them on a wide range of empirical networks. We distinguish between these measures based on how far they ‘reach’, i.e., up to where they take structural information into account, and how complete the structural information is. When we empirically compare the measures, we find that measures with a larger reach, even with minimal structural information, lead to higher risk of disclosure than having perfect structural knowledge over the direct neighborhood of a node.

To summarize, the main contributions of this paper are the following.

  • We provide an overview of aspects that should be taken into account when sharing network data while preserving privacy of entities represented (Section 2).

  • We give a categorization of state-of-the-art approaches in the literature (Section 3).

  • We summarize existing measures for k𝑘kitalic_k-anonymity in networks (Section 4).

  • We theoretically compare the measures based on reach and the completeness of the information they take into account, creating a formal strictness-based ordering of measures (Section 5).

  • We empirically compare these measures in terms of both anonymity observed and runtime on a wide range of real-world network datasets (Section 6).

The remainder of the paper is structured as follows. In Section 2, we summarize the aspects that should be taken into account when choosing an approach and measure for sharing network data while preserving the privacy of entities represented in it. Next, Section 3 gives an overview and categorization of approaches from the literature. Then, in Section 4 we give an overview of the measures for k𝑘kitalic_k-anonymity introduced in the literature. We compare a set of representative measures both theoretically and empirically on a wide range of networks in Sections 5 and 6. Lastly, we conclude the paper and indicate possible future work in Section 7.

2 Aspects of sharing sensitive network data

When deciding which approach to use for sharing network data while preserving privacy of entities represented, there are three important aspects to keep in mind: utility, privacy and computation time. We describe all three of these aspects in detail below.

2.1 Utility

For the utility of the data, we distinguish between two aspects. First, how the data can be shared in a useful manner, i.e., what is the desired type of output, and second which structural network properties should be preserved.

Starting with the first, we can distinguish between four types of output: 1) an interactive mechanism that allows users to ask queries about the network, 2) an intermediate representation of the network from which new networks can be generated, 3) a synthetic network or 4) a perturbed version of the network which can immediately be analyzed. While the first type is interactive, the other are non-interactive [15] as they share a version of the data once and do not require further interaction with the user.

The choice of output relates to the trade-off between data utility and privacy. When using an interactive approach where a user is allowed to submit queries, the information shared is tractable. Additionally, one could choose to not implement queries which may be a threat for privacy. A non-interactive approach would allow the user to run any query. However, this gives an advantage to possible attackers. While methods usually assume a certain attacker scenario, an attacker could come up with an unforeseen method, e.g., twin nodes as proposed in [12], or gather more data than foreseen in the attacker scenario, e.g., about centrality [40], which can be used to extract sensitive information. Moreover, when a non-interactive approach is used, this allows users to get more data from the network as they are not restricted to the queries that are implemented for the interactive mechanism.

A second utility aspect one needs to consider is which properties of the network are important to preserve. These properties could be global, concerning properties considering the graph as a whole such as the diameter; local, concerning properties surrounding the nodes such as clustering and degree; or at the meso level, for example accounting for the community structure. With this in mind, one could choose an approach that better preserves these properties as some methods may by their nature be more likely to destroy or preserve certain properties.

2.2 Privacy

For the privacy aspect, one needs to determine four things: 1) what the sensitive information in the network is, 2) what a realistic amount of structural attacker knowledge is 3) if there are any properties of the network that can be used by the attacker and 4) if the attacker can play a role in the network to cause alterations. We discuss each in detail below.

2.2.1 Sensitive information

Overall, works distinguish between node privacy and edge privacy [41, 42]. When protecting for node privacy, an attacker should not be able to infer the identity of a node. For edge privacy, the attacker should not be able to infer if there exist a relationship between two entities. Edge privacy would not imply node privacy, but node privacy, which is the focus for this paper, conditionally implies edge privacy. Intuitively, when the identities of the two nodes an edge connects are unknown, the existence of a link can not be inferred, hence assuring edge privacy. However, edge privacy is not guaranteed if the existence of the connection can be logically inferred, e.g., if for all pairs of candidate nodes there exists an edge. These notions can be linked to identity and attribute disclosure in SDC, as briefly covered in Section 1. Node privacy would prevent an entity from being identified in the dataset and thus prevent identity disclosure, while edge privacy would prevent one from learning about connections between entities, which can be seen as an attribute disclosure.

2.2.2 Attacker knowledge

To estimate the attacker scenario, we need to assess which type of knowledge the attacker can obtain, in which amount, and which properties they can use to obtain more sensitive information. In the context of k𝑘kitalic_k-anonymity, which we discuss in detail in Section 4, the choice of attacker scenario plays an important role in choosing a measure. Examples are the degree-based attack [26, 43, 44], where the attacker has knowledge of degrees of nodes, or a neighborhood attack [11, 45], where we assume a possible attacker has knowledge of a nodes’ surrounding neighborhood. Considering the amount of attacker knowledge, assuming more knowledge overall protects against more attacker scenarios, resulting in a stronger privacy guarantee. However, when one assumes an attacker has more knowledge, it may be more difficult to achieve anonymity as in an anonymization approach based on perturbation more changes might be required, which in turn could result in lower data utility. By choosing a less strict attacker scenario, anonymity may be easier to achieve, which may indirectly result in higher utility. However, it may not ensure sufficient privacy. Hence, it is very important to take into account a realistic attacker scenario.

2.2.3 Types of attacker scenarios

Various types of attacker scenarios have been introduced in literature [46, 47]. These works distinguish between seed based and seedless attacks. For seed based attacks, it is assumed that the attacker knows a number of user identities which are referred to as seeds. Most approaches include a propagation or cascade step which uses the identified entities to de-anonymize more entities [48, 49]. These works assume an attacker has access to different social networks partially overlapping in people present. These networks can be used to link similar nodes to each other. If this can be done properly, this could be a threat to privacy. However, it has been shown that this approach needs many high quality seed nodes [49], which may not always be realistic. Seedless attacks assume that an attacker does not need any seeds to start. Instead, structural network properties are used for identification.

2.2.4 Attacker role

Lastly, literature distinguishes between active and passive attacks [41, 50, 47]. For active attacks, also referred to as sybil attacks, the assumption is that an attacker can make changes to the network, e.g., by making friends on a social media platform. After an anonymized version of the network is obtained, the attacker can use these sybil nodes to identify themselves, and possibly cascade this information to de-anonymize more nodes. When this is possible, this is an important aspect to take into account in the attacker scenario. For passive attacks, which is the type of attack assumed in this paper, either a single network is shared, or it is assumed an attacker is not able to alter the network to such a large extend, e.g., for an electric grid network.

2.3 Computation time

When choosing an approach, one should also take into account whether it is computationally feasible to use this on the network data at hand. Some anonymization algorithms are more computationally expensive than others, making them possibly unsuitable for larger networks. For example, the problem of achieving k𝑘kitalic_k-anonymity with as few changes as possible is proven to be NP-hard when using 1-neighborhood isomorphism as measure [51], hence an approximation algorithm has to be considered. When using the degree as measure, determining the smallest set of alterations required to anonymize the graph by means of perturbation can be done efficiently. For example, the algorithm introduced in [26] runs in quadratic time in terms of the number of nodes. There essentially exists a trade-off between privacy and computational cost which should be taken into account when choosing an approach for a given network dataset.

3 Methods for privacy aware sharing of networks

In this section, we summarize approaches introduced in literature for achieving privacy aware sharing of networks. We categorize the approaches based on the types of output mentioned in Section 2.1 and discuss each approach accordingly. The categories and corresponding properties can be found in Table 1.

Method Output type Privacy for
Interactive Synthetic Intermediate Perturbed Nodes Edges
Differential privacy: queries [24, 52, 53, 42] \checkmark \lozenge \checkmark
Differential privacy: synthetic [23, 25, 54] \checkmark \lozenge \checkmark
Clustering [32, 33, 34, 35] \checkmark \checkmark
Injecting uncertainty [55] \checkmark \checkmark
Randomization [29, 30, 31] \checkmark \checkmark
k𝑘kitalic_k-Anonymity [26, 27, 11, 28] \checkmark \checkmark
Table 1: Categorization of methods for privacy-aware sharing of networks. \checkmark indicates the method is in this category, \lozenge that it could be in this category, but is less commonly studied as such.

3.1 Interactive

The first category that we discuss is that of the interactive approaches, which primarily consists of differential privacy based approaches. These approaches allow users to ask queries about the network dataset [56]. The user receives an answer to which noise is added, such that it preserves either node or edge privacy. The attacker scenario assumed is that the attacker has knowledge about the entire network except for one node or edge. Based on the answer, it should not be possible to infer whether a specific edge, when protecting edge privacy, or node, when protecting node privacy, exists in the dataset. The amount of noise added is based on the sensitivity of the query and ensures the privacy is guaranteed. Most works on differential privacy focus on edge privacy [23, 25, 54] and some on node privacy [53, 57]. As the (non-)existence of a node can potentially have more effect on the query than the (non-)existence of an edge, it is overall more difficult to ensure node privacy. Hence, more noise should be added to achieve node privacy.

When applying differential privacy to networks, there are two approaches, each providing different types of output. One aims to answer a specific user query with as little noise as possible [24, 52, 53, 42]. In this case, the approach is interactive and users are dependent on the queries that are implemented. A disadvantage of this approach is that when more queries are posed, more noise should be added to ensure differential privacy [42]. This strongly reduces the quality of the answers and with that the utility of the resulting dataset.

The other set of differential privacy approaches aims to create a synthetic network based on the output of a query [23, 25, 54]. This can for example be done by generating a synthetic dK-graph [58] (see also Section 3.3) based on a differentially private joint degree distribution of the network. This distribution can be obtained from the original graph by determining for each edge the degree of the nodes it connects, and counting the occurrence of each such pair. For this type of approach, the quality of the resulting network depends both on the quality of the answer provided by the differential privacy mechanism and the graph model used, in this case the dK-graph.

3.2 Intermediate representation

The second category we discuss is an intermediate representation of the network. Two forms of intermediate representations have been introduced for privacy aware sharing of networks, being clustering and injecting uncertainty [32, 27, 33, 55].

For the clustering approaches the nodes of the network are first merged into super nodes [32, 27, 33, 34, 35, 36, 37] according to some predefined mechanism. These supernodes additionally contain labels denoting the number of nodes and edges within each cluster. If there is at least one edge between nodes in two clusters, this is modelled using a superedge denoting the number of connections between nodes in the different clusters. The resulting graph can then be used to generate a network to analyze. By ensuring each cluster has a size of at least k𝑘kitalic_k, anonymity of the entities can be ensured. Due to the clustering, this approach likely preserves global properties, yet local properties may be destroyed as the exact neighborhood structure surrounding a node can not be inferred from the supernodes and superedges. Different works focus on improving the clusters made [35, 36], and some approaches add additional constraints to account for node labels [37].

Another approach is to inject uncertainty into edges. The work presented in [55] introduces an approach that adds a probability to each possible edge in the graph to hide certain node properties. This resulting graph with edge probabilities can be shared and used to sample new synthetic networks. This approach would hence ensure edge anonymity. However, which network properties would be preserved and destroyed depends heavily on how the probability is assigned to the edges.

3.3 Synthetic data

The third category of approaches generates a synthetic network based on graph properties of the original network. Various generic methods exist to generate network models that aim to capture real-world network properties. Examples are the Barabási–Albert [59] or Watts-Strogatz model [60]. An example of such a model specifically for achieving privacy, is the dK-graph model [58]. However, it is found that this in itself may not ensure privacy as many nodes can still be identified in the resulting dK-graph [61]. Hence, subsequentially introduced work first generates a differentially private joint degree distribution, as explained in Section 3.1, which can then be used to generate a dK-graph [25, 54].

3.4 Perturbed network

The last category is perturbation. For this type of approach, the goal is to share an adapted version of the original network which does not reveal a node’s identity based on structure. Overall there are two types of approaches: randomization [29, 30, 31] to ensure edge privacy, and k𝑘kitalic_k-anonymity [26, 27, 11, 28] to ensure node privacy.

Various operations on edges, such as edge addition, deletion or rewiring, can be used to perturb the network. After the graph is perturbed, an attacker may not be certain about the existence of a connection between two entities, hence ensuring edge anonymity. The work in [29] introduces such an approach, additionally preserving utility by retaining the spectrum, i.e., the set of eigenvalues of the adjacency matrix representing the network. In [30, 31], edges are added based on random walks with the aim of preserving more data utility.

One of the most common methods for ensuring node privacy, which is also commonly used in SDC literature and the focus of the remainder of this paper, is k𝑘kitalic_k-anonymity [26, 27, 11, 28]. When using this approach, for each node in the graph there should be k𝑘kitalic_k equivalent candidates given an amount of structural information hence ensuring node privacy. This amount of information, i.e., the chosen anonymity measure, corresponds to an attacker scenario where we assume that a possible attacker has this amount of knowledge. Based on this measure we can determine if a given node is k𝑘kitalic_k-anonymous or not. When choosing the measure, the privacy versus utility trade-off is very important, as discussed in Section 2.

Different anonymization algorithms can be used which could aim to preserve certain topological properties. Inevitably, some properties need to be destroyed in order to make all nodes k𝑘kitalic_k-anonymous. An example is that perturbation will always change the degree distribution, even if the chosen anonymity measure is only the degree of nodes. In the remainder of this paper we focus on k𝑘kitalic_k-anonymity by surveying and comparing different measures.

4 An overview of k-anonymity measures

In this section, we first define the required network concepts and notions related to k𝑘kitalic_k-anonymity. Second, we provide an overview of measures for k𝑘kitalic_k-anonymity introduced in literature thus far.

4.1 Networks

We define a network or graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) as a set of nodes V𝑉Vitalic_V, and a set of edges {v,w}E𝑣𝑤𝐸\{v,w\}\in E{ italic_v , italic_w } ∈ italic_E, with v,wV𝑣𝑤𝑉v,w\in Vitalic_v , italic_w ∈ italic_V. The degree of a node equals the number of connections it has: degree(v)=|{w:{v,w}E}|𝑑𝑒𝑔𝑟𝑒𝑒𝑣conditional-set𝑤𝑣𝑤𝐸degree(v)=|\{w:\{v,w\}\in E\}|italic_d italic_e italic_g italic_r italic_e italic_e ( italic_v ) = | { italic_w : { italic_v , italic_w } ∈ italic_E } |. We can create a distribution of node degrees in the network. In real-world networks, this often resembles a powerlaw distribution [62] where occurrence(degree)=cdegreeα𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑑𝑒𝑔𝑟𝑒𝑒𝑐𝑑𝑒𝑔𝑟𝑒superscript𝑒𝛼occurrence(degree)=c*degree^{-\alpha}italic_o italic_c italic_c italic_u italic_r italic_r italic_e italic_n italic_c italic_e ( italic_d italic_e italic_g italic_r italic_e italic_e ) = italic_c ∗ italic_d italic_e italic_g italic_r italic_e italic_e start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT. Here c𝑐citalic_c is a constant and α𝛼\alphaitalic_α the powerlaw exponent. The value for α𝛼\alphaitalic_α indicates the steepness of the slope of the degree distribution. Hence, a low α𝛼\alphaitalic_α indicates a less steep slope and a more diverse degree distribution, while a high value implies a more steep slope and a less diverse degree distribution.

In real-world networks it often occurs that two neighboring nodes form a triangle with a third node. We measure the tendency to form triangles using the clustering coefficient. For a node, this equals the number of triangles it is part of, divided by the maximum number of triangles it could have which equals 12degree(v)(degree(v)1)12𝑑𝑒𝑔𝑟𝑒𝑒𝑣𝑑𝑒𝑔𝑟𝑒𝑒𝑣1\frac{1}{2}degree(v)(degree(v)-1)divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_d italic_e italic_g italic_r italic_e italic_e ( italic_v ) ( italic_d italic_e italic_g italic_r italic_e italic_e ( italic_v ) - 1 ). For a graph, we summarize the clustering coefficient as the average clustering coefficient over all nodes.

Nodes may connect to nodes with a similar degree or a different degree. We capture this tendency by assortativity. A negative value implies nodes tend to connect to dissimilar nodes while a positive value indicates they tend to connect to similar nodes.

We define the distance between two nodes, distance(v,w)𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑣𝑤distance(v,w)italic_d italic_i italic_s italic_t italic_a italic_n italic_c italic_e ( italic_v , italic_w ), as the minimum number of edges that needs to be traversed to reach one node from the other. It follows that distance(v,w)=distance(w,v)𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑣𝑤𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑤𝑣distance(v,w)=distance(w,v)italic_d italic_i italic_s italic_t italic_a italic_n italic_c italic_e ( italic_v , italic_w ) = italic_d italic_i italic_s italic_t italic_a italic_n italic_c italic_e ( italic_w , italic_v ) and distance(v,v)=0𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑣𝑣0distance(v,v)=0italic_d italic_i italic_s italic_t italic_a italic_n italic_c italic_e ( italic_v , italic_v ) = 0. When there is no path between two given nodes, i.e., a sequence of edges connecting v𝑣vitalic_v and w𝑤witalic_w, then distance(v,w)=𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑣𝑤distance(v,w)=\inftyitalic_d italic_i italic_s italic_t italic_a italic_n italic_c italic_e ( italic_v , italic_w ) = ∞. For simplicity, we focus on undirected graphs. This occurs when the nodes are in different components, i.e., maximal subsets of nodes in which there is a path between all pairs of nodes. The diameter of the graph, D(G)𝐷𝐺D(G)italic_D ( italic_G ), is the length of the longest shortest path between two nodes that is not equal to \infty. The average distance equals the average of the length of all shortest paths between all node pairs in the network that are in the same component.

We define the d𝑑ditalic_d-neighborhood of a node Nd(v)=(VNd(v),ENd(v))subscript𝑁𝑑𝑣subscript𝑉subscript𝑁𝑑𝑣subscript𝐸subscript𝑁𝑑𝑣N_{d}(v)=(V_{N_{d}(v)},E_{N_{d}(v)})italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_v ) = ( italic_V start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_v ) end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_v ) end_POSTSUBSCRIPT ) as the set of nodes that are at most distance d𝑑ditalic_d from target node v𝑣vitalic_v, and set of all edges between these nodes. Two neighborhoods in a graph are structurally indistinguishable if they are isomorphic, as defined in Definition 1.

Definition 1 (Graph isomorphism)

Given two graphs G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) and G=(V,E)superscript𝐺superscript𝑉superscript𝐸G^{\prime}=(V^{\prime},E^{\prime})italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), a graph isomorphism is a bijective function ϕ:VV:italic-ϕ𝑉superscript𝑉\phi:V\rightarrow V^{\prime}italic_ϕ : italic_V → italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that for each v,wV𝑣𝑤𝑉v,w\in Vitalic_v , italic_w ∈ italic_V it holds that {ϕ(v),ϕ(w)}Eitalic-ϕ𝑣italic-ϕ𝑤superscript𝐸\{\phi(v),\phi(w)\}\in E^{\prime}{ italic_ϕ ( italic_v ) , italic_ϕ ( italic_w ) } ∈ italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT precisely when {v,w}E𝑣𝑤𝐸\{v,w\}\in E{ italic_v , italic_w } ∈ italic_E.

We can determine if two graphs are isomorphic by comparing their canonical labeling [63]. This is a label assigned by a function 𝒞𝒞\mathcal{C}caligraphic_C such that two graphs have the same label value only if they are isomorphic. A special case of isomorphism is automorphism: an isomorphism from the graph onto itself. If two nodes are mapped onto each other by an automorphism, they are in the same orbit. This implies that the nodes are structurally indistinguishable from each other.

4.2 k-Anonymity and equivalence

In order for a node to be k𝑘kitalic_k-anonymous, it should be equivalent to at least k1𝑘1k-1italic_k - 1 other nodes according to a particular definition for equivalence M𝑀Mitalic_M. We refer to M𝑀Mitalic_M as the anonymity measure. If two nodes v,wV𝑣𝑤𝑉v,w\in Vitalic_v , italic_w ∈ italic_V are equivalent using a measure M𝑀Mitalic_M, we denote this as vMwsubscript𝑀𝑣𝑤v\cong_{M}witalic_v ≅ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_w. These nodes are said to be in the same equivalence class. We let ECM𝐸subscript𝐶𝑀EC_{M}italic_E italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT denote the partitioning of nodes V𝑉Vitalic_V into equivalence classes and ECM(v)𝐸subscript𝐶𝑀𝑣EC_{M}(v)italic_E italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_v ) the equivalence class of a given node v𝑣vitalic_v. We say that a node v𝑣vitalic_v is k𝑘kitalic_k-anonymous when |ECM(v)|=k𝐸subscript𝐶𝑀𝑣𝑘|EC_{M}(v)|=k| italic_E italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_v ) | = italic_k. We summarize the anonymity of a graph given a certain measure by its uniqueness, defined in Definition 4.

Definition 2 (Equivalence class)

Given a graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) and anonymity measure M𝑀Mitalic_M, we define an equivalence class ECM𝐸subscript𝐶𝑀EC_{M}italic_E italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT as a set of nodes, such that v,wECM:vMw:for-all𝑣𝑤𝐸subscript𝐶𝑀subscript𝑀𝑣𝑤\forall v,w\in EC_{M}:v\cong_{M}w∀ italic_v , italic_w ∈ italic_E italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT : italic_v ≅ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_w.

Definition 3 (k-Anonymity)

Given a graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) and measure M𝑀Mitalic_M a node is k𝑘kitalic_k-anonymous for M𝑀Mitalic_M if it is in an equivalence class of size k𝑘kitalic_k. The graph G𝐺Gitalic_G is k𝑘kitalic_k-anonymous if all nodes in V𝑉Vitalic_V are at least k𝑘kitalic_k-anonymous.

Definition 4 (Uniqueness)

Given a graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), measure M𝑀Mitalic_M and equivalence classes ECM𝐸subscript𝐶𝑀EC_{M}italic_E italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, a node is unique if |ECM(v)|=1𝐸subscript𝐶𝑀𝑣1|EC_{M}(v)|=1| italic_E italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_v ) | = 1. We define the uniqueness of the graph as the fraction of unique nodes in the graph.

uniqueness(G)=|{v:vV,|ECM(v)|=1}||V|𝑢𝑛𝑖𝑞𝑢𝑒𝑛𝑒𝑠𝑠𝐺conditional-set𝑣formulae-sequence𝑣𝑉𝐸subscript𝐶𝑀𝑣1𝑉uniqueness(G)=\frac{|\{v:v\in V,|EC_{M}(v)|=1\}|}{|V|}italic_u italic_n italic_i italic_q italic_u italic_e italic_n italic_e italic_s italic_s ( italic_G ) = divide start_ARG | { italic_v : italic_v ∈ italic_V , | italic_E italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_v ) | = 1 } | end_ARG start_ARG | italic_V | end_ARG

4.3 k-Anonymity measures

In this section, we discuss measures for k𝑘kitalic_k-anonymity introduced in literature. We categorize these based on what type of information they take into account: degree based, neighborhood based, automorphism based and hybrid measures. While the selection of a measure can be done separately from the anonymization process, the meausure does influence the difficulty of anonymizing a graph. Hence, for each measure we both describe the measure itself and give an estimate of how complex it is to achieve anonymity with as few alterations as possible. A parameterized version for most measures can be created to account for structural information beyond the direct neighborhood of the target node, denoted with parameter d𝑑ditalic_d. Additionally, we discuss how each of the discussed measures can be extended by a cascading step.

We summarize the measures we focus on throughout the remainder of the manuscript in Table 2. These measures are chosen based on two criteria. First, each of the measures assigns one or more values to each node, which are used to divide the set of all nodes into equivalence classes. Second, the chosen measures represent the set of measures introduced in literature that model feasible attacker scenarios.

d=1𝑑1d=1italic_d = 1 d=2𝑑2d=2italic_d = 2
Reach Value Reach Value
Degree  [26, 64, 43, 44] [Uncaptioned image] 2 - -
Count [65] [Uncaptioned image] (3, 3) [Uncaptioned image] (5, 6)
Degree distribution [65] [Uncaptioned image] {2,2,2}222\{2,2,2\}{ 2 , 2 , 2 } [Uncaptioned image] {2,2,2,3,3}22233\{2,2,2,3,3\}{ 2 , 2 , 2 , 3 , 3 }
d𝑑ditalic_d-k𝑘kitalic_k-Anonymity [11, 45, 12, 66] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
VRQ [38, 27] [Uncaptioned image] {2,3,3}233\{2,3,3\}{ 2 , 3 , 3 } [Uncaptioned image] {2,2,3,3,3}22333\{2,2,3,3,3\}{ 2 , 2 , 3 , 3 , 3 }
Hybrid [67] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
{2,3,3}233\{2,3,3\}{ 2 , 3 , 3 } {2,2,3,3,3}22333\{2,2,3,3,3\}{ 2 , 2 , 3 , 3 , 3 }
Table 2: The six anonymity measures compared in this paper. For d=1𝑑1d=1italic_d = 1 and d=2𝑑2d=2italic_d = 2, we illustrate the reach of the measures (second and fourth column) and the outcome for the given measure (third and fifth column).

4.3.1 Degree based

We distinguish between three measures within the category of degree based measures, being degree, vertex refinement queries and joint degree.

Most of the works using degree based measures focus on the actual degree of the node [26, 68, 69, 70, 64, 43, 44]. A graph is then k𝑘kitalic_k-anonymous if each degree occurs at least k𝑘kitalic_k times in the network. Compared to the other measures we discuss, this is the simplest measure, as it accounts for the least structural information and is trivially easy to compute for a given node. As a result, anonymization algorithms have been introduced that minimize the number of edge deletions. These approaches can be used on networks with over a million nodes and edges [26, 43, 44]. Other works that introduced anonymization methods for this measure specifically aim to preserve more data utility than the other approaches [68, 70, 64].

A parameterized measure that takes into account the degree of nearby nodes is introduced in [38, 27]. This work introduces the notion of vertex refinement queries (VRQ), denoted isubscript𝑖\mathcal{H}_{i}caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, for which the distance can be set with parameter i𝑖iitalic_i. When 1subscript1\mathcal{H}_{1}caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is used, only the degree of the node is taken into account. For isubscript𝑖\mathcal{H}_{i}caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, with i>1𝑖1i>1italic_i > 1, the degree distribution of the nodes at distance at most i1𝑖1i-1italic_i - 1 is taken into account. The measure is applied recursively, i.e., isubscript𝑖\mathcal{H}_{i}caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is used to compute i+1subscript𝑖1\mathcal{H}_{i+1}caligraphic_H start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT for each i1𝑖1i\geq 1italic_i ≥ 1. This measure is more difficult to anonymize for and requires an approximation algorithm minimize the number of edge deletions.

A measure that is more strict than degree, but not as strict as VRQ, has been introduced in [71] and accounts for pairs of degrees: the degree of the target node and its neighbor. While this measure accounts for only slightly more information than node degree itself, anonymization is already much more difficult. To anonymize a network given this measure, the work proposes an integer programming problem and an approximation algorithm to anonymize larger networks. This measure differs from the remaining measures as it does not compute a value for a node to determine the equivalence classes. Hence we do not include the measure in our comparison. However, it does show similarities to the cascading algorithm we discuss in Section 4.3.5 and 6.4.

4.3.2 Neighborhood based

For neighborhood based measures, we distinguish between measures that account for complete structural information and measures that are based on meaningful properties of the neighborhood. Most common are measures that account for the complete structure of the 1-neighborhood [11, 45, 51, 72, 73, 66]. Here nodes are said to be equivalent if their neighborhoods are isomorphic, which implies the neighborhoods are not distinguishable based on structure. Other works propose a parameterized version of this measure assuming structural knowledge of the d𝑑ditalic_d-neighborhoods [66, 73, 65, 12] resulting in a measure referred to as d𝑑ditalic_d-k𝑘kitalic_k-anonymity. Similar to VRQ, this measure is computed recursively. Hence, in order for a node to be d𝑑ditalic_d-k𝑘kitalic_k-anonymous, it needs to be (d1)𝑑1(d-1)( italic_d - 1 )-k𝑘kitalic_k-anonymous.

The work in [65] uses heuristics to speed up the computation of d𝑑ditalic_d-k𝑘kitalic_k-anonymity. These properties are graph invariants: properties that are equal if two graphs, or in this case d𝑑ditalic_d-neighborhoods, are isomorphic. If the values are not equal, the d𝑑ditalic_d-neighborhoods can not be isomorphic and as a result the nodes can not be equivalent. However, the converse does not hold: if the graph invariants are equal it does not imply that the neighborhoods are isomorphic. The heuristics used are count, which equals the number of nodes and edges in the d𝑑ditalic_d-neighborhood and degree, which equals the degree distribution of the d𝑑ditalic_d-neighborhood. From the degree distribution the number of nodes and edges in the d𝑑ditalic_d-neighborhood, which equals the value for count, can be derived. As these heuristics in themselves can be used as measures for equivalence and are in terms of what information they measure elegantly positioned between degree and d𝑑ditalic_d-k𝑘kitalic_k-anonymity, we choose to include these in our comparison.

4.3.3 Automorphism based

Several works focus on a very strict scenario where an attacker should not be able to distinguish between two nodes, even with complete structural information [28, 39, 74]. There are two variants. In [28, 39], the aim is that for each node there should be at least k1𝑘1k-1italic_k - 1 nodes in the same orbit. This implies that the nodes have the exact same structural properties and hence can not be distinguished even when an attacker has perfect structural information over the entire network. For [74], the network is k𝑘kitalic_k-anonymous if it can be partitioned into k𝑘kitalic_k subgraphs that are isomorphic to each other. To achieve k𝑘kitalic_k-anonymity for these measures, symmetry should be introduced to the network. In particular, for a network to be k𝑘kitalic_k-anonymous, all its components should be symmetric in at least k1𝑘1k-1italic_k - 1 points, which is likely not realistic for most real-world networks. This would moreover have major impact on the utility of an anonymized version of the network. As this approach is computationally expensive and approximated by the distance-parameterized structural neighborhood measures mentioned at the end of Section 4.3.2, we do not include automorphism based measures in our comparison.

4.3.4 Hybrid

The previously mentioned measures can also be combined, which is done in [67]. Here nodes are equivalent if they have isomorphic 1-neighborhoods and if the degree distribution over the neighboring nodes is equal. This is referred to as the 1superscript11^{*}1 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-neighborhood attack. Hence, this combines the measures of 1-neighborhood isomorphism [11, 45, 65, 12, 73], and VRQ [38, 27]. A parameterized version would combine the d𝑑ditalic_d-neighborhood with VRQ(d)𝑑(d)( italic_d ).

4.3.5 Anonymity-cascade

Recent work introduces anonymity-cascade [12]. This algorithm models the scenario that an attacker reuses the nodes that are uniquely identified to identify more nodes in the network. This is similar to the attacker scenarios that include a propagation step [48, 49] as discussed in Section 2.2. In the work of [12], the algorithm starts with all nodes that have a unique 1111-neighborhood structure. Then, it continues in a cascading fashion by identifying neighboring nodes with a unique structure as unique. This way the reach of the measure is extended beyond the direct neighborhood.

4.3.6 Other structural knowledge

For completeness of the summary, we end with approaches that take into account different structural information, uncertainty and other graph properties. Several works introduced measures that account for other structural information than discussed thusfar. We summarize four such methods being edge facts, hub fingerprints, a method based on centrality of nodes, and one based on twin nodes.

First, we look at approaches introduced in the work of Hay et al. [38, 27]. Besides vrq, it introduces two different scenarios for which the first scenario is based on knowledge about edge facts. This approach measures anonymity by counting how many candidates there are given that an attacker knows a number of edges surrounding the node. However, computing all possible subgraphs given a number of edges is very computationally expensive and would not scale to large graphs. At the same time, if the edges are limited to the d𝑑ditalic_d-neighborhood of the graph, it can be at most as strict as the distance-parameterized structural neighborhood measures mentioned at the end of Section 4.3.2.

The second approach is the hub fingerprints approach. In this approach, it is assumed that the attacker knows the distance for each node to so-called hub nodes. These are nodes that have a central position in the network and often have a high degree. However, experiments showed that this knowledge helps to identify only few nodes. As this is not an effective attacker scenario, we chose to not include this measure in our comparison.

Third, the work of [40] introduces an attacker scenario where the attacker has additional structural information besides the neighborhood of a node. Namely, this work assumes the attacker can obtain information about the centrality of nodes and explicitly aims to protect against this scenario.

Fourth, as part of a different approach, [12] introduces the notion of twin-uniqueness. Under the assumption that the graph is shared, the attacker can identify so called twin nodes, which are sets of nodes that are connected to the same nodes. When an attacker finds that all candidates for an entity of interest are twin, they can learn all structural information there is to know about the node representing the entity, including the connections. The notion of twin-uniqueness accounts for this phenomenon when computing anonymity and adds the constraint that for a node to be anonymous at least one equivalent node should not be a twin.

4.3.7 Uncertainty

The work in [75] builds upon the notion of k𝑘kitalic_k-degree anonymity [26, 68, 69, 70, 64, 43, 44] and assumes there might be uncertainty in the knowledge of the attacker. This is accounted for by deploying a so-called binning approach such that nodes with a similar degree, which are in the same bin, are also equivalent. Since this approach accounts for uncertainty, it can merge equivalence classes that are distinct according to degree, hence it is as most as strict as this measure.

4.3.8 Network model extensions

Rather than focusing on undirected networks, of course there has been work that accounts for more network properties. Various works extend the notion of k𝑘kitalic_k-anonymity to include node labels [51, 72, 73, 76], edge labels [77] or edge weights [78]. Another work focuses on weighted dynamic networks with node labels [69]. Since all of these properties add more information, extending an anonymity measure to account for both network structure and such additional properties results in a situation in which the number of unique nodes found will always be higher, or in the best case equal. Hence, accounting for additional properties overall decreases the anonymity.

5 Theoretical comparison of k-anonymity network measures

In the previous section, we summarized the anonymity measures introduced in literature. These measures differ in how complete the structural information is they take into account, and their reach, i.e., how far from the target node this structural information is taken into account. In this section we aim to understand the most prominent measures for anonymity introduced in the literature We do so by summarizing their reach and the structural information they take into account allowing us to order them based on strictness.

Measure Abbreviation M(v,d)𝑀𝑣𝑑M(v,d)italic_M ( italic_v , italic_d )
Degree [26, 64, 43, 44] degree degree(v)𝑑𝑒𝑔𝑟𝑒𝑒𝑣degree(v)italic_d italic_e italic_g italic_r italic_e italic_e ( italic_v )
Count [65] count (|VNd(v)|,|ENd(v)|)subscript𝑉subscript𝑁𝑑𝑣subscript𝐸subscript𝑁𝑑𝑣(|V_{N_{d}(v)}|,|E_{N_{d}(v)}|)( | italic_V start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_v ) end_POSTSUBSCRIPT | , | italic_E start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_v ) end_POSTSUBSCRIPT | )
Degree distribution [65] degdist {degree(w):wVNd(v)}conditional-set𝑑𝑒𝑔𝑟𝑒𝑒𝑤𝑤subscript𝑉subscript𝑁𝑑𝑣\{degree(w):w\in V_{N_{d}(v)}\}{ italic_d italic_e italic_g italic_r italic_e italic_e ( italic_w ) : italic_w ∈ italic_V start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_v ) end_POSTSUBSCRIPT }
d𝑑ditalic_d-k𝑘kitalic_k-Anonymity [66, 11, 45, 12] d𝑑ditalic_d-k𝑘kitalic_k-anonymity 𝒞(Nd(v))𝒞subscript𝑁𝑑𝑣\mathcal{C}(N_{d}(v))caligraphic_C ( italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_v ) )
Vertex Refinement Query [38, 27] vrq {degree(w):wV,distance(v,w)d}conditional-set𝑑𝑒𝑔𝑟𝑒𝑒𝑤formulae-sequence𝑤𝑉𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑣𝑤𝑑\{degree(w):w\in V,\ distance(v,w)\leq d\}{ italic_d italic_e italic_g italic_r italic_e italic_e ( italic_w ) : italic_w ∈ italic_V , italic_d italic_i italic_s italic_t italic_a italic_n italic_c italic_e ( italic_v , italic_w ) ≤ italic_d }
Hybrid [67] hybrid (𝒞(Nd(v)),{degree(w):wV,distance(v,w)d})𝒞subscript𝑁𝑑𝑣conditional-set𝑑𝑒𝑔𝑟𝑒𝑒𝑤formulae-sequence𝑤𝑉𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑣𝑤𝑑(\mathcal{C}(N_{d}(v)),\ \{degree(w):w\in V,distance(v,w)\leq d\})( caligraphic_C ( italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_v ) ) , { italic_d italic_e italic_g italic_r italic_e italic_e ( italic_w ) : italic_w ∈ italic_V , italic_d italic_i italic_s italic_t italic_a italic_n italic_c italic_e ( italic_v , italic_w ) ≤ italic_d } )
Table 3: The k𝑘kitalic_k-anonymity measures compared in this paper (left column), their abbreviation (center column) and formula for computing their value M(v,d)𝑀𝑣𝑑M(v,d)italic_M ( italic_v , italic_d ) given an input node v𝑣vitalic_v and distance d𝑑ditalic_d (right column).

In our comparison, we account for the measures included in Table 3. All measures included, except for degree, are parameterized by parameter d𝑑ditalic_d. This value determines up to what distance the structural information is taken into account from the target node. For these parameterized variants, the equivalence classes are computed recursively. Classes are split into new classes as the distance increases. Below, we briefly summarize each of the measures used. An equation and the abbreviation used throughout this manuscript for each measure can be found in Table 3.

  • degree: the number of connections a node has.

  • count: the number of nodes and edges in the d𝑑ditalic_d-neighborhood of the considered node.

  • degdist: the degree distribution of the d𝑑ditalic_d-neighborhood of the considered node. Note that this is a multiset.

  • d𝑑ditalic_d-k𝑘kitalic_k-anonymity: the exact structure of the d𝑑ditalic_d-neighborhood of the considered node. The structure is summarized by its canonical labeling.

  • vrq: the degree of all nodes at distance at most d𝑑ditalic_d from the considered node. Note that this is a multiset.

  • hybrid: set containing the results of d𝑑ditalic_d-k𝑘kitalic_k-anonymity and vrq, i.e., the exact structure of the d𝑑ditalic_d-neighborhood of the considered node summarized by its canonical labeling and the degree of all nodes at distance at most d𝑑ditalic_d from the considered node.

For some measures nodes can only be equivalent if they are also equivalent for another measure. This is for example the case for count and degree. If the nodes are equivalent for count, they also need to be equivalent for degree. It follows that if the nodes are not equivalent for degree, they can not be equivalent for count. We define this notion as strictness in Definition 5. Overall, when a measure is more strict than another measure, more information should be taken into account compared to the less strict measure. As a result, more nodes may be distinguishable compared to when a more lenient measure is used. Hence, the use of the stricter measure results in more equivalence classes that are smaller in size, and thus lower overall network anonymity.

Definition 5 (Strictness)

Given two measures M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we say that M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is more strict than M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, denoted M1M2subscript𝑀1subscript𝑀2M_{1}\leq M_{2}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, if the following holds for each pair of nodes v,w𝑣𝑤v,witalic_v , italic_w:

  • vM2wvM1wsubscriptsubscript𝑀2𝑣𝑤𝑣subscriptsubscript𝑀1𝑤v\cong_{M_{2}}w\rightarrow v\cong_{M_{1}}witalic_v ≅ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w → italic_v ≅ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w

We can use this notion to order the measures for k𝑘kitalic_k-anonymity. First, we look at the neighborhood based measures being degree, count, degdist and d𝑑ditalic_d-k𝑘kitalic_k-anonymity. As illustrated in the second and fourth column of Table 2, these measures all have the same reach, namely the d𝑑ditalic_d-neighborhood of the node. However, the measures do differ in the extent to which they capture particular structural properties. Theorem 1 defines how these measures relate to each other based on strictness. We prove this ordering holds by showing that the stricter measures capture properties of the less strict measures. The ordering resulting from Theorem 1 is visualized in Figure 1.

Refer to caption
Figure 1: Ordering of anonymity measures based on strictness. An arrow between two measures (AB𝐴𝐵A\rightarrow Bitalic_A → italic_B) implies measure A𝐴Aitalic_A is more strict than measure B𝐵Bitalic_B (BA𝐵𝐴B\leq Aitalic_B ≤ italic_A).
Theorem 1

For d1𝑑1d\geq 1italic_d ≥ 1, the following holds:

  • degree \leq count(d)𝑑(d)( italic_d ) \leq degdist(d)𝑑(d)( italic_d ) \leq d𝑑ditalic_d-k𝑘kitalic_k-anonymity \leq hybrid(d)𝑑(d)( italic_d )

  • vrq(d)𝑑(d)( italic_d ) \leq hybrid(d)𝑑(d)( italic_d )

For d2𝑑2d\geq 2italic_d ≥ 2:

  • vrq(d1)𝑑1(d-1)( italic_d - 1 ) \leq d𝑑ditalic_d-k𝑘kitalic_k-anonymity.

  • vrq(d1)𝑑1(d-1)( italic_d - 1 ) \leq vrq(d)𝑑(d)( italic_d ).

Proof 1

Given a graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) and nodes v,w𝑣𝑤v,witalic_v , italic_w:

  • degree \leq count. Since the degree of the target node equals the number of nodes in the 1111-neighborhood minus one, the degree can be derived from count at d=1𝑑1d=1italic_d = 1. As a result, two nodes equivalent according to count are equivalent for degree, ensuring the requirement: vcountwvdegreewsubscriptcount𝑣𝑤𝑣subscriptdegree𝑤v\cong_{\textsc{count}}w\rightarrow v\cong_{\textsc{degree}}witalic_v ≅ start_POSTSUBSCRIPT count end_POSTSUBSCRIPT italic_w → italic_v ≅ start_POSTSUBSCRIPT degree end_POSTSUBSCRIPT italic_w.

  • count \leq degdist. Both the number of nodes and edges in the d𝑑ditalic_d-neighborhood can be derived from degdist. The number of nodes equals the number of degree values accounted for, the number of edges equals the sum of the degrees divided by two. Hence, vdegdistwvcountwsubscriptdegdist𝑣𝑤𝑣subscriptcount𝑤v\cong_{\textsc{degdist}}w\rightarrow v\cong_{\textsc{count}}witalic_v ≅ start_POSTSUBSCRIPT degdist end_POSTSUBSCRIPT italic_w → italic_v ≅ start_POSTSUBSCRIPT count end_POSTSUBSCRIPT italic_w.

  • degdist \leq d𝑑ditalic_d-k𝑘kitalic_k-anonymity. When two nodes are equivalent for d𝑑ditalic_d-k𝑘kitalic_k-anonymity, the d𝑑ditalic_d-neighborhoods need to be isomorphic. This implies that each node in VNd(v)subscript𝑉subscript𝑁𝑑𝑣V_{N_{d}}(v)italic_V start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v ) should be mapped onto a node in VNd(w)subscript𝑉subscript𝑁𝑑𝑤V_{N_{d}}(w)italic_V start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_w ) such that the isomorphism property is satisfied. As a result each node should be mapped onto a node with the same degree, otherwise the isomorphism can not be valid. Hence we can conclude vd-k-anonymitywvdegdistwsubscriptd-k-anonymity𝑣𝑤𝑣subscriptdegdist𝑤v\cong_{\textsc{$d$-$k$-anonymity}}w\rightarrow v\cong_{\textsc{degdist}}witalic_v ≅ start_POSTSUBSCRIPT italic_d - italic_k -anonymity end_POSTSUBSCRIPT italic_w → italic_v ≅ start_POSTSUBSCRIPT degdist end_POSTSUBSCRIPT italic_w.

  • vrq \leq hybrid and d𝑑ditalic_d-k𝑘kitalic_k-anonymity \leq hybrid. This holds because hybrid captures both vrq and d𝑑ditalic_d-k𝑘kitalic_k-anonymity.

  • vrq(d-1) \leq d𝑑ditalic_d-k𝑘kitalic_k-anonymity, with d2𝑑2d\geq 2italic_d ≥ 2. As the d𝑑ditalic_d-neighborhood of a node contains all nodes at distance d1𝑑1d-1italic_d - 1 and all edges attached to these nodes, the degree distributions of these nodes must be equal in order for the d𝑑ditalic_d-neighborhoods to be isomorphic. Hence vd-k-anonymitywvvrq(d1)wsubscriptd-k-anonymity𝑣𝑤𝑣subscriptvrq𝑑1𝑤v\cong_{\textsc{$d$-$k$-anonymity}}w\rightarrow v\cong_{\textsc{vrq}(d-1)}witalic_v ≅ start_POSTSUBSCRIPT italic_d - italic_k -anonymity end_POSTSUBSCRIPT italic_w → italic_v ≅ start_POSTSUBSCRIPT vrq ( italic_d - 1 ) end_POSTSUBSCRIPT italic_w.

  • vrq(d1)𝑑1(d-1)( italic_d - 1 ) \leq vrq(d)𝑑(d)( italic_d ) with d2𝑑2d\geq 2italic_d ≥ 2 holds because the measure is computed recursively.

It is not possible to include vrq elsewhere in the ordering of the neighborhood based measures. As illustrated in Table 2, the reach of vrq is slightly further than the d𝑑ditalic_d-neighborhood, as it accounts for the existence of connections to nodes at distance d+1𝑑1d+1italic_d + 1. At the same time, it is less complete than other measures such as count. It is, however, more strict than degree, as the number of degrees counted by vrq at distance 1 equals the degree of the node. In Theorem 2 we prove that we can not place this measure in the order of the neighborhood based measures when using the same value for d𝑑ditalic_d.

Theorem 2

count(d)𝑑(d)( italic_d ) not-less-than-nor-greater-than\nleq vrq(d)𝑑(d)( italic_d ) , vrq(d)𝑑(d)( italic_d ) not-less-than-nor-greater-than\nleq count(d)𝑑(d)( italic_d ) and vrq(d)𝑑(d)( italic_d ) not-less-than-nor-greater-than\nleq count(d+1)𝑑1(d+1)( italic_d + 1 ).

Refer to caption
Figure 2: Counterexamples accompanying the proof for Theorem 2. This figure shows three examples: 1) an example where vrq is able to distinguish between two nodes and count is not (left), 2) one where count is able to distinguish between two nodes but vrq is not (middle) and 3) an example where vrq is able to distinguish between two nodes but count(d𝑑ditalic_d+1) is not (right).
Proof 2

We prove this theorem by use of the counterexamples in Figure 2. The figure shows one case for which count is able to distinguish between two nodes and vrq is not, an example showing the converse where vrq is able to distinguish between the nodes and count is not and an example where vrq is able to distinguish between two nodes and count(d)𝑑(d)( italic_d ) and d+1𝑑1d+1italic_d + 1 and is not. Hence, we can not conclude that one measure is more strict than the other.

6 Empirical comparison of k-anonymity measures

In the upcoming sections, we empirically compare the anonymity measures in Table 3 on a wide range of network datasets obtained from various online repositories listed in Table 4. First, we summarize the experimental setup. Then we compare uniqueness on the datasets and thereafter the runtime required to compute anonymity for each measure. Lastly, we compare uniqueness of anonymity-cascade.

6.1 Experimental setup for empirical comparison

First, we investigate anonymity in terms of uniqueness, i.e., the fraction of nodes that are 1111-anonymous using various measures. We look at results for both d=1𝑑1d=1italic_d = 1 and d=2𝑑2d=2italic_d = 2, as previous work found a relatively large increase in uniqueness for d𝑑ditalic_d-k𝑘kitalic_k-anonymity with d=2𝑑2d=2italic_d = 2 and relatively small for d>2𝑑2d>2italic_d > 2 [12]. To compute uniqueness, we use the same computational approach as described in [65]. Second, for the runtime of the different measures, we report results averaged over 5 runs with a time limit of 3 hours. Runs longer than 3 hours are not reported. Code used can be found in the following repository111https://github.com/RacheldeJong/ANONET.

Network Nodes Edges
Average
degree
Median
degree
Max
degree
Clustering
coefficient
Assortativity Diameter
Average
distance
Alpha
Radoslaw emails [79] 167 3,250 38.92 40 139 0.69 -0.30 5 1.97 4.61
Primary school [80] 236 5,899 49.99 49 98 0.50 0.17 3 1.86 9.08
Moreno innov. [79] 241 923 7.66 7 28 0.31 -0.06 5 2.47 4.62
Gene fusion [79] 291 279 1.92 1 34 0.00 -0.35 9 3.90 2.50
Copnet calls [81] 536 621 2.32 2 18 0.25 0.17 22 7.37 3.82
Copnet sms [81] 568 697 2.45 2 11 0.22 0.19 20 7.32 3.90
Copnet FB [81] 800 6,418 16.05 13 101 0.32 0.18 7 2.98 3.21
FB Reed98 [82] 962 18,812 39.11 29 313 0.33 0.02 6 2.46 4.38
Arenas email [79] 1,133 5,451 9.62 7 71 0.25 0.08 8 3.61 6.78
Network science [79] 1,461 2,742 3.75 3 34 0.88 0.46 17 5.82 3.61
FB Simmons81 [82] 1,518 32,988 43.46 37 300 0.33 -0.06 7 2.57 4.74
DNC emails [79] 1,893 4,385 4.63 1 402 0.59 -0.31 8 3.37 2.01
Moreno health [79] 2,539 10,455 8.24 8 27 0.15 0.25 10 4.56 8.24
FB Wellesley22 [82] 2,970 94,899 63.91 52 746 0.27 0.06 8 2.59 4.60
Bitcoin alpha [82] 3,783 14,124 7.47 2 511 0.28 -0.17 10 3.57 2.09
GRQC collab. [83] 5,242 14,484 5.53 3 81 0.69 0.66 17 6.05 2.11
FB Carnegie49 [82] 6,637 249,967 75.33 54 840 0.29 0.12 8 2.74 4.98
Pajek Erdős [79] 6,927 11,850 3.42 1 507 0.40 -0.12 4 3.78 2.16
DT interaction [84] 7,341 15,138 4.12 1 584 0.00 -0.12 18 6.15 1.88
DG assoc. [84] 7,813 21,357 5.47 2 485 0.00 -0.29 8 4.23 1.94
FB GWU54 [82] 12,193 469,528 77.02 60 2,002 0.22 0.03 9 2.83 4.78
Anybeat [82] 12,645 49,132 7.77 2 4,800 0.40 -0.12 10 3.17 1.75
CE-CX [82] 15,229 245,952 32.30 13 375 0.23 0.34 13 3.85 4.02
Astro Physics [82] 18,771 198,050 21.10 9 504 0.68 0.21 14 4.19 4.50
FB BU10 [82] 19,700 637,528 64.72 51 1,819 0.20 0.05 9 3.03 5.44
FB Uillinois [82] 30,664 1,048,574 68.39 50 2,718 0.20 0.05 9 3.08 5.39
Enron email [83] 36,692 183,831 10.02 3 1,383 0.72 -0.11 13 4.03 1.97
FB Penn94 [82] 41,536 1,362,220 65.59 48 4,410 0.22 0.00 8 3.12 4.16
FB wall 2009 [79] 46,952 183,412 7.81 4 223 0.15 0.22 18 5.60 5.24
Brightkite [82] 58,228 214,078 7.35 2 1,134 0.27 0.01 18 4.92 2.48
The marker cafe [85] 69,413 1,644,843 47.39 6 8,930 0.24 -0.15 9 3.06 2.86
Slashdot zoo [79] 79,116 467,731 11.82 2 2,534 0.09 -0.07 12 4.04 3.46
Table 4: Real-world network datasets used in the experiments. For each network, we list the number of nodes, edges, average degree, median degree, largest degree, clustering coefficient, assortativity, diameter, average distance and alpha value.

Third, we measure the uniqueness when using a more generic version of anonymity-cascade [12]. For this generic version, we distinguish between the initial measure and cascading measure. We report results on all 36 combinations that can be made using the six measures in Table 3.

All experiments are conducted on a machine with 1 TB RAM, 64 AMD EPYC 7601 cores, and 128 threads. During the experiments, each run uses one thread that is not shared with other processes.

6.2 Comparing uniqueness

From Theorem 1, we know that some considered measures are stricter than others. Hence, the uniqueness is expected to be higher for stricter measures and lower for less strict measures. Figure 3 shows how the uniqueness varies in empirical networks by reporting this value for each measure at both d=1𝑑1d=1italic_d = 1 and d=2𝑑2d=2italic_d = 2. In Appendix A, we report for each measure the difference with the previous measure in the legend of Figure 3. In an attempt to understand performance differences, we compare each difference with various topological network properties, of which the results can be found in Table 5. This table shows for each combination of difference and network property the Pearson correlation and p𝑝pitalic_p-value. Plots of the values for all significantly correlated combinations can be found in Appendix B. We use these findings in the discussion of the results.

Refer to caption
Refer to caption
Figure 3: Uniqueness in empirical networks sorted by size from left to right. Fraction of unique nodes (vertical axis) on different datasets (horizontal axis) using different measures for k𝑘kitalic_k-anonymity: degree (lightblue), count (blue), degdist (dark blue), d𝑑ditalic_d-k𝑘kitalic_k-anonymity (black line with triangle), vrq (red) and hybrid (pink).
count vs. degree degdist vs. count d𝑑ditalic_d-k𝑘kitalic_k-anonymity vs. degdist vrq vs. d𝑑ditalic_d-k𝑘kitalic_k-anonymity hybrid vs. vqr
Network property
Pearson
correlation
p-value
Pearson
correlation
p-value
Pearson
correlation
p-value
Pearson
correlation
p-value
Pearson
correlation
p-value
Nodes -0.216 2.52E-01 0.184 3.30E-01 0.223 2.36E-01 0.275 1.42E-01 0.002 9.92E-01
Edges 0.223 2.37E-01 0.661 7.11E-05 0.083 6.63E-01 -0.223 2.37E-01 -0.223 2.36E-01
Average degree 0.118 5.33E-01 0.267 1.54E-01 -0.065 7.33E-01 -0.058 7.62E-01 -0.050 7.94E-01
Median degree 0.898 1.69E-11 0.437 1.58E-02 -0.163 3.91E-01 -0.651 9.94E-05 -0.426 1.91E-02
Max degree 0.587 6.56E-04 0.294 1.14E-01 -0.161 3.95E-01 -0.416 2.23E-02 -0.324 8.06E-02
Alpha 0.536 2.25E-03 0.354 5.52E-02 0.419 2.12E-02 -0.074 6.97E-01 -0.046 8.08E-01
Density 0.528 2.68E-03 -0.238 2.05E-01 -0.186 3.25E-01 -0.324 8.10E-02 -0.196 2.98E-01
Transitivity 0.081 6.70E-01 -0.193 3.07E-01 -0.228 2.25E-01 -0.303 1.03E-01 -0.090 6.35E-01
Assortativity 0.018 9.23E-01 0.194 3.04E-01 0.254 1.76E-01 -0.005 9.79E-01 0.343 6.38E-02
Diameter -0.601 4.50E-04 -0.273 1.45E-01 0.030 8.75E-01 0.273 1.44E-01 0.680 3.53E-05
Average distance -0.748 2.01E-06 -0.381 3.78E-02 0.024 8.98E-01 0.355 5.44E-02 0.757 1.27E-06
Table 5: Correlation between network properties (leftmost column) and the difference observed between the use of two measures (second up to last column). For each combination of network property and outcome both the Pearson correlation and p𝑝pitalic_p-value are reported. Values with p<0.05𝑝0.05p<0.05italic_p < 0.05 and Pearson correlation larger than 0.4 or smaller than -0.4 are deemed significant and shown in bold.

First, let us focus on the results for d=1𝑑1d=1italic_d = 1 shown in the top of Figure 3. We observe that degree is not very effective for identifying nodes as for all networks there is only a tiny fraction of unique nodes. However, when moving to count, which additionally accounts for the number of edges in the neighborhood, the uniqueness increases drastically for many networks, sometimes showing results similar to those of 1111-k𝑘kitalic_k-anonymity. The difference is especially larger in networks with higher degree, less diverse degree distribution, indicated by a higher alpha value, and lower diameter or average distance. Uniqueness obtained by degdist is in many cases equal to 1111-k𝑘kitalic_k-anonymity, and otherwise very similar. For networks with a high alpha, the difference is larger. This shows that count and degdist appear to be useful for approximating d𝑑ditalic_d-k𝑘kitalic_k-anonymity for the networks included. For these, we find that the difference between the two measures is often larger for graphs with a high number of edges, higher median degree, or lower average distance. For the networks with a larger difference, degdist would better approximate 1111-k𝑘kitalic_k-anonymity.

The vrq measure, which could not fit in our strictness ordering of neighborhood based measures, achieves higher uniqueness than 1111-k𝑘kitalic_k-anonymity, which very clearly indicates that having knowledge beyond the 1111-neigbhorhood of a target node can have a de-anonymizing effect: it is in many cases even more effective than perfect knowledge of the 1111-neighborhood. We find that the difference is often larger for networks with a high diameter, average distance, or a lower median degree. When looking at hybrid, hence adding knowledge of the 1111-neighborhood structure, we see for only a few networks a slight increase in uniqueness. This implies that the sets of nodes that can be identified with the measures tend to overlap largely, and this additional knowledge of the neighborhood structure only has a small effect on uniqueness.

Turning to the results for d=2𝑑2d=2italic_d = 2 at the bottom of Figure 3, we see a very large increase of uniqueness compared to the results for d=1𝑑1d=1italic_d = 1. Here the results obtained by degdist are always the same as that of d𝑑ditalic_d-k𝑘kitalic_k-anonymity. The count measure achieves similar results to both. vrq always achieves a uniqueness larger than or equal to that of d𝑑ditalic_d-k𝑘kitalic_k-anonymity. The generally small difference between vrq and degdist is consistent with the finding in [12] that looking further than the 2222-neighborhood does not have a large effect on uniqueness. Moving to hybrid, we see that this never results in a higher uniqueness when considering the 2222-neighborhood.

Overall, these results demonstrate that having imprecise information, even as little as the number of connections of nodes at distances 1 up to d𝑑ditalic_d, can be sufficient to uniquely identify a large fraction of nodes. Moreover, this knowledge is more revealing than accounting for complete knowledge using a smaller radius. One may not even need complete structural information to identify many nodes, as it appears that having complete structural information often has a very limited additional effect on uniqueness.

Refer to caption
Refer to caption
Figure 4: Runtime in seconds (vertical axis) on different empirical network datasets (horizontal axis) when accounting for different measures for k𝑘kitalic_k-anonymity: degree (lightblue circle), count (blue square), degdist (dark blue circle), d𝑑ditalic_d-k𝑘kitalic_k-anonymity (black triangle), vrq (red circle) and hybrid (pink square).

6.3 Comparing runtimes

As discussed in Section 2.3, a second important aspect of choosing a measure is whether it is feasible to compute for a network of a given size. It is worth nothing that computing anonymity consists of two parts: computing the value for a node using the chosen measure and book-keeping operations to determine, based on the values computed, the right equivalence class for each node. In Figure 4, we show the runtime for each of the measures on the set of considered real-world networks. We first focus on the results for d=1𝑑1d=1italic_d = 1 in the top figure.

Overall, degree and vrq are least expensive to compute. An explanation for this is that for all other measures the d𝑑ditalic_d-neighborhood of the target nodes need to be extracted, which is relatively time consuming. Instead, for degree merely the degree of the considered node needs to be determined, and for vrq one only needs to iterate over the nodes at a certain distance and determine their degree. For the neighborhood based measures, computing the degree distribution as required for degdist or determining the canonical labeling as required for d𝑑ditalic_d-k𝑘kitalic_k-anonymity and hybrid, can consume a noticeable amount of the total runtime. For most networks count is less time consuming than the other neighborhood based measures, but for a few other networks, such as “Radoslaw emails" and “FB Wellesley22" more time appears to be consumed by book-keeping operations. As a result, count achieves runtimes similar to or higher than other measures, including d𝑑ditalic_d-k𝑘kitalic_k-anonymity and hybrid.

When moving to the results for d=2𝑑2d=2italic_d = 2, shown in the bottom of Figure 4, we see that the runtimes are more diverse, up to  15 minutes for some of the denser networks. For several large networks, “The marker cafe" and “Anybeat", the values could not be computed for all measures within the set time limit. Here we see again that for some networks the runtime is much larger for count. This is likely due to the lower uniqueness achieved by this measure for d=1𝑑1d=1italic_d = 1. As a result, the value for more nodes should be determined and more book-keeping operations are required to obtain the equivalence partition for d=2𝑑2d=2italic_d = 2. It appears that hybrid is often the most time consuming measure as it combines vrq with d𝑑ditalic_d-k𝑘kitalic_k-anonymity.

Refer to caption
Figure 5: Illustration of anonymity-cascade. The red node marked with a triangle denotes a unique node identified with the initial measure. From this node the cascade process is started using the cascade measure. The numbers denote the level of the cascading process in which the node is identified. Blue nodes are identified as part of the cascading process.

6.4 Anonymity-cascade

The results from the previous section showed that measures that reach further achieve a much higher uniqueness. In this section, we investigate the effect on uniqueness when using a different method to increase the reach even further, namely by cascading the information of unique nodes to identify more nodes. To do so, we measure uniqueness using a generic variant of anonymity-cascade [12], which is explained in Section 4.3.5. The algorithm consists of two steps. First, we identify nodes using the initial measure. Second, we use the cascade measure to identify more nodes in a cascading fashion. This is also illustrated in Figure 5. The cascading process can continue for multiple levels. If only one level of cascading is used, we refer to this as C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The process can then repeated until no more nodes can be identified, referred to as cascading final Cfsubscript𝐶𝑓C_{f}italic_C start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT.

The combinations of initial and cascade measure consist of all 36 combinations of the six measures in Table 3. This allows us to investigate all combinations of all attacker knowledge. Besides the cases where the levels of knowledge are equal, this also includes the cases where an attacker has little starting knowledge and more local knowledge surrounding the identified nodes, hence more cascade knowledge. We also include scenarios where an attacker has a high level of knowledge on some nodes, which could for example be the high degree nodes, but little knowledge on the surrounding nodes. While we fully acknowledge that not every combination of measures is equally realistic, we chose to include all to observe the cascading effect of further reach on the uniqueness.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Uniqueness using anonymity-cascade one level. Each figure corresponds to a different initial measure (triangle and line) and is combined with all of the cascade measures (dots). The grey bar indicates the highest obtained uniqueness for the network given the initial measure. Measures used are: degree (lightblue), count (blue), degdist (dark blue), d𝑑ditalic_d-k𝑘kitalic_k-anonymity (black triangle), vrq (red) and hybrid (pink).

We focus on one level of cascading, as this is a more realistic scenario. The results for cascading final can be found in Appendix C. Figure 6 shows the uniqueness for anonymity-cascade with one level of cascading. Each subfigure shows results for a different initial measure, denoted by the line with triangle, and results of one level of cascading, denoted by the colored dots using each of the six measures as cascading measure. The grey bar is added for readability of the results and shows the highest uniqueness obtained.

First, turning to the results for degree, we see a very low initial uniqueness. However, one level of cascading overall has a large impact on uniqueness. When using degree as cascading measure, this impact is smaller in several cases. In most cases using vrq as cascading measure has the largest impact on the uniqueness.

The results of degree however, still differ a lot from results for the other measures. Results show that count, degdist and d𝑑ditalic_d-k𝑘kitalic_k-anonymity achieve similar results, where using vrq as cascade measure does increase uniqueness. With vrq as initial measure we do observe both a higher initial uniqueness and uniqueness after cascading. Turning to the results for cascading final in Appendix C, we see a large difference for degree and count. For degdist and d𝑑ditalic_d-k𝑘kitalic_k-anonymity there is some visible increase, for vrq there is no large increase compared to one level of cascading.

Overall, these results show, similar to the results of Section 6.2, that by reaching further, with one or more cascading steps, the uniqueness increases immensely. This effect is especially clear when using degree as initial measure.

7 Conclusions

In this paper, we have investigated the problem of privacy aware sharing of network data. First, we introduced the aspects of the problem that a user should take into account when choosing an approach for achieving anonymity in networks, being data utility, privacy and runtime. Then, we provided an overview of the methods introduced for achieving privacy in networks and related these to the different aspects. The approaches can be grouped based on the type of output they produce: an interactive model, synthetic data, an intermediate representation or a perturbed version of the network. Finally, as a particularly suitable approach for privacy aware sharing of network data, we focus on k𝑘kitalic_k-anonymity, in which the choice of measure models the attacker scenario to protect against and hence is important for a user to take into account. Therefore, we provide an overview of the different measures introduced in literature resulting in a categorization into degree based, neighborhood based and automorphism based measures. Next, we compared the selected k𝑘kitalic_k-anonymity measures theoretically, showing how they can be distinguished based on how far they reach, and the completeness of the structural information they take into account. This resulted in an ordering of the measures based on strictness.

To better understand how the different measures impact the anonymity, we compared the uniqueness obtained by different measures on a wide range of empirical networks. The measure of degree often finds only few unique nodes and shows to be a very ineffective approach for identifying nodes. If an attacker has additional information, such as the degree of the directly neighboring nodes (vrq), or the number of triangles in the 1-neighborhood (count), they are likely able to uniquely identify many more nodes, often close to the number that can be identified by d𝑑ditalic_d-k𝑘kitalic_k-anonymity. Most importantly, the results showed that it is more de-anonymizing to have incomplete information that accounts for structural information that reaches further, beyond the direct neighborhood of the considered node, than complete information concerning its direct neighborhood. Even more so, by means of experiments using the hybrid measure we show that often having (more) complete information only has a small additional effect on the uniqueness if the reach remains the same. At the same time these less complete measures, even if they reach further, require less time to compute anonymity. Hence, when choosing a measure for anonymity, it would be better to choose a measure that reaches beyond the direct neighborhood (assuming that this is a realistic attacker scenario), rather than a measure that accounts for as complete as possible structural information. Lastly, we investigate the effect of extending the reach of the anonymity measure, namely by reusing unique nodes in a cascading fashion to identify more nodes. Just one level of cascading has an immense impact on the anonymity. This result is especially large when using degree as starting measure, which identifies very few nodes.

There are still a lot of possibilities for future work. Firstly, most measures for anonymity so far focus on the direct neighborhood. At the same time our results on measuring anonymity show the necessity to look beyond the 1-neighborhood. A crucial next step would be to design anonymization algorithms accounting for measures that look beyond the 1-neighborhood of the node. Second, while this manuscript described how different measures affect certain aspects, some aspects are heavily influenced by the chosen anonymization algorithm. Therefore, another direction for future work, complementing this work on anonymity measurement would be to make a systematic overview of existing anonymization methods and how they affect runtime and utility.

Acknowledgments

This research was made possible by the Platform Digital Infrastructure SSH (http://www.pdi-ssh.nl). We would also like to thank the POPNET team (https://www.popnet.io) and the Leiden CNS group (https://www.computationalnetworkscience.org) for various helpful suggestions and discussions.

References

  • [1] Akrati Saxena and Sudarshan Iyengar. Centrality measures in complex networks: A survey. arXiv preprint arXiv:2011.07190, 2020.
  • [2] Jure Leskovec, Kevin J. Lang, and Michael Mahoney. Empirical comparison of algorithms for network community detection. In Proceedings of the 19th International Conference on World Wide Web, page 631–640, 2010.
  • [3] Monowar H Bhuyan, Dhruba Kumar Bhattacharyya, and Jugal K Kalita. Network anomaly detection: methods, systems and tools. IEEE Communications Surveys & Tutorials, 16(1):303–336, 2013.
  • [4] Asma Azizi, Cesar Montalvo, Baltazar Espinoza, Yun Kang, and Carlos Castillo-Chavez. Epidemics on networks: Reducing disease transmission using health emergency declarations and peer communication. Infectious Disease Modelling, 5:12–22, 2020.
  • [5] Yuliia Kazmina, Eelke M. Heemskerk, Eszter Bokanyi, and Frank W. Takes. Socio-economic segregation in a population-scale social network. 2023.
  • [6] Michał Bojanowski and Rense Corten. Measuring segregation in social networks. Social Networks, 39:14–32, 2014.
  • [7] David Savage, Xiuzhen Zhang, Xinghuo Yu, Pauline Chou, and Qingmai Wang. Anomaly detection in online social networks. Social networks, 39:62–70, 2014.
  • [8] Eszter Bokányi, Eelke M Heemskerk, and Frank W Takes. The anatomy of a population-scale social network. Scientific Reports, 13(1):9209, 2023.
  • [9] Jan Van der Laan. A person network of the netherlands. Discussion paper, Statistics Netherlands, The Hague, April 2022. https://www.cbs.nl/en-gb/background/2022/20/a-person-network-of-the-netherlands.
  • [10] Lars Backstrom, Cynthia Dwork, and Jon Kleinberg. Wherefore art thou r3579x? anonymized social networks, hidden patterns, and structural steganography. In Proceedings of the 16th International Conference on World Wide Web, page 181–190, 2007.
  • [11] Daniele Romanini, Sune Lehmann, and Mikko Kivelä. Privacy and uniqueness of neighborhoods in social networks. Scientific Reports, 11(1):20104, 2021.
  • [12] Rachel G. de Jong, Mark P. J. van der Loo, and Frank W. Takes. The effect of distant connections on node anonymity in complex networks. Scientific Reports, 14(1):1156, 2024.
  • [13] Anco Hundepool, Josep Domingo-Ferrer, Luisa Franconi, Sarah Giessing, Eric Schulte Nordholt, Keith Spicer, and Peter-Paul De Wolf. Statistical Disclosure Control, volume 2. Wiley New York, 2012.
  • [14] Leon Willenborg and Ton de Waal. Elements of statistical disclosure control, volume 155. Springer Science & Business Media, 2001.
  • [15] Cynthia Dwork. Differential Privacy: A Survey of Results. In Theory and Applications of Models of Computation, pages 1–19, Berlin, Heidelberg, 2008. Springer.
  • [16] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography, pages 265–284, Berlin, Heidelberg, 2006.
  • [17] Latanya Sweeney. k-anonymity: A model for protecting privacy. International journal of uncertainty, fuzziness and knowledge-based systems, 10(05):557–570, 2002.
  • [18] Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, 1(1), 2007.
  • [19] Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. In 2007 IEEE 23rd International Conference on Data Engineering, pages 106–115, April 2007. ISSN: 2375-026X.
  • [20] Raymond Chi-Wing Wong, Jiuyong Li, Ada Wai-Chee Fu, and Ke Wang. (α𝛼\alphaitalic_α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 754–759, 2006.
  • [21] Jörg Drechsler. Synthetic datasets for statistical disclosure control: theory and implementation, volume 201. Springer Science & Business Media, New York, NY, 2011.
  • [22] Xiaokui Xiao and Yufei Tao. Anatomy: simple and effective privacy preservation. In Proceedings of the 32nd International Conference on Very Large Data Bases, page 139–150. VLDB Endowment, 2006.
  • [23] Alessandra Sala, Xiaohan Zhao, Christo Wilson, Haitao Zheng, and Ben Y. Zhao. Sharing graphs using differentially private graph models. In Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, page 81–98. Association for Computing Machinery, 2011.
  • [24] Davide Proserpio, Sharon Goldberg, and Frank McSherry. Calibrating data to sensitivity in private data analysis: A platform for differentially-private analysis of weighted datasets. Proceedings of the VLDB Endowment, 7(8):637–648, 2014.
  • [25] Yue Wang and Xintao Wu. Preserving differential privacy in degree-correlation based graph generation. Transactions on data privacy, 6(2):127, 2013.
  • [26] Kun Liu and Evimaria Terzi. Towards identity anonymization on graphs. In Proceedings of the ACM SIGMOD International Conference on Management of Data, page 93–106, 2008.
  • [27] Michael Hay, Gerome Miklau, David Jensen, Don Towsley, and Philipp Weis. Resisting structural re-identification in anonymized social networks. volume 1, page 102–114. VLDB Endowment, aug 2008.
  • [28] Lei Zou, Lei Chen, and M. Tamer Özsu. K-automorphism: a general framework for privacy preserving network publication. In Proceedings of the of the 35th VLDB Endowment, volume 2, pages 946–957. VLDB Endowment, 2009.
  • [29] Xiaowei Ying and Xintao Wu. Randomizing social networks: a spectrum preserving approach. In Proceedings of the 2008 SIAM International Conference on Data Mining, pages 739–750, 2008.
  • [30] Yushan Liu, Shouling Ji, and Prateek Mittal. Smartwalk: Enhancing social network security via adaptive random walks. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, page 492–503, 2016.
  • [31] Prateek Mittal, Charalampos Papamanthou, and Dawn Song. Preserving link privacy in social network based systems. arXiv preprint arXiv:1208.6189, 2012.
  • [32] Alina Campan and Traian Marius Truta. Data and structural k-anonymity in social networks. In Privacy, Security, and Trust in KDD, pages 33–54, Berlin, Heidelberg, 2009. Springer.
  • [33] Smriti Bhagat, Graham Cormode, Balachander Krishnamurthy, and Divesh Srivastava. Class-based graph anonymization for social network data. volume 2, page 766–777. VLDB Endowment, aug 2009.
  • [34] Changchang Liu and Prateek Mittal. Linkmirage: Enabling privacy-preserving analytics on social relationships. In Proceedings of the 23rd Annual Network and Distributed System Security Symposium, 2016.
  • [35] Giorgia Minello, Luca Rossi, and Andrea Torsello. k-anonymity on graphs using the szemerédi regularity lemma. IEEE Transactions on Network Science and Engineering, 8(2):1283–1292, 2020.
  • [36] Navid Yazdanjue, Mohammad Fathian, and Babak Amiri. Evolutionary algorithms for k-anonymity in social networks based on clustering approach. The Computer Journal, 63(7):1039–1062, 2020.
  • [37] Roy Ford, Traian Marius Truta, and Alina Campan. P-sensitive k-anonymity for social networks. Proceedings of the 5th International Conference on Data Mining, 9:403–409, 2009.
  • [38] Michael Hay, Gerome Miklau, David Jensen, Philipp Weis, and Siddharth Srivastava. Anonymizing social networks. Computer Science Department Faculty Publication Series, page 180, 2007.
  • [39] Wentao Wu, Yanghua Xiao, Wei Wang, Zhenying He, and Zhihui Wang. k-symmetry model for identity anonymization in social networks. In Proceedings of the 13th International Conference on Extending Database Technology, page 111–122, 2010.
  • [40] Debasis Mohapatra and Manas Ranjan Patra. A level-cut heuristic-based clustering approach for social graph anonymization. Social Network Analysis and Mining, 7:1–13, 2017.
  • [41] Shouling Ji, Prateek Mittal, and Raheem Beyah. Graph data anonymization, de-anonymization attacks, and de-anonymizability quantification: A survey. IEEE Communications Surveys & Tutorials, 19(2):1305–1326, 2016.
  • [42] Honglu Jiang, Jian Pei, Dongxiao Yu, Jiguo Yu, Bei Gong, and Xiuzhen Cheng. Applications of differential privacy in social network analysis: A survey. IEEE Transactions on Knowledge and Data Engineering, 35(1):108–127, 2021.
  • [43] Jordi Casas-Roma, Jordi Herrera-Joancomartí, and Vicenç Torra. An algorithm for k-degree anonymity on large networks. In Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pages 671–675, 2013.
  • [44] Xuesong Lu, Yi Song, and Stéphane Bressan. Fast identity anonymization on graphs. In Database and Expert Systems Applications: 23rd International Conference, pages 281–295, 2012.
  • [45] Bin Zhou and Jian Pei. Preserving privacy in social networks against neighborhood attacks. In Proceedings of the 24th IEEE International Conference on Data Engineering, pages 506–515, 2008.
  • [46] Shouling Ji, Weiqing Li, Prateek Mittal, Xin Hu, and Raheem Beyah. SecGraph: A uniform and open-source evaluation system for graph data anonymization and de-anonymization. In 24th USENIX Security Symposium, pages 303–318, 2015.
  • [47] Ghazaleh Beigi and Huan Liu. A survey on privacy in social media: Identification, mitigation, and applications. ACM Transactions on Data Science, 1(1):1–38, 2020.
  • [48] Arvind Narayanan and Vitaly Shmatikov. De-anonymizing social networks. In Proceedings of the 30th IEEE Symposium on Security and Privacy, pages 173–187, 2009.
  • [49] Hao Fu, Aston Zhang, and Xing Xie. Effective social graph deanonymization based on graph structure and descriptive information. ACM Transactions on Intelligent Systems and Technology, 6(4), 2015.
  • [50] Bin Zhou, Jian Pei, and WoShun Luk. A brief survey on anonymization techniques for privacy preserving publishing of social network data. ACM Sigkdd Explorations Newsletter, 10(2):12–22, 2008.
  • [51] Bin Zhou and Jian Pei. The k-anonymity and l-diversity approaches for privacy preservation in social networks against neighborhood attacks. Knowledge and Information Systems, 28(1):47–77, 2011.
  • [52] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. Boosting the accuracy of differentially private histograms through consistency. Proceedings of the VLDB Endowment, 3(1–2):1021–1032, 2010.
  • [53] Kamalkumar R Macwan and Sankita J Patel. Node differential privacy in social graph degree publishing. Procedia computer science, 143:786–793, 2018.
  • [54] Qian Xiao, Rui Chen, and Kian-Lee Tan. Differentially private network data release via structural inference. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 911–920, 2014.
  • [55] Paolo Boldi, Francesco Bonchi, Aristides Gionis, and Tamir Tassa. Injecting uncertainty in graphs for identity obfuscation. volume 5, page 1376–1387. VLDB Endowment, 2012.
  • [56] Cynthia Dwork. Differential privacy. In Automata, Languages and Programming, pages 1–12, Berlin, Heidelberg, 2006. Springer.
  • [57] Xun Jian, Yue Wang, and Lei Chen. Publishing graphs under node differential privacy. IEEE Transactions on Knowledge and Data Engineering, 35(4):4164–4177, 2023.
  • [58] Priya Mahadevan, Dmitri Krioukov, Kevin Fall, and Amin Vahdat. Systematic topology analysis and generation using degree correlations. ACM SIGCOMM Computer Communication Review, 36(4):135–146, 2006.
  • [59] Albert László Barabási and Réka Albert. Emergence of scaling in random networks. Science, 286(5439):509–512, 1999.
  • [60] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature, 393(6684):440–442, 1998.
  • [61] Sameera Horawalavithana and Adriana Iamnitchi. On the privacy of dk-random graphs. CoRR, abs/1907.01695, 2019.
  • [62] Albert-Laszlo Barabasi and Zoltan N Oltvai. Network biology: understanding the cell’s functional organization. Nature reviews genetics, 5(2):101–113, 2004.
  • [63] Brendan D. McKay and Adolfo Piperno. Practical graph isomorphism, ii. Journal of Symbolic Computation, 60:94–112, 2014.
  • [64] Kamalkumar R Macwan and Sankita J Patel. k-degree anonymity model for social network data publishing. Advances in Electrical & Computer Engineering, 17(4):117 – 124, 2017.
  • [65] Rachel G. de Jong, Mark P. J. van der Loo, and Frank W. Takes. Algorithms for efficiently computing structural anonymity in complex networks. ACM Journal of Experimental. Algorithmics, 28, 2023.
  • [66] Arash Alavi, Rajiv Gupta, and Zhiyun Qian. When the attacker knows a lot: The gaga graph anonymizer. In Proceedings of the 21st Springer International Conference on Information Security, pages 211–230, 2019.
  • [67] Guojun Wang, Qin Liu, Feng Li, Shuhui Yang, and Jie Wu. Outsourcing privacy-preserving social networks to a cloud. In 2013 Proceedings IEEE INFOCOM, pages 2886–2894, 2013.
  • [68] Sara Rajabzadeh, Pedram Shahsafi, and Mostafa Khoramnejadi. A graph modification approach for k-anonymity in social networks using the genetic algorithm. Social Network Analysis and Mining, 10:1–17, 2020.
  • [69] Xiaolin Zhang, Jiao Liu, Jian Li, and Lixin Liu. Large-scale dynamic social network directed graph k-in&out-degree anonymity algorithm for protecting community structure. IEEE Access, 7:108371–108383, 2019.
  • [70] Debasis Mohapatra and Manas Ranjan Patra. Graph anonymization using hierarchical clustering. In Computational Intelligence in Data Mining, pages 145–154. Springer, 2019.
  • [71] Chih-Hua Tai, Philip S. Yu, De-Nian Yang, and Ming-Syan Chen. Privacy-preserving social network publication against friendship attacks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 1262–1270, 2011.
  • [72] BK Tripathy and Anirban Mitra. An algorithm to achieve k-anonymity and l-diversity anonymisation in social networks. In 2012 Fourth International Conference on Computational Aspects of Social Networks, pages 126–131, 2012.
  • [73] Xiangmin Ren, Dexun Jiang, et al. A personalized-anonymity model of social network for protecting privacy. Wireless Communications and Mobile Computing, 2022, 2022.
  • [74] James Cheng, Ada Wai-chee Fu, and Jia Liu. K-isomorphism: privacy preserving network publication against structural attacks. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, page 459–470, 2010.
  • [75] Jianliang Gao, Qing Ping, and Jianxin Wang. Resisting re-identification mining on social graph data. World Wide Web, 21:1759–1771, 2018.
  • [76] Mingxuan Yuan, Lei Chen, and Philip S Yu. Personalized privacy protection in social networks. Proceedings of the VLDB Endowment, 4(2):141–150, 2010.
  • [77] Yifan Hao, Huiping Cao, Chuan Hu, Kabi Bhattarai, and Satyajayant Misra. K-anonymity for social networks containing rich structural and textual information. Social Network Analysis and Mining, 4(1):223, 2014.
  • [78] Chuan-Gang Liu, I-Hsien Liu, Wun-Sheng Yao, and Jung-Shian Li. K-anonymity against neighborhood attacks in weighted social networks. Security and Communication Networks, 8(18):3864–3882, 2015.
  • [79] Jérôme Kunegis. Konect: the koblenz network collection. In Proceedings of the 22nd International Conference on World Wide Web, page 1343–1350, 2013.
  • [80] Sociopatterns. Sociopatterns: Datasets, 2021.
  • [81] Piotr Sapiezynski, Arkadiusz Stopczynski, David D. Lassen, and Sune L. Jørgensen. The copenhagen networks study interaction data. figshare. https://doi.org/10.6084/m9.figshare.7267433.v1 (last accessed May 2022), 2019.
  • [82] Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, page 4292–4293, 2015.
  • [83] Jure Leskovec and Andrej Krevl. Snap datasets: Stanford large network dataset collection. http://snap.stanford.edu/data (last accessed May 2022), 2014.
  • [84] Marinka Zitnik, Rok Sosič, Sagar Maheshwari, and Jure Leskovec. Biosnap datasets: Stanford. biomedical network dataset collection. http://snap.stanford.edu/biodata (last accessed May 2023), 2018.
  • [85] Michael Fire. Data 4 good lab. https://data4goodlab.github.io/MichaelFire/#section3 (last accessed May 2023), 2020.

Appendix A Empirical differences between measures

This appendix accompanies Section 6, where we study the uniqueness obtained by various measures for k𝑘kitalic_k-anonymity. Figure 7 denotes the difference in uniqueness obtained from various measures for d=1𝑑1d=1italic_d = 1 and d=2𝑑2d=2italic_d = 2. The biggest difference for all networks and distances is between count and degree, and after that vrq and d𝑑ditalic_d-k𝑘kitalic_k-anonymity. For the other measures, the difference is often very small.

Refer to caption
Refer to caption
Figure 7: Difference in fraction of unique nodes (vertical axis) on different datasets (horizontal axis) using different measures for k𝑘kitalic_k-anonymity: count vs. degree (blue circle), degdist vs. count (dark blue square), d𝑑ditalic_d-k𝑘kitalic_k-anonymity vs. degdist (black circle), vrq (red square) and hybrid vs. vrq (pink circle).

Appendix B Empirical differences and network properties

This appendix accompanies Section 6 in which we study the uniqueness obtained by various measures for k𝑘kitalic_k-anonymity. To understand which network properties cause the difference found between measures, we plot the differences reported in Figure 7 against various network properties. The resulting can be found in Figure 8 for the combinations deemed significant, as indicated in Table 5.

Refer to caption
Figure 8: Uniqueness differences and network properties. Each subfigure shows a specific combination of network property (horizontal axis) and difference between measures (vertical axis). Included combinations are significantly correlated as indicated in Table 5.

Appendix C Anonymity-cascade

Figure 9 presents results accompanying Section 6.4, showing the uniqueness for anonymity-cascade final.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Uniqueness using anonymity-cascade final. Each figure corresponds to a different initial measure (triangle and line) and is combined with all of the cascade measures (dots). The grey bar indicates the highest obtained uniqueness for the network given the initial measure. Measures used are: degree (lightblue), count (blue), degdist (dark blue), d𝑑ditalic_d-k𝑘kitalic_k-anonymity (black), vrq (red) and hybrid (pink).