SPEML SS2023-Lecture Anonymisation
SPEML SS2023-Lecture Anonymisation
Tanja Šarčević
[email protected]
https://www.sba-research.org/team/tanja-sarcevic/
Security, Privacy & Explainability in ML
.................................................
– Secure computation
– Adversarial examples
– Backdoor attacks
– Explainable AI
2
Security, Privacy & Explainability in ML
.................................................
3
Outline
.................................................
• Pseudonimisation
➢ Record-Linkage Attack
• Anonymisation
• k-anonymity
• l-diversity
• t-closeness
• Pseudonimisation
➢ Record-Linkage Attack
• Anonymisation
• k-anonymity
• l-diversity
• t-closeness
7
Privacy-preserving data analysis
.................................................
• Solutions?
– E.g. Data sanitisation to allow privacy-preserving data
publishing (PPDP), privacy-preserving computation
8
Privacy-preserving data analysis
.................................................
• Pseudonimisation
➢ Record-Linkage Attack
• Anonymisation
• k-anonymity
• l-diversity
• t-closeness
• Depending on requirements: 2
ID
Anna Williams
Pseudonym
23/8/79 Berkeley, CA
Date of birth City of residence
11
11
Pseudonymisation
.................................................
• GDPR:
– “…personal data … that can no longer be attributed to a
specific data subject without the use of additional
information”
– pseudonymized personal data
remain personal data nonetheless,
provided the controller or another party
has the means to reverse the process
14
Record Linkage attacks
.................................................
15
Record Linkage attacks
.................................................
16
Record linkage
.................................................
• Steps include
– Preprocessing / normalisation
• Rule based, hidden Markov models, …
• Phonetic algorithms, …
DataSet Name Date of birth City of residence
– Some form of identity resolution 1 William J. Smith 1/2/73 Berkeley, California
2 Smith, W. J. 1973.1.2 Berkeley, CA
3 Bill Smith Jan 2, 1973 Berkeley, Calif.
Fellegi & Sunter. A Theory for Record Linkage. Journal of the American Statistical Association.
64 (328). 1969
17
Record linkage
.................................................
Roos & Wajda. Record linkage strategies. Part I: Estimating information and evaluating
approaches. Methods of Information in Medicine. 30 (2). 1991
18
Probabilistic (fuzzy) record linkage
.................................................
20
Outline
.................................................
• Pseudonimisation
➢ Record-Linkage Attack
• Anonymisation
• k-anonymity
• l-diversity
• t-closeness
– Issues?
• List is not complete
• Case-dependent
– Adversary’s background knowledge!
• Dependent on the available other data (present AND future!)
23
Data Sanitisation: 18 HIPAA Identifiers
.................................................
• k-anonymity
– Each released record should be indistinguishable from
at least (k-1) others on its QI attributes
– Or: cardinality of any query result on released data
should be at least k
25
Outline
.................................................
• Pseudonimisation
➢ Record-Linkage Attack
• Anonymisation
• k-anonymity
• l-diversity
• t-closeness
27
k-Anonymity
.................................................
27
k-Anonymity
.................................................
27
k-Anonymity: hierarchies
.................................................
10** 10**
Generalisation step
30
k-Anonymity: hierarchies
.................................................
– Location
– Age
31
Global vs Local Transformation
.................................................
32
Minimal generalisation
.................................................
33
Methods for k-anonymisation
.................................................
• Microaggregation
– Data partitioned based on similarity of records
– Aggregation functions applied on data
• Mean for continuous numerical data
• Median for categorical data
Domingo-Ferrer, J., and Vicenç T. "Ordinal, continuous and heterogeneous k-anonymity through microaggregation."
34
Methods for k-anonymisation
.................................................
• Microaggregation
– Data partitioned based on similarity of records
– Aggregation functions applied on data
• Mean for continuous numerical data
• Median for categorical data
Domingo-Ferrer, J., and Vicenç T. "Ordinal, continuous and heterogeneous k-anonymity through microaggregation."
35
k-anonymity: types of attributes
.................................................
• Direct identifiers
– SSN, driving licence number, …
• Quasi-identifiers
– Personal information that can be combined to identify a
person
– Birthdate, zip code, …
• Sensitive attributes
– Non-identifying sensitive/confidental personal
information
– Health diagnosis, salary, political affiliation …
• Insensitive attributes
36
k-Anonymity: example results
.................................................
37
Solving k-anonymity
.................................................
• k-anonymity problem:
– Given a dataset R, find a dataset R’ such that:
• R’ satisfies k-anonymity condition
• R’ has the maximum utility (minimum information loss)
38
Solving k-anonymity: Algorithms
.................................................
• Datafly
• Incognito
• SaNGreeA
• Mondrian
• Flash
39
Datafly
.................................................
• Properties:
– Global (full-domain) generalization algorithm
– Heuristics: for generalization selects the attribute with
the greatest number of distinct values (iteratively until
k-anonymity is satisfied)
– Not necessarily minimal generalization
40
Datafly: example (k=2)
.................................................
0: Male Female
Zip:
537** Birthdate:
2:
1: *
1: 5371* 5370*
0: 21.1.’76 28.2.’76 13.4.’86
0: 53715 53710 53706 53703
41
Datafly: example (k=2)
.................................................
Zip:
2-anonymous? NO! 2: 537**
1: 5371* 5370*
42
Datafly: example (k=2)
.................................................
0: Male Female
43
Datafly: example (k=2)
.................................................
2-anonymous? YES ☺
2-minimal generalization?
44
Datafly: example (k=2)
.................................................
* * 53715 Hepatitis
* * 53703 Bronchitis
45
Datafly: example (k=2)
.................................................
2-anonymous? YES ☺
46
Incognito
.................................................
• Properties:
– Generates the set of all possible k-anonymous full-
domain generalizations of the dataset
– Iterative bottom-up breadth-first search
– k-minimal generalization
– Maximizing the number of equivalence classes
47
Incognito
.................................................
0: Male Female
Zip:
537** Birthdate:
2:
1: *
1: 5371* 5370*
0: 21.1.’76 28.2.’76 13.4.’86
0: 53715 53710 53706 53703
48
Incognito
.................................................
Birth.0
Frequency set:
2-anonymous with
21.1.’76 : 2 respect to „Birth.0“
13.4.’86 : 2
28.2.’76 : 2
49
Incognito
.................................................
Sex0
Frequency set:
2-anonymous with
Male : 3 respect to „Sex0“
Female : 3
50
Incognito
.................................................
Zip0
Frequency set:
2-anonymous with
53715 : 2 respect to „Zip0“
53703 : 2
53706 : 2
51
Incognito
.................................................
<Birth.1,Sex1>
<Sex1,Zip2>
<Sex1,Zip0> <Sex0,Zip1>
<Birth.0,Sex0>
<Sex0,Zip0>
<Birth.1,Zip2>
<Birth.1,Zip1> <Birth.0,Zip2>
<Birth.1,Zip0> <Birth.0,Zip1>
<Birth.0,Zip0>
55
Incognito
.................................................
<Birth.1,Sex1,Zip2>
<Birth.1,Sex1,Zip0>
56
Incognito
.................................................
<Birth.1,Sex1,Zip2>
<Birth.1,Sex1,Zip0>
3 equivalence classes
57
Incognito
.................................................
<Birth.1,Sex1,Zip2>
<Birth.1,Sex1,Zip0>
3 equivalence classes
58
Incognito
.................................................
<Birth.1,Sex1,Zip2>
<Birth.1,Sex1,Zip0>
Out: dataset with the greatest number of
3 equivalence classes
equivalence classes
59
SaNGreeA
.................................................
• Properties:
– Greedy clustering algorithm
– User-specified generalization hierarchies for each categorical
attribute
– Numerical attributes are generalized on the fly – no fixed
categories needed
• GIL function – measures the amount of generalization
N = set of numerical attributes
→ „how large is the generalised
range compared to the total range
of the attribute“
60
SaNGreeA: example (k=2)
.................................................
61
SaNGreeA: example (k=2)
.................................................
61
SaNGreeA: example (k=2)
.................................................
62
SaNGreeA: example (k=2)
.................................................
62
SaNGreeA: example (k=2)
.................................................
63
SaNGreeA: example (k=2)
.................................................
64
SaNGreeA: example (k=2)
.................................................
65
Solving k-anonymity: Tools
.................................................
• ARX:
– Flash algorithm
– https://arx.deidentifier.org/
• Amnesia:
– https://amnesia.openaire.eu/
Emam et al., Globally Optimal k-Anonymity for De-Identification of Health Data. 2009
67
Effects on Utility
.................................................
72
Attacks Against K-Anonymity
.................................................
47673 36
73
Outline
.................................................
• Pseudonimisation
➢ Record-Linkage Attack
• Anonymisation
• k-anonymity
• l-diversity
• t-closeness
Alan
? 476**
476**
20-40
20-40
Heart Disease
Heart Disease
47673 36
Machanavajjhala, Gehrke & Kifer. l-Diversity: Privacy Beyond k-Anonymity. 2006
75
L-diversity principles
.................................................
• L-diversity principle:
• Distinct l-diversity
– Each equivalence class has at least l well-represented
sensitive values
– This satisfies 3-diversity, but an attacker can still affirm that the
target person’s disease is “Cancer” with the accuracy of 80%.
76
l-Diversity: variations
.................................................
• Entropy l-diversity
– Each equivalence class not only must have enough
different sensitive values, but also the different sensitive
values must be distributed evenly enough.
77
l-Diversity: variations
.................................................
• Recursive (c,l)-diversity
– Less conservative notion
– “The most frequent value does not appear too “frequently
78
Limitations of l-Diversity
.................................................
476** 3* 7K Bronchitis
476** 3* 9K Pneumonia
• Pseudonimisation
➢ Record-Linkage Attack
• Anonymisation
• k-anonymity
• l-diversity
• t-closeness
82
t-closeness
.................................................
83
t-closeness Principle
.................................................
84
Distance between two distributions
.................................................
• Variational distance:
87
Anonymisation: other limitations
.................................................
Robust De-anonymization of Large Sparse Datasets. Arvind Narayanan and Vitaly Shmatikov.
2008
88
Anonymisation: other limitations
.................................................
Robust De-anonymization of Large Sparse Datasets. Arvind Narayanan and Vitaly Shmatikov.
2008
89
Anonymisation: conclusion
.................................................
• Pseudonimisation
➢ Record-Linkage Attack
• Anonymisation
• k-anonymity
• l-diversity
• t-closeness
32 64 1 33 64 1
31 66 0 31 68 0
50 72 1 50 72 1
48 70 0 47 70 0
93
Watermarking vs. fingerprinting
.................................................
94
Fingerprinting – (a bad) example
.................................................
https://twitter.com/pnikosis/status/1592823543498436611
The workflow
.................................................
95
The schemes: fingerprinting
.................................................
97
Watermarking ML/DL models
.................................................
98
WM & FP: Conclusions
.................................................
99
.................................................
Questions ?
100