0% found this document useful (0 votes)
10 views101 pages

SPEML SS2023-Lecture Anonymisation

Uploaded by

ivv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views101 pages

SPEML SS2023-Lecture Anonymisation

Uploaded by

ivv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

.................................................

Security, Privacy & Explainability


in Machine Learning

March 9th, 2023

Tanja Šarčević
[email protected]
https://www.sba-research.org/team/tanja-sarcevic/
Security, Privacy & Explainability in ML
.................................................

• Overview on the lecture topics


– Privacy preserving data publishing

– Secure computation

– Adversarial examples

– Backdoor attacks

– Explainable AI

2
Security, Privacy & Explainability in ML
.................................................

• Privacy-preserving data publishing:


– Pseudonimity
– k-anonymity
– l-diversity
– t-closeness
– Synthetic data
– Differential privacy

• Other concerns in data publishing:


– Intellectual digital property protection → watermarking &
fingerprinting data

3
Outline
.................................................

• Privacy: definitions and motivation

• Pseudonimisation
➢ Record-Linkage Attack

• Anonymisation
• k-anonymity

• l-diversity

• t-closeness

• Data watermarking and fingerprinting


4
Outline
.................................................

• Privacy: definitions and motivation

• Pseudonimisation
➢ Record-Linkage Attack

• Anonymisation
• k-anonymity

• l-diversity

• t-closeness

• Data watermarking and fingerprinting


5
Privacy definitions
.................................................

• “Privacy is the ability of an individual or group to


seclude themselves, or information about
themselves”
• “The challenge of data privacy is to use data
while protecting an individual's privacy
preferences and their personally identifiable
information”

• Pseudonymity is the use of pseudonyms as IDs


• Anonymity is the state of being not identifiable
within a set of subjects, the anonymity set
Anonymity, Unobservability, and Pseudonymity - A Proposal for Terminology. Pfitzmann &
Köhntopp. 2001
6
Privacy-preserving data analysis
.................................................

• Concerned with micro-data


– Data at the level of an individual

• Macro data describes mainly two subtypes of data:


– Aggregated data
– System-level data

• Meso data: data on collective and cooperative actors


– Commercial companies, organizations or political parties

7
Privacy-preserving data analysis
.................................................

• Large amounts of personal data becomes


available
– Analysis, distribution, sharing often conflicting
with data protection laws (GDPR, ...)

– Especially critical with highly sensitive information


• E.g. health data, financial data, ..

• Solutions?
– E.g. Data sanitisation to allow privacy-preserving data
publishing (PPDP), privacy-preserving computation

8
Privacy-preserving data analysis
.................................................

• Two main approaches


– Privacy-preserving data publishing
• De-identification of information: making sure that the data
published does not contain personal identifiable information;
• k-anonymity
• Differential Privacy
• Synthetic Data
• ...
– Privacy-preserving computation
• Making sure that computed result doesn’t allow inference on
the data
• Secure Multi-Party Computation
• Homomorphic Encryption
• Differential Privacy
• …
9
Outline
.................................................

• Privacy: definitions and motivation

• Pseudonimisation
➢ Record-Linkage Attack

• Anonymisation
• k-anonymity

• l-diversity

• t-closeness

• Data watermarking and fingerprinting


10
Pseudonymisation
.................................................

• A state of disguised identity

• Pseudonym identifies a holder, that is, one or more human


beings who possess but do not disclose their true names
(legal identities)
• It enables a consolidation of a persons’ data without revealing
identities
– Data can also mean books, paintings, etc...
ID Name Date of birth City of residence
1 William Smith 1/2/73 Berkeley, California

• Depending on requirements: 2
ID
Anna Williams
Pseudonym
23/8/79 Berkeley, CA
Date of birth City of residence

– One-way pseudonymisation 1 John Doe 1/2/73 Berkeley, California


2 Jane Doe 23/8/79 Berkeley, CA
– Reversible pseudonymisation – trusted third party!

11
11
Pseudonymisation
.................................................

• GDPR:
– “…personal data … that can no longer be attributed to a
specific data subject without the use of additional
information”
– pseudonymized personal data
remain personal data nonetheless,
provided the controller or another party
has the means to reverse the process

• Thus the same principles for storing, processing,


sharing, etc. still apply!
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the
protection of natural persons with regard to the processing of personal data and on the free
movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)
13
Data Sanitisation
.................................................

• Pseudonymisation: remove directly identifying


information
– That is often not enough!

• Massachusetts Health records of public


employees
– With the birthdate, ZIP Code, gender: Governor of
Massachusetts William Weld uniquely identified
• Linkage attack with public voting records
– Other examples:
Netflix prize (2006),
AOL data, ....

14
Record Linkage attacks
.................................................

15
Record Linkage attacks
.................................................

L. Sweeney. Simple Demographics Often Identify People


Uniquely. 2000

16
Record linkage
.................................................

• Finding records that refer to the same entity


– Across data sets from different sources
– May or may not share a common identifier

• Steps include
– Preprocessing / normalisation
• Rule based, hidden Markov models, …
• Phonetic algorithms, …
DataSet Name Date of birth City of residence
– Some form of identity resolution 1 William J. Smith 1/2/73 Berkeley, California
2 Smith, W. J. 1973.1.2 Berkeley, CA
3 Bill Smith Jan 2, 1973 Berkeley, Calif.

Fellegi & Sunter. A Theory for Record Linkage. Journal of the American Statistical Association.
64 (328). 1969
17
Record linkage
.................................................

• Deterministic (rules-based) record linkage


– Links based on the number of individual identifiers that
match among the available data sets
– Records match if all or some identifiers (above a certain
threshold) are identical
– Good option when entities in data sets are identified by
a common identifier
• Or when there are several representative identifiers whose
quality of data is relatively high
• (e.g., name, date of birth, and sex when identifying a person)

Roos & Wajda. Record linkage strategies. Part I: Estimating information and evaluating
approaches. Methods of Information in Medicine. 30 (2). 1991
18
Probabilistic (fuzzy) record linkage
.................................................

• Takes wider range of potential identifiers into account


• Computes weights for each identifier based on its
estimated ability to correctly identify a match/non-match
• Uses weights to calculate probability that two given
records refer to the same entity
• Three types of matches
– Pairs with probabilities above a threshold
considered to be matches
– Pairs with probabilities below another
threshold considered to be non-matches
– Pairs between these thresholds are
"possible matches“
• Can be dealt with e.g., human review
19
Probabilistic (fuzzy) record linkage
.................................................

• Algorithms assign match/non-match weights to identifiers


by two probabilities u and m
• u: probability that identifier in two non-matching records
will agree purely by chance
– What is that for the birth month?
– 1/12 ≈ 0.083
• m: the probability that identifier in matching pairs will agree
– Or sufficiently similar, e.g. strings with low Levenshtein distance
– 1.0 in case of perfect data; estimated in practice
• Based on prior knowledge of the data sets
• By estimation on a large number of matching and non-matching pairs
• By iteratively running the algorithm to obtain closer estimations of m

20
Outline
.................................................

• Privacy: definitions and motivation

• Pseudonimisation
➢ Record-Linkage Attack

• Anonymisation
• k-anonymity

• l-diversity

• t-closeness

• Data watermarking and fingerprinting


21
Data Sanitisation
.................................................

• Anonymisation: sanitize also quasi-identifiers (QI)


– Those attributes that can identify when used in
combination
– Birthdate, ZIP Code, gender, occupation, ...

– Issues?
• List is not complete
• Case-dependent
– Adversary’s background knowledge!
• Dependent on the available other data (present AND future!)

• Anonymised data is not subject to


GDPR regulations anymore!
22
Data Sanitisation: HIPAA
.................................................

• The Privacy Rule of the US Health Insurance


Portability and Accountability Act of 1996 (HIPAA)
establishes comprehensive protections for medical
privacy (revised & came into effect 2002)
• Protected health information (PHI) is “identifiable”
health information acquired in the course of serving
patients
– One of the few authoritative sources that lists identifiable
attributes
• Sanitisation standard before data sharing in medical
domains (research and professional)

23
Data Sanitisation: 18 HIPAA Identifiers
.................................................

• Names • Account numbers


• All geographic subdivisions • Certificate/license numbers
smaller than a State • Vehicle identifiers and serial numbers,
• All elements of dates including license plate numbers
(except year) • Device identifiers and serial numbers
• Telephone numbers • Web Universal Resource Locators (URLs)
• Fax numbers • Internet Protocol (IP) address numbers
• Electronic mail addresses • Biometric identifiers, including finger and
• Social security numbers voice prints
• Medical record numbers • Full face photographic images and any
• Health plan beneficiary comparable images
numbers • Any other unique identifying number,
characteristic, or code
24
Data Sanitisation: HIPAA
.................................................

• Massachusetts Health records of public


employees
– With the birthdate, ZIP Code, gender: would the
governor be re-identified by applying HIPAA?

• k-anonymity
– Each released record should be indistinguishable from
at least (k-1) others on its QI attributes
– Or: cardinality of any query result on released data
should be at least k

25
Outline
.................................................

• Privacy: definitions and motivation

• Pseudonimisation
➢ Record-Linkage Attack

• Anonymisation
• k-anonymity

• l-diversity

• t-closeness

• Data watermarking and fingerprinting


26
k-Anonymity
.................................................

• Ensures that at least k records have same QI, via


– Generalisation of values
• Exact age to a range of values, …

27
k-Anonymity
.................................................

• Ensures that at least k records have same QI, via


– Generalisation of values
• Exact age to a range of values, …
– Suppression of values
• To avoid “over-generalisation” due to e.g. outliers

27
k-Anonymity
.................................................

• Ensures that at least k records have same QI, via


– Generalisation of values
• Exact age to a range of values, …
– Suppression of values
• To avoid “over-generalisation” due to e.g. outliers
– Microaggregation

27
k-Anonymity: hierarchies
.................................................

• Generalisation is achieved by using a hierarchy


– Example: ZIP code

1*** Highest generalisation level / full generalisation

10** 10**

102* 103* 101* Generalisation level

Generalisation step

1020 1022 1031 1031 1010

30
k-Anonymity: hierarchies
.................................................

– Location

– Age

31
Global vs Local Transformation
.................................................

Birthdate Sex Zipcode Disease


* Male 537** Flu
Global:
* Male 537** Broken Arm All values of the
* Male 537** Bronchitis
attribute
* Female 537** Hepatitis
* Female 537** Sprained Ankle generalized to the
* Female 537** Hang Nail same level

Birthdate Sex Zipcode Disease


Local:
21.1.’76 Male 537** Flu
21.1.’76 Male 537** Broken Arm Different levels of
* Female 537** Hepatitis generalization
* Female 537** Hang Nail
within a single
* * 5370* Bronchitis
attribute
* * 5370* Sprained Ankle

32
Minimal generalisation
.................................................

Minimal generalisation – generalization (that satisfies k-anonymity)


such that it is impossible to lower the anonymity level of any attribute
and obtain the same level of anonymity for the database

33
Methods for k-anonymisation
.................................................

• Microaggregation
– Data partitioned based on similarity of records
– Aggregation functions applied on data
• Mean for continuous numerical data
• Median for categorical data

Age Sex Zipcode Disease


44 Male 53715 Flu
35 Female 53715 Hepatitis
45 Male 53703 Bronchitis
44 Male 53703 Broken Arm
35 Female 53706 Sprained Ankle

45 Female 53706 Hang Nail

Domingo-Ferrer, J., and Vicenç T. "Ordinal, continuous and heterogeneous k-anonymity through microaggregation."

34
Methods for k-anonymisation
.................................................

• Microaggregation
– Data partitioned based on similarity of records
– Aggregation functions applied on data
• Mean for continuous numerical data
• Median for categorical data

Age Sex Zipcode Disease


44 Male 53703 Flu
38 Female 53706 Hepatitis
44 Male 53703 Bronchitis
44 Male 53703 Broken Arm
38 Female 53706 Sprained Ankle

38 Female 53706 Hang Nail

Domingo-Ferrer, J., and Vicenç T. "Ordinal, continuous and heterogeneous k-anonymity through microaggregation."

35
k-anonymity: types of attributes
.................................................

• Direct identifiers
– SSN, driving licence number, …
• Quasi-identifiers
– Personal information that can be combined to identify a
person
– Birthdate, zip code, …
• Sensitive attributes
– Non-identifying sensitive/confidental personal
information
– Health diagnosis, salary, political affiliation …
• Insensitive attributes

36
k-Anonymity: example results
.................................................

37
Solving k-anonymity
.................................................

• k-anonymity problem:
– Given a dataset R, find a dataset R’ such that:
• R’ satisfies k-anonymity condition
• R’ has the maximum utility (minimum information loss)

• Given some data set R and a QI Q, does R satisfy k-


anonymity over Q?
– Easy to tell in polynomial time
• Finding an optimal anonymization is not easy
– NP-hard: reduction from k-dimensional perfect matching*
➔ Heuristic solutions
A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS’04.

38
Solving k-anonymity: Algorithms
.................................................

• Datafly
• Incognito
• SaNGreeA
• Mondrian
• Flash

39
Datafly
.................................................

• Properties:
– Global (full-domain) generalization algorithm
– Heuristics: for generalization selects the attribute with
the greatest number of distinct values (iteratively until
k-anonymity is satisfied)
– Not necessarily minimal generalization

40
Datafly: example (k=2)
.................................................

Birthdate Sex Zipcode Disease While not 2-anonymous:


generalise the attribute with the
21.1.’76 Male 53715 Flu greatest number of distinct values
13.4.’86 Female 53715 Hepatitis Start → Birthdate (or Zipcode)
28.2.’76 Male 53703 Bronchitis
21.1.’76 Male 53703 Broken Arm
Sex:
13.4.’86 Female 53706 Sprained Ankle
1: Person
28.2.‘76 Female 53706 Hang Nail

0: Male Female

Zip:
537** Birthdate:
2:
1: *
1: 5371* 5370*
0: 21.1.’76 28.2.’76 13.4.’86
0: 53715 53710 53706 53703

41
Datafly: example (k=2)
.................................................

Birthdate Sex Zipcode Disease While not 2-anonymous:


generalise the attribute with the
* Male 53715 Flu greatest number of distinct values
* Female 53715 Hepatitis
* Male 53703 Bronchitis
* Male 53703 Broken Arm
* Female 53706 Sprained Ankle

* Female 53706 Hang Nail

Zip:
2-anonymous? NO! 2: 537**

1: 5371* 5370*

0: 53715 53710 53706 53703

42
Datafly: example (k=2)
.................................................

Birthdate Sex Zipcode Disease While not 2-anonymous:


generalise the attribute with the
* Male 5371* Flu greatest number of distinct values
* Female 5371* Hepatitis
* Male 5370* Bronchitis
* Male 5370* Broken Arm
* Female 5370* Sprained Ankle

* Female 5370* Hang Nail

2-anonymous? NO! Sex:


1: Person

0: Male Female

43
Datafly: example (k=2)
.................................................

Birthdate Sex Zipcode Disease While not 2-anonymous:


generalise the attribute with the
* * 5371* Flu greatest number of distinct values
* * 5371* Hepatitis
* * 5370* Bronchitis
* * 5370* Broken Arm
* * 5370* Sprained Ankle

* * 5370* Hang Nail

2-anonymous? YES ☺

2-minimal generalization?

44
Datafly: example (k=2)
.................................................

Birthdate Sex Zipcode Disease


* * 5371* Flu
* * 5371* Hepatitis
* * 5370* Bronchitis
* * 5370* Broken Arm
* * 5370* Sprained Ankle
* * 5370* Hang Nail

Birthdate Sex Zipcode Disease


Consider:
* * 53715 Flu

* * 53715 Hepatitis

* * 53703 Bronchitis

* * 53703 Broken Arm

* * 53706 Sprained Ankle

* * 53706 Hang Nail

45
Datafly: example (k=2)
.................................................

Birthdate Sex Zipcode Disease


* * 5371* Flu
* * 5371* Hepatitis
* * 5370* Bronchitis
* * 5370* Broken Arm
* * 5370* Sprained Ankle

* * 5370* Hang Nail

2-anonymous? YES ☺

2-minimal generalization? NO!

46
Incognito
.................................................

• Properties:
– Generates the set of all possible k-anonymous full-
domain generalizations of the dataset
– Iterative bottom-up breadth-first search
– k-minimal generalization
– Maximizing the number of equivalence classes

47
Incognito
.................................................

Birthdate Sex Zipcode Disease


21.1.’76 Male 53715 Flu
13.4.’86 Female 53715 Hepatitis
28.2.’76 Male 53703 Bronchitis
21.1.’76 Male 53703 Broken Arm
Sex:
13.4.’86 Female 53706 Sprained Ankle
1: Person
28.2.‘76 Female 53706 Hang Nail

0: Male Female

Zip:
537** Birthdate:
2:
1: *
1: 5371* 5370*
0: 21.1.’76 28.2.’76 13.4.’86
0: 53715 53710 53706 53703

48
Incognito
.................................................

Birthdate Sex Zipcode Disease


21.1.’76 Male 53715 Flu
13.4.’86 Female 53715 Hepatitis
28.2.’76 Male 53703 Bronchitis
21.1.’76 Male 53703 Broken Arm
Birth.1
13.4.’86 Female 53706 Sprained Ankle

28.2.‘76 Female 53706 Hang Nail

Birth.0

Frequency set:
2-anonymous with
21.1.’76 : 2 respect to „Birth.0“
13.4.’86 : 2
28.2.’76 : 2

49
Incognito
.................................................

Birthdate Sex Zipcode Disease


21.1.’76 Male 53715 Flu
13.4.’86 Female 53715 Hepatitis
28.2.’76 Male 53703 Bronchitis
21.1.’76 Male 53703 Broken Arm
Sex1
13.4.’86 Female 53706 Sprained Ankle

28.2.‘76 Female 53706 Hang Nail

Sex0

Frequency set:
2-anonymous with
Male : 3 respect to „Sex0“
Female : 3

50
Incognito
.................................................

Birthdate Sex Zipcode Disease


21.1.’76 Male 53715 Flu
Zip2
13.4.’86 Female 53715 Hepatitis
28.2.’76 Male 53703 Bronchitis
21.1.’76 Male 53703 Broken Arm
Zip1
13.4.’86 Female 53706 Sprained Ankle

28.2.‘76 Female 53706 Hang Nail

Zip0

Frequency set:
2-anonymous with
53715 : 2 respect to „Zip0“
53703 : 2
53706 : 2

51
Incognito
.................................................

Birthdate Sex Zipcode Disease


21.1.’76 Male 53715 Flu
13.4.’86 Female 53715 Hepatitis
28.2.’76 Male 53703 Bronchitis
21.1.’76 Male 53703 Broken Arm
13.4.’86 Female 53706 Sprained Ankle

28.2.‘76 Female 53706 Hang Nail

<Birth.1,Sex1> Frequency set:


<21.1.‘76, Male> : 2
<13.4.‘86, Female> : 2
<Birth.1,Sex0> <Birth.0,Sex1>
<28.2.‘76, Male> : 1
<28.2.‘76, Female> : 1
<Birth.0,Sex0>
52
Incognito
.................................................

Birthdate Sex Zipcode Disease


21.1.’76 Male 53715 Flu
13.4.’86 Female 53715 Hepatitis
28.2.’76 Male 53703 Bronchitis
*
21.1.’76 Male 53703 Broken Arm
13.4.’86 Female 53706 Sprained Ankle
21.1.’76 28.2.’76 13.4.’86
28.2.‘76 Female 53706 Hang Nail

<Birth.1,Sex1> Frequency set:


<*, Male> : 3
<*, Female> : 3
<Birth.1,Sex0> <Birth.0,Sex1>

2-anonymous with respect to


<Birth.1,Sex0>
53
Incognito
.................................................

Birthdate Sex Zipcode Disease


21.1.’76 Male 53715 Flu
13.4.’86 Female 53715 Hepatitis
28.2.’76 Male 53703 Bronchitis
21.1.’76 Male 53703 Broken Arm
13.4.’86 Female 53706 Sprained Ankle
Person
28.2.‘76 Female 53706 Hang Nail
Male Female

<Birth.1,Sex1> Frequency set:


<21.1.‘76, Person> : 2
<13.4.‘86, Person> : 2
<Birth.1,Sex0> <Birth.0,Sex1>
<28.2.‘76, Person> : 2
2-anonymous with respect to
<Birth.0,Sex1>
54
Incognito
.................................................

<Birth.1,Sex1>
<Sex1,Zip2>

<Birth.1,Sex0> <Birth.0,Sex1> <Sex1,Zip1> <Sex0,Zip2>

<Sex1,Zip0> <Sex0,Zip1>
<Birth.0,Sex0>
<Sex0,Zip0>

<Birth.1,Zip2>

<Birth.1,Zip1> <Birth.0,Zip2>

<Birth.1,Zip0> <Birth.0,Zip1>

<Birth.0,Zip0>

55
Incognito
.................................................

<Birth.1,Sex1,Zip2>

<Birth.1,Sex1,Zip1> <Birth.1,Sex0,Zip2> <Birth.0,Sex1,Zip2>

<Birth.1,Sex1,Zip0>

Birthdate Sex Zipcode Disease


21.1.’76 Male 53715 Flu
13.4.’86 Female 53715 Hepatitis
28.2.’76 Male 53703 Bronchitis
21.1.’76 Male 53703 Broken Arm
13.4.’86 Female 53706 Sprained Ankle
28.2.‘76 Female 53706 Hang Nail

56
Incognito
.................................................

<Birth.1,Sex1,Zip2>

<Birth.1,Sex1,Zip1> <Birth.1,Sex0,Zip2> <Birth.0,Sex1,Zip2>

<Birth.1,Sex1,Zip0>
3 equivalence classes

Birthdate Sex Zipcode Disease


* * 53715 Flu
* * 53715 Hepatitis
* * 53703 Bronchitis
* * 53703 Broken Arm
* * 53706 Sprained Ankle
* * 53706 Hang Nail

57
Incognito
.................................................

<Birth.1,Sex1,Zip2>

<Birth.1,Sex1,Zip1> <Birth.1,Sex0,Zip2> <Birth.0,Sex1,Zip2>


2 equivalence classes

<Birth.1,Sex1,Zip0>
3 equivalence classes

Birthdate Sex Zipcode Disease


* Male 537** Flu
* Female 537** Hepatitis
* Male 537** Bronchitis
* Male 537** Broken Arm
* Female 537** Sprained Ankle
* Female 537** Hang Nail

58
Incognito
.................................................

<Birth.1,Sex1,Zip2>

<Birth.1,Sex1,Zip1> <Birth.1,Sex0,Zip2> <Birth.0,Sex1,Zip2>


2 equivalence classes 3 equivalence classes

<Birth.1,Sex1,Zip0>
Out: dataset with the greatest number of
3 equivalence classes
equivalence classes

Birthdate Sex Zipcode Disease


21.1.’76 * 537** Flu
13.4.’86 * 537** Hepatitis
28.2.’76 * 537** Bronchitis
21.1.’76 * 537** Broken Arm
13.4.’86 * 537** Sprained Ankle
28.2.‘76 * 537** Hang Nail

59
SaNGreeA
.................................................

• Properties:
– Greedy clustering algorithm
– User-specified generalization hierarchies for each categorical
attribute
– Numerical attributes are generalized on the fly – no fixed
categories needed
• GIL function – measures the amount of generalization
N = set of numerical attributes
→ „how large is the generalised
range compared to the total range
of the attribute“

→ „how many steps up the


hierarchy we need to take out of
total # of hierarchy levels“
C = set of categorical attributes

60
SaNGreeA: example (k=2)
.................................................

Age Sex Zipcode Disease

t1 43 Male 53715 Flu c1


t2 35 Female 53715 Hepatitis

t3 32 Male 53703 Bronchitis

t4 43 Male 53703 Broken Arm

t5 28 Female 53706 Sprained Ankle

t6 33 Female 53706 Hang Nail

• Initiate the cluster c1 with the record t1


• Add another record t to c1:
• Calculate GIL for each available t and c1
• E.g. if we would add t2 to c1, Age would need to be generalised to the range [35-43] and Sex to *
• Hence, GIL(c1,t2) = size of range [35-43] / size of total Age range[28-43]
+ #steps taken in Sex gen.hierarcy / #tot. steps in Sex gen.hierarchy
+ #steps in Zipcode gen.hierarcy / #tot. steps in Zipcode gen.hierarchy
GIL(c1,t2) = 8/15 + 1/1 + 0/2 = 1,53
• Choose a record with min GIL

61
SaNGreeA: example (k=2)
.................................................

Age Sex Zipcode Disease

t1 43 Male 53715 Flu c1


t2 35 Female 53715 Hepatitis

t3 32 Male 53703 Bronchitis

t4 43 Male 53703 Broken Arm


c1
t5 28 Female 53706 Sprained Ankle

t6 33 Female 53706 Hang Nail

GIL(c1,t2) = 8/15 + 1/1 + 0/2 = 1,53


GIL(c1,t3) = 11/15 + 0/1 + 2/2 = 1,73
GIL(c1,t4) = 0/15 + 0/1 + 2/2 = 1 Min GIL
GIL(c1,t5) = 15/15 + 1/1 + 2/2 = 3
GIL(c1,t6) = 10/15 + 1/1 + 2/2 = 2,67

61
SaNGreeA: example (k=2)
.................................................

Age Sex Zipcode Disease

t1 43 Male 537** Flu c1


t2 35 Female 53715 Hepatitis
c2
t3 32 Male 53703 Bronchitis

t4 43 Male 537** Broken Arm c1


t5 28 Female 53706 Sprained Ankle

t6 33 Female 53706 Hang Nail

• Initiate the next cluster c2 with the record t2


• Add another record t to c2:
• Calculate GIL for each available t and c1
• E.g. if we would add t3 to c2, Age would need to be generalised to the range [32-35], Sex to * and
Zipcode to 537**
• Hence, GIL(c2,t3) = size of range [32-35] / size of total Age range[28-43]
+ #steps taken in Sex gen.hierarcy / #tot. steps in Sex gen.hierarchy
+ #steps in Zipcode gen.hierarcy / #tot. steps in Zipcode gen.hierarchy
GIL(c2,t3) = 3/15 + 1/1 + 2/2 = 2,2
• Choose a record with min GIL

62
SaNGreeA: example (k=2)
.................................................

Age Sex Zipcode Disease

t1 43 Male 537** Flu c1


t2 35 Female 53715 Hepatitis
c2
t3 32 Male 53703 Bronchitis

t4 43 Male 537** Broken Arm c1


t5 28 Female 53706 Sprained Ankle

t6 33 Female 53706 Hang Nail


c2

GIL(c2,t3) = 3/15 + 1/1 + 2/2 = 2,2


GIL(c2,t5) = 7/15 + 0/1 + 2/2 = 1,47
GIL(c2,t6) = 2/15 + 0/1 + 2/2 = 1,13 Min GIL

62
SaNGreeA: example (k=2)
.................................................

Age Sex Zipcode Disease

t1 43 Male 537** Flu c1


t2 [33,35] Female 537** Hepatitis
c2
t3 32 Male 53703 Bronchitis
c3
t4 43 Male 537** Broken Arm c1
t5 28 Female 53706 Sprained Ankle
c3
t6 [33,35] Female 537** Hang Nail
c2

63
SaNGreeA: example (k=2)
.................................................

Age Sex Zipcode Disease

t1 43 Male 537** Flu c1


t2 [33,35] Female 537** Hepatitis
c2
t3 [28,33] * 5370* Bronchitis
c3
t4 43 Male 537** Broken Arm c1
t5 [28,33] * 5370* Sprained Ankle
c3
t6 [33,35] Female 537** Hang Nail
c2

64
SaNGreeA: example (k=2)
.................................................

Age Sex Zipcode Disease

43 Male 537** Flu

43 Male 537** Broken Arm

[33,35] Female 537** Hepatitis

[33,35] Female 537** Hang Nail

[28,33] * 5370* Bronchitis

[28,33] * 5370* Sprained Ankle

65
Solving k-anonymity: Tools
.................................................

• ARX:
– Flash algorithm
– https://arx.deidentifier.org/

• Amnesia:
– https://amnesia.openaire.eu/

• UTD Anonymization Toolbox:


– Datafly, Incognito, Mondrian
– http://www.cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php
• Microaggregation tool:
– https://github.com/CrisesUrv/microaggregation-
based_anonymization_tool
66
k-Anonymity: Effects on utility
.................................................

• Two main approaches to evaluate the effect of k-


anonymisation on the data utility

– Measured directly on the data (“information loss metric”)


• Precision (steps in the hierarchy), Discernibility Metric (how
many records can be distinguished), non-uniform entropy, ...

– Measured by the effect on utility for a certain task/model


• E.g. Train a machine learning model, and evaluate difference in
effectiveness measures

Emam et al., Globally Optimal k-Anonymity for De-Identification of Health Data. 2009

67
Effects on Utility
.................................................

• Increasing the level of anonymity, the information loss also increases

• Local (SaNGreeA) vs Global (Flash/ARX) transformation

* Experiments on Adult dataset (target: education-num)


71
Attacks Against K-Anonymity
.................................................

• Complementary Release Attack


– Different releases can be linked together to compromise
k-anonymity
– Solution:
• Consider all of the released tables before release the new one,
and try to avoid linking

– Other data holders may release some data that can be


used in this kind of attack.
• Hard to be prevented completely

72
Attacks Against K-Anonymity
.................................................

• k-Anonymity does not provide privacy if


– Sensitive values in an equivalence class lack diversity
– The attacker has background knowledge
Homogeinity Attack A 3-anonymous patient table
Zipcode Age Disease
Bob
476** 2* Heart Disease
Zipcode Age
476** 2* Heart Disease
47678 27 476** 2* Heart Disease
4790* ≥40 Flu

4790* ≥40 Heart Disease


4790* ≥40 Breast Cancer
Background Knowledge
Attack 476** 3* Heart Disease
Alan 476** 3* Breast Cancer

Zipcode Age 476** 3* Breast Cancer

47673 36

73
Outline
.................................................

• Privacy: definitions and motivation

• Pseudonimisation
➢ Record-Linkage Attack

• Anonymisation
• k-anonymity

• l-diversity

• t-closeness

• Data watermarking and fingerprinting


74
L-diversity principles
.................................................

• Each equivalence class has at least l well-


represented sensitive values

A 3-anonymous patient table


Bob ? Zipcode Age Disease

476** 20-40 Heart Disease


Zipcode Age
476** 20-40 Heart Disease
47678 27 476** 20-40 Breast Cancer
4790* ≥40 Flu

4790* ≥40 Heart Disease


4790* ≥40 Breast Cancer

Alan
? 476**
476**
20-40
20-40
Heart Disease
Heart Disease

Zipcode Age 476** 20-40 Breast Cancer

47673 36
Machanavajjhala, Gehrke & Kifer. l-Diversity: Privacy Beyond k-Anonymity. 2006
75
L-diversity principles
.................................................

• L-diversity principle:

– A q-block (equivalence class) is l-diverse if contains at


least l ‘well represented” values for the sensitive
attribute S

– A table is l-diverse if every q-block is l-diverse

– Different variations: distinct, entropy, recursive l-


diversity

Machanavajjhala, Gehrke & Kifer. l-Diversity: Privacy Beyond k-Anonymity. 2006


75
l-Diversity: variations
.................................................

• Distinct l-diversity
– Each equivalence class has at least l well-represented
sensitive values

– Limitation: # Zipcode Age Disease


1 476** 2* Cancer
• Doesn’t prevent a 2 476** 2* Flu
3 476** 2* Cancer
probabilistic inference attack 4 476** 2* Cancer
• Example 5
6
476**
476**
2*
2*
Cancer
Cancer
– 10 tuples in one equivalent class 7 476** 2* Cancer
8 476** 2* Heart Disease
– The “Disease” variable contains one “Flu”, 9 476** 2* Cancer
one “Heart Disease”, and eight “Cancer” 10 476** 2* Cancer

– This satisfies 3-diversity, but an attacker can still affirm that the
target person’s disease is “Cancer” with the accuracy of 80%.

76
l-Diversity: variations
.................................................

• Entropy l-diversity
– Each equivalence class not only must have enough
different sensitive values, but also the different sensitive
values must be distributed evenly enough.

– It means the entropy of the distribution of sensitive


values in each equivalence class is at least log2(l)
n n
H ( X ) = E(I(X)) =  p ( xi ) I ( xi ) = −  p ( xi ) log 2 p ( xi )
i =1 i =1

– Sometimes too restrictive – when some values are very


common, entropy of the entire table may be very low

77
l-Diversity: variations
.................................................

• Recursive (c,l)-diversity
– Less conservative notion
– “The most frequent value does not appear too “frequently

– s1, …. sm possible values of attribute in a q-block


– n(q,sm) = count of that value
• sorted descending & referred to as r1 .. rm

– A q-block is (c,2) diverse if, for a specified c:


• r1<c(r2+…+rm)

– Recursively (if more than two sensitive values)


• r1<c(rl+rl+1+…+rm)

78
Limitations of l-Diversity
.................................................

• l-diversity may be difficult or unnecessary

• Example: a single sensitive attribute


– Two values: HIV positive (1%) and HIV negative (99%)
– Very different degrees of sensitivity

– l-diversity may be unnecessary


• 2-diversity is unnecessary for an equivalence class that contains
only negative records

– l-diversity is difficult to achieve


• Suppose there are 10000 records in total
• To have distinct 2-diversity, there can be at most 10000*1%=100
equivalence classes
79
Limitations of l-Diversity
.................................................

• l-diversity is insufficient to prevent attribute disclosure


Zipcode Age Salary Disease

476** 2* 3K Gastric Ulcer


Bob
476** 2* 4K Gastritis
Zip Age
476** 2* 5K Stomach Cancer
47678 27 ≥40
4790* 6K Gastritis

4790* ≥40 11K Flu


Similarity Attack 4790* ≥40 8K Bronchitis

476** 3* 7K Bronchitis

476** 3* 9K Pneumonia

• Conclusions: 476** 3* 10K Stomach Cancer

– Bob’s salary is in [3k,5k], which is relatively low


– Bob has some stomach-related disease
• l-diversity does not consider semantic meanings of
sensitive values
80
Outline
.................................................

• Privacy: definitions and motivation

• Pseudonimisation
➢ Record-Linkage Attack

• Anonymisation
• k-anonymity

• l-diversity

• t-closeness

• Data watermarking and fingerprinting


81
t-closeness
.................................................

• k-anonymity prevents identity disclosure but


not attribute disclosure

• To solve that problem l-diversity requires that


each eq. class has at least l values for each
sensitive attribute

• t-closeness requires that the distribution of a


sensitive attribute in any equivalence class is
close to the distribution of the attribute in the
overall table
t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. Li et al., 2007

82
t-closeness
.................................................

Zipcode Age Salary Disease

476** <40 3K Gastric Ulcer


Bob
Zip Age
? 476** <40 9K Pneumonia

476** <40 5K Stomach Cancer


47678 27 ≥40
4790* 6K Gastritis

4790* ≥40 11K Flu


Similarity Attack 4790* ≥40 8K Bronchitis

476** <40 7K Bronchitis

476** <40 4K Gastritis

476** <40 10K Stomach Cancer

• Privacy = information gain of an observer


• Distribution of the sensitive attribute in each
equivalence class should be similar to distribution
of the sensitive attribute in the whole table
83
t-closeness
.................................................

• Privacy is measured by the information gain of an


observer

• Information Gain = (Posterior Belief – Prior Belief)

• Q = the distribution of the sensitive attribute in the


whole table

• P = the distribution of the sensitive attribute in


equivalence class

83
t-closeness Principle
.................................................

• An equivalence class is said to have t-closeness


– If the distance between the distribution of a sensitive
attribute in this class and the distribution of the attribute
in the whole table is no more than a threshold t

• A table is said to have t-closeness


– If all equivalence classes have t-closeness.

84
Distance between two distributions
.................................................

• Given two distributions


– P = (p1, p2, ..., pm)
– Q = (q1, q2, ..., qm),

• Variational distance:

• Earth Movers Distance:

– (Or something else..)


85
Similarity Attack Example
.................................................

• 0.167-closeness for Salary and 0.278-closeness


for Disease
86
l-t: Conclusion
.................................................

• l-diversity and t-closeness add additional


guarantees for the privacy of the individuals

• They however further limit the data utility

• Search for k/l/t-minimal distortion more complex

• Adds two more parameters to set – which values??

87
Anonymisation: other limitations
.................................................

• In very high-dimensional spaces


data matrices often get very sparse
➔ Only a few items are
actually similar to each other

Robust De-anonymization of Large Sparse Datasets. Arvind Narayanan and Vitaly Shmatikov.
2008
88
Anonymisation: other limitations
.................................................

• In very high-dimensional spaces


data matrices often get very sparse
– Makes re-identification easier

Robust De-anonymization of Large Sparse Datasets. Arvind Narayanan and Vitaly Shmatikov.
2008
89
Anonymisation: conclusion
.................................................

• Approaches like k/l/t-* prevent certain types of


attacks
– Identification, background, similarity, ....
– Has effects on the data utility
– It is difficult to assess what other data is available
– It is not clear what a required level for k is
• Still
– There aren’t many alternatives around
• Differential privacy the one likely most often mentioned
– Still frequently used approach when you need to publish
data to the “public”
– Makes it more GDPR compliant
90
Outline
.................................................

• Privacy: definitions and motivation

• Pseudonimisation
➢ Record-Linkage Attack

• Anonymisation
• k-anonymity

• l-diversity

• t-closeness

• Data watermarking and fingerprinting


91
Digital property protection: motivation
.................................................

• Why protecting the data?


– Data owner used a lot of resources to collect/create the
data (money, human experts, time…)
– Sensitive data (e.g. medical data) needs to be shared
with researchers
• Privacy implications: only the trusted parties get the data and
should not share it further

• The goal: controlled data sharing


– Share full data
– Trace the unauthorised data re-distribution
92
Data fingerprinting and watermarking
.................................................

• Embedding owner‘s signature into the data


– Applying tailored modifications to the data which only
the owner is able to extract

Age Blood Diabetes Age Blood Diabetes


Pressure Pressure

32 64 1 33 64 1

31 66 0 31 68 0

50 72 1 50 72 1

48 70 0 47 70 0

93
Watermarking vs. fingerprinting
.................................................

Watermark: identifies the owner Fingerprint: owner & recipient

94
Fingerprinting – (a bad) example
.................................................

https://twitter.com/pnikosis/status/1592823543498436611
The workflow
.................................................

95
The schemes: fingerprinting
.................................................

• Owner‘s secret key used for


– Fingerprint creation
– Embedding pattern
• Create distinct fingerprint for each data recipient
– Fingerprint = bitstring (output of a hash function seeded by
the secret key)
• Embed the fingerprint bits following the embedding
pattern:
– Pseudorandom number generator seeded by the secret key
outputs the locations in the dataset to be modified with
fingerprint bits

• Fingerprint extraction: reverse insertion (possible only


by knowing the secret key!)
96
Robustness vs utility
.................................................

• Robustness against attacks -> maximise modifications


• Preserve data utility! -> minimise modifications
– Trade-off!

97
Watermarking ML/DL models
.................................................

• Protecting the ownership of ML/DL models


• The same idea: Embedd the owner‘s signature
into the model
– E.g. modify decision boundary of DNN by learning
specifically tailored input data (adversarial input)

• More about this later in adversarial ML lecture! ☺

98
WM & FP: Conclusions
.................................................

• Watermarking and fingerprinting allow sharing the


data with a possibility of:
– Ownership verification
– Identification of unauthorised usage of data (only
fingerprint)
• Requires modifying the data
• Robustness of a fingerprint vs. data utlity:
– Stronger fingerprints decrease the utility more

99
.................................................

Questions ?

100

You might also like