0% found this document useful (0 votes)
15 views

Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (Re-Identification) v2

Uploaded by

Niloofar Fallahi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (Re-Identification) v2

Uploaded by

Niloofar Fallahi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Data Science & Ethics

Lecture 4

Data Preprocessing

Prof. David Martens


[email protected]
www.applieddatamining.com
@ApplDataMining
AI Ethics in the News
2
“We ask that all universities make the data on their students
public, but to anonymize the names of the students by hashing
them and not to include home address or other personal
information in the dataset. For each student, we want the
following fields to be included in the dataset: a hashed version of
the student’s name, the courses he or she enrolled in, his/her
grades on these courses, days of absence in 2020 due to COVID-19,
study program, nationality, date of birth, postal code and gender.
In that way social science research can be moved forward, by finding
patterns in this data, and universities could benefit from the
discovered insights.’” 3
“Who will end up in good positions?”

4
5
Ethical Data Preprocessing

6
Ethical Data Preprocessing
▪ Discrimination against sensitive groups
• Measuring fairness of the data
• Methods to make the data fair
Simply removing the name does not anonymize the data
Also, Removing he sex dos not prevent gender discrimination.

▪ Privacy
• Measuring
• Methods to include privacy
• Defining Target Variable
• Discussion Case: online re-identification

7
Input Selection
▪ Privacy
• Personal data, cf. Ethical Data Gathering
• “Why not simply removing personal attributes?”
➢ Proxies! Other correlated variables

▪ Discrimination against sensitive groups


• “Why not simply removing sensitive attributes?”
➢ Proxies!
➢ Gender correlated with?
➢ Race correlated with?
➢ Sexual orientation correlated with?

8
Proxies for discrimination
▪ Against sensitive groups
▪ Simply removing sensitive data is not enough
• Proxies: Red lining

The HOLC maps are part of the records of the FHLBB (RG195) at
the National Archives II
The Color of Money
Bill Dedman 1988
http://powerreporting.com/color/
Data Preprocessing for non-discrimination
▪ Proxies of sensitive attributes
• Removing sensitive attributes won’t remove bias from the data

▪ Measuring Bias
• Statistical Parity or Dependence
➢ P(Ypred = Pos | Sens = 0) - P(Ypred = Pos | Sens = 1)
• Disparate Impact
➢ P(Ypred = Pos | Sens = 0) / P(Ypred = Pos | Sens = 1)

Sens: “Sex”
Dependence = 80% - 40% = 40%

➔ A woman is 40% less likely (in absolute


numbers) to be accepted for a job than a
man

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 10
Data Preprocessing for non-discrimination
▪ Several preprocessing methods to get fairness
• Methods need access to sensitive attribute!
• Multi-objective goal:
➢ High accuracy: minimal relabeling (minimal effect on accyracy)
➢ Low discrimination: no more dependence:
P(Ypred = Pos | Sens = 0) = P(Ypred = Pos | Sens = 1)

▪ Notations:
• +: the desireable positive class (high income, obtaining loan, getting hired, etc.)
• S: the sensitive attribute
• S = w: white (or S = 0)
• S = b: black (or S = 1)

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 11
Data Preprocessing for non-discrimination
▪ Three methods
• Massaging: changing the class labels
• Reweighing: assign weights to data instances
• Sampling: re-sample the dataset

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 12
Massaging
Change some men to negative and some women to positive but based on their
▪ Relabeling score and probability of getting credit.

• Some instances with sensitive attribute (S = b) from – to +


• Some instances without sensitive attribute (S = w) from + to -

▪ Which ones?
• Closest to the decision boundary: not sure anyway

▪ How many?
• Such that no more discrimination

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 13
Massaging
▪ How many?
• Such that no more discrimination
• Discrimination = P(Ypred = Pos | Sens = 0) - P(Ypred = Pos | Sens = 1)
pw / |Dw| - pB / |Db|
• If we relabel M datapoints:
▪ change Men (Sens = 0) with + class to – class
▪ change Women (Sens = 1) with - class to + class
Discrimination =

=0

==>

if not an integer, round up (slight negative discrimination)

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 14
Massaging
▪ How many?

▪ Demote 1 and promote 1

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 15
Massaging
▪ How many?
• M=1
▪ Which one?
• Closest to decision boudary/where the model is most uncertain

-- Male with lowest prob. to be +

+ Female with lowest prob. to be –


(or highest prob. to be +,
as indicated in this column)

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 16
Reweighing
▪ Weight that measures how different this expected from observed probability is:

• Instances 1 – 4 have Sex = M and Class = +


➢ Expected: P(Sex = M) x P(Class = +) = 0,5 x 0,6 = 0,3
➢ Observed: P(Sex = M, Class = +) = 0,4
➢ Weight = 0,75
➢ Positive outcomes and men: have more than enough, lower weight
• Instances 8, 10 have Sex = F and Class = + ➔ Weight = 1,5
➢ Positive outcomes and women: too few, higher weight
• Similar for the four other instances (5-7, 9). What are the weights there?
17
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. KIS, 33(1), 1-33.
Reweighing
• Requires that the classification algorithm can work with weighted data instances

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 18
Sampling
▪ Go from weights to frequencies
• Higher weight: more likely to be sampled
• Lower weights: less likely to be sampled

▪ How many to sample?


• | Sex = M and Class = + | = 4, weight = 0,75 ➔ Sample 3
• | Sex = F and Class = + | = 2 , weight = 1,50 ➔ Sample 3
• Similar for the negative class

▪ How to sample?
• Uniform
• Preferential: take scores into account once more
➢ Close to decision boudary: more likely to be over or undersampeld

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 19
Data Preprocessing for non-discrimination
▪ Experiments
• German credit
➢ Y: default or not
➢ Sensitive: Sex
• Census income
➢ Y: makes over 50K a year or not
➢ Sensitive: Sex
• Communities and crime
➢ Y: minor or major violent community
➢ Sensitive: Black neighborhood
• Dutch 2001 census
➢ Y: High or low level occupation
➢ Sensitive: Sex
1. Removing sensitive attribute does not remove discrimination! (Proxies)

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 20
Data Preprocessing for non-discrimination
▪ Experiments
2. Similar results for other datasets and techniques

▪ Conclusions
• Trade-off between accuracy and discrimination (continuum!)
• Massaging and PS seem best methods
• Data preprocessing setup allows for use with any classification technique
• Just removing sensitive attribute is not enough
• Sometimes some discrimination can be allowed if it can be explained
➢ For example: have “number of crashes before” as input variable to predict insurance risk,
even if correlated with sex
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 21
Defining Target Variable
▪ Difficult, be transparant
• ‘a target variable must reflect judgements about what really is the problem
at issue.’ (Barocas and Selbst, 2016)
▪ Bias
• Historically biased against sensitive groups
• Predict who to hire, “good” candidates
➢ Y based on hires?
• Predict who to promote, “good” employees
➢ Y based on chosen metrics as years at company, sales, productivity?
➢ Y based on more holistic metrics, using evaluation reviews, complaints?
➢ Subjective choice and measurement of Y can lead to bias
➢ If turnover systematically higher for certain sensitive groups, the predictive model
will include this bias as well
• Predict who to accept to university, “good candidates”
➢ Y based on average test score?
➢ College graduation rates white students: 67%, black students: 51%
▪ Should we predict this?
• “Predictive privacy harms” (Crawfard and Schultz): sensitive or revealing 22
Defining Target Variable

23
Biased Language
▪ Word embeddings Reduction technics

• from words to vectors


• Similar words should be “close” to each other
▪ Bolukbasi et al (2016)
• Google News data

24
Biased Language
▪ Bolukbasi et al (2016), other bias stereotypes:
• nurse/surgeon
• housewife/shopkeeper
• interior designer/architect
• diva/superstar
▪ Gender bias in word embeddings
• http://wordbias.umiacs.umd.edu/

25
Ethical Data Preprocessing
▪ Discrimination against sensitive groups
• Measuring
• Methods
▪ Privacy
• Measuring
• Methods to include privacy
• Defining Target Variable
▪ Discussion Case: online re-identification

26
Anonymizing Data
▪ Simply not use personal data?
▪ Can re-identify persons uniquely based on input data
• Cf. location data: NY Times able to identify Lisa Magrin: only person who
commutes daily from her house in upstate NY to the middle school where
she works
• Cf. Census data, Sweeney (MIT) found that 87% of the U.S. population can
be uniquely identified by the combination
<birthdate, sex, zip code>
➔ Can we change the dataset so that individuals cannot be re-identified,
while keeping the data useful? (Sweeney and Samarati, 1998)
➔ K-anonymity: a data set has k-anonymity if the information for each person
(data instance) cannot be distinguished from at least k-1 other individuals in
the dataset

27
Data Preprocessing for privacy
▪ How to measure level of anonymity?
• an equivalence class is defined as a set of records that have the same
values for the quasi-identifiers
• k-anonymity => smallest equivalence class of size k
• NP-hard optimisation problem but good approximation methods exist
▪ Several preprocessing methods to get privacy
• Grouping instances: Aggregation
• Grouping variable values: Discretisation and Generalisation
• Suppressing: replace certain values by *

28
Data Preprocessing for privacy
Name Age Gender ZIP code Diagnosis
Dirk Den 41 M 2000 Cancer
Eric Eel 46 M 2600 HIV
Fling Fan 22 F 1000 No illness
Geo Gen 28 F 1020 Viral infection
Han Hun 29 F 1000 HIV

Quai-identifiers: available to an adversary

29
Data Preprocessing for privacy
Name Age Gender ZIP code Diagnosis
Dirk Den 41 M 2000 Cancer
Eric Eel 46 M 2600 HIV
Fling Fan 22 F 1000 No illness
Geo Gen 28 F 1020 Viral infection
Han Hun 29 F 1000 HIV
If I know that Dirk is 41, a man and lives in Antwerp, then I don’t need his name
to observe he has cancer

Name Age Gender Province Diagnosis


* [40-50] M Antwerp Cancer
* [40-50] M Antwerp HIV
* [20-30] F Brussels No illness
* [20-30] F Brussels Viral infection
* [20-30] F Brussels HIV
2-anonymous wrt all non-sensitive attributes (all but diagnosis) 30
Data Preprocessing for privacy
▪ Issues with k-anonymity
1. Homogenity: If the sensitive attribute for all k instances are the same
➢ Eg: I know Dirk is [40-50], male and from Antwerp

2-anonymous wrt all non-sensitive attributes (all but diagnosis) Dirk has HIV!
Name Age Gender Province Diagnosis
* [40-50] M Antwerp HIV
* [40-50] M Antwerp HIV
* [20-30] F Brussels No illness
* [20-30] F Brussels Viral infection
* [20-30] F Brussels HIV
2-anonymous wrt all non-sensitive attributes (all but diagnosis)
Data Preprocessing for privacy
▪ Issues with k-anonymity
1. Homogenity: If the sensitive attribute for all k instances are the same
2. Linkage attack: linking with additional dataset or knowledge
➢ Eg: I know Fling is [20-30], female and from Brussels, and in both of these datasets (eg from 2 hospitals).

Name Age Gender Province Diagnosis


* [20-30] F Brussels HIV
* [20-30] F Brussels Cancer
* [20-30] F Brussels Heart attack
2-anonymous wrt all non-sensitive attributes (all but diagnosis) Fling has HIV!
Name Age Gender Province Diagnosis
* [40-50] M Antwerp Cancer
* [40-50] M Antwerp HIV
* [20-30] F Brussels No illness
* [20-30] F Brussels Viral infection
* [20-30] F Brussels HIV
2-anonymous wrt all non-sensitive attributes (all but diagnosis)
Online Re-identification
▪ Netflix price: re-identification based on movie ratings
• Dataset with 100 million ratings from 480k users with 17k movies
➢ Names removed
➢ Some ratings changed to fake ones
➢ Narayanan and Sgmatikov: combine with data from IMDB to identify persons

➔ Now we know that johndoe90 also also watched `Fahrenheit 9/11' on Netflix, potentially revealing his political preference, as
well as `Jesus of Nazareth' and `The Gospel of John', potentially revealing his religious preference.

▪ Netflix price 2?
• Lawsuit by Jane Doe, a Lesbian Netflix user, whose sexual preference is not a
matter of public knowledge, “including at her children's school".
• She watched movies in the Netflix category “Lesbian and Gay“
• Lawsuit settled and second Netflix prize canceled.
33
Online Re-identification
▪ Massachusetts made medical records of public servants public,
removed names: re-identification
➢ Sweeney (1997)
➢ Public medical records: + vote records: only 1 person with that combination
➢ Governor Weld: exposed medical records

34
Online Re-identification
▪ AOL reseach released dataset: re-identification based on search queries
• August 4, 2006
• 20 million search keywords
• 650.000 users
• Simply removing personal identifier
▪ To foster academic research: to “embrace the vision of an open research community"

▪ De-identified?
• No users explicitly identified in the dataset
• Search queries could be used to identify persons
Online Re-identification
▪ AOL reseach released dataset
▪ De-identified?
• No users explicitly identified in the dataset
• Search queries could be used to identify persons
• NY Times exposed one of the users, with her explicit permission, as Thelma Arnold, a
62-year-old widow from Gorgia, U.S.

Queries from her:


• “landscapers in Lilburn, Ga”
• “homes sold in shadow lake subdivision gwinnett county
Georgia”
• several people with the last name Arnolds
• “60 single men”
• “dog that urinates on everything”
• “school supplies for Iraq children”
“My goodness, it's my whole personal life"

https://www.nytimes.com/2006/08/09/technology/09aol.html
Online Re-identification
▪ AOL reseach released dataset
▪ De-identified?
• No users explicitly identified in the dataset
• Search queries could be used to identify persons
• NY Times exposed one of the users, with her explicit permission, as Thelma Arnold, a
62-year-old widow from Gorgia, U.S.
▪ Revealed disturbing and sensitive thoughts
• “how to kill you wife” AOL Inc. ( AOL) is a global Web services company with a range of
brands and offerings and a global audience.
• “fear that spouse contemplating cheating”
• Thousands of sexually oriented queries
▪ Data was removed three days after the release (but can still be found online)
• “It was a mistake, and we apologize.“
• AOL researcher fired, CTO resigned
▪ Use such data?
• “you don’t want to do research on tainted data.” Kleinberg

▪ Have a look at your activity: https://myactivity.google.com


• Would someone be able to identify you? Any sensitive ifnormation?
Online Re-identification
▪ re-identification based on location data
▪ Some apps on mobile phones use location data
• For example weather, driving directions, running or biking apps
• Interesting for advertising
• NY Times reporters looked into a 2017 database with data of over 1 million users from
New York area
• at least 75 companies reported to receive `anonymous' precise location data from
such apps
▪ De-identified?
• Re-identify Lisa Magrin: 46-year old math teacher
• Locations:
➢ Daily commute from house in upstate NY to middle school where she works
➢ Weight Watchers meeting
➢ Former boyfriend’s home
➢ Doctor
▪ Could easily find: work and home locations of all attendees or employees of: AA
meetings, military bases, churches, Planned Parenthood clinics, etc.
▪ Similar revelation on 170 million individual taxi trips in NYC.
https://www.nytimes.com/interactive/2018/12/10/business/location-data-privacy-apps.html
Data Preprocessing for privacy
▪ How to measure level of anonymity?
• K-anonymity
▪ Several preprocessing methods to get privacy
• Grouping instances: Aggregation
• Grouping variable values: Discretisation and Generalisation
• Suppressing: replace certain values by *
• Adding noise
▪ Continuum!

Diff. privacy Diff. privacy Removing personal


k-anonymity identifiers
decentralized centralized
privacy accuracy

39
l-diversity
▪ Issue with k-anonymity: does not consider sensitive variables
• Homogenity attack: if the sensitive attribute for all k instances are the same

▪ Solution: l-diversity
• Also maintaining diversity of sensitive values
• Equivalence class: set of records that have the same values for the
quasi-identifiers Text

• An equivalence class is said to have l-diversity: if there are at least l


“well-represented” values for the sensitive variable.
• A dataset is l-diverse if every equivalence class of the dataset has l-
diversity
• Well-represented?
➢ distinct l-diversity: easiest version, requiring at least l distinct values
40
Machanavaljjhala et al. (2007)
l-diversity
▪ l-diversity
• An equivalence class is said to have l-diversity: if there are at least
l “well-represented” values for the sensitive attribute.
Name Age Gender Province Diagnosis
* [40-50] * Flanders HIV 2
k=?
* [40-50] * Flanders Viral infection l=? 2

* [20-30] * Flanders Covid-19


* [20-30] * Flanders Bronchitis
* [20-30] * Flanders Pneumonia
The higher K is the better. This data set is 2 anonymous and 2 diverse
Removing personal
l-diversity k-anonymity
identifiers
privacy utility
high k low k

high l low l

41
I-diversity
▪ Problem solved?
• Skewness attack:
➢ one class is much more unlikely (skewnewss in class distribution) and is
more sensitive
➢ For example: Positive for virus test (0,1%) more sensitive than negative

• Similarity attack:
➢ Similairty between the sensitive values
Name Age Gender Province Diagnosis
* [40-50] * Flanders HIV 50% certain HIV
vs 0,1% in overall
* [40-50] * Flanders Viral infection population

* [20-30] * Flanders Covid-19 Has lung-related


illness
* [20-30] * Flanders Bronchitis
Even though the values are
different they are really similar in
* [20-30] * Flanders Pneumonia terms they are all related to lung
illness so we know that this
population has lung issues 42
t-closeness
▪ Limit the difference between the distribution of the sensitive attribute per
equivalence class and in the overall population
▪ An equivalence class is said to have t-closeness: if the distance between
1) the distribution of a sensitive attribute in this class
2) the distribution of the attribute in the whole table
is no more than a treshold t
▪ A dataset is t-close if every equivalence class of the dataset has t-closeness
▪ Several possibilities to measure “closeness”
▪ Hard to implement

Removing personal
t-closeness l-diversity k-anonymity
identifiers
privacy utility
high k low k

high l low l

low t high t 43
Li et al. (2007)
t-closeness
▪ Reminder: Obtaining
• k-anonymity
• l-diversity
• t-closeness
by preprocessing methods to get pricvacy: grouping and suppressing

Removing personal
t-closeness l-diversity k-anonymity
identifiers
privacy utility
high k low k

high l low l

low t high t

44
Presentation and Paper Ideas
▪ Other methods to remove discrimination or improve
privacy in data preprocessing
• https://towardsdatascience.com/algorithmic-solutions-to-
algorithmic-bias-aef59eaf6565
▪ Re-identification based on other types of data
▪ Red lining: additional history and current practises
▪ Applications of k-anonymity

45

You might also like