Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (Re-Identification) v2
Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (Re-Identification) v2
Lecture 4
Data Preprocessing
4
5
Ethical Data Preprocessing
6
Ethical Data Preprocessing
▪ Discrimination against sensitive groups
• Measuring fairness of the data
• Methods to make the data fair
Simply removing the name does not anonymize the data
Also, Removing he sex dos not prevent gender discrimination.
▪ Privacy
• Measuring
• Methods to include privacy
• Defining Target Variable
• Discussion Case: online re-identification
7
Input Selection
▪ Privacy
• Personal data, cf. Ethical Data Gathering
• “Why not simply removing personal attributes?”
➢ Proxies! Other correlated variables
8
Proxies for discrimination
▪ Against sensitive groups
▪ Simply removing sensitive data is not enough
• Proxies: Red lining
The HOLC maps are part of the records of the FHLBB (RG195) at
the National Archives II
The Color of Money
Bill Dedman 1988
http://powerreporting.com/color/
Data Preprocessing for non-discrimination
▪ Proxies of sensitive attributes
• Removing sensitive attributes won’t remove bias from the data
▪ Measuring Bias
• Statistical Parity or Dependence
➢ P(Ypred = Pos | Sens = 0) - P(Ypred = Pos | Sens = 1)
• Disparate Impact
➢ P(Ypred = Pos | Sens = 0) / P(Ypred = Pos | Sens = 1)
Sens: “Sex”
Dependence = 80% - 40% = 40%
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 10
Data Preprocessing for non-discrimination
▪ Several preprocessing methods to get fairness
• Methods need access to sensitive attribute!
• Multi-objective goal:
➢ High accuracy: minimal relabeling (minimal effect on accyracy)
➢ Low discrimination: no more dependence:
P(Ypred = Pos | Sens = 0) = P(Ypred = Pos | Sens = 1)
▪ Notations:
• +: the desireable positive class (high income, obtaining loan, getting hired, etc.)
• S: the sensitive attribute
• S = w: white (or S = 0)
• S = b: black (or S = 1)
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 11
Data Preprocessing for non-discrimination
▪ Three methods
• Massaging: changing the class labels
• Reweighing: assign weights to data instances
• Sampling: re-sample the dataset
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 12
Massaging
Change some men to negative and some women to positive but based on their
▪ Relabeling score and probability of getting credit.
▪ Which ones?
• Closest to the decision boundary: not sure anyway
▪ How many?
• Such that no more discrimination
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 13
Massaging
▪ How many?
• Such that no more discrimination
• Discrimination = P(Ypred = Pos | Sens = 0) - P(Ypred = Pos | Sens = 1)
pw / |Dw| - pB / |Db|
• If we relabel M datapoints:
▪ change Men (Sens = 0) with + class to – class
▪ change Women (Sens = 1) with - class to + class
Discrimination =
=0
==>
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 14
Massaging
▪ How many?
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 15
Massaging
▪ How many?
• M=1
▪ Which one?
• Closest to decision boudary/where the model is most uncertain
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 16
Reweighing
▪ Weight that measures how different this expected from observed probability is:
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 18
Sampling
▪ Go from weights to frequencies
• Higher weight: more likely to be sampled
• Lower weights: less likely to be sampled
▪ How to sample?
• Uniform
• Preferential: take scores into account once more
➢ Close to decision boudary: more likely to be over or undersampeld
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 19
Data Preprocessing for non-discrimination
▪ Experiments
• German credit
➢ Y: default or not
➢ Sensitive: Sex
• Census income
➢ Y: makes over 50K a year or not
➢ Sensitive: Sex
• Communities and crime
➢ Y: minor or major violent community
➢ Sensitive: Black neighborhood
• Dutch 2001 census
➢ Y: High or low level occupation
➢ Sensitive: Sex
1. Removing sensitive attribute does not remove discrimination! (Proxies)
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 20
Data Preprocessing for non-discrimination
▪ Experiments
2. Similar results for other datasets and techniques
▪ Conclusions
• Trade-off between accuracy and discrimination (continuum!)
• Massaging and PS seem best methods
• Data preprocessing setup allows for use with any classification technique
• Just removing sensitive attribute is not enough
• Sometimes some discrimination can be allowed if it can be explained
➢ For example: have “number of crashes before” as input variable to predict insurance risk,
even if correlated with sex
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 21
Defining Target Variable
▪ Difficult, be transparant
• ‘a target variable must reflect judgements about what really is the problem
at issue.’ (Barocas and Selbst, 2016)
▪ Bias
• Historically biased against sensitive groups
• Predict who to hire, “good” candidates
➢ Y based on hires?
• Predict who to promote, “good” employees
➢ Y based on chosen metrics as years at company, sales, productivity?
➢ Y based on more holistic metrics, using evaluation reviews, complaints?
➢ Subjective choice and measurement of Y can lead to bias
➢ If turnover systematically higher for certain sensitive groups, the predictive model
will include this bias as well
• Predict who to accept to university, “good candidates”
➢ Y based on average test score?
➢ College graduation rates white students: 67%, black students: 51%
▪ Should we predict this?
• “Predictive privacy harms” (Crawfard and Schultz): sensitive or revealing 22
Defining Target Variable
23
Biased Language
▪ Word embeddings Reduction technics
24
Biased Language
▪ Bolukbasi et al (2016), other bias stereotypes:
• nurse/surgeon
• housewife/shopkeeper
• interior designer/architect
• diva/superstar
▪ Gender bias in word embeddings
• http://wordbias.umiacs.umd.edu/
25
Ethical Data Preprocessing
▪ Discrimination against sensitive groups
• Measuring
• Methods
▪ Privacy
• Measuring
• Methods to include privacy
• Defining Target Variable
▪ Discussion Case: online re-identification
26
Anonymizing Data
▪ Simply not use personal data?
▪ Can re-identify persons uniquely based on input data
• Cf. location data: NY Times able to identify Lisa Magrin: only person who
commutes daily from her house in upstate NY to the middle school where
she works
• Cf. Census data, Sweeney (MIT) found that 87% of the U.S. population can
be uniquely identified by the combination
<birthdate, sex, zip code>
➔ Can we change the dataset so that individuals cannot be re-identified,
while keeping the data useful? (Sweeney and Samarati, 1998)
➔ K-anonymity: a data set has k-anonymity if the information for each person
(data instance) cannot be distinguished from at least k-1 other individuals in
the dataset
27
Data Preprocessing for privacy
▪ How to measure level of anonymity?
• an equivalence class is defined as a set of records that have the same
values for the quasi-identifiers
• k-anonymity => smallest equivalence class of size k
• NP-hard optimisation problem but good approximation methods exist
▪ Several preprocessing methods to get privacy
• Grouping instances: Aggregation
• Grouping variable values: Discretisation and Generalisation
• Suppressing: replace certain values by *
28
Data Preprocessing for privacy
Name Age Gender ZIP code Diagnosis
Dirk Den 41 M 2000 Cancer
Eric Eel 46 M 2600 HIV
Fling Fan 22 F 1000 No illness
Geo Gen 28 F 1020 Viral infection
Han Hun 29 F 1000 HIV
29
Data Preprocessing for privacy
Name Age Gender ZIP code Diagnosis
Dirk Den 41 M 2000 Cancer
Eric Eel 46 M 2600 HIV
Fling Fan 22 F 1000 No illness
Geo Gen 28 F 1020 Viral infection
Han Hun 29 F 1000 HIV
If I know that Dirk is 41, a man and lives in Antwerp, then I don’t need his name
to observe he has cancer
2-anonymous wrt all non-sensitive attributes (all but diagnosis) Dirk has HIV!
Name Age Gender Province Diagnosis
* [40-50] M Antwerp HIV
* [40-50] M Antwerp HIV
* [20-30] F Brussels No illness
* [20-30] F Brussels Viral infection
* [20-30] F Brussels HIV
2-anonymous wrt all non-sensitive attributes (all but diagnosis)
Data Preprocessing for privacy
▪ Issues with k-anonymity
1. Homogenity: If the sensitive attribute for all k instances are the same
2. Linkage attack: linking with additional dataset or knowledge
➢ Eg: I know Fling is [20-30], female and from Brussels, and in both of these datasets (eg from 2 hospitals).
➔ Now we know that johndoe90 also also watched `Fahrenheit 9/11' on Netflix, potentially revealing his political preference, as
well as `Jesus of Nazareth' and `The Gospel of John', potentially revealing his religious preference.
▪ Netflix price 2?
• Lawsuit by Jane Doe, a Lesbian Netflix user, whose sexual preference is not a
matter of public knowledge, “including at her children's school".
• She watched movies in the Netflix category “Lesbian and Gay“
• Lawsuit settled and second Netflix prize canceled.
33
Online Re-identification
▪ Massachusetts made medical records of public servants public,
removed names: re-identification
➢ Sweeney (1997)
➢ Public medical records: + vote records: only 1 person with that combination
➢ Governor Weld: exposed medical records
34
Online Re-identification
▪ AOL reseach released dataset: re-identification based on search queries
• August 4, 2006
• 20 million search keywords
• 650.000 users
• Simply removing personal identifier
▪ To foster academic research: to “embrace the vision of an open research community"
▪ De-identified?
• No users explicitly identified in the dataset
• Search queries could be used to identify persons
Online Re-identification
▪ AOL reseach released dataset
▪ De-identified?
• No users explicitly identified in the dataset
• Search queries could be used to identify persons
• NY Times exposed one of the users, with her explicit permission, as Thelma Arnold, a
62-year-old widow from Gorgia, U.S.
https://www.nytimes.com/2006/08/09/technology/09aol.html
Online Re-identification
▪ AOL reseach released dataset
▪ De-identified?
• No users explicitly identified in the dataset
• Search queries could be used to identify persons
• NY Times exposed one of the users, with her explicit permission, as Thelma Arnold, a
62-year-old widow from Gorgia, U.S.
▪ Revealed disturbing and sensitive thoughts
• “how to kill you wife” AOL Inc. ( AOL) is a global Web services company with a range of
brands and offerings and a global audience.
• “fear that spouse contemplating cheating”
• Thousands of sexually oriented queries
▪ Data was removed three days after the release (but can still be found online)
• “It was a mistake, and we apologize.“
• AOL researcher fired, CTO resigned
▪ Use such data?
• “you don’t want to do research on tainted data.” Kleinberg
39
l-diversity
▪ Issue with k-anonymity: does not consider sensitive variables
• Homogenity attack: if the sensitive attribute for all k instances are the same
▪ Solution: l-diversity
• Also maintaining diversity of sensitive values
• Equivalence class: set of records that have the same values for the
quasi-identifiers Text
high l low l
41
I-diversity
▪ Problem solved?
• Skewness attack:
➢ one class is much more unlikely (skewnewss in class distribution) and is
more sensitive
➢ For example: Positive for virus test (0,1%) more sensitive than negative
• Similarity attack:
➢ Similairty between the sensitive values
Name Age Gender Province Diagnosis
* [40-50] * Flanders HIV 50% certain HIV
vs 0,1% in overall
* [40-50] * Flanders Viral infection population
Removing personal
t-closeness l-diversity k-anonymity
identifiers
privacy utility
high k low k
high l low l
low t high t 43
Li et al. (2007)
t-closeness
▪ Reminder: Obtaining
• k-anonymity
• l-diversity
• t-closeness
by preprocessing methods to get pricvacy: grouping and suppressing
Removing personal
t-closeness l-diversity k-anonymity
identifiers
privacy utility
high k low k
high l low l
low t high t
44
Presentation and Paper Ideas
▪ Other methods to remove discrimination or improve
privacy in data preprocessing
• https://towardsdatascience.com/algorithmic-solutions-to-
algorithmic-bias-aef59eaf6565
▪ Re-identification based on other types of data
▪ Red lining: additional history and current practises
▪ Applications of k-anonymity
45