0% found this document useful (0 votes)

15 views

Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (Re-Identification) v2

Uploaded by

Niloofar Fallahi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (Re-Identification) v2

Uploaded by

Niloofar Fallahi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Data Science & Ethics

Lecture 4

Data Preprocessing

Prof. David Martens

[email protected]
www.applieddatamining.com
@ApplDataMining
AI Ethics in the News
2
“We ask that all universities make the data on their students
public, but to anonymize the names of the students by hashing
them and not to include home address or other personal
information in the dataset. For each student, we want the
following fields to be included in the dataset: a hashed version of
the student’s name, the courses he or she enrolled in, his/her
grades on these courses, days of absence in 2020 due to COVID-19,
study program, nationality, date of birth, postal code and gender.
In that way social science research can be moved forward, by finding
patterns in this data, and universities could benefit from the
discovered insights.’” 3
“Who will end up in good positions?”

4
5
Ethical Data Preprocessing

6
Ethical Data Preprocessing
▪ Discrimination against sensitive groups
• Measuring fairness of the data
• Methods to make the data fair
Simply removing the name does not anonymize the data
Also, Removing he sex dos not prevent gender discrimination.

▪ Privacy
• Measuring
• Methods to include privacy
• Defining Target Variable
• Discussion Case: online re-identification

7
Input Selection
▪ Privacy
• Personal data, cf. Ethical Data Gathering
• “Why not simply removing personal attributes?”
➢ Proxies! Other correlated variables

▪ Discrimination against sensitive groups

• “Why not simply removing sensitive attributes?”
➢ Proxies!
➢ Gender correlated with?
➢ Race correlated with?
➢ Sexual orientation correlated with?

8
Proxies for discrimination
▪ Against sensitive groups
▪ Simply removing sensitive data is not enough
• Proxies: Red lining

The HOLC maps are part of the records of the FHLBB (RG195) at
the National Archives II
The Color of Money
Bill Dedman 1988
http://powerreporting.com/color/
Data Preprocessing for non-discrimination
▪ Proxies of sensitive attributes
• Removing sensitive attributes won’t remove bias from the data

▪ Measuring Bias
• Statistical Parity or Dependence
➢ P(Ypred = Pos | Sens = 0) - P(Ypred = Pos | Sens = 1)
• Disparate Impact
➢ P(Ypred = Pos | Sens = 0) / P(Ypred = Pos | Sens = 1)

Sens: “Sex”
Dependence = 80% - 40% = 40%

➔ A woman is 40% less likely (in absolute

numbers) to be accepted for a job than a
man

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 10
Data Preprocessing for non-discrimination
▪ Several preprocessing methods to get fairness
• Methods need access to sensitive attribute!
• Multi-objective goal:
➢ High accuracy: minimal relabeling (minimal effect on accyracy)
➢ Low discrimination: no more dependence:
P(Ypred = Pos | Sens = 0) = P(Ypred = Pos | Sens = 1)

▪ Notations:
• +: the desireable positive class (high income, obtaining loan, getting hired, etc.)
• S: the sensitive attribute
• S = w: white (or S = 0)
• S = b: black (or S = 1)

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 11
Data Preprocessing for non-discrimination
▪ Three methods
• Massaging: changing the class labels
• Reweighing: assign weights to data instances
• Sampling: re-sample the dataset

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 12
Massaging
Change some men to negative and some women to positive but based on their
▪ Relabeling score and probability of getting credit.

• Some instances with sensitive attribute (S = b) from – to +

• Some instances without sensitive attribute (S = w) from + to -

▪ Which ones?
• Closest to the decision boundary: not sure anyway

▪ How many?
• Such that no more discrimination

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 13
Massaging
▪ How many?
• Such that no more discrimination
• Discrimination = P(Ypred = Pos | Sens = 0) - P(Ypred = Pos | Sens = 1)
pw / |Dw| - pB / |Db|
• If we relabel M datapoints:
▪ change Men (Sens = 0) with + class to – class
▪ change Women (Sens = 1) with - class to + class
Discrimination =

==>

if not an integer, round up (slight negative discrimination)

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 14
Massaging
▪ How many?

▪ Demote 1 and promote 1

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 15
Massaging
▪ How many?
• M=1
▪ Which one?
• Closest to decision boudary/where the model is most uncertain

-- Male with lowest prob. to be +

+ Female with lowest prob. to be –

(or highest prob. to be +,
as indicated in this column)

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 16
Reweighing
▪ Weight that measures how different this expected from observed probability is:

• Instances 1 – 4 have Sex = M and Class = +

➢ Expected: P(Sex = M) x P(Class = +) = 0,5 x 0,6 = 0,3
➢ Observed: P(Sex = M, Class = +) = 0,4
➢ Weight = 0,75
➢ Positive outcomes and men: have more than enough, lower weight
• Instances 8, 10 have Sex = F and Class = + ➔ Weight = 1,5
➢ Positive outcomes and women: too few, higher weight
• Similar for the four other instances (5-7, 9). What are the weights there?
17
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. KIS, 33(1), 1-33.
Reweighing
• Requires that the classification algorithm can work with weighted data instances

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 18
Sampling
▪ Go from weights to frequencies
• Higher weight: more likely to be sampled
• Lower weights: less likely to be sampled

▪ How many to sample?

• | Sex = M and Class = + | = 4, weight = 0,75 ➔ Sample 3
• | Sex = F and Class = + | = 2 , weight = 1,50 ➔ Sample 3
• Similar for the negative class

▪ How to sample?
• Uniform
• Preferential: take scores into account once more
➢ Close to decision boudary: more likely to be over or undersampeld

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 19
Data Preprocessing for non-discrimination
▪ Experiments
• German credit
➢ Y: default or not
➢ Sensitive: Sex
• Census income
➢ Y: makes over 50K a year or not
➢ Sensitive: Sex
• Communities and crime
➢ Y: minor or major violent community
➢ Sensitive: Black neighborhood
• Dutch 2001 census
➢ Y: High or low level occupation
➢ Sensitive: Sex
1. Removing sensitive attribute does not remove discrimination! (Proxies)

Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 20
Data Preprocessing for non-discrimination
▪ Experiments
2. Similar results for other datasets and techniques

▪ Conclusions
• Trade-off between accuracy and discrimination (continuum!)
• Massaging and PS seem best methods
• Data preprocessing setup allows for use with any classification technique
• Just removing sensitive attribute is not enough
• Sometimes some discrimination can be allowed if it can be explained
➢ For example: have “number of crashes before” as input variable to predict insurance risk,
even if correlated with sex
Kamiran, F. and Calders, T. (2012) Data preprocessing techniques for classification without discrimination. Knowledge and
Information Systems, 33(1), 1-33. 21
Defining Target Variable
▪ Difficult, be transparant
• ‘a target variable must reflect judgements about what really is the problem
at issue.’ (Barocas and Selbst, 2016)
▪ Bias
• Historically biased against sensitive groups
• Predict who to hire, “good” candidates
➢ Y based on hires?
• Predict who to promote, “good” employees
➢ Y based on chosen metrics as years at company, sales, productivity?
➢ Y based on more holistic metrics, using evaluation reviews, complaints?
➢ Subjective choice and measurement of Y can lead to bias
➢ If turnover systematically higher for certain sensitive groups, the predictive model
will include this bias as well
• Predict who to accept to university, “good candidates”
➢ Y based on average test score?
➢ College graduation rates white students: 67%, black students: 51%
▪ Should we predict this?
• “Predictive privacy harms” (Crawfard and Schultz): sensitive or revealing 22
Defining Target Variable

23
Biased Language
▪ Word embeddings Reduction technics

• from words to vectors

• Similar words should be “close” to each other
▪ Bolukbasi et al (2016)
• Google News data

24
Biased Language
▪ Bolukbasi et al (2016), other bias stereotypes:
• nurse/surgeon
• housewife/shopkeeper
• interior designer/architect
• diva/superstar
▪ Gender bias in word embeddings
• http://wordbias.umiacs.umd.edu/

25
Ethical Data Preprocessing
▪ Discrimination against sensitive groups
• Measuring
• Methods
▪ Privacy
• Measuring
• Methods to include privacy
• Defining Target Variable
▪ Discussion Case: online re-identification

26
Anonymizing Data
▪ Simply not use personal data?
▪ Can re-identify persons uniquely based on input data
• Cf. location data: NY Times able to identify Lisa Magrin: only person who
commutes daily from her house in upstate NY to the middle school where
she works
• Cf. Census data, Sweeney (MIT) found that 87% of the U.S. population can
be uniquely identified by the combination
<birthdate, sex, zip code>
➔ Can we change the dataset so that individuals cannot be re-identified,
while keeping the data useful? (Sweeney and Samarati, 1998)
➔ K-anonymity: a data set has k-anonymity if the information for each person
(data instance) cannot be distinguished from at least k-1 other individuals in
the dataset

27
Data Preprocessing for privacy
▪ How to measure level of anonymity?
• an equivalence class is defined as a set of records that have the same
values for the quasi-identifiers
• k-anonymity => smallest equivalence class of size k
• NP-hard optimisation problem but good approximation methods exist
▪ Several preprocessing methods to get privacy
• Grouping instances: Aggregation
• Grouping variable values: Discretisation and Generalisation
• Suppressing: replace certain values by *

28
Data Preprocessing for privacy
Name Age Gender ZIP code Diagnosis
Dirk Den 41 M 2000 Cancer
Eric Eel 46 M 2600 HIV
Fling Fan 22 F 1000 No illness
Geo Gen 28 F 1020 Viral infection
Han Hun 29 F 1000 HIV

Quai-identifiers: available to an adversary

29
Data Preprocessing for privacy
Name Age Gender ZIP code Diagnosis
Dirk Den 41 M 2000 Cancer
Eric Eel 46 M 2600 HIV
Fling Fan 22 F 1000 No illness
Geo Gen 28 F 1020 Viral infection
Han Hun 29 F 1000 HIV
If I know that Dirk is 41, a man and lives in Antwerp, then I don’t need his name
to observe he has cancer

Name Age Gender Province Diagnosis

* [40-50] M Antwerp Cancer
* [40-50] M Antwerp HIV
* [20-30] F Brussels No illness
* [20-30] F Brussels Viral infection
* [20-30] F Brussels HIV
2-anonymous wrt all non-sensitive attributes (all but diagnosis) 30
Data Preprocessing for privacy
▪ Issues with k-anonymity
1. Homogenity: If the sensitive attribute for all k instances are the same
➢ Eg: I know Dirk is [40-50], male and from Antwerp

2-anonymous wrt all non-sensitive attributes (all but diagnosis) Dirk has HIV!
Name Age Gender Province Diagnosis
* [40-50] M Antwerp HIV
* [40-50] M Antwerp HIV
* [20-30] F Brussels No illness
* [20-30] F Brussels Viral infection
* [20-30] F Brussels HIV
2-anonymous wrt all non-sensitive attributes (all but diagnosis)
Data Preprocessing for privacy
▪ Issues with k-anonymity
1. Homogenity: If the sensitive attribute for all k instances are the same
2. Linkage attack: linking with additional dataset or knowledge
➢ Eg: I know Fling is [20-30], female and from Brussels, and in both of these datasets (eg from 2 hospitals).

Name Age Gender Province Diagnosis

* [20-30] F Brussels HIV
* [20-30] F Brussels Cancer
* [20-30] F Brussels Heart attack
2-anonymous wrt all non-sensitive attributes (all but diagnosis) Fling has HIV!
Name Age Gender Province Diagnosis
* [40-50] M Antwerp Cancer
* [40-50] M Antwerp HIV
* [20-30] F Brussels No illness
* [20-30] F Brussels Viral infection
* [20-30] F Brussels HIV
2-anonymous wrt all non-sensitive attributes (all but diagnosis)
Online Re-identification
▪ Netflix price: re-identification based on movie ratings
• Dataset with 100 million ratings from 480k users with 17k movies
➢ Names removed
➢ Some ratings changed to fake ones
➢ Narayanan and Sgmatikov: combine with data from IMDB to identify persons

➔ Now we know that johndoe90 also also watched `Fahrenheit 9/11' on Netflix, potentially revealing his political preference, as
well as `Jesus of Nazareth' and `The Gospel of John', potentially revealing his religious preference.

▪ Netflix price 2?
• Lawsuit by Jane Doe, a Lesbian Netflix user, whose sexual preference is not a
matter of public knowledge, “including at her children's school".
• She watched movies in the Netflix category “Lesbian and Gay“
• Lawsuit settled and second Netflix prize canceled.
33
Online Re-identification
▪ Massachusetts made medical records of public servants public,
removed names: re-identification
➢ Sweeney (1997)
➢ Public medical records: + vote records: only 1 person with that combination
➢ Governor Weld: exposed medical records

34
Online Re-identification
▪ AOL reseach released dataset: re-identification based on search queries
• August 4, 2006
• 20 million search keywords
• 650.000 users
• Simply removing personal identifier
▪ To foster academic research: to “embrace the vision of an open research community"

▪ De-identified?
• No users explicitly identified in the dataset
• Search queries could be used to identify persons
Online Re-identification
▪ AOL reseach released dataset
▪ De-identified?
• No users explicitly identified in the dataset
• Search queries could be used to identify persons
• NY Times exposed one of the users, with her explicit permission, as Thelma Arnold, a
62-year-old widow from Gorgia, U.S.

Queries from her:

• “landscapers in Lilburn, Ga”
• “homes sold in shadow lake subdivision gwinnett county
Georgia”
• several people with the last name Arnolds
• “60 single men”
• “dog that urinates on everything”
• “school supplies for Iraq children”
“My goodness, it's my whole personal life"

https://www.nytimes.com/2006/08/09/technology/09aol.html
Online Re-identification
▪ AOL reseach released dataset
▪ De-identified?
• No users explicitly identified in the dataset
• Search queries could be used to identify persons
• NY Times exposed one of the users, with her explicit permission, as Thelma Arnold, a
62-year-old widow from Gorgia, U.S.
▪ Revealed disturbing and sensitive thoughts
• “how to kill you wife” AOL Inc. ( AOL) is a global Web services company with a range of
brands and offerings and a global audience.
• “fear that spouse contemplating cheating”
• Thousands of sexually oriented queries
▪ Data was removed three days after the release (but can still be found online)
• “It was a mistake, and we apologize.“
• AOL researcher fired, CTO resigned
▪ Use such data?
• “you don’t want to do research on tainted data.” Kleinberg

▪ Have a look at your activity: https://myactivity.google.com

• Would someone be able to identify you? Any sensitive ifnormation?
Online Re-identification
▪ re-identification based on location data
▪ Some apps on mobile phones use location data
• For example weather, driving directions, running or biking apps
• Interesting for advertising
• NY Times reporters looked into a 2017 database with data of over 1 million users from
New York area
• at least 75 companies reported to receive `anonymous' precise location data from
such apps
▪ De-identified?
• Re-identify Lisa Magrin: 46-year old math teacher
• Locations:
➢ Daily commute from house in upstate NY to middle school where she works
➢ Weight Watchers meeting
➢ Former boyfriend’s home
➢ Doctor
▪ Could easily find: work and home locations of all attendees or employees of: AA
meetings, military bases, churches, Planned Parenthood clinics, etc.
▪ Similar revelation on 170 million individual taxi trips in NYC.
https://www.nytimes.com/interactive/2018/12/10/business/location-data-privacy-apps.html
Data Preprocessing for privacy
▪ How to measure level of anonymity?
• K-anonymity
▪ Several preprocessing methods to get privacy
• Grouping instances: Aggregation
• Grouping variable values: Discretisation and Generalisation
• Suppressing: replace certain values by *
• Adding noise
▪ Continuum!

Diff. privacy Diff. privacy Removing personal

k-anonymity identifiers
decentralized centralized
privacy accuracy

39
l-diversity
▪ Issue with k-anonymity: does not consider sensitive variables
• Homogenity attack: if the sensitive attribute for all k instances are the same

▪ Solution: l-diversity
• Also maintaining diversity of sensitive values
• Equivalence class: set of records that have the same values for the
quasi-identifiers Text

• An equivalence class is said to have l-diversity: if there are at least l

“well-represented” values for the sensitive variable.
• A dataset is l-diverse if every equivalence class of the dataset has l-
diversity
• Well-represented?
➢ distinct l-diversity: easiest version, requiring at least l distinct values
40
Machanavaljjhala et al. (2007)
l-diversity
▪ l-diversity
• An equivalence class is said to have l-diversity: if there are at least
l “well-represented” values for the sensitive attribute.
Name Age Gender Province Diagnosis
* [40-50] * Flanders HIV 2
k=?
* [40-50] * Flanders Viral infection l=? 2

* [20-30] * Flanders Covid-19

* [20-30] * Flanders Bronchitis
* [20-30] * Flanders Pneumonia
The higher K is the better. This data set is 2 anonymous and 2 diverse
Removing personal
l-diversity k-anonymity
identifiers
privacy utility
high k low k

high l low l

41
I-diversity
▪ Problem solved?
• Skewness attack:
➢ one class is much more unlikely (skewnewss in class distribution) and is
more sensitive
➢ For example: Positive for virus test (0,1%) more sensitive than negative

• Similarity attack:
➢ Similairty between the sensitive values
Name Age Gender Province Diagnosis
* [40-50] * Flanders HIV 50% certain HIV
vs 0,1% in overall
* [40-50] * Flanders Viral infection population

* [20-30] * Flanders Covid-19 Has lung-related

illness
* [20-30] * Flanders Bronchitis
Even though the values are
different they are really similar in
* [20-30] * Flanders Pneumonia terms they are all related to lung
illness so we know that this
population has lung issues 42
t-closeness
▪ Limit the difference between the distribution of the sensitive attribute per
equivalence class and in the overall population
▪ An equivalence class is said to have t-closeness: if the distance between
1) the distribution of a sensitive attribute in this class
2) the distribution of the attribute in the whole table
is no more than a treshold t
▪ A dataset is t-close if every equivalence class of the dataset has t-closeness
▪ Several possibilities to measure “closeness”
▪ Hard to implement

Removing personal
t-closeness l-diversity k-anonymity
identifiers
privacy utility
high k low k

high l low l

low t high t 43
Li et al. (2007)
t-closeness
▪ Reminder: Obtaining
• k-anonymity
• l-diversity
• t-closeness
by preprocessing methods to get pricvacy: grouping and suppressing

Removing personal
t-closeness l-diversity k-anonymity
identifiers
privacy utility
high k low k

high l low l

low t high t

44
Presentation and Paper Ideas
▪ Other methods to remove discrimination or improve
privacy in data preprocessing
• https://towardsdatascience.com/algorithmic-solutions-to-
algorithmic-bias-aef59eaf6565
▪ Re-identification based on other types of data
▪ Red lining: additional history and current practises
▪ Applications of k-anonymity

Generative AI Refresher - E0
89% (9)
Generative AI Refresher - E0
3 pages
Checklist For Developing AI Compliant With EU AI Act and White House AI Guidelines
No ratings yet
Checklist For Developing AI Compliant With EU AI Act and White House AI Guidelines
2 pages
J.D.Salinger-this Sandwich Has No Mayonnaise
No ratings yet
J.D.Salinger-this Sandwich Has No Mayonnaise
12 pages
CSC522: Automated Learning and Data Analysis Asynchronous Online Class
No ratings yet
CSC522: Automated Learning and Data Analysis Asynchronous Online Class
11 pages
CSC 422 522 001
No ratings yet
CSC 422 522 001
8 pages
Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (re-identification) v2
No ratings yet
Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (re-identification) v2
47 pages
Kamiran, Faisal (2011) Data preprocessing techniques for classification without discrimination
No ratings yet
Kamiran, Faisal (2011) Data preprocessing techniques for classification without discrimination
33 pages
03Preprocessing (2)
No ratings yet
03Preprocessing (2)
80 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
Data Mining - Lab 1
No ratings yet
Data Mining - Lab 1
4 pages
CS583 Data Prep
No ratings yet
CS583 Data Prep
33 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Prep
No ratings yet
Data Prep
33 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
Information Technology Fundamentals: CCIT4085
No ratings yet
Information Technology Fundamentals: CCIT4085
43 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
DM Day3 Preprocessing a F24(1)
No ratings yet
DM Day3 Preprocessing a F24(1)
85 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
CS322_Lec 3_S25
No ratings yet
CS322_Lec 3_S25
42 pages
Lecture123
No ratings yet
Lecture123
20 pages
Data Mining Outline
No ratings yet
Data Mining Outline
5 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
An Approach For Discrimination Prevention in Data Mining
No ratings yet
An Approach For Discrimination Prevention in Data Mining
6 pages
MILIT PPT Modifies
No ratings yet
MILIT PPT Modifies
43 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
8 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
CL 2
No ratings yet
CL 2
85 pages
DS Unit 2
No ratings yet
DS Unit 2
42 pages
Cs253 01 Introduction Marked
No ratings yet
Cs253 01 Introduction Marked
49 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
705
No ratings yet
705
16 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Lec 5
No ratings yet
Lec 5
24 pages
2-Data Fundamentals for BI - Part1
No ratings yet
2-Data Fundamentals for BI - Part1
39 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
W4-5 03preprocessing
No ratings yet
W4-5 03preprocessing
83 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Lecture 5 - Data Preparation
No ratings yet
Lecture 5 - Data Preparation
31 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
04 - ML - Data Preprocessing
No ratings yet
04 - ML - Data Preprocessing
13 pages
2002.11651
No ratings yet
2002.11651
37 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Unit I Chapter III
No ratings yet
Unit I Chapter III
71 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
Unit 3
No ratings yet
Unit 3
33 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
CH 3
No ratings yet
CH 3
68 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
CHAPTER I_231127_093902
No ratings yet
CHAPTER I_231127_093902
22 pages
Chaper 3 FoDS - Copy
No ratings yet
Chaper 3 FoDS - Copy
127 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Chapter3
No ratings yet
Chapter3
50 pages
Data Science & Analytics: Course Code: CSE3105 Credits: 02 Credit Hours: 02/week Exam Hours: 03
No ratings yet
Data Science & Analytics: Course Code: CSE3105 Credits: 02 Credit Hours: 02/week Exam Hours: 03
2 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Until You Walk in My Shoes: A Reframing Methodology to Overcome Systemic Discrimination
From Everand
Until You Walk in My Shoes: A Reframing Methodology to Overcome Systemic Discrimination
Dr. Frank L. Douglas
No ratings yet
Python Tutorial Text 2024-1
No ratings yet
Python Tutorial Text 2024-1
82 pages
Data Science Ethics - Lecture 10 - Ethical Deployment
No ratings yet
Data Science Ethics - Lecture 10 - Ethical Deployment
60 pages
Data Science Ethics - Lecture 2
No ratings yet
Data Science Ethics - Lecture 2
36 pages
Data Science Ethics - Lecture 3
No ratings yet
Data Science Ethics - Lecture 3
79 pages
Principles of Mgmt Accounting_class1
No ratings yet
Principles of Mgmt Accounting_class1
46 pages
Data Science Ethics - Lecture 1
No ratings yet
Data Science Ethics - Lecture 1
68 pages
Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling
No ratings yet
Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling
23 pages
Data Science Ethics - Lecture 9 - Ethical Reporting
No ratings yet
Data Science Ethics - Lecture 9 - Ethical Reporting
35 pages
Principles of Mgmt Accounting_class7
No ratings yet
Principles of Mgmt Accounting_class7
32 pages
Data Science Ethics - Lecture 1
No ratings yet
Data Science Ethics - Lecture 1
68 pages
Principles of Mgmt Accounting_class4&5
No ratings yet
Principles of Mgmt Accounting_class4&5
139 pages
Principles of Mgmt Accounting_class2
No ratings yet
Principles of Mgmt Accounting_class2
43 pages
AI For Teachers - An Open Textbo - Diversos Autors
No ratings yet
AI For Teachers - An Open Textbo - Diversos Autors
250 pages
Kasukabe Defence Group Project Report
No ratings yet
Kasukabe Defence Group Project Report
7 pages
Unit III Ai Standards and Regulation
No ratings yet
Unit III Ai Standards and Regulation
17 pages
Artificial_Intelligence_8UmDrvrJjEyhTdY8
No ratings yet
Artificial_Intelligence_8UmDrvrJjEyhTdY8
22 pages
Emerging-Research-Trends-in-Computer-Science-and-Information-Technology
No ratings yet
Emerging-Research-Trends-in-Computer-Science-and-Information-Technology
175 pages
Ethics20Mini20Thesis
No ratings yet
Ethics20Mini20Thesis
7 pages
Explainable and Interpretable Models in Computer Vision and Machine Learning
No ratings yet
Explainable and Interpretable Models in Computer Vision and Machine Learning
305 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
63 pages
The Ethical Implication of Using Artificial Intelligence in Hiring and Promotion Decisions
No ratings yet
The Ethical Implication of Using Artificial Intelligence in Hiring and Promotion Decisions
15 pages
2408.00992v3 Fairness in Large Language Models in Three Hours
No ratings yet
2408.00992v3 Fairness in Large Language Models in Three Hours
5 pages
AI tools
No ratings yet
AI tools
16 pages
Gen Ai
No ratings yet
Gen Ai
68 pages
UNIT III AI STANDARDS AND REGULATION-1
No ratings yet
UNIT III AI STANDARDS AND REGULATION-1
13 pages
Foundations of Intelligent Systems 25th International Symposium ISMIS 2020 Graz Austria September 23 25 2020 Proceedings Denis Helic All Chapters Instant Download
100% (4)
Foundations of Intelligent Systems 25th International Symposium ISMIS 2020 Graz Austria September 23 25 2020 Proceedings Denis Helic All Chapters Instant Download
65 pages
Generative AI Ethics: a Comprehensive Safety and Regulation Framework
No ratings yet
Generative AI Ethics: a Comprehensive Safety and Regulation Framework
13 pages
The Ethical Algorithm the Science of Socially Aware Algorithm Design (Kearns, Michael, Roth, Aaron) (Z-Library)
No ratings yet
The Ethical Algorithm the Science of Socially Aware Algorithm Design (Kearns, Michael, Roth, Aaron) (Z-Library)
211 pages
Fairness Lectures-21
No ratings yet
Fairness Lectures-21
63 pages
Paper Review - Giomara Quispe
No ratings yet
Paper Review - Giomara Quispe
19 pages
Software Testing of Generative AI Systems Challeng
No ratings yet
Software Testing of Generative AI Systems Challeng
10 pages
Loan Approval Prediction2
No ratings yet
Loan Approval Prediction2
72 pages
Algorithms and Crime
No ratings yet
Algorithms and Crime
20 pages
Credit Card Final Review
No ratings yet
Credit Card Final Review
21 pages
Amazons Hiring AI Essay
No ratings yet
Amazons Hiring AI Essay
2 pages
Proposed Research
No ratings yet
Proposed Research
4 pages
Artificial_intelligence
No ratings yet
Artificial_intelligence
73 pages
13
No ratings yet
13
1 page
AI - ETHICS
No ratings yet
AI - ETHICS
15 pages
Microsoft AI_900_Exam
No ratings yet
Microsoft AI_900_Exam
14 pages

Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (Re-Identification) v2

Uploaded by

Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (Re-Identification) v2

Uploaded by

Data Science & Ethics

Prof. David Martens

▪ Discrimination against sensitive groups

➔ A woman is 40% less likely (in absolute

• Some instances with sensitive attribute (S = b) from – to +

if not an integer, round up (slight negative discrimination)

▪ Demote 1 and promote 1

-- Male with lowest prob. to be +

+ Female with lowest prob. to be –

• Instances 1 – 4 have Sex = M and Class = +

▪ How many to sample?

• from words to vectors

Quai-identifiers: available to an adversary

Name Age Gender Province Diagnosis

Name Age Gender Province Diagnosis

Queries from her:

▪ Have a look at your activity: https://myactivity.google.com

Diff. privacy Diff. privacy Removing personal

• An equivalence class is said to have l-diversity: if there are at least l

* [20-30] * Flanders Covid-19

* [20-30] * Flanders Covid-19 Has lung-related

You might also like