0% found this document useful (0 votes)

18 views

Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling

Uploaded by

Niloofar Fallahi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling

Uploaded by

Niloofar Fallahi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Data Science & Ethics

Lecture 5

Data Preprocessing and Modeling: Privacy

Prof. David Martens

[email protected]
www.applieddatamining.com
@ApplDataMining
AI Ethics in the News
Differential Privacy
▪ Cynthia Dwork et al. (2006)
▪ Data Analysis/Modeling (centralized)
Data Preprocessing (local)
▪ How to analyse data, while preserving privacy?
Census data, surveys, etc.
▪ Goal: Allows social scientist to share useful statistics about
sensitive datasets.
• How many people in Belgium have HIV?
• How many students have financial aid?

2
Differential Privacy
▪ What’s the issue again?
▪ Reporting on data from the students
• Study 1: “In March 2024, there were 86 students taking the DSE
course, 10 of which were on financial aid”
➔ No personal information revealed
• Study 2: “In April 2024, there were 85 students taking the
DSEcourse, 9 of which were on financial aid”
➔ No personal information revealed

• Combination with background knowledge, troublesome!

➢ Student Sam knows that student Tim dropped out of the class in
March, Sam now knows that student Tim was on financial aid.
▪ Similarly with census data, reporting covid cases per zip code,
etc.
Example inspired from Wood et al. (2018) 3
Differential Privacy
▪ Differential privacy: a property of an algorithm:

▪ Note that for small ε, eε ≈ 1 + ε

▪ Whether you are in the dataset on covid numbers/financial

aid, or not, has little impact on the reported numbers
[and hence your privacy]
4
Differential Privacy
▪ Differential privacy: a property of an algorithm:

• ε: depending on algorithm and requirements

➢ Stronger privacy for smaller ε

Strong mathematical defintion of privacy

of an algorithm Hsu et al (2014) 5
Differential Privacy
▪ Two assumptions (for now)
1. Single count query (“how many”)
2. Trusted curator (don’t trust the outside observer)

Curator Outside
Observer
x
D
Result
Algorithm

6
Differential Privacy
▪ Some examples of analysis with sensitive data
▪ Participating in a survey on our class, where you would answer negatively
Prob(your exam will be more difficult after survey without your data) = 1%
eε ≈ 1 + ε for small ε, if ε = 0,01
Prob(your exam will be more difficult after survey with your data) = 1,01%

▪ Participating in medical study looking at cancer and smoking, where you

might fear the insurance premium goes up as you are a smoker
Prob(higher insurance premium after study without your data) = 2%
eε ≈ 1 + ε for small ε, if ε = 0,01
Prob(higher insurance premium after study with your data) = 2,02%

▪ Does not mean the premium will go up, the exam will be more difficult: just that
the increase is limited.

7
Privacy loss parameter ε
▪ Privacy loss parameter ε
▪ Smaller means more privacy (but less accuracy)
▪ If ε = 0
• P(M(D)) = P(M(D’))
• Total privacy, only noise

▪ Rule of thumb
• ε between 0.001 and 1
• Proper ε depends on dataset size and Prob(M(D)). Intuitively: the larger the dataset,
the less the impact of a single data instance.
• “almost no utility is expected from datasets containing 1/ε or fewer records.”
(Nissim et al., 2018)
➢ ε = 0.001 ➔ datasets required of size at least 1000
➢ ε = 0.01 ➔ datasets required of size at least 100

Nissim et al. (2018) Wood et al. (2018) 8

Differential Privacy
▪ Definition, algorithm is differential private if:
• Outcome will remain “largely” the same whether you participate in the dataset
or not
• Give roughly the same privacy to X when X is in the data or not

▪ How to make counting query ε-differential private?

• Add Laplace noise

How many students are

on financial aid in the
course DSE?
D
Result
Algorithm

“There were 85 students taking DSE course,

approximately 10 of which passed were on financial aid.”

9
Differential Privacy
▪ Counting
• How many students disliked the course?
• Add Lap(1/ε) noise ➔ ε-differential privacy
• Run analysis twice: two different answers, but there exist accuracy bounds (given ε)

Revisit link with size of dataset:

• Small dataset of 10 records, and the simple
sum function. Adding the noise from the
distribution with ε = 0.01 will lead to very
inaccurate estimates of the sum, where the
noise accounts for most of the answer.
• For a very large dataset with millions of
records, the effect on the accuracy of the
answer is smaller.
• Differential privacy will hence require
increasing the minimal dataset size needed
to provide accurate results.

10
Differential Privacy
▪ Reporting on data from the students
• Study 1: “In March 2024, there were 86 students taking the DSE course, 10 of
which were on financial aid”
➔ No personal information revealed
• Study 2: “In April 2024, there were 85 students taking the DSE course, 9 of
which were on financial aid”
➔ No personal information revealed
• Combination with background knowledge, troublesome!
➢ Student Sam knows that student Tim dropped out of the class in March, Sam now
knows that student Tim was on financial aid.

▪ Differential privacy
• Study 1: “In March 2024, there were 86 students taking the DSE course,
approximately 11 of which were on financial aid”
Study 2: “In April 2024, there were 85 students taking the DSE course,
approximately 8 of which were on financial aid”

Example inspired from Nissem et al. (2018)

Nissim et al. (2018) Differential privacy: A primer for a non-technical audience 11
Differential Privacy
▪ Assumption 1: Single Count Query. Needed?
▪ What if we answer the same question over and over again?
▪ Compositional
• Every analysis has some leakage, accumulates elegantly over more analyses
• Diff. privacy still holds, where if two studies, ε1 ε2 ➔ ε = ε1 + ε2 for combination
• Can create complex algorithms. Differentially private algorithms exist for linear
regression, clustering, classification, etc.

▪ ε as privacy budget (Nissim et al., 2018)

• How much privacy an analysis may use
• How much the risk to an individual’s privacy may increase
• More analyses implies less “budget” for each

Nissim et al. (2018) Differential privacy: A primer for a non-technical audience 12

Differential Privacy
▪ Promises:
• No matter what attack, computing power or additional data: outcome with or
without data similar
• Noone can learn “much” about you because of the data, while still allowing
analysis
• Guess whether your data is in the dataset or not, not much better than random
guess

▪ Does not promise:

• No secrets will be revealed (can be even without participating)
➢ Study finds that professor in Antwerp makes X €
➢ Even if not participated: secret revealed
• Absolute privacy: ε

13
Differential Privacy
▪ Properties
• Quantification of privacy loss
• Compositional: allows for complex differential private techniques
• Immune to post-processing: guarantee no matter the data, technology or
computation power
• Transparant: can reveal what procedure and parameter you used (otherwise
uncertainty on for example accuracy of the results)

• Golden standard of anonymization

14
Differential Privacy
▪ Assumption 2: trusted data curator
▪ Needed?
• Local (decentralized) differential privacy
➢ Don’t trust the curator ➔ Noise added before recording answer
Don’t trust the outside observers
• Centralized differential privacy
➢ Trust the data curator ➔ Noise added after recording answer
Don’t trust the outside observers

Curator Outside
Observer

x x
D
Result
Algorithm

15
Differential Privacy
▪ Two types of differential privacy
▪ Randomized Response: local
• For example, I want to know how much students liked the course,
I ask: “Did you like the class?”
➢ Flip coin:
▪ if head: then write real answer
▪ if tails: flip again, if heads: then write real answer, if tails: write inverse
▪ 75% correct, but total deniability
• If we know 1/3 did not like the class
➢ How many in randomized response result? 5/12
Why? if p is percentage of positive answers, we’d expect to observe positive
answer
▪ 1 out of 4: when it has changed from a negative answer
▪ 3 out of 4: when it is real positive answer
➔ ¼ x (1-p) + ¾ x p
➢ Average over large numbers: will be quite accurate, while having plausible
deniability
Differential Privacy
▪ Two types of differential privacy
▪ Centralized vs decentralized?
• Choice based on risk of hacking, leaks and subpoenas.

17
Differential Privacy
▪ Differential Privacy (Cynthia Dwork et al. 2006)
• Advantages
➢ If linked with additional data added, still privacy
➢ The learnt patterns do generalize
• Challenges
➢ Still can infer properties: with or without data instance (eg Facebook traits or salary)
➢ Learn secrets from large group in dataset (eg Strava)

▪ Large use cases

• Google: usage statistics about Chrome malware (2014)
decentralized
• Apple: usage statistics about iPhone (2016)
• US Census (2020) centralized

18
Data Preprocessing for privacy
▪ How to measure level of anonymity?
• K-anonymity of dataset: issues
• Differential privacy of algorithm: gold standard
▪ Several preprocessing methods to get privacy
• Grouping instances: Aggregation
• Grouping variable values: Discretisation and Generalisation
• Suppressing: replace certain values by *
• Adding noise
▪ Continuum!
Diff. privacy Diff. privacy Removing personal
decentralized centralized t-closeness l-diversity
k-anonymity identifiers
privacy utility
low ε high k low k
high ε
high l low l

low t high t

19
Conclusion
▪ Differential privacy
▪ Dream of analysing data while enabling privacy
▪ Privacy as a matter of accumulated risk, not binary
▪ With clear parameter ε that quantifies privacy loss: the
additional risk to an individual resulting from the use of its data.
▪ Always bounded, mathematically provable.

20
Differential Privacy
▪ Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith (2006)

2016 Time of Test Award 2017 Gödel Award

21
Presentation and Paper Ideas
▪ Differential privacy in action
(use by Apple, Google, Linkedin, etc.)
▪ The trade-offs in value
▪ Practical implementation and examples of
k-anonymity/l-diversity/t-closeness

Asam Ae MCD-2 MC BS V1-6-1 PDF
No ratings yet
Asam Ae MCD-2 MC BS V1-6-1 PDF
252 pages
0cj Audi Dupla Embreagem Audi q5
100% (8)
0cj Audi Dupla Embreagem Audi q5
131 pages
J.D.Salinger-this Sandwich Has No Mayonnaise
No ratings yet
J.D.Salinger-this Sandwich Has No Mayonnaise
12 pages
Secure Remote Access For Industrial Machines For Dummies
No ratings yet
Secure Remote Access For Industrial Machines For Dummies
67 pages
CEOs Guide To Differential Privacy April 2018
No ratings yet
CEOs Guide To Differential Privacy April 2018
17 pages
Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling
No ratings yet
Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling
23 pages
Differential-Privacy - Copy
No ratings yet
Differential-Privacy - Copy
40 pages
The Algorithmic Foundations of Differential Privacy
No ratings yet
The Algorithmic Foundations of Differential Privacy
281 pages
Privacy Book
No ratings yet
Privacy Book
281 pages
The Promise of Differential Privacy: Cynthia Dwork, Microsoft Research
No ratings yet
The Promise of Differential Privacy: Cynthia Dwork, Microsoft Research
50 pages
Lvilhuber,+Journal+Manager,+Fulltext
No ratings yet
Lvilhuber,+Journal+Manager,+Fulltext
36 pages
programmingdp
No ratings yet
programmingdp
124 pages
Preserving and Randomizing Data Responses in Web Application Using Differential Privacy
100% (1)
Preserving and Randomizing Data Responses in Web Application Using Differential Privacy
9 pages
5. Privacy Models Differential Privacy I
No ratings yet
5. Privacy Models Differential Privacy I
27 pages
Research Paper 3
No ratings yet
Research Paper 3
20 pages
Differential Privacy
No ratings yet
Differential Privacy
56 pages
CERIAS Presentation PDF
No ratings yet
CERIAS Presentation PDF
17 pages
08 - COE426-Differential Privacy I
No ratings yet
08 - COE426-Differential Privacy I
23 pages
Evaluating Differentially Private Machine Learning in Practice
No ratings yet
Evaluating Differentially Private Machine Learning in Practice
20 pages
13-4 GoogleDIfferntialPrivacy
No ratings yet
13-4 GoogleDIfferntialPrivacy
20 pages
09 - COE426-Differential Privacy II
No ratings yet
09 - COE426-Differential Privacy II
30 pages
w9 Differential Privacy
No ratings yet
w9 Differential Privacy
30 pages
Introduction To Differential Privacy
No ratings yet
Introduction To Differential Privacy
11 pages
Waye Lucas
No ratings yet
Waye Lucas
8 pages
A Statistical Framework For Differential Privacy
No ratings yet
A Statistical Framework For Differential Privacy
16 pages
Privacy And
No ratings yet
Privacy And
18 pages
Differential Privacy For Non Technical Audience
No ratings yet
Differential Privacy For Non Technical Audience
68 pages
Differential Privacy: 1 N I 1 N N
No ratings yet
Differential Privacy: 1 N I 1 N N
7 pages
Privacy Chapter
No ratings yet
Privacy Chapter
6 pages
Differentially Private Depth Functions and Their Associated Medians
No ratings yet
Differentially Private Depth Functions and Their Associated Medians
22 pages
Assess Impact of Differential Privacy on Model Performance
No ratings yet
Assess Impact of Differential Privacy on Model Performance
6 pages
Event Data Privacy
No ratings yet
Event Data Privacy
33 pages
Diffrential Privacy
No ratings yet
Diffrential Privacy
35 pages
2017 Book DifferentialPrivacyAndApplicat PDF
No ratings yet
2017 Book DifferentialPrivacyAndApplicat PDF
243 pages
Differential Privacy: On The Trade-Off Between Utility and Information Leakage
No ratings yet
Differential Privacy: On The Trade-Off Between Utility and Information Leakage
26 pages
Differential Privacy
No ratings yet
Differential Privacy
12 pages
Adjacent Initial States Based Differential Privacy F - 2024 - Expert Systems Wit
No ratings yet
Adjacent Initial States Based Differential Privacy F - 2024 - Expert Systems Wit
12 pages
Data 102 Fall 2023 Lecture 24 - Privacy in Machine Learning
No ratings yet
Data 102 Fall 2023 Lecture 24 - Privacy in Machine Learning
46 pages
Distributed DP in Mixnets
No ratings yet
Distributed DP in Mixnets
38 pages
Wasserstein Differential Privacy: Chengyi Yang, Jiayin Qi, Aimin Zhou
No ratings yet
Wasserstein Differential Privacy: Chengyi Yang, Jiayin Qi, Aimin Zhou
20 pages
Week 6 - Solution
No ratings yet
Week 6 - Solution
4 pages
2.1 Differential Privacy
No ratings yet
2.1 Differential Privacy
12 pages
Differentially Private Instance-Based Noise Mechanisms in Practice
No ratings yet
Differentially Private Instance-Based Noise Mechanisms in Practice
33 pages
Comparative Analysis of Differential Privacy Implementations on Synthetic Data
No ratings yet
Comparative Analysis of Differential Privacy Implementations on Synthetic Data
10 pages
Bayesian Differential Privacy for Linear Dynamic System
No ratings yet
Bayesian Differential Privacy for Linear Dynamic System
6 pages
Kamath 等 - 2019 - Differentially Private Algorithms for Learning Mix
No ratings yet
Kamath 等 - 2019 - Differentially Private Algorithms for Learning Mix
62 pages
Paper 16
No ratings yet
Paper 16
4 pages
WDS Unit 5 Notes
No ratings yet
WDS Unit 5 Notes
20 pages
Privacy Axioms
No ratings yet
Privacy Axioms
36 pages
Week 7 - Solution
No ratings yet
Week 7 - Solution
3 pages
Differentially Private Decision Trees
No ratings yet
Differentially Private Decision Trees
5 pages
Q2 and 4
No ratings yet
Q2 and 4
4 pages
Lecture 09 DifferentialPrivacy
No ratings yet
Lecture 09 DifferentialPrivacy
18 pages
DP Game Theory 2021
No ratings yet
DP Game Theory 2021
37 pages
Database Anonymization
No ratings yet
Database Anonymization
138 pages
Data_Privacy_Preservation_Using_Differential_Privacy_and_Re-Identification_Attacks
No ratings yet
Data_Privacy_Preservation_Using_Differential_Privacy_and_Re-Identification_Attacks
6 pages
Privacy and Utility Tradeoff in Approximate Differential Privacy
No ratings yet
Privacy and Utility Tradeoff in Approximate Differential Privacy
15 pages
Adaptive Laplace Mechanism: Differential Privacy Preservation in Deep Learning
No ratings yet
Adaptive Laplace Mechanism: Differential Privacy Preservation in Deep Learning
13 pages
Sanskrut Lesson 1 To 9 Month 11
No ratings yet
Sanskrut Lesson 1 To 9 Month 11
61 pages
Data Science and Ethical Issues
No ratings yet
Data Science and Ethical Issues
42 pages
Using differential privacy technique to measure incrementality of ads performance
No ratings yet
Using differential privacy technique to measure incrementality of ads performance
4 pages
A Bird's Eye view of Data Visualisation
From Everand
A Bird's Eye view of Data Visualisation
Nisarg Patel
No ratings yet
Data Fun Facts
From Everand
Data Fun Facts
Ravi Nakamoto
No ratings yet
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Data Science Ethics - Lecture 2
No ratings yet
Data Science Ethics - Lecture 2
36 pages
Python Tutorial Text 2024-1
No ratings yet
Python Tutorial Text 2024-1
82 pages
Data Science Ethics - Lecture 10 - Ethical Deployment
No ratings yet
Data Science Ethics - Lecture 10 - Ethical Deployment
60 pages
Data Science Ethics - Lecture 3
No ratings yet
Data Science Ethics - Lecture 3
79 pages
Principles of Mgmt Accounting_class2
No ratings yet
Principles of Mgmt Accounting_class2
43 pages
Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (re-identification) v2
No ratings yet
Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (re-identification) v2
47 pages
Data Science Ethics - Lecture 9 - Ethical Reporting
No ratings yet
Data Science Ethics - Lecture 9 - Ethical Reporting
35 pages
Data Science Ethics - Lecture 1
No ratings yet
Data Science Ethics - Lecture 1
68 pages
Principles of Mgmt Accounting_class4&5
No ratings yet
Principles of Mgmt Accounting_class4&5
139 pages
Principles of Mgmt Accounting_class7
No ratings yet
Principles of Mgmt Accounting_class7
32 pages
Principles of Mgmt Accounting_class1
No ratings yet
Principles of Mgmt Accounting_class1
46 pages
Data Science Ethics - Lecture 1
No ratings yet
Data Science Ethics - Lecture 1
68 pages
Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (Re-Identification) v2
No ratings yet
Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (Re-Identification) v2
46 pages
DBMS Report 1
No ratings yet
DBMS Report 1
7 pages
OPT Asso 2016 EN Web
No ratings yet
OPT Asso 2016 EN Web
19 pages
Business Analytics 2nd Edition Evans Test Bank - Quickly Download For The Best Reading Experience
No ratings yet
Business Analytics 2nd Edition Evans Test Bank - Quickly Download For The Best Reading Experience
45 pages
Mechanic (D&D 5e Conversion)
No ratings yet
Mechanic (D&D 5e Conversion)
7 pages
Presentation of Heat Recovery Energy Audit
No ratings yet
Presentation of Heat Recovery Energy Audit
19 pages
A709646 Handbuch Zenjet Range Operating Manual GB V2 1
No ratings yet
A709646 Handbuch Zenjet Range Operating Manual GB V2 1
57 pages
Why To Use Agile Framework
100% (1)
Why To Use Agile Framework
10 pages
Workabout Pro4 User Manual
No ratings yet
Workabout Pro4 User Manual
208 pages
teltonika-networks-use-case-catalog-2022 2
No ratings yet
teltonika-networks-use-case-catalog-2022 2
2 pages
Note Calcul Poteaux
No ratings yet
Note Calcul Poteaux
1 page
Food Packaging Exceed XP Factsheet en 2pdf
No ratings yet
Food Packaging Exceed XP Factsheet en 2pdf
2 pages
Seminar Report
No ratings yet
Seminar Report
28 pages
Stadio+Presentation+BCU100+and+BCUD152+2025+01+ASMT
No ratings yet
Stadio+Presentation+BCU100+and+BCUD152+2025+01+ASMT
20 pages
Cyber Crime and Security
No ratings yet
Cyber Crime and Security
16 pages
SNR Code
No ratings yet
SNR Code
5 pages
RTG CPR CoreBookFAQv1.0b
No ratings yet
RTG CPR CoreBookFAQv1.0b
4 pages
VNPT Approach in Smart City Development
No ratings yet
VNPT Approach in Smart City Development
23 pages
Tax Filling PART 1
No ratings yet
Tax Filling PART 1
11 pages
Frequently Asked Questions For PTE
No ratings yet
Frequently Asked Questions For PTE
7 pages
QSP-08 PROCEDURE FOR PURCHASING
No ratings yet
QSP-08 PROCEDURE FOR PURCHASING
2 pages
List Exhaust Fan
No ratings yet
List Exhaust Fan
11 pages
MPEN 584 Lesson 2
No ratings yet
MPEN 584 Lesson 2
22 pages
Second Monthly Examination: Disciplines and Ideas in The Applied Social Sciences (G-12 - Humms)
No ratings yet
Second Monthly Examination: Disciplines and Ideas in The Applied Social Sciences (G-12 - Humms)
1 page
(Ebook) Communication Theory: Media, Technology and Society by David Holmes ISBN 9780761970699, 9780761970705, 9781847877246, 076197069X, 0761970703, 1847877249 download
100% (1)
(Ebook) Communication Theory: Media, Technology and Society by David Holmes ISBN 9780761970699, 9780761970705, 9781847877246, 076197069X, 0761970703, 1847877249 download
59 pages
PSD Computations Using Welch's Method: Sandia Report
No ratings yet
PSD Computations Using Welch's Method: Sandia Report
64 pages
Computer System Servicing: TLE-Information and Communication Technology
No ratings yet
Computer System Servicing: TLE-Information and Communication Technology
14 pages
BMS - ZEVA BMS16v2 - CAN - Protocol
No ratings yet
BMS - ZEVA BMS16v2 - CAN - Protocol
6 pages

Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling

Uploaded by

Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling

Uploaded by

Data Science & Ethics

Data Preprocessing and Modeling: Privacy

Prof. David Martens

• Combination with background knowledge, troublesome!

▪ Note that for small ε, eε ≈ 1 + ε

▪ Whether you are in the dataset on covid numbers/financial

• ε: depending on algorithm and requirements

Strong mathematical defintion of privacy

▪ Participating in medical study looking at cancer and smoking, where you

Nissim et al. (2018) Wood et al. (2018) 8

▪ How to make counting query ε-differential private?

How many students are

“There were 85 students taking DSE course,

Revisit link with size of dataset:

Example inspired from Nissem et al. (2018)

▪ ε as privacy budget (Nissim et al., 2018)

Nissim et al. (2018) Differential privacy: A primer for a non-technical audience 12

▪ Does not promise:

• Golden standard of anonymization

▪ Large use cases

2016 Time of Test Award 2017 Gödel Award

You might also like