Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling
Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling
Lecture 5
2
Differential Privacy
▪ What’s the issue again?
▪ Reporting on data from the students
• Study 1: “In March 2024, there were 86 students taking the DSE
course, 10 of which were on financial aid”
➔ No personal information revealed
• Study 2: “In April 2024, there were 85 students taking the
DSEcourse, 9 of which were on financial aid”
➔ No personal information revealed
Curator Outside
Observer
x
D
Result
Algorithm
6
Differential Privacy
▪ Some examples of analysis with sensitive data
▪ Participating in a survey on our class, where you would answer negatively
Prob(your exam will be more difficult after survey without your data) = 1%
eε ≈ 1 + ε for small ε, if ε = 0,01
Prob(your exam will be more difficult after survey with your data) = 1,01%
▪ Does not mean the premium will go up, the exam will be more difficult: just that
the increase is limited.
7
Privacy loss parameter ε
▪ Privacy loss parameter ε
▪ Smaller means more privacy (but less accuracy)
▪ If ε = 0
• P(M(D)) = P(M(D’))
• Total privacy, only noise
▪ Rule of thumb
• ε between 0.001 and 1
• Proper ε depends on dataset size and Prob(M(D)). Intuitively: the larger the dataset,
the less the impact of a single data instance.
• “almost no utility is expected from datasets containing 1/ε or fewer records.”
(Nissim et al., 2018)
➢ ε = 0.001 ➔ datasets required of size at least 1000
➢ ε = 0.01 ➔ datasets required of size at least 100
9
Differential Privacy
▪ Counting
• How many students disliked the course?
• Add Lap(1/ε) noise ➔ ε-differential privacy
• Run analysis twice: two different answers, but there exist accuracy bounds (given ε)
10
Differential Privacy
▪ Reporting on data from the students
• Study 1: “In March 2024, there were 86 students taking the DSE course, 10 of
which were on financial aid”
➔ No personal information revealed
• Study 2: “In April 2024, there were 85 students taking the DSE course, 9 of
which were on financial aid”
➔ No personal information revealed
• Combination with background knowledge, troublesome!
➢ Student Sam knows that student Tim dropped out of the class in March, Sam now
knows that student Tim was on financial aid.
▪ Differential privacy
• Study 1: “In March 2024, there were 86 students taking the DSE course,
approximately 11 of which were on financial aid”
Study 2: “In April 2024, there were 85 students taking the DSE course,
approximately 8 of which were on financial aid”
13
Differential Privacy
▪ Properties
• Quantification of privacy loss
• Compositional: allows for complex differential private techniques
• Immune to post-processing: guarantee no matter the data, technology or
computation power
• Transparant: can reveal what procedure and parameter you used (otherwise
uncertainty on for example accuracy of the results)
14
Differential Privacy
▪ Assumption 2: trusted data curator
▪ Needed?
• Local (decentralized) differential privacy
➢ Don’t trust the curator ➔ Noise added before recording answer
Don’t trust the outside observers
• Centralized differential privacy
➢ Trust the data curator ➔ Noise added after recording answer
Don’t trust the outside observers
Curator Outside
Observer
x x
D
Result
Algorithm
15
Differential Privacy
▪ Two types of differential privacy
▪ Randomized Response: local
• For example, I want to know how much students liked the course,
I ask: “Did you like the class?”
➢ Flip coin:
▪ if head: then write real answer
▪ if tails: flip again, if heads: then write real answer, if tails: write inverse
▪ 75% correct, but total deniability
• If we know 1/3 did not like the class
➢ How many in randomized response result? 5/12
Why? if p is percentage of positive answers, we’d expect to observe positive
answer
▪ 1 out of 4: when it has changed from a negative answer
▪ 3 out of 4: when it is real positive answer
➔ ¼ x (1-p) + ¾ x p
➢ Average over large numbers: will be quite accurate, while having plausible
deniability
Differential Privacy
▪ Two types of differential privacy
▪ Centralized vs decentralized?
• Choice based on risk of hacking, leaks and subpoenas.
17
Differential Privacy
▪ Differential Privacy (Cynthia Dwork et al. 2006)
• Advantages
➢ If linked with additional data added, still privacy
➢ The learnt patterns do generalize
• Challenges
➢ Still can infer properties: with or without data instance (eg Facebook traits or salary)
➢ Learn secrets from large group in dataset (eg Strava)
18
Data Preprocessing for privacy
▪ How to measure level of anonymity?
• K-anonymity of dataset: issues
• Differential privacy of algorithm: gold standard
▪ Several preprocessing methods to get privacy
• Grouping instances: Aggregation
• Grouping variable values: Discretisation and Generalisation
• Suppressing: replace certain values by *
• Adding noise
▪ Continuum!
Diff. privacy Diff. privacy Removing personal
decentralized centralized t-closeness l-diversity
k-anonymity identifiers
privacy utility
low ε high k low k
high ε
high l low l
low t high t
19
Conclusion
▪ Differential privacy
▪ Dream of analysing data while enabling privacy
▪ Privacy as a matter of accumulated risk, not binary
▪ With clear parameter ε that quantifies privacy loss: the
additional risk to an individual resulting from the use of its data.
▪ Always bounded, mathematically provable.
20
Differential Privacy
▪ Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith (2006)
21
Presentation and Paper Ideas
▪ Differential privacy in action
(use by Apple, Google, Linkedin, etc.)
▪ The trade-offs in value
▪ Practical implementation and examples of
k-anonymity/l-diversity/t-closeness
22