CCST 9047 Lecture8
CCST 9047 Lecture8
of Big Data
Lecture 8
Tutorial
We only have four tutorials. The remaining two tutorials are in this week
and next week.
Private
Application Data collector Analyst Function
information
Social
Recommend other users based
recommendati Facebook Friend links Another user
on social network
ons
Statistical Database Privacy
f ( DB )
Individuals do not want
server to infer their
records
‣ Credentials
‣ Individual information
‣ Healthcare Data
‣ A hospital must securely store patients' medical records and only share them with
authorized personnel, ensuring con dentiality under laws like HIPAA (U.S.).
‣ Financial Data
‣ A bank must encrypt customers' credit card details and transaction history to prevent
unauthorized access or data breachesIndividual information
‣ Then if the data item is leaked, the sensitive information is still unknown
to the attacker.
fi
Data Anonymization
“ANONYMIZATION” IS NOT SAFE
Medical list of di erent patients
‣
Anonymization: removing
First solution: strippersonally identifiable
attributes that information
uniquely identify before (e.g.,
an individual publishing d
name, social security number…)
First solution: strip attributes that uniquely identify an individual (e.g., name, soci
‣ Now we do not know that William weld has a cancer!
security number...)
‣ The privacy of William has been protected.
Now we cannot know that William Weld has cancer!
ANONYMIZATION” IS NOT SAFE
Is Data Anonymization safe?
Diagnosis Name
‣
• Problem: susceptible to linkage attacks, i.e. uniquely linking a record in the
Problem: susceptible to linkage attacks, i.e., linking a record in the anonymized
anonymized dataset to an identified record in a public dataset
dataset to an identi ed record in a public dataset
‣
• ForAn
instance, an estimated
estimated
combination
87% of the87%
of birthdate,
their gender,
of the US population
US population is uniquelyisidenti
uniquely
ed byidentified by the of
the combination
their gender, andbirthdate
zip code.and zip code
‣ Public
• In the latevoter
9 s, list has beenmanaged
L. Sweeney leveraged
toto re-identifythe
re-identify themedical
medicalrecord
recordofinthe
the governor
late 90s.
of Massachusetts using a public voters list
fi
fi
Some Public Data Can Be Used to Re-identify People
• Better now?
K-Anonymity K-Anonymity
Zip Age Nationality Disease Zip Age Nationality Disease
prior knowledge
Problem:
about Umeko
Background 130**
130**
<30
<30
knowledge
*
*
Heart
Heart
1485*
Zip
<30
>40
130**
Age
*
*
<30
Nationality
Cancer
Cancer
*
Disease
Heart
about Umeko
1485* >40
130** *
<30 Heart
* Heart
*
Cancer
Heart
130** 30-40 * Cancer
Umeko has Cancer 130**
1485*
30-40
>40
* Cancer
* Flu
• Problem 3: reconstruction attack, i.e., inferring (part of) the dataset from the
output of many aggregate queries
• Modern AI models have the ability to reconstruct the data from a piece of them.
Privacy
manipulation
fi
ff
Privacy-Preserving Machine Learning
‣ De nition of privacy
x y = f(x)
x y = f(x)
These two data points can
be exactly inferred based
on the model’s output.
Module 2 Tutorial: Differential Privacy in the Wild 73
‣ Then getting the outputs, you are not 100% know the corresponding input.
‣ The answer/output is kind of noisy, which reduce the leakage of the information
about the dataset.
y y
y1 y2 yn y1 y2 yn
ff
fi
Differential Privacy
‣ For di
DIFFERENTIAL erent inputs,
PRIVACY the outputs could be similar, but their corresponding
distributions are di erent (otherwise the model is not that useful).
(Figure inspired from R. Bassily)
x1 x1
x2 Randomized A(D) Randomized A(D')
algorithm algorithm
.. ..
. xn A distribution of A(D)
. xn A distribution of A(D')
distribution of A(D)
A . x n
distribution of A(D) A . xn A distribution of A(D')
distribution of A(D')
‣ Two
g datasets D = neighboring datasets
•{x
1 , x2 , . . . , xn } and D = {x1 , x3 , . .1. , x2
Requirement: A(D) and A(D!! ) should have “close” distribution
n} n D = x , x , …, x , D′ = {x1, x3, …, xn}
nt: A(D)‣and A(D ) shouldA(D)
Requirement: and A(D′
have “close”
! ) should have “close” distribution
distribution
probability
ratio bounded
output range of A
ratio bounded
• Closer distribution implies worse utility.
Utility-Privacy Tradeo
21
!


output range of A
ff
Differential Privacy
fferential Privacy
D
A
D′
re DP, or ε-DP:
If the output distributions of A(D) and A(D′) are similar, the adversary
will be di cult to know whether the data x is in the dataset or not.
A mechanism satisfies DP iff for all inputs X, X’2 differ in one entry, for all output


ffi
ferential Privacy
Differential Privacy
D
A
D′
(1 − ϵ) ⋅ Pr(A(D) ∈ S) ≤ Pr(A(D) ∈ S) ≤ (1 + ϵ) ⋅ Pr(A(D) ∈ S)
e DP, or ε-DP:
‣ ϵ quanti es the similarity between the outputs corresponding to
A mechanism disatisfies DP iff for all inputs X, X’ differ in one entry, for all outputs
erent dataset.
‣ Using more models will lead to worse privacy, as we are getting more statistical
queries from the data.
‣ Mathematically, if we have M1, …, Mk models that guarantee ϵ-level privacy, then their
(sequential or parallel) combination is can only guarantee kϵ-level privacy.
Properties of Post-processing
‣ the database.
Noisy answers are close to the original one.
n the
| q(D) − q(D′) | ≤ S
‣ But clustersLaplace
• Since we add noise to the sums with sensitivity proportional to |dom|,
that are close by cannot be distinguished (drop of the utility).
k-means can’t distinguish small clusters that are close by.
Federated Learning: Decentralization
Server
Users
Summary
‣ Privacy Issue: