0% found this document useful (0 votes)
11 views49 pages

CCST 9047 Lecture8

The document discusses the negative impacts of big data, particularly focusing on data privacy and security issues. It highlights risks associated with data breaches, unauthorized access, and the inadequacies of data anonymization methods, which can still lead to privacy leaks. Additionally, it emphasizes the challenges posed by machine learning models in maintaining data privacy and the potential for various types of attacks on individual data privacy.

Uploaded by

lokhimtam11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views49 pages

CCST 9047 Lecture8

The document discusses the negative impacts of big data, particularly focusing on data privacy and security issues. It highlights risks associated with data breaches, unauthorized access, and the inadequacies of data anonymization methods, which can still lead to privacy leaks. Additionally, it emphasizes the challenges posed by machine learning models in maintaining data privacy and the potential for various types of attacks on individual data privacy.

Uploaded by

lokhimtam11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

CCST9047: The Age

of Big Data
Lecture 8
Tutorial

We only have four tutorials. The remaining two tutorials are in this week
and next week.

Tutorial this week: AI for healthcare


Negative Impacts of Big Data

Data privacy Data security


Privacy and Security

‣ Privacy: share my data, but protect my sensitive data (personally


identi able data)

‣ Security: protect my data from illegal access (Next lecture)


fi
Threat to Data Privacy

‣ Direct and intentional leakage:

‣ Unauthorized access to data, and data breaches. Data Collection

‣ Massive data collection

‣ Indirect and unintentional leakage

‣ Meta-data: data about data

‣ Data correlated with data Data Mining/


Machine Learning
‣ Computations on data
Privacy in the Data Science

Data Pre- Data Mining/


Data Collection Application
processing Machine Learning

‣ Massive collection of personal data by companies and governments.

‣ Di erent type of private data is collected:

‣ Browsing history, personal image, address, IP, email, location

‣ Social posts, speech, purchasing history


ff
Privacy in the Data Science

Data Pre- Data Mining/


Data Collection Application
processing Machine Learning

‣ Laws governing the collection and use of personal data by companies

‣ Google / iOS require developers to obtain user’s permission in collecting


their sensitive data
Database Privacy Examples

Private
Application Data collector Analyst Function
information

Correlation between disease


Medical Hospital Disease Epidemiologist
and geography

Genome Correlation between genome


Hospital Genome Statistician
analysis and disease

Google/ Clicks/ Number of clicks on an ad by


Advertising Advertiser
Facebook Browsing age/region/gender

Social
Recommend other users based
recommendati Facebook Friend links Another user
on social network
ons
Statistical Database Privacy

Data Pre- Data Mining/


Data Collection Application
processing
Statistical Machine Learning
Database Privacy
(untrusted collector)
Server wants to
compute f Server

f ( DB )
Individuals do not want
server to infer their
records

Person 1 Person 2 Person 3 Person N


r1 r2 r3 rN
Risks of Privacy Breaches

‣ Improper disclosure of data can have adverse consequences:

‣ Credentials

‣ Credit card number, home access code, password

‣ Risk: stealing personal property

‣ Identi cation information:

‣ Name, bank information, biometric data

‣ Risk: identity theft

‣ Individual information

‣ Medical status, political opinions, friendships

‣ Risk: discrimination, blackmailing, public shame…


fi
Data Privacy Examples

‣ Personal Data Protection


‣ A social media platform must obtain user consent before collecting their location data,
and users should have the option to disable tracking.

‣ Healthcare Data
‣ A hospital must securely store patients' medical records and only share them with
authorized personnel, ensuring con dentiality under laws like HIPAA (U.S.).

‣ Financial Data
‣ A bank must encrypt customers' credit card details and transaction history to prevent
unauthorized access or data breachesIndividual information

‣ Children’s Online Privacy


‣ A gaming app targeting kids under 13 must obtain parental consent before collecting
any personal information, as required by COPPA (U.S.).
fi
Data Preprocessing

Data Pre- Data Mining/


Data Collection Application
processing Machine Learning

‣ We can perform some data pre-processing to protect the privacy

‣ Then if the data item is leaked, the sensitive information is still unknown
to the attacker.

‣ Manipulate the dataset so that the personal identi cation cannot be


discovered.

fi
Data Anonymization
“ANONYMIZATION” IS NOT SAFE
Medical list of di erent patients

Name Birth date Zip code Gender Diagnosis ...


Ewen Jordan 1993-09-15 13741 M Asthma ...
Lea Yang 1999-11-07 13440 F Type-1 diabetes ...
William Weld 1945-07-31 02110 M Cancer ...
Clarice Mueller 1950-03-13 02061 F Cancer ...

• We want to maintain the information as much as possible,


Anonymization: removing personally identifiable information before but do not publishing d
want release the diagnosis information of the patients, e.g.
• We can reveal the information that someone (with gender and age
• First solution: strip attributes that uniquely identify an individual (e.g., name, soci
information) has cancer or Asthma (so that we can do analysis).
security number...)
• We cannot reveal that William is a patient, and he has the cancer.
ff
ANONYMIZATION” IS NOT SAFE
Data Anonymization

Name Birth date Zip code Gender Diagnosis ...


1993-09-15 13741 M Asthma ...
1999-11-07 13440 F Type-1 diabetes ...
******* 1945-07-31 02110 M Cancer ...
1950-03-13 02061 F Cancer ...


Anonymization: removing
First solution: strippersonally identifiable
attributes that information
uniquely identify before (e.g.,
an individual publishing d
name, social security number…)
First solution: strip attributes that uniquely identify an individual (e.g., name, soci
‣ Now we do not know that William weld has a cancer!
security number...)
‣ The privacy of William has been protected.
Now we cannot know that William Weld has cancer!
ANONYMIZATION” IS NOT SAFE
Is Data Anonymization safe?

Name Birth date Zip code Gender Diagnosis ...


1993-09-15 13741 M Asthma ...
1999-11-07 13440 F Type-1 diabetes ...
1945-07-31 02110 M Cancer ...
1950-03-13 02061 F Cancer ...

Anonymization: removing personally identifiable information before publishing d


‣ Think about the setting that we have the additional dataset that collects the voters
information, including names, birth dates, zip codes, genders, etc.
First solution: strip attributes that uniquely identify an individual (e.g., name, soci
‣ We can
security number...)
recover the identity of William based on the birth date and zip code.

‣ Privacy is NOT protected with additional dataset.


Now we cannot know that William Weld has cancer!
DATA “ANONYMIZATION” IS NOT SAFE
Privacy Is Still Leaked with Additional Data
(Figure inspired from C. Palamidessi)

Diagnosis Name

Visit date ZIP Date last voted


DATASET 1
DATASET 2
anonymized Medication Birth date Date registered
public voters list
medical data
Procedure Gender Party affiliation

Doctor seen Address


• Problem: susceptible to linkage attacks, i.e. uniquely linking a record in the
Problem: susceptible to linkage attacks, i.e., linking a record in the anonymized
anonymized dataset to an identified record in a public dataset
dataset to an identi ed record in a public dataset


• ForAn
instance, an estimated
estimated
combination
87% of the87%
of birthdate,
their gender,
of the US population
US population is uniquelyisidenti
uniquely
ed byidentified by the of
the combination
their gender, andbirthdate
zip code.and zip code

‣ Public
• In the latevoter
9 s, list has beenmanaged
L. Sweeney leveraged
toto re-identifythe
re-identify themedical
medicalrecord
recordofinthe
the governor
late 90s.
of Massachusetts using a public voters list
fi
fi
Some Public Data Can Be Used to Re-identify People

‣ Voter List (for some countries)

‣ Birthday, email, phone number - social media

‣ Secondary school class (with names and ages)


ANONYMIZATION” IS NOT SAFE
Data “Anonymization” Is Not Safe
Quasi identifiers Sensitive attribute
Name Age Zip code Gender Diagnosis ...
20-30 13*** Asthma ...
20-30 13*** Type-1 diabetes ...
70-80 02*** Cancer ...
70-80 02*** Cancer ...

Second solution: k-Anonymity


• Second solution: k-anonymity [Sweeney, ]
• Suppress/generalize attributes and/or add dummy records.
. Define a set of attributes as quasi-identifiers (QIs)
• Hide partial information of the attributes, so that many records look similar to each
. Suppress/generalize attributes and/or add dummy records to make every record in th
other.
dataset indistinguishable from at least k − other records with respect to QIs
• However, if we know William is a patient, we can still infer that he has the cancer.

• Better now?
K-Anonymity K-Anonymity
Zip Age Nationality Disease Zip Age Nationality Disease

13053 28 Russian Heart 130** <30 * Heart

13068 29 American Heart 130** <30 * Heart

13068 21 Japanese Flu 130** <30 * Flu

13053 23 American Flu 130** <30 * Flu

14853 50 Indian Cancer 1485* >40 * Cancer

14853 55 Russian Heart 1485* >40 * Heart

14850 47 American Flu 1485* >40 * Flu

14850 59 American Flu 1485* >40 * Flu

13053 31 American Cancer 130** 30-40 * Cancer

13053 37 Indian Cancer 130** 30-40 * Cancer

13068 36 Japanese Cancer 130** 30-40 * Cancer

13068 32 American Cancer 130** 30-40 * Cancer


Problem: Background knowledge
Adversary knows [MKGV 06]
K-Anonymity Can Still Be Attacked
Zip Age Nationality Disease

prior knowledge
Problem:
about Umeko
Background 130**

130**
<30

<30
knowledge
*

*
Heart

Heart

Adversary knows 130** <30 * Cancer

‣ Adversary knows prior


prior
knowledge
knowledge
Adversary learns
about Umeko 130**

1485*
Zip
<30

>40
130**
Age
*

*
<30
Nationality
Cancer

Cancer
*
Disease

Heart

about Umeko
1485* >40
130** *
<30 Heart
* Heart

Umeko has Cancer 1485* >40


130**
*
<30
Flu
* Cancer
1485* >40 * Flu
130** <30 * Cancer
Name Zip Age Nat. 130** 30-40 * Cancer
Umeko Adversary
1305325 learns
Japan 130**
1485*
30-40
1485*
>40
*
>40
Cancer
*

*
Cancer

Heart
130** 30-40 * Cancer
Umeko has Cancer 130**
1485*
30-40
>40
* Cancer
* Flu

1485* >40 * Flu


Name Zip Age Nat.
Then they know that Umeko has a Cancer 130** 30-40
49
* Cancer
Umeko 13053 25 Japan 130** 30-40 * Cancer

130** 30-40 * Cancer

130** 30-40 * Cancer


Privacy Attack Using Auxiliary Knowledge

• Di erent variants of the anonymization methods require to modify the


original data even more, which often destroy utility.
• In high-dimensional and sparse dataset, any combination of attributes
can be potentially exploited using appropriate auxiliary knowledge.
• De-anonymization of Net ix dataset using IMDB dataset
• De-anonymization of Twitter graph using Flicker
• 4 spatio-temporal points uniquely identify most people

Data cannot be fully anonymized and remain useful!


ff
fl
Other Approach: Aggregate Statistics

Other solution: How about releasing aggregate statistics about many


individuals?
• Mean, variance, histogram, frequency, etc.
• The statistics can be useful for data analysis and training AI models, and seem to be
able to protect individual’s information.

• Problem 1: Di erencing attacks, i.e., combining aggregate queries to


obtain precise information about speci c individuals:
• Average salary in a company before and after an employee joins.
ff
fi
Other Approach: Aggregate Statistics

Other solution: How about releasing aggregate statistics about many


individuals?
• Mean, variance, histogram, frequency, etc.

• Problem 2: membership inference attack, i.e., inferring presence of known


individual in a dataset from (high-dimensional) aggregate statistics
• Statistics about genomic variants
• Infer whether someone’s sensitive health information contributes to a medical
model’s prediction.
Other Approach: Aggregate Statistics

Other solution: How about releasing aggregate statistics about many


individuals?
• Mean, variance, histogram, frequency, etc.

• Problem 3: reconstruction attack, i.e., inferring (part of) the dataset from the
output of many aggregate queries
• Modern AI models have the ability to reconstruct the data from a piece of them.

• Dinur Dissim Result: a majority of records in a database of size n can be


2
reconstructed when n log(n) queries are answered. So we can brute-forcely
consider many combinations to infer the data. See this Video.
ML MODELS ARE NOT SAFE
AI Models Are Not Safe
• ML models are elaborate kinds of aggregate statistics!

• As such, they are susceptible to membership inference attacks, i.e. inferrin


AI models perform kinds
presence of aofknown
aggregate statistics,
individual in thee.g., linear
training setmodel.
They are susceptible to membership inference attacks, i.e., inferring the
• For instance, one can exploit the confidence in model predictions [Shokri e
presence of a[Carlini
known et individual
al., ] in the training set.
ML MODELS ARE NOT SAFE
AI Models Are Not Safe
• ML models are also susceptible to reconstruction attacks
AI models are also susceptible to reconstruction
• For instance, attacks.text from large language mode
one can extract sensitive
[Carlini et al., ] or run differencing attacks on ML models [Paige et a
For instance, one can extract sensitive text from large language
models or run di erencing attacks on ML models.
ff
Data Privacy in AI

‣ To mitigate privacy risk, data scientists use various techniques to


manipulate the users’ private data.

‣ Then, these data, after certain manipulations, will be used to


develop data science and AI models/applications.
AI Model

Privacy
manipulation

However, privacy manipulation may downgrade the quality of the data,


thus a ect the utility of AI models.
ff
Privacy Versus Utility

‣ Privacy has a cost of the utility of the analysis, but ideally it


should not destroy it.

‣ The goal of privacy research is to nd a good trade-o s between


utility and privacy.

fi
ff
Privacy-Preserving Machine Learning

Data Pre- Data Mining/


Data Collection Application
processing Machine Learning

‣ De nition of privacy

‣ To develop ML algorithms, we need a quantitative measurement of


the privacy

‣ Intuition: cannot infer the presence/absence of an individual in the


dataset, or anything speci c about an individual.
fi
fi
Privacy-Preserving Machine Learning

Training Data Mining/ Output/


Data Machine Learning Prediction

‣ Goal of privacy-preserving machine learning

‣ The AI model is trained based on training dataset

‣ We hope that the output/prediction of the AI model is good (Utility).

‣ We hope that the output/prediction will not reveal the information


about the individual data point (Privacy).
Consider deterministic algorithm

‣ Consider Non-trivial deterministic


a deterministic algorithm, algorithms dobenot
the input will exactly mapped to
the corresponding satisfy
output differential privacy
Non-trivial: at least 2 outputs in image

x y = f(x)

Module 2 Tutorial: Differential Privacy in the Wild 73


Non-trivial deterministic algorithms do not
satisfy
Randomization Is Necessary differential privacy
Non-trivial: at least 2 outputs in image

x y = f(x)
These two data points can
be exactly inferred based
on the model’s output.
Module 2 Tutorial: Differential Privacy in the Wild 73

‣ We see the model/algorithm’s output y.

‣ Then, if the function is deterministic, when for


−1
some data, as long as we
know y, we may be able to infer that x = f (y).
Randomization Is Necessary
‣ We need to perform randomization, make the model output be
random rather than a xed one.
‣ One input may lead to di erent outputs with probabilities.

‣ Then getting the outputs, you are not 100% know the corresponding input.
‣ The answer/output is kind of noisy, which reduce the leakage of the information
about the dataset.

Possibility Non-private Possibility Private

y y
y1 y2 yn y1 y2 yn
ff
fi
Differential Privacy

‣ When using randomized algorithm …


‣ Each input will lead a set of possible outputs.

‣ For di
DIFFERENTIAL erent inputs,
PRIVACY the outputs could be similar, but their corresponding
distributions are di erent (otherwise the model is not that useful).
(Figure inspired from R. Bassily)

random coins random coins

x1 x1
x2 Randomized A(D) Randomized A(D')
algorithm algorithm
.. ..
. xn A distribution of A(D)
. xn A distribution of A(D')

‣ Considering the distribution


• Neighboring datasets D = {x1 , x2 , .of
. . , the data
xn } and output
D = {x1 , x3 , .for
. . , xtwo
!
n} neighboring datasets

‣ • Requirement: A(D) and A(D! ) should have “close” distribution


Stimulate the presence or absence of a single entry
ff
ff
VACY
Differential Privacy
DIFFERENTIAL PRIVACY
(Figure inspired from R. Bassily)

andom coins random coins


(Figure inspired from R. Bassily)

random coins random coins


x1
Randomized A(D) x1
Randomized A(D)
Randomizedx1
Randomized
A(D')
A(D')
x2
algorithm algorithm algorithm algorithm
.. .. ..
A . xn

distribution of A(D)
A . x n
distribution of A(D) A . xn A distribution of A(D')
distribution of A(D')

• Neighboring datasets D = {x1 , x2 , . . . , xn } and D! = {x1 , x3 , . . . , xn }

‣ Two
g datasets D = neighboring datasets
•{x
1 , x2 , . . . , xn } and D = {x1 , x3 , . .1. , x2
Requirement: A(D) and A(D!! ) should have “close” distribution
n} n D = x , x , …, x , D′ = {x1, x3, …, xn}
nt: A(D)‣and A(D ) shouldA(D)
Requirement: and A(D′
have “close”
! ) should have “close” distribution
distribution
probability

ratio bounded

• Closer distribution implies


21 better privacy.
probability

output range of A
ratio bounded
• Closer distribution implies worse utility.

Utility-Privacy Tradeo
21
!


output range of A
ff
Differential Privacy
fferential Privacy

D
A

D′

re DP, or ε-DP:
If the output distributions of A(D) and A(D′) are similar, the adversary
will be di cult to know whether the data x is in the dataset or not.
A mechanism satisfies DP iff for all inputs X, X’2 differ in one entry, for all output


ffi
ferential Privacy
Differential Privacy

D
A

D′
(1 − ϵ) ⋅ Pr(A(D) ∈ S) ≤ Pr(A(D) ∈ S) ≤ (1 + ϵ) ⋅ Pr(A(D) ∈ S)
e DP, or ε-DP:
‣ ϵ quanti es the similarity between the outputs corresponding to
A mechanism disatisfies DP iff for all inputs X, X’ differ in one entry, for all outputs
erent dataset.

‣ Smaller ϵ implies better privacy protection.



ff
fi
Properties of Model Composition
Properties of DP: basic composition

‣ Using more models will lead to worse privacy, as we are getting more statistical
queries from the data.

‣ Mathematically, if we have M1, …, Mk models that guarantee ϵ-level privacy, then their
(sequential or parallel) combination is can only guarantee kϵ-level privacy.
Properties of Post-processing

‣ If M is ϵ-level private, then after any post-processing, i.e., F(M(D)) is


also ϵ-level private

‣ Post-processing do not add more information, thus is never helpful for


better inferring private information of the data.
How to Achieve Privacy: Add Noise to Introduce Randomization
Output Randomization
Query

Database Add noise to


true answer
Researcher
• Add noise to answers such that:
‣ – Each answer does not leak too much information about
Each answer does not leak too much information.

‣ the database.
Noisy answers are close to the original one.

‣ – Noisy answers are close to the original answers.


Noise level performs a trade-o between privacy and utility.
ff
How to Achieve DP: Add
hieve DP: Laplace Mechanism Noise

n the

‣ Privacy depends on the λ parameter:

Laplacian Noise ‣ η is a Laplace random variable with


2
mean 0 and variance 2λ .
How Much Noise For Privacy

‣ Assume that for any neighboring dataset D and D′

| q(D) − q(D′) | ≤ S

‣ Then we can pick λ = S/ϵ to guarantee ϵ-level privacy.


‣ In order to make two distributions close
‣ Variance need to be large

‣ Extreme case: if λ → ∞ , then the


distributions will be the same.

‣ Too much noise a ects the utility!




ff
Case
Example: Study: Private
Differential K-means Clustering
K-Means

Module 2 Tutorial: Differential Privacy in the Wild 101


K-means: a clustering algorithm
K-Means: Original Algorithm

1. Initialize the K cluster centers (randomly or using prior


information)
2. Decide the class memberships of data points by assigning
them to the nearest cluster center.
3. Re-estimate the K cluster centers, by assuming the
memberships found above are correct.
4. Repeat the steps 2-3 until none of the data changed
membership in the last iteration.
K-Means: DP Algorithm

1. Initialize the K cluster centers (randomly or using prior


information)
2. Assigning the points to the nearest cluster center.
3. Noisily compute the size of each cluster.
4. Noisily estimate the K cluster centers, by assuming the
memberships found above are correct.
5. Repeat the steps 2-4 until none of the data changed
membership in the last iteration.
K-Means: DP Algorithm

Use ϵ/T privacy budget in each iteration, T is the number of total


iterations.

1. Assigning the points to the nearest cluster center. (No expense


of privacy budget)
2. Noisily compute the size of each cluster. (expend privacy
budget)
3. Noisily estimate the K cluster centers, by assuming the
memberships found above are correct. (expend of privacy
budget)
Results
Results (T = 10 iterations, random initialization)
Original Kmeans algorithm Laplace Kmeans algorithm

• Even though we noisily compute centers, Laplace kmeans can distinguish


clusters that are far apart.
‣ Using noisily computed centers, we can still distinguish clusters that are far apart.

‣ But clustersLaplace
• Since we add noise to the sums with sensitivity proportional to |dom|,
that are close by cannot be distinguished (drop of the utility).
k-means can’t distinguish small clusters that are close by.
Federated Learning: Decentralization

Server

‣ The AI model is no longer trained based on


the collected training dataset from all
users.

‣ Each user only leverage his/her data to


train the model in the edge.

‣ It only transmits the model parameters to


the server rather than their private data.

‣ In this way, we can kind of ensure inter-


user privacy, as intra-user data privacy is
not necessary.

Users
Summary

‣ Privacy Issue:

‣ Data collection will reveal the private information.

‣ Simple anonymization may not help

‣ Di erent attack methods

‣ Privacy-preserving machine learning:

‣ De nition of di erential privacy

‣ Methods to achieve di erential privacy: add noise


Slides from
http://researchers.lille.inria.fr/abellet/teaching/ppml_lectures/lec1.pdf
https://sigmod2017.org/wp-content/uploads/2017/03/04-Di erential-Privacy-in-the-wild-1.pdf
https://hongyanz.github.io/slides/cs480680_s23_Lecture17.pdf
ff
fi
ff
ff
ff

You might also like