Data Science and Ethical Issues
Data Science and Ethical Issues
Issues
Dereje Teferi
[email protected]
Data Privacy in Big Data
Large amount of data is being collected about
almost every aspect of our life,
Technology has made the collection of these data
easier than ever before
As such, data privacy has become one of the main
ethical issues with this data (big data)
Identity theft, fraud, and discrimination are just a
few of the severe problems that people may face as
a result of the collection and storage of personal
information
2
The Rise of Privacy Concerns
Science:
benefits of sharing clinical patient records
patients shall control access to their records
patients found to be altruistic/selfless:
willing to grant access for purpose of advancing science
Government:
government and commercial use of data mining raises concerns about
appropriate use of private citizen information,
e.g., data collected for the purpose of airline passenger screening
should not be used for the enforcement of other criminal laws
Open Web:
many users are happy to share private details on social webs
but would be rightfully upset if this data is used for other purposes
5
Sensitive Data
identifying values sensitive attribute
6
Sensitive Data and Privacy
Sensitive Data Privacy
Data about individuals and Desire to limit the
organizations that should not
dissemination of
be freely disseminated and
publicized sensitive data
Health Lots of technology, but:
Education Unclear requirements
Finance
Unclear behaviors
Demography
Criminal Unclear laws
Location
Behavior
Family etc 7
A Case:
Hunter college high school
On a September morning in 2013, the students at
Hunter College High School filed past security and into
the hallways, to find that their school had been labeled
the saddest place in Manhattan
https://medium.com/memo-random/turning-data-around-7acea1f7479c
Five months earlier…
Researchers in Cambridge, Massachusetts had pulled
more than six hundred thousand tweets from Twitter’s
public API and fed them through sentiment analysis
routines.
If a tweet contained words that were deemed to be sad
— maybe “cry” or “frown” or “miserable”, an emotional
mark for sadness would be placed on a map of the city.
https://medium.com/memo-random/turning-data-around-7acea1f7479c
9
Turning Data Around
The world flows in one direction: data comes from us, but it
rarely returns to us.
The systems that we’ve created are designed to be
unidirectional: data is gathered from people, it’s processed by an
assembly line of algorithmic machinery, and spit out to an
audience of different people — surveillors and investors and
academics and data scientists.
Data is not collected for high school students, but for people
who want to know how high school students feel. This new data
reality is from us, but it isn’t for us.
How can we turn data around? How can we build new data
systems that start as two-way streets, and consider the
individuals from whom the data comes as first?
https://medium.com/memo-random/turning-data-around-7acea1f7479c 10
OECD’s Eight
Principles of Fair
Information Practices
A framework for
privacy protection
Protect use
Collection for a
purpose
Use only for
authorized purpose
Accountability
throughout these
principles
Present
Store data
results
Institutional
Review Board
Provisions for
Present
collection, storage, Store data
results
processing, and
dissemination of
sensitive data
Consent
State
Present
purpose/use
Store data
Decentresults
quality
Allow
corrections
Analyze Extract data
data
Pre-
process 14
Define
questions
Collect/find
Publish data
data
Physical safety
Personnel
Present training
Store data
results Access control
Encryption
20
Data anonymization
Data anonymization is the process of protecting private
or sensitive information by erasing or encrypting identifiers
that connect an individual to stored data.
(https://www.imperva.com/)
Data anonymization has been defined as a "process by which
personal data is altered in such a way that a data subject can
no longer be identified directly or indirectly, either by the
data controller alone or in collaboration with any other party
(Wikipedia)
For example, we can run Personally Identifiable Information
(PII) through a data anonymization process that retains the
data but keeps the source anonymous.
21
Techniques for Data Anonymization
There are several data anonymization techniques
that exist and that are still in research
The most common ones are:
Data masking
Pseudonymization
Generalization
Data Swapping
Data perturbation
Synthetization of data
22
Data Masking
Masking enables hiding data with altered values.
First create a mirror version of a database and apply
modification techniques such as character shuffling,
encryption, and word or character substitution.
For example, you can replace a value character with a
symbol such as “*” or “x”.
Data masking makes reverse engineering or detection
impossible.
23
Pseudonymization
Is a data management and de-identification method
that replaces private identifiers with fake identifiers or
pseudonyms,
for example replacing the identifier “Dereje Teferi”
with “Alex J”.
Pseudonymization preserves statistical accuracy and
data integrity,
It allows the modified data to be used for training,
model development, testing, and analytics while
protecting data privacy.
24
Generalization
It deliberately removes some of the data to make it less
identifiable.
Data can be modified into a set of ranges or a broad
area with appropriate boundaries.
You can remove the house number in an address, but
make sure you don’t remove the area (Woreda).
The purpose of generalization is to eliminate some of
the identifiers while retaining a measure of data
accuracy.
25
Data swapping
Data swapping also known as shuffling and
permutation is a technique used to rearrange the
dataset attribute values so they don’t correspond with
the original records.
Swapping attributes (columns) that contain identifiers
values such as date of birth, which may have more
impact on anonymization than membership type
values.
26
Data perturbation
Data perturbation modifies the original dataset slightly by
applying techniques that round numbers and add random
noise.
The range of values needs to be in proportion to the
perturbation.
A small base may lead to weak anonymization while a large
base can reduce the utility of the dataset.
For example:
use a base of 5 for rounding values like age or house number because
it’s proportional to the original value.
multiply a house number by 15 and the value may retain its credence.
However, using higher bases like 15 can make the age values seem
fake.
27
Synthetizing data
Synthetic data is algorithmically manufactured
information that has no connection to real events.
Synthetic data is used to create artificial datasets
instead of altering the original dataset (to avoid using
the original as is and risking privacy and security)
The process involves creating statistical models based
on patterns found in the original dataset.
The algorithm can use standard deviations, medians,
linear regression or other statistical techniques to
generate the synthetic data.
28
Anonymization examples
Replace identifiers with randomly-generated values
Eg: “Jane Krakowski” -> “Patient6479”
Abstraction: Replace values by ranges
Eg: Check-in date: 3/1/16 -> Check-in date: Spring 2016
Eg: Replace zip code by state
Cluster data points and replace individuals by their cluster
centroid
Eg Ages: 21, 25, 28, 27, 18 -> 5 individuals with nominal age
of 24
Remove values
Eg: Omit birth date
etc
29
Disadvantages of
Data Anonymization
In some cases when data is anonymized, it becomes
unusable for the intended purpose
For example GDPR states that websites must obtain
consent from users to collect personal information such
as IP addresses, device ID, and cookies.
Collecting anonymous data and deleting identifiers from
the database limit your ability to derive value and insight
from your data.
For example, anonymized data cannot be used for
marketing efforts, or to personalize the user experience
(search engines, recommender systems etc).
30
Problems with
Anonymization Techniques
Limited use for research
Too coarse-grained
Re-identification
Re-identification is often trivial
E.g.,anonymized list of students admitted showing
undergraduate university and average GPA
Re-identification is possible with high certainty in many cases
By linking the anonymized dataset with other public data
31
Examples of Re-Identification through
Linking Data: (III) Behavior Patterns
Four spatiotemporal points are enough to uniquely re-identify 90% of individuals
Even data sets that provide coarse information for all dimensions provide little
anonymity
http://science.sciencemag.org/content/347/6221/536.full 32
Addressing the Problems of
Simple Anonymization Techniques
33
Addressing Anonymization Problems:
k-Anonymity
A dataset has k-anonymity if at least k individuals
share the same identifying values
k=2
34
Addressing Anonymization Problems:
l-Diversity
A dataset has l-diversity if the individuals that share the same
identifying values have at least l distinct values for the sensitive
attribute
l=1
35
Addressing Anonymization Problems:
t-Closeness
A dataset has t-closeness if the individuals that share the
same identifying values have values for the sensitive
attribute that are within a threshold t of diversity
36
Differential Privacy
This is the only method that provides mathematical
guarantees of anonymity
Main problem addressed: Taking an individual I off a
dataset reveals their sensitive attribute information
Eg: retrieving aggregate data before removal, then retrieving
aggregate data after removal, and then comparing the
difference will give us the sensitive attribute of I
Main idea: Differential privacy adds “noise” to the
retrieval process so that such comparisons do not give us
the actual sensitive attribute information
The “noise” should be mathematically defined for the data
37
Summary: Threats to Privacy
Privacy requirements are not well articulated
People want benefits in exchange for data
Unclear that we are able to limit collection and
publication
Unique behavior of people (we don’t read legal contracts)
Human error, not without consequences
Large amount of sensitive data about individuals is
readily available in the open web
Open web already contains sensitive information that should
not be available and violates privacy acts
Lots of commercial data with personal information is for sale
Limited understanding of anonymization and other
privacy technologies
Linking to public datasets leads to re-identify individuals
Societal Value of Data and
Data Science
39
Granting Access to Private Records:
Health Information
Anonymized information is often not useful for research
Too coarse grained
Private information has great value
Tradeoff with quality of treatment
Incentivized through first access to new treatments
Altruism
Giving up privacy for pre-specified uses
Eg: for specific medical study, not for insurance purposes,
not for employers, not for social studies
There is zero privacy anyway, get over it
Although you can upload your data using a pseudonym, there is no way to
anonymously submit data. Statistically speaking it is really unlikely that your
medical and genetic information matches that of someone else. By uploading
you do not only disclose information about yourself, but also about your next
kinship (parents and siblings), that shares half of a genome with you. Before
uploading any genetic data you should make sure that those people approve of
you doing so. This is especially important if you have monozygotic twin, who
shares all of your genome!
41
Thank You