0% found this document useful (0 votes)
208 views42 pages

Data Science and Ethical Issues

Uploaded by

Tex Zgreat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views42 pages

Data Science and Ethical Issues

Uploaded by

Tex Zgreat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Data Science and Ethical

Issues
Dereje Teferi
[email protected]
Data Privacy in Big Data
Large amount of data is being collected about
almost every aspect of our life,
Technology has made the collection of these data
easier than ever before
As such, data privacy has become one of the main
ethical issues with this data (big data)
Identity theft, fraud, and discrimination are just a
few of the severe problems that people may face as
a result of the collection and storage of personal
information

2
The Rise of Privacy Concerns
Science:
 benefits of sharing clinical patient records
 patients shall control access to their records
 patients found to be altruistic/selfless:
 willing to grant access for purpose of advancing science

 Government:
 government and commercial use of data mining raises concerns about
appropriate use of private citizen information,
 e.g., data collected for the purpose of airline passenger screening
should not be used for the enforcement of other criminal laws
 Open Web:
 many users are happy to share private details on social webs
 but would be rightfully upset if this data is used for other purposes

 content is shared between networks


 not very transparent to the user
 users need to be reassured about appropriate use of their data
4
Sensitive Data (PII)
Sensitive data, to begin with, includes Personally
Identifiable Information (PII)
PII covers crucial data like names, social security
numbers, ID numbers, addresses, and telephone
numbers, which are essential for the identification of
individuals, whether they are customers, employees, or
patient information

5
Sensitive Data
identifying values sensitive attribute

6
Sensitive Data and Privacy
Sensitive Data Privacy
Data about individuals and Desire to limit the
organizations that should not
dissemination of
be freely disseminated and
publicized sensitive data
 Health Lots of technology, but:
 Education Unclear requirements
 Finance
Unclear behaviors
 Demography
 Criminal Unclear laws
 Location
 Behavior
 Family etc 7
A Case:
Hunter college high school
On a September morning in 2013, the students at
Hunter College High School filed past security and into
the hallways, to find that their school had been labeled
the saddest place in Manhattan

https://medium.com/memo-random/turning-data-around-7acea1f7479c
Five months earlier…
Researchers in Cambridge, Massachusetts had pulled
more than six hundred thousand tweets from Twitter’s
public API and fed them through sentiment analysis
routines.
If a tweet contained words that were deemed to be sad
— maybe “cry” or “frown” or “miserable”, an emotional
mark for sadness would be placed on a map of the city.

https://medium.com/memo-random/turning-data-around-7acea1f7479c
9
Turning Data Around
The world flows in one direction: data comes from us, but it
rarely returns to us.
The systems that we’ve created are designed to be
unidirectional: data is gathered from people, it’s processed by an
assembly line of algorithmic machinery, and spit out to an
audience of different people — surveillors and investors and
academics and data scientists.
Data is not collected for high school students, but for people
who want to know how high school students feel. This new data
reality is from us, but it isn’t for us.
How can we turn data around? How can we build new data
systems that start as two-way streets, and consider the
individuals from whom the data comes as first?
https://medium.com/memo-random/turning-data-around-7acea1f7479c 10
OECD’s Eight
Principles of Fair
Information Practices
A framework for
privacy protection
Protect use
 Collection for a
purpose
 Use only for
authorized purpose
 Accountability
throughout these
principles

Yolanda Gil [email protected]


Define
questions
Collect/find
Publish data
data

Present
Store data
results

Analyze Extract data


data
Pre-
process 12
Define
questions
Collect/find
Publish data
data

Institutional
Review Board
 Provisions for
Present
collection, storage, Store data
results
processing, and
dissemination of
sensitive data

Analyze Extract data


data
Pre-
process 13
Define
questions
Collect/find
Publish data
data

Consent
State
Present
purpose/use
Store data
Decentresults
quality
Allow
corrections
Analyze Extract data
data
Pre-
process 14
Define
questions
Collect/find
Publish data
data

Physical safety
Personnel
Present training
Store data
results Access control
Encryption

Analyze Extract data


data
Pre-
process 15
Define
questions
Collect/find
Publish data
data
 Limit data use based
on the purpose
expressed in the
Present original consent
Storedata
 Secure data
results
transmission
 Anonymization

Analyze Extract data


data
Pre-
process 16
17
General Data Protection Regulation
https://gdpr.eu/what-is-gdpr/

The GDPR, which became effective in 2018, is considered


by many to be the world’s most comprehensive data privacy
regulation because of its wide scope of application,
Many organizations, have chosen to implement GDPR as
their global data privacy standards.
Key points of GDPR include:
 Establish data privacy as a fundamental human right, including the
individual’s right to access, correct, erase, or port his or her personal
data.
 Strengthen baseline requirements and define roles and
responsibilities for ensuring personal data protection.
 Provide standardized application of data protection rules across the
EU, thereby facilitating the legitimate flow of personal data within
and beyond the EU and European Economic Area (EEA).
18
Personal Data as per GDPR
 Personal data is anything, alone or in combination with something else,
which can identify a living individual.
 Some examples of personal data include:
 Personal identifying information (PII), such as name, national
identification number, social security number, email address, telephone
number, or home address.
 Online identifiers such as IP addresses.
 Device identifiers such as MAC addresses
 Location data.
 GDPR further identifies special categories of personal data, which when
processed, require additional protections.
 Biometric data such as DNA, fingerprints, or facial recognition images.
 Genetic characteristics.
 Health data, including records of physical/mental conditions and
healthcare codes.
19
GDPR Rights: What are a Data Subject’s
Rights?
The GDPR grants data subjects the following basic rights:
 The right to be informed
 The right of access including the ability to request a copy of
the data.
 The right to rectification (correction)
 The right to erasure including the “right to be forgotten”.
 The right to restrict processing
 The right to data portability,
 The right to object
 The right to not be subject to automated decision
making

20
Data anonymization
Data anonymization is the process of protecting private
or sensitive information by erasing or encrypting identifiers
that connect an individual to stored data.
(https://www.imperva.com/)
Data anonymization has been defined as a "process by which
personal data is altered in such a way that a data subject can
no longer be identified directly or indirectly, either by the
data controller alone or in collaboration with any other party
(Wikipedia)
For example, we can run Personally Identifiable Information
(PII) through a data anonymization process that retains the
data but keeps the source anonymous.

21
Techniques for Data Anonymization
There are several data anonymization techniques
that exist and that are still in research
The most common ones are:
Data masking
Pseudonymization
Generalization
Data Swapping
Data perturbation
Synthetization of data

22
Data Masking
Masking enables hiding data with altered values.
First create a mirror version of a database and apply
modification techniques such as character shuffling,
encryption, and word or character substitution.
For example, you can replace a value character with a
symbol such as “*” or “x”.
Data masking makes reverse engineering or detection
impossible.

23
Pseudonymization
Is a data management and de-identification method
that replaces private identifiers with fake identifiers or
pseudonyms,
for example replacing the identifier “Dereje Teferi”
with “Alex J”.
Pseudonymization preserves statistical accuracy and
data integrity,
It allows the modified data to be used for training,
model development, testing, and analytics while
protecting data privacy.
24
Generalization
It deliberately removes some of the data to make it less
identifiable.
Data can be modified into a set of ranges or a broad
area with appropriate boundaries.
You can remove the house number in an address, but
make sure you don’t remove the area (Woreda).
The purpose of generalization is to eliminate some of
the identifiers while retaining a measure of data
accuracy.

25
Data swapping
Data swapping also known as shuffling and
permutation is a technique used to rearrange the
dataset attribute values so they don’t correspond with
the original records.
Swapping attributes (columns) that contain identifiers
values such as date of birth, which may have more
impact on anonymization than membership type
values.

26
Data perturbation
Data perturbation modifies the original dataset slightly by
applying techniques that round numbers and add random
noise.
The range of values needs to be in proportion to the
perturbation.
A small base may lead to weak anonymization while a large
base can reduce the utility of the dataset.
For example:
 use a base of 5 for rounding values like age or house number because
it’s proportional to the original value.
 multiply a house number by 15 and the value may retain its credence.
 However, using higher bases like 15 can make the age values seem
fake.
27
Synthetizing data
Synthetic data is algorithmically manufactured
information that has no connection to real events.
Synthetic data is used to create artificial datasets
instead of altering the original dataset (to avoid using
the original as is and risking privacy and security)
The process involves creating statistical models based
on patterns found in the original dataset.
The algorithm can use standard deviations, medians,
linear regression or other statistical techniques to
generate the synthetic data.

28
Anonymization examples
Replace identifiers with randomly-generated values
 Eg: “Jane Krakowski” -> “Patient6479”
Abstraction: Replace values by ranges
 Eg: Check-in date: 3/1/16 -> Check-in date: Spring 2016
 Eg: Replace zip code by state
Cluster data points and replace individuals by their cluster
centroid
 Eg Ages: 21, 25, 28, 27, 18 -> 5 individuals with nominal age
of 24
Remove values
 Eg: Omit birth date
etc

29
Disadvantages of
Data Anonymization
In some cases when data is anonymized, it becomes
unusable for the intended purpose
For example GDPR states that websites must obtain
consent from users to collect personal information such
as IP addresses, device ID, and cookies.
Collecting anonymous data and deleting identifiers from
the database limit your ability to derive value and insight
from your data.
For example, anonymized data cannot be used for
marketing efforts, or to personalize the user experience
(search engines, recommender systems etc).
30
Problems with
Anonymization Techniques
Limited use for research
 Too coarse-grained
Re-identification
 Re-identification is often trivial
 E.g.,anonymized list of students admitted showing
undergraduate university and average GPA
 Re-identification is possible with high certainty in many cases
 By linking the anonymized dataset with other public data

that is not anonymized

31
Examples of Re-Identification through
Linking Data: (III) Behavior Patterns
 Four spatiotemporal points are enough to uniquely re-identify 90% of individuals
 Even data sets that provide coarse information for all dimensions provide little
anonymity

http://science.sciencemag.org/content/347/6221/536.full 32
Addressing the Problems of
Simple Anonymization Techniques

Provide guarantees that 1. k- anonymization


re-identification will not 2. l-diversity
be possible within some 3. t-closeness
bounds
4. Differential privacy
Eg: can only map a
given individual to a
set of 50 individuals

33
Addressing Anonymization Problems:
k-Anonymity
A dataset has k-anonymity if at least k individuals
share the same identifying values

k=2

34
Addressing Anonymization Problems:
l-Diversity
A dataset has l-diversity if the individuals that share the same
identifying values have at least l distinct values for the sensitive
attribute

l=1

35
Addressing Anonymization Problems:
t-Closeness
A dataset has t-closeness if the individuals that share the
same identifying values have values for the sensitive
attribute that are within a threshold t of diversity

• Threshold is mathematically defined for the data

36
Differential Privacy
This is the only method that provides mathematical
guarantees of anonymity
Main problem addressed: Taking an individual I off a
dataset reveals their sensitive attribute information
 Eg: retrieving aggregate data before removal, then retrieving
aggregate data after removal, and then comparing the
difference will give us the sensitive attribute of I
Main idea: Differential privacy adds “noise” to the
retrieval process so that such comparisons do not give us
the actual sensitive attribute information
 The “noise” should be mathematically defined for the data

37
Summary: Threats to Privacy
Privacy requirements are not well articulated
People want benefits in exchange for data
Unclear that we are able to limit collection and
publication
 Unique behavior of people (we don’t read legal contracts)
 Human error, not without consequences
Large amount of sensitive data about individuals is
readily available in the open web
 Open web already contains sensitive information that should
not be available and violates privacy acts
 Lots of commercial data with personal information is for sale
Limited understanding of anonymization and other
privacy technologies
 Linking to public datasets leads to re-identify individuals
Societal Value of Data and
Data Science

39
Granting Access to Private Records:
Health Information
Anonymized information is often not useful for research
Too coarse grained
Private information has great value
Tradeoff with quality of treatment
Incentivized through first access to new treatments
Altruism
Giving up privacy for pre-specified uses
Eg: for specific medical study, not for insurance purposes,
not for employers, not for social studies
There is zero privacy anyway, get over it
Although you can upload your data using a pseudonym, there is no way to
anonymously submit data. Statistically speaking it is really unlikely that your
medical and genetic information matches that of someone else. By uploading
you do not only disclose information about yourself, but also about your next
kinship (parents and siblings), that shares half of a genome with you. Before
uploading any genetic data you should make sure that those people approve of
you doing so. This is especially important if you have monozygotic twin, who
shares all of your genome!
41
Thank You

You might also like