0% found this document useful (0 votes)

214 views42 pages

Data Science and Ethical Issues

Uploaded by

Tex Zgreat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

214 views42 pages

Data Science and Ethical Issues

Uploaded by

Tex Zgreat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 42

Data Science and Ethical

Issues
Dereje Teferi
[email protected]
Data Privacy in Big Data
Large amount of data is being collected about
almost every aspect of our life,
Technology has made the collection of these data
easier than ever before
As such, data privacy has become one of the main
ethical issues with this data (big data)
Identity theft, fraud, and discrimination are just a
few of the severe problems that people may face as
a result of the collection and storage of personal
information

2
The Rise of Privacy Concerns
Science:
 benefits of sharing clinical patient records
 patients shall control access to their records
 patients found to be altruistic/selfless:
 willing to grant access for purpose of advancing science

 Government:
 government and commercial use of data mining raises concerns about
appropriate use of private citizen information,
 e.g., data collected for the purpose of airline passenger screening
should not be used for the enforcement of other criminal laws
 Open Web:
 many users are happy to share private details on social webs
 but would be rightfully upset if this data is used for other purposes

 content is shared between networks

 not very transparent to the user
 users need to be reassured about appropriate use of their data
4
Sensitive Data (PII)
Sensitive data, to begin with, includes Personally
Identifiable Information (PII)
PII covers crucial data like names, social security
numbers, ID numbers, addresses, and telephone
numbers, which are essential for the identification of
individuals, whether they are customers, employees, or
patient information

5
Sensitive Data
identifying values sensitive attribute

6
Sensitive Data and Privacy
Sensitive Data Privacy
Data about individuals and Desire to limit the
organizations that should not
dissemination of
be freely disseminated and
publicized sensitive data
 Health Lots of technology, but:
 Education Unclear requirements
 Finance
Unclear behaviors
 Demography
 Criminal Unclear laws
 Location
 Behavior
 Family etc 7
A Case:
Hunter college high school
On a September morning in 2013, the students at
Hunter College High School filed past security and into
the hallways, to find that their school had been labeled
the saddest place in Manhattan

https://medium.com/memo-random/turning-data-around-7acea1f7479c
Five months earlier…
Researchers in Cambridge, Massachusetts had pulled
more than six hundred thousand tweets from Twitter’s
public API and fed them through sentiment analysis
routines.
If a tweet contained words that were deemed to be sad
— maybe “cry” or “frown” or “miserable”, an emotional
mark for sadness would be placed on a map of the city.

https://medium.com/memo-random/turning-data-around-7acea1f7479c
9
Turning Data Around
The world flows in one direction: data comes from us, but it
rarely returns to us.
The systems that we’ve created are designed to be
unidirectional: data is gathered from people, it’s processed by an
assembly line of algorithmic machinery, and spit out to an
audience of different people — surveillors and investors and
academics and data scientists.
Data is not collected for high school students, but for people
who want to know how high school students feel. This new data
reality is from us, but it isn’t for us.
How can we turn data around? How can we build new data
systems that start as two-way streets, and consider the
individuals from whom the data comes as first?
https://medium.com/memo-random/turning-data-around-7acea1f7479c 10
OECD’s Eight
Principles of Fair
Information Practices
A framework for
privacy protection
Protect use
 Collection for a
purpose
 Use only for
authorized purpose
 Accountability
throughout these
principles

Yolanda Gil [email protected]

Define
questions
Collect/find
Publish data
data

Present
Store data
results

Analyze Extract data

data
Pre-
process 12
Define
questions
Collect/find
Publish data
data

Institutional
Review Board
 Provisions for
Present
collection, storage, Store data
results
processing, and
dissemination of
sensitive data

Analyze Extract data

data
Pre-
process 13
Define
questions
Collect/find
Publish data
data

Consent
State
Present
purpose/use
Store data
Decentresults
quality
Allow
corrections
Analyze Extract data
data
Pre-
process 14
Define
questions
Collect/find
Publish data
data

Physical safety
Personnel
Present training
Store data
results Access control
Encryption

Analyze Extract data

data
Pre-
process 15
Define
questions
Collect/find
Publish data
data
 Limit data use based
on the purpose
expressed in the
Present original consent
Storedata
 Secure data
results
transmission
 Anonymization

Analyze Extract data

data
Pre-
process 16
17
General Data Protection Regulation
https://gdpr.eu/what-is-gdpr/

The GDPR, which became effective in 2018, is considered

by many to be the world’s most comprehensive data privacy
regulation because of its wide scope of application,
Many organizations, have chosen to implement GDPR as
their global data privacy standards.
Key points of GDPR include:
 Establish data privacy as a fundamental human right, including the
individual’s right to access, correct, erase, or port his or her personal
data.
 Strengthen baseline requirements and define roles and
responsibilities for ensuring personal data protection.
 Provide standardized application of data protection rules across the
EU, thereby facilitating the legitimate flow of personal data within
and beyond the EU and European Economic Area (EEA).
18
Personal Data as per GDPR
 Personal data is anything, alone or in combination with something else,
which can identify a living individual.
 Some examples of personal data include:
 Personal identifying information (PII), such as name, national
identification number, social security number, email address, telephone
number, or home address.
 Online identifiers such as IP addresses.
 Device identifiers such as MAC addresses
 Location data.
 GDPR further identifies special categories of personal data, which when
processed, require additional protections.
 Biometric data such as DNA, fingerprints, or facial recognition images.
 Genetic characteristics.
 Health data, including records of physical/mental conditions and
healthcare codes.
19
GDPR Rights: What are a Data Subject’s
Rights?
The GDPR grants data subjects the following basic rights:
 The right to be informed
 The right of access including the ability to request a copy of
the data.
 The right to rectification (correction)
 The right to erasure including the “right to be forgotten”.
 The right to restrict processing
 The right to data portability,
 The right to object
 The right to not be subject to automated decision
making

20
Data anonymization
Data anonymization is the process of protecting private
or sensitive information by erasing or encrypting identifiers
that connect an individual to stored data.
(https://www.imperva.com/)
Data anonymization has been defined as a "process by which
personal data is altered in such a way that a data subject can
no longer be identified directly or indirectly, either by the
data controller alone or in collaboration with any other party
(Wikipedia)
For example, we can run Personally Identifiable Information
(PII) through a data anonymization process that retains the
data but keeps the source anonymous.

21
Techniques for Data Anonymization
There are several data anonymization techniques
that exist and that are still in research
The most common ones are:
Data masking
Pseudonymization
Generalization
Data Swapping
Data perturbation
Synthetization of data

22
Data Masking
Masking enables hiding data with altered values.
First create a mirror version of a database and apply
modification techniques such as character shuffling,
encryption, and word or character substitution.
For example, you can replace a value character with a
symbol such as “*” or “x”.
Data masking makes reverse engineering or detection
impossible.

23
Pseudonymization
Is a data management and de-identification method
that replaces private identifiers with fake identifiers or
pseudonyms,
for example replacing the identifier “Dereje Teferi”
with “Alex J”.
Pseudonymization preserves statistical accuracy and
data integrity,
It allows the modified data to be used for training,
model development, testing, and analytics while
protecting data privacy.
24
Generalization
It deliberately removes some of the data to make it less
identifiable.
Data can be modified into a set of ranges or a broad
area with appropriate boundaries.
You can remove the house number in an address, but
make sure you don’t remove the area (Woreda).
The purpose of generalization is to eliminate some of
the identifiers while retaining a measure of data
accuracy.

25
Data swapping
Data swapping also known as shuffling and
permutation is a technique used to rearrange the
dataset attribute values so they don’t correspond with
the original records.
Swapping attributes (columns) that contain identifiers
values such as date of birth, which may have more
impact on anonymization than membership type
values.

26
Data perturbation
Data perturbation modifies the original dataset slightly by
applying techniques that round numbers and add random
noise.
The range of values needs to be in proportion to the
perturbation.
A small base may lead to weak anonymization while a large
base can reduce the utility of the dataset.
For example:
 use a base of 5 for rounding values like age or house number because
it’s proportional to the original value.
 multiply a house number by 15 and the value may retain its credence.
 However, using higher bases like 15 can make the age values seem
fake.
27
Synthetizing data
Synthetic data is algorithmically manufactured
information that has no connection to real events.
Synthetic data is used to create artificial datasets
instead of altering the original dataset (to avoid using
the original as is and risking privacy and security)
The process involves creating statistical models based
on patterns found in the original dataset.
The algorithm can use standard deviations, medians,
linear regression or other statistical techniques to
generate the synthetic data.

28
Anonymization examples
Replace identifiers with randomly-generated values
 Eg: “Jane Krakowski” -> “Patient6479”
Abstraction: Replace values by ranges
 Eg: Check-in date: 3/1/16 -> Check-in date: Spring 2016
 Eg: Replace zip code by state
Cluster data points and replace individuals by their cluster
centroid
 Eg Ages: 21, 25, 28, 27, 18 -> 5 individuals with nominal age
of 24
Remove values
 Eg: Omit birth date
etc

29
Disadvantages of
Data Anonymization
In some cases when data is anonymized, it becomes
unusable for the intended purpose
For example GDPR states that websites must obtain
consent from users to collect personal information such
as IP addresses, device ID, and cookies.
Collecting anonymous data and deleting identifiers from
the database limit your ability to derive value and insight
from your data.
For example, anonymized data cannot be used for
marketing efforts, or to personalize the user experience
(search engines, recommender systems etc).
30
Problems with
Anonymization Techniques
Limited use for research
 Too coarse-grained
Re-identification
 Re-identification is often trivial
 E.g.,anonymized list of students admitted showing
undergraduate university and average GPA
 Re-identification is possible with high certainty in many cases
 By linking the anonymized dataset with other public data

that is not anonymized

31
Examples of Re-Identification through
Linking Data: (III) Behavior Patterns
 Four spatiotemporal points are enough to uniquely re-identify 90% of individuals
 Even data sets that provide coarse information for all dimensions provide little
anonymity

http://science.sciencemag.org/content/347/6221/536.full 32
Addressing the Problems of
Simple Anonymization Techniques

Provide guarantees that 1. k- anonymization

re-identification will not 2. l-diversity
be possible within some 3. t-closeness
bounds
4. Differential privacy
Eg: can only map a
given individual to a
set of 50 individuals

33
Addressing Anonymization Problems:
k-Anonymity
A dataset has k-anonymity if at least k individuals
share the same identifying values

k=2

34
Addressing Anonymization Problems:
l-Diversity
A dataset has l-diversity if the individuals that share the same
identifying values have at least l distinct values for the sensitive
attribute

l=1

35
Addressing Anonymization Problems:
t-Closeness
A dataset has t-closeness if the individuals that share the
same identifying values have values for the sensitive
attribute that are within a threshold t of diversity

• Threshold is mathematically defined for the data

36
Differential Privacy
This is the only method that provides mathematical
guarantees of anonymity
Main problem addressed: Taking an individual I off a
dataset reveals their sensitive attribute information
 Eg: retrieving aggregate data before removal, then retrieving
aggregate data after removal, and then comparing the
difference will give us the sensitive attribute of I
Main idea: Differential privacy adds “noise” to the
retrieval process so that such comparisons do not give us
the actual sensitive attribute information
 The “noise” should be mathematically defined for the data

37
Summary: Threats to Privacy
Privacy requirements are not well articulated
People want benefits in exchange for data
Unclear that we are able to limit collection and
publication
 Unique behavior of people (we don’t read legal contracts)
 Human error, not without consequences
Large amount of sensitive data about individuals is
readily available in the open web
 Open web already contains sensitive information that should
not be available and violates privacy acts
 Lots of commercial data with personal information is for sale
Limited understanding of anonymization and other
privacy technologies
 Linking to public datasets leads to re-identify individuals
Societal Value of Data and
Data Science

39
Granting Access to Private Records:
Health Information
Anonymized information is often not useful for research
Too coarse grained
Private information has great value
Tradeoff with quality of treatment
Incentivized through first access to new treatments
Altruism
Giving up privacy for pre-specified uses
Eg: for specific medical study, not for insurance purposes,
not for employers, not for social studies
There is zero privacy anyway, get over it
Although you can upload your data using a pseudonym, there is no way to
anonymously submit data. Statistically speaking it is really unlikely that your
medical and genetic information matches that of someone else. By uploading
you do not only disclose information about yourself, but also about your next
kinship (parents and siblings), that shares half of a genome with you. Before
uploading any genetic data you should make sure that those people approve of
you doing so. This is especially important if you have monozygotic twin, who
shares all of your genome!
41
Thank You

UNIT 1 DVT
No ratings yet
UNIT 1 DVT
22 pages
How To Install Aloha On Windows 7 Server 2008
No ratings yet
How To Install Aloha On Windows 7 Server 2008
3 pages
Unit 6. Ethical Issues in Data Science PDF
No ratings yet
Unit 6. Ethical Issues in Data Science PDF
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Data Science PPT PD41
100% (1)
Data Science PPT PD41
8 pages
Lecture - 2 Classification (Machine Learning Basic and KNN)
No ratings yet
Lecture - 2 Classification (Machine Learning Basic and KNN)
94 pages
OOSE Lab Report
No ratings yet
OOSE Lab Report
30 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
Busa2001 2023 Sem2 Newcastle
No ratings yet
Busa2001 2023 Sem2 Newcastle
6 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Chi Merge
No ratings yet
Chi Merge
5 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
BDA Unit 1-1
No ratings yet
BDA Unit 1-1
21 pages
Data Science Module1
No ratings yet
Data Science Module1
20 pages
Predictive Analytics: Course Syllabus
No ratings yet
Predictive Analytics: Course Syllabus
8 pages
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
No ratings yet
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
40 pages
Unit 2 Machine Learning
No ratings yet
Unit 2 Machine Learning
32 pages
Data Analytics (A) CS-503, B.Tech. 5 Semester Assignment Questions
0% (1)
Data Analytics (A) CS-503, B.Tech. 5 Semester Assignment Questions
2 pages
Data Exploration and Visualization - AD3301 - Important Questions With Answer - Unit 5 - Multivariate and Time Series Analysis
No ratings yet
Data Exploration and Visualization - AD3301 - Important Questions With Answer - Unit 5 - Multivariate and Time Series Analysis
8 pages
R Programming UNIT-1
No ratings yet
R Programming UNIT-1
48 pages
DVT - Unit 1 Notes
No ratings yet
DVT - Unit 1 Notes
10 pages
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
12 pages
SE 7204 BIG Data Analysis Unit I Final
No ratings yet
SE 7204 BIG Data Analysis Unit I Final
66 pages
R22 ML Syllabus
No ratings yet
R22 ML Syllabus
2 pages
FDS Iat-2 Part-B
No ratings yet
FDS Iat-2 Part-B
4 pages
Cns Lessonplan
No ratings yet
Cns Lessonplan
2 pages
Unit 1 DataScience
No ratings yet
Unit 1 DataScience
105 pages
Data Literacy Questions All Types
No ratings yet
Data Literacy Questions All Types
2 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Clustering in Non-Euclidean Space
No ratings yet
Clustering in Non-Euclidean Space
4 pages
Chapter 1 Introduction To Visualization
No ratings yet
Chapter 1 Introduction To Visualization
53 pages
Mathematics For Machine Learning-I
No ratings yet
Mathematics For Machine Learning-I
10 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
DL Unit-2 Notes PPT
No ratings yet
DL Unit-2 Notes PPT
39 pages
Bi Unit1
No ratings yet
Bi Unit1
93 pages
20IT503 - Big Data Analytics - Unit2
No ratings yet
20IT503 - Big Data Analytics - Unit2
62 pages
BA ZG523 Introduction To Data Science
50% (2)
BA ZG523 Introduction To Data Science
12 pages
Data Wrangling
No ratings yet
Data Wrangling
13 pages
Unit 1 Data Science Notes
No ratings yet
Unit 1 Data Science Notes
33 pages
Mc4301 APR May 24 (Machine Learning)
No ratings yet
Mc4301 APR May 24 (Machine Learning)
3 pages
Unit 5 PDF
100% (1)
Unit 5 PDF
32 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
P, NP, NP - Complete, NP Hard
No ratings yet
P, NP, NP - Complete, NP Hard
19 pages
Data Warehousing
No ratings yet
Data Warehousing
24 pages
Data Warehousing and Data Mining (10cs755)
No ratings yet
Data Warehousing and Data Mining (10cs755)
142 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
Data Analysis
100% (1)
Data Analysis
4 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
Business Intelligence Hand Book
No ratings yet
Business Intelligence Hand Book
33 pages
Big Data Notes
No ratings yet
Big Data Notes
4 pages
Data Science Ethics - Lecture 2
No ratings yet
Data Science Ethics - Lecture 2
36 pages
Etika
No ratings yet
Etika
32 pages
Database Anonymization
No ratings yet
Database Anonymization
138 pages
Quantitative Methods
No ratings yet
Quantitative Methods
37 pages
Chapter One - RH For Africa Medical College - MPH
No ratings yet
Chapter One - RH For Africa Medical College - MPH
72 pages
Hps Guidelines Moni Evalu Health en
No ratings yet
Hps Guidelines Moni Evalu Health en
29 pages
Training Delivery Plan MO PS L3
No ratings yet
Training Delivery Plan MO PS L3
5 pages
Module 2 Managing Medical Records - Learner Module - Nov.13
100% (2)
Module 2 Managing Medical Records - Learner Module - Nov.13
135 pages
Module 4-DB Management Module
No ratings yet
Module 4-DB Management Module
104 pages
Chapter 1 - What Is Health Information?
No ratings yet
Chapter 1 - What Is Health Information?
9 pages
CEPM Assignment01 Mobile Brand Comprision
No ratings yet
CEPM Assignment01 Mobile Brand Comprision
6 pages
Objectives of The General Ledger System
No ratings yet
Objectives of The General Ledger System
3 pages
Oxe12.1 SD CommonHwBoards 8AL91022USAF 1 en
No ratings yet
Oxe12.1 SD CommonHwBoards 8AL91022USAF 1 en
44 pages
VPN Form202404050172
No ratings yet
VPN Form202404050172
3 pages
Assignment 16, 17, & 19
No ratings yet
Assignment 16, 17, & 19
9 pages
Thinking Forth
100% (1)
Thinking Forth
311 pages
Password Manager With Enhanced Security
No ratings yet
Password Manager With Enhanced Security
19 pages
JAVA 1.2 HR
No ratings yet
JAVA 1.2 HR
4 pages
6th International Conference On Cloud Computing and IoT (CCCIoT 2025)
No ratings yet
6th International Conference On Cloud Computing and IoT (CCCIoT 2025)
2 pages
Processor Board Design Course
No ratings yet
Processor Board Design Course
3 pages
CMS3.0 User Manual
No ratings yet
CMS3.0 User Manual
29 pages
04 RAID Test Planning Strategizing Case Study - Specialist
No ratings yet
04 RAID Test Planning Strategizing Case Study - Specialist
10 pages
RST Instruments: C109 Pneumatic Readout Instruction Manual
No ratings yet
RST Instruments: C109 Pneumatic Readout Instruction Manual
25 pages
MIS - Project Title Proposal
100% (1)
MIS - Project Title Proposal
14 pages
Funct Scenario of Foundit App
No ratings yet
Funct Scenario of Foundit App
8 pages
Record VLSI LAB-1
No ratings yet
Record VLSI LAB-1
62 pages
Mocktest Ty02k9q (Mos)
No ratings yet
Mocktest Ty02k9q (Mos)
10 pages
LEO Satellite Constellation For Internet of Things
No ratings yet
LEO Satellite Constellation For Internet of Things
11 pages
Autohydro Manual 6
No ratings yet
Autohydro Manual 6
58 pages
SOC2 Checklist
No ratings yet
SOC2 Checklist
2 pages
Baseband521X Baseband 66X0 and Rbs6502 Integration Procedure
100% (1)
Baseband521X Baseband 66X0 and Rbs6502 Integration Procedure
14 pages
Automation Technology AppNote Matlab Image Acquisition Toolbox
No ratings yet
Automation Technology AppNote Matlab Image Acquisition Toolbox
15 pages
Cit 3350 Mobile Application Development
No ratings yet
Cit 3350 Mobile Application Development
3 pages
Orcad电路设计与实践电子工业出版社华春梅 (等) 编著 12286272
No ratings yet
Orcad电路设计与实践电子工业出版社华春梅 (等) 编著 12286272
229 pages
Ad Audit Plus Use Cases
No ratings yet
Ad Audit Plus Use Cases
32 pages
SQL Introduction: DR - Aparna Chaparala
No ratings yet
SQL Introduction: DR - Aparna Chaparala
13 pages
Tableau
No ratings yet
Tableau
1 page
TCS NQT (Ninja + Digital + Prime) Preparation Study Plan
No ratings yet
TCS NQT (Ninja + Digital + Prime) Preparation Study Plan
22 pages
8K High Resolution Camera System
No ratings yet
8K High Resolution Camera System
10 pages

Data Science and Ethical Issues

Uploaded by

Data Science and Ethical Issues

Uploaded by

Data Science and Ethical

 content is shared between networks

Yolanda Gil [email protected]

Analyze Extract data

Analyze Extract data

Analyze Extract data

Analyze Extract data

The GDPR, which became effective in 2018, is considered

that is not anonymized

Provide guarantees that 1. k- anonymization

• Threshold is mathematically defined for the data

You might also like