0% found this document useful (0 votes)

277 views

Data Science Ethics - Lecture 3

Uploaded by

Niloofar Fallahi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

277 views

Data Science Ethics - Lecture 3

Uploaded by

Niloofar Fallahi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 79

Data Science & Ethics

Lecture 3

Data Gathering: privacy (continued), bias and experimentation

Prof. David Martens

[email protected]
www.applieddatamining.com
@ApplDataMining
AI Ethics in the News
2
Ethical Data Gathering
▪ Privacy as a Human Right
▪ Regulation: GDPR
▪ Privacy Mechanisms: Encryption and Hashing
▪ Obfuscation
▪ Public Data
▪ Bias
▪ Experimentation

3
Symmetric encryption
▪ One key used for encryption and decryption

Plain Text: Plain Text:

“Hello Bob” “Hello Bob”

Alice Bob
Secret 1 Encryption Decryption 3 Secret

Cipher Text: 2 Cipher Text:

Sà!3Lksd( Sà!3Lksd(

Sà!3Lksd(
???
Eve 4
Symmetric encryption
▪ One key used for encryption and decryption
• Caesar cipher: 3-shift right
• Weakness: frequency of letters and starting/ending words
(“Dear”, “Yours sincerely”, etc.) or brute force attack

Plain Text: Plain Text:

“Hello Bob” “Hello Bob”

Alice Bob
Secret 1 Encryption Decryption 3 Secret

Cipher Text: 2 Cipher Text:

Khoor Ere Khoor Ere

Khoor Ere
???
Eve 5
Symmetric encryption
▪ DES: Data Encryption Standard
• One of the first major standards in symmetric key encr.
• 56 bit key
• 256 = 7 x 106 possible keys
• Flaw: too small as brute force attack would find key
▪ AES: Advanced Encryption Standard
• By Belgians Vincent Rijmen and Joan Daemen (1988)
• 128, 192 or 256 bit keys
• 2128 = 3 x 1038 possible keys, considered safe in current age
• New standard since late 90s
▪ Challenges
• How to share keys: unsecure or overhead
• How to manage keys: if u users need to communicate with one
another → need for (u-1) + (u-2) + … + 1 = u x (u-1) / 2 keys to be
shared before communicating 6
Asymmetric encryption
▪ Two keys: public and private key
• Public key: revealed to the world
• Private key: kept secret at one party

Plain Text: Plain Text:

“Hello Bob” “Hello Bob”

Alice Bob
Bob’s 1 Encryption Decryption 3 Bob’s
Public Key Private Key
Cipher Text: 2 Cipher Text:
Sà!3Lksd( Sà!3Lksd(

Sà!3Lksd(
???
Eve 7
Asymmetric Encryption
▪ RSA: Rivest, Shamir, Adleman (1983)
• Popular algorithm for asymmetric encryption
• Principle:
➢ Multiplying two large numbers is easy and fast
➢ Decomposing a large number into prime numbers: very difficult
➢ For example: 19 x 13 = ? Decompose 391 in 2 prime numbers?
➢ If numbers large enough: non efficient (non-quantum) integer
factorisation algorithm exists.
• Given p and q large prime numbers
➢ n = p x q, make this the public key, p and q are the private keys
➢ Encrypt message m, using some number e, also made public
▪ e chosen such that prime relative to (p-1) x (q-1)
▪ c = me mod n
➢ Decrypt c, who knows private key p and q
▪ d chosen such that d x e mod ( (p-1)x(q-1) ) = 1 8
Asymmetric Encryption
▪ RSA: Rivest, Shamir, Adleman (1983)
• Given p and q large prime numbers
➢ n = p x q, make this the public key
➢ Encrypt message m, using some number e, also made public
▪ e chosen such that prime relative to (p-1) x (q-1), e.g. 35
▪ c = me mod n
➢ Decrypt c, who knows private key p and q
▪ d chosen such that d x e = 1 mod [ (p-1)x(q-1) ]
▪ d also private as you need to know p and q
▪ m = cd mod n
▪ Advantages
• Sharing keys: only public ones need to be shared (no need for secrecy)
• Manage keys: share only u keys among u users
▪ Disadvantage
• Takes more time than symmetric encryption 9
Asymmetric Encryption
▪ RSA: Rivest, Shamir, Adleman (1983)
• Given p and q large prime numbers
➢ n = p x q, make this the public key
➢ Encrypt message m, using some number e, also made public
▪ e chosen such that prime relative to (p-1) x (q-1), e.g. 5
▪ c = me mod n
➢ Decrypt c, who knows private key p and q
▪ d chosen such that d x e = 1 mod [ (p-1)x(q-1) ] so d x e = [k x (p-1)x(q-1) ] + 1
▪ m = cd mod n
Can only calculate this if you know the
▪ Example decomposition into the prime factors p and q
• p = 7, q = 3 ➔ n = ?
• Message is the letter “l” => 12th letter in the alphabet: so m = 12
• c = ? [3]
• For a k = 2, d = ?
• m=?
10
Encryption for data protection
▪ Online communication between server and client
• Initially asymmetric encryption to agree on secret key (slow)
• Subsequently symmetric encryption with agreed secret key
(fast)

2
c
Generate random number r Decrypt c with own private key to r
Encrypt r with public key of server to c
3
1
4
All subsequent communication using fast
symmetric encryption, using secret key r

11
Encryption for data protection
▪ Online communication between server and client
• Initially asymmetric encryption to agree on secret key
• Subsequently symmetric encryption with agreed secret key

▪ Used in SSL and TLS protocols, widely used online

• Public key infrasturcture: 3rd party Certificate Authority (CA),
such as Comodo, Let’s Encrypt
• https:// indicates this type of encryption working in the
background

12
Encryption for data protection
▪ Whenever you need to communicate online:
• Use TLS protocol
• Even considered to determine ranking in search engine
▪ Whenever you store personal data:
• Encrypt the data
• Beyond USBs, laptops, PCs and smartphones: bikes, cars, etc.
• Cars:
➢ Personal data: address, routes, contacts, etc.
➢ FTC (US): advice to clean out personal data before selling car
➢ Tesla:
▪ Personal data is hidden when given the keys to a valet (valet mode)
▪ Revealed that some data is stored unencrypted

13
Encryption for data protection
▪ Whenever you need to communicate online:
• Use TLS protocol
• Even considered to determine ranking in search engine
▪ Whenever you store personal data:
• Encrypt the data
• Beyond USBs, laptops, PCs and smartphones: bikes, car makers
• Cars:
➢ Personal data: address, routes, contacts, etc.
➢ FTC (US): advice to clean out personal data before selling car
➢ Tesla:
▪ Personal data is hidden when given the keys to a valet (valet mode)
▪ 2018: som

https://www.cnbc.com/2019/03/29/tesla-model-3-keeps-data-like-crash-videos-location-phone-contacts.html
14
Hashing
▪ Another useful cryptographic function
• Input: i
• Output: hash o, of always same length
▪ One way function:
• easy to go from input to hash, very hard to do the reverse
• Main difference with encryption
▪ MD5: Merkle and Damgard (1992)
• Popular hashing algorithm
• Input: any string, output: 128 bit hash
https://www.md5hashgenerator.com/

▪ SHA-3
• Porposed by NIST (2015), output of up to 512 bits hash
• Developers: Berony, Daemen (him again), Peeters and Van Assche
15
Hashing
▪ Main use: integrity
• Input: mail m
• Output: hash h
• If someone changes a word: modified mail will change to
other hash

16
Hashing
▪ Simple example for text
• Take the position of each letter in the alphabet
• Sum all the integers
• Take the last digit of the sum as hash
• Try with “ball” and “bell”
➢ One-way, fixed length hash (of size 1)
➢ Many hash collisions

17
Hashing
▪ Hashing passwords
▪ Widely used to send passwords over the Internet
• Naive:
➢ if website doesn’t use TLS, or on a public wifi: password sent in
plain text (visible to eavesdropping Eve).
➢ In database: store username and password in plain text.
▪ If ever data leak: all passwords public
▪ Employees can snoop in the database

User Password

David Martens 123456

Jennifer Johnsen 123456

Latifa Jenkins p@ssword5

18
Hashing
▪ Hashing passwords
• Better:
➢ Hash password client side (so on the laptop/PC)
➢ Send the hashed value to server, and store this in database
➢ If ever leaked: hard to know what password was (one way)
➢ BUT: people tend to use same password: frequently used
passwords can be identified

User Hash
David Martens e10adc3949ba59abbe56e057f20f883e
Jennifer Johnsen e10adc3949ba59abbe56e057f20f883e
Latifa Jenkins f3f092cd075b3e050451239611a9e1e9
…

19
Hashing for data protection
▪ Most popular passwords
Popular passwords Hash (MD5)
123456 e10adc3949ba59abbe56e057f20f883e
password cc3a0280e4fc1415930899896574e118
123456789 6ca5ac1929f1d27498575d75abbeb4d1
12345678 25d55ad283aa400af464c76d713c07ad
12345 827ccb0eea8a706c4c34a16891f84e7b
111111 96e79218965eb72c92a549dd5a330112
1234567 fcea920f7412b5da7be0cf42b8c93759
sunshine 0571749e2ac330a7455809c6b0e7af90
qwerty d8578edf8458ce06fbc5bb76a58c5ca4
iloveyou f25a2fc72690b780b2a14e140ef6a9e0
princess 8afa847f50a716e64932d995c8e7435a
…

▪ Evesdropping Eve can check if the hash she observes occurs in this list
• e10adc3949ba59abbe56e057f20f883e: found! password found: 123456
• Can only look at a subset of all possible (popular) password: “rainbow table” 20
Hashing for data protection

Interesting cultural links

https://en.wikipedia.org/wiki/List_of_the_most_common_passwords 21
Hashing for data protection
▪ Evesdropping Eve can check if the hash she observes occurs in this list,
if so: password found
▪ Improve further by generating a random string per user: “salt”
▪ Hash the concatenation of salt + password
• Remaining issue: in case of data leak, also salt revealed
• Hacker could create a rainbow table per user (salt + popular password)
• Evesdropping Eve can check if the hash she observes occurs in this list
➢ 79e514abbfb414c5ae58b553fbb0ff00: found! password found: 123456, with salt
t0mR1aoPdp
➢ Needs many (x number of users) more rainbow tables than without a salt

User Salt Hash

David Martens t0mR1aoPdp 79e514abbfb414c5ae58b553fbb0ff00

Jennifer Johnsen rmsP5dof9y 896cab0975703f599fe0f7491c90062b

Latifa Jenkins 8LyqGp4cPm 8a945d114467eee63485fba9aa0bbaf0

…
22
Hashing for data protection
▪ Hashing passwords
• A standard, expected to be used
• Knuddels.de:
➢ German chat community which stored passwords in plain text,
not hashed
➢ Hacking, 800,000 email addresses and passwords published online
➢ GDPR fine of 20,000 €

23
Hashing for data protection
▪ To ensure no personal data is copied throughout data
processing system
• Many copies, local downloads
➢ If someone asks to remove all their personal data: cumbersome task,
never 100% sure
• Personal data often not explicitly needed
(e.g. password, exact name)
➔ Use personal data in one table, with hashed personal ID
➔ Use the hashed personal ID in all other parts of the system

24
Hashing for data protection
▪ To ensure no personal data is copied throughout data
processing system
▪ Using hashed value of personal data: pseudonimisation
• Keep hash table secure

25
Encryption vs hashing
▪ Both useful for storing and analysing data

26
Discussion Case 2

27
Quantum Computing
▪ Bit: 0 or 1
▪ Qubit: 0 and 1 at the same time

bit qubit

1 0/1

28
Quantum Computing
Quantum Computing
▪ Schrödinger’s cat
• Cat in a closed box, 50% probability it dies
• Both dead and alive, until someone opens the box and
observes the cat
➢ What acutally happens at atomic levels

30
https://www.youtube.com/watch?v=QuR969uMICM 31
Quantum Computing
▪ Superposition
• Two states at the same time
• Superposition collapses when the state is observed
• E.g. photon or spin of electron
▪ Bit: can take 2 values: 0 or 1
▪ Qubit:
• can take 4 values: 00, 01, 10 or 11
• n qubits can store 2n states at the same time
• Qubits are “entangled”
➢ Observing one qubit, instantaneously reveals the other, even if at
the other side of the galaxy
➢ “spooky action at a distance” (Albert Einstein)
➢ Proven, 2017: entangled qubits 1,200 km apart
32
Quantum Computing
▪ Quantum computing
• Represent a problem and add a way to assess the answer
• Quantum decoherence of qubits such that only answers that
pass the test survive
▪ For personal data protection?
• Shor’s algorithm for factoring large numbers
• Would “crack” the popular RSA algorithm for asymmetric
encryption
▪ Current quantum computers
• Few dozen qubits
• Need thousands of qubits to break RSA
• IBM SVP of cloud and cognitive software (2019):
“within a decade”
33
Quantum Computing
▪ For personal data protection? (bis)
• Data in quantum state cannot be copied unnoticed
• Copying requires observing the data: eavesdropping detected

▪ And beyond: Quantum Machine Learning

34
Quantum computing poses a potential threat to many of the encryption methods currently used to secure sensitive data. Here’s an overview of why quantum
computing is such a game-changer for encryption:

### 1. How Current Encryption Works

Most encryption methods rely on the difficulty of solving certain mathematical problems with classical computers. Here are two common types:
- **Symmetric Encryption** (e.g., AES): Data is encrypted and decrypted with the same key. Its security relies on brute-force resistance, where it would take
classical computers an infeasible amount of time to try every possible key.
- **Asymmetric Encryption** (e.g., RSA, ECC): Uses a pair of public and private keys. It relies on hard-to-solve mathematical problems like factoring large
numbers (RSA) or solving discrete logarithms (ECC), which would take classical computers an extremely long time to crack.

### 2. How Quantum Computing Affects Encryption

Quantum computers process information differently than classical computers. Two main quantum algorithms have implications for encryption:
- **Shor’s Algorithm**: Efficiently factors large numbers and solves discrete logarithms, which are the foundations of RSA and ECC encryption. A sufficiently
powerful quantum computer running Shor’s algorithm could break these encryption methods quickly by solving problems that classical computers would take
years (or even centuries) to solve.
- **Grover’s Algorithm**: Speeds up the search through an unstructured database, allowing quantum computers to perform brute-force attacks faster. For
symmetric encryption like AES, Grover’s algorithm effectively halves the security strength (e.g., AES-256 would be as secure as AES-128 against a quantum
brute-force attack).

### 3. Why This Jeopardizes Current Encryption

With quantum algorithms, encryption methods we use today could become vulnerable:
- **Asymmetric Encryption at Risk**: RSA and ECC encryption, used in everything from internet traffic to digital signatures, would be easily broken by a
powerful enough quantum computer running Shor’s algorithm. This means that data protected with RSA or ECC could be decrypted if intercepted and stored
by attackers until a suitable quantum computer becomes available.
- **Symmetric Encryption is Safer, But Not Immune**: Grover’s algorithm means that symmetric keys need to be longer to maintain the same level of
security (e.g., moving from AES-128 to AES-256).

### 4. The Timeline and Current Limits

Although quantum computers have made significant advancements, they are not yet powerful enough to break most encryption. Fully operational, large-scale
quantum computers are still a few years or even decades away. But since data is often stored for long periods, even current encrypted data could be
vulnerable in the future if attackers store it and wait for quantum computing to advance.

### 5. Solutions: Post-Quantum Cryptography

To prepare for this threat, researchers are developing **post-quantum cryptography** (PQC), which includes encryption algorithms designed to be secure
against quantum attacks. The U.S. National Institute of Standards and Technology (NIST) is working on standardizing these algorithms, which rely on
mathematical problems thought to be hard for quantum computers, like lattice-based cryptography.

### Summary
Quantum computing could:
- Break widely-used encryption methods like RSA and ECC once sufficiently powerful quantum computers are available.
- Halve the effective security of symmetric encryption (e.g., AES), necessitating longer keys.

In response, **post-quantum cryptography** is being developed to provide quantum-resistant encryption methods that could secure data even in a future with
quantum computers.
Government Backdoor
Feb. 2016
Government Backdoor
Pro

▪ Privacy is not absolute

“Encryption isn't just a technical feature; it's a marketing

pitch. ... Sophisticated criminals will come to count on these
means of evading detection. It's the equivalent of a closet
that can't be opened. ... And my question is, at what cost?“

James Comey (2014)

Government Backdoor
Pro Con

▪ Privacy is not absolute ▪ Security versus Freedom

“Those who would give up essential Liberty, to purchase a

little temporary Safety, deserve neither Liberty nor Safety.“

Benjamin Franklin (1775)

Government Backdoor
Pro Con

▪ Privacy is not absolute ▪ Security versus Freedom

▪ Security versus Security

“No-one, I don't believe, would want a master key built that

would turn hundreds of millions of locks. Even if that key
were in the possession of the person that you trust the
most. That key could be stolen.”

Tim Cook (2016)

Government Backdoor
Pro Con

▪ Privacy is not absolute ▪ Security versus Freedom

▪ Security versus Security
▪ Futility of backdoors

“The bottom line is, if you look at both the terrorists in San
Bernardino and the Boston Marathon bombers, they were
family members. Most family members talk to each other
face to face. The government doesn't
have access to that after the fact.“

Michael Chertoff (2016)

Government Backdoor
Oct. 2019
Government Backdoor
Feb. 2019

41
Discussion Case 3

42
Ethical Data Gathering
▪ Privacy as a Human Right
▪ Regulation: GDPR
▪ Privacy Mechanisms: Encryption and Hashing
▪ Obfuscation
▪ Public Data
▪ Bias
▪ Experimentation

43
Obfuscation
▪ Encryption of personal data:
• Explicitly hiding secret information
• End user has little control
▪ Obfuscation: “the deliberate addition of ambiguous,
confusing, or misleading information to interfere with
surveillance and data collection” (Brunton and Nissenbaum,
2015)
• Hide information by adding noise to the system
• End user control
• eg. automatically generate a large number of search queries
(TrackMeNot)
• eg. automatically click on all ads (AdNausem)

44
Obfuscation
▪ Can be driven by a technical system, but also human
• #BrusselsLockdown
▪ Is it ethical?

45
Ethical Data Gathering
▪ Privacy as a Human Right
▪ Regulation: GDPR
▪ Privacy Mechanisms: Encryption and Hashing
▪ Obfuscation
▪ Public Data
▪ Bias
▪ Experimentation

46
Public Data
▪ Public Data is not Free-to-Copy Data
• Database right
• Policies

47
Public Data
▪ What is a Database:
• “collection of independent works, data or other materials which are
arranged in a systematic or methodical way and are individually
accessible by electronic or other means” (Directive 96/9/EC of the
European Parliament and of the Council of 11 March 1996 on the legal protection of
databases)
• Broad definition: datasets, mailing lists, telephone directories, etc.
▪ Database right:
• You are not allowed to copy or extract substantial parts of a database
without the owner’s consent.
• Even if the content is not copyright protected
• 15 years
• When there is a substantial investment in obtaining, verifying and
presenting the content (Europe and UK)
• Not in for example United States

48
Public Data
▪ Policies of individual data sources (such as websites) tell you what you
can do
▪ Some concepts
• Web scraping: extracting data from websites
• robots.txt
➢ Tells a crawler (bot) what pages it can visit/request.
➢ https://www.facebook.com/robots.txt
# Notice: Crawling Facebook is prohibited unless you have express written
# permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php
…
User-agent: *
Disallow: /
==

➢ Means Facebook disallows all automated scraping

49
Public Data
▪ Policies of individual data sources (such as websites) tell you what you
can do
▪ Some concepts
• Web scraping: extracting data from websites
• robots.txt
➢ Tells a crawler (bot) what pages it is allowed to visit/request.
➢ https://www.facebook.com/robots.txt
# Notice: Crawling Facebook is prohibited unless you have express written
# permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php
…
User-agent: *
Disallow: /
==

➢ Means Facebook disallows all automated scraping

50
Public Data
▪ Policies of individual data sources (such as websites) tell you what you
can do
▪ Some concepts
• Web scraping: extracting data from websites
• robots.txt
➢ Tells a crawler (bot) what pages it is allowed to visit/request.
➢ https://www.facebook.com/robots.txt
# Notice: Crawling Facebook is prohibited unless you have express written
# permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php
…
User-agent: *
Disallow: /
==

➢ Means Facebook disallows all automated scraping

• Application Programming Interfaces (APIs)
➢ Many companies provide APIs to access their data
▪ Facebook became very restrictive in who can use the API after Cambridge Analytica
▪ Twitter API allows to extract data on large scale
▪ Remember: often also includes personal data (GDPR!)
▪ Public does not imply free to copy!
51
Clearview.AI
▪ American company that uses face recognition for law enforcement
▪ App: take picture, upload, returns publicly available pictures of that
person and the link where picture appears online
▪ Data scraped from Facebook, Twitter, Instagram, Youtube, and “millions
of other websites”
▪ Uses
• Law enforcement:
➢ identify suspects, terrorists
➢ Example Indiana State Police:
identified shooter from a video
(no mug shots or driver’s license)
• Business goals:
➢ help with shoplifting,
➢ identity theft, credit card fraud, etc.

https://www.nytimes.com/2020/01/18/technology/clearview-privacy-facial-recognition.html 52
https://www.youtube.com/watch?v=-JkBM8n8ixI 53
Clearview.AI
▪ Several ethical issues
1. Data gathering: public data is not free-to-copy data
2. Data gathering: encryption
▪ Client list leaked: not saved securely, images are?
3. Deployment: access to system
▪ Law enforcement: 600+ law enforcement agencies, such as FBI, US Immigration
and Customs Enforcement
▪ Companies: Walmart, NBA, Macy’s, 46 financial institutions, among others
▪ Persons?
▪ 30 day free trial

▪ Discussion:
• Misuse? By 1) law enforcement, 2) businesses, 3) persons.
• Why ok with 1) mug shots and drivers licenses, and 2) with fingerprints, not
with faces?

54
https://www.buzzfeednews.com/article/ryanmac/clearview-ai-fbi-ice-global-law-enforcement
Clearview.AI

55
https://www.standaard.be/cnt/dmf20200228_04869700
Ethical Data Gathering
▪ Privacy as a Human Right
▪ Regulation: GDPR
▪ Privacy Mechanisms: Encryption and Hashing
▪ Obfuscation
▪ Public Data
▪ Bias
▪ Experimentation

56
Bias
▪ Systematic prejudice for or against a certain group
▪ An overloaded term This type of bias occurs when
the sample of data used for
training a model does not

• Bias in data sample: non-representative for population

accurately represent the broader
population that the model will be
applied to.
This bias arises when a model
or dataset disproportionately

• Bias in data or model against sensitive group: cf. fairness favors or disadvantages specific
groups, often defined by
sensitive characteristics like
race, gender, or socioeconomic
status. In the context of fairness,

• Bias / variance trade-off: bias error (assumption algorithm) + this type of bias can lead to
discriminatory outcomes for
certain groups.

variance error (limited sample size)

• Bias in linear model: intercept
3. Bias/Variance Trade-Off
Definition: In machine learning, the bias/variance trade-off is a fundamental concept describing the
balance between two types of errors:

Bias Error: Error due to overly simplistic assumptions in the model, which can lead to underfitting.
High bias means the model does not capture underlying patterns well.

Variance Error: Error due to high sensitivity to fluctuations in the training data, which can lead to
overfitting. High variance means the model is too finely tuned to the specific training data and may
not generalize well to new data.

Example: A linear regression model might have high bias if the true relationship is nonlinear.
Definition: In linear models, bias can refer to the intercept term in the model, which is the constant
Conversely, a complex model like a deep neural network may have high variance if trained on a
added to the prediction. The intercept is the model’s prediction when all feature values are zero,
small dataset, leading it to memorize rather than generalize.
effectively serving as the model’s baseline.
Impact: Finding the right balance between bias and variance is crucial to building models that
Example: In a linear regression model predicting housing prices, the intercept might represent the
generalize well. Too much bias leads to consistently inaccurate predictions, while too much variance
base price of a house when all other features (e.g., size, location) are zero. If the intercept is too
leads to instability and poor performance on new data.
high or too low, it can skew the entire prediction.
Impact: An incorrect intercept can cause biased predictions across all data points, leading the model
to consistently overestimate or underestimate the target variable. Adjusting the intercept can
improve the accuracy and interpretability of the model.
Sample Bias
▪ What is a sample?
• Part of the population
▪ What is sampling
• Act, process or technique of selecting a suitable sample or a
representative part of the population for the purpose of determining
parameters
▪ Why Sample?
• Economic advantage
• Time factor
• Large Populations
• Partly accessible populations
• Computation Power Required

▪ Can lead to wrong conclusions or impacting certain groups negatively.

Sample Bias
▪ Example 1: Twitter data to predict elections
• Twitter users overrepresented in densely population regions
• Predominantly male
• Non-representative race distribution
• Younger, Highly educated and Higher income

Population: potential voters

On Twitter Not on Twitter

Sample: potential voters on Twitter

democratic republican Y=?

(Y = 0) (Y = 1)

https://www.pewinternet.org/2019/04/24/sizing-up-twitter-users
Sample Bias
▪ Example 2: bullet hole data to predict where to place armor

Section Bullet holes per square foot

Engine 1,1
Main body section 1,7
Fuel system 1,6
Rest of the plane 1,8
Sample Bias
▪ Example 2: bullet hole data to predict where to place armor
• Mathematician Abraham Wald came to another solution: “The armor
doesn't go where the bullet holes are. It goes where the bullet holes aren't:
on the engines."

Population: airplanes shot at

return to base crash

Sample: airplanes returning to base

Most hit: wings Most hit: ?

Sample Bias
▪ Example 3: credit scoring
• Too optimistic data
• “Reject inference” problem

Population: persons applying for credit

granted credit denied credit

Sample: persons having been granted credit

credit paid back (Y = 0) default (Y = 1) Y=?

Sample Bias
▪ Example 4: words on resume to predict recruitment
• HR Analytics, prediction model to review job applicants’ resumes to
automate the search for top talent
• Trained on resumes from past (10 year period), biased data as most were
from men
• Model trained to prefer male candidates, for example:
➢ Penalized uses of word “woman’s” (eg woman’s chess club president)
➢ Penalizes all-woman colleges
Sample Bias
▪ Example 5: image data to predict object
• Not enough pictures of certain groups

A Google spokesperson confirmed that “gorilla” was

censored from searches and image tags after the 2015
incident, and that “chimp,” “chimpanzee,” and “monkey” are
also blocked today. “Image labeling technology is still early
and unfortunately it’s nowhere near perfect”
https://medium.com/@eirinimalliaraki/toward-ethical-transparent-and-fair-ai-ml-a-critical-reading-list-d950e70a70ea
https://www.wired.com/story/when-it-comes-to-gorillas-google-photos-remains-blind/
Sample Bias
▪ Example 6: movement data to predict pothole locations
▪ StreetBump app
• App for residents of Boston
• Detects bumps with location while driving
• Identifies neighborhoods for which to improve the road
infrastructure

• Representative sample?
➢ Need to have a smartphone
➢ Neighborhoods with elder people and lower income underrepresented
Sample Bias
▪ An algorithm does not eliminate human bias, it is only as
good as the data it works with
▪ Under- or over-representation of sensitive group can lead
to disparate impact
Bias
▪ Bias in our language
• Word embeddings
• Boy is to girl as man is to [?]

• Two-dimensional representations of word embeddings using Google news

➢ Find parallelogram
➢ man – woman ≈ king – queen
man – woman ≈ computer programmer – homemaker
man – woman ≈ surgeon – [?]
man – woman ≈ architect – [?]
man – woman ≈ superstar – [?]

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

https://arxiv.org/pdf/1607.06520.pdf
Bias
▪ Bias in our language
• Word embeddings
• Boy is to girl as man is to [?]

• Two-dimensional representations of word embeddings using Google news

➢ Find parallelogram
➢ man – woman ≈ king – queen
man – woman ≈ computer programmer – homemaker
man – woman ≈ surgeon – nurse
man – woman ≈ architect – interior designer
man – woman ≈ superstar – diva

• Be aware of biases or these will be amplified

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

https://arxiv.org/pdf/1607.06520.pdf
Ethical Data Gathering
▪ Privacy as a Human Right
▪ Regulation: GDPR
▪ Privacy Mechanisms: Encryption and Hashing
▪ Obfuscation
▪ Public Data
▪ Bias
▪ Experimentation

69
Human experiments

70
Experimentation
▪ Nuremberg Code for ethical rules for research involving
human subjects
• informed consent without coercion
• the ability to withdraw from the experiment at any time
• avoiding all unnecessary physical and mental suffering and
injury

▪ go beyond the initial ethical approval of an ethics board

▪ follow-up and challenge the ethical implications during a
data science project and add ethical reflection in each
report of the study

71
Experimentation
▪ A/B testing
▪ Randomize experiment with two variants: A and B
▪ Vary one variable and assess the effect

Registration rate: 45% Registration rate: 15%

▪ Different treatment of different groups: potential human impact and ethical

implications!
▪ If there is a potential for negative impact: informed consent needed
▪ C/D testing: intentionally deceiving the users
Experimentation
▪ Facebook/university experiment
• Goal: show emotional contagion (Kramer et al, 2014)
• One week in 2012, changed the content of randomly selected sample of
689.003 Facebook users
➢ Group A: removed articles with positive words
➢ Group B: removed articles with negative words
➢ Change in emotional content of status updates? It did.
Experimentation
▪ OK Cupid
• Goal: Test the performance of their prediction model in practice
• Dating website, provides a match (0-100%) with candidates, based on prediction
model
➢ Group A: Bad match → Told they were a bad match
➢ Group B: Bad match → Told they were very good match
➢ Change in if and how often they talk to each other?
It did: group B more likely to send message to each other

▪ “If you use the Internet, you're the subject of hundreds of experiments at any given time,
on every site. That's how websites work.“
OKCupid president.
▪ “If you're lying to your users to improve your service, what's the line between A/B testing
and fraud?“
Washington Post’s Brian Fung

www.washingtonpost.com/news/the-switch/wp/2014/07/28/okcupid-reveals-its-been-lying-to-some-of-its-users-just-to-see-whatll-happen/
Summary data gathering
▪ Fair
• to the data subject and model applicant: privacy (GDPR)
• to the model applicant: representative sample
▪ Transparent
• to the data subject and model applicant: is it clear what data is
used, for what purposes and for how long?
• to the model applicant: A/B testing with informed consent,
minimal risks and ethical oversight
• to the data scientist and manager: how is the data gathered,
over- or undersampling
▪ Accountable
• are appropriate and effective measures put in place to comply
with the answers to the above questions? Who is responsible?
Presentation and Paper Ideas
Privacy
▪ Fingerprints on ID cards and passports
▪ The use of synthetic data to learn models
Fairness
▪ A/B testing in the medical domain
▪ A/B tests gone (ethically) wrong
Face recognition
▪ Face recognition discussions in anticipation of the AI Act
▪ Bias in commercial face recognition software

Hand Over Notes Sample
80% (10)
Hand Over Notes Sample
3 pages
Modern Cryptography
No ratings yet
Modern Cryptography
129 pages
J.D.Salinger-this Sandwich Has No Mayonnaise
No ratings yet
J.D.Salinger-this Sandwich Has No Mayonnaise
12 pages
Esaote Mylab25Gold - 30 - Biosound
No ratings yet
Esaote Mylab25Gold - 30 - Biosound
3 pages
IL2 BOS Manual English 1011 Rev1
No ratings yet
IL2 BOS Manual English 1011 Rev1
72 pages
Data Science Ethics - Lecture 2
No ratings yet
Data Science Ethics - Lecture 2
36 pages
02 Key Management and Public Key Infrastructure A
No ratings yet
02 Key Management and Public Key Infrastructure A
51 pages
CH 15
No ratings yet
CH 15
41 pages
Data Privacy and Security
No ratings yet
Data Privacy and Security
13 pages
1 03crypto - Hugo Krawczyk
No ratings yet
1 03crypto - Hugo Krawczyk
41 pages
Kerberos X509
No ratings yet
Kerberos X509
31 pages
Unit 1_20240723_202758_0000
No ratings yet
Unit 1_20240723_202758_0000
17 pages
Network Security Essentials: Fifth Edition by William Stallings
No ratings yet
Network Security Essentials: Fifth Edition by William Stallings
39 pages
Cloud Infrastructure
No ratings yet
Cloud Infrastructure
45 pages
Cryptography and Network Security: Sixth Edition by William Stallings
No ratings yet
Cryptography and Network Security: Sixth Edition by William Stallings
26 pages
CH 2
No ratings yet
CH 2
36 pages
UNIT II - Compressed
No ratings yet
UNIT II - Compressed
14 pages
Encription
No ratings yet
Encription
2 pages
Cryptographic Hash Functions
No ratings yet
Cryptographic Hash Functions
10 pages
Presentation Layer PPT 1 PDF
No ratings yet
Presentation Layer PPT 1 PDF
16 pages
Public Key Cryptography and RSA
No ratings yet
Public Key Cryptography and RSA
37 pages
ETHICAL HACKING (Tools, Techniques and Approaches) : January 2015
No ratings yet
ETHICAL HACKING (Tools, Techniques and Approaches) : January 2015
7 pages
Cryptography Slides
100% (4)
Cryptography Slides
14 pages
Ch7 Crypto6e
No ratings yet
Ch7 Crypto6e
43 pages
Entity Authentication
No ratings yet
Entity Authentication
38 pages
2.5. Steganography: Figure 2.8. A Puzzle For Inspector Morse
No ratings yet
2.5. Steganography: Figure 2.8. A Puzzle For Inspector Morse
2 pages
CH 3
No ratings yet
CH 3
25 pages
Cryptography & Network Security: Chester Rebeiro IIT Madras
No ratings yet
Cryptography & Network Security: Chester Rebeiro IIT Madras
29 pages
NS LAB 03 - 19070124019 - Durgesh Vyas
No ratings yet
NS LAB 03 - 19070124019 - Durgesh Vyas
10 pages
Chapter 04
No ratings yet
Chapter 04
30 pages
Cryptography and Network Security
No ratings yet
Cryptography and Network Security
54 pages
Ch11 Crypto7e (4)
No ratings yet
Ch11 Crypto7e (4)
49 pages
Chapter 1
No ratings yet
Chapter 1
42 pages
Download full Programming the World Wide Web 8th Edition Robert W. Sebesta ebook all chapters
100% (1)
Download full Programming the World Wide Web 8th Edition Robert W. Sebesta ebook all chapters
47 pages
Honors Unit 2
No ratings yet
Honors Unit 2
16 pages
Unit I
No ratings yet
Unit I
24 pages
A Gift of Fire - CH #1
No ratings yet
A Gift of Fire - CH #1
19 pages
Cryptography and Network Security: Fifth Edition by William Stallings
No ratings yet
Cryptography and Network Security: Fifth Edition by William Stallings
33 pages
10416Download full Database and Application Security A Practitioners Guide 1st Edition Danturthi ebook all chapters
100% (1)
10416Download full Database and Application Security A Practitioners Guide 1st Edition Danturthi ebook all chapters
55 pages
Sol CH 1 Hw1-Key
No ratings yet
Sol CH 1 Hw1-Key
4 pages
Network Security KA Webinar - Slides
No ratings yet
Network Security KA Webinar - Slides
61 pages
Diffie-Hellman Case Study - 1DS18TE075
No ratings yet
Diffie-Hellman Case Study - 1DS18TE075
10 pages
3-Key Management and
No ratings yet
3-Key Management and
37 pages
Understanding Senetas Layer 2 Encryption
No ratings yet
Understanding Senetas Layer 2 Encryption
8 pages
L4 Cipher Systems
No ratings yet
L4 Cipher Systems
47 pages
E-Voting Concept Paper
No ratings yet
E-Voting Concept Paper
30 pages
Lecture 3-4 Symmetric Cryptography
No ratings yet
Lecture 3-4 Symmetric Cryptography
49 pages
Message Authentication and Digital Signatures
No ratings yet
Message Authentication and Digital Signatures
23 pages
XORSTEG: A New Model of Text Steganography: Tapodhir Acharjee
No ratings yet
XORSTEG: A New Model of Text Steganography: Tapodhir Acharjee
4 pages
Lab 3a: Private Key Encryption: Details
No ratings yet
Lab 3a: Private Key Encryption: Details
37 pages
Cryptography and Network Security: Third Edition by William Stallings Lecture Slides by Lawrie Brown
No ratings yet
Cryptography and Network Security: Third Edition by William Stallings Lecture Slides by Lawrie Brown
30 pages
18cs63 Web Technology and Its Applications Syllabus
No ratings yet
18cs63 Web Technology and Its Applications Syllabus
3 pages
Biba Security Model Comparison
No ratings yet
Biba Security Model Comparison
9 pages
Cryptography and Network Security: Sixth Edition by William Stallings
No ratings yet
Cryptography and Network Security: Sixth Edition by William Stallings
26 pages
3 Security of Data Against Hacking
No ratings yet
3 Security of Data Against Hacking
7 pages
Need of Cryptography Types of Attacks Techniques of Cryptography Encryption Algorithm
No ratings yet
Need of Cryptography Types of Attacks Techniques of Cryptography Encryption Algorithm
31 pages
SNA-UNIT-2 Full
No ratings yet
SNA-UNIT-2 Full
33 pages
AES
No ratings yet
AES
22 pages
Module 3
No ratings yet
Module 3
66 pages
Unit - 1
No ratings yet
Unit - 1
70 pages
Public Key Cryptography
No ratings yet
Public Key Cryptography
21 pages
Security Issues in Mobile Computing
No ratings yet
Security Issues in Mobile Computing
62 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
19 pages
CH 12 - Cryptography PDF
No ratings yet
CH 12 - Cryptography PDF
111 pages
Python Tutorial Text 2024-1
No ratings yet
Python Tutorial Text 2024-1
82 pages
Data Science Ethics - Lecture 10 - Ethical Deployment
No ratings yet
Data Science Ethics - Lecture 10 - Ethical Deployment
60 pages
Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (re-identification) v2
No ratings yet
Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (re-identification) v2
47 pages
Principles of Mgmt Accounting_class1
No ratings yet
Principles of Mgmt Accounting_class1
46 pages
Data Science Ethics - Lecture 9 - Ethical Reporting
No ratings yet
Data Science Ethics - Lecture 9 - Ethical Reporting
35 pages
Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling
No ratings yet
Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling
23 pages
Data Science Ethics - Lecture 1
No ratings yet
Data Science Ethics - Lecture 1
68 pages
Principles of Mgmt Accounting_class4&5
No ratings yet
Principles of Mgmt Accounting_class4&5
139 pages
Principles of Mgmt Accounting_class7
No ratings yet
Principles of Mgmt Accounting_class7
32 pages
Principles of Mgmt Accounting_class2
No ratings yet
Principles of Mgmt Accounting_class2
43 pages
Data Science Ethics - Lecture 1
No ratings yet
Data Science Ethics - Lecture 1
68 pages
Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (Re-Identification) v2
No ratings yet
Data Science Ethics - Lecture 4 - Discrimination and Privacy in Data Preprocessing (Re-Identification) v2
46 pages
Study Id167544 Key-Cybersecurity-Players
No ratings yet
Study Id167544 Key-Cybersecurity-Players
48 pages
EMA-100 - Deposit Platform-Manager Manual (ENG) v1.3 - 1.2.0.7
No ratings yet
EMA-100 - Deposit Platform-Manager Manual (ENG) v1.3 - 1.2.0.7
31 pages
Assesment 2 ICT Business Document Complete
100% (3)
Assesment 2 ICT Business Document Complete
16 pages
SAMSONType 3730 2 Electropneumatic Positioner & Type 3730 3 With HART Communication t83842
No ratings yet
SAMSONType 3730 2 Electropneumatic Positioner & Type 3730 3 With HART Communication t83842
8 pages
SpeedFace-V5L Series Datasheet 202401
No ratings yet
SpeedFace-V5L Series Datasheet 202401
4 pages
VAHAN 4.0 (Citizen Services)~SP-MORTH-WS03~175~8013
No ratings yet
VAHAN 4.0 (Citizen Services)~SP-MORTH-WS03~175~8013
1 page
Privacy Statement: Google Advertising Opt-Out Page
No ratings yet
Privacy Statement: Google Advertising Opt-Out Page
5 pages
The DES Algorithm Illustrated
0% (1)
The DES Algorithm Illustrated
10 pages
Guardian UsersGuide
No ratings yet
Guardian UsersGuide
460 pages
Unit - 3 Digital Devices Security, Tools and Technologies For Cyber Security
100% (1)
Unit - 3 Digital Devices Security, Tools and Technologies For Cyber Security
46 pages
Database Essentials 18S PDF
No ratings yet
Database Essentials 18S PDF
97 pages
Social Media Essay
No ratings yet
Social Media Essay
12 pages
Caution!: View Canadian Instructions
No ratings yet
Caution!: View Canadian Instructions
2 pages
CATAPA Digital Brochure 2023
No ratings yet
CATAPA Digital Brochure 2023
8 pages
Entry Exit Manual
No ratings yet
Entry Exit Manual
25 pages
Central Bank of India
No ratings yet
Central Bank of India
4 pages
Repair Manual Driven Front Axle Mt22/Mt23 Series: Marmon-Herrington All-Wheel Drive
No ratings yet
Repair Manual Driven Front Axle Mt22/Mt23 Series: Marmon-Herrington All-Wheel Drive
54 pages
Science Olympiad Codebusters K1 Aristocrat Practace Key
No ratings yet
Science Olympiad Codebusters K1 Aristocrat Practace Key
31 pages
Ats-Ip-Kit A 1
No ratings yet
Ats-Ip-Kit A 1
2 pages
Manual Book
No ratings yet
Manual Book
89 pages
RJSC - Entity Name Search
No ratings yet
RJSC - Entity Name Search
2 pages
Practical Assignment No. 8 Aim: Study of How Antivirus Works According To Offline and Online Mode. Theory
No ratings yet
Practical Assignment No. 8 Aim: Study of How Antivirus Works According To Offline and Online Mode. Theory
3 pages
The Threat Intelligence Handbook Chris Pace (Ed.) - The ebook is ready for download to explore the complete content
No ratings yet
The Threat Intelligence Handbook Chris Pace (Ed.) - The ebook is ready for download to explore the complete content
77 pages
Aswinkumar.R Resume
No ratings yet
Aswinkumar.R Resume
1 page
Ciberseguridad
No ratings yet
Ciberseguridad
13 pages
BSBMKG547 Task 2 (F)
No ratings yet
BSBMKG547 Task 2 (F)
9 pages
CCNA Discovery: Working at A Small-to-Medium Business or ISP
No ratings yet
CCNA Discovery: Working at A Small-to-Medium Business or ISP
38 pages

Data Science Ethics - Lecture 3

Uploaded by

Data Science Ethics - Lecture 3

Uploaded by

Data Science & Ethics

Data Gathering: privacy (continued), bias and experimentation

Prof. David Martens

Plain Text: Plain Text:

Cipher Text: 2 Cipher Text:

Plain Text: Plain Text:

Cipher Text: 2 Cipher Text:

Plain Text: Plain Text:

▪ Used in SSL and TLS protocols, widely used online

David Martens 123456

Jennifer Johnsen 123456

Latifa Jenkins p@ssword5

Interesting cultural links

User Salt Hash

David Martens t0mR1aoPdp 79e514abbfb414c5ae58b553fbb0ff00

Jennifer Johnsen rmsP5dof9y 896cab0975703f599fe0f7491c90062b

Latifa Jenkins 8LyqGp4cPm 8a945d114467eee63485fba9aa0bbaf0

▪ And beyond: Quantum Machine Learning

### 1. How Current Encryption Works

### 2. How Quantum Computing Affects Encryption

### 3. Why This Jeopardizes Current Encryption

### 4. The Timeline and Current Limits

### 5. Solutions: Post-Quantum Cryptography

▪ Privacy is not absolute

“Encryption isn't just a technical feature; it's a marketing

James Comey (2014)

▪ Privacy is not absolute ▪ Security versus Freedom

“Those who would give up essential Liberty, to purchase a

Benjamin Franklin (1775)

▪ Privacy is not absolute ▪ Security versus Freedom

“No-one, I don't believe, would want a master key built that

Tim Cook (2016)

▪ Privacy is not absolute ▪ Security versus Freedom

Michael Chertoff (2016)

➢ Means Facebook disallows all automated scraping

➢ Means Facebook disallows all automated scraping

➢ Means Facebook disallows all automated scraping

• Bias in data sample: non-representative for population

variance error (limited sample size)

▪ Can lead to wrong conclusions or impacting certain groups negatively.

Population: potential voters

On Twitter Not on Twitter

Sample: potential voters on Twitter

democratic republican Y=?

Section Bullet holes per square foot

Population: airplanes shot at

return to base crash

Sample: airplanes returning to base

Most hit: wings Most hit: ?

Population: persons applying for credit

granted credit denied credit

Sample: persons having been granted credit

credit paid back (Y = 0) default (Y = 1) Y=?

A Google spokesperson confirmed that “gorilla” was

• Two-dimensional representations of word embeddings using Google news

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

• Two-dimensional representations of word embeddings using Google news

• Be aware of biases or these will be amplified

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

▪ go beyond the initial ethical approval of an ethics board

Registration rate: 45% Registration rate: 15%

▪ Different treatment of different groups: potential human impact and ethical

You might also like