0% found this document useful (0 votes)
277 views

Data Science Ethics - Lecture 3

Uploaded by

Niloofar Fallahi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
277 views

Data Science Ethics - Lecture 3

Uploaded by

Niloofar Fallahi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Data Science & Ethics

Lecture 3

Data Gathering: privacy (continued), bias and experimentation

Prof. David Martens


[email protected]
www.applieddatamining.com
@ApplDataMining
AI Ethics in the News
2
Ethical Data Gathering
▪ Privacy as a Human Right
▪ Regulation: GDPR
▪ Privacy Mechanisms: Encryption and Hashing
▪ Obfuscation
▪ Public Data
▪ Bias
▪ Experimentation

3
Symmetric encryption
▪ One key used for encryption and decryption

Plain Text: Plain Text:


“Hello Bob” “Hello Bob”

Alice Bob
Secret 1 Encryption Decryption 3 Secret

Cipher Text: 2 Cipher Text:


Sà!3Lksd( Sà!3Lksd(

Sà!3Lksd(
???
Eve 4
Symmetric encryption
▪ One key used for encryption and decryption
• Caesar cipher: 3-shift right
• Weakness: frequency of letters and starting/ending words
(“Dear”, “Yours sincerely”, etc.) or brute force attack

Plain Text: Plain Text:


“Hello Bob” “Hello Bob”

Alice Bob
Secret 1 Encryption Decryption 3 Secret

Cipher Text: 2 Cipher Text:


Khoor Ere Khoor Ere

Khoor Ere
???
Eve 5
Symmetric encryption
▪ DES: Data Encryption Standard
• One of the first major standards in symmetric key encr.
• 56 bit key
• 256 = 7 x 106 possible keys
• Flaw: too small as brute force attack would find key
▪ AES: Advanced Encryption Standard
• By Belgians Vincent Rijmen and Joan Daemen (1988)
• 128, 192 or 256 bit keys
• 2128 = 3 x 1038 possible keys, considered safe in current age
• New standard since late 90s
▪ Challenges
• How to share keys: unsecure or overhead
• How to manage keys: if u users need to communicate with one
another → need for (u-1) + (u-2) + … + 1 = u x (u-1) / 2 keys to be
shared before communicating 6
Asymmetric encryption
▪ Two keys: public and private key
• Public key: revealed to the world
• Private key: kept secret at one party

Plain Text: Plain Text:


“Hello Bob” “Hello Bob”

Alice Bob
Bob’s 1 Encryption Decryption 3 Bob’s
Public Key Private Key
Cipher Text: 2 Cipher Text:
Sà!3Lksd( Sà!3Lksd(

Sà!3Lksd(
???
Eve 7
Asymmetric Encryption
▪ RSA: Rivest, Shamir, Adleman (1983)
• Popular algorithm for asymmetric encryption
• Principle:
➢ Multiplying two large numbers is easy and fast
➢ Decomposing a large number into prime numbers: very difficult
➢ For example: 19 x 13 = ? Decompose 391 in 2 prime numbers?
➢ If numbers large enough: non efficient (non-quantum) integer
factorisation algorithm exists.
• Given p and q large prime numbers
➢ n = p x q, make this the public key, p and q are the private keys
➢ Encrypt message m, using some number e, also made public
▪ e chosen such that prime relative to (p-1) x (q-1)
▪ c = me mod n
➢ Decrypt c, who knows private key p and q
▪ d chosen such that d x e mod ( (p-1)x(q-1) ) = 1 8
Asymmetric Encryption
▪ RSA: Rivest, Shamir, Adleman (1983)
• Given p and q large prime numbers
➢ n = p x q, make this the public key
➢ Encrypt message m, using some number e, also made public
▪ e chosen such that prime relative to (p-1) x (q-1), e.g. 35
▪ c = me mod n
➢ Decrypt c, who knows private key p and q
▪ d chosen such that d x e = 1 mod [ (p-1)x(q-1) ]
▪ d also private as you need to know p and q
▪ m = cd mod n
▪ Advantages
• Sharing keys: only public ones need to be shared (no need for secrecy)
• Manage keys: share only u keys among u users
▪ Disadvantage
• Takes more time than symmetric encryption 9
Asymmetric Encryption
▪ RSA: Rivest, Shamir, Adleman (1983)
• Given p and q large prime numbers
➢ n = p x q, make this the public key
➢ Encrypt message m, using some number e, also made public
▪ e chosen such that prime relative to (p-1) x (q-1), e.g. 5
▪ c = me mod n
➢ Decrypt c, who knows private key p and q
▪ d chosen such that d x e = 1 mod [ (p-1)x(q-1) ] so d x e = [k x (p-1)x(q-1) ] + 1
▪ m = cd mod n
Can only calculate this if you know the
▪ Example decomposition into the prime factors p and q
• p = 7, q = 3 ➔ n = ?
• Message is the letter “l” => 12th letter in the alphabet: so m = 12
• c = ? [3]
• For a k = 2, d = ?
• m=?
10
Encryption for data protection
▪ Online communication between server and client
• Initially asymmetric encryption to agree on secret key (slow)
• Subsequently symmetric encryption with agreed secret key
(fast)

2
c
Generate random number r Decrypt c with own private key to r
Encrypt r with public key of server to c
3
1
4
All subsequent communication using fast
symmetric encryption, using secret key r

11
Encryption for data protection
▪ Online communication between server and client
• Initially asymmetric encryption to agree on secret key
• Subsequently symmetric encryption with agreed secret key

▪ Used in SSL and TLS protocols, widely used online


• Public key infrasturcture: 3rd party Certificate Authority (CA),
such as Comodo, Let’s Encrypt
• https:// indicates this type of encryption working in the
background

12
Encryption for data protection
▪ Whenever you need to communicate online:
• Use TLS protocol
• Even considered to determine ranking in search engine
▪ Whenever you store personal data:
• Encrypt the data
• Beyond USBs, laptops, PCs and smartphones: bikes, cars, etc.
• Cars:
➢ Personal data: address, routes, contacts, etc.
➢ FTC (US): advice to clean out personal data before selling car
➢ Tesla:
▪ Personal data is hidden when given the keys to a valet (valet mode)
▪ Revealed that some data is stored unencrypted

13
Encryption for data protection
▪ Whenever you need to communicate online:
• Use TLS protocol
• Even considered to determine ranking in search engine
▪ Whenever you store personal data:
• Encrypt the data
• Beyond USBs, laptops, PCs and smartphones: bikes, car makers
• Cars:
➢ Personal data: address, routes, contacts, etc.
➢ FTC (US): advice to clean out personal data before selling car
➢ Tesla:
▪ Personal data is hidden when given the keys to a valet (valet mode)
▪ 2018: som

https://www.cnbc.com/2019/03/29/tesla-model-3-keeps-data-like-crash-videos-location-phone-contacts.html
14
Hashing
▪ Another useful cryptographic function
• Input: i
• Output: hash o, of always same length
▪ One way function:
• easy to go from input to hash, very hard to do the reverse
• Main difference with encryption
▪ MD5: Merkle and Damgard (1992)
• Popular hashing algorithm
• Input: any string, output: 128 bit hash
https://www.md5hashgenerator.com/

▪ SHA-3
• Porposed by NIST (2015), output of up to 512 bits hash
• Developers: Berony, Daemen (him again), Peeters and Van Assche
15
Hashing
▪ Main use: integrity
• Input: mail m
• Output: hash h
• If someone changes a word: modified mail will change to
other hash

16
Hashing
▪ Simple example for text
• Take the position of each letter in the alphabet
• Sum all the integers
• Take the last digit of the sum as hash
• Try with “ball” and “bell”
➢ One-way, fixed length hash (of size 1)
➢ Many hash collisions

17
Hashing
▪ Hashing passwords
▪ Widely used to send passwords over the Internet
• Naive:
➢ if website doesn’t use TLS, or on a public wifi: password sent in
plain text (visible to eavesdropping Eve).
➢ In database: store username and password in plain text.
▪ If ever data leak: all passwords public
▪ Employees can snoop in the database

User Password

David Martens 123456

Jennifer Johnsen 123456

Latifa Jenkins p@ssword5

18
Hashing
▪ Hashing passwords
• Better:
➢ Hash password client side (so on the laptop/PC)
➢ Send the hashed value to server, and store this in database
➢ If ever leaked: hard to know what password was (one way)
➢ BUT: people tend to use same password: frequently used
passwords can be identified

User Hash
David Martens e10adc3949ba59abbe56e057f20f883e
Jennifer Johnsen e10adc3949ba59abbe56e057f20f883e
Latifa Jenkins f3f092cd075b3e050451239611a9e1e9

19
Hashing for data protection
▪ Most popular passwords
Popular passwords Hash (MD5)
123456 e10adc3949ba59abbe56e057f20f883e
password cc3a0280e4fc1415930899896574e118
123456789 6ca5ac1929f1d27498575d75abbeb4d1
12345678 25d55ad283aa400af464c76d713c07ad
12345 827ccb0eea8a706c4c34a16891f84e7b
111111 96e79218965eb72c92a549dd5a330112
1234567 fcea920f7412b5da7be0cf42b8c93759
sunshine 0571749e2ac330a7455809c6b0e7af90
qwerty d8578edf8458ce06fbc5bb76a58c5ca4
iloveyou f25a2fc72690b780b2a14e140ef6a9e0
princess 8afa847f50a716e64932d995c8e7435a

▪ Evesdropping Eve can check if the hash she observes occurs in this list
• e10adc3949ba59abbe56e057f20f883e: found! password found: 123456
• Can only look at a subset of all possible (popular) password: “rainbow table” 20
Hashing for data protection

Interesting cultural links


https://en.wikipedia.org/wiki/List_of_the_most_common_passwords 21
Hashing for data protection
▪ Evesdropping Eve can check if the hash she observes occurs in this list,
if so: password found
▪ Improve further by generating a random string per user: “salt”
▪ Hash the concatenation of salt + password
• Remaining issue: in case of data leak, also salt revealed
• Hacker could create a rainbow table per user (salt + popular password)
• Evesdropping Eve can check if the hash she observes occurs in this list
➢ 79e514abbfb414c5ae58b553fbb0ff00: found! password found: 123456, with salt
t0mR1aoPdp
➢ Needs many (x number of users) more rainbow tables than without a salt

User Salt Hash

David Martens t0mR1aoPdp 79e514abbfb414c5ae58b553fbb0ff00

Jennifer Johnsen rmsP5dof9y 896cab0975703f599fe0f7491c90062b

Latifa Jenkins 8LyqGp4cPm 8a945d114467eee63485fba9aa0bbaf0


22
Hashing for data protection
▪ Hashing passwords
• A standard, expected to be used
• Knuddels.de:
➢ German chat community which stored passwords in plain text,
not hashed
➢ Hacking, 800,000 email addresses and passwords published online
➢ GDPR fine of 20,000 €

23
Hashing for data protection
▪ To ensure no personal data is copied throughout data
processing system
• Many copies, local downloads
➢ If someone asks to remove all their personal data: cumbersome task,
never 100% sure
• Personal data often not explicitly needed
(e.g. password, exact name)
➔ Use personal data in one table, with hashed personal ID
➔ Use the hashed personal ID in all other parts of the system

24
Hashing for data protection
▪ To ensure no personal data is copied throughout data
processing system
▪ Using hashed value of personal data: pseudonimisation
• Keep hash table secure

25
Encryption vs hashing
▪ Both useful for storing and analysing data

26
Discussion Case 2

27
Quantum Computing
▪ Bit: 0 or 1
▪ Qubit: 0 and 1 at the same time

bit qubit

1 0/1

28
Quantum Computing
Quantum Computing
▪ Schrödinger’s cat
• Cat in a closed box, 50% probability it dies
• Both dead and alive, until someone opens the box and
observes the cat
➢ What acutally happens at atomic levels

30
https://www.youtube.com/watch?v=QuR969uMICM 31
Quantum Computing
▪ Superposition
• Two states at the same time
• Superposition collapses when the state is observed
• E.g. photon or spin of electron
▪ Bit: can take 2 values: 0 or 1
▪ Qubit:
• can take 4 values: 00, 01, 10 or 11
• n qubits can store 2n states at the same time
• Qubits are “entangled”
➢ Observing one qubit, instantaneously reveals the other, even if at
the other side of the galaxy
➢ “spooky action at a distance” (Albert Einstein)
➢ Proven, 2017: entangled qubits 1,200 km apart
32
Quantum Computing
▪ Quantum computing
• Represent a problem and add a way to assess the answer
• Quantum decoherence of qubits such that only answers that
pass the test survive
▪ For personal data protection?
• Shor’s algorithm for factoring large numbers
• Would “crack” the popular RSA algorithm for asymmetric
encryption
▪ Current quantum computers
• Few dozen qubits
• Need thousands of qubits to break RSA
• IBM SVP of cloud and cognitive software (2019):
“within a decade”
33
Quantum Computing
▪ For personal data protection? (bis)
• Data in quantum state cannot be copied unnoticed
• Copying requires observing the data: eavesdropping detected

▪ And beyond: Quantum Machine Learning

34
Quantum computing poses a potential threat to many of the encryption methods currently used to secure sensitive data. Here’s an overview of why quantum
computing is such a game-changer for encryption:

### 1. How Current Encryption Works


Most encryption methods rely on the difficulty of solving certain mathematical problems with classical computers. Here are two common types:
- **Symmetric Encryption** (e.g., AES): Data is encrypted and decrypted with the same key. Its security relies on brute-force resistance, where it would take
classical computers an infeasible amount of time to try every possible key.
- **Asymmetric Encryption** (e.g., RSA, ECC): Uses a pair of public and private keys. It relies on hard-to-solve mathematical problems like factoring large
numbers (RSA) or solving discrete logarithms (ECC), which would take classical computers an extremely long time to crack.

### 2. How Quantum Computing Affects Encryption


Quantum computers process information differently than classical computers. Two main quantum algorithms have implications for encryption:
- **Shor’s Algorithm**: Efficiently factors large numbers and solves discrete logarithms, which are the foundations of RSA and ECC encryption. A sufficiently
powerful quantum computer running Shor’s algorithm could break these encryption methods quickly by solving problems that classical computers would take
years (or even centuries) to solve.
- **Grover’s Algorithm**: Speeds up the search through an unstructured database, allowing quantum computers to perform brute-force attacks faster. For
symmetric encryption like AES, Grover’s algorithm effectively halves the security strength (e.g., AES-256 would be as secure as AES-128 against a quantum
brute-force attack).

### 3. Why This Jeopardizes Current Encryption


With quantum algorithms, encryption methods we use today could become vulnerable:
- **Asymmetric Encryption at Risk**: RSA and ECC encryption, used in everything from internet traffic to digital signatures, would be easily broken by a
powerful enough quantum computer running Shor’s algorithm. This means that data protected with RSA or ECC could be decrypted if intercepted and stored
by attackers until a suitable quantum computer becomes available.
- **Symmetric Encryption is Safer, But Not Immune**: Grover’s algorithm means that symmetric keys need to be longer to maintain the same level of
security (e.g., moving from AES-128 to AES-256).

### 4. The Timeline and Current Limits


Although quantum computers have made significant advancements, they are not yet powerful enough to break most encryption. Fully operational, large-scale
quantum computers are still a few years or even decades away. But since data is often stored for long periods, even current encrypted data could be
vulnerable in the future if attackers store it and wait for quantum computing to advance.

### 5. Solutions: Post-Quantum Cryptography


To prepare for this threat, researchers are developing **post-quantum cryptography** (PQC), which includes encryption algorithms designed to be secure
against quantum attacks. The U.S. National Institute of Standards and Technology (NIST) is working on standardizing these algorithms, which rely on
mathematical problems thought to be hard for quantum computers, like lattice-based cryptography.

### Summary
Quantum computing could:
- Break widely-used encryption methods like RSA and ECC once sufficiently powerful quantum computers are available.
- Halve the effective security of symmetric encryption (e.g., AES), necessitating longer keys.

In response, **post-quantum cryptography** is being developed to provide quantum-resistant encryption methods that could secure data even in a future with
quantum computers.
Government Backdoor
Feb. 2016
Government Backdoor
Pro

▪ Privacy is not absolute

“Encryption isn't just a technical feature; it's a marketing


pitch. ... Sophisticated criminals will come to count on these
means of evading detection. It's the equivalent of a closet
that can't be opened. ... And my question is, at what cost?“

James Comey (2014)


Government Backdoor
Pro Con

▪ Privacy is not absolute ▪ Security versus Freedom

“Those who would give up essential Liberty, to purchase a


little temporary Safety, deserve neither Liberty nor Safety.“

Benjamin Franklin (1775)


Government Backdoor
Pro Con

▪ Privacy is not absolute ▪ Security versus Freedom


▪ Security versus Security

“No-one, I don't believe, would want a master key built that


would turn hundreds of millions of locks. Even if that key
were in the possession of the person that you trust the
most. That key could be stolen.”

Tim Cook (2016)


Government Backdoor
Pro Con

▪ Privacy is not absolute ▪ Security versus Freedom


▪ Security versus Security
▪ Futility of backdoors

“The bottom line is, if you look at both the terrorists in San
Bernardino and the Boston Marathon bombers, they were
family members. Most family members talk to each other
face to face. The government doesn't
have access to that after the fact.“

Michael Chertoff (2016)


Government Backdoor
Oct. 2019
Government Backdoor
Feb. 2019

41
Discussion Case 3

42
Ethical Data Gathering
▪ Privacy as a Human Right
▪ Regulation: GDPR
▪ Privacy Mechanisms: Encryption and Hashing
▪ Obfuscation
▪ Public Data
▪ Bias
▪ Experimentation

43
Obfuscation
▪ Encryption of personal data:
• Explicitly hiding secret information
• End user has little control
▪ Obfuscation: “the deliberate addition of ambiguous,
confusing, or misleading information to interfere with
surveillance and data collection” (Brunton and Nissenbaum,
2015)
• Hide information by adding noise to the system
• End user control
• eg. automatically generate a large number of search queries
(TrackMeNot)
• eg. automatically click on all ads (AdNausem)

44
Obfuscation
▪ Can be driven by a technical system, but also human
• #BrusselsLockdown
▪ Is it ethical?

45
Ethical Data Gathering
▪ Privacy as a Human Right
▪ Regulation: GDPR
▪ Privacy Mechanisms: Encryption and Hashing
▪ Obfuscation
▪ Public Data
▪ Bias
▪ Experimentation

46
Public Data
▪ Public Data is not Free-to-Copy Data
• Database right
• Policies

47
Public Data
▪ What is a Database:
• “collection of independent works, data or other materials which are
arranged in a systematic or methodical way and are individually
accessible by electronic or other means” (Directive 96/9/EC of the
European Parliament and of the Council of 11 March 1996 on the legal protection of
databases)
• Broad definition: datasets, mailing lists, telephone directories, etc.
▪ Database right:
• You are not allowed to copy or extract substantial parts of a database
without the owner’s consent.
• Even if the content is not copyright protected
• 15 years
• When there is a substantial investment in obtaining, verifying and
presenting the content (Europe and UK)
• Not in for example United States

48
Public Data
▪ Policies of individual data sources (such as websites) tell you what you
can do
▪ Some concepts
• Web scraping: extracting data from websites
• robots.txt
➢ Tells a crawler (bot) what pages it can visit/request.
➢ https://www.facebook.com/robots.txt
# Notice: Crawling Facebook is prohibited unless you have express written
# permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php

User-agent: *
Disallow: /
==

➢ Means Facebook disallows all automated scraping


• Application Programming Interfaces (APIs)
➢ Many companies provide APIs to access their data
▪ Facebook became very restrictive in who can use the API after Cambridge Analytica
▪ Twitter API allows to extract data on large scale

49
Public Data
▪ Policies of individual data sources (such as websites) tell you what you
can do
▪ Some concepts
• Web scraping: extracting data from websites
• robots.txt
➢ Tells a crawler (bot) what pages it is allowed to visit/request.
➢ https://www.facebook.com/robots.txt
# Notice: Crawling Facebook is prohibited unless you have express written
# permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php

User-agent: *
Disallow: /
==

➢ Means Facebook disallows all automated scraping


• Application Programming Interfaces (APIs)
➢ Many companies provide APIs to access their data
▪ Facebook became very restrictive in who can use the API after Cambridge Analytica
▪ Twitter API allows to extract data on large scale

50
Public Data
▪ Policies of individual data sources (such as websites) tell you what you
can do
▪ Some concepts
• Web scraping: extracting data from websites
• robots.txt
➢ Tells a crawler (bot) what pages it is allowed to visit/request.
➢ https://www.facebook.com/robots.txt
# Notice: Crawling Facebook is prohibited unless you have express written
# permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php

User-agent: *
Disallow: /
==

➢ Means Facebook disallows all automated scraping


• Application Programming Interfaces (APIs)
➢ Many companies provide APIs to access their data
▪ Facebook became very restrictive in who can use the API after Cambridge Analytica
▪ Twitter API allows to extract data on large scale
▪ Remember: often also includes personal data (GDPR!)
▪ Public does not imply free to copy!
51
Clearview.AI
▪ American company that uses face recognition for law enforcement
▪ App: take picture, upload, returns publicly available pictures of that
person and the link where picture appears online
▪ Data scraped from Facebook, Twitter, Instagram, Youtube, and “millions
of other websites”
▪ Uses
• Law enforcement:
➢ identify suspects, terrorists
➢ Example Indiana State Police:
identified shooter from a video
(no mug shots or driver’s license)
• Business goals:
➢ help with shoplifting,
➢ identity theft, credit card fraud, etc.

https://www.nytimes.com/2020/01/18/technology/clearview-privacy-facial-recognition.html 52
https://www.youtube.com/watch?v=-JkBM8n8ixI 53
Clearview.AI
▪ Several ethical issues
1. Data gathering: public data is not free-to-copy data
2. Data gathering: encryption
▪ Client list leaked: not saved securely, images are?
3. Deployment: access to system
▪ Law enforcement: 600+ law enforcement agencies, such as FBI, US Immigration
and Customs Enforcement
▪ Companies: Walmart, NBA, Macy’s, 46 financial institutions, among others
▪ Persons?
▪ 30 day free trial

▪ Discussion:
• Misuse? By 1) law enforcement, 2) businesses, 3) persons.
• Why ok with 1) mug shots and drivers licenses, and 2) with fingerprints, not
with faces?

54
https://www.buzzfeednews.com/article/ryanmac/clearview-ai-fbi-ice-global-law-enforcement
Clearview.AI

55
https://www.standaard.be/cnt/dmf20200228_04869700
Ethical Data Gathering
▪ Privacy as a Human Right
▪ Regulation: GDPR
▪ Privacy Mechanisms: Encryption and Hashing
▪ Obfuscation
▪ Public Data
▪ Bias
▪ Experimentation

56
Bias
▪ Systematic prejudice for or against a certain group
▪ An overloaded term This type of bias occurs when
the sample of data used for
training a model does not

• Bias in data sample: non-representative for population


accurately represent the broader
population that the model will be
applied to.
This bias arises when a model
or dataset disproportionately

• Bias in data or model against sensitive group: cf. fairness favors or disadvantages specific
groups, often defined by
sensitive characteristics like
race, gender, or socioeconomic
status. In the context of fairness,

• Bias / variance trade-off: bias error (assumption algorithm) + this type of bias can lead to
discriminatory outcomes for
certain groups.

variance error (limited sample size)


• Bias in linear model: intercept
3. Bias/Variance Trade-Off
Definition: In machine learning, the bias/variance trade-off is a fundamental concept describing the
balance between two types of errors:

Bias Error: Error due to overly simplistic assumptions in the model, which can lead to underfitting.
High bias means the model does not capture underlying patterns well.

Variance Error: Error due to high sensitivity to fluctuations in the training data, which can lead to
overfitting. High variance means the model is too finely tuned to the specific training data and may
not generalize well to new data.

Example: A linear regression model might have high bias if the true relationship is nonlinear.
Definition: In linear models, bias can refer to the intercept term in the model, which is the constant
Conversely, a complex model like a deep neural network may have high variance if trained on a
added to the prediction. The intercept is the model’s prediction when all feature values are zero,
small dataset, leading it to memorize rather than generalize.
effectively serving as the model’s baseline.
Impact: Finding the right balance between bias and variance is crucial to building models that
Example: In a linear regression model predicting housing prices, the intercept might represent the
generalize well. Too much bias leads to consistently inaccurate predictions, while too much variance
base price of a house when all other features (e.g., size, location) are zero. If the intercept is too
leads to instability and poor performance on new data.
high or too low, it can skew the entire prediction.
Impact: An incorrect intercept can cause biased predictions across all data points, leading the model
to consistently overestimate or underestimate the target variable. Adjusting the intercept can
improve the accuracy and interpretability of the model.
Sample Bias
▪ What is a sample?
• Part of the population
▪ What is sampling
• Act, process or technique of selecting a suitable sample or a
representative part of the population for the purpose of determining
parameters
▪ Why Sample?
• Economic advantage
• Time factor
• Large Populations
• Partly accessible populations
• Computation Power Required

▪ Can lead to wrong conclusions or impacting certain groups negatively.


Sample Bias
▪ Example 1: Twitter data to predict elections
• Twitter users overrepresented in densely population regions
• Predominantly male
• Non-representative race distribution
• Younger, Highly educated and Higher income

Population: potential voters

On Twitter Not on Twitter

Sample: potential voters on Twitter

democratic republican Y=?


(Y = 0) (Y = 1)

https://www.pewinternet.org/2019/04/24/sizing-up-twitter-users
Sample Bias
▪ Example 2: bullet hole data to predict where to place armor

Section Bullet holes per square foot


Engine 1,1
Main body section 1,7
Fuel system 1,6
Rest of the plane 1,8
Sample Bias
▪ Example 2: bullet hole data to predict where to place armor
• Mathematician Abraham Wald came to another solution: “The armor
doesn't go where the bullet holes are. It goes where the bullet holes aren't:
on the engines."

Population: airplanes shot at

return to base crash

Sample: airplanes returning to base

Most hit: wings Most hit: ?


Sample Bias
▪ Example 3: credit scoring
• Too optimistic data
• “Reject inference” problem

Population: persons applying for credit

granted credit denied credit

Sample: persons having been granted credit

credit paid back (Y = 0) default (Y = 1) Y=?


Sample Bias
▪ Example 4: words on resume to predict recruitment
• HR Analytics, prediction model to review job applicants’ resumes to
automate the search for top talent
• Trained on resumes from past (10 year period), biased data as most were
from men
• Model trained to prefer male candidates, for example:
➢ Penalized uses of word “woman’s” (eg woman’s chess club president)
➢ Penalizes all-woman colleges
Sample Bias
▪ Example 5: image data to predict object
• Not enough pictures of certain groups

A Google spokesperson confirmed that “gorilla” was


censored from searches and image tags after the 2015
incident, and that “chimp,” “chimpanzee,” and “monkey” are
also blocked today. “Image labeling technology is still early
and unfortunately it’s nowhere near perfect”
https://medium.com/@eirinimalliaraki/toward-ethical-transparent-and-fair-ai-ml-a-critical-reading-list-d950e70a70ea
https://www.wired.com/story/when-it-comes-to-gorillas-google-photos-remains-blind/
Sample Bias
▪ Example 6: movement data to predict pothole locations
▪ StreetBump app
• App for residents of Boston
• Detects bumps with location while driving
• Identifies neighborhoods for which to improve the road
infrastructure

• Representative sample?
➢ Need to have a smartphone
➢ Neighborhoods with elder people and lower income underrepresented
Sample Bias
▪ An algorithm does not eliminate human bias, it is only as
good as the data it works with
▪ Under- or over-representation of sensitive group can lead
to disparate impact
Bias
▪ Bias in our language
• Word embeddings
• Boy is to girl as man is to [?]

• Two-dimensional representations of word embeddings using Google news


➢ Find parallelogram
➢ man – woman ≈ king – queen
man – woman ≈ computer programmer – homemaker
man – woman ≈ surgeon – [?]
man – woman ≈ architect – [?]
man – woman ≈ superstar – [?]

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings


https://arxiv.org/pdf/1607.06520.pdf
Bias
▪ Bias in our language
• Word embeddings
• Boy is to girl as man is to [?]

• Two-dimensional representations of word embeddings using Google news


➢ Find parallelogram
➢ man – woman ≈ king – queen
man – woman ≈ computer programmer – homemaker
man – woman ≈ surgeon – nurse
man – woman ≈ architect – interior designer
man – woman ≈ superstar – diva

• Be aware of biases or these will be amplified

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings


https://arxiv.org/pdf/1607.06520.pdf
Ethical Data Gathering
▪ Privacy as a Human Right
▪ Regulation: GDPR
▪ Privacy Mechanisms: Encryption and Hashing
▪ Obfuscation
▪ Public Data
▪ Bias
▪ Experimentation

69
Human experiments

70
Experimentation
▪ Nuremberg Code for ethical rules for research involving
human subjects
• informed consent without coercion
• the ability to withdraw from the experiment at any time
• avoiding all unnecessary physical and mental suffering and
injury

▪ go beyond the initial ethical approval of an ethics board


▪ follow-up and challenge the ethical implications during a
data science project and add ethical reflection in each
report of the study

71
Experimentation
▪ A/B testing
▪ Randomize experiment with two variants: A and B
▪ Vary one variable and assess the effect

Registration rate: 45% Registration rate: 15%

▪ Different treatment of different groups: potential human impact and ethical


implications!
▪ If there is a potential for negative impact: informed consent needed
▪ C/D testing: intentionally deceiving the users
Experimentation
▪ Facebook/university experiment
• Goal: show emotional contagion (Kramer et al, 2014)
• One week in 2012, changed the content of randomly selected sample of
689.003 Facebook users
➢ Group A: removed articles with positive words
➢ Group B: removed articles with negative words
➢ Change in emotional content of status updates? It did.
Experimentation
▪ OK Cupid
• Goal: Test the performance of their prediction model in practice
• Dating website, provides a match (0-100%) with candidates, based on prediction
model
➢ Group A: Bad match → Told they were a bad match
➢ Group B: Bad match → Told they were very good match
➢ Change in if and how often they talk to each other?
It did: group B more likely to send message to each other

▪ “If you use the Internet, you're the subject of hundreds of experiments at any given time,
on every site. That's how websites work.“
OKCupid president.
▪ “If you're lying to your users to improve your service, what's the line between A/B testing
and fraud?“
Washington Post’s Brian Fung

www.washingtonpost.com/news/the-switch/wp/2014/07/28/okcupid-reveals-its-been-lying-to-some-of-its-users-just-to-see-whatll-happen/
Summary data gathering
▪ Fair
• to the data subject and model applicant: privacy (GDPR)
• to the model applicant: representative sample
▪ Transparent
• to the data subject and model applicant: is it clear what data is
used, for what purposes and for how long?
• to the model applicant: A/B testing with informed consent,
minimal risks and ethical oversight
• to the data scientist and manager: how is the data gathered,
over- or undersampling
▪ Accountable
• are appropriate and effective measures put in place to comply
with the answers to the above questions? Who is responsible?
Presentation and Paper Ideas
Privacy
▪ Fingerprints on ID cards and passports
▪ The use of synthetic data to learn models
Fairness
▪ A/B testing in the medical domain
▪ A/B tests gone (ethically) wrong
Face recognition
▪ Face recognition discussions in anticipation of the AI Act
▪ Bias in commercial face recognition software

76

You might also like