Data Science Ethics - Lecture 3
Data Science Ethics - Lecture 3
Lecture 3
3
Symmetric encryption
▪ One key used for encryption and decryption
Alice Bob
Secret 1 Encryption Decryption 3 Secret
Sà!3Lksd(
???
Eve 4
Symmetric encryption
▪ One key used for encryption and decryption
• Caesar cipher: 3-shift right
• Weakness: frequency of letters and starting/ending words
(“Dear”, “Yours sincerely”, etc.) or brute force attack
Alice Bob
Secret 1 Encryption Decryption 3 Secret
Khoor Ere
???
Eve 5
Symmetric encryption
▪ DES: Data Encryption Standard
• One of the first major standards in symmetric key encr.
• 56 bit key
• 256 = 7 x 106 possible keys
• Flaw: too small as brute force attack would find key
▪ AES: Advanced Encryption Standard
• By Belgians Vincent Rijmen and Joan Daemen (1988)
• 128, 192 or 256 bit keys
• 2128 = 3 x 1038 possible keys, considered safe in current age
• New standard since late 90s
▪ Challenges
• How to share keys: unsecure or overhead
• How to manage keys: if u users need to communicate with one
another → need for (u-1) + (u-2) + … + 1 = u x (u-1) / 2 keys to be
shared before communicating 6
Asymmetric encryption
▪ Two keys: public and private key
• Public key: revealed to the world
• Private key: kept secret at one party
Alice Bob
Bob’s 1 Encryption Decryption 3 Bob’s
Public Key Private Key
Cipher Text: 2 Cipher Text:
Sà!3Lksd( Sà!3Lksd(
Sà!3Lksd(
???
Eve 7
Asymmetric Encryption
▪ RSA: Rivest, Shamir, Adleman (1983)
• Popular algorithm for asymmetric encryption
• Principle:
➢ Multiplying two large numbers is easy and fast
➢ Decomposing a large number into prime numbers: very difficult
➢ For example: 19 x 13 = ? Decompose 391 in 2 prime numbers?
➢ If numbers large enough: non efficient (non-quantum) integer
factorisation algorithm exists.
• Given p and q large prime numbers
➢ n = p x q, make this the public key, p and q are the private keys
➢ Encrypt message m, using some number e, also made public
▪ e chosen such that prime relative to (p-1) x (q-1)
▪ c = me mod n
➢ Decrypt c, who knows private key p and q
▪ d chosen such that d x e mod ( (p-1)x(q-1) ) = 1 8
Asymmetric Encryption
▪ RSA: Rivest, Shamir, Adleman (1983)
• Given p and q large prime numbers
➢ n = p x q, make this the public key
➢ Encrypt message m, using some number e, also made public
▪ e chosen such that prime relative to (p-1) x (q-1), e.g. 35
▪ c = me mod n
➢ Decrypt c, who knows private key p and q
▪ d chosen such that d x e = 1 mod [ (p-1)x(q-1) ]
▪ d also private as you need to know p and q
▪ m = cd mod n
▪ Advantages
• Sharing keys: only public ones need to be shared (no need for secrecy)
• Manage keys: share only u keys among u users
▪ Disadvantage
• Takes more time than symmetric encryption 9
Asymmetric Encryption
▪ RSA: Rivest, Shamir, Adleman (1983)
• Given p and q large prime numbers
➢ n = p x q, make this the public key
➢ Encrypt message m, using some number e, also made public
▪ e chosen such that prime relative to (p-1) x (q-1), e.g. 5
▪ c = me mod n
➢ Decrypt c, who knows private key p and q
▪ d chosen such that d x e = 1 mod [ (p-1)x(q-1) ] so d x e = [k x (p-1)x(q-1) ] + 1
▪ m = cd mod n
Can only calculate this if you know the
▪ Example decomposition into the prime factors p and q
• p = 7, q = 3 ➔ n = ?
• Message is the letter “l” => 12th letter in the alphabet: so m = 12
• c = ? [3]
• For a k = 2, d = ?
• m=?
10
Encryption for data protection
▪ Online communication between server and client
• Initially asymmetric encryption to agree on secret key (slow)
• Subsequently symmetric encryption with agreed secret key
(fast)
2
c
Generate random number r Decrypt c with own private key to r
Encrypt r with public key of server to c
3
1
4
All subsequent communication using fast
symmetric encryption, using secret key r
11
Encryption for data protection
▪ Online communication between server and client
• Initially asymmetric encryption to agree on secret key
• Subsequently symmetric encryption with agreed secret key
12
Encryption for data protection
▪ Whenever you need to communicate online:
• Use TLS protocol
• Even considered to determine ranking in search engine
▪ Whenever you store personal data:
• Encrypt the data
• Beyond USBs, laptops, PCs and smartphones: bikes, cars, etc.
• Cars:
➢ Personal data: address, routes, contacts, etc.
➢ FTC (US): advice to clean out personal data before selling car
➢ Tesla:
▪ Personal data is hidden when given the keys to a valet (valet mode)
▪ Revealed that some data is stored unencrypted
13
Encryption for data protection
▪ Whenever you need to communicate online:
• Use TLS protocol
• Even considered to determine ranking in search engine
▪ Whenever you store personal data:
• Encrypt the data
• Beyond USBs, laptops, PCs and smartphones: bikes, car makers
• Cars:
➢ Personal data: address, routes, contacts, etc.
➢ FTC (US): advice to clean out personal data before selling car
➢ Tesla:
▪ Personal data is hidden when given the keys to a valet (valet mode)
▪ 2018: som
https://www.cnbc.com/2019/03/29/tesla-model-3-keeps-data-like-crash-videos-location-phone-contacts.html
14
Hashing
▪ Another useful cryptographic function
• Input: i
• Output: hash o, of always same length
▪ One way function:
• easy to go from input to hash, very hard to do the reverse
• Main difference with encryption
▪ MD5: Merkle and Damgard (1992)
• Popular hashing algorithm
• Input: any string, output: 128 bit hash
https://www.md5hashgenerator.com/
▪ SHA-3
• Porposed by NIST (2015), output of up to 512 bits hash
• Developers: Berony, Daemen (him again), Peeters and Van Assche
15
Hashing
▪ Main use: integrity
• Input: mail m
• Output: hash h
• If someone changes a word: modified mail will change to
other hash
16
Hashing
▪ Simple example for text
• Take the position of each letter in the alphabet
• Sum all the integers
• Take the last digit of the sum as hash
• Try with “ball” and “bell”
➢ One-way, fixed length hash (of size 1)
➢ Many hash collisions
17
Hashing
▪ Hashing passwords
▪ Widely used to send passwords over the Internet
• Naive:
➢ if website doesn’t use TLS, or on a public wifi: password sent in
plain text (visible to eavesdropping Eve).
➢ In database: store username and password in plain text.
▪ If ever data leak: all passwords public
▪ Employees can snoop in the database
User Password
18
Hashing
▪ Hashing passwords
• Better:
➢ Hash password client side (so on the laptop/PC)
➢ Send the hashed value to server, and store this in database
➢ If ever leaked: hard to know what password was (one way)
➢ BUT: people tend to use same password: frequently used
passwords can be identified
User Hash
David Martens e10adc3949ba59abbe56e057f20f883e
Jennifer Johnsen e10adc3949ba59abbe56e057f20f883e
Latifa Jenkins f3f092cd075b3e050451239611a9e1e9
…
19
Hashing for data protection
▪ Most popular passwords
Popular passwords Hash (MD5)
123456 e10adc3949ba59abbe56e057f20f883e
password cc3a0280e4fc1415930899896574e118
123456789 6ca5ac1929f1d27498575d75abbeb4d1
12345678 25d55ad283aa400af464c76d713c07ad
12345 827ccb0eea8a706c4c34a16891f84e7b
111111 96e79218965eb72c92a549dd5a330112
1234567 fcea920f7412b5da7be0cf42b8c93759
sunshine 0571749e2ac330a7455809c6b0e7af90
qwerty d8578edf8458ce06fbc5bb76a58c5ca4
iloveyou f25a2fc72690b780b2a14e140ef6a9e0
princess 8afa847f50a716e64932d995c8e7435a
…
▪ Evesdropping Eve can check if the hash she observes occurs in this list
• e10adc3949ba59abbe56e057f20f883e: found! password found: 123456
• Can only look at a subset of all possible (popular) password: “rainbow table” 20
Hashing for data protection
…
22
Hashing for data protection
▪ Hashing passwords
• A standard, expected to be used
• Knuddels.de:
➢ German chat community which stored passwords in plain text,
not hashed
➢ Hacking, 800,000 email addresses and passwords published online
➢ GDPR fine of 20,000 €
23
Hashing for data protection
▪ To ensure no personal data is copied throughout data
processing system
• Many copies, local downloads
➢ If someone asks to remove all their personal data: cumbersome task,
never 100% sure
• Personal data often not explicitly needed
(e.g. password, exact name)
➔ Use personal data in one table, with hashed personal ID
➔ Use the hashed personal ID in all other parts of the system
24
Hashing for data protection
▪ To ensure no personal data is copied throughout data
processing system
▪ Using hashed value of personal data: pseudonimisation
• Keep hash table secure
25
Encryption vs hashing
▪ Both useful for storing and analysing data
26
Discussion Case 2
27
Quantum Computing
▪ Bit: 0 or 1
▪ Qubit: 0 and 1 at the same time
bit qubit
1 0/1
28
Quantum Computing
Quantum Computing
▪ Schrödinger’s cat
• Cat in a closed box, 50% probability it dies
• Both dead and alive, until someone opens the box and
observes the cat
➢ What acutally happens at atomic levels
30
https://www.youtube.com/watch?v=QuR969uMICM 31
Quantum Computing
▪ Superposition
• Two states at the same time
• Superposition collapses when the state is observed
• E.g. photon or spin of electron
▪ Bit: can take 2 values: 0 or 1
▪ Qubit:
• can take 4 values: 00, 01, 10 or 11
• n qubits can store 2n states at the same time
• Qubits are “entangled”
➢ Observing one qubit, instantaneously reveals the other, even if at
the other side of the galaxy
➢ “spooky action at a distance” (Albert Einstein)
➢ Proven, 2017: entangled qubits 1,200 km apart
32
Quantum Computing
▪ Quantum computing
• Represent a problem and add a way to assess the answer
• Quantum decoherence of qubits such that only answers that
pass the test survive
▪ For personal data protection?
• Shor’s algorithm for factoring large numbers
• Would “crack” the popular RSA algorithm for asymmetric
encryption
▪ Current quantum computers
• Few dozen qubits
• Need thousands of qubits to break RSA
• IBM SVP of cloud and cognitive software (2019):
“within a decade”
33
Quantum Computing
▪ For personal data protection? (bis)
• Data in quantum state cannot be copied unnoticed
• Copying requires observing the data: eavesdropping detected
34
Quantum computing poses a potential threat to many of the encryption methods currently used to secure sensitive data. Here’s an overview of why quantum
computing is such a game-changer for encryption:
### Summary
Quantum computing could:
- Break widely-used encryption methods like RSA and ECC once sufficiently powerful quantum computers are available.
- Halve the effective security of symmetric encryption (e.g., AES), necessitating longer keys.
In response, **post-quantum cryptography** is being developed to provide quantum-resistant encryption methods that could secure data even in a future with
quantum computers.
Government Backdoor
Feb. 2016
Government Backdoor
Pro
“The bottom line is, if you look at both the terrorists in San
Bernardino and the Boston Marathon bombers, they were
family members. Most family members talk to each other
face to face. The government doesn't
have access to that after the fact.“
41
Discussion Case 3
42
Ethical Data Gathering
▪ Privacy as a Human Right
▪ Regulation: GDPR
▪ Privacy Mechanisms: Encryption and Hashing
▪ Obfuscation
▪ Public Data
▪ Bias
▪ Experimentation
43
Obfuscation
▪ Encryption of personal data:
• Explicitly hiding secret information
• End user has little control
▪ Obfuscation: “the deliberate addition of ambiguous,
confusing, or misleading information to interfere with
surveillance and data collection” (Brunton and Nissenbaum,
2015)
• Hide information by adding noise to the system
• End user control
• eg. automatically generate a large number of search queries
(TrackMeNot)
• eg. automatically click on all ads (AdNausem)
44
Obfuscation
▪ Can be driven by a technical system, but also human
• #BrusselsLockdown
▪ Is it ethical?
45
Ethical Data Gathering
▪ Privacy as a Human Right
▪ Regulation: GDPR
▪ Privacy Mechanisms: Encryption and Hashing
▪ Obfuscation
▪ Public Data
▪ Bias
▪ Experimentation
46
Public Data
▪ Public Data is not Free-to-Copy Data
• Database right
• Policies
47
Public Data
▪ What is a Database:
• “collection of independent works, data or other materials which are
arranged in a systematic or methodical way and are individually
accessible by electronic or other means” (Directive 96/9/EC of the
European Parliament and of the Council of 11 March 1996 on the legal protection of
databases)
• Broad definition: datasets, mailing lists, telephone directories, etc.
▪ Database right:
• You are not allowed to copy or extract substantial parts of a database
without the owner’s consent.
• Even if the content is not copyright protected
• 15 years
• When there is a substantial investment in obtaining, verifying and
presenting the content (Europe and UK)
• Not in for example United States
48
Public Data
▪ Policies of individual data sources (such as websites) tell you what you
can do
▪ Some concepts
• Web scraping: extracting data from websites
• robots.txt
➢ Tells a crawler (bot) what pages it can visit/request.
➢ https://www.facebook.com/robots.txt
# Notice: Crawling Facebook is prohibited unless you have express written
# permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php
…
User-agent: *
Disallow: /
==
49
Public Data
▪ Policies of individual data sources (such as websites) tell you what you
can do
▪ Some concepts
• Web scraping: extracting data from websites
• robots.txt
➢ Tells a crawler (bot) what pages it is allowed to visit/request.
➢ https://www.facebook.com/robots.txt
# Notice: Crawling Facebook is prohibited unless you have express written
# permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php
…
User-agent: *
Disallow: /
==
50
Public Data
▪ Policies of individual data sources (such as websites) tell you what you
can do
▪ Some concepts
• Web scraping: extracting data from websites
• robots.txt
➢ Tells a crawler (bot) what pages it is allowed to visit/request.
➢ https://www.facebook.com/robots.txt
# Notice: Crawling Facebook is prohibited unless you have express written
# permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php
…
User-agent: *
Disallow: /
==
https://www.nytimes.com/2020/01/18/technology/clearview-privacy-facial-recognition.html 52
https://www.youtube.com/watch?v=-JkBM8n8ixI 53
Clearview.AI
▪ Several ethical issues
1. Data gathering: public data is not free-to-copy data
2. Data gathering: encryption
▪ Client list leaked: not saved securely, images are?
3. Deployment: access to system
▪ Law enforcement: 600+ law enforcement agencies, such as FBI, US Immigration
and Customs Enforcement
▪ Companies: Walmart, NBA, Macy’s, 46 financial institutions, among others
▪ Persons?
▪ 30 day free trial
▪ Discussion:
• Misuse? By 1) law enforcement, 2) businesses, 3) persons.
• Why ok with 1) mug shots and drivers licenses, and 2) with fingerprints, not
with faces?
54
https://www.buzzfeednews.com/article/ryanmac/clearview-ai-fbi-ice-global-law-enforcement
Clearview.AI
55
https://www.standaard.be/cnt/dmf20200228_04869700
Ethical Data Gathering
▪ Privacy as a Human Right
▪ Regulation: GDPR
▪ Privacy Mechanisms: Encryption and Hashing
▪ Obfuscation
▪ Public Data
▪ Bias
▪ Experimentation
56
Bias
▪ Systematic prejudice for or against a certain group
▪ An overloaded term This type of bias occurs when
the sample of data used for
training a model does not
• Bias in data or model against sensitive group: cf. fairness favors or disadvantages specific
groups, often defined by
sensitive characteristics like
race, gender, or socioeconomic
status. In the context of fairness,
• Bias / variance trade-off: bias error (assumption algorithm) + this type of bias can lead to
discriminatory outcomes for
certain groups.
Bias Error: Error due to overly simplistic assumptions in the model, which can lead to underfitting.
High bias means the model does not capture underlying patterns well.
Variance Error: Error due to high sensitivity to fluctuations in the training data, which can lead to
overfitting. High variance means the model is too finely tuned to the specific training data and may
not generalize well to new data.
Example: A linear regression model might have high bias if the true relationship is nonlinear.
Definition: In linear models, bias can refer to the intercept term in the model, which is the constant
Conversely, a complex model like a deep neural network may have high variance if trained on a
added to the prediction. The intercept is the model’s prediction when all feature values are zero,
small dataset, leading it to memorize rather than generalize.
effectively serving as the model’s baseline.
Impact: Finding the right balance between bias and variance is crucial to building models that
Example: In a linear regression model predicting housing prices, the intercept might represent the
generalize well. Too much bias leads to consistently inaccurate predictions, while too much variance
base price of a house when all other features (e.g., size, location) are zero. If the intercept is too
leads to instability and poor performance on new data.
high or too low, it can skew the entire prediction.
Impact: An incorrect intercept can cause biased predictions across all data points, leading the model
to consistently overestimate or underestimate the target variable. Adjusting the intercept can
improve the accuracy and interpretability of the model.
Sample Bias
▪ What is a sample?
• Part of the population
▪ What is sampling
• Act, process or technique of selecting a suitable sample or a
representative part of the population for the purpose of determining
parameters
▪ Why Sample?
• Economic advantage
• Time factor
• Large Populations
• Partly accessible populations
• Computation Power Required
https://www.pewinternet.org/2019/04/24/sizing-up-twitter-users
Sample Bias
▪ Example 2: bullet hole data to predict where to place armor
• Representative sample?
➢ Need to have a smartphone
➢ Neighborhoods with elder people and lower income underrepresented
Sample Bias
▪ An algorithm does not eliminate human bias, it is only as
good as the data it works with
▪ Under- or over-representation of sensitive group can lead
to disparate impact
Bias
▪ Bias in our language
• Word embeddings
• Boy is to girl as man is to [?]
69
Human experiments
70
Experimentation
▪ Nuremberg Code for ethical rules for research involving
human subjects
• informed consent without coercion
• the ability to withdraw from the experiment at any time
• avoiding all unnecessary physical and mental suffering and
injury
71
Experimentation
▪ A/B testing
▪ Randomize experiment with two variants: A and B
▪ Vary one variable and assess the effect
▪ “If you use the Internet, you're the subject of hundreds of experiments at any given time,
on every site. That's how websites work.“
OKCupid president.
▪ “If you're lying to your users to improve your service, what's the line between A/B testing
and fraud?“
Washington Post’s Brian Fung
www.washingtonpost.com/news/the-switch/wp/2014/07/28/okcupid-reveals-its-been-lying-to-some-of-its-users-just-to-see-whatll-happen/
Summary data gathering
▪ Fair
• to the data subject and model applicant: privacy (GDPR)
• to the model applicant: representative sample
▪ Transparent
• to the data subject and model applicant: is it clear what data is
used, for what purposes and for how long?
• to the model applicant: A/B testing with informed consent,
minimal risks and ethical oversight
• to the data scientist and manager: how is the data gathered,
over- or undersampling
▪ Accountable
• are appropriate and effective measures put in place to comply
with the answers to the above questions? Who is responsible?
Presentation and Paper Ideas
Privacy
▪ Fingerprints on ID cards and passports
▪ The use of synthetic data to learn models
Fairness
▪ A/B testing in the medical domain
▪ A/B tests gone (ethically) wrong
Face recognition
▪ Face recognition discussions in anticipation of the AI Act
▪ Bias in commercial face recognition software
76