Data Science Ethics - Lecture 9 - Ethical Reporting
Data Science Ethics - Lecture 9 - Ethical Reporting
Lecture 9
Ethical Reporting
3
Ethical Measurement
▪ Two components
1. Being a good data scientist
2. Being good
4
Being a good data scientist
▪ Correct evaluation not the training set
▪ Always use a test set that is as representative as possible
• Out of time versus out of sample
• Large enough
▪ Issues with accuracy as a metric
• Imbalance in class: 99% of class 1,
majority voting has accuracy of 99%...
• No regard for misclassification costs between FP and FN
• Solutions: AUC, lift (curve), profit (curve), etc.
➢ Better evaluation metrics
5
Being a good data scientist
▪ Correct evaluation
• Always use a test set that is as representative as possible
• Issues with accuracy as a metric
▪ Danny: “I trained the data on data from January to November, and
then used four weeks of December as a test set. The prediction model
was correct in predicting whether the NASDAQ went up or down in 15
out of the 20 days. Hence the test accuracy of 15/20 = 75%.”
• Representative?
➢ ‘Sell in May and go away. Don’t come back until November’
➢ 20 days enough?
6
Being Good (FAT)
▪ Evaluating FAT
• Personal, sensitive data?
• Privacy considerations? Did you use personal data
• Explanations needed?
• Transparently reported?
7
Being Good (Robustness)
▪ Robustness: AI systems be developed with a preventative
approach to risks and in a manner such that they reliably
behave as intended while minimising unintentional and
unexpected harm, and preventing unacceptable harm.
(European ‘Ethics Guidelines for Trustworthy AI’)
• resiliency to attacks, fallback plans, accuracy, reliability, and
reproducibility Being Good: Robustness in AI
What is Robustness?
• Adversarial attacks Robustness means that an AI system:
8
Key Aspects of Robustness:
Resiliency to Attacks:
Adversarial Attacks
The AI system should resist attempts to manipulate it or make it What are they?
fail. Adversarial attacks are attempts to trick an AI system by feeding it
Example: Adversarial attacks—small, invisible changes to input manipulated inputs.
data that confuse AI models (like slightly changing an image to Why do they matter?
trick an AI into misclassifying it). These attacks highlight weaknesses in AI systems and challenge their
Example: Changing a few pixels on a picture of a cat makes an robustness.
AI think it's a dog. Example of an Adversarial Attack:
Fallback Plans:
A slightly altered image of a "stop sign" confuses a self-driving car's AI,
If the AI system fails, there must be a backup plan to handle making it think the sign says "speed limit 60." This could lead to
errors or unexpected issues. dangerous outcomes.
Example: If a self-driving car's sensors fail, the system could In short, robustness ensures AI systems are secure, reliable, and resilient
safely stop the car or hand over control to the driver. to risks—making them safe and trustworthy in real-world scenarios.
Accuracy:
The system should produce the same results when given the
same inputs, no matter when or where it is run.
Example: A credit-scoring AI should give the same score for a
person today and tomorrow if their data remains the same.
Being Good (Robustness)
▪ Robustness: AI systems be developed with a preventative
approach to risks and in a manner such that they reliably
behave as intended while minimising unintentional and
unexpected harm, and preventing unacceptable harm.
(European ‘Ethics Guidelines for Trustworthy AI’)
• resiliency to attacks, fallback plans, accuracy, reliability, and
reproducibility
• Adversarial attacks
9
Being Good (Robustness)
▪ Robustness: AI systems be developed with a preventative
approach to risks and in a manner such that they reliably
behave as intended while minimising unintentional and
unexpected harm, and preventing unacceptable harm.
(European ‘Ethics Guidelines for Trustworthy AI’)
• resiliency to attacks, fallback plans, accuracy, reliability, and
reproducibility
• Adversarial attacks
10
Being Good (Sustainable)
▪ Sustainable: energy consumption How much energy consumed to get the prediction?
• GPT-3 language processing model from Open AI: trained on
500 billion documents, 175 billion parameters
• CO2 emmissions of GPT-2 model training = lifetime carbon
footprint of 5 cars,
• Possible to consume less energy? Less retraining, optimized
deployment, limiing search space of hyperparameters, limiting
wasted experiments, etc.
https://mlco2.github.io/impact/ 11
Being Good (Sustainable)
▪ Large-scale data science (DL)
consumes substantially more
power than common data
science (cf. RF, decision
trees, logit, etc.) ; large-scale
data science consumes
substantially more power
than common data science
13
Ethical Interpretation
▪ Are the results significantly better than some baseline?
• Statistical test
14
p-Hacking
▪ p-value: “probability of obtaining results as extreme as the
observed results of a statistical hypothesis test, assuming
that the null hypothesis is correct. ”
• Hypothesis test often: Acc(M1) = Acc (M2) (or “not larger”)
• p-value < α (mostly 5%) What is the probability of this happening?
https://bitssblog.files.wordpress.com/2014/02/nelson-presentation.pdf 15
p-Hacking
▪ Suppose these results
Fig.5-10 from Provost and Fawcett’s book
17
Keep adding data or getting rid of data until the result is desirable
▪ The more tests, the higher the probability some rare event occurs
• If I throw 1000 1 euro coins, some will be cluttered and all heads
• Suppose I build m models and test whether the performance is
significantly better than random guessing, at 5% level.
P(at least 1 significant result) = 1 – P(no significant result)
= 1 – (1-0.05)m
➢ m=1 ➔ P(at least 1 significant result) = 5% M model with significance percent.
20
Lecture 9
▪ Ethical Measurement
▪ Ethical Interpretation of the Results
▪ Ethical Reporting
▪ Cautionary Tale of Diederik Stapel
21
Ethical Reporting (2)
▪ Report transparently: the good and the bad
• Report why you chose that sample size
• What variables were included, excluded and why?
• What models were tried, and why?
• What evaluation metrics were (not) used and why?
• Etc.
▪ Reproducibility crucial in scientific research and industry
• Have the code be publicly available (github)
• Have the data be publicly available (if possible) or use as much
publicly available data sources (UCI, Kaggle) as possible
• Similarly important in industry: for other data scientists
▪ “Here we investigate whether measurements of collective mood states derived from large-
scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA)
over time”
▪ “Our results indicate that the accuracy of DJIA predictions can be significantly improved by
the inclusion of specific public mood dimensions but not others. We find an accuracy of
87.6% in predicting the daily up and down changes in the closing values of the DJIA and a
reduction of the Mean Average Percentage Error by more than 6%”
▪ Reported on by The Telegraph, The Daily Mail, USA Today, The Atlantic, Wired Magazine,
Time Magazine, CNBC, CNN
24
Case 1: Too good to be true?
▪ “The junk science behind the ‘Twitter Hedge Fund’”
▪ 49 statistical tests conducted (reported): “The authors make a much
more basic statistical blunder, that of not correcting for multiple
hypothesis testing.”
▪ “accuracy of 87.6% in predicting the daily up and down changes in the
closing values of the DJIA” (Bollen et al, 2010)
• Based on only 15 days of test data The small days tested
• They got the number wrong: 13/15 = 86.7%
• Best result chosen from 8 model
▪ “The hedge fund set up in late 2010 to implement the Twitter mood
strategy, Derwent Capital Markets, failed and closed in early 2012.”
https://sellthenews.tumblr.com/post/21067996377/noitdoesnot
25
https://econjwatch.org/articles/shy-of-the-character-limit-twitter-mood-predicts-the-stock-market-revisited
Case 2
26
Ethical Academic Reporting
▪ Reproducibility as a sign on the wall
▪ Often not reproducable
▪ 2012 Reuters and Science story looked at basic cancer
research
• Global cancer reasearch at biotech company, Begley
➢ Aimed to reproduce 53 published “landmark” studies
➢ Could only replicate 6, other 47 (89%) could not
➢ Reached out to the authors. “We went through the paper line by
line, figure by figure. I explained that we re-did their experiment 50
times and never got their result. He said they’d done it six times
and got this result once, but put it in the paper because it made
the best story.’”
▪ 2016 Nature survey among 1,576 researchers: 70% have
tried and failed to reproduce experiments of another group.27
Lecture 9
▪ Ethical Measurement
▪ Ethical Interpretation of the Results
▪ Ethical Reporting
▪ Cautionary Tale of Diederik Stapel
28
Diederik Stapel
▪ Well-resepected professor in social psychology and dean at
the University of Tilburg, who won several awards
▪ Much of his research included survey experiments.
• Eg: “meat eaters are more selfish and less social than
vegetarians“
▪ Much fabricated, he made up experiment, inputed the data
himself, and gave data to phd students
▪ 50+ paper retracted
▪ Published a book later: driven by power and the desire to
be honoured, as well as the messiness of experimental
results
29
30
https://nl.wikipedia.org/wiki/Marc_Hooghe
31
Conclusion
▪ Report correct, think of FAT FLOW.
▪ Don’t be tempted, you don’t want to appear in the media
with your unethical data science practices (cf. Danny or
Stapel)
32
Presentation and Paper Ideas
▪ How many stock price prediction papers do the data science
right?
▪ P-hacking cases
▪ Other fallen academics and retracted papers
▪ Unethical reporting in finance, advertising, etc.
33