0% found this document useful (0 votes)
41 views35 pages

Data Science Ethics - Lecture 9 - Ethical Reporting

Uploaded by

Niloofar Fallahi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views35 pages

Data Science Ethics - Lecture 9 - Ethical Reporting

Uploaded by

Niloofar Fallahi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Data Science & Ethics

Lecture 9

Ethical Reporting

Prof. David Martens


[email protected]
www.applieddatamining.com
@ApplDataMining
AI Ethics in the News
2
Lecture 9
▪ Ethical Measurement
▪ Ethical Interpretation of the Results
▪ Ethical Reporting
▪ Cautionary Tale of Diederik Stapel

3
Ethical Measurement
▪ Two components
1. Being a good data scientist
2. Being good

4
Being a good data scientist
▪ Correct evaluation not the training set
▪ Always use a test set that is as representative as possible
• Out of time versus out of sample
• Large enough
▪ Issues with accuracy as a metric
• Imbalance in class: 99% of class 1,
majority voting has accuracy of 99%...
• No regard for misclassification costs between FP and FN
• Solutions: AUC, lift (curve), profit (curve), etc.
➢ Better evaluation metrics

5
Being a good data scientist
▪ Correct evaluation
• Always use a test set that is as representative as possible
• Issues with accuracy as a metric
▪ Danny: “I trained the data on data from January to November, and
then used four weeks of December as a test set. The prediction model
was correct in predicting whether the NASDAQ went up or down in 15
out of the 20 days. Hence the test accuracy of 15/20 = 75%.”
• Representative?
➢ ‘Sell in May and go away. Don’t come back until November’
➢ 20 days enough?

• Accuracy metric right?


➢ Better baselines
➢ Better evaluation metrics

6
Being Good (FAT)
▪ Evaluating FAT
• Personal, sensitive data?
• Privacy considerations? Did you use personal data
• Explanations needed?
• Transparently reported?

7
Being Good (Robustness)
▪ Robustness: AI systems be developed with a preventative
approach to risks and in a manner such that they reliably
behave as intended while minimising unintentional and
unexpected harm, and preventing unacceptable harm.
(European ‘Ethics Guidelines for Trustworthy AI’)
• resiliency to attacks, fallback plans, accuracy, reliability, and
reproducibility Being Good: Robustness in AI
What is Robustness?
• Adversarial attacks Robustness means that an AI system:

Is resilient to risks and unexpected issues.


Behaves as it is intended to, even under unusual or
stressful conditions.
Minimizes unintended and unexpected harm to
people or processes.
Avoids causing unacceptable harm.
This concept is part of the European Union's Ethics
Guidelines for Trustworthy AI.

8
Key Aspects of Robustness:
Resiliency to Attacks:
Adversarial Attacks
The AI system should resist attempts to manipulate it or make it What are they?
fail. Adversarial attacks are attempts to trick an AI system by feeding it
Example: Adversarial attacks—small, invisible changes to input manipulated inputs.
data that confuse AI models (like slightly changing an image to Why do they matter?
trick an AI into misclassifying it). These attacks highlight weaknesses in AI systems and challenge their
Example: Changing a few pixels on a picture of a cat makes an robustness.
AI think it's a dog. Example of an Adversarial Attack:
Fallback Plans:
A slightly altered image of a "stop sign" confuses a self-driving car's AI,
If the AI system fails, there must be a backup plan to handle making it think the sign says "speed limit 60." This could lead to
errors or unexpected issues. dangerous outcomes.
Example: If a self-driving car's sensors fail, the system could In short, robustness ensures AI systems are secure, reliable, and resilient
safely stop the car or hand over control to the driver. to risks—making them safe and trustworthy in real-world scenarios.
Accuracy:

The AI must consistently provide correct and precise results.


Example: A medical diagnosis system should give accurate
predictions, even when the data is noisy or incomplete.
Reliability:

The system should work consistently under different conditions


and provide stable results.
Example: A voice assistant should recognize commands in
various environments, like quiet or noisy rooms.
Reproducibility:

The system should produce the same results when given the
same inputs, no matter when or where it is run.
Example: A credit-scoring AI should give the same score for a
person today and tomorrow if their data remains the same.
Being Good (Robustness)
▪ Robustness: AI systems be developed with a preventative
approach to risks and in a manner such that they reliably
behave as intended while minimising unintentional and
unexpected harm, and preventing unacceptable harm.
(European ‘Ethics Guidelines for Trustworthy AI’)
• resiliency to attacks, fallback plans, accuracy, reliability, and
reproducibility
• Adversarial attacks

9
Being Good (Robustness)
▪ Robustness: AI systems be developed with a preventative
approach to risks and in a manner such that they reliably
behave as intended while minimising unintentional and
unexpected harm, and preventing unacceptable harm.
(European ‘Ethics Guidelines for Trustworthy AI’)
• resiliency to attacks, fallback plans, accuracy, reliability, and
reproducibility
• Adversarial attacks

10
Being Good (Sustainable)
▪ Sustainable: energy consumption How much energy consumed to get the prediction?
• GPT-3 language processing model from Open AI: trained on
500 billion documents, 175 billion parameters
• CO2 emmissions of GPT-2 model training = lifetime carbon
footprint of 5 cars,
• Possible to consume less energy? Less retraining, optimized
deployment, limiing search space of hyperparameters, limiting
wasted experiments, etc.

https://mlco2.github.io/impact/ 11
Being Good (Sustainable)
▪ Large-scale data science (DL)
consumes substantially more
power than common data
science (cf. RF, decision
trees, logit, etc.) ; large-scale
data science consumes
substantially more power
than common data science

How sustainable is" common" data science in terms


of power consumption? B Meulemeester, D Martens 12
(2022) arXiv preprint arXiv:2207.01934
Lecture 9
▪ Ethical Measurement
▪ Ethical Interpretation of the Results
▪ Ethical Reporting
▪ Cautionary Tale of Diederik Stapel

13
Ethical Interpretation
▪ Are the results significantly better than some baseline?
• Statistical test

▪ Two potential ethical issues:


1. p-hacking
2. Multiple comparisons

14
p-Hacking
▪ p-value: “probability of obtaining results as extreme as the
observed results of a statistical hypothesis test, assuming
that the null hypothesis is correct. ”
• Hypothesis test often: Acc(M1) = Acc (M2) (or “not larger”)
• p-value < α (mostly 5%) What is the probability of this happening?

▪ p-Hacking: researchers collect or select data or statistical


analyses until non-significant results become significant
similar to Data Dredging You continue changing the data until you
get the desired result.

https://bitssblog.files.wordpress.com/2014/02/nelson-presentation.pdf 15
p-Hacking
▪ Suppose these results
Fig.5-10 from Provost and Fawcett’s book

Changing evaluation until you get the


desired results.

▪ Quite some variance. How about creating another 10


random folds, and rerun the experienets?
▪ How about replacing folds 3, 4, 5 and 8?
▪ How about keep replacing folds till the best test result is
obtained?
▪ Unethical! Test set: to obtain independent evaluation 16
p-Hacking

17
Keep adding data or getting rid of data until the result is desirable

p-Hacking Or using many measures


create hunderds of models and show the best one

▪ Six ways of p-Hacking (similar for “data dredging”)


• Stop collecting data once p < 0.05
• Exclude data instances to get p < 0.05
• Transform the data to get p < 0.05
• Use covariates to get p < 0.05
• Analyze many measures, but report only those with p < 0.05
• Create and analyze many models, but only report those with
p < 0.05

▪ Why do many studies, if you can p–hack?


• Publications
• Fame and glory
• Potential catastrophe for scientific inference if we allow p-hacking
18
https://bitssblog.files.wordpress.com/2014/02/nelson-presentation.pdf
Multiple comparisons
The more you do a test the more likely that you get the desired result but its is purely by chance

▪ The more tests, the higher the probability some rare event occurs
• If I throw 1000 1 euro coins, some will be cluttered and all heads
• Suppose I build m models and test whether the performance is
significantly better than random guessing, at 5% level.
P(at least 1 significant result) = 1 – P(no significant result)
= 1 – (1-0.05)m
➢ m=1 ➔ P(at least 1 significant result) = 5% M model with significance percent.

➢ m=10 ➔ P(at least 1 significant result) = 40%


Doing many statistical tests increase the
➢ m=20 ➔ P(at least 1 significant result) = 64% chance that one of them predicts
➢ m=100 ➔ P(at least 1 significant result) = 99% significant results

• If I test 1000 different Twitter based measures, some will be correlated


with the stock price of Colruyt
▪ Bonferroni correction: method to correct for multiple comparisons
Divide it but the total number of
• For m hypotheses being tested, use α/m instead of α experiments instead of a
predetermined percentage
➢ m = 10 ➔ P(at least 1 significant result) = 1 – (1-0.005)10 = 5%
19
Multiple comparisons
▪ The more tests, the higher the probability some rare event
occurs
▪ Danny built 500 models: “one of these prediction scores is
significantly better than using a baseline model of predicting
for the next day what the NASDAQ did the day before, even at
a 1% level”
• Of course!
• P (at least one significant result) = 1 – 0,99500 = 99,34%

20
Lecture 9
▪ Ethical Measurement
▪ Ethical Interpretation of the Results
▪ Ethical Reporting
▪ Cautionary Tale of Diederik Stapel

21
Ethical Reporting (2)
▪ Report transparently: the good and the bad
• Report why you chose that sample size
• What variables were included, excluded and why?
• What models were tried, and why?
• What evaluation metrics were (not) used and why?
• Etc.
▪ Reproducibility crucial in scientific research and industry
• Have the code be publicly available (github)
• Have the data be publicly available (if possible) or use as much
publicly available data sources (UCI, Kaggle) as possible
• Similarly important in industry: for other data scientists

▪ One approach: Google’s Model Card 22


Model Card

Mitchell, Margaret, Wu, Simone,


Zaldivar, Andrew, Barnes, Parker,
Vasserman, Lucy, Hutchinson, Ben,
Spitzer, Elena, Raji, Inioluwa Deborah,
and Gebru, Timnit (2019,
Jan). Model cards for model reporting.
Proceedings of the Conference on
Fairness,
Accountability, and Transparency.
23
Case 1: Twitter to predict stock market

▪ “Here we investigate whether measurements of collective mood states derived from large-
scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA)
over time”
▪ “Our results indicate that the accuracy of DJIA predictions can be significantly improved by
the inclusion of specific public mood dimensions but not others. We find an accuracy of
87.6% in predicting the daily up and down changes in the closing values of the DJIA and a
reduction of the Mean Average Percentage Error by more than 6%”
▪ Reported on by The Telegraph, The Daily Mail, USA Today, The Atlantic, Wired Magazine,
Time Magazine, CNBC, CNN

24
Case 1: Too good to be true?
▪ “The junk science behind the ‘Twitter Hedge Fund’”
▪ 49 statistical tests conducted (reported): “The authors make a much
more basic statistical blunder, that of not correcting for multiple
hypothesis testing.”
▪ “accuracy of 87.6% in predicting the daily up and down changes in the
closing values of the DJIA” (Bollen et al, 2010)
• Based on only 15 days of test data The small days tested
• They got the number wrong: 13/15 = 86.7%
• Best result chosen from 8 model

▪ “The hedge fund set up in late 2010 to implement the Twitter mood
strategy, Derwent Capital Markets, failed and closed in early 2012.”

https://sellthenews.tumblr.com/post/21067996377/noitdoesnot
25
https://econjwatch.org/articles/shy-of-the-character-limit-twitter-mood-predicts-the-stock-market-revisited
Case 2

26
Ethical Academic Reporting
▪ Reproducibility as a sign on the wall
▪ Often not reproducable
▪ 2012 Reuters and Science story looked at basic cancer
research
• Global cancer reasearch at biotech company, Begley
➢ Aimed to reproduce 53 published “landmark” studies
➢ Could only replicate 6, other 47 (89%) could not
➢ Reached out to the authors. “We went through the paper line by
line, figure by figure. I explained that we re-did their experiment 50
times and never got their result. He said they’d done it six times
and got this result once, but put it in the paper because it made
the best story.’”
▪ 2016 Nature survey among 1,576 researchers: 70% have
tried and failed to reproduce experiments of another group.27
Lecture 9
▪ Ethical Measurement
▪ Ethical Interpretation of the Results
▪ Ethical Reporting
▪ Cautionary Tale of Diederik Stapel

28
Diederik Stapel
▪ Well-resepected professor in social psychology and dean at
the University of Tilburg, who won several awards
▪ Much of his research included survey experiments.
• Eg: “meat eaters are more selfish and less social than
vegetarians“
▪ Much fabricated, he made up experiment, inputed the data
himself, and gave data to phd students
▪ 50+ paper retracted
▪ Published a book later: driven by power and the desire to
be honoured, as well as the messiness of experimental
results

29
30
https://nl.wikipedia.org/wiki/Marc_Hooghe
31
Conclusion
▪ Report correct, think of FAT FLOW.
▪ Don’t be tempted, you don’t want to appear in the media
with your unethical data science practices (cf. Danny or
Stapel)

32
Presentation and Paper Ideas
▪ How many stock price prediction papers do the data science
right?
▪ P-hacking cases
▪ Other fallen academics and retracted papers
▪ Unethical reporting in finance, advertising, etc.

33

You might also like