5th Unit Data Science 24
5th Unit Data Science 24
5th Unit Data Science 24
Ethical Issue
5 ethical questions in data science, amid the growing concern of its ethical use
by organisations.
With the rapid growth in data science, there has been a growing concern around its
ethical use by organizations. For example, concerns have arisen as:
Data science algorithms are used to accept and deny bank loans and the
insurance premiums payable for insurance. However, the question arises:
What is the social cost of a wrong decision for a bank loan or insurance?
Companies use data science to scan resumes and recommend the best
candidate for a role. However, the question arises: What is the chance for a
bias towards gender or age in the hiring algorithm if that algorithm is based
on past data?
1. Unfair discrimination
The incorrect and unchecked use of data science can lead to unfair discrimination
against individuals based on their gender, demographics and socio-economic
conditions.
If you have really large data sets, you might not even realize that the data are
slightly biased towards gender or whatever you’re analyzing …. It might be that
you’ve overstrained on those characteristics.’
Gartner (‘Gartner Says Nearly Half of CIOs Are Planning to Deploy Artificial
Intelligence’, 2020) predicts that by 2022, 85 percent of data science projects will
deliver erroneous outcomes due to bias in data, algorithms or the teams responsible
for managing them.
Data science algorithms use past data to predict future outcomes. Data are
generated based on human decisions made in the past. Training the algorithm
purely based on past data could lead to some of these biases being included in the
algorithms.
Algorithms are also influenced by analysts’ biases, as they may choose data and
hypotheses that seem important to them.
3. Lack of transparency
Data science algorithms can sometimes be a black box where the model predicts an
outcome but does not explain the rationale behind the result.
Numerous recent machine learning algorithms fall into this category. With black
box solutions, it is not easy for a business to understand and explain the reason for
a business decision.
As Andrews notes, ‘Whether an AI system produces the right answer is not the
only concern… Executives need to understand why it is effective and offer insights
into its reasoning when it’s not.’
4. Privacy
Data privacy has become a major focus in the past few years. Sensitive data are
stored by various organisations and are subject to hacking and misuse.
During the 2016 United States presidential election, Cambridge Analytica, a data
analytics firm that worked on Donald Trump’s election campaign, used Facebook
data to influence customers’ behaviours in the US election.
There has been an increase in data breaches across the world. Rules and
regulations, such as the General Data Protection Regulation (GDPR), have been
introduced to monitor the way companies store and use sensitive data.
5. Consent and Power
Organisations are not transparent as to what data they collect, and use it to make
decisions. Most web browsers and websites capture enormous amounts of user data
even without their knowledge and consent.
For example, Google (Chrome and Gmail) and Facebook store individual browsing
data and monetises it by selling insights from users’ data for advertising.
The human side of analytics is the biggest challenge to implementing big data
The term “Data Science” was created in the early 1960s to describe a new
profession that would support the understanding and interpretation of the large
amounts of data which was being amassed at the time.
(At the time, there was no way of predicting the truly massive amounts of data
over the next fifty years.)
While Data Science is used in areas such as astronomy and medicine, it is also
used in business to help make smarter decisions.
Data Science started with statistics and has evolved to include concepts/practices
such as artificial intelligence, machine learning, and the Internet of Things, to
name a few.
As more and more data has become available, first by way of recorded shopping
behaviors and trends, businesses have been collecting and storing it in ever greater
amounts. With the growth of the Internet, the Internet of Things, and the
exponential growth of data volumes available to enterprises, there has been a flood
of new information or big data.
In 1962, John Tukey wrote a paper titled The Future of Data Analysis and
described a shift in the world of statistics, saying, “… as I have watched
mathematical statistics evolve, I have had cause to wonder and to doubt…I have
come to feel that my central interest is in data analysis…”
In 1974, Peter Naur authored the Concise Survey of Computer Methods, using the
term “Data Science,” repeatedly. Naur presented his own convoluted definition of
the new concept:
“The usefulness of data and data processes derives from their application in
building and handling models of reality.”
In 2015, Bloomberg’s Jack Clark, wrote that it had been a landmark year for
artificial intelligence (AI). Within Google, the total of software projects using AI
increased from “sporadic usage” to more than 2,700 projects over the year.
In the past 30 years, Data Science has quietly grown to include businesses
and organizations worldwide. It is now being used by governments, geneticists,
engineers, and even astronomers. During its evolution, Data Science’s use of big
data was not simply a “scaling up” of the data, but included shifting to new
systems for processing data and the ways data gets studied and analyzed.
As they delve into analytics across the business, data leaders have a front row seat
to nearly every operation and function. This provides them with a unique vantage
point for both solving problems and identifying new ones. Management also found
that people who rated check-in poorly had a lower rate of returning to the hotel.
Then an employee suggested they look at customer surveys that had been collected
on a rolling basis. Some natural language text analytics teased out some themes —
namely.
The Takeaway: Solving the problem that is in front of you can mean missing out
on opportunities to help the business improve in other ways. Those who work with
data often have access to deep, unique insights into numerous aspects of the
business. To become adept at problem-spotting, data leaders need to embrace that
big-picture view and gain deeper insights, with greater transparency around what
matters most to business leaders. In this way, data leaders can add value by
identifying problems that otherwise escape notice.
Once a problem has been spotted, the next step is determining its scope — that is,
gaining clarity into the nature of the problem and how analytics can help solve it.
This is especially important if a business leader has approached the data team with
a vague concern or challenge.
It could be a pipeline issue, but we just don’t have alignment. I think we’re playing
in the right sandboxes, now we just need to know the who and the why. Sound
good?”
Once the problem is identified and scoped out, many data analysts go into isolation
and only emerge when they have found a solution. This approach is highly
problematic. To be most effective, the process requires a great deal of information
This approach runs counter to how some data scientists prefer to work. Sometimes
they get enamored with their models and their creative problem-solving
techniques, and they can’t wait for the big reveal. Surprising results often prompt
people to start questioning the underlying data and methods.
However, by bringing the business team into decision-making along the way, they
will buy into the results and commit their trust.
At this point, we transition from problem to solution, the success of which depends
on how well data leaders and their teams have executed on the first three steps.
More than determining a final answer, the data team must also deliver a solution
that’s understandable and, therefore, actionable.
This isn’t just about putting the data in a chart or another visual display. must be
conveyed in language the business team can understand. One tool I’ve
recommended is the two-page data analytics memo, which highlights the most
important elements of the problem to be solved.
The two-page limit can avoid the temptation to go on and on about details of the
data analysis and encourage focus on the recommendations being made and the
evidence for them
The Takeaway: Solution translation requires data leaders to step back and
consider how to make the most impact with their analyses and recommendations.
By using simple language, while not compromising the complexity, data leaders
who excel in this area can deliver the equivalent of an elevator speech to engage
business leaders with compelling and understandable solutions.
Teaching Notes
For
DATA SCIENCE
ESSENTIALS
6 Privacy
17 Anonymization Techniques
Replace identifiers with randomly-generated identifiersEg: “Jane Krakowski” ->
“Patient6479”Abstraction: Replace values by rangesEg: Check-in date: 3/1/16 -> Check-in date:
Spring 2016Eg: Replace zip code by stateCluster data points and replace individuals by their
cluster centroidEg Ages: 21, 25, 28, 27, 18 -> 5 individuals with nominal age of 24Remove
valuesEg: Omit birth date
27 Privacy-Aware Workflows
P1: No personal ID information canleave the data sourceP2: Sensitive data must be k-
anonymizedDistributed workflow compliant with
policiesAnonymizationAbstractionAnalysisLoc1 .. Loc nLoc3Loc2Centralized workflow not
compliant with policiesAggregationAnalysisLoc1Loc2Loc3AggregationYolanda Gil
29 Reproducibility
34 There is zero privacy anyway, get over it Although you can upload your data using a
pseudonym, there is no way to anonymously submit data. Statistically speaking it is really
unlikely that your medical and genetic information matches that of someone else. By uploading
you do not only disclose information about yourself, but also about your next kinship (parents
and siblings), that shares half of a genome with you. Before uploading any genetic data you
should make sure that those people approve of you doing so. This is especially important if you
have monozygotic twin, who shares all of your genome!