Strategic Data Science-Preview
Strategic Data Science-Preview
Science
Creating Value from Data, Big and
Small
Peter Prevos
This book is for sale at http://leanpub.com/strategic_data_science
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Preface to second edition . . . . . . . . . . . . . . . . . . iv
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . v
5. References . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Preface
I am not a data scientist. These might be strange words to open a
book about data science, so please allow me to explain. This book is
the result of a twenty-five-year career in civil engineering, building
and managing structures in Europe, Africa, Asia and Australia.
Most of my tasks involve managing and analysing large amounts
of data. Cost estimates, volume calculations, modelling river flows,
structural calculations, Monte Carlo simulations and many other
types of number crunching are integral to my work as an engineer.
The raw text of this book and R code for the data visualisations is
available on GitHub. I encourage anyone who discovers mistakes
or likes to enhance the information in these pages to contact me.
The LeanPub publishing system provides flexible opportunities
to publish new versions when updated information is available.
Anyone purchasing this book through LeanPub can register to be
informed of future editions.
Preface iv
The idea that data can be used to understand the world is thus
almost as old as humanity itself and has gradually evolved into
what we now call data science. We can use some basic data science
to review the development of this term over time.
Figure 2: Frequency of the bi-gram ‘data science’ in literature and Google searches
in the percentage of highest occurrence.
³Davis, P.J., & Hersh, R. (1990). Descartes’ Dream. The World According to
Mathematics. London: Penguin.
What is Data Science? 4
The combination of the words data and science might seem rel-
atively new, but the Google N-gram Viewer shows that this ‘bi-
gram’ has been in use since the middle of the last century. An n-
gram is a sequence of words, with a bi-gram being any combination
of two words. Google’s n-gram viewer is a searchable database of
millions of scanned books published before 2008. This database is
a source for predictive text algorithms as it contains a fantastic
amount of knowledge about how people use various languages.⁴
The n-gram database shows that the term data science emerged in
the middle of the last century when electronic computation became
a topic of study. In those days, the discipline was a science about
storing and manipulating data. The current definition has drifted
away from this initial academic activity to a business activity.
Figure 2 visualises these two trends. The horizontal axis shows the
years from 1960 until recently. The vertical axis shows the relative
number of occurrences compared to the maximum, which is how
Google report search numbers. In an absolute sense, the number of
occurrences in books was much lower than current search volumes.
While the increase in attention has steeply risen since 2012, the
term started its journey towards business buzzword in the 1960s.
Although we can speak of a recent hype, in essence, data science is
a slow evolution with a recent spike in interest.
There are some signals that the excitement of the past few years is
waning. Data science blogger Matt Tucker has declared the death
of the data scientist.⁸ For many business problems, the hardcore
methods of machine learning result in over-analysing the issue.
Tucker provides an anecdote of a group of data scientists who
spend a lot of time fine-tuning a complex neural network. The
data scientists gave up when a young graduate with expertise in
the subject-matter used a linear regression that was more accurate
than their neural network.
This book looks at data science as the strategic and systematic ap-
proach to the fine art of analysing data to solve business problems.
This conceptualisation of data science is not a complete definition.
Computational analysis of data is also practised as a science in itself
⁸Tucker, M. (2018). The Death of the Data Scientist. Retrieved 9 February 2019
from Data Science Central.
What is Data Science? 7
reality.¹¹
Firstly, businesses have much more data available than ever be-
fore. The move to electronic transactions means that almost every
process leaves a digital footprint. Collecting and storing this data
has become exponentially cheaper than in the days of pencil and
paper. Many organisations collect this data without maximising the
value they extract from it. After the data is used for its intended
purpose, it becomes ‘dark data’, stored on servers but languishing
in obscurity. This data provides opportunities to optimise how an
organisation operates by recycling and analysing it to learn about
the past to create a better future.
and cleaning data to prepare it for the next step of the data science
process.
Conway defines the danger zone as the area where domain knowl-
edge and computing skills combine, without a good grounding in
mathematics. Somebody might have sufficient computing skills to
be pushing buttons on a business intelligence platform or spread-
sheet. The user-friendliness of some analysis platforms can be
detrimental to the outcomes of the analysis because they create
the illusion of accuracy. Point-and-click analysis hides the inner
workings from the user, creating a black-box result. Although the
data might be perfectly structured, valid and reliable, a wrongly-
applied analytical method leads to useless outcomes.
data scientists have proposed more complex models but they all
originate with Conway’s basic idea. A quick visit on the Duck-
duckgo search engine will reveal several variants.
It could be argued that the so-called skills are missing from this
picture. However, communication, managing people, facilitating
change and so on, are competencies that belong to every profession
who works in a complex environment, not just the data scientist.
Some critics of this idea point out that these people are unicorns.
Data scientists that possess all these skills are mythical employees
that don’t exist in the real world. Most data scientists start from
either mathematics or computer science, after which it is hard to
become a domain expert. This book is written from the point of
view that we can breed unicorns by teaching domain experts how to
write code and, where required, enhance their mathematical skills.
The analogy between data and oil is only partially correct in that
data is a renewable resource. The same data can be used many
times for a sometimes originally unintended purpose. The ability
to use data for more than one goal is one of the reasons data
science has gained popularity around board tables. Senior managers
are seeking ways to extract value from so-called ‘dark data’. Data
scientists use these forgotten data sources to create new knowledge,
make better decisions and spawn innovations.
Buildings must have utility so they can be used for their intended
purpose. A house needs to be functional and comfortable; a theatre
needs to be designed so that everybody can see the stage. Each type
of building has its functional requirements. Secondly, buildings
must be sound in that they are firm enough to withstand the forces
that act upon them. Last but not least, buildings need to be aesthetic.
In the words of Vitruvius, buildings need to look like Venus, the
Roman goddess of beauty and seduction.
The Vitruvian rules for architecture can also be applied to the
products of data science.¹ Great data science needs to have utility;
¹Lankow, J., Ritchie, J., & Crooks, R. (2012). Infographics: The Power of Visual
Storytelling. Hoboken, N.J: John Wiley & Sons, Inc.
Good Data Science 20
2.2.1 Reality
The complex relationship between the data and the reality it seeks
to improve emphasises the need for subject-matter expertise about
the problem under consideration. Data should never be seen as
merely an abstract series of numbers or a corpus of text and images,
but should always be interpreted in its context to do justice to the
reality it describes.
Good Data Science 24
2.2.2 Data
Data is the main ingredient of data science, but not all data sources
provide the same opportunities to create useful data products.
The quality and quantity of the data determine its value to the
organisation. This mechanism is just another way of stating the
classic Garbage-In-Garbage-Out (GIGO) principle. This principle
derives from the fact that computers blindly follow instructions,
irrespective of truthfulness, usefulness or ethical consequences of
the outcomes. An amazing algorithm with low quality or insuffi-
cient data will be unable to deliver value to an organisation. On the
other hand, great data with an invalid algorithm will also result in
garbage, instead of valuable information.
The quality of data relates to the measurement processes used to
collect the information and the relationship of this process to the
reality it describes. The quality of the data and the outcome of the
analysis is expressed in their validity and reliability. Section 2.3
discusses the soundness of data and information in more detail.
The next step is to decide how much data to collect. Determining
the appropriate amount of data is a balancing act between the cost
of collecting, storing and analysing the data versus the potential
usefulness of the outcome. In some instances collecting the required
data can be more costly than the benefits it provides.
The recent steep reduction in the cost of collecting and storing data
seems to render the need to be selective in data gathering a moot
point. Data scientists might claim that we should collect everything
because a time machine is much more expensive than collecting
and storing more data than strictly necessary. Not measuring parts
of a process has an opportunity cost because future benefits might
not be realised. This opportunity cost needs to be balanced with the
estimated cost of collecting and storing data.
This revolution in data gathering is mainly related to physical
measurements and the so-called Internet of Things, mobile phones
Good Data Science 25
2.2.3 Information
2.2.4 Knowledge
The last and most important part of this data science model is the
feedback loop from knowledge back to reality. The arrow signifies
Good Data Science 28
2.3.1 Validity
The validity of a data set and the information derived from it relates
to the extent to which the data matches the reality it describes. The
validity of data and information depends on how this information
was collected and how it was analysed.
For physical measurements, validity relates to the technology used
to measure the world and is determined by physics or chemistry. If,
for example, our variable of interest is temperature, we use the fact
that materials expand, or their electrical resistance increases, when
the temperature rises. Measuring length relies on comparing it with
a calibrated unit or the time it takes light to travel in a vacuum. Each
type of physical measurement uses a physical or chemical process,
and the laws of nature define validity. When measuring pH, for
Good Data Science 30
2.3.2 Reliability
Bennet and Baird with the igNobel prize. This annual prize awards
unusual and trivial results in science.⁷
2.3.3 Reproducibility
2.3.4 Governance
The same principle also applies to managing data. Each data source
in an organisation needs to have an owner and a custodian who
understands the relationship between this data and the reality from
which it is extracted. Large organisations have formal processes
that ensure each data set is governed and that all employees use a
single source of truth to safeguard the soundness of data products.
2.4.1 Visualisation
straight lines and primary colours. Using art to explain data visual-
isation is not an accidental metaphor because visual art represents
how the artist perceives reality. This comparison between Pollock
and Mondrian is not a judgement of their artistic abilities. For
Pollock, reality was chaotic and messy, while Mondrian saw a
geometric order behind the perceived world.
Although visualising data has some parallels with art, it is very
different. All art is basically a form of deception. The artist paints
a three-dimensional image on a flat canvas, and although we see
people, we are just looking at blobs of paint. Data visualisation as an
art form needs to be truthful and not deceive, either intentionally
or accidentally. The purpose of any visualisation is to validly and
reliably reflect reality.
Aesthetic data science is not so much an art as it is a craft. Following
some basic rules will prevent confusing the consumers of data
products. Firstly, a visualisation needs to have a straightforward
narrative. Secondly, visualising data should be as simple as possible,
minimising elements that don’t add to the story.
2.4.1.1 Storytelling
Now that we are in the paperless era, we can use the data-pixel
ratio as a generic measure for the aesthetics of visualisations.
The principle is the same as in the analogue days. Remove any
redundant information in your visualisation. Unnecessary lines,
multiple colours or multiple narratives risk confusing the user of
the report.
schemes are designed for choropleth maps, but can also be used for
non-spatial visualisations. ¹⁰ The Color Brewer system consists of
three types of colour palates: sequential, diverging and qualitative
(Figure 10).
2.4.2 Reports
use it. This knowledge improves the reality from which the data
was extracted. The relationship between reality and data is critical.
Data science is sound when it delivers valid and reliable results and
can be reviewed by other experts.
The next chapter delves into the strategic aspects of data science by
presenting a five-phase framework that organisations can follow to
enhance the value they extract from data.
3. Strategic Data Science
Managers often claim to be “data rich but information poor”. This
statement is in many cases only partially correct because it hides a
misconception about the life cycle of data. Being replete with data
but poor concerning information suggests that previously untapped
data sources are waiting to be mined and used.
It is highly unlikely that any organisation collects data for no partic-
ular purpose. Data is in most cases collected to manage operational
processes. Collecting data without purpose is a waste of resources.
After the data is used, it is stored and becomes ‘dark data’. Because
almost all business processes are recorded electronically, data is
now everywhere. Managers rightly ask themselves what to do with
this information after it is archived. A strategic approach to data
science helps an organisation to unlock the unrealised value of these
stores of data to better understand their strategic and operational
context.
Whereas the framework in the previous chapter is normative and
defines ideal data science, the model in this chapter describes
the path that an organisation can take to increase the value they
extract from data. The data science continuum is a hierarchical five-
phase framework towards becoming a data-driven organisation.
This chapter discusses the phases of this continuum as a strategy
map for data science.
The salient point is that data is rarely collected for the purpose
it was initially intended. This means that all data needs to be
transformed into a format that is suitable to achieve the objectives
of the project. Cleaning data is the least exciting activity of data
science, but it often requires much of the available resources. The
informal terms used in the industry for cleaning data are wrangling,
munching or data jujitsu. These not so positive references illustrate
the effort and frustration associated with cleaning data.
Lastly, big data has a high level of veracity. Traditional data sources
Strategic Data Science 52
about human behaviour ask people what they feel or what they
might do in the future. This approach has a low level of veracity
because they are indirect measurements of behaviour. Big data has
a high level of reliability because it measures actual conduct.
The first step in any analytical process is to describe the data under
consideration. Descriptive statistics are methods, such as the aver-
age, median, range, standard deviation, and so on, that summarise
a data set. Business intelligence and exploratory analysis are the
most common uses of descriptive statistics.
These are wise words, but measuring too many variables often leads
to confusion instead of enlightenment. It seems that the managers
that use these words did not read the whole book. Welch also
moderates his insistence on measurement and writes that: “Too
often we measure everything and understand nothing”. It seems
that business reporting hides a paradox.
3.4 Diagnostics
the case, then managers adjust current practice to create the desired
future. If a retail store can predict whether consumers are likely to
purchase a particular product because of their lifestyle preferences,
then they can target them in their marketing communication to
rattle their cage. If an engineer can predict that a piece of equipment
is likely to fail soon, then it can be replaced to avoid problems.
Predictive analysis uses information about the past and uses logic
to determine a possible future state of the variables in the model.
A water utility can, for example, develop a predictive model to
estimate the amount of water that their customers use on a given
day using their property size, affluence, weather and so on. A
retailer could develop a predictive model of the amount of store
traffic as a function of advertising, public holidays, sales, location
and whatever else might be predictive of this behaviour.
All reliable methods to predict the future use information from the
past, combined with a specified logic to estimate what could happen
next. This future only eventuates when all boundary conditions
under which this prediction was cast remain the same. No mathe-
matical model can predict the future, it can only provide insight
into possible futures. The purpose of predicting these possible
futures is to determine the best cause of action to create the desired
future.
Once these theories are accepted and used, further observations will
raise new questions, and the same process is followed again. The
critical aspect to realise is that the insight of the researcher guides
this process. Formulating a reasonable hypothesis requires a deep
understanding of the subject matter under consideration.
The last step in the continuum is where machines take over our
world, and human beings can relax and be served by their robot
slaves. This vision might seem science fiction, but automated
processes have been part of our lives for many decades.
Industrial systems need to be controlled by operators to ensure
it produces the outcomes we need. When, for example, a tank of
chemicals reaches a certain low level, a pump needs to be started
to fill it again, the operator needs to monitor the tank. Manual
systems require a human operator to review measurements of the
state of the system and take appropriate action. The first level of
automation uses a control loop to measure the level of the tank and
an automated valve.
Almost all contemporary manufacturing processes use these first-
generation control loops. The problem with these systems is that
Strategic Data Science 64
they rely on preset conditions that might not be suitable when ex-
ternal circumstances change. Traditional control systems measure
the present and take action when the measured value starts to trend
towards a set value. In effect, a control system consists of a network
of if-then statements to manage all the standard conditions. When
the environment deviates from the boundary conditions that the
system was designed to operate, the system will fail.
Decision makers sometimes ignore even the most useful and aes-
thetic visualisations, even when the analysis is sound. Best practice
data science, as described in chapter two, is only the starting
point for creating a value-driven organisation. A critical aspect of
ensuring that managers use the results is to foster a data-driven
culture, which requires managing people.
next three sections discuss these aspects of data science. This book
closes with some deliberations about the theoretical and ethical
limitations of data science.
4.1 People
Data science does not only exclusively happen within the spe-
cialised team. Each data project has an internal or external cus-
tomer that has a problem in need of an answer. The data science
team and the users of their products work together to improve the
organisation. This implies that a data scientist needs to understand
the basic principles of organisational behaviour and change man-
agement and be a good communicator. In reverse, the recipients
of data science need to have sufficient data literacy to understand
how to interpret and use the results.
Data science consumers are not only the colleagues of the data
scientist or the clients of a consultant, but they are also the con-
sumers of the products and services of the organisation. Businesses
and government agencies love to publish statistics to show their
customers or the community how awesome they are. Communi-
cating accurate statistics is an effective way to foster trust between
the organisation and its stakeholders. This method only works,
however, when the targeted people have sufficient data literacy to
understand the information. Most data communication is produced
⁴Australian Public Service Commission. (2018). Data literacy skills. Retrieved 2
March 2019, from apsc.gov.au.
The Data-Driven Organisation 72
The data revolution has made some organisations realise that they
are data rich but information poor. Managers in these organisations
realise that they hold large amounts of data that is only used once.
The popularity of data science is to a large extent motivated by a
desire to use this dark data and become more data-driven.
The most complicated aspect of implementing a data science strat-
egy is to integrate the results of our analysis with every-day
business activities. A data-driven organisation is one where using
information to solve problems forms part of its culture. On the
surface, this can be achieved by updating the existing operating
procedure to include analysing data, but there is also a strong
human component to this transition.
⁵Carretero, S., Vuorikari, R., Punie, Y., European Commission, & Joint Research
Centre. (2017). DigComp 2.1 The Digital Competence Framework for Citizens with
Eight Proficiency Levels and Examples of Use.
The Data-Driven Organisation 73
4.2 Systems
Just like any other profession, a data scientist needs a suitable set of
tools to create value from data. A plethora of data science solutions
is available on the market, many of which are open source software.
Specialised tools exist for each aspect of the data science workflow.
Each language has its own methods to combine text with code.
RMarkdown, Jupyter Notebooks and Org Mode are popular sys-
tems to undertake literate programming. Once the code is written,
at the push of a button the computer generates a new report with
updated statistics and graphics. You can either choose to include or
exclude the code from the final result, depending on the expertise
of the reader.
4.3 Process
three phases in the data science workflow and present a case study
about reporting water quality to a board of directors.⁸
4.3.1 Define
The first step of a data science project is to define the problem. This
first step describes the problem under consideration and the desired
future state. The problem definition should not make specific
reference to available data or possible methods but be limited to
the issue at hand. An organisation could seek to optimise produc-
tion facilities, reduce energy consumption, monitor effectiveness,
understand customers and so on. A concise problem definition is
necessary to ensure that a project does not deviate from its original
purpose or is cancelled when it becomes apparent that the problem
cannot be solved.
For the case study, the analyst decided to use a performance index
for each of the different aspects of the water supply system. A water
quality index is a dimensionless number that reflects the level of
performance compared to an ideal situation.
4.3.2 Prepare
For the water utility case study, several data sources are available
as shown in the table. The data scientist needs to decide which of
these sources solves the problem. In this example, the data from
the catchments is only available on paper, which makes it difficult
to analyse it algorithmically. A separate project would be required
to convert this source to electronic data. Other data sources are
available electronically and can be used for the project.
4.3.3 Understand
4.3.3.1 Explore
4.3.3.2 Model
After the analyst has a good grasp of the variables under con-
sideration, the actual analysis can commence. Modelling involves
transforming the problem statement into mathematics and code, as
described in the previous chapter.
Every model of the world is bounded by the assumptions contained
within it. Statistician George Box is famous for stating that “all
models of reality are wrong, but some are useful”. Since data science
is not a science in the sense that we are seeking the truth, a useful
model is all we need.
When modelling the data, the original research question always
needs to be kept in mind. Exploring and analysing data with-
out a specific purpose can quickly lead to wrong conclusions.
Just because two variables correlate does not imply that there is
a logical relationship. A clearly defined problem statement and
method prevent data dredging. The availability of data and the
ease of extracting this information makes it easy for anyone to find
relationships between different sources of information.
The Data-Driven Organisation 84
4.3.3.3 Reflect
was to provide the board with salient information so they can ask
targeted questions during meetings. The reflection phase always
needs to reflect on the purpose and ensure it is achieved.
4.3.4 Communicate
The last, and arguably the hardest phase of a data science project
is to communicate the results to the users. In most cases, the users
of the data product are not specialists with a deep understanding
of data and mathematics. The difference in skills between the data
scientist and the user of their products requires careful communi-
cation of the results.
The first limitation relates to the fact that our collection of data will
always be an incomplete description of reality. In a physical system,
choose which points to measure, at which frequency, by which
method and so on. We need to make a lot of choices before we can
access data. Even more so with social data, all our measurements
are only indirect expressions of the reality we seek to explain.
section 2.2. Algorithms can have a real impact on the world, which
implies that they are subjects to ethics.
This section provides some ethical guidelines that managers can use
in any discussion about whether a specific use of data is ethically
justified. Chapter two defined good data science as being useful,
sound and aesthetic. Perhaps we need to add a fourth aspect to this
trivium and insist that data science also needs to be ethical.
• Informed consent
• Avoiding harm in collecting data
• Doing justice to participants in analysing data
Bennett, C.M., Baird, A.A., Miller, M.B., & Wolford, G.L. (2009).
Neural correlates of interspecies perspective taking in the post-
mortem Atlantic Salmon: an argument for multiple compar-
isons correction. NeuroImage, 47, S125. DOI 10.1016/S1053-
8119(09)71202-9.
Caffo, B., Peng, R., & Leek, J.T. (2018). Executive Data Science.
A Guide to Training and Managing the Best Data Scientists.
LeanPub.
Davenport, T.H., & Patil, D.J. (2012). Data scientist: The sexiest job
of the 21st century. Harvard Business Review, 90(10), 70–76.
References 96
Davis, P.J., & Hersh, R. (1990). Descartes’ Dream. The World Ac-
cording to Mathematics. London: Penguin.
Jones, G. E. (2007). How to Lie with Charts (2nd ed). Santa Monica,
Calif: LaPuerta.
Lankow, J., Ritchie, J., & Crooks, R. (2012). Infographics: The Power
of Visual Storytelling. Hoboken, N.J: John Wiley & Sons, Inc.
Prevos, P. (2017). Lifting the ‘Big Data’ Veil. Creating Value through
Applied Data Science. Water E-Journal, 2(1), 1–5. DOI 10.21139
/wej.2017.008.