Elements of Data Strategy
Elements of Data Strategy
Boyan Angelov
This book is for sale at http://leanpub.com/elementsofdatastrategy
Find out what other people are saying about the book by clicking on this
link to search for this hashtag on Twitter:
#elementsofdatastrategy
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Why Did I Write This Book? . . . . . . . . . . . . . . . . . . . . . . . . iv
Who Is This Book For? . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
How Is This Book Written? . . . . . . . . . . . . . . . . . . . . . . . . . xi
How to Read This Book? . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Systems Thinking for Data Strategy . . . . . . . . . . . . . . . . . . . xv
How Technical Does a Data Strategist Need to be? . . . . . . . . . xxviii
.
Defining Data Strategy With StratOps Principles . . . . . . . . . . . xxx
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Data Strategy Analogies . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
The Influence Cascade . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
CONTENTS
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Nicolas Averseng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
CONTENTS
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
List of Acronyms and Abbreviations . . . . . . . . . . . . . . . . . . . 259
Architecture and Technology Definitions . . . . . . . . . . . . . . . . 261
Ethics and Privacy Checklist . . . . . . . . . . . . . . . . . . . . . . . . 268
Data Job Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Example of Definition of Done . . . . . . . . . . . . . . . . . . . . . . 272
Example Design Document . . . . . . . . . . . . . . . . . . . . . . . . . 273
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Acknowledgments
Writing a book is a long journey, at the start of which you don’t necessarily
know what its end will look like. I am writing these words after finishing
the manuscript, and seeing the journey from this vantage point allows me
to appreciate that I didn’t walk it alone. There are so many people I want to
thank. Here they are in no particular order.
Nicolas Averseng was the first person I interviewed. Thank you for answer-
ing my cold e-mail, for the brilliant insights, and for all the conversations
we had. You taught me a new way of measuring data strategy’s impact, and
I can’t wait to see what YOOI does in the future. Tom Davenport, for two
things. First, thank you for turning data strategy into a proper discipline.
Your articles inspired me and a whole new generation of data strategists.
Second, for giving me the confidence that this book is needed and that I’m
on the right path. Martin Szugat for teaching me data thinking, convincing
me I should write this book, and reviewing the manuscript. Datentreiber
is an inspiration! June Derschwewitz - thank you for helping me look at
data strategy differently and for the continuous encouragement. Noah Gift
for the fresh take on agile in organizations regarding data projects. And
for teaching me to be more opinionated! Amadeus Tunis for helping me
deepen my understanding of holistic data strategy. I also learned a ton
about measuring the value of data projects. Doug Laney: Infonomics was
one of the first books I read as a data strategist, and it shaped my views
tremendously. Also, thank you for the inspiration of having strong opinions
and, of course, for making me look at data as an asset. Stephanie Wagenaar,
Acknowledgments ii
for the deep insights on data governance, workshopping, and for reviewing
the manuscript. Your energy was vital for me to push the book through to
completion! Alexander Thamm, for the brilliant conversation and insights.
I thank you not only for the ideas on consulting, data, and AI but for your
views on the future of Europe in this regard. Thomas Varekamp for the great
conversations and continuous support with energy and encouragement.
Thank you for being an excellent example of a true data strategist. I also
want to thank all my former and present colleagues and clients. So many
conversations during the years formed my thinking about data strategy.
Also, as always, thank you to my family. For the constant encouragement
and support. Even if you are unsure what data strategy is, you still believed
in my spending a year and a half defining it. Thank you for everything.
And finally, thank you - the reader. Deciding to have this book in front of
you is another sign that data strategy is a truly growing discipline. It is a
privilege for me to guide you on this journey.
Preface
“Software eats the world, but AI will eat software.”
–Jensen Huang
Those are the words of NVIDIA’s CEO. While they are full of commitment,
most of us agree that we are still far off from this lofty goal. This issue
becomes even more apparent if we look at the current state of enterprise
data efforts. Despite the tremendous potential new data technologies offer,
few organizations reap the benefits of data-driven digital transformation.
Adoption and deployment of data and AI technologies remain rare, con-
trasting with big words from executives and their significant financial
commitments. But why? At this point, progress in the field should not be
hampered by the lack of technical talent anymore - or the lack of tools and
frameworks available. Why is it so hard to follow in the footsteps of digital
native organizations and take full advantage of data?
To do great data strategy, you need to possess a diverse set of skills and
experiences. It took me a while to appreciate that being a jack of all trades
can provide tremendous advantages in a new title, one growing beyond the
confines of pure data science. DJ Patil and Tom Davenport (the latter of
whom I also had the pleasure of interviewing for this book), in their 2022
article “Is Data Scientist Still the Sexiest Job of the 21st Century?”* support
this view: while data science saw meteoric growth (and will continue to do
so in the foreseeable future), it spawned even more fast-growing disciplines,
such as data engineering, machine learning engineering, and most impor-
tantly for us: data strategy.
I have had some diverse experiences that have helped me so far on this path:
I worked on metagenomic data at the Max Planck Institute, built machine
learning models and architectures for startups and large organizations, led
my teams to do that at scale as a CTO, and have been guiding industry
leaders in helping their organizations do the same. One thing stands out
* This is a follow-up to the original article.
Introduction v
when looking back at all those different areas: the increased complexity of
modern-day work. Each of those experiences required its own set of skills,
tools, associated frameworks and ways of working, and best practices. What
kept me going was the vast array of tutorials, courses, books, and articles
at my disposal - I rarely felt starved for help. This is why I was surprised to
find a massive lack of resources on data strategy. I was mostly left to my
own devices when I started that role.
I remember those first days as a data strategist. They were full of questions
and confusion. When do we do a CSA* ? Before the gap analysis or after?
What other elements are necessary, and how do they relate? What are the
actual deliverables of such work? The more I thought about those questions,
the more I realized that I couldn’t even answer the most fundamental
of them: what does it mean to be a data strategist? What are the skills
and experience necessary to get the job done? The issue wasn’t limited
to forming a brand new vocabulary; I was used to working with many new
terms as a data scientist. I searched, read, and asked - yet I could still not find
confident and conclusive answers. While the why was clear to me from the
beginning, I expected a manual on the what and the how of data strategy. I
resorted to learning the hard way by listening to experienced leaders in the
field and combining their knowledge into tangible and concrete concepts.
Slowly the different ideas and definitions clicked together, and the answers
to my questions became visible. Soon other people started approaching me
with the same questions I had, and that’s when I decided it was time to share
what I learned, shaping it into a book you’re reading now.
“Elements of Data Strategy” † is the book I wish I had when starting out in
data strategy.
* Current State Analysis. More on this in DUE DILIGENCE.
† The name was inspired by one of my favorite books on data: “Elements of Statistical Learning” by
About the cover: the book’s cover represents two fundamental ideas.
First: data strategy should be at the core of the business strategy. Second:
this work showcases a framework: a cohesive collection of concrete build-
ing blocks that guide and inform the design of your strategy. It is still up to
you, the strategist, to fill in the missing pieces that fit your organization.
Will reading this book be sufficient to make you a data strategist? I’ll be
honest - it won’t. I came to recognize that this role requires vast experience
across many fields; it’s inherently transdisciplinary. But I can make you a
promise that I’ll try to keep. This book will teach you three essential things
to keep at your side as you gain practical experience: the core concepts of
data strategy, a holistic way of thinking about them, and the tools to deliver.
You can apply this framework to any strategic work in data. I won’t go into
detail on technical terms. There are more than enough resources on data
lakes, warehouses, and algorithms - often written by their creators. I reserve
the right to provide a few sentences (thanks to GPT-3!) in the Appendix on
the most important ones. Still, more often than not, I’ll point to a better
resource. The focus gained will allow me to deliver on my promise.
Here’s another thing I won’t do: convince you that data strategy and, by
extension, data science, and engineering are important. Many such books
start with endless factual explanations of the low adoption of AI and the
need for planning data projects. I assume you would not have this book in
front of you if you thought otherwise, and if you still require convincing -
there are more than enough materials.
time this book turns its first birthday, many things will have changed. This
rapid pace of active development is why I stayed away from using the term
“modern” anywhere in the book (it would be arrogant to assume everything
I write will hold for years into the future). Treat this book as what it is - a
framework, and I encourage its adaptation and reworking in the wild! The
core concepts and way of thinking should remain constant, and hopefully,
this book’s usefulness will also persist into the future.
As the first step in understanding data strategy, we should define the skills
of a data strategist. In the early days of data science, there was a quite popu-
lar article called “The Data Scientist Venn diagram” by Drew Conway. Data
Science was initially dismissed as a fad. It took a while for the skills and job
descriptions to become established, culminating in further specialization. I
would argue that data strategy will follow a similar trajectory in spreading* .
A fundamental first step is normalizing the key skills. A good start is to make
a Venn diagram:
* More on this in my wide-ranging discussion with June Dershewitz, Data Strategist at Amazon, in
INTERVIEWS.
Introduction viii
The grey circle, Data, contains all the (technical) domain knowledge re-
quired. The term itself is quite broad, but we can at least specify its higher-
level internal components: Data Science, Data Engineering, and Business
Intelligence (BI). In other words, in this circle, we have everything hands-
on about the job. One of the most common questions I’ve received here is,
to what extent a data strategist needs to be familiar with those topics? There
are different answers to this, but a data strategist with a solid grasp of the
other circles of the Venn diagram will be able to compensate for any lack in
this area. Still, as a rule of thumb, they should have a sufficient technical
understanding of the three data areas I mentioned to the extent that it
allows them to manage a team of technical people. I’ll answer this question
Introduction ix
Finally, those two circles support the integrative skill, which binds them
into a complete package - System thinking (ST). I’ll go into more detail later
in SYSTEMS THINKING FOR DATA STRATEGY, but to get us started, here’s a thorough
definition of ST from a paper by Arnold and Wade1 :
There are overlaps between those different sections, resulting in the jobs
of Data Architect, Data Advocate, and Design Thinker. All of those are
* This term is described in detail on dbt’s blog.
Introduction x
important on their own, but our focus is on the middle of the diagram - the
data strategist. What is essential to consider when thinking about the skills
of a data strategy is that the three circles are never evenly balanced. Every
individual data strategist possesses a unique combination of those three
areas, and while, with sufficient dedication, you can perfect this balance,
it’s almost impossible to be perfectly capable in all three. This concept is
commonly referred to as T-shaped skill proficiency. For example, if you
dedicate extensive efforts to acquiring technical skills, you also pay an
opportunity cost for not learning business skills. There are twenty-four
hours a day for all of us, and we also have limited energy available. Thus a
tradeoff between the three elements of the Venn diagram occurs, and since
such tradeoffs are typical in any ST book - this becomes a pattern. I call it a
“tension diagram” (we’ll see many such patterns later in the book). Have a
look at a classic example below:
Even if I don’t cover the technical skills, there are also quite a few strategic
details that I’ll skip to avoid becoming a reference manual. Other excellent
works provide much more detail, such as Driving Digital Transformation
Through Data and AI: A Practical Guide to Delivering Data Science and
Machine Learning Products by A. Borek and N. Prill2 ) that I’ll also refer to
frequently. Again, to use the common biblical analogy - I don’t want to give
you fish; I want to teach you how to fish. You’ll still need to gain practical
experience by applying the framework. If this is how you think, this is the
book for you.
systems book, first and foremost. I provide a holistic* model of data strategy.
Naming the elements of data strategy (giving them logical boundaries)
and identifying the relationships between them lays the foundation for
any hands-on work. Such a structure allows the strategist to explore the
framework inductively, communicate their ideas about it, and enable fur-
ther information gathering. Still, remember the Zen saying, where the Zen
master said to their student: “don’t mistake my finger, pointing to the moon,
for the moon itself.” I believe the only way for a successful data strategy
model to be designed is by letting go of our desire to perfectly model every
situation and the arrogance of believing we actually can. Instead, we should
take this for what it is - a model; the rest of the work we need to do ourselves.
Remember - all models are wrong, but some are useful3 . I certainly hope this
one is!
The book comprises three phases, capped by a final chapter with interviews
with thought leaders in the field of data strategy worldwide. Each phase
contains a set of logically related elements. I’ll be referring to them often
in the text, so to make reading easier, they will always be CAPITALIZED AND
MONOSPACED. The main deliverables of each element are presented at the
beginning of each element (except for the elements in DELIVERY, which is
more focused on the process). I called this setup the 3D model of data
strategy:
that the parts of something are intimately interconnected and explicable only by reference to the whole”.
Introduction xiii
Phase II: (DESIGN): Here, you’ll put the information gathered so far to use by
designing the strategy itself. You’ll first learn how to ideate, prioritize and
select use cases. Those need to be supported by optimal target architecture
and technology stacks, and finally, the data strategy is capped by designing
an appropriate data governance and operating model.
Phase III: (DELIVERY): Even the best strategy is useless unless delivered
successfully. In this part, you’ll learn how to ensure the data strategy is used
by applying DataOps. Data-specific flavors of two methodologies inspire the
showcased approaches: soft agile and lean data.
Many books on data are written in a reference format. With such books,
you can pick any exciting topic and dive straight into it - without paying
attention to any assigned order. I structured this book differently. Creating
a data strategy is, by necessity, a sequential process. The phases and their
elements are building on top of each other; the outputs of a phase become
the inputs of the following one (you’ll understand this in SYSTEMS THINKING
FOR DATA STRATEGY). Those elements might not make sense if consumed out
of order. I suggest reading the book following the designated order first and
only afterward using it as a reference manual in case you want to refresh
your memory and knowledge on a specific element.
Introduction xiv
A second point to remember is that no two data strategies are identical. They
should all follow a similar structure, but you, as a strategist, will always need
to shift focus and order where necessary for your specific case. That Zen
teaching rings valid once again. It would be arrogant to assume that one
strategy template is all you need to make any large or small organization
data-driven.
Additionally, there are several types of blurbs that you’ll encounter through-
out the book. I’ve listed them here:
Warning: I have learned the hard way how a journey through data
strategy can be sidetracked. I’ll emphasize those situations here.
Tip: Here, I’ll provide practical advice that can give you the edge in
tricky situations.
Asides and sidebars: Some topics don’t fit perfectly well in the scope of
the book but can be useful. I will mention those here.
Introduction xv
Since working with complex systems is a core skill for a data strategist, I’ll
cover the most important terms.
setting is essential, but suffice it to say that working with vague terms dur-
ing already complex work, such as data strategy, can only add unnecessary
confusion.
The three phases of data strategy don’t exist in isolation. The output of
an element within DUE DILIGENCE can become an input to an element in
DESIGN, or DELIVERY. For example, the DATA DICTIONARY is relied upon heavily
Introduction xviii
during ARCHITECTURE AND TECHNOLOGY. The loops are moving forward and
backward, providing essential feedback. For example, if during DELIVERY,
the strategist can realize that an adjustment is needed in its design -
the project’s successful implementation might need to be supported by a
different operating model. The 3D model demonstrates how the concepts
of boundaries, feedback loops, abstraction levels, and other related ones are
essential for a systems thinker. Let’s go through those concepts next.
Boundaries
The first essential concept is probably the most abstract one: boundaries.
As you’ll see later in the book (in CHANGE MANAGEMENT), the primary challenge
for a data strategist is communication. So many abstract terms need to be
explained, and it’s easy to make vague definitions. For a data strategy to
be successfully implemented down the line, the communication between
the strategist and the client or in the strategy document itself needs to be
spot on. Having concise and clear words for the concepts, we discuss, and a
shared understanding ultimately enables us to be productive and focus on
the actual work.
This is also one of the main reasons I wrote this book - I lacked the common
understanding of terms to discuss how to do data strategy. Now: every time
we communicate complex topics, we set boundaries. Knowing where those
are in different systems is also an essential task in CSA.
Introduction xix
The reason why I cover the concept of boundaries first is that this is the
defining feature of a system. A boundary is where one system ends and
another one begins. Philosophically, everything is a system - one within
another, like matryoshka figures. Almost all the data strategy activities start
with explicitly defining the systems we are working with and understanding
their boundaries. With my colleagues, I used to joke that the number one
skill of a successful data strategist is to draw boxes! For example, we might
draw the organization’s different departments as different systems. We can
then proceed to draw the boundaries of the teams. Within those boundaries,
we can fit other elements, enabling us to talk concretely about otherwise
abstract terms.
Two common sources of error appear when we break down a concept into
parts. The first one is that two (or more) elements have sub-elements in
common, blurring the separation between them. The second source of error
is that we are not presenting the whole picture; some elements are missing.
If we want to avoid both, we should always attempt to break down a larger
concept into a complete set of parts without anything missing. Those parts
should have clear borders between them delineated, so there’s no ambiguity.
MECE is an essential tool in reducing complexity and the fundamental
principle in designing the elements of data strategy. All of the elements
of data strategy in this book are designed to be MECE.
Complexity
One of the most overused and dangerous words in data strategy (and
management in general) is complexity. To paint a picture of how frustrating
it is to define this word, I’ll share what I heard at a conference once:
“Complexity is so hard to define that even the eponymous book, Complexity
doesn’t define it in any of its 600 pages”. Complexity is the enemy number
one of a data strategist (or any knowledge worker, for that matter). And
for such an important term, it’s mind-boggling to realize that a definition
and a universally agreed upon measurement method is not available. For
our purposes, I believe it’s more beneficial to define a complex system
instead. A good working definition of that is available from the Santa
Fe Institute: complex systems have nonlinearity, randomness, collective
dynamics, hierarchy, and emergence as their main properties. Examples of
such systems are our nervous system, cities, software and many others.
Feedback Loops
The easiest way to understand what feedback loops are (you can also read
Donella Meadow’s book6 ) is with an example from our daily life. We all know
what we talk about when we say, “I got off the wrong side of the bed today”.
Suppose you have a negative experience first thing in the morning. In that
case, things can cascade further down - your already bad mood will likely
attract more negative experiences and, consequently - an even worse mood.
This predicament is an example of a positive feedback loop. Positive, not
because it’s a nice thing to happen but because the system continues to grow
with time. Have a look at the following diagram:
Black Boxes
Black box systems are common in data strategy, and the DUE DILIGENCE phase
aims to get rid of them as much as possible.
Introduction xxv
System Types
Let’s focus on a more common type of data strategy system (and planning
systems in general) - the rigid one. While nobody wants to build a fragile
system, a rigid one can be tempting - especially one designed by high-paid
external experts. The most common cognitive bias is the illusion that we can
predict the future with a significant degree of accuracy. This, together with
our domain expertise, can lead to having supreme confidence in crafting
plans. This scenario is even more dangerous because, in the short term,
such a strategy can work and inspire confidence. Everybody wants to follow
a leader with a clear plan, and nobody wants to hear or present thoughts of
uncertainty. Still, such systems are bound to fail in the mid-to-long term
since they collide with the complexity of the real world.
And finally, the third type of system, and the one we should aim to design,
is the antifragile one - a viable system. Such systems are built in a way to
respond effectively to feedback and learn. Instead of breaking under stress,
Introduction xxvi
Abstraction Levels
One of the best ways to deal with complex systems is by using abstractions
(in systems and complexity science, this is related to the concept of scale7 ).
The human mind is uniquely capable of reasoning through the same prob-
lem in various ways. Since, by definition, we can’t understand a black box
(complex system), we need to apply abstractions. You can imagine them
as mental maps we sketch to think differently and be productive. This also
allows us to look at problems with less detail necessary. After all, we don’t
need to understand a system’s inner workings, but mostly its outputs and
inputs - whatever is enough for us to work with them. Let’s define this
concept:
To drive the point home, I’ll illustrate with an example from arguably the
most strategic game invented: chess. Look at the boards below (you’ll need
just a basic understanding of chess, don’t worry):
Introduction xxvii
balance of the board, perhaps planning their attacks in one area only. At the
same time, any decisions they make on this second strategic abstraction
layer need to consider the overall game phase - the long-term picture - the
highest abstraction level.
Once you get used to thinking in abstraction levels, you’ll see them every-
where. For example, when you need to decide on a target data architecture
stack, it’s rarely required for the data strategist to inspect all the data
pipelines currently in operation (this would be the “pieces” level). They
should be able to comfortably make a recommendation based on more
general observations (for example, the different tools used to build the
pipelines and their limitations, the strategy level). Using this ST method
will allow the strategist to operate in black box scenarios under constraints
more effectively and efficiently.
As a natural first step, we must define what “technical” means. This term is
represented in the grey circle in the data strategist Venn diagram I presented
Introduction xxix
earlier in the chapter - Data domain knowledge. This is the ability to guide
the implementation of a data (software) system and its connection to other
systems (which can be nontechnical). There’s one thing we should already
get out of the way - it is certain that the more hands-on experience you
have with software and data technologies, the better. Still, as discussed
previously, even with all the technical talent and experience, a lack of
communication and systems skills will limit the effectiveness of the data
strategist.
Now that we know that balance needs to be kept, what is a good level of
technical and systems expertise? Instead of providing a giant list of pro-
gramming languages, frameworks, databases, and cloud services to master,
I’ll illustrate cases from the real world with examples.
Let’s say that as a part of the data strategy, all the data in the organization
needs to be organized and stored in a centralized repository. On top of
this, good access policies must be implemented (DATA GOVERNANCE), striking
a balance between restrictive and open. For this, a data strategist should
understand the use cases, the data formats, quality, and size, and different
data storage terms (data lake, warehouse, relational and non-relational
databases) enough to guide the implementers. This should be enough tech-
nical expertise and needs to be coupled with their business understanding
(regulations and internal policies, for example).
Introduction xxx
Needless to say, we should choose the StratOps approach! Now, let’s get
started with doing data strategy!
Part I: Due Diligence
Why due diligence is hard—Alignment with business strategy—Steering
group—Current State Analysis—Via negativa—Maturity assessment—
Pilotitis—Systems audit—Competitor analysis—Futures thinking—Ambition
level setting—Gap analysis
Overview
“Give me six hours to chop down a tree and I will spend the first
four sharpening the axe.”
–Abraham Lincoln
How can we know where to go if we don’t know where we are? This philo-
sophical question applies not only to our personal lives but to data strategy
as well. Assessing the current state might not feel like making inroads to-
ward digital transformation (and, in some cases, might even open unhealed
wounds). However, it still is a foundational element of any successful data
strategy project. I group all such activities focused on information gathering
in the first phase of the 3D model under the umbrella term DUE DILIGENCE.
You might have previously encountered this term in other contexts, such as
corporate audits. There it roughly translates to “collection of information,
in preparation for action”; in other words: before we can strategize, we need
information. It’s irresponsible to make decisions or provide a strategy in the
dark. Moreover, how would members of our organization or clients trust us
if we immediately jump to conclusions before learning the details? We need
to get the lay of the land first.
Before diving into the details of conducting DUE DILIGENCE for data strategy,
let’s have a few words about mindset and honesty - two essential qualities
of a data strategist for this part. Keep in mind that from all elements of
data strategy, DUE DILIGENCE is the most commonly assigned to an external
party (such as a management consulting firm). There are several reasons
Overview 3
for this. External consultants are primarily hired to alleviate pain points;
any organization would prefer to solve its issues (especially those of a more
“strategic” nature) internally. What’s behind this lack of trust in internals
can be summarized in three reasons:
Practical: It’s expected that a fresh look, combined with specific expertise,
can break through barriers that are too complicated to overcome for the
internal team. In Zen philosophy, there’s a term called “beginner’s mind”:
a viewpoint free of prejudice and baggage can provide a new, previously
unseen perspective.
Economic: External consultants are indeed expensive (at least more so than
regular, in-house employees), but the per-hour cost might make up for itself
mid-to-long term, especially if those consultants are deployed at critical
junctions and are involved in strategic decisions and planning.
consultants, with the explicit intent to be as direct as possible, just for the
DUE DILIGENCE part of the data strategy and then let them leave the project?
In this way, we would avoid the potential fallout of radical honesty. In
military circles (where strategy has a central role), this is known as a “red
team”: a group of experts hired specifically to poke holes in your theories
and work to improve them afterward.
Now that we are familiarised with the rationale behind DUE DILIGENCE, let’s
look at its elements:
This first element might sound obvious, but for large organizations with
multi-layered internal structures (visions, departments, and teams), iden-
tifying accurate business plans and objectives can prove more challeng-
ing than one may think. This activity requires a thorough look at both
company’s ongoing and planned strategic initiatives on many levels of the
organization. We should not limit this investigation to the business units
interfacing directly with data. The best way to approach this challenge is to
use a top-down approach, as illustrated in the figure below (this is also a
tension diagram; imagine the individual trees at the bottom and the forest
on top):
We want to start this way because the complexity of business goals tends
to increase from top to bottom organization layers as they become less
strategic and more operational. We can approach this work from a con-
sultant’s perspective: in many cases, you are an outsider to the goals of
Alignment with Business Strategy 8
other organizational units. For the learning curve for grasping the whole
business strategy to be lower, we first start gathering information on the
top by talking to the C-level suite. Most organizations run several strategic
projects simultaneously. Depending on their data maturity, this number
can be between one and more than ten * . Those initiatives tend to be long-
term (longer than one year) focused. An example of a strategic business goal
driving such an initiative can be “increase the market share of our products
in the EMEA region from 7% to 10%” (Borek and Prill provide many specific
examples8 ). If we were to stop the alignment between data and business
strategy on this level, we would probably miss the mark - this is too vague
and not actionable enough. How can we connect such a goal directly to our
data strategy?
Next, we climb down a level and uncover the business goals of the individual
departments. Let’s take marketing as an example. Their goals should be
informed by the C-level ones - but are probably more specific and focused on
marketing projects and related operations. We need to talk to the functional
leadership. These are the leaders of non-data departments and teams. In
this case, the marketing leaders think about how they can support the
overall strategic goal to increase the market share of products sold within
a specific region. While doing this, they come up with their own version of
the overall business objective: “Deploy a marketing campaign in EMEA which
* I’ll show you how to establish this for any organization during the CSA element.
Alignment with Business Strategy 9
increases the conversion rate for the region by 2%”. Successfully achieving
this goal contributes directly to the overarching one. Working directly with
functional leadership dramatically reduces the complexity of the task since
they are at the right level of familiarity with the work - not too strategic, but
also not too operational. Functional leaders are valuable allies for the data
strategist (and can even become data champions, see aside).
And finally, we can also go to the most fundamental level of the organiza-
tional pyramid and look at what individual teams have selected as objectives.
Aligning at this level would be more relevant if we design a data strategy for
a smaller organization where the employee headcount numbers are in the
low hundreds. The information obtained by working with specific teams is
probably redundant for a larger one. That could change in the use cases part
of the data strategy design, where goals and targets become more specific.
We can repeat the exercise we did for the middle section and discover
individual team goals.
Alignment with Business Strategy 10
Now that we have gone through the different levels of the organization to
discover their goals, I would advise looking at the business strategy pyramid
at the right level, where the goals are not too operational but, at the same
time, also not too strategic. With this information, we can proceed with the
other parts of the data strategy. We need to refer to the organization’s or the
separate functional parts’ goals in several elements of data strategy, most
notably in USE CASES.
Alignment with Business Strategy 11
The second alignment step specifies the internal team leading the data
strategy efforts and ensures that the necessary functions are actively and
continuously involved in this process. As I previously mentioned, having a
dedicated working group participating in the data strategy at all levels is es-
sential for progress. This process can be complicated due to political reasons
and the number of touchpoints (stakeholders) affecting data and technology
within even a middle-sized organization. I recommend approaching this by
filling out a Responsibility and Assignment (RACI) matrix. Have a look at
the template below:
Alignment with Business Strategy 12
The RACI matrix should involve a mix of the people designing the data
strategy and the steering group members. In the first column, we add all
the planned activities (elements) to create the data strategy, such as a DATA
MATURITY ASSESSMENT or AMBITION LEVEL SETTING. At this stage, it’s crucial not
to zoom into details; it’s enough to specify the major components only. This
way, we reserve space for adjustments (which will always be necessary) in
advance. The strategist can then use them during the data strategy design
work. On the horizontal axis, we can then add the participants. We can label
them in one of the five categories below:
relate to the feeling of frustration when exposed to such advice. The most
normal immediate reaction is rolling our eyes. We think: “Sure, thanks for
the advice, but isn’t this obvious? How is this actionable?”
Now, how about the second dimension of CSA? What if we take the snapshot
at the wrong resolution? This means we’re missing the forest for the trees
(again, remember the tension diagram from the INTRODUCTION). This mistake
can be easier to make since a careful balance is required. On one extreme,
Current State Analysis: Discovering Where We Stand 17
the strategist might spend too much time on the operational end of the work
- such as investigating daily practices and technical implementation details.
On the other extreme lies conducting a CSA with the executive suite only -
focussing on the more strategic and high-level view and ignoring the rest.
A successful CSA requires a balance between both views, hence the constant
need for the data strategist to adjust the focus of their attention.
This is what a CSA is all about - taking a picture of the organization at the
right time and the right resolution. Now let’s dive into the components of a
thorough CSA.
Systems Audit
You can look at the current state of an organization as an onion whose layers
we need to peel off one at a time until the complete picture is revealed. As
with many other concepts in this book, those layers are closely intertwined,
and the borders between them can be blurry, complicating our work.
Current State Analysis: Discovering Where We Stand 19
We can split the CSA target systems into three discrete types: use cases, data
and architecture, and technology. A speficic dataset and architecture, and
technology support each use case. Perhaps differently from how we would
peel an onion, we first need to start with the center. This is because the
use case is the fundamental value-generating unit of any data project, and
if we put it at the center of our work at all times, we can keep the focus on
delivering value. A second reason is that any change in the use case can have
cascading (and sometimes unpredictable) effects on the other two system
types (a topic covered in more detail in INFLUENCE CASCADE in DESIGN).
As a first step, the data strategist must uncover what data (and related)
use cases are currently existing across the organization. Those, together
with new ones, will represent the organization’s data portfolio (I’ll go
deeper into how to manage this in PORTFOLIO MANAGEMENT in DELIVERY). It’s not
guaranteed that those same use cases will be worked on as part of the data
Current State Analysis: Discovering Where We Stand 20
strategy - the scope is often changed during the DESIGN. Still, knowing what
is currently being pursued is crucial since it might need to be stopped (to
conserve resources), improved (if it’s a valid use case), or provide necessary
information and code for new use cases. Here’s an example use case audit
(FTE stands for full-time equivalent* ):
Question Example
What use cases are pursued? Topic modeling for customer
support data
What are the business goals Topic modeling for customer
behind it? support data
What technologies are used in Python, LDA
each?
What architecture S3, Glue
components are used in each?
What datasets are used? CRM data from SalesForce
Which people are involved? J. from marketing, A. as a
customer support manager
What are the budgets? EUR 35000, 2 FTE over two
quarters
scenario, there will almost always be technical components that can be re-
used for the data strategy implementation. Gathering a list of those can be
helpful for the target architecture and technology element in DESIGN.
Data Audit
Now we can dive into the oil that powers the use cases: the data. A word
of caution: if you remember what we discussed about the data strategist’s
attention earlier, this particular activity can drain it tremendously if left
unchecked. There is so much information here that if the data strategist
wants to compile an exhaustive report on it, they can waste much time -
even in the case of relatively small and homogenous datasets. The challenge
then lies in deciding the appropriate assessment for your specific case. Also
note: this work is essential for setting up good data governance in DESIGN.
The data strategist needs to ensure that necessary data is not only present
but is also of sufficient quality for use in the use cases. Fortunately for
us, this is not an entirely new problem - there are quite a few established
frameworks focused on auditing data. The two popular frameworks for data
audits are FAIR and the 4Vs. I’ll refer you to read them separately and
instead provide an example of a data audit below:
Question Example
Who are the data custodians? Marketing department
Who are the data consumers? Operations
What is the data lineage? Salesforce - data lake - data
warehouse
Is there a data dictionary yes, but partially, covering just
present? 40% of the fields
What is the data quality? 30% of the rows in the
aggregated dataset are
missing
Current State Analysis: Discovering Where We Stand 22
The answers in the example are too simplistic; this is for illustration pur-
poses only. Typically, especially for the data quality part, you will have much
more to fill in, and the deliverable probably won’t fit this table but be a
whole document instead.
The second lens through which we can do our data audit is looking at how
data moves through different systems in the organization: data lineage. A
common phrase in data science and engineering is “garbage in - garbage
out”11 . It’s primarily used in the machine learning context. In the case
of training ML models, data scientists often spend an inordinate amount
of time fine-tuning a model or trying to solve the problem by changing
the algorithm completely (usually going for a shiny new one or a different
framework altogether). While such changes might improve the model per-
formance by a few percent (in the best case), the highest performance gains
occur when better quality and quantity of data are supplied to the algorithm
(this is reflected in the data-centric AI movement spearheaded by Andrew
Ng12 ). The best algorithm on the planet will not save the project if the data
is garbage. In non-ML applications, erroneous data can also have disastrous
consequences - weekly financial reports based on the wrong numbers can
spell catastrophe. Data lineage is where these issues are dealt with.
A tipping point for deriving value from data is changing how it’s viewed
in the organization. Much is said about being “data-driven”: the organi-
zation can only achieve this if we view data as an asseta . Doug Laney, in
his excellent book Infonomics: Monetise, Manage & Measure Information13 ,
and in INTERVIEWS tells a compelling story on how modern enterprises treat
things like printers, monitors, and window blinds as “assets”, but not
their data. This is why we should look at our datasets always with a use
case in mind, and if we have done our work well in the previous section
Current State Analysis: Discovering Where We Stand 23
of use cases, it should become clear how this view can have monetary
gains - which is one of the main driving factors behind the motivation
of becoming truly a data-driven organization. To sum up:
a In my conversation with Amadeus Tunis, he mentioned that data can only have a relative
Data lineage
This diagram is MECE, meaning any data in the organization should fall into
one of those categories at any point in its existence. Our job is to document
how it travels through the organization.
One helpful way to think about this problem is to classify the different data
sets using the BSG method. It’s a that can help design architecture and
technologies strategies. It stands for the different classes of data assets -
Bronze, Silver, and Gold. Using metals to describe the data state is already
used in data engineering. You can learn about it from articles by leading
Current State Analysis: Discovering Where We Stand 24
Bronze: This is the data that enters the system. It’s data in its rawest, atomic
form - without any processing. It is usually of large quantity, little quality,
and limited business use. Focusing on just collecting these data is dangerous
since it’s of limited use in this form - this strategy is called “boiling the
ocean”, and is exactly as productive as it sounds.
Silver: This is the data resulting from processing and enrichment steps. This
data is also the input to ML algorithms. It usually is smaller in size and of
intermediate value. It can also be the input to self-service analytics systems,
where data analysts can build the presentation layer.
By using those ideas, we can go through the following steps to complete the
audit:
The final audit we need to conduct to get a complete CSA is that of the
technology and architecture. Those are the elements that power the use
cases, and often, many make-or-break issues occur here.
Question Description
What are the data-generating Salesforce
systems
What are the data-consuming PowerBI
systems
What are the data storage flat files on S3
systems
How much is on cloud versus 40% on cloud
on-premise
What languages and Python, Scala, Java, Keras
frameworks are used
What are potential issues with RPO and RTO are not set, only
security and compliance? weekly backups
Like the other audits, you usually would go much deeper here. For each of
those items, it’s also important to note any potential concerns or upcoming
plans associated. When working through the use cases in DESIGN, you’ll need
to have those at hand.
Finally, all those audits can help us conduct the next element, the DATA
Current State Analysis: Discovering Where We Stand 27
MATURITY ASSESSMENT.
It’s not surprising that the journey is not a straight line, but why this shape
in particular? I’ll show you by describing what the five stages stand for. We’ll
go through the lowest maturity level to the highest:
Waiting (1)
Starting (2)
Toiling (3)
Accelerating (4)
Leading (5)
DMA’s have been around for quite some time, and many consultancies offer
comprehensive solutions. My recommendation would go to appliedAI’s
tool. Another great source is the DMBOK book15 , the Data Management
Maturity Assessment chapter. As a general rule of thumb for this work, you
Current State Analysis: Discovering Where We Stand 30
Based on those criteria, you can score and classify the organization. In this
work, you’ll need copious amounts of trust once again. In most cases, the
organization won’t be a leading one - if it were, it would already have a
designed data strategy at its disposal. Thus you’ll find yourself in a situation
where you need to communicate that the organization falls short of the
ideal. Nobody wants to hear this, least of all senior executives - and those
are often your target audience. To cushion this potential blow, you’ll need
to demonstrate how being honest with the situation and looking at the facts
is the first step to making progress. With the right strategy, even giant
corporations can achieve dramatic turnarounds - so it’s not all doom and
gloom. Those goals are your North Star* .
Pilotitis: A term that I’ll often refer to in the book, pilotitis describes the
propensity of organizations to commit to individual, isolated pilot projects
only without contemplating the need to build on a solid foundation and
scale. Data products are worked on in isolation. So what are the causes
of this condition? First, it’s the easy way (more on this in the LEAN DATA
and SOFT AGILE in DELIVERY). No need to establish a complex organizational
structure; make a small team of three to five people, and off you go. Also
no need for complex IT or data architecture changes. Second - it’s cheap.
Doing things at scale requires an investment, both in people and technol-
ogy. And third (perhaps most insidiously): it’s fun. It plays to the human
* In the old days, before better technologies were widespread, travelers relied on the celestial bodies for
navigation, the North Star being one of them. It is used as a metaphor for a long-distance goal, which we
must constantly pay attention to while traveling to ensure we don’t sway from the right course.
Current State Analysis: Discovering Where We Stand 31
Pilot projects are often low in effort (also in impact). Perhaps more
crucially, as time goes on, such work does not build on top of each other.
Granted, it requires a more significant upfront investment to structure
work properly. Still, the impact will also be much higher - to the long-
term satisfaction of the implementation teams and the executives. Let’s
illustrate this point with a practical example. Imagine that in the first pilot
project, you create a data pipeline. For example, perhaps you decided to
use AWS S3 to store the raw data, then AWS Athena to visualize it. Even if
this pilot project ultimately proves unsuccessful, in the future, you might
still reuse much of the same architecture (or at least just with minor ad hoc
adjustments) for the next pilot project. This is much more productive than
every time inventing something from scratch. Such project management
needs to be strategic - looking at a wider angle at the use cases, further into
the future, and with a better understanding of the available resources.
Current State Analysis: Discovering Where We Stand 32
Let’s set some signposts for the road ahead from different angles. This
element will require us to move beyond the technical and work with our
stakeholders on a strategic level again. We must be mindful of our balance
here and strike a good balance between optimism and realism. Remember
that data and AI work starts with the burden of high expectations, fear of
the unknown, and escalating costs. We have the job cut out for us! The gap
analysis aims to set the ambition level for the organization and determine
how far off that goal is. Two methods are essential for this work; let’s tackle
them one at a time.
Competitor Analysis
analysis, the strategist can use the DMA we covered previously to see which
group the competitor organizations fall into. After this initial binning of
the competing into its maturity state, you can go deeper and try to answer
questions like these:
Knowing the state of the art, we can go back to our organization and talk
to the senior stakeholders to determine the right ambition level. This is
another vague and politically charged topic. A common thread in this book
is that we are dealing with complex, evolving systems. This distinction is
crucial because it helps us avoid making static, oversimplified assessments,
leading to quickly outdated decisions. This pattern continues - circum-
stances are bound to change. Humans are notoriously bad at predicting
the future, especially mid-to-long term, but valuable tools can help with
that. Artificial Intelligence (AI) is one of those technologies whose impact
is overestimated in the short term and underestimated in the long one (this
is known as Amara’s Law).
There are two parts to ambition level setting. One is having follow-up
conversations with senior management with the results of the DMA in hand.
The second is by anticipating future trends and scenarios internally and
externally. For this purpose, we can use a tool called the futures thinking
toolkit (there is a multitude of other great tools, including McKinsey’s
Horizon Innovation Framework). This is a whole field of its own. Still, we’ll
take its highest-level tool and fit it to our purpose. Have a look at the
diagram below:
Gap Analysis: Looking at the Road Ahead 36
Probable: Discovering what is the most likely outcome can be very hard. It
requires careful consideration of many factors and a deep understanding of
the context.
Plausible: This is where the scope widens. You can deduce those scenarios
by extrapolating from the preferable and probable scenarios, albeit with
Gap Analysis: Looking at the Road Ahead 37
minor adjustments.
Possible: Here, we enter a more creative territory, and this scenario has
the most extensive scope. The purpose here is to discover potential missed
opportunities (similar to the Data Thinking exercises we’ll cover in the USE
CASES section) and black swan events17 .
Now, armed with a solid understanding of where the future can take us, we
can have another internal discussion with the senior stakeholders about
where the organization’s ambition lies regarding data. You can organize
several ambition levels on a timeline. Such increased resolution can help
the strategy become more specific and move beyond the “we want to
be the Google of restaurant chains” (this is a typical example of goals
masquerading as strategies, explained very convincingly in Good Strategy,
Bad Strategy by Richard P. Rumelt (also recommended in my interview with
Martin Szugat)18 . This step requires much input from the other elements of
the CSA - the competitor analysis, maturity assessment, the data audit and
dictionary, and the futures thinking work. It can be helpful to position the
company around the competitors based on different assessments.
Now we can close the circle of the DUE DILLIGENCE phase and reveal the whole
picture:
Gap Analysis: Looking at the Road Ahead 38
Here we can see how the CSA and the ambition level setting elements enable
the GAP ANALYSIS. And guess what’s inside of it? The actual data strategy that
we’ll start designing in the next phase.
Summary
Let’s recap what we’ve learned so far.
Now we know where we are and where we want to go. Everything is set in
place for us to design the data strategy in the next phase.
Part II: Design
Useful analogies—The Influence cascade—Use case ideation, feasibility study,
and prioritization—Data architecture and technology—Data governance—
Operating model—Roadmap
Overview
“A system is never the sum of its parts. It’s the product of their
interaction.”
–Russell Ackoff
So far, most of our work has been more passive than active. We have learned
much about the organization’s ability (or lack thereof) to deliver value from
data. This groundwork can already prove helpful to the organization: it can
be used to take initial tentative steps in the right direction - even without
further recommendations from the data strategist or having a data strategy
in place. Still, at this point, there’s little in ways of actionable advice. This
situation can be frustrating for the data strategist since we always strive for
value as fast as possible, but we would be wise to remember that while it’s
tempting to jump directly into designing the strategy, why this is ill advised.
Without essential elements such as CSA, DMA and GA, any recommendations
and plans would only be based on gut feeling rather than facts about the
organization’s particular circumstances and aspirations. Shooting in the
dark is never a viable strategy - we might just as well copy a data strategy
made for another organization.
With the deliverables from DUE DILIGENCE in hand, we can now look at the
essential elements of data strategy DESIGN, MECE as always:
Overview 42
This view of data strategy is different from the currently popular ones. Many
data strategy whitepapers represent data strategy as a house - where the
use cases sit on top of elements such as operating model and enabling
culture or data architecture. While such a layout can seem logical at first
Overview 43
Before starting to design the data strategy, I want to show you one essential
tool that data strategists often use for such work: analogies. They become
handy when we must explain complex topics and their interconnections to
a diverse audience.
Overview 44
Talking about data projects can be confusing, even for experts. This is
primarily due to the extensive technical terminology used in the field.
Walking into a technical data meeting, you’ll often hear opaque concepts
such as data lakes, ingestion layers, data warehouses, anomaly detection
models, and everything in between. Finding good analogies for such a wide
range of terms might seem challenging, but here are some good ones - tried
and tested throughout various consulting engagements.
You might have heard this one: “data is the new oil” (here’s also the opposite
view, where that data is the new water: while it’s true that organizations are
sitting on piles of data, making them actionable is as big of a challenge as
it’s ever been* ).
There are several good reasons why this analogy has proven widespread in
the community. First, it relates to the idea that data can be seen valuable
than oil. It powers our digital lives, workspaces, government, and vital
infrastructure. This influence can be seen on par with oil throughout the
industrial revolution. Second, it maps well to data pipelines, containing
* See the vanity projects from Joe Reis and Matt Housley’s “Fundamentals of Data Engineering”.
Overview 45
data processing and enrichment steps. Like oil needs refinement before
becoming useful, data requires a similar upfront investment of work before
we can reap its benefits. Data also travels through pipelines, gradually
improving in quality and suitability for business use (becoming fit-for-
purpose). This analogy is excellent at explaining data architecture.
This second analogy became popular more recently. It has been advocated
for by Google’s Chief Decision Scientist, Cassie Kozyrkov* . This one can be
more playful (and less environmentally disturbing) than the oil one. I find
it particularly helpful in describing more AI-centric data terminology and
processes. Let’s see if GPT knows about this:
* Her original article introducing this analogy is called “Why Businesses fail at Machine Learning,” and
The arrow represents both the passage of time and the associated effort to
reach the final goal, the peak of value. At the beginning of this journey, we
start at our well-built home city, where we design the strategy. Imagine
everything here is neatly organized in different buildings, connected by
straight paved roads. A well-designed data strategy should resemble this
- it should be clear, concise, and structured. Nevertheless, to reach our final
destination, we must leave the safe confines of our homes and venture out
into the wilderness. Between us and the peak of value, we must traverse
obstacles such as the river and forest (the complexity of implementation).
This operational barrier is where data projects and products fail. This can
be the most challenging task in data strategy, and the whole DELIVERY part
of the 3D model is dedicated to how to overcome it.
I’ll refer to those analogies throughout the text, but now let’s focus on
understanding why we should always start data strategy design by working
on the use cases.
During CSA I covered how interrelated data strategy elements can be. Modify-
ing the design in one area can have dramatic effects on another. This was the
Overview 48
main reason the SYSTEMS AUDIT focuses on use cases first. The pattern repeats
in DESIGN. If we don’t consider the consequences of changing use cases, our
strategy might quickly end up disjointed and unfocused. It will also probably
become obsolete in the first weeks of implementation. Another potential
issue is the inherently cyclical nature of developing data products. It’s often
difficult to determine what approach will work beforehand, and we must
remain flexible. Any assumptions made here will be challenged later* . In my
conversation with Datentreiber’s Martin Szugat, he advocates for running
an experimentation phase before committing to implementation.
Imagine we’re developing a use case for detecting abusive content on social
media. Our initial idea might be to create a text classifier, which is trained
on a corpus of tweets and outputs a predicted class (for example, abusive
and non-abusive). We then evaluate and commit to a technology stack
supporting this product. In this case, a good approach would be to use
Python (since there’s a large number of Natural Language Processing (NLP)
related open source packages in Python’s ecosystem) and use a Support
Vector Machine (SVM) as a classification algorithm. Once this is decided,
we can design the appropriate backend architecture for this tech stack to
run on. For example, we can store text data in a NoSQL database (such as
* This is also described in DELIVERY, where I talk about DATAOPS.
Overview 49
So far, so good! But suddenly, the business development team realizes that
there are too many similar solutions on the market and that we need to
pivot. There’s a niche available for the same product but focused on video
data rather than social media texts. We become quickly excited again but
will eventually realize that the whole stack needs to be redesigned (even if
the problem statement and target audience both remain unchanged). The
fact that we now need to work on video data makes the technology tooling
completely different (SVM probably is not a good idea for an algorithm,
and neither is MongoDB for data storage). Now, storing the raw video
files on AWS S3 and using Convolutional Neural Networks (CNNs) sounds
much better. With such changes, the work can break down as a sequence of
dominoes.
You can see how any changes in high-level decisions regarding data prod-
ucts can have significant (sometimes unforeseen) consequences down the
line. At every step in the data strategy design, we need to be mindful of
the INFLUENCE CASCADE, understanding that any fundamental changes to
requirements need to be accounted for in other elements of data strategy.
We need to budget for this risk, and this is yet another motivating factor for
having a StratOps approach.
Now it should be clear why the USE CASES is the starting point of data strategy
design. Let’s get to work!
Use Cases: Designing Data Products
innovative solutions to prototype and test. They define five common verbs
associated with the process: Empathize, Define, Ideate, Prototype, and
Test. I would go a step further and argue that design thinking is more than
just a process: it’s a way of thinking, which is particularly powerful when
reasoning about difficult-to-grasp problems (the black boxes we covered
in SYSTEMS THINKING FOR DATA STRATEGY), which it terms “wicked problems”.
Next, how do we proceed with gathering ideas? Doing this should be fun, but
as mentioned before - we should be aware of pilotitis. While it’s easy to come
up with many ideas, the challenge lies in selecting the ones with the highest
impact and what is feasible for the available resources. We could build any
products we want with unlimited resources (including time). Unfortunately,
this is not our reality, and we always need to operate under constraints
(which can be even more limiting in the case of data products due to raised
expectations and often larger budgets).
Use Cases: Designing Data Products 53
The first issue we face is the cold start problem. Since the scope of a data
strategy is often expansive, getting started can be daunting. Stakeholders
might be reluctant to make an extensive investment before seeing tangible
results. In such situations, running a “lighthouse project” can be a good
idea. The task is to start with a simple to do, yet impactful use case to
demonstrate the potential value of data. In the optimal scenario that this
project is a success, the data strategist can use the learnings and new-found
trust to proceed with other use cases. Additionally, there’s a high likelihood
that the organization can reuse some of the components developed for the
Use Cases: Designing Data Products 54
The data strategist should execute the steps sequentially (if there are too
many ideas, it might be useful to do a rough prioritization round before
the feasibility assessment), but as you can see from the feedback arrows
between them, sometimes we need to return to the drawing board. This can
happen when we come up with a seemingly great idea that is also feasible
from a technical perspective. Due to a resource constraint, it might still
fall short of the final prioritized list. In that case, we must go back to the
ideation phase to gather new ideas. While it may seem like a setback, we
save ourselves much pain down the line (data products tend to be dangerous
when not appropriately executed) and avoid falling victim to sunk cost
fallacy.
Use Cases: Designing Data Products 56
Ideation
While it might be true that there’s no shortage of ideas in data work, good
ideas are hard to come by. What do I mean by good? And, perhaps more
importantly, how can we compare two (or more) data use case ideas? This
question might initially seem vague and subjective, but it turns out to be
surprisingly quantifiable. But before we go on to select ideas, let’s first learn
what is the best way to gather them: workshops.
Since this is one of the essential tools in the arsenal of a data strategist, it
makes sense to spend some time describing its basic elements and defining
attributes. What is a workshop? A colleague from my consulting days used
Use Cases: Designing Data Products 57
to joke that this term has been so overused recently that it mostly just means
a more extended meeting. To me, a workshop is a collaborative meeting with
a clear agenda, objectives, and facilitation focused on a topic that can’t be
resolved in any other way. Let’s have a look at the most important workshop
features.
Participants and key roles: As a rule of thumb there should be around five
to seven atendees. Any more and the session can become challenging to
manage and schedule. Any less, and we might not gather all potential view
points and ideas. We aim for a diversity of both opinions and skills (the more
T-shaped people* in the room, the better). The role usually assumed by the
data strategist is that of a facilitator (see GPT’s definition below). You can
think of this role as a referee in a football match - their job can be deemed
successfully executed when their presense goes mostly unnoticed. Since this
person is mainly engaged in leading the workshop, an additional person is
required to take notes and document the proceedings (you will refer to those
often later). For client projects this role is oftentimes assigned to the data
strategist, while the facilitator is nominated internally. The final key seat at
the table is reserved for a decision maker. You should always try to get this
person in the room, or at least have someone with a clear decision-making
mandate present. The reason is that workshops are often decision-making
focused, and can be disruptive (hopefully in a positive way) activities - thus
* People with a broad skill set and deeper specialization within a single area.
Use Cases: Designing Data Products 58
Now that we know a workshop’s essential features, let’s learn how to apply
a data ideation flavor. Here we’ll use many of the DUE DILIGENCE deliverables,
most notably the SYSTEMS AUDIT. There are many approaches to ideating data
projects, and from my experience, methods from design thinking transfer
very well into the data domain* .
Goals and deliverables: We should start with the end in mind. What do we
* This has now grown into the “data thinking” field.
Use Cases: Designing Data Products 59
need to run our use cases? We’ll need a list of prioritized potential products.
The deliverables need to be informative enough for technical requirements
gathering and generation of design documents (more on this in DELIVERY).
potentials and their respective values over this journey. Start automating
a human-led process first. One word of caution: ensure you optimize the
process in the right place. For example, it would be hard to see tangible
results if you automate a process coupled with a human action that is still
very manual. Imagine you are building a Computer Vision (CV) model to
check the quality of a particular material but still have a human manually
photographing each sample. You must look at the whole human process
first and ensure the absence of bottlenecks and redundancies.
After the warm-up, the participants gather ideas with sticky notes. This
activity should take around 15 minutes - it’s safe to assume the central
ideas should come out rather quickly. If you attempt to spend more time
on this, the participants’ focus can wander. The third session is where the
participants vote on the ideas gathered. There are different voting methods,
but I would recommend doing this anonymously (to ensure there’s no
political bias in the procedure), with each participant having three votes at
their disposal. This step can take around ten minutes. In the final stage, the
ideas are clustered into similar topics or common challenges they address
- here data strategist needs to be quite active and ask many clarification
questions to ensure the meaning behind the ideas is clear and eliminate
possible redundancy. Finally, the clusters get prioritized using a tool such
as a two-by-two Impact-Effort matrix (more on this later in PRIORITIZATION).
This section provides just the boilerplate for a data ideation workshop. If
you need a bit more detail, have a look at three great approaches:
strategy.
• Data Thinking: A Canvas for Data-Driven Ideation Workshops (link): A
flexible, holistic method of gathering ideas.
• AI Ideation Cards (link): A collection of cards to help business users
quickly come to terms with what AI can do in a business setting.
Feasibility Study
Data product ideation is easily one of data strategy’s most exciting and
enjoyable elements. We should have a lot of excitement building up - and
we’ll need it now since the next step is trickier. Here the ideas meet reality,
and the data strategist needs to estimate the feasibility of those ideas,
eliminating many in the process.
You should note that here how architecture and technology go hand
in hand. All too often, feasibility studies conducted within organi-
zations neglect to take into account architectural constraints.
might be suitable components that the data strategist can reuse in new use
cases.
Let me explain how A&T constraints affected a real-world use case. Despite
considerable advances in cloud-first infrastructure and associated tooling,
making advanced use cases scale remains a challenge. This is showcased by
the abundance and salaries of positions of data engineers, cloud engineers,
and data architects - their skills are probably the most sought-after in the
whole data industry. Regarding technology, we can be constrained by what
is currently available regarding the on-the-shelf tools on the market. A
good example of this is trying to build self-driving cars in the 90s. At that
point, people had already started to see the promise of neural networks for
CV tasks, but nobody could use them even on many images, let alone on
real-time video object detection in a moving vehicle. Other technological
constraints also affected this use case, for example, the lack of good internet
speeds and coverage to transmit all the data (and support the latencies
required by the use case) or the raw computing power in the car itself. With
5G networks and edge computing, constraints have been removed, and the
self-driving car use case has become feasible.
* RTO stands for Recovery Time Objective (how long can the application be down), and RPO for Recovery
Resource Constraints
The cloud itself can be a significant cost center. More on this later
in BUDGETING.
People can constrain what use cases we do in several ways. Beyond their
salaries, the most important factors to consider are their skills and experi-
ence.
Some use cases, especially ones that require a complex orchestration of ser-
vices, and managing mission-critical infrastructure (such as the aforemen-
tioned self-driving cars), can require not only knowledge of technologies
and frameworks but also pure on-the-job experience. More senior engineers
would not only know what needs to be done but, more importantly, how.
By working with experienced engineers, you could avoid technical debt
and unnecessary complexity in the code. The lack of such people can be
a tremendous constraint on what projects are feasible. The same goes
for skills themselves. If the current stack in the organization (that we
discovered in the CSA) is built on top of an arcane or niche technology, such
as KOBOL or Elixir (as is often the case of legacy banking systems), it would
not be feasible to switch to Python and readily take advantage of cloud
services for that matter, since most of the SDKs* for AWS, GCP or Azure
* Software Development Kit: a packaged collection of software tools that help developers write software
Here are the types of questions you can ask in this step:
• What are the skillsets of the people we have available? How do they fit
with the use case ideas?
• What are the experience levels of the data team members? Can they
cope with the requirements of a brownfield project, for example?
Data Constraints
In my book Python and R for the Modern Data Scientist: The Best of Both
Worlds20 , I argued that the data format is essential for what we can do with
it. Let’s see several different data formats to understand how they constrain
us. Note that this list is by no means exhaustive, but these are by far the
most popular data formats (beyond the standard tabular format):
Image data: Satellite imagery, celebrity photos, animal photos. Videos can
also be seen as part of this category.
All those different formats require very different technologies and sup-
porting A&T (another example of the INFLUENCE CASCADE). For example,
you might use R when working with time series data, while if you need
to work on image data - Python. Additionally, many algorithms will not
perform similarly for more advanced use cases, such as machine learning.
For example, for time series, it might be more beneficial to use a Long short-
term memory (LSTM) neural network and for text data - SVM.
The questions we need to ask here mirror those in CSA, but they should all
have the added information of new use cases we want to pursue. Armed with
the knowledge of those constraints, we can use them to prioritize the use
cases.
Use Cases: Designing Data Products 70
Prioritization
Now we can finally trim the use case list down. The hardest thing about
prioritization is choosing the right metrics. The format is straightforward;
any two-by-two matrix will do (there are other interesting methods, such as
PICK* ). Typically one of the metrics indicates the importance of the use case
- we can also substitute this for “impact”, “business value,” or something
similar. On the other axis, we can use “urgency”, “effort required,” or
“feasibility”. Then we can group the use cases into quadrants (again, a good
idea to do this in a workshop setting). The results that we should focus on
first will appear in just one of the four quadrants, for example, into the very
important and urgent (top right) - and we should start with those.
ideas.
Use Cases: Designing Data Products 71
Here it’s good to jump ahead and look at the various measurement
metrics for data products described in DELIVERY. Ideally, the data
strategist should measure the business impact discussed here in
those.
Data Architecture and Technology:
Establishing Foundations
First, based on the CSA, we should determine the type of architectural setup
we have (similar to what we did when estimating constraints): greenfield
or legacy. In the former, the data, our job is to help design an architecture
and technology strategy from scratch - there’s none in place. In the latter,
they need to ensure the new recommendations can properly be embedded
or connected to existing systems. Unfortunately, there will almost always
be some legacy system (whether part of the data stack or another system
with which the data stack needs to interact with) to take into account. This
second scenario is arguably harder to operate since it adds more complexity
and constraints to our target A&T design. The work we did in CSA should be
sufficient in deciding which situation we find ourselves in.
Next, we should do some definition setting. We hear and use the words
architecture and technology all the time but rarely pause to reflect on what
they mean. This leads to miscommunication and issues with “selling” our
work. If we can’t explain why we need a significant investment in rebuilding
or extending the data architecture, we probably won’t get the funds to do
so. If you ask five different people in tech to provide you with definitions,
you will get five different answers. This reflects the broad nature of those
concepts. To help here, we can use some of the analogies from the start of
this part:
Programming languages, data cleaning scripts, glue code pulling data from
third-party APIs* , enrichment queries to knowledge bases, processing for
ML training, the ML algorithms, and others.
Why AWS? The technical examples in this book are mostly focused on
Amazon Web Services (AWS). While there are other cloud providers, and
some of them have other benefits, I selected AWS since, in my opinion
(at the time of writing), they have the most diverse portfolio of services,
supporting many use cases. Almost any service in AWS has a correspond-
ing alternative in Google Cloud Platform (GCP) or Microsoft Azure. As
mentioned, I don’t want to go into detail about all the services and how
to use them, but if you need more information, an excellent resource is
the AWS Cookbook - it’s full to the brim with recipes you can use for your
data work in the cloud.
first, then open-sourced later. For example, Apache Airflow was first incubated at Airbnb.
Data Architecture and Technology: Establishing Foundations 75
of the building, which supports the elements inside. Or, for a more medical
analogy - this is the skeleton of the technology’s muscles.
Cloud native and Modern Data Stack are terms used to de-
scribe software or applications designed to run on cloud computing
platforms. These applications are built using cloud-native tech-
nologies, such as microservices and containers, which allow them
to be easily deployed, scaled, and managed on cloud infrastruc-
ture. Cloud-native applications are designed to be highly scalable,
resilient, and responsive, making them well-suited for modern,
cloud-based environments. They are also intended to be built and
deployed using a DevOps approach, which enables teams to rapidly
iterate and release new features and updates (such as serverless
computing, see APPENDIX). Overall, the goal of cloud-native design is
to enable organizations to take full advantage of the flexibility and
agility of cloud computing to drive innovation and improve their
operations* .
With the definitions out of the way, we can confidently proceed to design
the architecture and technology elements of the data strategy. Here we’ll
use the word target. As you remember in GAP ANALYSIS, in data strategy, we’re
always striving to achieve a goal, a target state. This is essential to always
keep in mind when working on architecture and technology since it’s easy to
* An excellent repository of stacks is available on moderndatastack.xyz.
Data Architecture and Technology: Establishing Foundations 76
get bogged down in unnecessary details. The most important outcome from
this part of a data strategy is preparing the deliverables and the shortest
path to achieving them.
Let’s start with the why21 (Amadeus Tunis explains why this is an essential
skill in INTERVIEWS): why do we care about data architecture? There’s no
better way to visualize this than Shopify’s Data Science Hierarchy of Needs.
Here’s a simplified version:
This diagram puts the three pillars of technical data work in context. With-
out a solid data collection, storage, and processing foundation, the whole
pyramid would break down long before we successfully deploy more ad-
vanced data science use cases (such as prediction) at scale. As we go
up the pyramid, more data engineering tasks such as transformation and
enrichment build upon the data architecture, enabling various data science
use cases.
Data Architecture and Technology: Establishing Foundations 77
As with many common complex topics, the best way to manage the com-
plexity of a larger task is to break it down into smaller, more manageable
chunks - this will also help us get started more quickly. We can break any
architecture (the stack) into different components. There are two ways to
do it:
Data Architecture and Technology: Establishing Foundations 78
The second option is to design in layers. Contrasting to the first one, this is
more appropriate when we want to support many use cases and have many
different sources available. This is the case for larger organizations. For this
approach, have a look at the layers below* :
* Here, the concept of abstraction layers we discussed in the SYSTEMS THINKING FOR DATA STRATEGY
Ingestion
This layer represents data entering the system. In large organizations, data
rarely originates from one source but rather from a set of many different
ones. This part of data architecture should be automated and orchestrated
(see the orchestrator diagram below). Examples of data sources include
third-party APIs, databases, data from IT systems (such as CRMs), scraped
web data, and others. The main challenges in this phase are orchestrating
the data collection scripts and working on the database schema (more on
this in APPENDIX). This layer depends heavily on the Storage one: depending
on the storage type, adjustments in ingestion are required.
Data Architecture and Technology: Establishing Foundations 80
Storage
The data from the ingestion layer needs to be stored internally. The “single
source of truth” (SSOT) is an essential motivating factor for good data
storage. This means we should always hold the raw data somewhere in
the system, and any downstream processing should be stored separately.
Through this, we can ensure data quality and add some redundancy to the
system in case of errors in processing - in that case, we can always retrieve
the data. In the old days, when data volumes were smaller, and most of the
data was generated by internal systems, databases were the most common
place to store data. Nowadays, the volume, velocity, and variety (as covered
Data Architecture and Technology: Establishing Foundations 81
We can list the different options at our disposal for storing data. Those are
explained in detail in the APPENDIX.
Keywords: AWS S3, Azure Data Lake Storage, Google Cloud Storage, Ama-
zon Redshift, Azure Synapse Analytics, Google BigQuery
For this large amount of unstructured data to be useful for the business
(let’s say as an input to an ML model), it needs to be processed. Example
processing steps include deduplication, feature engineering, dimensional-
ity reduction, and others. In most systems, this is done by orchestrating
different scripts and services. There are two types of using the ingested data
for processing: batch and streaming (expanded in the APPENDIX).
This layer is well represented by two paradigms, ETL (and more recently,
ELT):
Data Architecture and Technology: Establishing Foundations 83
Extract: The first step in the ETL process is to extract data from
various sources, such as transactional databases, flat files, or other
data warehouses. The extracted data is typically unstructured and
raw and may need to be cleaned and filtered to remove irrelevant
or duplicate information.
Load: The final step in the ETL process is to load the transformed
data into a central repository, such as a data warehouse, for storage
and access. This typically involves loading the data into a database
or other structured data storage system and then creating indexes
and other access structures to enable efficient querying and analy-
sis.
Data architecture is meaningless if it does not support a good use case. And
each good use case has a goal - the results of this data ingestion, storage, and
processing need to be consumed by another system. This system can be a
backend application part of the standard IT architecture or an external user
who needs to use the results of the data product. An example of the latter is
an ML model endpoint which is then exposed via a frontend or a dashboard
of business-critical information. There is usually a CI/CD pipeline (more on
Data Architecture and Technology: Establishing Foundations 84
Examples: AWS ECS, Kubernetes, AWS API Gateway, AWS Quicksight, AWS
CodeCommit
Here S3 serves as a data lake for data coming from another system (i.e., a
CRM). The data is transformed with Glue, which automatically detects the
schema and makes it queriable in Athena. Dashboards showing the data
are built with Quicksight. And finally, a subset of the data valid for ML is
stored in Redshift as a data warehouse and consumed by Sagemaker for ML
* Version Control Systems, such as Git or Mercurial are used for writing code collaboratively.
Data Architecture and Technology: Establishing Foundations 85
Target Technology
Keeping those points in mind, let’s look at three ways we can look at
technology selection.
Data View
Remember the data format constraints from FEASIBILITY STUDY. Those data
types influence the technology selection deeply:
Let’s take the text data format as an example. An example use case would be
to deploy a sentiment prediction model whose output can be used for churn
modeling. In INFLUENCE CASCADE, I wrote how this could impact the type of
database we use (i.e., Mongo DB). In this case, we should go open-source
since there are many great tools for sentiment prediction, including spaCy
and fasttext. If we need to hurry, we can also use a vendor; in that case, a
good option could be AWS Comprehend - that would allow us to be fast since
Data Architecture and Technology: Establishing Foundations 88
it integrates well with the rest of AWS, but we can have other issues, such
as limited language support and price (more on this in BUDGETING).
Depending on the use case we have, there are different technology options
available for us. The following diagram shows the different groups of possi-
ble use cases:
ML use cases fall in the predictive category. Here we might go for Python
with its packages, such as tensorflow, pytorch, xgboost and scikit-learn to
name a few. On the other hand, if we have more of a prescriptive use case,
Data Architecture and Technology: Establishing Foundations 89
R, with its ecosystem of statistical tools, can be the best option. Finally, if we
have the descriptive product, even using a self-service analytics tool such
as PowerBI or Tableau can get the job done.
Workflow View
The final way in which we can look at the technology is the workflow view:
This maps neatly to the type of work that different people are doing. We fall
into this scenario when in the legacy scenario in an organization, where
the data strategy can be focused more on improving existing use cases
rather than designing new ones. Here, for example, if more of an exploratory
workflow is required, we might go for a combination of R with self-service
analytics tools.
Data Architecture and Technology: Establishing Foundations 90
Many choices that need to be made during a data strategy have a markedly
“qualitative” rather than “quantitative” flavor. Those decisions are grayer
and require a nuanced approach. In terms of selecting technologies, here
are recommended selection criteria:
Fit for purpose: This one should go without saying, but the tech we use
should fit the goal (see the “use case-driven” point).
Ease of use: The learning curve ideally should be low. You’d be surprised
how quickly this can become an issue if you work within deadlines and
other constraints or have a less experienced data team.
Data Architecture and Technology: Establishing Foundations 91
Open source vs. Proprietary: Nowadays, this is not even for discussion.
Except for some niche cases (such as when working with GIS software), you
should always use an open-source solution for data projects.
As the organization grows and data usage becomes more established and
widespread, a new source of complexity arises. Access to the data and the
technology required to operate it (resource provisioning) for different teams
involves management. This falls in the domain of data governance. Here’s
GPT-3.5’s definition:
Topic Description
Data classification Establish a system for classifying
data based on sensitivity, value,
and risk. This might include
categories such as public, internal,
confidential, and highly
confidential
Data ownership Determine who is responsible for
managing and protecting different
data types within the
organization.
Data access Set rules for who can access
different data types and under
what circumstances.
Data retention Determine how long different
types of data should be retained
and establish a plan for securely
disposing of data that is no longer
needed.
Data security Implement measures to protect
data from unauthorized access,
such as encryption, secure
networks, and access controls.
Data privacy Establish policies and procedures
to protect personal data and
ensure compliance with relevant
privacy regulations, such as the
General Data Protection
Regulation (GDPR) or the
California Consumer Privacy Act
(CCPA).
Data Governance: Managing Data Assets at Scale 94
Topic Description
Data quality Set standards for data accuracy,
completeness, and timeliness, and
establish processes for ensuring
that data meets these standards.
Data teams are a bit different from traditional software teams in how they
operate. For software teams, it’s more accepted to function in a more
isolated role with more or less limited interactions with other functions,
but for data teams, this is often not enough. Data teams need to understand
Operating Model: Setting up the Organization for Success 98
the business use case even more, and the products they develop are often
consumed internally as well as externally. There are several models that an
organization can use to structure how its data teams interact with the wider
organization.
The optimal solution, as usual, lies in the middle: a hybrid between the
two. This has been pioneered by Spotify’s model of squads and tribes. While
there’s a centralized team, with all the support and organization that it
provides, data team members still are active within other organizational
units. This can be seen as “in-house consulting” or Center of Excellence
(CoE) format. This organizational model takes the most effort to pull off
successfully since there’s a need to balance the workloads and culture
carefully. The different models are shown in the figure below:
Operating Model: Setting up the Organization for Success 99
Center of Excellence
A fresh start: A brand new structure is created, and those people haven’t
worked together before. This has always had positive effects since many
people would have a new way of viewing things, and office politics would
be minimal.
Any operating model is better than none. A clear structure can tremen-
dously improve productivity because the relationships and responsibilities
of data team members are transparent; what operating model you choose
for the organization depends on its data maturity.
Operating Model: Setting up the Organization for Success 101
Change Management
Even with a solid operating model in place, our work as data strategists is
not done. What’s missing is ensuring the right culture is in place. Since
“culture” is a subjective term and depends on many factors, such as industry,
geographical region, and others - I’ll approach it from a more practical
perspective by describing change management methods.
Improve skills and diversity: Everybody knows that diverse teams are
more successful than homogenous ones, but why? The fundamental reason
is that diversity allows for different viewpoints; thus, many more paths to
solve a complex problem are available. The same concept applies to skills.
You can see data strategy as planting valuable seeds that need fertile ground.
Someone said, “culture eats strategy for breakfast”, and I couldn’t agree
more. Even if you design and deliver a comprehensive data strategy, it would
be pointless if not accepted by the organization. This difficulty in accepting
the strategy is the most challenging task for a data strategist to address. This
is not a book on social psychology - unfortunately, there’s no substitute for
real-world experience. Still, I’ll attempt to give you a few starting points so
you can start making solid progress.
Before starting this task, we need to think about why changing a working
culture is so complex so that we can address those challenges head-on.
Unsurprisingly, we can probably reduce the main reasons for the difficulty
to the human factor. Any organization comprising people is an excellent
example of a Complex Adaptive System (CAS from SYSTEMS THINKING FOR
DATA STRATEGY). The complexity of this system increases non-linearly (that
is - very fast) with an increase in the number of workers. The composition
of the said workforce further influences this - knowledge work tends to
be even more complex (more demanding to learn, teach and automate).
Any complex system is hard to change, primarily because it functions as
a black box - its inner workings are difficult to understand and, therefore,
influence and change. This is a core topic of DELIVERY, but let’s visualize it
here, together with another concept I’ll cover in a second, activation energy:
Operating Model: Setting up the Organization for Success 103
Product demos. Data team members must serve as evangelists for data
in the wider organization. The work those teams produce is often seen
as opaque, and results and impact are not easy to understand - this un-
derscores the importance of making frequent product demos to a wider
audience.
excitement for the technology, and sometimes deliver real business value
by piloting the best of the projects developed in this format.
Use case ideation sessions. Used to foster collaboration between the data
teams and the rest of the organization. The data strategist can also use them
to make everyone more involved in the decision process on what should be
done with data.
Roadmap: Preparing for Delivery
We are now at the finish line of the data strategy design. Before we proceed
to DELIVERY, the final element is to scope the work, prepare a budget and
order the work items on a timeline.
Preparing a Budget
Starting with the software costs, I’ll give an example of an NLP project.
The only missing information is an assumption on the volume and type
of data we expect the system to ingest and process. The best way to do
this extrapolation is to base it on existing datasets. Let’s say we have
Roadmap: Preparing for Delivery 107
Now we can estimate the second part of the data strategy implementation
budget: the people cost. Beyond the obvious things a tech company needs
* For the AWS pricing calculator go here, for GCP here and for Azure here.
Roadmap: Preparing for Delivery 108
to budget for (such as laptops and other hardware), most of the cost is based
on salary. And the pay itself is based on three things: seniority, skillset, and
market (location). Now you can appreciate why it makes sense to make a
budget for the headcount only at this stage of the data strategy. We need to
be sure of the use cases and the architecture and technology to determine
the types of people we need. A good rule of thumb for the number of people
necessary is the concept that I have termed “atomic team” (see aside below).
Almost any use case is achievable by a diverse team of 4 to 7 people in size
(number based on what I mentioned in change management - more than
this, and you need an additional manager). To estimate the human resource
cost for your budget, you can couple the roles, skills, and experience levels
of such a team together with adjacent cost factors such as recruitment and
onboarding.
Some roles would not be required full-time, for example, the Product
Manager. Such calculations make it easier if you have a CoE type of setup
and the same team needs to work on several use cases in parallel.
Roadmap: Preparing for Delivery 109
A good idea is to think about project chunks that a small team can accom-
plish. I would call this the “atomic team”. The concept of atomic describes
the smallest possible abstraction level (from ancient Greek). Here’s an
example:
It comprises four members, covering the prominent roles in data (there are
many resources detailing what different roles are responsible for, I recom-
mend Borek and Prill23 and the APPENDIX). It also has the needed hierarchy
(ownership) role. We can break down most data product development into
tasks that a single team of this size can accomplish. If you cannot break
down the work into such a task, you might need to rethink the project
planning and further break the tasks down.
We can also adjust the atomic team further to fit our specific purposes:
Roadmap: Preparing for Delivery 110
Finally we can attach the budget to the data strategy. More often than
not, there will be further discussions which will depend on the particular
situation. Still, those methods should provide a good rule of thumb for any
budget calculations in your work.
Timeline boilerplate
The chart consists of a timeline and several swim lanes underneath. This
allows several use cases to be run simultaneously, which is often the case.
The timeline scale can vary, but the most common is quarterly since it co-
incides with regular reporting for larger organizations. Essential elements
are quality gates, a topic expanded on in IMPACT ASSESSMENT in DELIVERY.
This roadmap can be further customized if needed. For example, one thing
I did with a client in the rapid growth phase was adding Human Resources
(HR) to the timeline. The client wanted to see when additions would be
necessary (with their roles, of course). This was very useful for the HR
Roadmap: Preparing for Delivery 112
We followed a MECE approach and covered the key elements. First, with the
one which generates value (remember we always should focus on the end
goal) - USE CASES. We gathered ideas and prioritized them based on feasibility.
After this, we designed the target DATA ARCHITECTURE and TECHNOLOGY to
support those use cases. Their fuel, the data, is managed in the policies we
defined in DATA GOVERNANCE. The non-technical element of a data strategy
is addressed in designing the teams and processes in the OPERATING MODEL.
Finally, we prepared a ROADMAP for implementation, supported by a budget
assessment.
We are ready with the “static” part of the data strategy by finishing this
section. In the next one, we’ll ensure that it gets delivered successfully!
Part III: Delivery
Strategy to value journey—Soft agile—Implementation forest—Lean data—The
knowledge factory—DataOps—Impact assessment—Portfolio management
Overview
“Well done is better than well said.”
–Benjamin Franklin
A good plan that’s easy to act on is better than a perfect one that nobody
wants to follow. Even if in the first two parts of the 3D data strategy
process (DUE DILIGENCE and DESIGN) we managed to formulate and prepare
an informed and holistic strategy, these efforts can eventually prove to be
in vain if the result is not adopted and acted upon within the broader or-
ganization. Much of this is because any organization is a complex adaptive
system, full of operational and communication complexity. When executing
a data strategy, we must rely heavily on many concepts and methods from
SYSTEMS THINKING FOR DATA STRATEGY. As the rubber meets the road, We will
continuously challenge our designs and assumptions. We’ll need to adjust
while staying focussed on delivering value - in line with StratOps principles.
DELIVERY connects the designed data strategy and value generation. The
arrows in both directions show that this is a two-way process of constantly
adjusting based on experience gathered while implementing. Naturally, the
concept of “value” needs to be made explicit for such a setup to work; this
is why I dedicated a whole section to this - IMPACT ASSESSMENT. This element
is used to measure value generation at different quality gates* . DELIVERY
relies on two popular software implementation framework flavors- SOFT
AGILE and LEAN DATA, forming DATAOPS. Explaining them and how to apply
them to our purposes is the primary goal of this phase. Finally, PORTFOLIO
MANAGEMENT provides a high-level view of all initiatives so that decision-
makers can also have an opportunity to monitor and adjust the strategy as
its implementation is set in motion.
* It’s not easy to measure data project impacts, but options are available. I’ll cover them in IMPACT
ASSESSMENT.
Overview 117
DELIVERY is when the data strategist needs the most flexibility since the
complexity of applying data strategy can be staggering. This can also be a
frustrating experience, since it is here that we often discover any mistakes
and wrong assumptions made during the first two phases. It takes courage
to admit to those, but this is essential to adapt and carry on successfully.
Remember that no two organizations are the same, and all are continuously
evolving. Your impact won’t stay constant either, despite any preparation.
Overview 118
We can use an analogy for what we are trying to do here to help illustrate
the challenges better. Moreover, this analogy maps well to the data maturity
levels from CSA. Let’s take advantage of our journey analogy once again:
collect the value. There are three different ways to cross this river. Some are
quick to set up but less efficient, while others are more so at a higher cost
(and time to build).
Our first option is to attempt to cross the river in a small boat. This is how
organizations of low data maturity go about it. We are likely to capsize,
rely on crew muscle power alone, and can be a victim of the whims of the
winds - with little control over where we end up on the other shore. Use
cases developed in such delivery scenarios rarely see the light of day (this is
what Harvinder Atwal calls “laptop data science”, or in my words pilotitis),
and team members eventually burn out and leave. Still, remember that this
book focuses on larger organizations - for smaller startups with less baggage
(smaller rivers to cross), this can be a viable strategy.
The final option is the gold standard for large organizations: we need a
bridge. Successful data-driven companies at the summit of their digital
transformation have bridges between their strategy and implementation
efforts. These organizations are so data-centric that even this analogy
would do them a disservice. A more appropriate way to term what they
have achieved is “data highways”. A recommended reading on how they
accomplish this is O’Reilly’s “Software Engineering at Google”24 . In this
scenario, the separation between strategy and implementation becomes
Overview 120
blurry, and most data product initiatives (especially those that are not core
innovation projects) are seamlessly designed and delivered.
Now that we know what we need, how do we get there? As mentioned in the
chapter’s introduction, we will use lean and agile in combination* . Instead
of providing definitions as usual, a better way is to show how they look in a
gold standard (ideal target state):
Lean: The data teams operate as a factory. Before them, there are clear
targets, automated processes, highly specialized yet overlapping roles, and
other characteristics of a productive factory. Development proceeds at
a reasonable rate continuously, measurably delivering value. In case of
issues, replacement parts are readily available, bottlenecks are rare, and
operational waste is reduced to a minimum.
Agile: The factory has processes to communicate with and adapt to a chang-
ing environment and requirements. If the need arises to change a feature of
the product developed by the factory or even replace it with a completely
different one - agile allows the factory to do so without decreasing value
output - on the contrary, minor adjustments can vastly scale the positive
* If you want to go deeper into them, good references are “Agile Project Management with Kanban” (Eric
Brechner) and “The Lean Six Sigma Pocket Toolbook” (Michael L. George).
Overview 121
value generated.
The following diagram shows how the two methodologies relate to each
other. While lean allows the factory to be productive, agile ensure it re-
sponds to the environment successfully:
Cargo Cults
Similar to the disease pilotitis that we covered in DUE DILIGENCE, cargo
cults are the second most common disease among large (and small) orga-
nizations, preventing them from executing their strategies. Earlier in the
chapter, I mentioned the large tech companies and their data highways
- while those are good inspirations, we can’t copy their methods blindly.
Let’s see why with a historical example.
Towards the end of the Second World War, many pacific islands were
affected adversely because of the armed engagements between the US and
Japanese armies. At some point, there were enough resources available
to the US military that they started to supply these islands with food
and essential items. For this purpose, the Americans constructed primary
makeshift airports. This went on for quite some time, but after the end
of the war, the supply lines trickled down to zero. They knew little
about aircraft, and they associated it with delivering supplies. Thus they
promptly created airports and airplanes from materials they could find. Of
course, those creations were far from functional, but the locals believed
their presence would automatically lead to supply delivery. Unfortunately
for them, this did not occur.
We might feel far away from the pacific islands, but those dangers are all
around us, especially when we adopt new methodologies. Instead of being
frustrated that our large organizations can’t adopt shiny new frameworks
readily, it’s helpful to remember that tech companies have been focusing on
digital products from day one, thereby significantly reducing the complexity
of the task. They did not have many communication issues that any non-
tech organization is bound to have, hindering work between technical and
non-technical teams. And finally, they have vast amounts of money for
Overview 123
Does agile work for data projects? There are different points of view on this,
but most in the data community are wary of creating a cargo cult. For me
and many leading data strategists* , the correct answer is yes, but with soft
adjustments. I term this flavor: SOFT AGILE.
It seems we can transplant most of those ideas well to data products, but
how about the specific principles of agile (also a fundamental part of the
manifesto)? Here the situation changes, and the real difference with SOFT
AGILE begins to appear:
Soft Agile: Moving Fast Without Breaking Too Much 126
Breaking big work down into smaller tasks that can be completed
quickly. Useful advice for any knowledge work; we’ll take it.
Now you should see why we need a softer approach to agile. Let’s use via
negativa once again, and see how the rigidity of traditional agile fails data
strategy delivery.
Soft Agile: Moving Fast Without Breaking Too Much 128
We have the roles on the left and time progressing from left to right. Deliver-
ables are shown in white circles; the red arrows correspond to conflict, and
Soft Agile: Moving Fast Without Breaking Too Much 129
finally, the bombs, appropriately, for project failure. The team receives the
brief and begins a two-week sprint. Now we can play the scenario forward
and see what happens, step by step.
Setup
Data projects can fail even before they start. Preparing the design document
for the product and the definitions of done (see aside below) for tasks. I have
seen a whole CoE setup provided with only the initial requirement: “let’s
do an AI for X”. You can substitute X for anything, such as improving the
patient outcome, warehouse operations, or new sales funnel performance.
If you don’t specify your goals correctly (see definitions of done and design
document aside below), even with the best of teams, the most you can hope
to get out of the data initiative is for them to dig a perfect hole in the wrong
yard. Also, pivoting mid-way through the project is not a viable strategy.
As you have seen in the INFLUENCE CASCADE, any changes have significant
consequences, and the project setup (in terms of people, hardware, software,
and access to data) needs to take care of that. Another issue that can occur
in the setup phase is not setting a good baseline* . How can you measure the
success of your churn model product in reducing churn when you have no
data on what the churn was before deploying it?
In our scenario, the team is oblivious to this issue because it will only
become apparent as the project progresses. The team is excited since they
can start with minimal fuss.
* This is also for data strategy. Despite improvements, I have seen it fail because the strategist did not
Data dump
The first common hurdle is the lack of data. The data scientist has set up
their local development environment and is ready to go, but they realize
they don’t have access to data. Now they need to find the right responsible
person or team who owns that data, and even if they do, they can find
out they first need to ask their project manager and negotiate for it. Data
governance policies are rarely set up flexibly in large organizations - and
data is often siloed, so this can take weeks. Even if this is not the case, it is
the rate that the data would be easy to download (because of size, format,
* An example definition of done is provided in the Appendix.
† An example design document outline for a data product is provided in the Appendix.
Soft Agile: Moving Fast Without Breaking Too Much 131
Data access
At the same time, the business owner has asked the data analyst to provide
some initial reports on the data. Similarly to the data scientist, they realize
they don’t have access to the data and get in touch with the data engineer.
Unfortunately for the data engineer, a data dump is not enough for them.
The analyst needs real-time access via a self-service analytics tool, which
was not set up before. More ad-hoc work for the data engineer. Moreover,
they often would require approval for the BI tool (those also can be expen-
sive, especially for larger organizations where hundreds of users need to use
it). This can result in further conflict with other departments already using
a similar tool (let’s say PowerBI), but it’s not suitable for the specific needs
of the churn project.
Prototype
After obtaining a data dump from the engineer, the data scientist has been
working on creating a prototype model. They developed locally on their
machine and are ready to provide the code for deployment. Usually, the data
engineer is also responsible for this, and unfortunately, again (at this point,
you should start to see why good data engineers are so sought after), they
realize they need to recode the modeling code from R to Python. Harvinder27
refers to this issue as the “throwing over the fence problem”. Technical debt
Soft Agile: Moving Fast Without Breaking Too Much 132
Deployment
Finally, the data engineer has the modeling code written in the language
they need. After deployment, the team realizes that the data coming into the
ML system is different in terms of format and quality. Now there’s a need to
set up the MLOps infrastructure and database connections (the scientist has
been working with flat files) to ensure continuous learning and performance
monitoring. The product team also complains that the users might notice
the several seconds’ prediction latency and that it needs to drop below one
second. This news change requires a complete rework of the architecture
and development process.
Now we are reaching the final and perhaps most challenging situation where
data projects fail, even if they managed to survive the previous hurdles.
The team has built and deployed a good model but made it accessible
only via an API endpoint. The product team realizes that they don’t have
enough technical people on their side to consume this API (they didn’t
know about this because they weren’t kept in the loop from the beginning).
The additional problem is that it starts to be unclear which system needs
to use this product. Finally, business users distrust the model’s accuracy
and begin to ask about details to understand its inner workings. The data
scientist realizes they should have created an XAI layer on top of the model
to convince them of its operation. You can see that even with an excellent
finished product, there are new challenges that can arise that can derail
everything, and we pay the price for the lousy setup - and the lack of data
strategy.
Soft Agile: Moving Fast Without Breaking Too Much 133
This story is admittedly rather bleak, and many of you will have had similar
experiences. Still, with the right tools most of those problems are avoidable.
Soft Agile: Moving Fast Without Breaking Too Much 134
I’ll provide them in DATAOPS, but before this let’s have a look at the other
framework for delivery.
Lean Data: Eliminating Waste
Lean Data Theory
One easy way to look at the lean framework is to see that, at its core, it’s
mostly about reducing waste. This is inherently in line with via negativa:
instead of an optimistic goal - “how can we have more resources”, we are
focused on a more pessimistic yet actionable one - “how can we do more
with the limited resources that we do have”.
Now that we know its background, how will it help us build the bridge
between data strategy and value? To answer this question, we need to
understand what a lean organization focused on knowledge work is. Instead
of looking at a negative example, as in the IMPLEMENTATION FOREST, I’ll show
you the gold standard - THE KNOWLEDGE FACTORY.
Lean Data: Eliminating Waste 136
Much of the inspiration behind the lean methodology comes from the
industrial sector. To understand how this can apply to data projects, we need
to see the differences between knowledge work and manual labor in terms of
complexity. This is a vast field of research, but I’ll provide vital contrasting
points in several paragraphs. Let’s first ask GPT-3 to define knowledge work:
This definition already sets the stage for our view on complexity. Data
strategy implementation is a product of largely conceptual work and enor-
mous mental effort delivered over a sustained period by a large group of
people. All this while the environment is changing* . Those are not deal-
breaking issues, but they result in the number one enemy of data project
management: inefficient communication.
those new fields bring their terminology, and their work can become harder
to explain from a specialist, even adjustment in application areas, let alone
those widely distant.
Communication complexity
Person A communicates to Person B that the team should deploy the churn
model. From the first person’s perspective, this is a reasonably straightfor-
ward task. Unfortunately, this is an illusion - because in a simple instruction
like that, much complexity is hidden in the person’s context (the devil lies in
the details). What “deployment” means for Person A can differ from Person
B. The noise on the diagram represents this issue. The way to mitigate
Lean Data: Eliminating Waste 138
start with two simple things everyone should measure: how long a data task
should take and how much value its successful completion should generate.
Now we can proceed with the concrete methods that allow us to combine
SOFT AGILE and LEAD DATA to ensure value-driven data strategy implementa-
tion.
DataOps: Methods for Value Delivery
The first Ops concept I mentioned in the book was StratOps in INTRODUCTION.
The second one was MLOps in DESIGN (this one is a subcategory of DataOps).
Some frown at this explosion of terms, but I see it as a net positive -
this represents a growing awareness and desire to automate and optimize
knowledge work.
and projects. Now we can continue with a walkthrough of tried and tested
frameworks that support DATAOPS.
One of the oldest efforts to standardize the process of doing analytical work
was the Cross-Industry Standard for Data Mining (CRISP-DM); here’s how
it looks together with a basic explanation:
CRISP-DM (modified)
DataOps: Methods for Value Delivery 144
This is useful as a start, but it’s not enough for most modern use cases -
most of those steps are common sense for any practitioner. Microsoft has
improved on this by creating the Team Data Science Process (TDSP)* :
* You can read in detail on the TDSP here.
DataOps: Methods for Value Delivery 145
There is more detail here. For example, the data acquisition part is broken
down into the required activities (mirroring many architecture layers in
DESIGN). The deployment also is modernized, with some more detail on
required tasks. But most importantly, there is the “customer acceptance”
step. This can mean an external customer, but perhaps more frequently in
data projects - an internal stakeholder. This connects to our “definition of
done” activity. One weak spot of this model is the machine learning focus
in modeling. Not all useful data science workflow fits here; for example, a
DataOps: Methods for Value Delivery 146
Data Platform
Building a data platform is one of the best mid- and long-term investments
a large organization can make in terms of data (especially after successful
PoCs). While this term has a very general meaning, for our purposes, it is an
abstraction layer on top of the data:
Data platform
The data platform contains all services, interfaces, and governance (access)
policies required by all different groups of people in the organization. Such
DataOps: Methods for Value Delivery 147
MLOps
There are different tools for the different aspects; for example, for CI/CD
you might use Github Actions; for model monitoring, mlflow. A good starting
point is to use the specialized services that all major cloud providers have -
for example, Sagemaker in the case of AWS or Machine Learning Studio for
Azure. Those contain everything necessary (but can be expensive).
MLOps is a fast-moving field, and there are many other useful concepts
available, such as feature stores:
communication and integration issues. This also helps with onboarding new
people to the development process.
The same applies to the generation of reports. For example, a good idea
is to use standard YAML files to specify standard project settings (such as
themes, database connections, etc.) that can be reused across projects.
Tools that allow IAC include Terraform, Ansible, Puppet, Chef, and
AWS CloudFormation.
There are many software tools to assist with this, including Confluence and
SharePoint.
Kanban
One of the most widely used frameworks for project management, this one
also originates from Toyota:
DataOps: Methods for Value Delivery 152
By limiting the amount of WIP and making the work visible on the
Kanban board, teams can identify bottlenecks and inefficiencies in
their process and take steps to address them. This can help teams to
deliver software more quickly and efficiently while also improving
the quality of their work.
When taken to data project management, this framework has benefits and
shortcomings. The positive is that it allows teams to get started quickly,
especially in projects which are hard to estimate (as mentioned before,
something common in data). The project management would rarely be able
to create a detailed roadmap more than four weeks in advance. In this case,
teams can tackle tasks as they come instead of distributing the work in
sprints.
DataOps: Methods for Value Delivery 153
Scrum
Throughout the sprint, the team completes the tasks and pro-
gresses toward the goal. The team holds daily meetings called
stand-ups, where they discuss their progress and any challenges
they face.
At the end of the sprint, the team holds a review meeting to demon-
strate their completed work and gather feedback. This feedback is
used to inform the next sprint, and the process repeats until the
project is completed.
DataOps: Methods for Value Delivery 154
Shotgun MVP
Running many use cases at the same time with Shotgun MVP
Let’s say that we have several ways of achieving the same operational goal -
increasing the profitability of the sales department. One approach would be
* You can read more about Scrum of Scrums here
† You can read more about SAFe here.
DataOps: Methods for Value Delivery 155
I have named this approach “closing the loop”. Let’s illustrate with an
example of a complex project. We can visualize many digital products in a
sequential graph. Some things need to be done in various stages to complete
the product. For example, first, we might spend time on comprehensive EDA,
building a dictionary, and feature engineering. Only then would we proceed
with modeling. We can add all those different elements together* :
* This process is often called “Value Stream Mapping”.
DataOps: Methods for Value Delivery 156
We can then implement this chain before working on all the other optional
components. Through this, we can quickly validate our idea and see if
it brings measurable success. Then we can return and finish the other
components, resulting in incremental improvements.
Sources of Waste
Partially done work: pilotitis. It’s easy to start new projects, but making
them usable requires much more effort.
and automation.
Defects: No code is perfect, and there are always edge cases and unexpected
bottlenecks in data-intensive software systems.
Multitasking: This relates to the idea that human productivity drops dras-
tically if attention is split between different tasks simultaneously. This is
often the result of “scope creep”.
Waiting: This can translate almost one-to-one from the industrial setting
to data. Especially in larger organizations, it’s common to have more tedious
processes and inefficient data governance, resulting in long waiting times
for the data team to even proper tooling, let alone access to data itself.
Lack of knowledge sharing: The result of siloing data team work and
members. In larger and less efficient organizations, the worst form of this
issue appears - different teams working on the same thing, reinventing the
wheel in isolation.
DataOps: Methods for Value Delivery 159
Extra processes: This concept relates to the idea of SOFT AGILE. It occurs
when the balance between rituals and work leans too much on the former
than the latter, generally because of the cargo cult effect.
The data sources of waste are a great fit for a workshop with a data team.
After explaining the concepts, the stakeholders can gather examples from
their work. Then clustering and prioritization steps can follow.
Heuristics
Simple and easy-to-follow rules often prove more effective than a big
strategy, especially for organizations in a low data maturity stage. A Harvard
Business Review article on strategic heuristics explains this idea well. The
authors explain how Napoleon innovated the art of managing a war during
the conflict with Russia in the 19th century. At that time, the lines of
communication between generals and front-line soldiers were virtually non-
existent (this was some time ago, before the invention of the telephone).
DataOps: Methods for Value Delivery 160
The relay of orders and information to and from the generals was primarily
accomplished by human messengers, often on horseback. It’s not difficult to
imagine how such inefficient communication channels could confuse. Often
by the time the information has arrived, it was already outdated - at best
irrelevant - and at its worst - dangerous. To avoid this, Napoleon issued
his troops with a set of simple heuristics: for example, in the case of total
communication breakdown, go into the direction of fire and take the high
ground.
Heuristics can be helpful for people at the executive level as well. I have of-
ten observed analytics leaders applying basic rules to their work, especially
regarding time management. For example, when asked how they prioritize
different work areas, such as hiring, outreach, and development work, they
would often come up with a rule of thumb percentage, such as 10-15-75%.
A data strategist might expect the ideal case to the left. Directions and tasks
flow in a structured way delivering value. Unfortunately, in an actual project
situation, the reality is represented better in the right graph. Most obstacles
and value are hidden and can only be reached indirectly by different people.
In this case, the data team is on its own and can use simple heuristics to
guide its work. For example - if less than 10% of the data is missing - proceed
with dropping it.
Impact Assessment: Measuring Our
Success
Let’s remind ourselves one more time why it’s essential to measure the
results of our data strategy:
In a nutshell - sustained success in data takes time - and DataOps helps get
there faster. This is the opposite graph of a pilotitis one, but it brings its frus-
trations. There will be a point for any large data strategy implementation
initiative where the investment has been at a maximum, yet the results are
few. We have already covered the best way to alleviate this (for example, by
doing Lighthouse projects (see my conversation with Alexander Thamm or
Amadeus Tunis on those). Here we will focus on how and when to measure
Impact Assessment: Measuring Our Success 163
Impact assessment
The answer to the “when” question is simple. First - before the start. This
established the baseline that we will measure against. Afterward - at regular
intervals, depending on the roadmap. This is typically done in a meeting
with the RACI steering group members.
How about the measurable metrics for data projects? There are quite a few,
but this table provides the most common ones:
Metric Description
Time to Market How quickly the product is
provided to customers
ROI How much money the product
makes as a return on
investment
Ramp time Time for a new hire to become
productive
Deployment number How frequent are deployments
Actionable insights delivered self-explanatory
Value-add time How much time spent
contributes to value
Impact Assessment: Measuring Our Success 164
Metric Description
Proportion of reusable self-explanatory
artifacts
Proportion of monitored ML self-explanatory
deployments
Model accuracy self-explanatory
Speed of deployment self-explanatory
Another common idea to generate good KPIs and ensure they are aligned
with the organization is to construct a KPI value tree. It takes work, but the
results pay off for large projects:
Impact Assessment: Measuring Our Success 165
KPI value trees can be useful for various purposes, such as helping
managers prioritize KPIs, identify performance gaps, and develop
strategies for improving performance. They can also help commu-
nicate the importance of different KPIs to the broader organization
and demonstrate the value of data-driven decision-making.
Portfolio Management: A
360-Degree View
In DUE DILIGENCE, while gathering information in CSA, a central focal point
was to identify all existing data-related use cases in the organization.
Even small organizations have a diverse pool of current use cases (data-
generating or consuming systems), with larger ones sometimes having
hundreds. Our work in USE CASES will most probably have expanded this list
by adding new use cases to the roadmap. This expansive landscape of use
cases represents the “analytics portfolio”. To manage it, we need structure.
This falls in the scope of “portfolio management”:
Portfolio Management: A 360-Degree View 167
Before I show an example, let me list the main motivating factors for
assembling an data portfolio:
There are other fields you can add here depending on your priorities, for
example, budgets left, deadlines, model performance metrics, etc. Each
use case should have at least a basic table. While you can use plain old
Excel to track this, perhaps a better way is to use new generic products
such as Airtable, Confluence, or even more specific tools such as YOOI and
Casebase.ai.
Summary
In the final phase of the 3D model, we ensured the DELIVERY of the data
strategy. The main obstacle in front of us is the jump from strategy to value.
For this, we needed to build a bridge - supported by the two pillars, SOFT
AGILE and LEAN DATA. You learned how to use DATAOPS methods to manage
the execution of the data products and projects we recommended in DESIGN.
This work would be impossible without adding quality measurement gates
- in the IMPACT ASSESSMENT, following best StratOps practices. Finally, we
have established the tooling and processes to monitor our ever-growing
expanding portfolio of data products in PORTFOLIO MANAGEMENT.
Nicolas Averseng is the founder and CEO of YOOI, a data analytics and
management platform. He has an extensive experience as a CTO and other
leadership positions in a variety of industries, solving challenges with data.
BA: Nicolas, it’s such an honor to be able to discuss data strategy with you.
I think you have one of the most forward-thinking visions on the field, and
I’m sure the readers can learn a lot from our conversation! Let’s start with a
simple question. How did you end up being involved in data strategy?
monitoring business processes, which is now known under the fancy term
“operational intelligence”. Four years ago, I joined a consulting company as
CTO. This company was mostly staffed by a large number of data scientists,
and I saw my job as reinforcing and developing the technology side further.
Eventually, we wanted to help people with not only the data science part of
the work, but also the other elements of data strategy: the data platform,
architecture and production environments, pipelines, and so on. I built
the technical teams there while working with large enterprises - also with
the same angle as what I mentioned - helping them avoid the mistakes
of thinking too much about the “how” without having a clear view on the
“why”. With this experience, I started YOOI, aiming to solve this problem.
BA: This is a very interesting point. I would agree that the most frustrating
thing in data I see is when clients have invested a ton into people and
resources - and spend it on digging a perfect hole, but in the wrong place.
NA: Yes - and this relates to why there are so few successful companies
successful with ML. They don’t manage to deal with the most basic - but
also hardest - part, starting with the why. Nothing else matters if you don’t
start with what you want to achieve and the business question. It might
seem obvious to many of us in the field, but this is a very common issue.
People would often come to me as a technical person: “We need to do X.”
And my first reaction is always, “Why?”. The first thing in a data strategy is
aligning the people on the ultimate goal and success criteria. Only when we
have that can we engage people around that same objective. Only then can
you work on the other elements, such as technology, processes and culture.
This potential misalignment is indeed the leading cause of failure. A good
illustration is what has happened often with data lakes: enterprises have
invested a lot to build infrastructure and then expect that it will solve all
their issue and then might struggle to actually build and deploy ML use
Interviews 172
cases. To make an analogy, if you have a big enough hammer, in which you
have invested, everything starts to look like a nail. And in the end, you might
find out what you really need is a screwdriver.
NA: Data strategy is the way to define what to do with data, in order to
support the business strategy of the company. It should always be focused
on supporting the business goals. A data project should have the same goals
and metrics of failure as a regular software project. Of course, there are
additional dimensions to data projects, such as the data itself, which add
to the complexity and uncertainty.
BA: So, do you see some significant differences between software and data
projects?
NA: There are some, but the biggest one is the uncertainty of data - this
proves to be challenging to many people and organizations, as it makes the
whole value chain more complex, more fragile. Another related source of
uncertainty is how to involve people in the process. This relates to one of my
favorite concepts in software, the three U’s: usable, useful, and used. While
this concept applies to all software projects, it is especially challenging
for data products. In the beginning, let’s say people do sales forecasting
“manually”. They do a great job, but it does not scale, and they cannot
focus on multiple product segments. The job, in this case, is to build a good
enough model that can automate this part of the work. But once done, they
discover that they don’t know how to deploy and use this model effectively
because they did not set a measurable goal or think about how to make it
actionable within their business process. They forgot why they were doing
this work in the first place.
BA: Couldn’t agree more. If you have a solid “why,” the rest of the data
Interviews 173
strategy work takes care of itself. Let’s now talk about YOOI and the purpose
of your offering.
NA: I want to talk and expand on the three U’s first; this will help explain
the purpose of YOOI. Why I like this concept is because it’s so simple. Often
people forget that people need to use whatever they built. How is all of this
going to be used by people? How will it be integrated into real processes and
drive real decisions? Even if you put a model in production, people would
still not use it because it behaves like a black box. This is also why you need
to be build trust and metrics in your data projects, and engage users all along
to make sure they understand and buy into those. This is why we built YOOI
- we see it as a cockpit for data strategy. A place for the team to align all
those different dimensions, connect the dots and make sure the technology
is meaningful. Of course, you can always hire a great consultant like you -
BA: laughs
NA: - but all the great work might end up sitting somewhere on SharePoint,
gathering dust and not being used. People will then lose track of why the
work is done, and also, you can’t repeat this data strategy work every year.
It has to be a living, continuously learning system. What happens to the
data strategy in a month when a new technical requirement comes up?
The world is changing, and this is something we need to accept. At the
same time, we have to keep everyone aligned on the same goals, and this
is why you need a tool for this. Our tool is a combination of process and
visibility elements. It allows to make sure when the ideas are proposed,
there’s sufficient information to make decisions on selection, budgeting,
and technology. This is a view of the complete value chain. Now there are so
many different tools available from the cloud that make doing data projects
easy. What is still missing is this monitoring part, bridging the gap with
project execution.
Interviews 174
You can learn more about YOOI on the official website. Follow Nicolas on
LinkedIn.
Summary
Noah Gift
*Noah Gift is the founder of Pragmatic A.I. Labs. Noah Gift lectures at MSDS,
at Northwestern, Duke MIDS Graduate Data Science Program, the Graduate
Data Science program at UC Berkeley, the UC Davis Graduate School of
Management MSBA program, UNC Charlotte Data Science Initiative and
University of Tennessee (as part of the Tennessee Digital Jobs Factory). He
teaches and designs graduate machine learning, MLOps, A.I., Data Science
courses, and consulting on Machine Learning and Cloud Architecture for
students and faculty. These responsibilities include leading a multi-cloud
certification initiative for students (source: noahgift.com).
BA: I was listening to your recent podcast on DataFramed, and I loved it.
While listening, I was thinking - this person certainly has unique opinions
that need to be heard. I like your skepticism, and I believe this is even more
Interviews 175
important in a field such as ours, where have more than enough buzzwords.
Maybe we can start with your background. How did you end up in data?
NG: I first started in TV and film. They offered me a full-time job, and that
was it, the start of my career. But I just wanted a little more than that, and
I always wanted to make sure I got a degree, and I decided to go to Cal Poly
San Obispo. I was interested in being a professional athlete, even perhaps
going to the Olympics, playing professional basketball. This is why I studied
nutritional science. I thought that was a good degree to learn more about
performance. And what was good about it is that nutrition science really is
a form of data science. All the courses you take, such as organic chemistry,
are very architectural in nature. Anatomy, physiology, and even dissection
can be seen as data science. You are inspecting the body and looking at
the parts, and seeing what they do. I also did experimentation on my own
body, centrifuged my blood, took doses of Vitamin C. After this, I briefly
pursued being a professional athlete. I was in the process of training after
college to play basketball, not NBA level, but lower-end tier. But then also
applied for a job at Caltech in Information Technology, since I thought that
I probably won’t make too much money doing the sport. I spent a few years
there, learning a bunch of stuff about Unix and Linux, learned Python. Right
after I spent three years there, I decided to go back to the film industry. A
lot of the stuff I learned at Cal Poly helped me make film pipelines. Film
pipelines and data engineering are essentially the same thing! After this, I
moved to the Bay area and worked at startups for roughly ten years there.
Since then, I’ve been consulting, teaching, writing books.
NG: Yes! And even machine learning is very similar to film because we’ve
Interviews 176
been doing distributed computing in film for a very long time. You have to
set up each job on a different node, and each does a separate piece, and in
the end, you must combine them.
NG: I think the issue is that many organizations hire academics - research-
oriented people. With research, you are not focused on production. You are
essentially hiring the wrong people to work for the company. They might
even be solving the wrong problem. Don’t get me wrong - it’s great to
have researchers available in some situations, but most companies need
operations. I think it is important to have a data strategist. Ultimately with
MLOps and data engineering, the thing you’re building is not a model or
data but a pipeline. It’s almost like the name of the discipline itself is
incorrect - if you say, “Hey, I’m doing data engineering”, ultimately you
mean you’re building a data pipeline. And it’s the same with machine
learning. So what is a pipeline? It’s dynamic; it can expand and contract or
react to different things. So you have to build the capability to respond to
things dynamically. That’s the opposite of a researcher. Research operates
on a fixed problem that’s constrained to a lab environment. The pipeline
should constantly improve and produce results.
ing, I know of two common analogies about data work that I also reference in
the book. One is this oil processing one, where we talk about data gathering
and enrichment. The other one is the kitchen analogy. Where you have
the raw data in terms of ingredients, and you have the recipe. What I find
strange in the approach of larger companies is hiring some people on high
salaries and telling them - let’s do AI in X, where X can be anything. They
make the team, set up the roles, provide some data, and say, “let’s go, talk
to you when the results are ready”. But do you think we can do better than
this when planning data projects in a large organization? How would you
even start to think about such complex work?
NG: I would say that the oil pipeline analogy is pretty good. I think oil
processing exploded in significance in the 1920s when cars as a means
of transportation started to become more popular. Imagine a geologist in
Texas in those early days. They come to the site and start drilling - oil
squirting out of the ground. That could be good enough - if you are a
researcher. You could say: “Look at this, we found some oil!”. They’re just
taking the oil in a bucket and throwing the thing in the pot, spilling dirt
everywhere. The result of this is that maybe enough oil for running a car
will be produced once a day. That’s a metaphor for how data science works
nowadays. Now, what’s the opposite? We can build an oil refinery. That is a
complete platform for oil production, with people and components working
on different parts subsystems. The result of this is that we can produce many
more unites per day, to run our cars. It’s the complete opposite. If we look
at an organization that wants to be successful, the first part is they need to
realize that the majority of people will be like that person, digging the hole
in the ground and making a mess with the oil spilling everywhere. Instead,
you should use a platform, like AWS Sagemaker, Azure ML Studio, Google
Vertex AI, maybe a third-party tool. Use that platform and have everything
Interviews 178
standardized. The goal isn’t to play around with the oil. The goal is how
many gallons of gas we produce per day. Similarly, if you force the data
scientist to work inside a platform that has strict rules, then you’ve already
made it much more likely they’ll be successful. A second component is also
if your oil refinery is producing diesel, and all your vehicles are unleaded
gas, well, you’re making the wrong type of fuel. That’s the other part of the
problem. There must be requirements that are mapped to the executives in
the company. Even if you’re using a platform, you have to make sure the
people build the right thing.
BA: This reminds me of another idea I’m developing - starting with the end
goal in mind. One of the worst ways for data science projects to fail is when
a bunch of intelligent and hardworking people spending a ton of effort on
the wrong thing. Nobody asked why. We can use all the platform tools, let’s
say AWS Sagemaker. We can make a pipeline - but in the end, the business
unit often says - ok cool, we have this AI, but so what? How are we going
to use this? What happens now? Sometimes we would find out that instead
of exposing our model through an API, all that was needed for the business
was to provide a scored Excel file. A good example of perfectly building the
wrong thing. Another issue here is not planning for the integration part of a
data project - how would that fit into the overall IT infrastructure and who
the consumers of that solution will be. How does the end result connect to
other systems? Do you think those issues are a failure of planning, a lack of
skilled workers, or a strategic problem?
NG: Well, let’s go back to what data and machine learning engineering are.
They represent the ability to respond to change by getting and reacting to
feedback loops. The issue is when you are just building one thing without
the capacity for change. For example, I like to do jiu-jitsu. This is a pretty
interesting martial art. In theory, the goal there is when someone attacks
Interviews 179
you, you submit them. To achieve this goal, you have to respond to events
dynamically. Let’s say someone jumps on top of you, then you get out and do
something else. The ability to react dynamically to any situation is essential.
I think that’s the issue with the data field. The goal isn’t to do something
static - it is to have a feedback loop to respond to the business. The feedback
should happen much quicker than it does now. A good example solution is
to show prototypes once a week.
BA: Another idea of tackling such projects is building a very basic pipeline
first, but end to end. Then you should get feedback from the stakeholders
and commit to further work on the different components of this pipeline.
Another follow-up question I have is, admittedly very tricky one - about
project management and data science. What do you think of the combina-
tion of agile development and data? Is there a better way? How do you think
data projects should be planned and executed?
BA: Let’s dig deeper into this. Even if we do a lightweight agile, how
would we communicate our process to senior stakeholders in a big company,
who might be expecting something else, who think our processes are not
rigid enough? Do you have some advice on that? How can this fit with a
Interviews 180
traditional business?
NG: I think the three components of a successful project are a weekly demo,
tasks assigned in a lightweight ticket system, and a spreadsheet showing
the quarter’s plan. That’s it. And the demo is what the product managers
would show to the CEO. This demo should just be good enough (like you
said end to end). Then you can get feedback immediately. In this case, you
can quickly fix significant issues, avoiding unnecessary work.
BA: Another question I have is the word “pragmatic”. I heard you use it on
several occasions. Could you elaborate on it? How can we be more pragmatic
in this work?
NG: I think pragmatic means being ruthless about efficiency. For example,
let’s say we have a system that barely works that took several years to build.
The person who did most of the building would very much like to keep it the
same. The right thing to do is to clean as much as possible - imagine a pull
request where 25% of the codebase is deleted while the system continues to
work. This is pragmatic. Nothing’s precious; whatever is needed to improve
the system should be done. Working only on things that matter - that’s
pragmatism.
BA: Do you think that knowledge work can be automated? Where does the
future go of our field? Do we lose creativity in what we do? What skills do
you think are most important right now for data people to remain relevant?
NG: I would say that it’s surprising that people think that AutoML and such
tools won’t get better with time. Even very famous people would think that
it doesn’t work. Let’s look at anything that happened in the last 50 years.
Once you start automating anything, it will always get better with time. A
great example is the film industry. When we first started editing, we had
3/4 inch tapes, and they were analog. You had to dissolve with three decks,
Interviews 181
using three different machines. You have the source tape, the destination
tape, and the black tape. And now, with my laptop, I can just click a button
that says “dissolve”. Of course, everything gets automated! Still, there’s art.
Editing is very creative. Such work will remain - the creator must provide
their signature. If you’re talented as a data person, you should be excited
about all of this happening because you’ll become more impactful with the
work you do.
Summary
June Dershewitz
June Dershewitz is a Data Strategist at Amazon Music. Before this she spent 20+
years in driving data and analytics strategies for industry-leading companies,
including Fortune 500 corporations and tech startups. She is also serves as
Board Char at the Digital Analytics Association. You can connect with her on
LinkedIn.
BA: Let’s start with your story. How do you end up in data? It’s a question I
always ask since there are so many diverse backgrounds in our field.
Interviews 182
JD: I got my start a very long time ago, with a bachelor’s degree in theoretical
math. That was in the very beginning of the internet. After that, I got a job
working for a mathematician who was building a website for math teachers,
students, and professors to talk to each other. I got to do many things on
that research project - essentially becoming a front-end engineer. I got the
chance to understand how the internet works, which was really exciting and
new at the time. Eventually, I decided it was time to move into the corporate
world in San Francisco.
JD: Well, I’m originally from Oregon, and I love the west coast. I had been
living in Philadelphia, so I felt the need to go to a large city again. It was
in 1999, the middle of the first dot-com boom. And I figured I could get
a job! I started applying to front-end engineer positions. At one company
I was asked whether I would like to become a data analyst. They told me I
had the combination of skills necessary to become a great analyst - software
engineering and math. I accepted! I realized that I loved it, even though it
wasn’t the vision I had for myself originally. That was a start of a very long
career in data. Since 1999, I’ve worked with data as an individual contributor,
building and leading teams of data people both, on the brand side and as a
consultant.
BA: Those were very early days in data science. I assume there were no data
scientists back then?
JD: No, they were called statisticians! Indeed, I ran across quite a few people
who would consider themselves statisticians, who today probably would call
themselves data scientists.
BA: Being in the field for such a long time, do you think companies
know more now how to do data projects than before? The technology has
Interviews 183
advanced quite a bit, but how about the more strategic part?
JD: It’s frustrating to see the same problems over and over that we keep
repeating and not figuring out. But I think we can build on ideas much faster
than before and iterate on them. An example of this would be A/B testing,
which a company would employ to optimize business outcomes. We’d like
to think that the dot-com organizations figured this out already, and any
competitive company out there is maximizing their investment. Well, that
can be true to a certain extent, but they certainly didn’t invent it. These
methods have existed even before the internet. For example, it was being
done by advertising companies to measure the effectiveness of direct mail.
Now we can just do it with much more ease, and we can do it faster.
BA: Interesting. Operationally, we probably still have the same problems re-
garding how people understand data. It might even be harder nowadays. My
next question is on the title of a “data strategist”. What do you think about
that? I know some companies use similar titles, such as data translator. Is
this a widely accepted and understood role at this point?
JD: Not really. I think that data-related job titles have always been some-
what of a pain point. Throughout my career, I’ve at times cared more or less
about the job title. The job title a person holds sometimes is important and
sometimes not. Early in my career, I was making a move from an individual
contributor to a consultant. Before I started, the company’s co-founder
called me, telling me he was working on the business cards and would like
to know my job title. I was thinking - perhaps VP of something? He said,
ok - vice president of analytics. And that was my job title from then on.
But when you work for a 14 person company and you have the job title
VP of Analytics, it means you’re going to do everything. And I generally
feel that way about data-related job titles as they have evolved over time,
especially with the “data scientist” one. Usually when I talk to other leaders
Interviews 184
of data organizations, when they talk about their staff makeup, the data
scientists, data engineers, and others - they’re usually admit those roles
mean different things in different organizations. On one extreme, you might
have a company where any person who touches data is called a data scientist.
And then in another one, you might have so many specific different job
titles where you’ll have a data scientist, research scientist, ML engineer
or data analyst. There’s no right or wrong. I think that data strategist and
data strategy are malleable terms that we can use to mean different things.
I don’t think they will become standard terms to describe a specific job
function in the company. I can contrast them with a title such as a “data
engineer”, which I think is very specific and tangible.
BA: Yes, “data engineer” is already quite an established one. But it’s safe
to assume that the role of data strategist has always existed before as well,
probably under a different name. Someone must have been taking care of
the “translator” duties in the organization.
BA: This is a perfect time for me to ask you for a definition of data strategy.
Do you have one?
JD: I’ve found several that I would mash together into one: data strategy
Interviews 185
is a vision for how a company will manage and use data to generate value
for the business and the customer. This is still broad but could be broken
down even further. For example, what data do we need? How are we going
to source it? How are we going to collect it? How are we going to store it?
What technology we’ll use? Who will we share it with? What are the policies
for data? How will we use the strategy, in what areas of our business, and
to what ends? How would we know it’s working? And if we’re doing it right,
what kind of value is it generating? How do we describe and quantify this
value to the business or the value to the customer of all of the time and
money that our data teams spend on working with data, trying to serve the
business?
BA: This is great. Now we start to talk about the specific elements of data
strategy. I now have a question on whether a data strategy is something
static, such as a PowerPoint deck, or is it more of a continuous function that
someone is performing? I’ve had clients ordering giant slide decks, only for
them to be buried somewhere, never to be seen again.
JD: It depends. Let’s say you are a data person at a company that isn’t yet
sold on the value of data. You have a tough task in front of you because
it’s all about education and convincing executives to fund your efforts.
Because if you don’t have any funding, you’re always going to be at the
bottom of the barrel. Your work can be an afterthought, and that’s not
where you want to be. Let’s say there are a few people who do data work
throughout the organization, but they’re doing it at a really low level, mostly
on unconnected pilot projects. But if you could show results, you can use
this base to form a team or even multiple teams. And the more you do
with data, the more you can say you’re using data successfully across the
organization. Say we’re using data successfully in Marketing, but we haven’t
necessarily gotten a full value out of what we’re doing with the data in
Interviews 186
Product. So you decide to build a Product Analytics team. And then perhaps
you can see how you can support the Sales team with data or insight. And
then, at the more advanced stage, you would be looking across the entire
company, and you’re collecting and managing all the kinds of data that
matter to the business. In the end you’ll be able to turn around and use that
to generate value for the business and the customer in all the ways where it
matters. I think that depending on the stage the company is at, you’re going
to see different variations of this process.
BA: Who do you think the customer of a data strategy is? How far down, up,
or sideways does this document need to be used in the organization?
JD: It depends on the org structure. It’s never going to be perfect. I’m sure we
can spend a whole hour just talking about different kinds of org structures
for data people and the pros and cons of each choice. But I think as long
as you understand what you’re striving for, you can compensate for the
weaknesses of any kind of org structure. I believe data strategy can work
best when incorporated into company-wide strategic planning. So if you set
annual goals about what you want to accomplish, hopefully, some of them
will be quantitative and require support and participation from data people.
Even if it’s basic business optimization, it’s meaningful as long as it helps
grow the business.
BA: I agree. I don’t think you can separate data strategy from business
strategy and hope for good results. How iterative should a data strategy be?
Should it be more of a living document or more of a static roadmap?
JD: People could discuss the value of long-term planning versus the effort
spent on implementation, but I see value in it - as long as it’s combined
with shorter-term plans that are directly related to execution. I think that
a well-thought-out three to five-year plan is a great idea. This can show
Interviews 187
where the organization’s data efforts are today and the vision for where it
wants to be way off in the future. Still, I don’t think you can’t just set it once
and forget about it. The strategy will get stale after a while, but it should be
able to serve you well long enough so that you can generate annual plans
under the umbrella of the larger, three to five-year one. And from there, you
can set tangible and specific quarterly targets. You always need the five-year
north star guiding you. I’ve found, especially with data science and machine
learning projects, that they can easily meander. You need to reinforce the
focus, even if it shifts over time. You can plan quarterly and build on top
of your knowledge, but everything should be aligned with the longer-term
plan.
BA: What is the most important thing for a company with low data maturity
to tackle first?
JD: I think that, that as a business, they should have a clear understanding
of where they’re going to get the most business value from their investment.
This is a good starting point to do the first proof-of-concept project.
BA: How do you think about managing data projects? Does agile work for
data? How do we go about estimating tasks and resources?
JD: I do think agile works. Of course, estimating how long something will
take is always difficult. And especially if you’ve got something big and
ambiguous and have nothing built yet. In that case, you’re not going to be
very good at estimating how long things will take. As a project develops,
you’ll better understand what is worth pursuing and what is not. This skill
will take time to learn. At some point, you can refer to your experience - for
example, the team compositions and skills, knowing who to involve, and
it does become easier. I think in the beginning, you’ll be able to estimate
things that are only one to three months out. And then, when it comes to
six months or a year out into the future, you really might not have much of
a clue. You might know the result you’re after, but you wouldn’t have a good
amount of information to estimate how long this will take or even who needs
to be involved at what level. I’ve seen in the past chronic underestimation
of data engineering effort for data science work. Also some confusion about
roles - for example, what should a data analyst do? How about a data
scientist? You often won’t have the luxury of bringing in people with all the
right skillsets to contribute at all times. This might slow you down because
you might have a data scientist who’s also asked to be a data engineer, and it
might not be their core skillset, or they might be doing it, but as a result, they
are not writing high-quality code. And so then you’ll need to have someone
come in later on to fix the problems that were made because they weren’t an
expert. I have also seen a lot of issues with trust-building with leaders who
Interviews 189
BA: I agree. Doing cool things just for the sake of learning can backfire. As a
data scientist, one might think this is smart, but as long as the work delivers
no value, it’s useless. Can you tell me the biggest reason for data projects to
fail nowadays?
JD: I think it all started with the whole “sexiest job of the 21st century“
article. I think this oversold the field and made it seem like snake oil.
How will you actually set data scientists for success when you don’t clearly
understand the value they will deliver? And I think modern data science in
terms of how it fits into a larger organization is better understood now. It’s
been around long enough so that people can ask and answer the question,
“what have you done for me lately”.
BA: Yeah, so to paraphrase a little bit: you would say that a lot of the issue
with data project success would be high expectations? Leaders think they
can just put a data science wizard on the project, and everything will fall
into place.
JD: Yeah, exactly. One approach I’ve seen that I think works fairly well: start
with a small proof of concept project with a short turnaround time. Then
show its value. If you don’t do this before a further longer-term investment
commitment, you might end up with a wasteland.
BA: In my book, I call it pilotitis - the disease of doing pilot projects only.
Interviews 190
How do we ensure such smaller projects are successful in the medium term?
JD: This is not easy. One thing you can do is set goals for pilots. For example,
we’ll finish it by this date, and it will do those exact things. This way, you
keep its scope limited. After this, you can show that it has all the features
you feel are essential and there’s widespread usage on the receiving end. I
don’t think a data scientist could do this alone. Having a product manager
involved is important for scoping, gathering requirements, user acceptance,
testing, and keeping a backlog of feature requests. So it’s not really data
science work, but this is necessary for creating something like a long-term
program.
BA: It sounds like we do need this person in the middle. It doesn’t matter if
they’re called a data strategist or a product manager.
BA: What specific skills would you say this person should have?
JD: A product manager is a big generalized job right now. And the product
manager for something that is directly facing customers of a business might
be different than a product manager for something else, such as a recom-
mendation engine. They can be one step removed from the end customer
who is receiving the recommendations. But, still, I think some of the same
skills apply. I think having an excellent understanding of why a product is
built and articulating that. Always have a solid customer focus, know for
whom the product is designed, and ensure users can use it successfully. This
person should also know where the project will be in the next quarter and
align on the long-term vision.
Interviews 191
BA: Exactly. The most frustrating thing I’ve seen in my career is brilliant
people building the wrong product that nobody wants to use.
JD: Yeah. And people might make different choices. It’s often a case of
taking the product in direction A or direction B, with trade-offs, and in each
case. If you only have a person involved who cares about the novelty and
complexity of the system they’re building, they may make one choice that is
not necessarily what the customers need. And if you chose instead a simple
approach that is not as technically sophisticated but what leads to a better
business outcome - it may be the right choice.
Summary:
Martin Szugat
Martin Szugat is the managing director of the data strategy consultancy Daten-
treiber. He is also a lecturer at Hochschule für Wirtschaft Zürich and Program
Director of Predictive Analytics World & Deep Learning World Europe.
BA: Let’s begin at the beginning - how did you end up in data?
MS: I started my career already during school. After many manual jobs, I
decided I would prefer to use my brain more (laughs). Since I liked playing
video games, the next obvious step was to start programming. My father
had a client looking for programmers, and this is how I started coding. I
also started writing for magazines, such as the Visual Basic magazine. I dove
deep into the .NET area and started teaching other people. Around that time,
I was also one of the first people in Germany to become an expert in the
whole XML topic, which would be the origin of data in my career. For my
studies, I initially studied computer linguistics and philosophy but switched
to bioinformatics.
I also wrote some books. One of them was about social software. Social
media didn’t exist then, and people mainly meant blogs and wikis by this
term. I also had the idea to start a company with my bioinformatics pro-
fessor but decided against doing that and joined UnternehmerTUM instead,
intending to meet like-minded people. We created a social media agency
with one of them, doing digital collectible games (now you would call those
NFTs, so that was way ahead of its time) and Facebook apps. After several
years of this, I wanted to go back to doing data because of my background
in bioinformatics. I had never had the chance to apply those skills before,
and nobody was talking about ML or AI at the time. At that time, I started
Datentreiber with the idea of putting all my experience into one venture -
combining data and business. I also saw how many companies fail in the
topic and saw an opportunity to help and improve their processes.
BA: It seems like you had a very diverse experience. What essential skills
you gained during this time are valuable to you now? What did you learn
from doing bioinformatics or running your own company?
Interviews 193
MS: The skill which stands out to me is learning the design thinking
approach while working with IDEO. Discovering design thinking was a
life-changing experience because afterward, I applied this design think-
ing mindset to all my ventures and projects, and currently in consulting.
From bioinformatics, I learned something quite important when thinking
about models. The people in bioinformatics also got this wrong. Back then,
Support Vector Machines were trendy, and the scientists wanted to solve
everything with them, including how genes worked and other topics like
that. But the biochemists proved that most of those models were wrong.
They did real-world experiments and tested the modeling work against real-
world data. This was called the “ivory tower” syndrome - bioinformaticians
at the time were rarely working wth someone in biochemistry or molecular
biology. Bioinformaticians wrote software for bioinformaticians. Avoiding
this condition is something I learned the hard way.
BA: I’ve seen nowadays that people try to put PhDs in a room with MBAs
and see what kind of ideas come from it. Not sure if that’s such a great idea,
but it sounds like a better approach.
MS: Yes, and this is why I decided not to create a company with that
professor. I’ve seen companies full of PhDs. If you ask them who will make
the sales, they have no answer - they think they are different and don’t need
it. In reality, you need some sales, marketing, and HR.
BA: Let’s now talk about the title of a “data strategist.” What do you think
about that? Is it necessary? Are there better titles?
MS: You have to always distinguish between the title and the responsi-
bility. Different titles can have a similar responsibility - whether they are
a data strategist, an analytics strategist, a Head of Data, or a Chief Data
Officer. There should always be someone, especially in bigger companies,
Interviews 194
BA: Exactly. Depending on the organization’s size, you might need different
people at different levels. Especially at the very top, you need someone
with this analytical skillset who manages all use cases. As you said before,
miscommunication between technical and business people is common, and
that’s why you need a responsible person to translate between the two.
MS: Yes. Another essential responsibility this person must carry is ensuring
the data strategy is aligned to the business strategy. There should be a strat-
egy for all data and analytics initiatives and investments, and they must be
responsible, also, for killing projects or use cases that are not contributing to
the business objectives. This goes more into project portfolio management.
For example, there are a lot of projects that fail because of issues with data
quality or availability. A data strategist needs to take the responsibility to
check the data sourcing, collection, and quality initiatives and ensure that
down the line, let’s say in three years, the data is available so that they can
implement the use case. This happened to one client project a few years
ago, and that use case could not be implemented since the data were simply
missing, and nobody paid attention to this.
BA: It sounds like there was just no plan, no strategy. Sometimes executives
believe you can just hire some people, give them a broad target and let them
work. All of this is done without doing the essential homework - checking
that everything is in place. I agree this is one of the most critical attributes
Interviews 195
MS: I think the best data strategists have a technical background in data
science, analytics, or a related field. If people just come from the business
perspective, they lack the skills and analytical thinking. If they studied
economics or something similar, they simply have a different way of seeing
and perceiving things. Data scientists from physics or biology have this
analytical thinking trained, which is very hard to get.
BA: Can you elaborate further? Do you mean a scientific mindset, experi-
mental and hands-on thinking?
MS: Yes, but not only. Most importantly, they realize that everything is
simply an assumption. A strategy itself is one big assumption. A great book
on the topic I recommend is “Good Strategy, Bad Strategy.” The author
has a lot of strategy consulting experience and is a professor at UCLA. His
first advice was to keep in mind that strategy is just an assumption and
always needs to be tested. You first design the strategy, and after this, check
whether it works out.
BA: This reminds me of the saying that all models are wrong, but some are
useful.
BA: I want to play the devil’s advocate here. While a data strategist needs
to have a scientific mindset, I think it’s equally important to be good
at dealing with ambiguity. This skill set is essential for communication
and dealing with more political issues, which are common with clients.
Ambiguity is also a part of any data strategy since, as you said, no strategy
is perfect, and many assumptions need to be made during data strategy
design. For example, when estimating budgets and resources, you need to
Interviews 196
be comfortable providing concrete numbers, even if it’s not clear what they
are for.
MS: Yes, and there are multiple levels of assumptions. An essential element
of a data strategy is defining the data products you want to build. Each is
also based on assumptions, and you must have an experimental approach
to making them. With such a mindset, you can become a data strategist.
Still, there are a few other very useful skills such as mediation, moderation,
and communication skills. I would still say you can train all those skills, but
changing the mindset of people is hard.
BA: This connects nicely to my next question. How do you train people for
such work? I know this is a big topic for Datentreiber.
MS: Yeah, as I mentioned before, design thinking is the most crucial method
to be learned. At the beginning of the training, we are primarily focused
on teaching the basic topics. For example, what’s the difference between
descriptive and predictive analytics, and what’s machine learning. What’s
AI, and what’s not AI. It’s essential to focus on the fundamentals first.
There’s too much buzzwordy content out there, and you can notice that
people spend too much time on LinkedIn. So this is the first level. At a
second level, we train people in our data strategy design kit and other
methods we have developed based on our experience.
BA: Do you also train the people how to teach other people themselves? It’s
an essential part of the job of a data strategist to “train” C-level and business
people. After all, they also spend some time on LinkedIn and probably need
to be “un-trained” a bit first.
MS: Yes! We’ve learned a lot, especially in the past year, that one thing
you should do before you work on the data strategy design in a series of
workshops is to have a training session. You can introduce the business
Interviews 197
people to the basics first (such as the difference between a metric and KPI).
We noticed that the following workshops work much better if the people had
training before, because otherwise during the workshops you’ll have a lot of
discussions about the definitions of things. Sometimes people talk about
the same things with different words. Another issue that can also arise if
people have no training before is unrealistic expectations. In one case, I
remember one of the clients wanted to build an Alexa-like system for a car
workshop. I already knew this would be hard. The Alexa team at Amazon
numbers in the hundreds, and still, the product has issues. This is closer to
science fiction than reality!
MS: I think this depends on the company. When they talk about data
strategy, they mean more about data architecture design. Or they just might
want a PowerPoint presentation for the management board to get the data
team financed. Others don’t want to do a data strategy but want a “very
concrete plan of how to create value with data and analytics” instead. But
then, I would call this a data strategy (laughs)! Some other ones even don’t
want to call it a data strategy. They would say that “strategy” is reserved
for the management board. It should also not be named “data” since the
IT department should be responsible for it. So, in that case, we would call
it a “MarTech Concept” or something like that. But that’s, in fact, a data
strategy.
BA: Right. People still want it and understand why it’s necessary, even if it’s
hidden under different names. Do you think data strategy as a PowerPoint
Interviews 198
deck? Do you believe organizations know what needs to happen after that,
how to implement things, and measure success?
MS: I think this happens only in organizations with low data maturity.
Executing a data strategy is the only way to know whether it is good. Those
should focus on doing more pilots and experimentation. I have seen this
issue quite a few times. For example, one client requested we build a so-
called KPI driver tree with them, the value driver tree. This would help them
understand the relevant metrics and set the objectives correctly. We did this
for several months, and after finishing, they were pretty happy and realized
this value. Still, I had to remind them that this is a good start, but it is still
just the first step of a long data strategy.
The more focused and smaller the data strategy is, from my experience, the
higher the likelihood it will be successful. We have also advised clients on
an overall company-wide data strategy. Still, we encountered the problem
that it became too superficial - it becomes that PowerPoint with a lot of
vague texts, such as “employees should treat data as an asset,” or “all our
departments should utilize data in a way which is aligned with our business
objectives.” While those statements are undoubtedly true, they are valid for
any company - you can just copy and paste this text. What most companies
struggle with is creating a holistic data strategy, which is a long-term one.
It needs to be executable and have checkpoints where you can measure its
success and adjust if necessary - see if it works out. This is a real strategy.
BA: How do you ensure the clients trust you with an expensive data strategy
and that it delivers results? One way is to run a prototype and show results
as quickly as possible. But still, a data strategy costs money, and the benefits
can become apparent much later.
MS: What helps again is to ensure that the data strategy is not superficial
Interviews 199
BA: Yes, this makes sense. Another one of my favorite questions is how
do you plan for things you can’t plan for? How do you ensure that a data
strategy does not get derailed, for example, when one of the prioritized use
cases doesn’t work out? And how do we deal with expectations relating to
this? Do you have any advice here?
MS: There are multiple things you can do. After designing the strategy or
product, the next phase should not be to start the execution or implementa-
tion immediately. You should have this experimental phase instead. There
you build prototypes, research, and try to falsify your assumptions. This is
another critical thing I learned during my training - always identify your
most critical beliefs and ruthlessly test them. A term for this is RAT - Risky
Assumptions Test. You can do those for any specific product but also the
overall strategy. Then you make sure all your RATs are eliminated. Only after
that do you start working on the engineering part.
A second thing you can do is just accept that many of your assumptions
will just be wrong. If you understand this, the logical consequence is that a
strategy is never done. It’s something fluid: after the first draft, you need to
test it and perhaps entirely through it away. Or maybe just modify it a little
and then retest again. It doesn’t work if you just hire someone to do the data
Interviews 200
BA: Can we now discuss an important concept - data assets and products.
Can you define what a data asset is?
MS: I use the term data asset to describe a data source with a precise value
for the business. This implies a data product, some form of analytics, or
whatever you applied to this data, to extract and analyze information from
it. And if this information then leads to better decisions, actions, and results
- reaching the objective in the end.
BA: That’s a great definition! How about data products versus data projects?
What’s the difference between the two? Data products must be different
from other products, such as clothing.
MS: The answer to this question depends on who you’re talking to. If
you’re talking to business people, they might think about data products
as packaging the data itself and selling it to other companies. When you
speak to old-school data scientists like me, a data product can be defined
as the outcome of a data mining process, where you apply analytics to data.
Even an ad-hoc research paper can be considered a data product with this
definition. The definition is different nowadays: a data product is closer
to a software product. It’s the data, and the analysis software, whether
automated or semi-automated, which is ideally scalable and reusable.
BA: So, by that definition, a machine learning model exposed via an API to
serve predictions would be a data product?
MS: Yes, but it can also have a graphical interface. It can be a dashboard or
an application generating business reports. It’s shocking how often, even
nowadays, generating business reports is done manually, by hand. We have
this one client, and they have so many people generating reports, and that’s
Interviews 201
their whole job. After generating them, they just sit around at a SharePoint
somewhere; who uses them is not clear.
BA: Why do you think such inefficiencies are still so widespread? Is building
data products a challenge in larger companies rather than startups?
MS: Yes, of course. In larger organizations, you already have a lot of people
who have been doing such manual work in the past. In one case, we were
working with a pharma company, and they had to create a study on how
time influences the effectiveness of drugs and whether that can lead to
potential side effects. Each year they would have thousands of new drugs,
and hundreds of people are doing this analysis. Many of them would have
ad-hoc scripts and just copy and paste from each other without centralized
solutions or templates. Startups rarely have this since they have too much
pressure to survive, have fewer people, and often have the luxury of devel-
oping greenfield data products, which is much easier.
BA: In this case, if you were to automate such a process and create a data
product, how do you ensure that the people trust it? Especially in such
sectors, this is a big topic. I can imagine that even if it’s inefficient, it can
still be perceived as more trustworthy since many humans are involved,
rather than a centralized black box.
MS: Now we’re getting back to the whole design thinking topic, which is
why it’s so important - not only when you’re designing data product, but the
data strategy itself. A central theme in design thinking is using the users’ or
stakeholders’ point of view. The best way to do this in data strategy is to
involve them in the design process. If they have a seat at the table and can
share their point of view with you on a whiteboard (making it more tangible
and visual), they can express themselves so that other people understand
it. People with a higher degree of understanding will trust and accept the
Interviews 202
strategy.
This goes in both directions. Also, for data scientists is vital to understand
the business process. Otherwise, they might design a solution for the wrong
problem. There are a lot of examples where there’s a perfect solution to the
wrong problem out there. This survey by Eric Siegel confirmed that many
models are not deployed just because they don’t fit the business processes.
This happens because the technical people have no understanding of the
business. If you had no idea how a car works, you wouldn’t enter it.
MS: Of course. It depends on the person. Some people are very comfortable
when they just have a rough understanding. All they need to know is that
the car is secure. They can just enter the vehicle and feel safe. Others need a
much deeper understanding of the cars’ inner workings to feel safe. The only
way to know what people you are dealing with is to start to work with them
in a workshop. Many potential problems can be avoided if you co-design and
co-develop data products and strategies.
BA: So, how can one go about learning design thinking? I think it’s still a
skill not widely known beyond certain circles.
project with IDEO, where it was shown much clearer how this mindset is
shaped, that I truly embraced it. I think it’s much more important to work
with other people who have already applied design thinking to projects and
exercise together. From my experience, many workshops that we did, were
much more successful if the people had already done some design thinking
beforehand. Otherwise, they might be in for a hard time - especially the part
where you need to think from the user perspective. Exercising this empathy
might sound trivial, but it’s the hard part!
Summary:
Amadeus Tunis
BA: What did your journey into data strategy look like?
Interviews 204
AT: I’ve been in tech and media for almost 20 years. About half of it was on
the tech side, and half on the consulting side. I can trace my initial work in
data back to around 2005 when I joined a startup in New York which had a
focus on in-game advertising. Funny enough, this practice is coming back in
a huge way now, but it was quite revolutionary back then. We had developed
proprietary technology allowing us to integrate geo-targeted advertising in
Xbox and PC games globally - anywhere in the game where it was relevant
and wasn’t too intrusive. The company was bought by Microsoft a few
months after I joined. I then worked for Microsoft for about five years
as a product lead, integrating global advertising campaigns for hundreds
of companies across different industries like consumer products, finance,
automotive, entertainment and more. Everyone wanted to advertise in this
new way. Our analytics at the time were mostly focused on ad performance,
interactions, impressions and so on but also the actual performance of our
integration tools and creative assets.
AT: Yes, 2005 to 2010. To connect analytics with my work: at that time,
we looked into ways to improve our own tools & products as well how to
get continuously more player interactions with the ads. We also wanted to
speed up integration time and actually managed to do so over time, going
from about two weeks for a global campaign to under two hours. We tried to
automate as much as possible and get more efficiencies from the integration
processes. Around 2010 there was a massive ramp-up in the mobile space,
and with Microsoft not being a big player yet at the time, I joined a startup
called Applico. I wanted to dive deep into mobile product development. I
joined as a creative director, but as you know, in a startup, everybody wears
many hats. We started with seven people and eventually scaled to about 80
in the first year. I was responsible for everything besides writing code - the
Interviews 205
That time was a hot market for people who knew about app development
and optimizing apps, so I joined CBS interactive, a large corporate to focus
on a single product but at massive scale. Focusing on their TV Guide brand,
we developed a powerful cross-platform mobile app and web products for
so-called “second screen” experiences. CBS owned the world’s largest data
pool of TV and streaming programming. What we built there would, later,
become a new standard, quite similar to features you now see on Netflix
or Hulu – lists and recommendations of what you want to watch. This was
something we built across every channel and every platform. Those lists also
contained other data points, such as information about your favorite stars,
channels, studios, genres etc. and could be used to link and collect more
relevant content.
Making this all accessible on mobile was complex undertaking but also
really fun. There, I started diving deeper into data. I worked hand in hand
with the director of analytics on customer interactions, looking at specific
Interviews 206
journeys and user preferences over time and at specific key moments. We
tried to figure out how to optimize and best personalize the user experience.
We didn’t have a specific term for it, but we were thinking about valuable
data assets. We started developing strategies and opportunities to monetize
product features better. We moved beyond simple advertising and used
historical data to optimize content, and experimenting with geo-specific
timing to get more signups, to test whether we should have a premium
version of the product and tailor more refined audiences. We were also
looking into what data to ask from people and when - with the focus on
creating value for our advertisers and our business, building a strong first
party data pool. We had dedicated analysts creating the reports, and it was
exciting to convert those insights into a strategy helping to make more
money from data and learn more about user behavior. That was just the
beginning of this domain, I mean strategic thinking about data, at least for
me. There was not a lot of thought leadership around data strategy that you
could find online, so we experimented a lot.
focused unit, which became a spearhead for Analytics & AI projects for
Deloitte in Germany. Other organizations started doing the same thing,
but we were perhaps among the first ones to grow so big, at least in the
consulting space. We had an influx of great talent as data science and
data engineering jobs become highly desired so we could pick the top 2%
of the talent applying. Eventually, the unit became quite sizable and was
distributed across the organization - moving from the centralized structure
to a more decentralized one, with Analytics & AI becoming a part of many
offerings and domains. Nevertheless, it was an amazing journey and allowed
me to work with a variety of global clients on cutting edge topics and
implementation across the entire data domain, from Analytics to ML/AI, Big
Data & Cloud to data strategy.
I wanted to keep progressing in the data realm, with a renewed focus on data
strategy, especially around customer data & analytics, so I joined Publicis
Sapient. I knew some great people there already and the organization’s
focus on comprehensive digital business transformation and its long history
in tech was very intriguing. I led the data team for the DACH region with a
personal domain focus on data strategy and have recently relocated to the
US, where I remain a lead in the data practice continuing my data strategy
journey with a great team and interesting projects.
BA: It sounds like you gathered experience from both sides - from the data
product side in industry as well as strategy consulting. Do you believe this
gives you a unique perspective?
AT: Yes, definitely, even though there can be a lot of overlap. Many con-
sulting companies, build products for clients. So you’re not excluded from
implementation – in the end you are a service provider and that’s what
makes companies like Publicis Sapient unique: we strategize with our client
and then implement as well. I think that’s very valuable since you have skin
Interviews 208
in the game. Doing that from both the product side in industry and the
consulting side is hugely beneficial.
In consulting, you do have the benefit that over time you have the op-
portunity to work across a number of different industries. While I have
a cross-industry focus, which is very helpful and interesting, I’ve worked
extensively with automotives in the last eight years. It’s still very interesting
to me because recently, there’ve been a lot of changes and new trends in
that particular industry, such as connected vehicles, direct to consumer
sales and a focus on digital services. You have datasets used in entirely
new ways. We’re also capturing the behavior of how customers are using
the car - adding new data pools to online behavior. There are also other
exciting industries, such as healthcare and pharma, which operate within
their own unique, mainly legal, constraints. You can contrast that with the
retail sector, which in my opinion has been leading in the productization of
data use cases as they concern customer intelligence. Selling things online
is the primordial soup of insight gathering. Hence, you can see how the field
of data is obviously highly relevant across all industries.
There’s no client that I’ve spoken to in recent years who hasn’t done some
work in the data space, so it’s really a personal decision whether you prefer
working in industry or in consulting. The opportunities are vast. Good thing
is that nowadays there is a ton of information available for the data space
- in terms of online resources, published papers or books, and many take
advantage of that, so you rarely start at ground zero with clients and with
people, independent of industry or domain.
BA: Since you mentioned the deeper domain knowledge required for the
pharma (or really any) industry, what are the essential skills for a data
strategist?
Interviews 209
AT: This is a unique job title, but nailing down the exact, necessary skills
can be confusing for someone starting off. Also, it tends to vary somewhat
for each company and team. I would say that it is probably a lower barrier for
entry for somebody who has worked hands-on with data and has spent some
time around a cross-functional team, with data scientists, data analysts, and
data engineers. They can take that step towards data strategy more naturally
than somebody who comes purely from the business side, such as manage-
ment consulting which is often stopping at tech. From my perspective, data
strategy has strong focus on how to offset purely technical cost and unlock
value on the business side, but you have to also know and anticipate where
the technical debt occurs or is necessary, hence some proximity to the data
itself is helpful. Sure, nobody knows more about value generation than the
business side but still, when it comes to feasibility estimation, nobody has
more practical knowledge than the people who have worked with the data
hands-on. These are constant balancing acts. We are answering questions
such as: what do we want to do for the business, what do we want to achieve?
Also, what can we do with our current capabilities and technology set-up?
Moreover, it’s also sequencing the work: where do you start and what do
you do next? And if you don’t have at least a high-level understanding
of technical feasibility and requirements, you will struggle to unlock that
value, find that balance and initiate activities in the right order.
side, you want to have the one that gives you the most benefit, including
the most profitable one, e.g. the least expensive overall. There’s the data
strategist somewhere in the middle, trying to balance that out between the
two, trying to get a grasp on attributable value on an asset that is very fluid
across the entire business.
BA: You have been in the field for so long. What is your impression of the
changes over the years? Do you think that after the initial hype of data
science, things have quieted down, and there’s some skepticism? Are orga-
nizations hungrier for concrete results, adopting a more StratOps approach
to implementation?
Nowadays you can do a lot more than ever before of course - you can
efficiently gather much greater information from many systems distributed
in your organization. You can start stitching a lot more data together. In
the mid 2010’s the big data wave happened, with tools & platforms such
as Hadoop, Cloudera, Hortonworks and others. The idea was that merging
a ton of data together would provide a much richer insight. But it also
effected some new challenges, such as how to manage the opening of these
Interviews 211
floodgates. There is a huge cost to doing this and it was not easy to solve
for. I worked with some automakers at the time who quickly realized they
needed governance on top of all the data. They were focused on specific sets
of data use cases, such as only focusing on production data for predictive
maintenance. That eventually proved a successful strategy, as they managed
to initially avoid dealing with privacy challenges that came from sensitive
customer data while learning how to handle the new tech from scaling first
MVPs leveraging huge data sets and machine learning.
I would say there always was a mixture of both successes and failures, but
things have definitely not quieted down. We are at a new stage now: a
massive uptick in the availability of ML packages, tools, and cloud platform
services combined with a new appreciation for data strategy and governance.
Before, open-source platforms were unreliable and hard to manage - it was
hard to get support. Tools have become much more structured, allowing
modern architectures to combine different datasets, packages, and ways of
working. This stabilized the technical part of data programs, and we found
a resurgence in the usage of machine learning. This was especially true for
everyday use cases such as audience segmentation and recommendation
engines. These are things that work very well at scale. They provide concrete
outcomes and fuel further interest in the field. We now have a handle on the
infrastructure issue in gathering massive datasets. We also made progress
on more advanced ML tools and scaling ML in a governed fashion through
MLOps. Data stitching became easier with tools such as Alteryx and Dataiku.
Those make it easy for someone to put them on their cloud platform and
manage their data efficiently.
Now we need to ensure that as those efforts start to plateau in terms of cost,
that they can be integrated and leveraged effectively within the businesses.
This is where data strategy comes in - and there has never been a better
Interviews 212
time to start working on it. The Bernard Marr book (at least the first edition)
came way early; it was ahead of its time. It is just now that the broader
mass of business is beginning to work on problems such as how you define
success, meaning defining clear business KPIs that can be enabled through
the proper use of data. Also, while it’s very tempting to put on a slide that
80% of your data projects don’t succeed, I don’t believe that’s true. Many
do succeed, if only in a contained enivronment, but there is a gap when
it comes to moving the needle on the business and operations side. The
missing piece that makes the wheels turn is data strategy as it is key to
help identify and attain that expected value associated with data. Also, data
strategy is not something you do once. I’ve encountered companies who
said “we’ve developed a data strategy 2 years ago” but are still struggling to
extract that ROI on their data assets. I would argue, that you have to refresh
your data strategy continuously to keep your data ambitions on track. The
roadmap always needs to be groomed. Most companies don’t have the same
business strategy for decades. They will always check where they are and
make adjustments, amendments or initiate resets. There’s always more
strategic work to be done. As organizations are running out of excuses to
explain why their data programs don’t work, since everything has been done
more or less “great” from a technical perspective it has become obvious that
the problems are more organizational, and that’s where data strategy comes
in.
BA: This is a great way to describe the changes. Maybe developments in the
data field mirror those in software, with a bit of a delay. Software products
have been around for a longer time, and probably they are easier to build
than data ones (also easier to measure their success). Now it’s much easier
to start, and the questions are more about designing data products than
architecture or data. We are now primarily focusing on how to provide value.
Interviews 213
AT: That’s a good point. Data is not limited to tech, even though data
capabilities are most often placed into the tech side of a business. When
a data strategist looks at a dataset, they don’t necessarily think of a set of
numbers in a table - they think about it as an information packet. From
this lens, we could think of it as key “information strategy”. From that
perspective, it’s easier to understand that while the technical part is of
course foundational, value delivery should be a priority. This often comes up
in conversations with my clients: you can’t build a data strategy without a
having a digital strategy, which is derived from a business strategy. The data
strategy allows you to hone into the value pools defined by such strategic
guidelines. Without value being the driver, you won’t be able to know what
information packets you need to make that happen. The key challenges lie
in breaking that further and further down to the operational level and then
managing the complexity associated with the execution, achieving these
milestones, using what you have, scaling that, becoming self-funding and
then profitable. Those things also take time; you need to start with what
you have and then scale out, not harvest as much data as possible and only
then get started. Just be aware that until scale can be achieved, a lot of data
you have is a heavy cost factor, so though starting small and learning as you
grow is great, there needs to be the ambition to march your data initiative
through to a certain advanced level. This is what a data strategist is hunting
for - defining the key data as an asset and develop the roadmap of what
it would take to make them exponentially valuable. That is how you shift
into thinking about data the right way, from an asset perspective. That is
why the communication with many stakeholders is so important. You’re
highly dependent on many of them across the business, but it will be you
who is always holding up extracted value by focusing on the ROI of each
information asset.
Interviews 214
AT: This is one of my favorite topics and something where I recently spent
time with my team developing methods to attribute specific value, that is
real dollar amounts, to individual data. There are established, economic and
scientific methods for data valuation. They exist but are not yet commonly
applied in P&L’s. There are three approaches. First, there’s the market
approach: this is a method based on the market price of a similar product
or data asset - what users are willing to pay for it. Second, we have the cost
approach, a calculation method that pays attention to the costs related to
creation, management, and utilization of the data. That is taken from cost
calculations for software and technology initiatives fed by production and
replacement costs. And finally, there’s the income approach. So that’s the
calculation method where the effects of productivity, revenue, cost savings,
and efficiencies of data utilization are estimated. If possible, it’s best to use
all three simultaneously. This will give you a range since the numbers will
differ.
Let’s say you have your use cases defined. You need to break those into their
components to define the individual data assets. Then you can apply those
three methods to get the potential value attributed to each data asset. This
is one of the most immediate challenges we cover with many customers and
one of the hottest questions in the industry. It is important to consider
that there are certain underlying principles regarding the value of data
that are just different from principles of other goods. For example, more
recent information is more valuable than older information, data does not
deteriorate from a technical perspective, data gains value when combined
with other data, data gains value when it is heavily used and so forth -
Interviews 215
you must consider those for your calculations. Hence traditional models to
calculate intrinsic value of business assets cannot be applied.
BA: It’s interesting how you talk about data in terms of information pack-
ages. It almost blurs the distinction with use cases. How useful is it to talk
about those separately? There are always situations such as having several
different datasets supporting a single use case. Moreover, the products
won’t deliver value even with good data if the use case is not designed
correctly. What do you think about this separation?
AT: For me, it’s simple. There is no absolute value of data. It just does not
exist. There is only a relative value of data.
AT: Let me explain what I mean, with a client example. I was working with
this global automaker and the CIO was very proud of the number of millions
of customer records. That was a critical KPI - how many individual customer
data sets they had. But this perspective did not attributable business value,
beyond operational or transactional uses to them. The CEO looked at that
and eventually asked, “so what?”. How much money are we making with
these millions of records? What are we saving? What benchmarks to we
have to surpass? How are we converting on these? The overall value of
the platforms and people used (which were very expensive) to gather and
maintain these customer records, was not transparent. It was important to
thus tether specific value to each data asset. The aforementioned approach
finally allowed us to do just that.
transparency you can acquire regarding what data to pursue and where there
is just huge technical debt. Once you know the ROI of your assets, doors tend
to fly open, and you can make use cases happen. But this is one of the most
challenging conversations, and IT can rarely answer it by themselves. The
business side also does not have enough technical understanding to gauge
how much has to be spent on tech to build these use cases. This is where
the data strategists can come in - they translate from one to the other and
back and forth and anchors relative value to each data asset.
BA: What is the sequence of activities in a data strategy for a big client? How
do you sequence topics like governance, architecture, and others and focus
on the most important ones for the client?
The next focus area is the future state strategy. What are you trying to
achieve? So that’s where you look at the business and the digital strategies
and try to understand as much as possible which direction the business is
aiming to go What’s the commercial impact they’re trying to drive? And it’s
not you as data strategist that is going that alone. You work closely with
people who know that information. Again, you’re going to come into touch
with a lot of stakeholders, both in business and IT.
Interviews 217
So, now that you know where the client is situated, what they’re doing today,
and then what they’re trying to achieve, you try to get a sense of the gap
between these two states.
In order to close that gap, you have to always look into how you can support
value identification and realization, the third focus area. How is value
defined? What use cases contribute to what and what are the priorities?
Sometimes, the future and the current state are not that far apart in terms
of realization. It’s just that the setup is not structured and organized in a
way that the data can contribute to making value happen. Hence, you focus
in on enabling this. How do you get there? What’s the route to deploy to
production? What do you need to prototype? How do you prioritize the
backlog to get there? Most often they’re not going to be able to make tons
of money with data tomorrow, but there should be a rout to becoming self-
funding, getting an initial process going that does not require never-ending
investments. And at that moment, it’s not just the technical components
that make that happen. It’s also the governance aspect. Who’s going to own
the data products? What’s the target operating model? How do you procure
and leverage capabilities that are already in your business? And who’s going
to help manage that change?
And then you look, in a fourth focus area, at the enablers: the data itself, the
roles and responsibilities required and the technology to achieve the future
state ambition. What do you need to improve the data quality? Do you need
a new exchange layer to help get data from A to B? Are there any security or
privacy issues? Do you need to hire or train people?
Of course, the technology layer is always underlying all of this. The hosting,
the infrastructure, data ops, microservices etc. It’s not wrong to look at
these enablers relatively early but often they tend to the only aspects of
data programs that are being focused on.
Interviews 218
This might sound overwhelming at first, but don’t worry, you don’t spend
a year building an enterprise data strategy. This is something you can do
in 8-12 weeks and if it’s a smaller company or just a certain part of the
business, perhaps in 4-6 weeks. If you prepare it well, you can develop
such a plan relatively quickly. At least to get to the point where you can
start implementing and then be agile, continuously experimenting and
optimizing. Even though this might sound overwhelming initially, it’s a very
valuable exercise because it creates transparency for everyone. People from
the tech and IT side can look at it and understand “this is why we’re doing
this”. Stakeholders from the business can look at it and have that “aha”
moment: this is what’s required for me to achieve this specific value from
the data that we have or can procure. There is a transparent ROI assessment
of the data assets, which allows stakeholders to have a clear understanding
of what can be achieved when, and everybody can participate in making this
happen.
AT: Many companies are, as I mentioned earlier, still realizing that there
is a need for data strategy in the first place. Once you have started the
process, it’s not always conducive to be critical of the expected outcomes
and shut down when they aren’t achieved right away. It’s like investing
in mutual funds and checking your investment portfolio daily. It can be
interesting, but it’s not helpful, especially if you cash out the moment things
go a bit south. You must give yourself some time to evaluate (and improve)
the performance. It’s a learning process which necessitates that you adjust
over time. I would argue that it’s challenging to build a long-term data
strategy that’s more than 24-36 months out. Just like with a prediction
model, it gets fuzzy after a certain point. In that case, it makes sense to
Interviews 219
While you may have a team of data strategists help you develop the data
strategy, you should continuously keep some data strategists, if only one
or two, on your data initiative permanently. They should be a proponents
of the developed plan and roadmap and hold up the banner for bringing
everybody back to the table on a regular basis. They will constantly aggre-
gate information on current status, successes, necessary improvements and
so forth. Have we moved from POCs to MVPs yet? Do we have anything in
production? The strategists communicate progress and can then adjust the
ROI models based on respective measurements. This should be, again, a
continuous process. You can then bring back a focused team twice a year to
walk through the full data strategy framework, even if it’s just a 2-3-week
exercise as a checkup, to aggregate the necessary information and see if the
strategy implementation is on track.
BA: Awesome! A final question from my side. Do you have a structured way
Interviews 220
AT: There are several Data Strategy course online offered by Udemy, MIT
and others, but I cannot speak to the quality or content they offer. I do
believe that they have their merit, though I also think that data strategy
is similar to mixed martial arts. You’re best off practicing various disci-
plines and then combining them as needed. Also there is not really a
single approach, because as with any strategy it will differ from company
to company. They are supposed to be different because clients of course
want to differentiate themselves in the market. So I believe, what you
can do is you can teach the basics around the main topics involved. For
example, data analytics, data science, some data engineering, business
strategy and eventually digital product development. Then my advice is
to get into projects quickly, shadow people who have done it before, and
participate in the activities as much as possible. Anybody working on data
strategy should have a clear framework and approach to the development
process. Helpful, supporting tools are templates like a data use case canvas,
interview guidelines, and a scoring matrix to assess relevant maturities.
There are many preexisting frameworks out there, so test them out and
see which ones help you with what you’re trying to achieve. For example,
you can start with a data asset map. It allows you to place all the use cases
you have on one axis and all the available data on the other. Then, mark
off which data can power which use case and what the status of that data
is, with red, green yellow or Harvey balls or whatever you prefer. This will
generate an easy-to-understand heat map of what is immediately possible
and what is not. Then you can further investigate the quality availability,
recency, frequency, latency, etc. of data to get more granular and then
make an informed decision which use cases are ready to go and where you
might require further foundational activities like improving data quality or
Interviews 221
acquiring further data. In the end, you sequence these activities out on a
roadmap. What are you going to do first? What are you going to do next? A
few preexisting templates can thus go a long way.
or the most beautiful model that gets you there. But the interaction of the
various components, people and activities we’ve been discussing.
You will do a great job if you have this comprehensive mindset, thinking
about the value of data from all perspectives, especially the business side.
Data strategists come from all walks of life, and this is what makes the
field so great: I’ve worked with economists, psychologists, management
consultants, data scientists, you name it. All of them data strategists today!
And that is what is most fun about data strategy - it’s collaborative and
cross-functional across so many domains. It’s definitely not “yet another
data trend”, but an evolving, hardening set of skills and becoming a full-
fledged pratice that will not disappear any time soon.
Summary:
Tom Davenport
BA: Let’s get this started by learning about your career. I know you have had
a long one! But how did it all start?
BA: In your experience, how has the field changed during this time? In my
conversations with other data leaders, it feels like some of the questions
we’re trying to answer with data have already been around for quite a long
time - even if we tend to think of data products as “modern.”
TD: When I started my career, people didn’t pay much attention to data.
Some people were focused on data quality, of course. But data was seen as
just a technical subject. The company I worked for focused on what came to
be called business intelligence. We called it “executive support systems” at
the time. To do such work successfully, you need to get data in better shape
than you typically find. But it didn’t receive a lot of attention, I would say,
until the last ten years. People have focused much more on it in recent years
because of the rise of data science and the need to analyze data much more
than we have in the past—the rise of analytics, big data, and AI.
Interviews 224
BA: Do you also feel that the initial hype around data science has decreased
somewhat? I remember around 2010, it was on top of the hype curve, and
now in recent years, it became obvious that doing data science successfully
is not so easy. And hopefully, that continues to be the case, that data science
becomes more “boring” so that we can focus on delivering value, not hype.
What are your thoughts on this? Is there a shift in how people perceive such
tech?
BA: Exactly. It’s not enough just to put a data scientist and then expect
magic to happen. Another reason for the declining salaries could be that
the field is like a pie, which data scientists share with other new roles, such
as data engineers and ML engineers, which are more specialized.
TD: Yes. I wrote a piece back when data science was just gaining in popu-
larity. It was in the Harvard Business Review with DJ Patil, who ended up
being the first chief data scientist of the United States. It was called “Data
Scientist: The Sexiest Job of the 21st Century”. We also wrote a ten-year
follow-up a few months ago. We discussed how the job has changed in one
major way during this time: people have realized data scientists can’t do
it all. You do need all those different roles you mentioned. Also, new ones,
such as data translators and data product managers - are both fast-growing
jobs. I think you are right; this may have diminished some of the demand
for data scientists. At least the types of data scientists who only develop
models. I think it’s a bad idea to just focus all your attention on finding the
right algorithm because if the project as a whole doesn’t get implemented,
what’s the point?
BA: Yes! These roles are super important, the ones which are focused on
data and product thinking. I have a few questions on this later, but let’s go
back to data strategy. When did this term first appear and gain traction?
TD: I think it coincided with the rise of Chief Data Officers over the last 20
years - initially in the financial services sector. I wrote an article, “What’s
Your Data Strategy?” with Leandro Dalle Mule. I worked with him at Deloitte
before he became a Chief Data officer at AIG. I give him all the credit for this
idea of offense versus defense in data strategy. I was more familiar with how
much commonality we need to have in our data - data federalism and related
Interviews 226
topics.
Since then, I think many other issues have come up, including what aspect of
the data supply chain you should focus on; does the whole product idea offer
any value in terms of making data effective. Also, what type of data will you
focus on - is it only internal or external? I think the field broadened quite a
bit from the early days when we used terms like “master data management.”
BA: How about data strategy today? I think data strategy unfortunately is
often seen as just a static document - and also has become quite vague.
TD: There was a previous generation of attempts for similar in goals ini-
tiatives. They were named “integrated master high-quality data” or “infor-
mation engineering”. A lot of money was being spent on this and it was a
very technical discipline. Organizations such as the Oxford Martin Institute
were founded. I was very distrustful of it since they created models nobody
could understand. Business people didn’t know what to do with those - they
looked like those schematic diagrams of the latest Intel microprocessor -
unnecessarily complex. Business people struggled to find where the data
is in all of this. I think in the 1980s, Michael Hammer wrote an article
about a principles-based approach to data. That was one of the first data
strategy-oriented pieces that ever appeared in HBR. Some companies used
it quite successfully. Occasionally, I still find somebody who supports the
principal idea. It was similar to data strategy because it demonstrated how
you should simplify goals in different areas of IT, including data - to ensure
business people would not just understand it but be the primary creators
of data strategy. They should be actively involved in it and related aspects
of technology, such as architecture and planning. I still believe it’s very
important to have simpler business-oriented data strategy approaches.
It is important to look at the companies that are very aggressive in their use
Interviews 227
of AI. Some of the best examples are European - such as Shell, Unilever, or
Airbus. I think this is largely because their senior business executives have
learned enough about AI technology to understand what it can do for them.
I wrote a piece in Sloan Management Review with Piyush Gupta (the CEO of
DBS Bank in Singapore). While doing this, I asked him how he got into all
those topics. He told me he world for Citi group with John Reed. And he was
probably the first banker in the world to realize the impotence of data and
technology for banking. Piyush Gupta then developed the data strategy and
architecture for Citigroup in Asia.
BA: Data strategy inspiration can come from many directions! Let’s take a
step back. What do you think about the data strategist title?
TD: You need some sort of intermediary between highly technical people
(whether in data science or data management) and business people. You
won’t have many people like Piyush Gupta - CEOs - doing data strategy. You
could call those roles data strategists, translators, or data product managers.
The translation is a job field in itself. You’’ always need someone like that
to manage a project from start to finish, interface with the stakeholders and
persuade them it’s a good idea to develop data products.
BA: It is not an easy job; you always have to look at the forest and the trees.
Those are skills probably hard to find in one single person.
TD: That’s true, although I believe the situation is improving. When I was
working on the 29 case studies of people who work with AI daily, my co-
author and I discovered many people who were those business-technology
hybrids. I do believe more and more are emergent all the time. Nowadays,
when I sit in a meeting at big companies, there are often both business and
IT people - and it’s sometimes hard to say who is who!
BA: Do you think the learning curve for understanding data stuff has
Interviews 228
TD: Yes. I teach my students how to do it, and most of them don’t have any
technical background. There are all these great automated ML tools now.
BA: And their eyes light up immediately, I’m sure. Let’s take an example
company that is not very successful with AI. How do we get started there?
TD: I always say: with AI, you have to think big but start small. You must
develop a vision of how your company can evolve with AI capabilities. Will
it treat customers differently or make money differently? What products or
services might it sell? After you have identified that, you can look at the
little pieces. AI is, for the most part, an incremental technology. It works
on individual tasks, not even entire business processes or jobs. Some of
those small steps might seem boring, but together they will eventually help
revolutionize a whole business line, such as customer service, marketing, or
product development.
TD: Yeah, I think that’s generally true. Last week I talked with the head of
data science at AT&T. They have done much to empower business people
to use AI-oriented tools. The whole organization has agreed on how to
have reusable data sets and how to use them. I think data strategy should
contribute a lot to empowering those citizen types to do more of the work.
We just don’t have enough data scientists, engineers, and analysts.
services into Excel, where you can get predictions in the Excel sheet. People
like to focus on shiny use cases such as object detection, but there’s a lot of
value in such “simpler” automation.
TD: Yes, agreed. Those large vendors can help democratize such tools
because everybody can access them.
BA: How about data product thinking? How do we ensure we don’t work in
isolation and the algorithms get integrated within a product?
TD:I talk to a lot of chief data and analytics officers. When you’re coming
to a company that is not that data-oriented and not doing a lot with AI, it’s
essential to start with a small number of use cases to build consensus within
the organization. You need to get business leaders who are stakeholders
quickly and then ensure those are a success. After this, you can build on
them with more sophisticated projects and infrastructure.
BA: Interesting. Would you then say that only mature organizations need a
complete data strategy?
TD: Yes. I think in the beginning, you can start case by case without
restructuring the whole organization.
BA: Now, I want to ask you one of the hardest questions in data strategy.
How do we measure its success? How do we come up with accurate numbers
that the business requires?
TD: What makes this difficult is that data is seen as an abstraction. This
leads to the short careers of chief data officers - it’s hard for them to
demonstrate value. I recently surveyed CDOs for an AWS report, together
with the MIT CDO Symposium. There I discovered that the ones who are
successful are obsessed with measurable value. They ensure to have a
baseline before they measure anything they have achieved. You cannot do
Interviews 230
that for everything, so you might need to prioritize a few critical projects. I
have seen this done successfully at Capital One, a very data and analytics-
oriented bank in the US. They were also the first ones to name a CDO in
1992.
BA: Yes, and use this measured value to fund further activities. Another
question I have is about who does the data strategy more often in your
experience. External consultants or internal people? And what is better?
TD: I think it’s certainly good to get facilitation help from external con-
sultants, but in the end, the organization needs to be on board since they
are most in tune with how the business and the business environment are
rapidly changing.
TD: I think this is a confusing term. When people hear “asset,” they expect
it to have value in itself, but this is rarely true of data. I like the data product
idea better since it’s clear how it’s valuable when analytics and AI are doing
something useful. I know some companies are perfectly successful using the
data asset concept, but I’m not a big fan of it.
In the AWS survey, I just did, data monetization was one of the least popular
activities. I think it’s just too hard and brings many new issues, such as data
ownership and privacy issues. I think it gets customers upset.
TD: Yeah. This is a value laden term, particularly when you start talking
about monetizing customer data.
BA: What mistakes do you see people making when developing data prod-
ucts?
Interviews 231
TD: You have to take a lot of ideas from standard product development.
There are a lot of interesting ideas coming from the lean startup thinking
about MVPs. What is essential is to have stage gates where you decide if you
proceed. What is more challenging is to find people who have product think-
ing - many are just focused on developing algorithms and less interested in
the business and people’s issues around the products.
I think you can take a lot of use cases from another industry. The first
employee attrition model I saw developed was at Sprint, the mobile telecom
company. They initially used it to prevent customer churn but then used it
for employee churn.
BA: How about the domain knowledge you need for the different industries?
Also, do you think data strategy differs, or can you reuse it easily between
industries, such as pharma or automotive?
TD: I do think there are some substantial differences. For example, financial
services tend to be more defense-oriented than any other sector. The
pharma domain, particularly in drug development and genomics, is so
complex, with a big focus on external data, called real-world evidence. In
automotive, there’s a bigger and bigger focus on connected vehicle data. For
retail, other issues and opportunities exist, such as customer loyalty data
and shopping behavior over time. Every industry has some key decisions
that needed to be made in the context of data strategy.
TD: That’s Amara’s law, right? Yes, I completely agree. I wrote about this in
the AI Advantage book.
BA: Where do you think things are going in twenty or thirty years? Do you
Interviews 232
BA: Sometimes I feel like we’re are monkeys sitting on a branch and cutting
it under us. We develop tools to automate ourselves away - everything is
evolving.
TD: Yes, one of the interesting things is that data scientists have generally
not embraced those tools - even in industries where models are very impor-
tant.
BA: What would be your parting advice for people just starting?
TD: Make sure you understand that it’s everybody’s job to think about how
we analyze data and how that relates to business!
Summary:
• Embrace new job titles in the field and the growing specializa-
tion;
• Have a relentless focus on measuring and providing value;
• Learn from traditional product development;
• Respect the differences between industries;
• Embrace new productivity tools for data;
Interviews 233
Stephanie Wagenaar
BA: How did your career look like so far, and what are you up to now?
BA: What are the essential skills for your type of role?
Interviews 234
BA: A strategy friend once told me that data strategy is the “art of drawing
boxes.”
SW: Yes! You have to make sure that everything is cohesive. I see a lot
of companies doing many projects simultaneously that are building on
different foundations and aligned to the same business goal.
BA: What do you think about the data strategist title? Will it become
established?
SW: Well, I can imagine it will. We need to keep the organization focused
on the long-term vision and make it pragmatic. I sometimes call them “data
specialists” since many different skills are required. This makes sense to
people who know nothing about data.
BA: You mentioned that it might be easier for business people to get into
this. Do you know how to train them to be good at the role?
SW: Few things can replace learning on the job. I would start by talking
to different people from different departments in the organization and
learning how they use data and what’s essential for them. Then you can
combine this into a cohesive overall plan, explaining it to the management
- ensuring their initiatives, goals, and needs are met.
Interviews 235
BA: What are your impressions of data strategy in the Netherlands? What
type of organizations are interested in it, and what do they expect?
SW: Everybody wants pragmatic. Many need to be made aware of what they
need. They might look at data governance and need help understanding how
it might help them achieve their goals. Part of the problem is that those
concepts are so broad, and it’s hard to have a single approach that fits each
client.
BA: Yes, I agree. It can be smarter to order individual elements which are
most urgent.
SW: Yes, this is how we often do it. The clients should always have the basics
fixed first.
I usually focus on data management and data governance - those are
essential. Many other topics come after, such as data privacy, but you can
solve many issues with a data management and governance plan.
BA: There’s a lot of concussion between terms like data management and
governance. How do you explain them to clients?
cars because they have first to learn how to drive a regular one.
BA: How do you ensure clients are still on board with topics like this? Often
the results of such work become visible only much later, which could lead
to frustration.
SW: Here, communication is essential. Making sure they are always in-
volved in the process and trying to be as transparent as possible. I do as
many in-person workshops as possible and keep them interactive, ensuring
nobody is left out. This attracts and excites even more people who want to
be a part of this. Having good energy and focusing on the bigger picture
and digital transformation results go a long way to generating excitement
- but it can still be challenging. Some of those workshops last for three
hours. What helps is to ensure that people learn from each other and keep
each other updated on progress and things achieved - keeping everything
interactive.
BA: This is a very positive way of motivating people. How about the
negative? If you focus on opportunities missed or costs if no governance
is implemented?
SW: This is more suitable for management - when you want such initiatives
funded. Still, I prefer not to focus on the negative - most people know those
things already. It’s better to focus proactively on positive change.
BA: How do you measure the success of such initiatives? Do you set some
baselines for data governance?
quality. This makes them even more excited - if they can understand the
results of their efforts. As the data management and governance building
proceeds, they must regularly see the outcomes. We need a group feeling -
that we are all working together towards a common goal. Through this, they
can feel empowered and eventually proceed without external assistance.
Just don’t forget to make sure all departments feel heard.
BA: This is a very time-consuming process, with all the due diligence, right?
SW: Yes. It can take six weeks, with, let’s say, two days a week. There are
a lot of technical tests too, and see who in the organization struggles with
what and assisting them. They have to see you as an ally.
BA: What are some common traits that successful in this initiative clients
share?
SW: I’ve noticed that successful clients already have quite a few internal
people interested in data—those motivated to take the next step. We always
need to work with such people because they understand the business inside
and out - we do the work together as specialists and domain experts. After
this, I also feel comfortable knowing the organization is in the right hands,
with people leading the work further. Those people can be from the data
team and others who are very interested. Again, the most important thing
is they understand the business. You can say I try to turn them into a mini
data strategist!
SW: Good idea. The number one problem in companies is everyone working
in their own silo. I have seen companies falling apart, and some departments
still claim everything is in order just because their datasets are fine. We need
to change this, and I’m working hard on that.
SW: In most cases, we start with the ERP system. After this, we focus on the
master data - so customers, suppliers, items, and transaction data. I tend to
focus less there since fixing the basics is most important. Of course, CRM
data is essential, and sometimes custom supply chain programs. In such
cases, we need the business users to help us assess the priorities. Under-
standing how the business makes money also helps. For example, suppose
you are working with perishable goods businesses. In that case, expiration
data is essential, and if those are more content-focused businesses - the
critical data can look very different and be in a separate silo altogether.
BA: How do you balance quick wins with more time-consuming work?
SW: I try to use tools for the quick ones. I’m a big fan of IBM Watson, and
one quick win we made was using their translation service. That helped us to
move quite fast. I know people like flashy things, so sometimes, we support
this with quick dashboards, showing business users new ways to look at and
use the data. I agree you need those quick wins to build trust for the more
major initiatives (and help us get assignments) - it’s not easy to sell the
complete data management proposition without that.
BA: How about the more advanced use cases? Do you feel companies have
now realized it’s just not enough to through engineers at the problem?
Interviews 239
SW: Yes - it is like building a house, with the foundation first. For AI, you
need the data management and governance. I do, of course, take those use
cases into account when designing the solutions.
BA: How about new tools, the cloud, for example - have they made your
work easier or harder?
SW: mostly, they are beneficial. In the Watson case, for example - we could
build a solution very fast, at a low cost. I am very interested in the low-code
movement for such work.
BA: There must also be hidden dangers in such tools, like providing too
much power to users that need to be properly educated for it?
SW: Yes, this also happens. But the problems we face are so diverse that we
should use all our tools. It’s hard to have a one size fits all approach. What
is also important here is just to know what you don’t know. In such cases,
you should have the right skills to pick new topics quickly and ask the right
experts for help; otherwise, it can get difficult.
BA: What are new exciting things you see in the data field?
SW: There is just so much! A lot of cool NLP projects are gaining traction.
It tends to keep the same about data governance, but the field is evolving
there as well. It’s hard to make a 20-year prediction with the speed of new
changes. It’s essential to see this as a positive and keep yourself and the
clients enthusiastic!
Interviews 240
Summary:
Doug Laney
Doug Laney is a Data & Analytics Strategy Innovation Fellow at West Monroe.
He is also the author of “Infonomics: How to Monetize, Manage, and Measure
Information for Competitive Advantage” and “Data Juice: 101 Stories of How
Organizations Are Squeezing Value from Available Data Assets”. Before that he
was a VP at Gartner and led the Deloitte Analytics Institute.
BA: Where do you think we are right now regarding realizing the value of
data, and what are the reasons for that?
DL: Regarding data assets, we’re still at a point where most organizations
and leaders will talk about data as an asset but not treat it like one.
They don’t apply asset management, measurement, and asset deployment
principles that have been well-defined and honed in other industries -
for different types of assets. Assets such as financial, physical, or even
human capital have well-defined methods and standards for management,
Interviews 241
for example, ISO standards. We, as data professionals, have done ourselves
and the world a disservice by inventing new terminologies and ways of
management that just fail instead of paying homage to how other assets are
managed. This is the main point in my Infonomics book. People would say
that infonomics is a bold new idea, but it’s applying existing ideas about
asset management, valuation, and economics to data - treating it as an
actual asset. This is where we are now - people start to talk about data assets
but still do not know what that means.
BA: Would you say we are too focused on tools and frameworks instead of
the basics?
DL: Yes. We all love new shiny objects. Everybody’s focused on tools. There
are two types of data strategy. The big S data strategy is about managing
data in the organization as a whole - culture, leadership, organizational
alignment, data architecture, integration, and so forth. The other is focused
on the policies and procedures around how to manage and leverage data.
BA: Do you think a part of the problem is that many people disagree on basic
definitions?
DL: Yes. Even the definition of data asset is something people can’t agree
on. I recently ran a poll on the differences between data assets, products,
and datasets. The best answer came from someone who was not even in the
data field but an engineer. He said a dataset is a collection of data organized
in a specific format. This can be anything from a simple text file to a large
database. A data asset includes all the information a company owns and
controls, including both digital and physical data. This can include cus-
tomer, financial, and employee data. A company’s data assets are important
because they can provide insights into the business, help decision making
and drive growth. Data is more of a physical manifestation of data, while
Interviews 242
data asset is more of a logical grouping of data you own and control for a
common purpose. And finally, a data product adds a layer of functionality
on top.
BA: We are talking about data as an asset, but how about data as a liability?
DL: When people talk about liabilities, they often mean risk. If we’re
talking about data as an asset, we use financial vernacular. And an asset
is something that is e that is owned and controlled - changeable for cash
and generating what accountants call probable future economic benefits.
Finally, it’s also separable from other resources, and data meets that criteria.
Does data meet the requirements of an accounting liability - something that
you owe somebody else? I think it’s rare to see data being owed to someone.
Still, data can certainly introduce risk. One of the things I suggested in
the Infonomics book was a set of generally accepted information principles
similar to those for accounting. One of the ways to reduce data risk is
the replicability principle. It suggests we need to be economically cautious
every time we copy or move data since we increase the attack surface for
hackers every time we do so. We are increasing the risk. We must be careful
when doing things just out of convenience and recognize the risks.
Another thing you can do is look at the saving that various assets provide as
part of solutions. Additional costs are labor costs and physical and financial
assets. The subjectivity arises in estimating how much value is allocated
to the data asset. Data assets are non-rivarlous and non-depeting. They
can be used for many purposes simultaneously, over and over again. Data
assets are also progenitive. We are generating more and more by using them,
becoming a cyclical problem.
BA: Who, in your opinion, should be doing this valuation of data assets?
**DL: This should be in the purview of the chief data officer. I have long
advocated for the bifurcation of IT into separate “I” and “T” organizations.
The CDO is responsible and accountable for the “information” part. In some
organizations, it might work to provide this to the CFO.
BA: What are the important skills for a CDO day in this case? What kind of
background should this person have?
DL: Most of the successful ones come from a business background. They
can transfer the skills from managing other assets. While there are certainly
technical aspects of it, there’s a lot of difficulty in the culture, organiza-
tional, and governance topics. How to create a data-driven culture and
foster data literacy. How to define use cases and leverage data in new
innovative ways. You need somebody with business sense.
DL: Of course, there you need technical people. But again, I think we should
be separating “I” and “T.” For example, instead of having a CIO, we can have
a CTO and CDO working independently yet synchronizing because of the
Interviews 244
necessary overlap.
BA: What would be the first thing an enterprise can do as a first step? What
are some ideas for quick fixes?
DL: appointing a CDO is a great first step. Then try to find a good balance
between the defensive and offensive parts of the data strategy.
BA: What are some topics in data assets that the people you teach have
difficulty grasping?
DL: There are many complex topics, such as data privacy and security.
Those are very difficult because those fields are ever-changing and differ in
geography, industry, culture, and even on customers. I am not focused on
those topics, but data asset valuation is tricky. Once you start peeling the
onion on that, it can be challenging. There are real nuances to the methods
there, as we discussed. Someone also needs to source the inputs required
to do the valuation. There are questions about probable future economic
value and the probability of successfully delivering those use cases. Also, as
we discussed, assign the contribution of the assets to that.
BA: You have been in the field for quite some time. In what ways has it
changed, especially with the new tools we have, such as the cloud?
DL: It has been great to see technology and architecture keep up with the
volume and velocity of data, thanks to the movement to the cloud. Still, big
problems remain to be solved, such as how we handle the variety of data
and the increasing range of data sources. How do we integrate them and
align the meaning of the data? The three V’s remain essential. One thing
which is often misunderstood is the “veracity” of data. People don’t realize
this is a bigger problem for smaller datasets since those are often compiled
manually.
Interviews 245
BA: . How about the term “data is the new oil” and contrasting it with
something I hear people say, “data is the new water”?
DL: Michael Staler used the initial term some twenty or thirty years ago.
He used to talk about data being like water in terms of being able to turn a
faucet and get a flow of data. This is undoubtedly the case at the moment. I
certainly appreciate that, at the macro level, data comparison to oil works.
Data is the driver of the economy today, much like oil was a century ago.
Still, it misses the point that data has unique economic characteristics. If
you consume a drop of oil, you can just do it once. It dissipates and turns into
heat and pollutants. This consumption does not create more oil - remember,
data can be progenitive and used in many ways simultaneously. This is what
successful business models for companies look like nowadays - they take
advantage of those foundational economic characteristics of data.
BA: What are your thoughts on describing data as bronze, silver, and gold?
DL: I’ve seen it and think it’s a handy way to express different levels of
data availability, usage, cleanliness, and governance to a business user.
I don’t like to make things more complicated, and I’m not in favor of
discriminating data from information - they are synonymous if you look
at the dictionary definition. I also think there are rarely state changes to
data, and we have more of a continuum as data becomes more consumable,
usable, and integrated.
BA: Still, how do you think about having “good” and “bad” data assets?
actual value is off by a certain percentage. There are plenty of data assets
with poor accuracy or timeliness but remain fit for a particular purpose.
BA: What would be the first thing a data strategist does to improve the data
situation in a large company?
DL: First, they should understand the culture and leadership around data.
How is it perceived? Then the governance topics - how individuals in the
organization use and measure data, and do they have defined metrics?
Are the KPIs aligned across the organization? Do they have a separate
organization or a traditional IT? How do people coordinate and collaborate
in such initiatives? After this, I would look at the architecture and see how
source data moves through it and is integrated with other systems. Data
governance and quality are also important here. It’s essential to assess
the data assets’ accuracy, timeliness, completeness, and integrity. Finally,
I would look at the operations and see how data is consumed and made
available to people and processes. I think we are still spending too much
time delivering data to eyeballs. Increasingly, and going into the future,
I think we’ll see more data consumed by systems and applications rather
than people. Data strategies should reflect this increase in business process
automation, AI, and similar, instead of just descriptive statistics.
BA: What are your thoughts on the term “data strategy” itself?
BA: Also, we should measure its impact and establish some baseline.
DL: Right. The data strategy itself should have goals and tools to measure
Interviews 247
Summary:
Alexander Thamm
Alexander Thamm is the founder and CEO of the consultancy company with the
same name, focused on data and AI products. He is also the author of “The
Ultimate Data and AI Guide” book.
BA: Tell me about your story. How did you end up in data?
H
AT: I think there are two important aspects to my start. First, when I was 18
years old, I opened an internet café. It eventually went bankrupt because, at
some point, everybody had an internet connection at home. We did a lot of
tech stuff: coding, making websites, and other software. While this got me
hooked on tech, it also helped me learn how to do business – the bankruptcy
Interviews 248
experience was not so nice. This was when I started studying economics with
minors in statistics and psychology.
BA: I hope they already promoted you or were you still a working student at
this point?
I was intrigued by the topic and accepted it, and I also started doing my
Ph.D. in the field while travelling to Ingolstadt. My Ph.D. was focused on
bayesian statistics since, in this type of problem, there were often issues
with missing data - and I was trying to augment the missing information.
At the time, this was pretty new to everyone, and I had a lot of fun. I had
to hack together many systems because the business was still working with
Excel and IT with rigid old databases. There was no data science as a field
- everyone was just talking about data mining - trying to find patterns in a
heap of data - hoping for crazy results that could change a whole business.
IBM promised a lot at the time with Watson. All those initiatives were rarely
actionable and provided redundant common sense patterns instead. At the
time, I was doing more advanced projects, and people noticed I could get
fast results from data, and eventually, I decided to leave my Ph.D. and start
my own company (so I could work easier with clients such as BMW).
BA: It sounds like you were successful in the field quite early. What are
non-technical traits and skills you think helped you along the journey?
Specifically, as a working student, to deliver the impact you did?
ATH: You can’t do without some core skills, like being hard-working and
bold. Every entrepreneur will tell you that. But there are several more
special ones: passion, curiosity, and being value-driven. You should be
interested in finding the truth in data, much like a detective who wouldn’t
let go, even if the odds are stacked against them. I was also always focused
on seeing an impact, which satisfied me. I loved it when people told me -
“Alex, this is so cool. We are now 20% more efficient”. I always wanted to go
for the value, not just the insight.
BA: Speaking of value, this is one of the most challenging questions. How
did you show measurable value? How do you measure the impact you
provide? This is a difficult task since data is often seen as a cost center rather
Interviews 250
than a profit one. Also, while the impact of data initiatives on the bottom
line can help improve the KPI of other teams and departments, it’s often not
measured directly.
ATH: This point has always been very relevant. We built the “Data Compass”.
This is a framework similar to CRISP-DM but more adapted to what we
think data scientists should be able to do. It starts with a business problem
and defining the core KPI we want to improve. Data strategy has to be a
derivative of the corporate and digital strategy. We often see that the data
department and CDOs often build a data strategy that is not interwoven
with the company targets. If you work against your company’s goals - you
have already lost. While there are many different KPIs, balancing capability-
building use cases with data products is crucial. It’s often not a good idea
to start with an extended data lake, data catalog, or governance project
without ensuring first they are connected to the business. It’s also hard
to keep the excitement going (we call this happy honeymooning) for long
periods. If you connect straight away to the specific use cases, such as sales
forecasting and its impact on targets, you will have higher chances of getting
funding for the project.
ATH: No, definitely, data strategy is not static. The term itself is misleading
since it implies you do something just on paper and then execute it. We
recently were with a client who was just starting, and they thought they
should do this as any other regular project—beginning with a strategy
assessment, three to four months of interviews, and so on. I agree it’s good
to have a good starting point, but the problem is that such projects are very
new to them. I remember the saying from Henry Ford - “if I asked people
Interviews 251
what they wanted, they would say faster horses”. The problem is that if they
are not experienced in the topic, it’s hard for them to develop a data strategy.
BA: It’s a chicken and egg problem, right? You need to have a data strategy
in place to make one.
And the last stream we have is the governance one. This one is trickier
since it depends a lot on which industry you work in and where you operate
geographically. You must touch this from the beginning, but you shouldn’t
overwork it. I have been in meetings where the client had ten lawyers
assembled with a 500-page policy document outlining what they shouldn’t
do. And then I asked them - have you done anything already?
BA: This is very interesting. But how do you balance those capability-
building topics with doing pilot projects without falling into pilotitis? You
Interviews 252
ATH: This is how we do it. We prioritize our use cases that follow the data
journey process. This process has three phases - lab, factory, and Ops. All
those topics are often interwoven as well. We have applied this process
successfully to more than 1500 projects. A good analogy to use here is a gym.
You are building a specific training plan for a particular person. You need to
balance out all the different ways the organization works. And even topics
like Data Governance can be exciting if they work well. If you look at Zalando
- the co-author of my book (The Ultimate Data and AI Guide), Alex Borek is
doing great work on the topic there as a Head of Data. They initially had
a radically agile approach - everyone could build anything. It was chaotic
but worked well then. Eventually, they wanted to leverage more synergies
and develop a coherent platform - compliant by design. As the company
has grown, many people are happier since they can just use the platform,
and everything is already set up for them. They don’t need to worry about
managing costs and about data protection. Like this, they can get going with
PoCs quite quickly. This reminds me of a Dilbert joke where Doc buys the
first video phone but has nobody to call since he is the only one with it!
You must have a strategy and resources allocated to a central unit that
gets the momentum going. After this, you create this upward spiral with
more use cases, more data, more ROI, and more money to invest into
capability. But the idea is not to put all the data in the data lake. We call
it “Türmchen bauen” in german - an ivory tower strategy. We need to work
holistically instead of building a whole house from the basement up, which
is, unfortunately, typical for many cases. Organizations are then stuck with
the view from the basement instead of being agile - and nobody likes that
view.
BA: How do you deal with the issue of significant upfront investment in data
Interviews 253
ATH: Yes, this happens often. You just have negative ROI for some time,
and only when the product is finished does that change. There are two
things you can do. First, you should have a very balanced portfolio of use
cases. Instead of focusing on what smart data scientists want - complex
PhD-level deep learning projects, which are not easy for most European
organizations to take advantage of yet, focus on easier use cases where you
can achieve positive ROI within several months. Of course, you also want
some moonshot projects to get media coverage or new talent in - but, for
example, you wouldn’t invest all your money in crypto, right?
On the other hand, you do want to start fundraising and acquiring commit-
ment from the board for at least one to three years. You can use your early
success to “hook” more investment. People overlook that successful data
organizations such as Amazon or Palantir haven’t been profitable for a long
time, despite huge investments. You have to start at the start and manage
expectations. You need to make people fall in love with data and prepare
them for reality afterward as well.
BA: Let’s talk about our situation in Germany. What are current issues you
see in data strategy implementation in German companies?
ATH: It’s important to remember that we have some cool things that work.
For example, Deutsche Bahn is now saving a lot of money with data and
AI. Together with them, we were able to build a reinforcement learning
solution to distribute trains in real-time much better. We also have a very
cool project with the German automobile club, ADAC, where we build a
recommender app for points of interest in traveling. This is something
that companies such as Booking or Airbnb still don’t have on the same
Interviews 254
level. In other examples, Daimler is the first company to have level three
autonomous driving - in front of US companies such as Tesla and Waymo.
I do think we have something like a marketing problem in Europe. We are
perfectionists, while in other places, people release stuff even before it’s
ready and tell everyone how cool it is.
Many companies here have invested a lot of money into capability building.
Now, they are back for more use case-driven approaches—companies such
as Allianz, BMW, and Porsche spring to mind. The real problem lies in the
Mittelstand. Those companies form the backbone of the economy here and
represent many human-centric European values. They can’t hire and retain
the same talent as the more prominent companies, but if we manage to give
them the proper tooling - similar to something like Shopify, but for AI - we
can be successful.
BA: You now mentioned the talent problem. What do you think are the best
people to drive such innovation forward? What do you think of the title
“data strategist”? Do you have this role in your organization, and how do
Interviews 255
you train such people? We need a broad view to get such innovative work
done.
ATH: Yes, I agree - you need people with a T-shaped distribution of skills.
We started as an engineering company. We started that way because we saw
a gap that big consultancies didn’t fill. I always liked the FUBU approach -
for us, by us. Being technical, we knew what technical work needed to be
done and how. This is why we also decided to do the strategy ourselves -
maybe the slides weren’t that great initially, but the contents were good.
Now we have a whole data strategy practice. There we have different roles
depending on their work - and they all function as a bridge between the
technical and the business. Those people can be both technical and have
a business background, like an economist who can code a bit. We always
ensure interdisciplinary teams in client engagements—for example, one
strategist, two scientists, and five engineers. We also have other roles, such
as data visualization engineers and similar. And finally, we have a data
product owner and product managers who ensure the outcomes and how
to get there.
ATH: Correct. It’s like different strains of the same organism on different
seniority levels. We also have people who specialize in different cloud
providers. One of our newest practices is the Ops practice, where we have
even more roles.
ATH: We had to specialize. We started with us four guys, where you would
get the job if you could spell data science correctly. But now the whole
field is specialized. It’s not enough to say you are an AI consultancy; you
have to say what type of AI - text, image, etc. We also differentiate from
Interviews 256
larger consultancies and have a more boutique approach which is also more
specialized and tailored to the customer.
ATH: Since we don’t have an investor and are bootstrapped, we always have
to stay lean. We got inspired a lot by the Spotify model and made it work.
It’s not easy to balance all the different types of cultures, technical and non-
technical. What also benefits us greatly is a strong network of partners with
whom we work and events such as the Data Festival. It’s essential to stay at
the forefront of the field and be seen as thought leaders.
ATH: I just don’t see it that way. Even as a doctor, you heal some patient
pain - they would tell everyone you are a good doctor. And they will come
back to you if they have new pain. This is the reputation we built in the last
ten years. Just deliver excellent results.
BA: One of my final questions is how do you keep so many people learning
all the new things that happen in our field and avoid becoming a legacy
company?
ATH: There are many ways to do it, but you must sacrifice some time now for
benefits later. We always provide time for people to develop their skills and
have a whole people apparatus, an organization to support the employees
and teams.
Summary:
As you embark on this journey, I have just one ask of you. As the last years
have shown, there are still significant challenges facing us - pandemics,
global conflict, and inequality. All against the backdrop of climate change.
If you can, with your data work, try to use it for good and the benefit of the
rest of us.
Boyan Angelov,
Berlin 2023
Appendix
AI Artificial Intelligence
API
Application Programming Interface
AWS
Amazon Web Services
BI Business Intelligence
CAS
Complex Adaptive System
CNN
Convolutional Neural Network
CoE
Center of Excellence
CRISP-DM
Cross-industry Standard for Data Mining
CSA
Current State Analysis
Appendix 260
CV Computer Vision
DD Due Diligence
DMA
Data Maturity Assessment
EDA
Exploratory Data Analysis
FTE
Full-time Equivalent
GA Gap Analysis
GCP
Google Cloud Platform
GIS Geographic Information System
GPT
Generative Pre-trained Transformer
IAC
Infrastructure-as-code
LSTM
Long short-term memory
ML Machine Learning
NLP
Natural Language Processing
RACI
Responsibility and Assignment Matrix
SDK
Software Development Kit
SSOT
Single Source of Truth
ST Systems Thinking
Appendix 261
SVM
Support Vector Machine
TDSP
Team Data Science Process
VCS
Version Control System
XAI
Explainable Artificial Intelligence
Database schema
Data warehouse
Data warehouses are typically designed to support fast querying and anal-
ysis using SQL or other query languages and may be optimized for per-
formance using techniques such as indexing and partitioning. Data teams
may also integrate them with business intelligence tools and visualization
software to enable users to create reports and dashboards.
Data lake
Data lakes are often used with big data technologies, such as Hadoop, to
store and process large amounts of data. They may also be integrated with
data management and analysis tools, such as SQL or Spark, to enable users
to query and analyze the data in various ways.
Examples: AWS S3, Azure Data Lake Storage, Google Cloud Storage
Appendix 264
Data lakehouse
Data lakehouses are often used in big data environments to enable the
efficient querying and analysis of large amounts of data from multiple
sources. They may also be integrated with data management and analysis
tools, such as Spark or SQL, to enable users to query and analyze the data
in various ways.
Examples: AWS Batch, Azure Batch, Cloud Data Fusion, AWS Kinesis, Azure
Stream Analytics
Serverless computing
and scalable solution for running applications and services, allowing users
to pay only for the resources they use. However, it can also require a
different mindset and development approach, as it focuses on individual
functions rather than on traditional application architectures.
Data mesh
across an organization. The core principles of data mesh include the follow-
ing:
The goal of data mesh is to create a flexible and agile data architecture
that can support the changing needs and priorities of the organization
and enable teams to quickly and easily access the data they need to drive
innovation and business value. This is undoubtedly very useful for some or-
ganizations but beware of a cargo cult. Decentralized often sounds good, but
in some cases, having monolithic and centralized setups is more accessible
(for security or maintenance purposes). Here are some potential drawbacks:
management. This can be difficult for organizations that are not used to
working in this way and may require significant effort to change long-
standing practices and mindsets.
Human in the loop: make sure that the results of prescriptive systems are
audited or have active participation by humans
Ethical design: ensure that the way prescriptive systems are built is repre-
sentative. The most common issue here is that of using biased datasets.
• Clearly define the purpose and objectives of the data project and ensure
that they align with ethical principles.
Appendix 269
Additionally, we can split the essential fields further. Here for data science:
A definition of done for a churn model task might include the following
criteria:
• The churn model has been tested in the production environment and
is functioning as expected.
• The churn model has been documented, including a description of the
features used, the training process, and the model performance.
• The results of the churn model have been reviewed by the relevant
stakeholders and approved for use.
• Any necessary changes or updates to the model have been made and
tested.
• The code for the churn model has been reviewed, and any necessary
changes have been made.
• Any necessary updates to the deployment or testing processes have
been made.
• User interface: This section describes the user interface for the system,
including screen mock-ups and detailed descriptions of the user flows
and interactions.
• Testing: This section describes the testing strategy for the system,
including the types of tests that will be performed (e.g., unit tests,
integration tests, etc.) and the criteria that must be met for the system
to be considered complete.
• Deployment: This section describes the deployment strategy for the
system, including the environments where it will be deployed (e.g.,
staging, production) and the process for deploying updates and bug
fixes.
• Maintenance: This section describes the system’s ongoing mainte-
nance and support plan, including the procedures for monitoring and
responding to issues and the process for making and implementing
updates and enhancements.
NOTES 275
Notes
Introduction
3 Box, G. E. “All models are wrong, but some are useful.” Robustness in
Statistics 202.1979 (1979): 549.
4 Arnold, Ross D., and Jon P. Wade. “A complete set of systems thinking
skills.” Insight 20.3 (2017): 9-17.
Through Data and AI: A Practical Guide to Delivering Data Science and
Machine Learning Products. Kogan Page Publishers, 2020.
9 Taleb, Nassim Nicholas. Antifragile: Things that gain from disorder. Vol.
3. Random House, 2012.
10 Maister, David H., Robert Galford, and Charles Green. The trusted
advisor. Free Press, 2021.
17 Taleb, Nassim. “The black swan: Why don’t we learn that we don’t
learn.” NY: Random House (2005).
20 Scavetta, Rick, and Boyan Angelov. Python and R for the Modern Data
Scientist. O’Reilly Media, 2021.
21 Sinek, Simon. Start with why: How great leaders inspire everyone to take
action. Penguin, 2009.
22 Peng, Roger D., and Elizabeth Matsui. The Art of Data Science: A guide
for anyone who works with Data. Skybrude consulting LLC, 2016.
Overview
25 Petersen, Kai, Claes Wohlin, and Dejan Baca. “The waterfall model
in large-scale development.” International Conference on Product-
Focused Software Process Improvement. Springer, Berlin, Heidelberg,
2009.
30 Reis, Eric. “The lean startup.” New York: Crown Business 27 (2011):
2016-2020.
NOTES 279
32 Kim, Gene, Kevin Behr, and Kim Spafford. The phoenix project: A novel
about IT, DevOps, and helping your business win. IT Revolution, 2014.
33 Knapp, Jake, John Zeratsky, and Braden Kowitz. Sprint: How to solve big
problems and test new ideas in just five days. Simon and Schuster, 2016.