0% found this document useful (0 votes)

669 views317 pages

Elements of Data Strategy

The document is a book titled 'Elements of Data Strategy' by Boyan Angelov, published on January 19, 2023, focusing on a framework for data and AI-driven transformation. It covers various aspects of data strategy including due diligence, design, and delivery, along with insights from interviews with experts in the field. The book aims to provide a comprehensive guide for practitioners looking to implement effective data strategies in their organizations.

Uploaded by

mekij

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

669 views317 pages

Elements of Data Strategy

Uploaded by

mekij

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 317

Elements of Data Strategy

A Framework for Data and AI-Driven Transformation

Boyan Angelov
This book is for sale at http://leanpub.com/elementsofdatastrategy

This version was published on 2023-01-19

This is a Leanpub book. Leanpub empowers authors and publishers with

the Lean Publishing process. Lean Publishing is the act of publishing an
in-progress ebook using lightweight tools and many iterations to get
reader feedback, pivot until you have the right book and build traction
once you do.

© 2021 - 2023 Boyan Angelov

Tweet This Book!
Please help Boyan Angelov by spreading the word about this book on
Twitter!

The suggested hashtag for this book is #elementsofdatastrategy.

Find out what other people are saying about the book by clicking on this
link to search for this hashtag on Twitter:

#elementsofdatastrategy
Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Why Did I Write This Book? . . . . . . . . . . . . . . . . . . . . . . . . iv
Who Is This Book For? . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
How Is This Book Written? . . . . . . . . . . . . . . . . . . . . . . . . . xi
How to Read This Book? . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Systems Thinking for Data Strategy . . . . . . . . . . . . . . . . . . . xv
How Technical Does a Data Strategist Need to be? . . . . . . . . . xxviii
.
Defining Data Strategy With StratOps Principles . . . . . . . . . . . xxx

Part I: Due Diligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Alignment with Business Strategy . . . . . . . . . . . . . . . . . . . . . 6

Uncovering the Business Strategy and Goals . . . . . . . . . . . . . . 7
Steering Group Formation . . . . . . . . . . . . . . . . . . . . . . . . . 11

Current State Analysis: Discovering Where We Stand . . . . . . . . 15

Systems Audit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Data Maturity Assessment . . . . . . . . . . . . . . . . . . . . . . . . . 27
CONTENTS

Gap Analysis: Looking at the Road Ahead . . . . . . . . . . . . . . . . 33

Competitor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Ambition Level Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Part II: Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Data Strategy Analogies . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
The Influence Cascade . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Use Cases: Designing Data Products . . . . . . . . . . . . . . . . . . . . 50

Ideation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Feasibility Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Prioritization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Data Architecture and Technology: Establishing Foundations . . . 72

Target Data Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Target Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Data Governance: Managing Data Assets at Scale . . . . . . . . . . . 92

Operating Model: Setting up the Organization for Success . . . . . 97

Data Team Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Center of Excellence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Change Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Roadmap: Preparing for Delivery . . . . . . . . . . . . . . . . . . . . . . 106

Preparing a Budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Planning Ahead With a Timeline . . . . . . . . . . . . . . . . . . . . . 111

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
CONTENTS

Part III: Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Soft Agile: Moving Fast Without Breaking Too Much . . . . . . . . . 124

Soft Agile Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
The Implementation Forest . . . . . . . . . . . . . . . . . . . . . . . . 128

Lean Data: Eliminating Waste . . . . . . . . . . . . . . . . . . . . . . . . 135

Lean Data Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
The Knowledge Factory . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

DataOps: Methods for Value Delivery . . . . . . . . . . . . . . . . . . . 142

Team Data Science Process . . . . . . . . . . . . . . . . . . . . . . . . . 143
Data Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
MLOps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Templating and Documentation . . . . . . . . . . . . . . . . . . . . . . 148
Kanban . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Scrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Shotgun MVP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Closing the Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Sources of Waste . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

Impact Assessment: Measuring Our Success . . . . . . . . . . . . . . . 162

Portfolio Management: A 360-Degree View . . . . . . . . . . . . . . . 166

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Nicolas Averseng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
CONTENTS

Noah Gift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

June Dershewitz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Martin Szugat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Amadeus Tunis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Tom Davenport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Stephanie Wagenaar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Doug Laney . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Alexander Thamm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
List of Acronyms and Abbreviations . . . . . . . . . . . . . . . . . . . 259
Architecture and Technology Definitions . . . . . . . . . . . . . . . . 261
Ethics and Privacy Checklist . . . . . . . . . . . . . . . . . . . . . . . . 268
Data Job Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Example of Definition of Done . . . . . . . . . . . . . . . . . . . . . . 272
Example Design Document . . . . . . . . . . . . . . . . . . . . . . . . . 273

Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Acknowledgments
Writing a book is a long journey, at the start of which you don’t necessarily
know what its end will look like. I am writing these words after finishing
the manuscript, and seeing the journey from this vantage point allows me
to appreciate that I didn’t walk it alone. There are so many people I want to
thank. Here they are in no particular order.

Nicolas Averseng was the first person I interviewed. Thank you for answer-
ing my cold e-mail, for the brilliant insights, and for all the conversations
we had. You taught me a new way of measuring data strategy’s impact, and
I can’t wait to see what YOOI does in the future. Tom Davenport, for two
things. First, thank you for turning data strategy into a proper discipline.
Your articles inspired me and a whole new generation of data strategists.
Second, for giving me the confidence that this book is needed and that I’m
on the right path. Martin Szugat for teaching me data thinking, convincing
me I should write this book, and reviewing the manuscript. Datentreiber
is an inspiration! June Derschwewitz - thank you for helping me look at
data strategy differently and for the continuous encouragement. Noah Gift
for the fresh take on agile in organizations regarding data projects. And
for teaching me to be more opinionated! Amadeus Tunis for helping me
deepen my understanding of holistic data strategy. I also learned a ton
about measuring the value of data projects. Doug Laney: Infonomics was
one of the first books I read as a data strategist, and it shaped my views
tremendously. Also, thank you for the inspiration of having strong opinions
and, of course, for making me look at data as an asset. Stephanie Wagenaar,
Acknowledgments ii

for the deep insights on data governance, workshopping, and for reviewing
the manuscript. Your energy was vital for me to push the book through to
completion! Alexander Thamm, for the brilliant conversation and insights.
I thank you not only for the ideas on consulting, data, and AI but for your
views on the future of Europe in this regard. Thomas Varekamp for the great
conversations and continuous support with energy and encouragement.
Thank you for being an excellent example of a true data strategist. I also
want to thank all my former and present colleagues and clients. So many
conversations during the years formed my thinking about data strategy.
Also, as always, thank you to my family. For the constant encouragement
and support. Even if you are unsure what data strategy is, you still believed
in my spending a year and a half defining it. Thank you for everything.

And finally, thank you - the reader. Deciding to have this book in front of
you is another sign that data strategy is a truly growing discipline. It is a
privilege for me to guide you on this journey.
Preface
“Software eats the world, but AI will eat software.”
–Jensen Huang

Those are the words of NVIDIA’s CEO. While they are full of commitment,
most of us agree that we are still far off from this lofty goal. This issue
becomes even more apparent if we look at the current state of enterprise
data efforts. Despite the tremendous potential new data technologies offer,
few organizations reap the benefits of data-driven digital transformation.
Adoption and deployment of data and AI technologies remain rare, con-
trasting with big words from executives and their significant financial
commitments. But why? At this point, progress in the field should not be
hampered by the lack of technical talent anymore - or the lack of tools and
frameworks available. Why is it so hard to follow in the footsteps of digital
native organizations and take full advantage of data?

This book aims to answer these questions by providing a framework for

action - a data strategy. Data strategy is not just a supplement to a business
strategy. It can be one of its most vital and full of potential core elements if
done correctly. These pages will show you how to both design and deliver a
successful data strategy.
Introduction
Motivating factors—Book structure—Learning how to fish—SYSTEMS THINK-
ING FOR DATA STRATEGY—The skills of a data strategist—Defining data
strategy—StratOps

Why Did I Write This Book?

To do great data strategy, you need to possess a diverse set of skills and
experiences. It took me a while to appreciate that being a jack of all trades
can provide tremendous advantages in a new title, one growing beyond the
confines of pure data science. DJ Patil and Tom Davenport (the latter of
whom I also had the pleasure of interviewing for this book), in their 2022
article “Is Data Scientist Still the Sexiest Job of the 21st Century?”* support
this view: while data science saw meteoric growth (and will continue to do
so in the foreseeable future), it spawned even more fast-growing disciplines,
such as data engineering, machine learning engineering, and most impor-
tantly for us: data strategy.

I have had some diverse experiences that have helped me so far on this path:
I worked on metagenomic data at the Max Planck Institute, built machine
learning models and architectures for startups and large organizations, led
my teams to do that at scale as a CTO, and have been guiding industry
leaders in helping their organizations do the same. One thing stands out
* This is a follow-up to the original article.
Introduction v

when looking back at all those different areas: the increased complexity of
modern-day work. Each of those experiences required its own set of skills,
tools, associated frameworks and ways of working, and best practices. What
kept me going was the vast array of tutorials, courses, books, and articles
at my disposal - I rarely felt starved for help. This is why I was surprised to
find a massive lack of resources on data strategy. I was mostly left to my
own devices when I started that role.

I remember those first days as a data strategist. They were full of questions
and confusion. When do we do a CSA* ? Before the gap analysis or after?
What other elements are necessary, and how do they relate? What are the
actual deliverables of such work? The more I thought about those questions,
the more I realized that I couldn’t even answer the most fundamental
of them: what does it mean to be a data strategist? What are the skills
and experience necessary to get the job done? The issue wasn’t limited
to forming a brand new vocabulary; I was used to working with many new
terms as a data scientist. I searched, read, and asked - yet I could still not find
confident and conclusive answers. While the why was clear to me from the
beginning, I expected a manual on the what and the how of data strategy. I
resorted to learning the hard way by listening to experienced leaders in the
field and combining their knowledge into tangible and concrete concepts.
Slowly the different ideas and definitions clicked together, and the answers
to my questions became visible. Soon other people started approaching me
with the same questions I had, and that’s when I decided it was time to share
what I learned, shaping it into a book you’re reading now.

“Elements of Data Strategy” † is the book I wish I had when starting out in
data strategy.
* Current State Analysis. More on this in DUE DILIGENCE.
† The name was inspired by one of my favorite books on data: “Elements of Statistical Learning” by

Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie.

Introduction vi

About the cover: the book’s cover represents two fundamental ideas.
First: data strategy should be at the core of the business strategy. Second:
this work showcases a framework: a cohesive collection of concrete build-
ing blocks that guide and inform the design of your strategy. It is still up to
you, the strategist, to fill in the missing pieces that fit your organization.

Will reading this book be sufficient to make you a data strategist? I’ll be
honest - it won’t. I came to recognize that this role requires vast experience
across many fields; it’s inherently transdisciplinary. But I can make you a
promise that I’ll try to keep. This book will teach you three essential things
to keep at your side as you gain practical experience: the core concepts of
data strategy, a holistic way of thinking about them, and the tools to deliver.
You can apply this framework to any strategic work in data. I won’t go into
detail on technical terms. There are more than enough resources on data
lakes, warehouses, and algorithms - often written by their creators. I reserve
the right to provide a few sentences (thanks to GPT-3!) in the Appendix on
the most important ones. Still, more often than not, I’ll point to a better
resource. The focus gained will allow me to deliver on my promise.

Here’s another thing I won’t do: convince you that data strategy and, by
extension, data science, and engineering are important. Many such books
start with endless factual explanations of the low adoption of AI and the
need for planning data projects. I assume you would not have this book in
front of you if you thought otherwise, and if you still require convincing -
there are more than enough materials.

Finally, this work represents a set of “working definitions”. I realize that

the field of data and data strategy, in general, is constantly evolving. By the
Introduction vii

time this book turns its first birthday, many things will have changed. This
rapid pace of active development is why I stayed away from using the term
“modern” anywhere in the book (it would be arrogant to assume everything
I write will hold for years into the future). Treat this book as what it is - a
framework, and I encourage its adaptation and reworking in the wild! The
core concepts and way of thinking should remain constant, and hopefully,
this book’s usefulness will also persist into the future.

As the first step in understanding data strategy, we should define the skills
of a data strategist. In the early days of data science, there was a quite popu-
lar article called “The Data Scientist Venn diagram” by Drew Conway. Data
Science was initially dismissed as a fad. It took a while for the skills and job
descriptions to become established, culminating in further specialization. I
would argue that data strategy will follow a similar trajectory in spreading* .
A fundamental first step is normalizing the key skills. A good start is to make
a Venn diagram:
* More on this in my wide-ranging discussion with June Dershewitz, Data Strategist at Amazon, in

INTERVIEWS.
Introduction viii

The Data Strategist venn diagram

The grey circle, Data, contains all the (technical) domain knowledge re-
quired. The term itself is quite broad, but we can at least specify its higher-
level internal components: Data Science, Data Engineering, and Business
Intelligence (BI). In other words, in this circle, we have everything hands-
on about the job. One of the most common questions I’ve received here is,
to what extent a data strategist needs to be familiar with those topics? There
are different answers to this, but a data strategist with a solid grasp of the
other circles of the Venn diagram will be able to compensate for any lack in
this area. Still, as a rule of thumb, they should have a sufficient technical
understanding of the three data areas I mentioned to the extent that it
allows them to manage a team of technical people. I’ll answer this question
Introduction ix

in more detail later in this chapter.

Now, circle number two: Communication. This collection of skills is fre-

quently mentioned in technical communities as desirable but still - at least
in my opinion - not emphasized enough. A common misconception about
it (at least in the technical realm) is that it’s a skill that some people
are just born with. I wholeheartedly disagree - while it can certainly be
easier for some, it can be learned and improved - with deliberate practice.
Why should we learn to communicate better, though? An effective data
strategist should be able to discuss topics on many different levels, from
the day-to-day data work (the items from the “data” circle) to high-level
conversations, presentations, and strategy formulations with the C-level
people. Data strategists’ job descriptions often describe this concept as “the
ability to translate requirements and concepts between technical and executive
stakeholders”. I like the word “translation” - it even enhances the job title
itself in some organizations, where they refer to us as “data translators” * .

Finally, those two circles support the integrative skill, which binds them
into a complete package - System thinking (ST). I’ll go into more detail later
in SYSTEMS THINKING FOR DATA STRATEGY, but to get us started, here’s a thorough
definition of ST from a paper by Arnold and Wade1 :

“Systems thinking is a set of synergistic analytic skills used to

improve the capability of identifying and understanding systems,
predicting their behaviors, and devising modifications to them to
produce desired effects.”

There are overlaps between those different sections, resulting in the jobs
of Data Architect, Data Advocate, and Design Thinker. All of those are
* This term is described in detail on dbt’s blog.
Introduction x

important on their own, but our focus is on the middle of the diagram - the
data strategist. What is essential to consider when thinking about the skills
of a data strategy is that the three circles are never evenly balanced. Every
individual data strategist possesses a unique combination of those three
areas, and while, with sufficient dedication, you can perfect this balance,
it’s almost impossible to be perfectly capable in all three. This concept is
commonly referred to as T-shaped skill proficiency. For example, if you
dedicate extensive efforts to acquiring technical skills, you also pay an
opportunity cost for not learning business skills. There are twenty-four
hours a day for all of us, and we also have limited energy available. Thus a
tradeoff between the three elements of the Venn diagram occurs, and since
such tradeoffs are typical in any ST book - this becomes a pattern. I call it a
“tension diagram” (we’ll see many such patterns later in the book). Have a
look at a classic example below:

The attention of a data strategist needs to be balanced.

The data strategist frequently needs to balance between different systems

of seemingly equal importance. For example, a data strategist often needs
to pay attention to things at different levels. At the same time, they need
to look at the details (the tree) and the data strategy overall design (the
forest). There is always a balance between the two, and we must navigate it
carefully.
Introduction xi

Who Is This Book For?

Even if I don’t cover the technical skills, there are also quite a few strategic
details that I’ll skip to avoid becoming a reference manual. Other excellent
works provide much more detail, such as Driving Digital Transformation
Through Data and AI: A Practical Guide to Delivering Data Science and
Machine Learning Products by A. Borek and N. Prill2 ) that I’ll also refer to
frequently. Again, to use the common biblical analogy - I don’t want to give
you fish; I want to teach you how to fish. You’ll still need to gain practical
experience by applying the framework. If this is how you think, this is the
book for you.

I decided to write this book from a consultant’s point of view. I believe, on

some level, everyone is an in-house consultant, depending on the context.
For my readers with no consulting experience - don’t worry; I’ll explain any
industry-specific terms in detail. One crucial point to address already is my
usage of the word “client”. I use this as the recipient of the data strategy,
whether they are external to the organization or internal.

Everyone in a position to make decisions on data projects should derive

benefits from this work. Whether you are head of data, CDO, or strategy
consultant - you will need to design and deliver a data strategy.

How Is This Book Written?

In the data strategist’s Venn diagram, I put a particular emphasis on systems

skills. Those are the foundation of data strategy, so I designed this as a
Introduction xii

systems book, first and foremost. I provide a holistic* model of data strategy.
Naming the elements of data strategy (giving them logical boundaries)
and identifying the relationships between them lays the foundation for
any hands-on work. Such a structure allows the strategist to explore the
framework inductively, communicate their ideas about it, and enable fur-
ther information gathering. Still, remember the Zen saying, where the Zen
master said to their student: “don’t mistake my finger, pointing to the moon,
for the moon itself.” I believe the only way for a successful data strategy
model to be designed is by letting go of our desire to perfectly model every
situation and the arrogance of believing we actually can. Instead, we should
take this for what it is - a model; the rest of the work we need to do ourselves.
Remember - all models are wrong, but some are useful3 . I certainly hope this
one is!

The book comprises three phases, capped by a final chapter with interviews
with thought leaders in the field of data strategy worldwide. Each phase
contains a set of logically related elements. I’ll be referring to them often
in the text, so to make reading easier, they will always be CAPITALIZED AND
MONOSPACED. The main deliverables of each element are presented at the
beginning of each element (except for the elements in DELIVERY, which is
more focused on the process). I called this setup the 3D model of data
strategy:

Phase I (DUE DILIGENCE): First, you’ll understand the organization’s overall

business objectives and align the data strategy to them. You’ll learn how to
conduct Current State and Gap analyses and how they form the critical core
of information gathering. The deliverables from those elements are manda-
tory for you to create an informed strategy that truly fits the organization.
* A good definition is available from the Oxford English Dictionary: “[holistic] characterized by the belief

that the parts of something are intimately interconnected and explicable only by reference to the whole”.
Introduction xiii

Phase II: (DESIGN): Here, you’ll put the information gathered so far to use by
designing the strategy itself. You’ll first learn how to ideate, prioritize and
select use cases. Those need to be supported by optimal target architecture
and technology stacks, and finally, the data strategy is capped by designing
an appropriate data governance and operating model.

Phase III: (DELIVERY): Even the best strategy is useless unless delivered
successfully. In this part, you’ll learn how to ensure the data strategy is used
by applying DataOps. Data-specific flavors of two methodologies inspire the
showcased approaches: soft agile and lean data.

Interviews: This work has been heavily influenced by my conversations

with data strategy leaders worldwide. I’m privileged to share some of those
in this section. Read them to understand how the principles we have covered
are common in successful data strategies across many organizations and
industries.

How to Read This Book?

Many books on data are written in a reference format. With such books,
you can pick any exciting topic and dive straight into it - without paying
attention to any assigned order. I structured this book differently. Creating
a data strategy is, by necessity, a sequential process. The phases and their
elements are building on top of each other; the outputs of a phase become
the inputs of the following one (you’ll understand this in SYSTEMS THINKING
FOR DATA STRATEGY). Those elements might not make sense if consumed out
of order. I suggest reading the book following the designated order first and
only afterward using it as a reference manual in case you want to refresh
your memory and knowledge on a specific element.
Introduction xiv

A second point to remember is that no two data strategies are identical. They
should all follow a similar structure, but you, as a strategist, will always need
to shift focus and order where necessary for your specific case. That Zen
teaching rings valid once again. It would be arrogant to assume that one
strategy template is all you need to make any large or small organization
data-driven.

Additionally, there are several types of blurbs that you’ll encounter through-
out the book. I’ve listed them here:

Information: Since data strategy is a transdisciplinary and expan-

sive field, I’ll be adding any further reading here.

Warning: I have learned the hard way how a journey through data
strategy can be sidetracked. I’ll emphasize those situations here.

Tip: Here, I’ll provide practical advice that can give you the edge in
tricky situations.

Discussion: The topics covered in those blurbs are meant to be

discussed as a group, so this is a good opportunity to engage with
your fellow data strategists.

Asides and sidebars: Some topics don’t fit perfectly well in the scope of
the book but can be useful. I will mention those here.
Introduction xv

AI collaborative text: These sections (mostly auxiliary) are writ-

ten collaboratively by me and OpenAI’s GPT-3.5 (large language
model, trained in early 2022). As of the time of writing, I believe
this book to be one of the first non-fiction books to have such
elements, and I have special responsibility for the ethical use of
such technology. This is why I am marking all texts which are
written collaboratively with an AI with this aside type and icon. I
also bear full responsibility for the AI’s output. You can read more
about the ethical AI guidelines from OpenAI here.

These checkmarks indicate data strategy element deliverables,

found at the beginning of each element’s section.

And finally - there’s a companion website to the book. It is available

on https://boyanangelov.com/edst.html. It contains additional information
and will be kept up-to-date.

Systems Thinking for Data Strategy

Since working with complex systems is a core skill for a data strategist, I’ll
cover the most important terms.

There’s a lot of literature on systems thinking - and that’s both a blessing

and a curse. A blessing since we have an abundance of resources to gain
an understanding from; a curse since this amount of information can
be overwhelming. The systems thinking field overlaps heavily with other
areas, such as operations research and complexity management, resulting
in conflicting definitions. Later I’ll go into deeper detail as to why definition
Introduction xvi

setting is essential, but suffice it to say that working with vague terms dur-
ing already complex work, such as data strategy, can only add unnecessary
confusion.

Arnold and Wade have not only provided us with an excellent ST

definition but went a step further to summarize the skills necessary
to be a systems thinker. Those are presented in their “A complete
set of systems thinking skills” paper4 .

Since the 3D model is at the core, it influences the book throughout. As I

mentioned before, the Elements of Data Strategy is a systems book. Have a
look at what I mean by that in the figure below:
Introduction xvii

Systems approach to data strategy - the 3D model.

The three phases of data strategy don’t exist in isolation. The output of
an element within DUE DILIGENCE can become an input to an element in
DESIGN, or DELIVERY. For example, the DATA DICTIONARY is relied upon heavily
Introduction xviii

during ARCHITECTURE AND TECHNOLOGY. The loops are moving forward and
backward, providing essential feedback. For example, if during DELIVERY,
the strategist can realize that an adjustment is needed in its design -
the project’s successful implementation might need to be supported by a
different operating model. The 3D model demonstrates how the concepts
of boundaries, feedback loops, abstraction levels, and other related ones are
essential for a systems thinker. Let’s go through those concepts next.

Boundaries

The first essential concept is probably the most abstract one: boundaries.
As you’ll see later in the book (in CHANGE MANAGEMENT), the primary challenge
for a data strategist is communication. So many abstract terms need to be
explained, and it’s easy to make vague definitions. For a data strategy to
be successfully implemented down the line, the communication between
the strategist and the client or in the strategy document itself needs to be
spot on. Having concise and clear words for the concepts, we discuss, and a
shared understanding ultimately enables us to be productive and focus on
the actual work.

This is also one of the main reasons I wrote this book - I lacked the common
understanding of terms to discuss how to do data strategy. Now: every time
we communicate complex topics, we set boundaries. Knowing where those
are in different systems is also an essential task in CSA.
Introduction xix

Definition setting. For many data strategy elements, I recom-

mend setting a common vocabulary of definitions at the beginning
of each engagement. This activity is essential to avoid confusion
since we are discussing challenging topics to understand by default.
We must be precise, even if it frequently feels indulgent and time-
wasteful. Definition setting is also essential when more technical
members start to use the data strategy; without it, the work can
lead to miscommunication and subsequent frustration. To provide
you with a specific example, one time with a client, I talked about
whether we should be doing data anonymization or pseudonymiza-
tion. Much later, both parties realized what those terms meant; by
that point, it was too late in the implementation process to adjust
effectively. To avoid issues moving forward, we must sit together
and define the standard terms. Definition setting is also important
in the data management and governance setting, where we need to
establish a business glossary.

The reason why I cover the concept of boundaries first is that this is the
defining feature of a system. A boundary is where one system ends and
another one begins. Philosophically, everything is a system - one within
another, like matryoshka figures. Almost all the data strategy activities start
with explicitly defining the systems we are working with and understanding
their boundaries. With my colleagues, I used to joke that the number one
skill of a successful data strategist is to draw boxes! For example, we might
draw the organization’s different departments as different systems. We can
then proceed to draw the boundaries of the teams. Within those boundaries,
we can fit other elements, enabling us to talk concretely about otherwise
abstract terms.

My method of choice for such boundary setting is MECE, standing for

Introduction xx

Mutually Exclusive, Collectively Exhaustive. I learned about it from the ex-

cellent book Technology Strategy Patterns by E. Hewitt5 . To see the concept
represented visually, have a look at the diagram below:

MECE: Mutually exclusive, collectively exhaustive

Two common sources of error appear when we break down a concept into
parts. The first one is that two (or more) elements have sub-elements in
common, blurring the separation between them. The second source of error
is that we are not presenting the whole picture; some elements are missing.
If we want to avoid both, we should always attempt to break down a larger
concept into a complete set of parts without anything missing. Those parts
should have clear borders between them delineated, so there’s no ambiguity.
MECE is an essential tool in reducing complexity and the fundamental
principle in designing the elements of data strategy. All of the elements
of data strategy in this book are designed to be MECE.

An additional benefit to designing systems using MECE is that any

lower-level elements will be MECE regarding larger elements by
default since their parent elements are clearly separated.
Introduction xxi

Complexity

One of the most overused and dangerous words in data strategy (and
management in general) is complexity. To paint a picture of how frustrating
it is to define this word, I’ll share what I heard at a conference once:
“Complexity is so hard to define that even the eponymous book, Complexity
doesn’t define it in any of its 600 pages”. Complexity is the enemy number
one of a data strategist (or any knowledge worker, for that matter). And
for such an important term, it’s mind-boggling to realize that a definition
and a universally agreed upon measurement method is not available. For
our purposes, I believe it’s more beneficial to define a complex system
instead. A good working definition of that is available from the Santa
Fe Institute: complex systems have nonlinearity, randomness, collective
dynamics, hierarchy, and emergence as their main properties. Examples of
such systems are our nervous system, cities, software and many others.

Complex systems: anthropogenic and natural

Introduction xxii

Feedback Loops

The easiest way to understand what feedback loops are (you can also read
Donella Meadow’s book6 ) is with an example from our daily life. We all know
what we talk about when we say, “I got off the wrong side of the bed today”.
Suppose you have a negative experience first thing in the morning. In that
case, things can cascade further down - your already bad mood will likely
attract more negative experiences and, consequently - an even worse mood.
This predicament is an example of a positive feedback loop. Positive, not
because it’s a nice thing to happen but because the system continues to grow
with time. Have a look at the following diagram:

A self-reinforcing positive feedback loop

An example of feedback loops in data strategy is the inertia of larger organi-

zations resisting change in their operations. This is an issue in almost any
data strategy engagement and can be summarized in the sentence, “we’ve
always done things this way, and it worked out so far - why should we change
anything”. Unfortunately, inertia is hard to fight. The good news is - once
the data strategy is in place and the feedback loop starts to go into the
opposite, more positive and organized direction - it would be equally hard
to reverse.
Introduction xxiii

Black Boxes

Practicing what I preach, I start with a definition:

A black box is any system whose internal operation and subsys-

tems we don’t understand yet need to work with.

This is visualized below:

Black box versus white box systems

Making technology decisions on well-documented and stable systems can

already be challenging, but how about opaque ones - when we deal with
a black box? As technology leaders, we almost always operate within con-
straints - but a lack of general understanding of a system is perhaps the most
challenging (especially under time pressure and if the system in question is
mission-critical, such as payments). Here I’ll go through common black box
scenarios and provide advice on how to deal with them:
Introduction xxiv

• Legacy software: When we inherit software written by others, we rely

solely on adequate documentation, which is, unfortunately rare.
• Senior engineering talent leaving: Even if they wrote good documenta-
tion, understanding complex systems takes time.
• Complex systems for simple processes: It’s tempting to over-engineer
solutions.

Of course, we can understand any black box if unlimited time is

available, but that’s seldom the case. Yet, there are several things
that we can do:

• Measure outputs and inputs: While the systems can be

opaque, what goes into and out of them is probably not.
Analyzing those points can yield valuable insights into the
system’s inner workings.
• Adopt a scientific approach: Conduct experiments, test hy-
potheses, and document the results. Observe the behavior of
the system patiently over time.
• Set up feedback loops: While running the system, prod it and
observe any changes in behavior and performance.
• Isolate subsystems: Break down the black box into separate
components and measure their inputs and outputs instead of
the system as a whole.
• Replicate: Replace different components piece by piece, cul-
minating in a complete copy.

Black box systems are common in data strategy, and the DUE DILIGENCE phase
aims to get rid of them as much as possible.
Introduction xxv

System Types

This is a great time to discuss another system thinking concept - that of

“viable” systems. In the systems thinking literature, those are also known
as Complex Adaptive Systems (CAS). We can look at our data strategy as a
system - and our goal would be to ensure that the data strategy we create is
adaptable instead of rigid. Fragile systems should be self-explanatory - they
can break anytime. Imagine a house of cards where the whole system crum-
bles with the pull of only one element. An example of this in data strategy
is a recommendation of the wrong vendor. For example, a company’s CTO
decides on a specific technology only because they have prior experience
with it or for other political reasons - instead of whether it’s fit for purpose
for the particular use case. Such a strategic recommendation can eventually
lead to a very fragile system.

Let’s focus on a more common type of data strategy system (and planning
systems in general) - the rigid one. While nobody wants to build a fragile
system, a rigid one can be tempting - especially one designed by high-paid
external experts. The most common cognitive bias is the illusion that we can
predict the future with a significant degree of accuracy. This, together with
our domain expertise, can lead to having supreme confidence in crafting
plans. This scenario is even more dangerous because, in the short term,
such a strategy can work and inspire confidence. Everybody wants to follow
a leader with a clear plan, and nobody wants to hear or present thoughts of
uncertainty. Still, such systems are bound to fail in the mid-to-long term
since they collide with the complexity of the real world.

And finally, the third type of system, and the one we should aim to design,
is the antifragile one - a viable system. Such systems are built in a way to
respond effectively to feedback and learn. Instead of breaking under stress,
Introduction xxvi

they become stronger.

Abstraction Levels

One of the best ways to deal with complex systems is by using abstractions
(in systems and complexity science, this is related to the concept of scale7 ).
The human mind is uniquely capable of reasoning through the same prob-
lem in various ways. Since, by definition, we can’t understand a black box
(complex system), we need to apply abstractions. You can imagine them
as mental maps we sketch to think differently and be productive. This also
allows us to look at problems with less detail necessary. After all, we don’t
need to understand a system’s inner workings, but mostly its outputs and
inputs - whatever is enough for us to work with them. Let’s define this
concept:

Abstraction levels are elements in a stack of mental models de-

signed to showcase a system’s different levels of detail (complex-
ity).

To drive the point home, I’ll illustrate with an example from arguably the
most strategic game invented: chess. Look at the boards below (you’ll need
just a basic understanding of chess, don’t worry):
Introduction xxvii

The three abstraction levels in a regular chess game

Remember our attention tension diagram? It is relevant to every move in

a chess game. At each point, the player needs to think on three levels:
individual pieces, strategy, and game phase. Looking at the game in this
way allows the player to be as efficient with their moves as possible. Every
abstraction layer provides a set of rules: the rules on how individual pieces
move are self-explanatory, but the other two levels are more complex. For
example, on a strategic level, the player needs to pay attention to the overall
Introduction xxviii

balance of the board, perhaps planning their attacks in one area only. At the
same time, any decisions they make on this second strategic abstraction
layer need to consider the overall game phase - the long-term picture - the
highest abstraction level.

One of the most famous visualizations of abstraction levels is by

Picasso. His famous 1946 series of eleven lithographs, “Bull,” shows
several bulls sketched at different levels of abstraction, from most
complexity to least.

Once you get used to thinking in abstraction levels, you’ll see them every-
where. For example, when you need to decide on a target data architecture
stack, it’s rarely required for the data strategist to inspect all the data
pipelines currently in operation (this would be the “pieces” level). They
should be able to comfortably make a recommendation based on more
general observations (for example, the different tools used to build the
pipelines and their limitations, the strategy level). Using this ST method
will allow the strategist to operate in black box scenarios under constraints
more effectively and efficiently.

How Technical Does a Data Strategist Need to be?

One of the most common questions is how “technical” a data strategist

needs to be. My experience is that of someone from a very technical
background (science and software engineering). Still, I can draw from it and
my observations of successful colleagues in the field.

As a natural first step, we must define what “technical” means. This term is
represented in the grey circle in the data strategist Venn diagram I presented
Introduction xxix

earlier in the chapter - Data domain knowledge. This is the ability to guide
the implementation of a data (software) system and its connection to other
systems (which can be nontechnical). There’s one thing we should already
get out of the way - it is certain that the more hands-on experience you
have with software and data technologies, the better. Still, as discussed
previously, even with all the technical talent and experience, a lack of
communication and systems skills will limit the effectiveness of the data
strategist.

I predict that we’ll have different “flavors” of data strategists in

the future. Until then, surround yourself with data strategists that
complement your skillset. Most of the complex knowledge work
in the 21st century can only be accomplished by a team, and data
strategy is no exception.

Now that we know that balance needs to be kept, what is a good level of
technical and systems expertise? Instead of providing a giant list of pro-
gramming languages, frameworks, databases, and cloud services to master,
I’ll illustrate cases from the real world with examples.

Let’s say that as a part of the data strategy, all the data in the organization
needs to be organized and stored in a centralized repository. On top of
this, good access policies must be implemented (DATA GOVERNANCE), striking
a balance between restrictive and open. For this, a data strategist should
understand the use cases, the data formats, quality, and size, and different
data storage terms (data lake, warehouse, relational and non-relational
databases) enough to guide the implementers. This should be enough tech-
nical expertise and needs to be coupled with their business understanding
(regulations and internal policies, for example).
Introduction xxx

What can be more challenging here is the systems knowledge as it relates

to software. The data strategist should understand how different systems
(in this case, a combination of cloud services, such as EC2, scripts, and
databases) operate with each other and the pros and cons of different
configurations (architectures). This is a much harder skill set to acquire
since it requires a lot of experience across the whole data stack (from
business intelligence to science and engineering). Finally, if a data strate-
gist is capable of selecting the optimal technologies, combining them to
form solutions to common use cases, and making informed decisions on
resources, budgets, and constraints on such systems - they are more than
good enough for the job.

Defining Data Strategy With StratOps Principles

For a strategy to be effective, it needs to be viewed from a specific angle: as

a roadmap - but not a “static”, unchangeable one - rather a living, malleable
and adaptable to the continuously changing data landscape. Currently, most
organizations look at strategy as a collection of slides delivered at the end of
a strategic engagement (or, even worse - a list of goals), often by expensive
external consultants. Yes, those slides should serve as a blueprint, but most
would agree they are useless if not applied in practice* . They also fall short of
a real strategy that can evolve and be updated long after the engagement has
concluded. The fast-moving and changing world of data are far too complex
for static plans - they are prone to become obsolete at the very moment they
are presented.
* More on this in my conversation with Nicolas Averseng in INTERVIEWS.
Introduction xxxi

Let’s ask GPT-3.5, which is very good at general knowledge, to give

us a working definition of data strategy:

“A modern data strategy is a holistic approach to managing, an-

alyzing, and leveraging data assets to drive business value and
competitive advantage.”

The company Taival provides an excellent comparison between a traditional

strategy process and what they call StratOps, inspired by Tom Paterson:

Classical Strategy StratOps

Inwards directed Customer driven and ecosystem
centric
C-level driven Responsibility across the
organization (inclusive)
Lengthy and Standardized Agile and dynamic
Backward-looking Future-back
Annual or bi-annual updates Continuous updates

Needless to say, we should choose the StratOps approach! Now, let’s get
started with doing data strategy!
Part I: Due Diligence
Why due diligence is hard—Alignment with business strategy—Steering
group—Current State Analysis—Via negativa—Maturity assessment—
Pilotitis—Systems audit—Competitor analysis—Futures thinking—Ambition
level setting—Gap analysis
Overview
“Give me six hours to chop down a tree and I will spend the first
four sharpening the axe.”
–Abraham Lincoln

How can we know where to go if we don’t know where we are? This philo-
sophical question applies not only to our personal lives but to data strategy
as well. Assessing the current state might not feel like making inroads to-
ward digital transformation (and, in some cases, might even open unhealed
wounds). However, it still is a foundational element of any successful data
strategy project. I group all such activities focused on information gathering
in the first phase of the 3D model under the umbrella term DUE DILIGENCE.
You might have previously encountered this term in other contexts, such as
corporate audits. There it roughly translates to “collection of information,
in preparation for action”; in other words: before we can strategize, we need
information. It’s irresponsible to make decisions or provide a strategy in the
dark. Moreover, how would members of our organization or clients trust us
if we immediately jump to conclusions before learning the details? We need
to get the lay of the land first.

Before diving into the details of conducting DUE DILIGENCE for data strategy,
let’s have a few words about mindset and honesty - two essential qualities
of a data strategist for this part. Keep in mind that from all elements of
data strategy, DUE DILIGENCE is the most commonly assigned to an external
party (such as a management consulting firm). There are several reasons
Overview 3

for this. External consultants are primarily hired to alleviate pain points;
any organization would prefer to solve its issues (especially those of a more
“strategic” nature) internally. What’s behind this lack of trust in internals
can be summarized in three reasons:

Political: Management would like to hear an external perspective on the

current state without bias and politics. Internal team members often have
agendas that might not necessarily align with the organization.

Practical: It’s expected that a fresh look, combined with specific expertise,
can break through barriers that are too complicated to overcome for the
internal team. In Zen philosophy, there’s a term called “beginner’s mind”:
a viewpoint free of prejudice and baggage can provide a new, previously
unseen perspective.

Economic: External consultants are indeed expensive (at least more so than
regular, in-house employees), but the per-hour cost might make up for itself
mid-to-long term, especially if those consultants are deployed at critical
junctions and are involved in strategic decisions and planning.

Extreme Red Teaming. An interesting idea I had ever since becoming

a management consultant was to perform an extreme version of dealing
with the political component of DUE DILIGENCE. As external consultants,
we try to be as honest with the client as possible. Still, in the real world,
additional political considerations limit this particular aspect of communi-
cation with the client. Often, the measure of a good consultant is to secure
repeat business opportunities. Sometimes, we cannot be as direct with the
client as we would like since it might jeopardize the engagement. This
misalignment can create tension between our desire for honesty on one
side and the internal political goal for project continuation on the other.
Being direct can damage any relationship. But what if we send several
Overview 4

consultants, with the explicit intent to be as direct as possible, just for the
DUE DILIGENCE part of the data strategy and then let them leave the project?
In this way, we would avoid the potential fallout of radical honesty. In
military circles (where strategy has a central role), this is known as a “red
team”: a group of experts hired specifically to poke holes in your theories
and work to improve them afterward.

Now that we are familiarised with the rationale behind DUE DILIGENCE, let’s
look at its elements:

Elements of Due Diligence

As I wrote in INTRODUCTION, many activities in the framework are designed

for sequential execution in mind - clear to see here. We begin by ensuring
we understand the strategic business objectives of the whole organization
are, and how those are aligned (or not!) to those of the data strategy. Then
we assess the current situation with the aptly named CURRENT STATE ANALYSIS
(CSA). Building on the information gained there, we continue by determining
the gap between the current state and where the company should aim for (its
ambition level), with a GAP ANALYSIS (GA). With all three elements complete,
we have obtained all information needed to proceed with the data strategy
Overview 5

design in the next phase.

Alignment with Business Strategy

• Documentation of the current business strategy

• List of alignment pain-points
• Filled RACI matrix

Data strategy initiatives are multi-department and often multi-year

projects requiring significant resources and commitment. This up-front
cost makes buy-in from senior leadership essential. The most common
concern from the leadership team is how the data strategy pushes in the
same direction as the business strategy and how we can ensure it stays
that way. Two elements of data strategy support this activity. First, the
strategist needs to uncover the business strategy. After all, how are we to
align with it if it’s not clear to us? Second, a steering group is required.
The purpose of this group is to be involved in all phases of data strategy.
This steering group is responsible for maintaining alignment between the
stakeholders and the operational team at all times by having a permanent
seat at the key meetings and activities.
Alignment with Business Strategy 7

Uncovering the Business Strategy and Goals

This first element might sound obvious, but for large organizations with
multi-layered internal structures (visions, departments, and teams), iden-
tifying accurate business plans and objectives can prove more challeng-
ing than one may think. This activity requires a thorough look at both
company’s ongoing and planned strategic initiatives on many levels of the
organization. We should not limit this investigation to the business units
interfacing directly with data. The best way to approach this challenge is to
use a top-down approach, as illustrated in the figure below (this is also a
tension diagram; imagine the individual trees at the bottom and the forest
on top):

Strategic-operational gradient across the organizational pyramid.

We want to start this way because the complexity of business goals tends
to increase from top to bottom organization layers as they become less
strategic and more operational. We can approach this work from a con-
sultant’s perspective: in many cases, you are an outsider to the goals of
Alignment with Business Strategy 8

other organizational units. For the learning curve for grasping the whole
business strategy to be lower, we first start gathering information on the
top by talking to the C-level suite. Most organizations run several strategic
projects simultaneously. Depending on their data maturity, this number
can be between one and more than ten * . Those initiatives tend to be long-
term (longer than one year) focused. An example of a strategic business goal
driving such an initiative can be “increase the market share of our products
in the EMEA region from 7% to 10%” (Borek and Prill provide many specific
examples8 ). If we were to stop the alignment between data and business
strategy on this level, we would probably miss the mark - this is too vague
and not actionable enough. How can we connect such a goal directly to our
data strategy?

Most of the data strategy work in DUE DILIGENCE is accomplished by con-

ducting either interview sessions or workshops. Essential skills of a
data strategist in this phase are asking good questions, taking notes, and
facilitating more interactivity in the sessions.

Next, we climb down a level and uncover the business goals of the individual
departments. Let’s take marketing as an example. Their goals should be
informed by the C-level ones - but are probably more specific and focused on
marketing projects and related operations. We need to talk to the functional
leadership. These are the leaders of non-data departments and teams. In
this case, the marketing leaders think about how they can support the
overall strategic goal to increase the market share of products sold within
a specific region. While doing this, they come up with their own version of
the overall business objective: “Deploy a marketing campaign in EMEA which
* I’ll show you how to establish this for any organization during the CSA element.
Alignment with Business Strategy 9

increases the conversion rate for the region by 2%”. Successfully achieving
this goal contributes directly to the overarching one. Working directly with
functional leadership dramatically reduces the complexity of the task since
they are at the right level of familiarity with the work - not too strategic, but
also not too operational. Functional leaders are valuable allies for the data
strategist (and can even become data champions, see aside).

I have often seen a list of goals referred to as “strategy”. If you are

reading this book, you are probably not one of those people!

And finally, we can also go to the most fundamental level of the organiza-
tional pyramid and look at what individual teams have selected as objectives.
Aligning at this level would be more relevant if we design a data strategy for
a smaller organization where the employee headcount numbers are in the
low hundreds. The information obtained by working with specific teams is
probably redundant for a larger one. That could change in the use cases part
of the data strategy design, where goals and targets become more specific.
We can repeat the exercise we did for the middle section and discover
individual team goals.
Alignment with Business Strategy 10

A data champion is a person within an organization responsible

for promoting the use of data and data-driven decision-making
within the organization. This typically involves working with other
teams and departments to identify opportunities for using data
to improve processes, drive innovation, and support strategic
decision-making.

The role of a data champion can vary depending on the organiza-

tion, but everyday responsibilities might include the following:

• Educating and training other team members on the use of data

and analytics tools
• Identifying and prioritizing opportunities for using data to
drive business value
• Acting as a point of contact for other teams seeking guidance
or support with data-related initiatives
• Advocating for the importance of data and analytics within
the organization

Now that we have gone through the different levels of the organization to
discover their goals, I would advise looking at the business strategy pyramid
at the right level, where the goals are not too operational but, at the same
time, also not too strategic. With this information, we can proceed with the
other parts of the data strategy. We need to refer to the organization’s or the
separate functional parts’ goals in several elements of data strategy, most
notably in USE CASES.
Alignment with Business Strategy 11

Steering Group Formation

The second alignment step specifies the internal team leading the data
strategy efforts and ensures that the necessary functions are actively and
continuously involved in this process. As I previously mentioned, having a
dedicated working group participating in the data strategy at all levels is es-
sential for progress. This process can be complicated due to political reasons
and the number of touchpoints (stakeholders) affecting data and technology
within even a middle-sized organization. I recommend approaching this by
filling out a Responsibility and Assignment (RACI) matrix. Have a look at
the template below:
Alignment with Business Strategy 12

Empty RACI matrix

The RACI matrix should involve a mix of the people designing the data
strategy and the steering group members. In the first column, we add all
the planned activities (elements) to create the data strategy, such as a DATA
MATURITY ASSESSMENT or AMBITION LEVEL SETTING. At this stage, it’s crucial not
to zoom into details; it’s enough to specify the major components only. This
way, we reserve space for adjustments (which will always be necessary) in
advance. The strategist can then use them during the data strategy design
work. On the horizontal axis, we can then add the participants. We can label
them in one of the five categories below:

Responsible: Executing the hands-on work on the activities (interviews,

Alignment with Business Strategy 13

workshops, design documents, and other data strategy deliverables).

Accountable: The go-to point of contact; this is the person to whom

external stakeholders can address the questions and ad-hoc requests.

Consulted: Participating in the topic, but without daily engagement -

someone kept in the loop, participating with advice.

Informed: Passive participant, but being updated on the progress through

the steering group at semi-regular intervals.

Below you can find a RACI in its filled form:

Filled RACI matrix

An essential property of RACI is that it’s constrained to one person per

action item. This ensures accountability for the process (see the aside on
ownership below). The final step after this is to setup up the meeting and
reporting structure for this steering group. Still, beyond saying that this
should be on a semi-regular basis (for example, bimonthly), we leave it to
Alignment with Business Strategy 14

the discretion of the data strategist.

Ownership. An essential quality for teams tackling complex projects. We

must ensure that ownership boundaries are clear to all stakeholders -
everyone should know where their responsibilities and focus areas lie.
Setting this up is challenging for projects led by multiple stakeholders and
can result in conflicting priorities, scope creep, and politics in general. All
items that we would rather avoid! This is why a single person should always
be held accountable for any data strategy element.
Current State Analysis: Discovering
Where We Stand

• Systems audit documentation

• Data maturity scorecard

Let’s start with a thought exercise. We can look at every organization as

a living organism: it constantly changes and evolves. It can get sick and
then healthy again. Now, imagine your first task in learning about the
organization is to take a picture of it so that you can study it in the lab
better. Even for an experienced data strategist (the “photographer” in this
analogy), this can pose a challenge. Any snapshot we take is bound to
become outdated quickly, and we’re in danger of taking the photo from the
wrong distance or at the wrong time. This analogy represents the two main
challenges in conducting a successful CSA: timing and resolution.

One of my favorite tools for decision-making is via negativa, popularized

in Nassim Taleb’s work9 . It stands for “the negative path” in Latin. Let’s
illustrate this concept with an example. Remember the advice we would
often hear from our parents, especially when we are just at the start of
our life: get a good job, be a good person and work hard? All of us can
Current State Analysis: Discovering Where We Stand 16

relate to the feeling of frustration when exposed to such advice. The most
normal immediate reaction is rolling our eyes. We think: “Sure, thanks for
the advice, but isn’t this obvious? How is this actionable?”

Interestingly, we can derive actionable advice here. We need to flip the

sentence. Instead, let’s define a “bad” job or bad “person”. Those concepts
are much easier to explain than the positive ones - everyone knows what
not to do. Maybe don’t get a job that does not develop your skills, which
doesn’t have a good mentor available, or a tedious one. If we approach our
parents’ advice this way, by just avoiding apparent mistakes, we can set
ourselves up for success. Via negativa is a method of flipping a positive
statement into a negative one to get actionable insights.

So what would happen if we apply via negativa to the definition of CSA:

what is a “bad” CSA? Let’s say we take our snapshot at the wrong time—
a typical case when the strategist jumps straight into details after the
start of the engagement. As Robert Galford thoroughly illustrates in his
book, The Trusted Advisor 10 , the fundamental basis of trust needs to be
established before you become technical and concrete. If you rush too fast
into exploring the current state (remember - this can be a painful topic),
you might succeed in “taking a photo”, but question marks will surround
how much it represents the truth. Jumping to immediate conclusions can
be error-prone since layers of politics and complexity probably obfuscate
much of the information you gather early on.

Now, how about the second dimension of CSA? What if we take the snapshot
at the wrong resolution? This means we’re missing the forest for the trees
(again, remember the tension diagram from the INTRODUCTION). This mistake
can be easier to make since a careful balance is required. On one extreme,
Current State Analysis: Discovering Where We Stand 17

the strategist might spend too much time on the operational end of the work
- such as investigating daily practices and technical implementation details.
On the other extreme lies conducting a CSA with the executive suite only -
focussing on the more strategic and high-level view and ignoring the rest.
A successful CSA requires a balance between both views, hence the constant
need for the data strategist to adjust the focus of their attention.

This is what a CSA is all about - taking a picture of the organization at the
right time and the right resolution. Now let’s dive into the components of a
thorough CSA.

Systems Audit

To deliver a comprehensive assessment of the organization’s data capabili-

ties, we need to audit several vital areas - the systems of interest:
Current State Analysis: Discovering Where We Stand 18

A 360 degree view of CSA target systems

You can look at the current state of an organization as an onion whose layers
we need to peel off one at a time until the complete picture is revealed. As
with many other concepts in this book, those layers are closely intertwined,
and the borders between them can be blurry, complicating our work.
Current State Analysis: Discovering Where We Stand 19

Activities format. Most of the activities in CSA and many in DESIGN

are implemented with interviews and workshops. The former is
when you want to gather quick information or when the element
is of low complexity, and the latter is when you need to dig much
deeper to uncover information or make a decision. There are many
resources to help the strategist improve in those areas, for example,
this great Harvard Business Review article on asking questions.
Further insight on workshops is presented in USE CASES.

We can split the CSA target systems into three discrete types: use cases, data
and architecture, and technology. A speficic dataset and architecture, and
technology support each use case. Perhaps differently from how we would
peel an onion, we first need to start with the center. This is because the
use case is the fundamental value-generating unit of any data project, and
if we put it at the center of our work at all times, we can keep the focus on
delivering value. A second reason is that any change in the use case can have
cascading (and sometimes unpredictable) effects on the other two system
types (a topic covered in more detail in INFLUENCE CASCADE in DESIGN).

Every system type needs to be audited. I’ll provide a list of motivating

factors, a brief description, and potential questions that need to be asked
or clarified during the interviews and workshops.

Use Cases Audit

As a first step, the data strategist must uncover what data (and related)
use cases are currently existing across the organization. Those, together
with new ones, will represent the organization’s data portfolio (I’ll go
deeper into how to manage this in PORTFOLIO MANAGEMENT in DELIVERY). It’s not
guaranteed that those same use cases will be worked on as part of the data
Current State Analysis: Discovering Where We Stand 20

strategy - the scope is often changed during the DESIGN. Still, knowing what
is currently being pursued is crucial since it might need to be stopped (to
conserve resources), improved (if it’s a valid use case), or provide necessary
information and code for new use cases. Here’s an example use case audit
(FTE stands for full-time equivalent* ):

Question Example
What use cases are pursued? Topic modeling for customer
support data
What are the business goals Topic modeling for customer
behind it? support data
What technologies are used in Python, LDA
each?
What architecture S3, Glue
components are used in each?
What datasets are used? CRM data from SalesForce
Which people are involved? J. from marketing, A. as a
customer support manager
What are the budgets? EUR 35000, 2 FTE over two
quarters

Format of CSA deliverables. While for activities, you are mainly

limited to two: interviews and workshops; in terms of element
deliverables, the options are many more. The most traditional
result would be a slide deck, and this is a deliverable commonly
expected by most business stakeholders. Additional and often more
suitable formats for your work are wiki pages (such as Confluence or
SharePoint), boards (such as Miro), Jira roadmaps, and many others.
This is valid for DUE DILIGENCE deliverables and those of DESIGN.

Another valuable activity to conduct here is redundancy analysis. In a legacy

* Full-time equivalent: one person working at 100% capacity. This is a standard unit for measuring

personnel or workload estimates and reports.

Current State Analysis: Discovering Where We Stand 21

scenario, there will almost always be technical components that can be re-
used for the data strategy implementation. Gathering a list of those can be
helpful for the target architecture and technology element in DESIGN.

Data Audit

Now we can dive into the oil that powers the use cases: the data. A word
of caution: if you remember what we discussed about the data strategist’s
attention earlier, this particular activity can drain it tremendously if left
unchecked. There is so much information here that if the data strategist
wants to compile an exhaustive report on it, they can waste much time -
even in the case of relatively small and homogenous datasets. The challenge
then lies in deciding the appropriate assessment for your specific case. Also
note: this work is essential for setting up good data governance in DESIGN.

The data strategist needs to ensure that necessary data is not only present
but is also of sufficient quality for use in the use cases. Fortunately for
us, this is not an entirely new problem - there are quite a few established
frameworks focused on auditing data. The two popular frameworks for data
audits are FAIR and the 4Vs. I’ll refer you to read them separately and
instead provide an example of a data audit below:

Question Example
Who are the data custodians? Marketing department
Who are the data consumers? Operations
What is the data lineage? Salesforce - data lake - data
warehouse
Is there a data dictionary yes, but partially, covering just
present? 40% of the fields
What is the data quality? 30% of the rows in the
aggregated dataset are
missing
Current State Analysis: Discovering Where We Stand 22

The answers in the example are too simplistic; this is for illustration pur-
poses only. Typically, especially for the data quality part, you will have much
more to fill in, and the deliverable probably won’t fit this table but be a
whole document instead.

The second lens through which we can do our data audit is looking at how
data moves through different systems in the organization: data lineage. A
common phrase in data science and engineering is “garbage in - garbage
out”11 . It’s primarily used in the machine learning context. In the case
of training ML models, data scientists often spend an inordinate amount
of time fine-tuning a model or trying to solve the problem by changing
the algorithm completely (usually going for a shiny new one or a different
framework altogether). While such changes might improve the model per-
formance by a few percent (in the best case), the highest performance gains
occur when better quality and quantity of data are supplied to the algorithm
(this is reflected in the data-centric AI movement spearheaded by Andrew
Ng12 ). The best algorithm on the planet will not save the project if the data
is garbage. In non-ML applications, erroneous data can also have disastrous
consequences - weekly financial reports based on the wrong numbers can
spell catastrophe. Data lineage is where these issues are dealt with.

A tipping point for deriving value from data is changing how it’s viewed
in the organization. Much is said about being “data-driven”: the organi-
zation can only achieve this if we view data as an asseta . Doug Laney, in
his excellent book Infonomics: Monetise, Manage & Measure Information13 ,
and in INTERVIEWS tells a compelling story on how modern enterprises treat
things like printers, monitors, and window blinds as “assets”, but not
their data. This is why we should look at our datasets always with a use
case in mind, and if we have done our work well in the previous section
Current State Analysis: Discovering Where We Stand 23

of use cases, it should become clear how this view can have monetary
gains - which is one of the main driving factors behind the motivation
of becoming truly a data-driven organization. To sum up:

a In my conversation with Amadeus Tunis, he mentioned that data can only have a relative

value, not absolute, as other assets.

Here’s a view of a standard flow of data:

Data lineage

This diagram is MECE, meaning any data in the organization should fall into
one of those categories at any point in its existence. Our job is to document
how it travels through the organization.

One helpful way to think about this problem is to classify the different data
sets using the BSG method. It’s a that can help design architecture and
technologies strategies. It stands for the different classes of data assets -
Bronze, Silver, and Gold. Using metals to describe the data state is already
used in data engineering. You can learn about it from articles by leading
Current State Analysis: Discovering Where We Stand 24

companies such as Teradata and Databricks* . This is a good use, but I

thought, can we use this when discussing data architecture and technology?
It turned out it can, and in my conversations with clients, it has worked very
well so far. So what are those BSG data assets?

Bronze: This is the data that enters the system. It’s data in its rawest, atomic
form - without any processing. It is usually of large quantity, little quality,
and limited business use. Focusing on just collecting these data is dangerous
since it’s of limited use in this form - this strategy is called “boiling the
ocean”, and is exactly as productive as it sounds.

Silver: This is the data resulting from processing and enrichment steps. This
data is also the input to ML algorithms. It usually is smaller in size and of
intermediate value. It can also be the input to self-service analytics systems,
where data analysts can build the presentation layer.

Gold: This is the data presented to the end-user (internal or external). It

could be the data points at a dashboard from the presentation layer or the
results of a machine learning algorithm. It is of the least amount but the
highest quality and importance for a business.

There might be many intermediate steps between these categories, which

differ based on the data format. Why should we care about data lineage?
Here we circle back to the “garbage in, garbage out” metaphor. At this point,
you should know that raw data is rarely useful for downstream applications
- it tends to arrive with many errors and duplications. Moreover, with
every processing step, there are new opportunities to commit mistakes,
especially if there are manual, human-involving steps in between - those
can introduce tremendous bias and error. Such errors can be hard to detect,
so it’s crucial to have a clear picture of the complete data lineage and,
* You can read Teradata’s article here and Databricks’ here.
Current State Analysis: Discovering Where We Stand 25

equally as important, who is responsible for the different parts of it. As

part of the data strategy, we need to interview them and ensure that the
information we have on the data and the assumptions on its processing are
correct - so that we have up-to-date documentation on the lineage. Note
that this work fits nicely with the data dictionary concepts we did earlier.

By using those ideas, we can go through the following steps to complete the
audit:

• Identify the data sources: The first step in documenting a data

lineage is to identify all the data sources. This may include
databases, files, applications, and other sources.
• Trace the data flow: The next step is to trace the data flow
from each source to its destination, tracking each step along
the way. This may involve mapping the data flow through
various systems, processes, and transformations.
• Record the transformations: As the data flows from one step
to the next, record the transformations applied, such as filter-
ing, aggregation, or joining with other data sets.
• Document the complete lineage: Finally, document the en-
tire lineage, including all the steps, transformations, and
destinations for the data. This documentation can be used
to understand the data flow and identify potential issues or
inconsistencies.
Current State Analysis: Discovering Where We Stand 26

When we are going through the different steps of a data lineage,

new ideas in how we can process the data can come. These new data
points are called “derived data”, and using this insight to improve
data products is called feature engineering 14 .

Architecture and Technology Audit

The final audit we need to conduct to get a complete CSA is that of the
technology and architecture. Those are the elements that power the use
cases, and often, many make-or-break issues occur here.

Question Description
What are the data-generating Salesforce
systems
What are the data-consuming PowerBI
systems
What are the data storage flat files on S3
systems
How much is on cloud versus 40% on cloud
on-premise
What languages and Python, Scala, Java, Keras
frameworks are used
What are potential issues with RPO and RTO are not set, only
security and compliance? weekly backups

Like the other audits, you usually would go much deeper here. For each of
those items, it’s also important to note any potential concerns or upcoming
plans associated. When working through the use cases in DESIGN, you’ll need
to have those at hand.

Finally, all those audits can help us conduct the next element, the DATA
Current State Analysis: Discovering Where We Stand 27

MATURITY ASSESSMENT.

Data Maturity Assessment

In many strategic engagements, an organization’s digital transformation

is often presented as a journey (to the top peak of a mountain* ). Such a
viewpoint certainly makes sense if you think deeper - this process is rarely
instantaneous and much more likely to be incremental, with long periods of
steady climb interspersed with sparse jumps of progress. It also makes sense
to view a company’s different stages and maturity development throughout
this journey. To provide good data strategy advice, we need to use the
knowledge we have obtained in the last two elements (ALIGNMENT and SYSTEMS
AUDIT) to place the organization along this journey. This element is called
DATA MATURITY ASSESSMENT (DMA).

There are situations where it might not be suitable to do a maturity

assessment this early in the process. Under pressure, it might be
useful to start with the use cases work (audit and design) and
compile the maturity assessment as a side deliverable. Through
this, we could avoid a potential lack of patience from the client side
- nobody wants to hear they are lagging behind their competitors
without a plan of action for how their use cases can close the gap.

Before we decide where our organization currently stands, look at the

diagram below to see the different categories of organizations.
* This and other useful analogies for data strategy are covered in DESIGN.
Current State Analysis: Discovering Where We Stand 28

Types of organizations based on data maturity

It’s not surprising that the journey is not a straight line, but why this shape
in particular? I’ll show you by describing what the five stages stand for. We’ll
go through the lowest maturity level to the highest:

Waiting (1)

• Data are seen as only a by-product of technological processes;

• No organizational functions responsible for data;
• Little awareness of the possibilities of taking advantage of data assets;
• No forward-thinking investment in technology and people regarding
data initiatives;

Starting (2)

• Understanding the necessity for change;

• Appreciation of data as an asset;
Current State Analysis: Discovering Where We Stand 29

• First initiatives looking to exploit data potentials;

• First roles and ownership established for data projects;

Toiling (3)

• More significant investment in data and people;

• Initial pilot projects deployed, but in isolation;
• Dedicated teams established for data;

Accelerating (4)

• Broad up-skilling efforts established for data literacy company-wide;

• Movement from pilotitis (I’ll explain this term in a bit) to DataOps;
• Data are seen as a product;
• Company-wide understanding of data assets and AI;

Leading (5)

• Known in their industry to be at the front of data technologies;

• Development of new methods for data science and engineering;
• Development of new frameworks for data projects;
• Large and exponential increase in revenues from data, reinvested into
further initiatives;

DMA’s have been around for quite some time, and many consultancies offer
comprehensive solutions. My recommendation would go to appliedAI’s
tool. Another great source is the DMBOK book15 , the Data Management
Maturity Assessment chapter. As a general rule of thumb for this work, you
Current State Analysis: Discovering Where We Stand 30

will always need to devise a scoring mechanism to make the deliverable

more quantitative than qualitative.

Based on those criteria, you can score and classify the organization. In this
work, you’ll need copious amounts of trust once again. In most cases, the
organization won’t be a leading one - if it were, it would already have a
designed data strategy at its disposal. Thus you’ll find yourself in a situation
where you need to communicate that the organization falls short of the
ideal. Nobody wants to hear this, least of all senior executives - and those
are often your target audience. To cushion this potential blow, you’ll need
to demonstrate how being honest with the situation and looking at the facts
is the first step to making progress. With the right strategy, even giant
corporations can achieve dramatic turnarounds - so it’s not all doom and
gloom. Those goals are your North Star* .

Pilotitis: A term that I’ll often refer to in the book, pilotitis describes the
propensity of organizations to commit to individual, isolated pilot projects
only without contemplating the need to build on a solid foundation and
scale. Data products are worked on in isolation. So what are the causes
of this condition? First, it’s the easy way (more on this in the LEAN DATA
and SOFT AGILE in DELIVERY). No need to establish a complex organizational
structure; make a small team of three to five people, and off you go. Also
no need for complex IT or data architecture changes. Second - it’s cheap.
Doing things at scale requires an investment, both in people and technol-
ogy. And third (perhaps most insidiously): it’s fun. It plays to the human

* In the old days, before better technologies were widespread, travelers relied on the celestial bodies for

navigation, the North Star being one of them. It is used as a metaphor for a long-distance goal, which we
must constantly pay attention to while traveling to ensure we don’t sway from the right course.
Current State Analysis: Discovering Where We Stand 31

susceptibility to the Dunning-Kruger effect16 . We feel accomplishment

and progress early on, but this will not last.

A comparison between an organization suffering from pilotitis, versus one

which has a strategic approach to data projects is visualized in the figure
below:

Pilot projects are often low in effort (also in impact). Perhaps more
crucially, as time goes on, such work does not build on top of each other.
Granted, it requires a more significant upfront investment to structure
work properly. Still, the impact will also be much higher - to the long-
term satisfaction of the implementation teams and the executives. Let’s
illustrate this point with a practical example. Imagine that in the first pilot
project, you create a data pipeline. For example, perhaps you decided to
use AWS S3 to store the raw data, then AWS Athena to visualize it. Even if
this pilot project ultimately proves unsuccessful, in the future, you might
still reuse much of the same architecture (or at least just with minor ad hoc
adjustments) for the next pilot project. This is much more productive than
every time inventing something from scratch. Such project management
needs to be strategic - looking at a wider angle at the use cases, further into
the future, and with a better understanding of the available resources.
Current State Analysis: Discovering Where We Stand 32

Knowing where the organization stands overall in different data maturity

areas, you can place it in a broader context, and that’s what we’ll do next.
Gap Analysis: Looking at the Road
Ahead

• Documentation of competitor state

• Documentation of ambition level

Let’s set some signposts for the road ahead from different angles. This
element will require us to move beyond the technical and work with our
stakeholders on a strategic level again. We must be mindful of our balance
here and strike a good balance between optimism and realism. Remember
that data and AI work starts with the burden of high expectations, fear of
the unknown, and escalating costs. We have the job cut out for us! The gap
analysis aims to set the ambition level for the organization and determine
how far off that goal is. Two methods are essential for this work; let’s tackle
them one at a time.

Competitor Analysis

Analyzing the state of competitors is essential for any business strategy.

Thus it’s not surprising to see its importance for data strategy. As an initial
Gap Analysis: Looking at the Road Ahead 34

analysis, the strategist can use the DMA we covered previously to see which
group the competitor organizations fall into. After this initial binning of
the competing into its maturity state, you can go deeper and try to answer
questions like these:

• What organizational structures does the competitor have regarding

data?
• How siloed are their data initiatives?
• What positions are they hiring for? What are the technology require-
ments for those? What are the salaries?
• What are the profiles of people employed by this organization (for data
roles)? Do they come from specific industries or university majors?
What is their seniority?
• Does the company have any open-source technologies available?
• Are they seen as thought leaders in the field, and if so - why?

With the answers to those questions, we should be closer to having a better

context. But we need to answer one final important one: what is state
of the art in the domain? Almost in all cases, another organization is a
leader (remember our diagram about the data maturity journey). We can
take one from an adjacent field if there’s no clear one. We need to see where
the signposts for the future are to benchmark our work accordingly. For
example, if you are designing a data strategy for an organization in the
hospitality industry, an excellent place to start is the Mariott hotel chain,
being well known as a leader in the digital transformation space* .
* Read this Harvard resource for more information.
Gap Analysis: Looking at the Road Ahead 35

Ambition Level Setting

Knowing the state of the art, we can go back to our organization and talk
to the senior stakeholders to determine the right ambition level. This is
another vague and politically charged topic. A common thread in this book
is that we are dealing with complex, evolving systems. This distinction is
crucial because it helps us avoid making static, oversimplified assessments,
leading to quickly outdated decisions. This pattern continues - circum-
stances are bound to change. Humans are notoriously bad at predicting
the future, especially mid-to-long term, but valuable tools can help with
that. Artificial Intelligence (AI) is one of those technologies whose impact
is overestimated in the short term and underestimated in the long one (this
is known as Amara’s Law).

There are two parts to ambition level setting. One is having follow-up
conversations with senior management with the results of the DMA in hand.
The second is by anticipating future trends and scenarios internally and
externally. For this purpose, we can use a tool called the futures thinking
toolkit (there is a multitude of other great tools, including McKinsey’s
Horizon Innovation Framework). This is a whole field of its own. Still, we’ll
take its highest-level tool and fit it to our purpose. Have a look at the
diagram below:
Gap Analysis: Looking at the Road Ahead 36

Futures thinking visualized

In a workshop, we can gather ideas about where the organization can go

regarding data transformation. Then we can cluster them into several future
scenarios with their respective probabilities. Let’s go through them one by
one and provide an example. Remember, we can apply this futures thinking
to several elements. We can use it for the current initiatives at the company
or the wider environment. My recommendation is to do both. From the CSA
perspective, discovering the internal scenarios is more useful in the short
term.

Preferable: This is the scenario we want to become a reality. It can be the

easiest to estimate since most organizations have at least some idea about
what they would like to happen. This one is probably the smallest in scope,
including the least amount of information.

Probable: Discovering what is the most likely outcome can be very hard. It
requires careful consideration of many factors and a deep understanding of
the context.

Plausible: This is where the scope widens. You can deduce those scenarios
by extrapolating from the preferable and probable scenarios, albeit with
Gap Analysis: Looking at the Road Ahead 37

minor adjustments.

Possible: Here, we enter a more creative territory, and this scenario has
the most extensive scope. The purpose here is to discover potential missed
opportunities (similar to the Data Thinking exercises we’ll cover in the USE
CASES section) and black swan events17 .

Now, armed with a solid understanding of where the future can take us, we
can have another internal discussion with the senior stakeholders about
where the organization’s ambition lies regarding data. You can organize
several ambition levels on a timeline. Such increased resolution can help
the strategy become more specific and move beyond the “we want to
be the Google of restaurant chains” (this is a typical example of goals
masquerading as strategies, explained very convincingly in Good Strategy,
Bad Strategy by Richard P. Rumelt (also recommended in my interview with
Martin Szugat)18 . This step requires much input from the other elements of
the CSA - the competitor analysis, maturity assessment, the data audit and
dictionary, and the futures thinking work. It can be helpful to position the
company around the competitors based on different assessments.

Now we can close the circle of the DUE DILLIGENCE phase and reveal the whole
picture:
Gap Analysis: Looking at the Road Ahead 38

Using the DUE DILIGENCE elements to frame data strategy

Here we can see how the CSA and the ambition level setting elements enable
the GAP ANALYSIS. And guess what’s inside of it? The actual data strategy that
we’ll start designing in the next phase.
Summary
Let’s recap what we’ve learned so far.

First, we established the key motivating factors behind conducting a thor-

ough DUE DILIGENCE. This work requires a lot of confidence (and trust). The
data strategist should pay attention to the level of abstraction they work on
- it’s easy to get sidetracked by unnecessary detail. Second, we uncovered
the business strategy and ensured we were aligned to it. The formation of a
steering group further supports this alignment. Next, we gathered as much
information as possible by auditing all essential systems and classifying the
organization’s data maturity. The results of this step (CSA) were crucial to
positioning the organization in a broader context with competitor and gap
analyses.

Now we know where we are and where we want to go. Everything is set in
place for us to design the data strategy in the next phase.
Part II: Design
Useful analogies—The Influence cascade—Use case ideation, feasibility study,
and prioritization—Data architecture and technology—Data governance—
Operating model—Roadmap
Overview
“A system is never the sum of its parts. It’s the product of their
interaction.”
–Russell Ackoff

So far, most of our work has been more passive than active. We have learned
much about the organization’s ability (or lack thereof) to deliver value from
data. This groundwork can already prove helpful to the organization: it can
be used to take initial tentative steps in the right direction - even without
further recommendations from the data strategist or having a data strategy
in place. Still, at this point, there’s little in ways of actionable advice. This
situation can be frustrating for the data strategist since we always strive for
value as fast as possible, but we would be wise to remember that while it’s
tempting to jump directly into designing the strategy, why this is ill advised.
Without essential elements such as CSA, DMA and GA, any recommendations
and plans would only be based on gut feeling rather than facts about the
organization’s particular circumstances and aspirations. Shooting in the
dark is never a viable strategy - we might just as well copy a data strategy
made for another organization.

With the deliverables from DUE DILIGENCE in hand, we can now look at the
essential elements of data strategy DESIGN, MECE as always:
Overview 42

Use-case-driven data strategy design

The first difference, compared to the sequence in DUE DILIGENCE, is the

order of execution of elements. We must start with USE CASES - all other
elements can be developed in parallel, finishing with the roadmap. The
reasoning for this particular sequence is described in INFLUENCE CASCADE, but
the main benefit is a constant focus on the practical value of data strategy
- represented by the use cases that need to be eventually delivered. If we
lose sight of that goal, we could easily fall into the trap of spending too
much time focusing on other elements. They are important but fulfill only
a supporting role in generating value.

This view of data strategy is different from the currently popular ones. Many
data strategy whitepapers represent data strategy as a house - where the
use cases sit on top of elements such as operating model and enabling
culture or data architecture. While such a layout can seem logical at first
Overview 43

glance, it is inherently anti-StratOps. This criticism is present in many of

the conversations in INTERVIEWS. In a nutshell, any data strategy built in
a house-like sequence (foundations first) will be slow: both in eventual
delivery and gathering feedback of deployed data products.

We begin by selecting the cases we want to commit to (or extend ones

already in operation) in USE CASES. Then we choose architecture and tech-
nology in ARCHITECTURE AND TECHNOLOGY to support them. The use cases also
require data access, which can be a challenging topic in larger organizations
(due to compliance, security, and infrastructural constraints) hence the
need to design a DATA GOVERNANCE. And finally, to expedite successful delivery,
an effective and efficient OPERATING MODEL is developed: a recommendation
on how to set up the organization in terms of individual roles and collabo-
rative teams.

No two data strategies are the same. The complexity of the

task at hand ensures that there are few copy-paste solutions for
data strategy: every time you design one, you’ll face new and often
different challenges. This complexity requires you to mix different
data strategy elements and apply different focus levels.

Additionally, any assumptions made in DESIGN are dangerous if left

unchecked. In my conversation with Martin Szugat (in INTERVIEWS)
we covered why those are frequent in data strategy work and how
the data strategist can only test their validity during DELIVERY.

Before starting to design the data strategy, I want to show you one essential
tool that data strategists often use for such work: analogies. They become
handy when we must explain complex topics and their interconnections to
a diverse audience.
Overview 44

Data Strategy Analogies

Talking about data projects can be confusing, even for experts. This is
primarily due to the extensive technical terminology used in the field.
Walking into a technical data meeting, you’ll often hear opaque concepts
such as data lakes, ingestion layers, data warehouses, anomaly detection
models, and everything in between. Finding good analogies for such a wide
range of terms might seem challenging, but here are some good ones - tried
and tested throughout various consulting engagements.

The Oil Analogy

You might have heard this one: “data is the new oil” (here’s also the opposite
view, where that data is the new water: while it’s true that organizations are
sitting on piles of data, making them actionable is as big of a challenge as
it’s ever been* ).

Similarity between data and oil

There are several good reasons why this analogy has proven widespread in
the community. First, it relates to the idea that data can be seen valuable
than oil. It powers our digital lives, workspaces, government, and vital
infrastructure. This influence can be seen on par with oil throughout the
industrial revolution. Second, it maps well to data pipelines, containing
* See the vanity projects from Joe Reis and Matt Housley’s “Fundamentals of Data Engineering”.
Overview 45

data processing and enrichment steps. Like oil needs refinement before
becoming useful, data requires a similar upfront investment of work before
we can reap its benefits. Data also travels through pipelines, gradually
improving in quality and suitability for business use (becoming fit-for-
purpose). This analogy is excellent at explaining data architecture.

Mapping user journeys to data. Business stakeholders are increasingly

comfortable with another software concept, “user journeys”, which is also
helpful when discussing pipelines. In one consulting project, I worked
on mapping a target data architecture design to a user journey. This was
very helpful for the customer to understand how the architecture supports
the product and directly impacts the customer and them using the data-
powered digital products. Such mapping is related to the Value Stream
Mapping method, mentioned in DATAOPS.

The Kitchen Analogy

This second analogy became popular more recently. It has been advocated
for by Google’s Chief Decision Scientist, Cassie Kozyrkov* . This one can be
more playful (and less environmentally disturbing) than the oil one. I find
it particularly helpful in describing more AI-centric data terminology and
processes. Let’s see if GPT knows about this:
* Her original article introducing this analogy is called “Why Businesses fail at Machine Learning,” and

you can read it here.

Overview 46

We can think of data sources as the raw ingredients a chef uses to

create a meal. In a data project, these might include databases, CSV
files, or web APIs that provide access to large amounts of data.

Algorithms can be thought of as the recipes that a chef uses to

transform raw ingredients into a finished dish. In a data project,
algorithms might include machine learning models that can make
predictions based on data or statistical methods that can be used
to analyze and summarize data.

Visualizations can be thought of as the presentation of a finished

dish on a plate. In a data project, visualizations might include
charts, graphs, or maps that are used to present data in a way that
is easy for people to understand and interpret.

This analogy gives a new meaning to the term “serving predictions”!

The Journey Analogy

Finally, the journey analogy is often used to describe digital transformation

efforts (remember the DMA?). It depicts an organization’s journey to derive
value from data as ascending a long path up a mountain, eventually reaching
success at the peak. Data strategists are often described as guides, showing
the clients how to climb such mountains and providing examples of other
organizations doing the same - but the client has to do the walking. Let’s
have a look:
Overview 47

The journey analogy

The arrow represents both the passage of time and the associated effort to
reach the final goal, the peak of value. At the beginning of this journey, we
start at our well-built home city, where we design the strategy. Imagine
everything here is neatly organized in different buildings, connected by
straight paved roads. A well-designed data strategy should resemble this
- it should be clear, concise, and structured. Nevertheless, to reach our final
destination, we must leave the safe confines of our homes and venture out
into the wilderness. Between us and the peak of value, we must traverse
obstacles such as the river and forest (the complexity of implementation).
This operational barrier is where data projects and products fail. This can
be the most challenging task in data strategy, and the whole DELIVERY part
of the 3D model is dedicated to how to overcome it.

I’ll refer to those analogies throughout the text, but now let’s focus on
understanding why we should always start data strategy design by working
on the use cases.

The Influence Cascade

During CSA I covered how interrelated data strategy elements can be. Modify-
ing the design in one area can have dramatic effects on another. This was the
Overview 48

main reason the SYSTEMS AUDIT focuses on use cases first. The pattern repeats
in DESIGN. If we don’t consider the consequences of changing use cases, our
strategy might quickly end up disjointed and unfocused. It will also probably
become obsolete in the first weeks of implementation. Another potential
issue is the inherently cyclical nature of developing data products. It’s often
difficult to determine what approach will work beforehand, and we must
remain flexible. Any assumptions made here will be challenged later* . In my
conversation with Datentreiber’s Martin Szugat, he advocates for running
an experimentation phase before committing to implementation.

Since I’ll often refer to this idea of dangerous interconnectedness, I have

termed this concept the “Influence Cascade”. Here’s what it looks like:

Changes in use cases cascade down.

Imagine we’re developing a use case for detecting abusive content on social
media. Our initial idea might be to create a text classifier, which is trained
on a corpus of tweets and outputs a predicted class (for example, abusive
and non-abusive). We then evaluate and commit to a technology stack
supporting this product. In this case, a good approach would be to use
Python (since there’s a large number of Natural Language Processing (NLP)
related open source packages in Python’s ecosystem) and use a Support
Vector Machine (SVM) as a classification algorithm. Once this is decided,
we can design the appropriate backend architecture for this tech stack to
run on. For example, we can store text data in a NoSQL database (such as
* This is also described in DELIVERY, where I talk about DATAOPS.
Overview 49

MongoDB) since it is a good option for document-oriented storage.

What is a tech stack? This is a common word for an organization’s

primary architecture and technology components. It is derived
from the idea that most systems like this can be summarised as
a hierarchical collection of layers feeding into each other - like a
stack of related books.

So far, so good! But suddenly, the business development team realizes that
there are too many similar solutions on the market and that we need to
pivot. There’s a niche available for the same product but focused on video
data rather than social media texts. We become quickly excited again but
will eventually realize that the whole stack needs to be redesigned (even if
the problem statement and target audience both remain unchanged). The
fact that we now need to work on video data makes the technology tooling
completely different (SVM probably is not a good idea for an algorithm,
and neither is MongoDB for data storage). Now, storing the raw video
files on AWS S3 and using Convolutional Neural Networks (CNNs) sounds
much better. With such changes, the work can break down as a sequence of
dominoes.

You can see how any changes in high-level decisions regarding data prod-
ucts can have significant (sometimes unforeseen) consequences down the
line. At every step in the data strategy design, we need to be mindful of
the INFLUENCE CASCADE, understanding that any fundamental changes to
requirements need to be accounted for in other elements of data strategy.
We need to budget for this risk, and this is yet another motivating factor for
having a StratOps approach.

Now it should be clear why the USE CASES is the starting point of data strategy
design. Let’s get to work!
Use Cases: Designing Data Products

• A comprehensive list of potential use cases

• A prioritized subset of selected use cases
• Design and requirements documents for use case implemen-
tation
• An outline of budget and resource requirements
• An implementation-ready roadmap

A common reason data initiatives remain unsuccessful is the lack of a

certain mindset: product thinking. To understand it better, we can contrast
it with the prevalent way of thinking, project thinking:

Project thinking Product thinking

Focused on process Focused on outcome
Hard to measure Easy to measure
Hard to design roadmaps Easy to design roadmaps

The project thinking mindset dominates in more inefficient working envi-

ronments - especially those in large organizations. In such cases, a large
part of the workforce tends to focus on optimizing processes and workflows
rather than developing practical solutions to deliver value. To illustrate the
point further, we can visualize the two types of thinking. Product thinking
tries to get to value as fast as possible with optimal effort, while resources
Use Cases: Designing Data Products 51

can easily be unfocused in a project thinking setting:

Product-project thinking comparison

Finally, when we think about our initiatives as developing products rather

than completing projects, we can measure their success more easily. For
example, instead of “consolidating data sources,” we can say we are building
a data lake. This forces us to prepare a solid design and requirements docu-
ment (more on this in DELIVERY) - making us more accountable and the final
result more measurable. Having data products also helps with designing
roadmaps since it’s easier to plan when building a concrete product rather
than an abstract process that has no end in sight.

Design thinking is a methodology that is also gaining traction for de-

veloping data products. A good definition is available on interaction-
design.org: Design thinking is a non-linear, iterative process that teams use
to understand users, challenge assumptions, redefine problems and create
Use Cases: Designing Data Products 52

innovative solutions to prototype and test. They define five common verbs
associated with the process: Empathize, Define, Ideate, Prototype, and
Test. I would go a step further and argue that design thinking is more than
just a process: it’s a way of thinking, which is particularly powerful when
reasoning about difficult-to-grasp problems (the black boxes we covered
in SYSTEMS THINKING FOR DATA STRATEGY), which it terms “wicked problems”.

Next, how do we proceed with gathering ideas? Doing this should be fun, but
as mentioned before - we should be aware of pilotitis. While it’s easy to come
up with many ideas, the challenge lies in selecting the ones with the highest
impact and what is feasible for the available resources. We could build any
products we want with unlimited resources (including time). Unfortunately,
this is not our reality, and we always need to operate under constraints
(which can be even more limiting in the case of data products due to raised
expectations and often larger budgets).
Use Cases: Designing Data Products 53

A frequent discussion in use case ideation is the “build” vs. “buy”

decision. Here are some motivating factors for buying from a ven-
dor instead of developing them internally:

• Cost: Building a product in-house may be more expensive

than purchasing it from an external supplier, especially if you
don’t have the necessary resources or expertise in-house.
• Time to market: Purchasing a product from an external sup-
plier may be faster than building it in-house.
• Customization: If you need a highly customized or tailored
product to your specific needs, it may be more cost-effective
to build it in-house.
• Intellectual property: If you build a product in-house, you
own its intellectual property (IP) rights. If you purchase a
product from an external supplier, you may not have the same
control over the IP.
• Risk: Building a product in-house carries more risk than pur-
chasing it from an external supplier, as you are responsible for
all aspects of development and delivery.

The first issue we face is the cold start problem. Since the scope of a data
strategy is often expansive, getting started can be daunting. Stakeholders
might be reluctant to make an extensive investment before seeing tangible
results. In such situations, running a “lighthouse project” can be a good
idea. The task is to start with a simple to do, yet impactful use case to
demonstrate the potential value of data. In the optimal scenario that this
project is a success, the data strategist can use the learnings and new-found
trust to proceed with other use cases. Additionally, there’s a high likelihood
that the organization can reuse some of the components developed for the
Use Cases: Designing Data Products 54

lighthouse project further down the line.

A lighthouse project is a term used to describe a project that

serves as a model or example for other projects within an orga-
nization. Lighthouse projects are typically high-visibility, high-
impact projects that demonstrate the capabilities and potential of
a particular methodology, technology, or approach. They are often
used to promote the adoption and implementation of new ideas
and best practices within an organization.

There are also dangers to doing lighthouse projects. What is typ-

ically selected are use cases using very ambitious, eye-catching
technologies, such as computer vision and deep learning. Unfor-
tunately, such use cases need significant support regarding other
elements, such as data architecture, and can also require consider-
able time and resource investment. In such a case, selecting a more
basic use case that is easier to implement and has a more guaran-
teed value-generation capability might be a better idea. Depending
on the organization, it might be altogether smarter not to attract
too much attention to a high-risk endeavor and let the results of a
successful small prototype project speak for themselves.

While we should be on the lookout for a potential lighthouse project,

our initial focus is to gather as many ideas as possible, then trim them
down based on feasibility and other constraints. We do this in three steps:
ideation, feasibility study, and prioritization.
Use Cases: Designing Data Products 55

Steps in selecting a use case

The data strategist should execute the steps sequentially (if there are too
many ideas, it might be useful to do a rough prioritization round before
the feasibility assessment), but as you can see from the feedback arrows
between them, sometimes we need to return to the drawing board. This can
happen when we come up with a seemingly great idea that is also feasible
from a technical perspective. Due to a resource constraint, it might still
fall short of the final prioritized list. In that case, we must go back to the
ideation phase to gather new ideas. While it may seem like a setback, we
save ourselves much pain down the line (data products tend to be dangerous
when not appropriately executed) and avoid falling victim to sunk cost
fallacy.
Use Cases: Designing Data Products 56

The sunk cost fallacy is a cognitive bias that occurs when a

person continues to invest time, effort, or money into something
because they have already done so. This bias is based on the idea
that investment cannot be recovered, so it is better to continue
investing in the project than to admit that the investment was a
mistake. This can lead to irrational decision-making, as people may
continue to pursue a course of action even if it is not in their best
interest.

Ideation

While it might be true that there’s no shortage of ideas in data work, good
ideas are hard to come by. What do I mean by good? And, perhaps more
importantly, how can we compare two (or more) data use case ideas? This
question might initially seem vague and subjective, but it turns out to be
surprisingly quantifiable. But before we go on to select ideas, let’s first learn
what is the best way to gather them: workshops.

The data strategist can complete most of the deliverables in DD by conduct-

ing a series of workshops and interviews. In DESIGN, the interaction with
non-RACI team members is reduced - we are often left to our own devices
to design the data strategy. One notable exception, however, is the use
case ideation element, together with the closely associated ones (feasibility
study, impact assessment, and prioritization). Those are ideally done with
a broader group of stakeholders in a workshop format.

Since this is one of the essential tools in the arsenal of a data strategist, it
makes sense to spend some time describing its basic elements and defining
attributes. What is a workshop? A colleague from my consulting days used
Use Cases: Designing Data Products 57

to joke that this term has been so overused recently that it mostly just means
a more extended meeting. To me, a workshop is a collaborative meeting with
a clear agenda, objectives, and facilitation focused on a topic that can’t be
resolved in any other way. Let’s have a look at the most important workshop
features.

Format: Workshops are typically structured in-person or remote (via digital

whiteboards such as Miro, Mural or Google’s Jamboard) meetings. These can
be split into chunks of several hours each but can stretch across several days
if needed. Since workshops can be mentally (and sometimes emotionally)
intense, frequent breaks are necessary. A standard workshop session is
around two hours, with several short breaks.

Participants and key roles: As a rule of thumb there should be around five
to seven atendees. Any more and the session can become challenging to
manage and schedule. Any less, and we might not gather all potential view
points and ideas. We aim for a diversity of both opinions and skills (the more
T-shaped people* in the room, the better). The role usually assumed by the
data strategist is that of a facilitator (see GPT’s definition below). You can
think of this role as a referee in a football match - their job can be deemed
successfully executed when their presense goes mostly unnoticed. Since this
person is mainly engaged in leading the workshop, an additional person is
required to take notes and document the proceedings (you will refer to those
often later). For client projects this role is oftentimes assigned to the data
strategist, while the facilitator is nominated internally. The final key seat at
the table is reserved for a decision maker. You should always try to get this
person in the room, or at least have someone with a clear decision-making
mandate present. The reason is that workshops are often decision-making
focused, and can be disruptive (hopefully in a positive way) activities - thus
* People with a broad skill set and deeper specialization within a single area.
Use Cases: Designing Data Products 58

often involving political, comittment and budgeting issues. The presense of

a decision maker can go a lont way in mitigating potential future roadblocks.
This will also increase the amount of trust, and you’ll need a lot of it during
DELIVERY.

A workshop facilitator is a person who is responsible for guiding

and organizing a workshop. The facilitator’s role is to ensure that
the workshop runs smoothly and that all participants are able
to engage in the activities and discussions in a productive and
effective way. This can involve setting the agenda, introducing
topics, facilitating discussions, and ensuring that the workshop
stays on track and achieves its objectives. The facilitator may also
provide support and guidance to participants as needed, and may
help facilitate group activities or break-out sessions.

Elements: Hundreds (probably thousands) of workshop designs are avail-

able for you to pick from. A quick look at the Miroverse (Miro’s user-
generated, curated collection) shows quite a few. This diversity can seem
overwhelming to a novice workshopper but rest assured - similarly to a data
strategy, there are just a few fundamental elements of a workshop. Once
you learn those, you can adapt them to your particular case and workshop
any topic19 .

Now that we know a workshop’s essential features, let’s learn how to apply
a data ideation flavor. Here we’ll use many of the DUE DILIGENCE deliverables,
most notably the SYSTEMS AUDIT. There are many approaches to ideating data
projects, and from my experience, methods from design thinking transfer
very well into the data domain* .

Goals and deliverables: We should start with the end in mind. What do we
* This has now grown into the “data thinking” field.
Use Cases: Designing Data Products 59

need to run our use cases? We’ll need a list of prioritized potential products.
The deliverables need to be informative enough for technical requirements
gathering and generation of design documents (more on this in DELIVERY).

Format: A more significant number of participants is preferable in this case.

After all, we want to generate ideas - and the more people there are, the
more we can discover. If possible, try to do such workshops in person -
digital meetings are a weak substitute for the energy of an in-person one.
If you have to run a hybrid workshop, where some people are remote, and
others are not, ensure that the remote participants are also taken care of
and have the same access to the discussions as those on site.

Participants: The most crucial property of the participating team is diver-

sity. People from different departments (and seniority levels) can provide
valuable ideas. As we mentioned before, you would want to have a decision-
maker involved. For data projects, in particular, it’s also essential to have
people who would be the consumers of the data products.

Elements: All workshop elements must be clearly understood and have a

good flow. Have a look at the diagram below for the structure of a sample
data ideation workshop:
Use Cases: Designing Data Products 60

Ideation workshop elements

The warm-up section shouldn’t be glossed over. It is necessary for people

for several reasons. One, they need to get used to the software (if you use
a digital white-boarding solution, such as Miro). A second reason is for
them to focus on the task. And finally, they can go into the workshop in a
positive mood (this is especially important for more politically challenging
workshops, such as the ones in CSA). There are a ton of warm-up templates
that you can borrow, customarily called “ice-breakers”. For inspiration,
check out the resources in the Miroverse again.

When working in uncertain environments, a good way to start and get

the lay of the land is to do something called anchoring. You need to
start somewhere, even if, eventually, during the design process, you end
up somewhere completely different. By working through such challenges,
you will obtain the information you need. A good anchor for use cases, in
my experience, is first seeing what human workers are doing - generating
a map of their processes. After that, you can look at any automation
Use Cases: Designing Data Products 61

potentials and their respective values over this journey. Start automating
a human-led process first. One word of caution: ensure you optimize the
process in the right place. For example, it would be hard to see tangible
results if you automate a process coupled with a human action that is still
very manual. Imagine you are building a Computer Vision (CV) model to
check the quality of a particular material but still have a human manually
photographing each sample. You must look at the whole human process
first and ensure the absence of bottlenecks and redundancies.

After the warm-up, the participants gather ideas with sticky notes. This
activity should take around 15 minutes - it’s safe to assume the central
ideas should come out rather quickly. If you attempt to spend more time
on this, the participants’ focus can wander. The third session is where the
participants vote on the ideas gathered. There are different voting methods,
but I would recommend doing this anonymously (to ensure there’s no
political bias in the procedure), with each participant having three votes at
their disposal. This step can take around ten minutes. In the final stage, the
ideas are clustered into similar topics or common challenges they address
- here data strategist needs to be quite active and ask many clarification
questions to ensure the meaning behind the ideas is clear and eliminate
possible redundancy. Finally, the clusters get prioritized using a tool such
as a two-by-two Impact-Effort matrix (more on this later in PRIORITIZATION).

This section provides just the boilerplate for a data ideation workshop. If
you need a bit more detail, have a look at three great approaches:

• Data Landscape Canvas by Datentreiber (link): This canvas allows you

to focus on a critical use case ingredient - what data you have available.
This page contains many other related canvases for you to use in data
Use Cases: Designing Data Products 62

strategy.
• Data Thinking: A Canvas for Data-Driven Ideation Workshops (link): A
flexible, holistic method of gathering ideas.
• AI Ideation Cards (link): A collection of cards to help business users
quickly come to terms with what AI can do in a business setting.

Feasibility Study

Data product ideation is easily one of data strategy’s most exciting and
enjoyable elements. We should have a lot of excitement building up - and
we’ll need it now since the next step is trickier. Here the ideas meet reality,
and the data strategist needs to estimate the feasibility of those ideas,
eliminating many in the process.

An example that I often use to illustrate how some ideas seem

easy to implement at first glance but are tricky comes from web
development. If we are looking at the visual features of a digital
product, let’s imagine an online banking website, such as PayPal,
we might think that adding buttons to an interface is relatively
straightforward. In many cases, this can be true, but we often
underestimate the actual complexity of this simple task. Sure we
can add the button, but the wiring behind it can be tremendously
complex - it depends on what this button should be doing.

Technical projects are notoriously difficult to estimate. Both in terms of the

work required and the probability of a successful outcome. Some products
might be seen as challenging to implement but prove simple in reality,
and sometimes the other way around. This is a complex topic, and we’ll
Use Cases: Designing Data Products 63

approach it by looking at three types of constraints that we need to consider:

architecture and technology, resources, and data.

Architecture and Technology Constraints

Estimating the feasibility of a data product from an architecture and tech-

nology (A&T) perspective can be challenging. One reason is stakeholders
often face the cold start problem (covered earlier). To estimate technology-
related topics, the organization needs to have established technology in the
first place. Moving beyond the buzzwords and blank statements, such as “we
need to develop an AI-powered product for storage warehouses,” requires
significant technical know-how and, even more importantly - real-world
experience. Both are hard to come by in an organization of low data maturity.
This is one reason why consultants can be very valuable in estimating
the feasibility of a project, especially from a technical perspective when
constraints need to be determined.

You should note that here how architecture and technology go hand
in hand. All too often, feasibility studies conducted within organi-
zations neglect to take into account architectural constraints.

A&T constraints can be grouped into two groups:

Legacy (brownfield) projects: Most organizations have an existing code-

base, which can consume, process, or create data. Any new initiatives need
to consider this and build on top of it (or redo some components). This can
impose severe constraints on the data product development. Even if there’s
no codebase related to the data itself, there are often other systems with
which the newborn data products need to interact. On the flip side, there
Use Cases: Designing Data Products 64

might be suitable components that the data strategist can reuse in new use
cases.

Greenfield projects: In this case, most of the technical constraints are

related to stability and performance, for example, concepts such as RTO and
RPO* .

Let me explain how A&T constraints affected a real-world use case. Despite
considerable advances in cloud-first infrastructure and associated tooling,
making advanced use cases scale remains a challenge. This is showcased by
the abundance and salaries of positions of data engineers, cloud engineers,
and data architects - their skills are probably the most sought-after in the
whole data industry. Regarding technology, we can be constrained by what
is currently available regarding the on-the-shelf tools on the market. A
good example of this is trying to build self-driving cars in the 90s. At that
point, people had already started to see the promise of neural networks for
CV tasks, but nobody could use them even on many images, let alone on
real-time video object detection in a moving vehicle. Other technological
constraints also affected this use case, for example, the lack of good internet
speeds and coverage to transmit all the data (and support the latencies
required by the use case) or the raw computing power in the car itself. With
5G networks and edge computing, constraints have been removed, and the
self-driving car use case has become feasible.
* RTO stands for Recovery Time Objective (how long can the application be down), and RPO for Recovery

Point Objective (the time database backups should cover).

Use Cases: Designing Data Products 65

There are many examples of how seemingly innocuous require-

ments can derail data product development. Unfortunately, it’s
widely thought that performance scales linearly with hardware im-
provements. The unforgiving reality is that we would often require
to rethink the whole data pipeline and architecture to shave off
several hundred milliseconds in application latency.

Answering the following questions should provide enough information to

take A&T constraints into account:

• Can cloud services make the development and deployment process

easier?
• Are the product requirements possible with state-of-the-art A&T?
• How is A&T going to change with the scaling of the product (in terms
of data and users)?
• Are there elements of existing A&T that we can reuse (part of a
redundancy analysis in CSA)?
• Are there issues in various touchpoints of the data product with exist-
ing A&T?

There are hidden roadblocks in those topics—for example, the issue of

scaling dataset size. Folks at large tech-first companies such as Google know
that many of the problems they need to solve arise just from the sheer scale
of datasets that they have to work with, which are only growing further
with time - sometimes exponentially. Data scientists and engineers in such
environments are forced to invent new ways of working with large amounts
of data, completely throwing away the old ones - from developing the map-
reduce algorithm to Kubernetes. The data strategist should know that some
use cases might require custom architecture development, which cannot be
purchased as an on-the-shelf service and require complex maintenance.
Use Cases: Designing Data Products 66

Resource Constraints

The topic of resource constraints in technical projects is extensive and

deserves its own book (and there are many!). Regarding data strategy, I’ll
move beyond the obvious items such as laptops, servers, internet speed,
and office space (and coffee and pizza availability) and focus on the most
considerable operating cost - people.

The cloud itself can be a significant cost center. More on this later
in BUDGETING.

People can constrain what use cases we do in several ways. Beyond their
salaries, the most important factors to consider are their skills and experi-
ence.

Some use cases, especially ones that require a complex orchestration of ser-
vices, and managing mission-critical infrastructure (such as the aforemen-
tioned self-driving cars), can require not only knowledge of technologies
and frameworks but also pure on-the-job experience. More senior engineers
would not only know what needs to be done but, more importantly, how.
By working with experienced engineers, you could avoid technical debt
and unnecessary complexity in the code. The lack of such people can be
a tremendous constraint on what projects are feasible. The same goes
for skills themselves. If the current stack in the organization (that we
discovered in the CSA) is built on top of an arcane or niche technology, such
as KOBOL or Elixir (as is often the case of legacy banking systems), it would
not be feasible to switch to Python and readily take advantage of cloud
services for that matter, since most of the SDKs* for AWS, GCP or Azure
* Software Development Kit: a packaged collection of software tools that help developers write software

for a specific system or purpose.

Use Cases: Designing Data Products 67

are in written in more common scripting languages, such as Python.

Here are the types of questions you can ask in this step:

• What are the skillsets of the people we have available? How do they fit
with the use case ideas?
• What are the experience levels of the data team members? Can they
cope with the requirements of a brownfield project, for example?

Data Constraints

Finally, we are always limited by data when we make a data-centric product.

To estimate this constraint, we can rely on the audits from CSA: the presence
of data (are all measurements we need available?), data size, data cleanli-
ness, and data bias. I will add one additional constraint: the data format.

In my book Python and R for the Modern Data Scientist: The Best of Both
Worlds20 , I argued that the data format is essential for what we can do with
it. Let’s see several different data formats to understand how they constrain
us. Note that this list is by no means exhaustive, but these are by far the
most popular data formats (beyond the standard tabular format):

Text data: Collections of documents, such as tweets, blog posts, news

articles, scientific articles, PDFs of invoices, and others.

Image data: Satellite imagery, celebrity photos, animal photos. Videos can
also be seen as part of this category.

Time series data: Anything with a timestamp associated, for example,

financial transactions or sensor data.

Spatial data: Anything with associated coordinates (i.e., elevator locations

for subway stations).
Use Cases: Designing Data Products 68

All those different formats require very different technologies and sup-
porting A&T (another example of the INFLUENCE CASCADE). For example,
you might use R when working with time series data, while if you need
to work on image data - Python. Additionally, many algorithms will not
perform similarly for more advanced use cases, such as machine learning.
For example, for time series, it might be more beneficial to use a Long short-
term memory (LSTM) neural network and for text data - SVM.

Compliance as a constraint. We need to pay attention to this additional

constraint when designing the USE CASES. This is of heightened (and often
crucial) importance in highly regulated domains, such as healthcare, the
military, and transportation. We need to ensure that the use cases follow
ethical and legal guidelines.
Use Cases: Designing Data Products 69

A data model can be another constraint. It is a way of organizing

and representing data in a database. It provides a conceptual frame-
work for understanding how data is related and how we can use it
to support an organization’s or system’s needs.

A flawed data model can lead to several problems, including:

• Data redundancy and inconsistency: If the data model is

not designed correctly, it can result in redundant data being
stored in multiple places, leading to inconsistencies and er-
rors.
• Poor data quality: A poorly designed data model can make it
challenging to ensure the accuracy and completeness of the
data, which can result in poor data quality.
• Inefficient data access and manipulation: A poorly designed
data model can make it difficult to retrieve and manipulate
data efficiently, which can impact the performance of the
database and the overall system.
• Difficulty in making changes: If the data model is not flexible
and adaptable, it can be challenging to change the database
as the organization’s needs evolve.
• Increased cost and complexity: A wrong data model can result
in increased complexity and cost, as it may require more
resources and effort to maintain and update the database.

The questions we need to ask here mirror those in CSA, but they should all
have the added information of new use cases we want to pursue. Armed with
the knowledge of those constraints, we can use them to prioritize the use
cases.
Use Cases: Designing Data Products 70

Prioritization

Now we can finally trim the use case list down. The hardest thing about
prioritization is choosing the right metrics. The format is straightforward;
any two-by-two matrix will do (there are other interesting methods, such as
PICK* ). Typically one of the metrics indicates the importance of the use case
- we can also substitute this for “impact”, “business value,” or something
similar. On the other axis, we can use “urgency”, “effort required,” or
“feasibility”. Then we can group the use cases into quadrants (again, a good
idea to do this in a workshop setting). The results that we should focus on
first will appear in just one of the four quadrants, for example, into the very
important and urgent (top right) - and we should start with those.

Here’s a filled prioritization matrix:

* PICK stands for Possible, Implement, Challenging, and Keep for Later - useful categories to cluster

ideas.
Use Cases: Designing Data Products 71

Filled two-by-two prioritization matrix for a data strategy project

Prioritization is a common topic in strategy that it’s often part of many

workshops. While it makes sense to do so and invite a wider group of stake-
holders again, this decision should be supported mainly by the constraints
we gathered. What the data strategist needs from stakeholders is now more
focused on estimations of business impact or value. The data strategist can
also do this in an interview format with individual stakeholders.

Here it’s good to jump ahead and look at the various measurement
metrics for data products described in DELIVERY. Ideally, the data
strategist should measure the business impact discussed here in
those.
Data Architecture and Technology:
Establishing Foundations

• Design and requirements documents for target data architec-

ture
• Documentation on selected technology stacks (supporting
the prioritized use cases)

As we covered in INTRODUCTION, there’s a considerable amount of technical

knowledge that a data strategist needs to be adept at. No other element of
data strategy is this more required than in ARCHITECTURE AND TECHNOLOGY. The
main goal of this element is to provide the necessary boilerplate for the data
strategist and not an exhaustive list of technologies. There are too many
terms and tools (with more appearing seemingly every week) to mention
here - and they will differ significantly between organizations - thus, I
purposefully refrain from doing so. The data strategist needs to understand
why those components are selected, how they fit together, and how they
enable the value-generating use cases.
Data Architecture and Technology: Establishing Foundations 73

It will be wise to refer once again to the INFLUENCE CASCADE with a

concrete example. Imagine a data team developing an impressive
modern data stack, only to realize that the data source system can
already support the use case they selected (a dashboard) - let’s say
Salesforce. Reinventing the wheel (in a more complex way) is not a
good ROI.

First, based on the CSA, we should determine the type of architectural setup
we have (similar to what we did when estimating constraints): greenfield
or legacy. In the former, the data, our job is to help design an architecture
and technology strategy from scratch - there’s none in place. In the latter,
they need to ensure the new recommendations can properly be embedded
or connected to existing systems. Unfortunately, there will almost always
be some legacy system (whether part of the data stack or another system
with which the data stack needs to interact with) to take into account. This
second scenario is arguably harder to operate since it adds more complexity
and constraints to our target A&T design. The work we did in CSA should be
sufficient in deciding which situation we find ourselves in.

Next, we should do some definition setting. We hear and use the words
architecture and technology all the time but rarely pause to reflect on what
they mean. This leads to miscommunication and issues with “selling” our
work. If we can’t explain why we need a significant investment in rebuilding
or extending the data architecture, we probably won’t get the funds to do
so. If you ask five different people in tech to provide you with definitions,
you will get five different answers. This reflects the broad nature of those
concepts. To help here, we can use some of the analogies from the start of
this part:

Technology: The software that contributes to or becomes part of data

Data Architecture and Technology: Establishing Foundations 74

products. For example, a recommender system for products in an online

store. Much of the software written in organizations (in terms of data) falls
in this category.

Programming languages, data cleaning scripts, glue code pulling data from
third-party APIs* , enrichment queries to knowledge bases, processing for
ML training, the ML algorithms, and others.

• Oil analogy: Filters, valves, extraction chemicals.

• Kitchen analogy: Knives, blenders, forks, and the dish.

Examples: Python, scikit-learn, pandas, tensorflow.

Why AWS? The technical examples in this book are mostly focused on
Amazon Web Services (AWS). While there are other cloud providers, and
some of them have other benefits, I selected AWS since, in my opinion
(at the time of writing), they have the most diverse portfolio of services,
supporting many use cases. Almost any service in AWS has a correspond-
ing alternative in Google Cloud Platform (GCP) or Microsoft Azure. As
mentioned, I don’t want to go into detail about all the services and how
to use them, but if you need more information, an excellent resource is
the AWS Cookbook - it’s full to the brim with recipes you can use for your
data work in the cloud.

Data Architecture: All software that supports technology - typically pro-

vided externally (either from a vendor or open-source), in combination with
glue-like code developed internally† . Imagine architecture as the scaffold
* Stands for Application Program Interface. This is the standard way to expose a piece of software to

other software systems—an abstraction lawyer for software-to-software communication.

† There are exceptions in larger tech companies, where a lot of custom tooling is developed internally

first, then open-sourced later. For example, Apache Airflow was first incubated at Airbnb.
Data Architecture and Technology: Establishing Foundations 75

of the building, which supports the elements inside. Or, for a more medical
analogy - this is the skeleton of the technology’s muscles.

• Oil analogy: Pipes, wells, vessels.

• Kitchen analogy: Fridge, oven.

Examples: PostgreSQL, dbt, great expectations, Apache Airflow, Apache

Spark

Cloud native and Modern Data Stack are terms used to de-
scribe software or applications designed to run on cloud computing
platforms. These applications are built using cloud-native tech-
nologies, such as microservices and containers, which allow them
to be easily deployed, scaled, and managed on cloud infrastruc-
ture. Cloud-native applications are designed to be highly scalable,
resilient, and responsive, making them well-suited for modern,
cloud-based environments. They are also intended to be built and
deployed using a DevOps approach, which enables teams to rapidly
iterate and release new features and updates (such as serverless
computing, see APPENDIX). Overall, the goal of cloud-native design is
to enable organizations to take full advantage of the flexibility and
agility of cloud computing to drive innovation and improve their
operations* .

With the definitions out of the way, we can confidently proceed to design
the architecture and technology elements of the data strategy. Here we’ll
use the word target. As you remember in GAP ANALYSIS, in data strategy, we’re
always striving to achieve a goal, a target state. This is essential to always
keep in mind when working on architecture and technology since it’s easy to
* An excellent repository of stacks is available on moderndatastack.xyz.
Data Architecture and Technology: Establishing Foundations 76

get bogged down in unnecessary details. The most important outcome from
this part of a data strategy is preparing the deliverables and the shortest
path to achieving them.

Target Data Architecture

Let’s start with the why21 (Amadeus Tunis explains why this is an essential
skill in INTERVIEWS): why do we care about data architecture? There’s no
better way to visualize this than Shopify’s Data Science Hierarchy of Needs.
Here’s a simplified version:

The data science hierarchy of needs

This diagram puts the three pillars of technical data work in context. With-
out a solid data collection, storage, and processing foundation, the whole
pyramid would break down long before we successfully deploy more ad-
vanced data science use cases (such as prediction) at scale. As we go
up the pyramid, more data engineering tasks such as transformation and
enrichment build upon the data architecture, enabling various data science
use cases.
Data Architecture and Technology: Establishing Foundations 77

Designing a data architecture, much like software architecture,

requires a new skill set - “system design”. This is generally in
the domain of a data architect, but other professions can pick it
up. Understanding and mapping out how different architectural
components fit together is difficult to learn from books - it is
mainly obtained from real-world experience. Still, one practical
book shows the way of thinking of a good software system designer:
“System Design Interview: An Insider’s Guide” by Alex Xu.

How can we start building the architectural foundation? As I wrote before,

doing this work requires significant technical knowledge and experience,
especially in the legacy scenario. If possible, I recommend that the data
strategist collaborate closely with data engineers and architects. Still, the
strategist should be able to design the fundamentals.

As with many common complex topics, the best way to manage the com-
plexity of a larger task is to break it down into smaller, more manageable
chunks - this will also help us get started more quickly. We can break any
architecture (the stack) into different components. There are two ways to
do it:
Data Architecture and Technology: Establishing Foundations 78

Pipelines versus stack view of data architecture

First, we can look at architecture as a collection of pipelines. When dealing

with a greenfield scenario, or small projects, it’s easier to work in pipelines.
Instead of designing a complete architecture and covering all possible
potential use cases, we can focus on just a few components. Those form a
typical pattern: ingestion, transformation, and consumption.

The second option is to design in layers. Contrasting to the first one, this is
more appropriate when we want to support many use cases and have many
different sources available. This is the case for larger organizations. For this
approach, have a look at the layers below* :
* Here, the concept of abstraction layers we discussed in the SYSTEMS THINKING FOR DATA STRATEGY

section proves helpful.

Data Architecture and Technology: Establishing Foundations 79

A boilerplate data architecture stack

I’ll cover each layer separately in the subsections below.

Ingestion

This layer represents data entering the system. In large organizations, data
rarely originates from one source but rather from a set of many different
ones. This part of data architecture should be automated and orchestrated
(see the orchestrator diagram below). Examples of data sources include
third-party APIs, databases, data from IT systems (such as CRMs), scraped
web data, and others. The main challenges in this phase are orchestrating
the data collection scripts and working on the database schema (more on
this in APPENDIX). This layer depends heavily on the Storage one: depending
on the storage type, adjustments in ingestion are required.
Data Architecture and Technology: Establishing Foundations 80

Orchestrator tools manage the scheduling and execution of scripts.

Keywords: APIs, Airflow, third-party databases, AWS Glue

Storage

The data from the ingestion layer needs to be stored internally. The “single
source of truth” (SSOT) is an essential motivating factor for good data
storage. This means we should always hold the raw data somewhere in
the system, and any downstream processing should be stored separately.
Through this, we can ensure data quality and add some redundancy to the
system in case of errors in processing - in that case, we can always retrieve
the data. In the old days, when data volumes were smaller, and most of the
data was generated by internal systems, databases were the most common
place to store data. Nowadays, the volume, velocity, and variety (as covered
Data Architecture and Technology: Establishing Foundations 81

in SYSTEMS AUDIT) have increased exponentially. Specifically, the volume and

variety of the data require it to be stored in different systems. For example,
due to its size, we should store image data on a data lake and time series
data in a specialized database such as InfluxDB.

When designing architectures and trying things out, costs are a

worrisome factor. A nicely structured, informative, and interactive
educational resource on the topic is available on the AWS Well-
Architected Labs website.

We can list the different options at our disposal for storing data. Those are
explained in detail in the APPENDIX.

Storage option Description

SQL databases Used mostly for tabular data with
a specific schema, where the
relationships and data types are
explicitly defined.
NOSQL databases Suitable for more flexible data
formats, for example, the storage
of text documents.
Data lake The most general way of storing
data; here, we don’t care about
the schema and want to store all
the data quickly and cheaply.
Data warehouse, lakehouse An organized data storage system
with huge performance gains
compared to the rest. Data here is
normally stored in a clean,
transformed format and used in
high-performance use cases, such
as analytics.
Data Architecture and Technology: Establishing Foundations 82

Keywords: AWS S3, Azure Data Lake Storage, Google Cloud Storage, Ama-
zon Redshift, Azure Synapse Analytics, Google BigQuery

Processing and Enrichment

For this large amount of unstructured data to be useful for the business
(let’s say as an input to an ML model), it needs to be processed. Example
processing steps include deduplication, feature engineering, dimensional-
ity reduction, and others. In most systems, this is done by orchestrating
different scripts and services. There are two types of using the ingested data
for processing: batch and streaming (expanded in the APPENDIX).

Key patterns: enrichment, deduplication, disambiguation

Examples: AWS Lambda, AWS Data Brew, AWS Step Functions, Apache
Airflow, Google Knowledge Graph

This layer is well represented by two paradigms, ETL (and more recently,
ELT):
Data Architecture and Technology: Establishing Foundations 83

The ETL process typically involves the following steps:

Extract: The first step in the ETL process is to extract data from
various sources, such as transactional databases, flat files, or other
data warehouses. The extracted data is typically unstructured and
raw and may need to be cleaned and filtered to remove irrelevant
or duplicate information.

Transform: Once the data has been extracted, it needs to be trans-

formed into a suitable format for analysis and reporting. This
typically involves cleaning and filtering the data to remove any
irrelevant or duplicate information and then converting it into a
structured format, such as a table or a matrix, that can be easily
queried and analyzed.

Load: The final step in the ETL process is to load the transformed
data into a central repository, such as a data warehouse, for storage
and access. This typically involves loading the data into a database
or other structured data storage system and then creating indexes
and other access structures to enable efficient querying and analy-
sis.

Presentation and Delivery

Data architecture is meaningless if it does not support a good use case. And
each good use case has a goal - the results of this data ingestion, storage, and
processing need to be consumed by another system. This system can be a
backend application part of the standard IT architecture or an external user
who needs to use the results of the data product. An example of the latter is
an ML model endpoint which is then exposed via a frontend or a dashboard
of business-critical information. There is usually a CI/CD pipeline (more on
Data Architecture and Technology: Establishing Foundations 84

this in DELIVERY) using code from VCS* repositories.

Examples: AWS ECS, Kubernetes, AWS API Gateway, AWS Quicksight, AWS
CodeCommit

Finally, we can combine all the layers in a target architecture. Here’s an

example of how this can look like for a typical scenario on AWS:

Example target data architecture

Here S3 serves as a data lake for data coming from another system (i.e., a
CRM). The data is transformed with Glue, which automatically detects the
schema and makes it queriable in Athena. Dashboards showing the data
are built with Quicksight. And finally, a subset of the data valid for ML is
stored in Redshift as a data warehouse and consumed by Sagemaker for ML
* Version Control Systems, such as Git or Mercurial are used for writing code collaboratively.
Data Architecture and Technology: Establishing Foundations 85

development (I’ll go deeper into this part in the following subsection on

technology).

Target Technology

If the architecture is the “skeleton” of a data project, technology represents

the “muscles”. At first glance, it might seem straightforward for an enter-
prise to choose the right technology - utilize whatever is suitable for the
use case. This is the approach I also recommend, but in doing so, the data
strategist needs to keep in mind the unforeseen consequences of technology
selection:

• Existing technology: Similarly to the architectural constraints, any

technology we select needs to be interoperable with the existing one.
• Open source versus proprietary: Deciding to use open-source or pro-
prietary technology has many consequences. For example, the lack of
support for open source can be an issue, but on the other side, there is
the chance of vendor lock-in (see aside below) if using proprietary.
• If open-source: If we decide to go for open-source tools, there are other
considerations, such as abandonware and licenses.
• Hiring and retaining talent: Most tech people want to work with the
newest tools; this can impact the ability of an organization to hire and
retain talent.
Data Architecture and Technology: Establishing Foundations 86

Vendor lock-in refers to a situation in which a customer becomes

dependent on a particular product or service provider and finds
switching to a different provider difficult or costly. This can happen
in various contexts, such as when a customer uses proprietary
software or hardware that is only compatible with a particular
vendor’s products or when a customer has invested significant time
and resources into using a specific vendor’s services.

Several factors can contribute to vendor lock-in, including:

Proprietary technology: If a vendor’s products or services rely on

proprietary technology unavailable from other providers, it can be
difficult or costly for a customer to switch to a different provider.

Integration with other systems: If a vendor’s products or services

are tightly integrated with other systems or processes, switching to
a different provider can become an issue.

Contractual obligations: If a customer has signed a long-term

contract with a particular vendor, switching to a different provider
may also be a problem.

Switching costs: Even if a customer is not technically locked into a

particular vendor, the costs associated with switching to a different
provider (e.g., training, data migration, etc.) can be a deterrent to
making a change.

Vendor lock-in can be a concern for customers because it can limit

their ability to negotiate favorable terms with a particular vendor
and may restrict their ability to switch to a different provider if their
needs or circumstances change. It can also limit competition within
a market and may lead to higher prices or reduced innovation from
vendors.
Data Architecture and Technology: Establishing Foundations 87

Keeping those points in mind, let’s look at three ways we can look at
technology selection.

Data View

Remember the data format constraints from FEASIBILITY STUDY. Those data
types influence the technology selection deeply:

Data format influences technology selection

Let’s take the text data format as an example. An example use case would be
to deploy a sentiment prediction model whose output can be used for churn
modeling. In INFLUENCE CASCADE, I wrote how this could impact the type of
database we use (i.e., Mongo DB). In this case, we should go open-source
since there are many great tools for sentiment prediction, including spaCy
and fasttext. If we need to hurry, we can also use a vendor; in that case, a
good option could be AWS Comprehend - that would allow us to be fast since
Data Architecture and Technology: Establishing Foundations 88

it integrates well with the rest of AWS, but we can have other issues, such
as limited language support and price (more on this in BUDGETING).

Use Case View

Depending on the use case we have, there are different technology options
available for us. The following diagram shows the different groups of possi-
ble use cases:

Use cases influence technology selection

ML use cases fall in the predictive category. Here we might go for Python
with its packages, such as tensorflow, pytorch, xgboost and scikit-learn to
name a few. On the other hand, if we have more of a prescriptive use case,
Data Architecture and Technology: Establishing Foundations 89

R, with its ecosystem of statistical tools, can be the best option. Finally, if we
have the descriptive product, even using a self-service analytics tool such
as PowerBI or Tableau can get the job done.

Workflow View

The final way in which we can look at the technology is the workflow view:

Workflows influence technology selection

This maps neatly to the type of work that different people are doing. We fall
into this scenario when in the legacy scenario in an organization, where
the data strategy can be focused more on improving existing use cases
rather than designing new ones. Here, for example, if more of an exploratory
workflow is required, we might go for a combination of R with self-service
analytics tools.
Data Architecture and Technology: Establishing Foundations 90

If we are making a recommendation for the organization as a

whole rather than specific use cases, we can use a technology
radar:

Many choices that need to be made during a data strategy have a markedly
“qualitative” rather than “quantitative” flavor. Those decisions are grayer
and require a nuanced approach. In terms of selecting technologies, here
are recommended selection criteria:

Fit for purpose: This one should go without saying, but the tech we use
should fit the goal (see the “use case-driven” point).

Ease of use: The learning curve ideally should be low. You’d be surprised
how quickly this can become an issue if you work within deadlines and
other constraints or have a less experienced data team.
Data Architecture and Technology: Establishing Foundations 91

Popularity: Normally, a package’s popularity correlates directly with how

good it is (with some notable exceptions).

Open source vs. Proprietary: Nowadays, this is not even for discussion.
Except for some niche cases (such as when working with GIS software), you
should always use an open-source solution for data projects.

Maintenance and future-proofing: This is probably the most overlooked

factor when selecting technologies. It is especially relevant if you have
chosen open-source technologies for your stack. One of the few negatives
of using open-source software is that you probably won’t have good sup-
port, and any issues that come up during development must be addressed
by your or your team. Thus it’s essential to use packages with a good
community and a stable future. Stay far away from abandonware!
Data Governance: Managing Data
Assets at Scale

• Design of target state of data governance

• Documentation on data governance requirements

As the organization grows and data usage becomes more established and
widespread, a new source of complexity arises. Access to the data and the
technology required to operate it (resource provisioning) for different teams
involves management. This falls in the domain of data governance. Here’s
GPT-3.5’s definition:

Data governance is the overall management of the availability,

usability, integrity, and security of data used in an organization.
It involves establishing policies, procedures, and standards for
acquiring, storing, managing, and using data and defining roles and
responsibilities for managing data assets.

This is a significant undertaking, so let’s use GPT-3 to break it down into

actionable pieces in the data governance checklist:
Data Governance: Managing Data Assets at Scale 93

Topic Description
Data classification Establish a system for classifying
data based on sensitivity, value,
and risk. This might include
categories such as public, internal,
confidential, and highly
confidential
Data ownership Determine who is responsible for
managing and protecting different
data types within the
organization.
Data access Set rules for who can access
different data types and under
what circumstances.
Data retention Determine how long different
types of data should be retained
and establish a plan for securely
disposing of data that is no longer
needed.
Data security Implement measures to protect
data from unauthorized access,
such as encryption, secure
networks, and access controls.
Data privacy Establish policies and procedures
to protect personal data and
ensure compliance with relevant
privacy regulations, such as the
General Data Protection
Regulation (GDPR) or the
California Consumer Privacy Act
(CCPA).
Data Governance: Managing Data Assets at Scale 94

Topic Description
Data quality Set standards for data accuracy,
completeness, and timeliness, and
establish processes for ensuring
that data meets these standards.

Data integration Establish guidelines for

integrating data from different
sources and systems, including
data mapping, cleansing, and
transformation.
Data analytics Set rules for using data for
analytical purposes, including
data privacy, data security, and
data governance.
Data governance oversight Establish a governance body or
committee to oversee the
implementation and enforcement
of data governance policies and
procedures.

The current state of data governance is represented below:

Data Governance: Managing Data Assets at Scale 95

Working in siloes versus new approaches

In most cases, the data is siloed in different applications with little to no

access beyond the teams using it. If access is provided, it usually is not at
the right level - either too restrictive or too permissive. When designing data
governance, the strategist needs always to balance those too:

Balancing data governance strictness

A knee-jerk reaction of many strategists will be to “democratize” data en-

tirely as soon as possible. That will liberate most of the issues, but new ones
will arise. There’s not just one type of access to data - there is at least access
to view (“read access”) or edit (“write access”); who gets which? Also, should
Data Governance: Managing Data Assets at Scale 96

the information on payroll or other personal information (which is readily

available and necessary for operations of the accounting department) be
available across all organizations? How about global enterprises with offices
and functions of differing significance, credentials, and compliance require-
ments in different regions of the world, subject to foreign laws and security
environments? How about external consultants and contractors? To which
data should they have access? This clarifies why there’s a need to balance
those access policies.

A key role in establishing a sound data governance process is the setup of a

new position, that of a “data steward”:

A data steward is a person or group responsible for managing,

governing, and maintaining the quality of an organization’s data
assets. Data stewards are typically IT or data management team
members, who play a crucial role in ensuring that the organiza-
tion’s data is accurate, consistent, and adequately protected. The
responsibilities of a data steward may include the following:

• Defining data policies and standards.

• Monitoring data quality.
• Coordinating data governance activities.
• Working with other teams to implement data-related projects
and initiatives.
• In some organizations, data stewards may train other employ-
ees on data-related topics, such as privacy and security.

Large organizations frequently set up their data platform as a good balance.

This is covered in detail in DELIVERY.
Operating Model: Setting up the
Organization for Success

• A design of an ideal operating model structure

• A list of recommendations for achieving the target operating
model state

If we want to design a cohesive data strategy, the technical elements are

not enough. Some of the hardest challenges in this work come to originate
in human-led processes and culture. The purpose of the operating model
element is to set up the organization to deliver on the data strategy from
this perspective. First, let’s have a look at what organizational models are
available.

Data Team Models

Data teams are a bit different from traditional software teams in how they
operate. For software teams, it’s more accepted to function in a more
isolated role with more or less limited interactions with other functions,
but for data teams, this is often not enough. Data teams need to understand
Operating Model: Setting up the Organization for Success 98

the business use case even more, and the products they develop are often
consumed internally as well as externally. There are several models that an
organization can use to structure how its data teams interact with the wider
organization.

Centralised: In this model, the data team functions closer to a traditional

software team and works daily together. This is great for team spirit and
cohesion but can be an issue when the projects require more interaction
with other departments. This is also naturally much easier to set up than
the other option and hence is the more commonly occurring one.

Distributed: A more ambitious setup is the distributed one. In this case,

data-focused employees are embedded within other teams in the organi-
zation. This is more common in organizations where data is still not a
first-class citizen, and the work has a more descriptive than prescriptive
character.

The optimal solution, as usual, lies in the middle: a hybrid between the
two. This has been pioneered by Spotify’s model of squads and tribes. While
there’s a centralized team, with all the support and organization that it
provides, data team members still are active within other organizational
units. This can be seen as “in-house consulting” or Center of Excellence
(CoE) format. This organizational model takes the most effort to pull off
successfully since there’s a need to balance the workloads and culture
carefully. The different models are shown in the figure below:
Operating Model: Setting up the Organization for Success 99

Operating model types

Operating Model: Setting up the Organization for Success 100

Center of Excellence

Let’s now cover the recommended approach: establishing a Center of Ex-

cellence (CoE). Some organizations, even with existing functions with data
ownership and responsibilities, would still prefer to start fresh. The CoE
typically comprises a smaller group, around 15 people with different roles,
working on more than one pilot project. Here are the reasons why such a
setup makes sense:

Low management complexity: The group is smaller and (typically) has

one leader who reports to other units within the organization.

Low change to the existing structures: The amount of change necessary

to the organization is kept to a minimum. By using a CoE, the other
departments and teams can continue to be staffed and operate the same
way as before the start of the data strategy initiative.

A fresh start: A brand new structure is created, and those people haven’t
worked together before. This has always had positive effects since many
people would have a new way of viewing things, and office politics would
be minimal.

An agile way to create an experiment: Since this is a slight change

(see point two), and the team is small in size as well (see point one), the
team’s success is relatively easy to observe. Thus it would also be easier to
implement adjustments if necessary.

Any operating model is better than none. A clear structure can tremen-
dously improve productivity because the relationships and responsibilities
of data team members are transparent; what operating model you choose
for the organization depends on its data maturity.
Operating Model: Setting up the Organization for Success 101

Change Management

Even with a solid operating model in place, our work as data strategists is
not done. What’s missing is ensuring the right culture is in place. Since
“culture” is a subjective term and depends on many factors, such as industry,
geographical region, and others - I’ll approach it from a more practical
perspective by describing change management methods.

Improve hierarchical structures: Much is made of having a “flat hier-

archy” nowadays, so much so that it’s an ever-present item in the perks
section of job descriptions. I don’t think this is a good idea, and this shift
represents a knee-jerk reaction to the bureaucracies of old - the pendulum
just swings to the other extreme instead of reaching a productive middle
ground. Some hierarchy is necessary. A good rule of thumb is that every
4-7 people need a manager; then, for every 4-7 managers, you need a
department head and so on up the chain. This specific number is the
approximate number of direct reports that one person can lead effectively:
both from a technical and personal perspective.

Improve skills and diversity: Everybody knows that diverse teams are
more successful than homogenous ones, but why? The fundamental reason
is that diversity allows for different viewpoints; thus, many more paths to
solve a complex problem are available. The same concept applies to skills.

Take into account career progression: One fundamental reason behind

the large churn numbers in tech is that managers and decision-makers
regard employees’ careers as static and unchanging. The reality is that
those are almost always in flux, and an employee should constantly be
growing, and this has to be taken into account not only when discussing
compensation and titles but also when distributing tasks.
Operating Model: Setting up the Organization for Success 102

Communication: This topic is discussed in more detail, DELIVERY.

You can see data strategy as planting valuable seeds that need fertile ground.
Someone said, “culture eats strategy for breakfast”, and I couldn’t agree
more. Even if you design and deliver a comprehensive data strategy, it would
be pointless if not accepted by the organization. This difficulty in accepting
the strategy is the most challenging task for a data strategist to address. This
is not a book on social psychology - unfortunately, there’s no substitute for
real-world experience. Still, I’ll attempt to give you a few starting points so
you can start making solid progress.

Before starting this task, we need to think about why changing a working
culture is so complex so that we can address those challenges head-on.
Unsurprisingly, we can probably reduce the main reasons for the difficulty
to the human factor. Any organization comprising people is an excellent
example of a Complex Adaptive System (CAS from SYSTEMS THINKING FOR
DATA STRATEGY). The complexity of this system increases non-linearly (that
is - very fast) with an increase in the number of workers. The composition
of the said workforce further influences this - knowledge work tends to
be even more complex (more demanding to learn, teach and automate).
Any complex system is hard to change, primarily because it functions as
a black box - its inner workings are difficult to understand and, therefore,
influence and change. This is a core topic of DELIVERY, but let’s visualize it
here, together with another concept I’ll cover in a second, activation energy:
Operating Model: Setting up the Organization for Success 103

Complexity and energy curves

In a nutshell, for a chemical reaction to occur, an initial hurdle must be

overcome. During this period, investments in energy (activation energy)
seem to have no effect. This is very similar to doing change management
- it often takes a while before we see the fruits of our labor. A further issue
in complex systems is that they tend to contain feedback loops, whether
positive (amplifying) or negative (inhibiting). The most important property
of a feedback loop is that once it’s set in motion, it’s hard to reverse. Its
momentum can be tremendously strong, even if we do everything right.

Now let’s look at some specific culture-enhancing initiatives:

Masterclass. The most direct way to up-skill the organization is by con-

ducting a series of data masterclasses. The data strategist should pull
the audience for this event (or series of events) from across the wider
organization. You can view this as a series of lectures on the utility of data,
covering the fundamentals of the most important topics, such as data assets,
machine learning, business intelligence, data quality, and others. A great
reference for this work is the book The Art of Data Science: A Guide for Anyone
Who Works With Data by Roger Peng and Elizabeth Matsui22 .
Operating Model: Setting up the Organization for Success 104

A valuable role in change management for data strategy is “citizen

data scientist”. Citizen data scientists have strong analytical and
technical skill sets but may not necessarily have a formal back-
ground in data science. They can apply their knowledge and skills
to analyze and interpret data, often using tools and technologies
designed for data scientists, to solve business problems and make
data-driven decisions. Citizen data scientists may work in various
industries, including finance, healthcare, retail, and technology,
and may be responsible for developing predictive models, per-
forming statistical analyses, and visualizing data. They may also
collaborate with traditional data scientists or work independently
to use data to drive business outcomes.

Another term used for this is “purple people” - individuals external

to the data team interacting with it because of the complex data
lineages across the organization* .

Coursework. Many online courses for fundamentals of data science, engi-

neering, and data literacy are available. Providing structured access to them
for different departments can be very useful.

Product demos. Data team members must serve as evangelists for data
in the wider organization. The work those teams produce is often seen
as opaque, and results and impact are not easy to understand - this un-
derscores the importance of making frequent product demos to a wider
audience.

Hackathons (Datathons). Organizing events where different groups

within the company can work with data team members can be tremendously
beneficial for making the company more data literate and generate
* This term is described in detail on dbt’s blog.
Operating Model: Setting up the Organization for Success 105

excitement for the technology, and sometimes deliver real business value
by piloting the best of the projects developed in this format.

Use case ideation sessions. Used to foster collaboration between the data
teams and the rest of the organization. The data strategist can also use them
to make everyone more involved in the decision process on what should be
done with data.
Roadmap: Preparing for Delivery

• Documentation of budget required (cloud, office, salaries,

FTE estimations, and success metrics)
• Short, a mid-and long-term strategic roadmap of initiatives
(use cases, target architecture, technology, and governance
implementations)

We are now at the finish line of the data strategy design. Before we proceed
to DELIVERY, the final element is to scope the work, prepare a budget and
order the work items on a timeline.

Preparing a Budget

Three aspects of budgeting need to come together for a complete budget:

software costs (i.e., cloud services, SaaS vendors) and people costs (salaries,
equipment).

Starting with the software costs, I’ll give an example of an NLP project.
The only missing information is an assumption on the volume and type
of data we expect the system to ingest and process. The best way to do
this extrapolation is to base it on existing datasets. Let’s say we have
Roadmap: Preparing for Delivery 107

3 million text files available. We also expect this dataset to grow by a

million documents every month. There are several costs that we now need
to take into account: Storage, Compute, and Service Costs. Fortunately,
there are very sophisticated calculators made available by all the major
cloud providers* . Knowing how much data is expected, you can calculate
the amount of storage you need - you can take the amount and multiply it
by the average document size on disk. The compute you can base on the
technology you need (calculate the requirements for InfluxDB for handling
the amount of data, for example), and finally, service provisioning should
also be straightforward to estimate. If we take our NLP use case again as
an example, and that an essential feature of the product would require to
obtain the sentiment of those pieces of text, we can use AWS Comprehend,
as mentioned before. At the time of writing, this would cost $0.0001 per unit.
Now you can think how many texts you need to query the API with and can
come up with an estimate of your costs for this service. Finally, we can add
all the other SaaS vendors (i.e., Atlassian for documentation) to this budget.

Leandro DalleMule and Tom Davenport, in one of their most fa-

mous articles, “What’s Your Data Strategy?” (on HBR) have pro-
posed an interesting way to look at data strategy investments. They
differentiate between defensive investments - those that are capa-
bility building (such as developing a data platform), and offensive
- those that are innovative (such as deploying a new ML model).
Balancing those two is key to making an organization competitive
in data and AI.

Now we can estimate the second part of the data strategy implementation
budget: the people cost. Beyond the obvious things a tech company needs
* For the AWS pricing calculator go here, for GCP here and for Azure here.
Roadmap: Preparing for Delivery 108

to budget for (such as laptops and other hardware), most of the cost is based
on salary. And the pay itself is based on three things: seniority, skillset, and
market (location). Now you can appreciate why it makes sense to make a
budget for the headcount only at this stage of the data strategy. We need to
be sure of the use cases and the architecture and technology to determine
the types of people we need. A good rule of thumb for the number of people
necessary is the concept that I have termed “atomic team” (see aside below).
Almost any use case is achievable by a diverse team of 4 to 7 people in size
(number based on what I mentioned in change management - more than
this, and you need an additional manager). To estimate the human resource
cost for your budget, you can couple the roles, skills, and experience levels
of such a team together with adjacent cost factors such as recruitment and
onboarding.

We can then use a standard unit of measurement: Full-time Equivalent

(FTE). Have a look at the table below showing how the data strategist can
use this to estimate a human resources budget:

Role Price FTE Daily Project

(day) needed total total
Data X$ 1 X$ x 1 Daily x
Scientist days
Data X$ 3 X$ x 3 Daily x
Engineer days
Data X$ 1 X$ x 1 Daily x
Analyst days
Product X$ 0.5 X$ x 0.5 Daily x
Manager days

Some roles would not be required full-time, for example, the Product
Manager. Such calculations make it easier if you have a CoE type of setup
and the same team needs to work on several use cases in parallel.
Roadmap: Preparing for Delivery 109

A good idea is to think about project chunks that a small team can accom-
plish. I would call this the “atomic team”. The concept of atomic describes
the smallest possible abstraction level (from ancient Greek). Here’s an
example:

The Atomic team

It comprises four members, covering the prominent roles in data (there are
many resources detailing what different roles are responsible for, I recom-
mend Borek and Prill23 and the APPENDIX). It also has the needed hierarchy
(ownership) role. We can break down most data product development into
tasks that a single team of this size can accomplish. If you cannot break
down the work into such a task, you might need to rethink the project
planning and further break the tasks down.

We can also adjust the atomic team further to fit our specific purposes:
Roadmap: Preparing for Delivery 110

The flavors of an atomic team

For example, some more engineering-heavy projects or data projects might

require an increased engineering effort in the productization phase. The
same goes for more research projects, where we might have a team of data
scientists and a lead. Finally, you can see the modular configuration that
allows adding other roles, such as a data architect and assembling a bigger
team.
Roadmap: Preparing for Delivery 111

Finally we can attach the budget to the data strategy. More often than
not, there will be further discussions which will depend on the particular
situation. Still, those methods should provide a good rule of thumb for any
budget calculations in your work.

Planning Ahead With a Timeline

Even though a data strategy shouldn’t be seen as a static plan, at least a

basic roadmap is required to manage the project. In the diagram below, you
can see an example data strategy implementation roadmap:

Timeline boilerplate

The chart consists of a timeline and several swim lanes underneath. This
allows several use cases to be run simultaneously, which is often the case.
The timeline scale can vary, but the most common is quarterly since it co-
incides with regular reporting for larger organizations. Essential elements
are quality gates, a topic expanded on in IMPACT ASSESSMENT in DELIVERY.

This roadmap can be further customized if needed. For example, one thing
I did with a client in the rapid growth phase was adding Human Resources
(HR) to the timeline. The client wanted to see when additions would be
necessary (with their roles, of course). This was very useful for the HR
Roadmap: Preparing for Delivery 112

department to plan their recruiting activities and is a good example of how

a good data strategy can and should be aligned with the other departments
in an organization.

If you wonder where to start with a roadmap, an initial practical

exercise is to fill out a bull’s eye diagram, like the technology radar
mentioned previously. You can add different work in three areas:
“now”, “next”, and “later”.
Summary
We now have a comprehensive data strategy designed. This deliverable is
the final one before we proceed with DELIVERY. Let’s recap:

We followed a MECE approach and covered the key elements. First, with the
one which generates value (remember we always should focus on the end
goal) - USE CASES. We gathered ideas and prioritized them based on feasibility.
After this, we designed the target DATA ARCHITECTURE and TECHNOLOGY to
support those use cases. Their fuel, the data, is managed in the policies we
defined in DATA GOVERNANCE. The non-technical element of a data strategy
is addressed in designing the teams and processes in the OPERATING MODEL.
Finally, we prepared a ROADMAP for implementation, supported by a budget
assessment.

We are ready with the “static” part of the data strategy by finishing this
section. In the next one, we’ll ensure that it gets delivered successfully!
Part III: Delivery
Strategy to value journey—Soft agile—Implementation forest—Lean data—The
knowledge factory—DataOps—Impact assessment—Portfolio management
Overview
“Well done is better than well said.”
–Benjamin Franklin

A good plan that’s easy to act on is better than a perfect one that nobody
wants to follow. Even if in the first two parts of the 3D data strategy
process (DUE DILIGENCE and DESIGN) we managed to formulate and prepare
an informed and holistic strategy, these efforts can eventually prove to be
in vain if the result is not adopted and acted upon within the broader or-
ganization. Much of this is because any organization is a complex adaptive
system, full of operational and communication complexity. When executing
a data strategy, we must rely heavily on many concepts and methods from
SYSTEMS THINKING FOR DATA STRATEGY. As the rubber meets the road, We will
continuously challenge our designs and assumptions. We’ll need to adjust
while staying focussed on delivering value - in line with StratOps principles.

Let’s take a high-level view of how DELIVERY relates to other strategic

elements:
Overview 116

StratOps approach to data strategy delivery.

DELIVERY connects the designed data strategy and value generation. The
arrows in both directions show that this is a two-way process of constantly
adjusting based on experience gathered while implementing. Naturally, the
concept of “value” needs to be made explicit for such a setup to work; this
is why I dedicated a whole section to this - IMPACT ASSESSMENT. This element
is used to measure value generation at different quality gates* . DELIVERY
relies on two popular software implementation framework flavors- SOFT
AGILE and LEAN DATA, forming DATAOPS. Explaining them and how to apply
them to our purposes is the primary goal of this phase. Finally, PORTFOLIO
MANAGEMENT provides a high-level view of all initiatives so that decision-
makers can also have an opportunity to monitor and adjust the strategy as
its implementation is set in motion.
* It’s not easy to measure data project impacts, but options are available. I’ll cover them in IMPACT

ASSESSMENT.
Overview 117

In my consulting experience, I have seen people using another term

in this context - “analytics playbook”. Let’s ask GPT3.5 to define it:

“An analytics playbook is a document or set of guidelines that

outline the standard processes and best practices for conducting
data analysis and reporting. It typically includes information on the
tools and methods that should be used, the data sources that should
be accessed, the types of analysis that should be performed, and the
reporting formats that should be used. The purpose of an analytics
playbook is to provide a consistent and repeatable approach to data
analysis that can be used across an organization. This helps ensure
that data analysis is standardized and reliable and that the results
are meaningful and actionable.”

It is useful to work, but much smaller in scope and impact than

what we attempt to do in DELIVERY.

DELIVERY is when the data strategist needs the most flexibility since the
complexity of applying data strategy can be staggering. This can also be a
frustrating experience, since it is here that we often discover any mistakes
and wrong assumptions made during the first two phases. It takes courage
to admit to those, but this is essential to adapt and carry on successfully.
Remember that no two organizations are the same, and all are continuously
evolving. Your impact won’t stay constant either, despite any preparation.
Overview 118

Technical jargon. In DELIVERY, you’ll interact with people who

haven’t necessarily been participating in the other data strategy
phases. For example, those could be other members of the tech-
nology department, such as front-end developers. Thus you need
to pay special attention to the use of technical jargon. Specialized
fields often have a significant barrier to entry for newcomers. While
technical concepts such as “data lake” or “clustering” are the bread
and butter of data people, they can form a significant communica-
tion roadblock in the wider business context. A simple solution is
that those words are avoided altogether, but unfortunately, this is
not always possible. You’ll need to do some up-skilling, for exam-
ple, by always including a glossary of terms in any document that
needs a wider audience. It’s also useful to focus on their function
rather than just descriptions and definitions - you can borrow the
analogies from DESIGN.

We can use an analogy for what we are trying to do here to help illustrate
the challenges better. Moreover, this analogy maps well to the data maturity
levels from CSA. Let’s take advantage of our journey analogy once again:

Taking the journey analogy even further.

Data strategy implementation is like crossing a river, starting from one

side with our designs in hand, and landing on the opposite shore where we
Overview 119

collect the value. There are three different ways to cross this river. Some are
quick to set up but less efficient, while others are more so at a higher cost
(and time to build).

Our first option is to attempt to cross the river in a small boat. This is how
organizations of low data maturity go about it. We are likely to capsize,
rely on crew muscle power alone, and can be a victim of the whims of the
winds - with little control over where we end up on the other shore. Use
cases developed in such delivery scenarios rarely see the light of day (this is
what Harvinder Atwal calls “laptop data science”, or in my words pilotitis),
and team members eventually burn out and leave. Still, remember that this
book focuses on larger organizations - for smaller startups with less baggage
(smaller rivers to cross), this can be a viable strategy.

A second method is to construct a larger vessel - a ship, to ferry us across.

This approach is more stable but remains relatively inefficient. We continue
to be limited in resources: how many people can we take with us, and how
much room do we have in the cargo hold? To translate this back to the world
of data strategy, imagine a small atomic team working on an MVP. While
they have most components of a delivery framework in place, such as Jira
boards, meeting structures, and performance metrics - progress is still slow,
and improvements difficult.

The final option is the gold standard for large organizations: we need a
bridge. Successful data-driven companies at the summit of their digital
transformation have bridges between their strategy and implementation
efforts. These organizations are so data-centric that even this analogy
would do them a disservice. A more appropriate way to term what they
have achieved is “data highways”. A recommended reading on how they
accomplish this is O’Reilly’s “Software Engineering at Google”24 . In this
scenario, the separation between strategy and implementation becomes
Overview 120

blurry, and most data product initiatives (especially those that are not core
innovation projects) are seamlessly designed and delivered.

Let’s summarise those delivery scenarios in the following table:

Delivery analogy Description

Boat Small pilot projects, no overall strategy.
Ad-hoc delivery.
Ship Stable delivery and good operations, but
no scale.
Bridge Continuous development under an
overarching strategy. Feedback loops
measuring value.

Now that we know what we need, how do we get there? As mentioned in the
chapter’s introduction, we will use lean and agile in combination* . Instead
of providing definitions as usual, a better way is to show how they look in a
gold standard (ideal target state):

Lean: The data teams operate as a factory. Before them, there are clear
targets, automated processes, highly specialized yet overlapping roles, and
other characteristics of a productive factory. Development proceeds at
a reasonable rate continuously, measurably delivering value. In case of
issues, replacement parts are readily available, bottlenecks are rare, and
operational waste is reduced to a minimum.

Agile: The factory has processes to communicate with and adapt to a chang-
ing environment and requirements. If the need arises to change a feature of
the product developed by the factory or even replace it with a completely
different one - agile allows the factory to do so without decreasing value
output - on the contrary, minor adjustments can vastly scale the positive
* If you want to go deeper into them, good references are “Agile Project Management with Kanban” (Eric

Brechner) and “The Lean Six Sigma Pocket Toolbook” (Michael L. George).
Overview 121

value generated.

The following diagram shows how the two methodologies relate to each
other. While lean allows the factory to be productive, agile ensure it re-
sponds to the environment successfully:

How Lean and Agile relate to each other.

Here I also give examples of what “environment” means; this can be

represented by other teams, management, and external (wider world, i.e.,
customers). Before we apply those in data strategy delivery, I want to
provide a word of caution in the aside below.
Overview 122

Cargo Cults
Similar to the disease pilotitis that we covered in DUE DILIGENCE, cargo
cults are the second most common disease among large (and small) orga-
nizations, preventing them from executing their strategies. Earlier in the
chapter, I mentioned the large tech companies and their data highways
- while those are good inspirations, we can’t copy their methods blindly.
Let’s see why with a historical example.

Towards the end of the Second World War, many pacific islands were
affected adversely because of the armed engagements between the US and
Japanese armies. At some point, there were enough resources available
to the US military that they started to supply these islands with food
and essential items. For this purpose, the Americans constructed primary
makeshift airports. This went on for quite some time, but after the end
of the war, the supply lines trickled down to zero. They knew little
about aircraft, and they associated it with delivering supplies. Thus they
promptly created airports and airplanes from materials they could find. Of
course, those creations were far from functional, but the locals believed
their presence would automatically lead to supply delivery. Unfortunately
for them, this did not occur.

We might feel far away from the pacific islands, but those dangers are all
around us, especially when we adopt new methodologies. Instead of being
frustrated that our large organizations can’t adopt shiny new frameworks
readily, it’s helpful to remember that tech companies have been focusing on
digital products from day one, thereby significantly reducing the complexity
of the task. They did not have many communication issues that any non-
tech organization is bound to have, hindering work between technical and
non-technical teams. And finally, they have vast amounts of money for
Overview 123

investment available, due to their investors or just based on their massive

success in the last two decades. As long as we don’t apply any methods
blindly (this goes for the data strategy framework itself) and adjust them
to our organization’s specific needs, resources and circumstances, we are
good to go.
Soft Agile: Moving Fast Without
Breaking Too Much
Soft Agile Theory

Does agile work for data projects? There are different points of view on this,
but most in the data community are wary of creating a cargo cult. For me
and many leading data strategists* , the correct answer is yes, but with soft
adjustments. I term this flavor: SOFT AGILE.

Agile methodologies can even be destructive since they generate a

lot of noise that can be mistaken for productivity and open up new
options for micromanagement. Also, more traditional companies,
which are building high-risk products (such as healthcare devices
or cars), might be unable to “move fast and break things” - for a
good reason.

We are in a similar situation as when the agile methodology rose to promi-

nence in the 1980s. Practitioners realize that the most widespread method
at the moment - agile - is not entirely suitable for modern data project
management. Back then, agile replaced waterfall25 , and now new frame-
works will replace or augment agile. The fundamental reason is that it
is too rigid for the unpredictable, complex, and challenging to plan data
* Check out the conversations with June Dershewitz and Noah Gift in INTERVIEWS for their opinion on this.
Soft Agile: Moving Fast Without Breaking Too Much 125

product work. To understand what adjustments are necessary to “soften up”

traditional agile development, we can go through the core concepts in the
Agile Manifesto and see if they fit data project management.

1. Individuals and interactions over processes and tools. This is

captured in our efforts to conduct interactive sessions in USE CASES
and work on upskilling in OPERATING MODEL. Additionally, the complexity
of data lineage, necessitating many teams to collaborate, is also a
motivating factor to focus on interactions. Culture eats strategy for
breakfast!
2. Working software over comprehensive documentation. In USE
CASES, I argued for the importance of value-driven design and the
difference between product and project thinking. Ensuring value
delivery is essential for investment-heavy, complex, and time-
consuming initiatives like developing data products. One caveat is that
this same complexity makes good documentation even more critical
compared to traditional agile software development.
3. Customer collaboration over contract negotiation. Working with
customers is vital to any modern organization, whether it is developing
data or traditional software projects.
4. Responding to change over following a plan. This pillar represents
our StratOps approach well, but a balance is required: on one extreme,
you can end up with a static data strategy slide deck, and on the other
one - pilotitis.

It seems we can transplant most of those ideas well to data products, but
how about the specific principles of agile (also a fundamental part of the
manifesto)? Here the situation changes, and the real difference with SOFT
AGILE begins to appear:
Soft Agile: Moving Fast Without Breaking Too Much 126

Satisfying customers through early and continuous delivery of valu-

able work. The first part fits very well. A good delivery idea (also confirmed
by many leaders in INTERVIEWS) is to demonstrate the results of our data
initiatives early. Thus we can both obtain stakeholder buy-in and dispel
skepticism. Now, the second part is less suitable - the value of data projects
depends on numerous factors, many outside our control. For example,
the idea of data drift26 - changes in the external data start to degrade
model performance. This was widespread during the COVID-19 pandemic,
where shopping behavior deviated from the norm dramatically overnight,
rendering many deployed predictive models obsolete. Offering a constant
value in data projects is more challenging, and the data strategist has to
manage expectations - an essential feature of SOFT AGILE.

Breaking big work down into smaller tasks that can be completed
quickly. Useful advice for any knowledge work; we’ll take it.

Recognizing that the best work emerges from self-organized teams.

This applies to data teams even more than to software since good decisions
on the project can be mainly made only by the implementers who under-
stand the data.

Providing motivated individuals with the environment and support

they need and trusting them to get the job done. Also extremely
valuable for data projects. The data strategist must address many sources of
pressure on data teams (scope creep, changing requirements, and others).
Additionally, the good embedding of the teams within the organization is
essential (OPERATING MODEL).

Creating processes that promote sustainable efforts.

Harder to do in data projects since the tooling can differ widely. Still, good
DataOps principles (such as TEMPLATING AND DOCUMENTATION) are key.
Soft Agile: Moving Fast Without Breaking Too Much 127

Maintaining a constant pace for completed work. Not applicable to data

projects since effort estimation for data projects can vary significantly.

Welcoming changing requirements, even late in a project. Not suitable

for data projects because of the INFLUENCE CASCADE.

Assembling the project team and business owners daily throughout

the project. Useful since the business owners often have a different per-
spective on the data - this is why we set up a steering group.

Having the team regularly reflect on becoming more effective, then

tuning and adjusting behavior accordingly. Applicable unless done in
a very rigid fashion. While it’s vital to have retrospectives, their cadence
should be timed with achieved milestones, so tricky to have pre-determined,
set-in-stone intervals.

Measuring progress by the amount of completed work. Not applicable

since the data projects don’t have an absolute value, just a relative one.
Progress is measured by the business value generated. More on this in IMPACT
ASSESSMENT.

Continually seeking excellence. Applicable, self-explanatory.

Harnessing change for a competitive advantage. Applicable, self-

explanatory.

Now you should see why we need a softer approach to agile. Let’s use via
negativa once again, and see how the rigidity of traditional agile fails data
strategy delivery.
Soft Agile: Moving Fast Without Breaking Too Much 128

The Implementation Forest

I’ll showcase a scenario where most of agile’s shortcomings are displayed

- the “implementation forest”. Rarely you’ll see all those issues in one
project, but they are unfortunately commonplace. Later in the chapter,
when I cover various DATAOPS, you’ll get the tools to mitigate those issues.
For this thought experiment, we’ll assign an atomic team to a typical data
project. As you learned in DESIGN, we have a data scientist, engineer, and
analyst responsible for the implementation, and all of them report to a
business owner (or client, in a data strategy engagement) - who also does
the project management. This is a greenfield scenario without any legacy
issues - working on a standard data product - a customer churn prediction
model.

The diagram below visualizes this setup and downstream issues:

The implementation forest in action.

We have the roles on the left and time progressing from left to right. Deliver-
ables are shown in white circles; the red arrows correspond to conflict, and
Soft Agile: Moving Fast Without Breaking Too Much 129

finally, the bombs, appropriately, for project failure. The team receives the
brief and begins a two-week sprint. Now we can play the scenario forward
and see what happens, step by step.

Setup

Data projects can fail even before they start. Preparing the design document
for the product and the definitions of done (see aside below) for tasks. I have
seen a whole CoE setup provided with only the initial requirement: “let’s
do an AI for X”. You can substitute X for anything, such as improving the
patient outcome, warehouse operations, or new sales funnel performance.

If you don’t specify your goals correctly (see definitions of done and design
document aside below), even with the best of teams, the most you can hope
to get out of the data initiative is for them to dig a perfect hole in the wrong
yard. Also, pivoting mid-way through the project is not a viable strategy.
As you have seen in the INFLUENCE CASCADE, any changes have significant
consequences, and the project setup (in terms of people, hardware, software,
and access to data) needs to take care of that. Another issue that can occur
in the setup phase is not setting a good baseline* . How can you measure the
success of your churn model product in reducing churn when you have no
data on what the churn was before deploying it?

In our scenario, the team is oblivious to this issue because it will only
become apparent as the project progresses. The team is excited since they
can start with minimal fuss.
* This is also for data strategy. Despite improvements, I have seen it fail because the strategist did not

establish a baseline to compare against.

Soft Agile: Moving Fast Without Breaking Too Much 130

Definition of done: In software development, “definition of done”

refers to a set of criteria or conditions that must be met before a
product increment is considered complete. This can include code
reviews, testing, and deployment to production. The definition of
done is often used in agile development methodologies, such as
Scrum, to ensure that the product increment is high quality and
ready to be delivered to the customer* .

Design document: A design document for software is a detailed

description of a software product, including its features, func-
tions, and requirements. It typically includes information on the
architecture of the software, the user interface, and any other
relevant design details. A design document is an integral part of
the software development process, as it serves as a blueprint for
the development team and helps to ensure that the software is
built according to the intended design. It can also help to prevent
misunderstandings and scope creep by providing a clear, written
record of the design decisions that have been made† .

Data dump

The first common hurdle is the lack of data. The data scientist has set up
their local development environment and is ready to go, but they realize
they don’t have access to data. Now they need to find the right responsible
person or team who owns that data, and even if they do, they can find
out they first need to ask their project manager and negotiate for it. Data
governance policies are rarely set up flexibly in large organizations - and
data is often siloed, so this can take weeks. Even if this is not the case, it is
the rate that the data would be easy to download (because of size, format,
* An example definition of done is provided in the Appendix.
† An example design document outline for a data product is provided in the Appendix.
Soft Agile: Moving Fast Without Breaking Too Much 131

or other constraints of the software system that generates it) to talk to

the data engineer to provide them with a dump of the data so they can at
least start working. This is also rarely added to sprints as a ticket, and the
data engineer, instead of working on their tasks, must drop everything and
export a dataset for the scientist since, otherwise, they wouldn’t be able to
work. Other issues that can arise in this stage is the lack of a data dictionary
and black box data lineage.

Data access

At the same time, the business owner has asked the data analyst to provide
some initial reports on the data. Similarly to the data scientist, they realize
they don’t have access to the data and get in touch with the data engineer.
Unfortunately for the data engineer, a data dump is not enough for them.
The analyst needs real-time access via a self-service analytics tool, which
was not set up before. More ad-hoc work for the data engineer. Moreover,
they often would require approval for the BI tool (those also can be expen-
sive, especially for larger organizations where hundreds of users need to use
it). This can result in further conflict with other departments already using
a similar tool (let’s say PowerBI), but it’s not suitable for the specific needs
of the churn project.

Prototype

After obtaining a data dump from the engineer, the data scientist has been
working on creating a prototype model. They developed locally on their
machine and are ready to provide the code for deployment. Usually, the data
engineer is also responsible for this, and unfortunately, again (at this point,
you should start to see why good data engineers are so sought after), they
realize they need to recode the modeling code from R to Python. Harvinder27
refers to this issue as the “throwing over the fence problem”. Technical debt
Soft Agile: Moving Fast Without Breaking Too Much 132

accumulates by avoiding DataOps (more on this in DATAOPS) principles from

the beginning.

Deployment

Finally, the data engineer has the modeling code written in the language
they need. After deployment, the team realizes that the data coming into the
ML system is different in terms of format and quality. Now there’s a need to
set up the MLOps infrastructure and database connections (the scientist has
been working with flat files) to ensure continuous learning and performance
monitoring. The product team also complains that the users might notice
the several seconds’ prediction latency and that it needs to drop below one
second. This news change requires a complete rework of the architecture
and development process.

The “So what?” Problem

Now we are reaching the final and perhaps most challenging situation where
data projects fail, even if they managed to survive the previous hurdles.
The team has built and deployed a good model but made it accessible
only via an API endpoint. The product team realizes that they don’t have
enough technical people on their side to consume this API (they didn’t
know about this because they weren’t kept in the loop from the beginning).
The additional problem is that it starts to be unclear which system needs
to use this product. Finally, business users distrust the model’s accuracy
and begin to ask about details to understand its inner workings. The data
scientist realizes they should have created an XAI layer on top of the model
to convince them of its operation. You can see that even with an excellent
finished product, there are new challenges that can arise that can derail
everything, and we pay the price for the lousy setup - and the lack of data
strategy.
Soft Agile: Moving Fast Without Breaking Too Much 133

Explainable AI (XAI). The black boxiness of ML systems has

been an issue ever since they were conceived. This issue has been
exacerbated further nowadays since we have ML systems deployed
in more life-critical applications and moving away from traditional
tech companies. Think how receiving a wrong prediction for what
YouTube video to watch next impacts a person’s life compared to a
machine learning system for cancer detection. Also, the complexity
of machine learning models has risen - since, as a rule for most
modern machine learning applications, such as computer vision
or text domain, more complex algorithms are used (more neural
networks than linear models). Those more complex algorithms also
tend to be less explainable.

XAI can be seen as an architectural layer on top of a traditional

system. The data team can use different methods here, and some
of the more popular ones are LIME and SHAP. I can refer the
interested reader to the excellent Interpretable Machine Learning
by Christoph Molnar for a deeper dive into XAI28 .

Additionally, GDPR has been a moment of reckoning for most or-

ganisations, which now see the need to store personal data securely.
The simplest in this case solution would be to store no data at
all, or delete it after some time. Still, customer data is essential
for most modern businesses, and many use cases in data and AI
revolve around such datasets. Moreover sometimes dropping the
data would not avoid ethical or privacy issues completely, since
some other features (or a combination thereof) might correlate
with the privacy-infringing one, and we might find out about this
much later than we would like to.

This story is admittedly rather bleak, and many of you will have had similar
experiences. Still, with the right tools most of those problems are avoidable.
Soft Agile: Moving Fast Without Breaking Too Much 134

I’ll provide them in DATAOPS, but before this let’s have a look at the other
framework for delivery.
Lean Data: Eliminating Waste
Lean Data Theory

The lean methodology originates in industrial manufacturing - the leg-

endary Japanese car manufacturer Toyota. The main concepts were pre-
sented in the Toyota Production System29 . Industrial manufacturing is
further distant from data projects, as compared to agile. This means many of
the methods discussed here might seem abstract initially, and more effort
is required from the data strategist to apply them in practice. There have
been notable developments in recent years, most recognizably in the form
of the Lean Startup movement30 .

One easy way to look at the lean framework is to see that, at its core, it’s
mostly about reducing waste. This is inherently in line with via negativa:
instead of an optimistic goal - “how can we have more resources”, we are
focused on a more pessimistic yet actionable one - “how can we do more
with the limited resources that we do have”.

Now that we know its background, how will it help us build the bridge
between data strategy and value? To answer this question, we need to
understand what a lean organization focused on knowledge work is. Instead
of looking at a negative example, as in the IMPLEMENTATION FOREST, I’ll show
you the gold standard - THE KNOWLEDGE FACTORY.
Lean Data: Eliminating Waste 136

The Knowledge Factory

Much of the inspiration behind the lean methodology comes from the
industrial sector. To understand how this can apply to data projects, we need
to see the differences between knowledge work and manual labor in terms of
complexity. This is a vast field of research, but I’ll provide vital contrasting
points in several paragraphs. Let’s first ask GPT-3 to define knowledge work:

Knowledge work refers to primarily intellectual and conceptual

work, as opposed to manual or physical labor. It involves the
creation, synthesis, and application of knowledge and expertise,
often in the form of information or ideas. Knowledge work includes
research, analysis, design, writing, consulting, and other activities
requiring specialized knowledge or skills.

This definition already sets the stage for our view on complexity. Data
strategy implementation is a product of largely conceptual work and enor-
mous mental effort delivered over a sustained period by a large group of
people. All this while the environment is changing* . Those are not deal-
breaking issues, but they result in the number one enemy of data project
management: inefficient communication.

Inefficient communication. With the growing specialization of the knowl-

edge work in general, it becomes less and less common to have employees
with the renaissance-man skillet - where they have achieved mastery across
a widely different set of areas. Nowadays, it’s more common to have a group
of specialists who function together. And with the increased specialization,
* Anna Filippova from dbt has a great example of this in her “Complexity: the new analytics frontier” article,

that you can read here.

Lean Data: Eliminating Waste 137

those new fields bring their terminology, and their work can become harder
to explain from a specialist, even adjustment in application areas, let alone
those widely distant.

Here is an example of how those communication issues play out:

Communication complexity

Person A communicates to Person B that the team should deploy the churn
model. From the first person’s perspective, this is a reasonably straightfor-
ward task. Unfortunately, this is an illusion - because in a simple instruction
like that, much complexity is hidden in the person’s context (the devil lies in
the details). What “deployment” means for Person A can differ from Person
B. The noise on the diagram represents this issue. The way to mitigate
Lean Data: Eliminating Waste 138

this is to focus on the task instead of the communication. For example,

instead of Person A communicating inefficiently with Person B, they should
have a direct feedback loop with the delivered work to judge its status by
themselves.

Another valuable method to improve communication is the “Five Whys”,

also from Lean manufacturing at Toyota:
Lean Data: Eliminating Waste 139

The five whys is a simple, iterative problem-solving technique

that repeatedly asks why to uncover a problem’s root cause. It is
typically used to identify the underlying causes of problems and
to develop solutions that address those causes rather than simply
treating the symptoms of the problem.

For example, if a product is not meeting customer expectations, the

five whys method might be used as follows:

• Identify the problem: The product is not meeting customer

expectations.
• Ask why: The product lacks certain features that customers
want.
• Identify the root cause: The product development team did
not include those features in the design.
• Ask why: The product development team did not clearly un-
derstand customer needs and requirements.
• Identify the root cause: The product development team did
not conduct sufficient market research or gather customer
feedback during the design process.

Once the root cause has been identified, it can be addressed to

prevent the problem from recurring. For example, the product de-
velopment team might implement a more robust market research
and customer feedback process to ensure that they have a better
understanding of customer needs and requirements in the future.

Let’s move on to a second challenge building a knowledge factory. As one

of the most influential thinkers in the field of organizational management,
Peter Drucker wrote: “What gets measured, gets measured”31 ; we are often
confronted with the difficulty in measurement within data projects. Let’s
Lean Data: Eliminating Waste 140

start with two simple things everyone should measure: how long a data task
should take and how much value its successful completion should generate.

Measuring how long a data task takes. Traditional software projects

can already be difficult to estimate, but still, for most cases, some specific,
deterministic outcome is desired - such as taking particular actions on a
user interface or logging in to an application. We can typically rely on the
successful sequence of the building of components on top of each other,
much like building a house, one brick at a time. This can look different for
data projects since the possibility of making some of the components is
already in question. Data product components are prone to failure due to
their stochastic nature. For example, how do we estimate how long it would
take us to finish a machine learning project (from conception to production)
when we are not even sure that we can successfully build one based on the
data we have? Also, how do we then estimate time and value confidently
when there is a high chance that the data collected is not of sufficient quality
and we have to recollect it? In a software project, if a component breaks or
is unavailable, we usually have a pretty good variety of replacement parts.
This is not the case for much more iterative, unpredictable, and complex
data work.

Measuring of success of data projects. Even if we successfully implement

and deploy data projects to production, how can we measure their success?
Data projects typically have large budgets and are opaque because of their
complexity to internal stakeholders. There is a perfect storm of high ex-
pectations, skepticism, and upfront investment. Under such constraints,
proving measurable value is the make or break for many CDOs and data
leaders. The same factors behind difficult effort estimation are also at play
here. Since this is an essential topic, I’ll expand on solutions in IMPACT
ASSESSMENT later in the chapter.
Lean Data: Eliminating Waste 141

The KNOWLEDGE FACTORY represents an optimal project delivery of knowledge

work via efficient and effective communication. Design documents are
written collaboratively and on time, definitions of done are provided, and
async communication abounds.

Now we can proceed with the concrete methods that allow us to combine
SOFT AGILE and LEAD DATA to ensure value-driven data strategy implementa-
tion.
DataOps: Methods for Value Delivery
The first Ops concept I mentioned in the book was StratOps in INTRODUCTION.
The second one was MLOps in DESIGN (this one is a subcategory of DataOps).
Some frown at this explosion of terms, but I see it as a net positive -
this represents a growing awareness and desire to automate and optimize
knowledge work.

To understand DataOps, we should define DevOps - the original term. It was

popularized by the excellent book - “The Phoenix Project”32 :

DevOps is a set of practices and principles that aim to improve the

collaboration, communication, and integration between software
developers and IT operations teams. It is based on the idea that
by bringing these two groups together and fostering a culture of
collaboration and continuous improvement, organizations can ac-
celerate the development and deployment of software and reduce
the time and cost of delivering value to customers.

DevOps emphasizes using automation and tools to streamline

and standardize the software development and delivery process
and to enable collaboration and communication among teams. It
also highlights the importance of monitoring and measuring the
performance and reliability of software systems and using data
and feedback to improve the development and delivery process
continuously.

Before we bog down in terms of semantics, let’s agree to define DATAOPS

as DevOps principles applied to developing and delivering data products
DataOps: Methods for Value Delivery 143

and projects. Now we can continue with a walkthrough of tried and tested
frameworks that support DATAOPS.

Team Data Science Process

One of the oldest efforts to standardize the process of doing analytical work
was the Cross-Industry Standard for Data Mining (CRISP-DM); here’s how
it looks together with a basic explanation:

CRISP-DM (modified)
DataOps: Methods for Value Delivery 144

1. Business Understanding: This phase involves defining the

project objectives and identifying the data used to achieve
them.
2. Data Understanding: The data is explored and analyzed to
understand its characteristics and quality in this phase.
3. Data Preparation: This phase involves cleaning, transforming,
and preprocessing the data to make it ready for analysis.
4. Modeling: In this phase, statistical and machine learning
algorithms are applied to the data to build predictive models.
5. Evaluation: This phase involves evaluating the performance
of the models and selecting the best one.
6. Deployment: In this phase, the selected model is deployed in
a production environment and used to make predictions or
recommendations.

This is useful as a start, but it’s not enough for most modern use cases -
most of those steps are common sense for any practitioner. Microsoft has
improved on this by creating the Team Data Science Process (TDSP)* :
* You can read in detail on the TDSP here.
DataOps: Methods for Value Delivery 145

Microsoft’s TDSP lifecycle process (modified).

There is more detail here. For example, the data acquisition part is broken
down into the required activities (mirroring many architecture layers in
DESIGN). The deployment also is modernized, with some more detail on
required tasks. But most importantly, there is the “customer acceptance”
step. This can mean an external customer, but perhaps more frequently in
data projects - an internal stakeholder. This connects to our “definition of
done” activity. One weak spot of this model is the machine learning focus
in modeling. Not all useful data science workflow fits here; for example, a
DataOps: Methods for Value Delivery 146

data scientist might produce a static report containing a statistical result

instead.

Data Platform

Building a data platform is one of the best mid- and long-term investments
a large organization can make in terms of data (especially after successful
PoCs). While this term has a very general meaning, for our purposes, it is an
abstraction layer on top of the data:

Data platform

This concept is also known as “data fabric”.

The data platform contains all services, interfaces, and governance (access)
policies required by all different groups of people in the organization. Such
DataOps: Methods for Value Delivery 147

well-prepared tooling, infrastructure, and documentation go a long way to

having best DataOps practices* . Another important concept related to data
platform is data mesh explained further in the APPENDIX.

MLOps

ML is so common to data products that there’s a specialized Ops framework

for optimizing its delivery - MLOps. Since many more companies started to
develop ML-powered products, a new problem space emerged: how to train
and deploy models at scale. Now there is a plethora of tools at our disposal.
Here are the main features of MLOps:

• Automated model training and validation: Training and testing ma-

chine learning models can be a tedious and error-prone procedure.
Parts of a machine learning workflow, such as hyperparameter tuning,
require training many different models and evaluating them until we
find the best-performing one.
• CI/CD: Much like traditional software development, we can add such
practices to ML projects. For example, CI/CD helps us prevent deploy-
ing models that don’t pass a particular test. Another application is
that even if we deploy a model we’re not happy with, all of them are
versioned, and we can roll back to a previous version.
• Monitoring: Deploying a model is not the end-all of the ML lifecycle.
Often there will be issues, such as data drift, occurring in production,
and we need to monitor them.
* Look at my conversation with Alexander Thamm in Interviews, where we talk about how this exact

scenario played out at a successful German company, Zalando.

DataOps: Methods for Value Delivery 148

There are different tools for the different aspects; for example, for CI/CD
you might use Github Actions; for model monitoring, mlflow. A good starting
point is to use the specialized services that all major cloud providers have -
for example, Sagemaker in the case of AWS or Machine Learning Studio for
Azure. Those contain everything necessary (but can be expensive).

Other important technologies are various dependency manage-

ment and containerization tools (such as Docker).

MLOps is a fast-moving field, and there are many other useful concepts
available, such as feature stores:

A feature store is a database or repository used to store and manage

features, data points, or variables that the data team can use to
train machine learning models. A feature store is designed to
provide a central location where multiple teams within an organi-
zation can store, access, and use features. It allows data scientists,
engineers, and analysts to easily access the features they need to
build, train, and deploy machine learning models.

Example tools: FSML, FSAML, Tecton, Feast, DataRobot Feature

Store, SageMaker Feature Store

Templating and Documentation

To improve efficiency in operation for software projects (and for other

knowledge workers, for that matter), the usage of standards is a good best
practice. The data team can reuse those across the organization to limit
DataOps: Methods for Value Delivery 149

communication and integration issues. This also helps with onboarding new
people to the development process.

One example is to standardize the folder structure for product development.

Have a look at another example from Microsoft’s TDSP (there are many
alternatives to this, for instance Cookiecutter Data Science):

Microsoft’s recommended folder structure (modified)

The same applies to the generation of reports. For example, a good idea
is to use standard YAML files to specify standard project settings (such as
themes, database connections, etc.) that can be reused across projects.

Next, a great example of templating is infrastructure-as-code:

DataOps: Methods for Value Delivery 150

Infrastructure as code (IAC) is a method of managing and pro-

visioning computing infrastructure and its configuration through
machine-readable definition files rather than using physical hard-
ware configuration or interactive configuration tools. This allows
infrastructure to be treated like software, with the benefits of
version control, collaboration, and repeatability.

Tools that allow IAC include Terraform, Ansible, Puppet, Chef, and
AWS CloudFormation.

Creating machine- and human-readable architectures added to version

control systems (VCS), such as git, is an excellent step to full DataOps
automation.

And finally, how should we document data product development? The

documentation of such work, especially for projects heavy on exploration
and modeling, is, by necessity, different from traditional software. My time
in a molecular biology lab as an undergrad gave me an idea that has proven
helpful in my projects: we should use shared lab notebooks. The concept of
a lab notebook is that it is an immutable, continuously updated log of your
work, including experimental results and notes. Most importantly, it needs
to be time-stamped and shared with the team. Have a look at an example
below:
DataOps: Methods for Value Delivery 151

An example layout of a data science lab notebook

There are many software tools to assist with this, including Confluence and
SharePoint.

Kanban

One of the most widely used frameworks for project management, this one
also originates from Toyota:
DataOps: Methods for Value Delivery 152

In software development, Kanban is a method for managing the

creation of software products. It is a way to visualize and control
the flow of work, allowing teams to optimize their efficiency and
improve the quality of their work.

The core principle of Kanban is to limit the amount of work in

progress (WIP) at any given time. This is done by using a Kanban
board, which is a visual representation of a project’s different
stages and the tasks associated with each stage. The board typically
includes columns for each stage of the project, such as “To Do,” “In
Progress,” and “Done,” and cards representing the individual tasks
that make up the project.

By limiting the amount of WIP and making the work visible on the
Kanban board, teams can identify bottlenecks and inefficiencies in
their process and take steps to address them. This can help teams to
deliver software more quickly and efficiently while also improving
the quality of their work.

When taken to data project management, this framework has benefits and
shortcomings. The positive is that it allows teams to get started quickly,
especially in projects which are hard to estimate (as mentioned before,
something common in data). The project management would rarely be able
to create a detailed roadmap more than four weeks in advance. In this case,
teams can tackle tasks as they come instead of distributing the work in
sprints.
DataOps: Methods for Value Delivery 153

Scrum

Scrum is the most common agile methodology. As we mentioned previously

in the chapter, The data strategist should soften it up to account for the
differences typical in data projects. For example, less emphasis should be
paid to time estimation. In data, combining scrum and kanban (scrumban)
is common to get both benefits.

Scrum is a framework for managing software development projects.

It is an agile methodology that emphasizes collaboration, account-
ability, and iterative progress.

In Scrum, a project is divided into small, manageable chunks called

sprints. Each sprint typically lasts a few weeks and has a specific
goal or deliverable. At the beginning of each sprint, the team holds
a planning meeting to determine which tasks they will work on
during the sprint.

Throughout the sprint, the team completes the tasks and pro-
gresses toward the goal. The team holds daily meetings called
stand-ups, where they discuss their progress and any challenges
they face.

At the end of the sprint, the team holds a review meeting to demon-
strate their completed work and gather feedback. This feedback is
used to inform the next sprint, and the process repeats until the
project is completed.
DataOps: Methods for Value Delivery 154

Larger organizations have adopted a flavor of those agile frame-

works, with two notable examples: Scrum of Scrums* and the Scaled
Agile Framework, SAFe† .

Shotgun MVP

I mentioned earlier how it’s difficult to estimate an ML project since we

can’t guarantee it’s possible based on data. How can we alleviate that? One
approach I have seen used successfully is what I call “shotgun MVP”:

Running many use cases at the same time with Shotgun MVP

Let’s say that we have several ways of achieving the same operational goal -
increasing the profitability of the sales department. One approach would be
* You can read more about Scrum of Scrums here
† You can read more about SAFe here.
DataOps: Methods for Value Delivery 155

to create a churn model. The second one - to build a customer segmentation

model. While the data for those two might overlap, the approaches are
different. As a first step, we work on those MVPs. This process continues for
several sprints until a clear frontrunner becomes apparent - based on our
KPIs. After this, the leading solution (MVP II in our case) is selected, and
resources are committed fully to it for the following sprints while keeping
the other options as backup plans.

Closing the Loop

Another common problem mentioned before is the “cold start problem”.

Sometimes it’s just not clear where to start with a big project.

A rapid iteration approach to developing a complex project is

described in Sprint: How to Solve Big Problems and Test New Ideas
in Just Five Days33 .

I have named this approach “closing the loop”. Let’s illustrate with an
example of a complex project. We can visualize many digital products in a
sequential graph. Some things need to be done in various stages to complete
the product. For example, first, we might spend time on comprehensive EDA,
building a dictionary, and feature engineering. Only then would we proceed
with modeling. We can add all those different elements together* :
* This process is often called “Value Stream Mapping”.
DataOps: Methods for Value Delivery 156

Selecting the critical component flow

This will almost certainly be an extensive list of things that need to be

done before we arrive at the end. If we start doing all the elements initially,
it might be weeks before the project can reach the middle stage. In this
scenario, we are setting ourselves up for disaster: in the end, we might spend
months on a project which fails at the finish line. Instead, let’s do something
better. A first improvement point is to discriminate between the optional
and critical elements. Those should form a chain throughout the whole flow,
following the criteria in the data science hierarchy of needs (in DESIGN):
DataOps: Methods for Value Delivery 157

Iterating after proving value

We can then implement this chain before working on all the other optional
components. Through this, we can quickly validate our idea and see if
it brings measurable success. Then we can return and finish the other
components, resulting in incremental improvements.

Sources of Waste

Harvinder Atwal’s brilliant book Practical DataOps: Delivering Agile Data

Science at Scale34 has done a great job explaining the sources of waste typical
for data projects. Let’s review them:

Partially done work: pilotitis. It’s easy to start new projects, but making
them usable requires much more effort.

Unreproducible work: the “it works on my machine” problem: non-

existent or rudimentary dependency management and absence of DevOps
DataOps: Methods for Value Delivery 158

and automation.

Defects: No code is perfect, and there are always edge cases and unexpected
bottlenecks in data-intensive software systems.

Vague definition of done: Covered already previously.

Multitasking: This relates to the idea that human productivity drops dras-
tically if attention is split between different tasks simultaneously. This is
often the result of “scope creep”.

Scope creep is a phenomenon that occurs when the scope or

requirements of a project change or expand beyond the originally
agreed-upon boundaries. This can happen for various reasons, such
as new information becoming available, changes in the project
team or stakeholders, or changes in the broader business or market
environment.

Extra features: Instead of focusing on the fundamental components of a

data system (as in the closing-the-loop concept) and only then iterating,
the data team attempts to build everything at once. This can result in delays,
frustration, and other issues.

Waiting: This can translate almost one-to-one from the industrial setting
to data. Especially in larger organizations, it’s common to have more tedious
processes and inefficient data governance, resulting in long waiting times
for the data team to even proper tooling, let alone access to data itself.

Lack of knowledge sharing: The result of siloing data team work and
members. In larger and less efficient organizations, the worst form of this
issue appears - different teams working on the same thing, reinventing the
wheel in isolation.
DataOps: Methods for Value Delivery 159

Extra processes: This concept relates to the idea of SOFT AGILE. It occurs
when the balance between rituals and work leans too much on the former
than the latter, generally because of the cargo cult effect.

Extra motion: A common source of waste in data projects, especially by

more junior members, is overcomplicating the system. For example, it’s
straightforward to install an open-source package and then just use one
function from it. This can seem productive initially but introduces unseen
technical debt in the system since this package needs to be considered for de-
pendency management and containerization (see”unreproducible work”).

Lack of documentation: Unfortunately, documentation is often seen more

as an afterthought of work instead of an actual work item. This introduces
tremendous fragility in the system, which can acutely be felt when a senior
member leaves the team, taking valuable, and even critical knowledge of the
system with them.

The data sources of waste are a great fit for a workshop with a data team.
After explaining the concepts, the stakeholders can gather examples from
their work. Then clustering and prioritization steps can follow.

Heuristics

Simple and easy-to-follow rules often prove more effective than a big
strategy, especially for organizations in a low data maturity stage. A Harvard
Business Review article on strategic heuristics explains this idea well. The
authors explain how Napoleon innovated the art of managing a war during
the conflict with Russia in the 19th century. At that time, the lines of
communication between generals and front-line soldiers were virtually non-
existent (this was some time ago, before the invention of the telephone).
DataOps: Methods for Value Delivery 160

The relay of orders and information to and from the generals was primarily
accomplished by human messengers, often on horseback. It’s not difficult to
imagine how such inefficient communication channels could confuse. Often
by the time the information has arrived, it was already outdated - at best
irrelevant - and at its worst - dangerous. To avoid this, Napoleon issued
his troops with a set of simple heuristics: for example, in the case of total
communication breakdown, go into the direction of fire and take the high
ground.

Heuristics can be helpful for people at the executive level as well. I have of-
ten observed analytics leaders applying basic rules to their work, especially
regarding time management. For example, when asked how they prioritize
different work areas, such as hiring, outreach, and development work, they
would often come up with a rule of thumb percentage, such as 10-15-75%.

We can show how this maps to data strategy here:

DataOps: Methods for Value Delivery 161

Expectations versus reality in implementation

A data strategist might expect the ideal case to the left. Directions and tasks
flow in a structured way delivering value. Unfortunately, in an actual project
situation, the reality is represented better in the right graph. Most obstacles
and value are hidden and can only be reached indirectly by different people.
In this case, the data team is on its own and can use simple heuristics to
guide its work. For example - if less than 10% of the data is missing - proceed
with dropping it.
Impact Assessment: Measuring Our
Success
Let’s remind ourselves one more time why it’s essential to measure the
results of our data strategy:

Investment - impact curve

In a nutshell - sustained success in data takes time - and DataOps helps get
there faster. This is the opposite graph of a pilotitis one, but it brings its frus-
trations. There will be a point for any large data strategy implementation
initiative where the investment has been at a maximum, yet the results are
few. We have already covered the best way to alleviate this (for example, by
doing Lighthouse projects (see my conversation with Alexander Thamm or
Amadeus Tunis on those). Here we will focus on how and when to measure
Impact Assessment: Measuring Our Success 163

our results in value generation. Have a look at the diagram below:

Impact assessment

The answer to the “when” question is simple. First - before the start. This
established the baseline that we will measure against. Afterward - at regular
intervals, depending on the roadmap. This is typically done in a meeting
with the RACI steering group members.

How about the measurable metrics for data projects? There are quite a few,
but this table provides the most common ones:

Metric Description
Time to Market How quickly the product is
provided to customers
ROI How much money the product
makes as a return on
investment
Ramp time Time for a new hire to become
productive
Deployment number How frequent are deployments
Actionable insights delivered self-explanatory
Value-add time How much time spent
contributes to value
Impact Assessment: Measuring Our Success 164

Metric Description
Proportion of reusable self-explanatory
artifacts
Proportion of monitored ML self-explanatory
deployments
Model accuracy self-explanatory
Speed of deployment self-explanatory

Another common idea to generate good KPIs and ensure they are aligned
with the organization is to construct a KPI value tree. It takes work, but the
results pay off for large projects:
Impact Assessment: Measuring Our Success 165

A KPI value tree is a visual representation of the hierarchical rela-

tionship between different KPIs in an organization. It is typically
used to help managers, and other stakeholders understand how
different KPIs relate to each other and how they contribute to the
organization’s overall performance.

A KPI value tree typically consists of a series of interconnected

nodes, with each node representing a KPI. The nodes are hierarchi-
cal, with the highest-level nodes representing the most important
or strategic KPIs and the lower-level nodes representing the more
tactical or operational KPIs. The connections between the nodes
indicate the causal or logical relationships between the KPIs.

For example, a KPI value tree for a manufacturing company might

have a top-level KPI representing the company’s overall profitabil-
ity, with lower-level KPIs representing the profitability of different
product lines, the efficiency of the production process, the quality
of the products, and the satisfaction of customers.

KPI value trees can be useful for various purposes, such as helping
managers prioritize KPIs, identify performance gaps, and develop
strategies for improving performance. They can also help commu-
nicate the importance of different KPIs to the broader organization
and demonstrate the value of data-driven decision-making.
Portfolio Management: A
360-Degree View
In DUE DILIGENCE, while gathering information in CSA, a central focal point
was to identify all existing data-related use cases in the organization.
Even small organizations have a diverse pool of current use cases (data-
generating or consuming systems), with larger ones sometimes having
hundreds. Our work in USE CASES will most probably have expanded this list
by adding new use cases to the roadmap. This expansive landscape of use
cases represents the “analytics portfolio”. To manage it, we need structure.
This falls in the scope of “portfolio management”:
Portfolio Management: A 360-Degree View 167

Data product portfolio management is managing a collection of

data products, such as software applications, mobile apps, web-
sites, and data platforms, in a coordinated and strategic way.
data product portfolio management involves a range of activities,
such as prioritizing and aligning product development efforts with
business goals, monitoring the performance of existing products,
and making decisions about retiring or replacing products that no
longer meet the needs of the business or its customers.

Effective data product portfolio management can help organiza-

tions ensure that their data products are aligned with their busi-
ness goals and customer needs and that they maximize their data
investments’ value and impact. It can also help organizations
identify and address potential risks and challenges, such as mar-
ket saturation, technology obsolescence, and changing customer
preferences.

Before I show an example, let me list the main motivating factors for
assembling an data portfolio:

Visibility: For leadership to make effective decisions, standardized infor-

mation on different initiatives, and most importantly - their status of
completion and success - is required. Visibility also ensures that teams
behind well-running projects get additional recognition from their peers as
a reward.

Alignment and coordination: When a broader audience in the organi-

zation has access to essential information on the initiatives, they can
collaborate better. It’s easier for managers to plan and budget for new ones
and align those regarding resources and dependencies.

Not reinventing the wheel: As we covered in this chapter, a common

Portfolio Management: A 360-Degree View 168

source of waste, especially for larger organizations, is unreproducible work.

The most common reason for this is the lack of transparency which a
solid analytics portfolio can provide. A good example of doing this right
is reusing data pipeline code and infrastructure to power many use cases
simultaneously (as we covered, a vital characteristic of a well-designed
target architecture).

Now, let’s look at an analytics portfolio of a small organization in the table

below:
Topic Details
Name Customer 360
Technology Python, seaborn, PowerBI, scikit-learn, flask
Architecture S3, Refshift, Sagemaker
Data CRM data, tabular data
Roles John Doe, data scientist; Jane Doe, data
strategist
Status In progress
Metrics Click-through-rate, churn rate

There are other fields you can add here depending on your priorities, for
example, budgets left, deadlines, model performance metrics, etc. Each
use case should have at least a basic table. While you can use plain old
Excel to track this, perhaps a better way is to use new generic products
such as Airtable, Confluence, or even more specific tools such as YOOI and
Casebase.ai.
Summary
In the final phase of the 3D model, we ensured the DELIVERY of the data
strategy. The main obstacle in front of us is the jump from strategy to value.
For this, we needed to build a bridge - supported by the two pillars, SOFT
AGILE and LEAN DATA. You learned how to use DATAOPS methods to manage
the execution of the data products and projects we recommended in DESIGN.
This work would be impossible without adding quality measurement gates
- in the IMPACT ASSESSMENT, following best StratOps practices. Finally, we
have established the tooling and processes to monitor our ever-growing
expanding portfolio of data products in PORTFOLIO MANAGEMENT.

By delivering our data strategy successfully, we helped our organization take

a giant leap to be truly transformed. There will be more hurdles in the future,
but our StratOps approach is the ideal preparation for what’s in store.
Interviews
Nicolas Averseng

Nicolas Averseng is the founder and CEO of YOOI, a data analytics and
management platform. He has an extensive experience as a CTO and other
leadership positions in a variety of industries, solving challenges with data.

BA: Boyan Angelov

NA: Nicolas Averseng

BA: Nicolas, it’s such an honor to be able to discuss data strategy with you.
I think you have one of the most forward-thinking visions on the field, and
I’m sure the readers can learn a lot from our conversation! Let’s start with a
simple question. How did you end up being involved in data strategy?

NA: My background is in engineering, and I spent most of my career on

the software vendor side, and I had a chance to work for smaller and larger
companies. I was always able to mix being 50% in the “cave” building
software and then, at the same time, work closely with customers. I always
enjoyed being technical, but I would say that while I love technology, I also
have an issue in doing it just for its sake. I always like making stuff people
use and this is what drives me. I got into data through a company that I
joined 15 years ago, an innovator in the monitoring space. Monitoring is
all about real-time capture of vast amounts of information, and analyzing
them versus historical knowledge and rules, in order to support decisions,
often in real-time. One of the things we have been pioneering is around
Interviews 171

monitoring business processes, which is now known under the fancy term
“operational intelligence”. Four years ago, I joined a consulting company as
CTO. This company was mostly staffed by a large number of data scientists,
and I saw my job as reinforcing and developing the technology side further.
Eventually, we wanted to help people with not only the data science part of
the work, but also the other elements of data strategy: the data platform,
architecture and production environments, pipelines, and so on. I built
the technical teams there while working with large enterprises - also with
the same angle as what I mentioned - helping them avoid the mistakes
of thinking too much about the “how” without having a clear view on the
“why”. With this experience, I started YOOI, aiming to solve this problem.

BA: This is a very interesting point. I would agree that the most frustrating
thing in data I see is when clients have invested a ton into people and
resources - and spend it on digging a perfect hole, but in the wrong place.

NA: Yes - and this relates to why there are so few successful companies
successful with ML. They don’t manage to deal with the most basic - but
also hardest - part, starting with the why. Nothing else matters if you don’t
start with what you want to achieve and the business question. It might
seem obvious to many of us in the field, but this is a very common issue.
People would often come to me as a technical person: “We need to do X.”
And my first reaction is always, “Why?”. The first thing in a data strategy is
aligning the people on the ultimate goal and success criteria. Only when we
have that can we engage people around that same objective. Only then can
you work on the other elements, such as technology, processes and culture.
This potential misalignment is indeed the leading cause of failure. A good
illustration is what has happened often with data lakes: enterprises have
invested a lot to build infrastructure and then expect that it will solve all
their issue and then might struggle to actually build and deploy ML use
Interviews 172

cases. To make an analogy, if you have a big enough hammer, in which you
have invested, everything starts to look like a nail. And in the end, you might
find out what you really need is a screwdriver.

BA: Can you define what a data strategy is in your view?

NA: Data strategy is the way to define what to do with data, in order to
support the business strategy of the company. It should always be focused
on supporting the business goals. A data project should have the same goals
and metrics of failure as a regular software project. Of course, there are
additional dimensions to data projects, such as the data itself, which add
to the complexity and uncertainty.

BA: So, do you see some significant differences between software and data
projects?

NA: There are some, but the biggest one is the uncertainty of data - this
proves to be challenging to many people and organizations, as it makes the
whole value chain more complex, more fragile. Another related source of
uncertainty is how to involve people in the process. This relates to one of my
favorite concepts in software, the three U’s: usable, useful, and used. While
this concept applies to all software projects, it is especially challenging
for data products. In the beginning, let’s say people do sales forecasting
“manually”. They do a great job, but it does not scale, and they cannot
focus on multiple product segments. The job, in this case, is to build a good
enough model that can automate this part of the work. But once done, they
discover that they don’t know how to deploy and use this model effectively
because they did not set a measurable goal or think about how to make it
actionable within their business process. They forgot why they were doing
this work in the first place.

BA: Couldn’t agree more. If you have a solid “why,” the rest of the data
Interviews 173

strategy work takes care of itself. Let’s now talk about YOOI and the purpose
of your offering.

NA: I want to talk and expand on the three U’s first; this will help explain
the purpose of YOOI. Why I like this concept is because it’s so simple. Often
people forget that people need to use whatever they built. How is all of this
going to be used by people? How will it be integrated into real processes and
drive real decisions? Even if you put a model in production, people would
still not use it because it behaves like a black box. This is also why you need
to be build trust and metrics in your data projects, and engage users all along
to make sure they understand and buy into those. This is why we built YOOI
- we see it as a cockpit for data strategy. A place for the team to align all
those different dimensions, connect the dots and make sure the technology
is meaningful. Of course, you can always hire a great consultant like you -

BA: laughs

NA: - but all the great work might end up sitting somewhere on SharePoint,
gathering dust and not being used. People will then lose track of why the
work is done, and also, you can’t repeat this data strategy work every year.
It has to be a living, continuously learning system. What happens to the
data strategy in a month when a new technical requirement comes up?
The world is changing, and this is something we need to accept. At the
same time, we have to keep everyone aligned on the same goals, and this
is why you need a tool for this. Our tool is a combination of process and
visibility elements. It allows to make sure when the ideas are proposed,
there’s sufficient information to make decisions on selection, budgeting,
and technology. This is a view of the complete value chain. Now there are so
many different tools available from the cloud that make doing data projects
easy. What is still missing is this monitoring part, bridging the gap with
project execution.
Interviews 174

You can learn more about YOOI on the official website. Follow Nicolas on
LinkedIn.

Summary

• Start with the why;

• Data strategy has the same goals as the business strategy;
• Prioritize, monitor, and focus on delivery;

Noah Gift

*Noah Gift is the founder of Pragmatic A.I. Labs. Noah Gift lectures at MSDS,
at Northwestern, Duke MIDS Graduate Data Science Program, the Graduate
Data Science program at UC Berkeley, the UC Davis Graduate School of
Management MSBA program, UNC Charlotte Data Science Initiative and
University of Tennessee (as part of the Tennessee Digital Jobs Factory). He
teaches and designs graduate machine learning, MLOps, A.I., Data Science
courses, and consulting on Machine Learning and Cloud Architecture for
students and faculty. These responsibilities include leading a multi-cloud
certification initiative for students (source: noahgift.com).

BA: Boyan Angelov

NG: Noah Gift

BA: I was listening to your recent podcast on DataFramed, and I loved it.
While listening, I was thinking - this person certainly has unique opinions
that need to be heard. I like your skepticism, and I believe this is even more
Interviews 175

important in a field such as ours, where have more than enough buzzwords.
Maybe we can start with your background. How did you end up in data?

NG: I first started in TV and film. They offered me a full-time job, and that
was it, the start of my career. But I just wanted a little more than that, and
I always wanted to make sure I got a degree, and I decided to go to Cal Poly
San Obispo. I was interested in being a professional athlete, even perhaps
going to the Olympics, playing professional basketball. This is why I studied
nutritional science. I thought that was a good degree to learn more about
performance. And what was good about it is that nutrition science really is
a form of data science. All the courses you take, such as organic chemistry,
are very architectural in nature. Anatomy, physiology, and even dissection
can be seen as data science. You are inspecting the body and looking at
the parts, and seeing what they do. I also did experimentation on my own
body, centrifuged my blood, took doses of Vitamin C. After this, I briefly
pursued being a professional athlete. I was in the process of training after
college to play basketball, not NBA level, but lower-end tier. But then also
applied for a job at Caltech in Information Technology, since I thought that
I probably won’t make too much money doing the sport. I spent a few years
there, learning a bunch of stuff about Unix and Linux, learned Python. Right
after I spent three years there, I decided to go back to the film industry. A
lot of the stuff I learned at Cal Poly helped me make film pipelines. Film
pipelines and data engineering are essentially the same thing! After this, I
moved to the Bay area and worked at startups for roughly ten years there.
Since then, I’ve been consulting, teaching, writing books.

BA: This is a fascinating background. It makes me think a lot about systems

thinking. The way you make those unique parallels between different fields
- film, sports, and data.

NG: Yes! And even machine learning is very similar to film because we’ve
Interviews 176

been doing distributed computing in film for a very long time. You have to
set up each job on a different node, and each does a separate piece, and in
the end, you must combine them.

BA: Interesting! My following questions are related to the challenges in

management consulting, working with large European companies. One
can even say that some of those companies are even further behind their
American counterparts, in terms of their digital transformation journey. I
believe the remedy for this is to design a data strategy. As a part of this
process, a data strategist such as myself would be sent to the company and
start to help them with what type of people they need, on which use cases
they should work, and so forth. Can you tell me your thoughts on this? Why
do some big companies fail and others succeed?

NG: I think the issue is that many organizations hire academics - research-
oriented people. With research, you are not focused on production. You are
essentially hiring the wrong people to work for the company. They might
even be solving the wrong problem. Don’t get me wrong - it’s great to
have researchers available in some situations, but most companies need
operations. I think it is important to have a data strategist. Ultimately with
MLOps and data engineering, the thing you’re building is not a model or
data but a pipeline. It’s almost like the name of the discipline itself is
incorrect - if you say, “Hey, I’m doing data engineering”, ultimately you
mean you’re building a data pipeline. And it’s the same with machine
learning. So what is a pipeline? It’s dynamic; it can expand and contract or
react to different things. So you have to build the capability to respond to
things dynamically. That’s the opposite of a researcher. Research operates
on a fixed problem that’s constrained to a lab environment. The pipeline
should constantly improve and produce results.

BA: And it should be measurable - this is another crucial property. Interest-

Interviews 177

ing, I know of two common analogies about data work that I also reference in
the book. One is this oil processing one, where we talk about data gathering
and enrichment. The other one is the kitchen analogy. Where you have
the raw data in terms of ingredients, and you have the recipe. What I find
strange in the approach of larger companies is hiring some people on high
salaries and telling them - let’s do AI in X, where X can be anything. They
make the team, set up the roles, provide some data, and say, “let’s go, talk
to you when the results are ready”. But do you think we can do better than
this when planning data projects in a large organization? How would you
even start to think about such complex work?

NG: I would say that the oil pipeline analogy is pretty good. I think oil
processing exploded in significance in the 1920s when cars as a means
of transportation started to become more popular. Imagine a geologist in
Texas in those early days. They come to the site and start drilling - oil
squirting out of the ground. That could be good enough - if you are a
researcher. You could say: “Look at this, we found some oil!”. They’re just
taking the oil in a bucket and throwing the thing in the pot, spilling dirt
everywhere. The result of this is that maybe enough oil for running a car
will be produced once a day. That’s a metaphor for how data science works
nowadays. Now, what’s the opposite? We can build an oil refinery. That is a
complete platform for oil production, with people and components working
on different parts subsystems. The result of this is that we can produce many
more unites per day, to run our cars. It’s the complete opposite. If we look
at an organization that wants to be successful, the first part is they need to
realize that the majority of people will be like that person, digging the hole
in the ground and making a mess with the oil spilling everywhere. Instead,
you should use a platform, like AWS Sagemaker, Azure ML Studio, Google
Vertex AI, maybe a third-party tool. Use that platform and have everything
Interviews 178

standardized. The goal isn’t to play around with the oil. The goal is how
many gallons of gas we produce per day. Similarly, if you force the data
scientist to work inside a platform that has strict rules, then you’ve already
made it much more likely they’ll be successful. A second component is also
if your oil refinery is producing diesel, and all your vehicles are unleaded
gas, well, you’re making the wrong type of fuel. That’s the other part of the
problem. There must be requirements that are mapped to the executives in
the company. Even if you’re using a platform, you have to make sure the
people build the right thing.

BA: This reminds me of another idea I’m developing - starting with the end
goal in mind. One of the worst ways for data science projects to fail is when
a bunch of intelligent and hardworking people spending a ton of effort on
the wrong thing. Nobody asked why. We can use all the platform tools, let’s
say AWS Sagemaker. We can make a pipeline - but in the end, the business
unit often says - ok cool, we have this AI, but so what? How are we going
to use this? What happens now? Sometimes we would find out that instead
of exposing our model through an API, all that was needed for the business
was to provide a scored Excel file. A good example of perfectly building the
wrong thing. Another issue here is not planning for the integration part of a
data project - how would that fit into the overall IT infrastructure and who
the consumers of that solution will be. How does the end result connect to
other systems? Do you think those issues are a failure of planning, a lack of
skilled workers, or a strategic problem?

NG: Well, let’s go back to what data and machine learning engineering are.
They represent the ability to respond to change by getting and reacting to
feedback loops. The issue is when you are just building one thing without
the capacity for change. For example, I like to do jiu-jitsu. This is a pretty
interesting martial art. In theory, the goal there is when someone attacks
Interviews 179

you, you submit them. To achieve this goal, you have to respond to events
dynamically. Let’s say someone jumps on top of you, then you get out and do
something else. The ability to react dynamically to any situation is essential.
I think that’s the issue with the data field. The goal isn’t to do something
static - it is to have a feedback loop to respond to the business. The feedback
should happen much quicker than it does now. A good example solution is
to show prototypes once a week.

BA: Another idea of tackling such projects is building a very basic pipeline
first, but end to end. Then you should get feedback from the stakeholders
and commit to further work on the different components of this pipeline.
Another follow-up question I have is, admittedly very tricky one - about
project management and data science. What do you think of the combina-
tion of agile development and data? Is there a better way? How do you think
data projects should be planned and executed?

NG: I think a lightweight agile process is pretty hard to complain against.

What I mean by lightweight - every week, you demonstrate where you’re at.
You have components divided into tasks that are maybe 4 hours or 8 hours
apiece, and then you do those tasks each week. I think that’s enough - no
need to go more complex. The problem when people go more complex, it
can becomes a cargo cult, another analogy I like. The cargo cult of martial
arts is aikido. Aikido is much more like scrum, where everything is staged
into specific, scripted rituals. This would never happen in reality. You can’t
expect a certain sequence of events, static and frozen in time. I’ve never
seen this work well in organizations.

BA: Let’s dig deeper into this. Even if we do a lightweight agile, how
would we communicate our process to senior stakeholders in a big company,
who might be expecting something else, who think our processes are not
rigid enough? Do you have some advice on that? How can this fit with a
Interviews 180

traditional business?

NG: I think the three components of a successful project are a weekly demo,
tasks assigned in a lightweight ticket system, and a spreadsheet showing
the quarter’s plan. That’s it. And the demo is what the product managers
would show to the CEO. This demo should just be good enough (like you
said end to end). Then you can get feedback immediately. In this case, you
can quickly fix significant issues, avoiding unnecessary work.

BA: Another question I have is the word “pragmatic”. I heard you use it on
several occasions. Could you elaborate on it? How can we be more pragmatic
in this work?

NG: I think pragmatic means being ruthless about efficiency. For example,
let’s say we have a system that barely works that took several years to build.
The person who did most of the building would very much like to keep it the
same. The right thing to do is to clean as much as possible - imagine a pull
request where 25% of the codebase is deleted while the system continues to
work. This is pragmatic. Nothing’s precious; whatever is needed to improve
the system should be done. Working only on things that matter - that’s
pragmatism.

BA: Do you think that knowledge work can be automated? Where does the
future go of our field? Do we lose creativity in what we do? What skills do
you think are most important right now for data people to remain relevant?

NG: I would say that it’s surprising that people think that AutoML and such
tools won’t get better with time. Even very famous people would think that
it doesn’t work. Let’s look at anything that happened in the last 50 years.
Once you start automating anything, it will always get better with time. A
great example is the film industry. When we first started editing, we had
3/4 inch tapes, and they were analog. You had to dissolve with three decks,
Interviews 181

using three different machines. You have the source tape, the destination
tape, and the black tape. And now, with my laptop, I can just click a button
that says “dissolve”. Of course, everything gets automated! Still, there’s art.
Editing is very creative. Such work will remain - the creator must provide
their signature. If you’re talented as a data person, you should be excited
about all of this happening because you’ll become more impactful with the
work you do.

Summary

• Hire operations-oriented data people;

• Build pipelines and use platforms;
• Be pragmatic;

June Dershewitz

June Dershewitz is a Data Strategist at Amazon Music. Before this she spent 20+
years in driving data and analytics strategies for industry-leading companies,
including Fortune 500 corporations and tech startups. She is also serves as
Board Char at the Digital Analytics Association. You can connect with her on
LinkedIn.

BA: Boyan Angelov

JD: June Dershewitz

BA: Let’s start with your story. How do you end up in data? It’s a question I
always ask since there are so many diverse backgrounds in our field.
Interviews 182

JD: I got my start a very long time ago, with a bachelor’s degree in theoretical
math. That was in the very beginning of the internet. After that, I got a job
working for a mathematician who was building a website for math teachers,
students, and professors to talk to each other. I got to do many things on
that research project - essentially becoming a front-end engineer. I got the
chance to understand how the internet works, which was really exciting and
new at the time. Eventually, I decided it was time to move into the corporate
world in San Francisco.

BA: Why San Francisco?

JD: Well, I’m originally from Oregon, and I love the west coast. I had been
living in Philadelphia, so I felt the need to go to a large city again. It was
in 1999, the middle of the first dot-com boom. And I figured I could get
a job! I started applying to front-end engineer positions. At one company
I was asked whether I would like to become a data analyst. They told me I
had the combination of skills necessary to become a great analyst - software
engineering and math. I accepted! I realized that I loved it, even though it
wasn’t the vision I had for myself originally. That was a start of a very long
career in data. Since 1999, I’ve worked with data as an individual contributor,
building and leading teams of data people both, on the brand side and as a
consultant.

BA: Those were very early days in data science. I assume there were no data
scientists back then?

JD: No, they were called statisticians! Indeed, I ran across quite a few people
who would consider themselves statisticians, who today probably would call
themselves data scientists.

BA: Being in the field for such a long time, do you think companies
know more now how to do data projects than before? The technology has
Interviews 183

advanced quite a bit, but how about the more strategic part?

JD: It’s frustrating to see the same problems over and over that we keep
repeating and not figuring out. But I think we can build on ideas much faster
than before and iterate on them. An example of this would be A/B testing,
which a company would employ to optimize business outcomes. We’d like
to think that the dot-com organizations figured this out already, and any
competitive company out there is maximizing their investment. Well, that
can be true to a certain extent, but they certainly didn’t invent it. These
methods have existed even before the internet. For example, it was being
done by advertising companies to measure the effectiveness of direct mail.
Now we can just do it with much more ease, and we can do it faster.

BA: Interesting. Operationally, we probably still have the same problems re-
garding how people understand data. It might even be harder nowadays. My
next question is on the title of a “data strategist”. What do you think about
that? I know some companies use similar titles, such as data translator. Is
this a widely accepted and understood role at this point?

JD: Not really. I think that data-related job titles have always been some-
what of a pain point. Throughout my career, I’ve at times cared more or less
about the job title. The job title a person holds sometimes is important and
sometimes not. Early in my career, I was making a move from an individual
contributor to a consultant. Before I started, the company’s co-founder
called me, telling me he was working on the business cards and would like
to know my job title. I was thinking - perhaps VP of something? He said,
ok - vice president of analytics. And that was my job title from then on.
But when you work for a 14 person company and you have the job title
VP of Analytics, it means you’re going to do everything. And I generally
feel that way about data-related job titles as they have evolved over time,
especially with the “data scientist” one. Usually when I talk to other leaders
Interviews 184

of data organizations, when they talk about their staff makeup, the data
scientists, data engineers, and others - they’re usually admit those roles
mean different things in different organizations. On one extreme, you might
have a company where any person who touches data is called a data scientist.
And then in another one, you might have so many specific different job
titles where you’ll have a data scientist, research scientist, ML engineer
or data analyst. There’s no right or wrong. I think that data strategist and
data strategy are malleable terms that we can use to mean different things.
I don’t think they will become standard terms to describe a specific job
function in the company. I can contrast them with a title such as a “data
engineer”, which I think is very specific and tangible.

BA: Yes, “data engineer” is already quite an established one. But it’s safe
to assume that the role of data strategist has always existed before as well,
probably under a different name. Someone must have been taking care of
the “translator” duties in the organization.

JD: This reminds me how in 2019 I was an invited speaker at a conference in

San Francisco called “Marketing Analytics in Data Science”. The title is quite
specific. When I went to it, I was surprised to find that everyone around me
actually was working in data science for marketing analytics! That same year
I decided to run a panel for the conference on building a company-wide data
strategy. I wrote a summary, and then I went looking for speakers - everyone
I got was a chief data officer. It turned out into a super interesting discussion.
But as we were preparing for it in advance, we all had the conversation: what
are we actually talking about here? What is a business-wide data strategy?

BA: This is a perfect time for me to ask you for a definition of data strategy.
Do you have one?

JD: I’ve found several that I would mash together into one: data strategy
Interviews 185

is a vision for how a company will manage and use data to generate value
for the business and the customer. This is still broad but could be broken
down even further. For example, what data do we need? How are we going
to source it? How are we going to collect it? How are we going to store it?
What technology we’ll use? Who will we share it with? What are the policies
for data? How will we use the strategy, in what areas of our business, and
to what ends? How would we know it’s working? And if we’re doing it right,
what kind of value is it generating? How do we describe and quantify this
value to the business or the value to the customer of all of the time and
money that our data teams spend on working with data, trying to serve the
business?

BA: This is great. Now we start to talk about the specific elements of data
strategy. I now have a question on whether a data strategy is something
static, such as a PowerPoint deck, or is it more of a continuous function that
someone is performing? I’ve had clients ordering giant slide decks, only for
them to be buried somewhere, never to be seen again.

JD: It depends. Let’s say you are a data person at a company that isn’t yet
sold on the value of data. You have a tough task in front of you because
it’s all about education and convincing executives to fund your efforts.
Because if you don’t have any funding, you’re always going to be at the
bottom of the barrel. Your work can be an afterthought, and that’s not
where you want to be. Let’s say there are a few people who do data work
throughout the organization, but they’re doing it at a really low level, mostly
on unconnected pilot projects. But if you could show results, you can use
this base to form a team or even multiple teams. And the more you do
with data, the more you can say you’re using data successfully across the
organization. Say we’re using data successfully in Marketing, but we haven’t
necessarily gotten a full value out of what we’re doing with the data in
Interviews 186

Product. So you decide to build a Product Analytics team. And then perhaps
you can see how you can support the Sales team with data or insight. And
then, at the more advanced stage, you would be looking across the entire
company, and you’re collecting and managing all the kinds of data that
matter to the business. In the end you’ll be able to turn around and use that
to generate value for the business and the customer in all the ways where it
matters. I think that depending on the stage the company is at, you’re going
to see different variations of this process.

BA: Who do you think the customer of a data strategy is? How far down, up,
or sideways does this document need to be used in the organization?

JD: It depends on the org structure. It’s never going to be perfect. I’m sure we
can spend a whole hour just talking about different kinds of org structures
for data people and the pros and cons of each choice. But I think as long
as you understand what you’re striving for, you can compensate for the
weaknesses of any kind of org structure. I believe data strategy can work
best when incorporated into company-wide strategic planning. So if you set
annual goals about what you want to accomplish, hopefully, some of them
will be quantitative and require support and participation from data people.
Even if it’s basic business optimization, it’s meaningful as long as it helps
grow the business.

BA: I agree. I don’t think you can separate data strategy from business
strategy and hope for good results. How iterative should a data strategy be?
Should it be more of a living document or more of a static roadmap?

JD: People could discuss the value of long-term planning versus the effort
spent on implementation, but I see value in it - as long as it’s combined
with shorter-term plans that are directly related to execution. I think that
a well-thought-out three to five-year plan is a great idea. This can show
Interviews 187

where the organization’s data efforts are today and the vision for where it
wants to be way off in the future. Still, I don’t think you can’t just set it once
and forget about it. The strategy will get stale after a while, but it should be
able to serve you well long enough so that you can generate annual plans
under the umbrella of the larger, three to five-year one. And from there, you
can set tangible and specific quarterly targets. You always need the five-year
north star guiding you. I’ve found, especially with data science and machine
learning projects, that they can easily meander. You need to reinforce the
focus, even if it shifts over time. You can plan quarterly and build on top
of your knowledge, but everything should be aligned with the longer-term
plan.

BA: What is the most important thing for a company with low data maturity
to tackle first?

JD: I think that, that as a business, they should have a clear understanding
of where they’re going to get the most business value from their investment.
This is a good starting point to do the first proof-of-concept project.

BA: A good point, but how do we estimate this business value?

JD: Let’s take the example of business intelligence people. Unfortunately,

they tend to get undervalued, especially now when there’s this separation
between them and data scientists in some companies. Data scientists are
often perceived as more impactful, but that’s not necessarily the case. If
you take away the people making the dashboards, there’s going to be a con-
siderable gap - an unmet need. As a business, we always have to think about
how much time and energy we need to spend on creating and maintaining
dashboards. It shouldn’t be a hundred percent, but it can’t be zero either.
Let’s say it’s zero - we’re just not going to make any dashboards. We’re going
to invest in data scientists solely instead. Then what happens is the data
Interviews 188

scientists get asked to create dashboards because (surprise), people still

need dashboards! And then, the data scientists become unhappy because
this wasn’t in their job description, and they might leave the organization
altogether.

BA: How do you think about managing data projects? Does agile work for
data? How do we go about estimating tasks and resources?

JD: I do think agile works. Of course, estimating how long something will
take is always difficult. And especially if you’ve got something big and
ambiguous and have nothing built yet. In that case, you’re not going to be
very good at estimating how long things will take. As a project develops,
you’ll better understand what is worth pursuing and what is not. This skill
will take time to learn. At some point, you can refer to your experience - for
example, the team compositions and skills, knowing who to involve, and
it does become easier. I think in the beginning, you’ll be able to estimate
things that are only one to three months out. And then, when it comes to
six months or a year out into the future, you really might not have much of
a clue. You might know the result you’re after, but you wouldn’t have a good
amount of information to estimate how long this will take or even who needs
to be involved at what level. I’ve seen in the past chronic underestimation
of data engineering effort for data science work. Also some confusion about
roles - for example, what should a data analyst do? How about a data
scientist? You often won’t have the luxury of bringing in people with all the
right skillsets to contribute at all times. This might slow you down because
you might have a data scientist who’s also asked to be a data engineer, and it
might not be their core skillset, or they might be doing it, but as a result, they
are not writing high-quality code. And so then you’ll need to have someone
come in later on to fix the problems that were made because they weren’t an
expert. I have also seen a lot of issues with trust-building with leaders who
Interviews 189

funded or sponsored a data project. A data scientist might be motivated,

but there might be a lack of clear understanding of what the business will
get at the end. What can happen then is the business leader can ask: “what
have you done for me lately?”. And if data scientists have gone off working
on things just because they are novel and exciting, they can get in trouble.
Meandering away from the business purpose or not communicating enough
back to the people who have funded it can easily backfire.

BA: I agree. Doing cool things just for the sake of learning can backfire. As a
data scientist, one might think this is smart, but as long as the work delivers
no value, it’s useless. Can you tell me the biggest reason for data projects to
fail nowadays?

JD: I think it all started with the whole “sexiest job of the 21st century“
article. I think this oversold the field and made it seem like snake oil.
How will you actually set data scientists for success when you don’t clearly
understand the value they will deliver? And I think modern data science in
terms of how it fits into a larger organization is better understood now. It’s
been around long enough so that people can ask and answer the question,
“what have you done for me lately”.

BA: Yeah, so to paraphrase a little bit: you would say that a lot of the issue
with data project success would be high expectations? Leaders think they
can just put a data science wizard on the project, and everything will fall
into place.

JD: Yeah, exactly. One approach I’ve seen that I think works fairly well: start
with a small proof of concept project with a short turnaround time. Then
show its value. If you don’t do this before a further longer-term investment
commitment, you might end up with a wasteland.

BA: In my book, I call it pilotitis - the disease of doing pilot projects only.
Interviews 190

How do we ensure such smaller projects are successful in the medium term?

JD: This is not easy. One thing you can do is set goals for pilots. For example,
we’ll finish it by this date, and it will do those exact things. This way, you
keep its scope limited. After this, you can show that it has all the features
you feel are essential and there’s widespread usage on the receiving end. I
don’t think a data scientist could do this alone. Having a product manager
involved is important for scoping, gathering requirements, user acceptance,
testing, and keeping a backlog of feature requests. So it’s not really data
science work, but this is necessary for creating something like a long-term
program.

BA: It sounds like we do need this person in the middle. It doesn’t matter if
they’re called a data strategist or a product manager.

JD: You can name this function a “technical program/product management”.

I think you need someone who has that kind of mindset to treat the system
being built as something that evolves over time and is only successful if it’s
adopted, used, and supports the business outcomes.

BA: What specific skills would you say this person should have?

JD: A product manager is a big generalized job right now. And the product
manager for something that is directly facing customers of a business might
be different than a product manager for something else, such as a recom-
mendation engine. They can be one step removed from the end customer
who is receiving the recommendations. But, still, I think some of the same
skills apply. I think having an excellent understanding of why a product is
built and articulating that. Always have a solid customer focus, know for
whom the product is designed, and ensure users can use it successfully. This
person should also know where the project will be in the next quarter and
align on the long-term vision.
Interviews 191

BA: Exactly. The most frustrating thing I’ve seen in my career is brilliant
people building the wrong product that nobody wants to use.

JD: Yeah. And people might make different choices. It’s often a case of
taking the product in direction A or direction B, with trade-offs, and in each
case. If you only have a person involved who cares about the novelty and
complexity of the system they’re building, they may make one choice that is
not necessarily what the customers need. And if you chose instead a simple
approach that is not as technically sophisticated but what leads to a better
business outcome - it may be the right choice.

Summary:

• Always align with the business objective;

• Start small in one area, prove the value and grow further;
• Have a longer-term data strategy and short-term, agile plan
at the same time;
• Always be focused on the end-user when building data prod-
ucts;

Martin Szugat

Martin Szugat is the managing director of the data strategy consultancy Daten-
treiber. He is also a lecturer at Hochschule für Wirtschaft Zürich and Program
Director of Predictive Analytics World & Deep Learning World Europe.

BA: Boyan Angelov

MS: Martin Szugat
Interviews 192

BA: Let’s begin at the beginning - how did you end up in data?

MS: I started my career already during school. After many manual jobs, I
decided I would prefer to use my brain more (laughs). Since I liked playing
video games, the next obvious step was to start programming. My father
had a client looking for programmers, and this is how I started coding. I
also started writing for magazines, such as the Visual Basic magazine. I dove
deep into the .NET area and started teaching other people. Around that time,
I was also one of the first people in Germany to become an expert in the
whole XML topic, which would be the origin of data in my career. For my
studies, I initially studied computer linguistics and philosophy but switched
to bioinformatics.

I also wrote some books. One of them was about social software. Social
media didn’t exist then, and people mainly meant blogs and wikis by this
term. I also had the idea to start a company with my bioinformatics pro-
fessor but decided against doing that and joined UnternehmerTUM instead,
intending to meet like-minded people. We created a social media agency
with one of them, doing digital collectible games (now you would call those
NFTs, so that was way ahead of its time) and Facebook apps. After several
years of this, I wanted to go back to doing data because of my background
in bioinformatics. I had never had the chance to apply those skills before,
and nobody was talking about ML or AI at the time. At that time, I started
Datentreiber with the idea of putting all my experience into one venture -
combining data and business. I also saw how many companies fail in the
topic and saw an opportunity to help and improve their processes.

BA: It seems like you had a very diverse experience. What essential skills
you gained during this time are valuable to you now? What did you learn
from doing bioinformatics or running your own company?
Interviews 193

MS: The skill which stands out to me is learning the design thinking
approach while working with IDEO. Discovering design thinking was a
life-changing experience because afterward, I applied this design think-
ing mindset to all my ventures and projects, and currently in consulting.
From bioinformatics, I learned something quite important when thinking
about models. The people in bioinformatics also got this wrong. Back then,
Support Vector Machines were trendy, and the scientists wanted to solve
everything with them, including how genes worked and other topics like
that. But the biochemists proved that most of those models were wrong.
They did real-world experiments and tested the modeling work against real-
world data. This was called the “ivory tower” syndrome - bioinformaticians
at the time were rarely working wth someone in biochemistry or molecular
biology. Bioinformaticians wrote software for bioinformaticians. Avoiding
this condition is something I learned the hard way.

BA: I’ve seen nowadays that people try to put PhDs in a room with MBAs
and see what kind of ideas come from it. Not sure if that’s such a great idea,
but it sounds like a better approach.

MS: Yes, and this is why I decided not to create a company with that
professor. I’ve seen companies full of PhDs. If you ask them who will make
the sales, they have no answer - they think they are different and don’t need
it. In reality, you need some sales, marketing, and HR.

BA: Let’s now talk about the title of a “data strategist.” What do you think
about that? Is it necessary? Are there better titles?

MS: You have to always distinguish between the title and the responsi-
bility. Different titles can have a similar responsibility - whether they are
a data strategist, an analytics strategist, a Head of Data, or a Chief Data
Officer. There should always be someone, especially in bigger companies,
Interviews 194

responsible for designing, executing, and monitoring a data strategy. And

by data strategy, I always speak about data and analytics strategy - you can’t
separate the two. From my perspective, there’s no “AI strategy.” AI is part
of analytics, and analytics is part of the data strategy. Going back to my
previous point, sometimes you’ll have a CDO saying they own the strategy
or another role. That can be a data strategist, depending on the company’s
size. You should spend money on hiring such a person!

BA: Exactly. Depending on the organization’s size, you might need different
people at different levels. Especially at the very top, you need someone
with this analytical skillset who manages all use cases. As you said before,
miscommunication between technical and business people is common, and
that’s why you need a responsible person to translate between the two.

MS: Yes. Another essential responsibility this person must carry is ensuring
the data strategy is aligned to the business strategy. There should be a strat-
egy for all data and analytics initiatives and investments, and they must be
responsible, also, for killing projects or use cases that are not contributing to
the business objectives. This goes more into project portfolio management.
For example, there are a lot of projects that fail because of issues with data
quality or availability. A data strategist needs to take the responsibility to
check the data sourcing, collection, and quality initiatives and ensure that
down the line, let’s say in three years, the data is available so that they can
implement the use case. This happened to one client project a few years
ago, and that use case could not be implemented since the data were simply
missing, and nobody paid attention to this.

BA: It sounds like there was just no plan, no strategy. Sometimes executives
believe you can just hire some people, give them a broad target and let them
work. All of this is done without doing the essential homework - checking
that everything is in place. I agree this is one of the most critical attributes
Interviews 195

for a data strategist. What other skills are necessary?

MS: I think the best data strategists have a technical background in data
science, analytics, or a related field. If people just come from the business
perspective, they lack the skills and analytical thinking. If they studied
economics or something similar, they simply have a different way of seeing
and perceiving things. Data scientists from physics or biology have this
analytical thinking trained, which is very hard to get.

BA: Can you elaborate further? Do you mean a scientific mindset, experi-
mental and hands-on thinking?

MS: Yes, but not only. Most importantly, they realize that everything is
simply an assumption. A strategy itself is one big assumption. A great book
on the topic I recommend is “Good Strategy, Bad Strategy.” The author
has a lot of strategy consulting experience and is a professor at UCLA. His
first advice was to keep in mind that strategy is just an assumption and
always needs to be tested. You first design the strategy, and after this, check
whether it works out.

BA: This reminds me of the saying that all models are wrong, but some are
useful.

MS: Yes, exactly!

BA: I want to play the devil’s advocate here. While a data strategist needs
to have a scientific mindset, I think it’s equally important to be good
at dealing with ambiguity. This skill set is essential for communication
and dealing with more political issues, which are common with clients.
Ambiguity is also a part of any data strategy since, as you said, no strategy
is perfect, and many assumptions need to be made during data strategy
design. For example, when estimating budgets and resources, you need to
Interviews 196

be comfortable providing concrete numbers, even if it’s not clear what they
are for.

MS: Yes, and there are multiple levels of assumptions. An essential element
of a data strategy is defining the data products you want to build. Each is
also based on assumptions, and you must have an experimental approach
to making them. With such a mindset, you can become a data strategist.
Still, there are a few other very useful skills such as mediation, moderation,
and communication skills. I would still say you can train all those skills, but
changing the mindset of people is hard.

BA: This connects nicely to my next question. How do you train people for
such work? I know this is a big topic for Datentreiber.

MS: Yeah, as I mentioned before, design thinking is the most crucial method
to be learned. At the beginning of the training, we are primarily focused
on teaching the basic topics. For example, what’s the difference between
descriptive and predictive analytics, and what’s machine learning. What’s
AI, and what’s not AI. It’s essential to focus on the fundamentals first.
There’s too much buzzwordy content out there, and you can notice that
people spend too much time on LinkedIn. So this is the first level. At a
second level, we train people in our data strategy design kit and other
methods we have developed based on our experience.

BA: Do you also train the people how to teach other people themselves? It’s
an essential part of the job of a data strategist to “train” C-level and business
people. After all, they also spend some time on LinkedIn and probably need
to be “un-trained” a bit first.

MS: Yes! We’ve learned a lot, especially in the past year, that one thing
you should do before you work on the data strategy design in a series of
workshops is to have a training session. You can introduce the business
Interviews 197

people to the basics first (such as the difference between a metric and KPI).
We noticed that the following workshops work much better if the people had
training before, because otherwise during the workshops you’ll have a lot of
discussions about the definitions of things. Sometimes people talk about
the same things with different words. Another issue that can also arise if
people have no training before is unrealistic expectations. In one case, I
remember one of the clients wanted to build an Alexa-like system for a car
workshop. I already knew this would be hard. The Alexa team at Amazon
numbers in the hundreds, and still, the product has issues. This is closer to
science fiction than reality!

BA: Yes, expectation management is critical, and you must re-educate.

Another question I have is about the success of data strategy in general.
You’ve been doing this for a while now. Could you tell us how C-level people
think about data strategy in 2022? Are they excited about it, see it as an
essential activity, or are they skeptical?

MS: I think this depends on the company. When they talk about data
strategy, they mean more about data architecture design. Or they just might
want a PowerPoint presentation for the management board to get the data
team financed. Others don’t want to do a data strategy but want a “very
concrete plan of how to create value with data and analytics” instead. But
then, I would call this a data strategy (laughs)! Some other ones even don’t
want to call it a data strategy. They would say that “strategy” is reserved
for the management board. It should also not be named “data” since the
IT department should be responsible for it. So, in that case, we would call
it a “MarTech Concept” or something like that. But that’s, in fact, a data
strategy.

BA: Right. People still want it and understand why it’s necessary, even if it’s
hidden under different names. Do you think data strategy as a PowerPoint
Interviews 198

deck? Do you believe organizations know what needs to happen after that,
how to implement things, and measure success?

MS: I think this happens only in organizations with low data maturity.
Executing a data strategy is the only way to know whether it is good. Those
should focus on doing more pilots and experimentation. I have seen this
issue quite a few times. For example, one client requested we build a so-
called KPI driver tree with them, the value driver tree. This would help them
understand the relevant metrics and set the objectives correctly. We did this
for several months, and after finishing, they were pretty happy and realized
this value. Still, I had to remind them that this is a good start, but it is still
just the first step of a long data strategy.

The more focused and smaller the data strategy is, from my experience, the
higher the likelihood it will be successful. We have also advised clients on
an overall company-wide data strategy. Still, we encountered the problem
that it became too superficial - it becomes that PowerPoint with a lot of
vague texts, such as “employees should treat data as an asset,” or “all our
departments should utilize data in a way which is aligned with our business
objectives.” While those statements are undoubtedly true, they are valid for
any company - you can just copy and paste this text. What most companies
struggle with is creating a holistic data strategy, which is a long-term one.
It needs to be executable and have checkpoints where you can measure its
success and adjust if necessary - see if it works out. This is a real strategy.

BA: How do you ensure the clients trust you with an expensive data strategy
and that it delivers results? One way is to run a prototype and show results
as quickly as possible. But still, a data strategy costs money, and the benefits
can become apparent much later.

MS: What helps again is to ensure that the data strategy is not superficial
Interviews 199

but detailed. For example, a large corporation might decide to go into

making more personalized offers and products. It’s then assumed that
having the products more personalized will make their customers buy more
of them. You can test this assumption by doing a small A/B test - get 50
people who get the personalized offering and 50 who don’t - and compare
the results, seeing if the average revenue per customer increased. You can
do such experiments pretty quickly in the beginning. Like this, you can test
one fundamental assumption of your data strategy before executing it and
wasting a lot of money.

BA: Yes, this makes sense. Another one of my favorite questions is how
do you plan for things you can’t plan for? How do you ensure that a data
strategy does not get derailed, for example, when one of the prioritized use
cases doesn’t work out? And how do we deal with expectations relating to
this? Do you have any advice here?

MS: There are multiple things you can do. After designing the strategy or
product, the next phase should not be to start the execution or implementa-
tion immediately. You should have this experimental phase instead. There
you build prototypes, research, and try to falsify your assumptions. This is
another critical thing I learned during my training - always identify your
most critical beliefs and ruthlessly test them. A term for this is RAT - Risky
Assumptions Test. You can do those for any specific product but also the
overall strategy. Then you make sure all your RATs are eliminated. Only after
that do you start working on the engineering part.

A second thing you can do is just accept that many of your assumptions
will just be wrong. If you understand this, the logical consequence is that a
strategy is never done. It’s something fluid: after the first draft, you need to
test it and perhaps entirely through it away. Or maybe just modify it a little
and then retest again. It doesn’t work if you just hire someone to do the data
Interviews 200

strategy, deliver it to you and then forget about it.

BA: Can we now discuss an important concept - data assets and products.
Can you define what a data asset is?

MS: I use the term data asset to describe a data source with a precise value
for the business. This implies a data product, some form of analytics, or
whatever you applied to this data, to extract and analyze information from
it. And if this information then leads to better decisions, actions, and results
- reaching the objective in the end.

BA: That’s a great definition! How about data products versus data projects?
What’s the difference between the two? Data products must be different
from other products, such as clothing.

MS: The answer to this question depends on who you’re talking to. If
you’re talking to business people, they might think about data products
as packaging the data itself and selling it to other companies. When you
speak to old-school data scientists like me, a data product can be defined
as the outcome of a data mining process, where you apply analytics to data.
Even an ad-hoc research paper can be considered a data product with this
definition. The definition is different nowadays: a data product is closer
to a software product. It’s the data, and the analysis software, whether
automated or semi-automated, which is ideally scalable and reusable.

BA: So, by that definition, a machine learning model exposed via an API to
serve predictions would be a data product?

MS: Yes, but it can also have a graphical interface. It can be a dashboard or
an application generating business reports. It’s shocking how often, even
nowadays, generating business reports is done manually, by hand. We have
this one client, and they have so many people generating reports, and that’s
Interviews 201

their whole job. After generating them, they just sit around at a SharePoint
somewhere; who uses them is not clear.

BA: Why do you think such inefficiencies are still so widespread? Is building
data products a challenge in larger companies rather than startups?

MS: Yes, of course. In larger organizations, you already have a lot of people
who have been doing such manual work in the past. In one case, we were
working with a pharma company, and they had to create a study on how
time influences the effectiveness of drugs and whether that can lead to
potential side effects. Each year they would have thousands of new drugs,
and hundreds of people are doing this analysis. Many of them would have
ad-hoc scripts and just copy and paste from each other without centralized
solutions or templates. Startups rarely have this since they have too much
pressure to survive, have fewer people, and often have the luxury of devel-
oping greenfield data products, which is much easier.

BA: In this case, if you were to automate such a process and create a data
product, how do you ensure that the people trust it? Especially in such
sectors, this is a big topic. I can imagine that even if it’s inefficient, it can
still be perceived as more trustworthy since many humans are involved,
rather than a centralized black box.

MS: Now we’re getting back to the whole design thinking topic, which is
why it’s so important - not only when you’re designing data product, but the
data strategy itself. A central theme in design thinking is using the users’ or
stakeholders’ point of view. The best way to do this in data strategy is to
involve them in the design process. If they have a seat at the table and can
share their point of view with you on a whiteboard (making it more tangible
and visual), they can express themselves so that other people understand
it. People with a higher degree of understanding will trust and accept the
Interviews 202

strategy.

This goes in both directions. Also, for data scientists is vital to understand
the business process. Otherwise, they might design a solution for the wrong
problem. There are a lot of examples where there’s a perfect solution to the
wrong problem out there. This survey by Eric Siegel confirmed that many
models are not deployed just because they don’t fit the business processes.
This happens because the technical people have no understanding of the
business. If you had no idea how a car works, you wouldn’t enter it.

BA: But you can’t understand everything simultaneously, right?

MS: Of course. It depends on the person. Some people are very comfortable
when they just have a rough understanding. All they need to know is that
the car is secure. They can just enter the vehicle and feel safe. Others need a
much deeper understanding of the cars’ inner workings to feel safe. The only
way to know what people you are dealing with is to start to work with them
in a workshop. Many potential problems can be avoided if you co-design and
co-develop data products and strategies.

BA: So, how can one go about learning design thinking? I think it’s still a
skill not widely known beyond certain circles.

MS: This is a good question. One prominent misconception about design

thinking is that it’s a specific method. It’s much more than that - you can call
it a mindset. You think in designs and design things from a user perspective,
always in T-shaped teams or teams full of T-shaped persons. You just need
all those different perspectives. I would say it’s just not enough to do one
training or a five-day design sprint course. That certainly helps, but you
need much more real-world practice and experience to master this truly.

I remember my first contact with design thinking. There was a presentation

about it. It sounded quite superficial, and only when I had this practical
Interviews 203

project with IDEO, where it was shown much clearer how this mindset is
shaped, that I truly embraced it. I think it’s much more important to work
with other people who have already applied design thinking to projects and
exercise together. From my experience, many workshops that we did, were
much more successful if the people had already done some design thinking
beforehand. Otherwise, they might be in for a hard time - especially the part
where you need to think from the user perspective. Exercising this empathy
might sound trivial, but it’s the hard part!

Summary:

• Learn and embrace design thinking;

• Keep the strategy granular;
• Have an experimentation phase between the strategy and
execution;

Amadeus Tunis

Amadeus Tunis is an Associate Managing Director, Data Strategy at Publicis

Sapient. Before this, he helped Deloitte scale its AI and Data Analytics initia-
tives in Germany as a Director of Data & Analytics. In past roles, he has led
analytics and product teams at CBS Interactive, Microsoft, and others.

BA: Boyan Angelov

AT: Amadeus Tunis

BA: What did your journey into data strategy look like?
Interviews 204

AT: I’ve been in tech and media for almost 20 years. About half of it was on
the tech side, and half on the consulting side. I can trace my initial work in
data back to around 2005 when I joined a startup in New York which had a
focus on in-game advertising. Funny enough, this practice is coming back in
a huge way now, but it was quite revolutionary back then. We had developed
proprietary technology allowing us to integrate geo-targeted advertising in
Xbox and PC games globally - anywhere in the game where it was relevant
and wasn’t too intrusive. The company was bought by Microsoft a few
months after I joined. I then worked for Microsoft for about five years
as a product lead, integrating global advertising campaigns for hundreds
of companies across different industries like consumer products, finance,
automotive, entertainment and more. Everyone wanted to advertise in this
new way. Our analytics at the time were mostly focused on ad performance,
interactions, impressions and so on but also the actual performance of our
integration tools and creative assets.

BA: So this is around 2005?

AT: Yes, 2005 to 2010. To connect analytics with my work: at that time,
we looked into ways to improve our own tools & products as well how to
get continuously more player interactions with the ads. We also wanted to
speed up integration time and actually managed to do so over time, going
from about two weeks for a global campaign to under two hours. We tried to
automate as much as possible and get more efficiencies from the integration
processes. Around 2010 there was a massive ramp-up in the mobile space,
and with Microsoft not being a big player yet at the time, I joined a startup
called Applico. I wanted to dive deep into mobile product development. I
joined as a creative director, but as you know, in a startup, everybody wears
many hats. We started with seven people and eventually scaled to about 80
in the first year. I was responsible for everything besides writing code - the
Interviews 205

end-to-end product development and optimization of mobile apps across

iOS, Android, and Blackberry. While we had other startups and government
agencies as clients, we worked mainly with corporates. We helped them
develop a vision and design the CX strategy, the UX, and the UI, as well as
the metrics for engagement. We would create a product strategy and then
develop each app in house. We also discussed what data you could capture
and needed to analyze to provide certain valuable services. How can you take
real-world or established web services and put them in the more restrictive
space of a mobile app? That was the key challenge. We investigated various
interaction metrics, such as conversion, sign-ups and the overall funnel. It
was essential to understand how user flows perform, how users’ journey
through the app. In short, we did customer analytics across mobile apps
focused on optimizing and improving the product.

That time was a hot market for people who knew about app development
and optimizing apps, so I joined CBS interactive, a large corporate to focus
on a single product but at massive scale. Focusing on their TV Guide brand,
we developed a powerful cross-platform mobile app and web products for
so-called “second screen” experiences. CBS owned the world’s largest data
pool of TV and streaming programming. What we built there would, later,
become a new standard, quite similar to features you now see on Netflix
or Hulu – lists and recommendations of what you want to watch. This was
something we built across every channel and every platform. Those lists also
contained other data points, such as information about your favorite stars,
channels, studios, genres etc. and could be used to link and collect more
relevant content.

Making this all accessible on mobile was complex undertaking but also
really fun. There, I started diving deeper into data. I worked hand in hand
with the director of analytics on customer interactions, looking at specific
Interviews 206

journeys and user preferences over time and at specific key moments. We
tried to figure out how to optimize and best personalize the user experience.
We didn’t have a specific term for it, but we were thinking about valuable
data assets. We started developing strategies and opportunities to monetize
product features better. We moved beyond simple advertising and used
historical data to optimize content, and experimenting with geo-specific
timing to get more signups, to test whether we should have a premium
version of the product and tailor more refined audiences. We were also
looking into what data to ask from people and when - with the focus on
creating value for our advertisers and our business, building a strong first
party data pool. We had dedicated analysts creating the reports, and it was
exciting to convert those insights into a strategy helping to make more
money from data and learn more about user behavior. That was just the
beginning of this domain, I mean strategic thinking about data, at least for
me. There was not a lot of thought leadership around data strategy that you
could find online, so we experimented a lot.

In 2014, I moved back to Europe where I joined Deloitte Germany. At the

time, they were heavily investing in innovation and offerings based on new
technologies. They started a dedicated analytics unit - like a startup within
the larger organization. We began as a small team, under 10, which grew
to 150 data people in the following years. We worked as an internal cross-
service provider and supported projects where other colleagues had access
to data to help them with topics like dashboarding and explorative analyt-
ics, but also some initial predictive work and building machine learning
prototypes, such as recommendation engines. It all ran under the moniker
“insight generation”. Around the mid 2010’s, we went much deeper into
the big data realm. We had a lot of data engineers coming in, and the
focus shifted to large data platform centric projects. Ours was a purely data-
Interviews 207

focused unit, which became a spearhead for Analytics & AI projects for
Deloitte in Germany. Other organizations started doing the same thing,
but we were perhaps among the first ones to grow so big, at least in the
consulting space. We had an influx of great talent as data science and
data engineering jobs become highly desired so we could pick the top 2%
of the talent applying. Eventually, the unit became quite sizable and was
distributed across the organization - moving from the centralized structure
to a more decentralized one, with Analytics & AI becoming a part of many
offerings and domains. Nevertheless, it was an amazing journey and allowed
me to work with a variety of global clients on cutting edge topics and
implementation across the entire data domain, from Analytics to ML/AI, Big
Data & Cloud to data strategy.

I wanted to keep progressing in the data realm, with a renewed focus on data
strategy, especially around customer data & analytics, so I joined Publicis
Sapient. I knew some great people there already and the organization’s
focus on comprehensive digital business transformation and its long history
in tech was very intriguing. I led the data team for the DACH region with a
personal domain focus on data strategy and have recently relocated to the
US, where I remain a lead in the data practice continuing my data strategy
journey with a great team and interesting projects.

BA: It sounds like you gathered experience from both sides - from the data
product side in industry as well as strategy consulting. Do you believe this
gives you a unique perspective?

AT: Yes, definitely, even though there can be a lot of overlap. Many con-
sulting companies, build products for clients. So you’re not excluded from
implementation – in the end you are a service provider and that’s what
makes companies like Publicis Sapient unique: we strategize with our client
and then implement as well. I think that’s very valuable since you have skin
Interviews 208

in the game. Doing that from both the product side in industry and the
consulting side is hugely beneficial.

In consulting, you do have the benefit that over time you have the op-
portunity to work across a number of different industries. While I have
a cross-industry focus, which is very helpful and interesting, I’ve worked
extensively with automotives in the last eight years. It’s still very interesting
to me because recently, there’ve been a lot of changes and new trends in
that particular industry, such as connected vehicles, direct to consumer
sales and a focus on digital services. You have datasets used in entirely
new ways. We’re also capturing the behavior of how customers are using
the car - adding new data pools to online behavior. There are also other
exciting industries, such as healthcare and pharma, which operate within
their own unique, mainly legal, constraints. You can contrast that with the
retail sector, which in my opinion has been leading in the productization of
data use cases as they concern customer intelligence. Selling things online
is the primordial soup of insight gathering. Hence, you can see how the field
of data is obviously highly relevant across all industries.

There’s no client that I’ve spoken to in recent years who hasn’t done some
work in the data space, so it’s really a personal decision whether you prefer
working in industry or in consulting. The opportunities are vast. Good thing
is that nowadays there is a ton of information available for the data space
- in terms of online resources, published papers or books, and many take
advantage of that, so you rarely start at ground zero with clients and with
people, independent of industry or domain.

BA: Since you mentioned the deeper domain knowledge required for the
pharma (or really any) industry, what are the essential skills for a data
strategist?
Interviews 209

AT: This is a unique job title, but nailing down the exact, necessary skills
can be confusing for someone starting off. Also, it tends to vary somewhat
for each company and team. I would say that it is probably a lower barrier for
entry for somebody who has worked hands-on with data and has spent some
time around a cross-functional team, with data scientists, data analysts, and
data engineers. They can take that step towards data strategy more naturally
than somebody who comes purely from the business side, such as manage-
ment consulting which is often stopping at tech. From my perspective, data
strategy has strong focus on how to offset purely technical cost and unlock
value on the business side, but you have to also know and anticipate where
the technical debt occurs or is necessary, hence some proximity to the data
itself is helpful. Sure, nobody knows more about value generation than the
business side but still, when it comes to feasibility estimation, nobody has
more practical knowledge than the people who have worked with the data
hands-on. These are constant balancing acts. We are answering questions
such as: what do we want to do for the business, what do we want to achieve?
Also, what can we do with our current capabilities and technology set-up?
Moreover, it’s also sequencing the work: where do you start and what do
you do next? And if you don’t have at least a high-level understanding
of technical feasibility and requirements, you will struggle to unlock that
value, find that balance and initiate activities in the right order.

I believe we made a smart move at Sapient when we decided to move data

strategy within our strategy consulting domain. Because if you just have
strategists and management consultants doing it, there is a missing part. On
the other hand, I also see too many technologists and engineers trying to
move into data strategy without procuring enough business knowledge. The
incentives for both groups are different. To generalize, on the technical side,
you want the best working solution, no matter the cost. And on the business
Interviews 210

side, you want to have the one that gives you the most benefit, including
the most profitable one, e.g. the least expensive overall. There’s the data
strategist somewhere in the middle, trying to balance that out between the
two, trying to get a grasp on attributable value on an asset that is very fluid
across the entire business.

BA: You have been in the field for so long. What is your impression of the
changes over the years? Do you think that after the initial hype of data
science, things have quieted down, and there’s some skepticism? Are orga-
nizations hungrier for concrete results, adopting a more StratOps approach
to implementation?

AT: To illustrate my perspective, I want to give an example from BI. That

topic has been around forever: companies have always been aggregating
data and creating reports. There’s always been that hunger to get more
information, to squeeze more from data. In the 2000’s, especially nearing
2010, data exploration tools started to become a growing topic, with new
tools such as Tableau, Qlik etc. finding strong adoption. They allowed a
mostly non-technical business user to drag and drop data to make charts,
making information easily consumable. That field eventually became some-
what commoditized, but what came out of it was very positive. A broader
understanding that if you put together different perspectives and data, you
can glean an insight that leads to action.

Nowadays you can do a lot more than ever before of course - you can
efficiently gather much greater information from many systems distributed
in your organization. You can start stitching a lot more data together. In
the mid 2010’s the big data wave happened, with tools & platforms such
as Hadoop, Cloudera, Hortonworks and others. The idea was that merging
a ton of data together would provide a much richer insight. But it also
effected some new challenges, such as how to manage the opening of these
Interviews 211

floodgates. There is a huge cost to doing this and it was not easy to solve
for. I worked with some automakers at the time who quickly realized they
needed governance on top of all the data. They were focused on specific sets
of data use cases, such as only focusing on production data for predictive
maintenance. That eventually proved a successful strategy, as they managed
to initially avoid dealing with privacy challenges that came from sensitive
customer data while learning how to handle the new tech from scaling first
MVPs leveraging huge data sets and machine learning.

I would say there always was a mixture of both successes and failures, but
things have definitely not quieted down. We are at a new stage now: a
massive uptick in the availability of ML packages, tools, and cloud platform
services combined with a new appreciation for data strategy and governance.
Before, open-source platforms were unreliable and hard to manage - it was
hard to get support. Tools have become much more structured, allowing
modern architectures to combine different datasets, packages, and ways of
working. This stabilized the technical part of data programs, and we found
a resurgence in the usage of machine learning. This was especially true for
everyday use cases such as audience segmentation and recommendation
engines. These are things that work very well at scale. They provide concrete
outcomes and fuel further interest in the field. We now have a handle on the
infrastructure issue in gathering massive datasets. We also made progress
on more advanced ML tools and scaling ML in a governed fashion through
MLOps. Data stitching became easier with tools such as Alteryx and Dataiku.
Those make it easy for someone to put them on their cloud platform and
manage their data efficiently.

Now we need to ensure that as those efforts start to plateau in terms of cost,
that they can be integrated and leveraged effectively within the businesses.
This is where data strategy comes in - and there has never been a better
Interviews 212

time to start working on it. The Bernard Marr book (at least the first edition)
came way early; it was ahead of its time. It is just now that the broader
mass of business is beginning to work on problems such as how you define
success, meaning defining clear business KPIs that can be enabled through
the proper use of data. Also, while it’s very tempting to put on a slide that
80% of your data projects don’t succeed, I don’t believe that’s true. Many
do succeed, if only in a contained enivronment, but there is a gap when
it comes to moving the needle on the business and operations side. The
missing piece that makes the wheels turn is data strategy as it is key to
help identify and attain that expected value associated with data. Also, data
strategy is not something you do once. I’ve encountered companies who
said “we’ve developed a data strategy 2 years ago” but are still struggling to
extract that ROI on their data assets. I would argue, that you have to refresh
your data strategy continuously to keep your data ambitions on track. The
roadmap always needs to be groomed. Most companies don’t have the same
business strategy for decades. They will always check where they are and
make adjustments, amendments or initiate resets. There’s always more
strategic work to be done. As organizations are running out of excuses to
explain why their data programs don’t work, since everything has been done
more or less “great” from a technical perspective it has become obvious that
the problems are more organizational, and that’s where data strategy comes
in.

BA: This is a great way to describe the changes. Maybe developments in the
data field mirror those in software, with a bit of a delay. Software products
have been around for a longer time, and probably they are easier to build
than data ones (also easier to measure their success). Now it’s much easier
to start, and the questions are more about designing data products than
architecture or data. We are now primarily focusing on how to provide value.
Interviews 213

AT: That’s a good point. Data is not limited to tech, even though data
capabilities are most often placed into the tech side of a business. When
a data strategist looks at a dataset, they don’t necessarily think of a set of
numbers in a table - they think about it as an information packet. From
this lens, we could think of it as key “information strategy”. From that
perspective, it’s easier to understand that while the technical part is of
course foundational, value delivery should be a priority. This often comes up
in conversations with my clients: you can’t build a data strategy without a
having a digital strategy, which is derived from a business strategy. The data
strategy allows you to hone into the value pools defined by such strategic
guidelines. Without value being the driver, you won’t be able to know what
information packets you need to make that happen. The key challenges lie
in breaking that further and further down to the operational level and then
managing the complexity associated with the execution, achieving these
milestones, using what you have, scaling that, becoming self-funding and
then profitable. Those things also take time; you need to start with what
you have and then scale out, not harvest as much data as possible and only
then get started. Just be aware that until scale can be achieved, a lot of data
you have is a heavy cost factor, so though starting small and learning as you
grow is great, there needs to be the ambition to march your data initiative
through to a certain advanced level. This is what a data strategist is hunting
for - defining the key data as an asset and develop the roadmap of what
it would take to make them exponentially valuable. That is how you shift
into thinking about data the right way, from an asset perspective. That is
why the communication with many stakeholders is so important. You’re
highly dependent on many of them across the business, but it will be you
who is always holding up extracted value by focusing on the ROI of each
information asset.
Interviews 214

BA: Speaking of information assets - how do we measure their value?

Are there concepts we can take from traditional software development or
business metrics?

AT: This is one of my favorite topics and something where I recently spent
time with my team developing methods to attribute specific value, that is
real dollar amounts, to individual data. There are established, economic and
scientific methods for data valuation. They exist but are not yet commonly
applied in P&L’s. There are three approaches. First, there’s the market
approach: this is a method based on the market price of a similar product
or data asset - what users are willing to pay for it. Second, we have the cost
approach, a calculation method that pays attention to the costs related to
creation, management, and utilization of the data. That is taken from cost
calculations for software and technology initiatives fed by production and
replacement costs. And finally, there’s the income approach. So that’s the
calculation method where the effects of productivity, revenue, cost savings,
and efficiencies of data utilization are estimated. If possible, it’s best to use
all three simultaneously. This will give you a range since the numbers will
differ.

Let’s say you have your use cases defined. You need to break those into their
components to define the individual data assets. Then you can apply those
three methods to get the potential value attributed to each data asset. This
is one of the most immediate challenges we cover with many customers and
one of the hottest questions in the industry. It is important to consider
that there are certain underlying principles regarding the value of data
that are just different from principles of other goods. For example, more
recent information is more valuable than older information, data does not
deteriorate from a technical perspective, data gains value when combined
with other data, data gains value when it is heavily used and so forth -
Interviews 215

you must consider those for your calculations. Hence traditional models to
calculate intrinsic value of business assets cannot be applied.

BA: It’s interesting how you talk about data in terms of information pack-
ages. It almost blurs the distinction with use cases. How useful is it to talk
about those separately? There are always situations such as having several
different datasets supporting a single use case. Moreover, the products
won’t deliver value even with good data if the use case is not designed
correctly. What do you think about this separation?

AT: For me, it’s simple. There is no absolute value of data. It just does not
exist. There is only a relative value of data.

BA: That’s a great quote!

AT: Let me explain what I mean, with a client example. I was working with
this global automaker and the CIO was very proud of the number of millions
of customer records. That was a critical KPI - how many individual customer
data sets they had. But this perspective did not attributable business value,
beyond operational or transactional uses to them. The CEO looked at that
and eventually asked, “so what?”. How much money are we making with
these millions of records? What are we saving? What benchmarks to we
have to surpass? How are we converting on these? The overall value of
the platforms and people used (which were very expensive) to gather and
maintain these customer records, was not transparent. It was important to
thus tether specific value to each data asset. The aforementioned approach
finally allowed us to do just that.

Of course, showcasing big datasets in marketing and investor relations is

common. But I’m continuously trying to move the rock up the hill to get
people to stop thinking about these absolutes and start thinking about the
relative. The more you try associating a value with each data point, the more
Interviews 216

transparency you can acquire regarding what data to pursue and where there
is just huge technical debt. Once you know the ROI of your assets, doors tend
to fly open, and you can make use cases happen. But this is one of the most
challenging conversations, and IT can rarely answer it by themselves. The
business side also does not have enough technical understanding to gauge
how much has to be spent on tech to build these use cases. This is where
the data strategists can come in - they translate from one to the other and
back and forth and anchors relative value to each data asset.

BA: What is the sequence of activities in a data strategy for a big client? How
do you sequence topics like governance, architecture, and others and focus
on the most important ones for the client?

AT: We work with a proven framework, that comprises several building

blocks, organized in four focus areas. The first one is the current state. When
developing a new data strategy, it is always essential not to come in with
preconceived notions of what you feel is successful or best for the client.
You need to understand why certain things are being done and capture the
intrinsic knowledge of your client’s business, what works well, what doesn’t
and the context of where the organization currently stands. So initially, you
do a lot of assessments and auditing of available data, systems, capabilities,
ongoing initiatives etc.

The next focus area is the future state strategy. What are you trying to
achieve? So that’s where you look at the business and the digital strategies
and try to understand as much as possible which direction the business is
aiming to go What’s the commercial impact they’re trying to drive? And it’s
not you as data strategist that is going that alone. You work closely with
people who know that information. Again, you’re going to come into touch
with a lot of stakeholders, both in business and IT.
Interviews 217

So, now that you know where the client is situated, what they’re doing today,
and then what they’re trying to achieve, you try to get a sense of the gap
between these two states.

In order to close that gap, you have to always look into how you can support
value identification and realization, the third focus area. How is value
defined? What use cases contribute to what and what are the priorities?
Sometimes, the future and the current state are not that far apart in terms
of realization. It’s just that the setup is not structured and organized in a
way that the data can contribute to making value happen. Hence, you focus
in on enabling this. How do you get there? What’s the route to deploy to
production? What do you need to prototype? How do you prioritize the
backlog to get there? Most often they’re not going to be able to make tons
of money with data tomorrow, but there should be a rout to becoming self-
funding, getting an initial process going that does not require never-ending
investments. And at that moment, it’s not just the technical components
that make that happen. It’s also the governance aspect. Who’s going to own
the data products? What’s the target operating model? How do you procure
and leverage capabilities that are already in your business? And who’s going
to help manage that change?

And then you look, in a fourth focus area, at the enablers: the data itself, the
roles and responsibilities required and the technology to achieve the future
state ambition. What do you need to improve the data quality? Do you need
a new exchange layer to help get data from A to B? Are there any security or
privacy issues? Do you need to hire or train people?

Of course, the technology layer is always underlying all of this. The hosting,
the infrastructure, data ops, microservices etc. It’s not wrong to look at
these enablers relatively early but often they tend to the only aspects of
data programs that are being focused on.
Interviews 218

This might sound overwhelming at first, but don’t worry, you don’t spend
a year building an enterprise data strategy. This is something you can do
in 8-12 weeks and if it’s a smaller company or just a certain part of the
business, perhaps in 4-6 weeks. If you prepare it well, you can develop
such a plan relatively quickly. At least to get to the point where you can
start implementing and then be agile, continuously experimenting and
optimizing. Even though this might sound overwhelming initially, it’s a very
valuable exercise because it creates transparency for everyone. People from
the tech and IT side can look at it and understand “this is why we’re doing
this”. Stakeholders from the business can look at it and have that “aha”
moment: this is what’s required for me to achieve this specific value from
the data that we have or can procure. There is a transparent ROI assessment
of the data assets, which allows stakeholders to have a clear understanding
of what can be achieved when, and everybody can participate in making this
happen.

BA: When do you think a data strategy can be marked as “done”? Do we

need quality gates? How often do we check for progress?

AT: Many companies are, as I mentioned earlier, still realizing that there
is a need for data strategy in the first place. Once you have started the
process, it’s not always conducive to be critical of the expected outcomes
and shut down when they aren’t achieved right away. It’s like investing
in mutual funds and checking your investment portfolio daily. It can be
interesting, but it’s not helpful, especially if you cash out the moment things
go a bit south. You must give yourself some time to evaluate (and improve)
the performance. It’s a learning process which necessitates that you adjust
over time. I would argue that it’s challenging to build a long-term data
strategy that’s more than 24-36 months out. Just like with a prediction
model, it gets fuzzy after a certain point. In that case, it makes sense to
Interviews 219

check on a bi-yearly basis if you’re on track. During implementation, data

strategy is like a flywheel for various initiatives such as data engineering,
data science, data product development, etc. so you will continuously put
the blueprint back on the table to look where you stand and adjust it as your
other initiatives and workstreams are progressing and there is better data
and stronger capabilities available.

While you may have a team of data strategists help you develop the data
strategy, you should continuously keep some data strategists, if only one
or two, on your data initiative permanently. They should be a proponents
of the developed plan and roadmap and hold up the banner for bringing
everybody back to the table on a regular basis. They will constantly aggre-
gate information on current status, successes, necessary improvements and
so forth. Have we moved from POCs to MVPs yet? Do we have anything in
production? The strategists communicate progress and can then adjust the
ROI models based on respective measurements. This should be, again, a
continuous process. You can then bring back a focused team twice a year to
walk through the full data strategy framework, even if it’s just a 2-3-week
exercise as a checkup, to aggregate the necessary information and see if the
strategy implementation is on track.

Additionally, it could be helpful to form a group of people that concern

themselves with the data strategy and implementation on an ongoing
basis, something like a “data” board. It doesn’t always have to be a single
stakeholder, like a Chief Data Officer, who is the only person directing all
efforts. Having such a board can help democratize knowledge and lift data
programs to a comprehensive, companywide effort. Together, you groom
the data strategy over time as you as the individual building blocks develop
and data becomes a key asset for the entire organization.

BA: Awesome! A final question from my side. Do you have a structured way
Interviews 220

of training data strategists? How can we get them up to speed quickly?

AT: There are several Data Strategy course online offered by Udemy, MIT
and others, but I cannot speak to the quality or content they offer. I do
believe that they have their merit, though I also think that data strategy
is similar to mixed martial arts. You’re best off practicing various disci-
plines and then combining them as needed. Also there is not really a
single approach, because as with any strategy it will differ from company
to company. They are supposed to be different because clients of course
want to differentiate themselves in the market. So I believe, what you
can do is you can teach the basics around the main topics involved. For
example, data analytics, data science, some data engineering, business
strategy and eventually digital product development. Then my advice is
to get into projects quickly, shadow people who have done it before, and
participate in the activities as much as possible. Anybody working on data
strategy should have a clear framework and approach to the development
process. Helpful, supporting tools are templates like a data use case canvas,
interview guidelines, and a scoring matrix to assess relevant maturities.
There are many preexisting frameworks out there, so test them out and
see which ones help you with what you’re trying to achieve. For example,
you can start with a data asset map. It allows you to place all the use cases
you have on one axis and all the available data on the other. Then, mark
off which data can power which use case and what the status of that data
is, with red, green yellow or Harvey balls or whatever you prefer. This will
generate an easy-to-understand heat map of what is immediately possible
and what is not. Then you can further investigate the quality availability,
recency, frequency, latency, etc. of data to get more granular and then
make an informed decision which use cases are ready to go and where you
might require further foundational activities like improving data quality or
Interviews 221

acquiring further data. In the end, you sequence these activities out on a
roadmap. What are you going to do first? What are you going to do next? A
few preexisting templates can thus go a long way.

So generally, what is required is a blend of A: get a generalist understanding

of the data space and it’s various domains plus some foundational strategy
skills and B: the awareness that it will always be different for each client,
as they often greatly vary in maturity and organizational setup, so it’s ok if
the output varies. If you have the curiosity and enthusiasm dive deep into
what the client is trying to achieve, what they’re doing, etc. you’re already
in the right mindset. It helps to gradually become good at interviewing and
gathering as well as sorting information. And finally, as mentioned before,
have a couple of templates in your arsenal, that you want to do work through
in a workshop, so you have some clear outcomes that you want to generate.
I am also still learning - like everyone else in the field. My last advice, which
people from the data space don’t often consider: read business newspapers
and listen to business and tech podcasts.

BA: Okay. I didn’t expect this!

AT: I mean general and industry-specific publications. I’ve been on too

many projects with excellent data engineers and data scientists. People
who were really into the data, trying to find patterns, making the data very
accessible, developing great platforms. But when you ask them questions
such as: What is the point of all of that? What are McDonald’s, Porsche, or
Hilton trying to achieve with their initiatives? Who, in the end, is going to
benefit from this and how will it impact the business? Those are questions
that data people often struggle to answer, yet they are immensely helpful
when thinking about data strategically. I thus think it’s essential to have at
least a mild interest in what’s happening in the markets, regions, industries
—understanding the business perspective. It’s often not the best technology
Interviews 222

or the most beautiful model that gets you there. But the interaction of the
various components, people and activities we’ve been discussing.
You will do a great job if you have this comprehensive mindset, thinking
about the value of data from all perspectives, especially the business side.
Data strategists come from all walks of life, and this is what makes the
field so great: I’ve worked with economists, psychologists, management
consultants, data scientists, you name it. All of them data strategists today!
And that is what is most fun about data strategy - it’s collaborative and
cross-functional across so many domains. It’s definitely not “yet another
data trend”, but an evolving, hardening set of skills and becoming a full-
fledged pratice that will not disappear any time soon.

Summary:

• Leverage frameworks & interpersonal skills to get started;

• Think in relative terms about data’s value, not absolute; do
your best to measure it;
• Educate yourself in business and have a relentless focus on
value;

Tom Davenport

Tom Davenport is a Distinguished Professor of Information Technology and

Management at Babson College, co-founder of the International Institute of
Analytics, a Fellow of the MIT initiative for the Digital Economy and a Senior
Advisor to Deloitte Analytics.
Interviews 223

BA: Boyan Angelov

TD: Thomas Davenport

BA: Let’s get this started by learning about your career. I know you have had
a long one! But how did it all start?

TD: I was initially trained as a sociologist. I started out teaching sociology

and statistics, but I paid my way through graduate school by helping
people do their statistical computing. I got more and more interested in
the computing part and less and less interested in the sociology one. I was
very interested in what people were doing with information technology. I
then got a job as Head of End-user Computing at Harvard University and
eventually moved to a consulting firm. That was an IT strategy consulting
firm, which was an excellent fit for me. I gravitated toward the research and
writing part of consulting. That’s what I’ve done for almost 40 years now:
spending my time at business school or consulting firms.

BA: In your experience, how has the field changed during this time? In my
conversations with other data leaders, it feels like some of the questions
we’re trying to answer with data have already been around for quite a long
time - even if we tend to think of data products as “modern.”

TD: When I started my career, people didn’t pay much attention to data.
Some people were focused on data quality, of course. But data was seen as
just a technical subject. The company I worked for focused on what came to
be called business intelligence. We called it “executive support systems” at
the time. To do such work successfully, you need to get data in better shape
than you typically find. But it didn’t receive a lot of attention, I would say,
until the last ten years. People have focused much more on it in recent years
because of the rise of data science and the need to analyze data much more
than we have in the past—the rise of analytics, big data, and AI.
Interviews 224

The first mentions of AI in my career were around rule-based expert systems.

Those didn’t raise a lot of data-centric questions. Now everything is ML-
oriented, and people are much more focused on effectively consuming data
beyond storing it. The most important thing about data is how we use it.

BA: Do you also feel that the initial hype around data science has decreased
somewhat? I remember around 2010, it was on top of the hype curve, and
now in recent years, it became obvious that doing data science successfully
is not so easy. And hopefully, that continues to be the case, that data science
becomes more “boring” so that we can focus on delivering value, not hype.
What are your thoughts on this? Is there a shift in how people perceive such
tech?

TD: I do a lot of surveys with different companies, for example, several

with Deloitte. I remember when I started doing these with executives, they
seemed to think that AI would transform their companies in just two or
three years and eventually transform their whole industries. Now there’s
an awareness that it’s much more difficult than that. One of my students
sent me an article suggesting that data scientists’ salaries declined over the
last year, which was a little bit of a surprise to me because I think there’s
still an excess of demand over supply. This might reflect that people are
getting more realistic about what they can do with AI and maybe being a
little more conservative. My book, which just came out, called “Working
with AI,” contains many case studies about how people already work with
AI day to day. But I have another one coming out a month or two called “All
In On AI”, which shows how companies who are very aggressive with AI use
it to change their business models, their strategies, and business processes.
These are “legacy” companies - not digital native companies. They are doing
very well and extremely successful in the topic. I think this shows that if
you want to be successful with AI, you must commit to it. You can’t just
Interviews 225

experiment here and there.

BA: Exactly. It’s not enough just to put a data scientist and then expect
magic to happen. Another reason for the declining salaries could be that
the field is like a pie, which data scientists share with other new roles, such
as data engineers and ML engineers, which are more specialized.

TD: Yes. I wrote a piece back when data science was just gaining in popu-
larity. It was in the Harvard Business Review with DJ Patil, who ended up
being the first chief data scientist of the United States. It was called “Data
Scientist: The Sexiest Job of the 21st Century”. We also wrote a ten-year
follow-up a few months ago. We discussed how the job has changed in one
major way during this time: people have realized data scientists can’t do
it all. You do need all those different roles you mentioned. Also, new ones,
such as data translators and data product managers - are both fast-growing
jobs. I think you are right; this may have diminished some of the demand
for data scientists. At least the types of data scientists who only develop
models. I think it’s a bad idea to just focus all your attention on finding the
right algorithm because if the project as a whole doesn’t get implemented,
what’s the point?

BA: Yes! These roles are super important, the ones which are focused on
data and product thinking. I have a few questions on this later, but let’s go
back to data strategy. When did this term first appear and gain traction?

TD: I think it coincided with the rise of Chief Data Officers over the last 20
years - initially in the financial services sector. I wrote an article, “What’s
Your Data Strategy?” with Leandro Dalle Mule. I worked with him at Deloitte
before he became a Chief Data officer at AIG. I give him all the credit for this
idea of offense versus defense in data strategy. I was more familiar with how
much commonality we need to have in our data - data federalism and related
Interviews 226

topics.

Since then, I think many other issues have come up, including what aspect of
the data supply chain you should focus on; does the whole product idea offer
any value in terms of making data effective. Also, what type of data will you
focus on - is it only internal or external? I think the field broadened quite a
bit from the early days when we used terms like “master data management.”

BA: How about data strategy today? I think data strategy unfortunately is
often seen as just a static document - and also has become quite vague.

TD: There was a previous generation of attempts for similar in goals ini-
tiatives. They were named “integrated master high-quality data” or “infor-
mation engineering”. A lot of money was being spent on this and it was a
very technical discipline. Organizations such as the Oxford Martin Institute
were founded. I was very distrustful of it since they created models nobody
could understand. Business people didn’t know what to do with those - they
looked like those schematic diagrams of the latest Intel microprocessor -
unnecessarily complex. Business people struggled to find where the data
is in all of this. I think in the 1980s, Michael Hammer wrote an article
about a principles-based approach to data. That was one of the first data
strategy-oriented pieces that ever appeared in HBR. Some companies used
it quite successfully. Occasionally, I still find somebody who supports the
principal idea. It was similar to data strategy because it demonstrated how
you should simplify goals in different areas of IT, including data - to ensure
business people would not just understand it but be the primary creators
of data strategy. They should be actively involved in it and related aspects
of technology, such as architecture and planning. I still believe it’s very
important to have simpler business-oriented data strategy approaches.

It is important to look at the companies that are very aggressive in their use
Interviews 227

of AI. Some of the best examples are European - such as Shell, Unilever, or
Airbus. I think this is largely because their senior business executives have
learned enough about AI technology to understand what it can do for them.
I wrote a piece in Sloan Management Review with Piyush Gupta (the CEO of
DBS Bank in Singapore). While doing this, I asked him how he got into all
those topics. He told me he world for Citi group with John Reed. And he was
probably the first banker in the world to realize the impotence of data and
technology for banking. Piyush Gupta then developed the data strategy and
architecture for Citigroup in Asia.

BA: Data strategy inspiration can come from many directions! Let’s take a
step back. What do you think about the data strategist title?

TD: You need some sort of intermediary between highly technical people
(whether in data science or data management) and business people. You
won’t have many people like Piyush Gupta - CEOs - doing data strategy. You
could call those roles data strategists, translators, or data product managers.
The translation is a job field in itself. You’’ always need someone like that
to manage a project from start to finish, interface with the stakeholders and
persuade them it’s a good idea to develop data products.

BA: It is not an easy job; you always have to look at the forest and the trees.
Those are skills probably hard to find in one single person.

TD: That’s true, although I believe the situation is improving. When I was
working on the 29 case studies of people who work with AI daily, my co-
author and I discovered many people who were those business-technology
hybrids. I do believe more and more are emergent all the time. Nowadays,
when I sit in a meeting at big companies, there are often both business and
IT people - and it’s sometimes hard to say who is who!

BA: Do you think the learning curve for understanding data stuff has
Interviews 228

dropped? There are so many initiatives nowadays to teach business people

data skills.

TD: Yes. I teach my students how to do it, and most of them don’t have any
technical background. There are all these great automated ML tools now.

BA: And their eyes light up immediately, I’m sure. Let’s take an example
company that is not very successful with AI. How do we get started there?

TD: I always say: with AI, you have to think big but start small. You must
develop a vision of how your company can evolve with AI capabilities. Will
it treat customers differently or make money differently? What products or
services might it sell? After you have identified that, you can look at the
little pieces. AI is, for the most part, an incremental technology. It works
on individual tasks, not even entire business processes or jobs. Some of
those small steps might seem boring, but together they will eventually help
revolutionize a whole business line, such as customer service, marketing, or
product development.

BA: When we talk about AI in an organization, there’s a lot of misun-

derstanding of the technology that can lead to false expectations. Many
people still believe you can develop one AI, let’s say for churn modeling
- and then just transfer it to a different domain, such as computer vision.
Unfortunately, it doesn’t work that way.

TD: Yeah, I think that’s generally true. Last week I talked with the head of
data science at AT&T. They have done much to empower business people
to use AI-oriented tools. The whole organization has agreed on how to
have reusable data sets and how to use them. I think data strategy should
contribute a lot to empowering those citizen types to do more of the work.
We just don’t have enough data scientists, engineers, and analysts.

BA: These remind me of demos by Microsoft, where they integrate Azure ML

Interviews 229

services into Excel, where you can get predictions in the Excel sheet. People
like to focus on shiny use cases such as object detection, but there’s a lot of
value in such “simpler” automation.

TD: Yes, agreed. Those large vendors can help democratize such tools
because everybody can access them.

BA: How about data product thinking? How do we ensure we don’t work in
isolation and the algorithms get integrated within a product?

TD:I talk to a lot of chief data and analytics officers. When you’re coming
to a company that is not that data-oriented and not doing a lot with AI, it’s
essential to start with a small number of use cases to build consensus within
the organization. You need to get business leaders who are stakeholders
quickly and then ensure those are a success. After this, you can build on
them with more sophisticated projects and infrastructure.

BA: Interesting. Would you then say that only mature organizations need a
complete data strategy?

TD: Yes. I think in the beginning, you can start case by case without
restructuring the whole organization.

BA: Now, I want to ask you one of the hardest questions in data strategy.
How do we measure its success? How do we come up with accurate numbers
that the business requires?

TD: What makes this difficult is that data is seen as an abstraction. This
leads to the short careers of chief data officers - it’s hard for them to
demonstrate value. I recently surveyed CDOs for an AWS report, together
with the MIT CDO Symposium. There I discovered that the ones who are
successful are obsessed with measurable value. They ensure to have a
baseline before they measure anything they have achieved. You cannot do
Interviews 230

that for everything, so you might need to prioritize a few critical projects. I
have seen this done successfully at Capital One, a very data and analytics-
oriented bank in the US. They were also the first ones to name a CDO in
1992.

BA: Yes, and use this measured value to fund further activities. Another
question I have is about who does the data strategy more often in your
experience. External consultants or internal people? And what is better?

TD: I think it’s certainly good to get facilitation help from external con-
sultants, but in the end, the organization needs to be on board since they
are most in tune with how the business and the business environment are
rapidly changing.

BA: How would you define a data asset?

TD: I think this is a confusing term. When people hear “asset,” they expect
it to have value in itself, but this is rarely true of data. I like the data product
idea better since it’s clear how it’s valuable when analytics and AI are doing
something useful. I know some companies are perfectly successful using the
data asset concept, but I’m not a big fan of it.

In the AWS survey, I just did, data monetization was one of the least popular
activities. I think it’s just too hard and brings many new issues, such as data
ownership and privacy issues. I think it gets customers upset.

BA: This is when data assets become data liabilities.

TD: Yeah. This is a value laden term, particularly when you start talking
about monetizing customer data.

BA: What mistakes do you see people making when developing data prod-
ucts?
Interviews 231

TD: You have to take a lot of ideas from standard product development.
There are a lot of interesting ideas coming from the lean startup thinking
about MVPs. What is essential is to have stage gates where you decide if you
proceed. What is more challenging is to find people who have product think-
ing - many are just focused on developing algorithms and less interested in
the business and people’s issues around the products.

I think you can take a lot of use cases from another industry. The first
employee attrition model I saw developed was at Sprint, the mobile telecom
company. They initially used it to prevent customer churn but then used it
for employee churn.

BA: How about the domain knowledge you need for the different industries?
Also, do you think data strategy differs, or can you reuse it easily between
industries, such as pharma or automotive?

TD: I do think there are some substantial differences. For example, financial
services tend to be more defense-oriented than any other sector. The
pharma domain, particularly in drug development and genomics, is so
complex, with a big focus on external data, called real-world evidence. In
automotive, there’s a bigger and bigger focus on connected vehicle data. For
retail, other issues and opportunities exist, such as customer loyalty data
and shopping behavior over time. Every industry has some key decisions
that needed to be made in the context of data strategy.

BA: What do you think - is AI a technology that is overestimated short term

and underestimated long term?

TD: That’s Amara’s law, right? Yes, I completely agree. I wrote about this in
the AI Advantage book.

BA: Where do you think things are going in twenty or thirty years? Do you
Interviews 232

see an incremental pace of improvement in data strategy in big organiza-

tions? Or do you even expect a slowdown?

TD: I’m impressed with new developments, such as the generative AI

models, GPT-3, and similar. Those will be very influential. I hope we get
smart tools to help us with the more mundane tasks, such as data cataloging,
integration, providence, where data should sit, etc. I think we’re making
pretty amazing progress.

BA: Sometimes I feel like we’re are monkeys sitting on a branch and cutting
it under us. We develop tools to automate ourselves away - everything is
evolving.

TD: Yes, one of the interesting things is that data scientists have generally
not embraced those tools - even in industries where models are very impor-
tant.

BA: What would be your parting advice for people just starting?

TD: Make sure you understand that it’s everybody’s job to think about how
we analyze data and how that relates to business!

Summary:

• Embrace new job titles in the field and the growing specializa-
tion;
• Have a relentless focus on measuring and providing value;
• Learn from traditional product development;
• Respect the differences between industries;
• Embrace new productivity tools for data;
Interviews 233

Stephanie Wagenaar

Stephanie Wagenaar is a Senior Director at FTI-BOLD, focusing on delivering

data management and governance strategies for mid and large-size organiza-
tions. BOLD was acquired in 2022 by FTI consulting, a global business advisory
firm.

BA: Boyan Angelov

SW: Stephanie Wagenaar

BA: How did your career look like so far, and what are you up to now?

SW: I started as a linguist. My first encounter with AI or technology was

when I was still studying. The Netherlands introduced a digital version of an
integration exam at the time. So it was recorded by phone and then automat-
ically judged. I worked on that and found the whole technology aspect super
interesting. After this, I worked extensively on speech technology before
eventually transitioning into search engine analysis - what people type in
for Google or Bing for specific markets. I learned a lot at that time since I had
to think about how to analyze such vast quantities of data and categorize
it for usage in machine learning models. Finally, I joined a consultancy,
where I am now - BOLD. Here, at first, I focused on unstructured data -
linguistic datasets. After this, my role evolved to more general use cases. I’m
doing a lot of advisory work and data management, with most of our clients
being mid-size companies with between 200M and 1B yearly revenue. This
category of clients often struggles with digitalization and is behind what
their competitors are doing. They want to use AI but have to fix the basics
first. Many of my assignments are on mergers between companies, where I
need to help consolidate everything in terms of data.

BA: What are the essential skills for your type of role?
Interviews 234

SW: You must understand technology and be interested in digital trends

and data in general. Also, you need to think in structures and logical
models. Linguistics prepared me for this since there’s a lot of thinking about
structures and how to define them. Still, soft and business skills are the most
critical skills. How to relate with people, how companies operate, and what
they find important. The ability to convert overall vision into pragmatic
solutions - and break this down into small steps that help people to get
there.

BA: A strategy friend once told me that data strategy is the “art of drawing
boxes.”

SW: Yes! You have to make sure that everything is cohesive. I see a lot
of companies doing many projects simultaneously that are building on
different foundations and aligned to the same business goal.

BA: What do you think about the data strategist title? Will it become
established?

SW: Well, I can imagine it will. We need to keep the organization focused
on the long-term vision and make it pragmatic. I sometimes call them “data
specialists” since many different skills are required. This makes sense to
people who know nothing about data.

BA: You mentioned that it might be easier for business people to get into
this. Do you know how to train them to be good at the role?

SW: Few things can replace learning on the job. I would start by talking
to different people from different departments in the organization and
learning how they use data and what’s essential for them. Then you can
combine this into a cohesive overall plan, explaining it to the management
- ensuring their initiatives, goals, and needs are met.
Interviews 235

BA: How about more social skills?

SW: Listening is critical. Ability to translate what they say to what is

helpful, even before they know they have a problem. And most of all - being
pragmatic. At BOLD, we also focus on execution, so as consultants, we
have to be that way. We have to move beyond slide decks and think about
execution.

BA: What are your impressions of data strategy in the Netherlands? What
type of organizations are interested in it, and what do they expect?

SW: Everybody wants pragmatic. Many need to be made aware of what they
need. They might look at data governance and need help understanding how
it might help them achieve their goals. Part of the problem is that those
concepts are so broad, and it’s hard to have a single approach that fits each
client.

BA: Yes, I agree. It can be smarter to order individual elements which are
most urgent.

SW: Yes, this is how we often do it. The clients should always have the basics
fixed first.
I usually focus on data management and data governance - those are
essential. Many other topics come after, such as data privacy, but you can
solve many issues with a data management and governance plan.

BA: There’s a lot of concussion between terms like data management and
governance. How do you explain them to clients?

SW: I explain data management to my clients, as driving a car and getting

a driver’s license. Data governance is more like the traffic police and the
driving rules. I take this analogy further by describing advanced analytics
and AI like driving a race car. Many organizations are just not ready for race
Interviews 236

cars because they have first to learn how to drive a regular one.

BA: How do you ensure clients are still on board with topics like this? Often
the results of such work become visible only much later, which could lead
to frustration.

SW: Here, communication is essential. Making sure they are always in-
volved in the process and trying to be as transparent as possible. I do as
many in-person workshops as possible and keep them interactive, ensuring
nobody is left out. This attracts and excites even more people who want to
be a part of this. Having good energy and focusing on the bigger picture
and digital transformation results go a long way to generating excitement
- but it can still be challenging. Some of those workshops last for three
hours. What helps is to ensure that people learn from each other and keep
each other updated on progress and things achieved - keeping everything
interactive.

BA: This is a very positive way of motivating people. How about the
negative? If you focus on opportunities missed or costs if no governance
is implemented?

SW: This is more suitable for management - when you want such initiatives
funded. Still, I prefer not to focus on the negative - most people know those
things already. It’s better to focus proactively on positive change.

BA: How do you measure the success of such initiatives? Do you set some
baselines for data governance?

SW: We do have maturity assessments, but often it’s hard to establish

a baseline since there’s no proper governance. It’s essential to have a
structured list of responsibilities and tasks to get started. We start by giving
ownership to different people and ensuring they also have the perks. They
should understand the significant impact of topics such as improved data
Interviews 237

quality. This makes them even more excited - if they can understand the
results of their efforts. As the data management and governance building
proceeds, they must regularly see the outcomes. We need a group feeling -
that we are all working together towards a common goal. Through this, they
can feel empowered and eventually proceed without external assistance.
Just don’t forget to make sure all departments feel heard.

BA: This is a very time-consuming process, with all the due diligence, right?

SW: Yes. It can take six weeks, with, let’s say, two days a week. There are
a lot of technical tests too, and see who in the organization struggles with
what and assisting them. They have to see you as an ally.

BA: What are some common traits that successful in this initiative clients
share?

SW: I’ve noticed that successful clients already have quite a few internal
people interested in data—those motivated to take the next step. We always
need to work with such people because they understand the business inside
and out - we do the work together as specialists and domain experts. After
this, I also feel comfortable knowing the organization is in the right hands,
with people leading the work further. Those people can be from the data
team and others who are very interested. Again, the most important thing
is they understand the business. You can say I try to turn them into a mini
data strategist!

BA: How about technology? Is it often seen as a constraint in such cases,

correct? After the initial excitement, the real-world limitations hit.

SW: Yes. Data should be seen as a separate business discipline between

business and IT.

BA: This reminds me of my conversation with Doug Laney, who advocates

Interviews 238

for separating Information and Technology, seeing that Information is

equivalent to data.

SW: Good idea. The number one problem in companies is everyone working
in their own silo. I have seen companies falling apart, and some departments
still claim everything is in order just because their datasets are fine. We need
to change this, and I’m working hard on that.

BA: How do you prioritize between different siloes?

SW: In most cases, we start with the ERP system. After this, we focus on the
master data - so customers, suppliers, items, and transaction data. I tend to
focus less there since fixing the basics is most important. Of course, CRM
data is essential, and sometimes custom supply chain programs. In such
cases, we need the business users to help us assess the priorities. Under-
standing how the business makes money also helps. For example, suppose
you are working with perishable goods businesses. In that case, expiration
data is essential, and if those are more content-focused businesses - the
critical data can look very different and be in a separate silo altogether.

BA: How do you balance quick wins with more time-consuming work?

SW: I try to use tools for the quick ones. I’m a big fan of IBM Watson, and
one quick win we made was using their translation service. That helped us to
move quite fast. I know people like flashy things, so sometimes, we support
this with quick dashboards, showing business users new ways to look at and
use the data. I agree you need those quick wins to build trust for the more
major initiatives (and help us get assignments) - it’s not easy to sell the
complete data management proposition without that.

BA: How about the more advanced use cases? Do you feel companies have
now realized it’s just not enough to through engineers at the problem?
Interviews 239

SW: Yes - it is like building a house, with the foundation first. For AI, you
need the data management and governance. I do, of course, take those use
cases into account when designing the solutions.

BA: How about new tools, the cloud, for example - have they made your
work easier or harder?

SW: mostly, they are beneficial. In the Watson case, for example - we could
build a solution very fast, at a low cost. I am very interested in the low-code
movement for such work.

BA: There must also be hidden dangers in such tools, like providing too
much power to users that need to be properly educated for it?

SW: Yes, this also happens. But the problems we face are so diverse that we
should use all our tools. It’s hard to have a one size fits all approach. What
is also important here is just to know what you don’t know. In such cases,
you should have the right skills to pick new topics quickly and ask the right
experts for help; otherwise, it can get difficult.

BA: What are new exciting things you see in the data field?

SW: There is just so much! A lot of cool NLP projects are gaining traction.
It tends to keep the same about data governance, but the field is evolving
there as well. It’s hard to make a 20-year prediction with the speed of new
changes. It’s essential to see this as a positive and keep yourself and the
clients enthusiastic!
Interviews 240

Summary:

• Have the fundamentals, such as data governance and manage-

ment, set up from the start;
• Ensure buy-in from all involved departments for your data
governance initiatives;
• Embrace new low- and no-code tools at your job;

Doug Laney

Doug Laney is a Data & Analytics Strategy Innovation Fellow at West Monroe.
He is also the author of “Infonomics: How to Monetize, Manage, and Measure
Information for Competitive Advantage” and “Data Juice: 101 Stories of How
Organizations Are Squeezing Value from Available Data Assets”. Before that he
was a VP at Gartner and led the Deloitte Analytics Institute.

BA: Boyan Angelov

DL: Doug Laney

BA: Where do you think we are right now regarding realizing the value of
data, and what are the reasons for that?

DL: Regarding data assets, we’re still at a point where most organizations
and leaders will talk about data as an asset but not treat it like one.
They don’t apply asset management, measurement, and asset deployment
principles that have been well-defined and honed in other industries -
for different types of assets. Assets such as financial, physical, or even
human capital have well-defined methods and standards for management,
Interviews 241

for example, ISO standards. We, as data professionals, have done ourselves
and the world a disservice by inventing new terminologies and ways of
management that just fail instead of paying homage to how other assets are
managed. This is the main point in my Infonomics book. People would say
that infonomics is a bold new idea, but it’s applying existing ideas about
asset management, valuation, and economics to data - treating it as an
actual asset. This is where we are now - people start to talk about data assets
but still do not know what that means.

BA: Would you say we are too focused on tools and frameworks instead of
the basics?

DL: Yes. We all love new shiny objects. Everybody’s focused on tools. There
are two types of data strategy. The big S data strategy is about managing
data in the organization as a whole - culture, leadership, organizational
alignment, data architecture, integration, and so forth. The other is focused
on the policies and procedures around how to manage and leverage data.

BA: Do you think a part of the problem is that many people disagree on basic
definitions?

DL: Yes. Even the definition of data asset is something people can’t agree
on. I recently ran a poll on the differences between data assets, products,
and datasets. The best answer came from someone who was not even in the
data field but an engineer. He said a dataset is a collection of data organized
in a specific format. This can be anything from a simple text file to a large
database. A data asset includes all the information a company owns and
controls, including both digital and physical data. This can include cus-
tomer, financial, and employee data. A company’s data assets are important
because they can provide insights into the business, help decision making
and drive growth. Data is more of a physical manifestation of data, while
Interviews 242

data asset is more of a logical grouping of data you own and control for a
common purpose. And finally, a data product adds a layer of functionality
on top.

BA: We are talking about data as an asset, but how about data as a liability?

DL: When people talk about liabilities, they often mean risk. If we’re
talking about data as an asset, we use financial vernacular. And an asset
is something that is e that is owned and controlled - changeable for cash
and generating what accountants call probable future economic benefits.
Finally, it’s also separable from other resources, and data meets that criteria.
Does data meet the requirements of an accounting liability - something that
you owe somebody else? I think it’s rare to see data being owed to someone.
Still, data can certainly introduce risk. One of the things I suggested in
the Infonomics book was a set of generally accepted information principles
similar to those for accounting. One of the ways to reduce data risk is
the replicability principle. It suggests we need to be economically cautious
every time we copy or move data since we increase the attack surface for
hackers every time we do so. We are increasing the risk. We must be careful
when doing things just out of convenience and recognize the risks.

BA: I define data assets as a combination of a logical grouping, as you

mentioned, attached to a measurable value. Here I think the topic becomes
hard since this can be subjective and more qualitative than quantitative.

DL: There’s subjectiveness to valuing almost anything unless there’s a

market. Even in that case, fundamental things like the cost of generating
or acquiring the asset are challenging to estimate. Questions such as what
others would pay for access. Estimating this asset’s contribution to a rev-
enue stream too. A challenge here is that when we sell data, we are more
selling rights to use it, not its ownership. Then when we look at the potential
Interviews 243

market, we need to aggregate the number of potential buyers or licensees.

It becomes sophisticated.

Another thing you can do is look at the saving that various assets provide as
part of solutions. Additional costs are labor costs and physical and financial
assets. The subjectivity arises in estimating how much value is allocated
to the data asset. Data assets are non-rivarlous and non-depeting. They
can be used for many purposes simultaneously, over and over again. Data
assets are also progenitive. We are generating more and more by using them,
becoming a cyclical problem.

BA: Who, in your opinion, should be doing this valuation of data assets?

**DL: This should be in the purview of the chief data officer. I have long
advocated for the bifurcation of IT into separate “I” and “T” organizations.
The CDO is responsible and accountable for the “information” part. In some
organizations, it might work to provide this to the CFO.

BA: What are the important skills for a CDO day in this case? What kind of
background should this person have?

DL: Most of the successful ones come from a business background. They
can transfer the skills from managing other assets. While there are certainly
technical aspects of it, there’s a lot of difficulty in the culture, organiza-
tional, and governance topics. How to create a data-driven culture and
foster data literacy. How to define use cases and leverage data in new
innovative ways. You need somebody with business sense.

BA: How about the technical feasibility part?

DL: Of course, there you need technical people. But again, I think we should
be separating “I” and “T.” For example, instead of having a CIO, we can have
a CTO and CDO working independently yet synchronizing because of the
Interviews 244

necessary overlap.

BA: What would be the first thing an enterprise can do as a first step? What
are some ideas for quick fixes?

DL: appointing a CDO is a great first step. Then try to find a good balance
between the defensive and offensive parts of the data strategy.

BA: What are some topics in data assets that the people you teach have
difficulty grasping?

DL: There are many complex topics, such as data privacy and security.
Those are very difficult because those fields are ever-changing and differ in
geography, industry, culture, and even on customers. I am not focused on
those topics, but data asset valuation is tricky. Once you start peeling the
onion on that, it can be challenging. There are real nuances to the methods
there, as we discussed. Someone also needs to source the inputs required
to do the valuation. There are questions about probable future economic
value and the probability of successfully delivering those use cases. Also, as
we discussed, assign the contribution of the assets to that.

BA: You have been in the field for quite some time. In what ways has it
changed, especially with the new tools we have, such as the cloud?

DL: It has been great to see technology and architecture keep up with the
volume and velocity of data, thanks to the movement to the cloud. Still, big
problems remain to be solved, such as how we handle the variety of data
and the increasing range of data sources. How do we integrate them and
align the meaning of the data? The three V’s remain essential. One thing
which is often misunderstood is the “veracity” of data. People don’t realize
this is a bigger problem for smaller datasets since those are often compiled
manually.
Interviews 245

BA: . How about the term “data is the new oil” and contrasting it with
something I hear people say, “data is the new water”?

DL: Michael Staler used the initial term some twenty or thirty years ago.
He used to talk about data being like water in terms of being able to turn a
faucet and get a flow of data. This is undoubtedly the case at the moment. I
certainly appreciate that, at the macro level, data comparison to oil works.
Data is the driver of the economy today, much like oil was a century ago.
Still, it misses the point that data has unique economic characteristics. If
you consume a drop of oil, you can just do it once. It dissipates and turns into
heat and pollutants. This consumption does not create more oil - remember,
data can be progenitive and used in many ways simultaneously. This is what
successful business models for companies look like nowadays - they take
advantage of those foundational economic characteristics of data.

BA: What are your thoughts on describing data as bronze, silver, and gold?

DL: I’ve seen it and think it’s a handy way to express different levels of
data availability, usage, cleanliness, and governance to a business user.
I don’t like to make things more complicated, and I’m not in favor of
discriminating data from information - they are synonymous if you look
at the dictionary definition. I also think there are rarely state changes to
data, and we have more of a continuum as data becomes more consumable,
usable, and integrated.

BA: Still, how do you think about having “good” and “bad” data assets?

DL: Me and my colleague at Gartner, Ted Friedman, who is an expert on

data quality, always referred to data quality as “fitness for” purpose. So
there is no good or bad data quality. Data quality needs to be measured in
terms of the data’s fitness for a particular purpose. Even if a data record is
inaccurate, as it changes over time, the delta may be significant, even if the
Interviews 246

actual value is off by a certain percentage. There are plenty of data assets
with poor accuracy or timeliness but remain fit for a particular purpose.

BA: What would be the first thing a data strategist does to improve the data
situation in a large company?

DL: First, they should understand the culture and leadership around data.
How is it perceived? Then the governance topics - how individuals in the
organization use and measure data, and do they have defined metrics?
Are the KPIs aligned across the organization? Do they have a separate
organization or a traditional IT? How do people coordinate and collaborate
in such initiatives? After this, I would look at the architecture and see how
source data moves through it and is integrated with other systems. Data
governance and quality are also important here. It’s essential to assess
the data assets’ accuracy, timeliness, completeness, and integrity. Finally,
I would look at the operations and see how data is consumed and made
available to people and processes. I think we are still spending too much
time delivering data to eyeballs. Increasingly, and going into the future,
I think we’ll see more data consumed by systems and applications rather
than people. Data strategies should reflect this increase in business process
automation, AI, and similar, instead of just descriptive statistics.

BA: What are your thoughts on the term “data strategy” itself?

DL: It can be vague, and some organizations have documents of hundreds

of pages on strategy. There should be a standard template for data strategy.
The second problem is that this needs to be a living document. It should be
continuously revised and refreshed. Most organizations need to do a better
job of communicating, enforcing, and monitoring data strategy.

BA: Also, we should measure its impact and establish some baseline.

DL: Right. The data strategy itself should have goals and tools to measure
Interviews 247

them. It’s important to include metrics and KPIs.

Summary:

• Get the fundamentals right - treat data as an asset;

• Understand the difference between datasets, data assets, and
data products;
• Separate information and technology organizations, and hire
a CDO;
• Apply economics principles to data assets;
• Focus on the data being “fit-for-purpose” first and foremost;

Alexander Thamm

Alexander Thamm is the founder and CEO of the consultancy company with the
same name, focused on data and AI products. He is also the author of “The
Ultimate Data and AI Guide” book.

BA: Boyan Angelov

ATH: Alexander Thamm

BA: Tell me about your story. How did you end up in data?
H
AT: I think there are two important aspects to my start. First, when I was 18
years old, I opened an internet café. It eventually went bankrupt because, at
some point, everybody had an internet connection at home. We did a lot of
tech stuff: coding, making websites, and other software. While this got me
hooked on tech, it also helped me learn how to do business – the bankruptcy
Interviews 248

experience was not so nice. This was when I started studying economics with
minors in statistics and psychology.

At that time, I also got the opportunity to intern at a BMW daughter

consulting company. They were doing a lot of customer understanding, how
customers behaved in the automotive aftermarket. There was a change in
German law where they lost the monopoly on parts, which was their primary
source of profit. My task was understanding their market share for brake
pads and oil filters. Nobody knew how big this market was then; this was
before connected cars. Essential questions were unanswered, such as how
many brake pads are needed globally for all the BMWs driving around. I
used data from the engineering department to forecast this based on driving
patterns. Nobody used these data before; the marketing teams least of all.
At the time, they were doing focus groups and market research but not using
data as an asset. I obtained some surprising results, and they didn’t believe
me initially since I was just a working student. I did the analysis again and
then got very excited to learn how just as a student, I could show this huge
company something that all their smart people didn’t know at the time. This
is how I became interested in the data and machine learning field. My next
task was to build an algorithm to predict when someone needed that part
- which you would now call reliability analysis or predictive maintenance.
Remember, I was doing this around 2007, based on engineering data for
marketing campaigns and other sources. And we had huge success rates with
this.

BA: I hope they already promoted you or were you still a working student at
this point?

ATH: I was still a student, doing my Diploma (Master’s in Germany). BMW

asked me whether I wanted to work for them, but they couldn’t offer me
a job (it was the financial crisis) but work with me as a freelancer instead.
Interviews 249

I was intrigued by the topic and accepted it, and I also started doing my
Ph.D. in the field while travelling to Ingolstadt. My Ph.D. was focused on
bayesian statistics since, in this type of problem, there were often issues
with missing data - and I was trying to augment the missing information.
At the time, this was pretty new to everyone, and I had a lot of fun. I had
to hack together many systems because the business was still working with
Excel and IT with rigid old databases. There was no data science as a field
- everyone was just talking about data mining - trying to find patterns in a
heap of data - hoping for crazy results that could change a whole business.
IBM promised a lot at the time with Watson. All those initiatives were rarely
actionable and provided redundant common sense patterns instead. At the
time, I was doing more advanced projects, and people noticed I could get
fast results from data, and eventually, I decided to leave my Ph.D. and start
my own company (so I could work easier with clients such as BMW).

BA: It sounds like you were successful in the field quite early. What are
non-technical traits and skills you think helped you along the journey?
Specifically, as a working student, to deliver the impact you did?

ATH: You can’t do without some core skills, like being hard-working and
bold. Every entrepreneur will tell you that. But there are several more
special ones: passion, curiosity, and being value-driven. You should be
interested in finding the truth in data, much like a detective who wouldn’t
let go, even if the odds are stacked against them. I was also always focused
on seeing an impact, which satisfied me. I loved it when people told me -
“Alex, this is so cool. We are now 20% more efficient”. I always wanted to go
for the value, not just the insight.

BA: Speaking of value, this is one of the most challenging questions. How
did you show measurable value? How do you measure the impact you
provide? This is a difficult task since data is often seen as a cost center rather
Interviews 250

than a profit one. Also, while the impact of data initiatives on the bottom
line can help improve the KPI of other teams and departments, it’s often not
measured directly.

ATH: This point has always been very relevant. We built the “Data Compass”.
This is a framework similar to CRISP-DM but more adapted to what we
think data scientists should be able to do. It starts with a business problem
and defining the core KPI we want to improve. Data strategy has to be a
derivative of the corporate and digital strategy. We often see that the data
department and CDOs often build a data strategy that is not interwoven
with the company targets. If you work against your company’s goals - you
have already lost. While there are many different KPIs, balancing capability-
building use cases with data products is crucial. It’s often not a good idea
to start with an extended data lake, data catalog, or governance project
without ensuring first they are connected to the business. It’s also hard
to keep the excitement going (we call this happy honeymooning) for long
periods. If you connect straight away to the specific use cases, such as sales
forecasting and its impact on targets, you will have higher chances of getting
funding for the project.

BA: What is your opinion on data strategy as a deliverable? Do you see it

as a static or iterative process? Also, how do we ensure the bridge between
strategy and implementation?

ATH: No, definitely, data strategy is not static. The term itself is misleading
since it implies you do something just on paper and then execute it. We
recently were with a client who was just starting, and they thought they
should do this as any other regular project—beginning with a strategy
assessment, three to four months of interviews, and so on. I agree it’s good
to have a good starting point, but the problem is that such projects are very
new to them. I remember the saying from Henry Ford - “if I asked people
Interviews 251

what they wanted, they would say faster horses”. The problem is that if they
are not experienced in the topic, it’s hard for them to develop a data strategy.

BA: It’s a chicken and egg problem, right? You need to have a data strategy
in place to make one.

ATH: We call this the Data and AI journey. It is a back-and-forth process,

but not a chaotic one. You have to document your learnings while doing
the use cases. You should work in different streams. For example, one can
be a strategic stream, where you do assessments and re-assessments and
define use cases. Then you have a capability-building stream, where you
are training the employees at different levels of the organization in data
literacy and understanding topics. Then you have a stream for architecture.
I have seen so many projects when someone builds a huge SAP HANA real-
time analytics platform just because it’s cool. Or another project where a
customer wanted real-time dashboards to visualize the whole production
process. That was hard to do because their existing systems didn’t allow this.
Even after managing to do it and deploy it, productivity on the line dropped.
It turned out people were busy looking at the dashboard instead of working!
Only then we adjusted the requirements to be near real-time rather. The big
problem here is that people need to follow design thinking principles.

And the last stream we have is the governance one. This one is trickier
since it depends a lot on which industry you work in and where you operate
geographically. You must touch this from the beginning, but you shouldn’t
overwork it. I have been in meetings where the client had ten lawyers
assembled with a 500-page policy document outlining what they shouldn’t
do. And then I asked them - have you done anything already?

BA: This is very interesting. But how do you balance those capability-
building topics with doing pilot projects without falling into pilotitis? You
Interviews 252

do need some solid foundation to work on, right?

ATH: This is how we do it. We prioritize our use cases that follow the data
journey process. This process has three phases - lab, factory, and Ops. All
those topics are often interwoven as well. We have applied this process
successfully to more than 1500 projects. A good analogy to use here is a gym.
You are building a specific training plan for a particular person. You need to
balance out all the different ways the organization works. And even topics
like Data Governance can be exciting if they work well. If you look at Zalando
- the co-author of my book (The Ultimate Data and AI Guide), Alex Borek is
doing great work on the topic there as a Head of Data. They initially had
a radically agile approach - everyone could build anything. It was chaotic
but worked well then. Eventually, they wanted to leverage more synergies
and develop a coherent platform - compliant by design. As the company
has grown, many people are happier since they can just use the platform,
and everything is already set up for them. They don’t need to worry about
managing costs and about data protection. Like this, they can get going with
PoCs quite quickly. This reminds me of a Dilbert joke where Doc buys the
first video phone but has nobody to call since he is the only one with it!
You must have a strategy and resources allocated to a central unit that
gets the momentum going. After this, you create this upward spiral with
more use cases, more data, more ROI, and more money to invest into
capability. But the idea is not to put all the data in the data lake. We call
it “Türmchen bauen” in german - an ivory tower strategy. We need to work
holistically instead of building a whole house from the basement up, which
is, unfortunately, typical for many cases. Organizations are then stuck with
the view from the basement instead of being agile - and nobody likes that
view.

BA: How do you deal with the issue of significant upfront investment in data
Interviews 253

projects? The investment and impact curves often go in opposite directions

for data projects. Nowadays, people expect results faster or?

ATH: Yes, this happens often. You just have negative ROI for some time,
and only when the product is finished does that change. There are two
things you can do. First, you should have a very balanced portfolio of use
cases. Instead of focusing on what smart data scientists want - complex
PhD-level deep learning projects, which are not easy for most European
organizations to take advantage of yet, focus on easier use cases where you
can achieve positive ROI within several months. Of course, you also want
some moonshot projects to get media coverage or new talent in - but, for
example, you wouldn’t invest all your money in crypto, right?

On the other hand, you do want to start fundraising and acquiring commit-
ment from the board for at least one to three years. You can use your early
success to “hook” more investment. People overlook that successful data
organizations such as Amazon or Palantir haven’t been profitable for a long
time, despite huge investments. You have to start at the start and manage
expectations. You need to make people fall in love with data and prepare
them for reality afterward as well.

BA: Let’s talk about our situation in Germany. What are current issues you
see in data strategy implementation in German companies?

ATH: It’s important to remember that we have some cool things that work.
For example, Deutsche Bahn is now saving a lot of money with data and
AI. Together with them, we were able to build a reinforcement learning
solution to distribute trains in real-time much better. We also have a very
cool project with the German automobile club, ADAC, where we build a
recommender app for points of interest in traveling. This is something
that companies such as Booking or Airbnb still don’t have on the same
Interviews 254

level. In other examples, Daimler is the first company to have level three
autonomous driving - in front of US companies such as Tesla and Waymo.
I do think we have something like a marketing problem in Europe. We are
perfectionists, while in other places, people release stuff even before it’s
ready and tell everyone how cool it is.

Many companies here have invested a lot of money into capability building.
Now, they are back for more use case-driven approaches—companies such
as Allianz, BMW, and Porsche spring to mind. The real problem lies in the
Mittelstand. Those companies form the backbone of the economy here and
represent many human-centric European values. They can’t hire and retain
the same talent as the more prominent companies, but if we manage to give
them the proper tooling - similar to something like Shopify, but for AI - we
can be successful.

In Germany, Mittelstand refers to small and medium-sized en-

terprises (SMEs) that are typically family-owned and operated.
These companies are often characterized by their innovation, high-
quality products or services, and a strong focus on specialization
and niche markets. The Mittelstand plays a vital role in the German
economy, accounting for a large portion of the country’s GDP and
providing a significant number of jobs. They are known for their
strong partnerships with universities and research institutions,
and many are leaders in their respective industries. The Mittel-
stand is considered a key driver of innovation and economic growth
in Germany.

BA: You now mentioned the talent problem. What do you think are the best
people to drive such innovation forward? What do you think of the title
“data strategist”? Do you have this role in your organization, and how do
Interviews 255

you train such people? We need a broad view to get such innovative work
done.

ATH: Yes, I agree - you need people with a T-shaped distribution of skills.
We started as an engineering company. We started that way because we saw
a gap that big consultancies didn’t fill. I always liked the FUBU approach -
for us, by us. Being technical, we knew what technical work needed to be
done and how. This is why we also decided to do the strategy ourselves -
maybe the slides weren’t that great initially, but the contents were good.
Now we have a whole data strategy practice. There we have different roles
depending on their work - and they all function as a bridge between the
technical and the business. Those people can be both technical and have
a business background, like an economist who can code a bit. We always
ensure interdisciplinary teams in client engagements—for example, one
strategist, two scientists, and five engineers. We also have other roles, such
as data visualization engineers and similar. And finally, we have a data
product owner and product managers who ensure the outcomes and how
to get there.

BA: It feels like you have several flavors of a data strategist.

ATH: Correct. It’s like different strains of the same organism on different
seniority levels. We also have people who specialize in different cloud
providers. One of our newest practices is the Ops practice, where we have
even more roles.

BA: That’s a lot of specialization!

ATH: We had to specialize. We started with us four guys, where you would
get the job if you could spell data science correctly. But now the whole
field is specialized. It’s not enough to say you are an AI consultancy; you
have to say what type of AI - text, image, etc. We also differentiate from
Interviews 256

larger consultancies and have a more boutique approach which is also more
specialized and tailored to the customer.

BA: How do you keep such a complex operation lean?

ATH: Since we don’t have an investor and are bootstrapped, we always have
to stay lean. We got inspired a lot by the Spotify model and made it work.
It’s not easy to balance all the different types of cultures, technical and non-
technical. What also benefits us greatly is a strong network of partners with
whom we work and events such as the Data Festival. It’s essential to stay at
the forefront of the field and be seen as thought leaders.

BA: Consulting is a fascinating business. You are like a doctor curing a

patient. Once you heal them, they are gone. How do you deal with such a
conflicting incentive?

ATH: I just don’t see it that way. Even as a doctor, you heal some patient
pain - they would tell everyone you are a good doctor. And they will come
back to you if they have new pain. This is the reputation we built in the last
ten years. Just deliver excellent results.

BA: One of my final questions is how do you keep so many people learning
all the new things that happen in our field and avoid becoming a legacy
company?

ATH: There are many ways to do it, but you must sacrifice some time now for
benefits later. We always provide time for people to develop their skills and
have a whole people apparatus, an organization to support the employees
and teams.

BA: The opposite of what FC Barcelona did by selling their commercial TV

rights. Instead of borrowing from the future, you invest in it. And my final
question is, what is the biggest challenge in front of us?
Interviews 257

ATH: Executing data strategy. Teaching our clients to work in a cross-

functional way. A lot of companies in Germany need help to collaborate
internally. For building an AI or data product, you need a lot of different
ingredients. We, as consultants, can only be enablers and coaches. This is
why we often have pictures of mountains - we are like guides on a journey
with the client. We can help them select the best route, perhaps show them
how it was done before - but they have to walk it themselves.

Summary:

• Don’t do a static data strategy; work in streams;

• Always connect the use cases directly with business value;
• Organize the consulting work in different areas: lab, factory,
and ops;
• Early success will pay for further investment in data projects;
Conclusion
As parting thoughts, I want to do a full circle and go back to the Zen saying at
the beginning of this book. The Elements of Data Strategy is just the finger
pointing to the moon - not the moon itself. It is up to you, the dedicated
practitioner, to take this work and make it your own - and develop new and
better ways to do data strategy.

As you embark on this journey, I have just one ask of you. As the last years
have shown, there are still significant challenges facing us - pandemics,
global conflict, and inequality. All against the backdrop of climate change.
If you can, with your data work, try to use it for good and the benefit of the
rest of us.

As Francis Bacon wrote: “We rise to great heights by a winding staircase of

small steps”.

Boyan Angelov,
Berlin 2023
Appendix

This appendix is GPT-3 assisted.

List of Acronyms and Abbreviations

AI Artificial Intelligence
API
Application Programming Interface
AWS
Amazon Web Services
BI Business Intelligence
CAS
Complex Adaptive System
CNN
Convolutional Neural Network
CoE
Center of Excellence
CRISP-DM
Cross-industry Standard for Data Mining
CSA
Current State Analysis
Appendix 260

CV Computer Vision
DD Due Diligence
DMA
Data Maturity Assessment
EDA
Exploratory Data Analysis
FTE
Full-time Equivalent
GA Gap Analysis
GCP
Google Cloud Platform
GIS Geographic Information System
GPT
Generative Pre-trained Transformer
IAC
Infrastructure-as-code
LSTM
Long short-term memory
ML Machine Learning
NLP
Natural Language Processing
RACI
Responsibility and Assignment Matrix
SDK
Software Development Kit
SSOT
Single Source of Truth
ST Systems Thinking
Appendix 261

SVM
Support Vector Machine
TDSP
Team Data Science Process
VCS
Version Control System
XAI
Explainable Artificial Intelligence

Architecture and Technology Definitions

Relational and non-relational databases

A relational database is a type of database that stores data in a structured

format, using tables with rows and columns. The tables are connected
through relationships, which are defined using keys. This allows for easy
data management and querying and the ability to join tables and retrieve
data from multiple tables at once. Some examples of relational database
management systems include MySQL, Oracle, and Oracle Database. On the
other hand, a non-relational database is a type of database that does not
store data in a structured format using tables. Instead, it keeps data in a
more flexible, unstructured way, often using a document-oriented or key-
value store model. This allows for faster data insertion and retrieval but can
make it more difficult to query and manage data compared to a relational
database. Some examples of non-relational databases include MongoDB,
Cassandra, and Redis.

In general, relational databases are better suited for structured, transac-

tional data that needs to be accessed and queried in a structured way, while
Appendix 262

non-relational databases are better suited for unstructured, high-velocity

data that needs to be accessed and updated quickly.

Examples: MySQL, Oracle DB, MongoDB, Cassandra

Database schema

A database schema is the structure of a database system, described in a

formal language supported by the database management system (DBMS).
It specifies the organization of the database, including the data types, the
relationships between different data elements, and the constraints on data.
A database schema can be represented using an Entity-Relationship (ER)
diagram, which shows the relationships between different entities in the
database. The schema specifies the tables and fields in the database and
the relationships between them. For example, in a database for a library,
the schema might define a table for books, with fields for the book’s title,
author, and publication date. It might also define a table for patrons, with
fields for their name, address, and phone number. The schema would also
specify the relationships between these tables, such as that only one patron
can check out each book at a time.

Data warehouse

A data warehouse is a centralized repository that allows organizations to

store and analyze large amounts of data from various sources. It is designed
to support the efficient querying and analysis of data, typically by business
analysts and data scientists. A data warehouse typically contains data that
has been cleaned, transformed, and structured to support efficient querying
and analysis. It is often used to support business intelligence and decision-
making by providing a single source of truth for data across an organization.
Appendix 263

Data in a data warehouse is usually organized into logical groupings called

“dimensions” and “facts,” which can be used to structure and analyze the
data in various ways.

Data warehouses are typically designed to support fast querying and anal-
ysis using SQL or other query languages and may be optimized for per-
formance using techniques such as indexing and partitioning. Data teams
may also integrate them with business intelligence tools and visualization
software to enable users to create reports and dashboards.

Examples: Amazon Redshift, Azure Synapse Analytics, Google BigQuery

Data lake

A data lake is a centralized repository that allows organizations to store

all their structured and unstructured data at any scale. It is designed to
hold large amounts of raw data in its native format, including data from
multiple sources and in various formats, such as log files, images, social
media posts, and more. The main difference between a data lake and a data
warehouse is that a data lake is designed to store raw data in its native
format, whereas a data warehouse typically stores cleaned, transformed,
and structured data that is optimized for querying and analysis. This means
a data lake is generally used as a staging area for storing and organizing
raw data before it is transformed and loaded into a data warehouse or other
system for analysis.

Data lakes are often used with big data technologies, such as Hadoop, to
store and process large amounts of data. They may also be integrated with
data management and analysis tools, such as SQL or Spark, to enable users
to query and analyze the data in various ways.

Examples: AWS S3, Azure Data Lake Storage, Google Cloud Storage
Appendix 264

Data lakehouse

A data lakehouse is a hybrid architecture that combines the features of

a data lake and a data warehouse. It is designed to provide the scalability
and flexibility of a data lake while also supporting a data warehouse’s fast
querying and analysis capabilities. A data lakehouse typically combines
the raw data storage and processing capabilities of a data lake with the
structured data storage and querying capabilities of a data warehouse. This
enables organizations to store and process large amounts of raw data in its
native format while allowing users to query and analyze it using SQL or other
query languages.

Data lakehouses are often used in big data environments to enable the
efficient querying and analysis of large amounts of data from multiple
sources. They may also be integrated with data management and analysis
tools, such as Spark or SQL, to enable users to query and analyze the data
in various ways.

Examples: Databricks Delta Lake, Snowflake

Batch and streaming processing

Batch processing involves processing data in chunks or batches, typically

in a predetermined, scheduled manner. In a batch processing system, data
is collected over a period of time and then processed in a single batch or
series of batches at a later time. This approach is well-suited for cases where
the data is not time-sensitive and can be processed less promptly. Batch
processing systems are generally easier to set up and manage but can have
longer latencies and may not be able to handle real-time data. Streaming
processing involves continuously processing data as it is generated or
Appendix 265

received, in real-time or near real-time. In a streaming processing system,

data is processed as soon as it is received, without waiting for a batch to
be collected. This approach is well-suited for cases where the data is time-
sensitive and needs to be processed on time. Streaming processing systems
are generally more complex to set up and manage but can handle real-time
data and have lower latencies.

Examples: AWS Batch, Azure Batch, Cloud Data Fusion, AWS Kinesis, Azure
Stream Analytics

Serverless computing

Serverless computing is a cloud computing model in which the cloud

provider dynamically allocates resources to run applications and services
and only charges for the amount of resources used. In serverless com-
puting, the provider manages the infrastructure and automatically scales
the resources up or down based on demand. This allows developers to
focus on building and deploying their applications without worrying about
the underlying infrastructure. We can contrast serverless computing with
traditional cloud computing models, such as infrastructure as a service
(IaaS) or platform as a service (PaaS), in which the user manages and scales
the infrastructure.

Serverless computing is often implemented using functions as a service

(FaaS) platforms, such as AWS Lambda, Google Cloud Functions, and Azure
Functions. These platforms allow developers to write and deploy code in
the form of individual functions, which specific events or API requests can
trigger. The FaaS platform then automatically executes the function and
allocates the necessary resources to run it, charging the user only for the
execution time of the function. Serverless computing can be a cost-effective
Appendix 266

and scalable solution for running applications and services, allowing users
to pay only for the resources they use. However, it can also require a
different mindset and development approach, as it focuses on individual
functions rather than on traditional application architectures.

Examples: AWS Lambda, Google Cloud Functions

Features and targets

In machine learning, features are the input variables used to predict an

output or target variable. They are also known as predictors or independent
variables. The target, also known as the dependent variable or output, is the
variable the model tries to predict based on the input features. For example,
in a machine learning model that predicts the price of a house based on
its size, number of bedrooms, and location, the size, number of bedrooms,
and location would be the features, and the price of the house would be the
target. A machine learning model’s features represent the characteristics
or properties of the data being analyzed. They should be selected carefully,
as the quality and relevance of the features can significantly impact the
model’s performance. The target, on the other hand, is the variable that the
model is trying to predict. Defining the target clearly and ensuring it aligns
with the model’s objectives is essential. In some cases, multiple targets may
be defined in a machine learning model, in which case the model is said to
be multi-output.

Data mesh

Data mesh is an approach to data management and governance that seeks

to decentralize control over data and promote collaboration and innovation
Appendix 267

across an organization. The core principles of data mesh include the follow-
ing:

• Data as a first-class citizen: Data is recognized as a strategic asset that

should be treated equally to other corporate assets.
• Data ownership by business domains: Data is owned and managed by
the business teams that use it rather than a centralized IT department.
• Decentralized governance: Governance over data is decentralized, with
clear rules and standards for how data is created, shared, and used.
• Collaborative data modeling: Data modeling is a collaborative process
that involves input and feedback from all relevant stakeholders.
• Continuous learning and evolution: Data mesh is a dynamic system
that encourages continuous learning and improvement.

The goal of data mesh is to create a flexible and agile data architecture
that can support the changing needs and priorities of the organization
and enable teams to quickly and easily access the data they need to drive
innovation and business value. This is undoubtedly very useful for some or-
ganizations but beware of a cargo cult. Decentralized often sounds good, but
in some cases, having monolithic and centralized setups is more accessible
(for security or maintenance purposes). Here are some potential drawbacks:

Complexity: Data mesh can be a complex architecture to implement, es-

pecially for organizations that are not already data-driven. It requires the
creation of cross-functional teams and the establishment of clear ownership
and governance of data products, which can be challenging in organizations
with siloed data practices.

Cultural shift: Data mesh requires a cultural change within an organization,

as it emphasizes the importance of collaboration and transparency in data
Appendix 268

management. This can be difficult for organizations that are not used to
working in this way and may require significant effort to change long-
standing practices and mindsets.

Resource requirements: Implementing a data mesh architecture requires

significant resources, including skilled data professionals who can work
across business domains and understand the needs of multiple stakeholders.
This can be a challenge for organizations that do not have a strong data
capability or cannot invest in the necessary resources.

Time and effort: Establishing a data mesh architecture requires a significant

time and effort investment, as it involves building new data products,
establishing clear ownership and governance, and shifting cultural practices
within the organization. This can be a significant undertaking, especially for
large organizations with complex data needs.

Ethics and Privacy Checklist

Human in the loop: make sure that the results of prescriptive systems are
audited or have active participation by humans

Ethical design: ensure that the way prescriptive systems are built is repre-
sentative. The most common issue here is that of using biased datasets.

Explainability: use explainability methods on top of prescriptive systems

to automatically deduce what they are learning and what are the main
factors in their decision-making process

• Clearly define the purpose and objectives of the data project and ensure
that they align with ethical principles.
Appendix 269

• Obtain the necessary legal permissions and consent from individuals

before collecting and using their data.
• Be transparent about how data will be collected, used, and shared, and
allow individuals to opt-out if they do not wish to have their data
included in the project.
• Use privacy-enhancing technologies and methods to protect the per-
sonal data of individuals.
• Establish clear guidelines and procedures for handling and storing data
and ensure that all team members involved in the project are trained
on these guidelines.
• Regularly review and update the project’s data collection and usage
practices to ensure they remain ethical and in compliance with relevant
laws and regulations.
• Conduct regular audits and assessments of the project to ensure it
meets its ethical obligations.
• Engage with stakeholders, including individuals whose data is being
collected and used, to obtain feedback and ensure that the project
meets their expectations.
• Be accountable for the ethical use of data, and be prepared to take
appropriate action if any ethical concerns or issues arise.
• Continuously strive to improve the ethical practices of the data project,
and stay up-to-date on relevant laws, regulations, and best practices.

Data Job Roles

We can split the main roles in data into those areas:

• Data scientists: These individuals are responsible for cleaning, analyz-

ing, and modeling data to extract insights and make predictions.
Appendix 270

• Data engineers: These individuals are responsible for building the

infrastructure and pipelines to collect, store, and process data.
• Data Product managers: These individuals are responsible for defining
the overall direction and goals of the data product. This is also a role
often covered by a data strategist.
• Business analysts: These individuals are responsible for understanding
the business context in which the data product will be used and for
defining the requirements and metrics that will be used to evaluate its
success.
• Data analysts: These individuals are responsible for analyzing product
data to extract insights and support business decision-making.
• Data visualizers: These individuals are responsible for creating visu-
alizations, such as charts and graphs, that help to communicate data
insights in an effective and accessible way.
• DevOps engineers: These individuals are responsible for implementing
and maintaining the infrastructure and processes used to build, deploy,
and maintain data products.
• Data curators: These individuals are responsible for ensuring that the
data used in the product is accurate, complete, and consistent.
• User experience designers: These individuals are responsible for de-
signing the user interface and user experience of the data product to
make it easy and intuitive for users to interact with.
• Quality assurance engineers: These individuals are responsible for test-
ing the data product to ensure that it meets the specified requirements
and works as expected.
• Technical writers: These individuals are responsible for writing doc-
umentation, such as user manuals and release notes, that help users
understand and use the data product.
Appendix 271

• Support specialists: These individuals are responsible for providing

technical support to users of the data product and troubleshooting any
issues that may arise.

Additionally, we can split the essential fields further. Here for data science:

• Machine Learning Engineer: This role involves building and deploying

machine learning models. A machine learning engineer might be re-
sponsible for selecting and preprocessing data, training and evaluating
machine learning models, and deploying those models in production
environments.
• Business Intelligence (BI) Analyst: This role involves using data to help
organizations make better decisions. A BI analyst might be responsible
for creating dashboards and reports that help stakeholders understand
the data and conducting analysis to identify trends and patterns in the
data.
• Research Scientist: This role involves conducting research using data
and applying machine learning and other advanced statistical tech-
niques to solve complex problems. A research scientist might be re-
sponsible for designing and conducting experiments, analyzing data,
and publishing research papers.

And for data engineering:

• Data pipeline engineers: These individuals are responsible for design-

ing and implementing the data pipelines used to collect, store, and
process data. This can include building and maintaining real-time
streaming systems, designing and implementing data storage systems,
and writing code to automate data processing tasks.
Appendix 272

• Data infrastructure engineers: These individuals are responsible for

building and maintaining the infrastructure used to support data en-
gineering efforts. This can include designing and implementing data
warehouses, setting up and managing cloud computing environments,
and ensuring that data is secure and accessible.
• Data architecture engineers: These individuals are responsible for de-
signing and implementing the overall architecture of the data system
to ensure that it is scalable, reliable, and efficient. This can include
defining the data model, choosing the right tools and technologies,
and working with other teams to integrate the data system with other
systems and platforms.
• Data governance engineers: These individuals are responsible for defin-
ing and enforcing policies and standards that govern how data is
collected, stored, and used within an organization. This can include
defining data quality standards, implementing data security measures,
and working with other teams to ensure that data is used ethically and
responsibly.

Example of Definition of Done

A definition of done for a churn model task might include the following
criteria:

• The churn model has been trained on a representative dataset

and achieved an acceptable accuracy level, as determined by cross-
validation.
• The churn model has been integrated into the production environment
and is accessible via the appropriate APIs.
Appendix 273

• The churn model has been tested in the production environment and
is functioning as expected.
• The churn model has been documented, including a description of the
features used, the training process, and the model performance.
• The results of the churn model have been reviewed by the relevant
stakeholders and approved for use.
• Any necessary changes or updates to the model have been made and
tested.
• The code for the churn model has been reviewed, and any necessary
changes have been made.
• Any necessary updates to the deployment or testing processes have
been made.

Example Design Document

This is an example design document for a churn model scenario:

• Overview: This section provides a high-level overview of the system,

including its purpose, goals, and intended audience.
• User stories: This section describes the typical user of the system and
the types of tasks they will perform, using user stories to provide
concrete examples of how the system will be used.
• Requirements: This section outlines the functional and non-functional
requirements for the system, such as the types of data it will need
to process and the accuracy and speed targets for the churn model
prediction.
• Architecture: This section provides a high-level overview of the system
architecture, including the components and technologies used to build
the system.
Appendix 274

• User interface: This section describes the user interface for the system,
including screen mock-ups and detailed descriptions of the user flows
and interactions.
• Testing: This section describes the testing strategy for the system,
including the types of tests that will be performed (e.g., unit tests,
integration tests, etc.) and the criteria that must be met for the system
to be considered complete.
• Deployment: This section describes the deployment strategy for the
system, including the environments where it will be deployed (e.g.,
staging, production) and the process for deploying updates and bug
fixes.
• Maintenance: This section describes the system’s ongoing mainte-
nance and support plan, including the procedures for monitoring and
responding to issues and the process for making and implementing
updates and enhancements.
NOTES 275

Notes

Introduction

1 Arnold, Ross D., and Jon P. Wade. “A definition of systems thinking: A

systems approach.” Procedia computer science 44 (2015): 669-678.

2 Borek, Alexander, and Nadine Prill. Driving Digital Transformation

Through Data and AI: A Practical Guide to Delivering Data Science and
Machine Learning Products. Kogan Page Publishers, 2020.

3 Box, G. E. “All models are wrong, but some are useful.” Robustness in
Statistics 202.1979 (1979): 549.

4 Arnold, Ross D., and Jon P. Wade. “A complete set of systems thinking
skills.” Insight 20.3 (2017): 9-17.

5 Hewitt, Eben. Technology Strategy Patterns: Architecture as Strategy.

O’Reilly Media, 2018.

6 Meadows, Donella H. Thinking in systems: A primer. chelsea green

publishing, 2008.

7 Siegenfeld, Alexander F., and Yaneer Bar-Yam. “An introduction to

complex systems science and its applications.” Complexity 2020 (2020).

Alignment with Business Strategy

8 Borek, Alexander, and Nadine Prill. Driving Digital Transformation

NOTES 276

Through Data and AI: A Practical Guide to Delivering Data Science and
Machine Learning Products. Kogan Page Publishers, 2020.

Current State Analysis: Discovering Where We Stand

9 Taleb, Nassim Nicholas. Antifragile: Things that gain from disorder. Vol.
3. Random House, 2012.

10 Maister, David H., Robert Galford, and Charles Green. The trusted
advisor. Free Press, 2021.

11 Kilkenny, Monique F., and Kerin M. Robinson. “Data quality: “Garbage

in–garbage out”.” Health Information Management Journal 47.3 (2018):
103-105.

12 Hajij, Mustafa, et al. “Data-Centric AI Requires Rethinking Data No-

tion.” arXiv preprint arXiv:2110.02491 (2021).

13 Laney, Douglas B. Infonomics: how to monetize, manage, and measure

information as an asset for competitive advantage. Routledge, 2017.

14 Zheng, Alice, and Amanda Casari. Feature engineering for machine

learning: principles and techniques for data scientists. “O’Reilly Media,
Inc.”, 2018.

15 Brackett, Michael, and Production Susan Earley. “The DAMA Guide to

The Data Management Body of Knowledge (DAMA-DMBOK Guide).”
(2009).

16 Dunning, David. “The Dunning–Kruger effect: On being ignorant of

one’s own ignorance.” Advances in experimental social psychology. Vol.
NOTES 277

44. Academic Press, 2011. 247-296.

Gap Analysis: Looking at the Road Ahead

17 Taleb, Nassim. “The black swan: Why don’t we learn that we don’t
learn.” NY: Random House (2005).

18 Rumelt, Richard P. “Good strategy/bad strategy: The difference and why

it matters.” Strategic direction 28.8 (2012).

Use Cases: Designing Data Products

19 Courtney, J. “The Workshopper Playbook–How to Become a Problem-

Solving and Decision-Making Expert.” (2020).

20 Scavetta, Rick, and Boyan Angelov. Python and R for the Modern Data
Scientist. O’Reilly Media, 2021.

Data Architecture and Technology: Establishing Foundations

21 Sinek, Simon. Start with why: How great leaders inspire everyone to take
action. Penguin, 2009.

Operating Model: Setting up the Organization for Success

22 Peng, Roger D., and Elizabeth Matsui. The Art of Data Science: A guide
for anyone who works with Data. Skybrude consulting LLC, 2016.

Roadmap: Preparing for Delivery

23 Borek, Alexander, and Nadine Prill. Driving Digital Transformation

Through Data and AI: A Practical Guide to Delivering Data Science and
NOTES 278

Machine Learning Products. Kogan Page Publishers, 2020.

Overview

24 Winters, Titus, Tom Manshreck, and Hyrum Wright. Software engineer-

ing at google: Lessons learned from programming over time. O’Reilly
Media, 2020.

Soft Agile: Moving Fast Without Breaking Too Much

25 Petersen, Kai, Claes Wohlin, and Dejan Baca. “The waterfall model
in large-scale development.” International Conference on Product-
Focused Software Process Improvement. Springer, Berlin, Heidelberg,
2009.

26 Hoens, T. Ryan, Robi Polikar, and Nitesh V. Chawla. “Learning from

streaming data with concept drift and imbalance: an overview.”
Progress in Artificial Intelligence 1.1 (2012): 89-101.

27 Atwal, Harvinder. Practical DataOps: Delivering Agile Data Science at

Scale. Apress, 2019.

28 Molnar, Christoph. Interpretable machine learning. Lulu. com, 2020.

Lean Data: Eliminating Waste

29 Dekier, Łukasz. “The origins and evolution of Lean Management sys-

tem.” Journal of International Studies 5.1 (2012): 46-51.

30 Reis, Eric. “The lean startup.” New York: Crown Business 27 (2011):
2016-2020.
NOTES 279

31 Drucker, Peter Ferdinand. Classic Drucker: essential wisdom of Peter

Drucker from the pages of Harvard Business Review. Harvard Business
Press, 2006.

DataOps: Methods for Value Delivery

32 Kim, Gene, Kevin Behr, and Kim Spafford. The phoenix project: A novel
about IT, DevOps, and helping your business win. IT Revolution, 2014.

33 Knapp, Jake, John Zeratsky, and Braden Kowitz. Sprint: How to solve big
problems and test new ideas in just five days. Simon and Schuster, 2016.

34 Atwal, Harvinder. Practical DataOps: Delivering Agile Data Science at

Scale. Apress, 2019.

Elements of AI
100% (1)
Elements of AI
119 pages
5212-1693457982871-NEW - Unit 16 - CRP-SEM3 - Proposal 2023 Big Data (AutoRecovered)
No ratings yet
5212-1693457982871-NEW - Unit 16 - CRP-SEM3 - Proposal 2023 Big Data (AutoRecovered)
53 pages
IT Key Metrics Data 465640 NDX
No ratings yet
IT Key Metrics Data 465640 NDX
46 pages
BCG Matrix - CLASS
100% (1)
BCG Matrix - CLASS
73 pages
ArtificiaI Intelligence Engineer - Brochure - Compressed
No ratings yet
ArtificiaI Intelligence Engineer - Brochure - Compressed
27 pages
AI Driven Orchestration Sana Tariq
No ratings yet
AI Driven Orchestration Sana Tariq
19 pages
ST Trends 2020-2040
100% (1)
ST Trends 2020-2040
160 pages
Emobility Consultancy Presentation
No ratings yet
Emobility Consultancy Presentation
43 pages
LinkedIn Selling Zoom On A Digital Marketing Strategy
No ratings yet
LinkedIn Selling Zoom On A Digital Marketing Strategy
10 pages
Understanding Design Thinking
No ratings yet
Understanding Design Thinking
8 pages
(8 Steps) Ferrari Case Study Solution
No ratings yet
(8 Steps) Ferrari Case Study Solution
14 pages
IT Key Metrics Data 752539 NDX
100% (1)
IT Key Metrics Data 752539 NDX
26 pages
KPMG Digital Nexus Brochure - July 2019 - Web
No ratings yet
KPMG Digital Nexus Brochure - July 2019 - Web
4 pages
Data Analytics Strategy Toolkit by Slidesgo
No ratings yet
Data Analytics Strategy Toolkit by Slidesgo
56 pages
Us Ai Institute State of Ai Fifth Edition
No ratings yet
Us Ai Institute State of Ai Fifth Edition
49 pages
1220pm Data-and-Analytics Suvanam
No ratings yet
1220pm Data-and-Analytics Suvanam
29 pages
Digital Transformation - AI - Data
No ratings yet
Digital Transformation - AI - Data
35 pages
Check
No ratings yet
Check
23 pages
Gartner Reprint
No ratings yet
Gartner Reprint
33 pages
Become A Transformational CIO Executive Brief
No ratings yet
Become A Transformational CIO Executive Brief
18 pages
Digital Maturity Model Part A
No ratings yet
Digital Maturity Model Part A
26 pages
Lecture5.strategic Planning
No ratings yet
Lecture5.strategic Planning
34 pages
BCG Wheres The Value in AI
No ratings yet
BCG Wheres The Value in AI
25 pages
Jo It Governance Guide 2020
No ratings yet
Jo It Governance Guide 2020
25 pages
BCG The Double Game of Digital Strategy Oct 2015 - tcm80 199053 PDF
No ratings yet
BCG The Double Game of Digital Strategy Oct 2015 - tcm80 199053 PDF
6 pages
FundamentalsOfDesigningDW MelissaCoates
No ratings yet
FundamentalsOfDesigningDW MelissaCoates
87 pages
IN 1040 DataDiscoveryGuide en PDF
No ratings yet
IN 1040 DataDiscoveryGuide en PDF
215 pages
1 - Introduction To Emerging Technologies PDF
No ratings yet
1 - Introduction To Emerging Technologies PDF
77 pages
Lecture I Innovation Management v05 Final
No ratings yet
Lecture I Innovation Management v05 Final
76 pages
EPAM Corporate Overview Q4 EOY
No ratings yet
EPAM Corporate Overview Q4 EOY
10 pages
AI in Production: A Game Changer For Manufacturers With Heavy Assets
No ratings yet
AI in Production: A Game Changer For Manufacturers With Heavy Assets
46 pages
Beyond Bimodal Toward A Sustainable Digital Business Model
No ratings yet
Beyond Bimodal Toward A Sustainable Digital Business Model
35 pages
2023 Predicts Composable Applications
No ratings yet
2023 Predicts Composable Applications
23 pages
What Is DataOps - The Ultimate DataOps Guide by Rivery
No ratings yet
What Is DataOps - The Ultimate DataOps Guide by Rivery
11 pages
The Role of IT in Successful Merger Integration PDF
No ratings yet
The Role of IT in Successful Merger Integration PDF
12 pages
Sustainable Development Goals Ict Playbook
No ratings yet
Sustainable Development Goals Ict Playbook
66 pages
AI Powered Decision Making in Banks
100% (2)
AI Powered Decision Making in Banks
17 pages
How To Build A Self-Service Data Analytics Stack Final - Google Docs Pdxule
No ratings yet
How To Build A Self-Service Data Analytics Stack Final - Google Docs Pdxule
12 pages
McKinsey Launches New Product Suite To Help Clients Scale AI
No ratings yet
McKinsey Launches New Product Suite To Help Clients Scale AI
5 pages
MDA19 - C7 - The Foundation and Future of Data and Analytics Go - 403159 PDF
No ratings yet
MDA19 - C7 - The Foundation and Future of Data and Analytics Go - 403159 PDF
46 pages
2020 Audience Insights For B2B Marketing in The Year of Disruption
No ratings yet
2020 Audience Insights For B2B Marketing in The Year of Disruption
43 pages
IFRC Official Kickoff - McKinseyOrg A2E Essentials - VF
No ratings yet
IFRC Official Kickoff - McKinseyOrg A2E Essentials - VF
14 pages
Gartner Catalyst Conference Us Research Note Data Analytics Planning Guide 2018
No ratings yet
Gartner Catalyst Conference Us Research Note Data Analytics Planning Guide 2018
32 pages
McKinsey Presentation
No ratings yet
McKinsey Presentation
18 pages
Cognilytica Research: Global AI Adoption Trends & Forecast 2020
100% (1)
Cognilytica Research: Global AI Adoption Trends & Forecast 2020
12 pages
A Comparison of The Top Four Enterprise-Architecture Methodologies
No ratings yet
A Comparison of The Top Four Enterprise-Architecture Methodologies
31 pages
Patterns of Big Data Forrester
No ratings yet
Patterns of Big Data Forrester
74 pages
Bain Capability Brief Capability-Driven IT
100% (1)
Bain Capability Brief Capability-Driven IT
4 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
Gen AI Disruption in Tech Services
No ratings yet
Gen AI Disruption in Tech Services
8 pages
KPMG's DT Playbook For BITS Pilani
No ratings yet
KPMG's DT Playbook For BITS Pilani
34 pages
GFF BCG Report 2024 Building Bridges in Finance
No ratings yet
GFF BCG Report 2024 Building Bridges in Finance
108 pages
MCK - Outcompeting in The Age of Digital and AI
No ratings yet
MCK - Outcompeting in The Age of Digital and AI
24 pages
Chapter 2 - Key Roles and Responsibilities - Updated
No ratings yet
Chapter 2 - Key Roles and Responsibilities - Updated
27 pages
Deloitte
No ratings yet
Deloitte
4 pages
Analytics in Action - How Marketelligent Helped A B2B Retailer Increase Its Lead Velocity
No ratings yet
Analytics in Action - How Marketelligent Helped A B2B Retailer Increase Its Lead Velocity
2 pages
TD GEStion Des Projets - PPPTX
No ratings yet
TD GEStion Des Projets - PPPTX
23 pages
TOGAF® Business Architecture Level 1 Study Guide
From Everand
TOGAF® Business Architecture Level 1 Study Guide
Andrew Josey
No ratings yet
Integrating AI Into Digital Marketing Strategies
No ratings yet
Integrating AI Into Digital Marketing Strategies
7 pages
White Paper A I Ops Use Cases 1563909601853
No ratings yet
White Paper A I Ops Use Cases 1563909601853
4 pages
HCL Information Security Practice: Security Transformation System Integration
No ratings yet
HCL Information Security Practice: Security Transformation System Integration
4 pages
Me Operations-Excellence Offering
No ratings yet
Me Operations-Excellence Offering
22 pages
EA Guiding Principles
No ratings yet
EA Guiding Principles
29 pages
Emerging Evidence On The Use of Big Data and Analytics in Workplace Learning
No ratings yet
Emerging Evidence On The Use of Big Data and Analytics in Workplace Learning
20 pages
Digital Transformation and IT Strategy Toolkit - Overview
No ratings yet
Digital Transformation and IT Strategy Toolkit - Overview
24 pages
Ultimate Digital Work Instructions Guide
No ratings yet
Ultimate Digital Work Instructions Guide
19 pages
Business Portfolio Analysis
100% (1)
Business Portfolio Analysis
38 pages
BCG Executive Perspectives Future of Data Management With AI EP9 10dec2024
100% (1)
BCG Executive Perspectives Future of Data Management With AI EP9 10dec2024
22 pages
Mastering Data Modeling - A Comprehensive Guide To Conceptual, Logical, and Physical Models - by Nilimesh Halder, PHD - Medium
No ratings yet
Mastering Data Modeling - A Comprehensive Guide To Conceptual, Logical, and Physical Models - by Nilimesh Halder, PHD - Medium
9 pages
Data Driven AI
No ratings yet
Data Driven AI
253 pages
YouTube Channel - Soheil Derakhshan - Career Expo 2022
No ratings yet
YouTube Channel - Soheil Derakhshan - Career Expo 2022
286 pages
KaranResume 1
No ratings yet
KaranResume 1
1 page
Christopher Ocampo
No ratings yet
Christopher Ocampo
1 page
Keywords
No ratings yet
Keywords
60 pages
Data Science & Big Data - Practical
No ratings yet
Data Science & Big Data - Practical
7 pages
Data Science in 2024 - What Has Changed - by Nathan Rosidi - Jan, 2024 - Medium
No ratings yet
Data Science in 2024 - What Has Changed - by Nathan Rosidi - Jan, 2024 - Medium
18 pages
DM 1
No ratings yet
DM 1
52 pages
Whitepaper - AI and Machine Learning in Tableau - August 2021
No ratings yet
Whitepaper - AI and Machine Learning in Tableau - August 2021
19 pages
Ashish Verma AIML-merged
No ratings yet
Ashish Verma AIML-merged
3 pages
Semester 2, 2023 Course Offering - FedUni-WSU Melbourne
No ratings yet
Semester 2, 2023 Course Offering - FedUni-WSU Melbourne
3 pages
Vvism Placement Brochure
No ratings yet
Vvism Placement Brochure
17 pages
CH1 - Big Data Introduction-En
No ratings yet
CH1 - Big Data Introduction-En
37 pages
Ebook Data Science in The Middle East - Original
No ratings yet
Ebook Data Science in The Middle East - Original
17 pages
Ulster University
No ratings yet
Ulster University
2 pages
Business Analytics - Unit 5 Notes
No ratings yet
Business Analytics - Unit 5 Notes
12 pages
Sarthak Singh Resume
No ratings yet
Sarthak Singh Resume
2 pages
Resume 1
No ratings yet
Resume 1
1 page
Jessica Xujia Wei - Resume (2025.02.07)
No ratings yet
Jessica Xujia Wei - Resume (2025.02.07)
1 page
AnirbanBhattacharya Résumé BDA
No ratings yet
AnirbanBhattacharya Résumé BDA
3 pages
Roadmap For Data Literacy and Data-Driven Business Transformation A Gartner Trend Insight Report
100% (1)
Roadmap For Data Literacy and Data-Driven Business Transformation A Gartner Trend Insight Report
18 pages