OReilly Report - What Is Data Observability
OReilly Report - What Is Data Observability
m
pl
im
en
ts
of
What Is Data
Observability?
Building Trust With
Real-Time Visibility
Andy Petrella
REPORT
What Is Data Observability?
Building Trust with
Real-Time Visibility
Andy Petrella
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. What Is Data
Observability?, the cover image, and related trade dress are trademarks of O’Reilly
Media, Inc.
The views expressed in this work are those of the author and do not represent the
publisher’s views. While the publisher and the author have used good faith efforts
to ensure that the information and instructions contained in this work are accurate,
the publisher and the author disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this
work is at your own risk. If any code samples or other technology this work contains
or describes is subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights.
This work is part of a collaboration between O’Reilly and Kensu, Inc. See our
statement of editorial independence.
978-1-098-12096-2
[LSI]
Table of Contents
iii
CHAPTER 1
It’s Time to Rethink
Data Management
1 Peter Lee, “Learning from Tay’s Introduction,” March 25, 2016, Microsoft Corporation,
blog post, https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction.
1
being driven by the advancement of open source machine learning,
which relies on large datasets for training models.
To manage this exploding data growth, data teams have risen in
importance and size. Small data teams with few stakeholders have
been replaced by large teams that must answer to executive pressure.
Yet, despite the increased value and presence data has within organ‐
izations, little thought has been given as to how to monitor the
quality of the data itself.
While IT and DevOps have numerous quality control measures
(such as development practices like continuous integration and test‐
ing) to protect against application downtime, most organizations
don’t have similar measures in place to protect against data issues.
But just as organizations depend on high levels of reliability from
their applications, they also depend on the reliability of their data.
When data issues—such as partial, erroneous, missing, or inaccu‐
rate data—occur, the impact can rapidly multiply and escalate in
complex data systems. These data incidents can have significant
consequences and multiple negative repercussions on the business,
including lack of trust and loss of revenue. A lack of control or
visibility into the data quality can also produce faulty insights, which
can lead the organization to make poor decisions that can result in
lost revenue or a poor customer experience.
Resolving data incidents can be time-consuming and difficult. Iden‐
tifying the root cause of this type of failure requires knowing where
the data flowed from and, therefore, what applications were partici‐
pating in creating value from that data. But early identification of
data issues, which are unknown to the data owners in most cases, is
inherently challenging in complex data systems. IT and data teams
must become archeologists, trying to discover what’s gone wrong.
Yet each team is siloed and doesn’t have complete visibility. The
further downstream the issue, the harder it is to figure out what
happened and the more information gets lost in translation. Patches
may fix the issue, but if the root cause isn’t identified, there’s little
confidence that the issue won’t reoccur. This creates a lack of trust in
the data and in the team responsible for shepherding that data to the
data user. This, in turn, mutes the data’s value.
The lesson? No matter how vast an organization’s data assets are or
how advanced the technologies are that support the management
and analysis of data, they cannot be useful without reliable data
2 “NewVantage Partners Releases 2020 Big Data and AI Executive Survey,” Business
Wire, January 6, 2020, https://www.businesswire.com/news/home/20200106005280/en/
NewVantage-Partners-Releases-2020-Big-Data-and-AI-Executive-Survey.
3 “AI Predictions 2021,” PwC, October 2020, https://www.pwc.com/us/en/tech-effect/ai-
analytics/ai-predictions.html.
4 “NewVantage Partners Releases 2020 Big Data and AI Executive Survey.”
5 Roger Magoulas and Steve Swoyer, “The State of Data Quality in 2020,” O’Reilly Media,
February 12, 2020.
9
isn’t there at all, so fixing the data requires fixing the application
upstream to ensure it ingests the entire dataset. In another case, the
data may simply no longer be available and, therefore, the producer
of that data needs to be made aware so that they can adapt their
analysis to this new reality. Further, when you’re performing an
analysis of the data, certain assumptions are made that may no
longer be true at a later point in time. Here’s an example of what I
mean.
When an analyst initially builds a model to analyze a retail organiza‐
tion’s previous year’s sales data, they can see that the organization
collected data every month for the prior year. Thus, the model is
trained based on a period of 12 months of data from the last year.
However, while the data was complete when the data model was
initially built, that may no longer be the case. Perhaps, when the
analysis is deployed six months later, the assumption of having the
12 months of data (with the seasonality of four quarters) is no
longer true because the system can no longer handle this amount
of data. As a result, the previous year’s data is removed from the
analysis. Since Q4 tends to be some of the highest-grossing months
for a retailer, this change can have a major impact on whether the
insights generated from the data are accurate, especially when the
assumption is that this data is present. However, because the analyst
hasn’t shared their assumptions with the producer and the producer
doesn’t have that level of visibility into the data models, they are
not aware of this “implicit” constraint coming from the data usage.
Hence, they also don’t know that the output from the model (their
analysis) is inaccurate.
This is where data observability comes into play. You need to be able
to observe and be aware of these types of silent changes to the data
so that you can fix them preemptively—before your CEO is coming
to tell you that numbers look wrong.
Data observability is a solution that provides an organization with
the ability to measure the health and usage of their data within their
system, as well as health indicators of the overall system. By using
automated logging and tracing information that allows an observer
to interpret the health of datasets and pipelines, data observability
enables data engineers and data teams to identify and be alerted
about data quality issues across the overall system. This makes it
much faster and easier to resolve data-induced issues at their root
rather than just patching issues as they arise.
Without data observability, not only would the data quality issue be
outside the pharmaceutical company’s control, but it would also be
very difficult and time-consuming even to pinpoint which external
input (lab) was creating the problem. And even once the company
could identify the lab causing the issue, without observability into
the data itself, it still would not have a clear understanding of the
root cause. But, by being able to trace the issue within its own sys‐
tems and datasets, the company provided a much more informative
and detailed report of the issue to the lab. In turn, this improves the
affected lab’s ability to resolve the issue and speeds the time it takes
them to do so. Thus, using circles of influence, you can observe and
identify data quality issues, build greater trust in the entire system,
and find and fix the root cause of a data issue more quickly.
While this structure allows each team to maintain its own quality
controls and outcomes, this process also creates silos. Each team
has limited visibility into the functions of the other teams and how
the data or pipelines might need to be changed (or not) to fit those
functions. This makes it difficult for the IT team to anticipate any
changes and take preventative measures to avert any issues experi‐
enced by the data or analytic teams.
Figure 2-2. Data mesh planes supporting the domain’s data products
and their usage3
27
• Validating rules before and continuously in production to gen‐
erate notifications about specific data events and their context
• Reviewing or refactoring your delivery upon triggered rules
1 NewVantage Partners, Big Data and AI Execution Survey 2021: The Jour‐
ney to Becoming Data-Driven—A Progress Report on the State of Corporate
Data Initiatives, 2021, https://c6abb8db-514c-4f5b-b5a1-fc710f1e464e.filesusr.com/ugd/
e5361a_d59b4629443945a0b0661d494abb5233.pdf.
2 Randy Bean and Thomas H. Davenport, “Companies Are Failing in Their Efforts
to Become Data-Driven,” Harvard Business Review, February 5, 2019, https://hbr.org/
2019/02/companies-are-failing-in-their-efforts-to-become-data-driven.
3 Dun & Bradstreet, The Past, Present, and Future of Data, 2019, https://www.dnb.com/con
tent/dam/english/dnb-data-insight/DNB_Past_Present_and_Future_of_Data_Report.pdf.
4 NewVantage Partners, Big Data and AI Execution Survey 2021.