0% found this document useful (0 votes)
45 views10 pages

Data Quality

The document discusses the importance of data quality and provides strategies for measuring and improving data quality. It defines data quality and explains why it matters for businesses. It also outlines best practices for measuring data quality, including measuring data downtime, and discusses the need for data observability and automated monitoring of data issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views10 pages

Data Quality

The document discusses the importance of data quality and provides strategies for measuring and improving data quality. It defines data quality and explains why it matters for businesses. It also outlines best practices for measuring data quality, including measuring data downtime, and discusses the need for data observability and automated monitoring of data issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

The Ultimate Guide to Data Quality

What is data quality and why does it matter?


Data quality is defined as the health of data at any stage in its life
cycle. Data quality can be impacted at any stage of the data
pipeline, before ingestion, in production, or even during analysis.

As data becomes increasingly important to modern companies, it’s


crucial that teams can trust their data to be accurate and reliable.
Data quality is the preeminent way to measure whether or not your
data can be trusted to deliver value to the business.

When bad data is served to your data consumers—whether internal


users or external customers—the consequences can be downright
disastrous. The cost of bad data shows up in wasted time, lost
revenue, missed opportunities, and damaged trust. And these
issues are widespread: many data leaders tell us their data
scientists and engineers spend 40 percent or more of their time CASE STUDY: How Yotpo fixes data quality
troubleshooting or firefighting data problems. Gartner estimates
companies spend upwards of $15 million annually on data
1 in 5 companies have lost a customer due to data quality issues. For
downtime. And over 88 percent of U.S. businesses have lost money
Yotpo, Monte Carlo’s approach to data observability empowered them to
because of data quality issues.
solve data issues fast so they could start trusting their data to deliver
reliable, actionable insights for the business.
Fortunately, with the right approach, data quality can be measured,
tracked, and improved over time. In fact, some of the best data
teams have charted the course for others when it comes to LEARN MORE: How Yotpo Fixes Data Quality at Scale
managing data quality at scale.

Bottom line: Even in today’s day and age, data quality is not
given the diligence it deserves—and that’s a serious problem.
How to measure data quality
Traditional methods of measuring data quality are often time
and resource-intensive, spanning several variables, from
accuracy and completeness, to validity and timeliness. But
The Data Engineer’s KPI for Data Quality
these approaches are often subjective and difficult to
communicate to stakeholders who don’t handle the data day to Data Downtime =
day. Fortunately, there’s a better way: measuring data
downtime.
Number of data incidents
Data downtime refers to periods of time where data is missing, X
erroneous, or otherwise inaccurate, and often suggests a
broken data pipeline. By measuring data downtime, you can (Time-to-Detection + Time-to-Resolution)
determine the reliability of your data and ensure the
confidence necessary to use it or lose it.
Further Reading: Data Quality — You’re Measuring It Wrong
Overall, data downtime is a function of:

● Number of data incidents (N) — This factor is not always in


your control given that you rely on data sources “external” to
your team, but it’s certainly a driver of data uptime.
● Time-to-detection (TTD) — In the event of an incident, how
quickly are you alerted? In extreme cases, this quantity can
be measured in months if you don’t have the proper
methods for detection in place. Silent errors made by bad
data can result in costly decisions, with repercussions for
both your company and your customers.
● Time-to-resolution (TTR) — Following a known incident,
how quickly were you able to resolve it?
The new rules of data quality
There are two types of data quality issues in this Just like software, data requires both testing and ● Lineage to track upstream and
world: those you can predict (known unknowns) observability to ensure consistent reliability. In fact, downstream dependencies.
and those you can’t (unknown unknowns). modern data teams at Intuit, PagerDuty, and Fox, Automated, end-to-end lineage
among other companies, think about data as a empowers data teams to track the
dynamic, ever-changing entity, and apply not just
flow of their data from A (ingestion)
● Data quality issues that can be easily rigorous testing, but also continual observability. Given
predicted. For these known unknowns, the millions of ways that data can break (or, the to Z (analytics), incorporating
automated data testing and manual unknown unknowns), we can use the same DevOps transformations, modeling, and other
threshold setting should cover your bases. principles to cover these edge cases. steps in the process.
● Data quality issues that cannot be easily ● Both custom & automatically
predicted. These are your unknown For most, a robust and holistic approach to data generated rules. Most data teams
unknowns. And as data pipelines become observability incorporates: need an approach that leverages the
increasingly complex, this number will only best of both worlds: using machine
grow. ● Metadata aggregation & discovery. If you learning to identify abnormalities in
don’t know what data you have, you certainly your data based on historical
If data testing can cover what you know might won’t know whether it’s useful. Data behavior, as well as the ability to set
happen to your data, you need a way to monitor and discovery is incorporated into the best data rules unique to the specs of your
alert for what we don’t know might happen to your observability platforms, offering a centralized data.
data (our unknown unknowns). perspective into your data ecosystem. ● Collaboration between data
● Automatic monitoring & alerting for data analysts, data engineers, and data
In the same way that application engineering teams
issues. A great data observability approach scientists. Data teams should be
don’t exclusively use unit and integration testing to will ensure you’re the first to know and solve
catch buggy code, data engineering teams need to able to easily and quickly collaborate
data issues, allowing you to address data
take a similar approach by making data observability quality issues right when they happen, as to resolve issues, set new rules, and
a key component of their stack. opposed to several months down the road. better understand the health of their
On top of that, such a solution requires data.
minimal configuration and practically no
threshold-setting.
3 key steps to building a great data quality program
Dror Engel, Head of Product, Data Services, at eBay, and Barr Moses, CEO and co-founder of Monte Carlo, share 3 essential steps for establishing a
successful data quality strategy.

1. Get leadership & stakeholder buy-in early:

● How do you measure the data quality of the assets your company collects and stores?
● What are the key KPIs or goals you’re going to hold your data quality strategy accountable for meeting?
● Do you have cross-functional involvement from leadership and data users in other parts of the company?
● Who at the company will be held accountable for meeting your strategy’s KPIs and goals?
● What checks and balances do you have to ensure KPIs are measured correctly and goals can be met?

2. Spearhead a data stewardship program

● To make sure that data users across the company are aware of why data quality matters, we suggest
developing a program for data quality champions to carry the torch and shepherd others through data access,
use, and storage best practices.
● Focus on short-term or quick wins to get traction while promoting and executing on the long-term strategy.

3. Automate your lineage and data governance tooling

● Simply put, with increasingly stringent compliance measures around data access and applications, taking a
manual approach to monitoring your organization’s data quality is not the answer. Not only is the process tedious
and time consuming, but the tooling available can not keep up with the speed of innovation across your data stack.
● Instead, we suggest investing in automated tools that can quickly validate, monitor and alert for data quality issues
as they arise. Add the ability to set custom rules, and these technologies have the potential to truly unlock the
potential of data for your organization.
How to measure ROI for your data quality program
To measure the potential ROI of a data quality program, We’ve ● End-to-end lineage: Robust lineage at each stage of
found that the following metrics (borrowed from the DevOps world) the data lifecycle empowers teams to track the flow of
offer a good start: Time-To-Detection and Time-to-Resolution. their data from A (ingestion) to Z (analytics),
incorporating transformations, modeling, and other
Tools for decreasing TTD: steps in the process, and it critical for supplementing
● Machine learning-powered anomaly detection: the often narrow-sighted insights (no pun intended) with
Testing your data before it goes into production is P0, but statistical RCA approaches. The OpenLineage
for tracking those unknown unknowns, it’s helpful to standard for metadata and lineage collection is a great
implement automated anomaly detection and custom place to start.
rules ● Data discovery to understand data access patterns:
● Relevant incident feeds and notifications. Integrating While many data catalogs have a UI-focused workflow,
a communication layer (likely an API) between your data data engineers need the flexibility to interact with their
platform and PagerDuty, Slack, or any other incident catalogs programmatically, through data discovery.
management solutions you use is critical for conducting
root cause analysis, setting SLAs/SLIs, and triaging data
downtime as it arises.

Tools for decreasing TTR:

● Statistical root cause analysis: As we discussed in a


previous article, root cause analysis is a common
practice among Site Reliability Engineering teams when it
comes to identifying why and how applications break in
production. Similarly, data teams can leverage statistical
root cause analysis and other intelligent insights about
your data to understand why these issues arose in the
first place.
Notes from the field: the role of data quality
We spoke with data leaders across industries about the role data quality and trust play at their companies. And although nearly every arm
of every organization at every company relies on data, not every data team has the same structure, and various industries have different
requirements. Here’s what they had to say:
On communicating the value of data:

On reliable metrics: “Think about your company’s data needs at


a high level. “What do they care about?
“At our weekly All Hands meeting, there’s What do they need to understand better,
time dedicated to reviewing metrics. Data is and what data will give them those
something that’s prevalent in our culture and insights?”
that everyone has the ability to see and take On building user trust in data: Amy Smith
Staff Business Data Analyst
in. This openness reinforces the importance
of data to our operations.” “Facilitating user confidence that they have
found the right information or data to act on is
Annie Tran paramount. We needed to ensure that the
Director of Data Science data driving our business decisions could be
trusted, and that we knew whether or not that
data can be applied to a given use case.” -

Eli Sebastian Brumbaugh


Data Product Design Lead
The 5 top data quality challenges - and how to overcome them
1. Maintaining trust with end users
○ Building and maintaining trust with users is difficult. It only takes one costly mistake to damage the trust of users, and once trust is lost, it is hard to regain.
Data needs to be trustworthy and reliable for stakeholders to use data, otherwise, they rely on their intuition which is often worse than working with “bad”
data.
○ One way businesses maintain trust end users is by setting data reliability SLAs. By doing so, you can build trust and strengthen relationships between
your data, data teams, and downstream consumers, whether that’s your customers or cross-functional teams at your company.
2. Developing a clear framework for handling data quality issues
○ Data testing alone is insufficient, particularly at scale. Data changes — a lot — and even moderately sized datasets introduce a lot of complexity and
variability. Since testing your data can not find or anticipate every possible issue, data teams blend testing with continual monitoring and observability
across the entire pipeline.
3. Investing in automation
○ Automate as much work as possible for when it comes to data quality. A good rule of thumb is to limit the amount of human interaction, as that increases
the likelihood of errors being made.
○ Of course, it is always better to improve data quality through a proactive approach rather than a reactive one, which often wastes valuable engineering
time that could have been spent improving product features or building critical business dashboards.
4. Establishing ownership around data quality
○ We recommend defining product champions for different aspects of tooling in your data stack. When an issue arises, the designated person is responsible
for alerting other team members.
○ Integration with Slack is crucial for owning data quality. Users can see there is an issue in the data quality and, data team members can respond in the
thread to alert team members they are actively resolving the issue.
5. Creating a culture around data quality
○ Ensure leadership and stakeholders understand the importance of data quality to the organization. Sometimes, it's best to drive quality data from the
top-down and to also reward users for their efforts early on.
○ Educating employees about their role in maintaining and improving data quality is also crucial. The best way to accomplish this is through storytelling.

Further reading: Data quality at Airbnb


The way forward: Data Observability
As companies increasingly leverage data-driven insights to drive innovation and maintain their competitive edge,
it’s important that their data is accurate and trustworthy. With data observability, data teams can now identify and
prevent inaccurate, missing, or erroneous data from breaking your analytics dashboards, delivering more reliable
insights.

Data observability automatically monitors across key features of your data ecosystem, including data freshness,
distribution, volume, schema, and lineage. Without the need for manual threshold setting, data observability
answers such questions as:

● When was my table last updated?


● Is my data within an accepted range?
● Is my data complete? Did 2,000 rows suddenly turn into 50?
● Who has access to our marketing tables and made changes to them?
● Where did my data break? Which tables or reports were affected?

With the right approach to data observability, data teams can trace field-level lineage across entire data workflows,
facilitating greater visibility into the health of their data and the insights those pipelines deliver.

CASE STUDY:
Manchester-based Auto Trader is the largest digital automotive marketplace in the United Kingdom and Ireland. The company sees 235 million advertising views
and 50 million cross-platform visitors per month, with thousands of interactions per minute—all data points the Auto Trader team can analyze and leverage to
improve efficiency, customer experience, and, ultimately, revenue. Like other startups, ensuring data quality across complex pipelines is a top priority for
AutoTrader. With data observability in tow, achieving end-to-end data quality and reliability was a reality as the team scaled their decentralized data platform.

Learn more: Scaling Data Trust: How AutoTrader Migrated to a Decentralized Data Platform
The future of data quality
To unlock the true value of your data, we need to go beyond data quality. We need to ensure our data is reliable and trustworthy wherever it is. And the only way to get
there is by creating observability — all the way from source to consumption. Data observability, an organization’s ability to fully understand the health of the data in their
system, eliminates this perception of magic by applying best practices of DevOps to data pipelines. With automation and machine learning, observability helps reliably
deliver fresh, accurate, complete data along every step of its complex lifecycle. Data quality may be a significant challenge, but data observability is a powerful solution.

Looking for more insights? Dive into our recommended resources below:

● The New Rules of Data Quality


● Improve Data Engineering Workflows
● Automated Data Quality Testing at Scale
with SQL & Machine Learning

Curious about how data observability can help


your company achieve high quality data?

Request a Demo

You might also like