0% found this document useful (0 votes)

45 views10 pages

Data Quality

The document discusses the importance of data quality and provides strategies for measuring and improving data quality. It defines data quality and explains why it matters for businesses. It also outlines best practices for measuring data quality, including measuring data downtime, and discusses the need for data observability and automated monitoring of data issues.

Uploaded by

bhanupratapsam123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views10 pages

Data Quality

Uploaded by

bhanupratapsam123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

The Ultimate Guide to Data Quality

What is data quality and why does it matter?

Data quality is defined as the health of data at any stage in its life
cycle. Data quality can be impacted at any stage of the data
pipeline, before ingestion, in production, or even during analysis.

As data becomes increasingly important to modern companies, it’s

crucial that teams can trust their data to be accurate and reliable.
Data quality is the preeminent way to measure whether or not your
data can be trusted to deliver value to the business.

When bad data is served to your data consumers—whether internal

users or external customers—the consequences can be downright
disastrous. The cost of bad data shows up in wasted time, lost
revenue, missed opportunities, and damaged trust. And these
issues are widespread: many data leaders tell us their data
scientists and engineers spend 40 percent or more of their time CASE STUDY: How Yotpo fixes data quality
troubleshooting or firefighting data problems. Gartner estimates
companies spend upwards of $15 million annually on data
1 in 5 companies have lost a customer due to data quality issues. For
downtime. And over 88 percent of U.S. businesses have lost money
Yotpo, Monte Carlo’s approach to data observability empowered them to
because of data quality issues.
solve data issues fast so they could start trusting their data to deliver
reliable, actionable insights for the business.
Fortunately, with the right approach, data quality can be measured,
tracked, and improved over time. In fact, some of the best data
teams have charted the course for others when it comes to LEARN MORE: How Yotpo Fixes Data Quality at Scale
managing data quality at scale.

Bottom line: Even in today’s day and age, data quality is not
given the diligence it deserves—and that’s a serious problem.
How to measure data quality
Traditional methods of measuring data quality are often time
and resource-intensive, spanning several variables, from
accuracy and completeness, to validity and timeliness. But
The Data Engineer’s KPI for Data Quality
these approaches are often subjective and difficult to
communicate to stakeholders who don’t handle the data day to Data Downtime =
day. Fortunately, there’s a better way: measuring data
downtime.
Number of data incidents
Data downtime refers to periods of time where data is missing, X
erroneous, or otherwise inaccurate, and often suggests a
broken data pipeline. By measuring data downtime, you can (Time-to-Detection + Time-to-Resolution)
determine the reliability of your data and ensure the
confidence necessary to use it or lose it.
Further Reading: Data Quality — You’re Measuring It Wrong
Overall, data downtime is a function of:

● Number of data incidents (N) — This factor is not always in

your control given that you rely on data sources “external” to
your team, but it’s certainly a driver of data uptime.
● Time-to-detection (TTD) — In the event of an incident, how
quickly are you alerted? In extreme cases, this quantity can
be measured in months if you don’t have the proper
methods for detection in place. Silent errors made by bad
data can result in costly decisions, with repercussions for
both your company and your customers.
● Time-to-resolution (TTR) — Following a known incident,
how quickly were you able to resolve it?
The new rules of data quality
There are two types of data quality issues in this Just like software, data requires both testing and ● Lineage to track upstream and
world: those you can predict (known unknowns) observability to ensure consistent reliability. In fact, downstream dependencies.
and those you can’t (unknown unknowns). modern data teams at Intuit, PagerDuty, and Fox, Automated, end-to-end lineage
among other companies, think about data as a empowers data teams to track the
dynamic, ever-changing entity, and apply not just
flow of their data from A (ingestion)
● Data quality issues that can be easily rigorous testing, but also continual observability. Given
predicted. For these known unknowns, the millions of ways that data can break (or, the to Z (analytics), incorporating
automated data testing and manual unknown unknowns), we can use the same DevOps transformations, modeling, and other
threshold setting should cover your bases. principles to cover these edge cases. steps in the process.
● Data quality issues that cannot be easily ● Both custom & automatically
predicted. These are your unknown For most, a robust and holistic approach to data generated rules. Most data teams
unknowns. And as data pipelines become observability incorporates: need an approach that leverages the
increasingly complex, this number will only best of both worlds: using machine
grow. ● Metadata aggregation & discovery. If you learning to identify abnormalities in
don’t know what data you have, you certainly your data based on historical
If data testing can cover what you know might won’t know whether it’s useful. Data behavior, as well as the ability to set
happen to your data, you need a way to monitor and discovery is incorporated into the best data rules unique to the specs of your
alert for what we don’t know might happen to your observability platforms, offering a centralized data.
data (our unknown unknowns). perspective into your data ecosystem. ● Collaboration between data
● Automatic monitoring & alerting for data analysts, data engineers, and data
In the same way that application engineering teams
issues. A great data observability approach scientists. Data teams should be
don’t exclusively use unit and integration testing to will ensure you’re the first to know and solve
catch buggy code, data engineering teams need to able to easily and quickly collaborate
data issues, allowing you to address data
take a similar approach by making data observability quality issues right when they happen, as to resolve issues, set new rules, and
a key component of their stack. opposed to several months down the road. better understand the health of their
On top of that, such a solution requires data.
minimal configuration and practically no
threshold-setting.
3 key steps to building a great data quality program
Dror Engel, Head of Product, Data Services, at eBay, and Barr Moses, CEO and co-founder of Monte Carlo, share 3 essential steps for establishing a
successful data quality strategy.

1. Get leadership & stakeholder buy-in early:

● How do you measure the data quality of the assets your company collects and stores?
● What are the key KPIs or goals you’re going to hold your data quality strategy accountable for meeting?
● Do you have cross-functional involvement from leadership and data users in other parts of the company?
● Who at the company will be held accountable for meeting your strategy’s KPIs and goals?
● What checks and balances do you have to ensure KPIs are measured correctly and goals can be met?

2. Spearhead a data stewardship program

● To make sure that data users across the company are aware of why data quality matters, we suggest
developing a program for data quality champions to carry the torch and shepherd others through data access,
use, and storage best practices.
● Focus on short-term or quick wins to get traction while promoting and executing on the long-term strategy.

3. Automate your lineage and data governance tooling

● Simply put, with increasingly stringent compliance measures around data access and applications, taking a
manual approach to monitoring your organization’s data quality is not the answer. Not only is the process tedious
and time consuming, but the tooling available can not keep up with the speed of innovation across your data stack.
● Instead, we suggest investing in automated tools that can quickly validate, monitor and alert for data quality issues
as they arise. Add the ability to set custom rules, and these technologies have the potential to truly unlock the
potential of data for your organization.
How to measure ROI for your data quality program
To measure the potential ROI of a data quality program, We’ve ● End-to-end lineage: Robust lineage at each stage of
found that the following metrics (borrowed from the DevOps world) the data lifecycle empowers teams to track the flow of
offer a good start: Time-To-Detection and Time-to-Resolution. their data from A (ingestion) to Z (analytics),
incorporating transformations, modeling, and other
Tools for decreasing TTD: steps in the process, and it critical for supplementing
● Machine learning-powered anomaly detection: the often narrow-sighted insights (no pun intended) with
Testing your data before it goes into production is P0, but statistical RCA approaches. The OpenLineage
for tracking those unknown unknowns, it’s helpful to standard for metadata and lineage collection is a great
implement automated anomaly detection and custom place to start.
rules ● Data discovery to understand data access patterns:
● Relevant incident feeds and notifications. Integrating While many data catalogs have a UI-focused workflow,
a communication layer (likely an API) between your data data engineers need the flexibility to interact with their
platform and PagerDuty, Slack, or any other incident catalogs programmatically, through data discovery.
management solutions you use is critical for conducting
root cause analysis, setting SLAs/SLIs, and triaging data
downtime as it arises.

Tools for decreasing TTR:

● Statistical root cause analysis: As we discussed in a

previous article, root cause analysis is a common
practice among Site Reliability Engineering teams when it
comes to identifying why and how applications break in
production. Similarly, data teams can leverage statistical
root cause analysis and other intelligent insights about
your data to understand why these issues arose in the
first place.
Notes from the ﬁeld: the role of data quality
We spoke with data leaders across industries about the role data quality and trust play at their companies. And although nearly every arm
of every organization at every company relies on data, not every data team has the same structure, and various industries have different
requirements. Here’s what they had to say:
On communicating the value of data:

On reliable metrics: “Think about your company’s data needs at

a high level. “What do they care about?
“At our weekly All Hands meeting, there’s What do they need to understand better,
time dedicated to reviewing metrics. Data is and what data will give them those
something that’s prevalent in our culture and insights?”
that everyone has the ability to see and take On building user trust in data: Amy Smith
Staff Business Data Analyst
in. This openness reinforces the importance
of data to our operations.” “Facilitating user confidence that they have
found the right information or data to act on is
Annie Tran paramount. We needed to ensure that the
Director of Data Science data driving our business decisions could be
trusted, and that we knew whether or not that
data can be applied to a given use case.” -

Eli Sebastian Brumbaugh

Data Product Design Lead
The 5 top data quality challenges - and how to overcome them
1. Maintaining trust with end users
○ Building and maintaining trust with users is difficult. It only takes one costly mistake to damage the trust of users, and once trust is lost, it is hard to regain.
Data needs to be trustworthy and reliable for stakeholders to use data, otherwise, they rely on their intuition which is often worse than working with “bad”
data.
○ One way businesses maintain trust end users is by setting data reliability SLAs. By doing so, you can build trust and strengthen relationships between
your data, data teams, and downstream consumers, whether that’s your customers or cross-functional teams at your company.
2. Developing a clear framework for handling data quality issues
○ Data testing alone is insufficient, particularly at scale. Data changes — a lot — and even moderately sized datasets introduce a lot of complexity and
variability. Since testing your data can not find or anticipate every possible issue, data teams blend testing with continual monitoring and observability
across the entire pipeline.
3. Investing in automation
○ Automate as much work as possible for when it comes to data quality. A good rule of thumb is to limit the amount of human interaction, as that increases
the likelihood of errors being made.
○ Of course, it is always better to improve data quality through a proactive approach rather than a reactive one, which often wastes valuable engineering
time that could have been spent improving product features or building critical business dashboards.
4. Establishing ownership around data quality
○ We recommend defining product champions for different aspects of tooling in your data stack. When an issue arises, the designated person is responsible
for alerting other team members.
○ Integration with Slack is crucial for owning data quality. Users can see there is an issue in the data quality and, data team members can respond in the
thread to alert team members they are actively resolving the issue.
5. Creating a culture around data quality
○ Ensure leadership and stakeholders understand the importance of data quality to the organization. Sometimes, it's best to drive quality data from the
top-down and to also reward users for their efforts early on.
○ Educating employees about their role in maintaining and improving data quality is also crucial. The best way to accomplish this is through storytelling.

● When was my table last updated?

● Is my data within an accepted range?
● Is my data complete? Did 2,000 rows suddenly turn into 50?
● Who has access to our marketing tables and made changes to them?
● Where did my data break? Which tables or reports were affected?

With the right approach to data observability, data teams can trace field-level lineage across entire data workflows,
facilitating greater visibility into the health of their data and the insights those pipelines deliver.

CASE STUDY:
Manchester-based Auto Trader is the largest digital automotive marketplace in the United Kingdom and Ireland. The company sees 235 million advertising views
and 50 million cross-platform visitors per month, with thousands of interactions per minute—all data points the Auto Trader team can analyze and leverage to
improve efficiency, customer experience, and, ultimately, revenue. Like other startups, ensuring data quality across complex pipelines is a top priority for
AutoTrader. With data observability in tow, achieving end-to-end data quality and reliability was a reality as the team scaled their decentralized data platform.

Learn more: Scaling Data Trust: How AutoTrader Migrated to a Decentralized Data Platform
The future of data quality
To unlock the true value of your data, we need to go beyond data quality. We need to ensure our data is reliable and trustworthy wherever it is. And the only way to get
there is by creating observability — all the way from source to consumption. Data observability, an organization’s ability to fully understand the health of the data in their
system, eliminates this perception of magic by applying best practices of DevOps to data pipelines. With automation and machine learning, observability helps reliably
deliver fresh, accurate, complete data along every step of its complex lifecycle. Data quality may be a significant challenge, but data observability is a powerful solution.

Looking for more insights? Dive into our recommended resources below:

● The New Rules of Data Quality

● Improve Data Engineering Workflows
● Automated Data Quality Testing at Scale
with SQL & Machine Learning

Curious about how data observability can help

your company achieve high quality data?

Request a Demo

Data Quality Strategy - A Step by Step Approach
100% (1)
Data Quality Strategy - A Step by Step Approach
28 pages
Create An Enterprise Vision For Data Quality and Observability Whitepaper
No ratings yet
Create An Enterprise Vision For Data Quality and Observability Whitepaper
17 pages
Data Governance Difinative Guide
100% (2)
Data Governance Difinative Guide
69 pages
Data Products
No ratings yet
Data Products
56 pages
Data Quality Management
100% (1)
Data Quality Management
12 pages
Data Quality
No ratings yet
Data Quality
76 pages
Paypal Credit Cashout Guide
100% (2)
Paypal Credit Cashout Guide
6 pages
Step-By-step Guide To Implement Modeling Scenarios in SAP BW 7.4 On HANA
100% (1)
Step-By-step Guide To Implement Modeling Scenarios in SAP BW 7.4 On HANA
25 pages
TDWI DataQuality Maturity Model Assessment Guide 2024 Web
No ratings yet
TDWI DataQuality Maturity Model Assessment Guide 2024 Web
11 pages
Data Quality
No ratings yet
Data Quality
6 pages
How To Become A Data Quality Expert
No ratings yet
How To Become A Data Quality Expert
19 pages
Data Governance Stewardship Ebook
100% (1)
Data Governance Stewardship Ebook
15 pages
BySoft CAM 4.0.0 en
No ratings yet
BySoft CAM 4.0.0 en
1,790 pages
Data Management Challenges
0% (1)
Data Management Challenges
9 pages
Data Strategy Governance
No ratings yet
Data Strategy Governance
19 pages
Data Quality Fundamentals: Agenda
No ratings yet
Data Quality Fundamentals: Agenda
47 pages
Forester White Paper - Build Trusted Data With Data Quality
100% (1)
Forester White Paper - Build Trusted Data With Data Quality
12 pages
Data Governance A Business Value-Driven Approach
100% (1)
Data Governance A Business Value-Driven Approach
16 pages
The Ultimate Guide To Modern Data Quality Management DQM For An Effective Data Quality Control Driven
No ratings yet
The Ultimate Guide To Modern Data Quality Management DQM For An Effective Data Quality Control Driven
14 pages
Everything You Need To Know About Data Quality in 2022
No ratings yet
Everything You Need To Know About Data Quality in 2022
37 pages
Guide Real Talk A Guide To Understanding Data Quality and Data Observability
No ratings yet
Guide Real Talk A Guide To Understanding Data Quality and Data Observability
36 pages
Ch05 Logistics
No ratings yet
Ch05 Logistics
72 pages
OReilly Report - What Is Data Observability
No ratings yet
OReilly Report - What Is Data Observability
37 pages
5 Fundamental Data Quality Practices
No ratings yet
5 Fundamental Data Quality Practices
12 pages
Data Quality Management Best Practices
No ratings yet
Data Quality Management Best Practices
9 pages
Example Data Strategy Redacted
No ratings yet
Example Data Strategy Redacted
28 pages
Importance
No ratings yet
Importance
24 pages
A Framework For Current and New Data Quality Dimensions
No ratings yet
A Framework For Current and New Data Quality Dimensions
26 pages
Data As A Strategic Asset: Improving Results Through A Systematic Data Governance Framework
No ratings yet
Data As A Strategic Asset: Improving Results Through A Systematic Data Governance Framework
24 pages
Constructing Your Data Warehouse Ipsos Advisory
No ratings yet
Constructing Your Data Warehouse Ipsos Advisory
18 pages
CIGNUS CG828 User's Manual Edited
100% (1)
CIGNUS CG828 User's Manual Edited
2 pages
Unit 2 More Notes
No ratings yet
Unit 2 More Notes
35 pages
Data Quality MDM
No ratings yet
Data Quality MDM
20 pages
2024 Data Engineering Trends and Predictions
No ratings yet
2024 Data Engineering Trends and Predictions
18 pages
Three Case Studies of Data Observability
No ratings yet
Three Case Studies of Data Observability
15 pages
Dataqualitymanagement
No ratings yet
Dataqualitymanagement
20 pages
The Ultimate Data Observability Checklist Guide
No ratings yet
The Ultimate Data Observability Checklist Guide
8 pages
The Data Governance Maturity Model: Establishing The People, Policies and Technology That Manage Enterprise Data
100% (1)
The Data Governance Maturity Model: Establishing The People, Policies and Technology That Manage Enterprise Data
11 pages
5 Steps To Better Data Quality: For Generative AI & Beyond
No ratings yet
5 Steps To Better Data Quality: For Generative AI & Beyond
13 pages
TDWI SoDQ Kobielus Web
No ratings yet
TDWI SoDQ Kobielus Web
24 pages
Data Quality Maturity Curve
No ratings yet
Data Quality Maturity Curve
16 pages
Ijcet 15 05 017
No ratings yet
Ijcet 15 05 017
13 pages
Data Quality
No ratings yet
Data Quality
13 pages
AIA DQG IDQ Approach& Features v1.1
No ratings yet
AIA DQG IDQ Approach& Features v1.1
29 pages
Talend Data Quality Guide
No ratings yet
Talend Data Quality Guide
45 pages
3.an Overview of Data Quality Frameworks
No ratings yet
3.an Overview of Data Quality Frameworks
15 pages
Microsoft Your Intelligent Data Strategy For AI
No ratings yet
Microsoft Your Intelligent Data Strategy For AI
9 pages
A BSC For Maximizing Data Performance 2022
No ratings yet
A BSC For Maximizing Data Performance 2022
9 pages
Turning Big Data Into Useful Information
No ratings yet
Turning Big Data Into Useful Information
14 pages
Data Quality - Trusted Data Across The Entreprise - Overview
100% (1)
Data Quality - Trusted Data Across The Entreprise - Overview
14 pages
White Paper: 1 Definitive Guide To Data Quality
No ratings yet
White Paper: 1 Definitive Guide To Data Quality
18 pages
A Better Way To Put Your Data To Work
No ratings yet
A Better Way To Put Your Data To Work
11 pages
Assignment 6 Data Management Pharamaceuticle Laboration
No ratings yet
Assignment 6 Data Management Pharamaceuticle Laboration
9 pages
M3A1
No ratings yet
M3A1
7 pages
Data Stewardship Is Everybody's Business - Best Practices For Data Quality Management - Innovation Insights
No ratings yet
Data Stewardship Is Everybody's Business - Best Practices For Data Quality Management - Innovation Insights
5 pages
Data Quality - 079 Moumon
No ratings yet
Data Quality - 079 Moumon
8 pages
(DAVAVERSITY) Soda Q22024 WP 211 Which Comes First Data Quality or DataGovernance
No ratings yet
(DAVAVERSITY) Soda Q22024 WP 211 Which Comes First Data Quality or DataGovernance
16 pages
Build Data Warehouse People Use Trust WP en US
No ratings yet
Build Data Warehouse People Use Trust WP en US
15 pages
Data Quality
No ratings yet
Data Quality
5 pages
Doosan Engine Dl08c Maintenance Manual
100% (63)
Doosan Engine Dl08c Maintenance Manual
10 pages
Build Better Management Systems To Put Your Data To Work
No ratings yet
Build Better Management Systems To Put Your Data To Work
6 pages
Driving Effective Data Governance For Improved Quality and Analytics
No ratings yet
Driving Effective Data Governance For Improved Quality and Analytics
6 pages
Data Quality
No ratings yet
Data Quality
2 pages
By Y.G.SAI RAGHU (124576) Under The Guidance of Dr. P. Sree Hari Rao
No ratings yet
By Y.G.SAI RAGHU (124576) Under The Guidance of Dr. P. Sree Hari Rao
26 pages
British Standard: A Single Copy of This British Standard Is Licensed To
No ratings yet
British Standard: A Single Copy of This British Standard Is Licensed To
25 pages
FP2901
No ratings yet
FP2901
1 page
Coatron M1 08-0544
No ratings yet
Coatron M1 08-0544
2 pages
9 10 Ict CH6 MCQ
No ratings yet
9 10 Ict CH6 MCQ
103 pages
Drawing Daily Tank For Genset 2000 kVA
No ratings yet
Drawing Daily Tank For Genset 2000 kVA
4 pages
Digital Engineering Asset Information Requirements
No ratings yet
Digital Engineering Asset Information Requirements
38 pages
Software Test Plan (STP) Template
No ratings yet
Software Test Plan (STP) Template
13 pages
Facility Management of Smart Buildings
No ratings yet
Facility Management of Smart Buildings
7 pages
Thesis Tracking Uow
100% (1)
Thesis Tracking Uow
6 pages
Coursediary - 22eet502 - Microprocessor & Microcontrollers
No ratings yet
Coursediary - 22eet502 - Microprocessor & Microcontrollers
53 pages
操作及维护保养说明书
No ratings yet
操作及维护保养说明书
22 pages
Cimco FTP Server Insert en
No ratings yet
Cimco FTP Server Insert en
16 pages
NI - (PRELIM) Computer System
No ratings yet
NI - (PRELIM) Computer System
52 pages
Time Table For B. Tech - I Semester End Lab Examinations Regular (Readmitted) & Supplementary (AR23 & AR20) - DEC 2024
No ratings yet
Time Table For B. Tech - I Semester End Lab Examinations Regular (Readmitted) & Supplementary (AR23 & AR20) - DEC 2024
2 pages
Profile APIs
No ratings yet
Profile APIs
2 pages
Pcme Man DT780 DT280 Ing Issue 1.01
No ratings yet
Pcme Man DT780 DT280 Ing Issue 1.01
86 pages
Hoda Vakili Resume P.eng. 2016
No ratings yet
Hoda Vakili Resume P.eng. 2016
3 pages
Application of Wireless Sensor Network in Smart Buildings
No ratings yet
Application of Wireless Sensor Network in Smart Buildings
11 pages
Control Lab 2
No ratings yet
Control Lab 2
7 pages
HP Pavilion 15-ab029TX Notebook - Laptop PDF
No ratings yet
HP Pavilion 15-ab029TX Notebook - Laptop PDF
1 page
Telecom Service Provider RFQ
No ratings yet
Telecom Service Provider RFQ
4 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
5 pages
Either Have Athletes To Transmit Samples & Documents in Time
No ratings yet
Either Have Athletes To Transmit Samples & Documents in Time
2 pages
From Data To Decisions: Driving Performance in the Age of Analytics
From Everand
From Data To Decisions: Driving Performance in the Age of Analytics
Babatunde Yusuf
No ratings yet
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet

Data Quality

Uploaded by

Data Quality

Uploaded by

The Ultimate Guide to Data Quality

What is data quality and why does it matter?

As data becomes increasingly important to modern companies, it’s

When bad data is served to your data consumers—whether internal

● Number of data incidents (N) — This factor is not always in

1. Get leadership & stakeholder buy-in early:

2. Spearhead a data stewardship program

3. Automate your lineage and data governance tooling

Tools for decreasing TTR:

● Statistical root cause analysis: As we discussed in a

On reliable metrics: “Think about your company’s data needs at

Eli Sebastian Brumbaugh

Further reading: Data quality at Airbnb

● When was my table last updated?

● The New Rules of Data Quality

Curious about how data observability can help

You might also like