Data Quality
Data Quality
Bottom line: Even in today’s day and age, data quality is not
given the diligence it deserves—and that’s a serious problem.
How to measure data quality
Traditional methods of measuring data quality are often time
and resource-intensive, spanning several variables, from
accuracy and completeness, to validity and timeliness. But
The Data Engineer’s KPI for Data Quality
these approaches are often subjective and difficult to
communicate to stakeholders who don’t handle the data day to Data Downtime =
day. Fortunately, there’s a better way: measuring data
downtime.
Number of data incidents
Data downtime refers to periods of time where data is missing, X
erroneous, or otherwise inaccurate, and often suggests a
broken data pipeline. By measuring data downtime, you can (Time-to-Detection + Time-to-Resolution)
determine the reliability of your data and ensure the
confidence necessary to use it or lose it.
Further Reading: Data Quality — You’re Measuring It Wrong
Overall, data downtime is a function of:
● How do you measure the data quality of the assets your company collects and stores?
● What are the key KPIs or goals you’re going to hold your data quality strategy accountable for meeting?
● Do you have cross-functional involvement from leadership and data users in other parts of the company?
● Who at the company will be held accountable for meeting your strategy’s KPIs and goals?
● What checks and balances do you have to ensure KPIs are measured correctly and goals can be met?
● To make sure that data users across the company are aware of why data quality matters, we suggest
developing a program for data quality champions to carry the torch and shepherd others through data access,
use, and storage best practices.
● Focus on short-term or quick wins to get traction while promoting and executing on the long-term strategy.
● Simply put, with increasingly stringent compliance measures around data access and applications, taking a
manual approach to monitoring your organization’s data quality is not the answer. Not only is the process tedious
and time consuming, but the tooling available can not keep up with the speed of innovation across your data stack.
● Instead, we suggest investing in automated tools that can quickly validate, monitor and alert for data quality issues
as they arise. Add the ability to set custom rules, and these technologies have the potential to truly unlock the
potential of data for your organization.
How to measure ROI for your data quality program
To measure the potential ROI of a data quality program, We’ve ● End-to-end lineage: Robust lineage at each stage of
found that the following metrics (borrowed from the DevOps world) the data lifecycle empowers teams to track the flow of
offer a good start: Time-To-Detection and Time-to-Resolution. their data from A (ingestion) to Z (analytics),
incorporating transformations, modeling, and other
Tools for decreasing TTD: steps in the process, and it critical for supplementing
● Machine learning-powered anomaly detection: the often narrow-sighted insights (no pun intended) with
Testing your data before it goes into production is P0, but statistical RCA approaches. The OpenLineage
for tracking those unknown unknowns, it’s helpful to standard for metadata and lineage collection is a great
implement automated anomaly detection and custom place to start.
rules ● Data discovery to understand data access patterns:
● Relevant incident feeds and notifications. Integrating While many data catalogs have a UI-focused workflow,
a communication layer (likely an API) between your data data engineers need the flexibility to interact with their
platform and PagerDuty, Slack, or any other incident catalogs programmatically, through data discovery.
management solutions you use is critical for conducting
root cause analysis, setting SLAs/SLIs, and triaging data
downtime as it arises.
Data observability automatically monitors across key features of your data ecosystem, including data freshness,
distribution, volume, schema, and lineage. Without the need for manual threshold setting, data observability
answers such questions as:
With the right approach to data observability, data teams can trace field-level lineage across entire data workflows,
facilitating greater visibility into the health of their data and the insights those pipelines deliver.
CASE STUDY:
Manchester-based Auto Trader is the largest digital automotive marketplace in the United Kingdom and Ireland. The company sees 235 million advertising views
and 50 million cross-platform visitors per month, with thousands of interactions per minute—all data points the Auto Trader team can analyze and leverage to
improve efficiency, customer experience, and, ultimately, revenue. Like other startups, ensuring data quality across complex pipelines is a top priority for
AutoTrader. With data observability in tow, achieving end-to-end data quality and reliability was a reality as the team scaled their decentralized data platform.
Learn more: Scaling Data Trust: How AutoTrader Migrated to a Decentralized Data Platform
The future of data quality
To unlock the true value of your data, we need to go beyond data quality. We need to ensure our data is reliable and trustworthy wherever it is. And the only way to get
there is by creating observability — all the way from source to consumption. Data observability, an organization’s ability to fully understand the health of the data in their
system, eliminates this perception of magic by applying best practices of DevOps to data pipelines. With automation and machine learning, observability helps reliably
deliver fresh, accurate, complete data along every step of its complex lifecycle. Data quality may be a significant challenge, but data observability is a powerful solution.
Looking for more insights? Dive into our recommended resources below:
Request a Demo