Data Quality Management
Data Quality Management
Quality
Data Quality 1
1 Introduction 2
5 Correction of Data 10
Introduction
Data quality is a fundamental aspect that data governance programs must tackle. It
primarily involves ensuring that the data aligns with business needs. Recognizing that
no process or modernization can succeed without reliable and high-quality data is
crucial. This applies to business applications, digital transformation initiatives, and the
deployment of AI/ML technologies.
Gartner describes data quality (DQ) solutions as “the set of processes and technologies
for identifying, understanding, preventing, escalating and correcting issues in data that
supports effective decision making and governance across all business processes.”
An improved definition might describe data quality as ensuring that data is suitable for
business purposes, facilitating operational activities, analytics, and decision-making in
a way that enhances trust and efficiency.
Breaking down Gartner's definition, data quality consists of the following components:
Identifying: Conducting a series of checks or applying rules to detect data that fails
to meet business requirements.
Understanding: Determining the reasons why data is unreliable in order to address
these issues.
Preventing: Implementing controls to prevent the creation of poor-quality data
during manual data entry, business processes, data pipelines, or other data
manipulation activities.
Data Quality 3
It is crucial to understand that data quality isn't solely about analytics; it's also about
maintaining accurate data in operational systems. Poor data in your data platform can
lead to bad business decisions, but poor data quality in operational systems can have
dire consequences. Examples of operational disruptions caused by bad data include:
(1) halting product manufacturing due to raw material shortages, (2) preventing a
logistics provider from delivering products due to payment problems, (3) sending
products to the wrong company or delivering the wrong product to the intended
customer, (4) crediting the wrong account for a customer refund, (5) offering the
incorrect product to a potential customer, (6) granting access to online content for
which a user is not licensed.
DAMA (Data Management Association) has introduced a set of data quality dimensions
known as the Strong-Wang framework, designed to address various situations
including:
Appropriate
Contextual DQ Relevancy Value-added Timeliness Completeness amount of
data
The good news is that this concept has significantly evolved over time. DAMA -
Netherlands recently released a research paper detailing the extensive range of data
quality dimensions that a company could adopt. The reality is that each organization
develops its own dimensions to suit its specific needs. There isn't a single, universal
approach.
Secondly, analyzing data quality rules and aggregating them by data quality
dimensions allows us to identify the number of issues within each dimension, by field,
table, functional area, and overall data. This analysis provides a comprehensive view of
the current data quality status. Additionally, tracking these metrics over time reveals
trends in data quality improvement or decline, offering valuable visualizations to share
with senior leadership to showcase progress in enhancing data quality.
Lastly, summarizing data quality by dimensions helps demonstrate to end users why
they should trust the data or how it can address business needs.
Utilizing tools like Talend, SAS etc. and techniques for addressing data quality in
front-end applications, such as data quality firewalls and front-end edits
Implementing data pipelines, data correction, improvement, and exception
handling within standard operating procedures
Creating standard data quality dashboards and reports, including specifications for
new data products
A solid foundation of tools and processes is essential for data quality initiatives to
succeed. Without it, these initiatives will struggle to progress.
Next, core policies at the field level must be established to operationalize data quality.
These policies typically cover data definitions, formats, ranges, and lifecycle
management. Common policy standards include:
Third, define data quality rules (data policies) and document them in a policy
repository. It's crucial to provide a data quality definition that is easily understandable
for any business user. Additionally, specify the technical requirements for these rules
separately so they can be implemented using either a code-heavy approach or the
available data quality tools.
Fourth, develop a set of data quality reports using your preferred BI tool to detail data
quality metrics. Creating a data quality dashboard that offers a comprehensive
overview of data quality across the organization is also beneficial.
While ensuring data is fit for business purposes is a significant aspect of data quality, it
is only one component. To enhance your data governance program, consider
leveraging the following capabilities commonly found in top-tier data quality tools:
Data profiling
Deduplication
Data quality checks within data pipelines
Exception processing
Data quality firewalls
Monitoring data decay and data drift
Dashboards and reporting
Profiling
Profiling involves using data quality tools to automatically run numerous queries and
generate a detailed report about the data set being examined. This report typically
includes inferred and defined data types, minimum and maximum values, most
common value, average, number of nulls, low and high values, duplicates, patterns, and
sample data. Inferred data types and patterns are particularly valuable insights from
profiles. Modern data observability tools may use AI scans instead of traditional
profiling but serve a similar purpose as the next generation of profiling techniques.
Deduplication
Deduplication involves standardizing data, applying algorithms to identify potential
duplicates, generating cross-references, and incorporating business user input to
confirm duplicate groupings. The process aims to eliminate duplicate records, ensure
accurate householding, and enhance data integrity.
Exception Processing
This technique, utilized within ELT, ETL, or Code Solutions, involves executing data quality
checks integrated with a data quality tool. Any data that fails to meet the specified
rules and checks is discarded. The exceptional data is then rectified, reprocessed, and
subsequently loaded successfully, ensuring a robust data processing workflow. This
process helps in identifying and correcting data issues early, maintaining high data
quality standards.
These capabilities aim to tackle challenges and ensure the acquisition of high-quality
data, foster a comprehensive understanding of data quality concepts, and meet the
requirements for data to align seamlessly with business objectives. Additionally, it is
crucial to establish a framework that incorporates data quality into the data lifecycle.
One such framework is POSMAD.
POSMAD, introduced by Danette McGilvray in the book "Executing Data Quality Projects,"
is widely recognized as a seminal work in the field of data quality. POSMAD stands for:
Data Quality 9
(P) Plan: Identify objectives, plan data architecture, and establish standards and
definitions.
(O) Obtain: Document data acquisition methods and procedures.
(S) Store and Share: Define data storage locations, storage methods, and data
accessibility protocols.
(M) Maintain: Outline data maintenance procedures, including cleansing, matching,
merging, deduplication, and enhancement.
(A) Apply: Determine data usage methods and tools for accessing data.
(D) Dispose: Address data retirement, archiving, movement, or deletion as part of the
data lifecycle.
Plan
Dispose Obtain
Information
Lifecycle
Management
Apply Store and
Share
Maintain
Data Quality 10
Correction of Data
Another crucial aspect is data correction, involving corrective measures on data
flagged with issues. When utilizing data quality tools, it's essential not only to detect
data anomalies but also to pinpoint the data requiring correction, with precise
instructions on remedial actions. Simply identifying outliers isn't sufficient; we must
rectify data through various means, including:
Logging change requests and tasking a business user with manual data
rectification (though this is the least preferable option).
Incorporating rules into data pipelines to rectify data (a highly efficient approach).
Implementing execution processes wherein erroneous data is diverted from regular
data flows to exception files. Remedial actions are then automated to rectify the
data before reintegrating it into normal data flows.
Integrating front-end edits to prevent the entry of flawed data; these edits may
include validation checks, warnings, or other assistive features. Data is only saved
once all edits are successfully passed.
Establishing a data quality firewall, which represents a more advanced solution than
edits. This firewall verifies data against various web services and function modules,
assisting end-users in preventing duplicate records and ensuring higher-quality
data.
Data Quality 11
Why Technoforte?
Unmatched Data Quality: Meticulously curate and maintain your data
Contact us today to learn how we can enhance your data quality and drive your
business to new heights.