0% found this document useful (0 votes)
114 views13 pages

Data Quality

I

Uploaded by

fatenjaber
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views13 pages

Data Quality

I

Uploaded by

fatenjaber
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

EBOOK

Getting Data Quality Right:


A Guide for CDOs and
Data Executives
Introduction
Data quality isn’t only important for the practitioners building machine learning (ML) models and
those deeply involved in ML projects. At the end of the day, it impacts the business — from C-suite
executives to line of business managers to analysts — just as much as it impacts technical users and
stakeholders. Not convinced? In reality:

1
2 Gartner Press Release - How to Improve Your Data Quality, 14 July 2021,
3

EBOOK 2 DATAIKU
Data is a critical element in ML success at the organizational level and, therefore, needs to be
valuable (i.e., high quality, labeled, organized) in order to help teams achieve business objectives
and hit their KPIs. Data quality, then, should matter just as much to less technical stakeholders
on the business side. When data scientists have “garbage in, garbage out,” the flawed results and
algorithms will undoubtedly lead to — you guessed it — flawed business decisions, which will
ultimately impact business stakeholders.

Without attributes such as reliability and accuracy in data quality, executives cannot trust the data
or make informed decisions which can, in turn, drive up operational costs and create havoc for
users downstream. Line of business managers will be on the hook for industry-specific projects
impacted by bad data quality (i.e., our marketing example above) and analysts will end up relying
on inaccurate reports and, therefore, may make faulty conclusions based on those findings.

Traditionally, IT owns data quality, but they don’t really know anything about business data, only
these aforementioned business stakeholders do. Therefore, as tempting as it may be for IT to
wholly own data quality, centralization without a larger goal or purpose (that has the business’s
buy-in) won’t actually generate business value and, ultimately, will result in data quality efforts
falling flat.

So, what is the role of these less technical people in controlling data quality? Well, when it comes
down to it, most organizations don’t have a robust repository of high-quality and trusted datasets.
And, when they do, they may not be accessible in any simple way and available for constant reuse
and are, instead, commonly siloed or fragmented. This much is for certain. Therefore, the problem
of data quality isn’t always a technological one, but an organizational one that requires synergy
across all teams and, whether you are a C-suite executive, data executive, or line of business
manager, you have a role to play.

EBOOK 3 DATAIKU
Key Considerations for Controlling
Data Quality

“A lack of quality data is probably the single biggest reason that


organizations fail in their data efforts.”

- Jeff McMillan, Chief Data and Analytics Officer,


Morgan Stanley Wealth Management

Now, before we dive into how data quality issues manifest, it’s important to make the distinction
between the ingestion into the data warehouse and the preparation for consumption within a data
science platform like Dataiku. The scope of this ebook will focus on the latter and how data quality
issues can hinder and, sometimes, fully halt data science initiatives. While addressing data quality
issues is for the data practitioners (such as data scientists) to handle, it’s important for business
stakeholders to be aware of them:

1. Unlabeled data
2. Poorly labeled data
3. Inconsistent or disorganized data (i.e., inundation of data sources with no organization
and/or data redundancy)
4. Out-of-date or purely inaccurate data
5. Incomplete or missing values
6. A lack of tools to properly address data quality issues
7. Process bottlenecks

EBOOK 4 DATAIKU
While data quality certainly matters to any industry, it’s critical to understand and illustrate its
applicability across different business domains. Examples include:

In marketing, prospects and customers can become easily annoyed if they receive
the same campaign more than once (i.e., with their name or address spelled slightly
differently). This could link back to duplicates within the same database and across a
variety of internal and external sources.

In industries that are highly reliant on supply chain logistics (i.e., manufacturing, retail
and CPG), maybe you don’t have reliable location information to automate processes
or, worse, may send products to the wrong addresses which can lower customer
satisfaction, loyalty, and advocacy.

Additionally, out-of-date customer information may result in missed opportunities for


upsell and cross-selling products and services.

For banking and financial services, you might have inconsistent data (i.e., using error-
prone spreadsheets to generate financial reports), varying freshness of data, and muddled
data definitions which can cause different answers to be given to the same question.

Relatedly, data quality is often key when it comes to meeting compliance requirements
such as GDPR and other privacy regulations. At the drop of a hat, organizations need
to be able to locate an individual’s information — without missing any of the collected
data due to inaccuracies or inconsistencies.

EBOOK 5 DATAIKU
Best Practices to Control Data Quality
We’ve compiled several best practices to help organizations ensure that they have a scalable
process for controlling data quality.

1. SECURE LEADERSHIP AND STAKEHOLDER BUY IN.


In order to successfully accomplish this, it’s important to come to those discussions with the initial
answers to questions such as the ones below:

• How do you measure the data quality of the assets your company collects and stores?
• What are the key KPIs or business objectives you plan to hold your data quality strategy
accountable for achieving?
• Do you plan to have cross-functional involvement from leadership and data users in other
parts of the company? If so, which ones?
• Who specifically at the company will be responsible for meeting your data quality strategy
KPIs and objectives?
• What checks and balances will you have in place to ensure the KPIs are measured
regularly and accurately so that goals can be met?

You should also be prepared with a list of the existing data quality issues plaguing the organization
and how they are impacting revenue and other business KPIs. According to an article from Deloitte,
“It’s easy to be overwhelmed by the challenge of turning data from an afterthought into a core
facet of business operations. But CDOs can take comfort in knowing that change doesn’t happen
overnight.”4 In fact, Jeff McMillan shared that the data quality efforts at Morgan Stanley Wealth
Management (a Dataiku customer) have taken about five years to implement in a meaningful way
and today make up one of the company’s competitive advantages.

4 https://www2.deloitte.com/us/en/insights/industry/public-sector/chief-data-officer-government-playbook/data-as-an-asset.html

EBOOK 6 DATAIKU
Finally, data quality needs to be supported and promoted at each level of management, including
the C-suite. If executives and other business leaders don’t prioritize good data quality, data
managers and data teams may not either. To make sure that everyone is in the loop, from C-suite
to lines of business to data teams themselves, organizations can establish a data stewardship
program (more on this later) to champion data access, use, and storage best practices. When it
comes to the various lines of business, for example, be sure to explain how data quality impacts
their functions and clearly outline how they can share best data practices and enforce them with
their teams.

2. INVEST IN A DATA QUALITY INFRASTRUCTURE.

“The most important thing we do every day is ensure the accuracy of


the input. If you are not investing in a data quality/data governance
infrastructure, you’re going to fail.”

- Jeff McMillan, Chief Data and Analytics Officer,


Morgan Stanley Wealth Management

Data governance, in its simplest terms, involves the policy and oversight for collecting, storing, and
sharing data. By establishing this infrastructure, business teams will be able to take in any data
quality problem that arises, evaluate it, and determine which action must be taken as well as the
appropriate resources to address it. According to Gartner “How to Improve Your Data Quality,” data
and analytics leaders should “Include DQ as an agenda item at D&A governance board meetings.
D&A leaders need to link DQ initiatives to business outcomes, which will help track the investments
in DQ improvement against the business objectives.”5

5 Gartner Press Release - How to Improve Your Data Quality, 14 July 2021,

EBOOK 7 DATAIKU
A sound data quality infrastructure includes improved access to data, as most organizations don’t
have a single curated, high-quality data source, but rather siloed and disparate data sources. The
infrastructure and a data quality/data governance strategy will help bring that together. With
Dataiku, teams can easily compile their data in one place, keep it centralized and accessible, and
ensure they can seamlessly connect to every source.

3. ESTABLISH METRICS AROUND ACCURACY.


Because data quality so heavily impacts the business side of the organization, it’s clear that they
need to be linked to the general KPIs for overall business performance. Data quality metrics and
KPIs can be related to key data quality dimensions such as data uniqueness/deduplication, data
completeness, data orderliness, data accuracy, data consistency, and data timeliness.

Data Quality Attributes

ATTRIBUTE WHAT DOES IT MEAN?

The information within your data is correct and corresponds to the real-world
Accuracy
scenario at hand.

Auditability The data is accessible and traceable.

Completeness The available elements of the data are in one place (i.e., database, data platform).

Consistency The data does not contain contradictions and matches across multiple instances.

Orderliness The data is listed in the required format and structure.

The data is up to date and corresponds to reality within a reasonable period of


Timeliness
time.

Uniqueness/ There are no data records that contain specific details that appear more than
Deduplication once in the database/data platform.

EBOOK 8 DATAIKU
4. ALIGN ON A CLEAR DEFINITION OF WHAT “QUALITY” MEANS TO YOUR ORGANIZATION.

When managing data quality activities as part of a data governance framework, it’s imperative
that this framework not only sets the data policies, standards, and roles needed, but provides a
business glossary for what the organization constitutes as good data quality. This glossary should
be used as the basis for metadata (data about data) management, which needs to be used to have
common data definitions to link to current and future business applications.

By establishing what is “best fit” for the organization, it’s critical for data and analytics leaders to
align expectations with the line of business managers and executives. For example, different lines of
businesses using the same data (i.e., marketing and finance both use customer data) may have different
standards and, therefore, different requirements and expectations for data quality initiatives.

Dataiku customer Bankers’ Bank uses Dataiku to ensure data quality


across an array of financial analytics and, resultantly, the team has
been able to reduce the time to prepare analyses and deploy insights
by 87%. They do transactional reporting on the different volumes of
transactions they process, which was previously very manual.

Data was pulled in from various sources (including a CRM) and


validation included extensive backtracking to pinpoint where any
errors occurred without visibility to the entire data flow, which was —
at times — nearly impossible. The team has been able to reduce time
associated with pulling that data while simultaneously improving the
data quality and reliability.

EBOOK 9 DATAIKU
5. ASSIGN PEOPLE WHO ARE ACCOUNTABLE FOR DATA ACCURACY AND IN CHARGE OF
MONITORING DATA QUALITY ON A DAILY BASIS.

Teams need to decide who will be in charge of what and assign the role of setting clear definitions,
metrics, categorization rules, and goals to specific individuals. For example, who will evaluate data
quality and will the evaluation be based on completeness, validity, timeliness, etc.? The first step to
accuracy and consistency is to clearly define these roles and responsibilities.

Many organizations impart these tasks to a data steward, someone responsible for the
management and oversight of the organization’s data assets to help business users with high-
quality data that is not only easily accessible in a consistent manner, but also compliant with policy
and/or regulatory obligations. The responsibility is usually a joint effort between IT, line of business
data owners, and the central data office, if it exists.

The next step revolves around putting in place additional efforts to systemize the use of data,
starting with data centralization. With a centralized data repository such as Dataiku, teams (that
may be distributed or remote) can work more efficiently by providing one clear data resource point,
thus increasing accessibility (while also managing consistency and accuracy).

6. ENSURE A PROCESS FOR ISSUES MANAGEMENT CONTROL.


For each data quality issue found, there should be a clear process for reporting it to the appropriate
team(s). For example, the assigned data stewards can maintain a data quality issue log where
people can submit entries for each issue on data quality, its impact, and eventual resolution. It’s
important to keep in mind to try to implement issues management solutions that pinpoint the
issues as close to the data onboarding point as possible, rather than relying on downstream data
cleansing (potentially after that data has been used in other pipelines, workflows, and use cases).
In the featured story below, check out how data quality can benefit from cross-team collaboration
and reuse.

EBOOK 10 DATAIKU
Conclusion

Data quality management is not a turnkey initiative that is handled all at once. Rather, it’s an
ongoing process that needs to involve the business from the beginning in order to ensure success
— from assessing current data quality to selecting metrics and KPIs to establishing rules and
processes for implementation and measurement, data quality issues won’t be solved overnight.

However, the gravity of data quality should not be undercut, as data quality issues can compound
and, ultimately, undermine data and analytics efforts. We hope this ebook helped outline how,
while hands-on data quality efforts are often tackled directly by data practitioners, transparency
and involvement from business stakeholders (C-suite to lines of business managers and executives
and more) are imperative for tangible results.

EBOOK 11 DATAIKU
Everyday AI,
Extraordinary People

Elastic Architecture Built for the Cloud

Machine Learning Visualization Data Preparation

Name Sex Age

Natural lang. Gender Integer

Braund, Mr. Owen Harris male 22


Moran, Mr. James male 38
Heikkinen, Miss. Laina
Remove rows containing Mr. female 26
Futrelle, Mrs. Jacques Heath female 35
Keep only rows containing Mr.
Allen, Mr. William Henry male 35
Split column on Mr.
McCarthy, Mr. Robert male
Replace
Hewlett, Mrs (Mary Mr. by ...
D Kingcome) 29

Remove rows equal to Moran, Mr. James

Keep only rows equal to Moran, Mr. James

Clear cells equal to Moran, Mr. James

Filter on Moran, Mr. James

Filter on Mr.

Toggle row highlight

Show complete value

DataOps Governance & MLOps Applications

Dataiku is the platform for Everyday AI, enabling data experts and domain experts to work
together to build AI into their daily operations. Together, they design, develop and deploy
new AI capabilities, at all scales and in all industries.

©2023 dataiku | dataiku.com


©2023 DATAIKU | DATAIKU.COM

You might also like