Data Quality
Data Quality
1
2 Gartner Press Release - How to Improve Your Data Quality, 14 July 2021,
3
EBOOK 2 DATAIKU
Data is a critical element in ML success at the organizational level and, therefore, needs to be
valuable (i.e., high quality, labeled, organized) in order to help teams achieve business objectives
and hit their KPIs. Data quality, then, should matter just as much to less technical stakeholders
on the business side. When data scientists have “garbage in, garbage out,” the flawed results and
algorithms will undoubtedly lead to — you guessed it — flawed business decisions, which will
ultimately impact business stakeholders.
Without attributes such as reliability and accuracy in data quality, executives cannot trust the data
or make informed decisions which can, in turn, drive up operational costs and create havoc for
users downstream. Line of business managers will be on the hook for industry-specific projects
impacted by bad data quality (i.e., our marketing example above) and analysts will end up relying
on inaccurate reports and, therefore, may make faulty conclusions based on those findings.
Traditionally, IT owns data quality, but they don’t really know anything about business data, only
these aforementioned business stakeholders do. Therefore, as tempting as it may be for IT to
wholly own data quality, centralization without a larger goal or purpose (that has the business’s
buy-in) won’t actually generate business value and, ultimately, will result in data quality efforts
falling flat.
So, what is the role of these less technical people in controlling data quality? Well, when it comes
down to it, most organizations don’t have a robust repository of high-quality and trusted datasets.
And, when they do, they may not be accessible in any simple way and available for constant reuse
and are, instead, commonly siloed or fragmented. This much is for certain. Therefore, the problem
of data quality isn’t always a technological one, but an organizational one that requires synergy
across all teams and, whether you are a C-suite executive, data executive, or line of business
manager, you have a role to play.
EBOOK 3 DATAIKU
Key Considerations for Controlling
Data Quality
Now, before we dive into how data quality issues manifest, it’s important to make the distinction
between the ingestion into the data warehouse and the preparation for consumption within a data
science platform like Dataiku. The scope of this ebook will focus on the latter and how data quality
issues can hinder and, sometimes, fully halt data science initiatives. While addressing data quality
issues is for the data practitioners (such as data scientists) to handle, it’s important for business
stakeholders to be aware of them:
1. Unlabeled data
2. Poorly labeled data
3. Inconsistent or disorganized data (i.e., inundation of data sources with no organization
and/or data redundancy)
4. Out-of-date or purely inaccurate data
5. Incomplete or missing values
6. A lack of tools to properly address data quality issues
7. Process bottlenecks
EBOOK 4 DATAIKU
While data quality certainly matters to any industry, it’s critical to understand and illustrate its
applicability across different business domains. Examples include:
In marketing, prospects and customers can become easily annoyed if they receive
the same campaign more than once (i.e., with their name or address spelled slightly
differently). This could link back to duplicates within the same database and across a
variety of internal and external sources.
In industries that are highly reliant on supply chain logistics (i.e., manufacturing, retail
and CPG), maybe you don’t have reliable location information to automate processes
or, worse, may send products to the wrong addresses which can lower customer
satisfaction, loyalty, and advocacy.
For banking and financial services, you might have inconsistent data (i.e., using error-
prone spreadsheets to generate financial reports), varying freshness of data, and muddled
data definitions which can cause different answers to be given to the same question.
Relatedly, data quality is often key when it comes to meeting compliance requirements
such as GDPR and other privacy regulations. At the drop of a hat, organizations need
to be able to locate an individual’s information — without missing any of the collected
data due to inaccuracies or inconsistencies.
EBOOK 5 DATAIKU
Best Practices to Control Data Quality
We’ve compiled several best practices to help organizations ensure that they have a scalable
process for controlling data quality.
• How do you measure the data quality of the assets your company collects and stores?
• What are the key KPIs or business objectives you plan to hold your data quality strategy
accountable for achieving?
• Do you plan to have cross-functional involvement from leadership and data users in other
parts of the company? If so, which ones?
• Who specifically at the company will be responsible for meeting your data quality strategy
KPIs and objectives?
• What checks and balances will you have in place to ensure the KPIs are measured
regularly and accurately so that goals can be met?
You should also be prepared with a list of the existing data quality issues plaguing the organization
and how they are impacting revenue and other business KPIs. According to an article from Deloitte,
“It’s easy to be overwhelmed by the challenge of turning data from an afterthought into a core
facet of business operations. But CDOs can take comfort in knowing that change doesn’t happen
overnight.”4 In fact, Jeff McMillan shared that the data quality efforts at Morgan Stanley Wealth
Management (a Dataiku customer) have taken about five years to implement in a meaningful way
and today make up one of the company’s competitive advantages.
4 https://www2.deloitte.com/us/en/insights/industry/public-sector/chief-data-officer-government-playbook/data-as-an-asset.html
EBOOK 6 DATAIKU
Finally, data quality needs to be supported and promoted at each level of management, including
the C-suite. If executives and other business leaders don’t prioritize good data quality, data
managers and data teams may not either. To make sure that everyone is in the loop, from C-suite
to lines of business to data teams themselves, organizations can establish a data stewardship
program (more on this later) to champion data access, use, and storage best practices. When it
comes to the various lines of business, for example, be sure to explain how data quality impacts
their functions and clearly outline how they can share best data practices and enforce them with
their teams.
Data governance, in its simplest terms, involves the policy and oversight for collecting, storing, and
sharing data. By establishing this infrastructure, business teams will be able to take in any data
quality problem that arises, evaluate it, and determine which action must be taken as well as the
appropriate resources to address it. According to Gartner “How to Improve Your Data Quality,” data
and analytics leaders should “Include DQ as an agenda item at D&A governance board meetings.
D&A leaders need to link DQ initiatives to business outcomes, which will help track the investments
in DQ improvement against the business objectives.”5
5 Gartner Press Release - How to Improve Your Data Quality, 14 July 2021,
EBOOK 7 DATAIKU
A sound data quality infrastructure includes improved access to data, as most organizations don’t
have a single curated, high-quality data source, but rather siloed and disparate data sources. The
infrastructure and a data quality/data governance strategy will help bring that together. With
Dataiku, teams can easily compile their data in one place, keep it centralized and accessible, and
ensure they can seamlessly connect to every source.
The information within your data is correct and corresponds to the real-world
Accuracy
scenario at hand.
Completeness The available elements of the data are in one place (i.e., database, data platform).
Consistency The data does not contain contradictions and matches across multiple instances.
Uniqueness/ There are no data records that contain specific details that appear more than
Deduplication once in the database/data platform.
EBOOK 8 DATAIKU
4. ALIGN ON A CLEAR DEFINITION OF WHAT “QUALITY” MEANS TO YOUR ORGANIZATION.
When managing data quality activities as part of a data governance framework, it’s imperative
that this framework not only sets the data policies, standards, and roles needed, but provides a
business glossary for what the organization constitutes as good data quality. This glossary should
be used as the basis for metadata (data about data) management, which needs to be used to have
common data definitions to link to current and future business applications.
By establishing what is “best fit” for the organization, it’s critical for data and analytics leaders to
align expectations with the line of business managers and executives. For example, different lines of
businesses using the same data (i.e., marketing and finance both use customer data) may have different
standards and, therefore, different requirements and expectations for data quality initiatives.
EBOOK 9 DATAIKU
5. ASSIGN PEOPLE WHO ARE ACCOUNTABLE FOR DATA ACCURACY AND IN CHARGE OF
MONITORING DATA QUALITY ON A DAILY BASIS.
Teams need to decide who will be in charge of what and assign the role of setting clear definitions,
metrics, categorization rules, and goals to specific individuals. For example, who will evaluate data
quality and will the evaluation be based on completeness, validity, timeliness, etc.? The first step to
accuracy and consistency is to clearly define these roles and responsibilities.
Many organizations impart these tasks to a data steward, someone responsible for the
management and oversight of the organization’s data assets to help business users with high-
quality data that is not only easily accessible in a consistent manner, but also compliant with policy
and/or regulatory obligations. The responsibility is usually a joint effort between IT, line of business
data owners, and the central data office, if it exists.
The next step revolves around putting in place additional efforts to systemize the use of data,
starting with data centralization. With a centralized data repository such as Dataiku, teams (that
may be distributed or remote) can work more efficiently by providing one clear data resource point,
thus increasing accessibility (while also managing consistency and accuracy).
EBOOK 10 DATAIKU
Conclusion
Data quality management is not a turnkey initiative that is handled all at once. Rather, it’s an
ongoing process that needs to involve the business from the beginning in order to ensure success
— from assessing current data quality to selecting metrics and KPIs to establishing rules and
processes for implementation and measurement, data quality issues won’t be solved overnight.
However, the gravity of data quality should not be undercut, as data quality issues can compound
and, ultimately, undermine data and analytics efforts. We hope this ebook helped outline how,
while hands-on data quality efforts are often tackled directly by data practitioners, transparency
and involvement from business stakeholders (C-suite to lines of business managers and executives
and more) are imperative for tangible results.
EBOOK 11 DATAIKU
Everyday AI,
Extraordinary People
Filter on Mr.
Dataiku is the platform for Everyday AI, enabling data experts and domain experts to work
together to build AI into their daily operations. Together, they design, develop and deploy
new AI capabilities, at all scales and in all industries.