Data Quality and Its Parameters
Data Quality and Its Parameters
Data - facts and statistics collected together for reference or analysis. Data are the
values of subjects with respect to qualitative or quantitative variables. Data and
information are often used interchangeably; however, the extent to which a set of
data is informative to someone depends on the extent to which it is unexpected by
that person.
Data quality-
Poor-quality data is often pegged as the source of inaccurate reporting and ill-
conceived strategies in a variety of companies, and some have attempted to quantify
the damage done. Economic damage due to data quality problems can range from
added miscellaneous expenses when packages are shipped to wrong addresses, all
the way to steep regulatory compliance fines for improper financial reporting.
An oft-cited estimate originating from IBM suggests the yearly cost of data quality
issues in the U.S. during 2016 alone was about $3.1 trillion. Lack of trust by business
managers in data quality is commonly cited among chief impediments to decision-
making.
The demon of poor data quality was particularly common in the early days of
corporate computing, when most data was entered manually. Even as more
automation took hold, data quality issues rose in prominence. For a number of years,
the image of deficient data quality was represented in stories of meetings at which
department heads sorted through differing spreadsheet numbers that ostensibly
described the same activity.
As a first step toward data quality, organizations typically perform data asset
inventories in which the relative value, uniqueness and validity of data can
undergo baseline studies. Established baseline ratings for known good data sets are
then used for comparison against data in the organization going forward.
Methodologies for such data quality projects include the Data Quality Assessment
Framework (DQAF), which was created by the International Monetary Fund (IMF)
to provide a common method for assessing data quality. The DQAF provides
guidelines for measuring data dimensions that include timeliness, in which actual
times of data delivery are compared to anticipated data delivery schedules.
Data quality management -
Several steps typically mark data quality efforts. In a data quality management cycle
identified by data expert David Loshin, data quality management begins with
identifying and measuring the effect of business outcomes. Rules are defined,
performance targets are set, and quality improvement methods as well as specific
data cleansing, or data scrubbing, and enhancement processes are put in place.
Results are then monitored as part of ongoing measurement of the use of the data in
the organization. This virtuous cycle of data quality management is intended to
assure consistent improvement of overall data quality continues after initial data
quality efforts are completed.
Software tools specialized for data quality management match records, delete
duplicates, establish remediation policies and identify personally identifiable data.
Management consoles for data quality support creation of rules for data handling to
maintain data integrity, discovering data relationships and automated data
transforms that may be part of quality control efforts.
Collaborative views and workflow enablement tools have become more common,
giving data stewards, who are charged with maintaining data quality, views into
corporate data repositories. These tools and related processes are often closely linked
with master data management (MDM) systems that have become part of many data
governance efforts.
Data quality management tools include IBM InfoSphere Information Server for Data
Quality, Informatica Data Quality, Oracle Enterprise Data Quality, Pitney Bowes
Spectrum Technology Platform, SAP Data Quality Management and Data Services,
SAS DataFlux and others.
While many organizations boast of having good data or improving the quality of
their data, the real challenge is defining what those qualities represent. What some
consider good quality others might view as poor. Judging the quality of data requires
an examination of its characteristics and then weighing those characteristics
according to what is most important to the organization and the application(s) for
which they are being used.
Accuracy-
This characteristic refers to the exactness of the data. It cannot have any erroneous
elements and must convey the correct message without being misleading. This
accuracy and precision have a component that relates to its intended use. Without
understanding how the data will be consumed, ensuring accuracy and precision
could be off-target or more costly than necessary. For example, accuracy in
healthcare might be more important than in another industry (which is to say,
inaccurate data in healthcare could have more serious consequences) and, therefore,
justifiably worth higher levels of investment.
Data should be sufficiently accurate for the intended use and should be captured only
once, although it may have multiple uses. Data should be captured at the point of
activity.
Validity –
Requirements governing data set the boundaries of this characteristic. For example,
on surveys, items such as gender, ethnicity, and nationality are typically limited to a
set of options and open answers are not permitted. Any answers other than these
would not be considered valid or legitimate based on the survey’s requirement. This
is the case for most data and must be carefully considered when determining its
quality. The people in each department in an organization understand what data is
valid or not to them, so the requirements must be leveraged when evaluating data
quality.
Data should be recorded and used in compliance with relevant requirements,
including the correct application of any rules or definitions. This will ensure
consistency between periods and with similar organisations, measuring what is
intended to be measured.
Relevant guidance and definitions are provided for all statutory performance
indicators. Service Heads are informed of any revisions and amendments
within 24 hours of receipt from the relevant government department. Local
performance indicators comply with locally agreed guidance and definitions.
Reliability –
Many systems in today’s environments use and/or collect the same source data.
Regardless of what source collected the data or where it resides, it cannot contradict
a value residing in a different source or collected by a different system. There must
be a stable and steady mechanism that collects and stores the data without
contradiction or unwarranted variance.
Data should reflect stable and consistent data collection processes across collection
points and over time. Progress toward performance targets should reflect real
changes rather than variations in data collection approaches or methods.
Timeliness
There must be a valid reason to collect the data to justify the effort required, which
also means it has to be collected at the right moment in time. Data collected too soon
or too late could misrepresent a situation and drive inaccurate decisions.
Data should be captured as quickly as possible after the event or activity and must
be available for the intended use within a reasonable time period. Data must be
available quickly and frequently enough to support information needs and to
influence service or management decisions.
Performance data is requested to be available within one calendar month
from the end of the previous quarter and is subsequently reported to the
respective Policy and Scrutiny Panel on a quarterly basis. As a part of the
ongoing development of PerformancePlus it is intended that performance
information will be exported through custom reporting and made available
via the Three Rivers DC website. This will improve access to information and
eliminate delays in publishing information through traditional methods.
Completeness -
Data requirements should be clearly specified based on the information needs of the
organisation and data collection processes matched to these requirements.
Refers to the relationship between database objects and the abstract universe of all
such objects.
● Includes selection criteria, definitions and other mapping rules used to create the
database
Relevance -
Data captured should be relevant to the purposes for which it is to be used. This
will require a periodic review of requirements to reflect changing needs.
There are many elements that determine data quality, and each can be prioritized
differently by different organizations. The prioritization could change depending on
the stage of growth of an organization or even its current business cycle. The key is
to remember you must define what is most important for your organization when
evaluating data. Then, use these characteristics to define the criteria for high-quality,
accurate data. Once defined, you can be assured of a better understanding and are
better positioned to achieve your goals.
Over time, the burden of data quality efforts centered on the governance of relational
data in organizations, but that began to change as web and cloud computing
architectures came into prominence.
Unstructured data, text, natural language processing and object data became part of
the data quality mission. The variety of data was such that data experts began to
assign different degrees of trust to various data sets, forgoing approaches that took a
single, monolithic view of data quality.
Also, the classic issues of garbage in/garbage out that drove data quality efforts in
early computing resurfaced with artificial intelligence (AI) and machine
learning applications, in which data preparation often became the most demanding
of data teams' resources.
The higher volume and speed of arrival of new data also became a greater challenge
for the data quality steward.
Expansion of data's use in digital commerce, along with ubiquitous online activity,
has only intensified data quality concerns. While errors from rekeying data are a
thing of the past, dirty data is still a common nuisance.
Protecting the privacy of individuals' data became a mild concern for data quality
teams beginning in the 1970s, growing to become a major issue with the spread of
data acquired via social media in the 2010s. With the formal implementation of the
General Data Protection Regulation (GDPR) in the European Union (EU) in 2018,
the demands for data quality expertise were expanded yet again.
With GDPR and the risks of data breaches, many companies find themselves in a
situation where they must fix data quality issues.
The first step toward fixing data quality requires identifying all the problem data.
Software can be used to perform a data quality assessment to verify data sources are
accurate, determine how much data there is and the potential impact of a data breach.
From there, companies can build a data quality program, with the help of data
stewards, data protection officers or other data management professionals. These
data management experts will help implement business processes that ensure future
data collection and use meets regulatory guidelines and provides the value that
businesses expect from data they collect.