Data Profiling
Data Profiling
Like it or not, many of the assumptions you have about your data are probably not
accurate. Despite our best efforts, gremlins inevitably find their way into our systems.
The end result poor data quality has a host of negative consequences. This brief
article will provide an introduction to data quality concepts, and illustrate how data
profiling can be used to improve data quality.
Is the data of sufficient quality to support the business purpose(s) for which it is
being used?
Are any specific issues within the data decreasing its suitability for these
business purposes?
The short answer is yes; a study by Gartner estimated more than 25 percent of
critical data within Fortune 1000 enterprises to be flawed.
With the myriad of ways that data is captured (online transactions, automated
device capture, manual screen entry, spreadsheet uploads, direct database
changes), there are many opportunities for flawed data to enter source systems
A report from The Data Warehouse Institute concluded that data quality
problems cost U.S. businesses more than $600 billion a year and that poor data
quality leads to failure and delays of many high profile IT projects
Lack of trust in the data due to poor data quality leads to reduced or
discontinued BI usage among Information consumers
Poor data quality also has legal/regulatory implications, especially in the wake of
Sarbanes-Oxley, as accurate data is required in order to have accurate financial
reporting
Data Profiling is a systematic analysis of the content of a data source (Ralph Kimball).
You must look at the data; you cant trust copybooks, data models, or source
system experts
It is systematic in the sense that its thorough and looks in all the nooks and
crannies of the data
You have to know your data before you can fix it
Completeness Analysis
o How often is a given attribute populated, versus blank or null?
Uniqueness Analysis
o How many unique (distinct) values are found for a given attribute
across all records? Are there duplicates? Should there be?
Values Distribution Analysis
o What is the distribution of records across different values for a
given attribute?
Range Analysis
o What are the minimum, maximum, average and median values
found for a given attribute?
Pattern Analysis
o What formats were found for a given attribute, and what is the
distribution of records across these formats?
Data profiling can add value in a wide variety of situations. The basic litmus test is, Is
the quality of data important for this initiative? If the answer is yes, then data profiling
can help as it can quickly and thoroughly unveil the true content and structure of your
data.
Traditionally, data profiling required a skilled technical resource who could manually
query the data source using Structured Query Language (SQL). There is often a
disconnect between the business analyst who knows what the data should be, and the
technical programmer who knows SQL.
Available Tools
A variety of options exist in the marketplace to help ease the challenge of data profiling.
They range in capabilities and price. Tools like Datiris Profiler and Informatica Data
Quality have been successfully deployed by myriad of organizations. Implemented in
the right way, such tools stand to sculpt the data profiling landscape, by reducing effort,
broadening scope, and improving consistency across all data quality initiatives.
correction, and data monitoring. Data profiling is the act of analyzing your data contents.
Data correction is the act of correcting your data content when it falls below your
standards. And data monitoring is the ongoing act of establishing data quality standards in
a set of metrics meaningful to the business, reviewing the results in a re-occurring fashion
and taking corrective action whenever we exceed the acceptable thresholds of quality.
Today, I want to focus on data profiling. Data profiling is the analysis of data content in
conjunction with every new and existing application effort. We can profile batch data,
near/real time data, structured and non-structured data, or any data asset meaningful to
the organization. Data profiling provides organizations the ability to analyze large amounts
of data quickly in a systematic and repeatable process. Data profiling will provide your
organization with a methodical, repeatable, consistent, and metrics-based means to
evaluate your data. You should constantly evaluate your data given its dynamic nature.
Column Profiling, where all the values are analyzed within each column or
attribute. The objective is to discover the true metadata and uncover data content
quality problems
Security Profiling, where it is determined who (or what roles) have access to the
data and what are they authorized to do with the data (add, update, delete, etc.).
Custom Profiling, where our data is analyzed in a fashion that is meaningful to our
Organization. For example, an organization might want to analyze data
consumption to determine if data is accessed more by web services, direct queries
or in some other fashion. For example, a large organization, improved system
throughput after determining how the business and its customer accessed their
information.
Most times, youll find IT and Business may have a few false assumptions concerning data
content and its quality. I believe the cost to the business is the risk of their future solvency
or failure to reach their maximum revenue potential. Sometimes leadership has difficulty
assessing their need for a data quality program due to an inability to assess the cost.
Sometimes, action is taken after a bug is discovered at midnight or a customer feels their
report is wrong. Data profiling allows your organization to be proactive and creates self-
awareness.
There are two methods of data profiling: One based on sample and another based on
profiling data in place. Sample based profiling involves performing your analysis on a
random sample of data. For example, I might want to profile a 100 million row table. In my
effort to be efficient, my sample might be 30% of rows where I select every third row.
Sample base profiling requires me to store my sample in some temporary medium. Also,
sample based profiling requires you to ensure you have a representative sample of your
data. From a statistic standpoint, if my samples are too small, I can easily miss data
patterns or not properly identify the columns domain.
The second type of profiling involves profiling my data in place. Its treated as just another
query of my database. Generally, you will be profiling PROD and given the contention for
resources, youll want to run your queries when it has the least impact to the database.
You might be asking what toolsets are available to perform Data Profiling. You have lots of
options. Most of the ETL toolsets like Informatica and Data Stage offer built in Data
Profilers. There are stand-alone Data Profiling alternatives. And if your budget is zero, you
can write your own scripts to perform the analysis.
What data should I profile first? I like to focus on mission critical data first, like customer or
product information. If I have a data warehouse, data mart, or OLAP cube, Ill focus on their
Data Sources. Your OLTP environment is a good starting point since most analytic data
stores will pull from these sources.
Once you have performed your data profiling effort, what next? I like to map the results to
my outstanding applications bug fix reports. You can find a high correlation between the
known errors and what your data profile informs you of. And you can be proactive in the
discovery of errors that may reside in your data now. If I know my data contents, I can
create better and smaller test data sets for QA purposes. I like to share my findings with
QA, and develop a better test database and improve our test plans.
I can be proactive in my transformations where I can identify data misalignments where my
data sources contain values that are not being handled properly. And if there are data
anomalies where we have the same set of values stored in multiple locations, we can
address our data structure if needed.
Another useful insight comes in the data modeling structure. Do my tables reflect the
business at hand? Every organization will have tables that are processed each night, and
not used by anyone. When I profile, I like to match my data to my Business Intelligence
environment. When I identify a set of tables and reports that are not used by anyone we
can remove them from PROD to improve our performance. Also, I can match my data
sources to my staging area to determine if my processes are optimal.
There are so many great uses for data profiling. To start, I recommend looking at your
business strategy and assessing your data quality cost. Once youve assessed the cost,
determine if your current data quality strategy aligns to your business needs. A good data
profile strategy should complement your business strategy and provide the business
tangible bottom-line results.
What issues have you overcome in data profiling? How did you work through any issues?