0% found this document useful (0 votes)

377 views7 pages

Data Profiling

This document discusses data profiling and its importance for data quality. It defines data profiling as a systematic analysis of the content of a data source to understand its true structure and identify data quality issues. The document outlines the benefits of data profiling, such as increased speed, more thorough analysis, and a common repository for results. It also contrasts traditional manual profiling with newer automated profiling tools that can analyze data more efficiently. Overall, the document advocates for data profiling as an important first step for ensuring and improving data quality.

Uploaded by

sagarika panda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

377 views7 pages

Data Profiling

Uploaded by

sagarika panda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Data Profiling: What, Why and How?

DEC 17, 2013 IN DATA PROFILING, DATA QUALITY

Intro to Data Quality

By: Jason Hover

Like it or not, many of the assumptions you have about your data are probably not
accurate. Despite our best efforts, gremlins inevitably find their way into our systems.
The end result poor data quality has a host of negative consequences. This brief
article will provide an introduction to data quality concepts, and illustrate how data
profiling can be used to improve data quality.

What Is Data Quality?

Data Quality is a measure of the accuracy, validity and completeness of data.

Is the data of sufficient quality to support the business purpose(s) for which it is
being used?
Are any specific issues within the data decreasing its suitability for these
business purposes?

Do Most Organizations Have A Data Quality Problem?

The short answer is yes; a study by Gartner estimated more than 25 percent of
critical data within Fortune 1000 enterprises to be flawed.

With the myriad of ways that data is captured (online transactions, automated
device capture, manual screen entry, spreadsheet uploads, direct database
changes), there are many opportunities for flawed data to enter source systems

So What? Does it Matter?

The costs of poor data quality are ongoing and substantial.

A report from The Data Warehouse Institute concluded that data quality
problems cost U.S. businesses more than $600 billion a year and that poor data
quality leads to failure and delays of many high profile IT projects
Lack of trust in the data due to poor data quality leads to reduced or
discontinued BI usage among Information consumers
Poor data quality also has legal/regulatory implications, especially in the wake of
Sarbanes-Oxley, as accurate data is required in order to have accurate financial
reporting

Data Profiling Overview

What Is Data Profiling, and How Can It Help With Data Quality?

Data Profiling is a systematic analysis of the content of a data source (Ralph Kimball).

You must look at the data; you cant trust copybooks, data models, or source
system experts
It is systematic in the sense that its thorough and looks in all the nooks and
crannies of the data
You have to know your data before you can fix it

What Types Of Analysis Are Performed?

Completeness Analysis
o How often is a given attribute populated, versus blank or null?
Uniqueness Analysis
o How many unique (distinct) values are found for a given attribute
across all records? Are there duplicates? Should there be?
Values Distribution Analysis
o What is the distribution of records across different values for a
given attribute?
Range Analysis
o What are the minimum, maximum, average and median values
found for a given attribute?
Pattern Analysis
o What formats were found for a given attribute, and what is the
distribution of records across these formats?

What Are Some Real-World Scenarios?

Data profiling can add value in a wide variety of situations. The basic litmus test is, Is
the quality of data important for this initiative? If the answer is yes, then data profiling
can help as it can quickly and thoroughly unveil the true content and structure of your
data.

Some example scenarios include:

Data Warehousing / Business Intelligence (DW/BI) projects
o These projects involve gathering data from disparate systems for
the purpose of reporting and analysis. Data profiling can help
ensure project success by:
Identifying data quality issues that must be corrected in
the source system
Identifying issues that can be corrected in ETL processing
Discovering unanticipated business rules
Even potentially providing a no-go decision on the project
as a whole
Data conversion / Migration projects
o These involve moving data from a legacy system to a new
system. Data profiling can help reduce project risk by:
Identifying data quality issues that must be handled in the
code that moves data from the legacy system to the new
system
Identifying data issues that may require a change to the
target (new) system
Source System Data Quality Initiative
o These projects endeavor to assess and improve the data quality
of a given source system, seeking to fix existing issues as well as
avoid those issues in the future. Data profiling can help maximize
project ROI by:
Highlighting the areas within the system suffering from the
most serious and/or numerous data quality issues
Identifying issues that may be the result of bad user input
or errant system interfaces

Data Profiling the Old Way

The Manual Approach

Traditionally, data profiling required a skilled technical resource who could manually
query the data source using Structured Query Language (SQL). There is often a
disconnect between the business analyst who knows what the data should be, and the
technical programmer who knows SQL.

Data Profiling the New Way

Benefits of Using Data Profiling Software
There are many benefits to be reaped by using software to automate the data profiling
process, including:

Increased Speed (resulting in hard dollar savings)

o Industry estimates for manual data profiling are approximately 3-
5 hours per attribute; by using a data profiling tool, this can be
reduced to 15-30 minutes per attribute
Sample ROI, assuming 1500 attributes: $281,250 minus
the cost of data profiling software
More Thorough Analysis
o With a manual approach, generally only a subset of the
attributes and the rows are tested; with a data profiling tool, a
thorough evaluation of the data can be performed
o Quote from DM Review: Smart organizations are abandoning
manual methods in favor of automated data profiling tools that
take much of the guesswork out of finding and identifying
problem data
Common Repository
o Data profiling tools provide a common repository for storing data
profile results and other key metadata such as notes made
during analysis
Data profile information is centralized
Entire team can share and leverage the information

Available Tools

A variety of options exist in the marketplace to help ease the challenge of data profiling.
They range in capabilities and price. Tools like Datiris Profiler and Informatica Data
Quality have been successfully deployed by myriad of organizations. Implemented in
the right way, such tools stand to sculpt the data profiling landscape, by reducing effort,
broadening scope, and improving consistency across all data quality initiatives.

Data Profiling: The First Step in Data Quality

When I think of data quality, I think of three primary components: data profiling, data

correction, and data monitoring. Data profiling is the act of analyzing your data contents.
Data correction is the act of correcting your data content when it falls below your
standards. And data monitoring is the ongoing act of establishing data quality standards in
a set of metrics meaningful to the business, reviewing the results in a re-occurring fashion
and taking corrective action whenever we exceed the acceptable thresholds of quality.

Today, I want to focus on data profiling. Data profiling is the analysis of data content in
conjunction with every new and existing application effort. We can profile batch data,
near/real time data, structured and non-structured data, or any data asset meaningful to
the organization. Data profiling provides organizations the ability to analyze large amounts
of data quickly in a systematic and repeatable process. Data profiling will provide your
organization with a methodical, repeatable, consistent, and metrics-based means to
evaluate your data. You should constantly evaluate your data given its dynamic nature.

I like to break down data profiling into the following categories:

Column Profiling, where all the values are analyzed within each column or
attribute. The objective is to discover the true metadata and uncover data content
quality problems

Dependency Profiling, where each attribute is compared in relation to every other

attribute within a table where were looking for dependency relationships. The focus
is on the discovery of functional dependencies; primary keys; and quality problems
due to data structure.

Redundancy Profiling, where data is compared between tables in determining

which attributes contain overlapping or identical sets of values. The purpose is to
identify duplicate data across systems; foreign keys, synonyms, and homonyms. All
values corrupting data integrity should be identified.

Transformation Profiling, where our processes (business rules) are examined to

determine our datas source(s); what transformation(s) are applied to data; and
explore data target(s).

Security Profiling, where it is determined who (or what roles) have access to the
data and what are they authorized to do with the data (add, update, delete, etc.).

Custom Profiling, where our data is analyzed in a fashion that is meaningful to our
Organization. For example, an organization might want to analyze data
consumption to determine if data is accessed more by web services, direct queries
or in some other fashion. For example, a large organization, improved system
throughput after determining how the business and its customer accessed their
information.
Most times, youll find IT and Business may have a few false assumptions concerning data
content and its quality. I believe the cost to the business is the risk of their future solvency
or failure to reach their maximum revenue potential. Sometimes leadership has difficulty
assessing their need for a data quality program due to an inability to assess the cost.
Sometimes, action is taken after a bug is discovered at midnight or a customer feels their
report is wrong. Data profiling allows your organization to be proactive and creates self-
awareness.

The Two Flavors of Data Profiling

There are two methods of data profiling: One based on sample and another based on
profiling data in place. Sample based profiling involves performing your analysis on a
random sample of data. For example, I might want to profile a 100 million row table. In my
effort to be efficient, my sample might be 30% of rows where I select every third row.
Sample base profiling requires me to store my sample in some temporary medium. Also,
sample based profiling requires you to ensure you have a representative sample of your
data. From a statistic standpoint, if my samples are too small, I can easily miss data
patterns or not properly identify the columns domain.

The second type of profiling involves profiling my data in place. Its treated as just another
query of my database. Generally, you will be profiling PROD and given the contention for
resources, youll want to run your queries when it has the least impact to the database.

Data Profiling Toolsets

You might be asking what toolsets are available to perform Data Profiling. You have lots of
options. Most of the ETL toolsets like Informatica and Data Stage offer built in Data
Profilers. There are stand-alone Data Profiling alternatives. And if your budget is zero, you
can write your own scripts to perform the analysis.

Data Profiling Insights

What data should I profile first? I like to focus on mission critical data first, like customer or
product information. If I have a data warehouse, data mart, or OLAP cube, Ill focus on their
Data Sources. Your OLTP environment is a good starting point since most analytic data
stores will pull from these sources.

Once you have performed your data profiling effort, what next? I like to map the results to
my outstanding applications bug fix reports. You can find a high correlation between the
known errors and what your data profile informs you of. And you can be proactive in the
discovery of errors that may reside in your data now. If I know my data contents, I can
create better and smaller test data sets for QA purposes. I like to share my findings with
QA, and develop a better test database and improve our test plans.
I can be proactive in my transformations where I can identify data misalignments where my
data sources contain values that are not being handled properly. And if there are data
anomalies where we have the same set of values stored in multiple locations, we can
address our data structure if needed.

Another useful insight comes in the data modeling structure. Do my tables reflect the
business at hand? Every organization will have tables that are processed each night, and
not used by anyone. When I profile, I like to match my data to my Business Intelligence
environment. When I identify a set of tables and reports that are not used by anyone we
can remove them from PROD to improve our performance. Also, I can match my data
sources to my staging area to determine if my processes are optimal.

There are so many great uses for data profiling. To start, I recommend looking at your
business strategy and assessing your data quality cost. Once youve assessed the cost,
determine if your current data quality strategy aligns to your business needs. A good data
profile strategy should complement your business strategy and provide the business
tangible bottom-line results.

What issues have you overcome in data profiling? How did you work through any issues?

Cdac Sample C-Cat Papers
95% (22)
Cdac Sample C-Cat Papers
77 pages
Best Practices Data Model Review Process
No ratings yet
Best Practices Data Model Review Process
3 pages
Critical Capabilities For Data Quality Tools Asset ID 407171
No ratings yet
Critical Capabilities For Data Quality Tools Asset ID 407171
16 pages
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
From Everand
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
Carl A. Bolton
No ratings yet
3D Occlusograms
100% (1)
3D Occlusograms
9 pages
Data Profiling
No ratings yet
Data Profiling
15 pages
Data Profiling PPT - How To
100% (2)
Data Profiling PPT - How To
69 pages
Data Profiling
No ratings yet
Data Profiling
9 pages
Data Architecture
No ratings yet
Data Architecture
1 page
IDQ - 1WMP Data Migration Use Cases
100% (1)
IDQ - 1WMP Data Migration Use Cases
11 pages
Delivering Data Governance With A Yes
100% (1)
Delivering Data Governance With A Yes
39 pages
Data Mapping Diagrams For Data Warehouse Design Wi
No ratings yet
Data Mapping Diagrams For Data Warehouse Design Wi
15 pages
Ds Data Quality Business Intelligence
No ratings yet
Ds Data Quality Business Intelligence
2 pages
Data Architect or ETL Architect or BI Architect or Data Warehous
No ratings yet
Data Architect or ETL Architect or BI Architect or Data Warehous
4 pages
The 5 Ways Modern Data Governance Helps Business Productivity
100% (1)
The 5 Ways Modern Data Governance Helps Business Productivity
12 pages
Best Practices For Implementing Cloud Data Governance and Catalog
100% (1)
Best Practices For Implementing Cloud Data Governance and Catalog
45 pages
Denodo8 - Metadata Management Overview
No ratings yet
Denodo8 - Metadata Management Overview
28 pages
DQ Architecture
0% (1)
DQ Architecture
3 pages
Data Quality and Preprocessing Concepts ETL
No ratings yet
Data Quality and Preprocessing Concepts ETL
64 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Sybase Data Architecture and Data Governance WP
No ratings yet
Sybase Data Architecture and Data Governance WP
8 pages
Axon-EDC Integration
No ratings yet
Axon-EDC Integration
23 pages
Metadata Management On A Hadoop Eco-System: Whitepaper by
No ratings yet
Metadata Management On A Hadoop Eco-System: Whitepaper by
12 pages
IN 1040 DataDiscoveryGuide en PDF
No ratings yet
IN 1040 DataDiscoveryGuide en PDF
215 pages
Case Studies of Open Source Data Quality Management
No ratings yet
Case Studies of Open Source Data Quality Management
64 pages
Basic Elements of Data Warehouse Architecture
100% (1)
Basic Elements of Data Warehouse Architecture
3 pages
Informatica Data Quality Data Sheet
No ratings yet
Informatica Data Quality Data Sheet
4 pages
Data Governance Framework at Stony Brook University: Scope
100% (1)
Data Governance Framework at Stony Brook University: Scope
6 pages
XBRL US Pacific Rim Workshop Database and Business Intelligence Workshop Karen Hsu Director Product Marketing, Informatica
No ratings yet
XBRL US Pacific Rim Workshop Database and Business Intelligence Workshop Karen Hsu Director Product Marketing, Informatica
18 pages
Lab Workbook-Informatica EDC Migrating To MSFT AZURE DWM
No ratings yet
Lab Workbook-Informatica EDC Migrating To MSFT AZURE DWM
17 pages
Enterprice Architecture
No ratings yet
Enterprice Architecture
13 pages
Data Warehousing Architecture Best Practices
100% (3)
Data Warehousing Architecture Best Practices
61 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Idq 1
No ratings yet
Idq 1
13 pages
Data Quality
100% (2)
Data Quality
60 pages
Creating A Modern Analytics Architecture
No ratings yet
Creating A Modern Analytics Architecture
18 pages
Gartner Data Analytics Brazil RN Build A Data Driven Enterprise
No ratings yet
Gartner Data Analytics Brazil RN Build A Data Driven Enterprise
16 pages
Top Data Integration Trends and Best
No ratings yet
Top Data Integration Trends and Best
18 pages
Handbook On Data Quality Assessment Methods and Tools I PDF
No ratings yet
Handbook On Data Quality Assessment Methods and Tools I PDF
141 pages
Data Warehouse Modeling
100% (1)
Data Warehouse Modeling
87 pages
Dimensional Model Data Warehouse Overview
No ratings yet
Dimensional Model Data Warehouse Overview
2 pages
Customized Training Plan - Collibra
No ratings yet
Customized Training Plan - Collibra
3 pages
Data Warehouse Concepts With Dimensional Modeling
100% (1)
Data Warehouse Concepts With Dimensional Modeling
36 pages
Fds Data Governance Playbook PDF
No ratings yet
Fds Data Governance Playbook PDF
14 pages
Microsoft Data Warehouse Design Considerations
No ratings yet
Microsoft Data Warehouse Design Considerations
36 pages
Data Management Chapter1
No ratings yet
Data Management Chapter1
11 pages
Ensuring Data Quality
No ratings yet
Ensuring Data Quality
16 pages
Data Profiling & Data Quality
No ratings yet
Data Profiling & Data Quality
7 pages
Data-Quality Brochure 6787
No ratings yet
Data-Quality Brochure 6787
6 pages
Test Data Management
No ratings yet
Test Data Management
7 pages
Insurance DataWare House Design Vechiles
No ratings yet
Insurance DataWare House Design Vechiles
2 pages
A Framework For ETL Systems Development
No ratings yet
A Framework For ETL Systems Development
16 pages
Data Governance The Way Forward
100% (3)
Data Governance The Way Forward
41 pages
Enterprise Data Planning
100% (1)
Enterprise Data Planning
24 pages
Power BI Information Pack
No ratings yet
Power BI Information Pack
4 pages
Is Your Approach To Multidomain Master Data Management Upside-Down?
0% (1)
Is Your Approach To Multidomain Master Data Management Upside-Down?
6 pages
How To Scale Data Governance
100% (1)
How To Scale Data Governance
13 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
From Everand
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
Pierre-yves Bonnefoy
No ratings yet
Pentaho Data Integration Cookbook - Second Edition
From Everand
Pentaho Data Integration Cookbook - Second Edition
María Carina Roldán
No ratings yet
Enterprise Metadata Management Standard Requirements
From Everand
Enterprise Metadata Management Standard Requirements
Gerardus Blokdyk
No ratings yet
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
From Everand
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
Manoj Kumar
No ratings yet
My OBIA Interview Questions2 Answer
50% (2)
My OBIA Interview Questions2 Answer
2 pages
R5.0 - Measurements - Status - and - Overview - 20120605 (Slides 10,40,67)
No ratings yet
R5.0 - Measurements - Status - and - Overview - 20120605 (Slides 10,40,67)
75 pages
EBS Forms
No ratings yet
EBS Forms
19 pages
How To Set All Informatica Workflows To Unscheduled From A Script
No ratings yet
How To Set All Informatica Workflows To Unscheduled From A Script
2 pages
Tips and Tricks For Navigating Apps
No ratings yet
Tips and Tricks For Navigating Apps
3 pages
N) General Ledger - Accounting Setup B) Create Accounting Setup
No ratings yet
N) General Ledger - Accounting Setup B) Create Accounting Setup
15 pages
ODI Framework
No ratings yet
ODI Framework
18 pages
Fast Company 2016 4
100% (1)
Fast Company 2016 4
112 pages
Final Learning Journal
No ratings yet
Final Learning Journal
31 pages
Throughput Issue in Lte-Debugging
No ratings yet
Throughput Issue in Lte-Debugging
6 pages
HA179769 - Iss 13 EPower User Guide
No ratings yet
HA179769 - Iss 13 EPower User Guide
276 pages
Elysia Niveau Filter Manual en
No ratings yet
Elysia Niveau Filter Manual en
6 pages
Final Thesis Presentation
100% (4)
Final Thesis Presentation
5 pages
Data Management 1 8
No ratings yet
Data Management 1 8
11 pages
2 Product Specification
No ratings yet
2 Product Specification
28 pages
A Proposed Study On Facility Planning and Design in Manufacturing Process
No ratings yet
A Proposed Study On Facility Planning and Design in Manufacturing Process
7 pages
Course Outline - Natural Language Processing-Prof
No ratings yet
Course Outline - Natural Language Processing-Prof
3 pages
Fortios v7.2.6 Release Notes
No ratings yet
Fortios v7.2.6 Release Notes
64 pages
AEM Upgrade From 5.6.x To 6.1
No ratings yet
AEM Upgrade From 5.6.x To 6.1
4 pages
Device Drivers in Java
No ratings yet
Device Drivers in Java
7 pages
MT Profile
No ratings yet
MT Profile
16 pages
Fitness App SRS
No ratings yet
Fitness App SRS
3 pages
Module 3 - Computer System
No ratings yet
Module 3 - Computer System
6 pages
Exam Data Structures
No ratings yet
Exam Data Structures
2 pages
Business Information System - AIRLINE SYSTEM
No ratings yet
Business Information System - AIRLINE SYSTEM
17 pages
Taurus Series Multimedia Player TB30 Specifications V1.0.1
No ratings yet
Taurus Series Multimedia Player TB30 Specifications V1.0.1
9 pages
Class Presentation CES324032 Cesare Caoduro
No ratings yet
Class Presentation CES324032 Cesare Caoduro
99 pages
HUAWEI S7-Schematics
No ratings yet
HUAWEI S7-Schematics
40 pages
Resume - Data Analyst
No ratings yet
Resume - Data Analyst
4 pages
Special Exam Prelim Ias1 Bsit
No ratings yet
Special Exam Prelim Ias1 Bsit
1 page
Technical Judging Form: Reason For Which Project Is Done or Created
No ratings yet
Technical Judging Form: Reason For Which Project Is Done or Created
1 page
Release Notes Oracle Service Bus
No ratings yet
Release Notes Oracle Service Bus
19 pages
UIDAI Letter Dated 27 01 2023 Phase Out of Existing Fingerprint L0 Registered Devices From Aadhaar Authentication Ecosystem
100% (1)
UIDAI Letter Dated 27 01 2023 Phase Out of Existing Fingerprint L0 Registered Devices From Aadhaar Authentication Ecosystem
17 pages
Manual Hummer H-240CL PDF
No ratings yet
Manual Hummer H-240CL PDF
16 pages
Elective 1 Review
No ratings yet
Elective 1 Review
10 pages

Data Profiling

Uploaded by

Data Profiling

Uploaded by

Data Profiling: What, Why and How?

DEC 17, 2013 IN DATA PROFILING, DATA QUALITY

Intro to Data Quality

What Is Data Quality?

Data Quality is a measure of the accuracy, validity and completeness of data.

Do Most Organizations Have A Data Quality Problem?

So What? Does it Matter?

The costs of poor data quality are ongoing and substantial.

Data Profiling Overview

What Types Of Analysis Are Performed?

What Are Some Real-World Scenarios?

Some example scenarios include:

Data Profiling the Old Way

Data Profiling the New Way

Increased Speed (resulting in hard dollar savings)

Data Profiling: The First Step in Data Quality

I like to break down data profiling into the following categories:

Dependency Profiling, where each attribute is compared in relation to every other

Redundancy Profiling, where data is compared between tables in determining

Transformation Profiling, where our processes (business rules) are examined to

The Two Flavors of Data Profiling

Data Profiling Toolsets

Data Profiling Insights

You might also like