Data Explorer: - A Data Profiling Tool
Data Explorer: - A Data Profiling Tool
Ihr Logo
Agenda
Introduction
Existing System
Limitations of Existing System
Proposed Solution
Project Scope
Block Diagram
Implementation
Technology
Hardware and Software Requirements
Features and Benefits
Future Enhancement
DataYour
Explorer
Logo – A Data Profiling Tool
Introduction (1/2)
Data Profiling
Data profiling is the process of examining the data available in an existing data source
(e.g. a database or a file) and collecting statistics and information about that data.
Data profiling is an analysis of the candidate data sources for a data warehouse to
clarify the structure, content, relationships and derivation rules of the data. Profiling
helps to understand anomalies and to assess data quality, but also to discover, register,
and assess enterprise metadata.
The purpose of data profiling is both to validate metadata when it is available and to
discover metadata when it is not.
The result of the analysis is used both strategically, to determine suitability of the
candidate source systems and give the basis for an early go/no-go decision, and
tactically, to identify problems for later solution design, and to level sponsors’
expectations.
DataYour
Explorer
Logo – A Data Profiling Tool
Introduction (2/2)
Pourpose of Data Profiling
Find out whether existing data can easily be used for other purposes
Improve the ability to search the data by tagging it with keywords, descriptions, or
assigning it to a category
Give metrics on data quality, including whether the data conforms to particular
standards or patterns
Assess the risk involved in integrating data for new applications, including the
challenges of joins
Assess whether metadata accurately describes the actual values in the source
database
Understanding data challenges early in any data intensive project, so that late project
surprises are avoided. Finding data problems late in the project can lead to delays and
cost overruns.
DataYour
Explorer
Logo – A Data Profiling Tool
Existing System
Initially the data Profiling activities used to be done by writing complicated SQL queries
This would be comfortable for analyst or user who knows to write SQL queries
Many of us do not know the proper syntax and format for writing SQL queries
To overcome this, Data Profiling tools were introduced
Data Profiling Tools, to a some extent overcome the limitations for writing complex
queries
All types of profiling activities were not supported by the tools
User has to understand and learn how to use the tool
DataYour
Explorer
Logo – A Data Profiling Tool
Limitations of Existing System
SQL Queries Existing Tools
Traditional Approach
Minimum Server Requirements
DataYour
Explorer
Logo – A Data Profiling Tool
Proposed Solution
Developing an Application performing all the types of profiling
Easy to use interface
Minimum system requirements
Feature to export the profiling results data to excel
Additional feature to indicate the Data Quality i.e. Data Quality Indicator
Supporting multiple Databases like Oracle 10g, Oracle 11g, MS SQL Server 2005, MS
SQL Server 2008, My SQL etc
DataYour
Explorer
Logo – A Data Profiling Tool
Project Scope
Keeping the Time Line and other factors in mind, the project will currently support only
MS SQL Server
Also the project will have following types of Profiling:
Column Profiling
Empty Column Analysis
Null Rule Analysis
Constant Analysis
Frequency Analysis
Uniqueness Analysis
Primary/Composite Key Analysis
DataYour
Explorer
Logo – A Data Profiling Tool
Architecture Diagram
Analysis Team Business Users Management
Data Explorer
Data Central Capture Reporting
Profiling Metadata Issues
Repository and Notes
DataYour
Explorer
Logo – A Data Profiling Tool
Implementation
The project will be implemented module wise.
Project will be having different modules. Each module will be developed individually
and Unit Tested
After completion of all the modules and unit testing, all the modules will be integrated
and System Integration Testing will be performed
There will be separate modules for Databases retrieval from server, Tables retrieval
after selecting a database, Columns retrieval after selecting a Table
There will be separate module for each type of profiling discussed.
DataYour
Explorer
Logo – A Data Profiling Tool
Implementation - Profiling Details
Column Profiling
This will help in discovering total no of records, null percentage, unique
percentage, minimum and maximum value in the column, documented data type
etc.
Constant Analysis
This will help in discovering those columns which has less than 4 and greater than
0 distinct values.
Null Rule Analysis
This will help in finding all the columns in a table which has 100% NULL values
DataYour
Explorer
Logo – A Data Profiling Tool
Implementation - Profiling Details
Unique Analysis
This will help in finding all the columns in table which has 100% uniqueness.
Primary Key / Composite Key Analysis
It will help us to find out the possible primary or composite key columns which can
be have unique combination.
Frequency Analysis
This will help in finding the no. of distinct values in the columns and the no. of time
the value is repeated in a column.
DataYour
Explorer
Logo – A Data Profiling Tool
Technology
Data Explorer will be developed on .NET platform using C# as a coding language.
.NET is Microsoft platform for developing advanced and Robust applications
.NET supports a wide range of library classes which eases the development efforts
and hence more time can be utilized in other activities
.NET is called Language Independent Platform as it support 4 native languages and 21
non-native languages.
Native Languages are a Microsoft created languages i.e. C#. VB.Net. J#, VC++
Non-Native or Non Microsoft Languages supported are Pearl, Ruby on Rails etc
DataYour
Explorer
Logo – A Data Profiling Tool
Hardware and Software Requirements
Data Explorer
HARDWARE SOFTWARE
• Pentium Core 2 • Windows 2000/
Duo processor or Windows XP/
above Windows Vista/
• 2 GB RAM Windows 7
• 20 GB HDD • Microsoft .NET
• Printer Framework 3.5
• Router for Internet • Microsoft Visual
Connection Studio 2008
DataYour
Explorer
Logo – A Data Profiling Tool
Features
Supports multiple databases like MS SQL Server, Oracle
Different types of profiling like
Column Profiling
Constant Analysis
Unique Analysis
Null Rule Analysis
Frequency Analysis
Empty Column Analysis
Primary / Composite Key Analysis
Quickly Analyze and validate data issues
DataYour
Explorer
Logo – A Data Profiling Tool
Benefits
Quick discovery of data issues
No more writing of queries to profile data
Time efficient
Shorten the implementation cycle of major projects
Improve understanding of data for the users
Discovering business knowledge
Improves data accuracy in corporate databases
DataYour
Explorer
Logo – A Data Profiling Tool
Future Enhancement
Data Explorer can be further extended to support unstructured or semi-structured data
like flat files, .csv files
It can also be extended to support other relation data bases like MS Access, MySQL,
Sybase etc Time efficient
It can also be enhanced by including Data Quality reports on top of Data Quality
Results
There can be mechanism to store the profiling results so that it can be used or referred
later at any point of time
DataYour
Explorer
Logo – A Data Profiling Tool
Thank You
Data Ihr
Explorer
Logo – A Data Profiling Tool