OracleDataQuality_DataProfiling20140514
OracleDataQuality_DataProfiling20140514
Day1:
Introduction about Trillium and Prepare Data Load
1. Basic information about Trillium, Version,Trillium Architecture , Equivalent Oracle
software (ODQ & ODP)
2. Entity Creation,
3. Verification of Entity Creation.
4. Loading different types of files to Trillium Environment. (Flat File, Fixed width file,
Cobol Files )
5. Apply Filter during Entity Creation.
6. Profile Project Creation .
7. How to add entity to Profile project, Deletion,rename etc etc.
Day2:
Investigate Data content:
1.Value Range: Min and Maximum values
2.Document Findings
3.Book marks and Public and Private book
4.Notes
5.Examin Data types
6.Examin Nulls
7.Misspelling and Data Entry Issues:soundexes,Metaphones
8.Format Deaviations:Masks,Patterns
9.Numerical Attributes for an attributes
10.Examin value sets and duplicate values
Day3:
Define Standards and recode values:
1.Attribute Phrase analysis
2.Value recoding
3.Creation of recode tables
Key analysis and Dependency analysis:
4.Keys:Discover keys,Create Keys
5.Dependencies:Discover Dependency,Create Dependency
Day4:
Define Standards:
1.Simple compliance tests:
Predominant Data type check
Sum check
Schema data type check
Spaces check
Pattern check
Value check
Null check
Range check
Schema Length check
Unique check
2. Business Rules:
To create a Business Rule
To review Results from Business Rule analysis
To edit a business Rule
Day5:
Investigate: Relation ships:
1.Find Potential Joins
2.Join Analysis: Review Join analysis results
3.Validate Expected Joins
4.Entity Relationship Diagram(ERD)
5.View ERD for Multiple Joins
6.Export DDL 's from ERD
7.Generate DDL's from Entity
Day6:
Metabase Manager AdminTask:
1.Create Metabase:
2.Create Users
3.Add users to Metabase
4.Create Loader connections
5.Enable and Disable the Key analysis and Dependency analysis
6. Create Project,add entity to project,delete project
7.Create notes to attributes and create book marks to issue attributes
Introduction About Trillium Software:
Trillium Software is a leading software and service provider for data quality, data gov-
ernance, data cleansing, and data profiling
Oracle Data Profiling and Quality share a single user interface through which
you can monitor and manage:
■Data resources and connections
■Core data functions and services
■User-defined projects
■Data business and governance rules
■Repository data objects
■Batch and real-time data process results
About data quality:
Good quality data means that all master data is complete
,consistent,accurate,timestamped and,industry standards
Data Profiling:
-->Data profiling is the process of examining the data available in an existing data source
-->A data source usually a data base or a file
-->By doing data profiling we can collect the statistics and information about data
Here in the above example if we have a look in to the CUSTID field data is missing ,this
is critical
Bussiness data standards for Customer ID
Now we will see few of the business data standards example for CUSTID
1.Data type of customer id should be varchar
2.Length of the customer id it should be 4
3.Customer ID should not be a Null,Empty,Space
4.format of customer ID it should be ‘A999’
5.Customer ID value should be only from range A100 to A400
6.Customer ID carnality should be 100%
3.Now click on Failing tab then will get those values which are nulls in this attribute
Features of Oracle Data Quality and profiling:
1. Attribute analysis
2. Phrase analysis
3. key analysis
4. dependency analysis
5. join analysis
6. Business rules
Terms and Terminology used in Oracle Data quality and Data Profiling:
Repository:
A Repository contains a collection of one or more Metabases
A repository has its own group of Users, Loader Connections, and security and
performance settings
Entity:
An Entity is a file or table stored in your Metabase and associated with a data
source that you've identified
Attribute:
An Attribute is a field or column in an Entity.
An Attribute cannot exist in a Metabase without an Entity
Meta Data:
Data objects have additional properties and information associated with them
called metadata
Dependency:
A Dependency is a data relationship in which one or more Attributes
(fields/columns) determine the value in another Attribute
Key:
A Key is an Attribute that can uniquely identify and associate data, binding the
data together.
Join:
A Join identifies the intersection between two Entities. To identify a Join,
Attributes in one Entity(A)
are compared to Attributes in a second Entity(B). If an Attribute in Entity(A)
is suitable for merging or joining with an Attribute in Entity(B), Profiling
identifies a Join.
Findings:
Findings is the general term given to the documented results of a data discovery
activity.
Meta Base:
A Metabase stores data objects. It also stores any information related to the
stored data, called Metadata.
The type of information you can discover about your data includes:
Data compliance with business rules and Data Standard Definitions (DSD)
Note:
1.You must create at least one Metabase for each Repository, but you may create
as many Metabases as you choose within the same Repository.
2.Metabases can be created by any person who has administrative access to the
Metabase Manager tool.
3.This person is referred to as the Metabase Administrator and manages the
Repository installed on your system.
Metabase objects:
attribute
dependency
entity
finding
join
key
metadata
project
1.Projects
2.Entities
3.Analysis
4.Findings
1.Profiling
2.Time Series
3.Data Quality
About Entity Tab:
Entity tab it contains list of entities
Step2:Select the DEL FLATFILES and next in the Change filter text field type *.txt and
clcik on Change filter then we will get all the txt files
Step3: Next select the CUST_INFO text file because now we are going create the entity
for this source file and next click on next then will get the below screen.
Step4:Now in the Delmiter list choose the actual delmiter in your source file
CUST_INFO,here delmiter is a comma delmiter and and Next similarly check wether the
first line is column name in source file,if incase if first line is column name then select
the Names on first line radio button and and before going click on next just click on Pre-
view tab and check data is now aligned properly or not.
Step5:Click on next once view the data on preview tab and choose the options at load pa-
rameters
Step6: Click on next then will get the below screen and next click on finish then will get
the below second screen there select Run Later then will get the third screen there click
on Cancel now the Dynamic entity will be created
Once the Dynamic entity is created now check first the entity which we created above is
dynamic or real
Step1: go to Entities on Explorer tab and select Cust Info Entity and expand chose
Metadata and exapand it now in that list see the Entity Type Dynamic.
Step 2: Now select Cust Info Entity from Entities from Explorer and right click on Cust
Info entity then will find the Data load option and select that then we will get the below
screen
Step3: Next keep the file CUST_INFO as it is and next click on ok then it will give the
below screen there it as run now click on run now.
Step4: Once you click on run now check the status of that entity data load was done or
not.
Step5: Next go to Analysis and click on Background Task check all rows loaded for that
entity
Step6: Check the status of the entity is Dyanamic or Real usually once data load done
then that Entity becomes now changed from Dynamic to Real
Step 7:Now follow the below process to check the entity is real or dynamic
go to Entities on Explorer tab and select Cust Info Entity and expand chose
Metadata and expand it now in that list see the Entity Type Real
Investiage Data:
Maximum value:
Step1:From the explorer,expand the TrillMedTech Project
Step2:Expand Order Id under the entity Uk Orders2004
Step3:Locate the labels names Max
Example:
The value in the right parentheses represents the percentage of rows containing that data
type
Step5:Double click on Integers to see the integer values for the Phone
Step6:Double click on Strings to see the integer values for the Phone
Note: Strings contain a separator(dash or space) between the numbers and the integer
values do not
Examine Nulls:
The defination of null is “absence of avalue”.With TSS,You can locate Nulls in your data
at the enitity or attribute level
Soundexes:
The purpose of examining the soundexes is to identify data that it contains sound -alike
values that data values that sound alike,could be incorrect and indicate adata entry prob-
lem.
Soundex maps characters to ashort,4 char identifier and also helps to identify common
record types,soundex is particulary useful for grouping short strings,such as first or last
names
Step2: From the list of soundexes double click on thirt row soundex E351
The list view display5 values in the attribute matching E351
MetaPhones:
A meta phone is similar to asoudex ,but it provides much finer granularity,resulting in
multi character phonetic pattern
Notice that for the soundes value of E351,There are 3 different Metaphone values
To Examin Suspiciuos MetaPhones:
You could go through every metphone that is listed for an attribute or any look for those
metaphones that look for suspeiuos
Step1: in the list view sort the list view by valiue count(Indecending order)
Step2: Double click on the first row displaying the metaphone WSTNSPRMR.
The List view displays the values in the attribute that match that metaphones value
Format Deviations:
Examine patterns and Mask to discover format deviations in the data
Masks:
Masks display character represenations of data value in long-hand notaion by defining
each character in the data value as
To List Masks for an Attribute:
Step1:From customer master,Expand Account Number.
Step2:Double click on label Masks
Step3: Right click in the list view and select Book mark
Step4:Type the book amark name as Issue:Out of Range order Id
Step7:Click ok if your book mark look like the following screen shot.
Now verify in Customer master entity in country attribute has the values
USA,UK,DEU,CAN
Step1: Under customer master,expand the attributes folder
Step2: Locate the attribute named Country
During data migration data quality defects may need to be repaired or values must be
standardized to meet corporate requirements that can be achieved through phrase analy-
sis and value recodes and generation of recode tables you can standardize data to meet
corporate needs.
Step3:Place a check next to the option phrases between keep the default settings of be-
tween 1 and 5 words per phrase.
Step5: Click Ok
Step7:If we want to know the background tasks to know when the task completes
A corporate decision should be made on how to standardize the data and the data entry
application to only accept the three values for Account Status,Based on the most
occuring phrase you will standardize these values as PENDING,ACTIVE ,INACTIVE
Recode Values:
we have already examined results of the phrase analysis in above now we need to specify
the standard values and generate recode table.
Step1:Under customer table,Account Status,double-click Unique values
Here if we a look in to the above screen shot in value we are able see the different val-
ues so now here we are going to recode the below value to Recode value
Step2: Right click on the highlighted rows and select recode values
Simalry follow the same process and do the recode where ever recode is required
Step6:Examine the Value Recode and Value recoded ?Columns and make sure that re-
coded values are correct for each value listed
Step7: If in case any descrepencies,in the above data you can correct them with the same
technique as before
Generate a Recode Value Look-UP Table:
You will generate a look up table of recoded values that will be referenced from a
Quality project created later in this class standardize Acct Status values to pending
,active,inactive.
Generate a recode value look up table:
You can save your recode values as a physical file external to TSS, ready for use by TSS
Quality projects or any other applications
Step1: From Unique values for Acct Status,right-click on any row and select Export as
TSQ Recodes..
Note: The above file will be created in the export directory beneanth the metabase direc-
tory on the TSS Server.
Step3: Next to Export,make sure values is selected
Step4:Next to Field width,select Auto.
Step5:Next to encoding ,select US-ASCII
Step6: Make sure that your selections look like the next screen shot.
Step7: Click Ok
Step8: Click Run Now
Here in the above path two files should have been created and examine each file using
Notepad
Investigate:Keys& Dependencies:
Key Analysis:
-->Key analysis it applicable only real Entity
-->Key analysis can do in two methods
1.Discover keys
2.Create Keys
Method1:Discover keys
2. performed on data load on sample of first 10,000 rows
3. Attribute(s) identified as a potential key if>=98% unique
4. can be disable for entity creation
to disable we need to log in to metabase manager and select Control Admin and right
click on loader connection and select edit loader settings
Now slelect the check box which is appeared in the above screen
Step2: Select the option Dependencies and Customer Master from the drop down
menu and click next
Step7: Now expand Customer Master and its sub folder Metadata.
Now look for the Dependency that you just created between Post code and City ,notice
that Post code Predicts City 68.44% and next explore the conflicting values to find data
problems
Step9: From List view double click on dependency between Postcode and City
Step11:Scroll through the list and find the post code W1C2PW
A DSD tests data compliance at the attribute level and highlights data that does not com-
ply to your corporate standards
Quality Requirement:
Now we will see how to define all DSD to the attribute Order Id
Step1: Go to Uk Order2004 entity and right click Order and select edit DSD
To verify the number of rows passed and failed click on Failing ,then it will display the
rows which are not met the condition defined.
To see the complete row of the failing value just double click on the selected row
Similarly next go to the pattern tab in the same window and enable the pattern test and
define the pattern which we are expecting
and to see the complete row of the above value double click on the value
Data Quality Problems:
Here in the above example if we have a look in to the CUSTID field data is missing ,this
is critical
Now we will see few of the business data standards example for CUSTID
1.Data type of customer id should be varchar
2.Length of the customer id it should be 4
3.Customer ID should not be a Null,Empty,Space
4.format of customer ID it should be ‘A999’
5.Customer ID value should be only from range A100 to A400
6.Customer ID cardinality should be 100%
3.Now clock on Failing tab then will get those values which are nulls in this attribute
SELECT CUSTID
,CUSTOMER_NAME
,CUSTOMER_STATE
,COUNTRY
,ZIPCODE
,EMAIL_ADDRESS FROM CUSTOMER
WHERE CUSTOMER_NAME LIKE('%$%') OR
CUSTOMER_NAME LIKE('%#%')
Here in the above example one record was not complies as per the data standards
Step1: Define one business rule for that first go to the entity Cust Info and expand it and
select Bussiness rules and right click and click on add bussiness rule the the below
screen will come
Here In Name Text field CUSTSTATE_VALIDITY is the business rule name and right
side text box a arrow mark indicates the logic which we used in this business rule
Logic Used:
IF (Country = "INDIA") THEN( [Cust State] IN ("AP",
"AC",
"AS",
"BI",
"CH",
"CG",
"DN",
"DD",
"DL",
"GO",
"GU",
"HA",
"HP",
"JK",
"JH",
"KA",
"KE",
"LK",
"MP",
"MH",
"MN",
"ME",
"MZ",
"NA",
"OR",
"PD",
"PB",
"RJ",
"SI",
"TN",
"TR",
"UP",
"UA",
"WB"
))
Step2: Once you write the business rule save and run this business rule then it takes sev-
eral minutes of time to finish to run this business rule job.
Step3: Once job run completes then select the business rules passed from entity and
check the results status fail or pass and check the passing fraction
Step 4:If you want to see the actual rows which are fail the bussiness condition which we
defined simply double click on Bussiness rule name ie CUSTSTATE_VALIDITY then
then there one more new window will be opened in that window List view we will see the
rows were failed.
Actully here in the above screen shot if we had a look in to Cust state value is TX
actaully this state code is not at all exist in indian state code list thats why this record was
failed
SELECT CUSTID
,CUSTOMER_NAME
,CUSTOMER_STATE
,COUNTRY
,ZIPCODE
,EMAIL_ADDRESS FROM CUSTOMER
WHERE COUNTRY=’INDIA’
Here In Name Text field CNTRYCD_VALIDITY is the business rule name and right side
text box a arrow mark indicates the logic which we used in this business rule
Logic Used:
Country = "INDIA"
Step2: Once you write the business rule save and run this business rule then it takes sev-
eral minutes of time to finish to run this business rule job.
Step3: Once job run completes then select the business rules passed from entity and
check the results status fail or pass and check the passing fraction
Step 4:If you want to see the actual rows which are fail the bussiness condition which we
defined simply double click on Bussiness rule name ie CUSTSTATE_VALIDITY then
then there one more new window will be opened in that window List view we will see the
rows were failed.
Actully here in the above screen shot we did not get any rows because the reason beaing
passing fraction is 100%
Here in this example also 100% data complies as per the data standards defined
Note:if we check this address data against ref data from reference db then might be will
find few of the data issues
Step2: Once you write the business rule save and run this business rule then it takes sev-
eral minutes of time to finish to run this business rule job.
Step3: Once job run completes then select the business rules passed from entity and
check the results status fail or pass and check the passing fraction
Step 4:If you want to see the actual rows which are fail the bussiness condition which we
defined simply double click on Bussiness rule name ie ZIPCODE_FORMATCHECK
then then there one more new window will be opened in that window List view we will
see the rows were failed.
Actully here in the above screen shot we did not get any rows because the reason beaing
passing fraction is 100%
Create A Business Rule:
Quality requirement: when CC on file is Y,the Payment Method value is CREDIT
CARD
Step1: Select the Entity Uk Orders2004 and its sub folder Metadata
Step4: Refer to the next screen shot to enter a Business Rule Name,Description,and the
percentage of rows that should pass this test
Step5: in the expression elements,select Attributes from the first column
Step9: Click the AND button to insert the AND clause ,and your expression should be
look like this
Step3: From the List view,double click the Business Rule to view all failed rows
Oracle Data Profiling: Day5 Training Document:
Join Analysis:
Trillium Software has two types of join analysis:
Discover Joins: Allows you find Potential joins by selecting Entities that you want to
join together along with the attributes on which to join the entities,and assess the suitabil-
ity of the data for integration or cleansing needs,the join can be performed on all attrib-
utes,regard less of whether they are keys or not.
Step3:Expand UkOrders2004
Note: Notice that all of the Attributes have a check next to them indicating that TSS will
try to find the joins between all Attributes
Step6: If your choice look like below screen shot and click Ok
Step3: Expand the folder Joins and Sub folder Join Analysis Results
Step4: Expand the Join Analysis Results named Uk orders1 2004 and Products1
There were 1 joins discovered
The List view display all Product Ids that have not been ordered in the UK
Step10: Double click on (3) to see the corresponding rows to see details(rows) of unor-
dered products
Step12: In the List view ,right click on Discovered Join and select Set Permanent
Note:Once we make and set it as Permanent we can see Entity Diagram of this Join anal-
ysis
Step4: Because Uk order1 2004 only contains order information for the UK,you must
filter Customer Master1 Prior to Joining
Step9: Click Account Number on the left and Account Id on the right
Step11: Ensure that the join looks like the Picture below and click Finish
Step12: Complete the screen using the picture below and click Finish
Step13: Click Run Now to Schedule the Job
Step14: View the Background Tasks to determine when the check join and Execute Join
tasks complete
Step5: Right click and select Venn Diagram to examine the results.
Here in the below Venn Diagram there are 62 unique Account Numbers in Customer
Master1 that are not contained in the Uk Orders1 2004 file
Step6: Double click on 62 on the left hand side of the Venn Diagram to examine those
Account Numbers not found in Uk Order1 2004
Step7: Highlight a few of the rows,right click and select Drill down to Highlighted Left
Not Matching Rows.
Step8: From the Venn Diagram,double click 2 on the right side,these Account Ids do not
have a corresponding Account Number Specified in Customer Master1.
Step9: Double click on each value to examine the corresponding row,the value that starts
with ZZZ,Might indicate that this was a test record inadvertently left in the data.
Step10: To view all of the matching rows,double clcik(170) in the center segment of the
Venn diagram.
Step2: Right click on the join Customer Master1>-<Uk Orders1 2004 and select Enti-
ty Relationship Diagram.
To View ERD for multiple Joins:
Step1: From the Analysis tab ,double click on the Permanent joins folder
Step1: Still we you are in ERD ,right click any where in the white space of the diagram
and select Export.
Step2: Select the save in location as Desk top.
Now we will follow the above process and generate the DDL for the below diagram
DDL'S of above ERD:
-- DDL Export File Generated by TSS.
-- Mon May 12 9:32:12 AM US Eastern Daylight Time 2014
Here in the above example if we have a look in to the CUSTID field data is missing ,this
is critical
Now we will see few of the business data standards example for CUSTID
1.Data type of customer id should be varchar
2.Length of the customer id it should be 4
3.Customer ID should not be a Null,Empty,Space
4.format of customer ID it should be ‘A9999’
5.Customer ID value should be only from range A100 to A400
6.Customer ID cardinality should be 100%
Input:
Output:
JOB Design:
Validation:
Bussiness data standards for Customer Name:
Now we will see few of the business data standards example for CUSTNAME
1.Data type of customer name should be char
2.Length of the customer id it should be 10
3.Customer Name the following characters {$,#,?,*,&,@} will not be acceptable
SELECT CUSTID
,CUSTOMER_NAME
,CUSTOMER_STATE
,COUNTRY
,ZIPCODE
,EMAIL_ADDRESS FROM CUSTOMER
WHERE CUSTOMER_NAME LIKE('%$%') OR
CUSTOMER_NAME LIKE('%#%')
Bussiness data standards for State code in INDIA
1.Data type of customer state should be char
2.Length of the customer id it should be 2
3.if country code =INDIA then state code
in{AP,AC,AS,BI,CH,CG,DN,DD,DL,GO,GU,HA,HP,JK,JH,KA,KE,LK,MP,MH,MN,M
E,MZ,NA,OR,PD,PB,RJ,SI,TN,TR,UP,UA,WB}
Here in the above example one record was not complies as per the data standards
Problem Identification through Data stage job:
Input:
Referencedata:
State Rejected data:
OutputData:
Job:
Bussiness data standards for COUNTRY:
1.Data type of country should be char
2.Length of the country it should be 5
3.format should be ‘AAAAA’
4. All values in country field should be’INDIA’
SELECT CUSTID
,CUSTOMER_NAME
,CUSTOMER_STATE
,COUNTRY
,ZIPCODE
,EMAIL_ADDRESS FROM CUSTOMER
WHERE COUNTRY=’INDIA’
Here in this example also 100% data complies as per the data standards defined
Note:if we check this address data against ref data from reference DB then might be will
find few of the data issues
Problem Identification through IA column analysis:
Here in Information Analyzer in format has only one format i.e 999999
Here in the above example email_address 2 is having 2 @ signs it was not complies
Business data standards similarly for emaild three also it has two periods and 4th email
address contains invalid domain exists
Problem Identification through Quality stage standardize stage using email rule set:
Input:
Output:
Invalid Email address:
Job:
There are a number of tools and techniques that can help prevent data quality issues from
leading to big consequences:
2. select Data quality and go to task under task find the New data rule definition
3. Click on new data rule definition then the below window will be pop up
4.Click on overview and provide the Data rule name in the name text field and short de-
scription and Long description is optional
1.Condtion:
IF
THEN
ELSE
AND
OR
NOT
2.(
((
(((
Example:
Once we written the logic then we have to be validate weather the logic is syntactically
correct or not if the logic is correct then have to click on and save and exist.
3. SourceData
Here source data is a field name
4. Condtion
Not
5. Type of check
=
>
<
>=
<=
<>
Contains-- >String containment
Exists->Null value test
Matches_Format-- >If country=India then zip code format=’999999’
,Matches_Regex
occurs
occurs>
occurs>=
occurs<
occurs<=
In_Reference_column
In_reference_List
Unique
Is_numeric
Is_Date
6. Reference Data:
Steps for Getting Started with Oracle Data Profiling and Quality:
Before we need to start working with Oracle Data Profiling and Quality we need to fol-
low the below steps
Step2: Log on to the Oracle Data Profiling and Quality user interface and familiarize
yourself with the user interface.
Step3: Prepare for importing your data by deciding whether all or only part of the data
source will be imported.
Step4:Create an Entity
Step5:Create a Project
Step4: Add Metabase Named Test,with the default pattern default eg.$Abc pa4.and
Public Cache Size 200 MB and then click OK
Step3: Add user bhaskar with the password bhaskar123 as shown below,click Ok
Step5: to verify the the user which we added or which is created or not go to Users and
double click on users that list view will show you all the users which is created by
Metabase Administrator
Step3:Right Click on Metabase Users then List view will display the list of users which
is created by the Metabase Administrator,
Step4: Now select the user which we created above “bhaskar” and right click on user
there will find Add then click on Add one page will be opened and select the user
“bhaskar” in the UserName list and select the Metabase “test” to the above user and
next click on Ok,
Now your settings would be like below screen shot.
Create Loader Connections:
Step1: log into Metabase manager using Metabase administartor privilages
Step2: go to Loader Connections and right click on Loader connections there will find
Add loader connections and right click on Add loader connections Then one window
will be opened then provide the Loader connection Name in the name text field
“Flatfile” and Provide the Description in the Description text field “Loader connection
for flat files” and provide the information for Data Directory and Data Extension
,Schema Directory and Schema Extension information,Now your settings would be
like below
Step: Click on Ok
Usually These were all the Different Data files and Schema files will be used for Enity
creation for Different Data sources
Step2:Double click on Loader connections then in the List view we will all the loader
connection which is created by the administrator
Step3: Right click in the list view and select Book mark
Step4:Type the book amark name as Issue:Out of Range order Id
Step7:Click ok if your book mark look like the following screen shot.
Process to Create a Notes:
Note: Here in Order id the value '0' that is out of range ,the actual expected value range is
30000 to 99999,so here I Find one issue that is value of one order id is '0' now this issue
we need to create notes
Step4:Click Ok
Step1: Go to Findings in the Explorer and select notes there we will see all the notes
which we created for the attributes