0% found this document useful (0 votes)
3 views

OracleDataQuality_DataProfiling20140514

The Oracle Data Profiling Training Document outlines a six-day training program covering various aspects of Trillium software, including data loading, data content investigation, standard definition, and relationship analysis. It emphasizes the importance of data quality, profiling techniques, and the use of Oracle's data quality products. Key topics include entity creation, data type examination, null checks, and the creation of business rules and metadata management.

Uploaded by

abreddy2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

OracleDataQuality_DataProfiling20140514

The Oracle Data Profiling Training Document outlines a six-day training program covering various aspects of Trillium software, including data loading, data content investigation, standard definition, and relationship analysis. It emphasizes the importance of data quality, profiling techniques, and the use of Oracle's data quality products. Key topics include entity creation, data type examination, null checks, and the creation of business rules and metadata management.

Uploaded by

abreddy2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Oracle Data Profiling Training Document:

Day1:
Introduction about Trillium and Prepare Data Load
1. Basic information about Trillium, Version,Trillium Architecture , Equivalent Oracle
software (ODQ & ODP)
2. Entity Creation,
3. Verification of Entity Creation.
4. Loading different types of files to Trillium Environment. (Flat File, Fixed width file,
Cobol Files )
5. Apply Filter during Entity Creation.
6. Profile Project Creation .
7. How to add entity to Profile project, Deletion,rename etc etc.

Day2:
Investigate Data content:
1.Value Range: Min and Maximum values
2.Document Findings
3.Book marks and Public and Private book
4.Notes
5.Examin Data types
6.Examin Nulls
7.Misspelling and Data Entry Issues:soundexes,Metaphones
8.Format Deaviations:Masks,Patterns
9.Numerical Attributes for an attributes
10.Examin value sets and duplicate values

Day3:
Define Standards and recode values:
1.Attribute Phrase analysis
2.Value recoding
3.Creation of recode tables
Key analysis and Dependency analysis:
4.Keys:Discover keys,Create Keys
5.Dependencies:Discover Dependency,Create Dependency

Day4:
Define Standards:
1.Simple compliance tests:
Predominant Data type check
Sum check
Schema data type check
Spaces check
Pattern check
Value check
Null check
Range check
Schema Length check
Unique check

2. Business Rules:
To create a Business Rule
To review Results from Business Rule analysis
To edit a business Rule

Day5:
Investigate: Relation ships:
1.Find Potential Joins
2.Join Analysis: Review Join analysis results
3.Validate Expected Joins
4.Entity Relationship Diagram(ERD)
5.View ERD for Multiple Joins
6.Export DDL 's from ERD
7.Generate DDL's from Entity

Day6:
Metabase Manager AdminTask:

1.Create Metabase:
2.Create Users
3.Add users to Metabase
4.Create Loader connections
5.Enable and Disable the Key analysis and Dependency analysis
6. Create Project,add entity to project,delete project
7.Create notes to attributes and create book marks to issue attributes
Introduction About Trillium Software:
Trillium Software is a leading software and service provider for data quality, data gov-
ernance, data cleansing, and data profiling

Trillium Equivalent software:


Oracle data quality products :
1.Oracle data profiling
2.Oracle data quality for data integrator

Architecture of oracle data Quality:

Oracle Data Profiling and Quality share a single user interface through which
you can monitor and manage:
■Data resources and connections
■Core data functions and services
■User-defined projects
■Data business and governance rules
■Repository data objects
■Batch and real-time data process results
About data quality:
Good quality data means that all master data is complete
,consistent,accurate,timestamped and,industry standards

Defining a data quality:


1.Complteness
2.standards based
3.consistency
4.Accurate
5.Time-Stamped

Completenss: All the required values are electronically recorded

Standard Based: Data confirms to Industry standards

Consistent: Data values aligned across systems

Accuracy: Data values are right, at the right time

Time-Stamped: Validity time frame of data is clear

Data Profiling:
-->Data profiling is the process of examining the data available in an existing data source
-->A data source usually a data base or a file
-->By doing data profiling we can collect the statistics and information about data

Data Profiling Tools:


1) Informatica Data Explorer 8x
2) Informatica PowerCenter 8x (Profiling option in Source Analyzer)
3) Trillium Discovery
4) Oracle data quality and data profiling
5) SQL Server Integration Service (Data Profiling Task)
5) IBM InfoSphere (Information Analyzer)

Data Quality Problems:

Here in the above example if we have a look in to the CUSTID field data is missing ,this
is critical
Bussiness data standards for Customer ID

Now we will see few of the business data standards example for CUSTID
1.Data type of customer id should be varchar
2.Length of the customer id it should be 4
3.Customer ID should not be a Null,Empty,Space
4.format of customer ID it should be ‘A999’
5.Customer ID value should be only from range A100 to A400
6.Customer ID carnality should be 100%

Problem Identification through Trillium Discovery:


● Select the attribute Cust Id and right click on DSD then will get the DSD window
there at Null check

3.Now click on Failing tab then will get those values which are nulls in this attribute
Features of Oracle Data Quality and profiling:
1. Attribute analysis
2. Phrase analysis
3. key analysis
4. dependency analysis
5. join analysis
6. Business rules
Terms and Terminology used in Oracle Data quality and Data Profiling:
Repository:
A Repository contains a collection of one or more Metabases
A repository has its own group of Users, Loader Connections, and security and
performance settings
Entity:
An Entity is a file or table stored in your Metabase and associated with a data
source that you've identified
Attribute:
An Attribute is a field or column in an Entity.
An Attribute cannot exist in a Metabase without an Entity
Meta Data:
Data objects have additional properties and information associated with them
called metadata

Dependency:
A Dependency is a data relationship in which one or more Attributes
(fields/columns) determine the value in another Attribute

Key:
A Key is an Attribute that can uniquely identify and associate data, binding the
data together.
Join:
A Join identifies the intersection between two Entities. To identify a Join,
Attributes in one Entity(A)
are compared to Attributes in a second Entity(B). If an Attribute in Entity(A)
is suitable for merging or joining with an Attribute in Entity(B), Profiling
identifies a Join.

Findings:
Findings is the general term given to the documented results of a data discovery
activity.

Meta Base:
A Metabase stores data objects. It also stores any information related to the
stored data, called Metadata.

The type of information you can discover about your data includes:

Data structures, contents, and relationships

Data compliance with business rules and Data Standard Definitions (DSD)

Data statistics, drill-down details, and data patterns

Data trends and changes over time

Data quality processing and results

Documentation of data observations, compliance issues, and more

Note:
1.You must create at least one Metabase for each Repository, but you may create
as many Metabases as you choose within the same Repository.
2.Metabases can be created by any person who has administrative access to the
Metabase Manager tool.
3.This person is referred to as the Metabase Administrator and manages the
Repository installed on your system.

Metabase objects:
attribute
dependency
entity
finding
join
key
metadata
project

About Explorer tab in Oracle Data Quality Data profiling:

List of tables in Explorer

There are 4 tabs in Explorer

1.Projects
2.Entities
3.Analysis
4.Findings

About Project Tab:


Project tab it will give more information about the types of projects

1.Profiling
2.Time Series
3.Data Quality
About Entity Tab:
Entity tab it contains list of entities

About Analysis Tab:


Analysis tab that give more information about different types of analysis
1.Join
2.Keys
3.Dependency

About Finding tab:


Finding tab it contains notes ,book marks
1.Create an Entity from a Delimited Source:

Step1:Go to Analysis and click on Create Entity

Step2:Select the DEL FLATFILES and next in the Change filter text field type *.txt and
clcik on Change filter then we will get all the txt files

Step3: Next select the CUST_INFO text file because now we are going create the entity
for this source file and next click on next then will get the below screen.

Step4:Now in the Delmiter list choose the actual delmiter in your source file

CUST_INFO,here delmiter is a comma delmiter and and Next similarly check wether the
first line is column name in source file,if incase if first line is column name then select
the Names on first line radio button and and before going click on next just click on Pre-
view tab and check data is now aligned properly or not.
Step5:Click on next once view the data on preview tab and choose the options at load pa-
rameters

Step6: Click on next then will get the below screen and next click on finish then will get
the below second screen there select Run Later then will get the third screen there click
on Cancel now the Dynamic entity will be created

2.Run the Dynamic entity :

Once the Dynamic entity is created now check first the entity which we created above is
dynamic or real
Step1: go to Entities on Explorer tab and select Cust Info Entity and expand chose
Metadata and exapand it now in that list see the Entity Type Dynamic.

Step 2: Now select Cust Info Entity from Entities from Explorer and right click on Cust
Info entity then will find the Data load option and select that then we will get the below
screen

Step3: Next keep the file CUST_INFO as it is and next click on ok then it will give the
below screen there it as run now click on run now.

Step4: Once you click on run now check the status of that entity data load was done or
not.
Step5: Next go to Analysis and click on Background Task check all rows loaded for that
entity

Step6: Check the status of the entity is Dyanamic or Real usually once data load done
then that Entity becomes now changed from Dynamic to Real
Step 7:Now follow the below process to check the entity is real or dynamic
go to Entities on Explorer tab and select Cust Info Entity and expand chose
Metadata and expand it now in that list see the Entity Type Real

Creat profiling project:


find below nativation to create profiling project
right click on Profiling-->click on create project then find the below screen and
next provide project name in name text field Example:Test and provide some
description to that project
Add Entity to Project:
Follow the below navigation to add the entity to the profile project
right click on Profiling-->click on create project then find the below screen shot
and provide project name and provide project description and select any one
entity from the entity list and click on ok
Oracle Data Profiling: Day2 Training Document:

Investiage Data:

Minimum and Maximum Values:


Step1:From the explorer,expand the TrillMedTech Project
Step2:Expand Order Id under the entity Uk Orders2004

Step3:Locate the labels names Min


To Find Minimum Values:
Step1:Select Orderid and double click on Min
Step2:In the list view ,locate the column named freequency.

To View rows containing the Min Values:


Step1:Inthe list view,double click on the row with Valueof 0,This is the list of all rows
that have '0' as the Order Id
Note: Lets add a note as remiander that this issue exists

Maximum value:
Step1:From the explorer,expand the TrillMedTech Project
Step2:Expand Order Id under the entity Uk Orders2004
Step3:Locate the labels names Max

To Find Maximum Values:


Step1:Select Orderid and double click on Min
Step2:In the list view ,locate the column named freequency.

To View rows containing the Max Values:


Step1:Inthe list view,double click on the row with Valueof 33207,This is the list of all
rows that have '33207' as the Order Id
Examine Data Types:
Each Attribute has atleast one data type.because TSS analyzes and compares data from a
variety of data sources,TSS relies on three generic terms to describe data
types,Integer,String and Decimal

Example:

TO ExamineData types of an Attribute:


Step1:Open Project and Select Uk orders2004
Step2: Expand the attribute folder and attribute Phone
Step3:Look for the lables named Strings,Integer and Decimals
The value to the right of the data type label is the count of unique values corresponding to
that data type

The value in the right parentheses represents the percentage of rows containing that data
type
Step5:Double click on Integers to see the integer values for the Phone

Step6:Double click on Strings to see the integer values for the Phone
Note: Strings contain a separator(dash or space) between the numbers and the integer
values do not

Examine Nulls:
The defination of null is “absence of avalue”.With TSS,You can locate Nulls in your data
at the enitity or attribute level

To View Number of Nulls In an Attribute:


Bussiness Requirement: Phone is a required field and should be populated with a value
Step1:Open the attribute Phone under Customer master entity and look for label Null
count
Step2: Double click on Null count to see the rows that contatin Nulls in the Phone field

To View rows in an Entity that contain Nulls:


A row May contain a null for an ttribute value,The entity metadata row value counts can
help u pinpoint those rows containing Nulls

Step1: Under customer Master,Expand the folder Metadata


Step2: Double click on Row value count
MisSpellings and and Data Entry Issues:
You can Identify potential data entry and misspellings through Soundexes and
Metaphones

Soundexes:
The purpose of examining the soundexes is to identify data that it contains sound -alike
values that data values that sound alike,could be incorrect and indicate adata entry prob-
lem.

Soundex maps characters to ashort,4 char identifier and also helps to identify common
record types,soundex is particulary useful for grouping short strings,such as first or last
names

To List Soundexes for an attribute:


Step1:Go to the Trilmedtech Project on the projects tab
Step2: Expand customer Master and attribute City

Step3:Locate the meta data lable named Soundexes


Here City has 178 Soundexes
Step4:Double clcik ok Soundexes
To Examine Suspecious Soundexes:
You could go through every soundex that is listed for an attribute or only look for those
Soundexes that look potentially suspiciuos those with avalue count greater than 1.
Step1:In the list view ,sort on value count in the descending order.

Step2: From the list of soundexes double click on thirt row soundex E351
The list view display5 values in the attribute matching E351

MetaPhones:
A meta phone is similar to asoudex ,but it provides much finer granularity,resulting in
multi character phonetic pattern
Notice that for the soundes value of E351,There are 3 different Metaphone values
To Examin Suspiciuos MetaPhones:
You could go through every metphone that is listed for an attribute or any look for those
metaphones that look for suspeiuos

Step1: in the list view sort the list view by valiue count(Indecending order)

Step2: Double click on the first row displaying the metaphone WSTNSPRMR.
The List view displays the values in the attribute that match that metaphones value

Format Deviations:
Examine patterns and Mask to discover format deviations in the data
Masks:
Masks display character represenations of data value in long-hand notaion by defining
each character in the data value as
To List Masks for an Attribute:
Step1:From customer master,Expand Account Number.
Step2:Double click on label Masks

To Examine Suspecious Soundexes:


Masks that have low frequency indicate potential format deviations
Patterns:
Patterns display character representaions of a data values in short-hand noation by count-
ing the number of charcters represnted by the code and displaying that count next to the
code.
Notes:
Examining patterns is extereamly helpful when identifying format deviations in ur data .if
you have any attribute that should conform to a fixed format such as dates or financial
information by examine the patterns to find the anomalies.

To List Pattern for an attribute:


Step1:From Customer master,examine start date.
Step2:Double click on the label pattern
There are 6 pattern for this attribute

The list view displays this informtaion for each pattern


To Examine Suspicious Patterns:
The patterns that have a very low frequency may indicate a potential format deviations
Now look at the lowest frequency pattern by double clicking the row with pattern d2pd6
Document Findings:
when you find data issues,learn additional information about ur data,or want to create a
check point while carrying out data profiling,then doument it this can be done with Book
marks and Notes
Book marks are two types private and public
Note: A Book mark that you created only if u can view it is called Private
If a Book mark that you created if others can view it is a Public

To Create Private Book Mark:

Step1: From the explorer,double click on Min under UK orders2004,Order Id


Step2: Look at the list view displaying minimum value of '0'

Step3: Right click in the list view and select Book mark
Step4:Type the book amark name as Issue:Out of Range order Id

Step5: Type the disprcition as order id is 0


Step6: Place a check next to public book mark

Step7:Click ok if your book mark look like the following screen shot.

Examine Value sets and Duplicate values:


You can find value sets and duplicate values for an attribute through metadata named
Unique values
Quality requirement: Acceptable values in Country are:
CAN
DEU
UK
USA

Now verify in Customer master entity in country attribute has the values
USA,UK,DEU,CAN
Step1: Under customer master,expand the attributes folder
Step2: Locate the attribute named Country

Step3: Expand country and locate the label Unique values.

Step4:Double click Unique values to view the 4 values in list view

To Examine potential data duplication:


Quality requirement: Item number is the unique identifier(key ) for the product records
So here each record should have a unique item number.
Step1: Examine Item number in product master
Step2: look at the label Dist= next to Item number

Step3: Double click on Unique values for Item Number

Step4: Examine the Frequency column in the list view


If any value greater than 1 that indicates a duplicate value,here in the below screen if we
have a look there is 4 duplicate values exist
Oracle Data Profiling: Day3 Training Document:
Standardize data through Attribute phrase analysis,Value recoding and creation of reusa-
ble recode tables.

During data migration data quality defects may need to be repaired or values must be
standardized to meet corporate requirements that can be achieved through phrase analy-
sis and value recodes and generation of recode tables you can standardize data to meet
corporate needs.

Perform Phrase analysis on an attribute:


Step1: In the explorer,find Acct Status under Customertable

Step2: Right click on Acct Status and select Edit attribute

Step3:Place a check next to the option phrases between keep the default settings of be-
tween 1 and 5 words per phrase.

Step4:Place a check next to the option Analyze Now

Step5: Click Ok

Step6: Click Run Now

Step7:If we want to know the background tasks to know when the task completes

Step8:In the explorer,expand AcctStatus,If Acct status was already expanded,refresh

Step9:Look for the new piece of metadata named Phrases.

If in case if you do not see Phrases ,try refreshing the explorer

Step10: Double click Phrases


Step11: Examine the Phrase values in the list view
The list view presents all of the phrases (between 1 and 5 words) that were found in the
Acct Status Values

Step12: Sort descending on the column named value count

Here 3 Account status values contain the phrase Account:pending

Step13.Double click on the row containing the phrase Account:pending


These values in Acct Status contain the phrase Account:pending

Step14: Continue to exploring the remaining phrases.


The frequencies of the phrases along with the values should lead you to conclude that
there are really only three states of account:
● Pending
● active
● inactive

A corporate decision should be made on how to standardize the data and the data entry
application to only accept the three values for Account Status,Based on the most
occuring phrase you will standardize these values as PENDING,ACTIVE ,INACTIVE

Recode Values:
we have already examined results of the phrase analysis in above now we need to specify
the standard values and generate recode table.
Step1:Under customer table,Account Status,double-click Unique values

Here if we a look in to the above screen shot in value we are able see the different val-
ues so now here we are going to recode the below value to Recode value
Step2: Right click on the highlighted rows and select recode values

Step3:Highlight the values and right click and select new


Step4: Type Pending and click ok

Step5: Next in the list view,two additional columns appear


2. Value Recode Conatains results of the recoding,and
3. Value Recoded? Marks which values were recoded.

Simalry follow the same process and do the recode where ever recode is required

Step6:Examine the Value Recode and Value recoded ?Columns and make sure that re-
coded values are correct for each value listed

Step7: If in case any descrepencies,in the above data you can correct them with the same
technique as before
Generate a Recode Value Look-UP Table:
You will generate a look up table of recoded values that will be referenced from a
Quality project created later in this class standardize Acct Status values to pending
,active,inactive.
Generate a recode value look up table:
You can save your recode values as a physical file external to TSS, ready for use by TSS
Quality projects or any other applications
Step1: From Unique values for Acct Status,right-click on any row and select Export as
TSQ Recodes..

Step2: Next to File Name,type /export/recode_acct_status

Note: The above file will be created in the export directory beneanth the metabase direc-
tory on the TSS Server.
Step3: Next to Export,make sure values is selected
Step4:Next to Field width,select Auto.
Step5:Next to encoding ,select US-ASCII
Step6: Make sure that your selections look like the next screen shot.
Step7: Click Ok
Step8: Click Run Now

Step9: with Windows explorer,go to


C:\OraHome_1\oracledq\metabase_data\export

Here in the above path two files should have been created and examine each file using
Notepad

Here recode_acct_status is a comma delimited data file


and recode_acct-status.ddx is data definition file

Investigate:Keys& Dependencies:
Key Analysis:
-->Key analysis it applicable only real Entity
-->Key analysis can do in two methods
1.Discover keys
2.Create Keys
Method1:Discover keys
2. performed on data load on sample of first 10,000 rows
3. Attribute(s) identified as a potential key if>=98% unique
4. can be disable for entity creation
to disable we need to log in to metabase manager and select Control Admin and right
click on loader connection and select edit loader settings
Now slelect the check box which is appeared in the above screen

5. can be run after entity creation to find additional keys


Method2:Create Keys
7. Manually invoked to validate a known key
8. Analysis performed on full volume of data

To review potential keys(Found during import):


Step1: From the Trilmedtech project,expand Customer Master

Step2: Expand the folder named Metadata

Step3: Locate the metadata keys(Discovered)

Here there are two types of keys :


Discovered:These are the potential keys found during the load process and represented
by the number in the parentheses,here in the above case,16
Permanent: these are the keys that you mark as being important to your project and rep-
resented by the number,here in the above case '0'

Validate expected keys:


Quality requirement: Incustomer Master file Account Number is a primary key
Now we will validate Account Number is 100% unique is or not.

Step1: go to analysis and select Create key or Dependency

Step2: Select the option Keys.

Step3: Select the customer master in the drop down menu


Step4:Click on Next

Step5: Enter a job name and place a next to Account Number

Step6: Click Finish

Step7: Click Run Now to schedule the job

Note: Once the keys check is complete(State=Completed),Account Number will be


marked as a Permanent Key

Step8: Now check under the analysis tab,Keys,Permanent keys folder

Step9: Refresh and examine Keys(Discovered) for Customer Master


Validate expected Dependency:
Quality Requirement:
In customer master ,there is an expected dependency between City and postcode
and Post code -a post code should determines city.

Now we will validate this expected dependency.

Step1: From main menu,select Analysis,Create keys or Dependency.

Step2: Select the option Dependencies and Customer Master from the drop down
menu and click next

Step3: Enter job Name

Step4: On the left side,place a check next to the postcode

Step5: On the right side,place check next to City


Step6: Click Finsh and then click Run Now to schedule the job.

Step7: Now expand Customer Master and its sub folder Metadata.

Step8: Double click on the label Dependencies(Discovered)

Now look for the Dependency that you just created between Post code and City ,notice
that Post code Predicts City 68.44% and next explore the conflicting values to find data
problems

Step9: From List view double click on dependency between Postcode and City

Step10: Sort the List view on Postcode in descending order

Step11:Scroll through the list and find the post code W1C2PW

Step12: Highlight all rows with postcode W1C2PW

Step13; Right click and select drill Down to Matching rows


Looks like these cities should have been London

Step14: At this point,you could resolve conflicts and book mark


Oracle Data Profiling: Day4 Training Document:
Define Standard Definition:

A DSD tests data compliance at the attribute level and highlights data that does not com-
ply to your corporate standards

There are many DSD test that can applied,such as specifying

● Predominant Data type check


● Sum check
● Schema data type check
● Spaces check
● Pattern check
● Value check
● Null check
● Range check
● Schema Length check
● Unique check

To Check Attribute Completeness(Null Check):

Quality Requirement:

In Uk Orders 2004,Payment Method is a required field and should not be a null.


Step1:
From the TrilMedTech project,find the Payment Method Attribute under Uk Orders
2004.

Step2: Right-click and select Edit DSD.

Step3:When the DSD Pane displays,click on the tab Null Check.

Step4: Examine the information on the Null Check tab.


You should see that the Null Check.
Step5:Click failing values to see rows where Payment Method is Null

To Specify more DSD tests On Attribute:


Quality Requirement:
Trilmed Tech has the following data standards for the attribute Order Id in the Entity Uk
Orders2004

Condition1:Should not contain nulls

Condition2:allowed values range 30000 and 99999

Condition3:Pattern should be d5.

Now we will see how to define all DSD to the attribute Order Id

Step1: Go to Uk Order2004 entity and right click Order and select edit DSD

Step2: Click on the tab Range check

Step3.Enable the test by click on here

To verify the number of rows passed and failed click on Failing ,then it will display the
rows which are not met the condition defined.
To see the complete row of the failing value just double click on the selected row

Similarly next go to the pattern tab in the same window and enable the pattern test and
define the pattern which we are expecting

Now click on failing to know the failing rows

and to see the complete row of the above value double click on the value
Data Quality Problems:

Here in the above example if we have a look in to the CUSTID field data is missing ,this
is critical

Bussiness data standards for Customer ID

Now we will see few of the business data standards example for CUSTID
1.Data type of customer id should be varchar
2.Length of the customer id it should be 4
3.Customer ID should not be a Null,Empty,Space
4.format of customer ID it should be ‘A999’
5.Customer ID value should be only from range A100 to A400
6.Customer ID cardinality should be 100%

Problem Identification through Trillium Discovery:


● Select the attribute Cust Id and right click on DSD then will get the DSD window
there at Null check

3.Now clock on Failing tab then will get those values which are nulls in this attribute

Bussiness data standards for Customer Name:


Now we will see few of the business data standards example for CUSTNAME
1.Data type of customer name should be varchar
2.Length of the customer id it should be 10
3.Customer Name the following characters {$,#,?,*,&,@} will not be acceptable
Problem Identification through SQL Query:

SELECT CUSTID
,CUSTOMER_NAME
,CUSTOMER_STATE
,COUNTRY
,ZIPCODE
,EMAIL_ADDRESS FROM CUSTOMER
WHERE CUSTOMER_NAME LIKE('%$%') OR
CUSTOMER_NAME LIKE('%#%')

Bussiness data standards for State code in INDIA

1.Data type of customer state should be char


2.Length of the customer id it should be 2
3.if country code =INDIA then state code
in{AP,AC,AS,BI,CH,CG,DN,DD,DL,GO,GU,HA,HP,JK,JH,KA,KE,LK,MP,MH,MN,M
E,MZ,NA,OR,PD,PB,RJ,SI,TN,TR,UP,UA,WB}

Here in the above example one record was not complies as per the data standards

Problem Identification through Trillium Discovery using bussiness rule:

Step1: Define one business rule for that first go to the entity Cust Info and expand it and
select Bussiness rules and right click and click on add bussiness rule the the below
screen will come
Here In Name Text field CUSTSTATE_VALIDITY is the business rule name and right
side text box a arrow mark indicates the logic which we used in this business rule

Logic Used:
IF (Country = "INDIA") THEN( [Cust State] IN ("AP",
"AC",
"AS",
"BI",
"CH",
"CG",
"DN",
"DD",
"DL",
"GO",
"GU",
"HA",
"HP",
"JK",
"JH",
"KA",
"KE",
"LK",
"MP",
"MH",
"MN",
"ME",
"MZ",
"NA",
"OR",
"PD",
"PB",
"RJ",
"SI",
"TN",
"TR",
"UP",
"UA",
"WB"
))

Step2: Once you write the business rule save and run this business rule then it takes sev-
eral minutes of time to finish to run this business rule job.

Step3: Once job run completes then select the business rules passed from entity and
check the results status fail or pass and check the passing fraction

Step 4:If you want to see the actual rows which are fail the bussiness condition which we
defined simply double click on Bussiness rule name ie CUSTSTATE_VALIDITY then
then there one more new window will be opened in that window List view we will see the
rows were failed.

Actully here in the above screen shot if we had a look in to Cust state value is TX
actaully this state code is not at all exist in indian state code list thats why this record was
failed

Bussiness data standards for COUNTRY:


1.Data type of country should be char
2.Length of the country it should be 5
3.format should be ‘AAAAA’
4. All values in country field should be’INDIA’
Here 100% Data complies as per the Data standards

Problem Identification through SQL Query:

SELECT CUSTID
,CUSTOMER_NAME
,CUSTOMER_STATE
,COUNTRY
,ZIPCODE
,EMAIL_ADDRESS FROM CUSTOMER
WHERE COUNTRY=’INDIA’

Same Problem Identification through Trillium Discovery Bussiness rule:


Step1: Define one business rule for that first go to the entity Cust Info and expand it and
select Bussiness rules and right click and click on add bussiness rule the the below
screen will come

Here In Name Text field CNTRYCD_VALIDITY is the business rule name and right side
text box a arrow mark indicates the logic which we used in this business rule
Logic Used:
Country = "INDIA"

Step2: Once you write the business rule save and run this business rule then it takes sev-
eral minutes of time to finish to run this business rule job.
Step3: Once job run completes then select the business rules passed from entity and
check the results status fail or pass and check the passing fraction

Step 4:If you want to see the actual rows which are fail the bussiness condition which we
defined simply double click on Bussiness rule name ie CUSTSTATE_VALIDITY then
then there one more new window will be opened in that window List view we will see the
rows were failed.

Actully here in the above screen shot we did not get any rows because the reason beaing
passing fraction is 100%

Bussiness data standards for ZIPCODE


1.Data type of zipcode should be varchar
2.Length of the customer id it should be 10
3. format of Indian zip code should be ‘999999’

Here in this example also 100% data complies as per the data standards defined

Note:if we check this address data against ref data from reference db then might be will
find few of the data issues

Same Problem Identification through Trillium Discovery Bussiness rule:


Step1: Define one business rule for that first go to the entity Cust Info and expand it and
select Bussiness rules and right click and click on add bussiness rule the the below
screen will come
Here In Name Text field ZIPCODE_VALIDITY is the business rule name and right side
text box a arrow mark indicates the logic which we used in this business rule
Logic Used:
PATTERN(Zipcode,"rich") like "d6"

Step2: Once you write the business rule save and run this business rule then it takes sev-
eral minutes of time to finish to run this business rule job.

Step3: Once job run completes then select the business rules passed from entity and
check the results status fail or pass and check the passing fraction

Step 4:If you want to see the actual rows which are fail the bussiness condition which we
defined simply double click on Bussiness rule name ie ZIPCODE_FORMATCHECK
then then there one more new window will be opened in that window List view we will
see the rows were failed.

Actully here in the above screen shot we did not get any rows because the reason beaing
passing fraction is 100%
Create A Business Rule:
Quality requirement: when CC on file is Y,the Payment Method value is CREDIT
CARD

Step1: Select the Entity Uk Orders2004 and its sub folder Metadata

Step2: Right click on the Label Named Business Rule

Step3:Select Add Business Rule

Step4: Refer to the next screen shot to enter a Business Rule Name,Description,and the
percentage of rows that should pass this test
Step5: in the expression elements,select Attributes from the first column

Step6: in the second column double click on CC on file

Step7: click the is equal to button(=) to insert an equals sign

Step8: Type “Y”

Step9: Click the AND button to insert the AND clause ,and your expression should be
look like this

Step10:From the Attributes list,double -click Payment Method

Step11:Click the Is equal to button (=) again

Step10:Type “CREDIT CARD”


Step11: If your Business Rule look like the one above,click Create.
Step12:When prompted to schedule a job ,click Ok

Step13: In the schedule job window,click ,Run Now

To Review results from Business Rule:


Once the business rule analysis completes,follow the steps below

Step1:double click on Business Rule(Passed) for Uk Orders2004

Step2:A summary of the Business Rule Analysis results appears

Step3: From the List view,double click the Business Rule to view all failed rows
Oracle Data Profiling: Day5 Training Document:

Join Analysis:
Trillium Software has two types of join analysis:

Discover Joins: Allows you find Potential joins by selecting Entities that you want to
join together along with the attributes on which to join the entities,and assess the suitabil-
ity of the data for integration or cleansing needs,the join can be performed on all attrib-
utes,regard less of whether they are keys or not.

Create Join: Allows you to validate known joins

To Find Potential Joins:


Quality requirement:
Find all suitable joins between Uk order2004 and Product Master

Step1: From the main menu ,select Analysis,Discover Joins

Step2: Next to Findings Name,Type Uk orders2004 and Products

Step3:Expand UkOrders2004
Note: Notice that all of the Attributes have a check next to them indicating that TSS will
try to find the joins between all Attributes

Step4: Place a check in the box next to Product Master

Step6: If your choice look like below screen shot and click Ok

Step7: Click Run Now to schedule the job


To review Join Analysis Results:
Step1: Once the Discover Joins job completes ,click the Explorer tab Analysis

Step2: Right click on the tab and select Refresh

Step3: Expand the folder Joins and Sub folder Join Analysis Results

Step4: Expand the Join Analysis Results named Uk orders1 2004 and Products1
There were 1 joins discovered

Step5:From the Explorer ,double click on Joins Discovered

Step6: Look at the join found between Product_Id and Item_number


This is the join that seems to be make sense based on the type of data held in each attrib-
ute
Step7: Right click on the discovered join and select Venn Diagram

Step8: double click on(0) on Left hand side of the Venn


All rows in Uk Orders1 2004 that matching Product id in the Product Master1 are
listed

Step9: double click on 3 on Right side of the Venn diagram

The List view display all Product Ids that have not been ordered in the UK

Step10: Double click on (3) to see the corresponding rows to see details(rows) of unor-
dered products

Step11: From the Explorer double click on Joins Discovered again

Step12: In the List view ,right click on Discovered Join and select Set Permanent
Note:Once we make and set it as Permanent we can see Entity Diagram of this Join anal-
ysis

Validate Expected Joins:


Quality Requirement: Expected Join between Customer Master1 and Uk Orders1
2004

Step1: From the main menu,select Analysis,Create Join


Step2: From LHS Entity drop-down list,select Customer Master
Step3: From the RHS Entity drop-down list,select Uk Orders1 2004

Step4: Because Uk order1 2004 only contains order information for the UK,you must
filter Customer Master1 Prior to Joining

Step5: Next to the Customer selection,click Filter

Step6: Write the following expression


Step7:Click Apply
Step8: Next>

Step9: Click Account Number on the left and Account Id on the right

Step10: Click Add Join

Step11: Ensure that the join looks like the Picture below and click Finish

Step12: Complete the screen using the picture below and click Finish
Step13: Click Run Now to Schedule the Job

Step14: View the Background Tasks to determine when the check join and Execute Join
tasks complete

To Review Join Analysis Results:


Step1: Once the Execute Join job is complete,click the tab Analysis

Step2: Right click on the Analysis tab and select Refresh

Step3: Double click on folder named Permanent Joins


Step4: From the ListView ,highlight the join between Customer Master1 and Uk Or-
ders1 2004

Step5: Right click and select Venn Diagram to examine the results.
Here in the below Venn Diagram there are 62 unique Account Numbers in Customer
Master1 that are not contained in the Uk Orders1 2004 file

Step6: Double click on 62 on the left hand side of the Venn Diagram to examine those
Account Numbers not found in Uk Order1 2004

Step7: Highlight a few of the rows,right click and select Drill down to Highlighted Left
Not Matching Rows.

Step8: From the Venn Diagram,double click 2 on the right side,these Account Ids do not
have a corresponding Account Number Specified in Customer Master1.

Step9: Double click on each value to examine the corresponding row,the value that starts
with ZZZ,Might indicate that this was a test record inadvertently left in the data.

Step10: To view all of the matching rows,double clcik(170) in the center segment of the
Venn diagram.

Step11: Close the Venn Diagram

To View ERD of above Customer Master1 and Uk Order1 2004 Joins:


Step1: From the Analysis tab expand the Permanent Joins folder

Step2: Right click on the join Customer Master1>-<Uk Orders1 2004 and select Enti-
ty Relationship Diagram.
To View ERD for multiple Joins:
Step1: From the Analysis tab ,double click on the Permanent joins folder

Step2: From the List view,highlight multiple joins

Step3: Right click and select Entity Relation ship diagram

Generate DDL format from Entity Relation ship Diagram:


TSS allows you to generate all related DDLS from the ERD in one step.

Step1: Still we you are in ERD ,right click any where in the white space of the diagram
and select Export.
Step2: Select the save in location as Desk top.

Step3:Type the file name as Export_Erd.

Step4: From Save as type,Select DDL File(*.ddl)


Step5: Click Save.

Step6: Go to your desk top and open the exported file.


The file contains ANSI DDL for all Entities displayed in the ERD

Now we will follow the above process and generate the DDL for the below diagram
DDL'S of above ERD:
-- DDL Export File Generated by TSS.
-- Mon May 12 9:32:12 AM US Eastern Daylight Time 2014

ALTER TABLE UK_ORDERS1_2004 DROP CONSTRAINT DSC_1;

DROP TABLE CUSTOMER_MASTER1;


CREATE TABLE CUSTOMER_MASTER1 (
BUS_NAME VARCHAR(35),
TITLE VARCHAR(3),
FIRSTNAME VARCHAR(11),
MIDDLE VARCHAR(6),
PHONE VARCHAR(12),
LASTNAME VARCHAR(16),
ADDRESS1 VARCHAR(39),
CITY VARCHAR(20),
STATE VARCHAR(4),
POSTCODE VARCHAR(8),
COUNTRY VARCHAR(3) NOT NULL,
ACCOUNT_NUMBER VARCHAR(8) NOT NULL,
START_DATE VARCHAR(10) NOT NULL,
ACCT_STATUS VARCHAR(8) NOT NULL,
ACCT_REP VARCHAR(3),
CLRECID VARCHAR(6) NOT NULL,
LAST_CONTACT_DATE VARCHAR(10) NOT NULL,
GOLD_MEMBER VARCHAR(1),
REFERRED_BY VARCHAR(12) NOT NULL );

DROP TABLE UK_ORDERS1_2004;


CREATE TABLE UK_ORDERS1_2004 (
ACCOUNT_ID VARCHAR(8) NOT NULL,
ORDER_ID NUMERIC(5) NOT NULL,
INVOICE_ID VARCHAR(10),
PRODUCT_ID VARCHAR(6) NOT NULL,
LINE_ITEM NUMERIC(1) NOT NULL,
ORDER_DATE VARCHAR(10) NOT NULL,
SHIP_DATE VARCHAR(10),
PAYMENT_METHOD VARCHAR(11),
CC_ON_FILE VARCHAR(1) NOT NULL,
PAID VARCHAR(1) NOT NULL,
SHIPPING_METHOD NUMERIC(1) NOT NULL,
QUANTITY_ORDERED NUMERIC(2) NOT NULL,
QUANTITY_SHIPPED VARCHAR(3) NOT NULL );

ALTER TABLE PRODUCT_MASTER1 DROP PRIMARY KEY;

DROP TABLE PRODUCT_MASTER1;


CREATE TABLE PRODUCT_MASTER1 (
ITEM_NUMBER VARCHAR(6) NOT NULL,
PRODUCT_NAME VARCHAR(69) NOT NULL,
PRODUCT_TYPE VARCHAR(21) NOT NULL,
MSRP NUMERIC(5) NOT NULL,
QTY_AVAIL VARCHAR(5) NOT NULL );

ALTER TABLE PRODUCT_MASTER1


ADD PRIMARY KEY (PRODUCT_NAME);

ALTER TABLE UK_ORDERS1_2004


ADD CONSTRAINT DSC_1
FOREIGN KEY (PRODUCT_ID)
REFERENCES PRODUCT_MASTER1 (ITEM_NUMBER);

To Export a Schema from TSS Explorer:


Step1: Click on The Explorer tab Entities
Step2: From the Explorer,highlight the Entity Customer Master1
Step3: Right click and select Generate DDL,SQL
Step4: Select the Save In location as Desk top
Step5: Type the file name as Cust_Master1
Step6: Click Save.
The file contains a schema describing the Entity highlighted in the Explorer
Step7: Go to your Desk top and open the exported file ,the file conatins ANSI DDL for
the entity Customer Master1

Generated DDL's for Customer Master1:


-- DDL Export File Generated by TSS.
-- Mon May 12 9:42:45 AM US Eastern Daylight Time 2014

DROP TABLE CUSTOMER_MASTER1;


CREATE TABLE CUSTOMER_MASTER1 (
BUS_NAME VARCHAR(35),
TITLE VARCHAR(3),
FIRSTNAME VARCHAR(11),
MIDDLE VARCHAR(6),
PHONE VARCHAR(12),
LASTNAME VARCHAR(16),
ADDRESS1 VARCHAR(39),
CITY VARCHAR(20),
STATE VARCHAR(4),
POSTCODE VARCHAR(8),
COUNTRY VARCHAR(3) NOT NULL,
ACCOUNT_NUMBER VARCHAR(8) NOT NULL,
START_DATE VARCHAR(10) NOT NULL,
ACCT_STATUS VARCHAR(8) NOT NULL,
ACCT_REP VARCHAR(3),
CLRECID VARCHAR(6) NOT NULL,
LAST_CONTACT_DATE VARCHAR(10) NOT NULL,
GOLD_MEMBER VARCHAR(1),
REFERRED_BY VARCHAR(12) NOT NULL );
Data Problems can be Identified Using DS ,QS,Information Analyzer:
Data Quality Problems:

Here in the above example if we have a look in to the CUSTID field data is missing ,this
is critical

Bussiness data standards for Customer ID

Now we will see few of the business data standards example for CUSTID
1.Data type of customer id should be varchar
2.Length of the customer id it should be 4
3.Customer ID should not be a Null,Empty,Space
4.format of customer ID it should be ‘A9999’
5.Customer ID value should be only from range A100 to A400
6.Customer ID cardinality should be 100%

Problem Identification through Data rules stage in QS:

Input:

Output:
JOB Design:

Data rules Stage:

Validation:
Bussiness data standards for Customer Name:
Now we will see few of the business data standards example for CUSTNAME
1.Data type of customer name should be char
2.Length of the customer id it should be 10
3.Customer Name the following characters {$,#,?,*,&,@} will not be acceptable

Problem Identification through SQL Query:

SELECT CUSTID
,CUSTOMER_NAME
,CUSTOMER_STATE
,COUNTRY
,ZIPCODE
,EMAIL_ADDRESS FROM CUSTOMER
WHERE CUSTOMER_NAME LIKE('%$%') OR
CUSTOMER_NAME LIKE('%#%')
Bussiness data standards for State code in INDIA
1.Data type of customer state should be char
2.Length of the customer id it should be 2
3.if country code =INDIA then state code
in{AP,AC,AS,BI,CH,CG,DN,DD,DL,GO,GU,HA,HP,JK,JH,KA,KE,LK,MP,MH,MN,M
E,MZ,NA,OR,PD,PB,RJ,SI,TN,TR,UP,UA,WB}

Here in the above example one record was not complies as per the data standards
Problem Identification through Data stage job:
Input:

Referencedata:
State Rejected data:

OutputData:

Job:
Bussiness data standards for COUNTRY:
1.Data type of country should be char
2.Length of the country it should be 5
3.format should be ‘AAAAA’
4. All values in country field should be’INDIA’

Here 100% Data complies as per the Data standards

Problem Identification through SQL Query:

SELECT CUSTID
,CUSTOMER_NAME
,CUSTOMER_STATE
,COUNTRY
,ZIPCODE
,EMAIL_ADDRESS FROM CUSTOMER
WHERE COUNTRY=’INDIA’

Bussiness data standards for ZIPCODE


1.Data type of zipcode should be varchar
2.Length of the customer id it should be 10
3. format of Indian zip code should be ‘999999’

Here in this example also 100% data complies as per the data standards defined

Note:if we check this address data against ref data from reference DB then might be will
find few of the data issues
Problem Identification through IA column analysis:

Here in Information Analyzer in format has only one format i.e 999999

Bussiness data standards for email address

1.Data type of zip code should be var char


2.Length of the customer id it should be 25
3. should be at least one character before @ sign
4, each email address should be ended with either 2 or 3 char
5. email id should contain one @ sign

6.email id should be a tlelast one period(.)

7.domain should be valid domain

Here in the above example email_address 2 is having 2 @ signs it was not complies
Business data standards similarly for emaild three also it has two periods and 4th email
address contains invalid domain exists

Problem Identification through Quality stage standardize stage using email rule set:
Input:

Output:
Invalid Email address:

Job:

There are a number of tools and techniques that can help prevent data quality issues from
leading to big consequences:

Data Rules Creation Process:


Data Rules Creation Process:
1.Go to Develop in Work space navigator Menu

2. select Data quality and go to task under task find the New data rule definition

3. Click on new data rule definition then the below window will be pop up
4.Click on overview and provide the Data rule name in the name text field and short de-
scription and Long description is optional

5. Goto Rule Logic and there we have to right the logic

Write this logic source_data exists and len(trim(source_data))<>0

1.Condtion:
 IF
 THEN
 ELSE
 AND
 OR
 NOT

2.(

 ((
 (((
Example:

Once we written the logic then we have to be validate weather the logic is syntactically
correct or not if the logic is correct then have to click on and save and exist.
3. SourceData
Here source data is a field name

4. Condtion
Not
5. Type of check
 =
 >
 <
 >=
 <=
 <>
 Contains-- >String containment
 Exists->Null value test
 Matches_Format-- >If country=India then zip code format=’999999’
 ,Matches_Regex
 occurs
 occurs>
 occurs>=
 occurs<
 occurs<=
 In_Reference_column
 In_reference_List
 Unique
 Is_numeric
 Is_Date
6. Reference Data:

Here in reference data we have to give the reference column name


7.)
))
)))
Oracle Data Profiling: Day6Training Document:

Steps for Getting Started with Oracle Data Profiling and Quality:
Before we need to start working with Oracle Data Profiling and Quality we need to fol-
low the below steps

Step1:Create Metabase,Loader connections with Administartor

Step2: Log on to the Oracle Data Profiling and Quality user interface and familiarize
yourself with the user interface.

Step3: Prepare for importing your data by deciding whether all or only part of the data
source will be imported.

Step4:Create an Entity

Step5:Create a Project

Step6: Open a Project and start to work

How to Configure the Metabase and Connections:


Step1:Make sure Oracle Data Quality Data Profiling ,as Oracle Data Integrator are in-
stalled and working first

Step2:Select Start>All Programs>Oracle>Oracle Data Profiling and Quality>Meta


base Manager to Log in to the Metabase Manager as the Metabase Administrator
(madmin)

Step3: Select Tools> Add Metabase from the menu

Step4: Add Metabase Named Test,with the default pattern default eg.$Abc pa4.and
Public Cache Size 200 MB and then click OK

Now your settings would be like the above screen


Step5: to verify the the Metabase test which is created or not double click on Metabase

Then List view display the list of Metabase which we created

Add Users in Metabase Manager:


Step1: log in with Metabase Admin privileges to add user to Metabase

Step2: Select Tools>Add user from the menu

Step3: Add user bhaskar with the password bhaskar123 as shown below,click Ok
Step5: to verify the the user which we added or which is created or not go to Users and
double click on users that list view will show you all the users which is created by
Metabase Administrator

Add Users to Metabase 'Test':


Step1: Go to Metabase Users in Metabase Manager

Step2: Click on Metabase Users

Step3:Right Click on Metabase Users then List view will display the list of users which
is created by the Metabase Administrator,

Step4: Now select the user which we created above “bhaskar” and right click on user
there will find Add then click on Add one page will be opened and select the user
“bhaskar” in the UserName list and select the Metabase “test” to the above user and
next click on Ok,
Now your settings would be like below screen shot.
Create Loader Connections:
Step1: log into Metabase manager using Metabase administartor privilages

Step2: go to Loader Connections and right click on Loader connections there will find
Add loader connections and right click on Add loader connections Then one window
will be opened then provide the Loader connection Name in the name text field
“Flatfile” and Provide the Description in the Description text field “Loader connection
for flat files” and provide the information for Data Directory and Data Extension
,Schema Directory and Schema Extension information,Now your settings would be
like below

Step: Click on Ok

Usually These were all the Different Data files and Schema files will be used for Enity
creation for Different Data sources

To See the List of Loader Connections in the Metabase Manager:


Step1: Go to Loader connections in the Metabase Manager

Step2:Double click on Loader connections then in the List view we will all the loader
connection which is created by the administrator

To Create Private Book Mark,Publick Book Mark:


Step1: From the explorer,double click on Min under UK orders2004,Order Id
Step2: Look at the list view displaying minimum value of '0'

Step3: Right click in the list view and select Book mark
Step4:Type the book amark name as Issue:Out of Range order Id

Step5: Type the disprcition as order id is 0


Step6: Place a check next to public book mark

Step7:Click ok if your book mark look like the following screen shot.
Process to Create a Notes:

Step1:Go to Uk Orders2004 Entity and select the Order Id attribute:

Step2: Right click Order id attribute and select Notes

Note: Here in Order id the value '0' that is out of range ,the actual expected value range is
30000 to 99999,so here I Find one issue that is value of one order id is '0' now this issue
we need to create notes

Step3: Now enter the following information in to the Note:

Step4:Click Ok

To View All the Notes and Book Marks:


To see all the notes and book marks which we created follow the below steps

Step1: Go to Findings in the Explorer and select notes there we will see all the notes
which we created for the attributes

You might also like