Data Profiling
Data Profiling
Defining the metrics for data quality can be difficult, because they
are specific to the domain or specific to the application or specific to the
business. One common approach to defining data quality is data profiling.
you finish the Profile Wizard, or you can run it from the Profile Manager.
The Integration Service runs the session and loads the profile results to
the Data Profiling warehouse.
3. View the reports. View the Data Profiling report associated with the
profile session. Based on the type of profile report, you can view reports
from the Profile Manager or from Data Analyzer.
Custom profile: A data profile you define with the functions you need
to profile source data. Use a custom profile during mapping development to
Page1
validate documented business rules about the source data. You can also use a
custom profile to monitor data quality or validate the results of BI reports.
You use the Designer to create a data profile. When you create a profile, the
Designer generates a mapping and a session based on the profile information.
You can configure a data profile to write verbose data to the Data Profiling
warehouse during a profile session.
Verbose data provides more details about the data those results from a
profile function. For example, for a function that validates business rules,
verbose data may include the invalid rows in the source. For a function that
determines the number of distinct values, verbose data can include a list of
distinct values.
After you create a data profile, you can view profile details from the
Profile Manager. You can also edit and delete the data profile.
Source Analyzer. Click Sources > Profiling > Create Auto Profile.
Mapplet Designer. Click Mapplets > Profiling > Create Auto Profile.
Loading verbose data for large sources may impact system performance.
Note: If you load verbose data for columns with a precision greater than
1,000 characters, the Integration Service writes truncated data to the
Data Profiling warehouse during the profile session.
6. Click Next.
The Profile Settings dialog box displays the default domain inference
tuning and structure inference settings.
10. Optionally, modify the default profile settings and click OK.
The Designer generates a data profile and profile mapping based on the
profile functions.
If you selected Run Session, the Profile Manager starts the session.
Source Analyzer. Click Sources > Profiling > Create Custom Profile.
Mapplet Designer. Click Mapplets > Profiling > Create Custom Profile.
Profile Manager. Click Profile > Create Custom.
Designer tool: If you create a custom profile this way, you can only
profile that source. If you need to include multiple sources in the profile,
or if you want to create an inter source function, use the Designer menu
commands.
You can also edit or delete a data profile.
If you want to profile multiple sources, you can create a mapplet that
combines multiple sources and create a data profile based on the mapplet
output data.
Note: If you use a source as a lookup source within a data profile, it cannot
be used as a non-lookup source within the same data profile. For example,
when you create a Domain Validation function using a Column Lookup domain,
the source you use for the column lookup cannot be a profiled source in the
same data profile. If two profile sources attempt to validate data against
each other, the Designer creates an invalid mapping.
Add functions: When you add functions to the data profile, the Profile
Wizard opens the Function Details page for you to configure details about the
functions.
Edit functions: You can edit existing functions for the data profile.
Delete functions: You can remove functions from the data profile.
Organize functions: Use the Up and Down arrows to organize the
functions in a data profile. The order of the functions does not affect the
data profile results.
Select columns to load in verbose mode: When you
configure a function, you can select the columns to load in verbose mode.
Enable session configuration: When you enable session
configuration, the Profile Wizard prompts you to
Configure the profile session for the mapping: If you
configured the default data profile options to always run profile sessions
interactively, this option is selected by default.
If you finish adding functions to the data profile and you have not enabled
session configuration, click Finish. The Profile Wizard generates the profile
mapping.
If you finish adding functions to the data profile and you enabled session
configuration, click next. The Profile Wizard prompts you to configure the
profile session.
Page1
If you added multiple sources to the data profile, you must select the source
you want to apply the function to.
If you select an intersource function, you must select at least two sources
or two groups from different sources to apply the function to.
After you select the function type and function, click next. The Profile
Wizard prompts you to specify the function details for the function. The
function Details window and available options change depending on the type of
function you select.
Each function type has a subset of functionality you can configure to perform
calculations on the source data.
When you finish configuring the function, the Profile Wizard returns to the
function Level Operations page. From the Function Level Operations page, you
can continue to add and configure functions for the data profile.
For example, if you created a Business Rule Validation function for the
column Agreement_Status, you cannot select this column to group by.
5. Click OK.
Configuring a Function for Verbose Mode
Page1
When you configure a function for verbose mode, the Integration Service
writes verbose data to the Data Profiling warehouse during a profile session.
You can configure verbose mode for the following functions:
Source-level Business Rule Validation
Column-level Business Rule Validation
Domain Validation
Distinct Value Count
Row Uniqueness
The type of verbose data the Integration Service can load to the target
depends on the function for which you configure verbose mode.
You can view profile mappings in the Designer. The Designer denotes profile
mappings in the Repository Navigator with a Profile Mappings icon. The
profile mapping name is based on the data profile name. By default, the
mapping name contains the prefix m_DP_.
For example, if you name the data profile SalaryValidation, the mapping name
for the data profile is m_DP_SalaryValidation.
You can change the naming convention for profile mappings in the default data
profile options.
Working with Functions Overview
You include functions in a data profile to perform calculations on sources
during a profile session. When you create an auto profile, the Designer adds
a predefined set of functions to the data profile. When you create a custom
profile, you create functions that meet your business needs, and add them to
the data profile. You can add the following types of functions to a data
profile:
Source-level functions: Perform calculations on two or more
source columns, source group, or mapplet group.
Column-level functions: Perform calculations on one column in a
source.
Intersource functions: Perform calculations on two or more
sources, source groups, or mapplet groups.
Source-Level Functions
Source-level functions perform calculations on two or more columns of a
source, source group, or mapplet group.
Page1
Column-Level Functions
Column-level functions perform a calculation on one column in a source. You
can add the following column-level
functions to a data profile:
Business Rule Validation: Calculates the number of rows in a
single source column that satisfy and do not satisfy a specified
business rule, and evaluates those rows that do satisfy the business
rule.
Domain Validation: Calculates the number of values in the
profile source column that fall within a specified domain and the
number of values that do not. When you create a Domain Validation
function, you include domains in the function.
Domain Inference: Reads all values in the column and infers a
pattern that fits the data.
Aggregate Functions: Calculates an aggregate value for a
numeric or string value in a column.
Distinct Value Count: Returns the number of distinct values for
the column. When you specify a column-level function on the Function
Details page of the Profile Wizard, the Profile Wizard prompts you to
configure the function. The options available on the Function Role
Details page for column-level functions depend on the function you
select.
Intersource Functions
Intersource functions perform calculations on two or more sources, source
groups from different sources, or mapplet output groups, and generate
Page1
information about their relationship. You can add the following intersource
functions to a data profile:
Referential Integrity Analysis: Compares the values of columns
in two sources to determine orphan values.
Join Complexity Evaluation: Measures the columns in multiple
sources that satisfy a join condition.
Intersource Structure Analysis: Determines the primary key-
foreign key relationships between sources.
Page1