Data Profiling Overview
Data Profiling Overview
By PenchalaRaju.Yanamala
Data profiling is a technique used to analyze the content, quality, and structure of
source data. Use PowerCenter Data Profiling to detect patterns and exceptions
of source data during mapping development and during production. Use data
profiling to make the following types of analyses:
Make initial assessments. You can make initial assessments about data
patterns and exceptions data during mapping development. As a result, you can
design mappings and workflows on actual data, rather than make theoretical
assumptions about sources.
Validate business rules. You can validate documented business rules about
the source data. For example, if you have a business rule requiring columns in a
source table to contain U.S. zip codes, you can profile the source data to verify
that the rows in this table contain the proper values.
Verify assumptions. You can verify that the initial assumptions you made
about source data during project development are still valid. For example, you
may want to view statistics about how many rows satisfied a business rule and
how many did not.
Verify report validity. You can use data profiling to verify the validity of the
Business Intelligence (BI) reports.
PowerCenter Client. Use the PowerCenter Client to create and manage data
profiles.
PowerCenter Data Profile. Metadata that you generate in the PowerCenter
Client that defines what types of statistics you want to collect for a source. It is
comprised of a source definition, a profile mapping, and a profile session.
Data Profiling warehouse. The Data Profiling warehouse stores results from
profile sessions and reports that you run to view the results.
Data Profiling reports. View data and metadata in Data Profiling reports.
PowerCenter Client
Use the following PowerCenter Client tools to create and manage data profiles:
Designer. Create data profiles from the Source Analyzer or the Mapplet
Designer. When you create a data profile, the Designer generates a profile
mapping based on the profile functions. The PowerCenter repository stores the
profile mappings and metadata. If the repository is versioned, profile mappings
are versioned in the same way other PowerCenter mappings are versioned.
Profile Manager. A tool in the PowerCenter Designer that you use to manage
data profiles. You can edit and regenerate profiles, run profile sessions, and
view profile results.
A data profile contains the source definitions, the functions and function
parameters, and the profile session run parameters. To create a data profile, you
run the Profile Wizard from the PowerCenter Designer. When you create a data
profile, you create the following repository objects:
While profiles are not versioned, the profile mappings and profile sessions are
versioned objects.
The Data Profiling warehouse is a set of tables that stores the results from profile
sessions. It also contains reports that you run to view the profile session results.
You can create a Data Profiling warehouse on any relational database that
PowerCenter supports as a source or target database. Create a Data Profiling
warehouse for each PowerCenter repository you want to store data profiles in.
You can view the results of each function configured in the data profile. Based on
the type of metadata you want to view, you can view reports from the following
tools:
TCP/IP. The PowerCenter Client and the Integration Service use native protocol
to communicate with the Repository Service.
Native. The Integration Service uses native database connectivity to connect to
the Data Profiling warehouse when it loads target data from the profiling
sessions.
ODBC. The PowerCenter Client uses ODBC to connect to the Data Profiling
warehouse when you run data profiling reports from the Profile Manager.
JDBC. Data Analyzer uses JDBC to connect to the Data Profiling warehouse
when you run data profiling reports.
The following steps describe the data profiling process shown in Figure 1-2:
Create a data profile. Use the Profile Wizard in the Designer to create a data
profile based on a source definition and a set of functions. The Profile Wizard
1.generates a mapping and a session based on criteria that you provide.
Run the profile session. You can choose to run the profile session when you
finish the Profile Wizard, or you can run it from the Profile Manager. The
Integration Service runs the session and loads the profile results to the Data
2.Profiling warehouse.
View the reports. View the Data Profiling report associated with the profile
session. Based on the type of profile report, you can view reports from the
3.Profile Manager or from Data Analyzer.
After you create the Data Profiling warehouse, you create data profiles in
PowerCenter. A data profile contains functions that perform calculations on the
source data. When you create a data profile, the Designer generates a profile
mapping and a profile session.
You can run profile sessions against the mapping to gather information about
source data. The Data Profiling warehouse stores the results of profile sessions.
After you run profile sessions, you can view reports that display the session
results.
The Designer provides a Profile Manager and Profile Wizard to complete these
tasks.
To profile source data, you create a data profile based on a source or mapplet in
the repository. Data profiles contain functions that perform calculations on the
source data. For example, you can use a function to validate business rules in a
data profile. You can apply profile functions to a column within a source, to a
single source, or to multiple sources.
Auto profile. Contains a predefined set of functions for profiling source data.
Use an auto profile during mapping development to learn more about source
data.
Custom profile. A data profile you define with the functions you need to profile
source data. Use a custom profile during mapping development to validate
documented business rules about the source data. You can also use a custom
profile to monitor data quality or validate the results of BI reports.
You use the Designer to create a data profile. When you create a profile, the
Designer generates a mapping and a session based on the profile information.
You can configure a data profile to write verbose data to the Data Profiling
warehouse during a profile session. Verbose data provides more details about
the data that results from a profile function. For example, for a function that
validates business rules, verbose data may include the invalid rows in the
source. For a function that determines the number of distinct values, verbose
data can include a list of distinct values.
After you create a data profile, you can view profile details from the Profile
Manager. You can also edit and delete the data profile.
After you create a data profile, you can run the profile session. The Integration
Service writes the profile session results to the Data Profiling warehouse.
Profile Manager. You can create and run temporary and persistent profile
sessions from the Profile Manager. A temporary session runs on demand and is
not stored in the repository. A persistent session can run on demand and is
stored in the repository.
Workflow Manager. If you create a persistent profile session when you create
the data profile, you can edit and run the profile workflow from the Workflow
Manager.
When you run a profile session, the Integration Service loads the session results
to the Data Profiling warehouse. You can view the session results using
PowerCenter Data Profiling reports. You can view reports on the profile session
results in the following ways:
The Profile Manager is a tool in the Designer that helps you manage data
profiles. Use the Profile Manager to set default data profile options, work with
data profiles in the repository, run profile sessions, view profile results, and view
sources and mapplets with at least one profile defined for them. When you
launch the Profile Manager, you can access profile information for the open
folders in the repository.
Profile View. The Profile View tab displays the data profiles in the open folders
in the repository.
Source View. The Source View tab displays the source definitions in the open
folders in the repository for which you have defined data profiles.
Note: If the repository folder is read-only, you can view and run data profiles in
the Profile View. You can also view Data Profiling reports. You cannot edit or
delete data profiles.
From the Profile View and the Source View, you can complete the following tasks
to manage, run, and view data profiles:
Source Analyzer. Click Sources > Profiling > Launch Profile Manager.
Mapplet Designer. Click Mapplets > Profiling > Launch Profile Manager.
Repository Navigator. Open a folder and select a source definition. Right-click
on the source definition and select Launch Profile Manager.
Tip: If you do not want the Profile Manager to launch immediately after you
create a data profile, you can change the default data profile options in the Profile
Manager.
Profile View
The Profile View tab displays all of the data profiles in the open folder in the
repository. Use the Profile View to determine the data profiles that exist for a
particular repository folder.
Source View
The Source View displays the source definitions with data profiles in the open
folder in the repository. A folder must be open before you can launch Profile
Manager. Use the Source View to determine if a source definition already has
data profiles defined. The Source View shows if the data profile is an auto profile
or custom profile.
You can also use the Source View when you want to work with a data profile but
are more familiar with the source name than the data profile name. For example,
you want to run a profile session, and you know the source definition name but
not the data profile name.
When you select the Source View tab in the Profile Manager, the Profile
Navigator displays data profiles as nodes under the source definition for which
you defined the data profile.
If you change or delete a data profile or a source or mapplet with a data profile,
you can click View > Refresh to refresh the Source View.
Complete the following steps to extend data profiling functionality with mapplets:
When you want to profile aggregate data, use a mapplet to aggregate the data
before you create the data profile.
Use the mapplet to aggregate data when you want to use a column-level function
on aggregate data. For example, you have an Employee Expenses flat file
source that provides information on employee expenditures.
The following example shows the data from the Employee Expenses flat file
source:
To test data consistency, you want to create a data profile that shows employees
who spent over $1,000 in the last six months. To get this information, aggregate
and filter the spending amounts before you create a data profile.
Create a mapplet to aggregate and filter the data. Later, use the Profile Manager
to create a data profile for the mapplet.
When you create the mapplet, add a Filter transformation to filter out purchases
older than six months. Add the following condition to the Filter transformation:
DATE_DIFF (SYSDATE,SPENDING_DATE,'MM')<6
After you filter the data, add an Aggregator transformation to aggregate the
cumulative spending for each employee.
In the Aggregator transformation, add the cumulative_emp_spending output port
to aggregate the total amount of money spent by each employee. Group the
results by employee ID (EID), to see the total for each employee. Connect the
EID port and the cumulative_emp_spending port to the Output transformation.
You can then use the Profile Manager to profile the mapplet output data.
Note: You do not need to run a session to generate the correct mapplet output
data. When you run a profile session, the Profile Manager processes the mapplet
before it profiles the mapplet output data.
After you create the mapplet to filter and aggregate the data, you can profile the
mapplet output data. From the Profile Manager, create a custom profile.
The custom profile locates the employees who spent more than $1,000 in the last
six months using the source-level Business Rule Validation function. Create a
Business Rule Validation function using the following expression:
When you specify the type of verbose data to load to the Data Profiling
warehouse, select valid rows only. This ensures that you can view the verbose
data for the employees who spent over $1,000 in the last six months.
After you create the Data Profile, you can run a profile session.
After you run a profile session, you can view the profile results in a PowerCenter
Data Profiling report or a Data Analyzer Data Profiling report. For example, when
you view a PowerCenter Data Profiling report, you can view the verbose data for
the rows that do not satisfy the Business Validation Rule. You can see that the
employee with employee ID 19 has spent $1,340.42. All other employees have
spent under $1,000, and therefore do not appear in the Verbose Report
Summary.
To profile two related sources with one or more matching ports, you can create a
mapplet with a Joiner transformation to join the sources. Then, profile the
mapplet output data.
For example, you have an Items relational source and a Manufacturers relational
source. The Items relational source contains information about items, such as
item description, wholesale cost, and item price. The Manufacturers relational
source contains information about the manufacturers who manufacture the items.
You want to find the manufacturers whose items sell with a markup that is
greater than 50 percent of the wholesale cost.
You need to join the two sources before you can create a data profile. Create a
mapplet using a Joiner transformation to join the two sources.
Use the following join condition:
MANUFACTURER_ID1 = MANUFACTURER_ID
After you join the sources, you can profile the mapplet output data to find the
manufacturers whose items sell with a markup that is greater than 50 percent of
the wholesale cost.
You create a custom profile with a source-level Business Rule Validation function
and enter the following expression in the Rule Editor:
When you specify the type of verbose data to load to the Data Profiling
warehouse, select valid rows only. You can see the verbose data for the rows
that meet the business rule.
Once you create a data profile, you can run a profile session. After you run a
profile session, you can view the profile results in a PowerCenter Data Profiling
report or Data Analyzer Data Profiling report. For example, when you view a
PowerCenter Data Profiling report, you can see the information for the
companies whose items sell with a markup greater than 50 percent of the
wholesale cost.