0% found this document useful (0 votes)
317 views

Data Profiling Overview

This document provides an overview of data profiling: 1) Data profiling analyzes source data content, quality, and structure to make initial assessments, validate rules, and verify assumptions and reports. 2) The profiling process involves creating a profile, running a session to generate results stored in a warehouse, and viewing reports on the results. 3) Profiles define functions applied to source data, and sessions generate metadata written to the warehouse, where different tools access reports.

Uploaded by

ypraju
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
317 views

Data Profiling Overview

This document provides an overview of data profiling: 1) Data profiling analyzes source data content, quality, and structure to make initial assessments, validate rules, and verify assumptions and reports. 2) The profiling process involves creating a profile, running a session to generate results stored in a warehouse, and viewing reports on the results. 3) Profiles define functions applied to source data, and sessions generate metadata written to the warehouse, where different tools access reports.

Uploaded by

ypraju
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 11

Data Profiling Overview

By PenchalaRaju.Yanamala

This chapter includes the following topics:

Understanding Data Profiling


Steps for Profiling Source Data
Using the Profile Manager

Understanding Data Profiling

Data profiling is a technique used to analyze the content, quality, and structure of
source data. Use PowerCenter Data Profiling to detect patterns and exceptions
of source data during mapping development and during production. Use data
profiling to make the following types of analyses:

Make initial assessments. You can make initial assessments about data
patterns and exceptions data during mapping development. As a result, you can
design mappings and workflows on actual data, rather than make theoretical
assumptions about sources.
Validate business rules. You can validate documented business rules about
the source data. For example, if you have a business rule requiring columns in a
source table to contain U.S. zip codes, you can profile the source data to verify
that the rows in this table contain the proper values.
Verify assumptions. You can verify that the initial assumptions you made
about source data during project development are still valid. For example, you
may want to view statistics about how many rows satisfied a business rule and
how many did not.
Verify report validity. You can use data profiling to verify the validity of the
Business Intelligence (BI) reports.

Data Profiling Components

To understand data profiling, you need to be familiar with the following


components:

PowerCenter Client. Use the PowerCenter Client to create and manage data
profiles.
PowerCenter Data Profile. Metadata that you generate in the PowerCenter
Client that defines what types of statistics you want to collect for a source. It is
comprised of a source definition, a profile mapping, and a profile session.
Data Profiling warehouse. The Data Profiling warehouse stores results from
profile sessions and reports that you run to view the results.
Data Profiling reports. View data and metadata in Data Profiling reports.

PowerCenter Client

Use the following PowerCenter Client tools to create and manage data profiles:

Designer. Create data profiles from the Source Analyzer or the Mapplet
Designer. When you create a data profile, the Designer generates a profile
mapping based on the profile functions. The PowerCenter repository stores the
profile mappings and metadata. If the repository is versioned, profile mappings
are versioned in the same way other PowerCenter mappings are versioned.
Profile Manager. A tool in the PowerCenter Designer that you use to manage
data profiles. You can edit and regenerate profiles, run profile sessions, and
view profile results.

PowerCenter Data Profile

A data profile contains the source definitions, the functions and function
parameters, and the profile session run parameters. To create a data profile, you
run the Profile Wizard from the PowerCenter Designer. When you create a data
profile, you create the following repository objects:

Profile. A profile is a repository object that represents all the metadata


configured in the wizard. You create the profile based on a mapplet or source
definition and a set of functions.
Profile mapping. When you create a data profile, the Profile Wizard generates
a profile mapping. Select functions in the wizard that to help determine the
content, structure, and quality of the profile source. You can use pre-defined or
custom functions. The Profile Wizard creates transformations and adds targets
based on the functions that you supply. You can view the profile mapping in the
Mapping Designer.
Profile session. After the Profile Wizard generates a profile mapping, you
provide basic session information such as Integration Service name and
connection information to the source and the Data Profiling warehouse. The
Profiling Wizard creates a profile session and a profile workflow. You can
choose to run the profile session when the wizard completes, or you can run it
later. When you run a profile session, the Integration Service writes profile
results to the Data Profiling warehouse.

While profiles are not versioned, the profile mappings and profile sessions are
versioned objects.

Data Profiling Warehouse

The Data Profiling warehouse is a set of tables that stores the results from profile
sessions. It also contains reports that you run to view the profile session results.
You can create a Data Profiling warehouse on any relational database that
PowerCenter supports as a source or target database. Create a Data Profiling
warehouse for each PowerCenter repository you want to store data profiles in.

Data Profiling Reports

You can view the results of each function configured in the data profile. Based on
the type of metadata you want to view, you can view reports from the following
tools:

Profile Manager. PowerCenter Data Profiling reports provide information about


the latest session run. View them from the Profile Manager.
Data Analyzer. Data Analyzer Data Profiling reports provide composite,
metadata, and summary reports. View them from the Data Profiling dashboard
in Data Analyzer. You can also customize the reports in Data Analyzer.

Data Profiling Connectivity


PowerCenter Data Profiling uses the following types of connectivity:

TCP/IP. The PowerCenter Client and the Integration Service use native protocol
to communicate with the Repository Service.
Native. The Integration Service uses native database connectivity to connect to
the Data Profiling warehouse when it loads target data from the profiling
sessions.
ODBC. The PowerCenter Client uses ODBC to connect to the Data Profiling
warehouse when you run data profiling reports from the Profile Manager.
JDBC. Data Analyzer uses JDBC to connect to the Data Profiling warehouse
when you run data profiling reports.

The following steps describe the data profiling process shown in Figure 1-2:

Create a data profile. Use the Profile Wizard in the Designer to create a data
profile based on a source definition and a set of functions. The Profile Wizard
1.generates a mapping and a session based on criteria that you provide.
Run the profile session. You can choose to run the profile session when you
finish the Profile Wizard, or you can run it from the Profile Manager. The
Integration Service runs the session and loads the profile results to the Data
2.Profiling warehouse.
View the reports. View the Data Profiling report associated with the profile
session. Based on the type of profile report, you can view reports from the
3.Profile Manager or from Data Analyzer.

Steps for Profiling Source Data

After you create the Data Profiling warehouse, you create data profiles in
PowerCenter. A data profile contains functions that perform calculations on the
source data. When you create a data profile, the Designer generates a profile
mapping and a profile session.
You can run profile sessions against the mapping to gather information about
source data. The Data Profiling warehouse stores the results of profile sessions.
After you run profile sessions, you can view reports that display the session
results.

Complete the following tasks to profile a source, mapplet, or groups in a source


or mapplet:

1.Create a data profile.


2.Run a profile session.
3.View profile reports.

The Designer provides a Profile Manager and Profile Wizard to complete these
tasks.

Step 1. Create a Data Profile

To profile source data, you create a data profile based on a source or mapplet in
the repository. Data profiles contain functions that perform calculations on the
source data. For example, you can use a function to validate business rules in a
data profile. You can apply profile functions to a column within a source, to a
single source, or to multiple sources.

You can create the following types of data profiles:

Auto profile. Contains a predefined set of functions for profiling source data.
Use an auto profile during mapping development to learn more about source
data.
Custom profile. A data profile you define with the functions you need to profile
source data. Use a custom profile during mapping development to validate
documented business rules about the source data. You can also use a custom
profile to monitor data quality or validate the results of BI reports.

You use the Designer to create a data profile. When you create a profile, the
Designer generates a mapping and a session based on the profile information.

You can configure a data profile to write verbose data to the Data Profiling
warehouse during a profile session. Verbose data provides more details about
the data that results from a profile function. For example, for a function that
validates business rules, verbose data may include the invalid rows in the
source. For a function that determines the number of distinct values, verbose
data can include a list of distinct values.

After you create a data profile, you can view profile details from the Profile
Manager. You can also edit and delete the data profile.

Step 2. Run the Profile Session

After you create a data profile, you can run the profile session. The Integration
Service writes the profile session results to the Data Profiling warehouse.

You can run profile sessions from the following places:

Profile Manager. You can create and run temporary and persistent profile
sessions from the Profile Manager. A temporary session runs on demand and is
not stored in the repository. A persistent session can run on demand and is
stored in the repository.
Workflow Manager. If you create a persistent profile session when you create
the data profile, you can edit and run the profile workflow from the Workflow
Manager.

Step 3. View Data Profiling Reports

When you run a profile session, the Integration Service loads the session results
to the Data Profiling warehouse. You can view the session results using
PowerCenter Data Profiling reports. You can view reports on the profile session
results in the following ways:

View PowerCenter Data Profiling reports from the Profile Manager.


View customizable reports in Data Analyzer.

Using the Profile Manager

The Profile Manager is a tool in the Designer that helps you manage data
profiles. Use the Profile Manager to set default data profile options, work with
data profiles in the repository, run profile sessions, view profile results, and view
sources and mapplets with at least one profile defined for them. When you
launch the Profile Manager, you can access profile information for the open
folders in the repository.

There are two views in the Profile Manager:

Profile View. The Profile View tab displays the data profiles in the open folders
in the repository.
Source View. The Source View tab displays the source definitions in the open
folders in the repository for which you have defined data profiles.

Note: If the repository folder is read-only, you can view and run data profiles in
the Profile View. You can also view Data Profiling reports. You cannot edit or
delete data profiles.

From the Profile View and the Source View, you can complete the following tasks
to manage, run, and view data profiles:

Create a custom profile.


View data profile details.
Edit a data profile.
Delete a data profile.
Run a session.
Regenerate a profile mapping.
Check in profile mappings.
Configure default data profile options.
Configure domains for profile functions.
Purge the Data Profiling warehouse.
Display the status of interactive sessions.
Display PowerCenter Data Profiling reports.
The Profile Manager launches immediately after you create a data profile. You
can manually launch the Profile Manager from the following Designer tools:

Source Analyzer. Click Sources > Profiling > Launch Profile Manager.
Mapplet Designer. Click Mapplets > Profiling > Launch Profile Manager.
Repository Navigator. Open a folder and select a source definition. Right-click
on the source definition and select Launch Profile Manager.

Tip: If you do not want the Profile Manager to launch immediately after you
create a data profile, you can change the default data profile options in the Profile
Manager.

Profile View

The Profile View tab displays all of the data profiles in the open folder in the
repository. Use the Profile View to determine the data profiles that exist for a
particular repository folder.

Source View

The Source View displays the source definitions with data profiles in the open
folder in the repository. A folder must be open before you can launch Profile
Manager. Use the Source View to determine if a source definition already has
data profiles defined. The Source View shows if the data profile is an auto profile
or custom profile.

You can also use the Source View when you want to work with a data profile but
are more familiar with the source name than the data profile name. For example,
you want to run a profile session, and you know the source definition name but
not the data profile name.

When you select the Source View tab in the Profile Manager, the Profile
Navigator displays data profiles as nodes under the source definition for which
you defined the data profile.

If you change or delete a data profile or a source or mapplet with a data profile,
you can click View > Refresh to refresh the Source View.

Using Mapplets to Extend Data Profiling Functions

A function can operate on a column, source, or multiple sources. Sometimes, you


need to combine data from multiple sources or multiple columns to use a
particular function with it. Or, you may need to aggregate data to get the profiling
results you want. For example, you want to create a Business Rule Validation
function that operates on aggregate values from a source. You need to
aggregate the values before you can profile the data using the Business Rule
Validation function.

Use a mapplet when you want to profile the following information:

Aggregate data from a single source


Data from two or more sources with one or more matching ports
Data from two sources with all matching ports
Extending Data Profiling Functionality with Mapplets

Complete the following steps to extend data profiling functionality with mapplets:

Create the mapplet. Create a mapplet to aggregate data or join or merge


1.sources.
Create a data profile using the mapplet output data as a source. Create an
auto profile based on the mapplet output data. Or, create a custom profile
2.based on the mapplet output data and add functions to the data profile.
Run the data profile. Run the profile session. When you run the profile
session from the Profile Manager, it processes the mapplet data as it would if
you were running a workflow. You do not need to run a workflow to aggregate
3.or join the data.
View the Data Profiling report. Open the Data Profiling report in the Profile
4.Manager or Data Analyzerto view the results.

Profiling Aggregate Data

When you want to profile aggregate data, use a mapplet to aggregate the data
before you create the data profile.

Use the mapplet to aggregate data when you want to use a column-level function
on aggregate data. For example, you have an Employee Expenses flat file
source that provides information on employee expenditures.

The following example shows the data from the Employee Expenses flat file
source:

EID Spending Date Amount Reason


12 12/3/2003 123.22 Acquired new books.
19 4/09/2004 600.21 Purchased ticket to Boston.
213 6/29/2004 215.61 Purchased new software.
12 6/12/2004 921.56 Acquired new books.
19 6/16/2004 740.21 Purchased ticket to New York.
21 7/21/2004 712.88 Purchased a new computer.

To test data consistency, you want to create a data profile that shows employees
who spent over $1,000 in the last six months. To get this information, aggregate
and filter the spending amounts before you create a data profile.

Creating the Mapplet

Create a mapplet to aggregate and filter the data. Later, use the Profile Manager
to create a data profile for the mapplet.

When you create the mapplet, add a Filter transformation to filter out purchases
older than six months. Add the following condition to the Filter transformation:

DATE_DIFF (SYSDATE,SPENDING_DATE,'MM')<6

After you filter the data, add an Aggregator transformation to aggregate the
cumulative spending for each employee.
In the Aggregator transformation, add the cumulative_emp_spending output port
to aggregate the total amount of money spent by each employee. Group the
results by employee ID (EID), to see the total for each employee. Connect the
EID port and the cumulative_emp_spending port to the Output transformation.
You can then use the Profile Manager to profile the mapplet output data.

Note: You do not need to run a session to generate the correct mapplet output
data. When you run a profile session, the Profile Manager processes the mapplet
before it profiles the mapplet output data.

Creating the Data Profile

After you create the mapplet to filter and aggregate the data, you can profile the
mapplet output data. From the Profile Manager, create a custom profile.

The custom profile locates the employees who spent more than $1,000 in the last
six months using the source-level Business Rule Validation function. Create a
Business Rule Validation function using the following expression:

cumulative_emp_spending > 1000

When you specify the type of verbose data to load to the Data Profiling
warehouse, select valid rows only. This ensures that you can view the verbose
data for the employees who spent over $1,000 in the last six months.

After you create the Data Profile, you can run a profile session.

Viewing the Data Profiling Report

After you run a profile session, you can view the profile results in a PowerCenter
Data Profiling report or a Data Analyzer Data Profiling report. For example, when
you view a PowerCenter Data Profiling report, you can view the verbose data for
the rows that do not satisfy the Business Validation Rule. You can see that the
employee with employee ID 19 has spent $1,340.42. All other employees have
spent under $1,000, and therefore do not appear in the Verbose Report
Summary.

Profiling Multiple Sources with One Matching Port

To profile two related sources with one or more matching ports, you can create a
mapplet with a Joiner transformation to join the sources. Then, profile the
mapplet output data.

Creating the Mapplet

For example, you have an Items relational source and a Manufacturers relational
source. The Items relational source contains information about items, such as
item description, wholesale cost, and item price. The Manufacturers relational
source contains information about the manufacturers who manufacture the items.
You want to find the manufacturers whose items sell with a markup that is
greater than 50 percent of the wholesale cost.

You need to join the two sources before you can create a data profile. Create a
mapplet using a Joiner transformation to join the two sources.
Use the following join condition:

MANUFACTURER_ID1 = MANUFACTURER_ID

Creating the Data Profile

After you join the sources, you can profile the mapplet output data to find the
manufacturers whose items sell with a markup that is greater than 50 percent of
the wholesale cost.

You create a custom profile with a source-level Business Rule Validation function
and enter the following expression in the Rule Editor:

PRICE > (WHOLESALE_COST +(WHOLESALE_COST * .50))

When you specify the type of verbose data to load to the Data Profiling
warehouse, select valid rows only. You can see the verbose data for the rows
that meet the business rule.

Viewing the Data Profiling Report

Once you create a data profile, you can run a profile session. After you run a
profile session, you can view the profile results in a PowerCenter Data Profiling
report or Data Analyzer Data Profiling report. For example, when you view a
PowerCenter Data Profiling report, you can see the information for the
companies whose items sell with a markup greater than 50 percent of the
wholesale cost.

You might also like