0% found this document useful (0 votes)
360 views702 pages

Velocity BestPractices

Informatica Best Practices V7

Uploaded by

londecastro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
360 views702 pages

Velocity BestPractices

Informatica Best Practices V7

Uploaded by

londecastro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 702

Velocity 2007

Best Practices
Best Practices
G Configuration Management and Security
H Configuring Security
H Data Analyzer Security
H Database Sizing
H Deployment Groups
H Migration Procedures - PowerCenter
H Migration Procedures - PowerExchange
H Running Sessions in Recovery Mode
H Using PowerCenter Labels
G Data Analyzer Configuration
H Deploying Data Analyzer Objects
H Installing Data Analyzer
G Data Connectivity
H Data Connectivity using PowerCenter Connect for BW Integration
Server
H Data Connectivity using PowerCenter Connect for MQSeries
H Data Connectivity using PowerCenter Connect for SAP
H Data Connectivity using PowerCenter Connect for Web Services
G Data Migration
H Data Migration Principles
H Data Migration Project Challenges
H Data Migration Velocity Approach
G Data Quality and Profiling
H Build Data Audit/Balancing Processes
H Data Cleansing
H Data Profiling
INFORMATICA CONFIDENTIAL BEST PRACTICE 2 of 702
H Data Quality Mapping Rules
H Data Quality Project Estimation and Scheduling Factors
H Effective Data Matching Techniques
H Effective Data Standardizing Techniques
H Managing Internal and External Reference Data
H Testing Data Quality Plans
H Tuning Data Quality Plans
H Using Data Explorer for Data Discovery and Analysis
H Working with Pre-Built Plans in Data Cleanse and Match
G Development Techniques
H Designing Data Integration Architectures
H Development FAQs
H Event Based Scheduling
H Key Management in Data Warehousing Solutions
H Mapping Design
H Mapping Templates
H Naming Conventions
H Performing Incremental Loads
H Real-Time Integration with PowerCenter
H Session and Data Partitioning
H Using Parameters, Variables and Parameter Files
H Using PowerCenter with UDB
H Using Shortcut Keys in PowerCenter Designer
H Working with JAVA Transformation Object
G Error Handling
H Error Handling Process
H Error Handling Strategies - Data Warehousing
H Error Handling Strategies - General
H Error Handling Techniques - PowerCenter Mappings
H Error Handling Techniques - PowerCenter Workflows and Data
Analyzer
INFORMATICA CONFIDENTIAL BEST PRACTICE 3 of 702
G Integration Competency Centers and Enterprise Architecture
H Planning the ICC Implementation
H Selecting the Right ICC Model
G Metadata and Object Management
H Creating Inventories of Reusable Objects & Mappings
H Metadata Reporting and Sharing
H Repository Tables & Metadata Management
H Using Metadata Extensions
H Using PowerCenter Metadata Manager and Metadata Exchange Views
for Quality Assurance
G Metadata Manager Configuration
H Configuring Standard XConnects
H Custom XConnect Implementation
H Customizing the Metadata Manager Interface
H Estimating Metadata Manager Volume Requirements
H Metadata Manager Load Validation
H Metadata Manager Migration Procedures
H Metadata Manager Repository Administration
H Upgrading Metadata Manager
G Operations
H Daily Operations
H Data Integration Load Traceability
H High Availability
H Load Validation
H Repository Administration
H Third Party Scheduler
H Updating Repository Statistics
G Performance and Tuning
H Determining Bottlenecks
H Performance Tuning Databases (Oracle)
H Performance Tuning Databases (SQL Server)
INFORMATICA CONFIDENTIAL BEST PRACTICE 4 of 702
H Performance Tuning Databases (Teradata)
H Performance Tuning UNIX Systems
H Performance Tuning Windows 2000/2003 Systems
H Recommended Performance Tuning Procedures
H Tuning and Configuring Data Analyzer and Data Analyzer Reports
H Tuning Mappings for Better Performance
H Tuning Sessions for Better Performance
H Tuning SQL Overrides and Environment for Better Performance
H Using Metadata Manager Console to Tune the XConnects
G PowerCenter Configuration
H Advanced Client Configuration Options
H Advanced Server Configuration Options
H Causes and Analysis of UNIX Core Files
H Domain Configuration
H Managing Repository Size
H Organizing and Maintaining Parameter Files & Variables
H Platform Sizing
H PowerCenter Admin Console
H Understanding and Setting UNIX Resources for PowerCenter
Installations
G PowerExchange Configuration
H PowerExchange CDC for Oracle
H PowerExchange Installation (for Mainframe)
G Project Management
H Assessing the Business Case
H Defining and Prioritizing Requirements
H Developing a Work Breakdown Structure (WBS)
H Developing and Maintaining the Project Plan
H Developing the Business Case
H Managing the Project Lifecycle
INFORMATICA CONFIDENTIAL BEST PRACTICE 5 of 702
H Using Interviews to Determine Corporate Data Integration
Requirements
G Upgrades
H Upgrading Data Analyzer
H Upgrading PowerCenter
H Upgrading PowerExchange
INFORMATICA CONFIDENTIAL BEST PRACTICE 6 of 702
Configuring Security
Challenge
Configuring a PowerCenter security scheme to prevent unauthorized access to mappings, folders, sessions,
workflows, repositories, and data in order to ensure system integrity and data confidentiality.
Description
Security is an often overlooked area within the Informatica ETL domain. However, without paying close attention
to the repository security, one ignores a crucial component of ETL code management. Determining an optimal
security configuration for a PowerCenter environment requires a thorough understanding of business
requirements, data content, and end-user access requirements. Knowledge of PowerCenter's security
functionality and facilities is also a prerequisite to security design.
Implement security with the goals of easy maintenance and scalability. When establishing repository security,
keep it simple. Although PowerCenter includes the utilities for a complex web of security, the more simple the
configuration, the easier it is to maintain. Securing the PowerCenter environment involves the following basic
principles:
G Create users and groups
G Define access requirements
G Grant privileges and permissions
Before implementing security measures, ask and answer the following questions:
G Who will administer the repository?
G How many projects need to be administered? Will the administrator be able to manage security for all
PowerCenter projects or just a select few?
G How many environments will be supported in the repository?
G Who needs access to the repository? What do they need the ability to do?
G How will the metadata be organized in the repository? How many folders will be required?
G Where can we limit repository privileges by granting folder permissions instead?
G Who will need Administrator or Super User-type access?
After you evaluate the needs of the repository users, you can create appropriate user groups, assign repository
privileges and folder permissions. In most implementations, the administrator takes care of maintaining the
repository. Limit the number of administrator accounts for PowerCenter. While this concept is important in a
development/unit test environment, it is critical for protecting the production environment.
Repository Security Overview
A security system needs to properly control access to all sources, targets, mappings, reusable transformations,
tasks, and workflows in both the test and production repositories. A successful security model needs to support
all groups in the project lifecycle and also consider the repository structure.
INFORMATICA CONFIDENTIAL BEST PRACTICE 7 of 702
Informatica offers multiple layers of security, which enables you to customize the security within your data
warehouse environment. Metadata level security controls access to PowerCenter repositories, which contain
objects grouped by folders. Access to metadata is determined by the privileges granted to the user or to a group
of users and the access permissions granted on each folder. Some privileges do not apply by folder, as they are
granted by privilege alone (i.e., repository-level tasks).
Just beyond PowerCenter authentication is the connection to the repository database. All client connectivity to
the repository is handled by the PowerCenter Repository Service over a TCP/IP connection. The particular
database account and password is specified at installation and during the configuration of the Repository
Service. Developers need not have knowledge of this database account and password; they should only use
their individual repository user ids and passwords. This information should be restricted to the administrator.
Other forms of security available in PowerCenter include permissions for connections. Connections include
database, FTP, and external loader connections. These permissions are useful when you want to limit access to
schemas in a relational database and can be set-up in the Workflow Manager when source and target
connections are defined.
Occasionally, you may want to restrict changes to source and target definitions in the repository. A common
way to approach this security issue is to use shared folders, which are owned by an Administrator or Super
User. Granting read access to developers on these folders allows them to create read-only copies in their work
folders.
Informatica Security Architecture
The following diagram, Informatica PowerCenter Security, depicts PowerCenter security, including access to the
repository, Repository Service, Integration Service and the command-line utilities pmrep and pmcmd.
As shown in the diagram, the repository service is the central component when using default security. It sits
between the PowerCenter repository and all client applications, including GUI tools, command line tools, and
the Integration Service. Each application must be authenticated against metadata stored in several tables within
the repository. Each Repository Service manages a single repository database where all security data is stored
as part of its metadata; this is a second layer of security. Only the Repository Service has access to this
database; it authenticates all client applications against this metadata.
INFORMATICA CONFIDENTIAL BEST PRACTICE 8 of 702
Repository Service Security
Connection to the PowerCenter repository database is one level of security. The Repository Service uses native
drivers to communicate with the repository database. PowerCenter Client tools and the Integration Service
communicate with the Repository Service over TCP/IP. When a client application connects to the repository, it
connects directly to the Repository Service process. You can configure a Repository Service to run on multiple
machines, or nodes, in the domain. Each instance running on a node is called a Repository Service process.
This process accesses the database tables and performs most repository-related tasks.
When the Repository Service is installed, the database connection information is entered for the metadata
repository. At this time you need to know the database user id and password to access the metadata repository.
The database user id must be able to read and write to all tables in the database. As a developer creates,
modifies, executes mappings and sessions, this information is continuously updating the metadata in the
repository. Actual database security should be controlled by the DBA responsible for that database, in
conjunction with the PowerCenter Repository Administrator. After the Repository Service is installed and
started, all subsequent client connectivity is automatic. The database id and password are transparent at this
point.
INFORMATICA CONFIDENTIAL BEST PRACTICE 9 of 702
Integration Service Security
Like the Repository Service, the Integration Service communicates with the metadata repository when it
executes workflows or when users are using Workflow Monitor. During configuration of the Integration Service,
the repository database is identified with the appropriate user id and password. Connectivity to the repository is
made using native drivers supplied by Informatica.
Certain permissions are also required to use the pmrep and pmcmd command line utilities.
Encrypting Repository Passwords
You can encrypt passwords and create an environment variable to use with pmcmd and pmrep. For example,
you can encrypt the repository and database passwords for pmrep to maintain security when using pmrep in
scripts. In addition, you can create an environment variable to store the encrypted password.
Use the following steps as a guideline to use an encrypted password as an environment variable:
1. Use the command line program pmpasswd to encrypt the repository password.
2. Configure the password environment variable to set the encrypted value.
To configure a password as an environment variable on UNIX:
1. At the command line, type:
pmpasswd <repository password>
pmpasswd returns the encrypted password.
2. In a UNIX C shell environment, type:
setenv <Password_Environment_Variable> <encrypted password>
In a UNIX Bourne shell environment, type:
<Password_Environment_Variable> = <encrypted password>
export <Password_Environment_Variable>
You can assign the environment variable any valid UNIX name.
To configure a password as an environment variable on Windows:
1. At the command line, type:
pmpasswd <repository password>
pmpasswd returns the encrypted password.
INFORMATICA CONFIDENTIAL BEST PRACTICE 10 of 702
2. Enter the password environment variable in the Variable field. Enter the encrypted password in the
Value field.
Setting the Repository User Name
For pmcmd and pmrep, you can create an environment variable to store the repository user name.
To configure a user name as an environment variable on UNIX:
1. In a UNIX C shell environment, type:
setenv <User_Name_Environment_Variable> <user name>
2. In a UNIX Bourne shell environment, type:
<User_Name_Environment_Variable> = <user name>
export <User_Name_Environment_Variable> = <user name>
You can assign the environment variable any valid UNIX name.
To configure a user name as an environment variable on Windows:
1.
Enter the user name environment variable in the Variable field.
2.
Enter the repository user name in the Value field.
Connection Object Permissions
Within Workflow Manager, you can grant read, write, and execute permissions to groups and/or users for all
types of connection objects. This controls who can create, view, change, and execute workflow tasks that use
those specific connections, providing another level of security for these global repository objects.
Users with Use Workflow Manager permission can create and modify connection objects. Connection objects
allow the PowerCenter server to read and write to source and target databases. Any database the server can
access requires a connection definition. As shown below, connection information is stored in the
repository. Users executing workflows need execution permission on all connections used by the workflow. The
PowerCenter server looks up the connection information in the repository, and verifies permission for the
required action. If permissions are properly granted, the server reads and writes to the defined databases, as
specified by the workflow.
INFORMATICA CONFIDENTIAL BEST PRACTICE 11 of 702
Users
Users are the fundamental objects of security in a PowerCenter environment. Each individual logging into the
PowerCenter repository should have a unique user account. Informatica does not recommend creating shared
accounts; unique accounts should be created for each user. Each repository user needs a user name and
password, provided by the PowerCenter Repository Administrator, to access the repository.
Users are created and managed through Repository Manager. Users should change their passwords from the
default immediately after receiving the initial user id from the Administrator. Passwords can be reset by the user
if they are granted the privilege Use Repository Manager.
When you create the repository, the repository automatically creates two default users:
G Administrator - The default password for Administrator is Administrator.
G Database user - The username and password used when you created the repository.
These default users are in the Administrators user group, with full privileges within the repository. They cannot
be deleted from the repository, nor have their group affiliation changed.
To administer repository users, you must have one of the following privileges:
G Administer Repository
G Super User
LDAP (Lightweight Directory Access Protocol)
In addition to default repository user authentication, LDAP can be used to authenticate users. Using LDAP
authentication, the repository maintains an association between the repository user and the external login
name. When a user logs into the repository, the security module authenticates the user name and password
INFORMATICA CONFIDENTIAL BEST PRACTICE 12 of 702
against the external directory. The repository maintains a status for each user. Users can be enabled or
disabled by modifying this status.
Prior to implementing LDAP, the administrator must know:
G Repository server username and password
G An administrator or superuser user name and password for the repository
G An external login name and password
To configure LDAP, follow these steps:
1. Edit ldap_authen.xml, modify the following attributes:
H NAME the .dll that implements the authentication
H OSTYPE Host operating system
2. Register ldap_authen.xml in the Repository Server Administration Console.
3. In the Repository Server Administration Console, configure the authentication module.
User Groups
When you create a repository, the Repository Manager creates two repository user groups. These two groups
exist so you can immediately create users and begin developing repository objects. These groups cannot be
deleted from the repository nor have their configured privileges changed. The default repository user groups are:
G Administrators - which has super-user access
G Public - which has a subset of default repository privileges
You should create custom user groups to manage users and repository privileges effectively. The number and
types of groups that you create should reflect the needs of your development teams, administrators, and
operations group. Informatica recommends minimizing the number of custom user groups that you create in
order to facilitate the maintenance process.
A starting point is to create a group for each type of combination of privileges needed to support the
development cycle and production process. This is the recommended method for assigning privileges. After
creating a user group, you assign a set of privileges for that group. Each repository user must be assigned to at
least one user group. When you assign a user to a group, the user:
G Receives all group privileges.
G Inherits any changes to group privileges.
G Loses and gains privileges if you change the user group membership.
You can also assign users to multiple groups, which grants the user the privileges of each group. Use the
Repository Manager to create and edit repository user groups.
Folder Permissions
INFORMATICA CONFIDENTIAL BEST PRACTICE 13 of 702
When you create or edit a folder, you define permissions for the folder. The permissions can be set at three
different levels:
G owner
G owners group
G repository - remainder of users within the repository
First, choose an owner (i.e., user) and group for the folder. If the owner belongs to more than one group, you
must select one of the groups listed. Once the folder is defined and the owner is selected, determine what level
of permissions you would like to grant to the users within the group. Then determine the permission level for the
remainder of the repository users. The permissions that can be set include: read, write, and execute. Any
combination of these can be granted to the owner, group or repository.
Be sure to consider folder permissions very carefully. They offer the easiest way to restrict users and/or groups
from having access to folders or restricting access to folders. The following table gives some examples of
folders, their type, and recommended ownership.
Folder Name Folder Type Proposed Owner
DEVELOPER_1 Initial development, temporary work area,
unit test
Individual developer
DEVELOPMENT Integrated development Development lead, Administrator or Super
User
UAT Integrated User Acceptance Test UAT lead, Administrator or Super User
PRODUCTION Production Administrator or Super User
PRODUCTION SUPPORT Production fixes and upgrades Production support lead, Administrator or
Super User
Repository Privileges
Repository privileges work in conjunction with folder permissions to give a user or group authority to perform
tasks. Repository privileges are the most granular way of controlling a users activity. Consider the privileges
that each user group requires, as well as folder permissions, when determining the breakdown of users into
groups. Informatica recommends creating one group for each distinct combination of folder permissions and
privileges.
When you assign a user to a user group, the user receives all privileges granted to the group. You can also
assign privileges to users individually. When you grant a privilege to an individual user, the user retains that
privilege, even if his or her user group affiliation changes. For example, you have a user in a Developer group
who has limited group privileges, and you want this user to act as a backup administrator when you are not
available. For the user to perform every task in every folder in the repository, and to administer the Integration
Service, the user must have the Super User privilege. For tighter security, grant the Super User privilege to the
individual user, not the entire Developer group. This limits the number of users with the Super User privilege,
and ensures that the user retains the privilege even if you remove the user from the Developer group.
The Repository Manager grants a default set of privileges to each new user and group for working within the
repository. You can add or remove privileges from any user or group except:
INFORMATICA CONFIDENTIAL BEST PRACTICE 14 of 702
G Administrators and Public (the default read-only repository groups)
G Administrator and the database user who created the repository (the users automatically created in the
Administrators group)
The Repository Manager automatically grants each new user and new group the default privileges. These
privileges allow you to perform basic tasks in Designer, Repository Manager, Workflow Manager, and Workflow
Monitor. The following table lists the default repository privileges:
Default Repository Privileges
Default Privilege Folder Permission
Connection Object
Permission
Grants the Ability to
Use Designer N/A N/A
G Connect to the repository using the Designer.
G Configure connection information.
Read N/A
G View objects in the folder.
G Change folder versions.
G Create shortcuts to objects in the folder.
G Copy objects from the folder.
G Export objects.
Read/Write N/A
G Create or edit metadata.
G Create shortcuts from shared folders.
G Copy objects into the folder.
G Import objects.
Browse Repository N/A N/A
G Connect to the repository using the Repository
Manager.
G Add and remove reports.
G Import, export, or remove the registry.
G Search by keywords.
G Change your user password.
Read N/A
G View dependencies.
G Unlock objects, versions, and folders locked by your
username.
G Edit folder properties for folders you own.
G Copy a version. (You must also have Administer
Repository or Super User privilege in the target
repository and write permission on the target folder.)
G Copy a folder. (You must also have Administer
Repository or Super User privilege in the target
repository.)
Use Workflow Manager N/A N/A
G Connect to the repository using the Workflow Manager.
G Create database, FTP, and external loader connections
in the Workflow Manager.
G Run the Workflow Monitor.
N/A Read/Write G Edit database, FTP, and external loader connections in
the Workflow Manager.
INFORMATICA CONFIDENTIAL BEST PRACTICE 15 of 702
Read N/A
G Export sessions.
G View workflows.
G View sessions.
G View tasks.
G View session details and session performance details.
Read/Write N/A
G Create and edit workflows and tasks.
G Import sessions.
G Validate workflows and tasks.
Read/Write Read
G Create and edit sessions.
Read/Execute N/A
G View session log.
Read/Execute Execute
G Schedule or unschedule workflows.
G Start workflows immediately.
Execute N/A
G Restart workflow.
G Stop workflow.
G Abort workflow.
G Resume workflow.
Use Repository Manager N/A N/A
G Remove label references.
Write Deployment group
G Delete from deployment group.
Write Folder
G Change objects version comments if not owner.
G Change status of object.
G Check in.
G Check out/undo check-out.
G Delete objects from folder.
G Mass validation (needs write permission if options
selected).
G Recover after delete.
Read Folder
G Export objects.
Read/Write Folder Deployment Groups
G Add to deployment group.
Read/Write Original folders Target folder
G Copy objects.
G Import objects.
Read/Write/ Execute Folder Label
G Apply label
Extended Privileges
In addition to the default privileges listed above, Repository Manager provides extended privileges that you can
assign to users and groups. These privileges are granted to the Administrator group by default. The following
table lists the extended repository privileges:
INFORMATICA CONFIDENTIAL BEST PRACTICE 16 of 702
Extended Repository Privileges
Extended Privilege Folder Permission
Connection Object
Permission
Grants the Ability to
Admin Repository N/A N/A G Create, upgrade, backup, delete, and restore the
repository.
G Manage passwords, users, groups, and privileges.
G Start, stop, enable, disable, and check the status of
the repository.
Write Folder G Check in or undo check out for other users.
G Purge (in version-enabled repository).
Admin Integration
Service
N/A N/A G Disable the Integration Service using the infacmd
program.
G Connect to the Integration Service from PowerCenter
client applications when running the Integration
Service in safe mode.
Super User N/A N/A G Perform all tasks, across all folders in the repository.
G Manage connection object permissions.
G Manage global object permissions.
G Perform mass validate.
Workflow Operator N/A N/A G Connect to the Integration Service.
Read Folder G View the session log.
G View the workflow log. View session details and
performance details.
Execute Folder G Abort workflow.
G Restart workflow.
G Resume workflow.
G Stop workflow.
G Schedule and unschedule workflows.
Read
Execute
Folder
Connection
G Start workflows immediately.
Execute
Execute
Folder
Connection
G Use pmcmd to start workflows in folders for which you
have execute permission.
Manage Connection N/A N/A G Create and edit connection objects.
G Delete connection objects.
G Manage connection object permissions.
Manage Label N/A N/A G Create labels.
G Delete labels.
Extended privileges allow you to perform more tasks and expand the access you have to repository objects.
Informatica recommends that you reserve extended privileges for individual users and grant default privileges to
groups.
Audit Trails
You can track changes to Repository users, groups, privileges, and permissions by selecting the
INFORMATICA CONFIDENTIAL BEST PRACTICE 17 of 702
SecurityAuditTrail configuration option in the Repository Service properties in the PowerCenter Administration
Console. When you enable the audit trail, the Repository Service logs security changes to the Repository
Service log.
The audit trail logs the following operations:
G Changing the owner, owner's group, or permissions for a folder.
G Changing the password of another user.
G Adding or removing a user.
G Adding or removing a group.
G Adding or removing users from a group.
G Changing global object permissions.
G Adding or removing user and group privileges.
Sample Security Implementation
The following steps provide an example of how to establish users, groups, permissions, and privileges in your
environment. Again, the requirements of your projects and production systems should dictate how security is
established.
1. Identify users and the environments they will support (e.g., Development, UAT, QA, Production,
Production Support, etc.).
2. Identify the PowerCenter repositories in your environment (this may be similar to the basic groups listed
in Step 1; for example, Development, UAT, QA, Production, etc.).
3. Identify which users need to exist in each repository.
4. Define the groups that will exist in each PowerCenter Repository.
5. Assign users to groups.
6. Define privileges for each group.
The following table provides an example of groups and privileges that may exist in the PowerCenter repository.
This example assumes one PowerCenter project with three environments co-existing in one PowerCenter
repository.

GROUP NAME FOLDER FOLDER
PERMISSIONS
PRIVILEGES
ADMINISTRATORS All All Super User (all privileges)
DEVELOPERS
Individual
development
folder;
integrated
development
folder Read, Write, Execute
Use Designer, Browse Repository,
Use Workflow Manager
INFORMATICA CONFIDENTIAL BEST PRACTICE 18 of 702
DEVELOPERS UAT Read
Use Designer, Browse Repository,
Use Workflow Manager
UAT
UAT working
folder Read, Write, Execute
Use Designer, Browse Repository,
Use Workflow Manager
UAT Production Read
Use Designer, Browse Repository,
Use Workflow Manager
OPERATIONS Production Read, Execute
Browse Repository, Workflow
Operator
PRODUCTION
SUPPORT
Production
maintenance
folders Read, Write, Execute
Use Designer, Browse Repository,
Use Workflow Manager
PRODUCTION
SUPPORT Production Read Browse Repository
Informatica PowerCenter Security Administration
As mentioned earlier, one individual should be identified as the Informatica Administrator. This individual is
responsible for a number of tasks in the Informatica environment, including security. To summarize, here are
the security-related tasks an administrator is responsible for:
G Creating user accounts.
G Defining and creating groups.
G Defining and granting folder permissions.
G Defining and granting repository privileges.
G Enforcing changes in passwords.
G Controlling requests for changes in privileges.
G Creating and maintaining database, FTP, and external loader connections in conjunction with database
administrator.
G Working with operations group to ensure tight security in production environment.
Remember, you must have one of the following privileges to administer repository users:
G Administer Repository
G Super User
Summary of Recommendations
When implementing your security model, keep the following recommendations in mind:
G Create groups with limited privileges.
INFORMATICA CONFIDENTIAL BEST PRACTICE 19 of 702
G Do not use shared accounts.
G Limit user and group access to multiple repositories.
G Customize user privileges.
G Limit the Super User privilege.
G Limit the Administer Repository privilege.
G Restrict the Workflow Operator privilege.
G Follow a naming convention for user accounts and group names.
G For more secure environments, turn Audit Trail logging on.


Last updated: 05-Feb-07 15:33
INFORMATICA CONFIDENTIAL BEST PRACTICE 20 of 702
Data Analyzer Security
Challenge
Using Data Analyzer's sophisticated security architecture to establish a robust security system to
safeguard valuable business information against a range of technologies and security models. Ensuring
that Data Analyzer security provides appropriate mechanisms to support and augment the security
infrastructure of a Business Intelligence environment at every level.
Description
Four main architectural layers must be completely secure: user layer, transmission layer, application
layer and data layer.
Users must be authenticated and authorized to access data. Data Analyzer integrates seamlessly with the
following LDAP-compliant directory servers:
SunOne/iPlanet Directory Server 4.1
Sun Java System Directory Server 5.2
INFORMATICA CONFIDENTIAL BEST PRACTICE 21 of 702
Novell eDirectory Server 8.7
IBM SecureWay Directory 3.2
IBM SecureWay Directory 4.1
IBM Tivoli Directory Server 5.2
Microsoft Active Directory 2000
Microsoft Active Directory 2003
In addition to the directory server, Data Analyzer supports Netegrity SiteMinder for centralizing
authentication and access control for the various web applications in the organization.
Transmission Layer
The data transmission must be secure and hacker-proof. Data Analyzer supports the standard security
protocol Secure Sockets Layer (SSL) to provide a secure environment.
Application Layer
Only appropriate application functionality should be provided to users with associated privileges. Data
Analyzer provides three basic types of application-level security:
G Report, Folder and Dashboard Security. Restricts access for users or groups to specific
reports, folders, and/or dashboards.
G Column-level Security. Restricts users and groups to particular metric and attribute columns.
G Row-level Security. Restricts users to specific attribute values within an attribute column of a
table.
Components for Managing Application Layer Security
Data Analyzer users can perform a variety of tasks based on the privileges that you grant them. Data
Analyzer provides the following components for managing application layer security:
G Roles. A role can consist of one or more privileges. You can use system roles or create custom
roles. You can grant roles to groups and/or individual users. When you edit a custom role, all
groups and users with the role automatically inherit the change.
G Groups. A group can consist of users and/or groups. You can assign one or more roles to a
group. Groups are created to organize logical sets of users and roles. After you create groups,
you can assign users to the groups. You can also assign groups to other groups to organize
privileges for related users. When you edit a group, all users and groups within the edited group
inherit the change.
G Users. A user has a user name and password. Each person accessing Data Analyzer must have
a unique user name. To set the tasks a user can perform, you can assign roles to the user or
INFORMATICA CONFIDENTIAL BEST PRACTICE 22 of 702
assign the user to a group with predefined roles.
Types of Roles

G
System roles - Data Analyzer provides a set of roles when the repository is created.
Each role has sets of privileges assigned to it.
G
Custom roles - The end user can create and assign privileges to these roles.
Managing Groups
Groups allow you to classify users according to a particular function. You may organize users into groups
based on their departments or management level. When you assign roles to a group, you grant the same
privileges to all members of the group. When you change the roles assigned to a group, all users in the
group inherit the changes. If a user belongs to more than one group, the user has the privileges from all
groups. To organize related users into related groups, you can create group hierarchies. With hierarchical
groups, each subgroup automatically receives the roles assigned to the group it belongs to. When you
edit a group, all subgroups contained within it inherit the changes.
For example, you may create a Lead group and assign it the Advanced Consumer role. Within the Lead
group, you create a Manager group with a custom role Manage Data Analyzer. Because the Manager
group is a subgroup of the Lead group, it has both the Manage Data Analyzer and Advanced Consumer
role privileges.

Belonging to multiple groups has an inclusive effect. For example, if group 1 has access to something but
group 2 is excluded from that object, a user belonging to both groups 1 and 2 will have access to the
object.
INFORMATICA CONFIDENTIAL BEST PRACTICE 23 of 702

Preventing Data Analyzer from Updating Group Information
If you use Windows Domain or LDAP authentication, you typically modify the users or groups in Data
Analyzer. However, some organizations keep only user accounts in the Windows Domain or LDAP
directory service, but set up groups in Data Analyzer to organize the Data Analyzer users. Data Analyzer
provides a way for you to keep user accounts in the authentication server and still keep the groups in
Data Analyzer.
Ordinarily, when Data Analyzer synchronizes the repository with the Windows Domain or LDAP directory
service, it updates the users and groups in the repository and deletes users and groups that are not found
in the Windows Domain or LDAP directory service.
To prevent Data Analyzer from deleting or updating groups in the repository, you can set a property in the
web.xml file so that Data Analyzer updates only user accounts, not groups. You can then create and
manage groups in Data Analyzer for users in the Windows Domain or LDAP directory service.
The web.xml file is in stored in the Data Analyzer EAR file. To access the files in the Data Analyzer EAR
file, use the EAR Repackager utility provided with Data Analyzer.
Note: Be sure to back-up the web.xml file before you modify it.
To prevent Data Analyzer from updating group information in the repository:
1. In the directory where you extracted the Data Analyzer EAR file, locate the web.xml file in the
following directory:
/custom/properties
2. Open the web.xml file with a text editor and locate the line containing the following property:
enableGroupSynchronization
The enableGroupSynchronization property determines whether Data Analyzer updates the groups
in the repository.
INFORMATICA CONFIDENTIAL BEST PRACTICE 24 of 702
3. To prevent Data Analyzer from updating group information in the Data Analyzer repository,
change the value of the enableGroupSynchronization property to false:
<init-param>
<param-name>
InfSchedulerStartup.com.informatica.ias.
scheduler.enableGroupSynchronization
</param-name>
<param-value>false</param-value>
</init-param>
When the value of enableGroupSynchronization property is false, Data Analyzer does not
synchronize the groups in the repository with the groups in the Windows Domain or LDAP
directory service.
4. Save the web.xml file and add it back to the Data Analyzer EAR file.
5. Restart Data Analyzer.
When the enableGroupSynchronization property in the web.xml file is set to false, Data Analyzer
updates only the user accounts in Data Analyzer the next time it synchronizes with the Windows
Domain or LDAP authentication server. You must create and manage groups, and assign users to
groups in Data Analyzer.
Managing Users
Each user must have a unique user name to access Data Analyzer. To perform Data Analyzer tasks, a
user must have the appropriate privileges. You can assign privileges to a user with roles or groups.
Data Analyzer creates a System Administrator user account when you create the repository. The default
user name for the System Administrator user account is admin. The system daemon, ias_scheduler/
padaemon, runs the updates for all time-based schedules. System daemons must have a unique user
name and password in order to perform Data Analyzer system functions and tasks. You can change the
password for a system daemon, but you cannot change the system daemon user name via the GUI. Data
Analyzer permanently assigns the daemon role to system daemons. You cannot assign new roles to
system daemons or assign them to groups.
To change the password for a system daemon, complete the following steps:
1. Change the password in the Administration tab in Data Analyzer
2. Change the password in the web.xml file in the Data Analyzer folder.
3. Restart Data Analyzer.
Access LDAP Directory Contacts
INFORMATICA CONFIDENTIAL BEST PRACTICE 25 of 702
To access contacts in the LDAP directory service, you can add the LDAP server on the LDAP Settings
page. After you set up the connection to the LDAP directory service, users can email reports and shared
documents to LDAP directory contacts.
When you add an LDAP server, you must provide a value for the BaseDN (distinguished name) property.
In the BaseDN property, enter the Base DN entries for your LDAP directory. The Base distinguished
name entries define the type of information that is stored in the LDAP directory. If you do not know the
value for BaseDN, contact your LDAP system administrator.
Customizing User Access
You can customize Data Analyzer user access with the following security options:
G Access permissions. Restrict user and/or group access to folders, reports, dashboards,
attributes, metrics, template dimensions, or schedules. Use access permissions to restrict access
to a particular folder or object in the repository.
G Data restrictions. Restrict user and/or group access to information in fact and dimension tables
and operational schemas. Use data restrictions to prevent certain users or groups from
accessing specific values when they create reports.
G Password restrictions. Restrict users from changing their passwords. Use password restrictions
when you do not want users to alter their passwords.
When you create an object in the repository, every user has default read and write permissions for that
object. By customizing access permissions for an object, you determine which users and/or groups can
read, write, delete, or change access permissions for that object.
When you set data restrictions, you determine which users and groups can view particular attribute
values. If a user with a data restriction runs a report, Data Analyzer does not display the restricted data to
that user.
Types of Access Permissions
Access permissions determine the tasks that you can perform for a specific repository object. When you
set access permissions, you determine which users and groups have access to the folders and repository
objects. You can assign the following types of access permissions to repository objects:
G Read. Allows you to view a folder or object.
G Write. Allows you to edit an object. Also allows you to create and edit folders and objects within a
folder.
G Delete. Allows you to delete a folder or an object from the repository.
G Change permission. Allows you to change the access permissions on a folder or object.
By default, Data Analyzer grants read and write access permissions to every user in the repository. You
can use the General Permissions area to modify default access permissions for an object, or turn off
default access permissions.
INFORMATICA CONFIDENTIAL BEST PRACTICE 26 of 702
Data Restrictions
You can restrict access to data based on the values of related attributes. Data restrictions are set to keep
sensitive data from appearing in reports. For example, you may want to restrict data related to the
performance of a new store from outside vendors. You can set a data restriction that excludes the store
ID from their reports.
You can set data restrictions using one of the following methods:
G Set data restrictions by object. Restrict access to attribute values in a fact table, operational
schema, real-time connector, and real-time message stream. You can apply the data restriction
to users and groups in the repository. Use this method to apply the same data restrictions to
more than one user or group.
G Set data restrictions for one user at a time. Edit a user account or group to restrict user or
group access to specified data. You can set one or more data restrictions for each user or group.
Use this method to set custom data restrictions for different users or groups
Types of Data Restrictions
You can set two kinds of data restrictions:
G Inclusive. Use the IN option to allow users to access data related to the attributes you select. For
example, to allow users to view only data from the year 2001, create an IN 2001 rule.
G Exclusive. Use the NOT IN option to restrict users from accessing data related to the attributes
you select. For example, to allow users to view all data except from the year 2001, create a NOT
IN 2001 rule.
Restricting Data Access by User or Group
You can edit a user or group profile to restrict the data the user or group can access in reports. When you
edit a user profile, you can set data restrictions for any schema in the repository, including operational
schemas and fact tables.
You can set a data restriction to limit user or group access to data in a single schema based on the
attributes you select. If the attributes apply to more than one schema in the repository, you can also
restrict the user or group access from related data across all schemas in the repository. For example, you
may have a Sales fact table and Salary fact table. Both tables use the Region attribute. You can set one
data restriction that applies to both the Sales and Salary fact tables based on the region you select.
To set data restrictions for a user or group, you need the following role or privilege:
G System Administrator role
G Access Management privilege
When Data Analyzer runs scheduled reports that have provider-based security, it runs reports against the
data restrictions for the report owner. However, if the reports have consumer-based security, the Data
Analyzer Server creates a separate report for each unique security profile.
INFORMATICA CONFIDENTIAL BEST PRACTICE 27 of 702
The following information applies to the required steps for changing admin user for weblogic only.
To change the Data Analyzer system administrator username on Weblogic 8.1(DA
8.1)

G
Repository authentication. You must use the Update System Accounts utility to
change the system administrator account name in the repository.
G LDAP or Windows Domain Authentication. Set up the new system administrator account in
Windows Domain or LDAP directory service. Then use the Update System Accounts utility to
change the system administrator account name in the repository.
To change the Data Analyzer default users from admin, ias_scheduler/padaemon
1. Back up the repository.
2. Go to the Web Logic library directory: .\bea\wlserver6.1\lib
3. Open the file ias.jar and locate the file entry called InfChangeSystemUserNames.class
4. Extract the file "InfChangeSystemUserNames.class" into a temporary directory (example: d:
\temp)
5. This extracts the file as 'd:\temp\repository tils\Refresh\InfChangeSystemUserNames.class'
6. Create a batch file (change_sys_user.bat) with the following commands in the directory D:\Temp
\Repository Utils\Refresh\
REM To change the system user name and password
REM *******************************************
REM Change the BEA home here
REM ************************
set JAVA_HOME=E:\bea\wlserver6.1\jdk131_06
set WL_HOME=E:\bea\wlserver6.1
set CLASSPATH=%WL_HOME%\sql
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\jconn2.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\classes12.zip
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\weblogic.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias_securityadapter.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\infalicense
REM Change the DB information here and also
REM the user Dias_scheduler and -Dadmin to values of your choice
REM *************************************************************
%JAVA_HOME%\bin\java-Ddriver=com.informatica.jdbc.sqlserver.SQLServerDriver-Durl=jdbc:
informatica:sqlserver://host_name:port;SelectMethod=cursor;DatabaseName=database_name -
INFORMATICA CONFIDENTIAL BEST PRACTICE 28 of 702
Duser=userName -Dpassword=userPassword -Dias_scheduler=pa_scheduler -Dadmin=paadmin
repositoryutil.refresh.InfChangeSystemUserNames
REM END OF BATCH FILE
7. Make changes in the batch file as directed in the remarks [REM lines]
8. Save the file and open up a command prompt window and navigate to D:\Temp\Repository Utils
\Refresh\
9. At the prompt, type change_sys_user.bat and press Enter.
The user "ias_scheduler" and "admin" will be changed to "pa_scheduler" and "paadmin",
respectively.
10. Modify web.xml, and weblogic.xml (located at .\bea\wlserver6.1\config\informatica\applications\ias
\WEB-INF) by replacing ias_scheduler with 'pa_scheduler'
11. Replace ias_scheduler with pa_scheduler in the xml file weblogic-ejb-jar.xml
This file is in iasEjb.jar file located in the directory .\bea\wlserver6.1\config\informatica\applications\
To edit the file
Make a copy of the iasEjb.jar:
G mkdir \tmp
G cd \tmp
G jar xvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar META-INF
G cd META-INF
G Update META-INF/weblogic-ejb.jar.xml replace ias_scheduler with pa_scheduler
G cd \
G jar uvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar -C \tmp .
Note: There is a tailing period at the end of the command above.
12. Restart the server.


Last updated: 05-Feb-07 15:39
INFORMATICA CONFIDENTIAL BEST PRACTICE 29 of 702
Database Sizing
Challenge
Database sizing involves estimating the types and sizes of the components of a data
architecture. This is important for determining the optimal configuration for the
database servers in order to support the operational workloads. Individuals involved in
a sizing exercise may be data architects, database administrators, and/or business
analysts.
Description
The first step in database sizing is to review system requirements to define such things
as:
G Expected data architecture elements (will there be staging areas? operational
data stores? centralized data warehouse and/or master data? data marts?)
Each additional database element requires more space. This is even more true
in situations where data is being replicated across multiple systems, such as a
data warehouse maintaining an operational data store as well. The same data
in the ODS will be present in the warehouse as well, albeit in a different format.
G Expected source data volume
It is useful to analyze how each row in the source system translates into the
target system. In most situations the row count in the target system can be
calculated by following the data flows from the source to the target. For
example, say a sales order table is being built by denormalizing a source table.
The source table holds sales data for 12 months in a single row (one column for
each month). Each row in the source translates to 12 rows in the target. So a
source table with one million rows ends up as a 12 million row table.
G Data granularity and periodicity
Granularity refers to the lowest level of information that is going to be stored in
a fact table. Granularity affects the size of a database to a great extent,
especially for aggregate tables. The level at which a table has been aggregated
INFORMATICA CONFIDENTIAL BEST PRACTICE 30 of 702
increases or decreases a table's row count. For example, a sales order fact
table's size is likely to be greatly affected by whether the table is being
aggregated at a monthly level or at a quarterly level. The granularity of fact
tables is determined by the dimensions linked to that table. The number of
dimensions that are connected to the fact tables affects the granularity of the
table and hence the size of the table.
G Load frequency and method (full refresh? incremental updates?)
Load frequency affects the space requirements for the staging areas. A load
plan that updates a target less frequently is likely to load more data at one
go. Therefore, more space is required by the staging areas. A full
refresh requires more space for the same reason. Estimated growth rates over
time and retained history.
Determining Growth Projections
One way to estimate projections of data growth over time is to use scenario analysis.
As an example, for scenario analysis of a sales tracking data mart you can use the
number of sales transactions to be stored as the basis for the sizing estimate. In the
first year, 10 million sales transactions are expected; this equates to 10 million fact-
table records.
Next, use the sales growth forecasts for the upcoming years for database growth
calculations. That is, an annual sales growth rate of 10 percent translates into 11
million fact table records for the next year. At the end of five years, the fact table is
likely to contain about 60 million records. You may want to calculate other estimates
based on five-percent annual sales growth (case 1) and 20-percent annual sales
growth (case 2). Multiple projections for best and worst case scenarios can be very
helpful.
Oracle Table Space Prediction Model
Oracle (10g and onwards) provides a mechanism to predict the growth of a database.
This feature can be useful in predicting table space requirements.
Oracle incorporates a table space prediction model in the database engine that
provides projected statistics for space used by a table. The following Oracle 10g query
returns projected space usage statistics:
SELECT *
INFORMATICA CONFIDENTIAL BEST PRACTICE 31 of 702
FROM TABLE(DBMS_SPACE.object_growth_trend ('schema','tablename','TABLE'))
ORDER BY timepoint;
The results of this query are shown below:
TIMEPOINT SPACE_USAGE SPACE_ALLOC QUALITY
------------------------------ ----------- ----------- --------------------
11-APR-04 02.55.14.116000 PM 6372 65536 INTERPOLATED
12-APR-04 02.55.14.116000 PM 6372 65536 INTERPOLATED
13-APR-04 02.55.14.116000 PM 6372 65536 INTERPOLATED
13-MAY-04 02.55.14.116000 PM 6372 65536 PROJECTED
14-MAY-04 02.55.14.116000 PM 6372 65536 PROJECTED
15-MAY-04 02.55.14.116000 PM 6372 65536 PROJECTED
16-MAY-04 02.55.14.116000 PM 6372 65536 PROJECTED
The QUALITY column indicates the quality of the output as follows:
G GOOD - The data for the timepoint relates to data within the AWR repository
with a timestamp within 10 percent of the interval.
G INTERPOLATED - The data for this timepoint did not meet the GOOD criteria
but was based on data gathered before and after the timepoint.
G PROJECTED - The timepoint is in the future, so the data is estimated based
on previous growth statistics.
Baseline Volumetric
Next, use the physical data models for the sources and the target architecture to
develop a baseline sizing estimate. The administration guides for most DBMSs contain
sizing guidelines for the various database structures such as tables, indexes, sort
space, data files, log files, and database cache.
Develop a detailed sizing using a worksheet inventory of the tables and indexes from
the physical data model, along with field data types and field sizes. Various database
products use different storage methods for data types. For this reason, be sure to use
the database manuals to determine the size of each data type. Add up the field sizes to
determine row size. Then use the data volume projections to determine the number of
rows to multiply by the table size.
The default estimate for index size is to assume same size as the table size. Also
estimate the temporary space for sort operations. For data warehouse applications
where summarizations are common, plan on large temporary spaces. The temporary
space can be as much as 1.5 times larger than the largest table in the database.
Another approach that is sometimes useful is to load the data architecture with
representative data and determine the resulting database sizes. This test load can be a
INFORMATICA CONFIDENTIAL BEST PRACTICE 32 of 702
fraction of the actual data and is used only to gather basic sizing statistics. You then
need to apply growth projections to these statistics. For example, after loading ten
thousand sample records to the fact table, you determine the size to be 10MB. Based
on the scenario analysis, you can expect this fact table to contain 60 million records
after five years. So, the estimated size for the fact table is about 60GB [i.e., 10 MB *
(60,000,000/10,000)]. Don't forget to add indexes and summary tables to the
calculations.
Guesstimating
When there is not enough information to calculate an estimate as described above, use
educated guesses and rules of thumb to develop as reasonable an estimate as
possible.
G If you dont have the source data model, use what you do know of the source
data to estimate average field size and average number of fields in a row to
determine table size. Based on your understanding of transaction volume over
time, determine your growth metrics for each type of data and calculate out
your source data volume (SDV) from table size and growth metrics.
G If your target data architecture is not completed so that you can determine
table sizes, base your estimates on multiples of the SDV:
H If it includes staging areas: add another SDV for any source subject area
that you will stage multiplied by the number of loads youll retain in
staging.
H If you intend to consolidate data into an operational data store, add the
SDV multiplied by the number of loads to be retained in the ODS for
historical purposes (e.g., keeping one years worth of monthly loads = 12
x SDV)
H Data warehouse architectures are based on the periodicity and
granularity of the warehouse; this may be another SDV + (.3n x SDV
where n = number of time periods loaded in the warehouse over time)
H If your data architecture includes aggregates, add a percentage of the
warehouse volumetrics based on how much of the warehouse data will be
aggregated and to what level (e.g., if the rollup level represents 10
percent of the dimensions at the details level, use 10 percent).
H Similarly, for data marts add a percentage of the data warehouse based
on how much of the warehouse data is moved into the data mart.
H Be sure to consider the growth projections over time and the history to be
retained in all of your calculations.
INFORMATICA CONFIDENTIAL BEST PRACTICE 33 of 702
And finally, remember that there is always much more data than you expect so you
may want to add a reasonable fudge-factor to the calculations for a margin of safety.


Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 34 of 702
Deployment Groups
Challenge
In selectively migrating objects from one repository folder to another, there is a need for
a versatile and flexible mechanism that can overcome such limitations as confinement
to a single source folder.
Description
Deployment Groups are containers that hold references to objects that need to be
migrated. This includes objects such as mappings, mapplets, reusable transformations,
sources, targets, workflows, sessions and tasks, as well as the object holders (i.e., the
repository folders). Deployment groups are faster and more flexible than folder moves
for incremental changes. In addition, they allow for migration rollbacks if necessary.
Migrating a deployment group allows you to copy objects in a single copy operation
from across multiple folders in the source repository into multiple folders in the target
repository. Copying a deployment group also allows you to specify individual objects to
copy, rather than the entire contents of a folder.
There are two types of deployment groups: static and dynamic.
G Static deployment groups contain direct references to versions of objects
that need to be moved. Users explicitly add the version of the object to be
migrated to the deployment group. Create a static deployment group if you do
not expect the set of deployment objects to change between the deployments.
G Dynamic deployment groups contain a query that is executed at the time of
deployment. The results of the query (i.e., object versions in the repository) are
then selected and copied to the target repository/folder. Create a dynamic
deployment group if you expect the deployment objects to change frequently
between deployments.
Dynamic deployment groups are generated from a query. While any available criteria
can be used, it is advisable to have developers use labels to simplify the query. See the
Best Practice on Using PowerCenter Labels , Strategies for Labels section, for further
information. When generating a query for deployment groups with mappings and
mapplets that contain non-reusable objects, you must use a query condition in addition
to specific selection criteria. The query must include a condition for Is Reusable and
use a qualifier of either Reusable and Non-Reusable. Without this qualifier, the
INFORMATICA CONFIDENTIAL BEST PRACTICE 35 of 702
deployment may encounter errors if there are non-reusable objects held within the
mapping or mapplet.
A deployment group exists in a specific repository. It can be used to move items to any
other accessible repository/folder. A deployment group maintains a history of all
migrations it has performed. It tracks what versions of objects were moved from which
folders in which source repositories, and into which folders in which target repositories
those versions were copied (i.e., it provides a complete audit trail of all migrations
performed). Given that the deployment group knows what it moved and to where, then
if necessary, an administrator can have the deployment group undo the most recent
deployment, reverting the target repository to its pre-deployment state. Using labels (as
described in the Using PowerCenter Labels Best Practice) allows objects in the
subsequent repository to be tracked back to a specific deployment.
It is important to note that the deployment group only migrates the objects it contains to
the target repository/folder. It does not, itself, move to the target repository. It still
resides in the source repository.
Deploying via the GUI
You can perform migrations via the GUI or the command line (pmrep). To migrate
objects via the GUI, simply drag a deployment group from the repository it resides in
onto the target repository where the objects it references are to be moved. The
Deployment Wizard appears to step you through the deployment process. You can
match folders in the source and target repositories so objects are moved into the
proper target folders, reset sequence generator values, etc. Once the wizard is
complete, the migration occurs, and the deployment history is created.
Deploying via the Command Line
Alternatively, you can use the PowerCenter pmrep command to automate both Folder
Level deployments (e.g., in a non-versioned repository) and deployments using
Deployment Groups. The commands DeployFolder and DeployDeploymentGroup in
pmrep are used respectively for these purposes. Whereas deployment via the GUI
requires you to step through a wizard and answer a series of questions to deploy,
command-line deployment requires you to provide an XML control file, which contains
the same information that the wizard requests. This file must be present before the
deployment is executed.
Considerations for Deployment and Deployment Groups
INFORMATICA CONFIDENTIAL BEST PRACTICE 36 of 702
Simultaneous Multi-Phase Projects
If multiple phases of a project are being developed simultaneously in separate folders,
it is possible to consolidate them by mapping folders appropriately through the
deployment group migration wizard. When migrating with deployment groups in this
way, the override buttons in the migration wizard are used to select specific folder
mapping.
Rolling Back a Deployment
Deployment groups help to ensure that you have a back-out methodology. You can
rollback the latest version of a deployment. To do this:
In the target repository (where the objects were migrated to), go to
Versioning>>Deployment>>History>>View History>>Rollback.
The rollback purges all objects (of the latest version) that were in the deployment
group. You can initiate a rollback on a deployment as long as you roll back only the
latest versions of the objects. The rollback ensures that the check-in time for the
repository objects is the same as the deploy time.
Managing Repository Size
As you check in objects and deploy objects to target repositories, the number of object
versions in those repositories increases, and thus, the size of the repositories also
increases.
In order to manage repository size, use a combination of Check-in Date and Latest
Status (both are query parameters) to purge the desired versions from the repository
and retain only the very latest version. You may also choose to purge all the deleted
versions of the objects, which reduces the size of the repository.
If you want to keep more than the latest version, you can also include labels in your
query. These labels are ones that you have applied to the repository for the specific
purpose of identifying objects for purging.
Off-Shore, On-Shore Migration
In an off-shore development environment to an on-shore migration situation, other
aspects of the computing environment may make it desirable to generate a dynamic
INFORMATICA CONFIDENTIAL BEST PRACTICE 37 of 702
deployment group. Instead of migrating the group itself to the next repository, you can
use a query to select the objects for migration and save them to a single XML file which
can be then be transmitted to the on-shore environment though alternative methods. If
the on-shore repository is versioned, it activates the import wizard as if a deployment
group was being received.
Migrating to a Non-Versioned Repository
In some instances, it may be desirable to migrate to a non-versioned repository from a
versioned repository. Note that this changes the wizards used when migrating in this
manner, and that the export from the versioned repository must take place using XML
export. Also be aware that certain repository objects (e.g., connections) cannot be
automatically migrated, which may invalidate objects such as sessions. To resolve this
issue, first set up the objects/connections in the receiving repository; the XML import
wizard will advise of any invalidations that occur.


Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 38 of 702
Migration Procedures - PowerCenter
Challenge
Develop a migration strategy that ensures clean migration between development, test, quality assurance
(QA), and production environments, thereby protecting the integrity of each of these environments as the
system evolves.
Description
Ensuring that an application has a smooth migration process between development, QA, and production
environments is essential for the deployment of an application. Deciding which migration strategy works best
for a project depends on two primary factors.
G How is the PowerCenter repository environment designed? Are there individual repositories for
development, QA, and production or are there just one or two environments that share one or all of
these phases.
G How has the folder architecture been defined?
Each of these factors plays a role in determining the migration procedure that is most beneficial to the project.
PowerCenter offers flexible migration options that can be adapted to fit the need of each
application. PowerCenter migration options include repository migration, folder migration, object migration,
and XML import/export. In versioned PowerCenter repositories, users can also use static or dynamic
deployment groups for migration, which provides the capability to migrate any combination of objects within
the repository with a single command.
This Best Practice is intended to help the development team decide which technique is most appropriate for
the project. The following sections discuss various options that are available, based on the environment and
architecture selected. Each section describes the major advantages of its use, as well as its disadvantages.
Repository Environments
The following section outlines the migration procedures for standalone and distributed repository
environments. The distributed environment section touches on several migration architectures, outlining the
pros and cons of each. Also, please note that any methods described in the Standalone section may also be
used in a Distributed environment.
Standalone Repository Environment
In a standalone environment, all work is performed in a single PowerCenter repository that serves as the
metadata store. Separate folders are used to represent the development, QA, and production workspaces
and segregate work. This type of architecture within a single repository ensures seamless migration from
development to QA, and from QA to production.
The following example shows a typical architecture. In this example, the company has chosen to create
INFORMATICA CONFIDENTIAL BEST PRACTICE 39 of 702
separate development folders for each of the individual developers for development and unit test purposes. A
single shared or common development folder, SHARED_MARKETING_DEV, holds all of the common
objects, such as sources, targets, and reusable mapplets. In addition, two test folders are created for QA
purposes. The first contains all of the unit-tested mappings from the development folder. The second is a
common or shared folder that contains all of the tested shared objects. Eventually, as the following
paragraphs explain, two production folders will also be built.


Proposed Migration Process Single Repository
DEV to TEST Object Level Migration
Now that we've described the repository architecture for this organization, let's discuss how it will migrate
mappings to test, and then eventually to production.
After all mappings have completed their unit testing, the process for migration to test can begin. The first step
in this process is to copy all of the shared or common objects from the SHARED_MARKETING_DEV folder to
the SHARED_MARKETING_TEST folder. This can be done using one of two methods:
G The first, and most common method, is object migration via an object copy. In this case, a user
opens the SHARED_MARKETING_TEST folder and drags the object from the
SHARED_MARKETING_DEV into the appropriate workspace (i.e., Source Analyzer, Warehouse
Designer, etc.). This is similar to dragging a file from one folder to another using Windows Explorer.
G The second approach is object migration via object XML import/export. A user can export each of the
objects in the SHARED_MARKETING_DEV folder to XML, and then re-import each object into the
SHARED_MARKETING_TEST via XML import. With the XML import/export, the XML files can be
uploaded to a third-party versioning tool, if the organization has standardized on such a tool.
Otherwise, versioning can be enabled in PowerCenter. Migrations with versioned PowerCenter
repositories is covered later in this document.
After you've copied all common or shared objects, the next step is to copy the individual mappings from each
development folder into the MARKETING_TEST folder. Again, you can use either of the two object-level
migration methods described above to copy the mappings to the folder, although the XML import/export
method is the most intuitive method for resolving shared object conflicts. However, the migration method is
slightly different here when you're copying the mappings because you must ensure that the shortcuts in the
mapping are associated with the SHARED_MARKETING_TEST folder. Designer prompts the user to choose
the correct shortcut folder that you created in the previous example, which point to the
INFORMATICA CONFIDENTIAL BEST PRACTICE 40 of 702
SHARED_MARKETING_TEST (see image below). You can then continue the migration process until all
mappings have been successfully migrated. In PowerCenter 7 and later versions, you can export multiple
objects into a single XML file, and then import them at the same time.

The final step in the process is to migrate the workflows that use those mappings. Again, the object-level
migration can be completed either through drag-and-drop or by using XML import/export. In either case, this
process is very similar to the steps described above for migrating mappings, but differs in that the Workflow
Manager provides a Workflow Copy Wizard to guide you through the process. The following steps outline the
full process for successfully copying a workflow and all of its associated tasks.
1. The Wizard prompts for the name of the new workflow. If a workflow with the same name exists in the
destination folder, the Wizard prompts you to rename it or replace it. If no such workflow exists, a
default name is used. Then click Next to continue the copy process.
2. The next step for each task is to see if it exists (as shown below). If the task is present, you can
rename or replace the current one. If it does not exist, then the default name is used (see below).
Then click Next.

INFORMATICA CONFIDENTIAL BEST PRACTICE 41 of 702


3. Next, the Wizard prompts you to select the mapping associated with each session task in the
workflow. Select the mapping and continue by clicking Next".

4. If connections exist in the target repository, the Wizard prompts you to select the connection to use for
the source and target. If no connections exist, the default settings are used. When this step is
completed, click "Finish" and save the work.
Initial Migration New Folders Created
The move to production is very different for the initial move than for subsequent changes to mappings and
workflows. Since the repository only contains folders for development and test, we need to create two new
folders to house the production-ready objects. Create these folders after testing of the objects in
SHARED_MARKETING_TEST and MARKETING_TEST has been approved.
The following steps outline the creation of the production folders and, at the same time, address the initial test
INFORMATICA CONFIDENTIAL BEST PRACTICE 42 of 702
to production migration.
1. Open the PowerCenter Repository Manager client tool and log into the repository.
2. To make a shared folder for the production environment, highlight the SHARED_MARKETING_TEST
folder, drag it, and drop it on the repository name.
3. The Copy Folder Wizard appears to guide you through the copying process.

4. The first Wizard screen asks if you want to use the typical folder copy options or the advanced
options. In this example, we'll use the advanced options.
5. The second Wizard screen prompts you to enter a folder name. By default, the folder name that
appears on this screen is the folder name followed by the date. In this case, enter the name as
INFORMATICA CONFIDENTIAL BEST PRACTICE 43 of 702
SHARED_MARKETING_PROD.
6. The third Wizard screen prompts you to select a folder to override. Because this is the first time you
are transporting the folder, you wont need to select anything.

7. The final screen begins the actual copy process. Click "Finish" when the process is complete.
INFORMATICA CONFIDENTIAL BEST PRACTICE 44 of 702
Repeat this process to create the MARKETING_PROD folder. Use the MARKETING_TEST folder as
the original to copy and associate the shared objects with the SHARED_MARKETING_PROD folder
that you just created.
At the end of the migration, you should have two additional folders in the repository environment for
production: SHARED_MARKETING_PROD and MARKETING_ PROD (as shown below). These
folders contain the initially migrated objects. Before you can actually run the workflow in these
production folders, you need to modify the session source and target connections to point to the
production environment.

Incremental Migration Object Copy Example
Now that the initial production migration is complete, let's take a look at how future changes will be migrated
into the folder.
Any time an object is modified, it must be re-tested and migrated into production for the actual change to
INFORMATICA CONFIDENTIAL BEST PRACTICE 45 of 702
occur. These types of changes in production take place on a case-by-case or periodically-scheduled basis.
The following steps outline the process of moving these objects individually.
1. Log into PowerCenter Designer. Open the destination folder and expand the source folder. Click on
the object to copy and drag-and-drop it into the appropriate workspace window.
2. Because this is a modification to an object that already exists in the destination folder,
Designer prompts you to choose whether to Rename or Replace the object (as shown below). Choose
the option to Replace the object.

3. In PowerCenter 7 and later versions, you can choose to compare conflicts whenever migrating any
object in Designer or Workflow Manager. By comparing the objects, you can ensure that the changes
that you are making are what you intend. See below for an example of the mapping compare window.
INFORMATICA CONFIDENTIAL BEST PRACTICE 46 of 702

4. After the object has been successfully copied, save the folder so the changes can take place.
5. The newly copied mapping is now tied to any sessions that the replaced mapping was tied to.
6. Log into Workflow Manager and make the appropriate changes to the session or workflow so it can
update itself with the changes.
Standalone Repository Example
In this example, we look at moving development work to QA and then from QA to production, using multiple
development folders for each developer, with the test and production folders divided into the data mart they
represent. For this example, we focus solely on the MARKETING_DEV data mart, first explaining how to
move objects and mappings from each individual folder to the test folder and then how to move tasks,
worklets, and workflows to the new area.
Follow these steps to copy a mapping from Development to QA:
1. If using shortcuts, first follow these steps; if not using shortcuts, skip to step 2
H Copy the tested objects from the SHARED_MARKETING_DEV folder to the
SHARED_MARKETING_TEST folder.
H Drag all of the newly copied objects from the SHARED_MARKETING_TEST folder to
MARKETING_TEST.
H Save your changes.
2. Copy the mapping from Development into Test.
H In the PowerCenter Designer, open the MARKETING_TEST folder, and drag and drop the
mapping from each development folder into the MARKETING_TEST folder.
INFORMATICA CONFIDENTIAL BEST PRACTICE 47 of 702
H When copying each mapping in PowerCenter, Designer prompts you to either
Replace, Rename, or Reuse the object, or Skip for each reusable object, such as source and
target definitions. Choose to Reuse the object for all shared objects in the mappings copied
into the MARKETING_TEST folder.
H Save your changes.
3. If a reusable session task is being used, follow these steps. Otherwise, skip to step 4.
H In the PowerCenter Workflow Manager, open the MARKETING_TEST folder and drag and
drop each reusable session from the developers folders into the MARKETING_TEST folder.
A Copy Session Wizard guides you through the copying process.
H Open each newly copied session and click on the Source tab. Change the source to point to
the source database for the Test environment.
H Click the Target tab. Change each connection to point to the target database for the Test
environment. Be sure to double-check the workspace from within the Target tab to ensure
that the load options are correct.
H Save your changes.
4. While the MARKETING_TEST folder is still open, copy each workflow from Development to Test.
H Drag each workflow from the development folders into the MARKETING_TEST folder. The
Copy Workflow Wizard appears. Follow the same steps listed above to copy the workflow to
the new folder.
H As mentioned earlier, in PowerCenter 7 and later versions, the Copy Wizard allows you to
compare conflicts from within Workflow Manager to ensure that the correct migrations are
being made.
H Save your changes.
5. Implement the appropriate security.
H In Development, the owner of the folders should be a user(s) in the development group.
H In Test, change the owner of the test folder to a user(s) in the test group.
H In Production, change the owner of the folders to a user in the production group.
H Revoke all rights to Public other than Read for the production folders.
Disadvantages of a Single Repository Environment
The most significant disadvantage of a single repository environment is performance. Having a development,
QA, and production environment within a single repository can cause degradation in production performance
as the production environment shares CPU and memory resources with the development and test
environments. Although these environments are stored in separate folders, they all reside within the same
database table space and on the same server.
For example, if development or test loads are running simultaneously with production loads, the server
machine may reach 100 percent utilization and production performance is likely to suffer.
A single repository structure can also create confusion as the same users and groups exist in all
environments and the number of folders can increase exponentially.
Distributed Repository Environment
A distributed repository environment maintains separate, independent repositories, hardware, and software
for development, test, and production environments. Separating repository environments is preferable for
handling development to production migrations. Because the environments are segregated from one another,
INFORMATICA CONFIDENTIAL BEST PRACTICE 48 of 702
work performed in development cannot impact QA or production.
With a fully distributed approach, separate repositories function much like the separate folders in a standalone
environment. Each repository has a similar name, like the folders in the standalone environment. For
instance, in our Marketing example we would have three repositories, INFADEV, INFATEST, and
INFAPROD. In the following example, we discuss a distributed repository architecture.
There are four techniques for migrating from development to production in a distributed repository
architecture, with each involving some advantages and disadvantages.
G Repository Copy
G Folder Copy
G Object Copy
G Deployment Groups

Repository Copy
So far, this document has covered object-level migrations and folder migrations through drag-and-drop object
copying and object XML import/export. This section discusses migrations in a distributed repository
environment through repository copies.
The main advantages of this approach are:
G The ability to copy all objects (i.e., mappings, workflows, mapplets, reusable transformation, etc.) at
once from one environment to another.
G The ability to automate this process using pmrep commands, thereby eliminating many of the manual
processes that users typically perform.
G The ability to move everything without breaking or corrupting any of the objects.
This approach also involves a few disadvantages.
INFORMATICA CONFIDENTIAL BEST PRACTICE 49 of 702
G The first is that everything is moved at once (which is also an advantage). The problem with this is
that everything is moved -- ready or not. For example, we may have 50 mappings in QA, but only 40
of them are production-ready. The 10 untested mappings are moved into production along with the
40 production-ready mappings, which leads to the second disadvantage.
G Significant maintenance is required to remove any unwanted or excess objects.
G There is also a need to adjust server variables, sequences, parameters/variables, database
connections, etc. Everything must be set up correctly before the actual production runs can take
place.
G Lastly, the repository copy process requires that the existing Production repository be deleted, and
then the Test repository can be copied. This results in a loss of production environment operational
metadata such as load statuses, session run times, etc. High-performance organizations leverage
the value of operational metadata to track trends over time related to load success/failure and
duration. This metadata can be a competitive advantage for organizations that use this information to
plan for future growth.
Now that we've discussed the advantages and disadvantages, we'll look at three ways to accomplish the
Repository Copy method:
G Copying the Repository
G Repository Backup and Restore
G PMREP
Copying the Repository
Copying the Test repository to Production through the GUI client tools is the easiest of all the migration
methods. First, ensure that all users are logged out of the destination repository, then connect to the
PowerCenter Repository Administration Console (as shown below).
INFORMATICA CONFIDENTIAL BEST PRACTICE 50 of 702
If the Production repository already exists, you must delete the repository before you can copy the Test
repository. Before you can delete the repository, you must run the repository in the exclusive mode.
1. Click on the INFA_PROD Repository on the left pane to select it and change the running mode to
exclusive mode by clicking on the edit button on the right pane under the properties tab.
INFORMATICA CONFIDENTIAL BEST PRACTICE 51 of 702
2. Delete the Production repository by selecting it and choosing Delete from the context menu.
INFORMATICA CONFIDENTIAL BEST PRACTICE 52 of 702
3. Click on the Action drop-down list and choose Copy contents from
INFORMATICA CONFIDENTIAL BEST PRACTICE 53 of 702
4. In the new window, choose the domain name, repository service INFA_TEST from the drop-down
menu. Enter the username and password of the Test repository.
INFORMATICA CONFIDENTIAL BEST PRACTICE 54 of 702
5. Click OK to begin the copy process.
6. When you've successfully copied the repository to the new location, exit from the PowerCenter
Administration Console.
7. In the Repository Manager, double-click on the newly copied repository and log-in with a valid
username and password.
8. Verify connectivity, then highlight each folder individually and rename them. For example, rename the
MARKETING_TEST folder to MARKETING_PROD, and the SHARED_MARKETING_TEST to
SHARED_MARKETING_PROD.
9. Be sure to remove all objects that are not pertinent to the Production environment from the folders
before beginning the actual testing process.
10. When this cleanup is finished, you can log into the repository through the Workflow Manager. Modify
the server information and all connections so they are updated to point to the new Production
locations for all existing tasks and workflows.
Repository Backup and Restore
Backup and Restore Repository is another simple method of copying an entire repository. This process backs
up the repository to a binary file that can be restored to any new location. This method is preferable to the
repository copy process because if any type of error occurs, the file is backed up to the binary file on the
repository server.
The following steps outline the process of backing up and restoring the repository for migration.
1. Launch the PowerCenter Administration Console, and highlight the INFA_TEST repository service.
Select Action -> Backup Contents from the drop-down menu.
INFORMATICA CONFIDENTIAL BEST PRACTICE 55 of 702

2. A screen appears and prompts you to supply a name for the backup file as well as the Administrator
username and password. The file is saved to the Backup directory within the repository servers home
directory.
3. After you've selected the location and file name, click OK to begin the backup process.
INFORMATICA CONFIDENTIAL BEST PRACTICE 56 of 702

4. The backup process creates a .rep file containing all repository information. Stay logged into the
Manage Repositories screen. When the backup is complete, select the repository connection to which
the backup will be restored to (i.e., the Production repository).
5. The system will prompt you to supply a username, password, and the name of the file to be restored.
Enter the appropriate information and click OK.
When the restoration process is complete, you must repeat the steps listed in the copy repository option in
order to delete all of the unused objects and renaming of the folders.
PMREP
Using the PMREP commands is essentially the same as the Backup and Restore Repository method except
that it is run from the command line rather than through the GUI client tools. PMREP utilities can be used from
the Informatica Server or from any client machine connected to the server.
Refer to the Repository Manager Guide for a list of PMREP commands.
The following is a sample of the command syntax used within a Windows batch file to connect to and backup
a repository. Using this code example as a model, you can write scripts to be run on a daily basis to perform
functions such as connect, backup, restore, etc:
backupproduction.bat
REM This batch file uses pmrep to connect to and back up the repository Production on the server
Central
INFORMATICA CONFIDENTIAL BEST PRACTICE 57 of 702
@echo off
echo Connecting to Production repository...
C:\Program Files\Informatica PowerCenter 7 and later versions\RepositoryServer\bin\pmrep connect
-r INFAPROD -n Administrator -x Adminpwd h infarepserver o 7001
echo Backing up Production repository...
C:\Program Files\Informatica PowerCenter 7 and later versions\RepositoryServer\bin\pmrep backup -
o c:\backup\Production_backup.rep
Post-Repository Migration Cleanup
After you have used one of the repository migration procedures to migrate into Production, follow these steps
to convert the repository to Production:
1. Disable workflows that are not ready for Production or simply delete the mappings, tasks, and
workflows.
H Disable the workflows not being used in the Workflow Manager by opening the workflow
properties, then checking the Disabled checkbox under the General tab.
H Delete the tasks not being used in the Workflow Manager and the mappings in the Designer
2. Modify the database connection strings to point to the production sources and targets.
H In the Workflow Manager, select Relational connections from the Connections menu.
H Edit each relational connection by changing the connect string to point to the production
sources and targets.
H If you are using lookup transformations in the mappings and the connect string is anything
other than $SOURCE or $TARGET, you will need to modify the connect strings appropriately.
3. Modify the pre- and post-session commands and SQL as necessary.
H In the Workflow Manager, open the session task properties, and from the Components tab
make the required changes to the pre- and post-session scripts.
4. Implement appropriate security, such as:
H In Development, ensure that the owner of the folders is a user in the development group.
H In Test, change the owner of the test folders to a user in the test group.
H In Production, change the owner of the folders to a user in the production group.
H Revoke all rights to Public other than Read for the Production folders.
Folder Copy
INFORMATICA CONFIDENTIAL BEST PRACTICE 58 of 702
Although deployment groups are becoming a very popular migration method, the folder copy method has
historically been the most popular way to migrate in a distributed environment. Copying an entire folder allows
you to quickly promote all of the objects located within that folder. All source and target objects, reusable
transformations, mapplets, mappings, tasks, worklets and workflows are promoted at once. Because of this,
however, everything in the folder must be ready to migrate forward. If some mappings or workflows are not
valid, then developers (or the Repository Administrator) must manually delete these mappings or workflows
from the new folder after the folder is copied.
The three advantages of using the folder copy method are:
G The Repository Managers Folder Copy Wizard makes it almost seamless to copy an entire folder and
all the objects located within it.
G If the project uses a common or shared folder and this folder is copied first, then all shortcut
relationships are automatically converted to point to this newly copied common or shared folder.
G All connections, sequences, mapping variables, and workflow variables are copied automatically.
The primary disadvantage of the folder copy method is that the repository is locked while the folder copy is
being performed. Therefore, it is necessary to schedule this migration task during a time when the repository
is least utilized. Remember that a locked repository means than no jobs can be launched during this process.
This can be a serious consideration in real-time or near real-time environments.
The following example steps through the process of copying folders from each of the different environments.
The first example uses three separate repositories for development, test, and production.
1. If using shortcuts, follow these sub steps; otherwise skip to step 2:
G Open the Repository Manager client tool.
G Connect to both the Development and Test repositories.
G Highlight the folder to copy and drag it to the Test repository.
G The Copy Folder Wizard appears to step you through the copy process.
G When the folder copy process is complete, open the newly copied folder in both the
Repository Manager and Designer to ensure that the objects were copied properly.
2. Copy the Development folder to Test. If you skipped step 1, follow these sub-steps:
G Open the Repository Manager client tool.
G Connect to both the Development and Test repositories.
G Highlight the folder to copy and drag it to the Test repository.
The Copy Folder Wizard will appear.
INFORMATICA CONFIDENTIAL BEST PRACTICE 59 of 702

3.
Follow these steps to ensure that all shortcuts are reconnected.
G
Use the advanced options when copying the folder across.
G Select Next to use the default name of the folder
4. If the folder already exists in the destination repository, choose to replace the folder.

The following screen appears to prompt you to select the folder where the new shortcuts are located.
INFORMATICA CONFIDENTIAL BEST PRACTICE 60 of 702

In a situation where the folder names do not match, a folder compare will take place. The Copy
Folder Wizard then completes the folder copy process. Rename the folder as appropriate and
implement the security.
5. When testing is complete, repeat the steps above to migrate to the Production repository.
When the folder copy process is complete, log onto the Workflow Manager and change the connections to
point to the appropriate target location. Ensure that all tasks updated correctly and that folder and repository
security is modified for test and production.
Object Copy
Copying mappings into the next stage in a networked environment involves many of the same advantages
and disadvantages as in the standalone environment, but the process of handling shortcuts is simplified in the
networked environment. For additional information, see the earlier description of Object Copy for the
standalone environment.
One advantage of Object Copy in a distributed environment is that it provides more granular control over
objects.
Two distinct disadvantages of Object Copy in a distributed environment are:
G Much more work to deploy an entire group of objects
G Shortcuts must exist prior to importing/copying mappings
Below are the steps to complete an object copy in a distributed repository environment:
1. If using shortcuts, follow these sub-steps, otherwise skip to step 2:
INFORMATICA CONFIDENTIAL BEST PRACTICE 61 of 702
G In each of the distributed repositories, create a common folder with the exact same name and case.
G Copy the shortcuts into the common folder in Production, making sure the shortcut has the exact
same name.
2. Copy the mapping from the Test environment into Production.
G In the Designer, connect to both the Test and Production repositories and open the appropriate
folders in each.
G Drag-and-drop the mapping from Test into Production.
G During the mapping copy process, PowerCenter 7 and later versions allow a comparison of this
mapping to an existing copy of the mapping already in Production. Note that the ability to compare
objects is not limited to mappings, but is available for all repository objects including workflows,
sessions, and tasks.
3. Create or copy a workflow with the corresponding session task in the Workflow Manager to run the
mapping (first ensure that the mapping exists in the current repository).
G If copying the workflow, follow the Copy Wizard.
G If creating the workflow, add a session task that points to the mapping and enter all the appropriate
information.
4. Implement appropriate security.
G In Development, ensure the owner of the folders is a user in the development group.
G In Test, change the owner of the test folders to a user in the test group.
G In Production, change the owner of the folders to a user in the production group.
G Revoke all rights to Public other than Read for the Production folders.
Deployment Groups
For versioned repositories, the use of Deployment Groups for migrations between distributed environments
allows the most flexibility and convenience. With Deployment Groups, you can migrate individual objects as
you would in an object copy migration, but can also have the convenience of a repository- or folder-level
migration as all objects are deployed at once. The objects included in a deployment group have no
restrictions and can come from one or multiple folders. Additionally, for additional convenience, you can set
up a dynamic deployment group that allows the objects in the deployment group to be defined by a repository
query, rather than being added to the deployment group manually. Lastly, because deployment groups are
available on versioned repositories, they also have the ability to be rolled back, reverting to the previous
versions of the objects, when necessary.
Advantages of Using Deployment Groups

G Backup and restore of the Repository needs to be performed only once.
G Copying a Folder replaces the previous copy.
G Copying a Mapping allows for different names to be used for the same object.
G Uses for Deployment Groups
INFORMATICA CONFIDENTIAL BEST PRACTICE 62 of 702
H Deployment Groups are containers that hold references to objects that need to be migrated.
H Allows for version-based object migration.
H Faster and more flexible than folder moves for incremental changes.
H Allows for migration rollbacks
H Allows specifying individual objects to copy, rather than the entire contents of a folder.
Types of Deployment Groups

G Static
H Contain direct references to versions of objects that need to be moved.
H Users explicitly add the version of the object to be migrated to the deployment group.
G Dynamic
H Contain a query that is executed at the time of deployment.
H The results of the query (i.e. object versions in the repository) are then selected and copied to
the target repository
Pre-Requisites
Create required folders in the Target Repository
Creating Labels
A label is a versioning object that you can associate with any versioned object or group of versioned objects
in a repository.
G Advantages
H Tracks versioned objects during development.
H Improves query results.
H Associates groups of objects for deployment.
H Associates groups of objects for import and export.
G Create label
H Create labels through the Repository Manager.
H After creating the labels, go to edit mode and lock them.
H The "Lock" option is used to prevent other users from editing or applying the label.
H This option can be enabled only when the label is edited.
H Some Standard Label examples are:
I Development
I Deploy_Test
INFORMATICA CONFIDENTIAL BEST PRACTICE 63 of 702
I Test
I Deploy_Production
I Production
G Apply Label
H Create a query to identify the objects that are needed to be queried.
H Run the query and apply the labels.
Note: By default, the latest version of the object gets labeled.
Queries
A query is an object used to search for versioned objects in the repository that meet specific conditions.
G Advantages
H Tracks objects during development
H Associates a query with a deployment group
H Finds deleted objects you want to recover
H Finds groups of invalidated objects you want to validate
G Create a query
H The Query Browser allows you to create, edit, run, or delete object queries
G Execute a query
H Execute through Query Browser
H EXECUTE QUERY: ExecuteQuery -q query_name -t query_type -u persistent_output_file_name
-a append -c column_separator -r end-of-record_separator -l end-oflisting_indicator -b verbose
Creating a Deployment Group
Follow these steps to create a deployment group:
1.
Launch the Repository Manager client tool and log in to the source repository.

2. Expand the repository, right-click on Deployment Groups and choose New Group.
INFORMATICA CONFIDENTIAL BEST PRACTICE 64 of 702

3.
In the dialog window, give the deployment group a name, and choose whether it should be static or
dynamic. In this example, we are creating a static deployment group. Click OK.

Adding Objects to a Static Deployment Group
INFORMATICA CONFIDENTIAL BEST PRACTICE 65 of 702
Follow these steps to add objects to a static deployment group:
1.
In Designer, Workflow Manager, or Repository Manger, right-click an object that you want to add to
the deployment group and choose Versioning -> View History. The View History window
appears.

2.
In the View History window, right-click the object and choose Add to Deployment Group.
INFORMATICA CONFIDENTIAL BEST PRACTICE 66 of 702
3.
In the Deployment Group dialog window, choose the deployment group that you want to add the
object to, and click OK.

4.
In the final dialog window, choose whether you want to add dependent objects. In most cases, you
will want to add dependent objects to the deployment group so that they will be migrated as well.
Click OK.
INFORMATICA CONFIDENTIAL BEST PRACTICE 67 of 702

NOTE: The All Dependencies option should be used for any new code that is migrating forward. However,
this option can cause issues when moving existing code forward because All Dependencies also flags
shortcuts. During the deployment, PowerCenter tries to re-insert or replace the shortcuts. This does not work,
and causes the deployment to fail.
The object will be added to the deployment group at this time.
Although the deployment group allows the most flexibility, the task of adding each object to the deployment
group is similar to the effort required for an object copy migration. To make deployment groups easier to use,
PowerCenter allows the capability to create dynamic deployment groups.
Adding Objects to a Dynamic Deployment Group
Dynamic Deployment groups are similar in function to static deployment groups, but differ in the way that
objects are added. In a static deployment group, objects are manually added one by one. In a dynamic
deployment group, the contents of the deployment group are defined by a repository query. Dont worry about
the complexity of writing a repository query, it is quite simple and aided by the PowerCenter GUI interface.
Follow these steps to add objects to a dynamic deployment group:
1.
First, create a deployment group, just as you did for a static deployment group, but in this case,
choose the dynamic option. Also, select the Queries button.
INFORMATICA CONFIDENTIAL BEST PRACTICE 68 of 702

2.
The Query Browser window appears. Choose New to create a query for the dynamic deployment
group.

3.
In the Query Editor window, provide a name and query type (Shared). Define criteria for the objects
that should be migrated. The drop-down list of parameters lets you choose from 23 predefined
metadata categories. In this case, the developers have assigned the RELEASE_20050130 label to
all objects that need to be migrated, so the query is defined as Label Is Equal To
RELEASE_20050130. The creation and application of labels are discussed in Using PowerCenter
Labels.
INFORMATICA CONFIDENTIAL BEST PRACTICE 69 of 702

4.
Save the Query and exit the Query Editor. Click OK on the Query Browser window, and close the
Deployment Group editor window.
Executing a Deployment Group Migration
A Deployment Group migration can be executed through the Repository Manager client tool, or through the
pmrep command line utility. With the client tool, you simply drag the deployment group from the source
repository and drop it on the destination repository. This opens the Copy Deployment Group Wizard, which
guides you through the step-by-step options for executing the deployment group.
Rolling Back a Deployment
To roll back a deployment, you must first locate the Deployment via the TARGET Repositories menu bar (i.
e., Deployments -> History -> View History -> Rollback).
Automated Deployments
For the optimal migration method, you can set up a UNIX shell or Windows batch script that calls the pmrep
DeployDeploymentGroup command, which can execute a deployment group migration without human
intevention. This is ideal since the deployment group allows ultimate flexibility and convenience as the script
can be scheduled to run overnight, thereby causing minimal impact on developers and the PowerCenter
administrator. You can also use the pmrep utility to automate importing objects via XML.
INFORMATICA CONFIDENTIAL BEST PRACTICE 70 of 702
Recommendations
Informatica recommends using the following process when running in a three-tiered environment with
development, test, and production servers.

Non-Versioned Repositories
For migrating from development into test, Informatica recommends using the Object Copy method. This
method gives you total granular control over the objects that are being moved. It also ensures that the latest
development mappings can be moved over manually as they are completed. For recommendations on
performing this copy procedure correctly, see the steps listed in the Object Copy section.
Versioned Repositories
For versioned repositories, Informatica recommends using the Deployment Groups method for repository
migration in a distributed repository environment. This method provides the greatest flexibility in that you can
promote any object from within a development repository (even across folders) into any destination
repository. Also, by using labels, dynamic deployment groups, and the enhanced pmrep command line utility,
the use of the deployment group migration method results in automated migrations that can be executed
without manual intervention.
Third-Party Versioning
Some organizations have standardized on third-party version control software. PowerCenters XML import/
export functionality offers integration with such software and provides a means to migrate objects. This
method is most useful in a distributed environment because objects can be exported into an XML file from
one repository and imported into the destination repository.
The XML Object Copy Process allows you to copy nearly all repository objects, including sources, targets,
reusable transformations, mappings, mapplets, workflows, worklets, and tasks. Beginning with PowerCenter 7
and later versions, the export/import functionality allows the export/import of multiple objects to a single XML
file. This can significantly cut down on the work associated with object level XML import/export.
INFORMATICA CONFIDENTIAL BEST PRACTICE 71 of 702
The following steps outline the process of exporting the objects from source repository and importing them
into the destination repository:
Exporting
1. From Designer or Workflow Manager, login to the source repository. Open the folder and highlight the
object to be exported.
2. Select Repository -> Export Objects
3. The system prompts you to select a directory location on the local workstation. Choose the directory to
save the file. Using the default name for the XML file is generally recommended.
4. Open Windows Explorer and go to the C:\Program Files\Informatica PowerCenter 7 and later versions
x\Client directory. (This may vary depending on where you installed the client tools.)
5. Find the powrmart.dtd file, make a copy of it, and paste the copy into the directory where you saved
the XML file.
6. Together, these files are now ready to be added to the version control software
Importing
Log into Designer or the Workflow Manager client tool and login to the destination repository. Open the folder
where the object is to be imported.
1. Select Repository -> Import Objects.
2. The system prompts you to select a directory location and file to import into the repository.
3. The following screen appears with the steps for importing the object.

4. Select the mapping and add it to the Objects to Import list.
INFORMATICA CONFIDENTIAL BEST PRACTICE 72 of 702

5. Click "Next", and then click "Import". Since the shortcuts have been added to the folder, the mapping
will now point to the new shortcuts and their parent folder.
6. It is important to note that the pmrep command line utility was greatly enhanced in PowerCenter 7 and
later versions, allowing the activities associated with XML import/export to be automated through
pmrep.
7. Click on the destination repository service on the left pane and choose the Action drop-down list box
-> Restore. Remember, if the destination repository has content, it has to be deleted prior to
restoring).


Last updated: 05-Feb-07 17:22
INFORMATICA CONFIDENTIAL BEST PRACTICE 73 of 702
Migration Procedures - PowerExchange
Challenge
To facilitate the migration of PowerExchange definitions from one environment to another.
Description
There are two approaches to perform a migration.
G Using the DTLURDMO utility
G Using the Power Exchange Client tool (Detail Navigator)
DTLURDMO Utility
Step 1: Validate connectivity between the client and listeners

G Test communication between clients and all listeners in the production environment with:
dtlrexeprog=ping <loc>=<nodename>.
G Run selected jobs to exercise data access through PowerExchange data maps.
Step 2: Run DTLURDMO to copy PowerExchange objects.
At this stage, if PowerExchange is to run against new versions of the PowerExchange objects rather than
existing libraries, you need to copy the datamaps. To do this, use the PowerExchange Copy Utility
DTLURDMO. The following section assumes that the entire datamap set is to be copied. DTLURDMO
does have the ability to copy selectively, however, and the full functionality of the utility is documented in
the PowerExchange Utilities Guide.
The types of definitions that can be managed with this utility are:
G PowerExchange data maps
INFORMATICA CONFIDENTIAL BEST PRACTICE 74 of 702
G PowerExchange capture registrations
G PowerExchange capture extraction data maps
On MVS, the input statements for this utility are taken from SYSIN.
On non-MVS platforms, the input argument point to a file containing the input definition. If no input
argument is provided, it looks for a file dtlurdmo.ini in the current path.
The utility runs on all capture platforms.
Windows and UNIX Command Line
Syntax: DTLURDMO <dtlurdmo definition file>
For example: DTLURDMO e:\powerexchange\bin\dtlurdmo.ini
G DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO
utility operates. If no definition file is specified, it looks for a file dtlurdmo.ini in the current path.
MVS DTLURDMO job utility
Run the utility by submitting the DTLURDMO job, which can be found in the RUNLIB library.
G DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO
utility operates and is read from the SYSIN card.
AS/400 utility
Syntax: CALL PGM(<location and name of DTLURDMO executable file>)
For example: CALL PGM(dtllib/DTLURDMO)
G DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO
utility operates. By default, the definition is in the member CFG/DTLURDMO in the current datalib
library.
If you want to create a separate DTLURDMO definition file rather than use the default location, you must
give the library and filename of the definition file as a parameter. For example: CALL PGM(dtllib/
DTLURDMO) parm ('datalib/deffile(dtlurdmo)')
Running DTLURDMO
The utility should be run extracting information from the files locally, then writing out the datamaps through
the new PowerExchange V8.x.x Listener. This causes the datamaps to be written out in the format
required for the upgraded PowerExchange. DTLURDMO must be run once for the datamaps, then again
for the registrations, and then the extract maps if this is a capture environment. Commands for mixed
datamaps, registrations, and extract maps cannot be run together.
INFORMATICA CONFIDENTIAL BEST PRACTICE 75 of 702
If only a subset of the PowerExchange datamaps, registrations, and extract maps are required, then
selective copies can be carried out. Details of performing selective copies are documented fully in the
PowerExchange Utilities Guide. This document assumes that everything is going to be migrated from the
existing environment to the new V8.x.x format.
Definition File Example
The following example shows a definition file to copy all datamaps from the existing local datamaps (the
local datamaps are defined in the DATAMAP DD card in the MVS JCL or by the path on Windows or
UNIX) to the V8.x.x listener (defined by the TARGET location node1):
USER DTLUSR;
EPWD A3156A3623298FDC;
SOURCE LOCAL;
TARGET NODE1;
DETAIL;
REPLACE;
DM_COPY;
SELECT schema=*;
Note: The encrypted password (EPWD) is generated from the FILE, ENCRYPT PASSWORD option from
the PowerExchange Navigator.
Power Exchange Client tool (Detail Navigator)
Step 1: Validate connectivity between the client and listeners

G Test communication between clients and all listeners in the production environment with:
dtlrexeprog=ping loc=<nodename>.
INFORMATICA CONFIDENTIAL BEST PRACTICE 76 of 702
G Run selected jobs to exercise data access through PowerExchange data maps.
Step 2: Start the Power Exchange Navigator

G Select the datamap that is going to be promoted to production.
G On the menu bar, select a file to send to the remote node.
G
On the drop-down list box, choose the appropriate location ( in this case mvs_prod).
G
Supply the user name and password and click OK.
G
A confirmation message for successful migration is displayed.
INFORMATICA CONFIDENTIAL BEST PRACTICE 77 of 702


Last updated: 06-Feb-07 11:39
INFORMATICA CONFIDENTIAL BEST PRACTICE 78 of 702
Running Sessions in Recovery Mode
Challenge
Use the Load Manager architecture for manual error recovery by suspending and
resuming the workflows and worklets when an error is encountered.
Description
When a task in the workflow fails at any point, one option is to truncate the target and
run the workflow again from the beginning. Load Manager architecture offers an
alternative to this scenario: the workflow can be suspended and the user can fix the
error rather than re-processing the portion of the workflow with no errors. This option,
"Suspend on Error", results in accurate and complete target data, as if the session
completed successfully with one run.
Configure Mapping for Recovery
For consistent recovery, the mapping needs to produce the same result, and in the
same order, in the recovery execution as in the failed execution. This can be achieved
by sorting the input data using either the sorted ports option in Source Qualifier (or
Application Source Qualifier) or by using a sorter transformation with distinct rows
option immediately after source qualifier transformation. Additionally, ensure that all the
targets received data from transformations that produce repeatable data.
Configure Session for Recovery
Enable the session for recovery by selecting one of the following three Recovery
Strategies:
G Resume from the last checkpoint
H Integration Service saves session recovery information and updates
recovery tables for a target database.
H If session interrupts, Integration Service uses saved recovery information
to recover it.
G Restart task
INFORMATICA CONFIDENTIAL BEST PRACTICE 79 of 702
H Integration Service does not save session recovery information.
H If session interrupts, Integration Service reruns the session during
recovery.
G Fail task and continue workflow
H Session will not be recovered (default).
Configure Workflow for Recovery
The Suspend on Error option directs the PowerCenter Server to suspend the workflow
while the user fixes the error, and then resumes the workflow.
The server suspends the workflow when any of the following tasks fail:
G Session
G Command
G Worklet
G Email
When a task fails in the workflow, the Integration Service stops running tasks in the
path. The Integration Service does not evaluate the output link of the failed task. If no
other task is running in the workflow, the Workflow Monitor displays the status of the
workflow as "Suspended."
If one or more tasks are still running in the workflow when a task fails, the Integration
Service stops running the failed task and continues running tasks in other paths. The
Workflow Monitor displays the status of the workflow as "Suspending."
When the status of the workflow is "Suspended" or "Suspending," you can fix the error,
such as a target database error, and recover the workflow in the Workflow Monitor.
When you recover a workflow, the Integration Service restarts the failed tasks and
continues evaluating the rest of the tasks in the workflow. The Integration Service does
not run any task that already completed successfully.
Note: You can no longer recover individual sessions in a workflow. To recover a
session, you recover the workflow.
Truncate Target Table
INFORMATICA CONFIDENTIAL BEST PRACTICE 80 of 702
If the truncate table option is enabled in a recovery-enabled session, the target table is
not truncated during recovery process.
Session Logs
In a suspended workflow scenario, the Integration Service uses the existing session log
when it resumes the workflow from the point of suspension. However, the earlier runs
that caused the suspension are recorded in the historical run information in the
repository.
Suspension Email
The workflow can be configured to send an email when the Integration Service
suspends the workflow. When a task fails, the server suspends the workflow and sends
the suspension email. The user can then fix the error and resume the workflow. If
another task fails while the Integration Service is suspending the workflow, the server
does not send another suspension email. The Integration Service only sends out
another suspension email if another task fails after the workflow resumes. Check the
"Browse Emails" button on the General tab of the Workflow Designer Edit sheet to
configure the suspension email.
Suspending Worklets
When the "Suspend On Error" option is enabled for the parent workflow, the Integration
Service also suspends the worklet if a task within the worklet fails. When a task in the
worklet fails, the server stops executing the failed task and other tasks in its path. If no
other task is running in the worklet, the status of the worklet is "Suspended". If other
tasks are still running in the worklet, the status of the worklet is "Suspending". The
parent workflow is also suspended when the worklet is "Suspended" or "Suspending".
Starting Recovery
The recovery process can be started using Workflow Manager Client tool or Workflow
Monitor client tool. Alternatively, the recovery process can be started using pmcmd in
command line mode or using a script.
Recovery Tables and Recovery Process
When the Integration Service runs a session that has a resume recovery strategy, it
writes to recovery tables on the target database system. When the Integration Service
INFORMATICA CONFIDENTIAL BEST PRACTICE 81 of 702
recovers the session, it uses information in the recovery tables to determine where to
begin loading data to target tables.
If you want the Integration Service to create the recovery tables, grant table creation
privilege to the database user name for the target database connection. If you do not
want the Integration Service to create the recovery tables, create the recovery tables
manually.
The Integration Service creates the following recovery tables in the target database:
G PM_RECOVERY. Contains target load information for the session run. The
Integration Service removes the information from this table after each
successful session and initializes the information at the beginning of
subsequent sessions.
G PM_TGT_RUN_ID. Contains information the Integration Service uses to
identify each target on the database. The information remains in the table
between session runs. If you manually create this table, you must create a row
and enter a value other than zero for LAST_TGT_RUN_ID to ensure that the
session recovers successfully.
Do not edit or drop the recovery tables before you recover a session. If you disable
recovery, the Integration Service does not remove the recovery tables from the target
database. You must manually remove the recovery tables.
Unrecoverable Sessions
The following options affect whether the session is incrementally recoverable:
G Output is deterministic. A property that determines if the transformation
generates the same set of data for each session run. You can set this property
for SDK sources and Custom transformations.
G Output is repeatable. A property that determines if the transformation
generates the data in the same order for each session run. You can set this
property for Custom transformations.
G Lookup source is static. A Lookup transformation property that determines if
the lookup source is the same between the session and recovery. The
Integration Service uses this property to determine if the output is
deterministic.
Inconsistent Data During Recovery Process
INFORMATICA CONFIDENTIAL BEST PRACTICE 82 of 702
For recovery to be effective, the recovery session must produce the same set of rows
and in the same order. Any change after initial failure in mapping, session and/or in
the server that changes the ability to produce repeatable data results in inconsistent
data during recovery process.
The following cases may produce inconsistent data during a recovery session:
G Session performs incremental aggregation and server stops unexpectedly.
G Mapping uses sequence generator transformation.
G Mapping uses a normalizer transformation.
G Source and/or target changes after initial session failure.
G Data movement mode change after initial session failure.
G Code page (server, source or target) changes, after initial session failure.
G Mapping changes in a way that causes server to distribute or filter or
aggregate rows differently.
G Session configurations are not supported by PowerCenter for session
recovery.
G Mapping uses a lookup table and the data in the lookup table changes
between session runs.
G Session sort order changes, when server is running in Unicode mode.
HA Recovery
Highly-available recovery allows the workflow to resume automatically in case of the
Integration Service has failed over. The following options are available in the properties
tab of the workflow:
G Enable HA recovery Allows the workflow to be configured for Highly
Availability.
G Automatically recover terminated tasks Recover terminated Session or
Command tasks without user intervention. You must have high availability and
the workflow must still be running.
G Maximum automatic recovery attempts When you automatically recover
terminated tasks you can choose the number of times the Integration Service
attempts to recover the task. Default is 5.
Note: To run a workflow in HA recovery, you must have HA License for the Repository
INFORMATICA CONFIDENTIAL BEST PRACTICE 83 of 702
Service.
Complex Mappings and Recovery
In the case of complex mappings that load to more than one target that are related (i.e.,
primary key foreign key relationship), the session failure and subsequent recovery
may lead to data integrity issues. In such cases, it is necessary to check the integrity of
the target tables to be checked and fixed prior to starting the recovery process.


Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 84 of 702
Using PowerCenter Labels
Challenge
Using labels effectively in a data warehouse or data integration project to assist with
administration and migration.
Description
A label is a versioning object that can be associated with any versioned object or group of
versioned objects in a repository. Labels provide a way to tag a number of object versions with
a name for later identification. Therefore, a label is a named object in the repository, whose
purpose is to be a pointer or reference to a group of versioned objects. For example, a label
called Project X version X can be applied to all object versions that are part of that project and
release.
Labels can be used for many purposes:
G Track versioned objects during development
G Improve object query results.
G Create logical groups of objects for future deployment.
G Associate groups of objects for import and export.
Note that labels apply to individual object versions, and not objects as a whole. So if a mapping
has ten versions checked in, and a label is applied to version 9, then only version 9 has that
label. The other versions of that mapping do not automatically inherit that label. However,
multiple labels can point to the same object for greater flexibility.
The Use Repository Manager privilege is required in order to create or edit labels, To create a
label, choose Versioning-Labels from the Repository Manager.
INFORMATICA CONFIDENTIAL BEST PRACTICE 85 of 702

When creating a new label, choose a name that is as descriptive as possible. For example, a
suggested naming convention for labels is: Project_Version_Action. Include comments for
further meaningful description.
Locking the label is also advisable. This prevents anyone from accidentally associating
additional objects with the label or removing object references for the label.
Labels, like other global objects such as Queries and Deployment Groups, can have user and
group privileges attached to them. This allows an administrator to create a label that can only
be used by specific individuals or groups. Only those people working on a specific project
should be given read/write/execute permissions for labels that are assigned to that project.

INFORMATICA CONFIDENTIAL BEST PRACTICE 86 of 702
Once a label is created, it should be applied to related objects. To apply the label to objects,
invoke the Apply Label wizard from the Versioning >> Apply Label menu option from the menu
bar in the Repository Manager (as shown in the following figure).

Applying Labels
Labels can be applied to any object and cascaded upwards and downwards to parent and/or
child objects. For example, to group dependencies for a workflow, apply a label to all children
objects. The Repository Server applies labels to sources, targets, mappings, and tasks
associated with the workflow. Use the Move label property to point the label to the latest
version of the object(s).
Note: Labels can be applied to any object version in the repository except checked-out
versions. Execute permission is required for applying labels.
After the label has been applied to related objects, it can be used in queries and deployment
groups (see the Best Practice on Deployment Groups ). Labels can also be used to manage the
size of the repository (i.e. to purge object versions).
Using Labels in Deployment
An object query can be created using the existing labels (as shown below). Labels can be
associated only with a dynamic deployment group. Based on the object query, objects
associated with that label can be used in the deployment.
INFORMATICA CONFIDENTIAL BEST PRACTICE 87 of 702
Strategies for Labels
Repository Administrators and other individuals in charge of migrations should develop their
own label strategies and naming conventions in the early stages of a data integration project.
Be sure that developers are aware of the uses of these labels and when they should apply
labels.
For each planned migration between repositories, choose three labels for the development and
subsequent repositories:
G The first is to identify the objects that developers can mark as ready for migration.
G The second should apply to migrated objects, thus developing a migration audit trail.
G The third is to apply to objects as they are migrated into the receiving repository,
completing the migration audit trail.
INFORMATICA CONFIDENTIAL BEST PRACTICE 88 of 702
When preparing for the migration, use the first label to construct a query to build a dynamic
deployment group. The second and third labels in the process are optionally applied by the
migration wizard when copying folders between versioned repositories. Developers and
administrators do not need to apply the second and third labels manually.
Additional labels can be created with developers to allow the progress of mappings to be
tracked if desired. For example, when an object is successfully unit-tested by the developer, it
can be marked as such. Developers can also label the object with a migration label at a later
time if necessary. Using labels in this fashion along with the query feature allows complete or
incomplete objects to be identified quickly and easily, thereby providing an object-based view of
progress.


Last updated: 12-Feb-07 15:17
INFORMATICA CONFIDENTIAL BEST PRACTICE 89 of 702
Deploying Data Analyzer Objects
Challenge
To understand the methods for deploying Data Analyzer objects among repositories
and the limitations of such deployment.
Description
Data Analyzer repository objects can be exported to and imported from Extensible
Markup Language (XML) files. Export/import facilitates archiving the Data Analyzer
repository and deploying Data Analyzer Dashboards and reports from development to
production.
The following repository objects in Data Analyzer can be exported and imported:
G Schemas
G Reports
G Time Dimensions
G Global Variables
G Dashboards
G Security profiles
G Schedules
G Users
G Groups
G Roles
The XML file created after exporting objects should not be modified. Any change might
invalidate the XML file and result in failure of import objects into a Data Analyzer
repository.
For more information on exporting objects from the Data Analyzer repository, refer to
the Data Analyzer Administration Guide.
Exporting Schema(s)
INFORMATICA CONFIDENTIAL BEST PRACTICE 90 of 702
To export the definition of a star schema or an operational schema, you need to select
a metric or folder from the Metrics system folder in the Schema Directory. When you
export a folder, you export the schema associated with the definitions of the metrics in
that folder and its subfolders. If the folder you select for export does not contain any
objects, Data Analyzer does not export any schema definition and displays the
following message:
There is no content to be exported.
There are two ways to export metrics or folders containing metrics:
G Select the Export Metric Definitions and All Associated Schema Table and
Attribute Definitions option. If you select to export a metric and its associated
schema objects, Data Analyzer exports the definitions of the metric and the
schema objects associated with that metric. If you select to export an entire
metric folder and its associated objects, Data Analyzer exports the definitions
of all metrics in the folder, as well as schema objects associated with every
metric in the folder.
G Alternatively, select the Export Metric Definitions Only option. When you
choose to export only the definition of the selected metric, Data Analyzer does
not export the definition of the schema table from which the metric is derived
or any other associated schema object.
1. Login to Data Analyzer as a System Administrator.
2. Click on the Administration tab XML Export/Import Export Schemas.
3. All the metric folders in the schema directory are displayed. Click Refresh
Schema to display the latest list of folders and metrics in the schema directory.
4. Select the check box for the folder or metric to be exported and click Export as
XML option.
5. Enter XML filename and click Save to save the XML file.
6. The XML file will be stored locally on the client machine.
Exporting Report(s)
To export the definitions of more than one report, select multiple reports or folders.
Data Analyzer exports only report definitions. It does not export the data or the
schedule for cached reports. As part of the Report Definition export, Data Analyzer
exports the report table, report chart, filters, indicators (i.e., gauge, chart, and table
indicators), custom metrics, links to similar reports, and all reports in an analytic
workflow, including links to similar reports.
INFORMATICA CONFIDENTIAL BEST PRACTICE 91 of 702
Reports can have public or personal indicators associated with them. By default, Data
Analyzer exports only public indicators associated with a report. To export the personal
indicators as well, select the Export Personal Indicators check box.
To export an analytic workflow, you need to export only the originating report. When
you export the originating report of an analytic workflow, Data Analyzer exports the
definitions of all the workflow reports. If a report in the analytic workflow has similar
reports associated with it, Data Analyzer exports the links to the similar reports.
Data Analyzer does not export the alerts, schedules, or global variables associated with
the report. Although Data Analyzer does not export global variables, it lists all global
variables it finds in the report filter. You can, however, export these global variables
separately.
1. Login to Data Analyzer as a System Administrator.
2. Click Administration XML Export/Import Export Reports.
3. Select the folder or report to be exported.
4. Click Export as XML.
5. Enter XML filename and click Save to save the XML file.
6. The XML file will be stored locally on the client machine.
Exporting Global Variables
1. Login to Data Analyzer as a System Administrator.
2. Click Administration XML Export/Import Export Global Variables.
3. Select the Global variable to be exported.
4. Click Export as XML.
5. Enter the XML filename and click Save to save the XML file.
6. The XML file will be stored locally on the client machine.
Exporting a Dashboard
Whenever a dashboard is exported, Data Analyzer exports the reports, indicators,
shared documents, and gauges associated with the dashboard. Data Analyzer does
not, however, export the alerts, access permissions, attributes or metrics in the report
(s), or real-time objects. You can export any of the public dashboards defined in the
repository, and can export more than one dashboard at one time.
1. Login to Data Analyzer as a System Administrator.
2. Click Administration XML Export/Import Export Dashboards.
3. Select the Dashboard to be exported.
INFORMATICA CONFIDENTIAL BEST PRACTICE 92 of 702
4. Click Export as XML.
5. Enter XML filename and click Save to save the XML file.
6. The XML file will be stored locally on the client machine.
Exporting a User Security Profile
Data Analyzer maintains a security profile for each user or group in the repository. A
security profile consists of the access permissions and data restrictions that the system
administrator sets for a user or group.
When exporting a security profile, Data Analyzer exports access permissions for
objects under the Schema Directory, which include folders, metrics, and attributes.
Data Analyzer does not export access permissions for filtersets, reports, or shared
documents.
Data Analyzer allows you to export only one security profile at a time. If a user or group
security profile you export does not have any access permissions or data restrictions,
Data Analyzer does not export any object definitions and displays the following
message:
There is no content to be exported.
1. Login to Data Analyzer as a System Administrator.
2. Click Administration XML Export/Import Export Security Profile.
3. Click Export from users and select the user for which security profile to be
exported.
4. Click Export as XML.
5. Enter XML filename and click Save to save the XML file.
6. The XML file will be stored locally on the client machine.
Exporting a Schedule
You can export a time-based or event-based schedule to an XML file. Data Analyzer
runs a report with a time-based schedule on a configured schedule. Data Analyzer runs
a report with an event-based schedule when a PowerCenter session completes. When
you export a schedule, Data Analyzer does not export the history of the schedule.
1. Login to Data Analyzer as a System Administrator.
2. Click Administration XML Export/Import Export Schedules.
3. Select the Schedule to be exported.
4. Click Export as XML.
INFORMATICA CONFIDENTIAL BEST PRACTICE 93 of 702
5. Enter XML filename and click Save to save the XML file.
6. The XML file will be stored locally on the client machine.
Exporting Users, Groups, or Roles
Exporting Users
You can export the definition of any user defined in the repository. However, you
cannot export the definitions of system users defined by Data Analyzer. If you have
more than one thousand users defined in the repository, Data Analyzer allows you to
search for the users that you want to export. You can use the asterisk (*) or the percent
symbol (%) as wildcard characters to search for users to export.
You can export the definitions of more than one user, including the following
information:
G Login name
G Description
G First, middle, and last name
G Title
G Password
G Change password privilege
G Password never expires indicator
G Account status
G Groups to which the user belongs
G Roles assigned to the user
G Query governing settings
Data Analyzer does not export the email address, reply-to address, department, or
color scheme assignment associated with the exported user(s).
1. Login to Data Analyzer as a System Administrator.
2. Click Administration XML Export/Import Export User/Group/Role.
3. Click Export Users/Group(s)/Role(s).
4. Select the user(s) to be exported.
5. Click Export as XML.
6. Enter XML filename and click Save to save the XML file.
7. The XML file will be stored locally on the client machine.
INFORMATICA CONFIDENTIAL BEST PRACTICE 94 of 702
Exporting Groups
You can export any group defined in the repository, and can export the definitions of
multiple groups. You can also export the definitions of all the users within a selected
group. Use the asterisk (*) or percent symbol (%) as wildcard characters to search for
groups to export. Each group definition includes the following information:
G Name
G Description
G Department
G Color scheme assignment
G Group hierarchy
G Roles assigned to the group
G Users assigned to the group
G Query governing settings
Data Analyzer does not export the color scheme associated with an exported group.
1. Login to Data Analyzer as a System Administrator.
2. Click Administration XML Export/Import Export User/Group/Role.
3. Click Export Users/Group(s)/Role(s).
4. Select the group to be exported.
5. Click Export as XML.
6. Enter XML filename and click Save to save the XML file.
7. The XML file will be stored locally on the client machine.
Exporting Roles
You can export the definitions of the custom roles defined in the repository. However,
you cannot export the definitions of system roles defined by Data Analyzer. You can
export the definitions of more than one role. Each role definition includes the name and
description of the role and the permissions assigned to each role.
1. Login to Data Analyzer as a System Administrator.
2. Click Administration XML Export/Import Export User/Group/Role.
3. Click Export Users/Group(s)/Role(s).
4. Select the role to be exported.
5. Click Export as XML.
INFORMATICA CONFIDENTIAL BEST PRACTICE 95 of 702
6. Enter XML filename and click Save to save the XML file.
7. The XML file will be stored locally on the client machine.
Importing Objects
You can import objects into the same repository or a different repository. If you import
objects that already exist in the repository, you can choose to overwrite the existing
objects. However, you can import only global variables that do not already exist in the
repository.
When you import objects, you can validate the XML file against the DTD provided by
Data Analyzer. Informatica recommends that you do not modify the XML files after you
export from Data Analyzer. Ordinarily, you do not need to validate an XML file that you
create by exporting from Data Analyzer. However, if you are not sure of the validity of
an XML file, you can validate it against the Data Analyzer DTD file when you start the
import process.
To import repository objects, you must have the System Administrator role or the
Access XML Export/Import privilege.
When you import a repository object, you become the owner of the object as if you
created it. However, other system administrators can also access imported repository
objects. You can limit access to reports for users who are not system administrators. If
you select to publish imported reports to everyone, all users in Data Analyzer have
read and write access to them. You can change the access permissions to reports after
you import them.
Importing Schemas
When importing schemas, if the XML file contains only the metric definition, you must
make sure that the fact table for the metric exists in the target repository. You can
import a metric only if its associated fact table exists in the target repository or the
definition of its associated fact table is also in the XML file.
When you import a schema, Data Analyzer displays a list of all the definitions contained
in the XML file. It then displays a list of all the object definitions in the XML file that
already exist in the repository. You can choose to overwrite objects in the repository. If
you import a schema that contains time keys, you must import or create a time
dimension.
1. Login to Data Analyzer as a System Administrator.
INFORMATICA CONFIDENTIAL BEST PRACTICE 96 of 702
2. Click Administration XML Export/Import Import Schema.
3. Click Browse to choose an XML file to import.
4. Select Validate XML against DTD.
5. Click Import XML.
6. Verify all attributes on the summary page, and choose Continue.
Importing Reports
A valid XML file of exported report objects can contain definitions of cached or on-
demand reports, including prompted reports. When you import a report, you must make
sure that all the metrics and attributes used in the report are defined in the target
repository. If you import a report that contains attributes and metrics not defined in the
target repository, you can cancel the import process. If you choose to continue the
import process, you may not be able to run the report correctly. To run the report, you
must import or add the attribute and metric definitions to the target repository.
You are the owner of all the reports you import, including the personal or public
indicators associated with the reports. You can publish the imported reports to all Data
Analyzer users. If you publish reports to everyone, Data Analyzer provides read-access
to the reports to all users. However, it does not provide access to the folder that
contains the imported reports. If you want another user to access an imported report,
you can put the imported report in a public folder and have the user save or move the
imported report to his or her personal folder. Any public indicator associated with the
report also becomes accessible to the user.
If you import a report and its corresponding analytic workflow, the XML file contains all
workflow reports. If you choose to overwrite the report, Data Analyzer also overwrites
the workflow reports. Also, when importing multiple workflows, note that Data Analyzer
does not import analytic workflows containing the same workflow report names. Thus,
ensure that all imported analytic workflows have unique report names prior to being
imported.
1. Login to Data Analyzer as a System Administrator.
2. Click Administration XML Export/Import Import Report.
3. Click Browse to choose an XML file to import.
4. Select Validate XML against DTD.
5. Click Import XML.
6. Verify all attributes on the summary page, and choose Continue.

Importing Global Variables
INFORMATICA CONFIDENTIAL BEST PRACTICE 97 of 702
You can import global variables that are not defined in the target repository. If the XML
file contains global variables already in the repository, you can cancel the process. If
you continue the import process, Data Analyzer imports only the global variables not in
the target repository.
1. Login to Data Analyzer as a System Administrator.
2. Click Administration XML Export/Import Import Global Variables.
3. Click Browse to choose an XML file to import.
4. Select Validate XML against DTD.
5. Click Import XML.
6. Verify all attributes on the summary page, and choose Continue.
Importing Dashboards
Dashboards display links to reports, shared documents, alerts, and indicators. When
you import a dashboard, Data Analyzer imports the following objects associated with
the dashboard:
G Reports
G Indicators
G Shared documents
G Gauges
Data Analyzer does not import the following objects associated with the dashboard:
G Alerts
G Access permissions
G Attributes and metrics in the report
G Real-time objects
If an object already exists in the repository, Data Analyzer provides an option to
overwrite it. Data Analyzer does not import the attributes and metrics in the reports
associated with the dashboard. If the attributes or metrics in a report associated with
the dashboard do not exist, the report does not display on the imported dashboard.
1. Login to Data Analyzer as a System Administrator.
2. Click Administration XML Export/Import Import Dashboard.
3. Click Browse to choose an XML file to import.
4. Select Validate XML against DTD.
INFORMATICA CONFIDENTIAL BEST PRACTICE 98 of 702
5. Click Import XML.
6. Verify all attributes on the summary page, and choose Continue.
Importing Security Profile(s)
To import a security profile, you must begin by selecting the user or group to which you
want to assign the security profile. You can assign the same security profile to more
than one user or group.
When you import a security profile and associate it with a user or group, you can either
overwrite the current security profile or add to it. When you overwrite a security profile,
you assign the user or group only the access permissions and data restrictions found in
the new security profile. Data Analyzer removes the old restrictions associated with the
user or group. When you append a security profile, you assign the user or group the
new access permissions and data restrictions in addition to the old permissions and
restrictions.
When exporting a security profile, Data Analyzer exports the security profile for objects
in Schema Directory, including folders, attributes, and metrics. However, it does not
include the security profile for filtersets.
1. Login to Data Analyzer as a System Administrator.
2. Click Administration XML Export/Import Import Security Profile.
3. Click Import to Users.
4. Select the user with which you want to associate the security profile you import.
H To associate the imported security profiles with all the users on the
page, select the "Users" check box at the top of the list.
H To associate the imported security profiles with all the users in the
repository, select Import to All..
H To overwrite the selected users current security profile with the
imported security profile, select Overwrite..
H To append the imported security profile to the selected users current
security profile, select Append..
5. Click Browse to choose an XML file to import.
6. Select Validate XML against DTD.
7. Click Import XML.
8. Verify all attributes on the summary page, and choose Continue.
INFORMATICA CONFIDENTIAL BEST PRACTICE 99 of 702

Importing Schedule(s)
A time-based schedule runs reports based on a configured schedule. An event-based
schedule runs reports when a PowerCenter session completes. You can import a time-
based or event-based schedules from an XML file. When you import a schedule, Data
Analyzer does not attach the schedule to any reports.
1. Login to Data Analyzer as a System Administrator.
2. Click Administration XML Export/Import Import Schedule.
3. Click Browse to choose an XML file to import.
4. Select Validate XML against DTD.
5. Click Import XML.
6. Verify all attributes on the summary page, and choose Continue.

Importing Users, Groups, or Roles
When you import a user, group, or role, you import all the information associated with
each user, group, or role. The XML file includes definitions of roles assigned to users or
groups, and definitions of users within groups. For this reason, you can import the
definition of a user, group, or role in the same import process.
When importing a user, you import the definitions of roles assigned to the user and the
groups to which the user belongs. When you import a user or group, you import the
user or group definitions only. The XML file does not contain the color scheme
assignments, access permissions, or data restrictions for the user or group. To import
the access permissions and data restrictions, you must import the security profile for
the user or group.
1. Login to Data Analyzer as a System Administrator.
2. Click Administration XML Export/Import Import User/Group/Role.
3. Click Browse to choose an XML file to import.
4. Select Validate XML against DTD.
5. Click Import XML option.
6. Verify all attributes on the summary page, and choose Continue.

Tips for Importing/Exporting
G Schedule Importing/Exporting of repository objects for a time of minimal
INFORMATICA CONFIDENTIAL BEST PRACTICE 100 of 702
Data Analyzer activity, when most of the users are not accessing the Data
Analyzer repository. This should help to prevent users from experiencing
timeout errors or degraded response time. Only the System Administrator
should perform import/export operations.
G Take a backup of the Data Analyzer repository prior to performing an import/
export operation. This backup should be completed using the Repository
Backup Utility provided with Data Analyzer.
G Manually add user/group permissions for the report. These permissions
will not be exported as part of exporting Reports and should be manually
added after the report is imported in the desired server.
G Use a version control tool. Prior to importing objects into a new environment,
it is advisable to check the XML documents with a version-control tool such as
Microsoft's Visual Source Safe, or PVCS. This facilitates the versioning of
repository objects and provides a means for rollback to a prior version of an
object, if necessary.
G Attach cached reports to schedules. Data Analyzer does not import the
schedule with a cached report. When you import cached reports, you must
attach them to schedules in the target repository. You can attach multiple
imported reports to schedules in the target repository in one process
immediately after you import them.
G Ensure that global variables exist in the target repository. If you import a
report that uses global variables in the attribute filter, ensure that the global
variables already exist in the target repository. If they are not in the target
repository, you must either import the global variables from the source
repository or recreate them in the target repository.
G Manually add indicators to the dashboard. When you import a dashboard,
Data Analyzer imports all indicators for the originating report and workflow
reports in a workflow. However, indicators for workflow reports do not display
on the dashboard after you import it until added manually.
G Check with your System Administrator to understand what level of LDAP
integration has been configured (if any). Users, groups, and roles need to
be exported and imported during deployment when using repository
authentication. If Data Analyzer has been integrated with an LDAP
(Lightweight Directory Access Protocol) tool, then users, groups, and/or roles
may not require deployment.
When you import users into a Microsoft SQL Server or IBM DB2 repository, Data
Analyzer blocks all user authentication requests until the import process is complete.


INFORMATICA CONFIDENTIAL BEST PRACTICE 101 of 702
Installing Data Analyzer
Challenge
Installing Data Analyzer on new or existing hardware, either as a dedicated application on a physical machine (as
Informatica recommends) or co-existing with other applications on the same physical server or with other Web applications
on the same application server.
Description
Consider the following questions when determining what type of hardware to use for Data Analyzer:
If the hardware already exists:
1. Is the processor, operating system, and database software supported by Data Analyzer?
2. Are the necessary operating system and database patches applied?
3. How many CPUs does the machine currently have? Can the CPU capacity be expanded?
4. How much memory does the machine have? How much is available to the Data Analyzer application?
5. Will Data Analyzer share the machine with other applications? If yes, what are the CPU and memory requirements
of the other applications?
If the hardware does not already exist:
1. Has the organization standardized on hardware or operating system vendor?
2. What type of operating system is preferred and supported? (e.g., Solaris, Windows, AIX, HP-UX, Redhat AS,
SuSE)
3. What database and version is preferred and supported for the Data Analyzer repository?
Regardless of the hardware vendor chosen, the hardware must be configured and sized appropriately to support the
reporting response time requirements for Data Analyzer. The following questions should be answered in order to estimate
the size of a Data Analyzer server:
1. How many users are predicted for concurrent access?
2. On average, how many rows will be returned in each report?
3. On average, how many charts will there be for each report?
4. Do the business requirements mandate a SSL Web server?
The hardware requirements for the Data Analyzer environment depend on the number of concurrent users, types of
reports being used (i.e., interactive vs. static), average number of records in a report, application server and operating
system used, among other factors. The following table should be used as a general guide for hardware recommendations
for a Data Analyzer installation. Actual results may vary depending upon exact hardware configuration and user volume.
For exact sizing recommendations, contact Informatica Professional Services for a Data Analyzer Sizing and Baseline
Architecture engagement.
Windows
# of Concurrent Users Average Number of
Rows per Report
Average # of Charts
per Report
Estimated # of CPUs
for Peak Usage
Estimated Total RAM (For
Data Analyzer alone)
Estimated # of App
servers in a
Clustered
Environment
50 1000 2 2 1 GB 1
INFORMATICA CONFIDENTIAL BEST PRACTICE 102 of 702
100 1000 2 3 2 GB 1 - 2
200 1000 2 6 3.5 GB 3
400 1000 2 12 6.5 GB 6
100 1000 2 3 2 GB 1 - 2
-100 2000 2 3 2.5 GB 1 - 2
100 5000 2 4 3 GB 2
100 10000 2 5 4 GB 2 - 3
100 1000 2 3 2 GB 1 - 2
100 1000 5 3 2 GB 1 - 2
100 1000 7 3 2.5 GB 1 - 2
100 1000 10 3 - 4 3 GB 1 - 2
Notes:
1. This estimating guide is based on experiments conducted in the Informatica lab.
2. The sizing estimates are based on PowerAnalyzer 5 running BEA WebLogic 8.1 SP3, Windows 2000, on a 4 CPU
2.5 GHz Xeon Processor. This estimate may not be accurate for other, different environments.
3. The number of concurrent users under peak volume can be estimated by using the number of total users multiplied
by the percentage of concurrent users. In practice, typically 10 percent of the user base is concurrent. However,
this percentage can be as high as 50 percent or as low as 5 percent in some organizations.
4. For every two CPUs on the server, Informatica recommends one managed server (instance) of the application
server. For servers with at least four CPUs, clustering multiple logical instances of the application server on one
physical server can result in increased performance.
5. There will be an increase in overhead on for a SSL Web server architecture, depending on strength of encryption.
6. CPU utilization can be minimized by 10 to 25 percent by using SVG charts, otherwise known as interactive
charting, rather than the default PNG charting.
7. Clustering is recommended for instances with more than 50 concurrent users. (Clustering doesnt have to be
across multiple boxes if >= 4 CPU)
8. Informatica Professional Services should be engaged for a thorough and accurate sizing estimate.
IBM AIX
# of Concurrent Users Average Number of
Rows per Report
Average # of Charts
per Report
Estimated # of CPUs
for Peak Usage
Estimated Total RAM (For
Data Analyzer alone)
Estimated # of App
servers in a
Clustered
Environment
50 1000 2 2 1 GB 1
100 1000 2 2 - 3 2 GB 1
200 1000 2 4 - 5 3.5 GB 2 - 3
400 1000 2 9 - 10 6 GB 4 - 5
INFORMATICA CONFIDENTIAL BEST PRACTICE 103 of 702
100 1000 2 2 - 3 2 GB 1
-100 2000 2 2 - 3 2 GB 1 - 2
100 5000 2 2 - 3 3 GB 1 - 2
100 10000 2 4 4 GB 2
100 1000 2 2 - 3 2 GB 1
100 1000 5 2 - 3 2 GB 1
100 1000 7 2 - 3 2 GB 1 - 2
100 1000 10 2 - 3 2.5 GB 1 - 2
Notes:
1. This estimating guide is based on experiments conducted in the Informatica lab.
2. The sizing estimates are based on PowerAnalyzer 5 running IBM WebSphere 5.1.1.1 and AIX 5.2.02 on a 4 CPU
2.4 GHz IBM p630. This estimate may not be accurate for other, different environments.
3. The number of concurrent users under peak volume can be estimated by using the number of total users multiplied
by the percentage of concurrent users. In practice, typically 10 percent of the user base is concurrent. However,
this percentage can be as high as 50 percent or as low as 5 percent in some organizations.
4. For every two CPUs on the server, Informatica recommends one managed server (instance) of the application
server. For servers with at least four CPUs, clustering multiple logical instances of the application server on one
physical server can result in increased performance.
5. Add 30 to 50 percent overhead on for a SSL Web server architecture, depending on strength of encryption.
6. CPU utilization can be minimized by 10 to 25 percent by using SVG charts, otherwise known as interactive
charting, rather than the default PNG charting.
7. Clustering is recommended for instances with more than 50 concurrent users. (Clustering doesnt have to be
across multiple boxes if >= 4 CPU)
8. Informatica Professional Services should be engaged for a thorough and accurate sizing estimate.
Data Analyzer Installation
The Data Analyzer installation process involves two main components: the Data Analyzer Repository and the Data
Analyzer Server, which is an application deployed on an application server. A Web server is necessary to support these
components and is included with the installation of the application servers. This section discusses the installation process
for JBOSS, BEA WebLogic and IBM WebSphere. The installation tips apply to both Windows and UNIX environments.
This section is intended to serve as a supplement to the Data Analyzer Installation Guide.
Before installing Data Analyzer, be sure to complete the following steps:
G Verify that the hardware meets the minimum system requirements for Data Analyzer. Ensure that the combination
of hardware, operating system, application server, repository database, and, optionally, authentication software
are supported by Data Analyzer. Ensure that sufficient space has been allocated to the Data Analyzer repository.
G Apply all necessary patches to the operating system and database software.
G Verify connectivity to the data warehouse database (or other reporting source) and repository database.
G If LDAP or NT Domain is used for Data Analyzer authentication, verify connectivity to the LDAP directory server
or the NT primary domain controller.
G The Data Analyzer license file has been obtained from technical support.
G On UNIX/Linux installations, the OS user that is running Data Analyzer must have execute privileges on all Data
Analyzer installation executables.
INFORMATICA CONFIDENTIAL BEST PRACTICE 104 of 702
In addition to the standard Data Analyzer components that are installed by default, you can also install Metadata Manager.
With Version 8.0, the Data Analyzer SDK and Portal Integration Kit are now installed with Data Analyzer. Refer to the Data
Analyzer documentation for detailed information for these components.
Changes to Installation Process
Beginning with Data Analyzer version 7.1.4, Data Analyzer is packaged with PowerCenter Advance Edition. To install only
the Data Analyzer portion, during the installation process choose the Custom Installation option. On the following screen,
uncheck all of the check boxes except the Data Analyzer check box and then click Next.
Repository Configuration
To properly install Data Analyzer you need to have connectivity information for the database server where the repository is
going to reside. This information includes:
G Database URL
G Repository username
G Password for repository username
Installation Steps: JBOSS
The following are the basic installation steps for Data Analyzer on JBOSS
1. Set up the Data Analyzer repository database. The Data Analyzer Server installation process will create the
repository tables, but an empty database schema needs to exist and be able to be connected to via JDBC prior to
installation.
2. Install Data Analyzer. The Data Analyzer installation process will install JBOSS if a version does not already exist,
or an existing instance can be selected.
3. Apply the Data Analyzer license key.
4. Install the Data Analyzer Online Help.
INFORMATICA CONFIDENTIAL BEST PRACTICE 105 of 702
Installation Tips: JBOSS
The following are the basic installation tips for Data Analyzer on JBOSS:
G Beginning with PowerAnalyzer 5, multiple Data Analyzer instances can be installed on a single instance of
JBOSS. Also, other applications can coexist with Data Analyzer on a single instance of JBOSS. Although this
architecture should be considered during hardware sizing estimates, it allows greater flexibility during installation.
G For JBOSS installations on UNIX, the JBOSS Server installation program requires an X-Windows server. If
JBOSS Server is installed on a machine where an X-Windows server is not installed, an X-Windows server must
be installed on another machine in order to render graphics for the GUI-based installation program. For more
information on installing on UNIX, please see the UNIX Servers section of the installation and configuration tips
below.
G If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTPd in binary
format
G To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer
Installations. You can reach this article through My Informatica (http://my.informatica.com).
G During the Data Analyzer installation process, the user will be prompted to choose an authentication method for
Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is used, have the
configuration parameters available during installation as the installer will configure all properties files at
installation.
G The Data Analyzer license file must be applied prior to starting Data Analyzer.
Configuration Screen

Installation Steps: BEA WebLogic
The following are the basic installation steps for Data Analyzer on BEA WebLogic:
INFORMATICA CONFIDENTIAL BEST PRACTICE 106 of 702
1. Set up the Data Analyzer repository database. The Data Analyzer Server installation process will create the
repository tables, but an empty database schema needs to exist and be able to be connected to via JDBC prior to
installation.
2. Install BEA WebLogic and apply the BEA license.
3. Install Data Analyzer.
4. Apply the Data Analyzer license key.
5. Install the Data Analyzer Online Help.

TIP
When creating a repository in an Oracle database, make sure the storage parameters specified for the tablespace that contains the repository are not set too
large. Since many target tablespaces are initially set for very large INITIAL and NEXT values, large storage parameters cause the repository to use excessive
amounts of space. Also verify that the default tablespace for the user that owns the repository tables is set correctly.
The following example shows how to set the recommended storage parameters, assuming the repository is stored in the REPOSITORY tablespace:
ALTER TABLESPACE REPOSITORY DEFAULT STORAGE ( INITIAL 10K NEXT 10K MAXEXTENTS UNLIMITED PCTINCREASE 50 );
Installation Tips: BEA WebLogic
The following are the basic installation tips for Data Analyzer on BEA WebLogic:
G Beginning with PowerAnalyzer 5, multiple Data Analyzer instances can be installed on a single instance of
WebLogic. Also, other applications can coexist with Data Analyzer on a single instance of WebLogic. Although
this architecture should be factored in during hardware sizing estimates, it allows greater flexibility during
installation.
G With Data Analyzer 8, there is a console version of the installation available. X-Windows is no longer required for
WebLogic installations.
G If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTPd in binary
format
G To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer
Installations. You can reach this article through My Informatica (http://my.informatica.com).
G During the Data Analyzer installation process, the user will be prompted to choose an authentication method for
Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is used, have the
configuration parameters available during installation since the installer will configure all properties files at
installation.
G The Data Analyzer license file and BEA WebLogic license must be applied prior to starting Data Analyzer.
Configuration Screen
INFORMATICA CONFIDENTIAL BEST PRACTICE 107 of 702

Installation Steps: IBM WebSphere
The following are the basic installation steps for Data Analyzer on IBM WebSphere:
1. Setup the Data Analyzer repository database. The Data Analyzer Server installation process will create the
repository tables, but the empty database schema needs to exist and be able to be connected to via JDBC prior to
installation.
2. Install IBM WebSphere and apply the WebSphere patches. WebSphere can be installed in its Base configuration
or Network Deployment configuration if clustering will be utilized. In both cases, patchsets will need to be applied.
3. Install Data Analyzer.
4. Apply the Data Analyzer license key.
5. Install the Data Analyzer Online Help.
6. Configure the PowerCenter Integration Utility. See the section "Configuring the PowerCenter Integration Utility for
WebSphere" in the PowerCenter Installation and Configuration Guide.
Installation Tips: IBM WebSphere
G
Starting in Data Analyzer 5, multiple Data Analyzer instances can be installed on a single instance
of WebSphere. Also, other applications can coexist with Data Analyzer on a single instance of
WebSphere. Although this architecture should be considered during sizing estimates, it allows
greater flexibility during installation.
G With Data Analyzer 8 there is a console version of the installation available. X-Windows is no longer required for
WebSphere installations.
G For WebSphere on UNIX installations, Data Analyzer must be installed using the root user or system
administrator account. Two groups (mqm and mqbrkrs) must be created prior to the installation and the root
account should be added to both of these groups.
G For WebSphere on Windows installations, ensure that Data Analyzer is installed under the padaemon local
Windows user ID that is in the Administrative group and has the advanced user rights: "Act as part of the
operating system" and "Log on as a service." During the installation, the padaemon account will need to be
added to the mqm group.
INFORMATICA CONFIDENTIAL BEST PRACTICE 108 of 702
G If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTPd in binary
format.
G To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer
Installations. You can reach this article through My Informatica (http://my.informatica.com).
G During the WebSphere installation process, the user will be prompted to enter a directory for the application
server and the HTTP (web) server. In both instances, it is advisable to keep the default installation directory.
Directory names for the application server and HTTP server that include spaces may result in errors.
G During the Data Analyzer installation process, the user will be prompted to choose an authentication method for
Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is utilized, have
the configuration parameters available during installation as the installer will configure all properties files at
installation.
G The Data Analyzer license file and BEA WebLogic license must be applied prior to starting Data Analyzer.
Configuration Screen
Installation and Configuration Tips: UNIX Servers
With Data Analyzer 8 there is a console version of the installation available. For previous versions of Data Analyzer, a
graphics display server is required for a Data Analyzer installation on UNIX.
On UNIX, the graphics display server is typically an X-Windows server, although an X-Window Virtual Frame Buffer
(XVFB) or personal computer X-Windows software such as WRQ Reflection-X can also be used. In any case, the X-
Windows server does not need to exist on the local machine where Data Analyzer is being installed, but does need to be
accessible. A remote X-Windows, XVFB, or PC-X Server can be used by setting the DISPLAY to the appropriate IP
address, as discussed below.
If the X-Windows server is not installed on the machine where Data Analyzer will be installed, Data Analyzer can be
installed using an X-Windows server installed on another machine. Simply redirect the DISPLAY variable to use the X-
Windows server on another UNIX machine.
To redirect the host output, define the environment variable DISPLAY. On the command line, type the following command
and press Enter:
INFORMATICA CONFIDENTIAL BEST PRACTICE 109 of 702
C shell:
setenv DISPLAY=<TCP/IP node of X-Windows server>:0
Bourne/Korn shell:
export DISPLAY=<TCP/IP node of X-Windows server>:0
Configuration

G Data Analyzer requires a means to render graphics for charting and indicators. When graphics rendering is not
configured properly, charts and indicators do not display properly on dashboards or reports. For Data Analyzer
installations using an application server with JDK 1.4 and greater, the java.awt.headless=true setting can be set
in the application server startup scripts to facilitate graphics rendering for Data Analyzer. If the application server
does not use JDK 1.4 or later, use an X-Windows server or XVFB to render graphics. The DISPLAY environment
variable should be set to the IP address of the X-Windows or XVFB server prior to starting Data Analyzer.
G The application server heap size is the memory allocation for the JVM. The recommended heap size depends on
the memory available on the machine hosting the application server and server load, but the recommended
starting point is 512MB. This setting is the first setting that should be examined when tuning a Data Analyzer
instance.


Last updated: 06-Feb-07 11:55
INFORMATICA CONFIDENTIAL BEST PRACTICE 110 of 702
Data Connectivity using PowerCenter
Connect for BW Integration Server
Challenge
Understanding how to use PowerCenter Connect for SAP NetWeaver - BW Option to
load data into the SAP BW (Business Information Warehouse).
Description
The PowerCenter Connect for SAP NetWeaver - BW Option supports the SAP
Business Information Warehouse as both a source and target.
Extracting Data from BW
PowerCenter Connect for SAP NetWeaver - BW Option lets you extract data from SAP
BW to use as a source in a PowerCenter session. PowerCenter Connect for SAP
NetWeaver - BW Option integrates with the Open Hub Service (OHS), SAPs
framework for extracting data from BW. OHS uses data from multiple BW data sources,
including SAP's InfoSources and InfoCubes. The OHS framework includes InfoSpoke
programs, which extract data from BW and write the output to SAP transparent tables.
Loading Data into BW
PowerCenter Connect for SAP NetWeaver - BW Option lets you import BW target
definitions into the Designer and use the target in a mapping to load data into BW.
PowerCenter Connect for SAP NetWeaver - BW Option uses Business Application
Program Interface (BAPI), to exchange metadata and load data into BW.
PowerCenter can use SAPs business content framework to provide a high-volume
data warehousing solution or SAPs Business Application Program Interface (BAPI),
SAPs strategic technology for linking components into the Business Framework, to
exchange metadata with BW.
PowerCenter extracts and transforms data from multiple sources and uses SAPs high-
speed bulk BAPIs to load the data into BW, where it is integrated with industry-specific
models for analysis through the SAP Business Explorer tool.
INFORMATICA CONFIDENTIAL BEST PRACTICE 111 of 702
Using PowerCenter with PowerCenter Connect to Populate BW
The following paragraphs summarize some of the key differences in using
PowerCenter with the PowerCenter Connect to populate a SAP BW rather than
working with standard RDBMS sources and
targets.
G BW uses a pull model. The BW must request data from a source system
before the source system can send data to the BW. PowerCenter must first
register with the BW using SAPs Remote Function Call (RFC) protocol.
G The native interface to communicate with BW is the Staging BAPI, an API
published and supported by SAP. Three products in the PowerCenter suite
use this API. PowerCenter Designer uses the Staging BAPI to import
metadata for the target transfer structures; PowerCenter Integration Server for
BW uses the Staging BAPI to register with BW and receive requests to run
sessions; and the PowerCenter Server uses the Staging BAPI to perform
metadata verification and load data into BW.
G Programs communicating with BW use the SAP standard saprfc.ini file to
communicate with BW. The saprfc.ini file is similar to the tnsnames file in
Oracle or the interface file in Sybase. The PowerCenter Designer reads
metadata from BW and the PowerCenter Server writes data to BW.
G BW requires that all metadata extensions be defined in the BW Administrator
Workbench. The definition must be imported to Designer. An active structure is
the target for PowerCenter mappings loading BW.
G Because of the pull model, BW must control all scheduling. BW invokes the
PowerCenter session when the InfoPackage is scheduled to run in BW.
G BW only supports insertion of data into BW. There is no concept of update or
deletes through the staging BAPI.

Steps for Extracting Data from BW
The process of extracting data from SAP BW is quite similar to extracting data from
SAP. Similar transports are used on the SAP side, and data type support is the same
as that supported for SAP PowerCenter Connect.
The steps required for extracting data are:
1. Create an InfoSpoke. Create an InfoSpoke in the BW to extract the data from
the BW database and write it to either a database table or a file output target.
2. Import the ABAP program. Import the Informatica-provided ABAP program,
INFORMATICA CONFIDENTIAL BEST PRACTICE 112 of 702
which calls the workflow created in the Workflow Manager.
3. Create a mapping. Create a mapping in the Designer that uses the database
table or file output target as a source.
4. Create a workflow to extract data from BW. Create a workflow and session
task to automate data extraction from BW.
5. Create a Process Chain. A BW Process Chain links programs together to run
in sequence. Create a Process Chain to link the InfoSpoke and ABAP programs
together.
6. Schedule the data extraction from BW. Set up a schedule in BW to automate
data extraction.
Steps To Load Data into BW
1. Install and Configure PowerCenter Components.
The installation of the PowerCenter Connect for SAP NetWeaver - BW Option
includes both a client and a server component. The Connect server must be
installed in the same directory as the PowerCenter Server. Informatica
recommends installing the Connect client tools in the same directory as the
PowerCenter Client. For more details on installation and configuration refer to
the PowerCenter and the PowerCenter Connect installation guides.
Note: On SAP Transports for PowerConnect version 8.1 and above, it is crucial
to install or upgrade PowerCenter 8.1 transports on the appropriate SAP
system, when installing or upgrading PowerCenter Connect for SAP NetWeaver
- BW Option. If you are extracting data from BW using OHS, you must also
configure the mySAP option. If the BW system is separate from the SAP
system, install the designated transports on the BW system. It is also important
to note that there are now three categories of transports (as compared to two in
previous versions). These are as follows:
G
Transports for SAP versions 3.1H and 3.1I.
G
Transports for SAP versions 4.0B to 4.6B, 4.6C, and non-Unicode
versions 4.7 and above.
G
Transports for SAP Unicode versions 4.7 and above; this category has
been added for Unicode extraction support which was not previously
available in SAP versions 4.6 and earlier.
INFORMATICA CONFIDENTIAL BEST PRACTICE 113 of 702
2. Build the BW Components.
To load data into BW, you must build components in both BW and
PowerCenter. You must first build the BW components in the Administrator
Workbench:
G Define PowerCenter as a source system to BW. BW requires an external
source definition for all non-R/3 sources.
G Create the InfoObjects in BW (this is similar to a database table).
G The InfoSource represents a provider structure. Create the InfoSource in the
BW Administrator Workbench and import the definition into the PowerCenter
Warehouse Designer.
G Assign the InfoSource to the PowerCenter source system. After you create an
InfoSource, assign it to the PowerCenter source system.
G Activate the InfoSource. When you activate the InfoSource, you activate the
InfoObjects and the transfer rules.
3. Configure the sparfc.ini file.
Required for PowerCenter and Connect to connect to BW.
PowerCenter uses two types of entries to connect to BW through the saprfc.ini
file:
G Type A. Used by PowerCenter Client and PowerCenter Server. Specifies the
BW application server.
G Type R. Used by the PowerCenter Connect for SAP NetWeaver - BW Option.
Specifies the external program, which is registered at the SAP gateway.
Note: Do not use Notepad to edit the sparfc.ini file because Notepad can
corrupt the file. Set RFC_INI environment variable for all Windows NT,
Windows 2000, and Windows 95/98 machines with saprfc.ini file. RFC_INI is
used to locate the saprfc.ini.
4. Start the Connect for BW server
Start Connect for BW server after you start PowerCenter Server and before you
create InfoPackage in BW.
INFORMATICA CONFIDENTIAL BEST PRACTICE 114 of 702
5. Build mappings
Import the InfoSource into the PowerCenter repository and build a mapping
using the InfoSource as a target.
The following restrictions apply to building mappings with BW InfoSource target:
G You cannot use BW as a lookup table.
G You can use only one transfer structure for each mapping.
G You cannot execute stored procedure in a BW target.
G You cannot partition pipelines with a BW target.
G You cannot copy fields that are prefaced with /BIC/ from the InfoSource
definition into other transformations.
G You cannot build an update strategy in a mapping. BW supports only inserts; it
does not support updates or deletes. You can use Update Strategy
transformation in a mapping, but the Connect for BW Server attempts to insert
all records, even those marked for update or delete.
6. Load data
To load data into BW from PowerCenter, both PowerCenter and the BW system
must be configured.
Use the following steps to load data into BW:
G Configure a workflow to load data into BW. Create a session in a workflow that
uses a mapping with an InfoSource target definition.
G Create and schedule an InfoPackage. The InfoPackage associates the
PowerCenter session with the InfoSource.
When the Connect for BW Server starts, it communicates with the BW to
register itself as a server. The Connect for BW Server waits for a request from
the BW to start the workflow. When the InfoPackage starts, the BW
communicates with the registered Connect for BW Server and sends the
workflow name to be scheduled with the PowerCenter Server. The Connect for
BW Server reads information about the workflow and sends a request to the
PowerCenter Server to run the workflow.
The PowerCenter Server validates the workflow name in the repository and the
INFORMATICA CONFIDENTIAL BEST PRACTICE 115 of 702
workflow name in the InfoPackage. The PowerCenter Server executes the
session and loads the data into BW. You must start the Connect for BW Server
after you restart the PowerCenter Server.
Supported Datatypes
The PowerCenter Server transforms data based on the Informatica transformation
datatypes. BW can only receive data in 250 bytes per packet. The PowerCenter Server
converts all data to a CHAR datatype and puts it into packets of 250 bytes, plus one
byte for a continuation flag.
BW receives data until it reads the continuation flag set to zero. Within the transfer
structure, BW then converts the data to the BW datatype. Currently, BW only supports
the following datatypes in transfer structures assigned to BAPI source systems
(PowerCenter ): CHAR, CUKY, CURR, DATS, NUMC, TIMS, UNIT.
All other datatypes result in the following error in BW:
Invalid data type (data type name) for source system of type BAPI.
Date/Time Datatypes
The transformation date/time datatype supports dates with precision to the second. If
you import a date/time value that includes milliseconds, the PowerCenter Server
truncates to seconds. If you write a date/time value to a target column that supports
milliseconds, the PowerCenter Server inserts zeros for the millisecond portion of the
date.
Binary Datatypes
BW does not allow you to build a transfer structure with binary datatypes. Therefore,
you cannot load binary data from PowerCenter into BW.
Numeric Datatypes
PowerCenter does not support the INT1 datatype.
Performance Enhancement for Loading into SAP BW
INFORMATICA CONFIDENTIAL BEST PRACTICE 116 of 702
If you see a performance slowdown for sessions that load into SAP BW, set the default
buffer block size to 15MB to 20MB to enhance performance. You can put 5,000 to
10,000 rows per block, so you can calculate the buffer block size needed with the
following formula:
Row size x Rows per block = Default Buffer Block size< /FONT >
For example, if your target row size is 2KB: 2 KB x 10,000 = 20MB.


Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 117 of 702
Data Connectivity using PowerCenter Connect for MQSeries
Challenge
Understanding how to use IBM MQSeries applications in PowerCenter mappings.
Description
MQSeries applications communicate by sending messages asynchronously rather than by calling each other directly.
Applications can also request data using a "request message" on a message queue. Because no open connection is
required between systems, they can run independently of one another. MQSeries enforces no structure on the content or
format of the message; this is defined by the application.
With more and more requirements for on-demand or real-time data integration, as well as the development of Enterprise
Application Integration (EAI) capabilities, MQ Series has become an important vehicle for providing information to data
warehouses in a real-time mode.
PowerCenter provides data integration for transactional data generated by online continuously messaging systems (such
as MQ Series). For these types of messaging systems, PowerCenters Zero Latency (ZL) Engine provides immediate
processing of trickle-feed data, allowing the processing of real-time data flow in both uni-directional and bi-directional
manner.
TIP
In order to enable PowerCenters ZL engine to process MQ messages in real-time, the workflow must be configured to run continuously and a real-time MQ
filter needs to be applied to the MQ source qualifier (such as idle time, reader time limit, or message count).
MQSeries Architecture
IBM MQSeries is a messaging and queuing application that permits programs to communicate with one another across
heterogeneous platforms and network protocols using a consistent application-programming interface.
MQSeries architecture has three parts:
1. Queue Manager
2. Message Queue, which is a destination to which messages can be sent
3. MQSeries Message, which incorporates a header and a data component
Queue Manager

G PowerCenter connects to Queue Manager to send and receive messages.
G A Queue Manager may publish one or more MQ queues.
G Every message queue belongs to a Queue Manager.
G Queue Manager administers queues, creates queues, and controls queue operation.
Message Queue

G PowerCenter connects to Queue Manager to send and receive messages to one or more message queues.
G PowerCenter is responsible to deleting the message from the queue after processing it.
INFORMATICA CONFIDENTIAL BEST PRACTICE 118 of 702
TIP
There are several ways to maintain transactional consistency (i.e., clean up the queue after reading). Refer to the Informatica Webzine article on Transactional
Consistency for details on the various ways to delete messages from the queue.
MQSeries Message
An MQSeries message is composed of two distinct sections:
G MQSeries header. This section contains data about the queue message itself. Message header data includes the
message identification number, message format, and other message descriptor data. In PowerCenter, MQSeries
sources and dynamic MQSeries targets automatically incorporate MQSeries message header fields.
G MQSeries message data block. A single data element that contains the application data (sometime referred to as
the "message body"). The content and format of the message data is defined by the application that puts the
message on the queue.

Extracting Data from a Queue
Reading Messages from a Queue
In order for PowerCenter to extract from the message data block, the source system must define the data in one of the
following formats:
G
Flat file (fixed width or delimited)
G
XML
G
COBOL
G
Binary
When reading a message from a queue, the PowerCenter mapping must contain an MQ Source Qualifier (MQSQ). If the
mapping also needs to read the message data block, then an Associated Source Qualifier (ASQ) is also needed. When
developing an MQ Series mapping, the MESSAGE_DATA block is re-defined by the ASQ. Based on the format of the
source data, PowerCenter will generate the appropriate transformation for parsing the MESSAGE_DATA. Once
associated, the MSG_ID field is linked within the associated source qualifier transformation.
Applying Filters to Limit Messages Returned
Filters can be applied to the MQ Source Qualifier to reduce the number of messages read.
Filters can also be added to control the length of time PowerCenter reads the MQ queue.
If no filters are applied, PowerCenter reads all messages in the queue and then stops reading.
Example:
PutDate >= 20040901 && PutDate <= 20040930
TIP
In order to leverage reading a single MQ queue to process multiple record types, have the source application populate an MQ header field and then filter the
value set in this field (Example: ApplIdentityData = TRM).
INFORMATICA CONFIDENTIAL BEST PRACTICE 119 of 702
Using MQ Functions
PowerCenter provides built-in functions that can also be used to filter message data.
G
Functions can be used to control the end-of-file of the MQSeries queue.
G
Functions can be used to enable PowerCenter real-time data extraction.
Available Functions:
Function Description
Idle(n) Time RT remains idle before stopping.
MsgCount(n) Number of messages read from the queue before stopping.
StartTime(time) GMT time when RT begins reading queue.
EndTime(time) GMT time when RT stops reading queue.
FlushLatency(n) Time period RT waits before committing messages read from the queue.
ForcedEOQ(n) Time period RT reads messages from the queue before stopping.
RemoveMsg(TRUE) Removes messages from the queue.
TIP
In order to enable real-time message processing, use the FlushLatency() or ForcedEOQ() MQ functions as part of the filter expression in the MQSQ.
Loading Message to a Queue
PowerCenter supports two types of MQ targeting: Static and Dynamic.
G Static MQ Targets. Used for loading message data (instead of header data) to the target. A Static target does not
load data to the message header fields. Use the target definition specific to the format of the message data (i.e.,
flat file, XML, or COBOL). Design the mapping as if it were not using MQ Series, then configure the target
connection to point to a MQ message queue in the session when using MQSeries.
G Dynamic. Used for binary targets only, and when loading data to a message header. Note that certain message
headers in an MQSeries message require a predefined set of values assigned by IBM.
Dynamic MQSeries Targets
Use this type of target if message header fields need to be populated from the ETL pipeline.
MESSAGE_DATA field data type is binary only.
Certain fields cannot be populated by the pipeline (i.e., set by the target MQ environment):
G
UserIdentifier
G
INFORMATICA CONFIDENTIAL BEST PRACTICE 120 of 702
AccountingToken
G
ApplIdentityData
G
PutApplType
G
PutApplName
G
PutDate
G
PutTime
G
ApplOriginData
Static MQSeries Targets
Unlike dynamic targets, where an MQ target transformation exists in the mapping, static targets use existing target
transformations.
G
Flat file
G
XML
G
COBOL
G
RT can only write to one MQ queue per target definition.
G
XML targets with multiple hierarchies can generate one or more MQ messages (configurable).
Creating and Configuring MQSeries Sessions
After you create mappings in the Designer, you can create and configure sessions in the Workflow Manager.
Configuring MQSeries Sources
The MQSeries source definition represents the metadata for the MQSeries source in the repository. Unlike other source
definitions, you do not create an MQSeries source definition by importing the metadata from the MQSeries source. Since
all MQSeries messages contain the same message header and message data fields, the Designer provides an MQSeries
source definition with predefined column names.
MQSeries Mappings
MQSeries mappings cannot be partitioned if an associated source qualifier is used.
For MQ Series sources, set the Source Type to the following:
G Heterogeneous - when there is an associated source definition in the mapping. This indicates that the source data
is coming from an MQ source, and the message data is in flat file, COBOL or XML format.
G Message Queue - when there is no associated source definition in the mapping.
Note that there are two pages on the Source Options dialog: XML and MQSeries. You can alternate between the two
pages to set configurations for each.
INFORMATICA CONFIDENTIAL BEST PRACTICE 121 of 702
Configuring MQSeries Targets
For Static MQSeries targets, select File Target type from the list. When the target is an XML file or XML message data for a
target message queue, the target type is automatically set to XML.
G If you load data to a dynamic MQ target, the target type is automatically set to Message Queue.
G On the MQSeries page, select the MQ connection to use for the source message queue, and click OK.
G Be sure to select the MQ checkbox in Target Options for the Associated file type. Then click Edit Object Properties
and type:
H the connection name of the target message queue.
H the format of the message data in the target queue (ex. MQSTR).
H the number of rows per message (only applies to flat file MQ targets).
Considerations when Working with MQSeries
The following features and functions are not available to PowerCenter when using MQSeries:
G Lookup transformations can be used in an MQSeries mapping, but lookups on MQSeries sources are not allowed.
G No Debug "Sessions". You must run an actual session to debug a queue mapping.
G Certain considerations are necessary when using AEPs, Aggregators, Joiners, Sorters, Rank, or Transaction
Control transformations because they can only be performed on one queue, as opposed to a full data set.
G The MQSeries mapping cannot contain a flat file target definition if you are trying to target an MQSeries queue.
G PowerCenter version 6 and earlier performs a browse of the MQ queue. PowerCenter version 7 provides the
ability to perform a destructive read of the MQ queue (instead of a browse).
G PowerCenter version 7 also provides support for active transformations (i.e., Aggregators) in an MQ source
mapping.
G PowerCenter version 7 provides MQ message recovery on restart of failed sessions.
G PowerCenter version 7 offers enhanced XML capabilities for mid-stream XML parsing.

Appendix Information
PowerCenter uses the following datatypes in MQSeries mappings:
G IBM MQSeries datatypes. IBM MQSeries datatypes appear in the MQSeries source and target definitions in a
mapping.
G Native datatypes. Flat file, XML, or COBOL datatypes associated with an MQSeries message data. Native
datatypes appear in flat file, XML and COBOL source definitions. Native datatypes also appear in flat file and XML
target definitions in the mapping.
G Transformation datatypes. Transformation datatypes are generic datatypes that PowerCenter uses during the
transformation process. They appear in all the transformations in the mapping.

IBM MQSeries Datatypes
MQSeries Datatypes
Transformation Datatypes
MQBYTE BINARY
MQCHAR STRING
MQLONG INTEGER
INFORMATICA CONFIDENTIAL BEST PRACTICE 122 of 702
MQHEX

Values for Message Header Fields in MQSeries Target Messages
MQSeries Message Header
Description
StrucId Structure identifier
Version Structure version number
Report Options for report messages
MsgType Message type
Expiry Message lifetime
Feedback Feedback or reason code
Encoding Data encoding
CodedCharSetId Coded character set identifier
Format Format name
Priority Message priority
Persistence Message persistence
MsgId Message identifier
CorrelId Correlation identifier
BackoutCount Backout counter
ReplytoZ Name of reply queue
ReplytoQMgr Name of reply gueue Manager
UserIdentifier Defined by the environment. If the MQSeries server cannot determine this value, the
value for the field is null.
AccountingToken Defined by the environment. If the MQSeries server cannot determine this value, the
value for the field is MQACT_NONE.
ApplIdentityData Application data relating to identity. The value for ApplIdentityData is null.
PutApplType Type of application that put the message on queue. Defined by the environment.
PutApplName Name of application that put the message on queue. Defined by the environment. If the
MQSeries server cannot determine this value, the value for the field is null.
PutDate Date when the message arrives in the queue.
INFORMATICA CONFIDENTIAL BEST PRACTICE 123 of 702
PutTime Time when the message arrives in queue.
ApplOrigData Application data relating to origin. Value for ApplOriginData is null.
GroupId Group identifier
MsgSeqNumber Sequence number of logical messages within group.
Offset Offset of data in physical message from start of logical message.
MsgFlags Message flags
OrigialLength Length of original message


Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 124 of 702
Data Connectivity using PowerCenter Connect for SAP
Challenge
Understanding how to install PowerCenter Connect for SAP R/3, extract data from SAP R/3, and load data into SAP R/3.
Description
SAP R/3 is an ERP software that provides multiple business applications/modules, such as financial accounting, materials
management, sales and distribution, human resources, CRM and SRM. The CORE R/3 system (BASIS layer) is programmed
in Advance Business Application Programming-Fourth Generation (ABAP/4, or ABAP), a language proprietary to SAP.
PowerCenter Connect for SAP R/3 can write/read/change data in R/3 via BAPI/RFC and IDoc interfaces. The ABAP interface
of PowerCenter Connect can only read data from SAP R/3.
PowerCenter Connect for SAP R/3 provides the ability to extract SAP R/3 data into data warehouses, data
integration applications, and other third-party applications. All of this is accomplished without writing complex ABAP code.
PowerCenter Connect for SAP R/3 generates ABAP programs and is capable of extracting data from transparent tables, pool
tables, and cluster tables.
When integrated with R/3 using ALE (Application Link Enabling), PowerCenter Connect for SAP R/3 can also extract data
from R/3 using outbound IDocs (Intermediate Documents) in near real-time. The ALE concept available in R/3 Release 3.0
supports the construction and operation of distributed applications. It incorporates controlled exchange of business data
messages while ensuring data consistency across loosely-coupled SAP applications. The integration of various applications
is achieved by using synchronous and asynchronous communication, rather than by means of a central database.
The database server stores the physical tables in the R/3 system, while the application server stores the logical tables. A
transparent table definition on the application server is represented by a single physical table on the database server. Pool
and cluster tables are logical definitions on the application server that do not have a one-to-one relationship with a physical
table on the database server.
Communication Interfaces
TCP/IP is the native communication interface between PowerCenter and SAP R/3. Other interfaces between the two include:
Common Program Interface-Communications (CPI-C). CPI-C communication protocol enables online data exchange and
data conversion between R/3 and PowerCenter. To initialize CPI-C communication with PowerCenter, SAP R/3 requires
information such as the host name of the application server and the SAP gateway. This information is stored on the
PowerCenter Server in a configuration file named sideinfo. The PowerCenter Server uses parameters in the sideinfo file to
execute ABAP stream mode sessions.
Remote Function Call (RFC). RFC is the remote communication protocol used by SAP and is based on RPC (Remote
Procedure Call). To execute remote calls from PowerCenter, SAP R/3 requires information such as the connection type and
the service name and gateway on the application server. This information is stored on the PowerCenter Client and
PowerCenter Server in a configuration file named saprfc.ini. PowerCenter makes remote function calls when importing
source definitions, installing ABAP programs and running ABAP file mode sessions.
Transport system. The transport system in SAP is a mechanism to transfer objects developed on one system to another
system. Transport system is primarily used to migrate code and configuration from development to QA and production
systems. It can be used in the following cases:
G PowerCenter Connect for SAP R/3 installation transports
G PowerCenter Connect generated ABAP programs
INFORMATICA CONFIDENTIAL BEST PRACTICE 125 of 702
Note: If the ABAP programs are installed in the $TMP development class, they cannot be transported from development to
production. Ensure you have a transportable development class/package for the ABAP mappings.
Security You must have proper authorizations on the R/3 system to perform integration tasks. The R/3 administrator needs
to create authorizations, profiles, and users for PowerCenter users.
Integration Feature Authorization Object Activity
Import Definitions, Install Programs S_DEVELOP All activities. Also need to set Development Object ID to
PROG
Extract Data S_TABU_DIS

READ
Run File Mode Sessions S_DATASET WRITE

Submit Background Job S_PROGRAM BTCSUBMIT, SUBMIT
Release Background Job S_BTCH_JOB DELE, LIST, PLAN, SHOW
Also need to set Job Operation to RELE
Run Stream Mode Sessions S_CPIC All activities
Authorize RFC privileges S_RFC All activities
You also need access to the SAP GUI, as described in following SAP GUI Parameters table:
Parameter Feature references to this variable Comments
User ID $SAP_USERID Identify the username that connects to the SAP GUI
and is authorized for read-only access to the following
transactions:
- SE12
- SE15
- SE16
- SPRO
Password $SAP_PASSWORD Identify the password for the above user
System Number $SAP_SYSTEM_NUMBER Identify the SAP system number
Client Number $SAP_CLIENT_NUMBER Identify the SAP client number
Server $SAP_SERVER Identify the server on which this instance of SAP is
running
Key Capabilities of PowerCenter Connect for SAP R/3
INFORMATICA CONFIDENTIAL BEST PRACTICE 126 of 702
Some key capabilities of PowerCenter Connect for SAP R/3 include:
G Extract data from SAP R/3 using ABAP BAPI /RFC and IDoc interfaces.
G Migrate/load data from any source into R/3 using IDoc, BAPI/RFC and DMI interfaces.
G Generate DMI files ready to be loaded into SAP via SXDA TOOLS or LSMW or SAP standard delivered programs.
G Support calling BAPI and RFC functions dynamically from PowerCenter for data integration. PowerCenter Connect
for SAP R/3 can make BAPI and RFC function calls dynamically from mappings to extract or load.
G Capture changes to the master and transactional data in SAP R/3 using ALE. PowerCenter Connect for SAP R/3
can receive outbound IDocs from SAP R/3 in real time and load into SAP R/3 using inbound IDocs. To receive IDocs
in real time using ALE, install PowerCenter Connect for SAP R/3 on PowerCenterRT.
G Provide rapid development of the data warehouse based on R/3 data using Analytic Business Components for SAP
R/3 (ABC). ABC is a set of business content that includes mappings, mapplets, source objects, targets, and
transformations.
G Set partition points in a pipeline for outbound/inbound IDoc sessions; sessions that fail when reading outbound
IDocs from an SAP R/3 source can be configured for recovery. You can also receive data from outbound IDoc files
and write data to inbound IDoc files.
G Insert ABAP Code Block to add functionality to the ABAP program flow and use static/dynamic filters to reduce
return rows.
G Customize the ABAP program flow with joins, filters, SAP functions, and code blocks. For example: qualifying table =
table1-field1 = table2-field2 where the qualifying table is the last table in the condition based on the join order
including outer joins.
G Create ABAP program variables to represent SAP R/3 structures, structure fields, or values in the ABAP program.
G Remove ABAP program information from SAP R/3 and the repository when a folder is deleted.
G Provide enhanced platform support by running on 64-bit AIX and HP-UX (Itanium). You can install PowerCenter
Connect for SAP R/3 for the PowerCenter Server and Repository Server on SuSe Linux or on Red Hat Linux.
Installation and Configuration Steps
PowerCenter Connect for SAP R/3 setup programs install components for PowerCenter Server, Client, and repository server.
These programs install drivers, connection files, and a repository plug-in XML file that enables integration between
PowerCenter and SAP R/3. Setup programs can also install PowerCenter Connect for SAP R/3 Analytic Business
Components, and PowerCenter Connect for SAP R/3 Metadata Exchange.
The Power Center Connect for SAP R/3 repository plug-in is called sapplg.xml. After the plug-in is installed, it needs to be
registered in the PowerCenter repository.
For SAP R/3
Informatica provides a group of customized objects required for R/3 integration in the form of transport files. These objects
include tables, programs, structures, and functions that PowerCenter Connect for SAP exports to data files. The R/3 system
administrator must use the transport control program, tp import, to transport these object files on the R/3 system. The
transport process creates a development class called ZERP. The SAPTRANS directory contains data and co files. The
data files are the actual transport objects. The co files are control files containing information about the transport request.
The R/3 system needs development objects and user profiles established to communicate with PowerCenter. Preparing R/3
for integration involves the following tasks:
G Transport the development objects on the PowerCenter CD to R/3. PowerCenter calls these objects each time it
makes a request to the R/3 system.
G Run the transport program that generates unique Ids.
G Establish profiles in the R/3 system for PowerCenter users.
G Create a development class for the ABAP programs that PowerCenter installs on the SAP R/3 system.
For PowerCenter
INFORMATICA CONFIDENTIAL BEST PRACTICE 127 of 702
The PowerCenter server and client need drivers and connection files to communicate with SAP R/3. Preparing PowerCenter
for integration involves the following tasks:
G Run installation programs on PowerCenter Server and Client machines.
G Configure the connection files:

H The sideinfo file on the PowerCenter Server allows PowerCenter to initiate CPI-C with the R/3 system.
Following are the required parameters for sideinfo :
DEST logical name of the R/3 system
TYPE set to A to indicate connection to specific R/3 system.
ASHOST host name of the SAP R/3 application server.
SYSNR system number of the SAP R/3 application server.
H -The saprfc.ini file on the PowerCenter Client and Server allows PowerCenter to connect to the R/3 system
as an RFC client. The required parameterts for sideinfo are:
DEST logical name of the R/3 system
LU host name of the SAP application server machine
TP set to sapdp<system number>
GWHOST host name of the SAP gateway machine.
GWSERV set to sapgw<system number>
PROTOCOL set to I for TCP/IP connection.
Following is the summary of required steps:
1. Install PowerCenter Connect for SAP R/3 on PowerCenter.
2. Configure the sideinfo file.
3. Configure the saprfc.ini
4. Set the RFC_INI environment variable.
5. Configure an application connection for SAP R/3 sources in the Workflow Manager.
6. Configure SAP/ALE IDoc connection in the Workflow Manager to receive IDocs generated by the SAP R/3 system.
7. Configure the FTP connection to access staging files through FTP.
8. Install the repository plug-in in the PowerCenter repository.
Configuring the Services File
Windows
If SAPGUI is not installed, you must make entries in the Services file to run stream mode sessions. This is found in the
\WINNT\SYSTEM32\drivers\etc directory. Entries should be similar to the following:
sapdp<system number> <port number of dispatcher service>/tcp
sapgw<system number> <port number of gateway service>/tcp
Note: SAPGUI is not technically required, but experience has shown that evaluators typically want to log into the R/3 system
to use the ABAP workbench and to view table contents.
UNIX
Services file is located in /etc
G
INFORMATICA CONFIDENTIAL BEST PRACTICE 128 of 702
sapdp<system number> <port# of dispatcher service>/TCP
G
sapgw<system number> <port# of gateway service>/TCP
The system number and port numbers are provided by the BASIS administrator.
Configure Connections to Run Sessions
Informatica supports two methods of communication between the SAP R/3 system and the PowerCenter Server.
G
Streaming Mode does not create any intermediate files on the R/3 system. This method is faster, but
uses more CPU cycles on the R/3 system.
G
File Mode creates an intermediate file on the SAP R/3 system, which is then transferred to the
machine running the PowerCenter Server.
If you want to run file mode sessions, you must provide either FTP access or NFS access from the machine running the
PowerCenter Server to the machine running SAP R/3. This, of course, assumes that PowerCenter and SAP R/3 are not
running on the same machine; it is possible to run PowerCenter and R/3 on the same system, but highly unlikely.
If you want to use File mode sessions and your R/3 system is on a UNIX system, you need to do one of the following:
G
Provide the login and password for the UNIX account used to run the SAP R/3 system.
G
Provide a login and password for a UNIX account belonging to same group as the UNIX account used
to run the SAP R/3 system.
G
Create a directory on the machine running SAP R/3, and run chmod g+s on that directory. Provide
the login and password for the account used to create this directory.
Configure database connections in the Server Manager to access the SAP R/3 system when running a session, then
configure an FTP connection to access staging file through FTP.
Extraction Process
R/3 source definitions can be imported from the logical tables using RFC protocol. Extracting data from R/3 is a four-step
process:
Import source definitions. The PowerCenter Designer connects to the R/3 application server using RFC. The Designer
calls a function in the R/3 system to import source definitions.
Note: If you plan to join two or more tables in SAP, be sure you have the optimized join conditions. Make sure you have
identified your driving table (e.g., if you plan to extract data from bkpf and bseg accounting tables, be sure to drive your
extracts from bkpf table). There is a significant difference in performance if the joins are properly defined.
Create a mapping. When creating a mapping using an R/3 source definition, you must use an ERP source qualifier. In the
ERP source qualifier, you can customize properties of the ABAP program that the R/3 server uses to extract source data.
You can also use joins, filters, ABAP program variables, ABAP code blocks, and SAP functions to customize the ABAP
program.
Generate and install ABAP program. You can install two types of ABAP programs for each mapping:
G
INFORMATICA CONFIDENTIAL BEST PRACTICE 129 of 702
File mode. Extract data to file. The PowerCenter Server accesses the file through FTP or NFS mount.
This mode is used for large extracts as there are timeouts set in SAP for long running queries.
G
Stream Mode. Extract data to buffers. The PowerCenter Server accesses the buffers through CPI-C,
the SAP protocol for program-to-program communication. This mode is preferred for short running
extracts.
You can modify the ABAP program block and customize according to your requirements (e.g., if you want to get data
incrementally, create a mapping variable/parameter and use it in the ABAP program).
Create Session and Run Workflow

G Stream Mode. In stream mode, the installed ABAP program creates buffers on the application server. The program
extracts source data and loads it into the buffers. When a buffer fills, the program streams the data to the
PowerCenter Server using CPI-C. With this method, the PowerCenter Server can process data when it is received.
G
File Mode. When running a session in file mode, the session must be configured to access the file
through NFS mount or FTP. When the session runs, the installed ABAP program creates a file on the
application server. The program extracts source data and loads it into the file. When the file is
complete, the PowerCenter Server accesses the file through FTP or NFS mount and continues
processing the session.
Data Integration Using RFC/BAPI Functions
PowerCenter Connect for SAP R/3 can generate RFC/BAPI function mappings in the Designer to extract data from SAP R/3,
change data in R/3, or load data into R/3. When it uses an RFC/BAPI function mapping in a workflow, the PowerCenter
Server makes the RFC function calls on R/3 directly to process the R/3 data. It doesnt have to generate and install the ABAP
program for data extraction.
Data Integration Using ALE
PowerCenter Connect for SAP R/3 can integrate PowerCenter with SAP R/3 using ALE. With PowerCenter Connect for SAP
R/3, PowerCenter can generate mappings in the Designer to receive outbound IDocs from SAP R/3 in real time. It can also
generate mappings to send inbound IDocs to SAP for data integration. When PowerCenter uses an inbound or outbound
mapping in a workflow to process data in SAP R/3 using ALE, it doesnt have to generate and install the ABAP program for
data extraction.
Analytical Business Components
Analytic Business Components for SAP R/3 (ABC) allows you to use predefined business logic to extract and transform R/3
data. It works in conjunction with PowerCenter and PowerCenter Connect for SAP R/3 to extract master data, perform
lookups, provide documents, and other fact and dimension data from the following R/3 modules:
G
Financial Accounting
G
Controlling
G
Materials Management
G
Personnel Administration and Payroll Accounting
G
Personnel Planning and Development
G
INFORMATICA CONFIDENTIAL BEST PRACTICE 130 of 702
Sales and Distribution
Refer to the ABC Guide for complete installation and configuration information.


Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 131 of 702
Data Connectivity using PowerCenter
Connect for Web Services
Challenge
Understanding PowerCenter Connect for Web Services and configuring PowerCenter
to access a secure web service.
Description
PowerCenter Connect for Web Services (WebServices Consumer) allows PowerCenter
to act as a web services client to consume external web services. PowerCenter
Connect for Web Services uses the Simple Object Access Protocol (SOAP) to
communicate with the external web service provider. An external web service can be
invoked from PowerCenter in three ways:
G Web Service source
G Web Service transformation
G Web Service target
Web Service Source Usage
PowerCenter supports a request-response type of operation using Web Services
source. You can use the web service as a source if the input in the SOAP request
remains fairly constant since input values for a web service source can only be
provided at the source transformation level.
The following steps serve as an example for invoking a temperature web service to
retrieve the current temperature for a given zip code:
1. In Source Analyzer, click Import from WSDL(Consumer).
2. Specify URL http://www.xmethods.net/sd/2001/TemperatureService.wsdl and
pick operation getTemp.
3. Open the Web Services Consumer Properties tab and click Populate SOAP
request and populate the desired zip code value.
4. Connect the output port of the web services source to the target.
Web Service Transformation Usage
INFORMATICA CONFIDENTIAL BEST PRACTICE 132 of 702
PowerCenter also supports a request-response type of operation using Web Services
transformation. You can use the web service as a transformation if your input data is
available midstream and you want to capture the response values from the web
service.
The following steps serve as an example for invoking a Stock Quote web service to
learn the price for each of the ticker symbols available in a flat file:
1. In transformation developer, create a web service consumer transformation.
2. Specify URL http://services.xmethods.net/soap/urn:xmethods-delayed-quotes.
wsdl and pick operation getQuote.
3. Connect the input port of this transformation to the field containing the ticker
symbols.
4. To invoke the web service for each input row, change to source-based commit
and the interval to 1. Also change the Transaction Scope to Transaction in the
web services consumer transformation.
Web Service Target Usage
PowerCenter supports a one-way type of operation using Web Services target. You
can use the web service as a target if you only need to send a message (i.e., and do
not need a response). PowerCenter only waits for the web server to start processing
the message; it does not wait for the web server to finish processing the web service
operation.
The following provides an example for invoking a sendmail web service:
1. In Warehouse Designer, click Import from WSDL(Consumer)
2. Specify URL http://webservices.matlus.com/scripts/emailwebservice.dll/wsdl/
IEmailService and pick operation SendMail
3. In the mapping, connect the input ports of the web services target to the ports
containing appropriate values.
PowerCenter Connect for Web Services and Web Services Provider
Informatica also offers a product called Web Services Provider that differs from
PowerCenter Connect for Web Services.The advantage of this feature is that it will
decouple the WebService that needs to be consumed from the client. Using Informatica
PowerCenter as the glue you can make changes that are transparent from the client.
INFORMATICA CONFIDENTIAL BEST PRACTICE 133 of 702
This is helpful because Informatica Professional Services will most likely not have
access to the client code or the Web Service.
G In Web Services Provider, PowerCenter acts as a Service Provider and
exposes many key functionalities as web services.
G In PowerCenter Connect for Web Services, PowerCenter acts as a web
service client and consumes external web services.
G It is not necessary to install or configure Web Services Provider in order to use
PowerCenter Connect for Web Services.
G Web Services exposed through PowerCenter have two formats
H Real-Time: In real time web enabled workflows are exposed. The Web
Services Provider must be used and point to the Web Service that the
mapping is going to consume. Workflows can be started and protected.
H Batch: In batch mode a preset of services are exposed to run and
monitor workflows in your system. Good for reporting engines, etc.

Configuring PowerCenter to Invoke a Secure Web Service
Secure Sockets Layer (SSL) is used to provide such security features as authentication
and encryption to web services applications. The authentication certificates follow the
Public Key Infrastructure (PKI) standard, a system of digital certificates provided by
certificate authorities to verify and authenticate parties of Internet communications or
transactions. These certificates are managed in the following two keystore files:
G Truststore. Truststore holds the public keys for the entities it can trust.
PowerCenter uses the entries in the Truststore file to authenticate the external
web services servers.
G Keystore (Clientstore). Clientstore holds both the entitys public and private
keys. PowerCenter sends the entries in the Clientstore file to the web services
server so that the web services server can authenticate the PowerCenter
server.
By default, the keystore files jssecacerts and cacerts in the $(JAVA_HOME)/lib/
security directory are used for Truststores. You can also create new keystore files and
configure the TrustStore and ClientStore parameters in the PowerCenter Server setup
to point to these files. Keystore files can contain multiple certificates and are managed
using utilities like keytool.
SSL authentication can be performed in three ways:
INFORMATICA CONFIDENTIAL BEST PRACTICE 134 of 702
G Server authentication
G Client authentication
G Mutual authentication
Server authentication:
When establishing an SSL session in server authentication, the web services server
sends its certificate to PowerCenter and PowerCenter verifies whether the server
certificate can be trusted. Only the truststore file needs to be configured in this case.
Assumptions:
Web Services Server certificate is stored in server.cer file
PowerCenter Server(Client) public/private key pair is available in keystore client.jks
Steps:
1. Import the servers certificate into the PowerCenter Servers truststore file. You
can use either the default keystores jssecacerts, cacerts or create your own
keystore file.
2. keytool -import -file server.cer -alias wserver -keystore trust.jks trustcacerts
storepass changeit
3. At the prompt for trusting this certificate, type yes.
4. Configure PowerCenter to use this truststore file. Open the PowerCenter Server
setup-> JVM options tab and in the value for Truststore, give the full path and
name of the keystore file (i.e., c:\trust.jks)
Client authentication:
When establishing an SSL session in client authentication, PowerCenter sends its
certificate to the web services server. The web services server then verifies whether
the PowerCenter Server can be trusted. In this case, you need only the clientstore file.
Steps:
1. Keystore containing the private/public key pair is called client.jks. Be sure the
client private key password and the keystore password are the same, (e.g.,
changeit)
INFORMATICA CONFIDENTIAL BEST PRACTICE 135 of 702
2. Configure PowerCenter to use this clientstore file. Open the PowerCenter
Server setup-> JVM options tab and in the value for Clientstore, type the full
path and name of the keystore file (i.e., c:\client.jks)
3. Add an additional JVM parameter in the PowerCenter Server setup and give the
value as Djavax.net.ssl.keyStorePassword=changeit
Mutual authentication:
When establishing an SSL session in mutual authentication, both PowerCenter Server
and the Web Services server send their certificates to each other and both verify if the
other one can be trusted. You need to configure both the clientstore and the truststore
files.
Steps:
1. Import the servers certificate into the PowerCenter Servers truststore file.
2. keytool -import -file server.cer -alias wserver -keystore trust.jks trustcacerts
storepass changeit
3. Configure PowerCenter to use this truststore file. Open the PowerCenter server
setup-> JVM options tab and in the value for Truststore, type the full path and
name of the keystore file (i.e., c:\trust.jks).
4. Keystore containing the client public/private key pair is called client.jks. Be sure
the client private key password and the keystore password are the same (e.g.,
changeit).
5. Configure PowerCenter to use this clientstore file. Open the PowerCenter
Server setup-> JVM options tab and in the value for Clientstore, type the full
path and name of the keystore file (i.e., c:\client.jks).
6. Add an additional JVM parameter in the PowerCenter Server setup and type
the value as
Djavax.net.ssl.keyStorePassword=changeit
Note: If your client private key is not already present in the keystore file, you cannot
use keytool command to import it. Keytool can only generate a private key; it cannot
import a private key into a keystore. In this case, use an external java utility such as
utils.ImportPrivateKey(weblogic), KeystoreMove (to convert PKCS#12 format to JKS) to
move it into the JKS keystore.
Converting Other Formats of Certificate Files
There are a number of formats of certificate files available: DER format (.cer and .der
extensions); PEM format (.pem extension); and PKCS#12 format (.pfx or .P12
extension). You can convert from one format of certificate to another using openssl.
INFORMATICA CONFIDENTIAL BEST PRACTICE 136 of 702
Refer to the openssl documentation for complete information on such conversions. A
few examples are given below:
To convert from PEM to DER: assuming that you have a PEM file called server.
pem
G openssl x509 -in server.pem -inform PEM -out server.der -outform DER
To convert a PKCS12 file, you must first convert to PEM, and then from PEM to
DER:
Assuming that your PKCS12 file is called server.pfx, the two commands are:
G openssl pkcs12 -in server.pfx -out server.pem
G openssl x509 -in server.pem -inform PEM -out server.der -outform DER


Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 137 of 702
Data Migration Principles
Challenge
A successful Data Migration effort is often critical to a system implementation. These
implementations can be a new or upgraded ERP package, integration due to merger
and acquisition activity or the development of a new operational system. The effort and
criticality of the Data Migration as part of the larger system implementation project is
often overlooked, underestimated or given a lower priority in the scope of the full
implementation project. As an end result, implementations are often delayed at a great
cost to the organization while Data Migration issues are addressed. Informaticas suite
of products provide functionality and process to minimize this cost of the migration,
lower risk and increase the probability of success (i.e. completing the project on-time
and on-budget).
In this Best Practice we will discuss basic principles for data migration to lower the
project time, to lower staff time to develop, lower risk and lower the total cost of
ownership of the project. These principles include:
1. Leverage staging strategies
2. Utilize table driven approaches
3. Develop via Modular Design
4. Focus On Re-Use
5. Common Exception Handling Processes
6. Multiple Simple Processes versus Few Complex Processes
7. Take advantage of metadata
Description
Leverage Staging Strategies
As discussed elsewhere in Velocity, in data migration it is recommended to employ
both a legacy staging and pre-load staging area. The reason for this is simple, it
provides the ability to pull data from the production system and use it for data cleaning
and harmonization activities without interfering with the production systems. By
leveraging this type of strategy you are able to see real production data sooner and
follow the guiding principle of Convert Early, Convert Often, and with Real Production
Data'.
Utilize Table Driven Approaches
INFORMATICA CONFIDENTIAL BEST PRACTICE 138 of 702
Developers frequently find themselves in positions where they need to perform a large
amount of cross-referencing, hard-coding of values, or other repeatable
transformations during a Data Migration. These transformations often have a
probability to change over time. Without a table driven approach this will cause code
changes, bug fixes, re-testing, and re-deployments during the development effort. This
work is unnecessary on many occasions and could be avoided with the use of
configuration or reference data tables. It is recommend to use table driven approaches
such as these whenever possible. Some common table driven approaches include:
G Default Values hard-coded values for a given column, stored in a table
where the values could be changed whenever a requirement changes. For
example, if you have a hard coded value of NA for any value not populated
and then want to change that value to NV you could simply change the value
in a default value table rather then change numerous hard-coded values.
G Cross-Reference Values frequently in data migration projects there is a
need to take values from the source system and convert them to the value of
the target system. These values are usually identified up-front, but as the
source system changes additional values are also needed. In a typical
mapping development situation this would require adding additional values to
a series of IIF or Decode statements. With a table driven situation, new data
could be added to a cross-reference table and no coding, testing, or
deployment would be required.
G Parameter Values by using a table driven parameter file you can reduce the
need for scripting and accelerate the development process.
G Code-Driven Table in some instances a set of understood rules are known.
By taking those rules and building code against them, a table-driven/code
solution can be very productive. For example, if you had a rules table that was
keyed by table/column/rule id, then whenever that combination was found a
pre-set piece of code would be executed. If at a later date the rules change to
a different set of pre-determined rules, the rule table could change for the
column and no additional coding would be required.
Develop Via Modular Design
As part of the migration methodology, modular design is encouraged. Modular design is
the act of developing a standard way of how similar mappings should function. These
are then published as templates and developers are required to build similar mappings
in that same manner. This provides rapid development, increases efficiency for testing,
and increases ease of maintenance. The result of this change is it causes dramatically
lower total cost of ownership and reduced cost.
INFORMATICA CONFIDENTIAL BEST PRACTICE 139 of 702
Focus On Re-Use
Re-use should always be considered during Informatica development. However, due to
such a high degree of repeatability, on data migration projects re-use is paramount to
success. There is often tremendous opportunity for re-use of mappings/strategies/
processes/scripts/testing documents. This reduces the staff time for migration projects
and lowers project costs.
Common Exception Handling Processes
Employing the Velocity Data Migration Methodology through its iterative intent will add
new data quality rules as problems are found with the data. Because of this it is critical
to find data exceptions and write appropriate rules to correct these situations
throughout the data migration effort. It is highly recommended to build a
common method for capturing and recording these exceptions. This common method
should then be deployed for all data migration processes.
Multiple Simple Processes versus Few Complex Processes
For data migration projects it is possible to build one process to pull all data for a given
entity from all systems to the target system. While this may seem ideal, these type of
complex processes take much longer to design and develop, are challenging to test,
and are very difficult to maintain over time. Due to these drawbacks, it is recommend to
develop many simple processes as needed to complete the effort rather then a few
complex processes.
Take Advantage of Metadata
The Informatica data integration platform is highly metadata driven. Take advantage of
those capabilities on data migration projects. This can be done via a host of reports
against the data integration repository such as:
1. Illustrate how the data is being transformed (i.e., lineage reports)
2. Illustrate who has access to what data (i.e., security group reports)
3. Illustrate what source or target objects exist in the repository
4. Identify how many mappings each developer has created
5. Identify how many sessions each developer has run during a given time period
6. Identify how many successful/failed sessions have been executed
In summary, these design principles provide significant benefits to data migration
INFORMATICA CONFIDENTIAL BEST PRACTICE 140 of 702
projects and add to the large set of typical best practice items that are available in
Velocity. The key to Data Migration projects is architect well, design better, and
execute best.

Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 141 of 702
Data Migration Project Challenges
Challenge
A successful Data Migration effort is often critical to a system implementation. These implementations
can be a new or upgraded ERP package, integration due to merger and acquisition activity, or the
development of a new operational system. The effort and criticality of the Data Migration as part of the
larger system implementation project is often overlooked, underestimated or given a lower priority in
the scope of the full implementation project. As an end result, implementations are often delayed at a
great cost to the organization while Data Migration issues are addressed. Informaticas suite of
products provide functionality and process to minimize this cost of the migration, lower risk and
increase the probability of success (i.e. completing the project on-time and on-budget).
In this best practice the three main data migration project challenges will be discussed. These include:
1. Specifications incomplete, inaccurate, or not completed on-time.
2. Data quality problems impacting project time-lines.
3. Difficulties in project management executing the data migration project.
Description
Unlike other Velocity Best Practices we will not specify the full solution to each. Rather, it is more
important to understand these three challenges and take action to address them throughout the
implementation.
Migration Specifications
During the execution of data migration projects a challenge that projects always encounter is problems
with migration specifications. Projects require the completion of functional specs to identify what is
required of each migration interface.
Definitions:
G A migration interface is defined as 1 to many mapping/sessions/workflows or scripts used to
migrate a data entity from one source system to one target system.
G A Functional Requirements Specification is normally comprised of a document covering
details including security, database join needs, audit needs, and primary contact details.
These details are normally at the interface level rather then at the column level. It also
includes a Target-Source Matrix target-source matrix which identifies details at the column
level such as how source table/columns map to target table/columns, business rules, data
cleansing rules, validation rules, and other column level specifics.
Many projects attempt to complete these migrations without these types of specifications. Often these
projects have little to no chance to complete on-time or on-budget. Time and subject matter expertise
INFORMATICA CONFIDENTIAL BEST PRACTICE 142 of 702
is needed to complete this analysis; this is the baseline for project success.
Projects are disadvantaged when functional specifications are not completed on-time. Developers can
often be in a wait mode for extended periods of time when these specs are not completed at the time
specified by the project plan.
Another project risk occurs when the right individuals are not used to write these specs or often
inappropriate levels of importance are applied to this exercise. These situations cause inaccurate or
incomplete specifications which prevent data integration developers from successfully building the
migration processes.
To address the spec challenge for migration projects, projects must have specifications that are
completed with accuracy and delivered on time.
Data Quality
Most projects are affected by data quality due to the need to address problems in the source data that
fit into the six dimensions of data quality:
Data Quality Dimension Description
Completeness What data is missing or unusable?
Conformity What data is stored in a non-standard format?
Consistency What data values give conflicting Informatica?
Accuracy What data is incorrect or out of date?
Duplicates What data records or attributes are repeated?
Integrity What data is missing or not referenced?
Data migration data quality problems are typically worse then planned for. Projects need to allow
enough time to identify and fix data quality problems BEFORE loading the data into the new target
system.
Informaticas data integration platform provides data quality capabilities that can help to identify the
data quality problems in an efficient manner, but Subject-Matter Experts are required to address how
these data problems should be addressed within business context and process.
Project Management
Project managers are often disadvantaged on these types of projects as they are mainly much larger,
more expensive, and more complex then any prior project they have been involved with. They need to
INFORMATICA CONFIDENTIAL BEST PRACTICE 143 of 702
understand early in the project the importance of correctly completed specs and the importance of
addressing data quality and establish a set of tools to accurately and objectively plan the project with
the ability to evaluate progress.
Informaticas Velocity Migration Methodology, its tool sets, and the metadata reporting capabilities are
key to addressing these project challenges.
The key challenge is to fully understand the pitfalls early on in the project and how PowerCenter and
Informatica Data Quality can address these challenges, and how metadata reporting can provide
objective information relative to project status.
In summary, data migration projects are challenged by specification issues, data quality issues, and
project management difficulties. By understanding the Velocity Methodology focus on data migration
and how Informaticas products can handle these changes for a successful migration, these
challenges can be minimized.


Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 144 of 702
Data Migration Velocity Approach
Challenge
A successful Data Migration effort is often critical to a system implementation. These implementations can be a new or
upgraded ERP package, integration due to merger and acquisition activity or the development of a new operational
system. The effort and criticality of the Data Migration as part of the larger system implementation project is often
overlooked, underestimated or given a lower priority in the scope of the full implementation project. As an end result,
implementations are often delayed at a great cost to the organization while Data Migration issues are addressed.
Informaticas suite of products provide functionality and process to minimize this cost of the migration, lower risk and
increase the probability of success (i.e. completing the project on-time and on-budget).
To meet these objectives a set of best practices have been provided in Velocity focused on Data Migration. This Best
Practice provides an overview of how to use Informaticas Products in an iterative methodology to expedite a data migration
project. The keys to the methodology are further discussed in the Best Practice Data Migration Principles.
Description
The Velocity approach to data migration is illustrated here. While it is possible to migrate data in one step it is more
productive to break these processes up into two or three simpler steps. The goal for data migration is to get the data into
the target application as early as possible for large scale implementations. Typical implementations will have three to four
trial cutovers or mock-runs before the final implementation of Go-Live. The mantra for the Informatica based migration is
to Convert Early, Convert Often, and Convert with Real Production Data. To do this the following approach is encouraged:
Analysis
In the analysis phase the functional specs will be completed, these will include both functional specs and target-source
matrix.
See the Best Practice Data Migration Project Challenges for related information.
Acquire
INFORMATICA CONFIDENTIAL BEST PRACTICE 145 of 702
In the acquire phase the targets-source matrix will be reviewed and all source systems/tables will be identified. These
tables will be used to develop one mapping per source table to populate a mirrored structure in a legacy data based
schema. For example if there were 50 source tables identified in all the Target-Source Matrix documents, 50 legacy tables
would be created and one mapping would be developed; one for each table.
It is recommended to perform the initial development against test data, but once complete run a single extract of the current
production data. This will assist in addressing data quality problems without impacting production systems. It is
recommended to run these extracts in low use time periods and with the cooperation of the operations group responsible
for these systems.
It is also recommended to take advantage of the Visio Generation Option if available. These mappings are very straight
forward and the use of autogeneration can increase consistency and lower required staff time for the project.
Convert
In this phase data will be extracted from the legacy stage tables (merged, transformed, and cleansed) to populate a mirror
of the target application. As part of this process a standard exception process should be developed to determine
exceptions and expedite data cleansing activities. The results of this convert process should be profiled, and appropriate
data quality scorecards should be reviewed.
During the convert phase the basic set of exception tests should be executed, with exception details collected for future
reporting and correction. The basic exception tests include:
1. Data Type
2. Data Size
3. Data Length
4. Valid Values
5. Range of Values
Exception Type Exception Description
Data Type Will the source data value load correctly to the target data type such
as a numeric date loading into an Oracle date type?
Data Size Will a numeric value from a source value load correctly to the target
column or will a numeric overflow occur?
Data Length Is the input value too large for the target column? (This is appropriate
for all data types but of particular interest for string data types. For
example, in one system a field could be char(256) but most of the
values are char(10). In the target the new field is varchar(20) so any
value over char(20) should raise an exception.)
Range of Values
Is the input value within a tolerable range for the new system? (For
example, does the birth date for an Insurance Subscriber fall between
Jan 1, 1900 and Jan 1, 2006? If this test fails the date is unreasonable
and should be addressed.)
Valid Values Is the input value in a list of tolerant values in the target system? (An
example of this would be does the state code for an input record
match the list of states in the new target system? If not the data should
be corrected prior to entry to the new system.)
Once profiling exercises, exception reports and data quality scorecards are complete a list of data quality issues should be
created.
This list should then be reviewed with the functional business owners to generate new data quality rules to correct the data.
These details should be added to the spec and the original convert process should be modified with the new data quality
INFORMATICA CONFIDENTIAL BEST PRACTICE 146 of 702
rules.
The convert process should then be re-executed as well as the profiling, exception reporting and data scorecarding until
the data is correct and ready for load to the target application.
Migrate
In the migrate phase the data from the convert phase should be loaded to the target application.
The expectation is that there should be no failures on these loads. The data should be corrected in the covert phase prior
to loading the target application.
Once the migrate phase is complete, validation should occur. It is recommended to complete an audit/balancing step prior
to validation. This is discussed in the Best Practice Build Data Audit/Balancing Processes.
Additional detail about these steps are defined in the Best Practice Data Migration Principles.

Last updated: 06-Feb-07 12:08
INFORMATICA CONFIDENTIAL BEST PRACTICE 147 of 702
Build Data Audit/Balancing Processes
Challenge
Data Migration and Data Integration projects are often challenged to verify that the data in an application is
complete. More specifically, to identify that all the appropriate data was extracted from a source system
and propagated to its final target. This best practice illustrates how to do this in an efficient and a
repeatable fashion for increased productivity and reliability. This is particularly important in businesses that
are either highly regulated internally and externally or that have to comply with a host of government
compliance regulations such as Sarbanes-Oxley, BASEL II, HIPAA, Patriot Act, and many others.
Description
The common practice for audit and balancing solutions is to produce a set of common tables that can hold
various control metrics regarding the data integration process. Ultimately, business intelligence reports
provide insight at a glance to verify that the correct data has been pulled from the source and completely
loaded to the target. Each control measure that is being tracked will require development of a
corresponding PowerCenter process to load the metrics to the Audit/Balancing Detail table.
To drive out this type of solution execute the following tasks:
1.
Work with business users to identify what audit/balancing processes are needed. Some examples
of this may be:
a. Customers (Number of Customers or Number of Customers by Country)
b. Orders (Qty of Units Sold or Net Sales Amount)
c. Deliveries (Number of shipments or Qty of units shipped of Value of all shipments)
d. Accounts Receivable (Number of Accounts Receivable Shipments or Total Accounts
Receivable Outstanding)
2. Define for each process defined in #1 which columns should be used for tracking purposes for both
the source and target system.
3. Develop a data integration process that will read from the source system and populate the detail
audit/balancing table with the control totals.
4. Develop a data integration process that will read from the target system and populate the detail
audit/balancing table with the control totals.
5. Develop a reporting mechanism that will query the audit/balancing table and identify the the source
and target entries match or if there is a discrepancy.
An example audit/balance table definition looks like this :
Audit/Balancing Details
Column Name Data Type Size
AUDIT_KEY NUMBER 10
CONTROL_AREA VARCHAR2 50
INFORMATICA CONFIDENTIAL BEST PRACTICE 148 of 702
CONTROL_SUB_AREA VARCHAR2 50
CONTROL_COUNT_1 NUMBER 10
CONTROL_COUNT_2 NUMBER 10
CONTROL_COUNT_3 NUMBER 10
CONTROL_COUNT_4 NUMBER 10
CONTROL_COUNT_5 NUMBER 10
CONTROL_SUM_1 NUMBER (p,s) 10,2
CONTROL_SUM_2 NUMBER (p,s) 10,2
CONTROL_SUM_3 NUMBER (p,s) 10,2
CONTROL_SUM_4 NUMBER (p,s) 10,2
CONTROL_SUM_5 NUMBER (p,s) 10,2
UPDATE_TIMESTAMP TIMESTAMP
UPDATE_PROCESS VARCHAR2 50
Control Column Definition by Control Area/Control Sub Area
Column Name Data Type Size
CONTROL_AREA VARCHAR2 50
CONTROL_SUB_AREA VARCHAR2 50
CONTROL_COUNT_1 VARCHAR2 50
CONTROL_COUNT_2 VARCHAR2 50
CONTROL_COUNT_3 VARCHAR2 50
CONTROL_COUNT_4 VARCHAR2 50
CONTROL_COUNT_5 VARCHAR2 50
CONTROL_SUM_1 VARCHAR2 50
CONTROL_SUM_2 VARCHAR2 50
INFORMATICA CONFIDENTIAL BEST PRACTICE 149 of 702
CONTROL_SUM_3 VARCHAR2 50
CONTROL_SUM_4 VARCHAR2 50
CONTROL_SUM_5 VARCHAR2 50
UPDATE_TIMESTAMP TIMESTAMP
UPDATE_PROCESS VARCHAR2 50
The following is a screenshot of a single mapping that will populate both the source and target values in a
single mapping:
The following two screenshots show how two mappings could be used to provide the same results:
INFORMATICA CONFIDENTIAL BEST PRACTICE 150 of 702
Note: One key challenge is how to capture the appropriate control values from the source system if it is
continually being updated. The first example with one mapping will not work due to the changes that occur
in the time between the extraction of the data from the source and the completion of the load to the target
application. In those cases you may want to take advantage of an aggregator transformation to collect the
appropriate control totals as illustrated in this screenshot:

The following are two Straw-man Examples of an Audit/Balancing Report which is the end-result of this
type of process:
Data area Leg count TT count Diff Leg amt TT amt
Customer 11000 10099 1 0
Orders 9827 9827 0 11230.21 11230.21 0
Deliveries 1298 1288 0 21294.22 21011.21 283.01
In summary, there are two big challenges in building audit/balancing processes:
INFORMATICA CONFIDENTIAL BEST PRACTICE 151 of 702
1. Identifying what the control totals should be
2. Building processes that will collect the correct information at the correct granularity
There are also a set of basic tasks that can be leveraged and shared across any audit/balancing needs.
By building a common model for meeting audit/balancing needs, projects can lower the time needed to
develop these solutions and still provide risk reductions by having this type of solution in place.


Last updated: 06-Feb-07 12:37
INFORMATICA CONFIDENTIAL BEST PRACTICE 152 of 702
Data Cleansing
Challenge
Poor data quality is one of the biggest obstacles to the success of many data integration projects. A 2005
study by the Gartner Group stated that the majority of currently planned data warehouse projects will suffer
limited acceptance or fail outright. Gartner declared that the main cause of project problems was a lack of
attention to data quality.
Moreover, once in the system, poor data quality can cost organizations vast sums in lost revenues.
Defective data leads to breakdowns in the supply chain, poor business decisions, and inferior customer
relationship management. It is essential that data quality issues are tackled during any large-scale data
project to enable project success and future organizational success.
Therefore, the challenge is twofold: to cleanse project data, so that the project succeeds, and to ensure
that all data entering the organizational data stores provides for consistent and reliable decision-making.
Description
A significant portion of time in the project development process should be dedicated to data quality,
including the implementation of data cleansing processes. In a production environment, data quality
reports should be generated after each data warehouse implementation or when new source systems are
integrated into the environment. There should also be provision for rolling back if data quality testing
indicates that the data is unacceptable.
Informatica offers two application suites for tackling data quality issues: Informatica Data Explorer (IDE)
and Informatica Data Quality (IDQ). IDE focuses on data profiling, and its results can feed into the data
integration process. However, its unique strength is its metadata profiling and discovery capability. IDQ
has been developed as a data analysis, cleansing, correction, and de-duplication tool, one that provides a
complete solution for identifying and resolving all types of data quality problems and preparing data for the
consolidation and load processes.
Concepts
Following are some key concepts in the field of data quality. These data quality concepts provide a
foundation that helps to develop a clear picture of the subject data, which can improve both efficiency and
effectiveness. The list of concepts can be read as a process, leading from profiling and analysis to
consolidation.
Profiling and Analysis - whereas data profiling and data analysis are often synonymous terms, in
Informatica terminology these tasks are assigned to IDE and IDQ respectively. Thus, profiling is primarily
concerned with metadata discovery and definition, and IDE is ideally suited to these tasks. IDQ can
discover data quality issues at a record and field level, and Velocity best practices recommends the use of
IDQ for such purposes.
Note: The remaining items in this document will therefore, focus in the context of IDQ usage.
INFORMATICA CONFIDENTIAL BEST PRACTICE 153 of 702
Parsing - the process of extracting individual elements within the records, files, or data entry forms in
order to check the structure and content of each field and to create discrete fields devoted to specific
information types. Examples may include: name, title, company name, phone number, and SSN.
Cleansing and Standardization - refers to arranging information in a consistent manner or preferred
format. Examples include the removal of dashes from phone numbers or SSNs. For more information, see
the Best Practice Effective Data Standardizing Techniques.
Enhancement - refers to adding useful, but optional, information to existing data or complete data.
Examples may include: sales volume, number of employees for a given business, and zip+4 codes.
Validation - the process of correcting data using algorithmic components and secondary reference data
sources, to check and validate information. Example: validating addresses with postal directories.
Matching and de-duplication - refers to removing, or flagging for removal, redundant or poor-quality
records where high-quality records of the same information exist. Use matching components and business
rules to identify records that may refer, for example, to the same customer. For more information, see the
Best Practice Effective Data Matching Techniques.
Consolidation - using the data sets defined during the matching process to combine all cleansed or
approved data into a single, consolidated view. Examples are building best record, master record, or
house-holding.
Informatica Applications
The Informatica Data Quality software suite has been developed to resolve a wide range of data quality
issues, including data cleansing. The suite comprises the following elements:
G IDQ Workbench - a stand-alone desktop tool that provides a complete set of data quality
functionality on a single computer (Windows only).
INFORMATICA CONFIDENTIAL BEST PRACTICE 154 of 702
G IDQ Server- a set of processes that enables the deployment and management of data quality
procedures and resources across a network of any size through TCP/IP.
G IDQ Integration - a plug-in component that integrates Workbench with PowerCenter, enabling
PowerCenter users to embed data quality procedures defined in IDQ in their mappings.
G IDQ stores all its processes as XML in the Data Quality Repository (MySQL). IDQ Server
enables the creation and management of multiple repositories.
Using IDQ in Data Projects
IDQ can be used effectively alongside PowerCenter in data projects, to run data quality procedures in its
own applications or to provide them for addition to PowerCenter transformations.
Through its Workbench user-interface tool, IDQ tackles data quality in a modular fashion. That is,
Workbench enables you to build discrete procedures (called plans in Workbench) which contain data input
components, output components, and operational components. Plans can perform analysis, parsing,
standardization, enhancement, validation, matching, and consolidation operations on the specified data.
Plans are saved into projects that can provide a structure and sequence to your data quality endeavors.
The following figure illustrates how data quality processes can function in a project setting:
INFORMATICA CONFIDENTIAL BEST PRACTICE 155 of 702
In stage 1, you analyze the quality of the project data according to several metrics, in consultation with the
business or project sponsor. This stage is performed in Workbench, which enables the creation of versatile
and easy to use dashboards to communicate data quality metrics to all interested parties.
In stage 2, you verify the target levels of quality for the business according to the data quality
measurements taken in stage 1, and in accordance with project resourcing and scheduling.
In stage 3, you use Workbench to design the data quality plans and projects to achieve the
targets. Capturing business rules and testing the plans are also covered in this stage.
In stage 4, you deploy the data quality plans. If you are using IDQ Workbench and Server, you can deploy
plans and resources to remote repositories and file systems through the user interface. If you are running
Workbench alone on remote computers, you can export your plans as XML. Stage 4 is the phase in which
data cleansing and other data quality tasks are performed on the project data.
In stage 5, youll test and measure the results of the plans and compare them to the initial data quality
assessment to verify that targets have been met. If targets have not been met, this information feeds into
another iteration of data quality operations in which the plans are tuned and optimized.
In a large data project, you may find that data quality processes of varying sizes and impact are necessary
at many points in the project plan. At a high level, stages 1 and 2 ideally occur very early in the project, at
a point defined as the Manage Phase within Velocity. Stages 3 and 4 typically occur during the Design
Phase of Velocity. Stage 5 can occur during the Design and/or Build Phase of Velocity, depending on the
level of unit testing required.
Using the IDQ Integration
INFORMATICA CONFIDENTIAL BEST PRACTICE 156 of 702
Data Quality Integration is a plug-in component that enables PowerCenter to connect to the Data Quality
repository and import data quality plans to a PowerCenter transformation. With the Integration component,
you can apply IDQ plans to your data without necessarily interacting with or being aware of IDQ
Workbench or Server.
The Integration interacts with PowerCenter in two ways:
G On the PowerCenter client side, it enables you to browse the Data Quality repository and add
data quality plans to custom transformations. The data quality plans functional details are saved
as XML in the PowerCenter repository.
G On the PowerCenter server side, it enables the PowerCenter Server (or Integration service) to
send data quality plan XML to the Data Quality engine for execution.
The Integration requires that at least the following IDQ components are available to PowerCenter:
G Client side: PowerCenter needs to access a Data Quality repository from which to import plans.
G Server side: PowerCenter needs an instance of the Data Quality engine to execute the plan
instructions.
An IDQ-trained consultant can build the data quality plans, or you can use the pre-built plans provided by
Informatica. Currently, Informatica provides a set of plans dedicated to cleansing and de-duplicating North
American name and postal address records.
The Integration component enables the following process:
G Data quality plans are built in Data Quality Workbench and saved from there to the Data Quality
repository.
G The PowerCenter Designer user opens a Data Quality Integration transformation and configures it
to read from the Data Quality repository. Next, the users selects a plan from the Data Quality
repository and adds it to the transformation.
G The PowerCenter Designer user saves the transformation and the mapping containing it to the
PowerCenter repository. The plan information is saved with the transformation as XML.
The PowerCenter Integration service can then run a workflow containing the saved mapping. The relevant
source data and plan information will be sent to the Data Quality engine, which processes the data (in
conjunction with any reference data files used by the plan) and returns the results to PowerCenter.


Last updated: 06-Feb-07 12:43
INFORMATICA CONFIDENTIAL BEST PRACTICE 157 of 702
Data Profiling
Challenge
Data profiling is an option in PowerCenter version 7.0 and later that leverages existing PowerCenter functionality and a data
profiling GUI front-end to provide a wizard-driven approach to creating data profiling mappings, sessions, and workflows. This
Best Practice is intended to provide an introduction on usage for new users.
Bear in mind that Informaticas Data Quality (IDQ) applications also provide data profiling capabilities. Consult the following
Velocity Best Practice documents for more information:
G Data Cleansing
G Using Data Explorer for Data Discovery and Analysis
Description
Creating a Custom or Auto Profile
The data profiling option provides visibility into the data contained in source systems and enables users to measure changes
in the source data over time. This information can help to improve the quality of the source data.
An auto profile is particularly valuable when you are data profiling a source for the first time, since auto profiling offers a good
overall perspective of a source. It provides a row count, candidate key evaluation, and redundancy evaluation at the source
level, and domain inference, distinct value and null value count, and min, max, and average (if numeric) at the column level.
Creating and running an auto profile is quick and helps to gain a reasonably thorough understanding of a source in a short
amount of time.
A custom data profile is useful when there is a specific question about a source. Custom profiling is useful for validating
business rules and/or verifying that data matches a particular pattern. For example, use custom profiling if you have a
business rule that you want to validate, or if you want to test whether data matches a particular pattern.
Setting Up the Profile Wizard
To customize the profile wizard for your preferences:
G Open the Profile Manager and choose Tools > Options.
G If you are profiling data using a database user that is not the owner of the tables to be sourced, check the Use
source owner name during profile mapping generation option.
G If you are in the analysis phase of your project, choose Always run profile interactively since most of your data-
profiling tasks will be interactive. (In later phases of the project, uncheck this option because more permanent data
profiles are useful in these phases.)
Running and Monitoring Profiles
Profiles are run in one of two modes: interactive or batch. Choose the appropriate mode by checking or unchecking Configure
Session on the "Function-Level Operations tab of the wizard.
G Use Interactive to create quick, single-use data profiles. The sessions are created with default configuration
parameters.
G For data-profiling tasks that are likely to be reused on a regular basis, create the sessions manually in Workflow
Manager and configure and schedule them appropriately.
Generating and Viewing Profile Reports
Use Profile Manager to view profile reports. Right-click on a profile and choose View Report.
INFORMATICA CONFIDENTIAL BEST PRACTICE 158 of 702
For greater flexibility, you can also use Data Analyzer to view reports. Each PowerCenter client includes a Data Analyzer
schema and reports xml file. The xml files are located in the \Extensions\DataProfile\IPAReports subdirectory of the client
installation.
You can create additional metrics, attributes, and reports in Data Analyzer to meet specific business requirements. You can
also schedule Data Analyzer reports and alerts to send notifications in cases where data does not meet preset quality limits.
Sampling Techniques
Four types of sampling techniques are available with the PowerCenter data profiling option:
Technique Description Usage
No sampling Uses all source data Relatively small data sources
Automatic random sampling PowerCenter determines the appropriate percentage to
sample, then samples random rows.
Larger data sources where you want a
statistically significant data analysis
Manual random sampling PowerCenter samples random rows of the source data
based on a user-specified percentage.
Samples more or fewer rows than the automatic
option chooses.

Sample first N rows Samples the number of user-selected rows Provides a quick readout of a source (e.g., first
200 rows)
Profile Warehouse Administration
Updating Data Profiling Repository Statistics
The Data Profiling repository contains nearly 30 tables with more than 80 indexes. To ensure that queries run optimally, be
sure to keep database statistics up to date. Run the query below as appropriate for your database type, then capture the script
that is generated and run it.
ORACLE
select 'analyze table ' || table_name || ' compute statistics;' from user_tables where table_name like 'PMDP%';
select 'analyze index ' || index_name || ' compute statistics;' from user_tables where index_name like 'DP%';
Microsoft SQL Server
select 'update statistics ' + name from sysobjects where name like 'PMDP%'
SYBASE
select 'update statistics ' + name from sysobjects where name like 'PMDP%'
INFORMIX
select 'update statistics low for table ', tabname, ' ; ' from systables where table_name like 'PMDP%'
IBM DB2
INFORMATICA CONFIDENTIAL BEST PRACTICE 159 of 702
select 'runstats on table ' || rtrim(tabschema) || '. ' || tabname || ' and indexes all; ' from syscat.tables where tabname like 'PMDP
%'
TERADATA
select 'collect statistics on ', tablename, ' index ', indexname from dbc.indices where tablename like 'PMDP%' and
databasename = 'database_name'
where database_name is the name of the repository database.
Purging Old Data Profiles
Use the Profile Manager to purge old profile data from the Profile Warehouse. Choose Target Warehouse>Connect and
connect to the profiling warehouse. Choose Target Warehouse>Purge to open the purging tool.


Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 160 of 702
Data Quality Mapping Rules
Challenge
Use PowerCenter to create data quality mapping rules to enhance the usability of the
data in your system.
Description
The issue of poor data quality is one that frequently hinders the success of data
integration projects. It can produce inconsistent or faulty results and ruin the credibility
of the system with the business users.
This Best Practice focuses on techniques for use with PowerCenter and third-party or
add-on software. Comments that are specific to the use of PowerCenter are enclosed
in brackets.
Bear in mind that you can augment or supplant the data quality handling capabilities of
PowerCenter with Informatica Data Quality (IDQ), the Informatica application suite
dedicated to data quality issues. Data analysis and data enhancement processes, or
plans, defined in IDQ can deliver significant data quality improvements to your project
data. A data project that has built-in data quality steps, such as those described in the
Analyze and Design phases of Velocity, enjoys a significant advantage over a project
that has not audited and resolved issues of poor data quality. If you have added these
data quality steps to your project, you are likely to avoid the issues described below.
A description of the range of IDQ capabilities is beyond the scope of this document. For
a summary of Informaticas data quality methodology, as embodied in IDQ, consult the
Best Practice Data Cleansing.
Common Questions to Consider
Data integration/warehousing projects often encounter general data problems that may
not merit a full-blown data quality project, but which nonetheless must be addressed.
This document discusses some methods to ensure a base level of data quality; much
of the content discusses specific strategies to use with PowerCenter.
The quality of data is important in all types of projects, whether it be data warehousing,
INFORMATICA CONFIDENTIAL BEST PRACTICE 161 of 702
data synchronization, or data migration. Certain questions need to be considered for all
of these projects, with the answers driven by the projects requirements and the
business users that are being serviced. Ideally, these questions should be addressed
during the Design and Analyze Phases of the project because they can require a
significant amount of re-coding if identified later.
Some of the areas to consider are:
Text Formatting
The most common hurdle here is capitalization and trimming of spaces. Often, users
want to see data in its raw format without any capitalization, trimming, or formatting
applied to it. This is easily achievable as it is the default behavior, but there is danger in
taking this requirement literally since it can lead to duplicate records when some of
these fields are used to identify uniqueness and the system is combining data from
various source systems.
One solution to this issue is to create additional fields that act as a unique key to a
given table, but which are formatted in a standard way. Since the raw data is stored in
the table, users can still see it in this format, but the additional columns mitigate the risk
of duplication.
Another possibility is to explain to the users that raw data in unique, identifying fields
is not as clean and consistent as data in a common format. In other words, push back
on this requirement.
This issue can be particularly troublesome in data migration projects where matching
the source data is a high priority. Failing to trim leading/trailing spaces from data can
often lead to mismatched results since the spaces are stored as part of the data value.
The project team must understand how spaces are handled from the source systems to
determine the amount of coding required to correct this. (When using PowerCenter and
sourcing flat files, the options provided while configuring the File Properties may be
sufficient.). Remember that certain RDBMS products use the data type CHAR, which
then stores the data with trailing blanks. These blanks need to be trimmed before
matching can occur. It is usually only advisable to use CHAR for 1-character flag fields.
INFORMATICA CONFIDENTIAL BEST PRACTICE 162 of 702

Note that many fixed-width files do not use a null as space. Therefore, developers must
put one space beside the text radio button, and also tell the product that the space is
repeating to fill out the rest of the precision of the column. The strip trailing blanks
facility then strips off any remaining spaces from the end of the data value. Embedding
database text manipulation functions in lookup transformations is not recommended
because a developer must then cache the lookup table due to the presence of a SQL
override. (In PowerCenter, avoid embedding database text manipulation functions in
lookup transformations.) On very large tables, caching is not always realistic or feasible.
Datatype Conversions
It is advisable to use explicit tool functions when converting the data type of a particular
data value.
[In PowerCenter, if the TO_CHAR function is not used, an implicit conversion is
performed, and 15 digits are carried forward, even when they are not needed or
desired. PowerCenter can handle some conversions without function calls (these are
detailed in the product documentation), but this may cause subsequent support or
INFORMATICA CONFIDENTIAL BEST PRACTICE 163 of 702
maintenance headaches.]
Dates
Dates can cause many problems when moving and transforming data from one place
to another because an assumption must be made that all data values are in a
designated format.
[Informatica recommends first checking a piece of data to ensure it is in the proper
format before trying to convert it to a Date data type. If the check is not performed first,
then a developer increases the risk of transformation errors, which can cause data to
be lost].
An example piece of code would be: IIF(IS_DATE(in_RECORD_CREATE_DT,
YYYYMMDD'), TO_DATE(in_RECORD_CREATE_DT, 'YYYYMMDD'), NULL)
If the majority of the dates coming from a source system arrive in the same format, then
it is often wise to create a reusable expression that handles dates, so that the proper
checks are made. It is also advisable to determine if any default dates should be
defined, such as a low date or high date. These should then be used throughout the
system for consistency. However, do not fall into the trap of always using default dates
as some are meant to be NULL until the appropriate time (e.g., birth date or death date).
The NULL in the example above could be changed to one of the standard default dates
described here.
Decimal Precision
With numeric data columns, developers must determine the expected or required
precisions of the columns. (By default, to increase performance, PowerCenter treats all
numeric columns as 15 digit floating point decimals, regardless of how they are defined
in the transformations. The maximum numeric precision in PowerCenter is 28 digits.)
If it is determined that a column realistically needs a higher precision, then the Enable
Decimal Arithmetic in the Session Properties option needs to be checked. However, be
aware that enabling this option can slow performance by as much as 15 percent. The
Enable Decimal Arithmetic option must be enabled when comparing two numbers for
equality.
Trapping Poor Data Quality Techniques
INFORMATICA CONFIDENTIAL BEST PRACTICE 164 of 702
The most important technique for ensuring good data quality is to prevent incorrect,
inconsistent, or incomplete data from ever reaching the target system. This goal may
be difficult to achieve in a data synchronization or data migration project, but it is very
relevant when discussing data warehousing or ODS. This section discusses techniques
that you can use to prevent bad data from reaching the system.
Checking Data for Completeness Before Loading
When requesting a data feed from an upstream system, be sure to request an audit file
or report that contains a summary of what to expect within the feed. Common requests
here are record counts or summaries of numeric data fields. If you have performed a
data quality audit, as specified in the Analyze Phase these metrics and others should
be readily available.
Assuming that the metrics can be obtained from the source system, it is advisable to
then create a pre-process step that ensures your input source matches the audit file. If
the values do not match, stop the overall process from loading into your target system.
The source system can then be alerted to verify where the problem exists in its feed.
Enforcing Rules During Mapping
Another method of filtering bad data is to have a set of clearly defined data rules built
into the load job. The records are then evaluated against these rules and routed to an
Error or Bad Table for further re-processing accordingly. An example of this is to check
all incoming Country Codes against a Valid Values table. If the code is not found, then
the record is flagged as an Error record and written to the Error table.
A pitfall of this method is that you must determine what happens to the record once it
has been loaded to the Error table. If the record is pushed back to the source system to
be fixed, then a delay may occur until the record can be successfully loaded to the
target system. In fact, if the proper governance is not in place, the source system may
refuse to fix the record at all. In this case, a decision must be made to either: 1) fix the
data manually and risk not matching with the source system; or 2) relax the business
rule to allow the record to be loaded.
Often times, in the absence of an enterprise data steward, it is a good idea to assign a
team member the role of data steward. It is this persons responsibility to patrol these
tables and push back to the appropriate systems as necessary, as well as help to make
decisions about fixing or filtering bad data. A data steward should have a good
command of the metadata, and he/she should also understand the consequences to
the user community of data decisions.
INFORMATICA CONFIDENTIAL BEST PRACTICE 165 of 702
Another solution applicable in cases with a small number of code values is to try to
anticipate any mistyped error codes and translate them back to the correct codes. The
cross-reference translation data can be accumulated over time. Each time an error is
corrected, both the incorrect and correct values should be put into the table and used to
correct future errors automatically.
Dimension Not Found While Loading Fact
The majority of current data warehouses are built using a dimensional model. A
dimensional model relies on the presence of dimension records existing before loading
the fact tables. This can usually be accomplished by loading the dimension tables
before loading the fact tables. However, there are some cases where a corresponding
dimension record is not present at the time of the fact load. When this occurs,
consistent rules need to handle this so that data is not improperly exposed to, or hidden
from, the users.
One solution is to continue to load the data to the fact table, but assign the foreign key
a value that represents Not Found or Not Available in the dimension. These keys must
also exist in the dimension tables to satisfy referential integrity, but they provide a clear
and easy way to identify records that may need to be reprocessed at a later date.
Another solution is to filter the record from processing since it may no longer be
relevant to the fact table. The team will most likely want to flag the row through the use
of either error tables or process codes so that it can be reprocessed at a later time.
A third solution is to use dynamic caches and load the dimensions when a record is not
found there, even while loading the fact table. This should be done very carefully since
it may add unwanted or junk values to the dimension table. One occasion when this
may be advisable is in cases where dimensions are simply made up of the distinct
combination values in a data set. Thus, this dimension may require a new record if a
new combination occurs.
It is imperative that all of these solutions be discussed with the users before making
any decisions since they will eventually be the ones making decisions based on the
reports.


Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 166 of 702
Data Quality Project Estimation and Scheduling Factors
Challenge
This Best Practice is intended to assist project managers who must estimate the time and resources
necessary to address data quality issues within data integration or other data-dependent projects.
Its primary concerns are the project estimation issues that arise when you add a discrete data quality stage
to your data project. However, it also examines the factors that determine when, or whether, you need to
build a larger data quality element into your project.
Description
At a high level, there are three ways to add data quality to your project:
G Add a discrete and self-contained data quality stage, such as that enabled by using pre-built
Informatica Data Quality (IDQ) processes, or plans, in conjunction with Informatica Data Cleanse
and Match.
G Add an expanded but finite set of data quality actions to the project, for example in cases where
pre-built plans do not fit the project parameters.
G Incorporate data quality actions throughout the project.
This document should help you decide which of these methods best suits your project and assist in
estimating the time and resources needed for the first and second methods.
Using Pre-Built Plans with Informatica Data Cleanse and Match
Informatica Data Cleanse and Match is a cross-application solution that enables PowerCenter users to add
data quality processes defined in IDQ to custom transformations in PowerCenter. It incorporates the
following components:
G Data Quality Workbench, a user-interface application for building and executing data quality
processes, or plans.
G Data Quality Integration, a plug-in component for PowerCenter that integrates PowerCenter and
IDQ.
G At least one set of reference data files that can be read by data quality plans to validate and enrich
certain types of project data. For example, Data Cleanse and Match can be used with the North
America Content Pack, which includes pre-built data quality plans and complete address
reference datasets for the United States and Canada.
Data Quality Engagement Scenarios
Data Cleanse and Match delivers its data quality capabilities out of the box; a PowerCenter user can
select data quality plans and add them to a Data Quality transformation without leaving PowerCenter. In
this way, Data Cleanse and Match capabilities can be added into a project plan as a relatively short and
INFORMATICA CONFIDENTIAL BEST PRACTICE 167 of 702
discrete stage.
In a more complex scenario, a Data Quality Developer may wish to modify the underlying data quality plans
or create new plans to focus on quality analysis or enhancements in particular areas. This expansion of the
data quality operations beyond the pre-built plans can also be handled within a discrete data quality stage.
The Project Manager may decide to implement a more thorough approach to data quality and integrate
data quality actions throughout the project plan. In many cases, a convincing case can be made for
enlarging the data quality aspect to encompass the full data project. (Velocity contains several tasks and
subtasks concerned with such an endeavor.) This is well worth considering. Often, businesses do not
realize the extent to which their business and project goals depend on the quality of their data.
The project impact of these three types of data quality activity can be summarized as follows:
DQ approach Estimated Project impact
Simple stage 10 days, 1-2 Data Quality Developers
Expanded data quality stage 15-20 days, 2 Data Quality Developers, high visibility to business
Data quality integrated with data project Duration of data project, 2 or more project roles, impact on business and
project objectives
Note: The actual time that should be allotted to the data quality stages noted above depends on the factors
discussed in the remainder of this document.
Factors Influencing Project Estimation
The factors influencing project estimation for a data quality stage range from high-level project parameters
to lower-level data characteristics. The main factors are listed below and explained in detail later in this
document.
G
Base and target levels of data quality
G
Overall project duration/budget
G
Overlap of sources/Complexity of data joins
G
Quantity of data sources
G
Matching requirements
G
Data volumes
G
Complexity and quantity of data rules
G
Geography
Determine which scenario out of the box (Data Cleanse and Match), expanded Data Cleanse and
INFORMATICA CONFIDENTIAL BEST PRACTICE 168 of 702
Match, or a thorough data quality integration best fits your data project by considering the projects
overall objectives and its mix of factors.
The Simple Data Quality Stage
Project managers can consider the use of pre-built plans with Data Cleanse and Match as a simple
scenario with a predictable number of function points that can be added to the project plan as a single
package.
You can add the North America Content Pack plans to your project if the project meets most of the
following criteria. Similar metrics apply to other types of pre-built plans:
G
Baseline functionality of the pre-built data quality plans meets 80 percent of
the project needs.
G
Complexity of data rules is relatively low.
G
Business rules present in pre-built plans need minimum fine-tuning.
G
Target data quality level is achievable (i.e., <100 percent).
G
Quantity of data sources is relatively low.
G
Overlap of data sources/complexity of database table joins is relatively low.
G
Matching requirements and targets are straightforward.
G
Overall project duration is relatively short.
G
The project relates to a single country.
Note that the source data quality level is not a major concern.
Implementing the Simple Data Quality Stage
The out-of-the-box scenario is designed to deliver significant increases in data quality in those areas for
which the plans were designed (i.e., North American name and address data) in a short time frame. As
indicated above, it does not anticipate major changes to the underlying data quality plans. It involves the
following three steps:
1.
Run pre-built plans.
2.
Review plan results.
3.
Transfer data to the next stage in the project and (optionally) add data quality plans to
PowerCenter transformations.
INFORMATICA CONFIDENTIAL BEST PRACTICE 169 of 702
While every project is different, a single iteration of the simple model may take approximately five days, as
indicated below:
G
Run pre-built plans (2 days)
G
Review plan results (1 day)
G
Pass data to the next stage in the project and add plans to PowerCenter
transformations (2 days)
Note that these estimates fit neatly into a five-day week but may be conservative in some cases. Note also
that a Data Quality Developer can tune plans on an ad-hoc basis to suit the project. Therefore you should
plan for a two week simple data quality stage.
Step - Simple Stage Days, week 1 Days, week 2
Run pre-built plans 2
Review plan results
Fine-tune pre-built plans if necessary
1
Re-run pre-built plans 2
Review plan results with stakeholders
Add plans to PowerCenter transformations and define mappings
2
Run PowerCenter workflows 1
Review results/obtain approval from stakeholders 1
Approve and pass all files to the next project stage 1
Expanding the Simple Data Quality Stage
Although the simple scenario above allows for the data quality components to be treated as a black box, it
allows for modifications to the data quality plans. The types of plan tuning that developers can undertake in
this time frame include changing the reference dictionaries used by the plans, editing these dictionaries,
and re-selecting the data fields used by the plans as keys to identify data matches. The above time frame
does not guarantee that a developer can build or re-build a plan from scratch.
The gap between base and target levels of data quality is an important area to consider when expanding
the data quality stage. The Developer and Project Manager may decide to add a data analysis step in this
stage, or even decide to split these activities across the project plan by conducting a data quality audit early
in the project, so that issues can be revealed to the business in advance of the formal data quality stage.
The schedule should allow for sufficient time for testing the data quality plans and for contact with the
business managers in order to define data quality expectations and targets.
In addition:
G If a data quality audit is added early in the project, the data quality stage grows into a project-
length endeavor.
G If the data quality audit is included in the discrete data quality stage, the expanded, three-week
Data Quality stage may look like this:
INFORMATICA CONFIDENTIAL BEST PRACTICE 170 of 702
Step - Enhanced DQ Stage Days,
week 1
Days,
week 2
Days,
week 3
Set up and run data analysis plans
Review plan results
1-2
Conduct advance tuning of pre-built plans
Run pre-built plans
2
Review plan results with stakeholders 1
Modify pre-built plans or build new plans from scratch 2
Re-run the plans 2
Review plan results/obtain approval from stakeholders 1
Add approved plans to PowerCenter transformations, define
mappings
2
Run PowerCenter workflows 1
Review results/obtain approval from stakeholders 1
Approve and pass all files to the next project stage 1
Sizing Your Data Quality Initiatives
The following section describes the factors that affect the estimated time that the data quality endeavors
may add to a project. Estimating the specific impact that a single factor is likely to have on a project plan is
difficult, as a single data factor rarely exists in isolation from others. If one or two of these factors apply to
your data, you may be able to treat them within the scope of a discrete DQ stage. If several factors apply,
you are moving into a complex scenario and must design your project plan accordingly.
Base and Target Levels of Data Quality
The rigor of your data quality stage depends in large part on the current (i.e., base) levels of data quality
in your dataset and the target levels that you want to achieve. As part of your data project, you should run a
set of data analysis plans and determine the strengths and weaknesses of the proposed project data. If
your data is already of a high quality relative to project and business goals, then your data quality stage is
likely to be a short one!
If possible, you should conduct this analysis at an early stage in the data project (i.e., well in advance of the
data quality stage). Depending on your overall project parameters, you may have already scoped a Data
Quality Audit into your project. However, if your overall project is short in duration, you may have to tailor
your data quality analysis actions to the time available.
Action:If there is a wide gap between base and target data quality levels, determine whether a short data
quality stage can bridge the gap. If a data quality audit is conducted early in the project, you have latitude
to discuss this with the business managers in the context of the overall project timeline. In general, it is
good practice to agree with the business to incorporate time into the project plan for a dedicated Data
Quality Audit. (See Task 2.8 in theVelocity Work Breakdown Structure.)
If the aggregated data quality percentage for your projects source data is greater than 60 percent, and
your target percentage level for the data quality stage is less than 95 percent, then you are in the zone of
effectiveness for Data Cleanse and Match.
Note: You can assess data quality according to at least six criteria. Your business may need to improve
data quality levels with respect to one criterion but not another. See the Best Practice document Data
INFORMATICA CONFIDENTIAL BEST PRACTICE 171 of 702
Cleansing .
Overall Project Duration/Budget
A data project with a short duration may not have the means to accommodate a complex data quality
stage, regardless of the potential or need to enhance the quality of the data involved. In such a case, you
may have to incorporate a finite data quality stage.
Conversely, a data project with a long time line may have scope for a larger data quality initiative. In large
data projects with major business and IT targets, good data quality may be a significant issue. For
example, poor data quality can affect the ability to cleanly and quickly load data into target systems. Major
data projects typically have a genuine need for high-quality data if they are to avoid unforeseen problems.
Action: Evaluate the project schedule parameters and expectations put forward by the business and
evaluate how data quality fits into these parameters.
You must also determine if there are any data quality issues that may jeopardize project success, such as
a poor understanding of the data structure. These issues may already be visible to the business
community. If not, they should be raised with the management. Bear in mind that data quality is not simply
concerned with the accuracy of the data values it can encompass the project metadata also.
Overlap of Sources/Complexity of Data Joins
When data sources overlap, data quality issues can be spread across several sources. The relationships
among the variables within the sources can be complex, difficult to join together, and difficult to resolve, all
adding to project time.
If the joins between the data are simple, then this task may be straightforward. However, if the data joins
use complex keys or exist over many hierarchies, then the data modeling stage can be time-consuming,
and the process of resolving the indices may be prolonged.
Action: You can tackle complexity in data sources and in required database joins within a data quality
stage, but in doing so, you step outside the scope of the simple data quality stage.
Quantity of Data Sources
This issue is similar to that of data source overlap and complexity (above). The greater the quantity of
sources, the greater the opportunity for data quality issues to arise. The number of data sources has a
particular impact on the time required to set up the data quality solution. (The source data setup in
PowerCenter can facilitate the data setup in the data quality stage.)
Action: You may find that the number of data sources correlates with the number of data sites covered by
the project. If your project includes data from multiple geographies, you step outside the scope of a simple
data quality stage.
Matching Requirements
Data matching plans are the most performance-intensive type of data quality plan. Moreover, matching
plans are often coupled to a type of data standardization plan (i.e., grouping plan) that prepares the data for
INFORMATICA CONFIDENTIAL BEST PRACTICE 172 of 702
match analysis.
Matching plans are not necessarily more complex to design than other types of plans, although they may
contain sophisticated business rules. However, the time taken to execute a matching plan is exponentially
proportional to the volume of data records passed through the plan. (Specifically, the time taken is
proportional to the size and number of data groups created in the grouping plans.)
Action: Consult the Best Practice on Effective Data Matching Techniques and determine how long your
matching plans may take to run.
Data Volumes
Data matching requirements and data volumes are closely related. As stated above, the time taken to
execute a matching plan is exponentially proportional to the volume of data records passed through it. In
other types of plans, this exponential relationship does not exist. However, the general rule applies: the
larger your data volumes, the longer it takes for plans to execute.
Action: Although IDQ can handle data volumes measurable in eight figures, a dataset of more than 1.5
million records is considered larger than average. If your dataset is measurable in millions of records, and
high levels of matching/de-duplication are required, consult the Best Practice on Effective Data Matching
Techniques.
Complexity and Quantity of Data Rules
This is a key factor in determining the complexity of your data quality stage. If the Data Quality Developer is
likely to write a large number of business rules for the data quality plans as may be the case if data
quality target levels are very high or relate to precise data objectives then the project is de facto moving
out of Data Cleanse and Match capability and you need to add rule-creation and rule-review elements to
the data quality effort.
Action: If the business requires multiple complex rules, you must scope additional time for rule creation
and for multiple iterations of the data quality stage. Bear in mind that, as well as writing and adding these
rules to data quality plans, the rules must be tested and passed by the business.
Geography
Geography affects the project plan in two ways:
G First, the geographical spread of data sites is likely to affect the time needed to run plans, collate
data, and engage with key business personnel. Working hours in different time zones can mean
that one site is starting its business day while others are ending theirs, and this can effect the tight
scheduling of the simple data quality stage.
G Secondly, project data that is sourced from several countries typically means multiple data
sources, with opportunities for data quality issues to arise that may be specific to the country or the
division of the organization providing the data source.
There is also a high correlation between the scale of the data project and the scale of the enterprise in
which the project will take place. For multi-national corporations, there is rarely such a thing as a small data
project!
INFORMATICA CONFIDENTIAL BEST PRACTICE 173 of 702
Action: Consider the geographical spread of your source data. If the data sites are spread across several
time zones or countries, you may need to factor in time lags to your data quality planning.


Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 174 of 702
Effective Data Matching Techniques
Challenge
Identifying and eliminating duplicates is a cornerstone of effective marketing efforts and customer resource management
initiatives, and is an increasingly important driver of cost-efficient compliance with regulatory initiatives such as KYC (Know
Your Customer).
Once duplicate records are identified, you can remove them from your dataset, and can better recognize key relationships
among data records (such as customer records from a common household). You can also match records or values against
reference data to ensure data accuracy and validity.
This Best Practice is targeted toward Informatica Data Quality (IDQ) users familiar with Informatica's matching approach. It
has two high-level objectives:
G To identify the key performance variables that affect the design and execution of IDQ matching plans.
G To describe plan design and plan execution actions that will optimize plan performance and results.
To optimize your data matching operations in IDQ, you must be aware of the factors that are discussed below.
Description
All too often, an organization's datasets contain duplicate data in spite of numerous attempts to cleanse the data or
prevent duplicates from occurring. In other scenarios, the datasets may lack common keys (such as customer numbers or
product ID fields) that, if present, would allow clear joins between the datasets and improve business knowledge.
Identifying and eliminating duplicates in datasets can serve several purposes. It enables the creation of a single view of
customers; it can help control costs associated with mailing lists by preventing multiple pieces of mail from being sent to
the same person or household; and it can assist marketing efforts by identifying households or individuals who are heavy
users of a product or service.
Data can be enriched by matching across production data and reference data sources. Business intelligence operations
can be improved by identifying links between two or more systems to provide a more complete picture of how customers
interact with a business.
IDQs matching capabilities can help to resolve dataset duplications and deliver business results. However, a users ability
to design and execute a matching plan that meets the key requirements of performance and match quality depends on
understanding the best-practice approaches described in this document.
An integrated approach to data matching involves several steps that prepare the data for matching and improve the overall
quality of the matches. The following table outlines the processes in each step.
Step Description
Profiling
Typically the first stage of the data quality process, profiling generates a picture of the data and
indicates the data elements that can comprise effective group keys. It also highlights the data
elements that require standardizing to improve match scores.
Standardization Removes noise, excess punctuation, variant spellings, and other extraneous data elements.
Standardization reduces the likelihood that match quality will be affected by data elements that are
not relevant to match determination.
Grouping A post-standardization function in which the groups' key fields identified in the profiling stage are
used to segment data into logical groups that facilitate matching plan performance.
Matching The process whereby the data values in the created groups are compared against one another and
record matches are identified according to user-defined criteria.
INFORMATICA CONFIDENTIAL BEST PRACTICE 175 of 702
Consolidation The process whereby duplicate records are cleansed. It identifies the master record in a duplicate
cluster and permits the creation of a new dataset or the elimination of subordinate records. Any child
data associated with subordinate records is linked to the master record.
The sections below identify the key factors that affect the performance (or speed) of a matching plan and the quality of the
matches identified. They also outline the best practices that ensure that each matching plan is implemented with the
highest probability of success. (This document does not make any recommendations on profiling, standardization or
consolidation strategies. Its focus is grouping and matching.)
The following table identifies the key variables that affect matching plan performance and the quality of matches identified.
Factor Impact Impact summary
Group size Plan performance The number and size of groups have a
significant impact on plan execution speed.
Group keys Quality of matches The proper selection of group keys ensures
that the maximum number of possible matches
are identified in the plan.
Hardware resources Plan performance Processors, disk performance, and memory
require consideration.
Size of dataset(s) Plan performance This is not a high-priority issue. However, it
should be considered when designing the plan.
Informatica Data Quality components Plan performance The plan designer must weigh file-based
versus database matching approaches when
considering plan requirements.
Time window and frequency of execution Plan performance The time taken for a matching plan to complete
execution depends on its scale. Timing
requirements must be understood up-front.
Match identification Quality of matches The plan designer must weigh deterministic
versus probabilistic approaches.
Group Size
Grouping breaks large datasets down into smaller ones to reduce the number of record-to-record comparisons performed
in the plan, which directly impacts the speed of plan execution. When matching on grouped data, a matching plan
compares the records within each group with one another. When grouping is implemented properly, plan execution speed
is increased significantly, with no meaningful effect on match quality.
The most important determinant of plan execution speed is the size of the groups to be processed that is, the number of
data records in each group.
For example, consider a dataset of 1,000,000 records, for which a grouping strategy generates 10,000 groups. If 9,999 of
these groups have an average of 50 records each, the remaining group will contain more than 500,000 records; based on
this one large group, the matching plan would require 87 days to complete, processing 1,000,000 comparisons a minute!
In comparison, the remaining 9,999 groups could be matched in about 12 minutes if the group sizes were evenly
distributed.
Group size can also have an impact on the quality of the matches returned in the matching plan. Large groups perform
more record comparisons, so more likely matches are potentially identified. The reverse is true for small groups: as groups
INFORMATICA CONFIDENTIAL BEST PRACTICE 176 of 702
get smaller, fewer comparisons are possible, and the potential for missing good matches is increased. Therefore, groups
must be defined intelligently through the use of group keys.
Group Keys
Group keys determine which records are assigned to which groups. Group key selection, therefore, has a significant affect
on the success of matching operations.
Grouping splits data into logical chunks and thereby reduces the total number of comparisons performed by the plan. The
selection of group keys, based on key data fields, is critical to ensuring that relevant records are compared against one
another.
When selecting a group key, two main criteria apply:
G Candidate group keys should represent a logical separation of the data into distinct units where there is a low
probability that matches exist between records in different units. This can be determined by profiling the data and
uncovering the structure and quality of the content prior to grouping.
G Candidate group keys should also have high scores in three keys areas of data quality: completeness, conformity,
and accuracy. Problems in these data areas can be improved by standardizing the data prior to grouping.
For example, geography is a logical separation criterion when comparing name and address data. A record for a person
living in Canada is unlikely to match someone living in Ireland. Thus, the country-identifier field can provide a useful group
key. However, if you are working with national data (e.g. Swiss data), duplicate data may exist for an individual living in
Geneva, who may also be recorded as living in Genf or Geneve. If the group key in this case is based on city name,
records for Geneva, Genf, and Geneve will be written to different groups and never compared unless variant city names
are standardized.
Size of Dataset
In matching, the size of the dataset typically does not have as significant an impact on plan performance as the definition
of the groups within the plan. However, in general terms, the larger the dataset, the more time required to produce a
matching plan both in terms of the preparation of the data and the plan execution.
IDQ Components
All IDQ components serve specific purposes, and very little functionality is duplicated across the components. However,
there are performance implications for certain component types, combinations of components, and the quantity of
components used in the plan.
Several tests have been conducted on IDQ (version 2.11) to test source/sink combinations and various operational
components. In tests comparing file-based matching against database matching, file-based matching outperformed
database matching in UNIX and Windows environments for plans containing up to 100,000 groups. Also, matching plans
that wrote output to a CSV Sink outperformed plans with a DB Sink or Match Key Sink. Plans with a Mixed Field Matcher
component performed more slowly than plans without a Mixed Field Matcher.
Raw performance should not be the only consideration when selecting the components to use in a matching plan. Different
components serve different needs and may offer advantages in a given scenario.
Time Window
IDQ can perform millions or billions of comparison operations in a single matching plan. The time available for the
completion of a matching plan can have a significant impact on the perception that the plan is running correctly.
Knowing the time window for plan completion helps to determine the hardware configuration choices, grouping strategy,
INFORMATICA CONFIDENTIAL BEST PRACTICE 177 of 702
and the IDQ components to employ.
Frequency of Execution
The frequency with which plans are executed is linked to the time window available. Matching plans may need to be tuned
to fit within the cycle in which they are run. The more frequently a matching plan is run, the more the execution time will
have to be considered.
Match Identification
The method used by IDQ to identify good matches has a significant effect on the success of the plan. Two key methods for
assessing matches are:
G deterministic matching
G probabilistic matching
Deterministic matching applies a series of checks to determine if a match can be found between two records. IDQs fuzzy
matching algorithms can be combined with this method. For example, a deterministic check may first check if the last
name comparison score was greater than 85 percent. If this is true, it next checks the address. If an 80 percent match is
found, it then checks the first name. If a 90 percent match is found on the first name, then the entire record is
considered successfully matched.
The advantages of deterministic matching are: (1) it follows a logical path that can be easily communicated to others, and
(2) it is similar to the methods employed when manually checking for matches. The disadvantages to this method are its
rigidity and its requirement that each dependency be true. This can result in matches being missed, or can require several
different rule checks to cover all likely combinations.
Probabilistic matching takes the match scores from fuzzy matching components and assigns weights to them in order to
calculate a weighted average that indicates the degree of similarity between two pieces of information.
The advantage of probabilistic matching is that it is less rigid than deterministic matching. There are no dependencies on
certain data elements matching in order for a full match to be found. Weights assigned to individual components can place
emphasis on different fields or areas in a record. However, even if a heavily-weighted score falls below a defined
threshold, match scores from less heavily-weighted components may still produce a match.
The disadvantages of this method are a higher degree of required tweaking on the users part to get the right balance of
weights in order to optimize successful matches. This can be difficult for users to understand and communicate to one
another.
Also, the cut-off mark for good matches versus bad matches can be difficult to assess. For example, a matching plan with
95 to 100 percent success may have found all good matches, but matching plan success between 90 and 94 percent may
map to only 85 percent genuine matches. Matches between 85 and 89 percent may correspond to only 65 percent
genuine matches, and so on. The following table illustrates this principle.
INFORMATICA CONFIDENTIAL BEST PRACTICE 178 of 702
Close analysis of the match results is required because of the relationship between match quality and match thresholds
scores assigned since there may not be a one-to-one mapping between the plans weighted score and the number of
records that can be considered genuine matches.
Best Practice Operations
The following section outlines best practices for matching with IDQ.
Capturing Client Requirements
Capturing client requirements is key to understanding how successful and relevant your matching plans are likely to be. As
a best practice, be sure to answer the following questions, as a minimum, before designing and implementing a matching
plan:
G How large is the dataset to be matched?
G How often will the matching plans be executed?
G When will the match process need to be completed?
G Are there any other dependent processes?
G What are the rules for determining a match?
G What process is required to sign-off on the quality of match results?
G What processes exist for merging records?
Test Results
Performance tests demonstrate the following:
G IDQ has near-linear scalability in a multi-processor environment.
G Scalability in standard installations, as achieved in the allocation of matching plans to multiple processors, will
eventually level off.
Performance is the key to success in high-volume matching solutions. IDQs architecture supports massive scalability by
allowing large jobs to be subdivided and executed across several processors. This scalability greatly enhances IDQs
ability to meet the service levels required by users without sacrificing quality or requiring an overly complex solution.
INFORMATICA CONFIDENTIAL BEST PRACTICE 179 of 702
Managing Group Sizes
As stated earlier, group sizes have a significant affect on the speed of matching plan execution. Also, the quantity of small
groups should be minimized to ensure that the greatest number of comparisons are captured. Keep the following
parameters in mind when designing a grouping plan.
Condition Best practice Exceptions
Maximum group size 5,000 records Large datasets over 2M records with uniform data.
Minimize the number of groups containing more than
5,000 records.
Minimum number of single-record groups 1,000 groups per one million record
dataset.

Optimum number of comparisons 500,000,000 comparisons per 1
million records
+/- 20 percent
In cases where the datasets are large, multiple group keys may be required to segment the data to ensure that best
practice guidelines are followed. Informatica Corporation can provide sample grouping plans that automate these
requirements as far as is practicable.
Group Key Identification
Identifying appropriate group keys is essential to the success of a matching plan. Ideally, any dataset that is about to be
matched has been profiled and standardized to identify candidate keys.
Group keys act as a first pass or high-level summary of the shape of the dataset(s). Remember that only data records
within a given group are compared with one another. Therefore, it is vital to select group keys that have high data quality
scores for completeness, conformity, consistency, and accuracy.
Group key selection depends on the type of data in the dataset, for example whether it contains name and address data or
other data types such as product codes.
Hardware Specifications
Matching is a resource-intensive operation, especially in terms of processor capability. Three key variables determine the
effect of hardware on a matching plan: processor speed, disk performance, and memory.
The majority of the activity required in matching is tied to the processor. Therefore, the speed of the processor has a
significant affect on how fast a matching plan completes. Although the average computational speed for IDQ is one million
comparisons per minute, the speed can range from as low as 250,000 comparisons to 6.5 million comparisons per minute,
depending on the hardware specification, background processes running, and components used. As a best practice,
higher-specification processors (e.g., 1.5 GHz minimum) should be used for high-volume matching plans.
Hard disk capacity and available memory can also determine how fast a plan completes. The hard disk reads and writes
data required by IDQ sources and sinks. The speed of the disk and the level of defragmentation affect how quickly data
can be read from, and written to, the hard disk. Information that cannot be stored in memory during plan execution must be
temporarily written to the hard disk. This increases the time required to retrieve information that otherwise could be stored
in memory, and also increases the load on the hard disk. A RAID drive may be appropriate for datasets of 3 to 4 million
records and a minimum of 512MB of memory should be available.
The following table is a rough guide for hardware estimates based on IDQ Runtime on Windows platforms. Specifications
for UNIX-based systems vary.
Match volumes Suggested hardware specification
< 1,500,000 records 1.5 GHz computer, 512MB RAM
INFORMATICA CONFIDENTIAL BEST PRACTICE 180 of 702
1,500,000 to 3 million records Multi processor server, 1GB RAM
> 3 million records Multi-processor server, 2GB RAM, RAID 5 hard disk
Single Processor vs. Multi-Processor
With IDQ Runtime, it is possible to run multiple processes in parallel. Matching plans, whether they are file-based or
database-based, can be split into multiple plans to take advantage of multiple processors on a server. Be aware however,
that this requires additional effort to create the groups and consolidate the match output. Also, matching plans split across
four processors do not run four times faster than a single-processor matching plan. As a result, multi-processor matching
may not significantly improve performance in every case.
The following table can help you to estimate the execution time between a single and multi-processor match plan.
Plan Type Single Processor Multiprocessor
Standardardization/ grouping Depends on operations and size of data set.
(Time equals Y)
Single processor time plus 20 percent.
(Time equals Y * 1.20)
Matching Est 1 million comparisons a minute.
(Time equals X)
Time for single processor matching divided by no or
processors (NP) multiplied by 25 percent. (Time equals [(X /
NP) * 1.25])
For example, if a single processor plan takes one hour to group and standardize the data and eight hours to match, a four-
processor match plan should require approximately one hour and 20 minute to group and standardize and two and one
half hours to match. The time difference between a single- and multi-processor plan in this case would be more than five
hours (i.e., nine hours for the single processor plan versus three hours and 50 minutes for the quad-processor plan).
Deterministic vs. Probabilistic Comparisons
No best-practice research has yet been completed on which type of comparison is most effective at determining a match.
Each method has strengths and weaknesses. A 2006 article by Forrester Research stated a preference for deterministic
comparisons since they remove the burden of identifying a universal match threshold from the user.
Bear in mind that IDQ supports deterministic matching operations only. However, IDQs Weight Based Analyzer
component lets plan designers calculate weighted match scores for matched fields.
Database vs. File-Based Matching
File-based matching and database matching perform essentially the same operations. The major differences between
the two methods revolve around how data is stored and how the outputs can be manipulated after matching is complete.
With regards to selecting one method or the other, there are no best practice recommendations since this
is largely defined by requirements.
The following table outlines the strengths and weakness of each method:
File-Based Method Database Method
Ease of implementation Easy to implement Requires SQL knowledge
Performance Fastest method Slower than file-based method
Space utilization Requires more hard-disk space Lower hard-disk space requirement
Operating system restrictions Possible limit to number of groups that can be
created
None
Ability to control/ manipulate output Low High
High-Volume Data Matching Techniques
INFORMATICA CONFIDENTIAL BEST PRACTICE 181 of 702
This section discusses the challenges facing IDQ matching plan designers in opti-mizing their plans for speed of execution
and quality of results. It highlights the key factors affecting matching performance and discusses the results of IDQ
performance testing in single and multi-processor environments.
Checking for duplicate records where no clear connection exists among data elements is a resource-intensive activity. In
order to detect matching information, a record must be compared against every other record in a dataset. For a single data
source, the quantity of comparisons required to check an entire dataset increases geometrically as the volume of data
increases. A similar situation arises when matching between two datasets, where the number of comparisons required is a
multiple of the volumes of data in each dataset.
When the volume of data increases into the tens of millions, the number of comparisons required to identify matches
and consequently, the amount of time required to check for matches reaches impractical levels.
Approaches to High-Volume Matching
Two key factors control the time it takes to match a dataset:
G The number of comparisons required to check the data.
G The number of comparisons that can be performed per minute.
The first factor can be controlled in IDQ through grouping, which involves logically segmenting the dataset into distinct
elements, or groups, so that there is a high probability that records within a group are not duplicates of records outside of
the group. Grouping data greatly reduces the total number of required comparisons without affecting match accuracy.
IDQ affects the number of comparisons per minute in two ways:
G Its matching components maximize the comparison activities assigned to the com-puter processor. This reduces
the amount of disk I/O communication in the system and increases the number of comparisons per minute.
Therefore, hard-ware with higher processor speeds has higher match throughputs.
G IDQ architecture also allows matching tasks to be broken into smaller tasks and shared across multiple
processors. The use of multiple processors to handle matching operations greatly enhances IDQ scalability with
regard to high-volume matching problems.
The following section outlines how a multi-processor matching solution can be imple-mented and illustrates the results
obtained in Informatica Corporation testing.
Multi-Processor Matching: Solution Overview
IDQ does not automatically distribute its load across multiple processors. To scale a matching plan to take advantage of a
multi-processor environment, the plan designer must develop multiple plans for execution in parallel.
To develop this solution, the plan designer first groups the data to prevent the plan from running low-probability
comparisons. Groups are then subdivided into one or more subgroups (the number of subgroups depends on the plan
being run and the number of processors in use). Each subgroup is assigned to a discrete matching plan, and the plans are
executed in parallel.
The following diagram outlines how multi-processor matching can be implemented in a database model. Source data is
first grouped and then subgrouped according to the number of processors available to the job. Each subgroup of data is
loaded into a sepa-rate staging area, and the discrete match plans are run in parallel against each table. Results from
each plan are consolidated to generate a single match result for the orig-inal source data.
INFORMATICA CONFIDENTIAL BEST PRACTICE 182 of 702
Informatica Corporation Match Plan Tests
Informatica Corporation performed match plan tests on a 2GHz Intel Xeon dual-processor server running Windows 2003
(Server edition). Two gigabytes of RAM were available. The hyper-threading ability of the Xeon processors effectively
provided four CPUs on which to run the tests.
Several tests were performed using file-based and database-based matching methods and single and multiple processor
methods. The tests were performed on one million rows of data. Grouping of the data limited the total number of
comparisons to approximately 500,000,000.
Test results using file-based and database-based methods showed a near linear scal-ability as the number of available
processors increased. As the number of processors increased, so too did the demand on disk I/O resources. As the
processor capacity began to scale upward, disk I/O in this configuration eventually limited the benefits of adding additional
processor capacity. This is demonstrated in the graph below.
Execution times for multiple processors were based on the longest execution time of the jobs run in parallel. Therefore,
having an even distribution of records across all proc-essors was important to maintaining scalability. When the data was
INFORMATICA CONFIDENTIAL BEST PRACTICE 183 of 702
not evenly distributed, some match plans ran longer than others, and the benefits of scaling over multiple processors was
not as evident.


Last updated: 07-Feb-07 17:24
INFORMATICA CONFIDENTIAL BEST PRACTICE 184 of 702
Effective Data Standardizing Techniques
Challenge
To enable users to streamline their data cleansing and standardization processes (or plans) with
Informatica Data Quality (IDQ). The intent is to shorten development timelines and ensure a
consistent and methodological approach to cleansing and standardizing project data.
Description
Data cleansing refers to operations that remove non-relevant information and noise from the
content of the data. Examples of cleansing operations include the removal of person names, care
of information, excess character spaces, or punctuation from postal address.
Data standardization refers to operations related to modifying the appearance of the data, so that it
takes on a more uniform structure and to enriching the data by deriving additional details from
existing content.
Cleansing and Standardization Operations
Data can be transformed into a standard format appropriate for its business type. This is typically
performed on complex data types such as name and address or product data. A data
standardization operation typically profiles data by type (e.g., word, number, code) and parses
data strings into discrete components. This reveals the content of the elements within the data as
well as standardizing the data itself.
For best results, the Data Quality Developer should carry out these steps in consultation with a
member of the business. Often, this individual is the data steward, the person who best
understands the nature of the data within the business scenario.
G Within IDQ, the Profile Standardizer is a powerful tool for parsing unsorted data into the
correct fields. However, when using the Profile Standardizer, be aware that there is a
finite number of profiles (500) that can be contained within a cleansing plan. Users can
extend the number of profiles by using the first 500 profiles within one component and
then feeding the data overflow into a second Profile Standardizer via the Token Parser
component.
After the data is parsed and labeled, it should be evident if reference dictionaries will be needed to
further standardize the data. It may take several iterations of dictionary construction and review
before the data is standardized to an acceptable level. Once acceptable standardization has been
achieved, data quality scorecard or dashboard reporting can be introduced. For information on
dashboard reporting, see the Report Viewer chapter of the Informatica Data Quality 3.1 User
Guide.
INFORMATICA CONFIDENTIAL BEST PRACTICE 185 of 702
Discovering Business Rules
At this point, the business user may discover and define business rules applicable to the data.
These rules should be documented and converted to logic that can be contained within a data
quality plan. When building a data quality plan, be sure to group related business rules together in
a single rules component whenever possible; otherwise the plan may become very difficult to read.
If there are rules that do not lend themselves easily to regular IDQ components (i.e, when
standardizing product data information), it may be necessary to perform some custom
scripting using IDQs scripting component. This requirement may arise when a string or an
element within a string needs to be treated as an array.
Standard and Third-Party Reference Data
Reference data can be a useful tool when standardizing data. Terms with variant formats or
spellings can be standardized to a single form. IDQ installs with several reference dictionary files
that cover common name and address and business terms. The illustration below shows part of a
dictionary of street address suffixes.
Common Issues when Cleansing and Standardizing Data
If the customer has expectations of a bureau-style service, it may be advisable to re-emphasize
the score-carding and graded-data approach to cleansing and standardizing. This helps to ensure
that the customer develops reasonable expectations of what can be achieved with the data set
within an agreed-upon timeframe.
INFORMATICA CONFIDENTIAL BEST PRACTICE 186 of 702
Standardizing Ambiguous Data
Data values can often appear ambiguous, particularly in name and address data where name,
address, and premise values can be interchangeable. For example, Hill, Park, and Church are all
common surnames. In some cases, the position of the value is important. ST can be a suffix for
street or a prefix for Saint, and sometimes they can both occur in the same string.
The address string St Patricks Church, Main St can reasonably be interpreted as Saint Patricks
Church, Main Street. In this case, if the delimiter is a space (thus ignoring any commas and
periods), the string has five tokens. You may need to write business rules using the IDQ Scripting
component, as you are treating the string as an array. St with position 1 within the string would be
standardized to meaning_1, whereas St with position 5 would be standardized to meaning_2.
Each data value can then be compared to a discrete prefix and suffix dictionary.
Conclusion
Using the data cleansing and standardization techniques described in this Best Practice can help
an organization to recognize the value of incorporating IDQ into their development
methodology. Because data quality is an iterative process, the business rules initially developed
may require ongoing modification, as the results produced by IDQ will be affected by the starting
condition of the data and the requirements of the business users.
When data arrives in multiple languages, it is worth creating similar IDQ plans for each country
and applying the same rules across these plans. The data would typically be staged in a database,
and the plans developed using a SQL statement as input, with a where country_code= DE
clause, for example. Country dictionaries are identifiable by country code to facilitate such
statements. Remember that IDQ installs with a large set of reference dictionaries and additional
dictionaries are available from Informatica.
IDQ provides several components that focus on verifying and correcting the accuracy of name and
postal address data. These components leverage address reference data that originates from
national postal carriers such as the United States Postal Service. Such datasets enable IDQ to
validate an address to premise level. Please note, the reference datasets are licensed and
installed as discrete Informatica products, and thus it is important to discuss their inclusion in the
project with the business in advance so as to avoid budget and installation issues. Several types of
reference data, with differing levels of address granularity, are available from Informatica. Pricing
for the licensing of these components may vary and should be discussed with the Informatica
Account Manager.


Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 187 of 702
Managing Internal and External Reference Data
Challenge
To provide guidelines for the development and management of the reference data sources that can be
used with data quality plans in Informatica Data Quality (IDQ). The goal is to ensure the smooth transition
from development to production for reference data files and the plans with which they are associated.
Description
Reference data files can be used by a plan to verify or enhance the accuracy of the data inputs to the plan.
A reference data file is a list of verified-correct terms and, where appropriate, acceptable variants on those
terms. It may be a list of employees, package measurements, or valid postal addresses any data set
that provides an objective reference against which project data sources can be checked or corrected.
Reference files are essential to some, but not all data quality processes.
Reference data can be internal or external in origin.
Internal data is specific to a particular project or client. Such data is typically generated from internal
company information. It may be custom-built for the project.
External data has been sourced or purchased from outside the organization. External data is used when
authoritative, independently-verified data is needed to provide the desired level of data quality to a
particular aspect of the source data. Examples include the dictionary files that install with IDQ, postal
address data sets that have been verified as current and complete by a national postal carrier, such as
United States Postal Service, or company registration and identification information from an industry-
standard source such as Dun & Bradstreet.
Reference data can be stored in a file format recognizable to Informatica Data Quality or in a format that
requires intermediary (third-party) software in order to be read by Informatica applications.
Internal data files, as they are often created specifically for data quality projects, are typically saved in the
dictionary file format or as delimited text files, which are easily portable into dictionary format. Databases
can also be used as a source for internal data.
External files are more likely to remain in their original format. For example, external data may be
contained in a database or in a library whose files cannot be edited or opened on the desktop to reveal
discrete data values.
Working with Internal Data
Obtaining Reference Data
Most organizations already possess much information that can be used as reference data for example,
employee tax numbers or customer names. These forms of data may or may not be part of the project
source data, and they may be stored in different parts of the organization.
INFORMATICA CONFIDENTIAL BEST PRACTICE 188 of 702
The question arises, are internal data sources sufficiently reliable for use as reference? Bear in mind that
in some cases the reference data does not need to be 100 percent accurate. It can be good enough to
compare project data against reference data and to flag inconsistencies between them, particularly in
cases where both sets of data are highly unlikely to share common errors.
Saving the Data in .DIC File Format
IDQ installs with a set of reference dictionaries that have been created to handle many types of business
data. These dictionaries are created using a proprietary .DIC file name extension. DIC is abbreviated from
dictionary, and dictionary files are essentially comma delimited text files.
You can create a new dictionary in three ways:
G You can save an appropriately formatted delimited file as a .DIC file into the Dictionaries folders of
your IDQ (client or server) installation.
G You can use the Dictionary Manager within Data Quality Workbench. This method allows you to
create text and database dictionaries.
G You can write from plan files directly to a dictionary using the IDQ Report Viewer (see below).
The figure below shows a dictionary file open in IDQ Workbench and its underlying .DIC file open in a text
editor. Note that the dictionary file has at least two columns of data. The Label column contains the correct
or standardized form of each datum from the dictionarys perspective. The Item columns contain versions
of each datum that the dictionary recognizes as identical to or coterminous with the Label entry. Therefore,
each datum in the dictionary must have at least two entries in the DIC file (see the text editor illustration
below). A dictionary can have multiple Item columns.
To edit a dictionary value, open the DIC file and make your changes. You can make changes either
through a text editor or by opening the dictionary in the Dictionary Manager.
INFORMATICA CONFIDENTIAL BEST PRACTICE 189 of 702
To add a value to a dictionary, open the DIC file in Dictionary Manager, place the cursor in an empty row,
and add a Label string and at least one Item string. You can also add values in a text editor by placing the
cursor on a new line and typing Label and Item values separated by commas.
Once saved, the dictionary is ready for use in IDQ.
Note: IDQ users with database expertise can create and specify dictionaries that are linked to database
tables, and that thus can be updated dynamically when the underlying data is updated. Database
dictionaries are useful when the reference data has been originated for other purposes and is likely to
change independently of data quality. By making use of a dynamic connection, data quality plans can
always point to the current version of the reference data.
Sharing Reference Data Across the Organization
As you can publish or export plans from a local Data Quality repository to server repositories, so you can
copy dictionaries across the network. The File Manager within IDQ Workbench provides an Explorer-like
mechanism for moving files to other machines across the network.
Bear in mind that Data Quality looks for .DIC files in pre-set locations within the IDQ installation when
running a plan. By default, Data Quality relies on dictionaries being located in the following locations:
G The Dictionaries folders installed with Workbench and Server.
G The users file space in the Data Quality service domain.
IDQ does not recognize a dictionary file that is not in such a location, even if you can browse to the file
when designing the data quality plan. Thus, any plan that uses a dictionary in a non-standard location will
fail.
This is most relevant when you publish or export a plan to another machine on the network. You must
ensure that copies of any dictionary files used in the local plan are available in a suitable location on the
service domain in the user space on the server, or at a location in the servers Dictionaries folders that
corresponds to the dictionaries location on Workbench when the plan is copied to the server-side
repository.
Note: You can change the locations in which IDQ looks for plan dictionaries by editing the config.xml file.
However, this is the master configuration file for the product and you should not edit it without consulting
Informatica Support. Bear in mind that Data Quality looks only in the locations set in the config.xml file.
Version Controlling Updates and Managing Rollout from Development to Production
Plans can be version-controlled during development in Workbench and when published to a domain
repository. You can create and annotate multiple versions of a plan, and review/roll back to earlier versions
when necessary.
Dictionary files are not version controlled by IDQ, however. You should define a process to log changes
and back-up your dictionaries using version control software if possible or a manual method. If
modifications are to be made to the versions of dictionary files installed by the software, it is recommended
that these modifications be made to a copy of the original file, renamed or relocated as desired. This
approach avoids the risk that a subsequent installation might overwrite changes.
INFORMATICA CONFIDENTIAL BEST PRACTICE 190 of 702
Database reference data can also be version controlled, although this presents difficulties if the database
is very large in size. Bear in mind that third-party reference data, such as postal address data, should not
ordinarily be changed, and so the need for a versioning strategy for these files is debatable.
Working with External Data
Formatting Data into Dictionary Format
External data may or may not permit the copying of data into text format for example, external data
contained in a database or in library files. Currently, third-party postal address validation data is provided
to Informatica users in this manner, and IDQ leverages software from the vendor to read these files. (The
third-party software has a very small footprint.) However, some software files can be amenable to data
extraction to file.
Obtaining Updates for External Reference Data
External data vendors produce regular data updates, and its vital to refresh your external reference data
when updates become available. The key advantage of external data its reliability is lost if you do not
apply the latest files from the vendor. If you obtained third-party data through Informatica, you will be kept
up to date with the latest data as it becomes available for as long as your data subscription warrants. You
can check that you possess the latest versions of third-party data by contacting your Informatica Account
Manager.
Managing Reference Updates and Rolling Out Across the Organization
If your organization has a reference data subscription, you will receive either regular data files on compact
disc or regular information on how to download data from Informatica or vendor web sites. You must
develop a strategy for distributing these updates to all parties who run plans with the external data. This
may involve installing the data on machines in a service domain.
Bear in mind that postal address data vendors update their offerings every two or three months, and that a
significant percentage of postal addresses can change in such time periods.
You should plan for the task of obtaining and distributing updates in your organization at frequent intervals.
Depending on the number of IDQ installations that must be updated, updating your organization with third-
party reference data can be a sizable task.
Strategies for Managing Internal and External Reference Data
Experience working with reference data leads to a series of best practice tips for creating and managing
reference data files.
Using Workbench to Build Dictionaries
With IDQ Workbench, you can select data fields or columns from a dataset and save them in a dictionary-
compatible format.
INFORMATICA CONFIDENTIAL BEST PRACTICE 191 of 702
Lets say you have designed a data quality plan that identifies invalid or anomalous records in a customer
database. Using IDQ, you can create an exception file of these bad records, and subsequently use this file
to create a dictionary-compatible file.
For example, lets say you have an exception file containing suspect or invalid customer account records.
Using a very simple data quality plan, you can quickly parse the account numbers from this file to create a
new text file containing the account serial numbers only. This file effectively constitutes the labels column
of your dictionary.
By opening this file in Microsoft Excel or a comparable program and copying the contents of Column A into
Column B, and then saving the spreadsheet as a CSV file, you create a file with Label and Item1 columns.
Rename the file with a .DIC suffix and add it to the Dictionaries folder of your IDQ installation: the
dictionary is now visible to the IDQ Dictionary Manager. You now have a dictionary file of bad account
numbers that you can use in any plans checking the validity of the organization's account records.
Using Report Viewer to Build Dictionaries
The IDQ Report Viewer allows you to create exception files and dictionaries on-the-fly from report data.
The figure below illustrates how you can drill-down into report data, right-click on a column, and save the
column data as a dictionary file. This file will be populated with Label and Item1 entries corresponding to
the column data.
In this case, the dictionary created is a list of serial numbers from invalid customer records (specifically,
records containing bad zip codes). The plan designer can now create plans to check customer databases
against these serial numbers. You can also append data to an existing dictionary file in this manner.
As a general rule, it is a best practice to follow the dictionary organization structure installed by the
application, adding to that structure as necessary to accommodate specialized and supplemental
dictionaries. Subsequent users are then relieved of the need to examine the config.xml file for possible
modifications, thereby lowering the risk of accidental errors during migration. When following the original
dictionary organization structure is not practical or contravenes other requirements, take care to document
INFORMATICA CONFIDENTIAL BEST PRACTICE 192 of 702
the customizations.
Since external data may be obtained from third parties and may not be in file format, the most efficient way
to share its content across the organization is to locate it on the Data Quality Server machine. (Specifically,
this is the machine that hosts the Execution Service.)
Moving Dictionary Files After IDQ Plans are Built
This is a similar issue to that of sharing reference data across the organization. If you must move or
relocate your reference data files post-plan development, you have three options:
G You can reset the location to which IDQ looks by default for dictionary files.
G You can reconfigure the plan components that employ the dictionaries to point to the new
location. Depending on the complexity of the plan concerned, this can be very labor-intensive.
G If deploying plans in a batch or scheduled task, you can append the new location to the plan
execution command. You can do this by appending a parameter file to the plan execution
instructions on the command line. The parameter file is an xml file that can contain a simple
command to use one file path instead of another.


Last updated: 08-Feb-07 17:09
INFORMATICA CONFIDENTIAL BEST PRACTICE 193 of 702
Testing Data Quality Plans
Challenge
To provide a guide to testing data quality processes or plans created using Informatica
Data Quality (IDQ), and to manage some of the unique complexities associated with
data quality plans.
Description
Testing data quality plans is an iterative process that occurs as part of the Design
Phase of Velocity. That is, plan testing often precedes the projects main testing
activities, as the tested plan outputs will be used as inputs in the Build Phase. It is not
necessary to formally test the plans used in the Analyze Phase of Velocity.
The development of data quality plans typically follows a prototyping methodology of
create, execute, analyze. Testing is performed as part of the third step, in order to
determine that the plans are being developed in accordance with design and project
requirements. This method of iterative testing helps support rapid identification and
resolution of bugs.
Bear in mind that data quality plans are designed to analyze and resolve data content
issues. These are not typically cut-and-dry problems, but more often represent a
continuum of data improvement issues where it is possible that every data instance is
unique and there is a target level of data quality rather than a right or wrong answer.
Data quality plans tend to resolve problems in terms of percentages and probabilities
that a problem is fixed. For example, the project may set a target of 95 percent
accuracy in its customer addresses.
Common Questions in Data Quality Plan Testing
G What dataset will you use to test the plans? While the ideal situation is to
use a data set that exactly mimics the project production data, you may not
gain access to this data. If you obtain a full cloned set of the project data for
testing purposes, bear in mind that some plans (specifically, some data
matching plans) can take several hours to complete. Consider testing data
matching plans overnight.
G Are the plans using reference dictionaries? Reference dictionary
INFORMATICA CONFIDENTIAL BEST PRACTICE 194 of 702
management is an important factor since it is possible to make changes to a
reference dictionary independently of IDQ and without making any changes to
the plan itself. When you pass an IDQ plan as tested, you must ensure that no
additional work is carried out on any dictionaries referenced in the plan.
Moreover, you must ensure that the dictionary files reside in locations that
are valid IDQ.
G How will the plans be executed? Will they be executed on a remote IDQ
Server, and/or via a scheduler? In cases like these, its vital to ensure that your
plan resources, including source data files and reference data files, are in valid
locations for use by the Data Quality engine. For details on the local and
remote locations to which IDQ looks for source and reference data files, refer
to the Informatica Data Quality 3.1 User Guide.
G Will the plans be integrated into a PowerCenter transformation? If so, the
plans must have realtime-enabled data source and sink components.
Strategies for Testing Data Quality Plans
The best practice steps for testing plans can be grouped under two headings.
Testing to Validate Rules
1. Identify a small, representative sample of source data.
2. Manually process the data, based on the rules for profiling, standardization or
matching that the plans will apply, to determine the results expected when the
plans are run.
3. Execute the plans on the test dataset, and validate the plan results against the
manually-derived results.
Testing to Validate Plan Effectiveness
This process is concerned with establishing that a data enhancement plan has been
properly designed; that is, that the plan delivers the required improvements in data
quality.
This is largely a matter of comparing the business and project requirements for data
quality and establishing if the plans are on course to deliver these. If not, the plans may
need a thorough redesign or the business and project targets may need to be revised.


Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 195 of 702
Tuning Data Quality Plans
Challenge
This document gives an insight into the type of considerations and issues a user needs
to be aware of when making changes to data quality processes defined in Informatica
Data Quality (IDQ). In IDQ, data quality processes are called plans.
The principal focus of this best practice is to know how to tune your plans without
adversely affecting the plan logic. This best practice is not intended to replace training
materials but serve as a guide for decision making in the areas of adding, removing or
changing the operational components that comprise a data quality plan.
Description
You should consider the following questions prior to making changes to a data quality
plan:
G What is the purpose of changing the plan? You should consider changing a
plan if you believe the plan is not optimally configured, or the plan is not
functioning properly and there is a problem at execution time or the plan is not
delivering expected results as per the plan design principles.
G Are you trained to change the plan? Data quality plans can be complex.
You should not alter a plan unless you have been trained or are highly
experienced with IDQ methodology.
G Is the plan properly documented? You should ensure all plan
documentation on the data flow and the data components are up-to-date. For
guidelines on documenting IDQ plans, see the Sample Deliverable Data
Quality Plan Design.
G Have you backed up the plan before editing? If you are using IDQ in a
client-server environment, you can create a baseline version of the plan using
IDQ version control functionality. In addition, you should copy the plan to a
new project folder (viz., Work_Folder) in the Workbench for changing and
testing, and leave the original plan untouched during testing.
G Is the plan operating directly on production data? This applies especially
to standardization plans. When editing a plan, always work on staged data
(database or flat-file). You can later migrate the plan to the production
environment after complete and thorough testing.
INFORMATICA CONFIDENTIAL BEST PRACTICE 196 of 702
You should have a clear goal whenever you plan to change an existing plan. An event
may prompt the change: for example, input data changing (in format or content), or
changes in business rules or business/project targets. You should take into account all
current change-management procedures, and the updated plans should be thoroughly
tested before production processes are updated. This includes integration and
regression testing too. (See also Testing Data Quality Plans.)
Bear in mind that at a high level there are two types of data quality plans: data analysis
and data enhancement plans.
G Data analysis plans produce reports on data patterns and data quality across
the input data. The key objective in data analysis is to determine the levels of
completeness, conformity, and consistency in the dataset. In pursuing these
objectives, data analysis plans can also identify cases of missing, inaccurate
or noisy data.
G Data enhancement plans corrects completeness, conformity and consistency
problems; they can also identify duplicate data entries and fix accuracy issues
through the use of reference data.
Your goal in a data analysis plan is to discover the quality and usability of your data. It
is not necessarily your goal to obtain the best scores for your data. Your goal in a data
enhancement plan is to resolve the data quality issues discovered in the data analysis.
Adding Components
In general, simply adding a component to a plan is not likely to directly effect results if
no further changes are made to the plan. However, once the outputs from the new
component are integrated into existing components, the data process flow is changed
and the plan must be re-tested and results reviewed in detail before migrating the plan
into production.
Bear in mind, particularly in data analysis plans, that improved plan statistics do not
always mean that the plan is performing better. It is possible to configure a plan that
moves beyond the point of truth by focusing on certain data elements and exclude
others.
When added to existing plans, some components have a larger impact than others. For
example, adding a To Upper component to convert text into upper case may not
cause the plan results to change meaningfully, although the presentation of the output
data will change. However, adding and integrating a Rule Based Analyzer component
INFORMATICA CONFIDENTIAL BEST PRACTICE 197 of 702
(designed to apply business rules) may cause a severe impact, as the rules are likely to
change the plan logic.
As well as adding a new component that is, a new icon to the plan, you can add a
new instance to an existing component. This can have the same effect as adding and
integrating a new component icon. To avoid overloading a plan with too many
components, it is a good practice to add multiple instances to a single component,
within reason. Good plan design suggests that instances within a single component
should be logically similar and work on the selected inputs in similar ways. If you add a
new instance to a component, and that instance behaves very differently to the other
instances in that component for example, if it acts on an unrelated set of outputs or
performs an unrelated type of action on the data you should probably add a new
component for this instance. This will also help you keep track of your changes
onscreen.
To avoid making plans over-complicated, it is often a good practice to split tasks into
multiple plans where a large amount of data quality measures need to be checked. This
makes plans and business rules easier to maintain and provides a good framework for
future development. For example, in an environment where a large number of attributes
must be evaluated against the six standard data quality criteria (i.e., completeness,
conformity, consistency, accuracy, duplication and consolidation) using one plan per
data quality criterion may be a good way to move forward. Alternatively, splitting plans
up by data entity may be advantageous. Similarly, during standardization, you can
create plans for specific function areas (e.g,. address, product, or name) as opposed to
adding all standardization tasks to a single large plan.
For more information on the six standard data quality criteria, see Data Cleansing .
Removing Components
Removing a component from a plan is likely to have a major impact since, in most
cases, data flow in the plan will be broken. If you remove an integrated component,
configuration changes will be required to all components that use the outputs from the
component. The plan cannot run without these configuration changes being completed.
The only exceptions to this case are when the output(s) of the removed component are
solely used by CSV Sink component or by a frequency component. However, in these
cases, you must note that the plan output changes since the column(s) no longer
appear in the result set.
Editing Component Configurations
INFORMATICA CONFIDENTIAL BEST PRACTICE 198 of 702
Changing the configuration of a component can have a comparable impact on the
overall plan as adding or removing a component the plans logic changes, and
therefore, so do the results that it produces. However, although adding or removing a
component may make a plan non-executable, changing the configuration of a
component can impact the results in more subtle ways. For example, changing the
reference dictionary used by a parsing component does not break a plan, but may
have a major impact on the resulting output.
Similarly, changing the name of a component instance output does not break a plan. By
default, component output names cascade through the other components in the plan,
so when you change an output name, all subsequent components automatically update
with the new output name. It is not necessary to change the configuration of dependent
components.


Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL BEST PRACTICE 199 of 702
Using Data Explorer for Data Discovery and Analysis
Challenge
To understand and make full use of Informatica Data Explorers potential to profile and define mappings for your
project data.
Data profiling and mapping provide a firm foundation for virtually any project involving data movement, migration,
consolidation or integration, from data warehouse/data mart development, ERP migrations, and enterprise
application integration to CRM initiatives and B2B integration. These types of projects rely on an accurate
understanding of the true structure of the source data in order to correctly transform the data for a given target
database design. However, the datas actual form rarely coincides with its documented or supposed form.
The key to success for data-related projects is to fully understand the data as it actually is, before attempting to
cleanse, transform, integrate, mine, or otherwise operate on it. Informatica Data Explorer is a key tool for this
purpose.
This Best Practice describes how to use Informatica Data Explorer (IDE) in data profiling and mapping scenarios.
Description
Data profiling and data mapping involve a combination of automated and human analyses to reveal the quality,
content and structure of project data sources. Data profiling analyzes several aspects of data structure and
content, including characteristics of each column or field, the relationships between fields, and the commonality
of data values between fields often an indicator of redundant data.
Data Profiling
Data profiling involves the explicit analysis of source data and the comparison of observed data characteristics
against data quality standards. Data quality and integrity issues include invalid values, multiple formats within a
field, non-atomic fields (such as long address strings), duplicate entities, cryptic field names, and others. Quality
standards may either be the native rules expressed in the source datas metadata, or an external standard (e.g.,
corporate, industry, or government) to which the source data must be mapped in order to be assessed.
INFORMATICA CONFIDENTIAL BEST PRACTICE 200 of 702
Data profiling in IDE is based on two main processes:
G Inference of characteristics from the data
G Comparison of those characteristics with specified standards, as an assessment of data quality
Data mapping involves establishing relationships among data elements in various data structures or sources, in
terms of how the same information is expressed or stored in different ways in different sources. By performing
these processes early in a data project, IT organizations can preempt the code/load/explode syndrome, wherein
a project fails at the load stage because the data is not in the anticipated form.
Data profiling and mapping are fundamental techniques applicable to virtually any project. The following figure
summarizes and abstracts these scenarios into a single depiction of the IDE solution.
The overall process flow for the IDE Solution is as follows:
INFORMATICA CONFIDENTIAL BEST PRACTICE 201 of 702
1. Data and metadata are prepared and imported into IDE.
2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents
cleansing and transformation requirements based on the source and normalized schemas.
3. The resultant metadata are exported to and managed in the IDE Repository.
4. In a derived-target scenario, the project team designs the target database by modeling the existing data
sources and then modifying the model as required to meet current business and performance
requirements. In this scenario, IDE is used to develop the normalized schema into a target database.

The normalized and target schemas are then exported to IDEs FTM/XML tool, which documents
transformation requirements between fields in the source, normalized, and target schemas.

OR
5. In a fixed-target scenario, the design of the target database is a given (i.e., because another
organization is responsible for developing it, or because an off-the-shelf package or industry standard is
to be used). In this scenario, the schema development process is bypassed. Instead, FTM/XML is used to
map the source data fields to the corresponding fields in an externally-specified target schema, and to
document transformation requirements between fields in the normalized and target schemas. FTM is
used for SQL-based metadata structures, and FTM/XML is used to map SQL and/or XML-based
metadata structures. Externally specified targets are typical for ERP package migrations, business-to-
business integration projects, or situations where a data modeling team is independently designing the
target schema.
6. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and
loading or formatting specs developed with IDE applications.
IDE's Methods of Data Profiling
IDE employs three methods of data profiling:
Column profiling - infers metadata from the data for a column or set of columns. IDE infers both the most likely
metadata and alternate metadata which is consistent with the data.
INFORMATICA CONFIDENTIAL BEST PRACTICE 202 of 702
Table Structural profiling - uses the sample data to infer relationships among the columns in a table. This
process can discover primary and foreign keys, functional dependencies, and sub-tables.
Cross-Table profiling - determines the overlap of values across a set of columns, which may come from
multiple tables.
INFORMATICA CONFIDENTIAL BEST PRACTICE 203 of 702
Profiling against external standards requires that the data source be mapped to the standard before being
assessed (as shown in the following figure). Note that the mapping is performed by IDEs Fixed Target Mapping
tool (FTM). IDE can also be used in the development and application of corporate standards, making them
relevant to existing systems as well as to new systems.
Data profiling projects may involve iterative profiling and cleansing as well since data cleansing may improve the
quality of the results obtained through dependency and redundancy profiling. Note that Informatica Data Quality
should be considered as an alternative tool for data cleansing.
IDE and Fixed-Target Migration
Fixed-target migration projects involve the conversion and migration of data from one or more sources to an
externally defined or fixed-target. IDE is used to profile the data and develop a normalized schema representing
INFORMATICA CONFIDENTIAL BEST PRACTICE 204 of 702
the data source(s), while IDEs Fixed Target Mapping tool (FTM) is used to map from the normalized schema to
the fixed target.
The general sequence of activities for a fixed-target migration project, as shown in the figure below, is as follows:
1. Data is prepared for IDE. Metadata is imported into IDE.
2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents
cleansing and transformation requirements based on the source and normalized schemas. The cleansing
requirements can be reviewed and modified by the Data Quality team.
3. The resultant metadata are exported to and managed by the IDE Repository.
4. FTM maps the source data fields to the corresponding fields in an externally specified target schema, and
documents transformation requirements between fields in the normalized and target schemas. Externally-
specified targets are typical for ERP migrations or projects where a data modeling team is independently
designing the target schema.
5. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and
loading or formatting specs developed with IDE and FTM.
6. The cleansing, transformation, and formatting specs can be used by the application development or Data
Quality team to cleanse the data, implement any required edits and integrity management functions, and
develop the transforms or configure an ETL product to perform the data conversion and migration.
The following screen shot shows how IDE can be used to generate a suggested normalized schema, which may
discover hidden tables within tables.
INFORMATICA CONFIDENTIAL BEST PRACTICE 205 of 702
Depending on the staging architecture used, IDE can generate the data definition language (DDL) needed to
establish several of the staging databases between the sources and target, as shown below:
Derived-Target Migration
Derived-target migration projects involve the conversion and migration of data from one or more sources to a
target database defined by the migration team. IDE is used to profile the data and develop a normalized schema
representing the data source(s), and to further develop the normalized schema into a target schema by adding
tables and/or fields, eliminating unused tables and/or fields, changing the relational structure, and/or
denormalizing the schema to enhance performance. When the target schema is developed from the normalized
schema within IDE, the product automatically maintains the mappings from the source to normalized schema,
and from the normalized to target schemas.
The figure below shows that the general sequence of activities for a derived-target migration project is as follows:
INFORMATICA CONFIDENTIAL BEST PRACTICE 206 of 702
1. Data is prepared for IDE. Metadata is imported into IDE.
2. IDE is used to profile the data, generate accurate metadata (including a normalized schema), and
document cleansing and transformation requirements based on the source and normalized schemas. The
cleansing requirements can be reviewed and modified by the Data Quality team.
3. IDE is used to modify and develop the normalized schema into a target schema. This generally involves
removing obsolete or spurious data elements, incorporating new business requirements and data
elements, adapting to corporate data standards, and denormalizing to enhance performance.
4. The resultant metadata are exported to and managed by the IDE Repository.
5. FTM is used to develop and document transformation requirements between the normalized and target
schemas. The mappings between the data elements are automatically carried over from the IDE-based
schema development process.
6. The IDE Repository is used to export an XSLT document containing the transformation and the formatting
specs developed with IDE and FTM/XML.
7. The cleansing, transformation, and formatting specs are used by the application development or Data
Quality team to cleanse the data, implement any required edits and integrity management functions, and
develop the transforms of configure an ETL product to perform the data conversion and migration.


Last updated: 09-Feb-07 12:55
INFORMATICA CONFIDENTIAL BEST PRACTICE 207 of 702
Working with Pre-Built Plans in Data Cleanse and Match
Challenge
To provide a set of best practices for users of the pre-built data quality processes designed for use with the Informatica Data
Cleanse and Match (DC&M) product offering.
Informatica Data Cleanse and Match is a cross-application data quality solution that installs two components to the PowerCenter
system:
G Data Cleanse and Match Workbench, the desktop application in which data quality processes - or plans - plans can be
designed, tested, and executed. Workbench installs with its own Data Quality repository, where plans are stored until
needed.
G Data Quality Integration, a plug-in component that integrates Informatica Data Quality and PowerCenter. The plug-in
adds a transformation to PowerCenter, called the Data Quality Integration transformation; PowerCenter Designer users
can connect to the Data Quality repository and read data quality plan information into this transformation.
Informatica Data Cleanse and Match has been developed to work with Content Packs developed by Informatica. This document
focuses on the plans that install with the North America Content Pack, which was developed in conjunction with the components
of Data Cleanse and Match. The North America Content Pack delivers data parsing, cleansing, standardization, and de-duplication
functionality to United States and Canadian name and address data through a series of pre-built data quality plans and address
reference data files.
This document focuses on the following areas:
G when to use one plan vs. another for data cleansing.
G what behavior to expected from the plans.
G how best to manage exception data.
Description
The North America Content Pack installs several plans to the Data Quality Repository:
G Plans 01-04 are designed to parse, standardize, and validate United States name and address data.
G Plans 05-07 are designed to enable single-source matching operations (identifying duplicates within a data set) or dual
source matching operations (identifying matching records between two datasets).
The processing logic for data matching is split between PowerCenter and Informatica Data Quality (IDQ) applications.
Plans 01-04: Parsing, Cleansing, and Validation
These plans provide modular solutions for name and address data. The plans can operate on highly unstructured and well-
structured data sources. The level of structure contained in a given data set determines the plan to be used.
The following diagram demonstrates how the level of structure in address data maps to the plans required to standardize and
validate an address.
INFORMATICA CONFIDENTIAL BEST PRACTICE 208 of 702
In cases where the address is well structured and specific data elements (i.e., city, state, and zip) are mapped to specific fields,
only the address validation plan may be required. Where the city, state, and zip are mapped to address fields, but not specifically
labeled as such (e.g., as Address1 through Address5), a combination of the address standardization and validation plans
is required. In extreme cases, where the data is not mapped to any address columns, a combination of the general parser, address
standardization, and validation plans may be required to obtain meaning from the data.
The purpose of making the plans modular is twofold:
G It is possible to apply these plans on an individual basis to the data. There is no requirement that the plans be run in
sequence with each other. For example, the address validation plan (plan 03) can be run successfully to validate input
addresses discretely from the other plans. In fact, the Data Quality Developer will not run all seven plans consecutively on
the same dataset. Plans 01 and 02 are not designed to operate in sequence, nor are plans 06 and 07.
G Modular plans facilitate faster performance. Designing a single plan to perform all the processing tasks contained in the
seven plans, even if it were desirable from a functional point of view, would result in significant performance degradation
and extremely complex plan logic that would be difficult to modify and maintain.
01 General Parser
The General Parser plan was developed to handle highly unstructured data and to parse it into type-specific fields. For example,
consider data stored in the following format:
Field1 Field2 Field3 Field4 Field5
100 Cardinal Way Informatica Corp CA 94063 [email protected] Redwood City
Redwood City 38725 100 Cardinal Way CA 94063 [email protected]
While it is unusual to see data fragmented and spread across a number of fields in this way, it can and does happen. In cases such
as this, data is not stored in any specific fields. Street addresses, email addresses, company names, and dates are scattered
throughout the data. Using a combination of dictionaries and pattern recognition, the General Parser plan sorts such data into type-
specific fields of address, names, company names, Social Security Numbers, dates, telephone numbers, and email addresses,
depending on the profile of the content. As a result, the above data will be parsed into the following format:
INFORMATICA CONFIDENTIAL BEST PRACTICE 209 of 702
Address1 Address2 Address3 E-mail Date Company
100 Cardinal Way CA 94063 Redwood City [email protected] Informatica Corp
Redwood City 100 Cardinal Way CA 94063 [email protected] 08/01/2006
The General Parser does not attempt to apply any structure or meaning to the data. Its purpose is to identify and sort data by
information type. As demonstrated with the address fields in the above example, the address fields are labeled as addresses; the
contents are not arranged in a standard address format, they are flagged as addresses in the order in which they were processed
in the file.
The General Parser does not attempt to validate the correctness of a field. For example, the dates are accepted as valid because
they have a structure of symbols and numbers that represents a date. A value of 99/99/9999 would also be parsed as a date.
The General Parser does not attempt to handle multiple information types in a single field. For example, if a person name and
address element are contained in the same field, the General Parser would label the entire field either a name or an address - or
leave it unparsed - depending on the elements in the field it can identify first (if any).
While the General Parser does not make any assumption about the data prior to parsing, it parses based on the elements of data
that it can make sense of first. In cases where no elements of information can be labeled, the field is left in a pipe-delimited form
containing unparsed data.
The effectiveness of the General Parser to recognize various information types is a function of the dictionaries used to identify that
data and the rules used to sort them. Adding or deleting dictionary entries can greatly affect the effectiveness of this plan.
Overall, the General Parser is likely only be used in limited cases, where certain types of information may be mixed together, (e.g.,
telephone and email in the same contact field), or in cases where the data has been badly managed, such as when several files of
differing structures have been merged into a single file.
02 Name Standardization
The Name Standardization plan is designed to take in person name or company name information and apply parsing and
standardization logic to it. Name Standardization follows two different tracks: one for person names and one for company names.
The plan input fields include two inputs for company names. Data that is entered in these fields are assumed to be valid company
names, and no additional tests are performed to validate that the data is an existing company name. Any combination of letters,
numbers, and symbols can represent a company; therefore, in the absence of an external reference data source, further tests to
validate a company name are not likely to yield usable results.
Any data entered into the company name fields is subjected to two processes. First, the company name is standardized using the
Word Manager component, standardizing any company suffixes included in the field. Second, the standardized company name is
matched against the company_names.dic dictionary, which returns the standardized Dun & Bradstreet company name, if found.
The second track for name standardization is person names standardization. While this track is dedicated to standardizing person
names, it does not necessarily assume that all data entered here is a person name. Person names in North America tend to follow
a set structure and typically do not contain company suffixes or digits. Therefore, values entered in this field that contain a company
suffix or a company name are taken out of the person name track and moved to the company name track. Additional logic is
applied to identify people whose last name is similar (or equal) to a valid company name (for example John Sears); inputs that
contain an identified first name and a company name are treated as a person name.
If the company name track inputs are already fully populated for the record in question, then any company name detected in a
person name column is moved to a field for unparsed company name output. If the name is not recognized as a company name (e.
g., by the presence of a company suffix) but contains digits, the data is parsed into the non-name data output field. Any remaining
data is accepted as being a valid person name and parsed as such.
North American person names are typically entered in one of two different styles: either in a firstname middlename surname
format or surname, firstname middlename format. Name parsing algorithms have been built using this assumption.
Name parsing occurs in two passes. The first pass applies a series of dictionaries to the name fields, attempting to parse out name
prefixes, name suffixes, firstnames, and any extraneous data (noise) present. Any remaining details are assumed to be middle
name or surname details. A rule is applied to the parsed details to check if the name has been parsed correctly. If not, best guess
INFORMATICA CONFIDENTIAL BEST PRACTICE 210 of 702
parsing is applied to the field based on the possible assumed formats.
When name details have been parsed into first, last, and middle name formats, the first name is used to derive additional details
including gender and the name prefix. Finally, using all parsed and derived name elements, salutations are generated.
In cases where no clear gender can be generated from the first name, the gender field is typically left blank or indeterminate.
The salutation field is generated according to the derived gender information. This can be easily replicated outside the data quality
plan if the salutation is not immediately needed as an output from the process (assuming the gender field is an output).
Depending on the data entered in the person name fields, certain companies may be treated as person names and parsed
according to person name processing rules. Likewise, some person names may be identified as companies and standardized
according to company name processing logic. This is typically a result of the dictionary content. If this is a significant problem when
working with name data, some adjustments to the dictionaries and the rule logic for the plan may be required.
Non-name data encountered in the name standardization plan may be standardized as names depending on the contents of the
fields. For example, an address datum such as Corporate Parkway may be standardized as a business name, as Corporate is
also a business suffix. Any text data that is entered in a person name field is always treated as a person or company, depending on
whether or not the field contains a recognizable company suffix in the text.
To ensure that the name standardization plan is delivering adequate results, Informatica strongly recommends pre- and post-
execution analysis of the data.
Based on the following input:
ROW ID IN NAME1
1 Steven King
2 Chris Pope Jr.
3 Shannon C. Prince
4 Dean Jones
5 Mike Judge
6 Thomas Staples
7 Eugene F. Sears
8 Roy Jones Jr.
9 Thomas Smith, Sr
10 Eddie Martin III
11 Martin Luther King, Jr.
12 Staples Corner
13 Sears Chicago
14 Robert Tyre
15 Chris News
The following outputs are produced by the Name Standardization plan:
INFORMATICA CONFIDENTIAL BEST PRACTICE 211 of 702
The last entry (Chris News) is identified as a company in the current plan configuration such results can be refined by changing
the underlying dictionary entries used to identify company and person names.
03 US Canada Standardization
This plan is designed to apply basic standardization processes to city, state/province, and zip/postal code information for United
States and Canadian postal address data. The purpose of the plan is to deliver basic standardization to address elements where
processing time is critical and one hundred percent validation is not possible due to time constraints. The plan also organizes key
search elements into discrete fields, thereby speeding up the validation process.
The plan accepts up to six generic address fields and attempts to parse out city, state/province, and zip/postal code information. All
remaining information is assumed to be address information and is absorbed into the address line 1-3 fields. Any information that
cannot be parsed into the remaining fields is merged into the non-address data field.
The plan makes a number of assumptions that may or may not suit your data:
G When parsing city, state, and zip details, the address standardization dictionaries assume that these data elements are
spelled correctly. Variation in town/city names is very limited, and in cases where punctuation differences exist or where
town names are commonly misspelled, the standardization plan may not correctly parse the information.
G Zip codes are all assumed to be five-digit. In some files, zip codes that begin with 0 may lack this first number and so
appear as a four-digit codes, and these may be missed during parsing. Adding four-digit zips to the dictionary is not
recommended, as these will conflict with the Plus 4 element of a zip code. Zip codes may also be confused with other
five-digit numbers in an address line such as street numbers.
G City names are also commonly found in street names and other address elements. For example, United is part of a
country (United States of America) and is also a town name in the U.S. Bear in mind that the dictionary parsing operates
from right to left across the data, so that country name and zip code fields are analyzed before city names and street
addresses. Therefore, the word United may be parsed and written as the town name for a given address before the
actual town name datum is reached.
G The plan appends a country code to the end of a parsed address if it can identify it as U.S. or Canadian. Therefore, there
is no need to include any country code field in the address inputs when configuring the plan.
Most of these issues can be dealt with, if necessary, by minor adjustments to the plan logic or to the dictionaries, or by adding
some pre-processing logic to a workflow prior to passing the data into the plan.
The plan assumes that all data entered into it are valid address elements. Therefore, once city, state, and zip details have been
parsed out, the plan assumes all remaining elements are street address lines and parses them in the order they occurred as
address lines 1-3.
04 NA Address Validation
The purposes of the North America Address Validation plan are:
G To match input addresses against known valid addresses in an address database, and
G To parse, standardize, and enrich the input addresses.
INFORMATICA CONFIDENTIAL BEST PRACTICE 212 of 702
Performing these operations is a resource-intensive process. Using the US Canada Standardization plan before the NA Address
Validation plan helps to improve validation plan results in cases where city, state, and zip code information are not already in
discrete fields. City, state, and zip are key search criteria for the address validation engine, and they need to be mapped into
discrete fields. Not having these fields correctly mapped prior to plan execution leads to poor results and slow execution times.
The address validation APIs store specific area information in memory and continue to use that information from one record to the
next, when applicable. Therefore, when running validation plans, it is advisable to sort address data by zip/postal code in order to
maximize the usage of data in memory.
In cases where status codes, error codes, or invalid results are generated as plan outputs, refer to the Informatica Data Quality 3.1
User Guide for information on how to interpret them.
Plans 05-07: Pre-Match Standardization, Grouping, and Matching
These plans take advantage of PowerCenter and IDQ capabilities and are commonly used in pairs. Users will use either plan 05
and 06 or plans 05 and 07. There plans work as follows:
G 05 Match Standardization and Grouping. This plan is used to perform basic standardization and grouping operations on
the data prior to matching.
G 06 Single Source Matching. Single source matching seeks to identify duplicate records within a single data set.
G 07 Dual Source Matching. Dual source matching seeks to identify duplicate records between two datasets.
Note that the matching plans are designed for use within a PowerCenter mapping and do not deliver optimal results when executed
directly from IDQ Workbench. Note also that the Standardization and Matching plans are geared towards North American English
data. Although they work with datasets in other languages, the results may be sub-optimal.
Matching Concepts
To ensure the best possible matching results and performance, match plans usually use a pre-processing step to standardize and
group the data.
The aim for standardization here is different from a classic standardization plan the intent is to ensure that different spellings,
abbreviations, etc. are as similar to each other as possible to return better match set. For example, 123 Main Rd. and 123 Main
Road will obtain an imperfect match score, although they clearly refer to the same street address.
Grouping, in a matching context, means sorting input records based on identical values in one or more user-selected fields. When a
matching plan is run on grouped data, serial matching operations are performed on a group-by-group basis, so that data records
within a group are matched but records across groups are not. A well-designed grouping plan can dramatically cut plan processing
time while minimizing the likelihood of missed matches in the dataset.
Grouping performs two functions. It sorts the records in a dataset to increase matching plan performance, and it creates new data
columns to provide group key options for the matching plan. (In PowerCenter, the Sorter transformation can organize the data to
facilitate matching performance. Therefore, the main function of grouping in a PowerCenter context is to create candidate group
keys.
In both Data Quality and PowerCenter, grouping operations do not affect the source dataset itself.)
Matching on un-grouped data involves a large number of comparisons that realistically will not generate a meaningful quantity of
additional matches. For example, when looking for duplicates in a customer list, there is little value in comparing the record for John
Smith with the record for Angela Murphy as they are obviously not going to be considered as duplicate entries. The type of
grouping used depends on the type of information being matched; in general, productive fields for grouping name and address data
are location-based (e.g. city name, zip codes) or person/company based (surname and company name composites). For more
information on grouping strategies for best result/performance relationship, see the Best Practice Effective Data Matching
Techniques.
Plan 05 (Match Standardization and Grouping) performs cleansing and standardization operations on the data before group keys are generated. It
offers a number of grouping options. The plan generates the following group keys:
G OUT_ZIP_GROUP: first 5 digits of ZIP code
INFORMATICA CONFIDENTIAL BEST PRACTICE 213 of 702
G OUT_ZIP_NAME3_GROUP: first 5 digits of ZIP code and the first 3 characters of the last name
G OUT_ZIP_NAME5_GROUP: first 5 digits of ZIP code and the first 5 characters of the last name
G OUT_ZIP_COMPANY3_GROUP: first 5 digits of ZIP code and the first 3 characters of the cleansed company name
G OUT_ZIP_COMPANY5_GROUP: first 5 digits of ZIP code and the first 5 characters of the cleansed company name
The grouping output used depends on the data contents and data volume.
Plans 06 Single Source Matching and 07 Dual Source Matching
Plans 06 and 07 are set up in similar ways and assume that person name, company name, and address data inputs will be used.
However, in PowerCenter, plan 07 requires the additional input of a Source tag, typically generated by an Expression
transform upstream in the PowerCenter mapping.
A number of matching algorithms are applied to the address and name elements. To ensure the best possible result, a weight-
based component and a custom rule are applied to the outputs from the matching components. For further information on IDQ
matching components, consult the Informatica Data Quality 3.1 User Guide.
By default the plans are configured to write as output all records that match with an 85% percent or higher degree of certainty. The
Data Quality Developer can easily adjusted this figure in each plan.
PowerCenter Mappings
When configuring the Data Quality Integration transformation for the matching plan, the Developer must select a valid grouping field.
To ensure best matching results, the PowerCenter mapping that contains plan 05 should include a Sorter transformation that sorts
data according to the group key to be used during matching. This transformation should follow standardization and grouping
operations. Note that a single mapping can contain multiple Data Quality Integration transformations, so that the Data Quality
Developer or Data Integration Developer can add plan 05 to one Integration transformation and plan 06 or 07 to another in the
same mapping. The standardization plan requires a passive transformation, whereas the matching plan requires an active
transformation.
INFORMATICA CONFIDENTIAL BEST PRACTICE 214 of 702
The developer can add a Sequencer transformation to the mapping to generate a unique identifier for each input record if these not
present in the source data. (Note that a unique identifier is not required for matching processes).
When working with the dual source matching plan, additional PowerCenter transformations are required to pre-process the data for
the Integration transformation. Expression transformations are used to label each input with a source tag of A and B respectively.
The data from the two sources is then joined together using a Union transformation, before being passed to the Integration
transformation containing the standardization and grouping plan. From here on, the mapping has the same design as the single
source version.


Last updated: 09-Feb-07 13:18
INFORMATICA CONFIDENTIAL BEST PRACTICE 215 of 702
Designing Data Integration Architectures
Challenge
Develop a sound data architecture that can serve as a foundation for a data integration solution
that may evolve over many years.
Description
Historically, organizations have approached the development of a "data warehouse" or "data mart"
as a departmental effort, without considering an enterprise perspective. The result has been silos
of corporate data and analysis, which very often conflict with each other in terms of both detailed
data and the business conclusions implied by it.
Taking an enterprise-wide, architect stance in developing data integration solutions provides many
advantages, including:
G A sound architectural foundation ensures the solution can evolve and scale with the
business over time.
G Proper architecture can isolate the application component (business context) of the data
integration solution from the technology.
G Lastly, architectures allow for reuse - reuse of skills, design objects, and knowledge.
As the evolution of data integration solutions (and the corresponding nomenclature) has
progressed, the necessity of building these solutions on a solid architectural framework has
become more and more clear. To understand why, a brief review of the history of data
integration solutions and their predecessors is warranted.
Historical Perspective
Online Transaction Processing Systems (OLTPs) have always provided a very detailed,
transaction-oriented view of an organization's data. While this view was indispensable for the day-
to-day operation of a business, its ability to provide a "big picture" view of the operation, critical for
management decision-making, was severely limited. Initial attempts to address this problem took
several directions:
Reporting directly against the production system. This approach minimized the effort
associated with developing management reports, but introduced a number of significant issues:
The nature of OLTP data is, by definition, "point-in-time." Thus, reports run at different times of the
year, month, or even the day, were inconsistent with each other.
INFORMATICA CONFIDENTIAL BEST PRACTICE 216 of 702
Ad hoc queries against the production database introduced uncontrolled performance issues,
resulting in slow reporting results and degradation of OLTP system performance.
Trending and aggregate analysis was difficult (or impossible) with the detailed data available in the
OLTP systems.
G Mirroring the production system in a reporting database . While this approach
alleviated the performance degradation of the OLTP system, it did nothing to address the
other issues noted above.
G Reporting databases . To address the fundamental issues associated with reporting
against the OLTP schema, organizations began to move toward dedicated reporting
databases. These databases were optimized for the types of queries typically run by
analysts, rather than those used by systems supporting data entry clerks or customer
service representatives. These databases may or may not have included pre-aggregated
data, and took several forms, including traditional RDBMS as well as newer technology
Online Analytical Processing (OLAP) solutions.
The initial attempts at reporting solutions were typically point solutions; they were developed
internally to provide very targeted data to a particular department within the enterprise. For
example, the Marketing department might extract sales and demographic data in order to infer
customer purchasing habits. Concurrently, the Sales department was also extracting sales data for
the purpose of awarding commissions to the sales force. Over time, these isolated silos of
information became irreconcilable, since the extracts and business rules applied to the data during
the extract process differed for the different departments
The result of this evolution was that the Sales and Marketing departments might report completely
different sales figures to executive management, resulting in a lack of confidence in both
departments' "data marts." From a technical perspective, the uncoordinated extracts of the same
data from the source systems multiple times placed undue strain on system resources.
The solution seemed to be the "centralized" or "galactic" data warehouse. This warehouse would
be supported by a single set of periodic extracts of all relevant data into the data warehouse (or
Operational Data Store), with the data being cleansed and made consistent as part of the extract
process. The problem with this solution was its enormous complexity, typically resulting in project
failure. The scale of these failures led many organizations to abandon the concept of the enterprise
data warehouse in favor of the isolated, "stovepipe" data marts described earlier. While these
solutions still had all of the issues discussed previously, they had the clear advantage of providing
individual departments with the data they needed without the unmanageability of the enterprise
solution.
As individual departments pursued their own data and data integration needs, they not only created
data stovepipes, they also created technical islands. The approaches to populating the data marts
and performing the data integration tasks varied widely, resulting in a single enterprise evaluating,
purchasing, and being trained on multiple tools and adopting multiple methods for performing these
tasks. If, at any point, the organization did attempt to undertake an enterprise effort, it was likely to
face the daunting challenge of integrating the disparate data as well as the widely varying
technologies. To deal with these issues, organizations began developing approaches that
considered the enterprise-level requirements of a data integration solution.
INFORMATICA CONFIDENTIAL BEST PRACTICE 217 of 702
Centralized Data Warehouse
The first approach to gain popularity was the centralized data warehouse. Designed to solve the
decision support needs for the entire enterprise at one time, with one effort, the data integration
process extracts the data directly from the operational systems. It transforms the data according to
the business rules and loads it into a single target database serving as the enterprise-wide data
warehouse.


Advantages
The centralized model offers a number of benefits to the overall architecture, including:
G Centralized control . Since a single project drives the entire process, there is centralized
control over everything occurring in the data warehouse. This makes it easier to manage a
production system while concurrently integrating new components of the warehouse.
G Consistent metadata . Because the warehouse environment is contained in a single
database and the metadata is stored in a single repository, the entire enterprise can be
queried whether you are looking at data from Finance, Customers, or Human Resources.
G Enterprise view . Developing the entire project at one time provides a global view of how
data from one workgroup coordinates with data from others. Since the warehouse is highly
integrated, different workgroups often share common tables such as customer, employee,
and item lists.
INFORMATICA CONFIDENTIAL BEST PRACTICE 218 of 702
G High data integrity . A single, integrated data repository for the entire enterprise would
naturally avoid all data integrity issues that result from duplicate copies and versions of the
same business data.
Disadvantages
Of course, the centralized data warehouse also involves a number of drawbacks, including:
G Lengthy implementation cycle. With the complete warehouse environment developed
simultaneously, many components of the warehouse become daunting tasks, such as
analyzing all of the source systems and developing the target data model. Even minor
tasks, such as defining how to measure profit and establishing naming conventions,
snowball into major issues.
G Substantial up-front costs . Many analysts who have studied the costs of this approach
agree that this type of effort nearly always runs into the millions. While this level of
investment is often justified, the problem lies in the delay between the investment and the
delivery of value back to the business.
G Scope too broad . The centralized data warehouse requires a single database to satisfy
the needs of the entire organization. Attempts to develop an enterprise-wide warehouse
using this approach have rarely succeeded, since the goal is simply too ambitious. As a
result, this wide scope has been a strong contributor to project failure.
G Impact on the operational systems . Different tables within the warehouse often read
data from the same source tables, but manipulate it differently before loading it into the
targets. Since the centralized approach extracts data directly from the operational
systems, a source table that feeds into three different target tables is queried three times
to load the appropriate target tables in the warehouse. When combined with all the other
loads for the warehouse, this can create an unacceptable performance hit on the
operational systems.
Independent Data Mart
The second warehousing approach is the independent data mart, which gained popularity in 1996
when DBMS magazine ran a cover story featuring this strategy. This architecture is based on the
same principles as the centralized approach, but it scales down the scope from solving the
warehousing needs of the entire company to the needs of a single department or workgroup.
Much like the centralized data warehouse, an independent data mart extracts data directly from the
operational sources, manipulates the data according to the business rules, and loads a single
target database serving as the independent data mart. In some cases, the operational data may be
staged in an Operational Data Store (ODS) and then moved to the mart.
INFORMATICA CONFIDENTIAL BEST PRACTICE 219 of 702

Advantages
The independent data mart is the logical opposite of the centralized data warehouse. The
disadvantages of the centralized approach are the strengths of the independent data mart:
G Impact on operational databases localized . Because the independent data mart is
trying to solve the DSS needs of a single department or workgroup, only the few
operational databases containing the information required need to be analyzed.
G Reduced scope of the data model . The target data modeling effort is vastly reduced
since it only needs to serve a single department or workgroup, rather than the entire
company.
G Lower up-front costs . The data mart is serving only a single department or workgroup;
thus hardware and software costs are reduced.
G Fast implementation . The project can be completed in months, not years. The process
of defining business terms and naming conventions is simplified since "players from the
same team" are working on the project.
Disadvantages
Of course, independent data marts also have some significant disadvantages:
G Lack of centralized control . Because several independent data marts are needed to
INFORMATICA CONFIDENTIAL BEST PRACTICE 220 of 702
solve the decision support needs of an organization, there is no centralized control. Each
data mart or project controls itself, but there is no central control from a single location.
G Redundant data . After several data marts are in production throughout the organization,
all of the problems associated with data redundancy surface, such as inconsistent
definitions of the same data object or timing differences that make reconciliation
impossible.
G Metadata integration . Due to their independence, the opportunity to share metadata - for
example, the definition and business rules associated with the Invoice data object - is lost.
Subsequent projects must repeat the development and deployment of common data
objects.
G Manageability . The independent data marts control their own scheduling routines and
therefore store and report their metadata differently, with a negative impact on the
manageability of the data warehouse. There is no centralized scheduler to coordinate the
individual loads appropriately or metadata browser to maintain the global metadata and
share development work among related projects.
Dependent Data Marts (Federated Data Warehouses)
The third warehouse architecture is the dependent data mart approach supported by the hub-and-
spoke architecture of PowerCenter and PowerMart. After studying more than one hundred different
warehousing projects, Informatica introduced this approach in 1998, leveraging the benefits of the
centralized data warehouse and independent data mart.
The more general term being adopted to describe this approach is the "federated data warehouse."
Industry analysts have recognized that, in many cases, there is no "one size fits all" solution.
Although the goal of true enterprise architecture, with conformed dimensions and strict standards,
is laudable, it is often impractical, particularly for early efforts. Thus, the concept of the federated
data warehouse was born. It allows for the relatively independent development of data marts, but
leverages a centralized PowerCenter repository for sharing transformations, source and target
objects, business rules, etc.
Recent literature describes the federated architecture approach as a way to get closer to the goal
of a truly centralized architecture while allowing for the practical realities of most organizations. The
centralized warehouse concept is sacrificed in favor of a more pragmatic approach, whereby the
organization can develop semi-autonomous data marts, so long as they subscribe to a common
view of the business. This common business model is the fundamental, underlying basis of the
federated architecture, since it ensures consistent use of business terms and meanings throughout
the enterprise.
With the exception of the rare case of a truly independent data mart, where no future growth is
planned or anticipated, and where no opportunities for integration with other business areas exist,
the federated data warehouse architecture provides the best framework for building a data
integration solution.
Informatica's PowerCenter and PowerMart products provide an essential capability for supporting
the federated architecture: the shared Global Repository. When used in conjunction with one or
more Local Repositories, the Global Repository serves as a sort of "federal" governing body,
INFORMATICA CONFIDENTIAL BEST PRACTICE 221 of 702
providing a common understanding of core business concepts that can be shared across the semi-
autonomous data marts. These data marts each have their own Local Repository, which typically
include a combination of purely local metadata and shared metadata by way of links to the Global
Repository.

This environment allows for relatively independent development of individual data marts, but also
supports metadata sharing without obstacles. The common business model and names described
above can be captured in metadata terms and stored in the Global Repository. The data marts use
the common business model as a basis, but extend the model by developing departmental
metadata and storing it locally.
A typical characteristic of the federated architecture is the existence of an Operational Data Store
(ODS). Although this component is optional, it can be found in many implementations that extract
data from multiple source systems and load multiple targets. The ODS was originally designed to
extract and hold operational data that would be sent to a centralized data warehouse, working as a
time-variant database to support end-user reporting directly from operational systems. A typical
ODS had to be organized by data subject area because it did not retain the data model from the
operational system.
Informatica's approach to the ODS, by contrast, has virtually no change in data model from the
operational system, so it need not be organized by subject area. The ODS does not permit direct
end-user reporting, and its refresh policies are more closely aligned with the refresh schedules of
the enterprise data marts it may be feeding. It can also perform more sophisticated consolidation
functions than a traditional ODS.
INFORMATICA CONFIDENTIAL BEST PRACTICE 222 of 702
Advantages
The Federated architecture brings together the best features of the centralized data warehouse
and independent data mart:
G Room for expansion . While the architecture is designed to quickly deploy the initial data
mart, it is also easy to share project deliverables across subsequent data marts by
migrating local metadata to the Global Repository. Reuse is built in.
G Centralized control . A single platform controls the environment from development to test
to production. Mechanisms to control and monitor the data movement from operational
databases into the data integration environment are applied across the data marts, easing
the system management task.
G Consistent metadata . A Global Repository spans all the data marts, providing a
consistent view of metadata.
G Enterprise view . Viewing all the metadata from a central location also provides an
enterprise view, easing the maintenance burden for the warehouse administrators.
Business users can also access the entire environment when necessary (assuming that
security privileges are granted).
G High data integrity . Using a set of integrated metadata repositories for the entire
enterprise removes data integrity issues that result from duplicate copies of data.
G Minimized impact on operational systems . Frequently accessed source data, such as
customer, product, or invoice records is moved into the decision support environment
once, leaving the operational systems unaffected by the number of target data marts.
Disadvantages
Disadvantages of the federated approach include:
G Data propagation . This approach moves data twice-to the ODS, then into the individual
data mart. This requires extra database space to store the staged data as well as extra
time to move the data. However, the disadvantage can be mitigated by not saving the data
permanently in the ODS. After the warehouse is refreshed, the ODS can be truncated, or
a rolling three months of data can be saved.
G Increased development effort during initial installations . For each table in the target,
there needs to be one load developed from the ODS to the target, in addition to all the
loads from the source to the targets.
Operational Data Store
Using a staging area or ODS differs from a centralized data warehouse approach since the ODS is
not organized by subject area and is not customized for viewing by end users or even for reporting.
The primary focus of the ODS is in providing a clean, consistent set of operational data for creating
and refreshing data marts. Separating out this function allows the ODS to provide more reliable
and flexible support.
INFORMATICA CONFIDENTIAL BEST PRACTICE 223 of 702
Data from the various operational sources is staged for subsequent extraction by target systems in
the ODS. In the ODS, data is cleaned and remains normalized, tables from different databases are
joined, and a refresh policy is carried out (a change/capture facility may be used to schedule ODS
refreshes, for instance).
The ODS and the data marts may reside in a single database or be distributed across several
physical databases and servers.
Characteristics of the Operational Data Store are:
G Normalized
G Detailed (not summarized)
G Integrated
G Cleansed
G Consistent
Within an enterprise data mart, the ODS can consolidate data from disparate systems in a number
of ways:
G Normalizes data where necessary (such as non-relational mainframe data), preparing it for
storage in a relational system.
G Cleans data by enforcing commonalties in dates, names and other data types that appear
across multiple systems.
G Maintains reference data to help standardize other formats; references might range from
zip codes and currency conversion rates to product-code-to-product-name translations.
The ODS may apply fundamental transformations to some database tables in order to
reconcile common definitions, but the ODS is not intended to be a transformation
processor for end-user reporting requirements.
Its role is to consolidate detailed data within common formats. This enables users to create wide
varieties of data integration reports, with confidence that those reports will be based on the same
detailed data, using common definitions and formats.
The following table compares the key differences in the three architectures:
Architecture Centralized Data
Warehouse
Independent Data
Mart
Federated
Data Warehouse
Centralized
Control
Yes No Yes
INFORMATICA CONFIDENTIAL BEST PRACTICE 224 of 702
Consistent
Metadata
Yes No Yes
Cost effective No Yes Yes
Enterprise View Yes No Yes
Fast
Implementation
No Yes Yes
High Data
Integrity
Yes No Yes
Immediate ROI No Yes Yes
Repeatable
Process
No Yes Yes
The Role of Enterprise Architecture
The federated architecture approach allows for the planning and implementation of an enterprise
architecture framework that addresses not only short-term departmental needs, but also the long-
term enterprise requirements of the business. This does not mean that the entire architectural
investment must be made in advance of any application development. However, it does mean that
development is approached within the guidelines of the framework, allowing for future growth
without significant technological change. The remainder of this chapter will focus on the process of
designing and developing a data integration solution architecture using PowerCenter as the
platform.
Fitting Into the Corporate Architecture
Very few organizations have the luxury of creating a "green field" architecture to support their
decision support needs. Rather, the architecture must fit within an existing set of corporate
guidelines regarding preferred hardware, operating systems, databases, and other software. The
Technical Architect, if not already an employee of the organization, should ensure that he/she has
a thorough understanding of the existing (and future vision of) technical infrastructure. Doing so will
eliminate the possibility of developing an elegant technical solution that will never be implemented
because it defies corporate standards.


Last updated: 12-Feb-07 15:22
INFORMATICA CONFIDENTIAL BEST PRACTICE 225 of 702
Development FAQs
Challenge
Using the PowerCenter product suite to effectively develop, name, and document components of the data integration solution.
While the most effective use of PowerCenter depends on the specific situation, this Best Practice addresses some questions
that are commonly raised by project teams. It provides answers in a number of areas, including Logs, Scheduling, Backup
Strategies, Server Administration, Custom Transformations, and Metadata. Refer to the product guides supplied with
PowerCenter for additional information.
Description
The following pages summarize some of the questions that typically arise during development and suggest potential resolutions.
Mapping Design
Q: How does source format affect performance? (i.e., is it more efficient to source from a flat file rather than a database?)
In general, a flat file that is located on the server machine loads faster than a database located on the server machine. Fixed-
width files are faster than delimited files because delimited files require extra parsing. However, if there is an intent to perform
intricate transformations before loading to target, it may be advisable to first load the flat file into a relational database, which
allows the PowerCenter mappings to access the data in an optimized fashion by using filters, custom transformations, and
custom SQL SELECTs where appropriate.
Q: What are some considerations when designing the mapping? (i.e., what is the impact of having multiple targets populated by
a single map?)
With PowerCenter, it is possible to design a mapping with multiple targets. If each target has a separate source qualifier, you
can then load the targets in a specific order using Target Load Ordering. However, the recommendation is to limit the amount of
complex logic in a mapping. Not only is it easier to debug a mapping with a limited number of objects, but such mappings can
also be run concurrently and make use of more system resources. When using multiple output files (targets), consider writing to
multiple disks or file systems simultaneously. This minimizes disk writing contention and applies to a session writing to multiple
targets, and to multiple sessions running simultaneously.
Q: What are some considerations for determining how many objects and transformations to include in a single mapping?
The business requirement is always the first consideration, regardless of the number of objects it takes to fulfill the requirement.
Beyond this, consideration should be given to having objects that stage data at certain points to allow both easier debugging
and better understandability, as well as to create potential partition points. This should be balanced against the fact that more
objects means more overhead for the DTM process.
It should also be noted that the most expensive use of the DTM is passing unnecessary data through the mapping. It is best to
use filters as early as possible in the mapping to remove rows of data that are not needed. This is the SQL equivalent of the
WHERE clause. Using the filter condition in the Source Qualifier to filter out the rows at the database level is a good way to
increase the performance of the mapping. If this is not possible, a filter or router transformation can be used instead.
Log File Organization
Q: How does PowerCenter handle logs?
The Service Manager provides accumulated log events from each service in the domain and for sessions and workflows. To
perform the logging function, the Service Manager runs a Log Manager and a Log Agent.
The Log Manager runs on the master gateway node. It collects and processes log events for Service Manager domain
operations and application services. The log events contain operational and error messages for a domain. The Service
INFORMATICA CONFIDENTIAL BEST PRACTICE 226 of 702
Manager and the application services send log events to the Log Manager. When the Log Manager receives log events, it
generates log event files, which can be viewed in the Administration Console.
The Log Agent runs on the nodes to collect and process log events for session and workflows. Log events for workflows include
information about tasks performed by the Integration Service, workflow processing, and workflow errors. Log events for
sessions include information about the tasks performed by the Integration Service, session errors, and load summary and
transformation statistics for the session. You can view log events for the last workflow run with the Log Events window in the
Workflow Monitor.
Log event files are binary files that the Administration Console Log Viewer uses to display log events. When you view log
events in the Administration Console, the Log Manager uses the log event files to display the log events for the domain or
application service. For more information, please see Chapter 16: Managing Logs in the Administrator Guide.
Q: Where can I view the logs?
Logs can be viewed in two locations: the Administration Console or the Workflow Monitor. The Administration Console displays
domain-level operational and error messages. The Workflow Monitor displays session and workflow level processing and error
messages.
Q: Where is the best place to maintain Session Logs?
One often-recommended location is a shared directory location that is accessible to the gateway node. If you have more than
one gateway node, store the logs on a shared disk. This keeps all the logs in the same directory. The location can be changed
in the Administration Console.
If you have more than one PowerCenter domain, you must configure a different directory path for each domains Log Manager.
Multiple domains can not use the same shared directory path.
For more information, please refer to Chapter 16: Managing Logs of the Administrator Guide.
Q: What documentation is available for the error codes that appear within the error log files?
Log file errors and descriptions appear in Chapter 39: LGS Messages of the PowerCenter Trouble Shooting Guide. Error
information also appears in the PowerCenter Help File within the PowerCenter client applications. For other database-specific
errors, consult your Database User Guide.
Scheduling Techniques
Q: What are the benefits of using workflows with multiple tasks rather than a workflow with a stand-alone session?
Using a workflow to group logical sessions minimizes the number of objects that must be managed to successfully load the
warehouse. For example, a hundred individual sessions can be logically grouped into twenty workflows. The Operations group
can then work with twenty workflows to load the warehouse, which simplifies the operations tasks associated with loading the
targets.
Workflows can be created to run tasks sequentially or concurrently, or have tasks in different paths doing either.
G A sequential workflow runs sessions and tasks one at a time, in a linear sequence. Sequential workflows help ensure
that dependencies are met as needed. For example, a sequential workflow ensures that session1 runs before session2
when session2 is dependent on the load of session1, and so on. It's also possible to set up conditions to run the next
session only if the previous session was successful, or to stop on errors, etc.
G A concurrent workflow groups logical sessions and tasks together, like a sequential workflow, but runs all the tasks at
one time. This can reduce the load times into the warehouse, taking advantage of hardware platforms' symmetric multi-
processing (SMP) architecture.
Other workflow options, such as nesting worklets within workflows, can further reduce the complexity of loading the warehouse.
This capability allows for the creation of very complex and flexible workflow streams without the use of a third-party scheduler.
INFORMATICA CONFIDENTIAL BEST PRACTICE 227 of 702
Q: Assuming a workflow failure, does PowerCenter allow restart from the point of failure?
No. When a workflow fails, you can choose to start a workflow from a particular task but not from the point of failure. It is
possible, however, to create tasks and flows based on error handling assumptions. If a previously running real-time workflow
fails, first recover and then restart that workflow from the Workflow Monitor.
Q: How can a failed workflow be recovered if it is not visible from the Workflow Monitor?
Start the Workflow Manager and open the corresponding workflow. Find the failed task and right click to "Recover Workflow
From Task."
Q: What guidelines exist regarding the execution of multiple concurrent sessions / workflows within or across applications?
Workflow Execution needs to be planned around two main constraints:
G Available system resources
G Memory and processors
The number of sessions that can run efficiently at one time depends on the number of processors available on the server. The
load manager is always running as a process. If bottlenecks with regards to I/O and network are addressed, a session will be
compute-bound, meaning its throughput is limited by the availability of CPU cycles. Most sessions are transformation intensive,
so the DTM always runs. However, some sessions require more I/O, so they use less processor time. A general rule is that a
session needs about 120 percent of a processor for the DTM, reader, and writer in total.
For concurrent sessions:
One session per processor is about right; you can run more, but that requires a "trial and error" approach to determine what
number of sessions starts to affect session performance and possibly adversely affect other executing tasks on the server.
If possible, sessions should run at "off-peak" hours to have as many available resources as possible.
Even after available processors are determined, it is necessary to look at overall system resource usage. Determining memory
usage is more difficult than the processors calculation; it tends to vary according to system load and number of PowerCenter
sessions running.
The first step is to estimate memory usage, accounting for:
G Operating system kernel and miscellaneous processes
G Database engine
G Informatica Load Manager
Next, each session being run needs to be examined with regard to the memory usage, including the DTM buffer size and any
cache/memory allocations for transformations such as lookups, aggregators, ranks, sorters and joiners.
At this point, you should have a good idea of what memory is utilized during concurrent sessions. It is important to arrange the
production run to maximize use of this memory. Remember to account for sessions with large memory requirements; you may
be able to run only one large session, or several small sessions concurrently.
Load-order dependencies are also an important consideration because they often create additional constraints. For example,
load the dimensions first, then facts. Also, some sources may only be available at specific times; some network links may
become saturated if overloaded; and some target tables may need to be available to end users earlier than others.
Q: Is it possible to perform two "levels" of event notification? At the application level and the PowerCenter Server level to notify
the Server Administrator?
The application level of event notification can be accomplished through post-session email. Post-session email allows you to
INFORMATICA CONFIDENTIAL BEST PRACTICE 228 of 702
create two different messages; one to be sent upon successful completion of the session, the other to be sent if the session
fails. Messages can be a simple notification of session completion or failure, or a more complex notification containing specifics
about the session. You can use the following variables in the text of your post-session email:
Email Variable Description
%s Session name
%l Total records loaded
%r Total records rejected
%e Session status
%t Table details, including read throughput in bytes/second and write throughput in
rows/second
%b Session start time
%c Session completion time
%i Session elapsed time (session completion time-session start time)
%g Attaches the session log to the message
%m Name and version of the mapping used in the session
%d Name of the folder containing the session
%n Name of the repository containing the session
%a<filename> Attaches the named file. The file must be local to the Informatica Server. The
following are valid filenames: %a<c:\data\sales.txt> or %a</users/john/data/sales.
txt>
On Windows NT, you can attach a file of any type.
On UNIX, you can only attach text files. If you attach a non-text file, the send may
fail.
Note: The filename cannot include the Greater Than character (>) or a line break.
The PowerCenter Server on UNIX uses rmail to send post-session email. The repository user who starts the PowerCenter
server must have the rmail tool installed in the path in order to send email.
To verify the rmail tool is accessible:
1. Login to the UNIX system as the PowerCenter user who starts the PowerCenter Server.
2. Type rmail <fully qualified email address> at the prompt and press Enter.
3. Type '.' to indicate the end of the message and press Enter.
4. You should receive a blank email from the PowerCenter user's email account. If not, locate the directory where rmail
resides and add that directory to the path.
5. When you have verified that rmail is installed correctly, you are ready to send post-session email.
The output should look like the following:
Session complete.
Session name: sInstrTest
Total Rows Loaded = 1
INFORMATICA CONFIDENTIAL BEST PRACTICE 229 of 702
Total Rows Rejected = 0
Completed
Rows
Loaded
Rows
Rejected
ReadThroughput
(bytes/sec)
WriteThroughput
(rows/sec)
Table Name
Status
1 0 30 1 t_Q3_sales
No errors encountered.
Start Time: Tue Sep 14 12:26:31 1999
Completion Time: Tue Sep 14 12:26:41 1999
Elapsed time: 0:00:10 (h:m:s)
This information, or a subset, can also be sent to any text pager that accepts email.
Backup Strategy Recommendation
Q: Can individual objects within a repository be restored from the backup or from a prior version?
At the present time, individual objects cannot be restored from a backup using the PowerCenter Repository Manager (i.e., you
can only restore the entire repository). But, it is possible to restore the backup repository into a different database and then
manually copy the individual objects back into the main repository.
It should be noted that PowerCenter does not restore repository backup files created in previous versions of PowerCenter. To
correctly restore a repository, the version of PowerCenter used to create the backup file must be used for the restore as well.
An option for the backup of individual objects is to export them to XML files. This allows for the granular re-importation of
individual objects, mappings, tasks, workflows, etc.
Refer to Migration Procedures - PowerCenter for details on promoting new or changed objects between development, test, QA,
and production environments.
Server Administration
Q: What built-in functions does PowerCenter provide to notify someone in the event that the server goes down, or some other
significant event occurs?
The Repository Service can be used to send messages notifying users that the server will be shut down. Additionally, the
Repository Service can be used to send notification messages about repository objects that are created, modified, or deleted by
another user. Notification messages are received through the PowerCenter Client tools.
Q: What system resources should be monitored? What should be considered normal or acceptable server performance levels?
The pmprocs utility, which is available for UNIX systems only, shows the currently executing PowerCenter processes.
Pmprocs is a script that combines the ps and ipcs commands. It is available through Informatica Technical Support. The utility
provides the following information:
G CPID - Creator PID (process ID)
G LPID - Last PID that accessed the resource
G Semaphores - used to sync the reader and writer
G 0 or 1 - shows slot in LM shared memory
INFORMATICA CONFIDENTIAL BEST PRACTICE 230 of 702
A variety of UNIX and Windows NT commands and utilities are also available. Consult your UNIX and/or Windows NT
documentation.
Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an Oracle instance crash?
If the UNIX server crashes, you should first check to see if the repository database is able to come back up successfully. If this
is the case, then you should try to start the PowerCenter server. Use the pmserver.err log to check if the server has started
correctly. You can also use ps -ef | grep pmserver to see if the server process (the Load Manager) is running.
Custom Transformations
Q: What is the relationship between the Java or SQL transformation and the Custom transformation?
Many advanced transformations, including Java and SQL, were built using the Custom transformation. Custom transformations
operate in conjunction with procedures you create outside of the Designer interface to extend PowerCenter functionality.
Other transformations that were built using Custom transformations include HTTP, SQL, Union, XML Parser, XML Generator,
and many others. Below is a summary of noticeable differences.
Transformation # of Input Groups # of Output Groups Type
Custom Multiple Multiple Active/Passive
HTTP One One Passive
Java One One Active/Passive
SQL One One Active/Passive
Union Multiple One Active
XML Parser One Multiple Active
XML Generator Multiple One Active
For further details, please see the Transformation Guide.
Q: What is the main benefit of a Custom transformation over an External Procedure transformation?
A Custom transformation allows for the separation of input and output functions, whereas an External Procedure transformation
handles both the input and output simultaneously. Additionally, an External Procedure transformations parameters consist of all
the ports of the transformation.
The ability to separate input and output functions is especially useful for sorting and aggregation, which require all input rows to
be processed before outputting any output rows.
Q: How do I change a Custom transformation from Active to Passive, or vice versa?
After the creation of the Custom transformation, the transformation type cannot be changed. In order to set the appropriate type,
delete and recreate the transformation.
Q: What is the difference between active and passive Java transformations? When should one be used over the other?
An active Java transformation allows for the generation of more than one output row for each input row. Conversely, a passive
Java transformation only allows for the generation of one output row per input row.
INFORMATICA CONFIDENTIAL BEST PRACTICE 231 of 702
Use active if you need to generate multiple rows with each input. For example, a Java transformation contains two input ports
that represent a start date and an end date. You can generate an output row for each date between the start and end date. Use
passive when you need one output row for each input.
Q: What are the advantages of a SQL transformation over a Source Qualifier?
A SQL transformation allows for the processing of SQL queries in the middle of a mapping. It allows you to insert, delete,
update, and retrieve rows from a database. For example, you might need to create database tables before adding new
transactions. The SQL transformation allows for the creation of these tables from within the workflow.
Q: What is the difference between the SQL transformations Script and Query modes?
Script mode allows for the execution of externally located ANSI SQL scripts. Query mode executes a query that you define in a
query editor. You can pass strings or parameters to the query to define dynamic queries or change the selection parameters.
For more information, please see Chapter 22: SQL Transformation in the Transformation Guide.
Metadata
Q: What recommendations or considerations exist as to naming standards or repository administration for metadata that may
be extracted from the PowerCenter repository and used in others?
With PowerCenter, you can enter description information for all repository objects, sources, targets, transformations, etc, but the
amount of metadata that you enter should be determined by the business requirements. You can also drill down to the column
level and give descriptions of the columns in a table if necessary. All information about column size and scale, data types, and
primary keys are stored in the repository.
The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to
enter detailed descriptions of each column, expression, variable, etc, it is also very time consuming to do so. Therefore, this
decision should be made on the basis of how much metadata is likely to be required by the systems that use the metadata.
There are some time-saving tools that are available to better manage a metadata strategy and content, such as third-party
metadata software and, for sources and targets, data modeling tools.
Q: What procedures exist for extracting metadata from the repository?
Informatica offers an extremely rich suite of metadata-driven tools for data warehousing applications. All of these tools store,
retrieve, and manage their metadata in Informatica's PowerCenter repository. The motivation behind the original Metadata
Exchange (MX) architecture was to provide an effective and easy-to-use interface to the repository.
Today, Informatica and several key Business Intelligence (BI) vendors, including Brio, Business Objects, Cognos, and
MicroStrategy, are effectively using the MX views to report and query the Informatica metadata.
Informatica strongly discourages accessing the repository directly, even for SELECT access because some releases of
PowerCenter change the look and feel of the repository tables, resulting in a maintenance task for you. Rather, views have
been created to provide access to the metadata stored in the repository.
Additionally, Informatica's Metadata Manager and Data Analyzer, allow for more robust reporting against the repository
database and are able to present reports to the end-user and/or management.
Versioning
Q: How can I keep multiple copies of the same object within PowerCenter?
A: With PowerCenter, you can use version control to maintain previous copies of every changed object.
INFORMATICA CONFIDENTIAL BEST PRACTICE 232 of 702
You can enable version control after you create a repository. Version control allows you to maintain multiple versions of an
object, control development of the object, and track changes. You can configure a repository for versioning when you create it,
or you can upgrade an existing repository to support versioned objects.
When you enable version control for a repository, the repository assigns all versioned objects version number 1 and each object
has an active status.
You can perform the following tasks when you work with a versioned object:
G View object version properties. Each versioned object has a set of version properties and a status. You can also
configure the status of a folder to freeze all objects it contains or make them active for editing.
G Track changes to an object. You can view a history that includes all versions of a given object, and compare any
version of the object in the history to any other version. This allows you to determine changes made to an object over
time.
G Check the object version in and out. You can check out an object to reserve it while you edit the object. When you
check in an object, the repository saves a new version of the object and allows you to add comments to the version.
You can also find objects checked out by yourself and other users.
G Delete or purge the object version. You can delete an object from view and continue to store it in the repository. You
can recover, or undelete, deleted objects. If you want to permanently remove an object version, you can purge it from
the repository.
Q: Is there a way to migrate only the changed objects from Development to Production without having to spend too much time
on making a list of all changed/affected objects?
A: Yes there is.
You can create Deployment Groups that allow you to group versioned objects for migration to a different repository. You can
create the following types of deployment groups:
G Static. You populate the deployment group by manually selecting objects.
G Dynamic. You use the result set from an object query to populate the deployment group.
To make a smooth transition/migration to Production, you need to have a query associated with your Dynamic deployment
group. When you associate an object query with the deployment group, the Repository Agent runs the query at the time of
deployment. You can associate an object query with a deployment group when you edit or create a deployment group.
If the repository is enabled for versioning, you may also copy the objects in a deployment group from one repository to another.
Copying a deployment group allows you to copy objects in a single copy operation from across multiple folders in the source
repository into multiple folders in the target repository. Copying a deployment group also allows you to specify individual objects
to copy, rather than the entire contents of a folder.
Performance
Q: Can PowerCenter sessions be load balanced?
A: Yes, if the grid option is available. The Load Balancer is a component of the Integration Service that dispatches tasks to
Integration Service processes running on nodes in a grid. It matches task requirements with resource availability to identify the
best Integration Service process to run a task. It can dispatch tasks on a single node or across nodes.
Tasks can be dispatched in three ways: Round-robin, Metric-based, and Adaptive. Additionally, you can set the Service Levels
to change the priority of each task waiting to be dispatched. This can be changed in the Administration Consoles domain
properties.
For more information, please refer to Chapter 11: Configuring the Load Balancer in the Administrator Guide.
Web Services
INFORMATICA CONFIDENTIAL BEST PRACTICE 233 of 702
Q: How does Web Services Hub work in PowerCenter?
A: The Web Services Hub is a web service gateway for external clients. It processes SOAP requests from web service clients
that want to access PowerCenter functionality through web services. Web service clients access the Integration Service and
Repository Service through the Web Services Hub.
The Web Services Hub hosts Batch and Real-time Web Services. When you install PowerCenter Services, the PowerCenter
installer installs the Web Services Hub. Use the Administration Console to configure and manage the Web Services Hub. For
more information, please refer to Creating and Configuring the Web Services Hub in the Administrator Guide.
The Web Services Hub connects to the Repository Server and the PowerCenter Server through TCP/IP. Web service clients log
in to the Web Services Hub through HTTP(s). The Web Services Hub authenticates the client based on repository user name
and password. You can use the Web Services Hub console to view service information and download Web Services
Description Language (WSDL) files necessary for running services and workflows.


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 234 of 702
Event Based Scheduling
Challenge
In an operational environment, the beginning of a task often needs to be triggered by
some event, either internal or external, to the Informatica environment. In versions of
PowerCenter prior to version 6.0, this was achieved through the use of indicator files. In
PowerCenter 6.0 and forward, it is achieved through use of the EventRaise and
EventWait Workflow and Worklet tasks, as well as indicator files.
Description
Event-based scheduling with versions of PowerCenter prior to 6.0 was achieved
through the use indicator files. Users specified the indicator file configuration in the
session configuration under advanced options. When the session started, the
PowerCenter Server looked for the specified file name; if it wasnt there, it waited until it
appeared, then deleted it, and triggered the session.
In PowerCenter 6.0 and above, event-based scheduling is triggered by Event-Wait and
Event-Raise tasks. These tasks can be used to define task execution order within a
workflow or worklet. They can even be used to control sessions across workflows.
G An Event-Raise task represents a user-defined event (i.e., an indicator file).
G An Event-Wait task waits for an event to occurwithin a workflow. After the
event triggers, the PowerCenter Server continues executing the workflow from
the Event-Wait task forward.
The following paragraphs describe events that can be triggered by an Event-Wait task.
Waiting for Pre-Defined Events
To use a pre-defined event, you need a session, shell command, script, or batch file to
create an indicator file. You must create the file locally or send it to a directory local to
the PowerCenter Server. The file can be any format recognized by the PowerCenter
Server operating system. You can choose to have the PowerCenter Server delete the
indicator file after it detects the file, or you can manually delete the indicator file. The
PowerCenter Server marks the status of the Event-Wait task as "failed" if it cannot
delete the indicator file.
INFORMATICA CONFIDENTIAL BEST PRACTICE 235 of 702
When you specify the indicator file in the Event-Wait task, specify the directory in which
the file will appear and the name of the indicator file. Do not use either a source or
target file name as the indicator file name. You must also provide the absolute path for
the file and the directory must be local to the PowerCenter Server. If you only specify
the file name, and not the directory, Workflow Manager looks for the indicator file in the
system directory. For example, on Windows NT, the system directory is C:/winnt/
system32. You can enter the actual name of the file or use server variables to specify
the location of the files. The PowerCenter Server writes the time the file appears in the
workflow log.
Follow these steps to set up a pre-defined event in the workflow:
1. Create an Event-Wait task and double-click the Event-Wait task to open the
Edit Tasks dialog box.
2. In the Events tab of the Edit Task dialog box, select Pre-defined.
3. Enter the path of the indicator file.
4. If you want the PowerCenter Server to delete the indicator file after it detects
the file, select the Delete Indicator File option in the Properties tab.
5. Click OK.
Pre-defined Event
A pre-defined event is a file-watch event. For pre-defined events, use an Event-Wait
task to instruct the PowerCenter Server to wait for the specified indicator file to appear
before continuing with the rest of the workflow. When the PowerCenter Server locates
the indicator file, it starts the task downstream of the Event-Wait.
User-defined Event
A user-defined event is defined at the workflow or worklet level and the Event-Raise
task triggers the event at one point of the workflow/worklet. If an Event-Wait task is
configured in the same workflow/worklet to listen for that event, then execution will
continue from the Event-Wait task forward.
The following is an example of using user-defined events:
Assume that you have four sessions that you want to execute in a workflow. You want
P1_session and P2_session to execute concurrently to save time. You also want to
execute Q3_session after P1_session completes. You want to execute Q4_session
only when P1_session, P2_session, and Q3_session complete. Follow these steps:
INFORMATICA CONFIDENTIAL BEST PRACTICE 236 of 702
1. Link P1_session and P2_session concurrently.
2. Add Q3_session after P1_session
3. Declare an event called P1Q3_Complete in the Events tab of the workflow
properties
4. In the workspace, add an Event-Raise task after Q3_session.
5. Specify the P1Q3_Complete event in the Event-Raise task properties. This
allows the Event-Raise task to trigger the event when P1_session and
Q3_session complete.
6. Add an Event-Wait task after P2_session.
7. Specify the Q1 Q3_Complete event for the Event-Wait task.
8. Add Q4_session after the Event-Wait task. When the PowerCenter Server
processes the Event-Wait task, it waits until the Event-Raise task triggers
Q1Q3_Complete before it executes Q4_session.
The PowerCenter Server executes the workflow in the following order:
1. The PowerCenter Server executes P1_session and P2_session concurrently.
2. When P1_session completes, the PowerCenter Server executes Q3_session.
3. The PowerCenter Server finishes executing P2_session.
4. The Event-Wait task waits for the Event-Raise task to trigger the event.
5. The PowerCenter Server completes Q3_session.
6. The Event-Raise task triggers the event, Q1Q3_complete.
7. The Informatica Server executes Q4_session because the event,
Q1Q3_Complete, has been triggered.
Be sure to take carein setting the links though. If they are left as the default and if Q3
fails, the Event-Raise will never happen. Then the Event-Wait will wait forever and the
workflow will run until it is stopped. To avoid this, check the workflow option suspend
on error. With this option, if a session fails, the whole workflow goes into suspended
mode and can send an email to notify developers.


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 237 of 702
Key Management in Data Warehousing
Solutions
Challenge
Key management refers to the technique that manages key allocation in a decision
support RDBMS to create a single view of reference data from multiple sources.
Informatica recommends a concept of key management that ensures loading
everything extracted from a source system into the data warehouse.
This Best Practice provides some tips for employing the Informatica-recommended
approach of key management, an approach that deviates from many traditional data
warehouse solutions that apply logical and data warehouse (surrogate) key strategies
where errors are loaded and transactions rejected from referential integrity issues.
Description
Key management in a decision support RDBMS comprises three techniques for
handling the following common situations:
G Key merging/matching
G Missing keys
G Unknown keys
All three methods are applicable to a Reference Data Store, whereas only the missing
and unknown keys are relevant for an Operational Data Store (ODS). Key management
should be handled at the data integration level, thereby making it transparent to the
Business Intelligence layer.
Key Merging/Matching
When companies source data from more than one transaction system of a similar type,
the same object may have different, non-unique legacy keys. Additionally, a single key
may have several descriptions or attributes in each of the source systems. The
independence of these systems can result in incongruent coding, which poses a
greater problem than records being sourced from multiple systems.
INFORMATICA CONFIDENTIAL BEST PRACTICE 238 of 702
A business can resolve this inconsistency by undertaking a complete code
standardization initiative (often as part of a larger metadata management effort) or
applying a Universal Reference Data Store (URDS). Standardizing code requires an
object to be uniquely represented in the new system. Alternatively, URDS contains
universal codes for common reference values. Most companies adopt this pragmatic
approach, while embarking on the longer term solution of code standardization.
The bottom line is that nearly every data warehouse project encounters this issue and
needs to find a solution in the short term.
Missing Keys
A problem arises when a transaction is sent through without a value in a column where
a foreign key should exist (i.e., a reference to a key in a reference table). This normally
occurs during the loading of transactional data, although it can also occur when loading
reference data into hierarchy structures. In many older data warehouse solutions, this
condition would be identified as an error and the transaction row would be rejected.
The row would have to be processed through some other mechanism to find the correct
code and loaded at a later date. This is often a slow and cumbersome process that
leaves the data warehouse incomplete until the issue is resolved.
The more practical way to resolve this situation is to allocate a special key in place of
the missing key, which links it with a dummy 'missing key' row in the related table. This
enables the transaction to continue through the loading process and end up in the
warehouse without further processing. Furthermore, the row ID of the bad transaction
can be recorded in an error log, allowing the addition of the correct key value at a later
time.
The major advantage of this approach is that any aggregate values derived from the
transaction table will be correct because the transaction exists in the data warehouse
rather than being in some external error processing file waiting to be fixed.
Simple Example:
PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE
Audi TT18 Doe10224 1 35,000
In the transaction above, there is no code in the SALES REP column. As this row is
processed, a dummy sales rep key (UNKNOWN) is added to the record to link to a
record in the SALES REP table. A data warehouse key (8888888) is also added to the
INFORMATICA CONFIDENTIAL BEST PRACTICE 239 of 702
transaction.
PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE DWKEY
Audi TT18 Doe10224 9999999 1 35,000 8888888
The related sales rep record may look like this:
REP CODE REP NAME REP MANAGER
1234567 David Jones Mark Smith
7654321 Mark Smith
9999999 Missing Rep
An error log entry to identify the missing key on this transaction may look like:
ERROR CODE TABLE NAME KEY NAME KEY
MSGKEY ORDERS SALES REP 8888888
This type of error reporting is not usually necessary because the transactions with
missing keys can be identified using standard end-user reporting tools against the data
warehouse.
Unknown Keys
Unknown keys need to be treated much like missing keys except that the load process
has to add the unknown key value to the referenced table to maintain integrity rather
than explicitly allocating a dummy key to the transaction. The process also needs to
make two error log entries. The first, to log the fact that a new and unknown key has
been added to the reference table and a second to record the transaction in which the
unknown key was found.
Simple example:
The sales rep reference data record might look like the following:
INFORMATICA CONFIDENTIAL BEST PRACTICE 240 of 702
DWKEY REP NAME REP MANAGER
1234567 David Jones Mark Smith
7654321 Mark Smith
9999999 Missing Rep
A transaction comes into ODS with the record below:
PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE
Audi TT18 Doe10224 2424242 1 35,000
In the transaction above, the code 2424242 appears in the SALES REP column. As
this row is processed, a new row has to be added to the Sales Rep reference table.
This allows the transaction to be loaded successfully.
DWKEY REP NAME REP MANAGER
2424242 Unknown
A data warehouse key (8888889) is also added to the transaction.
PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE DWKEY
Audi TT18 Doe10224 2424242 1 35,000 8888889
Some warehouse administrators like to have an error log entry generated to identify the
addition of a new reference table entry. This can be achieved simply by adding the
following entries to an error log.
ERROR CODE TABLE NAME KEY NAME KEY
NEWROW SALES REP SALES REP 2424242
A second log entry can be added with the data warehouse key of the transaction in
which the unknown key was found.
INFORMATICA CONFIDENTIAL BEST PRACTICE 241 of 702
ERROR CODE TABLE NAME KEY NAME KEY
UNKNKEY ORDERS SALES REP 8888889
As with missing keys, error reporting is not essential because the unknown status is
clearly visible through the standard end-user reporting.
Moreover, regardless of the error logging, the system is self-healing because the newly
added reference data entry will be updated with full details as soon as these changes
appear in a reference data feed.
This would result in the reference data entry looking complete.
DWKEY REP NAME REP MANAGER
2424242 David Digby Mark Smith
Employing the Informatica recommended key management strategy produces the
following benefits:
G All rows can be loaded into the data warehouse
G All objects are allocated a unique key
G Referential integrity is maintained
G Load dependencies are removed


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 242 of 702
Mapping Design
Challenge
Optimizing PowerCenter to create an efficient execution environment.
Description
Although PowerCenter environments vary widely, most sessions and/or mappings can
benefit from the implementation of common objects and optimization procedures.
Follow these procedures and rules of thumb when creating mappings to help ensure
optimization.
General Suggestions for Optimizing
1. Reduce the number of transformations. There is always overhead involved in
moving data between transformations.
2. Consider more shared memory for large number of transformations. Session
shared memory between 12MB and 40MB should suffice.
3. Calculate once, use many times.

H Avoid calculating or testing the same value over and over.
H Calculate it once in an expression, and set a True/False flag.
H Within an expression, use variable ports to calculate a value that can
be used multiple times within that transformation.
4. Only connect what is used.

H Delete unnecessary links between transformations to minimize the
amount of data moved, particularly in the Source Qualifier.
H This is also helpful for maintenance. If a transformation needs to be
reconnected, it is best to only have necessary ports set as input and
output to reconnect.
H In lookup transformations, change unused ports to be neither input nor
output. This makes the transformations cleaner looking. It also makes
the generated SQL override as small as possible, which cuts down on
the amount of cache necessary and thereby improves performance.
5. Watch the data types.

INFORMATICA CONFIDENTIAL BEST PRACTICE 243 of 702
H The engine automatically converts compatible types.
H Sometimes data conversion is excessive. Data types are automatically
converted when types differ between connected ports. Minimize data
type changes between transformations by planning data flow prior to
developing the mapping.
6. Facilitate reuse.

H Plan for reusable transformations upfront..
H Use variables. Use both mapping variables and ports that are
variables. Variable ports are especially beneficial when they can be
used to calculate a complex expression or perform a disconnected
lookup call only once instead of multiple times.
H Use mapplets to encapsulate multiple reusable transformations.
H Use mapplets to leverage the work of critical developers and minimize
mistakes when performing similar functions.
7. Only manipulate data that needs to be moved and transformed.

H Reduce the number of non-essential records that are passed through
the entire mapping.
H Use active transformations that reduce the number of records as early
in the mapping as possible (i.e., placing filters, aggregators as close to
source as possible).
H Select appropriate driving/master table while using joins. The table with
the lesser number of rows should be the driving/master table for a
faster join.
8. Utilize single-pass reads.

H Redesign mappings to utilize one Source Qualifier to populate multiple
targets. This way the server reads this source only once. If you have
different Source Qualifiers for the same source (e.g., one for delete
and one for update/insert), the server reads the source for each Source
Qualifier.
H Remove or reduce field-level stored procedures.
9. Utilize Pushdown Optimization.
H Design mappings so they can take advantage of the Pushdown
Optimization feature. This improves performance by allowing the
source and/or target database to perform the mapping logic.
Lookup Transformation Optimizing Tips
INFORMATICA CONFIDENTIAL BEST PRACTICE 244 of 702
1. When your source is large, cache lookup table columns for those lookup tables
of 500,000 rows or less. This typically improves performance by 10 to 20
percent.
2. The rule of thumb is not to cache any table over 500,000 rows. This is only true
if the standard row byte count is 1,024 or less. If the row byte count is more
than 1,024, then you need to adjust the 500K-row standard down as the
number of bytes increase (i.e., a 2,048 byte row can drop the cache row count
to between 250K and 300K, so the lookup table should not be cached in this
case). This is just a general rule though. Try running the session with a large
lookup cached and not cached. Caching is often faster on very large lookup
tables.
3. When using a Lookup Table Transformation, improve lookup performance by
placing all conditions that use the equality operator = first in the list of conditions
under the condition tab.
4. Cache only lookup tables if the number of lookup calls is more than 10 to 20
percent of the lookup table rows. For fewer number of lookup calls, do not
cache if the number of lookup table rows is large. For small lookup tables (i.e.,
less than 5,000 rows), cache for more than 5 to 10 lookup calls.
5. Replace lookup with decode or IIF (for small sets of values).
6. If caching lookups and performance is poor, consider replacing with an
unconnected, uncached lookup.
7. For overly large lookup tables, use dynamic caching along with a persistent
cache. Cache the entire table to a persistent file on the first run, enable the
"update else insert" option on the dynamic cache and the engine never has to
go back to the database to read data from this table. You can also partition this
persistent cache at run time for further performance gains.
8. When handling multiple matches, use the "Return any matching value" setting
whenever possible. Also use this setting if the lookup is being performed to
determine that a match exists, but the value returned is irrelevant. The lookup
creates an index based on the key ports rather than all lookup transformation
ports. This simplified indexing process can improve performance.
9. Review complex expressions.
H Examine mappings via Repository Reporting and Dependency
Reporting within the mapping.
H Minimize aggregate function calls.
H Replace Aggregate Transformation object with an Expression
Transformation object and an Update Strategy Transformation for
certain types of Aggregations.
Operations and Expression Optimizing Tips
1. Numeric operations are faster than string operations.
2. Optimize char-varchar comparisons (i.e., trim spaces before comparing).
INFORMATICA CONFIDENTIAL BEST PRACTICE 245 of 702
3. Operators are faster than functions (i.e., || vs. CONCAT).
4. Optimize IIF expressions.
5. Avoid date comparisons in lookup; replace with string.
6. Test expression timing by replacing with constant.
7. Use flat files.
H Using flat files located on the server machine loads faster than a
database located in the server machine.
H Fixed-width files are faster to load than delimited files because
delimited files require extra parsing.
H If processing intricate transformations, consider loading first to a
source flat file into a relational database, which allows the
PowerCenter mappings to access the data in an optimized fashion by
using filters and custom SQL Selects where appropriate.
8. If working with data that is not able to return sorted data (e.g., Web Logs),
consider using the Sorter Advanced External Procedure.
9. Use a Router Transformation to separate data flows instead of multiple Filter
Transformations.
10. Use a Sorter Transformation or hash-auto keys partitioning before an
Aggregator Transformation to optimize the aggregate. With a Sorter
Transformation, the Sorted Ports option can be used even if the original source
cannot be ordered.
11. Use a Normalizer Transformation to pivot rows rather than multiple instances of
the same target.
12. Rejected rows from an update strategy are logged to the bad file. Consider
filtering before the update strategy if retaining these rows is not critical because
logging causes extra overhead on the engine. Choose the option in the update
strategy to discard rejected rows.
13. When using a Joiner Transformation, be sure to make the source with the
smallest amount of data the Master source.
14. If an update override is necessary in a load, consider using a Lookup
transformation just in front of the target to retrieve the primary key. The primary
key update is much faster than the non-indexed lookup override.
Suggestions for Using Mapplets
A mapplet is a reusable object that represents a set of transformations. It allows you to
reuse transformation logic and can contain as many transformations as necessary. Use
the Mapplet Designer to create mapplets.

INFORMATICA CONFIDENTIAL BEST PRACTICE 246 of 702
Mapping Templates
Challenge
Mapping Templates demonstrate proven solutions for tackling challenges that
commonly occur during data integration development efforts. Mapping Templates can
be used to make the development phase of a project more efficient. Mapping
Templates can also serve as a medium to introduce development standards into the
mapping development process that developers need to follow.
A wide array of Mapping Template examples can be obtained for the most current
PowerCenter version from the Informatica Customer Portal. As "templates," each of the
objects in Informatica's Mapping Template Inventory illustrates the transformation logic
and steps required to solve specific data integration requirements. These sample
templates, however, are meant to be used as examples, not as means to implement
development standards.
Description

Reuse Transformation Logic
Templates can be heavily used in a data integration and warehouse environment, when
loading information from multiple source providers into the same target structure, or
when similar source system structures are employed to load different target instances.
Using templates guarantees that any transformation logic that is developed and tested
correctly, once, can be successfully applied across multiple mappings as needed. In
some instances, the process can be further simplified if the source/target structures
have the same attributes, by simply creating multiple instances of the session, each
with its own connection/execution attributes, instead of duplicating the mapping.
Implementing Development Techniques
When the process is not simple enough to allow usage based on the need to duplicate
transformation logic to load the same target, Mapping Templates can help to reproduce
transformation techniques. In this case, the implementation process requires more than
just replacing source/target transformations. This scenario is most useful when certain
logic (i.e., logical group of transformations) is employed across mappings. In many
instances this can be further simplified by making use of mapplets. Additionally user
defined functions can be utilized for expression logic reuse and build complex
INFORMATICA CONFIDENTIAL BEST PRACTICE 247 of 702
expressions using transformation language.
Transport mechanism
Once Mapping Templates have been developed, they can be distributed by any of the
following procedures:
G Copy mapping from development area to the desired repository/folder
G Export mapping template into XML and import to the desired repository/folder.
Mapping template examples
The following Mapping Templates can be downloaded from the Informatica Customer
Portal and are listed by subject area:
Common Data Warehousing Techniques
G Aggregation using Sorted Input
G Tracking Dimension History
G Constraint-Based Loading
G Loading Incremental Updates
G Tracking History and Current
G Inserts or Updates
Transformation Techniques
G Error Handling Strategy
G Flat File Creation with Headers and Footers
G Removing Duplicate Source Records
G Transforming One Record into Multiple Records
G Dynamic Caching
G Sequence Generator Alternative
G Streamline a Mapping with a Mapplet
G Reusable Transformations (Customers)
G Using a Sorter
G Pipeline Partitioning Mapping Template
INFORMATICA CONFIDENTIAL BEST PRACTICE 248 of 702
G Using Update Strategy to Delete Rows
G Loading Heterogenous Targets
G Load Using External Procedure
Advanced Mapping Concepts
G Aggregation Using Expression Transformation
G Building a Parameter File
G Best Build Logic
G Comparing Values Between Records
G Transaction Control Transformation
Source-Specific Requirements
G Processing VSAM Source Files
G Processing Data from an XML Source
G Joining a Flat File with a Relational Table
Industry-Specific Requirements
G Loading SWIFT 942 Messages.htm
G Loading SWIFT 950 Messages.htm


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 249 of 702
Naming Conventions
Challenge
Choosing a good naming standard for use in the repository and adhering to it.
Description
Although naming conventions are important for all repository and database objects, the suggestions in this Best Practice focus on the
former. Choosing a convention and sticking with it is the key.
Having a good naming convention facilitates smooth migration and improves readability for anyone reviewing or carrying out
maintenance on the repository objects by helping them understand the processes being affected. If consistent names and descriptions
are not used, significant time may be needed to understand the working of mappings and transformation objects. If no description is
provided, a developer is likely to spend considerable time going through an object or mapping to understand its objective.
The following pages offer some suggestions for naming conventions for various repository objects. Whatever convention is chosen, it is
important to make the selection very early in the development cycle and communicate the convention to project staff working on the
repository. The policy can be enforced by peer review and at test phases by adding process to check conventions to test plans and test
execution documents.
Suggested Naming Conventions
Designer Objects Suggested Naming Conventions
Mapping m_{PROCESS}_{SOURCE_SYSTEM}_{TARGET_NAME} or suffix
with _{descriptor} if there are multiple mappings for that single target
table
Mapplet mplt_{DESCRIPTION}
Target {update_types(s)}_{TARGET_NAME} this naming convention should
only occur within a mapping as the actual target name object affects
the actual table that PowerCenter will access
Aggregator Transformation AGG_{FUNCTION} that leverages the expression and/or a name that
describes the processing being done.
Application Source Qualifier
Transformation
ASQ_{TRANSFORMATION} _{SOURCE_TABLE1}_
{SOURCE_TABLE2} represents data from application source.
Custom Transformation CT_{TRANSFORMATION} name that describes the processing being
done.
Expression Transformation EXP_{FUNCTION} that leverages the expression and/or a name that
describes the processing being done.
External Procedure
Transformation
EXT_{PROCEDURE_NAME}
Filter Transformation FIL_ or FILT_{FUNCTION} that leverages the expression or a name
that describes the processing being done.
Java Transformation JV_{FUNCTION} that leverages the expression or a name that
describes the processing being done.
INFORMATICA CONFIDENTIAL BEST PRACTICE 250 of 702
Joiner Transformation JNR_{DESCRIPTION}
Lookup Transformation LKP_{TABLE_NAME} or suffix with _{descriptor} if there are multiple
look-ups on a single table. For unconnected look-ups, use ULKP in
place of LKP.
Mapplet Input Transformation MPLTI_{DESCRIPTOR} indicating the data going into the mapplet.
Mapplet Output
Transformation
MPLTO_{DESCRIPTOR} indicating the data coming out of the mapplet.
MQ Source Qualifier
Transformation
MQSQ_{DESCRIPTOR} defines the messaging being selected.
Normalizer Transformation NRM_{FUNCTION} that leverages the expression or a name that
describes the processing being done.
Rank Transformation RNK_{FUNCTION} that leverages the expression or a name that
describes the processing being done.
Router Transformation RTR_{DESCRIPTOR}
Sequence Generator
Transformation
SEQ_{DESCRIPTOR} if using keys for a target table entity, then refer
to that
Sorter Transformation SRT_{DESCRIPTOR}
Source Qualifier
Transformation
SQ_{SOURCE_TABLE1}_{SOURCE_TABLE2}. Using all source
tables can be impractical if there are a lot of tables in a source qualifier,
so refer to the type of information being obtained, for example a certain
type of product SQ_SALES_INSURANCE_PRODUCTS.
Stored Procedure
Transformation
SP_{STORED_PROCEDURE_NAME}
Transaction Control
Transformation
TCT_ or TRANS_{DESCRIPTOR} indicating the function of the
transaction control.
Union Transformation UN_{DESCRIPTOR}
Update Strategy
Transformation
UPD_{UPDATE_TYPE(S)} or UPD_{UPDATE_TYPE(S)}_
{TARGET_NAME} if there are multiple targets in the mapping. E.g.,
UPD_UPDATE_EXISTING_EMPLOYEES
XML Generator
Transformation
XMG_{DESCRIPTOR}defines the target message.
XML Parser Transformation XMP_{DESCRIPTOR}defines the messaging being selected.
XML Source Qualifier
Transformation
XMSQ_{DESCRIPTOR}defines the data being selected.
Port Names
Ports names should remain the same as the source unless some other action is performed on the port. In that case, the port should be
prefixed with the appropriate name.
INFORMATICA CONFIDENTIAL BEST PRACTICE 251 of 702
When the developer brings a source port into a lookup, the port should be prefixed with in_. This helps the user immediately identify
the ports that are being input without having to line up the ports with the input checkbox. In any other transformation, if the input port
is transformed in an output port with the same name, prefix the input port with in_.
Generated output ports can also be prefixed. This helps trace the port value throughout the mapping as it may travel through many
other transformations. If it is intended to be able to use the autolink feature based on names, then outputs may be better left as the
name of the target port in the next transformation. For variables inside a transformation, the developer can use the prefix v, 'var_ or
v_' plus a meaningful name.
With some exceptions, port standards apply when creating a transformation object. The exceptions are the Source Definition, the
Source Qualifier, the Lookup, and the Target Definition ports, which must not change since the port names are used to retrieve data
from the database.
Other transformations that are not applicable to the port standards are:
G Normalizer - The ports created in the Normalizer are automatically formatted when the developer configures it.
G Sequence Generator - The ports are reserved words.
G Router - Because output ports are created automatically, prefixing the input ports with an I_ prefixes the output ports with I_
as well. Port names should not have any prefix.
G Sorter, Update Strategy, Transaction Control, and Filter - These ports are always input and output. There is no need to
rename them unless they are prefixed. Prefixed port names should be removed.
G Union - The group ports are automatically assigned to the input and output; therefore prefixing with anything is reflected in
both the input and output. The port names should not have any prefix.
All other transformation object ports can be prefixed or suffixed with:
G in_ or i_for Input ports
G o_ or _out for Output ports
G io_ for Input/Output ports
G v,v_ or var_ for variable ports
G lkp_ for returns from look ups
G mplt_ for returns from mapplets
Prefixes are preferable because they are generally easier to see; developers do not need to expand the columns to see the suffix for
longer port names.
Transformation object ports can also:
G Have the Source Qualifier port name.
G Be unique.
G Be meaningful.
G Be given the target port name.
Transformation Descriptions
This section defines the standards to be used for transformation descriptions in the Designer.
G Source Qualifier Descriptions. Should include the aim of the source qualifier and the data it is intended to select.
Should also indicate if any overrides are used. If so, it should describe the filters or settings used. Some projects prefer items
such as the SQL statement to be included in the description as well.
G Lookup Transformation Descriptions. Describe the lookup along the lines of the [lookup attribute] obtained from [lookup
table name] to retrieve the [lookup attribute name].
Where:
INFORMATICA CONFIDENTIAL BEST PRACTICE 252 of 702
H Lookup attribute is the name of the column being passed into the lookup and is used as the lookup criteria.
H Lookup table name is the table on which the lookup is being performed.
H Lookup attribute name is the name of the attribute being returned from the lookup. If appropriate, specify the condition
when the lookup is actually executed.
It is also important to note lookup features such as persistent cache or dynamic lookup.
G Expression Transformation Descriptions. Must adhere to the following format:
This expression [explanation of what transformation does].
Expressions can be distinctly different depending on the situation; therefore the explanation should be specific to the actions
being performed.
Within each Expression, transformation ports have their own description in the format:
This port [explanation of what the port is used for].
G Aggregator Transformation Descriptions. Must adhere to the following format:
This Aggregator [explanation of what transformation does].
Aggregators can be distinctly different, depending on the situation; therefore the explanation should be specific to the actions
being performed.
Within each Aggregator, transformation ports have their own description in the format:
This port [explanation of what the port is used for].
G Sequence Generators Transformation Descriptions. Must adhere to the following format:
This Sequence Generator provides the next value for the [column name] on the [table name].
Where:
H Table name is the table being populated by the sequence number, and the
H Column name is the column within that table being populated.

G Joiner Transformation Descriptions. Must adhere to the following format:
This Joiner uses [joining field names] from [joining table names].
Where:
H
Joining field names are the names of the columns on which the join is done, and the
H
Joining table names are the tables being joined.

G Normalizer Transformation Descriptions. Must adhere to the following format::
This Normalizer [explanation].
Where:
H explanation describes what the Normalizer does.

INFORMATICA CONFIDENTIAL BEST PRACTICE 253 of 702
G Filter Transformation Descriptions. Must adhere to the following format:
This Filter processes [explanation].
Where:
H explanation describes what the filter criteria are and what they do.

G Stored Procedure Transformation Descriptions. Explain the stored procedures functionality within the mapping (i.e., what
does it return in relation to the input ports?).

G Mapplet Input Transformation Descriptions. Describe the input values and their intended use in the mapplet.

G Mapplet Output Transformation Descriptions. Describe the output ports and the subsequent use of those values. As an
example, for an exchange rate mapplet, describe what currency the output value will be in. Answer the questions like: is the
currency fixed or based on other data? What kind of rate is used? is it a fixed inter-company rate? an inter-bank rate?
business rate or tourist rate? Has the conversion gone through an intermediate currency?

G Update Strategies Transformation Descriptions. Describe the Update Strategy and whether it is fixed in its function or
determined by a calculation.

G Sorter Transformation Descriptions. Explanation of the port(s) that are being sorted and their sort direction.

G Router Transformation Descriptions. Describes the groups and their functions.

G Union Transformation Descriptions. Describe the source inputs and indicate what further processing on those inputs (if
any) is expected to take place in later transformations in the mapping.

G Transaction Control Transformation Descriptions. Describe the process behind the transaction control and the function of
the control to commit or rollback.

G Custom Transformation Descriptions. Describe the function that the custom transformation accomplishes and what data is
expected as input and what data will be generated as output. Also indicate the module name (and location) and the procedure
which is used.

G External Procedure Transformation Descriptions. Describe the function of the external procedure and what data is
expected as input and what data will be generated as output. Also indicate the module name (and location) and the
procedure that is used.

G Java Transformation Descriptions. Describe the function of the java code and what data is expected as input and what data
is generated as output. Also indicate whether the java code determines the object to be an Active or Passive transformation.

G Rank Transformation Descriptions. Indicate the columns being used in the rank, the number of records returned from the
rank, the rank direction, and the purpose of the transformation.

G XML Generator Transformation Descriptions. Describe the data expected for the generation of the XML and indicate the
purpose of the XML being generated.

G XML Parser Transformation Descriptions. Describe the input XML expected and the output from the parser and indicate the
purpose of the transformation.

Mapping Comments
These comments describe the source data obtained and the structure file, table or facts and dimensions that it populates. Remember
to use business terms along with such technical details as table names. This is beneficial when maintenance is required or if issues
arise that need to be discussed with business analysts.
Mapplet Comments
These comments are used to explain the process that the mapplet carries out. Always be sure to see the notes regarding descriptions
for the input and output transformation.
INFORMATICA CONFIDENTIAL BEST PRACTICE 254 of 702
Repository Objects
Repositories, as well as repository level objects, should also have meaningful names. Repositories should prefix with either L_ for
local or G for global and a descriptor. Descriptors usually include information about the project and/or level of the environment (e.g.,
PROD, TEST, DEV).
Folders and Groups
Working folder names should be meaningful and include project name and, if there are multiple folders for that one project, a
descriptor. User groups should also include project name and descriptors, as necessary. For example, folder DW_SALES_US and
DW_SALES_UK could both have TEAM_SALES as their user group. Individual developer folders or non-production folders should
prefix with z_ so that they are grouped together and not confused with working production folders.
Shared Objects and Folders
Any object within a folder can be shared across folders and maintained in one central location. These objects are sources, targets,
mappings, transformations, and mapplets. To share objects in a folder, the folder must be designated as shared. In addition to
facilitating maintenance, shared folders help reduce the size of the repository since shortcuts are used to link to the original, instead of
copies.
Only users with the proper permissions can access these shared folders. These users are responsible for migrating the folders across
the repositories and, with help from the developers, for maintaining the objects within the folders. For example, if an object is created
by a developer and is to be shared, the developer should provide details of the object and the level at which the object is to be shared
before the Administrator accepts it as a valid entry into the shared folder. The developers, not necessarily the creator, control the
maintenance of the object, since they must ensure that a subsequent change does not negatively impact other objects.
If the developer has an object that he or she wants to use in several mappings or across multiple folders, like an Expression
transformation that calculates sales tax, the developer can place the object in a shared folder. Then use the object in other folders by
creating a shortcut to the object. In this case, the naming convention is sc_ (e.g., sc_EXP_CALC_SALES_TAX). The folder should
prefix with SC_ to identify it as a shared folder and keep all shared folders grouped together in the repository.
Workflow Manager Objects
WorkFlow Objects Suggested Naming Convention
Session s_{MappingName}
Command Object cmd_{DESCRIPTOR}
Worklet wk or wklt_{DESCRIPTOR}
Workflow wkf or wf_{DESCRIPTOR}
Email Task: email_ or eml_{DESCRIPTOR}
Decision Task: dcn_ or dt_{DESCRIPTOR}
Assign Task: asgn_{DESCRIPTOR}
Timer Task: timer_ or tmr_{DESCRIPTOR}
Control Task: ctl_{DESCRIPTOR}Specify when and how the PowerCenter
Server is to stop or abort a workflow by using the Control task in
the workflow.
INFORMATICA CONFIDENTIAL BEST PRACTICE 255 of 702
Event Wait Task: wait_ or ew_{DESCRIPTOR}Waits for an event to occur. Once
the event triggers, the PowerCenter Server continues executing
the rest of the workflow.
Event Raise Task: raise_ or er_{DESCRIPTOR} Represents a user-defined event.
When the PowerCenter Server runs the Event-Raise task, the
Event-Raise task triggers the event. Use the Event-Raise task
with the Event-Wait task to define events.
ODBC Data Source Names
All Open Database Connectivity (ODBC) data source names (DSNs) should be set up in the same way on all client machines.
PowerCenter uniquely identifies a source by its Database Data Source (DBDS) and its name. The DBDS is the same name as the
ODBC DSN since the PowerCenter Client talks to all databases through ODBC.
Also be sure to setup the ODBC DSNs as system DSNs so that all users of a machine can see the DSN. This approach ensures that
there is less chance of a discrepancy occuring among users when they use different (i.e., colleagues') machines and have to recreate
a new DSN when they use a separate machine.
If ODBC DSNs are different across multiple machines, there is a risk of analyzing the same table using different names. For example,
machine1 has ODBS DSN Name0 that points to database1. TableA gets analyzed in on machine 1. TableA is uniquely identified as
Name0.TableA in the repository. Machine2 has ODBS DSN Name1 that points to database1. TableA gets analyzed in on machine 2.
TableA is uniquely identified as Name1.TableA in the repository. The result is that the repository may refer to the same object by
multiple names, creating confusion for developers, testers, and potentially end users.
Also, refrain from using environment tokens in the ODBC DSN. For example, do not call it dev_db01. When migrating objects from dev,
to test, to prod, PowerCenter can wind up with source objects called dev_db01 in the production repository. ODBC database names
should clearly describe the database they reference to ensure that users do not incorrectly point sessions to the wrong databases.
Database Connection Information
Security considerations may dictate using the company name of the database or project instead of {user}_{database name}, except for
developer scratch schemas, which are not found in test or production environments. Be careful not to include machine names or
environment tokens in the database connection name. Database connection names must be very generic to be understandable and
ensure a smooth migration.
The naming convention should be applied across all development, test, and production environments. This allows seamless migration
of sessions when migrating between environments. If an administrator uses the Copy Folder function for migration, session information
is also copied. If the Database Connection information does not already exist in the folder the administrator is copying to, it is also
copied. So, if the developer uses connections with names like Dev_DW in the development repository, they are likely to eventually
wind up in the test, and even the production repositories as the folders are migrated. Manual intervention is then necessary to change
connection names, user names, passwords, and possibly even connect strings.
Instead, if the developer just has a DW connection in each of the three environments, when the administrator copies a folder from the
development environment to the test environment, the sessions automatically use the existing connection in the test repository. With
the right naming convention, you can migrate sessions from the test to production repository without manual intervention.
TIP
At the beginning of a project, have the Repository Administrator or DBA setup all connections in all environments based on the issues discussed in this Best Practice. Then
use permission options to protect these connections so that only specified individuals can modify them. Whenever possible, avoid having developers create their own
connections using different conventions and possibly duplicating connections.
Administration Console Objects
Administration console objects such as domains, nodes, and services should also have meaningful names.
Object Recommended Naming Convention Example
INFORMATICA CONFIDENTIAL BEST PRACTICE 256 of 702
Domain DOM_ or DMN_[PROJECT]_[ENVIRONMENT] DOM_PROCURE_DEV
Node NODE[#]_[SERVER_NAME]_ [optional_descriptor] NODE02_SERVER_rs_b (backup node for the
repository service)
Services:
- Integration INT_SVC_[ENVIRONMENT]_[optional descriptor] INT_SVC_DEV_primary
- Repository REPO_SVC_[ENVIRONMENT]_[optional
descriptor]
REPO_SVC_TEST
- Web Services
Hub
WEB_SVC_[ENVIRONMENT]_[optional descriptor] WEB_SVC_PROD

PowerCenter PowerExchange Application/Relational Connections
Before the PowerCenter Server can access a source or target in a session, you must configure connections in the Workflow Manager.
When you create or modify a session that reads from, or writes to, a database, you can select only configured source and target
databases. Connections are saved in the repository.
For PowerExchange Client for PowerCenter, you configure relational database and/or application connections. The connection you
configure depends on the type of source data you want to extract and the extraction mode (e.g., PWX[MODE_INITIAL]_[SOURCE]_
[Instance_Name]). The following table shows some examples.
Source Type/
Extraction Mode
Application Connection/
Relational Connection
Connection Type Recommended Naming
Convention
DB2/390 Bulk Mode Relational PWX DB2390 PWXB_DB2_Instance_Name
DB2/390 Change
Mode
Application PWX DB2390
CDC Change
PWXC_DB2_Instance_Name
DB2/390 Real Time
Mode
Application PWX DB2390
CDC Real Time
PWXR_DB2_Instance_Name
IMS Batch Mode Application PWX NRDB Batch PWXB_IMS_ Instance_Name
IMS Change Mode Application PWX NRDB CDC
Change
PWXC_IMS_ Instance_Name
IMS Real Time Application PWX NRDB CDC
Real Time
PWXR_IMS_ Instance_Name
Oracle Change Mode Application PWX Oracle CDC
Change
PWXC_ORA_Instance_Name
Oracle Real Time Application PWX Oracle CDC
Real
PWXR_ORA_Instance_Name
PowerCenter PowerExchange Target Connections
The connection you configure depends on the type of target data you want to load.
INFORMATICA CONFIDENTIAL BEST PRACTICE 257 of 702
Target Type Connection Type Recommended Naming
Convention
DB2/390 PWX DB2390 relational database
connection
PWXT_DB2_Instance_Name
DB2/400 PWX DB2400 relational database
connection
PWXT_DB2_Instance_Name


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 258 of 702
Performing Incremental Loads
Challenge
Data warehousing incorporates very large volumes of data. The process of loading the
warehouse in a reasonable timescale without compromising its functionality is
extremely difficult. The goal is to create a load strategy that can minimize downtime for
the warehouse and allow quick and robust data management.
Description
As time windows shrink and data volumes increase, it is important to understand the
impact of a suitable incremental load strategy. The design should allow data to be
incrementally added to the data warehouse with minimal impact on the overall system.
This Best Practice describes several possible load strategies.
Incremental Aggregation
Incremental aggregation is useful for applying incrementally-captured changes in the
source to aggregate calculations in a session.
If the source changes only incrementally, and you can capture those changes, you can
configure the session to process only those changes with each run. This allows the
PowerCenter Integration Service to update the target incrementally, rather than forcing
it to process the entire source and recalculate the same calculations each time you run
the session.
If the session performs incremental aggregation, the PowerCenter Integration Service
saves index and data cache information to disk when the session finishes. The next
time the session runs, the PowerCenter Integration Service uses this historical
information to perform the incremental aggregation. To utilize this functionality set the
Incremental Aggregation Session attribute. For details see Chapter 24 in the
Workflow Administration Guide.
Use incremental aggregation under the following conditions:
G Your mapping includes an aggregate function.
G The source changes only incrementally.
INFORMATICA CONFIDENTIAL BEST PRACTICE 259 of 702
G You can capture incremental changes (i.e., by filtering source data by
timestamp).
G You get only delta records (i.e., you may have implemented the CDC (Change
Data Capture) feature of PowerExchange).
Do not use incremental aggregation in the following circumstances:
G You cannot capture new source data.
G Processing the incrementally-changed source significantly changes the target.
If processing the incrementally-changed source alters more than half the
existing target, the session may not benefit from using incremental
aggregation.
G Your mapping contains percentile or median functions.
Some conditions that may help in making a decision on an incremental strategy include:
G Error handling, loading and unloading strategies for recovering, reloading, and
unloading data.
G History tracking requirements for keeping track of what has been loaded and
when
G Slowly-changing dimensions. Informatica Mapping Wizards are a good start to
an incremental load strategy. The Wizards generate generic mappings as a
starting point (refer to Chapter 15 in the Designer Guide)
Source Analysis
Data sources typically fall into the following possible scenarios:
G Delta records. Records supplied by the source system include only new or
changed records. In this scenario, all records are generally inserted or updated
into the data warehouse.
G Record indicator or flags. Records that include columns that specify the
intention of the record to be populated into the warehouse. Records can be
selected based upon this flag for all inserts, updates, and deletes.
G Date stamped data. Data is organized by timestamps, and loaded into the
warehouse based upon the last processing date or the effective date range.
G Key values are present. When only key values are present, data must be
checked against what has already been entered into the warehouse. All values
must be checked before entering the warehouse.
INFORMATICA CONFIDENTIAL BEST PRACTICE 260 of 702
G No key values present. When no key values are present, surrogate keys are
created and all data is inserted into the warehouse based upon validity of the
records.
Identify Records for Comparison
After the sources are identified, you need to determine which records need to be
entered into the warehouse and how. Here are some considerations:
G Compare with the target table. When source delta loads are received,
determine if the record exists in the target table. The timestamps and natural
keys of the record are the starting point for identifying whether the record is
new, modified, or should be archived. If the record does not exist in the target,
insert the record as a new row. If it does exist, determine if the record needs to
be updated, inserted as a new record, or removed (deleted from target) or
filtered out and not added to the target.
G Record indicators. Record indicators can be beneficial when lookups into the
target are not necessary. Take care to ensure that the record exists for update
or delete scenarios, or does not exist for successful inserts. Some design
effort may be needed to manage errors in these situations.
Determine Method of Comparison
There are four main strategies in mapping design that can be used as a method of
comparison:
G Joins of sources to targets. Records are directly joined to the target using
Source Qualifier join conditions or using Joiner transformations after the
Source Qualifiers (for heterogeneous sources). When using Joiner
transformations, take care to ensure the data volumes are manageable and
that the smaller of the two datasets is configured as the Master side of the join.
G Lookup on target. Using the Lookup transformation, lookup the keys or
critical columns in the target relational database. Consider the caches and
indexing possibilities.
G Load table log. Generate a log table of records that have already been
inserted into the target system. You can use this table for comparison with
lookups or joins, depending on the need and volume. For example, store keys
in a separate table and compare source records against this log table to
determine load strategy. Another example is to store the dates associated with
the data already loaded into a log table.
G MD5 checksum function. Generate a unique value for each row of data and
then compare previous and current unique checksum values to determine
INFORMATICA CONFIDENTIAL BEST PRACTICE 261 of 702
whether the record has changed.
Source-Based Load Strategies
Complete Incremental Loads in a Single File/Table
The simplest method for incremental loads is from flat files or a database in which all
records are going to be loaded. This strategy requires bulk loads into the warehouse
with no overhead on processing of the sources or sorting the source records.
Data can be loaded directly from the source locations into the data warehouse. There is
no additional overhead produced in moving these sources into the warehouse.
Date-Stamped Data
This method involves data that has been stamped using effective dates or sequences.
The incremental load can be determined by dates greater than the previous load date
or data that has an effective key greater than the last key processed.
With the use of relational sources, the records can be selected based on this effective
date and only those records past a certain date are loaded into the warehouse. Views
can also be created to perform the selection criteria. This way, the processing does not
have to be incorporated into the mappings but is kept on the source component.
Placing the load strategy into the other mapping components is more flexible and
controllable by the Data Integration developers and the associated metadata.
To compare the effective dates, you can use mapping variables to provide the previous
date processed (see the description below). An alternative to Repository-maintained
mapping variables is the use of control tables to store the dates and update the control
table after each load.
Non-relational data can be filtered as records are loaded based upon the effective
dates or sequenced keys. A Router transformation or filter can be placed after the
Source Qualifier to remove old records.
Changed Data Based on Keys or Record Information
Data that is uniquely identified by keys can be sourced according to selection criteria.
For example, records that contain primary keys or alternate keys can be used to
determine if they have already been entered into the data warehouse. If they exist, you
INFORMATICA CONFIDENTIAL BEST PRACTICE 262 of 702
can also check to see if you need to update these records or discard the source record.
It may be possible to perform a join with the target tables in which new data can be
selected and loaded into the target. It may also be feasible to lookup in the target to
see if the data exists.
Target-Based Load Strategies
G Loading directly into the target. Loading directly into the target is possible
when the data is going to be bulk loaded. The mapping is then responsible for
error control, recovery, and update strategy.
G Load into flat files and bulk load using an external loader. The
mapping loads data directly into flat files. You can then invoke an external
loader to bulk load the data into the target. This method reduces the load times
(with less downtime for the data warehouse) and provides a means of
maintaining a history of data being loaded into the target. Typically, this
method is only used for updates into the warehouse.
G Load into a mirror database. The data is loaded into a mirror database to
avoid downtime of the active data warehouse. After data has been loaded, the
databases are switched, making the mirror the active database and the active
the mirror.
Using Mapping Variables
You can use a mapping variable to perform incremental loading. By referencing a date-
based mapping variable in the Source Qualifier or join condition, it is possible to select
only those rows with greater than the previously captured date (i.e., the newly inserted
source data). However, the source system must have a reliable date to use.
The steps involved in this method are:
Step 1: Create mapping variable
In the Mapping Designer, choose Mappings > Parameters > Variables. Or, to create
variables for a mapplet, choose Mapplet > Parameters > Variables in the Mapplet
Designer.
Click Add and enter the name of the variable (i.e., $$INCREMENT DATE). In this case,
make your variable a date/time. For the Aggregation option, select MAX.
In the same screen, state your initial value. This date is used during the initial run of the
INFORMATICA CONFIDENTIAL BEST PRACTICE 263 of 702
session and as such should represent a date earlier than the earliest desired data. The
date can use any one of these formats:
G MM/DD/RR
G MM/DD/RR HH24:MI:SS
G MM/DD/YYYY
G MM/DD/YYYY HH24:MI:SS
Step 2: Reference the mapping variable in the Source Qualifier
The select statement should look like the following:
Select * from table A
where
CREATE DATE > date($$INCREMENT_DATE. MM-DD-YYYY HH24:MI:SS)
Step 3: Refresh the mapping variable for the next session run using
an Expression Transformation
Use an Expression transformation and the pre-defined variable functions to set and use
the mapping variable.
In the expression transformation, create a variable port and use the
SETMAXVARIABLE variable function to capture the maximum source date selected
during each run.
SETMAXVARIABLE($$INCREMENT_DATE,CREATE_DATE)
CREATE_DATE in this example is the date field from the source that should be used to
identify incremental rows.
You can use the variables in the following transformations:
G Expression
G Filter
G Router
G Update Strategy
INFORMATICA CONFIDENTIAL BEST PRACTICE 264 of 702
As the session runs, the variable is refreshed with the max date value encountered
between the source and variable. So, if one row comes through with 9/1/2004, then the
variable gets that value. If all subsequent rows are LESS than that, then 9/1/2004 is
preserved.
Note: This behavior has no effect on the date used in the source qualifier. The initial
select always contains the maximum data value encountered during the previous,
successful session run.
When the mapping completes, the PERSISTENT value of the mapping variable is
stored in the repository for the next run of your session. You can view the value of the
mapping variable in the session log file.
The advantage of the mapping variable and incremental loading is that it allows the
session to use only the new rows of data. No table is needed to store the max(date)
since the variable takes care of it.
After a successful session run, the PowerCenter Integration Service saves the final
value of each variable in the repository. So when you run your session the next time,
only new data from the source system is captured. If necessary, you can override the
value saved in the repository with a value saved in a parameter file.
Using PowerExchange Change Data Capture
PowerExchange (PWX) Change Data Capture (CDC) greatly simplifies the
identification, extraction, and loading of change records. It supports all key mainframe
and midrange database systems, requires no changes to the user application, uses
vendor-supplied technology where possible to capture changes, and eliminates the
need for programming or the use of triggers. Once PWX CDC collects changes, it
places them in a change stream for delivery to PowerCenter. Included in the change
data is useful control information, such as the transaction type (insert/update/delete)
and the transaction timestamp. In addition, the change data can be made available
immediately (i.e., in real time) or periodically (i.e., where changes are condensed).
The native interface between PowerCenter and PowerExchange is PowerExchange
Client for PowerCenter (PWXPC). PWXPC enables PowerCenter to pull the change
data from the PWX change stream if real-time consumption is needed or from PWX
condense files if periodic consumption is required. The changes are applied directly. So
if the action flag is I, the record is inserted. If the action flag is U, the record is
updated. If the action flag is D, the record is deleted. There is no need for change
detection logic in the PowerCenter mapping.
INFORMATICA CONFIDENTIAL BEST PRACTICE 265 of 702
In addition, by leveraging group source processing, where multiple sources are
placed in a single mapping, the PowerCenter session reads the committed changes for
multiple sources in a single efficient pass, and in the order they occurred. The changes
are then propagated to the targets, and upon session completion, restart tokens
(markers) are written out to a PowerCenter file so that the next session run knows the
point to extract from.
Tips for Using PWX CDC

G After installing PWX, ensure the PWX Listener is up and running and that
connectivity is established to the Listener. For best performance, the Listener
should be co-located with the source system.
G In the PWX Navigator client tool, use metadata to configure data access. This
means creating data maps for the non-relational to relational view of
mainframe sources (such as IMS and VSAM) and capture registrations for all
sources (mainframe, Oracle, DB2, etc). Registrations define the specific tables
and columns desired for change capture. There should be one registration per
source. Group the registrations logically, for example, by source database.
G For an initial test, make changes in the source system to the registered
sources. Ensure that the changes are committed.
G Still working in PWX Navigator (and before using PowerCenter), perform Row
Tests to verify the returned change records, including the transaction action
flag (the DTL__CAPXACTION column) and the timestamp. Set the required
access mode: CAPX for change and CAPXRT for real time. Also, if desired,
edit the PWX extraction maps to add the Change Indicator (CI) column. This
CI flag (Y or N) allows for field level capture and can be filtered in the
PowerCenter mapping.
G Use PowerCenter to materialize the targets (i.e., to ensure that sources and
targets are in sync prior to starting the change capture process). This can be
accomplished with a simple pass-through batch mapping. This same bulk
mapping can be reused for CDC purposes, but only if specific CDC columns
are not included, and by changing the session connection/mode.
G Import the PWX extraction maps into Designer. This requires the PWXPC
INFORMATICA CONFIDENTIAL BEST PRACTICE 266 of 702
component. Specify the CDC Datamaps option during the import.
G Use group sourcing to create the CDC mapping by including multiple sources
in the mapping. This enhances performance because only one read/
connection is made to the PWX Listener and all changes (for the sources in
the mapping) are pulled at one time.
G Keep the CDC mappings simple. There are some limitations; for instance, you
cannot use active transformations. In addition, if loading to a staging area,
store the transaction types (i.e., insert/update/delete) and the timestamp for
subsequent processing downstream. Also, if loading to a staging area, include
an Update Strategy transformation in the mapping with DD_INSERT or
DD_UPDATE in order to override the default behavior and store the action
flags.
G Set up the Application Connection in Workflow Manager to be used by the
CDC session. This requires the PWXPC component. There should be one
connection and token file per CDC mapping/session. Set the UOW (unit of
work) to a low value for faster commits to the target for real-time sessions.
Specify the restart token location and file on the PowerCenter Integration
Service (within the infa_shared directory) and specify the location of the PWX
Listener.
G In the CDC session properties, enable session recovery (i.e., set the Recovery
Strategy to Resume from last checkpoint).
G Use post-session commands to archive the restart token files for restart/
recovery purposes. Also, archive the session logs.


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 267 of 702
Real-Time Integration with PowerCenter
Challenge
Configure PowerCenter to work with PowerCenter Connect to process real-time data. This Best
Practice discusses guidelines for establishing a connection with PowerCenter and setting up a
real-time session to work with PowerCenter.
Description
PowerCenter with real-time option can be used to process data from real-time data sources.
PowerCenter supports the following types of real-time data:
G Messages and message queues. PowerCenter with the real-time option can be used
to integrate third-party messaging applications using a specific version of PowerCenter
Connect. Each PowerCenter Connect version supports a specific industry-standard
messaging application, such as IBM MQSeries, JMS, MSMQ, SAP NetWeaver mySAP
Option, TIBCO, and webMethods. You can read from messages and message queues
and write to messages, messaging applications, and message queues. IBM MQ Series
uses a queue to store and exchange data. Other applications, such as TIBCO and
JMS, use a publish/subscribe model. In this case, the message exchange is identified
using a topic.
G Web service messages. PowerCenter can receive a web service message from a
web service client through the Web Services Hub, transform the data, and load the
data to a target or send a message back to a web service client. A web service
message is a SOAP request from a web service client or a SOAP response from the
Web Services Hub. The Integration Service processes real-time data from a web
service client by receiving a message request through the Web Services Hub and
processing the request. The Integration Service can send a reply back to the web
service client through the Web Services Hub or write the data to a target.
G Changed source data. PowerCenter can extract changed data in real time from a
source table using the PowerExchange Listener and write data to a target. Real-time
sources supported by PowerExchange are ADABAS, DATACOM, DB2/390, DB2/400,
DB2/UDB, IDMS, IMS, MS SQL Server, Oracle and VSAM.
Connection Setup
PowerCenter uses some attribute values in order to correctly connect and identify the third-
party messaging application and message itself. Each version of PowerCenter Connect
supplies its own connection attributes that need to be configured properly before running a real-
time session.
Setting Up Real-Time Session in PowerCenter
INFORMATICA CONFIDENTIAL BEST PRACTICE 268 of 702
The PowerCenter real-time option uses a zero latency engine to process data from the
messaging system. Depending on the messaging systems and the application that sends and
receives messages, there may be a period when there are many messages and, conversely,
there may be a period when there are no messages. PowerCenter uses the attribute Flush
Latency to determine how often the messages are being flushed to the target. PowerCenter
also provides various attributes to control when the session ends.
The following reader attributes determine when a PowerCenter session should end:
G Message Count - Controls the number of messages the PowerCenter Server reads
from the source before the session stops reading from the source.
G Idle Time - Indicates how long the PowerCenter Server waits when no messages arrive
before it stops reading from the source.
G Time Slice Mode - Indicates a specific range of time that the server read messages
from the source. Only PowerCenter Connect for MQSeries uses this option.
G Reader Time Limit - Indicates the number of seconds the PowerCenter Server spends
reading messages from the source.
The specific filter conditions and options available to you depend on which Real-Time source is
being used.
For example: Attributes for PowerExchange real-time CDC for DB2/400
INFORMATICA CONFIDENTIAL BEST PRACTICE 269 of 702
Set the attributes that control how the reader ends. One or more attributes can be used to
control the end of session.
For example: set the Reader Time Limit attribute to 3600. The reader will end after 3600
seconds. The idle time limit is set to 500 seconds. The reader will end if it doesnt process any
changes for 500 seconds (i.e., it remains idle for 500 seconds).
If more than one attribute is selected, the first attribute that satisfies the condition is used to
control the end of session.
Note:: The real-time attributes can be found in the Reader Properties for PowerCenter Connect
for JMS, Tibco, Webmethods, and SAP Idoc. For PowerCenter Connect for MQ Series, the real-
time attributes must be specified as a filter condition.
INFORMATICA CONFIDENTIAL BEST PRACTICE 270 of 702
The next step is to set the Real-time Flush Latency attribute. The Flush Latency defines how
often PowerCenter should flush messages, expressed in milli-seconds.
For example, if the Real-time Flush Latency is set to 2000, PowerCenter flushes messages
every two seconds. The messages will also be flushed from the reader buffer if the Source
Based Commit condition is reached. The Source Based Commit condition is defined in the
Properties tab of the session.
The message recovery option can be enabled to ensure that no messages are lost if a session
fails as a result of unpredictable error, such as power loss. This is especially important for real-
time sessions because some messaging applications do not store the messages after the
messages are consumed by another application.
A unit of work (UOW) is a collection of changes within a single commit scope made by a
transaction on the source system from an external application. Each UOW may consist of a
different number of rows depending on the transaction to the source system. When you use the
UOW Count Session condition, the Integration Service commits source data to the target when
it reaches the number of UOWs specified in the session condition.
For example, if the value for UOW Count is 10, the Integration Service commits all data read
from the source after the 10th UOW enters the source. The lower you set the value, the faster
the Integration Service commits data to the target. The lower value also causes the system to
consume more resources.
Executing a Real-Time Session
A real-time session often has to be up and running continuously to listen to the messaging
application and to process messages immediately after the messages arrive. Set the reader
attribute Idle Time to -1 and Flush Latency to a specific time interval. This is applicable for all
PowerExchange and PowerCenter Connect versions except for PowerConnect for MQSeries
where the session continues to run and flush the messages to the target using the specific flush
latency interval.
Another scenario is the ability to read data from another source system and immediately send it
to a real-time target. For example, reading data from a relational source and writing it to MQ
Series. In this case, set the session to run continuously so that every change in the source
system can be immediately reflected in the target.
A real-time session may run continuously until a condition is met to end the session. In some
situations it may be required to periodically stop the session and restart it. This is sometimes
necessary to execute a post-session command or run some other process that is not part of the
session. To stop the session and restart it, it is useful to deploy continuously running workflows.
The Integration Service starts the next run of a continuous workflow as soon as it completes the
first.
To set a workflow to run continuously, edit the workflow and select the Scheduler tab. Edit the
INFORMATICA CONFIDENTIAL BEST PRACTICE 271 of 702
Scheduler and select Run Continuously from Run Options. A continuous workflow starts
automatically when the Integration Service initializes. When the workflow stops, it restarts
immediately.
Real-Time Sessions and Active Transformations
Some of the transformations in PowerCenter are active transformations, which means that the
number of input rows and output rows of the transformations are not the same. For most cases,
active transformation requires all of the input rows to be processed before processing the
output row to the next transformation or target. For a real-time session, the flush latency will be
ignored if DTM needs to wait for all the rows to be processed.
Depending on user needs, active transformations, such as aggregator, rank, sorter can be used
in a real-time session by setting the transaction scope property in the active transformation to
Transaction. This signals the session to process the data in the transformation every
transaction. For example, if a real-time session is using an aggregator that sums a field of an
input, the summation will be done per transaction, as opposed to all rows. The result may or
may not be correct depending on the requirement. Use the active transformation with real-time
session if you want to process the data per transaction.
Custom transformations can also be defined to handle data per transaction so that they can be
used in a real-time session.
PowerExchange Real Time Connections
PowerExchange NRDB CDC Real Time connections can be used to extract changes from
ADABAS, DATACOM, IDMS, IMS and VSAM sources in real time.
The DB2/390 connection can be used to extract changes for DB2 on OS/390 and the DB2/400
connection to extract from AS/400. There is a separate connection to read from DB2 UDB in
real time.
The NRDB CDC connection requires the application name and the restart token file name to be
overridden for every session. When the PowerCenter session completes, the PowerCenter
Server writes the last restart token to a physical file called the RestartToken File. The next time
the session starts, the PowerCenter Server reads the restart token from the file and the starts
reading changes from the point where it last left off. Every PowerCenter session needs to have
a unique restart token filename.
Informatica recommends archiving the file periodically. The reader timeout or the idle timeout
can be used to stop a real-time session. A post-session command can be used to archive the
RestartToken file.
The encryption mode for this connection can slow down the read performance and increase
resource consumption. Compression mode can help in situations where the network is a
INFORMATICA CONFIDENTIAL BEST PRACTICE 272 of 702
bottleneck; using compression also increases the CPU and memory usage on the source
system.
Archiving PowerExchange Tokens
When the PowerCenter session completes, the Integration Service writes the last restart token
to a physical file called the RestartToken File. The token in the file indicates the end point
where the read job ended. The next time the session starts, the PowerCenter Server reads the
restart token from the file and the starts reading changes from the point where it left off. The
token file is overwritten each time the session has to write a token out. PowerCenter does not
implicitly maintain an archive of these tokens.
If, for some reason, the changes from a particular point in time have to replayed, we need the
PowerExchange token from that point in time.
To enable such a process, it is a good practice to periodically copy the token file to a backup
folder. This procedure is necessary to maintain an archive of the PowerExchange tokens. A
real-time PowerExchange session may be stopped periodically, using either the reader time
limit or the idle time limit. A post-session command is used to copy the restart token file to an
archive folder. The session will be part of a continuous running workflow, so when the session
completes after the post session command, it automatically restarts again. From a data
processing standpoint very little changes; the process pauses for a moment, archives the token,
and starts again.
The following are examples of post-session commands that can be used to copy a restart token
file (session.token) and append the current system date/time to the file name for archive
purposes:
cp session.token session`date '+%m%d%H%M'`.token
Windows:
copy session.token session-%date:~4,2%-%date:~7,2%-%date:~10,4%-%time:~0,2%-%time:
~3,2%.token
PowerCenter Connect for MQ
1. In the Workflow Manager, connect to a repository and choose Connection > Queue
2. The Queue Connection Browser appears. Select New > Message Queue
3. The Connection Object Definition dialog box appears
You need to specify three attributes in the Connection Object Definition dialog box:
G Name - the name for the connection. (Use <queue_name>_<QM_name> to uniquely
INFORMATICA CONFIDENTIAL BEST PRACTICE 273 of 702
identified the connection.)
G Queue Manager - the Queue Manager name for the message queue. (in Windows, the
default Queue Manager name is QM_<machine name>)
G Queue Name - the Message Queue name
To obtain the Queue Manager and Message Queue names:
G Open the MQ Series Administration Console. The Queue Manager should appear on
the left panel
G Expand the Queue Manager icon. A list of the queues for the queue manager appears
on the left panel
Note that the Queue Managers name and Queue Name are case-sensitive
PowerCenter Connect for JMS
PowerCenter Connect for JMS can be used to read or write messages from various JMS
providers, such as IBM MQ Series JMS, BEA Weblogic Server, and IBM Websphere.
There are two types of JMS application connections:
G JNDI Application Connection, which is used to connect to a JNDI server during a
session run.
G JMS Application Connection, which is used to connect to a JMS provider during a
session run.
JNDI Application Connection Attributes are:
G Name
G JNDI Context Factory
G JNDI Provider URL
G JNDI UserName
G JNDI Password
G JMS Application Connection
JMS Application Connection Attributes are:
G Name
G JMS Destination Type
G JMS Connection Factory Name
INFORMATICA CONFIDENTIAL BEST PRACTICE 274 of 702
G JMS Destination
G JMS UserName
G JMS Password
Configuring the JNDI Connection for IBM MQ Series
The JNDI settings for MQ Series JMS can be configured using a file system service or LDAP
(Lightweight Directory Access Protocol).
The JNDI setting is stored in a file named JMSAdmin.config. The file should be installed in the
MQSeries Java installation/bin directory.
If you are using a file system service provider to store your JNDI settings, remove the number
sign (#) before the following context factory setting:
INITIAL_CONTEXT_FACTORY=com.sun.jndi.fscontext.RefFSContextFactory
Or, if you are using the LDAP service provider to store your JNDI settings, remove the number
sign (#) before the following context factory setting:
INITIAL_CONTEXT_FACTORY=com.sun.jndi.ldap.LdapCtxFactory
Find the PROVIDER_URL settings.
If you are using a file system service provider to store your JNDI settings, remove the number
sign (#) before the following provider URL setting and provide a value for the JNDI directory.
PROVIDER_URL=file: /<JNDI directory>
<JNDI directory> is the directory where you want JNDI to store the .binding file.
Or, if you are using the LDAP service provider to store your JNDI settings, remove the number
sign (#) before the provider URL setting and specify a hostname.
#PROVIDER_URL=ldap://<hostname>/context_name
For example, you can specify:
PROVIDER_URL=ldap://<localhost>/o=infa,c=rc
If you want to provide a user DN and password for connecting to JNDI, you can remove the #
from the following settings and enter a user DN and password:
INFORMATICA CONFIDENTIAL BEST PRACTICE 275 of 702
PROVIDER_USERDN=cn=myname,o=infa,c=rc
PROVIDER_PASSWORD=test
The following table shows the JMSAdmin.config settings and the corresponding attributes in the
JNDI application connection in the Workflow Manager:
JMSAdmin.config Settings: JNDI Application Connection Attribute
INITIAL_CONTEXT_FACTORY JNDI Context Factory
PROVIDER_URL JNDI Provider URL
PROVIDER_USERDN JNDI UserName
PROVIDER_PASSWORD JNDI Password
Configuring the JMS Connection for IBM MQ Series
The JMS connection is defined using a tool in JMS called jmsadmin, which is available in MQ
Series Java installation/bin directory. Use this tool to configure the JMS Connection Factory.
The JMS Connection Factory can be a Queue Connection Factory or Topic Connection Factory.
G When Queue Connection Factory is used, define a JMS queue as the destination.
G When Connection Factory is used, define a JMS topic as the destination.
The command to define a queue connection factory (qcf) is:
def qcf(<qcf_name>) qmgr(queue_manager_name)
hostname (QM_machine_hostname) port (QM_machine_port)
The command to define JMS queue is:
def q(<JMS_queue_name>) qmgr(queue_manager_name) qu
(queue_manager_queue_name)
The command to define JMS topic connection factory (tcf) is:
def tcf(<tcf_name>) qmgr(queue_manager_name)
hostname (QM_machine_hostname) port (QM_machine_port)
INFORMATICA CONFIDENTIAL BEST PRACTICE 276 of 702
The command to define the JMS topic is:
def t(<JMS_topic_name>) topic(pub/sub_topic_name)
The topic name must be unique. For example: topic (application/infa)
The following table shows the JMS object types and the corresponding attributes in the JMS
application connection in the Workflow Manager:
JMS Object Types JMS Application Connection Attribute
QueueConnectionFactory or
TopicConnectionFactory
JMS Connection Name
JMS Queue Name or
JMS Topic Name
JMS Destination
Configure the JNDI and JMS Connection for IBM Websphere
Configure the JNDI settings for IBM WebSphere to use IBM WebSphere as a provider for JMS
sources or targets in a PowerCenterRT session.
JNDI Connection
Add the following option to the file JMSAdmin.bat to configure JMS properly:
-Djava.ext.dirs=<WebSphere Application Server>bin
For example: -Djava.ext.dirs=WebSphere\AppServer\bin
The JNDI connection resides in the JMSAdmin.config file, which is located in the MQ Series
Java/bin directory.
INITIAL_CONTEXT_FACTORY=com.ibm.websphere.naming.wsInitialContextFactory
PROVIDER_URL=iiop://<hostname>/
For example:
PROVIDER_URL=iiop://localhost/
INFORMATICA CONFIDENTIAL BEST PRACTICE 277 of 702
PROVIDER_USERDN=cn=informatica,o=infa,c=rc
PROVIDER_PASSWORD=test
JMS Connection
The JMS configuration is similar to the JMS Connection for IBM MQ Series.
Configure the JNDI and JMS Connection for BEA Weblogic
Configure the JNDI settings for BEA Weblogic to use BEA Weblogic as a provider for JMS
sources or targets in a PowerCenterRT session.
PowerCenter Connect for JMS and the JMS hosting WebLogic server do not need to be on the
same server. PowerCenter Connect for JMS just needs a URL, as long as the URL points to the
right place.
JNDI Connection
The Weblogic Server automatically provides a context factory and URL during the JNDI set-up
configuration for WebLogic Server. Enter these values to configure the JNDI connection for
JMS sources and targets in the Workflow Manager.
Enter the following value for JNDI Context Factory in the JNDI Application Connection in the
Workflow Manager:
weblogic.jndi.WLInitialContextFactory
Enter the following value for JNDI Provider URL in the JNDI Application Connection in the
Workflow Manager:
t3://<WebLogic_Server_hostname>:<port>
where WebLogic Server hostname is the hostname or IP address of the WebLogic Server and
port is the port number for the WebLogic Server.
JMS Connection
The JMS connection is configured from the BEA WebLogic Server console. Select JMS ->
Connection Factory.
The JMS Destination is also configured from the BEA Weblogic Server console.
From the Console pane, select Services > JMS > Servers > <JMS Server name> >
INFORMATICA CONFIDENTIAL BEST PRACTICE 278 of 702
Destinations under your domain.
Click Configure a New JMSQueue or Configure a New JMSTopic.
The following table shows the JMS object types and the corresponding attributes in the JMS
application connection in the Workflow Manager:
WebLogic Server JMS Object JMS Application Connection Attribute
Connection Factory Settings: JNDIName JMS Application Connection Attribute
Connection Factory Settings: JNDIName JMS Connection Factory Name
Destination Settings: JNDIName JMS Destination
In addition to JNDI and JMS setting, BEA Weblogic also offers a function called JMS Store,
which can be used for persistent messaging when reading and writing JMS messages. The
JMS Stores configuration is available from the Console pane: select Services > JMS > Stores
under your domain.
Configuring the JNDI and JMS Connection for TIBCO
TIBCO Rendezvous Server does not adhere to JMS specifications. As a result, PowerCenter
Connect for JMS cant connect directly with the Rendezvous Server. TIBCO Enterprise Server,
which is JMS-compliant, acts as a bridge between the PowerCenter Connect for JMS and
TIBCO Rendezvous Server. Configure a connection-bridge between TIBCO Rendezvous
Server and TIBCO Enterpriser Server for PowerCenter Connect for JMS to be able to read
messages from and write messages to TIBCO Rendezvous Server.
To create a connection-bridge between PowerCenter Connect for JMS and TIBCO Rendezvous
Server, follow these steps:
1. Configure PowerCenter Connect for JMS to communicate with TIBCO Enterprise
Server.
2. Configure TIBCO Enterprise Server to communicate with TIBCO Rendezvous Server.
Configure the following information in your JNDI application connection:
G JNDI Context Factory.com.tibco.tibjms.naming.TibjmsInitialContextFactory
G Provider URL.tibjmsnaming://<host>:<port> where host and port are the host name
and port number of the Enterprise Server.
INFORMATICA CONFIDENTIAL BEST PRACTICE 279 of 702
To make a connection-bridge between TIBCO Rendezvous Server and TIBCO Enterpriser
Server:
1. In the file tibjmsd.conf, enable the tibrv transport configuration parameter as in the
example below, so that TIBCO Enterprise Server can communicate with TIBCO
Rendezvous messaging systems:
tibrv_transports = enabled
2. Enter the following transports in the transports.conf file:
[RV]
type = tibrv // type of external messaging system
topic_import_dm = TIBJMS_RELIABLE // only reliable/certified messages can transfer
daemon = tcp:localhost:7500 // default daemon for the Rendezvous server
The transports in the transports.conf configuration file specify the communication protocol
between TIBCO Enterprise for JMS and the TIBCO Rendezvous system. The import and export
properties on a destination can list one or more transports to use to communicate with the
TIBCO Rendezvous system.
3. Optionally, specify the name of one or more transports for reliable and certified
message delivery in the export property in the file topics.conf. as in the following
example:
topicname export="RV"
The export property allows messages published to a topic by a JMS client to be exported to the
external systems with configured transports. Currently, you can configure transports for TIBCO
Rendezvous reliable and certified messaging protocols.
PowerCenter Connect for webMethods
When importing webMethods sources into the Designer, be sure the webMethods host file
doesnt contain . character. You cant use fully-qualified names for the connection when
importing webMethods sources. You can use fully-qualified names for the connection when
importing webMethods targets because PowerCenter doesnt use the same grouping method
for importing sources and targets. To get around this, modify the host file to resolve the name to
the IP address.
For example:
Host File:
INFORMATICA CONFIDENTIAL BEST PRACTICE 280 of 702
crpc23232.crp.informatica.com crpc23232
Use crpc23232 instead of crpc23232.crp.informatica.com as the host name when importing
webMethods source definition. This step is only required for importing PowerCenter Connect for
webMethods sources into the Designer.
If you are using the request/reply model in webMethods, PowerCenter needs to send an
appropriate document back to the broker for every document it receives. PowerCenter
populates some of the envelope fields of the webMethods target to enable webMethods broker
to recognize that the published document is a reply from PowerCenter. The envelope fields
destid and tag are populated for the request/reply model. Destid should be populated from
the pubid of the source document and tag should be populated from tag of the source
document. Use the option Create Default Envelope Fields when importing webMethods
sources and targets into the Designer in order to make the envelope fields available in
PowerCenter.
Configuring the PowerCenter Connect for webMethods Connection
To create or edit PowerCenter Connect for webMethods connection select Connections >
Application > webMethods Broker from the Workflow Manager.
PowerCenter Connect for webMethods connection attributes are:
G Name
G Broker Host
G Broker Name
G Client ID
G Client Group
G Application Name
G Automatic Reconnect
G Preserve Client State
Enter the connection to the Broker Host in the following format <hostname: port>.
If you are using the request/reply method in webMethods, you have to specify a client ID in the
connection. Be sure that the client ID used in the request connection is the same as the client
ID used in the reply connection. Note that if you are using multiple request/reply document
pairs, you need to setup different webMethods connections for each pair because they cannot
share a client ID.


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 281 of 702
Session and Data Partitioning
Challenge
Improving performance by identifying strategies for partitioning relational tables, XML,
COBOL and standard flat files, and by coordinating the interaction between sessions,
partitions, and CPUs. These strategies take advantage of the enhanced partitioning
capabilities in PowerCenter.
Description
On hardware systems that are under-utilized, you may be able to improve performance
by processing partitioned data sets in parallel in multiple threads of the same session
instance running onthe PowerCenter Server engine. However, parallel execution may
impair performance on over-utilized systems or systems with smaller I/O capacity.
In addition to hardware, consider these other factors when determining if a session is
an ideal candidate for partitioning: source and target database setup, target type,
mapping design, and certain assumptions that are explained in the following
paragraphs. Use the Workflow Manager client tool to implement session partitioning.
Assumptions
The following assumptions pertain to the source and target systems of a session that is
a candidate for partitioning. These factors can help to maximize the benefits that can
be achieved through partitioning.
G Indexing has been implemented on the partition key when using a relational
source.
G Source files are located on the same physical machine as the PowerCenter
Server process when partitioning flat files, COBOL, and XML, to reduce
network overhead and delay.
G All possible constraints are dropped or disabled on relational targets.
G All possible indexes are dropped or disabled on relational targets.
G Table spaces and database partitions are properly managed on the target
system.
G Target files are written to same physical machine that hosts the PowerCenter
INFORMATICA CONFIDENTIAL BEST PRACTICE 282 of 702
process in order to reduce network overhead and delay.
G Oracle External Loaders are utilized whenever possible
First, determine if you should partition your session. Parallel execution benefits
systems that have the following characteristics:
Check idle time and busy percentage for each thread. This gives the high-level
information of the bottleneck point/points. In order to do this, open the session log and
look for messages starting with PETL_ under the RUN INFO FOR TGT LOAD
ORDER GROUP section. These PETL messages give the following details against the
reader, transformation, and writer threads:
G Total Run Time
G Total Idle Time
G Busy Percentage
Under-utilized or intermittently-used CPUs. To determine if this is the case, check
the CPU usage of your machine. The column ID displays the percentage utilization of
CPU idling during the specified interval without any I/O wait. If there are CPU cycles
available (i.e., twenty percent or more idle time), then this session's performance may
be improved by adding a partition.
G Windows 2000/2003 - check the task manager performance tab.
G UNIX - type VMSTAT 1 10 on the command line.
Sufficient I/O. To determine the I/O statistics:
G Windows 2000/2003 - check the task manager performance tab.
G UNIX - type IOSTAT on the command line. The column %IOWAIT displays the
percentage of CPU time spent idling while waiting for I/O requests. The
column %idle displays the total percentage of the time that the CPU spends
idling (i.e., the unused capacity of the CPU.)
Sufficient memory. If too much memory is allocated to your session, you will receive a
memory allocation error. Check to see that you're using as much memory as you can. If
the session is paging, increase the memory. To determine if the session is paging:
G Windows 2000/2003 - check the task manager performance tab.
INFORMATICA CONFIDENTIAL BEST PRACTICE 283 of 702
G UNIX - type VMSTAT 1 10 on the command line. PI displays number of pages
swapped in from the page space during the specified interval. PO displays the
number of pages swapped out to the page space during the specified interval.
If these values indicate that paging is occurring, it may be necessary to
allocate more memory, if possible.
If you determine that partitioning is practical, you can begin setting up the partition.
Partition Types
PowerCenter provides increased control of the pipeline threads. Session performance
can be improved by adding partitions at various pipeline partition points. When you
configure the partitioning information for a pipeline, you must specify a partition type.
The partition type determines how the PowerCenter Server redistributes data across
partition points. The Workflow Manager allows you to specify the following partition
types:
Round-robin Partitioning
The PowerCenter Server distributes data evenly among all partitions. Use round-robin
partitioning when you need to distribute rows evenly and do not need to group data
among partitions.
In a pipeline that reads data from file sources of different sizes, use round-robin
partitioning. For example, consider a session based on a mapping that reads data from
three flat files of different sizes.
G Source file 1: 100,000 rows
G Source file 2: 5,000 rows
G Source file 3: 20,000 rows
In this scenario, the recommended best practice is to set a partition point after the
Source Qualifier and set the partition type to round-robin. The PowerCenter Server
distributes the data so that each partition processes approximately one third of the
data.
Hash Partitioning
The PowerCenter Server applies a hash function to a partition key to group data among
INFORMATICA CONFIDENTIAL BEST PRACTICE 284 of 702
partitions.
Use hash partitioning where you want to ensure that the PowerCenter Server
processes groups of rows with the same partition key in the same partition. For
example, in a scenario where you need to sort items by item ID, but do not know the
number of items that have a particular ID number. If you select hash auto-keys, the
PowerCenter Server uses all grouped or sorted ports as the partition key. If you select
hash user keys, you specify a number of ports to form the partition key.
An example of this type of partitioning is when you are using Aggregators and need to
ensure that groups of data based on a primary key are processed in the same partition.
Key Range Partitioning
With this type of partitioning, you specify one or more ports to form a compound
partition key for a source or target. The PowerCenter Server then passes data to each
partition depending on the ranges you specify for each port.
Use key range partitioning where the sources or targets in the pipeline are partitioned
by key range. Refer to Workflow Administration Guide for further directions on setting
up Key range partitions.
For example, with key range partitioning set at End range = 2020, the PowerCenter
Server passes in data where values are less than 2020. Similarly, for Start range =
2020, the PowerCenter Server passes in data where values are equal to greater than
2020. Null values or values that may not fall in either partition are passed through the
first partition.
Pass-through Partitioning
In this type of partitioning, the PowerCenter Server passes all rows at one partition
point to the next partition point without redistributing them.
Use pass-through partitioning where you want to create an additional pipeline stage to
improve performance, but do not want to (or cannot) change the distribution of data
across partitions. The Data Transformation Manager spawns a master thread on each
session run, which in turn creates three threads (reader, transformation, and writer
threads) by default. Each of these threads can, at the most, process one data set at a
time and hence, three data sets simultaneously. If there are complex transformations in
the mapping, the transformation thread may take a longer time than the other threads,
which can slow data throughput.
INFORMATICA CONFIDENTIAL BEST PRACTICE 285 of 702
It is advisable to define partition points at these transformations. This creates another
pipeline stage and reduces the overhead of a single transformation thread.
When you have considered all of these factors and selected a partitioning strategy, you
can begin the iterative process of adding partitions. Continue adding partitions to the
session until you meet the desired performance threshold or observe degradation in
performance.
Tips for Efficient Session and Data Partitioning
G Add one partition at a time. To best monitor performance, add one partition
at a time, and note your session settings before adding additional partitions.
Refer to Workflow Administrator Guide, for more information on Restrictions on
the Number of Partitions.
G Set DTM buffer memory. For a session with n partitions, set this value to at
least n times the original value for the non-partitioned session.
G Set cached values for sequence generator. For a session with n partitions,
there is generally no need to use the Number of Cached Values property of
the sequence generator. If you must set this value to a value greater than
zero, make sure it is at least n times the original value for the non-partitioned
session.
G Partition the source data evenly. The source data should be partitioned into
equal sized chunks for each partition.
G Partition tables. A notable increase in performance can also be realized when
the actual source and target tables are partitioned. Work with the DBA to
discuss the partitioning of source and target tables, and the setup of
tablespaces.
G Consider using external loader. As with any session, using an external
loader may increase session performance. You can only use Oracle external
loaders for partitioning. Refer to the Session and Server Guide for more
information on using and setting up the Oracle external loader for partitioning.
G Write throughput. Check the session statistics to see if you have increased
the write throughput.
G Paging. Check to see if the session is now causing the system to page. When
you partition a session and there are cached lookups, you must make sure
that DTM memory is increased to handle the lookup caches. When you
partition a source that uses a static lookup cache, the PowerCenter Server
creates one memory cache for each partition and one disk cache for each
transformation. Thus, memory requirements grow for each partition. If the
memory is not bumped up, the system may start paging to disk, causing
INFORMATICA CONFIDENTIAL BEST PRACTICE 286 of 702
degradation in performance.
When you finish partitioning, monitor the session to see if the partition is degrading or
improving session performance. If the session performance is improved and the
session meets your requirements, add another partition
Session on Grid and Partitioning Across Nodes
Session on Grid provides the ability to run a session on multi-node integration services.
This is most suitable for large-size sessions. For small and medium size sessions, it is
more practical to distribute whole sessions to different nodes using Workflow on Grid.
Session on Grid leverages existing partitions of a session b executing threads in
multiple DTMs. Log service can be used to get the cumulative log.
Dynamic Partitioning
Dynamic partitioning is also called parameterized partitioning because a single
parameter can determine the number of partitions. With the Session on Grid option,
more partitions can be added when more resources are available. Also the number of
partitions in a session can be tied to partitions in the database to facilitate maintenance
of PowerCenter partitioning to leverage database partitioning.


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 287 of 702
Using Parameters, Variables and Parameter Files
Challenge
Understanding how parameters, variables, and parameter files work and using them for maximum efficiency.
Description
Prior to the release of PowerCenter 5, the only variables inherent to the product were defined to specific
transformations and to those server variables that were global in nature. Transformation variables were defined as
variable ports in a transformation and could only be used in that specific transformation object (e.g., Expression,
Aggregator, and Rank transformations). Similarly, global parameters defined within Server Manager would affect
the subdirectories for source files, target files, log files, and so forth.
More current versions of PowerCenter made variables and parameters available across the entire mapping rather
than for a specific transformation object. In addition, they provide built-in parameters for use within Workflow
Manager. Using parameter files, these values can change from session-run to session-run. With the addition of
workflows, parameters can now be passed to every session contained in the workflow, providing more flexibility
and reducing parameter file maintenance. Other important functionality that has been added in recent releases is
the ability to dynamically create parameter files that can be used in the next session in a workflow or in other
workflows.
Parameters and Variables
Use a parameter file to define the values for parameters and variables used in a workflow, worklet, mapping, or
session. A parameter file can be created using a text editor such as WordPad or Notepad. List the parameters or
variables and their values in the parameter file. Parameter files can contain the following types of parameters and
variables:
G Workflow variables
G Worklet variables
G Session parameters
G Mapping parameters and variables
When using parameters or variables in a workflow, worklet, mapping, or session, the PowerCenter Server checks
the parameter file to determine the start value of the parameter or variable. Use a parameter file to initialize
workflow variables, worklet variables, mapping parameters, and mapping variables. If not defining start values for
these parameters and variables, the PowerCenter Server checks for the start value of the parameter or variable in
other places.
Session parameters must be defined in a parameter file. Because session parameters do not have default values,
if the PowerCenter Server cannot locate the value of a session parameter in the parameter file, it fails to initialize
the session. To include parameter or variable information for more than one workflow, worklet, or session in a
single parameter file, create separate sections for each object within the parameter file.
Also, create multiple parameter files for a single workflow, worklet, or session and change the file that these tasks
use, as necessary. To specify the parameter file that the PowerCenter Server uses with a workflow, worklet, or
session, do either of the following:
INFORMATICA CONFIDENTIAL BEST PRACTICE 288 of 702
G Enter the parameter file name and directory in the workflow, worklet, or session properties.
G Start the workflow, worklet, or session using pmcmd and enter the parameter filename and directory in
the command line.
If entering a parameter file name and directory in the workflow, worklet, or session properties and in the pmcmd
command line, the PowerCenter Server uses the information entered in the pmcmd command line.
Parameter File Format
The format for parameter files changed with the addition of the Workflow Manager. When entering values in a
parameter file, precede the entries with a heading that identifies the workflow, worklet, or session whose
parameters and variables are to be assigned. Assign individual parameters and variables directly below this
heading, entering each parameter or variable on a new line. List parameters and variables in any order for each
task.
The following heading formats can be defined:
G Workflow variables - [folder name.WF:workflow name]
G Worklet variables -[folder name.WF:workflow name.WT:worklet name]
G Worklet variables in nested worklets - [folder name.WF:workflow name.WT:worklet name.WT:worklet
name...]
G Session parameters, plus mapping parameters and variables - [folder name.WF:workflow name.ST:
session name] or [folder name.session name] or [session name]
Below each heading, define parameter and variable values as follows:
G parameter name=value
G parameter2 name=value
G variable name=value
G variable2 name=value
For example, a session in the production folder, s_MonthlyCalculations, uses a string mapping parameter, $
$State, that needs to be set to MA, and a datetime mapping variable, $$Time. $$Time already has an initial value
of 9/30/2000 00:00:00 saved in the repository, but this value needs to be overridden to 10/1/2000 00:00:00. The
session also uses session parameters to connect to source files and target databases, as well as to write session
log to the appropriate session log file.
The following table shows the parameters and variables that can be defined in the parameter file:
Parameter and Variable
Type
Parameter and Variable Name Desired Definition
String Mapping Parameter $$State MA
Datetime Mapping Variable $$Time 10/1/2000 00:00:00
Source File (Session
Parameter)
$InputFile1 Sales.txt
Database Connection
(Session Parameter)
$DBConnection_Target Sales (database connection)
Session Log File (Session
Parameter)
$PMSessionLogFile d:/session logs/firstrun.txt
INFORMATICA CONFIDENTIAL BEST PRACTICE 289 of 702
The parameter file for the session includes the folder and session name, as well as each parameter and variable:
G [Production.s_MonthlyCalculations]
G $$State=MA
G $$Time=10/1/2000 00:00:00
G $InputFile1=sales.txt
G $DBConnection_target=sales
G $PMSessionLogFile=D:/session logs/firstrun.txt
The next time the session runs, edit the parameter file to change the state to MD and delete the $$Time variable.
This allows the PowerCenter Server to use the value for the variable that was set in the previous session run
Mapping Variables
Declare mapping variables in PowerCenter Designer using the menu option Mappings -> Parameters and
Variables (See the first figure, below). After selecting mapping variables, use the pop-up window to create a
variable by specifying its name, data type, initial value, aggregation type, precision, and scale. This is similar to
creating a port in most transformations (See the second figure, below).
INFORMATICA CONFIDENTIAL BEST PRACTICE 290 of 702
Variables, by definition, are objects that can change value dynamically. PowerCenter has four functions to affect
change to mapping variables:
G SetVariable
G SetMaxVariable
G SetMinVariable
G SetCountVariable
A mapping variable can store the last value from a session run in the repository to be used as the starting value
for the next session run.
G Name. The name of the variable should be descriptive and be preceded by $$ (so that it is easily
identifiable as a variable). A typical variable name is: $$Procedure_Start_Date.
G Aggregation type. This entry creates specific functionality for the variable and determines how it stores
data. For example, with an aggregation type of Max, the value stored in the repository at the end of each
session run would be the maximum value across ALL records until the value is deleted.
G Initial value. This value is used during the first session run when there is no corresponding and
overriding parameter file. This value is also used if the stored repository value is deleted. If no initial value
is identified, then a data-type specific default value is used.
Variable values are not stored in the repository when the session:
G Fails to complete.
G Is configured for a test load.
G Is a debug session.
INFORMATICA CONFIDENTIAL BEST PRACTICE 291 of 702
G Runs in debug mode and is configured to discard session output.
Order of Evaluation
The start value is the value of the variable at the start of the session. The start value can be a value defined in the
parameter file for the variable, a value saved in the repository from the previous run of the session, a user-defined
initial value for the variable, or the default value based on the variable data type.
The PowerCenter Server looks for the start value in the following order:
1. Value in session parameter file
2. Value saved in the repository
3. Initial value
4. Default value
Mapping Parameters and Variables
Since parameter values do not change over the course of the session run, the value used is based on:
G Value in session parameter file
G Initial value
G Default value
Once defined, mapping parameters and variables can be used in the Expression Editor section of the following
transformations:
G Expression
G Filter
G Router
G Update Strategy
G Aggregator
Mapping parameters and variables also can be used within the Source Qualifier in the SQL query, user-defined
join, and source filter sections, as well as in a SQL override in the lookup transformation.
The lookup SQL override is similar to entering a custom query in a Source Qualifier transformation. When entering
a lookup SQL override, enter the entire override, or generate and edit the default SQL statement. When the
Designer generates the default SQL statement for the lookup SQL override, it includes the lookup/output ports in
the lookup condition and the lookup/return port.
Note: Although you can use mapping parameters and variables when entering a lookup SQL override, the
Designer cannot expand mapping parameters and variables in the query override and does not validate the
lookup SQL override. When running a session with a mapping parameter or variable in the lookup SQL override,
the PowerCenter Integration Service expands mapping parameters and variables and connects to the lookup
database to validate the query override.
Also note that Workflow Manager does not recognize variable connection parameters such as dbconnection with
lookup transformations. At this time, Lookups can use $Source, $Target, or exact db connections.
INFORMATICA CONFIDENTIAL BEST PRACTICE 292 of 702
Guidelines for Creating Parameter Files
Use the following guidelines when creating parameter files:
G Capitalize folder and session names as necessary. Folder and session names are case-sensitive in
the parameter file.
G Enter folder names for non-unique session names. When a session name exists more than once in a
repository, enter the folder name to indicate the location of the session.
G Create one or more parameter files. Assign parameter files to workflows, worklets, and sessions
individually. Specify the same parameter file for all of these tasks or create several parameter files.
G If including parameter and variable information for more than one session in the file, create a new
section for each session as follows. The folder name is optional.
[folder_name.session_name]
parameter_name=value
variable_name=value
mapplet_name.parameter_name=value
[folder2_name.session_name]
parameter_name=value
variable_name=value
mapplet_name.parameter_name=value
G Specify headings in any order. Place headings in any order in the parameter file. However, if defining
the same parameter or variable more than once in the file, the PowerCenter Server assigns the
parameter or variable value using the first instance of the parameter or variable.
G Specify parameters and variables in any order. Below each heading, the parameters and variables
can be specified in any order.
G When defining parameter values, do not use unnecessary line breaks or spaces. The PowerCenter
Server may interpret additional spaces as part of the value.
G List all necessary mapping parameters and variables. Values entered for mapping parameters and
variables become the start value for parameters and variables in a mapping. Mapping parameter and
variable names are not case sensitive.
G List all session parameters. Session parameters do not have default values. An undefined session
parameter can cause the session to fail. Session parameter names are not case sensitive.
G Use correct date formats for datetime values. When entering datetime values, use the following date
formats:
MM/DD/RR
MM/DD/RR HH24:MI:SS
MM/DD/YYYY
INFORMATICA CONFIDENTIAL BEST PRACTICE 293 of 702
MM/DD/YYYY HH24:MI:SS
G Do not enclose parameters or variables in quotes. The PowerCenter Server interprets everything after
the equal sign as part of the value.
G Do enclose parameters in single quotes in a Source Qualifier SQL Override if the parameter
represents a string or date/time value to be used in the SQL Override.
G Precede parameters and variables created in mapplets with the mapplet name as follows:
mapplet_name.parameter_name=value
mapplet2_name.variable_name=value
Sample: Parameter Files and Session Parameters
Parameter files, along with session parameters, allow you to change certain values between sessions. A
commonly-used feature is the ability to create user-defined database connection session parameters to reuse
sessions for different relational sources or targets. Use session parameters in the session properties, and then
define the parameters in a parameter file. To do this, name all database connection session parameters with the
prefix $DBConnection, followed by any alphanumeric and underscore characters. Session parameters and
parameter files help reduce the overhead of creating multiple mappings when only certain attributes of a mapping
need to be changed.
Using Parameters in Source Qualifiers
Another commonly used feature is the ability to create parameters in the source qualifiers, which allows you to
reuse the same mapping, with different sessions, to extract specified data from the parameter files the session
references.
Moreover, there may be a time when it is necessary to create a mapping that will create a parameter file and the
second mapping to use that parameter file created from the first mapping. The second mapping pulls the data
using a parameter in the Source Qualifier transformation, which reads the parameter from the parameter file
created in the first mapping. In the first case, the idea is to build a mapping that creates the flat file, which is a
parameter file for another session to use.
Note: Server variables cannot be modified by entries in the parameter file. For example, there is no way to set the
Workflow log directory in a parameter file. The Workflow Log File Directory can only accept an actual directory or
the $PMWorkflowLogDir variable as a valid entry. The $PMWorkflowLogDir variable is a server variable that is set
at the server configuration level, not in the Workflow parameter file.
Sample: Variables and Parameters in an Incremental Strategy
Variables and parameters can enhance incremental strategies. The following example uses a mapping variable,
an expression transformation object, and a parameter file for restarting.
Scenario
Company X wants to start with an initial load of all data, but wants subsequent process runs to select only new
information. The environment data has an inherent Post_Date that is defined within a column named
Date_Entered that can be used. The process will run once every twenty-four hours.
INFORMATICA CONFIDENTIAL BEST PRACTICE 294 of 702
Sample Solution
Create a mapping with source and target objects. From the menu create a new mapping variable named $
$Post_Date with the following attributes:
G TYPE Variable
G DATATYPE Date/Time
G AGGREGATION TYPE MAX
G INITIAL VALUE 01/01/1900
Note that there is no need to encapsulate the INITIAL VALUE with quotation marks. However, if this value is used
within the Source Qualifier SQL, it may be necessary to use native RDBMS functions to convert (e.g., TO DATE
(--,--)). Within the Source Qualifier Transformation, use the following in the Source_Filter Attribute:
DATE_ENTERED > to_Date(' $$Post_Date','MM/DD/YYYY HH24:MI:SS') [please be aware that this sample
refers to Oracle as the source RDBMS]. Also note that the initial value 01/01/1900 will be expanded by the
PowerCenter Server to 01/01/1900 00:00:00, hence the need to convert the parameter to a datetime.
The next step is to forward $$Post_Date and Date_Entered to an Expression transformation. This is where the
function for setting the variable will reside. An output port named Post_Date is created with a data type of date/
time. In the expression code section, place the following function:
SETMAXVARIABLE($$Post_Date,DATE_ENTERED)
The function evaluates each value for DATE_ENTERED and updates the variable with the Max value to be
passed forward. For example:
DATE_ENTERED Resultant POST_DATE
9/1/2000 9/1/2000
10/30/2001 10/30/2001
9/2/2000 10/30/2001
Consider the following with regard to the functionality:
1. In order for the function to assign a value, and ultimately store it in the repository, the port must be
connected to a downstream object. It need not go to the target, but it must go to another Expression
Transformation. The reason is that the memory will not be instantiated unless it is used in a downstream
transformation object.
2. In order for the function to work correctly, the rows have to be marked for insert. If the mapping is an
update-only mapping (i.e., Treat Rows As is set to Update in the session properties) the function will not
work. In this case, make the session Data Driven and add an Update Strategy after the transformation
containing the SETMAXVARIABLE function, but before the Target.
3. If the intent is to store the original Date_Entered per row and not the evaluated date value, then add an
ORDER BY clause to the Source Qualifier. This way, the dates are processed and set in order and data is
preserved.
INFORMATICA CONFIDENTIAL BEST PRACTICE 295 of 702
The first time this mapping is run, the SQL will select from the source where Date_Entered is > 01/01/1900
providing an initial load. As data flows through the mapping, the variable gets updated to the Max Date_Entered it
encounters. Upon successful completion of the session, the variable is updated in the repository for use in the
next session run. To view the current value for a particular variable associated with the session, right-click on the
session in the Workflow Monitor and choose View Persistent Values.
The following graphic shows that after the initial run, the Max Date_Entered was 02/03/1998. The next time this
session is run, based on the variable in the Source Qualifier Filter, only sources where Date_Entered >
02/03/1998 will be processed.
Resetting or Overriding Persistent Values
To reset the persistent value to the initial value declared in the mapping, view the persistent value from Workflow
Manager (see graphic above) and press Delete Values. This deletes the stored value from the repository, causing
the Order of Evaluation to use the Initial Value declared from the mapping.
If a session run is needed for a specific date, use a parameter file. There are two basic ways to accomplish this:
G Create a generic parameter file, place it on the server, and point all sessions to that parameter file. A
session may (or may not) have a variable, and the parameter file need not have variables and
parameters defined for every session using the parameter file. To override the variable, either change,
uncomment, or delete the variable in the parameter file.
G Run pmcmd for that session, but declare the specific parameter file within the pmcmd command.
Although you can use mapping parameters and variables when entering a lookup SQL override, the
Designer cannot expand mapping parameters and variables in the query override and does not validate
the lookup SQL override. When running a session with a mapping parameter or variable in the lookup
SQL override, the PowerCenter Server expands mapping parameters and variables and connects to the
lookup database to validate the query override.
INFORMATICA CONFIDENTIAL BEST PRACTICE 296 of 702
Configuring the Parameter File Location
Specify the parameter filename and directory in the workflow or session properties. To enter a parameter file in
the workflow or session properties:
G Select either the Workflow or Session, choose, Edit, and click the Properties tab.
G Enter the parameter directory and name in the Parameter Filename field.
G Enter either a direct path or a server variable directory. Use the appropriate delimiter for the PowerCenter
Server operating system.
The following graphic shows the parameter filename and location specified in the session task.
The next graphic shows the parameter filename and location specified in the Workflow.
In this example, after the initial session is run, the parameter file contents may look like:
[Test.s_Incremental]
;$$Post_Date=
INFORMATICA CONFIDENTIAL BEST PRACTICE 297 of 702
By using the semicolon, the variable override is ignored and the Initial Value or Stored Value is used. If, in the
subsequent run, the data processing date needs to be set to a specific date (for example: 04/21/2001), then a
simple Perl script or manual change can update the parameter file to:
[Test.s_Incremental]
$$Post_Date=04/21/2001
Upon running the sessions, the order of evaluation looks to the parameter file first, sees a valid variable and value
and uses that value for the session run. After successful completion, run another script to reset the parameter file.
Sample: Using Session and Mapping Parameters in Multiple Database Environments
Reusable mappings that can source a common table definition across multiple databases, regardless of differing
environmental definitions (e.g., instances, schemas, user/logins), are required in a multiple database environment.
Scenario
Company X maintains five Oracle database instances. All instances have a common table definition for sales
orders, but each instance has a unique instance name, schema, and login.
DB Instance Schema Table User Password
ORC1 aardso orders Sam max
ORC99 environ orders Help me
HALC hitme order_done Hi Lois
UGLY snakepit orders Punch Judy
GORF gmer orders Brer Rabbit
Each sales order table has a different name, but the same definition:
ORDER_ID NUMBER (28) NOT NULL,
DATE_ENTERED DATE NOT NULL,
DATE_PROMISED DATE NOT NULL,
DATE_SHIPPED DATE NOT NULL,
EMPLOYEE_ID NUMBER (28) NOT NULL,
CUSTOMER_ID NUMBER (28) NOT NULL,
SALES_TAX_RATE NUMBER (5,4) NOT NULL,
STORE_ID NUMBER (28) NOT NULL
Sample Solution
Using Workflow Manager, create multiple relational connections. In this example, the strings are named according
to the DB Instance name. Using Designer, create the mapping that sources the commonly defined table. Then
create a Mapping Parameter named $$Source_Schema_Table with the following attributes:
INFORMATICA CONFIDENTIAL BEST PRACTICE 298 of 702
Note that the parameter attributes vary based on the specific environment. Also, the initial value is not required
since this solution uses parameter files.
Open the Source Qualifier and use the mapping parameter in the SQL Override as shown in the following graphic.
Open the Expression Editor and select Generate SQL. The generated SQL statement shows the columns.
INFORMATICA CONFIDENTIAL BEST PRACTICE 299 of 702
Override the table names in the SQL statement with the mapping parameter.
Using Workflow Manager, create a session based on this mapping. Within the Source Database connection drop-
down box, choose the following parameter:
$DBConnection_Source.
Point the target to the corresponding target and finish.
Now create the parameter files. In this example, there are five separate parameter files.
Parmfile1.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=aardso.orders
$DBConnection_Source= ORC1
Parmfile2.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=environ.orders
$DBConnection_Source= ORC99
Parmfile3.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=hitme.order_done
$DBConnection_Source= HALC
Parmfile4.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=snakepit.orders
$DBConnection_Source= UGLY
Parmfile5.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table= gmer.orders
$DBConnection_Source= GORF
INFORMATICA CONFIDENTIAL BEST PRACTICE 300 of 702
Use pmcmd to run the five sessions in parallel. The syntax for pmcmd for starting sessions with a particular
parameter file is as follows:
pmcmd startworkflow -s serveraddress:portno -u Username -p Password -paramfile parmfilename
s_Incremental
You may also use "-pv pwdvariable" if the named environment variable contains the encrypted form of the actual
password.
Notes on Using Parameter Files with Startworkflow
When starting a workflow, you can optionally enter the directory and name of a parameter file. The PowerCenter
Integration Service runs the workflow using the parameters in the file specified.
For UNIX shell users, enclose the parameter file name in single quotes:
-paramfile '$PMRootDir/myfile.txt'
For Windows command prompt users, the parameter file name cannot have beginning or trailing spaces. If the
name includes spaces, enclose the file name in double quotes:
-paramfile "$PMRootDir\my file.txt"
Note: When writing a pmcmd command that includes a parameter file located on another machine, use the
backslash (\) with the dollar sign ($). This ensures that the machine where the variable is defined expands the
server variable.
pmcmd startworkflow -uv USERNAME -pv PASSWORD -s SALES:6258 -f east -w wSalesAvg -paramfile '\
$PMRootDir/myfile.txt'
In the event that it is necessary to run the same workflow with different parameter files, use the following five
separate commands:
pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -
paramfile \$PMRootDir\ParmFiles\Parmfile1.txt 1 1
pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -
paramfile \$PMRootDir\ParmFiles\Parmfile2.txt 1 1
pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -
paramfile \$PMRootDir\ParmFiles\Parmfile3.txt 1 1
pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -
paramfile \$PMRootDir\ParmFiles\Parmfile4.txt 1 1
pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -
paramfile \$PMRootDir\ParmFiles\Parmfile5.txt 1 1
Alternatively, run the sessions in sequence with one parameter file. In this case, a pre- or post-session script
can change the parameter file for the next session.
INFORMATICA CONFIDENTIAL BEST PRACTICE 301 of 702
Using PowerCenter with UDB
Challenge
Universal Database (UDB) is a database platform that can be used to run PowerCenter
repositories and act as source and target databases for PowerCenter mappings. Like
any software, it has its own way of doing things. It is important to understand these
behaviors so as to configure the environment correctly for implementing PowerCenter
and other Informatica products with this database platform. This Best Practice offers a
number of tips for using UDB with PowerCenter.
Description

UDB Overview
UDB is used for a variety of purposes and with various environments. UDB servers run
on Windows, OS/2, AS/400 and UNIX-based systems like AIX, Solaris, and HP-UX.
UDB supports two independent types of parallelism: symmetric multi-processing (SMP)
and massively parallel processing (MPP).
Enterprise-Extended Edition (EEE) is the most common UDB edition used in
conjunction with the Informatica product suite. UDB EEE introduces a dimension of
parallelism that can be scaled to very high performance. A UDB EEE database can be
partitioned across multiple machines that are connected by a network or a high-speed
switch. Additional machines can be added to an EEE system as application
requirements grow. The individual machines participating in an EEE installation can be
either uniprocessors or symmetric multiprocessors.
Connection Setup
You must set up a remote database connection to connect to DB2 UDB via
PowerCenter. This is necessary because DB2 UDB sets a very small limit on the
number of attachments per user to the shared memory segments when the user is
using the local (or indirect) connection/protocol. The PowerCenter server runs into this
limit when it is acting as the database agent or user. This is especially apparent when
the repository is installed on DB2 and the target data source is on the same DB2
database.
The local protocol limit will definitely be reached when using the same connection node
INFORMATICA CONFIDENTIAL BEST PRACTICE 302 of 702
for the repository via the PowerCenter Server and for the targets. This occurs when the
session is executed and the server sends requests for multiple agents to be launched.
Whenever the limit on number of database agents is reached, the following error
occurs:
CMN_1022 [[IBM][CLI Driver] SQL1224N A database agent could not be
started to service a request, or was terminated as a result of a database system
shutdown or a force command. SQLSTATE=55032]
The following recommendations may resolve this problem:
G Increase the number of connections permitted by DB2.
G Catalog the database as if it were remote. (For information of how to catalog
database with remote node refer Knowledgebase id 14745 at my.Informatica.
com support Knowledgebase)
G Be sure to close connections when programming exceptions occur.
G Verify that connections obtained in one method are returned to the pool via
close()
G (The PowerCenter Server is very likely already doing this).
G Verify that your application does not try to access pre-empted connections (i.
e., idle connections that are now used by other resources).

DB2 Timestamp
DB2 has a timestamp data type that is precise to the microsecond and uses a 26-
character format, as follows:
YYYY-MM-DD-HH.MI.SS.MICROS (where MICROS after the last period
recommends six decimals places of second)
The PowerCenter Date/Time datatype only supports precision to the second (using a
19 character format), so under normal circumstances when a timestamp source is read
into PowerCenter, the six decimal places after the second are lost. This is sufficient for
most data warehousing applications but can cause significant problems where this
timestamp is used as part of a key.
If the MICROS need to be retained, this can be accomplished by changing the format
of the column from a timestamp data type to a character 26 in the source and target
definitions. When the timestamp is read from DB2, the timestamp will be read in and
converted to character in the YYYY-MM-DD-HH.MI.SS.MICROS format. Likewise,
INFORMATICA CONFIDENTIAL BEST PRACTICE 303 of 702
when writing to a timestamp, pass the date as a character in the YYYY-MM-DD-HH.MI.
SS.MICROS format. If this format is not retained, the records are likely to be rejected
due to an invalid date format error.
It is also possible to maintain the timestamp correctly using the timestamp data type
itself. Setting a flag at the PowerCenter Server level does this; the technique is
described in Knowledge Base article 10220 at my.Informatica.com.
Importing Sources or Targets
If the value of the DB2 system variable APPLHEAPSZ is too small when you use the
Designer to import sources/targets from a DB2 database, the Designer reports an error
accessing the repository. The Designer status bar displays the following message:
SQL Error:[IBM][CLI Driver][DB2]SQL0954C: Not enough storage is available in
the application heap to process the statement.
If you receive this error, increase the value of the APPLHEAPSZ variable for your DB2
operating system. APPLHEAPSZ is the application heap size (in 4KB pages) for each
process using the database.
Unsupported Datatypes
PowerMart and PowerCenter do not support the following DB2 datatypes:
G Dbclob
G Blob
G Clob
G Real

DB2 External Loaders
The DB2 EE and DB2 EEE external loaders can both perform insert and replace
operations on targets. Both can also restart or terminate load operations.
G The DB2 EE external loader invokes the db2load executable located in the
PowerCenter Server installation directory. The DB2 EE external loader can
load data to a DB2 server on a machine that is remote to the PowerCenter
Server.
INFORMATICA CONFIDENTIAL BEST PRACTICE 304 of 702
G The DB2 EEE external loader invokes the IBM DB2 Autoloader program to
load data. The Autoloader program uses the db2atld executable. The DB2
EEE external loader can partition data and load the partitioned data
simultaneously to the corresponding database partitions. When you use the
DB2 EEE external loader, the PowerCenter Server and theDB2 EEE server
must be on the same machine.
The DB2 external loaders load from a delimited flat file. Be sure that the target table
columns are wide enough to store all of the data. If you configure multiple targets in the
same pipeline to use DB2 external loaders, each loader must load to a different
tablespace on the target database. For information on selecting external loaders, see
"Configuring External Loading in a Session" in the PowerCenter User Guide.
Setting DB2 External Loader Operation Modes
DB2 operation modes specify the type of load the external loader runs. You can
configure the DB2 EE or DB2 EEE external loader to run in any one of the following
operation modes:
G Insert. Adds loaded data to the table without changing existing table data.
G Replace. Deletes all existing data from the table, and inserts the loaded data.
The table and index definitions do not change.
G Restart. Restarts a previously interrupted load operation.
G Terminate. Terminates a previously interrupted load operation and rolls back
the operation to the starting point, even if consistency points were passed. The
tablespaces return to normal state, and all table objects are made consistent.

Configuring Authorities, Privileges, and Permissions
When you load data to a DB2 database using either the DB2 EE or DB2 EEE external
loader, you must have the correct authority levels and privileges to load data into to the
database tables.
DB2 privileges allow you to create or access database resources. Authority levels
provide a method of grouping privileges and higher-level database manager
maintenance and utility operations. Together, these functions control access to the
database manager and its database objects. You can access only those objects for
which you have the required privilege or authority.
To load data into a table, you must have one of the following authorities:
INFORMATICA CONFIDENTIAL BEST PRACTICE 305 of 702
G SYSADM authority
G DBADM authority
G LOAD authority on the database, with INSERT privilege
In addition, you must have proper read access and read/write permissions:
G The database instance owner must have read access to the external loader
input files.
G If you use run DB2 as a service on Windows, you must configure the service
start account with a user account that has read/write permissions to use LAN
resources, including drives, directories, and files.
G If you load to DB2 EEE, the database instance owner must have write access
to the load dump file and the load temporary file.
Remember, the target file must be delimited when using the DB2 AutoLoader.
Guidelines for Performance Tuning
You can achieve numerous performance improvements by properly configuring the
database manager, database, and tablespace container and parameter settings. For
example, MAXFILOP is one of the database configuration parameters that you can
tune. The default value for MAXFILOP is far too small for most databases. When this
value is too small, UDB spends a lot of extra CPU processing time closing and opening
files. To resolve this problem, increase MAXFILOP value until UDB stops closing files.
You must also have enough DB2 agents available to process the workload based on
the number of users accessing the database. Incrementally increase the value of
MAXAGENTS until agents are not stolen from another application. Moreover, sufficient
memory allocated to the CATALOGCACHE_SZ database configuration parameter also
benefits the database. If the value of catalog cache heap is greater than zero, both
DBHEAP and CATALOGCACHE_SZ should be proportionally increased.
In UDB, the LOCKTIMEOUT default value is 1. In a data warehouse database, set this
value to 60 seconds. Remember to define TEMPSPACE tablespaces so that they have
at least 3 or 4 containers across different disks, and set the PREFETCHSIZE to a
multiple of EXTENTSIZE, where the multiplier is equal to the number of containers.
Doing so will enable parallel I/O for larger sorts, joins, and other database functions
requiring substantial TEMPSPACE space.
INFORMATICA CONFIDENTIAL BEST PRACTICE 306 of 702
In UDB, LOGBUFSZ value of 8 is too small. Try setting it to 128. Also, set an
INTRA_PARALLEL value of YES for CPU parallelism. The database configuration
parameter DFT_DEGREE should be set to a value between ANY and 1 depending on
the number of CPUs available and number of processes that will be running
simultaneously. Setting the DFT_DEGREE to ANY can prove to be a CPU hogger
since one process can take up all the processing power with this setting. Setting it to
one does not make sense as there is no parallelism in one.
Note: DFT_DEGREE and INTRA_PARALLEL are applicable only for EEE DB.
Data warehouse databases perform numerous sorts, many of which can be very large.
SORTHEAP memory is also used for hash joins, which a surprising number of DB2
users fail to enable. To do so, use the db2set command to set environment variable
DB2_HASH_JOIN=ON.
For a data warehouse database, at a minimum, double or triple the SHEAPTHRES (to
between 40,000 and 60,000) and set the SORTHEAP size between 4,096 and 8,192. If
real memory is available, some clients use even larger values for these configuration
parameters.
SQL is very complex in a data warehouse environment and often consumes large
quantities of CPU and I/O resources. Therefore, set DFT_QUERYOPT to 7 or 9.
UDB uses NUM_IO_CLEANERS for writing to TEMPSPACE, temporary intermediate
tables, index creations, and more. SET NUM_IO_CLEANERS equal to the number of
CPUs on the UDB server and focus on your disk layout strategy instead.
Lastly, for RAID devices where several disks appear as one to the operating system,
be sure to do the following:
1. db2set DB2_STRIPED_CONTAINERS=YES (do this before creating
tablespaces or before a redirected restore)
2. db2set DB2_PARALLEL_IO=* (or use TablespaceID numbers for tablespaces
residing on the RAID devices for example
DB2_PARALLEL_IO=4,5,6,7,8,10,12,13)
3. Alter the tablespace PREFETCHSIZE for each tablespace residing on RAID
devices such that the PREFETCHSIZE is a multiple of the EXTENTSIZE.
Database Locks and Performance Problems
When working in an environment with many users that target a DB2 UDB database,
INFORMATICA CONFIDENTIAL BEST PRACTICE 307 of 702
you may experience slow and erratic behavior resulting from the way UDB handles
database locks. Out of the box, DB2 UDB database and client connections are
configured on the assumption that they will be part of an OLTP system and place
several locks on records and tables. Because PowerCenter typically works with OLAP
systems where it is the only process writing to the database and users are primarily
reading from the database, this default locking behavior can have a significant impact
on performance
Connections to DB2 UDB databases are set up using the DB2 Client Configuration
utility. To minimize problems with the default settings, make the following changes to all
remote clients accessing the database for read-only purposes. To help replicate these
settings, you can export the settings from one client and then import the resulting file
into all the other clients.
G Enable Cursor Hold is the default setting for the Cursor Hold option. Edit the
configuration settings and make sure the Enable Cursor Hold option is not
checked.
G Connection Mode should be Shared, not Exclusive
G Isolation Level should be Read Uncommitted (the minimum level) or Read
Committed (if updates by other applications are possible and dirty reads must
be avoided)
For setting the Isolation level to dirty read at the PowerCenter Server level, you can set
a flag can at the PowerCenter configuration file. For details on this process, refer to the
KB article 13575 in my.Informatica.com support knowledgebase.
If you're not sure how to adjust these settings, launch the IBM DB2 Client Configuration
utility, then highlight the database connection you use and select Properties. In
Properties, select Settings and then select Advanced. You will see these options and
their settings on the Transaction tab
To export the settings from the main screen of the IBM DB2 client configuration utility,
highlight the database connection you use, then select Export and all. Use the same
process to import the settings on another client.
If users run hand-coded queries against the target table using DB2's Command Center,
be sure they know to use script mode and avoid interactive mode (by choosing the
script tab instead of the interactive tab when writing queries). Interactive mode can
lock returned records while script mode merely returns the result and does not hold
them.
INFORMATICA CONFIDENTIAL BEST PRACTICE 308 of 702
If your target DB2 table is partitioned and resides across different nodes in DB2, you
can use a target partition type DB Partitioning in PowerCenter session properties.
When DB partitioning is selected, separate connections are opened directly to each
node and the load starts in parallel. This improves performance and scalability.


Last updated: 13-Feb-07 17:14
INFORMATICA CONFIDENTIAL BEST PRACTICE 309 of 702
Using Shortcut Keys in PowerCenter Designer
Challenge
Using shortcuts and work-arounds to work as efficiently as possible in PowerCenter Mapping Designer and Workflow Manager.
Description
After you are familiar with the normal operation of PowerCenter Mapping Designer and Workflow Manager, you can use a variety
of shortcuts to speed up their operation.
PowerCenter provides two types of shortcuts:
G keyboard shortcuts to edit repository objects and maneuver through the Mapping Designer and Workflow Manager as
efficiently as possible, and
G shortcuts that simplify the maintenance of repository objects.
General Suggestions
Maneuvering the Navigator Window
Follow these steps to open a folder with workspace open as well:
1. While highlighting the folder, click the Open folder icon.
Note: Double-clicking the folder name only opens the folder if it has not yet been opened or connected to.
2. Alternatively, right-click the folder name, then click on Open.
Working with the Toolbar and Menubar
The toolbar contains commonly used features and functions within the various client tools. Using the toolbar is often faster than
selecting commands from within the menubar.
G To add more toolbars, select Tools | Customize.
G Select the Toolbar tab to add or remove toolbars.
Follow these steps to use drop-down menus without the mouse:
1. Press and hold the <Alt> key. You will see an underline under one letter of each of the menu titles.
2. Press the underlined letter for the desired drop-down menu. For example, press 'r' for the 'Repository' drop-down menu.

INFORMATICA CONFIDENTIAL BEST PRACTICE 310 of 702
3. Press the underlined letter to select the command/operation you want. For example, press 't' for 'Close All Tools'.
4. Alternatively, after you have pressed the <Alt> key, use the right/left arrows to navigate across the menubar, and up/down
arrows to expand and navigate through the drop-down menu.. Press Enter when the desired command is highlighted.
G To create a customized toolbar for the functions you frequently use, press <Alt> <T> (expands the Tools drop-down
menu) then <C> (for Customize).
G To delete customized icons, select Tools | Customize, and then remove the icons by dragging them directly off the toolbar
G To add an icon to an existing (or new) toolbar, select Tools | Customize and navigate to the Commands tab. Find your
desired command, then "drag and drop" the icon onto your toolbar.
G To rearrange the toolbars, click and drag the toolbar to the new location. You can insert more than one toolbar at the top
of the designer tool to avoid having the buttons go off the edge of the screen. Alternatively, you can position the toolbars
at the bottom, side, or between the workspace and the message windows.
G To use a Docking\UnDocking window (e.g., Repository Navigator), double-click on the window's title bar. If you are
having trouble docking the the window again, right-click somewhere in the white space of the runaway window (not the
title bar) and make sure that the "Allow Docking" option is checked. When it is checked, drag the window to its proper
place and, when an outline of where the window used to be appears, release the window.

Keyboard Shortcuts
Use the following keyboard shortcuts to perform various operations in Mapping Designer and Workflow Manager.
To: Press:
Cancel editing in an object Esc
Check and uncheck a check box Space Bar
Copy text from an object onto a clipboard Ctrl+C
Cut text from an object onto the clipboard Ctrl+X.
Edit the text of an object F2. Then move the cursor to the desired location
INFORMATICA CONFIDENTIAL BEST PRACTICE 311 of 702
Find all combination and list boxes Type the first letter of the list
Find tables or fields in the workspace Ctrl+F
Move around objects in a dialog box
(When no objects are selected, this will pan within the workspace)
Ctrl+directional arrows
Paste copied or cut text from the clipboard into an object Ctrl+V
Select the text of an object F2
To start help F1
Mapping Designer
Navigating the Workspace
When using the "drag & drop" approach to create Foreign Key/Primary Key relationships between tables, be sure to start in the
Foreign Key table and drag the key/field to the Primary Key table. Set the Key Type value to "NOT A KEY" prior to dragging.
Follow these steps to quickly select multiple transformations:
1. Hold the mouse down and drag to view a box.
2. Be sure the box touches every object you want to select. The selected items will have a distinctive outline around them.
3. If you miss one or have an extra, you can hold down the <Shift> or <Ctrl> key and click the offending transformations one
at a time. They will alternate between being selected and deselected each time you click on them.
Follow these steps to copy and link fields between transformations:
1. You can select multiple ports when you are trying to link to the next transformation.
2. When you are linking multiple ports, they are linked in the same order as they are in the source transformation. You need
to highlight the fields you want in the source transformation and hold the mouse button over the port name in the target
transformation that corresponds to the source transformation port.
3. Use the Autolink function whenever possible. It is located under the Layout menu (or accessible by right-clicking
somewhere in the workspace) of the Mapping Designer.
4. Autolink can link by name or position. PowerCenter version 6 and later gives you the option of entering prefixes or suffixes
(when you click the 'More' button). This is especially helpful when you are trying to autolink from a Router transformation to
some target transformation. For example, each group created in a Router has a distinct suffix number added to the port/
field name. To autolink, you need to choose the proper Router and Router group in the 'From Transformation' space. You
also need to click the 'More' button and enter the appropriate suffix value. You must do both to create a link.
5. Autolink does not work if any of the fields in the 'To' transformation are already linked to another group or another stream.
No error appears; the links are simply not created.
Sometimes, a shared object is very close to (but not exactly) what you need. In this case, you may want to make a copy of the
object with some minor alterations to suit your purposes. If you try to simply click and drag the object, it will ask you if you want to
make a shortcut or it will be reusable every time. Follow these steps to make a non-reusable copy of a reusable object:
1. Open the target folder.
2. Select the object that you want to make a copy of, either in the source or target folder.
3. Drag the object over the workspace.
4. Press and hold the <Ctrl> key (the crosshairs symbol '+' will appear in a white box)
5. Release the mouse button, then release the <Ctrl> key.
6. A copy confirmation window and a copy wizard window appears.
7. The newly created transformation no longer says that it is reusable and you are free to make changes without affecting the
original reusable object.
Editing Tables/Transformations
Follow these steps to move one port in a transformation:
INFORMATICA CONFIDENTIAL BEST PRACTICE 312 of 702
1. Double-click the transformation and make sure you are in the "Ports" tab. (You go directly to the Ports tab if you double-
click a port instead of the colored title bar.)
2. Highlight the port and click the up/down arrow button to reposition the port.
3. Or, highlight the port and then press <Alt><w> to move the port down or <Alt> <u> to move the port up.
Note: You can hold down the <Alt> and hit the <w> or <u> multiple times to reposition the currently highlighted port
downwards or upwards, respectively.
Alternatively, you can accomplish the same thing by following these steps:
1. Highlight the port you want to move by clicking the number beside the port.
2. Grab onto the port by its number and continue holding down the left mouse button..
3. Drag the port to the desired location (the list of ports scrolls when you reach the end). A red line indicates the new location.
4. When the red line is pointing to the desired location, release the mouse button.
Note: You cannot move more than one port at a time with this method. See below for instructions on moving more than
one port at a time.
If you are using PowerCenter version 6.x, 7.x, or 8.x and the ports you are moving are adjacent, you can follow these steps to
move more than one port at a time:
1. Highlight the ports you want to move by clicking the number beside the port while holding down the <Ctrl> key.
2. Use the up/down arrow buttons to move the ports to the desired location.
G To add a new field or port, first highlight an existing field or port, then press <Alt><f> to insert the new field/port below it.
G To validate a defined default value, first highlight the port you want to validate, and then press <Alt><v>. A message box
will confirm the validity of the default value.
G After creating a new port, simply begin typing the name you wish to call the port. There is no need to to remove the
default "NEWFIELD" text prior to labelling the new port. This method could also be applied when modifying existing port
names. Simply highlight the existing port, by clicking onto the port number, and begin typing the modified name of the
port. To prefix a port name, press <Home> to bring the cursor to the beginning of the port name. In addition, to add a
suffix to a port name, press <End> to bring the curso to the end of the port name.
G Checkboxes can be checked (or unchecked) by highlighting the desired checkbox, and pressing SPACE bar to toggle the
checkmark on and off.
Follow either of these steps to quickly open the Expression Editor of an output or variable port:
1. Highlight the expression so that there is a box around the cell and press <F2> followed by <F3>.
2. Or, highlight the expression so that there is a cursor somewhere in the expression, then press <F2>.
G To cancel an edit in the grid, press <Esc> so the changes are not saved.
G For all combo/drop-down list boxes, type the first letter on the list to select the item you want. For example, you can
highlight a port's Data type box without displaying the drop-down. To change it to 'binary', type <b>. Then use the arrow
keys to go down to the next port. This is very handy if you want to change all fields to string for example because using
the up and down arrows and hitting a letter is much faster than opening the drop-down menu and making a choice each
time.
G To copy a selected item in the grid, press <Ctrl><c>.
G To paste a selected item from the Clipboard to the grid, press <Ctrl><v>.
G To delete a selected field or port from the grid, press <Alt><c>.
G To copy a selected row from the grid, press <Alt><o>.
G To paste a selected row from the grid, press <Alt><p>.
You can use either of the following methods to delete more than one port at a time.
G -You can repeatedly hit the cut button; or
G You can highlight several records and then click the cut button. Use <Shift> to highlight many items in a row or <Ctrl> to
INFORMATICA CONFIDENTIAL BEST PRACTICE 313 of 702
highlight multiple non-contiguous items. Be sure to click on the number beside the port, not the port name while you are
holding <Shift> or <Ctrl>.
Editing Expressions
Follow either of these steps to expedite validation of a newly created expression:
G Click on the <Validate> button or press <Alt> and <v>.
Note: This validates and leaves the Expression Editor open.
G Or, press <OK> to initiate parsing/validating of the expression. The system closes the Expression Editor if the validation is
successful. If you click OK once again in the "Expression parsed successfully" pop-up, the Expression Editor remains
open.
There is little need to type in the Expression Editor. The tabs list all functions, ports, and variables that are currently available. If
you want an item to appear in the Formula box, just double-click on it in the appropriate list on the left. This helps to avoid
typographical errors and mistakes (such as including an output-only port name in an expression formula).
In version 6.x and later, if you change a port name, PowerCenter automatically updates any expression that uses that port with the
new name.
Be careful about changing data types. Any expression using the port with the new data type may remain valid, but not perform as
expected. If the change invalidates the expression, it will be detected when the object is saved or if the Expression Editor is active
for that expression.
The following table summarizes additional shortcut keys that are applicable only when working with Mapping Designer:
To: Press
Add a new field or port Alt + F
Copy a row Alt + O
Cut a row Alt + C
Move current row down Alt + W
Move current row up Alt + U
Paste a row Alt + P
INFORMATICA CONFIDENTIAL BEST PRACTICE 314 of 702
Validate the default value in a transformation Alt + V
Open the Expression Editor from the expression field F2, then press F3
To start the debugger F9
Repository Object Shortcuts
A repository object defined in a shared folder can be reused across folders by creating a shortcut (i.e., a dynamic link to the
referenced object).
Whenever possible, reuse source definitions, target definitions, reusable transformations, mapplets, and mappings. Reusing
objects allows sharing complex mappings, mapplets or reusable transformations across folders, saves space in the repository,
and reduces maintenance.
Follow these steps to create a repository object shortcut:
1. Expand the shared folder.
2. Click and drag the object definition into the mapping that is open in the workspace.
3. As the cursor enters the workspace, the object icon appears along with a small curve; as an example, the icon should look
like this:
4. A dialog box appears to confirm that you want to create a shortcut.
If you want to copy an object from a shared folder instead of creating a shortcut, hold down the <Ctrl> key before dropping the
object into the workspace.
Workflow Manager
Navigating the Workspace
When editing a repository object or maneuvering around the Workflow Manager, use the following shortcuts to speed up the
operation you are performing:
To: Press:
Create links Press Ctrl+F2 to select first task you want to link.
Press Tab to select the rest of the tasks you want to link
Press Ctrl+F2 again to link all the tasks you selected
Edit tasks name in the workspace F2
Expand a selected node and all its children SHIFT + * (use asterisk on numeric keypad)
Move across to select tasks in the workspace Tab
Select multiple tasks Ctrl + Mouseclick
INFORMATICA CONFIDENTIAL BEST PRACTICE 315 of 702
Repository Object Shortcuts
Mappings that reside in a shared folder can be reused within workflows by creating shortcut mappings.
A set of workflow logic can be reused within workflows by creating a reusable worklet.


Last updated: 13-Feb-07 17:25
INFORMATICA CONFIDENTIAL BEST PRACTICE 316 of 702
Working with JAVA Transformation Object
Challenge
Occasionally special processing of data is required that is not easy to accomplish using
existing PowerCenter transformation objects. Transformation tasks like looping through
data 1 to x number of times is not a functionality native to the existing PowerCenter
transformation objects. For these situations, the Java Transformation provides the
ability to develop Java code with unlimited possibilities for transformation
capabilities. This Best Practice addresses questions that are commonly raised about
using JTX and how to make effective use of it, and supplements the existing
PowerCenter documentation on the JTX.
Description
The Java Transformation (JTX) introduced in PowerCenter 8.0 provides a uniform
means of entering and maintaining program code written in Java to be executed for
every record being processed during a session run. The Java code is maintained,
entered, and viewed within the PowerCenter Designer tool.
Below is a summary of some of typical questions about JTX.
Is a JTX a passive or an active transformation?
A JTX can be either passive or active. When defining a JTX you must choose one or
the other type. Once you make this choice you will not be able to change it without
deleting the JTX, saving the repository and recreating the object.
Hint: If you are working with a versioned repository, you will have to purge the deleted
JTX from the repository before you can recreate it with the same name.
What parts of a typical Java class can be used in a JTX?
The following standard features can be used in a JTX:
G static initialization blocks can be defined on the tab Helper Code.
G import statements can be listed on the tab Import Packages.
INFORMATICA CONFIDENTIAL BEST PRACTICE 317 of 702
G static variables of the Java class as a whole (i.e., counters for instances of
this class) as well as non-static member variables (for every single instance)
can be defined on the tab Helper Code.
G Auxiliary member functions or static functions may be declared and defined
on the tab Helper Code.
G static final variables may be defined on the tab Helper Code. However, they
are private by nature; no object of any other Java class will be able to utilize
these.
G Auxiliary functions (static and dynamic) can be defined on the tab Helper
Code.
Important Note:
Before trying to start a session utilizing additional import clauses in the Java code,
make sure that the environment variable CLASSPATH contains the necessary .jar files
or directories before the PowerCenter Integration Service has been started.
All non-static member variables declared on the tab Helper Code are automatically
available to every partition of a partitioned session without any precautions. In other
words, one object of the respective Java class that is generated by PowerCenter will be
instantiated for every single instance of the JTX and for every session partition. For
example, if you utilize two instances of the same reusable JTX and have set the
session to run with three partitions, then six individual objects of that Java class will be
instantiated for this session run.
What parts of a typical Java class cannot be utilized in a JTX?
The following standard features of Java are not available in a JTX:
G Standard and user-defined constructors
G Standard and user-defined destructors
G Any kind of direct user-interface, be it a Swing GUI or a console-based user
interface
What else cannot be done in a JTX?
One important note for a JTX is that you cannot retrieve, change, or utilize an existing
DB connection in a JTX (such as a source connection, a target connection, or a
relational connection to a LKP). If you would like to establish a database connection,
use JDBC in the JTX. Make sure in this case that you provide the necessary
INFORMATICA CONFIDENTIAL BEST PRACTICE 318 of 702
parameters by other means.
How can I substitute constructors and the like in a JTX?
User-defined constructors are mainly used to pass certain initialization values to a
Java class that you want to process only once. The only way in a JTX to get this work
done is to pass those parameters into the JTX as a normal port; then you define a
boolean variable (initial value is true). For example, the name might
be constructMissing on the Helper Code tab. The very first block in the On Input Row
block will then look like this:
if (constructMissing)
{
// do whatever you would do in the constructor
constructMissing = false;
}
Interaction with users is mainly done to provide input values to some member
functions of a class. This usually is not appropriate in a JTX because all input values
should be provided by means of input records.
If there is a need to enable immediate interaction with a user for one or several or all
input records, use an inter-process communication mechanism (i.e., IPC) to establish
communication between the Java class associated with the JTX and an environment
available to a user. For example, if the actual check to be performed can only be
determined at runtime, you might want to establish a JavaBeans communication
between the JTX and the classes performing the actual checks. Beware, however, that
this sort of mechanism causes great overhead and subsequently may decrease
performance dramatically. Although in many cases such requirements indicate that the
analysis process and the mapping design process have not been executed optimally.
How do I choose between an active and a passive JTX?
Use the following guidelines to identify whether you need an active or a passive JTX in
your mapping:
INFORMATICA CONFIDENTIAL BEST PRACTICE 319 of 702
As a general rule of thumb, a passive JTX will usually execute faster
than an active JTX.
G
If one input record equals one output record of the JTX, you will
probably want to use a passive JTX.
G
If you have to produce a varying number of output records per input
record (i.e., for some input values the JTX will generate one output
record, for some values it will generate no output records, for some
values it will generate two or even more output records) you will
have to utilize an active JTX. There is no other choice.
G
If you have to accumulate one or more input records before
generating one or more output records, you will have to utilize an
active JTX. There is no other choice.
G
If you have to do some initialization work before processing the
first input record, then this fact does in no way determine whether to
utilize an active or a passive JTX.
G
If you have to do some cleanup work after having processed the
last input record, then this fact does in no way determine whether to
utilize an active or a passive JTX.
G
If you have to generate one or more output records after the last
input record has been processed, then you have to use an active
JTX. There is no other choice except changing the mapping
accordingly to produce these additional records by other means.
How do I set up a JTX and use it in a mapping?
As with most standard transformations you can either define a reusable JTX or an
instance directly within a mapping. The following example will describe how to define a
JTX in a mapping. For this example assume that the JTX has one input port of data
type String and three output ports of type String, Integer, and Smallint.
Note: As of version 8.1.1 the PowerCenter Designer is extremely sensitive regarding
the port structure of a JTX; make sure you read and understand the Notes section
below before designing your first JTX, otherwise you will encounter issues when trying
to run a session associated to your mapping.
1.
INFORMATICA CONFIDENTIAL BEST PRACTICE 320 of 702
Click the button showing the java icon, then click on the background in the main
window of the Mapping Designer. Choose whether to generate a passive or an
active JTX (see How do I choose between an active and a passive JTX
above). Remember, you cannot change this setting later.
2.
Rename the JTX accordingly (i.e., rename it to JTX_SplitString).
3.
Go to the Ports tab; define all input-only ports in the Input Group, define all
output-only and input-output ports in the Output Group. Make sure that every
output-only and every input-output port is defined correctly.
4.
Make sure you define the port structure correctly from the onset as changing
data types of ports after the JTX has been saved to the repository will not
always work.
5.
Click Apply.
6.
On the Properties tab you may want to change certain properties. For example,
the setting "Is Partitionable" is mandatory if this session will be partitioned.
Follow the hints in the lower part of the screen form that explain the selection
lists in detail.
7.
Activate the tab Java Code. Enter code pieces where necessary. Be aware
that all ports marked as input-output ports on the Ports tab are automatically
processed as pass-through ports by the Integration Service. You do not have
to (and should not) enter any code referring to pass-through ports. See the
Notes section below for more details.
8.
Click the Compile link near the lower right corner of the screen form to compile
the Java code you have entered. Check the output window at the lower border
of the screen form for all compilation errors and work through each error
message encountered; then click Compile again. Repeat this step as often
as necessary until you can compile the Java code without any error messages.
9.
Click OK.
10.
Only connect ports of the same data type to every input-only or input-output
port of the JTX. Connect output-only and input-output ports of the JTX only to
ports of the same data type in transformations downstream. If any downstream
transformation expects a different data type than the type of the respective
output port of the JTX, insert an EXP to convert data types. Refer to the Notes
below for more detail.
11.
Save the mapping.
INFORMATICA CONFIDENTIAL BEST PRACTICE 321 of 702
Notes:
G
The primitive Java data types available in a JTX that can be used for
ports of the JTX to connect to other transformations are Integer,
Double, and Date/Time. Date/time values are delivered to or by a
JTX by means of a Java long value which indicates the difference of
the respective date/time value to midnight, Jan 1
st
, 1970 (the so-
called Epoch) in milliseconds; to interpret this value, utilize the
appropriate methods of the Java class GregorianCalendar. Smallint
values cannot be delivered to or by a JTX.
G
The Java object data types available in a JTX that can be used for
ports are String, byte arrays (for Binary ports), and BigDecimal (for
Decimal values of arbitrary precision).
G
In a JTX you check whether an input port has a NULL value by calling
the function isNull("name_of_input_port"). If an input value is NULL,
then you should explicitly set all depending output ports to NULL by
calling setNull("name_of_output_port"). Both functions take the
name of the respective input / output port as a string.
G
You retrieve the value of an input port (provided this port is not
NULL, see previous paragraph) simply by referring to the name of
this port in your Java source code. For example, if you have two
input ports i_1 and i_2 of type Integer and one output port o_1 of
type String, then you might set the output value with a statement
like this one: o_1 = "First value = " + i_1 + ", second value = "
+ i_2;
G
In contrast to a Custom Transformation, it is not possible to retrieve
the names, data types, and/or values of pass-through ports except if
these pass-through ports have been defined on the Ports tab in
advance. In other words, it is impossible for a JTX to adapt to its port
structure at runtime (which would be necessary, for example, for
something like a Sorter JTX).
G
If you have to transfer 64-bit values into a JTX, deliver them to the
JTX by means of a string representing the 64-bit number and convert
this string into a Java long variable using the static method Long.
parseLong(). Likewise, to deliver a 64-bit integer from a JTX to
downstream transformations, convert the long variable to a string
INFORMATICA CONFIDENTIAL BEST PRACTICE 322 of 702
which will be an output port of the JTX (e.g. using the statement
o_Int64 = "" + myLongVariable).
G
As of version 8.1.1, the PowerCenter Designer is very sensitive
regarding data types of ports connected to a JTX. Supplying a JTX
with not exactly the expected data types or connecting output ports
to other transformations expecting other data types (i.e., a string
instead of an integer) may cause the Designer to invalidate the
mapping such that the only remedy is to delete the JTX, save the
mapping, and re-create the JTX.
G
Initialization Properties and Metadata Extensions can neither be
defined nor retrieved in a JTX.
G
The code entered on the Java Code sub-tab On Input Row is
inserted into some other code; only this complete code constitutes
the method execute() of the resulting Java class associated to the
JTX (see output of the link "View Code" near the lower-right corner of
the Java Code screen form). The same holds true for the code
entered on the tabs On End Of Data and On Receiving
Transactions with regard to the methods. This fact has a couple of
implications which will be explained in more detail below.
G
If you connect input and/or output ports to transformations with
differing data types, you might get error messages during mapping
validation. One such error message occurring quite often indicates
that the byte code of the class cannot be retrieved from the
repository. In this case, rectify port connections to all input and/or
output ports of the JTX and edit the Java code (inserting one blank
comment line usually suffices) and recompile the Java code again.
G
The JTX (Java Transformation) doesn't currently allow pass-through
ports. Thus they have to be simulated by splitting them up into one
input port and one output port, then the values of all input ports
have to be assigned to the respective output port. The key here
is the input port of every pair of ports has to be in the Input Group
while the respective output port has to be in the Output Group. If you
do not do this, there is no warning in designer but it will not function
correctly.
Where and how to insert what pieces of Java code into a JTX?
A JTX always contains a code skeleton that is generated by the Designer. Every piece
INFORMATICA CONFIDENTIAL BEST PRACTICE 323 of 702
of code written by a mapping designer is inserted into this skeleton at designated
places. Because all these code pieces do not constitute the sole content of the
respective functions, there are certain rules and recommendations as to how to write
such code.
As mentioned previously, a mapping designer can neither write his or her own
constructor nor insert any code into the default constructor or the default destructor
generated by the Designer. All initialization work can be done in either of the following
two ways:
G
as part of the static{} initialization block,
G
by inserting code that in a standalone class would be part of the
destructor into the tab On End Of Data,
G
by inserting code that in a standalone class would be part of the
constructor into the tab On Input Row.
The last case (constructor code being part of the On Input Row code) requires a little
trick: constructor code is supposed to be executed once only, namely before the first
method is called. In order to resemble this behavior, follow these steps:
1.
On the tab Helper Code, define a boolean variable (i.e., constructorMissing)
and initialize it to true.
2.
At the beginning of the On Input Row code, insert code that looks like the
following:
if( constructorMissing)
{
// do whatever the constructor should have done
constructorMissing = false;
}
INFORMATICA CONFIDENTIAL BEST PRACTICE 324 of 702
This will ensure that this piece of code is executed only once, namely directly before
the very first input row is processed.
The code pieces on the tabs On Input Row, On End Of Data, and On Receiving
Transaction are embedded in other code. There is code that runs before the code
entered here will execute, and there is more code to follow; for example, exceptions
raised within code written by a developer will be caught here. As a mapping developer
you cannot change this order, so you need to be aware of the following important
implication.
Suppose you are writing a Java class that performs some checks on an input record
and, if the checks fail, issues an error message and then skips processing to the next
record. Such a piece of code might look like this:
if (firstCheckPerformed( inputRecord) &&
secondCheckPerformed( inputRecord))
{ logMessage( ERROR: one of the two checks failed!);
return;
}
// else
insertIntoTarget( inputRecord);
countOfSucceededRows ++;
This code will not compile in a JTX because it would lead to unreachable code. Why?
Because the return at the end of the if statement might enable the respective
function (in this case, the method will have the name execute()) to ignore the
subsequent code that is part of the framework created by the Designer.
In order to make this code work in a JTX, change it to look like this:
if (firstCheckPerformed( inputRecord) &&
secondCheckPerformed( inputRecord))
INFORMATICA CONFIDENTIAL BEST PRACTICE 325 of 702
{ logMessage( ERROR: one of the two checks failed!);
}
else
{ insertIntoTarget( inputRecord);
countOfSucceededRows ++;
}
The same principle (never use return in these code pieces) applies to all three tabs
On Input Row, On End Of Data, and On Receiving Transaction.
Another important point is that the code entered on the On Every Record tab is
embedded in a try-catch block. So never include any try-catch code on this tab.
How fast does a JTX perform?
A JTX communicates with PowerCenter by means of JNI (Java Native Invocation).
This mechanism has been defined by Sun Micro-systems in order to allow Java code to
interact with dynamically linkable libraries. Though JNI has been designed to perform
fast, it still creates some overhead to a session due to:
G
the additional process switches between the PowerCenter Integration
Service and the Java Virtual Machine (JVM) that executes as another
operating system process
G
Java not being compiled to machine code but to portable byte code
(although this has been largely remedied in the past years due to the
introduction of Just-In-Time compilers) which is interpreted by the
JVM
G
The inherent complexity of the genuine object model in Java (except
for most sorts of number types and characters everything in Java is
an object that occupies space and execution time).
INFORMATICA CONFIDENTIAL BEST PRACTICE 326 of 702
So it is obvious that a JTX cannot perform as fast as, for example, a carefully written
Custom Transformation.
The rule of thumb is for simple JTX to require approximately 50% more total running
time than an EXP of comparable functionality. It can also be assumed that Java code
utilizing several of the fairly complex standard classes will need even more total
runtime when compared to an EXP performing the same tasks.
When should I use a JTX and when not?
As with any other standard transformation, a JTX has its advantages as well as
disadvantages. The most significant disadvantages are:
G
The Designer is very sensitive in regards to the data types of ports
that are connected to the ports of a JTX. However, most of the
troubles arising from this sensitivity can be remedied rather easily
by simply recompiling the Java code.
G
Working with long values representing days and time within, for
example, the GregorianCalendar can be extremely difficult to do and
demanding in terms of runtime resources (memory, execution time).
Date/time ports in PowerCenter are by far easier to use. So it is
advisable to split up date/time ports into their individual components,
such as year, month, and day, and to process these singular
attributes within a JTX if needed.
G
In general a JTX can reduce performance simply by the nature of the
architecture. Only use a JTX when necessary.
G
A JTX always has one input group and one output group. For
example, it is impossible to write a Joiner as a JTX.
Significant advantages to using a JTX are:
G
Java knowledge and experience are generally easier to find than
comparable skills in other languages.
G
Prototyping with a JTX can be very fast. For example, setting up a
INFORMATICA CONFIDENTIAL BEST PRACTICE 327 of 702
simple JTX that calculates the calendar week and calendar year for a
given date takes approximately 10-20 minutes. Writing Custom
Transformations (even for easy tasks) can take several hours.
G
Not every data integration environment has access to a C compiler
used to compile Custom Transformations in C. Because PowerCenter
is installed with its own JDK, this problem will not arise with a JTX.
In Summary
G
If you need a transformation that adapts its processing behavior to
its ports, a JTX is not the way to go. In such a case, write a Custom
Transformation in C, C++, or Java to perform the necessary tasks.
The CT API is considerably more complex than the JTX API, but it is
also far more flexible.
G
Use a JTX for development whenever a task cannot be easily
completed using other standard options in PowerCenter (as long
as performance requirements do not dictate otherwise).
G
If performance measurements are slightly below expectations, try
optimizing the Java code and the remainder of the mapping in order
to increase processing speed.


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 328 of 702
Error Handling Process
Challenge
For an error handling strategy to be implemented successfully, it must be integral to the load process as a
whole. The method of implementation for the strategy will vary depending on the data integration
requirements for each project.
The resulting error handling process should however, always involve the following three steps:
1. Error identification
2. Error retrieval
3. Error correction
This Best Practice describes how each of these steps can be facilitated within the PowerCenter
environment.
Description
A typical error handling process leverages the best-of-breed error management technology available in
PowerCenter, such as:
Relational database error logging
Email notification of workflow failures
Session error thresholds
The reporting capabilities of PowerCenter Data Analyzer
Data profiling
These capabilities can be integrated to facilitate error identification, retrieval, and correction as described in
the flow chart below:


INFORMATICA CONFIDENTIAL BEST PRACTICE 329 of 702
Error Identification
The first step in the error handling process is error identification. Error identification is often achieved
through the use of the ERROR() function within mappings, enablement of relational error logging in
PowerCenter, and referential integrity constraints at the database.
This approach ensures that row-level issues such as database errors (e.g., referential integrity failures),
transformation errors, and business rule exceptions for which the ERROR() function was called are captured
in relational error logging tables.
Enabling the relational error logging functionality automatically writes row-level data to a set of four error
handling tables (PMERR_MSG, PMERR_DATA, PMERR_TRANS, and PMERR_SESS). These tables can
be centralized in the PowerCenter repository and store information such as error messages, error data, and
source row data. Row-level errors trapped in this manner include any database errors, transformation errors,
and business rule exceptions for which the ERROR() function was called within the mapping.
Error Retrieval
The second step in the error handling process is error retrieval. After errors have been captured in the
PowerCenter repository, it is important to make their retrieval simple and automated so that the process is
as efficient as possible. Data Analyzer can be customized to create error retrieval reports from the
information stored in the PowerCenter repository. A typical error report prompts a user for the folder and
workflow name, and returns a report with information such as the session, error message, and data that
caused the error. In this way, the error is successfully captured in the repository and can be easily retrieved
through a Data Analyzer report, or an email alert that identifies a user when a certain threshold is crossed
(such as number of errors is greater than zero).
Error Correction
The final step in the error handling process is error correction. As PowerCenter automates the process of
error identification, and Data Analyzer can be used to simplify error retrieval, error correction is
straightforward. After retrieving an error through Data Analyzer, the error report (which contains information
such as workflow name, session name, error date, error message, error data, and source row data) can
be exported to various file formats including Microsoft Excel, Adobe PDF, CSV, and others. Upon retrieval of
an error, the error report can be extracted into a supported format and emailed to a developer or DBA to
resolve the issue, or it can be entered into a defect management tracking tool. The Data Analyzer interface
supports emailing a report directly through the web-based interface to make the process even easier.
For further automation, a report broadcasting rule that emails the error report to a developers inbox can be
set up to run on a pre-defined schedule. After the developer or DBA identifies the condition that caused the
error, a fix for the error can be implemented. The exact method of data correction depends on various
factors such as the number of records with errors, data availability requirements per SLA, the level of data
criticality to the business unit(s), and the type of error that occurred. Considerations made during error
correction include:
The owner of the data should always fix the data errors. For example, if the source data is
coming from an external system, then the errors should be sent back to the source system to be
fixed.
In some situations, a simple re-execution of the session will reprocess the data.
Does partial data that has been loaded into the target systems need to be backed-out in order to
avoid duplicate processing of rows.
Lastly, errors can also be corrected through a manual SQL load of the data. If the volume of
errors is low, the rejected data can be easily exported to Microsoft Excel or CSV format and
corrected in a spreadsheet from the Data Analyzer error reports. The corrected data can then be
manually inserted into the target table using a SQL statement.
Any approach to correct erroneous data should be precisely documented and followed as a standard.
INFORMATICA CONFIDENTIAL BEST PRACTICE 330 of 702
If the data errors occur frequently, then the reprocessing process can be automated by designing a special
mapping or session to correct the errors and load the corrected data into the ODS or staging area.
Data Profiling Option
For organizations that want to identify data irregularities post-load but do not want to reject such rows at load
time, the PowerCenter Data Profiling option can be an important part of the error management solution. The
PowerCenter Data Profiling option enables users to create data profiles through a wizard-driven GUI that
provides profile reporting such as orphan record identification, business rule violation, and data irregularity
identification (such as NULL or default values). The Data Profiling option comes with a license to use Data
Analyzer reports that source the data profile warehouse to deliver data profiling information through an
intuitive BI tool. This is a recommended best practice since error handling reports and data profile reports
can be delivered to users through the same easy-to-use application.
Integrating Error Handling, Load Management, and Metadata
Error handling forms only one part of a data integration application. By necessity, it is tightly coupled to the
load management process and the load metadata; it is the integration of all these approaches that ensures
the system is sufficiently robust for successful operation and management. The flow chart below illustrates
this in the end-to-end load process.
INFORMATICA CONFIDENTIAL BEST PRACTICE 331 of 702

INFORMATICA CONFIDENTIAL BEST PRACTICE 332 of 702
Error handling underpins the data integration system from end-to-end. Each of the load
components performs validation checks, the results of which must be reported to the operational team.
These components are not just PowerCenter processes such as business rule and field validation, but cover
the entire data integration architecture, for example:
Process Validation. Are all the resources in place for the processing to begin (e.g., connectivity
to source systems)?
Source File Validation. Is the source file datestamp later than the previous load?
File Check. Does the number of rows successfully loaded match the source rows read?

Last updated: 09-Feb-07 13:42

INFORMATICA CONFIDENTIAL BEST PRACTICE 333 of 702
Error Handling Strategies - Data Warehousing
Challenge
A key requirement for any successful data warehouse or data integration project is that it a
within the user community. At the same time, it is imperative that the warehouse be as up-to-date as
possible since the more recent the information derived from it is, the more relevant it is to the business
operations of the organization, thereby providing the best opportunity to gain an advantage over the
competition.
ttains credibility
Transactional systems can manage to function even with a certain amount of error since the impact of an
individual transaction (in error) has a limited effect on the business figures as a whole, and corrections can
be applied to erroneous data after the event (i.e., after the error has been identified). In data warehouse
systems, however, any systematic error (e.g., for a particular load instance) not only affects a larger number
of data items, but may potentially distort key reporting metrics. Such data cannot be left in the warehouse
"until someone notices" because business decisions may be driven by such information.
Therefore, it is important to proactively manage errors, identifying them before, or as, they occur. If errors
occur, it is equally important either to prevent them from getting to the warehouse at all, or to remove them
from the warehouse immediately (i.e., before the business tries to use the information in error).
The types of error to consider include:
Source data structures
Sources presented out-of-sequence
Old sources represented in error
Incomplete source files
Data-type errors for individual fields
Unrealistic values (e.g., impossible dates)
Business rule breaches
Missing mandatory data
O/S errors
RDBMS errors
These cover both high-level (i.e., related to the process or a load as a whole) and low-level (i.e., field or
column-related errors) concerns.
Description
In an ideal world, when an analysis is complete, you have a precise definition of source and target data; you
can be sure that every source element was populated correctly, with meaningful values, never missing a
value, and fulfilling all relational constraints. At the same time, source data sets always have a fixed
structure, are always available on time (and in the correct order), and are never corrupted during transfer to
the data warehouse. In addition, the OS and RDBMS never run out of resources, or have permissions and
privileges change.
Realistically, however, the operational applications are rarely able to cope with every possible business
scenario or combination of events; operational systems crash, networks fall over, and users may not use the
transactional systems in quite the way they were designed. The operational systems also typically need
some flexibility to allow non-fixed data to be stored (typically as free-text comments). In every case, there is
a risk that the source data does not match what the data warehouse expects.
Because of the credibility issue, in-error data must not be propagated to the metrics and measures used by
the business managers. If erroneous data does reach the warehouse, it must be identified and removed
immediately (before the current version of the warehouse can be published). Preferably, error data should
INFORMATICA CONFIDENTIAL BEST PRACTICE 334 of 702
be identified during the load process and prevented from reaching the warehouse at all. Ideally, erroneous
source data should be identified before a load even begins, so that no resources are wasted trying to load it.
As a principle, data errors should corrected at the source. As soon as any attempt is made to correct errors
within the warehouse, there is a risk that the lineage and provenance of the data will be lost. From that point
on, it becomes impossible to guarantee that a metric or data item came from a specific source via a specific
chain of processes. As a by-product, adopting this principle also helps to tie both the end-users and those
responsible for the source data into the warehouse process; source data staff understand that their
professionalism directly affects the quality of the reports, and end-users become owners of their data.
As a final consideration, error management (the implementation of an error handling strategy) complements
and overlaps load management, data quality and key management, and operational processes and
procedures.
Load management processes record at a high-level if a load is unsuccessful; error management records the
details of why the failure occurred.
Quality management defines the criteria whereby data can be identified as in error; and error management
identifies the specific error(s), thereby allowing the source data to be corrected.
Operational reporting shows a picture of loads over time, and error management allows analysis to identify
systematic errors, perhaps indicating a failure in operational procedure.
Error management must therefore be tightly integrated within the data warehouse load process. This is
shown in the high level flow chart below:
INFORMATICA CONFIDENTIAL BEST PRACTICE 335 of 702

INFORMATICA CONFIDENTIAL BEST PRACTICE 336 of 702
Error Management Considerations
High-Level Issues
From previous discussion of load management, a number of checks can be performed before any attempt is
made to load a source data set. Without load management in place, it is unlikely that the warehouse process
will be robust enough to satisfy any end-user requirements, and error correction processing becomes moot
(in so far as nearly all maintenance and development resources will be working full time to manually correct
bad data in the warehouse). The following assumes that you have implemented load management
processes similar to Informaticas best practices.
Process Dependency checks in the load management can identify when a source data set is
missing, duplicates a previous version, or has been presented out of sequence, and where the
previous load failed but has not yet been corrected.
Load management prevents this source data from being loaded. At the same time, error
management processes should record the details of the failed load; noting the source instance, the
load affected, and when and why the load was aborted.
Source file structures can be compared to expected structures stored as metadata, either from
header information or by attempting to read the first data row.
Source table structures can be compared to expectations; typically this can be done by
interrogating the RDBMS catalogue directly (and comparing to the expected structure held in
metadata), or by simply running a describe command against the table (again comparing to a pre-
stored version in metadata).
Control file totals (for file sources) and row number counts (table sources) are also used to
determine if files have been corrupted or truncated during transfer, or if tables have no new data in
them (suggesting a fault in an operational application).
In every case, information should be recorded to identify where and when an error occurred, what
sort of error it was, and any other relevant process-level details.
Low-Level Issues
Assuming that the load is to be processed normally (i.e., that the high-level checks have not caused the load
to abort), further error management processes need to be applied to the individual source rows and fields.
Individual source fields can be compared to expected data-types against standard metadata within
the repository, or additional information added by the development. In some instances, this is
enough to abort the rest of the load; if the field structure is incorrect, it is much more likely that the
source data set as a whole either cannot be processed at all or (more worryingly) is likely to be
processed unpredictably.
Data conversion errors can be identified on a field-by-field basis within the body of a mapping.
Built-in error handling can be used to spot failed date conversions, conversions of string to
numbers, or missing required data. In rare cases, stored procedures can be called if a specific
conversion fails; however this cannot be generally recommended because of the potentially
crushing impact on performance if a particularly error-filled load occurs.
Business rule breaches can then be picked up. It is possible to define allowable values, or
acceptable value ranges within PowerCenter mappings (if the rules are few, and it is clear from
mapping metadata that the business rules are included in the mapping itself). A more flexible
approach is to use external tables to codify the business rules. In this way, only the rules tabl
need to be amended if a new business ru

the
es
le needs to be applied. Informatica has suggested
nt
methods to implement such a process.
Missing Key/Unknown Key issues have already been defined in their own best practice docume
Key Management in Data Warehousing Solutions with suggested management techniques for
identifying and handling them. However, from an error handling perspective, such errors must st
be identified and recorded, even when key management techniques do not formally fail source
rows with key errors. Unless a record is kept of the frequency with which particular sour
ill
ce data

fails, it is difficult to realize when there is a systematic problem in the source systems.
Inter-row errors may also have to be considered. These may occur when a business process
expects a certain hierarchy of events (e.g., a customer query, followed by a booking request,
INFORMATICA CONFIDENTIAL BEST PRACTICE 337 of 702
followed by a confirmation, followed by a payment). If the events arrive from the source system in
the wrong order, or where key events are missing, it may indicate a major problem with the source
system, or the way in which the source system is being used.
g


e first, re-running,
and then identifying a second error (which halts the load for a second time).
OS and RDBMS Issues
s should be very rare (i.e., the load should
already have identified that reference information is missing).
o
d
schemas, invalid indexes, no further table space extents available, missing partitions and the like.

on the data warehouse, or are not aware that the data warehouse managers need to be kept
up to speed.
n
issions on a UNIX host are amended, bad files themselves (or even
the log files) may not be accessible.
Subsequent runs should note this,
and enforce correction of the last load before beginning the new one.
ld also be available to the data warehouse
operators to rapidly explain and resolve such errors if they occur.
Auto-Correction vs. Manual Correction
An important principle to follow is to try to identify all of the errors on a particular row before haltin
processing, rather than rejecting the row at the first instance. This seems to break the rule of not
wasting resources trying to load a sourced data set if we already know it is in error; however, since
the row needs to be corrected at source, then reprocessed subsequently, it is sensible to identify
all the corrections that need to be made before reloading, rather than fixing th
Since best practice means that referential integrity (RI) issues are proactively managed within the loads,
instances where the RDBMS rejects data for referential reason
However, there is little that can be done to identify the more generic RDBMS problems that are likely t
occur; changes to schema permissions, running out of temporary disk space, dropping of tables an
Similarly, interaction with the OS means that changes in directory structures, file permissions, disk space,
command syntax, and authentication may occur outside of the data warehouse. Often such changes are
driven by Systems Administrators who, from an operational perspective, are not aware that there is likely to
be an impact
In both of the instances above, the nature of the errors may be such that not only will they cause a load to
fail, but it may be impossible to record the nature of the error at that point in time. For example, if RDBMS
user ids are revoked, it may be impossible to write a row to an error table if the error process depends o
the revoked id; if disk space runs out during a write to a target table, this may affect all other tables
(including the error tables); if file perm
Most of these types of issues can be managed by a proper load management process, however. Since
setting the status of a load to complete should be absolutely the last step in a given process, any failure
before, or including, that point leaves the load in an incomplete state.
The best practice to manage such OS and RDBMS errors is, therefore, to ensure that the Operational
Administrators and DBAs have proper and working communication with the data warehouse management to
allow proactive control of changes. Administrators and DBAs shou
Load management and key management best practices (Key Management in Data Warehousing Solutions)
back,
eserved, and incorrect key values are corrected as soon as
the source system provides the missing data.
ality
be impossible, potentially requiring a whole section of the warehouse to be reloaded entirely
from scratch.
have already defined auto-correcting processes; the former to allow loads themselves to launch, roll
and reload without manual intervention, and the latter to allow RI errors to be managed so that the
quantitative quality of the warehouse data is pr
We cannot conclude from these two specific techniques, however, that the warehouse should attempt to
change source data as a general principle. Even if this were possible (which is debatable), such function
would mean that the absolute link between the source data and its eventual incorporation into the data
warehouse would be lost. As soon as one of the warehouse metrics was identified as incorrect, unpicking
the error would
INFORMATICA CONFIDENTIAL BEST PRACTICE 338 of 702
In addition, such automatic correction of data might hide the fact that one or other of the source systems had
a generic fault, or more importantly, had acquired a fault because of on-going development of the
transactional applications, or a failure in user training.
The principle to apply here is to identify the errors in the load, and then alert the source system users that
data should be corrected in the source system itself, ready for the next load to pick up the right data. This
maintains the data lineage, allows source system errors to be identified and ameliorated in good time, and
permits extra training needs to be identified and managed.
Error Management Techniques
Simple Error Handling Structure
The following data structure is an example of the error metadata that should be captured as a minimum
within the error handling strategy.

The example defines three main sets of information:
The ERROR_DEFINITION table, which stores descriptions for the various types of errors,
including:

o process-level (e.g., incorrect source file, load started out-of-sequence)
o row-level (e.g., missing foreign key, incorrect data-type, conversion errors) and
o reconciliation (e.g., incorrect row numbers, incorrect file total etc.).

The ERROR_HEADER table provides a high-level view on the process, allowing a quick
identification of the frequency of error for particular loads and of the distribution of error types. It is
linked to the load management processes via the SRC_INST_ID and PROC_INST_ID, from which
other process-level information can be gathered.
The ERROR_DETAIL table stores information about actual rows with errors, including how to
identify the specific row that was in error (using the source natural keys and row number) together
with a string of field identifier/value pairs concatenated together. It is not expected that this
INFORMATICA CONFIDENTIAL BEST PRACTICE 339 of 702
information will be deconstructed as part of an automatic correction load, but if necessary this can
be pivoted (e.g., using simple UNIX scripts) to separate out the field/value pairs for subsequent
reporting.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 340 of 702
Error Handling Strategies - General
Challenge
The challenge is to accurately and efficiently load data into the target data architecture. This Best Practice describes
various loading scenarios, the use of data profiles, an alternate method for identifying data errors, methods for
handling data errors, and alternatives for addressing the most common types of problems. For the most part, these
strategies are relevant whether your data integration project is loading an operational data structure (as with data
migrations, consolidations, or loading various sorts of operational data stores) or loading a data warehousing
structure.
Description
Regardless of target data structure, your loading process must validate that the data conforms to known rules of the
business. When the source system data does not meet these rules, the process needs to handle the exceptions in
an appropriate manner. The business needs to be aware of the consequences of either permitting invalid data to
enter the target or rejecting it until it is fixed. Both approaches present complex issues. The business must decide
what is acceptable and prioritize two conflicting goals:
G The need for accurate information.
G The ability to analyze or process the most complete information available with the understanding that errors
can exist.
Data Integration Process Validation
In general, there are three methods for handling data errors detected in the loading process:
G Reject All. This is the simplest to implement since all errors are rejected from entering the target when they
are detected. This provides a very reliable target that the users can count on as being correct, although it
may not be complete. Both dimensional and factual data can be rejected when any errors are encountered.
Reports indicate what the errors are and how they affect the completeness of the data.
Dimensional or Master Data errors can cause valid factual data to be rejected because a foreign key
relationship cannot be created. These errors need to be fixed in the source systems and reloaded on a
subsequent load. Once the corrected rows have been loaded, the factual data will be reprocessed and
loaded, assuming that all errors have been fixed. This delay may cause some user dissatisfaction since the
users need to take into account that the data they are looking at may not be a complete picture of the
operational systems until the errors are fixed. For an operational system, this delay may affect downstream
transactions.
The development effort required to fix a Reject All scenario is minimal, since the rejected data can be
processed through existing mappings once it has been fixed. Minimal additional code may need to be written
since the data will only enter the target if it is correct, and it would then be loaded into the data mart using
the normal process.
G Reject None. This approach gives users a complete picture of the available data without having to consider
data that was not available due to it being rejected during the load process. The problem is that the data
may not be complete or accurate. All of the target data structures may contain incorrect information that
can lead to incorrect decisions or faulty transactions.
With Reject None, the complete set of data is loaded, but the data may not support correct transactions or
INFORMATICA CONFIDENTIAL BEST PRACTICE 341 of 702
aggregations. Factual data can be allocated to dummy or incorrect dimension rows, resulting in grand total
numbers that are correct, but incorrect detail numbers. After the data is fixed, reports may change, with
detail information being redistributed along different hierarchies.

The development effort to fix this scenario is significant. After the errors are corrected, a new loading
process needs to correct all of the target data structures, which can be a time-consuming effort based on the
delay between an error being detected and fixed. The development strategy may include removing
information from the target, restoring backup tapes for each nights load, and reprocessing the data. Once
the target is fixed, these changes need to be propagated to all downstream data structures or data marts.
G Reject Critical. This method provides a balance between missing information and incorrect information. It
involves examining each row of data and determining the particular data elements to be rejected. All
changes that are valid are processed into the target to allow for the most complete picture. Rejected
elements are reported as errors so that they can be fixed in the source systems and loaded on a
subsequent run of the ETL process.
This approach requires categorizing the data in two ways: 1) as key elements or attributes, and 2) as inserts
or updates.

Key elements are required fields that maintain the data integrity of the target and allow for hierarchies to be
summarized at various levels in the organization. Attributes provide additional descriptive information per
key element.

Inserts are important for dimensions or master data because subsequent factual data may rely on the
existence of the dimension data row in order to load properly. Updates do not affect the data integrity as
much because the factual data can usually be loaded with the existing dimensional data unless the update is
to a key element.

The development effort for this method is more extensive than Reject All since it involves classifying fields
as critical or non-critical, and developing logic to update the target and flag the fields that are in error. The
effort also incorporates some tasks from the Reject None approach, in that processes must be developed to
fix incorrect data in the entire target data architecture.

Informatica generally recommends using the Reject Critical strategy to maintain the accuracy of the target.
By providing the most fine-grained analysis of errors, this method allows the greatest amount of valid data to
enter the target on each run of the ETL process, while at the same time screening out the unverifiable data
fields. However, business management needs to understand that some information may be held out of the
target, and also that some of the information in the target data structures may be at least temporarily
allocated to the wrong hierarchies.
Handling Errors in Dimension Profiles
Profiles are tables used to track history changes to the source data. As the source systems change, profile records
are created with date stamps that indicate when the change took place. This allows power users to review the target
data using either current (As-Is) or past (As-Was) views of the data.
A profile record should occur for each change in the source data. Problems occur when two fields change in the
source system and one of those fields results in an error. The first value passes validation, which produces a new
profile record, while the second value is rejected and is not included in the new profile. When this error is fixed, it
would be desirable to update the existing profile rather than creating a new one, but the logic needed to perform this
UPDATE instead of an INSERT is complicated. If a third field is changed in the source before the error is fixed, the
correction process is complicated further.
The following example represents three field values in a source system. The first row on 1/1/2000 shows the original
values. On 1/5/2000, Field 1 changes from Closed to Open, and Field 2 changes from Black to BRed, which is
invalid. On 1/10/2000, Field 3 changes from Open 9-5 to Open 24hrs, but Field 2 is still invalid. On 1/15/2000, Field
INFORMATICA CONFIDENTIAL BEST PRACTICE 342 of 702
2 is finally fixed to Red.
Date Field 1 Value Field 2 Value Field 3 Value
1/1/2000 Closed Sunday Black Open 9 5
1/5/2000 Open Sunday BRed Open 9 5
1/10/2000 Open Sunday BRed Open 24hrs
1/15/2000 Open Sunday Red Open 24hrs
Three methods exist for handling the creation and update of profiles:
1. The first method produces a new profile record each time a change is detected in the source. If a field value
was invalid, then the original field value is maintained.
Date Profile Date Field 1 Value Field 2 Value Field 3 Value
1/1/2000 1/1/2000 Closed Sunday Black Open 9 5
1/5/2000 1/5/2000 Open Sunday Black Open 9 5
1/10/2000 1/10/2000 Open Sunday Black Open 24hrs
1/15/2000 1/15/2000 Open Sunday Red Open 24hrs
By applying all corrections as new profiles in this method, we simplify the process by directly applying all
changes to the source system directly to the target. Each change -- regardless if it is a fix to a previous error
-- is applied as a new change that creates a new profile. This incorrectly shows in the target that two
changes occurred to the source information when, in reality, a mistake was entered on the first change and
should be reflected in the first profile. The second profile should not have been created.
2.
The second method updates the first profile created on 1/5/2000 until all fields are corrected on 1/15/2000,
which loses the profile record for the change to Field 3.
If we try to apply changes to the existing profile, as in this method, we run the risk of losing profile
information. If the third field changes before the second field is fixed, we show the third field changed at the
same time as the first. When the second field was fixed, it would also be added to the existing profile, which
incorrectly reflects the changes in the source system.
3. The third method creates only two new profiles, but then causes an update to the profile records on
1/15/2000 to fix the Field 2 value in both.
Date Profile Date Field 1 Value Field 2 Value Field 3 Value
1/1/2000 1/1/2000 Closed Sunday Black Open 9 5
INFORMATICA CONFIDENTIAL BEST PRACTICE 343 of 702
1/5/2000 1/5/2000 Open Sunday Black Open 9 5
1/10/2000 1/10/2000 Open Sunday Black Open 24hrs
1/15/2000 1/5/2000 (Update) Open Sunday Red Open 9-5
1/15/2000 1/10/2000 (Update) Open Sunday Red Open 24hrs
If we try to implement a method that updates old profiles when errors are fixed, as in this option, we need to create
complex algorithms that handle the process correctly. It involves being able to determine when an error occurred
and examining all profiles generated since then and updating them appropriately. And, even if we create the
algorithms to handle these methods, we still have an issue of determining if a value is a correction or a new value. If
an error is never fixed in the source system, but a new value is entered, we would identify it as a previous error,
causing an automated process to update old profile records, when in reality a new profile record should have been
entered.
Recommended Method
A method exists to track old errors so that we know when a value was rejected. Then, when the process encounters
a new, correct value it flags it as part of the load strategy as a potential fix that should be applied to old Profile
records. In this way, the corrected data enters the target as a new Profile record, but the process of fixing old Profile
records, and potentially deleting the newly inserted record, is delayed until the data is examined and an action is
decided. Once an action is decided, another process examines the existing Profile records and corrects them as
necessary. This method only delays the As-Was analysis of the data until the correction method is determined
because the current information is reflected in the new Profile.
Data Quality Edits
Quality indicators can be used to record definitive statements regarding the quality of the data received and stored
in the target. The indicators can be append to existing data tables or stored in a separate table linked by the primary
key. Quality indicators can be used to:
G Show the record and field level quality associated with a given record at the time of extract.
G Identify data sources and errors encountered in specific records.
G Support the resolution of specific record error types via an update and resubmission process.
Quality indicators can be used to record several types of errors e.g., fatal errors (missing primary key value),
missing data in a required field, wrong data type/format, or invalid data value. If a record contains even one error,
data quality (DQ) fields will be appended to the end of the record, one field for every field in the record. A data
quality indicator code is included in the DQ fields corresponding to the original fields in the record where the errors
were encountered. Records containing a fatal error are stored in a Rejected Record Table and associated to the
original file name and record number. These records cannot be loaded to the target because they lack a primary
key field to be used as a unique record identifier in the target.
The following types of errors cannot be processed:
G A source record does not contain a valid key. This record would be sent to a reject queue. Metadata will be
saved and used to generate a notice to the sending system indicating that x number of invalid records were
received and could not be processed. However, in the absence of a primary key, no tracking is possible to
determine whether the invalid record has been replaced or not.
G The source file or record is illegible. The file or record would be sent to a reject queue. Metadata indicating
INFORMATICA CONFIDENTIAL BEST PRACTICE 344 of 702
that x number of invalid records were received and could not be processed may or may not be available for
a general notice to be sent to the sending system. In this case, due to the nature of the error, no tracking is
possible to determine whether the invalid record has been replaced or not. If the file or record is illegible, it
is likely that individual unique records within the file are not identifiable. While information can be provided
to the source system site indicating there are file errors for x number of records, specific problems may not
be identifiable on a record-by-record basis.
In these error types, the records can be processed, but they contain errors:
G A required (non-key) field is missing.
G The value in a numeric or date field is non-numeric.
G The value in a field does not fall within the range of acceptable values identified for the field. Typically, a
reference table is used for this validation.
When an error is detected during ingest and cleansing, the identified error type is recorded.
Quality Indicators (Quality Code Table)
The requirement to validate virtually every data element received from the source data systems mandates the
development, implementation, capture and maintenance of quality indicators. These are used to indicate the quality
of incoming data at an elemental level. Aggregated and analyzed over time, these indicators provide the
information necessary to identify acute data quality problems, systemic issues, business process problems and
information technology breakdowns.
The quality indicators: 0-No Error, 1-Fatal Error, 2-Missing Data from a Required Field, 3-Wrong Data Type/
Format, 4-Invalid Data Value and 5-Outdated Reference Table in Use, apply a concise indication of the quality of
the data within specific fields for every data type. These indicators provide the opportunity for operations staff, data
quality analysts and users to readily identify issues potentially impacting the quality of the data. At the same time,
these indicators provide the level of detail necessary for acute quality problems to be remedied in a timely manner.
Handling Data Errors
The need to periodically correct data in the target is inevitable. But how often should these corrections be
performed?
The correction process can be as simple as updating field information to reflect actual values, or as complex as
deleting data from the target, restoring previous loads from tape, and then reloading the information correctly.
Although we try to avoid performing a complete database restore and reload from a previous point in time, we
cannot rule this out as a possible solution.
Reject Tables vs. Source System
As errors are encountered, they are written to a reject file so that business analysts can examine reports of the data
and the related error messages indicating the causes of error. The business needs to decide whether analysts
should be allowed to fix data in the reject tables, or whether data fixes will be restricted to source systems. If errors
are fixed in the reject tables, the target will not be synchronized with the source systems. This can present credibility
problems when trying to track the history of changes in the target data architecture. If all fixes occur in the source
systems, then these fixes must be applied correctly to the target data.
Attribute Errors and Default Values
INFORMATICA CONFIDENTIAL BEST PRACTICE 345 of 702
Attributes provide additional descriptive information about a dimension concept. Attributes include things like the
color of a product or the address of a store. Attribute errors are typically things like an invalid color or inappropriate
characters in the address. These types of errors do not generally affect the aggregated facts and statistics in the
target data; the attributes are most useful as qualifiers and filtering criteria for drilling into the data, (e.g. to find
specific patterns for market research). Attribute errors can be fixed by waiting for the source system to be corrected
and reapplied to the data in the target.
When attribute errors are encountered for a new dimensional value, default values can be assigned to let the new
record enter thetarget. Some rules that have been proposed for handling defaults are as follows:
Value Types Description Default
Reference Values Attributes that are foreign keys to other tables Unknown
Small Value Sets Y/N indicator fields No
Other Any other type of attribute Null or Business provided value
Reference tables are used to normalize the target model to prevent the duplication of data. When a source value
does not translate into a reference table value, we use the Unknown value. (All reference tables contain a value of
Unknown for this purpose.)
The business should provide default values for each identified attribute. Fields that are restricted to a limited domain
of values (e.g., On/Off or Yes/No indicators), are referred to as small-value sets. When errors are encountered in
translating these values, we use the value that represents off or No as the default. Other values, like numbers, are
handled on a case-by-case basis. In many cases, the data integration process is set to populate Null into these
fields, which means undefined in the target. After a source system value is corrected and passes validation, it is
corrected in the target.
Primary Key Errors
The business also needs to decide how to handle new dimensional values such as locations. Problems occur when
the new key is actually an update to an old key in the source system. For example, a location number is assigned
and the new location is transferred to the target using the normal process; then the location number is changed due
to some source business rule such as: all Warehouses should be in the 5000 range. The process assumes that the
change in the primary key is actually a new warehouse and that the old warehouse was deleted. This type of error
causes a separation of fact data, with some data being attributed to the old primary key and some to the new. An
analyst would be unable to get a complete picture.
Fixing this type of error involves integrating the two records in the target data, along with the related facts.
Integrating the two rows involves combining the profile information, taking care to coordinate the effective dates of
the profiles to sequence properly. If two profile records exist for the same day, then a manual decision is required as
to which is correct. If facts were loaded using both primary keys, then the related fact rows must be added together
and the originals deleted in order to correct the data.
The situation is more complicated when the opposite condition occurs (i.e., two primary keys mapped to the same
target data ID really represent two different IDs). In this case, it is necessary to restore the source information for
both dimensions and facts from the point in time at which the error was introduced, deleting affected records from
the target and reloading from the restore to correct the errors.
DM Facts Calculated from EDW Dimensions
INFORMATICA CONFIDENTIAL BEST PRACTICE 346 of 702
If information is captured as dimensional data from the source, but used as measures residing on the fact records in
the target, we must decide how to handle the facts. From a data accuracy view, we would like to reject the fact until
the value is corrected. If we load the facts with the incorrect data, the process to fix the target can be time
consuming and difficult to implement.
If we let the facts enter downstream target structures, we need to create processes that update them after the
dimensional data is fixed. If we reject the facts when these types of errors are encountered, the fix process becomes
simpler. After the errors are fixed, the affected rows can simply be loaded and applied to the target data.
Fact Errors
If there are no business rules that reject fact records except for relationship errors to dimensional data, then when
we encounter errors that would cause a fact to be rejected, we save these rows to a reject table for reprocessing the
following night. This nightly reprocessing continues until the data successfully enters the target data structures.
Initial and periodic analyses should be performed on the errors to determine why they are not being loaded.
Data Stewards
Data Stewards are generally responsible for maintaining reference tables and translation tables, creating new
entities in dimensional data, and designating one primary data source when multiple sources exist. Reference data
and translation tables enable the target data architecture to maintain consistent descriptions across multiple source
systems, regardless of how the source system stores the data. New entities in dimensional data include new
locations, products, hierarchies, etc. Multiple source data occurs when two source systems can contain different
data for the same dimensional entity.
Reference Tables
The target data architecture may use reference tables to maintain consistent descriptions. Each table contains a
short code value as a primary key and a long description for reporting purposes. A translation table is associated
with each reference table to map the codes to the source system values. Using both of these tables, the ETL
process can load data from the source systems into the target structures.
The translation tables contain one or more rows for each source value and map the value to a matching row in the
reference table. For example, the SOURCE column in FILE X on System X can contain O, S or W. The data
steward would be responsible for entering in the translation table the following values:
Source Value Code Translation
O OFFICE
S STORE
W WAREHSE
These values are used by the data integration process to correctly load the target. Other source systems that
maintain a similar field may use a two-letter abbreviation like OF, ST and WH. The data steward would make the
following entries into the translation table to maintain consistency across systems:
Source Value Code Translation
INFORMATICA CONFIDENTIAL BEST PRACTICE 347 of 702
OF OFFICE
ST STORE
WH WAREHSE
The data stewards are also responsible for maintaining the reference table that translates the codes into
descriptions. The ETL process uses the reference table to populate the following values into the target:
Code Translation Code Description
OFFICE Office
STORE Retail Store
WAREHSE Distribution Warehouse
Error handling results when the data steward enters incorrect information for these mappings and needs to correct
them after data has been loaded. Correcting the above example could be complex (e.g., if the data steward entered
ST as translating to OFFICE by mistake). The only way to determine which rows should be changed is to restore
and reload source data from the first time the mistake was entered. Processes should be built to handle these types
of situations, including correction of the entire target data architecture.
Dimensional Data
New entities in dimensional data present a more complex issue. New entities in the target may include Locations
and Products, at a minimum. Dimensional data uses the same concept of translation as reference tables. These
translation tables map the source system value to the target value. For location, this is straightforward, but over
time, products may have multiple source system values that map to the same product in the target. (Other similar
translation issues may also exist, but Products serves as a good example for error handling.)
There are two possible methods for loading new dimensional entities. Either require the data steward to enter the
translation data before allowing the dimensional data into the target, or create the translation data through the ETL
process and force the data steward to review it. The first option requires the data steward to create the translation
for new entities, while the second lets the ETL process create the translation, but marks the record as Pending
Verification until the data steward reviews it and changes the status to Verified before any facts that reference it
can be loaded.
When the dimensional value is left as Pending Verification however, facts may be rejected or allocated to dummy
values. This requires the data stewards to review the status of new values on a daily basis. A potential solution to
this issue is to generate an email each night if there are any translation table entries pending verification. The data
steward then opens a report that lists them.
A problem specific to Product is that when it is created as new, it is really just a changed SKU number. This causes
additional fact rows to be created, which produces an inaccurate view of the product when reporting. When this is
fixed, the fact rows for the various SKU numbers need to be merged and the original rows deleted. Profiles would
also have to be merged, requiring manual intervention.
The situation is more complicated when the opposite condition occurs (i.e., two products are mapped to the same
product, but really represent two different products). In this case, it is necessary to restore the source information for
all loads since the error was introduced. Affected records from the target should be deleted and then reloaded from
INFORMATICA CONFIDENTIAL BEST PRACTICE 348 of 702
the restore to correctly split the data. Facts should be split to allocate the information correctly and dimensions split
to generate correct profile information.
Manual Updates
Over time, any system is likely to encounter errors that are not correctable using source systems. A method needs
to be established for manually entering fixed data and applying it correctly to the entire target data architecture,
including beginning and ending effective dates. These dates are useful for both profile and date event fixes. Further,
a log of these fixes should be maintained to enable identifying the source of the fixes as manual rather than part of
the normal load process.
Multiple Sources
The data stewards are also involved when multiple sources exist for the same data. This occurs when two sources
contain subsets of the required information. For example, one system may contain Warehouse and Store
information while another contains Store and Hub information. Because they share Store information, it is difficult to
decide which source contains the correct information.
When this happens, both sources have the ability to update the same row in the target. If both sources are allowed
to update the shared information, data accuracy and profile problems are likely to occur. If we update the shared
information on only one source system, the two systems then contain different information. If the changed system is
loaded into the target, it creates a new profile indicating the information changed. When the second system is
loaded, it compares its old unchanged value to the new profile, assumes a change occurred and creates another
new profile with the old, unchanged value. If the two systems remain different, the process causes two profiles to be
loaded every day until the two source systems are synchronized with the same information.
To avoid this type of situation, the business analysts and developers need to designate, at a field level, the source
that should be considered primary for the field. Then, only if the field changes on the primary source would it be
changed. While this sounds simple, it requires complex logic when creating Profiles, because multiple sources can
provide information toward the one profile record created for that day.
One solution to this problem is to develop a system of record for all sources. This allows developers to pull the
information from the system of record, knowing that there are no conflicts for multiple sources. Another solution is to
indicate, at the field level, a primary source where information can be shared from multiple sources. Developers can
use the field level information to update only the fields that are marked as primary. However, this requires additional
effort by the data stewards to mark the correct source fields as primary and by the data integration team to
customize the load process.


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 349 of 702
Error Handling Techniques - PowerCenter Mappings
Challenge
Identifying and capturing data errors using a mapping approach, and making such errors available for
further processing or correction.
Description
Identifying errors and creating an error handling strategy is an essential part of a data integration project. In
the production environment, data must be checked and validated prior to entry into the target system. One
strategy for catching data errors is to use PowerCenter mappings and error logging capabilities to catch
specific data validation errors and unexpected transformation or database constraint errors.
Data Validation Errors
The first step in using a mapping to trap data validation errors is to understand and identify the error
handling requirements.
Consider the following questions:
What types of data errors are likely to be encountered?
Of these errors, which ones should be captured?
What process can capture the possible errors?
Should errors be captured before they have a chance to be written to the target database?
cted? Will any of these errors need to be reloaded or corre
s are encountered? How will the users know if error
How will the errors be stored?
Should descriptions be assigned for individual errors?
Can a table be designed to store captured errors and the error descriptions?
y
r server does not have to write them to the session log
and the reject/bad file, thus improving performance.
h
: date
conversion, null values intended for not null target fields, and incorrect data formats or data types.
is identified, the row
containing the error is to be separated from the data flow and logged in an error table.
One solution is to implement a mapping similar to the one shown below:
Capturing data errors within a mapping and re-routing these errors to an error table facilitates analysis b
end users and improves performance. One practical application of the mapping approach is to capture
foreign key constraint errors (e.g., executing a lookup on a dimension table prior to loading a fact table).
Referential integrity is assured by including this sort of functionality in a mapping. While the database still
enforces the foreign key constraints, erroneous data is not written to the target table; constraint errors are
captured within the mapping so that the PowerCente
Data content errors can also be captured in a mapping. Mapping logic can identify content errors and attac
descriptions to them. This approach can be effective for many types of data content error, including
Sample Mapping Approach for Data Validation Errors
In the following example, customer data is to be checked to ensure that invalid null values are intercepted
before being written to not null columns in a target CUSTOMER table. Once a null value
INFORMATICA CONFIDENTIAL BEST PRACTICE 350 of 702

An expression transformation can be employed to validate the source data, applying rules and flagging
records with one or more errors.
A router transformation can then separate valid rows from those containing the errors. It is good practice to
append error rows with a unique key; this can be a composite consisting of a MAPPING_ID and ROW_ID,
for example. The MAPPING_ID would refer to the mapping name and the ROW_ID would be created by a
sequence generator.
The composite key is designed to allow developers to trace rows written to the error tables that store
information useful for error reporting and investigation. In this example, two error tables are suggested,
namely: CUSTOMER_ERR and ERR_DESC_TBL.

The table ERR_DESC_TBL, is designed to hold information about the error, such as the mapping name, the
ROW_ID, and the error description. This table can be used to hold all data validation error descriptions for
all mappings, giving a single point of reference for reporting.
The CUSTOMER_ERR table can be an exact copy of the target CUSTOMER table appended with two
additional columns: ROW_ID and MAPPING_ID. These columns allow the two error tables to be joined. The
CUSTOMER_ERR table stores the entire row that was rejected, enabling the user to trace the error rows
back to the source and potentially build mappings to reprocess them.
The mapping logic must assign a unique description for each error in the rejected row. In this example, any
null value intended for a not null target field could generate an error message such as NAME is NULL or
DOB is NULL. This step can be done in an expression transformation (e.g., EXP_VALIDATION in the
sample mapping).
After the field descriptions are assigned, the error row can be split into several rows, one for each possible
error using a normalizer transformation. After a single source row is normalized, the resulting rows can be
INFORMATICA CONFIDENTIAL BEST PRACTICE 351 of 702
filtered to leave only errors that are present (i.e., each record can have zero to many errors). For example, if
a row has three errors, three error rows would be generated with appropriate error descriptions
(ERROR_DESC) in the table ERR_DESC_TBL.
The following table shows how the error data produced may look.
Table Name: CUSTOMER_ERR
NAME DOB ADDRESS ROW_ID MAPPING_ID
NULL NULL NULL 1 DIM_LOAD
Table Name: ERR_DESC_TBL
FOLDER_NA
ME
MAPPING
_ID
ROW_I
D
ERROR_DE
SC
LOAD_DA
TE
SOURCE Target
CUST DIM_LOA
D
1 Name is
NULL
10/11/2006 CUSTOMER
_FF
CUSTOM
ER
CUST DIM_LOA
D
1 DOB is
NULL
10/11/2006 CUSTOMER
_FF
CUSTOM
ER
CUST DIM_LOA
D
1 Address is
NULL
10/11/2006 CUSTOMER
_FF
CUSTOM
ER

The efficiency of a mapping approach can be increased by employing reusable objects. Common logic
should be placed in mapplets, which can be shared by multiple mappings. This improves productivity in
implementing and managing the capture of data validation errors.
Data validation error handling can be extended by including mapping logic to grade error severity. For
example, flagging data validation errors as soft or hard.
A hard error can be defined as one that would fail when being written to the database, such as a
constraint error.
A soft error can be defined as a data content error.
A record flagged as hard can be filtered from the target and written to the error tables, while a record
flagged as soft can be written to both the target system and the error tables. This gives business analysts
an opportunity to evaluate and correct data imperfections while still allowing the records to be processed for
end-user reporting.
Ultimately, business organizations need to decide if the analysts should fix the data in the reject table or in
the source systems. The advantage of the mapping approach is that all errors are identified as either data
errors or constraint errors and can be properly addressed. The mapping approach also reports errors based
on projects or categories by identifying the mappings that contain errors. The most important aspect of the
mapping approach however, is its flexibility. Once an error type is identified, the error handling logic can be
placed anywhere within a mapping. By using the mapping approach to capture identified errors, the
operations team can effectively communicate data quality issues to the business users.
Constraint and Transformation Errors
Perfect data can never be guaranteed. In implementing the mapping approach described above to detect
errors and log them to an error table, how can we handle unexpected errors that arise in the load? For
example, PowerCenter may apply the validated data to the database; however the relational database
management system (RDBMS) may reject it for some unexpected reason. An RDBMS may, for
example, reject data if constraints are violated. Ideally, we would like to detect these database-level errors
automatically and send them to the same error table used to store the soft errors caught by the mapping
approach described above.
In some cases, the stop on errors session property can be set to 1 to stop source data for which
unhandled errors were encountered from being loaded. In this case, the process will stop with a failure, the
INFORMATICA CONFIDENTIAL BEST PRACTICE 352 of 702
data must be corrected, and the entire source may need to be reloaded or recovered. This is not always an
acceptable approach.
An alternative might be to have the load process continue in the event of records being rejected, and then
reprocess only the records that were found to be in error. This can be achieved by configuring the stop on
errors property to 0 and switching on relational error logging for a session. By default, the error-messages
from the RDBMS and any un-caught transformation errors are sent to the session log. Switching on
relational error logging redirects these messages to a selected database in which four tables are
automatically created: PMERR_MSG, PMERR_DATA, PMERR_TRANS and PMERR_SESS.
The PowerCenter Workflow Administration Guide contains detailed information on the structure of these
tables. However, the PMERR_MSG table stores the error messages that were encountered in a session.
The following four columns of this table allow us to retrieve any RDBMS errors:
SESS_INST_ID: A unique identifier for the session. J oining this table with the Metadata Exchange
(MX) View REP_LOAD_SESSIONS in the repository allows the MAPPING_ID to be retrieved.
TRANS_NAME: Name of the transformation where an error occurred. When a RDBMS error
occurs, this is the name of the target transformation.
TRANS_ROW_ID: Specifies the row ID generated by the last active source. This field contains the
row number at the target when the error occurred.
ERROR_MSG: Error message generated by the RDBMS
With this information, all RDBMS errors can be extracted and stored in an applicable error table. A post-load
session (i.e., an additional PowerCenter session) can be implemented to read the PMERR_MSG table, join
it with the MX View REP_LOAD_SESSION in the repository, and insert the error details into
ERR_DESC_TBL. When the post process ends, ERR_DESC_TBL will contain both soft errors and hard
errors.
One problem with capturing RDBMS errors in this way is mapping them to the relevant source key to provide
lineage. This can be difficult when the source and target rows are not directly related (i.e., one source row
can actually result in zero or more rows at the target). In this case, the mapping that loads the source must
write translation data to a staging table (including the source key and target row number). The translation
table can then be used by the post-load session to identify the source key by the target row number
retrieved from the error log. The source key stored in the translation table could be a row number in the case
of a flat file, or a primary key in the case of a relational data source.
Reprocessing
After the load and post-load sessions are complete, the error table (e.g., ERR_DESC_TBL) can be analyzed
by members of the business or operational teams. The rows listed in this table have not been loaded into the
target database. The operations team can, therefore, fix the data in the source that resulted in soft errors
and may be able to explain and remediate the hard errors.
Once the errors have been fixed, the source data can be reloaded. Ideally, only the rows resulting in errors
during the first run should be reprocessed in the reload. This can be achieved by including a filter and a
lookup in the original load mapping and using a parameter to configure the mapping for an initial load or for
a reprocess load. If the mapping is reprocessing, the lookup searches for each source row number in the
error table, while the filter removes source rows for which the lookup has not found errors. If initial loading,
all rows are passed through the filter, validated, and loaded.
With this approach, the same mapping can be used for initial and reprocess loads. During a reprocess run,
the records successfully loaded should be deleted (or marked for deletion) from the error table, while any
new errors encountered should be inserted as if an initial run. On completion, the post-load process is
executed to capture any new RDBMS errors. This ensures that reprocessing loads are repeatable and result
in reducing numbers of records in the error table over time.
Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 353 of 702
Error Handling Techniques - PowerCenter Workflows and Data
Analyzer
Challenge
Implementing an efficient strategy to identify different types of errors in the ETL process, correct the errors, and
reprocess the corrected data.
Description
Identifying errors and creating an error handling strategy is an essential part of a data warehousing project. The errors in
an ETL process can be broadly categorized into two types: data errors in the load process, which are defined by the
standards of acceptable data quality; and process errors, which are driven by the stability of the process itself.
The first step in implementing an error handling strategy is to understand and define the error handling requirement.
Consider the following questions:
G What tools and methods can help in detecting all the possible errors?
G What tools and methods can help in correcting the errors?
G What is the best way to reconcile data across multiple systems?
G Where and how will the errors be stored? (i.e., relational tables or flat files)
A robust error handling strategy can be implemented using PowerCenters built-in error handling capabilities along with
Data Analyzer as follows:
G Process Errors: Configure an email task to notify the PowerCenter Administrator immediately of any process
failures.
G Data Errors: Setup the ETL process to:
H Use the Row Error Logging feature in PowerCenter to capture data errors in the PowerCenter error tables
for analysis, correction, and reprocessing.
H Setup Data Analyzer alerts to notify the PowerCenter Administrator in the event of any rejected rows.
H Setup customized Data Analyzer reports and dashboards at the project level to provide information on
failed sessions, sessions with failed rows, load time, etc.
Configuring an Email Task to Handle Process Failures
Configure all workflows to send an email to the PowerCenter Administrator, or any other designated recipient, in the
event of a session failure. Create a reusable email task and use it in the On Failure Email property settings in the
Components tab of the session, as shown in the following figure.
INFORMATICA CONFIDENTIAL BEST PRACTICE 354 of 702
When you configure the subject and body of a post-session email, use email variables to include information about the
session run, such as session name, mapping name, status, total number of records loaded, and total number of records
rejected. The following table lists all the available email variables:
Email Variables for Post-Session Email
Email Variable Description
%s Session name.
%e Session status.
%b Session start time.
%c Session completion time.
%i Session elapsed time (session completion time-session start time).
%l Total rows loaded.
%r Total rows rejected.
%t
Source and target table details, including read throughput in bytes per second and write
throughput in rows per second. The PowerCenter Server includes all information displayed
in the session detail dialog box.
%m Name of the mapping used in the session.
INFORMATICA CONFIDENTIAL BEST PRACTICE 355 of 702
%n Name of the folder containing the session.
%d Name of the repository containing the session.
%g Attach the session log to the message.
%a<filename>
Attach the named file. The file must be local to the PowerCenter Server. The following are
valid file names: %a<c:\data\sales.txt> or %a</users/john/data/sales.txt>.
Note: The file name cannot include the greater than character (>) or a line break.
Note: The PowerCenter Server ignores %a, %g, or %t when you include them in the email subject. Include
these variables in the email message only.
Configuring Row Error Logging in PowerCenter
PowerCenter provides you with a set of four centralized error tables into which all data errors can be logged. Using these
tables to capture data errors greatly reduces the time and effort required to implement an error handling strategy when
compared with a custom error handling solution.
When you configure a session, you can choose to log row errors in this central location. When a row error occurs, the
PowerCenter Server logs error information that allows you to determine the cause and source of the error. The
PowerCenter Server logs information such as source name, row ID, current row data, transformation, timestamp, error
code, error message, repository name, folder name, session name, and mapping information. This error metadata is
logged for all row-level errors, including database errors, transformation errors, and errors raised through the ERROR()
function, such as business rule violations.
Logging row errors into relational tables rather than flat files enables you to report on and fix the errors easily. When you
enable error logging and chose the Relational Database Error Log Type, the PowerCenter Server offers you the
following features:
G Generates the following tables to help you track row errors:
H PMERR_DATA. Stores data and metadata about a transformation row error and its corresponding source
row.
H PMERR_MSG. Stores metadata about an error and the error message.
H PMERR_SESS. Stores metadata about the session.
H PMERR_TRANS. Stores metadata about the source and transformation ports, such as name and datatype,
when a transformation error occurs.
I Appends error data to the same tables cumulatively, if they already exist, for the further runs of the
session.
I Allows you to specify a prefix for the error tables. For instance, if you want all your EDW session errors
to go to one set of error tables, you can specify the prefix as EDW_
I Allows you to collect row errors from multiple sessions in a centralized set of four error tables. To do
this, you specify the same error log table name prefix for all sessions.
Example:
In the following figure, the session s_m_Load_Customer loads Customer Data into the EDW Customer table. The
Customer Table in EDW has the following structure:
CUSTOMER_ID NOT NULL NUMBER (PRIMARY KEY)
INFORMATICA CONFIDENTIAL BEST PRACTICE 356 of 702
CUSTOMER_NAME NULL VARCHAR2(30)
CUSTOMER_STATUS NULL VARCHAR2(10)
There is a primary key constraint on the column CUSTOMER_ID.
To take advantage of PowerCenters built-in error handling features, you would set the session properties as shown
below:
The session property Error Log Type is set to Relational Database, and Error Log DB Connection and Table name
Prefix values are given accordingly.
When the PowerCenter server detects any rejected rows because of Primary Key Constraint violation, it writes
information into the Error Tables as shown below:
EDW_PMERR_DATA
WORKFLOW_
RUN_ID
WORKLET_
RUN_ID
SESS_
INST_
ID
TRANS_NAME TRANS_
ROW_ID
TRANS_ROW
DATA
SOURCE_
ROW_ID
SOURCE_
ROW_
TYPE
SOURCE_
ROW_
DATA
LINE_
NO
8 0 3 Customer_Table 1 D:1001:00000000
0000|D:Elvis Pres|
D:Valid
-1 -1 N/A 1
INFORMATICA CONFIDENTIAL BEST PRACTICE 357 of 702
8 0 3 Customer_Table 2 D:1002:00000000
0000|D:James
Bond|D:Valid
-1 -1 N/A 1
8 0 3 Customer_Table 3 D:1003:00000000
0000|D:Michael Ja|
D:Valid
-1 -1 N/A 1
EDW_PMERR_MSG
WORKFLOW_
RUN_ID
SESS_
INST_ID
SESS_
START_TIME
REPOSITORY_
NAME
FOLDER_
NAME
WORKFLOW_
NAME
TASK_
INST_PATH
MAPPING_
NAME
LINE_
NO
6 3 9/15/2004
18:31
pc711 Folder1 wf_test1 s_m_test1 m_test1 1
7 3 9/15/2004
18:33
pc711 Folder1 wf_test1 s_m_test1 m_test1 1
8 3 9/15/2004
18:34
pc711 Folder1 wf_test1 s_m_test1 m_test1 1
EDW_PMERR_SESS
WORKFLOW_
RUN_ID
SESS_
INST_ID
SESS_
START_TIME
REPOSITORY_
NAME
FOLDER_
NAME
WORKFLOW_
NAME
TASK_
INST_PATH
MAPPING_
NAME
LINE_
NO
6 3 9/15/2004
18:31
pc711 Folder1 wf_test1 s_m_test1 m_test1 1
7 3 9/15/2004
18:33
pc711 Folder1 wf_test1 s_m_test1 m_test1 1
8 3 9/15/2004
18:34
pc711 Folder1 wf_test1 s_m_test1 m_test1 1
EDW_PMERR_TRANS
WORKFLOW_RUN_ID SESS_INST_ID TRANS_NAME TRANS_GROUP TRANS_ATTR LINE_
NO
8 3 Customer_Table Input Customer
_Id:3,
Customer
_Name:12,
Customer
_Status:12
1
By looking at the workflow run id and other fields, you can analyze the errors and reprocess them after fixing the errors.
Error Detection and Notification using Data Analyzer
Informatica provides Data Analyzer for PowerCenter Repository Reports with every PowerCenter license. Data Analyzer
is Informaticas powerful business intelligence tool that is used to provide insight into the PowerCenter repository
metadata.
INFORMATICA CONFIDENTIAL BEST PRACTICE 358 of 702
You can use the Operations Dashboard provided with the repository reports as one central location to gain insight into
production environment ETL activities. In addition, the following capabilities of Data Analyzer are recommended best
practices:
G Configure alerts to send an email or a pager message to the PowerCenter Administrator whenever there is an
entry made into the error tables PMERR_DATA or PMERR_TRANS.
G Configure reports and dashboards to provide detailed session run information grouped by projects/PowerCenter
folders for easy analysis.
G Configure reports to provide detailed information of the row level errors for each session. This can be
accomplished by using the four error tables as sources of data for the reports
Data Reconciliation Using Data Analyzer
Business users often like to see certain metrics matching from one system to another (e.g., source system to ODS, ODS
to targets, etc.) to ascertain that the data has been processed accurately. This is frequently accomplished by writing
tedious queries, comparing two separately produced reports, or using constructs such as DBLinks.
Upgrading the Data Analyzer licence from Repository Reports to a full license enables Data Analyzer to source your
companys data (e.g., source systems, staging areas, ODS, data warehouse, and data marts) and provide a reliable and
reusable way to accomplish data reconciliation. Using Data Analyzers reporting capabilities, you can select data from
various data sources such as ODS, data marts, and data warehouses to compare key reconciliation metrics and numbers
through aggregate reports. You can further schedule the reports to run automatically every time the relevant
PowerCenter sessions complete, and setup alerts to notify the appropriate business or technical users in case of any
discrepancies.
For example, a report can be created to ensure that the same number of customers exist in the ODS in comparison to a
data warehouse and/or any downstream data marts. The reconciliation reports should be relevant to a business user by
comparing key metrics (e.g., customer counts, aggregated financial metrics, etc) across data silos. Such reconciliation
reports can be run automatically after PowerCenter loads the data, or they can be run by technical users or business on
demand. This process allows users to verify the accuracy of data and builds confidence in the data warehouse solution.


Last updated: 09-Feb-07 14:22
INFORMATICA CONFIDENTIAL BEST PRACTICE 359 of 702
Planning the ICC Implementation
Challenge
The first stage in the creation of an Integration Competency Centre (ICC) is the selection of the
appropriate organizational model for the services to be provided; this process is documented in the
Best Practice Selecting the Right ICC Model
This Best Practice is concerned with planning the construction of the ICC itself. There are several
challenges in describing this process in a single document; the resources required obviously vary
according to the organizational model selected.
Description
At the end of the ICC selection process, one of the following models will have been selected for
implementation:
The following stages need to be undertaken to implement the ICC:
G Select the initial project
G Identify the resource needs
G Establish the 30/60/90-day plan
INFORMATICA CONFIDENTIAL BEST PRACTICE 360 of 702
Choosing Projects for the ICC
The first step is the considered selection of a pilot project for the ICC; it may be advisable to
choose a moderately challenging project so that the ICC can build on success. However, the
project that is selected should be representative of the projects to be undertaken in the first year of
the ICCs existence.
Another criteria for the selection of the first project is the contribution to the ICC resource pool. The
deliverables created by the first projects can serve as examples and templates for the processes to
be adopted by the ICC as standards. Documents that fall into this category are: naming standards,
sizing guides, performance tuning guidelines, deployment processes, low level design documents,
project plans, etc. Supplementary resources can be found in the Velocity sample deliverables on
Informaticas customer portal at http://my.informatica.com.
Identify the Resource Needs
The resources required by an ICC fall into two broad categories.
First, a resource is needed to implement an ICC infrastructure and drive organizational change.
Typically, one FTE is required to architect, design, and build the technical infrastructure that will
support the ICC. This involves architecting the physical hardware (i.e., servers, SAN, network, etc)
as well as the software (i.e., PowerCenter, PowerExchange, options, etc) to support any projects
that are likely to use the shared resources of the ICC. The resource selected should have the
required technical skills for this complex task.
The second type of resource is required to provide whatever development/operational support
services are within the remit of the ICC organizational model selected. Once again, the calibre of
the staff selected should reflect the importance of achieving success in the ICCs first projects.
Establishing the 30/60/90/120 Day Plan
The purpose of the 30/60/90 day plan is to ensure that incremental deliverables are accomplished
with respect to the implementation of the ICC. This also provides opportunities to enjoy successes
at each step of the way and to communicate those successes to the ICC executive sponsor as
each milestone is achieved during the 30/60/90 day plan.
It is also important to note that since the central services ICC model is estimated at 6+ months to
fully implement, Informatica has provided a category of 120+ Day Plan for deliverables that may
be associated with a central services model, since it is outside of the scope of the 30/60/90 day
plan.
30 Day Plan
The following plan outlines the people, process, and technology steps that should occur during the
first 30 days of the ICC rollout:
INFORMATICA CONFIDENTIAL BEST PRACTICE 361 of 702
People
G Identify, assemble, and budget for the human resources necessary to support the ICC
rollout.; typically, one technical FTE is a good place to start.
G Identify, estimate, and budget for the necessary technical resources (e.g., hardware,
software). Note: To encourage projects to utilize the ICC model, it can often be effective to
provide hardware and software resources without any internal chargeback for the first year
of the ICC conception. Alternatively, the hardware and software costs can be funded by
the projects that are likely to leverage the ICC.
Processes
G Identify and start planning for the initial projects that will utilize the ICC shared services.
Technology
G Implement a short-term technical infrastructure for the ICC. This includes implementing
the hardware and software required to support the initial five projects (or projects within
the scope of the first year of the ICC) in both a development and production capacity.
Typically, this technical infrastructure is not the end-goal configuration, but it should
include a hardware and software configuration that can easily meld into the end-goal
configuration. The hardware and software requirements of the short-term technical
infrastructure are generally limited to the components required for the projects that will
leverage the infrastructure during the first year.
60 Day Plan
People
G Provide the shared resources to support the ongoing projects that are utilizing the ICC
shared services in development. These resources need to support the deployment of
objects between environments (dev/test/production) and support the monitoring of ongoing
production processes (a.k.a production support).
Processes
G Start building, establishing, and communicating processes that are going to be required to
support the ICC. These include:
H Naming standards
H Code/mapping reviews
H Deployment processes
H Performance tuning techniques
INFORMATICA CONFIDENTIAL BEST PRACTICE 362 of 702
H Detailed design documents
Technology
G Build out additional features into the short-term technical infrastructure that can improve
service levels of the ICC and reduce costs. Examples include:
H PowerCenter Metadata Reporter
H PowerCenter Team Based Development Model
H Metadata Manager
H Data Profiling and Cleansing options
H Various connectivity options, including PowerExchange and PowerCenter Connects
90 Day Plan
People
G Continue to provide production support shared services to projects leveraging the ICC
infrastructure.
G Provide training to project teams on additional ICC capabilities available to them (i.
e., implemented during the 60 day plan).
Processes
G Finalize and fully communicate all ICC processes (i.e., the processes listed in the 30 day
plan)
G Develop a governance plan such that all objects/code developed for projects leveraging
the ICC are reviewed by a governing board of architects and senior developers before
being migrated into production. This governance ensures that only high-quality projects
are placed in production.
G Establish SLAs between the projects leveraging the ICC shared services and the ICC
itself.
G Begin work on a chargeback model such that projects that join the ICC after the first
year provide an internal transfer of funds to support the ICC based on their usage of the
ICC shared services. Typically, chargeback models are based upon CPU utilization used
in production by the project on a monthly basis.
H PowerCenter 8.1 logs metadata in the repository regarding the amount of CPU used
by a particular process. For this reason, PowerCenter 8.1 is a key technology that
should be leveraged for ICC implementations.
H The ICC chargeback model should be flexible in that the project manager can choose
between a number of options for levels of support. For example, projects have
different SLA requirements and a project that requires 24/7 high availability and
dedicated hardware should have a different, more expensive, chargeback than a
INFORMATICA CONFIDENTIAL BEST PRACTICE 363 of 702
similar project that does not require high availability or dedicated hardware.
H The ICC chargeback model should reflect costs that are lower than the costs that the
project would otherwise have to pay to various hardware, software, and services
vendors if they were to go down the path of a project silo approach.
Technology
G As projects join the ICC that have disaster recovery/failover needs, the appropriate
implementation of DR/Failover should be completed for the ICC infrastructure. This usually
happens in the first 90 days of the ICC.
120+ Day Plan
Assuming a central services ICC model is chosen, the following plan outlines the people, process,
and technology steps that should occur during the first six to nine months of the ICC rollout.
G Implement a long-term technical infrastructure, including both hardware and software. This
long-term technical infrastructure can generally provide cost-effective options for
horizontal scaling such as leveraging Informaticas Enterprise Grid capabilities with a
relatively inexpensive hardware platform, such as Linux or Windows.
G Proactively implement additional software components that can be leveraged by ICC
customers/projects. Examples include:
H High Availability
H Enterprise Grid
H Unstructured Data Option
G After the initial project successes leveraging the ICC shared services model, establish the
ICC as the enterprise standard for all data integration project needs.
G Provide additional chargeback models offering greater flexibility to the ICC customers/
projects
G The ICC should expand its services offerings beyond simple development and production
support to include shared services resources that can be shared across projects during
the development and testing phases of the project. Examples of such resources include
Data Architects, Data Modelers, Development resources, and/or QA resources.
G Establish an ICC Help Desk that provides 24x7 production support similar to an
operator in the mainframe environment.
G Consider negotiating with hardware vendors for more flexible offerings.


Last updated: 09-Feb-07 14:09
INFORMATICA CONFIDENTIAL BEST PRACTICE 364 of 702
Selecting the Right ICC Model
Challenge
With increased pressure on IT productivity, many companies are rethinking the "independence" of
data integration projects that has resulted in inefficient, piecemeal or silo-based approach to each
new project. Furthermore, as each group within a business attempts to integrate its data, it
unknowingly duplicates effort the company has already invested -- not just in the data integration
itself, but also the effort spent on developing practices, processes, code, and personnel expertise.
An alternative to this expensive redundancy is to create some type of "integration competency
center" (ICC). An ICC is an IT approach that provides teams throughout an organization with best
practices in integration skills, processes, and technology so that they can complete data integration
projects consistently, rapidly, and cost-efficiently.
What types of services should an ICC offer? This Best Practice provides an overview to help you
consider the appropriate structure for your ICC.
More information is available in the following publication: Integration Competency Center: An
Implementation Methodology by John Schmidt and David Lyle, Copyright 2005 Informatica
Corporation.
Description
Objectives
Typical ICC objectives include:
G Promoting data integration as a formal discipline.
G Developing a set of experts with data integration skills and processes, and leveraging their
knowledge across the organization.
G Building and developing skills, capabilities, and best practices for integration processes
and operations.
G Monitoring, assessing, and selecting integration technology and tools.
G Managing integration pilots.
G Leading and supporting integration projects with the cooperation of subject matter experts.
G Reusing development work such as source definitions, application interfaces, and codified
business rules.
Benefits
Although a successful project that shares its lessons with other teams can be a great way to begin
developing organizational awareness of the value of an ICC, setting up a more formal ICC requires
INFORMATICA CONFIDENTIAL BEST PRACTICE 365 of 702
upper management buy-in and funding. Here are some of the typical benefits that can be realized
from doing so:
G Rapid development of in-house expertise through coordinated training and shared
knowledge.
G Leverage shared resources and "best practice" methods and solutions.
G More rapid project deployments.
G Higher quality/reduced risk data integration projects.
G Reduced costs of project development and maintenance.
When examining the move toward an ICC model that optimizes and (in certain situations)
centralizes integration functions, consider two things: the problems, costs and risks associated with
a project silo-based approach, and the potential benefits of an ICC environment.
What Services Should be in an ICC?
The common services provided by ICCs can be divided into four major categories:
G Knowledge Management
G Environment
G Development Support
G Production Support
Having considered the service categories, the appropriate ICC Organizational Model can be
selected.
Knowledge Management
Training
G Standards Training (Training Coordinator) - Training of best practices, including but not
limited to, naming conventions, unit test plans, configuration management strategy, and
project methodology.
G Product Training (Training Coordinator) - Co-ordination of vendor-offered or internally-
sponsored training of specific technology products.
Standards
G
Standards Development (Knowledge Coordinator) - Creating best practices,
including but not limited to, naming conventions, unit test plans, and coding
standards.
INFORMATICA CONFIDENTIAL BEST PRACTICE 366 of 702
Standards Enforcement (Knowledge Coordinator) - Enforcing development
teams to use documented best practices through formal development reviews,
metadata reports, project audits or other means.
G Methodology (Knowledge Coordinator) - Creating methodologies to support development
initiatives. Examples include methodologies for rolling out data warehouses and data
integration projects. Typical topics in a methodology include, but are not limited to:
H Project Management
H Project Estimation
H Development Standards
H Operational Support
G Mapping Patterns (Knowledge Coordinator) - Developing and maintaining mapping
patterns (templates) to speed up development time and promote mapping standards
across projects.
Technology
G Emerging Technologies (Technology Leader ) - Assessing emerging technologies and
determining if/where they fit in the organization and policies around their adoption/use.
G
Benchmarking (Technology Leader) - Conducting and documenting tests on
hardware and software in the organization to establish performance
benchmarks.
Metadata
G Metadata Standards (Metadata Administrator) - Creating standards for capturing and
maintaining metadata. For example, database column descriptions can be captured in
ErWin and pushed to PowerCenter via Metadata Exchange.
G Metadata Enforcement (Metadata Administrator) - Enforcing development teams to
conform to documented metadata standards.
G Data Integration Catalog (Metadata Administrator) - Tracking the list of systems involved
in data integration efforts, the integration between systems, and the use of/subscription to
data integration feeds. This information is critical to managing the interconnections in the
environment in order to avoid duplication of integration efforts. The Calalog can also assist
in understanding when particular integration feeds are no longer needed.
Environment
Hardware
G Vendor Selection and Management (Vendor Manager) - Selecting vendors for the
hardware tools needed for integration efforts that may span Servers, Storage and network
facilities.
INFORMATICA CONFIDENTIAL BEST PRACTICE 367 of 702
G Hardware Procurement (Vendor Manager) - Responsible for the purchasing process for
hardware items that may include receiving and cataloging the physical hardware items.
G Hardware Architecture (Technical Architect) - Developing and maintaining the physical
layout and details of the hardware used to support the Integration Competency Center.
G Hardware Installation (Product Specialist) - Setting up and activating new hardware as it
becomes part of the physical architecture supporting the Integration Competency Center.
G Hardware Upgrades (Product Specialist) - Managing the upgrade of hardware including
operating system patches, additional cpu/memory upgrades, replacing old technology etc.
Software
G Vendor Selection and Management (Vendor Manager) - Selecting vendors for the
software tools needed for integration efforts. Activities may include formal RFPs, vendor
presentation reviews, software selection criteria, maintenance renewal negotiations and all
activities related to managing the software vendor relationship.
G Software Procurement (Vendor Manager) - Responsible for the purchasing process for
software packages and licenses.
G Software Architecture (Technical Architect) - Developing and maintaining the architecture
of the software package(s) used in the competency center. This may include flowcharts
and decision trees of what software to select for specific tasks.
G Software Installation (Product Specialist) - Setting up and installing new software as it
becomes part of the physical architecture supporting the Integration Competency Center.
G Software Upgrade (Product Specialist) - Managing the upgrade of software including
patches and new releases. Depending on the nature of the upgrade, significant planning
and rollout efforts may be required during upgrades. (Training, testing, physical installation
on client machines etc.)
G Compliance (Licensing) (Vendor Manager) - Monitoring and ensuring proper licensing
compliance across development teams. Formal audits or reviews may be scheduled.
Physical documentation should be kept matching installed software with purchased
licenses.
Professional Services
G Vendor Selection and Management (Vendor Manager) - Selecting vendors for professional
services efforts related to integration efforts. Activities may include managing vendor
rates and bulk discount negotiations, payment of vendors, reviewing past vendor work
efforts, managing list of "preferred" vendors etc.
G Vendor Qualification (Vendor Manager) - Conducting formal vendor interviews as
consultants/ contracts are proposed for projects, checking vendor references and
certifications, formally qualifying selected vendors for specific work tasks (i.e., Vendor A is
qualified for Java development while Vendor B is qualified for ETL and EAI work.)
Security
INFORMATICA CONFIDENTIAL BEST PRACTICE 368 of 702
G Security Administration (Security Administrator) - Providing access to the tools and
technology needed to complete data integration development efforts including software
user id's, source system user id/passwords, and overall data security of the integration
efforts. Ensures enterprise security processes are followed.
G Disaster Recovery (Technical Architect) - Performing risk analysis in order to develop and
execute a plan for disaster recovery including repository backups, off-site backups,
failover hardware, notification procedures and other tasks related to a catastrophic failure
(i.e., server room fire destroys dev/prod servers).
Financial
G Budget (ICC Manager) - Yearly budget management for the Integration Competency
Center. Responsible for managing outlays for services, support, hardware, software and
other costs.
G Departmental Cost Allocation (ICC Manager) - For clients where shared services costs are
to be spread across departments/ business units for cost purposes. Activities include
defining metrics uses for cost allocation, reporting on the metrics, and applying cost
factors for billing on a weekly/monthly or quarterly basis as dictated.
Scalability/Availability
G High Availability (Technical Architect) - Designing and implementing hardware, software
and procedures to ensure high availability of the data integration environment.
G Capacity Planning (Technical Architect) - Designing and planing for additional integration
capacity to address the growth in size and volume of data integration in the future for the
organization.
Development Support
Performance
G Performance and Tuning (Product Specialist) - Providing targeted performance and tuning
assistance for integration efforts. Providing on-going assessments of load windows and
schedules to ensure service level agreements are being met.
Shared Objects
G Shared Object Quality Assurance (Quality Assurance) - Providing quality assurance
services for shared objects so that objects conform to standards and do not adversely
affect the various projects that may be using them.
G Shared Object Change Management (Change Control Coordinator) - Managing the
migration to production of shared objects which may impact multiple project
teams. Activities include defining the schedule for production moves, notifying teams of
changes, and coordinating the migration of the object to production.
INFORMATICA CONFIDENTIAL BEST PRACTICE 369 of 702
G Shared Object Acceptance (Change Control Coordinator) - Defining and documenting the
criteria for a shared object and officially certifying an object as one that will be shared
across project teams.
G Shared Object Documentation (Change Control Coordinator) - Defining the standards for
documentation of shared objects and maintaining a catalog of all shared objects and their
functions.
Project Support
G Development Helpdesk (Data Integration Developer) - Providing a helpdesk of expert
product personnel to support project teams. This will provide project teams new to
developing data integration routines with a place to turn to for experienced guidance.
G Software/Method Selection (Technical Architect) - Providing a workflow or decision tree to
use when deciding which data integration technology to use for a given technology
request.
G Requirements Definition (Business/Technical Analyst) - Developing the process to gather
and document integration requirements. Depending on the level of service, activity may
include assisting or even fully gathering the requirements for the project.
G Project Estimation (Project Manager) - Developing project estimation models and provide
estimation assistance for data integration efforts.
G Project Management (Project Manager) - Providing full time management resources
experienced in data integration to ensure successful projects.
G Project Architecture Review (Data Integration Architect) - Providing project level
architecture review as part of the design process for data integration projects. Helping
ensure standards are met and the project architecture fits within the enterprise
architecture vision.
G Detailed Design Review (Data Integration Developer) - Reviewing design specifications in
detail to ensure conformance to standards and identifying any issues upfront before
development work is begun.
G Development Resources (Data Integration Developer) - Providing product-skilled
resources for completion of the development efforts.
G Data Profiling (Data Integration Developer) - Providing data profiling services to identify
data quality issues. Develop plans for addressing issues found in data profiling.
G Data Quality (Data Integration Developer) - Defining and meeting data quality levels and
thresholds for data integration efforts.
Testing
G Unit Testing (Quality Assurance ) - Defining and executing unit testing of data integration
processes. Deliverables include documented test plans, test cases and verification against
end-user acceptance criteria.
G System Testing (Quality Assurance) - Defining and performing system testing to ensure
that data integration efforts work seamlessly across multiple projects and teams.
INFORMATICA CONFIDENTIAL BEST PRACTICE 370 of 702
Cross Project Integration
G Schedule Management/Planning (Data Integration Developer) - Providing a single point for
managing load schedules across the physical architecture to make best use of available
resources and appropriately handle integration dependencies.
G Impact Analysis (Data Integration Developer) - Providing impact analysis on proposed and
scheduled changes that may impact the integration environment. Changes include, but are
not limited to, system enhancements, new systems, retirement of old systems, data
volume changes, shared object changes, hardware migration and system outages.
Production Support
Issue Resolution
G
Operations Helpdesk (Production Operator) - First line of support for operations
issues providing high level issue resolution. Helpdesk should provide field
support for cases and issues related to scheduled jobs, system availability and
other production support tasks.
G Data Validation (Quality Assurance) - Providing data validation on integration load
tasks. Data may be "held" from end-user access until some level of data validation has
been performed. It may be manual review of load statistics - to automated review of record
counts including grand total comparisons, expected size thresholds, or any other metric an
organization may define to catch potential data inconsistencies before reaching end users.
Production Monitoring
G
Schedule Monitoring (Production Operator) - Nightly/daily monitoring of the
data integration load jobs. Ensuring jobs are properly initiated, are not being
delayed, and ensuring successful completion. May provide first level support to
the load schedule while escalating issues to the appropriate support teams.
G Operations Metadata Delivery (Production Operator) - Responsible for providing metadata
to system owners and end users regarding the production load process including load
times, completion status, known issues and other pertinent information regarding the
current state of the integration job stream.
Change Management
G Object Migration (Change Control Coordinator) - Coordinating movement of development
objects and processes to production. May even physically control migration such that all
migration is scheduled, managed, and performed by the ICC.
G Change Control Review (Change Control Coordinator) - Conducting formal and informal
reviews of production changes before migration is approved. At this time, standards may
be enforced, system tuning reviewed, production schedules updated, and formal sign off
INFORMATICA CONFIDENTIAL BEST PRACTICE 371 of 702
to production changes is issued.
G Process Definition (Change Control Coordinator) - Developing and documenting the
change management process such that development objects are efficiently and flawlessly
migrated into the production environment. This may include notification rules, schedule
migration plans, emergency fix procedures etc.
Choosing an ICC Model
The organizational options for developing multiple integration applications are shown below:
The higher the degree of centralization, the greater the potential cost savings. Some organizations
have the flexibility to easily move toward central services, while others dont either due to
organizational or regulatory constraints. There is no ideal model, just one that is appropriate to the
environment in which it operates.
To assist the selection of the appropriate ICC model, the Services described above are mapped to
the Organizational Models below:
INFORMATICA CONFIDENTIAL BEST PRACTICE 372 of 702

The adoption of the Central Services model does not necessarily mandate the inclusion of all applications
within the orbit of the ICC. Some projects require very specific SLAs (Service Level Agreements) that are
much more stringent than other projects, and as such they may require a less stringent ICC model.
Last updated: 09-Feb-07 14:51
INFORMATICA CONFIDENTIAL BEST PRACTICE 373 of 702
Creating Inventories of Reusable Objects &
Mappings
Challenge
Successfully identify the need and scope of reusability. Create inventories of reusable objects
with in a folder or shortcuts across folders (Local shortcuts) or shortcuts across repositories
(Global shortcuts).
Successfully identify and create inventories of mappings based on business rules.
Description
Reusable Objects
Prior to creating an inventory of reusable objects or shortcut objects, be sure to review the
business requirements and look for any common routines and/or modules that may appear in
more than one data movement. These common routines are excellent candidates for reusable
objects or shortcut objects. In PowerCenter, these objects can be created as:
G single transformations (i.e., lookups, filters, etc.)
G a reusable mapping component (i.e., a group of transformations - mapplets)
G single tasks in workflow manager (i.e., command, email, or session)
G a reusable workflow component (i.e., a group of tasks in workflow manager - worklets).
Please note that shortcuts are not supported for workflow level objects (Tasks).
Identify the need for reusable objects based on the following criteria:

G Is there enough usage and complexity to warrant the development of a common object?
G Are the data types of the information passing through the reusable object the same from
case to case or is it simply the same high-level steps with different fields and data.

Identify the Scope based on the following criteria:

G Do these objects need to be shared with in the same folder. If so, then create re-usable
objects with in the folder
G Do these objects need to be shared in several other PowerCenter repository folders? If
so, then create local shortcuts
G Do these objects need to be shared across repositories? If so, then create a global
repository and maintain these re-usable objects in the global repository. Create global
INFORMATICA CONFIDENTIAL BEST PRACTICE 374 of 702
shortcuts to these reusable objects from the local repositories.

Note: Shortcuts cannot be created for workflow objects.


PowerCenter Designer objects:

Creating and testing common objects does not always save development time or facilitate future
maintenance. For example, if a simple calculation like subtracting a current rate from a budget
rate that is going to be used for two different mappings, carefully consider whether the effort to
create, test, and document the common object is worthwhile. Often, it is simpler to add the
calculation to both mappings. However, if the calculation were to be performed in a number of
mappings, if it was very difficult, and if all occurrences would be updated following any change or
fix, then the calculation would be an ideal case for a reusable object. When you add instances of
a reusable transformation to mappings, be careful that the changes do not invalidate the
mapping or generate unexpected data. The Designer stores each reusable transformation as
metadata, separate from any mapping that uses the transformation.

The second criterion for a reusable object concerns the data that will pass through the reusable
object. Developers often encounter situations where they may perform a certain type of high-
level process (i.e., a filter, expression, or update strategy) in two or more mappings. For
example, if you have several fact tables that require a series of dimension keys, you can create
a mapplet containing a series of lookup transformations to find each dimension key. You can
then use the mapplet in each fact table mapping, rather than recreating the same lookup logic in
each mapping. This seems like a great candidate for a mapplet. However, after performing half
of the mapplet work, the developers may realize that the actual data or ports passing through the
high-level logic are totally different from case to case, thus making the use of a mapplet
impractical. Consider whether there is a practical way to generalize the common logic so that it
can be successfully applied to multiple cases. Remember, when creating a reusable object, the
actual object will be replicated in one to many mappings. Thus, in each mapping using the
mapplet or reusable transformation object, the same size and number of ports must pass into
and out of the mapping/reusable object.

Document the list of the reusable objects that pass this criteria test, providing a high-level
description of what each object will accomplish. The detailed design will occur in a future
subtask, but at this point the intent is to identify the number and functionality of reusable objects
that will be built for the project. Keep in mind that it will be impossible to identify one hundred
percent of the reusable objects at this point; the goal here is to create an inventory of as many
as possible, and hopefully the most difficult ones. The remainder will be discovered while
building the data integration processes.


PowerCenter Workflow Manager Objects:

In some cases, we may have to read data from different sources and go through the same
transformation logic and write the data to either one destination database or multiple destination
databases. Also, sometimes, depending on the availability of the source, these loads have to be
scheduled at different time. This case would be the ideal one to create a re-usable session and
INFORMATICA CONFIDENTIAL BEST PRACTICE 375 of 702
do Session overrides at the session instance level for the database connections/pre-session
commands / post session commands.

Logging load statistics, failure criteria and success criteria are usually common pieces of code
that would be executed for multiple loads in most Projects. Some of these common tasks include:

G
Notification when number of rows loaded is less then expected
G
Notification when there are any reject rows using email tasks and link
conditions
G
Successful completion notification based on success criteria like number of
rows loaded using email tasks and link conditions
G
Fail the load based on failure criteria like load statistics or status of some
critical session using control task
G
Stop/Abort a Workflow based on some failure criteria using control task
G
Based on some previous session completion times, calculate the amount of
time the down stream session has to wait before it can start using worklet
variables, timer task and assignment task
Re-usable worklets can be developed to encapsulate the above-mentioned tasks and can be
used in multiple loads. By passing workflow variable values to the worklets and assign then to
worklet variables, one can easily encapsulate common workflow logic.


Mappings

A mapping is a set of source and target definitions linked by transformation objects that define
the rules for data transformation. Mappings represent the data flow between sources and
targets. In a simple world, a single source table would populate a single target table. However, in
practice, this is usually not the case. Sometimes multiple sources of data need to be combined
to create a target table, and sometimes a single source of data creates many target tables. The
latter is especially true for mainframe data sources where COBOL OCCURS statements litter the
landscape. In a typical warehouse or data mart model, each OCCURS statement decomposes
to a separate table.

The goal here is to create an inventory of the mappings needed for the project. For this exercise,
the challenge is to think in individual components of data movement. While the business may
consider a fact table and its three related dimensions as a single object in the data mart or
warehouse, five mappings may be needed to populate the corresponding star schema with data
(i.e., one for each of the dimension tables and two for the fact table, each from a different source
system).

INFORMATICA CONFIDENTIAL BEST PRACTICE 376 of 702
Typically, when creating an inventory of mappings, the focus is on the target tables, with an
assumption that each target table has its own mapping, or sometimes multiple mappings. While
often true, if a single source of data populates multiple tables, this approach yields multiple
mappings. Efficiencies can sometimes be realized by loading multiple tables from a single
source. By simply focusing on the target tables, however, these efficiencies can be overlooked.

A more comprehensive approach to creating the inventory of mappings is to create a
spreadsheet listing all of the target tables. Create a column with a number next to each target
table. For each of the target tables, in another column, list the source file or table that will be
used to populate the table. In the case of multiple source tables per target, create two rows for
the target, each with the same number, and list the additional source(s) of data.

The table would look similar to the following:
Number Target Table Source
1 Customers Cust_File
2 Products Items
3 Customer_Type Cust_File
4 Orders_Item Tickets
4 Orders_Item Ticket_Items
When completed, the spreadsheet can be sorted either by target table or source table. Sorting
by source table can help determine potential mappings that create multiple targets.

When using a source to populate multiple tables at once for efficiency, be sure to keep
restartabilty and reloadability in mind. The mapping will always load two or more target tables
from the source, so there will be no easy way to rerun a single table. In this example, potentially
the Customers table and the Customer_Type tables can be loaded in the same mapping.

When merging targets into one mapping in this manner, give both targets the same number.
Then, re-sort the spreadsheet by number. For the mappings with multiple sources or targets,
merge the data back into a single row to generate the inventory of mappings, with each number
representing a separate mapping.

The resulting inventory would look similar to the following:
Number Target Table Source
1 Customers Customer_Type Cust_File
2 Products Items
4 Orders_Item Tickets Ticket_Items
At this point, it is often helpful to record some additional information about each mapping to help
with planning and maintenance.

First, give each mapping a name. Apply the naming standards generated in 3.2 Design
Development Architecture. These names can then be used to distinguish mappings from one
other and also can be put on the project plan as individual tasks.
INFORMATICA CONFIDENTIAL BEST PRACTICE 377 of 702

Next, determine for the project a threshold for a high, medium, or low number of target rows. For
example, in a warehouse where dimension tables are likely to number in the thousands and fact
tables in the hundred thousands, the following thresholds might apply:
G
Low 1 to 10,000 rows
G
Medium 10,000 to 100,000 rows
G
High 100,000 rows +
Assign a likely row volume (high, medium or low) to each of the mappings based on the
expected volume of data to pass through the mapping. These high level estimates will help to
determine how many mappings are of high volume; these mappings will be the first candidates
for performance tuning.

Add any other columns of information that might be useful to capture about each mapping, such
as a high-level description of the mapping functionality, resource (developer) assigned, initial
estimate, actual completion time, or complexity rating.



Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 378 of 702
Metadata Reporting and Sharing
Challenge
Using Informatica's suite of metadata tools effectively in the design of the end-user analysis application.
Description
The Informatica tool suite can capture extensive levels of metadata but the amount of metadata that is
entered depends on the metadata strategy. Detailed information or metadata comments can be entered for
all repository objects (e.g. mapping, sources, targets, transformations, ports etc.). Also, all information about
column size and scale, data types, and primary keys are stored in the repository. The decision on how much
metadata to create is often driven by project timelines. While it may be beneficial for a developer to enter
detailed descriptions of each column, expression, variable, etc, it will also require extra amount of time and
efforts to do so. But once that information is fed to the Informatica repository ,the same information can be
retrieved using Metadata reporter any time. There are several out-of-box reports and customized reports can
also be created to view that information. There are several options available to export these reports (e.g.
Excel spreadsheet, Adobe .pdf file etc.). Informatica offers two ways to access the repository metadata:
Metadata Reporter, which is a web-based application that allows you to run reports against the
repository metadata. This is a very comprehensive tool that is powered by the functionality of
Informaticas BI reporting tool, Data Analyzer. It is included on the PowerCenter CD.
Because Informatica does not support or recommend direct reporting access to the repository,
even for Select Only queries, the second way of repository metadata reporting is through the use of
views written using Metadata Exchange (MX).

Metadata Reporter
The need for the Informatica Metadata Reporter arose from the number of clients requesting custom and
complete metadata reports from their repositories. Metadata Reporter is based on the Data Analyzer and
PowerCenter products. It provides Data Analyzer dashboards and metadata reports to help you administer
your day-to-day PowerCenter operations, reports to access to every Informatica object stored in the
repository, and even reports to access objects in the Data Analyzer repository. The architecture of the
Metadata Reporter is web-based, with an Internet browser front end. Because Metadata Reporter runs on
Data Analyzer, you must have Data Analyzer installed and running before you proceed with Metadata
Reporter setup.
Metadata Reporter setup includes the following .XML files to be imported from the PowerCenter CD in the
same sequence as they are listed below:
Schemas.xml
Schedule.xml
GlobalVariables_Oracle.xml (This file is database specific, Informatica provides GlobalVariable
files for DB2, SQLServer, Sybase and Teradata. You need to select the appropriate file based on
er repository environment) your PowerCent
Reports.xml
Dashboards.xml
Note : If you have setup a new instance of Data Analyzer exclusively for Metadata reporter, you should have
no problem importing these files. However, if you are using an existing instance of Data Analyzer which you
currently use for some other reporting purpose, be careful while importing these files. Some of the file (e.g.,
Global variables, schedules, etc.) may already exist with the same name. You can rename the conflicting
objects.
INFORMATICA CONFIDENTIAL BEST PRACTICE 379 of 702
The follo
nize
nt - contains a set of reports that provide detailed information on
conf t, including deployment and label details. This folder contains following
subf
operational statistics including
server lo run times, load times, number of runtime errors, etc. for workflows,
nable users to identify all types of
Pow eir properties, and interdependencies on other objects within the
repo following subfolders:
nsion

Security - contains a set of reports that provide detailed information on the users, groups and their

atica client tools being
installed on that computer. The Metadata Reporter connects to the PowerCenter repository using J DBC
(Note: You can also use the J DBC to ODBC bridge to connect to the repository (e.g., Syntax -
jdbc b

adata

er. The name of any
metadata object that displays on a report links to an associated report. As you view a report, you
can generate reports for objects on which you need more information.
wing are the folders that are created in Data Analyzer when you import the above-listed files:
Data Analyzer Metadata Reporting - contains reports for Data Analyzer repository itself e.g.
Todays Login ,Reports accessed by Users Today etc.
PowerCent er Metadata Reports - contains reports for PowerCenter repository. To better orga
reports based on their functionality these reports are further grouped into subfolders as
following:
Configuration Manageme
iguration managemen
olders:
o
o Label
o Object Version
Operations - contains a set of reports that enable users to analyze
Deployment
ad, connection usage,
let and sessions. This fold work s er contains following subfolders:
o Session Execution
o Workflow Execution
Pow contains a set of reports that e erCenter Objects -
erCenter objects, th
sitory. This folder contains
o Mappings
Mapplets o
te o Metadata Ex
Server Grids o
Sessions o
o Sources
Target o
Transformatio o ns
o Workflows
o Worklets
association within the repository.
Informatica recommends retaining this folder organization, adding new folders if necessary.
The Metadata Reporter provides 44 standard reports which can be customized with the use of parameters
and wildcards. Metadata Reporter is accessible from any computer with a browser that has access to the
web server where the Metadata Reporter is installed, even without the other Inform
drivers. Be sure the proper J DBC drivers are installed for your database platform.
:od c:<data_source_name>)
Metadata Reporter is comprehensive. You can run reports on any repository. The reports provide
information about all types of metadata objects.
Metadata Reporter is easily accessible. Because the Metadata Reporter is web-based, you can
generate reports from any machine that has access to the web server. The reports in the Met
Reporter are customizable. The Metadata Reporter allows you to set parameters for the metadata
objects to include in the report.
The Metadata Reporter allows you to go easily from one report to anoth
INFORMATICA CONFIDENTIAL BEST PRACTICE 380 of 702
The following table shows list of reports provided by the Metadata Reporter, along with their location and a
brief description:
Reports For PowerCenter Repository
Sr No Name Folder Description
1 Deployment Group Public Folders>PowerCenter Metadata
Reports>Configuration
Management>Deployment>Deployment
Group
Displays deployment groups by
repository
2 Deployment Group
History
Public Folders>PowerCenter Metadata
Reports>Configuration
Management>Deployment>Deployment
Group History
Displays, by group, deployment
groups and the dates they were
deployed. It also displays the
source and target repository
names of the deployment group
for all deployment dates. This is
a primary report in an analytic
workflow.
3 Labels Public Folders>PowerCenter Metadata
Reports>Configuration
Management>Labels>Labels
Displays labels created in the
repository for any versioned
object by repository.
4 All Object Version
History
Public Folders>PowerCenter Metadata
Reports>Configuration Management>Object
Version>All Object Version History
Displays all versions of an
object by the date the object is
saved in the repository. This is
a standalone report.
5 Server Load by Day
of Week
Public Folders>PowerCenter Metadata
Reports>Operations>Session
Execution>Server Load by Day of Week
Displays the total number of
sessions that ran, and the total
session run duration for any
day of week in any given month
of the year by server by
repository. For example, all
Mondays in September are
represented in one row if that
month had 4 Mondays
6 Session Run Details Public Folders>PowerCenter Metadata
Reports>Operations>Session
Execution>Session Run Details
Displays session run details for
any start date by repository by
folder. This is a primary report
in an analytic workflow.
7 Target Table Load
Analysis (Last
Month)
Public Folders>PowerCenter Metadata
Reports>Operations>Session
Execution>Target Table Load Analysis (Last
Month)
Displays the load statistics for
each table for last month by
repository by folder. This is a
primary report in an analytic
workflow.
8 Workflow Run
Details
Public Folders>PowerCenter Metadata
Reports>Operations>Workflow
Execution>Workflow Run Details
Displays the run statistics of all
workflows by repository by
folder. This is a primary report
in an analytic workflow.
9 Worklet Run Details Public Folders>PowerCenter Metadata
Reports>Operations>Workflow
Execution>Worklet Run Details
Displays the run statistics of all
worklets by repository by folder.
This is a primary report in an
analytic workflow.
10 Mapping List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Mappings>Mapping List
Displays mappings by
repository and folder. It also
displays properties of the
mapping such as the number of
sources used in a mapping, the
number of transformations, and
the number of targets. This is a
primary report in an analytic
workflow.
11 Mapping Lookup Public Folders>PowerCenter Metadata Displays Lookup
INFORMATICA CONFIDENTIAL BEST PRACTICE 381 of 702
Reports For PowerCenter Repository
Sr No Name Folder Description
Transformations Reports>PowerCenter
Objects>Mappings>Mapping Lookup
Transformations
transformations used in a
mapping by repository and
folder. This report is a
standalone report and also the
first node in the analytic
workflow associated with the
Mapping List primary report.
12 Mapping Shortcuts Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Mappings>Mapping Shortcuts
Displays mappings defined as a
shortcut by repository and
folder.
13 Source to Target
Dependency
Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Mappings>Source to Target
Dependency
Displays the data flow from the
source to the target by
repository and folder. The
report lists all the source and
target ports, the mappings in
which the ports are connected,
and the transformation
expression that shows how
data for the target port is
derived.
14 Mapplet List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Mapplets>Mapplet List
Displays mapplets available by
repository and folder. It displays
properties of the mapplet such
as the number of sources used
in a mapplet, the number of
transformations, or the number
of targets. This is a primary
report in an analytic workflow.
15 Mapplet Lookup
Transformations
Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Mapplets>Mapplet Lookup
Transformations
Displays all Lookup
transformations used in a
mapplet by folder and
repository. This report is a
standalone report and also the
first node in the analytic
workflow associated with the
Mapplet List primary report.
16 Mapplet Shortcuts Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Mapplets>Mapplet Shortcuts
Displays mapplets defined as a
shortcut by repository and
folder.
17 Unused Mapplets in
Mappings
Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Mapplets>Unused Mapplets in
Mappings
Displays mapplets defined in a
folder but not used in any
mapping in that folder.
18 Metadata
Extensions Usage
Public Folders>PowerCenter Metadata
Reports>PowerCenter Objects>Metadata
Extensions>Metadata Extensions Usage
Displays, by repository by
folder, reusable metadata
extensions used by any object.
Also displays the counts of all
objects using that metadata
extension.
19 Server Grid List Public Folders>PowerCenter Metadata
Reports>PowerCenter Objects>Server
Grid>Server Grid List
Displays all server grids and
servers associated with each
grid. Information includes host
name, port number, and
internet protocol address of the
servers.
20 Session List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Sessions>Session List
Displays all sessions and their
properties by repository by
folder. This is a primary report
INFORMATICA CONFIDENTIAL BEST PRACTICE 382 of 702
Reports For PowerCenter Repository
Sr No Name Folder Description
in an analytic workflow.
21

Source List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Sources>Source List
Displays relational and non-
relational sources by repository
and folder. It also shows the
source properties. This report is
a primary report in an analytic
workflow.
22 Source Shortcuts Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Sources>Source Shortcuts
Displays sources that are
defined as shortcuts by
repository and folder
23 Target List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Targets>Target List
Displays relational and non-
relational targets available by
repository and folder. It also
displays the target properties.
This is a primary report in an
analytic workflow.
24 Target Shortcuts Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Targets>Target Shortcuts
Displays targets that are
defined as shortcuts by
repository and folder.
25 Transformation List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Transformations>Transformation
List
Displays transformations
defined by repository and
folder. This is a primary report
in an analytic workflow.
26 Transformation
Shortcuts
Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Transformations>Transformation
Shortcuts
Displays transformations that
are defined as shortcuts by
repository and folder.
27 Scheduler
(Reusable) List
Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Workflows>Scheduler (Reusable)
List
Displays all the reusable
schedulers defined in the
repository and their description
and properties by repository by
folder. This is a primary report
in an analytic workflow.
28 Workflow List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Workflows>Workflow List
Displays workflows and
workflow properties by
repository by folder. This report
is a primary report in an
analytic workflow.
29 Worklet List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Worklets>Worklet List
Displays worklets and worklet
properties by repository by
folder. This is a primary report
in an analytic workflow.
30 Users By Group Public Folders>PowerCenter Metadata
Reports>Security>Users By Group
Displays users by repository
and group.
Reports For Data Analyzer Repository
Sr No Name Folder Description
1 Bottom 10 Least
Accessed Reports
this Year
Public Folders>Data Analyzer Metadata
Reporting>Bottom 10 Least Accessed
Reports this Year
Displays the ten least accessed
reports for the current year. It
has an analytic workflow that
provides access details such as
user name and access time.
2 Report Activity
Details
Public Folders>Data Analyzer Metadata
Reporting>Report Activity Details
Part of the analytic workflows
"Top 10 Most Accessed
Reports This Year", "Bottom 10
Least Accessed Reports this
Year" and "Usage by Login
(Month To Date)".
INFORMATICA CONFIDENTIAL BEST PRACTICE 383 of 702
Reports For PowerCenter Repository
Sr No Name Folder Description
3 Report Activity
Details for Current
Month
Public Folders>Data Analyzer Metadata
Reporting>Report Activity Details for Current
Month
Provides information about
reports accessed in the current
month until current date.
4 Report Refresh
Schedule
Public Folders>Data Analyzer Metadata
Reporting>Report Refresh Schedule
Provides information about the
next scheduled update for
scheduled reports. It can be
used to decide schedule timing
for various reports for optimum
system performance.
5 Reports Accessed
by Users Today
Public Folders>Data Analyzer Metadata
Reporting>Reports Accessed by Users
Today
Part of the analytic workflow for
"Today's Logins". It provides
detailed information on the
reports accessed by users
today. This can be used
independently to get
comprehensive information
about today's report activity
details.
6 Todays Logins Public Folders>Data Analyzer Metadata
Reporting>Todays Logins
Provides the login count and
average login duration for users
who logged in today.
7 Todays Report
Usage by Hour
Public Folders>Data Analyzer Metadata
Reporting>Todays Report Usage by Hour
Provides information about the
number of reports accessed
today for each hour. The
analytic workflow attached to it
provides more details on the
reports accessed and users
who accessed them during the
selected hour.
8 Top 10 Most
Accessed Reports
this Year
Public Folders>Data Analyzer Metadata
Reporting>Top 10 Most Accessed Reports
this Year
Shows the ten most accessed
reports for the current year. It
has an analytic workflow that
provides access details such as
user name and access time.
9 Top 5 Logins (Month
To Date)
Public Folders>Data Analyzer Metadata
Reporting>Top 5 Logins (Month To Date)
Provides information about
users and their corresponding
login count for the current
month to date. The analytic
workflow attached to it provides
more details about the reports
accessed by a selected user.
10 Top 5 Longest
Running On-
Demand Reports
(Month To Date)
Public Folders>Data Analyzer Metadata
Reporting>Top 5 Longest Running On-
Demand Reports (Month To Date)
Shows the five longest running
on-demand reports for the
current month to date. It
displays the average total
response time, average DB
response time, and the average
Data Analyzer response time
(all in seconds) for each report
shown.
11 Top 5 Longest
Running Scheduled
Reports (Month To
Date)
Public Folders>Data Analyzer Metadata
Reporting>Top 5 Longest Running
Scheduled Reports (Month To Date)
Shows the five longest running
scheduled reports for the
current month to date. It
displays the average response
time (in seconds) for each
report shown.
12 Total Schedule
Errors for Today
Public Folders>Data Analyzer Metadata
Reporting>Total Schedule Errors for Today
Provides the number of errors
encountered during execution
INFORMATICA CONFIDENTIAL BEST PRACTICE 384 of 702
Reports For PowerCenter Repository
Sr No Name Folder Description
of reports attached to
schedules. The analytic
workflow "Scheduled Report
Error Details for Today" is
attached to it.
13 User Logins (Month
To Date)
Public Folders>Data Analyzer Metadata
Reporting>User Logins (Month To Date)
Provides information about
users and their corresponding
login count for the current
month to date. The analytic
workflow attached to it provides
more details about the reports
accessed by a selected user.
14 Users Who Have
Never Logged On
Public Folders>Data Analyzer Metadata
Reporting>Users Who Have Never Logged
On
Provides information about
users who exist in the
repository but have never
logged in. This information can
be used to make administrative
decisions about disabling
accounts.

Customizing a Report or Creating New Reports
Once you select the report, you can customize it by setting the parameter values and/or creating new
attributes or metrics. Data Analyzer includes simples steps to create new reports or modify existing ones.
Adding filters or modifying filters offers tremendous reporting flexibility. Additionally, you can setup report
templates and export them as Excel files, which can be refreshed as necessary. For more information on the
attributes, metrics, and schemas included with the Metadata Reporter, consult the product documentation.
Wildcards
The Metadata Reporter supports two wildcard characters:
Percent symbol (%) - represents any number of characters and spaces.
Underscore (_) - represents one character or space.
You can use wildcards in any number and combination in the same parameter. Leaving a parameter blank
returns all values and is the same as using %. The following examples show how you can use the wildcards
to set parameters.
Suppose you have the following values available to select:
i t ems, i t ems_i n_pr omot i ons, or der _i t ems, pr omot i ons
INFORMATICA CONFIDENTIAL BEST PRACTICE 385 of 702
The following list shows the return values for some wildcard combinations you can use:
Wildcard Combination Return Values
% items, items_in_promotions, order_items, promotions
<blank> items, items_in_promotions, order_items, promotions
%items items, order_items
item_ Items
item% items, items_in_promotions
___m% items, items_in_promotions, promotions
%pr_mo% items_in_promotions, promotions
A printout of the mapping object flow is also useful for clarifying how objects are connected. To produce
such a printout, arrange the mapping in Designer so the full mapping appears on the screen, and then use
Alt+PrtSc to copy the active window to the clipboard. Use Ctrl+V to paste the copy into a Word document.
For a detailed description of how to run these reports, consult the Metadata Reporter Guide included in the
PowerCenter documentation.
Security Awareness for Metadata Reporter
Metadata Reporter uses Data Analyzer for reporting out of the PowerCenter /Data Analyzer repository. Data
Analyzer has a robust security mechanism that is inherited by Metadata Reporter. You can establish groups,
roles, and/or privileges for users based on their profiles. Since the information in PowerCenter repository
does not change often after it goes to production, the Administrator can create some reports and export
them to files that can be distributed to the user community. If the numbers of users for Metadata Reporter
are limited, you can implement security using report filters or data restriction feature. For example, if a user
in PowerCenter repository has access to certain folders, you can create a filter for those folders and apply it
to the user's profile. For more information on the ways in which you can implement security in Data
Analyzer, refer to the Data Analyzer documentation.
Metadata Exchange: the Second Generation (MX2)
The MX architecture was intended primarily for BI vendors who wanted to create a PowerCenter-based data
warehouse and display the warehouse metadata through their own products. The result was a set of
relational views that encapsulated the underlying repository tables while exposing the metadata in several
categories that were more suitable for external parties. Today, Informatica and several key vendors,
including Brio, Business Objects, Cognos, and MicroStrategy are effectively using the MX views to report
and query the Informatica metadata.
Informatica currently supports the second generation of Metadata Exchange called MX2. Although the
overall motivation for creating the second generation of MX remains consistent with the original intent, the
requirements and objectives of MX2 supersede those of MX.
The primary requirements and features of MX2 are:
Incorporation of object technology in a COM-based API. Although SQL provides a powerful mechanism
for accessing and manipulating records of data in a relational paradigm, it is not suitable for procedural
programming tasks that can be achieved by C, C++, J ava, or Visual Basic. Furthermore, the increasing
popularity and use of object-oriented software tools require interfaces that can fully take advantage of the
object technology. MX2 is implemented in C++and offers an advanced object-based API for accessing and
manipulating the PowerCenter Repository from various programming languages.
INFORMATICA CONFIDENTIAL BEST PRACTICE 386 of 702
Self-contained Software Development Kit (SDK). One of the key advantages of MX views is that they are
part of the repository database and thus can be used independent of any of the Informatica software
products. The same requirement also holds for MX2, thus leading to the development of a self-contained
API Software Development Kit that can be used independently of the client or server products.
Extensi ve metadata content, especially multidimensional models for OLAP. A number of BI tools and
upstream data warehouse modeling tools require complex multidimensional metadata, such as hierarchies,
levels, and various relationships. This type of metadata was specifically designed and implemented in the
repository to accommodate the needs of the Informatica partners by means of the new MX2 interfaces.
Ability to write (push) metadata into the repository. Because of the limitations associated with relational
views, MX could not be used for writing or updating metadata in the Informatica repository. As a result, such
tasks could only be accomplished by directly manipulating the repository's relational tables. The MX2
interfaces provide metadata write capabilities along with the appropriate verification and validation features
to ensure the integrity of the metadata in the repository.
Complete encapsulation of the underlying repository organization by means of an API. One of the
main challenges with MX views and the interfaces that access the repository tables is that they are directly
exposed to any schema changes of the underlying repository database. As a result, maintaining the MX
views and direct interfaces requires a major effort with every major upgrade of the repository. MX2 alleviates
this problem by offering a set of object-based APIs that are abstracted away from the details of the
underlying relational tables, thus providing an easier mechanism for managing schema evolution.
Integration with third-party tools. MX2 offers the object-based interfaces needed to develop more
sophisticated procedural programs that can tightly integrate the repository with the third-party data
warehouse modeling and query/reporting tools.
Synchronization of metadata based on changes from up-stream and down-stream tools. Given that
metadata is likely to reside in various databases and files in a distributed software environment,
synchronizing changes and updates ensures the validity and integrity of the metadata. The object-based
technology used in MX2 provides the infrastructure needed to implement automatic metadata
synchronization and change propagation across different tools that access the PowerCenter Repository.
Interoperability with other COM-based programs and repository interfaces. MX2 interfaces comply with
Microsoft's Component Object Model (COM) interoperability protocol. Therefore, any existing or future
program that is COM-compliant can seamlessly interface with the PowerCenter Repository by means of
MX2.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 387 of 702
Repository Tables & Metadata Management
Challenge
Maintaining the repository for regular backup, quick response, and querying metadata for metadata reports.
Description
Regular actions such as backups, testing backup and restore procedures, and deleting unwanted information
from the repository maintains the repository for better performance.
Managing Repository
The PowerCenter Administrator plays a vital role in managing and maintaining the repository and metadata. The
role involves tasks such as securing the repository, managing the users and roles, maintaining backups, and
managing the repository through such activities as removing unwanted metadata, analyzing tables, and updating
statistics.
Repository backup
Repository back up can be performed using the client tool Repository Server Admin Console or the command line
program pmrep. Backup using pmrep can be automated and scheduled for regular backups.

This shell script can be scheduled to run as cron job for regular backups. Alternatively, this shell script can be
called from PowerCenter via a command task. The command task can be placed in a workflow and scheduled to
run daily.
INFORMATICA CONFIDENTIAL BEST PRACTICE 388 of 702

The following paragraphs describe some useful practices for maintaining backups:
Frequency: Backup frequency depends on the activity in repository. For Production repositories, backup is
recommended once a month or prior to major release. For development repositories, backup is recommended
once a week or once a day, depending upon the team size.
Backup file sizes: Because backup files can be very large, Informatica recommends compressing them using a
utility such as winzip or gzip.
Storage: For security reasons, Informatica recommends maintaining backups on a different physical device that
the repository itself.
Move backups offline: Review the backups on a regular basis to determine how long they need to remain
online. Any that are not required online should be moved offline, to tape, as soon as possible.
Restore repository
Although the Repository restore function is used primarily as part of disaster recovery, it can also be useful for
testing the validity of the backup files and for testing the recovery process on a regular basis. Informatica
recommends testing the backup files and recovery process at least once each quarter. The repository can be
restored using the client tool, Repository Server Administrator Console, or the command line programs
pmrepagent.
Restore folders
There is no easy way to restore only one particular folder from backup. First the backup repository has to be
restored into a new repository, then you can use the client tool, repository manager, to copy the entire folder from
the restored repository into the target repository.
Remove older versions
Use the purge command to remove older versions of objects from repository. To purge a specific version of an
object, view the history of the object, select the version, and purge it.
Finding deleted objects and removing them from repository
If a PowerCenter repository is enabled for versioning through the use of the Team Based Development option.
Objects that have been deleted from the repository are not be visible in the client tools. To list or view deleted
objects, use either the find checkouts command in the client tools or a query generated in the repository
INFORMATICA CONFIDENTIAL BEST PRACTICE 389 of 702
manager, or a specific query.
After an object has been deleted from the repository, you cannot create another object with the same name
unless the deleted object has been completely removed from the repository. Use the purge command to
completely remove deleted objects from the repository. Keep in mind, however, that you must remove all versions
of a deleted object to completely remove it from repository.
Truncating Logs
You can truncate the log information (for sessions and workflows) stored in the repository either by using
repository manager or the pmrep command line program. Logs can be truncated for the entire repository or for a
particular folder.
Options allow truncating all log entries or selected entries based on date and time.
INFORMATICA CONFIDENTIAL BEST PRACTICE 390 of 702

Repository Performance
Analyzing (or updating the statistics) of repository tables can help to improve the repository performance.
Because this process should be carried out for all tables in the repository, a script offers the most efficient means.
You can then schedule the script to run using either an external scheduler or a PowerCenter workflow with a
command task to call the script.
Repository Agent and Repository Server performance
Factors such as team size, network, number of objects involved in a specific operation, number of old locks (on
repository objects), etc. may reduce the efficiency of the repository server (or agent). In such cases, the various
causes should be analyzed and the repository server (or agent) configuration file modified to improve
performance.
Managing Metadata
The following paragraphs list the queries that are most often used to report on PowerCenter metadata. The
queries are written for PowerCenter repositories on Oracle and are based on PowerCenter 6 and PowerCenter 7.
Minor changes in the queries may be required for PowerCenter repositories residing on other databases.
Failed Sessions
The following query lists the failed sessions in the last day. To make it work for the last n days, replace
SYSDATE-1 with SYSDATE - n
SELECT Subject_Area AS Folder,
Session_Name,
Last_Error AS Error_Message,
DECODE (Run_Status_Code,3,'Failed',4,'Stopped',5,'Aborted') AS Status,
Actual_Start AS Start_Time,
INFORMATICA CONFIDENTIAL BEST PRACTICE 391 of 702
Session_TimeStamp
FROM rep_sess_log
WHERE run_status_code != 1
AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)
Long running Sessions
The following query lists long running sessions in the last day. To make it work for the last n days, replace
SYSDATE-1 with SYSDATE - n
SELECT Subject_Area AS Folder,
Session_Name,
Successful_Source_Rows AS Source_Rows,
Successful_Rows AS Target_Rows,
Actual_Start AS Start_Time,
Session_TimeStamp
FROM rep_sess_log
WHERE run_status_code = 1
AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)
AND (Session_TimeStamp - Actual_Start) > (10/(24*60))
ORDER BY Session_timeStamp
Invalid Tasks
The following query lists folder names and task name, version number, and last saved for all invalid tasks.
SELECT SUBJECT_AREA AS FOLDER_NAME,
DECODE(IS_REUSABLE,1,'Reusable',' ') || ' ' ||TASK_TYPE_NAME AS TASK_TYPE,
TASK_NAME AS OBJECT_NAME,
VERSION_NUMBER, -- comment out for V6
LAST_SAVED
INFORMATICA CONFIDENTIAL BEST PRACTICE 392 of 702
FROM REP_ALL_TASKS
WHERE IS_VALID=0
AND IS_ENABLED=1
--AND CHECKOUT_USER_ID = 0 -- Comment out for V6
--AND is_visible=1 -- Comment out for V6
ORDER BY SUBJECT_AREA,TASK_NAME
Load Counts
The following query lists the load counts (number of rows loaded) for the successful sessions.
SELECT
subject_area,
workflow_name,
session_name,
DECODE (Run_Status_Code,1,'Succeeded',3,'Failed',4,'Stopped',5,'Aborted') AS Session_Status,
successful_rows,
failed_rows,
actual_start
FROM
REP_SESS_LOG
WHERE
TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)
ORDER BY
subject_area
workflow_name,
session_name,
Session_status
INFORMATICA CONFIDENTIAL BEST PRACTICE 393 of 702
Using Metadata Extensions
Challenge
To provide for efficient documentation and achieve extended metadata reporting
through the use of metadata extensions in repository objects.
Description
Metadata Extensions, as the name implies, help you to extend the metadata stored in
the repository by associating information with individual objects in the repository.
Informatica Client applications can contain two types of metadata extensions: vendor-
defined and user-defined.
G Vendor-defined. Third-party application vendors create vendor-defined
metadata extensions. You can view and change the values of vendor-defined
metadata extensions, but you cannot create, delete, or redefine them.
G User-defined. You create user-defined metadata extensions using
PowerCenter clients. You can create, edit, delete, and view user-defined
metadata extensions. You can also change the values of user-defined
extensions.
You can create reusable or non-reusable metadata extensions. You associate reusable
metadata extensions with all repository objects of a certain type. So, when you create a
reusable extension for a mapping, it is available for all mappings. Vendor-defined
metadata extensions are always reusable.
Non-reusable extensions are associated with a single repository object. Therefore, if
you edit a target and create a non-reusable extension for it, that extension is available
only for the target you edit. It is not available for other targets. You can promote a non-
reusable metadata extension to reusable, but you cannot change a reusable metadata
extension to non-reusable.
Metadata extensions can be created for the following repository objects:
G Source definitions
G Target definitions
INFORMATICA CONFIDENTIAL BEST PRACTICE 394 of 702
G Transformations (Expressions, Filters, etc.)
G Mappings
G Mapplets
G Sessions
G Tasks
G Workflows
G Worklets
Metadata Extensions offer a very easy and efficient method of documenting important
information associated with repository objects. For example, when you create a
mapping, you can store the mapping owners name and contact information with the
mapping OR when you create a source definition, you can enter the name of the
person who created/imported the source.
The power of metadata extensions is most evident in the reusable type. When you
create a reusable metadata extension for any type of repository object, that metadata
extension becomes part of the properties of that type of object. For example, suppose
you create a reusable metadata extension for source definitions called SourceCreator.
When you create or edit any source definition in the Designer, the SourceCreator
extension appears on the Metadata Extensions tab. Anyone who creates or edits a
source can enter the name of the person that created the source into this field.
You can create, edit, and delete non-reusable metadata extensions for sources,
targets, transformations, mappings, and mapplets in the Designer. You can create,
edit, and delete non-reusable metadata extensions for sessions, workflows, and
worklets in the Workflow Manager. You can also promote non-reusable metadata
extensions to reusable extensions using the Designer or the Workflow Manager. You
can also create reusable metadata extensions in the Workflow Manager or Designer.
You can create, edit, and delete reusable metadata extensions for all types of
repository objects using the Repository Manager. If you want to create, edit, or delete
metadata extensions for multiple objects at one time, use the Repository Manager.
When you edit a reusable metadata extension, you can modify the properties Default
Value, Permissions and Description.
Note: You cannot create non-reusable metadata extensions in the Repository
Manager. All metadata extensions created in the Repository Manager are reusable.
Reusable metadata extensions are repository wide.
You can also migrate Metadata Extensions from one environment to another. When
INFORMATICA CONFIDENTIAL BEST PRACTICE 395 of 702
you do a copy folder operation, the Copy Folder Wizard copies the metadata extension
values associated with those objects to the target repository. A non-reusable metadata
extension will be copied as a non-reusable metadata extension in the target repository.
A reusable metadata extension is copied as reusable in the target repository, and the
object retains the individual values. You can edit and delete those extensions, as well
as modify the values.
Metadata Extensions provide for extended metadata reporting capabilities. Using
Informatica MX2 API, you can create useful reports on metadata extensions. For
example, you can create and view a report on all the mappings owned by a specific
team member. You can use various programming environments such as Visual Basic,
Visual C++, C++ and Java SDK to write API modules. The Informatica Metadata
Exchange SDK 6.0 installation CD includes sample Visual Basic and Visual C++
applications.
Additionally, Metadata Extensions can also be populated via data modeling tools such
as ERWin, Oracle Designer, and PowerDesigner via Informatica Metadata Exchange
for Data Models. With the Informatica Metadata Exchange for Data Models, the
Informatica Repository interface can retrieve and update the extended properties of
source and target definitions in PowerCenter repositories. Extended Properties are the
descriptive, user defined, and other properties derived from your Data Modeling tool
and you can map any of these properties to the metadata extensions that are already
defined in the source or target object in the Informatica repository.


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 396 of 702
Using PowerCenter Metadata Manager and Metadata
Exchange Views for Quality Assurance
Challenge
The role that the PowerCenter repository can play in an automated QA strategy is often overlooked and under-
appreciated. This repository is essentially a database about the transformation process and the software
developed to implement it; the challenge is to devise a method to exploit this resource for QA purposes.
To address the above challenge, Informatica PowerCenter provides several pre-packaged reports (PowerCenter
Repository Reports) that can be installed on Data Analyzer or Metadata Manager Installation. These reports
provide lots of useful information about PowerCenter object metadata and operational metadata that can be used
for quality assurance.
Description
Before considering the mechanics of an automated QA strategy, it is worth emphasizing that quality should be built
in from the outset. If the project involves multiple mappings repeating the same basic transformation pattern(s), it is
probably worth constructing a virtual production line. This is essentially a template-driven approach to accelerate
development and enforce consistency through the use of the following aids:
G Shared template for each type of mapping.
G Checklists to guide the developer through the process of adapting the template to the mapping
requirements.
G Macros/scripts to generate productivity aids such as SQL overrides etc.
It is easier to ensure quality from a standardized base rather than relying on developers to repeat accurately the
same basic keystrokes.
Underpinning the exploitation of the repository for QA purposes is the adoption of naming standards which
categorize components. By running the appropriate query on the repository, it is possible to identify those
components whose attributes differ from those predicted for the category. Thus, it is quite possible to automate
some aspects of QA. Clearly, the function of naming conventions is not just to standardize, but also to provide
logical access paths into the information in the repository; names can be used to identify patterns and/or categories
and thus allow assumptions to be made about object attributes. Along with the facilities provided to query the
repository, such as the Metadata Exchange (MX) Views and the PowerCenter Metadata Manager, this opens the
door to an automated QA strategy
For example, consider the following situation: it is possible that the EXTRACT mapping/session should always
truncate the target table before loading; conversely, the TRANSFORM and LOAD phases should never truncate a
target.
Possible code errors in this respect can be identified as follows:
G Define a mapping/session naming standard to indicate EXTRACT, TRANSFORM, or LOAD.
G Develop a query on the repository to search for sessions named EXTRACT, which do not have the
truncate target option set.
G Develop a query on the repository to search for sessions named TRANSFORM or LOAD, which do have
the truncate target option set.
INFORMATICA CONFIDENTIAL BEST PRACTICE 397 of 702
G Provide a facility to allow developers to run both queries before releasing code to the test environment.
Alternatively, a standard may have been defined to prohibit unconnected output ports from transformations (such
as expressions) in a mapping. These can be very easily identified from the MX View
REP_MAPPING_UNCONN_PORTS:
The following bullets represent a high-level overview of the steps involved in automating QA:
G Review the transformations/mappings/sessions/workflows and allocate to broadly representative
categories.
G Identify the key attributes of each category.
G Define naming standards to identify the category for transformations/mappings/sessions/workflows.
G Analyze the MX Views to source the key attributes.
G Develop the query to compare actual and expected attributes for each category.
After you have completed these steps, it is possible to develop a utility that compares actual and expected
attributes for developers to run before releasing code into any test environment. Such a utility may incorporate the
following processing stages:
G Execute a profile to assign environment variables (e.g., repository schema user, password, etc).
G Select the folder to be reviewed.
G Execute the query to find exceptions.
G Report the exceptions in an accessible format.
G Exit with failure if exceptions are found.
TIP
Remember that any queries on the repository that bypass the MX views will require modification if subsequent upgrades to PowerCenter occur and
as such is not recommended by Informatica.

The principal objective of any QA strategy is to ensure that developed components adhere to standards and to
identify defects before incurring overhead during the migration from development to test/production environments.
Qualitative, peer-based reviews of PowerCenter objects due for release obviously have their part to play in this
process.
Using Metadata Manager and PowerCenter Repository Reports for Quality Assurance
The need for the Informatica Metadata Reporter was identified from the a number of clients requesting custom and
complete metadata reports from their repositories. Metadata Reporter provides Data Analyzer dashboards and
metadata reports to help you administer your day-to-day PowerCenter operations. In this section, we focus
primarily on how these reports and custom reports can help ease the QA process.
The following reports can help identify regressions in load performance:
G
Session Run details
G
Workflow Run details
G
Worklet Run details
INFORMATICA CONFIDENTIAL BEST PRACTICE 398 of 702
G Server Load by Day of the Week can help determine the load on the server before and after QA
migrations and may help balance the loads through the week by modifying the schedules.
G
The Target Table Load Analysis can help identify any data regressions with the number of
records loaded in each target (if a baseline was established before the migration/upgrade).
G
The Failed Session report lists failed sessions at a glance, which is very helpful after a major
QA migration or QA of Informatica upgrade process
During huge deployments to QA, the Code review team can look at the following reports to determine if the
standards (i.e., Naming standards, Comments for repository objects, metadata extensions usage, etc.) were
followed. Accessing this information from PowerCenter Repository Reports typically reduces the time required for
review because the reviewer doesnt need to open each mapping and check for these details. All of the following
are out-of-the-box reports provided by Informatica:
G
Label report
G
Mappings list
G
Mapping shortcuts
G
Mapping lookup transformation
G
Mapplet list
G
Mapplet shortcuts
G
Mapplet lookup transformation
G
Metadata extensions usage
G
Sessions list
G
Worklets list
G
Workflows list
G
Source list
G
Target list
G
Custom reports based on the review requirements
In addition, note that the following reports are also useful during migration and upgrade processes:
G Invalid object reports and deployment group report in the QA repository help to determine which
deployments caused the invalidations.
G Invalid object report against Development repository helps to identify the invalid objects that are part of
deployment before QA migration.
INFORMATICA CONFIDENTIAL BEST PRACTICE 399 of 702
G Invalid object report helps in QA of an Informatica upgrade process.
The following table summarizes some of the reports that Informatica ships with a PowerCenter Repository Reports
installation:
Report Name Description
1 Deployment Group Displays deployment groups by repository
2 Deployment Group History Displays, by group, deployment groups and the dates they were deployed. It also displays the
source and target repository names of the deployment group for all deployment dates.
3 Labels Displays labels created in the repository for any versioned object by repository.
4 All Object Version History Displays all versions of an object by the date the object is saved in the repository.
5 Server Load by Day of
Week
Displays the total number of sessions that ran, and the total session run duration for any day
of week in any given month of the year by server by repository. For example, all Mondays in
September are represented in one row if that month had 4 Mondays
6 Session Run Details Displays session run details for any start date by repository by folder.
7 Target Table Load
Analysis (Last Month)
Displays the load statistics for each table for last month by repository by folder
8 Workflow Run Details Displays the run statistics of all workflows by repository by folder.
9 Worklet Run Details Displays the run statistics of all worklets by repository by folder.
10 Mapping List Displays mappings by repository and folder. It also displays properties of the mapping such as
the number of sources used in a mapping, the number of transformations, and the number of
targets.
11 Mapping Lookup
Transformations
Displays Lookup transformations used in a mapping by repository and folder.
12 Mapping Shortcuts Displays mappings defined as a shortcut by repository and folder.
13 Source to Target
Dependency
Displays the data flow from the source to the target by repository and folder. The report lists all
the source and target ports, the mappings in which the ports are connected, and the
transformation expression that shows how data for the target port is derived.
14 Mapplet List Displays mapplets available by repository and folder. It displays properties of the mapplet
such as the number of sources used in a mapplet, the number of transformations, or the
number of targets.
15 Mapplet Lookup
Transformations
Displays all Lookup transformations used in a mapplet by folder and repository.
16 Mapplet Shortcuts Displays mapplets defined as a shortcut by repository and folder.
17 Unused Mapplets in
Mappings
Displays mapplets defined in a folder but not used in any mapping in that folder.
INFORMATICA CONFIDENTIAL BEST PRACTICE 400 of 702
18 Metadata Extensions
Usage
Displays, by repository by folder, reusable metadata extensions used by any object. Also
displays the counts of all objects using that metadata extension.
19 Server Grid List Displays all server grids and servers associated with each grid. Information includes host
name, port number, and internet protocol address of the servers.
20 Session List Displays all sessions and their properties by repository by folder. This is a primary report in a
data integration workflow.
21 Source List Displays relational and non-relational sources by repository and folder. It also shows the
source properties. This report is a primary report in a data integration workflow.
22 Source Shortcuts Displays sources that are defined as shortcuts by repository and folder
23 Target List Displays relational and non-relational targets available by repository and folder. It also
displays the target properties. This is a primary report in a data integration workflow.
24 Target Shortcuts Displays targets that are defined as shortcuts by repository and folder.
25 Transformation List Displays transformations defined by repository and folder. This is a primary report in a data
integration workflow.
26 Transformation Shortcuts Displays transformations that are defined as shortcuts by repository and folder.
27 Scheduler (Reusable) List Displays all the reusable schedulers defined in the repository and their description and
properties by repository by folder.
28 Workflow List Displays workflows and workflow properties by repository by folder.
29 Worklet List
Displays worklets and worklet properties by repository by folder.


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 401 of 702
Configuring Standard XConnects
Challenge
Metadata that is derived from a variety of sources and tools is often disparate and fragmented. To be of
value, metadata needs to be consolidated into a central repository. Informatica's Metadata Manager
provides a central repository for the capture and analysis of critical metadata.
Description
Metadata Manager Console Settings
Logging into the Metadata Manager Warehouse
You can use the Metadata Manager console to access one Metadata Manager Warehouse repository at
a time. When logging in to the Metadata Manager console for the first time, you need to set up the
connection information along with the data source for the Integration Repository. In subsequent logins,
you need enter only the Metadata Manager Warehouse database password.
Setting up Connections to the PowerCenter Components
Before you run any XConnects, be sure that the Metadata Manager Console has valid connections to
the following PowerCenter components for Metadata Manager:
G Integration Repository Service
G Domain
G Integration Service
To verify the PowerCenter settings, click the Administration tab.
Specifying the PowerCenter Source Files Directory
Metadata Manager stores the following files in the PowerCenter source files directory:
1. IME files - Some XConnects extract the source repository metadata and reformat it into an IME-
based format. The reformatted metadata is stored in new files, referred to as IME files. The
workflows extract the transformed metadata from the IME files and then load the metadata into
the Metadata Manager Warehouse.
2. Parameter files - The integration workflows use parameters to control the sessions, worklets,
and workflows.
3. Date files - The integration workflows use date files to load dates into the Metadata Manager
Warehouse
To configure the Source file directory, click the Administration tab, then the File Transfer Configuration
tab.
INFORMATICA CONFIDENTIAL BEST PRACTICE 402 of 702
For Windows:
G \\Informatica Server Name\PM SrcFiles directory.
G Click Save when you are finished.
For UNIX:
G Select ftp option
G FTP Server Name: Integration Service Host Name
G Port Number: 21 (default)
G User Name: UNIX login name to Integration Server
G ftp directory: /Integration service Home directory/SrcFiles
Note: Metadata Manager 8.1 does not support secure ftp connections to the Integration server.
Configuring Standard XConnects
SQL Server XConnect
Specify a user name and password to access the SQL Server database metadata. Be sure that the user
has access to all system tables. One XConnect is needed per SQL Server database.
To extract metadata from SQL Server, perform the following steps:
G Add a new SQL repository from Metadata Manager (Web interface).
G Log in to the Metadata Manager Console. Click the Source Repository Management tab. The
new SQL Server XConnect added above should show up in the console. Select the SQL
Server XConnect and click the Configuration Properties tab. Enter the following information
related to the XConnect:
Properties Description
User Name/Password Database user name and password to access SQL
Server data dictionary
Data Source Name ODBC connection name to connect to SQL Server
data dictionary
Database Type Microsoft SQL Server
INFORMATICA CONFIDENTIAL BEST PRACTICE 403 of 702
Connection String For default instance: SQL Server Name@Database
Name
For named instance: Server Name\Instance
Name@Database Name
G Click Save when you have finished entering this information.
G To configure the list of user schemas to load, click the Parameter Setup tab and select the list
of schemas to load (these are listed in the Included Schemas). Click Save when you are
finished.
G The XConnect is ready to be loaded.
G After a successful load of the SQL Server metadata, you can see the metadata in the Web
interface.
To configure SQL Server out-of-the-box XConnects to run on the PowerCenter server in a UNIX
environment, follow these steps:
G Install DataDirect ODBC drivers on the PowerCenter server location.
G Configure .odbc.ini just like any other ODBC setup.
G Create a repository of type Microsoft SQL Server using the Metadata Manager browser
G Configuring the repository in the configuration console, specify a connect string as
<SQLserverhost>@DBname and save the configuration
G Using Workflow Manager, delete the connection it created R<RepoUID>, and create an ODBC
connection with the same name as R<RepoUID> (Specify the connect string same as the one
configured in the .odbc.ini)
Oracle XConnect
Specify a user name and password to access the Oracle database metadata. Be sure that the user has
the Select Any Table privilege and Select Permissions on the following objects in the specified
schemas: tables, views, indexes, packages, procedures, functions, sequences, triggers, and synonyms.
Also ensure the user has Select Permissions on the SYS.v_$instance. One XConnect is needed for
each Oracle instance.
To extract metadata from Oracle, perform the following steps:
G Add a new SQL repository from Metadata Manager (Web interface).
G Log in to the Metadata Manager Console. Click the Source Repository Management tab. The
Oracle XConnect added above should show up in the console. Select the Oracle XConnect
and click the Configuration Properties tab. Enter the following information related to the
XConnect:
INFORMATICA CONFIDENTIAL BEST PRACTICE 404 of 702
Properties Description
User Name/Password Database user name and password to access the
Oracle instance data dictionary
Data Source Name ODBC connection name to connect to the Oracle
instance data dictionary
Database Type Oracle
Connect String Oracle instance name
G Click Save when you have finished entering this information.
G To configure the list of schemas to load, click the Parameter Setup tab and select the list of
schemas to load (these are listed in the Included Schemas). Click Save when finished.
G The XConnect is ready to be loaded.
G After a successful load of the Oracle metadata, you can see the metadata in the Web interface:
Teradata XConnect
Specify a user name and password to access the Teradata metadata. Be sure that the user has access
to all the system DBC tables.
To extract metadata from Teradata Server, perform the following steps:
G Add a new SQL repository from Metadata Manager (Web interface).
G Log in to the Metadata Manager Console and click the Source Repository Management tab.
The new SQL Server XConnect added above should show up in the console. Select the SQL
Server XConnect and click the Configuration Properties tab. Enter the following information
related to the XConnect:
Properties Description
User Name/Password Database user name and password to access SQL
Server data dictionary
Data Source Name ODBC connection name to connect to SQL Server
data dictionary
Database Type Teradata
INFORMATICA CONFIDENTIAL BEST PRACTICE 405 of 702
Connection String ODBC connection name in PowerCenter repository to
Teradata
G Click Save when you have finished entering this information.
G To configure the list of user databases to load click the Parameter Setup tab. Select the list of
list of database to load (these are listed in the Included Schemas). Click Save when you are
finished.
G The XConnect is ready to be loaded.
G After a successful load of the Teradata metadata, you can see the metadata in the Web
interface.
ERwin XConnect
The following format is required to extract metadata from Erwin:
G For Erwin 3.5, save the datamodel as ER1 format
G For Erwin 4.x, Save as XML format of the ERwin model that you want to load into Metadata
Manager.
To extract metadata from ERwin, perform the following steps:
G Log in to Metadata Manager (Web interface) and select the Administration tab. Under
Repository Management, select Repositories. Click Add to add a new repository. Enter all the
information related to the ERwin XConnect. Repository Type and Name are mandatory fields.)
G Log into the Metadata Manager Console and click the Source Repository Management tab.
The ERwin XConnect added above should show up in the console. Select the ERwin XConnect
and click the Configuration Properties tab.
H Each XConnect allows you to add multiple files.
H Source System Version = Select the appropriate option.
H Click Add to add the ERwin file. Browse to the location of the ERwin file. The directory
path of the file is stored locally. To load a new ERwin file, select the current file, then
click Delete and add the new file.
H Select the Refresh? checkbox to refresh the metadata from the file. If you do not want to
update the metadata from a particular metadata file (i.e., if the file does not contain any
changes since the last metadata load), uncheck this box.
H Click Save when you are finished.
G If you select Edit/assigned Connections for Lineage Report, set the connection assignments
between the ERwin model and the underlying database schemas. Click OK when you are
finished.
G The XConnect is ready to be loaded. After a successful load of the ERwin metadata, you can
see the metadata in the Web interface.
ER-Studio XConnect
INFORMATICA CONFIDENTIAL BEST PRACTICE 406 of 702
The following format is required to extract metadata from ER-Studio:
G ER-Studio and DM1 format
To extract metadata from ERStudio, perform the following steps:
G Log in to Metadata Manager (Web interface) and select the Administration tab. Under
Repository Management, select Repositories and click Add to add a new repository. Enter all
the information related to the ERStudio XConnect. Repository Type and Name are mandatory
fields.)
G Log in to the Metadata Manager Console and click the Source Repository Management tab.
The ERStudio XConnect added above should show up in the console. Select the ERStudio
XConnect and click the Configuration Properties tab.
H Each XConnect allows you to add multiple files
H Source System Version = Select the appropriate option.
H Click Add to add the ERStudio file. Browse to the location of the ERStudio file. The
directory path of the file is stored locally. To load a new ERStudio file, select the current
file, and click Delete, then add the new file.
H Select the Refresh? checkbox to refresh the metadata from the file. If you do not want to
update the metadata from a particular metadata file (i.e., if the file does not contain any
changes since the last metadata load), uncheck this box.
H Click Save when you are finished.
G If you select Edit/assigned Connections for Lineage Report, set the connection assignments
between the ERStudio model and the underlying database schemas. Click OK when you are
finished.
G The XConnect is ready to be loaded. After a successful load of the ERStudio metadata, you
can see the metadata in the Web interface.
PowerCenter XConnect
Specify a user name and password to access the PowerCenter database metadata. Be sure that the
user has the Select Any Table privilege and the ability to drop and create views. If you are using a
different Oracle user to pull PowerCenter metadata into the metadata warehouse than is used by
PowerCenter to create the metadata, you need to create synonyms in the new users schema to all
tables and views in the PowerCenter users schema. When the XConnect runs, it can successfully
create the views it needs in the new users schema.
To extract metadata from PowerCenter, perform the following steps:
G Log into Metadata Manager (Web interface) and select Add to add a new repository. Enter all
the information related to the PowerCenter XConnect.
G Log into the Metadata Manager Console and click the Source Repository Management tab.
The PowerCenter XConnect added above should show up in the console. Select the
PowerCenter XConnect and click the Configuration Properties tab.
INFORMATICA CONFIDENTIAL BEST PRACTICE 407 of 702
G Enter the following information related to the XConnect:
Properties Description
User Name/Password Database user name and password to access the
PowerCenter repository tables
Data Source Name ODBC connection name to connect to the database
(provides information about how to connect to the
machine containing the source repository database)
Database Type Database type of PowerCenter database
Connect String Please refer appropriate RDBMS XConnect based on
database type.
G Click Save when you have finished entering this information.
G To configure the list of folders to load, click the Parameter Setup tab, and select the list of
folders to load (these are listed in the Included Folders).
G Select Enable Operational Metadata Extraction to extract operational metadata (e.g., run
details, including times and statuses for workflow, worklet, and session runs, etc.)
G Leave the Source Incremental Extract Window (in days) at its default value of 4000. (To ensure
a full extract during the initial workflow run, the workflow is configured to extract records that
have been inserted or updated within the past 4000 days of the extract.)
G Click Save when you are finished.
Configure Parameterized Connection
Use the Assign Source Parameter Files button located under the Enable Operational Metadata text-box
to assign connection parameters to a PowerCenter XConnect.
G Browse to the Parameter File Directory. Click the Add button to select the appropriate
parameter file for each workflow that is being used. Click Save when you are finished selecting
parameter files.
G The XConnect is ready to be loaded.
G After a successful load of the PowerCenter metadata, you can see the metadata in the Web
interface.
Business Objects XConnect
The Business Objects XConnect requires you to install Business Object Designer on the machine
hosting the Metadata Manager console and to provide user name and password to access Business
Objects repository.
INFORMATICA CONFIDENTIAL BEST PRACTICE 408 of 702
To extract metadata from Business Objects, perform the following steps:
G Add a new SQL repository from Metadata Manager (Web interface).
G Log into the Metadata Manager Console and click the Source Repository Management tab.
The new Business Objects XConnect added above should show up in the console. Select the
Business Objects XConnect and click the Configuration Properties tab.
To configure the Business Objects repository connection setup for the first time:
G Click Configure to setup the Business Objects configuration file. The Metadata Administrator
needs to define the Business Objects configuration for the first time.
G Select the Business Object repository, then enter the Business Objects login name and
password to connect to the Business Objects repository.
G Select the list of universes you need to extract.
G Select the list of documents.
G Click Save to create the Business Objects configuration file to extract metadata from Business
Objects.
G Browse to select the path and file name for the Business Objects connection configuration file.
G Click Save when you are finished.
G The XConnect is now ready to be loaded.
G After a successful load of the Business Objects metadata, you can see the metadata in the
Web interface.


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 409 of 702
Custom XConnect Implementation
Challenge
Metadata Manager uses XConnects to extract source repository metadata and load it
into the Metadata Manager Warehouse. The Metadata Manager Configuration Console
is used to run each XConnect. A Custom XConnect is needed to load metadata from a
source repository for which Metadata Manager does not prepackage an out-of-the box
XConnect.
Description
This document organizes all steps into phases, where each phase and step must be
performed in the order presented. To integrate custom metadata, complete tasks for
the following phases:
G Design the Metamodel.
G Implement the Metamodel Design.
G Set-up and run the custom XConnect.
G Configure the reports and schema.
Prerequisites for Integrating Custom Metadata
To integrate custom metadata, install Metadata Manager and the other required
applications. The custom metadata integration process assumes knowledge of the
following topics:
G Common Warehouse Metamodel (CWM) and Informatica-Defined
Metamodels. The CWM metamodel includes industry-standard packages,
classes, and class associations. The Informatica metamodel components
supplement the CWM metamodel by providing repository-specific packages,
classes, and class associations. For more information about CWM, see http://
www.omg.org/cwm. For more information about the Informatica-defined
metamodel components, run and review the metamodel reports.
G PowerCenter Functionality. During the metadata integration process,
XConnects are configured and run. The XConnects run PowerCenter
workflows that extract custom metadata and load it into the Metadata Manager
Warehouse.
INFORMATICA CONFIDENTIAL BEST PRACTICE 410 of 702
G Data Analyzer Functionality. Metadata Manager embeds Data Analyzer
functionality to create, run, and maintain a metadata reporting environment.
Knowledge of creating, modifying, and deleting reports, dashboards, and
analytic workflows in Data Analyzer is required. Knowledge of creating,
modifying, and deleting table definitions, metrics, and attributes is required to
update the schema with new or changed objects.
Design the Metamodel
During this planning phase, the metamodel is designed; the metamodel will be
implemented in the next phase.
A metamodel is the logical structure that classifies the metadata from a particular
repository type. Metamodels consist of classes, class, associations, and packages,
which group related classes and class associations.
An XConnect loads metadata into the Metadata Manager Warehouse based on classes
and class associations. This task consists of the following steps:
1. Identify Custom Classes. To identify custom classes, determine the various
types of metadata in the source repository that need to be loaded into the
Metadata Manager Warehouse. Each type of metadata corresponds to one
class.
2. Identify Custom Class Properties. After identifying the custom classes, each
custom class must be populated with properties (i.e., attributes) in order for
Metadata Manager to track and report values belonging to classes instances.
3. Map Custom Classes to CWM Classes. Metadata Manager prepackages all
CWM classes, class properties, and class associations. To quickly develop a
custom metamodel and reduce redundancy, reuse the predefined class
properties and associations instead of recreating them. To determine which
custom classes can inherit properties from CWM classes, map custom classes
to the packaged CWM classes. For all properties that cannot be inherited,
define them in Metadata Manager.
4. Determine the Metadata Tree Structure. Configure the way the metadata tree
displays objects. Determine the groups of metadata objects in the metadata
tree, then determine the hierarchy of the objects in the tree. Assign the
TreeElement class as a base class to each custom class.
5. Identify Custom Class Associations. The metadata browser uses class
associations to display metadata. For each identified class association,
determine if you can reuse a predefined association from a CWM base class or
if you need to manually define an association in Metadata Manager.
6. Identify Custom Packages. A package contains related classes and class
associations. Multiple packages can be assigned to a repository type to define
INFORMATICA CONFIDENTIAL BEST PRACTICE 411 of 702
the structure of the metadata contained in the source repositories of the given
repository type. Create packages to group related classes and class
associations.
To see an example of sample metamodel design specifications, see Appendix A in the
Metadata Manager Custom Metadata Integration Guide.
Implement the Metamodel Design
Using the metamodel design specifications from the previous task, implement the
metamodel in Metadata Manager. This task includes the following steps:
1.
Create the originator (aka owner) of the metamodel. When creating a new
metamodel, specify the originator of each metamodel. An originator is the
organization that creates and owns the metamodel. When defining a new
custom originator in Metadata Manager, select Customer as the originator
type.
G Go to the Administration tab.
G Click Originators under Metamodel Management.
G Click Add to add a new originator.
G Fill out the requested information (Note: Domain Name, Name, and
Type are mandatory fields).
G Click OK when you are finished.
2. Create the packages that contain the classes and associations of the
subject metamodel. Define the packages to which custom classes and
associations are assigned. Packages contain classes and their class
associations. Packages have a hierarchical structure, where one package can
be the parent of another package. Parent packages are generally used to group
child packages together.
G
Go to the Administration tab.
G
Click Packages under Metamodel Management.
G
Click Add to add a new package.
G
Fill out the requested information (Note: Name and Originator
INFORMATICA CONFIDENTIAL BEST PRACTICE 412 of 702
are mandatory fields). Choose the originator created above.
G
Click OK when you are finished.
3. Create Custom Classes. In this step, create custom classes identified in the
metamodel design task.
G
Go to the Administration tab.
G
Click Classes under Metamodel Management.
G
From the drop-down menu, select the package that you
created in the step above
G
Click Add to create a new class.
G
Fill out the requested information (Note: Name, Package, and
Class Label are mandatory fields).
G
Base Classes: In order to see the metadata in the Metadata
Manager metadata browser, you need to at least add the base
class, TreeElement. To do this:
a. Click Add under Base Classes.
b. Select the package.
c. Under Classes, select TreeElement.
d. Click OK (You should now see the class properties in the
properties section).
G To add custom properties to your class, click Add. Fill out the property
information (Name, Data Type, and Display Label are mandatory
fields). Click OK when you are done.
G Click OK at the top of the page to create the class.
Repeat the above steps for additional classes.
4. Create Custom Class Associations. In this step, implement the custom class
associations identified in the metamodel design phase. In the previous step,
CWM classes are added as base classes. Any of the class associations from
the CWM base classes can be reused. Define those custom class associations
INFORMATICA CONFIDENTIAL BEST PRACTICE 413 of 702
that cannot be reused. If you only need the ElementOwnership association, skip
this step.
G
Go to the Administration tab.
G
Click Associations under Metamodel Management.
G
Click Add to add a new association.
G
Fill out the requested information (all bold fields are required).
G
Click OK when you are finished.
5. Create the Repository Type. Each type of repository contains unique
metadata. For example, a PowerCenter data integration repository type
contains workflows and mappings, but a Data Analyzer business intelligence
repository type does not. Repository types maintain the uniqueness of each
repository.
G
Go to the Administration tab.
G
Click Repository Types under Metamodel Management.
G
Click Add to add a new repository type.
G
Fill out the requested information (Note: Name and Product
Type are mandatory fields).
G
Click OK when you are finished.
6. Configure a Repository Type Root Class. Root classes display under the
source repository in the metadata tree. All other objects appear under the root
class. To configure a repository root class:
G
Go to the Administration tab.
G
INFORMATICA CONFIDENTIAL BEST PRACTICE 414 of 702
Click Custom Repository Type Root Classes under Metamodel
Management.
G
Select the custom repository type.
G
Optionally, select a package to limit the number of classes that
display.
G
Select the Root Class option for all applicable classes.
G
Click Apply to apply the changes.
Set Up and Run the XConnect
The objective of this task is to set up and run the custom XConnect. Custom
XConnects involve a set of mappings that transform source metadata into the required
format specified in the Informatica Metadata Extraction (IME) files. The custom
XConnect extracts the metadata from the IME files and loads it into the Metadata
Manager Warehouse. This task includes the following steps:
1.
Determine which Metadata Manager Warehouse tables to load. Although
you do not have to load all Metadata Manager Warehouse tables, you must
load the following Metadata Manager Warehouse tables:
G
IMW_ELEMENT: The IME_ELEMENT interface file loads the
element names from the source repository into the
IMW_ELEMENT table. Note that element is used generically to
mean packages, classes, or properties.
G
IMW_ELMNT_ATTR: The IME_ELMNT_ATTR interface file
loads the attributes belonging to elements from the source
repository into the IMW_ELMNT_ATTR table.
G
IMW_ELMNT_ASSOC: The IME_ELMNT_ASSOC interface file
loads the associations between elements of a source repository
into the IMW_ELMNT_ASSOC table.
To stop the metadata load into particular Metadata Manager Warehouse
tables, disable the worklets that load those tables.
INFORMATICA CONFIDENTIAL BEST PRACTICE 415 of 702
2. Reformat the source metadata. In this step, reformat the source metadata so
that it conforms to the format specified in each required IME interface file. (The
IME files are packaged with the Metadata Manager documentation.) Present
the reformatted metadata in a valid source type format. To extract the
reformatted metadata, the integration workflows require that the reformatted
metadata be in one or more of the following source type formats: database
table, database view, or flat file. Note that you can load metadata into a
Metadata Manager Warehouse table using more than one of the accepted
source type formats.
3. Register the Source Repository Instance in Metadata Manager. Before the
Custom XConnect can extract metadata, the source repository must be
registered in Metadata Manager. When registering the source repository, the
Metadata Manager application assigns a unique repository ID that identifies the
source repository. Once registered, Metadata Manager adds an XConnect in
the Configuration Console for the source repository. To register the source
repository, go to the Metadata Manager web interface. Register the repository
under the custom repository type created above. All packages, classes, and
class associations defined for the custom repository type apply to all repository
instances registered to the repository type. When defining the repository,
provide descriptive information about the repository instance. Once the
repository is registered in Metadata Manager, Metadata Manager adds an
XConnect in the Configuration Console for the repository.
Create the Repository that will hold the metadata extracted from the source
system:
G Go to the Administration tab.
G Click Repositories under Repository Management.
G Click Add to add a new repository.
G Fill out the requested information (Note: Name and Repository Type
are mandatory fields). Choose the repository type created above.
G Click OK when finished.
4. Configure the Custom Parameter Files. Custom XConnects require that the
parameter file be updated by specifying the following information:
G
The source type (database table, database view, or flat file).
G
The name of the database views or tables used to load the
Metadata Manager Warehouse, if applicable.
INFORMATICA CONFIDENTIAL BEST PRACTICE 416 of 702
The list of all flat files used to load a particular Metadata
Manager Warehouse table, if applicable.
G
The worklets you want to enable and disable.
Understanding Metadata Manager Workflows for Custom Metadata
G wf_Load_IME. Custom workflow to extract and transform metadata from the
source repository into IME format. This is created by a developer.
Metadata Manager prepackages the following integration workflows for
custom metadata. These workflows read the IME files mentioned above
and load them into the Metadata Manager Warehouse.
H WF_STATUS: Extracts and transforms statuses from any
source repository and loads them into the Metadata Manager
Warehouse. To resolve status IDs correctly, the workflow is
configured to run before the WF_CUSTOM workflow.
H WF_CUSTOM: Extracts and transforms custom metadata from
IME files and loads that metadata into the Metadata Manager
Warehouse.

5. Configure the Custom XConnect. The XConnect loads metadata into the
Metadata Manager Warehouse based on classes and class associations
specified in the custom metamodel.
When the custom repository type is defined, Metadata Manager registers the
corresponding XConnect in the Configuration Console. The following
information in the Configuration Console configures the XConnect:
G
Under the Administration Tab, select Custom Workflow
Configuration and choose the repository type to which the
custom repository belongs.
G Workflows to load the metadata.
H CustomXConnect-wf_Load_IME workflow
H Metadata Manager-WF_CUSTOM workflow(prepackages all
INFORMATICA CONFIDENTIAL BEST PRACTICE 417 of 702
worklets and sessions required to populate all Metadata Manager
Warehouse tables, except the IMW_STATUS table)
H Metadata Manager -WF_STATUS workflow (populates the
IMW_STATUS)
Note: Metadata Manager Server does not load Metadata Manager
Warehouse tables that have disabled worklets.
G Under the Administration Tab, select Custom Workflow Configuration
and choose the parameter file used by the workflows to load the
metadata (the parameter file name is assigned at first data load). This
parameter file name has the form nnnnn.par, where nnnnn is a five
digit integer assigned at the time of the first load of this source
repository. The script promoting Metadata Manager from the
development environment to test and from the test environment to
production preserves this file name.
6. Reset the $$SRC_INCR_DATE Parameter. After completing the first metadata
load, reset the $$SRC_INCR_DATE parameter to extract metadata in shorter
intervals, such as every f days. The value depends on how often the Metadata
Manager Warehouse needs to be updated. If the source does not provide the
date when the records were last updated, records are extracted regardless of
the $$SRC_INCR_DATE parameter setting.
7. Run the Custom XConnect. Using the Configuration Console, Metadata
Manager Administrators can run the custom XConnect and ensure that the
metadata loads correctly.
Note: When loading metadata with Effective From and Effective To Dates, Metadata
Manager does not validate whether the Effective From Date is less than the Effective
To Date. Ensure that each Effective To Date is greater than the Effective From Date. If
you do not supply Effective From and Effective To Dates, Metadata Manager sets the
Effective From Date to 1/1/1899 and the Effective To Date to 1/1/3714.
To Run a Custom XConnect

G
Log in to the Configuration Console.
INFORMATICA CONFIDENTIAL BEST PRACTICE 418 of 702
Click Source Repository Management
G
Click Load next to the custom XConnect you want to run
Configure the Reports and Schema
The objective of this task is to set up the reporting environment, which needs to run
reports on the metadata stored in the Metadata Manager Warehouse. The setup of the
reporting environment depends on the reporting requirements. The following options
are available for creating reports:
G Use the existing schema and reports. Metadata Manager contains
prepackaged reports that can be used to analyze business intelligence
metadata, data integration metadata, data modeling tool metadata, and
database catalog metadata. Metadata Manager also provides impact analysis
and lineage reports that provide information on any type of metadata.
G Create new reports using the existing schema. Build new reports using the
existing Metadata Manager metrics and attributes.
G Create new Metadata Manager Warehouse tables and views to support
the schema and reports. If the prepackaged Metadata Manager schema
does not meet the reporting requirements, create new Metadata Manager
Warehouse tables and views. Prefix the name of custom-built tables with
Z_IMW_. Prefix custom-built views with Z_IMA_. If you build new Metadata
Manager Warehouse tables or views, register the tables in the Metadata
Manager schema and create new metrics/attributes in the Metadata Manager
schema. Note that the Metadata Manager schema is built on the Metadata
Manager views.
After the environment setup is complete, test all schema objects, such as dashboards,
analytic workflows, reports, metrics, attributes, and alerts.


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 419 of 702
Customizing the Metadata Manager Interface
Challenge
Customizing the Metadata Manager Presentation layer to meet specific business needs.
Description
Configuring Metamodels
You may need to configure metamodels for a repository type in order to integrate additional metadata into a Metadata
Manager Warehouse and/or to adapt to changes in metadata reporting and browsing requirements. For more information
about creating a metamodel for a new repository type, see the Metadata Manager Custom Metadata Integration Guide.
Use Metadata Manager to define a metamodel, which consists of the following objects:
G Originator - the party that creates and owns the metamodel.
G Packages - contain related classes that model metadata for a particular application domain or specific
application. Multiple packages can be defined under the newly defined originator. Each package stores classes
and associations that represent the metamodel.
G Classes and Class Properties - define a type of object, with its properties, contained in a repository. Multiple
classes can be defined under a single package. Each class has multiple properties associated to it. These
properties can be inherited from one or many base classes already available. Additional properties can be
defined directly under the new class.
G Associations - define the relationship among classes and their objects. Associations help define relationships
across individual classes. The cardinality helps define 1-1, 1-n, or n-n relationships. These relationships mirror
real life associations of logical, physical, or design-level building blocks of systems and processes.
For more information about metamodels, originators, packages, classes, and associations, see Metadata Manager
Concepts in the Metadata Manager Administration Guide.
After you define the metamodel, you need to associate it with a repository type. When registering a repository under a
repository type, all classes and associations assigned to the repository type through packages apply to the repository.
Repository Types
You can configure types of repositories for the metadata you want to store and manage in the Metadata Manager
Warehouse.
You must configure a repository type when you develop an XConnect. You can modify some attributes for existing
XConnects and XConnect repository types.
Displaying Objects of an Association in the Metadata Tree
Metadata Manager displays many objects in the metadata tree by default because of the predefined associations among
metadata objects. Associations determine how objects display in the metadata tree.
To display an object that doesn't already display in the metadata tree, add an association between the objects in the IMM.
properties file. For example, Object A displays in the metadata tree but Object B does not. To display Object B under
Object A in the metadata tree, perform the following actions:
G Create an association from Object B to Object A. 'From Objects' in an association display as parent objects; 'To
INFORMATICA CONFIDENTIAL BEST PRACTICE 420 of 702
Objects' display as child objects. The 'To Object' displays in the metadata tree only if the 'From Object' in the
association already displays in the metadata tree. For more information about adding associations, refer to
Adding Object Associations in the Metadata Manager User Guide.
G Add the association to the IMM.properties file. Metadata Manager only displays objects in the metadata tree if
the corresponding association between their classes is included in the IMM.properties file.
Note: Some associations are not explicitly defined among the classes of objects. Some objects reuse associations
based on the ancestors of the classes. The metadata tree displays objects that have explicit or reused associations.
To Add an Association to the IMM.properties File
1. Open the IMM.properties file. The file is located in the following directory:
G For WebLogic: <WebLogic_home>\user_projects\domains\<domain>
G For WebSphere: <WebSphere_home>\DeploymentManager
G For JBoss: <JBoss_home>\bin
2. Add the association ID under findtab.parentChildAssociations parameter:
To determine the ID of an association, click Administration > Metamodel Management > Associations, and then click the
association on the Associations page.
3. Save and close the IMM.properties file.
4. Stop and then restart the Metadata Manager Server to apply the changes.
Customizing Metadata Manager Metadata Browser
The Metadata Browser, on the Metadata Directory page, is used for browsing source repository metadata stored in the
Metadata Manager Warehouse. The following figure shows a sample metadata directory page on the 'Find Tab' of
Metadata Manager.
INFORMATICA CONFIDENTIAL BEST PRACTICE 421 of 702
The Metadata Directory page consists of the following areas:
G Query Task Area - allows you to search for metadata objects stored in the Metadata Manager Warehouse.
G Metadata Tree Task Area - allows you to navigate to a metadata object in a particular repository.
G Results Task Area - displays metadata objects based on an object search in the Query Task area or based on
the object selected in the Metadata Tree Task area.
G Details Task Area - displays properties about the selected object. You can also view associations between the
object and other objects, and run related reports from the Details Task area.
For more information about the Metadata Directory page on the Find tab, refer to the Accessing Source Repository
Metadata chapter in the Metadata Manager User Guide.
You can perform the following customizations while browsing the source repository metadata:
Configure Display Properties
Metadata Manager displays a set of default properties for all items in the 'Results Task' area. The default properties are
generic properties that apply to all metadata objects stored in the Metadata Manager Warehouse.
By default, Metadata Manager displays the following properties in the 'Results Task' area for each source repository
object:
G Class - Displays an icon that represents the class of the selected object. The class name appears when you
place the pointer over the icon.
G Label - Label of the object.
G Source Update Date - Date the object was last updated in the source repository.
INFORMATICA CONFIDENTIAL BEST PRACTICE 422 of 702
G Repository Name - Name of the source repository from which the object originates.
G Description - Describes the object.
The default properties that appear in the 'Results Task' area can, however, be rearranged, added, and/or removed for a
Metadata Manager user account. For example, you can remove the default Class and Source Update Date properties,
move the Repository Name property to precede the Label property, and add a different property, such as the Warehouse
Insertion Date, to the list.
Additionally, you can add other properties that are specific to the class of the selected object. With the exception of
Label, all other default properties can be removed. You can select up to ten properties to display in the 'Results Task'
area. Metadata Manager displays them in the order specified while configuring.
If there are more than ten properties to display, Metadata Manager displays the first ten, displaying common properties
first in the order specified and then all remaining properties in alphabetical order based on the property display label.
Applying Favorite Properties for Multiple Classes of Objects Property
The modified property display settings can be applied to any class of objects displayed in the 'Results Task' area. When
selecting an object in the metadata tree, multiple classes of objects can appear in the 'Results Task' area. The following
figure shows how to apply the modified display settings for each class of objects in the 'Results Task' area:
INFORMATICA CONFIDENTIAL BEST PRACTICE 423 of 702
The same settings can be applied to the other classes of objects that currently display in the 'Results Task' area.
If the settings are not applied to the other classes, then the settings apply to the objects of the same class as the object
selected in the metadata tree.
Configuring Object Links
Object links are created to link related objects without navigating the metadata tree or searching for the object. Refer to
the Metadata Manager User Guide to configure the object link.
Configuring Report Links
Report links can be created to run reports on a particular metadata object. When creating a report link, assign a
Metadata Manager report to a specific object. While creating a report link, you can also create a run report button to run
the associated report. The run report button appears in the top, right corner of the 'Details Task' area. When you create
the run report button, you also have the option of applying it to all objects of the same class. You can create a maximum
of three run report buttons per object.
Customizing Metadata Manager Packaged Reports, Dashboards, and Indicators
You can create new reporting elements and attributes under Schema Design. These elements can be used in new
reports or existing report extensions. You can also extend or customize out-of-the-box reports, indicators, or dashboards.
Informatica recommends using the Save As new report option for such changes in order to avoid any conflicts during
upgrades.
Further, you can use Data Analyzer's 1-2-3-4 report creation wizard to create new reports. Informatica recommends
saving such reports in a new report folder to avoid conflict during upgrades.
Customizing Metadata Manager ODS Reports
Use the operational data store (ODS) report templates to analyze metadata stored in a particular repository. Although
these reports can be used as is, they can also be customized to suit particular business requirements. Out-of-the-box
reports can be used as a guideline for creating reports for other types of source repositories, such as a repository for
which Metadata Manager does not prepackage an XConnect.


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 424 of 702
Estimating Metadata Manager Volume Requirements
Challenge
Understanding the relationship between various inputs for the Metadata Manager solution in order to
estimate volumes for the Metadata Manager Warehouse.
Description
The size of the Metadata Manager warehouse is directly proportional to the size of metadata being loaded
into it. The size is also dependent on the number of element attributes being captured in source metadata
and the associations defined in the metamodel.
When estimating volume requirements for a Metadata Manager implementation, consider the
following Metadata Manager components:
Metadata Manager Server
Metadata Manager Console
Metadata Manager Integration Repository
Metadata Manager Warehouse
Note: Refer to the Metadata Manager Installation Guide for complete information on minimum system
requirements for server, console and integration repository.
Considerations
Volume estimation for Metadata Manager is an iterative process. Use the Metadata Manager development
environment to get accurate size estimates for the Metadata Manager production environment. The required
steps are as follow:
1. Identify the source metadata that needs to be loaded in the Metadata Manager production
warehouse.
2. Size the Metadata Manager Development warehouse based on the initial sizing estimates (as
explained in next section of this document).
3. Run the XConnects and monitor the disk usage. The development data loaded during the initial run
of the XConnects should be used as a baseline for furthers sizing estimates.
4. Restart the XConnect if a failure due to lack of disk space is encountered after adding additional
disk space.
Repeat steps 1 through 4 until the XConnect run is successful.
The following figures illustrate the initial sizing estimates for a typical Metadata Manager implementation:
INFORMATICA CONFIDENTIAL BEST PRACTICE 425 of 702
Metadata Manager Server



Metadata Manager Console

INFORMATICA CONFIDENTIAL BEST PRACTICE 426 of 702

Metadata Manager Integration Repository



Metadata Manager Warehouse

INFORMATICA CONFIDENTIAL BEST PRACTICE 427 of 702
The following table is an initial estimation matrix that should be helpful in deriving a reasonable initial
estimation. For increased input sizes, consider the expected Metadata Manager warehouse target size to
increase in direct proportion.
XConnect Input Size Expected Metadata
Manager Warehouse
Target Size
Metamodel and other
tables
- 50MB
PowerCenter 1MB 10MB
Data Analyzer 1MB 4MB
Database 1MB 5MB
Other XConnect 1MB 4.5MB

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 428 of 702
Metadata Manager Load Validation
Challenge
J ust as it is essential to know that all data for the current load cycle has loaded correctly, it is important to
ensure that all metadata extractions (XConnects) loaded correctly into the Metadata Manager warehouse. If
metadata extractions do not execute successfully, the Metadata Manager warehouse will not be current with
the most up-to-date metadata.
Description
The process for validating Metadata Manager metadata loads is very simple using the Metadata Manager
Configuration Console. In the Metadata Manager Configuration Console, you can view the run history for
each of the XConnects. For those who are familiar with PowerCenter, the Run History portion of the
Metadata Manager Configuration Console is similar to the Workflow Monitor in PowerCenter.
To view XConnect run history, first log into the Metadata Manager Configuration Console.

After logging into the console, click Administration > Repositories. The XConnect Repositories are
displayed with their last load date and status.
INFORMATICA CONFIDENTIAL BEST PRACTICE 429 of 702

The XConnect run history is displayed (see below) on the Source Repository Management screen. A
Metadata Manager Administrator should log into the Metadata Manager Configuration Console on a regular
basis and verify that all XConnects that were scheduled ran to successful completion.
INFORMATICA CONFIDENTIAL BEST PRACTICE 430 of 702

If any XConnects have a status of Failed as shown above in the Last Refresh Status column, the issue
should be investigated to correct it and the XConnect should be re-executed. XConnects can fail for a
variety of reasons common in IT such as unavailability of the database, network failure, improper
configuration, etc.
More detailed error messages can be found in the activity log or in the workflow log files. By clicking on the
Output tab of the selected XConnect in the Metadata Manager Console, you can view the output for the
most recent run of the selected XConnect. In most cases, the logging is setup to write to the <PowerCenter
installation directory>\client\Console\ActivityLog file.

After investigating and correcting the issue, the XConnect that failed should be re-executed at the next
available time in order to load the most recent metadata.
Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 431 of 702
Metadata Manager Migration Procedures
Challenge

This Best Practice describes the processes that need to be followed (as part of the Metadata
Manager deployment in multiple environments) whenever out of the box Metadata Manager components are customized
or configured, or when new components are added to Metadata Manager.

Because the Metadata Manager product consists of multiple components, the steps apply to individual product
components. The deployment processes are divided into the following four categories:

G
Reports: This would include changes to the reporting schema and the out of the box reports. In
addition, this would also include any new reports or schema elements created to cater to the custom
reporting needs at the specific implementation instance of the product.
G
Metamodel: This would include the creation of new metamodel components to help associate any
custom metadata against repository types and domains that are not covered by the out of the box
Metadata Manager repository types.
G
Metadata: This would include the creation of new metadata objects, their properties or associations
against repository instances configured within Metadata Manager. These repository instances could
either belong to the repository types supported out of the box by Metadata Manager or any new
repository types configured through custom additions to the metamodels.
G
Integration Repository: This would include changes to the out of the box PowerCenter workflows
or mappings. In addition, this would also include any new PowerCenter objects (mappings,
transformations etc.) or associated workflows.
Description
Report Changes

The following chart depicts the various scenarios related to the reporting area and the actions that need to be taken as
relates to the deployment of the changed components. It is always advisable to create new schema elements (metrics,
attributes etc.) or new reports in a new Data Analyzer folder to facilitate exporting or importing the Data Analyzer objects
across development, test and production.
Nature of Report Change: Modify schema component (metric, attribute etc.)
Development Test Production
Perform the change in
development, test the same and
certify it for deployment.
Do an XML export of the changed
components.
Import the XML exported in the
development environment.
Answer Yes to overriding the
definitions that already exist for
the changed schema
components.
Import the XML exported in the
development environment.
Answer Yes to overriding the
definitions that already exist for
the changed schema
components.
Nature of Report Change: Modify an existing report (add or delete metrics, attributes, filters,
change formatting etc.)
INFORMATICA CONFIDENTIAL BEST PRACTICE 432 of 702
Development Test Production
Perform the change in
development, test the same and
certify it for deployment.
Do an XML export of the changed
report.
Import the XML exported in the
development environment.
Answer Yes to overriding the
definitions that already exist for
the changed report.
Import the XML exported in the
development environment.
Answer Yes to overriding the
definitions that already exist for
the changed report.
Nature of Report Change: Add new schema component (metric, attribute etc.)
Development Test Production
Perform the change in
development, test the same and
certify it for deployment.
Do an XML export of the new
schema components.

Import the XML exported in the
development environment.

Import the XML exported in the
development environment.
Nature of Report Change: Add new report
Development Test Production
Perform the change in
development, test the same and
certify it for deployment.
Do an XML export of the new
report.

Import the XML exported in the
development environment.
Import the XML exported in the
development environment.
Metamodel Changes
The following chart depicts the various scenarios related to the metamodel area and the actions that need to be taken as
relates to the deployment of the changed components.
Nature of the Change: Add new metamodel component
Development Test Production
INFORMATICA CONFIDENTIAL BEST PRACTICE 433 of 702
Perform the change in
development, test the same and
certify it for deployment.
Do an XML export of the new
metamodel components (export
can be done at 3 levels:
Originators, Repository Types and
Entry Points) using the Export
Metamodel option.
Import the XML exported in
the development environment
using the Import
metamodel option.

Import the XML exported in
the development environment
using the Import
metamodel option.

Integration Repository Changes
The following chart depicts the various scenarios related to the integration repository area and the actions that need to be
taken as relates to the deployment of the changed components. It is always advisable to create new mappings,
transformations, workflows etc in a new PowerCenter folder so that it becomes easy to export the ETL objects across
development, test and production.
Nature of the Change: Modify an existing mapping, transformation and/or the associated
workflows etc.
Development Test Production
Perform the change in
development, test the same and
certify it for deployment.
Do an XML export of the
changed objects.
Import the XML exported in the
development environment.
Answer Yes to overriding the
definitions that already exist for
the changed object.
Import the XML exported in the
development environment.
Answer Yes to overriding
the definitions that already
exist for the changed object.
Nature of the Change: Add new ETL object (mapping, transformation etc.) and create an
associated
Development Test Production
Perform the change in
development, test the same and
certify it for deployment.
Do an XML export of the new
objects.
Import the XML exported in the
development environment.

Import the XML exported in the
development environment.




Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 434 of 702
Metadata Manager Repository
Administration
Challenge
The task of administering the Metadata Manager Repository involves taking care of
both the integration repository and the Metadata Manager warehouse. This requires a
knowledge of both PowerCenter administrative features (i.e., the integration repository
used in Metadata Manager) and Metadata Manager administration features.
Description
A Metadata Manager administrator needs to be involved in the following areas to
ensure that the Metadata Manager metadata warehouse is fulfilling the end-user needs:
G Migration of Metadata Manager objects created in the Development
environment to QA or the Production environment
G Creation and maintenance of access and privileges of Metadata
Manager objects
G Repository backups
G Job monitoring
G Metamodel creation.
Migration from Development to QA or Production
In cases where a client has modified out-of-the-box objects provided in Metadata
Manager or created a custom metamodel for custom metadata, the objects must be
tested in the Development environment prior to being migrated to the QA or Production
environments. The Metadata Manager Administrator needs to do the following to
ensure that the objects are in sync between the two environments:
G Install a new Metadata Manager instance for the QA/Production environment.
This involves creating a new integration repository and Metadata
Manager warehouse
G Export the metamodel from the Development environment and import it to QA
or production via XML Import/Export functionality (in the Metadata
Manager Administration tab) or via the Metadata Manager command line
INFORMATICA CONFIDENTIAL BEST PRACTICE 435 of 702
utility.
G
Export the custom or modified reports created or configured in the
Development environment and import them to QA or Production via
XML Import/Export functionality in SG Administration Tab. This
functionality is identical to the function in Data Analyzer; refer to the
Data Analyzer Administration Guide for details on the import/export
function.
Providing Access and Privileges
Users can perform a variety of Metadata Manager tasks based on their privileges.
The Metadata Manager Administrator can assign privileges to users by assigning them
roles. Each role has a set of privileges that allow the associated users to perform
specific tasks. The Administrator can also create groups of users so that all users in a
particular group have the same functions. When an Administrator assigns a role to a
group, all users of that group receive the privileges assigned to the role. For more
information about privileges, users, and groups, see the Data Analyzer Administrator
Guide.
The Metadata Manager Administrator can assign privileges to users to enable users to
perform the any of the following tasks in Metadata Manager:
G Configure reports. Users can view particular reports, create reports, and/or
modify the reporting schema.
G Configure the Metadata Manager Warehouse. Users can add, edit, and
delete repository objects using Metadata Manager.
G Configure metamodels. Users can add, edit, and delete metamodels.
Metadata Manager also allows the Administrator to create access permissions on
specific source repository objects for specific users. Users can be restricted to reading,
writing, or deleting source repository objects that appear in Metadata Manager.
Similarly, the Administrator can establish access permissions for source repository
objects in the Metadata Manager warehouse. Access permissions determine the tasks
that users can perform on specific objects. When the Administrator sets access
permissions, he or she determines which users have access to the source repository
objects that appear in Metadata Manager. The Administrator can assign the following
types of access permissions to objects:
INFORMATICA CONFIDENTIAL BEST PRACTICE 436 of 702
G Read - Grants permission to view the details of an object and the names of
any objects it contains.
G Write - Grants permission to edit an object and create new repository objects
in the Metadata Manager warehouse.
G Delete - Grants permission to delete an object from a repository.
G Change permission - Grants permission to change the access permissions for
an object.
When a repository is first loaded into the Metadata Manager warehouse, Metadata
Manager provides all permissions to users with the System Administrator role. All other
users receive read permissions. The Administrator can then set inclusive and exclusive
access permissions.
Metamodel Creation
In cases where a client needs to create custom metamodels for sourcing custom
metadata, the Metadata Manager Administrator needs to create new packages,
originators, repository types and class associations. For details on how to create new
metamodels for custom metadata loading and rendering in Metadata Manager, refer to
the Metadata Manager Installation and Administration Guide.
Job Monitoring
When Metadata Manager Xconnects are running in the Production environment,
Informatica recommends monitoring loads through the Metadata Manager console. The
Configuration Console Activity Log in the Metadata Manager console can identify
the total time it takes for an Xconnect to complete. The console maintains a history of
all runs of an Xconnect, enabling a Metadata Manager Administrator to ensure that
load times are meeting the SLA agreed upon with end users and that the load times are
not increasing inordinately as data increases in the Metadata Manager warehouse.
The Activity Log provides the following details about each repository load:
G Repository Name- name of the source repository defined in Metadata
Manager
G Run Start Date- day of week and date the XConnect run began
G Start Time- time the XConnect run started
G End Time- time the XConnect run completed
G Duration- number of seconds the XConnect run took to complete
INFORMATICA CONFIDENTIAL BEST PRACTICE 437 of 702
G Ran From- machine hosting the source repository
G Last Refresh Status- status of the XConnect run, and whether it completed
successfully or failed
Repository Backups
When Metadata Manager is running in either the Production or QA environment,
Informatica recommends taking periodic backups of the following areas:
G Database backups of the Metadata Manager warehouse
G Integration repository; Informatica recommends either of two methods for this
backup:
H The PowerCenter Repository Server Administration Console or pmrep
command line utility
H The traditional, native database backup method.
The native PowerCenter backup is required but Informatica recommends using both
methods because, if database corruption occurs, the native PowerCenter backup
provides a clean backup that can be restored to a new database.


Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL BEST PRACTICE 438 of 702
Upgrading Metadata Manager
Challenge
This best practices document summarizes the instructions for a Metadata Manager
upgrade and should not be used for upgrading a PowerCenter repository. Refer to the
PowerCenter Installation and Configuration Guide for detailed instructions on the
PowerCenter or Metadata Manager upgrade process.
Before you start the upgrade process be sure to check through the Informatica support
information for the Metadata Manager upgrade path. For instance, Metadata Manager
2.1 or 2.2 should first be upgraded to Metadata Manager 8.1 and then to the Metadata
Manager 8.1.1.
Also verify the requirements for the following Metadata Manager 8.1.1 components:
G Metadata Manager and Metadata Manager Client
G Web browser
G Databases
G Third-party software
G Code pages
G Application server
For more information about requirements for each component, see Chapter 3
PowerCenter Prerequisites in the PowerCenter Installation and Configuration Guide.
As we already know from the existing installation, Metadata Manager is made up of
various components. Except for the Metadata Manager Repository all other Metadata
Manager components (i.e., Metadata Manager Server, PowerCenter Repository,
PowerCenter Clients and Metadata Manager Clients) should be uninstalled and then
reinstalled with the latest version of the Metadata Manager by the Metadata Manager
upgrade process.
Keep in mind that all modifications and/or customizations to the standard version of
Metadata Manager will be lost and will need to be re-created and re-tested after the
upgrade process.
INFORMATICA CONFIDENTIAL BEST PRACTICE 439 of 702
Description
Upgrade Steps
1. Set up new repository database and user account.
G Set up new database/schema for the PowerCenter Metadata Manager
repository. For Oracle, set the appropriate storage parameters. For IBM DB2,
use a single node tablespace to optimize PowerCenter performance. For IBM
DB2, configure the system temporary table spaces and update the heap sizes.
G Create a database user account for the PowerCenter Metadata Manager
repository. The database user must have permissions to create and drop
tables and indexes, and to select, insert, update, and delete data from tables.
For more information, see the PowerCenter Installation and Configuration
Guide.
2. Make a copy of the existing Metadata Manager repository.
G You can use any backup or copy utility provided with the database to make a
copy of the working Metadata Manager repository prior to upgrading the
Metadata Manager. Use the copy of the Metadata Manager repository for the
new Metadata Manager installation.
3. Back up the existing parameter files.
G Make a copy of the existing parameter files. If you have custom XConnects
and the parameter, attribute and data files of these custom XConnects is in a
different place, do not forget to take a backup of them too. You may need to
refer to these files when you later configure the parameters for the custom
XConnects as part of the Metadata Manager client upgrade.
For PowerCenter 8.0, you can find the parameter files in the following directory:
PowerCenter_Home\server\infa_shared\SrcFiles
For Metadata Manager, you can find the parameter files in the following
directory:
PowerCenter_Home\Server\SrcFiles
INFORMATICA CONFIDENTIAL BEST PRACTICE 440 of 702
4. Export the Metadata Manager mappings that you customized or created for your
environment.
G If you made any changes on the standard Metadata Manager mappings, or
create some new mappings within the Metadata Manager Integration
repository, make an export of these mappings, workflows and/or sessions.
G If you created some additional reports, make an export of these reports too.
5. Install Metadata Manager.
G Select the Custom installation set and install Metadata Manager. The installer
creates a Repository Service and Integration Service in the PowerCenter
domain and creates a PowerCenter repository for Metadata Manager. For
more information about installing Metadata Manager, see the PowerCenter
Installation and Configuration Guide.
6. Stop the Metadata Manager server.
G You must stop the Metadata Manager server before you upgrade the Metadata
Manager repository contents. For more information about stopping Metadata
Manager, see Appendix C Starting and Stopping Application Servers in the
PowerCenter Installation and Configuration Guide.
7. Upgrade the Metadata Manager repository.
G Use the Metadata Manager upgrade utility shipped with the latest version of
Metadata Manager to upgrade the Metadata Manager repository. For
instructions on running the Metadata Manager upgrade utility, see the
PowerCenter Installation and Configuration Guide.
8. Complete the Metadata Manager post-upgrade tasks.
After you upgrade the Metadata Manager repository, perform the following tasks:
G Update metamodels for Business Objects and Cognos ReportNet Content
Manager.
G Delete obsolete Metadata Manager objects.
G Refresh Metadata Manager views.
G For a DB2 Metadata Manager repository, import metamodels.
INFORMATICA CONFIDENTIAL BEST PRACTICE 441 of 702
For more information about the post-upgrade tasks, see the PowerCenter Installation
and Configuration Guide.
9. Upgrade the Metadata Manager Client.
G For instructions on upgrading the Metadata Manager Client, see
the PowerCenter Installation and Configuration Guide.
G After you complete the upgrade steps, verify that all dashboards and reports
are working correctly in Metadata Manager. When you are sure that the new
version is working properly, you can delete the old instance of Metadata
Manager and switch to the new version.
10. Compare and redeploy the exported Metadata Manager mappings that were
customized or created for your environment.
G If you had any modified Metadata Manager mappings in the previous release
of Metadata Manager, check whether the modifications are still necessary. If
the modifications still needed override or rebuild the changes into the new
PowerCenter mappings.
G Import the customized reports into the new environment and check that the
reports are still working with the new Metadata Manager environment. If not
then make the necessary modifications to make them compatible with the new
structure.
11. Upgrade the Custom XConnects
G If you have any custom XConnects in your environment, you need to
regenerate the XConnect mappings that were generated by the previous
version of the custom XConnect configuration wizard. Before starting the
regeneration process, ensure that the absolute paths to the .csv files are the
same as the previous version. If all the paths are the same, no further actions
are required after the regeneration of the workflows and mappings.
12. Uninstall the previous version of Metadata Manager.
G Verify that the browser and all reports are working correctly in Metadata
Manager 8.1. If the upgrade is successful, you can uninstall the previous
version of Metadata Manager.
INFORMATICA CONFIDENTIAL BEST PRACTICE 442 of 702
Daily Operations
Challenge
Once the data warehouse has been moved to production, the most important task is keeping the system
running and available for the end users.
Description
In most organizations, the day-to-day operation of the data warehouse is the responsibility of a Production
Support team. This team is typically involved with the support of other systems and has expertise in
database systems and various operating systems. The Data Warehouse Development team becomes, in
effect, a customer to the Production Support team. To that end, the Production Support team needs two
documents, a Service Level Agreement and an Operations Manual, to help in the support of the production
data warehouse.
Monitoring the System
Monitoring the system is useful for identifying any problems or outages before the users notice. The
Production Support team must know what failed, where it failed, when it failed, and who needs to be working
on the solution. Identifying outages and/or bottlenecks can help to identify trends associated with various
technologies. The goal of monitoring is to reduce downtime for the business user. Comparing the
monitoring data against threshold violations, service level agreements, and other organizational
requirements helps to determine the effectiveness of the data warehouse and any need for changes.
Service Level Agreement
The Service Level Agreement (SLA) outlines how the overall data warehouse system is to be maintained.
This is a high-level document that discusses system maintenance and the components of the system, and
identifies the groups responsible for monitoring the various components. The SLA should be able to be
measured against key performance indicators. At a minimum, it should contain the following information:
Times when the system should be available to users.
Scheduled maintenance window.
Who is expected to monitor the operating system.
Who is expected to monitor the database.
Who is expected to monitor the PowerCenter sessions.
How quickly the support team is expected to respond to notifications of system failures.
Escalation procedures that include data warehouse team contacts in the event that the support
team cannot resolve the system failure.
Operations Manual
The Operations Manual is crucial to the Production Support team because it provides the information
needed to perform the data warehouse system maintenance. This manual should be self-contained,
providing all of the information necessary for a production support operator to maintain the system and
resolve most problems that can arise. This manual should contain information on how to maintain all data
warehouse system components. At a minimum, the Operations Manual should contain:
Information on how to stop and re-start the various components of the system.
Ids and passwords (or how to obtain passwords) for the system components.
Information on how to re-start failed PowerCenter sessions and recovery procedures.
A listing of all jobs that are run, their frequency (daily, weekly, monthly, etc.), and the average run
times.
INFORMATICA CONFIDENTIAL BEST PRACTICE 443 of 702
Error handling strategies.
Who to call in the event of a component failure that cannot be resolved by the Production Support
team.
PowerExchange Operations Manual
s in a scheduler and/or after an IPL. There are certain commands that need to be executed
by operations.
Adapter Guide provides detailed information
on the operation of PowerExchange Change Data Capture.
Archive/Listener Log Maintenace
old
; to do that, use SMS to override the specification, removing the need to change the
EDMUPARM.
e
s scheduled to restart every weekend, the log will be
refreshed and a new spool file will be created.
//DTLLOG DD
DSN=&HLQ..LOG, this will log the file to the member LOG in the HLQ..RUNLIB.
Recovery After Failure
here are other solutions. In any case, if you do
need every change, re-initializing may not be an option.
Application ID
ions the processes that extract changes,
whether they are realtime or change (periodic batch extraction).
tically, this means that each
session must have an application id parameter containing a unique label.
Restart Tokens

is a potential restart point. It is possible, using the Navigator interface directly, or by updating the restart file,
The need to maintain archive logs and listener logs, use started tasks, perform recovery, and other
operation functions on MVS are challenges that need to be addressed in the Operations Manual. If listener
logs are not cleaned up on a regular basis, operations is likely to face space issues. Setting up archive logs
on MVS requires datasets to be allocated and sized. Recovery after failure requires operations intervention
to restart workflows and set the restart tokens. For Change Data Capture, operations are required to start
the started task
The PowerExchange Reference Guide (8.1.1) and the related
The archive log should be controlled by using the Retention Period specified in the EDMUPARM
ARCHIVE_OPTIONS in parameter ARCHIVE_RTPD=. The default supplied in the Install (in RUNLIB
member SETUPCC2) is 9999. This is generally longer than most organizations need. To change it,
just rerun the first step (and only the first step) in SETUPCC2 after making the appropriate changes. Any
new archive log datasets will be created with the new retention period. This does not, however, fix the
archive datasets
The listener default log are part of the joblog of the running listener. If the listener job runs
continuously, there is a potential risk of the spool file reaching the maximum and causing issues with th
listener. For example, if the listener started task i
If necessary, change the started task listener jobs from //DTLLOG DD SYSOUT=*
The last resort recovery procedure is to re-execute your initial extraction and load, and restart the CDC
process from the new initial load start point. Fortunately t
PowerExchange documentation talks about consuming applicat
Each consuming application must identify itself to PowerExchange. Realis
Power Exchange remembers each time that a consuming application successfully extracts changes. The
end-point of the extraction (Address in the database Log RBA or SCN) is stored in a file on the server
hosting the Listener that reads the changed data. Each of these memorized end-points (i.e., Restart Tokens)
INFORMATICA CONFIDENTIAL BEST PRACTICE 444 of 702
to force the next extraction to restart from any of these points. If youre using the ODBC interface for
PowerExchange, this is the best solution to implement.
If you are running periodic extractions of changes and everything finishes cleanly, the restart token history is
a good approach to recovery back to a previous extraction. You simply chose the recovery point from the list
and re-use it.
There are more likely scenarios though. If you are running realtime extractions, potentially never-ending or
until theres a failure, there are no end-points to memorize for restarts. If your batch extraction fails, you may
already have processed and committed many changes. You cant afford to miss any changes and you
dont want to reapply the same changes youve just processed, but the previous restart token does not
correspond to the reality of what youve processed.
If you are using the Power Exchange Client for PowerCenter (PWXPC), the best answer to the recovery
problem lies with PowerCenter, which has historically been able to deal with restarting this type of process
Guaranteed Message Delivery. This functionality is applicable to both realtime and change CDC options.
The PowerExchange Client for PowerCenter stores the Restart Token of the last successful extraction run
for each Application Id in files on the PowerCenter Server. The directory and file name are required
parameters when configuring the PWXPC connection in the Workflow Manager. This functionality greatly
simplifies recovery procedures compared to using the ODBC interface to PowerExchange.
To enable recovery, select the Enable Recovery option in the Error Handling settings of the Configuration
tab in the session properties. During normal session execution, PowerCenter Server stores recovery
information in cache files in the directory specified for $PMCacheDir.
Normal CDC Execution
If the session ends "cleanly" (i.e., zero return code), PowerCenter writes tokens to the restart file, and the
GMD cache is purged.
If the session fails, you are left with unprocessed changes in the GMD cache and a Restart Token
corresponding to the point in time of the last of the unprocessed changes. This information is useful for
recovery.
Recovery
If a CDC session fails, and it was executed with recovery enabled, you can restart it in recovery mode
either from the PowerCenter Client interfaces or using the pmcmd command line instruction. Obviously, this
assumes that you are able to identify that the session failed previously.
1. Start from the point in time specified by the Restart Token in the GMD cache.
2. PowerCenter reads the change records from the GMD cache.
3. PowerCenter processes and commits the records to the target system(s).
4. Once the records in the GMD cache have been processed and committed, PowerCenter purges
the records from the GMD cache and writes a restart token to the restart file.
5. The PowerCenter session ends cleanly.
The CDC session is now ready for you to execute in normal mode again.
Recovery Using PWX ODBC Interface
You can, of course, successfully recover if you are using the ODBC connectivity to PowerCenter, but you
have to build in some things yourself coping with processing all the changes from the last restart token,
even if youve already processed some of them.
INFORMATICA CONFIDENTIAL BEST PRACTICE 445 of 702
When you re-execute a failed CDC session, you receive all the changed data since the last Power
Exchange restart token. Your session has to cope with processing some of the same changes you already
processed at the start of the failed execution either using lookups/joins to the target to see if youve
already applied the change you are processing, or simply ignoring database error messages such as trying
to delete a record you already deleted.
If you run DTLUAPPL to generate a restart token periodically during the execution of your CDC extraction
and save the results, you can use the generated restart token to force a recovery at a more recent point in
time than the last session-end restart token. This is especially useful if you are running realtime extractions
using ODBC, otherwise you may find yourself re-processing several days of changes youve already
processed.
Finally, you can always re-initialize the target and the CDC processing:
Take an image copy of the tablespace containing the table to be captured, with QUIESCE option.
Monitor the EDMMSG output from the PowerExchange Logger job.
Look for message DTLEDM172774I which identifies the PowerExchange Logger sequence number
corresponding to the QUIESCE event.
The output logger show detail with the following format:
DB2 QUIESCE of TABLESPACE TSNAME.TBNAME at DB2 RBA/LRSN
0000000000
000849C56185
EDP Logger RBA . . . . . . . . . : D5D3D3D34040000000084E
00000 Sequence number . . . . . . . . . : 000000084E00000
Edition number . . . . . . . . . : B93C4F9C2A79B000
Source EDMNAME(s) . . . . . . . . : DB2DSN1CAPTNAME1
Take note of the log sequence number.
Repeat for all tables that form part of the same PowerExchange Application.
Run the DTLUAPPL utility specifying the application name and the registration name for each table
in the application.
Alter the SYSIN as follows:
d string from the sequence number found in the
Logger messages after the Copy/Quiesce.
extraction start point on the PowerExchange Logger to the point at
which the QUIESCE was done above.
The image copy obtained above can be used for the initial materialization of the target tables.
MOD APPL REGDEMO DSN1 (where REGDEMO is Registration name on Navi
m Navigator)
gator)
add RSTTKN CAPDEMO (where CAPDEMO is Capture name fro
0000000 SEQUENCE 000000084E0000000000000000084E000
RESTART D5D3D3D34040000000084E0000000000
END APPL REGDEMO (where REGDEMO is Registration name from Navigator)
Note how the sequence number is a repeate
Note that the Restart parameter specified in the DTLUAPPL job is the EDP Logger RBA generated in the
same message sequence. This sets the
INFORMATICA CONFIDENTIAL BEST PRACTICE 446 of 702
PowerExchange Tasks: MVS Start and Stop Command Summary
Task
Start
Comma
nd*
Stop Command Notes Description of Task
Listener
/S
DTLLST
/F DTLLST,
CLOSE
/F DTLLST,
CLOSE, FORCE
/P DTLLST
/C DTLLST
Preferred
method /F
DTLLST,
CLOSE
If CLOSE
doesnt work
If FORCE
doesnt work
If STOP
doesnt work
The PowerExchange listener is used
for bulk data movement and registering
sources for Change Data Capture
Agent /S DTLA /DTLA shutdown
/DTLA DRAIN
and
SHUTDOWN
COMPLETELY
can be used
only at the
request of
Informatica
Support
The PowerExchange Agent, used to
manage connections to the
PowerExchange Logger and handle
repository and other tasks. This must be
started before the Logger.
Logger /S DTLL
/P DTLL
/F DTLL, STOP
****(if you are
installing, you
need to run
setup2 here
prior to starting
the Logger) /f
DTLL, display
The PowerExchange Logger used to
manage the Linear datasets and hiperspace
that hold change capture data.
ECCR
(DB2)
/S
DTLDB2
EC
/F DTLDB2EC,
STOP or /F
DTLDB2EC,
QUIESCE or /P
DTLDB2EC
STOP
command jus
t cancel
ECCR,
QUIESCE wa
it for open
UOWs to
complete.
/F
DTLDB2EC,
display will
publish stats
into the
ECCR sysout
There must be registrations present prior to
bringing up most adaptor ECCRs.
Condense /S DTLC
/F DTLC,
SHUTDOWN

The PowerExchange Condenser used to
run condense jobs against the
PowerExchange Logger. This is used with
PowerExchange CHANGE to organize the
data by table, allow for interval-based
extraction, and optionally fully condense
multiple changes to a single row.
INFORMATICA CONFIDENTIAL BEST PRACTICE 447 of 702
Apply
Submit
J CL or
/S
DTLAPP

(1) F <Listener job>,
D A
(2) F DTLLST,
STOPTASK name

(1) To identify
all tasks
running
through a
certain listener
issue the
following:
(2) Then to
stop the Apply
issue the
following
where: name =
DBN2 (apply
name)
If the CAPX
access and
apply is
running locally
not through a
listener then
issue the
following
command:
<Listener job>,
CLOSE
The PowerExchange Apply process used in
situations where straight replication is
required and the data is not moved through
PowerCenter before landing in the target.
Notes:
1. /p is an MVS STOP command , /f is an MVS MODIFY command.
2. REMOVE the / if the command is done from the console not SDSF.
If you attempt to shut down the Logger before the ECCR(s), a message indicates that there are still active
ECCRs and that the logger will come down AFTER the ECCRS go away. What you should do is:
You can shut the Listener and the ECCR(s) down at the same time.
The Listener:
1. F <Listener_job>,CLOSE
2. If this isnt coming down fast enough for you, issue F <Listener_job>,CLOSE FORCE
3. If it still isnt coming down fast enough, issue C <Listener_job>
Note that these commands are listed in the order of most to least desirable method for bringing the listener
down.
The DB2 ECCR:
1. F <DB2 ECCR>,QUIESCE - this waits for all OPEN UOWs to finish, which can be awhile if
a long-running batch job is running.
2. F <DB2 ECCR>,STOP - this terminates immediately
3. P <DB2 ECCR>- this also terminates immediately
INFORMATICA CONFIDENTIAL BEST PRACTICE 448 of 702
Once the ECCR(s) are down, you can then bring the Logger down.
The Logger: P <Logger job_name>
The Agent: CMDPREFIX SHUTDOWN
If you know that you are headed for an IPL, you can issue all these commands at the same time. The
Listener and ECCR(s) should start down, if you are looking for speed, issue F <Listener_job>,CLOSE
FORCE to shut down the Listener, then issue F <DB2 ECCR>,STOP to terminate DB2 ECCR, then shut
down the Logger and the Agent.
Note: Bringing the Agent down before the ECCR(S) are down can result in a loss of captured data. If a new
file/DB2 table/IMS database is being updated during this shutdown process and the Agent is not available,
the call to see if the source is registered returns a Not being captured answer. The update, therefore,
occurs without you capturing it, leaving your target in a broken state (which you won't know about until too
late!)
Sizing the Logger
When you install PWX-CHANGE, up to two active log data sets are allocated with minimum size
requirements. The information in this section can help to determine if you need to increase the size of the
data sets, and if you should allocate additional log data sets. When you define your active log data sets,
consider your systems capacity and your changed data requirements, including archiving and performance
issues.
After the PWX Logger is active, you can change the log data set configuration as necessary. In general,
remember that you must balance the following variables:
Data set size
Number of data sets
Amount of archiving
The choices you make depend on the following factors:
Resource availability requirements
Performance requirements
-realtime or batch replication Whether you are running near
Data recovery requirements
An inverse relationship exists between the size of the log data sets and the frequency of archiving
required. Larger data sets need to be archived less often than smaller data sets.
Note: Although smaller data sets require more frequent archiving, the archiving process requires less time.
Use the following formulas to estimate the total space you need for each active log data set. For an example
of the calculated data set size, refer to the PowerExchange Reference Guide.
active log data set size in bytes =(average size of captured change record * number of changes
captured per hour * desired number of hours between archives) * (1 +overhead rate)
active log data set size in cylinders =active log data set size in tracks/number of tracks per cylinder
active log data set size in tracks =active log data set size in bytes/number of usable bytes per track
When determining the average size of your captured change records, note the following information:
INFORMATICA CONFIDENTIAL BEST PRACTICE 449 of 702
PWX Change Capture captures the full object that is changed. For example, if one field in an IMS
segment has changed, the product captures the entire segment.
The PWX header adds overhead to the size of the change record. Per record, the overhead is
approximately 300 bytes plus the key length.
The type of change transaction affects whether PWX Change Capture includes a before-image,
after-image, or both:
o DELETE includes a before-image.
o INSERT includes an after-image.
o UPDATE includes both.
Informatica suggests using an overhead rate of 5 to 10 percent, which includes the following factors:
Overhead for control information
Overhead for writing recovery-related information, such as system checkpoints.
You have some control over the frequency of system checkpoints when you define your PWX Logger
parameters. See CHKPT_FREQUENCY in the PowerExchange Reference Guide for more information
about this parameter.
DASD Capacity Conversion Table
Space Information Model 3390 Model 3380
usable bytes per track 49,152 40,960
tracks per cylinder 15 15
This example is based on the following assumptions:
estimated average size of a changed record =600 bytes
estimated rate of captured changes =40,000 changes per hour
tween archives =12 desired number of hours be
ent overhead rate =5 perc
DASD model =3390
The estimated size of each active log data set in bytes is calculated as follows:
600 * 40,000 * 12 * 1.05 =302,400,000
The number of cylinders to allocate is calculated as follows:
302,400,000 / 49,152 =approximately 6152 tracks
6152 / 15 =approximately 410 cylinders
The following example shows an IDCAMS DEFINE statement that uses the above calculations:
DEFINE CLUSTER -
(NAME (HLQ.EDML.PRILOG.DS01) -
LINEAR -
VOLUMES(volser) -
SHAREOPT
0) ) -
IONS(2,3) -
CYL(41
DATA -
(NAME(HLQ.EDML.PRILOG.DS01.DATA) )
INFORMATICA CONFIDENTIAL BEST PRACTICE 450 of 702
The variable HLQ represents the high-level qualifier that you defined for the log data sets during installation.
Additional Logger Tips
The Logger format utility (EDMLUTL0) formats only the primary space allocation. This means that the
Logger does not use secondary allocation. This includes Candidate Volumes and Space, such as that
allocated by SMS when using a STORCLAS with the Guaranteed Space attribute. Logger active logs should
be defined through IDCAMS with:
No secondary allocation.
A single VOLSER in the VOLUME parameter.
An SMS STORCLAS, if used, without GUARANTEED SPACE=YES.
PowerExchange Agent Commands
You can use commands from the MVS system to control certain aspects of PowerExchange Agent
processing. To issue a PowerExchange Agent command, enter the PowerExchange Agent command prefix
(as specified by CmdPrefix in your configuration parameters), followed by the command. For example, if
CmdPrefix=AG01, issue the following command to close the Agent's message log:
AG01 LOGCLOSE
The PowerExchange Agent intercepts agent commands issued on the MVS console and processes them in
the agent address space. If the PowerExchange Agent address space is inactive, MVS rejects any
PowerExchange Agent commands that you issue. If the PowerExchange Agent has not been started during
the current IPL, or if you issue the command with the wrong prefix, MVS generates the following message:
IEE305I command COMMAND INVALID
See PowerExchange Reference Guide (8.1.1) for detailed information on Agent commands.
PowerExchange Logger Commands
The PowerExchange Logger uses two types of commands: interactive and batch
You run interactive commands from the MVS console when the PowerExchange logger is running. You can
use PowerExchange Logger interactive commands to:
Display PowerExchange Logger log data sets, units of work (UOWs), and reader/writer
connections.
Resolve in-doubt UOWs.
Stop a PowerExchange Logger.
Print the contents of the PowerExchange active log file (in hexadecimal format).
You use batch commands primarily in batch change utility jobs to make changes to parameters and
configurations when the PowerExchange Logger is stopped. Use PowerExchange Logger batch commands
to:
Define PowerExchange Loggers and PowerExchange Logger options, including PowerExchange
ptions, and mode (single or dual). Logger names, archive log options, buffer o
Add log definitions to the restart data set.
Delete data set records from the restart data set.
Display log data sets, UOWs, and reader/writer connections.
INFORMATICA CONFIDENTIAL BEST PRACTICE 451 of 702
See PowerExchange Reference Guide (8.1.1) for detailed information on Logger Commands (Chapter 4,
Page 59)

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 452 of 702
Data Integration Load Traceability
Challenge
Load management is one of the major difficulties facing a data integration or data warehouse operations
team. This Best Practice tries to answer the following questions:
How can the team keep track of what has been loaded?
What order should the data be loaded in?
What happens when there is a load failure?
How can bad data be removed and replaced?
f data be identified? How can the source o
When it was loaded?
Description
Load management provides an architecture to allow all of the above questions to be answered with minimal
operational effort.
Benefits of a Load Management Architecture
Data Lineage
The term Data Lineage is used to describe the ability to track data from its final resting place in the target
back to its original source. This requires the tagging of every row of data in the target with an ID from the
load management metadata model. This serves as a direct link between the actual data in the target and the
original source data.
To give an example of the usefulness of this ID, a data warehouse or integration competency center
operations team, or possibly end users, can, on inspection of any row of data in the target schema, link back
to see when it was loaded, where it came from, any other metadata about the set it was loaded with,
validation check results, number of other rows loaded at the same time, and so forth.
It is also possible to use this ID to link one row of data with all of the other rows loaded at the same time.
This can be useful when a data issue is detected in one row and the operations team needs to see if the
same error exists in all of the other rows. More than this, it is the ability to easily identify the source data for
a specific row in the target, enabling the operations team to quickly identify where a data issue may lie.
It is often assumed that data issues are produced by the transformation processes executed as part of the
target schema load. Using the source ID to link back the source data makes it easy to identify whether the
issues were in the source data when it was first encountered by the target schema load processes or if
those load processes caused the issue. This ability can save a huge amount of time, expense, and
frustration -- particularly in the initial launch of any new subject areas.
Process Lineage
Tracking the order that data was actually processed in is often the key to resolving processing and data
issues. Because choices are often made during the processing of data based on business rules and logic,
the order and path of processing differs from one run to the next. Only by actually tracking these processes
as they act upon the data can issue resolution be simplified.
INFORMATICA CONFIDENTIAL BEST PRACTICE 453 of 702
Process Dependency Management
Having a metadata structure in place provides an environment to facilitate the application and maintenance
of business dependency rules. Once a structure is in place that identifies every process, it becomes very
simple to add the necessary metadata and validation processes required to ensure enforcement of the
dependencies among processes. Such enforcement resolves many of the scheduling issues that operations
teams typically faces.
Process dependency metadata needs to exist because it is often not possible to rely on the source systems
to deliver the correct data at the correct time. Moreover, in some cases, transactions are split across multiple
systems and must be loaded into the target schema in a specific order. This is usually difficult to manage
because the various source systems have no way of coordinating the release of data to the target schema.
Robustness
Using load management metadata to control the loading process also offers two other big advantages, both
of which fall under the heading of robustness because they allow for a degree of resilience to load failure.
Load Ordering
Load ordering is a set of processes that use the load management metadata to identify the order in which
the source data should be loaded. This can be as simple as making sure the data is loaded in the sequence
it arrives, or as complex as having a pre-defined load sequence planned in the metadata.
There are a number of techniques used to manage these processes. The most common is an automated
process that generates a PowerCenter load list from flat files in a directory, then archives the files in that list
after the load is complete. This process can use embedded data in file names or can read header records to
identify the correct ordering of the data. Alternatively the correct order can be pre-defined in the load
management metadata using load calendars.
Either way, load ordering should be employed in any data integration or data warehousing implementation
because it allows the load process to be automatically paused when there is a load failure, and ensures that
the data that has been put on hold is loaded in the correct order as soon as possible after a failure.
The essential part of the load management process is that it operates without human intervention, helping to
make the system self healing!
Rollback
If there is a loading failure or a data issue in normal daily load operations, it is usually preferable to remove
all of the data loaded as one set. Load management metadata allows the operations team to selectively roll
back a specific set of source data, the data processed by a specific process, or a combination of both. This
can be done using manual intervention or by a developed automated feature.
INFORMATICA CONFIDENTIAL BEST PRACTICE 454 of 702
Simple Load Management Metadata Model


As you can see from the simple load management metadata model above, there are two sets of data linked
to every transaction in the target tables. These represent the two major types of load management
metadata:
Source tracking
Process tracking

Source Tracking
Source tracking looks at how the target schema validates and controls the loading of source data. The aim is
to automate as much of the load processing as possible and track every load from the source through to the
target schema.
Source Definitions
Most data integration projects use batch load operations for the majority of data loading. The sources for
these come in a variety of forms, including flat file formats (ASCII, XML etc), relational databases, ERP
systems, and legacy mainframe systems.
The first control point for the target schema is to maintain a definition of how each source is structured, as
well as other validation parameters.
These definitions should be held in a Source Master table like the one shown in the data model above.
These definitions can and should be used to validate that the structure of the source data has not changed.
A great example of this practice is the use of DTD files in the validation of XML feeds.
INFORMATICA CONFIDENTIAL BEST PRACTICE 455 of 702
In the case of flat files, it is usual to hold details like:
Header information (if any)
How many columns
Data types for each column
Expected number of rows
For RDBMS sources, the Source Master record might hold the definition of the source tables or store the
structure of the SQL statement used to extract the data (i.e., the SELECT, FROM and ORDER BY clauses).
These definitions can be used to manage and understand the initial validation of the source data structures.
Quite simply, if the system is validating the source against a definition, there is an inherent control point at
which problem notifications and recovery processes can be implemented. Its better to catch a bad data
structure than to start loading bad data.
Source Instances
A Source Instance table (as shown in the load management metadata model) is designed to hold one record
for each separate set of data of a specific source type being loaded. It should have a direct key link back to
the Source Master table which defines its type.
The various source types may need slightly different source instance metadata to enable optimal control
over each individual load.
Unlike the source definitions, this metadata will change every time a new extract and load is performed. In
the case of flat files, this would be a new file name and possibly date / time information from its header
record. In the case of relational data, it would be the selection criteria (i.e., the SQL WHERE clause) used
for each specific extract, and the date and time it was executed.
This metadata needs to be stored in the source tracking tables so that the operations team can identify a
specific set of source data if the need arises. This need may arise if the data needs to be removed and
reloaded after an error has been spotted in the target schema.
Process Tracking
Process tracking describes the use of load management metadata to track and control the loading
processes rather than the specific data sets themselves. There can often be many load processes acting
upon a single source instance set of data.
While it is not always necessary to be able to identify when each individual process completes, it is very
beneficial to know when a set of sessions that move data from one stage to the next has completed. Not all
sessions are tracked this way because, in most cases, the individual processes are simply storing data into
temporary tables that will be flushed at a later date. Since load management process IDs are intended to
track back from a record in the target schema to the process used to load it, it only makes sense to generate
a new process ID if the data is being stored permanently in one of the major staging areas.
Process Definition
Process definition metadata is held in the Process Master table (as shown in the load management
metadata model ). This, in its basic form, holds a description of the process and its overall status. It can also
be extended, with the introduction of other tables, to reflect any dependencies among processes, as well as
processing holidays.
INFORMATICA CONFIDENTIAL BEST PRACTICE 456 of 702
Process Instances
A process instance is represented by an individual row in the load management metadata Process Instance
table. This represents each instance of a load process that is actually run. This holds metadata about when
the process started and stopped, as well as its current status. Most importantly, this table allocates a unique
ID to each instance.
The unique ID allocated in the process instance table is used to tag every row of source data. This ID is then
stored with each row of data in the target table.
Integrating Source and Process Tracking
Integrating source and process tracking can produce an extremely powerful investigative and control tool for
the administrators of data warehouses and integrated schemas. This is achieved by simply linking every
process ID with the source instance ID of the source it is processing. This requires that a write-back facility
be built into every process to update its process instance record with the ID of the source instance being
processed.
The effect is that there is a one to one/many relationship between the source instance table and the process
instance table containing several rows for each set of source data loaded into a target schema. For
example, in a data warehousing project, a row for loading the extract into a staging area, a row for the move
from the staging area to an ODS, and a final row for the move from the ODS to the warehouse.
Integrated Load Management Flow Diagram

INFORMATICA CONFIDENTIAL BEST PRACTICE 457 of 702
Tracking Transactions
This is the simplest data to track since it is loaded incrementally and not updated. This means that the
process and source tracking discussed earlier in this document can be applied as is.
Tracking Reference Data
This task is complicated by the fact that reference data, by its nature, is not static. This means that if you
simply update the data in a row any time there is a change, there is no way that the change can be backed
out using the load management practice described earlier. Instead, Informatica recommends always using
slowly changing dimension processing on every reference data and dimension table to accomplish source
and process tracking. Updating the reference data as a slowly changing table retains the previous versions
of updated records, thus allowing any changes to be backed out.
Tracking Aggregations
Aggregation also causes additional complexity for load management because the resulting aggregate row
very often contains the aggregation across many source data sets. As with reference data, this means that
the aggregated row cannot be backed out in the same way as transactions.
This problem is managed by treating the source of the aggregate as if it was an original source. This means
that rather than trying to track the original source, the load management metadata only tracks back to the
transactions in the target that have been aggregated. So, the mechanism is the same as used for
transactions but the resulting load management metadata only tracks back from the aggregate to the fact
table in the target schema.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 458 of 702
High Availability
Challenge
An increasing number of customers find their Data Integration implementation must be available
24x7 without interruption or failure. This Best Practice describes the High Availability (HA)
capabilities incorporated in PowerCenter and explains why it is critical to address both architectural
(i.e., systems, hardware, firmware) and procedural (i.e., application design, code implementation,
session/workflow features) recovery with HA.
Description
When considering HA recovery, be sure to explore the following two components of HA that exist
on all enterprise systems:
External Resilience
External resilience has to do with the integration and specification of domain name servers,
database servers, FTP servers, network access servers in a defined, tested 24x7
configuration. The nature of Informaticas DI setup places it at many interface points in system
integration. Before placing and configuring PowerCenter within an infrastructure that has an HA
expectation, the following questions should be answered:
G Is the pre-existing set of servers already in a sustained HA configuration? Is there a
schematic with applicable settings to use for reference? If so, is there a unit test or system
test to exercise before installing PowerCenter products? It is important to remember that
the external systems must be HA before the PowerCenter architecture they support can
be.
G What are the bottlenecks or perceived failure points of the existing system? Are these
bottlenecks likely to be exposed or heightened by placing PowerCenter in the
infrastructure? (e.g., five times the amount of Oracle traffic, ten times the amount of DB2
traffic, a UNIX server that always shows 10% idle may now have twice as many processes
running).
G Finally, if a proprietary solution (such as IBM HACMP or Veritas Storage Foundation for
Windows) has been implemented with success at a customer site, this sets a different
expectation. The customer may merely want the grid capability of multiple PowerCenter
nodes to splay/recover Informatica tasks, and expect their back-end system (such as
those listed above) to provide file system or server bootstrap recovery upon a fundamental
failure of those back-end systems. If these back-end systems have a script/command
capability to, for example, restart a repository service, PowerCenter can be installed in this
fashion. However, PowerCenter's HA capability extends as far as the PowerCenter
components.
Internal Resilience
INFORMATICA CONFIDENTIAL BEST PRACTICE 459 of 702
In an HA PowerCenter environment key elements to keep in mind are:
G Rapid and constant connectivity to the repository metadata.
G Rapid and constant network connectivity between all gateway and worker nodes in the
PowerCenter domain.
G A common highly-available storage system accessible to all PowerCenter domain nodes
with one service name and one file protocol. Only domain nodes on the same operating
system can share gateway and log files (see Admin Console->Domain->Properties->Log
and Gateway Configuration).
Internal resilience occurs within the PowerCenter environment among PowerCenter services, the
PowerCenter Client tools, and other client applications such as pmrep and pmcmd. Internal
resilience can be configured at the following levels:
G Domain. Configure service connection resilience at the domain level in the general
properties for the domain. The domain resilience timeout determines how long services
attempt to connect as clients to application services or the Service Manager. The domain
resilience properties are the default values for all services in the domain.
G Service. It is possible to configure service connection resilience in the advanced
properties for an application service. When configuring connection resilience for an
application service, this overrides the resilience values from the domain settings.
G Gateway. The master gateway node maintains a connection to the domain configuration
database. If the domain configuration database becomes unavailable, the master gateway
node tries to reconnect. The resilience timeout period depends on user activity and
whether the domain has one or multiple gateway nodes:
H Single gateway node. If the domain has one gateway node, the gateway node tries
to reconnect until a user or service tries to perform a domain operation. When a user
tries to perform a domain operation, the master gateway node shuts down.
H Multiple gateway nodes. If the domain has multiple gateway nodes and the master
gateway node cannot reconnect, then the master gateway node shuts down. If a user
tries to perform a domain operation while the master gateway node is trying to
connect, the master gateway node shuts down. If another gateway node is available,
the domain elects a new master gateway node. The domain tries to connect to the
domain configuration database with each gateway node. If none of the gateway
nodes can connect, the domain shuts down and all domain operations fail.
Process
Be aware that your implementation has a dependency on the installation environment. For
example, you may want to combine multiple disparate ETL repositories onto a single upgraded
PowerCenter platform. This has the benefit of:
G Single point of access/administration from the Admin Console.
G A group of repositories that now can become a repository domain.
INFORMATICA CONFIDENTIAL BEST PRACTICE 460 of 702
G A group of repositories that can be shaped into common processing/backup/schedule
patterns for optimal performance and administration.
HA items of concern are now:
G Single point of failure of one PowerCenter domain.
G One repository, possibly heavy in processing or poorly designed, degrading that
entire PowerCenter domain.
Common Elements of Concern in an HA Configuration
Restart and Failover
Restart and Failover has to do with the Domain Services (Integration and Repository). Obviously if
these services are not highly available, the scheduling, dependencies(e.g., touch files, ftp, etc) and
artifacts of your ETL cannot be highly available.
If a service process becomes unavailable, the Service Manager can restart the process or fail it
over to a backup node based on the availability of the node. When a service process restarts or
fails over, the service restores the state of operation and begins recovery from the point of
interruption.
You can configure backup nodes for services if you have the high availability option. If you
configure an application service to run on primary and backup nodes, one service process can run
at a time. The following situations describe restart and failover for an application service:
G If the primary node running the service process becomes unavailable, the service fails
over to a backup node. The primary node may be unavailable if it shuts down or if the
connection to the node becomes unavailable.
G If the primary node running the service process is available, the domain tries to restart the
process based on the restart options configured in the domain properties. If the process
does not restart, the Service Manager can mark the process as failed. The service then
fails over to a backup node and starts another process. If the Service Manager marks the
process as failed, the administrator must enable the process after addressing any
configuration problem.
If a service process fails over to a backup node, it does not fail back to the primary node when the
node becomes available. You can disable the service process on the backup node to cause it to
fail back to the primary node.
Recovery
Recovery is the completion of operations after an interrupted service is restored. When a service
recovers, it restores the state of operation and continues processing the job from the point of
interruption.
INFORMATICA CONFIDENTIAL BEST PRACTICE 461 of 702
The state of operation for a service contains information about the service process. The
PowerCenter services include the following states of operation:
G Service Manager. The Service Manager for each node in the domain maintains the state
of service processes running on that node. If the master gateway shuts down, the newly
elected master gateway collects the state information from each node to restore the state
of the domain.
G Repository Service. The Repository Service maintains the state of operation in the
repository. This includes information about repository locks, requests in progress, and
connected clients.
G Integration Service. The Integration Service maintains the state of operation in the
shared storage configured for the service. This includes information about scheduled,
running, and completed tasks for the service. The Integration Service maintains session
and workflow state of operation based on the recovery strategy you configure for the
session and workflow.
When designing a system that has HA recovery as a core component, be sure to include
architectural and procedural recovery.
Architectural recovery for a PowerCenter domain involves the three bulleted items above
restarting in a complete, sustainable and traceable manner. If the Service Manager and Repository
Service recover, but the Integration service cannot recover the restart is not successful and has
little value to a production environment.
Field experience with PowerCenter has yielded these key items in planning a proper recovery upon
a systemic failure:
G A PowerCenter domain cannot be established without at least one gateway node
running. Even if you have established a domain with ten worker nodes and one gateway
node, none of the worker nodes can run ETL jobs without a gateway node managing the
domain.
G An Integration Service cannot run without its associated Repository Service being started
and connected to its metadata repository.
G A Repository Service cannot run without its metadata repository DBMS being started and
accepting database connections. Often database connections are established on periodic
windows that expire which puts the repository offline.
G If the installed domain configuration is running from Authentication Module Configuration
and the LDAP Principal User account becomes corrupt or inactive, all PowerCenter
repository access is lost. If the installation uses any additional authentication
outside PowerCenter (such as LDAP), an additional recovery and restart plan is required.
Procedural recovery is supported with many features of PowerCenter 8. Consider the following
very simple mapping that might run in production for many ETL applications:
INFORMATICA CONFIDENTIAL BEST PRACTICE 462 of 702
Suppose there is a situation where the ftp server sending this ff_customer file is inconsistent. Many
times the file is not there, but the processes depending on this must always run. The process is
always insert only. You do not want the succession of ETL that follows this small process to fail -
they can run to customer_stg with current records only. This setting in the Workflow Manager,
Session, Properties would fit your need:
Since it is not critical the ff_customer records run each time, record the failure but continue the
process.
Now say the situation has changed. Sessions are failing on a PowerCenter server due to target
INFORMATICA CONFIDENTIAL BEST PRACTICE 463 of 702
database timeouts. A requirement is given that the session must recover from this:
Resuming from last checkpoint restarts the process from its prior commit, allowing no loss of ETL
work.
To finish this second case, consider three basic items on the workflow side when HA
is incorporated in your environment:
INFORMATICA CONFIDENTIAL BEST PRACTICE 464 of 702
An Integration Service in an HA environment can only recover those workflows marked with
Enable HA recovery. For all critical workflows, this should be considered.
For a mature set of ETL code running in QA or Production, you may consider the following
workflow property:
INFORMATICA CONFIDENTIAL BEST PRACTICE 465 of 702
This would automatically recover tasks from where they failed in a workflow upon an application or
system wide failure. Consider carefully the use of this feature, however. Remember, automated
restart of critical ETL processes without interaction can have vast unintended side effects. For
instance, if a database alias or synonym was dropped, all ETL targets may now refer to different
objects than the original intent. Only PowerCenter environments with HA, mature production
support practices, and a complete operations manual per Velocity, should expect complete
recovery with this feature.
In an HA environment, certain components of the Domain can go offline while the Domain stays
up to execute ETL jobs. This is a time to use the Suspend On Error feature from the General tab
of Workflow settings. The backup Integration Server would then pickup this workflow and resume
processing based on the resume settings of this workflow:
INFORMATICA CONFIDENTIAL BEST PRACTICE 466 of 702
Features
A variety of HA features exist in PowerCenter 8. Specifically, they include:
G Integration Service HA option
G Integration Service Grid option
G Repository Service HA option
First, proceed from an assumption that nodes have been provided to you such that a basic HA
configuration of PowerCenter 8 can take place. A lab-tested version completed by Informatica is
configured as below with an HP solution. Your solution can be completed with any reliable
clustered file system. Your first step would always be implementing and thoroughly exercising a
clustered file system:
INFORMATICA CONFIDENTIAL BEST PRACTICE 467 of 702
Now, lets address the options in order:
Integration Service HA Option
You must have the HA option on the license key for this to be available on install. Note that once
the base PowerCenter 8 install is configured, all nodes are available from the Admin Console-
>Domain->Integration Services->Grid/Node Assignments. From the above example, you would see
Node 1, Node 2, Node 3 as dropdown options on that browse page. With the HA (Primary/Backup)
install complete, Integration Services are then displayed with both P and B in a configuration,
with the current operating node highlighted:
INFORMATICA CONFIDENTIAL BEST PRACTICE 468 of 702
If a failure were to occur on this HA configuration, the Integration Service INT_SVCS_DEV would
poll the Domain: Domain_Corp_RD for another Gateway Node, then assign INT_SVCS_DEV over
to that Node, in this case Node_Corp_RD02. Then the B button would highlight showing this
Node as providing INT_SVCS_DEV.
A vital component of configuring the Integration Service for HA is making sure the Integration
Service files are stored in a shared persistent environment. You must specify the paths for
Integration Service files for each Integration Service process. Examples of Integration Service files
include run-time files, state of operation files, and session log files.
Each Integration Service process uses run-time files to process workflows and sessions. If you
configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must
be stored in a shared location. Each node must have access to the run-time files used to process a
session or workflow. This includes files such as parameter files, cache files, input files, and output
files.
State of operation files must be accessible by all Integration Service processes. When you enable
an Integration Service, it creates files to store the state of operations for the service. The state of
operations includes information such as the active service requests, scheduled tasks, and
completed and running processes. If the service fails, the Integration Service can restore the state
and recover operations from the point of interruption.
All Integration Service processes associated with an Integration Service must use the same shared
location. However, each Integration Service can use a separate location.
By default, the installation program creates a set of Integration Service directories in the server
INFORMATICA CONFIDENTIAL BEST PRACTICE 469 of 702
\infa_shared directory. You can set the shared location for these directories by configuring the
process variable $PMRootDir to point to the same location for each Integration Service
process. The key HA concern of this is $PMRootDir should be on the highly-available clustered file
system mentioned above.
Integration Service Grid Option
You must have the Server Grid option on the license key for this to be available on install. In
configuring the $PMRootDir files for the Integration Service, retain the methodology
described above. Also, in Admin Console->Domain->Properties->Log and Gateway Configuration,
the log and directory paths should be on the clustered file system mentioned above. A grid must be
created before it can be used in a Power Center 8 domain. It is essential to remember that a grid
can only be created from machines running the same operating system.
Be sure to remember these key points:
G PowerCenter supports nodes from heterogeneous operating systems, bit modes, and
others to be used within same domain. However, if there are heterogeneous nodes for a
grid, then you can only run a workflow on the Grid, not a session.
G A session on grid does not support heterogeneous operating systems. This is because a
session may have a sharing cache file and other objects that may not be compatible with
all of the operating systems. For session on a grid, you need a homogeneous grid.
In short, scenarios such as a production failure are the worst possible time to find out that a multi-
OS grid does not meet your needs.
If you have a large volume of disparate hardware, it is certainly possible to make perhaps two grids
centered on two different operating systems. In either case, the performance of your clustered file
system is going to affect the performance of your server grid, and should be considered as part of
your performance/maintenance strategy.
Repository Service HA Option
You must have the HA option on the license key for this to be available on install. There are two
ways to include the Repository Service HA capability when configuring PowerCenter 8:
G The first is during install. When the Install Program prompts for your nodes to do a
Repository install (after answering Yes to Create Repository), you can enter a second
node where the Install Program can create and invoke the PowerCenter service and
Repository Service for a backup repository node. Keep in mind that all of the database,
OS, and server preparation steps referred to in the PowerCenter Installation and
Configuration Guide still hold true for this backup node. When the install is complete, the
Repository Service displays a P/B link similar to that illustrated above for the
INT_SVCS_DEV example Integration Service.
G A second method for configuring Repository Service HA allows for measured, incremental
implementation of HA from a tested base configuration. After ensuring that your initial
INFORMATICA CONFIDENTIAL BEST PRACTICE 470 of 702
Repository Service settings (e.g., resilience timeout, codepage, connection timeout) and
the DBMS repository containing the metadata are running and stable, you can add a
second node and make it the Repository Backup. Install the PowerCenter Service on this
second server following the PowerCenter Installation and Configuration Guide. In
particular, skip creating Repository Content or an Integration Service on the
node. Following this, Go to Admin Console->Domain and select:
Create->Node. The server to contain this node should be of the exact same configuration/
clustered file system/OS as the Primary Repository Service.
The following dialog should appear:
Assign a logical name to the node to describe its place, and select Create. The node should now
be running as part of your domain, but if it isn't, refer to the PowerCenter Command Line
Reference with the infaservice and infacmd commands to ensure the node is running on the
domain. When it is running, go to Domain->Repository->Properties->Node Assignments->Edit and
the browser window displays:
INFORMATICA CONFIDENTIAL BEST PRACTICE 471 of 702
Click OK and the Repository Service is now configured in a Primary/Backup setup for the
domain. To ensure the P/B setting, test the following elements of the configuration:
1. Be certain the same version of the DBMS client is installed on the server and can access
the metadata.
2. Both nodes must be on the same clustered file system.
3. Log onto the OS for the Backup Repository Service and ping the Domain Master Gateway
Node. Be sure a reasonable response time is being given at an OS level (i.e., less than 5
seconds).
4. Take the Primary Repository Service Node offline and validate that the polling, failover,
restart process takes place in a methodical, traceable manner for the Repository Service on
the Domain. This should be clearly visible from the node logs on the Primary and
Secondary Repository Service boxes [$INFA_HOME/server/tomcat/logs] or from Admin
Console->Repository->Logs.
Note: Remember that when a node is taken offline, you cannot access Admin Console from it.


Last updated: 09-Feb-07 15:34
INFORMATICA CONFIDENTIAL BEST PRACTICE 472 of 702
Load Validation
Challenge
Knowing that all data for the current load cycle has loaded correctly is essential for effective data
warehouse management. However, the need for load validation varies depending on the extent of
error checking, data validation, and data cleansing functionalities inherent in your mappings. For
large data integration projects with thousands of mappings, the task of reporting load statuses
becomes overwhelming without a well-planned load validation process.
Description
Methods for validating the load process range from simple to complex. Use the following steps to
plan a load validation process:
1. Determine what information you need for load validation (e.g., work flow names, session
names, session start times, session completion times, successful rows and failed rows).
2. Determine the source of the information. All of this information is stored as metadata in the
PowerCenter repository, but you must have a means of extracting it.
3. Determine how you want the information presented to you. Should the information be
delivered in a report? Do you want it emailed to you? Do you want it available in a
relational table so that history is easily preserved? Do you want it stored as a flat file?
Weigh all of these factors to find the correct solution for your project.
Below are descriptions of five possible load validation solutions, ranging from fairly simple to
increasingly complex:
1. Post-session Emails on Success or Failure
One practical application of the post-session email functionality is the situation in which a key
business user waits for completion of a session to run a report. Email is configured to notify the
user when the session was successful so that the report can be run. Another practical application
is the situation in which a production support analyst needs to be notified immediately of any
failures. Configure the session to send an email to the analyst upon failure. For round-the-clock
support, a pager number that has the ability to receive email can be used in place of an email
address.
Post-session email is configured in the session, under the General tab and Session Commands.
A number of variables are available to simplify the text of the email:
G %s Session name
INFORMATICA CONFIDENTIAL BEST PRACTICE 473 of 702
G %e Session status
G %b Session start time
G %c Session completion time
G %i Session elapsed time
G %l Total records loaded
G %r Total records rejected
G %t Target table details
G %m Name of the mapping used in the session
G %n Name of the folder containing the session
G %d Name of the repository containing the session
G %g Attach the session log to the message
G %a <file path> Attache a file to the message

2. Other Workflow Manager Features
In addition to post session email messages, there are other features available in the Workflow
Manager to help validate loads. Control, Decision, Event, and Timer tasks are some of the
features that can be used to place multiple controls on the behavior of loads. Another solution is to
place conditions within links. Links are used to connect tasks within a workflow or worklet. Use the
pre-defined or user-defined variables in the link conditions. In the example below, upon the
Successful completion of both sessions A and B, the PowerCenter Server executex session C.
3. PowerCenter Reports (PCR)
The PowerCenter Reports (PCR) is a web-based business intelligence (BI) tool that is included
with every PowerCenter license to provide visibility into metadata stored in the PowerCenter
repository in a manner that is easy to comprehend and distribute. The PCR includes more than
130 pre-packaged metadata reports and dashboards delivered through Data Analyzer,
Informaticas BI offering. These pre-packaged reports enable PowerCenter customers to extract
extensive business and technical metadata through easy-to-read reports including:
INFORMATICA CONFIDENTIAL BEST PRACTICE 474 of 702
G Load statistics and operational metadata that enable load validation.
G Table dependencies and impact analysis that enable change management.
G PowerCenter object statistics to aid in development assistance.
G Historical load statistics that enable planning for growth.
In addition to the 130 pre-packaged reports and dashboards that come standard with PCR, you
can develop additional custom reports and dashboards that are based upon the PCR limited-use
license that allows you to source reports from the PowerCenter repository. Examples of custom
components that can be created include:
G Repository-wide reports and/or dashboards with indicators of daily load success/failure.
G Customized project-based dashboard with visual indicators of daily load success/failure.
G Detailed daily load statistics report for each project that can be exported to Microsoft Excel
or PDF.
G Error handling reports that deliver error messages and source data for row level errors that
may have occurred during a load.
Below is an example of a custom dashboard that gives instant insight into the load validation
across an entire repository through four custom indicators.
INFORMATICA CONFIDENTIAL BEST PRACTICE 475 of 702
4. Query Informatica Metadata Exchange (MX) Views
Informatica Metadata Exchange (MX) provides a set of relational views that allow easy SQL access
to the PowerCenter repository. The Repository Manager generates these views when you create or
upgrade a repository. Almost any query can be put together to retrieve metadata related to the load
execution from the repository. The MX view, REP_SESS_LOG, is a great place to start. This view
is likely to contain all the information you need. The following sample query shows how to extract
folder name, session name, session end time, successful rows, and session duration:
select subject_area, session_name, session_timestamp, successful_rows,
(session_timestamp - actual_start) * 24 * 60 * 60 from rep_sess_log a where
session_timestamp = (select max(session_timestamp) from rep_sess_log
where session_name =a.session_name) order by subject_area, session_name
The sample output would look like this:
INFORMATICA CONFIDENTIAL BEST PRACTICE 476 of 702
TIP
Informatica strongly advises against querying directly from the repository tables. Because future versions of PowerCenter are
likely to alter the underlying repository tables, PowerCenter supports queries from the unaltered MX views, not the repository
tables.
5. Mapping Approach
A more complex approach, and the most customizable, is to create a PowerCenter mapping to
populate a table or a flat file with desired information. You can do this by sourcing the MX view
REP_SESS_LOG and then performing lookups to other repository tables or views for additional
information.
The following graphic illustrates a sample mapping:

This mapping selects data from REP_SESS_LOG and performs lookups to retrieve the absolute
minimum and maximum run times for that particular session. This enables you to compare the
current execution time with the minimum and maximum durations.
INFORMATICA CONFIDENTIAL BEST PRACTICE 477 of 702
Note: Unless you have acquired additional licensing, a customized metadata data mart cannot be
a source for a PCR report. However, you can use a business intelligence tool of your choice
instead.


Last updated: 09-Feb-07 15:47
INFORMATICA CONFIDENTIAL BEST PRACTICE 478 of 702
Repository Administration
Challenge
Defining the role of the PowerCenter Administrator to understand
the tasks required to properly manage the domain and repository.
Description
The PowerCenter Administrator has many responsibilities. In addition to regularly
backing up the domain and repository, truncating logs, and updating the database
statistics, he or she also typically performs the following functions:
G Determines metadata strategy
G Installs/configures client/server software
G Migrates development to test and production
G Maintains PowerCenter servers
G Upgrades software
G Administers security and folder organization
G Monitors and tunes environment
Note: The Administrator is also typically responsible for maintaining domain and
repository passwords; changing them on a regular basis and keeping a record of them
in a secure place.
Determine Metadata Strategy
The PowerCenter Administrator is responsible for developing the structure and
standard for metadata in the PowerCenter Repository. This includes developing
naming conventions for all objects in the repository, creating a folder organization, and
maintaining the repository. The Administrator is also responsible for modifying the
metadata strategies to suit changing business needs or to fit the needs of a particular
project. Such changes may include new folder names and/or a different security setup.
Install/Configure Client/Server Software
INFORMATICA CONFIDENTIAL BEST PRACTICE 479 of 702
This responsibility includes installing and configuring the application servers in all
applicable environments (e.g., development, QA, production, etc.). The Administrator
must have a thorough understanding of the working environment, along with access to
resources such as a Windows 2000/2003 or UNIX Admin and a DBA.
The Administrator is also responsible for installing and configuring the client
tools. Although end users can generally install the client software, the configuration of
the client tool connections benefits from being consistent throughout the repository
environment. The Administrator, therefore, needs to enforce this consistency in order to
maintain an organized repository.
Migrate Development to Production
When the time comes for content in the development environment to be moved to
the test and production environments, it is the responsibility of the Administrator to
schedule, track, and copy folder changes. Also, it is crucial to keep track of the
changes that have taken place. It is the role of the Administrator to track these
changes through a change control process. The Administrator should be the only
individual able to physically move folders from one environment to another.
If a versioned repository is used, the Administrator should set up labels and instruct the
developers on the labels that they must apply to their repository objects (i.e., reusable
transformations, mappings, workflows and sessions). This task also requires close
communication with project staff to review the status of items of work to ensure, for
example, that only tested or approved work is migrated.
Maintain PowerCenter Servers
The Administrator must also be able to understand and troubleshoot the server
environment. He or she should have a good understanding of PowerCenters Service
Oriented Architecture and how the domain and application services interact with each
other. The Administrator should also understand what the Integration Service does
when a session is running and be able to identify those processes. Additionally, certain
mappings may produce files in addition to the standard session and workflow logs. The
Administrator should be familiar with these files and know how and where to maintain
them.
Upgrade Software
If and when the time comes to upgrade software, the Administrator is responsible for
overseeing the installation and upgrade process.
INFORMATICA CONFIDENTIAL BEST PRACTICE 480 of 702
Security and Folder Administration
Security administration consists of both the PowerCenter domain and repository. For
the domain, it involves creating, maintaining, and updating all domain users and their
associated rights and privileges to services and alerts. For the repository, it involves
creating, maintaining, and updating all users within the repository, including creating
and assigning groups based on new and changing projects and defining which folders
are to be shared, and at what level. Folder administration involves creating and
maintaining the security of all folders. The Administrator should be the only user with
privileges to edit folder properties.
Monitor and Tune Environment
Proactively monitoring the domain and user activity helps ensure a healthy functioning
PowerCenter environment. The Administrator should review user activity for the domain
to verify that the appropriate rights and privileges have been applied. The domain
activity will ensure correct CPU and license usage.
The Administrator should have sole responsibility for implementing performance
changes to the server environment. He or she should observe server performance
throughout development so as to identify any bottlenecks in the system. In the
production environment, the Repository Administrator should monitor the jobs and any
growth (e.g., increases in data or throughput time) and communicate such change to
appropriate staff address bottlenecks, accommodate growth, and ensure that the
required data is loaded within the prescribed load window.



Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 481 of 702
Third Party Scheduler
Challenge
Successfully integrate a third-party scheduler with PowerCenter. This Best Practice
describes various levels to integrate a third-party scheduler.
Description
Tasks such as getting server and session properties, session status, or starting or
stopping a workflow or a task can be performed either through the Workflow Monitor or
by integrating a third-party scheduler with PowerCenter. A third-party scheduler can be
integrated with PowerCenter at any of several levels. The level of integration depends
on the complexity of the workflow/schedule and the skill sets of production support
personnel.
Many companies want to automate the scheduling process by using scripts or third-
party schedulers. In some cases, they are using a standard scheduler and want to
continue using it to drive the scheduling process.
A third-party scheduler can start or stop a workflow or task, obtain session statistics,
and get server details using the pmcmd commands. Pmcmd is a program used to
communicate with the PowerCenter server.
Third Party Scheduler Integration Levels
In general, there are three levels of integration between a third-party scheduler and
PowerCenter: Low, Medium, and High.
Low Level
Low-level integration refers to a third-party scheduler kicking off the initial PowerCenter
workflow. This process subsequently kicks off the rest of the tasks or sessions. The
PowerCenter scheduler handles all processes and dependencies after the third-party
scheduler has kicked off the initial workflow. In this level of integration, nearly all control
lies with the PowerCenter scheduler.
INFORMATICA CONFIDENTIAL BEST PRACTICE 482 of 702
This type of integration is very simple to implement because the third-party scheduler
kicks off only one process. It is only used as a loophole to fulfil a corporate mandate on
a standard scheduler. This type of integration also takes advantage of the robust
functionality offered by the Workflow Monitor.
Low-level integration requires production support personnel to have a thorough
knowledge of PowerCenter. Because Production Support personnel in many
companies are only knowledgeable about the companys standard scheduler, one of
the main disadvantages of this level of integration is that if a batch fails at some point,
the Production Support personnel may not be able to determine the exact breakpoint.
Thus, the majority of the production support burden falls back on the Project
Development team.
Medium Level
With Medium-level integration, a third-party scheduler kicks off some, but not all,
workflows or tasks. Within the tasks, many sessions may be defined with
dependencies. PowerCenter controls the dependencies within the tasks.
With this level of integration, control is shared between PowerCenter and the third-party
scheduler, which requires more integration between the third-party scheduler and
PowerCenter. Medium-level integration requires Production Support personnel to have
a fairly good knowledge of PowerCenter and also of the scheduling tool. If they do not
have in-depth knowledge about the tool, they may be unable to fix problems that arise,
so the production support burden is shared between the Project Development team
and the Production Support team.
High Level
With High-level integration, the third-party scheduler has full control of scheduling and
kicks off all PowerCenter sessions. In this case, the third-party scheduler is responsible
for controlling all dependencies among the sessions. This type of integration is the
most complex to implement because there are many more interactions between the
third-party scheduler and PowerCenter.
Production Support personnel may have limited knowledge of PowerCenter but must
have thorough knowledge of the scheduling tool. Because Production Support
personnel in many companies are knowledgeable only about the companys standard
scheduler, one of the main advantages of this level of integration is that if the batch
fails at some point, the Production Support personnel are usually able to determine the
exact breakpoint. Thus, the production support burden lies with the Production Support
INFORMATICA CONFIDENTIAL BEST PRACTICE 483 of 702
team.
Sample Scheduler Script
There are many independent scheduling tools on the market. The following is an
example of a AutoSys script that can be used to start tasks; it is included here simply
as an illustration of how a scheduler can be implemented in the PowerCenter
environment. This script can also capture the return codes, and abort on error,
returning a success or failure (with associated return codes to the command line or the
Autosys GUI monitor).
# Name: jobname.job
# Author: Author Name
# Date: 01/03/2005
# Description:
# Schedule: Daily
#
# Modification History
# When Who Why
#
#------------------------------------------------------------------
. jobstart $0 $*
# set variables
ERR_DIR=/tmp
# Temporary file will be created to store all the Error Information
# The file format is TDDHHMISS<PROCESS-ID>.lst
curDayTime=`date +%d%H%M%S`
FName=T$CurDayTime$$.lst
if [ $STEP -le 1 ]
then
echo "Step 1: RUNNING wf_stg_tmp_product_xref_table..."
cd /dbvol03/vendor/informatica/pmserver/
#pmcmd startworkflow -s ah-hp9:4001 -u Administrator -p informat01
wf_stg_tmp_product_xref_table
#pmcmd starttask -s ah-hp9:4001 -u Administrator -p informat01 -f
FINDW_SRC_STG -w WF_STG_TMP_PRODUCT_XREF_TABLE -wait
s_M_S
INFORMATICA CONFIDENTIAL BEST PRACTICE 484 of 702
# The above lines need to be edited to include the name of the workflow or the
task that you are attempting to start.
TG_TMP_PRODUCT_XREF_TABLE
# Checking whether to abort the Current Process or not
RetVal=$?
echo "Status = $RetVal"
if [ $RetVal -ge 1 ]
then
jobend abnormal "Step 1: Failed wf_stg_tmp_product_xref_table...\n"
exit 1
fi
echo "Step 1: Successful"
fi
jobend normal
exit 0


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 485 of 702
Updating Repository Statistics
Challenge
The PowerCenter repository has more than 170 tables, and most have one or more indexes to speed up
queries. Most databases use column distribution statistics to determine which index to use to optimize
performance. It can be important, especially in large or high-use repositories, to update these statistics
regularly to avoid performance degradation.
Description
For PowerCenter, statistics are updated during copy, backup or restore operations. In addition, the
RMREP command has an option to update statistics that can be scheduled as part of a regularly-run
script.
For PowerCenter 6 and earlier there are specific strategies for Oracle, Sybase, SQL Server, DB2 and
Informix discussed below. Each example shows how to extract the information out of the PowerCenter
repository and incorporate it into a custom stored procedure.
Features in PowerCenter version 7 and later
Copy, Backup and Restore Repositories
PowerCenter automatically identifies and updates all statistics of all repository tables and indexes when a
repository is copied, backed-up, or restored. If you follow a strategy of regular repository back-ups, the
statistics will also be updated.
PMREP Command
PowerCenter also has a command line option to update statistics in the database. This allows this
command to be put in a Windows batch file or Unix Shell script to run. The format of the command is:
pmrep updatestatistics {-s filelistfile}
The s option allows for you to skip different tables you may not want to update statistics.
Example of Automating the Process
One approach to automating this would be to use a UNIX shell that includes the pmrep command
updatestatistics which is incorporated into a special workflow in PowerCenter and run on a scheduled
basis. Note: Workflow Manager supports command line as well as scheduling.
Below listed is an example of the command line object.
INFORMATICA CONFIDENTIAL BEST PRACTICE 486 of 702

In addition, this workflow can be scheduled to run continuously on a daily, weekly or monthly schedule.
This allows the statistics to be updated regularly so performance is not degraded.
Tuning Strategies for PowerCenter version 6 and earlier
The following are strategies for generating scripts to update distribution statistics. Note that all
PowerCenter repository tables and index names begin with "OPB_" or "REP_".
Oracle
Run the following queries:
select 'analyze table ', table_name, ' compute statistics;' from user_tables where table_name like
'OPB_%'
select 'analyze index ', INDEX_NAME, ' compute statistics;' from user_indexes where
INDEX_NAME like 'OPB_%'
This will produce output like:
INFORMATICA CONFIDENTIAL BEST PRACTICE 487 of 702
'ANALYZETABLE' TABLE_NAME 'COMPUTESTATISTICS;'
analyze table OPB_ANALYZE_DEP compute statistics;
analyze table OPB_ATTR compute statistics;
analyze table OPB_BATCH_OBJECT compute statistics;

'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;'
analyze index OPB_DBD_IDX compute statistics;
analyze index OPB_DIM_LEVEL compute statistics;
analyze index OPB_EXPR_IDX compute statistics;
Save the output to a file. Then, edit the file and remove all the headers. (i.e., the lines that look like:
'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;'
Run this as a SQL script. This updates statistics for the repository tables.

MS SQL Server
Run the following query:
select 'update statistics ', name from sysobjects where name like 'OPB_%'
This will produce output like :
name
update statistics OPB_ANALYZE_DEP
update statistics OPB_ATTR
update statistics OPB_BATCH_OBJECT
Save the output to a file, then edit the file and remove the header information (i.e., the top two lines) and
add a 'go' at the end of the file.
Run this as a SQL script. This updates statistics for the repository tables.
INFORMATICA CONFIDENTIAL BEST PRACTICE 488 of 702

Sybase
Run the following query:
select 'update statistics ', name from sysobjects where name like 'OPB_%'
This will produce output like
name
update statistics OPB_ANALYZE_DEP
update statistics OPB_ATTR
update statistics OPB_BATCH_OBJECT
Save the output to a file, then remove the header information (i.e., the top two lines), and add a 'go' at the
end of the file.
Run this as a SQL script. This updates statistics for the repository tables.

Informix
Run the following query:
select 'update statistics low for table ', tabname, ' ;' from systables where tabname like 'opb_%' or
tabname like 'OPB_%';
This will produce output like :
(constant) tabname (constant)
update statistics low for table OPB_ANALYZE_DEP ;
update statistics low for table OPB_ATTR ;
update statistics low for table OPB_BATCH_OBJECT ;
Save the output to a file, then edit the file and remove the header information (i.e., the top line that looks
like:
(constant) tabname (constant)
Run this as a SQL script. This updates statistics for the repository tables.
INFORMATICA CONFIDENTIAL BEST PRACTICE 489 of 702

DB2
Run the following query :
select 'runstats on table ', (rtrim(tabschema)||'.')||tabname, ' and indexes all;'
from sysstat.tables where tabname like 'OPB_%'
This will produce output like:
runstats on table PARTH.OPB_ANALYZE_DEP
and indexes all;
runstats on table PARTH.OPB_ATTR
and indexes all;
runstats on table PARTH.OPB_BATCH_OBJECT
and indexes all;
Save the output to a file.
Run this as a SQL script to update statistics for the repository tables.


Last updated: 12-Feb-07 15:29
INFORMATICA CONFIDENTIAL BEST PRACTICE 490 of 702
Determining Bottlenecks
Challenge
Because there are many variables involved in identifying and rectifying performance
bottlenecks, an efficient method for determining where bottlenecks exist is crucial to
good data warehouse management.
Description
The first step in performance tuning is to identify performance bottlenecks. Carefully
consider the following five areas to determine where bottlenecks exist; using a process
of elimination, investigating each area in the order indicated:
1. Target
2. Source
3. Mapping
4. Session
5. System
Best Practice Considerations
Use Thread Statistics to Identify Target, Source, and Mapping
Bottlenecks
Use thread statistics to identify source, target or mapping (transformation) bottlenecks.
By default, an Integration Service uses one reader, one transformation, and one target
thread to process a session. Within each session log, the following thread statistics are
available:
G Run time Amount of time the thread was running
G Idle time Amount of time the thread was idle due to other threads within
application or Integration Service. This value does not include time the thread
is blocked due to the operating system.
G Busy Percentage of the overall run time the thread is not idle. This
percentage is calculated using the following formula:
INFORMATICA CONFIDENTIAL BEST PRACTICE 491 of 702
(run time idle time) / run time x 100
By analyzing the thread statistics found in an Integration Service session log, it is
possible to determine which thread is being used the most.
If a transformation thread is 100 percent busy and there are additional resources (e.g.,
CPU cycles and memory) available on the Integration Service server, add a partition
point in the segment.
If reader or writer thread is 100 percent busy, consider using string data types in source
or target ports since non-string ports require more processing.
Use the Swap Method to Test Changes in Isolation
Attempt to isolate performance problems by running test sessions. You should be able
to compare the sessions original performance with that of tuned sessions performance.
The swap method is very useful for determining the most common bottlenecks. It
involves the following five steps:
1. Make a temporary copy of the mapping, session and/or workflow that is to be
tuned, then tune the copy before making changes to the original.
2. Implement only one change at a time and test for any performance
improvements to gauge which tuning methods work most effectively in the
environment.
3. Document the change made to the mapping, session and/or workflow and the
performance metrics achieved as a result of the change. The actual execution
time may be used as a performance metric.
4. Delete the temporary mapping, session and/or workflow upon completion of
performance tuning.
5. Make appropriate tuning changes to mappings, sessions and/or workflows.
Evaluating the Five Areas of Consideration
Target Bottlenecks
Relational Targets
The most common performance bottleneck occurs when the Integration Service writes
to a target database. This type of bottleneck can easily be identified with the following
INFORMATICA CONFIDENTIAL BEST PRACTICE 492 of 702
procedure:
1. Make a copy of the original workflow
2. Configure the session in the test workflow to write to a flat file and run the
session.
3. Read the thread statistics in session log
If session performance increases significantly when writing to a flat file, you have a
write bottleneck. Consider performing the following tasks to improve performance:
G Drop indexes and key constraints
G Increase checkpoint intervals
G Use bulk loading
G Use external loading
G Minimize deadlocks
G Increase database network packet size
G Optimize target databases
Flat file targets
If the session targets a flat file, you probably do not have a write bottleneck. If the
session is writing to a SAN or a non-local file system, performance may be slower than
writing to a local file system. If possible, a session can be optimized by writing to a flat
file target local to the Integration Service. If the local flat file is very large, you can
optimize the write process by dividing it among several physical drives.
If the SAN or non-local file system is significantly slower than the local file system, work
with the appropriate network/storage group to determine if there are configuration
issues within the SAN.
Source Bottlenecks
Relational sources
If the session reads from a relational source, you can use a filter transformation, a read
test mapping, or a database query to identify source bottlenecks.
Using a Filter Transformation.
INFORMATICA CONFIDENTIAL BEST PRACTICE 493 of 702
Add a filter transformation in the mapping after each source qualifier. Set the filter
condition to false so that no data is processed past the filter transformation. If the time it
takes to run the new session remains about the same, then you have a source
bottleneck.
Using a Read Test Session.
You can create a read test mapping to identify source bottlenecks. A read test mapping
isolates the read query by removing any transformation logic from the mapping. Use
the following steps to create a read test mapping:
1. Make a copy of the original mapping.
2. In the copied mapping, retain only the sources, source qualifiers, and any
custom joins or queries.
3. Remove all transformations.
4. Connect the source qualifiers to a file target.
Use the read test mapping in a test session. If the test session performance is similar to
the original session, you have a source bottleneck.
Using a Database Query
You can also identify source bottlenecks by executing a read query directly against the
source database. To do so, perform the following steps:
G Copy the read query directly from the session log.
G Run the query against the source database with a query tool such as SQL
Plus.
G Measure the query execution time and the time it takes for the query to return
the first row.
If there is a long delay between the two time measurements, you have a source
bottleneck.
If your session reads from a relational source and is constrained by a source
bottleneck, review the following suggestions for improving performance:
G Optimize the query.
G Create tempdb as in-memory database.
INFORMATICA CONFIDENTIAL BEST PRACTICE 494 of 702
G Use conditional filters.
G Increase database network packet size.
G Connect to Oracle databases using IPC protocol.
Flat file sources
If your session reads from a flat file source, you probably do not have a read
bottleneck. Tuning the line sequential buffer length to a size large enough to hold
approximately four to eight rows of data at a time (for flat files) may improve
performance when reading flat file sources. Also, ensure the flat file source is local to
the Integration Service.
Mapping Bottlenecks
If you have eliminated the reading and writing of data as bottlenecks, you may have a
mapping bottleneck. Use the swap method to determine if the bottleneck is in the
mapping.
Begin by adding a Filter transformation in the mapping immediately before each target
definition. Set the filter condition to false so that no data is loaded into the target tables.
If the time it takes to run the new session is the same as the original session, you have
a mapping bottleneck. You can also use the performance details to identify mapping
bottlenecks: high Rowsinlookupcache and High Errorrows counters indicate mapping
bottlenecks.
Follow these steps to identify mapping bottlenecks:
Create a test mapping without transformations
1. Make a copy of the original mapping.
2. In the copied mapping, retain only the sources, source qualifiers, and any
custom joins or queries.
3. Remove all transformations.
4. Connect the source qualifiers to the target.
Check for High Rowsinlookupcache counters
Multiple lookups can slow the session. You may improve session performance by
locating the largest lookup tables and tuning those lookup expressions.
INFORMATICA CONFIDENTIAL BEST PRACTICE 495 of 702
Check for High Errorrows counters
Transformation errors affect session performance. If a session has large numbers in
any of the Transformation_errorrows counters, you may improve performance by
eliminating the errors.
For further details on eliminating mapping bottlenecks, refer to the Best Practice:
Tuning Mappings for Better Performance
Session Bottlenecks
Session performance details can be used to flag other problem areas. Create
performance details by selecting Collect Performance Data in the session properties
before running the session.
View the performance details through the Workflow Monitor as the session runs, or
view the resulting file. The performance details provide counters about each source
qualifier, target definition, and individual transformation within the mapping to help you
understand session and mapping efficiency.
To view the performance details during the session run:
G Right-click the session in the Workflow Monitor.
G Choose Properties.
G Click the Properties tab in the details dialog box.
To view the resulting performance daa file, look for the file session_name.perf in the
same directory as the session log and open the file in any text editor.
All transformations have basic counters that indicate the number of input row, output
rows, and error rows. Source qualifiers, normalizers, and targets have additional
counters indicating the efficiency of data moving into and out of buffers. Some
transformations have counters specific to their functionality. When reading performance
details, the first column displays the transformation name as it appears in the mapping,
the second column contains the counter name, and the third column holds the resulting
number or efficiency percentage.
Low buffer input and buffer output counters
INFORMATICA CONFIDENTIAL BEST PRACTICE 496 of 702
If the BufferInput_efficiency and BufferOutput_efficiency counters are low for all
sources and targets, increasing the session DTM buffer pool size may improve
performance.
Aggregator, Rank, and Joiner readfromdisk and writetodisk counters
If a session contains Aggregator, Rank, or Joiner transformations, examine each
Transformation_readfromdisk and Transformation_writetodisk counter. If these
counters display any number other than zero, you can improve session performance by
increasing the index and data cache sizes.
If the session performs incremental aggregation, the Aggregator_readtodisk and
writetodisk counters display a number other than zero because the Integration Service
reads historical aggregate data from the local disk during the session and writes to disk
when saving historical data. Evaluate the incremental Aggregator_readtodisk and
writetodisk counters during the session. If the counters show any numbers other than
zero during the session run, you can increase performance by tuning the index and
data cache sizes.
Note: PowerCenter versions 6.x and above include the ability to assign memory
allocation per object. In versions earlier than 6.x, aggregators, ranks, and joiners were
assigned at a global/session level.
For further details on eliminating session bottlenecks, refer to the Best Practice: Tuning
Sessions for Better Performance and Tuning SQL Overrides and Environment for
Better Performance.
System Bottlenecks
After tuning the source, target, mapping, and session, you may also consider tuning the
system hosting the Integration Service.
The Integration Service uses system resources to process transformations, session
execution, and the reading and writing of data. The Integration Service also uses
system memory for other data tasks such as creating aggregator, joiner, rank, and
lookup table caches.
You can use system performance monitoring tools to monitor the amount of system
resources the Server uses and identify system bottlenecks.
G Windows NT/2000. Use system tools such as the Performance and
INFORMATICA CONFIDENTIAL BEST PRACTICE 497 of 702
Processes tab in the Task Manager to view CPU usage and total memory
usage. You can also view more detailed performance information by using the
Performance Monitor in the Administrative Tools on Windows.
G UNIX. Use the following system tools to monitor system performance
and identify system bottlenecks:
H lsattr -E -I sys0 - To view current system settings
H iostat - To monitor loading operation for every disk attached to the
database server
H vmstat or sar w - To monitor disk swapping actions
H sar u - To monitor CPU loading.
For further information regarding system tuning, refer to the Best
Practices: Performance Tuning UNIX Systems and Performance Tuning Windows
2000/2003 Systems.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 498 of 702
Performance Tuning Databases (Oracle)
Challenge
Database tuning can result in a tremendous improvement in loading performance. This
Best Practice covers tips on tuning Oracle.
Description
Performance Tuning Tools
Oracle offers many tools for tuning an Oracle instance. Most DBAs are already familiar
with these tools, so weve included only a short description of some of the major ones
here.
V$ Views
V$ views are dynamic performance views that provide real-time information on
database activity, enabling the DBA to draw conclusions about database performance.
Because SYS is the owner of these views, only SYS can query them. Keep in mind that
querying these views impacts database performance; with each query having an
immediate hit. With this in mind, carefully consider which users should be granted the
privilege to query these views. You can grant viewing privileges with either the
SELECT privilege, which allows a user to view for individual V$ views or the SELECT
ANY TABLE privilege, which allows the user to view all V$ views. Using the SELECT
ANY TABLE option requires the O7_DICTIONARY_ACCESSIBILITY parameter be
set to TRUE, which allows the ANY keyword to apply to SYS owned objects.
Explain Plan
Explain Plan, SQL Trace, and TKPROF are powerful tools for revealing bottlenecks
and developing a strategy to avoid them.
Explain Plan allows the DBA or developer to determine the execution path of a block of
SQL code. The SQL in a source qualifier or in a lookup that is running for a long time
should be generated and copied to SQL*PLUS or other SQL tool and tested to avoid
inefficient execution of these statements. Review the PowerCenter session log for long
initialization time (an indicator that the source qualifier may need tuning) and the time it
takes to build a lookup cache to determine if the SQL for these transformations should
INFORMATICA CONFIDENTIAL BEST PRACTICE 499 of 702
be tested.
SQL Trace
SQL Trace extends the functionality of Explain Plan by providing statistical information
about the SQL statements executed in a session that has tracing enabled. This utility is
run for a session with the ALTER SESSION SET SQL_TRACE = TRUE statement.
TKPROF
The output of SQL Trace is provided in a dump file that is difficult to read. TKPROF
formats this dump file into a more understandable report.
UTLBSTAT & UTLESTAT
Executing UTLBSTAT creates tables to store dynamic performance statistics and
begins the statistics collection process. Run this utility after the database has been up
and running (for hours or days). Accumulating statistics may take time, so you need to
run this utility for a long while and through several operations (i.e., both loading and
querying).
UTLESTAT ends the statistics collection process and generates an output file called
report.txt. This report should give the DBA a fairly complete idea about the level of
usage the database experiences and reveal areas that should be addressed.
Disk I/O
Disk I/O at the database level provides the highest level of performance gain in most
systems. Database files should be separated and identified. Rollback files should be
separated onto their own disks because they have significant disk I/O. Co-locate tables
that are heavily used with tables that are rarely used to help minimize disk contention.
Separate indexes so that when queries run indexes and tables, they are not fighting for
the same resource. Also be sure to implement disk striping; this, or RAID technology
can help immensely in reducing disk contention. While this type of planning is time
consuming, the payoff is well worth the effort in terms of performance gains.
Dynamic Sampling
Dynamic sampling enables the server to improve performance by:
G Estimating single-table predicate statistics where available statistics are
missing or may lead to bad estimations.
G Estimating statistics for tables and indexes with missing statistics.
G Estimating statistics for tables and indexes with out of date statistics.
Dynamic sampling is controlled by the OPTIMIZER_DYNAMIC_SAMPLING parameter,
which accepts values from "0" (off) to "10" (aggressive sampling) with a default value of
"2". At compile-time, Oracle determines if dynamic sampling can improve query
INFORMATICA CONFIDENTIAL BEST PRACTICE 500 of 702
performance. If so, it issues recursive statements to estimate the necessary statistics.
Dynamic sampling can be beneficial when:
G The sample time is small compared to the overall query execution time.
G Dynamic sampling results in a better performing query.
The query can be executed multiple times.
Automatic SQL Tuning in Oracle Database 10g
In its normal mode, the query optimizer needs to make decisions about execution plans
in a very short time. As a result, it may not always be able to obtain enough information
to make the best decision. Oracle 10g allows the optimizer to run in tuning mode,
where it can gather additional information and make recommendations about how
specific statements can be tuned further. This process may take several minutes for a
single statement so it is intended to be used on high-load, resource-intensive
statements.

In tuning mode, the optimizer performs the following analysis:
G Statistics Analysis. The optimizer recommends the gathering of statistics on
objects with missing or stale statistics. Additional statistics for these objects
are stored in an SQL profile.
G SQL Profiling. The optimizer may be able to improve performance by
gathering additional statistics and altering session specific parameters such as
the OPTIMIZER_MODE. If such improvements are possible, the information is
stored in an SQL profile. If accepted, this information can then used by the
optimizer when running in normal mode. Unlike a stored outline, which fixes
the execution plan, an SQL profile may still be of benefit when the contents of
the table alter drastically. Even so, it's sensible to update profiles periodically.
The SQL profiling is not performed when the tuining optimizer is run in limited
mode.
G Access Path Analysis. The optimizer investigates the effect of new or
modified indexes on the access path. Because its index recommendations
relate to a specific statement, where practical, it is also suggest the use of the
SQL Access Advisor to check the impact of these indexes on a representative
SQL workload.
G SQL Structure Analysis. The optimizer suggests alternatives for SQL
statements that contain structures that may affect performance. Be aware that
implementing these suggestions requires human intervention to check their
INFORMATICA CONFIDENTIAL BEST PRACTICE 501 of 702
validity.
TIP
The automatic SQL tuning features are accessible from Enterprise Manager on the "Advisor Central" page

Useful Views
Useful views related to automatic SQL tuning include:
G DBA_ADVISOR_TASKS
G DBA_ADVISOR_FINDINGS
G DBA_ADVISOR_RECOMMENDATIONS
G DBA_ADVISOR_RATIONALE
G DBA_SQLTUNE_STATISTICS
G DBA_SQLTUNE_BINDS
G DBA_SQLTUNE_PLANS
G DBA_SQLSET
G DBA_SQLSET_BINDS
G DBA_SQLSET_STATEMENTS
G DBA_SQLSET_REFERENCES
G DBA_SQL_PROFILES
G V$SQL
G V$SQLAREA
G V$ACTIVE_SESSION_HISTORY
Memory and Processing
Memory and processing configuration is performed in the init.ora file. Because each
database is different and requires an experienced DBA to analyze and tune it for
optimal performance, a standard set of parameters to optimize PowerCenter is not
practical and is not likely to ever exist.
INFORMATICA CONFIDENTIAL BEST PRACTICE 502 of 702
TIP
Changes made in the init.ora file take effect after a restart of the instance. Use svrmgr to issue the
commands shutdown and startup (eventually shutdown immediate) to the instance. Note that svrmgr is
no longer available as of Oracle 9i because Oracle is moving to a web-based Server Manager in Oracle 10g. If
you are using Oracle 9i, install Oracle client tools and log onto Oracle Enterprise Manager. Some other tools
like DBArtisan expose the initialization parameters.

The settings presented here are those used in a four-CPU AIX server running Oracle
7.3.4 set to make use of the parallel query option to facilitate parallel
processing queries and indexes. Weve also included the descriptions and
documentation from Oracle for each setting to help DBAs of other (i.e., non-Oracle)
systems determine what the commands do in the Oracle environment to facilitate
setting their native database commands and settings in a similar fashion.
HASH_AREA_SIZE = 16777216
G Default value: 2 times the value of SORT_AREA_SIZE
G Range of values: any integer
G This parameter specifies the maximum amount of memory, in bytes, to be
used for the hash join. If this parameter is not set, its value defaults to twice
the value of the SORT_AREA_SIZE parameter.
G The value of this parameter can be changed without shutting down the Oracle
instance by using the ALTER SESSION command. (Note: ALTER SESSION
refers to the Database Administration command issued at the svrmgr
command prompt).
G HASH_JOIN_ENABLED
H In Oracle 7 and Oracle 8 the hash_join_enabled parameter must be set to
true.
H In Oracle 8i and above hash_join_enabled=true is the default value
G HASH_MULTIBLOCK_IO_COUNT
H Allows multiblock reads against the TEMP tablespace
H It is advisable to set the NEXT extentsize to greater than the value for
hash_multiblock_io_count to reduce disk I/O
H This is the same behavior seen when setting the
db_file_multiblock_read_count parameter for data tablespaces except this
one applies only to multiblock access of segments of TEMP Tablespace
G STAR_TRANSFORMATION_ENABLED
H Determines whether a cost-based query transformation will be applied to
INFORMATICA CONFIDENTIAL BEST PRACTICE 503 of 702
star queries
H When set to TRUE, the optimizer will consider performing a cost-based
query transformation on the n-way join table
G OPTIMIZER_INDEX_COST_ADJ
H Numeric parameter set between 0 and 1000 (default 1000)
H This parameter lets you tune the optimizer behavior for access path
selection to be more or less index friendly
Optimizer_percent_parallel=33
This parameter defines the amount of parallelism that the optimizer uses in its cost
functions. The default of 0 means that the optimizer chooses the best serial plan. A
value of 100 means that the optimizer uses each object's degree of parallelism in
computing the cost of a full-table scan operation.
The value of this parameter can be changed without shutting down the Oracle instance
by using the ALTER SESSION command. Low values favor indexes, while high values
favor table scans.
Cost-based optimization is always used for queries that reference an object with a
nonzero degree of parallelism. For such queries, a RULE hint or optimizer mode or
goal is ignored. Use of a FIRST_ROWS hint or optimizer mode overrides a nonzero
setting of OPTIMIZER_PERCENT_PARALLEL.
parallel_max_servers=40
G Used to enable parallel query.
G Initially not set on Install.
G Maximum number of query servers or parallel recovery processes for an
instance.
Parallel_min_servers=8
G Used to enable parallel query.
G Initially not set on Install.
G Minimum number of query server processes for an instance. Also the number
of query-server processes Oracle creates when the instance is started.
SORT_AREA_SIZE=8388608
INFORMATICA CONFIDENTIAL BEST PRACTICE 504 of 702
G Default value: operating system-dependent
G Minimum value: the value equivalent to two database blocks
G This parameter specifies the maximum amount, in bytes, of program global
area (PGA) memory to use for a sort. After the sort is complete, and all that
remains to do is to fetch the rows out, the memory is released down to the size
specified by SORT_AREA_RETAINED_SIZE. After the last row is fetched out,
all memory is freed. The memory is released back to the PGA, not to the
operating system.
G Increasing SORT_AREA_SIZE size improves the efficiency of large sorts.
Multiple allocations never exist; there is only one memory area of
SORT_AREA_SIZE for each user process at any time.
G The default is usually adequate for most database operations. However, if very
large indexes are created, this parameter may need to be adjusted. For
example, if one process is doing all database access, as in a full database
import, then an increased value for this parameter may speed the import,
particularly the CREATE INDEX statements.
Automatic Shared Memory Management in Oracle 10g
Automatic Shared Memory Management puts Oracle in control of allocating memory
within the SGA. The SGA_TARGET parameter sets the amount of memory available to
the SGA. This parameter can be altered dynamically up to a maximum of the
SGA_MAX_SIZE parameter value. Provided the STATISTICS_LEVEL is set to
TYPICAL or ALL, and the SGA_TARGET is set to a value other than "0", Oracle will
control the memory pools that would otherwise be controlled by the following
parameters:
G DB_CACHE_SIZE (default block size)
G SHARED_POOL_SIZE
G LARGE_POOL_SIZE
G JAVA_POOL_SIZE
If these parameters are set to a non-zero value, they represent the minimum size for
the pool. These minimum values may be necessary if you experience application errors
when certain pool sizes drop below a specific threshold.
The following parameters must be set manually and take memory from the quota
allocated by the SGA_TARGET parameter:
G DB_KEEP_CACHE_SIZE
INFORMATICA CONFIDENTIAL BEST PRACTICE 505 of 702
G DB_RECYCLE_CACHE_SIZE
G DB_nK_CACHE_SIZE (non-default block size)
G STREAMS_POOL_SIZE
G LOG_BUFFER
IPC as an Alternative to TCP/IP on UNIX
On an HP/UX server with Oracle as a target (i.e., PMServer and Oracle target on same
box), using an IPC connection can significantly reduce the time it takes to build a
lookup cache. In one case, a fact mapping that was using a lookup to get five columns
(including a foreign key) and about 500,000 rows from a table was taking 19 minutes.
Changing the connection type to IPC reduced this to 45 seconds. In another mapping,
the total time decreased from 24 minutes to 8 minutes for ~120-130 bytes/row, 500,000
row write (array inserts), and primary key with unique index in place. Performance went
from about 2Mb/min (280 rows/sec) to about 10Mb/min (1360 rows/sec).
A normal tcp (network tcp/ip) connection in tnsnames.ora would look like this:
DW.armafix =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS =
(PROTOCOL =TCP)
(HOST = armafix)
(PORT = 1526)
)
)
(CONNECT_DATA=(SID=DW)
)
)
Make a new entry in the tnsnames like this, and use it for connection to the local Oracle
instance:
DWIPC.armafix =
(DESCRIPTION =
(ADDRESS =
(PROTOCOL=ipc)
(KEY=DW)
)
(CONNECT_DATA=(SID=DW))
)
INFORMATICA CONFIDENTIAL BEST PRACTICE 506 of 702
Improving Data Load Performance
Alternative to Dropping and Reloading Indexes
Experts often recommend dropping and reloading indexes during very large loads to a
data warehouse but there is no easy way to do this. For example, writing a SQL
statement to drop each index, then writing another SQL statement to rebuild it, can be
a very tedious process.
Oracle 7 (and above) offers an alternative to dropping and rebuilding indexes by
allowing you to disable and re-enable existing indexes. Oracle stores the name of each
index in a table that can be queried. With this in mind, it is an easy matter to write a
SQL statement that queries this table. then generate SQL statements as output to
disable and enable these indexes.
Run the following to generate output to disable the foreign keys in the data warehouse:
SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE
CONSTRAINT ' || CONSTRAINT_NAME || ' ;'
FROM USER_CONSTRAINTS
WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')
AND CONSTRAINT_TYPE = 'R'
This produces output that looks like:
ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT
SYS_C0011077 ;
ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT
SYS_C0011075 ;
ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT
SYS_C0011060 ;
ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT
SYS_C0011059 ;
INFORMATICA CONFIDENTIAL BEST PRACTICE 507 of 702
ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE
CONSTRAINT SYS_C0011133 ;
ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE
CONSTRAINT SYS_C0011134 ;
ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE
CONSTRAINT SYS_C0011131 ;
Dropping or disabling primary keys also speeds loads. Run the results of this SQL
statement after disabling the foreign key constraints:
SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE
PRIMARY KEY ;'
FROM USER_CONSTRAINTS
WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')
AND CONSTRAINT_TYPE = 'P'
This produces output that looks like:
ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE PRIMARY KEY ;
ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE PRIMARY KEY ;
ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE PRIMARY
KEY ;
Finally, disable any unique constraints with the following:
SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE
PRIMARY KEY ;'
FROM USER_CONSTRAINTS
WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')
INFORMATICA CONFIDENTIAL BEST PRACTICE 508 of 702
AND CONSTRAINT_TYPE = 'U'
ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT
SYS_C0011070 ;
ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE
CONSTRAINT SYS_C0011071 ;
Save the results in a single file and name it something like DISABLE.SQL
To re-enable the indexes, rerun these queries after replacing DISABLE with
ENABLE. Save the results in another file with a name such as ENABLE.SQL and run
it as a post-session command.
Re-enable constraints in the reverse order that you disabled them. Re-enable the
unique constraints first, and re-enable primary keys before foreign keys.
TIP
Dropping or disabling foreign keys often boosts loading, but also slows queries (such as lookups) and updates.
If you do not use lookups or updates on your target tables, you should get a boost by using this SQL statement
to generate scripts. If you use lookups and updates (especially on large tables), you can exclude the index that
will be used for the lookup from your script. You may want to experiment to determine which method is faster.

Optimizing Query Performance
Oracle Bitmap Indexing
With version 7.3.x, Oracle added bitmap indexing to supplement the traditional b-tree
index. A b-tree index can greatly improve query performance on data that has high
cardinality or contains mostly unique values, but is not much help for low cardinality/
highly-duplicated data and may even increase query time. A typical example of a low
cardinality field is gender it is either male or female (or possibly unknown). This kind
of data is an excellent candidate for a bitmap index, and can significantly improve query
performance.
Keep in mind however, that b-tree indexing is still the Oracle default. If you dont
specify an index type when creating an index, Oracle defaults to b-tree. Also note that
for certain columns, bitmaps are likely to be smaller and faster to create than a b-tree
index on the same column.
Bitmap indexes are suited to data warehousing because of their performance, size, and
INFORMATICA CONFIDENTIAL BEST PRACTICE 509 of 702
ability to create and drop very quickly. Since most dimension tables in a warehouse
have nearly every column indexed, the space savings is dramatic. But it is important to
note that when a bitmap-indexed column is updated, every row associated with that
bitmap entry is locked, making bit-map indexing a poor choice for OLTP database
tables with constant insert and update traffic. Also, bitmap indexes are rebuilt after
each DML statement (e.g., inserts and updates), which can make loads very slow. For
this reason, it is a good idea to drop or disable bitmap indexes prior to the load and re-
create or re-enable them after the load.
The relationship between Fact and Dimension keys is another example of low
cardinality. With a b-tree index on the Fact table, a query processes by joining all the
Dimension tables in a Cartesian product based on the WHERE clause, then joins back
to the Fact table. With a bitmapped index on the Fact table, a star query may be
created that accesses the Fact table first followed by the Dimension table joins,
avoiding a Cartesian product of all possible Dimension attributes. This star query
access method is only used if the STAR_TRANSFORMATION_ENABLED parameter is
equal to TRUE in the init.ora file and if there are single column bitmapped indexes on
the fact table foreign keys. Creating bitmap indexes is similar to creating b-tree
indexes. To specify a bitmap index, add the word bitmap between create and index.
All other syntax is identical.
Bitmap Indexes
drop index emp_active_bit;
drop index emp_gender_bit;
create bitmap index emp_active_bit on emp (active_flag);
create bitmap index emp_gender_bit on emp (gender);
B-tree Indexes
drop index emp_active;
drop index emp_gender;
create index emp_active on emp (active_flag);
create index emp_gender on emp (gender);
INFORMATICA CONFIDENTIAL BEST PRACTICE 510 of 702
Information for bitmap indexes is stored in the data dictionary in dba_indexes,
all_indexes, and user_indexes with the word BITMAP in the Uniqueness column
rather than the word UNIQUE. Bitmap indexes cannot be unique.
To enable bitmap indexes, you must set the following items in the instance initialization
file:
G compatible = 7.3.2.0.0 # or higher
G event = "10111 trace name context forever"
G event = "10112 trace name context forever"
G event = "10114 trace name context forever"
Also note that the parallel query option must be installed in order to create bitmap
indexes. If you try to create bitmap indexes without the parallel query option, a syntax
error appears in the SQL statement; the keyword bitmap won't be recognized.
TIP
To check if the parallel query option is installed, start and log into SQL*Plus. If the parallel query option is
installed, the word parallel appears in the banner text.

Index Statistics
Table method
Index statistics are used by Oracle to determine the best method to access tables and
should be updated periodically as normal DBA procedures. The following should
improve query results on Fact and Dimension tables (including appending and updating
records) by updating the table and index statistics for the data warehouse:
The following SQL statement can be used to analyze the tables in the database:
SELECT 'ANALYZE TABLE ' || TABLE_NAME || ' COMPUTE STATISTICS;'
FROM USER_TABLES
WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')
This generates the following result:
INFORMATICA CONFIDENTIAL BEST PRACTICE 511 of 702
ANALYZE TABLE CUSTOMER_DIM COMPUTE STATISTICS;
ANALYZE TABLE MARKET_DIM COMPUTE STATISTICS;
ANALYZE TABLE VENDOR_DIM COMPUTE STATISTICS;
The following SQL statement can be used to analyze the indexes in the database:
SELECT 'ANALYZE INDEX ' || INDEX_NAME || ' COMPUTE STATISTICS;'
FROM USER_INDEXES
WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')
This generates the following results:
ANALYZE INDEX SYS_C0011125 COMPUTE STATISTICS;
ANALYZE INDEX SYS_C0011119 COMPUTE STATISTICS;
ANALYZE INDEX SYS_C0011105 COMPUTE STATISTICS;
Save these results as a SQL script to be executed before or after a load.
Schema method
Another way to update index statistics is to compute indexes by schema rather than by
table. If data warehouse indexes are the only indexes located in a single schema, you
can use the following command to update the statistics:
EXECUTE SYS.DBMS_UTILITY.Analyze_Schema ('BDB', 'compute');
In this example, BDB is the schema for which the statistics should be updated. Note
that the DBA must grant the execution privilege for dbms_utility to the database user
executing this command.
INFORMATICA CONFIDENTIAL BEST PRACTICE 512 of 702
TIP
These SQL statements can be very resource intensive, especially for very large tables. For this
reason, Informatica recommends running them at off-peak times when no other process is using the database.
If you find the exact computation of the statistics consumes too much time, it is often acceptable to estimate the
statistics rather than compute them. Use estimate instead of compute in the above examples.
Parallelism
Parallel execution can be implemented at the SQL statement, database object, or
instance level for many SQL operations. The degree of parallelism should be identified
based on the number of processors and disk drives on the server, with the number of
processors being the minimum degree.
SQL Level Parallelism
Hints are used to define parallelism at the SQL statement level. The following examples
demonstrate how to utilize four processors:
SELECT /*+ PARALLEL(order_fact,4) */ ;
SELECT /*+ PARALLEL_INDEX(order_fact, order_fact_ixl,4) */ ;
TIP
When using a table alias in the SQL Statement, be sure to use this alias in the hint. Otherwise, the hint will not
be used, and you will not receive an error message.
Example of improper use of alias:
SELECT /*+PARALLEL (EMP, 4) */ EMPNO, ENAME
FROM EMP A
Here, the parallel hint will not be used because of the used alias A for table EMP. The
correct way is:
SELECT /*+PARALLEL (A, 4) */ EMPNO, ENAME
FROM EMP A
Table Level Parallelism
Parallelism can also be defined at the table and index level. The following example
INFORMATICA CONFIDENTIAL BEST PRACTICE 513 of 702
demonstrates how to set a tables degree of parallelism to four for all eligible SQL
statements on this table:
ALTER TABLE order_fact PARALLEL 4;
Ensure that Oracle is not contending with other processes for these resources or you
may end up with degraded performance due to resource contention.
Additional Tips
Executing Oracle SQL Scripts as Pre- and Post-Session Commands
on UNIX
You can execute queries as both pre- and post-session commands. For a UNIX
environment, the format of the command is:
sqlplus s user_id/password@database @ script_name.sql
For example, to execute the ENABLE.SQL file created earlier (assuming the data
warehouse is on a database named infadb), you would execute the following as a post-
session command:
sqlplus s user_id/password@infadb @ enable.sql
In some environments, this may be a security issue since both username and
password are hard-coded and unencrypted. To avoid this, use the operating systems
authentication to log onto the database instance.
In the following example, the Informatica id pmuser is used to log onto the Oracle
database. Create the Oracle user pmuser with the following SQL statement:
CREATE USER PMUSER IDENTIFIED EXTERNALLY
DEFAULT TABLESPACE . . .
TEMPORARY TABLESPACE . . .
In the following pre-session command, pmuser (the id Informatica is logged onto the
operating system as) is automatically passed from the operating system to the
database and used to execute the script:
sqlplus -s /@infadb @/informatica/powercenter/Scripts/ENABLE.SQL
INFORMATICA CONFIDENTIAL BEST PRACTICE 514 of 702
You may want to use the init.ora parameter os_authent_prefix to distinguish between
normal oracle-users and external-identified ones.
DRIVING_SITE Hint
If the source and target are on separate instances, the Source Qualifier transformation
should be executed on the target instance.
For example, you want to join two source tables (A and B) together, which may reduce
the number of selected rows. However, Oracle fetches all of the data from both tables,
moves the data across the network to the target instance, then processes everything
on the target instance. If either data source is large, this causes a great deal of network
traffic. To force the Oracle optimizer to process the join on the source instance, use the
Generate SQL option in the source qualifier and include the driving_site hint in the
SQL statement as:
SELECT /*+ DRIVING_SITE */ ;


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 515 of 702
Performance Tuning Databases (SQL
Server)
Challenge
Database tuning can result in tremendous improvement in loading performance. This
Best Practice offers tips on tuning SQL Server.
Description
Proper tuning of the source and target database is a very important consideration in the
scalability and usability of a business data integration environment. Managing
performance on an SQL Server involves the following points.
G Manage system memory usage (RAM caching).
G Create and maintain good indexes.
G Partition large data sets and indexes.
G Monitor disk I/O subsystem performance.
G Tune applications and queries.
G Optimize active data.
Taking advantage of grid computing is another option for improving the overall SQL
Server performance. To set up a SQL Server cluster environment, you need to set up a
cluster where the databases are split among the nodes. This provides the ability to
distribute the load across multiple nodes. To achieve high performance, Informatica
recommends using a fibre-attached SAN device for shared storage.
Manage RAM Caching
Managing RAM buffer cache is a major consideration in any database server
environment. Accessing data in RAM cache is much faster than accessing the same
information from disk. If database I/O can be reduced to the minimal required set of
data and index pages, the pages stay in RAM longer. Too much unnecessary data and
index information flowing into buffer cache quickly pushes out valuable pages. The
primary goal of performance tuning is to reduce I/O so that buffer cache is used
effectively.
INFORMATICA CONFIDENTIAL BEST PRACTICE 516 of 702
Several settings in SQL Server can be adjusted to take advantage of SQL Server RAM
usage:
G Max async I/O is used to specify the number of simultaneous disk I/O
operations that SQL Server can submit to the operating system. Note that this
setting is automated in SQL Server 2000
G SQL Server allows several selectable models for database recovery, these
include:
H Full Recovery
H Bulk-Logged Recovery
H Simple Recovery
Create and Maintain Good Indexes
Creating and maintaining good indexes is key to maintaining minimal I/O for all
database queries.
Partition Large Data Sets and Indexes
To reduce overall I/O contention and improve parallel operations, consider partitioning
table data and indexes. Multiple techniques for achieving and managing partitions
using SQL Server 2000 are addressed in this document.
Tune Applications and Queries
Tuning applications and queries is especially important when a database server is
likely to be servicing requests from hundreds or thousands of connections through a
given application. Because applications typically determine the SQL queries that are
executed on a database server, it is very important for application developers to
understand SQL Server architectural basics and know how to take full advantage of
SQL Server indexes to minimize I/O.
Partitioning for Performance
The simplest technique for creating disk I/O parallelism is to use hardware partitioning
and create a single "pool of drives" that serves all SQL Server database files except
transaction log files, which should always be stored on physically-separate disk drives
dedicated to log files. (See Microsoft documentation for installation procedures.)
INFORMATICA CONFIDENTIAL BEST PRACTICE 517 of 702
Objects For Partitioning Consideration
The following areas of SQL Server activity can be separated across different hard
drives, RAID controllers, and PCI channels (or combinations of the three):
G Transaction logs
G Tempdb
G Database
G Tables
G Nonclustered Indexes
Note: In SQL Server 2000, Microsoft introduced enhancements to distributed
partitioned views that enable the creation of federated databases (commonly referred
to as scale-out), which spread resource load and I/O activity across multiple servers.
Federated databases are appropriate for some high-end online analytical processing
(OLTP) applications, but this approach is not recommended for addressing the needs
of a data warehouse.
Segregating the Transaction Log
Transaction log files should be maintained on a storage device that is physically
separate from devices that contain data files. Depending on your database recovery
model setting, most update activity generates both data device activity and log activity.
If both are set up to share the same device, the operations to be performed compete
for the same limited resources. Most installations benefit from separating these
competing I/O activities.
Segregating tempdb
SQL Server creates a database, tempdb, on every server instance to be used by the
server as a shared working area for various activities, including temporary tables,
sorting, processing subqueries, building aggregates to support GROUP BY or ORDER
BY clauses, queries using DISTINCT (temporary worktables have to be created to
remove duplicate rows), cursors, and hash joins.
To move the tempdb database, use the ALTER DATABASE command to change the
physical file location of the SQL Server logical file name associated with tempdb. For
example, to move tempdb and its associated log to the new file locations E:\mssql7 and
C:\temp, use the following commands:
INFORMATICA CONFIDENTIAL BEST PRACTICE 518 of 702
alterdatabasetempdbmodifyfile(name='tempdev',filename=
'e:\mssql7\tempnew_location.mDF')
alterdatabasetempdbmodifyfile(name='templog',filename=
'c:\temp\tempnew_loglocation.mDF')
The master database, msdb, and model databases are not used much during
production (as compared to user databases), so it is generally y not necessary to
consider them in I/O performance tuning considerations. The master database is
usually used only for adding new logins, databases, devices, and other system objects.
Database Partitioning
Databases can be partitioned using files and/or filegroups. A filegroup is simply a
named collection of individual files grouped together for administration purposes. A file
cannot be a member of more than one filegroup. Tables, indexes, text, ntext, and
image data can all be associated with a specific filegroup. This means that all their
pages are allocated from the files in that filegroup. The three types of filegroups are:
G Primary filegroup. Contains the primary data file and any other files not
placed into another filegroup. All pages for the system tables are allocated
from the primary filegroup.
G User-defined filegroup. Any filegroup specified using the FILEGROUP
keyword in a CREATE DATABASE or ALTER DATABASE statement, or on
the Properties dialog box within SQL Server Enterprise Manager.
G Default filegroup. Contains the pages for all tables and indexes that do not
have a filegroup specified when they are created. In each database, only one
filegroup at a time can be the default filegroup. If no default filegroup is
specified, the default is the primary filegroup.
Files and filegroups are useful for controlling the placement of data and indexes
and eliminating device contention. Quite a few installations also leverage files and
filegroups as a mechanism that is more granular than a database in order to exercise
more control over their database backup/recovery strategy.
Horizontal Partitioning (Table)
Horizontal partitioning segments a table into multiple tables, each containing the same
INFORMATICA CONFIDENTIAL BEST PRACTICE 519 of 702
number of columns but fewer rows. Determining how to partition tables horizontally
depends on how data is analyzed. A general rule of thumb is to partition tables so
queries reference as few tables as possible. Otherwise, excessive UNION queries,
used to merge the tables logically at query time, can impair performance.
When you partition data across multiple tables or multiple servers, queries accessing
only a fraction of the data can run faster because there is less data to scan. If the
tables are located on different servers, or on a computer with multiple processors, each
table involved in the query can also be scanned in parallel, thereby improving query
performance. Additionally, maintenance tasks, such as rebuilding indexes or backing
up a table, can execute more quickly.
By using a partitioned view, the data still appears as a single table and can be queried
as such without having to reference the correct underlying table manually
Cost Threshold for Parallelism Option
Use this option to specify the threshold where SQL Server creates and executes
parallel plans. SQL Server creates and executes a parallel plan for a query only when
the estimated cost to execute a serial plan for the same query is higher than the value
set in cost threshold for parallelism. The cost refers to an estimated elapsed time in
seconds required to execute the serial plan on a specific hardware configuration. Only
set cost threshold for parallelism on symmetric multiprocessors (SMP).
Max Degree of Parallelism Option
Use this option to limit the number of processors (from a maximum of 32) to use in
parallel plan execution. The default value is zero, which uses the actual number of
available CPUs. Set this option to one to suppress parallel plan generation. Set the
value to a number greater than one to restrict the maximum number of processors
used by a single query execution.
Priority Boost Option
Use this option to specify whether SQL Server should run at a higher scheduling
priority than other processors on the same computer. If you set this option to one, SQL
Server runs at a priority base of 13. The default is zero, which is a priority base of
seven.
Set Working Set Size Option
INFORMATICA CONFIDENTIAL BEST PRACTICE 520 of 702
Use this option to reserve physical memory space for SQL Server that is equal to the
server memory setting. The server memory setting is configured automatically by SQL
Server based on workload and available resources. It can vary dynamically among
minimum server memory and maximum server memory. Setting set working set size
means the operating system does not attempt to swap out SQL Server pages, even if
they can be used more readily by another process when SQL Server is idle.
Optimizing Disk I/O Performance
When configuring a SQL Server that contains only a few gigabytes of data and does
not sustain heavy read or write activity, you need not be particularly concerned with the
subject of disk I/O and balancing of SQL Server I/O activity across hard drives for
optimal performance. To build larger SQL Server databases however, which can
contain hundreds of gigabytes or even terabytes of data and/or that sustain heavy read/
write activity (as in a DSS application), it is necessary to drive configuration around
maximizing SQL Server disk I/O performance by load-balancing across multiple hard
drives.
Partitioning for Performance
For SQL Server databases that are stored on multiple disk drives, performance can be
improved by partitioning the data to increase the amount of disk I/O parallelism.
Partitioning can be performed using a variety of techniques. Methods for creating and
managing partitions include configuring the storage subsystem (i.e., disk, RAID
partitioning) and applying various data configuration mechanisms in SQL Server such
as files, file groups, tables and views. Some possible candidates for partitioning include:
G Transaction log
G Tempdb
G Database
G Tables
G Non-clustered indexes
Using bcp and BULK INSERT
Two mechanisms exist inside SQL Server to address the need for bulk movement of
data: the bcp utility and the BULK INSERT statement.
INFORMATICA CONFIDENTIAL BEST PRACTICE 521 of 702
G Bcp is a command prompt utility that copies data into or out of SQL Server.
G BULK INSERT is a Transact-SQL statement that can be executed from within
the database environment. Unlike bcp, BULK INSERT can only pull data into
SQL Server. An advantage of using BULK INSERT is that it can copy data
into instances of SQL Server using a Transact-SQL statement, rather than
having to shell out to the command prompt.
TIP
Both of these mechanisms enable you to exercise control over the batch size. Unless you are working with
small volumes of data, it is good to get in the habit of specifying a batch size for recoverability reasons. If none
is specified, SQL Server commits all rows to be loaded as a single batch. For example, you attempt to load
1,000,000 rows of new data into a table. The server suddenly loses power just as it finishes processing row
number 999,999. When the server recovers, those 999,999 rows will need to be rolled back out of the database
before you attempt to reload the data. By specifying a batch size of 10,000 you could have saved significant
recovery time, because SQL Server would have only had to rollback 9999 rows instead of 999,999.
General Guidelines for Initial Data Loads
While loading data:
G Remove indexes.
G Use Bulk INSERT or bcp.
G Parallel load using partitioned data files into partitioned tables.
G Run one load stream for each available CPU.
G Set Bulk-Logged or Simple Recovery model.
G Use the TABLOCK option.
G Create indexes.
G Switch to the appropriate recovery model.
G Perform backups
General Guidelines for Incremental Data Loads

G Load data with indexes in place.
G Use performance and concurrency requirements to determine locking
granularity (sp_indexoption).
G Change from Full to Bulk-Logged Recovery mode unless there is an overriding
need to preserve a pointin-time recovery, such as online users modifying the
database during bulk loads. Read operations should not affect bulk loads.
INFORMATICA CONFIDENTIAL BEST PRACTICE 522 of 702
Performance Tuning Databases (Teradata)
Challenge
Database tuning can result in tremendous improvement in loading performance. This
Best Practice provides tips on tuning Teradata.
Description
Teradata offers several bulk load utilities including:
G MultiLoad which supports inserts, updates, deletes, and upserts to any
table.
G FastExport which is a high-performance bulk export utility.
G BTEQ which allows you to export data to a flat file but is suitable for smaller
volumes than FastExport.
G FastLoad which is used for loading inserts into an empty table.
G TPump which is a light-weight utility that does not lock the table that is being
loaded.
Tuning MultiLoad
There are many aspects to tuning a Teradata database. Several aspects of tuning can
be controlled by setting MultiLoad parameters to maximize write throughput. Other
areas to analyze when performing a MultiLoad job include estimating space
requirements and monitoring MultiLoad performance.
MultiLoad parameters
Below are the MultiLoad-specific parameters that are available in PowerCenter:
G TDPID. A client based operand that is part of the logon string.
G Date Format. Ensure that the date format used in your target flat file is
equivalent to the date format parameter in your MultiLoad script. Also validate
that your date format is compatible with the date format specified in the
Teradata database.
G Checkpoint. A checkpoint interval is similar to a commit interval for other
INFORMATICA CONFIDENTIAL BEST PRACTICE 523 of 702
databases. When you set the checkpoint value to less than 60, it represents
the interval in minutes between checkpoint operations. If the checkpoint is set
to a value greater than 60, it represents the number of records to write before
performing a checkpoint operation. To maximize write speed to the database,
try to limit the number of checkpoint operations that are performed.
G Tenacity. Interval in hours between MultiLoad attempts to log on to the
database when the maximum number of sessions are already running.
G Load Mode. Available load methods include Insert, Update, Delete, and
Upsert. Consider creating separate external loader connections for each
method, selecting the one that will be most efficient for each target table.
G Drop Error Tables. Allows you to specify whether to drop or retain the three
error tables for a MultiLoad session. Set this parameter to 1 to drop error
tables or 0 to retain error tables.
G Max Sessions. This parameter specifies the maximum number of sessions
that are allowed to log on to the database. This value should not exceed one
per working amp (Access Module Process).
G Sleep. This parameter specifies the number of minutes that MultiLoad waits
before retrying a logon operation.
Estimating Space Requirements for MultiLoad Jobs
Always estimate the final size of your MultiLoad target tables and make sure the
destination has enough space to complete your MultiLoad job. In addition to the space
that may be required by target tables, each MultiLoad job needs permanent space for:
G Work tables
G Error tables
G Restart Log table
Note: Spool space cannot be used for MultiLoad work tables, error tables, or the
restart log table. Spool space is freed at each restart. By using permanent space for the
MultiLoad tables, data is preserved for restart operations after a system failure. Work
tables, in particular, require a lot of extra permanent space. Also remember to account
for the size of error tables since error tables are generated for each target table.
Use the following formula to prepare the preliminary space estimate for one target
table, assuming no fallback protection, no journals, and no non-unique secondary
indexes:
INFORMATICA CONFIDENTIAL BEST PRACTICE 524 of 702
PERM = (using data size + 38) x (number of rows processed) x (number of
apply conditions satisfied) x (number of Teradata SQL statements within the
applied DML)
Make adjustments to your preliminary space estimates according to the requirements
and expectations of your MultiLoad job.
Monitoring MultiLoad Performance
Below are tips for analyzing MultiLoad performance:
1. Determine which phase of the MultiLoad job is causing poor performance.
G If the performance bottleneck is during the acquisition phase, as data is
acquired from the client system, then the issue may be with the client system.
If it is during the application phase, as data is applied to the target tables, then
the issue is not likely to be with the client system.
G The MultiLoad job output lists the job phases and other useful information.
Save these listings for evaluation.
2. Use the Teradata RDBMS Query Session utility to monitor the progress of the
MultiLoad job.
3. Check for locks on the MultiLoad target tables and error tables.
4. Check the DBC.Resusage table for problem areas, such as data bus or CPU
capacities at or near 100 percent for one or more processors.
5. Determine whether the target tables have non-unique secondary indexes
(NUSIs). NUSIs degrade MultiLoad performance because the utility builds a
separate NUSI change row to be applied to each NUSI sub-table after all of the
rows have been applied to the primary table.
6. Check the size of the error tables. Write operations to the fallback error tables
are performed at normal SQL speed, which is much slower than normal
MultiLoad tasks.
7. Verify that the primary index is unique. Non-unique primary indexes can cause
severe MultiLoad performance problems
8. Poor performance can happen when the input data is skewed with respect to
the Primary Index of the database. Teradata depends upon random and well
distributed data for data input and retrieval. For example, a file containing a
million rows with a single value 'AAAAAA' for the Primary Index will take an
infinite time to load.
9. One common tool used for determining load issues/skewed data/locks is
Performance Monitor (PMON). PMON requires MONITOR access on the
Teradata system. If you do not have Monitor access, then the DBA can help
INFORMATICA CONFIDENTIAL BEST PRACTICE 525 of 702
you to look at the system.
10. SQL against the system catalog can also be used to determine any
performance bottle necks. The following query is used to see if the load is
inserting data into the system. Spool space (a type of work space) is inside the
build as data is transferred to the database. So if the load is going well, the
spool will be built rapidly in the database. Use the following query to check:
SELECT sum(currentspool) from dbc.diskspace where databasename = userid
loading the database.
After the spool rises has a reached its peak, spool will fall rapidly as data is
inserted from spool into the table. If the spool grows slowly, then the input data
is probably skewed.
FastExport
FastExport is a bulk export Teradata utility. One way to pull up data for Lookup/
Sources is by using ODBC since there is not native connectivity to Teradata. However,
ODBC is slow. For higher performance, use FastExport if the number of rows to be
pulled is in the order of a million rows. FastExport writes to a file. The lookup or source
qualifier then reads this file. FastExport integrated within PowerCenter.
BTEQ
BTEQ is a SQL executor utility similar to SQL*Plus. Life FastExport, BTEQ allows you
to export data to a flat file, but is suitable for smaller volumes of data. This provides
faster performance than ODBC but doesn't tax Teradata system resources the way
FastExport can. A possible use for BTEQ with PowerCenter is to export smaller
volumes of data to a flat file (i.e., less than 1 million rows). The flat file is then read by
PowerCenter. BTEQ is not integrated with PowerCenter but can be called from a pre-
session script.
TPump
TPump was a load utility primarily intended for streaming data (think of loading bundles
of messages arriving from MQ using Power Center Real Time). TPump can also load
from a file or a named pipe.
While FastLoad and MultiLoad are bulk load utilities, TPump is a lightweight utility.
Another important difference between MultiLoad and TPump is that TPump locks at the
INFORMATICA CONFIDENTIAL BEST PRACTICE 526 of 702
row-hash level instead of the table level thus providing users read access to fresher
data. Although Teradata says that it has improved the speed of TPump for loading files
to compare with that of MultiLoad. So, try a test load using TPump first. Also, be
cautious with the use of TPump to load streaming data if the data throughput is large.
Push Down Optimization
PowerCenter embeds a powerful engine that actually has a memory management
system built within and all the smart algorithms built into the engine to perform various
transformation operations such as aggregation, sorting, joining, lookup etc. This is a
typically referred to as an ETL architecture where Extracts, Transformations and Loads
are performed. So, data is extracted from the data source to the PowerCenter Engine
(can be on the same machine as the source or a separate machine) where all the
transformations are applied and then pushed to the target. Some of the performance
considerations for this type of architecture are:
G
Is the network fast enough and tuned effectively to support the
necessary data transfer?
G
Is the hardware on which PowerCenter is running
sufficiently robust with high processing capability and high memory
capacity.
ELT (Extract, Load, Transform) is a relatively new design or runtime paradigm
that became popular with the advent of high performance RDBM systems such asDSS
and OLTP. Because Teradata typically runs on well tuned operating systems and well
tuned hardware, the ELT paradigm tries to push as much of the transformation logic as
possible onto the Teradata system.
The ELT design paradigm can be achieved through the Pushdown Optimization option
offered with PowerCenter.
ETL or ELT
Because many database vendors and consultants advocate using ELT (Extract, Load
and Transform) over ETL (Extract, Transform and Load), the use of Pushdown
Optimization can be somewhat controversial. Informatica advocates using Pushdown
Optimization as an option to solve specific performance situations rather than as the
default design of a mapping.
INFORMATICA CONFIDENTIAL BEST PRACTICE 527 of 702
The following scenarios can help in deciding on when to use ETL with PowerCenter
and when to use ELT (i.e., Pushdown Optimization):
1. When the load needs to look up only dimension tables then there may be no
need to use Pushdown Optimization. In this context, PowerCenter's ability to
build dynamic, persistent caching is significant. If a daily load involves 10s or
100s of fact files to be loaded throughout the day, then dimension surrogate
keys can be easily obtained from PowerCenter's cache in memory. Compare
this with the cost of running the same dimension lookup queries on the
database.
2. In many cases large Teradata systems contain only a small amount of data. In
such cases there may be no need to push down.
3. When only simple filters or expressions need to be applied on the data then
there may be no need to push down. The special case is that of applying filters
or expression logic to non-unique columns in incoming data in PowerCenter.
Compare this to loading the same data into the database and then applying a
WHERE clause on a non-unique column, which is highly inefficient for a large
table.

The principle here is: Filter and resolve the data AS it gets loaded instead of
loading it into a database, querying the RDBMS to filter/resolve and re-loading it
into the database. In other words, ETL instead of ELT.
4. Push Down optimization needs to be considered only if a large set of data
needs to be merged or queried for getting to your final load set.
Maximizing Performance using Pushdown Optimization
You can push transformation logic to either the source or target database using
pushdown optimization. The amount of work you can push to the database depends on
the pushdown optimization configuration, the transformation logic, and the mapping
and session configuration.
When you run a session configured for pushdown optimization, the Integration Service
analyzes the mapping and writes one or more SQL statements based on the mapping
transformation logic. The Integration Service analyzes the transformation logic,
mapping, and session configuration to determine the transformation logic it can push to
the database. At run time, the Integration Service executes any SQL statement
generated against the source or target tables, and processes any transformation logic
that it cannot push to the database.
Use the Pushdown Optimization Viewer to preview the SQL statements and mapping
logic that the Integration Service can push to the source or target database. You can
also use the Pushdown Optimization Viewer to view the messages related to
INFORMATICA CONFIDENTIAL BEST PRACTICE 528 of 702
Pushdown Optimization.
Known Issues with Teradata
You may encounter the following problems using ODBC drivers with a Teradata
database:
G Teradata sessions fail if the session requires a conversion to a numeric data
type and the precision is greater than 18.
G Teradata sessions fail when you use full pushdown optimization for a session
containing a Sorter transformation.
G A sort on a distinct key may give inconsistent results if the sort is not case
sensitive and one port is a character port.
G A session containing an Aggregator transformation may produce different
results from PowerCenter if the group by port is a string data type and it is not
case-sensitive.
G A session containing a Lookup transformation fails if it is configured for target-
side pushdown optimization.
G A session that requires type casting fails if the casting is from x to date/time.
G A session that contains a date to string conversion fails
Working with SQL Overrides
You can configure the Integration Service to perform an SQL override with Pushdown
Optimization. To perform an SQL override, you configure the session to create a view.
When you use a SQL override for a Source Qualifier transformation in a session
configured for source or full Pushdown Optimization with a view, the Integration Service
creates a view in the source database based on the override. After it creates the view
in the database, the Integration Service generates a SQL query that it can push to the
database. The Integration Service runs the SQL query against the view to perform
Pushdown Optimization.
Note: To use an SQL override with pushdown optimization, you must configure the
session for pushdown optimization with a view.
Running a Query
If the Integration Service did not successfully drop the view, you can run a query
against the source database to search for the views generated by the Integration
Service. When the Integration Service creates a view, it uses a prefix of PM_V. You
INFORMATICA CONFIDENTIAL BEST PRACTICE 529 of 702
can search for views with this prefix to locate the views created during pushdown
optimization.
Teradata specific SQL:
SELECT TableName FROM DBC.Tables
WHERE CreatorName =USER
AND TableKind ='V'
AND TableName LIKE 'PM\_V%' ESCAPE '\'
Rules and Guidelines for SQL OVERIDE
Use the following rules and guidelines when you configure pushdown optimization for a
session containing an SQL override:


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 530 of 702
Performance Tuning UNIX Systems
Challenge
Identify opportunities for performance improvement within the complexities of the UNIX
operating environment.
Description
This section provides an overview of the subject area, followed by discussion of the use
of specific tools.
Overview
All system performance issues are fundamentally resource contention issues. In any
computer system, there are three essential resources: CPU, memory, and I/O - namely
disk and network I/O. From this standpoint, performance tuning for PowerCenter
means ensuring that the PowerCenter and its sub-processes have adequate resources
to execute in a timely and efficient manner.
Each resource has its own particular set of problems. Resource problems are
complicated because all resources interact with each other. Performance tuning is
about identifying bottlenecks and making trade-off to improve the situation. Your best
approach is to initially take a baseline measurement and to obtain a good
understanding of how it behaves, then evaluate any bottleneck revealed on each
system resource during your load window and determine the removal of whichever
resource contention offers the greatest opportunity for performance enhancement.
Here is a summary of each system resource area and the problems it can have.
CPU
G On any multiprocessing and multi-user system, many processes want to use
the CPUs at the same time. The UNIX kernel is responsible for allocation of a
finite number of CPU cycles across all running processes. If the total demand
on the CPU exceeds its finite capacity, then all processing is likely to reflect a
negative impact on performance; the system scheduler puts each process in a
queue to wait for CPU availability.
INFORMATICA CONFIDENTIAL BEST PRACTICE 531 of 702
G An average of the count of active processes in the system for the last 1, 5, and
15 minutes is reported as load average when you execute the command
uptime. The load average provides you a basic indicator of the number of
contenders for CPU time. Likewise vmstat command provides an average
usage of all the CPUs along with the number of processes contending for
CPU (the value under the r column).
G On SMP (symmetric multiprocessing) architecture servers, watch the even
utilization of all the CPUs. How well all the CPUs are utilized depends on how
well an application can be parallelized, If a process is incurring a high degree
of involuntary context switch by the kernel; binding the process to a specific
CPU may improve performance.
Memory
G Memory contention arises when the memory requirements of the active
processes exceed the physical memory available on the system; at this point,
the system is out of memory. To handle this lack of memory, the system starts
paging, or moving portions of active processes to disk in order to reclaim
physical memory. When this happens, performance decreases dramatically.
Paging is distinguished from swapping, which means moving entire processes
to disk and reclaiming their space. Paging and excessive swapping indicate
that the system can't provide enough memory for the processes that are
currently running.
G Commands such as vmstat and pstat show whether the system is paging; ps,
prstat and sar can report the memory requirements of each process.
Disk I/O
G The I/O subsystem is a common source of resource contention problems. A
finite amount of I/O bandwidth must be shared by all the programs (including
the UNIX kernel) that currently run. The system's I/O buses can transfer only
so many megabytes per second; individual devices are even more limited.
Each type of device has its own peculiarities and, therefore, its own problems.
G Tools are available to evaluate specific parts of the I/O subsystem
H iostat can give you information about the transfer rates for each disk
drive. ps and vmstat can give some information about how many
processes are blocked waiting for I/O.
H sar can provide voluminous information about I/O efficiency.
H sadp can give detailed information about disk access patterns.
INFORMATICA CONFIDENTIAL BEST PRACTICE 532 of 702
Network I/O
G The source data, the target data, or both the source and target data are likely
to be connected through an Ethernet channel to the system where
PowerCenter resides. Be sure to consider the number of Ethernet channels
and bandwidth available to avoid congestion.
H netstat shows packet activity on a network, watch for high collision rate of
output packets on each interface.
H nfstat monitors NFS traffic; execute nfstat c from a client machine (not
from the nfs server); watch for high time rate of total call and not
responding message.
Given that these issues all boil down to access to some computing resource, mitigation
of each issue con sists of making some adjustment to the environment to provide more
(or preferential) access to the resource; for instance:
G Adjusting execution schedules to allow leverage of low usage times may
improve availability of memory, disk, network bandwidth, CPU cycles, etc.
G Migrating other applications to other hardware is likely tol reduce demand on
the hardware hosting PowerCenter.
G For CPU intensive sessions, raising CPU priority (or lowering priority for
competing processes) provides more CPU time to the PowerCenter sessions.
G Adding hardware resources, such as adding memory, can make more
resource available to all processes.
G Re-configuring existing resources may provide for more efficient usage, such
as assigning different disk devices for input and output, striping disk devices,
or adjusting network packet sizes.
Detailed Usage
The following tips have proven useful in performance tuning UNIX-based machines.
While some of these tips are likely to be more helpful than others in a particular
environment, all are worthy of consideration.
Availability, syntax and format of each varies across UNIX versions.
Running ps -axu
INFORMATICA CONFIDENTIAL BEST PRACTICE 533 of 702
Run ps -axu to check for the following items:
G Are there any processes waiting for disk access or for paging? If so check the I/
O and memory subsystems.
G What processes are using most of the CPU? This may help to distribute the
workload better.
G What processes are using most of the memory? This may help to distribute the
workload better.
G Does ps show that your system is running many memory-intensive jobs? Look
for jobs with a large set (RSS) or a high storage integral.
Identifying and Resolving Memory Issues
Use vmstat or sar to check for paging/swapping actions. Check the system to
ensure that excessive paging/swapping does not occur at any time during the session
processing. By using sar 5 10 or vmstat 1 10, you can get a snapshot of paging/
swapping. If paging or excessive swapping does occur at any time, increase memory to
prevent it. Paging/swapping, on any database system, causes a major performance
decrease and increased I/O. On a memory-starved and I/O-bound server, this can
effectively shut down the PowerCenter process and any databases running on the
server.
Some swapping may occur normally regardless of the tuning settings. This occurs
because some processes use the swap space by their design. To check swap space
availability, use pstat and swap. If the swap space is too small for the intended
applications, it should be increased.
Runvmstate 5 (sar wpgr ) for SunOS, vmstat S 5 to detect and confirm memory
problems and check for the following:
G Are pages-outs occurring consistently? If so, you are short of memory.
G Are there a high number of address translation faults? (System V only). This
suggests a memory shortage.
G Are swap-outs occurring consistently? If so, you are extremely short of
memory. Occasional swap-outs are normal; BSD systems swap-out inactive
jobs. Long bursts of swap-outs mean that active jobs are probably falling victim
and indicate extreme memory shortage. If you dont have vmstat S, look at the
w and de fields of vmstat. These should always be zero.
If memory seems to be the bottleneck, try following remedial steps:
INFORMATICA CONFIDENTIAL BEST PRACTICE 534 of 702
G Reduce the size of the buffer cache (if your system has one) by decreasing
BUFPAGES.
G If you have statically allocated STREAMS buffers, reduce the number of large
(e.g., 2048- and 4096-byte) buffers. This may reduce network performance,
but netstat-m should give you an idea of how many buffers you really need.
G Reduce the size of your kernels tables. This may limit the systems capacity (i.
e., number of files, number of processes, etc.).
G Try running jobs requiring a lot of memory at night. This may not help the
memory problems, but you may not care about them as much.
G Try running jobs requiring a lot of memory in a batch queue. If only one
memory-intensive job is running at a time, your system may perform
satisfactorily.
G Try to limit the time spent running sendmail, which is a memory hog.
G If you dont see any significant improvement, add more memory.
Identifying and Resolving Disk I/O Issues
Use iostat to check I/O load and utilization as well as CPU load. Iostat can be used
to monitor the I/O load on the disks on the UNIX server. Using iostat permits monitoring
the load on specific disks. Take notice of how evenly disk activity is distributed among
the system disks. If it is not, are the most active disks also the fastest disks?
Run sadp to get a seek histogram of disk activity. Is activity concentrated in one area
of the disk (good), spread evenly across the disk (tolerable), or in two well-defined
peaks at opposite ends (bad)?
G Reorganize your file systems and disks to distribute I/O activity as evenly as
possible.
G Using symbolic links helps to keep the directory structure the same throughout
while still moving the data files that are causing I/O contention.
G Use your fastest disk drive and controller for your root file system; this almost
certainly has the heaviest activity. Alternatively, if single-file throughput is
important, put performance-critical files into one file system and use the fastest
drive for that file system.
G Put performance-critical files on a file system with a large block size: 16KB or
32KB (BSD).
G Increase the size of the buffer cache by increasing BUFPAGES (BSD). This
may hurt your systems memory performance.
INFORMATICA CONFIDENTIAL BEST PRACTICE 535 of 702
G Rebuild your file systems periodically to eliminate fragmentation (i.e., backup,
build a new file system, and restore).
G If you are using NFS and using remote files, look at your network situation.
You dont have local disk I/O problems.
G Check memory statistics again by running vmstat 5 (sar-rwpg). If your system
is paging or swapping consistently, you have memory problems, fix memory
problem first. Swapping makes performance worse.
If your system has disk capacity problem and is constantly running out of disk space try
the following actions:
G Write a find script that detects old core dumps, editor backup and auto-save
files, and other trash and deletes it automatically. Run the script through cron.
G Use the disk quota system (if your system has one) to prevent individual users
from gathering too much storage.
G Use a smaller block size on file systems that are mostly small files (e.g.,
source code files, object modules, and small data files).
Identifying and Resolving CPU Overload Issues
Use uptime or sar -u to check for CPU loading. Sar provides more detail, including %
usr (user), %sys (system), %wio (waiting on I/O), and %idle (% of idle time). A target
goal should be %usr + %sys= 80 and %wio = 10 leaving %idle at 10.
If %wio is higher, the disk and I/O contention should be investigated to eliminate I/O
bottleneck on the UNIX server. If the system shows a heavy load of %sys, and %usr
has a high %idle, this is indicative of memory and contention of swapping/paging
problems. In this case, it is necessary to make memory changes to reduce the load on
the system server.
When you run iostat 5, also watch for CPU idle time. Is the idle time always 0, without
letup? It is good for the CPU to be busy, but if it is always busy 100 percent of the
time, work must be piling up somewhere. This points to CPU overload.
G Eliminate unnecessary daemon processes. rwhod and routed are particularly
likely to be performance problems, but any savings will help.
G Get users to run jobs at night with at or any queuing system thats available.
You may not care if the CPU (or the memory or I/O system) is overloaded at
night, provided the work is done in the morning.
INFORMATICA CONFIDENTIAL BEST PRACTICE 536 of 702
G Using nice to lower the priority of CPU-bound jobs improves interactive
performance. Also, using nice to raise the priority of CPU-bound
jobs expedites them but may hurt interactive performance. In general though,
using nice is really only a temporary solution. If your workload grows, it will
soon become insufficient. Consider upgrading your system, replacing it, or
buying another system to share the load.
Identifying and Resolving Network I/O Issues
Suspect problems with network capacity or with data integrity if users experience
slow performance when they are using rlogin or when they are accessing files via NFS.
Look at netsat-i. If the number of collisions is large, suspect an overloaded network. If
the number of input or output errors is large, suspect hardware problems. A large
number of input errors indicate problems somewhere on the network. A large number
of output errors suggests problems with your system and its interface to the network.
If collisions and network hardware are not a problem, figure out which system
appears to be slow. Use spray to send a large burst of packets to the slow system. If
the number of dropped packets is large, the remote system most likely cannot respond
to incoming data fast enough. Look to see if there are CPU, memory or disk I/O
problems on the remote system. If not, the system may just not be able to tolerate
heavy network workloads. Try to reorganize the network so that this system isnt a file
server.
A large number of dropped packets may also indicate data corruption. Run netstat-s
on the remote system, then spray the remote system from the local system and run
netstat-s again. If the increase of UDP socket full drops (as indicated by netstat) is
equal to or greater than the number of drop packets that spray reports, the remote
system is slow network server If the increase of socket full drops is less than the
number of dropped packets, look for network errors.
Run nfsstat and look at the client RPC data. If the retransfield is more than 5 percent
of calls, the network or an NFS server is overloaded. If timeout is high, at least one
NFS server is overloaded, the network may be faulty, or one or more servers may have
crashed. If badmix is roughly equal to timeout, at least one NFS server is overloaded. If
timeout and retrans are high, but badxid is low, some part of the network between the
NFS client and server is overloaded and dropping packets.
Try to prevent users from running I/O- intensive programs across the network.
The greputility is a good example of an I/O intensive program. Instead, have users log
into the remote system to do their work.
INFORMATICA CONFIDENTIAL BEST PRACTICE 537 of 702
Reorganize the computers and disks on your network so that as many users as
possible can do as much work as possible on a local system.
Use systems with good network performance as file servers.
lsattr E l sys0 is used to determine some current settings on some UNIX
environments. (In Solaris, you execute prtenv.) Of particular attention is maxuproc, the
setting to determine the maximum level of user background processes. On most UNIX
environments, this is defaulted to 40, but should be increased to 250 on most systems.
Choose a file system. Be sure to check the database vendor documentation to
determine the best file system for the specific machine. Typical choices include: s5, the
UNIX System V file system; ufs, the UNIX file system derived from Berkeley (BSD);
vxfs, the Veritas file system; and lastly raw devices that, in reality are not a file system
at all. Additionally, for the PowerCenter Grid option cluster file system (CFS), products
such as GFS for RedHat Linux, Veritas CFS, and GPFS for IBM AIX are some of the
available choices.
Cluster File System Tuning
In order to take full advantage of the PowerCenter Grid option, cluster file system
(CFS) is recommended. PowerCenter Grid option requires that the directories for each
Integration Service to be shared with other servers. This allows Integration Services to
share files such as cache files between different session runs. CFS performance is a
result of tuning parameters and tuning the infrastructure. Therefore, using the
parameters recommended by each CFS vendor is the best approach for CFS tuning.
PowerCenter Options
The Integration Service Monitor is available to display system resource usage
information about associated nodes. The window displays resource usage information
about the running tasks, including CPU%, memory, and swap usage.
The PowerCenter 64-bit option can allocate more memory to sessions and achieve
higher throughputs compared to 32-bit version of PowerCenter.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 538 of 702
Performance Tuning Windows 2000/2003
Systems
Challenge
Windows Server is designed as a self-tuning operating system. Standard installation
of Windows Server provides good performance out-of-the-box, but optimal performance
can be achieved by tuning.
Note: Tuning is essentially the same for both Windows 2000 and 2003-based systems.
Description
The following tips have proven useful in performance-tuning Windows Servers. While
some are likely to be more helpful than others in any particular environment, all are
worthy of consideration.
The two places to begin tuning an NT server are:
G Performance Monitor.
G Performance tab (hit ctrl+alt+del, choose task manager, and click on the
Performance tab).
Although the Performance Monitor can be tracked in real-time, creating a result-set
representative of a full day is more likely to render an accurate view of system
performance.
Resolving Typical Windows Server Problems
The following paragraphs describe some common performance problems in a Windows
Server environment and suggest tuning solutions.
Server Load: Assume that some software will not be well coded, and some
background processes (e.g., a mail server or web server) running on a single machine,
can potentially starve the machine's CPUs. In this situation, off-loading the CPU hogs
may be the only recourse.
INFORMATICA CONFIDENTIAL BEST PRACTICE 539 of 702
Device Drivers: The device drivers for some types of hardware are notorious for
inefficient CPU clock cycles. Be sure to obtain the latest drivers from the hardware
vendor to minimize this problem.
Memory and services: Although adding memory to Windows Server is always a good
solution, it is also expensive and usually must be planned in advance. Before adding
memory, check the Services in Control Panel because many background applications
do not uninstall the old service when installing a new version. Thus, both the unused
old service and the new service may be using valuable CPU memory resources.
I/O Optimization: This is, by far, the best tuning option for database applications in
the Windows Server environment. If necessary, level the load across the disk devices
by moving files. In situations where there are multiple controllers, be sure to level the
load across the controllers too.
Using electrostatic devices and fast-wide SCSI can also help to increase performance.
Further, fragmentation can usually be eliminated by using a Windows Server disk
defragmentation product.
Finally, on Windows Servers, be sure to implement disk striping to split single data files
across multiple disk drives and take advantage of RAID (Redundant Arrays of
Inexpensive Disks) technology. Also increase the priority of the disk devices on the
Windows Server. Windows Server, by default, sets the disk device priority low.
Monitoring System Performance in Windows Server
In Windows Server, PowerCenter uses system resources to process transformation,
session execution, and reading and writing of data. The PowerCenter Integration
Service also uses system memory for other data such as aggregate, joiner, rank, and
cached lookup tables. With Windows Server, you can use the system monitor in the
Performance Console of the administrative tools, or system tools in the task manager,
to monitor the amount of system resources used by the PowerCenter and to identify
system bottlenecks.
Windows Server provides the following tools (accessible under the Control Panel/
Administration Tools/Performance) for monitoring resource usage on your computer:
G System Monitor
G Performance Logs and Alerts
These Windows Server monitoring tools enable you to analyze usage and detect
INFORMATICA CONFIDENTIAL BEST PRACTICE 540 of 702
bottlenecks at the disk, memory, processor, and network level.
System Monitor
The System Monitor displays a graph which is flexible and configurable. You can copy
counter paths and settings from the System Monitor display to the Clipboard and paste
counter paths from Web pages or other sources into the System Monitor display.
Because the System Monitor is portable, it is useful in monitoring other systems that
require administration.
Performance Monitor
The Performance Logs and Alerts tool provides two types of performance-related logs
counter logs and trace logsand an alerting function.
Counter logs record sampled data about hardware resources and system services
based on performance objects and counters in the same manner as System Monitor.
They can, therefore, be viewed in System Monitor. Data in counter logs can be saved
as comma-separated or tab-separated files that are easily viewed with Excel.
Trace logs collect event traces that measure performance statistics associated with
events such as disk and file I/O, page faults, or thread activity. The alerting function
allows you to define a counter value that will trigger actions such as sending a network
message, running a program, or starting a log. Alerts are useful if you are not actively
monitoring a particular counter threshold value but want to be notified when it exceeds
or falls below a specified value so that you can investigate and determine the cause of
the change. You may want to set alerts based on established performance baseline
values for your system.
Note: You must have Full Control access to a subkey in the registry in order to create
or modify a log configuration. (The subkey is HKEY_CURRENT_MACHINE\SYSTEM
\CurrentControlSet\Services\SysmonLog\Log_Queries).
The predefined log settings under Counter Logs (i.e., System Overview) are configured
to create a binary log that, after manual start-up, updates every 15 seconds and logs
continuously until it achieves a maximum size. If you start logging with the default
settings, data is saved to the Perflogs folder on the root directory and includes the
counters: Memory\ Pages/sec, PhysicalDisk(_Total)\Avg. Disk Queue Length, and
Processor(_Total)\ % Processor Time.
If you want to create your own log setting, press the right mouse on one of the log
INFORMATICA CONFIDENTIAL BEST PRACTICE 541 of 702
types.
PowerCenter Options
The Integration Service Monitor is available to display system resource usage
information about associated nodes. The window displays resource usage information
about running task including CPU%, Memory and Swap usage.
PowerCenter's 64-bit option running on Intel Itanium processor-based machines and 64-
bit Windows Server 2003 can allocate more memory to sessions and achieve higher
throughputs than the 32-bit version of PowerCenter on Windows Server.
Using PowerCenter Grid option on Windows Server enables distribution of a session or
sessions in a workflow to multiple servers and reduces the processing load window.
The PowerCenter Grid option requires that the directories for each integration service
to be shared with other servers. This allows integration services to share files such as
cache files among various session runs. With a Cluster File System (CFS), integration
services running on various servers can perform concurrent reads and write to the
same block of data.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 542 of 702
Recommended Performance Tuning
Procedures
Challenge
To optimize PowerCenter load times by employing a series of performance tuning
procedures.
Description
When a PowerCenter session or workflow is not performing at the expected or desired
speed, there is a methodology that can help to diagnose problems that may be
adversely affecting various components of the data integration architecture. While
PowerCenter has its own performance settings that can be tuned, you must consider
the entire data integration architecture, including the UNIX/Windows servers, network,
disk array, and the source and target databases to achieve optimal performance. More
often than not, an issue external to PowerCenter is the cause of the performance
problem. In order to correctly and scientifically determine the most logical cause of the
performance problem, you need to execute the performance tuning steps in a specific
order. This enables you to methodically rule out individual pieces and narrow down the
specific areas on which to focus your tuning efforts.
1. Perform Benchmarking
You should always have a baseline of current load times for a given workflow or
session with a similar row count. Maybe you are not achieving your required load
window or simply think your processes could run more efficiently based on comparison
with other similar tasks running faster. Use the benchmark to estimate what your
desired performance goal should be and tune to that goal. Begin with the problem
mapping that you created, along with a session and workflow that use all default
settings. This helps to identify which changes have a positive impact on performance.
2. Identify the Performance Bottleneck Area
This step helps to narrow down the areas on which to focus further. Follow the areas
and sequence below when attempting to identify the bottleneck:
G Target
INFORMATICA CONFIDENTIAL BEST PRACTICE 543 of 702
G Source
G Mapping
G Session/Workflow
G System.
The methodology steps you through a series of tests using PowerCenter to identify
trends that point where next to focus. Remember to go through these tests in a
scientific manner; running them multiple times before reaching any conclusion
and always keep in mind that fixing one bottleneck area may create a different
bottleneck. For more information, see Determining Bottlenecks.
3. "Inside" or "Outside" PowerCenter
Depending on the results of the bottleneck tests, optimize inside or outside
PowerCenter. Be sure to perform the bottleneck test in the order prescribed
in Determining Bottlenecks, since this is also the order in which you should make any
performance changes.
Problems outside PowerCenter refers to anything that indicates the source of the
performance problem is external to PowerCenter. The most common performance
problems outside PowerCenter are source/target database problem, network
bottleneck, server, or operating system problem.
G For source database related bottlenecks, refer to Tuning SQL Overrides and
Environment for Better Performance
G For target database related problems, refer to Performance Tuning Databases
- Oracle, SQL Server, or Teradata
G For operating system problems, refer to Performance Tuning UNIX Systems
or Performance Tuning Windows 2000/2003 Systems for more information.
Problems inside PowerCenter refers to anything that PowerCenter controls, such as
actual transformation logic, and PowerCenter Workflow/Session settings. The session
settings contain quite a few memory settings and partitioning options that can greatly
improve performance. Refer to the Tuning Sessions for Better Performance for more
information.
Although there are certain procedures to follow to optimize mappings, keep in mind
that, in most cases, the mapping design is dictated by business logic; there may be a
INFORMATICA CONFIDENTIAL BEST PRACTICE 544 of 702
more efficient way to perform the business logic within the mapping, but you cannot
ignore the necessary business logic to improve performance. Refer to Tuning
Mappings for Better Performance for more information.
4. Re-Execute the Problem Workflow or Session
After you have completed the recommended steps for each relevant performance
bottleneck, re-run the problem workflow or session and compare the results to the
benchmark and compare load performance against the baseline. This step is iterative,
and should be performed after any performance-based setting is changed. You are
trying to answer the question, Did the performance change have a positive impact? If
so, move on to the next bottleneck. Be sure to prepare detailed documentation at every
step along the way so you have a clear record of what was and wasn't tried.
While it may seem like there are an enormous number of areas where a performance
problem can arise, if you follow the steps for finding the bottleneck(s), and apply the
tuning techniques specific to it, you are likely to improve performance and achieve your
desired goals.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 545 of 702
Tuning and Configuring Data Analyzer and Data Analyzer
Reports
Challenge
A Data Analyzer report that is slow to return data means lag time to a manager or business analyst. It can
be a crucial point of failure in the acceptance of a data warehouse. This Best Practice offers some
suggestions for tuning Data Analyzer and Data Analyzer reports.
Description
Performance tuning reports occurs both at the environment level and the reporting level. Often report
performance can be enhanced by looking closely at the objective of the report rather than the suggested
appearance. The following guidelines should help with tuning the environment and the report itself.
1. Perform Benchmarking. Benchmark the reports to determine an expected rate of return. Perform
benchmarks at various points throughout the day and evening hours to account for inconsistencies
in network traffic, database server load, and application server load. This provides a baseline to
measure changes against.
2. Review Report. Confirm that all data elements are required in the report. Eliminate any
unnecessary data elements, filters, and calculations. Also be sure to remove any extraneous charts
or graphs. Consider if the report can be broken into multiple reports or presented at a higher level.
These are often ways to create more visually appealing reports and allow for linked detail reports or
drill down to detail level.
3. Scheduling of Reports. If the report is on-demand but can be changed to a scheduled report,
schedule the report to run during hours when the system use is minimized. Consider scheduling
large numbers of reports to run overnight. If mid-day updates are required, test the performance at
lunch hours and consider scheduling for that time period. Reports that require filters by users can
often be copied and filters pre-created to allow for scheduling of the report.
4. Evaluate Database. Database tuning occurs on multiple levels. Begin by reviewing the tables used
in the report. Ensure that indexes have been created on dimension keys. If filters are used on
attributes, test the creation of secondary indices to improve the efficiency of the query. Next,
execute reports while a DBA monitors the database environment. This provides the DBA the
opportunity to tune the database for querying. Finally, look into changes in database settings.
Increasing the database memory in the initialization file often improves Data Analyzer performance
significantly.
5. Investigate Network. Reports are simply database queries, which can be found by clicking the
"View SQL" button on the report. Run the query from the report, against the database using a client
tool on the server that the database resides on. One caveat to this is that even the database tool on
the server may contact the outside network. Work with the DBA during this test to use a local
database connection, (e.g., Bequeath / IPC Oracles local database communication protocol) and
monitor the database throughout this process. This test may pinpoint if the bottleneck is occurring
on the network or in the database. If, for instance, the query performs well regardless of where it is
executed, but the report continues to be slow, this indicates an application server bottleneck.
Common locations for network bottlenecks include router tables, web server demand, and server
input/output. Informatica does recommend installing Data Analyzer on a dedicated application
server.
6. Tune the Schema. Having tuned the environment and minimized the report requirements, the final
level of tuning involves changes to the database tables. Review the under performing reports.
Can any of these be generated from aggregate tables instead of from base tables? Data Analyzer makes
efficient use of linked aggregate tables by determining on a report-by-report basis if the report can utilize an
aggregate table. By studying the existing reports and future requirements, you can determine what key
aggregates can be created in the ETL tool and stored in the database.
Calculated metrics can also be created in an ETL tool and stored in the database instead of created in Data
Analyzer. Each time a calculation must be done in Data Analyzer, it is being performed as part of the query
process. To determine if a query can be improved by building these elements in the database, try removing
INFORMATICA CONFIDENTIAL BEST PRACTICE 546 of 702
them from the report and comparing report performance. Consider if these elements are appearing in a
multitude of reports or simply a few.
7. Database Queries. As a last resort for under-performing reports, you may want to edit the actual
report query. To determine if the query is the bottleneck, select the View SQL button on the report.
Next, copy the SQL into a query utility and execute. (DBA assistance may be beneficial here.) If
the query appears to be the bottleneck, revisit Steps 2 and 6 above to ensure that no additional
report changes are possible. Once you have confirmed that the report is as required, work to edit
the query while continuing to re-test it in a query utility. Additional options include utilizing database
views to cache data prior to report generation. Reports are then built based on the view.
Note: Editing the report query requires query editing for each report change and may require editing during
migrations. Be aware that this is a time-consuming process and a difficult-to-maintain method of
performance tuning.
The Data Analyzer repository database should be tuned for an OLTP workload.
Tuning Java Virtual Machine (JVM)

JVM Layout
The J ava Virtual Machine (J VM) is the repository for all live objects, dead objects, and free memory. It has
the following primary jobs:
Execute code
Manage memory
Remove garbage objects
The size of the J VM determines how often and how long garbage collection runs.
The J VM parameters can be set in the "startWebLogic.cmd" or "startWebLogic.sh" if using the Weblogic
application server.
Parameters of the JVM
1. -Xms and -Xmx parameters define the minimum and maximum heap size; for large applications like
Data Analyzer, the values should be set equal to each other.
2. Start with -ms=512m -mx=512m as needed, increase J VM by 128m or 256m to reduce garbage
collection.
3. Permanent generation, which holds the J VM's class and method objects -XX:MaxPermSize
command line parameter controls the permanent generation's size.
4. "NewSize" and "MaxNewSize" parameters control the new generation's minimum and maximum
size.
5. XX:NewRatio=5 divides the old-to-new in the order of 5:1 (i.e the old generation occupies 5/6 of the
heap while the new generation occupies 1/6 of the heap).
INFORMATICA CONFIDENTIAL BEST PRACTICE 547 of 702
When the new generation fills up, it triggers a minor collection, in which surviving objects
are moved to the old generation.
When the old generation fills up, it triggers a major collection, which involves the entire
object heap. This is more expensive in terms of resources than a minor collection.
6. If you increase the new generation size, the old generation size decreases. Minor collections occur
less often, but the frequency of major collection increases.
7. If you decrease the new generation size, the old generation size increases. Minor collections occur
more, but the frequency of major collection decreases.
8. As a general rule, keep the new generation smaller than half the heap size (i.e., 1/4 or 1/3 of the
heap size).
9. Enable additional J VM if you expect large numbers of users. Informatica typically recommends two
to three CPUs per J VM.
Other Areas to Tune
Execute Threads
Threads available to process simultaneous operations in Weblogic.
Too few threads means CPUs are under-utilized and jobs are waiting for threads to become
available.
Too many threads means system is wasting resource in managing threads. The OS performs
unnecessary context switching.
The default is 15 threads. Informatica recommends using the default value, but you may need to
experiment to determine the optimal value for your environment.
Connection Pooling
The application borrows a connection from the pool, uses it, and then returns it to the pool by closing it.
Initial capacity =15
Maximum capacity =15
Sum of connections of all pools should be equal to the number of execution threads.
nd shrinking the pool size dynamically by setting the
initial and maximum pool size at the same level.
. They are
available on: Windows NT/2000 (default installed), Solaris 2.6/2.7, AIX 4.3, HP/UX, and Linux.
b.
Adds <NativeIOEnabled>to config.xml as true.
For Websphere, use the Performance Tuner to modify the configurable parameters.
tion server , the data warehouse database, and the repository
database onto separate dedicated machines.
Connection pooling avoids the overhead of growing a
Performance packs use platform-optimized (i.e., native) sockets to improve server performance
Check Enable Native I/O on the server attribute ta
For optimal configuration, separate the applica
INFORMATICA CONFIDENTIAL BEST PRACTICE 548 of 702
Application Server-Specific Tuning Details
JBoss Application Server
Web Container. Tune the web container by modifying the following configuration file so that it accepts a
reasonable number of HTTP requests as required by the Data Analyzer installation. Ensure that the web
container is made available to optimal number of threads so that it can accept and process more HTTP
requests.
<J BOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/META-INF/jboss-service.xml
The following is a typical configuration:
<!-- A HTTP/1.1 Connector on port 8080 -->
<Connector className="org.apache.coyote.tomcat4.CoyoteConnector" port= "8080" minProcessors="10"
maxProcessors="100" enableLookups="true" acceptCount="20" debug="0" tcpNoDelay="true"
bufferSize="2048" connectionLinger="-1" connectionTimeout="20000" /><
The following parameters may need tuning:
minProcessors. Number of threads created initially in the pool.
maxProcessors. Maximum number of threads that can ever be created in the pool.
acceptCount. Controls the length of the queue of waiting requests when no more threads are
available from the pool to process the request.
connectionTimeout. Amount of time to wait before a URI is received from the stream. Default is
20 seconds. This avoid problems where a client opens a connection and does not send any data
tcpNoDelay. Set to true when data should be sent to the client without waiting for the buffer to be
full. This reduces latency at the cost of more packets being sent over the network. The default is
true.
enableLookups. Determines whether a reverse DNS lookup is performed. This can be enabled to
prevent IP spoofing. Enabling this parameter can cause problems when a DNS is misbehaving.
The enableLookups parameter can be turned off when you implicitly trust all clients.
connectionLinger. How long connections should linger after they are closed. Informatica
recommends using the default value: -1 (no linger).
In the Data Analyzer application, each web page can potentially have more than one request to the
application server. Hence, the maxProcessors should always be more than the actual number of concurrent
users. For an installation with 20 concurrent users, a minProcessors of 5 and maxProcessors of 100 is a
suitable value.
If the number of threads is too low, the following message may appear in the log files:
ERROR [ThreadPool] All threads are busy, waiting. Please increase maxThreads
JSP Optimization. To avoid having the application server compile J SP scripts when they are executed for
the first time, Informatica ships Data Analyzer with pre-compiled J SPs.
The following is a typical configuration:
<J BOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/web.xml
<servlet>
<servlet-name>jsp</servlet-name>
<servlet-class>org.apache.jasper.servlet.J spServlet</servlet-class>
<init-param>
INFORMATICA CONFIDENTIAL BEST PRACTICE 549 of 702
<param-name>logVerbosityLevel</param-name>
<param-value>WARNING</param-value>
<param-name>development</param-nam
<param-value>fal
</init-param>
<load-on-s

e>
se</param-value>
tartup>3</load-on-startup>
</servlet>
The following parameter may need tuning:
Set the development parameter to false in a production installation.
n pool
e. To optimize Data Analyzer database connections, you can tune the database
connection pools.
nnection Pool. To optimize the repository database connection pool, modify the
J Boss configuration file:
<J BOSS_HOME>/server/informatica/deploy/<DB_Type>_ds.xml
ml. With some versions of Data
Analyzer, the configuration file may simply be named DataAnalyzer-ds.xml.
The following is a typical configuration:
ection-url>
cle.OracleDriver</driver-class>

boss.resource.adapter.jdbc.vendor.OracleExceptionSorter

>1500</idle-timeout-minutes>
urce>
</datasources>
The following parameters may need tuning:

pty until it is first accessed. Once used, it will always have at least the min-pool-
idle-timeout-minutes. The length of time an idle connection remains in the pool before it is used.
Database Connection Pool. Data Analyzer accesses the repository database to retrieve metadata
information. When it runs reports, it accesses the data sources to get the report information. Data Analyzer
keeps a pool of database connections for the repository. It also keeps a separate database connectio
for each data sourc
Repository Database Co
The name of the file includes the database type. <DB_Type>can be Oracle, DB2, or other databases. For
example, for an Oracle repository, the configuration file name is oracle_ds.x
<datasources>
<local-tx-datasource>
<jndi-name>jdbc/IASDataSource</jndi-name>
<connection-url>jdbc:informatica:oracle://aries:1521;SID=prfbase8</conn
<driver-class>com.informatica.jdbc.ora
<user-name>powera</user-name>
<password>powera</password>
<exception-sorter-class-name>org.j
</exception-sorter-class-name>
<min-pool-size>5</min-pool-size>
<max-pool-size>50</max-pool-size>
<blocking-timeout-millis>5000</blocking-timeout-millis>
<idle-timeout-minutes
</local-tx-dataso
min-pool-size. The minimum number of connections in the pool. (The pool is lazily constructed,
that is, it will be em
size connections.)
max-pool-size. The strict maximum size of the connection pool.
The maximum time in millisecond blocking-timeout-millis. s that a caller waits to get a connection
when no more free connections are available in the pool.
INFORMATICA CONFIDENTIAL BEST PRACTICE 550 of 702
The max-pool-size value is recommended to be at least five more than maximum number of concurrent
users because there may be several scheduled reports running in the background and each of them needs
a database connection.
A higher value is recommended for idle-timeout-minutes. Because Data Analyzer accesses the repository
very frequently, it is inefficient to spend resources on checking for idle connections and cleaning them ou
Checking for idle connec
t.
tions may block other threads that require new connections.
connection pools, the data
source also has a pool of connections that Data Analyzer dynamically creates as soon as the first client
ols are present in following file:
BOSS_HOME>/bin/IAS.properties.file
e following is a typical configuration:
dynapool.waitForConnection=true
dynapool.poolNamePrefix=IAS_dynapool.refreshTestMinutes=60
data rt
The follo s-specific parameters may need tuning:

y


kPeriodMins. This parameter determines the amount of time (in minutes) an idle
wed to be in the pool. After this period, the number of connections in the pool
shrinks back to the value of its initialCapacity parameter. This is done only if the allowShrinking
-driven beans (MDBs) that are used for the
ng parameter is the EJ B pool. You
following file:
Data Source Database Connection Pool. Similar to the repository database
requests a connection.
The tuning parameters for these dynamic po
<J
Th
#
#Datasource definition
#
dynapool.initialCapacity=5
dynapool.maxCapacity=50
dynapool.capacityIncrement=2
dynapool.allowShrinking=true
dynapool.shrinkPeriodMins=20
dynapool.waitSec=1
ma .defaultRowPrefetch=20</FONT>
wing J Bos
y. dynapool.initialCapacit The minimum number of initial connections in the data source pool.
dynapool.maxCapacity. The maximum number of connections that the data source pool ma
grow to.
dynapool.poolNamePrefix. This parameter is a prefix added to the dynamic J DB pool name for
identification purposes.
dynapool.waitSec. The maximum amount of time (in seconds) a client will wait to grab a
connection from the pool if none is readily available.
dynapool.refreshTestMinutes. This parameter determines the frequency at which a health check
is performed on the idle connections in the pool. This should not be performed too frequently
because it locks up the connection pool and may prevent other clients from grabbing connections
from the pool.
dynapool.shrin
connection is allo
parameter is set to true.
EJB Container
Data Analyzer uses EJ Bs extensively. It has more than 50 stateless session beans (SLSB) and more than
60 entity beans (EB). In addition, there are six message
scheduling and real-time functionalities.
Stateless Session Beans (SLSB). For SLSBs, the most important tuni
can tune the EJ B pool parameters in the
INFORMATICA CONFIDENTIAL BEST PRACTICE 551 of 702
<J BOSS_HOME>/server/In
The following is a typical configuratio
<container-configuration>
<container-name>Standard Stateless SessionBean</container-nam
<call-logging>false</call-logging>
<invoker-proxy-binding-name>
stateless-rmi-invoker</invoker-proxy-binding-name>
<container-interceptors>
<interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.LogInterceptor</interc
<interceptor>
org.jboss.ejb.plugins.SecurityInterceptor</interceptor>
<!-- CMT -->
<interceptor transaction="Container">
org.jboss.ejb.plugins.TxInterceptorCMT</interceptor>
<interceptor tra
org.jboss.ejb.plugins.MetricsInte
<interceptor transaction="Container">
org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor
</interceptor>
<!-- BMT -->
<interceptor transaction="Bean">
org.jboss.ejb.plugins.StatelessSessionInstanceIntercep
</interceptor>
<interceptor transaction="Bean">
org.jboss.ejb.plugins.TxInterceptorBMT</interceptor>
<interceptor transaction=
org.jboss.ejb.plugins.MetricsInterc
<interceptor>
org.jboss.resource.connectionmanager.CachedConnectionIntercepto
</interceptor>
</container-interceptors>
<instance-pool>
org.jboss.ejb.plugins.StatelessSessionIns
<instance-cache></instanc
<persistence-manager></
formatica/conf/standardjboss.xml.
n:
e>
eptor>
nsaction="Container" metricsEnabled="true">
rceptor</interceptor>
tor
"Bean" metricsEnabled="true">
eptor</interceptor>
r
tancePool</instance-pool>
e-cache>
persistence-manager>
</container-pool-conf>
</co in
The follo

e>

n Data Analyzer. They can be tuned after you have performed


strictMaximumSize. When the value is set to true, the <strictMaximumSize>enforces a rule that
it for an
ble in the pool.
<container-pool-conf>
<MaximumSize>100</MaximumSize>
nta er-configuration>
wing parameter may need tuning:
MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize>is
set to true, then <MaximumSize>is a strict upper limit for the number of objects that can be
created. If <strictMaximumSize>is set to false, the number of active objects can exceed the
<MaximumSize>if there are requests for more objects. However, only the <MaximumSiz
number of objects can be returned to the pool.
Additionally, there are two other parameters that you can set to fine tune the EJ B pool. These two
parameters are not set by default i
proper iterative testing in Data Analyzer to increase the throughput for high-concurrency
installations.
only <MaximumSize>number of objects can be active. Any subsequent requests must wa
object to be returned to the pool.
strictTimeout. If you set <strictMaximumSize>to true, then <strictTimeout>is the amount of time
that requests wait for an object to be made availa
INFORMATICA CONFIDENTIAL BEST PRACTICE 552 of 702
Message-Dri ven Beans (MDB). MDB tuning parameters are very similar to stateless bean tuning
parameters. The main difference is that MDBs are not invoked by clients. Instead, the messaging system
following configuration file:
formatica/conf/standardjboss.xml
n:

ejb.plugins.ProxyFactoryFinderInterceptor


>
MetricsInterceptor
r
" metricsEnabled="true">
.CachedConnectionInterceptor
-cache>
e-manager>
<container-pool-conf>

</container-pool-conf>
umSize. Represents the maximum number of objects in the pool. If <strictMaximumSize>is set to
true, then <MaximumSize>is a strict upper limit for the number of objects that can be created. Otherwise, if
e
delivers messages to the MDB when they are available.
To tune the MDB parameters, modify the
<J BOSS_HOME>/server/in
The following is a typical configuratio
<container-configuration>
<container-name>Standard Message Driven Bean</container-name>
<call-logging>false</call-logging>
<invoker-proxy-binding-name>message-driven-bean
</invoker-proxy-binding-name>
<container-interceptors>
<interceptor>org.jboss.
</interceptor>
<interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.RunAsSecurityInterceptor
</interceptor>
<!-- CMT -->
<interceptor transaction="Container">
org.jboss.ejb.plugins.TxInterceptorCMT</interceptor>
<interceptor transaction="Container" metricsEnabled="true"
org.jboss.ejb.plugins.
</interceptor>
<interceptor transaction="Container">
org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor
</interceptor>
<!-- BMT -->
<interceptor transaction="Bean">
org.jboss.ejb.plugins.MessageDrivenInstanceIntercepto
</interceptor>
<interceptor transaction="Bean">
org.jboss.ejb.plugins.MessageDrivenTxInterceptorBMT
</interceptor>
<interceptor transaction="Bean
org.jboss.ejb.plugins.MetricsInterceptor</interceptor>
<interceptor>
org.jboss.resource.connectionmanager
</interceptor>
</container-interceptors>
<instance-pool>org.jboss.ejb.plugins.MessageDrivenInstancePool
</instance-pool>
<instance-cache></instance
<persistence-manager></persistenc
<MaximumSize>100</MaximumSize>
</container-configuration>
The following parameter may need tuning:
Maxim
<strictMaximumSize>is set to false, the number of active objects can exceed the <MaximumSize>if there
are request for more objects. However, only the <MaximumSize>number of objects can be returned to th
pool.
INFORMATICA CONFIDENTIAL BEST PRACTICE 553 of 702
Addi na
paramet
iterative roughput for high-concurrency installations.
mumSize>parameter enforces a
rule that only <MaximumSize>number of objects will be active. Any subsequent requests must wait

s wait for an object to be made available in the pool.
an-managed persistence) as opposed to
CMP (container-managed persistence). The EJ B tuning parameters are very similar to the stateless bean
llowing configuration file:
tica/conf/standardjboss.xml.
e>
Interceptor
Interceptor</interceptor>

ejb.plugins.EntityReentranceInterceptor
nmanager.CachedConnectionInterceptor
jb.plugins.lock.QueuedPessimisticEJ BLock
tio lly, there are two other parameters that you can set to fine tune the EJ B pool. These two
ers are not set by default in Data Analyzer. They can be tuned after you have performed proper
testing in Data Analyzer to increase the th
strictMaximumSize. When the value is set to true, the <strictMaxi
for an object to be returned to the pool.
strictTimeout. If you set <strictMaximumSize>to true, then <strictTimeout>is the amount of time
that request
Enterprise Java Beans (EJB). Data Analyzer EJ Bs use BMP (be
tuning parameters.
The EJ B tuning parameters are in the fo
<J BOSS_HOME>/server/informa
The following is a typical configuration:
<container-configuration>
<container-name>Standard BMP EntityBean</container-nam
<call-logging>false</call-logging>
<invoker-proxy-binding-name>entity-rmi-invoker
</invoker-proxy-binding-name>
<sync-on-commit-only>false</sync-on-commit-only>
<container-interceptors>
<interceptor>org.jboss.ejb.plugins.ProxyFactoryFinder
</interceptor>
<interceptor>org.jboss.ejb.plugins.Log
<interceptor>org.jboss.ejb.plugins.SecurityInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.TxInterceptorCMT
</interceptor>
<interceptor metricsEnabled="true">
org.jboss.ejb.plugins.MetricsInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityCreationInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityLockInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityInstanceInterceptor
</interceptor>
<interceptor>org.jboss.
</interceptor>
<interceptor>
org.jboss.resource.connectio
</interceptor>
<interceptor>
org.jboss.ejb.plugins.EntitySynchronizationInterceptor
</interceptor>
</container-interceptors>
<instance-pool>org.jboss.ejb.plugins.EntityInstancePool
</instance-pool>
<instance-cache>org.jboss.ejb.plugins.EntityInstanceCache
</instance-cache>
<persistence-manager>org.jboss.ejb.plugins.BMPPersistenceManager
</persistence-manager>
<locking-policy>org.jboss.e
</locking-policy>
<container-cache-conf>
INFORMATICA CONFIDENTIAL BEST PRACTICE 554 of 702
<cache-policy>org.jboss.ejb.plugins.LRUEnt
</cache-policy>
<cache-policy-conf>
<min-capacity>50</min-capacity>
<max-capacity>1000000</max-capacity>
<overager-period>300</overager-period>
<max-bean-age>600</max-bean-age>
<resizer-period>400</resi
<max-cache-miss-period>6
<min-cache-miss-pe
<cache-load-factor>0.75</cache-load
</cache-policy-conf>
</container-cache-conf>
<container-pool-conf>
erpriseContextCachePolicy
zer-period>
0</max-cache-miss-period>
riod>1</min-cache-miss-period>
-factor>
<MaximumSize>100</MaximumSize>
MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize>is set to
rwise, if
e
ool.
Add na
para t
iterative
mSize>parameter enforces a
rule that only <MaximumSize>number of objects can be active. Any subsequent requests must
r an object to be returned to the pool.
out. If you set <strictMaximumSize>to true, then <strictTimeout>is the amount of time
that requests will wait for an object to be made available in the pool.
threads to accept connections from clients
for remote method invocation (RMI). If you use the J ava RMI protocol to access the Data Analyzer API from
ool parameters.
owing configuration file:
oker"name="jboss:service=invoker,type=pooled

ss"></attribute>
<attribute name="ClientConnectPort">0</attribute>
</container-pool-conf>
<commit-option>A</commit-option>
</container-configuration>
The following parameter may need tuning:
true, then <MaximumSize>is a strict upper limit for the number of objects that can be created. Othe
<strictMaximumSize>is set to false, the number of active objects can exceed the <MaximumSize>if ther
are request for more objects. However, only the <MaximumSize>number of objects are returned to the p
itio lly, there are two other parameters that you can set to fine tune the EJ B pool. These two
me ers are not set by default in Data Analyzer. They can be tuned after you have performed proper
testing in Data Analyzer to increase the throughput for high-concurrency installations.
strictMaximumSize. When the value is set to true, the <strictMaximu
wait fo
strictTime
RMI Pool
The J Boss Application Server can be configured to have a pool of
other custom applications, you can optimize the RMI thread p
To optimize the RMI pool, modify the foll
<J BOSS_HOME>/server/informatica/conf/jboss-service.xml
The following is a typical configuration:
<mbeancode="org.jboss.invocation.pooled.server.PooledInv
">
<attribute name="NumAcceptThreads">1</attribute>
<attribute name="MaxPoolSize">300</attribute>
<attribute name="ClientMaxPoolSize">300</attribute>
<attribute name="SocketTimeout">60000</attribute>
<attribute name="ServerBindAddress"></attribute>
<attribute name="ServerBindPort">0</attribute>
<attribute name="ClientConnectAddre
INFORMATICA CONFIDENTIAL BEST PRACTICE 555 of 702
<attribute nam
<depends
e="EnableTcpNoDelay">false</attribute>
optional-attribute-name="TransactionManagerService">
jboss:service=TransactionManager
The follo
lient.
Backlog. The number of requests in the queue when all the processing threads are in use.

re packets will be sent across the network.
ation Server 5.1. The Tivoli Performance Viewer can be used to observe the behavior
of some of the parameters and arrive at a good settings.
Navi te
following

en

optimal.

ads
vironment, there is likely to be more than one server instance that may be
ines. In such a scenario, be sure that the changes have been properly
propagated to all of the server instances.
n certain circumstances (e.g., import of large XML files), the default value
nt and should be increased. This parameter can be modified during
runtime also.
Dia o
Disable the trace in a production environment .
Navigate to Application Servers >[your_server_instance] >Administration Services >Diagnostic
Trace Service and make sure Enable Tracing is not checked.
</depends>
</mbean>
wing parameters may need tuning:
NumAcceptThreads. The controlling threads used to accept connections from the client.
MaxPoolSize. A strict maximum size for the pool of threads to service requests on the server.
ClientMaxPoolSize. A strict maximum size for the pool of threads to service requests on the c
EnableTcpDelay. Indicates whether information should be sent before the buffer is full. Setting it to
true may increase the network traffic because mo
WebSphere Applic
Web Container
ga to Application Servers >[your_server_instance] >Web Container >Thread Pool to tune the
parameters.
Minimum Size: Specifies the minimum number of threads to allow in the pool. The default value of
10 is appropriate.
Maximum Size: Specifies the minimum number of threads to allow in the pool. For a highly
concurrent usage scenario (with a 3 VM load-balanced configuration), the value of 50-60 has be
determined to be optimal.
Thread Inactivity Timeout: Specifies the number of milliseconds of inactivity that should elapse
before a thread is reclaimed. The default of 3500ms is considered
Is Growable: Specifies whether the number of threads can increase beyond the maximum size
configured for the thread pool. Be sure to leave this option unchecked. Also, the maximum thre
should be hard-limited to the value given in the Maximum Size.
Note: In a load-balanced en
spread across multiple mach
Transaction Services
Total transaction lifetime timeout: I
of 120 seconds may not be sufficie
gn stic Trace Services
INFORMATICA CONFIDENTIAL BEST PRACTICE 556 of 702
Debugging Services
Ensure that the tracing is disabled in a production environment.
Navigate to Application Servers >[your_server_instance] >Logging and Tracing >Diagnostic Trace
Service >Debugging Service and make sure Startup is not checked.
Performance Monitoring Services
This set of parameters is for monitoring the health of the Application Server. This monitoring service tries to
ping the application server after a certain interval; if the server is found to be dead, it then tries to restart the
server.
Navigate to Application Servers >[your_server_instance] >Process Definition >MonitoringPolicy and tune
the parameters according to a policy determined for each Data Analyzer installation.
Note: The parameter Ping Timeout determines the time after which a no-response from the server implies
that it is faulty. The monitoring service then attempts to kill the server and restart it if Automatic restart is
checked. Take care that Ping Timeout is not set to too small a value.
Process Definitions (JVM Parameters)
For a Data Analyzer installation with a high number of concurrent users, Informatica recommends that the
minimum and the maximum heap size be set to the same values. This avoids the heap allocation-
reallocation expense during a high-concurrency scenario. Also, for a high-concurrency scenario, Informatica
recommends setting the values of minimum heap and maximum heap size to at least 1000MB. Further
tuning of this heap-size is recommended after carefully studying the garbage collection behavior by turning
on the verbosegc option.
The following is a list of java parameters (for IBM J VM 1.4.1) that should not be modified from the default
values for Data Analyzer installation:
- Xnocompact gc. This parameter switches off heap compaction altogether. Switching off heap
compaction results in heap fragmentation. Since Data Analyzer frequently allocates large objects,
heap fragmentation can result in OutOfMemory exceptions.
- Xcompact gc. Using this parameter leads to each garbage collection cycle carrying out
compaction, regardless of whether it's useful.
- Xgct hr eads. This controls the number of garbage collection helper threads created by the
J VM during startup. The default is N-1 threads for an N-processor machine. These threads provide
the parallelism in parallel mark and parallel sweep modes, which reduces the pause time during
garbage collection.
-Xclassnogc. This disables collection of class objects.
-Xinitsh. This sets the initial size of the application-class system heap. The system heap is
expanded as needed and is never garbage collected.
You may want to alter the following parameters after carefully examining the application server processes:
Navigate to Application Servers >[your_server_instance] >Process Definition >J ava Virtual
Machine"
Verbose garbage collection. Check this option to turn on verbose garbage collection. This can help
in understanding the behavior of the garbage collection for the application. It has a very low
overhead on performance and can be turned on even in the production environment.
Initial heap size. This is the ms value. Only the numeric value (without MB) needs to be specified.
For concurrent usage, the initial heap-size should be started with a 1000 and, depending on the
garbage collection behavior, can be potentially increased up to 2000. A value beyond 2000 may
INFORMATICA CONFIDENTIAL BEST PRACTICE 557 of 702
actually reduce throughput because the garbage collection cycles will take more time to go through
size. This is the mx value. It should be equal to the Initial heap size value.
s down the VM
considerably.
nchecked (i.e., J IT should never be disabled).
Navigate to Application Servers >[your_server_instance] >Performance Monitoring Services and be sure
The repository database connection pool can be configured by navigating to J DBC Providers >User-
ource >Connection Pools
The rio
0


s


twork are stable, there should not be a reason for age timeout. The default is 0
(i.e., connections do not age). If the database or the network connection to the repository database
Much like the repository database connection pools, the data source or data warehouse databases also
a Analyzer as soon as the first client makes a
The tuning parameters for these dynamic pools are present in
<WebSphere_Home>/AppServer/IAS.properties file.
the large heap, even though the cycles may be occurring less frequently.
Maximum heap
RunHProf:. This should remain unchecked in production mode, because it slows down the VM
considerably.
Debug Mode. This should remain unchecked in production mode, because it slow
Disable J IT.: This should remain u
Performance Monitoring Services
Be sure that performance monitoring services are not enabled in a production environment.
Startup is not checked.
Database Connection Pool
defined J DBC Provider >Data Sources >IASDataS
va us parameters that may need tuning are:
Connection Timeout. The default value of 180 seconds should be good. This implies that after 18
seconds, the request to grab a connection from the pool will timeout. After it times out,
DataAnalyzer will throw an exception. In that case, the pool size may need to be increased.
Max Connections. The maximum number of connections in the pool. Informatica recommends a
value of 50 for this.
Min Connections. The minimum number of connections in the pool. Informatica recommends a
value of 10 for this.
Reap Time. This specifies the frequency of pool maintenance thread. This should not be set very
high because when pool maintenance thread is running, it blocks the whole pool and no proces
can grab a new connection form the pool. If the database and the network are reliable, this should
have a very high value (e.g., 1000).
Unused Timeout. This specifies the time in seconds after which an unused connection will be
discarded until the pool size reaches the minimum size. In a highly concurrent usage, this should
be a high value. The default of 1800 seconds should be fine.
Aged Timeout. Specifies the interval in seconds before a physical connection is discarded. If the
database and the ne
frequently comes down (compared to the life of the AppServer), this can be used to age-out the
stale connections.
have a pool of connections that are created dynamically by Dat
request.
INFORMATICA CONFIDENTIAL BEST PRACTICE 558 of 702
The following is a typical configuration:.
#
#Datasource definition
#
dynapool.initialCapacity=5
dynapool.maxCapacity=50
dynapool.capacityIncrement=2
dynapool.allowShrinking=true
dynapool.shrinkPeriodMins=20
dynapool.waitForConnection=true
dynapool.waitSec=1
dynapool.poolNamePrefix=IAS_
dynapool.refreshTestMinutes=60
datamart.defaultRowPrefetch=20
The various parameters that may need tuning are:
dynapool.initialCapacity - the minimum number of initial connections in the data-source pool.
dynapool.maxCapacity - the maximum number of connections that the data-source pool may grow
up to.
dynapool.poolNamePrefix - a prefix added to the dynamic J DB pool name for identification
purposes.
dynapool.waitSec - the maximum amount of time (in seconds) that a client will wait to grab a
connection from the pool if none is readily available.
dynapool.refreshTestMinutes - determines the frequency at which a health check on the idle
connections in the pool is performed. Such checks should not be performed too frequently because
they lock up the connection pool and may prevent other clients from grabbing connections from the
pool.
dynapool.shrinkPeriodMins - determines the amount of time (in minutes) an idle connection is
allowed to be in the pool. After this period, the number of connections in the pool decreases (to its
initialCapacity). This is done only if allowShrinking is set to true.
Message Listeners Services
To process scheduled reports, Data Analyzer uses Message-Driven-Beans. It is possible to run multiple
reports within one schedule in parallel by increasing the number of instances of the MDB catering to the
Scheduler (InfScheduleMDB). Take care however, not to increase the value to some arbitrarily high value
since each report consumes considerable resources (e.g., database connections, and CPU processing at
both the application-server and database server levels) and setting this to a very high value may actually be
detrimental to the whole system.
INFORMATICA CONFIDENTIAL BEST PRACTICE 559 of 702
Navigate to Application Servers >[your_server_instance] >Message Listener Service >Listener Ports >
IAS_ScheduleMDB_ListenerPort .
The parameters that can be tuned are:
Maximum sessions. The default value is one. On a highly-concurrent user scenario, Informatica
does not recommend going beyond five.
Maximum messages. This should remain as one. This implies that each report in a schedule will be
executed in a separate transaction instead of a batch. Setting it to more than one may have
unwanted effects like transaction timeouts, and the failure of one report may cause all the reports in
the batch to fail.
Plug-in Retry Intervals and Connect Timeouts
When Data Analyzer is set up in a clustered WebSphere environment, a plug-in is normally used to perform
the load-balancing between each server in the cluster. The proxy http-server sends the request to the plug-
in and the plug-in then routes the request to the proper application-server.
The plug-in file can be generated automatically by navigating to
Environment >Update web server plugin configuration.
The default plug-in file contains ConnectTimeOut=0, which means that it relies on the tcp timeout setting of
the server. It is possible to have different timeout settings for different servers in the cluster. The timeout
settings implies that after the given number of seconds if the server doesnt respond, then it is marked as
down and the request is sent over to the next available member of the cluster.
The RetryInterval parameter allows you to specify how long to wait before retrying a server that is marked as
down. The default value is 10 seconds. This means if a cluster member is marked as down, the server does
not try to send a request to the same member for 10 seconds.

Last updated: 13-Feb-07 17:59

INFORMATICA CONFIDENTIAL BEST PRACTICE 560 of 702
Tuning Mappings for Better Performance
Challenge
In general, mapping-level optimization takes time to implement, but can significantly boost performance.
Sometimes the mapping is the biggest bottleneck in the load process because business rules determine
the number and complexity of transformations in a mapping.
Before deciding on the best route to optimize the mapping architecture, you need to resolve some basic
issues. Tuning mappings is a grouped approach. The first group can be of assistance almost universally,
bringing about a performance increase in all scenarios. The second group of tuning processes may yield
only small performance increase, or can be of significant value, depending on the situation.
Some factors to consider when choosing tuning processes at the mapping level include the specific
environment, software/ hardware limitations, and the number of rows going through a mapping. This Best
Practice offers some guidelines for tuning mappings.
Description
Analyze mappings for tuning only after you have tuned the target and source for peak performance. To
optimize mappings, you generally reduce the number of transformations in the mapping and delete
unnecessary links between transformations.
For transformations that use data cache (such as Aggregator, Joiner, Rank, and Lookup transformations),
limit connected input/output or output ports. Doing so can reduce the amount of data the transformations
store in the data cache. Having too many Lookups and Aggregators can encumber performance because
each requires index cache and data cache. Since both are fighting for memory space, decreasing the
number of these transformations in a mapping can help improve speed. Splitting them up into different
mappings is another option.
Limit the number of Aggregators in a mapping. A high number of Aggregators can increase I/O activity on
the cache directory. Unless the seek/access time is fast on the directory itself, having too many
Aggregators can cause a bottleneck. Similarly, too many Lookups in a mapping causes contention of disk
and memory, which can lead to thrashing, leaving insufficient memory to run a mapping efficiently.
Consider Single-Pass Reading
If several mappings use the same data source, consider a single-pass reading. If you have several
sessions that use the same sources, consolidate the separate mappings with either a single Source
Qualifier Transformation or one set of Source Qualifier Transformations as the data source for the
separate data flows.
Similarly, if a function is used in several mappings, a single-pass reading reduces the number of times that
function is called in the session. For example, if you need to subtract percentage from the PRICE ports for
both the Aggregator and Rank transformations, you can minimize work by subtracting the percentage
before splitting the pipeline.
INFORMATICA CONFIDENTIAL BEST PRACTICE 561 of 702
Optimize SQL Overrides
When SQL overrides are required in a Source Qualifier, Lookup Transformation, or in the update override
of a target object, be sure the SQL statement is tuned. The extent to which and how SQL can be tuned
depends on the underlying source or target database system. See Tuning SQL Overrides and
Environment for Better Performance for more information .
Scrutinize Datatype Conversions
PowerCenter Server automatically makes conversions between compatible datatypes. When these
conversions are performed unnecessarily, performance slows. For example, if a mapping moves data from
an integer port to a decimal port, then back to an integer port, the conversion may be unnecessary.
In some instances however, datatype conversions can help improve performance. This is especially true
when integer values are used in place of other datatypes for performing comparisons using Lookup and
Filter transformations.
Eliminate Transformation Errors
Large numbers of evaluation errors significantly slow performance of the PowerCenter Server. During
transformation errors, the PowerCenter Server engine pauses to determine the cause of the error,
removes the row causing the error from the data flow, and logs the error in the session log.
Transformation errors can be caused by many things including: conversion errors, conflicting mapping
logic, any condition that is specifically set up as an error, and so on. The session log can help point out the
cause of these errors. If errors recur consistently for certain transformations, re-evaluate the constraints for
these transformations. If you need to run a session that generates a large number of transformation errors,
you might improve performance by setting a lower tracing level. However, this is not a long-term response
to transformation errors. Any source of errors should be traced and eliminated.
Optimize Lookup Transformations
There are a several ways to optimize lookup transformations that are set up in a mapping.
When to Cache Lookups
Cache small lookup tables. When caching is enabled, the PowerCenter Server caches the lookup table
and queries the lookup cache during the session. When this option is not enabled, the PowerCenter
Server queries the lookup table on a row-by-row basis.
Note: All of the tuning options mentioned in this Best Practice assume that memory and cache sizing for
lookups are sufficient to ensure that caches will not page to disks. Information regarding memory and
cache sizing for Lookup transformations are covered in the Best Practice: Tuning Sessions for Better
Performance.
A better rule of thumb than memory size is to determine the size of the potential lookup cache with regard
to the number of rows expected to be processed. For example, consider the following example.
INFORMATICA CONFIDENTIAL BEST PRACTICE 562 of 702
In Mapping X, the source and lookup contain the following number of records:
ITEMS (source): 5000 records
MANUFACTURER: 200 records
DIM_ITEMS: 100000 records
Number of Disk Reads
Cached Lookup Un-cached Lookup
LKP_Manufacturer
Build Cache 200 0
Read Source Records 5000 5000
Execute Lookup 0 5000
Total # of Disk Reads 5200 100000
LKP_DIM_ITEMS
Build Cache 100000 0
Read Source Records 5000 5000
Execute Lookup 0 5000
Total # of Disk Reads 105000 10000
Consider the case where MANUFACTURER is the lookup table. If the lookup table is cached, it will take a
total of 5200 disk reads to build the cache and execute the lookup. If the lookup table is not cached, then it
will take a total of 10,000 total disk reads to execute the lookup. In this case, the number of records in the
lookup table is small in comparison with the number of times the lookup is executed. So this lookup should
be cached. This is the more likely scenario.
Consider the case where DIM_ITEMS is the lookup table. If the lookup table is cached, it will result in
105,000 total disk reads to build and execute the lookup. If the lookup table is not cached, then the disk
reads would total 10,000. In this case the number of records in the lookup table is not small in comparison
with the number of times the lookup will be executed. Thus, the lookup should not be cached.
Use the following eight step method to determine if a lookup should be cached:
INFORMATICA CONFIDENTIAL BEST PRACTICE 563 of 702
1. Code the lookup into the mapping.
2. Select a standard set of data from the source. For example, add a "where" clause on a relational
source to load a sample 10,000 rows.
3. Run the mapping with caching turned off and save the log.
4. Run the mapping with caching turned on and save the log to a different name than the log created
in step 3.
5. Look in the cached lookup log and determine how long it takes to cache the lookup object. Note
this time in seconds: LOOKUP TIME IN SECONDS = LS.
6. In the non-cached log, take the time from the last lookup cache to the end of the load in seconds
and divide it into the number or rows being processed: NON-CACHED ROWS PER SECOND =
NRS.
7. In the cached log, take the time from the last lookup cache to the end of the load in seconds and
divide it into number or rows being processed: CACHED ROWS PER SECOND = CRS.
8. Use the following formula to find the breakeven row point:

(LS*NRS*CRS)/(CRS-NRS) = X

Where X is the breakeven point. If your expected source records is less than X, it is better to not
cache the lookup. If your expected source records is more than X, it is better to cache the lookup.

For example:

Assume the lookup takes 166 seconds to cache (LS=166).
Assume with a cached lookup the load is 232 rows per second (CRS=232).
Assume with a non-cached lookup the load is 147 rows per second (NRS = 147).

The formula would result in: (166*147*232)/(232-147) = 66,603.

Thus, if the source has less than 66,603 records, the lookup should not be cached. If it has more
than 66,603 records, then the lookup should be cached.
Sharing Lookup Caches
There are a number of methods for sharing lookup caches:
G Within a specific session run for a mapping, if the same lookup is used multiple times in a
mapping, the PowerCenter Server will re-use the cache for the multiple instances of the lookup.
Using the same lookup multiple times in the mapping will be more resource intensive with each
successive instance. If multiple cached lookups are from the same table but are expected to
return different columns of data, it may be better to setup the multiple lookups to bring back the
same columns even though not all return ports are used in all lookups. Bringing back a common
set of columns may reduce the number of disk reads.
G Across sessions of the same mapping, the use of an unnamed persistent cache allows multiple
runs to use an existing cache file stored on the PowerCenter Server. If the option of creating a
persistent cache is set in the lookup properties, the memory cache created for the lookup during
the initial run is saved to the PowerCenter Server. This can improve performance because the
Server builds the memory cache from cache files instead of the database. This feature should
only be used when the lookup table is not expected to change between session runs.
G Across different mappings and sessions, the use of a named persistent cache allows
sharing an existing cache file.
Reducing the Number of Cached Rows
INFORMATICA CONFIDENTIAL BEST PRACTICE 564 of 702
There is an option to use a SQL override in the creation of a lookup cache. Options can be added to the
WHERE clause to reduce the set of records included in the resulting cache.
Note: If you use a SQL override in a lookup, the lookup must be cached.
Optimizing the Lookup Condition
In the case where a lookup uses more than one lookup condition, set the conditions with an equal sign first
in order to optimize lookup performance.
Indexing the Lookup Table
The PowerCenter Server must query, sort, and compare values in the lookup condition columns. As a
result, indexes on the database table should include every column used in a lookup condition. This can
improve performance for both cached and un-cached lookups.
G
In the case of a cached lookup, an ORDER BY condition is issued in the SQL statement
used to create the cache. Columns used in the ORDER BY condition should be indexed.
The session log will contain the ORDER BY statement.
G
In the case of an un-cached lookup, since a SQL statement is created for each row
passing into the lookup transformation, performance can be helped by indexing
columns in the lookup condition.
Use a Persistent Lookup Cache for Static Lookups
If the lookup source does not change between sessions, configure the Lookup transformation to use a
persistent lookup cache. The PowerCenter Server then saves and reuses cache files from session to
session, eliminating the time required to read the lookup source.
Optimize Filter and Router Transformations
Filtering data as early as possible in the data flow improves the efficiency of a mapping. Instead of
using a Filter Transformation to remove a sizeable number of rows in the middle or end of a mapping, use
a filter on the Source Qualifier or a Filter Transformation immediately after the source qualifier to improve
performance.
Avoid complex expressions when creating the filter condition. Filter transformations are most
effective when a simple integer or TRUE/FALSE expression is used in the filter condition.
Filters or routers should also be used to drop rejected rows from an Update Strategy transformation if
rejected rows do not need to be saved.
Replace multiple filter transformations with a router transformation. This reduces the number of
transformations in the mapping and makes the mapping easier to follow.
INFORMATICA CONFIDENTIAL BEST PRACTICE 565 of 702
Optimize Aggregator Transformations
Aggregator Transformations often slow performance because they must group data before processing it.
Use simple columns in the group by condition to make the Aggregator Transformation more efficient.
When possible, use numbers instead of strings or dates in the GROUP BY columns. Also avoid complex
expressions in the Aggregator expressions, especially in GROUP BY ports.
Use the Sorted Input option in the Aggregator. This option requires that data sent to the Aggregator be
sorted in the order in which the ports are used in the Aggregator's group by. The Sorted Input option
decreases the use of aggregate caches. When it is used, the PowerCenter Server assumes all data is
sorted by group and, as a group is passed through an Aggregator, calculations can be performed and
information passed on to the next transformation. Without sorted input, the Server must wait for all rows of
data before processing aggregate calculations. Use of the Sorted Inputs option is usually accompanied by
a Source Qualifier which uses the Number of Sorted Ports option.
Use an Expression and Update Strategy instead of an Aggregator Transformation. This technique can
only be used if the source data can be sorted. Further, using this option assumes that a mapping is using
an Aggregator with Sorted Input option. In the Expression Transformation, the use of variable ports is
required to hold data from the previous row of data processed. The premise is to use the previous row of
data to determine whether the current row is a part of the current group or is the beginning of a new group.
Thus, if the row is a part of the current group, then its data would be used to continue calculating the
current group function. An Update Strategy Transformation would follow the Expression Transformation
and set the first row of a new group to insert, and the following rows to update.
Use incremental aggregation if you can capture changes from the source that changes less than half the
target. When using incremental aggregation, you apply captured changes in the source to aggregate
calculations in a session. The PowerCenter Server updates your target incrementally, rather than
processing the entire source and recalculating the same calculations every time you run the session.
Joiner Transformation
Joining Data from the Same Source
You can join data from the same source in the following ways:
G Join two branches of the same pipeline.
G Create two instances of the same source and join pipelines from these source instances.
You may want to join data from the same source if you want to perform a calculation on part of the data
and join the transformed data with the original data. When you join the data using this method, you can
maintain the original data and transform parts of that data within one mapping.
When you join data from the same source, you can create two branches of the pipeline. When you branch
a pipeline, you must add a transformation between the Source Qualifier and the Joiner transformation in at
least one branch of the pipeline. You must join sorted data and configure the Joiner transformation for
sorted input.
INFORMATICA CONFIDENTIAL BEST PRACTICE 566 of 702
If you want to join unsorted data, you must create two instances of the same source and join the pipelines.
For example, you may have a source with the following ports:
G Employee
G Department
G Total Sales
In the target table, you want to view the employees who generated sales that were greater than the
average sales for their respective departments. To accomplish this, you create a mapping with the
following transformations:
G Sorter transformation. Sort the data.
G Sorted Aggregator transformation. Average the sales data and group by department. When
you perform this aggregation, you lose the data for individual employees. To maintain employee
data, you must pass a branch of the pipeline to the Aggregator transformation and pass a branch
with the same data to the Joiner transformation to maintain the original data. When you join both
branches of the pipeline, you join the aggregated data with the original data.
G Sorted Joiner transformation. Use a sorted Joiner transformation to join the sorted aggregated
data with the original data.
G Filter transformation. Compare the average sales data against sales data for each employee
and filter out employees with less than above average sales.

Note: You can also join data from output groups of the same transformation, such as the Custom
transformation or XML Source Qualifier transformations. Place a Sorter transformation between each
output group and the Joiner transformation and configure the Joiner transformation to receive sorted input.
Joining two branches can affect performance if the Joiner transformation receives data from one branch
much later than the other branch. The Joiner transformation caches all the data from the first branch, and
writes the cache to disk if the cache fills. The Joiner transformation must then read the data from disk
when it receives the data from the second branch. This can slow processing.
You can also join same source data by creating a second instance of the source. After you create the
second source instance, you can join the pipelines from the two source instances.
Note: When you join data using this method, the PowerCenter Server reads the source data for each
source instance, so performance can be slower than joining two branches of a pipeline.
Use the following guidelines when deciding whether to join branches of a pipeline or join two instances of a
source:
G Join two branches of a pipeline when you have a large source or if you can read the source data
only once. For example, you can only read source data from a message queue once.
G Join two branches of a pipeline when you use sorted data. If the source data is unsorted and you
use a Sorter transformation to sort the data, branch the pipeline after you sort the data.
G Join two instances of a source when you need to add a blocking transformation to the pipeline
between the source and the Joiner transformation.
G Join two instances of a source if one pipeline may process much more slowly than the other
INFORMATICA CONFIDENTIAL BEST PRACTICE 567 of 702
pipeline.
Performance Tips
Use the database to do the join when sourcing data from the same database schema. Database
systems usually can perform the join more quickly than the PowerCenter Server, so a SQL override or a
join condition should be used when joining multiple tables from the same database schema.
Use Normal joins whenever possible. Normal joins are faster than outer joins and the resulting set of
data is also smaller.
Join sorted data when possible. You can improve session performance by configuring the Joiner
transformation to use sorted input. When you configure the Joiner transformation to use sorted data, the
PowerCenter Server improves performance by minimizing disk input and output. You see the greatest
performance improvement when you work with large data sets.
For an unsorted Joiner transformation, designate as the master sourcethe source with fewer rows.
For optimal performance and disk storage, designate the master source as the source with the fewer rows.
During a session, the Joiner transformation compares each row of the master source against the detail
source. The fewer unique rows in the master, the fewer iterations of the join comparison occur, which
speeds the join process.
For a sorted Joiner transformation, designate as the master source the source with fewer duplicate
key values. For optimal performance and disk storage, designate the master source as the source with
fewer duplicate key values. When the PowerCenter Server processes a sorted Joiner transformation, it
caches rows for one hundred keys at a time. If the master source contains many rows with the same key
value, the PowerCenter Server must cache more rows, and performance can be slowed.
Optimizing sorted joiner transformations with partitions. When you use partitions with a sorted Joiner
transformation, you may optimize performance by grouping data and using n:n partitions.
Add a hash auto-keys partition upstream of the sort origin
To obtain expected results and get best performance when partitioning a sorted Joiner transformation, you
must group and sort data. To group data, ensure that rows with the same key value are routed to the same
partition. The best way to ensure that data is grouped and distributed evenly among partitions is to add a
hash auto-keys or key-range partition point before the sort origin. Placing the partition point before you sort
the data ensures that you maintain grouping and sort the data within each group.
Use n:n partitions
You may be able to improve performance for a sorted Joiner transformation by using n:n partitions. When
you use n:n partitions, the Joiner transformation reads master and detail rows concurrently and does not
need to cache all of the master data. This reduces memory usage and speeds processing. When you use
1:n partitions, the Joiner transformation caches all the data from the master pipeline and writes the cache
to disk if the memory cache fills. When the Joiner transformation receives the data from the detail pipeline,
it must then read the data from disk to compare the master and detail pipelines.
INFORMATICA CONFIDENTIAL BEST PRACTICE 568 of 702
Optimize Sequence Generator Transformations
Sequence Generator transformations need to determine the next available sequence number; thus,
increasing the Number of Cached Values property can increase performance. This property determines
the number of values the PowerCenter Server caches at one time. If it is set to cache no values, then the
PowerCenter Server must query the repository each time to determine the next number to be used. You
may consider configuring the Number of Cached Values to a value greater than 1000. Note that any
cached values not used in the course of a session are lost since the sequence generator value in the
repository is set when it is called next time, to give the next set of cache values.
Avoid External Procedure Transformations
For the most part, making calls to external procedures slows a session. If possible, avoid the use of these
Transformations, which include Stored Procedures, External Procedures, and Advanced External
Procedures.
Field-Level Transformation Optimization
As a final step in the tuning process, you can tune expressions used in transformations. When examining
expressions, focus on complex expressions and try to simplify them when possible.
To help isolate slow expressions, do the following:
1. Time the session with the original expression.
2. Copy the mapping and replace half the complex expressions with a constant.
3. Run and time the edited session.
4. Make another copy of the mapping and replace the other half of the complex expressions with a
constant.
5. Run and time the edited session.
Processing field level transformations takes time. If the transformation expressions are complex, then
processing is even slower. Its often possible to get a 10 to 20 percent performance improvement by
optimizing complex field level transformations. Use the target table mapping reports or the Metadata
Reporter to examine the transformations. Likely candidates for optimization are the fields with the most
complex expressions. Keep in mind that there may be more than one field causing performance problems.
Factoring Out Common Logic
Factoring out common logic can reduce the number of times a mapping performs the same logic. If a
mapping performs the same logic multiple times, moving the task upstream in the mapping may allow the
logic to be performed just once. For example, a mapping has five target tables. Each target requires a
Social Security Number lookup. Instead of performing the lookup right before each target, move the lookup
to a position before the data flow splits.
Minimize Function Calls
Anytime a function is called it takes resources to process. There are several common examples where
function calls can be reduced or eliminated.
INFORMATICA CONFIDENTIAL BEST PRACTICE 569 of 702
Aggregate function calls can sometime be reduced. In the case of each aggregate function call, the
PowerCenter Server must search and group the data. Thus, the following expression:
SUM(Column A) + SUM(Column B)
Can be optimized to:
SUM(Column A + Column B)
In general, operators are faster than functions, so operators should be used whenever possible. For
example if you have an expression which involves a CONCAT function such as:
CONCAT(CONCAT(FIRST_NAME, ), LAST_NAME)
It can be optimized to:
FIRST_NAME || LAST_NAME
Remember that IIF() is a function that returns a value, not just a logical test. This allows many logical
statements to be written in a more compact fashion. For example:
IIF(FLG_A=Y and FLG_B=Y and FLG_C= Y, VAL_A+VAL_B+VAL_C,< /FONT>
IIF(FLG_A=Y and FLG_B=Y and FLG_C= N, VAL_A+VAL_B,< /FONT>
IIF(FLG_A=Y and FLG_B=N and FLG_C= Y, VAL_A+VAL_C,< /FONT>
IIF(FLG_A=Y and FLG_B=N and FLG_C= N, VAL_A,< /FONT>
IIF(FLG_A=N and FLG_B=Y and FLG_C= Y, VAL_B+VAL_C,< /FONT>
IIF(FLG_A=N and FLG_B=Y and FLG_C= N, VAL_B,< /FONT>
IIF(FLG_A=N and FLG_B=N and FLG_C= Y, VAL_C,< /FONT>
IIF(FLG_A=N and FLG_B=N and FLG_C= N, 0.0))))))))< /FONT>
Can be optimized to:
IIF(FLG_A=Y, VAL_A, 0.0) + IIF(FLG_B=Y, VAL_B, 0.0) + IIF(FLG_C= Y, VAL_C, 0.0)< /FONT>
The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized expression results in
three IIFs, three comparisons, and two additions.
Be creative in making expressions more efficient. The following is an example of rework of an expression
that eliminates three comparisons down to one:
INFORMATICA CONFIDENTIAL BEST PRACTICE 570 of 702
IIF(X=1 OR X=5 OR X=9, 'yes', 'no')< /FONT>
Can be optimized to:
IIF(MOD(X, 4) = 1, 'yes', 'no')< /FONT >
Calculate Once, Use Many Times
Avoid calculating or testing the same value multiple times. If the same sub-expression is used several
times in a transformation, consider making the sub-expression a local variable. The local variable can be
used only within the transformation in which it was created. Calculating the variable only once and then
referencing the variable in following sub-expressions improves performance.
Choose Numeric vs. String Operations
The PowerCenter Server processes numeric operations faster than string operations. For example, if a
lookup is performed on a large amount of data on two columns, EMPLOYEE_NAME and EMPLOYEE_ID,
configuring the lookup around EMPLOYEE_ID improves performance.
Optimizing Char-Char and Char-Varchar Comparisons
When the PowerCenter Server performs comparisons between CHAR and VARCHAR columns, it slows
each time it finds trailing blank spaces in the row. To resolve this, treat CHAR as the CHAR On Read
option in the PowerCenter Server setup so that the server does not trim trailing spaces from the end of
CHAR source fields.
Use DECODE Instead of LOOKUP
When a LOOKUP function is used, the PowerCenter Server must lookup a table in the database. When a
DECODE function is used, the lookup values are incorporated into the expression itself so the server does
not need to lookup a separate table. Thus, when looking up a small set of unchanging values, using
DECODE may improve performance.
Reduce the Number of Transformations in a Mapping
Because there is always overhead involved in moving data among transformations, try, whenever
possible, to reduce the number of transformations. Also, resolve unnecessary links between
transformations to minimize the amount of data moved. This is especially important with data being pulled
from the Source Qualifier Transformation.
Use Pre- and Post-Session SQL Commands
You can specify pre- and post-session SQL commands in the Properties tab of the Source Qualifier
transformation and in the Properties tab of the target instance in a mapping. To increase the load speed,
use these commands to drop indexes on the target before the session runs, then recreate them when the
session completes.
Apply the following guidelines when using SQL statements:
INFORMATICA CONFIDENTIAL BEST PRACTICE 571 of 702
G You can use any command that is valid for the database type. However, the PowerCenter Server
does not allow nested comments, even though the database may.
G You can use mapping parameters and variables in SQL executed against the source, but not
against the target.
G Use a semi-colon (;) to separate multiple statements.
G The PowerCenter Server ignores semi-colons within single quotes, double quotes, or within /* ...*/.
G If you need to use a semi-colon outside of quotes or comments, you can escape it with a back
slash (\).
G The Workflow Manager does not validate the SQL.
Use Environmental SQL
For relational databases, you can execute SQL commands in the database environment when connecting
to the database. You can use this for source, target, lookup, and stored procedure connections. For
instance, you can set isolation levels on the source and target systems to avoid deadlocks. Follow the
guidelines listed above for using the SQL statements.
Use Local Variables
You can use local variables in Aggregator, Expression, and Rank transformations.
Temporarily Store Data and Simplify Complex Expressions
Rather than parsing and validating the same expression each time, you can define these components as
variables. This also allows you to simplyfy complex expressions. For example, the following expressions:
AVG( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) < /FONT >
SUM( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) < /FONT >
can use variables to simplify complex expressions and temporarily store data:
Port Value
V_CONDITION1 JOB_STATUS = 'Full-time'
V_CONDITION2 OFFICE_ID = 1000
AVG_SALARY AVG( SALARY, V_CONDITION1 AND V_CONDITION2 )
SUM_SALARY SUM( SALARY, V_CONDITION1 AND V_CONDITION2 )
Store Values Across Rows
INFORMATICA CONFIDENTIAL BEST PRACTICE 572 of 702
You can use variables to store data from prior rows. This can help you perform procedural calculations.
To compare the previous state to the state just read:
IIF( PREVIOUS_STATE = STATE, STATE_COUNTER + 1, 1 )< /FONT >
Capture Values from Stored Procedures
Variables also provide a way to capture multiple columns of return values from stored procedures.


Last updated: 13-Feb-07 17:43
INFORMATICA CONFIDENTIAL BEST PRACTICE 573 of 702
Tuning Sessions for Better Performance
Challenge
Running sessions is where the pedal hits the metal. A common misconception is that
this is the area where most tuning should occur. While it is true that various specific
session options can be modified to improve performance, PowerCenter 8 comes with
options of Grid and Pushdown optimizations that also improve performance
tremendously.
Description
Once you optimize the source and target database, and mapping, you can focus on
optimizing the session. The greatest area for improvement at the session level usually
involves tweaking memory cache settings. The Aggregator (without sorted ports),
Joiner, Rank, Sorter and Lookup transformations (with caching enabled) use caches.
The PowerCenter Server uses index and data caches for each of these
transformations. If the allocated data or index cache is not large enough to store the
data, the PowerCenter Server stores the data in a temporary disk file as it processes
the session data. Each time the PowerCenter Server pages to the temporary file,
performance slows.
You can see when the PowerCenter Server pages to the temporary file by examining
the performance details. The transformation_readfromdisk or
transformation_writetodisk counters for any Aggregator, Rank, Lookup, Sorter, or
Joiner transformation indicate the number of times the PowerCenter Server must page
to disk to process the transformation. Index and data caches should both be sized
according to the requirements of the individual lookup. The sizing can be done using
the estimation tools provided in the Transformation Guide, or through observation of
actual cache sizes on in the session caching directory.
The PowerCenter Server creates the index and data cache files by default in the
PowerCenter Server variable directory, $PMCacheDir. The naming convention used by
the PowerCenter Server for these files is PM [type of transformation] [generated
session instance id number] _ [transformation instance id number] _ [partition index].
dat or .idx. For example, an aggregate data cache file would be named PMAGG31_19.
dat. The cache directory may be changed however, if disk space is a constraint.
Informatica recommends that the cache directory be local to the PowerCenter
Server. A RAID 0 arrangement that gives maximum performance with no redundancy is
INFORMATICA CONFIDENTIAL BEST PRACTICE 574 of 702
recommended for volatile cache file directories (i.e., no persistent caches).
If the PowerCenter Server requires more memory than the configured cache size, it
stores the overflow values in these cache files. Since paging to disk can slow session
performance, the RAM allocated needs to be available on the server. If the server
doesnt have available RAM and uses paged memory, your session is again accessing
the hard disk. In this case, it is more efficient to allow PowerCenter to page the data
rather than the operating system. Adding additional memory to the server is, of course,
the best solution.
Refer to Session Caches in the Workflow Administration Guide for detailed information
on determining cache sizes.
The PowerCenter Server writes to the index and data cache files during a session in
the following cases:
G The mapping contains one or more Aggregator transformations, and the
session is configured for incremental aggregation.
G The mapping contains a Lookup transformation that is configured to use a
persistent lookup cache, and the PowerCenter Server runs the session for the
first time.
G The mapping contains a Lookup transformation that is configured to initialize
the persistent lookup cache.
G The Data Transformation Manager (DTM) process in a session runs out of
cache memory and pages to the local cache files. The DTM may create
multiple files when processing large amounts of data. The session fails if the
local directory runs out of disk space.
When a session is running, the PowerCenter Server writes a message in the session
log indicating the cache file name and the transformation name. When a session
completes, the DTM generally deletes the overflow index and data cache files.
However, index and data files may exist in the cache directory if the session is
configured for either incremental aggregation or to use a persistent lookup cache.
Cache files may also remain if the session does not complete successfully.
Configuring Automatic Memory Settings
PowerCenter 8 allows you to configure the amount of cache memory. Alternatively, you
can configure the Integration Service to automatically calculate cache memory settings
at run time. When you run a session, the Integration Service allocates buffer memory to
the session to move the data from the source to the target. It also creates session
INFORMATICA CONFIDENTIAL BEST PRACTICE 575 of 702
caches in memory. Session caches include index and data caches for the Aggregator,
Rank, Joiner, and Lookup transformations, as well as Sorter and XML target caches.
The values stored in the data and index caches depend upon the requirements of the
transformation. For example, the Aggregator index cache stores group values as
configured in the group by ports, and the data cache stores calculations based on the
group by ports. When the Integration Service processes a Sorter transformation or
writes data to an XML target, it also creates a cache.
Configuring Session Cache Memory
The Integration Service can determine cache memory requirements for the Lookup,
Aggregator, Rank, Joiner, Sorter and XML.
You can configure auto for the index and data cache size in the transformation
properties or on the mappings tab of the session properties
Max Memory Limits
Configuring maximum memory limits allows you to ensure that you reserve a
designated amount or percentage of memory for other processes. You can configure
the memory limit as a numeric value and as a percent of total memory. Because
available memory varies, the Integration Service bases the percentage value on the
total memory on the Integration Service process machine.
For example, you configure automatic caching for three Lookup transformations in a
session. Then, you configure a maximum memory limit of 500MB for the session. When
you run the session, the Integration Service divides the 500MB of allocated memory
among the index and data caches for the Lookup transformations.
When you configure a maximum memory value, the Integration Service divides
memory among transformation caches based on the transformation type.
When you configure a numeric value and a percent both, the Integration Service
compares the values and uses the lower value as the maximum memory limit.
When you configure automatic memory settings, the Integration Service specifies a
minimum memory allocation for the index and data caches. The Integration Service
allocates 1,000,000 bytes to the index cache and 2,000,000 bytes to the data cache for
each transformation instance. If you configure a maximum memory limit that is less
than the minimum value for an index or data cache, the Integration Service overrides
this value. For example, if you configure a maximum memory value of 500 bytes for
INFORMATICA CONFIDENTIAL BEST PRACTICE 576 of 702
session containing a Lookup transformation, the Integration Service overrides or
disable the automatic memory settings and uses the default values.
When you run a session on a grid and you configure Maximum Memory Allowed for
Auto Memory Attributes, the Integration Service divides the allocated memory among
all the nodes in the grid. When you configure Maximum Percentage of Total Memory
Allowed for Auto Memory Attributes, the Integration Service allocates the specified
percentage of memory on each node in the grid.
Aggregator Caches
Keep the following items in mind when configuring the aggregate memory cache sizes:
G Allocate at least enough space to hold at least one row in each aggregate
group.
G Remember that you only need to configure cache memory for an Aggregator
transformation that does not use sorted ports. The PowerCenter Server uses
Session Process memory to process an Aggregator transformation with sorted
ports, not cache memory.
G Incremental aggregation can improve session performance. When it is used,
the PowerCenter Server saves index and data cache information to disk at the
end of the session. The next time the session runs, the PowerCenter Server
uses this historical information to perform the incremental aggregation. The
PowerCenter Server names these files PMAGG*.dat and PMAGG*.idx and
saves them to the cache directory. Mappings that have sessions which use
incremental aggregation should be set up so that only new detail records are
read with each subsequent run.
G When configuring Aggregate data cache size, remember that the data cache
holds row data for variable ports and connected output ports only. As a result,
the data cache is generally larger than the index cache. To reduce the data
cache size, connect only the necessary output ports to subsequent
transformations.
Joiner Caches
When a session is run with a Joiner transformation, the PowerCenter Server reads
from master and detail sources concurrently and builds index and data caches based
on the master rows. The PowerCenter Server then performs the join based on the
detail source data and the cache data.
The number of rows the PowerCenter Server stores in the cache depends on the
INFORMATICA CONFIDENTIAL BEST PRACTICE 577 of 702
partitioning scheme, the data in the master source, and whether or not you use sorted
input.
After the memory caches are built, the PowerCenter Server reads the rows from the
detail source and performs the joins. The PowerCenter Server uses the index cache to
test the join condition. When it finds source data and cache data that match, it retrieves
row values from the data cache.
Lookup Caches
Several options can be explored when dealing with Lookup transformation caches.
G Persistent caches should be used when lookup data is not expected to change
often. Lookup cache files are saved after a session with a persistent cache
lookup is run for the first time. These files are reused for subsequent runs,
bypassing the querying of the database for the lookup. If the lookup table
changes, you must be sure to set the Recache from Database option to
ensure that the lookup cache files are rebuilt. You can also delete the cache
files before the session run to force the session to rebuild the caches.
G Lookup caching should be enabled for relatively small tables. Refer to the Best
Practice Tuning Mappings for Better Performance to determine when lookups
should be cached. When the Lookup transformation is not configured for
caching, the PowerCenter Server queries the lookup table for each input row.
The result of the lookup query and processing is the same, regardless of
whether the lookup table is cached or not. However, when the transformation
is configured to not cache, the PowerCenter Server queries the lookup table
instead of the lookup cache. Using a lookup cache can usually increase
session performance.
G Just like for a joiner, the PowerCenter Server aligns all data for lookup caches
on an eight-byte boundary, which helps increase the performance of the
lookup
Allocating Buffer Memory
The Integration Service can determine the memory requirements for the buffer memory:
G DTM Buffer Size
G Default Buffer Block Size
You can also configure DTM buffer size and the default buffer block size in the session
properties. When the PowerCenter Server initializes a session, it allocates blocks of
INFORMATICA CONFIDENTIAL BEST PRACTICE 578 of 702
memory to hold source and target data. Sessions that use a large number of sources
and targets may require additional memory blocks.
To configure these settings, first determine the number of memory blocks the
PowerCenter Server requires to initialize the session. Then you can calculate the buffer
size and/or the buffer block size based on the default settings, to create the required
number of session blocks.
If there are XML sources or targets in the mappings, use the number of groups in the
XML source or target in the total calculation for the total number of sources and targets.
Increasing the DTM Buffer Pool Size
The DTM Buffer Pool Size setting specifies the amount of memory the PowerCenter
Server uses as DTM buffer memory. The PowerCenter Server uses DTM buffer
memory to create the internal data structures and buffer blocks used to bring data into
and out of the server. When the DTM buffer memory is increased, the PowerCenter
Server creates more buffer blocks, which can improve performance during momentary
slowdowns.
If a session's performance details show low numbers for your source and target
BufferInput_efficiency and BufferOutput_efficiency counters, increasing the DTM buffer
pool size may improve performance.
Using DTM buffer memory allocation generally causes performance to improve initially
and then level off. (Conversely, it may have no impact on source or target-bottlenecked
sessions at all and may not have an impact on DTM bottlenecked sessions). When the
DTM buffer memory allocation is increased, you need to evaluate the total memory
available on the PowerCenter Server. If a session is part of a concurrent batch, the
combined DTM buffer memory allocated for the sessions or batches must not exceed
the total memory for the PowerCenter Server system. You can increase the DTM buffer
size in the Performance settings of the Properties tab.
Running Workflows and Sessions Concurrently
The PowerCenter Server can process multiple sessions in parallel and can also
process multiple partitions of a pipeline within a session. If you have a symmetric multi-
processing (SMP) platform, you can use multiple CPUs to concurrently process session
data or partitions of data. This provides improved performance since true parallelism is
achieved. On a single processor platform, these tasks share the CPU, so there is no
parallelism.
INFORMATICA CONFIDENTIAL BEST PRACTICE 579 of 702
To achieve better performance, you can create a workflow that runs several sessions in
parallel on one PowerCenter Server. This technique should only be employed on
servers with multiple CPUs available.
Partitioning Sessions
Performance can be improved by processing data in parallel in a single session by
creating multiple partitions of the pipeline. If you have PowerCenter partitioning
available, you can increase the number of partitions in a pipeline to improve session
performance. Increasing the number of partitions allows the PowerCenter Server to
create multiple connections to sources and process partitions of source data
concurrently.
When you create or edit a session, you can change the partitioning information for each
pipeline in a mapping. If the mapping contains multiple pipelines, you can specify
multiple partitions in some pipelines and single partitions in others. Keep the following
attributes in mind when specifying partitioning information for a pipeline:
G Location of partition points. The PowerCenter Server sets partition points at
several transformations in a pipeline by default. If you have PowerCenter
partitioning available, you can define other partition points. Select those
transformations where you think redistributing the rows in a different way is
likely to increase the performance considerably.
G Number of partitions. By default, the PowerCenter Server sets the number of
partitions to one. You can generally define up to 64 partitions at any partition
point. When you increase the number of partitions, you increase the number of
processing threads, which can improve session performance. Increasing the
number of partitions or partition points also increases the load on the server. If
the server contains ample CPU bandwidth, processing rows of data in a
session concurrently can increase session performance. However, if you
create a large number of partitions or partition points in a session that
processes large amounts of data, you can overload the system. You can also
overload source and target systems, so that is another consideration.
G Partition types. The partition type determines how the PowerCenter Server
redistributes data across partition points. The Workflow Manager allows you to
specify the following partition types:
1. Round-robin partitioning. PowerCenter distributes rows of data evenly
to all partitions. Each partition processes approximately the same
number of rows. In a pipeline that reads data from file sources of
different sizes, you can use round-robin partitioning to ensure that each
INFORMATICA CONFIDENTIAL BEST PRACTICE 580 of 702
partition receives approximately the same number of rows.
2. Hash keys. The PowerCenter Server uses a hash function to group
rows of data among partitions. The Server groups the data based on a
partition key. There are two types of hash partitioning:
H Hash auto-keys. The PowerCenter Server uses all grouped or
sorted ports as a compound partition key. You can use hash
auto-keys partitioning at or before Rank, Sorter, and unsorted
Aggregator transformations to ensure that rows are grouped
properly before they enter these transformations.
H Hash user keys. The PowerCenter Server uses a hash
function to group rows of data among partitions based on a
user-defined partition key. You choose the ports that define the
partition key.
3. Key range. The PowerCenter Server distributes rows of data based on
a port or set of ports that you specify as the partition key. For each port,
you define a range of values. The PowerCenter Server uses the key and
ranges to send rows to the appropriate partition. Choose key range
partitioning where the sources or targets in the pipeline are partitioned
by key range.
4. -Pass-through partitioning. The PowerCenter Server processes data
without redistributing rows among partitions. Therefore, all rows in a
single partition stay in that partition after crossing a pass-through
partition point.
5. Database partitioning partition. You can optimize session
performance by using the database partitioning partition type instead of
the pass-through partition type for IBM DB2 targets.
If you find that your system is under-utilized after you have tuned the application,
databases, and system for maximum single-partition performance, you can reconfigure
your session to have two or more partitions to make your session utilize more of the
hardware. Use the following tips when you add partitions to a session:
G Add one partition at a time. To best monitor performance, add one partition
at a time, and note your session settings before you add each partition.
G Set DTM buffer memory. For a session with n partitions, this value should be
at least n times the value for the session with one partition.
G Set cached values for Sequence Generator. For a session with n partitions,
there should be no need to use the number of cached values property of the
Sequence Generator transformation. If you must set this value to a value
greater than zero, make sure it is at least n times the original value for the
INFORMATICA CONFIDENTIAL BEST PRACTICE 581 of 702
session with one partition.
G Partition the source data evenly. Configure each partition to extract the
same number of rows. Or redistribute the data among partitions early using a
partition point with round-robin. This is actually a good way to prevent
hammering of the source system. You could have a session with multiple
partitions where one partition returns all the data and the override SQL in the
other partitions is set to return zero rows (where 1 = 2 in the where
clause prevents any rows being returned). Some source systems react better
to multiple concurrent SQL queries; others prefer smaller numbers of queries.
G Monitor the system while running the session. If there are CPU cycles
available (twenty percent or more idle time), then performance may improve
for this session by adding a partition.
G Monitor the system after adding a partition. If the CPU utilization does not
go up, the wait for I/O time goes up, or the total data transformation rate goes
down, then there is probably a hardware or software bottleneck. If the wait for I/
O time goes up a significant amount, then check the system for hardware
bottlenecks. Otherwise, check the database configuration.
G Tune databases and system. Make sure that your databases are tuned
properly for parallel ETL and that your system has no bottlenecks.
Increasing the Target Commit Interval
One method of resolving target database bottlenecks is to increase the commit interval.
Each time the target database commits, performance slows. If you increase the commit
interval, the number of times the PowerCenter Server commits decreases and
performance may improve.
When increasing the commit interval at the session level, you must remember to
increase the size of the database rollback segments to accommodate the larger
number of rows. One of the major reasons that Informatica set the default commit
interval to 10,000 is to accommodate the default rollback segment / extent size of most
databases. If you increase both the commit interval and the database rollback
segments, you should see an increase in performance. In some cases though, just
increasing the commit interval without making the appropriate database changes may
cause the session to fail part way through (i.e., you may get a database error like
"unable to extend rollback segments" in Oracle).
Disabling High Precision
If a session runs with high precision enabled, disabling high precision may improve
session performance.
INFORMATICA CONFIDENTIAL BEST PRACTICE 582 of 702
The Decimal datatype is a numeric datatype with a maximum precision of 28. To use a
high-precision Decimal datatype in a session, you must configure it so that the
PowerCenter Server recognizes this datatype by selecting Enable High Precision in the
session property sheet. However, since reading and manipulating a high-precision
datatype (i.e., those with a precision of greater than 28) can slow the PowerCenter
Server down, session performance may be improved by disabling decimal arithmetic.
When you disable high precision, the PowerCenter Server reverts to using a dataype of
Double.
Reducing Error Tracking
If a session contains a large number of transformation errors, you may be able to
improve performance by reducing the amount of data the PowerCenter Server writes to
the session log.
To reduce the amount of time spent writing to the session log file, set the tracing level
to Terse. At this tracing level, the PowerCenter Server does not write error messages
or row-level information for reject data. However, if terse is not an acceptable level of
detail, you may want to consider leaving the tracing level at Normal and focus your
efforts on reducing the number of transformation errors. Note that the tracing level must
be set to Normal in order to use the reject loading utility.
As an additional debug option (beyond the PowerCenter Debugger), you may set the
tracing level to verbose initialization or verbose data.
G Verbose initialization logs initialization details in addition to normal, names of
index and data files used, and detailed transformation statistics.
G Verbose data logs each row that passes into the mapping. It also notes where
the PowerCenter Server truncates string data to fit the precision of a column
and provides detailed transformation statistics. When you configure the tracing
level to verbose data, the PowerCenter Server writes row data for all rows in a
block when it processes a transformation.
However, the verbose initialization and verbose data logging options significantly affect
the session performance. Do not use Verbose tracing options except when testing
sessions. Always remember to switch tracing back to Normal after the testing is
complete.
The session tracing level overrides any transformation-specific tracing levels within the
mapping. Informatica does not recommend reducing error tracing as a long-term
response to high levels of transformation errors. Because there are only a handful of
INFORMATICA CONFIDENTIAL BEST PRACTICE 583 of 702
reasons why transformation errors occur, it makes sense to fix and prevent any
recurring transformation errors. PowerCenter uses the mapping tracing level when the
session tracing level is set to none.
Pushdown Optimization
You can push transformation logic to the source or target database using pushdown
optimization. The amount of work you can push to the database depends on the
pushdown optimization configuration, the transformation logic, and the mapping and
session configuration.
When you run a session configured for pushdown optimization, the Integration Service
analyzes the mapping and writes one or more SQL statements based on the mapping
transformation logic. The Integration Service analyzes the transformation logic,
mapping, and session configuration to determine the transformation logic it can push to
the database. At run time, the Integration Service executes any SQL statement
generated against the source or target tables, and it processes any transformation logic
that it cannot push to the database.
Use the Pushdown Optimization Viewer to preview the SQL statements and mapping
logic that the Integration Service can push to the source or target database. You can
also use the Pushdown Optimization Viewer to view the messages related to
Pushdown Optimization.
Source-Side Pushdown Optimization Sessions
In source-side pushdown optimization, the Integration Service analyzes the mapping
from the source to the target until it reaches a downstream transformation that cannot
be pushed to the database.
The Integration Service generates a SELECT statement based on the transformation
logic up to the transformation it can push to the database. Integration Service pushes
all transformation logic that is valid to push to the database by executing the generated
SQL statement at run time. Then, it reads the results of this SQL statement and
continues to run the session. Similarly it create the view for SQL override and then
generate SELECT statement and runs the SELECT statement against this view. When
the session completes, the Integration Service drops the view from the database.
Target-Side Pushdown Optimization Sessions
When you run a session configured for target-side pushdown optimization, the
INFORMATICA CONFIDENTIAL BEST PRACTICE 584 of 702
Integration Service analyzes the mapping from the target to the source or until it
reaches an upstream transformation it cannot push to the database. It generates an
INSERT, DELETE, or UPDATE statement based on the transformation logic for each
transformation it can push to the database, starting with the first transformation in the
pipeline it can push to the database. The Integration Service processes the
transformation logic up to the point that it can push the transformation logic to the target
database. Then, it executes the generated SQL.
Full Pushdown Optimization Sessions
To use full pushdown optimization, the source and target must be on the same
database. When you run a session configured for full pushdown optimization, the
Integration Service analyzes the mapping from source to target and analyze each
transformation in the pipeline until it analyzes the target. It generates and executes the
SQL on sources and targets,
When you run a session for full pushdown optimization, the database must run a long
transaction if the session contains a large quantity of data. Consider the following
database performance issues when you generate a long transaction:
G A long transaction uses more database resources.
G A long transaction locks the database for longer periods of time, and thereby
reduces the database concurrency and increases the likelihood of deadlock.
G A long transaction can increase the likelihood that an unexpected event may
occur.
The Rank transformation cannot be pushed to the database. If you configure the
session for full pushdown optimization, the Integration Service pushes the Source
Qualifier transformation and the Aggregator transformation to the source. It pushes the
Expression transformation and target to the target database, and it processes the Rank
transformation. The Integration Service does not fail the session if it can push only part
of the transformation logic to the database and the session is configured for full
optimization.
Using a Grid
You can use a grid to increase session and workflow performance. A grid is an alias
assigned to a group of nodes that allows you to automate the distribution of workflows
and sessions across nodes.
INFORMATICA CONFIDENTIAL BEST PRACTICE 585 of 702
When you use a grid, the Integration Service distributes workflow tasks and session
threads across multiple nodes. Running workflows and sessions on the nodes of a grid
provides the following performance gains:
G Balances the Integration Service workload.
G Processes concurrent sessions faster.
G Processes partitions faster.
When you run a session on a grid, you improve scalability and performance by
distributing session threads to multiple DTM processes running on nodes in the grid.
To run a workflow or session on a grid, you assign resources to nodes, create and
configure the grid, and configure the Integration Service to run on a grid.
Running a Session on Grid
When you run a session on a grid, the master service process runs the workflow and
workflow tasks, including the Scheduler. Because it runs on the master service process
node, the Scheduler uses the date and time for the master service process node to
start scheduled workflows. The Load Balancer distributes Command tasks as it does
when you run a workflow on a grid. In addition, when the Load Balancer dispatches a
Session task, it distributes the session threads to separate DTM processes.
The master service process starts a temporary preparer DTM process that fetches the
session and prepares it to run. After the preparer DTM process prepares the session, it
acts as the master DTM process, which monitors the DTM processes running on other
nodes.
The worker service processes start the worker DTM processes on other nodes. The
worker DTM runs the session. Multiple worker DTM processes running on a node might
be running multiple sessions or multiple partition groups from a single session
depending on the session configuration.
For example, you run a workflow on a grid that contains one Session task and one
Command task. You also configure the session to run on the grid.
When the Integration Service process runs the session on a grid, it performs the
following tasks:
G On Node 1, the master service process runs workflow tasks. It also starts a
INFORMATICA CONFIDENTIAL BEST PRACTICE 586 of 702
temporary preparer DTM process, which becomes the master DTM process.
The Load Balancer dispatches the Command task and session threads to
nodes in the grid.
G On Node 2, the worker service process runs the Command task and starts the
worker DTM processes that run the session threads.
G On Node 3, the worker service process starts the worker DTM processes that
run the session threads.
For information about configuring and managing a grid, refer to the PowerCenter
Administrator Guide.
For information about how the DTM distributes session threads into partition groups,
see "Running Workflows and Sessions on a Grid" in the Workflow Administration Guide.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 587 of 702
Tuning SQL Overrides and Environment for Better Performance
Challenge
Tuning SQL Overrides and SQL queries within the source qualifier objects can improve performance in selecting data from
source database tables, which positively impacts the overall session performance. This Best Practice explores ways to
optimize a SQL query within the source qualifier object. The tips here can be applied to any PowerCenter mapping. While
the SQL discussed here is executed in Oracle 8 and above, the techniques are generally applicable, but specifics for other
RDBMS products (e.g., SQL Server, Sybase, etc.) are not included.
Description

SQL Queries Performing Data Extractions
Optimizing SQL queries is perhaps the most complex portion of performance tuning. When tuning SQL, the developer must
look at the type of execution being forced by hints, the execution plan, and the indexes on the query tables in the SQL, the
logic of the SQL statement itself, and the SQL syntax. The following paragraphs discuss each of these areas in more detail.
DB2 Coalesce and Oracle NVL
When examining data with NULLs, it is often necessary to substitute a value to make comparisons and joins work. In
Oracle, the NVL function is used, while in DB2, the COALESCE function is used.
Here is an example of the Oracle NLV function:
SELECT DISTINCT bio.experiment_group_id, bio.database_site_code
FROM exp.exp_bio_result bio, sar.sar_data_load_log log
WHERE bio.update_date BETWEEN log.start_time AND log.end_time
AND NVL(bio.species_type_code, 'X') IN ('mice', 'rats', X)
AND log.seq_no = (SELECT MAX(seq_no) FROM sar.sar_data_load_log < /FONT >
WHERE load_status = 'P')<
Here is the same query in DB2:
SELECT DISTINCT bio.experiment_group_id, bio.database_site_code
FROM bio_result bio, data_load_log log
WHERE bio.update_date BETWEEN log.start_time AND log.end_time
AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', X)
AND log.seq_no = (SELECT MAX(seq_no) FROM data_load_log < /FONT >
WHERE load_status = 'P')< /FONT >
INFORMATICA CONFIDENTIAL BEST PRACTICE 588 of 702

Surmounting the Single SQL Statement Limitation in Oracle or DB2: In-line Views
In source qualifiers and lookup objects, you are limited to a single SQL statement. There are several ways to get around
this limitation.
You can create views in the database and use them as you would tables, either as source tables, or in the FROM clause of
the SELECT statement. This can simplify the SQL and make it easier to understand, but it also makes it harder to maintain.
The logic is now in two places: in an Informatica mapping and in a database view
You can use in-line views which are SELECT statements in the FROM or WHERE clause. This can help focus the query to
a subset of data in the table and work more efficiently than using a traditional join. Here is an example of an in-line view in
the FROM clause:
SELECT N.DOSE_REGIMEN_TEXT as DOSE_REGIMEN_TEXT,
N.DOSE_REGIMEN_COMMENT as DOSE_REGIMEN_COMMENT,
N.DOSE_VEHICLE_BATCH_NUMBER as DOSE_VEHICLE_BATCH_NUMBER,
N.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID
FROM DOSE_REGIMEN N,
(SELECT DISTINCT R.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID
FROM EXPERIMENT_PARAMETER R,
NEW_GROUP_TMP TMP
WHERE R.EXPERIMENT_PARAMETERS_ID = TMP.EXPERIMENT_PARAMETERS_ID< /FONT >
AND R.SCREEN_PROTOCOL_ID = TMP.BDS_PROTOCOL_ID < /FONT >
) X
WHERE N.DOSE_REGIMEN_ID = X.DOSE_REGIMEN_ID < /FONT >
ORDER BY N.DOSE_REGIMEN_ID
Surmounting the Single SQL Statement Limitation in DB2: Using the Common Table Expression
temp tables and the WITH Clause
The Common Table Expression (CTE) stores data in temp tables during the execution of the SQL statement. The WITH
clause lets you assign a name to a CTE block. You can then reference the CTE block multiple places in the query by
specifying the query name. For example:
WITH maxseq AS (SELECT MAX(seq_no) as seq_no FROM data_load_log WHERE load_status = 'P') < /FONT >
SELECT DISTINCT bio.experiment_group_id, bio.database_site_code
FROM bio_result bio, data_load_log log
WHERE bio.update_date BETWEEN log.start_time AND log.end_time
INFORMATICA CONFIDENTIAL BEST PRACTICE 589 of 702
AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', X)
AND log.seq_no = maxseq. seq_no< /FONT >
Here is another example using a WITH clause that uses recursive SQL:
WITH PERSON_TEMP (PERSON_ID, NAME, PARENT_ID) AS
(SELECT PERSON_ID, NAME, PARENT_ID
FROM PARENT_CHILD
WHERE NAME IN (FRED, SALLY, JIM)
UNION ALL
SELECT C.PERSON_ID, C.NAME, C.PARENT_ID
FROM PARENT_CHILD C, PERSON_TEMP RECURS
WHERE C.PERSON_ID = RECURS.PERSON_ID < /FONT >
AND LEVEL < 5)
SELECT * FROM PERSON_TEMP
The PARENT_ID in any particular row refers to the PERSON_ID of the parent. Pretty stupid since we all have two parents,
but you get the idea. The LEVEL clause prevents infinite recursion.
CASE (DB2) vs. DECODE (Oracle)
The CASE syntax is allowed in ORACLE, but you are much more likely to see the DECODE logic, even for a single case
since it was the only legal way to test a condition in earlier versions.
DECODE is not allowed in DB2.
In Oracle:
SELECT EMPLOYEE, FNAME, LNAME,
DECODE (SALARY)
< 10000, NEED RAISE,
> 1000000, OVERPAID,
THE REST OF US) AS COMMENT
FROM EMPLOYEE
In DB2:
INFORMATICA CONFIDENTIAL BEST PRACTICE 590 of 702
SELECT EMPLOYEE, FNAME, LNAME,
CASE
WHEN SALARY < 10000 THEN NEED RAISE
WHEN SALARY > 1000000 THEN OVERPAID
ELSE THE REST OF US
END AS COMMENT
FROM EMPLOYEE
Debugging Tip: Obtaining a Sample Subset
It is often useful to get a small sample of the data from a long running query that returns a large set of data. The logic can
be commented out or removed after it is put in general use.
DB2 uses the FETCH FIRST n ROWS ONLY clause to do this as follows:
SELECT EMPLOYEE, FNAME, LNAME
FROM EMPLOYEE
WHERE JOB_TITLE = WORKERBEE < /FONT >
FETCH FIRST 12 ROWS ONLY
Oracle does it this way using the ROWNUM variable:
SELECT EMPLOYEE, FNAME, LNAME
FROM EMPLOYEE
WHERE JOB_TITLE = WORKERBEE < /FONT >
AND ROWNUM <= 12< /FONT>
INTERSECT, INTERSECT ALL, UNION, UNION ALL
Remember that both the UNION and INTERSECT operators return distinct rows, while UNION ALL and INTERSECT ALL
return all rows.
System Dates in Oracle and DB2
Oracle uses the system variable SYSDATE for the current time and date, and allows you to display either the time and/or
the date however you want with date functions.
Here is an example that returns yesterdays date in Oracle (default format as mm/dd/yyyy):
SELECT TRUNC(SYSDATE) 1 FROM DUAL
INFORMATICA CONFIDENTIAL BEST PRACTICE 591 of 702
DB2 uses the system variables, here called special registers, CURRENT DATE, CURRENT TIME and CURRENT
TIMESTAMP
Here is an example for DB2:
SELECT FNAME, LNAME, CURRENT DATE AS TODAY
FROM EMPLOYEE
Oracle: Using Hints
Hints affect the way a query or sub-query is executed and can therefore, provide a significant performance increase in
queries. Hints cause the database engine to relinquish control over how a query is executed, thereby giving the developer
control over the execution. Hints are always honored unless execution is not possible. Because the database engine does
not evaluate whether the hint makes sense, developers must be careful in implementing hints. Oracle has many types of
hints: optimizer hints, access method hints, join order hints, join operation hints, and parallel execution hints. Optimizer and
access method hints are the most common.
In the latest versions of Oracle, the Cost-based query analysis is built-in and Rule-based analysis is no longer possible. It
was in Rule-based Oracle systems that hints mentioning specific indexes were most helpful. In Oracle version 9.2,
however, the use of /*+ INDEX */ hints may actually decrease performance significantly in many cases. If you are using
older versions of Oracle however, the use of the proper INDEX hints should help performance.
The optimizer hint allows the developer to change the optimizer's goals when creating the execution plan. The table below
provides a partial list of optimizer hints and descriptions.
Optimizer hints: Choosing the best join method
Sort/merge and hash joins are in the same group, but nested loop joins are very different. Sort/merge involves two sorts
while the nested loop involves no sorts. The hash join also requires memory to build the hash table.
Hash joins are most effective when the amount of data is large and one table is much larger than the other.
Here is an example of a select that performs best as a hash join:
SELECT COUNT(*) FROM CUSTOMERS C, MANAGERS M
WHERE C.CUST_ID = M.MANAGER_ID< /FONT >
Considerations Join Type
Better throughput Sort/Merge
Better response time Nested loop
Large subsets of data Sort/Merge
Index available to support join Nested loop
Limited memory and CPU available for sorting Nested loop
Parallel execution Sort/Merge or Hash
INFORMATICA CONFIDENTIAL BEST PRACTICE 592 of 702
Joining all or most of the rows of large tables Sort/Merge or Hash
Joining small sub-sets of data and index available Nested loop
Hint Description
ALL_ROWS The database engine creates an execution plan that optimizes for throughput.
Favors full table scans. Optimizer favors Sort/Merge
FIRST_ROWS The database engine creates an execution plan that optimizes for response time.
It returns the first row of data as quickly as possible. Favors index lookups.
Optimizer favors Nested-loops
CHOOSE The database engine creates an execution plan that uses cost-based execution if
statistics have been run on the tables. If statistics have not been run, the engine
uses rule-based execution. If statistics have been run on empty tables, the
engine still uses cost-based execution, but performance is extremely poor.
RULE The database engine creates an execution plan based on a fixed set of rules.
USE NL Use nested loops
USE MERGE Use sort merge joins
HASH The database engine performs a hash scan of the table. This hint is ignored if the
table is not clustered.
Access method hints
Access method hints control how data is accessed. These hints are used to force the database engine to use indexes,
hash scans, or row id scans. The following table provides a partial list of access method hints.
Hint Description
ROWID The database engine performs a scan of the table based on ROWIDS.
INDEX DO NOT USE in Oracle 9.2 and above. The database engine performs an index
scan of a specific table, but in 9.2 and above, the optimizer does not use any
indexes other than those mentioned.
USE_CONCAT The database engine converts a query with an OR condition into two or more
queries joined by a UNION ALL statement.
The syntax for using a hint in a SQL statement is as follows:
Select /*+ FIRST_ROWS */ empno, ename
From emp;
INFORMATICA CONFIDENTIAL BEST PRACTICE 593 of 702
Select /*+ USE_CONCAT */ empno, ename
From emp;
SQL Execution and Explain Plan
The simplest change is forcing the SQL to choose either rule-based or cost-based execution. This change can be
accomplished without changing the logic of the SQL query. While cost-based execution is typically considered the best
SQL execution; it relies upon optimization of the Oracle parameters and updated database statistics. If these statistics are
not maintained, cost-based query execution can suffer over time. When that happens, rule-based execution can actually
provide better execution time.
The developer can determine which type of execution is being used by running an explain plan on the SQL query in
question. Note that the step in the explain plan that is indented the most is the statement that is executed first. The results
of that statement are then used as input by the next level statement.
Typically, the developer should attempt to eliminate any full table scans and index range scans whenever possible. Full
table scans cause degradation in performance.
Information provided by the Explain Plan can be enhanced using the SQL Trace Utility. This utility provides the following
additional information including:
G The number of executions
G The elapsed time of the statement execution
G The CPU time used to execute the statement
The SQL Trace Utility adds value because it definitively shows the statements that are using the most resources, and can
immediately show the change in resource consumption after the statement has been tuned and a new explain plan has
been run.
Using Indexes
The explain plan also shows whether indexes are being used to facilitate execution. The data warehouse team should
compare the indexes being used to those available. If necessary, the administrative staff should identify new indexes that
are needed to improve execution and ask the database administration team to add them to the appropriate tables. Once
implemented, the explain plan should be executed again to ensure that the indexes are being used. If an index is not being
used, it is possible to force the query to use it by using an access method hint, as described earlier.
Reviewing SQL Logic
The final step in SQL optimization involves reviewing the SQL logic itself. The purpose of this review is to determine
whether the logic is efficiently capturing the data needed for processing. Review of the logic may uncover the need for
additional filters to select only certain data, as well as the need to restructure the where clause to use indexes. In extreme
cases, the entire SQL statement may need to be re-written to become more efficient.
Reviewing SQL Syntax
SQL Syntax can also have a great impact on query performance. Certain operators can slow performance, for example:
G EXISTS clauses are almost always used in correlated sub-queries. They are executed for each row of the parent
query and cannot take advantage of indexes, while the IN clause is executed once and does use indexes, and
may be translated to a JOIN by the optimizer. If possible, replace EXISTS with an IN clause. For example:
SELECT * FROM DEPARTMENTS WHERE DEPT_ID IN
INFORMATICA CONFIDENTIAL BEST PRACTICE 594 of 702
(SELECT DISTINCT DEPT_ID FROM MANAGERS) -- Faster
SELECT * FROM DEPARTMENTS D WHERE EXISTS
(SELECT * FROM MANAGERS M WHERE M.DEPT_ID = D.DEPT_ID)< /FONT >
Situation Exists In
Index supports subquery Yes Yes
No Index to support subquery No
Table scans per parent
row
Yes
Table scan once
Sub-query returns many rows Probably not Yes
Sub-query returns one or a few rows Yes Yes
Most of the sub-query rows are eliminated by the
parent query
No Yes
Index in parent that match sub-query columns Possibly not since the
EXISTS cannot use the
index
Yes IN uses the
index
G Where possible, use the EXISTS clause instead of the INTERSECT clause. Simply modifying the query in this way
can improve performance by more than100 percent.
G Where possible, limit the use of outer joins on tables. Remove the outer joins from the query and create lookup
objects within the mapping to fill in the optional information.

Choosing the Best Join Order
Place the smallest table first in the join order. This is often a staging table holding the IDs identifying the data in the
incremental ETL load.
Always put the small table column on the right side of the join. Use the driving table first in the WHERE clause, and work
from it outward. In other words, be consistent and orderly about placing columns in the WHERE clause.
Outer joins limit the join order that the optimizer can use. Dont use them needlessly.
Anti-join with NOT IN, NOT EXISTS, MINUS or EXCEPT, OUTER JOIN


G Avoid use of the NOT IN clause. This clause causes the database engine to perform a full table scan. While this
may not be a problem on small tables, it can become a performance drain on large tables.
SELECT NAME_ID FROM CUSTOMERS
WHERE NAME_ID NOT IN
(SELECT NAME_ID FROM EMPLOYEES)
INFORMATICA CONFIDENTIAL BEST PRACTICE 595 of 702
G Avoid use of the NOT EXISTS clause. This clause is better than the NOT IN, but still may cause a full table scan.
SELECT C.NAME_ID FROM CUSTOMERS C
WHERE NOT EXISTS
(SELECT * FROM EMPLOYEES E
WHERE C.NAME_ID = E.NAME_ID)< /FONT >
G In Oracle, use the MINUS operator to do the anti-join, if possible. In DB2, use the equivalent EXCEPT operator.
SELECT C.NAME_ID FROM CUSTOMERS C
MINUS
SELECT E.NAME_ID* FROM EMPLOYEES E
G Also consider using outer joins with IS NULL conditions for anti-joins.
SELECT C.NAME_ID FROM CUSTOMERS C, EMPLOYEES E
WHERE C.NAME_ID = E.NAME_ID (+)< /FONT >
AND C.NAME_ID IS NULL
Review the database SQL manuals to determine the cost benefits or liabilities of certain SQL clauses as they may change
based on the database engine.
G In lookups from large tables, try to limit the rows returned to the set of rows matching the set in the source
qualifier. Add the WHERE clause conditions to the lookup. For example, if the source qualifier selects sales orders
entered into the system since the previous load of the database, then, in the product information lookup, only
select the products that match the distinct product IDs in the incremental sales orders.
G Avoid range lookups. This is a SELECT that uses a BETWEEN in the WHERE clause that uses values retrieved
from a table as limits in the BETWEEN. Here is an example:
SELECT
R.BATCH_TRACKING_NO,
R.SUPPLIER_DESC,
R.SUPPLIER_REG_NO,
R.SUPPLIER_REF_CODE,
R.GCW_LOAD_DATE
FROM CDS_SUPPLIER R,
(SELECT LOAD_DATE_PREV AS LOAD_DATE_PREV,
INFORMATICA CONFIDENTIAL BEST PRACTICE 596 of 702
L.LOAD_DATE) AS LOAD_DATE
FROM ETL_AUDIT_LOG L
WHERE L.LOAD_DATE_PREV IN
(SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV
FROM ETL_AUDIT_LOG Y)
) Z
WHERE
R.LOAD_DATE BETWEEN Z.LOAD_DATE_PREV AND Z.LOAD_DATE
The work-around is to use an in-line view to get the lower range in the FROM clause and join it to the main query that limits
the higher date range in its where clause. Use an ORDER BY the lower limit in the in-line view. This is likely to reduce the
throughput time from hours to seconds.
Here is the improved SQL:
SELECT
R.BATCH_TRACKING_NO,
R.SUPPLIER_DESC,
R.SUPPLIER_REG_NO,
R.SUPPLIER_REF_CODE,
R.LOAD_DATE
FROM
/* In-line view for lower limit */
(SELECT
R1.BATCH_TRACKING_NO,
R1.SUPPLIER_DESC,
R1.SUPPLIER_REG_NO,
R1.SUPPLIER_REF_CODE,
R1.LOAD_DATE
FROM CDS_SUPPLIER R1,
(SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV
INFORMATICA CONFIDENTIAL BEST PRACTICE 597 of 702
FROM ETL_AUDIT_LOG Y) Z
WHERE R1.LOAD_DATE >= Z.LOAD_DATE_PREV< /FONT>
ORDER BY R1.LOAD_DATE) R,
/* end in-line view for lower limit */
(SELECT MAX(D.LOAD_DATE) AS LOAD_DATE
FROM ETL_AUDIT_LOG D) A /* upper limit /*
WHERE R. LOAD_DATE <= A.LOAD_DATE< /FONT>
Tuning System Architecture
Use the following steps to improve the performance of any system:
1. Establish performance boundaries (baseline).
2. Define performance objectives.
3. Develop a performance monitoring plan.
4. Execute the plan.
5. Analyze measurements to determine whether the results meet the objectives. If objectives are met, consider
reducing the number of measurements because performance monitoring itself uses system resources. Otherwise
continue with Step 6.
6. Determine the major constraints in the system.
7. Decide where the team can afford to make trade-offs and which resources can bear additional load.
8. Adjust the configuration of the system. If it is feasible to change more than one tuning option, implement one at a
time. If there are no options left at any level, this indicates that the system has reached its limits and hardware
upgrades may be advisable.
9. Return to Step 4 and continue to monitor the system.
10. Return to Step 1.
11. Re-examine outlined objectives and indicators.
12. Refine monitoring and tuning strategy.
System Resources
The PowerCenter Server uses the following system resources:
G CPU
G Load Manager shared memory
G DTM buffer memory
G Cache memory
When tuning the system, evaluate the following considerations during the implementation process.
G Determine if the network is running at an optimal speed. Recommended best practice is to minimize the number of
network hops between the PowerCenter Server and the databases.
G Use multiple PowerCenter Servers on separate systems to potentially improve session performance.
G When all character data processed by the PowerCenter Server is US-ASCII or EBCDIC, configure the
PowerCenter Server for ASCII data movement mode. In ASCII mode, the PowerCenter Server uses one byte to
store each character. In Unicode mode, the PowerCenter Server uses two bytes for each character, which can
potentially slow session performance
G Check hard disks on related machines. Slow disk access on source and target databases, source and target file
INFORMATICA CONFIDENTIAL BEST PRACTICE 598 of 702
systems, as well as the PowerCenter Server and repository machines can slow session performance.
G When an operating system runs out of physical memory, it starts paging to disk to free physical memory.
Configure the physical memory for the PowerCenter Server machine to minimize paging to disk. Increase system
memory when sessions use large cached lookups or sessions have many partitions.
G In a multi-processor UNIX environment, the PowerCenter Server may use a large amount of system resources.
Use processor binding to control processor usage by the PowerCenter Server.
G In a Sun Solaris environment, use the psrset command to create and manage a processor set. After creating a
processor set, use the pbind command to bind the PowerCenter Server to the processor set so that the processor
set only runs the PowerCenter Sever. For details, see project system administrator and Sun Solaris
documentation.
G In an HP-UX environment, use the Process Resource Manager utility to control CPU usage in the system. The
Process Resource Manager allocates minimum system resources and uses a maximum cap of resources. For
details, see project system administrator and HP-UX documentation.
G In an AIX environment, use the Workload Manager in AIX 5L to manage system resources during peak demands.
The Workload Manager can allocate resources and manage CPU, memory, and disk I/O bandwidth. For details,
see project system administrator and AIX documentation.

Database Performance Features
Nearly everything is a trade-off in the physical database implementation. Work with the DBA in determining which of the
many available alternatives is the best implementation choice for the particular database. The project team must have a
thorough understanding of the data, database, and desired use of the database by the end-user community prior to
beginning the physical implementation process. Evaluate the following considerations during the implementation process.
G Denormalization. The DBA can use denormalization to improve performance by eliminating the constraints and
primary key to foreign key relationships, and also eliminating join tables.
G Indexes. Proper indexing can significantly improve query response time. The trade-off of heavy indexing is a
degradation of the time required to load data rows in to the target tables. Carefully written pre-session scripts are
recommended to drop indexes before the load and rebuilding them after the load using post-session scripts.
G Constraints. Avoid constraints if possible and try to exploit integrity enforcement through the use of incorporating
that additional logic in the mappings.
G Rollback and Temporary Segments. Rollback and temporary segments are primarily used to store data for
queries (temporary) and INSERTs and UPDATES (rollback). The rollback area must be large enough to hold all
the data prior to a COMMIT. Proper sizing can be crucial to ensuring successful completion of load sessions,
particularly on initial loads.
G OS Priority. The priority of background processes is an often-overlooked problem that can be difficult to
determine after the fact. DBAs must work with the System Administrator to ensure all the database processes
have the same priority.
G Striping. Database performance can be increased significantly by implementing either RAID 0 (striping) or RAID 5
(pooled disk sharing) disk I/O throughput.
G Disk Controllers. Although expensive, striping and RAID 5 can be further enhanced by separating the disk
controllers.


Last updated: 13-Feb-07 17:47
INFORMATICA CONFIDENTIAL BEST PRACTICE 599 of 702
Using Metadata Manager Console to Tune
the XConnects
Challenge
Improving the efficiency and reducing the run-time of your XConnects through the
parameter settings of the Metadata Manager console.
Description
Remember that the minimum system requirements for a machine hosting the Metadata
Manager console are:
G Windows operating system (2000, NT 4.0 SP 6a)
G 400MB disk space
G 128MB RAM (256MB recommended)
G 133 MHz processor.
If the system meets or exceeds the minimal requirements, but an XConnect is still
taking an inordinately long time to run, use the following steps to try to improve its
performance.
To improve performance of your XConnect loads from database catalogs:
G Modify the inclusion/exclusion schema list (if schema to be loaded is more
than exclusion, then use exclusion)
G Carefully examine how many old objects the project needs by default. Modify
the sysdate -5000 to a smaller value to reduce the result set.
To improve performance of your XConnect loads from the PowerCenter repository:
G Load only the production folders that are needed for a particular project.
G Run the XConnects with just one folder at a time, or select the list of folders for
a particular run.


INFORMATICA CONFIDENTIAL BEST PRACTICE 600 of 702
Advanced Client Configuration Options
Challenge
Setting the Registry to ensure consistent client installations, resolve potential missing or invalid
license key issues, and change the Server Manager Session Log Editor to your preferred editor.
Description
Ensuring Consistent Data Source Names
To ensure the use of consistent data source names for the same data sources across the domain,
the Administrator can create a single "official" set of data sources, then use the Repository
Manager to export that connection information to a file. You can then distribute this file and import
the connection information for each client machine.
Solution:
G From Repository Manager, choose Export Registry from the Tools drop-down menu.
G For all subsequent client installs, simply choose Import Registry from the Tools drop-down
menu.
Resolving Missing or Invalid License Keys
The missing or invalid license key error occurs when attempting to install PowerCenter Client
tools on NT 4.0 or Windows 2000 with a userid other than Administrator.
This problem also occurs when the client software tools are installed under the Administrator
account, and a user with a non-administrator ID subsequently attempts to run the tools. The user
who attempts to log in using the normal non-administrator userid will be unable to start the
PowerCenter Client tools. Instead, the software displays the message indicating that the license
key is missing or invalid.
Solution:
G While logged in as the installation user with administrator authority, use regedt32 to edit
the registry.
G Under HKEY_LOCAL_MACHINE open Software/Informatica/PowerMart Client Tools/.
G From the menu bar, select Security/Permissions, and grant read access to the users that
should be permitted to use the PowerMart Client. (Note that the registry entries for both
PowerMart and PowerCenter Server and client tools are stored as PowerMart Server and
PowerMart Client tools.)
Changing the Session Log Editor
INFORMATICA CONFIDENTIAL BEST PRACTICE 601 of 702
In PowerCenter versions 6.0 to 7.1.2, the session and workflow log editor defaults to Wordpad
within the workflow monitor client tool. To choose a different editor, just select Tools>Options in the
workflow monitor. Then browse for the editor that you want on the General tab.
For PowerCenter versions earlier than 6.0, the editor does not default to Wordpad unless the
wordpad.exe can be found in the path statement. Instead, a window appears the first time a
session log is viewed from the PowerCenter Server Manager prompting the user to enter the full
path name of the editor to be used to view the logs. Users often set this parameter incorrectly and
must access the registry to change it.
Solution:
G While logged in as the installation user with administrator authority, use regedt32 to go into
the registry.
G Move to registry path location: HKEY_CURRENT_USER Software\Informatica\PowerMart
Client Tools\[CLIENT VERSION]\Server Manager\Session Files. From the menu bar,
select View Tree and Data.
G Select the Log File Editor entry by double clicking on it.
G Replace the entry with the appropriate editor entry (i.e., typically WordPad.exe or Write.
exe).
G Select Registry --> Exit from the menu bar to save the entry.
For PowerCenter version 7.1 and above, you should set the log editor option in the Workflow
Monitor.
The following figure shows the Workflow Monitor Options Dialog box to use for setting the editor for
workflow and session logs.
INFORMATICA CONFIDENTIAL BEST PRACTICE 602 of 702

Adding a New Command Under Tools Menu
Other tools, in addition to the PowerCenter client tools, are often needed during development and
testing. For example, you may need a tool such as Enterprise manager (SQL Server) or Toad
(Oracle) to query the database. You can add shortcuts to executable programs from any client
tools Tools drop-down menu to provide quick access to these programs.
Solution:
Choose Customize under the Tools menu and add a new item. Once it is added, browse to find
the executable it is going to call (as shown below).
INFORMATICA CONFIDENTIAL BEST PRACTICE 603 of 702

When this is done once, you can easily call another program from your PowerCenter client tools.
In the following example, TOAD can be called quickly from the Repository Manager tool.
Changing Target Load Type
In PowerCenter versions 6.0 and earlier, each time a session was created, it defaulted to be of type
bulk, although this was not necessarily what was desired and could cause the session to fail under
certain conditions if not changed. In versions 7.0 and above, you can set a property in Workflow
Manager to choose the default load type to be either 'bulk' or 'normal'.
INFORMATICA CONFIDENTIAL BEST PRACTICE 604 of 702
Solution:
G In the Workflow Manager tool, choose Tools > Options and go to the Miscellaneous tab.
G Click the button for either 'normal' or 'bulk', as desired.
G Click OK, then close and open the Workflow Manager tool.
After this, every time a session is created, the target load type for all relational targets will default to
your choice.
Resolving Undocked Explorer Windows
The Repository Navigator window sometimes becomes undocked. Docking it again can be
frustrating because double clicking on the window header does not put it back in place.
Solution:
G To get the Window correctly docked, right-click in the white space of the Navigator
window.
G Make sure that Allow Docking option is checked. If it is checked, double-click on the title
bar of the Navigator Window.
INFORMATICA CONFIDENTIAL BEST PRACTICE 605 of 702
Resolving Client Tool Window Display Issues
If one of the windows (e.g., Navigator or Output) in a PowerCenter 7.x or later client tool (e.
g., Designer) disappears, try the following solutions to recover it:
G Clicking View > Navigator
G Toggling the menu bar
G Uninstalling and reinstalling Client tools
Note: If none of the above solutions resolve the problem, you may want to try the following solution
using the Registry Editor. Be aware, however, that using the Registry Editor incorrectly can cause
serious problems that may require reinstalling the operating system. Informatica does not
guarantee that any problems caused by using Registry Editor incorrectly can be resolved. Use the
Registry Editor at your own risk.
Solution:
Starting with PowerCenter 7.x, the settings for the client tools are in the registry. Display issues can
often be resolved as follows:
G Close the client tool.
G Go to Start > Run and type "regedit".
G Go to the key HKEY_CURRENT_USER\Software\Informatica\PowerMart Client Tools\x.y.z
Where x.y.z is the version and maintenance release level of the PowerCenter client as
follows:
PowerCenter
Version
Folder
Name
7.1 7.1
7.1.1 7.1.1
7.1.2 7.1.1
7.1.3 7.1.1
7.1.4 7.1.1
8.1 8.1
G Open the key of the affected tool (for the Repository Manager open Repository Manager
Options).
G Export all of the Toolbars sub-folders and rename them.
G Re-open the client tool.
Enhancing the Look of the Client Tools
INFORMATICA CONFIDENTIAL BEST PRACTICE 606 of 702
The PowerCenter client tools allow you to customize the look and feel of the display. Here are a
few examples of what you can do.
Designer

G From the Menu bar, select Tools > Options.
G In the dialog box, choose the Format tab.
G Select the feature that you want to modify (i.e., workspace colors, caption colors, or fonts).
Changing the background workspace colors can help identify which workspace is currently
open. For example, changing the Source Analyzer workspace color to green or the Target Designer
workspace to purple to match their respective metadata definitions helps to identify the workspace.
Alternatively, click the Select Theme button to choose a color theme, which displays background
colors based on predefined themes.
INFORMATICA CONFIDENTIAL BEST PRACTICE 607 of 702

Workflow Manager
You can modify the Workflow Manager using the same approach as the Designer tool.
From the Menu bar, select Tools > Options and click the Format tab. Select a color theme or
customize each element individually.
Workflow Monitor
You can modify the colors in the Gantt Chart view to represent the various states of a task. You can
also select two colors for one task to give it a dimensional appearance; this can be helpful in
INFORMATICA CONFIDENTIAL BEST PRACTICE 608 of 702
distinguishing between running tasks, succeeded tasks, etc.
To modify the Gantt chart appearance, go to the Menu bar and select Tools > Options and Gantt
Chart.
Using Macros in Data Stencil
Data Stencil contains unsigned macros. Set the security level in Visio to Medium so you can enable
macros when you start Data Stencil. If the security level for Visio is set to High or Very High, you
INFORMATICA CONFIDENTIAL BEST PRACTICE 609 of 702
cannot run the Data Stencil macros.
To use the security level for the Visio, select Tools > Macros > Security from the menu. On the
Security Level tab, select Medium.
When you start Data Stencil, Visio displays a security warning about viruses in macros. Click
Enable Macros to enable the macros for Data Stencil.


Last updated: 09-Feb-07 15:58
INFORMATICA CONFIDENTIAL BEST PRACTICE 610 of 702
Advanced Server Configuration Options
Challenge
Correctly configuring Advanced Integration Service properties, Integration Service process variables, and automatic memory settings;
using custom properties to write service logs to files; and adjusting semaphore and shared memory settings in the UNIX environment.
Description
Configuring Advanced Integration Service Properties
Use the Administration Console to configure the advanced properties, such as the character set of the Integration Service logs. To
edit the advanced properties, select the Integration Service in the Navigator, and click the Properties tab > Advanced Properties >
Edit.
The following Advanced properties are included:
Limit on Resilience Timeouts Optional Maximum amount of time (in seconds) that the service holds on to resources for resilience purposes. This
property places a restriction on clients that connect to the service. Any resilience timeouts that exceed the limit
are cut off at the limit. If the value of this property is blank, the value is derived from the domain-level settings.
Valid values are between 0 and 2592000, inclusive. Default is blank.
Resilience Timeout Optional Period of time (in seconds) that the service tries to establish or reestablish a connection to another service. If
blank, the value is derived from the domain-level settings.
Valid values are between 0 and 2592000, inclusive. Default is blank.
Configuring Integration Service Process Variables
One configuration best practice is to properly configure and leverage the Integration service (IS) process variables. The benefits
include:
G Ease of deployment across environments (DEV > TEST > PRD)
G Ease of switching sessions from one IS to another without manually editing all the sessions to change directory paths.
G All the variables are related to directory paths used by a given Integration Service.
You must specify the paths for Integration Service files for each Integration Service process. Examples of Integration Service files
include run-time files, state of operation files, and session log files.
Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run
on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-
time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files.
State of operation files must be accessible by all Integration Service processes.When you enable an Integration Service, it creates
files to store the state of operations for the service. The state of operations includes information such as the active service requests,
scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover
operations from the point of interruption.
All Integration Service processes associated with an Integration Service must use the same shared location. However, each
Integration Service can use a separate location.
By default, the installation program creates a set of Integration Service directories in the server\infa_shared directory. You can set
the shared location for these directories by configuring the process variable $PMRootDir to point to the same location for each
Integration Service process.
You must specify the directory path for each type of file. You specify the following directories using service process variables:
INFORMATICA CONFIDENTIAL BEST PRACTICE 611 of 702
Each registered server has its own set of variables. The list is fixed, not user-extensible.
Service Process Variable Value
$PMRootDir (no default user must insert a path)
$PMSessionLogDir $PMRootDir/SessLogs
$PMBadFileDir $PMRootDir/BadFiles
$PMCacheDir $PMRootDir/Cache
$PMTargetFileDir $PMRootDir/TargetFiles
$PMSourceFileDir $PMRootDir/SourceFiles
$PMExtProcDir $PMRootDir/ExtProc
$PMTempDir $PMRootDir/Temp
$PMSuccessEmailUser (no default user must insert a path)
$PMFailureEmailUser (no default user must insert a path)
$PMSessionLogCount 0
$PMSessionErrorThreshold 0
$PMWorkflowLogCount 0
$PMWorkflowLogDir $PMRootDir/WorkflowLogs
$PMLookupFileDir $PMRootDir/LkpFiles
$PMStorageDir $PMRootDir/Storage

Writing PowerCenter 8 Service Logs to Files
Starting with PowerCenter 8, all the logging for the services and sessions created use the log service and can only be viewed through
the PowerCenter Administration Console. However, it is still possible to get this information logged into a file similar to the previous
versions.
To write all Integration Service logs (session, workflow, server, etc.) to files:
1. <!--[endif]-->Log in to the Admin Console.
2. Select the Integration Service
3. Add a Custom property called UseFileLog and set its value to "Yes".
4. Add a Custom property called LogFileName and set its value to the desired file name.
5. Restart the service.
Integration Service Custom Properties (undocumented server parameters) can be entered here as well:
1. At the bottom of the list enter the Name and Value of the custom property
2. Click OK.
Adjusting Semaphore Settings on UNIX Platforms
INFORMATICA CONFIDENTIAL BEST PRACTICE 612 of 702
When PowerCenter runs on a UNIX platform, it uses operating system semaphores to keep processes synchronized and to prevent
collisions when accessing shared data structures. You may need to increase these semaphore settings before installing the server.
Seven semaphores are required to run a session. Most installations require between 64 and 128 available semaphores, depending
on the number of sessions the server runs concurrently. This is in addition to any semaphores required by other software, such as
database servers.
The total number of available operating system semaphores is an operating system configuration parameter, with a limit per user and
system. The method used to change the parameter depends on the operating system:
G HP/UX: Use sam (1M) to change the parameters.
G Solaris: Use admintool or edit /etc/system to change the parameters.
G AIX: Use smit to change the parameters.
Setting Shared Memory and Semaphore Parameters on UNIX Platforms
Informatica recommends setting the following parameters as high as possible for the UNIX operating system. However, if you set
these parameters too high, the machine may not boot. Always refer to the operating system documentation for parameter limits. Note
that different UNIX operating systems set these variables in different ways or may be self tuning. Always reboot the system after
configuring the UNIX kernel.
HP-UX
For HP-UX release 11i the CDLIMIT and NOFILES parameters are not implemented. In some versions, SEMMSL is hard-coded to
500. NCALL is referred to as NCALLOUT.
Use the HP System V IPC Shared-Memory Subsystem to update parameters.
To change a value, perform the following steps:
1. Enter the /usr/sbin/sam command to start the System Administration Manager (SAM) program.
2. Double click the Kernel Configuration icon.
3. Double click the Configurable Parameters icon.
4. Double click the parameter you want to change and enter the new value in the Formula/Value field.
5. Click OK.
6. Repeat these steps for all kernel configuration parameters that you want to change.
7. When you are finished setting all of the kernel configuration parameters, select Process New Kernel from the Action menu.
The HP-UX operating system automatically reboots after you change the values for the kernel configuration parameters.
IBM AIX
None of the listed parameters requires tuning because each is dynamically adjusted as needed by the kernel.
SUN Solaris
Keep the following points in mind when configuring and tuning the SUN Solaris platform:
1. Edit the /etc/system file and add the following variables to increase shared memory segments:
set shmsys:shminfo_shmmax=value
set shmsys:shminfo_shmmin=value
set shmsys:shminfo_shmmni=value
set shmsys:shminfo_shmseg=value
set semsys:seminfo_semmap=value
set semsys:seminfo_semmni=value
set semsys:seminfo_semmns=value
set semsys:seminfo_semmsl=value
INFORMATICA CONFIDENTIAL BEST PRACTICE 613 of 702
set semsys:seminfo_semmnu=value
set semsys:seminfo_semume=value
2. Verify the shared memory value changes:
# grep shmsys /etc/system
3. Restart the system:
# init 6
Red Hat Linux
The default shared memory limit (shmmax) on Linux platforms is 32MB. This value can be changed in the proc file system without a
restart.
For example, to allow 128MB, type the following command:
$ echo 134217728 >/proc/sys/kernel/shmmax
You can put this command into a script run at startup.
Alternatively, you can use sysctl(8), if available, to control this parameter. Look for a file called /etc/sysctl.conf and add a line similar
to the following:
kernel.shmmax = 134217728
This file is usually processed at startup, but sysctl can also be called explicitly later.
To view the values of other parameters, look in the files /usr/src/linux/include/asm-xxx/shmparam.h and /usr/src/linux/include/linux/
sem.h.
SuSE Linux
The default shared memory limits (shhmax and shmall) on SuSE Linux platforms can be changed in the proc file system without a
restart. For example, to allow 512MB, type the following commands:
#sets shmall and shmmax shared memory
echo 536870912 >/proc/sys/kernel/shmall #Sets shmall to 512 MB
echo 536870912 >/proc/sys/kernel/shmmax #Sets shmmax to 512 MB
You can also put these commands into a script run at startup.
Also change the settings for the system memory user limits by modifying a file called /etc/profile. Add lines similar to the following:
#sets user limits (ulimit) for system memory resources
ulimit -v 512000 #set virtual (swap) memory to 512 MB
ulimit -m 512000 #set physical memory to 512 MB
Configuring Automatic Memory Settings
With Informatica PowerCenter 8, you can configure the Integration Service to determine buffer memory size and session cache size
at runtime. When you run a session, the Integration Service allocates buffer memory to the session to move the data from the source
to the target. It also creates session caches in memory. Session caches include index and data caches for the Aggregator, Rank,
Joiner, and Lookup transformations, as well as Sorter and XML target caches.
INFORMATICA CONFIDENTIAL BEST PRACTICE 614 of 702
Configure buffer memory and cache memory settings in the Transformation and Session Properties. When you configure buffer
memory and cache memory settings, consider the overall memory usage for best performance.
Enable automatic memory settings by configuring a value for the Maximum Memory Allowed for Auto Memory Attributes or the
Maximum Percentage of Total Memory Allowed for Auto Memory Attributes. If the value is set to zero for either of these attributes, the
Integration Service disables automatic memory settings and uses default values.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 615 of 702
Causes and Analysis of UNIX Core Files
Challenge
This Best Practice explains what UNIX core files are and why they are created, and
offers some tips on analyzing them.
Description
Fatal run-time errors in UNIX programs usually result in the termination of the UNIX
process by the operating system. Usually, when the operating system terminates a
process, a "core dump" file is also created, which can be used to analyze the reason
for the abnormal termination.
What is a Core File and What Causes it to be Created?
UNIX operating systems may terminate a process before its normal, expected exit for
several reasons. These reasons are typically for bad behavior by the program, and
include attempts to execute illegal or incorrect machine instructions, attempts to
allocate memory outside the memory space allocated to the program, attempts to write
to memory marked read-only by the operating system, and other similar incorrect low-
level operations. Most of these bad behaviors are caused by errors in programming
logic in the program.
UNIX may also terminate a process for some reasons that are not caused by
programming errors. The main examples of this type of termination are when a process
exceeds its CPU time limit, and when a process exceeds its memory limit.
When UNIX terminates a process in this way, it normally writes an image of the
processes memory to disk in a single file. These files are called "core files", and are
intended to be used by a programmer to help determine the cause of the failure.
Depending on the UNIX version, the name of the file may be "core", or in more recent
UNIX versions, "core.nnnn" where nnnn is the UNIX process ID of the process that was
terminated.
Core files are not created for "normal" runtime errors such as incorrect file permissions,
lack of disk space, inability to open a file or network connection, and other errors that a
program is expected to detect and handle. However, under certain error conditions a
program may not handle the error conditions correctly and may follow a path of
INFORMATICA CONFIDENTIAL BEST PRACTICE 616 of 702
execution that causes the OS to terminate it and cause a core dump.
Mixing incompatible versions of UNIX, vendor, and database libraries can often trigger
behavior that causes unexpected core dumps. For example, using an odbc driver
library from one vendor and an odbc driver manager from another vendor may result in
a core dump if the libraries are not compatible. A similar situation can occur if a process
is using libraries from different versions of a database client, such as a mixed
installation of Oracle 8i and 9i. An installation like this should not exist, but if it does,
core dumps are often the result.
Core File Locations and Size Limits
A core file is written to the current working directory of the process that was terminated.
For PowerCenter, this is always the directory the services were started from. For other
applications, this may not be true.
UNIX also implements a per user resource limit on the maximum size of core files. This
is controlled by the ulimit command. If the limit is 0, then core files will not be created. If
the limit is less than the total memory size of the process, a partial core file will be
written. Refer to the Best Practice Understanding and Setting UNIX Resources for
PowerCenter Installations .
Analyzing Core Files
Core files provide valuable insight into the state and condition the process was in just
before it was terminated. It also contains the history or log of routines that the process
went through before that fateful function call; this log is known as the stack trace. There
is little information in a core file that is relevant to an end user; most of the contents of a
core file are only relevant to a developer, or someone who understands the internals of
the program that generated the core file. However, there are a few things that an end
user can do with a core file in the way of initial analysis. The most important aspect of
analyzing a core file is the task of extracting this stack trace out of the core dump.
Debuggers are the tools that help retrieve this stack trace and other vital information
out of the core. Informatica recommends using the pmstack utility.

The first step is to save the core file under a new name so that it is not overwritten by a
later crash of the same application. One option is to append a timestamp to the core,
but it can be renamed to anything:
mv core core.ddmmyyhhmi
INFORMATICA CONFIDENTIAL BEST PRACTICE 617 of 702
The second step is to log in with the same UNIX user id that started up the process that
crashed. This sets the debugger's environment to be same as that of the process at
startup time.
The third step is to go to the directory where the program is installed. Run the "file"
command on the core file. This returns the name of the process that created the core
file.
file <fullpathtocorefile>/core.ddmmyyhhmi
Core files can be generated by the PowerCenter executables (i.e., pmserver,
infaservices, and pmdtm) as well as from other UNIX commands executed by the
Integration Service, typically from command tasks and per- or post-session commands.
If a PowerCenter process is terminated by the OS and a core is generated, the session
or server log typically indicates Process terminating on Signal/Exception as its last
entry.
Using the pmstack Utility
Informatica provides a pmstack utility that can automatically analyze a core file. If the
core file is from PowerCenter, it will generate a complete stack trace from the core file,
which can be sent to Informatica Customer Support for further analysis. The track
contains everything necessary to further diagnose the problem. Core files themselves
are normally not useful on a system other than the one where they were generated.
The pmstack utility can be downloaded from the Informatica Support knowledge base
as article 13652, and from the support ftp server at tsftp.informatica.com. Once
downloaded, run pmstack with the c option, followed by the name of the core file:
$ pmstack -c core.21896
=================================
SSG pmstack ver 2.0 073004
=================================
Core info :
-rw------- 1 pr_pc_d pr_pc_d 58806272 Mar 29 16:28 core.21896
core.21896: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-
style, from ''''''''pmdtm''''''''

Process name used for analyzing the core : pmdtm
Generating stack trace, please wait..

Pmstack completed successfully
Please send file core.21896.trace to Informatica Technical Support
You can then look at the generated trace file or send it to support.
INFORMATICA CONFIDENTIAL BEST PRACTICE 618 of 702
Pmstack also supports a p option, which can be used to extract a stack trace from a
running process. This is sometimes useful if the process appears to be hung to
determine what the process is doing.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 619 of 702
Domain Configuration
Challenge
The addition of the domain architecture in PowerCenter 8 simplified administration of disparate
PowerCenter services across the enterprise, and allows for grouping previously-separately
administered application services and nodes into logically-grouped folders within the domain based
on administrative ownership. It is vital, when installing or upgrading PowerCenter, that the
Application Administrator understand the terminology and architecture surrounding the Domain
Configuration in order to effectively administer, upgrade, deploy, and maintain PowerCenter
Services throughout the enterprise.
Description
The domain architecture allows PowerCenter to provide a service-oriented architecture where you
can specify which services are running on which node or physical machine from one central
location. The components in the domain are aware of each others presence and continually
monitor one another via heartbeats. The various services within the domain can move from one
physical machine to another without any interruption to the PowerCenter environment. As long as
clients can connect to the domain, the domain can route their needs to the appropriate physical
machine.
From a monitoring perspective, the domain provides the ability to monitor all services in the domain
from a central location. You no longer have to log into and ping multiple machines in a robust
PowerCenter environment instead a single screen displays the current availability state of
all services.
For more details on the individual components and detailed configuration of a domain, refer to the
PowerCenter Administrator Guide.
Key Domain Components
There are several key domain components to consider during installation and setup:
G Master Gateway The node designated as the master gateway or domain controller is
the main entry point to the domain. This server(s) should be your most reliable and
available machine in the architecture. It is the first point of entry for all clients wishing to
connect to one of the PowerCenter services. If the master gateway is unavailable, the
entire domain is unavailable. You may designate more than one node to run the gateway
service. One gateway is always the master or primary, but by having the gateway services
running on more than one node in a multimode configuration, you domain can continue to
function if the master gateway is no longer available. In a high-availability environment, it
is critical to have one or more nodes running the gateway service as a backup to the
master gateway.
INFORMATICA CONFIDENTIAL BEST PRACTICE 620 of 702
G Shared File System The PowerCenter domain architecture provides centralized logging
capability and; when high-availability is enabled, a highly available environment with
automatic fail-over of workflows and sessions. In order to achieve this, the base
PowerCenter server file directories must reside on a file system that is accessable by all
nodes in the domain. When PowerCenter is initially installed, this directory is called
infa_shared and is located under the server directory of the PowerCenter installation. It
includes logs and checkpoint information that is shared among nodes of the
domain. Ideally, this file system is both high-performance and highly available.
G Domain Metadata As of PowerCenter 8, a store of metadata exists to hold all of the
configuration for the domain. This domain repository is separate from the one or more
PowerCenter repositories in your domain. Instead, it is a handful of tables that replace the
older pmserver.cfg, pmrep.cfg and other PowerCenter configuration information. Upon
installation you will be prompted for the RDBMS location for the domain repository. This
information should be treated similar to a PowerCenter repository, with regularly-
scheduled backups and a disaster recovery plan. Without this metadata, your domain is
unable to function. The RDBMS user provided to PowerCenter requires permissions to
create and drop tables, as well as insert, update, and delete records. Ideally, if you are
going to be grouping multiple independent nodes within this domain, the domain
configuration database should reside on a separate and independent server so as to
eliminate the single point of failure if the node hosting the domain configuration database
fails.
Domain Architecture
Just as in other PowerCenter architectures, the premise of the architecture is maintain flexibility
and scalability across the environment. There is no single best way to deploy the architecture.
Rather, each environment should be assessed for external factors and then PowerCenter
configured appropriately to function best in that particular environment. The advantage of the
service-oriented architecture is that components in the architecture (i.e., repository services,
integration services, and others) can be moved among nodes without needing to make changes to
the mappings or workflows. In this way, it is very simple to alter architecture components if you find
a suboptimal configuration and want to alter it in your environment. The key here is that you are
not tied to any choices you make at installation time and have the flexibility to make changes to
your architecture as your business needs change.
TIP
While the architecture is
very flexible and
provides easy
movement of services
throughout the
environment, one area
that to carefully consider
at installation time is the
name of the domain and
the subsequent nodes.
These are somewhat
troublesome to change
later because of the
INFORMATICA CONFIDENTIAL BEST PRACTICE 621 of 702
nature of their criticality
to the domain. It is not
recommended that you
imbed server IP
addresses and names in
the domain name or the
node names. You never
know when you may
need to move to new
hardware or move
nodes to new locations.
For example, instead of
naming your domain
PowerCenter_11.5.8.20,
consider naming it
Enterprise_Dev_Test.
This makes it much
more intuitive to
understand what domain
you are attaching to and
if you ever decide to
move the main gateway
to another server, you
dont need to change
the domain or node
name. While these
names can be changed,
the change is not
easy and requires using
command line programs
to alter the domain
metadata.
In the next sections, we look at a couple of sample domain configurations.
Single Node Domain
Even in a single server/single node installation, you must still create a domain. In this case, all
domain components reside on a single physical machine (i.e., node). You can have any number of
PowerCenter services running on this domain. It is important to note that with PowerCenter 8 and
beyond, you can run multiple integration services at the same time on the same machine even in
a NT/Windows environment.
Naturally this configuration exposes a single point of failure for every component in the domain and
high availability is not available in this situation.
INFORMATICA CONFIDENTIAL BEST PRACTICE 622 of 702
Multiple Node Domains
Domains can continue to expand to meet the demands of true enterprise-wide data integration.
Domain Architecture for Production/Development/Quality Assurance
Environments
The architecture picture becomes more complex when you consider a typical development
environment, which usually includes some level of a Development, Quality Assurance, and
INFORMATICA CONFIDENTIAL BEST PRACTICE 623 of 702
Production environment. In most implementations, these are separate PowerCenter repositories
and associated servers. It is possible to define a single domain to include one or more of these
development environments. However, there are a few points to consider:
G If the domain gateway is unavailable for any reason, the entire domain is inaccessible.
Keep in mind that if you place your development, quality assurance and production
services in a single domain, you have the possibility of affecting your production
environment with development and quality assurance work. If you decide to restart the
domain in Development for some reason, you are effectively restarting development,
quality assurance and production at the same time. Also, if you experience some sort of
failure that affects the domain in production, you have also brought down your
development environment and have no place to test to fix the problem since your entire
environment is compromised.
G For the domain you should have a common, shared, high-performance file system to
share the centralized logging and checkpoint files. If you have all three environments
together on one domain, you are mixing production logs with development logs and other
files on the same physical disk. Your production backups and disaster recovery files will
have more than just production information in them.
G For future upgrade, it is very likely that you will need to upgrade all components of the
domain at once to the new version of PowerCenter. If you have placed development,
quality assurance, and production in the same domain, you may need to upgrade all of it
at once. This is an undesirable situation in most data integration environments.
For these reasons, Informatica generally recommends having at least two separate domains in any
environment:
G Production Domain
G Development/Quality Assurance Domain

INFORMATICA CONFIDENTIAL BEST PRACTICE 624 of 702
Some architects choose to deploy a separate domain for each environment to further isolate them
and ensure no disruptions occur in the Quality Assurance environment by any changes in the
development environment. The tradeoff is an additional administration console to log into and
maintain.
One thing to keep in mind is that while you may have separate domains with separate domain
metadata repositories, there is no need to migrate any of the metadata from the separate domain
repositories between development, Quality Assurance and production. The domain metadata
repositories collect information on the physical location and connectivity of the components and
thus, it makes no sense to migrate between environments. You do need to provide separate
database locations for each, but there is no migration needs for the data within; each one is
specific to the environment it services.
Administration
The domain administrator has the permission to start/shutdown all services within the domain, as
well as the ability to create other users and delegate roles and responsibilities to them. Keep in
mind that if the domain is shutdown, it has to be restarted via the command line or the host
operating system GUI.
PowerCenter's High Availability option provides the ability to create multiple gateway nodes to a
domain, such that if the Master Gateway Node fails, another can assume its
responsibilities, including authentication, logging, and service management.
Security and Folders
Much like the Repository Manager, security in the domain interface is set up on a per-folder
basis, with owners being designated per logical grouping of objects/services in the domain. One of
the major differences is that Domain security allows the creation of subfolders to segment your
nodes and services in any way you like.
There are many considerations when deciding on a folder structure, keeping in mind that this is a
purely administrative interface and does not effect the users and permissions associated with a
developer role, which are designated at the Repository level. New legislation in the United States
and Europe, such as Basel II and the Public Company Accounting Reform and Investor Protection
INFORMATICA CONFIDENTIAL BEST PRACTICE 625 of 702
Act of 2002 (also known as SOX, SarbOx and Sarbanes-Oxley) have been widely interpreted to
place many restrictions on the ability of persons in development roles to have direct write access to
production systems, and consequently, you may have to plan your administration roles
accordingly. Your organization may simply need to use different folders to group objects in
Development, Quality Assurance, and Production roles with separate administrators. In some
instances, systems may need to be entirely separate, with different domains for the Development,
Quality Assurance, and Production systems. Sharing of metadata remains simple between
separate domains, with PowerCenters ability to link domains, and copy data between linked
domains.
For Data Migration projects, it is recommended to establish a standardized
architecture that includes a set of folders, connections and developer access in accordance with
the needs of the project. Typically this include folders for:
G Acquiring data
G Converting data to match the target system
G The final load to the target application
G Establishing reference data structures
Maintenance
As part of your regular backup of metadata, you should schedule a recurring backup of your
PowerCenter domain configuration database metadata. This can be accomplished through
PowerCenter by using the infasetup command, further explained in the Command Line
Reference. You should also add the schema to your normal RDBMS backup schedule, providing
two reliable backup methods for disaster recovery purposes.
Licensing
As part of PowerCenters new Service-Oriented Architecture (SOA), licensing for PowerCenter
services has been centralized within the domain. You receive your license key file(s) from
Informatica at the same time the download location for your software is provided. Adding license
object(s) and assigning individual PowerCenter Services to the license(s) is the method by which
you enable a PowerCenter Service. You can do this during install, or add initial/incremental license
keys after install via the Administration Console web-based utility, or the infacmd command line
utility.


Last updated: 09-Feb-07 16:10
INFORMATICA CONFIDENTIAL BEST PRACTICE 626 of 702
Managing Repository Size
Challenge
The PowerCenter repository is expected to grow over time as new development and
production runs occur. Over time, the repository can be expected to grow to a size that
may start slowing performance of the repository or make backups increasingly difficult.
This Best Practice discusses methods to manage the size of the repository.
The release of PowerCenter version 8.x added several features that aid in managing
the repository size. Although the repository is slightly larger with version 8.x than it was
with the previous versions, the client tools have increased functionality to limit the
dependency on the size of the repository. PowerCenter versions earlier than 8.x
require more administration to keep the repository sizes manageable.
Description
Why should we manage the size of the repository?
Repository size affects the following:
G DB backups and restores. If database backups are being performed, the size
required for the backup can be reduced. If PowerCenter backups are being
used, you can limit what gets backed up.
G Overall query time of the repository, which slows performance of the
repository over time. Analyzing tables on a regular basis can aid in repository
table performance.
G Migrations (i.e., copying from one repository to the next). Limit data transfer
between repositories to avoid locking up the repository for a long period of
time. Some options are available to avoid transferring all run statistics when
migrating. A typical repository starts off small (i.e., 50MB to 60MB for an empty
repository) and grows to upwards of 1GB for a large repository. The type of
information stored in the repository includes:
H Versions
H Objects
H Run statistics
H Scheduling information
INFORMATICA CONFIDENTIAL BEST PRACTICE 627 of 702
H Variables
Tips for Managing Repository Size
Versions and Objects
Delete old versions or purged objects from the repository. Use your repository queries
in the client tools to generate reusable queries that can determine out-of-date versions
and objects for removal. Use Query Browser to run object queries on both versioned
and non-versioned repositories..
Old versions and objects not only increase the size of the repository, but also make it
more difficult to manage further into the development cycle. Cleaning up the folders
makes it easier to determine what is valid and what is not.
One way to keep repository size small is to use shortcuts by creating shared folders if
you are using the same source/target definition, reusable transformations in multiple
folders.
Folders
Remove folders and objects that are no longer used or referenced. Unnecessary
folders increase the size of the repository backups. These folders should not be a part
of production but they may exist in development or test repositories.
Run Statistics
Remove old run statistics from the repository if you no longer need them. History is
important to determine trending, scaling, and performance tuning needs but you can
always generate reports based on the PowerCenter Metadata Reporter and
save reports of the data you need. To remove the run statistics, go to Repository
Manager and truncate the logs based on the dates.
Recommendations
Informatica strongly recommends upgrading to the latest version of PowerCenter since
the most recent release includes such features as skip workflow and session log, skip
deployment group history, skip MX data and so forth. The repository size in version 8.x
and above is larger than the previous versions of PowerCenter, but the added size
does not significantly affect the performance of the repository. It is still advisable to
INFORMATICA CONFIDENTIAL BEST PRACTICE 628 of 702
analyze the tables or run statistics to optimize the tables.
Informatica does not recommend directly querying the repository tables or performing
deletes on them. Use the client tools unless otherwise advised by Informatica technical
support personnel.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 629 of 702
Organizing and Maintaining Parameter Files &
Variables
Challenge
Organizing variables and parameters in Parameter files and maintaining Parameter files for ease of use.
Description
Parameter files are a means of providing run time values for parameters and variables defined in a
workflow, worklet, session, mapplet, or mapping. A parameter file can have values for multiple
workflows, sessions, and mappings, and can be created using text editors such as notepad, vi, shell
script, or an Informatica mapping.
Variable values are stored in the repository and can be changed within mappings. However, variable
values specified in parameter files supersede values stored in the repository. The values stored in the
repository can be cleared or reset using workflow manager.
Parameter File Contents
A Parameter File contains the values for variables and parameters. Although a parameter file can
contain values for more than one workflow (or session), it is advisable to build a parameter file to contain
values for a single or logical group of workflows for ease of administration. When using the command
line mode to execute workflows, multiple parameter files can also be configured and used for a single
workflow if the same workflow needs to be run with different parameters.
Types of Parameters and Variables
A parameter file contains the following types of parameters and variables:
G Service Variable. Defines a service variable for an Integration Service.
G Service Process Variable. Defines a service process variable for an Integration Service that
runs on a specific node.
G Workflow Variable. References values and records information in a workflow. For example,
use a workflow variable in a decision task to determine whether the previous task ran properly.
G Worklet Variable. References values and records information in a worklet. You can use
predefined worklet variables in a parent workflow, but cannot use workflow variables from the
parent workflow in a worklet.
G Session Parameter. Defines a value that can change from session to session, such a
database connection or file name.
G Mapping Parameter. Defines a value that remains constant throughout a session, such as a
state sales tax rate.
G Mapping Variable. Defines a value that can change during the session. The Integration
Service saves the value of a mapping variable to the repository at the end of each successful
INFORMATICA CONFIDENTIAL BEST PRACTICE 630 of 702
session run and uses that value the next time the session runs.
Configuring Resources with Parameter File
If a session uses a parameter file, it must run on a node that has access to the file. You create a
resource for the parameter file and make it available to one or more nodes. When you configure the
session, you assign the parameter file resource as a required resource. The Load Balancer dispatches
the Session task to a node that has the parameter file resource. If no node has the parameter file
resource available, the session fails.
Configuring Pushdown Optimization with Parameter File
Depending on the database workload, you may want to use source-side, target-side, or full pushdown
optimization at different times. For example, you may want to use partial pushdown optimization during
the database's peak hours and full pushdown optimization when activity is low. Use the $
$PushDownConfig mapping parameter to use different pushdown optimization configurations at different
times. The parameter lets you run the same session using the different types of pushdown optimization.
When you configure the session, choose $$PushdownConfig for the Pushdown Optimization attribute.
Define the parameter in the parameter file. Enter one of the following values for $$PushdownConfig in
the parameter file:
G None. The Integration Service processes all transformation logic for the session.
G Source. The Integration Service pushes part of the transformation logic to the source database.
G Source with View. The Integration Service creates a view to represent the SQL override value,
and runs an SQL statement against this view to push part of the transformation logic to the
source database.
G Target. The Integration Service pushes part of the transformation logic to the target database.
G Full. The Integration Service pushes all transformation logic to the database.
G Full with View. The Integration Service creates a view to represent the SQL override value,
and runs an SQL statement against this view to push part of the transformation logic to the
source database. The Integration Service pushes any remaining transformation logic to the
target database.
Parameter File Name
Informatica recommends giving the Parameter File the same name as the workflow with a suffix of .
par. This helps in identifying and linking the parameter file to a workflow.
Parameter File: Order of Precedence
While it is possible to assign Parameter Files to a session and a workflow, it is important to note that a
file specified at the workflow level always supersedes files specified at session levels.
Parameter File Location
INFORMATICA CONFIDENTIAL BEST PRACTICE 631 of 702
Each Integration Service process uses run-time files to process workflows and sessions. If you configure
an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a
shared location. Each node must have access to the run-time files used to process a session or
workflow. This includes files such as parameter files, cache files, input files, and output files.
Place the Parameter Files in directory that can be accessed using the server variable. This helps to
move the sessions and workflows to a different server without modifying workflow or session properties.
You can override the location and name of parameter file specified in the session or workflow while
executing workflows via the pmcmd command.
The following points apply to both Parameter and Variable files, however these are more relevant to
Parameters and Parameter files, and are therefore detailed accordingly.
Multiple Parameter Files for a Workflow
To run a workflow with different sets of parameter values during every run:
1. Create multiple parameter files with unique names.
2. Change the parameter file name (to match the parameter file name defined in Session or
Workflow properties). You can do this manually or by using a pre-session shell (or batch script).
3. Run the workflow.
Alternatively, run the workflow using pmcmd with the -paramfile option in place of steps 2 and 3.
Generating Parameter Files
Based on requirements, you can obtain the values for certain parameters from relational tables or
generate them programmatically. In such cases, the parameter files can be generated dynamically using
shell (or batch scripts) or using Informatica mappings and sessions.
Consider a case where a session has to be executed only on specific dates (e.g., the last working day of
every month), which are listed in a table. You can create the parameter file containing the next run date
(extracted from the table) in more than one way.
Method 1:
1. The workflow is configured to use a parameter file.
2. The workflow has a decision task before running the session: comparing the Current System
date against the date in the parameter file.
3. Use a shell (or batch) script to create a parameter file. Use an SQL query to extract a single
date, which is greater than the System Date (today) from the table and write it to a file with
required format.
4. The shell script uses pmcmd to run the workflow.
5. The shell script is scheduled using cron or an external scheduler to run daily. The
following figure shows the use of a shell script to generate a parameter file.
INFORMATICA CONFIDENTIAL BEST PRACTICE 632 of 702
The following figure shows a generated parameter file.
Method 2:
1. The Workflow is configured to use a parameter file.
2. The initial value for the data parameter is the first date on which the workflow is to run.
3. The workflow has a decision task before running the session: comparing the Current System
date against the date in the parameter file
4. The last task in the workflow generates the parameter file for the next run of the workflow (using
INFORMATICA CONFIDENTIAL BEST PRACTICE 633 of 702
a command task calling a shell script) or a session task, which uses a mapping. This task
extracts a date that is greater than the system date (today) from the table and writes into
parameter file in the required format.
5. Schedule the workflow using Scheduler, to run daily (as shown in the following figure).
Parameter File Templates
In some other cases, the parameter values change between runs, but the change can be incorporated
into the parameter files programmatically. There is no need to maintain separate parameter files for
each run.
Consider, for example, a service provider who gets the source data for each client from flat files located
in client-specific directories and writes processed data into global database. The source data structure,
target data structure, and processing logic are all same. The log file for each client run has to be
preserved in a client-specific directory. The directory names have the client id as part of directory
structure (e.g., /app/data/Client_ID/)
You can complete the work for all clients using a set of mappings, sessions, and a workflow, with one
parameter file per client. However, the number of parameter files may become cumbersome to manage
when the number of clients increases.
INFORMATICA CONFIDENTIAL BEST PRACTICE 634 of 702
In such cases, a parameter file template (i.e., a parameter file containing values for some parameters
and placeholders for others) may prove useful. Use a shell (or batch) script at run time to create actual
parameter file (for a specific client), replacing the placeholders with actual values, and then execute the
workflow using pmcmd.
[PROJ_DP.WF:Client_Data]
$InputFile_1=/app/data/Client_ID/input/client_info.dat
$LogFile=/app/data/Client_ID/logfile/wfl_client_data_curdate.log
Using a script, replace Client_ID and curdate to actual values before executing the workflow.
The following text is an excerpt from a parameter file that contains service variables for one Integration
Service and parameters for four workflows:
[Service:IntSvs_01]
[email protected]
[email protected]
[HET_TGTS.WF:wf_TCOMMIT_INST_ALIAS]
$$platform=unix
[HET_TGTS.WF:wf_TGTS_ASC_ORDR.ST:s_TGTS_ASC_ORDR]
$$platform=unix
$DBConnection_ora=qasrvrk2_hp817
[ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1]
$$DT_WL_lvl_1=02/01/2005 01:05:11
$$Double_WL_lvl_1=2.2
[ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1.WT:NWL_PARAM_Lvl_2]
$$DT_WL_lvl_2=03/01/2005 01:01:01
$$Int_WL_lvl_2=3
$$String_WL_lvl_2=ccccc
INFORMATICA CONFIDENTIAL BEST PRACTICE 635 of 702
Use Case 1: Fiscal Calendar-Based Processing
Some Financial and Retail industries use Fiscal calendar for accounting purposes. Use the mapping
parameters to process the correct fiscal period.
For example, create a calendar table in the database with the mapping between the Gregorian calendar
and fiscal calendar. Create mapping parameters in the mappings for the starting and ending dates.
Create another mapping with the logic to create a parameter file. Run the parameter file creation
session before running the main session.
The calendar table can be directly joined with the main table, but the performance may not be good in
some databases depending upon how the indexes are defined. Using a parameter file can resolve the
index and result in better performance.
Use Case 2: Incremental Data Extraction
Mapping parameters and variables can be used to extract inserted/updated data since previous extract.
Use the mapping parameters or variables in the source qualifier to determine the beginning timestamp
and the end timestamp for extraction.
For example, create a user-defined mapping variable $$PREVIOUS_RUN_DATE_TIME that saves the
timestamp of the last row the Integration Service read in the previous session. Use this variable for the
beginning timestamp and the built-in variable $$$SessStartTime for the end timestamp in the source
filter.
Use the following filter to incrementally extract data from the database:
LOAN.record_update_timestamp > TO_DATE($$PREVIOUS_DATE_TIME) and
LOAN.record_update_timestamp <= TO_DATE($$$SessStartTime)
Use Case 3: Multi-Purpose Mapping
Mapping parameters can be used to extract data from different tables using a single mapping. In some
cases the table name is the only difference between extracts.
For example, there are two similar extracts from tables FUTURE_ISSUER and EQUITY_ISSUER; the
column names and data types within the tables are same. Use mapping parameter $$TABLE_NAME in
the source qualifier SQL override, create two parameter files for each table name. Run the workflow
using the pmcmd command with the corresponding parameter file, or create two sessions with
corresponding parameter file.
Use Case 4: Using Workflow Variables
You can create variables within a workflow. When you create a variable in a workflow, it is valid only in
INFORMATICA CONFIDENTIAL BEST PRACTICE 636 of 702
that workflow. Use the variable in tasks within that workflow. You can edit and delete user-defined
workflow variables.
Use user-defined variables when you need to make a workflow decision based on criteria you specify.
For example, you create a workflow to load data to an orders database nightly. You also need to load a
subset of this data to headquarters periodically, every tenth time you update the local orders database.
Create separate sessions to update the local database and the one at headquarters. Use a user-defined
variable to determine when to run the session that updates the orders database at headquarters.
To configure user-defined workflow variables, set up the workflow as follows:
Create a persistent workflow variable, $$WorkflowCount, to represent the number of times the workflow
has run. Add a Start task and both sessions to the workflow. Place a Decision task after the session that
updates the local orders database. Set up the decision condition to check to see if the number of
workflow runs is evenly divisible by 10. Use the modulus (MOD) function to do this. Create an
Assignment task to increment the $$WorkflowCount variable by one.
Link the Decision task to the session that updates the database at headquarters when the decision
condition evaluates to true. Link it to the Assignment task when the decision condition evaluates to false.
When you configure workflow variables using conditions, the session that updates the local database
runs every time the workflow runs. The session that updates the database at headquarters runs every
10th time the workflow runs.


Last updated: 09-Feb-07 16:20
INFORMATICA CONFIDENTIAL BEST PRACTICE 637 of 702
Platform Sizing
Challenge
Determining the appropriate platform size to support the PowerCenter environment
based on customer environments and requirements.
Description
The required platform size to support PowerCenter depends on each customers
unique environment and processing requirements. The Integration Service allocates
resources for individual extraction, transformation, and load (ETL) jobs or sessions.
Each session has its own resource requirements. The resources required for the
Integration Service depend on the number of sessions, what each session does while
moving data, and how many sessions run concurrently. This Best Practice discusses
the relevant questions pertinent to estimating the platform requirements.
TIP
An important concept regarding platform sizing is not to size your environment too soon in the project lifecycle.
Too often, clients size their machines before any ETL is designed or developed, and in many cases these
platforms are too small for the resultant system. Thus, it is better to analyze sizing requirements after the data
transformation processes have been well defined during the design and development phases.

Environment Questions
To determine platform size, consider the following questions regarding your
environment:
G What sources do you plan to access?
G How do you currently access those sources?
G Have you decided on the target environment (i.e., database, hardware,
operating system)? If so, what is it?
G Have you decided on the PowerCenter environment (i.e., hardware, operating
system)?
G Is it possible for the PowerCenter services to be on the same machine as the
target?
G How do you plan to access your information (i.e., cube, ad-hoc query tool) and
INFORMATICA CONFIDENTIAL BEST PRACTICE 638 of 702
what tools will you use to do this?
G What other applications or services, if any, run on the PowerCenter server?
G What are the latency requirements for the PowerCenter loads?
Engine Sizing Questions
To determine engine size, consider the following questions:
G Is the overall ETL task currently being performed? If so, how is it being done,
and how long does it take?
G What is the total volume of data to move?
G What is the largest table (i.e., bytes and rows)? Is there any key on this table
that can be used to partition load sessions, if needed?
G How often does the refresh occur?
G Will refresh be scheduled at a certain time, or driven by external events?
G Is there a "modified" timestamp on the source table rows?
G What is the batch window available for the load?
G Are you doing a load of detail data, aggregations, or both?
G If you are doing aggregations, what is the ration of source/target rows for the
largest result set? How large is the result set (bytes and rows)?
The answers to these questions provide an approximation guide to the factors that
affect PowerCenter's resource requirements. To simplify the analysis, focus on large
jobs that drive the resource requirement.
Engine Resource Consumption
The following sections summarize some recommendations on the PowerCenter engine
resource consumption.
Processor
1 to 1.5 CPUs per concurrent non-partitioned session or transformation job.
Memory

G 20 to 30MB of memory for the main engine for session coordination.
INFORMATICA CONFIDENTIAL BEST PRACTICE 639 of 702
G 20 to 30MB of memory per session, if there are no aggregations, lookups, or
heterogeneous data joins. Note that 32-bit systems have an operating system
limitation of 2GB per session.
G Caches for aggregation, lookups or joins use additional memory:
G Lookup tables are cached in full; the memory consumed depends on the size
of the tables.
G Aggregate caches store the individual groups; more memory is used if there
are more groups.
G Sorting the input to aggregations greatly reduces the need for memory.
G Joins cache the master table in a join; memory consumed depends on the size
of the master.
System Recommendations
PowerCenter has a service-oriented architecture that provides the ability to scale services and share resources
across multiple machines. Below are the recommendations for the system.
Minimum server
G 1 Node, 4 CPUs and 8GB of memory (instead of the minimal requirement of
4GB RAM).
Disk Space
Disk space is not a factor if the machine is used only for PowerCenter services, unless
the following conditions exist:
G Data is staged to flat files on the PowerCenter machine.
G Data is stored in incremental aggregation files for adding data to aggregates.
The space consumed is about the size of the data aggregated.
G Temporary space is needed for paging for transformations that require large
caches that cannot be entirely cached by system memory
G Sessions logs are saved by timestamp
If any of these factors is true, Informatica recommends monitoring disk space on a
regular basis or maintaining some type of script to purge unused files.
Sizing Analysis
INFORMATICA CONFIDENTIAL BEST PRACTICE 640 of 702
The basic goal is to size the machine so that all jobs can complete within the specified
load window. You should consider the answers to the questions in the "Environment"
and "Engine Sizing" sections to estimate the required number of sessions, the volume
of data that each session moves, and its lookup table, aggregation, and heterogeneous
join caching requirements. Use these estimates with the recommendations in the
"Engine Resource Consumption" section to determine the required number of
processors, memory, and disk space to achieve the required performance to meet the
load window.
Note that the deployment environment often creates performance constraints that
hardware capacity cannot overcome. The engine throughput is usually constrained by
one or more of the environmental factors addressed by the questions in the
"Environment" section. For example, if the data sources and target are both remote
from the PowerCenter machine, the network is often the constraining factor. At some
point, additional sessions, processors, and memory may not yield faster execution
because the network (not the PowerCenter services) imposes the performance limit.
The hardware sizing analysis is highly dependent on the environment in which the
server is deployed. You need to understand the performance characteristics of the
environment before making any sizing conclusions.
It is also vitally important to remember that other applications (in addition to
PowerCenter) are likely to use the platform. PowerCenter often runs on a server with a
database engine and query/analysis tools. In fact, in an environment where
PowerCenter, the target database, and query/analysis tools all run on the same
machine, the query/analysis tool often drives the hardware requirements. However, if
the loading is performed after business hours, the query/analysis tools requirements
may not be a sizing limitation.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 641 of 702
PowerCenter Admin Console
Challenge
Using the PowerCenter Administration Console to administer PowerCenter domain and
services.
Description
PowerCenter has a service-oriented architecture that provides the ability to scale
services and share resources across multiple machines. The PowerCenter domain is
the fundamental administrative unit in PowerCenter. A domain is a collection of nodes
and services that you can group in folders based on administration ownership.
The Administration Console consolidates administrative tasks for domain objects such
as services, nodes, licenses, and grids. For more information on domain configuration,
refer the Best Practice on Domain Configuration.
Folders and Security
It is a good practice to create folders in the domain in order to organize objects
and manage security. Folders can contain nodes, services, grids, licenses and other
folders. Folders can be created based on functionality type, object type, or environment
type.
G Functionality-type folders group services based on a functional area such as
Sales or Marketing.
G Object type-folders group objects based on the service type. For
example, Integration services folder.
G Environment-type folders group objects based on the environment. For
example, if you have development and testing on the same domain, group the
services according to the environment.
Create User Accounts in the admin console, then set permissions and privileges to the
folders the users need access to. It is a good practice for the Administrator to monitor
the user activity in the domain periodically and save the reports for audit purposes.
INFORMATICA CONFIDENTIAL BEST PRACTICE 642 of 702
Nodes, Services, and Grids
A node is the logical representation of a machine in a domain. One node in the domain
acts as a gateway to receive service requests from clients and route them to the
appropriate service and node. Node properties can be set and modified using the
admin console. It is important to note that the property to set the maximum session/
tasks to run is Maximum Processes. Set this threshold to a maximum number; for
example, 200 is a good threshold. If you are using Adaptive Dispatch mode it is a good
practice to recalculate the CPU profile when the node is idle since it uses 100 percent
of the CPU.
The admin console also allows you to manage application services. You can access
properties of the services under one window using the admin console. For more
information on configuring the properties, refer the Best Practice on Advanced Server
Configuration Options
In addition, you can create and configure grids to nodes using admin console.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 643 of 702
Understanding and Setting UNIX Resources for PowerCenter
Installations
Challenge
This Best Practice explains what UNIX resource limits are, and how to control and manage them.
Description
UNIX systems impose per-process limits on resources such as processor usage, memory, and file handles. Understanding
and setting these resources correctly is essential for PowerCenter installations.
Understanding UNIX Resource Limits
UNIX systems impose limits on several different resources. The resources that can be limited depend on the actual
operating system (e.g., Solaris, AIX, Linux, or HPUX) and the version of the operating system. In general, all UNIX systems
implement per-process limits on the following resources. There may be additional resource limits, depending on the
operating system.
Resource Description
Processor time The maximum amount of processor time that can be used by a process, usually in seconds.
Maximum file size The size of the largest single file a process can create. Usually specified in blocks of 512
bytes.
Process data The maximum amount of data memory a process can allocate. Usually specified in KB.
Process stack The maximum amount of stack memory a process can allocate. Usually specified in KB.
Number of open files The maximum number of files that can be open simultaneously.
Total virtual memory The maximum amount of memory a process can use, including stack, instructions, and data.
Usually specified in KB.
Core file size The maximum size of a core dump file. Usually specified in blocks of 512 bytes.
These limits are implemented on an individual process basis. The limits are also inherited by child processes when they are
created.
In practice, this means that the resource limits are typically set at log-on time, and apply to all processes started from the log-
in shell. In the case of PowerCenter, any limits in effect before the Integration Service is started also apply to all sessions
(pmdtm) started from that node. Any limits in effect when the Repository Service is started also apply to
all pmrepagents started from that repository service (repository service process is an instance of the repository service
running on a particular machine or node).
When a process exceeds its resource limit, UNIX fails the operation that caused the limit to be exceeded. Depending on the
limit that is reached, memory allocations fail, files cant be opened, and processes are terminated when they exceed their
processor time.
Since PowerCenter sessions often use a large amount of processor time, open many files, and can use large amounts of
memory, it is important to set resource limits correctly to prevent the operating system from limiting access to required
resources, while preventing problems.
INFORMATICA CONFIDENTIAL BEST PRACTICE 644 of 702
Hard and Soft Limits
Each resource that can be limited actually allows two limits to be specified a soft limit and a hard limit. Hard and soft
limits can be confusing.
From a practical point of view, the difference between hard and soft limits doesnt matter to PowerCenter or any other
process; the lower value is enforced when it reached, whether it is a hard or soft limit.
The difference between hard and soft limits really only matters when changing resource limits. The hard limits are the
absolute maximums set by the System Administrator that can only be changed by the System Administrator. The soft limits
are recommended values set by the System Administrator, and can be increased by the user, up to the maximum limits.
UNIX Resource Limit Commands
The standard interface to UNIX resource limits is the ulimit shell command. This command displays and sets resource
limits. The C shell implements a variation of this command called limit, which has different syntax but the same functions.
G ulimit a Displays all soft limits
G ulimit a H Displays all hard limits in effect
Recommended ulimit settings for a PowerCenter server:
Resource Description
Processor time Unlimited. This is needed for the pmserver and pmrepserver that run forever.
Maximum file size Based on whats needed for the specific application. This is an important parameter to keep a
session from filling a whole filesystem, but needs to be large enough to not affect normal
production operations.
Process data 1GB to 2GB
Process stack 32MB
Number of open files At least 256. Each network connection counts as a file so source, target, and repository
connections, as well as cache files all use file handles.
Total virtual memory The largest expected size of a session. 1Gig should be adequate, unless sessions are
expected to create large in-memory aggregate and lookup caches that require more memory.
If you have sessions that are likely to required more than 1Gig, set the Total virtual memory
appropriately. Remember that in 32-bit OS, the maximum virtual memory for a session is
2Gigs.
Core file size Unlimited, unless disk space is very tight. The largest core files can be ~2-3GB, but after
analysis they should be deleted, and there really shouldnt be multiple core files lying around.
Setting Resource Limits
Resource limits are normally set in the log-in script, either .profile for the Korn shell or .bash_profile for the bash shell. One
ulimit command is required for each resource being set, and usually the soft limit is set. A typical sequence is:
ulimit -S -c unlimited
ulimit -S -d 1232896
ulimit -S -s 32768
INFORMATICA CONFIDENTIAL BEST PRACTICE 645 of 702
ulimit -S -t unlimited
ulimit -S -f 2097152
ulimit -S -n 1024
ulimit -S -v unlimited
after running this, the limits are changed:
% ulimit S a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) 1232896
file size (blocks, -f) 2097152
max memory size (kbytes, -m) unlimited
open files (-n) 1024
stack size (kbytes, -s) 32768
cpu time (seconds, -t) unlimited
virtual memory (kbytes, -v) unlimited
Setting or Changing Hard Resource Limits
Setting or changing hard resource limits varies across UNIX types. Most current UNIX systems set the initial hard limits in
the file /etc/profile, which must be changed by a System Administrator. In some cases, it is necessary to run a system utility
such as smit on AIX to change the global system limits.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 646 of 702
PowerExchange CDC for Oracle
Challenge
Configuration of the Oracle environment for optimal performance of PowerExchange
Change Data Capture in production environments.
Description
The performance of PowerExchange CDC on Oracle databases is dependant upon a
variety of factors including
G The type of connection that PowerExchange has to the Oracle database being
captured .
G The amount of data that is being written to the Oracle redo logs.
G The workload of the server where the Oracle database being captured resides.
Connection Type
Ensure that wherever possible PowerExchange has a connection type of Local mode to
the source database. Connections over slow networks and via SQL*Net should be
avoided.
Volume of Data
The volume of data that the Oracle Log Miner has to process in order to provide
changed data to PowerExchange has a significant impact upon performance. Bear in
mind that other processes may be writing large volumes of data to the Oracle redo
logs, in addition to the changed data rows. These include, but are not restricted to
G Oracle Catalog dumps.
G Oracle Workload monitor customizations.
G Other (non-Oracle) tools that use the redo logs to provide proprietary
information.
In order to optimize the PowerExchange CDC performance, the amount of data these
processes write to the Oracle redo logs needs to be minimized, both in terms of volume
INFORMATICA CONFIDENTIAL BEST PRACTICE 647 of 702
and frequency. Review the processes that are actively writing data to the Oracle redo
logs and tune them within the context of a production environment.
For example, is it strictly necessary to perform a Catalog dump every 30 minutes? In a
production environment schema, changes are less frequent than in a development
environment where Catalog dumps may be needed at this frequency.
Server Workload
Optimize the performance of the Oracle database server by reducing the number of
uneccessary tasks it is performing concurrently with the PowerExchange CDC
components. This may include a full review of the scheduling of backups and restores,
Oracle import and export processing, and other application software utilized within the
production server environment.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 648 of 702
PowerExchange Installation (for
Mainframe)
Challenge
Installing and configuring a PowerExchange listener on a mainframe, ensuring that the
process is both efficient and effective.
Description
PowerExchange installation is very straight-forward and can generally be accomplished
in a timely fashion. When considering a PowerExchange installation, be sure that the
appropriate resources are available. These include, but are not limited to:
G MVS systems operator
G Appropriate database administrator; this depends on what (if any) databases
are going to be sources/and or targets (e.g., IMS, IDMS, etc.).
G MVS Security resources
Be sure to adhere to the sequence of the following steps to successfully install
PowerExchange. Note that in this very typical scenario, the mainframe source data is
going to be pulled across to a server box.
1. Complete the PowerExchange pre-install checklist and obtain valid license keys.
2. Install PowerExchange on the mainframe.
3. Start the PowerExchange jobs/tasks on the mainframe.
4. Install the PowerExchange client (Navigator) on a workstation.
5. Test connectivity to the mainframe from the workstation.
6. Install Navigator on the UNIX/NT server.
7. Test connectivity to the mainframe from the server.
Complete the PowerExchange Pre-install Checklist and Obtain Valid
License Keys
Reviewing the environment and recording the information in a detailed checklist
facilitates the PowerExchange install. The checklist (which is a prerequisite) is installed
in the Documentation Folder when the PowerExchange software is installed. It is also
INFORMATICA CONFIDENTIAL BEST PRACTICE 649 of 702
available within the client from the PowerExchange Program Group. Be sure to
complete all relevant sections.
You will need a valid license key in order to run any of the PowerExchange
components. This is a 44-byte key that uses hyphens every 4 bytes. For example:
1234-ABCD-1234-EF01-5678-A9B2-E1E2-E3E4-A5F1
The key is not case-sensitive and uses hexadecimal digits and letters (0-9 and A-F).
Keys are valid for a specific time period and are also linked to an exact or generic TCP/
IP address. They also control access to certain databases and determine if the
PowerCenter Mover can be used. You cannot successfully install PowerExchange
without a valid key for all required components.
Note: When copying software from one machine to another, you may encounter
license key problems since the license key is IP specific. Be prepared to deal with this
eventuality, especially if you are going to a backup site for disaster recovery testing.
Install PowerExchange on the Mainframe
Step 1: Create a folder c:\PWX on the workstation. Copy the file with a naming
convention similar to PWXOS26.Vxxx.EXE from the PowerExchange CD to this
directory. Double click the file to unzip its contents into this directory.
Step 2: Create the PDS HLQ.PWXVxxx.RUNLIB and HLQ.PWXVxxx.
BINLIB on the mainframe in order to pre-allocate the needed libraries.. Ensure
sufficient space for the required jobs/tasks by setting the cylinders to 150 and
directory blocks of 50.
Step 3: Run the MVS_Install file. This displays the MVS Install Assistant.
Configure the IP Address, Logon ID, Password, HLQ, and Default volume
setting on the display screen. Also, enter the license key.
Click the Custom buttons to configure the desired data sources.
Be sure that the HLQ on this screen matches the HLQ of the allocated
RUNLIB (from step 2).
Save these settings and click Process. This creates the JCL libraries
and opens the following screen to FTP these libraries to MVS. Click
XMIT to complete the FTP process.
INFORMATICA CONFIDENTIAL BEST PRACTICE 650 of 702
Step 4: Edit JOBCARD in RUNLIB and configure as per the environment (e.g.,
execution class, message class, etc.)
Step 5: Edit the SETUP member in RUNLIB. Copy in the JOBCARD and
SUBMIT. This process can submit from 5 to 24 jobs. All jobs should end with
return code 0 (success) or 1, and a list of the needed installation jobs can be
found in the XJOBS member.
Start The PowerExchange Jobs/Tasks on the Mainframe
The installed PowerExchange Listener can be run as a normal batch job or as a started
task. Informatica recommends that it initially be submitted as a batch job: RUNLIB
(STARTLST).
It should return: DTL-00607 Listener VRM x.x.x Build Vxxx_P0x started.
If implementing change capture, start the PowerExchange Agent (as a started task):
/S DTLA
It should return: DTLEDMI1722561: EDM Agent DTLA has completed initialization.
Install The PowerExchange Client (Navigator) on a Workstation
Step 1: Run the Windows or UNIX installation file in the software folder on the
installation CD and follow the prompts.
Step 2: Enter the license key.
Step 3: Follow the wizard to complete the install and reboot the machine.
Step 4: Add a node entry to the configuration file \Program Files\Informatica
\Informatica Power Exchange\dbmover.cfg to point to the Listener on the
mainframe.
node = (mainframe location name, TCPIP, mainframe IP address, 2480)
INFORMATICA CONFIDENTIAL BEST PRACTICE 651 of 702
Test Connectivity to the Mainframe from the Workstation
Ensure communication to the PowerExchange Listener on the mainframe by entering
the following in DOS on the workstation:
DTLREXE PROG=PING LOC=mainframe location or nodename in dbmover.
cfg< /FONT>
It should return: DLT-00755 DTLREXE Command OK!
Install PowerExchange on the UNIX Server
Step 1: Create a user for the PowerExchange installation on the UNIX box.
Step 2: Create a UNIX directory /opt/inform/pwxvxxxp0x.
Step 3: FTP the file \software\Unix\dtlxxx_vxxx.tar on the installation CD to
the pwx installation directory on UNIX.
Step 4: Use the UNIX tar command to extract the files. The command is tar
xvf pwxxxx_vxxx.tar.
Step 5: Update the logon profile with the correct path, library path, and
home environment variables.
Step 6: Update the license key file on the server.
Step 7: Update the configuration file on the server (dbmover.cfg) by adding a
node entry to point to the Listener on the mainframe.
Step 8: If using an ETL tool in conjunction with PowerExchange, via ODBC,
update the odbc.ini file on the server by adding data source entries that point to
PowerExchange-accessed data:
[pwx_mvs_db2]
DRIVER=<install dir>/libdtlodbc.so
DESCRIPTION=MVS DB2
INFORMATICA CONFIDENTIAL BEST PRACTICE 652 of 702
DBTYPE=db2
LOCATION=mvs1
DBQUAL1=DB2T
Test Connectivity to the Mainframe from the Server
Ensure communication to the PowerExchange Listener on the mainframe by entering
the following on the UNIX server:
DTLREXE PROG=PING LOC=mainframe location< /FONT>
It should return: DLT-00755 DTLREXE Command OK!
Changed Data Capture
There is a separate manual for each type of change data capture adapter. This
manual contains the specifics on the following general steps. You will need
to understand the appropiate adapter guide to ensure success.
Step 1: APF authorize the .LOAD and the .LOADLIB libraries. This is required
for external security.
Step 2: Copy the Agent from the PowerExchange PROCLIB to the system site
PROCLIB.
Step 3: After the Agent has been started, run job SETUP2.
Step 4: Create an active registration for a table/segment/record in Navigator
that is setup for changes.
Step 5: Start the ECCR.
Step 6: Issue a change to the table/segment/record that you registered in
Navigator.
Step 7: Perform an extraction map row test in Navigator
INFORMATICA CONFIDENTIAL BEST PRACTICE 653 of 702
Assessing the Business Case
Challenge
Assessing the business case for a project must consider both the tangible and
intangible potential benefits. The assessment should also validate the benefits and
ensure they are realistic to the Project Sponsor and Key Stakeholders to
ensure project funding.
Description
A Business Case should include both qualitative and quantitative measures of potential
benefits.
The Qualitative Assessment portion of the Business Case is based on the Statement
of Problem/Need and the Statement of Project Goals and Objectives (both generated in
Subtask 1.1.1 Establish Business Project Scope ) and focuses on discussions with the
project beneficiaries regarding the expected benefits in terms of problem alleviation,
cost savings or controls, and increased efficiencies and opportunities.
Many qualitative items are intangible, but you may be able to cite examples of the
potential costs or risks if the system is not implemented. An example may be the cost
of bad data quality resulting in the loss of a key customer or an invalid analysis
resulting in bad business decisions. Risk factors may be classified as business,
technical, or execution in nature. Examples of these risks are uncertainty of value or
the unreliability of collected information, new technology employed, or a major change
in business thinking for personnel executing change.
It is important to identify an estimated value added or cost eliminated to strengthen the
business case. The better definition of the factors, the better the value to the business
case.
The Quantitative Assessment portion of the Business Case provides specific
measurable details of the proposed project, such as the estimated ROI. This may
involve the following calculations:
G Cash flow analysis- Projects positive and negative cash flows for the
anticipated life of the project. Typically, ROI measurements use the cash flow
formula to depict results.
INFORMATICA CONFIDENTIAL BEST PRACTICE 654 of 702
G Net present value - Evaluates cash flow according to the long-term value of
current investment. Net present value shows how much capital needs to be
invested currently, at an assumed interest rate, in order to create a stream of
payments over time. For instance, to generate an income stream of $500 per
month over six months at an interest rate of eight percent would require an
investment (i.e., a net present value) of $2,311.44.
G Return on investment - Calculates net present value of total incremental cost
savings and revenue divided by the net present value of total costs multiplied
by 100. This type of ROI calculation is frequently referred to as return-on-
equity or return-on-capital.
G Payback Period - Determines how much time must pass before an initial
capital investment is recovered.
The following are steps to calculate the quantitative business case or ROI:
Step 1 Develop Enterprise Deployment Map. This is a model of the project phases
over a timeline, estimating as specifically as possible participants, requirements, and
systems involved. A data integration or migration initiative or amendment may require
estimating customer participation (e.g., by department and location), subject area and
type of information/analysis, numbers of users, numbers and complexity of target data
systems (data marts or operational databases, for example) and data sources, types of
sources, and size of data set. A data migration project may require customer
participation, legacy system migrations, and retirement procedures. The types of
estimations vary by project types and goals. It is important to note that the more details
you have for estimations, the more precise your phased solutions are likely to be. The
scope of the project should also be made known in the deployment map.
Step 2 Analyze Potential Benefits. Discussions with representative managers and
users or the Project Sponsor should reveal the tangible and intangible benefits of the
project. The most effective format for presenting this analysis is often a "before" and
"after" format that compares the current situation to the project expectations, Include in
this step, costs that can be avoided by the deployment of this project.
Step 3 Calculate Net Present Value for all Benefits. Information gathered in this
step should help the customer representatives to understand how the expected
benefits are going to be allocated throughout the organization over time, using the
enterprise deployment map as a guide.
Step 4 Define Overall Costs. Customers need specific cost information in order to
assess the dollar impact of the project. Cost estimates should address the following
fundamental cost components:
INFORMATICA CONFIDENTIAL BEST PRACTICE 655 of 702
G Hardware
G Networks
G RDBMS software
G Back-end tools
G Query/reporting tools
G Internal labor
G External labor
G Ongoing support
G Training
Step 5 Calculate Net Present Value for all Costs. Use either actual cost estimates
or percentage-of-cost values (based on cost allocation assumptions) to calculate costs
for each cost component, projected over the timeline of the enterprise deployment map.
Actual cost estimates are more accurate than percentage-of-cost allocations, but much
more time-consuming. The percentage-of-cost allocation process may be valuable for
initial ROI snapshots until costs can be more clearly predicted.
Step 6 Assess Risk, Adjust Costs and Benefits Accordingly. Review potential
risks to the project and make corresponding adjustments to the costs and/or benefits.
Some of the major risks to consider are:
G Scope creep, which can be mitigated by thorough planning and tight project
scope.
G Integration complexity, which may be reduced by standardizing on vendors
with integrated product sets or open architectures.
G Architectural strategy that is inappropriate.
G Current support infrastructure may not meet the needs of the project.
G Conflicting priorities may impact resource availability.
G Other miscellaneous risks from management or end users who may withhold
project support; from the entanglements of internal politics; and from
technologies that don't function as promised.
G Unexpected data quality, complexity, or definition issues often are discovered
late in the course of the project and can adversely affect effort, cost, and
schedule. This can be somewhat mitigated by early source analysis.
Step 7 Determine Overall ROI. When all other portions of the business case are
complete, calculate the project's "bottom line". Determining the overall ROI is simply a
matter of subtracting net present value of total costs from net present value of (total
INFORMATICA CONFIDENTIAL BEST PRACTICE 656 of 702
incremental revenue plus cost savings).
Final Deliverable
The final deliverable of this phase of development is a complete business case that
documents both tangible (quantified) and in-tangible (non-quantified, but estimate of
benefits and risks) to be presented to the Project Sponsor and Key Stakeholders. This
allows them to review the Business Case in order to justify the development effort.
If your organization has the concept of a Project Office which provides the governance
for project and priorities, many times this is part of the original Project Charter which
states items like scope, initial high level requirements, and key project stakeholders.
However, developing a full Business Case can validate any initial analysis and provide
additional justification. Additionally, the Project Office should provide guidance in
building and communicating the Business Case.
Once completed, the Project Manager is responsible for scheduling the review and
socialization of the Business Case.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 657 of 702
Defining and Prioritizing Requirements
Challenge
Defining and prioritizing business and functional requirements is often accomplished
through a combination of interviews and facilitated meetings (i.e., workshops) between
the Project Sponsor and beneficiaries and the Project Manager and Business Analyst.
Requirements need to be gathered from business users who currently use and/or have
the potential to use the information being assessed. All input is important since the
assessment should encompass an enterprise view of the data rather than a limited
functional, departmental, or line-of-business view.
Types of specific detailed data requirements gathered include:
G Data names to be assessed
G Data definitions
G Data formats and physical attributes
G Required business rules including allowed values
G Data usage
G Expected quality levels
By gathering and documenting some of the key detailed data requirements, a solid
understanding the business rules involved is reached. Certainly, all elements cant be
analyzed in detail, but it helps in getting to the heart of the business system so you are
better prepared when speaking with business and technical users.
Description
The following steps are key for successfully defining and prioritizing requirements:
Step 1: Discovery
Gathering business requirements is one of the most important stages of any data
integration project. Business requirements affect virtually every aspect of the data
integration project starting from Project Planning and Management to End-User
INFORMATICA CONFIDENTIAL BEST PRACTICE 658 of 702
Application Specification. They are like a hub that sits in the middle and touches the
various stages (spokes) of the data integration project. There are two basic techniques
for gathering requirements and investigating the underlying operational data: interviews
and facilitated sessions.
Data Profiling
Informatica Data Explorer (IDE) is an automated data profiling and analysis software
product that can be extremely beneficial in defining and prioritizing requirements. It
provides a detailed description of data content, structure, rules, and quality by profiling
the actual data that is loaded into the product.
Some industry examples of why data profiling is crucial prior to beginning the
development process are:
G
Cost of poor data quality is 15 to 25 percent of operating profit.
G
Poor data management is costing global business $1.4 billion a year.
G
37 percent of projects are cancelled; 50 percent are completed but
with 20 percent overruns, leaving only 13 percent completed on time
and within budget.
G
Using a Data Profiling Tool can lower the risk and lower the cost of
the project and increase the chances of success.
G
Data Profiling reports can be posted to a central presence where all
team members can review results and track accuracy.
IDE provides the ability to promote collaboration through tags, notes, action items,
transformations and rules. By profiling the information, the framework is set to have an
effective interview process with business and technical users.
Interviews
By conducting interview research before starting the requirements gathering process,
interviewees can be categorized into functional business management and Information
Technology (IT) management. This, in conjunction with effective data profiling, helps
to establish a comprehensive set of business requirements.
INFORMATICA CONFIDENTIAL BEST PRACTICE 659 of 702
Business Interviewees. Depending on the needs of the project, even though you may
be focused on a single primary business area, it is always beneficial to interview
horizontally to achieve a good cross-functional perspective of the enterprise. This also
provides insight into how extensible your project is across the enterprise.
Before you interview, be sure to develop an interview questionnaire based upon
profiling results, as well as business questions; schedule the interview time and place;
and prepare the interviewees by sending a sample agenda. When interviewing
business people, it is always important to start with the upper echelons of management
so as to understand the overall vision, assuming you have the business background,
confidence and credibility to converse at those levels.
If not adequately prepared, the safer approach is to interview middle management. If
you are interviewing across multiple teams, you might want to scramble interviews
among teams. This way if you hear different perspectives from finance and marketing,
you can resolve the discrepancies with a scrambled interview schedule. A note to keep
in mind is that business is sponsoring the data integration project and is going to be the
end-users of the application. They will decide the success criteria of your data
integration project and determine future sponsorship. Questioning during these
sessions should include the following:
G Who are the stakeholders for this milestone delivery (IT, field business
analysts, executive management)?
G What are the target business functions, roles, and responsibilities?
G What are the key relevant business strategies, decisions, and processes (in
brief)?
G What information is important to drive, support, and measure success for
those strategies/processes? What key metrics? What dimensions for those
metrics?
G What current reporting and analysis is applicable? Who provides it? How is it
presented? How is it used? How can it be improved?
IT interviewees. The IT interviewees have a different flavor than the business user
community. Interviewing the IT team is generally very beneficial because it is
composed of data gurus who deal with the data on a daily basis. They can provide
great insight into data quality issues, help in systematic exploration of legacy source
systems, and understanding business user needs around critical reports. If you are
developing a prototype, they can help get things done quickly and address important
business reports. Questioning during these sessions should include the following:
G Request an overview of existing legacy source systems. How does data
INFORMATICA CONFIDENTIAL BEST PRACTICE 660 of 702
current flow from these systems to the users?
G What day-to-day maintenance issues does the operations team encounter with
these systems?
G Ask for their insight into data quality issues.
G What business users do they support? What reports are generated on a daily,
weekly, or monthly basis? What are the current service level agreements for
these reports?
G How can the DI project support the IS department needs?
G Review data profiling reports and analyze the anomalies in the data. Note and
record each of the comments from the more detailed analysis. What are the
key business rules involved in each item?
Facilitated Sessions
Facilitated sessions - known sometimes as JAD (Joint Application Development) or
RAD (Rapid Application Development) - are ways to work as a group of business and
technical users to capture the requirements. This can be very valuable in gathering
comprehensive requirements and building the project team. The difficulty is the amount
of preparation and planning required to make the session a pleasant, and
worthwhile experience.
Facilitated sessions provide quick feedback by gathering all the people from the various
teams into a meeting and initiating the requirements process. You need a facilitator
who is experienced in these meetings to ensure that all the participants get a chance to
speak and provide feedback. During individual (or small group) interviews with high-
level management, there is often focus and clarity of vision that may be hindered in
large meetings. Thus, it is extremely important to encourage all attendees to
participate and minimize a small number from dominating the requirement process.
A challenge of facilitated sessions is matching everyones busy schedules and actually
getting them into a meeting room. However, this part of the process must be focused
and brief or it can become unwieldy with too much time expended just trying to
coordinate calendars among worthy forum participants. Set a time period and target list
of participants with the Project Sponsor, but avoid lengthening the process if some
participants aren't available. Questions asked during facilitated sessions are similar to
the questions asked to business and IS interviewees.
Step 2: Validation and Prioritization
The Business Analyst, with the help of the Project Architect, documents the findings of
INFORMATICA CONFIDENTIAL BEST PRACTICE 661 of 702
the discovery process after interviewing the business and IT management. The next
step is to define the business requirements specification. The resulting Business
Requirements Specification includes a matrix linking the specific business requirements
to their functional requirements.
Defining the business requirements is a time consuming process and should be
facilitated by forming a working group team. A working group team usually consists of
business users, business analysts, project manager, and other individuals who can
help to define the business requirements. The working group should meet weekly to
define and finalize business requirements. The working group helps to:
G Design the current state and future state
G Identify supply format and transport mechanism
G Identify required message types
G Develop Service Level Agreement(s), including timings
G Identify supply management and control requirements
G Identify common verifications, validations, business validations and
transformation rules
G Identify common reference data requirements
G Identify common exceptions
G Produce the physical message specification
At this time also, the Architect develops the Information Requirements Specification to
clearly represent the structure of the information requirements. This document, based
on the business requirements findings, can facilitate discussion of informational details
and provide the starting point for the target model definition.
The detailed business requirements and information requirements should be reviewed
with the project beneficiaries and prioritized based on business need and the stated
project objectives and scope.
Step 3: The Incremental Roadmap
Concurrent with the validation of the business requirements, the Architect begins the
Functional Requirements Specification providing details on the technical requirements
for the project.
As general technical feasibility is compared to the prioritization from Step 2, the Project
INFORMATICA CONFIDENTIAL BEST PRACTICE 662 of 702
Manager, Business Analyst, and Architect develop consensus on a project "phasing"
approach. Items of secondary priority and those with poor near-term feasibility are
relegated to subsequent phases of the project. Thus, they develop a phased, or
incremental, "roadmap" for the project (Project Roadmap).
Final Deliverable
The final deliverable of this phase of development is a complete list of business
requirement, a diagram of current and future state, and a list of high-level business
rules affected by the requirements that will effect the change from current to future.
This provides the development team with much of the information in order to begin the
design effort of the system modifications. Once completed, the Project Manager is
responsible for scheduling the review and socialization of the requirements and plan to
achieve sign-off on the deliverable.
This is presented to the Project Sponsor for approval and becomes the first "increment"
or starting point for the Project Plan.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 663 of 702
Developing a Work Breakdown Structure (WBS)
Challenge
Developing a comprehensive work breakdown structure (WBS) is crucial for capturing all the tasks required
for a data integration project. Many times, items such as full analysis, testing, or even specification
development, can create a sense of false optimism for the project. The WBS clearly depicts all of the
various tasks and subtasks required to complete a project. Most project time and resource estimates are
supported by the WBS. A thorough, accurate WBS is critical for effective monitoring and also
facilitates communication with project sponsors and key stakeholders.
Description
The WBS is a deliverable-oriented hierarchical tree that allows large tasks to be visualized as a group of
related smaller, more manageable subtasks. These tasks and subtasks can then be assigned to various
resources, which helps to identify accountability and is invaluable for tracking progress. The WBS serves
as a starting point as well as a monitoring tool for the project.
One challenge in developing a thorough WBS is obtaining the correct balance between sufficient detail,
and too much detail. The WBS shouldnt include every minor detail in the project, but it does need to break
the tasks down to a manageable level of detail. One general guideline is to keep task detail to a duration of
at least a day. It is also important to maintain consistency across project for the level of detail.
A well designed WBS can be extracted at a higher level to communicate overall project progress, as shown
in the following sample. The actual WBS for the project manager may, for example, may be a level of detail
deeper than the overall project WBS to ensure that all steps are completed, but the communication can roll
up a level or two to make things more clear.
Plan
%
Complete
Budget
Hours
Actual
Hours

Architecture - Set up of Informatica Environment 82% 167 137
Develop analytic solution architecture 46% 28 13
Design development architecture 59% 32 19
Customize and implement Iterative Framework
Data Profiling 100% 32 32
Legacy Stage 150% 10 15
Pre-Load Stage 150% 10 15
Reference Data 128% 18 23
INFORMATICA CONFIDENTIAL BEST PRACTICE 664 of 702
Reusable Objects 56% 27 15
Review and signoff of Architecture 50% 10 5
Analysis - Target-to-Source Data Mapping 48% 1000 479
Customer (9 tables) 87% 135 117
Product (7 tables) 98% 215 210
Inventory (3 tables) 0% 60 0
Shipping (3 tables) 0% 60 0
Invoicing (7 tables) 0% 140 0
Orders (13 tables) 37% 380 140
Review and signoff of Functional Specification 0% 10 0
Total Architecture and Analysis 52% 1167 602
A fundamental question is to whether to include activities as part of a WBS. The following statements are
generally true for most projects, most of the time, and therefore are appropriate as the basis for resolving
this question.
G The project manager should have the right to decompose the WBS to whatever level of detail he
or she requires to effectively plan and manage the project. The WBS is a project management tool
that can be used in different ways, depending upon the needs of the project manager.
G
The lowest level of the WBS can be activities.
G
The hierarchical structure should be organized by deliverables and milestones with
process steps detailed within it. The WBS can be structured from a process or life
cycle basis (i.e., the accepted concept of Phases), with non-deliverables detailed within
it.
G
At the lowest level in the WBS, an individual should be identified and held accountable
for the result. This person should be an individual contributor, creating the deliverable
personally, or a manager who will in turn create a set of tasks to plan and manage the
results.
G
The WBS is not necessarily a sequential document. Tasks in the hierarchy are often
completed in parallel. At part, the goal is to list every task that must be completed; it
is not necessary to determine the critical path for completing these tasks.
H For example, multiple subtasks under a task (e.g., 4.3.1 through 4.3.7 under task 4.3).
Subtasks 4.3.1 through 4.3.4 may have sequential requirements that forces them to be
INFORMATICA CONFIDENTIAL BEST PRACTICE 665 of 702
completed in order while subtasks 4.3.5 through 4.3.7 can - and should - be completed in
parallel if they do not have sequential requirements.
H It is important to remember that a task is not complete until all of its corresponding
subtasks are completed - whether sequentially or in parallel. For example, the Build
Phase is not complete until tasks 4.1 through 4.7 are complete, but some work can (and
should) begin for the Deploy Phase long before the Build Phase is complete.
The Project Plan provides a starting point for further development of the project WBS. This sample is a
Microsoft Project file that has been "pre-loaded" with the phases, tasks, and subtasks that make up the
Informatica methodology. The Project Manager can use this WBS as a starting point, but should review it to
ensure that it corresponds to the specific development effort, removing any steps that arent relevant or
adding steps as necessary. Many projects require the addition of detailed steps to accurately represent the
development effort.
If the Project Manager chooses not to use Microsoft Project, an Excel version of the Work Breakdown
Structure is also available. The phases, tasks, and subtasks can be exported from Excel into many other
project management tools, simplifying the effort of developing the WBS.
Sometimes it is best to build an initial task list and timeline with a project team using a facilitator with the
project team. The project manager can act as a facilitator or can appoint one, freeing up the project
manager and enabling team members to focus on determining the actual tasks and effort needed.
Depending on the size and scope of the project, sub-projects may be beneficial, with multiple project teams
creating their own project plans. The overall project manager then brings the plans together into a master
project plan. This group of projects can be defined as a program and the project manager and project
architect manage the interaction among the various development teams.
Caution: Do not expect plans to be set in stone. Plans inevitably change as the project progresses;
new information becomes available; scope, resources and priorities change; deliverables are (or are not)
completed on time, etc. The process of estimating and modifying the plan should be repeated many times
throughout the project. Even initial planning is likely to take several iterations to gather enough
information. Significant changes to the project plan become the basis to communicate with the project
sponsor(s) and/or key stakeholders with regard to decisions to be made and priorities rearranged. The goal
of the project manager is to be non-biased toward any decision, but to place the responsibility with the
sponsor to shape direction.
Approaches to Building WBS Structures: Waterfall vs. Iterative
Data integration projects differ somewhat from other types of development projects, although they also
share some key attributes. The following list summarizes some unique aspects of data integration projects:
G Business requirements are less tangible and predictable than in OLTP (online transactional
processing) projects.
G Database queries are very data intensive, involving few or many tables, but with many, many
rows. In OLTP, transactions are data selective, involving few or many tables and comparatively
few rows.
G
INFORMATICA CONFIDENTIAL BEST PRACTICE 666 of 702
Metadata is important, but in OLTP the meaning of fields is predetermined on a screen
or report. In a data integration project (e.g., warehouse or common data
management, etc.), metadata and traceability are much more critical.
G
Data integration projects, like all development projects, must be managed. To
manage them, they must follow a clear plan. Data integration project managers often
have a more difficult job than those managing OLTP projects because there are so
many pieces and sources to manage.
Two purposes of the WBS are to manage work and ensure success. Although this is the same as any
project, data integration projects are unlike typical waterfall projects in that they are based on a iterative
approach. Three of the main principles of iteration are as follows:
G
Iteration. Division of work into small chunks of effort using lessons learned from
earlier iterations.
G
Time boxing. Delivery of capability in short intervals, with the first release typically
requiring from three to nine months (depending on complexity) and quarterly releases
thereafter.
G
Prototyping. Early delivery of a prototype, with a working database delivered
approximately one-third of the way through.
Incidentally, most iteration projects follow an essentially waterfall process within a given increment. The
danger is that projects can iterate or spiral out of control..
The three principles listed above are very important because even the best data integration plans are
likely to invite failure if these principles are ignored. An example of a failure waiting to happen, even with a
fully detailed plan, is a large common data management project that gathers all requirements upfront and
delivers the application all-at-once after three years. It is not the "large" that is the problem, but the "all
requirements upfront" and the "all-at-once in three years."
Even enterprise data warehouses are delivered piece-by-piece using these three (and other) principles.
The feedback you can gather from increment to increment is critical to the success of the future
increments. The benefit is that such incremental deliveries establish patterns for development that can be
used and leveraged for future deliveries.
What is the Correct Development Approach?
The correct development approach is usually dictated by corporate standards and by departments such as
the Project Management Office (PMO). Regardless of the development approach chosen, high-level
phases typically include planning the project; gathering data requirements; developing data models;
designing and developing the physical database(s); developing the source, profile, and map data; and
extracting, transforming, and loading the data. Lower-level planning details are typically carried out by the
project manager and project team leads.
Preparing the WBS
The WBS can be prepared using manual or automated techniques, or a combination of the two.
INFORMATICA CONFIDENTIAL BEST PRACTICE 667 of 702
In many cases, a manual technique is used identify and record the high-level phases and tasks, then the
information is transferred to project tracking software such as Microsoft Project. Project team members
typically begin by identifying the high-level phases and tasks, writing the relevant information on large
sticky notes or index cards, then mount the notes or cards on a wall or white board. Use one sticky note or
card per phase or task so that you can easily be rearrange them as the project order evolves. As the
project plan progresses, you can add information to the cards or notes to flesh out the details, such as task
owner, time estimates, and dependencies. This information can then be fed into the project tracking
software.
Once you have a fairly detailed methodology, you can enter the phase and task information into your
project tracking software. When the project team is assembled, you can enter additional tasks and details
directly into the software. Be aware however, that the project team can better understand a project and its
various components if they actually participate in the high-level development activities, as they do in the
manual approach. Using software alone, without input from relevant project team members, to designate
phases, tasks, dependencies and time lines can be difficult and prone to errors and ommissions.
Benefits of developing the project timeline manually, with input from team members include:
G
Tasks, effort and dependencies are visible to all team members.
G
Team has a greater understanding of and commitment to the project.
G
Team members have an opportunity to work with each other and set the foundation.
This is particularly important if the team is geographically dispersed and cannot work
face-to-face throughout much of the project.
How Much Descriptive Information is Needed?
The project plan should incorporate a thorough description of the project and its goals. Be sure to review
the business objectives, constraints, and high-level phases but keep the description as short and simple as
possible. In many cases, a verb-noun form works well (e.g., interview users, document requirments, etc.).
After you have described the project on a high-level, identify the tasks needed to complete each phase. It is
often helpful to use the notes section in the tracking software (e.g., Microsoft Project) to provide narrative
for each task or subtask. In general, decompose the tasks until they have a rough durations of two to 20
days.
Remember to break down the tasks only to the level of detail that you are willing to track. Include key
checkpoints or milestones as tasks to be completed. Again, a noun-verb form works well for milestones (e.
g., requirements completed, data model completed, etc.).
Assigning and Delegating Responsibility
Identify a single owner for each task in the project plan. Although other resources may help to complete the
task; the individual who is designated as the owner is ultimately responsible for ensuring that the task, and
any associated deliverables, is completed on time.
After the WBS is loaded into the selected project tracking software and refined for the specific project
requirements, the Project Manager can begin to estimate the level of effort involved in completing each of
the steps. When the estimate is complete, the project manager can assign individual resources and
INFORMATICA CONFIDENTIAL BEST PRACTICE 668 of 702
prepare a project schedule . The end result is the Project Plan. Refer to Developing and Maintaining the
Project Plan for further information about the project plan.
Use your project plan to track progress. Be sure to review and modify estimates and keep the project plan
updated throughout the project.


Last updated: 09-Feb-07 16:29
INFORMATICA CONFIDENTIAL BEST PRACTICE 669 of 702
Developing and Maintaining the Project Plan
Challenge
The challenge of developing and maintaing a project plan is to incorporate all of the necessary components while
retaining the flexibility necessary to accommodate change.
A two-fold approach is required to meet the challenge:
1. A project that is clear in scope contains the following elements:
G A designated begin and end date.
G Well-defined business and technical requirements
G Adequate resources must be assigned.
Without these components, the project is subject slippage and incorrect expectations set with the Project Sponsor.
2. Project Plans are subject to revision and change throughout the project. It is imperative to establish a
communication plan with the Project Sponsor; such communication may involve a weekly status report of
accomplishments, and/or a report on issues and plans for the following week. This type of forum is very helpful in
involving the Project Sponsor to actively make decisions with regards to change in scope or timeframes.
If your organization has the concept of a Project Office that provides governance for the project and priorities, look for
a Project Charter that contains items like scope, initial high-level requirements, and key project stakeholders. Additionally,
the Project Office should provide guidance in funding and resource allocation for key projects.
Informaticas PowerCenter and Data Quality are not exempted from this project planning process. However, the purpose
here is to provide some key elements that can be used to develop and maintain a data integration, data migration,
or data quality project.
Description
Use the following steps as a guide for developing the initial project plan:
1. Define major milestones based on the project scope. (Be sure to list all key items such as analysis, design,
development, and testing.)
2. Break the milestones down into major tasks and activities. The Project Plan should be helpful as a starting point
or for recommending tasks for inclusion.
3. Continue the detail breakdown, if possible, to a level at which there are logical chunks of work can be completed
and assigned to resources for accountability purposes. This level provides satisfactory detail to facilitate
estimation, assignment of resources, and tracking of progress. If the detail tasks are too broad in scope, such as
assigning multiple resources, estimates are much less likely to be accurate and resource accountability becomes
difficult to maintain.
4. Confer with technical personnel to review the task definitions and effort estimates (or even to help define them, if
applicable). This helps to build commitment for the project plan.
5. Establish the dependencies among tasks, where one task cannot be started until another is completed (or must
start or complete concurrently with another).
6. Define the resources based on the role definitions and estimated number of resources needed for each role.
7. Assign resources to each task. If a resource will only be part-time on a task, indicate this in the plan.
8. Ensure that project plan follows your organizations system development methodology.
Note: Informatica Professional Services has found success in projects that blend thewaterfall method with the iterative
method. TheWaterfall method works well in the early stages of a project, such as analysis and initial design.
The Iterative methods work well in accelerating development and testing where feedback from extensive testing
INFORMATICA CONFIDENTIAL BEST PRACTICE 670 of 702
validates the design of the system.
At this point, especially when using Microsoft Project, it is advisable to create dependencies (i.e., predecessor
relationships) between tasks assigned to the same resource in order to indicate the sequence of that person's activities.
Set the constraint type to As Soon As Possible and avoid setting a constraint date. Use the Effort-Driven approach so
that the Project Plan can be easily modified as adjustments are made.
By setting the initial definition of tasks and efforts, the resulting schedule should provide a realistic picture of the
project, unfettered by concerns about ideal user-requested completion dates. In other words, be as realistic as possible in
your initial estimations, even if the resulting scheduling is likely to miss Project Sponsor expectations. This helps to
establish good communications with your Project Sponsor so you can begin to negotiate scope and resources in good
faith.
This initial schedule becomes a starting point. Expect to review and rework it, perhaps several times. Look for
opportunities for parallel activities, perhaps adding resources if necessary, to improve the schedule.
When a satisfactory initial plan is complete, review it with the Project Sponsor and discuss the assumptions,
dependencies, assignments, milestone dates, etc. Expect to modify the plan as a result of this review.
Reviewing and Revising the Project Plan
Once the Project Sponsor and Key Stakeholders agree to the initial plan, it becomes the basis for assigning tasks
and setting expectations regarding delivery dates. The planning activity then shifts to tracking tasks against the schedule
and updating the plan based on status and changes to assumptions.
One of the key communication methods is building the concept of a weekly or bi-weekly Project Sponsor
meeting. Attendance at this meeting should include the Project Sponsor, Key Stakeholders, Lead Developers, and the
Project Manager.
Elements of a Project Sponsor meeting should include: a) Key Accomplishments (milestones, events at a high-level),
b) Progress to Date against the initial plan, c) Actual Hours vs. Budgeted Hours, d) Key Issues and e) Plans for Next
Period.
Key Accomplishments
Listing key accomplishments provides an audit trail of activities completed for comparison against the initial plan. This is
an opportunity to bring in the lead developers and have them report to management on what they have accomplished;
it also provides them with an opportunity to raise concerns, which is very good from a motivation perspective since they
have to own and account to management.
Keep accomplishments at a high-level and coach the team members to be brief, keeping their presentation to a five to
ten minute maximum during this portion of the meeting.
Progress against Initial Plan
The following matrix shows progress on relevant stages of the project. Roll-up tasks to a management level so it is
readable to the Project Sponsor (see sample below).
Plan Percent Complete Budget Hours
Architecture - Set up of Informatica Migration Environment 167
Develop data integration solution architecture 10% 28
Design development architecture 28% 32
Customize and implement Iterative Migration Framework
Data Profiling 80% 32
Legacy Stage 100% 10
Pre-Load Stage 100% 10
INFORMATICA CONFIDENTIAL BEST PRACTICE 671 of 702
Reference Data 83% 18
Reusable Objects 19% 27
Review and signoff of Architecture 0% 10

Analysis - Target-to-Source Data Mapping 1000
Customer (9 tables) 90% 135
Product (6 tables) 90% 215
Inventory (3 tables) 0% 60
Shipping (3 tables) 0% 60
Invoicing (7 tables) 57% 140
Orders (19 tables) 40% 380
Review and signoff of Functional Specification 0% 10
Budget versus Actual
A key measure to be aware of is budgeted vs. actual cost of the project. The Project Sponsor needs to know if additional
funding is required; forecasting actual hours against budgeted hours allows the Project Sponsor to determine when
additional funding or a change in scope is required.
Many projects are cancelled because of cost overruns, so it is the Project Managers job to keep expenditures under
control. The following example shows how a budgeted vs. actual report may look.

10-
Apr 17-Apr 24-Apr 1-May 8-May 15-May 22-May 29-May
Resource A 28 40 24 40 40 40 40 32 284
Resource B 10 40 40 40 40 32 202
Resource C 40 36 40 40 32 188
Resource D 24 40 36 40 40 32 212
Project Manager 12 8 8 16 32 76
*462 962

110 160 97 160 160 160 160 160 1167
687
Key Issues
This is the most important part of the meeting. Presenting key issues such as resource commitment, user roadblocks,
key design concerns, etc, to the Project Sponsor and Key Stakeholders as they occur allows them to make immediate
decisions and minimizes the risk of impact to the project.
Plans for Next Period
This communicates back to the Project Sponsor where the resources are to be deployed. If key issues dictate a change,
this is an opportunity to redirect the resources and use them correctly.
Be sure to evaluate any changes to scope (see 1.2.4 Manage Project and Scope Change Assessment Sample
Deliverable), or changes in priority or approach as they arise to determine if they effect the plan. It may be necessary to
revise the plan if changes in scope or priority require rearranging task assignments or delivery sequences, or if they add
new tasks or postpone existing ones.
Tracking Changes
One approach is to establish a baseline schedule (and budget, if applicable) and then track changes against it. With
Microsoft Project, this involves creating a "Baseline" that remains static as changes are applied to the schedule. If
company and project management do not require tracking against a baseline, simply maintain the plan through updates
without a baseline. Maintain all records of Project Sponsor meetings and recap changes in scope after the meeting is
INFORMATICA CONFIDENTIAL BEST PRACTICE 672 of 702
completed.
Summary
Managing a data integration, data migration, or data quality project requires good project planning and
communications. Many data integration project fail because of issues such as poor data quality or complexity of
integration issues. However, good communication and expectation setting with the Project Sponsor can prevent such
issues from causing a project to fail.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 673 of 702
Developing the Business Case
Challenge
Identifying the departments and individuals that are likely to benefit directly from the
project implementation. Understanding these individuals, and their business information
requirements, is key to defining and scoping the project.
Description
The following four steps summarize business case development and lay a good
foundation for proceeding into detailed business requirements for the project.
1. One of the first steps in establishing the business scope is identifying the project
beneficiaries and understanding their business roles and project participation. In many
cases, the Project Sponsor can help to identify the beneficiaries and the various
departments they represent. This information can then be summarized in an
organization chart that is useful for ensuring that all project team members understand
the corporate/business organization.
G Activity - Interview project sponsor to identify beneficiaries, define their
business roles and project participation.
G Deliverable - Organization chart of corporate beneficiaries and participants.
2. The next step in establishing the business scope is to understand the business
problem or need that the project addresses. This information should be clearly defined
in a Problem/Needs Statement, using business terms to describe the problem. For
example, the problem may be expressed as "a lack of information" rather than "a lack
of technology" and should detail the business decisions or analysis that is required to
resolve the lack of information. The best way to gather this type of information is by
interviewing the Project Sponsor and/or the project beneficiaries.
G Activity - Interview (individually or in forum) Project Sponsor and/or
beneficiaries regarding problems and needs related to project.
G Deliverable - Problem/Need Statement
3. The next step in creating the project scope is defining the business goals and
objectives for the project and detailing them in a comprehensive Statement of Project
INFORMATICA CONFIDENTIAL BEST PRACTICE 674 of 702
Goals and Objectives. This statement should be a high-level expression of the desired
business solution (e.g., what strategic or tactical benefits does the business expect to
gain from the project,) and should avoid any technical considerations at this point.
Again, the Project Sponsor and beneficiaries are the best sources for this type of
information. It may be practical to combine information gathering for the needs
assessment and goals definition, using individual interviews or general meetings to
elicit the information.
G Activity - Interview (individually or in forum) Project Sponsor and/or
beneficiaries regarding business goals and objectives for the project.
G Deliverable - Statement of Project Goals and Objectives
4. The final step is creating a Project Scope and Assumptions statement that clearly
defines the boundaries of the project based on the Statement of Project Goals and
Objective and the associated project assumptions. This statement should focus on the
type of information or analysis that will be included in the project rather than what will
not.
The assumptions statements are optional and may include qualifiers on the scope,
such as assumptions of feasibility, specific roles and responsibilities, or availability of
resources or data.
G Activity -Business Analyst develops Project Scope and Assumptions
statement for presentation to the Project Sponsor.
G Deliverable - Project Scope and Assumptions statement


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 675 of 702
Managing the Project Lifecycle
Challenge
To provide an effective communications plan to provide on-going management
throughout the project lifecycle and to inform the Project Sponsor regarding status of
the project.
Description
The quality of a project can be directly correlated to the amount of review that occurs
during its lifecycle and the involvement of the Project Sponsor and Key Stakeholders.
Project Status Reports
In addition to the initial project plan review with the Project Sponsor, it is critical to
schedule regular status meetings with the sponsor and project team to review status,
issues, scope changes and schedule updates. This is known as the project sponsor
meeting.
Gather status, issues and schedule update information from the team one day before
the status meeting in order to compile and distribute the Project Status Report . In
addition, make sure lead developers of major assignments are present to report on the
status and issues, if applicable.
Project Management Review
The Project Manager should coordinate, if not facilitate, reviews of requirements, plans
and deliverables with company management, including business requirements reviews
with business personnel and technical reviews with project technical personnel.
Set a process in place beforehand to ensure appropriate personnel are invited, any
relevant documents are distributed at least 24 hours in advance, and that reviews focus
on questions and issues (rather than a laborious "reading of the code").
Reviews may include:
INFORMATICA CONFIDENTIAL BEST PRACTICE 676 of 702
G Project scope and business case review.
G Business requirements review.
G Source analysis and business rules reviews.
G Data architecture review.
G Technical infrastructure review (hardware and software capacity and
configuration planning).
G Data integration logic review (source to target mappings, cleansing and
transformation logic, etc.).
G Source extraction process review.
G Operations review (operations and maintenance of load sessions, etc.).
G Reviews of operations plan, QA plan, deployment and support plan.
Project Sponsor Meetings
A project sponsor meeting should be completed weekly to bi-weekly to communicate
progress to the Project Sponsor and Key Stakeholders. The purpose is to keep key
user management involved and engaged in the process. In addition, it is to
communicate any changes to the initial plan and to have them weigh in on the decision
process.
Elements of the meeting include:
G Key Accomplishments.
G Activities Next Week.
G Tracking of Progress to-Date (Budget vs. Actual).
G Key Issues / Roadblocks.
It is the Project Managers role to stay neutral to any issue and to effectively state facts
and allow the Project Sponsor or other key executives to make decisions. Many times
this process builds the partnership necessary for success.
Change in Scope
Directly address and evaluate any changes to the planned project activities, priorities,
or staffing as they arise, or are proposed, in terms of their impact on the project plan.
The Project Manager should institute a change management process in response to
any issue or request that appears to add or alter expected activities and has the
INFORMATICA CONFIDENTIAL BEST PRACTICE 677 of 702
potential to affect the plan.
G Use the Scope Change Assessment to record the background problem or
requirement and the recommended resolution that constitutes the potential
scope change. Note that such a change-in-scope document helps capture key
documentation that is particularly useful if the project overruns or fails
to deliver upon Project Sponsor expectations.
G Review each potential change with the technical team to assess its impact on
the project, evaluating the effect in terms of schedule, budget, staffing
requirements, and so forth.
G Present the Scope Change Assessment to the Project Sponsor for acceptance
(with formal sign-off, if applicable). Discuss the assumptions involved in the
impact estimate and any potential risks to the project.
Even if there is no evident effect on the schedule, it is important to document these
changes because they may affect project direction and it may become necessary, later
in the project cycle, to justify these changes to management.
Management of Issues
Any questions, problems, or issues that arise and are not immediately resolved should
be tracked to ensure that someone is accountable for resolving them so that their effect
can also be visible.
Use the Issues Tracking template, or something similar, to track issues, their owner,
and dates of entry and resolution as well as the details of the issue and of its solution.
Significant or "showstopper" issues should also be mentioned on the status report and
communicated through the weekly project sponsor meeting. This way, the Project
Sponsor has the opportunity to resolve and cure a potential issue.
Project Acceptance and Close
A formal project acceptance and close helps document the final status of the project.
Rather than simply walking away from a project when it seems complete, this explicit
close procedure both documents and helps finalize the project with the Project Sponsor.
For most projects this involves a meeting where the Project Sponsor and/or department
managers acknowledge completion or sign a statement of satisfactory completion.
INFORMATICA CONFIDENTIAL BEST PRACTICE 678 of 702
G Even for relatively short projects, use the Project Close Report to finalize the
project with a final status report detailing:
H What was accomplished.
H Any justification for tasks expected but not completed.
H Recommendations.
G Prepare for the close by considering what the project team has learned about
the environments, procedures, data integration design, data architecture, and
other project plans.
G Formulate the recommendations based on issues or problems that need to be
addressed. Succinctly describe each problem or recommendation and if
applicable, briefly describe a recommended approach.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 679 of 702
Using Interviews to Determine Corporate
Data Integration Requirements
Challenge
Data warehousing projects are usually initiated out of a business need for a certain
type of reports (i.e., we need consistent reporting of revenue, bookings and backlog).
Except in the case of narrowly-focused, departmental data marts however, this is not
enough guidance to drive a full data integration solution. Further, a successful, single-
purpose data mart can build a reputation such that, after a relatively brief period of
proving its value to users, business management floods the technical group with
requests for more data marts in other areas. The only way to avoid silos of data marts
is to think bigger at the beginning and canvas the enterprise (or at least the
department, if thats your limit of scope) for a broad analysis of data
integration requirements.
Description
Determining the data integration requirements in satisfactory detail and clarity is a
difficult task however, especially while ensuring that the requirements are
representative of all the potential stakeholders. This Best Practice summarizes the
recommended interview and prioritization process for this requirements analysis.
Process Steps
The first step in the process is to identify and interview all major sponsors and
stakeholders. This typically includes the executive staff and CFO since they are likely to
be the key decision makers who will depend on the data integraton. At a minimum,
figure on 10 to 20 interview sessions.
The next step in the process is to interview representative information providers. These
individuals include the decision makers who provide the strategic perspective on what
information to pursue, as well as details on that information, and how it is currently
used (i.e., reported and/or analyzed). Be sure to provide feedback to all of the sponsors
and stakeholders regarding the findings of the interviews and the recommended
subject areas and information profiles. It is often helpful to facilitate a Prioritization
Workshop with the major stakeholders, sponsors, and information providers in order to
set priorities on the subject areas.
INFORMATICA CONFIDENTIAL BEST PRACTICE 680 of 702
Conduct Interviews
The following paragraphs offer some tips on the actual interviewing process. Two
sections at the end of this document provide sample interview outlines for the executive
staff and information providers.
Remember to keep executive interviews brief (i.e., an hour or less) and to the point. A
focused, consistent interview format is desirable. Don't feel bound to the script,
however, since interviewees are likely to raise some interesting points that may not be
included in the original interview format. Pursue these subjects as they come up,
asking detailed questions. This approach often leads to discoveries of strategic uses
for information that may be exciting to the client and provide sparkle and focus to the
project.
Questions to the executives or decision-makers should focus on what business
strategies and decisions need information to support or monitor them. (Refer to Outline
for Executive Interviews at the end of this document). Coverage here is critical if key
managers are left out, you may miss a critical viewpoint and may miss an important
buy-in.
Interviews of information providers are secondary but can be very useful. These are the
business analyst-types who report to decision-makers and currently provide reports
and analyses using Excel or Lotus or a database program to consolidate data from
more than one source and provide regular and ad hoc reports or conduct sophisticated
analysis. In subsequent phases of the project, you must identify all of these individuals,
learn what information they access, and how they process it. At this stage however,
you should focus on the basics, building a foundation for the project and discovering
what tools are currently in use and where gaps may exist in the analysis and reporting
functions.
Be sure to take detailed notes throughout the interview process. If there are a lot of
interviews, you may want the interviewer to partner with someone who can take good
notes, perhaps on a laptop to save note transcription time later. It is important to take
down the details of what each person says because, at this stage, it is difficult to know
what is likely to be important. While some interviewees may want to see detailed notes
from their interviews, this is not very efficient since it takes time to clean up the notes
for review. The most efficient approach is to simply consolidate the interview notes into
a summary format following the interviews.
Be sure to review previous interviews as you go through the interviewing process, You
can often use information from earlier interviews to pursue topics in later interviews in
more detail and with varying perspectives.
INFORMATICA CONFIDENTIAL BEST PRACTICE 681 of 702
The executive interviews must be carried out in business terms. There can be no
mention of the data warehouse or systems of record or particular source data entities
or issues related to sourcing, cleansing or transformation. It is strictly forbidden to use
any technical language. It can be valuable to have an industry expert prepare and even
accompany the interviewer to provide business terminology and focus. If the interview
falls into technical details, for example, into a discussion of whether certain
information is currently available or could be integrated into the data warehouse, it is up
to the interviewer to re-focus immediately on business needs. If this focus is not
maintained, the opportunity for brainstorming is likely to be lost, which will reduce the
quality and breadth of the business drivers.
Because of the above caution, it is rarely acceptable to have IS resources present at
the executive interviews. These resources are likely to engage the executive (or vice
versa) in a discussion of current reporting problems or technical issues and thereby
destroy the interview opportunity.
Keep the interview groups small. One or two Professional Services personnel should
suffice with at most one client project person. Especially for executive interviews, there
should be one interviewee. There is sometimes a need to interview a group of middle
managers together, but if there are more than two or three, you are likely to get much
less input from the participants.
Distribute Interview Findings and Recommended Subject Areas
At the completion of the interviews, compile the interview notes and consolidate the
content into a summary.This summary should help to breakout the input into
departments or other groupings significant to the client. Use this content and your
interview experience along with best practices or industry experience to recommend
specific, well-defined subject areas.
Remember that this is a critical opportunity to position the project to the decision-
makers by accurately representing their interests while adding enough creativity to
capture their imagination. Provide them with models or profiles of the sort of information
that could be included in a subject area so they can visualize its utility. This sort of
visionary concept of their strategic information needs is crucial to drive their
awareness and is often suggested during interviews of the more strategic thinkers. Tie
descriptions of the information directly to stated business drivers (e.g., key processes
and decisions) to further accentuate the business solution.
A typical table of contents in the initial Findings and Recommendations document might
look like this:
INFORMATICA CONFIDENTIAL BEST PRACTICE 682 of 702
I.
Introduction
II.
Executive Summary
A.
Objectives for the Data Warehouse
B.
Summary of Requirements
C.
High Priority Information Categories
D.
Issues
III.
Recommendations
A.
Strategic Information Requirements
B.
Issues Related to Availability of Data
C.
Suggested Initial Increments
D.
Data Warehouse Model
IV.
Summary of Findings
A.
Description of Process Used
B.
Key Business Strategies [this includes descriptions of processes,
decisions, other drivers)
C.
Key Departmental Strategies and Measurements
D.
Existing Sources of Information
E.
How Information is Used
F.
Issues Related to Information Access
V.
Appendices
A.
Organizational structure, departmental roles
B.
Departmental responsibilities, and relationships
INFORMATICA CONFIDENTIAL BEST PRACTICE 683 of 702

Conduct Prioritization Workshop
This is a critical workshop for consensus on the business drivers. Key executives and
decision-makers should attend, along with some key information providers. It is
advisable to schedule this workshop offsite to assure attendance and attention, but the
workshop must be efficient typically confined to a half-day.
Be sure to announce the workshop well enough in advance to ensure that key
attendees can put it on their schedules. Sending the announcement of the workshop
may coincide with the initial distribution of the interview findings.
The workshop agenda should include the following items:
G Agenda and Introductions
G Project Background and Objectives
G Validate Interview Findings: Key Issues
G Validate Information Needs
G Reality Check: Feasibility
G Prioritize Information Needs
G Data Integration Plan
G Wrap-up and Next Steps
Keep the presentation as simple and concise as possible, and avoid technical
discussions or detailed sidetracks.
Validate information needs
Key business drivers should be determined well in advance of the workshop, using
information gathered during the interviewing process. Prior to the workshop, these
business drivers should be written out, preferably in display format on flipcharts or
similar presentation media, along with relevant comments or additions from the
interviewees and/or workshop attendees.
During the validation segment of the workshop, attendees need to review and discuss
the specific types of information that have been identified as important for triggering or
monitoring the business drivers. At this point, it is advisable to compile as complete a
list as possible; it can be refined and prioritized in subsequent phases of the project.
INFORMATICA CONFIDENTIAL BEST PRACTICE 684 of 702
As much as possible, categorize the information needs by function, maybe even by
specific driver (i.e., a strategic process or decision). Considering the information needs
on a function by function basis fosters discussion of how the information is used and by
whom.
Reality check: feasibility
With the results of brainstorming over business drivers and information needs listed (all
over the walls, presumably), take a brief detour into reality before prioritizing and
planning. You need to consider overall feasibility before establishing the first priority
information area(s) and setting a plan to implement the data warehousing solution with
initial increments to address those first priorities.
Briefly describe the current state of the likely information sources (SORs). What
information is currently accessible with a reasonable likelihood of the quality and
content necessary for the high priority information areas? If there is likely to be a high
degree of complexity or technical difficulty in obtaining the source information, you may
need to reduce the priority of that information area (i.e., tackle it after some successes
in other areas).
Avoid getting into too much detail or technical issues. Describe the general types of
information that will be needed (e.g., sales revenue, service costs, customer descriptive
information, etc.), focusing on what you expect will be needed for the highest priority
information needs.
Data Integration Plan
The project sponsors, stakeholders, and users should all understand that the process
of implementing the data warehousing solution is incremental.. Develop a high-level
plan for implementing the project, focusing on increments that are both high-value and
high-feasibility. Implementing these increments first provides an opportunity to build
credibility for the project. The objective during this step is to obtain buy-in for your
implementation plan and to begin to set expectations in terms of timing. Be practical
though; don't establish too rigorous a timeline!
Wrap-up and next steps
At the close of the workshop, review the group's decisions (in 30 seconds or less),
schedule the delivery of notes and findings to the attendees, and discuss the next steps
of the data warehousing project.
INFORMATICA CONFIDENTIAL BEST PRACTICE 685 of 702
Document the Roadmap
As soon as possible after the workshop, provide the attendees and other project
stakeholders with the results:
G Definitions of each subject area, categorized by functional area
G Within each subject area, descriptions of the business drivers and information
metrics
G Lists of the feasibility issues
G The subject area priorities and the implementation timeline.
Outline for Executive Interviews
I. Introductions
II. General description of information strategy process
A. Purpose and goals
B. Overview of steps and deliverables
G Interviews to understand business information strategies and
expectations
G Document strategy findings
G Consensus-building meeting to prioritize information
requirements and identify quick hits
G Model strategic subject areas
G Produce multi-phase Business Intelligence strategy
III. Goals for this meeting
A. Description of business vision, strategies
B. Perspective on strategic business issues and how they drive information
needs
G Information needed to support or achieve business goals
G How success is measured
IV. Briefly describe your roles and responsibilities?
G The interviewee may provide this information before the actual interview. In
this case, simply review with the interviewee and ask if there is anything to add.
INFORMATICA CONFIDENTIAL BEST PRACTICE 686 of 702
A. What are your key business strategies and objectives?
G How do corporate strategic initiatives impact your group?
G These may include MBOs (personal performance objectives),
and workgroup objectives or strategies.
B. What do you see as the Critical Success Factors for an Enterprise
Information Strategy?
G What are its potential obstacles or pitfalls?
C. What information do you need to achieve or support key decisions
related to your business objectives?
D. How will your organizations progress and final success be measured (e.
g., metrics, critical success factors)?
E. What information or decisions from other groups affect your success?
F. What are other valuable information sources (i.e., computer reports,
industry reports, email, key people, meetings, phone)?
G. Do you have regular strategy meetings? What information is shared as
you develop your strategy?
H. If it is difficult for the interviewee to brainstorm about information needs,
try asking the question this way: "When you return from a two-week
vacation, what information do you want to know first?"
I. Of all the information you now receive, what is the most valuable?
J. What information do you need that is not now readily available?
K. How accurate is the information you are now getting?
L. To whom do you provide information?
M. Who provides information to you?
N. Who would you recommend be involved in the cross-functional
Consensus Workshop?
Outline for Information Provider Interviews
I.
Introductions
II.
General description of information strategy process
A.
Purpose and goals
B.
INFORMATICA CONFIDENTIAL BEST PRACTICE 687 of 702
Overview of steps and deliverables
G
Interviews to understand business information
strategies and expectations
G
Document strategy findings and model the strategic
subject areas
G
Consensus-building meeting to prioritize information
requirements and identify quick hits
G
Produce multi-phase Business Intelligence strategy
III.
Goals for this meeting
A.
Understanding of how business issues drive information needs
B.
High-level understanding of what information is currently provided to
whom
G
Where does it come from
G
How is it processed
G
What are its quality or access issues
IV.
Briefly describe your roles and responsibilities?
G
The interviewee may provide this information before the actual
interview. In this case, simply review with the interviewee and ask if
there is anything to add.
A.
Who do you provide information to?
B.
INFORMATICA CONFIDENTIAL BEST PRACTICE 688 of 702
What information do you provide to help support or measure the
progress/success of their key business decisions?
C.
Of all the information you now provide, what is the most requested or
most widely used?
D.
What are your sources for the information (both in terms of systems and
personnel)?
E.
What types of analysis do you regularly perform (i.e., trends,
investigating problems)? How do you provide these analyses (e.g.,
charts, graphs, spreadsheets)?
F.
How do you change/add value to the information?
G.
Are there quality or usability problems with the information you work
with? How accurate is it?


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 689 of 702
Upgrading Data Analyzer
Challenge
Seamlessly upgrade Data Analyzer from one release to another while safeguarding the
repository.
Description
Upgrading Data Analyzer involves two steps:
1. Upgrading the Data Analyzer application.
2. Upgrading the Data Anaylzer repository.
Steps Before The Upgrade
1. Backup the repository. To ensure a clean backup, shutdown Data Analyzer and
create the backup, following the steps in the Data Analyzer manual.
2. Restore the backed up repository into an empty database or a new schema.
This will ensure that you have a hot backup of the repository if, for some
reason, the upgrade fails.
Steps for upgrading Data Analyzer application
The upgrade process varies from application server to application server on which Data
Analyzer is hosted.
For WebLogic:
1. Install WebLogic 8.1 without uninstalling the existing Application Server
(WebLogic 6.1).
2. Install the Data Analyzer application on the new WebLogic 8.1 Application
Server, making sure to use a different port than the one used in the old
installation.. When prompted for repository, please choose the option of
existing repository and give the connection details of the database that hosts
the backed up old repository of Data Analyzer.
3. When the installation is complete, use the Upgrade utility to connect to the
database that hosts the Data Analyzer backed up repository and perform the
INFORMATICA CONFIDENTIAL BEST PRACTICE 690 of 702
upgrade.
For Jboss and WebSphere:
1. Uninstall Data Analyzer
2. Install new Data Analyzer version
3. When prompted for a repository, choose the option of existing repository and
give the connection details of the database that hosts the backed up Data
Analyzer
4. Use the Upgrade utility and connect to the database that hosts the backed up
Data Analyzer repository and perform the upgrade.
When the repository upgrade is complete, start Data Analyzer and perform a simple
acceptance test.
You can use the following test case (or a subset of the following test case) as an
acceptance test).
1. Open a simple report
2. Open a cached report.
3. Open a report with filtersets.
4. Open a sectional report.
5. Open a workflow and also its nodes.
6. Open a report and drill through it.
When all the reports open without problems, your upgrade can be called complete.
Once the upgrade is complete, repeat the above process on the actual repository.
Note: This upgrade process creates two instances of Data Analyzer. So when the
upgrade is successful, uninstall the older version, following the steps in the Data
Analyzer manual.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 691 of 702
Upgrading PowerCenter
Challenge
Upgrading an existing installation of PowerCenter to a later version encompasses
upgrading the repositories, implementing any necessary modifications, testing, and
configuring new features. With PowerCenter 8.1, the expansion of the Service-Oriented
Archicture with its domain and node concept brings additional challenges to the
upgrade process. The challenge is for data integration administrators to approach the
upgrade process in a structured fashion and minimize risk to the environment and on-
going project work.
Some of the challenges typically encountered during an upgrade include:
G Limiting development downtime.
G Ensuring that development work performed during the upgrade is accurately
migrated to the upgraded environment.
G Testing the upgraded environment to ensure that data integration results are
identical to the previous version.
G Ensuring that all elements of the various environments (e.g., Development,
Test, and Production) are upgraded successfully.
Description
Some typical reasons for an initiating a PowerCenter upgrade include:
G Additional features and capabilities in the new version of PowerCenter that
enhance development productivity and administration.
G To keep pace with higher demands for data integration.
G To achieve process performance gains.
G To maintain an environment of fully supported software as older PowerCenter
versions end support status.
Upgrade Team
Assembling a team of knowledgeable individuals to carry out the PowerCenter upgrade
is key to completing the process within schedule and budgetary guidelines. Typically,
INFORMATICA CONFIDENTIAL BEST PRACTICE 692 of 702
the upgrade team needs the following key players:
G PowerCenter Administrator
G Database Administrator
G System Administrator
G Informatica team - the business and technical users that "own" the various
areas in the Informatica environment. These resources are required for
knowledge transfer and testing during the upgrade process and after the
upgrade is complete.
Upgrade Paths
The upgrade process details depend on which of the existing PowerCenter versions
you are upgrading from and which version you are moving to. The following bullet items
summarize the upgrade paths for the various PowerCenter versions:
G PowerCenter 8.1.1 (available since September 2006)
H Direct upgrade for PowerCenter 6.x to 8.1.1
H Direct upgrade for PowerCenter 7.x to 8.1.1
H Direct upgrade for PowerCenter 8.0 to 8.1.1
G Other versions:
H For version 4.6 or earlier - upgrade to 5.x, then to 7.x and to 8.1.1
H For version 4.7 or later - upgrade to 6.x and then to 8.1.1
Upgrade Tips
Some of the following items may seem obvious, but adhering to these tips should help
to ensure that the upgrade process goes smoothly.
G Be sure to have sufficient memory and disk space (database) for the installed
software.
G As new features are added into PowerCenter, the repository grows in size
anywhere from 5 to 25 percent per release to accommodate the metadata for
the new features. Plan for this increase in all of your PowerCenter repositories.
G Always read and save the upgrade log file.
G Backup Repository Server and PowerCenter Server configuration files prior to
beginning the upgrade process.
G Test the AEP/EP (Advanced External Procedure/External Procedure) prior to
INFORMATICA CONFIDENTIAL BEST PRACTICE 693 of 702
beginning the upgrade. Recompiling may be necessary.
G PowerCenter 8.x and beyond require Domain Metadata in addition to the
standard PowerCenter Repositories. Work with your DBA to create a location
for the Domain Metadata Repository, which is created at install time.
G Ensure that all repositories for upgrade are backed up and that they can be
restored successfully. Repositories can be restored to the same database in a
different schema to allow an upgrade to be carried out in parallel. This is
especially useful if the PowerCenter test and development environments
reside in a single repository.
G When naming your nodes and domains in PowerCenter 8, think carefully
about the naming convention before the upgrade. While changing the name of
a node or the domain later is possible, it is not an easy task since it is
embedded in much of the general operation of the product. Avoid using IP
addresses and machine names for the domain and node names since over
time machine IP addresses and server names may change.
G With PowerCenter 8, a central location exists for shared files (i.e., log files,
error files, checkpoint files, etc.) across the domain. If using the Grid option or
High Availability option, it is important that this file structure is on a high-
performance file system and viewable by all nodes in the domain. If High
Availability is configured, this file system should also be highly available.
Upgrading Multiple Projects
Be sure to consider the following items if the upgrade involves multiple projects:
G All projects sharing a repository must upgrade at same time (test
concurrently).
G Projects using multiple repositories must all upgrade at same time.
G After upgrade, each project should undergo full regression testing.
Upgrade Project Plan
The full upgrade process from version to version can be time consuming, particularly
around the testing and verification stages. Informatica strongly recommends developing
a project plan to track progress and inform managers and team members of the tasks
that need to be completed, uncertainties, or missed steps
Scheduling the Upgrade
INFORMATICA CONFIDENTIAL BEST PRACTICE 694 of 702
When an upgrade is scheduled in conjunction with other development work, it is
prudent to have it occur within a separate test environment that mimics (or at least
closely resembles) production. This reduces the risk of unexpected errors and can
decrease the effort spent on the upgrade. It may also allow the development work to
continue in parallel with the upgrade effort, depending on the specific site setup.
Environmental Impact
With each new PowerCenter release, there is the potential for the upgrade to effect
your data integration environment based on new components and features. The
PowerCenter 8 upgrade changes the architecture from PowerCenter version 7, so you
should spend time planning the upgrade strategy concerning domains, nodes, domain
metadata, and the other architectural components with PowerCenter 8. Depending on
the complexity of your data integration environment, this may be a minor or major
impact. Single integration server/single repository installations are not likely to notice
much of a difference to the architecture, but customers striving for highly-available
systems with enterprise scalability may need to spend time understanding how to alter
their physical architecture to take advantage of these new features in PowerCenter
8. For more information on these architecture changes, reference the PowerCenter
documentation and the Best Practice on Domain Configuration.
Upgrade Process
Informatica recommends using the following approach to handle the challenges
inherent in an upgrade effort.
Choosing an Appropriate Environment
It is always advisable to have at least three separate environments: one each for
Development, Test, and Production.
The Test environment is generally the best place to start the upgrade process since it is
likely to be the most similar to Production. If possible, select a test sandbox that
parallels production as closely as possible. This enables you to carry out data
comparisons between PowerCenter versions. An added benefit of starting the upgrade
process in a test environment is that development can continue without interruption.
Your corporate policies on development, test, and sandbox environments and the work
that can or cannot be done in them will determine the precise order for the upgrade and
any associated development changes. Note that if changes are required as a result of
the upgrade, they need to be migrated to Production. Use the existing version to
backup the PowerCenter repository, then ensure that the backup works by restoring it
to a new schema in the repository database.
INFORMATICA CONFIDENTIAL BEST PRACTICE 695 of 702
Alternatively, you can begin the upgrade process in the Development environment or
create a parallel environment in which to start the effort. The decision to use or copy an
existing platform depends on the state of project work across all environments. If it is
not possible to set up a parallel environment, the upgrade may start in Development,
then progress to the Test and Production systems. However, using a parallel
environment is likely to minimize development downtime. The important thing is to
understand the upgrade process and your own business and technical requirements,
then adapt the approaches described in this document to one that suits your particular
situation.
Organizing the Upgrade Effort
Begin by evaluating the entire upgrade effort in terms of resources, time, and
environments. This includes training, availability of database, operating system, and
PowerCenter administrator resources as well as time to perform the upgrade and carry
out the necessary testing in all environments. Refer to the release notes to help identify
mappings and other repository objects that may need changes as a result of the
upgrade.
Provide detailed training for the Upgrade team to ensure that everyone directly involved
in the upgrade process understands the new version and is capable of using it for their
own development work and assisting others with the upgrade process.
Run regression tests for all components on the old version. If possible, store the results
so that you can use them for comparison purposes after the upgrade is complete.
Before you begin the upgrade, be sure to backup the repository and server caches,
scripts, logs, bad files, parameter files, source and target files, and external
procedures. Also be sure to copy backed-up server files to the new directories as the
upgrade progresses.
If you are working in a UNIX environment and have to use the same machine for
existing and upgrade versions, be sure to use separate users and directories. Be
careful to ensure that profile path statements do not overlap between the new and old
versions of PowerCenter. For additional information, refer to the installation guide for
path statements and environment variables for your platform and operating system.
Installing and Configuring the Software
G Install the new version of the PowerCenter components on the server.
INFORMATICA CONFIDENTIAL BEST PRACTICE 696 of 702
G Ensure that the PowerCenter client is installed on at least one workstation to
be used for upgrade testing and that connections to repositories are updated if
parallel repositories are being used.
G Re-compile any Advanced External Procedures/External Procedures if
necessary, and test them.
G The PowerCenter license key is now in the form of a file. During the installation
of PowerCenter, youll be asked for the location of this key file. The key should
be saved on the server prior to beginning the installation process.
G When installing PowerCenter 8.x, youll configure the domain, node, repository
service, and the integration service at the same time. Ensure that you have all
necessary database connections ready before beginning the installation
process.
G If upgrading to PowerCenter 8.x from PowerCenter 7.x (or earlier), you must
gather all of your configuration files that are going to be used in the automated
process to upgrade the Integration Services and Repositories. See the
PowerCenter Upgrade Manual for more information on how to gather them
and where to locate them for the upgrade process.
G Once the installation has been completed, use the Repository Server
Administration Console to perform the upgrade. Unlike previous versions of
PowerCenter, in version 8 the Administration Console is a web application.
The Administration Console URL is http://hostname:portnumber where
hostname is the name of their server where the PowerCenter services are
installed and port number is the port identified during the installation process.
The default port number is 6001.
G Re-register any plug-ins (such as PowerExchange) to the newly upgraded
environment.
G You can start both the repository and integration services on the Admin
Console.
G Analyze upgrade activity logs to identify areas where changes may be
required, rerun full regression tests on the upgraded repository.
G Execute test plans. Ensure that there are no failures and all the loads run
successfully in the upgraded environment.
G Verify the data to ensure that there are no changes and no additional or
missing records.
Implementing Changes and Testing
If changes are needed, decide where those changes are going to be made. It is
generally advisable to migrate work back from test to an upgraded development
environment. Complete the necessary changes, then migrate forward through test to
INFORMATICA CONFIDENTIAL BEST PRACTICE 697 of 702
production. Assess the changes when the results from the test runs are available. If
you decide to deviate from best practice and make changes in test and migrate them
forward to production, remember that you'll still need to implement the changes in
development. Otherwise, these changes will be lost the next time work is migrated from
development to the test environment.
When you are satisfied with the results of testing, upgrade the other environments by
backing up and restoring the appropriate repositories. Be sure to closely monitor the
production environment and check the results after the upgrade. Also remember to
archive and remove old repositories from the previous version.
After the Upgrade
G If multiple nodes were configured and you own the PowerCenter Grid option,
you can create a server grid to test performance gains
G If you own the high-availability option, you should configure your environment
for high availability including setting up failover gateway node(s) and
designating primary and backup nodes for your various PowerCenter services.
In addition, your shared file location for the domain should be located on a
highly available, high-performance file server.
G Start measuring data quality by creating a sample data profile.
G If LDAP is in use, associate LDAP users with PowerCenter users.
G Install PowerCenter Reports and configure the built-in reports for the
PowerCenter repository.
Repository Versioning
After upgrading to version 8.x, you can set the repository to versioned if you purchased
the Team-Based Management option and enabled it via the license key.
Keep in mind that once the repository is set to versioned, it cannot be set back to non-
versioned. You can invoke the team-based development option in the Administration
Console.
Upgrading Folder Versions
After upgrading to version 8.x, you'll need to remember the following:
G There are no more folder versions in version 8.
INFORMATICA CONFIDENTIAL BEST PRACTICE 698 of 702
G The folder with the highest version number becomes the current folder.
G Other versions of the folders are folder_<folder_version_number>.
G Shortcuts are created to mappings from the current folder.
Upgrading Pmrep and Pmcmd Scripts
G No more folder versions for pmrep and pmrepagent scripts.
G Ensure that the workflow/session folder names match the upgraded names.
G Note that pmcmd command structure changes significantly after version 5.
Version 5 pmcmd commands can still run in version 8, but may not be
backwards-compatible in future versions.
Advanced External Procedure Transformations
AEPs are upgraded to Custom Transformation, a non-blocking transformation. To use
this feature, you need to recompile the procedure, but you can use the old DLL/
library if recompilation is not required.
Upgrading XML Definitions
G Version 8 supports XML schema.
G The upgrade removes namespaces and prefixes for multiple namespaces.
G Circular reference definitions are read-only after the upgrade.
G Some datatypes are changed in XML definitions by the upgrade.
For more information on the specific changes to the PowerCenter software for your
particular upgraded version, reference the release notes as well as the PowerCenter
documentation.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 699 of 702
Upgrading PowerExchange
Challenge
Upgrading and configuring PowerExchange on a mainframe to a new release and
ensuring that there is minimum impact to the current PowerExchange schedule.
Description
The PowerExchange upgrade is essentially an installation with a few additional steps
and some changes to the steps of a new installation. When planning for a
PowerExchange upgrade the same resources are required as the initial implementation
requires. These include, but are not limited to:
G MVS systems operator
G Appropriate database administrator; this depends on what (if any) databases
are going to be sources/and or targets (e.g., IMS, IDMS, etc.).
G MVS Security resources
Since an upgrade is so similar to an initial implementation of PowerExchange, this
document does not address the details of the installation. This document addresses the
steps that are not documented in the Best Practices Installation document, as well as
changes to existing steps in that document. For details on installing a new
PowerExchange release see the Best Practice PowerExchange Installation (for
Mainframe) .
Upgrading PowerExchange on the Mainframe
The following steps are modifications to the installation steps or additional steps
required to upgrade PowerExchange on the mainframe. More detailed information for
upgrades can also be found in the PWX Migration Guide that comes with each release.
1. Choose a new high-level qualifier when allocating the libraries, RUNLIB and
BINLIB, on the mainframe. Consider using the version of PowerExchange as
part of the dataset name. An example would be SYSB.PWX811.RUNLIB.
These two libraries need to be APF authorized.
2. Backup the mainframe datasets and libraries. Also, backup the PowerExchange
INFORMATICA CONFIDENTIAL BEST PRACTICE 700 of 702
paths on the client workstations and the PowerCenter server.
3. When executing the MVS Install Assistant and providing values on each screen,
make sure the following parameters differ from those used in the existing
version of PowerExchange.

Specify new high-level qualifiers used for the PowerExchange datasets,
libraries, and VSAM files. The value needs to match the qualifier used for the
RUNLIB and BINLIB datasets allocated earlier. Consider including the version
of PowerExchange in the high-level nodes of the datasets. An example could
be SYSB.PWX811.

The PowerExchange Agent/Logger three character prefix needs to be unique
and differ from that used in the existing version of PowerExchange. Make sure
the values on Logger/Agent/Condenser Parameters screen reflect the new
prefix.

For DB2, the plan name specified should differ from that used in the existing
release.
4. Run the jobs listed in the XJOBS member in the RUNLIB.
5. Before starting the Listener, rename the DBMOVER member in the new
RUNLIB dataset.
6. Copy the DBMOVER member from the current PowerExchange RUNLIB to the
corresponding library for the new release of PowerExchange. Update the port
numbers to reflect the new ports. Update any dataset names specified in the
NETPORTS to reflect the new high-level qualifier.
7. Start the Listener and make sure the PING works. See the other document or
the Implementation guide for more details.
8. The existing Datamaps must now be migrated to the new release using
the DTLURDMO utility. Details and examples can be found in the PWX Utilities
Guide and the PWX Migration Guide.
At this point, the mainframe upgrade is complete for bulk processing.
For PowerExchange Change Data Capture or Change Data Capture Real-time,
complete the additional steps in the installation manual. Also perform the following
steps:
1. Use the DTLURDMO utility to migrate existing Capture Registrations and
Capture Extractions to the new release.
2. Create a Registration Group for each source.
3. Open and save each Extraction Map in the new Extraction Groups.
4. Insure the values for CHKPT_BASENAME and EXT_CAPT_MASK parameters
are correct before running a Condense.
INFORMATICA CONFIDENTIAL BEST PRACTICE 701 of 702
Upgrade PowerExchange on a Client Workstation and the Server
The installation procedures on the client workstations and the server are the same as
they are for an initial implementation with a few exceptions. The differences are as
follow:
1. New paths should be specified during the installation of the new release.
2. After the installation, copy the old DBMOVER.CFG configuration member to the
new path and modify the ports to reflect those of the new release.
3. Make sure the PATHS reflects the path specified earlier for the new release.
Testing can begin now. When testing is complete, the new version can go live.
Go Live With New Release
1. Stop all workflows.
2. Stop all production updates to the existing sources.
3. Ensure all captured data has been processed.
4. Stop all tasks on the mainframe (Agent, Listener, etc.)
5. Start the new tasks on the mainframe.
6. Resume production updates to the sources and resume the workflow schedule.
After the Migration
Consider removing or de-installing the software for the old release on the workstations
and server to avoid any conflicts.


Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL BEST PRACTICE 702 of 702

You might also like