ETL Standards and Guidelines
ETL Standards and Guidelines
Capgemini Public
ETL Standards and Design Guidelines
SQL Override Only use SQL overrides if it will either result in a substantial performance gain or alter data types. 3.1.1
If used Lookup and SQ Transformations must have “_override” in the name.
Migrations Migrating code from will follow the same change management process as any other application 3.1.2
code migration. A code review with Data and Storage team is required for all new application or
significant change in an existing application.
Surrogate Keys Surrogate keys should be used to create a standard id. It is recommended to use Informatica 3.1.3
instead of DB scripts for creating the surrogate key.
Comments Code commenting should occur as often as possible and should be useful to other developers. 3.1.4
Source Objects Source objects should be same structure of where they are sourced from and placed in Shared 3.1.5
Folder.
Target Objects Target objects should be same structure of target table environment and placed in Shared Folder. 3.1.6
Data Objects & Expression Transformations should be used to “bookend” other transformations. Filter 3.1.7
Transformations Transformations should not be used, use a Router instead. Aggregators should rarely be used and
with very specific grouping criteria. When using a Lookup Transformations use connected
lookups where possible. Rarely use a Joiner Transformation, use a SQL override in a Source
Qualifier Transformation instead.
Change Detection Change detection / MD5 - CRC (SCD-1 , SCD-2 ,SCD-3) 3.1.8
Error Handling Developers will follow the Informatica error log to identify technical errors. The error log will be 3.3.1
loaded to a table to be more useful. Specific Informatica error codes will need fixed on a case by
case basis.
Testing Recommended to Unit Test & Peer Review before moving code to project folder. 3.3.2
Recovery A process to recover session/workflow from failure 3.4
User Access Developers logging into the PowerCenter should have their own account. Do not use shared 3.4.1
accounts.
Versioning Versioning will need to be used during any code development. A tool will need to be used 3.5
properly control versions. Code will be checked out when being developed and checked in when
completed.
2
iGATE Internal
ETL Standards and Design Guidelines
1 Introduction............................................................................................................................................... 5
1.1 Purpose.......................................................................................................................................................5
1.2 Scope.......................................................................................................................................................... 5
1.4 Assumptions...............................................................................................................................................5
2 ETL Standards & Procedures........................................................................................................................ 6
3.4 Recovery................................................................................................................................................... 35
3.5 Security.....................................................................................................................................................35
3.5.1 User Access..........................................................................................................................................36
3.6 Versioning.................................................................................................................................................38
3.6.1 Informatica..........................................................................................................................................38
3
iGATE Internal
ETL Standards and Design Guidelines
6.3 Metadata..................................................................................................................................................48
6.3.1 Informatica Technical Metadata.........................................................................................................48
6.3.2 Business Metadata..............................................................................................................................49
4
iGATE Internal
ETL Standards and Design Guidelines
1 Introduction
1.1 Purpose
This document serves as the starting point for (ETL) developers. The document will provide the standard
processes, methods and components which will be utilized in creating, testing and deploying ETL Integration
interfaces. The guidelines in this document are ETL standards and therefore can be used across different ETL tools
and processes
1.2 Scope
The scope of this document is limited to the standards and guidelines for the ETL processes. It does not cover
hardware setup, software installs, operations support and other activities not directly related to development.
1.4 Assumptions
Audience will have a basic knowledge of ETL processes
Audience will have a basic understanding of Informatica PowerCenter
The current architecture is relevant for Data Conversion, Data Staging, Operational Data Store, and Data
Warehouse
5
iGATE Internal
ETL Standards and Design Guidelines
After setting the folder types there are many ways to organize the folders. A few of the best practices /
approaches are: Development Environment, Object Type, and Location.
The folder structure comes from the start-up / instruction manual for Informatica. The approach taken aligns to
the recommended Object Type approach. The following folder structure is planned to be used:
6
iGATE Internal
ETL Standards and Design Guidelines
Example : wf_s_STG_PERSON
Please make sure that all your objects including have proper naming and comments.
Please make sure that all your objects including have proper naming and comments.
7
iGATE Internal
ETL Standards and Design Guidelines
Example: srt_EMPLOYEE_ID_DESC
Union Transformation Naming Standard: uni_(DESCRIPTOR).
Example: uni_PRODUCT_SOURCES
Update Strategy Transformation Naming Standard: UPD_(UPDATE_TYPE(S)) or
UPD_(UPDATE_TYPE(S))_(TARGET_NAME) if there are multiple targets
in the mapping.Example:upd_UPDATE_EXISTING_EMPLOYEES
Transaction control Transformation tct_<DESCRIPTOR>
Java Transformation java_<meaningful name> that describes the processing being done.
Identity Resoultion Transformation ir_<meaningful name> that describes the processing being done.
9
iGATE Internal
ETL Standards and Design Guidelines
Use of one parameter file per project is highly recommended. Group related mappings and sessions within a
workflow section in the parameter file, sort of nested approach.
[Global]
The parameters which are used by multiple workflows in a Project Folder are grouped and termed
as global parameters.
Examples:-
Database connections - $DBConnection_Src=TD_HRREPO_STG
$DBConnection_Tgt=TD_HRREPO_DM
Target File location – $OutputFilePath=/aaa/bbb/ccc -- full path
$InputFilePath=/aaa/bbb/ccc -- full path
When there are multiple Integration services configured and used by objects within a single project folder,
parameters can also be grouped based on Integration service that is going to use them.
Example –
The parameters which are used by multiple sessions within a single workflow are
Grouped under this section. The parameters declared in this section are local to the folder and workflow
specified.
Examples are default/Standard values and conventions
10
iGATE Internal
ETL Standards and Design Guidelines
Parameters that are local to a particular mapping and session are grouped in this section. Other sessions /
workflows cannot use these parameters.
These parameters are customized for a particular session i.e they have situational use in the mapping
flow.
Examples - suffix / prefix strings in string data types
Reference dates for CDC
Maintain separate param file for each integration layer. One file for each layer below -
Source to EDW Staging
EDW Staging to ODS
ODS to EDW
ODS to Reporting layer
ODS to Data Mart
ODS to Downstream
11
iGATE Internal
ETL Standards and Design Guidelines
If initial values are not declared for the mapping parameter or variable then default values will be assigned based
on the data type. The value that is defined for the parameter remains constant throughout the entire session.
Create parameter files using a text editor such as WordPad or Notepad. Parameter files can contain only the
following types of parameters and variables:
Workflow variable
Session parameter
Mapping parameter and variables
The inbound and outbound file extracts should follow these guidelines
The extension of a file can be .csv, .txt or .dat .
The delimiter in the file can contain comma (,), pipe (|) or tilde (~ ) with an emphasis of quotes (“”) for all
text fields
The file must contain header. The header should contain the following information:
Company
Functional Area
Timestamp
Count of Total rows in the file
Comments (Optional)
The file name is broken down with the sections identified below with an underscore (_) as a separator and
filled with pound sign (#) if not applicable, lower case only.
12
iGATE Internal
ETL Standards and Design Guidelines
Example: xyzmod_corp_payrollded_00_0001_201210010715.ext
pruavi_ltc#_remittance_01_####_201204021130.ext
abcmed_ltc#_remittance_02_####_201204021131.ext
MD5_CHCKSUM_VAL
CREATE_TS - load date timestamp
LAST_MOD_PROC_ID - Last run proc_id
OB table may be used as lookup to determine changes since last run for outbound extracts
OB table will have historical extracts
OB table may also be used for auditing purposes to determine what records were extracted in an
outbound file at any given point of time
OB Archive/ purge strategy to be implemented based on the need
Example: - INT568 (wf_INT568_OUTBOUND_YTD) generates an output file which has 3 columns and 2
rows as below -
1-INPUT_TRANSACTION
2- PRIMARY_TAXING
3- WORK_STATE_%
A separate outbound specific table has to be created in ODS schema with the following structure
OB_INT568
INPUT_TRANSACTION PRIMARY_TAXIN WORK_STATE_WITHHOLDING_% MD5_CHKSUM_VAL CREATE_TS LAST_MOD_PROC_ID
13
iGATE Internal
ETL Standards and Design Guidelines
G
ABCD 20 2 HYGF1234… 2013-11-21 81
WXYZ 24 7 POUH681… 2013-11-21 81
So there will be an additional OB_INT Target in the ETL map of each outbound interface which will have
a similar field to that of the actual Target along with the audit fields as shown above.
All workflows Inbound / Outbound are subject to populate audit log information in job control
Tables-
There are 2 tables to store metadata information about the job runs:-
1 – PROJ_MAS
This table has information on Project Name, Division Name, and Domain name
2 – PROC_CTRL_TBL
For more details please refer attached documents –examples rows for the above tables:-
It’s an accepted best practice to always load a flat file into a staging table before any transformations are done on
the data in the flat file.
Always use LTRIM, RTRIM functions on string columns before loading data into a stage table.
You can also use UPPER function on string columns but before using it you need to ensure that the data is not
case sensitive (e.g. ABC is different from Abc)
If you are loading data from a delimited file then make sure the delimiter is not a character, which could
appear in the data itself. Avoid using comma-separated files. Tilde (~) is a good delimiter to use.
Mappings which run on a regular basis should be designed in such a way that you query only that data from the
source table which has changed since the last time you extracted data from the source table.
If you are extracting data from more than one table from the same database by joining them then you can have
multiple source definitions and a single source qualifier instead of having joiner transformation to join them as
15
iGATE Internal
ETL Standards and Design Guidelines
shown in the figure below. You can put the join conditions in the source qualifier. If the tables exist in different
databases you can make use of synonyms for querying them from the same database
16
iGATE Internal
ETL Standards and Design Guidelines
Try to make the scripts generic so that it can be used across the projects.
No hardcoding in scripts
No exposing passwords in scripts
Make sure that Script saves the execution log for few runs at least.
Script should have brief description on the functionality and proper indentation and comments
throughout.
There should be proper error handling and notification in your script.
17
iGATE Internal
ETL Standards and Design Guidelines
When designing mappings the developer should draw out a rough draft of the mapping that resembles a data
flow diagram in DI spec (DLD). Draw out the different paths that the data flow can take and the different actions
it will take on the target table. This diagram can act as a template for future mappings that will perform similar
tasks. Example below:
Data Validation
Data Error
Errors
Target
Expression Router
Source Transformation Transformation
Employee Table
Valid Employees
Valid Data
Target
An STM data map is an excel file which describes in detail where a field is sourced from and the exact target
destination of the field.
Exit criteria for ETL deliverables:-ETL code reviewed and tested along with STM , DI Tech spec and UTC (3.4)
Ex: Templates for Design Checklist, Technical Design Documents and Data Maps
18
iGATE Internal
ETL Standards and Design Guidelines
Replace large Lookup tables (huge data) with joins in SQL overrides wherever possible
Make sure the SQL override is generated by the transformation and other parts like where clause can be
added later on for easy validation purpose
Make sure to bring all the sources joined in SQL override SQL in the mapping for better visibility and code
maintenance
In case of a production target, we need approved change request otherwise we need approval with service now
request.
Please not that we don’t generally support any manual change request in QA or Production for example,
changing session property or changing a mapping or adding a command task etc… hence please take care of
these kind of requirements in your deployment.
1. For new application or significant change in existing application, setup a meeting with DA Arch, DA
Platform and Run teams for code review.
2. Open a change request with detail about code migration. This should include details on:
a. Mention the source and target environments
b. Source and Target Folders
c. Parent object name (for example workflow name(s))
d. Informatica deployment objects
i. Label information
ii. Connection request information – please ensure connections are requested in proper
format only. Requestor is responsible for providing all the information including
username and password.
iii. Folder request information
iv. OS Profile request information
v. Scheduling information
vi. Special instruction if any
e. Server deployment objects
i. Folder structure setup information
ii. Source/target files
iii. Parameter files
iv. Configuration files
v. Parameter files
19
iGATE Internal
ETL Standards and Design Guidelines
3. Open a related Remedy ticket for JOB SCHEDULER team with scheduling details if required
20
iGATE Internal
ETL Standards and Design Guidelines
•Static: Static deployment is used in the scenarios where objects are not expected to change. Objects are
added manually to the deployment group object.
•Dynamic: Dynamic deployment group is used where object change too often. A query is used in this case
which can dynamically be associated with the latest version of the objects.
A label is a versioning object that you can associate with any versioned object or group of
versioned objects in a repository
c. Advantages:
This method allows for a simple means to ensure that every record is unique in the table. Furthermore, it
becomes easier to insert large volumes of data quickly as no lookup on the target table is needed to see if it
already exists.
21
iGATE Internal
ETL Standards and Design Guidelines
During the ETL load for parent and child relationship tables, parent tables are loaded with the surrogate key first
and a separate ETL process retrieves the primary key from the parent table to load child tables.
3.1.4 Comments
Proper commenting in code is essential to developing an effective module. An Informatica mapping which
is commented clearly and concisely is easier to read, analyze, debug, modify and test. While commenting
code is an art rather than a science these guidelines should be followed to establish a system standard
Ex: After selecting edit mapping the developer can add comments.
EXAMPLE 1: --BH – 4/29/2011 – Added a column and updated the name field so that a flag could be set which
will be used in rtr_PERSONNEL_SCOPE.
EXAMPLE 2 if the SIR or defect number is known: // BH - 04/29/2011 - SIR1234 - ‘rounding of
AT_OPEN_MO_BAL_v is changed from 4 to 2 to ensure accuracy
The columns should be the same names and length whenever possible
The source objects should be imported into a shared folder and all developers should create shortcuts to
the table in the shared folder
Sourcing table objects from source system will be the ETL Administrator responsibility
Only the ETL Administrator should have read-write permission on the shared folder
Using a shared folder allows numerous developers to work on the same tables without causing problems
like unwanted editing, table definition out of sync
o NOTE: Single-pass reading is the server’s ability to use one Source Qualifier to populate
multiple targets
o NOTE: For any additional Source Qualifier, the server reads this source. If there are different
Source Qualifiers for the same source (e.g., one for delete and one for update/insert), the
server reads the source for each Source Qualifier
If processing intricate transformations, consider loading source flat file first into a relational database.
This allows the PowerCenter mappings to access the data in an optimized fashion by using filters and
custom SQL Selects where appropriate
The structure of the target objects should look nearly identical to where and how the target systems are set up.
In the case of the staging tables, the target tables will more reflect the source system as this is where the
extracts will be populated. For the ODS, the target tables will more reflect the eventual data warehouse.
Additional columns may be added for tracking purposes which is why the target table structure may not be
exactly the same as the target environment (Ex: Adding a unique key, Timestamps, Version Stamps, etc.).
All target table objects must be extracted into the shared object folder before they may be used for any
mapping
Sourcing table objects from the target database will be the ETL Administrator responsibility
Only the ETL Administrator should have read-write permission on the shared folder
23
iGATE Internal
ETL Standards and Design Guidelines
Example:, if a mapping has four instances of CUSTOMER table according to update strategy
(Update, Insert, Reject, and Delete), the tables should be named as follows:
CUSTOMER_UPD, CUSTOMER_INS, CUSTOMER_DEL, CUSTOMER_REJ.
Session Property Insert or Update – If the target table volume is large, use the session
property “insert” or “update” and route one of the targets to a flat file. The flat file is used
to insert or update within a separate mapping.
If the incoming source records usually require more updates than inserts into the target
table, create two target instances: the target database instance for updating and the flat file
instance for inserting. Then set the session property “target treat row as” to update (this can
also be performed by using session partitions on this session since update is more expensive
than insert). A second mapping will use the flat file from the first mapping to insert to the
same target table.
If the incoming source records usually require more inserts than updates into target table
time, create two target instances: the target database instance for inserting and the flat file
instance for updating. Then set the session property “target treat row as” to insert. A
second mapping will use the flat file from the first mapping to update the same target table.
Update Strategy – If the target table volume is small, use the “insert else update” update
strategy (with or without target table lookup).
The value for ’Treat source rows as’ session property must be = DATA DRIVEN
The properties insert, update, delete etc pertaining to target instance must be checked /unchecked
appropriately
Bulk Writer can be used to Insert , Update to the targets giving optimized performance as compared to
the conventional Relational connection *
Set the socket buffer size to 25 to 50 % of the DTM buffer size to increase
session performance. You
might need to test different settings for optimal performance. Enter a value
Socket Buffer Size
between 4096 and
2147483648 bytes.
Default is 8388608 bytes.
24
iGATE Internal
ETL Standards and Design Guidelines
Escape character of an external table. If the data contains NULL, CR, and LF
characters in the Char or
Varchar field, you need to escape these characters in the source data before
EscapeCharacter
extracting. Enter an
escape character before the data. The supported escape character is backslash
(\).
Set the socket buffer size to 25 to 50 % of the DTM buffer size to increase
session performance. You
might need to test different settings for optimal performance. Enter a value
Socket Buffer Size
between 4096 and
2147483648 bytes.
Default is 8388608 bytes.
25
iGATE Internal
ETL Standards and Design Guidelines
Escape character of an external table. If the data contains NULL, CR, and LF
characters in the Char or
Varchar field, you need to escape these characters in the source data before
EscapeCharacter
extracting. Enter an
escape character before the data. The supported escape character is backslash
(\).
Ignores constraints on primary key fields. When you select this option, the
PowerCenter Integration
Service can write duplicate rows with the same primary key to the target.
Ignore Key Constraints Default is disabled. The
PowerCenter Integration Service ignores this value when the target operation is
“update as update” or
“update else insert.”
26
iGATE Internal
ETL Standards and Design Guidelines
Only Required ports should be used across the mapping .Unused Ports must be deleted .
The data types and lengths for the mapped fields should be consistent through out the mapping
A recommended practice is to always place an expression after the source qualifier to allow for the
mapping to be edited later without disconnecting ports
Use an Expression Transformation as a gathering location to make the mappings easier to read
Create an Expression Transformation to bring all the ports together before going to the next
transformation or target
Note that Informatica processes the ports based on the priority as follows
Best Practices:
Calculate once, use many times. Avoid calculating or testing the same value over and over.
Calculate a formula once in an expression and then set a True/False flag
Use local variables to simplify complex calculations. Use variables to calculate a value used
several times
Watch the data types of fields and implicit conversions involved. Excessive data type conversions
will slow the mapping
27
iGATE Internal
ETL Standards and Design Guidelines
A Router Transformation should be used in place of the Filter Transformation since routers redirect
unwanted data but allows filtered data to be stored if needed. Use a Router Transformation to separate
data flows instead of multiple Filter Transformations
Use a Router Transformation if more than one target requires some kind of filter condition
Use dynamic lookup in scenarios when a single data set pulled in a run has multiple records wrt natural
keys
To ensure a match on a Lookup Transformation the developer may need to generate a SQL override and
trim values in the lookup condition of leading and trailing spaces (Ex: RTRIM(LTRIM(fieldname))
Size the Lookup Data and Index Cache Sizes and specify them as part of a tuning exercise
Best Practices:
When using a Lookup Table Transformation, improve lookup performance by placing all conditions that
use the equality operator = first in the list of conditions under the condition tab
When the source is large, cache lookup table columns for those lookup tables of 500,000 rows or less. This
typically improves performance by 10 to 20 percent
If caching lookups and performance is poor, consider replacing with an unconnected, un-cached lookup
If the same lookup is used in multiple mappings or the same lookup is used more than once in the same
mapping, take advantage of reusable transformation. In the case of using the same lookup multiple times
in the same mapping, the lookup will only be cached once and both instances will refer to the same cache
28
iGATE Internal
ETL Standards and Design Guidelines
Do not reset the value unless the logic of the mapping demands
Do not overwrite sequence generator values during migration from one environment to other unless
mentioned explicitly
Make sure that the start value of sequence is a higher value than 0 leaving decent number of holes for
default records and space for unexpected future exceptions. (Example start value in HR reporting Wave 1
Project was 100 for each sequence generator)
If only DB refresh is performed in DEV/STAGE from PROD then reset sequence generator equal to PROD
Set optimum cache size for the sequence generator for better performance.
When joining two sources, if both sources have same amount of records, select the master table as the
one having more unique values in join column
If the use of a joiner is necessary when loading parent and child tables then separate mappings must be
developed: one to load the parent table and one to load the child table(s)
A normalize is good for creating one-to-many records which is useful to break out a table to its individual
columns
Use a Normalizer Transformation to pivot rows rather than multiple instances of the same target
Do not code update strategies when all the rows to the target are update or insert
Rejected rows from an update strategy are logged to the bad file. Consider filtering before the update
strategy. Retaining these rows is not critical because logging causes extra overhead on the engine. Choose
the option in the update strategy to discard rejected rows
29
iGATE Internal
ETL Standards and Design Guidelines
If an update override is necessary in a load, consider using a Lookup Transformation just in front of the
target to retrieve the primary key. The primary key update will be much faster than the non-indexed
lookup override
Target dimension tables or similar tables should have a column to store MD5 checksum value which is populated
with a hash value.
Use Informatica MD5 function to achieve this by passing all the column(s) that we intend to check for changes.
Change detection: - calculate MD5() for the same columns in the ETL map and perform a lookup on the Target
table comparing the 2 MD5 values for a each key combination.
SCD 1 :- Update the existing record using update strategy (insert or update)
SCD 2:- Insert the current record and flag existing record in the table as inactive (depending on the design) . (This
features Insert OR Insert and Update)
SCD 3:- Insert for a new record and Update the multiple columns that contain previous and current values.
30
iGATE Internal
ETL Standards and Design Guidelines
Check / uncheck target instance properties for insert , update , delete etc. accordingly
Tracing level at session / transformation level must be ‘NORMAL’ which is the default.
Do not use TRUNCATE TABLE option available in the session properties, rather create a separate dummy
session with a pre / post SQL to truncate tables.
Make sure that your workflow is HA aware; please select following properties at Workflow level.
– Enable HA Recovery
– Automatically recover terminated tasks – check this if you wish workflow to get automatically
recovered, else not.
There should be no hardcoded path in script/parameter file/workflow objects.
For CDC Workflows, make sure that each workflow has only one session. If possible, logically group
multiple CDC sources in one mapping. Following models are supported
Developers should request DBAs for truncate privilege over only certain required tables for data testing
The following sections explain how a developer would parameterize a workflow as well as properly name a
workflow.
31
iGATE Internal
ETL Standards and Design Guidelines
– Create session logs / workflow logs and bad files with timestamp suffix by check marking the
option in session properties.
– Retain the logs for 15 days and purge them after the timeframe to maintain server space
In the case of a session or workflow error the developer should try to investigate and fix in order:
– Mapping Level – Not usually a quick fix and takes time to investigate. The reason why is that the
user will need to go through the debugging process
Common errors that a developer would need to walk through. For a complete listing please review the Informatica
troubleshooting guide or help menu.
32
iGATE Internal
ETL Standards and Design Guidelines
The connection test failed. The Check the parameter file to see if the connection is listed correctly. If the
directory [XXXXX] does not connection is listed correctly then check to see if the server is down.
exist.
Execution terminated Generic error. Generally specific to issues with Informatica settings such as the
unexpectedly log or cache is maxed out.
Connection Error Check to see if parameter connections are set up correctly in the session. The
error should list which transformation failed. Occasionally the error will just be
“0”. This is a definite sign that the connection is wrong in the session or the
parameter.
Sequence Generator The sequence generator has reached the end of its user specified value. The
Transformation: Overflow Error developer should look into the sequence generators length to verify the issue.
Then the developer must decide to either expand the maximum value for the
sequence or reset the sequence generator back to its initial value. Either
decision will have a large impact so this issue should be raised up.
Unique Record Constraint Target tables id has to be unique but a duplicate ID attempted to load. The
Failure session will not fail but the individual record will be dropped. The record
should be found in the error log.
Invalid Mapping The saved mapping is invalid. Normally this occurs when a transformation is
not connected to anything or in the case of an active transformation not all of
the ports are connected. The error should specify the transformation.
Data Value Overflowed or Too The precision settings cannot handle the amount of data being processed. The
Large developer will need to edit the declared variable in the mapping to have a
higher precision.
User defined lookup override query contains invalid characters – The listed
lookup transformation contains an invalid SQL character. The developer will
just need to go to the lookup transformation and correct the query.
Error Truncating Target Table This is a user permissions error or the table being truncated is locked. In either
scenario the DBA will have to fix it.
Performance Error A session task takes a longer time than it should to load a small amount of
data. A developer should check with the DBA of the source to see the DB
performance. After that the developer should review any SQL used in a
mapping. The developer should also try to run the mapping in another
environment to see if their environment is down.
Cannot find parameter file for The parameter file name has changed, has been moved or deleted, or didn’t
the session exist at all. This will cause a workflow to fail immediately. The developer
should check the parameter in workflow manager and verify that the file is
correct on the server.
Invalid lookup connect string The lookup transformations has an invalid location for its lookup table. The
developer should check to see if the parameters are set up correctly in the
lookup transformation as well as the session. If everything appears to be
correct then the developer should verify that the table is still available within
the Informatica Shared folder. If those appear to be fine then the developer
should check the database to see if the table has been altered or dropped.
Conversion from source type to There is an invalid data type conversion. This occurs when an objects data type
target type is not supported is altered but there was not a valid conversion. Ex: A number field is changed to
varchar in an expression transformation drop down menu but no command
(TO_CHAR) is called out
33
iGATE Internal
ETL Standards and Design Guidelines
3.3.2 Testing
This section explains the standards and methodology for a developer to finalize their code. Before moving code
from a developer’s folder to the project folder the code will need to go through a series of unit tests.
The developer will need to select a number of specific cases that they are looking to test.
Retain test data, test scripts and test results to perform regression testing and possibly reuse the same
data for Integration test in a neat and readable manner.
The final step of unit test will be a review and signoff by the ETL team lead on the test checklist.
Developers are responsible for unit testing. The developer should check the following before and during
Unit testing:
Once a mapping is completed, unit tested and peer reviewed it will be moved from the individual developer folder
into a common subject area folder. The folder structure will be setup by the project Informatica admin. This
folder will be the project folder and will contain only final code.
3.4 Recovery
Define a process to restart session/workflow in case of a failure
Point of contact in build team to help research failure in case run team needs assistance
3.5 Security
Handled by Informatica administrators
Although PowerCenter includes the utilities for a complex web of security, the more simple the
configuration, the easier it is to maintain
Other forms of security available in PowerCenter include permissions for connections. Connections include
database, FTP, and external loader connections. These permissions are useful to limit access to schemas in a
relational database and can be set-up in the Workflow Manager when source and target connections are defined.
Occasionally, restriction changes to source and target definitions are needed in the repository. A recommended
approach to this security issue is to use shared folders, which are owned by an Administrator. Granting read access
to developers on these folders allows them to create read-only copies in their work folders. When implementing a
security model, keep the following guidelines in mind:
Create groups with limited privileges
Do not use shared accounts
Limit user and group access to multiple repositories
Customize user privileges
Limit the Administer Repository privilege
Restrict the Workflow Operator privilege
Follow a naming convention for user accounts and group names
Identify users and the environments they will support (development, UAT, QA,
Production, production support, etc.)
Identify the PowerCenter repositories in the environment (this may be similar to the basic groups listed in
Step 1, e.g., development, UAT, QA, production, etc).
Identify what users need to exist in each repository
Define the groups that will exist in each PowerCenter Repository. Repository privileges work in
conjunction with folder permissions to give a user or group authority to perform tasks. Consider the
privileges that each user group requires, as well as folder permissions, when determining the breakdown
of users into groups. It is recommended to create one group for each distinct combination of folder
permissions and privileges
Assign users to groups. When a user is assigned to a user group, the user receives all privileges granted to
the group
Define privileges for each group and assign folder permissions. Informatica PowerCenter can also assign
privileges to users individually. When a privilege is granted to an individual user, the user retains that
privilege even if his or her user group affiliation changes. Example: a user in a Developer group who has
limited group privileges needs to act as a backup Administrator when the current admin is not available.
35
iGATE Internal
ETL Standards and Design Guidelines
To do so the user must have the Administrator privileges. Grant the Administrator privilege to the
individual user, not the entire Developer group
Root Administrator – This account is an admin console user with domain admin access. This user has the ability
to create and restrict other accounts. Most security will be run through this user and developers will be
dependent on this user to be able to have access. Since this user grants access this account should be heavily
restricted and essentially works as a security Administrator. To summarize, here are the security related tasks an
Administrator should be responsible for:
Creating user accounts
Defining and creating groups
Defining and granting folder permissions
Defining and granting repository privileges
Enforcing changes in passwords
Controlling requests for changes in privileges
Creating and maintaining database, FTP, and external loader connections in conjunction with
database Administrator
Working with operations group to ensure tight security in production environment
Domain user –This is the developer account and will only have access to objects they have been granted access to
but have no user create/edit ability. The developer account will have read/write access to their own developer
folder which is where majority of the coding will occur. It is possible to assign read only permission to a developer
so that they will have access to view finalized code as well as other developers code.
Data and Storage Architects group manages the Informatica platform in development, stage and production
environments.
Currently there are four security groups for the platform in each environment where individual users log into the
active directory. The settings are:
1. Setup a meeting with Data and Storage Architect to discuss application and/or user access requirements.
The following items need to be discussed in this meeting.
a. Group membership
b. Folder permissions
c. Import and export directory requirement and transfer of files from/to Informatica Shares
d. Database source and target connectivity requirements
e. Code review
2. Open a Remedy ticket assigned to Data and Storage Architect group with the following information
a. Windows Domain where access is requested (BENHRIS, STAGEPRD, TWIDPRD):
b. Windows Account Name:
c. New Application Name:
d. File Share Requested:
e. Existing Application Name for which access is requested:
f. Type of Privilege requested (Developer, Support or Release Manager).
5. Follow naming conventions for Informatica objects listed in the Developer Guide.
3.6 Versioning
3.6.1 Informatica
Track objects during development – for adding Label, User, Last saved, or Comments parameters to
queries to track objects during development.
Associate a query with a deployment group – For creating a dynamic deployment group, associate an
object query with it.
3.6.1.1 Check-out
Identify mapping, sessions, and workflows that need to be modified for code changes within the
Integration folder
A developer should only check out the code that they will be working on as it allows other users to continue
working instead of waiting for the code to be checked back in. Last a developer should always check out code
being developed. Doing so will keep code consistent and will help with versioning if an issue arises.
37
iGATE Internal
ETL Standards and Design Guidelines
3.6.1.2 Check-in
Once the object has been modified for code changes, the developer will then have to use the “check-in” feature to
commit the changes to the repository. This is done by right clicking the checked-out object and then selecting the
check-in option under versioning. Whenever a developer is done working on a mapping they must check the code
in to allow other developers to continue working.
Use TFS as a version control repository for SQL (DDL, DML)) and Unix scripts.
4 Performance Optimization
4.1 Performance Tuning Steps in Informatica
The goal of performance tuning is to optimize session performance by eliminating performance bottlenecks.
To tune the performance of a session, first you identify a performance bottleneck, eliminate it, and then
identify the next performance bottleneck until you are satisfied with the session performance. You can use the
test load option to run sessions when you tune session performance.
The most common performance bottleneck occurs when the Informatica Server writes to a target database.
You can identify performance bottlenecks by the following methods:
Running test sessions. You can configure a test session to read from a flat file source or to write to a
flat file target to identify source and target bottlenecks.
Studying performance details. You can create a set of information called performance details to
identify session bottlenecks. Performance details provide information such as buffer input and output
efficiency.
Monitoring system performance. You can use system-monitoring tools to view percent CPU usage,
I/O waits, and paging to identify system bottlenecks.
Once you determine the location of a performance bottleneck, you can eliminate the bottleneck by
following these guidelines:
Eliminate source and target database bottlenecks. Have the database administrator optimize
database performance by optimizing the query, increasing the database network packet size, or
configuring index and key constraints.
Eliminate mapping bottlenecks. Fine-tune the pipeline logic and transformation settings and options
in mappings to eliminate mapping bottlenecks.
Eliminate session bottlenecks. You can optimize the session strategy and use performance details to
help tune session configuration.
Eliminate system bottlenecks. Have the system administrator analyze information from system
monitoring tools and improve CPU and network performance.
If you tune all the bottlenecks above, you can further optimize session performance by partitioning the
session. Adding partitions can improve performance by utilizing more of the system hardware while
processing the session.
38
iGATE Internal
ETL Standards and Design Guidelines
Because determining the best way to improve performance can be complex, change only one variable at a
time, and time the session both before and after the change. If session performance does not improve, you
might want to return to your original configurations.
For more information check out the Informatica Help from any of the three informatica client tools.
Store all Sequence Generators as re-usable (even if they won’t be reused) so they end in the Transformation
section of the Project Folder. This will make it easier to find and re-set the sequences if necessary. However,
when a sequence is marked re-usable the cached value can’t be zero. Make sure you don’t keep the cached
value at 1 because it will access the Repository for every row; instead make the cache value 100.
Optimize Query. Give Hints, add indexes, analyze tables, Create index on order by and group by columns.
Filter data in source side.
Single-pass reading. Use router, decode and other transformation.
Consider more shared memory for large number of transformations. Session shared memory at 40MB should
suffice.
Calculate once, use many times.
Only connect what is used.
Watch the data types.
The engine automatically converts compatible types.
Sometimes conversion is excessive, and happens on every transformation.
Minimize data type changes between transformations by planning data flow prior to developing the
mapping.
Facilitate reuse.
Plan for reusable transformations.
Use variables.
Use mapplets to encapsulate multiple reusable transformations.
Only manipulate data that needs to be moved and transformed.
Delete unused ports particularly in Source Qualifier and Lookups. Reducing the number of records used
throughout the mapping provides better performance
Use active transformations that reduce the number of records as early in the mapping as possible (i.e.,
placing filters, aggregators as close to source as possible).
Select appropriate driving/master table while using joins. The table with the lesser number of rows should
be the driving/master table.
When DTM bottlenecks are identified and session optimization has not helped,
Use tracing levels to identify which transformation is causing the bottleneck (use the Test Load option in
session properties).
Utilize single-pass reads.
Single-pass reading is the server’s ability to use one Source Qualifier to populate multiple targets.
For any additional Source Qualifier, the server reads this source. If you have different Source Qualifiers for
the same source (e.g., one for delete and one for update/insert), the server reads the source for each
Source Qualifier.
Remove or reduce field-level stored procedures.
If you use field-level stored procedures, Power Center has to make a call to that stored procedure for
every row so performance will be slow.
Lookup Transformation Optimizing Tips.
Indexing on lookup tables.
39
iGATE Internal
ETL Standards and Design Guidelines
In LOOKUP never pass NULL value for input port instead use default value like –999.
Use SQL Overrides whenever possible to limit the number of rows returned.
Only Cache lookup tables if the number of lookup calls is more than 10-20% of the lookup table rows. For
fewer number of lookup calls, do not cache if the number of lookup table rows is big. For small lookup
tables, less than 5,000 rows, cache for more than 5-10 lookup calls. Remove Cache if low number of rows
coming in (if high rows in LKP). When caching is required, only select the data needed for the lookup. For
example, only select current records when caching tables.
Reuse cache when used by 3 or more sessions in a single job stream AND it takes greater than 15 minutes
to create the cache file
When your source is large, cache lookup table columns for those lookup tables of 500,000 rows or less.
This typically improves performance by 10-20%. Do this by add condition logic to the SQL override
whenever possible.
The rule of thumb is not to cache any table over 500,000 rows. This is only true if the standard row byte
count is 1,024 or less. If the row byte count is more than 1,024, then the 500k rows will have to be
adjusted down as the number of bytes increase (i.e., a 2,048 byte row can drop the cache row count to
250K 300K, so the lookup table will not be cached in this case).
When using a Lookup Table Transformation, improve lookup performance by placing all conditions that
use the equality operator = first in the list of conditions under the condition tab.
Replace lookup with decode or IIF (for small sets of values).
If caching lookups and performance is poor, consider replacing with an unconnected, uncached lookup.
UN-connected lookups should be used when less than 30% of the input rows need to be looked up for a
value.
For overly large lookup tables, use dynamic caching along with a persistent cache. Cache the entire table
to a persistent file on the first run, enable update else insert on the dynamic cache and the engine will
never have to go back to the database to read data from this table. It would then also be possible to
partition this persistent cache at run time for further performance gains (Caution: Use only with
approval).
Review complex expressions.
Examine mappings via Repository Reporting and Dependency Reporting within the mapping.
Minimize aggregate function calls.
Operations and Expression Optimizing Tips
Numeric operations are faster than string operations.
Optimize char-varchar comparisons (i.e., trim spaces before comparing).
Operators are faster than functions (i.e., || vs. CONCAT).
Optimize IIF expressions.
Avoid date comparisons in lookup; replace with string.
Test expression timing by replacing with constant.
Use Flat Files
Using flat files located on the server machine loads faster than a database source located in the server
machine.
Fixed-width files are faster to load than delimited files because delimited files require extra parsing.
If processing intricate transformations, consider loading first to a source flat file into a relational database,
which allows the Power Center mappings to access the data in an optimized fashion by using filters and
custom SQL selects where appropriate.
If working with data that is not able to return sorted data (e.g., Web Logs) consider using the Sorter Advanced
External Procedure.
Use a Router Transformation to separate data flows instead of multiple Filter Transformations.
Use a Sorter Transformation or hash-auto keys partitioning before an Aggregator Transformation to optimize
the aggregate. With a Sorter Transformation, the Sorted Ports option can be used even if the original source
cannot be ordered.
Use a Normalizer Transformation to pivot rows rather than multiple Instances of the same Target.
When using a Joiner Transformation, be sure to make the source with the smallest amount of data the Master
source.
40
iGATE Internal
ETL Standards and Design Guidelines
If an update override is necessary in a load, consider using a lookup transformation just in front of the target
to retrieve the primary key.
The primary key update will be much faster than the non-indexed lookup override.
Tune Session Parameters
Buffer Block Size (at least 20 rows at a time)
Enable or Disable lookup cache
Increase cache size (data and index).
For data (column (s) size + 8) * Number of Rows.
For index (column (s) size + 16) * Number of rows
Increase commit interval
Remove any verbose settings made in transformations for testing. Also, avoid running the session on verbose
for large set of data.
Monitor the sessions and document it.
5 Teradata Connections
The connection should be chosen as per the below standards and based on the DBA’s advice.
The Run strategy should be discussed with the DBA on how many sessions/connections to be run in parallel with
the above set of combinations.
41
iGATE Internal
ETL Standards and Design Guidelines
The Load operator does not support update, select, and delete operations on the target table. The data
sources for the Load operator can come from anywhere, such as a flat file, a queue, an ODBC-compliant
source, an access module provided by Teradata, a named pipe, or a customer access module created by an
end user, to name a few.
Features:
o Fastest way to load data into an empty table
o Moving data in bulk (block) fashion
o Multi-session
o Recommended for more data more than 100,000 rows
Restrictions:
o Target table must be empty
o No secondary index on the target table
o No join index on the target table
o No foreign key on the target table
o If the job fails the table needs to be re-created
Considerations:
The LOAD operator (Teradata FASTLOAD utility) should be used when there is a need to load a very large
volume of data quickly in the EDW into an empty table. This is generally for load strategy utilizing a ELT
approach, where data is loaded into a staging table , transformed in the database, and then coalesced
with data from the production table, via a rename process.
Stream (TPUMP)
The Stream operator is a consumer-type operator that emulates the Teradata TPump utility to perform
high-speed DML transactions (SQL INSERT, UPDATE, DELETE, or UPSERT) in a near-real-time mode to a
table (or tables) while queries are performed on the table (or tables).
42
iGATE Internal
ETL Standards and Design Guidelines
The Stream operator allows parallel inserts, updates, deletes, and upserts to empty or preexisting Teradata
Database tables. The Stream operator uses Teradata SQL within the Teradata PT infrastructure in its
communication with the Teradata Database. It uses multiple sessions to load data into one or more empty
or preexisting tables.
The Stream operator provides an alternative to the Update operator for the maintenance of large
databases under control of a Teradata Database. The Stream operator can be used to maintain tables in
the Teradata Warehouse, as can the Update operator. However, the Stream operator allows access to the
database and does not lock the target tables that are updated, so that interactive read and write activities
can be performed concurrently. The Stream operator supports many of the same features and facilities as
the Teradata TPump standalone utility.
Features:
o Parallel data load activity
o Macro based operation for multiple statements packaged together.
o Serializable execution operation of statements
o Handles streams of data with rich set of operation: insert/udpate/delete/upsert
o Not quite fast, but better than insert/update/delete
o All operations are primary index based
o Need careful tuning (speed, primary index etc.)
o Ideal for Mini-batch and small amounts of data to be loaded, recommended data volume is
thousands of records to a few couple hundred thousand at a time.
Restrictions:
o Jobs have tunable parameters that need to be discussed and tested with Teradata DBAs to ensure
optimal performance.
43
iGATE Internal
ETL Standards and Design Guidelines
6 Appendix
6.1 Mapping Examples
Ex: Mapping using one source and populating many different target tables. This occurs by using a Router Transformation.
Surrogate Key – A sequenced generated ID used to track records across tables. This is useful
when working with data coming from different systems
Time Stamp – Records when the mapping was last run. This is crucial when tracking data
extracts and allows the developer to know if the data is up to date
Source Name – This column is needed if numerous sources are used. This can also be used to
list the name of the flat file which allows developers know if the correct file was used
Mapping Stamp – This column is needed if more than one mapping is used to populate a table.
If so it is recommended to list the mapping where the data came from
Job_ID - A distinct ID created to track the exact run of a mapping
44
iGATE Internal
ETL Standards and Design Guidelines
There are two types of data staging architecture, one layer and two layers. In the one layer
architecture, if the source systems are online with small volume or offline systems (files) then there
should be a single staging area. In the two layers architecture the source systems are online systems
and a detail level of quality check needs to be performed at staging. This approach will help in source
system contention, impact on resources and when there is smaller window to access the source
system with large volume of data. The first staging area can be used to park the raw data (delta) from
source system which will provide the source system snapshot extracted. The staging area can hold
one/two days of data based on requirements. The second staging area will be used to store the
cleansed and standardized data. This staging area can hold the delta or historical data based on
business needs. Retaining historical data helps in data analysis for future use without dependency on
source systems.
Based on the current planned data architecture the recommended approach is to use one layer data
staging as the Operational Data Store (ODS) will take the place of the function of the second layer.
The data staging area will be used to park the raw data from source system which will provide the
source system snapshot extracted. The staging area will hold only the previous extract and will be
truncated during each extract. The ODS will then be used for historical and quality purposes.
6.2.2 Debugging
Existing non-reusable session: Uses existing source, target, and non-reusable session
configuration properties. The Debugger does not suspend on error
45
iGATE Internal
ETL Standards and Design Guidelines
Existing reusable session: Uses existing source, target, and session configuration properties.
When the developer runs the Debugger, it creates and runs a debug workflow for the reusable
session
Create a debug session instance: The developer can create source, target and session
configuration properties on their own through Debugger Wizard. It creates and runs a debug
workflow for the session
Selecting Instance Name: While selecting the instance name, the breakpoint can be created for an individual
transformation or for Global condition.
The recommended approach is to set breakpoints wherever data is being sourced or altered (Ex: Source Qualifier,
Lookup, and Router Transformations). The debugger will pause upon meeting the condition defined. In the
example below, a breakpoint is defined for Payee_name = “TED DOE”. The debugger will pause once the matching
record is found. The developer can also edit a breakpoint by altering the condition.
46
iGATE Internal
ETL Standards and Design Guidelines
6.2.3 Encryption
Encryption should be based off of security policies put in place by the Security Team. The policies would list how
much and what data is encrypted (All, Specific Environments, Specific Tables, a Few Fields, or No encryption). ETL
developers would code for the policies but will not define the policies. If eencryption is elected to be enforced it
can be handled in Informatica using the Source Qualifier, Lookup, and Expression Transformations. In both the
Source Qualifier and lookup transformations the developer would use a SQL override to encrypt the data fields.
The expression transformation can encrypt the data by editing the ports. Use the field that the developer wants to
encrypt as the input port then override the value using the encryption rule as the output port. The developer will
need to make sure that the database or staging table can handle the length of the encryption. This can be seen if
the values sourced do not match the values returned after encryption/decryption. If this occurs the developer
should check the field length of every port of the fields encrypted to see if it can handle the full length of
encryption. Encrypting the data normally extends the length of the value and if the port or table is not ready the
data will be truncated. Then when it is eventually decrypted the data will not match.
6.3 Metadata
Informatica PowerCenter captures various forms of metadata in its data repository which is a queryable, though
somewhat cryptic, database. Examples of such metadata are table definitions, column mappings, business rules,
data definitions, and execution statistics. Repository originates in two forms:
Static
Source and Target Tables
Workflow/Mapping Configuration
Last Saved and other edit audit fields.
Dynamic
Which jobs ran last night
Were they successful
What tables were affected
How many rows were read & written
How long did it take to run
Only use outside functions if it will provide a significant increase in throughput or decrease in processing time.
Otherwise, metadata benefits decrease as no metadata will be stored in the repository. In short Informatica
Metadata is strictly technical metadata.
47
iGATE Internal
ETL Standards and Design Guidelines
Business Rules describing what is and is not included within the date warehouse
Definitions of Business Hierarchies and KPIs
Common Business Definitions and Calculations for data elements
Transformation and Conversion Rules in Business context
Source System Names/Locations
User security profile
Descriptions of warehouse attributes
A description of warehouse data transformations over time
48
iGATE Internal