Business Case: Why Do We Need ETL Tools?: Informatica Beginners
Business Case: Why Do We Need ETL Tools?: Informatica Beginners
Think of GE, the company has over 100+ years of history & presence in almost all the industries.
Over these years company’s management style has been changed from book keeping to SAP.
This transition was not a single day transition. In transition, from book keeping to SAP, they
used a wide array of technologies, ranging from mainframes to PCs, data storage ranging from
flat files to relational databases, programming languages ranging from Cobol to Java. This
transformation resulted into different businesses, or to be precise different sub businesses within
a business, running different applications, different hardware and different architecture.
Technologies are introduced as and when invented & as and when required.
This directly resulted into the scenario, like HR department of the company running on Oracle
Applications, Finance running SAP, some part of process chain supported by mainframes, some
data stored on Oracle, some data on mainframes, some data in VSM files & the list goes on. If
one day company requires a consolidated reports of assets, there are two ways.
ETL Tools provide facility to Extract data from different non-coherent systems, cleanse it, merge
it and load into target systems.
What is Informatica?
Informatica is a tool, supporting all the steps of Extraction, Transformation and Load process. Now a days
Informatica is also being used as an Integration tool.
Informatica is an easy to use tool. It has got a simple visual interface like forms in visual basic. You just need to
drag and drop different objects (known as transformations) and design process flow for Data extraction
transformation and load. These process flow diagrams are known as mappings. Once a mapping is made, it can be
scheduled to run as and when required. In the background Informatica server takes care of fetching data from
source, transforming it, & loading it to the target systems/databases.
Informatica can communicate with all major data sources (mainframe/RDBMS/Flat Files/XML/VSM/SAP etc), can
move/transform data between them. It can move huge volumes of data in a very effective way, many a times better
than even bespoke programs written for specific data movement only. It can throttle the transactions (do big updates
in small chunks to avoid long locking and filling the transactional log). It can effectively join data from two distinct
data sources (even a xml file can be joined with a relational table). In all, Informatica has got the ability to
effectively integrate heterogeneous data sources & converting raw data into useful information.
Before we start actually working in Informatica, let’s have an idea about the company owning this wonderful
product.
In short, Informatica is worlds leading ETL tool & its rapidly acquiring market as an Enterprise Integration
Platform.
Informatica ETL product, known as Informatica Power Center consists of 3 main components.
These are the development tools installed at developer end. These tools enable a developer to
Repository is the heart of Informatica tools. Repository is a kind of data inventory where all the data related to
mappings, sources, targets etc is kept. This is the place where all the metadata for your application is stored. All the
client tools and Informatica Server fetch data from Repository. Informatica client and server without repository is
same as a PC without memory/harddisk, which has got the ability to process data but has no data to process. This
can be treated as backend of Informatica.
3. Informatica PowerCenter Server:
Server is the place, where all the executions take place. Server makes
physical connections to sources/targets, fetches data, applies the
transformations mentioned in the mapping and loads the data in the
target system.
Sources
Targets
Legacy: Mainframes
(DB2, VSAM, IMS, Legacy: Mainframes
IDMS, Adabas)AS400 (DB2)AS400 (DB2)
(DB2, Flat File)
Remote Targets
Remote Sources
This is the sufficient knowledge to start with Informatica. So lets go straight to development in Informatica.
Informatica is a powerful ETL tool from Informatica Corporation, a leading provider of enterprise data integration
software and ETL softwares.
Power Center
Power Mart
Power Exchange
Power Center Connect
Power Channel
Metadata Exchange
Power Analyzer
Super Glue
Power Center & Power Mart: Power Mart is a departmental version of Informatica for building, deploying, and
managing data warehouses and data marts. Power center is used for corporate enterprise data warehouse and power
mart is used for departmental data warehouses like data marts. Power Center supports global repositories and
networked repositories and it can be connected to several sources. Power Mart supports single repository and it can
be connected to fewer sources when compared to Power Center. Power Mart can extensibily grow to an enterprise
implementation and it is easy for developer productivity through a codeless environment.
Power Exchange: Informatica Power Exchange as a stand alone service or along with Power Center, helps
organizations leverage data by avoiding manual coding of data extraction programs. Power Exchange supports
batch, real time and changed data capture options in main frame(DB2, VSAM, IMS etc.,), mid range (AS400 DB2
etc.,), and for relational databases (oracle, sql server, db2 etc) and flat files in unix, linux and windows systems.
Power Center Connect: This is add on to Informatica Power Center. It helps to extract data and metadata from
ERP systems like IBM's MQSeries, Peoplesoft, SAP, Siebel etc. and other third party applications.
Power Channel: This helps to transfer large amount of encrypted and compressed data over LAN, WAN, through
Firewalls, tranfer files over FTP, etc.
Meta Data Exchange: Metadata Exchange enables organizations to take advantage of the time and effort already
invested in defining data structures within their IT environment when used with Power Center. For example, an
organization may be using data modeling tools, such as Erwin, Embarcadero, Oracle designer, Sybase Power
Designer etc for developing data models. Functional and technical team should have spent much time and effort in
creating the data model's data structures(tables, columns, data types, procedures, functions, triggers etc). By using
meta deta exchange, these data structures can be imported into power center to identifiy source and target mappings
which leverages time and effort. There is no need for informatica developer to create these data structures once
again.
Power Analyzer: Power Analyzer provides organizations with reporting facilities. PowerAnalyzer makes accessing,
analyzing, and sharing enterprise data simple and easily available to decision makers. PowerAnalyzer enables to
gain insight into business processes and develop business intelligence.
With PowerAnalyzer, an organization can extract, filter, format, and analyze corporate information from data stored
in a data warehouse, data mart, operational data store, or otherdata storage models. PowerAnalyzer is best with a
dimensional data warehouse in a relational database. It can also run reports on data in any table in a relational
database that do not conform to the dimensional model.
Super Glue: Superglue is used for loading metadata in a centralized place from several sources. Reports can be run
against this superglue to analyze meta data.
Note:This is not a complete tutorial on Informatica. We will add more Tips and Guidelines on Informatica in near
future. Please visit us soon to check back. To know more about Informatica, contact its official website
www.informatica.com
Informatica Transformations
A transformation is a repository object that generates, modifies, or passes data. The Designer provides a set of
transformations that perform specific functions. For example, an Aggregator transformation performs calculations
on groups of data.
Active Transformation
Passive Transformation.
A passive transformation does not change the number of rows that pass through it, maintains the transaction
boundary, and maintains the row type.
The key point is to note that Designer allows you to connect multiple transformations to the same downstream
transformation or transformation input group only if all transformations in the upstream branches are passive. The
transformation that originates the branch can be active or passive.
Connected Transformation
Connected transformation is connected to
other transformations or directly to target table
in the mapping.
UnConnected Transformation
An unconnected transformation is not connected to other transformations in the mapping. It is called within another
transformation, and returns a value to that transformation.
Informatica Transformations
Aggregator Transformation
Application Source Qualifier Transformation
Custom Transformation
Data Masking Transformation
Expression Transformation
External Procedure Transformation
Filter Transformation
HTTP Transformation
Input Transformation
Java Transformation
Joiner Transformation
Lookup Transformation
Normalizer Transformation
Output Transformation
Rank Transformation
Reusable Transformation
Router Transformation
Sequence Generator Transformation
Sorter Transformation
Source Qualifier Transformation
SQL Transformation
Stored Procedure Transformation
Transaction Control Transaction
Union Transformation
Unstructured Data Transformation
Update Strategy Transformation
XML Generator Transformation
XML Parser Transformation
XML Source Qualifier Transformation
Advanced External Procedure Transformation
External Transformation
In the following pages, we will explain all the above Informatica Transformations and their significances in the ETL
process in detail.
Informatica Transformations
Aggregator Transformation
Aggregator transformation performs aggregate funtions like average, sum, count etc. on multiple rows or groups.
The Integration Service performs these calculations as it reads and stores data group and row data in an aggregate
cache. It is an Active & Connected transformation.
Difference b/w Aggregator and Expression Transformation? Expression transformation permits you to perform
calculations row by row basis only. In Aggregator you can perform calculations on groups.
Aggregator transformation has following ports State, State_Count, Previous_State and State_Counter.
Aggregate Expressions: are allowed only in aggregate transformations. can include conditional clauses and non-
aggregate functions. can also include one aggregate function nested into another aggregate function.
Aggregate Functions: AVG, COUNT, FIRST, LAST, MAX, MEDIAN, MIN, PERCENTILE, STDDEV, SUM,
VARIANCE
Custom Transformation
It works with procedures you create outside the designer interface to extend PowerCenter functionality. calls a
procedure from a shared library or DLL. It is active/passive & connected type.
You can use CT to create T. that require multiple input groups and multiple output groups.
Custom transformation allows you to develop the transformation logic in a procedure. Some of the PowerCenter
transformations are built using the Custom transformation. Rules that apply to Custom transformations, such as
blocking rules, also apply to transformations built using Custom transformations. PowerCenter provides two sets of
functions called generated and API functions. The Integration Service uses generated functions to interface with the
procedure. When you create a Custom transformation and generate the source code files, the Designer includes the
generated functions in the files. Use the API functions in the procedure code to develop the transformation logic.
Difference between Custom and External Procedure Transformation? In Custom T, input and output functions occur
separately.The Integration Service passes the input data to the procedure using an input function. The output
function is a separate function that you must enter in the procedure code to pass output data to the Integration
Service. In contrast, in the External Procedure transformation, an external procedure function does both input and
output, and its parameters consist of all the ports of the transformation.
Expression Transformation
Passive & Connected. are used to perform non-aggregate functions, i.e to calculate values in a single row. Example:
to calculate discount of each product or to concatenate first and last names or to convert date to a string field.
You can create an Expression transformation in the Transformation Developer or the Mapping Designer.
Components: Transformation, Ports, Properties, Metadata Extensions.
External Procedure
Passive & Connected or Unconnected. It works with procedures you create outside of the Designer interface to
extend PowerCenter functionality. You can create complex functions within a DLL or in the COM layer of windows
and bind it to external procedure transformation. To get this kind of extensibility, use the Transformation Exchange
(TX) dynamic invocation interface built into PowerCenter. You must be an experienced programmer to use TX and
use multi-threaded code in external procedures.
Filter Transformation
Active & Connected. It allows rows that meet the specified filter condition and removes the rows that do not meet
the condition. For example, to find all the employees who are working in NewYork or to find out all the faculty
member teaching Chemistry in a state. The input ports for the filter must come from a single transformation. You
cannot concatenate ports from more than one transformation into the Filter transformation. Components:
Transformation, Ports, Properties, Metadata Extensions.
HTTP Transformation
Java Transformation
Active or Passive & Connected. It provides a simple native programming interface to define transformation
functionality with the Java programming language. You can use the Java transformation to quickly define simple or
moderately complex transformation functionality without advanced knowledge of the Java programming language
or an external Java development environment.
Joiner Transformation
Active & Connected. It is used to join data from two related heterogeneous sources residing in different locations or
to join data from the same source. In order to join two sources, there must be at least one or more pairs of matching
column between the sources and a must to specify one source as master and the other as detail. For example: to join
a flat file and a relational source or to join two flat files or to join a relational source and a XML source.
The Joiner transformation supports the following types of joins:
Normal
Normal join discards all the rows of data from the master and detail source that do not match, based on the
condition.
Master Outer
Master outer join discards all the unmatched rows from the master source and keeps all the rows from the
detail source and the matching rows from the master source.
Detail Outer
Detail outer join keeps all rows of data from the master source and the matching rows from the detail
source. It discards the unmatched rows from the detail source.
Full Outer
Full outer join keeps all rows of data from both the master and detail sources.
Lookup Transformation
Passive & Connected or UnConnected. It is used to look up data in a flat file, relational table, view, or synonym. It
compares lookup transformation ports (input ports) to the source column values based on the lookup condition. Later
returned values can be passed to other transformations. You can create a lookup definition from a source qualifier
and can also use multiple Lookup transformations in a mapping.
Informatica Transformations
Normalizer Transformation
Active & Connected. The Normalizer transformation processes multiple-occurring columns or multiple-occurring
groups of columns in each source row and returns a row for each instance of the multiple-occurring data. It is used
mainly with COBOL sources where most of the time data is stored in de-normalized format.
Rank Transformation
Active & Connected. It is used to select the top or bottom rank of data. You can use it to return the largest or
smallest numeric value in a port or group or to return the strings at the top or the bottom of a session sort order. For
example, to select top 10 Regions where the sales volume was very high or to select 10 lowest priced products. As
an active transformation, it might change the number of rows passed through it. Like if you pass 100 rows to the
Rank transformation, but select to rank only the top 10 rows, passing from the Rank transformation to another
transformation. You can connect ports from only one transformation to the Rank transformation. You can also create
local variables and write non-aggregate expressions.
Router Transformation
Passive & Connected transformation. It is used to create unique primary key values or cycle through a sequential
range of numbers or to replace missing primary keys.
It has two output ports: NEXTVAL and CURRVAL. You cannot edit or delete these ports. Likewise, you cannot
add ports to the transformation. NEXTVAL port generates a sequence of numbers by connecting it to a
transformation or target. CURRVAL is the NEXTVAL value plus one or NEXTVAL plus the Increment By value.
You can make a Sequence Generator reusable, and use it in multiple mappings. You might reuse a Sequence
Generator when you perform multiple loads to a single target.
For non-reusable Sequence Generator transformations, Number of Cached Values is set to zero by default, and the
Integration Service does not cache values during the session.For non-reusable Sequence Generator transformations,
setting Number of Cached Values greater than zero can increase the number of times the Integration Service
accesses the repository during the session. It also causes sections of skipped values since unused cached values are
discarded at the end of each session.
For reusable Sequence Generator transformations, you can reduce Number of Cached Values to minimize discarded
values, however it must be greater than one. When you reduce the Number of Cached Values, you might increase
the number of times the Integration Service accesses the repository to cache values during the session.
Sorter Transformation
Active & Connected transformation. When adding a relational or a flat file source definition to a mapping, you need
to connect it to a Source Qualifier transformation. The Source Qualifier is used to join data originating from the
same source database, filter rows when the Integration Service reads source data, Specify an outer join rather than
the default inner join and to specify sorted ports.
It is also used to select only distinct values from the source and to create a custom query to issue a special SELECT
statement for the Integration Service to read source data
SQL Transformation
Active/Passive & Connected transformation. The SQL transformation processes SQL queries midstream in a
pipeline. You can insert, delete, update, and retrieve rows from a database. You can pass the database connection
information to the SQL transformation as input data at run time. The transformation processes external SQL scripts
or SQL queries that you create in an SQL editor. The SQL transformation processes the query and returns rows and
database errors.
Passive & Connected or UnConnected transformation. It is useful to automate time-consuming tasks and it is also
used in error handling, to drop and recreate indexes and to determine the space in database, a specialized calculation
etc. The stored procedure must exist in the database before creating a Stored Procedure transformation, and the
stored procedure can exist in a source, target, or any database with a valid connection to the Informatica Server.
Stored Procedure is an executable script with SQL statements and control statements, user-defined variables and
conditional statements.
Active & Connected. You can control commit and roll back of transactions based on a set of rows that pass through
a Transaction Control transformation. Transaction control can be defined within a mapping or within a session.
Components: Transformation, Ports, Properties, Metadata Extensions.
Union Transformation
Active & Connected. The Union transformation is a multiple input group transformation that you use to merge data
from multiple pipelines or pipeline branches into one pipeline branch. It merges data from multiple sources similar
to the UNION ALL SQL statement to combine the results from two or more SQL statements. Similar to the UNION
ALL statement, the Union transformation does not remove duplicate rows.
Rules
1) You can create multiple input groups, but only one output group.
2) All input groups and the output group must have matching ports. The precision, datatype, and scale must be
identical across all groups.
3) The Union transformation does not remove duplicate rows. To remove duplicate rows, you must add another
transformation such as a Router or Filter transformation.
4) You cannot use a Sequence Generator or Update Strategy transformation upstream from a Union transformation.
5) The Union transformation does not generate transactions.
Components: Transformation tab, Properties tab, Groups tab, Group Ports tab.
Active & Connected transformation. It is used to update data in target table, either to maintain history of data or
recent changes. It flags rows for insert, update, delete or reject within a mapping.
Active & Connected transformation. It lets you create XML inside a pipeline. The XML Generator transformation
accepts data from multiple ports and writes XML through a single output port.
Active & Connected transformation. The XML Parser transformation lets you extract XML data from messaging
systems, such as TIBCO or MQ Series, and from other sources, such as files or databases. The XML Parser
transformation functionality is similar to the XML source functionality, except it parses the XML in the pipeline.
Active & Connected transformation. XML Source Qualifier is used only with an XML source definition. It
represents the data elements that the Informatica Server reads when it executes a session with XML sources. has one
input or output port for every column in the XML source.
Active & Connected transformation. It operates in conjunction with procedures, which are created outside of the
Designer interface to extend PowerCenter/PowerMart functionality. It is useful in creating external transformation
applications, such as sorting and aggregation, which require all input rows to be processed before emitting any
output rows.