This document is based on batch loading techniques for scalability and performance of large batch sets of data. We compare and contrast the different sequence loading styles, and discuss the pros and cons of each. We also present mapping structures and architectural diagrams to assist with the discussion.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
192 views0 pages
Informatica Sequence Generation Techniquesv2
This document is based on batch loading techniques for scalability and performance of large batch sets of data. We compare and contrast the different sequence loading styles, and discuss the pros and cons of each. We also present mapping structures and architectural diagrams to assist with the discussion.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 0
Informatica Sequence Generation Techniques Page 1 of 12
Dan Linstedt, 2005 All Rights Reserved
This document may not be reproduced in any form without Date: 7/23/2007 express written consent of Dan Linstedt.
Informatica Sequence Generation Techniques Best Practices & Performance Architectures
Dan Linstedt, 2005, All rights reserved
Dan Linstedt CTO/CIO, CoFounder of RapidACE On the web at: http://www.RapidACE.com Data Vault Data Modeling available at: http://www.DanLinstedt.com
Abstract: This document is based on Batch loading techniques for scalability and performance of large batch sets of data. These techniques are not for OLTP or trickle feed sequence generation, the trickle feed and OLTP requirements are simple use a trigger on the table to generate a new sequence, OR use an identity column within the RDBMS. In this document we cover the different architectures for sequence loading, both within the RDBMS, and within Informatica. We compare and contrast the different sequence loading styles, and discuss the pros and cons of each. We also present mapping structures and architectural diagrams to assist with the discussion. These techniques are version independent.
I now have a tool which generates mappings according to data models and templates. You can customize the templates as much as you want. The templates out of the box come pre- architected to use best-of breed performance, along with best of breed CDC and scalability. Keep in mind that the tool Ive built consolidates disparate data models (multiple source systems) into a single common data model, and produces the mappings accordingly. You can find out more at: http://www.RapidACE.com Informatica Sequence Generation Techniques Page 2 of 12 Dan Linstedt, 2005 All Rights Reserved This document may not be reproduced in any form without Date: 7/23/2007 express written consent of Dan Linstedt. 1.0 Introduction Generating sequences (or dummy IDs) for RDBMS work is an important facet of database maneuvering. Sequences are used for everything from identifying unique rows (of which the data is truly duplicated) to ordering data sets, to making joins more efficient (by replacing longer, more useful natural keys). Why? Because RDBMS engines operate on the principle that numerics are shorter than most natural keys (aka: business keys). Numerical operations take less time to complete, when compared with character based operations. Most natural keys are character based, therefore when joining data in the database, enforcing constraints, or uniqueness - the surrogate keys (sequence IDs) can help the database perform.
Assumptions: The map is responsible for generating sequences Must resolve sequences within the mapping Must be capable of running in parallel Must be fully restartable. Must not be database dependent.
How can we make sequences run in parallel? How do we architect for external sequence numbers? Is it possible to architect a database neutral solution?
The real question is: How can we generate sequences without locking them into the repository?
What wed truly like to have is the RDBMS hide the sequence keys, maybe the RDBMS numbers everything internally (in Oracle this is called the ROW-ID, SQLServer it is a UniqueIdentifier data type), only to expose the sequences as read-only operations on a row. Better yet, wed like to believe that the surrogate sequences are not needed at all, that natural or business keys are best and should be utilized throughout the system. However that is not the reality.
First, before launching into the Sequence Generator techniques, lets set some ground rules. 1. There are people who care about holes in sequence numbers, even though these sequences are meaningless and should never be seen by end-users, often times they end up on the BI/SQL selection screen. 2. Restartability of sequences is an issue, there are techniques to deal with this none of them are pretty, but they do work. 3. The architectures main purpose should be efficient functionality; sequences should be secondary to that.
In this document, I will discuss the Informatica Sequence Generator object, its pros and cons, and discuss how it can or should be utilized (if at all), RDBMS sequence generation techniques, and their problems. We will then move on to discuss possible architectural solutions to these problems, and present the reader with options on how to solve these problems.
Informatica Sequence Generation Techniques Page 3 of 12 Dan Linstedt, 2005 All Rights Reserved This document may not be reproduced in any form without Date: 7/23/2007 express written consent of Dan Linstedt. Problem Description Integer Based 2.1 Billion Upper limit Repository based Doesnt account for external sequence generators Doesnt allow for restartability without changing the map Non-Sharable Takes large chunks of sequences to parallelize, cache, and perform well. Non-Restartable Doesnt roll-back to last inserted number if the session dies. Requires Caching In order to be re-used, it requires caching, leaving holes in the sequence numbers
2.0 The Informatica Sequence Generator Object As youve read in the manuals, Informatica has a sequence generator object. It is quite nifty for some purposes, and not so good for others. Lets start the discussion by talking about its properties, all of which are stored in the repository after the successful or failed completion of every session run.
Start Value, Increment By, End Value, Current Value, Cycle, Number of Cached Values, Reset, and Tracing Level. These values are all 32 bit integer based. Which means they all have an upper limit of 2.1 Billion; any time the current value reaches this hard imposed limit it cycles back around to the start value, like it or not. Now, if I have a transactional system and I generate more than 2.1 Billion over the course of a year or six months, I could be in trouble if I use the Informatica Sequence Generator.
Theres another issue with this particular generator that is: the current value (along with the rest) is stored in the repository tables, so if a session is run in Development, it sets the current value to Development increments. When the session is copied (using Repository Manager) from Development to Test, it offers you a choice: Use the test repository current value, or copy over the Development current value for this object. Prior to Informatica 7.0 this was not available.
If for some reason, someone in Development accidentally resets the sequence numbers, the session will fail with duplicate key generation the next time its run. Or if someone copies the wrong value from Development to Test, then it will fail again (in test or production), by producing duplicate keys again. Manual processes can and do break the compliance of a sequence generator object.
Another issue with sequence generators is: suppose the session has 20 million rows to move, commits 19 million successfully then for some reason dies. The session (even though it failed) will write the last used sequence back to the repository. Why does it do this - to maintain integrity of the sequence generator? This is because it ran through 19,999,900 before dying, it writes this number back to current value, thus leaving holes in the sequence numbering when its re-started.
Lets talk performance. Informatica Sequence Generation Techniques Page 4 of 12 Dan Linstedt, 2005 All Rights Reserved This document may not be reproduced in any form without Date: 7/23/2007 express written consent of Dan Linstedt. The Informatica Sequence Generators performance is nothing to be proud of. It can be fast, but only by tuning the number of Cached Values upward the sweet spot is around 10,000 cached values. But, this should only be done if 1. the session is guaranteed not to crash, 2. the session doesnt load too many rows, which would cause the sequence to cycle back around, 3. if the holes in the resulting sequence (from restart) are not of any consequence to management, OR 4. if the session will be partitioned, and sequences across partitions are necessary. Reusable sequence generators require cached values, and can slow performance down even more if the right number of cached values isnt chosen (around 10,000 to 20,000).
One more thing about sequence generators, look at the following screenshot from a mapping: what does it generate, one or two new sequence numbers?
NextVal and Informatica Sequence Generators
The answer may surprise you. This sequence generates 1 new value for EXPTRANS and another new value for EXPTRANS1. The reason has to do with the way Informatica passes data in blocks across objects. It takes 1 pass to populate 1 block for each expression, therefore 2 blocks; each requires its own sequence number. Not only will this mapping run slowly, but it will produce unwanted sequence numbers.
Ok, so weve pretty much shot the sequence generator down and we still need sequence numbers so how do we do it? Before we get to the solutions, lets talk about RDBMS sequence generators.
3.0 RDBMS Sequence Generators There are essentially two kinds of RDBMS Sequence Generators: 1. Independent Objects (like Oracle SEQUENCE objects), 2. Embedded sequence columns, known as Identity Columns. Identity columns can be found in SQLServer, Sybase, MySQL, Teradata, and DB2 UDB. Oracle contains outside the table sequence objects which have their own issues.
Informatica Sequence Generation Techniques Page 5 of 12 Dan Linstedt, 2005 All Rights Reserved This document may not be reproduced in any form without Date: 7/23/2007 express written consent of Dan Linstedt. Type 1: Independent Objects. Independent sequence objects are quite interesting. They are built for and work well in a highly transactional environment; however in a large batch environment, large data environment, or highly parallel environment they lead to all kinds of problems. The first of which is, just like Informatica Sequence Generators, they have a start value, max value (end value), increment value. The RDBMS has been known to cache certain numbers of values in order to provide performance.
There are locking problems and usage problems. One of the most common problems here is resetting the sequence object if the session that uses it fails, and the cached value has already increased the current value. In other words, suppose the number of cached values is 10,000, and the current value when we start the session is 10,000 suppose our session fails at row 10,001 because we performed the action of fetch from the sequence object, it has already cached out the next 10k values. The next time the session is restarted, the sequence value will be 20,000. Every Select from the sequence object is automatically committed. There is however a saving grace, in oracle you can alter sequence and reset the current values to what you think it should be, but that requires the appropriate RDBMS privileges.
What are the ways we can use Oracle Sequence Generators within Informatica? 1. Select <sequence.nextval>, (our columns) from sequence, our table direct read into a source qualifier, probably the fastest and most efficient, but doesnt handle restartability or resetting of the sequence object. 2. Stored Procedure, connected, in-line for every row. One of the Slowest and most inefficient manners to retrieve a sequence number from the database will cause undue headache, and can take the performance down by a factor of 10x. 3. Lookup Uncached, connected youll have to put a view underneath it to select the sequence from dual, because Informatica wont understand the sequence object as a table. This is the second slowest way to pull sequences from the RDBMS and attach to your rows. This can slow your mapping down by a factor of 5x to 7x. 4. Select rownum+,max_targ.max_target_seq, <my fields> from <table>,(select max(seq) as max_target_seq from target) max_targ. This will support an insert into database operation, if you wanted to generate the max ID in Informatica, then read from the view with RowNum and do the addition of max value within Informatica. During the run in Informatica, add the MAX value from the target database +the rownum to get the current sequence increment. This can be one of the fastest mechanisms to get sequences generated direct in the database. 5. Dynamic Lookup Caches Which increase the unique IDs with the unique rows that are inserted. However, this option is severely limited in performance, and is not recommended for parallelism, partitioning, or grid work.
So again, what is the solution here? The solutions are: #4, and variants on #4. In otherwords generating a row number in Informatica is cross-platform, adding Max_SEQ in Informatica also is cross platform. However, pulling RowNum+Max_SEQ in the select statement works only for Oracle. Believe it or not, pulling the rownum+max is faster than reading a sequence.nextval from the database.
Informatica Sequence Generation Techniques Page 6 of 12 Dan Linstedt, 2005 All Rights Reserved This document may not be reproduced in any form without Date: 7/23/2007 express written consent of Dan Linstedt. Type 2: Identity Columns Identity columns are embedded in the create table statements, ie:
CREATE TABLE X ( MySeq i dent i t y, myname var char ( 20) ) ;
They are used as a valid datatype in most cases. Sometimes they are specified as an attribute of the column or a constraint of the column. In any case, they provide a read-only column which increments whenever a new row is inserted. In these cases in Informatica we do NOT connect the identity column to sequence generated output. We let the RDBMS take care of that.
Are they fast? You bet. Are they parallel? Yes. Can you partition the table that contains an identity column and insert in parallel to multiple partitions? Yes. Are they recoverable? Yes. Do they leave holes in the database because of a failure? No. Do they work with Real-Time and large batch feeds? Yes. Is any special Informatica logic needed to work with these columns? No. Are they available in Oracle? No.
Loading Identity Columns in Informatica
Notice: Invoice_NO is not connected. In Sybase, SQLServer, DB2 UDB 8x, Teradata and MySQL you can use this method to load tables that contain identity columns. Sessions MUST be set to NORMAL to work. This mechanism will not work in Oracle.
Performance of this technique (even in NORMAL mode) can reach from 80,000 to 120,000 rows per second depending on the target database system chosen, and the architecture of the mapping. BE AWARE FOR TERADATA: YOU CANNOT GET A SEQUENTIAL NUMBER FROM AN IDENTITY COLUMN!!! Because Teradata automatically splits everything into a parallel operation, identity columns provide you with guaranteed UNIQUE numbers, but not ordered in sequence. This can be an issue (for some), but I will say this: even if you try for sequential numbers in Teradata, it isnt a good thing why? Because it limits the parallelism of the queries on return (not to mention makes the inserts single-strung). Queries load balance the work across the nodes, and like to return data according to what the node contains. There is no way to guarantee sequential order within a query statement, except to sort (ORDER BY) but this has its drawbacks too, it can eat all the spool space available (TEMP area).
Use identities for fastest operations, just be aware that Teradata will not issue sequentially ordered numbers.
Informatica Sequence Generation Techniques Page 7 of 12 Dan Linstedt, 2005 All Rights Reserved This document may not be reproduced in any form without Date: 7/23/2007 express written consent of Dan Linstedt. 4.0 Architecture Issues with Sequence Generators Ive already discussed many of the technical issues with sequence generators; now lets discuss a few of the architectural issues that cause sequence generators to go awry. First, theres the restart when a session fails, it must be able to be restarted, and it must (without editing) be able to pick up where it left off. Most sequence generators (except Identity columns) cannot recover.
Second theres parallelism. If the target table has a situation where its being loaded by both the Informatica process, and one or more outside Informatica processes at the same time, then it has a parallel load situation. Identity columns handle this inside the two-phase commit locking that the RDBMS engine takes care of. Sequence generators are left in the cold, and required to cache values in order to handle this circumstance. The higher the number of cached values, the less likely you are to experience or see lock contention across both load processes.
Third theres Informaticas own partitioning within a session. If you MUST have sequence numbers, but dont have outside processes loading at the same time as the Informatica session, then you must use Informaticas sequence generator. The partitioning mechanism shares the sequence generator object, much the same way Oracle shares its sequence generator object. Again, identity columns handle this without a hitch.
Now are we ready to talk solutions? Yes. Here are the suggested best practices in the market today. Golden rule of thumb thats hard to swallow: use a RDBMS system that allows Identity columns, and doesnt force usage of a Sequence Object.
5.0 Solution Architecture Thoughts In order to meet the goals: Partitioning possible, parallel contention loads possible, real-time and batch high speed yes, not limited to integer possible; we must consider the architecture. The true solution, which also includes recoverable or restartable, requires a 3 step process. Step 1 loads a staging table, Step 2 sets up sequence numbers, step 3 inserts/updates rows in the target. This is the only true way to get re-usable, restartable, cross-RDBMS sequence number solutions. However that said, if I have an RDBMS with an identity column, I will use it and in a single step have my mapping done.
5.1 Solution 1: Max ID Technique Requirements: Parallel Contention Load No Partitioned in Informatica Can be Fast - Yes Cross-Platform RDBMS Yes Restartable Yes Leaves Holes No (except when used in a partitioned session in Informatica) Informatica Sequence Generation Techniques Page 8 of 12 Dan Linstedt, 2005 All Rights Reserved This document may not be reproduced in any form without Date: 7/23/2007 express written consent of Dan Linstedt. Unconnected Lookup Max ID Source SQ <other objects> Expression: Compute Next Sequence Target Target Table Max ID Cached Once Per Run! Tgt can be: Prod, Test, Dev, Map is Migrateable Expression: V_ROW_ID =V_ROW_ID +1 V_MAX_ID =IIF(iV_MAX_ID =0, :LKP.Max_ID(-1), V_MAX_ID) Out_New_ID =iif( isnull(v_max_id), v_ROW_ID, V_MAX_ID +V_ROW_ID)
Sequence Numbers in Informatica are best generated by lookups and expressions than by Sequence Generators. This method is portable, repeatable, restartable. This method works on not skipping sequence numbers. This method works if used in a batch cycle where Informatica has control over the sequence number. Cross DB Method Best Practice, but NOT PARTITIONABLE!
Note: the MaxID lookup can be connected or unconnected, at this time it is preferable to run it as a connected lookup (see below).
How do we do it? For a single partition/single session non partitioned: 1. Setup a connected, in-stream Lookup MAX Value from the target table 2. Setup a default value for the looked up max sequence of ZERO 3. Override the SQL in the Lookup MAX: Select max(my_seq) as SEQ_ID from MyTable
4. Setup the match condition to be: SEQ_ID >=input_SEQ
5. Send in constant ZERO as the input_seq, and the output (lookup port)
6. Setup an expression downstream to be: v_rowcount =v_rowcount+1 Informatica Sequence Generation Techniques Page 9 of 12 Dan Linstedt, 2005 All Rights Reserved This document may not be reproduced in any form without Date: 7/23/2007 express written consent of Dan Linstedt. o_NextVal =lkp_MAX_SEQ_ID +v_rowcount
7. Set v_rowcount, and o_NextVal to the length of number / decimal desired (up to 64 bit access).
Why the connected lookup? Why not an unconnected lookup? Because connected lookups are faster than unconnected lookups, and we can setup a default value of zero in case the target has no rows and returns a null. Also, pushing all the data through the transformation keeps the lookup partitionable and keeps Informatica from passing too much data in too many blocks around the object.
What does this buy us? Speed, and the ability to copy the mapping from one repository to the next, or to change the target database connection at the session level, and it will still work. Its not dependent on any stored repository value. The lookup_MAX caches only one row, one column the max ID.
There is a another solution for working with Oracle Sequence Generators, and that solution is as follows: Setup the Source Query to match the following source query: SELECT r ownum+maxnum. max_i d, my_t abl e. * f r ommy_t abl e, ( sel ect max( seq_col ) as max_i d f r ommy_t abl e) maxnum
What this does, is first, gathers the max sequence number from the target table, then as the rows are selected from the staging table, numbers them by row number plus max_id. Again, this technique only works if you do NOT have any contending processes inserting into the target table at the same time. By the same token, if you wish to use Oracle Sequence number objects, you can change the select to:
Sel ect MySeq. Next Val , my_t abl e. * Fr omMy_t abl e
Informatica Sequence Generation Techniques Page 10 of 12 Dan Linstedt, 2005 All Rights Reserved This document may not be reproduced in any form without Date: 7/23/2007 express written consent of Dan Linstedt. This will achieve the same result as the above rownum query.
An Overview is below (this one is shown with the unconnected lookup), I prefer the MAX_ID connected lookup solution proposed above.
SQL Override: SELECT max(id) as id_out FROM <target> Match: id >= input_id Return NULL if no data in target table must be dealt with in the expression. v_max_id = iif(v_max_id = 0,:LKP.max_lookup(-1),0) O_new_id = iif( isnull(v_max_id), v_row_seq, v_max_id+v_row_seq) BENEFIT: Connects to any target specified by the session, finds the max ID value, and allows incrementing sequence numbers to be used that stretch beyond integer format. Unconnected Object, called by an expression.
6.0 Solution 2: Max ID Parallel Sequence Generation Requirements: Parallel Contention Load Doesnt work. Partitioned in Informatica No, but can be parallel sessions Fast - Yes Cross-Platform RDBMS Yes Restartable Yes Leaves Holes Depends on where it breaks, but the holes arent usually very large.
How do we do this in parallel? Lets say we want three sessions to run in parallel. 1. Setup 3 mappings 2. For mapping 1, set v_rowcount default value to start at 1 change the expression to: IIF( v_rowcount =0, 1, v_rowcount+3 ) Informatica Sequence Generation Techniques Page 11 of 12 Dan Linstedt, 2005 All Rights Reserved This document may not be reproduced in any form without Date: 7/23/2007 express written consent of Dan Linstedt. 3. For mapping 2, set v_rowcount to start at 2 change the expression to be: IIF( v_rowcount =0, 2, v_rowcount+3) 4. For Mapping 3, set v_rowcount to start at 3 change the expression to be: IIF( v_rowcount =0, 3, v_rowcount+3)
Each of the session can now be run in parallel, and the sequences will not collide. This does NOT work if you then partition any of the sessions in the session partitioning tab.
7.0 Solution 3: Informatica Shared Sequences for Partitioned sessions Requirements: Parallel Contention Load Cannot be used in contention with an outside process. Partitioned in Informatica Yes, can be. Fast - Somewhat Cross-Platform RDBMS Yes Restartable Yes Leaves Holes Yes, it will depends on the number of the values cached.
Create a single sequence generator. Set the cached values to 10,000 or more. Set the single session to be partitioned in the sessions tab, youre done. Is it restartable? Yes, will it leave holes? Yes
8.0 Soultion 3: Pre-Staged Sequence Generation In this solution, it is possible to build sequences in the staging area. This bodes well for restartability in a two step process, but does not work well when outside processes are modifying the target table while we load, or inside our processing window. Example architecture for this type of sequence generation technique is below:
Preset Staging Sequence Numbers Expression: Compute Next Sequence O_New_ID=O_New_ID+1 Target Stage Seq 1 2 3 4 . P(x) Stored Proc Compute Target Sequences DW Target DB Seq. Number Stg_Seq, DW_Seq, Flag 1 , 150002, Ins 2 , 150003, Ins 3 , 150015, Upd 4 , 150060, DEL . Process to Load DW View / Informatica Map or stored procedure Join Two Tables Insert Read/Get Read/Get Insert To Stage Insert Or Update To DW ** Can be partitioned / parallel
In this example, we setup each staging row with an incremental numeric column. The staging table in this case is not persistent. The next step is to use a stored procedure to scan through the staging table, and add sequence numbers as it generates them from either the target, or a database Informatica Sequence Generation Techniques Page 12 of 12 Dan Linstedt, 2005 All Rights Reserved This document may not be reproduced in any form without Date: 7/23/2007 express written consent of Dan Linstedt. sequence generator object. The third step in this sequence is to run a bulk-insert process direct to target.
Of course in this technique it is imperative that he sequences used by our stored procedure are not available to an outside process that is inserting rows direct to the target. Locking and using the sequences is important. This technique is best used with an external sequence object (like an Oracle sequence generator object).
Another alternative (without the sequence generator object) relies on a view in the second step. The view becomes the second step processing source, this technique can be used with Identity columns, and in conjunction with inserts from other than our process however it cannot be used in parallel * Unless it sources sequence numbers from a sequence number object.
Expression: Compute Next Sequence O_New_ID=O_New_ID+1 Target Stage Seq 1 2 3 4 . DB View Gen New Numbers DW Target DB Seq. Number Process to Load DW View / Informatica Map or stored procedure Select Read/Get Insert To Stage Insert Or Update To DW ** Can be partitioned / parallel
Dan Linstedt, 2006-2007
Remember, I work in systems where a small load is considered 80 million rows (per mapping), a medium sized load is 150 M up to 350 M rows, and a large load is anything above that number. The algorithms and architectures designed here are built for maximum performance, flexibility, and scalability. Ive used ALL these techniques for different reasons (except the dynamic lookup cache cant stand this one).
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint