Utilities Which Can Be Used in Datastage:: Advantages
Utilities Which Can Be Used in Datastage:: Advantages
In
addition, you can use the Teradata (TD) API stage and ODBC stage to do
Extraction/loading/Lookup/manipulating of data. With IBM Information Server (DataStage 8x - The latest
version of DataStage), the most awaited Teradata Connecter for Teradata Parallel Transporter (TPT /
Teradata PT) stage joined the TD stages fleet. There is more good news coming with IBM Information
Server:
>> Supports TD stored procedures
>> Supports TD macros
>> Supports restart capability and reject links for bulk loads
TD API as Source/Lookup/Target:
Uses TD CLI native programming interface (API). This API will let the network attached systems to
connect and process SQL statements in native TD environment. It submits whatever statements you
define exactly as you define them on a per-incoming row basis, with a COMMIT sent after each
"Transaction Size" of rows.
Advantages:
--> Better performance and speed (rows / sec) over generic ODBC stage.
--> Support for TD client
--> Simplified configuration on UNIX platforms.
--> Adv. support for target table DDL (i.e. For create and drop)
--> Native metadata import support (Ability to import table metadata and store in DS repository)
--> Reject rows handling
Disadvantages:
--> Does not support Non-ANSI SQL statements in stage generated SQL statements.
--> Does not support byte data types.
--> Does not generate TD version specific SQLs as stage generated SQL statements.
--> Does not support 'like' in the user defined sql when used as lookup.
TD API has the option to run in parallel mode also. One can write/insert records using TD API in parallel
mode and it gives much better performance than using the same in sequential mode. When used in
parallel mode, you must set the partitioning property to HASH and select all the columns in the Primary
Index of the target table for partitioning. You might get occasional blocks due to Teradata hash synonyms
only, but this should be pretty minimal on most low volume loads. If you plan to implement restart strategy
in a job using TD API for loading data, since there is no cleanup, it is advised to write as UPSERT. The
API will generate upsert code automatically, but it will be 2 SQL statements, not atomic upsert. If you want
ATOMIC upsert you will have to take the auto generated code and then modify it slightly to the ATOMIC
upsert syntax.
Note: Upsert strategy - If more rows exist in the DB it is faster to update first (Update existing or insert
new) than Insert new or update existing. (And vice-versa).
Findings:
-->Data loading modes supported - Insert/Delete/Upsert/Append
-->Uses a Teradata Utility - No
-->If Yes, Utility used - NA
-->Parallel Features of DataStage Supported - Yes (Conditional)
-->Runs in sequential or Parallel Mod - Both
-->Reject rows handling inside DataStage - Yes
-->Before and After SQL execution facility - Yes
-->Lock strategy (Row/Page/Table) - Row
-->Uses Temp/Work tables - No
-->Uses Error Tables - No
-->Ability to control job based on rows rejected to error table - No
-->Ability to write output to files - No
-->Uses named pipes - No
-->Check Point restart support - No
-->Direct loading support - No (No TD Utility used)
-->Can be used as look up stage? - Yes
-->Can be used as for sparse look up - No
Tips:
* For best performance, TD client should NOT be the same machine where TD server is installed.
* Take note of timestamp format used in TD - "%yyyy-%mm-%dd %hh%nn%ss" while doing date
conversions.
* ODBC stage is the only stage which allows you to do sparse lookup on teradata tables.
TD EE as Source/Lookup/Target:
When used as Source, it calls FASTEXPORT (TD Utility) and when used as Target, it calls FASTLOAD
(TD Utility). Number of sessions created is governed by the RequestedSessions and SessionsPerPlayer
options in the stage.
TD EE (Source/Lookup) invokes FastExport which produces 1 answer set. All the parallel processing is
performed on the TD Server, not in DataStage. FastExport cannot handle selects that use unique key
constraint that would only return one row. You have to be careful when using unique key constraints. This
is set from the TD side and not by Datastage. If you use this 2 or 3 times for source tables/lookups in a
job and you run a couple of jobs at once, then you have invoked too many fastexports. For explicit exports
or big data pulls, the TD EE can work fine, or you can use the MLOAD stage (which does Fastexport if it
is used as source stage)
TD EE (Target) invokes FastLoad for bulk loading into TD Tables. Fastload does not support Secondary
Indexes in Teradata. TD EE Stage will create a work table when an append to a Teradata table is
selected for the job. This is because a FastLoad must load to an empty table which is impossible with an
append operation. To get around this, DataStage FastLoads to a generated empty work table and then
does insert into (select * from work table) on the database side. Append option will create an empty TD
table with ALL fields but NO defaults. It generates this empty table from the Teradata metadata, NOT your
DataStage job metadata. Also, unknown columns are replaced with null.
Ex:
Incoming columns are col1, col2, col3
Target table columns are col1, col2, col3, col4, col5 with col4 default value as 0 and col5 as
current_timestamp.
Step 1:
Creation of orch_work table with:
Step 2:
Incoming records are loaded in orch_work with values col1, col2, col3,null,null.
INSERT INTO ORCH_WORK_xxxxx (:col1, :col2, :col3, null, null)
Step 3:
Append using Insert command into Target table:
INSERT INTO TargetTable SELECT * FROM ORCH_WORK_xxxxx
Caution: Step 3 will fail if Col4 and Col5 (or any of them) are set as not null in the Target table. To avoid
this you need to pass col4 and col5 with default values inside the job itself.
FastLoad will create 2 error tables by executing drop table statements for the same table (As a part of
"BEGIN LOADING" ). FastLoad drops these 2 tables if they are empty as a part "END LOADING"
transaction. In case of a problem (duplicate primary index key for example ), the error table number 2
should be removed manually. The names of the two tables are ERR_cookie_1 and ERR_cookie_2.
Cookie is to be found on the terasync table. In that table, you may use start time and end time (integer)
fields in order to help find the last ones inserted. You cannot modify these error table names.
Fastloading delivers very high performance with only two constraints: duplicate rows are silently dropped
and error detection/correction is weak.
Findings:
-->Data loading modes supported - Insert/Delete/Upsert/Append
-->Uses a Teradata Utility - Yes
-->If Yes, Utility used - FastExport and FastLoad
-->Parallel Features of DataStage Supported - Yes
-->Runs in sequential or Parallel Mode - Parallel
-->Reject rows handling inside DataStage - No(7x) & Yes(8x)
-->Lock strategy (Row/Page/Table) - Table
-->Uses Temp/Work tables - Yes
-->Uses Error Tables - Yes
-->Ability to control job based on rows rejected to error table - Yes
-->Ability to write output to files - Yes
-->Uses named pipes - Yes
-->Any related APT or DS Parameters that can change/control the functionality - Yes
-->Direct loading support - Yes (TD Utility used)
-->Can be used as lookup stage - Yes
-->Before and after sql (open and close command) option available - Yes
-->Can be used as for sparse lookup - No
Tips:
1) TD EE stage creates a special terasync table to the source database and if you don't have create and
write privileges in that db, you will encounter an error. The way to do it is to point TD EE stage to write the
terasync in another place in the db where you have enough privileges.
2) **There is a relation between the "SessionsPerPlayer", the number of nodes and the resulting
TeraData Sessions generated. For example, for SessionsPerPlayer=2 you will have 16 Teradata
sessions for a 32 Teradata AMP system. Requestedsessions is completely independent of the # of nodes
in the DataStage configuration file. It is dependent on the number of vprocs or AMPS in the Teradata
system. A 168 AMP system will create 84 sessions using the defaults (SessionsPerPlayer = 2) regardless
of the EE Configuration file. In high volume environment try to balance performance and TeraData
sessions by tuning the SessionsPerPlayer.
3) If TENACITY is not set with the timeout limit, an process that can not get a Teradata session will abort
immediately.
4) Restart on TD EE works basically by starting over from the beginning. It does not cleanup the old
Fastload tables also. So using insert strategy should be with strategies to take care of this scenario.
5) If you have large TD instance (A lot of VPROCs) set RequestedSessions to something more
manageable to keep the repartitioning to a minimum. Repartitioning the data is expensive and can cause
various resource issues.
Set the right RequestedSessions in the Advanced connection options in the stage to keep this minimum.
TD operator uses two values to determine how many players to start - RequestedSessions and
SessionsPerPlayer. By default, if no value is set for RequestedSessions, it looks at the number of
VPROCs used for the table and SessionsPerPlayer defaults to 2 (one read session and one write
session). Here's a rule of thumb:
6) For some actions the Teradata Enterprise stage lets you put in open and close commands that run at
the start or end of the DataStage job.
TD Mload as Source/Lookup/Target
Multiload (as target) is very efficient when you are doing maintenance activities on multiple large tables.
At a time, Multiload stage can perform Inserts/Updates on upto 5 different tables in one pass. Work tables
and error tables are created each time you perform an operation using Multiload also. It is automatically
dropped once the job has run successfully. However if the job aborts, the work tables have to be manually
dropped before the job is run again.
Invoke Multiload - Mload is invoked automatically when the job runs. Stage creates named pipes to
transmit data to Mload and then starts Mload process. Stage allows a 720 secs (12) minutes time to write
to named pipe (ie. Acquisition phase) and then kicks off load process. Else fails the job. MultiLoad places
locks on the entire table.
You can always change this phase time by changing:
DS_TDM_PIPE_OPEN_TIMEOUT
Manual - Data is stored in .dat file with name you specified in the stage. You can execute an Mload script
independent of the job in this case which points to this .dat file.
TD TPUMP as Source/Lookup/Target
TPump is a highly parallel utility designed to continuously move data from data sources into Teradata.
TPump is typically used for loading a small quantity of records in relation to the size of your target table.
TPump works at the row level, whereas MultiLoad and FastLoad update whole blocks of data. TPump
allowing us to load data into tables with referential integrity which MultiLoad doesn't. TPump only needs to
take row-level locks; in other words, TPump only places a lock upon the row it is modifying. In contrast,
MultiLoad places locks on the entire table. If you need multiple processes updating a table
simultaneously, TPump may be the better solution. TPump also uses fewer system resources, so you can
run it concurrently with user queries without impacting system performance. TPump lets you control the
rate at which updates are applied. You can dynamically "throttle down" when the system is busy and
throttle up when activity is lighter.
TPump does not take a utility slot. TPump is designed for "trickle feed" taking individual row-level locks. If
you use TPUMP, you need to make sure that you follow normal TPUMP standards (Set the KEY
statements equal to the PI and turn SERIALIZE on). Tpump is typically for processes that are constantly
retrieving and processing data, like from a message queue. Using TPump offers a controllable transition
to updating that is closer to real time. Best fit in "Active", "real-time" or "closed-loop" data warehousing.
Findings:
-->Data loading modes supported - Insert/Delete/Upsert/Append
-->Uses a Teradata Utility - TPump does not take a utility slot.
-->If Yes, Utility used - NA
-->Parallel Features of DataStage Supported - Yes
-->Direct loading support - No
-->Can be used as for sparse lookup - No
Tips:
* Feeding clean data from your transformation processes into TPump is important for overall
performance, data with errors makes TPump slow (Longer runtime (+40%)). Also Longer runtime (+45%)
with fallback, SERIALIZE adds 30% and ROBUST adds 15%.
* Ask your DBA to assign the TPump user to a higher priority performance group when the TPump job
runs at the same time as decision support queries, if the TPump completion time is more critical than the
other work active in the system.
* It uses time based checkpoints not count based.
* Does not support MULTI-SET tables.
* Can have Secondary Indexes and RI on tables.
Main features:
--> All-in-one! Single infrastructure for all loading/unloading needs using single scripting language.
--> Greatly reduces the amount of file I/O and significantly improve performance.
--> Push up and Push down features.
--> Provides unlimited symbolic substitution for the script language and application programming interface
(API).
--> Combines functionality of TD FastLoad, MultiLoad, FastExport, TPump and API.
--> Respective modules or operators for the protocols of fastload, multiload, tpump and fastexport are
named as Load, Update, Stream and Export.
-->Apart from the four operators, on API & ODBC front, there are operators like Selector, Inserter and
more.
1) Immediate - For Selector, Inserter... operators. Same as CLI interface/API. (No Load/Unload Utilities).
2) Bulk - For Load, Update, Stream and Export operators (i.e. fastload, multiload, tpump and fastexport
respectively).
Remember to install this utility (TPT) on the DataStage server for using the range of operators it supports
[Same case as most of the other TD utilities].