Recovered DataStage Tip
Recovered DataStage Tip
1
This example shows a typical configuration file. Pools can be applied to nodes or
other resources. Note the curly braces following some disk resources.
Following the keyword “node” is the name of the node (logical processing unit).
The order of resources is significant. The first disk is used before the second, and so
on. Keywords, such as “sort” and “bigdata”, when used, restrict the signified
processes to the use of the resources that are identified. For example, “sort” restricts
sorting to node pools and scratchdisk resources labeled “sort”.
Database resources (not shown here) can also be created that restrict database access
to certain nodes. Question: Can objects be constrained to CPUs? No, a request is
made to the operating system and the operating system chooses the CPU.
2. What is APT_CONFIG_FILE
APT_CONFIG_FILE is the file using which DataStage determines the
configuration file (one can have many configuration files for a project) to be used.
In fact, this is what is generally used in production. However
If this environment variable is not defined then how DataStage determines which
file to use?
2
Parallel Job execution:
The conductor node has the start-up process. Creates the score. Starts up section
leaders.
Every player has to be able to communicate with every other player. There are
separate communication channels (pathways) for control, messages, errors, and data.
Note that the data channel does not go through the section leader/conductor, as this
would limit scalability. Data flows directly from upstream operator to downstream
operator using APT Communicator class.
3
4. Explain Run Time Architecture of Enterprise
Enterprise Edition Job Startup
Generated OSH and configuration file are used to “compose” a job
“Score”
• Think of “Score” as in musical score, not game score
• Similar to the way an RDBMS builds a query optimization plan
• Identifies degree of parallelism and node assignments for each operator
• Inserts sorts and partitioners as needed to ensure correct results
• Defines connection topology (virtual datasets) between adjacent operators
• Inserts buffer operators to prevent deadlocks E.g., in fork-joins
• Defines number of actual OS processes
• Where possible, multiple operators are combined within a single OS process
to improve performance and optimize resource requirements
• Job Score is used to fork processes with communication interconnects for
data, message, and control
• Set $APT_STARTUP_STATUS to show each step of job startup
• Set $APT_PM_SHOW_PIDS to show process IDs in DataStage log
Enterprise Edition Runtime
It is only after the job Score and processes are created that
processing begins
“Startup overhead” of an EE job
4
You don’t see anywhere the word ‘Score’
5
Sequential files, unlike most other data sources, do not have inherent column
definitions,and so DataStage cannot always tell where there are extra columns that
need propagating.You can only use RCP on sequential files if you have used the
Schema File property (seeLink\xd2 ” on page - and on page - ) to specify a schema
which describes all the columns in the sequential file. You need to specify the same
schema file for any similar stages in the job where you want to propagate columns.
Stages that will require a schema file are:
Sequential File
File Set
External Source
External Target
Column Import
Column Export
8. What are the different options a logical node can have in the configuration file?
a. fastname – The fastname is the physical node name that stages use to open
connections for high volume data transfers. The attribute of this option is often the
6
network name. Typically, you can get this name by using UNIX command ‘uname -
n’.
b.pools – Name of the pools to which the node is assigned to. Based on the
characteristics of the processing nodes you can group nodes into set of pools.
A pool can be associated with many nodes and a node can be part of many
pools.
A node belongs to the default pool unless you explicitly specify a pools list for it,
and omit the default pool name (”") from the list.
A parallel job or specific stage in the parallel job can be constrained to run on a
pool (set of processing nodes).
In case jobs as well as stage within the job are constrained to run on specific
processing nodes then stage will run on the node which is common to stage as
well as job.
a. If a job or stage is not constrained to run on specific nodes then parallel engine
executes a parallel stage on all nodes defined in the default node pool. (Default
Behavior)
b.If the node is constrained then the constrained processing nodes are chosen while
executing the parallel stage. (Refer to 2.2.3 for more detail).
C.When configuring an MPP, you specify the physical nodes in your system on
which the parallel engine will run your parallel jobs. This is called Conductor Node.
For other nodes, you do not need to specify the physical node. Also, You need to
copy the (.apt) configuration file only to the nodes from which you start parallel
engine applications. It is possible that conductor node is not connected with the high-
speed network switches. However, the other nodes are connected to each other using
a very high-speed network switches.
10. How do you configure your system so that you will be able to achieve
optimized parallelism?
a. Make sure that none of the stages are specified to be run on the conductor node.
c.Make sure that conductor node is not the part of the default pool.
7
11. why maximum parallelization is not necessarily the optimal parallelization?
a.Datastage creates one process for every stage for each processing node. Hence, if
the hardware resource is not available to support the maximum parallelization, the
performance of overall system goes down. For example, suppose we have a SMP
system with three CPU and a Parallel job with 4 stages. We have 3 logical nodes (one
corresponding to each physical node (say CPU)). Now DataStage will start 3*4 = 12
processes, which has to be managed by a single operating system. Significant time
will be spent in switching context and scheduling the process.
b.Since we can have different logical processing nodes, it is possible that some node
will be more suitable for some stage while other nodes will be more suitable for other
stages.
12. So, when to decide which node will be suitable for which stage?
a.If a stage is performing a memory intensive task then it should be run on a node
which has more disk space available for it. E.g. sorting a data is memory intensive
task and it should be run on such nodes.
b.If some stage depends on licensed version of software (e.g. SAS Stage, RDBMS
related stages, etc.) then you need to associate those stages with the processing node,
which is physically mapped to the machine on which the licensed software is
installed. (Assumption: The machine on which licensed software is installed is
connected through other machines using high speed network.)
c.If a job contains stages, which exchange large amounts of data then they should be
assigned to nodes where stages communicate by either shared memory (SMP) or
high-speed link (MPP) in most optimized manner.
d.Basically nodes are nothing but set of machines (specially in MPP systems). You
start the execution of parallel jobs from the conductor node. Conductor nodes create
a shell of remote machines (depending on the processing nodes) and copy the same
environment on them. However, it is possible to create a startup script which will
selectively change the environment on a specific node. This script has a default name
of startup apt. However, like main configuration file, we can also have many startup
configuration files. The appropriate configuration file can be picked up using the
environment variable APT_STARTUP_SCRIPT.
14. What are the generic things one must follow while creating a configuration file
so that optimal parallelization can be achieved?
a.Consider avoiding the disk/disks that your input files reside on.
b.Ensure that the different file systems mentioned as the disk and scratchdisk
resources hit disjoint sets of spindles even if they’re located on a RAID (Redundant
Array of Inexpensive Disks) system.
8
15. Know what is real and what is NFS:
a.Real disks are directly attached, or are reachable over a SAN (storage-area network
-dedicated, just for storage, low-level protocols).
b.Never use NFS file systems for scratchdisk resources, remember scratchdisk are
also used for temporary storage of file/data during processing.
c.If you use NFS file system space for disk resources, then you need to know what
you are doing. For example, your final result files may need to be written out onto
the NFS disk area, but that doesn’t mean the intermediate data sets created and used
temporarily in a multi-job sequence should use this NFS disk area. Better to setup a
“final” disk pool, and constrain the result sequential file or data set to reside there,
but let intermediate storage go to local or SAN resources, not NFS.
? For example, Suppose we have 2000 products and 2000 manufacturer. These
two details are in two separate tables. If the requirement says that we need to
join these tables and create a new set of records. By brute force, it will take
2000 X 2000/2 = 20,00,000 steps. Now, if we divide this into 4 partitions, it
will take 4 * 500 * 2000/2 = 20, 00,000 steps. Assuming the number of steps is
directly proportional to time taken and we have sufficient computing power
to run 4 separate instances of the jobs performing join; we can complete this
task in effectively 1/4th time. This is what we call power of partition.
There are mainly 8 different kind of partitioning supported by enterprise edition (I
have excluded DB2 at this moment from the list.). Usage of these partition
mechanism completely depends on the what kind of data distribution do we have or
going to have. Do we need to have the related data together or it doesn’t matter? Do
we need to have a look at complete dataset at a time or it doesn’t matter if we are
working on subset of data?
? Funnel stage can run in parallel as well as sequential mode. In case of Parallel
mode different input datasets to the funnel stage will be partitioned and
processed on different processing node. Again, processed partitions will be
collected and funneled into the final output dataset. This can be controlled
from the partitioning tab of Funnel Stage. In case of sequential mode, if
Funnel Stage is in the middle of the job design then it first collects all the data
from different partitions and then funnels all the incoming datasets.
9
Of course metadata for all the incoming inputs to the funnel stage should be same for
funnel stage to be able to work. However, The funnel stage allows you to specify
(through mapping tab) how output columns are derived from the input column. This
is something simple collection doesn’t do.
? Remember that metadata needs to be same for all the input column. So, only
one set of input column is shown and that too is read only.
? Continuous Funnel
3. It takes one record from each input link in turn. If data is not available
on an input link,the stage skips to the next link rather than waiting
? Sort Funnel
1. Based on the key columns, this method sorts the data from the
different sorted datasets into a single sorted dataset.
2. Typically all the input datasets to the Funnel stage are hash
partitioned before they are sorted.
3. If the data is not yet partitioned (by previous stage) then partition the
data using hash or modulus partitioning. (This can be done from the
partitioning tab.)
? All the input datasets must be sorted on the same sort key.
10
4. Combines the input records in the order defined by the value(s) of one
or more key columns and the order of the output records is
determined by these sorting keys
5. Produces a sorted output (assuming input links are all sorted on key)
6. Data from all input links must be sorted on the same key column
Sequence Funnel
1. Similar to Ordered collection. You can choose the order (through link
ordering) in which data from different links will be funneled.
2. Copies all records from the first input link to the output link, then all
the records from the second input link and so on
o Null handling
o String handling
11
19. Describe input of partitioning and re-partitioning in an MPP/cluster
environment.
Input of partitioning is
o Number of processing node available for the stage and the properties specified
on the partitioning tab of that particular stage.
Input to repartitioning is
o Again the number of processing node available to the current stage. If the
number of processing node for current stage is different (especially less) than
previous stages then datastage decides to repartition the data.
o Different partition type being used by the current stage than the one used for
previous stages.
o If the data requirement is such that data from different partitions need to be re-
grouped (even though same partition type is used!!).
For better performance of your job, you must try to avoid re-partitioning as
much as possible. Remember that repartitioning involves two steps (which may be
unnecessary and taxing). Hence understanding where, when and why re-
partitioning occurs in a job is necessary for being able to improve the performance of
the job.
o First collects all the data being processed by different processing node.
Each parallel stage in a job can partition or repartition incoming data or accept
same partitioned data before it operates on it. There is an icon on the input link
(called link marking) to a stage which shows how the stage handles partitioning.
While deciding about the parallel or sequential mode of operation at a given stage,
this information is really useful. We have following types of Icons:
o None
12
o Auto
Shows DataStage will decide the most suitable partitioning for the stage.
o Bow Tie
o Fan Out
It shows that data is being partitioned. What it means that either it is start of
the job or before this stage the execution was in sequential mode.
Basically Sequential Parallel flow – if the stage is in the middle of job flow.
o Same (box)
o Fan In
21.Given a job design and configuration file, provide estimates of the number of
processes generated at runtime.
In ideal case, where all the stages are running on all the processing nodes and
there is no constraints on hardware, I/O devices, processing nodes etc. then
DataStage will start one process per stage per partition.
Configuration File is the first place where we put the information regarding on
which node a stage can run and on which node it can not. In fact, it is not specified
directly inside the configuration file, however you declare Node Pools & Resource
13
Pools in the configuration file. Thus you have fair idea about what kind of stages can
be run on these pools.
o You also have choice of not making a node or resource as part of default pool.
So, you can run the processes selectively on a given node. In this case – in the job
design you select the list of node on which a stage can run. This actually determines,
how many processes will be started. Even here we have further constraints
associated. It is quite possible that your job is constrained to run on specific set of
nodes. In that case, the processes will run on the nodes – common to job as well as
the current stage.
In addition to Node Pool and Resource pool constraints – you can also specify
Node Map constraints. This basically forces the parallel execution to run on specified
nodes. In fact, it is equivalent to creating another node pool (in addition to whatever
is already existing in configuration files) containing set of nodes on which this stage
can run.
Also, note that – if a stage is running in sequential mode then one process will
run corresponding to it and that too will run on the Conductor node.
Node pool is used to create set of logical node on which a job can run its
processes. Basically all the nodes with similar characteristics are placed into same
node pools. Depending on the hardware, software and software licenses available
you create a list of node pools. During job design you decide which stage will be
suitable on which node and thus you constrain the stage to run on specific nodes.
This is needed otherwise your job will not be able to take full advantage of the
available resources.
o Example – you do not want to run a sort stage on a machine which doesn’t have
enough scratch disk space to hold the temporary files.
o It is very important to have node pool declaration and use of node constraint in
such a way that it doesn’t cause any unnecessary repartitioning.
Resource pool is mainly used to decide which node will have access to what
resources.
14
o E.g. If a job needs a lot of I/O operations then you need to assign high speed
I/O resources to the node on which you plan to run I/O intensive processes.
o Similarly, if a machine has SAS installed on it (in MPP system) and you have to
extract/load data into SAS system then you may like to create a SAS_NODEPOOL
containing information about these machines.
23.Given a source DataSet, describe the degree of parallelism using auto and same
partitioning.
Auto partitioning can be used for letting datastage decide about the most
suitable partitioning for a given flow. In this case degree of parallelism mainly
depends:
o What kind of Node Pool or Resource Pools constraint has been put on the given
stage.
There are mainly two factors which decide the partitioning method that will be
opted by the DataStage.
o What kind of stage the current stage is – i.e. what kind of data requirement does
it have?
If the current stage doesn’t have any specific data setup requirement and it
doesn’t have any previous stage then DataStage typically uses Round Robin
partitioning to make sure that data is evenly distributed among different partitions.
Same partitioning is a little different from the auto partitioning. In this case
user is trying to instruct DataStage to use the same partitioning as it has been used
by the previous stages. So, partitioning method at current stage is actually
determined by the partitioning configuration at previous stages. This is generally
used when you know that same partitioning will be more effective than allowing
datastage to pick up a suitable partitioning, which may cause repartitioning. So, if
you are using this partitioning, then data flows inside the same processing node. No
redistribution of data occurs!! Degree of parallelism is decided by
o Constraints on the current node.In this case – there is no reason why one should
go for same partitioning. Suppose previous stages had data partitioned on 5
processing node, while current stage is constrained to run on only 3 partitioning
node then using same partitioning doesn’t make sense. Data has to be repartitioned.
You can export/import data between different frameworks. However, one thing you
must make sure is that you are providing appropriate metadata (e.g. column
definition, formatting rules, etc.) needed for exporting/importing the data.
15
25.Explain use of various file stages (e.g., file, CFF, fileset, dataset) and where
appropriate to use
1. In parallel jobs, data is moved around in data sets. These carry metadata with
them – column definitions and information about the configuration that was
in effect when the data set was created.
2. The Data Set stage allows you to store data being operated on in a persistent
form, which can then be used by other DataStage jobs. Persistent data sets are
stored in a series of files linked by a control file. So, never ever try to
manipulate these files using commands like rm, mv, tr, etc. as it will corrupt
the control file. If needed, use Dataset Management Utility to manage
datasets.
4. Preserve Partitioning
8. Using data sets wisely can be key to good performance in a set of linked
jobs.*No Import/Export Conversions are required.*No Repartitioning is
required
9. Dataset stage allows you to read from and write to dataset (.ds) files.
Persistent Datasets
Two parts:
Descriptor file: .
contains metadata, data location, but NOT the data itself
record (
partno: int32;
description: string;
)
16
Data file(s): contain the data, multiple Unix files (one per node)(
node1:/local/disk1/…
node2:/local/disk2/…) accessible in parallel
FileSet Stage
Use of File Set Stage
1. It allows you to read data from or write data to a file set.
3. DataStage can generate and name exported files, write them to their
destination, and list the files it has generated in a file whose extension is, by
convention, “.fs”.
4. File Set is really useful when OS limits the size of data file to 2GB and you
need to distribute files among nodes to prevent overruns.
6. The amount of data that can be stored in the destination file is limited by:
7. Unlike data sets, file sets carry formatting information that describes the
format of the files to be read or written.
17
2. Individual raw data files( Number of raw data files depends on: the
configuration file)
9. Similar to a dataset
1. Main difference is that file sets are not in the internal format and
therefore more accessible to external applications
Descriptor File
2. You can specify that single files can be read by multiple nodes. This can
improve performance on cluster systems.
3. You can specify that a number of readers run on a single node. This means,
for example, that a single file can be partitioned as it is read (even though the
stage is constrained to running sequentially on the conductor node). 2 & 3 are
mutually exclusive.
1. It allows you to create a lookup file set or reference one for a lookup.
2. The stage can have a single input link or a single output link. The output link
must be a reference link.
3. When performing lookups, Lookup File stages are used in conjunction with
Lookup stages.
18
4. If you are planning to perform lookup on particular key combination then it
is recommended to use this file stage. In case other file stage is being used for
looking purpose, lookup become sequential.
1. This stage allows you to read data that is output from one or more source
programs.
3. The stage can have a single output link, and a single rejects link.
5. External Source stages, unlike most other data sources, do not have inherent
column definitions, and so DataStage cannot always tell where there are extra
columns that need propagating. You can only use RCP on External Source
stages if you have used the Schema File property to specify a schema which
describes all the columns in the sequential files referenced by the stage. You
need to specify the same schema file for any similar stages in the job where
you want to propagate columns.
1. Allows you to read or write complex flat files on a mainframe machine. This
is intended for use on USS systems.
2. When used as a source, the stage allows you to read data from one or more
complex flat files, including MVS datasets with QSAM (Queued Sequential
Access Method) and VSAM (Virtual Storage Access Method, a file
management system for IBM mainframe systems) files.
1. Complex Flat File source stages execute in parallel mode when they are used
to read multiple files, but you can configure the stage to execute sequentially
if it is only reading one file with a single reader.
19
o Orchadmin
Unix command line utility
List records
Remove datasets
Removes all component files, not just the header file
Dsrecords
o Lists number of records in a dataset
Both dsrecords and orchadmin are Unix command-line utilities.
The DataStage Designer GUI provides a mechanism to view and manage data sets.
Note: Since datasets consist of a header file and multiple component files, you can’t
delete them as you would delete a sequential file. You would just be deleting the
header file. The remaining component files would continue to exist in limbo.
The screen is available (data sets management) from Manager, Designer, and
Director.
Manage Datasets from the System Command Line
Dsrecords
Gives record count
Unix command-line utility
$ dsrecords ds_name
E.g., $ dsrecords myDS.ds
156999 records
Orchadmin
Manages EE persistent data sets
Unix command-line utility
E.g., $ orchadmin delete myDataSet.ds
Key EE Concepts:
Parallel processing:
Executing the job on multiple CPUs
Scalable processing:
Add more resources (CPUs and disks) to increase system performance
20
Example system: 6 CPUs (processing nodes) and disks
• Scale up by adding more CPUs
• Add CPUs as individual nodes or to an SMP system
Parallel processing is the key to building jobs that are highly scalable.
EE Engine uses the processing node concept. “Standalone processes” rather than
“thread technology” is used. Processed-based architecture is platform-independent,
and allows greater scalability across resources within the processing pool.
Processing node is a CPU on an SMP, or a board on an MPP.
Pipeline Parallelism
21
Keeps processors busy
Still has limits on scalability
Partition Parallelism
Divide the incoming stream of data into subsets to be separately
processed by an operation
Subsets are called partitions (nodes)
Each partition of data is processed by the same operation
E.g., if operation is Filter, each partition will be filtered in exactly the same way
Facilitates near-linear scalability
8 times faster on 8 processors
24 times faster on 24 processors
This assumes the data is evenly distributed
Partitioning breaks a dataset into smaller sets. This is a key to scalability. However,
the data needs to be evenly distributed across the partitions; otherwise, the benefits
of Partitioning are reduced. It is important to note that what is done to each partition
of data is the same. How the data is processed or transformed is the same.
Three-Node Partitioning
22
volume throughput.
The configuration file drives the parallelism by specifying the number of partitions.
Much of the parallel processing paradigm is hidden from the programmer. The
programmer
simply designates the process flow, as shown in the upper portion of this diagram.
EE
(Enterprise Edition), using the definitions in the configuration file, will actually
execute
UNIX processes that are partitioned and parallelized, as illustrated in the bottom
portion.
28.What is the difference between sequential file and a dataset? When to use the
copy stage?
Sequentiial Stage stores small amount of the data with any extension in order to
access the file where as DataSet is used to store Huge amount of the data and it
opens only with an extension (.ds ) .The Copy stage copies a single input data set to a
number of output datasets. Each record of the input data set is copied to every
output data set. Records can be copied without modification or you can drop or
change the order of columns.
23
1. Before using orchadmin, you should make sure that either the working directory
or the $APT_ORCHHOME/etc contains the file “config.apt” ORThe environment
variable $APT_CONFIG_FILE should be defined for your
session.Orchadmin commands
The various commands available with orchadmin are
1. CHECK: $orchadmin check
Validates the configuration file contents like, accessibility of all nodes defined in the
configuration file, scratch disk definitions and accessibility of all the nodes etc.
Throws an error when Config file is not found or not defined properly
2. COPY : $orchadmin copy <source.ds> <destination.ds>
Makes a complete copy of the datasets of source with new destination descriptor file
name. Please not that
a. You cannot use UNIX cp command as it justs copies the config file to a new
name. The data is not copied
b. The new datasets will be arranged in the form of the config file that is in use
but not according to the old confing file that was in use with the source.
3. DELETE : $orchadmin < delete | del | rm > [-f | -x] descriptorfiles….
The unix rm utility cannot be used to delete the datasets. The orchadmin delete or rm
command should be used to delete one or more persistent data sets.
-f options makes a force delete. If some nodes are not accesible then -f forces to delete
the dataset partitions from accessible nodes and leave the other partitions in
inaccessible nodes as orphans.
-x forces to use the current config file to be used while deleting than the one stored in
data set.
4. DESCRIBE: $orchadmin describe [options] descriptorfile.ds
This is the single most important command.
1. Without any option lists the no.of.partitions, no.of.segments, valid segments, and
preserve partitioning flag details of the persistent dataset.
-c : Print the configuration file that is written in the dataset if any
-p: Lists down the partition level information.
-f: Lists down the file level information in each partition
-e: List down the segment level information .
-s: List down the meta-data schema of the information.
-v: Lists all segments, valid or otherwise
-l : Long listing. Equivalent to -f -p -s -v –e
5. DUMP: $orchadmin dump [options] descriptorfile.ds
The dump command is used to dump (extract) the records from the dataset.
Without any options the dump command lists down all the records starting from
first record from first partition till last record in last partition.
-delim ‘<string>’ : Uses the given string as delimtor for fields instead of space.
-field <name> : Lists only the given field instead of all fields.
-name : List all the values preceded by field name and a colon
-n numrecs : List only the given number of records per partition.
-p period (N): Lists every Nth record from each partition starting from first record.
-skip N: Skip the first N records from each partition.
-x : Use the current system configuration file rather than the one stored in dataset.
6. TRUNCATE: $orchadmin truncate [options] descriptorfile.ds
Without options deletes all the data (i.e. Segments) from the dataset.
-f: Uses force truncate. Truncate accessible segments and leave the inaccessible ones.-
x: Uses current system config file rather than the default one stored in the dataset.
-n N: Leaves the first N segments in each partition and truncates the remaining.
24
7. HELP: $orchadmin -help OR $orchadmin <command> -help
Help manual about the usage of orchadmin or orchadmin commands.
31.If USS, define the native file format (e.g., EBCDIC, VSDM)
Native File Format for USS (UNIX Support Systems) EBCDIC Extended Binary
Coded Decimal Interchange Code is an 8-bit character encoding () used on IBM
mainframe operating systems. Native File Format For USS (Unix Support
Systems)EBCDIC Extended Binary Coded Decimal Interchange Code is an 8-bit
character encoding () used on IBM mainframe operating systems.
ASCII American Standard Code for Information Interchange EBCDIC and ASCII,
both are ways of mapping computer codes to characters and numbers, as well as
other symbols typically used in writing. Most current computers have a basic storage
element of 8 bits, normally called a byte. This can have 256 possible values. 26 of
these values need to be used for A-Z, and another 26 for a-z. 0-9 take up 10, and then
there are many accented characters and punctuation marks, as well as control codes
such as carriage return (CR) and line feed (LF). EBCDIC and ASCII both perform the
same task, but they use different values for each symbol. For instance, in ASCII a ‘E’
is code 69, but in EBCDIC it is 197.Text conversion is very easy, however, numeric
conversion is quite tricky. For example:
1. Text string : It is very simple and portable. A simple mapping can be used to
map the string to the code and vice versa.
2. Binary : Binary numbers use the raw bytes in the computer to store numbers.
Thus a single byte can be used to store any number from 0 to 255. If two bytes
are used (16 bits) then numbers up to 65535 can be saved. The biggest
problem with this type of number storage is how the bytes are ordered [i.e
Little Endian (Intel uses this, least significant first!! i.e. the high byte on the
right) or Big Indian (Motorola uses this, the high byte on the left) or Native
Endian]. E.g 260 in Little Endian will be 04H 01H while it in Big Endian, it
will be 01H 04H.
3. Packed decimal : In text mode each digit takes a single byte. In packed
decimal, each digit takes just 4 bits (a nibble). These nibbles are then packed
together, and the final nibble represents the sign. These are C (credit) is a +, D
(debit) is a - and F is unsigned, i.e. +. The number 260 in packed decimal
would be: 26H 0FH or 26H 0CH.
4. Floating point : Floating-point numbers are much harder to describe but have
the advantage that they can represent a very large range of values including
many decimal places.Of course there are some rounding problems as well.
The Problem with ASCII/EBCDIC conversion when dealing with records that
contain both text and numbers is that numbers must not be converted with the same
byte conversion that is used for ASCII/EBCDIC conversion. The only truly portable
way to convert such records is on a per field basis. Typically, from EBCDIC to ASCII,
and numeric fields, packed decimal fields etc are converted to ASCII strings.
How is this done? The only way to do an EBCDIC to ASCII conversion is with a
program that has knowledge of the record layout. With DataStage, details of each
record structure are entered and how each field is to be converted can be set. Files
25
with multiple record structures and multiple fields can then be converted on a field-
by-field basis to give exactly the correct type of conversion. It is an ideal solution for
an EBCDIC and ASCII conversion as all data is retained. Packed decimal fields are
normally found on mainframe type applications often Cobol related. RR32 can be
used to create Access compatible files or even take ASCII files and create files with
packed decimal numbers. Reference for ASCII and EBCDIC code mappings:
4. Explain how to identify and capture rejected records (e.g., log counts, using
reject link, options for rejection)
It could be whether the file is a fixed length file (containing fixed length records)
or delimited files.
What is the volume of records in that flat file and what kind of configuration
(e.g. Number of Nodes, I/O disks, etc.) do we have?
Scenario may also include whether you need to read a single file or you need to
read multiple files (e.g. A job can create 2 files with names file_0, file_1, which may
be input for your stage. You must process both the files for completeness. One more
thing that need to be remembered is that, while working with multiple files, it need
not be matching a pattern. You can mention the complete path name for that file.).
• You can specify that single files can be read by multiple nodes. This can
improve performance on cluster systems.
• You can specify that a number of readers run on a single node. For example, a
single file can be partitioned as it is read (even though the stage is constrained to
running sequentially on the conductor node)
Using the “Format tab” you can supply information about the format of the files in
the file set to which you are writing. (Default is variable length columns, surrounded
by double quote, separated by comma and rows delimited by Unix New Line
Character).
26
Final Delimiter: Specifies a delimiter for the last field of the record. When writing, a
space is now inserted after every field except the last in the record. Previously, a
space was inserted after every field including the
last.(APT_FINAL_DELIM_COMPATIBLE environment variable can be used for
compatibility with pre 7.5 releases.)
whitespace => Import skips all standard white-space characters (space, tab, and
new line) trailing a field.
end => The last field in the record is composed of all remaining bytes until the
end of the record.
Specifies one or more ASCII delimiter characters as the trailing delimiter for the last
field of the record. Use backslash (\) as an escape character to specify special
characters within a string, e.g. ‘\t’ for TAB. Import skips the delimiter string in the
source data file. On export, the trailing delimiter string is written after the final field
in the data file.
Fill Char: Byte value to fill in any gaps in an exported record caused by field
positioning properties. This is valid only for export. By default, the fill value is the
null byte (0); it may be an integer between 0 and 255, or a single-character value.
Record Delimiter String (Mutually exclusive with Record Delimiter): Specifies ASCII
characters to delimit a record. Pick ‘DOS format’ to get CR/LF delimiters; Unix
format is best specified using ‘Record delimiter=UNIX new line’, since it is only a
single charact delimiter is the UNIX newline character. For fixed-width files with no
explicit record delimiter, remove this property altogether.
27
This question has popped up several times over on the DSExchange. In DataStage
server jobs the answer is quite simple, local hash files are the fastest method of a key
based lookup, as long as the time taken to build the hash file does not wipe out your
benefits from using it.
In a parallel job there are a very large number of stages that can be used as a lookup,
a much wider variety then server jobs, this includes most data sources and the
parallel staging formats of datasets and lookup filesets. I have discounted database
lookups as the overhead of the database connectivity and any network passage
makes them slower then most local storage.
I did a test comparing datasets to sequential files to lookup filesets and increased row
volumes to see how they responded. The test had three jobs, each with a sequential
file input stage and a reference stage writing to a copy stage.
Small lookups
I set the input and lookup volumes to 1000 rows. All three jobs processed in 17 or 18
seconds. No lookuptables were created apart from the existing lookup fileset one.
This indicates the lookup data fit into memory and did not overflow to a resource
file.
2 million rows
Starting to see some big differences now. Lookup fileset down at 45 seconds is only
three times the length of the 1000 row test. Dataset is up to 1:17 and sequential file up
to 1:32. The cost of partitioning the lookup data is really showing now.
3 million rows
The filset still at 45 seconds, swallowed up the extra 1 million rows with ease. Dataset
up to 2:06 and the sequential file up to 2:20.
As a final test I replaced the lookup stage with a join stage and tested the dataset and
sequential file reference links. The dataset join finished in 1:02 and the sequential file
join finished in 1:15. A large join proved faster then a large lookup but not as fast as a
lookup file.
Conclusion
If your lookup size is low enough to fit into memory then the source is irrelevant,
they all load up very quickly, even database lookups are fast. If you have very large
lookup files spilling into lookup table resources then the lookup fileset outstrips the
other options. A join becomes a viable option. They are a bit harder to design as you
can only join one source at a time whereas a lookup can join multiple sources.
I usually go with lookups for code to description or code to key type lookups
regardless of the size, I reserve the joins for references that bring back lots of
columns. I will certainly be making more use of the lookup fileset to get more
28
performance from jobs.
Sparse database lookups, which I didn't test for, are an option if you have a very
large reference table and a small number of input rows.
34.What are Job parameters and Environment variables and their significance?
Instead of entering inherently variable factors as part of the job design, you can set
up parameters which represent processing variables. Operators can be prompted for
values when they run or schedule the job.
Job Parameters:
29
• Used For
30
• Can also set/override Environment Variables Values - valid only within the
job
31
? Collecting is the process in which a number of virtual datasets are combined
into a single dataset or a single stream of data. Link Collector methods specify
the algorithm to execute the process.
37 What are the guide lines to decide the number of nodes that suit a particular
DataStage job?
? The configuration file tells DataStage Enterprise Edition how to exploit
underlying system resources. At runtime, EE first reads the configuration file
to determine what system resources are allocated to it, and then distributes
the job flow across these resources.
? There is not necessarily one ideal configuration file for a given system
because of the high variability between the way different jobs work. For this
reason, multiple configuration files should be used to optimize overall
throughput and to match job characteristics to available hardware resources.
? A configuration file with a larger number of nodes generates a larger number
of processes that use more system resources. Parallelism should be optimized
rather than maximized. Increasing parallelism may better distribute the
workload, but it also adds to the overhead because the number of processes
increases. Therefore, one must weigh the gains of added parallelism against
the potential losses in processing efficiency.
? If a job is highly I/O dependent or dependent on external (e.g. database)
sources or targets, it may appropriate to have more nodes than physical
CPUs.
? For development environments, which are typically smaller and more
resource-constrained, create smaller configuration files (e.g. 2-4 nodes).
39 In how many different ways a DataStage job can be aborted, for example in a
requirement like ‘stop processing the one million record input file, if the error and
rejected records are greater than 100’?
32
? First method: On the reject link of the transformer stage we can set the
property ‘Abort After Rows’. This property is available on the link
constraints.
? Second method: If the job has to be aborted based on a condition (for
example, if inputlink.value = ‘xyz’ then abort) then a call to the function
DSLogFatal can be issued, which logs a fatal error and aborts the job.
40Can a Sequencer job be aborted, if any job in the sequence fails? If so how?
? Yes the sequencer job can be aborted if any job in the sequence fails. This can
be achieved by connecting a Terminator Activity stage to the Job Activity
stage and then specifying the trigger to the Terminator activity stage as Job
Failed.
44. What file do you prefer to use as an intermediate storage between two jobs, a
sequential file, a dataset or a file set and why?
? A dataset is the efficient stage to store the data between jobs. The advantage
of dataset over the sequential file is that it preserves the partitioning of the
data. This decreases the overhead of repartitioning the data which happens
when used a sequential file.
? A Fileset stage unlike a Dataset carries formatting information to the file.
Even though the File set stage can preserve partitioning, a Dataset is more
efficient since datasets are operating system files referred by a control file.
45. What is the significance of CRC32 and Checksum functions and the relative
merits?
? CRC32 function: Returns a 32-bit cyclic redundancy check value for a string.
? CheckSum function: Returns a number that is a cyclic redundancy code for
the specified string.
? Checksum implementation in Universe is 16 bit and the algorithm is additive
which can lead to some very undesirable results. The probability that a
different checksum (incorrect) will be created for two identical rows (with the
33
same key) or the same checksum will be generated for a row that has changed
(again, the same key) is extremely high - somewhere around 1 in 65,536 or
2^16.
? CRC32 on the other hand is not additive and the return is a 32 bit integer.
Checksum has a difficult time detecting small changes in moderate to large
fields and this is what makes it not desirable to use as a change data
mechanism.
46. What do you do to make sure the Job logs don’t fill up the entire space
available for DataStage over a period of time?
? We can set the ‘Auto-purge for job log’ property in Datastage administrator,
which clears up the logs periodically.
47.If the input dataset has 10 columns and only 5 columns need to be mapped to
the output dataset without any transformations involved then what stage do you
use and why?
? Copy stage is the best choice, since by using any other stage like transformer,
we add to the overhead of the job.
48. All the operations that can be performed by a copy stage or a filter stage or a
modify stage can be performed by a Transformer stage itself. What is the
significance of having all these stages?
? Transformer stage wraps a whole set of functions which add to the overhead
of a job. Having more no. of transformer stages in a job degrade the
performance.
? To achieve specific functions like filtering columns or altering schemas, using
specific stages like COPY or MODIFY can considerably reduce the overhead
of the job and hence make it efficient.
49. What is the stage useful for debugging in PX? How is it used?
? A Peak stage is useful for debugging in PX. By diverting a stream of data into
Peak stage, the records can be readily displayed in the Log.
34
? DSJS.RUNWARN
? DSJS.STOPPED
? DSJS.VALFAILED
? DSJS.VALOK
? DSJS.VALWARN
52.What is the difference between Normal Lookup and a Parse lookup and when
is each one used?
? LookUp Type: This property specifies whether the database stage will
provide data for an in-memory look up (Lookup Type = Normal) or whether
the lookup will access the database directly (Lookup Type = Sparse).
? If the Lookup Type is Normal, the Lookup stage can have multiple reference
links. If the Lookup Type is Sparse, the Lookup stage can only have one
reference link.
? If the no. of records on the input link is small and reference data is large, its
advisable to use a sparse lookup, since for each record in the input, a call is
made to the database which may be less costlier than loading all the reference
data into memory.
53.How can you read variable number of input files all in the same format through
a single DataStage job? Explain if there is more than one way?
? A number of input files in the same format can be read through one
sequential file stage by specifying the property of pattern.
? A folder stage can be used.
? A control file with extension .fs can be created and all the files can be read
through a single File Set stage. All the input files will have a single entry in
the .fs file with complete path.
35
36
37
Scenario …
– 3 Dimension Load Jobs and 2 Fact Load Jobs
– Fact Load Jobs to start after all Dimension Load Jobs completes
successfully
– If any Dimension Load Jobs fail, terminate all running Jobs
38
Job Activity …
– Executes a DataStage Job
Scenario …
– 5 input files are available in a folder with the same layout
– Single Server Job available to sort a input file
– Wait for a trigger to start the Job
– Send a message to a computer after Job completion (success or failure)
– Handle exception
39
Sample Job Sequence 2 …
40
Other Activity stages
Routine
– Specifies a routine from the Repository (but not transforms)
– Routine arguments can be accessed by subsequent activities
Email Notification
– Specifies the email notification to be sent using SMTP
– Email template file dssendmail_template.txt under Projects folder
allows to create different email templates for different projects
Nested Conditions
– Allows branch the execution of sequence based on condition
– Example: If today is weekday execute weekday_Job else Weekend_Job
User Variable
– Allows to define global variables within a sequence
– For example, the activity can be used to set Job parameters
41
55.What are the different trigger conditions available in a Job activity stage in a
Sequencer?
? Conditional. A conditional trigger fires the target activity if the source activity
fulfills the specified condition. The condition is defined by an expression, and
can be one of the following types:
42
57. What are the options available to set the commit frequency while writing to a
Oracle database table?
APT_ORAUPSERT_COMMIT_ROW_INTERVAL and
APT_ORAUPSERT_COMMIT_TIME_INTERVAL.
You can make those two values a job parameter and set the value as necessary.
58. How can you use a single join stage to remove duplicate records from the input
links and then join the records?
? In the join stage, we can use the Hash partitioning property.
? Then select the perform sort checkbox and then select Unique.
? Specify the keys on which unique sorting needs to be done.
This procedure will join the records after removing the duplicates on the input link.
43
62. What is usage analysis report in DataStage manager?
The Usage Analysis tool allows you to check where items in the DataStage
Repository are used. Usage Analysis gives you a list of the items that use a particular
source item. You can in turn use the tool to examine items on the list, and see where
they are used.
Usage Analysis report gives the following information about each of
the items in the list
? Relationship: The name of the relationship from the source to the target
? Type: The type of the target item
? Name: The name of the target item
? Category: The category where the item is held in the DataStage
Repository
? Source: The name of the source item
44
63.How do U Export and Import Datastage Components
All the DataStage components can be exported into a file
(.dsx or .xml file)
¾ Job
¾ Table definitions
¾ Shared Containers
¾ Data Elements
Specify the location and name of the file the component has to be
exported
45
Importing the DataStage Components
• DataStage components which are available in a .dsx or .xml file can be
imported into a project
• The components would be placed in the appropriate categories and all
default properties would also be migrated along with the components
• Specify the name and location of the file to Import.
Options available to overwrite the existing components by default or prompt for the
same.
46
RA – Documentation Tool
? Documentation Tool can be invoked by clicking the Doc Tool button of the
Reporting Assistant
? Observe the registered components available
? Select individual components and click on the Print Preview button to view
the reports
? Print the reports using the print Icon
? Export the reports to files
Exporting
47
67. What is the difference between exporting Job components and Job executables?
Which one is preferred while exporting jobs from Development environment to
Production Environment?
When a Job component is exported, then entire design information of the job is
exported.
When a job executable is exported, then the design information is omitted. Only the
executable file is exported.
When moving jobs from development to production, exporting job executables is
recommended since, modification of jobs on the production is not preferred and
hence restricted.
68. What is Number of Nodes per reader property in Sequential file stage?
? Specifies the number of instances of the file read operator on each processing
node. The default is one operator per node per input data file. If numReaders
is greater than one, each instance of the file read operator reads a contiguous
range of records from the input file. The starting record location in the file for
each operator, or seek location, is determined by the data file size, the record
length, and the number of instances of the operator, as specified by
numReaders.
? The resulting data set contains one partition per instance of the file read
operator, as determined by numReaders. The data file(s) being read must
contain fixed-length records.
70.Can you use a Time Stamp data type to read a time stamp value with micro
second in it? For example ‘10-10-2004 10:10:10.3333’
Yes a timestamp data type can be used to read the microsecond.
But it has to be taken care that the Extended property of a timestamp data type has to
be set to Microsecond.
71.In a configuration file, if we keep on adding the nodes, what is the downstream
impact?
A configuration file with a larger number of nodes generates a larger number
of processes that use more system resources. Parallelism should be optimized rather
than maximized. Increasing parallelism may better distribute the workload, but it
also adds to the overhead because the number of processes increases. Therefore, one
must weigh the gains of added parallelism against the potential losses in processing
efficiency.
72.What is the best way to read an input file with variable number of columns, say
if the maximum number of columns possible is known?
When there are variable number of columns in a input file, its better to read the
entire file as single column and later split the single column using a column import
stage.
48
73.Merge stage - no data in columns from right table
The Merge stage is a processing stage. It can have any number of input links, a single
output link, and the same number of reject links as there are update input links.
Some example merges are shown in the Parallel Job Developer's Guide. Follow this
link for a list of steps you must take when deploying a Merge stage in your job.
The Merge stage is one of three stages that join tables based on the values of key
columns. The other two are:
Join stage and Lookup stage
The three stages differ mainly in the memory they use, the treatment of rows with
unmatched keys, and their requirements for data being input (for example, whether
it is sorted).
The Merge stage combines a sorted master data set with one or more update data
sets. The columns from the records in the master and update data sets are merged so
that the output record contains all the columns from the master record plus any
additional columns from each update record. A master record and an update record
are merged only if both of them have the same values for the merge key column(s)
that you specify. Merge key columns are one or more columns that exist in both the
master
and update records.
The data sets input to the Merge stage must be key partitioned and sorted. This
ensures that rows with the same key column values are located in the same partition
and will be processed by the same node. It also minimizes memory requirements
because fewer rows need to be in memory at any one time. Choosing the auto
partitioning method will ensure that partitioning and sorting is done. If sorting and
partitioning are carried out on separate stages before the Merge stage, DataStage in
auto partition mode will detect this and not repartition (alternatively you could
explicitly specify the Same partitioning method).
As part of preprocessing your data for the Merge stage, you should also remove
duplicate records from the master data set. If you have more than one update data
set, you must remove duplicate records from the update data sets as well.
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several
reject links. You can route update link rows that fail to match a master row down a
reject link that is specific for that link. You must have the same number of reject links
as you have update links. The Link Ordering tab on the Stage page lets you specify
which update links send rejected rows to which reject links. You can also specify
whether to drop unmatched master rows, or output them on the output data link.
The stage editor has three pages:
Stage page. This is always present and is used to specify general information about
the stage.
Inputs page. This is where you specify details about the data sets being merged.
Outputs page. This is where you specify details about the merged data being output
from the stage and about the reject links.
49
Many times this question pops up in the mind of Datastage developers.
All the above stages can be used to do the same task. Match one set of data (say
primary) with another set of data (references) and see the results. DataStage
normally uses different execution plans (hmm… I should ignore my Oracle legacy
when posting on Datastage). Since DataStage is not so nice as Oracle, to show its
Execution plan easily, we need to fill in the gap of Optimizer and analyze our
requirements. Well I have come up with a nice table,
Most importantly its the Primary/Reference ratio that needs to be considered not the
actual counts.
Primary Source Reference Volume Preferred Method
Volume
Little (< 5 million) Very Huge ( > 50 Sparse Lookup
million)
Parallel DataStage jobs can have many sources of reference data for lookups
including database tables, sequential files or native datasets. Which is the most
efficient?
This question has popped up several times over on the DSExchange. In DataStage
server jobs the answer is quite simple, local hash files are the fastest method of a key
based lookup, as long as the time taken to build the hash file does not wipe out your
benefits from using it.
In a parallel job there are a very large number of stages that can be used as a lookup,
a much wider variety then server jobs, this includes most data sources and the
parallel staging formats of datasets and lookup filesets. I have discounted database
lookups as the overhead of the database connectivity and any network passage
makes them slower then most local storage.
I did a test comparing datasets to sequential files to lookup filesets and increased row
volumes to see how they responded. The test had three jobs, each with a sequential
file input stage and a reference stage writing to a copy stage.
Small lookups
I set the input and lookup volumes to 1000 rows. All three jobs processed in 17 or 18
seconds. No lookup tables were created apart from the existing lookup fileset one.
50
This indicates the lookup data fit into memory and did not overflow to a resource
file.
2 million rows
Starting to see some big differences now. Lookup fileset down at 45 seconds is only
three times the length of the 1000 row test. Dataset is up to 1:17 and sequential file up
to 1:32. The cost of partitioning the lookup data is really showing now.
3 million rows
The fileset still at 45 seconds, swallowed up the extra 1 million rows with ease.
Dataset up to 2:06 and the sequential file up to 2:20.
As a final test I replaced the lookup stage with a join stage and tested the dataset and
sequential file reference links. The dataset join finished in 1:02 and the sequential file
join finished in 1:15. A large join proved faster then a large lookup but not as fast as a
lookup file.
Conclusion
If your lookup size is low enough to fit into memory then the source is irrelevant,
they all load up very quickly, even database lookups are fast. If you have very large
lookup files spilling into lookup table resources then the lookup fileset outstrips the
other options. A join becomes a viable option. They are a bit harder to design as you
can only join one source at a time whereas a lookup can join multiple sources.
I usually go with lookups for code to description or code to key type lookups
regardless of the size, I reserve the joins for references that bring back lots of
columns. I will certainly be making more use of the lookup fileset to get more
performance from jobs.
Sparse database lookups, which I didn't test for, are an option if you have a very
large reference table and a small number of input rows.
When we say "Validating a Job", we are talking about running the Job in the "check
only" mode. The following checks are made:
51
select the data source stage depending upon the sources for ex:flatfile,database, xml
etc
select the required stages for transformation logic such as transformer, link collector,
link partitioner, Aggregator, merge etc
select the final target stage where u want to load the data either it is datawarehouse,
datamart, ODS,staging etc
Parallel Jobs
1 - Generates OSH programs after job compilation
2 - Execution Time Parallelism level definition
3 - Different Stages (DataSets, Lookup, ...)
You can read more about them in my blogs that compare server and parallel jobs.
Process in parallel or take up folk dancing:
http://blogs.ittoolbox.com/bi/websphere/archive s/006622.asp
DataStage server v enterprise: some performance stats:
http://blogs.ittoolbox.com/bi/websphere/archive s/006976.asp
There also a description of each DataStage edition on the DataStage wiki page:
http://wiki.ittoolbox.com/index.php/Topic:Web Sphere_DataStage
Some people will tell you if you are not doing your batch data integration using a
parallel processing engine you might as well shoot your ETL server and take up folk
dancing. You are just not a serious enterprise data integration player. This has lead to
some confusion from developers who want to know if the parallel ETL is a better
career path, and consternation from DataStage customers who have the non parallel
version and don't like folk dancing.
The Ab Initio ETL tool had such a good parallel engine they didn't need to advertise.
They just had a word in the ear of a customer, "you know how that fact load takes 12
hours, well we can do it in 1".
52
Now most of the serious ETL vendors such as IBM/Ascential, Informatica and SAS
have got automated parallel processing.
The obvious incentive for going parallel is data volume. Parallel jobs can remove
bottlenecks and run across multiple nodes in a cluster for almost unlimited
scalability. At this point parallel jobs become the faster and easier option. With the
release of DataStage Hawk next year an added incentive will be the extra
functionality of parallel jobs such as Quality Stage matching. Recent product
upgrades have made parallel jobs easier to build. Hopefully further improvements
and a lower price will be forthcoming in the next release.
Just change the very high volume jobs and do some performance testing to compare
different designs. Release a small number to production to see how they run.
53
version becomes cheaper and easier to use. The parallel version will remain popular
for large implementations such as master data management and enterprise
integration across clusters. Combination implementations will be popular with
customers upgrading to parallel jobs or starting with Enterprise Edition but choosing
to use server jobs for small volumes.
I ran some performance tests comparing DataStage server jobs against parallel jobs
running on the same machine and processing the same data. Interesting results.
Some people out there may be using the server edition, most DataStage for
PeopleSoft customers are in that boat, and getting to the type of data volumes that
make a switch to Enterprise Edition enticing. Most stages tested proved to be a lot
faster in a parallel job then a server job even when they are run on just one parallel
node.
All tests were run on a 2 CPU AIX box with plenty of RAM using DataStage 7.5.1.
The sort stage has long been a bugbear in DataStage server edition prompting many
to sort data in operating scripts:
1mill server: 3:17; parallel 1node:00:07; 2nodes: 00:07; 4nodes: 00:08
2mill server: 6:59; parallel 1node: 00:12; 2node: 00:11; 4 nodes: 00:12
10mill server: 60+; parallel 2 nodes: 00:42; parallel 4 nodes: 00:41
The parallel sort stage is quite a lot faster then the server edition sort. Moving from 2
nodes to 4 nodes on a 2 CPU machine did not see any improvement on these smaller
volumes and the nodes may have been fighting each other for resources. I didn't
have time to wait for the 10 million row sort to finish but it was struggling along
after 1 hour.
The next test was a transformer that ran four transformation functions including
trim, replace and calculation.
1 mill server: 00:25; parallel 1node: 00:11; 2node: 00:05: 4node: 00:06
2 mill server: 00:54; parallel 1node: 00:20; 2node: 00:08; 4node: 00:09
10mill server: 04:04; parallel 1node: 01:36; 2node: 00:35; 4node: 00:35
Even on one node with a compiled transformer stage the parallel version was three
times faster. When I added one node it became twelve times faster with the benefits
of the parallel architecture.
Aggregation:
1 mill server: 00:57; parallel 2node: 00:15
2 mill server: 01:55; parallel 2node: 00:28
The DB2 read was several times faster and the source table with 2million plus rows
had no DB2 partitioning applied.
54
So as you can see even on a 1 node configuration that does not have a lot of parallel
processing you can still get big performance improvements from an Enterprise
Edition job. The parallel stages seem to be more efficient. On a 2 CPU machine there
were some 10x to 50x improvements in most stages using 2 nodes.
If you are interested in these type of comparisons leave a comment and in a future
blog I may do some more complex test scenarios.
Large jobs are impacted by smaller jobs, especially when there is a flotilla of small
jobs constantly taking small chunks of CPU, RAM and disk I/O away from the larger
jobs.
Impact of Stress
I start a large parallel job processing millions of rows from a database over two
nodes. I rerun my two small jobs. The server job takes 2 seconds and the parallel job
is 10 seconds. On the next run they are 2 and 14. On the next run 2 and 11. The server
job is not impacted however the parallel job is.
I start up four more parallel jobs and retest my two small jobs. The server job is now
up to 3 seconds, the parallel job is up to 27 and 33 seconds. When I switch my small
parallel job to use four nodes instead of one it jumps to 44 and 43 seconds.
Now imagine you are running several hundred of these small jobs during a batch
window alongside some very large jobs. You begin to see how the performance of
the small jobs can be as important as the large jobs. Not only are the smaller jobs
taking much longer to run but they are taking resources away from the larger jobs.
Ease of Use
With DataStage Enterprise Edition you get both a parallel and a server job license so
you can use both types. Anyone proficient in parallel jobs will find server jobs easy to
write. Both jobs can be scheduled from the same sequence jobs or from the same
command line scheduling scripts via the dsjob command. Most types of custom or
MetaStage job reporting will collect results from each job type.
55
Going Parallel all the Way
If you do choose to use just parallel jobs then limit the overhead of small jobs by
running them on a single node. This can be done by adding the Environment
variable $APT_CONFIG_FILE and setting it to a single node configuration file. This
stops the job from starting too many processes or from partitioning data that is such
small volume that makes partitioning is a waste of time.
I'm going to lay my cards on the table here, and unlike Gary Busey going all in on
celebrity Texas Hold-em, it's not a pair of deuces. You get greater flexibility and
control generating your surrogate keys within the ETL job rather than in the database
on insert. Especially when you have purchased a high end ETL tool like DataStage
and plan to use data quality best practices around that tool.
If those descriptions have done nothing for you than let's just call it a unique ID field
for data.
In DataStage if the target table has an Indentity field you set it by inserting all the
other columns and letting the Identity field set itself. For a DB2 sequence you set it by
using user-defined SQL on insert statements that populate the field with the
sequence NEXTVAL command. E.g.
56
Gets changed to use a sequence to generate a new id:
INSERT INTO CUSTOMER
(CUSTOMER_ID, CUSTOMER_NAME)
CUSTOMERSEQ.NEXTVAL,
ORCHESTRATE.CUSTNAME
ETL generated keys works very well on vertically banded ETL loads. For example,
you can process all your data during the day and load it into the database overnight.
Day time processing occurs mainly on the ETL server where you extract, transform,
consolidate, enrich, massage, wobble and masticate your data. This gives you a set of
load ready data files. Overnight you bulk load, insert or update them into your
database with almost no transformation. With ETL key generation you can have your
surrogate keys and foreign keys generated for new rows during this preparation
phase. This will save time and complexity on your overnight loads. This type of
vertical banding gives you easier rollback and recovery and lets you process data
without impacting on production databases.
An ETL generated key works no matter what type of database load you perform:
insert, bulk load, fast load, append, import etc. With database keys you need to work
out how the type of load affects your number generation. I still haven't worked out
how to bulk load into DB2 table and use a DB2 sequence at the same time.
The ETL key generator offers very good performance whether using the stand alone
stage or in a transformer. It runs in parallel and uses very little memory.
57
The database prevents duplicate keys by remembering the last key used and
incrementing even when there are simultaneous loads. Mind you, good ETL design
will do the same thing.
84. what is job control ? what is the use of it explain with steps?
JCL defines Job Control Language it is used to run more number of jobs at a time
with or without using loops. steps:click on edit in the menu bar and select 'job
properties' and enter the parameters asparamete prompt type STEP_ID STEP_ID
stringSource SRC stringDSN DSN stringUsername unm stringPassword pwd
stringafter editing the above steps then set JCL button and select the jobs from the
listbox and run the job
Controlling Datstage jobs through some other Datastage jobs. Ex: Consider two Jobs
XXX and YYY. The Job YYY can be executed from Job XXX by using Datastage
macros in Routines.
To execute one job from other job, following steps needs to be followed in Routines.
85. Is it possible to access the same job by two users at a time in DataStage?
No, it is not possible to access the same job two users at the same time. DS will
produce the following error : "Job is accessed by other user"
86. How to drop the index before loading data into target and how to rebuild it in
DS?
This can be achieved by "Direct Load" option of SQLLoaded utily.
58
88.How do you track performance statistics and enhance it?
Through Monitor we can view the performance statistics.
89.What is the order of execution done internally in the transformer with the stage
editor having input links on the left hand side and output links?
Stage variables, constraints and column derivation or expressions.
91.How we use NLS function in Datastage? What are advantages of NLS function?
Where we can use that one? Explain briefly?
As per the manuals and documents, we have different level of interfaces. More
specific? Like Teradata interface operators, DB2 interface operators, Oracle Interface
operators and SAS-Interface operators. Orchestrate National Language Support
(NLS) makes it possible for you to process data in international languages using
Unicode character sets. International Components for Unicode (ICU) libraries
support NLS functionality in Orchestrate. Operator NLS Functionality* Teradata
Interface Operators * switch Operator * filter Operator * The DB2 Interface Operators
* The Oracle Interface Operators* The SAS-Interface Operators * transform Operator
* modify Operator * import and export Operators * generator
92.What are the Repository Tables in DataStage and what are they?
A datawarehouse is a repository (centralized as well as distributed) of Data, able to
answer any adhoc, analytical, historical or complex queries. Metadata is data about
data. Examples of metadata include data element descriptions, data type
descriptions, attribute/property descriptions, range/domain descriptions, and
process/method descriptions. The repository environment encompasses all
corporate metadata resources: database catalogs, data dictionaries, and navigation
services. Metadata includes things like the name, length, valid values, and
description of a data element. Metadata is stored in a data dictionary and repository.
It insulates the data warehouse from changes in the schema of operational systems.
In data stage I/O and Transfer , under interface tab: input , out put & transfer pages
will have 4 tabs and the last one is build under that u can find the TABLE NAME
.The DataStage client components are : Administrator Administers DataStage
projects and conducts housekeeping on the server Designer Creates DataStage jobs
that are compiled into executable programs Director Used to run and monitor the
DataStage jobs .Manager Allows you to view and edit the contents of the repository
59
• Follow correct syntax for definitions
– Import from an existing data set or file set
• Manager import > Table Definitions > Orchestrate Schema
Definitions
• Select checkbox for a file with .fs or .ds
– Import from a database table
– Create from a Table Definition
• Click Parallel on Layout tab
• Schema file for data accessed through stages that have the “Schema Files”
property, e.g. Sequential File
• Sample Use
• if source file format may change without functional impact to the DS code
• say columns inserted, reordered, deleted, etc.
• Job access the file only through the definition in the schema file
• Schema file may be changed without affecting the job(s
• Refinement Case 1
– The input file may in the future
• include extra columns that are not relevant to the requirement, these
must be dropped/ignored by the job
• The record format may change, e.g. become comma delimited, order
in which the fields appear may change
– The job must be capable of accepting this input file without impact
• To Do
• Define a schema file to define the input file & point to it within the
sequential file stage
60
94.What is RUN TIME COLUMN PROPOGATION
• Supports partial definition of meta data.
• Enable RCP to
– Recognize columns at runtime though they have not been used within
a job
– Propagate through the job stream
• Design and compile time column mapping enforcement
– RCP is off by default
61
– Enable
• project level. (Administrator project properties)
• job level. (Job properties General tab)
• Stage. (Link Output Column tab)
– Always enable if using Schema Files
• To use RCP in a Sequential stage:
– Use the “Schema File” option & provide path name of the schema file
• When RCP is enabled:
– DataStage does not enforce mapping rules
– Danger of runtime error if incoming column names do not match
column names outgoing link
– Columns not used within the job also propagated if definition exists
• Note that RCP is available for specified stages
• Consider this requirement statement:
– Regional_Sales.txt is a pipe-delimited sequential file
– It will contain
• Region_ID
• Sales_Total
– Job must read this file and compute
• Sales_Total_USD = Sales_Total*45
– Write the data into
• data set Regional_Sales.ds
So far a simple job will do
• Refinement Case 1
– The input file may in the future
• include extra columns that are not relevant to the requirement, these
must be dropped/ignored by the job
• The record format may change, e.g. become comma delimited, order
in which the fields appear may change
– The job must be capable of accepting this input file without impact
• To Do
record
{final_delim=end, record_delim='\n', delim='|',
quote=double, charset="ISO8859-1"}
(
REGION_ID:int32 {quote=none};
SALES_TOTAL:int32 {quote=none};
62
TO DO:
Column definition will define all columns that must be carried through to the next stage
• Column definition column name must match those defined in the
schema file
• Ensure RCP is disabled for the output links
record
{final_delim=end, record_delim='\n', delim='|',
quote=double, charset="ISO8859-1"}
(
REGION_ID:int32 {quote=none};
SALES_CITY:ustring[max=255];
SALES_ZONE:ustring[max=255];
SALES_TOTAL:int32 {quote=none};
)
63
– Data Set will contain ALL columns in the schema, unless explicitly
accessed & dropped within the job plus the computed field
Refinement Case 2
– The input file may in the future include extra columns BUT THESE
MUST BE CARRIED ON into the target DataSet as it is
– The job must be capable of accepting this input file without impact
• To Do
• Define & use schema file
• Ensure RCP is enabled at the project level as well as for all
output link along which data is to be propagated at run time
• Define all columns that require processing
• Other columns may of may not be defined
• In this case, Region_ID need not be defined in the stage
• But if a column is defined and found missing from schema
&/or data file at run time, the job will abort!
– Data Set will contain ALL columns in the schema, unless explicitly
accessed & dropped within the job plus the computed field
95.What are shared and Local Containers
64
– are for making the job look less complicated
– Cannot be invoked from other jobs
– Can be converted to shared containers
– Can be deconstructed to embed log with the job itself
Shared containers
– Can be converted to a local container within a specific job, while still
retaining the original shared container definition
– Can be deconstructed
• Consider validation of geography
– Region_ID must exists in Region_Master.txt & Zone must exist in
Zone_Master.txt
– This rule is applicable for various streams including Regional_Sales
and Employee_Master
• Basic Solution:
– Create individual jobs to lookup each source against the master files
• Refined Solution
– Create a Shared container – “Validate Geography”
• Select the stages that are to be shared
• Select menu item Edit > Construct Container > Shared
65
To make it truly reusable
• Ensure that the stage defines only the fields used within the
processing, in this case, Zone & Region
– Column names that are used within the container must be mapped or
modified before/after the shared container is invoked
– The shared container icon within the job must be opened &
input/output in the job must be mapped against the corresponding
link name in the container
– Note that parameters required by the stages within the container must
be set through the container’s invocation stage in the calling job
66
• Container can be reused as shown to validate the geography information of
the employee-master file
• If in the future
– Say if, geography validation does not validation of Zone then
• Change only the shared container
• Recompile all jobs that invoke the container
– Say if, the city must also be validated
• Provided all the inputs contain the required field,
– Change only the shared container
– Recompile all jobs that invoke the container
• Options
– Run through the DS client Director menu
– Command line interface DSJob Commands
• Used directly or within a OS shell or batch script
• DSJob available with client installation
– DataStage API
• callable through a C/C++ Program
• Distribute DLLs and header files to enable remote execution
without a DS Client
– Through other DataStage Executable Components - which have to be
in turn invoked through any of the listed means
• DataStage BASIC Job Control
– Written in DataStage Basic Script
– Embedded as a Job Control script within the job
definition OR
– Called as a Server Routine OR
– In Parallel Jobs, wrapped into a Basis Transform Stage
67
• Invoked as an activity within a Sequence Job
– which has to be in turn invoked through any of the
listed means
Command Line Interface
• Use dsjob for controlling DataStage jobs
• Options available are:
• Logon
• Starting a job
• Stopping a job
• Listing projects, jobs, stages, links and parameters
• Setting an alias for job
• Retrieving information
• Accessing log files
• Importing job executables
• Generating a report
• CLI commands returns status code to OS
• Use dsjob – run to start, stop, validate or reset DataStage jobs
68
DataStage can be administered on UNIX system by a non root user. The user role
created is ‘dsadm’ by default.The primary UNIX group of the administrator should
be the group to which all the other UNIX id’s of DataStage users should belong to.
By default the group created is ‘dstage’.Secondary UNIX groups can be created and
assigned to different DataStage roles from DS Administrator
Your development directory must be visible globally across all nodes. Use NFS for
this.
Update the PWD variable to the same.
A user who runs the parallel jobs should have all the following rights :-
Login access
Read / Write access to Scratch disk
Read / Write access to Temporary dir
Read access to APT_ORCHHOME
Execute access to Local copies and scripts
? For accessing DB2 resources from DataStage , you need to
Define all the DB2 nodes acting as Servers in your configuration file
? Execute the script $APT_ORCHHOME/bin/db2setup.ksh for each and every
database in DB2 once
? Execute the script $APT_ORCHHOME/bin/db2grant.ksh for each and every
user accessing DB2.
? For Remote connection to DB2, the client and DS server should be on the
same machine
DB2 System configuration entails
1. Make sure db2nodes.cfg is readable DS Administrator
2. For users running the jobs in Load Mode, the DataStage user needs to have
DBADM role granted. Give the grant by executing the command
1. GRANT DBADM ON DATABASE TO USER username
3. Grant the DataStage user select privileges on syscat.nodegroupdefs,
syscat.tablespaces, syscat.tables
69
2. Oracle System configuration entails
1. Creating user environment variables ORACLE_HOME and
ORACLE_SID and copying the values of $ORACLE_HOME and
$ORACLE_SID.
2. Addition of ORACLE_HOME/bin and ORACLE_HOME/lib to your
LIBPATH, LD_LIBRARY_PATH, SHLIB
3. Have select privileges on
1. DBA_EXTENTS
2. DBA_TAB_PARTITIONS
3. DBA_DATA_FILES
4. DBA_OBJECTS
5. ALL_PART_INDEXES
6. ALL_INDEXES
7. ALL_PART_TABLES
8. GV_$INSTANCE ( Only if parallel server is used)
C++ Compiler
1. The DS env. Variables APT_COMPILER and APT_LINKER should be set to
the default location of the corresponding C++ compiler.
2. If the C++ compiler is installed elsewhere, then the above env. variable values
should be changed in each and every project form DS Administrator
Configuring some environment variables
1. Values to the environment variables can be given from DS Administrator as
well as at the time of installation. Please configure there at the time of install
2. TMPDIR: By default is /tmp. Specify your own temporary directory
3. APT_IO_MAXIMUM_OUTSTANDING: Specifies the amount of memory
reserved in each node for TCP/IP communications. Default is 2 MB
4. APT_COPY_TRANSFORM_OPERATOR: Set it to True in order for the
Transformer stage to work under different env. Default is False.
5. APT_MONITOR_SIZE, APT_MONITOR_TIME : By default job monitoring is
done with time and the monitor window is refreshed every 5 sec. Specifying
SIZE param when TIME param is at this default value, makes job reporting
window gets refreshed when new instances are added. Overriding the
default TIME param will override SIZE param
6. APT_DEFAULT_TRANSPORT_BLOCK_SIZE,
APT_AUTO_TRANSPORT_BLOCK_SIZE: The values in these variables
specify the size of Data block used when DS transports data in internal links.
The value of APT_DEFAULT_TRANSPORT_BLOCK SIZE is 32768. Set the
other variable to True if you want DS to automatically calculate the Block
Size.
Starting and stopping the DS Server
1. To stop the Datastage server , use $DSHOME/bin/uv -admin –stop
2. To start the DataStage server, use $DSHOME/bin/uv –admin –start
3. Please check for the dsrpcd process and stop all of them ( client connections ),
using the command
1.
1. netstat –a | grep dsrpcd
Project location and assign DataStage EE roles
1. The project location and DataStage roles are assigned in DS Administrator.
70
2. In the projects dialog box and under generals tab, please assign the location of
the project.
3. The DataStage roles that can be assigned to UNIX groups ( not users ) under
permissions tab are
1. Developer - Has full access to all areas of a project.
2. Production Manager – Developer + access to create and manage
protected projects
3. Operator - Has no write access, can run the jobs and access only to
Director
4. None - He cannot login to any DataStage client components.
Configuring Unix Environment
Check the following Tunable Kernel parameters
MSGMAX 8192 32768 N/A 8192
MSGMNB 16384 32768 N/A 16384
SHMSEG 15* 15 N/A 32
MSGSEG N/A 7168 N/A N/A
SHMMAX 8388608 N/A N/A 8388608
SEMMNS 111 N/A N/A 51
SEMMSL 111 N/A N/A 128
SEMMNI 20 N/A N/A 128
Last week we got a requirement to Validate a date field.
The dates were in mm/dd/yyyy format.
1.7/7/2007 format for some records
2. 07/7/2007 format for some records
3. 7/07/2007 format for some records
4. 07/07/2007 format for some records.
and in-between them there were some invalid dates like 09/31/2007, 09/040/200.
Being an Oracle developer before I started using Datastage, the first thing i went is
for TO_DATE() in Oracle enterprise stage. Damn! Well it wasn’t so easy, the stage
aborted for invalid dates instead of throwing them down the reject link. I tried some
online help on how to capture the To_Date () errors into reject link. After searching
for couple of hours, nothing concrete has come up.
Ok. I decided to do the validation in Datastage. I was already having a transformer in
the job and i included a constraint
“isValid(’date’,StringToDate(trim_leading_trailing(<inp_Col>), ‘%mm/%dd/%yyyy) “ and
passed down the invalid dates into reject link. I compiled and ran the job. The
job ran successfully and i thought everything went fine, until i recognized that the
format mask is not intelligent enough to recognise single digit day and month fields.
Hmmm…. i was back to square one. Then i tried some innovation using the format
mask as %[mm]/%[dd]/%yyyy, %m[m]/%d[d]/%yyyy etc… etc…. nothing worked.
Anyway, at last i was able to do the task for the day using some trouble some logic
(Identifying the single digit day and month and concatenating a Zero before
them with the help of Field() function) inside the transformer and made my boss
happy, but I wondered for such a simple requirement, I was not able to understand
why Datastage Date field format mask is not modeled intelligent enough.
In Oracle, the To_Date() is intelligent enough to recognise such data when the format
mask is specifeid as MM/DD/YYYY.
71
Also we had a time stamp field coming in from the source with the format mask as
‘MM/DD/YYYY HH:MI:SS AM’ (Oracle style). Even this i am not able to validate
properly in a DataStage time stamp field, since datastage is not intelligent enough
in recognizing ‘AM/PM’. I guess I need to learn how regular expressions are
specified in DataStage.
These are the day to day requirements in ETL, which i guess DataStage must handle.
Either i need to do more research or DataStage should make more flexible format
masks in next release and make the life of developers easy.
72
Job parameters should be used in all DataStage server, parallel and sequence jobs to
provide administrators access to changing run time values such as database login
details, file locations and job settings.
One option for maintaining these job parameters is to use project specific
environment variables. These are similar to operating system environment variables
but they are setup and maintained through the DataStage Administrator tool.
There is a blog entry with bitmaps that describes the steps in setting up these
variables at DataStage tip: using job parameters without losing your mind
Steps
To create a new project variable:
Start up DataStage Administrator.
Choose the project and click the "Properties" button.
On the General tab click the "Environment..." button.
Click on the "User Defined" folder to see the list of job specific environment
variables.
There are two types of variables - string and encrypted. If you create an encrypted
environment variable it will appears as the string "*******" in the Administrator tool
and will appears as junk text when saved to the DSParams file or when displayed in
a job log. This provides robust security of the value.
Note that encrypted environment variables are not supported in versions earlier than
7.5.
==Steps==
<!-- Enter steps involved below. -->
To create a new project variable:
* Start up DataStage Administrator.
* Choose the project and click the "Properties" button.
* On the General tab click the "Environment..." button.
* Click on the "User Defined" folder to see the list of job specific environment
variables.
There are two types of variables - string and encrypted. If you create an encrypted
environment variable it will appears as the string "*******" in the Administrator tool
and will appears as junk text when saved to the DSParams file or when displayed in
a job log. This provides robust security of the value.
Note that encrypted environment variables are not supported in versions earlier than
7.5.
73
=== Environment Variables as Job Parameters ===
To create a job level variable:
* Open up a job.
* Go to Job Properties and move to the parameters tab.
* Click on the "Add Environment Variables..." button and choose the variable from
the list. Only values set in Administrator will appear. This list will show both the
system variables and the user-defined variables.
* Set the Default value of the new parameter to $PROJDEF. If it is an encrypted field
set it to $PROJDEF in both data entry boxes on the encrypted value entry form.
When the job parameter is first created it has a default value the same as the Value
entered in the Administrator. By changing this value to $PROJDEF you instruct
DataStage to retrieve the latest Value for this variable at job run time.
74
If you have an encrypted environment variable it should also be an encrypted job
parameter. Set the value of these encrypted job parameters to $PROJDEF. You will
need to type it in twice to the password entry box, or better yet cut and paste it into
the fields, a spelling mistake can lead to a connection error message that is not very
informative and leads to a long investigation.
Examples
These job parameters are used just like normal parameters by adding them to stages
in your job enclosed by the # symbol.
Job Parameter Examples
Field Setting Result
Data #$DW_DB_NAME# CUSTDB
base
Pass #$DW_DB_PASSWORD# ********
word
File #$PROJECT_PATH#/#SOURCE_DIR#/Custo c:/data/custfiles/Custom
Nam mers_#PROCESS_DATE#.csv ers_20040203.csv
e
Conclusion
These type of job parameters are useful for having a central location for storing all
job parameters that is password protected and supports encryption of passwords. It
can be difficult to migrate between environments. Migrating the entire DSParams file
can result in development environment settings being moved into production and
trying to migrate just the user defined section can result in a corrupt DSParams file.
Care must be taken.
101.Perform change data capture in an ETL job
Introduction
This HOWTO entry will describe how to identify changed data using a DataStage
server or parallel job. For an overview of other change capture options see the blog
on incremental loads
The objective of changed data identification is to compare two sets of data with
identical or similar metadata and determine the differences between the two. The
two sets of data represent an existing set of a data and a new set of data where the
change capture identifies the modified, added and removed rows in the new set.
Steps
75
The steps for change capture depend on whether you are using server jobs or parallel
jobs or one of the specialized change data capture products that integrate with
DataStage.
Server Job
Most change capture methods involve the transformer stage with new data as the
input and existing data as a left outer join reference lookup. The simplest form of
change capture is to compare all rows using output links for inserts and updates
with a constraint on each.
Column Compare
Update link constraint: <math>input.firstname <> lookup.firstname and
input.lastname <> lookup.lastname and input.birthdate <>
lookup.birthdate...<math> Insert link constraint:
<math>lookup.NOTFOUND<math>
A delete output cannot be derived as the lookup is a left outer join.
These constraints can become very complex to write, especially if there are a lot of
fields to compare. It can also produce slow performance as the constraint needs to
run for every row. Performance improvement can be gained by using the CRC32
function to describe the data for comparison.
CRC Compare
CRC32 is a C function written by Michael Hester and is now on the Transformer
function list. It takes an input string and returns a signed 32 bit number that acts as a
digital signature of the input data.
When a row is processed and becomes existing data a CRC32 code is generated and
saved to a lookup along with the primary key of the row. When a new data row
comes through a primary key lookup determines if the row already exists and if it
does comparing the CRC32 of the new row to the existing row determines whether
the data has changed.
CRC32 change capture using a text file source:
? Read each row as a single long string. Do not specify a valid delimiter.
? In a transformer find the key fields using the FIELD command and generate a
CRC32 code for the entire record. Output all fields.
? In a transformer lookup the existing records using the key fields to join and
compare the new and existing CRC32 codes. Output new and updated
records.
76
? The output records have concatenated fields. Either write the output records
to a staging sequential file, where they can be processed by insert and update
jobs, or split the records into individual fields using the row splitter stage.
CRC32 change capture using a database source:
? To concatenate fields together use the Row Merge stage. Then follow the
steps described in the sequential file section above.
Parallel Job
The Change Capture stage uses "Before" and "After" input links to compare data.
? Before is the existing data.
? After is the new data.
The stage operates using the settings for Key and Value fields. Key fields are the
fields used to match a before and after record. Value fields are the fields that are
compared to find modified records. You can explicitly define all key and value fields
or use some of the options such as "All Keys, All Values" or "Explicit Keys, All
Values".
The stage outputs a change_code which by default is set to 0 for existing rows that
are unchanged, 1 for deleted rows, 2 for modified rows and 4 for new rows. A filter
stage or a transformer stage can then split the output using the change_code field
down different insert, update and delete paths.
Change Capture can also be performed in a Transformer stage as per the Server Job
instructions.
The CRC32 function is not part of the parallel job install but it is possible to write one
as a custom buildop.
Conclusion
The change capture stage in parallel jobs and the CRC32 function in server jobs
simplify the process of change capture in an ETL job.
Introduction
77
When an enterprise database stage such as DB2 or Oracle is set to upsert it is possible
to create a reject link to trap rows that fail any update or insert statements. By default
this reject link holds just the columns written to the stage, they do not show any
columns indicating why the row was rejected and often no warnings or error
messages appear in the job log.
Steps
There is an undocumented feature in the DB2 and Oracle enterprise stage where a
reject link out of the stage will carry two new fields, sqlstate and sqlcode. These hold
the return codes from the RDBMS engine for failed upsert transactions. The fields are
called sqlstate and sqlcode.
To see these values add a peek to your reject link, the sqlstate and sqlcode should
turn up for each rejected row in the job log. To trap these values add a copy stage to
your reject link, add sqlstate and sqlcode to the list of output columns, on the output
columns tab check the "Runtime column propagation" check box, this will turn your
two new columns from invalid red columns to black and let your job compile. If you
do not see this check box uses the Administrator tool to turn on column propagation
for your project.
When the job runs and a RDBMS reject occurs the record is sent down the reject link,
two new columns are propagated down that link and are defined by the copy stage
and can then be written out to an error handling table of file.
If you do not want to turn on column propagation for your project you can still
define the two new columns with a Modify stage by creating them in two
specifications. sqlcode=sqlcode and sqlstate=sqlstate. Despite column propagation
being turned off the Modify stage will still find the two columns on the input link
and use the specification to add them to the output schema.
Examples
Oracle By default, oraupsert produces no output data set. By using the -reject option,
you can specify an optional output data set containing the records that fail to be
inserted or updated. It’s syntax is: -reject filename For a failed insert record, these
sqlcodes cause the record to be transferred to your reject dataset: -1400: cannot insert
NULL -1401: inserted value too large for column -1438: value larger than specified
precision allows for this column -1480: trailing null missing from string bind value
For a failed update record, these sqlcodes cause the record to be transferred to your
reject dataset: -1: unique constraint violation -1401: inserted value too large for
column -1403: update record not found -1407: cannot update to null -1438: value
larger than specified precision allows for this column -1480: trailing null missing
from string bind value An insert record that fails because of a unique constraint
violation (sqlcode of -1) is used for updating.
DB2 When you specify the -reject option, any update record that receives a status of
SQL_PARAM_ERROR is written to your reject data set. It’s syntax is: -reject filename
Conclusion
Always place a reject link on a Database stage that performs an upsert. There is no
other way to trap rows that fail that upsert statement.
For other database actions such as load or import a different method of trapping
rejects and messages is required.
78
103.Routine Generated Sequential Keys
When you require a system generated sequential key for a table this method will
allow you to control the starting number & preserve the next sequential for use in
subsequent loads.
After you have your output columns defined open the transformer stage (double
click on the transformer symbol) & open the stage variables properties box.
Opening the stage variables properties box is opened by RIGHT-clicking on the stage
variables box within the transformer GUI. If your stage variables box is not displayed
in the transformer press the "Show/Hide Stage variables" button on the transformer
toolbar.
In the stage variables box enter a meaningful name for the variable which will
contain the Sequential key & set it's initial value to 0 (zero).
Enter the derivation for the Stage variable which, in this case is the
KeyMgtGetNextValue routine provided in the sdk (Software Developers Kit) routine
category. The routine can be selected from the dropdown by RIGHT-clicking on the
derivation column This routine, circled below as viewed from datastage Manager, is
supplied with Datastage along with source code to allow you to modify it for your
needs.
Now all that is left is to use the variable in the derivation of your key field as shown
below.
Enter the derivation for the Stage variable which, in this case is the
KeyMgtGetNextValue routine provided in the sdk (Software Developers Kit) routine
category. The routine can be selected from the dropdown by RIGHT-clicking on the
derivation column This routine, circled below as viewed from datastage Manager, is
supplied with Datastage along with sourcecode to allow you to modify it for your
needs.
Now all that is left is to use the variable in the derivation of your key field as shown
below.
The Experiment with the methods & select the best for your situation.
Learn how to use a Simple Sequential Key here.
SCD type1
79
or
SCD type2
u have use one hash file to look -up the target ,take 3 instance of target ,give diff
condns depending on the process, give diff update actions in target ,use system
variables like sysdate ,null
We can handle SCD in the following ways Type I: Just overwrite; Type II: We need
versioning and dates; Type III: Add old and new copies of certain important fields.
Hybrid Dimensions: Combination of Type II and Type III
yes you can implement Type1 Type2 or Type 3. Let me try to explain Type 2 with
time stamp.
Step :1 time stamp we are creating via shared container. it return system time and
one key. For satisfying the lookup condition we are creating a key column by using
the column generator.
Step 2: Our source is Data set and Lookup table is oracle OCI stage. by using the
change capture stage we will find out the differences. the change capture stage will
return a value for change_code. Based on return value we will find out whether this
is for insert, Edit, or update. if it is insert we will modify with current timestamp and
the old time stamp will keep as history.
if u used more than 15 stages in a job and if you used 10 lookup tables in a job then u
can call it as a complex job
Complex design means having more joins and more look ups. Then that job design
will be called as complex job. We can easily implement any complex design in
DataStage by following simple tips in terms of increasing performance also. There is
no limitation of using stages in a job. For better performance, Use at the Max of 20
stages in each job. If it is exceeding 20 stages then go for another job. Use not more
than 7 look ups for a transformer otherwise go for including one more transformer.
Am I Answered for u'r abstract Question.
10 7.Does Enterprise Edition only add the parallel processing for better
performance?
80
cobol/JCL code and are transferred to a mainframe to be compiled and run. Jobs are
developed on a UNIX or Windows server transferred to the mainframe to be
compiled and run. The first two versions share the same Designer interface but have
a different set of design stages depending on the type of job you are working on.
Parallel jobs have parallel stages but also accept some server stages via a container.
Server jobs only accept server stages; MVS jobs only accept MVS stages. There are
some stages that are common to all types (such as aggregation) but they tend to have
different fields and options within that stage.
109.If you’re running 4 ways parallel and you have 10 stages on the canvas, how
many processes does datastage create?
Answer is 40
You have 10 stages and each stage can be partitioned and run on 4 nodes which
makes total number of processes generated are 40
110.If data is partitioned in your job on key 1 and then you aggregate on key 2,
what issues could arise?
Data will partitioned on both the keys! Hardly it will take more for execution
111.What r the different type of errors u faced during loading and how u solve
them
Check for Parameters and check for input files are existed or not and also check for
input tables existed or not and also usernames, datasource names, passwords
112.How can I specify a filter command for processing data while defining
sequential file output data?
We have some thing called as after job subroutine and Before subroutine, with then
we can execute the UNIX commands. Here we can use the sort command or the filter
command
Wherever we can use same lookup in multiple places, on that time we will develop
lookup in shared containers, then we will use shared containers as lookup.
81
114.Can any one tell me how to extract data from more than 1 heterogeneous
Sources.
Mean, example 1 sequential file, Sybase, Oracle in a single Job.
Yes you can extract the data from two heterogeneous sources in data stages using the
transformer stage it's so simple you need to just form a link between the two sources
in the transformer stage that's it
U can convert all heterogeneous sources into sequential files & join them using
merge
or
U can write user defined query in the source itself to join them
DataStage doesn't know how large your data is, so cannot make an informed choice
whether to combine data using a join stage or a lookup stage. Here's how to decide
which to use:
if the reference datasets are big enough to cause trouble, use a join. A join does a
high-speed sort on the driving and reference datasets. This can involve I/O if the
data is big enough, but the I/O is all highly optimized and sequential. Once the sort
is over the join processing is very fast and never involves paging or other I/O
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several
reject links as many as input links. the concept of merge and join is different in
parallel edition as u will not find join component in server merge will survive this
purpose.
As of my knowledge join and merge both u used to join two files of same structure
where lookup u mainly use it for to compare the prev data and the curr data.
We can join 2 relational tables using Hash file only in server jobs. Merge stage is only
for flat files
116.There are three different types of user-created stages available for PX.
What are they? Which would you use? What are the disadvantages for using each
type?
These are the three different stages: i) Custom ii) Build iii) Wrapped
DataStage jobs.
82
119.What is DS Director used for - did u use it?
Datastage Director is used to monitor, run, validate & schedule datastage server jobs.
• Jobs created in the Designer can be run in the Director
• Director can be invoked by logging into the director separately Or directly
through the designer.
• Director has three screens
o Status – Can be invoked by clicking the Status icon available in the
tool bar , Shows the status of the jobs available
o Schedule – Can be invoked by clicking the Schedule icon available in
the tool bar, shows the schedule of the job
o Log - Can be invoked by clicking the Log icon available in the tool
bar, shows the Log of the job
Director-Status
Director – Schedule
Director-Log
Director-Running a Job
• Select the job to be run and click the RUN NOW button available in the
tool bar.
• Tool bar also has options to
83
Reset
Stop
Sort the jobs by name (Ascending & Descending)
• Once the job starts running, click the log screen to monitor the job
• Job status screen will have the status finished once the job gets finished
successfully
Director-Scheduling a job
Select the job, right click and select Add to schedule menu to schedule the job –
Observe the various scheduling options available
Filtering a logs:
Select the job, right click and select Filter in the log screen to
filter the log – Observe the various options available.
84
Purging the logs:
Select the job, clear log from the toolbar and – Observe the various
options available.
One is by using a hashed file and the other is by using Database/ODBC stage as a
lookup.
85
The Duplicates can be eliminated by loading the corresponding data in the Hash file.
Specify the columns on which u want to eliminate as the keys of hash.
Removal of duplicates done in two ways:
1. Use "Duplicate Data Removal" stage
or
2. use group by on all the columns used in select, duplicates will go away.
If u connects output of hash file to transformer, it will act like reference .there is no
errors at all!! It can be used in implementing SCD's
If Hash file output is connected to transformer stage the hash file will consider as the
Lookup file if there is no primary link to the same Transformer stage, if there is no
primary link then this will treat as primary link itself. You can do SCD in server job
by using Lookup functionality. This will not return any error code.
123.What is merge and how it can be done plz explain with simple example taking
2 tables.......
Merge is used to join two tables. It takes the Key columns sort them in Ascending or
descending order. Let us consider two table i.e. Emp,Dept.If we want to join these
two tables we are having DeptNo as a common Key so we can give that column
name as key and sort Deptno in ascending order and can join those two tables
Merge stage in used for only Flat files in server edition and Configuration of
SMP/MMP server.
124.What are the enhancements made in datastage 7.5 compare with 7.0
Many new stages were introduced compared to datastage version 7.0. In server jobs
we have stored procedure stage, command stage and generate report option was
there in file tab. In job sequence many stages like startloop activity, end loop activity,
terminate loop activity and user variables activities were introduced. In parallel jobs
surrogate key stage, stored procedure stage were introduced. Complex file and
Surrogate key generator stages are added in Ver 7.5
86
127.What is the difference between Inprocess and Interprocess ?
Regarding the database it varies and depends upon the project and for the second
question, inprocess is the process where the server transfers only one row at a time to
target and interphones means that the server sends group of rows to the target
table...these both are available at the tunable tab page of the administrator client
component..
In-process
You can improve the performance of most DataStage jobs by turning in-process row
buffering on and recompiling the job. This allows connected active stages to pass
data via buffers rather than row by row.
Note: You cannot use in-process row-buffering if your job uses COMMON blocks in
transform functions to pass data between stages. This is not recommended practice,
and it is advisable to redesign your job to use row buffering rather than COMMON
blocks.
Inter-process
Use this if you are running server jobs on an SMP parallel system. This enables the
job to run using a separate process for each active stage, which will run
simultaneously on a separate processor.
Note: You cannot inter-process row-buffering if your job uses COMMON blocks in
transform functions to pass data between stages. This is not recommended practice,
and it is advisable to redesign your job to use row buffering rather than COMMON
blocks.
87
You can also define your own before/after subroutines using the Routine dialog box.
Custom Universe functions. These are specialized BASIC functions that have been
defined outside DataStage. Using the Routine dialog box, you can get DataStage to
create a wrapper that enables you to call these functions from within DataStage.
These functions are stored under the Routines branch in the Repository. You specify
the category when you create the routine. If NLS is enabled,9-4 Ascential DataStage
Designer Guide you should be aware of any mapping requirements when using
custom Universe functions. If a function uses data in a particular character set, it is
your responsibility to map the data to and from Unicode. ActiveX (OLE) functions.
You can use ActiveX (OLE) functions as programming components within
DataStage. Such functions are made accessible to DataStage by importing them. This
creates a wrapper that enables you to call the functions. After import, you can view
and edit the BASIC wrapper using the Routine dialog box. By default, such functions
are located in the Routines Class name branch in the Repository, but you can specify
your own category when importing the functions. When using the Expression Editor,
all of these components appear under the DS Routines… command on the Suggest
Operand menu. A special case of routine is the job control routine. Such a routine is
used to set up a DataStage job that controls other DataStage jobs. Job control routines
are specified in the Job control page on the Job Properties dialog box. Job control
routines are not stored under the Routines branch in
theRepository.TransformsTransforms are stored in the Transforms branch of the
DataStage Repository, where you can create, view or edit them using the Transform
dialog box. Transforms specify the type of data transformed the type it is
transformed into, and the expression that performs the transformation. DataStage is
supplied with a number of built-in transforms (which you cannot edit). You can also
define your own custom transforms, which are stored in the Repository and can be
used by other DataStage jobs. When using the Expression Editor, the transforms
appear under the DSTransform… command on the Suggest Operand menu.
Functions take arguments and return a value. The word “function” is applied to
many components in DataStage:• BASIC functions. These are one of the fundamental
building blocks of the BASIC language. When using the Expression Editor,
Programming in DataStage 9-5you can access the BASIC functions via the Function…
command on the Suggest Operand menu. DataStage BASIC functions. These are
special BASIC functions that are specific to DataStage. These are mostly used in job
control routines. DataStage functions begin with DS to distinguish them from general
BASIC functions. When using the Expression Editor, you can access the DataStage
BASIC functions via the DS Functions…command on the Suggest Operand menu.
The following items, although called “functions,” are classified as routines and are
described under “Routines” on page 9-3. When using the Expression Editor, they all
appear under the DS Routines… command on the Suggest Operand menu.•
Transform functions• Custom Universe functions• ActiveX (OLE) functions
ExpressionsAn expression is an element of code that defines a value. The word”
expression” is used both as a specific part of BASIC syntax, and to describe portions
of code that you can enter when defining a job. Areas of DataStage where you can
use such expressions are:• Defining breakpoints in the debugger• Defining column
derivations, key expressions and constraints in Transformer stages• Defining a
custom transform In each of these cases the DataStage Expression Editor guides you
as to what programming elements you can insert into the expression
130.Which three are valid ways within a Job Sequence to pass parameters to
Activity stages?
88
ExecCommand Activity stage, UserVariables Activity stage, Routine Activity stage
132.You have a compiled job and parallel configuration file. Which three methods
can be used to determine the number of nodes actually used to run the job in
parallel?
Within DataStage Director, examine log entry for parallel configuration file
Within DataStage Director, examine log entry for parallel job score
Within DataStage Director, open a new DataStage Job Monitor
133.Which three features of datasets make them suitable for job restart points?
They are partitioned.
They use datatype that are in the parallel engine internal format.
They are persistent.
134.What would require creating a new parallel Custom stage rather than a new
parallel BuildOp stage?
In a Custom stage, the number of input links does not have to be fixed, but can vary,
for example from one to two. BuildOp stages require a fixed number of input links.
C. Creating a Custom stage requires knowledge of C/C++. You do not need
knowledge of C/C++ to create a BuildOp stage.
136.Which two would cause a stage to sequentially process its incoming data?
The execution mode of the stage is sequential.
The stage has a constraint with a node pool containing only one node
89
? Counters
? Store values from previous rows to make comparisons
? Store derived values to be used in multiple target field derivations
? Can be used to control execution of constraints
? An intermediate processing variable that retains value during read and
doesn’t pass the value into target column.
Constant - Conditions that are either true or false that specifies flow of data with a
link.
140.How do you pass the parameter to the job sequence if the job is running at
night?
Two ways
1. Ste the default values of Parameters in the Job Sequencer and map these
parameters to job.
2. Run the job in the sequencer using dsjobs utility where we can specify the values
to be taken for each parameter
141.What is the utility you use to schedule the jobs on a UNIX server other than
using Ascential Director?
"AUTOSYS": Thru autosys u can automate the job by invoking the shell script written
to schedule the datastage jobs.
142.What is the mean of Try to have the constraints in the 'Selection' criteria of the
jobs itself. This will eliminate the unnecessary records even getting in before joins
are made?
This means try to improve the performance by avoiding use of constraints wherever
possible and instead using them while selecting the data itself using a where clause.
This improves performance
1) If an input file has an excessive number of rows and can be split-up then use
standard
90
144.What is the OCI? And how to use the ETL Tools?
OCI means orabulk data which used client having bulk data its retrieve time is much
more i.e., your used to orabulk data the divided and retrieved
145.What is merge and how it can be done explain with simple example taking 2
tables?
Merge is used to join two tables. It takes the Key columns sort them in Ascending or
descending order. Let us consider two table i.e. Emp,Dept.If we want to join these
two tables we are having DeptNo as a common Key so we can give that column
name as key
150.I want to process 3 files in sequentially one by one, how I can do that. While
processing the files it should fetch files automatically.
If the metadata for all the files r same then create a job having file name as
Parameter, then use same job in routine and call the job with different file Name or u
can create sequencer to use
151.Scenario based Question..... Suppose that 4 job control by the sequencer like
(job 1, job 2, job 3, job 4 )if job 1 have 10,000 row ,after run the job only 5000 data
has been loaded in target table remaining are not loaded and your job going to be
aborted then.. How can short out the problem.
Suppose job sequencer synchronies or control 4 job but job 1 have problem, in this
condition should go director and check it what type of problem showing either data
type problem, warning massage, job fail or job aborted, If job fail means data type
problem
91
153.How many places u can call Routines?
Four Places u can call (i) Transform of routine (A) Date Transformation (B) Upstring
Transformation (ii) Transform of the Before & After Subroutines(iii) XML
transformation(iv)Web base transformation
154.How do you fix the error "OCI has fetched truncated data" in DataStage
Can we use Change capture stage to get the truncated data’s
92
162.What about System variables?
DataStage provides a set of variables containing useful system information that you
can access from a transform or routine. System variables are read-only. @DATE the
internal date when the program started. See the Date function.
@DAY The day of the month extracted from the value in @DATE.
@SYSTEM.RETURN.CODE
Status codes returned by system processes or commands.
@TIME The internal time when the program started. See the Time function.
93
output link. REJECTED is initially TRUE, but is set to FALSE whenever an output
link is successfully written.
171.Tell me one situation from your last project, where you had faced problem and
How did u solve it?
A. The jobs in which data is read directly from OCI stages are running extremely
slow. I had to stage the data before sending to the transformer to make the jobs run
faster.
B. The job aborts in the middle of loading some 500,000 rows. Have an option either
cleaning/deleting the loaded data and then run the fixed job or run the job again
94
from the row the job has aborted. To make sure the load is proper we opted the
former.
The above might raise another question: Why do we have to load the dimensional
tables first, then fact tables
as we load the dimensional tables the keys (primary) are generated and these keys
(primary) are Foreign keys in Fact tables.
How will you determine the sequence of jobs to load into data warehouse?
First we execute the jobs that load the data into Dimension tables, then Fact tables,
then load the Aggregator tables (if any).
178.What are the command line functions that import and export the DS jobs?
dsimport.exe- imports the DataStage components.
dsexport.exe- exports the DataStage components
179.What is the utility you use to schedule the jobs on a UNIX server other than
using Ascential Director?
Use crontab utility along with dsexecute () function along with proper parameters
passed.
180.How would call an external Java function which are not supported by
DataStage?
Starting from DS 6.0 we have the ability to call external Java functions using a Java
package from Ascential. In this case we can even use the command line to invoke the
Java function and write the return values from the Java program (if any) and use that
files as a source in DataStage job.
181.What will you in a situation where somebody wants to send you a file and use
that file as an input or reference and then run job.
Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run
the job. May be you can schedule the sequencer around the time the file is expected
to arrive.B. Under UNIX: Poll for the file. Once the file has start the job
Read the String functions in DS
Functions like [] -> sub-string function and ':' -> concatenation operator Syntax:
string [ [ start, ] length ]string [ delimiter, instance, repeats ]
182.If worked with DS6.0 and latest versions what are Link-Partitioner and Link-
Collector used for?
Link Partitioner - Used for partitioning the data. Link Collector - Used for collecting
the partitioned data.
183.What are OConv () and Iconv () functions and where are they used?
IConv() - Converts a string to an internal storage format OConv() - Converts an
expression to an output format.
95
184.Do u know about METASTAGE?
In simple terms metadata is data about data and MetaStage can be anything like
DS(dataset,sq file,etc)
Do you know about INTEGRITY/QUALITY stage?
Integrity/quality stage is a data integration tool from Ascential which is used to
standardize/integrate the data from different sources
187.What r XML files and how do you read data from XML files and what stage to
be used?
In the pallet there is Real time stages like xml-input, xml-output,xml-transformer
188.Suppose if there are million records did you use OCI? if not then what stage
do you prefer?
Using Orabulk
189.How do you pass the parameter to the job sequence if the job is running at
night?
Two ways1. Ste the default values of Parameters in the Job Sequencer and map these
parameters to job.2. Run the job in the sequencer using dsjobs utility where we can
specify the values to be taken for each parameter.
192.What is the order of execution done internally in the transformer with the
stage editor having input links on the left hand side and output links?
Stage variables, constraints and column derivation or expressions.
193.What are the difficulties faced in using DataStage? Or what are the constraints
in using DataStage?
1) If the number of lookups are more? 2) What will happen, while loading the
data due to some regions job aborts?
96
195.Differentiate Primary Key and Partition Key?
Primary Key is a combination of unique and not null. It can be a collection of key
values called as composite primary key. Partition Key is a just a part of Primary Key.
There are several methods of partition like Hash, DB2, Random etc.While using Hash
partition we specify the Partition Key.
197.What is the default cache siz'Ae? How do you change the cache siz'Ae if
needed?
Default cache size is 256 MB. We can increase it by going into Datastage
Administrator and selecting the Tunable Tab and specify the cache size over there.
199.How to run a Shell Script within the scope of a Data stage job?
By using "ExcecSH" command at Before/After job properties.
97
Fast Load or Bulk Load: use the native load utility integrated into a DataStage job.
ODBC stage: provides standard or enterprise ODBC connectivity.
One of the trickiest databases to connect to from a DataStage enterprise stage is DB2
and the article
Introduction
WebSphere DataStage is one the foremost leaders in the ETL (Extract, Transform,
and Load) market space. One of the great advantages of this tool is its scalability, as it
is capable of parallel processing on an SMP, MPP or cluster environment. Although
DataStage Enterprise Edition (DS/EE) provides many types of plug-in stages to
connect to DB2, including DB2 API, DB2 load, and dynamic RDBMS, only DB2
Enterprise Stage is designed to support parallel processing for maximum scalability
and performance.
The DB2 Data Partitioning Feature (DPF) offers the necessary scalability to distribute
a large database over multiple partitions (logical or physical). ETL processing of a
large bulk of data across whole tables is very time-expensive using traditional plug-
in stages. DB2 Enterprise Stage however provides a parallel execution engine, using
direct communication with each database partition to achieve the best possible
performance.
DB2 Enterprise Stage with DPF communication architecture
Figure 1. DS/EE remote DB2 communication architecture
As you see in Figure 1, the DS/EE primary server can be separate from the DB2
coordinate node. Although a 32-bit DB2 client still must be installed, it’s different
from the typical remote DB2 access which requires only DB2 client for connectivity. It
can be used to pre-query the DB2 instance and determine partitioning of source or
98
target table. On the DB2 server, every DB2 DPF partition must have the DS/EE
engine installed. In addition, the DS/EE engine and libraries must be installed in the
same location on all DS/EE servers and DB2 servers.
The following principles are important in understanding how this framework works:
? DataStage conductor node uses local DB2 environment variables to determine
DB2 instance.
? DataStage reads the DB2nodes.cfg file to determine each DB2 partition.
DB2nodes.cfg file is copied from DB2 server node and can be put any location
of sqllib subdirectory on DS/EE server. One DS/EE environment variable
$APT_DB2INSTANCE_HOME can be used to specify this location of sqllib.
? DataStage scans the current parallel configuration file specified by
environment variable $APT_CONFIG_FILE. Each fastname property of this
file must have a match with the node name of DB2nodes.cfg.
? DataStage conductor node queries local DB2 instance using the DB2 client to
determine table partition information.
? DataStage starts up processes across ETL and DB2 nodes in the cluster.
DB2/UDB Enterprise stage passes data to/from each DB2 node through the
DataStage parallel framework, not the DB2 client. The parallel execution
instance can be examined from the job monitor of the DataStage Director.
In our example, we use 2 machines with RedHat Enterprise Linux 3.0 operating
system for testing, one with 2 CPUs and 1G memory for the DB2 server, another with
1 CPU with 1G memory for DS/EE server. In the DB2 server, we have 2 database
partitions which can be configured via DB2nodes.cfg, while in DS/EE server; the
engine configuration file tells us which nodes are used to execute DataStage jobs
concurrently.
The following are steps we followed to successfully configure remote DB2 instance
using DS/EE DB2 Enterprise Stage. We will begin this exercise from scratch,
including DB2 server configuration, DS/EE installation and configuration.
Installation and configuration steps for the DB2 server
• Install DB2 Enterprise Server Edition (with DPF) and create a DB2 instance
at Stage164 node.
99
• Configure rsh service and remote authority file.
• Create sample database and check distribution of tables.
• Create DS/EE users on all members of both nodes.
If DB2 DPF environments are installed and configured, you can skip step 1 and step
3
Step 1. Install DB2 Enterprise Server and create DB2 instance at Stage164 node
Check your DB2 version before installing DB2 ESE on Stage164 node. For our
example we used V8.1 fix pack 7. For DPF feature, you must have another separate
license. Pay attention to Linux kernel parameters which can potentially affect DB2
installation. Please follow the DB2 installation guide.
1. Before installation, create DB2 group and DB2 users.
[root@stage164 home]# groupadd –g db2grp1
[root@stage164 home]# groupadd –g db2fgrp1
[root@stage164 home]# useradd –g db2grp1 db2inst1
[root@stage164 home]# useradd –g db2fgrp1 db2fenc1
[root@stage164 home]# passwd db2inst1
2. Create instance. Install DB2, then create the instance using the GUI or
command line. If using the command line, switch to DB2 install path with
root user and issue the command below to create one DB2 instance the users
created in the previous step as parameters.
[root@stage164 home]# cd /opt/IBM/db2/V8.1/instance/
[root@stage164 instance]# ./db2icrt -u db2fenc1 db2inst1
Step 2. Configure remote shell (rsh) service and remote authority file.
For the DPF environment, DB2 needs the remote shell utility to communicate and
execute commands between each partition. Rsh utility can be used for inter-partition
communication; OpenSSH utility is another option for inter-partition communication
that protects secure communication. For simplicity, we will not cover it in this article.
Check whether rsh server has installed. If not, download it and issue "rpm –ivh
rsh.server-xx.rpm" to install it.
[root@stage164 /]# rpm -qa | grep -i rsh
rsh-0.17-17
rsh-server-0.17-17
100
1. Confirm rsh service can be started successfully.
[root@stage164 /]#service xinetd start
[root@stage164 /]#netstat –na | grep 514
2. Create or modify file for authority users to execute remote commands You
can create (or edit if it already exists) an /etc/hosts.equiv file. This first
column of this file is the machine name, and the second is the instance owner.
For example, the following means only db2inst1 user has authority to execute
commands on Stage164 using rsh:
Stage164 db2inst1
3. Check whether rsh works correctly or not by issuing below command using
DB2inst1 user. If the date doesn't show correctly, that means there is still a
configuration problem.
[db2inst1@stage164 db2inst1]$ rsh stage164 date
Thu May 18 23:26:03 CST 2006
2. Restart DB2 instance and be sure both partitions can be started successfully.
[db2inst1@stage164 db2inst1]$ db2stop force
05-18-2006 23:32:08 0 0 SQL1064N DB2STOP processing was successful.
SQL1064N DB2STOP processing was successful.
[db2inst1@stage164 db2inst1]$ db2start
05-18-2006 23:32:18 1 0 SQL1063N DB2START processing was successful.
05-18-2006 23:32:18 0 0 SQL1063N DB2START processing was successful.
SQL1063N DB2START processing was successful.
3. Create sample database and check the data distribution. According to result,
the total row count of table department is 9, and 4 of 9 is distributed into
partition 0, while 5 of 9 into partition 1 according to partition key deptno.
[db2inst1@stage164 db2inst1]$ db2sampl
[db2inst1@stage164 db2inst1]$ db2 connect to sample
[db2inst1@stage164 db2inst1]$ db2 "select count(*) from department"
1
-----------
9
1 record(s) selected.
[db2inst1@stage164 db2inst1]$ db2 "select count(*) from department where
dbpartitionnum(deptno)=0"
1
-----------
4
1 record(s) selected.
[db2inst1@stage164 db2inst1]$ db2 "select count(*) from department where
101
dbpartitionnum(deptno)=1"
1
-----------
5
1 record(s) selected.
Step 4. Create DS/EE users and configure the to access the DB2 database
If DS/EE users and groups have been created on the DS/EE node, then create the
same users and groups on the DB2 server node. In any case, make sure you have the
same DS/EE users and groups on these two machines.
1. Create DS/EE user/groups at DB2 server. In this example, they are
dsadmin/dsadmin. Also add DS/EE user to DB2 instance group.
[root@stage164 home]# groupadd -g 501 dsadmin
[root@stage164 home]# useradd –u 501 –g dsadmin –G db2grp1 db2inst1
[root@stage164 home]# passwd dsadmin
2. Add an entry in /etc/hosts.equiv file which was created in Step 2.3. This
gives dsadmin authority to execute some commands on Stage164.
Stage164 db2inst1
Stage164 dsadmin
102
which is /home/dsadmin/Ascential/DataStage/DSEngine. Then, install the DB2
client and create one client instance at DS/EE node.
Step 2. Add DB2 library and instance home at DS/EE configuration file
The dsenv configuration file, located under DSHOME directory, is one of the most
important configuration files in DS/EE. It contains the environment variables and
library path. At this step, we will add DB2 library to LD_LIBRARY_PATH so that
DS/EE engine can connect to DB2.
Note: PXEngine library should precede DB2 library for LD_LIBRARY_PATH
environment path.
Configure the dsenv file as follows:
PATH=$PATH:/home/dsadmin/Ascential/DataStage/PXEngine/bin:/home/dsadmin/Ascential
/
DataStage/DSEngine/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:
/home/dsadmin/Ascential/DataStage/PXEngine/lib:
$DB2DIR/lib:$INSTHOME/sqllib/lib; export LD_LIBRARY_PATH
You can add this dsenv file to dsadmin .bashrc file (/home/dsadmin/.bashrc) to
avoid executing it manually every time. What you need do is to exit the dsadmin
user and re-logon to make it execute and take effect.
. /home/dsadmin/Ascential/DataStage/DSEngine/dsenv
Step 4. Copy DB2nodes.cfg from DB2 server to DS/EE and configure environment
variable.
Copy the DB2nodes.cfg file from the DB2 server to one directory of DS/EE. This file
tells DS/EE engine how many DB2 partitions there are in the DB2 server. Then create
103
one environment variable APT_DB2INSTANCE_HOME by DataStage Administrator
to point to the directory. This variable can be specified at the project level or the job
level.
Step 5. NFS configuration, export /home/dsadmin/Ascential/
First, add 2 machine names into /etc/hosts file at both nodes to identify one
another’s network name. Then, share the DS/EE whole directory to the DB2 server
so that each partition can communicate with DS/EE.
1. At DS/EE node, export /home/dsadmin/Ascential directory. This can be
done by adding an entry in /etc/exports file, it will allow users from stage164
machine to mount /home/dsadmin/Ascential directory with read/write
authority.
/home/dsadmin/Ascential stage164(rw,sync)
2. Once you have changed /etc/export file, you must notify NFS daremod
process to reload changes. Or you can stop and restart this nfsd process by
issuing following commands:
[root@transfer /]# service nfs start
Starting NFS services: [ OK ]
Starting NFS quotas: [ OK ]
Starting NFS daemon: [ OK ]
Starting NFS mountd: [ OK ]
To avoid mounting it every time when machine restart, you can also add this entry
into file /etc/fstab to mount this directory automatically:
transfer:/home/dsadmin/Ascential /home/dsadmin/Ascential nfs defaults 0 0
Step 6. Verify DB2 operator library and execute DB2setup.sh and DB2grants.sh
1. Execute DB2setup.sh script ,which located in $PXHOME/bin. Note, you may
have a problem for remote DB2 instances; you will need to change the
connect userid and password.
db2 connect to samp_02 user dsadmin using passw0rd
db2 bind ${APT_ORCHHOME}/bin/db2esql.bnd datetime ISO blocking all grant
public
# this statement must be run from /instance_dir/bnd
cd ${INSTHOME}/sqllib/bnd
db2 bind @db2ubind.lst blocking all grant public
db2 bind @db2cli.lst blocking all grant public
104
db2 connect reset
db2 terminate
2. Execute DB2grants.sh
db2 connect to samp_2 user dsadmin using passw0rd
db2 grant bind, execute on package dsadm.db2esql to group dsadmin
db2 connect reset
db2 terminate
Note: after stopping the DS/EE engine, you need to exit dsadmin and re-logon, and
the dsenv configuration file will be executed. Also, be sure the time interval between
stop and start is longer than 30 seconds in order for the changed configuration to
take effect.
Next, we will test remote connectivity using DataStage Designer. Choose Import
plug-in table definition. The following window will appear. Click Next. If it imports
successfully, that means the remote DB2 connectivity configuration has succeeded.
105
Figure 3. DB2 Enterprise stage job
Double-click the DB2 Enterprise stage icon, and set the following properties to the
DB2 Enterprise stage. For detailed information, please reference the "Parallel job
developer’s guide."
106
? Client Instance Name: Set this to the DB2 client instance name. If you set this
property, DataStage assumes you require remote connection.
? Server: Optionally set this to the instance name of the DB2 server. Otherwise
use the DB2 environment variable, DB2INSTANCE, to identify the instance
name of the DB2 server.
? Client Alias DB Name: Set this to the DB2 client’s alias database name for the
remote DB2 server database. This is required only if the client’s alias is
different from the actual name of the remote server database.
? Database: Optionally set this to the remote server database name. Otherwise
use the environment variables APT_DBNAME or APT_DB2DBDFT to
identify the database.
? User: Enter the user name for connecting to DB2. This is required for a remote
connection.
? Password: Enter the password for connecting to DB2. This is required for a
remote connection.
Add the following two environment variables into this job via DataStage Manager.
APT_DB2INSTANCE_HOME defines DB2nodes.cfg location, while
APT_CONFIG_FILE specifies the engine configuration file.
107
o generate a quantity of test data, we created the following stored procedure:
while ( number>count)
do
set deptno=char( mod(number, 100) );
insert into department values( deptno, 'deptname', 'mgr', 'dep', 'location');
if( mod(number, 2000)=0) then
commit;
end if;
set number=number+1 ;
end while;
end@
Then, we execute these 2 jobs against 100,000, 1M and 5M rows via DataStage
Director and observe the result using the job monitor. The following screenshots are
test results with DB2 Enterprise Stage and DB2 API Stage.
108
Figure 10. 1,000,000 records (DB2 Enterprise Stage)
Figure 14. Compare performance between Enterprise Stage and API Stage
109
DB2 Enterprise Stage has a great parallel performance over the other DB2 plug-in
stages using a DB2 DPF environments, however, it requires the hardware and
operating system of ETL server, and the DB2 nodes must be the same. Consequently,
it’s not a replacement for other DB2 plug-in stages, especially in heterogeneous
environments.
110
This entry is a comprehensive guide to preparing for the DataStage 7.5 Certification
exam.
Regular readers may be feeling a sense of de ja vu. Haven’t we seen this post before?
I originally posted this in 2006 and this the Directors Cut - I’ve added some deleted
scenes, a commentary for DataStage 7.0 and 8.0 users and generally improved the
entry. By reposting I retain the links from other sites such as my DataStage
Certification Squidoo lens with links to my certification blog entries and IBM
certification pages.
This post shows all of the headings from the IBM exam Objectives and describes how
to prepare for that section.
Before you start read work out how you add Environment variables to a job as a job
parameter as they are handy for exercises and testing. See the DataStage Designer
Guide for details.
Versions: Version 7.5.1 and 7.5.2 are the best to study and run exercises on. Version
6.x is risky but mostly the same as 7. Version 8.0 is no good for any type of
installation and configuration preparation as it has a new approach to installation
and user security.
Reading: Read the Installation and Upgrade Guide for DataStage, especially the
section on parallel installation. Read the pre-requisites for each type of install such as
users and groups, the compiler, project locations, kernel settings for each platform.
Make sure you know what goes into the dsenv file. Read the section on DataStage for
USS as you might get one USS question. Do a search for threads on dsenv on the
dsxchange forum to become familiar with how this file is used in different
production environments.
Exercise: installing your own DataStage Server Enterprise Edition is the best exercise
- getting it to connect to Oracle, DB2 and SQL Server is also beneficial. Run the
DataStage Administrator and create some users and roles and give them access to
DataStage functions.
I’ve move section 4 and 9 up to the front as you need to study this before you run
exercises and read about parallel stages in the other sections. Understanding how to
use and monitor parallel jobs is worth a whopping 20% so it’s a good one to know
well.
Versions: you can study this using DataStage 6, 7 and 8. Version 8 has the best
definition of the parallel architecture with better diagrams.
Reading: Parallel Job Developers Guide opening chapters on what the parallel engine
and job partitioning is all about. Read about each partitioning type. Read how
sequential file stages partition or repartition data and why datasets don’t. The
Parallel Job Advanced Developers Guide has sections on environment variables to
help with job monitoring, read about every parameter with the word SCORE in it.
111
The DataStage Director Guide describes how to run job monitoring - use the right
mouse click menu on the job monitor window to see extra parallel information.
Try creating one node, two node and four node config files and see how jobs behave
under each one. Try the remaining exercises on a couple different configurations by
adding the configuration environment variable to the job. Try some pooling options.
I got to admit I guessed my way through some of the pooling questions as I didn’t do
many exercises.
Generate a set of rows into a sequential file for testing out various partitioning types.
One column with unique ids 1 to 100 and a second column with repeating codes such
as A, A, A, A, A, B, B, B, B, B etc. Write a job that reads from the input, sends it
through a partitioning stage such as a transformer and writes it to a peek stage. The
Director logs shows which rows went where. You should also view the Director
monitor and expand and show the row counts on each instance of each stage in the
job to see how stages are split and run on each node and how many rows each
instance gets.
Use a filter stage to split the rows down two paths and bring them back together
with a funnel stage, then replace the funnel with a collector stage. Compare the two.
Test yourself on estimating how many processes will be created by a job and check
the result after the job has run using the Director monitor or log messages. Do this
throughout all your exercises across all sections as a habit.
Section 2 - Metadata
Section 3 - Persistant Storage
I’ve merged these into one. Both sections talk about sequential files, datasets, XML
files and Cobol files.
Reading: read the section in the DataStage Developers Guide on Orchestrate schemas
and partial schemas. Read the plugin guide for the Complex Flat File Stage to
understand how Cobol metadata is imported (if you don’t have any cobol copybooks
around you will just have to read about them and not do any exercises). Quickly scan
through the NLS guide - but don’t expect any hard questions on this.
Exercises: Step through the IBM XML tutorial to get the tricky part on reading XML
files. Find an XML file and do various exercises reading it and writing it to a
sequential file. Switch between different key fields to see the impact of the key on the
flattening of the XML hierarchy. Don’t worry too much about XML Transform.
Import from a database using the Orchestrate Import and the Plugin Import and
compare the table definitions. Run an exercise on column propagation using the Peek
112
stage where a partial schema is written to the Peek stage to reveal the propagated
columns.
Create a job using the Row Generator stage. Define some columns, on the columns
tab doubleclick on a column to bring up the advanced column properties. Use some
properties to generate values for different data types. Get to know the advanced
properties page.
Create a really large sequential file, dataset and fileset and use each as a reference in
a lookup stage. Monitor the Resource and Scratch directories as the job runs to see
how these lookup sources are prepared prior to and during a job run. Get to know
the difference between the lookup fileset and other sources for a lookup stage.
One of the more difficult topics if you get questions for a database you are not
familiar with. I got one database parallel connectivity question that I still can’t find
the answer to in any of the manuals.
Versions: DataStage 7.x any version or at a pinch DataStage 8. Earlier versions do not
have enough database stages and DataStage 8 has a new approach to database
connections.
Reading: read the plugin guide for each enterprise database stage: Oracle, SQL
Server, DB2 and ODBC. In version 8 read the improved Connectivity Guides for
these targets. If you have time you can dig deeper, the Parallel Job Developers Guide
and/or the Advanced Developers Guide has a section on the
Oracle/DB2/Informix/Teradata/Sybase/SQL Server interface libraries. Look for the
section called "Operator action" and read it for each stage. It’s got interesting bits like
whether the stage can run in parallel, how it converts data and handles record sizes.
Exercise: Add each Enterprise database stage to a parallel job as both an input and
output stage. Go in and fiddle around with all the different types of read and write
options. You don’t need to get a connection working or have access to that database,
you just need to have the stage installed and add it to your job. Look at the
differences between insert/update/add/load etc. Look at the different options for
each database stage. If you have time and a database try some loads to a database
table.
If you’ve used DataStage for longer than a year this is probably the topic you are
going to ace - as long as you have done some type of stage variable use.
Reading: there is more value in using the transformation stages than reading about it.
It’s hard to read about it and take it in as the Transformer stage is easier to navigate
and understand if you are using it. If you have to make do with reading then visit the
dsxchange and look for threads on stage variables, the FAQ on the parallel number
generator, removing duplicates using a transformer and questions in the parallel
forum on null handling. This will be better than reading the manuals as they will be
113
full of practical examples. Read the Parallel Job Advanced Developers Guide section
on "Specifying your own parallel stages".
Exercises: Focus on Transformer, Modify Stage (briefly), Copy Stage (briefly) and
Filter Stage.
Create some mixed up source data with duplicate rows and a multiple field key. Try
to remove duplicates using a Transformer with a sort stage and combination of stage
variables to hold the prior row key value to compare to the new row key value.
Process some data that has nulls in it. Use the null column in a Transformer
concatenate function with and without a nulltovalue function and with and without
a reject link from the Transformer. This gives you an understanding of how rows get
dropped and/or trapped from a transformer. Explore the right mouse click menu in
the Transformer, output some of the DS Macro values and System Variables to a
peek stage and think of uses for them in various data warehouse scenarios. Ignore
DS Routine, it’s not on the test.
Don’t spend much time on the Modify stage - it would take forever memorizing
functions. Just do an exercise on handle_null, string_trim and convert string to
number. Can be tricky getting it working and you might not even get a question
about it.
I’ve combined these since they overlap. Don’t underestimate this section, it covers a
very narrow range of functionality so it is an easy set of questions to prepare for and
get right. There are easy points on offer in this section.
Versions: Any version 7.x is best, version 6 has a completely different lookup stage,
version 8 can be used but remember that the Range lookup functionality is new.
Reading: The Parallel Job Developers Guide has a table showing the differences
between the lookup, merge and join stages. Try to memorize the parts of this table
about inputs and outputs and reject links. This is a good place to learn about some
more environment variables. Read the Parallel Job Advanced Developers Guide
looking for any environment variables with the word SORT or SCORE in them.
Exercises: Compare the way join, lookup and merge work. Create a job that switches
between each type.
114
Section 8 - Automation and Production Deployment (10%)
Versions: most deployment methods have remained the same from version 6, 7 and
8. Version 8 has the same import and export functions. DataStage 8 parameter sets
will not be in the version 7 exam.
Reading: you don’t need to install or use the Version Control tool to pass this section
however you should read the PDF that comes with it to understand the IBM
recommendations for deployment. It covers the move from dev to test to prod. Read
the Server Job Developers Guide section on command line calls to DataStage such as
dsjob and dssearch and dsadmin.
115