100% found this document useful (1 vote)
354 views115 pages

Recovered DataStage Tip

DATASSTAGE INTERVIEW QUESTIONS

Uploaded by

Bhaskar Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
354 views115 pages

Recovered DataStage Tip

DATASSTAGE INTERVIEW QUESTIONS

Uploaded by

Bhaskar Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 115

Datastage Interview Questions

1. What is Configuration File?


• System size & configuration details maintained external to the job design
• Can be modified to suit development & production environment, handle
hardware upgrades, etc. without redesigning/recompiling jobs
• The configuration file describes available processing power in terms of
processing nodes
• determines how many instances of a process will be produced
when you compile a parallel job.
– Minimum #Nodes < ½ times #CPUs Minimum
Recommended
– Usual starting point for #Nodes = # CPUs
– # Nodes < # CPUs if some CPUs left free for OS, DB
and other applications
– # Nodes > # CPUs for I/O intensive streams with poor
CPU-usage
• Associates the scratchdisk with each node
Configuration file separates configuration (hardware / software) from job design
• Specified per job at runtime by $APT_CONFIG_FILE
• Change hardware and resources without changing job design
Defines number of nodes (logical processing units) with their resources (need not
match physical CPUs)
• Dataset, Scratch, Buffer disk (file systems)
• Optional resources (Database, SAS, etc.)
• Advanced resource optimizations
o “Pools” (named subsets of nodes)
Multiple configuration files can be used at runtime
• Optimizes overall throughput and matches job characteristics to overall
hardware resources
• Allows runtime constraints on resource usage on a per job basis
DataStage jobs can point to different configuration files by using job parameters.
Thus, a job can utilize different hardware architectures without being recompiled.
It can pay to have a 4-node configuration file running on a 2 processor box, for
example, if the job is “resource bound.” We can spread disk I/O among more
controllers.

1
This example shows a typical configuration file. Pools can be applied to nodes or
other resources. Note the curly braces following some disk resources.
Following the keyword “node” is the name of the node (logical processing unit).
The order of resources is significant. The first disk is used before the second, and so
on. Keywords, such as “sort” and “bigdata”, when used, restrict the signified
processes to the use of the resources that are identified. For example, “sort” restricts
sorting to node pools and scratchdisk resources labeled “sort”.
Database resources (not shown here) can also be created that restrict database access
to certain nodes. Question: Can objects be constrained to CPUs? No, a request is
made to the operating system and the operating system chooses the CPU.

2. What is APT_CONFIG_FILE
APT_CONFIG_FILE is the file using which DataStage determines the
configuration file (one can have many configuration files for a project) to be used.
In fact, this is what is generally used in production. However

If this environment variable is not defined then how DataStage determines which
file to use?

If the APT_CONFIG_FILE environment variable is not defined then DataStage


look for default configuration file (config.apt) in following path: Current
working directory.

INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level


directory of DataStage installation.

3. What are Conductor node, Section Leader and Player Processes

2
Parallel Job execution:

The conductor node has the start-up process. Creates the score. Starts up section
leaders.

Section leaders communicate with the conductor only.

The Conductor communicates with the layers

Every player has to be able to communicate with every other player. There are
separate communication channels (pathways) for control, messages, errors, and data.
Note that the data channel does not go through the section leader/conductor, as this
would limit scalability. Data flows directly from upstream operator to downstream
operator using APT Communicator class.

3
4. Explain Run Time Architecture of Enterprise
Enterprise Edition Job Startup
Generated OSH and configuration file are used to “compose” a job
“Score”
• Think of “Score” as in musical score, not game score
• Similar to the way an RDBMS builds a query optimization plan
• Identifies degree of parallelism and node assignments for each operator
• Inserts sorts and partitioners as needed to ensure correct results
• Defines connection topology (virtual datasets) between adjacent operators
• Inserts buffer operators to prevent deadlocks E.g., in fork-joins
• Defines number of actual OS processes
• Where possible, multiple operators are combined within a single OS process
to improve performance and optimize resource requirements
• Job Score is used to fork processes with communication interconnects for
data, message, and control
• Set $APT_STARTUP_STATUS to show each step of job startup
• Set $APT_PM_SHOW_PIDS to show process IDs in DataStage log
Enterprise Edition Runtime
It is only after the job Score and processes are created that
processing begins
“Startup overhead” of an EE job

• Job processing ends when either:


• Last row of data is processed by final operator
• A fatal error is encountered by any operator
• Job is halted (SIGINT) by DataStage Job Control or human intervention (e.g.
DataStage Director STOP)
Viewing the Job Score
Set $APT_DUMP_SCORE to output the Score to the job log
• For each job run, 2 separate Score dumps are written
• First score is for the license operator
• Second score entry is the real job score
To identify the Score dump, look for “main program: This step …”

4
You don’t see anywhere the word ‘Score’

What are Job Scores

The Score yields useful info:


# stages (operators in Framework)
# datasets
Mapping Node --> Partition
Remember: "1" in "node1" is not a number, but just a part of an arbitrary string to
name the node.
In this example, nodes get partitions 0,1,2, 3 in order of appearance in config file
(often, but not always the case in complex flows).
Stage combinations (none here, because we set APT_DISABLE_COMBINATION)
ANSWER to QUIZ:
1 sequential Generator
+ 2 parallel Peeks x 4 nodes
= 9 processes

5.What is runtime column propagation?


Runtime column propagation (RCP) allows DataStage to be flexible about the
columns you define in a job. If RCP is enabled for a project, you can just define the
columns you are interested in using in a job, but ask DataStage to propagate the
other columns through the various stages. So such columns can be extracted from the
data source and end up on your data target without explicitly being operated on in
between.

5
Sequential files, unlike most other data sources, do not have inherent column
definitions,and so DataStage cannot always tell where there are extra columns that
need propagating.You can only use RCP on sequential files if you have used the
Schema File property (seeLink\xd2 ” on page - and on page - ) to specify a schema
which describes all the columns in the sequential file. You need to specify the same
schema file for any similar stages in the job where you want to propagate columns.
Stages that will require a schema file are:
Sequential File
File Set
External Source
External Target
Column Import
Column Export

6.What happens if RCP is disabled?


DataStage Designer enforces Stage Input to Output column mappings.
• At job compile time Modify operators are inserted on output links in the
generated osh.

Modify operators can add or change columns in a data flow.


7.What happens if RCP is enabled?
DataStage does not enforce mapping rules
• No Modify operators are inserted at compile time
• Danger of runtime error if incoming column names do not match column
names outgoing link

8. What are the different options a logical node can have in the configuration file?

a. fastname – The fastname is the physical node name that stages use to open
connections for high volume data transfers. The attribute of this option is often the

6
network name. Typically, you can get this name by using UNIX command ‘uname -
n’.

b.pools – Name of the pools to which the node is assigned to. Based on the
characteristics of the processing nodes you can group nodes into set of pools.

A pool can be associated with many nodes and a node can be part of many
pools.

A node belongs to the default pool unless you explicitly specify a pools list for it,
and omit the default pool name (”") from the list.

A parallel job or specific stage in the parallel job can be constrained to run on a
pool (set of processing nodes).

In case jobs as well as stage within the job are constrained to run on specific
processing nodes then stage will run on the node which is common to stage as
well as job.

c.resource – resource resource_type “location” [{pools “disk_pool_name”}] |


resource resource_type “value” . Resource_type can be canonical hostname (Which
takes quoted Ethernet name of a node in cluster that is unconnected to Conductor
node by the high speed network.) or disk (To read/write persistent data to this
directory.) or scratchdisk (Quoted absolute path name of a directory on a file system
where intermediate data will be temporarily stored. It is local to the processing
node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.)

9. How datastage decides on which processing node a stage should be run?

a. If a job or stage is not constrained to run on specific nodes then parallel engine
executes a parallel stage on all nodes defined in the default node pool. (Default
Behavior)

b.If the node is constrained then the constrained processing nodes are chosen while
executing the parallel stage. (Refer to 2.2.3 for more detail).

C.When configuring an MPP, you specify the physical nodes in your system on
which the parallel engine will run your parallel jobs. This is called Conductor Node.
For other nodes, you do not need to specify the physical node. Also, You need to
copy the (.apt) configuration file only to the nodes from which you start parallel
engine applications. It is possible that conductor node is not connected with the high-
speed network switches. However, the other nodes are connected to each other using
a very high-speed network switches.

10. How do you configure your system so that you will be able to achieve
optimized parallelism?

a. Make sure that none of the stages are specified to be run on the conductor node.

b.Use conductor node just to start the execution of parallel job.

c.Make sure that conductor node is not the part of the default pool.

d.Although, parallelization increases the throughput and speed of the process,

7
11. why maximum parallelization is not necessarily the optimal parallelization?

a.Datastage creates one process for every stage for each processing node. Hence, if
the hardware resource is not available to support the maximum parallelization, the
performance of overall system goes down. For example, suppose we have a SMP
system with three CPU and a Parallel job with 4 stages. We have 3 logical nodes (one
corresponding to each physical node (say CPU)). Now DataStage will start 3*4 = 12
processes, which has to be managed by a single operating system. Significant time
will be spent in switching context and scheduling the process.

b.Since we can have different logical processing nodes, it is possible that some node
will be more suitable for some stage while other nodes will be more suitable for other
stages.

12. So, when to decide which node will be suitable for which stage?

a.If a stage is performing a memory intensive task then it should be run on a node
which has more disk space available for it. E.g. sorting a data is memory intensive
task and it should be run on such nodes.

b.If some stage depends on licensed version of software (e.g. SAS Stage, RDBMS
related stages, etc.) then you need to associate those stages with the processing node,
which is physically mapped to the machine on which the licensed software is
installed. (Assumption: The machine on which licensed software is installed is
connected through other machines using high speed network.)

c.If a job contains stages, which exchange large amounts of data then they should be
assigned to nodes where stages communicate by either shared memory (SMP) or
high-speed link (MPP) in most optimized manner.

d.Basically nodes are nothing but set of machines (specially in MPP systems). You
start the execution of parallel jobs from the conductor node. Conductor nodes create
a shell of remote machines (depending on the processing nodes) and copy the same
environment on them. However, it is possible to create a startup script which will
selectively change the environment on a specific node. This script has a default name
of startup apt. However, like main configuration file, we can also have many startup
configuration files. The appropriate configuration file can be picked up using the
environment variable APT_STARTUP_SCRIPT.

13. What is use of APT_NO_STARTUP_SCRIPT environment variable?

a. Using APT_NO_STARTUP_SCRIPT environment variable, you can instruct


Parallel engine not to run the startup script on the remote shell.

14. What are the generic things one must follow while creating a configuration file
so that optimal parallelization can be achieved?

a.Consider avoiding the disk/disks that your input files reside on.

b.Ensure that the different file systems mentioned as the disk and scratchdisk
resources hit disjoint sets of spindles even if they’re located on a RAID (Redundant
Array of Inexpensive Disks) system.

8
15. Know what is real and what is NFS:

a.Real disks are directly attached, or are reachable over a SAN (storage-area network
-dedicated, just for storage, low-level protocols).

b.Never use NFS file systems for scratchdisk resources, remember scratchdisk are
also used for temporary storage of file/data during processing.

c.If you use NFS file system space for disk resources, then you need to know what
you are doing. For example, your final result files may need to be written out onto
the NFS disk area, but that doesn’t mean the intermediate data sets created and used
temporarily in a multi-job sequence should use this NFS disk area. Better to setup a
“final” disk pool, and constrain the result sequential file or data set to reside there,
but let intermediate storage go to local or SAN resources, not NFS.

? For example, Suppose we have 2000 products and 2000 manufacturer. These
two details are in two separate tables. If the requirement says that we need to
join these tables and create a new set of records. By brute force, it will take
2000 X 2000/2 = 20,00,000 steps. Now, if we divide this into 4 partitions, it
will take 4 * 500 * 2000/2 = 20, 00,000 steps. Assuming the number of steps is
directly proportional to time taken and we have sufficient computing power
to run 4 separate instances of the jobs performing join; we can complete this
task in effectively 1/4th time. This is what we call power of partition.
There are mainly 8 different kind of partitioning supported by enterprise edition (I
have excluded DB2 at this moment from the list.). Usage of these partition
mechanism completely depends on the what kind of data distribution do we have or
going to have. Do we need to have the related data together or it doesn’t matter? Do
we need to have a look at complete dataset at a time or it doesn’t matter if we are
working on subset of data?

16.Explain Funnel Stage?


• A processing stage that combines data from multiple input links to a single
output link
• Useful to combine data from several identical data sources into a single large
dataset

17. Explain the differences between a funnel stage and a collector.


Funnel stage copies different datasets into a single output dataset, while collector
collects different partitions of same dataset.

? In fact, collector works on a single link – divided among different processing


nodes – to collect datasets from different partitions.

? Funnel stage can run in parallel as well as sequential mode. In case of Parallel
mode different input datasets to the funnel stage will be partitioned and
processed on different processing node. Again, processed partitions will be
collected and funneled into the final output dataset. This can be controlled
from the partitioning tab of Funnel Stage. In case of sequential mode, if
Funnel Stage is in the middle of the job design then it first collects all the data
from different partitions and then funnels all the incoming datasets.

9
Of course metadata for all the incoming inputs to the funnel stage should be same for
funnel stage to be able to work. However, The funnel stage allows you to specify
(through mapping tab) how output columns are derived from the input column. This
is something simple collection doesn’t do.

? Remember that metadata needs to be same for all the input column. So, only
one set of input column is shown and that too is read only.

Funnel has mainly three mode in which it operates

? Continuous Funnel

1. It behaves similar to Round Robin Collection method. It reads one


record at a time from each link.

2. Combines the records of the input link in no guaranteed order.

3. It takes one record from each input link in turn. If data is not available
on an input link,the stage skips to the next link rather than waiting

4. Does not attempt to impose any order on the data it is processing.

? Sort Funnel

1. Based on the key columns, this method sorts the data from the
different sorted datasets into a single sorted dataset.

2. Typically all the input datasets to the Funnel stage are hash
partitioned before they are sorted.

? Selecting “Auto” partition type under Input Partitioning tab


defaults to this.

? Hash partitioning guarantees that all the records with same


key column values are located in the same partition and are
processed on the same node.

3. If the data is not yet partitioned (by previous stage) then partition the
data using hash or modulus partitioning. (This can be done from the
partitioning tab.)

? All the input datasets must be sorted on the same sort key.

? It is similar to Sort Merge Collection


method.

? If the sorting and partitioning are carried out by separate stage


prior to funnel stage then you must make sure that
partitioning is preserved.

10
4. Combines the input records in the order defined by the value(s) of one
or more key columns and the order of the output records is
determined by these sorting keys

5. Produces a sorted output (assuming input links are all sorted on key)

6. Data from all input links must be sorted on the same key column

7. Allows for multiple key columns

? 1 primary key column, n secondary key columns

? Funnel stage first examines the primary key in each input


record.

? For records with multiple records with same primary key


value, it will then examine secondary keys to determine the
order of records it will output

Sequence Funnel

1. Similar to Ordered collection. You can choose the order (through link
ordering) in which data from different links will be funneled.

2. Copies all records from the first input link to the output link, then all
the records from the second input link and so on

18. Explain Modify Stage.

? Modify column types

? Add or drop columns

? Less overhead than Transformer

? Perform some types of derivations

o Null handling

o Date/ time handling

o String handling

Ex: Specifying a Column Conversion

11
19. Describe input of partitioning and re-partitioning in an MPP/cluster
environment.

Input of partitioning is

o Number of processing node available for the stage and the properties specified
on the partitioning tab of that particular stage.

o Partition type used by the stage.

Input to repartitioning is

o Again the number of processing node available to the current stage. If the
number of processing node for current stage is different (especially less) than
previous stages then datastage decides to repartition the data.

It is possible that we can have limitations because of license (Some times


number of license for a given software may be limited!!).

o Different partition type being used by the current stage than the one used for
previous stages.

o If the data requirement is such that data from different partitions need to be re-
grouped (even though same partition type is used!!).

E.g. Suppose customer data is grouped based on age of the customers.


However, current stage needs to know about all the customer belonging to current
state (possibly based on state_cd). Now, we do need to repartition the data.

For better performance of your job, you must try to avoid re-partitioning as
much as possible. Remember that repartitioning involves two steps (which may be
unnecessary and taxing). Hence understanding where, when and why re-
partitioning occurs in a job is necessary for being able to improve the performance of
the job.

o First collects all the data being processed by different processing node.

o Partition the data based on new partitioning rules specified/identified.

o On MPP or Cluster systems, we have additional overhead of passing the data


on the network during the repartitioning.

20.Given a scenario, demonstrate knowledge of parallel execution. Identify


partitioning type, parallel/sequential by analyzing a DataStage EE screen shot.

Each parallel stage in a job can partition or repartition incoming data or accept
same partitioned data before it operates on it. There is an icon on the input link
(called link marking) to a stage which shows how the stage handles partitioning.
While deciding about the parallel or sequential mode of operation at a given stage,
this information is really useful. We have following types of Icons:

o None

Shows Sequential Sequential flow.

12
o Auto

Shows DataStage will decide the most suitable partitioning for the stage.

Basically Parallel Parallel flow.

o Bow Tie

It shows that repartitioning has occurred. Mainly because downstream stage


has different partitioning needs.

Basically from Parallel Parallel flow.

Well – it is still parallel to parallel, however, sometimes it is as bad as


sequential.

o Fan Out

It shows that data is being partitioned. What it means that either it is start of
the job or before this stage the execution was in sequential mode.

Basically Sequential Parallel flow – if the stage is in the middle of job flow.

o Same (box)

It shows that next stage is going to use the same partitioning.

Basically, Parallel Parallel flow.

o Fan In

It shows that data is being collected at this stage.

Basically from Parallel Sequential flow.

21.Given a job design and configuration file, provide estimates of the number of
processes generated at runtime.

Number of processes generated at run time depends on various factors. Moreover,


there is no straight forward formula to give the exact number of processes created.
Here are the main factors that affects creation of processes:

In ideal case, where all the stages are running on all the processing nodes and
there is no constraints on hardware, I/O devices, processing nodes etc. then
DataStage will start one process per stage per partition.

o This definitely sounds like a straight forward mathematics. If there are N


partitions ( i.e. N logical nodes) and M stages in the job then at approximately M * N
processes will be created. Remember that DataStage will also start group leader
processes and process on the conductor node to keep the communications uniform
and consistent. So, number of process could be actually more than M * N.

Configuration File is the first place where we put the information regarding on
which node a stage can run and on which node it can not. In fact, it is not specified
directly inside the configuration file, however you declare Node Pools & Resource

13
Pools in the configuration file. Thus you have fair idea about what kind of stages can
be run on these pools.

o You also have choice of not making a node or resource as part of default pool.
So, you can run the processes selectively on a given node. In this case – in the job
design you select the list of node on which a stage can run. This actually determines,
how many processes will be started. Even here we have further constraints
associated. It is quite possible that your job is constrained to run on specific set of
nodes. In that case, the processes will run on the nodes – common to job as well as
the current stage.

o Ultimately processes depend on how many partitions of data is being done.


That again depends on whether data is being partitioned or collected and what kind
of constraints are associated with the current stage.

In addition to Node Pool and Resource pool constraints – you can also specify
Node Map constraints. This basically forces the parallel execution to run on specified
nodes. In fact, it is equivalent to creating another node pool (in addition to whatever
is already existing in configuration files) containing set of nodes on which this stage
can run.

o So, process will be started accordingly.

Also, note that – if a stage is running in sequential mode then one process will
run corresponding to it and that too will run on the Conductor node.

22.Explain the purpose and use of resource and node pools.

Before we discuss about what is resource pools or node pools, it is important to


understand the meaning of pool.

o A pool is set of nodes or resources which can be used in configuring the


datastage jobs to make them run on suitable nodes and use appropriate resources
while processing data.

Node pool is used to create set of logical node on which a job can run its
processes. Basically all the nodes with similar characteristics are placed into same
node pools. Depending on the hardware, software and software licenses available
you create a list of node pools. During job design you decide which stage will be
suitable on which node and thus you constrain the stage to run on specific nodes.
This is needed otherwise your job will not be able to take full advantage of the
available resources.

o Example – you do not want to run a sort stage on a machine which doesn’t have
enough scratch disk space to hold the temporary files.

o It is very important to have node pool declaration and use of node constraint in
such a way that it doesn’t cause any unnecessary repartitioning.

Resource pool is mainly used to decide which node will have access to what
resources.

14
o E.g. If a job needs a lot of I/O operations then you need to assign high speed
I/O resources to the node on which you plan to run I/O intensive processes.

o Similarly, if a machine has SAS installed on it (in MPP system) and you have to
extract/load data into SAS system then you may like to create a SAS_NODEPOOL
containing information about these machines.

23.Given a source DataSet, describe the degree of parallelism using auto and same
partitioning.

Auto partitioning can be used for letting datastage decide about the most
suitable partitioning for a given flow. In this case degree of parallelism mainly
depends:

o How many logical processing nodes we have?o what is the suitable


partitioning method for at the given stage?

o What kind of Node Pool or Resource Pools constraint has been put on the given
stage.

There are mainly two factors which decide the partitioning method that will be
opted by the DataStage.

o How data has been partitioned by previous stages.

o What kind of stage the current stage is – i.e. what kind of data requirement does
it have?

If the current stage doesn’t have any specific data setup requirement and it
doesn’t have any previous stage then DataStage typically uses Round Robin
partitioning to make sure that data is evenly distributed among different partitions.

Same partitioning is a little different from the auto partitioning. In this case
user is trying to instruct DataStage to use the same partitioning as it has been used
by the previous stages. So, partitioning method at current stage is actually
determined by the partitioning configuration at previous stages. This is generally
used when you know that same partitioning will be more effective than allowing
datastage to pick up a suitable partitioning, which may cause repartitioning. So, if
you are using this partitioning, then data flows inside the same processing node. No
redistribution of data occurs!! Degree of parallelism is decided by

o The configuration at previous stages.

o Constraints on the current node.In this case – there is no reason why one should
go for same partitioning. Suppose previous stages had data partitioned on 5
processing node, while current stage is constrained to run on only 3 partitioning
node then using same partitioning doesn’t make sense. Data has to be repartitioned.

24. Given a scenario, explain the process of importing/exporting data to/from


framework (e.g., sequential file, external source/target).

You can export/import data between different frameworks. However, one thing you
must make sure is that you are providing appropriate metadata (e.g. column
definition, formatting rules, etc.) needed for exporting/importing the data.

15
25.Explain use of various file stages (e.g., file, CFF, fileset, dataset) and where
appropriate to use

Use of DataSet Stage

1. In parallel jobs, data is moved around in data sets. These carry metadata with
them – column definitions and information about the configuration that was
in effect when the data set was created.

1. These information are used by DataStage in passing on the data to the


next stage as well as taking decision like same partitioning is needed
or repartitioning may be required. Example, if you have a stage which
limits execution to a subset of available nodes, and the data set was
created by a stage using all nodes, DataStage can detect that the data
will need repartitioning.

2. The Data Set stage allows you to store data being operated on in a persistent
form, which can then be used by other DataStage jobs. Persistent data sets are
stored in a series of files linked by a control file. So, never ever try to
manipulate these files using commands like rm, mv, tr, etc. as it will corrupt
the control file. If needed, use Dataset Management Utility to manage
datasets.

3. Data sets are operating system (Framework) files.

4. Preserve Partitioning

5. Component Datasets files are written to on each partition

6. Referred by a Header file

7. Managed by a Dataset Management Utility from


GUI(Manager,Designer,Director)

8. Using data sets wisely can be key to good performance in a set of linked
jobs.*No Import/Export Conversions are required.*No Repartitioning is
required

9. Dataset stage allows you to read from and write to dataset (.ds) files.

Persistent Datasets

Accessed using DataSet Stage.

Two parts:
Descriptor file: .
contains metadata, data location, but NOT the data itself
record (
partno: int32;
description: string;
)

16
Data file(s): contain the data, multiple Unix files (one per node)(
node1:/local/disk1/…
node2:/local/disk2/…) accessible in parallel

Accessed from/to disk with DataSet Stage.


Two parts:
Descriptor file:User-specified name
Contains table definition ("unformatted core" schema)
Here is the icon of the DataSet Stage, used to access persistent datasets
Descriptor file, e.g., "input.ds" contains:
Paths of data files
Metadata: unformatted table definition, no formats (unformatted schema)
Config file used to store the data
Data file(s)
Contain the data itself
System-generated long file names, to avoid naming conflicts.

FileSet Stage
Use of File Set Stage
1. It allows you to read data from or write data to a file set.

2. It only operates in parallel mode.

3. DataStage can generate and name exported files, write them to their
destination, and list the files it has generated in a file whose extension is, by
convention, “.fs”.

4. File Set is really useful when OS limits the size of data file to 2GB and you
need to distribute files among nodes to prevent overruns.

5. The number of files created by a file set depends on:

1. The number of processing nodes in the default node pool.

2. The number of disks in the export or default disk pool connected to


each processing node in the default node pool.

3. The size of the partitions of the data set.

6. The amount of data that can be stored in the destination file is limited by:

1. The characteristics of the file system and

2. The amount of free disk space available.

7. Unlike data sets, file sets carry formatting information that describes the
format of the files to be read or written.

8. File set consists of:

1. Descriptor File-Contains location of raw data files+metadata. The


descriptor file shows both a record is metadata and the file's location.
The location is determined by the configuration file.

17
2. Individual raw data files( Number of raw data files depends on: the
configuration file)

9. Similar to a dataset

1. Main difference is that file sets are not in the internal format and
therefore more accessible to external applications

Descriptor File

Sequential File Stage

1. The stage executes in parallel mode if reading/writing to multiple files but


executes sequentially if it is only reading/writing one file.

2. You can specify that single files can be read by multiple nodes. This can
improve performance on cluster systems.

3. You can specify that a number of readers run on a single node. This means,
for example, that a single file can be partitioned as it is read (even though the
stage is constrained to running sequentially on the conductor node). 2 & 3 are
mutually exclusive.

4. Generally used to read/write flat/text files.

Lookup File Set Stage

1. It allows you to create a lookup file set or reference one for a lookup.

2. The stage can have a single input link or a single output link. The output link
must be a reference link.

3. When performing lookups, Lookup File stages are used in conjunction with
Lookup stages.

18
4. If you are planning to perform lookup on particular key combination then it
is recommended to use this file stage. In case other file stage is being used for
looking purpose, lookup become sequential.

External Source Stage

1. This stage allows you to read data that is output from one or more source
programs.

2. The stage calls the program and passes appropriate arguments.

3. The stage can have a single output link, and a single rejects link.

4. It allows you to perform actions such as interface with databases not


currently supported by the DataStage Enterprise Edition.

5. External Source stages, unlike most other data sources, do not have inherent
column definitions, and so DataStage cannot always tell where there are extra
columns that need propagating. You can only use RCP on External Source
stages if you have used the Schema File property to specify a schema which
describes all the columns in the sequential files referenced by the stage. You
need to specify the same schema file for any similar stages in the job where
you want to propagate columns.

External Target Stage Similar to external source stage!

1. Allows you to read or write complex flat files on a mainframe machine. This
is intended for use on USS systems.

2. When used as a source, the stage allows you to read data from one or more
complex flat files, including MVS datasets with QSAM (Queued Sequential
Access Method) and VSAM (Virtual Storage Access Method, a file
management system for IBM mainframe systems) files.

3. A complex flat file may contain one or more GROUPs, REDEFINES,


OCCURS, or OCCURS DEPENDING ON clauses. Complex Flat File source
stages execute in parallel mode when they are used to read multiple files, but
you can configure the stage to execute sequentially if it is only reading one
file with a single reader.

Complex Flat File

CFF typically have hierarchical structure or include legacy data types.

1. Complex Flat File source stages execute in parallel mode when they are used
to read multiple files, but you can configure the stage to execute sequentially
if it is only reading one file with a single reader.

26.Explain how Dataset are Managed

• GUI (Manager, Designer, Director) – tools > data set management


• Dataset management from the system command line

19
o Orchadmin
ƒ Unix command line utility
ƒ List records
ƒ Remove datasets
Removes all component files, not just the header file
ƒ Dsrecords
o Lists number of records in a dataset
Both dsrecords and orchadmin are Unix command-line utilities.
The DataStage Designer GUI provides a mechanism to view and manage data sets.
Note: Since datasets consist of a header file and multiple component files, you can’t
delete them as you would delete a sequential file. You would just be deleting the
header file. The remaining component files would continue to exist in limbo.

Displaying Data and schema

The screen is available (data sets management) from Manager, Designer, and
Director.
Manage Datasets from the System Command Line
Dsrecords
Gives record count
Unix command-line utility
$ dsrecords ds_name
E.g., $ dsrecords myDS.ds
156999 records
Orchadmin
Manages EE persistent data sets
Unix command-line utility
E.g., $ orchadmin delete myDataSet.ds

27.Explain the Architecture of Ascential Datastage Parallel Extender

Key EE Concepts:
Parallel processing:
Executing the job on multiple CPUs
Scalable processing:
Add more resources (CPUs and disks) to increase system performance

20
Example system: 6 CPUs (processing nodes) and disks
• Scale up by adding more CPUs
• Add CPUs as individual nodes or to an SMP system
Parallel processing is the key to building jobs that are highly scalable.
EE Engine uses the processing node concept. “Standalone processes” rather than
“thread technology” is used. Processed-based architecture is platform-independent,
and allows greater scalability across resources within the processing pool.
Processing node is a CPU on an SMP, or a board on an MPP.

Scalable Hardware Environments

DataStage Enterprise Edition is designed to be platform-independent – a single job, if


properly designed, can run across resources within a single machine (SMP) or
multiple machines (cluster, GRID, or MPP architectures).
While Enterprise Edition can run on a single-CPU environment, it is designed to take
advantage of parallel platforms.

Pipeline Parallelism

Transform, clean, load processes execute simultaneously


Like a conveyor belt moving rows from process to process
Start downstream process while upstream process is running
Advantages:
Reduces disk usage for staging areas

21
Keeps processors busy
Still has limits on scalability

Partition Parallelism
Divide the incoming stream of data into subsets to be separately
processed by an operation
Subsets are called partitions (nodes)
Each partition of data is processed by the same operation
E.g., if operation is Filter, each partition will be filtered in exactly the same way
Facilitates near-linear scalability
8 times faster on 8 processors
24 times faster on 24 processors
This assumes the data is evenly distributed
Partitioning breaks a dataset into smaller sets. This is a key to scalability. However,
the data needs to be evenly distributed across the partitions; otherwise, the benefits
of Partitioning are reduced. It is important to note that what is done to each partition
of data is the same. How the data is processed or transformed is the same.

Three-Node Partitioning

Here the data is partitioned into three partitions


The operation is performed on each partition of data separately and in parallel
If the data is evenly distributed, the data will be processed three times faster

EE Combines Partitioning and Pipelining

Within EE, pipelining, partitioning, and repartitioning are automatic


Job developer only identifies:
Sequential vs. Parallel operations (by stage)
Method of data partitioning
Configuration file (which identifies resources)
Advanced stage options (buffer tuning, operator combining, etc.)
By combining both pipelining and partitioning DataStage creates jobs with higher

22
volume throughput.
The configuration file drives the parallelism by specifying the number of partitions.

Job Design v. Execution

Much of the parallel processing paradigm is hidden from the programmer. The
programmer
simply designates the process flow, as shown in the upper portion of this diagram.
EE
(Enterprise Edition), using the definitions in the configuration file, will actually
execute
UNIX processes that are partitioned and parallelized, as illustrated in the bottom
portion.

28.What is the difference between sequential file and a dataset? When to use the
copy stage?
Sequentiial Stage stores small amount of the data with any extension in order to
access the file where as DataSet is used to store Huge amount of the data and it
opens only with an extension (.ds ) .The Copy stage copies a single input data set to a
number of output datasets. Each record of the input data set is copied to every
output data set. Records can be copied without modification or you can drop or
change the order of columns.

29. What is difference between Merge stage and Join stage?


Merge and Join Stage Difference :

1. Merge Reject Links are there


2. Can take Multiple Update links
3. If you used it for comparision, then first matching data will be the output.
Because it uses the update links to extend the primary details which are coming from
master link

30.Explain Orchadmin Utility

Orchadmin is a command line utility provided by datastage to research on data


sets.The general callable format is : $Orchadmin <command> [options] [descriptor
file]

23
1. Before using orchadmin, you should make sure that either the working directory
or the $APT_ORCHHOME/etc contains the file “config.apt” ORThe environment
variable $APT_CONFIG_FILE should be defined for your
session.Orchadmin commands
The various commands available with orchadmin are
1. CHECK: $orchadmin check
Validates the configuration file contents like, accessibility of all nodes defined in the
configuration file, scratch disk definitions and accessibility of all the nodes etc.
Throws an error when Config file is not found or not defined properly
2. COPY : $orchadmin copy <source.ds> <destination.ds>
Makes a complete copy of the datasets of source with new destination descriptor file
name. Please not that
a. You cannot use UNIX cp command as it justs copies the config file to a new
name. The data is not copied
b. The new datasets will be arranged in the form of the config file that is in use
but not according to the old confing file that was in use with the source.
3. DELETE : $orchadmin < delete | del | rm > [-f | -x] descriptorfiles….
The unix rm utility cannot be used to delete the datasets. The orchadmin delete or rm
command should be used to delete one or more persistent data sets.
-f options makes a force delete. If some nodes are not accesible then -f forces to delete
the dataset partitions from accessible nodes and leave the other partitions in
inaccessible nodes as orphans.
-x forces to use the current config file to be used while deleting than the one stored in
data set.
4. DESCRIBE: $orchadmin describe [options] descriptorfile.ds
This is the single most important command.
1. Without any option lists the no.of.partitions, no.of.segments, valid segments, and
preserve partitioning flag details of the persistent dataset.
-c : Print the configuration file that is written in the dataset if any
-p: Lists down the partition level information.
-f: Lists down the file level information in each partition
-e: List down the segment level information .
-s: List down the meta-data schema of the information.
-v: Lists all segments, valid or otherwise
-l : Long listing. Equivalent to -f -p -s -v –e
5. DUMP: $orchadmin dump [options] descriptorfile.ds
The dump command is used to dump (extract) the records from the dataset.
Without any options the dump command lists down all the records starting from
first record from first partition till last record in last partition.
-delim ‘<string>’ : Uses the given string as delimtor for fields instead of space.
-field <name> : Lists only the given field instead of all fields.
-name : List all the values preceded by field name and a colon
-n numrecs : List only the given number of records per partition.
-p period (N): Lists every Nth record from each partition starting from first record.
-skip N: Skip the first N records from each partition.
-x : Use the current system configuration file rather than the one stored in dataset.
6. TRUNCATE: $orchadmin truncate [options] descriptorfile.ds
Without options deletes all the data (i.e. Segments) from the dataset.
-f: Uses force truncate. Truncate accessible segments and leave the inaccessible ones.-
x: Uses current system config file rather than the default one stored in the dataset.
-n N: Leaves the first N segments in each partition and truncates the remaining.

24
7. HELP: $orchadmin -help OR $orchadmin <command> -help
Help manual about the usage of orchadmin or orchadmin commands.
31.If USS, define the native file format (e.g., EBCDIC, VSDM)

Native File Format for USS (UNIX Support Systems) EBCDIC Extended Binary
Coded Decimal Interchange Code is an 8-bit character encoding () used on IBM
mainframe operating systems. Native File Format For USS (Unix Support
Systems)EBCDIC Extended Binary Coded Decimal Interchange Code is an 8-bit
character encoding () used on IBM mainframe operating systems.

ASCII American Standard Code for Information Interchange EBCDIC and ASCII,
both are ways of mapping computer codes to characters and numbers, as well as
other symbols typically used in writing. Most current computers have a basic storage
element of 8 bits, normally called a byte. This can have 256 possible values. 26 of
these values need to be used for A-Z, and another 26 for a-z. 0-9 take up 10, and then
there are many accented characters and punctuation marks, as well as control codes
such as carriage return (CR) and line feed (LF). EBCDIC and ASCII both perform the
same task, but they use different values for each symbol. For instance, in ASCII a ‘E’
is code 69, but in EBCDIC it is 197.Text conversion is very easy, however, numeric
conversion is quite tricky. For example:

1. Text string : It is very simple and portable. A simple mapping can be used to
map the string to the code and vice versa.

2. Binary : Binary numbers use the raw bytes in the computer to store numbers.
Thus a single byte can be used to store any number from 0 to 255. If two bytes
are used (16 bits) then numbers up to 65535 can be saved. The biggest
problem with this type of number storage is how the bytes are ordered [i.e
Little Endian (Intel uses this, least significant first!! i.e. the high byte on the
right) or Big Indian (Motorola uses this, the high byte on the left) or Native
Endian]. E.g 260 in Little Endian will be 04H 01H while it in Big Endian, it
will be 01H 04H.

3. Packed decimal : In text mode each digit takes a single byte. In packed
decimal, each digit takes just 4 bits (a nibble). These nibbles are then packed
together, and the final nibble represents the sign. These are C (credit) is a +, D
(debit) is a - and F is unsigned, i.e. +. The number 260 in packed decimal
would be: 26H 0FH or 26H 0CH.

4. Floating point : Floating-point numbers are much harder to describe but have
the advantage that they can represent a very large range of values including
many decimal places.Of course there are some rounding problems as well.

The Problem with ASCII/EBCDIC conversion when dealing with records that
contain both text and numbers is that numbers must not be converted with the same
byte conversion that is used for ASCII/EBCDIC conversion. The only truly portable
way to convert such records is on a per field basis. Typically, from EBCDIC to ASCII,
and numeric fields, packed decimal fields etc are converted to ASCII strings.

How is this done? The only way to do an EBCDIC to ASCII conversion is with a
program that has knowledge of the record layout. With DataStage, details of each
record structure are entered and how each field is to be converted can be set. Files

25
with multiple record structures and multiple fields can then be converted on a field-
by-field basis to give exactly the correct type of conversion. It is an ideal solution for
an EBCDIC and ASCII conversion as all data is retained. Packed decimal fields are
normally found on mainframe type applications often Cobol related. RR32 can be
used to create Access compatible files or even take ASCII files and create files with
packed decimal numbers. Reference for ASCII and EBCDIC code mappings:

32.Given a scenario, describe proper use of a sequential file.

1. Read in parallel (e.g., reader per node, multiple files)

2. Handle various formats (e.g., fix Vs variable, delimited Vs non-delimited, etc.)

3. Describe how to import and export nullable data

4. Explain how to identify and capture rejected records (e.g., log counts, using
reject link, options for rejection)

What could be the various scenarios?

It could be whether the file is a fixed length file (containing fixed length records)
or delimited files.

What is the volume of records in that flat file and what kind of configuration
(e.g. Number of Nodes, I/O disks, etc.) do we have?

Scenario may also include whether you need to read a single file or you need to
read multiple files (e.g. A job can create 2 files with names file_0, file_1, which may
be input for your stage. You must process both the files for completeness. One more
thing that need to be remembered is that, while working with multiple files, it need
not be matching a pattern. You can mention the complete path name for that file.).

We also need to consider whether the preceding stage is working in parallel


mode or sequential mode (of course, along wit the mode of Sequential Stage).
Accordingly we will have to decide which partitioning/collection mechanism to use.
The Sequential stage executes in parallel mode if reading/writing to multiple
files(i.e. the ‘Read Method’/’Write Method’ Source option is set to value “File
Pattern”) but executes sequentially if it is only reading/writing one file. By default a
complete file will be read by a single node (although each node might read/write
more than one file).Each node writes to a single file, but a node can write more than
one file. For fixed-width files you can configure the stage to behave differently:

• You can specify that single files can be read by multiple nodes. This can
improve performance on cluster systems.

• You can specify that a number of readers run on a single node. For example, a
single file can be partitioned as it is read (even though the stage is constrained to
running sequentially on the conductor node)

Handling various formats (e.g., fix Vs variable, delimited Vs non-delimited, etc.)

Using the “Format tab” you can supply information about the format of the files in
the file set to which you are writing. (Default is variable length columns, surrounded
by double quote, separated by comma and rows delimited by Unix New Line
Character).

26
Final Delimiter: Specifies a delimiter for the last field of the record. When writing, a
space is now inserted after every field except the last in the record. Previously, a
space was inserted after every field including the
last.(APT_FINAL_DELIM_COMPATIBLE environment variable can be used for
compatibility with pre 7.5 releases.)

whitespace => Import skips all standard white-space characters (space, tab, and
new line) trailing a field.

end => The last field in the record is composed of all remaining bytes until the
end of the record.

none => Fields have no delimiter.

null => The delimiter is the ASCII null character.

<other> => A single ASCII character.

Final Delimiter String (Mutually exclusive with Final Delimeter):

Specifies one or more ASCII delimiter characters as the trailing delimiter for the last
field of the record. Use backslash (\) as an escape character to specify special
characters within a string, e.g. ‘\t’ for TAB. Import skips the delimiter string in the
source data file. On export, the trailing delimiter string is written after the final field
in the data file.

Fill Char: Byte value to fill in any gaps in an exported record caused by field
positioning properties. This is valid only for export. By default, the fill value is the
null byte (0); it may be an integer between 0 and 255, or a single-character value.

Record Length: Fixed-length records of a specified length (including any record


delimiter), or, if the keyword ‘fixed’ is used, must contain only fixed-length columns
so that the record length can be calculated. Record Prefix (Mutually exclusive with
Record Delimiter): Variable-length records prefixed by length prefix of 1, 2, or 4
bytes. Default is 1 byte.

Record Delimiter: Records terminated by a trailing delimiter character. Import skips


the delimiter; export writes the delimiter after each record. By default, the record
delimiter is the UNIX new line character. For fixed-width files with no explicit
record delimiter, remove this property altogether.

Record Delimiter String (Mutually exclusive with Record Delimiter): Specifies ASCII
characters to delimit a record. Pick ‘DOS format’ to get CR/LF delimiters; Unix
format is best specified using ‘Record delimiter=UNIX new line’, since it is only a
single charact delimiter is the UNIX newline character. For fixed-width files with no
explicit record delimiter, remove this property altogether.

33.DataStage tip for beginners - parallel lookup types


Parallel DataStage jobs can have many sources of reference data for lookups
including database tables, sequential files or native datasets. Which is the most
efficient?

27
This question has popped up several times over on the DSExchange. In DataStage
server jobs the answer is quite simple, local hash files are the fastest method of a key
based lookup, as long as the time taken to build the hash file does not wipe out your
benefits from using it.

In a parallel job there are a very large number of stages that can be used as a lookup,
a much wider variety then server jobs, this includes most data sources and the
parallel staging formats of datasets and lookup filesets. I have discounted database
lookups as the overhead of the database connectivity and any network passage
makes them slower then most local storage.

I did a test comparing datasets to sequential files to lookup filesets and increased row
volumes to see how they responded. The test had three jobs, each with a sequential
file input stage and a reference stage writing to a copy stage.

Small lookups
I set the input and lookup volumes to 1000 rows. All three jobs processed in 17 or 18
seconds. No lookuptables were created apart from the existing lookup fileset one.
This indicates the lookup data fit into memory and did not overflow to a resource
file.

1 Million Row Test


The lookup dataset took 35 seconds, the lookup fileset took 18 seconds and the
lookup sequential file took 35 seconds even though it had to partition the data. I
assume this is because the input also had to be partitioned and this was the
bottleneck in the job.

2 million rows
Starting to see some big differences now. Lookup fileset down at 45 seconds is only
three times the length of the 1000 row test. Dataset is up to 1:17 and sequential file up
to 1:32. The cost of partitioning the lookup data is really showing now.

3 million rows
The filset still at 45 seconds, swallowed up the extra 1 million rows with ease. Dataset
up to 2:06 and the sequential file up to 2:20.

As a final test I replaced the lookup stage with a join stage and tested the dataset and
sequential file reference links. The dataset join finished in 1:02 and the sequential file
join finished in 1:15. A large join proved faster then a large lookup but not as fast as a
lookup file.

Conclusion
If your lookup size is low enough to fit into memory then the source is irrelevant,
they all load up very quickly, even database lookups are fast. If you have very large
lookup files spilling into lookup table resources then the lookup fileset outstrips the
other options. A join becomes a viable option. They are a bit harder to design as you
can only join one source at a time whereas a lookup can join multiple sources.

I usually go with lookups for code to description or code to key type lookups
regardless of the size, I reserve the joins for references that bring back lots of
columns. I will certainly be making more use of the lookup fileset to get more

28
performance from jobs.

Sparse database lookups, which I didn't test for, are an option if you have a very
large reference table and a small number of input rows.

34.What are Job parameters and Environment variables and their significance?

Instead of entering inherently variable factors as part of the job design, you can set
up parameters which represent processing variables. Operators can be prompted for
values when they run or schedule the job.

Environment variables are project wide defaults.

Significance of these parameters is the flexibility to configure the environment of ETL


process like Source/Target database and respective schema names and connection
parameters. The ETL process developed on a DEV server may have different values
and in Production may have different. By parameterize these values, rework in
production is eliminated.

Job Parameters:

Defining through Job Properties > Parameters

Used to pass business & control parameters to the job

Recap of sample usage:

• Setting Parameter Values

– Passed by calling sequence**/script/program

– If value set is by calling program, this will overrides default


value

– If no default value, calling sequence/script MUST set


parameter, else job fails

29
• Used For

• Flexibility – Change business parameters

• Reuse– Run same job with different parameters to handle different


needs

• Portability – set path, user name, password, etc. according to the


environment

Some Common Parameters


• Run Date
• Business Date
• Path
• Filename suffix/prefix (input/output)
• User Name/password
• Database connection parameters, DSNs,
etc.
• Base currency, etc.

30
• Can also set/override Environment Variables Values - valid only within the
job

• Orchestrate Shell Script that is compiled by the engine

35What is the difference between Link partitions and Link collectors?


? Link Partitioning is the process in which one dataset is divided into a number
of virtual datasets specified by the no. of nodes. Link Partition methods
specify the algorithm to execute the process.

31
? Collecting is the process in which a number of virtual datasets are combined
into a single dataset or a single stream of data. Link Collector methods specify
the algorithm to execute the process.

36What is a Node map constraint and how is it useful?


? Though we define multiple nodes in the configuration file which is used by
the whole of the job, we can restrict the execution of a stage to a particular
node.
? By enabling the Node map constraint for a particular stage and specifying a
node, we limit the execution of the stage to that node.
? This feature is helpful in a scenario where a particular stage like for example
sort stage forms a bottle neck for the job. By assigning more resources to a
node and then limiting the sort stage to that highly resourceful node, we can
improve the performance of the job.

37 What are the guide lines to decide the number of nodes that suit a particular
DataStage job?
? The configuration file tells DataStage Enterprise Edition how to exploit
underlying system resources. At runtime, EE first reads the configuration file
to determine what system resources are allocated to it, and then distributes
the job flow across these resources.
? There is not necessarily one ideal configuration file for a given system
because of the high variability between the way different jobs work. For this
reason, multiple configuration files should be used to optimize overall
throughput and to match job characteristics to available hardware resources.
? A configuration file with a larger number of nodes generates a larger number
of processes that use more system resources. Parallelism should be optimized
rather than maximized. Increasing parallelism may better distribute the
workload, but it also adds to the overhead because the number of processes
increases. Therefore, one must weigh the gains of added parallelism against
the potential losses in processing efficiency.
? If a job is highly I/O dependent or dependent on external (e.g. database)
sources or targets, it may appropriate to have more nodes than physical
CPUs.
? For development environments, which are typically smaller and more
resource-constrained, create smaller configuration files (e.g. 2-4 nodes).

38 What is the significance of Resource Disk and Scratch Disk?


? Within the EE configuration file, each node is assigned one or more Resource
disk and scratch disk using file system paths.
? Resource Disks are used for storage of Parallel DataSets.
? Scratch Disk resources are used for temporary storage, most notably during
sort operations.
? In an ideal configuration, these file system paths should reside on different
mount points, spread across available I/O channels using separate physical
disks.

39 In how many different ways a DataStage job can be aborted, for example in a
requirement like ‘stop processing the one million record input file, if the error and
rejected records are greater than 100’?

32
? First method: On the reject link of the transformer stage we can set the
property ‘Abort After Rows’. This property is available on the link
constraints.
? Second method: If the job has to be aborted based on a condition (for
example, if inputlink.value = ‘xyz’ then abort) then a call to the function
DSLogFatal can be issued, which logs a fatal error and aborts the job.

40Can a Sequencer job be aborted, if any job in the sequence fails? If so how?
? Yes the sequencer job can be aborted if any job in the sequence fails. This can
be achieved by connecting a Terminator Activity stage to the Job Activity
stage and then specifying the trigger to the Terminator activity stage as Job
Failed.

41’ can an email be sent from a Parallel Job?


? An Email can be sent from the parallel job using DSSendMail routine in After
job or Before Job subroutines.

42.t is an Invocation Id and what is its significance?


? We can specify a parallel job to run as multiple instances.
? After specifying the multiple instance property for a job, whenever we run
the job, it prompts for an invocation id. We can specify a string for this id
which will appear as an extension for the job name in the Job Director.

43.What is the difference between Parallel Transformer and Basic Transformer?


Respective advantages and disadvantages?
? Parallel Transformer stage uses the C++ operator. The routines that can be
called from this stage have to be written in C and not Basic.
? Basic transformer stage is an added stage in PX environment to ease the
migration of Server jobs to PX. Basic Routines and Basic Transforms can be
called from this stage. Basic transformers may degrade the performance of a
job.

44. What file do you prefer to use as an intermediate storage between two jobs, a
sequential file, a dataset or a file set and why?
? A dataset is the efficient stage to store the data between jobs. The advantage
of dataset over the sequential file is that it preserves the partitioning of the
data. This decreases the overhead of repartitioning the data which happens
when used a sequential file.
? A Fileset stage unlike a Dataset carries formatting information to the file.
Even though the File set stage can preserve partitioning, a Dataset is more
efficient since datasets are operating system files referred by a control file.

45. What is the significance of CRC32 and Checksum functions and the relative
merits?
? CRC32 function: Returns a 32-bit cyclic redundancy check value for a string.
? CheckSum function: Returns a number that is a cyclic redundancy code for
the specified string.
? Checksum implementation in Universe is 16 bit and the algorithm is additive
which can lead to some very undesirable results. The probability that a
different checksum (incorrect) will be created for two identical rows (with the

33
same key) or the same checksum will be generated for a row that has changed
(again, the same key) is extremely high - somewhere around 1 in 65,536 or
2^16.
? CRC32 on the other hand is not additive and the return is a 32 bit integer.
Checksum has a difficult time detecting small changes in moderate to large
fields and this is what makes it not desirable to use as a change data
mechanism.

46. What do you do to make sure the Job logs don’t fill up the entire space
available for DataStage over a period of time?
? We can set the ‘Auto-purge for job log’ property in Datastage administrator,
which clears up the logs periodically.

47.If the input dataset has 10 columns and only 5 columns need to be mapped to
the output dataset without any transformations involved then what stage do you
use and why?
? Copy stage is the best choice, since by using any other stage like transformer,
we add to the overhead of the job.

48. All the operations that can be performed by a copy stage or a filter stage or a
modify stage can be performed by a Transformer stage itself. What is the
significance of having all these stages?
? Transformer stage wraps a whole set of functions which add to the overhead
of a job. Having more no. of transformer stages in a job degrade the
performance.
? To achieve specific functions like filtering columns or altering schemas, using
specific stages like COPY or MODIFY can considerably reduce the overhead
of the job and hence make it efficient.

49. What is the stage useful for debugging in PX? How is it used?
? A Peak stage is useful for debugging in PX. By diverting a stream of data into
Peak stage, the records can be readily displayed in the Log.

50. What is a phantom process?


? The "phantom" is an archaic term (coming from the UniVerse implementation
on PRIMOS) which just means a background process,
? The Phantom processes are the ones actually doing the work, so they hold the
locks.
? The PHANTOM command causes a background session to be started from
the current user session and executes the item immediately following the
PHANTOM command.

51. What are the various return codes of a DataStage job?


Constants starting with DSJS are job status codes returned by the DSGetJobInfo
function. Possible constants are:
? DSJS.NOTRUNNABLE
? DSJS.NOTRUNNING
? DSJS.RESET
? DSJS.RUNFAILED
? DSJS.RUNNING
? DSJS.RUNOK

34
? DSJS.RUNWARN
? DSJS.STOPPED
? DSJS.VALFAILED
? DSJS.VALOK
? DSJS.VALWARN

52.What is the difference between Normal Lookup and a Parse lookup and when
is each one used?
? LookUp Type: This property specifies whether the database stage will
provide data for an in-memory look up (Lookup Type = Normal) or whether
the lookup will access the database directly (Lookup Type = Sparse).
? If the Lookup Type is Normal, the Lookup stage can have multiple reference
links. If the Lookup Type is Sparse, the Lookup stage can only have one
reference link.
? If the no. of records on the input link is small and reference data is large, its
advisable to use a sparse lookup, since for each record in the input, a call is
made to the database which may be less costlier than loading all the reference
data into memory.

53.How can you read variable number of input files all in the same format through
a single DataStage job? Explain if there is more than one way?
? A number of input files in the same format can be read through one
sequential file stage by specifying the property of pattern.
? A folder stage can be used.
? A control file with extension .fs can be created and all the files can be read
through a single File Set stage. All the input files will have a single entry in
the .fs file with complete path.

54.What are Job Sequences and explain their significance


– Specifies the sequence of Server Jobs to run (control flow)
– Contains control information, like specifies different courses of action
depending on Job’s status (success or failure)
– Consists of Activity Stages and Triggers
– Supports parameters for Activity Stages and also for Job Sequences
– Restartable with the help of Checkpoint information (maintained by
DS)
– Supports automatic Exception Handling

Creating Job Sequences


Create File-ÆNew

35
36
37
Scenario …
– 3 Dimension Load Jobs and 2 Fact Load Jobs
– Fact Load Jobs to start after all Dimension Load Jobs completes
successfully
– If any Dimension Load Jobs fail, terminate all running Jobs

Job Sequence for the above scenario involves …


Job Activity – Execute the Dimension and Fact Load Jobs
Sequencer Activity – Synchronize the control flow of Jobs
Terminator Activity – Stop the Job Sequence for failures

Sample Job Sequence 1 …

38
Job Activity …
– Executes a DataStage Job

Sequencer and Terminator Activities

Scenario …
– 5 input files are available in a folder with the same layout
– Single Server Job available to sort a input file
– Wait for a trigger to start the Job
– Send a message to a computer after Job completion (success or failure)
– Handle exception

Job Sequence for the above scenario involves …


Job Activity – Execute the Sort Job
Wait-For-File Activity – Wait for the trigger file before executing the Job
Start and End Loop Activity – Create For…Next loop to process 5 files
ExecCommand – Send a message to a computer using OS command
Exception Handler – To handle exception when a failure occurs

39
Sample Job Sequence 2 …

Wait-For-File & Start and End Loop Activities

ExecCommand and Exception Handler Activities

40
Other Activity stages
Routine
– Specifies a routine from the Repository (but not transforms)
– Routine arguments can be accessed by subsequent activities
Email Notification
– Specifies the email notification to be sent using SMTP
– Email template file dssendmail_template.txt under Projects folder
allows to create different email templates for different projects
Nested Conditions
– Allows branch the execution of sequence based on condition
– Example: If today is weekday execute weekday_Job else Weekend_Job
User Variable
– Allows to define global variables within a sequence
– For example, the activity can be used to set Job parameters

Job Sequence Properties


Job Sequence Properties…
– Select Edit Æ Job Properties

41
55.What are the different trigger conditions available in a Job activity stage in a
Sequencer?
? Conditional. A conditional trigger fires the target activity if the source activity
fulfills the specified condition. The condition is defined by an expression, and
can be one of the following types:

OK. Activity succeeds.


Failed. Activity fails.
Warnings. Activity produced warnings.
ReturnValue. A routine or command has returned a value.
Custom. Allows you to define a custom expression.
User status. Allows you to define a custom status message to write to the log.

? Unconditional. An unconditional trigger fires the target activity once the


source activity completes, regardless of what other triggers are fired from the
same activity.

? Otherwise. An otherwise trigger is used as a default where a source activity


has multiple output triggers, but none of the conditional ones have fired.

56.What is the significance of sequencer stage in a sequencer job?


? A sequencer allows you to synchronize the control flow of multiple activities
in a job sequence. It can have multiple input triggers as well as multiple
output triggers.
? The sequencer operates in two modes:
ALL mode. In this mode all of the inputs to the sequencer must be TRUE for any of
the sequencer outputs to fire.
ANY mode. In this mode, output triggers can be fired if any of the sequencer inputs
are TRUE.

42
57. What are the options available to set the commit frequency while writing to a
Oracle database table?
APT_ORAUPSERT_COMMIT_ROW_INTERVAL and
APT_ORAUPSERT_COMMIT_TIME_INTERVAL.
You can make those two values a job parameter and set the value as necessary.

58. How can you use a single join stage to remove duplicate records from the input
links and then join the records?
? In the join stage, we can use the Hash partitioning property.
? Then select the perform sort checkbox and then select Unique.
? Specify the keys on which unique sorting needs to be done.
This procedure will join the records after removing the duplicates on the input link.

59. What is Metadata and what is its significance?


? Metadata is the term used to describe data about data. It describes what the
data contains, where (and how) the data is stored.
? In Datastage the schemas used in the stages form the core of metadata.
? Maintaining proper metadata standards can be critical for the project. It is
highly important that the Metadata of the source and target tables/files are
properly maintained so that there will be a unique source of metadata
definitions at any point in time.

60. What is a Build Op stage and what is its significance?


Build Op stage is used to design and build our own bespoke operator as a stage to be
included in DataStage Parallel Jobs.
PX buildop is a user-defined c program, written in a PX proprietary style, that does
record processing and data transformation

61.Explain the usage of Datastage Manager


• Manager is the user interface for viewing and editing the contents of
the DataStage repository
• Manger allows you to import and export items between different
DataStage systems, or exchange meta data with other data warehousing
tools.
• You can also analyze where particular items are used in your project
and request reports on items held in the Repository.

43
62. What is usage analysis report in DataStage manager?
The Usage Analysis tool allows you to check where items in the DataStage
Repository are used. Usage Analysis gives you a list of the items that use a particular
source item. You can in turn use the tool to examine items on the list, and see where
they are used.
Usage Analysis report gives the following information about each of
the items in the list
? Relationship: The name of the relationship from the source to the target
? Type: The type of the target item
? Name: The name of the target item
? Category: The category where the item is held in the DataStage
Repository
? Source: The name of the source item

Usage Analysis – Example – F1

44
63.How do U Export and Import Datastage Components
All the DataStage components can be exported into a file
(.dsx or .xml file)

Select Export -> DataStage Components to start the export


Options available to export the whole project or export individual

¾ Job
¾ Table definitions
¾ Shared Containers
¾ Data Elements

Specify the location and name of the file the component has to be
exported

45
Importing the DataStage Components
• DataStage components which are available in a .dsx or .xml file can be
imported into a project
• The components would be placed in the appropriate categories and all
default properties would also be migrated along with the components
• Specify the name and location of the file to Import.
Options available to overwrite the existing components by default or prompt for the
same.

64. DataStage Reporting Assistant


? The DataStage Reporting Tool is flexible and allows you to generate
reports at various levels within a project, entire job, single stage, set of
stages, etc.
? Information generated for reporting purposes is stored in a relational
database on the DataStage client.
? This information can then be used to print a report, write a report to a file,
or be interrogated by a third-party tool.
? A Microsoft Access database is provided on the DataStage client for this
purpose. C:\Program Files\ascential\datastage\documentation tool).
? If you want to use an alternative database, run the script to create the
database, assign it a DSN and then select that DSN on the Update
Options
? In the Manager Click Tools -> Report Assistant Select the whole project
or individual components to track and click on the update now button –
to register them in the Reporting Assistant DB

46
RA – Documentation Tool
? Documentation Tool can be invoked by clicking the Doc Tool button of the
Reporting Assistant
? Observe the registered components available
? Select individual components and click on the Print Preview button to view
the reports
? Print the reports using the print Icon
? Export the reports to files
Exporting

65. What are Column Export and Column Import stages?


The Column Export stage exports data from a number of columns of different data
types into a single column of data type string or binary.
The Column Import stage imports data from a single column and outputs it to one or
more columns. You would typically use it to divide data arriving in a single column
into multiple columns.

66. What is Combine Records Stage?


The Combine Records stage combines records (i.e., rows), in which particular key-
column values are identical, into vectors of sub records. As input, the stage takes a
data set in which one or more columns are chosen as keys. All adjacent records
whose key columns contain the same value are gathered into the same record as sub
records.

47
67. What is the difference between exporting Job components and Job executables?
Which one is preferred while exporting jobs from Development environment to
Production Environment?
When a Job component is exported, then entire design information of the job is
exported.
When a job executable is exported, then the design information is omitted. Only the
executable file is exported.
When moving jobs from development to production, exporting job executables is
recommended since, modification of jobs on the production is not preferred and
hence restricted.

68. What is Number of Nodes per reader property in Sequential file stage?
? Specifies the number of instances of the file read operator on each processing
node. The default is one operator per node per input data file. If numReaders
is greater than one, each instance of the file read operator reads a contiguous
range of records from the input file. The starting record location in the file for
each operator, or seek location, is determined by the data file size, the record
length, and the number of instances of the operator, as specified by
numReaders.
? The resulting data set contains one partition per instance of the file read
operator, as determined by numReaders. The data file(s) being read must
contain fixed-length records.

69.What are Head and Tail Stages?


The Head Stage selects the first N records from each partition of an input data set
and copies the selected records to an output data set.
The Tail Stage selects the last N records from each partition of an input data set and
copies the selected records to an output data set.

70.Can you use a Time Stamp data type to read a time stamp value with micro
second in it? For example ‘10-10-2004 10:10:10.3333’
Yes a timestamp data type can be used to read the microsecond.
But it has to be taken care that the Extended property of a timestamp data type has to
be set to Microsecond.

71.In a configuration file, if we keep on adding the nodes, what is the downstream
impact?
A configuration file with a larger number of nodes generates a larger number
of processes that use more system resources. Parallelism should be optimized rather
than maximized. Increasing parallelism may better distribute the workload, but it
also adds to the overhead because the number of processes increases. Therefore, one
must weigh the gains of added parallelism against the potential losses in processing
efficiency.

72.What is the best way to read an input file with variable number of columns, say
if the maximum number of columns possible is known?
When there are variable number of columns in a input file, its better to read the
entire file as single column and later split the single column using a column import
stage.

48
73.Merge stage - no data in columns from right table

The Merge stage is a processing stage. It can have any number of input links, a single
output link, and the same number of reject links as there are update input links.
Some example merges are shown in the Parallel Job Developer's Guide. Follow this
link for a list of steps you must take when deploying a Merge stage in your job.

The Merge stage is one of three stages that join tables based on the values of key
columns. The other two are:
Join stage and Lookup stage

The three stages differ mainly in the memory they use, the treatment of rows with
unmatched keys, and their requirements for data being input (for example, whether
it is sorted).

The Merge stage combines a sorted master data set with one or more update data
sets. The columns from the records in the master and update data sets are merged so
that the output record contains all the columns from the master record plus any
additional columns from each update record. A master record and an update record
are merged only if both of them have the same values for the merge key column(s)
that you specify. Merge key columns are one or more columns that exist in both the
master
and update records.
The data sets input to the Merge stage must be key partitioned and sorted. This
ensures that rows with the same key column values are located in the same partition
and will be processed by the same node. It also minimizes memory requirements
because fewer rows need to be in memory at any one time. Choosing the auto
partitioning method will ensure that partitioning and sorting is done. If sorting and
partitioning are carried out on separate stages before the Merge stage, DataStage in
auto partition mode will detect this and not repartition (alternatively you could
explicitly specify the Same partitioning method).
As part of preprocessing your data for the Merge stage, you should also remove
duplicate records from the master data set. If you have more than one update data
set, you must remove duplicate records from the update data sets as well.
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several
reject links. You can route update link rows that fail to match a master row down a
reject link that is specific for that link. You must have the same number of reject links
as you have update links. The Link Ordering tab on the Stage page lets you specify
which update links send rejected rows to which reject links. You can also specify
whether to drop unmatched master rows, or output them on the output data link.
The stage editor has three pages:
Stage page. This is always present and is used to specify general information about
the stage.

Inputs page. This is where you specify details about the data sets being merged.

Outputs page. This is where you specify details about the merged data being output
from the stage and about the reject links.

74.Datastage: Join or Lookup or Merge orCDC

49
Many times this question pops up in the mind of Datastage developers.
All the above stages can be used to do the same task. Match one set of data (say
primary) with another set of data (references) and see the results. DataStage
normally uses different execution plans (hmm… I should ignore my Oracle legacy
when posting on Datastage). Since DataStage is not so nice as Oracle, to show its
Execution plan easily, we need to fill in the gap of Optimizer and analyze our
requirements. Well I have come up with a nice table,
Most importantly its the Primary/Reference ratio that needs to be considered not the
actual counts.
Primary Source Reference Volume Preferred Method
Volume
Little (< 5 million) Very Huge ( > 50 Sparse Lookup
million)

Little ( < 5 million) Little (< 5 million) Normal Lookup

Huge (> 10 million) Little (< 5 million) Normal Lookup


Huge ( > 10 million)
Little (< 5 million)
Huge (> 10 million) Huge (> 10 million) Join

Huge (> 10 million) Huge (> 10 million) Merge, if you want to


handle rejects in
reference links.

75. DataStage tip for beginners - parallel lookup types

Parallel DataStage jobs can have many sources of reference data for lookups
including database tables, sequential files or native datasets. Which is the most
efficient?

This question has popped up several times over on the DSExchange. In DataStage
server jobs the answer is quite simple, local hash files are the fastest method of a key
based lookup, as long as the time taken to build the hash file does not wipe out your
benefits from using it.

In a parallel job there are a very large number of stages that can be used as a lookup,
a much wider variety then server jobs, this includes most data sources and the
parallel staging formats of datasets and lookup filesets. I have discounted database
lookups as the overhead of the database connectivity and any network passage
makes them slower then most local storage.

I did a test comparing datasets to sequential files to lookup filesets and increased row
volumes to see how they responded. The test had three jobs, each with a sequential
file input stage and a reference stage writing to a copy stage.

Small lookups
I set the input and lookup volumes to 1000 rows. All three jobs processed in 17 or 18
seconds. No lookup tables were created apart from the existing lookup fileset one.

50
This indicates the lookup data fit into memory and did not overflow to a resource
file.

1 Million Row Test


The lookup dataset took 35 seconds, the lookup fileset took 18 seconds and the
lookup sequential file took 35 seconds even though it had to partition the data. I
assume this is because the input also had to be partitioned and this was the
bottleneck in the job.

2 million rows
Starting to see some big differences now. Lookup fileset down at 45 seconds is only
three times the length of the 1000 row test. Dataset is up to 1:17 and sequential file up
to 1:32. The cost of partitioning the lookup data is really showing now.

3 million rows
The fileset still at 45 seconds, swallowed up the extra 1 million rows with ease.
Dataset up to 2:06 and the sequential file up to 2:20.

As a final test I replaced the lookup stage with a join stage and tested the dataset and
sequential file reference links. The dataset join finished in 1:02 and the sequential file
join finished in 1:15. A large join proved faster then a large lookup but not as fast as a
lookup file.

Conclusion
If your lookup size is low enough to fit into memory then the source is irrelevant,
they all load up very quickly, even database lookups are fast. If you have very large
lookup files spilling into lookup table resources then the lookup fileset outstrips the
other options. A join becomes a viable option. They are a bit harder to design as you
can only join one source at a time whereas a lookup can join multiple sources.

I usually go with lookups for code to description or code to key type lookups
regardless of the size, I reserve the joins for references that bring back lots of
columns. I will certainly be making more use of the lookup fileset to get more
performance from jobs.

Sparse database lookups, which I didn't test for, are an option if you have a very
large reference table and a small number of input rows.

76. What is the difference between validated ok and compiled in datastage.

When we say "Validating a Job", we are talking about running the Job in the "check
only" mode. The following checks are made:

- Connections are made to the data sources or data warehouse.


- SQL SELECT statements are prepared.
- Files are opened. Intermediate files in Hashed File, UniVerse, or ODBC stages that
use the local data source are created, if they do not already exist.

77. What are the Steps involved in development of a job in DataStage?


The steps required are:

51
select the data source stage depending upon the sources for ex:flatfile,database, xml
etc

select the required stages for transformation logic such as transformer, link collector,
link partitioner, Aggregator, merge etc

select the final target stage where u want to load the data either it is datawarehouse,
datamart, ODS,staging etc

78 Datastage Server V/S Datastage Parallel


There are several differences. Above, top three ones (in my opinion):
Server Jobs
1 - Generates Basic programs after job compilation
2 - Design Time Parallelism level definition
3 - Different Stages (Hash, IPC, ...)

Parallel Jobs
1 - Generates OSH programs after job compilation
2 - Execution Time Parallelism level definition
3 - Different Stages (DataSets, Lookup, ...)

There are three types of DataStage ETL jobs:


- Server jobs are from the original DataStage version.
- Mainframe jobs are from DataStage MVS and only run on mainframe.
- Parallel jobs are now in the DataStage Enterprise Edition. Ascential purchased a
product called Orchestrate, they referred to it in version 5 and 6 releases as Parallel
Extender, but I version 7.x it was known as DataStage Enterprise Edition o Enterprise
server. The term parallel extender is no longer used. That's why some forums have
separate forums for DataStage Server Edition and DataStage Enterprise Edition.

You can read more about them in my blogs that compare server and parallel jobs.
Process in parallel or take up folk dancing:
http://blogs.ittoolbox.com/bi/websphere/archive s/006622.asp
DataStage server v enterprise: some performance stats:
http://blogs.ittoolbox.com/bi/websphere/archive s/006976.asp

There also a description of each DataStage edition on the DataStage wiki page:
http://wiki.ittoolbox.com/index.php/Topic:Web Sphere_DataStage

Process in parallel or take up folk dancing.

Some people will tell you if you are not doing your batch data integration using a
parallel processing engine you might as well shoot your ETL server and take up folk
dancing. You are just not a serious enterprise data integration player. This has lead to
some confusion from developers who want to know if the parallel ETL is a better
career path, and consternation from DataStage customers who have the non parallel
version and don't like folk dancing.

The Ab Initio ETL tool had such a good parallel engine they didn't need to advertise.
They just had a word in the ear of a customer, "you know how that fact load takes 12
hours, well we can do it in 1".

52
Now most of the serious ETL vendors such as IBM/Ascential, Informatica and SAS
have got automated parallel processing.

DataStage Version Confusion, more common then bird flu


WebSphere Information Integration has several varieties of ETL jobs including
parallel jobs, server jobs and mainframe jobs. These come in various editions such as
the Enterprise, Server, PeopleSoft, MVS or SOA Editions. There is not really a one to
one relationship between job types and editions and that's where things start to get
confusing. The Websphere DataStage wiki page is an attempt summarise the
differences between these versions and may innoculate you from version confusion.

Are you already using DataStage?


Customers with DataStage Server Edition such as older users and PeopleSoft EPM
users may be wondering if they need to upgrade to parallel jobs. It is not
encouraging to them that the parallel version is two to three times the price of the
server version. It is also a worry that recent discussions at the DataStage Exchange
forum shows some developers find server jobs easier to use and faster to implement.
By the way if you want to reach me I'll be out in the woods hiding from the parallel
police. It was good news that server jobs would still be around in the next release.

The obvious incentive for going parallel is data volume. Parallel jobs can remove
bottlenecks and run across multiple nodes in a cluster for almost unlimited
scalability. At this point parallel jobs become the faster and easier option. With the
release of DataStage Hawk next year an added incentive will be the extra
functionality of parallel jobs such as Quality Stage matching. Recent product
upgrades have made parallel jobs easier to build. Hopefully further improvements
and a lower price will be forthcoming in the next release.

So how do I go parallel without redoing the whole lot?


If you are upgrading from server to parallel jobs then consider an iterative
implementation rather then a big bang.
Small volume jobs may actually be faster if left as server jobs as parallel jobs can
have a longer start up time.
Server and parallel jobs will run quite happily from the same sequencer or
scheduler.
BASIC transformers and server job containers can be put into parallel jobs. This
gives you a parallel job that re-uses your old transformation routines, it will be
slower then a 100% parallel job but this might not matter, it will still be a lot faster
then the old server job and be faster to build.
PeopleSoft-EPM upgrades will continue to deliver server jobs so the more you can
leave as server versions the easier the upgrade path.
Those with an SOA requirement can turn either server or parallel jobs into real time
services.

Just change the very high volume jobs and do some performance testing to compare
different designs. Release a small number to production to see how they run.

Which version will get me a job?


There will certainly be demand for skilled developers with experience in server or
parallel or combined. The server job version will remain popular until the parallel

53
version becomes cheaper and easier to use. The parallel version will remain popular
for large implementations such as master data management and enterprise
integration across clusters. Combination implementations will be popular with
customers upgrading to parallel jobs or starting with Enterprise Edition but choosing
to use server jobs for small volumes.

DataStage server v enterprise: some performance stats

I ran some performance tests comparing DataStage server jobs against parallel jobs
running on the same machine and processing the same data. Interesting results.

Some people out there may be using the server edition, most DataStage for
PeopleSoft customers are in that boat, and getting to the type of data volumes that
make a switch to Enterprise Edition enticing. Most stages tested proved to be a lot
faster in a parallel job then a server job even when they are run on just one parallel
node.

All tests were run on a 2 CPU AIX box with plenty of RAM using DataStage 7.5.1.

The sort stage has long been a bugbear in DataStage server edition prompting many
to sort data in operating scripts:
1mill server: 3:17; parallel 1node:00:07; 2nodes: 00:07; 4nodes: 00:08
2mill server: 6:59; parallel 1node: 00:12; 2node: 00:11; 4 nodes: 00:12
10mill server: 60+; parallel 2 nodes: 00:42; parallel 4 nodes: 00:41

The parallel sort stage is quite a lot faster then the server edition sort. Moving from 2
nodes to 4 nodes on a 2 CPU machine did not see any improvement on these smaller
volumes and the nodes may have been fighting each other for resources. I didn't
have time to wait for the 10 million row sort to finish but it was struggling along
after 1 hour.

The next test was a transformer that ran four transformation functions including
trim, replace and calculation.
1 mill server: 00:25; parallel 1node: 00:11; 2node: 00:05: 4node: 00:06
2 mill server: 00:54; parallel 1node: 00:20; 2node: 00:08; 4node: 00:09
10mill server: 04:04; parallel 1node: 01:36; 2node: 00:35; 4node: 00:35

Even on one node with a compiled transformer stage the parallel version was three
times faster. When I added one node it became twelve times faster with the benefits
of the parallel architecture.

Aggregation:
1 mill server: 00:57; parallel 2node: 00:15
2 mill server: 01:55; parallel 2node: 00:28

Reading from DB2:


2 mill rows server: 5:27; parallel 1node: 01:56; 2node: 01:42

The DB2 read was several times faster and the source table with 2million plus rows
had no DB2 partitioning applied.

54
So as you can see even on a 1 node configuration that does not have a lot of parallel
processing you can still get big performance improvements from an Enterprise
Edition job. The parallel stages seem to be more efficient. On a 2 CPU machine there
were some 10x to 50x improvements in most stages using 2 nodes.

If you are interested in these type of comparisons leave a comment and in a future
blog I may do some more complex test scenarios.

79. What do you mean I need to optimize small jobs?


There is a lot of focus on the performance improvement techniques for large jobs, but
one of the important tips is how making smaller jobs faster can make larger jobs
faster.

Large jobs are impacted by smaller jobs, especially when there is a flotilla of small
jobs constantly taking small chunks of CPU, RAM and disk I/O away from the larger
jobs.

Impact of Startup Time


On my modest little dev box I have a server job and a parallel job that both process
10 rows read from a sequential file. The server job takes 1-2 seconds to finish and the
parallel job takes 7 seconds (1 node). Parallel jobs have a slower start up time as
multiple processes are started up on each processing node. IBM-Ascential has
improved on these startup times in the next release.

Impact of Stress
I start a large parallel job processing millions of rows from a database over two
nodes. I rerun my two small jobs. The server job takes 2 seconds and the parallel job
is 10 seconds. On the next run they are 2 and 14. On the next run 2 and 11. The server
job is not impacted however the parallel job is.

I start up four more parallel jobs and retest my two small jobs. The server job is now
up to 3 seconds, the parallel job is up to 27 and 33 seconds. When I switch my small
parallel job to use four nodes instead of one it jumps to 44 and 43 seconds.

Now imagine you are running several hundred of these small jobs during a batch
window alongside some very large jobs. You begin to see how the performance of
the small jobs can be as important as the large jobs. Not only are the smaller jobs
taking much longer to run but they are taking resources away from the larger jobs.

Using both versions


There are sites out there using parallel jobs for large volumes and server jobs for
small volumes. Server jobs have a faster startup up time and they do not use as much
of the node resources as the parallel jobs. Not only do the smaller jobs run faster but
the impact on the larger jobs is reduced allowing them to run faster.

Ease of Use
With DataStage Enterprise Edition you get both a parallel and a server job license so
you can use both types. Anyone proficient in parallel jobs will find server jobs easy to
write. Both jobs can be scheduled from the same sequence jobs or from the same
command line scheduling scripts via the dsjob command. Most types of custom or
MetaStage job reporting will collect results from each job type.

55
Going Parallel all the Way
If you do choose to use just parallel jobs then limit the overhead of small jobs by
running them on a single node. This can be done by adding the Environment
variable $APT_CONFIG_FILE and setting it to a single node configuration file. This
stops the job from starting too many processes or from partitioning data that is such
small volume that makes partitioning is a waste of time.

Looking for Feedback


Are you using a combination of both job types? Is it working well? Do you notice a
difference in jobs running on one node versus multiple nodes?

80.Why database generated surrogate keys drive me nuts!!!


In the world of ETL data loads there are two ways to generate unique id numbers.
You can generate them in the ETL job or you can generate them in the target
database. After years of doing it both ways I am going to draw my line in the sand
and state that the database way drives me nuts!

I'm going to lay my cards on the table here, and unlike Gary Busey going all in on
celebrity Texas Hold-em, it's not a pair of deuces. You get greater flexibility and
control generating your surrogate keys within the ETL job rather than in the database
on insert. Especially when you have purchased a high end ETL tool like DataStage
and plan to use data quality best practices around that tool.

81. What is a surrogate key?


The dictionary tells us that Surrogate is someone who takes the place of another, a
substitution. In data modelling parlance the primary key of a row in the legacy
database may have several fields, when loading into a BI database such as a data
warehouse or operational data store a surrogate key is added and being a single
numeric type field it offers better performance on table joins.

If those descriptions have done nothing for you than let's just call it a unique ID field
for data.

82.The RDBMS generated key


Most databases have a way of generating a surrogate key for a table. In SQL Server
and Sybase it is called an identity field and the field automatically increments when a
row is inserted. In DB2 and Oracle you can create a counter or sequence object and
every time you add a row to the table you increment that counter.

In DataStage if the target table has an Indentity field you set it by inserting all the
other columns and letting the Identity field set itself. For a DB2 sequence you set it by
using user-defined SQL on insert statements that populate the field with the
sequence NEXTVAL command. E.g.

An automatically generated insert statement:


INSERT INTO CUSTOMER
(CUSTOMER_ID, CUSTOMER_NAME)
ORCHESTRATE.DUMMYCUSTID,
ORCHESTRATE.CUSTNAME

56
Gets changed to use a sequence to generate a new id:
INSERT INTO CUSTOMER
(CUSTOMER_ID, CUSTOMER_NAME)
CUSTOMERSEQ.NEXTVAL,
ORCHESTRATE.CUSTNAME

83.ETL Generated Surrogate Key


The ETL job can generate a surrogate key with a counter. The parallel surrogate key
stage automatically creates unique numbers across each parallel node starting from a
seed number. The parallel transformer can generate unique numbers using a stage
variable, they can be made unique across nodes by using special macros. I added an
ITToolbox wiki entry to demonstrate it.

Pros and Cons


Now to be perfectly fair here I am going to present both sides of the argument. I
would hate to badmouth database engines with so many illustrious database
bloggers on IT Toolbox.

The case for ETL generated keys


You attach a surrogate key to a row of data at the earliest opportunity and this can be
used to track the row from that point on right up until insert into the database. This
offers a lot of opportunities to use this surrogate key on rejected records or on
messages attached to accepted records. The data quality firewall for example can
send messages to a data quality reporting system, it is easier to link these messages
back to loaded or rejected data via a generated key field.

ETL generated keys works very well on vertically banded ETL loads. For example,
you can process all your data during the day and load it into the database overnight.
Day time processing occurs mainly on the ETL server where you extract, transform,
consolidate, enrich, massage, wobble and masticate your data. This gives you a set of
load ready data files. Overnight you bulk load, insert or update them into your
database with almost no transformation. With ETL key generation you can have your
surrogate keys and foreign keys generated for new rows during this preparation
phase. This will save time and complexity on your overnight loads. This type of
vertical banding gives you easier rollback and recovery and lets you process data
without impacting on production databases.

An ETL generated key works no matter what type of database load you perform:
insert, bulk load, fast load, append, import etc. With database keys you need to work
out how the type of load affects your number generation. I still haven't worked out
how to bulk load into DB2 table and use a DB2 sequence at the same time.

The ETL key generator offers very good performance whether using the stand alone
stage or in a transformer. It runs in parallel and uses very little memory.

The case for database generated keys


Here is my half heated attempt at defending database generated keys. Identity fields
that automatically increment when a row is inserted conveniently work no matter
what load tool you are using.

57
The database prevents duplicate keys by remembering the last key used and
incrementing even when there are simultaneous loads. Mind you, good ETL design
will do the same thing.

Running out of good things to say about database keys now.

84. what is job control ? what is the use of it explain with steps?
JCL defines Job Control Language it is used to run more number of jobs at a time
with or without using loops. steps:click on edit in the menu bar and select 'job
properties' and enter the parameters asparamete prompt type STEP_ID STEP_ID
stringSource SRC stringDSN DSN stringUsername unm stringPassword pwd
stringafter editing the above steps then set JCL button and select the jobs from the
listbox and run the job

Controlling Datstage jobs through some other Datastage jobs. Ex: Consider two Jobs
XXX and YYY. The Job YYY can be executed from Job XXX by using Datastage
macros in Routines.

To execute one job from other job, following steps needs to be followed in Routines.

1. Attach job using DSAttachjob function.

2. Run the other job using DSRunjob function

3. Stop the job using DSStopJob function

85. Is it possible to access the same job by two users at a time in DataStage?
No, it is not possible to access the same job two users at the same time. DS will
produce the following error : "Job is accessed by other user"

86. How to drop the index before loading data into target and how to rebuild it in
DS?
This can be achieved by "Direct Load" option of SQLLoaded utily.

87.How i create datastage Engine stop start script.


Actually my idea is as below.
!#bin/bash
dsadm - user
su - root
password (encript)
DSHOMEBIN=/Ascential/DataStage/home/dsadm/Ascential/DataStage/DSEngin
e/bin
if check ps -ef | grep DataStage
(client connection is there) {
kill -9 PID (client connection) }
uv -admin - stop > dev/null
uv -admin - start > dev/null
verify process
check the connection
echo "Started properly"

58
88.How do you track performance statistics and enhance it?
Through Monitor we can view the performance statistics.

89.What is the order of execution done internally in the transformer with the stage
editor having input links on the left hand side and output links?
Stage variables, constraints and column derivation or expressions.

90. What is SQL tuning? how do you do it ?


Sql tuning can be done using cost based optimization

this parameters are very important of pfile


sort_area_size,sort_area_retained_size,db_multi_block_count,open_cursors,cursor_s
haring
optimizer mode=choose/role

91.How we use NLS function in Datastage? What are advantages of NLS function?
Where we can use that one? Explain briefly?
As per the manuals and documents, we have different level of interfaces. More
specific? Like Teradata interface operators, DB2 interface operators, Oracle Interface
operators and SAS-Interface operators. Orchestrate National Language Support
(NLS) makes it possible for you to process data in international languages using
Unicode character sets. International Components for Unicode (ICU) libraries
support NLS functionality in Orchestrate. Operator NLS Functionality* Teradata
Interface Operators * switch Operator * filter Operator * The DB2 Interface Operators
* The Oracle Interface Operators* The SAS-Interface Operators * transform Operator
* modify Operator * import and export Operators * generator

92.What are the Repository Tables in DataStage and what are they?
A datawarehouse is a repository (centralized as well as distributed) of Data, able to
answer any adhoc, analytical, historical or complex queries. Metadata is data about
data. Examples of metadata include data element descriptions, data type
descriptions, attribute/property descriptions, range/domain descriptions, and
process/method descriptions. The repository environment encompasses all
corporate metadata resources: database catalogs, data dictionaries, and navigation
services. Metadata includes things like the name, length, valid values, and
description of a data element. Metadata is stored in a data dictionary and repository.
It insulates the data warehouse from changes in the schema of operational systems.
In data stage I/O and Transfer , under interface tab: input , out put & transfer pages
will have 4 tabs and the last one is build under that u can find the TABLE NAME
.The DataStage client components are : Administrator Administers DataStage
projects and conducts housekeeping on the server Designer Creates DataStage jobs
that are compiled into executable programs Director Used to run and monitor the
DataStage jobs .Manager Allows you to view and edit the contents of the repository

93.What are Schema Files


• Alternative way to specify column definitions for data used in EE jobs
• Written in a plain text file
• Can be imported into the DataStage Repository
• Creating a Schema
– Using a text editor

59
• Follow correct syntax for definitions
– Import from an existing data set or file set
• Manager import > Table Definitions > Orchestrate Schema
Definitions
• Select checkbox for a file with .fs or .ds
– Import from a database table
– Create from a Table Definition
• Click Parallel on Layout tab

• Schema file for data accessed through stages that have the “Schema Files”
property, e.g. Sequential File
• Sample Use
• if source file format may change without functional impact to the DS code
• say columns inserted, reordered, deleted, etc.
• Job access the file only through the definition in the schema file
• Schema file may be changed without affecting the job(s

• Refinement Case 1
– The input file may in the future
• include extra columns that are not relevant to the requirement, these
must be dropped/ignored by the job
• The record format may change, e.g. become comma delimited, order
in which the fields appear may change
– The job must be capable of accepting this input file without impact
• To Do
• Define a schema file to define the input file & point to it within the
sequential file stage

60
94.What is RUN TIME COLUMN PROPOGATION
• Supports partial definition of meta data.
• Enable RCP to
– Recognize columns at runtime though they have not been used within
a job
– Propagate through the job stream
• Design and compile time column mapping enforcement
– RCP is off by default

61
– Enable
• project level. (Administrator project properties)
• job level. (Job properties General tab)
• Stage. (Link Output Column tab)
– Always enable if using Schema Files
• To use RCP in a Sequential stage:
– Use the “Schema File” option & provide path name of the schema file
• When RCP is enabled:
– DataStage does not enforce mapping rules
– Danger of runtime error if incoming column names do not match
column names outgoing link
– Columns not used within the job also propagated if definition exists
• Note that RCP is available for specified stages
• Consider this requirement statement:
– Regional_Sales.txt is a pipe-delimited sequential file
– It will contain
• Region_ID
• Sales_Total
– Job must read this file and compute
• Sales_Total_USD = Sales_Total*45
– Write the data into
• data set Regional_Sales.ds
So far a simple job will do

• Refinement Case 1
– The input file may in the future
• include extra columns that are not relevant to the requirement, these
must be dropped/ignored by the job
• The record format may change, e.g. become comma delimited, order
in which the fields appear may change
– The job must be capable of accepting this input file without impact
• To Do

record
{final_delim=end, record_delim='\n', delim='|',
quote=double, charset="ISO8859-1"}
(
REGION_ID:int32 {quote=none};
SALES_TOTAL:int32 {quote=none};

62
ƒ TO DO:
Column definition will define all columns that must be carried through to the next stage
• Column definition column name must match those defined in the
schema file
• Ensure RCP is disabled for the output links

record
{final_delim=end, record_delim='\n', delim='|',
quote=double, charset="ISO8859-1"}
(
REGION_ID:int32 {quote=none};
SALES_CITY:ustring[max=255];
SALES_ZONE:ustring[max=255];
SALES_TOTAL:int32 {quote=none};
)

• When the input format changes


– ONLY the schema file must be modified!

63
– Data Set will contain ALL columns in the schema, unless explicitly
accessed & dropped within the job plus the computed field

Refinement Case 2
– The input file may in the future include extra columns BUT THESE
MUST BE CARRIED ON into the target DataSet as it is
– The job must be capable of accepting this input file without impact
• To Do
• Define & use schema file
• Ensure RCP is enabled at the project level as well as for all
output link along which data is to be propagated at run time
• Define all columns that require processing
• Other columns may of may not be defined
• In this case, Region_ID need not be defined in the stage
• But if a column is defined and found missing from schema
&/or data file at run time, the job will abort!

• When the input format changes

– ONLY the schema file must be modified!

– Data Set will contain ALL columns in the schema, unless explicitly
accessed & dropped within the job plus the computed field
95.What are shared and Local Containers

• Reuse of logic across jobs in a project


• Set of stages that provide a frequently performed sequence of operations
• Included at compile time within the logic of the job within which it is invoked
– On change, job(s) must be recompiled
• Accepts parameters passed by calling job
• Can be used along with RCP features to provide higher reuse
• Care must be taken to ensure that no deadlocks or other data integrity issues
are introduced through shared logic being invoked simultaneously
• Server job functionality can be embedded within the Parallel Job
– Invoked multiple times to allow the process to function in parallel
– Note that parallel containers cannot be invoked within a server job
Local containers

64
– are for making the job look less complicated
– Cannot be invoked from other jobs
– Can be converted to shared containers
– Can be deconstructed to embed log with the job itself

Shared containers
– Can be converted to a local container within a specific job, while still
retaining the original shared container definition
– Can be deconstructed
• Consider validation of geography
– Region_ID must exists in Region_Master.txt & Zone must exist in
Zone_Master.txt
– This rule is applicable for various streams including Regional_Sales
and Employee_Master
• Basic Solution:
– Create individual jobs to lookup each source against the master files

• Refined Solution
– Create a Shared container – “Validate Geography”
• Select the stages that are to be shared
• Select menu item Edit > Construct Container > Shared

65
To make it truly reusable

• Within the Shared Container Definition

• Rename the link & columns names to generic names

• Ensure that the stage defines only the fields used within the
processing, in this case, Zone & Region

• Ensure RCP is enabled on the output links. This ensures that


all fields in the input are passed on to the output

• Within the job(s)

– Column names that are used within the container must be mapped or
modified before/after the shared container is invoked

– Output links of the shared container must have RCP set

– The shared container icon within the job must be opened &
input/output in the job must be mapped against the corresponding
link name in the container

– Note that parameters required by the stages within the container must
be set through the container’s invocation stage in the calling job

66
• Container can be reused as shown to validate the geography information of
the employee-master file

• If in the future
– Say if, geography validation does not validation of Zone then
• Change only the shared container
• Recompile all jobs that invoke the container
– Say if, the city must also be validated
• Provided all the inputs contain the required field,
– Change only the shared container
– Recompile all jobs that invoke the container

96.Explain Job Sequence Invocation & Control

• Options
– Run through the DS client Director menu
– Command line interface DSJob Commands
• Used directly or within a OS shell or batch script
• DSJob available with client installation
– DataStage API
• callable through a C/C++ Program
• Distribute DLLs and header files to enable remote execution
without a DS Client
– Through other DataStage Executable Components - which have to be
in turn invoked through any of the listed means
• DataStage BASIC Job Control
– Written in DataStage Basic Script
– Embedded as a Job Control script within the job
definition OR
– Called as a Server Routine OR
– In Parallel Jobs, wrapped into a Basis Transform Stage

67
• Invoked as an activity within a Sequence Job
– which has to be in turn invoked through any of the
listed means
Command Line Interface
• Use dsjob for controlling DataStage jobs
• Options available are:
• Logon
• Starting a job
• Stopping a job
• Listing projects, jobs, stages, links and parameters
• Setting an alias for job
• Retrieving information
• Accessing log files
• Importing job executables
• Generating a report
• CLI commands returns status code to OS
• Use dsjob – run to start, stop, validate or reset DataStage jobs

97.Datstage : Email notification activity & job parameters


Yet Again some of the things that can be improved in Datastage.We had a
requirement of mailing the link statistics after the job has completed. A simple
task, but the difficulty here is that the end user( read Datastage support
personnel) are not comfortable with Before/After routines , so we had to
program using the Email notification activity available in Job sequences and the
support guys will update the email address in the activity in the future if any
change is needed in email recipients The problem is that the body of the Email
Notification Activity is static, except for the usage of Environment parameters(
we will come back to this). Now if we want any run-time content to be published
in the mail, we cant do that. I guess its a minimum requirement to have the body
to accept the job parameters. My idea is to have the link counts collected (.. either
in the after job subroutine of the datastage job or else using dsjob command) and
pass it as job parameter to the email notification activity.The email notification
activity is actually generating a UNIX script and executing it(probably using
mailx and uuencode ). The script is generated and destroyed within the
sequencer, as a result the execution details are kept obscure to the developer. I
strongly believe that the “Environment parameters” are dynamic in the Email
body since they map directly to the UNIX exported variables under the same
UNIX shell. So you actually dont know that environment parameters get
substituted in the email body, unless you “type the env. parameters manually”
since the Email body has not got any “Insert job parameter” buttons.

98..Installation & Configuration of Datastage

68
DataStage can be administered on UNIX system by a non root user. The user role
created is ‘dsadm’ by default.The primary UNIX group of the administrator should
be the group to which all the other UNIX id’s of DataStage users should belong to.
By default the group created is ‘dstage’.Secondary UNIX groups can be created and
assigned to different DataStage roles from DS Administrator

To configure the Parallel Environment

Your development directory must be visible globally across all nodes. Use NFS for
this.
Update the PWD variable to the same.
A user who runs the parallel jobs should have all the following rights :-
Login access
Read / Write access to Scratch disk
Read / Write access to Temporary dir
Read access to APT_ORCHHOME
Execute access to Local copies and scripts
? For accessing DB2 resources from DataStage , you need to
Define all the DB2 nodes acting as Servers in your configuration file
? Execute the script $APT_ORCHHOME/bin/db2setup.ksh for each and every
database in DB2 once
? Execute the script $APT_ORCHHOME/bin/db2grant.ksh for each and every
user accessing DB2.
? For Remote connection to DB2, the client and DS server should be on the
same machine
DB2 System configuration entails
1. Make sure db2nodes.cfg is readable DS Administrator
2. For users running the jobs in Load Mode, the DataStage user needs to have
DBADM role granted. Give the grant by executing the command
1. GRANT DBADM ON DATABASE TO USER username
3. Grant the DataStage user select privileges on syscat.nodegroupdefs,
syscat.tablespaces, syscat.tables

DB2 user configuration entails


1. Set the env. Variable DB2INSTANCE to the owner of the db2 instance, in
order that it points to correct db2nodes.cfg file. i e if you set DB2INSTANCE
to John, then the location of db2nodes.cfg is ~John/sqllib/db2nodes.cfg
2. Source db2profile.ksh in your current shell. I e set the following in .profile ( if
Korn or Bourne shells)
1. . ~John/sqllib/db2profile
2. Export LIBPATH=~John/sqllib/db2profile:${LIBPATH:-/usr/lib}
3. If C Shell,
1. source ~Mary/sqllib/db2profile
2. if (! $?LIBPATH) setenv LIBPATH /usr/lib
3. setenv LIBPATH ~Mary/sqllib/lib:$LIBPATH

For accessing Oracle Resources, you need to


1. Grant the users select privileges on SYS.GV_$INSTANCE and
SYS.V_$CACHE

69
2. Oracle System configuration entails
1. Creating user environment variables ORACLE_HOME and
ORACLE_SID and copying the values of $ORACLE_HOME and
$ORACLE_SID.
2. Addition of ORACLE_HOME/bin and ORACLE_HOME/lib to your
LIBPATH, LD_LIBRARY_PATH, SHLIB
3. Have select privileges on
1. DBA_EXTENTS
2. DBA_TAB_PARTITIONS
3. DBA_DATA_FILES
4. DBA_OBJECTS
5. ALL_PART_INDEXES
6. ALL_INDEXES
7. ALL_PART_TABLES
8. GV_$INSTANCE ( Only if parallel server is used)
C++ Compiler
1. The DS env. Variables APT_COMPILER and APT_LINKER should be set to
the default location of the corresponding C++ compiler.
2. If the C++ compiler is installed elsewhere, then the above env. variable values
should be changed in each and every project form DS Administrator
Configuring some environment variables
1. Values to the environment variables can be given from DS Administrator as
well as at the time of installation. Please configure there at the time of install
2. TMPDIR: By default is /tmp. Specify your own temporary directory
3. APT_IO_MAXIMUM_OUTSTANDING: Specifies the amount of memory
reserved in each node for TCP/IP communications. Default is 2 MB
4. APT_COPY_TRANSFORM_OPERATOR: Set it to True in order for the
Transformer stage to work under different env. Default is False.
5. APT_MONITOR_SIZE, APT_MONITOR_TIME : By default job monitoring is
done with time and the monitor window is refreshed every 5 sec. Specifying
SIZE param when TIME param is at this default value, makes job reporting
window gets refreshed when new instances are added. Overriding the
default TIME param will override SIZE param
6. APT_DEFAULT_TRANSPORT_BLOCK_SIZE,
APT_AUTO_TRANSPORT_BLOCK_SIZE: The values in these variables
specify the size of Data block used when DS transports data in internal links.
The value of APT_DEFAULT_TRANSPORT_BLOCK SIZE is 32768. Set the
other variable to True if you want DS to automatically calculate the Block
Size.
Starting and stopping the DS Server
1. To stop the Datastage server , use $DSHOME/bin/uv -admin –stop
2. To start the DataStage server, use $DSHOME/bin/uv –admin –start
3. Please check for the dsrpcd process and stop all of them ( client connections ),
using the command
1.
1. netstat –a | grep dsrpcd
Project location and assign DataStage EE roles
1. The project location and DataStage roles are assigned in DS Administrator.

70
2. In the projects dialog box and under generals tab, please assign the location of
the project.
3. The DataStage roles that can be assigned to UNIX groups ( not users ) under
permissions tab are
1. Developer - Has full access to all areas of a project.
2. Production Manager – Developer + access to create and manage
protected projects
3. Operator - Has no write access, can run the jobs and access only to
Director
4. None - He cannot login to any DataStage client components.
Configuring Unix Environment
Check the following Tunable Kernel parameters
MSGMAX 8192 32768 N/A 8192
MSGMNB 16384 32768 N/A 16384
SHMSEG 15* 15 N/A 32
MSGSEG N/A 7168 N/A N/A
SHMMAX 8388608 N/A N/A 8388608
SEMMNS 111 N/A N/A 51
SEMMSL 111 N/A N/A 128
SEMMNI 20 N/A N/A 128
Last week we got a requirement to Validate a date field.
The dates were in mm/dd/yyyy format.
1.7/7/2007 format for some records
2. 07/7/2007 format for some records
3. 7/07/2007 format for some records
4. 07/07/2007 format for some records.
and in-between them there were some invalid dates like 09/31/2007, 09/040/200.
Being an Oracle developer before I started using Datastage, the first thing i went is
for TO_DATE() in Oracle enterprise stage. Damn! Well it wasn’t so easy, the stage
aborted for invalid dates instead of throwing them down the reject link. I tried some
online help on how to capture the To_Date () errors into reject link. After searching
for couple of hours, nothing concrete has come up.
Ok. I decided to do the validation in Datastage. I was already having a transformer in
the job and i included a constraint
“isValid(’date’,StringToDate(trim_leading_trailing(<inp_Col>), ‘%mm/%dd/%yyyy) “ and
passed down the invalid dates into reject link. I compiled and ran the job. The
job ran successfully and i thought everything went fine, until i recognized that the
format mask is not intelligent enough to recognise single digit day and month fields.
Hmmm…. i was back to square one. Then i tried some innovation using the format
mask as %[mm]/%[dd]/%yyyy, %m[m]/%d[d]/%yyyy etc… etc…. nothing worked.
Anyway, at last i was able to do the task for the day using some trouble some logic
(Identifying the single digit day and month and concatenating a Zero before
them with the help of Field() function) inside the transformer and made my boss
happy, but I wondered for such a simple requirement, I was not able to understand
why Datastage Date field format mask is not modeled intelligent enough.
In Oracle, the To_Date() is intelligent enough to recognise such data when the format
mask is specifeid as MM/DD/YYYY.

71
Also we had a time stamp field coming in from the source with the format mask as
‘MM/DD/YYYY HH:MI:SS AM’ (Oracle style). Even this i am not able to validate
properly in a DataStage time stamp field, since datastage is not intelligent enough
in recognizing ‘AM/PM’. I guess I need to learn how regular expressions are
specified in DataStage.
These are the day to day requirements in ETL, which i guess DataStage must handle.
Either i need to do more research or DataStage should make more flexible format
masks in next release and make the life of developers easy.

99.Datastage: Job designtips


I am just collecting the general design tips that help the developers to build clean &
effective jobs.
1. Turn off Runtime Column propagation wherever it’s not required.
2. Make use of Modify, Filter, and Aggregation, Col. Generator etc stages instead of
Transformer stage only if the anticipated volumes are high and performance
becomes a problem. Otherwise use Transformer. Its very easy to code a transformer
than a modify stage.
3. Avoid propagation of unnecessary metadata between the stages. Use Modify stage
and drop the metadata. Modify stage will drop the metadata only when explicitly
specified using DROP clause.
4. One of the most important mistakes that developers often make is not to
have volumetric analyses done before you decide to use Join or Lookup or Merge
stages. Estimate the volumes and then decide which stage to go for.
5.Add reject files wherever you need reprocessing of rejected records or you think
considerable data loss may happen. Try to keep reject file at least at Sequential file
stages and writing to Database stages.
6. Make use of Order By clause when a DB stage is being used in join. The intention
is to make use of Database power for sorting instead of datastage recources. Keep the
join partitioning as Auto. Indicate don’t sort option between DB stage and join stage
using sort stage when using order by clause.
7. While doing Outer joins, you can make use of Dummy variables for just Null
checking instead of fetching an explicit column from table.
8. Use Sort stages instead of Remove duplicate stages. Sort stage has got more
grouping options and sort indicator options.
9. One of the most frequent mistakes that developers face is lookup failures by not
taking care of String padchar that datastage appends when converting strings of
lower precision to higher precision. Try to decide on the APT_STRING_PADCHAR,
APT_CONFIG_FILE parameters from the beginning. Ideally
APT_STRING_PADCHAR should be set to OxOO (C/C++ end of string) and
Configuration file to the maximum number of nodes available.
10. Data Partitioning is very important part of Parallel job design. It’s always
advisable to have the data partitioning as ‘Auto’ unless you are comfortable with
partitioning, since all DataStage stages are designed to perform in the required way
with Auto partitioning.
11. Do remember that Modify drops the Metadata only when it is explicitly asked to
do so using KEEP/DROP clauses.

100.Setup and use project specific environment variables


Introduction

72
Job parameters should be used in all DataStage server, parallel and sequence jobs to
provide administrators access to changing run time values such as database login
details, file locations and job settings.
One option for maintaining these job parameters is to use project specific
environment variables. These are similar to operating system environment variables
but they are setup and maintained through the DataStage Administrator tool.
There is a blog entry with bitmaps that describes the steps in setting up these
variables at DataStage tip: using job parameters without losing your mind

Steps
To create a new project variable:
Start up DataStage Administrator.
Choose the project and click the "Properties" button.
On the General tab click the "Environment..." button.
Click on the "User Defined" folder to see the list of job specific environment
variables.
There are two types of variables - string and encrypted. If you create an encrypted
environment variable it will appears as the string "*******" in the Administrator tool
and will appears as junk text when saved to the DSParams file or when displayed in
a job log. This provides robust security of the value.
Note that encrypted environment variables are not supported in versions earlier than
7.5.

==Steps==
<!-- Enter steps involved below. -->
To create a new project variable:
* Start up DataStage Administrator.
* Choose the project and click the "Properties" button.
* On the General tab click the "Environment..." button.
* Click on the "User Defined" folder to see the list of job specific environment
variables.

There are two types of variables - string and encrypted. If you create an encrypted
environment variable it will appears as the string "*******" in the Administrator tool
and will appears as junk text when saved to the DSParams file or when displayed in
a job log. This provides robust security of the value.

Note that encrypted environment variables are not supported in versions earlier than
7.5.

=== Migrating Project Specific Job Parameters ===


It is possible to set or copy job specific environment variables directly to the
DSParams file in the project directory. There is also a DSParams.keep file in this
directory and if you make manual changes to the DSParams file you will find
Administrator can roll back those changes to DSParams.keep. It is possible to copy
project specific parameters between projects by overwriting the DSParams and
DSParams.keep files. It may be safer to just replace the User Defined section of these
files and not the General and Parallel sections.

73
=== Environment Variables as Job Parameters ===
To create a job level variable:
* Open up a job.
* Go to Job Properties and move to the parameters tab.
* Click on the "Add Environment Variables..." button and choose the variable from
the list. Only values set in Administrator will appear. This list will show both the
system variables and the user-defined variables.
* Set the Default value of the new parameter to $PROJDEF. If it is an encrypted field
set it to $PROJDEF in both data entry boxes on the encrypted value entry form.

When the job parameter is first created it has a default value the same as the Value
entered in the Administrator. By changing this value to $PROJDEF you instruct
DataStage to retrieve the latest Value for this variable at job run time.

If you have an encrypted environment variable it should also be an encrypted job


parameter. Set the value of these encrypted job parameters to $PROJDEF. You will
need to type it in twice to the password entry box, or better yet cut and paste it into
the fields, a spelling mistake can lead to a connection error message that is not very
informative and leads to a long investigation.

=== Creating sub folders ===


By default all parameters are put into a "User Defined" folder. This can make it
difficult to locate them through the Designer or Administrator tools. Sub folders can
be added by editing the DSParams folder and adding sub folder names to the
parameter definition section. Where the folder name is defined as "/User Defined/"
this can be changed to include a sub folder, e.g. "/User Defined\Database/".

Migrating Project Specific Job Parameters


=== Migrating Project Specific Job Parameters ===
It is possible to set or copy job specific environment variables directly to the
DSParams file in the project directory. There is also a DSParams.keep file in this
directory and if you make manual changes to the DSParams file you will find
Administrator can roll back those changes to DSParams.keep. It is possible to copy
project specific parameters between projects by overwriting the DSParams and
DSParams.keep files. It may be safer to just replace the User Defined section of these
files and not the General and Parallel sections.

Environment Variables as Job Parameters


To create a job level variable:
Open up a job.
Go to Job Properties and move to the parameters tab.
Click on the "Add Environment Variables..." button and choose the variable from the
list. Only values set in Administrator will appear. This list will show both the system
variables and the user-defined variables.
Set the Default value of the new parameter to $PROJDEF. If it is an encrypted field
set it to $PROJDEF in both data entry boxes on the encrypted value entry form.
When the job parameter is first created it has a default value the same as the Value
entered in the Administrator. By changing this value to $PROJDEF you instruct
DataStage to retrieve the latest Value for this variable at job run time.

74
If you have an encrypted environment variable it should also be an encrypted job
parameter. Set the value of these encrypted job parameters to $PROJDEF. You will
need to type it in twice to the password entry box, or better yet cut and paste it into
the fields, a spelling mistake can lead to a connection error message that is not very
informative and leads to a long investigation.

Creating sub folders


By default all parameters are put into a "User Defined" folder. This can make it
difficult to locate them through the Designer or Administrator tools. Sub folders can
be added by editing the DSParams folder and adding sub folder names to the
parameter definition section. Where the folder name is defined as "/User Defined/"
this can be changed to include a sub folder, eg. "/User Defined\Database/".

Examples
These job parameters are used just like normal parameters by adding them to stages
in your job enclosed by the # symbol.
Job Parameter Examples
Field Setting Result
Data #$DW_DB_NAME# CUSTDB
base
Pass #$DW_DB_PASSWORD# ********
word
File #$PROJECT_PATH#/#SOURCE_DIR#/Custo c:/data/custfiles/Custom
Nam mers_#PROCESS_DATE#.csv ers_20040203.csv
e

Conclusion
These type of job parameters are useful for having a central location for storing all
job parameters that is password protected and supports encryption of passwords. It
can be difficult to migrate between environments. Migrating the entire DSParams file
can result in development environment settings being moved into production and
trying to migrate just the user defined section can result in a corrupt DSParams file.
Care must be taken.
101.Perform change data capture in an ETL job

Introduction
This HOWTO entry will describe how to identify changed data using a DataStage
server or parallel job. For an overview of other change capture options see the blog
on incremental loads
The objective of changed data identification is to compare two sets of data with
identical or similar metadata and determine the differences between the two. The
two sets of data represent an existing set of a data and a new set of data where the
change capture identifies the modified, added and removed rows in the new set.

Steps

75
The steps for change capture depend on whether you are using server jobs or parallel
jobs or one of the specialized change data capture products that integrate with
DataStage.

Change Data Capture Components


These are components that can be purchased in addition to a DataStage license in
order to perform change data capture against a specific database:
DataStage CDC for DB2
DataStage CDC for SQL Server 2000
Ascential CDC for IMS
Change Data Capture for Oracle
DataStage CDC for SQL Server 2000 (Windows only)
CDC for DB2 AS/400 (Windows only)

Server Job
Most change capture methods involve the transformer stage with new data as the
input and existing data as a left outer join reference lookup. The simplest form of
change capture is to compare all rows using output links for inserts and updates
with a constraint on each.

Column Compare
Update link constraint: <math>input.firstname <> lookup.firstname and
input.lastname <> lookup.lastname and input.birthdate <>
lookup.birthdate...<math> Insert link constraint:
<math>lookup.NOTFOUND<math>
A delete output cannot be derived as the lookup is a left outer join.
These constraints can become very complex to write, especially if there are a lot of
fields to compare. It can also produce slow performance as the constraint needs to
run for every row. Performance improvement can be gained by using the CRC32
function to describe the data for comparison.

CRC Compare
CRC32 is a C function written by Michael Hester and is now on the Transformer
function list. It takes an input string and returns a signed 32 bit number that acts as a
digital signature of the input data.
When a row is processed and becomes existing data a CRC32 code is generated and
saved to a lookup along with the primary key of the row. When a new data row
comes through a primary key lookup determines if the row already exists and if it
does comparing the CRC32 of the new row to the existing row determines whether
the data has changed.
CRC32 change capture using a text file source:
? Read each row as a single long string. Do not specify a valid delimiter.
? In a transformer find the key fields using the FIELD command and generate a
CRC32 code for the entire record. Output all fields.
? In a transformer lookup the existing records using the key fields to join and
compare the new and existing CRC32 codes. Output new and updated
records.

76
? The output records have concatenated fields. Either write the output records
to a staging sequential file, where they can be processed by insert and update
jobs, or split the records into individual fields using the row splitter stage.
CRC32 change capture using a database source:
? To concatenate fields together use the Row Merge stage. Then follow the
steps described in the sequential file section above.

Shared Container Change Capture


One benefit of the CRC32 function is the ability to put change capture into a shared
container for use across multiple jobs. This code re-use can save a lot of time. The
container has a transformer in it with two input columns: keyfields and valuefields,
and two output links: inserts and updates. The keyfields column contains the key
fields concatenated into a string with a delimiter such as |. The valuecolumns
contains all fields concatenated with a delimiter.
The job that uses the container needs to concatenate the input columns and pass
them to the container and then split the output insert and update rows. Row Merge
and Row Splitter can be used to do this.

Parallel Job
The Change Capture stage uses "Before" and "After" input links to compare data.
? Before is the existing data.
? After is the new data.
The stage operates using the settings for Key and Value fields. Key fields are the
fields used to match a before and after record. Value fields are the fields that are
compared to find modified records. You can explicitly define all key and value fields
or use some of the options such as "All Keys, All Values" or "Explicit Keys, All
Values".
The stage outputs a change_code which by default is set to 0 for existing rows that
are unchanged, 1 for deleted rows, 2 for modified rows and 4 for new rows. A filter
stage or a transformer stage can then split the output using the change_code field
down different insert, update and delete paths.
Change Capture can also be performed in a Transformer stage as per the Server Job
instructions.
The CRC32 function is not part of the parallel job install but it is possible to write one
as a custom buildop.

Conclusion
The change capture stage in parallel jobs and the CRC32 function in server jobs
simplify the process of change capture in an ETL job.

102.Retrieve sql codes on a failed upsert


Parallel jobs are useful for increasing the performance of the job so all kind of jobs
will be bow days as a parallel job.

Introduction

77
When an enterprise database stage such as DB2 or Oracle is set to upsert it is possible
to create a reject link to trap rows that fail any update or insert statements. By default
this reject link holds just the columns written to the stage, they do not show any
columns indicating why the row was rejected and often no warnings or error
messages appear in the job log.

Steps
There is an undocumented feature in the DB2 and Oracle enterprise stage where a
reject link out of the stage will carry two new fields, sqlstate and sqlcode. These hold
the return codes from the RDBMS engine for failed upsert transactions. The fields are
called sqlstate and sqlcode.
To see these values add a peek to your reject link, the sqlstate and sqlcode should
turn up for each rejected row in the job log. To trap these values add a copy stage to
your reject link, add sqlstate and sqlcode to the list of output columns, on the output
columns tab check the "Runtime column propagation" check box, this will turn your
two new columns from invalid red columns to black and let your job compile. If you
do not see this check box uses the Administrator tool to turn on column propagation
for your project.
When the job runs and a RDBMS reject occurs the record is sent down the reject link,
two new columns are propagated down that link and are defined by the copy stage
and can then be written out to an error handling table of file.
If you do not want to turn on column propagation for your project you can still
define the two new columns with a Modify stage by creating them in two
specifications. sqlcode=sqlcode and sqlstate=sqlstate. Despite column propagation
being turned off the Modify stage will still find the two columns on the input link
and use the specification to add them to the output schema.

Examples
Oracle By default, oraupsert produces no output data set. By using the -reject option,
you can specify an optional output data set containing the records that fail to be
inserted or updated. It’s syntax is: -reject filename For a failed insert record, these
sqlcodes cause the record to be transferred to your reject dataset: -1400: cannot insert
NULL -1401: inserted value too large for column -1438: value larger than specified
precision allows for this column -1480: trailing null missing from string bind value
For a failed update record, these sqlcodes cause the record to be transferred to your
reject dataset: -1: unique constraint violation -1401: inserted value too large for
column -1403: update record not found -1407: cannot update to null -1438: value
larger than specified precision allows for this column -1480: trailing null missing
from string bind value An insert record that fails because of a unique constraint
violation (sqlcode of -1) is used for updating.
DB2 When you specify the -reject option, any update record that receives a status of
SQL_PARAM_ERROR is written to your reject data set. It’s syntax is: -reject filename

Conclusion
Always place a reject link on a Database stage that performs an upsert. There is no
other way to trap rows that fail that upsert statement.
For other database actions such as load or import a different method of trapping
rejects and messages is required.

78
103.Routine Generated Sequential Keys
When you require a system generated sequential key for a table this method will
allow you to control the starting number & preserve the next sequential for use in
subsequent loads.
After you have your output columns defined open the transformer stage (double
click on the transformer symbol) & open the stage variables properties box.
Opening the stage variables properties box is opened by RIGHT-clicking on the stage
variables box within the transformer GUI. If your stage variables box is not displayed
in the transformer press the "Show/Hide Stage variables" button on the transformer
toolbar.
In the stage variables box enter a meaningful name for the variable which will
contain the Sequential key & set it's initial value to 0 (zero).
Enter the derivation for the Stage variable which, in this case is the
KeyMgtGetNextValue routine provided in the sdk (Software Developers Kit) routine
category. The routine can be selected from the dropdown by RIGHT-clicking on the
derivation column This routine, circled below as viewed from datastage Manager, is
supplied with Datastage along with source code to allow you to modify it for your
needs.
Now all that is left is to use the variable in the derivation of your key field as shown
below.
Enter the derivation for the Stage variable which, in this case is the
KeyMgtGetNextValue routine provided in the sdk (Software Developers Kit) routine
category. The routine can be selected from the dropdown by RIGHT-clicking on the
derivation column This routine, circled below as viewed from datastage Manager, is
supplied with Datastage along with sourcecode to allow you to modify it for your
needs.
Now all that is left is to use the variable in the derivation of your key field as shown
below.

The Experiment with the methods & select the best for your situation.
Learn how to use a Simple Sequential Key here.

104.Can u join flat file and database in datastage?how?


Yes, we can join a flat file and database in an indirect way. First create a job which
can populate the data from database into a Sequential file and name it as Seq_First.
Take the flat file which you are having and use a Merge Stage to join these two files.
You have various join types in Merge Stage like Pure Inner Join, Left Outer Join,
Right Outer Join etc., You can use any one of these which suits your requirements.
Yes, we can do it in an indirect way. First create a job which can populate the data
from database into a Sequential file and name it as Seq_First1. Take the flat file which
you are having and use a Merge Stage to join the two files. You have various join
types in Merge Stage like Pure Inner Join, Left Outer Join, Right Outer Join etc., You
can use any one of these which suits your requirements.

105.Implement SCD's in datastage

SCD type1

just use 'insert rows else update rows'

79
or

‘update rows else insert rows’ in update action of target

SCD type2

u have use one hash file to look -up the target ,take 3 instance of target ,give diff
condns depending on the process, give diff update actions in target ,use system
variables like sysdate ,null
We can handle SCD in the following ways Type I: Just overwrite; Type II: We need
versioning and dates; Type III: Add old and new copies of certain important fields.
Hybrid Dimensions: Combination of Type II and Type III
yes you can implement Type1 Type2 or Type 3. Let me try to explain Type 2 with
time stamp.

Step :1 time stamp we are creating via shared container. it return system time and
one key. For satisfying the lookup condition we are creating a key column by using
the column generator.

Step 2: Our source is Data set and Lookup table is oracle OCI stage. by using the
change capture stage we will find out the differences. the change capture stage will
return a value for change_code. Based on return value we will find out whether this
is for insert, Edit, or update. if it is insert we will modify with current timestamp and
the old time stamp will keep as history.

106.How can you implement Complex Jobs in datastage?what do u mean by


complex jobs.

if u used more than 15 stages in a job and if you used 10 lookup tables in a job then u
can call it as a complex job
Complex design means having more joins and more look ups. Then that job design
will be called as complex job. We can easily implement any complex design in
DataStage by following simple tips in terms of increasing performance also. There is
no limitation of using stages in a job. For better performance, Use at the Max of 20
stages in each job. If it is exceeding 20 stages then go for another job. Use not more
than 7 look ups for a transformer otherwise go for including one more transformer.
Am I Answered for u'r abstract Question.

10 7.Does Enterprise Edition only add the parallel processing for better
performance?

Are any stages/transformations available in the enterprise edition only?


DataStage Standard Edition was previously called DataStage and DataStage Server
Edition. • DataStage Enterprise Edition was originally called Orchestrate, then
renamed to Parallel Extender when purchased by Ascential. • DataStage Enterprise:
Server jobs, sequence jobs, parallel jobs. The enterprise edition offers parallel
processing features for scalable high volume solutions. Designed originally for Unix,
it now supports Windows, Linux and Unix System Services on mainframes. •
DataStage Enterprise MVS: Server jobs, sequence jobs, parallel jobs, mvs jobs. MVS
jobs are jobs designed using an alternative set of stages that are generated into

80
cobol/JCL code and are transferred to a mainframe to be compiled and run. Jobs are
developed on a UNIX or Windows server transferred to the mainframe to be
compiled and run. The first two versions share the same Designer interface but have
a different set of design stages depending on the type of job you are working on.
Parallel jobs have parallel stages but also accept some server stages via a container.
Server jobs only accept server stages; MVS jobs only accept MVS stages. There are
some stages that are common to all types (such as aggregation) but they tend to have
different fields and options within that stage.

Row Merger, Row splitter are only present in parallel Stage.

108.How can you do incremental load in datastage?


You can create a table where u can store the last successful refresh time for each
table/Dimension.
Then in the source query take the delta of the last successful and sysdate should give
you incremental load.Incremental load means daily load.
When ever you are selecting data from source, select the records which are loaded or
updated between the timestamp of last successful load and today’s load start date
and time. For this u have to pass parameters for those two dates. Store the last
rundate and time in a file and read the parameter through job parameters and state
second argument as current ate and time

109.If you’re running 4 ways parallel and you have 10 stages on the canvas, how
many processes does datastage create?
Answer is 40
You have 10 stages and each stage can be partitioned and run on 4 nodes which
makes total number of processes generated are 40

110.If data is partitioned in your job on key 1 and then you aggregate on key 2,
what issues could arise?

Data will partitioned on both the keys! Hardly it will take more for execution

111.What r the different type of errors u faced during loading and how u solve
them
Check for Parameters and check for input files are existed or not and also check for
input tables existed or not and also usernames, datasource names, passwords

112.How can I specify a filter command for processing data while defining
sequential file output data?

We have some thing called as after job subroutine and Before subroutine, with then
we can execute the UNIX commands. Here we can use the sort command or the filter
command

113.Can we use shared container as lookup in datastage server jobs?


ya,we can use shared container as lookup in server jobs.

Wherever we can use same lookup in multiple places, on that time we will develop
lookup in shared containers, then we will use shared containers as lookup.

81
114.Can any one tell me how to extract data from more than 1 heterogeneous
Sources.
Mean, example 1 sequential file, Sybase, Oracle in a single Job.
Yes you can extract the data from two heterogeneous sources in data stages using the
transformer stage it's so simple you need to just form a link between the two sources
in the transformer stage that's it
U can convert all heterogeneous sources into sequential files & join them using
merge

or

U can write user defined query in the source itself to join them

115.The exact difference between Join, Merge and lookup is


The three stages differ mainly in the memory they use

DataStage doesn't know how large your data is, so cannot make an informed choice
whether to combine data using a join stage or a lookup stage. Here's how to decide
which to use:

if the reference datasets are big enough to cause trouble, use a join. A join does a
high-speed sort on the driving and reference datasets. This can involve I/O if the
data is big enough, but the I/O is all highly optimized and sequential. Once the sort
is over the join processing is very fast and never involves paging or other I/O
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several
reject links as many as input links. the concept of merge and join is different in
parallel edition as u will not find join component in server merge will survive this
purpose.
As of my knowledge join and merge both u used to join two files of same structure
where lookup u mainly use it for to compare the prev data and the curr data.
We can join 2 relational tables using Hash file only in server jobs. Merge stage is only
for flat files

116.There are three different types of user-created stages available for PX.
What are they? Which would you use? What are the disadvantages for using each
type?
These are the three different stages: i) Custom ii) Build iii) Wrapped

117.What is the difference between buildopts and subroutines ?

118.What is a project? Specify its various components?.


You always enter DataStage through a DataStage project. When you start a
DataStage client you are prompted to connect to a project. Each project contains:

DataStage jobs.

Built-in components. These are predefined components used in a job.

User-defined components. These are customized components created using the


DataStage Manager or DataStage Designer

82
119.What is DS Director used for - did u use it?

Datastage Director is used to monitor, run, validate & schedule datastage server jobs.
• Jobs created in the Designer can be run in the Director
• Director can be invoked by logging into the director separately Or directly
through the designer.
• Director has three screens
o Status – Can be invoked by clicking the Status icon available in the
tool bar , Shows the status of the jobs available
o Schedule – Can be invoked by clicking the Schedule icon available in
the tool bar, shows the schedule of the job
o Log - Can be invoked by clicking the Log icon available in the tool
bar, shows the Log of the job

Director-Status

Director – Schedule

Director-Log

Director-Running a Job

• Select the job to be run and click the RUN NOW button available in the
tool bar.
• Tool bar also has options to

83
Reset
Stop
Sort the jobs by name (Ascending & Descending)

• Once the job starts running, click the log screen to monitor the job
• Job status screen will have the status finished once the job gets finished
successfully

Director-Scheduling a job
Select the job, right click and select Add to schedule menu to schedule the job –
Observe the various scheduling options available

Info,Warning and Fatal logs

Filtering a logs:
Select the job, right click and select Filter in the log screen to
filter the log – Observe the various options available.

84
Purging the logs:
Select the job, clear log from the toolbar and – Observe the various
options available.

120.How can we implement Lookup in DataStage Server Jobs?


In server jobs we can perform 2 kinds of direct lookups

One is by using a hashed file and the other is by using Database/ODBC stage as a
lookup.

121.How do you eliminate duplicate rows?

85
The Duplicates can be eliminated by loading the corresponding data in the Hash file.
Specify the columns on which u want to eliminate as the keys of hash.
Removal of duplicates done in two ways:
1. Use "Duplicate Data Removal" stage
or
2. use group by on all the columns used in select, duplicates will go away.

122.What happens out put of hash file is connected to transformer..


What error it through

If u connects output of hash file to transformer, it will act like reference .there is no
errors at all!! It can be used in implementing SCD's
If Hash file output is connected to transformer stage the hash file will consider as the
Lookup file if there is no primary link to the same Transformer stage, if there is no
primary link then this will treat as primary link itself. You can do SCD in server job
by using Lookup functionality. This will not return any error code.

123.What is merge and how it can be done plz explain with simple example taking
2 tables.......
Merge is used to join two tables. It takes the Key columns sort them in Ascending or
descending order. Let us consider two table i.e. Emp,Dept.If we want to join these
two tables we are having DeptNo as a common Key so we can give that column
name as key and sort Deptno in ascending order and can join those two tables
Merge stage in used for only Flat files in server edition and Configuration of
SMP/MMP server.

124.What are the enhancements made in datastage 7.5 compare with 7.0
Many new stages were introduced compared to datastage version 7.0. In server jobs
we have stored procedure stage, command stage and generate report option was
there in file tab. In job sequence many stages like startloop activity, end loop activity,
terminate loop activity and user variables activities were introduced. In parallel jobs
surrogate key stage, stored procedure stage were introduced. Complex file and
Surrogate key generator stages are added in Ver 7.5

125.How do we do the automation of dsjobs?


Dsjobs" can be automated by using Shell scripts in UNIX system.
We can call Datastage Batch Job from Command prompt using 'dsjob'. We can also
pass all the parameters from command prompt.
Then call this shell script in any of the market available schedulers.
The 2nd option is schedule these jobs using Data Stage director.

126.What are types of Hashed File?

Hashed File is classified broadly into 2 types.

a) Static - Sub divided into 17 types based on Primary Key Pattern.


b) Dynamic - sub divided into 2 types
i) Generic
ii) Specific.
Default Hashed file is "Dynamic - Type Random 30 D"

86
127.What is the difference between Inprocess and Interprocess ?
Regarding the database it varies and depends upon the project and for the second
question, inprocess is the process where the server transfers only one row at a time to
target and interphones means that the server sends group of rows to the target
table...these both are available at the tunable tab page of the administrator client
component..
In-process

You can improve the performance of most DataStage jobs by turning in-process row
buffering on and recompiling the job. This allows connected active stages to pass
data via buffers rather than row by row.

Note: You cannot use in-process row-buffering if your job uses COMMON blocks in
transform functions to pass data between stages. This is not recommended practice,
and it is advisable to redesign your job to use row buffering rather than COMMON
blocks.

Inter-process

Use this if you are running server jobs on an SMP parallel system. This enables the
job to run using a separate process for each active stage, which will run
simultaneously on a separate processor.

Note: You cannot inter-process row-buffering if your job uses COMMON blocks in
transform functions to pass data between stages. This is not recommended practice,
and it is advisable to redesign your job to use row buffering rather than COMMON
blocks.

128.How to handle the rejected rows in datastage?

We can handle by using constraints and store it in file or DB.


we can handle rejected rows in two ways with help of Constraints in a Tansformer.1)
By Putting on the Rejected cell where we will be writing our constraints in the
properties of the Transformer2)Use REJECTED in the expression editor of the
Constraint Create a hash file as a temporary storage for rejected rows. Create a link
and use it as one of the output of the transformer. Apply either of the two steps
above said on that Link. All the rows which are rejected by all the constraints will go
to the Hash File.
129.What are Routines and where/how are they written and have you written any
routines before?
Routines: Routines are stored in the Routines branch of the DataStage
Repository, where you can create, view, or edit them using the Routine dialog box.
The following program components are classified as routines: Transform functions.
These are functions that you can use when defining custom transforms. DataStage
has a number of built-in transform functions which are located in the Routines
Examples Functions branch of the Repository. You can also define your own
transform functions in the Routine dialog box. Before/After subroutines. When
designing a job, you can specify a subroutine to run before or after the job, or before
or after an active stage. DataStage has a number of built-in before/after subroutines,
which are located in the Routines Built-in Before/Afterbranch in the Repository.

87
You can also define your own before/after subroutines using the Routine dialog box.
Custom Universe functions. These are specialized BASIC functions that have been
defined outside DataStage. Using the Routine dialog box, you can get DataStage to
create a wrapper that enables you to call these functions from within DataStage.
These functions are stored under the Routines branch in the Repository. You specify
the category when you create the routine. If NLS is enabled,9-4 Ascential DataStage
Designer Guide you should be aware of any mapping requirements when using
custom Universe functions. If a function uses data in a particular character set, it is
your responsibility to map the data to and from Unicode. ActiveX (OLE) functions.
You can use ActiveX (OLE) functions as programming components within
DataStage. Such functions are made accessible to DataStage by importing them. This
creates a wrapper that enables you to call the functions. After import, you can view
and edit the BASIC wrapper using the Routine dialog box. By default, such functions
are located in the Routines Class name branch in the Repository, but you can specify
your own category when importing the functions. When using the Expression Editor,
all of these components appear under the DS Routines… command on the Suggest
Operand menu. A special case of routine is the job control routine. Such a routine is
used to set up a DataStage job that controls other DataStage jobs. Job control routines
are specified in the Job control page on the Job Properties dialog box. Job control
routines are not stored under the Routines branch in
theRepository.TransformsTransforms are stored in the Transforms branch of the
DataStage Repository, where you can create, view or edit them using the Transform
dialog box. Transforms specify the type of data transformed the type it is
transformed into, and the expression that performs the transformation. DataStage is
supplied with a number of built-in transforms (which you cannot edit). You can also
define your own custom transforms, which are stored in the Repository and can be
used by other DataStage jobs. When using the Expression Editor, the transforms
appear under the DSTransform… command on the Suggest Operand menu.
Functions take arguments and return a value. The word “function” is applied to
many components in DataStage:• BASIC functions. These are one of the fundamental
building blocks of the BASIC language. When using the Expression Editor,
Programming in DataStage 9-5you can access the BASIC functions via the Function…
command on the Suggest Operand menu. DataStage BASIC functions. These are
special BASIC functions that are specific to DataStage. These are mostly used in job
control routines. DataStage functions begin with DS to distinguish them from general
BASIC functions. When using the Expression Editor, you can access the DataStage
BASIC functions via the DS Functions…command on the Suggest Operand menu.
The following items, although called “functions,” are classified as routines and are
described under “Routines” on page 9-3. When using the Expression Editor, they all
appear under the DS Routines… command on the Suggest Operand menu.•
Transform functions• Custom Universe functions• ActiveX (OLE) functions
ExpressionsAn expression is an element of code that defines a value. The word”
expression” is used both as a specific part of BASIC syntax, and to describe portions
of code that you can enter when defining a job. Areas of DataStage where you can
use such expressions are:• Defining breakpoints in the debugger• Defining column
derivations, key expressions and constraints in Transformer stages• Defining a
custom transform In each of these cases the DataStage Expression Editor guides you
as to what programming elements you can insert into the expression
130.Which three are valid ways within a Job Sequence to pass parameters to
Activity stages?

88
ExecCommand Activity stage, UserVariables Activity stage, Routine Activity stage

Which three are valid trigger expressions in a stage in a Job Sequence?


Unconditional, ReturnValue (Conditional), Custom (Conditional)

131.Which three actions are performed using stage variables in a parallel


Transformer stage?
A function can be executed once per record.
A function can be executed once per run.
Identify the first row of an input group.

132.You have a compiled job and parallel configuration file. Which three methods
can be used to determine the number of nodes actually used to run the job in
parallel?
Within DataStage Director, examine log entry for parallel configuration file
Within DataStage Director, examine log entry for parallel job score
Within DataStage Director, open a new DataStage Job Monitor

133.Which three features of datasets make them suitable for job restart points?
They are partitioned.
They use datatype that are in the parallel engine internal format.
They are persistent.

134.What would require creating a new parallel Custom stage rather than a new
parallel BuildOp stage?
In a Custom stage, the number of input links does not have to be fixed, but can vary,
for example from one to two. BuildOp stages require a fixed number of input links.
C. Creating a Custom stage requires knowledge of C/C++. You do not need
knowledge of C/C++ to create a BuildOp stage.

135.Which task is performed by the DataStage JobMon daemon?


Provides a snapshot of a job's performance

136.Which two would cause a stage to sequentially process its incoming data?
The execution mode of the stage is sequential.
The stage has a constraint with a node pool containing only one node

137.Which two statements are true for parallel shared containers?


? Within DataStage Manager, Usage Analysis can be used to build a multi-job
compile for all jobs used by a given shared container.
? Parallel shared containers facilitate modular development by reusing
common stages and logic across multiple jobs.

138.What are Stage Variables, Derivations and Constants?


Stage Variable –
? Derivations execute in order from top to bottom
? Later stage variables can reference earlier stage variables
? Earlier stage variables can reference later stage variables
? These variables will contain a value derived from the previous row that came
into the Transformer
? Multi-purpose

89
? Counters
? Store values from previous rows to make comparisons
? Store derived values to be used in multiple target field derivations
? Can be used to control execution of constraints
? An intermediate processing variable that retains value during read and
doesn’t pass the value into target column.

Derivation - Expression that specifies value to be passed on to the target column.

Constant - Conditions that are either true or false that specifies flow of data with a
link.

139.How do you populate source files?


There are many ways to populate one is writing SQL statement in oracle is one way

140.How do you pass the parameter to the job sequence if the job is running at
night?
Two ways
1. Ste the default values of Parameters in the Job Sequencer and map these
parameters to job.
2. Run the job in the sequencer using dsjobs utility where we can specify the values
to be taken for each parameter

141.What is the utility you use to schedule the jobs on a UNIX server other than
using Ascential Director?
"AUTOSYS": Thru autosys u can automate the job by invoking the shell script written
to schedule the datastage jobs.

142.What is the mean of Try to have the constraints in the 'Selection' criteria of the
jobs itself. This will eliminate the unnecessary records even getting in before joins
are made?
This means try to improve the performance by avoiding use of constraints wherever
possible and instead using them while selecting the data itself using a where clause.
This improves performance

143.What is the meaning of the following..

1) If an input file has an excessive number of rows and can be split-up then use
standard

2) logic to run jobs in parallel

3) Tuning should occur on a job-by-job basis. Use the power of DBMS.

If u have SMP machines u can use IPC,link-colector,link-partitioner for performance


tuning

If u have cluster,MPP machines u can use parallel jobs

90
144.What is the OCI? And how to use the ETL Tools?
OCI means orabulk data which used client having bulk data its retrieve time is much
more i.e., your used to orabulk data the divided and retrieved

145.What is merge and how it can be done explain with simple example taking 2
tables?
Merge is used to join two tables. It takes the Key columns sort them in Ascending or
descending order. Let us consider two table i.e. Emp,Dept.If we want to join these
two tables we are having DeptNo as a common Key so we can give that column
name as key

146.What is version Control?


Version Control stores different versions of DS jobs runs different versions of same
job reverts to previous version of a job view version histories

147.How can we pass parameters to job by using file?


You can do this, by passing parameters from UNIX file, and then calling the
execution of a datastage job. the ds job has the parameters defined (which are passed
by Unix)

148.Where does UNIX script of datastage execute weather in client machine or in


server? Suppose if it executes on server then it will execute?
Datastage jobs are executed in the server machines only. There is nothing that is
stored in the client machine.

149.Defaults nodes for datastage parallel Edition


Actually the Number of Nodes depends on the number of processors in your system.
If your system is supporting two processors we will get two nodes by default 1

150.I want to process 3 files in sequentially one by one, how I can do that. While
processing the files it should fetch files automatically.
If the metadata for all the files r same then create a job having file name as
Parameter, then use same job in routine and call the job with different file Name or u
can create sequencer to use

151.Scenario based Question..... Suppose that 4 job control by the sequencer like
(job 1, job 2, job 3, job 4 )if job 1 have 10,000 row ,after run the job only 5000 data
has been loaded in target table remaining are not loaded and your job going to be
aborted then.. How can short out the problem.
Suppose job sequencer synchronies or control 4 job but job 1 have problem, in this
condition should go director and check it what type of problem showing either data
type problem, warning massage, job fail or job aborted, If job fail means data type
problem

152.What is the Batch Program and how can generate?


Batch program is the program it's generate run time to maintain by the datastage it
self but u can easy to change own the basis of your requirement (Extraction,
Transformation, Loading) .Batch program are generate depends your job nature
either simple

91
153.How many places u can call Routines?
Four Places u can call (i) Transform of routine (A) Date Transformation (B) Upstring
Transformation (ii) Transform of the Before & After Subroutines(iii) XML
transformation(iv)Web base transformation

154.How do you fix the error "OCI has fetched truncated data" in DataStage
Can we use Change capture stage to get the truncated data’s

155.Importance of Surrogate Key in Data warehousing?


Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is
it is independent of underlying database. i.e. Surrogate Key is not affected by the
changes going on with a database.

156,Whats the difference between Datastage Developers and Datastage Designers?


What are the skills required for this.
Datastage developer is one how will code the jobs. Datastage designer is how will
design the job, i mean he will deal with blue prints and he will design the jobs the
stages that are required in developing the code

157.How do you merge two files in DS?


Either used Copy command as a Before-job subroutine if the metadata of the 2 files
are same or created a job to concatenate the 2 files into one if the metadata is
different.

158.How do we do the automation of ds jobs?


We can call Datastage Batch Job from Command prompt using 'dsjob'. We can also
pass all the parameters from command prompt. Then call this shell script in any of
the market available schedulers. The 2nd option is schedule these jobs using Data

159.What is DS Manager used for - did u use it?


Datastage manager is used to export and import purpose [/B] main use of export
and import is sharing the jobs and projects one project to other project.
? Manager is the user interface for viewing and editing the contents of the
DataStage repository
? Manger allows you to import and export items between different DataStage
systems, or exchange meta data with other data warehousing tools.
? You can also analyze where particular items are used in your project and
request reports on items held in the Repository.

160.What are types of Hashed File?


Hashed File is classified broadly into 2 types. a) Static - Sub divided into 17 types
based on Primary Key Pattern. b) Dynamic - sub divided into 2 types i) Generic
ii) Specific. Default Hashed file is "Dynamic – Type

161.How do you eliminate duplicate rows?


Removal of duplicates done in two ways: 1. Use "Duplicate Data Removal" stage or 2.
Use group by on all the columns used in select, duplicates will go away.

92
162.What about System variables?
DataStage provides a set of variables containing useful system information that you
can access from a transform or routine. System variables are read-only. @DATE the
internal date when the program started. See the Date function.

@DAY The day of the month extracted from the value in @DATE.

@FALSE The compiler replaces the value with 0.

@FM A field mark, Char (254).

@IM An item mark, Char (255).

@INROWNUM Input row counter. For use in constrains and derivations in


Transformer stages.

@OUTROWNUM Output row counter (per link). For use in derivations in


Transformer stages.

@LOGNAME The user login name.

@MONTH The current extracted from the value in @DATE.

@NULL The null value.

@NULL.STR The internal representation of the null value, Char (128).

@PATH The pathname of the current DataStage project.

@SCHEMA The schema name of the current DataStage project.

@SM A subvalue mark (a delimiter used in Universe files), Char(252).

@SYSTEM.RETURN.CODE
Status codes returned by system processes or commands.

@TIME The internal time when the program started. See the Time function.

@TM A text mark (a delimiter used in Universe files), Char (251).

@TRUE The compiler replaces the value with 1.

@USERNO The user number.

@VM A value mark (a delimiter used in Universe files), Char (253).

@WHO The name of the current DataStage project directory.

@YEAR The current year extracted from @DATE.

REJECTED Can be used in the constraint expression of a Transformer stage of an

93
output link. REJECTED is initially TRUE, but is set to FALSE whenever an output
link is successfully written.

163.What is DS Designer used for - did u use it?


You use the Designer to build jobs by creating a visual design that models the flow
and transformation of data from the data source through to the target warehouse.
The Designer graphical interface lets you select stage icons, drop them onto the
Designer

164.What is DS Administrator used for - did u use it?


The Administrator enables you to set up DataStage users, control the purging of the
Repository, and, if National Language Support (NLS) is enabled, install and manage
maps and locales.

165.How will you call external function or subroutine from datastage?


There is datastage option to call external programs. execSH

166.How do you pass filename as the parameter for a job?


While job development we can create a parameter 'FILE_NAME' and the value can
be passed while running the job.

167.How to handle Date conversions in Datastage? Convert an mm/dd/yyyy format


to yyyy-dd-mm?
We use a) "Iconv" function - Internal Conversion. b) "Oconv" function - External
Conversion. Function to convert mm/dd/yyyy format to yyyy-dd-mm is
Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")

168.Difference between Hash file and Sequential File?


Hash file stores the data based on hash algorithm and on a key value. A sequential
file is just a file with no key column. Hash file used as a reference for look up.
Sequential file cannot
169.How do you rename all of the jobs to support your new File-naming
conventions?
Create an Excel spreadsheet with new and old names. Export the whole project as a
dsx. Write a Perl program, which can do a simple rename of the strings looking up
the Excel file. Then import the new dsx file probably into a new project for testing
170.Does the selection of 'Clear the table and Insert rows' in the ODBC stage send
a Truncate statement to the DB or does it do some kind of Delete logic.
There is no TRUNCATE on ODBC stages. It is Clear table and that is a delete from
statement. On an OCI stage such as Oracle, you do have both Clear and Truncate
options. They are radically different in permissions (Truncate requires you to have
altered table permissions where Delete doesn't).

171.Tell me one situation from your last project, where you had faced problem and
How did u solve it?
A. The jobs in which data is read directly from OCI stages are running extremely
slow. I had to stage the data before sending to the transformer to make the jobs run
faster.
B. The job aborts in the middle of loading some 500,000 rows. Have an option either
cleaning/deleting the loaded data and then run the fixed job or run the job again

94
from the row the job has aborted. To make sure the load is proper we opted the
former.
The above might raise another question: Why do we have to load the dimensional
tables first, then fact tables
as we load the dimensional tables the keys (primary) are generated and these keys
(primary) are Foreign keys in Fact tables.

How will you determine the sequence of jobs to load into data warehouse?
First we execute the jobs that load the data into Dimension tables, then Fact tables,
then load the Aggregator tables (if any).

178.What are the command line functions that import and export the DS jobs?
dsimport.exe- imports the DataStage components.
dsexport.exe- exports the DataStage components

179.What is the utility you use to schedule the jobs on a UNIX server other than
using Ascential Director?
Use crontab utility along with dsexecute () function along with proper parameters
passed.

180.How would call an external Java function which are not supported by
DataStage?
Starting from DS 6.0 we have the ability to call external Java functions using a Java
package from Ascential. In this case we can even use the command line to invoke the
Java function and write the return values from the Java program (if any) and use that
files as a source in DataStage job.

181.What will you in a situation where somebody wants to send you a file and use
that file as an input or reference and then run job.
Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run
the job. May be you can schedule the sequencer around the time the file is expected
to arrive.B. Under UNIX: Poll for the file. Once the file has start the job
Read the String functions in DS
Functions like [] -> sub-string function and ':' -> concatenation operator Syntax:
string [ [ start, ] length ]string [ delimiter, instance, repeats ]

182.How did u connect with DB2 in your last project?


Most of the times the data was sent to us in the form of flat files. The data is dumped
and sent to us. In some cases were we need to connect to DB2 for look-ups as an
instance then we used ODBC drivers to connect to DB2 (or) DB2-UDB depending the
situation

182.If worked with DS6.0 and latest versions what are Link-Partitioner and Link-
Collector used for?
Link Partitioner - Used for partitioning the data. Link Collector - Used for collecting
the partitioned data.

183.What are OConv () and Iconv () functions and where are they used?
IConv() - Converts a string to an internal storage format OConv() - Converts an
expression to an output format.

95
184.Do u know about METASTAGE?
In simple terms metadata is data about data and MetaStage can be anything like
DS(dataset,sq file,etc)
Do you know about INTEGRITY/QUALITY stage?
Integrity/quality stage is a data integration tool from Ascential which is used to
standardize/integrate the data from different sources

186.How many jobs have you created in your last project?


100+ jobs for every 6 months if you are in Development, if you are in testing 40 jobs
for every 6 months although it need not be the same number for everybody

187.What r XML files and how do you read data from XML files and what stage to
be used?
In the pallet there is Real time stages like xml-input, xml-output,xml-transformer

188.Suppose if there are million records did you use OCI? if not then what stage
do you prefer?
Using Orabulk

189.How do you pass the parameter to the job sequence if the job is running at
night?
Two ways1. Ste the default values of Parameters in the Job Sequencer and map these
parameters to job.2. Run the job in the sequencer using dsjobs utility where we can
specify the values to be taken for each parameter.

190.What happens if the job fails at night?


Job Sequence Abort

191.How do you track performance statistics and enhance it?


Through Monitor we can view the performance statistics.

192.What is the order of execution done internally in the transformer with the
stage editor having input links on the left hand side and output links?
Stage variables, constraints and column derivation or expressions.

193.What are the difficulties faced in using DataStage? Or what are the constraints
in using DataStage?
1) If the number of lookups are more? 2) What will happen, while loading the
data due to some regions job aborts?

194.Orchestrate Vs Datastage Parallel Extender?


Orchestrate itself is an ETL tool with extensive parallel processing capabilities and
running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta
version of 6.0) to incorporate the parallel processing capabilities. Now Datastage has
purchased Orchestrate and integrated it with Datastage XE and released a new
version Datastage 6.0 i.e Parallel Extender.

96
195.Differentiate Primary Key and Partition Key?
Primary Key is a combination of unique and not null. It can be a collection of key
values called as composite primary key. Partition Key is a just a part of Primary Key.
There are several methods of partition like Hash, DB2, Random etc.While using Hash
partition we specify the Partition Key.

196.How do you execute datastage job from command line prompt?


Using "dsjob" command as follows. dsjob -run -job status projectname jobname

197.What is the default cache siz'Ae? How do you change the cache siz'Ae if
needed?
Default cache size is 256 MB. We can increase it by going into Datastage
Administrator and selecting the Tunable Tab and specify the cache size over there.

198.Compare and Contrast ODBC and Plug-In stages?


ODBC: a) Poor Performance. b) Can be used for Variety of Databases. c) Can handle
Stored Procedures. Plug-In: a) Good Performance. b) Database specific.(Only one
database) c) Cannot handle Stored Procedures.

199.How to run a Shell Script within the scope of a Data stage job?
By using "ExcecSH" command at Before/After job properties.

200.Functionality of Link Partitioner and Link Collector?


Link Partitioner: It actually splits data into various partitions or data flows using
various partition methods.
Link Collector: It collects the data coming from partitions, merges it into a single data
flow and loads to target.

201.What is Modulus and Splitting in Dynamic Hashed File?


In a Hashed File, the size of the file keeps changing randomly. If the size of the file
increases it is called as "Modulus". If the size of the file decreases it is called as
"Splitting".

202.Types of views in Datastage Director?


There are 3 types of views in Datastage Director a) Job View - Dates of Jobs
Compiled. b) Log View - Status of Job last run c) Status View - Warning Messages,
Event Messages, and Program Generated Messages.

203.DataStage Tip: Extracting database data 250% faster


An IBM Developerworks article shows how to configure the remote DB2 Enterprise
stage and benchmarks it as 250% faster than a standard API connection.
It’s a useful article as it goes through the complex steps of connecting a parallel
DataStage configuration to a parallel remote DB2 database and it shows some
benchmark timings demonstrating an enterprise stage that is 250% faster than a
standard API stage.
DataStage parallel jobs come with four ways of connecting to the most popular
databases:
Use an Enterprise database stage: provides native parallel connectivity.
Use an API stage: provides native standard Application Programming Interface
connectivity.

97
Fast Load or Bulk Load: use the native load utility integrated into a DataStage job.
ODBC stage: provides standard or enterprise ODBC connectivity.
One of the trickiest databases to connect to from a DataStage enterprise stage is DB2
and the article

Configure DB2 remote connectivity with WebSphere DataStage Enterprise Edition


by authors Ming Wei Xu and Yun Feng Guo shows the steps to this configuration.
Configure DB2 remote connectivity with WebSphere DataStage Enterprise Edition
This article provides step-by-step instructions for configuring connectivity to remote
DB2® instances using DB2 Enterprise Stage. In addition, the authors compare the
performance of DB2 API Stage with DB2 Enterprise Stage running in the same
environment.

Introduction
WebSphere DataStage is one the foremost leaders in the ETL (Extract, Transform,
and Load) market space. One of the great advantages of this tool is its scalability, as it
is capable of parallel processing on an SMP, MPP or cluster environment. Although
DataStage Enterprise Edition (DS/EE) provides many types of plug-in stages to
connect to DB2, including DB2 API, DB2 load, and dynamic RDBMS, only DB2
Enterprise Stage is designed to support parallel processing for maximum scalability
and performance.
The DB2 Data Partitioning Feature (DPF) offers the necessary scalability to distribute
a large database over multiple partitions (logical or physical). ETL processing of a
large bulk of data across whole tables is very time-expensive using traditional plug-
in stages. DB2 Enterprise Stage however provides a parallel execution engine, using
direct communication with each database partition to achieve the best possible
performance.
DB2 Enterprise Stage with DPF communication architecture
Figure 1. DS/EE remote DB2 communication architecture

As you see in Figure 1, the DS/EE primary server can be separate from the DB2
coordinate node. Although a 32-bit DB2 client still must be installed, it’s different
from the typical remote DB2 access which requires only DB2 client for connectivity. It
can be used to pre-query the DB2 instance and determine partitioning of source or

98
target table. On the DB2 server, every DB2 DPF partition must have the DS/EE
engine installed. In addition, the DS/EE engine and libraries must be installed in the
same location on all DS/EE servers and DB2 servers.
The following principles are important in understanding how this framework works:
? DataStage conductor node uses local DB2 environment variables to determine
DB2 instance.
? DataStage reads the DB2nodes.cfg file to determine each DB2 partition.
DB2nodes.cfg file is copied from DB2 server node and can be put any location
of sqllib subdirectory on DS/EE server. One DS/EE environment variable
$APT_DB2INSTANCE_HOME can be used to specify this location of sqllib.
? DataStage scans the current parallel configuration file specified by
environment variable $APT_CONFIG_FILE. Each fastname property of this
file must have a match with the node name of DB2nodes.cfg.
? DataStage conductor node queries local DB2 instance using the DB2 client to
determine table partition information.
? DataStage starts up processes across ETL and DB2 nodes in the cluster.
DB2/UDB Enterprise stage passes data to/from each DB2 node through the
DataStage parallel framework, not the DB2 client. The parallel execution
instance can be examined from the job monitor of the DataStage Director.

Environment used in our example


Figure 2. Example topology

In our example, we use 2 machines with RedHat Enterprise Linux 3.0 operating
system for testing, one with 2 CPUs and 1G memory for the DB2 server, another with
1 CPU with 1G memory for DS/EE server. In the DB2 server, we have 2 database
partitions which can be configured via DB2nodes.cfg, while in DS/EE server; the
engine configuration file tells us which nodes are used to execute DataStage jobs
concurrently.
The following are steps we followed to successfully configure remote DB2 instance
using DS/EE DB2 Enterprise Stage. We will begin this exercise from scratch,
including DB2 server configuration, DS/EE installation and configuration.
Installation and configuration steps for the DB2 server
• Install DB2 Enterprise Server Edition (with DPF) and create a DB2 instance
at Stage164 node.

99
• Configure rsh service and remote authority file.
• Create sample database and check distribution of tables.
• Create DS/EE users on all members of both nodes.
If DB2 DPF environments are installed and configured, you can skip step 1 and step
3
Step 1. Install DB2 Enterprise Server and create DB2 instance at Stage164 node
Check your DB2 version before installing DB2 ESE on Stage164 node. For our
example we used V8.1 fix pack 7. For DPF feature, you must have another separate
license. Pay attention to Linux kernel parameters which can potentially affect DB2
installation. Please follow the DB2 installation guide.
1. Before installation, create DB2 group and DB2 users.
[root@stage164 home]# groupadd –g db2grp1
[root@stage164 home]# groupadd –g db2fgrp1
[root@stage164 home]# useradd –g db2grp1 db2inst1
[root@stage164 home]# useradd –g db2fgrp1 db2fenc1
[root@stage164 home]# passwd db2inst1

2. Create instance. Install DB2, then create the instance using the GUI or
command line. If using the command line, switch to DB2 install path with
root user and issue the command below to create one DB2 instance the users
created in the previous step as parameters.
[root@stage164 home]# cd /opt/IBM/db2/V8.1/instance/
[root@stage164 instance]# ./db2icrt -u db2fenc1 db2inst1

3. Confirm db2inst1 instance was created successfully. If it failed, please refer to


the official DB2 installation documentation.
[root@stage164 instance]# su – db2inst1
[db2inst1@stage164 db2inst1]$ db2start
05-19-2006 03:56:01 0 0 SQL1063N DB2START processing was successful.
SQL1063N DB2START processing was successful.

4. Check to confirm that DBM SVCENAME configuration parameter was


configured successfully. If it is not set, the client has no way to connect to the
DB2 server. In addition, the TCPIP communication protocol also must be set.
[db2inst1@stage164 db2inst1]$ db2 get dbm cfg | grep -i svcename
TCP/IP Service name (SVCENAME) = 50000
[db2inst1@stage164 db2inst1]$ db2set DB2COMM=TCPIP

Step 2. Configure remote shell (rsh) service and remote authority file.
For the DPF environment, DB2 needs the remote shell utility to communicate and
execute commands between each partition. Rsh utility can be used for inter-partition
communication; OpenSSH utility is another option for inter-partition communication
that protects secure communication. For simplicity, we will not cover it in this article.
Check whether rsh server has installed. If not, download it and issue "rpm –ivh
rsh.server-xx.rpm" to install it.
[root@stage164 /]# rpm -qa | grep -i rsh
rsh-0.17-17
rsh-server-0.17-17

100
1. Confirm rsh service can be started successfully.
[root@stage164 /]#service xinetd start
[root@stage164 /]#netstat –na | grep 514

2. Create or modify file for authority users to execute remote commands You
can create (or edit if it already exists) an /etc/hosts.equiv file. This first
column of this file is the machine name, and the second is the instance owner.
For example, the following means only db2inst1 user has authority to execute
commands on Stage164 using rsh:
Stage164 db2inst1

3. Check whether rsh works correctly or not by issuing below command using
DB2inst1 user. If the date doesn't show correctly, that means there is still a
configuration problem.
[db2inst1@stage164 db2inst1]$ rsh stage164 date
Thu May 18 23:26:03 CST 2006

Step 3. Create DPF partitions and create sample database


1. Edit the database partition configuration file (DB2nodes.cfg) under
<DB2HOME>/sqllib. In this example, we have 2 logical partitions on
Stage164 host.
0 stage164 0
1 stage164 1

2. Restart DB2 instance and be sure both partitions can be started successfully.
[db2inst1@stage164 db2inst1]$ db2stop force
05-18-2006 23:32:08 0 0 SQL1064N DB2STOP processing was successful.
SQL1064N DB2STOP processing was successful.
[db2inst1@stage164 db2inst1]$ db2start
05-18-2006 23:32:18 1 0 SQL1063N DB2START processing was successful.
05-18-2006 23:32:18 0 0 SQL1063N DB2START processing was successful.
SQL1063N DB2START processing was successful.

3. Create sample database and check the data distribution. According to result,
the total row count of table department is 9, and 4 of 9 is distributed into
partition 0, while 5 of 9 into partition 1 according to partition key deptno.
[db2inst1@stage164 db2inst1]$ db2sampl
[db2inst1@stage164 db2inst1]$ db2 connect to sample
[db2inst1@stage164 db2inst1]$ db2 "select count(*) from department"
1
-----------
9
1 record(s) selected.
[db2inst1@stage164 db2inst1]$ db2 "select count(*) from department where
dbpartitionnum(deptno)=0"
1
-----------
4
1 record(s) selected.
[db2inst1@stage164 db2inst1]$ db2 "select count(*) from department where

101
dbpartitionnum(deptno)=1"
1
-----------
5
1 record(s) selected.

Step 4. Create DS/EE users and configure the to access the DB2 database
If DS/EE users and groups have been created on the DS/EE node, then create the
same users and groups on the DB2 server node. In any case, make sure you have the
same DS/EE users and groups on these two machines.
1. Create DS/EE user/groups at DB2 server. In this example, they are
dsadmin/dsadmin. Also add DS/EE user to DB2 instance group.
[root@stage164 home]# groupadd -g 501 dsadmin
[root@stage164 home]# useradd –u 501 –g dsadmin –G db2grp1 db2inst1
[root@stage164 home]# passwd dsadmin

2. Add an entry in /etc/hosts.equiv file which was created in Step 2.3. This
gives dsadmin authority to execute some commands on Stage164.
Stage164 db2inst1
Stage164 dsadmin

3. Add DB2 profile environment variable at <DSEngine_HOME>/.bashrc file


(for example, <DSEngine_HOME> = /home/dsadmin).
. /home/db2inst1/sqllib/db2profile

4. Be sure dsadmin user can connect to sample db successfully.


# su - dsadm

$ db2 connect to sample

Database Connection Information

Database server = DB2/6000 8.2.3


SQL authorization ID = DSADM
Local database alias = SAMPLE
$
Installation and configuration steps for the DS/EE node
Install DataStage Enterprise Edition(DS/EE) and DB2 client
Add DB2 library and instance home at DS/EE configuration file
Catalog sample db to DS/EE using dsadmin
Copy DB2nodes.cfg from DB2 server to DS/EE and configure environment variable.
NFS configuration, export /home/dsadmin/Ascential/
Verify DB2 operator library and execute DB2setup.sh and DB2grants.sh
Create or modify DS/EE configuration file
Restart DS/EE server
Now let's walk through the process in detail.
Step 1. Install DataStage Enterprise Edition(DS/EE) and DB2 client
First, DS/EE users and groups need to be created in advance. In this example, the
user is dsadmin, group dsadmin. If DS/EE not installed, follow the WebSphere
DataStage install guide. We assume the software is installed on DSHOME variable

102
which is /home/dsadmin/Ascential/DataStage/DSEngine. Then, install the DB2
client and create one client instance at DS/EE node.
Step 2. Add DB2 library and instance home at DS/EE configuration file
The dsenv configuration file, located under DSHOME directory, is one of the most
important configuration files in DS/EE. It contains the environment variables and
library path. At this step, we will add DB2 library to LD_LIBRARY_PATH so that
DS/EE engine can connect to DB2.
Note: PXEngine library should precede DB2 library for LD_LIBRARY_PATH
environment path.
Configure the dsenv file as follows:

PATH=$PATH:/home/dsadmin/Ascential/DataStage/PXEngine/bin:/home/dsadmin/Ascential
/
DataStage/DSEngine/bin

# for DB2 configuration


DB2DIR=/opt/IBM/db2/V8.1; export DB2DIR
DB2INSTANCE=db2inst1; export DB2INSTANCE
INSTHOME=/home/db2inst1; export INSTHOME
DB2PATH=/opt/IBM/db2/V8.1; export DB2PATH

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:
/home/dsadmin/Ascential/DataStage/PXEngine/lib:
$DB2DIR/lib:$INSTHOME/sqllib/lib; export LD_LIBRARY_PATH

PATH=$PATH:$INSTHOME/sqllib/bin:$INSTHOME/sqllib/adm; export PATH

You can add this dsenv file to dsadmin .bashrc file (/home/dsadmin/.bashrc) to
avoid executing it manually every time. What you need do is to exit the dsadmin
user and re-logon to make it execute and take effect.
. /home/dsadmin/Ascential/DataStage/DSEngine/dsenv

Step 3. Catalog remote sample db to DS/EE using dsadmin


1. Catalog remote sample database from DB2 EE (Stage164) to DS/EE using
dsadmin.
[dsadmin@transfer dsadmin]$ db2 CATALOG TCPIP NODE stage164 REMOTE
stage164 SERVER 50000
[dsadmin@transfer dsadmin]$ db2 CATALOG DB sample AS samp_02 AT NODE
stage164

2. Configure rsh utility according to Step 2 of "Installation and configuration


steps for the DB2 server." Be sure dsadmin user at transfer can execute remote
commands for Stage164 using rsh.
[dsadmin@transfer dsadmin]$ rsh stage164 date
Thu May 19 10:22:09 CST 2006

Step 4. Copy DB2nodes.cfg from DB2 server to DS/EE and configure environment
variable.
Copy the DB2nodes.cfg file from the DB2 server to one directory of DS/EE. This file
tells DS/EE engine how many DB2 partitions there are in the DB2 server. Then create

103
one environment variable APT_DB2INSTANCE_HOME by DataStage Administrator
to point to the directory. This variable can be specified at the project level or the job
level.
Step 5. NFS configuration, export /home/dsadmin/Ascential/
First, add 2 machine names into /etc/hosts file at both nodes to identify one
another’s network name. Then, share the DS/EE whole directory to the DB2 server
so that each partition can communicate with DS/EE.
1. At DS/EE node, export /home/dsadmin/Ascential directory. This can be
done by adding an entry in /etc/exports file, it will allow users from stage164
machine to mount /home/dsadmin/Ascential directory with read/write
authority.
/home/dsadmin/Ascential stage164(rw,sync)

2. Once you have changed /etc/export file, you must notify NFS daremod
process to reload changes. Or you can stop and restart this nfsd process by
issuing following commands:
[root@transfer /]# service nfs start
Starting NFS services: [ OK ]
Starting NFS quotas: [ OK ]
Starting NFS daemon: [ OK ]
Starting NFS mountd: [ OK ]

3. Then at DB2 server, create one directory called /home/dsadmin/Ascential,


it’s same with DS/EE server, then mount this directory to remote DS/EE
directory.
[root@stage164 home]# mount -t nfs -o rw transfer:/home/dsadmin/Ascential
/home/dsadmin/
Ascential
You can check mounted files as follows:
[root@stage164 home]# df -k
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 7052464 3491352 3202868 53% /
transfer:/home/dsadmin/Ascential
7052464 6420892 273328 96% /home/dsadmin/Ascential

To avoid mounting it every time when machine restart, you can also add this entry
into file /etc/fstab to mount this directory automatically:
transfer:/home/dsadmin/Ascential /home/dsadmin/Ascential nfs defaults 0 0

Step 6. Verify DB2 operator library and execute DB2setup.sh and DB2grants.sh
1. Execute DB2setup.sh script ,which located in $PXHOME/bin. Note, you may
have a problem for remote DB2 instances; you will need to change the
connect userid and password.
db2 connect to samp_02 user dsadmin using passw0rd
db2 bind ${APT_ORCHHOME}/bin/db2esql.bnd datetime ISO blocking all grant
public
# this statement must be run from /instance_dir/bnd
cd ${INSTHOME}/sqllib/bnd
db2 bind @db2ubind.lst blocking all grant public
db2 bind @db2cli.lst blocking all grant public

104
db2 connect reset
db2 terminate

2. Execute DB2grants.sh
db2 connect to samp_2 user dsadmin using passw0rd
db2 grant bind, execute on package dsadm.db2esql to group dsadmin
db2 connect reset
db2 terminate

Step 7. Create or modify DS/EE configuration file


DS/EE provides parallel engine configuration files. DataStage learns about the shape
and size of the system from the configuration file. It organizes the resources needed
for a job according to what is defined in the configuration file. The DataStage
configuration file needs to contain the node on which DataStage and the DB2 client
are installed and the nodes of the remote computer where the DB2 server is installed.
The following is one example. For more detail info of engine configuration file,
please refer to the
"Parallel job development guide."
{
node "node1"
{
fastname "transfer"
pools ""
resource disk "/home/dsadmin/Ascential/DataStage/Datasets" {pools ""}
resource scratchdisk "/home/dsadmin/Ascential/DataStage/Scratch" {pools
""}
}
node "node2"
{
fastname "stage164"
pools ""
resource disk "/tmp" {pools ""}
resource scratchdisk "/tmp" {pools ""}
}
}

Step 8. Restart DS/EE server and test connectivity


At this point you have completed all configurations on both nodes. Restart DS/EE
server by issuing the commands below:
[dsadmin@transfer bin]$ uv -admin –stop
[dsadmin@transfer bin]$ uv -admin -start

Note: after stopping the DS/EE engine, you need to exit dsadmin and re-logon, and
the dsenv configuration file will be executed. Also, be sure the time interval between
stop and start is longer than 30 seconds in order for the changed configuration to
take effect.
Next, we will test remote connectivity using DataStage Designer. Choose Import
plug-in table definition. The following window will appear. Click Next. If it imports
successfully, that means the remote DB2 connectivity configuration has succeeded.

105
Figure 3. DB2 Enterprise stage job

Develop one DB2 enterprise job on DS/EE


In this part, we will develop one parallel job with DB2 Enterprise Stage using
DataStage Designer. This job is very simple because it just demonstrates how to
extract DB2 department table data to one sequential file.

Figure 4. Import DSDB2 Meta Data

Double-click the DB2 Enterprise stage icon, and set the following properties to the
DB2 Enterprise stage. For detailed information, please reference the "Parallel job
developer’s guide."

Figure 5. DS/EE DB2 Enterprise stage properties

106
? Client Instance Name: Set this to the DB2 client instance name. If you set this
property, DataStage assumes you require remote connection.
? Server: Optionally set this to the instance name of the DB2 server. Otherwise
use the DB2 environment variable, DB2INSTANCE, to identify the instance
name of the DB2 server.
? Client Alias DB Name: Set this to the DB2 client’s alias database name for the
remote DB2 server database. This is required only if the client’s alias is
different from the actual name of the remote server database.
? Database: Optionally set this to the remote server database name. Otherwise
use the environment variables APT_DBNAME or APT_DB2DBDFT to
identify the database.
? User: Enter the user name for connecting to DB2. This is required for a remote
connection.
? Password: Enter the password for connecting to DB2. This is required for a
remote connection.
Add the following two environment variables into this job via DataStage Manager.
APT_DB2INSTANCE_HOME defines DB2nodes.cfg location, while
APT_CONFIG_FILE specifies the engine configuration file.

Figure 6. Job properties set

Performance comparison between Enterprise Stage and API Stage


In this part, we will execute the jobs developed above by DataStage Director and
compare the performance between DS/EE Enterprise Stage and API Stage. The
following is another job with DB2 API Stage.

Figure 7. DB2 API stage

107
o generate a quantity of test data, we created the following stored procedure:

CREATE PROCEDURE insert_department( IN count)


language sql
begin
declare number int;
declare str varchar(10);
declare deptno char(10);
set number=1;

while ( number>count)
do
set deptno=char( mod(number, 100) );
insert into department values( deptno, 'deptname', 'mgr', 'dep', 'location');
if( mod(number, 2000)=0) then
commit;
end if;
set number=number+1 ;
end while;
end@

Execute the stored procedure:


DB2 –td@ -f emp_resume.sql
DB2 call emp_resume( 5000000)

Then, we execute these 2 jobs against 100,000, 1M and 5M rows via DataStage
Director and observe the result using the job monitor. The following screenshots are
test results with DB2 Enterprise Stage and DB2 API Stage.

Figure 8. 100,000 records (DB2 Enterprise Stage)

Figure 9. 100,000 records (DB2 API stage)

108
Figure 10. 1,000,000 records (DB2 Enterprise Stage)

Figure 11. 1,000.000 records (DB2 API Stage)

Figure 12. 5,000,000 records (DB2 Enterprise Stage)

Figure 13. 5,000,000 records (DB2 API Stage)

Figure 14. Compare performance between Enterprise Stage and API Stage

109
DB2 Enterprise Stage has a great parallel performance over the other DB2 plug-in
stages using a DB2 DPF environments, however, it requires the hardware and
operating system of ETL server, and the DB2 nodes must be the same. Consequently,
it’s not a replacement for other DB2 plug-in stages, especially in heterogeneous
environments.

204.How to release Locks using Administrator

1.Login to DataStage Administrator as an administrative user.


2.Go to the Command Prompt for that Project.
3.Look at the locks by executing: LIST.READU
4.The last column is the item id. Look for item id with the jobname.This is the record
you will want to unlock. Now you can note the inode or user no, whichever
uniquely identifies the record.
5.The unlock command is not available by default in the projects. To create an entry
to use unlock execute the following two commands:
SET.FILE UV VOC UV.VOC
COPY FROM UV.VOC TO VOC UNLOCK
6.To unlock by either INODE or User No use the commands:
UNLOCK INODE inodenumber ALL
UNLOCK USER usernumber ALL

205.How To: Study Guide for DataStage Certification

110
This entry is a comprehensive guide to preparing for the DataStage 7.5 Certification
exam.

Regular readers may be feeling a sense of de ja vu. Haven’t we seen this post before?
I originally posted this in 2006 and this the Directors Cut - I’ve added some deleted
scenes, a commentary for DataStage 7.0 and 8.0 users and generally improved the
entry. By reposting I retain the links from other sites such as my DataStage
Certification Squidoo lens with links to my certification blog entries and IBM
certification pages.

This post shows all of the headings from the IBM exam Objectives and describes how
to prepare for that section.

Before you start read work out how you add Environment variables to a job as a job
parameter as they are handy for exercises and testing. See the DataStage Designer
Guide for details.

Section 1 - Installation and Configuration

Versions: Version 7.5.1 and 7.5.2 are the best to study and run exercises on. Version
6.x is risky but mostly the same as 7. Version 8.0 is no good for any type of
installation and configuration preparation as it has a new approach to installation
and user security.

Reading: Read the Installation and Upgrade Guide for DataStage, especially the
section on parallel installation. Read the pre-requisites for each type of install such as
users and groups, the compiler, project locations, kernel settings for each platform.
Make sure you know what goes into the dsenv file. Read the section on DataStage for
USS as you might get one USS question. Do a search for threads on dsenv on the
dsxchange forum to become familiar with how this file is used in different
production environments.

Exercise: installing your own DataStage Server Enterprise Edition is the best exercise
- getting it to connect to Oracle, DB2 and SQL Server is also beneficial. Run the
DataStage Administrator and create some users and roles and give them access to
DataStage functions.

Section 4 - Parallel Architecture (10%)

Section 9 - Monitoring and Troubleshooting (10%)

I’ve move section 4 and 9 up to the front as you need to study this before you run
exercises and read about parallel stages in the other sections. Understanding how to
use and monitor parallel jobs is worth a whopping 20% so it’s a good one to know
well.

Versions: you can study this using DataStage 6, 7 and 8. Version 8 has the best
definition of the parallel architecture with better diagrams.

Reading: Parallel Job Developers Guide opening chapters on what the parallel engine
and job partitioning is all about. Read about each partitioning type. Read how
sequential file stages partition or repartition data and why datasets don’t. The
Parallel Job Advanced Developers Guide has sections on environment variables to
help with job monitoring, read about every parameter with the word SCORE in it.

111
The DataStage Director Guide describes how to run job monitoring - use the right
mouse click menu on the job monitor window to see extra parallel information.

Exercises: Turn on various monitoring environment variables such as


APT_PM_SHOW_PIDS and APT_DUMP_SCORE so you can see what happens
during your exercises. It shows you what really runs in a job - the extra processes
that get added across parallel nodes.

Try creating one node, two node and four node config files and see how jobs behave
under each one. Try the remaining exercises on a couple different configurations by
adding the configuration environment variable to the job. Try some pooling options.
I got to admit I guessed my way through some of the pooling questions as I didn’t do
many exercises.

Generate a set of rows into a sequential file for testing out various partitioning types.
One column with unique ids 1 to 100 and a second column with repeating codes such
as A, A, A, A, A, B, B, B, B, B etc. Write a job that reads from the input, sends it
through a partitioning stage such as a transformer and writes it to a peek stage. The
Director logs shows which rows went where. You should also view the Director
monitor and expand and show the row counts on each instance of each stage in the
job to see how stages are split and run on each node and how many rows each
instance gets.

Use a filter stage to split the rows down two paths and bring them back together
with a funnel stage, then replace the funnel with a collector stage. Compare the two.

Test yourself on estimating how many processes will be created by a job and check
the result after the job has run using the Director monitor or log messages. Do this
throughout all your exercises across all sections as a habit.
Section 2 - Metadata
Section 3 - Persistant Storage

I’ve merged these into one. Both sections talk about sequential files, datasets, XML
files and Cobol files.

Versions: you can study this on DataStage 6, 7 or 8 as it is a narrow focus on


DataStage Designer metadata. DataStage 8 will have slightly different CFF options
but not enough to cause a problem.

Reading: read the section in the DataStage Developers Guide on Orchestrate schemas
and partial schemas. Read the plugin guide for the Complex Flat File Stage to
understand how Cobol metadata is imported (if you don’t have any cobol copybooks
around you will just have to read about them and not do any exercises). Quickly scan
through the NLS guide - but don’t expect any hard questions on this.

Exercises: Step through the IBM XML tutorial to get the tricky part on reading XML
files. Find an XML file and do various exercises reading it and writing it to a
sequential file. Switch between different key fields to see the impact of the key on the
flattening of the XML hierarchy. Don’t worry too much about XML Transform.

Import from a database using the Orchestrate Import and the Plugin Import and
compare the table definitions. Run an exercise on column propagation using the Peek

112
stage where a partial schema is written to the Peek stage to reveal the propagated
columns.

Create a job using the Row Generator stage. Define some columns, on the columns
tab doubleclick on a column to bring up the advanced column properties. Use some
properties to generate values for different data types. Get to know the advanced
properties page.

Create a really large sequential file, dataset and fileset and use each as a reference in
a lookup stage. Monitor the Resource and Scratch directories as the job runs to see
how these lookup sources are prepared prior to and during a job run. Get to know
the difference between the lookup fileset and other sources for a lookup stage.

Section 5 - Databases (15%)

One of the more difficult topics if you get questions for a database you are not
familiar with. I got one database parallel connectivity question that I still can’t find
the answer to in any of the manuals.

Versions: DataStage 7.x any version or at a pinch DataStage 8. Earlier versions do not
have enough database stages and DataStage 8 has a new approach to database
connections.

Reading: read the plugin guide for each enterprise database stage: Oracle, SQL
Server, DB2 and ODBC. In version 8 read the improved Connectivity Guides for
these targets. If you have time you can dig deeper, the Parallel Job Developers Guide
and/or the Advanced Developers Guide has a section on the
Oracle/DB2/Informix/Teradata/Sybase/SQL Server interface libraries. Look for the
section called "Operator action" and read it for each stage. It’s got interesting bits like
whether the stage can run in parallel, how it converts data and handles record sizes.

Exercise: Add each Enterprise database stage to a parallel job as both an input and
output stage. Go in and fiddle around with all the different types of read and write
options. You don’t need to get a connection working or have access to that database,
you just need to have the stage installed and add it to your job. Look at the
differences between insert/update/add/load etc. Look at the different options for
each database stage. If you have time and a database try some loads to a database
table.

Section 6 - Data Transformation (15%)

If you’ve used DataStage for longer than a year this is probably the topic you are
going to ace - as long as you have done some type of stage variable use.

Versions: should be okay studying on versions 6, 7 or 8. Transformation stages such


as Transformer, Filter and Modify have not changed much.

Reading: there is more value in using the transformation stages than reading about it.
It’s hard to read about it and take it in as the Transformer stage is easier to navigate
and understand if you are using it. If you have to make do with reading then visit the
dsxchange and look for threads on stage variables, the FAQ on the parallel number
generator, removing duplicates using a transformer and questions in the parallel
forum on null handling. This will be better than reading the manuals as they will be

113
full of practical examples. Read the Parallel Job Advanced Developers Guide section
on "Specifying your own parallel stages".

Exercises: Focus on Transformer, Modify Stage (briefly), Copy Stage (briefly) and
Filter Stage.

Create some mixed up source data with duplicate rows and a multiple field key. Try
to remove duplicates using a Transformer with a sort stage and combination of stage
variables to hold the prior row key value to compare to the new row key value.

Process some data that has nulls in it. Use the null column in a Transformer
concatenate function with and without a nulltovalue function and with and without
a reject link from the Transformer. This gives you an understanding of how rows get
dropped and/or trapped from a transformer. Explore the right mouse click menu in
the Transformer, output some of the DS Macro values and System Variables to a
peek stage and think of uses for them in various data warehouse scenarios. Ignore
DS Routine, it’s not on the test.

Don’t spend much time on the Modify stage - it would take forever memorizing
functions. Just do an exercise on handle_null, string_trim and convert string to
number. Can be tricky getting it working and you might not even get a question
about it.

Section 7 - Combining and Sorting Data (10%)

Section 10 - Job Design (10%)

I’ve combined these since they overlap. Don’t underestimate this section, it covers a
very narrow range of functionality so it is an easy set of questions to prepare for and
get right. There are easy points on offer in this section.

Versions: Any version 7.x is best, version 6 has a completely different lookup stage,
version 8 can be used but remember that the Range lookup functionality is new.

Reading: The Parallel Job Developers Guide has a table showing the differences
between the lookup, merge and join stages. Try to memorize the parts of this table
about inputs and outputs and reject links. This is a good place to learn about some
more environment variables. Read the Parallel Job Advanced Developers Guide
looking for any environment variables with the word SORT or SCORE in them.

Exercises: Compare the way join, lookup and merge work. Create a job that switches
between each type.

Add various combinations of the SORT and COMBINE_OPERATORS environment


variables to your job and examine the score of the job to see what sorts get added to
your job at run time. A simple job with a Remove Duplicates or Join stage and no
sorting will add a sort into the job - even if the data comes from a sorted database
source. Use the SORT variables to turn off this sort insertion. See what happens to
Transformer, Lookup and Copy stages when COMBINE_OPERATORS is turned on
or off using the SCORE log entry.

Create an annotation and a description annotation and explore the differences


between the two. Use the Multiple Compile tool in the DataStage Manager (version
6, 7) or Designer (version 8). Create and use shared containers.

114
Section 8 - Automation and Production Deployment (10%)

Versions: most deployment methods have remained the same from version 6, 7 and
8. Version 8 has the same import and export functions. DataStage 8 parameter sets
will not be in the version 7 exam.

Reading: you don’t need to install or use the Version Control tool to pass this section
however you should read the PDF that comes with it to understand the IBM
recommendations for deployment. It covers the move from dev to test to prod. Read
the Server Job Developers Guide section on command line calls to DataStage such as
dsjob and dssearch and dsadmin.

Exercises: practice a few dsjob and dssearch commands. Practi

115

You might also like