0% found this document useful (0 votes)
89 views74 pages

Datastage

Uploaded by

Charan Tej
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views74 pages

Datastage

Uploaded by

Charan Tej
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 74

1. Drag the stage icons onto the screen.

2. Link the stage icons, and name the stages and links.

Name passive stages with the table/file names they access.


Name active stages to match their function.
Name links to express the direction and type of data flowing through them.

3. Set the properties of the passive stage icons.


4. Set the properties of the active stage icons.
5. Performance tune the job.

Remove unused columns from transforms. This does not apply to columns in
sequential files or output to hash files.

6. Compile the job.


7. Validate the job (in DataStage Director)
8. Run (test) the job (in DataStage Director)

Active Stages
Eliminate unused columns.
Eliminate unused references.
In derivations, if possible, instead of calling routines, move the code into the
derivation. This eliminates the overhead of the procedure call.
Input Links
Use ODBC stage to access relational tables.

Move constraints from Transform stages to input stage WHERE clauses, if


possible, to reduce the number of rows the job has to process.

Output links
Use OCI plugin stages to access relational tables if available.

Adjust the rows per transaction setting. Try 1000, 5000, or 10,000.

Adjust the array size setting. Try 10, 100, or 1000.

If output rows are INSERTs or APPENDs, not UPDATEs, consider using a native
bulk loader. Direct output to a sequential file compatible with the bulk loader,
then invoke the bulk loader using an after-job subroutine. The bulk loader for
Oracle is SQLLDR.

Reference Lookups
Compare the number of input rows with the number of rows in the referenced
table. If the referenced table is smaller than the number of input rows, pre-load the
reference table into a hashed file and then reference the hashed file.
Consider moving reference lookups to a join within the input stage. All columns
used to join the tables should be indexed to maximize performance.

If the number of rows in a hashed file is small, consider pre-loading the file into
memory by checking the pre-load into memory checkbox in the Hashed File
stage.

Loading Fact Tables


When populating fact tables, the value of foreign keys must be known before a fact row
can be inserted. These foreign key values, primary key values in dimension tables, may
not be initially known. This most often is the case when new dimension rows are
discovered, and your warehouse design requires that dimension table primary key values
be automatically assigned by the database. To overcome this situation, a multi-step
design must be applied to processing fact rows.

Processing fact rows in a multi-step manner exploits the unique capabilities of DataStage
to integrate operating system and database features within a DataStage job.

1. Process fact rows without regard to dimension key values, instead retaining
dimension column values. Theses dimension column values will later be used to
determine dimension key values.
2. For each dimension table, create a temporary dimension table in you database
whose structure is similar to the dimension table.
3. Populate the temporary dimension tables using the retained dimension column
values from step 1, setting the dimension key column value to NULL.
4. Join the temporary dimension tables with the dimension tables, updating the
dimension key column in each temporary dimension table.
5. For all rows in the temporary dimension tables with a dimension key column
value of NULL, insert the row into its dimension tables.
6. Join the temporary dimension tables with the dimension tables for all rows in the
temporary dimension tables with a dimension key column value of NULL,
updating the dimension key column in each temporary dimension table.
7. Create a hash file for each temporary dimension table whose key columns are all
columns other than the dimension key value.
8. Populate the temporary dimension hash files with the rows from the temporary
dimension tables.
9. Process the fact rows created in step 1, performing reference lookups to the
temporary dimension hash files, resolving the dimension key values, and creating
an output file compatible with your database’s bulk loader (e.g. SQLLDR).
10. Execute your database’s bulk loader using the file created in step 9 as input.

The implementation of this multi-step process in simpler than its description. The entire
process can be implemented as three DataStage jobs and two database scripts.
 Step 1 of the process is simple a DataStage job with multiple outputs, one for the
fact table, and one for each (temporary) dimension table.
 Steps 2 through 6 are implemented as a single database script that is executed as
an after job stage of the step 1 DataStage job.
 Steps 7 and 8 represent a single DataStage job with an independent active stage to
load each hash dimension table from the temporary dimension table.
 Step 9 is another simple DataStage job with a single input and output, and
multiple reference lookups.
 Step 10 is another database script that is executed as an after job stage of the step9
DataStage job.

A Global View of Reference Lookups


This document represents work in progress.

Reference lookups are typically thought of on a per job basis. It is assumed that good
DataStage development practices are being used, utilizing hash files for reference
lookups, or where possible, performing joins in the input stage of a job to off load this
burden onto the database server.

In most cases, this is satisfactory. However, large and/or time critical ETL applications
may find it advantageous to revisit the topic of reference lookups, and apply a more
global view. In this global view, four factors should be considered:

1. Data Source
2. Number of rows in source
3. Total number of rows retrieved
4. Total number of unique rows retrieved

How to check &PH&


How many times have you heard, "Is there anything in &PH&?"? That many. The
following is everything you every wanted to know about &PH&, and then some.

When a DataStage job runs, it creates one or more phantom processes on your DataStage
server. Generally, one phantom process for the job, and one for each active stage within
the job. Each phantom process has its own log file that records information about the
process's execution. This information may be useful for debugging problems.

Log files are created in the folder &PH&. A &PH& folder exists in each DataStage
project folder. A DataStage job phantom will create a log file in the &PH& folder named
with the prefix DSD.RUN_, a DataStage active stage will create a log file in the &PH&
folder named with the prefix DSD.StageRun_. All log files end with a time and date
suffix. The time is seconds since midnight, and the date in a Universe julian date. These
dates and times are usually close to those found in DataStage Director on the Control
event Starting job ...

A useful tool is to create a routine in DataStage Manager that will suggest a log file
name. The source of this routine is:

Ans = "DSD.RUN_":Iconv(TheTime,"MTS"):"_":Iconv(TheDate,"D-
YMD[4,2,2]")

TheDate and TheTime are the routine arguments. Get the job's run date and time from
DataStage Director, the use the test button for this routine in DataStage Manager to
compute a suggested log file name.

Another useful piece of information is the DataStage job's number. An easy way to find a
job number is with DataStage Administrator. In DataStage Administrator, select a project
on the Projects tab, then press the Command button. Enter the command:

LIST DS_JOBS JOBNO WITH NAME = "your job name"

Press the Execute button, and your job's name and number will be displayed.

Armed with a job's name, run date, and run time, and a suggested log file name, finding
the actual log file is based on the operating system of your DataStage server. The
following steps assume a workstation with capabilities similar to Windows NT and My
Computer or Explorer.

For NT servers, and Unix servers with SAMBA:


1. Map the folder containing the DataStage project to a drive on your workstation.
2. Open the &PH& folder in the project folder.
3. Change the view to arrange the items in the folder by date.
4. Now do a little pointing and clicking until you find the specific log file. It should
begin with the line:

DataStage Job your job number Phantom a phantom number

For other Unix servers:

To be determined.

1. Telnet to your DataStage server.


2. Change directory to the &PH& folder in the project folder:

cd \&PH\&
3. Using find and grep commands, locate the jobs log files:

find . -type f -exec grep -l "DataStage Job #" {}\;

Replace the pound sign with the job number.

4. Use the view command to review each file. To exit the view command, enter :q.

view path name from the previous step

A Few Notes About Handling Dates


Following are some things I have learned about handling dates:

 When passing dates to the Pivot Stage, they should be in internal format.
 When passing dates to the Informix CLI Stage, they should be in internal format.
 Dates and timestamps to and from the ODBC stage should be in external format,
YYYY-MM-DD or YYYY-MM-DD hh:mm:ss.sss respectively.

Other Notes About DataStage


Following are other things I have discovered about DataStage:

 In my opinion, DataStage is a great ETL tool, but any tool is only as good as its
support. Ascential Software provides great customer support.
 Seriously consider using a disk defragment utility on NT DataStage Servers.
 When installing DataStage, plan the location of the project and UVTEMP
directories.
 When you create a routine in DataStage, it is cataloged in Universe with the
prefix "DSU." followed by the routine name.
 Output to a hash file writes a row (a Universe write). It does not update individual
columns (a Universe writev). Non-key columns are written as sequential hash file
fields (e.g. the first non-key column is field 1, the second is field 2, ...).
 The TCL command "CONFIG ALL" displays the current value of all uvconfig
parameters.
 The TCL command "ANALYZE.SHM -d" displays the contents of the Universe
dynamic file descriptor table.
 The &PH& folder fills over time. You may notice that the time between when a
job says it is finishing, and when it actually ends, increases. This may be a
symptom of a too full &PH& folder. To correct this, periodically delete the files
in the &PH& folder. One way to do this is in DataStage Administrator, select the
projects tab, click your project, then press the Command button, enter the
command CLEAR.FILE &PH&, and press the execute button. Another way is
to create a job with the command: EXECUTE "CLEAR.FILE &PH&" on the
job control tab of the job properties window. You may want to schedule this job
to run weekly, but at a point in your production cycle where it will not delete data
critical to debugging a problem. &PH& is a project level folder, so this job should
be created and scheduled in each project.

http://www.anotheritco.com/tools/download_list_routines.htm

Just wanted to know what is the difference between a filter stage and switch stage .

as far as i know both do the same job.

A few major differences:

-Filter stage can have any number of output links where as Switch stage is limited to a
max of 128 links.
-Filter stage can optionally have a reject link. Switch stage requires a reject link.
-A switch stage is like the C swtich function. It goes through all the cases and if no
cases are met, it goes for the default value, which is specified by the reject link.

THE major difference is that the Switch stage operates on values, while the Filter stage
operates on WHERE expressions, to determine which rows are sent along each output link.

In Filter, a single Input record can pass thru one or many output links aslong as each link
satisfies a certain criteria
In Switch, a single INput record can ONLY pass thru one of the many output links.

Code:

Field(in.Col[INDEX(in.Col,"V",1), Len(in.Col) - INDEX(in.Col,"V",1)+1]," ",1)

Trim(Arg1[Index(Arg1,"V-",1),4])

Here's how the data looks. The string has different formats namely

1. "5.5 L T-spark 24-Valve"

2. "24-Valve Al V-8"

3. "3.5 L V6 24-Valve"
4. "24-Valve Al V12"

5. "5.5 L Twin-Spark"

I want to populate "V-?" (V-numeric say V-6 or V-8 or V-12) I want to populate
"Null" for 1st and 5th formats.

If I use the following logic I m unable to populate correct value for the 3rd format
as the "V6" occurrence is 1.

[code]

If Index(DSLink2.ENGINE_DESCRIPTION,"V-",2) Then
Right(DSLink2.ENGINE_DESCRIPTION,4) Else (If
Index(DSLink2.ENGINE_DESCRIPTION,"V",2) Then
Right(DSLink2.ENGINE_DESCRIPTION,3) Else 'Null')

If (in.Col Matches "...V-0N1N..." OR in.Col Matches "...V0N1N...")

then Field(EREPLACE(in.Col, "Valve", "")[INDEX(EREPLACE(in.Col,


"Valve", ""),"V",1), Len(EREPLACE(in.Col, "Valve", "")) -
INDEX(EREPLACE(in.Col, "Valve", ""),"V",1)+1]," ",1) else @NULL

how to find the agge.

YearFromDate(CurrentDate())-YearFromDate(DOB)

http://www.anotheritco.com/dswp.htm

mv myfixednamefile.txt filewithdateandtime`date +"%Y%m%d_%H%M%S"`.txt


UNIX Shell Scripting
These notes teach you how to write and run Bourne shell scripts on any UNIX computer.

What do you need to know to follow along? This was originally written as a second class
in UNIX. The first class taught how to use the basic UNIX commands (like sed, grep and
find) and this class teaches how to combine these tools to accomplish bigger tasks.

In addition to the material in this course you might be interested in the Korn shell (ksh)
and the Bourne again shell (bash), both of which are excellent shells that enchance the
original Bourne shell. These alternate shells are upwardly-compatible with the Bourne
shell, meaning that a script written for sh can run in ksh or bash. However, there are
additional features in bash and ksh that are not available in the Bourne shell.

The focus of this guide is to get you to understand and run some Bourne shell scripts. On
several pages there are example scripts for you to run. On most of these pages there is a
link you can click on (with the right mouse button) and download the script to your
computer and run it.

You will learn several things:

 Ability to automate tasks, such as


o Software install procedures
o Backups
o Administration tasks
o Periodic operations on a database via cron
o Any repetetive operations on files
 Increase your general knowledge of UNIX
o Use of environment
o Use of UNIX utilities
o Use of features such as pipes and I/O redirection

For example, I recently wrote a script to make a backup of one of the subdirectories
where I was developing a project. I quickly wrote a shell script that uses /bin/tar to create
an archive of the entire subdirectory and then copy it to one of our backup systems at my
computer center and store it under a subdirectory named according to today's date.

As another example, I have some software that runs on UNIX that I distribute and people
were having trouble unpacking the software and getting it running. I designed and wrote
a shell script that automated the process of unpacking the software and configuring it.
Now people can get and install the software without having to contact me for help, which
is good for them and good for me, too!

For shell script experts one of the things to consider is whether to use the Bourne shell (or
ksh or bash), the C shell, or a richer scripting language like perl or python. I like all these
tools and am not especially biased toward any one of them. The best thing is to use the
right tool for each job. If all you need to do is run some UNIX commands over and over
again, use a Bourne or C shell script. If you need a script that does a lot of arithmetic or
string manipulation, then you will be better off with perl or python. If you have a Bourne
shell script that runs too slowly then you might want to rewrite it in perl or python
because they can be much faster.

Historically, people have been biased toward the Bourne shell over the C shell because in
the early days the C shell was buggy. These problems are fixed in many C shell
implementations these days, especially the excellent 'T' C shell (tcsh), but many still
prefer the Bourne shell.

There are other good shells available. I don't mean to neglect them but rather to talk about
the tools I am familiar with.

If you are interested also in learning about programming in the C shell I also have a
comparison between features of the C shell and Bourne shell.

Section 1: Review of a few Basic UNIX Topics


Shell scripting involves chaining several UNIX commands together to accomplish a task.
For example, you might run the 'date' command and then use today's date as part of a file
name. I'll show you how to do this below.

Some of the tools of the trade are variables, backquotes and pipes. First we'll study these
topics and also quickly review a few other UNIX topics.

Variables
 Topics covered: storing strings in variables
 Utilities covered: echo, expr
 To try the commands below start up a Bourne shell:
 /bin/sh
 A variable stores a string (try running these commands in a Bourne shell)
 name="John Doe"
 echo $name
 The quotes are required in the example above because the string contains a special
character (the space)
 A variable may store a number
 num=137
 The shell stores this as a string even though it appears to be a number
 A few UNIX utilities will convert this string into a number to perform arithmetic
 expr $num + 3
 Try defining num as '7m8' and try the expr command again
 What happens when num is not a valid number?
 Now you may exit the Bourne shell with
 exit
I/O Redirection
 Topics covered: specifying the input or capturing the output of a command in a
file
 Utilities covered: wc, sort
 The wc command counts the number of lines, words, and characters in a file
 wc /etc/passwd
 wc -l /etc/passwd
 You can save the output of wc (or any other command) with output redirection
 wc /etc/passwd > wc.file
 You can specify the input with input redirection
 wc < /etc/passwd
 Many UNIX commands allow you to specify the input file by name or by input
redirection
 sort /etc/passwd
 sort < /etc/passwd
 You can also append lines to the end of an existing file with output redirection
 wc -l /etc/passwd >> wc.file

Backquotes
 Topics covered: capturing output of a command in a variable
 Utilities covered: date
 The backquote character looks like the single quote or apostrophe, but slants the
other way
 It is used to capture the output of a UNIX utility
 A command in backquotes is executed and then replaced by the output of the
command
 Execute these commands
 date
 save_date=`date`
 echo The date is $save_date
 Notice how echo prints the output of 'date', and gives the time when you defined
the save_date variable
 Store the following in a file named backquotes.sh and execute it (right click and
save in a file)
 #!/bin/sh
 # Illustrates using backquotes
 # Output of 'date' stored in a variable
 Today="`date`"
 echo Today is $Today
 Execute the script with
 sh backquotes.sh
 The example above shows you how you can write commands into a file and
execute the file with a Bourne shell
 Backquotes are very useful, but be aware that they slow down a script if you use
them hundreds of times
 You can save the output of any command with backquotes, but be aware that the
results will be reformated into one line. Try this:
 LS=`ls -l`
 echo $LS

Pipes
 Topics covered: using UNIX pipes
 Utilities covered: sort, cat, head
 Pipes are used for post-processing data
 One UNIX command prints results to the standard output (usually the screen), and
another command reads that data and processes it
 sort /etc/passwd | head -5
 Notice that this pipe can be simplified
 cat /etc/passwd | head -5
 You could accomplish the same thing more efficiently with either of the two
commands:
 head -5 /etc/passwd
 head -5 < /etc/passwd
 For example, this command displays all the files in the current directory sorted by
file size
 ls -al | sort -n -r +4
 The command ls -al writes the file size in the fifth column, which is why we skip
the first four columns using +4.
 The options -n and -r request a numeric sort (which is different than the normal
alphabetic sort) in reverse order

awk
 Topics covered: processing columnar data
 Utilities covered: awk
 The awk utility is used for processing columns of data
 A simple example shows how to extract column 5 (the file size) from the output
of ls -l
 ls -l | awk '{print $5}'
 Cut and paste this line into a Bourne shell and you should see a column of file
sizes, one per file in your current directory.
 A more complicated example shows how to sum the file sizes and print the result
at the end of the awk run
 ls -al | awk '{sum = sum + $5} END {print sum}'
 In this example you should see printed just one number, which is the sum of the
file sizes in the current directory.
Section 2: Storing Frequently Used Commands in Files:
Shell Scripts

Shell Scripts
 Topics covered: storing commands in a file and executing the file
 Utilities covered: date, cal, last (shows who has logged in recently)
 Store the following in a file named simple.sh and execute it
 #!/bin/sh
 # Show some useful info at the start of the day
 date
 echo Good morning $USER
 cal
 last | head -6
 Shows current date, calendar, and a six of previous logins
 Notice that the commands themselves are not displayed, only the results
 To display the commands verbatim as they run, execute with
 sh -v simple.sh
 Another way to display the commands as they run is with -x
 sh -x simple.sh
 What is the difference between -v and -x? Notice that with -v you see '$USER' but
with -x you see your login name
 Run the command 'echo $USER' at your terminal prompt and see that the variable
$USER stores your login name
 With -v or -x (or both) you can easily relate any error message that may appear to
the command that generated it
 When an error occurs in a script, the script continues executing at the next
command
 Verify this by changing 'cal' to 'caal' to cause an error, and then run the script
again
 Run the 'caal' script with 'sh -v simple.sh' and with 'sh -x simple.sh' and verify the
error message comes from cal
 Other standard variable names include: $HOME, $PATH, $PRINTER. Use echo
to examine the values of these variables

Storing File Names in Variables


 Topics covered: variables store strings such as file names, more on creating and
using variables
 Utilities covered: echo, ls, wc
 A variable is a name that stores a string
 It's often convenient to store a filename in a variable
 Store the following in a file named variables.sh and execute it
 #!/bin/sh
 # An example with variables
 filename="/etc/passwd"
 echo "Check the permissions on $filename"
 ls -l $filename
 echo "Find out how many accounts there are on this system"
 wc -l $filename
 Now if we change the value of $filename, the change is automatically propagated
throughout the entire script

Scripting With sed


 Topics covered: global search and replace, input and output redirection
 Utilities covered: sed
 Here's how you can use sed to modify the contents of a variable:
 echo "Hello Jim" | sed -e 's/Hello/Bye/'
 Copy the file nlanr.txt to your home directory and notice how the word 'vBNS'
appears in it several times
 Change 'vBNS' to 'NETWORK' with
 sed -e 's/vBNS/NETWORK/g' < nlanr.txt
 You can save the modified text in a file with output redirection
 sed -e 's/vBNS/NETWORK/g' < nlanr.txt > nlanr.new
 Sed can be used for many complex editing tasks, we have only scratched the
surface here

Section 3: More on Using UNIX Utilities

Performing Arithmetic
 Topics covered: integer arithmetic, preceding '*' with backslash to avoid file
name wildcard expansion
 Utilities covered: expr
 Arithmetic is done with expr
 expr 5 + 7
 expr 5 \* 7
 Backslash required in front of '*' since it is a filename wildcard and would be
translated by the shell into a list of file names
 You can save arithmetic result in a variable
 Store the following in a file named arith.sh and execute it
 #!/bin/sh
 # Perform some arithmetic
 x=24
 y=4
 Result=`expr $x \* $y`
 echo "$x times $y is $Result"

Translating Characters
 Topics covered: converting one character to another, translating and saving string
stored in a variable
 Utilities covered: tr
 Copy the file sdsc.txt to your home directory
 The utility tr translates characters
 tr 'a' 'Z' < sdsc.txt
 This example shows how to translate the contents of a variable and display the
result on the screen with tr
 Store the following in a file named tr1.sh and execute it
 #!/bin/sh
 # Translate the contents of a variable
 Cat_name="Piewacket"
 echo $Cat_name | tr 'a' 'i'
 This example shows how to change the contents of a variable
 Store the following in a file named tr2.sh and execute it
 #!/bin/sh
 # Illustrates how to change the contents of a variable with tr
 Cat_name="Piewacket"
 echo "Cat_name is $Cat_name"
 Cat_name=`echo $Cat_name | tr 'a' 'i'`
 echo "Cat_name has changed to $Cat_name"
 You can also specify ranges of characters.
 This example converts upper case to lower case
 tr 'A-Z' 'a-z' < file
 Now you can change the value of the variable and your script has access to the
new value

Section 4: Performing Search and Replace in Several


Files

Processing Multiple Files


 Topics covered: executing a sequence of commands on each of several files with
for loops
 Utilities covered: no new utilities
 Store the following in a file named loop1.sh and execute it
 #!/bin/sh
 # Execute ls and wc on each of several files
 # File names listed explicitly
 for filename in simple.sh variables.sh loop1.sh
 do
 echo "Variable filename is set to $filename..."
 ls -l $filename
 wc -l $filename
 done
 This executes the three commands echo, ls and wc for each of the three file names
 You should see three lines of output for each file name
 filename is a variable, set by "for" statement and referenced as $filename
 Now we know how to execute a series of commands on each of several files

Using File Name Wildcards in For Loops


 Topics covered: looping over files specified with wildcards
 Utilities covered: no new utilities
 Store the following in a file named loop2.sh and execute it
 #!/bin/sh
 # Execute ls and wc on each of several files
 # File names listed using file name wildcards
 for filename in *.sh
 do
 echo "Variable filename is set to $filename..."
 ls -l $filename
 wc -l $filename
 done
 You should see three lines of output for each file name ending in '.sh'
 The file name wildcard pattern *.sh gets replaced by the list of filenames that
exist in the current directory
 For another example with filename wildcards try this command
 echo *.sh

Search and Replace in Multiple Files


 Topics covered: combining for loops with utilities for global search and replace
in several files
 Utilities covered: mv
 Sed performs global search and replace on a single file
 sed -e 's/application/APPLICATION/g' sdsc.txt > sdsc.txt.new
 The original file sdsc.txt is unchanged
 How can we arrange to have the original file over-written by the new version?
 Store the following in a file named s-and-r.sh and execute it
 #!/bin/sh
 # Perform a global search and replace on each of several files
 # File names listed explicitly
 for text_file in sdsc.txt nlanr.txt
 do
 echo "Editing file $text_file"
 sed -e 's/application/APPLICATION/g' $text_file > temp
 mv -f temp $text_file
 done
 First, sed saves new version in file 'temp'
 Then, use mv to overwrite original file with new version
Section 5: Using Command-line Arguments for
Flexibility

What's Lacking in the Scripts Above?


 Topics covered: looping over files specified with wildcards
 Utilities covered: no new utilities
 File names are hard-coded inside the script
 What if you want to run the script but with different file names?
 To execute for loops on different files, the user has to know how to edit the script
 Not simple enough for general use by the masses
 Wouldn't it be useful if we could easily specify different file names for each
execution of a script?

What are Command-line Arguments?


 Topics covered: specifying command-line arguments
 Utilities covered: no new utilities
 Command-line arguments follow the name of a command
 ls -l .cshrc /etc
 The command above has three command-line arguments
 -l (an option that requests long directory listing)
 .cshrc (a file name)
 /etc (a directory name)
 An example with file name wildcards:
 wc *.sh
 How many command-line arguments were given to wc? It depends on how many
files in the current directory match the pattern *.sh
 Use 'echo *.sh' to see them
 Most UNIX commands take command-line arguments. Your scripts may also
have arguments

Accessing Command-line Arguments


 Topics covered: accessing command-line arguments
 Utilities covered: no new utilities
 Store the following in a file named args1.sh
 #!/bin/sh
 # Illustrates using command-line arguments
 # Execute with
 # sh args1.sh On the Waterfront
 echo "First command-line argument is: $1"
 echo "Third argument is: $3"
 echo "Number of arguments is: $#"
 echo "The entire list of arguments is: $*"
 Execute the script with
 sh args1.sh -x On the Waterfront
 Words after the script name are command-line arguments
 Arguments are usually options like -l or file names

Looping Over the Command-line Arguments


 Topics covered: using command-line arguments in a for loop
 Utilities covered: no new utilities
 Store the following in a file named args2.sh and execute it
 #!/bin/sh
 # Loop over the command-line arguments
 # Execute with
 # sh args2.sh simple.sh variables.sh
 for filename in "$@"
 do
 echo "Examining file $filename"
 wc -l $filename
 done
 This script runs properly with any number of arguments, including zero
 The shorter form of the for statement shown below does exactly the same thing
 for filename
 do
 ...
 Don't use
 for filename in $*
 Fails if any arguments include spaces
 Also, don't forget the double quotes around $@

If Blocks
 Topics covered: testing conditions, executing commands conditionally
 Utilities covered: test (used by if to evaluate conditions)
 This will be covered on the whiteboard
 See Chapter 8 of the book

The read Command


 Topics covered: reading a line from the standard input
 Utilities covered: no new utilities
 stdin is the keyboard unless input redirection used
 Read one line from stdin, store line in a variable
 read variable_name
 Ask the user if he wants to exit the script
 Store the following in a file named read.sh and execute it
 #!/bin/sh
 # Shows how to read a line from stdin
 echo "Would you like to exit this script now?"
 read answer
 if [ "$answer" = y ]
 then
 echo "Exiting..."
 exit 0
 fi

Command Exit Status


 Topics covered: checking whether a command succeeds or not
 Utilities covered: no new utilities
 Every command in UNIX should return an exit status
 Status is in range 0-255
 Only 0 means success
 Other statuses indicate various types of failures
 Status does not print on screen, but is available thru variable $?
 Example shows how to examine exit status of a command
 Store the following in a file named exit-status.sh and execute it
 #!/bin/sh
 # Experiment with command exit status
 echo "The next command should fail and return a status greater
than zero"
 ls /nosuchdirectory
 echo "Status is $? from command: ls /nosuchdirectory"
 echo "The next command should succeed and return a status equal to
zero"
 ls /tmp
 echo "Status is $? from command: ls /tmp"
 Example shows if block using exit status to force exit on failure
 Store the following in a file named exit-status-test.sh and execute it
 #!/bin/sh
 # Use an if block to determine if a command succeeded
 echo "This mkdir command fails unless you are root:"
 mkdir /no_way
 if [ "$?" -ne 0 ]
 then
 # Complain and quit
 echo "Could not create directory /no_way...quitting"
 exit 1 # Set script's exit status to 1
 fi
 echo "Created directory /no_way"
 Exit status is $status in C shell

Regular Expressions
 Topics covered: search patterns for editors, grep, sed
 Utilities covered: no new utilities
 Zero or more characters: .*
 grep 'provided.*access' sdsc.txt
 sed -e 's/provided.*access/provided access/' sdsc.txt
 Search for text at beginning of line
 grep '^the' sdsc.txt
 Search for text at the end of line
 grep 'of$' sdsc.txt
 Asterisk means zero or more the the preceeding character
 a* zero or more a's
 aa* one or more a's
 aaa* two or more a's
 Delete all spaces at the ends of lines
 sed -e 's/ *$//' sdsc.txt > sdsc.txt.new
 Turn each line into a shell comment
 sed -e 's/^/# /' sdsc.txt

Greed and Eagerness


 Attributes of pattern matching
 Greed: a regular expression will match the largest possible string
 Execute this command and see how big a string gets replaced by an underscore
 echo 'Big robot' | sed -e 's/i.*o/_/'
 Eagerness: a regular expression will find the first match if several are present in
the line
 Execute this command and see whether 'big' or 'bag' is matched by the regular
expression
 echo 'big bag' | sed -e 's/b.g/___/'
 Contrast with this command (notice the extra 'g')
 echo 'big bag' | sed -e 's/b.g/___/g'
 Explain what happens in the next example
 echo 'black dog' | sed -e 's/a*/_/'
 Hint: a* matches zero or more a's, and there are many places where zero a's
appear
 Try the example above with the extra 'g'
 echo 'black dog' | sed -e 's/a*/_/g'

Regular Expressions Versus Wildcards


 Topics covered: clarify double meaning of asterisk in patterns
 Utilities covered: no new utilities
 Asterisk used in regular expressions for editors, grep, sed
 Different meaning in file name wildcards on command line and in find command
and case statement (see below)
 regexp wildcard meaning

 .* * zero or more characters, any type
 . ? exactly one character, any type
 [aCg] [aCg] exactly one character, from list: aCg
 Regexps can be anchored to beginning/ending of line with ^ and $
 Wildcards automatically anchored to both extremes
 Can use wildcards un-anchored with asterisks
 ls *bub*

Getting Clever With Regular Expressions


 Topics covered: manipulating text matched by a pattern
 Utilities covered: no new utilities
 Copy the file animals.txt to your home directory
 Try this sed command, which changes the first line of animals.txt
 sed -e "s/big \(.*\) dog/small \1 cat/" animals.txt
 Bracketing part of a pattern with \( and \) labels that part as \1
 Bracketing additional parts of a pattern creates labels \2, \3, ...
 This sed command reverses the order of two words describing the rabbit
 sed -e "s/Flopsy is a big \(.*\) \(.*\) rabbit/A big \2 \1
rabbit/" < animals.txt

The case Statement


 Topics covered: choosing which block of commands to execute based on value of
a string
 Utilities covered: no new utilities
 The next example shows how to use a case statement to handle several
contingencies
 The user is expected to type one of three words
 A different action is taken for each choice
 Store the following in a file named case1.sh and execute it
 #!/bin/sh
 # An example with the case statement
 # Reads a command from the user and processes it
 echo "Enter your command (who, list, or cal)"
 read command
 case "$command" in
 who)
 echo "Running who..."
 who
 ;;
 list)
 echo "Running ls..."
 ls
 ;;
 cal)
 echo "Running cal..."
 cal
 ;;
 *)
 echo "Bad command, your choices are: who, list, or cal"
 ;;
 esac
 exit 0
 The last case above is the default, which corresponds to an unrecognized entry
 The next example uses the first command-line arg instead of asking the user to
type a command
 Store the following in a file named case2.sh and execute it
 #!/bin/sh
 # An example with the case statement
 # Reads a command from the user and processes it
 # Execute with one of
 # sh case2.sh who
 # sh case2.sh ls
 # sh case2.sh cal
 echo "Took command from the argument list: '$1'"
 case "$1" in
 who)
 echo "Running who..."
 who
 ;;
 list)
 echo "Running ls..."
 ls
 ;;
 cal)
 echo "Running cal..."
 cal
 ;;
 *)
 echo "Bad command, your choices are: who, list, or cal"
 ;;
 esac
 The patterns in the case statement may use file name wildcards

The while Statement


 Topics covered: executing a series of commands as long as some condition is
true
 Utilities covered: no new utilities
 The example below loops over two statements as long as the variable i is less than
or equal to ten
 Store the following in a file named while1.sh and execute it
 #!/bin/sh
 # Illustrates implementing a counter with a while loop
 # Notice how we increment the counter with expr in backquotes
 i="1"
 while [ $i -le 10 ]
 do
 echo "i is $i"
 i=`expr $i + 1`
 done

Example With a while Loop


 Topics covered: Using a while loop to read and process a file
 Utilities covered: no new utilities
 Copy the file while2.data to your home directory
 The example below uses a while loop to read an entire file
 The while loop exits when the read command returns false exit status (end of file)
 Store the following in a file named while2.sh and execute it
 #!/bin/sh
 # Illustrates use of a while loop to read a file
 cat while2.data | \
 while read line
 do
 echo "Found line: $line"
 done

 The entire while loop reads its stdin from the pipe
 Each read command reads another line from the file coming from cat
 The entire while loop runs in a subshell because of the pipe
 Variable values set inside while loop not available after while loop

Interpreting Options With getopts Command


 Topics covered: Understand how getopts command works
 Utilities covered: getopts
 getopts is a standard UNIX utility used for our class in scripts getopts1.sh and
getopts2.sh
 Its purpose is to help process command-line options (such as -h) inside a script
 It handles stacked options (such as -la) and options with arguments (such as -P
used as -Pprinter-name in lpr command)
 This example will help you understand how getopts interprets options
 Store the following in a file named getopts1.sh and execute it
 #!/bin/sh

 # Execute with
 #
 # sh getopts1.sh -h -Pxerox file1 file2
 #
 # and notice how the information on all the options is displayed
 #
 # The string 'P:h' says that the option -P is a complex option
 # requiring an argument, and that h is a simple option not
requiring
 # an argument.
 #

 # Experiment with getopts command
 while getopts 'P:h' OPT_LETTER
 do
 echo "getopts has set variable OPT_LETTER to '$OPT_LETTER'"
 echo " OPTARG is '$OPTARG'"
 done

 used_up=`expr $OPTIND - 1`

 echo "Shifting away the first \$OPTIND-1 = $used_up command-line
arguments"

 shift $used_up

 echo "Remaining command-line arguments are '$*'"

 Look over the script
 getopts looks for command-line options
 For each option found, it sets three variables: OPT_LETTER, OPTARG,
OPTIND
 OPT_LETTER is the letter, such as 'h' for option -h
 OPTARG is the argument to the option, such as -Pjunky has argument 'junky'
 OPTIND is a counter that determines how many of the command-line arguments
were used up by getopts (see the shift command in the script)
 Execute it several times with
 sh getopts1.sh -h -Pjunky
 sh getopts1.sh -hPjunky
 sh getopts1.sh -h -Pjunky /etc /tmp
 Notice how it interprets -h and gives you 'h' in variable OPT_LETTER
 Now you can easily implement some operation when -h is used
 Notice how the second execution uses stacked options
 Notice how the third execution examines the rest of the command-line after the
options (these are usually file or directory names)

Example With getopts


 Topics covered: interpreting options in a script
 Utilities covered: getopts
 The second example shows how to use if blocks to take action for each option
 Store the following in a file named getopts2.sh and execute it
 #!/bin/sh
 #
 # Usage:
 #
 # getopts2.sh [-P string] [-h] [file1 file2 ...]
 #
 # Example runs:
 #
 # getopts2.sh -h -Pxerox file1 file2
 # getopts2.sh -hPxerox file1 file2
 #
 # Will print out the options and file names given
 #

 # Initialize our variables so we don't inherit values
 # from the environment
 opt_P=''
 opt_h=''

 # Parse the command-line options
 while getopts 'P:h' option
 do
 case "$option" in
 "P") opt_P="$OPTARG"
 ;;
 "h") opt_h="1"
 ;;
 ?) echo "getopts2.sh: Bad option specified...quitting"
 exit 1
 ;;
 esac
 done

 shift `expr $OPTIND - 1`

 if [ "$opt_P" != "" ]
 then
 echo "Option P used with argument '$opt_P'"
 fi

 if [ "$opt_h" != "" ]
 then
 echo "Option h used"
 fi

 if [ "$*" != "" ]
 then
 echo "Remaining command-line:"
 for arg in "$@"
 do
 echo " $arg"
 done
 fi

 Execute it several times with
 sh getopts2.sh -h -Pjunky
 sh getopts2.sh -hPjunky
 sh getopts2.sh -h -Pjunky /etc /tmp
 Can also implement actions inside case statement if desired

Section 6: Using Functions


Functions
 Sequence of statements that can be called anywhere in script
 Used for
o Good organization
o Create re-usable sequences of commands

Define a Function
 Define a function
 echo_it () {
 echo "In function echo_it"
 }
 Use it like any other command
 echo_it
 Put these four lines in a script and execute it

Function Arguments
 Functions can have command-line arguments
 echo_it () {
 echo "Argument 1 is $1"
 echo "Argument 2 is $2"
 }
 echo_it arg1 arg2
 When you execute the script above, you should see
 Argument 1 is arg1
 Argument 2 is arg2
 Create a script 'difference.sh' with the following lines:
 #!/bin/sh
 echo_it () {
 echo Function argument 1 is $1
 }
 echo Script argument 1 is $1
 echo_it Barney
 Execute this script using
 sh difference.sh Fred
 Notice that '$1' is echoed twice with different values
 The function has separate command-line arguments from the script's

Example With Functions


 Use functions to organize script
 read_inputs () { ... }
 compute_results () { ... }
 print_results () { ... }
 Main program very readable
 read_inputs
 compute_results
 print_results

Functions in Pipes
 Can use a function in a pipe
 ls_sorter () {
 sort -n +4
 }
 ls -al | ls_sorter
 Function in pipe executed in new shell
 New variables forgotten when function exits

Inherited Variables
 Variables defined before calling script available to script
 func_y () {
 echo "A is $A"
 return 7
 }
 A='bub'
 func_y
 if [ $? -eq 7 ] ; then ...
 Try it: is a variable defined inside a function available to the main program?

Functions -vs- Scripts


 Functions are like separate scripts
 Both functions and scripts can:
 Use command-line arguments
 echo First arg is $1
 Operate in pipes
 echo "test string" | ls_sorter
 Return exit status
 func_y arg1 arg2
 if [ $? -ne 0 ] ...

Libraries of Functions
 Common to store definitions of favorite functions in a file
 Then execute file with
 . file
 Period command executes file in current shell
 Compare to C shell's source command

Section 7: Miscellaneous

Here Files
 Data contained within script
 cat << END
 This script backs up the directory
 named as the first command-line argument,
 which in your case in $1.
 END
 Terminator string must begin in column one
 Variables and backquotes translated in data
 Turn off translation with \END

Example With Here File


 Send e-mail to each of several users
 for name in login1 login2 login3
 do
 mailx -s 'hi there' $name << EOF
 Hi $name, meet me at the water
 fountain
 EOF
 done
 Use <<- to remove initial tabs automatically

Set: Shell Options


 Can change Bourne shell's options at runtime
 Use set command inside script
 set -v
 set +v
 set -xv
 Toggle verbose mode on and off to reduce amount of debugging output

Set: Split a Line


 Can change Bourne shell's options
 set -- word1 word2
 echo $1, $2
 word1, word2
 Double dash important!
 Word1 may begin with a dash, what if word1 is '-x'?
 Double dash says "even if first word begins with '-', do not treat it as an option to
the shell

Example With Set


 Read a line from keyboard
 Echo words 3 and 5
 read var
 set -- $var
 echo $3 $5
 Best way to split a line into words

Section 8: Trapping Signals

What are Signals?


 Signals are small messages sent to a process
 Process interrupted to handle signal
 Possibilities for managing signal:
o Terminate
o Ignore
o Perform a programmer-defined action

Common Signals
 Common signals are
o SIGINTR sent to foreground process by ^C
o SIGHUP sent when modem line gets hung up
o SIGTERM sent by kill -9
 Signals have numeric equivalents
 2 SIGINTR
 9 SIGTERM

Send a Signal
 Send a signal to a process
 kill -2 PID
 kill -INTR PID

Trap Signals
 Handling Signals
 trap "echo Interrupted; exit 2" 2
 Ignoring Signals
 trap "" 2 3
 Restoring Default Handler
 trap 2

Where to Find List of Signals


 See file
 /usr/include/sys/signal.h

User Signals
 SIGUSR1, SIGUSR2 are for your use
 Send to a process with
 kill -USR1 PID
 Default action is to terminate process

Experiment With Signals


 Script that catches USR1
 Echo message upon each signal
 trap 'echo USR1' 16
 while : ; do
 date
 sleep 3
 done
 Try it: does signal interrupt sleep?

Section 9: Understanding Command Translation

Command Translation
 Common translations include
o Splitting at spaces, obey quotes
o $HOME -> /users/us/freddy
o `command` -> output of command
o I/O redirection
o File name wildcard expansion
 Combinations of quotes and metacharacters confusing
 Resolve problems by understanding order of translations
Experiment With Translation
 Try wildcards in echo command
 echo b*
 b budget bzzzzz
 b* translated by sh before echo runs
 When echo runs it sees
 echo b budget bzzzzz
 Echo command need not understand wildcards!

Order of Translations
 Splits into words at spaces and tabs
 Divides commands at
 ; & | && || (...) {...}
 Echos command if -v
 Interprets quotes
 Performs variable substitution

Order of Translations (continued)


 Performs command substitution
 Implements I/O redirection and removes redirection characters
 Divides command again according to IFS
 Expands file name wildcards
 Echos translated command if -x
 Executes command

Exceptional Case
 Delayed expansion for variable assignments
 VAR=b*
 echo $VAR
 b b_file
 Wildcard re-expanded for each echo

Examples With Translation


 Variables translated before execution
 Can store command name in variable
 command="ls"
 $command
 file1 file2 dir1 dir2...
 Variables translated before I/O redirection
 tempfile="/tmp/scriptname_$$"
 ls -al > $tempfile

Examples (continued)
 Delayed expansion of wildcards in variable assignment
 Output of this echo command changes when directory contents change (* is re-
evaluated each time the command is run)
 x=*
 echo $x
 Can view values stored in variables with
 set
 Try it: verify that the wildcard is stored in x without expansion

Examples (continued)
 Wildcards expanded after redirection (assuming file* matches exactly one file):
 cat < file*
 file*: No such file or directory
 Command in backquotes expanded fully (and before I/O redirection)
 cat < `echo file*`
 (contents of file sent to screen)

Eval Command
 Forces an extra evaluation of command
 eval cat \< file*
 (contents of matching file)
 Backslash delays translation of < until second translation

Section 10: Writing Advanced Loops

While loops
 Execute statements while a condition is true
 i=0
 while [ $i -lt 10 ]
 do
 echo I is $i
 i=`expr $i + 1`
 done

Until loops
 Execute statements as long as a condition is false
 until grep "sort" dbase_log > /dev/null
 do
 sleep 10
 done
 echo "Database has been sorted"
 Example executes until grep is unsuccessful

Redirection of Loops
 Can redirect output of a loop
 for f in *.c
 do
 wc -l $f
 done > loop.out
 Loop runs in separate shell
 New variables forgotten after loop
 Backgrounding OK, too

Continue Command
 Used in for, while, and until loops
 Skip remaining statements
 Return to top of loop
 for name in *
 do
 if [ ! -f $name ] ; then
 continue
 fi
 echo "Found file $name"
 done
 Example loops over files, skips directories

Break Command
 Used in for, while, and until loops
 Skip remaining statements
 Exit loop
 for name in *
 do
 if [ ! -r $name ] ; then
 echo "Cannot read $name, quitting loop"
 break
 fi
 echo "Found file or directory $name"
 done
 Example loops over files and directories, quits if one is not readable

Case Command
 Execute one of several blocks of commands
 case "string" in
 pattern1)
 commands ;;
 pattern2)
 commands ;;
 *) # Default case
 commands ;;
 esac
 Patterns specified with file name wildcards
 quit) ...
 qu*) ...

Example With Case


 Read commands from keyboard and interpret
 Enter this script 'case.sh'
 echo Enter a command
 while read cmd
 do
 case "$cmd" in
 list) ls -al ;;
 freespace) df . ;;
 quit|Quit) break ;;
 *) echo "$cmd: No such command" ;;
 esac
 done
 echo "All done"
 When you run it, the script waits for you to type one of:
 list
 freespace
 quit
 Quit
 Try it: modify the example so any command beginning with characters "free" runs
df

Infinite Loops
 Infinite loop with while
 while :
 do
 ...
 done
 : is no-op, always returns success status
 Must use break or exit inside loop for it to terminate

Section 11: Forking Remote Shells

Remote Shells
 Rsh command
 rsh hostname "commands"
 Runs commands on remote system
 Must have .rhosts set up
 Can specify different login name
 rsh -l name hostname "commands"

Examples With rsh


 Check who's logged on
 rsh spooky "finger"
 Run several remote commands
 rsh spooky "uname -a; time"
 Executes .cshrc on remote system
 Be sure to set path in .cshrc instead of .login

Access Control with .Rhosts


 May get "permission denied" error from rsh
 Fix this with ~/.rhosts on remote system
 Example: provide for remote shell from spunky to spooky
 spunky % rlogin spooky
 spooky % vi ~/.rhosts
 (insert "spunky login-name")
 spooky % chmod 600 ~/.rhosts
 spooky % logout
 spunky % rsh spooky uname -a
 spooky 5.5 sparc SUNW,Ultra-1
 May also rlogin without password: security problem!

Remote Shell I/O


 Standard output sent to local host
 rsh spooky finger > finger.spooky
 Standard input sent to remote host
 cat local-file | rsh spooky lpr -

Return Status
 Get return status of rsh
 rsh mayer "uname -a"
 echo $?
 Returns 0 if rsh managed to connect to remote host
 Returns 1 otherwise
o Invalid hostname
o Permission denied

Remote Return Status


 What about exit status of remote command?
 Have to determine success or failure from stdout or stderr

Section 12: More Miscellaneous

Temporary Files
 Use unique names to avoid clashes
 tempfile=$HOME/Weq_$$
 command > $tempfile
 $$ is PID of current shell
 Avoids conflict with concurrent executions of script
 Do not use /tmp!

Wait Command
 Wait for termination of background job
 command &
 pid=$!
 (other processing)
 wait $pid
 Allows overlap of two or more operations

Section 13: Using Quotes


Quotes
 Provide control of collapsing of spaces and translation of variables
 Try it: run three examples
 No quotes (variables translated, spaces collapsed)
 echo Home: $HOME
 Home: /users/us/freddy
 Double quotes (no collapsing)
 echo "Home: $HOME"
 Home: /users/us/freddy
 Single quotes (no translation or collapsing)
 echo 'Home: $HOME'
 Home: $HOME
 Try it: single quotes within double quotes
 echo "Home directory '$HOME' is full..."

Metacharacters
 Characters with special meaning to shell
 " ' ` $ * [ ] ?
 ; > < & ( ) \
 Avoid special meaning with quoting
 echo 'You have $20'
 Backslash like single quotes
 Applies only to next character
 echo You have \$20

Examples With Quotes


 Bad command line:
 grep dog.*cat file
 Shell tries to expand dot.*cat as file name wildcard
 Use quotes to avoid translation
 grep 'dog.*cat' file
 Single quotes OK in this case because we don't need variable translation

More Examples With Quotes


 Read name and search file for name
 read name
 grep "$name" dbase
 Single quotes not OK because we need variable translation

Searching for Metacharacters


 Bad command line: search for dollar sign
 grep "Gimme.*$20" file
 Problem: shell translates variable $20
 Solution: use single quotes

grep 'Gimme.*$20' file

DataStage tip for beginners - parallel lookup types


Vincent McBurney(Consultant, Solution Architect) Posted 1/6/2006
Comments (4) | Trackbacks (0)

Parallel DataStage jobs can have many sources of reference data for lookups
including database tables, sequential files or native datasets. Which is the most
efficient?

This question has popped up several times over on the DSExchange. In DataStage
server jobs the answer is quite simple, local hash files are the fastest method of a key
based lookup, as long as the time taken to build the hash file does not wipe out your
benefits from using it.

In a parallel job there are a very large number of stages that can be used as a
lookup, a much wider variety then server jobs, this includes most data sources and
the parallel staging formats of datasets and lookup filesets. I have discounted
database lookups as the overhead of the database connectivity and any network
passage makes them slower then most local storage.

I did a test comparing datasets to sequential files to lookup filesets and increased
row volumes to see how they responded. The test had three jobs, each with a
sequential file input stage and a reference stage writing to a copy stage.

Small lookups
I set the input and lookup volumes to 1000 rows. All three jobs processed in 17 or 18
seconds. No lookuptables were created apart from the existing lookup fileset one.
This indicates the lookup data fit into memory and did not overflow to a resource file.

1 Million Row Test


The lookup dataset took 35 seconds, the lookup fileset took 18 seconds and the
lookup sequential file took 35 seconds even though it had to partition the data. I
assume this is because the input also had to be partitioned and this was the
bottleneck in the job.

2 million rows
Starting to see some big differences now. Lookup fileset down at 45 seconds is only
three times the length of the 1000 row test. Dataset is up to 1:17 and sequential file
up to 1:32. The cost of partitioning the lookup data is really showing now.

3 million rows
The filset still at 45 seconds, swallowed up the extra 1 million rows with ease.
Dataset up to 2:06 and the sequential file up to 2:20.
As a final test I replaced the lookup stage with a join stage and tested the dataset
and sequential file reference links. The dataset join finished in 1:02 and the
sequential file join finished in 1:15. A large join proved faster then a large lookup but
not as fast as a lookup file.

Conclusion
If your lookup size is low enough to fit into memory then the source is irrelevent, they
all load up very quickly, even database lookups are fast. If you have very large
lookup files spilling into lookup table resources then the lookup fileset outstrips the
other options. A join becomes a viable option. They are a bit harder to design as you
can only join one source at a time whereas a lookup can join multiple sources.

I usually go with lookups for code to description or code to key type lookups
regardless of the size, I reserve the joins for references that bring back lots of
columns. I will certainly be making more use of the lookup fileset to get more
performance from jobs.

Sparse database lookups, which I didn't test for, are an option if you have a very
large reference table and a small number of input rows.

IntroductionEdit section
Job parameters should be used in all DataStage server, parallel and sequence jobs to provide
administrators access to changing run time values such as database login details, file
locations and job settings.

One option for maintaining these job parameters is to use project specific environment
variables. These are similar to operating system environment variables but they are setup and
maintained through the DataStage Administrator tool.

There is a blog entry with bitmaps that describes the steps in setting up these variables at
DataStage tip: using job parameters without losing your mind

StepsEdit section
To create a new project variable:

 Start up DataStage Administrator.


 Choose the project and click the "Properties" button.
 On the General tab click the "Environment..." button.
 Click on the "User Defined" folder to see the list of job specific environment variables.

There are two types of variables - string and encrypted. If you create an encrypted
environment variable it will appears as the string "*******" in the Administrator tool and will
appears as junk text when saved to the DSParams file or when displayed in a job log. This
provides robust security of the value.

Note that encrypted environment variables are not supported in versions earlier than 7.5.
Migrating Project Specific Job Parameters Edit section
It is possible to set or copy job specific environment variables directly to the DSParams file in
the project directory. There is also a DSParams.keep file in this directory and if you make
manual changes to the DSParams file you will find Administrator can roll back those changes
to DSParams.keep. It is possible to copy project specific parameters between projects by
overwriting the DSParams and DSParams.keep files. It may be safer to just replace the User
Defined section of these files and not the General and Parallel sections.

Environment Variables as Job Parameters Edit section


To create a job level variable:

 Open up a job.
 Go to Job Properties and move to the parameters tab.
 Click on the "Add Environment Variables..." button and choose the variable from the
list. Only values set in Administrator will appear. This list will show both the system
variables and the user-defined variables.
 Set the Default value of the new parameter to $PROJDEF. If it is an encrypted field
set it to $PROJDEF in both data entry boxes on the encrypted value entry form.

When the job parameter is first created it has a default value the same as the Value entered in
the Administrator. By changing this value to $PROJDEF you instruct DataStage to retrieve the
latest Value for this variable at job run time.

If you have an encrypted environment variable it should also be an encrypted job parameter.
Set the value of these encrypted job parameters to $PROJDEF. You will need to type it in
twice to the password entry box, or better yet cut and paste it into the fields, a spelling mistake
can lead to a connection error message that is not very informative and leads to a long
investigation.

Creating sub folders Edit section


By default all parameters are put into a "User Defined" folder. This can make it difficult to
locate them through the Designer or Administrator tools. Sub folders can be added by editing
the DSParams folder and adding sub folder names to the parameter definition section. Where
the folder name is defined as "\User Defined\" this can be changed to include a sub folder, eg.
"\User Defined\Database\".

ExamplesEdit section
These job parameters are used just like normal parameters by adding them to stages in your
job enclosed by the # symbol.

Job Parameter Examples


Field Setting Result

Datab
#$DW_DB_NAME# CUSTDB
ase

Passw #$DW_DB_PASSWORD# ********


ord

File #$PROJECT_PATH#/#SOURCE_DIR#/ c:/data/custfiles/


Name Customers_#PROCESS_DATE#.csv Customers_20040203.csv

ConclusionEdit section
These type of job parameters are useful for having a central location for storing all job
parameters that is password protected and supports encryption of passwords. It can be
difficult to migrate between environments. Migrating the entire DSParams file can result in
development environment settings being moved into production and trying to migrate just the
user defined section can result in a corrupt DSParams file. Care must be taken.

IntroductionEdit section
This HOWTO entry will describe how to identify changed data using a DataStage server or
parallel job. For an overview of other change capture options see the blog on incremental
loads.

The objective of changed data identification is to compare two sets of data with identical or
similar metadata and determine the differences between the two. The two sets of data
represent an existing set of a data and a new set of data where the change capture identifies
the modified, added and removed rows in the new set.

StepsEdit section
The steps for change capture depend on whether you are using server jobs or parallel jobs or
one of the specialised change data capture products that integrate with DataStage.

Change Data Capture Components Edit section


These are components that can be purchased in addition to a DataStage license in order to
perform change data capture against a specific database:

 DataStage CDC for DB2


 DataStage CDC for SQL Server 2000
 Ascential CDC for IMS
 Change Data Capture for Oracle
 DataStage CDC for SQL Server 2000 (Windows only)
 CDC for DB2 AS/400 (Windows only)

Server Job Edit section


Most change capture methods involve the transformer stage with new data as the input and
existing data as a left outer join reference lookup. The simplest form of change capture is to
compare all rows using output links for inserts and updates with a constraint on each.
Column Compare Edit section
Update link constraint: <math>input.firstname <> lookup.firstname and input.lastname <>
lookup.lastname and input.birthdate <> lookup.birthdate...<math> Insert link constraint:
<math>lookup.NOTFOUND<math>

A delete output cannot be derived as the lookup is a left outer join.

These constraints can become very complex to write, especially if there are a lot of fields to
compare. It can also produce slow performance as the constraint needs to run for every row.
Performance improvement can be gained by using the CRC32 function to describe the data
for comparison.

CRC Compare Edit section


CRC32 is a C function written by Michael Hester and is now on the Transformer function list. It
takes an input string and returns a signed 32 bit number that acts as a digital signature of the
input data.

When a row is processed and becomes existing data a CRC32 code is generated and saved
to a lookup along with the primary key of the row. When a new data row comes through a
primary key lookup determines if the row already exists and if it does comparing the CRC32 of
the new row to the existing row determines whether the data has changed.

CRC32 change capture using a text file source:

 Read each row as a single long string. Do not specify a valid delimiter.
 In a transformer find the key fields using the FIELD command and generate a CRC32
code for the entire record. Output all fields.
 In a transformer lookup the existing records using the key fields to join and compare
the new and existing CRC32 codes. Output new and updated records.
 The output records have concatenated fields. Either write the output records to a
staging sequential file, where they can be processed by insert and update jobs, or
split the records into individual fields using the row splitter stage.

CRC32 change capture using a database source:

 To concatenate fields together use the Row Merge stage. Then follow the steps
described in the sequential file section above.

WARNING! Because it uses a 32-bit integer, the CRC32 function introduces a finite probability
of false positives; that is, identifying a match when there is in fact no match. The larger the
number of rows processed, the higher the probability of such an error occurring becomes.

Shared Container Change Capture Edit section


One benefit of the CRC32 function is the ability to put change capture into a shared container
for use across multiple jobs. This code re-use can save a lot of time. The container has a
transformer in it with two input columns: keyfields and valuefields, and two output links: inserts
and updates. The keyfields column contains the key fields concatenated into a string with a
delimiter such as |. The valuecolumns contains all fields concatenated with a delimiter.
The job that uses the container needs to concatenate the input columns and pass them to the
container and then split the output insert and update rows. Row Merge and Row Splitter can
be used to do this.

Parallel Job Edit section


The Change Capture stage uses "Before" and "After" input links to compare data.

 Before is the existing data.


 After is the new data.

The stage operates using the settings for Key and Value fields. Key fields are the fields used
to match a before and after record. Value fields are the fields that are compared to find
modified records. You can explicitly define all key and value fields or use some of the options
such as "All Keys, All Values" or "Explicit Keys, All Values".

The stage outputs a change_code which by default is set to 0 for existing rows that are
unchanged, 1 for deleted rows, 2 for modified rows and 4 for new rows. A filter stage or a
transformer stage can then split the output using the change_code field down different insert,
update and delete paths.

Change Capture can also be performed in a Transformer stage as per the Server Job
instructions.

The CRC32 function is not part of the parallel job install but it is possible to write one as a
custom buildop.

Parallel jobs are useful for increasing the performance of the job so all kind of jobs will be bow
days as a parallel job.

ExamplesEdit section
1x2y1x

ConclusionEdit section
The change capture stage in parallel jobs and the CRC32 function in server jobs simplify the
process of change capture in an ETL job.

IntroductionEdit section
This entry describes various ways of creating a unique counter in DataStage jobs.

A parallel job has a surrogate key stage that creates unique IDs, however it is limited in that it
does not support conditional code and it may be more efficient to add a counter to an existing
transformer rather than add a new stage.
In a server job there are a set of key increment routines installed in the routine SDK samples
that offer a more complex counter that remembers values between job executions.

The following section outlines a transformer only technique.

StepsEdit section
In a DataStage job the easiest way to create a counter is within the Transformer stage with a
Stage Variable.

svMyCounter = svMyCounter + 1

This simple counter adds 1 each time a row is processed.

The counter can be given a seed value by passing a value in as a job parameter and setting
the initial value of svMyCounter to that job parameter.

In a parallel job this simple counter will create duplicate values on each node as the
transformer is split into parallel instances. It can be turned into a unique counter by using
special parallel macros.

1. Create a stage variable for the counter, eg. SVCounter.


2. At the Stage Properties form set the Initial Value of the Stage Variable to
"@PARTITIONNUM - @NUMPARTITIONS + 1".
3. Set the derivation of the stage variable to "svCounter + @NUMPARTITIONS". You
can embed this in an IF statement if it is a conditional counter.

Each instance will start at a different number, eg. -1, -2, -3, -4. When the counter is
incremented each instance is increment by the number of partitions, eg. 4. This gives us a
sequence in instance 1 of 1, 5, 9, 13... Instance 2 is 2, 6, 10, 14... etc

I know it's not polite dinner conversation, but are you suffering row leakage from
your data integration jobs? Do you have an unpleasant discharge and you don't know
what to do with it?

Row leakage are the rows that get dropped out of ETL jobs. Row discharge are the
row rejects from the ETL jobs that are not properly investigated. Untrapped leakage
and ignored discharge both lead to data quality problems in the ETL targets. In a
future blog I will talk about cleaning up your unpleasant discharge, but in this blog I
will address row leakage.

How does it happen?


Like any program an ETL job can have an unexpected error. This can be an abort
error, which is easy to trap and investigate because it's caused a catastrophic hault
to your overnight processing! More insidious is the individual row failure where a
combination of data caused the row to be dropped. It can be more dangerous
because if it's treated as a minor error then processing can continue and the load can
give the appearance of being complete even though some data has been lost.

Row Leakage Common Causes


In the WebSphere DataStage tool, which is the most heavily used of the WebSphere
Information Integration suite, these are the most common causes of row leakage:
* Null values in transformer stages. In a parallel job especially any attempt to use a
field in a stage variable that has a null in it will drop the row. Any attempt to run a
derivation function against a field with a null in it can drop the row. The exception is
of course the null handling functions.
* Database rejection. A wide array of database errors can reject a row, common ones
being duplicate key, missing foreign key relationship, null in a not nullable field. Less
common are the unpredictable rejections such as lack of rollback space.
* Metadata mismatch. Quite common on sequential files. A metadata mismatch may
be a string value that is being forced into a numeric field or an invalid string date
being converted into a date. Sometimes the value will be set to a default such as 0 or
an empty string and the row is retained, sometimes the row is dropped.
* Metadata mismatch in a sequential file read. This is why the parallel sequential file
stage has the rarely used by quite important reject link option.

How is the job affected?


Usually there will be a warning message in the Director log for each row that is lost.
DataStage server jobs are better at trapping and reporting row leakage then
DataStage parallel jobs. If there is no reject link on the stage a parallel job message
will look something like this:
APT_CombinedOperatorController(0),1: Field
'CUSTOMER_TYPE' from input dataset '0' is NULL. Record dropped.
(transform/tfmp_functions.C:130)

If there is a reject link on the stage the message will be:


'CUSTOMER_TYPE' from input dataset '0' is NULL. Record sent to the reject dataset.
(transform/tfmp_functions.C:130)

Or sometimes it is a series of warning messages following by an information message


that indicates row leakage (there is nothing in the warning messages that specifically
says the row has been dropped) :
Information message: "Sequential_File_1,0: Import complete. 266 records imported
successfully, 6 rejected."

A reject link is a good option as the rejected record can be saved and investigated
and the reject link count can be used in etl row reporting. Sometimes a row is leaked
and no warning message appears, see the section on black holes below.

The job that cried wolf


DataStage Parallel jobs can be the job that cries wolf. Many production sites have
jobs that produce half a dozen or more warning messages every time they run. This
is due to warnings that the developer could not remove from the job, usually very
minor metadata mismatch messages that do not harm the data.

This means row leakage messages are often missed, especially in jobs where the row
leakage happens rarely. This is why the parallel message handler was added to turn
warning messages into information messages and deliver "clean" jobs to production
environments.

Avoiding warnings in a parallel job


1. Try to build parallel DataStage jobs that do not have warning messages.
2. If a warning message in unavoidable get is tested and signed off as an acceptable
warning.
3. Turn acceptable warnings into information messages via the message handler.

The Black Hole


The black hole in an ETL jobs is where rows get leaked but no warnings are reported.
The row has simply disappeared.

The most dangerous black holes are the enterprise database output stages. ETL row
by row processing means each row is handled as it's own insert or update statement,
with million plus row processing that's quite a lot of DB statements that could go
wrong! Database bulk loads are different to direct insert or updates, in these loads
the data is exported to a flat file and loaded via a database utility. This means the
error and log messages are not available in the DataStage log but can be found in an
external bulk load log file.

Trapping parallel job database errors


The first thing you need to do is capture rejected rows. For parallel jobs this is
described in the wiki entry HOWTO:Parallel job: Retrieve sql codes on a failed upsert.
This method is not in any of the DataStage documentation and involves propagating
two sql error code fields that are generated by the enterprise database stages.

I add some special error reporting to this, the reject output of the db stage is passed
to a copy stage and then to a shared container with row propogation turned on. The
shared container retrieves the SQL code field, looks up the error description against a
prepared lookup fileset, aggregates and counts the errors and sends them to a peek
stage. The DataStage log receives a peek message with error code, error description
and error count for failed database statements. I elevate peek messages from
information to warning in all my DataStage projects (including production).

This method is especially effective in development and testing environments where


database errors can happen a lot.

Trapping server job database errors


Usually server jobs display a warning for database rejects that shows you the
database error message. You can also trap the rejected row and the sql codes using a
transformer. Where parallel jobs send rejects to a downstream reject link the server
job bounces the rejected rows back to preceding stage. This is an important thing to
remember, parallel db stages spit out rejects, server db stages bounce them.

If you have a transformer leading to a db stage, which should happen at least 90% of
the time, you can catch the bounced db failure rows and output them down a reject
link. The Link Variables of the database link holds the SQL error codes and messages
and you can output these values down a reject link.

DATASTAGE

01. What can you do with Data stage?


A: Design Jobs that extract, integrate, Aggregate, Transform Data.
Create, manage, and reuse Metadata. Run, Monitor, Schedule Jobs.
Manage your Development Environment.

02. What is the Data stage Application Components?


A: Clients Components-- Designer, Director, Administrator and Manager.
Server Components--- Server and Repository.

03. What you do in Director?


A: Validate Run, Schedule, and Monitor Jobs and gather Status.
04. What you do in Manager?
A: Create, manage, reuse metadata & define routines & Export and import
table definitions, jobs, project.

05. What you do in Administrator?


A: Change License information, Set user privileges, set job monitoring
limits, set server connection timeout, Enable/Disable sever tracing.

06. What is a job?


A: Job is executable Data stage program. Designed and built in designer,
scheduling, running, monitoring in Director and Executed under control
of DSEngine (Universe).

07. What are views of Director?


A: Status View, Schedule View and Log View.

08. Types of Jobs?


A: Server jobs, Parallel jobs and Mainframe.

09. Types of Stages?


A: Passive Stage: For Read/Write process only. EX- ODBC, Hash,
Sequential.
Active Stage: For transformation, filtering and aggregating etc.
Ex- Aggregate, Transformer.

10. What is a Constraint and Derivation?


A: Constraint specifies the condition under which dataflow thru links,
only applies to links.
Derivations specify value to be moved to target field, only applies
to fields.

11. What is ICONV and OCONV?


A: ICONV: Converts Date value to an internal format.
OCONV: Converts Date Value From an internal Format.

12. What are the types of Lookup?


A: Singleton Lookup-- Returns single row as output. EX- Hash, OCI.
Multiton Lookup--- Returns Multiple rows as output. Ex-ODBC,
Universe.

13. What is hash file stage?


A: Distributes rows into one or more evenly signed groups (modules)
based on the primary key. Uses a specifiable "hashing Algorithm". Used
for lookup for better performance.

14.What are Datastage Transforms?


A: These are similar to routines, takes multiple input arguments but
returns single output value. Arguments have specific data elements
associated with them defined by a single (but possibly very complex)
BASIC expression.
EX-- CAPITALS: string to word-initial capital
DIGITS: string, non-digit out of string
LETTERS: strings, non-letters out of string

15. Where job parameters can be used included/


A: Passive stage file and table names.
Passive stage directory path.
Account name for hash file stage.
Transformer stage constraints and derivations.

16. What is a job sequencer?


A: Datastage provides a graphical job sequence which allows you to
specify a sequence of server jobs to run.

17.What is a sequence Activity?


A: Sequence is to carry out the process of synchronizing the control
flow of multiple activities in the job sequence. It considers multiple
input triggers as well as multiple output triggers.
The sequencer operates in two modes:
ALL mode. In this mode all of the inputs to the sequencer must be TRUE
for any of the sequencer outputs to fire.
ANY mode. In this mode, output triggers can be fired if any of the
sequencer inputs are TRUE.

18.What is a Execute command Activity?


A: Execute command activity involves in the process of executing an
operating system commands.

19.What is a Routine Activity?


A: Routine activity able to execute the sepcified routine which is
obtained from the built in or custom routines supported by datastage.

20. Types of Hash file Stages?


A: Dynamic: Automatically adjusts the modules accordance with the data.
Static: Do not adjust the modules without explicit intervention.

21. In which format the Exported files will be (Extension)?


Ans: In XML file Format i.e. .xml or dsx.

22. What are Routines?


Ans: Routines are the functions which we develop in BASIC Code for
required tasks, which we DS is not fully supported (Complex).

23. How do you generate Sequence number in DS?


Ans: Using the Routine
KeyMgtGetNextVal,
KeyMgtGetNextValConn
Also can be done by:
Using Oracle Sequencer

24. What is merge (plug-in) stage?


Ans: This stage is used to merge the two sequential files

25. What is orabulk Stage?


Ans: This Stage is used to Bulk Load the Oracle Target Database

26. What is Container & Types of it?


Ans: Containers are the reusable set of stages.
Types are: Local Container and Shared Container
Local Container is local to the particular job in which we developed
the container.
Shared Container is can be used in any other jobs also.
27. Types of scheduling a job (Frequency)?
Ans:
Today: with specified time
Tomorrow: with specified time
Next: with specified Day, Time.
Every: with specified Month, Day, and Time
Daily: with specified Time
Time can specified in Format like: AM, PM or 24 hr.

28. What are Triggers? Types of Triggers?


Ans: Types of Triggers
Conditional Ok, Failed, Warning, Custom, Return Value, User
Status.
Unconditional
Otherwise
29. What is Job Invocation ID? What are the uses? How to set that.
Ans: By setting this Invocation ID, then only we can run multiple
instances of jobs.
Using this we can run a job on different days with different parameters.
We set in when we design the job. Check Allow Multiple Instance in job
GUI.

30. What are Triggers? Types of Triggers?


Ans: Types of Triggers
Conditional – Ok, Failed, Warning, Custom, Return Value,
User Status.
Unconditional
Otherwise
31. What are the Different Activities that can output different types of
trigger?
Ans: Wait for File, Exec Command, and Routine
Unconditional
Otherwise
Conditional- Ok, Failed, Custom, Return Value
Job
Unconditional
Otherwise
Conditional- Ok, Failed, Warning, Custom, User Status.
Nested Condition
Unconditional
Otherwise
Conditional- Custom
Run-Activity-On-Exception, Sequencer, Email Notification
Unconditional
PART – II

1. What is Job Invocation ID? What are the uses? How to set that.
Ans: By setting this Invocation ID, then only we can run multiple
instances of jobs.
Using this we can run a job on different days with different parameters.
We set in when we design the job. Check Allow Multiple Instance in job
GUI.

2. What are Triggers? Types of Triggers?


Ans: Types of Triggers
Conditional – Ok, Failed, Warning, Custom, Return Value,
User Status.
Unconditional
Otherwise
3. What are the Different Activities that can output different types of
trigger?
Ans: Wait for File, Exec Command, and Routine
Unconditional
Otherwise
Conditional- Ok, Failed, Custom, Return Value
Job
Unconditional
Otherwise
Conditional- Ok, Failed, Warning, Custom, User Status.

Nested Condition
Unconditional
Otherwise
Conditional- Custom
Run-Activity-On-Exception, Sequencer, Email Notification
Unconditional

4. The Color of the links for different types of triggers?


Ans: Black: Unconditional, Otherwise
Green: Conditional –Ok
Red: 111111111111111111111111111111111111Conditional –
Failed, Warning
Blue: Conditional – Custom, Return Value, User Status

5. Types of Activities supported by Job Sequence?


Ans: Wait for File, Exec Command, Email Notification, Sequencer, Job,
Routine, Run-Activity-on-Exception, Nested Condition.

6. In how many ways you can build a batch/job sequence. What are they?
Ans: 1. Job Sequence - DS Designer
2. Batch Job Facilities - DS Director
3. Job Control Routine - DS Designer

7. What is a link Partitioner Stage? How many output links can be


defined?
Ans: Link Partitioner Stage is to partition dataflow. 64 output links
can be defined.

8. What is the Difference between In-Process and Inter-Process?


Ans: In-process the transformation will be row by row basis whereas in
Inter-process it will be by bulk data basis.

9. How do you define Constraints and how do you handle rejects?


Ans: To define a constraint or specify a reject link, do one of the
following:
Select an output link and click the constraints button.
Double-click the output link’s constraint entry field.
Choose Constraints from the background or header shortcut menus.
A reject link can be defined by choosing Yes in the Reject Row field
and setting the
Constraint field as follows:
To catch rows which are rejected from a specific output link, set the
Constraint
field to linkname.REJECTED.
To catch rows which caused a write failures on an output link, set
the
Constraint field to linkname.REJECTEDCODE.

10. How can you Aggregate data without using Aggregator Stage?
Ans: We do this using ODBC Stage.

11. How do you create custom transforms?


Ans: To create a custom transform:
From the DataStage Manager, select the Transforms branch in the project
tree and do one of the following:
Choose File ??New Transform… .
Choose New Transform… from the shortcut menu.
Click the New button on the toolbar.
12. How do you maintain metadata in DS?
Ans: Using MetaBroker. MetaBrokers allow you to exchange enterprise Meta
data between DataStage and other data warehousing tools. For example,
you can use MetaBrokers to import into DataStage table definitions that
you have set up using a data modeling tool. Similarly you can export
Meta data from a DataStage job to a business intelligence tool to help
it analyze your data warehouse.

13. How to import a COBOL source File? Is their any separate stage to do
this?
Ans: In ODBC stage click Load ->From Menu -> import Cobol Definitions
CFF (Complex Flat File) Plug-in Stage

14. How will you sort the data in DS? Any Separate Stage?
Ans: We do by SQL override. We also have a sort stage this gives us
better performance than the above.

15. What is tracing Level and give the values of it and message.
Ans: Tracing Level is the information to be included in job log file
0 - No Stage properties information included in the job log file and
1 - Stage properties information included in the job log file

16. How will you merge two Sequential Files?


Ans: The Merge Stage is used to merge two sequential files.

17. What are the parallelisms supported in DS. How will we achieve this?
Ans: SMP, MPP, and Clusters. For this we have design parallel jobs using
parallel Extender.

18. What is default Buffer size for in -process and Inter-process?


Ans: (128 KB)

19. What is a sequencer?


Ans: A sequencer allows you to synchronize the control flow of multiple
activities in a job sequence. It can have multiple input triggers as
well as multiple output triggers .The sequencer operates in two modes:
ALL mode. In this mode all of the inputs to the sequencer must be TRUE
for any of the sequencer outputs to fire.
ANY mode. In this mode, output triggers can be fired if any of the
sequencer inputs are TRUE.
20. Reporting Tool? And how do you update whole project?
Ans: The Data Stage Reporting Tool is flexible and allows you to
generate reports at various levels within a project, for example, entire
job, single stage, set of stages, etc.

Whole Project. Click this button if you want all project details to be
updated in the reporting database. If you select Whole Project, other
fields in the dialog box are disabled. This is selected by default.

21. What is a Nested Condition?


Ans: A nested condition allows you to further branch the execution of a
sequence depending on a condition. Each nested condition can have one
input trigger and will normally have multiple output triggers.

22. Converting Containers!


Ans: You can convert local containers to shared containers and vice
versa.
By converting a local container to a shared one you can make the
functionality available to all jobs in the project.
You may want to convert a shared container to a local one if you want to
slightly modify its functionality within a job. You can also convert a
shared container to a local container and then deconstruct it into its
constituent parts.

23. What are Before - Stage and After - Stage Routines?


Ans: The Routines which we have to run before or after a particular
stage runs. Using this we run scripts, PL/SQL codes, OS commands.
There are three built-in before/after subroutines supplied with
DataStage:
DSSendMail. This routine is an interlude to the local send mail program.
DSWaitForFile. This routine is called to suspend a job until a named job
either exists, or does not exist.
ExecDOS. This routine executes a command via an MS-DOS shell. The
command executed is specified in the routine’s input argument.
ExecTCL. This routine executes a command via a DataStage Engine shell.
The command executed is specified in the routine’s input argument.
ExecSH. This routine executes a command via a UNIX Korn shell.

24. What are the types of Joins we have DS?


Ans: Pure Inner, Complete set, Right and Left Only, Right Only, Left
Only, Left Outer, Right Outer.

25. What are the loading modes in OraBulk Stage?


Ans: Loading Modes in OraBulk:
Insert: Default Mode, Requires table to be empty before loading,
other wise raises an error.
Append: In Append mode new row will be appended to existing table,
for this we must have SELECT privilege to use this mode.
Replace: With Replace mode all rows are deleted and new rows are
inserted. For this we must have DELETE privilege to use this mode.
Truncate: This is the best possible option for better performance.
It truncates the table and inserts the data. Table Referential integrity
constraints must be disabled, otherwise error raises. DELETE ANY
privilege is must.

26. What are the files created by OraBulk?


Ans: Orabulk Stage creates Control File and Data File.
In Control File the information required for SQL * loader like userid,
pswd, table name, Fields to be loaded etc information will be there.
In Data File (Flat File format) contains all the data that is to be
loaded into table.

27. How SQL *Loader work for OraBulk Stage (process)?


Ans: The OraBulk Stage uses the SQL * Loader to Bulk Load. sqldr uses
two files called Control File and Data file (these Data files can more
than one). Based on the Information given in the Control File sqldr
loads data. sqldr first disables the constraints and bulk loads the
data from flat file (Data file). All the disabled constraints again
enabled after the completion of loading. But sqldr checks only field
length, datatype only. If any error occurs it will raised at DB level
only.

28. What is a Folder Stage?


Ans: Folder stages are used to read or write data as files in a
directory located on the Data Stage server. The folder stages can read
multiple files from a single directory and can Deliver the files to the
job as rows on an output link. The file content is delivered with New
lines converted to char (254) field marks. The folder stage can also
write rows of data As files to a directory. The rows arrive at the stage
on an input link.

29. What we can do with Transformer Stage?


Ans : Transformer stages do not extract data or write data to a target
database. They are used to handle extracted data, perform any
conversions required, and pass data to another Transformer stage or a
stage that writes data to a target data table. We can define
Constraints, derivations and Stage variables also.

30. What is a Hash File Stage?


Ans: Hashed File stages represent a hashed file, i.e., a file that uses
a hashing algorithm for
distributing records in one or more groups on disk. We use a
Hashed File stage to access Universe files. The Data Stage Engine can
host Universe files locally. You can use a hashed file as an
intermediate file in a job, taking advantage of DSEngine’s local
hosting.
We can use a Hashed File stage to extract or write data, or to act
as an intermediate file in a
job. The primary role of a Hashed File stage is as a reference
table based on a single key field (Lookup).
PART – III

1. What are the client’s tools in DataStage?


2. What are the types of jobs in DataStage?
3. What do you do in DataStage Administrator?
4. What is DataStage Manager?
5. How do you export/import the project?
6. Can you export/import individual components in the project?
7. Can we import table definitions in Manager?
8. What is Designer used for?
9. What does a job contain?
10. What are active and passive stages?
11. What are containers?
12. What are the different types of containers?
13. What is the difference between them?
14. What is the job sequence?
15. What are triggers?
16. What are the types of triggers?
17. What is a job sequencer?
18. Where do you schedule a job?
19. What are the schedule options available in DataStage?
20. How do you schedule a job?
21. How do you compile a job?
22. What is the use of compiling a job?
23. What is a log file?
24. Can you clear a log file for the whole project?
25. How do you monitor the jobs?
26. What do you monitor?
27. What are stage variables?
28. What are custom transforms?
29. Where do you create custom transforms?
30. What is a hash file stage?
31. What is a merge stage?
32. What is a pivot stage?

PART – IV

1. What are the components of Ascential Data Stage?


Ans: Client Components---- Administrator, Director, Manager, and
Designer.
Server Components--- Repository, Server and Plug-ins

2. What can we do with DataStage Director?


Ans: Validating, Scheduling, Executing and Monitoring Jobs (server
Jobs)

3. What do you do with Manager?


Ans: Exporting, Importing-----Routines, Table Definitions, jobs,
project.
Developing Routines.

4. Can we export Executables?


Ans: Yes

5. In which format the Exported files will be (Extension)?


Ans: In XML file Format i.e. .xml or .dsx

6. What are the Types of Jobs we have in DS & briefly


(Components used to for them)
Ans: Server Jobs:
Designed, developed and compiled are done in Designer.
Validated, Scheduled, Executed and Monitoring are done in Director.
Win NT/ Unix system Server
Parallel Jobs:
Designed, developed and compiled are done in Designer
Validated, Scheduled, Executed and Monitoring are done in Director.
Server should be on Unix System Only
Mainframe Jobs:
Designed and developed are done in Designer (win)
Complied, Validated, Scheduled, Executed and Monitored in Mainframe
Systems Only.

7. Types of Stages.
Ans: 1) Active Stage: In which Transformation, Aggregation etc are
done.
Ex: Transformer, Aggregator.
2) Passive Stage: In which Read/Write Process is done.
Ex: ODBC, Sequential, Hash File...
Other Classification:
1) Built-in: These are default Stages. Ex: ODBC, Transformer, Sequential
etc.
2) Plug-in: These stages are to be installed separately, which are used
for special tasks, which DS not supported previously.

8. Difference between Normal Loading and Bulk Loading.


Ans: Normal loading Bulk Loading
Log information maintained Not Maintained
ETL-Process row by row basis bulk data
DB Constraints Followed Not Followed

9. What is Hash File Stage?


Ans: Hash file stage is binary file used for lookup, for better
performance.

10. What is Staging Variable?


Ans: These are the temporary variables created in transformer for
calculation.

11. What are constraints?


Ans: These are filtering Conditions we specify in DS

12. What are Routines?


Ans: Routines are the functions which we develop in BASIC Code for
required tasks, which we DS is not fully supported (Complex)

13. What are iConv and oConv?


Ans: These are the Date functions, which we use to convert the Dates
from internal format to External format.
iConv -External to Internal
oConv- Internal to External

14. How do you generate Sequence number in DS?


Ans: Using the Routine
KeyMgtGetNextVal,
KeyMgtGetNextValConn
Also can be done by:
Using Oracle Sequencer

15. What is merge (plug-in) stage?


Ans: This stage is used to merge the two sequential files
16. What is orabulk Stage?
Ans: This Stage is used to Bulk Load the Oracle Target Database

17. What is Container & Types of it?


Ans: Containers are the reusable set of stages.
Types are: Local Container and Shared Container
Local Container is local to the particular job in which we developed
the container.
Shared Container is can be used in any other jobs also.

18. Types of scheduling a job (Frequency)?


Ans:
Today: with specified time
Tomorrow: with specified time
Next: with specified Day, Time.
Every: with specified Month, Day, and Time
Daily: with specified Time
Time can specified in Format like: AM, PM or 24 hr

How To Split Single Ken_R Post new reply


Sequential File into
Multiple Sequential
Files?
Originally posted:
2007 Jan 04 09:02 AM

I am working on a design for a ETL job that must accept a sequential file as input. The input file
contains data for a number of "locations", typically 10 -15 locations or so. We do not know the
locations, only that the file will contain multiple records for each location. The job(s) must split the
input file into a single file for each location. So if there are 15 locations in the input file we will end
up with 15 individual files as the output. The output files must be named with the location name in
the filename.

My initial thought on this was to first run a job to obtain a dataset containing the unique list of
locations in the input file. Then using this list run another job that will filter the input file by each
item in the locations dataset creating an output file with a name based on the location code. I'm not
sure how to do this or if it is even possible. It needs to be driven from the unique list of locations. I'm
thinking the Job Sequence would need to loop though the location dataset and pass each row value
to the main processing job as a parameter. Is this possible? Can you suggest how I should approach
this problem?
Yes, this will be processed on AIX servers. I have attached some sample test data. Here's what I'm
trying to do with it:

1. The file needs to be split based on the first column. Columns are delimited with semi-colons.

2. The first and last rows of the source (header/trailer) must exist in all output files as well.

3. Output files must be named as the source file with the value of the first column appended.

A shell script may be the way to go. If I can ever figure out how to set the UserStatus in a parallel job
(or server job) then I can probably get DataStage to do it. For some reason the DSSetUserStatus
routine is not available in our DataStage!
Dimensional Modeling - Dimension Table

In a Dimensional Model, context of the measurements are represented in dimension tables. You can
also think of the context of a measurement as the characteristics such as who, what, where, when,
how of a measurement (subject ). In your business process Sales, the characteristics of the
'monthly sales number' measurement can be a Location (Where), Time (When), Product Sold (What).

The Dimension Attributes are the various columns in a dimension table. In the Location
dimension, the attributes can be Location Code, State, Country, Zip code. Generally the Dimension
Attributes are used in report labels, and query constraints such as where Country='USA'. The
dimension attributes also contain one or more hierarchical relationships.

Before designing your data warehouse, you need to decide what this data warehouse contains. Say
if you want to build a data warehouse containing monthly sales numbers across multiple store
locations, across time and across products then your dimensions are:

Location
Time
Product

Each dimension table contains data for one dimension. In the above example you get all your store
location information and put that into one single table called Location. Your store location data may
be spanned across multiple tables in your OLTP system (unlike OLAP), but you need to de-normalize
all that data into one single table.

Dimensional modeling is the design concept used by many data warehouse designers to build
their data warehouse. Dimensional model is the underlying data model used by many of the
commercial OLAP products available today in the market. In this model, all data is contained in two
types of tables called Fact Table and Dimension Table.

Dimensional Modeling - Fact Table

In a Dimensional Model, Fact table contains the measurements or metrics or facts of business
processes. If your business process is Sales, then a measurement of this business process such as
"monthly sales number" is captured in the fact table. In addition to the measurements, the only
other things a fact table contains are foreign keys for the dimension tables.

DataStage server v enterprise: some performance stats


Vincent McBurney(Consultant, Solution Architect) Posted 12/19/2005
Comments (10) | Trackbacks (0)

I ran some performance tests comparing DataStage server jobs against parallel jobs
running on the same machine and processing the same data. Interesting results.

Some people out there may be using the server edition, most DataStage for
PeopleSoft customers are in that boat, and getting to the type of data volumes that
make a switch to Enterprise Edition enticing. Most stages tested proved to be a lot
faster in a parallel job then a server job even when they are run on just one parallel
node.

All tests were run on a 2 CPU AIX box with plenty of RAM using DataStage 7.5.1.

The sort stage has long been a bugbear in DataStage server edition prompting many
to sort data in operating scripts:
1mill server: 3:17; parallel 1node:00:07; 2nodes: 00:07; 4nodes: 00:08
2mill server: 6:59; parallel 1node: 00:12; 2node: 00:11; 4 nodes: 00:12
10mill server: 60+; parallel 2 nodes: 00:42; parallel 4 nodes: 00:41
The parallel sort stage is quite a lot faster then the server edition sort. Moving from 2
nodes to 4 nodes on a 2 CPU machine did not see any improvement on these smaller
volumes and the nodes may have been fighting each other for resources. I didn't
have time to wait for the 10 million row sort to finish but it was struggling along after
1 hour.

The next test was a transformer that ran four transformation functions including trim,
replace and calculation.
1 mill server: 00:25; parallel 1node: 00:11; 2node: 00:05: 4node: 00:06
2 mill server: 00:54; parallel 1node: 00:20; 2node: 00:08; 4node: 00:09
10mill server: 04:04; parallel 1node: 01:36; 2node: 00:35; 4node: 00:35

Even on one node with a compiled transformer stage the parallel version was three
times faster. When I added one node it became twelve times faster with the benefits
of the parallel architecture.

Aggregation:
1 mill server: 00:57; parallel 2node: 00:15
2 mill server: 01:55; parallel 2node: 00:28

Reading from DB2:


2 mill rows server: 5:27; parallel 1node: 01:56; 2node: 01:42

The DB2 read was several times faster and the source table with 2million plus rows
had no DB2 partitioning applied.

So as you can see even on a 1 node configuration that does not have a lot of parallel
processing you can still get big performance improvements from an Enterprise
Edition job. The parallel stages seem to be more efficient. On a 2 CPU machine there
were some 10x to 50x improvements in most stages using 2 nodes.

If you are interested in these type of comparisons leave a comment and in a future
blog I may do some more complex test scenarios.

Learning DataStage: the Modify stage


Vincent McBurney
(Consultant, Solution Architect) Posted 12/18/2006
Comments (6) | Trackbacks (0)

This new series is about techniques I use to learn more about DataStage starting with
the Modify stage.

DataStage is always a learning experience, most developers always have a part of


the product they don't know much about and haven't had to use. I am starting this
series on techniques learning more about DataStage and I hope to cover parts of the
product I don't know yet.

Learning the Modify Stage


The Modify stage is a metadata panel beater. It does just one thing: converts
columns from one data type to another. It is quite restrictive, you cannot nest
functions or use constants as values. Almost all the functions in the Modify stage are
also available in the all rounder Transformer stage.

So why use the Modify stage? Well, I don't. As I said in the post Is the DataStage
parallel transformer evil? I always use the Transformer first. I only go to the Modify
stage when I need some extra performance on a very high data volume.

The Transformer is an automatic, the Modify Stage is a manual

Transformers are easy to use, which is one of the reasons why DataStage has been
successful as it is the most commonly used stage. When adding derivation /
mapping / transformation code into the text entry windows there is a right mouse
click menu for finding a list of transformation functions:

After choosing a function from the menu it comes with syntax to show how to use the
function:

By comparison for the Modify stage you get next to nothing. You just get a text box
and you need to enter all the column and function text manually without a right
mouse click menu:

There are almost no sample commands or example code in the DataStage manuals.
There is a list of functions and syntax in the DataStage Parallel Developers Guide but
it usually takes several goes to get functions right.
The easiest way to learn the Modify stage or learn a new function you haven't used
before is to create a simple test job with a Modify stage in it:

 The Generate Rows stage creates test data for the Modify stage to work on.
Any number of columns and rows though you only need to start with one
column.
 The Peek stage shows the result of the Modify function so you can view the
output in the Director.
 The Modify stage can be used to test out functions before they are added to
actual jobs.

This makes it easier to try out different functions to use in real jobs. It is hard
learning and debugging Modify functions in a real job due to things like high data
volumes and job complexity. I prefer to get functions working in isolation first so I go
back to this type of learning job whenever I need to add a modify stage with some
unfamiliar functions.

1. I used to routinely get bitten by the transformer stage rejecting rows in which
some fields were participating in derivations or stage variables and the dang
things had NULLS in them.
2. Solving the NULLS problem with IF IsNull() for every single field being
evaluated in some way can get overly complex and very messy.

Instead I put a Modify stage before the Transformer, call the stage
MhandleNull and handle_null() for all fields being evaluated in the transformer.
This simplifies the already cumbersome IF THEN ELSE syntax of the
transformer and/or the stage variables

The Modify stage is primitive, yes, but therein lies the secret of its success - it's slick.

With a Modify stage you can perform five classes of function, more than Vincent
suggests, but one major limitation is that Modify stage functions are limited to a
single argument - straight away arithmetic and concatenation are ruled out, for
example.

The five classes of function are:

 null handling
 date/time manipulation
 date/time interval calculation (limited)
 string trimming and substring
 data type conversion

The Modify stage is incompletely (and occasionally inaccurately) documented in the


Parallel Job Developer's Guide; you need to enrich your knowledge by consulting the
chapter on the modifyoperator in the Orchestrate Operators manual.

Null values are a hand grenade in a parallel Transformer. As soon as you do anything
to them they go off and you lose the row. It's one of the most common causes of row
leakage. A lot of my job designs have a Modify or Transformer stage near the
beginning of the job to handle all metadata issues such as null values and types and
a transformer near the end of the job to do the business rule qa and derivations and
column mapping.

If I am playing with columns in a parallel transformer and I think there may be nulls I
do the null handling in stage variables and then only refer to these stage variables in
the other parts of the transformer.

DataStage tip 1 of 10: hacking job parameters


Vincent McBurney(Consultant, Solution Architect) Posted 11/13/2006
Comments (0) | Trackbacks (1)

There is such a thing as a DataStage hack. One of my favourite hacks is the copy
parameters routine. It lets you copy parameters from a prototype job to a large
number of target jobs identified by using a folder name and/or a job name pattern.

The Set Default Parameters job is available from the Ken Bland and Associates
website. It's a DataStage jobs that hacks the DataStage repository to change job
parameters for one or more jobs.

Getting the Routine

Go to the http://www.kennethbland.com/ website and create a login. Go to the


Services page and on the left side of the page you will see a Publications link that will
take you to a list of DataStage articles, white papers and DataStage routines. This
page only seems to work for IE browsers and had login problems through a Firefox
browser. SetDefaultParameters can be downloaded and imported into a DataStage
project.

How it works

The download contains a DataStage dsx export file with a single job. Import this job
into DataStage. The first thing you will notice is that the job has no stages! It is a job
made entirely of BASIC code and each time it runs it executes this code to change
sets of parameters directly in the DataStage repository.

When you run the job you will see the following parameters:
 Update Parameters: set this to Y so the job does something.
 Enter the job text to search: this works with a LIKE statement to only change
jobs with a name that matches the string.
 Enter the folder name: an optional field that only changes jobs within that
folder.
 Enter the Prototype job: the name of the job that holds the parameters to be
copied from.
 Enter Y to replace block: this option removes all existing parameters before
adding the new parameters.

Add one parameter to a large number of jobs

Create an empty job with the one parameter. Make it the prototype, set Update to Y
and Replace Block to N. That one parameter will be added to all target jobs without
affecting existing parameters.

Synchronise jobs to one set of parameters

Create a job with your final set of parameters. Make it the prototype, set Update to Y
and Replace Block to Y. All existing parameters will be removed and the new
parameters added.

Overwrite parameter values

Add the parameters you want to reset into a prototype job. Set Update to Y and
Replace Block to N and the target jobs will have the values replaced and other
parameters will not be affected.

Happy Anniversary DataStage Hawk Beta


Vincent McBurney(Consultant, Solution Architect) Posted 6/28/2006
Comments (0) | Trackbacks (1)

I am posting this a few days too late but it is the twelve month anniversary the Hawk
beta test. It is a test that for some people has not yet started.
The Hawk Beta press release came out on June the 23rd of 2005 just four months
after Ascential became an IBM company. There was a round 1 Hawk beta test for
clients and partners who requested beta software for Windows which completed
successfully. The round 2 has recently started for other platforms but with a much
tighter acceptance criteria for clients to beta test production data. Some of you who
signed up for the beta test twelve months ago still may not have heard back from
IBM.

I was thinking of anniversaries as yesterday was my eighth wedding anniversary. Not


the most romantic of anniversaries. My wife has a big work load and I am looking
after four kids under 6. Our oldest did nag us to go out for dinner to celebrate our
anniversary but I ended up cooking chops and vegies at home! We were lucky to
celebrate it at all, it was only a chance read of the TV guide that made us realise our
anniversary date had arrived.

Since I am having a little dig at Hawk I thought I would take us down memory road
with this Ascential press release from AscentialWorld 2003 about some very
optimistic release dates:

Product Roadmap
Ascential Software unveiled three strategic product initiatives, code-named Trinity,
Hawk, and Rhapsody. The results of these development projects are anticipated to be
in the market as soon as the first half of 2004.
* "Trinity" - Anticipated completion mid 2004.
Trinity highlights include:
o Expanded service-oriented architecture capabilities
o 64-bit performance and expanded mainframe support
o Increased functionality for enterprise grid computing initiatives
o "Frictionless connectivity" - eliminating barriers to connectivity inside and outside
the enterprise
* "Hawk" - Expected delivery late 2004.
Hawk highlights include:
o User interface innovations for enhanced productivity
o RFID infrastructure support
o "Next generation" meta data services infrastructure
* "Rhapsody" - Next generation platform that will deliver a quantum leap in
productivity, manageability and scalability.

I'm afraid that 64-bit performance from Trinity never made it, nor did frictionless
connectivity (it's a big part of the Hawk release). However RFID infrastructure support
did make it to market on time. The "late 2004" release date for Hawk turned into
"much later then 2004".

Don't get me started on the beta for the analyser tool! I am assuming that the
Analyser product, which is the successor to ProfileStage and AuditStage, requires
completion of the Metadata Server beta testing before it can enter its own beta
testing.

So happy anniversary Hawk Beta. I've marked that press release date, October 29, as
the anniversary of the public announcement of the Hawk release. This October will be
the third anniversary of waiting for Hawk. You can expect another blog from me then
but hopefully it will be better news. In the meantime you can whet your appetite for
Hawk with my blog from a couple months ago My DB2 Magazine article: Hawk
overview, screenshots and questionnaires!.
You can also read my early entry, in fact my third blog entry, My top ten features in
DataStage Hawk.

Four flavours of DataStage FUD


Vincent McBurney(Consultant, Solution Architect) Posted 7/6/2006
Comments (3) | Trackbacks (0)

The ETL market is highly competitive and with this comes the spreading of Fear,
Uncertainty and Doubt (FUD). I tackle four items of FUD about WebSphere DataStage.

Wikipedia has a history of the term FUD at the Fear, uncertainty and doubt topic
page. Interestingly it was first coined to describe an IBM marketing practice and here
I am talking about it being applied to an IBM WebSphere DataStage.

Fear, uncertainty, and doubt (FUD) is a sales or marketing strategy of disseminating


negative (and vague) information on a competitor's product. The term originated to
describe misinformation tactics in the computer hardware industry and has since
been used more broadly. FUD is a manifestation of the appeal to fear.

FUD 1 - server jobs are going away


The first bit of DataStage FUD appeared on the dsxchange forum with Refuting FUD
where rumours were being spread that DataStage Server Edition would not be
supported beyond the next release. The truth is that server jobs will be supported in
the Hawk release and there are no plans to shelve them any time soon. Both server
and parallel jobs benefit greatly by the Hawk release improvements.

FUD 2 - DataStage Hawk will be a difficult upgrade


I was made aware of another piece of FUD that was being extracted out of my
DB2Magazine article on the Hawk release. I spoke of some of the interesting features
in the Hawk release and provided a range of options for moving to Hawk. This has
been turned into collateral from a DataStage competitor that you shouldn't use
DataStage because the Hawk release will be a difficult upgrade.

Tis a nugget of purest FUD.

DataStage upgrades are quite easy to run, DataStage projects get upgraded by the
installation process and DataStage export files can also get upgraded as they are
imported. This means you can upgrade your projects and have backups on hand in
case the installation runs into problems. There is no evidence anywhere that the
Hawk release will be a difficult upgrade.

Since Hawk is still in beta phase we haven't even seen the full upgrade
documentation so it is premature to try and use it as FUD.

FUD 3 - you have to use a pure play vendor


Another piece of FUD came from Philip Howard of Bloor Research in The market for
data migration last month. It concludes that the only data integration vendor
focusing on application migration is Informatica.

The fact is that most of the leading data integration tools are owned by third party
vendors and each of these suppliers has a different agenda to pure play data
integration companies. Thus SAS and Business Objects are much more focused on
business intelligence than on data integration per se, while IBM is focusing a lot of
attention on MDM and data governance, for example.

I would LOVE to know why data governance and MDM have nothing to do with
application migration!!!! They are not mutually exclusive areas of data integration.

The term pure play data integration appeared in a lot of Informatica press releases
after IBM bought Ascential. It reminds me of the Harry Potter story lines about pure
blood wizards. Apparently DataStage has been mixed with Muggle Blood and is no
longer a pure play data integration tool. It magically only works on AIX against DB2
databases and it has forgotten everything it knows about SAP, PeopleSoft and Oracle
Financials!

Pure play pure shmay is what I say. DataStage is still an ETL tool that works on
multiple platforms (Unix, Linux, Windows, mainframe, USS), it works equally well
against all the popular databases (Oracle, SQL Server, DB2, Teradata) and it has
enterprise packs for code free integration with the major ERP tools (SAP, PeopleSoft,
Oracle Financials). None of this magically disappeared when Ascential was acquired
IBM and none of this disappears in the Hawk release. The talk of agendas and pure
play vendors is FUD.

It is also amusing to think IBM can only focus on two marketing ideas at any one
time. Since they are focusing on MDM and data governance they must have
abandoned application migration, data warehousing, BI and data conversions! Oh,
and since Microsoft are focused on Vista they must be no longer competing in the
console market! Most of the IBM MDM press releases do not even refer to DataStage,
they refer to the MDM product IBM purchased.

There remains plenty of information about application migration using DataStage in


the DataStage Literature Library and from IBM sales:
- 7 Key Steps to Maximizing Value from SAP (3333 downloads)
- Lowering the Total Cost of Ownership of SAP Deployments Through Enterprise Data
Integration (2800 downloads)
- Accelerating Merger and Acquisition Success
- IBM WebSphere Connectivity for Enterprise Application

FUD 4 - Is ETL Obsolete?


I spoke about this FUD in my ELT post last month. It is a clever marketing message
from an ELT vendor and it cleverly placed to appear as a Google ad each time the
term ETL is used on a site with Google ads. Both IBM and Microsoft have invested
heavily in ETL products in the last two years and they are not the type of vendors
that invest in obsolete
products!

Sorts to the left of me, sorts to the right


Vincent McBurney(Consultant, Solution Architect) Posted 7/29/2006
Comments (0) | Trackbacks (0)
Sorting is the simplest task in ETL! Okay, so why am I blogging about it? Because
when you hit very large data volumes you need to consider all your sorting
alternatives.

In the process of moving data there many places to do a sort:


- In the database during extract.
- In the operating system if data is landed to a file.
- In a specialist sorting product.
- In the transformation process.

With an ETL tool you can do the sort in the transformation process via a sort stage or
you can use the tool to execute the sort via the external tools. Often the call to the
external sort is seemless - it can be called from the ETL database stages or
sequential file stages.

Why are sorts required?


- A remove duplicates or de-duplication often needs presorted data to find identical
adjacent rows.
- ETL (row-by-row) aggregations are more efficient if data is sorted by the
aggregation key fields so the collection of groups is faster.
- Some data needs to be loaded in the order in which it was captured in the source
system and needs some type of date, ID or timestamp sorting.
- Vertical pivots (turning multiple rows into one row) can required sorted data for
adjacent row merging.

Database sorting
Easily done by adding an ORDER BY to your SQL select statement in your database
stage. You can add it through the stage SQL builder, or as an extra property to a
generated SQL, or in a user-defined SQL statement. Database sorts are similar to ETL
sorts, they both try to do the sort in memory and when that fills up they overflow into
some type of temporary storage on disk. This means they both may need some
special care on very large volumes.

Be wary of sort commands against a database that cannot handle the extra load, or
in a complex SQL statement with outer joins, or in a database that has not been
configured to handle very large sorts. If you want to minimise the impact you have on
source databases then leave the sort on the ETL server.

If you had someone like a Chris Eaton on hand to monitor Performance (hit ratios,
sorts, dynamic queries) he may help you with the source SQL and make sure the
sorting is done efficiently. However, on a lot of ETL projects you have a lot of legacy
sources and the DBAs who look after those databases might not give you the time of
day.

Operating System Sorts


On Unix and Linux systems you can run a sort via an operating system command or a
script. This type of sort works best on text files. It can be a bottleneck in your process
as it is a single threaded operation and is difficult to partition into multiple instances.
By comparison if you had a multiple instance DataStage server job it could filter on
key fields and then sort giving you many parallel sorts instead of one.

On Windows you can find the Unix sort command at SourceForge.


Specialised Sort Tools
The third party sort tool that works best with DataStage is CoSort as it has a
DataStage plugin that is compatible with the latest version. SyncSort is a similar tool
but I don't think they have a plugin.

CoSort started supplying specialised sorting tools in 1978! In 1997 the CoSort sorting
product was executing on Unix sorts across multiple CPUs, years before DataStage
got parallel processing. CoSort offer plugins for both DataStage Server Edition and
Informatica PowerCenter that provide very fast sorting on Unix and Windows along
with some joining, filtering and aggregation.

If you are on the cheaper and older DataStage Server Edition then CoSort is worth
evaluating. It was good to see IBM recently re-affirm the partnership that CoSort
started with DataStage way be at version 3.6. It can save a huge amount of time if
you can sort, aggregate and filter large text files in the first step of your job in
parallel mode. They do not have a plugin for parallel jobs, but given the performance
of the sort stage in parallel jobs it may not be worth them developing one.

Some CoSort stats:


On the IBM p690 running with only 4 of its 32 CPUs, a 1GB 'CoSORT'
took only '12 seconds. On a first generation Intel Itanium (IA-64 prototype) server
with
4 CPUs running Debian Linux, CoSORT routines sorts 5GB in < 6 minutes.

ETL Sorts
The ETL sort is extremely easy to implement as it is a stage with just a couple of
options to set. Generally if you get one very large sort optimised for one job it will
work across every other job as well. My own humble parallel job versus server job
testing showed a parallel job sort could be a least 100 times faster than a server job
sort so you may need to do some extra configuration to get server job sorts finely
tuned.

In parallel jobs the sort stage has a couple extra benefits: it can remove duplications
and it can partition. It is a good stage to put at the front of a job. I will talk more
about parallel sorting in a future article.

The Wrap
For parallel jobs I would advise on using the parallel sort stage wherever possible due
to the very fast performance and partitioning. Database sorts are possible as long as
your databases can handle the extra load.

For server jobs all types of sorts should be evaluated as the server job sort can be
very slow. The database sort and the operating system sort are both good options.
CoSort offers the best performance for very large text files. I have not heard of any
plans for a parallel CoSort plugin so for now I would only use it with server jobs.

If you already have CoSort and Enterprise Edition you may be able to call the CoSort
from within the parallel Sequential file stage. Let me know if any of you have had
success combining these two products.

DataStage tip for beginners: developer short cuts


Vincent McBurney(Consultant, Solution Architect) Posted 2/21/2006
Comments (5) | Trackbacks (0)

People who have been using a tool for a long time learn some shortcuts. I've done a
brain dump of any DataStage short cuts I can remember.

Feel free to add comments with your own shortcuts.

Import Export
* When you do an export cut and paste the export file name. When you go to your
project and run an import paste the file name instead of having to browse for it.
While export and import independently remember the last file name used they do not
share that name between each other.
* When you switch export type between category and individual job it is quick to
switch the type, close the export form and open it again. That way the job name or
category you have highlighted will be automatically picked.
There is an Export option to export by individual job name or export by category
name. This is on the second tab in the export form. Often when you go to export
something it is on the wrong option, eg. you want a job but it is showing the
category. You switch from category export to individual job export but back on tab 1
your job is still not highlighted.
* When you do an export there is a "View" button, click this to open the export file
and run any type of search and replace on job parameter values when moving
between dev and test.
* If you want to export several jobs that are not in the same category use the append
option. Highlight and export the first job. Close the export window, find and highlight
the second job, in the export form click the "Append" option to add to the file.
Continue until all jobs have been selected and exported.
* On the Options tab is a check box to include in the export "Referenced shared
containers" to also export those.

Things you could easily miss


You could use DataStage for months and not see some of these time savers, done
this myself.
* There is an automap button in a lot of stages, especially the transformers, maps
fields with the same names.
* When you add a shared container into your job you need to map the columns of the
container to your job link. What you might miss is the extra option you get on the
Columns tab "Load" button. In addition to the normal column load you get "Load from
Container" which is a quick way to load the container metadata into your job.
* Don't create a job from an empty canvas. Always copy and use an existing job.
Don't create shared containers from a blank canvas, always build and test a full job
and then turn part of it into a container.
* If you want to copy and paste settings between jobs, for example database login
values or transformer functions, open each job in a seperate Designer session. Most
property windows in DataStage are modal and you can only have one property
window open per Designer session, by opening two Designers you can have two
property windows open at the same time and copy or compare them more easily.
* You can load metadata into a stage by using the "Load" button on the column tab
or by dragging and dropping a table definition from the Designer repository window
onto a link in your job. For sequential file stages the drag and drop is faster as it
loads both the column names and the format values in one go. If you used the load
button you would need to load the column names and then the format details
seperately.
* Can't get a Modify function or Transformer function working correctly? Trial and
error is often the only way to work out the syntax of a function. If you do this in a
large and complex job it can be time consuming to debug due to job startup times.
Consider have a couple test jobs in your dev project with a row generator, a modify
or transformer stage and a peek stage. Have a column of each type in this test job.
Use this throughout your project as a quick way to test a function or conversion.
* You can put job parameters into stage properties text boxes, eg.
#filedir#/#filename# but you may not know that you can put macros into property
text boxes. #filedir#/#filename#_#DSJobName#. In this example the first two are
job parameters and the third value is not, it's a DataStage macro.

Sequence Jobs
My most annoying Sequence job "feature" is the constant need to enter job
parameters over and over and over again. If you have 10-20 parameters per job (as I
normally do) it becomes very repetitive and is open to manual coding errors.
Under version 7.1 and earlier you could copy and paste a job activity stage, change
the job name and retain most of the parameter settings. Under 7.5.x when you
change the job name all the parameter settings get wiped.

You need to set the parameters for every flippin job activity stage, even though they
are likely to have the same or similar parameter lists and settings. A faster way is to
do the job renaming in an export file.

* In an empty sequence job add your first job activity stage and set all parameter
values or copy one in from an existing job.
* Copy and paste as many copies of this job activity as you need for your sequence.
* Close the sequence job and export the job and click the View button.
* Open the sequence job, you need it to retrieve the stage names.
* Copy the name of the last job activity stage name and search for it in the export
file. When the cursor is on that part of the export file search and replace the old job
name with the new job name. Make sure you only replace to the bottom of the file,
most text editors should have this option. This will rename the job of the last activity
stage.
* Repeat this for the second last job activity, then the third last etc until you have
replaced all job names back to the second job activity stage.
* Import the job into your project.

This should give you the same set of job activity stages but with each one pointing at
a different job and with the full set of job parameters set.

Debugging
* In parallel jobs I use the copy stage a lot to debug errors that do not indicate which
stage caused the error. I create a copy of the job and start removing output stages,
replacing them with a copy stage. I progressively remove stages until I locate the one
with the error.

101 uses for ETL job parameters


Vincent McBurney(Consultant, Solution Architect) Posted 4/7/2006
Comments (3) | Trackbacks (0)

This is my third post in what I am now calling job parameter week. In future it will be
better if I come up with an idea for a theme week at the start of the week instead of
the end of the week. I'll chalk that one up to experience.

Welcome to job parameter theme week, where we bring job parameter to life. (Note
to marketing department, please provide better mottos for future theme weeks). I
have tracked back to my previous two job parameter posts from this week. We
looked at Project Specific Job Parameters on Monday and saw how they made
management of parameters a lot easier, we looked at the exposure of unencrypted
passwords on Wednesday by the NSW police.

Today I am going to talk about what I use job parameters for.

ETL job parameters are values you can change when you run the job, as opposed to
properties you can only change by modifying and recompiling the job, don't know if I
am going to get to 101 but thought it was an eye catching title.

Static Parameters
In programmer speak you could also call these constants. They are values you never
expect to change. So why make them a job parameter? Because they might change,
or because you don't want to have to try and remember the value every time you use
it in a job.

In my current project I have a static parameter called MAX_DATE which holds a high
end date value 2999-12-31 23:59:00. I use this on tables that track variant data such
as a slowly changing dimension type 2 table, it provides a high end date for an open
or current item. This parameter gets copied into all jobs that update dimension
tables, the good thing about having it around is that I don't need to remember the
value of the field, I just enter MAX_DATE into my transformer.

Another good thing is that come July of 2999 when a project team is assembled to
address Y3K problems they only have to update this date in one place and will not
have to recompile jobs.

Slowly Changing Parameters


The most common example is the database password that **should** be changed
every month. At the current project the DBA team change the password on the
database and the DataStage support team change the project specific environment
variable. I would prefer to see IBM provide a tool that lets users maintain these
environment variables without having to go through so many Administrator screens
so they can change passwords more easily.

Environmental Parameters
These are the job parameters that change value as you move from one environment
to another, for example from development into testing and into production. Typical
values that change are directory locations and database login details.

The Customer Keeps Changing Their Mind Parameters


On a previous project I had a QualityStage plugin cleansing names and addresses for
delivery of marketing campaigns to mail houses. During testing there were some
doubts about company name cleansing and name cleansing. Sometimes the
cleansed name was worse then the raw name.

There was much umming and aahing in meetings as to whether to turn it on. I create
a set of job parameter flags that could be set to true or false that turned on and off
personal name, company cleansing and address cleansing.
These parameters were used in a transformer that wrote out the final fields and had
original or cleansed fields to choose from. Now in production they can define what
type of cleansing the job will perform from the set of flags for each run.

Dynamic Parameters
These are parameters that get changed with every job run.
- A shell script starts a job with the dsjob command, it can set a dynamic parameter
using the Unix scripting language and set it in the dsjob command.
- A sequence job calls parallel jobs that require a dynamic parameter, the value can
be set in a User Variables stage or in the Job Activity Stage using BASIC commands or
a BASIC routine.

Currently I have two parameters PROCESS_ID and PROCESS_DATE. The first is a


unique number generated for every execution of a job. It goes into the ODS tables to
indicate which job execution loaded the data so we can rollback loads or identify data
volumes. The PROCESS_DATE is the delta data we are processing as we do daily
loads.

The PROCESS_DATE is retrieved from a database table via a Unix script. The script
uses a simple DB2 command line statement to retrieve the date and then uses the
dsjob command to set the parameter and call a sequence job. It populates
PROCESS_ID with an eight character string representation of the PROCESS_DATE.

The Sequence job uses a User Variable stage to append a unique 2 character
sequence job code to the PROCESS_ID to make it unique for this sequence job. This
User Variable is then used in each job activity stage with an additional 2 character
code added that is unique for each job activity stage within the sequence job.
Therefore each parallel or server job across the entire project gets a unique
PROCESS_ID each time they run.

Running DataStage from outside of DataStage


Vincent McBurney(Consultant, Solution Architect) Posted 4/11/2006
Comments (4) | Trackbacks (0)

This is a followup from comments on my parameter week post on 101 uses of job
parameters. This post is about calling DataStage jobs and the range of job control
options.

The go to command for interacting with DataStage from the command line or from
scripts or from other products is the dsjob command. The documentation for dsjob is
buried in the Server Job Developers Guide, it is cunningly placed there to keep
Enterprise users, who would never think to read the Server Editon guide, in a state of
perpetual bewilderment.

I was born in a state of bewilderment so I am in my zone.

I am not going to go into the job control API or mobile device job control, refer to your
documentation for those options! I will cover the more commonly used methods.

Sequence Jobs and the DataStage Director


The easiest out of the box job control comes from the DataStage Director product
and the Sequence Job. The Sequence job puts jobs in the right order and passes them
all a consistent set of job parameters. The DataStage Directory runs the Sequence
job according to the defined schedule and lets the user set the job parameters at run
time.

A lot of additional stages within the Sequence Job provide dynamic parameter
setting, after job notification, conditional triggers to control job flow, looping, waiting
for files and access to the DataStage BASIC programming language.

Third Party Scheduling and Scripting


DataStage comes with a scheduling tool, the Director. It provides a front end for
viewing jobs, running jobs and looking at job log results. Under the covers it adds
scheduled jobs to the operating system scheduler. The main advantage of it over a
third party scheduling tool is the job run options screen that lets you enter job
parameter values when you schedule the job.

In third party scheduling tools you need to set job parameters as you run the job in
some type of scripting language. Jobs are executed by scheduling tools using the
dsjob command. This command can require a lot of arguments so it is often run via a
script or batch file.

The mother of all DataStage run scripts can be found in this dsxchange thread.
Written by Ken Bland and Steve Boyce it will start jobs, set run time parameters from
a parameter ini file, check the status of finished jobs, service your car, solve the Da
Vinci code and run an audit process after the job has finished.

This script is run from a scheduling tool to make the setup of the scheduling easier.

The mother of all job run scripts sets parameters that are saved in an ini file.
Parameters can also be saved in a database table, with a job extracting the settings
to an ini file before a batch run.

They can also be stored as environment parameters in a users .profile file. These
environment parameters can be passed into the job via a script or they can be
accessed directly in the job by adding environment job parameters and setting the
value to the magic word $ENV.

They can also be stored as project specific environment parameters as we saw during
the exhilirating job parameter week, where we brought job parameters to life and
struggled to come up with a good theme motto. These job parameters are much like
environment parameters but use the magic word $PROJDEF.

Job Control Code and Old School DataStage


Old school DataStage programmers, those who know who Ardent are and remember
the days when you only needed one Developer Guide, will be accomplished at writing
job control code. This uses a BASIC programming language based on the Universe
database code to prepare, execute and audit jobs.

The DataStage BASIC language has better access to jobs the operating system
scripts. While an external script has to do everything through the dsjob and dsadmin
commands the BASIC language has access to a much larger number of DataStage
commands. Like dsjob these commands are cunningly hidden in the Server Job
Developers Guide.

Before the days of sequence jobs, (DataStage 5?), and before sequence jobs became
quite useful in version 7.5 this job control code was far more prevelent and easier to
code then job control in external scripts. It was extremely useful at putting jobs in the
right sequence, retrieving job parameters from files, checking the results of jobs and
shelling out to execute operating system commands.

Job control code is still widely used even when external scripts or sequence jobs are
in use. They fill in gaps of functionality by providing job auditing, setting dynamic
calculated parameter values, checking for files etc etc etc. It is a very powerful
language.

Also from the dsxchange forum we can find examples of job control code. This time
from Arnd:
GetJobParameter(ParameterName)
EQUATE ProgramName TO 'GetJobParameter'
OPENSEQ 'ParameterFile' TO InFilePtr ELSE CALL DSLogFatal('Oh No, cannot open
file',ProgramName)
Finished = 0
Ans = ''
READNEXT InRecord FROM InFilePtr ELSE Finished = 1
LOOP UNTIL Finished
FileParameterName = TRIM(FIELD(InRecord,'=',1))
FileParameterValue = TRIM(FIELD(InRecord,'=',2,99))
IF (FileParameterName=ParameterName)
THEN
Finished = 1
Ans = FileParameterValue
END
READNEXT InRecord FROM InFilePtr ELSE Finished = 1
REPEAT
IF NOT(Ans) THEN CALL DSLogFatal('Could not find value for
"':ParameterName:'".',ProgramName)
CLOSESEQ InFilePtr

What are you comfortable with?


People from a Unix background are most comfortable with Unix scheduling
tools, .profile environment parameters and running and auditing of jobs from within
Unix scripts using the dsjob command.

People from database backgrounds like have parameters in database tables and may
even put an entire job schedule into a table with dependencies and sequencing. They
need a bridge between the database and DataStage so they still need a layer of
either Unix scripts or job control code to run the jobs.

People from programming backgrounds will be very comfortable with the DataStage
BASIC programming language and find it can do just about anything regarding the
starting, stopping and auditing of jobs. They can retrieve settings and parameters
from files or databases.

The method I currently prefer is Sequence Jobs for all job dependencies, project
specific environment variables for most slowly changing job parameters, some job
control routines for job auditing and dynamic parameters and external operating
system commands and a dsjob script for starting Sequence Jobs from a third party
scheduling tool or from the command line.
What I like about project specific environment parameters is that the job can be
called up from anywhere without requiring any parameter settings. It can be called
up from within the Designer by developers, from ad hoc testing scripts by testers and
from third party scheduling tools in production.

DataStage tip for beginners - avoiding Stage Amnesia


Vincent McBurney(Consultant, Solution Architect) Posted 12/11/2005
Comments (0) | Trackbacks (0)

One of the most frustrating tasks for new DataStage users is adding a new stage
between two existing stages and discovering one of your stages has amnesia.

Amnesia isn't constrained to soap characters and fraudelent business characters.


Stage Amnesia happens when a stage loses a link and gets attached to a different
link and forgets all the old link properties. This happens most commonly when you try
to add a stage between two existing stages.

For a DataStage novice adding a stage involves adding a new stage, moving the
existing link onto that stage and adding a new link. Let me represent old stages as O
and the new stage as N:
O--N--O

The problem is one of your stages loses some properties.


:(--:)--:)

The :( stage was disconnected from the link and got a new link. This means output
link properties in that stage went back to defaults. If it is a database stage then all
database login and SQL settings disappeared. If it's a sequential file the file name
disappeared. For a transformer stage you an lose a lot of time consuming derivation
and mapping settings.

Here is a better way to insert a stage. The trick is to make a copy of the link so you
get two sets of link properties instead of one. Since you cannot copy a link by itself
you have to copy the stage and all attached links also get copied:
O---O
2) ---O Cut the start stage to the clipboard.
3) N---O Add the new stage.
4) O--- N---O Paste to get the start stage back.
5) O--- N---O Rename one of the links to avoid duplicate link names.
6) O---N---O Hook up the links.

Now you have an input and output stage with properties set. You just need to
configure the middle stage.

If you have a long chain of stages and you will need to disconnect a link before
copying and deleting:
O---O---O---O---O
1) O---O---O ---O---O Disattach one side of an adjoining stage.
2) O---O--- ---O---O Cut that stage.
3) O---O---N ---O---O Add the new stage
4) O---O---N ---O ---O---O Paste the stage and rename the link.
4) O---O---N---O---O---O Hook it all up.

You might also like