Datastage
Datastage
2. Link the stage icons, and name the stages and links.
Remove unused columns from transforms. This does not apply to columns in
sequential files or output to hash files.
Active Stages
Eliminate unused columns.
Eliminate unused references.
In derivations, if possible, instead of calling routines, move the code into the
derivation. This eliminates the overhead of the procedure call.
Input Links
Use ODBC stage to access relational tables.
Output links
Use OCI plugin stages to access relational tables if available.
Adjust the rows per transaction setting. Try 1000, 5000, or 10,000.
If output rows are INSERTs or APPENDs, not UPDATEs, consider using a native
bulk loader. Direct output to a sequential file compatible with the bulk loader,
then invoke the bulk loader using an after-job subroutine. The bulk loader for
Oracle is SQLLDR.
Reference Lookups
Compare the number of input rows with the number of rows in the referenced
table. If the referenced table is smaller than the number of input rows, pre-load the
reference table into a hashed file and then reference the hashed file.
Consider moving reference lookups to a join within the input stage. All columns
used to join the tables should be indexed to maximize performance.
If the number of rows in a hashed file is small, consider pre-loading the file into
memory by checking the pre-load into memory checkbox in the Hashed File
stage.
Processing fact rows in a multi-step manner exploits the unique capabilities of DataStage
to integrate operating system and database features within a DataStage job.
1. Process fact rows without regard to dimension key values, instead retaining
dimension column values. Theses dimension column values will later be used to
determine dimension key values.
2. For each dimension table, create a temporary dimension table in you database
whose structure is similar to the dimension table.
3. Populate the temporary dimension tables using the retained dimension column
values from step 1, setting the dimension key column value to NULL.
4. Join the temporary dimension tables with the dimension tables, updating the
dimension key column in each temporary dimension table.
5. For all rows in the temporary dimension tables with a dimension key column
value of NULL, insert the row into its dimension tables.
6. Join the temporary dimension tables with the dimension tables for all rows in the
temporary dimension tables with a dimension key column value of NULL,
updating the dimension key column in each temporary dimension table.
7. Create a hash file for each temporary dimension table whose key columns are all
columns other than the dimension key value.
8. Populate the temporary dimension hash files with the rows from the temporary
dimension tables.
9. Process the fact rows created in step 1, performing reference lookups to the
temporary dimension hash files, resolving the dimension key values, and creating
an output file compatible with your database’s bulk loader (e.g. SQLLDR).
10. Execute your database’s bulk loader using the file created in step 9 as input.
The implementation of this multi-step process in simpler than its description. The entire
process can be implemented as three DataStage jobs and two database scripts.
Step 1 of the process is simple a DataStage job with multiple outputs, one for the
fact table, and one for each (temporary) dimension table.
Steps 2 through 6 are implemented as a single database script that is executed as
an after job stage of the step 1 DataStage job.
Steps 7 and 8 represent a single DataStage job with an independent active stage to
load each hash dimension table from the temporary dimension table.
Step 9 is another simple DataStage job with a single input and output, and
multiple reference lookups.
Step 10 is another database script that is executed as an after job stage of the step9
DataStage job.
Reference lookups are typically thought of on a per job basis. It is assumed that good
DataStage development practices are being used, utilizing hash files for reference
lookups, or where possible, performing joins in the input stage of a job to off load this
burden onto the database server.
In most cases, this is satisfactory. However, large and/or time critical ETL applications
may find it advantageous to revisit the topic of reference lookups, and apply a more
global view. In this global view, four factors should be considered:
1. Data Source
2. Number of rows in source
3. Total number of rows retrieved
4. Total number of unique rows retrieved
When a DataStage job runs, it creates one or more phantom processes on your DataStage
server. Generally, one phantom process for the job, and one for each active stage within
the job. Each phantom process has its own log file that records information about the
process's execution. This information may be useful for debugging problems.
Log files are created in the folder &PH&. A &PH& folder exists in each DataStage
project folder. A DataStage job phantom will create a log file in the &PH& folder named
with the prefix DSD.RUN_, a DataStage active stage will create a log file in the &PH&
folder named with the prefix DSD.StageRun_. All log files end with a time and date
suffix. The time is seconds since midnight, and the date in a Universe julian date. These
dates and times are usually close to those found in DataStage Director on the Control
event Starting job ...
A useful tool is to create a routine in DataStage Manager that will suggest a log file
name. The source of this routine is:
Ans = "DSD.RUN_":Iconv(TheTime,"MTS"):"_":Iconv(TheDate,"D-
YMD[4,2,2]")
TheDate and TheTime are the routine arguments. Get the job's run date and time from
DataStage Director, the use the test button for this routine in DataStage Manager to
compute a suggested log file name.
Another useful piece of information is the DataStage job's number. An easy way to find a
job number is with DataStage Administrator. In DataStage Administrator, select a project
on the Projects tab, then press the Command button. Enter the command:
Press the Execute button, and your job's name and number will be displayed.
Armed with a job's name, run date, and run time, and a suggested log file name, finding
the actual log file is based on the operating system of your DataStage server. The
following steps assume a workstation with capabilities similar to Windows NT and My
Computer or Explorer.
To be determined.
cd \&PH\&
3. Using find and grep commands, locate the jobs log files:
4. Use the view command to review each file. To exit the view command, enter :q.
When passing dates to the Pivot Stage, they should be in internal format.
When passing dates to the Informix CLI Stage, they should be in internal format.
Dates and timestamps to and from the ODBC stage should be in external format,
YYYY-MM-DD or YYYY-MM-DD hh:mm:ss.sss respectively.
In my opinion, DataStage is a great ETL tool, but any tool is only as good as its
support. Ascential Software provides great customer support.
Seriously consider using a disk defragment utility on NT DataStage Servers.
When installing DataStage, plan the location of the project and UVTEMP
directories.
When you create a routine in DataStage, it is cataloged in Universe with the
prefix "DSU." followed by the routine name.
Output to a hash file writes a row (a Universe write). It does not update individual
columns (a Universe writev). Non-key columns are written as sequential hash file
fields (e.g. the first non-key column is field 1, the second is field 2, ...).
The TCL command "CONFIG ALL" displays the current value of all uvconfig
parameters.
The TCL command "ANALYZE.SHM -d" displays the contents of the Universe
dynamic file descriptor table.
The &PH& folder fills over time. You may notice that the time between when a
job says it is finishing, and when it actually ends, increases. This may be a
symptom of a too full &PH& folder. To correct this, periodically delete the files
in the &PH& folder. One way to do this is in DataStage Administrator, select the
projects tab, click your project, then press the Command button, enter the
command CLEAR.FILE &PH&, and press the execute button. Another way is
to create a job with the command: EXECUTE "CLEAR.FILE &PH&" on the
job control tab of the job properties window. You may want to schedule this job
to run weekly, but at a point in your production cycle where it will not delete data
critical to debugging a problem. &PH& is a project level folder, so this job should
be created and scheduled in each project.
http://www.anotheritco.com/tools/download_list_routines.htm
Just wanted to know what is the difference between a filter stage and switch stage .
-Filter stage can have any number of output links where as Switch stage is limited to a
max of 128 links.
-Filter stage can optionally have a reject link. Switch stage requires a reject link.
-A switch stage is like the C swtich function. It goes through all the cases and if no
cases are met, it goes for the default value, which is specified by the reject link.
THE major difference is that the Switch stage operates on values, while the Filter stage
operates on WHERE expressions, to determine which rows are sent along each output link.
In Filter, a single Input record can pass thru one or many output links aslong as each link
satisfies a certain criteria
In Switch, a single INput record can ONLY pass thru one of the many output links.
Code:
Trim(Arg1[Index(Arg1,"V-",1),4])
Here's how the data looks. The string has different formats namely
2. "24-Valve Al V-8"
3. "3.5 L V6 24-Valve"
4. "24-Valve Al V12"
5. "5.5 L Twin-Spark"
I want to populate "V-?" (V-numeric say V-6 or V-8 or V-12) I want to populate
"Null" for 1st and 5th formats.
If I use the following logic I m unable to populate correct value for the 3rd format
as the "V6" occurrence is 1.
[code]
If Index(DSLink2.ENGINE_DESCRIPTION,"V-",2) Then
Right(DSLink2.ENGINE_DESCRIPTION,4) Else (If
Index(DSLink2.ENGINE_DESCRIPTION,"V",2) Then
Right(DSLink2.ENGINE_DESCRIPTION,3) Else 'Null')
YearFromDate(CurrentDate())-YearFromDate(DOB)
http://www.anotheritco.com/dswp.htm
What do you need to know to follow along? This was originally written as a second class
in UNIX. The first class taught how to use the basic UNIX commands (like sed, grep and
find) and this class teaches how to combine these tools to accomplish bigger tasks.
In addition to the material in this course you might be interested in the Korn shell (ksh)
and the Bourne again shell (bash), both of which are excellent shells that enchance the
original Bourne shell. These alternate shells are upwardly-compatible with the Bourne
shell, meaning that a script written for sh can run in ksh or bash. However, there are
additional features in bash and ksh that are not available in the Bourne shell.
The focus of this guide is to get you to understand and run some Bourne shell scripts. On
several pages there are example scripts for you to run. On most of these pages there is a
link you can click on (with the right mouse button) and download the script to your
computer and run it.
For example, I recently wrote a script to make a backup of one of the subdirectories
where I was developing a project. I quickly wrote a shell script that uses /bin/tar to create
an archive of the entire subdirectory and then copy it to one of our backup systems at my
computer center and store it under a subdirectory named according to today's date.
As another example, I have some software that runs on UNIX that I distribute and people
were having trouble unpacking the software and getting it running. I designed and wrote
a shell script that automated the process of unpacking the software and configuring it.
Now people can get and install the software without having to contact me for help, which
is good for them and good for me, too!
For shell script experts one of the things to consider is whether to use the Bourne shell (or
ksh or bash), the C shell, or a richer scripting language like perl or python. I like all these
tools and am not especially biased toward any one of them. The best thing is to use the
right tool for each job. If all you need to do is run some UNIX commands over and over
again, use a Bourne or C shell script. If you need a script that does a lot of arithmetic or
string manipulation, then you will be better off with perl or python. If you have a Bourne
shell script that runs too slowly then you might want to rewrite it in perl or python
because they can be much faster.
Historically, people have been biased toward the Bourne shell over the C shell because in
the early days the C shell was buggy. These problems are fixed in many C shell
implementations these days, especially the excellent 'T' C shell (tcsh), but many still
prefer the Bourne shell.
There are other good shells available. I don't mean to neglect them but rather to talk about
the tools I am familiar with.
If you are interested also in learning about programming in the C shell I also have a
comparison between features of the C shell and Bourne shell.
Some of the tools of the trade are variables, backquotes and pipes. First we'll study these
topics and also quickly review a few other UNIX topics.
Variables
Topics covered: storing strings in variables
Utilities covered: echo, expr
To try the commands below start up a Bourne shell:
/bin/sh
A variable stores a string (try running these commands in a Bourne shell)
name="John Doe"
echo $name
The quotes are required in the example above because the string contains a special
character (the space)
A variable may store a number
num=137
The shell stores this as a string even though it appears to be a number
A few UNIX utilities will convert this string into a number to perform arithmetic
expr $num + 3
Try defining num as '7m8' and try the expr command again
What happens when num is not a valid number?
Now you may exit the Bourne shell with
exit
I/O Redirection
Topics covered: specifying the input or capturing the output of a command in a
file
Utilities covered: wc, sort
The wc command counts the number of lines, words, and characters in a file
wc /etc/passwd
wc -l /etc/passwd
You can save the output of wc (or any other command) with output redirection
wc /etc/passwd > wc.file
You can specify the input with input redirection
wc < /etc/passwd
Many UNIX commands allow you to specify the input file by name or by input
redirection
sort /etc/passwd
sort < /etc/passwd
You can also append lines to the end of an existing file with output redirection
wc -l /etc/passwd >> wc.file
Backquotes
Topics covered: capturing output of a command in a variable
Utilities covered: date
The backquote character looks like the single quote or apostrophe, but slants the
other way
It is used to capture the output of a UNIX utility
A command in backquotes is executed and then replaced by the output of the
command
Execute these commands
date
save_date=`date`
echo The date is $save_date
Notice how echo prints the output of 'date', and gives the time when you defined
the save_date variable
Store the following in a file named backquotes.sh and execute it (right click and
save in a file)
#!/bin/sh
# Illustrates using backquotes
# Output of 'date' stored in a variable
Today="`date`"
echo Today is $Today
Execute the script with
sh backquotes.sh
The example above shows you how you can write commands into a file and
execute the file with a Bourne shell
Backquotes are very useful, but be aware that they slow down a script if you use
them hundreds of times
You can save the output of any command with backquotes, but be aware that the
results will be reformated into one line. Try this:
LS=`ls -l`
echo $LS
Pipes
Topics covered: using UNIX pipes
Utilities covered: sort, cat, head
Pipes are used for post-processing data
One UNIX command prints results to the standard output (usually the screen), and
another command reads that data and processes it
sort /etc/passwd | head -5
Notice that this pipe can be simplified
cat /etc/passwd | head -5
You could accomplish the same thing more efficiently with either of the two
commands:
head -5 /etc/passwd
head -5 < /etc/passwd
For example, this command displays all the files in the current directory sorted by
file size
ls -al | sort -n -r +4
The command ls -al writes the file size in the fifth column, which is why we skip
the first four columns using +4.
The options -n and -r request a numeric sort (which is different than the normal
alphabetic sort) in reverse order
awk
Topics covered: processing columnar data
Utilities covered: awk
The awk utility is used for processing columns of data
A simple example shows how to extract column 5 (the file size) from the output
of ls -l
ls -l | awk '{print $5}'
Cut and paste this line into a Bourne shell and you should see a column of file
sizes, one per file in your current directory.
A more complicated example shows how to sum the file sizes and print the result
at the end of the awk run
ls -al | awk '{sum = sum + $5} END {print sum}'
In this example you should see printed just one number, which is the sum of the
file sizes in the current directory.
Section 2: Storing Frequently Used Commands in Files:
Shell Scripts
Shell Scripts
Topics covered: storing commands in a file and executing the file
Utilities covered: date, cal, last (shows who has logged in recently)
Store the following in a file named simple.sh and execute it
#!/bin/sh
# Show some useful info at the start of the day
date
echo Good morning $USER
cal
last | head -6
Shows current date, calendar, and a six of previous logins
Notice that the commands themselves are not displayed, only the results
To display the commands verbatim as they run, execute with
sh -v simple.sh
Another way to display the commands as they run is with -x
sh -x simple.sh
What is the difference between -v and -x? Notice that with -v you see '$USER' but
with -x you see your login name
Run the command 'echo $USER' at your terminal prompt and see that the variable
$USER stores your login name
With -v or -x (or both) you can easily relate any error message that may appear to
the command that generated it
When an error occurs in a script, the script continues executing at the next
command
Verify this by changing 'cal' to 'caal' to cause an error, and then run the script
again
Run the 'caal' script with 'sh -v simple.sh' and with 'sh -x simple.sh' and verify the
error message comes from cal
Other standard variable names include: $HOME, $PATH, $PRINTER. Use echo
to examine the values of these variables
Performing Arithmetic
Topics covered: integer arithmetic, preceding '*' with backslash to avoid file
name wildcard expansion
Utilities covered: expr
Arithmetic is done with expr
expr 5 + 7
expr 5 \* 7
Backslash required in front of '*' since it is a filename wildcard and would be
translated by the shell into a list of file names
You can save arithmetic result in a variable
Store the following in a file named arith.sh and execute it
#!/bin/sh
# Perform some arithmetic
x=24
y=4
Result=`expr $x \* $y`
echo "$x times $y is $Result"
Translating Characters
Topics covered: converting one character to another, translating and saving string
stored in a variable
Utilities covered: tr
Copy the file sdsc.txt to your home directory
The utility tr translates characters
tr 'a' 'Z' < sdsc.txt
This example shows how to translate the contents of a variable and display the
result on the screen with tr
Store the following in a file named tr1.sh and execute it
#!/bin/sh
# Translate the contents of a variable
Cat_name="Piewacket"
echo $Cat_name | tr 'a' 'i'
This example shows how to change the contents of a variable
Store the following in a file named tr2.sh and execute it
#!/bin/sh
# Illustrates how to change the contents of a variable with tr
Cat_name="Piewacket"
echo "Cat_name is $Cat_name"
Cat_name=`echo $Cat_name | tr 'a' 'i'`
echo "Cat_name has changed to $Cat_name"
You can also specify ranges of characters.
This example converts upper case to lower case
tr 'A-Z' 'a-z' < file
Now you can change the value of the variable and your script has access to the
new value
If Blocks
Topics covered: testing conditions, executing commands conditionally
Utilities covered: test (used by if to evaluate conditions)
This will be covered on the whiteboard
See Chapter 8 of the book
Regular Expressions
Topics covered: search patterns for editors, grep, sed
Utilities covered: no new utilities
Zero or more characters: .*
grep 'provided.*access' sdsc.txt
sed -e 's/provided.*access/provided access/' sdsc.txt
Search for text at beginning of line
grep '^the' sdsc.txt
Search for text at the end of line
grep 'of$' sdsc.txt
Asterisk means zero or more the the preceeding character
a* zero or more a's
aa* one or more a's
aaa* two or more a's
Delete all spaces at the ends of lines
sed -e 's/ *$//' sdsc.txt > sdsc.txt.new
Turn each line into a shell comment
sed -e 's/^/# /' sdsc.txt
Define a Function
Define a function
echo_it () {
echo "In function echo_it"
}
Use it like any other command
echo_it
Put these four lines in a script and execute it
Function Arguments
Functions can have command-line arguments
echo_it () {
echo "Argument 1 is $1"
echo "Argument 2 is $2"
}
echo_it arg1 arg2
When you execute the script above, you should see
Argument 1 is arg1
Argument 2 is arg2
Create a script 'difference.sh' with the following lines:
#!/bin/sh
echo_it () {
echo Function argument 1 is $1
}
echo Script argument 1 is $1
echo_it Barney
Execute this script using
sh difference.sh Fred
Notice that '$1' is echoed twice with different values
The function has separate command-line arguments from the script's
Functions in Pipes
Can use a function in a pipe
ls_sorter () {
sort -n +4
}
ls -al | ls_sorter
Function in pipe executed in new shell
New variables forgotten when function exits
Inherited Variables
Variables defined before calling script available to script
func_y () {
echo "A is $A"
return 7
}
A='bub'
func_y
if [ $? -eq 7 ] ; then ...
Try it: is a variable defined inside a function available to the main program?
Libraries of Functions
Common to store definitions of favorite functions in a file
Then execute file with
. file
Period command executes file in current shell
Compare to C shell's source command
Section 7: Miscellaneous
Here Files
Data contained within script
cat << END
This script backs up the directory
named as the first command-line argument,
which in your case in $1.
END
Terminator string must begin in column one
Variables and backquotes translated in data
Turn off translation with \END
Common Signals
Common signals are
o SIGINTR sent to foreground process by ^C
o SIGHUP sent when modem line gets hung up
o SIGTERM sent by kill -9
Signals have numeric equivalents
2 SIGINTR
9 SIGTERM
Send a Signal
Send a signal to a process
kill -2 PID
kill -INTR PID
Trap Signals
Handling Signals
trap "echo Interrupted; exit 2" 2
Ignoring Signals
trap "" 2 3
Restoring Default Handler
trap 2
User Signals
SIGUSR1, SIGUSR2 are for your use
Send to a process with
kill -USR1 PID
Default action is to terminate process
Command Translation
Common translations include
o Splitting at spaces, obey quotes
o $HOME -> /users/us/freddy
o `command` -> output of command
o I/O redirection
o File name wildcard expansion
Combinations of quotes and metacharacters confusing
Resolve problems by understanding order of translations
Experiment With Translation
Try wildcards in echo command
echo b*
b budget bzzzzz
b* translated by sh before echo runs
When echo runs it sees
echo b budget bzzzzz
Echo command need not understand wildcards!
Order of Translations
Splits into words at spaces and tabs
Divides commands at
; & | && || (...) {...}
Echos command if -v
Interprets quotes
Performs variable substitution
Exceptional Case
Delayed expansion for variable assignments
VAR=b*
echo $VAR
b b_file
Wildcard re-expanded for each echo
Examples (continued)
Delayed expansion of wildcards in variable assignment
Output of this echo command changes when directory contents change (* is re-
evaluated each time the command is run)
x=*
echo $x
Can view values stored in variables with
set
Try it: verify that the wildcard is stored in x without expansion
Examples (continued)
Wildcards expanded after redirection (assuming file* matches exactly one file):
cat < file*
file*: No such file or directory
Command in backquotes expanded fully (and before I/O redirection)
cat < `echo file*`
(contents of file sent to screen)
Eval Command
Forces an extra evaluation of command
eval cat \< file*
(contents of matching file)
Backslash delays translation of < until second translation
While loops
Execute statements while a condition is true
i=0
while [ $i -lt 10 ]
do
echo I is $i
i=`expr $i + 1`
done
Until loops
Execute statements as long as a condition is false
until grep "sort" dbase_log > /dev/null
do
sleep 10
done
echo "Database has been sorted"
Example executes until grep is unsuccessful
Redirection of Loops
Can redirect output of a loop
for f in *.c
do
wc -l $f
done > loop.out
Loop runs in separate shell
New variables forgotten after loop
Backgrounding OK, too
Continue Command
Used in for, while, and until loops
Skip remaining statements
Return to top of loop
for name in *
do
if [ ! -f $name ] ; then
continue
fi
echo "Found file $name"
done
Example loops over files, skips directories
Break Command
Used in for, while, and until loops
Skip remaining statements
Exit loop
for name in *
do
if [ ! -r $name ] ; then
echo "Cannot read $name, quitting loop"
break
fi
echo "Found file or directory $name"
done
Example loops over files and directories, quits if one is not readable
Case Command
Execute one of several blocks of commands
case "string" in
pattern1)
commands ;;
pattern2)
commands ;;
*) # Default case
commands ;;
esac
Patterns specified with file name wildcards
quit) ...
qu*) ...
Infinite Loops
Infinite loop with while
while :
do
...
done
: is no-op, always returns success status
Must use break or exit inside loop for it to terminate
Remote Shells
Rsh command
rsh hostname "commands"
Runs commands on remote system
Must have .rhosts set up
Can specify different login name
rsh -l name hostname "commands"
Return Status
Get return status of rsh
rsh mayer "uname -a"
echo $?
Returns 0 if rsh managed to connect to remote host
Returns 1 otherwise
o Invalid hostname
o Permission denied
Temporary Files
Use unique names to avoid clashes
tempfile=$HOME/Weq_$$
command > $tempfile
$$ is PID of current shell
Avoids conflict with concurrent executions of script
Do not use /tmp!
Wait Command
Wait for termination of background job
command &
pid=$!
(other processing)
wait $pid
Allows overlap of two or more operations
Metacharacters
Characters with special meaning to shell
" ' ` $ * [ ] ?
; > < & ( ) \
Avoid special meaning with quoting
echo 'You have $20'
Backslash like single quotes
Applies only to next character
echo You have \$20
Parallel DataStage jobs can have many sources of reference data for lookups
including database tables, sequential files or native datasets. Which is the most
efficient?
This question has popped up several times over on the DSExchange. In DataStage
server jobs the answer is quite simple, local hash files are the fastest method of a key
based lookup, as long as the time taken to build the hash file does not wipe out your
benefits from using it.
In a parallel job there are a very large number of stages that can be used as a
lookup, a much wider variety then server jobs, this includes most data sources and
the parallel staging formats of datasets and lookup filesets. I have discounted
database lookups as the overhead of the database connectivity and any network
passage makes them slower then most local storage.
I did a test comparing datasets to sequential files to lookup filesets and increased
row volumes to see how they responded. The test had three jobs, each with a
sequential file input stage and a reference stage writing to a copy stage.
Small lookups
I set the input and lookup volumes to 1000 rows. All three jobs processed in 17 or 18
seconds. No lookuptables were created apart from the existing lookup fileset one.
This indicates the lookup data fit into memory and did not overflow to a resource file.
2 million rows
Starting to see some big differences now. Lookup fileset down at 45 seconds is only
three times the length of the 1000 row test. Dataset is up to 1:17 and sequential file
up to 1:32. The cost of partitioning the lookup data is really showing now.
3 million rows
The filset still at 45 seconds, swallowed up the extra 1 million rows with ease.
Dataset up to 2:06 and the sequential file up to 2:20.
As a final test I replaced the lookup stage with a join stage and tested the dataset
and sequential file reference links. The dataset join finished in 1:02 and the
sequential file join finished in 1:15. A large join proved faster then a large lookup but
not as fast as a lookup file.
Conclusion
If your lookup size is low enough to fit into memory then the source is irrelevent, they
all load up very quickly, even database lookups are fast. If you have very large
lookup files spilling into lookup table resources then the lookup fileset outstrips the
other options. A join becomes a viable option. They are a bit harder to design as you
can only join one source at a time whereas a lookup can join multiple sources.
I usually go with lookups for code to description or code to key type lookups
regardless of the size, I reserve the joins for references that bring back lots of
columns. I will certainly be making more use of the lookup fileset to get more
performance from jobs.
Sparse database lookups, which I didn't test for, are an option if you have a very
large reference table and a small number of input rows.
IntroductionEdit section
Job parameters should be used in all DataStage server, parallel and sequence jobs to provide
administrators access to changing run time values such as database login details, file
locations and job settings.
One option for maintaining these job parameters is to use project specific environment
variables. These are similar to operating system environment variables but they are setup and
maintained through the DataStage Administrator tool.
There is a blog entry with bitmaps that describes the steps in setting up these variables at
DataStage tip: using job parameters without losing your mind
StepsEdit section
To create a new project variable:
There are two types of variables - string and encrypted. If you create an encrypted
environment variable it will appears as the string "*******" in the Administrator tool and will
appears as junk text when saved to the DSParams file or when displayed in a job log. This
provides robust security of the value.
Note that encrypted environment variables are not supported in versions earlier than 7.5.
Migrating Project Specific Job Parameters Edit section
It is possible to set or copy job specific environment variables directly to the DSParams file in
the project directory. There is also a DSParams.keep file in this directory and if you make
manual changes to the DSParams file you will find Administrator can roll back those changes
to DSParams.keep. It is possible to copy project specific parameters between projects by
overwriting the DSParams and DSParams.keep files. It may be safer to just replace the User
Defined section of these files and not the General and Parallel sections.
Open up a job.
Go to Job Properties and move to the parameters tab.
Click on the "Add Environment Variables..." button and choose the variable from the
list. Only values set in Administrator will appear. This list will show both the system
variables and the user-defined variables.
Set the Default value of the new parameter to $PROJDEF. If it is an encrypted field
set it to $PROJDEF in both data entry boxes on the encrypted value entry form.
When the job parameter is first created it has a default value the same as the Value entered in
the Administrator. By changing this value to $PROJDEF you instruct DataStage to retrieve the
latest Value for this variable at job run time.
If you have an encrypted environment variable it should also be an encrypted job parameter.
Set the value of these encrypted job parameters to $PROJDEF. You will need to type it in
twice to the password entry box, or better yet cut and paste it into the fields, a spelling mistake
can lead to a connection error message that is not very informative and leads to a long
investigation.
ExamplesEdit section
These job parameters are used just like normal parameters by adding them to stages in your
job enclosed by the # symbol.
Datab
#$DW_DB_NAME# CUSTDB
ase
ConclusionEdit section
These type of job parameters are useful for having a central location for storing all job
parameters that is password protected and supports encryption of passwords. It can be
difficult to migrate between environments. Migrating the entire DSParams file can result in
development environment settings being moved into production and trying to migrate just the
user defined section can result in a corrupt DSParams file. Care must be taken.
IntroductionEdit section
This HOWTO entry will describe how to identify changed data using a DataStage server or
parallel job. For an overview of other change capture options see the blog on incremental
loads.
The objective of changed data identification is to compare two sets of data with identical or
similar metadata and determine the differences between the two. The two sets of data
represent an existing set of a data and a new set of data where the change capture identifies
the modified, added and removed rows in the new set.
StepsEdit section
The steps for change capture depend on whether you are using server jobs or parallel jobs or
one of the specialised change data capture products that integrate with DataStage.
These constraints can become very complex to write, especially if there are a lot of fields to
compare. It can also produce slow performance as the constraint needs to run for every row.
Performance improvement can be gained by using the CRC32 function to describe the data
for comparison.
When a row is processed and becomes existing data a CRC32 code is generated and saved
to a lookup along with the primary key of the row. When a new data row comes through a
primary key lookup determines if the row already exists and if it does comparing the CRC32 of
the new row to the existing row determines whether the data has changed.
Read each row as a single long string. Do not specify a valid delimiter.
In a transformer find the key fields using the FIELD command and generate a CRC32
code for the entire record. Output all fields.
In a transformer lookup the existing records using the key fields to join and compare
the new and existing CRC32 codes. Output new and updated records.
The output records have concatenated fields. Either write the output records to a
staging sequential file, where they can be processed by insert and update jobs, or
split the records into individual fields using the row splitter stage.
To concatenate fields together use the Row Merge stage. Then follow the steps
described in the sequential file section above.
WARNING! Because it uses a 32-bit integer, the CRC32 function introduces a finite probability
of false positives; that is, identifying a match when there is in fact no match. The larger the
number of rows processed, the higher the probability of such an error occurring becomes.
The stage operates using the settings for Key and Value fields. Key fields are the fields used
to match a before and after record. Value fields are the fields that are compared to find
modified records. You can explicitly define all key and value fields or use some of the options
such as "All Keys, All Values" or "Explicit Keys, All Values".
The stage outputs a change_code which by default is set to 0 for existing rows that are
unchanged, 1 for deleted rows, 2 for modified rows and 4 for new rows. A filter stage or a
transformer stage can then split the output using the change_code field down different insert,
update and delete paths.
Change Capture can also be performed in a Transformer stage as per the Server Job
instructions.
The CRC32 function is not part of the parallel job install but it is possible to write one as a
custom buildop.
Parallel jobs are useful for increasing the performance of the job so all kind of jobs will be bow
days as a parallel job.
ExamplesEdit section
1x2y1x
ConclusionEdit section
The change capture stage in parallel jobs and the CRC32 function in server jobs simplify the
process of change capture in an ETL job.
IntroductionEdit section
This entry describes various ways of creating a unique counter in DataStage jobs.
A parallel job has a surrogate key stage that creates unique IDs, however it is limited in that it
does not support conditional code and it may be more efficient to add a counter to an existing
transformer rather than add a new stage.
In a server job there are a set of key increment routines installed in the routine SDK samples
that offer a more complex counter that remembers values between job executions.
StepsEdit section
In a DataStage job the easiest way to create a counter is within the Transformer stage with a
Stage Variable.
svMyCounter = svMyCounter + 1
The counter can be given a seed value by passing a value in as a job parameter and setting
the initial value of svMyCounter to that job parameter.
In a parallel job this simple counter will create duplicate values on each node as the
transformer is split into parallel instances. It can be turned into a unique counter by using
special parallel macros.
Each instance will start at a different number, eg. -1, -2, -3, -4. When the counter is
incremented each instance is increment by the number of partitions, eg. 4. This gives us a
sequence in instance 1 of 1, 5, 9, 13... Instance 2 is 2, 6, 10, 14... etc
I know it's not polite dinner conversation, but are you suffering row leakage from
your data integration jobs? Do you have an unpleasant discharge and you don't know
what to do with it?
Row leakage are the rows that get dropped out of ETL jobs. Row discharge are the
row rejects from the ETL jobs that are not properly investigated. Untrapped leakage
and ignored discharge both lead to data quality problems in the ETL targets. In a
future blog I will talk about cleaning up your unpleasant discharge, but in this blog I
will address row leakage.
A reject link is a good option as the rejected record can be saved and investigated
and the reject link count can be used in etl row reporting. Sometimes a row is leaked
and no warning message appears, see the section on black holes below.
This means row leakage messages are often missed, especially in jobs where the row
leakage happens rarely. This is why the parallel message handler was added to turn
warning messages into information messages and deliver "clean" jobs to production
environments.
The most dangerous black holes are the enterprise database output stages. ETL row
by row processing means each row is handled as it's own insert or update statement,
with million plus row processing that's quite a lot of DB statements that could go
wrong! Database bulk loads are different to direct insert or updates, in these loads
the data is exported to a flat file and loaded via a database utility. This means the
error and log messages are not available in the DataStage log but can be found in an
external bulk load log file.
I add some special error reporting to this, the reject output of the db stage is passed
to a copy stage and then to a shared container with row propogation turned on. The
shared container retrieves the SQL code field, looks up the error description against a
prepared lookup fileset, aggregates and counts the errors and sends them to a peek
stage. The DataStage log receives a peek message with error code, error description
and error count for failed database statements. I elevate peek messages from
information to warning in all my DataStage projects (including production).
If you have a transformer leading to a db stage, which should happen at least 90% of
the time, you can catch the bounced db failure rows and output them down a reject
link. The Link Variables of the database link holds the SQL error codes and messages
and you can output these values down a reject link.
DATASTAGE
1. What is Job Invocation ID? What are the uses? How to set that.
Ans: By setting this Invocation ID, then only we can run multiple
instances of jobs.
Using this we can run a job on different days with different parameters.
We set in when we design the job. Check Allow Multiple Instance in job
GUI.
Nested Condition
Unconditional
Otherwise
Conditional- Custom
Run-Activity-On-Exception, Sequencer, Email Notification
Unconditional
6. In how many ways you can build a batch/job sequence. What are they?
Ans: 1. Job Sequence - DS Designer
2. Batch Job Facilities - DS Director
3. Job Control Routine - DS Designer
10. How can you Aggregate data without using Aggregator Stage?
Ans: We do this using ODBC Stage.
13. How to import a COBOL source File? Is their any separate stage to do
this?
Ans: In ODBC stage click Load ->From Menu -> import Cobol Definitions
CFF (Complex Flat File) Plug-in Stage
14. How will you sort the data in DS? Any Separate Stage?
Ans: We do by SQL override. We also have a sort stage this gives us
better performance than the above.
15. What is tracing Level and give the values of it and message.
Ans: Tracing Level is the information to be included in job log file
0 - No Stage properties information included in the job log file and
1 - Stage properties information included in the job log file
17. What are the parallelisms supported in DS. How will we achieve this?
Ans: SMP, MPP, and Clusters. For this we have design parallel jobs using
parallel Extender.
Whole Project. Click this button if you want all project details to be
updated in the reporting database. If you select Whole Project, other
fields in the dialog box are disabled. This is selected by default.
PART – IV
7. Types of Stages.
Ans: 1) Active Stage: In which Transformation, Aggregation etc are
done.
Ex: Transformer, Aggregator.
2) Passive Stage: In which Read/Write Process is done.
Ex: ODBC, Sequential, Hash File...
Other Classification:
1) Built-in: These are default Stages. Ex: ODBC, Transformer, Sequential
etc.
2) Plug-in: These stages are to be installed separately, which are used
for special tasks, which DS not supported previously.
I am working on a design for a ETL job that must accept a sequential file as input. The input file
contains data for a number of "locations", typically 10 -15 locations or so. We do not know the
locations, only that the file will contain multiple records for each location. The job(s) must split the
input file into a single file for each location. So if there are 15 locations in the input file we will end
up with 15 individual files as the output. The output files must be named with the location name in
the filename.
My initial thought on this was to first run a job to obtain a dataset containing the unique list of
locations in the input file. Then using this list run another job that will filter the input file by each
item in the locations dataset creating an output file with a name based on the location code. I'm not
sure how to do this or if it is even possible. It needs to be driven from the unique list of locations. I'm
thinking the Job Sequence would need to loop though the location dataset and pass each row value
to the main processing job as a parameter. Is this possible? Can you suggest how I should approach
this problem?
Yes, this will be processed on AIX servers. I have attached some sample test data. Here's what I'm
trying to do with it:
1. The file needs to be split based on the first column. Columns are delimited with semi-colons.
2. The first and last rows of the source (header/trailer) must exist in all output files as well.
3. Output files must be named as the source file with the value of the first column appended.
A shell script may be the way to go. If I can ever figure out how to set the UserStatus in a parallel job
(or server job) then I can probably get DataStage to do it. For some reason the DSSetUserStatus
routine is not available in our DataStage!
Dimensional Modeling - Dimension Table
In a Dimensional Model, context of the measurements are represented in dimension tables. You can
also think of the context of a measurement as the characteristics such as who, what, where, when,
how of a measurement (subject ). In your business process Sales, the characteristics of the
'monthly sales number' measurement can be a Location (Where), Time (When), Product Sold (What).
The Dimension Attributes are the various columns in a dimension table. In the Location
dimension, the attributes can be Location Code, State, Country, Zip code. Generally the Dimension
Attributes are used in report labels, and query constraints such as where Country='USA'. The
dimension attributes also contain one or more hierarchical relationships.
Before designing your data warehouse, you need to decide what this data warehouse contains. Say
if you want to build a data warehouse containing monthly sales numbers across multiple store
locations, across time and across products then your dimensions are:
Location
Time
Product
Each dimension table contains data for one dimension. In the above example you get all your store
location information and put that into one single table called Location. Your store location data may
be spanned across multiple tables in your OLTP system (unlike OLAP), but you need to de-normalize
all that data into one single table.
Dimensional modeling is the design concept used by many data warehouse designers to build
their data warehouse. Dimensional model is the underlying data model used by many of the
commercial OLAP products available today in the market. In this model, all data is contained in two
types of tables called Fact Table and Dimension Table.
In a Dimensional Model, Fact table contains the measurements or metrics or facts of business
processes. If your business process is Sales, then a measurement of this business process such as
"monthly sales number" is captured in the fact table. In addition to the measurements, the only
other things a fact table contains are foreign keys for the dimension tables.
I ran some performance tests comparing DataStage server jobs against parallel jobs
running on the same machine and processing the same data. Interesting results.
Some people out there may be using the server edition, most DataStage for
PeopleSoft customers are in that boat, and getting to the type of data volumes that
make a switch to Enterprise Edition enticing. Most stages tested proved to be a lot
faster in a parallel job then a server job even when they are run on just one parallel
node.
All tests were run on a 2 CPU AIX box with plenty of RAM using DataStage 7.5.1.
The sort stage has long been a bugbear in DataStage server edition prompting many
to sort data in operating scripts:
1mill server: 3:17; parallel 1node:00:07; 2nodes: 00:07; 4nodes: 00:08
2mill server: 6:59; parallel 1node: 00:12; 2node: 00:11; 4 nodes: 00:12
10mill server: 60+; parallel 2 nodes: 00:42; parallel 4 nodes: 00:41
The parallel sort stage is quite a lot faster then the server edition sort. Moving from 2
nodes to 4 nodes on a 2 CPU machine did not see any improvement on these smaller
volumes and the nodes may have been fighting each other for resources. I didn't
have time to wait for the 10 million row sort to finish but it was struggling along after
1 hour.
The next test was a transformer that ran four transformation functions including trim,
replace and calculation.
1 mill server: 00:25; parallel 1node: 00:11; 2node: 00:05: 4node: 00:06
2 mill server: 00:54; parallel 1node: 00:20; 2node: 00:08; 4node: 00:09
10mill server: 04:04; parallel 1node: 01:36; 2node: 00:35; 4node: 00:35
Even on one node with a compiled transformer stage the parallel version was three
times faster. When I added one node it became twelve times faster with the benefits
of the parallel architecture.
Aggregation:
1 mill server: 00:57; parallel 2node: 00:15
2 mill server: 01:55; parallel 2node: 00:28
The DB2 read was several times faster and the source table with 2million plus rows
had no DB2 partitioning applied.
So as you can see even on a 1 node configuration that does not have a lot of parallel
processing you can still get big performance improvements from an Enterprise
Edition job. The parallel stages seem to be more efficient. On a 2 CPU machine there
were some 10x to 50x improvements in most stages using 2 nodes.
If you are interested in these type of comparisons leave a comment and in a future
blog I may do some more complex test scenarios.
This new series is about techniques I use to learn more about DataStage starting with
the Modify stage.
So why use the Modify stage? Well, I don't. As I said in the post Is the DataStage
parallel transformer evil? I always use the Transformer first. I only go to the Modify
stage when I need some extra performance on a very high data volume.
Transformers are easy to use, which is one of the reasons why DataStage has been
successful as it is the most commonly used stage. When adding derivation /
mapping / transformation code into the text entry windows there is a right mouse
click menu for finding a list of transformation functions:
After choosing a function from the menu it comes with syntax to show how to use the
function:
By comparison for the Modify stage you get next to nothing. You just get a text box
and you need to enter all the column and function text manually without a right
mouse click menu:
There are almost no sample commands or example code in the DataStage manuals.
There is a list of functions and syntax in the DataStage Parallel Developers Guide but
it usually takes several goes to get functions right.
The easiest way to learn the Modify stage or learn a new function you haven't used
before is to create a simple test job with a Modify stage in it:
The Generate Rows stage creates test data for the Modify stage to work on.
Any number of columns and rows though you only need to start with one
column.
The Peek stage shows the result of the Modify function so you can view the
output in the Director.
The Modify stage can be used to test out functions before they are added to
actual jobs.
This makes it easier to try out different functions to use in real jobs. It is hard
learning and debugging Modify functions in a real job due to things like high data
volumes and job complexity. I prefer to get functions working in isolation first so I go
back to this type of learning job whenever I need to add a modify stage with some
unfamiliar functions.
1. I used to routinely get bitten by the transformer stage rejecting rows in which
some fields were participating in derivations or stage variables and the dang
things had NULLS in them.
2. Solving the NULLS problem with IF IsNull() for every single field being
evaluated in some way can get overly complex and very messy.
Instead I put a Modify stage before the Transformer, call the stage
MhandleNull and handle_null() for all fields being evaluated in the transformer.
This simplifies the already cumbersome IF THEN ELSE syntax of the
transformer and/or the stage variables
The Modify stage is primitive, yes, but therein lies the secret of its success - it's slick.
With a Modify stage you can perform five classes of function, more than Vincent
suggests, but one major limitation is that Modify stage functions are limited to a
single argument - straight away arithmetic and concatenation are ruled out, for
example.
null handling
date/time manipulation
date/time interval calculation (limited)
string trimming and substring
data type conversion
Null values are a hand grenade in a parallel Transformer. As soon as you do anything
to them they go off and you lose the row. It's one of the most common causes of row
leakage. A lot of my job designs have a Modify or Transformer stage near the
beginning of the job to handle all metadata issues such as null values and types and
a transformer near the end of the job to do the business rule qa and derivations and
column mapping.
If I am playing with columns in a parallel transformer and I think there may be nulls I
do the null handling in stage variables and then only refer to these stage variables in
the other parts of the transformer.
There is such a thing as a DataStage hack. One of my favourite hacks is the copy
parameters routine. It lets you copy parameters from a prototype job to a large
number of target jobs identified by using a folder name and/or a job name pattern.
The Set Default Parameters job is available from the Ken Bland and Associates
website. It's a DataStage jobs that hacks the DataStage repository to change job
parameters for one or more jobs.
How it works
The download contains a DataStage dsx export file with a single job. Import this job
into DataStage. The first thing you will notice is that the job has no stages! It is a job
made entirely of BASIC code and each time it runs it executes this code to change
sets of parameters directly in the DataStage repository.
When you run the job you will see the following parameters:
Update Parameters: set this to Y so the job does something.
Enter the job text to search: this works with a LIKE statement to only change
jobs with a name that matches the string.
Enter the folder name: an optional field that only changes jobs within that
folder.
Enter the Prototype job: the name of the job that holds the parameters to be
copied from.
Enter Y to replace block: this option removes all existing parameters before
adding the new parameters.
Create an empty job with the one parameter. Make it the prototype, set Update to Y
and Replace Block to N. That one parameter will be added to all target jobs without
affecting existing parameters.
Create a job with your final set of parameters. Make it the prototype, set Update to Y
and Replace Block to Y. All existing parameters will be removed and the new
parameters added.
Add the parameters you want to reset into a prototype job. Set Update to Y and
Replace Block to N and the target jobs will have the values replaced and other
parameters will not be affected.
I am posting this a few days too late but it is the twelve month anniversary the Hawk
beta test. It is a test that for some people has not yet started.
The Hawk Beta press release came out on June the 23rd of 2005 just four months
after Ascential became an IBM company. There was a round 1 Hawk beta test for
clients and partners who requested beta software for Windows which completed
successfully. The round 2 has recently started for other platforms but with a much
tighter acceptance criteria for clients to beta test production data. Some of you who
signed up for the beta test twelve months ago still may not have heard back from
IBM.
Since I am having a little dig at Hawk I thought I would take us down memory road
with this Ascential press release from AscentialWorld 2003 about some very
optimistic release dates:
Product Roadmap
Ascential Software unveiled three strategic product initiatives, code-named Trinity,
Hawk, and Rhapsody. The results of these development projects are anticipated to be
in the market as soon as the first half of 2004.
* "Trinity" - Anticipated completion mid 2004.
Trinity highlights include:
o Expanded service-oriented architecture capabilities
o 64-bit performance and expanded mainframe support
o Increased functionality for enterprise grid computing initiatives
o "Frictionless connectivity" - eliminating barriers to connectivity inside and outside
the enterprise
* "Hawk" - Expected delivery late 2004.
Hawk highlights include:
o User interface innovations for enhanced productivity
o RFID infrastructure support
o "Next generation" meta data services infrastructure
* "Rhapsody" - Next generation platform that will deliver a quantum leap in
productivity, manageability and scalability.
I'm afraid that 64-bit performance from Trinity never made it, nor did frictionless
connectivity (it's a big part of the Hawk release). However RFID infrastructure support
did make it to market on time. The "late 2004" release date for Hawk turned into
"much later then 2004".
Don't get me started on the beta for the analyser tool! I am assuming that the
Analyser product, which is the successor to ProfileStage and AuditStage, requires
completion of the Metadata Server beta testing before it can enter its own beta
testing.
So happy anniversary Hawk Beta. I've marked that press release date, October 29, as
the anniversary of the public announcement of the Hawk release. This October will be
the third anniversary of waiting for Hawk. You can expect another blog from me then
but hopefully it will be better news. In the meantime you can whet your appetite for
Hawk with my blog from a couple months ago My DB2 Magazine article: Hawk
overview, screenshots and questionnaires!.
You can also read my early entry, in fact my third blog entry, My top ten features in
DataStage Hawk.
The ETL market is highly competitive and with this comes the spreading of Fear,
Uncertainty and Doubt (FUD). I tackle four items of FUD about WebSphere DataStage.
Wikipedia has a history of the term FUD at the Fear, uncertainty and doubt topic
page. Interestingly it was first coined to describe an IBM marketing practice and here
I am talking about it being applied to an IBM WebSphere DataStage.
DataStage upgrades are quite easy to run, DataStage projects get upgraded by the
installation process and DataStage export files can also get upgraded as they are
imported. This means you can upgrade your projects and have backups on hand in
case the installation runs into problems. There is no evidence anywhere that the
Hawk release will be a difficult upgrade.
Since Hawk is still in beta phase we haven't even seen the full upgrade
documentation so it is premature to try and use it as FUD.
The fact is that most of the leading data integration tools are owned by third party
vendors and each of these suppliers has a different agenda to pure play data
integration companies. Thus SAS and Business Objects are much more focused on
business intelligence than on data integration per se, while IBM is focusing a lot of
attention on MDM and data governance, for example.
I would LOVE to know why data governance and MDM have nothing to do with
application migration!!!! They are not mutually exclusive areas of data integration.
The term pure play data integration appeared in a lot of Informatica press releases
after IBM bought Ascential. It reminds me of the Harry Potter story lines about pure
blood wizards. Apparently DataStage has been mixed with Muggle Blood and is no
longer a pure play data integration tool. It magically only works on AIX against DB2
databases and it has forgotten everything it knows about SAP, PeopleSoft and Oracle
Financials!
Pure play pure shmay is what I say. DataStage is still an ETL tool that works on
multiple platforms (Unix, Linux, Windows, mainframe, USS), it works equally well
against all the popular databases (Oracle, SQL Server, DB2, Teradata) and it has
enterprise packs for code free integration with the major ERP tools (SAP, PeopleSoft,
Oracle Financials). None of this magically disappeared when Ascential was acquired
IBM and none of this disappears in the Hawk release. The talk of agendas and pure
play vendors is FUD.
It is also amusing to think IBM can only focus on two marketing ideas at any one
time. Since they are focusing on MDM and data governance they must have
abandoned application migration, data warehousing, BI and data conversions! Oh,
and since Microsoft are focused on Vista they must be no longer competing in the
console market! Most of the IBM MDM press releases do not even refer to DataStage,
they refer to the MDM product IBM purchased.
With an ETL tool you can do the sort in the transformation process via a sort stage or
you can use the tool to execute the sort via the external tools. Often the call to the
external sort is seemless - it can be called from the ETL database stages or
sequential file stages.
Database sorting
Easily done by adding an ORDER BY to your SQL select statement in your database
stage. You can add it through the stage SQL builder, or as an extra property to a
generated SQL, or in a user-defined SQL statement. Database sorts are similar to ETL
sorts, they both try to do the sort in memory and when that fills up they overflow into
some type of temporary storage on disk. This means they both may need some
special care on very large volumes.
Be wary of sort commands against a database that cannot handle the extra load, or
in a complex SQL statement with outer joins, or in a database that has not been
configured to handle very large sorts. If you want to minimise the impact you have on
source databases then leave the sort on the ETL server.
If you had someone like a Chris Eaton on hand to monitor Performance (hit ratios,
sorts, dynamic queries) he may help you with the source SQL and make sure the
sorting is done efficiently. However, on a lot of ETL projects you have a lot of legacy
sources and the DBAs who look after those databases might not give you the time of
day.
CoSort started supplying specialised sorting tools in 1978! In 1997 the CoSort sorting
product was executing on Unix sorts across multiple CPUs, years before DataStage
got parallel processing. CoSort offer plugins for both DataStage Server Edition and
Informatica PowerCenter that provide very fast sorting on Unix and Windows along
with some joining, filtering and aggregation.
If you are on the cheaper and older DataStage Server Edition then CoSort is worth
evaluating. It was good to see IBM recently re-affirm the partnership that CoSort
started with DataStage way be at version 3.6. It can save a huge amount of time if
you can sort, aggregate and filter large text files in the first step of your job in
parallel mode. They do not have a plugin for parallel jobs, but given the performance
of the sort stage in parallel jobs it may not be worth them developing one.
ETL Sorts
The ETL sort is extremely easy to implement as it is a stage with just a couple of
options to set. Generally if you get one very large sort optimised for one job it will
work across every other job as well. My own humble parallel job versus server job
testing showed a parallel job sort could be a least 100 times faster than a server job
sort so you may need to do some extra configuration to get server job sorts finely
tuned.
In parallel jobs the sort stage has a couple extra benefits: it can remove duplications
and it can partition. It is a good stage to put at the front of a job. I will talk more
about parallel sorting in a future article.
The Wrap
For parallel jobs I would advise on using the parallel sort stage wherever possible due
to the very fast performance and partitioning. Database sorts are possible as long as
your databases can handle the extra load.
For server jobs all types of sorts should be evaluated as the server job sort can be
very slow. The database sort and the operating system sort are both good options.
CoSort offers the best performance for very large text files. I have not heard of any
plans for a parallel CoSort plugin so for now I would only use it with server jobs.
If you already have CoSort and Enterprise Edition you may be able to call the CoSort
from within the parallel Sequential file stage. Let me know if any of you have had
success combining these two products.
People who have been using a tool for a long time learn some shortcuts. I've done a
brain dump of any DataStage short cuts I can remember.
Import Export
* When you do an export cut and paste the export file name. When you go to your
project and run an import paste the file name instead of having to browse for it.
While export and import independently remember the last file name used they do not
share that name between each other.
* When you switch export type between category and individual job it is quick to
switch the type, close the export form and open it again. That way the job name or
category you have highlighted will be automatically picked.
There is an Export option to export by individual job name or export by category
name. This is on the second tab in the export form. Often when you go to export
something it is on the wrong option, eg. you want a job but it is showing the
category. You switch from category export to individual job export but back on tab 1
your job is still not highlighted.
* When you do an export there is a "View" button, click this to open the export file
and run any type of search and replace on job parameter values when moving
between dev and test.
* If you want to export several jobs that are not in the same category use the append
option. Highlight and export the first job. Close the export window, find and highlight
the second job, in the export form click the "Append" option to add to the file.
Continue until all jobs have been selected and exported.
* On the Options tab is a check box to include in the export "Referenced shared
containers" to also export those.
Sequence Jobs
My most annoying Sequence job "feature" is the constant need to enter job
parameters over and over and over again. If you have 10-20 parameters per job (as I
normally do) it becomes very repetitive and is open to manual coding errors.
Under version 7.1 and earlier you could copy and paste a job activity stage, change
the job name and retain most of the parameter settings. Under 7.5.x when you
change the job name all the parameter settings get wiped.
You need to set the parameters for every flippin job activity stage, even though they
are likely to have the same or similar parameter lists and settings. A faster way is to
do the job renaming in an export file.
* In an empty sequence job add your first job activity stage and set all parameter
values or copy one in from an existing job.
* Copy and paste as many copies of this job activity as you need for your sequence.
* Close the sequence job and export the job and click the View button.
* Open the sequence job, you need it to retrieve the stage names.
* Copy the name of the last job activity stage name and search for it in the export
file. When the cursor is on that part of the export file search and replace the old job
name with the new job name. Make sure you only replace to the bottom of the file,
most text editors should have this option. This will rename the job of the last activity
stage.
* Repeat this for the second last job activity, then the third last etc until you have
replaced all job names back to the second job activity stage.
* Import the job into your project.
This should give you the same set of job activity stages but with each one pointing at
a different job and with the full set of job parameters set.
Debugging
* In parallel jobs I use the copy stage a lot to debug errors that do not indicate which
stage caused the error. I create a copy of the job and start removing output stages,
replacing them with a copy stage. I progressively remove stages until I locate the one
with the error.
This is my third post in what I am now calling job parameter week. In future it will be
better if I come up with an idea for a theme week at the start of the week instead of
the end of the week. I'll chalk that one up to experience.
Welcome to job parameter theme week, where we bring job parameter to life. (Note
to marketing department, please provide better mottos for future theme weeks). I
have tracked back to my previous two job parameter posts from this week. We
looked at Project Specific Job Parameters on Monday and saw how they made
management of parameters a lot easier, we looked at the exposure of unencrypted
passwords on Wednesday by the NSW police.
ETL job parameters are values you can change when you run the job, as opposed to
properties you can only change by modifying and recompiling the job, don't know if I
am going to get to 101 but thought it was an eye catching title.
Static Parameters
In programmer speak you could also call these constants. They are values you never
expect to change. So why make them a job parameter? Because they might change,
or because you don't want to have to try and remember the value every time you use
it in a job.
In my current project I have a static parameter called MAX_DATE which holds a high
end date value 2999-12-31 23:59:00. I use this on tables that track variant data such
as a slowly changing dimension type 2 table, it provides a high end date for an open
or current item. This parameter gets copied into all jobs that update dimension
tables, the good thing about having it around is that I don't need to remember the
value of the field, I just enter MAX_DATE into my transformer.
Another good thing is that come July of 2999 when a project team is assembled to
address Y3K problems they only have to update this date in one place and will not
have to recompile jobs.
Environmental Parameters
These are the job parameters that change value as you move from one environment
to another, for example from development into testing and into production. Typical
values that change are directory locations and database login details.
There was much umming and aahing in meetings as to whether to turn it on. I create
a set of job parameter flags that could be set to true or false that turned on and off
personal name, company cleansing and address cleansing.
These parameters were used in a transformer that wrote out the final fields and had
original or cleansed fields to choose from. Now in production they can define what
type of cleansing the job will perform from the set of flags for each run.
Dynamic Parameters
These are parameters that get changed with every job run.
- A shell script starts a job with the dsjob command, it can set a dynamic parameter
using the Unix scripting language and set it in the dsjob command.
- A sequence job calls parallel jobs that require a dynamic parameter, the value can
be set in a User Variables stage or in the Job Activity Stage using BASIC commands or
a BASIC routine.
The PROCESS_DATE is retrieved from a database table via a Unix script. The script
uses a simple DB2 command line statement to retrieve the date and then uses the
dsjob command to set the parameter and call a sequence job. It populates
PROCESS_ID with an eight character string representation of the PROCESS_DATE.
The Sequence job uses a User Variable stage to append a unique 2 character
sequence job code to the PROCESS_ID to make it unique for this sequence job. This
User Variable is then used in each job activity stage with an additional 2 character
code added that is unique for each job activity stage within the sequence job.
Therefore each parallel or server job across the entire project gets a unique
PROCESS_ID each time they run.
This is a followup from comments on my parameter week post on 101 uses of job
parameters. This post is about calling DataStage jobs and the range of job control
options.
The go to command for interacting with DataStage from the command line or from
scripts or from other products is the dsjob command. The documentation for dsjob is
buried in the Server Job Developers Guide, it is cunningly placed there to keep
Enterprise users, who would never think to read the Server Editon guide, in a state of
perpetual bewilderment.
I am not going to go into the job control API or mobile device job control, refer to your
documentation for those options! I will cover the more commonly used methods.
A lot of additional stages within the Sequence Job provide dynamic parameter
setting, after job notification, conditional triggers to control job flow, looping, waiting
for files and access to the DataStage BASIC programming language.
In third party scheduling tools you need to set job parameters as you run the job in
some type of scripting language. Jobs are executed by scheduling tools using the
dsjob command. This command can require a lot of arguments so it is often run via a
script or batch file.
The mother of all DataStage run scripts can be found in this dsxchange thread.
Written by Ken Bland and Steve Boyce it will start jobs, set run time parameters from
a parameter ini file, check the status of finished jobs, service your car, solve the Da
Vinci code and run an audit process after the job has finished.
This script is run from a scheduling tool to make the setup of the scheduling easier.
The mother of all job run scripts sets parameters that are saved in an ini file.
Parameters can also be saved in a database table, with a job extracting the settings
to an ini file before a batch run.
They can also be stored as environment parameters in a users .profile file. These
environment parameters can be passed into the job via a script or they can be
accessed directly in the job by adding environment job parameters and setting the
value to the magic word $ENV.
They can also be stored as project specific environment parameters as we saw during
the exhilirating job parameter week, where we brought job parameters to life and
struggled to come up with a good theme motto. These job parameters are much like
environment parameters but use the magic word $PROJDEF.
The DataStage BASIC language has better access to jobs the operating system
scripts. While an external script has to do everything through the dsjob and dsadmin
commands the BASIC language has access to a much larger number of DataStage
commands. Like dsjob these commands are cunningly hidden in the Server Job
Developers Guide.
Before the days of sequence jobs, (DataStage 5?), and before sequence jobs became
quite useful in version 7.5 this job control code was far more prevelent and easier to
code then job control in external scripts. It was extremely useful at putting jobs in the
right sequence, retrieving job parameters from files, checking the results of jobs and
shelling out to execute operating system commands.
Job control code is still widely used even when external scripts or sequence jobs are
in use. They fill in gaps of functionality by providing job auditing, setting dynamic
calculated parameter values, checking for files etc etc etc. It is a very powerful
language.
Also from the dsxchange forum we can find examples of job control code. This time
from Arnd:
GetJobParameter(ParameterName)
EQUATE ProgramName TO 'GetJobParameter'
OPENSEQ 'ParameterFile' TO InFilePtr ELSE CALL DSLogFatal('Oh No, cannot open
file',ProgramName)
Finished = 0
Ans = ''
READNEXT InRecord FROM InFilePtr ELSE Finished = 1
LOOP UNTIL Finished
FileParameterName = TRIM(FIELD(InRecord,'=',1))
FileParameterValue = TRIM(FIELD(InRecord,'=',2,99))
IF (FileParameterName=ParameterName)
THEN
Finished = 1
Ans = FileParameterValue
END
READNEXT InRecord FROM InFilePtr ELSE Finished = 1
REPEAT
IF NOT(Ans) THEN CALL DSLogFatal('Could not find value for
"':ParameterName:'".',ProgramName)
CLOSESEQ InFilePtr
People from database backgrounds like have parameters in database tables and may
even put an entire job schedule into a table with dependencies and sequencing. They
need a bridge between the database and DataStage so they still need a layer of
either Unix scripts or job control code to run the jobs.
People from programming backgrounds will be very comfortable with the DataStage
BASIC programming language and find it can do just about anything regarding the
starting, stopping and auditing of jobs. They can retrieve settings and parameters
from files or databases.
The method I currently prefer is Sequence Jobs for all job dependencies, project
specific environment variables for most slowly changing job parameters, some job
control routines for job auditing and dynamic parameters and external operating
system commands and a dsjob script for starting Sequence Jobs from a third party
scheduling tool or from the command line.
What I like about project specific environment parameters is that the job can be
called up from anywhere without requiring any parameter settings. It can be called
up from within the Designer by developers, from ad hoc testing scripts by testers and
from third party scheduling tools in production.
One of the most frustrating tasks for new DataStage users is adding a new stage
between two existing stages and discovering one of your stages has amnesia.
For a DataStage novice adding a stage involves adding a new stage, moving the
existing link onto that stage and adding a new link. Let me represent old stages as O
and the new stage as N:
O--N--O
The :( stage was disconnected from the link and got a new link. This means output
link properties in that stage went back to defaults. If it is a database stage then all
database login and SQL settings disappeared. If it's a sequential file the file name
disappeared. For a transformer stage you an lose a lot of time consuming derivation
and mapping settings.
Here is a better way to insert a stage. The trick is to make a copy of the link so you
get two sets of link properties instead of one. Since you cannot copy a link by itself
you have to copy the stage and all attached links also get copied:
O---O
2) ---O Cut the start stage to the clipboard.
3) N---O Add the new stage.
4) O--- N---O Paste to get the start stage back.
5) O--- N---O Rename one of the links to avoid duplicate link names.
6) O---N---O Hook up the links.
Now you have an input and output stage with properties set. You just need to
configure the middle stage.
If you have a long chain of stages and you will need to disconnect a link before
copying and deleting:
O---O---O---O---O
1) O---O---O ---O---O Disattach one side of an adjoining stage.
2) O---O--- ---O---O Cut that stage.
3) O---O---N ---O---O Add the new stage
4) O---O---N ---O ---O---O Paste the stage and rename the link.
4) O---O---N---O---O---O Hook it all up.