Best Practices
Incorporating Target's Standards
DataStage is a trademark of International Business Machines Corporation
Main Source
"DataStage Technical Design and Construction Procedures" This is a "living document"
\\nicsrv10\TTS\E\ETL\Best Practices\DataStageTechDoc\DataStageTech.doc
that is, a work in progress changes will be notified
Job Naming
Each DW "project" has three-letter code
for example GLB
Within Jobs branch create category with that name
keep all objects together in order to support MetaStage functions
Job Naming
Job name begins with database identifier
for example GTL
GTLJB0001 GTLJB0002 GTLJB0002TEST GTLJB0003
Followed by job identifier and sequence
Stage Names
First 3-4 characters: stage type
SEQL (Sequential File stage) LKFS (Lookup File Set stage)
Remainder should be meaningful and descriptive
first character to be capitalized
Link Names
Links prior to final active stage
shortdesc_InTo_stagedesc shortdesc_OutTo_stagedesc
Links after final active stage
Links from passive stage
In_linkdesc
Out_linkdesc_action
Links to passive stage
Links from Lookup stage
Lkup_linkdesc
Example
Images copyright claimed by Ascential Software Corporation
Reusable Components
Images copyright claimed by Ascential Software Corporation
Create reusable components where possible
shared containers flexible routines
Annotations
Annotations are to be used to explain processing Description annotation shows purpose of job
Annotations
Description Annotation
Images copyright claimed by Ascential Software Corporation
Job Descriptions
Images copyright claimed by Ascential Software Corporation
Become text of description annotation Short description visible in Detail view (Manager)
Stage/Link Naming
Stages are named after
the data they access (passive stages) the function they perform (active) for the data they carry
such as Sequential_File_0
Links are named
Do not leave default names
Developing Jobs
1.
Keep it simple
jobs with many stages are hard to debug and maintain documentation
2.
Start small and Build to final Solution
plan use view data, copy, and peek start from source and work out develop with a 1 node configuration file, small set of data
Developing Jobs (continued)
3.
Solve the business problem before the performance problem
dont worry too much about partitioning until the sequential flow works as expected
4.
If you have to write to disk use a persistent Data Set
Developing Jobs (continued)
Images copyright claimed by Ascential Software Corporation
Iterative Design
Use Copy or Peek stage as stub Test job in phases small first, then increasing in complexity Use Peek stage to examine records
Example Phase 1
Images copyright claimed by Ascential Software Corporation
Example Phase 2
Images copyright claimed by Ascential Software Corporation
Example Phase 3
Images copyright claimed by Ascential Software Corporation
Transformer Stage
Transformer stage generates code Always include reject link Always test for null value before using a column in a function Be aware of column and stage variable data types
often developer does not pay attention to Stage Variable data type try to maintain the data type as imported
Avoid data type conversions
Job Parameters
Provide insurance against
things that change over time (for example passwords, filter conditions) things that different in different environments (for example DSNs, pathnames, passwords)
Job Parameters
Created in Job Properties Each parameter has
name prompt text (mandatory) type default value (design time) help text
Defining Job Parameters
Images copyright claimed by Ascential Software Corporation
Click to add environment variables
Using Job Parameters
In fields in passive stages delimit with "#" characters
for example #SourceDir#
Names are case-sensitive In expressions choose from expression editor
not delimited
Useful Environment Variables
APT_DUMP_SCORE
report osh to message log
establishes name of configuration file and therefore degree of parallelism
APT_CONFIG_FILE
DUMP SCORE Output
Images copyright claimed by Ascential Software Corporation
Setting APT_DUMP_SCORE yields:
Partitioner And Collector
Two DataSets
Mapping Node --> partition
Configuration Files
Make a set for 1X, 2X,. Use different ones for test versus production Include as a parameter in each job Automatic scaling