100% found this document useful (1 vote)
318 views316 pages

DataStage Adv Bootcamp All Presentations

This document discusses the parallel architecture of IBM DataStage and how jobs are executed in parallel. It covers key concepts like pipeline parallelism, partition parallelism, and how the configuration file separates the hardware configuration from the job design allowing a single job to run on any hardware configuration.

Uploaded by

apc316
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
318 views316 pages

DataStage Adv Bootcamp All Presentations

This document discusses the parallel architecture of IBM DataStage and how jobs are executed in parallel. It covers key concepts like pipeline parallelism, partition parallelism, and how the configuration file separates the hardware configuration from the job design allowing a single job to run on any hardware configuration.

Uploaded by

apc316
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 316

Advanced DataStage Workshop

Module 01 Parallel Architecture

Information Management

2010 IBM Corporation

Information Management

Module Objectives
After completing this module, you should be able to:

Explain parallel processing architecture

Describe pipeline parallelism

Describe partition parallelism

Describe parallel job development environment

Describe parallel framework

Explain parallel job execution model

2010 IBM Corporation

Information Management

Why Study the Parallel Architecture?

Designer client is a productivity tool


GUI design functionality is intended for fast job development
Not intended to mirror underlying architecture

GUI design depicts standard ETL process


Parallelism is implemented under the cover
GUI hides, and in some cases distorts things, e.g. sort insertion

Sound, scalable, and high performance job designs require


good understanding of underlying architecture

2010 IBM Corporation

Information Management

What Do We Need to Master?

Parallel runtime engine


How the GUI design gets executed
What is generated from the GUI
What happens in the parallel framework
Pipeline parallelism
Partition parallelism
Configuration file, which separates design from runtime environment

Development environment
How to take advantage of the parallel framework
How to debug and change GUI design based on parallel framework
messages in the job log

2010 IBM Corporation

Information Management

DataStage Documentations

Parallel Job Developer Guide


Information on GUI elements, e.g. stages, properties, etc.
Information on Data Set and Schema
Information on Transformer functions

Parallel Job Advanced Developer Guide


Stage usage guide
Job design tips
Environment variables
Extending the functionality of parallel framework
Operators

2010 IBM Corporation

Information Management

Information Server Backbone

Information
Services
Director

Business
Glossary

Information
Analyzer

DataStage

QualityStage

Metadata

Metadata

Access Services

Analysis Services

MetaBrokers
Federation
Server

Metadata Server

Parallel engine

2010 IBM Corporation

Information Management

Key Parallel Concepts

Parallel processing
Executing the job on multiple CPUs

Scalable processing
Adding more resources increase job performance

Example system: 6 CPUs (processing


nodes) and disks
Scale up by adding more CPUs
Add CPUs as individual nodes or to
an SMP system

2010 IBM Corporation

Information Management

Scalable Hardware Environments

Single CPU
Dedicated memory
& disk

SMP
Multi-CPU (2-64+)
Shared memory &
disk

GRID / Clusters
Multiple, multi-CPU systems
Dedicated memory per node
Typically SAN-based shared
storage

MPP
Multiple nodes with
dedicated memory, storage

2 1000s of CPUs

2010 IBM Corporation

Information Management

Drawbacks of Traditional Batch


Processing

Poor utilization of resources


Lots of idle processing time
Lots of disk I/O for staging

Impractical with large data volume


2010 IBM Corporation

Information Management

Three Types of Parallelism

Sample

Derivation

pipeline
Link Constraint

Lookup

Sort

explicit
data-partition

Implicit pipeline parallelism


Implicit partition parallelism
Explicit execution flow parallelism

10

2010 IBM Corporation

Information Management

Pipeline Parallelism

Transform, enrich, and load processes execute concurrently


Like a conveyor belt moving rows from process to process
Advantages

Data are all in memory eliminating disk I/O between processes


Keep all CPUs busy

11

Still has limits on scalability


2010 IBM Corporation

Information Management

Partition Parallelism

Divide the incoming stream of data into subsets to be


separately processed by an operation
Subsets are called partitions

Each partition of data is processed by the same operation


e.g. if operation is filter, then every partition will filter data the same way

Facilitates near-linear scalability


8 times faster with 8 more CPUs
24 times faster with 24 more CPUs

12

This assumes data are distributed evenly among all partitions

2010 IBM Corporation

Information Management

Three-Nodes Partitioning
Node 1

Operation
subset1
Node 2
subset2

Data

Operation

subset3

Node 3

Operation

Data are partitioned into three subsets


The operation is performed on data in each partition
separately in parallel
If the data are evenly distributed, the operation will finish in
1/3 of the time

13

2010 IBM Corporation

Information Management

DataStage Combines Pipelining &


Partitioning

With DataStage, pipelining, partitioning, and re-partitioning


are automatic
Job developer only identifies

Sequential vs. parallel operation (by stage)


Method of partitioning
Configuration file that defines resources
Advanced stage options, e.g. buffering, operator combining, etc.
14

2010 IBM Corporation

Information Management

Job Design vs. Execution


User assembles the flow using DataStage Designer

at runtime, this job runs in parallel for any configuration


(1 node, 4 nodes, N nodes)

No need to modify or recompile the job design!


15

2010 IBM Corporation

Information Management

Execution in Production Environment


Supports all hardware
configuration with a
single job design
Scales up by adding
processors or nodes
with no application
change or re-compile
Configuration file
specifies hardware
configuration and
resources

Application Assembly: One Dataflow Design

UNLIMITED SCALABILITY
16

2010 IBM Corporation

Information Management

Defining Parallelism

Execution mode (sequential or parallel) is controlled by stage


definition and properties
Default is parallel for most stages
Can override default in most cases (stage advanced properties)

Example of stage parallelism


Sequential File stage when reads multiple files or with multiple readers
Sequential File stage when writes multiple files
Oracle stage reads from partitioned table

Degree of parallelism is determined by configuration file


Total number of logical nodes in nameless (default) node pool
Nodes in named node pool

17

2010 IBM Corporation

Information Management

Configuration File

Separates hardware configuration from job design


Specified per job at execution time by $APT_CONFIG_FILE
Change hardware and resources without changing job design

Defines number of nodes (LOGICAL processing units) with


their resources
No need to match number of CPUs
Data Set, scratch, buffer disks (file systems)
Optional resources (Database, SAS, etc.)
Advanced resources optimization with pools

Different configuration file can be used for same job on


different executions

Different configuration file can be used for different jobs

18

2010 IBM Corporation

Information Management

Example Configuration File


{

node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}

Key points:
1.

Number of nodes defined

2.

Resources assigned to each


node. Their order is significant.

3.

Advanced resource
optimizations and configuration
(named pools, database, SAS)
}

19

2010 IBM Corporation

Information Management

Configuration File Guidelines

Minimize I/O overlap across nodes


If multiple file systems are shared across nodes, have different order of
these file systems defined within each node
Spread file systems across RAID / SAN controllers and drives
Use local disks for scratch when possible

Named pools can be used to further separate I/O


buffer: file systems are only used for buffer overflow
sort: file systems are only used for sorting

On GRID environment, named pools can be used to further


specify resources across physical systems
Specify server with direct database connection
Through careful job design can minimize data transporting between
servers

20

2010 IBM Corporation

Information Management

Parallel Job Compilation

What gets generated


OSH: a kind of script

Designer
Client

OSH represents the design data flow


and stages
Stages become operators

A transform operator for each


Transformer

Compile
DataStage
server

A custom operator is built during the


compile
Compiled into C++ then to
corresponding native object module
A C++ compiler is needed for jobs
with Transformers

21

Executable
Job

Gene
rated
OSH

C++ f
or
Trans each
forme
r

Transformer
Components

2010 IBM Corporation

Information Management

Generated OSH
Enable viewing of
generated OSH in
Administrator:

Comments
Operator
Schema

22

OSH is visible in:


- Job properties
- Job run log
- View Data
- Table Defs

2010 IBM Corporation

Information Management

Stage to Operator Mapping Examples

Sequential File
Source: import
Target: export

Data Set: copy


Sort: tsort
Aggregator: group
Row Generator, Column Generator, Surrogate Key
Generator: generator

Oracle
Source: oraread
Sparse lookup: oralookup
Target load: orawrite

23

2010 IBM Corporation

Information Management

Generated OSH Primer

Comment blocks introduce each


operator
Operator orders is determined by the
order stages were added to design

Syntax
Operator name
Schema
Operator options (-name value)
Input (indicated by n<)
Output (indicated by n>)

24

Generated OSH for first 2 stages

Virtual Data Set is generated to


connect operators

####################################################
#### STAGE: Row_Generator_0
## Operator
generator
## Operator options
-schema record
(
a:int32;
b:string[max=12];
c:nullable decimal[10,2] {nulls=10};
)
-records 50000
## General options
[ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')]
## Outputs
0> [] 'Row_Generator_0:lnk_gen.v'
;

Virtual dataset is
used to connect
output of one
operator to input of
another

####################################################
#### STAGE: SortSt
## Operator
tsort
## Operator options
-key 'a'
-asc
## General options
[ident('SortSt'); jobmon_ident('SortSt'); par]
## Inputs
0< 'Row_Generator_0:lnk_gen.v'
## Outputs
0> [modify (
keep
a,b,c;
)] 'SortSt:lnk_sorted.v'
;

2010 IBM Corporation

Information Management

Job SCORE
Generated from OSH and configuration file
SCORE is like an execution plan

Assigns nodes for each operator

Inserts sort and partitioner as needed

Defines topology (virtual data sets) between adjacent


operators

Inserts buffer operator to prevent deadlock

Defines the actual operating system processes to start


When possible, multiple operators are combined into a single process

25

2010 IBM Corporation

Information Management

Viewing the Job SCORE

Score message in job log

Set
$APT_DUMP_SCORE
to output job score to
the job log
To identify the score
dump, look for main
program: this step has

You dont see the word


score anywhere

Virtual
datasets
Operators
with node
assignments

26

2010 IBM Corporation

Information Management

Job Execution: The Orchestra


Conductor Node

Processing Node

SL
P

Processing Node
SL
P

Forks Player processes (one per stage)


Manages up/down communication

Players

The actual processes associated with stages


Combined players: one process only
Sends stderr, stdout to Section Leader

Establish connections to other players for data


flow
Clean up upon completion

27

Composes the Score


Creates Section Leader processes (one/node)
Consolidates messages to DataStage log
Manages orderly shutdown

Section Leader (one per Node)

Conductor - initial process

2010 IBM Corporation

Information Management

Module Summary
After completing this module, you should be able to:

Explain parallel processing architecture

Describe pipeline parallelism

Describe partition parallelism

Describe parallel job development environment

Describe parallel framework

Explain parallel job execution model

28

2010 IBM Corporation

Advanced DataStage Workshop


Module 02 Partitioning and Collecting

Information Management

2010 IBM Corporation

Information Management

Module Objectives
After completing this module, you should be able to:

List and describe partitioning and collecting algorithms

Keyless vs. Keyed partitioning algorithms

Partitioning strategy and tips

2010 IBM Corporation

Information Management

Partitioning and Collecting

Partitioning breaks incoming rows into multiple streams of rows (one for
each node)

Each partition of rows is processed separately by the stage/operator

Collecting returns partitioned data back to a single stream

Partitioning / Collecting is specified on stage input links

2010 IBM Corporation

Information Management

Partitioning / Collecting Algorithms

Partitioning algorithms include:


Round robin
Random
Hash: Determine partition based on key value
Requires key specification

Modulus
Entire: Send all rows down all partitions
Same: Preserve the same partitioning
Auto: Let DataStage choose the algorithm

Collecting algorithms include:


Round robin
Auto
Collect first available record

Sort Merge
Read in by key
Presumes data is sorted by the key in each partition
Builds a single sorted stream based on the key

Ordered
Read all records from first partition, then second,

2010 IBM Corporation

Information Management

Keyless vs. Keyed Partitioning Algorithms

Keyless: Rows are distributed independently of data values

Round Robin
Random
Entire
Same

Keyed: Rows are distributed based on values in the specified key

Hash: Partition based on key

Modulus: Partition based on modulus of key divided by the number of


partitions. Key is a numeric type.

Example: Key is State. All CA rows go into the same partition; all MA rows go
in the same partition. Two rows of the same state never go into different
partitions

Example: Key is OrderNumber (numeric type). Rows with the same order
number will all go into the same partition.

DB2: Matches DB2 DPF partitioning

2010 IBM Corporation

Information Management

Round Robin and Random Partitioning

Keyless partitioning methods

Good for initial import of data if


no other partitioning is needed
Useful for redistributing data

8 7 6 5 4 3 2 1 0

Fairly low overhead


Round Robin assigns rows to
partitions like dealing cards

Keyless

Rows are evenly distributed


across partitions

Round Robin

Row/Partition assignment will be


the same for a given
$APT_CONFIG_FILE

Random has slightly higher


overhead, but assigns rows in a
non-deterministic fashion
between job runs

6
3
0

7
4
1

8
5
2

2010 IBM Corporation

Information Management

ENTIRE Partitioning

Useful for distributing lookup


and reference data
May have performance impact in
MPP / clustered environments
On SMP platforms, Lookup
stage (only) uses shared
memory instead of duplicating
ENTIRE reference data
On MPP platforms, each server
uses shared memory for a single
local copy

ENTIRE is the default partitioning


for Lookup reference links with
Auto partitioning

Keyless

Each partition gets a complete


copy of the data

On SMP platforms, it is a good


practice to set this explicitly on
the Normal Lookup reference
link(s)

8 7 6 5 4 3 2 1 0

ENTIRE

.
.
3
2
1
0

.
.
3
2
1
0

.
.
3
2
1
0

2010 IBM Corporation

Information Management

HASH Partitioning

Rows are distributed according


to the values in key columns

Guarantees that rows with same


key values go into the same
partition
Needed to prevent matching
rows from hiding in other
partitions

Keyed

Keyed partitioning method


Values of key column

0 3 2 1 0 2 3 2 1 1

HASH

e.g. Join, Merge, Remove


Duplicates,

Partition distribution is relatively


equal if the data across the
source key column(s) is evenly
distributed

0
3
0
3

1
1
1

2
2
2

2010 IBM Corporation

Information Management

Modulus Partitioning

Rows are distributed according to


the values in one integer key
column

Keyed

Keyed partitioning method

Uses modulus
partition = MOD (key_value /
#partitions)

Values of key column


0 3 2 1 0 2 3 2 1 1

MODULUS

Faster than HASH


Guarantees that rows with identical
key values go in the same partition
Partition size is relatively equal if
the data within the key column is
evenly distributed

0
3
0
3

1
1
1

2
2
2

2010 IBM Corporation

Information Management

SAME Partitioning Algorithm

Keyless partitioning method


Rows retain current distribution
and order from output of
previous parallel stage
Doesnt move data between
nodes
Retains carefully partitioned
data (such as the output of a
previous sort)

Fastest partitioning method (no


overhead)

Keyless
Row ID's

0
3
6

1
4
7

2
5
8

0
3
6

1
4
7

2
5
8

SAME partitioning icon


2010 IBM Corporation

Information Management

Caution Regarding SAME Partitioning

Degree of parallelism remains unchanged in the downstream stage


Dont follow a stage running sequentially (e.g., Sequential File stage)
going to a stage using SAME partitioning
The downstream stage will run sequentially!
Dont follow a DataSet stage with a stage using SAME partitioning
The downstream stage will run with the degree of parallelism used to
create the dataset regardless of the degree of parallelism defined in
the jobs configuration file

2010 IBM Corporation

Information Management

Auto Partitioning

DataStage inserts partition components as necessary to ensure


correct results

Before any stage with Auto partitioning


Generally chooses ROUND-ROBIN or SAME
Inserts HASH on stages that require matched key values
(e.g. Join, Merge, Remove Duplicates)
Inserts ENTIRE on Normal (not Sparse) Lookup reference links

Since DataStage has limited awareness of your data and business


rules, explicitly specify HASH partitioning when needed

DataStage has no visibility into Transformer logic


Hash is required before Sort and Aggregator stages
DataStage sometimes inserts un-needed partitioning

12

NOT always appropriate for MPP/clusters

Check the log

2010 IBM Corporation

Information Management

Preserve Partitioning Flag

Preserve Partitioning flag is designed before downstream stages


that use Auto
Flag has 3 possible settings:
Set: downstream stages are to attempt to retain partitioning and sort order
Clear: downstream stages need not retain partitioning and sort order
Propagate: tries to pass the flag setting from input to output links
Set automatically by some operators (e.g. Sort, Hash partitioning)
Can be manually set on Advanced Properties tab
Functionally equivalent to explicitly specifying SAME partitioning
But allows DataStage to over-ride and optimize for performance (e.g. if the
degree of parallelism differs)

Preserve Partitioning setting is part of dataset metadata


Log warnings are issued when Preserve Partitioning flag is set but
downstream operators cannot use the same partitioning

2010 IBM Corporation

Information Management

Partitioning Strategy

Use HASH partitioning when stage requires grouping of related values

Use MODULUS if group key is a single integer column


Better performance than Hash

RANGE may be appropriate in rare instances when data distribution is


uneven but consistent over time
Know your data!
How many unique values in the Hash key columns?

If grouping is not required, use Round Robin


Little overhead

Try to optimize partitioning for the entire job flow

2010 IBM Corporation

Information Management

Partitioning Strategy, Cont.

Minimize the number of repartitions within and across job flows

Within a flow:
Examine up-stream partitioning and sort order and attempt to preserve for
down-stream stages using SAME partitioning

Across jobs:
Use datasets to retain partitioning

2010 IBM Corporation

Information Management

Collector Methods

(Auto)
Eagerly read any row from any input partition
Output row order is undefined (non-deterministic)
This is the default collector method

Round Robin
Pick row from input partitions in round robin order
Slower than auto, rarely used

Ordered
Read all rows from first partition, then second,
Preserves order that exists within partitions

Sort Merge
Produces a single (sequential) stream of rows sorted on specified key
columns from input sorted on those keys
Row order is not preserved for non-key columns
That is, non-stable sort.

2010 IBM Corporation

Information Management

Choosing a Collector Method

Generally, Auto is the fastest and most efficient method of collection


To generate a single stream of sorted data, use the Sort Merge
collector
Input data must be sorted on these keys
Sort Merge does not perform a sort

Ordered is only appropriate when sorted input has been rangepartitioned


No sort required to produce sorted output, when partitions have been sorted

Round robin collector can be used to reconstruct original (sequential)


row order for round-robin partitioned inputs
As long as intermediate processing (e.g. sort, aggregator) has not altered
row order or reduced number of rows
Rarely used

2010 IBM Corporation

Information Management

Module Summary
After completing this module, you should be able to:

List and describe partitioning and collecting algorithms

Keyless vs. Keyed partitioning algorithms

Partitioning strategy and tips

18

2010 IBM Corporation

Advanced DataStage Workshop


Topic 03A Development & Debugging

Information Management

2010 IBM Corporation

Information Management

Topic Objectives
After completing this topic, you should be able to:

Understand the importance of reject link in DataStage

How to generate test data with the Row Generator, and how to peek at
data streams.

Guidelines for the stages of Row Generator, Lookup, Filter, Peek,


Sequential File and Transformer.

2010 IBM Corporation

Information Management

Reject Link Usage in DataStage

2010 IBM Corporation

Information Management

Reject Link Sequential File Stage

Compiled into import / export operators

Import operator converts data from external format to framework internal


format
External format described by Table Definition
Internal format described by schema

Export operator reverses the process

Job log messages use import / export


e.g. 40 records imported successfully; 3 rejected
e.g. 40 records exported successfully; 0 rejected
Records get rejected when they cannot be converted correctly

2010 IBM Corporation

Information Management

Reject Link Sequential File Stage

Sequential File stage characteristics


Stage needs to know how
File is divided into rows (record format)
Record is divided into fields (column format)

One output link if reading from file


One input link if writing to file
Optionally one reject link
Reject any record not matching metadata
e.g. columns are separated by comma but no comma in record
e.g. column is integer but contain alphabet

2010 IBM Corporation

Information Management

Reject Link Sequential File Stage

Reject Mode =
Continue: continue reading
records
Fail: abort job
Output: send record down reject
link

Rejected records consist of one


column of data type = raw

Reject mode property

2010 IBM Corporation

Information Management

Reject Link Sequential File Stage

2010 IBM Corporation

Information Management

Reject Link Lookup Stage

Lookup stage characteristics

One stream (source) input link


One or more reference (lookup) input links
One stream output link
Optionally one reject link
Use the yellow constraint icon to change lookup failure action property

2010 IBM Corporation

Information Management

Reject Link Lookup Stage

Lookup Failure action =


Fail: log error and job aborts
Drop: source record is dropped silently (no warning message)
Continue: source record is transferred to output and reference link columns
filled with NULL or default value
Reject: source record is sent to reject link

2010 IBM Corporation

Information Management

Reject Link Lookup Stage

2010 IBM Corporation

Information Management

Reject Link Transformer Stage

Transformer stage characteristics

One input link


One or more output links
Optionally a reject link
Derivation and constraint expressions can reference:

Input columns
Job parameters
Stage variables
Functions

2010 IBM Corporation

Information Management

Reject Link Transformer Stage

Transformer stage and NULL expressions


Any expression includes a NULL will always evaluate to NULL
e.g. 1 + NULL = NULL
John : NULL : Doe = NULL

When derivation or constraint evaluates to NULL, Transformer stage will


drop the record with a warning message when no reject link presents
output the record to the reject link if one presents

Recommendation
Always create a reject link
Always test for NULL in expression with NULLABLE = Yes columns
IF isNull(link.col) THEN ELSE

2010 IBM Corporation

Information Management

Reject Link Transformer Stage

To create a reject link for the Transformer stage


Draw a standard output link from the Transformer stage to the desire stage
(e.g. Peek)
Select the output link, right mouse button, and choose Convert to Reject

2010 IBM Corporation

Information Management

Reject Link Transformer Stage

2010 IBM Corporation

Information Management

Reject Link Merge Stage

Merge stage characteristics

One master input link


One or more update input links
One stream output link
Optionally one or more reject links (same number of update input links)
All input link records must be sorted
Master input link records cannot have duplicates
Master records without update records can be kept or discarded
Unused update records from update link n can be captured with a
corresponding reject link n
Master and update records are merged by matching key value

2010 IBM Corporation

Information Management

Reject Link Merge Stage

2010 IBM Corporation

Information Management

Reject Link Merge Stage

Example of reject link setting

2010 IBM Corporation

Information Management

Test Data with Row Generator

18

2010 IBM Corporation

Information Management

Job Design for Generating Mock Data


Lookup tables

Row Generator

2010 IBM Corporation

Information Management

Generating Mock Data

Row Generator stage


Define columns in which to generate the data
In Extended Properties page, select algorithm for generating values
Different types have different algorithms available

Lookups can be used to generate large amounts of robust mock data


Lookup tables map integers to values
Column Generator columns generate integers to look up
Cycling through integer sets can generate all possible combinations

2010 IBM Corporation

Information Management

Specifying the Generating Algorithm

Cycle through

Columns
generating
integers

2010 IBM Corporation

Information Management

Inside the Lookup Stage

Metadata for
data file
Return mapped
values
2010 IBM Corporation

Information Management

Lookup Stage

Lookup stage runs in two phases:


Read all rows from reference link into memory; index by lookup key
Process incoming rows

Reference data should small enough to fit into physical (shared) memory
For reference datasets larger than available memory, use the JOIN or MERGE
stage

Lookup stage processing cannot begin until all reference links have been read
into memory

2010 IBM Corporation

Information Management

Partitioning Lookup Reference Data

ENTIRE is the default partitioning method for Lookup reference links


On SMP platforms, it is a good practice to set this explicitly

SMP configurations:
Lookup stage uses shared memory instead of duplicating ENTIRE reference data

Clustered / GRID / MPP configurations:


Be careful using ENTIRE
Reference data will copied to all Server machines

It may be appropriate to use a keyed partitioning method, especially if data is


already partitioned on those keys
Make sure input stream link and reference link partitioning keys match

2010 IBM Corporation

Information Management

Lookup Reference Data

The Lookup stage cannot output any rows until ALL reference link data has
been read into memory
In this job design, that means reading all the source data (which might be vast) into
memory

NEVER generate Lookup reference data using a fork-join of source data


Separate creation of Lookup reference data from lookup processing

HeaderRef

Header

Src

Out
Detail

2010 IBM Corporation

Information Management

Sparse Lookups
Specified in a relational Enterprise stage such as DB2 or Oracle used
for a lookup

Perform singleton SQL lookups for every row

Limit use of Sparse Lookup


Extremely expensive (slow)
ONLY appropriate when the number of input rows is significantly
smaller (e.g. at least 1:100) than the number of reference rows

2010 IBM Corporation

Information Management

Peeking at Data Stream

2010 IBM Corporation

Information Management

Peeking at the Data Stream Design

Copy stream to
the peeks

Selecting records
to peek at

Copy stage
used as a
place holder

Peek stage

2010 IBM Corporation

Information Management

Peeking at the Data Steam

How do you view whats happening on a link at run time?

Use Copy stage to split the stream off to Peek at

Use Filter stage to select the data


Map out the columns youre interested in
Turn off RCP to avoid extra columns

Use Peek stage to display selected data in the job log

2010 IBM Corporation

Information Management

Using Transformer Stage Variables

Stage
variables are
executed top
to down.
Reference stage
variables in
column
derivations
2010 IBM Corporation

Information Management

Reading Sequential Files

Accessing sequential files in parallel depends on the access method


and options

Sequential I/O:

Specific files; single file


File pattern

Parallel I/O:

Single file when Readers Per Node > 1


Multiple individual files
Reading with a file pattern read, when
$APT_IMPORT_PATTERN_USES_FILESET is turned on
Note: Sequential row order cannot be maintained when reading a file in
parallel

2010 IBM Corporation

Information Management

Reading a Sequential File in Parallel

The Readers Per Node option can be used to read a single input file
in parallel at evenly spaced offsets

2010 IBM Corporation

Information Management

Partitioning and Sequential Files

Sequential File stage creates one partition for each input file
Always follow a Sequential file with ROUND ROBIN or other appropriate
partitioning type

Never follow a Sequential File stage with SAME partitioning!


If reading from one file, downstream flow will run sequentially!
If reading from multiple files, the number of files may not match the number
of partitions defined in the configuration file
SAME is only appropriate in cases where the source data is already running
in multiple partitions

2010 IBM Corporation

Information Management

Buffering Sequential File Writes

By default, target Sequential File stages write to memory buffers


Improves performance
Buffers are automatically flushed to disk when the job completes successfully
Not all rows may be written to disk if the job crashes

Environment variable $APT_EXPORT_FLUSH_COUNT can be used to specify the


number of rows to buffer
flushes to disk for every row
Setting this value too low incurs a performance penalty!

$APT_EXPORT_FLUSH_COUNT=1

2010 IBM Corporation

Information Management

Other Sequential File Tips

When writing to variable-length columns from fixed-length fields,


specify:
field width
pad char extended property
specifies character used to pad data to the specified field width
If not specified an ASCII NULL character 0x0 is used
by default

When reading delimited files, extra characters are silently truncated for
source file values longer than the maximum specified length of
VARCHAR columns
Set the environment variable $APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS to
reject these records instead

2010 IBM Corporation

Information Management

Simplifying Transformer Expressions

Leverage built-in functions

More efficient than complex expression


Original expression:
IF Link_1.ProdNum = "000" OR Link_1.ProdNum = "800" OR Link_1.ProdNum =
"888" OR Link_1.ProdNum = "866" OR Link_1.ProdNum = "877" OR Link_1.ProdNum
= "855" OR Link_1.ProdNum = "844" OR Link_1.ProdNum = "833" OR
Link_1.ProdNum = "822" OR Link_1.ProdNum = "900"
THEN 'N
ELSE "Y"

Simplified expression:
IF index('000|800|888|866|877|855|844|833|822|900', Link_1.ProdNum, 1) > 0
THEN 'N'
ELSE "Y"

2010 IBM Corporation

Information Management

Transformer vs. Lookup

Consider implementing Lookup tables for


expressions that depend on value mappings
Example:
- link.A=1 OR link.A=3 OR link.A=5
- link.A=2 OR link.A=7 OR link.A=15 OR
link.A=20

- Create a Lookup table for the source-value pairs


For expressions such as:
link.A=1 OR link.A=3 OR link.A=5
link.A=2 OR link.A=7 OR link.A=15 OR
link.A=20

Result

15

20

Create a Lookup table for the source-value pairs

2010 IBM Corporation

Information Management

Transformer Decimal Arithmetic

Default internal decimal variables are precision 38, scale 10


Can be changed by:
$APT_DECIMAL_INTERM_PRECISION
$APT_DECIMAL_INTERM_SCALE

2010 IBM Corporation

Information Management

Transformer Decimal Rounding

Use $APT_DECIMAL_INTERM_ROUND_MODE to specify decimal


rounding

ceil: round up
1.4 -> 2, -1.6 -> -1

floor: round down


1.6 -> 1, -1.4 -> -2

round_inf: round to nearest integer. Up for ties


1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2

trunc_zero: discard any fractional digits to the right of the rightmost


fractional digit supported
1.56 -> 1.5, -1.56 -> -1.5

2010 IBM Corporation

Information Management

Conditionally Aborting a Job

Use the Abort After Rows setting in the Transformer constraints


to conditionally abort a job
Create a new output link and assign a link constraint that matches the
abort condition
Set the Abort After Rows for this link to the number of rows allowed
before the job aborts

When the Abort After Rows threshold is reached, the job aborts

2010 IBM Corporation

Information Management

Identifying the First Row in a Group

Stage variables can be used to identify the first row of an input group
Define a stage variable for each grouping key column
Define a stage variable to flag when input key columns do not match previous
values
When a new group is flagged, set stage variables to incoming key column
values

Compare flag

LName value from previous read

2010 IBM Corporation

Information Management

Calculations Involving the Last Row

Transformers do not know when they are reading the last row of data

Other methods must be used

To aggregate calculations in a Transformer, generate a running total


for each group, outputting every row
Follow the Transformer with a Remove Duplicates stage to get the last result
in each group

2010 IBM Corporation

Information Management

Topic Summary
After completing this topic, you should be able to:

Understand the importance of reject link in DataStage

How to generate test data with the Row Generator, and how to peek at
data streams.

Guidelines for the stages of Row Generator, Lookup, Filter, Peek,


Sequential File and Transformer.

43

2010 IBM Corporation

Advanced DataStage Workshop


Topic 03B Header/Detail Processing

Information Management

2010 IBM Corporation

Information Management

Topic Objectives
After completing this topic, you should be able to:

Understand Column Import stage usage

Design Header / Details processing job

Combine Data Using Join Stage

Design Fork Join job

2010 IBM Corporation

Information Management

Reading Data in Bulk

Read data into one column

Specify as Char or Varchar


Max length = record size
Parse using Column Import stage or Transformer

In Transformer use Field function to parse columns


Define columns, data types, and other format options in Column Import
or Transformer stage, not the Sequential stage

Can improve performance

Column parsing can run in parallel


Separates parsing process from sequential read process

2010 IBM Corporation

Information Management

Reading Bulk Data

Read into one


varchar field

Parse record into fields


using Column Import stage
2010 IBM Corporation

Information Management

Header / Detail Processing

Source file contains both


header and detail records
Second column is record
type
A=header
B=detail

First column is order number

Source file

Task: assign header


information to detail records
Resulting detail records
contain information from the
header record

Target file

2010 IBM Corporation

Information Management

Job Design

Variable format
data file. Read in
as single field

Split records into


header and detail
streams. Parse out
individual fields

Combine
records by
order number

2010 IBM Corporation

Information Management

Inside the Transformer


Constrain to
Header records
Use Field function to
parse: Field (string,
delimiter, num)

Convert string
integer to
integer

2010 IBM Corporation

Information Management

Examining the Score

Inserted Hash
partitioner

Inserted Hash
partitioner

2010 IBM Corporation

Information Management

Difficulties with the Design

Going into the Join stage, hash operators on the join key (OrderNum)
are inserted by default
Each group of Header/Detail records will be hashed into the same partition
Each group of records will run sequentially
Essentially, the whole job runs sequentially

Solution:
Select Entire partitioning algorithm for Header into the Join
Select Same partitioning algorithm for Detail into the Join
Not all Detail records are in the same partition, but in every partition theyre in,
theres a Header

2010 IBM Corporation

Information Management

Examining the Score

No Hash
partitioner, now
Same

10

No Hash
partitioner, now
Entire

2010 IBM Corporation

Information Management

Generating a Header Trailer Data File

Export multiple
columns to a single
column

11

Combine multiple data streams


into a single stream

2010 IBM Corporation

Information Management

Inside the Column Export Stage

Columns to export

Single output column

12

2010 IBM Corporation

Information Management

Inside the Funnel Stage

All input links


must all have
the same
columns format

13

Single output
stream

2010 IBM Corporation

Information Management

Fork Join Job Example

Task: assign a group summary


value to each row
For each customer count the
number of customers in the
same zip
Add this count to each
customer record
Customer
records

Add column
with zip
count

14

2010 IBM Corporation

Information Management

Fork Join Job Design

Fork
Join by zip

Group by
zip; count
records in
the group

15

2010 IBM Corporation

Information Management

Examining the Score


Note inserted Hash
partitioners

Note inserted tsort


operators

16

2010 IBM Corporation

Information Management

Job Design

17

2010 IBM Corporation

Information Management

Difficulties with the Design

Here sorts are explicitly


specified

Under the covers, DataStage inserts has partitioners and sort operators
before the Aggregator and Join stages
This is the default when Auto is chosen

Design is not optimized!


Need to minimize sorts and hashings

18

2010 IBM Corporation

Information Management

Optimized Solution

Move sort upstream

Add SAME
partitioners

Explicitly set a sort by zip before the Copy stage

Explicitly specify SAME partition for the Aggregator and Join stages

Data are repartitioned and sorted only once

19

2010 IBM Corporation

Information Management

Score of Optimized Job


Notice there
are no inserted
tsorts

20

2010 IBM Corporation

Information Management

Topic Summary
After completing this topic, you should be able to:

Understand Column Import stage usage

Design Header / Details processing job

Combine Data Using Join Stage

Design Fork Join job

21

2010 IBM Corporation

Advanced DataStage Workshop


Topic 03C Sort

Information Management

2010 IBM Corporation

Information Management

Topic Objectives
After completing this topic, you should be able to explain:

Sorting data in parallel framework

Inserted sorts

Using Sort stages to determine the last row

2010 IBM Corporation

Information Management

Sorting Data

2010 IBM Corporation

Information Management

Traditional (Sequential) Sort

Traditionally, the process of sorting data uses one primary


key column and (optionally) multiple secondary key columns
to generate a sequential, ordered result set
Order of key columns determines sequence (and groupings)
Each key column specifies an ascending or descending sort group

LName

FName

Address

Ford

Henry

66 Edison Avenue

Ford

Clara

66 Edison Avenue

Ford

Edsel

7900 Jefferson

Ford

Eleanor

7900 Jefferson

Dodge

Horace

17840 Jefferson

Dodge

John

75 Boston Boulevard

Ford

Henry

4901 Evergreen

Ford

Clara

Ford

10

Ford

Sorted Result

Source Data
4

ID

ID

LName

FName

Address

Dodge

John

75 Boston Boulevard

Dodge

Horace

17840 Jefferson

Ford

Henry

66 Edison Avenue

Lname
(asc),

Ford

Henry

4901 Evergreen

Ford

Eleanor

7900 Jefferson

FName
(desc)

10

Ford

Eleanor

1100 Lakeshore

Ford

Edsel

7900 Jefferson

4901 Evergreen

Ford

Edsel

1100 Lakeshore

Edsel

1100 Lakeshore

Ford

Clara

66 Edison Avenue

Eleanor

1100 Lakeshore

Ford

Clara

4901 Evergreen

Sort
on:

2010 IBM Corporation

Information Management

Parallel Sort

In most cases, there is no need to globally sort data to produce a single


sequence of rows

Instead, sorting is most often needed to establish order within specified


groups of data
Join, Merge, Aggregator, RemoveDups, etc
This sort can be done in parallel!

Hash partitioning can be used to gather related rows


Assigns rows with the same key column values to the same partition

Sorting is used to establish grouping and order within each partition


based on key columns
Key values are adjacent

Hash and Sort keys need not be the same


Often the case before Remove Duplicates stage
Hash ensures that all duplicates are in the same partition
Sort establishes order of precedence, e.g., latest date

2010 IBM Corporation

Information Management

Example Parallel Sort

Using same source


data, Hash partition on
LName, FName (4 node
config):

Ford

Clara

66 Edison Avenue

Ford

Clara

4901 Evergreen

ID

LName

FName

Address

Ford

Edsel

7900 Jefferson

Dodge

Horace

17840 Jefferson

Ford

Edsel

1100 Lakeshore

ID

LName

FName

Address

Ford

Eleanor

7900 Jefferson

Dodge

John

75 Boston Boulevard

10

Ford

Eleanor

1100 Lakeshore

ID

LName

FName

Address

Ford

Henry

66 Edison Avenue

Ford

Henry

4901 Evergreen

Parallel
Sort

Parallel
Sort

Parallel
Sort

Parallel
Sort

Part 3

Part 3

Address

Part 2

Part 2

FName

Part 1

Part 1

LName

Within each partition,


sort using LName,
FName:

Part 0

Part 0
6

ID

ID

LName

FName

Address

Ford

Clara

66 Edison Avenue

Ford

Clara

4901 Evergreen

ID

LName

FName

Address

Dodge

Horace

17840 Jefferson

Ford

Edsel

7900 Jefferson

Ford

Edsel

1100 Lakeshore

ID

LName

FName

Address

Dodge

John

75 Boston Boulevard

Ford

Eleanor

7900 Jefferson

10

Ford

Eleanor

1100 Lakeshore

ID

LName

FName

Address

Ford

Henry

66 Edison Avenue

Ford

Henry

4901 Evergreen
2010 IBM Corporation

Information Management

Stages that require Sorted Data

Stages that process data on groups


Aggregator
Remove Duplicates
Compare (perhaps)
If only comparing values, not order between two sources

Transformer, Build stage (perhaps)


Depending on internal logic

Lightweight stages that minimize memory usage by


requiring data in key-column sort order
Join
Merge
Aggregator (using Sort method, rather than Hash method)

2010 IBM Corporation

Information Management

Parallel Sorting Methods

Two methods for parallel (grouped) sorting:


Sort stage
Parallel execution mode

OR

In stage when partitioning is not Auto


Links with SORT defined will have a Sort icon:

By default, both methods use the same internal sort package


(sort operator)

2010 IBM Corporation

Information Management

Resorting on Sub-Groups

Use Sort Key Mode property to re-use key column groupings from
previous sorts
Uses significantly less memory / disk!
Sort is now on previously-sorted key-column groups not the entire dataset
Outputs rows after each group

Key column order is important!


Must be consistent across stages to be able to sub-sort on the same keys

Dont forget to retain incoming sort order and partitioning (SAME)

2010 IBM Corporation

Information Management

Dont Sort (Previously Grouped)

Whats the difference between Dont Sort (Previously


sorted) and Dont Sort (Previously grouped)?

When rows were previously grouped by a key, all the rows


with the same key value are grouped together.
But the groups of rows are not necessarily in sort order.

When rows are previously sorted by a key, all the rows are
grouped together and, moreover, the groups are in sort order.
In either case the Sort stage can be used to sort by a sub-key
within each group of rows

10

2010 IBM Corporation

Information Management

Partitioning and Sort Order

After data is re-partitioned the sort


order may not be maintained

To restore row order / groupings, a


sort is required

In the example shown, the data


has been hashed by a key not
shown

101

102

103

Re-Partitioner

101 has been moved by the hashing to


the left partition
1 has been moved to the right partition
The data in either partition can no
longer be guaranteed to be in sort
order
11

101

102

103

2010 IBM Corporation

Information Management

Global Sorting Methods

DataStage provides two


methods for generating a
globally (totally) sorted
result:
Sort stage
Operating in sequential
execution mode

SortMerge Collector
For sorted input or partition
sorted data

12

2010 IBM Corporation

Information Management

Inserted Sorts

By default, tsort operators


are inserted into the score as
necessary
Before any stage that requires
matched key values (Join,
Merge, RemDups)

Only inserted if the user has


NOT explicitly defined a sort
Check the job score for
inserted tsort operators

13

op1[4p] {(parallel inserted


tsort operator
{key={value=LastName},
key={value=FirstName}}(0))
on nodes (
node1[op2,p0]
node2[op2,p1]
node3[op2,p2]
node4[op2,p3]
)}

Score showing
inserted tsort
operator

2010 IBM Corporation

Information Management

Changing Inserted Sorting Behavior

Set $APT_SORT_INSERTION_CHECK_ONLY
or $APT_NO_SORT_INSERTION to change
behavior of automatically inserted sorts
Set $APT_SORT_INSERTION_CHECK_ONLY
The inserted sort operators only VERIFY
that the data is sorted
If data is not sorted properly at runtime,
the job aborts
Recommended only on a per-job basis
during performance tuning

op1[4p] {(parallel inserted


tsort operator
{key={value=LastName,
subArgs={sorted}},
key={value=FirstName},
subArgs={sorted}}(0))
on nodes (
node1[op2,p0]
node2[op2,p1]
node3[op2,p2]
node4[op2,p3]
)}

Set $APT_NO_SORT_INSERTION to remove


stop inserted sorts entirely
Score when
$APT_SORT_INSERTION_CHECK_ONLY is
turned on. Note subArgs = {sorted}

14

2010 IBM Corporation

Information Management

Sort Resource Usage

By default, sort uses 20MB per partition as an internal memory buffer


Applies to both user-defined sorts and framework-inserted sorts
A different size can be specified using the Restrict Memory Usage option
Increasing this value can improve performance
Specially if the entire (or group) data partition can fit into memory
Decreasing this value will hurt performance
This option is unavailable for link sorts

To change the amount of memory used by all tsort stages set:


$APT_TSORT_STRESS_BLOCKSIZE = [mb]
This overrides the per-stage memory settings

When the memory buffer is filled, sort uses temporary disk space in the
following order:
Scratch disks in the $APT_CONFIG_FILE sort named disk pool
Scratch disks in the $APT_CONFIG_FILE default disk pool
The default directory specified by $TMPDIR
The UNIX /tmp directory

15

2010 IBM Corporation

Information Management

Partition and Sort Keys

Note that partition and sort keys do not always have to be the
same
Partitioning assigns related records
Sorting establishes group order

Example: Remove Duplicates


Partition on SSN, FName, LName
Sort on SSN, FName, LName, Order Date
Remove Duplicates on SSN, FName, LName, Order Date

16

2010 IBM Corporation

Information Management

Optimizing Sort Performance

Minimize number of sorts within a job flow


Each sort interrupts the parallel pipeline
Must read all rows in the partition before generating output

Specify only necessary key columns


Avoid stable sorts unless needed
Use Sort Key Usage key column option to re-use
previous sort keys
Within Sort stage, adjust Restrict Memory Usage

17

2010 IBM Corporation

Information Management

Using Sort Stages to Get the Last Row

Use 3 Sort stages before Transformer to generate a change


key column for the last row in the group
Often, data is already sorted earlier in the same flow
Hash or Sort on key columns before first sort
Use SAME partitioning to ensure that subsequent stages keep the
grouping and sort order

18

2010 IBM Corporation

Information Management

Last Row Sort Details


First Sort
Sorts on key columns

Second Sort

Final Sub-Sort

No sorting
Create key-change column

No sort on key columns


Sub-sorts Ascending on keychange column

Choose Create Cluster Key


Change when group is
already sorted
19

2010 IBM Corporation

Information Management

Getting Last Row Example

Sort by State

20

Add key change


col; 1 will be in
first row of group

Sort by key change


col; 1 will be in last
row of group

2010 IBM Corporation

Information Management

Topic Summary
After completing this topic, you should be able to explain:

Sorting data in parallel framework

Inserted sorts

Using Sort stages to determine the last row

21

2010 IBM Corporation

Advanced DataStage Workshop


Topic 03D Aggregator

Information Management

2010 IBM Corporation

Information Management

Topic Objectives
After completing this topic, you should be able to explain:

Combine data using Aggregator stage

Aggregator Stage Guidelines

2010 IBM Corporation

Information Management

Aggregator Stage

2010 IBM Corporation

Information Management

Aggregator Stage
Purpose:

Perform data aggregations

Specify:

One or more key columns that define the aggregation units (or groups)
Columns to be aggregated
Aggregation functions include, among many others:
Count (nulls/non-nulls)
Sum
Max / Min / Range

The grouping method (hash table or pre-sort) is a performance issue

2010 IBM Corporation

Information Management

Job with Aggregator Stage

Aggregator
stage

2010 IBM Corporation

Information Management

Aggregation Types

Count rows
Count rows in each group
Put result in a specified output column

Calculation
Select column
Put result of calculation in a specified output column
Calculations include:

Sum
Count
Min, max
Mean
Missing value count
Non-missing value count
Percent coefficient of variation

2010 IBM Corporation

Information Management

Count Rows Aggregator Properties

Grouping key
columns

Count Rows
aggregation type
7

Output column
for the result
2010 IBM Corporation

Information Management

Calculation Type Aggregator Properties


Grouping key
columns

Calculation
aggregation type

Calculations with
output columns
Available
calculations
8

2010 IBM Corporation

Information Management

Grouping Methods

Hash (default)
Calculations are made for all groups and stored in memory
Hash table structure (hence the name)

Results are written out after all input has been processed
Input does not need to be sorted
Useful when the number of unique groups is small
Running tally for each groups aggregations needs to fit into memory

Sort
Requires the input data to be sorted by grouping keys
Does not perform the sort! Expects the sort

Only a single aggregation group is kept in memory


When a new group is seen, the current group is written out

Can handle unlimited numbers of groups

2010 IBM Corporation

Information Management

Grouping Method - Hash


Key Col

4
3
1
3
2
3
1
2
10

X
Y
K
C
P
D
A
L

4X

3C

3Y

3D

1K

1A

2P

2L

2010 IBM Corporation

Information Management

Grouping Method - Sort

11

Key

Col

1
1
2
2
3
3
3
4

K
A
P
L
Y
C
D
X

1K

1A

2P

2L

3Y

3C

3D

4X
2010 IBM Corporation

Information Management

Aggregator Stage Guidelines

12

2010 IBM Corporation

Information Management

Aggregator

Match input partitioning to Aggregate stage groupings

Use Hash method for a limited number of distinct key values (i.e.,
limited number of groups)
Uses 2K of memory per group
Incoming data does not need to be pre-sorted
Results are output after all rows
have been read
Output row order is undefined
Even if input data is sorted

Use Sort method with a large


number of distinct key-column values
Control break processing
Requires input pre-sorted on key columns

13

Results are output after each group

2010 IBM Corporation

Information Management

Using Aggregator to Sum All Input Rows

Generate a constant-value key column for all rows


Column Generator
Use cycle algorithm on one value

Transformer
Hardcode value

Aggregate on the new column

Parallel

Sequential

Run Aggregator in Sequential mode


Set in Stage Advanced Properties
Dont Hash-partition input data
Theres no point, since it is running on only one node (partition)

Dont sort
The data is already sorted because there is only one key value!

Use Aggregators to reduce collection time


First aggregator processes rows in parallel
Second aggregator runs sequentially, getting the final global sum

14

2010 IBM Corporation

Information Management

Summing All Rows With Aggregator


Used to
generates a
new column
with a single
value

Aggregates
over the new
generated
column

Aggregator
must run
Sequentially

15

2010 IBM Corporation

Information Management

Topic Summary
After completing this topic, you should be able to:

Combine data using Aggregator stage

Aggregator Stage Guidelines

16

2010 IBM Corporation

Advanced DataStage Workshop


Topic 03E - Slowly Changing Dimensions

Information Management

2010 IBM Corporation

Information Management

Topic Objectives
After completing this topic, you should be able to:

Design a job that creates a surrogate key source key file

Design a job that updates a surrogate key source key file from a dimension
table

Design a job that processes a star schema database with Type 1 and Type
2 slowly changing dimensions

2010 IBM Corporation

Information Management

Surrogate Key Generation Stage

2010 IBM Corporation

Information Management

Surrogate Key Generation Stage


Use to create and update the surrogate key state file
One file per dimension table

One file per dimension table


Stores the last used surrogate key integer for the dimension table
Binary file

2010 IBM Corporation

Information Management

Example Job to Create Surrogate State Files

Create Surrogate
State File for
Product
dimension table

Create Surrogate
State File for
Store dimension
table

2010 IBM Corporation

Information Management

Editing the Surrogate Key Generator Stage

Path to state file

Create the
state file

2010 IBM Corporation

Information Management

Example Job to Update the Surrogate State File

2010 IBM Corporation

Information Management

Specifying the Update Information


Table column
containing
surrogate key
values

Update the
state file

2010 IBM Corporation

Information Management

Slowly Changing Dimensions Stage

2010 IBM Corporation

Information Management

Slowly Changing Dimensions Stage

Used for processing a star schema

Performs a lookup into a star schema dimension table


Multiple SCD stages can be chained to process multiple dimension
tables

Inserts new rows into the dimension table as required


Updates existing rows in the dimension table as required

Type 1 fields of a matching row are overwritten


Type 2 fields of a matching row are retained as history rows
A new record with the new field value is added to the dimension table and made
the current record

Generally used in conjunction with the Surrogate Key


Generator stage
Creates a Surrogate Key state file that retains a list of the previously
used surrogate keys

10

2010 IBM Corporation

Information Management

Star Schema Database Structure and Mappings


Dimension
tables

Source rows

Fact table

11

2010 IBM Corporation

Information Management

Example Slowly Changing Dimensions Job


Check for
matching
Product
rows

Perform
Type 1 and
Type 2
updates to
Product
table

12

Check for
matching
StoreDim
rows

Perform
Type 1 and
Type 2
updates to
StoreDim
table

2010 IBM Corporation

Information Management

Working in the SCD Stage

Five Fast Path pages to edit

Select the Output link


This is the link coming out of the SCD stage that is not used to update the dimension table

Specify the purpose codes


Fields to match by
Business key fields and the source fields to match to it
Surrogate key field
Type 1 fields
Type 2 fields
Current Indicator field for Type 2
Effective Date, Expire Date for Type 2

Surrogate Key management


Location of State file

13

Dimension update specification


Output mappings
2010 IBM Corporation

Information Management

Selecting the Output Link

Select the
output link

14

2010 IBM Corporation

Information Management

Specifying the Purpose Codes

Lookup key
mapping

Type 1 field

Surrogate
key

Type 2 field

Fields used
for Type 2
handling

15

2010 IBM Corporation

Information Management

Surrogate Key Management

Path to state
file

Initial surrogate
key value
Number of values
to retrieve at one
time

16

2010 IBM Corporation

Information Management

Dimension Update Specification


Function used to
retrieve the next
surrogate key value

Value that
means
current

Functions used to calculate


history date range

17

2010 IBM Corporation

Information Management

Output Mappings

18

2010 IBM Corporation

Information Management

Topic Summary
After completing this topic, you should be able to:

Design a job that creates a surrogate key source key file

Design a job that updates a surrogate key source key file from a dimension
table

Design a job that processes a star schema database with Type 1 and Type
2 slowly changing dimensions

19

2010 IBM Corporation

Advanced DataStage Workshop


Module 04 Extending DataStage

Information Management

2010 IBM Corporation

Information Management

Module Objectives
After completing this module, you should be able to:

Create new external function

Create a wrapped stage

Create a build stage

2010 IBM Corporation

Information Management

Ways of Adding New Functionality

External function
Define a new parallel routine to use in the Transformer stage
Specify input arguements
C++ function is created, compiled, and linked outside of DataStage

Wrapped stage
Wrap an existing executable into a new custom stage

Build stage
Using the Designer to create a new stage that will be compiled into a
new operator
Define the new stage by specifying
Properties
Input and output interfaces
C++ source code is compiled and linked within DataStage for execution

2010 IBM Corporation

Information Management

Ways of Adding New Functionality

Custom stage
A way to create a new stage that will be compiled into a new operator
New operator is an instantiation of the APT_Operator class

A new stage is defined to invoke this new operator


Stage properties are passed to the new operator instance
C++ function is created, compiled, and linked outside of DataStage

2010 IBM Corporation

Information Management

Parallel Routine

Two types
External function
Can be used in Transformer
Can return a value

External before / after


No return value
Can be executed before / after a job runs or a Transformer
Specified in job or Transformer properties

C++ function compiled outside of DataStage into shared


object library

In DataStage, define routine metadata, input and output


arguments, and static / dynamic linking
2010 IBM Corporation

Information Management

External Function Example

Function returns Y if keyword strings are found


otherwise it returns N

2010 IBM Corporation

Information Management

Another External Function Example

Framework classes
include file

Function returns Y if keyword strings are found


otherwise it returns N

This version uses the APT_String class functions and


orchestrate.h is included
2010 IBM Corporation

Information Management

Creating an External Function


DataStage
function name

External function
name

Static link

Return type

Object file path

To create an External Function Routine

Click right mouse button over Routines branch


Click New Parallel Routine
Select External Function type
2010 IBM Corporation

Information Management

Defining the Input Arguments

Define all the input arguments

Only input arguments are be defined here


The required single output argument was defined on the General tab

2010 IBM Corporation

Information Management

Calling the External Function

New function

External Function Routines are listed under DSRoutines


in the DataStage Expression Editor
2010 IBM Corporation

Information Management

Building Wrapped Stages

You can wrap an executable


Binary
UNIX command
Shell script

and turn it into a custom stage capable of parallel


execution as long as the executable is
Amenable to data-partition parallelism: no dependencies between rows
Pipe safe: can read rows sequentially with no random data access

11

2010 IBM Corporation

Information Management

Wrapped Stages

Wrappers are treated as a black boxes


DataStage has no knowledge of contents
DataStage has no means of managing anything that occurs inside
the wrapped stage
DataStage only knows how to export data into and import data out of
the wrapped stage
User must know at design time the intended behavior of the wrapped
stage and its schema interface
2010 IBM Corporation

Information Management

Wrapped Stage Example

Create a source stage that produces a listing of files


Wrap the UNIX ls directory command

The new stage will have no input and one output


Output will be a single varchar column containing the list returned from
the ls command

To create a new wrapped stage


Specify name of stage and operator
Specify how to invoke the executable
Specify properties to be passed to the operator including
-name value: pass the property name and the value
Value only: pass just the value

Create and load Table Definitions that define the input and output
interfaces
13

2010 IBM Corporation

Information Management

Creating a Wrapped Stage

New stage type

Default execution
mode
Executable
command

2010 IBM Corporation

Information Management

Defining the Wrapped Stage Interfaces


Optionally
define
properties

Select Table
Definition
defining output

Generate new
stage

2010 IBM Corporation

Information Management

Specifying Wrapped Stage Properties

Property
name
Required or
optional?

How to convert
the property
passed to the
operator

2010 IBM Corporation

Information Management

Job with Wrapped Stage

Wrapped
stage

2010 IBM Corporation

Information Management

Build Stage

Works like existing parallel stages

Extends the functionality of parallel jobs

Can be used in any parallel job


Coded in C++

Predefined macros can be used in the code


Predefined header files allow framework classes and class functions
available

18

2010 IBM Corporation

Information Management

Example Job With Build Stage

Build stage

2010 IBM Corporation

Information Management

Creating a Build Stage


Stage name

Class instantiation
name from
APT_Operator
class

Framework
operator name

Click right mouse button over any repository folder in


Designer

Click New->Other->Parallel Stage Type (Build)

20

2010 IBM Corporation

Information Management

Operator Stage Elements

Properties
Defined properties will show up on stage properties tab

Input / Output
Interfaces
Build stage requires at least one input and one output
Build stage has a static interface and cannot access a schema
dynamically

Reads / Writes
Auto / Non-auto
Combined / Separate transfer
Framework macros can be used to explicitly execute transfer

21

2010 IBM Corporation

Information Management

Operator Stage Elements

Code
Variables definition and initialization
Pre-loop: code executed before rows are read
Per-record: code executed for each row read
Post-loop: code executed after all rows are read

22

2010 IBM Corporation

Information Management

Anatomy of a Build Stage


auto / noauto

Transfer
auto / noauto
combine / separate

auto / noauto

Build
Properties

Input link
columns:
a, b,

Definitions

Output link
columns:
a, b, x, y,

Pre-Loop
Per-Record
Post-Loop

Input port: in0,


Interface columns: a, b

Output port: out0,


Interface columns: a, b, x, y,

Interface Input / Output fields are defined by Table Definitions

C++ variables, includes added to Definitions tab

C++ code added to Pre-Loop, Per-Record, Post-Loop of Build tab


2010 IBM Corporation

Information Management

Defining the Input and Output Interfaces

Create Table Definition for each input and output link


Defined properties will show up on stage properties tab

List inputs and outputs on interfaces tab


Provide input port name (default: in0, in1, )
Provide output port name (default: out0, out1, )
Specify auto-read / auto-write
Enable / Disable RCP (Runtime Column Propagation)
RCP is incompatible with specifying transfers

Input / Output macros can be used to explicitly control reads and writes
readRecord(port_number), writeRecord(port_number)
inputDone(port_number)
Need to execute after read before referencing fields populated by the read
24

2010 IBM Corporation

Information Management

Interface Table Definition

An example Table Definition for the input interface of a build


stage
Provides column names and types
Data types must be C++ otherwise class function and operator
signature will not apply

25

Input link to the build stage must have columns with same
names and compatible data types as defined by the interface
2010 IBM Corporation

Information Management

Specifying the Input Interface


Intra-stage RCP
alternative to
Transfer. Dont use!

Port name.
Alias for in0

auto / noauto

Table definition
defining
interface

Default port names are in0, in1, in the order defined

Select Table Definition that defines the input fields

26

2010 IBM Corporation

Information Management

Specifying the Output Interface

Port name.
Alias for out0

auto / noauto

Table
Definition

Default port names are out0, out1, in the order defined

Select Table Definition that defines the output fields

27

2010 IBM Corporation

Information Management

Transfer

Used to pass unreferenced input link columns through the


build stage
Input link columns are passed as a block to the output link
Auto transfer occurs at end of each iteration of per-record loop code

Can be auto or noauto

Transfer type can be combined or separate


Combined: second transfer to same output link (port) replaces the first
Assumes same metadata for each transfer

Separate: second transfer to same output link (port) adds columns


Assumes different metadata for each transfer

28

2010 IBM Corporation

Information Management

Transfer

Transfer macros are used in code to explicitly move records


from input buffer to output buffer
DoTransfer(transfer_index)
DoTransferFrom(input)
DoTransferTo(output)
TransferAndWriteRecord(output)

29

2010 IBM Corporation

Information Management

Defining a Transfer

Input port name

Transfer type:
combined / separate

Output port name

auto / noauto

Definition order determines the transfer index number

Refer to ports by specified or default names

Specify whether transfer is to be done automatically at end of


each loop

Specify type of transfer, combined or separate

30

2010 IBM Corporation

Information Management

Anatomy of a Transfer
Qty
Price
TaxRate
OrderNum
ItemNum

Qty
Price
TaxRate
----------inRec.*

Amount
--------------outRec.*

Input link
columns

Output link
columns

Amount = Qty*Price

Input buffer

OrderNum
ItemNum
Qty
Price
TaxRate
Amount

Output buffer

ReadRecs (auto or explicit) brings ennumerated, input interface values into the input buffer. If a Transfer is
specified, then the whole input record also comes in as a block of values.

Assignments in Per-Record move values to ennumerated output fields

Transfers copy input fields as a block (inRec.*) to the output buffer

Duplicate columns coming from a Transfer are dropped with warnings in log

e.g., if the input record contained a column named Amount, this would be dropped. Explicit assignments in the code to output
interface columns take precedence over Transferred column values.

If RCP is enabled instead of a Transfer, the picture is the same. If neither Transfer nor RCP is specified, then the
inRec.* and outRec.* wont exist.

2010 IBM Corporation

Information Management

Defining Stage Properties

Property specifications
Name
Data type
Prompt
Default value
Required or optional
Conversion: how property gets processed
Choose Name Value

32

Extended properties can also be specified

2010 IBM Corporation

Information Management

Specifying Properties

Property type

Default value

Choose
-Name value

If data type is List, open the Extended Properties window


to define the members

2010 IBM Corporation

Information Management

Defining the Build Stage Logic

Definitions
Variables
Include files

Pre-loop
Code executed once, prior to entering the per-record loop

Per-record
Code executed once per input record

Post-loop
Code executed once, after exiting the per-record loop

34

2010 IBM Corporation

Information Management

Definitions Tab

Define variables

Include header files

2010 IBM Corporation

Information Management

Pre-Loop Tab

Initialize
variable

Code to be executed before input records are processed

This code is executed only once


2010 IBM Corporation

Information Management

Per-Record Tab

Build macro

Qualified input
column

Unqualified
output column

Code to be executed for each input record read in

This code is executed once for each input record


2010 IBM Corporation

Information Management

Post-Loop Tab
Framework types

Property

Framework
functions

Code to be executed after all input records are processed

This code is executed only once


2010 IBM Corporation

Information Management

Writing to the Job Log

Use ostream standard output objects: cout, clog, cerr


Standard output objects redirected to DataStage log
clog & cerr generate warning messages
Example: cout << Message to log << endl;

Use errorLog object


Defined in errorlog.h
Examples:
*errorLog() << Message to log << endl;
errorLog().logInfo(index): logs messages
errorLog().logWarning(index): generates warnings
errorLog().logError(index): generates errors

39

2010 IBM Corporation

Information Management

Using a Build Stage

Build stage

2010 IBM Corporation

Information Management

Stage Properties

Category and
property

List of property
values

2010 IBM Corporation

Information Management

Build Stage with Multiple Ports

Input / Output ports are indexed: 0, 1, 2,


Order defined by sequence of definitions, top to bottom
writeRecord(1) writes to second output port / link
readRecord(0) reads from first input port / link
Use link ordering on stage properties tab to specify order of links
(ports)

42

2010 IBM Corporation

Information Management

Build Macros

Informational
inputs() number of inputs
outputs() number of outputs
transfers() number of transfers

Flow control
endLoop() exit per-record loop and go to post-loop code
nextLoop() read next input record
failStep() abort the job

43

2010 IBM Corporation

Information Management

Build Macros

Input / Output
Ports are indexed: 0, 1, 2,
readRecord(index), writeRecord(index), inputDone(index)
holdRecord(index) suspend next auto read
discardRecord(index) suspend next auto write
discardTransfer(index) suspend next auto transfer

Transfers
doTransfersFrom(index) do all transfers from specified input
doTransfersTo(index) do all transfers to specified output
transferAndWriteRecord(index) do all transfers to specified output
then write a record

44

2010 IBM Corporation

Information Management

Turning Off Auto Read, Write, and Transfer


Turn off Auto Read

Turn off Auto Write

Turn off Auto Transfer

2010 IBM Corporation

Information Management

Reading Records Using Macros

readRecord(0) reads a record from first input link

Use readRecord(0) in pre-loop logic to bring in first record

After all records have been read, readRecord(0) will not bring
in usable input data for processing
Use inputDone() to test for any additional record for processing

46

2010 IBM Corporation

Information Management

APT Framework and Utility Classes

Utility classes functions and macros


Used in the buildOp (build stage) code
Automatically included in the build stage
APT_ is the prefix that distinguishes utility objects from standard C++
objects
string.h defines string handling functions and operators
errorlog.h defines functions for writing to DataStage log

47

2010 IBM Corporation

Information Management

APT_String BuildOp Example

s1 and s2 are
declared as
APT_String
variables

APT_String
assignment

operator

APT_String
concatenation
operator

APT_String
function

2010 IBM Corporation

Information Management

Module Summary
After completing this module, you should be able to:

Create new external function

Create a wrapped stage

Create a build stage

49

2010 IBM Corporation

Advanced DataStage Workshop


Module 05 Performance Tuning

Information Management

2010 IBM Corporation

Information Management

Module Objectives
After completing this module, you should be able to:

Explain performance tuning methodology

Selectively disabling operator combination

Understand configuration file guideline

Understand the impact of partitioning

Understand the impact of sorting

Understand the impact of transformer

Use the performance analyzer

2010 IBM Corporation

Information Management

Optimizing Performance

The ability to process large volumes of data in a short period of time


requires optimizing all aspect of the job flow and environment for
maximum throughput and performance
Job design
Stage properties
DataStage parameters
Configuration file
Disk subsystems: RAID / SAN
Source and target databases
Network
etc....

2010 IBM Corporation

Information Management

Parallel Job Performance

Within DataStage examine (in order):


End-to-end process flow
Intermediate results, sources / targets, disk usage

Configuration file(s) for each job


Degree of parallelism
Impact to overall system resources
File systems mapping, scratch disk

Individual job design including shared containers


Stages chosen, overall design approach
Partitioning strategy
Operator combination
Buffering (as a last resort)

2010 IBM Corporation

Information Management

Parallel Job Performance

Ultimate job performance may be constrained by external sources /


targets
Disk subsystems, network, database, etc.
May be appropriate to scale back on degree of parallelism to conserve
resources

2010 IBM Corporation

Information Management

Performance Tuning Methodology

An iterative process

Test in isolation (nothing else should be running)


DataStage server
Source and target databases

Change ONE item at time, then examine impact


Use job score to determine
Number of processes generated
Operator combination
Framework inserted sorting and partitioning

Use DataStage performance monitor to verify


Data distributions (partitioning)
Throughput and bottleneck

Use performance analyzer to check other metrics

2010 IBM Corporation

Information Management

Using DataStage Director Job Monitor

Enable Show Instances to show data distribution (skew) across


partitions
Best performance with even distribution

Be cautious with Rows/sec numbers calculated by Director


(elapsed time of entire job, not per stage)

2010 IBM Corporation

Information Management

Operator Combination

At run time, DataStage parallel framework will attempt to combine stages


(operators) into a single process
Operator combination is intended to improve overall performance and
lower resource usage
Combination only occurs between stages (operators) that:
Use the same partitioning method
Repartitioning prevents operator combination between the producer and
consumer stages
Implicit repartitioning (sequential operators) prevents combination

Are combinable
Set automatically within the stage / operator definition
Can also be set within stages advanced properties

2010 IBM Corporation

Information Management

Tuning Operator Combination

There may be instance that operator combination hurts performance


One process cannot use more than 100% of CPU
It is also a good practice to separate I/O from CPU tasks

Use performance monitor to identify CPU bottlenecks


Selectively disable combination through Designer stage properties

In some situations, disable all combinations by setting


$APT_DISABLE_COMBINATION = TRUE

2010 IBM Corporation

Information Management

Configuration File Guidelines

Minimize I/O overlaps across nodes


If multiple file systems are shared across nodes, alter order of file systems
within each node definition
Pay particular attention to map file systems to physical controllers / drives with
RAID / SAN
Use local disks for scratch storage if possible

Named pools can be used to further separate I/O


buffer file system is only used for buffer overflow
sort file system is only used for sorting

With cluster / grid / MPP environment, named pools can be used to


further control resources
Minimize data shipping, direct database connection, etc.

10

2010 IBM Corporation

Information Management

Use Parallel Data Set

Use parallel Data Set to land intermediate results between parallel jobs
No conversion overhead, stored in native internal format
Retains data partitioning and sort order
Maximum performance through parallel I/O

11

2010 IBM Corporation

Information Management

Impact of Partitioning

Ensure data is close to as evenly distributed as possible


When business rules dictates otherwise, re-partition to a more balanced
distribution as soon as possible to improve performance of downstream stages

Minimize re-partitions by optimizing the flow to reuse upstream


partitioning
Specially in GRID / MPP / Cluster environment

Know your data


Choose hash key columns that generate sufficient unique key combinations
while meeting business requirements

Use SAME partitioning carefully


Maintains the same degree of parallelism

12

2010 IBM Corporation

Information Management

Impact of Sorting

Use parallel sort if possible (sort by key-column groups)


Where global sort is required, using parallel sort and sort merge collector is
generally much faster than sequential sort

Complete sort is expensive


Rows cannot be output to next stage until all rows are read and sorted.
Pipleline is interrupted.
Must use scratch disk for intermediate storage

13

Use the Restrict Memory Usage option to increase the amount of


memory available for sorting per partition

2010 IBM Corporation

Information Management

More Impact of Sorting

Minimize and combine sorts where possible


Use the Dont Sort, Previously Sorted option to leverage previous sort
groupings
Uses much less memory
Output rows after each key-column group

Parallel Data Set maintains partitions and sort order across jobs

Stable sort is slower


Use only when needed to satisfy business requirement

14

2010 IBM Corporation

Information Management

Impact of Transformer

Minimize the number of Transformers


If possible, combine derivations from multiple Transformers

Use stage variables to perform calculations used by constraints and


multiple derivations
Never use the BASIC Transformer
Doesnt show up in the standard palette by default
Intended to provide a migration path for existing DataStage Server applications
that use DataStage BASIC routines
Runs sequentially
Invokes the DataStage server engine
Extremely expensive (slow)!

15

2010 IBM Corporation

Information Management

Impact of Transformer vs. Other Stages

For optimum performance, consider more appropriate stages instead


of a Transformer in parallel job flows:
Use non-Transformer stage (e.g., Copy stage) to:

Rename Columns
Drop Columns
Perform default type conversions
Split output

Transformer constraints are FASTER than the Filter or Switch stages


Filter and Switch expressions are interpreted at runtime
Transformer constraints are compiled

2010 IBM Corporation

Information Management

Impact of Buffering

Consider maixmum row width


For very wide row, it may be necessary to increase buffer size to hold more
rows in memory
Default is 3 MB per partition
Set in stage properties
For entire job, set with $APT_BUFFER_MAXIMUM_MEMORY

17

Tune all other factors before tune buffer settings


Disabling buffer may cause dead lock
Best solution might be not to use fork-join design pattern

2010 IBM Corporation

Information Management

Isolating Buffers

Buffer operators may make it difficult to identify performance bottleneck

The following environment variables effectively isolate each stage (by


inserting buffers), and prevent the buffers from slowing down upstream
stages
$APT_BUFFERING_POLICY=FORCE
Insert buffer between each operator (isolates)

$APT_BUFFER_FREE_RUN=1000
Writes excess buffers to disk instead of slowing down producer
Buffer will not slow down producer until it has written
1000*$APT_MAXIMUM_MEMORY to disk

These settings will generate a significant amount of disk I/O therefore DO NOT
use these setting for production jobs

18

2010 IBM Corporation

Information Management

Other Performance Tips

Remove unnecessary columns as early as possible within the flow


Minimizes memory usage, optimizes buffering
Use SELECT COLUMN NAMES not SELECT * when reading from
database
Disable RCP if not require

Sequential File stage file pattern reads start with a single CAT process
Setting $APT_IMPORT_PATTERN_USES_FILESET allows parallel I/O
Dynamically builds a File Set header file for list of files match pattern

19

2010 IBM Corporation

Information Management

Performance Analysis

20

2010 IBM Corporation

Information Management

Performance Analysis In the Past

Use the Director monitor to watch the throughput (rows/sec) during a job
run
Compare job run durations
Turn on APT_PM_PLAYER_TIMING and APT_PM_PLAYER_MEMORY
to report player calls and memory allocation

How This Fails You...

Long running jobs couldnt be watched for record throughput changes


throughout the job run

The job monitor didnt allow recording for playback

Job monitor throughput rates included time waiting for data

Couldnt determine what was happening on the machines

21

2010 IBM Corporation

Information Management

Performance Analyzer

Visualization tool that provides deeper insight into job runtime behavior

Offers several categories of visualizations:


Record throughput (rows/sec)
CPU utilization
Job timing & memory utilization
Physical machine utilization

Performance data to be visualized can be:


Filtered in selected ways, including
Hide startup processes
Hide license operators
Hide inserted operators

Isolated to selected stages (operators), partitions, and phases

22

Charts can be saved and printed

2010 IBM Corporation

Information Management

Enabling Performance Data Recording

Open the job in Designer

Select Record job


performance data in Job
Properties

Run your job. Performance


collection has little impact
on overall job performance

To view the results, click


the Performance Analysis
icon in Designer

23

2010 IBM Corporation

Information Management

Example Job

24

2010 IBM Corporation

Information Management

Job Timeline Chart


Job timeline
chart
Job timing
Throughput
CPU
utilization
Memory
utilization

Machine
utilization

25

Stages in
job

Lengths of
time

2010 IBM Corporation

Information Management

Expanding the Job Timeline Chart


View by
partition
Click to
expand

Process
phases

26

2010 IBM Corporation

Information Management

Another Job Timeline Chart

27

2010 IBM Corporation

Information Management

Record Throughput Chart

Blue is
reading
350,000
records per
second

Rows per
second

Run mouse
over line to
identify the
stage port
represented
Timeline

28

2010 IBM Corporation

Information Management

Displaying Selected Stages


Select stages
in a partition to
display
Select partitions
to display

Select the
stages to
display

29

2010 IBM Corporation

Information Management

CPU Utilization Totals Chart

Blue shows
CPU time

Amount of
CPU time
Red shows
system time

30

2010 IBM Corporation

Information Management

Machine Utilization - CPU

31

2010 IBM Corporation

Information Management

Filters

32

2010 IBM Corporation

Information Management

Module Summary
After completing this module, you should be able to:

Explain performance tuning methodology

Selectively disabling operator combination

Understand configuration file guideline

Understand the impact of partitioning

Understand the impact of sorting

Understand the impact of transformer

Use the performance analyzer

33

2010 IBM Corporation

Advanced DataStage Workshop


Module 06: Repository Functions

Information Management

2010 IBM Corporation

Information Management

Module Objectives
After completing this module, you should be able to:

Perform a simple Find

Perform an Advanced Find

Perform an impact analysis

Compare the differences between two Table Definitions

Compare the differences between two jobs

2010 IBM Corporation

Information Management

Searching the Repository

2010 IBM Corporation

Information Management

Quick Find
Name with wild
card character (*)

Include matches
in object
descriptions

Execute
Find

2010 IBM Corporation

Information Management

Found Results
Highlight
next item

Number found;
Click to open
Advanced Find
window

Found
item

2010 IBM Corporation

Information Management

Advanced Find Window

Filter
search

2010 IBM Corporation

Information Management

Advanced Find Filtering Options

Type: type of object


Job, Table Definition, etc

Creation: range of dates


e.g. Up to a week ago

Last modification: range of dates


e.g. Up to a week ago

Where used: objects that use specified objects


e.g. a job that uses a specified Table Definition

Dependencies of: objects that are dependencies of objects


e.g. a Table Definition that is referenced in a specified job

Options
Case sensitivity
Search within last result set

2010 IBM Corporation

Information Management

Using the Found Results

Create
impact
analysis

Export to a
file

Draw
Comparison

2010 IBM Corporation

Information Management

Impact Analysis

2010 IBM Corporation

Information Management

Performing an Impact Analysis

Find where Table Definitions are used


Right-click over a stage or Table Definition
Select Find where Table Definitions Used or
Select Find where Table Definitions Used (deep)
Deep includes additional object types

Displays a list of the objects using the Table Definition

Find object dependencies


Select Find dependencies or
Select Find dependencies (deep)
Displays list of objects dependent on the one selected

Graphical functionality
Display the dependency path
Collapse selected objects
Move the graphical object
Birds-eye view

10

2010 IBM Corporation

Information Management

Initiating an Impact Analysis from a Stage

Select Table
Definition

11

2010 IBM Corporation

Information Management

Displaying the Dependencies Graphically

List of
dependent
objects

Table
Definition

Birds-Eye
view

12

2010 IBM Corporation

Information Management

Displaying the Dependency Path

Table
Definition

Job containing
(dependent on)
Table Definition
13

2010 IBM Corporation

Information Management

Generating an HTML Report

14

2010 IBM Corporation

Information Management

Viewing Column-Level Data Flow at Design Time

Impact Analysis is also available at the column level


View where selected columns on a selected link flow to
View where selected columns on a selected link originate

Open a job:
Select a stage
Right-click Show where data flows to / originates
Select a link flowing in or out of the stage
Select one or more columns on the link
You can also right-click outside of any stage and select Configure data flow
view

The flow is graphically displayed on the diagram through high-lighting


You can also trace column data flow from Repository Table Definitions
Select a Table Definition in the Repository
Right-click Find where column used
Select columns to trace

15

2010 IBM Corporation

Information Management

Finding Where a Column Originates From

Select, then
click Find
where data
originates from
Select
columns

16

2010 IBM Corporation

Information Management

Displayed Results

17

2010 IBM Corporation

Information Management

Job and Table Difference Reports

18

2010 IBM Corporation

Information Management

Finding the Difference Between Two Jobs

Example: Job1 is saved as Job2. Changes are made to


Job2. What changes have been made?
Here Job1 may be a production job. Job2 is a copy of the production
job after enhancements or other changes have been made to it

19

2010 IBM Corporation

Information Management

Initiating the Comparison

Job with
the
changes

20

2010 IBM Corporation

Information Management

Comparison Results

Click stage and link


references to
highlight in open jobs

Click underlined
item to open stage
editor

21

2010 IBM Corporation

Information Management

Saving to an HTML File

Click when
Comparison window
is active
22

2010 IBM Corporation

Information Management

Comparing Table Definitions

23

Same procedure as when comparing jobs

2010 IBM Corporation

Information Management

Module Summary
After completing this module, you should be able to:

Perform a simple Find

Perform an Advanced Find

Perform an impact analysis

Compare the differences between two Table Definitions

Compare the differences between two jobs

24

2010 IBM Corporation

Advanced DataStage Workshop


Module 07 DataStage Integration

Information Management

2010 IBM Corporation

Information Management

Module Objectives
After completing this module, you should be able to explain:

How DataStage is integrated with other products in the Information Server


and Foundation Tools

How DataStage integrates with other applications through Web Services


with Information Service Director

How DataStage integrates with WebSphere MQ

How DataStage integrates with MDM Server with Rapid Development


Package

How DataStage integrates with Change Data Capture

2010 IBM Corporation

Information Management

Information Server and Foundation Tools

2010 IBM Corporation

Information Management

Trusted Information Management


Create, manage, govern and deliver trusted information
Get A single view of your business
IBM InfoSphere MDM Server

Get your arms around


your data

Deliver better business


intelligence faster

IBM InfoSphere
Foundation Tools

IBM InfoSphere
Warehouse

Consolidate your application infrastructure


IBM InfoSphere Information Server
4

2010 IBM Corporation

Information Management

InfoSphere Foundation Tools


Software to convert information into a trusted strategic
asset
Only IBM has invested
to provide the breadth
of capabilities to
define and govern
your information

Business Vocabulary
Data Relationships
Data Quality Compliance
Data Models and
Mapping
Business Specification
Rules
Provenance of
information

Discover and understand the data across heterogeneous


systems

Design trusted information structures for business optimization

Govern that information over time


5

2010 IBM Corporation

Information Management

InfoSphere Foundation Tools


Govern your information assets

Business
Glossary
Term Pack

Manage Business
Terms

Design Enterprise
Models

Monitor
Data Flows

Business Glossary

Data Architect

Metadata Workbench

Target Model

Shared Metadata

Leverage Industry
Practices
IBM Industry
Models
Understand Data
Relationships
Discovery

Capture Design
Specifications
FastTrack

Assess Monitor,
Manage Data
Quality
Information Analyzer

2010 IBM Corporation

Information Management

IBM InfoSphere Information Server


IBM InfoSphere Information Server
Unified Deployment

Understand

Cleanse

Discover, model, and


govern information
structure and content

Standardize, merge,
and correct information

Parallel
processing
services

Transform

Platform

Connectivity
services

Combine and
restructure information
Services
for new uses

Metadata
services

Administration
services

Deliver

Replication, virtualize
and move information
for in-line delivery
Deployment
services

Unified Metadata Management


Parallel Processing
Rich Connectivity to Applications, Data, and Content

2010 IBM Corporation

Information Management

IBM InfoSphere Information Server


Core Components
IBM InfoSphere Information Server
Understand

Cleanse

Transform

Deliver

Standardize, merge,
and correct information

Combine and
restructure information
for InfoSphere
new uses
DataStage

Capture, virtualize and


move information for
InfoSphere Information
in-line
delivery
Services
Director

InfoSphere Business Glossary


+ InfoSphere Business
Glossary Anywhere
InfoSphere Information
Analyzer

Discover, model, and


InfoSphere
Discovery
govern
information
structure
and content
InfoSphere FastTrack

InfoSphere
QualityStage

Platform Services
Parallel
Processing
Services

Connectivity
Services

Metadata
Services
InfoSphere Metadata
Workbench

Administration
Services

Deployment
Services

InfoSphere
Information Server
Manager

2010 IBM Corporation

Information Management

Information Service Director

2010 IBM Corporation

Information Management

What is Information Service Director?

Packages information integration logic as


services that insulate developers from
underlying sources
Deploy services for DataStage,
QualityStage, Federation Server, MDM
Server, DB2 and Oracle*
Services are created in minutes without
coding

Controls the invocation of services via a


variety of protocols

Developers

Architects

InfoSphere Information Services Director


Flexibly deploy and manage reusable
information services without hand
coding

EJB, web services, JMS*, RSS*, REST*

Provides work load balancing assurance


of service

Provides foundation infrastructure for


Information Services
Rapid SOA Deployment

2010 IBM Corporation

Information Management

ISD Service Life Cycle


Design
(Info Provider)

Deploy

Invoke

(Information Server Console)

ISD Server

- or -

DataStage Servers

SELECT CUST_ID, NAME,


PHONE, ADDRESS
FROM CUSTOMER
WHERE CUST_ID = ?

2010 IBM Corporation

Information Management

InfoSphere Information Services Director v8.1


ESB

Process server

EJB

Web Services

JMS

REST
XML/JSON

Portal

RSS

Bindings

InfoSphere Information Services Director


Common Logging

Integrated Metadata
Management

Common Reporting

Service Registry &


Repository

Common Administration
Service Security
Design

Service Design and Publishing

Operational

Service Deployment
Load Balancing and Fault Tolerance

DataStage
QualityStage

DB2
Federation

Classic
Federation

Oracle

WCC (Partial)

Information
Providers

IBM Information Server


2010 IBM Corporation

Information Management

InfoSphere ISD to deliver Information as a Service


Cleansing Services - InfoSphere QualityStage
Kate A. Roberts
416 Columbos Street #2
Boston, Mass

Kate A. Roberts
416 Columbos St #2
Boston, MA

Customer Data

Standardize

Kate A. Roberts
416 Columbus Ave #2
Boston, MA 02116

Catherine A. Roberts
416 Columbus Ave. Apt. 2
Boston, MA 02116

Correct

Match

Postal
Records

Customers

Customer Data

Transformation Services - InfoSphere DataStage

Customer Name

Merge Account Details

Legacy
System

Calculate Aggregates

Transform to New Context

Account Summary

Data Warehouse
2010 IBM Corporation

Information Management

WebSphere MQ

2010 IBM Corporation

Information Management

MQ Architecture
Local Host
MQ Client
Library

MQ
Connector

MQ
Client
Communication

Remote Host

Client mode
connection

MQ connector links with


MQ Library dynamically
(during job execution)

Remote
Queue
Manager

MQ Server
Library

Local
Queue
Manager

MQ
Intercommunication

Server mode
connection
2010 IBM Corporation

Information Management

MQ Messages

Message types

Request: Reply is to be sent to the Reply-to-Queue


Reply: Sent in response to a request message
Report: A message about another message
Datagram: No response message is required

Logical messages
Composed of one or more physical messages on the queue
Each physical message is called a segment

Each segment has an offset


The last segment contains a flag

Message groups
Composed of one or more logical messages
Each logical message has a sequence number
The last message contains a flag

2010 IBM Corporation

Information Management

Message Structure

Two or three parts


Header: Information about the content and structure of the data
Optional format header: Information about the message format
Payload: Message data

Message schema
Defines the type and structure of the data

2010 IBM Corporation

Information Management

MQ Stages

MQ Stages
MQ Connector stage
MQ stage

Read messages from and write messages to a InfoSphere MQ enterprise


messaging system

A queue manager manages one or more message queues


The MQ stages establish a connection to a queue manager in order to read
messages from or write messages to a queue
Connecting to the queue manager
Server mode: The queue manager resides on the same machine as the MQ
Connector stage
Client mode: The queue manager resides on a remote machine
Specify channel name, transport type (e.g., TCP), and remote connection name or IP
address
Not supported by the MQ stage
Supported by the MQ Connector stage

Queues
Store messages
Must be opened before messages can be written or read
2010 IBM Corporation

Information Management

Message Queue
Queue manager

Queue

Queue messages

2010 IBM Corporation

Information Management

MQ Stage Readers
MQ
Connector

MQ Stage

2010 IBM Corporation

Information Management

MQ Connector Stage

2010 IBM Corporation

Information Management

Connecting to the Queue Manager


Queue manager
Test connection

Client mode
connection

Save metadata
into data
connection

Client connection
properties
2010 IBM Corporation

Information Management

Reading Messages
Queue to read
from

Number to read

Delete record
after read

Records to read
before commit

2010 IBM Corporation

Information Management

Filtering Read Messages


Filter messages

Filter by payload
size (in bytes)

2010 IBM Corporation

Information Management

Selecting Message Information

Column for
payload

Select data element


describing header
information to retrieve

2010 IBM Corporation

Information Management

Writing Messages to a Queue

MQ
Connector

2010 IBM Corporation

Information Management

MQ Write Properties

Connection
Information

Queue to
write to

2010 IBM Corporation

Information Management

Rapid Deployment Package for Master Data


Management Server

2010 IBM Corporation

Information Management

What is MDM Server?

Single source of Truth for master data


for all applications
Maintains quality and integrity of your
master data

Works with ALL types of master data


Single View
Customer, product, item, supplier,
vendor, account, employee, location,
etc..

Provides greater insight to the business


by understanding complex relationships
between master data

Easily aligns and integrates into your


business processes

Supports tactical projects to enterprisewide initiatives


2010 IBM Corporation

Information Management

InfoSphere MDM Server -- Single source of truth for


master data for all applications

Business Services

Enables business process to easily leverage


master data
Speed time-to-value, reduce subsequent phase
investment

InfoSphere MDM Server


Data Stewardship

Functionality

Stewardship: Data Quality, Stewardship, &


User Interfaces
Events: Event Management & Business Rules
Security & Entitlement: ROV

Business Services

Pre-built

Customizable

MDM Domains

Multi-domain

Extensible data model supporting domains


including Party, Product, Account & Location
Relationships between domains

Party

Account

Product Customer

Integration
Content

Data

Analytics

MDM Workbench

Tooling for easy extensions to data and UI


generation

Robust Data Integration

Pre-built Data Integration & Quality


2010 IBM Corporation

Information Management

MDM Rapid Deployment - Services Offering


Rapid Deployment
Data Preparation
Workshop
Architectural Design &
Requirement Definition
Workshop
Data Profiling & Analysis
of 2 Source Systems
Mapping Workshop

Rapid Deployment Package Offering


Implementation
Product Installation
Development Environment
Load Accelerators,
Validation Rules,
Suspect Duplicate
Processing,
Standardization,
Construct Configuration,
MDM Tech Specs,

Post
Implementation
Support for:
Test Plan
Integration Testing
Functional Testing
User Acceptance
Data Stewardship

Clients looking to become self sufficient benefit significantly from MDM RD


Data Preparation Workshops
31

31

2010 IBM Corporation

Information Management

RDP: A Well Defined Implementation Process

MDM Server Foundation Person and/or Organization


DataStage 8.0, Fixpak #2 and Quality Stage 8.0
Information Server Service Director (MDM-QS interface)
If workshops are to be provided, then Information Analyzer and FastTrack also required
Source
Systems

Source
#1

Profiling &
Analysis

Source to
Target
Mapping

Source
#2

SIF

ETL / DQ
Logic

Services

Information
Analyzer

Fast Track
DataStage

Mentoring Workshops

DataStage
QualityStage

MDM
Server

Rapid Deployment Package


2010 IBM Corporation

Information Management

Full Suite Saves implementation Times & Costs

Key tasks in implementing RDP:


Data analysis to ensure that attributes contain what they should (part of workshops)
Mapping to the SIF format (part of workshops)
Extending the model for up to 10 additional attributes and attribute lengths
Tuning standardization and matching rules
MDM SERVER

Source Systems

Source
#2

Information Server

Source
#N

DataStage

QS

MDM Business Services


User Interface
&
Reporting
Duplicate Suspect Processing

Information Server
Information Analyzer
Fast Track
DataStage

SIF
Load Process
DS jobs

MDM Database

History

2010 IBM Corporation

Information Management

MDM Rapid Deployment Benefits


Pre-built quality and integration logic
Increase de-duplication using complex matching algorithms
Facilitates rapid integration of source systems
Data analysis and mapping Generation of integration logic - 70% of code generated
Business & IT Alignment
Shared business vocabulary, business-level specification
Scalable to any data volume
Capable of handling incremental loads based on batch windows,
changed data events, messages, SOA, or other mechanisms

34

2010 IBM Corporation

Information Management

Change Data Capture

2010 IBM Corporation

Information Management

What is Change Data Capture?


Change Data Capture is a software solution that:

Connects two or more databases together

Works on a variety of systems (Windows, Linux, etc.)

Works with a variety of databases (Oracle, DB2, etc.)

Captures, transforms and flows the data in real-time

36
36

Capture: Grab/copy changes to data as the change occurs

Transform: Modify the data using filters, calculations or


functions

Flow: Send the data to another database

Near Real-time: Without minimal delay, changes are


immediately sent

Can be used in any industry

2010 IBM Corporation

Information Management

Capturing Changed Data Conventional Methods

Table differencing

Heavy resource intensive SQLs

Expensive and complex SQL coding

Multiple changes on single transaction cannot be captured

Change value based on timestamp

Potentially expensive queries against source table

Source system applications and schemas have to be designed giving consideration


to this approach

Multiple changes on single transaction cannot be captured

Custom built triggers

Expensive custom development work

Potentially cause performance issues to source system

All of these methods are inefficient, do not guarantee data integrity, and potentially impact
source system performance
37

2010 IBM Corporation

Information Management

InfoSphere CDC Workflow


Oracle

DB2

SQL Server
:

InfoSphere CDC
Binaries

InfoSphere CDC
Binaries

InfoSphere CDC Binaries


includes Source and Target
replication engines and
configuration agent.

InfoSphere CDC
Binaries

GUI Connection
GUI Connection
GUI Connection

Subscriptions
( Replication Threads)

Subscriptions
( Replication Threads)

Windows

Subscriptions
( Replication Threads)
Management
Console

Management Console
configuration interface
.
Access Manager controls access
to the product
. Can be run on a
separate server to allow for
centralized access

Access
Manager

GUI Connection
GUI Connection
GUI Connection
InfoSphere CDC
Binaries

InfoSphere CDC
Binaries

InfoSphere CDC
Binaries

Interface Connections. Only


connected when GUI being
used
Replication Connections.
. Only
connected when actively
Mirroring or Refreshing

Oracle
38
38

DB2

SQL Server
2010 IBM Corporation

Information Management

What CDC Cannot Do

CDC works based on logged operations, any non-logged operations


are not captured.

CDC cannot capture DDL changes

Exception, if Oracle-Oracle only, consider CDC for Oracle


Replication which can replicate DDL changes

CDC cannot perform heavy-duty transformation on data (ETL style)

39
39

Example, load operation. Require table refresh to sync non-logged


load operation.

Consider integrating with ETL tool for transformation requirement

2010 IBM Corporation

Information Management

CDC For DataStage


Enabling Real-Time Response to Data Changes and Business Events
Low impact log-based
changed data capture
New palette stages on
Information Server
Stream data changes into
Information Server
Flat File
Direct Connect

40

2010 IBM Corporation

Information Management

Option-1: MQ Based Integration Method


3

MQ

2
1

database

DS/QS job
5

database

1.
2.
3.
4.

A change is introduced to source database


CDC captures change made to source database
Captured changes written to MQ in XML format
DataStage (via MQ connector) processes message
continuously and passes data off to downstream stages
5. Updates written to target warehouse

41

2010 IBM Corporation

Information Management

Option-2: Flat File Connection Method


3

File

2
1

1.
2.
3.
4.
5.

42

database

DS/QS job
5

database

A change is introduced to source database


CDC captures changes made to source database
CDC writes each transaction to a file
DataStage reads the changes from the file
Update target database with changes

2010 IBM Corporation

Information Management

Option-3: Direct Connection Method


2
5

U
se
r

Ex
it

DS Custom
Operator

3
1

database

DS/QS job

4
6

database

1. Custom operator, which runs on regular intervals, requests


the changed data from CDC
2. A change is introduced to source database
3. CDC captures changes made to source database
4. Captured changes passed to user exit and writes to private
TCPIP comm port
5. Custom operator passes data off to downstream stages
6. Update target database with changes
43

2010 IBM Corporation

Information Management

Module Summary
After completing this module, you should be able to explain:

How DataStage is integrated with other products in the Information Server


and Foundation Tools

How DataStage integrates with other applications through Web Services


with Information Service Director

How DataStage integrates with WebSphere MQ

How DataStage integrates with MDM Server with Rapid Development


Package

44

How DataStage integrates with Change Data Capture

2010 IBM Corporation

You might also like