Data Stage

Download as pdf or txt
Download as pdf or txt
You are on page 1of 299

IBM WebSphere DataStage

Introduction To Enterprise Edition

Course Contents

Module 01: Introduction

Module 02: Setting Up Your DataStage Environment

Module 03: Creating Parallel Jobs

Module 04: Accessing Sequential Data

Module 05: Platform Architecture

Module 06: Combining Data

Module 07: Sorting and Aggregating Data

Module 08: Transforming Data

Module 09: Standards and Techniques

Module 10: Accessing Relational Data

Module 11: Compilation and Execution

Module 12: Testing and Debugging

Module 13: Metadata in Enterprise Edition

Module 14: Job Control

Course Objectives

DataStage Clients and Server

Setting up the parallel environment

Importing metadata

Building DataStage jobs

Loading metadata into job stages

Accessing Sequential data

Accessing Relational data

Introducing the Parallel framework

architecture

Transforming data

Sorting and aggregating data

Merging data

Configuration files

Creating job sequences

IBM WebSphere DataStage


Module 01: Introduction

What is IBM WebSphere DataStage?

Design jobs for Extraction, Transformation, and Loading (ETL)

Ideal tool for data integration projects such as, data warehouses, data marts,

and system migrations


Import, export, create, and manage metadata for use within jobs

Schedule, run, and monitor jobs all within DataStage

Administer your DataStage development and execution environments

Create batch (controlling) jobs

DataStage Server and Clients


Microsoft Windows

Windows or Unix Server

Client Logon

DataStage Administrator

DataStage Manager

DataStage Designer

DataStage Director

Developing in DataStage
Define global and project properties in Administrator

Import metadata into the Repository


Manager
Designer Repository View

Build job in Designer

Compile job in Designer

Run and monitor job in Director

DataStage Projects

DataStage Jobs
Parallel jobs

Executed under control of DataStage Server runtime environment


Built-in functionality for Pipeline and Partitioning Parallelism
Compiled into OSH (Orchestrate Scripting Language)
OSH executes Operators
Executable C++ class instances
Runtime monitoring in DataStage Director

Job Sequences (Batch jobs, Controlling jobs)

Master Server jobs that kick-off jobs and other activities


Can kick-off Server or Parallel jobs
Runtime monitoring in DataStage Director

Server jobs (Requires Server Edition license)

Executed by the DataStage Server Edition


Compiled into Basic (interpreted pseudo-code)
Runtime monitoring in DataStage Director

Mainframe jobs (Requires Mainframe Edition license)

Compiled into COBOL


Executed on the Mainframe, outside of DataStage

Design Elements of Parallel Jobs

Stages
Implemented as OSH operators (pre-built components)
Passive stages (E and L of ETL)

Read data
Write data

E.g., Sequential File, Oracle, Peek stages

Processor (active) stages (T of ETL)

Transform data
Filter data

Aggregate data

Generate data

Split / Merge data

E.g., Transformer, Aggregator, Join, Sort stages

Links

Pipes through which the data moves from stage to stage

Quiz True or False?

DataStage Designer is used to build and compile your ETL jobs

Manager is used to execute your jobs after you build them

Director is used to execute your jobs after you build them

Administrator is used to set global and project properties

Introduction to the Lab Exercises


Two types of exercises in this course:

Conceptual exercises
Designed to reinforce a specific modules topics
Provide hands-on experiences with DataStage
Introduced by the word Concept
E.g., Conceptual Lab 01A

Solution Development exercises

Based on production applications


Provide development examples
Introduced by the word Solution
E.g., Solution Lab 05A
The Solution Development exercises are introduced and discussed in a later
module

Lab Exercises
Conceptual Lab 01A

Install DataStage clients


Test connection to the DataStage Server
Install lab files

IBM WebSphere DataStage

Module 02: Setting up Your DataStage


Environment

Module Objectives

Setting project properties in Administrator

Defining Environment Variables

Importing / Exporting DataStage objects in Manager

Importing Table Definitions defining sources and targets in Manager

Setting Project Properties

Project Properties
Projects can be created and deleted in Administrator

Each project is associated with a directory on the DataStage Server

Project properties, defaults, and environmental variables are specified

in Administrator
Can be overridden at the job level

01/15/06

Setting Project Properties


To set project properties, log onto Administrator, select your project,

and then click Properties

Project Properties General Tab

Environment Variables

Permissions Tab

Tracing Tab

Parallel Tab

Sequence Tab

Importing and Exporting


DataStage Objects

What Is Metadata?

Data

Source

Transform

Target
Metadata

Metadata
Metadata

Repository

DataStage Manager

Manager Contents
Metadata

Describing sources and targets: Table definitions


Describing inputs / outputs from external routines
Describing inputs and outputs to BuildOp and CustomOp stages

DataStage objects

Jobs
Routines
Compiled jobs / objects
Stages

Import and Export

Any object in Manager can be exported to a file

Can export whole projects

Use for backup

Sometimes used for version control

Can be used to move DataStage objects from one project to another

Use to share DataStage jobs and projects with other developers

Export Procedure

In Manager, click Export>DataStage Components

Select DataStage objects for export

Specify type of export:

DSX: Default format


XML: Enables processing of export file by XML applications, e.g., for
generating reports

Specify file path on client machine

Quiz - True or False?


You can export DataStage objects such as jobs, but you cant export

metadata, such as field definitions of a sequential file.

Quiz - True or False?


The directory to which you export is on the DataStage client machine,

not on the DataStage server machine.

Exporting DataStage Objects

Select Objects for Export

Options Tab

Select by folder or
individual object

Import Procedure
In Manager, click Import>DataStage Components

Or Import>DataStage Components (XML) if you are importing an XMLformat export file

Select DataStage objects for import

Importing DataStage Objects

Import Options

Importing Metadata

Metadata Import
Import format and column definitions from sequential files

Import relational table column definitions

Imported as Table Definitions

Table definitions can be loaded into job stages

Table definitions can be used to define Routine and Stage interfaces

Sequential File Import Procedure


In Manager, click Import>Table Definitions>Sequential File Definitions

Select directory containing sequential file and then the file

Select Manager category

Examined format and column definitions and edit is necessary

Importing Sequential Metadata

Sequential Import Window

Specify Format

Specify Column Names and Types


Double-click to define
extended properties

Extended Properties window

Property
categories

Available
properties

Table Definition General Tab

Second level
category
Top level
category

Table Definition Columns Tab

Table Definition Parallel Tab

Table Definition Format Tab

Lab Exercises
Conceptual Lab 02A
Set up your DataStage environment

Conceptual Lab 02B


Import a sequential file Table Definition

IBM WebSphere DataStage

Module 03: Creating Parallel Jobs

Module Objectives
Design a simple Parallel job in Designer

Compile your job

Run your job in Director

View the job log

Creating Parallel Jobs

What Is a Parallel Job?


Executable DataStage program

Created in DataStage Designer


Can use components from Manager Repository

Built using a graphical user interface

Compiles into Orchestrate shell language (OSH) and object code


(from generated C++)

Job Development Overview


Import metadata defining sources and targets
Can be done within Designer or Manager

In Designer, add stages defining data extractions and loads

Add processing stages to define data transformations

Add links defining the flow of data from sources to targets

Compile the job

In Director, validate, run, and monitor your job


Can also run the job in Designer
Can only view the job log in Director

Designer Work Area

Canvas
Repository

Tools
Palette

Designer Toolbar
Provides quick access to the main functions of Designer
Show/hide metadata markers

Run
Job properties

Compile

Tools Palette

Adding Stages and Links


Drag stages from the Tools Palette to the diagram
Can also be dragged from Stage Type branch to the diagram

Draw links from source to target stage


Right mouse over source stage
Release mouse button over target stage

Job Creation Example Sequence


Brief walkthrough of procedure

Assumes table definition of source already exists in the repository

Create New Job

Drag Stages and Links From Palette

Peek
Row
Generator

Annotation

Renaming Links and Stages


Click on a stage or link to rename it

Meaningful names have many


benefits
Documentation
Clarity
Fewer development errors

RowGenerator Stage
Produces mock data for specified columns

No inputs link; single output link

On Properties tab, specify number of rows

On Columns tab, load or specify column definitions


Click Edit Row over a column to specify the values to be generated for that
column
A number of algorithms for generating values are available depending on the
data type

Algorithms for Integer type


Random: seed, limit
Cycle: Initial value, increment

Algorithms for string type: Cycle , alphabet

Algorithms for date type: Random, cycle

Inside the Row Generator Stage


Properties
tab

Set property
value

Property

Columns Tab
View data

Load a
Table
definition
Select Table
Definition

Extended Properties

Specified
properties and
their values

Additional
properties to add

Peek Stage
Displays field values
Displayed in job log or sent to a file
Skip records option
Can control number of records to be displayed
Shows data in each partition, labeled 0, 1, 2,

Useful stub stage for iterative job development


Develop job to a stopping point and check the data

Peek Stage Properties

Output to
job log

Job Parameters
Defined in Job Properties window

Makes the job more flexible

Parameters can be:


Used in directory and file names
Used to specify property values
Used in constraints and derivations

Parameter values are determined at run time

When used for directory and files names and names of properties,
surround with pound signs (#)
E.g., #NumRows#

Job parameters can reference DataStage and system environment


variables
$PROJDEF
$ENV

Defining a Job Parameter


Parameters tab

Parameter

Using a Job Parameter in a Stage

Job parameter surrounded


with pound signs

Adding Job Documentation


Job Properties
Short and long descriptions
Shows in Manager

Annotation stage
Added from the Tools Palette
Display formatted text descriptions on diagram

Job Properties Documentation

Documentation

Annotation Stage Properties

Compiling a Job

Compile

Errors or Successful Message

Highlight stage
with error

Click for more info

Running Jobs and Viewing the Job


Log in Designer

Prerequisite to Job Execution

DataStage Director
Use to run and schedule jobs

View runtime messages

Can invoke from DataStage Manager or Designer


Tools > Run Director

Run Options

Stop after number


of warnings

Stop after number


of rows

Director Log View

Click the open


book icon to view
log messages

Peek messages

Message Details

Other Director Functions


Schedule job to run on a particular date/time

Clear job log of messages

Set job log purging conditions

Set Director options


Row limits
Abort after x warnings

Running Jobs from Command Line


Use dsjob run

Use dsjob logsum to display messages in the log

Documented in Parallel Job Advanced Developers Guide, ch. 7

Lab Exercises
Conceptual Lab 03A
Design a simple job in Designer
Define a job parameter
Document the job
Compile
Run
Monitor the job in Director

IBM WebSphere DataStage

Module 04: Accessing Sequential Data

Module Objectives
Understand the stages for accessing different kinds of sequential data

Sequential File stage

Data Set stage

Complex Flat File stage

Create jobs that read from and write to sequential files

Read from multiple files using file patterns

Use multiple readers

Types of Sequential Data Stages


Sequential
Fixed or variable length

Data Set

Complex Flat File

The Framework and Sequential Data


The EE Framework processes only datasets

For files other than datasets, such as sequential flat files, import and
export operations are done
Import and export OSH operators are generated by Sequential and
Complex Flat File stages

During import or export DataStage performs format translations


into, or out of, the EE internal format
Internally, the format of data is described by schemas
Like Table Definitions

Using the Sequential File Stage


Both import and export of general files (text, binary) are
performed by the SequentialFile Stage.
Data import:

Data export

EE internal format

EE internal format

Features of Sequential File Stage


Normally executes in sequential mode

Executes in parallel when reading multiple files

Can use multiple readers within a node


Reads chunks of a single file in parallel

The stage needs to be told:


How file is divided into rows (record format)
How row is divided into columns (column format)

File Format Example


Record delimiter
Field 1

Field 12

Field 13

, Last field

nl

Final Delimiter = end


Field Delimiter

Field 1

Field 12

Field 13

, Last field

, nl

Final Delimiter = comma

Sequential File Stage Rules


One input link

One stream output link

Optionally, one reject link


Will reject any records not matching metadata in the column definitions
Example: You specify three columns separated by commas, but the row
thats read had no commas in it

Job Design Using Sequential Stages

Reject link

Sequential Source Columns Tab

View data

Load Table Definition


Save as a new
Table Definition

Input Sequential Stage Properties


Output tab
File to
access

Column names
in first row

Click to add more files having


the same format

Format Tab

Record format

Column format

Reading Using a File Pattern


Use wild
cards

Select File
Pattern

Properties - Multiple Readers

Multiple readers option allows


you to set number of readers
per node

Sequential Stage As a Target


Input Tab

Append /
Overwrite

Reject Link
Reject mode =
Continue: Continue reading records
Fail: Abort job
Output: Send down output link

In a source stage
All records not matching the
metadata (column definitions) are
rejected

In a target stage
All records that fail to be written for
any reason

Rejected records consist of one


column, datatype = raw

Reject mode property

Inside the Copy Stage


Column mappings

DataSet Stage

Data Set
Operating system (Framework) file

Preserves partitioning
Component dataset files are written to on each partition

Suffixed by .ds

Referred to by a header file

Managed by Data Set Management utility from GUI (Manager, Designer,


Director)
Represents persistent data

Key to good performance in set of linked jobs


No import / export conversions are needed
No repartitioning needed

Persistent Datasets

Accessed using DataSet Stage.

Two parts:

Descriptor file:
contains metadata, data location, but NOT the data itself

input.ds
Data file(s)

contain the data

multiple Unix files (one per node), accessible in parallel

record (
partno: int32;
description: string;
)

node1:/local/disk1/
node2:/local/disk2/

Data Translation
Occurs on import
From sequential files or file sets
From RDBMS

Occurs on export
From datasets to file sets or sequential files
From datasets to RDBMS

DataStage engine is most efficient when processing internally


formatted records (i.e. datasets)

FileSet Stage

01/15/06

File Set Stage

Can read or write file sets

Files suffixed by .fs

File set consists of:

1.

Descriptor file contains location of raw data files + metadata

2.

Individual raw data files

Can be processed in parallel

Similar to a dataset

Main difference is that file sets are not in the internal format and
therefore more accessible to external applications

File Set Stage Example


Descriptor file

Lab Exercises
Conceptual Lab 04A
Read and write to a sequential file
Create reject links
Create a data set

Conceptual Lab 04B


Read multiple files using a file path

Conceptual Lab 04C


Read a file using multiple readers

DataStage Data Types


Standard types
Char

VarChar

Integer

Decimal (Numeric)

Floating point

Date

Time

Timestamp

VarBinary (raw)

Complex types
Vector (array, occurs)

Subrecord (group)

Standard Types

Char
Fixed length string

VarChar
Variable length string
Specify maximum length

Integer

Decimal (Numeric)

Precision (length including numbers after the decimal point)


Scale (number of digits after the decimal point)

Floating point

Date

Default string format: %yyyy-%mm-%dd

Time
Default string format: %hh:%nn:%ss

Timestamp
Default string format: %yyyy-%mm-%dd %hh:%nn:%ss

VarBinary (raw)

Complex Data Types


Vector

A one-dimensional array
Elements are numbered 0 to n
Elements can be of any single type
All elements must have the same type
Can have fixed or variable number of elements

Subrecord
A group or structure of elements
Elements of the subrecord can be of any type
Subrecords can be embedded

Schema With Complex Types

subrecord

vector

Table Definition with complex types

Authors is a subrecord

Books is a vector of 3 strings of length 5

Complex Types Column Definitions


subrecord

Elements of subrecord

Vector

Reading and Writing Complex Data

Complex Flat
File source
stage

Complex Flat
File target
stage

Importing Cobol Copybooks

Click Import>Table

Definitions>COBOL File Definitions


to begin the import
Each level 01 item begins a Table

Definition
Specify position of level 01 items

Level 01 start
position
Where to store the
Table Definition

Path to
copybook file

Reading and Writing NULL Values

Working with NULLs

Internally, NULL is represented by a special value outside the range of


any existing, legitimate values
If NULL is written to a non-nullable column, the job will abort

Columns can be specified as nullable

NULLs can be written to nullable columns

You must handle NULLs written to non-nullable columns in a

Sequential File stage


You need to tell DataStage what value to write to the file
Unhandled rows are rejected

In a Sequential source stage, you can specify values you want

DataStage to convert to NULLs

Specifying a Value for NULL

Nullable
column

Added
property

Managing DataSets

Managing DataSets

GUI (Manager, Designer, Director) tools > data set management

Dataset management from the system command line


Orchadmin
Unix command line utility

List records

Remove datasets
Removes all component files, not just the header file
Dsrecords
Lists number of records in a dataset

Displaying Data and Schema

Display data

Schema

Manage Datasets from theSystemCommandLine

Dsrecords

Gives record count


Unix command-line utility
$ dsrecords ds_name
E.g., $ dsrecords myDS.ds
156999 records

Orchadmin
Manages EE persistent data sets
Unix command-line utility
E.g., $ orchadmin delete myDataSet.ds

Lab Exercises
Conceptual Lab 04D

Use the dsrecords utility


Use Data Set Management tool

Conceptual Lab 04E

Reading and Writing NULLs

IBM WebSphere DataStage

Module 05: Platform Architecture

2005 IBM Corporation

Module Objectives

Parallel processing architecture

Pipeline parallelism

Partition parallelism

Partitioning and collecting

Configuration files

Key EE Concepts

Parallel processing:
Executing the job on multiple CPUs

Scalable processing:

Add more resources (CPUs and disks) to increase system performance

Example system: 6 CPUs (processing


nodes) and disks
Scale up by adding more CPUs
Add CPUs as individual nodes or to
an SMP system

Scalable Hardware Environments

Single CPU

Dedicated memory &


disk

SMP

Multi-CPU (2-64+)

Shared memory & disk

GRID / Clusters
Multiple, multi-CPU systems
Dedicated memory per node
Typically SAN-based shared storage

MPP
Multiple nodes with dedicated memory,
storage

2 1000s of CPUs

Pipeline Parallelism

Transform, clean, load processes execute simultaneously

Like a conveyor belt moving rows from process to process


Start downstream process while upstream process is running

Advantages:

Reduces disk usage for staging areas


Keeps processors busy

Still has limits on scalability

Partition Parallelism
Divide the incoming stream of data into subsets to be separately

processed by an operation
Subsets are called partitions (nodes)

Each partition of data is processed by the same operation

E.g., if operation is Filter, each partition will be filtered in exactly the same
way

Facilitates near-linear scalability

8 times faster on 8 processors


24 times faster on 24 processors
This assumes the data is evenly distributed

Three-Node Partitioning
Node 1

Operation
subset1
Node 2
subset2

Data

Operation

subset3

Node 3

Operation

Here the data is partitioned into three partitions

The operation is performed on each partition of data separately and in parallel

If the data is evenly distributed, the data will be processed three times faster

EE Combines Partitioning and Pipelining

Within EE, pipelining, partitioning, and repartitioning are automatic


Job developer only identifies:
Sequential vs. Parallel operations (by stage)

Method of data partitioning

Configuration file (which identifies resources)

Advanced stage options (buffer tuning, operator combining, etc.)

Job Design v. Execution


User assembles the flow using DataStage Designer

at runtime, this job runs in parallel for any configuration


(1 node, 4 nodes, N nodes)

No need to modify or recompile the job design!

Configuration File

Configuration file separates configuration (hardware / software) from job design


Specified per job at runtime by $APT_CONFIG_FILE
Change hardware and resources without changing job design

Defines number of nodes (logical processing units) with their resources (need not

match physical CPUs)


Dataset, Scratch, Buffer disk (file systems)
Optional resources (Database, SAS, etc.)
Advanced resource optimizations
Pools (named subsets of nodes)

Multiple configuration files can be used at runtime


Optimizes overall throughput and matches job characteristics to overall hardware resources
Allows runtime constraints on resource usage on a per job basis

Example Configuration File


{

node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}

Key points:
1.

Number of nodes defined

2.

Resources assigned to each


node. Their order is significant.

3.

Advanced resource
optimizations and configuration
(named pools, database, SAS)
}

Partitioning and Collecting

Partitioning and Collecting

Partitioning breaks incoming rows into sets (partitions) of rows

Each partition of rows is processed separately by the stage/operator

If the hardware and configuration file supports parallel processing, partitions


of rows will be processed in parallel

Collecting returns partitioned data back to a single stream

Partitioning / Collecting occurs on stage Input links

Partitioning / Collecting is implemented automatically

Based on stage and stage properties


How the data is partitioned / collected can be specified

Partitioning / Collecting Algorithms

Partitioning algorithms include:


Round robin
Hash: Determine partition based on key value
Requires key specification
Entire: Send all rows down all partitions
Same: Preserve the same partitioning
Auto: Let DataStage choose the algorithm

Collecting algorithms include:


Round robin
Sort Merge
Read in by key
Presumes data is sorted by the key in each partition
Builds a single sorted stream based on the key
Ordered
Read all records from first partition, then second,

Keyless V. Keyed Partitioning Algorithms

Keyless: Rows are distributed independently of data values


Round Robin
Entire
Same

Keyed: Rows are distributed based on values in the specified key

Hash: Partition based on key


Example: Key is State. All CA rows go into the same partition; all MA
rows go in the same partition. Two rows of the same state never go into
different partitions
Modulus: Partition based on modulus of key divided by the number of
partitions. Key is a numeric type.
Example: Key is OrderNumber (numeric type). Rows with the same
order number will all go into the same partition.
DB2: Matches DB2 EEE partitioning

Partitioning Requirements for Related Records


Misplaced records

Using Aggregator stage to sum customer sales by customer number


If there are 25 customers, 25 records should be output
But suppose records with the same customer numbers are spread
across partitions
This will produce more than 25 groups (records)
Solution: Use hash partitioning algorithm

Partition imbalances
Peek stage shows number of records going down each partition

Unequal Distribution Example


Same key values are assigned to

the same partition

FName

Address

Ford

Henry

66 Edison Avenue

Ford

Clara

66 Edison Avenue

Ford

Edsel

7900 Jefferson

Ford

Eleanor

7900 Jefferson

Dodge

Horace

17840 Jefferson

Dodge

John

75 Boston Boulevard

Ford

Henry

4901 Evergreen

Ford

Clara

4901 Evergreen

Ford

Edsel

1100 Lakeshore

10

Ford

Eleanor

Partition 1

LName

Part 0

Source Data

ID

Hash on LName, with 2-node config file

ID

LName

FName

Address

Dodge

Horace

17840 Jefferson

Dodge

John

75 Boston Boulevard

ID

LName

FName

Address

Ford

Henry

66 Edison Avenue

Ford

Clara

66 Edison Avenue

Ford

Edsel

7900 Jefferson

Ford

Eleanor

7900 Jefferson

Ford

Henry

4901 Evergreen

Ford

Clara

4901 Evergreen

Ford

Edsel

1100 Lakeshore

10

Ford

Eleanor

1100 Lakeshore

1100 Lakeshore

Partitioning / Collecting Link Icons


Partitioning icon

Collecting icon

More Partitioning Icons


fan-out
Sequential to Parallel

SAME partitioner

Re-partition
watch for this!

AUTO partitioner

Partitioning Tab
Key specification

Algorithms

Collecting Specification
Key specification

Algorithms

Quiz

True or False?

Everything that has been data-partitioned must be


collected in same job

Data Set Stage

Is the data partitioned?

Introduction to the Solution Development


Exercises

Solution Development Jobs

Series of 4 jobs extracted from production jobs

Use a variety of stages in interesting, realistic configurations

Sort, Aggregator stages


Join, lookup stage
Peek, Filter stages
Modify stage
Oracle stage

Contain useful techniques

Use of Peeks
Datasets used to connect jobs
Use of project environment variables in job parameters
Fork Joins
Lookups for auditing

Warehouse Job 01

Glimpse Into the Sort Stage


Algorithms

Sort key to add

Copy Stage With Multiple Output Links

Select output link

Filter Stage

Used with Peek stage to select a portion of data for checking

On Properties tab, specify a Where clause to filter the data

On Mapping tab, map input columns to output columns

Setting the Filtering Condition


Filtering
condition

Warehouse Job 02

Warehouse Job 03

Warehouse Job 04

Warehouse Job 02 With Lookup

Lab Exercises

Conceptual Lab 05A


Experiment with partitioning / collecting

Solution Lab 05B (Build Warehouse_01 Job)

Add environment variables as job parameters


Read multiple sequential files
Use the Sort stage
Use Filter and Peek stages
Write to a DataSet stage

IBM WebSphere DataStage


Module 06: Combining Data

Module Objectives

Combine data using the Lookup stage

Combine data using Merge stage

Combine data using the Join stage

Combine data using the Funnel stage

Combining Data
Ways to combine data:
Horizontally:

Multiple input links


One output link made of columns from different input links.
Joins
Lookup
Merge

Vertically:

One input link, one output link combining groups of related records into a
single record
Aggregator
Remove Duplicates

Funneling: Multiple input streams funneled into a single output stream

Funnel stage

Lookup, Merge, Join Stages


These stages combine two or more input links

Data is combined by designated "key" column(s)

These stages differ mainly in:

Memory usage
Treatment of rows with unmatched key values
Input requirements (sorted, de-duplicated)

Not all Links are Created Equal


DataStage distinguishes between:
- The Primary input: (Framework port 0)
- Secondary inputs: in some cases "Reference" (other Framework
ports)

Conventions:
Primary Input: port 0
Secondary Input(s): ports 1,

Joins

Lookup

Merge

Left
Right

Source
Lookup table(s)

Master
Update(s)

Tip: Check Link Ordering" tab to make sure intended


Primary is listed first

Lookup Stage

01/15/06

Lookup Features
One Stream Input link (Source)

Multiple Reference links (Lookup files)

One output link

Optional Reject link


Only one per Lookup stage, regardless of number of reference links

Lookup Failure options


Continue, Drop, Fail, Reject

Can return multiple matching rows

Hash tables are built in memory from the lookup files


Indexed by key
Should be small enough to fit into physical memory

The Lookup Stage


Uses one or more key columns as an index into a table
Usually contains other values associated with each key.

The lookup table is created in memory before any lookup source rows are processed

Lookup table
Index
Key column of source
state_code
TN

[]
SC
SD
TN
TX
UT
VT
[]

Associated Value

South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont

Lookup from Sequential File Example

Driver (Source)
link

Reference link
(lookup table)

Lookup Key Column in Sequential File


Lookup key

Lookup Stage Mappings


Source link

Reference link
Derivation for lookup key

Handling Lookup Failures

Select action

Lookup Failure Actions

If the lookup fails to find a matching key column, one of these actions
can be taken:
fail: the lookup Stage reports an error and the job fails immediately.
This is the default.
drop: the input row with the failed lookup(s) is dropped
continue: the input row is transferred to the output, together with the successful table
entries. The failed table entry(s) are not transferred, resulting in either default output
values or null output values.
reject: the input row with the failed lookup(s) is transferred to a second output link, the
"reject" link.

There is no option to capture unused table entries


Compare with the Join and Merge stages

Lookup Stage Behavior


We shall first use a simplest case, optimal input:
Two input links: Source" as primary, Look up" as secondary
sorted on key column (here "Citizen"),
without duplicates on key

Source link (primary input)


Revolution
1789
1776

Citizen
Lefty
M_B_Dextrous

Lookup link (secondary input)


Citizen
M_B_Dextrous
Righty

Exchange
Nasdaq
NYSE

Lookup Stage
Output of Lookup with continue option on key Citizen

Revolution
1789
1776

Citizen
Lefty
M_B_Dextrous

Exchange
Nasdaq

Same output as outer join and merge/keep

Empty string
or NULL

Output of Lookup with drop option on key Citizen

Revolution
1776

Citizen
M_B_Dextrous

Exchange
Nasdaq

Same output as inner join and merge/drop

The Lookup Stage


Lookup Tables should be small enough to fit into physical memory

On a MPP you should partition the lookup tables using entire partitioning method
or partition them by the same hash key as the source link
Entire results in multiple copies (one for each partition)

On a SMP, choose entire or accept the default (which is entire)


Entire does not result in multiple copies because memory is shared

Join Stage

The Join Stage

Four types:

Inner
Left outer
Right outer
Full outer
2 or more sorted input links, 1 output link
"left" on primary input, "right" on secondary input
Pre-sort make joins "lightweight": few rows need to be in RAM

Follow the RDBMS-style relational model


Cross-products in case of duplicates
Matching entries are reusable for multiple matches
Non-matching entries can be captured (Left, Right, Full)

No fail/reject option for missed matches

Join Stage Editor

Link Order
immaterial for Inner
and Full Outer Joins,
but very important for
Left/Right Outer
joins)

One of four variants:


Inner
Left Outer
Right Outer
Full Outer

Multiple key columns


allowed

Join Stage Behavior


We shall first use a simplest case, optimal input:
two input links: "left" as primary, "right" as secondary
sorted on key column (here "Citizen"),
without duplicates on key

Left link (primary input)


Revolution
1789
1776

Citizen
Lefty
M_B_Dextrous

Right link (secondary input)


Citizen
M_B_Dextrous
Righty

Exchange
Nasdaq
NYSE

Inner Join
Transfers rows from both data sets whose key columns
contain equal values to the output link
Treats both inputs symmetrically
Output of inner join on key Citizen

Revolution
1776

Citizen
M_B_Dextrous

Exchange
Nasdaq

Same output as lookup/reject and merge/drop

Left Outer Join


Transfers all values from the left link and transfers values from the right link
only where key columns match.

Revolution
1789
1776

Citizen
Lefty
M_B_Dextrous

Exchange
Nasdaq

Same output as lookup/continue and merge/keep

Left Outer Join

Check Link Ordering Tab


to make sure intended Primary is listed first

Right Outer Join


Transfers all values from the right link and transfers values from the left link only
where key columns match.

Revolution
1776
Null or 0

Citiz en
M_B _Dex trous
Righty

Ex c hange
Nas daq
NYSE

Full Outer Join


Transfers rows from both data sets, whose key columns contain equal values, to
the output link.
It also transfers rows, whose key columns contain unequal values, from both input
links to the output link.
Treats both input symmetrically.

Creates new columns, with new column names!

Revolution
1789
1776
0

leftRec_Citizen
Lefty
M_B_Dextrous

rightRec_Citizen
M_B_Dextrous
Righty

Exchange
Nasdaq
NYSE

Merge Stage

Merge Stage Job

The Merge Stage


Allows composite keys

Master

One or more
updates

0
0

Merge

Output

Rejects

Multiple update links

Matched update rows are consumed

Unmatched updates in
input port n can be captured in output
port n
Lightweight:

Merge Stage Editor

Unmatched Master rows

Unmatched Update rows option:

One of two options:


Keep [default]
Drop

Capture in reject link(s).


Implemented by adding
outgoing links

(Capture in reject link is NOT


an option)

Comparison: Joins, Lookup, Merge

Model
M emory us age
# and nam es of Inputs
M andatory Input S ort
Duplic ates in prim ary input
Duplic ates in s ec ondary input(s )
Options on unmatc hed prim ary
Options on unmatc hed s ec ondary
On m atc h, s ec ondary entries are
# Outputs
Captured in rejec t s et(s )

Joins

Lookup

Merge

RDBMS-s ty le relational
light

S ourc e - in RAM LU Table


heavy

Mas ter -Update(s )


light

1 S ourc e, N LU Tables
2 or more: left, right
all inputs
no
OK (x -produc t)
OK
OK (x -produc t)
W arning!
K eep (left outer), Drop (Inner) [fail] | c ontinue | drop | rejec t
K eep (right outer), Drop (Inner)
NONE
c aptured
c aptured
1
Nothing (N/A)

1 out, (1 rejec t)
unmatc hed primary entries

1 Mas ter, N Update(s )


all inputs
W arning!
OK only when N = 1
[k eep] | drop
c apture in rejec t s et(s )
c ons um ed
1 out, (N rejec ts )
unm atc hed s ec ondary entries

Funnel Stage

What is a Funnel Stage?


A processing stage that combines data from multiple input links to a
single output link
Useful to combine data from several identical data sources into a single
large dataset
Operates in three modes
Continuous
SortFunnel
Sequence

Three Funnel modes


Continuous:
Combines the records of the input link in no guaranteed order.
It takes one record from each input link in turn. If data is not available on an input link,
the stage skips to the next link rather than waiting.
Does not attempt to impose any order on the data it is processing.

Sort Funnel: Combines the input records in the order defined by the value(s) of one or
more key columns and the order of the output records is determined by these sorting
keys.
Sequence: Copies all records from the first input link to the output link, then all the
records from the second input link and so on.

Sort Funnel Method


Produces a sorted output (assuming input links are all sorted on key)

Data from all input links must be sorted on the same key column

Typically data from all input links are hash partitioned before they are sorted
Selecting Auto partition type under Input Partitioning tab defaults to this
Hash partitioning guarantees that all the records with same key column
values are located in the same partition and are processed on the same
node.

Allows for multiple key columns


1 primary key column, n secondary key columns
Funnel stage first examines the primary key in each input record.
For records with multiple records with same primary key value, it will then
examine secondary keys to determine the order of records it will output

Funnel Stage Example

Funnel Stage Properties

Lab Exercises
Conceptual Lab 06A

Use a Lookup stage


Handle lookup failures
Use a Merge stage
Use a Join stage
Use a Funnel stage

Solution Lab 06B (Build Warehouse_02 Job)

Use a Join stage

IBM WebSphere DataStage


Module 07: Sorting and Aggregating Data

Module Objectives

Sort data using in-stage sorts and Sort stage

Combine data using Aggregator stage

Combine data Remove Duplicates stage

Sort Stage

Sorting Data
Uses

Some stages require sorted input


Join, merge stages require sorted input
Some stages use less memory with sorted input
E.g., Aggregator

Sorts can be done:

Within stages
On input link Partitioning tab, set partitioning to anything other than Auto
In a separate Sort stage
Makes sort more visible on diagram
Has more options

Sorting Alternatives

Sort stage

Sort within
stage

In-Stage Sorting
Partitioning
tab

Do s ort
Preserve
non -key
ordering
Remove
dups

Cant be Auto
when sorting

Sort key

Sort Stage
Sort key

Sort options

Sort keys

Add one or more keys

Specify sort mode for each key


Sort: Sort by this key
Dont sort (previously sorted):
Assume the data has already been sorted by this key
Continue sorting by any secondary keys

Specify sort order: ascending / descending

Specify case sensitive or not

Sort Options

Sort Utility
DataStage the default
Unix: Dont use. Slower than DataStage sort utility

Stable

Allow duplicates

Memory usage

Sorting takes advantage of the available memory for increased performance


Uses disk if necessary
Increasing amount of memory can improve performance

Create key change column

Add a column with a value of 1 / 0


1 indicates that the key value has changed
0 mean that the key value hasnt changed
Useful for processing groups of rows in a Transformer

Sort Stage Mapping Tab

Partitioning V. Sorting Keys

Partitioning keys are often different than Sorting keys


Keyed partitioning (e.g., Hash) is used to group related records into the
same partition
Sort keys are used to establish order within each partition

For example, partition on HouseHoldID, sort on HouseHoldID,


PaymentDate
Important when removing duplicates. Sorting within each partition is uses to
establish order for duplicate retention (first or last in the group)

Aggregator Stage

Aggregator Stage
Purpose: Perform data aggregations
Specify:

Zero or more key columns that define the aggregation units (or
groups)
Columns to be aggregated

Aggregation functions, include among many others:


count (nulls/non-nulls)
Sum
Max / Min / Range

The grouping method (hash table or pre-sort) is a performance


issue

Job with Aggregator Stage

Aggregator stage

Aggregator Stage Properties

Group columns

Group method
Aggregation
functions

Aggregator Functions
Aggregation type = Count rows

Count rows in each group


Put result in a specified output column

Aggregation type = Calculation

Select column
Put result of calculation in a specified output column
Calculations include:

Sum

Count

Min, max

Mean

Missing value count

Non-missing value count

Percent coefficient of variation

Grouping Methods

Hash (default)
Intermediate results for each group are stored in a hash table
Final results are written out after all input has been processed

No sort required
Use when number of unique groups is small
Running tally for each groups aggregate calculations needs to fit into
memory. Requires about 1K RAM / group

E.g. average family income by state requires .05MB of RAM

Sort
Only a single aggregation group is kept in memory
When a new group is seen, the current group is written out

Requires input to be sorted by grouping keys


Can handle unlimited numbers of groups
Example: average daily balance by credit card

Aggregation Types

Calculation types

Remove Duplicates Stage

Removing Duplicates
Can be done by Sort stage

Use unique option

No choice on which to keep

Stable sort always retains the first row in the group

Non-stable sort is indeterminate

OR

Remove Duplicates stage

Has more sophisticated ways to remove duplicates


Can choose to retain first or last

Remove Duplicates Stage Job

Remove Duplicates
stage

Remove Duplicates Stage Properties


Key that defines
duplicates

Retain first or last


duplicate

Lab Exercises
Solution Development Lab 07A (Build Warehouse_03 job)

Use Sort stage


Use Aggregator stage
Use RemoveDuplicates stage

IBM WebSphere DataStage


Module 08: Transforming Data

Module Objectives

Understand ways DataStage allows you to transform data

Use this understanding to:

Create column derivations using user-defined code and system functions


Filter records based on business criteria
Control data flow based on data conditions

Transformed Data

Derivations may include incoming fields or parts of incoming


fields
Derivations may reference system variables and constants

Frequently uses functions performed on incoming values


Date and time
Mathematical
Logical
Null handling
More

Stages Review
Stages that can transform data

Transformer
Modify
Aggregator

Stages that do not transform data

File stages: Sequential, Dataset, Peek, etc.


Sort
Remove Duplicates
Copy
Filter
Funnel

Transformer Stage

Column mappings

Derivations

Written in Basic
Final compiled code is C++ generated object code

Constraints

Filter data
Direct data down different output links
For different processing or storage

Expressions for constraints and derivations can reference

Input columns
Job parameters
Functions
System variables and constants
Stage variables
External routines

Transformer Stage Uses


Control data flow

Constrain data
Direct data

Derivations

Transformer with
multiple outputs

Inside the Transformer Stage

Stage variables

Input columns

Output
Output columns
Constraints
Derivations / Mappings

Input / Output column defs

Defining a Constraint

Input column

Job parameter

Defining a Derivation
Input column

String in quotes

Concatenation
operator (:)

IF THEN ELSE Derivation

Use IF THEN ELSE to conditionally derive a value

Format:
IF <condition> THEN <expression1> ELSE <expression1>
If the condition evaluates to true then the result of expression1 will be copied
to the target column or stage variable
If the condition evaluates to false then the result of expression2 will be
copied to the target column or stage variable

Example:

Suppose the source column is named In.OrderID and the target column is
named Out.OrderID
Replace In.OrderID values of 3000 by 4000
IF In.OrderID = 3000 THEN 4000 ELSE Out.OrderID

String Functions and Operators

Substring operator
Format: String [loc, length]
Example:
Suppose In.Description contains the string Orange Juice
InDescription[8,5] Juice

UpCase(<string>) / DownCase(<string>)

Example: UpCase(In.Description) ORANGE JUICE

Len(<string>)

Example: Len(In.Description) 12

Checking for NULLs


Nulls can be introduced into the data flow from

lookups
Mismatches (lookup failures) can produce nulls

Can be handled in constraints, derivations,

stage variables, or a combination of these


NULL functions

Testing for NULL


IsNull(<column>)

IsNotNull(<column>)

Replace NULL with a value


NullToValue(<column>, <value>)

Set to NULL: SetNull()


Example: IF In.Col = 5 THEN SetNull()
ELSE In.Col

Transformer Functions

Date & Time

Logical

Null Handling

Number

String

Type Conversion

Transformer Execution Order

Derivations in stage variables

Constraints are executed before derivations

Column derivations in earlier links are executed before later links

Derivations in higher columns are executed before lower columns

Transformer Stage Variables


Derivations execute in order from top to bottom

Later stage variables can reference earlier stage variables


Earlier stage variables can reference later stage variables
These variables will contain a value derived from the previous row
that came into the Transformer

Multi-purpose

Counters
Store values from previous rows to make comparisons
Store derived values to be used in multiple target field derivations
Can be used to control execution of constraints

Stage Variables Toggle

Show/Hide button

Transformer Reject Links

Reject link

Convert link to a
Reject link

Otherwise Link

Otherwise link

Defining an Otherwise Link

Check to create
otherwise link

Can specify abort


condition

Specifying Link Ordering


Link ordering toolbar icon

Last in
order

Transformer Stage Tips


Suggestions
Include reject links
Test for NULL values before using a column in a function
Use RCP (Runtime Column Propogation)
Map columns that have derivations (not just copies).
More on RCP later.
Be aware of column and stage variable data types.
Often developers do not pay attention to stage variable types.
Avoid type conversions.
Try to maintain the data type as imported.

Modify Stage

Modify Stage

Modify column types

Perform some types of derivations

Null handling
Date / time handling
String handling

Add or drop columns

Job With Modify Stage

Modify stage

Specifying a Column Conversion


New column

Specification
property

Derivation / Conversion

Lab Exercises
Conceptual Lab 08A

Add a Transformer to a job


Define a constraint
Work with null values
Define a rejects link
Define a stage variable
Define a derivation

IBM WebSphere DataStage


Module 09: Standards and Techniques

Module Objectives

Establish standard techniques for Parallel job development

Job documentation

Naming conventions for jobs, links, and stages

Iterative job design

Useful stages for job development

Using configuration files for development

Using environmental variables

Job parameters

Containers

Job Presentation

Document
Documentusing
using the
annotation stage

Job Properties Documentation


Organize jobs into
categories

Description is displayed in
Manager and MetaStage

Naming Conventions
Stages named after the

Data they access


Function they perform
DO NOT leave default stage names like Sequential_File_0
One possible convention:
Use 2-character prefixes to indicate stage type, e.g.,
SF_ for Sequential File stage
DS_ for Dataset stage
CP_ for Copy stage

Links named for the data they carry

DO NOT leave default link names like DSLink3


One possible convention:
Prefix all link names with lnk_
Name links after the data flowing through them

Stage and Link Names

Name stages and


links for the data they
handle

Iterative Job Design


Use Copy and Peek stages as stubs

Test job in phases

Small sections first, then increasing in complexity

Use Peek stage to examine records

Check data at various locations


Check before and after processing stages

Copy Stage Stub Example

Copy stage

Copy Stage Example


With 1 link in, 1 link out:

The Copy Stage is the ultimate "no-op" (place-holder):


Partitioners
Sort / Remove Duplicates
Rename, Drop column

Can be placed on:


input link (Partitioning): Partitioners, Sort, Remove Duplicates)
output link (Mapping page): Rename, Drop.
Sometimes replace the transformer:
Rename,
Drop,
Implicit type Conversions
Link Constraint break up schema

Developing Jobs
1.

Keep it simple
a)

Jobs with many stages are hard to debug and maintain

Start small and build to final solution

2.
a)
b)
c)

Use view data, copy, and peek


Start from source and work out
Develop with a 1 node configuration file

Solve the business problem before the performance problem

3.
a)

Dont worry too much about partitioning until the sequential flow works
as expected

If you land data in order to break complex jobs into smaller sets of
jobs for purposes of restartability or maintainability, use persistent
datasets

4.

a)
b)

Retains partitioning and internal data types


This is true only as long as you dont need to read the data outside of
DataStage

Final Result

Good Things to Have in each Job


Job parameters

Useful environmental variables to add to job parameters

$APT_DUMP_SCORE
Report OSH to message log
$APT_CONFIG_FILE

Establishes runtime parameters to EE engine

Establishes degree of parallelization

Setting Job Parameters

Click to add
environment
variables

DUMP SCORE Output


Setting APT_DUMP_SCORE yields:
Double-click

Partitioner
And
Collector

Mapping
Node--> partition

Use Multiple Configuration Files

Make a set for 1X, 2X,.

Use different ones for test versus production

Include as a parameter in each job

Containers
Two varieties

Local
Shared

Local

Simplifies a large, complex diagram

Shared

Creates reusable object that many jobs within the project can
include

Reusable Job Components

Use Shared Containers for repeatedly used components

Container

Creating a Container

Create a job

Select (loop) portions to containerize

Edit > Construct container > local or shared

Lab Exercises
Conceptual Lab 07A

Apply best practices when naming links and stages

IBM WebSphere DataStage


Module 10: Accessing Relational Data

Module Objectives

Understand how DataStage jobs read and write records to a


RDBMS tables
Import relational table definitions

Read from and write to database tables

Use database tables to lookup data

Parallel Database Connectivity


Traditional
Client-Server

Client

Enterprise Edition

Client

Sort

Client
Client
Client

Load

Client

Parallel RDBMS

Parallel RDBMS

Only RDBMS is running in parallel


Each application has only one connection

Suitable only for small data volumes

Parallel server runs APPLICATIONS


Application has parallel connections to RDBMS
Suitable for large data volumes

Higher levels of integration possible

Supported Database Access


Enterprise Edition provides high performance / scalable interfaces for:
DB2 / UDB

Informix

Oracle

Teradata

SQL Server

ODBC

Importing Table Definitions


Can import using ODBC or using Orchestrate schema definitions

Orchestrate schema imports are better because the data types are more
accurate

Import>Table Definitions>Orchestrate Schema Definitions

Import>Table Definitions>ODBC Table Definitions

Orchestrate Schema Import

ODBC Import
Select ODBC data
source name

RDBMS Access
Automatically convert RDBMS table layouts to/from DataStage Table

Definitions

RDBMS NULLs converted to/from DataStage NULLs

Support for standard SQL syntax for specifying:


SELECT clause list
WHERE clause filter condition
INSERT / UPDATE

Supports user-defined queries

Native Parallel RDBMS Stages

DB2/UDB Enterprise

Informix Enterprise

Oracle Enterprise

Teradata Enterprise

ODBC Enterprise

SQL Server Enterprise

RDBMS Usage
As a source

Extract data from table (stream link)


Read methods include: Table, Generated SQL SELECT, or Userdefined SQL
User-defined can perform joins, access views
Lookup (reference link)
Normal lookup is memory-based (all table data read into memory)
Can perform one lookup at a time in DBMS (sparse option)
Continue/drop/fail options

As a target

Inserts
Upserts (Inserts and updates)
Loader

DB2 Enterprise Stage Source


Auto-generated
SELECT

Connection
information

Job example

Sourcing with User-Defined SQL


User-defined
read method

Columns in SQL must


match definitions on
Columns tab

DBMS Source Lookup

Reference
link

DBMS as a Target

Write Methods

Write methods
Delete
Load
Uses database load utility
Upsert
INSERT followed by an UPDATE
Write (DB2)
INSERT

Write modes

Truncate: Empty the table before writing


Create: Create a new table
Replace: Drop the existing table (if it exists) then create a new one
Append

DB2 Stage Target Properties


SQL INSERT

Drop table and


create
Database specified
by job parameter

Optional CLOSE command

DB2 Target Stage Upsert


SQL INSERT

SQL UPDATE

Upsert method

Generated OSH for first 2 stages

Generated OSH Primer

Comment blocks introduce each operator


Operator order is determined by the order stages
were added to the canvas

OSH uses the familiar syntax of the UNIX shell


Operator name
Schema
Operator options ( -name value format)
Input (indicated by n< where n is the input #)
Output (indicated by n> where n is the output #)
may include modify

For every operator, input and/or output datasets are


numbered sequentially starting from 0. E.g.:
op1 0> dst
op1 1< src
Virtual datasets are generated to connect operators

####################################################
#### STAGE: Row_Generator_0
## Operator
generator
## Operator options
-schema record
(
a:int32;
b:string[max=12];
c:nullable decimal[10,2] {nulls=10};
)
-records 50000
## General options
[ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')]
## Outputs
0> [] 'Row_Generator_0:lnk_gen.v'
;

Virtual dataset is
used to connect
output of one
operator to input of
another

################### #################################
#### STAGE: SortSt
## Operator
tsort
## Operator options
-key 'a'
-asc
## General options
[ident('SortSt'); jobmon_ident('SortSt'); par]
## Inputs
0< 'Row_Generator_0:lnk_gen.v'
## Outputs
0> [modify (
keep
a,b,c;
)] 'SortSt:lnk_sorted.v'
;

Framework v. DataStage Terminology

Framework

DataStage

schema

table definition

property

format

type

SQL type and length

virtual dataset

link

Record / field

row / column

operator

stage

step, flow, OSH command

job

Framework

DS Parallel Engine

GUI uses both terminologies


Log messages (info, warnings, errors) use Framework terminology

Elements of a Framework Program


Operators
Virtual datasets: set of rows processed by Framework
Schema:
data description (metadata) for datasets and links

Enterprise Edition Runtime Architecture

Enterprise Edition Job Startup

Generated OSH and configuration file are used to compose a job


Score
Think of Score as in musical score, not game score
Similar to the way an RDBMS builds a query optimization plan

Identifies degree of parallelism and node assignments for each operator


Inserts sorts and partitioners as needed to ensure correct results
Defines connection topology (virtual datasets) between adjacent operators
Inserts buffer operators to prevent deadlocks
E.g., in fork-joins
Defines number of actual OS processes
Where possible, multiple operators are combined within a single OS process
to improve performance and optimize resource requirements

Job Score is used to fork processes with communication interconnects for


data, message, and control
Set $APT_STARTUP_STATUS to show each step of job startup
Set $APT_PM_SHOW_PIDS to show process IDs in DataStage log

Enterprise Edition Runtime

It is only after the job Score and processes are created that
processing begins
Startup overhead of an EE job

Job processing ends when either:

Last row of data is processed by final operator


A fatal error is encountered by any operator
Job is halted (SIGINT) by DataStage Job Control or human intervention
(e.g. DataStage Director STOP)

Viewing the Job Score


Set $APT_DUMP_SCORE to output the Score to the job log
For each job run, 2 separate Score dumps are written
First score is for the license operator
Second score entry is the real job score

To identify the Score dump, look for main program: This step

You dont see anywhere the word Score

License operator job score

Job score

Example Job Score


Job scores are divided into two

sections
Datasets
partitioning and collecting

Operators
node/operator mapping

Both sections identify sequential or


parallel processing

Why 9 Unix processes?

Q&A

Thank You

You might also like