Data Stage
Data Stage
Data Stage
Course Contents
Course Objectives
Importing metadata
architecture
Transforming data
Merging data
Configuration files
Ideal tool for data integration projects such as, data warehouses, data marts,
Client Logon
DataStage Administrator
DataStage Manager
DataStage Designer
DataStage Director
Developing in DataStage
Define global and project properties in Administrator
DataStage Projects
DataStage Jobs
Parallel jobs
Stages
Implemented as OSH operators (pre-built components)
Passive stages (E and L of ETL)
Read data
Write data
Transform data
Filter data
Aggregate data
Generate data
Links
Conceptual exercises
Designed to reinforce a specific modules topics
Provide hands-on experiences with DataStage
Introduced by the word Concept
E.g., Conceptual Lab 01A
Lab Exercises
Conceptual Lab 01A
Module Objectives
Project Properties
Projects can be created and deleted in Administrator
in Administrator
Can be overridden at the job level
01/15/06
Environment Variables
Permissions Tab
Tracing Tab
Parallel Tab
Sequence Tab
What Is Metadata?
Data
Source
Transform
Target
Metadata
Metadata
Metadata
Repository
DataStage Manager
Manager Contents
Metadata
DataStage objects
Jobs
Routines
Compiled jobs / objects
Stages
Export Procedure
Options Tab
Select by folder or
individual object
Import Procedure
In Manager, click Import>DataStage Components
Import Options
Importing Metadata
Metadata Import
Import format and column definitions from sequential files
Specify Format
Property
categories
Available
properties
Second level
category
Top level
category
Lab Exercises
Conceptual Lab 02A
Set up your DataStage environment
Module Objectives
Design a simple Parallel job in Designer
Canvas
Repository
Tools
Palette
Designer Toolbar
Provides quick access to the main functions of Designer
Show/hide metadata markers
Run
Job properties
Compile
Tools Palette
Peek
Row
Generator
Annotation
RowGenerator Stage
Produces mock data for specified columns
Set property
value
Property
Columns Tab
View data
Load a
Table
definition
Select Table
Definition
Extended Properties
Specified
properties and
their values
Additional
properties to add
Peek Stage
Displays field values
Displayed in job log or sent to a file
Skip records option
Can control number of records to be displayed
Shows data in each partition, labeled 0, 1, 2,
Output to
job log
Job Parameters
Defined in Job Properties window
When used for directory and files names and names of properties,
surround with pound signs (#)
E.g., #NumRows#
Parameter
Annotation stage
Added from the Tools Palette
Display formatted text descriptions on diagram
Documentation
Compiling a Job
Compile
Highlight stage
with error
DataStage Director
Use to run and schedule jobs
Run Options
Peek messages
Message Details
Lab Exercises
Conceptual Lab 03A
Design a simple job in Designer
Define a job parameter
Document the job
Compile
Run
Monitor the job in Director
Module Objectives
Understand the stages for accessing different kinds of sequential data
Data Set
For files other than datasets, such as sequential flat files, import and
export operations are done
Import and export OSH operators are generated by Sequential and
Complex Flat File stages
Data export
EE internal format
EE internal format
Field 12
Field 13
, Last field
nl
Field 1
Field 12
Field 13
, Last field
, nl
Reject link
View data
Column names
in first row
Format Tab
Record format
Column format
Select File
Pattern
Append /
Overwrite
Reject Link
Reject mode =
Continue: Continue reading records
Fail: Abort job
Output: Send down output link
In a source stage
All records not matching the
metadata (column definitions) are
rejected
In a target stage
All records that fail to be written for
any reason
DataSet Stage
Data Set
Operating system (Framework) file
Preserves partitioning
Component dataset files are written to on each partition
Suffixed by .ds
Persistent Datasets
Two parts:
Descriptor file:
contains metadata, data location, but NOT the data itself
input.ds
Data file(s)
record (
partno: int32;
description: string;
)
node1:/local/disk1/
node2:/local/disk2/
Data Translation
Occurs on import
From sequential files or file sets
From RDBMS
Occurs on export
From datasets to file sets or sequential files
From datasets to RDBMS
FileSet Stage
01/15/06
1.
2.
Similar to a dataset
Main difference is that file sets are not in the internal format and
therefore more accessible to external applications
Lab Exercises
Conceptual Lab 04A
Read and write to a sequential file
Create reject links
Create a data set
VarChar
Integer
Decimal (Numeric)
Floating point
Date
Time
Timestamp
VarBinary (raw)
Complex types
Vector (array, occurs)
Subrecord (group)
Standard Types
Char
Fixed length string
VarChar
Variable length string
Specify maximum length
Integer
Decimal (Numeric)
Floating point
Date
Time
Default string format: %hh:%nn:%ss
Timestamp
Default string format: %yyyy-%mm-%dd %hh:%nn:%ss
VarBinary (raw)
A one-dimensional array
Elements are numbered 0 to n
Elements can be of any single type
All elements must have the same type
Can have fixed or variable number of elements
Subrecord
A group or structure of elements
Elements of the subrecord can be of any type
Subrecords can be embedded
subrecord
vector
Authors is a subrecord
Elements of subrecord
Vector
Complex Flat
File source
stage
Complex Flat
File target
stage
Click Import>Table
Definition
Specify position of level 01 items
Level 01 start
position
Where to store the
Table Definition
Path to
copybook file
Nullable
column
Added
property
Managing DataSets
Managing DataSets
List records
Remove datasets
Removes all component files, not just the header file
Dsrecords
Lists number of records in a dataset
Display data
Schema
Dsrecords
Orchadmin
Manages EE persistent data sets
Unix command-line utility
E.g., $ orchadmin delete myDataSet.ds
Lab Exercises
Conceptual Lab 04D
Module Objectives
Pipeline parallelism
Partition parallelism
Configuration files
Key EE Concepts
Parallel processing:
Executing the job on multiple CPUs
Scalable processing:
Single CPU
SMP
Multi-CPU (2-64+)
GRID / Clusters
Multiple, multi-CPU systems
Dedicated memory per node
Typically SAN-based shared storage
MPP
Multiple nodes with dedicated memory,
storage
2 1000s of CPUs
Pipeline Parallelism
Advantages:
Partition Parallelism
Divide the incoming stream of data into subsets to be separately
processed by an operation
Subsets are called partitions (nodes)
E.g., if operation is Filter, each partition will be filtered in exactly the same
way
Three-Node Partitioning
Node 1
Operation
subset1
Node 2
subset2
Data
Operation
subset3
Node 3
Operation
If the data is evenly distributed, the data will be processed three times faster
Configuration File
Defines number of nodes (logical processing units) with their resources (need not
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
Key points:
1.
2.
3.
Advanced resource
optimizations and configuration
(named pools, database, SAS)
}
Partition imbalances
Peek stage shows number of records going down each partition
FName
Address
Ford
Henry
66 Edison Avenue
Ford
Clara
66 Edison Avenue
Ford
Edsel
7900 Jefferson
Ford
Eleanor
7900 Jefferson
Dodge
Horace
17840 Jefferson
Dodge
John
75 Boston Boulevard
Ford
Henry
4901 Evergreen
Ford
Clara
4901 Evergreen
Ford
Edsel
1100 Lakeshore
10
Ford
Eleanor
Partition 1
LName
Part 0
Source Data
ID
ID
LName
FName
Address
Dodge
Horace
17840 Jefferson
Dodge
John
75 Boston Boulevard
ID
LName
FName
Address
Ford
Henry
66 Edison Avenue
Ford
Clara
66 Edison Avenue
Ford
Edsel
7900 Jefferson
Ford
Eleanor
7900 Jefferson
Ford
Henry
4901 Evergreen
Ford
Clara
4901 Evergreen
Ford
Edsel
1100 Lakeshore
10
Ford
Eleanor
1100 Lakeshore
1100 Lakeshore
Collecting icon
SAME partitioner
Re-partition
watch for this!
AUTO partitioner
Partitioning Tab
Key specification
Algorithms
Collecting Specification
Key specification
Algorithms
Quiz
True or False?
Use of Peeks
Datasets used to connect jobs
Use of project environment variables in job parameters
Fork Joins
Lookups for auditing
Warehouse Job 01
Filter Stage
Warehouse Job 02
Warehouse Job 03
Warehouse Job 04
Lab Exercises
Module Objectives
Combining Data
Ways to combine data:
Horizontally:
Vertically:
One input link, one output link combining groups of related records into a
single record
Aggregator
Remove Duplicates
Funnel stage
Memory usage
Treatment of rows with unmatched key values
Input requirements (sorted, de-duplicated)
Conventions:
Primary Input: port 0
Secondary Input(s): ports 1,
Joins
Lookup
Merge
Left
Right
Source
Lookup table(s)
Master
Update(s)
Lookup Stage
01/15/06
Lookup Features
One Stream Input link (Source)
The lookup table is created in memory before any lookup source rows are processed
Lookup table
Index
Key column of source
state_code
TN
[]
SC
SD
TN
TX
UT
VT
[]
Associated Value
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Driver (Source)
link
Reference link
(lookup table)
Reference link
Derivation for lookup key
Select action
If the lookup fails to find a matching key column, one of these actions
can be taken:
fail: the lookup Stage reports an error and the job fails immediately.
This is the default.
drop: the input row with the failed lookup(s) is dropped
continue: the input row is transferred to the output, together with the successful table
entries. The failed table entry(s) are not transferred, resulting in either default output
values or null output values.
reject: the input row with the failed lookup(s) is transferred to a second output link, the
"reject" link.
Citizen
Lefty
M_B_Dextrous
Exchange
Nasdaq
NYSE
Lookup Stage
Output of Lookup with continue option on key Citizen
Revolution
1789
1776
Citizen
Lefty
M_B_Dextrous
Exchange
Nasdaq
Empty string
or NULL
Revolution
1776
Citizen
M_B_Dextrous
Exchange
Nasdaq
On a MPP you should partition the lookup tables using entire partitioning method
or partition them by the same hash key as the source link
Entire results in multiple copies (one for each partition)
Join Stage
Four types:
Inner
Left outer
Right outer
Full outer
2 or more sorted input links, 1 output link
"left" on primary input, "right" on secondary input
Pre-sort make joins "lightweight": few rows need to be in RAM
Link Order
immaterial for Inner
and Full Outer Joins,
but very important for
Left/Right Outer
joins)
Citizen
Lefty
M_B_Dextrous
Exchange
Nasdaq
NYSE
Inner Join
Transfers rows from both data sets whose key columns
contain equal values to the output link
Treats both inputs symmetrically
Output of inner join on key Citizen
Revolution
1776
Citizen
M_B_Dextrous
Exchange
Nasdaq
Revolution
1789
1776
Citizen
Lefty
M_B_Dextrous
Exchange
Nasdaq
Revolution
1776
Null or 0
Citiz en
M_B _Dex trous
Righty
Ex c hange
Nas daq
NYSE
Revolution
1789
1776
0
leftRec_Citizen
Lefty
M_B_Dextrous
rightRec_Citizen
M_B_Dextrous
Righty
Exchange
Nasdaq
NYSE
Merge Stage
Master
One or more
updates
0
0
Merge
Output
Rejects
Unmatched updates in
input port n can be captured in output
port n
Lightweight:
Model
M emory us age
# and nam es of Inputs
M andatory Input S ort
Duplic ates in prim ary input
Duplic ates in s ec ondary input(s )
Options on unmatc hed prim ary
Options on unmatc hed s ec ondary
On m atc h, s ec ondary entries are
# Outputs
Captured in rejec t s et(s )
Joins
Lookup
Merge
RDBMS-s ty le relational
light
1 S ourc e, N LU Tables
2 or more: left, right
all inputs
no
OK (x -produc t)
OK
OK (x -produc t)
W arning!
K eep (left outer), Drop (Inner) [fail] | c ontinue | drop | rejec t
K eep (right outer), Drop (Inner)
NONE
c aptured
c aptured
1
Nothing (N/A)
1 out, (1 rejec t)
unmatc hed primary entries
Funnel Stage
Sort Funnel: Combines the input records in the order defined by the value(s) of one or
more key columns and the order of the output records is determined by these sorting
keys.
Sequence: Copies all records from the first input link to the output link, then all the
records from the second input link and so on.
Data from all input links must be sorted on the same key column
Typically data from all input links are hash partitioned before they are sorted
Selecting Auto partition type under Input Partitioning tab defaults to this
Hash partitioning guarantees that all the records with same key column
values are located in the same partition and are processed on the same
node.
Lab Exercises
Conceptual Lab 06A
Module Objectives
Sort Stage
Sorting Data
Uses
Within stages
On input link Partitioning tab, set partitioning to anything other than Auto
In a separate Sort stage
Makes sort more visible on diagram
Has more options
Sorting Alternatives
Sort stage
Sort within
stage
In-Stage Sorting
Partitioning
tab
Do s ort
Preserve
non -key
ordering
Remove
dups
Cant be Auto
when sorting
Sort key
Sort Stage
Sort key
Sort options
Sort keys
Sort Options
Sort Utility
DataStage the default
Unix: Dont use. Slower than DataStage sort utility
Stable
Allow duplicates
Memory usage
Aggregator Stage
Aggregator Stage
Purpose: Perform data aggregations
Specify:
Zero or more key columns that define the aggregation units (or
groups)
Columns to be aggregated
Aggregator stage
Group columns
Group method
Aggregation
functions
Aggregator Functions
Aggregation type = Count rows
Select column
Put result of calculation in a specified output column
Calculations include:
Sum
Count
Min, max
Mean
Grouping Methods
Hash (default)
Intermediate results for each group are stored in a hash table
Final results are written out after all input has been processed
No sort required
Use when number of unique groups is small
Running tally for each groups aggregate calculations needs to fit into
memory. Requires about 1K RAM / group
Sort
Only a single aggregation group is kept in memory
When a new group is seen, the current group is written out
Aggregation Types
Calculation types
Removing Duplicates
Can be done by Sort stage
OR
Remove Duplicates
stage
Lab Exercises
Solution Development Lab 07A (Build Warehouse_03 job)
Module Objectives
Transformed Data
Stages Review
Stages that can transform data
Transformer
Modify
Aggregator
Transformer Stage
Column mappings
Derivations
Written in Basic
Final compiled code is C++ generated object code
Constraints
Filter data
Direct data down different output links
For different processing or storage
Input columns
Job parameters
Functions
System variables and constants
Stage variables
External routines
Constrain data
Direct data
Derivations
Transformer with
multiple outputs
Stage variables
Input columns
Output
Output columns
Constraints
Derivations / Mappings
Defining a Constraint
Input column
Job parameter
Defining a Derivation
Input column
String in quotes
Concatenation
operator (:)
Format:
IF <condition> THEN <expression1> ELSE <expression1>
If the condition evaluates to true then the result of expression1 will be copied
to the target column or stage variable
If the condition evaluates to false then the result of expression2 will be
copied to the target column or stage variable
Example:
Suppose the source column is named In.OrderID and the target column is
named Out.OrderID
Replace In.OrderID values of 3000 by 4000
IF In.OrderID = 3000 THEN 4000 ELSE Out.OrderID
Substring operator
Format: String [loc, length]
Example:
Suppose In.Description contains the string Orange Juice
InDescription[8,5] Juice
UpCase(<string>) / DownCase(<string>)
Len(<string>)
Example: Len(In.Description) 12
lookups
Mismatches (lookup failures) can produce nulls
IsNotNull(<column>)
Transformer Functions
Logical
Null Handling
Number
String
Type Conversion
Multi-purpose
Counters
Store values from previous rows to make comparisons
Store derived values to be used in multiple target field derivations
Can be used to control execution of constraints
Show/Hide button
Reject link
Convert link to a
Reject link
Otherwise Link
Otherwise link
Check to create
otherwise link
Last in
order
Modify Stage
Modify Stage
Null handling
Date / time handling
String handling
Modify stage
Specification
property
Derivation / Conversion
Lab Exercises
Conceptual Lab 08A
Module Objectives
Job documentation
Job parameters
Containers
Job Presentation
Document
Documentusing
using the
annotation stage
Description is displayed in
Manager and MetaStage
Naming Conventions
Stages named after the
Copy stage
Developing Jobs
1.
Keep it simple
a)
2.
a)
b)
c)
3.
a)
Dont worry too much about partitioning until the sequential flow works
as expected
If you land data in order to break complex jobs into smaller sets of
jobs for purposes of restartability or maintainability, use persistent
datasets
4.
a)
b)
Final Result
$APT_DUMP_SCORE
Report OSH to message log
$APT_CONFIG_FILE
Click to add
environment
variables
Partitioner
And
Collector
Mapping
Node--> partition
Containers
Two varieties
Local
Shared
Local
Shared
Creates reusable object that many jobs within the project can
include
Container
Creating a Container
Create a job
Lab Exercises
Conceptual Lab 07A
Module Objectives
Client
Enterprise Edition
Client
Sort
Client
Client
Client
Load
Client
Parallel RDBMS
Parallel RDBMS
Informix
Oracle
Teradata
SQL Server
ODBC
Orchestrate schema imports are better because the data types are more
accurate
ODBC Import
Select ODBC data
source name
RDBMS Access
Automatically convert RDBMS table layouts to/from DataStage Table
Definitions
DB2/UDB Enterprise
Informix Enterprise
Oracle Enterprise
Teradata Enterprise
ODBC Enterprise
RDBMS Usage
As a source
As a target
Inserts
Upserts (Inserts and updates)
Loader
Connection
information
Job example
Reference
link
DBMS as a Target
Write Methods
Write methods
Delete
Load
Uses database load utility
Upsert
INSERT followed by an UPDATE
Write (DB2)
INSERT
Write modes
SQL UPDATE
Upsert method
####################################################
#### STAGE: Row_Generator_0
## Operator
generator
## Operator options
-schema record
(
a:int32;
b:string[max=12];
c:nullable decimal[10,2] {nulls=10};
)
-records 50000
## General options
[ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')]
## Outputs
0> [] 'Row_Generator_0:lnk_gen.v'
;
Virtual dataset is
used to connect
output of one
operator to input of
another
################### #################################
#### STAGE: SortSt
## Operator
tsort
## Operator options
-key 'a'
-asc
## General options
[ident('SortSt'); jobmon_ident('SortSt'); par]
## Inputs
0< 'Row_Generator_0:lnk_gen.v'
## Outputs
0> [modify (
keep
a,b,c;
)] 'SortSt:lnk_sorted.v'
;
Framework
DataStage
schema
table definition
property
format
type
virtual dataset
link
Record / field
row / column
operator
stage
job
Framework
DS Parallel Engine
It is only after the job Score and processes are created that
processing begins
Startup overhead of an EE job
To identify the Score dump, look for main program: This step
Job score
sections
Datasets
partitioning and collecting
Operators
node/operator mapping
Q&A
Thank You