DataStage Adv Bootcamp All Presentations
DataStage Adv Bootcamp All Presentations
Information Management
Information Management
Module Objectives
After completing this module, you should be able to:
Information Management
Information Management
Development environment
How to take advantage of the parallel framework
How to debug and change GUI design based on parallel framework
messages in the job log
Information Management
DataStage Documentations
Information Management
Information
Services
Director
Business
Glossary
Information
Analyzer
DataStage
QualityStage
Metadata
Metadata
Access Services
Analysis Services
MetaBrokers
Federation
Server
Metadata Server
Parallel engine
Information Management
Parallel processing
Executing the job on multiple CPUs
Scalable processing
Adding more resources increase job performance
Information Management
Single CPU
Dedicated memory
& disk
SMP
Multi-CPU (2-64+)
Shared memory &
disk
GRID / Clusters
Multiple, multi-CPU systems
Dedicated memory per node
Typically SAN-based shared
storage
MPP
Multiple nodes with
dedicated memory, storage
2 1000s of CPUs
Information Management
Information Management
Sample
Derivation
pipeline
Link Constraint
Lookup
Sort
explicit
data-partition
10
Information Management
Pipeline Parallelism
11
Information Management
Partition Parallelism
12
Information Management
Three-Nodes Partitioning
Node 1
Operation
subset1
Node 2
subset2
Data
Operation
subset3
Node 3
Operation
13
Information Management
Information Management
Information Management
UNLIMITED SCALABILITY
16
Information Management
Defining Parallelism
17
Information Management
Configuration File
18
Information Management
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
Key points:
1.
2.
3.
Advanced resource
optimizations and configuration
(named pools, database, SAS)
}
19
Information Management
20
Information Management
Designer
Client
Compile
DataStage
server
21
Executable
Job
Gene
rated
OSH
C++ f
or
Trans each
forme
r
Transformer
Components
Information Management
Generated OSH
Enable viewing of
generated OSH in
Administrator:
Comments
Operator
Schema
22
Information Management
Sequential File
Source: import
Target: export
Oracle
Source: oraread
Sparse lookup: oralookup
Target load: orawrite
23
Information Management
Syntax
Operator name
Schema
Operator options (-name value)
Input (indicated by n<)
Output (indicated by n>)
24
####################################################
#### STAGE: Row_Generator_0
## Operator
generator
## Operator options
-schema record
(
a:int32;
b:string[max=12];
c:nullable decimal[10,2] {nulls=10};
)
-records 50000
## General options
[ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')]
## Outputs
0> [] 'Row_Generator_0:lnk_gen.v'
;
Virtual dataset is
used to connect
output of one
operator to input of
another
####################################################
#### STAGE: SortSt
## Operator
tsort
## Operator options
-key 'a'
-asc
## General options
[ident('SortSt'); jobmon_ident('SortSt'); par]
## Inputs
0< 'Row_Generator_0:lnk_gen.v'
## Outputs
0> [modify (
keep
a,b,c;
)] 'SortSt:lnk_sorted.v'
;
Information Management
Job SCORE
Generated from OSH and configuration file
SCORE is like an execution plan
25
Information Management
Set
$APT_DUMP_SCORE
to output job score to
the job log
To identify the score
dump, look for main
program: this step has
Virtual
datasets
Operators
with node
assignments
26
Information Management
Processing Node
SL
P
Processing Node
SL
P
Players
27
Information Management
Module Summary
After completing this module, you should be able to:
28
Information Management
Information Management
Module Objectives
After completing this module, you should be able to:
Information Management
Partitioning breaks incoming rows into multiple streams of rows (one for
each node)
Information Management
Modulus
Entire: Send all rows down all partitions
Same: Preserve the same partitioning
Auto: Let DataStage choose the algorithm
Sort Merge
Read in by key
Presumes data is sorted by the key in each partition
Builds a single sorted stream based on the key
Ordered
Read all records from first partition, then second,
Information Management
Round Robin
Random
Entire
Same
Example: Key is State. All CA rows go into the same partition; all MA rows go
in the same partition. Two rows of the same state never go into different
partitions
Example: Key is OrderNumber (numeric type). Rows with the same order
number will all go into the same partition.
Information Management
8 7 6 5 4 3 2 1 0
Keyless
Round Robin
6
3
0
7
4
1
8
5
2
Information Management
ENTIRE Partitioning
Keyless
8 7 6 5 4 3 2 1 0
ENTIRE
.
.
3
2
1
0
.
.
3
2
1
0
.
.
3
2
1
0
Information Management
HASH Partitioning
Keyed
0 3 2 1 0 2 3 2 1 1
HASH
0
3
0
3
1
1
1
2
2
2
Information Management
Modulus Partitioning
Keyed
Uses modulus
partition = MOD (key_value /
#partitions)
MODULUS
0
3
0
3
1
1
1
2
2
2
Information Management
Keyless
Row ID's
0
3
6
1
4
7
2
5
8
0
3
6
1
4
7
2
5
8
Information Management
Information Management
Auto Partitioning
12
Information Management
Information Management
Partitioning Strategy
Information Management
Within a flow:
Examine up-stream partitioning and sort order and attempt to preserve for
down-stream stages using SAME partitioning
Across jobs:
Use datasets to retain partitioning
Information Management
Collector Methods
(Auto)
Eagerly read any row from any input partition
Output row order is undefined (non-deterministic)
This is the default collector method
Round Robin
Pick row from input partitions in round robin order
Slower than auto, rarely used
Ordered
Read all rows from first partition, then second,
Preserves order that exists within partitions
Sort Merge
Produces a single (sequential) stream of rows sorted on specified key
columns from input sorted on those keys
Row order is not preserved for non-key columns
That is, non-stable sort.
Information Management
Information Management
Module Summary
After completing this module, you should be able to:
18
Information Management
Information Management
Topic Objectives
After completing this topic, you should be able to:
How to generate test data with the Row Generator, and how to peek at
data streams.
Information Management
Information Management
Information Management
Information Management
Reject Mode =
Continue: continue reading
records
Fail: abort job
Output: send record down reject
link
Information Management
Information Management
Information Management
Information Management
Information Management
Input columns
Job parameters
Stage variables
Functions
Information Management
Recommendation
Always create a reject link
Always test for NULL in expression with NULLABLE = Yes columns
IF isNull(link.col) THEN ELSE
Information Management
Information Management
Information Management
Information Management
Information Management
Information Management
18
Information Management
Row Generator
Information Management
Information Management
Cycle through
Columns
generating
integers
Information Management
Metadata for
data file
Return mapped
values
2010 IBM Corporation
Information Management
Lookup Stage
Reference data should small enough to fit into physical (shared) memory
For reference datasets larger than available memory, use the JOIN or MERGE
stage
Lookup stage processing cannot begin until all reference links have been read
into memory
Information Management
SMP configurations:
Lookup stage uses shared memory instead of duplicating ENTIRE reference data
Information Management
The Lookup stage cannot output any rows until ALL reference link data has
been read into memory
In this job design, that means reading all the source data (which might be vast) into
memory
HeaderRef
Header
Src
Out
Detail
Information Management
Sparse Lookups
Specified in a relational Enterprise stage such as DB2 or Oracle used
for a lookup
Information Management
Information Management
Copy stream to
the peeks
Selecting records
to peek at
Copy stage
used as a
place holder
Peek stage
Information Management
Information Management
Stage
variables are
executed top
to down.
Reference stage
variables in
column
derivations
2010 IBM Corporation
Information Management
Sequential I/O:
Parallel I/O:
Information Management
The Readers Per Node option can be used to read a single input file
in parallel at evenly spaced offsets
Information Management
Sequential File stage creates one partition for each input file
Always follow a Sequential file with ROUND ROBIN or other appropriate
partitioning type
Information Management
$APT_EXPORT_FLUSH_COUNT=1
Information Management
When reading delimited files, extra characters are silently truncated for
source file values longer than the maximum specified length of
VARCHAR columns
Set the environment variable $APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS to
reject these records instead
Information Management
Simplified expression:
IF index('000|800|888|866|877|855|844|833|822|900', Link_1.ProdNum, 1) > 0
THEN 'N'
ELSE "Y"
Information Management
Result
15
20
Information Management
Information Management
ceil: round up
1.4 -> 2, -1.6 -> -1
Information Management
When the Abort After Rows threshold is reached, the job aborts
Information Management
Stage variables can be used to identify the first row of an input group
Define a stage variable for each grouping key column
Define a stage variable to flag when input key columns do not match previous
values
When a new group is flagged, set stage variables to incoming key column
values
Compare flag
Information Management
Transformers do not know when they are reading the last row of data
Information Management
Topic Summary
After completing this topic, you should be able to:
How to generate test data with the Row Generator, and how to peek at
data streams.
43
Information Management
Information Management
Topic Objectives
After completing this topic, you should be able to:
Information Management
Information Management
Information Management
Source file
Target file
Information Management
Job Design
Variable format
data file. Read in
as single field
Combine
records by
order number
Information Management
Convert string
integer to
integer
Information Management
Inserted Hash
partitioner
Inserted Hash
partitioner
Information Management
Going into the Join stage, hash operators on the join key (OrderNum)
are inserted by default
Each group of Header/Detail records will be hashed into the same partition
Each group of records will run sequentially
Essentially, the whole job runs sequentially
Solution:
Select Entire partitioning algorithm for Header into the Join
Select Same partitioning algorithm for Detail into the Join
Not all Detail records are in the same partition, but in every partition theyre in,
theres a Header
Information Management
No Hash
partitioner, now
Same
10
No Hash
partitioner, now
Entire
Information Management
Export multiple
columns to a single
column
11
Information Management
Columns to export
12
Information Management
13
Single output
stream
Information Management
Add column
with zip
count
14
Information Management
Fork
Join by zip
Group by
zip; count
records in
the group
15
Information Management
16
Information Management
Job Design
17
Information Management
Under the covers, DataStage inserts has partitioners and sort operators
before the Aggregator and Join stages
This is the default when Auto is chosen
18
Information Management
Optimized Solution
Add SAME
partitioners
Explicitly specify SAME partition for the Aggregator and Join stages
19
Information Management
20
Information Management
Topic Summary
After completing this topic, you should be able to:
21
Information Management
Information Management
Topic Objectives
After completing this topic, you should be able to explain:
Inserted sorts
Information Management
Sorting Data
Information Management
LName
FName
Address
Ford
Henry
66 Edison Avenue
Ford
Clara
66 Edison Avenue
Ford
Edsel
7900 Jefferson
Ford
Eleanor
7900 Jefferson
Dodge
Horace
17840 Jefferson
Dodge
John
75 Boston Boulevard
Ford
Henry
4901 Evergreen
Ford
Clara
Ford
10
Ford
Sorted Result
Source Data
4
ID
ID
LName
FName
Address
Dodge
John
75 Boston Boulevard
Dodge
Horace
17840 Jefferson
Ford
Henry
66 Edison Avenue
Lname
(asc),
Ford
Henry
4901 Evergreen
Ford
Eleanor
7900 Jefferson
FName
(desc)
10
Ford
Eleanor
1100 Lakeshore
Ford
Edsel
7900 Jefferson
4901 Evergreen
Ford
Edsel
1100 Lakeshore
Edsel
1100 Lakeshore
Ford
Clara
66 Edison Avenue
Eleanor
1100 Lakeshore
Ford
Clara
4901 Evergreen
Sort
on:
Information Management
Parallel Sort
Information Management
Ford
Clara
66 Edison Avenue
Ford
Clara
4901 Evergreen
ID
LName
FName
Address
Ford
Edsel
7900 Jefferson
Dodge
Horace
17840 Jefferson
Ford
Edsel
1100 Lakeshore
ID
LName
FName
Address
Ford
Eleanor
7900 Jefferson
Dodge
John
75 Boston Boulevard
10
Ford
Eleanor
1100 Lakeshore
ID
LName
FName
Address
Ford
Henry
66 Edison Avenue
Ford
Henry
4901 Evergreen
Parallel
Sort
Parallel
Sort
Parallel
Sort
Parallel
Sort
Part 3
Part 3
Address
Part 2
Part 2
FName
Part 1
Part 1
LName
Part 0
Part 0
6
ID
ID
LName
FName
Address
Ford
Clara
66 Edison Avenue
Ford
Clara
4901 Evergreen
ID
LName
FName
Address
Dodge
Horace
17840 Jefferson
Ford
Edsel
7900 Jefferson
Ford
Edsel
1100 Lakeshore
ID
LName
FName
Address
Dodge
John
75 Boston Boulevard
Ford
Eleanor
7900 Jefferson
10
Ford
Eleanor
1100 Lakeshore
ID
LName
FName
Address
Ford
Henry
66 Edison Avenue
Ford
Henry
4901 Evergreen
2010 IBM Corporation
Information Management
Information Management
OR
Information Management
Resorting on Sub-Groups
Use Sort Key Mode property to re-use key column groupings from
previous sorts
Uses significantly less memory / disk!
Sort is now on previously-sorted key-column groups not the entire dataset
Outputs rows after each group
Information Management
When rows are previously sorted by a key, all the rows are
grouped together and, moreover, the groups are in sort order.
In either case the Sort stage can be used to sort by a sub-key
within each group of rows
10
Information Management
101
102
103
Re-Partitioner
101
102
103
Information Management
SortMerge Collector
For sorted input or partition
sorted data
12
Information Management
Inserted Sorts
13
Score showing
inserted tsort
operator
Information Management
Set $APT_SORT_INSERTION_CHECK_ONLY
or $APT_NO_SORT_INSERTION to change
behavior of automatically inserted sorts
Set $APT_SORT_INSERTION_CHECK_ONLY
The inserted sort operators only VERIFY
that the data is sorted
If data is not sorted properly at runtime,
the job aborts
Recommended only on a per-job basis
during performance tuning
14
Information Management
When the memory buffer is filled, sort uses temporary disk space in the
following order:
Scratch disks in the $APT_CONFIG_FILE sort named disk pool
Scratch disks in the $APT_CONFIG_FILE default disk pool
The default directory specified by $TMPDIR
The UNIX /tmp directory
15
Information Management
Note that partition and sort keys do not always have to be the
same
Partitioning assigns related records
Sorting establishes group order
16
Information Management
17
Information Management
18
Information Management
Second Sort
Final Sub-Sort
No sorting
Create key-change column
Information Management
Sort by State
20
Information Management
Topic Summary
After completing this topic, you should be able to explain:
Inserted sorts
21
Information Management
Information Management
Topic Objectives
After completing this topic, you should be able to explain:
Information Management
Aggregator Stage
Information Management
Aggregator Stage
Purpose:
Specify:
One or more key columns that define the aggregation units (or groups)
Columns to be aggregated
Aggregation functions include, among many others:
Count (nulls/non-nulls)
Sum
Max / Min / Range
Information Management
Aggregator
stage
Information Management
Aggregation Types
Count rows
Count rows in each group
Put result in a specified output column
Calculation
Select column
Put result of calculation in a specified output column
Calculations include:
Sum
Count
Min, max
Mean
Missing value count
Non-missing value count
Percent coefficient of variation
Information Management
Grouping key
columns
Count Rows
aggregation type
7
Output column
for the result
2010 IBM Corporation
Information Management
Calculation
aggregation type
Calculations with
output columns
Available
calculations
8
Information Management
Grouping Methods
Hash (default)
Calculations are made for all groups and stored in memory
Hash table structure (hence the name)
Results are written out after all input has been processed
Input does not need to be sorted
Useful when the number of unique groups is small
Running tally for each groups aggregations needs to fit into memory
Sort
Requires the input data to be sorted by grouping keys
Does not perform the sort! Expects the sort
Information Management
4
3
1
3
2
3
1
2
10
X
Y
K
C
P
D
A
L
4X
3C
3Y
3D
1K
1A
2P
2L
Information Management
11
Key
Col
1
1
2
2
3
3
3
4
K
A
P
L
Y
C
D
X
1K
1A
2P
2L
3Y
3C
3D
4X
2010 IBM Corporation
Information Management
12
Information Management
Aggregator
Use Hash method for a limited number of distinct key values (i.e.,
limited number of groups)
Uses 2K of memory per group
Incoming data does not need to be pre-sorted
Results are output after all rows
have been read
Output row order is undefined
Even if input data is sorted
13
Information Management
Transformer
Hardcode value
Parallel
Sequential
Dont sort
The data is already sorted because there is only one key value!
14
Information Management
Aggregates
over the new
generated
column
Aggregator
must run
Sequentially
15
Information Management
Topic Summary
After completing this topic, you should be able to:
16
Information Management
Information Management
Topic Objectives
After completing this topic, you should be able to:
Design a job that updates a surrogate key source key file from a dimension
table
Design a job that processes a star schema database with Type 1 and Type
2 slowly changing dimensions
Information Management
Information Management
Information Management
Create Surrogate
State File for
Product
dimension table
Create Surrogate
State File for
Store dimension
table
Information Management
Create the
state file
Information Management
Information Management
Update the
state file
Information Management
Information Management
10
Information Management
Source rows
Fact table
11
Information Management
Perform
Type 1 and
Type 2
updates to
Product
table
12
Check for
matching
StoreDim
rows
Perform
Type 1 and
Type 2
updates to
StoreDim
table
Information Management
13
Information Management
Select the
output link
14
Information Management
Lookup key
mapping
Type 1 field
Surrogate
key
Type 2 field
Fields used
for Type 2
handling
15
Information Management
Path to state
file
Initial surrogate
key value
Number of values
to retrieve at one
time
16
Information Management
Value that
means
current
17
Information Management
Output Mappings
18
Information Management
Topic Summary
After completing this topic, you should be able to:
Design a job that updates a surrogate key source key file from a dimension
table
Design a job that processes a star schema database with Type 1 and Type
2 slowly changing dimensions
19
Information Management
Information Management
Module Objectives
After completing this module, you should be able to:
Information Management
External function
Define a new parallel routine to use in the Transformer stage
Specify input arguements
C++ function is created, compiled, and linked outside of DataStage
Wrapped stage
Wrap an existing executable into a new custom stage
Build stage
Using the Designer to create a new stage that will be compiled into a
new operator
Define the new stage by specifying
Properties
Input and output interfaces
C++ source code is compiled and linked within DataStage for execution
Information Management
Custom stage
A way to create a new stage that will be compiled into a new operator
New operator is an instantiation of the APT_Operator class
Information Management
Parallel Routine
Two types
External function
Can be used in Transformer
Can return a value
Information Management
Information Management
Framework classes
include file
Information Management
External function
name
Static link
Return type
Information Management
Information Management
New function
Information Management
11
Information Management
Wrapped Stages
Information Management
Create and load Table Definitions that define the input and output
interfaces
13
Information Management
Default execution
mode
Executable
command
Information Management
Select Table
Definition
defining output
Generate new
stage
Information Management
Property
name
Required or
optional?
How to convert
the property
passed to the
operator
Information Management
Wrapped
stage
Information Management
Build Stage
18
Information Management
Build stage
Information Management
Class instantiation
name from
APT_Operator
class
Framework
operator name
20
Information Management
Properties
Defined properties will show up on stage properties tab
Input / Output
Interfaces
Build stage requires at least one input and one output
Build stage has a static interface and cannot access a schema
dynamically
Reads / Writes
Auto / Non-auto
Combined / Separate transfer
Framework macros can be used to explicitly execute transfer
21
Information Management
Code
Variables definition and initialization
Pre-loop: code executed before rows are read
Per-record: code executed for each row read
Post-loop: code executed after all rows are read
22
Information Management
Transfer
auto / noauto
combine / separate
auto / noauto
Build
Properties
Input link
columns:
a, b,
Definitions
Output link
columns:
a, b, x, y,
Pre-Loop
Per-Record
Post-Loop
Information Management
Input / Output macros can be used to explicitly control reads and writes
readRecord(port_number), writeRecord(port_number)
inputDone(port_number)
Need to execute after read before referencing fields populated by the read
24
Information Management
25
Input link to the build stage must have columns with same
names and compatible data types as defined by the interface
2010 IBM Corporation
Information Management
Port name.
Alias for in0
auto / noauto
Table definition
defining
interface
26
Information Management
Port name.
Alias for out0
auto / noauto
Table
Definition
27
Information Management
Transfer
28
Information Management
Transfer
29
Information Management
Defining a Transfer
Transfer type:
combined / separate
auto / noauto
30
Information Management
Anatomy of a Transfer
Qty
Price
TaxRate
OrderNum
ItemNum
Qty
Price
TaxRate
----------inRec.*
Amount
--------------outRec.*
Input link
columns
Output link
columns
Amount = Qty*Price
Input buffer
OrderNum
ItemNum
Qty
Price
TaxRate
Amount
Output buffer
ReadRecs (auto or explicit) brings ennumerated, input interface values into the input buffer. If a Transfer is
specified, then the whole input record also comes in as a block of values.
Duplicate columns coming from a Transfer are dropped with warnings in log
e.g., if the input record contained a column named Amount, this would be dropped. Explicit assignments in the code to output
interface columns take precedence over Transferred column values.
If RCP is enabled instead of a Transfer, the picture is the same. If neither Transfer nor RCP is specified, then the
inRec.* and outRec.* wont exist.
Information Management
Property specifications
Name
Data type
Prompt
Default value
Required or optional
Conversion: how property gets processed
Choose Name Value
32
Information Management
Specifying Properties
Property type
Default value
Choose
-Name value
Information Management
Definitions
Variables
Include files
Pre-loop
Code executed once, prior to entering the per-record loop
Per-record
Code executed once per input record
Post-loop
Code executed once, after exiting the per-record loop
34
Information Management
Definitions Tab
Define variables
Information Management
Pre-Loop Tab
Initialize
variable
Information Management
Per-Record Tab
Build macro
Qualified input
column
Unqualified
output column
Information Management
Post-Loop Tab
Framework types
Property
Framework
functions
Information Management
39
Information Management
Build stage
Information Management
Stage Properties
Category and
property
List of property
values
Information Management
42
Information Management
Build Macros
Informational
inputs() number of inputs
outputs() number of outputs
transfers() number of transfers
Flow control
endLoop() exit per-record loop and go to post-loop code
nextLoop() read next input record
failStep() abort the job
43
Information Management
Build Macros
Input / Output
Ports are indexed: 0, 1, 2,
readRecord(index), writeRecord(index), inputDone(index)
holdRecord(index) suspend next auto read
discardRecord(index) suspend next auto write
discardTransfer(index) suspend next auto transfer
Transfers
doTransfersFrom(index) do all transfers from specified input
doTransfersTo(index) do all transfers to specified output
transferAndWriteRecord(index) do all transfers to specified output
then write a record
44
Information Management
Information Management
After all records have been read, readRecord(0) will not bring
in usable input data for processing
Use inputDone() to test for any additional record for processing
46
Information Management
47
Information Management
s1 and s2 are
declared as
APT_String
variables
APT_String
assignment
operator
APT_String
concatenation
operator
APT_String
function
Information Management
Module Summary
After completing this module, you should be able to:
49
Information Management
Information Management
Module Objectives
After completing this module, you should be able to:
Information Management
Optimizing Performance
Information Management
Information Management
Information Management
An iterative process
Information Management
Information Management
Operator Combination
Are combinable
Set automatically within the stage / operator definition
Can also be set within stages advanced properties
Information Management
Information Management
10
Information Management
Use parallel Data Set to land intermediate results between parallel jobs
No conversion overhead, stored in native internal format
Retains data partitioning and sort order
Maximum performance through parallel I/O
11
Information Management
Impact of Partitioning
12
Information Management
Impact of Sorting
13
Information Management
Parallel Data Set maintains partitions and sort order across jobs
14
Information Management
Impact of Transformer
15
Information Management
Rename Columns
Drop Columns
Perform default type conversions
Split output
Information Management
Impact of Buffering
17
Information Management
Isolating Buffers
$APT_BUFFER_FREE_RUN=1000
Writes excess buffers to disk instead of slowing down producer
Buffer will not slow down producer until it has written
1000*$APT_MAXIMUM_MEMORY to disk
These settings will generate a significant amount of disk I/O therefore DO NOT
use these setting for production jobs
18
Information Management
Sequential File stage file pattern reads start with a single CAT process
Setting $APT_IMPORT_PATTERN_USES_FILESET allows parallel I/O
Dynamically builds a File Set header file for list of files match pattern
19
Information Management
Performance Analysis
20
Information Management
Use the Director monitor to watch the throughput (rows/sec) during a job
run
Compare job run durations
Turn on APT_PM_PLAYER_TIMING and APT_PM_PLAYER_MEMORY
to report player calls and memory allocation
21
Information Management
Performance Analyzer
Visualization tool that provides deeper insight into job runtime behavior
22
Information Management
23
Information Management
Example Job
24
Information Management
Machine
utilization
25
Stages in
job
Lengths of
time
Information Management
Process
phases
26
Information Management
27
Information Management
Blue is
reading
350,000
records per
second
Rows per
second
Run mouse
over line to
identify the
stage port
represented
Timeline
28
Information Management
Select the
stages to
display
29
Information Management
Blue shows
CPU time
Amount of
CPU time
Red shows
system time
30
Information Management
31
Information Management
Filters
32
Information Management
Module Summary
After completing this module, you should be able to:
33
Information Management
Information Management
Module Objectives
After completing this module, you should be able to:
Information Management
Information Management
Quick Find
Name with wild
card character (*)
Include matches
in object
descriptions
Execute
Find
Information Management
Found Results
Highlight
next item
Number found;
Click to open
Advanced Find
window
Found
item
Information Management
Filter
search
Information Management
Options
Case sensitivity
Search within last result set
Information Management
Create
impact
analysis
Export to a
file
Draw
Comparison
Information Management
Impact Analysis
Information Management
Graphical functionality
Display the dependency path
Collapse selected objects
Move the graphical object
Birds-eye view
10
Information Management
Select Table
Definition
11
Information Management
List of
dependent
objects
Table
Definition
Birds-Eye
view
12
Information Management
Table
Definition
Job containing
(dependent on)
Table Definition
13
Information Management
14
Information Management
Open a job:
Select a stage
Right-click Show where data flows to / originates
Select a link flowing in or out of the stage
Select one or more columns on the link
You can also right-click outside of any stage and select Configure data flow
view
15
Information Management
Select, then
click Find
where data
originates from
Select
columns
16
Information Management
Displayed Results
17
Information Management
18
Information Management
19
Information Management
Job with
the
changes
20
Information Management
Comparison Results
Click underlined
item to open stage
editor
21
Information Management
Click when
Comparison window
is active
22
Information Management
23
Information Management
Module Summary
After completing this module, you should be able to:
24
Information Management
Information Management
Module Objectives
After completing this module, you should be able to explain:
Information Management
Information Management
IBM InfoSphere
Foundation Tools
IBM InfoSphere
Warehouse
Information Management
Business Vocabulary
Data Relationships
Data Quality Compliance
Data Models and
Mapping
Business Specification
Rules
Provenance of
information
Information Management
Business
Glossary
Term Pack
Manage Business
Terms
Design Enterprise
Models
Monitor
Data Flows
Business Glossary
Data Architect
Metadata Workbench
Target Model
Shared Metadata
Leverage Industry
Practices
IBM Industry
Models
Understand Data
Relationships
Discovery
Capture Design
Specifications
FastTrack
Assess Monitor,
Manage Data
Quality
Information Analyzer
Information Management
Understand
Cleanse
Standardize, merge,
and correct information
Parallel
processing
services
Transform
Platform
Connectivity
services
Combine and
restructure information
Services
for new uses
Metadata
services
Administration
services
Deliver
Replication, virtualize
and move information
for in-line delivery
Deployment
services
Information Management
Cleanse
Transform
Deliver
Standardize, merge,
and correct information
Combine and
restructure information
for InfoSphere
new uses
DataStage
InfoSphere
QualityStage
Platform Services
Parallel
Processing
Services
Connectivity
Services
Metadata
Services
InfoSphere Metadata
Workbench
Administration
Services
Deployment
Services
InfoSphere
Information Server
Manager
Information Management
Information Management
Developers
Architects
Information Management
Deploy
Invoke
ISD Server
- or -
DataStage Servers
Information Management
Process server
EJB
Web Services
JMS
REST
XML/JSON
Portal
RSS
Bindings
Integrated Metadata
Management
Common Reporting
Common Administration
Service Security
Design
Operational
Service Deployment
Load Balancing and Fault Tolerance
DataStage
QualityStage
DB2
Federation
Classic
Federation
Oracle
WCC (Partial)
Information
Providers
Information Management
Kate A. Roberts
416 Columbos St #2
Boston, MA
Customer Data
Standardize
Kate A. Roberts
416 Columbus Ave #2
Boston, MA 02116
Catherine A. Roberts
416 Columbus Ave. Apt. 2
Boston, MA 02116
Correct
Match
Postal
Records
Customers
Customer Data
Customer Name
Legacy
System
Calculate Aggregates
Account Summary
Data Warehouse
2010 IBM Corporation
Information Management
WebSphere MQ
Information Management
MQ Architecture
Local Host
MQ Client
Library
MQ
Connector
MQ
Client
Communication
Remote Host
Client mode
connection
Remote
Queue
Manager
MQ Server
Library
Local
Queue
Manager
MQ
Intercommunication
Server mode
connection
2010 IBM Corporation
Information Management
MQ Messages
Message types
Logical messages
Composed of one or more physical messages on the queue
Each physical message is called a segment
Message groups
Composed of one or more logical messages
Each logical message has a sequence number
The last message contains a flag
Information Management
Message Structure
Message schema
Defines the type and structure of the data
Information Management
MQ Stages
MQ Stages
MQ Connector stage
MQ stage
Queues
Store messages
Must be opened before messages can be written or read
2010 IBM Corporation
Information Management
Message Queue
Queue manager
Queue
Queue messages
Information Management
MQ Stage Readers
MQ
Connector
MQ Stage
Information Management
MQ Connector Stage
Information Management
Client mode
connection
Save metadata
into data
connection
Client connection
properties
2010 IBM Corporation
Information Management
Reading Messages
Queue to read
from
Number to read
Delete record
after read
Records to read
before commit
Information Management
Filter by payload
size (in bytes)
Information Management
Column for
payload
Information Management
MQ
Connector
Information Management
MQ Write Properties
Connection
Information
Queue to
write to
Information Management
Information Management
Information Management
Business Services
Functionality
Business Services
Pre-built
Customizable
MDM Domains
Multi-domain
Party
Account
Product Customer
Integration
Content
Data
Analytics
MDM Workbench
Information Management
Post
Implementation
Support for:
Test Plan
Integration Testing
Functional Testing
User Acceptance
Data Stewardship
31
Information Management
Source
#1
Profiling &
Analysis
Source to
Target
Mapping
Source
#2
SIF
ETL / DQ
Logic
Services
Information
Analyzer
Fast Track
DataStage
Mentoring Workshops
DataStage
QualityStage
MDM
Server
Information Management
Source Systems
Source
#2
Information Server
Source
#N
DataStage
QS
Information Server
Information Analyzer
Fast Track
DataStage
SIF
Load Process
DS jobs
MDM Database
History
Information Management
34
Information Management
Information Management
36
36
Information Management
Table differencing
All of these methods are inefficient, do not guarantee data integrity, and potentially impact
source system performance
37
Information Management
DB2
SQL Server
:
InfoSphere CDC
Binaries
InfoSphere CDC
Binaries
InfoSphere CDC
Binaries
GUI Connection
GUI Connection
GUI Connection
Subscriptions
( Replication Threads)
Subscriptions
( Replication Threads)
Windows
Subscriptions
( Replication Threads)
Management
Console
Management Console
configuration interface
.
Access Manager controls access
to the product
. Can be run on a
separate server to allow for
centralized access
Access
Manager
GUI Connection
GUI Connection
GUI Connection
InfoSphere CDC
Binaries
InfoSphere CDC
Binaries
InfoSphere CDC
Binaries
Oracle
38
38
DB2
SQL Server
2010 IBM Corporation
Information Management
39
39
Information Management
40
Information Management
MQ
2
1
database
DS/QS job
5
database
1.
2.
3.
4.
41
Information Management
File
2
1
1.
2.
3.
4.
5.
42
database
DS/QS job
5
database
Information Management
U
se
r
Ex
it
DS Custom
Operator
3
1
database
DS/QS job
4
6
database
Information Management
Module Summary
After completing this module, you should be able to explain:
44