E2 E3 Infosphere Datastage - Introduction To The Parallel Architecture

Infosphere Datastage - Introduction to the
Parallel Architecture
Agenda
• Parallel processing architecture
• Partition parallelism
• Pipeline parallelism
• Parallel development environment
• Parallel framework
• Parallel job execution model
Saturday, August 07, 2021 2

Why Study the Parallel Architecture?
• DataStage Client is a productivity tool
 GUI design functionality is intended for fast development
 Not intended to mirror underlying architecture
• GUI depicts standard ETL process

 Parallelism is implemented under the covers
 GUI hides and in some cases distorts things
• E.g., sort insertions
• Sound, scalable designs require an understanding of underlying architecture
• Learning DataStage at the GUI job design level is not enough. In order to develop the
ability to design sound, scalable jobs, it is necessary to understand the underlying
architecture.

What We Need to Master
• Parallel runtime engine
 How the GUI design gets executed
 What is generated from the GUI
 How this is executed in the parallel framework
• Partition parallelism
• Pipeline parallelism
• Configuration file, which separates design from runtime environment
• Development environment
 How to take advantage of the parallel framework
 How to debug and change the GUI design based on parallel framework messages in
the job log
• To be able to design robust parallel jobs, we need to get behind and beyond the GUI. We
need to understand what gets generated from the GUI design and how this gets executed
by the parallel framework. We also need to be able to debug and modify our job designs
based on what we see happen at runtime.

DataStage Documentation
• DataStage documentation
 “Parallel Job Developers Guide”

• Info on GUI elements (stages, job properties, etc.)
• Info on extending the functionality of EE
• Info on datasets, schemas
• Info on functions used in Transformer and in Modify stages
 “Parallel Job Advanced Developer’s Guide”

• Environment variables
• Extending the functionality of EE

Information Server Backbone
• Like other products in the Information Server Suite, DataStage uses the metadata access
and analysis services provided by the Information Server backbone. Its jobs, table
definitions, and other component objects are stored in the Information Server Repository,
which is shared by other products in the suite. Its jobs are executed using the DataStage
parallel engine, which is also used by some other members of the suite, including
Information Analyzer.
Metadata Metadata Metadata Server

Access Analysis
Services Services
Parallel engine
Key Parallel Concepts
• Parallel processing:
 Executing the job on multiple CPUs
• Scalable processing:
 Add more resources (CPUs and disks) to increase system performance
• Example system: 6 CPUs

(processing nodes) and disks
• Scale up by adding more CPUs
• Add CPUs as individual nodes or to

an SMP system

Key Parallel Concepts
• Parallel processing is the key to building jobs that are highly scalable.
• The parallel engine uses the processing node concept. “Standalone processes” rather
than “thread technology” is used. Processed-based architecture is platform-independent,
and allows greater scalability across resources within the processing pool.

Scalable Hardware Environments
● Single CPU ● SMP ● GRID / Clusters

• Multiple, multi-CPU systems
● Dedicated memory ● Multi-CPU (2-64+) • Dedicated memory per node
& disk • Typically SAN-based shared
● Shared memory & storage
disk
● MPP
• Multiple nodes with dedicated
memory, storage
● 2 – 1000’s of CPUs

Scalable Hardware Environments
• DataStage parallel jobs are designed to be platform-independent – a single job, if properly

designed, can run across resources within a single machine (SMP) or multiple machines
(cluster, GRID, or MPP architectures).
• While DataStage can run on a single-CPU environment, it is designed to take advantage

of parallel platforms.

Drawbacks of Traditional Batch Processing
● Poor utilization of resources

• Lots of idle processing time
• Lots of disk and I/O for staging
● Complex to manage
• Lots of small jobs
● Impractical with large data volumes

Drawbacks of Traditional Batch Processing
• Traditional batch processing consists of a distinct set of steps, defined by business
requirements. Between each step, intermediate results are written to disk.
• This processing may exist outside of a database (using flat files for intermediate results) or
within a database (using SQL, stored procedures, and temporary tables).
• There are several problems with this approach:

1. Each step must complete and write its entire result set before the next step can
begin
2. Landing intermediate results incurs a large performance penalty through increased
I/O. In this example, a single source incurs 7 times the I/O to process.
3. With increased I/O requirements come increased storage costs.

Pipeline Parallelism
• Transform, clean, load processes execute simultaneously
• Like a conveyor belt moving rows from process to process

 Start downstream process while upstream process is running
• Advantages:
 Reduces disk usage for staging areas
 Keeps processors busy
• Still has limits on scalability

Partition Parallelism
• Divide the incoming stream of data into subsets to be separately processed by an
operation
 Subsets are called partitions
• Each partition of data is processed by the same operation

 E.g., if operation is Filter, each partition will be filtered in exactly the same way
• Facilitates near-linear scalability

 8 times faster on 8 processors
 24 times faster on 24 processors
 This assumes the data is evenly distributed
• Partitioning breaks a dataset into smaller sets. This is a key to scalability. However, the
data needs to be evenly distributed across the partitions; otherwise, the benefits of
partitioning are reduced.
• It is important to note that what is done to each partition of data is the same. How the data
is processed or transformed is the same.

Three-Node Partitioning
• Here the data is partitioned into three partitions
• The operation is performed on each partition of data separately and in parallel
• If the data is evenly distributed, the data will be processed three times faster

DataStage Combines Partitioning and Pipelining
• Within DataStage, pipelining, partitioning, and repartitioning are automatic
• Job developer only identifies:

 Sequential vs. parallel operations (by stage)
 Method of data partitioning
 Configuration file (which identifies resources)
 Advanced stage options (buffer tuning, operator combining, etc.)
• By combining both pipelining and partitioning DataStage creates jobs with higher volume
throughput. The configuration file drives the parallelism by specifying the number of
partitions.

Job Design v. Execution
• Much of the parallel processing paradigm is hidden from the programmer. The
programmer simply designates the process flow, as shown in the upper portion of this
diagram. DataStage, using the definitions in the configuration file, will actually execute
UNIX processes that are partitioned and parallelized, as illustrated in the bottom portion.
User assembles the flow using DataStage Designer
at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes)
No need to modify or recompile the job design!

Execution, Production Environment
• Supports all hardware
configurations with a
single job design
• Scale up by adding
processors or nodes
with no application
change or recompile
• Configuration file
specifies hardware
configuration and
resources

Execution, Production Environment
• DataStage isolates the specific details of the hardware environment from the job design.
This allows a properly designed job to run without modifications across SMP and
clustered/GRID/MPP environments.
• The job developer only needs to identify:

 -Parallel vs. Sequential operations (by stage)
 -Method of partitioning data to meet business requirements
 -Advanced per-stage options (for performance optimization)
• The parallel configuration file specifies hardware resources used to run a Particular Job.

Three Types of Parallelism
data-partition
• Explicit parallelism
• Implicit pipeline "parallelism"
• Implicit data-partition parallelism

Three Types of Parallelism
• There are really only two types of parallelism (pipeline and partition). Explicit represents
an implied way of doing things because you can restrict the degree of parallelism by
explicitly specifying certain nodes. This is done by explicitly specifying that we want to
“constrain” a stage to a node or node pool (i.e., named set of nodes).
• However, constraining pieces of a job to a specific pool of nodes reduces flexibility. Named
node pools give more flexibility than constraining to specific nodes, but any constraints
reduce flexibility.

Defining Parallelism
• Execution mode (sequential / parallel) is controlled by stage definition and properties
 Default is parallel for most stages
 Can override default in most cases (Advanced Properties)
• Examples of stage parallelism:

 Sequential file reads (when using multiple readers or files)
 Sequential file targets (when using multiple files)
 Oracle sources (when partition table is set)
• Degree of parallelism is determined by configuration file

 Total number of logical nodes in nameless default pool
 Nodes listed in node map or in named node pool
• The configuration file is used to assign resources to operators.

Configuration File
• Configuration file separates configuration (hardware / software) from job design
 Specified per job at runtime by $APT_CONFIG_FILE
 Change hardware and resources without changing job design
• Defines number of nodes (logical processing units) with their resources (need not match
physical CPUs)
 Dataset, Scratch, Buffer disk (file systems)
 Optional resources (Database, SAS, etc.)
 Advanced resource optimizations
• “Pools” (named subsets of nodes)
• Different configuration files can be used on different job runs

 Optimizes overall throughput and matches job characteristics to overall hardware
resources
 Allows runtime constraints on resource usage on a per job basis

Configuration File
• DataStage jobs can point to different configuration files by using job parameters. Thus, a
job can utilize different hardware architectures without being recompiled.
• It can pay to have a 4-node configuration file running on a 2 processor box, for example, if
the job is “resource bound.” We can spread disk I/O among more controllers.

Example Configuration File
• Key points: {
node "n1" {
 Number of nodes defined fastname "s1"
pool "" "n1" "s1" "app2" "sort"
 Resources assigned to resource disk "/orch/n1/d1" {}
each node. Their order is resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
significant. }
node "n2" {
 Advanced resource fastname "s2"
optimizations and pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
configuration (named pools, resource disk "/orch/n2/d2" {"bigdata"}
database, SAS) resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource scratchdisk "/temp" {}
}
}

• This example shows a typical configuration file. Pools can be applied to nodes or other
resources. The curly braces following some disk resources specify the resource pools
associated with that resource. A node pool is simply a collection of nodes. The pool a
given node belongs to are listed after the key word ‘pool’ for the given node. A stage that is
constrained to use a particular named pool will run only on the nodes that are in that pool.
By default, all stages run on the nodes that are in the nameless pool (“”).
• Following the keyword “node” is the name of the node (logical processing unit).
• The order of resources is significant. The first disk is used before the second, and so on.

• Keywords, such as “sort” and “bigdata”, when used, restrict the signified processes to the
use of the resources that are identified. For example, “sort” restricts sorting to node pools
and scratchdisk resources labeled “sort”.
• Database resources (not shown here) can also be created that restrict database access to
certain nodes.
• Question: Can objects be constrained to specific CPUs? No, a request is made to the
operating system and the operating system chooses the CPU.

Configuration File Guidelines
• Minimize I/O overlap across nodes
 If multiple file systems are shared across nodes, alter order of file systems within
each node definition
 Pay particular attention with mapping of file systems to physical controllers / drives
within a RAID array or SAN
 Use local disks for scratch storage if possible
• Named Pools can be used to further separate I/O

 “buffer” – file systems are only used for buffer overflow
 “sort” – file systems are only used for sorting
• On clustered / GRID / MPP configurations, named pools can be used to further specify
resources across physical servers
 Through careful job design, can minimize data shipping
 Specifies server(s) with database connectivity

Job Design Examples
Generating Mock Data
• Row Generator stage
 Define columns in which to generate the data
 In Extended Properties page, select algorithm for generating values
• Different types have different algorithms available
• Lookups can be used to generate large amounts of robust mock data

 Lookup tables map integers to values
 Column Generator columns generate integers to look up
 Cycling through integer sets can generate all possible combinations

Job Design for Generating Mock Data
Lookup tables
Row Generator

Specifying the Generating Algorithm
Cycle through
Columns generating
integers

Specifying the Generating Algorithm
• The number of values to cycle through should be different for each set of integers, so that
all possible combinations will be generated, for example:
0 0 0
1 1 1
2 2 0
3 0 1
0 1 0
1 2 1
2 0 0
etc.
• Here the first column cycles through 0-3, the second 0-2, and the third 0-1.

Inside the Lookup Stage
Metadata for data file
Return Mapped
values Saturday, August 07, 2021 34
Config File Displayed in Job Log
First Partition
Message displaying
config file
Second

Thank You !

E2 E3 Infosphere Datastage - Introduction To The Parallel Architecture

Uploaded by

Copyright:

Available Formats

E2 E3 Infosphere Datastage - Introduction To The Parallel Architecture

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

E2 E3 Infosphere Datastage - Introduction To The Parallel Architecture

Uploaded by

Copyright:

Available Formats

Infosphere Datastage - Introduction to the

Saturday, August 07, 2021 2

• GUI depicts standard ETL process

• Sound, scalable designs require an understanding of underlying architecture

Saturday, August 07, 2021 3

Saturday, August 07, 2021 4

 “Parallel Job Developers Guide”

 “Parallel Job Advanced Developer’s Guide”

Saturday, August 07, 2021 5

Metadata Metadata Metadata Server

• Example system: 6 CPUs

• Scale up by adding more CPUs

• Add CPUs as individual nodes or to

Saturday, August 07, 2021 7

Saturday, August 07, 2021 8

● Single CPU ● SMP ● GRID / Clusters

Saturday, August 07, 2021 9

• DataStage parallel jobs are designed to be platform-independent – a single job, if properly

• While DataStage can run on a single-CPU environment, it is designed to take advantage

Saturday, August 07, 2021 10

● Poor utilization of resources

● Impractical with large data volumes

Saturday, August 07, 2021 11

• There are several problems with this approach:

Saturday, August 07, 2021 12

• Like a conveyor belt moving rows from process to process

• Still has limits on scalability

Saturday, August 07, 2021 13

• Each partition of data is processed by the same operation

• Facilitates near-linear scalability

Saturday, August 07, 2021 14

Saturday, August 07, 2021 15

• Job developer only identifies:

Saturday, August 07, 2021 16

User assembles the flow using DataStage Designer

No need to modify or recompile the job design!

Saturday, August 07, 2021 17

Saturday, August 07, 2021 18

• The job developer only needs to identify:

Saturday, August 07, 2021 19

Saturday, August 07, 2021 20

Saturday, August 07, 2021 21

• Examples of stage parallelism:

• Degree of parallelism is determined by configuration file

• The configuration file is used to assign resources to operators.

Saturday, August 07, 2021 22

• Different configuration files can be used on different job runs

Saturday, August 07, 2021 23

Saturday, August 07, 2021 24

Saturday, August 07, 2021 25

Saturday, August 07, 2021 26

Saturday, August 07, 2021 27

• Named Pools can be used to further separate I/O

Saturday, August 07, 2021 28

• Lookups can be used to generate large amounts of robust mock data

Saturday, August 07, 2021 30

Saturday, August 07, 2021 31

Saturday, August 07, 2021 32

Saturday, August 07, 2021 33