E2 E3 Infosphere Datastage - Introduction To The Parallel Architecture

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 36

Infosphere Datastage - Introduction to the

Parallel Architecture
Agenda
• Parallel processing architecture
• Partition parallelism
• Pipeline parallelism
• Parallel development environment
• Parallel framework
• Parallel job execution model

Saturday, August 07, 2021 2


Why Study the Parallel Architecture?
• DataStage Client is a productivity tool
 GUI design functionality is intended for fast development
 Not intended to mirror underlying architecture

• GUI depicts standard ETL process


 Parallelism is implemented under the covers
 GUI hides and in some cases distorts things
• E.g., sort insertions

• Sound, scalable designs require an understanding of underlying architecture

• Learning DataStage at the GUI job design level is not enough. In order to develop the
ability to design sound, scalable jobs, it is necessary to understand the underlying
architecture.

Saturday, August 07, 2021 3


What We Need to Master
• Parallel runtime engine
 How the GUI design gets executed
 What is generated from the GUI
 How this is executed in the parallel framework
• Partition parallelism
• Pipeline parallelism
• Configuration file, which separates design from runtime environment

• Development environment
 How to take advantage of the parallel framework
 How to debug and change the GUI design based on parallel framework messages in
the job log

• To be able to design robust parallel jobs, we need to get behind and beyond the GUI. We
need to understand what gets generated from the GUI design and how this gets executed
by the parallel framework. We also need to be able to debug and modify our job designs
based on what we see happen at runtime.

Saturday, August 07, 2021 4


DataStage Documentation
• DataStage documentation

 “Parallel Job Developers Guide”


• Info on GUI elements (stages, job properties, etc.)
• Info on extending the functionality of EE
• Info on datasets, schemas
• Info on functions used in Transformer and in Modify stages

 “Parallel Job Advanced Developer’s Guide”


• Environment variables
• Extending the functionality of EE

Saturday, August 07, 2021 5


Information Server Backbone
• Like other products in the Information Server Suite, DataStage uses the metadata access
and analysis services provided by the Information Server backbone. Its jobs, table
definitions, and other component objects are stored in the Information Server Repository,
which is shared by other products in the suite. Its jobs are executed using the DataStage
parallel engine, which is also used by some other members of the suite, including
Information Analyzer.

Metadata Metadata Metadata Server


Access Analysis
Services Services

Parallel engine
Saturday, August 07, 2021 6
Key Parallel Concepts
• Parallel processing:
 Executing the job on multiple CPUs
• Scalable processing:
 Add more resources (CPUs and disks) to increase system performance

• Example system: 6 CPUs


(processing nodes) and disks

• Scale up by adding more CPUs

• Add CPUs as individual nodes or to


an SMP system

Saturday, August 07, 2021 7


Key Parallel Concepts
• Parallel processing is the key to building jobs that are highly scalable.

• The parallel engine uses the processing node concept. “Standalone processes” rather
than “thread technology” is used. Processed-based architecture is platform-independent,
and allows greater scalability across resources within the processing pool.

Saturday, August 07, 2021 8


Scalable Hardware Environments

● Single CPU ● SMP ● GRID / Clusters


• Multiple, multi-CPU systems
● Dedicated memory ● Multi-CPU (2-64+) • Dedicated memory per node
& disk • Typically SAN-based shared
● Shared memory & storage
disk
● MPP
• Multiple nodes with dedicated
memory, storage

● 2 – 1000’s of CPUs

Saturday, August 07, 2021 9


Scalable Hardware Environments

• DataStage parallel jobs are designed to be platform-independent – a single job, if properly


designed, can run across resources within a single machine (SMP) or multiple machines
(cluster, GRID, or MPP architectures).

• While DataStage can run on a single-CPU environment, it is designed to take advantage


of parallel platforms.

Saturday, August 07, 2021 10


Drawbacks of Traditional Batch Processing

● Poor utilization of resources


• Lots of idle processing time
• Lots of disk and I/O for staging

● Complex to manage
• Lots of small jobs

● Impractical with large data volumes

Saturday, August 07, 2021 11


Drawbacks of Traditional Batch Processing
• Traditional batch processing consists of a distinct set of steps, defined by business
requirements. Between each step, intermediate results are written to disk.

• This processing may exist outside of a database (using flat files for intermediate results) or
within a database (using SQL, stored procedures, and temporary tables).

• There are several problems with this approach:


1. Each step must complete and write its entire result set before the next step can
begin
2. Landing intermediate results incurs a large performance penalty through increased
I/O. In this example, a single source incurs 7 times the I/O to process.
3. With increased I/O requirements come increased storage costs.

Saturday, August 07, 2021 12


Pipeline Parallelism
• Transform, clean, load processes execute simultaneously

• Like a conveyor belt moving rows from process to process


 Start downstream process while upstream process is running

• Advantages:
 Reduces disk usage for staging areas
 Keeps processors busy

• Still has limits on scalability

Saturday, August 07, 2021 13


Partition Parallelism
• Divide the incoming stream of data into subsets to be separately processed by an
operation
 Subsets are called partitions

• Each partition of data is processed by the same operation


 E.g., if operation is Filter, each partition will be filtered in exactly the same way

• Facilitates near-linear scalability


 8 times faster on 8 processors
 24 times faster on 24 processors
 This assumes the data is evenly distributed

• Partitioning breaks a dataset into smaller sets. This is a key to scalability. However, the
data needs to be evenly distributed across the partitions; otherwise, the benefits of
partitioning are reduced.
• It is important to note that what is done to each partition of data is the same. How the data
is processed or transformed is the same.

Saturday, August 07, 2021 14


Three-Node Partitioning
• Here the data is partitioned into three partitions
• The operation is performed on each partition of data separately and in parallel
• If the data is evenly distributed, the data will be processed three times faster

Saturday, August 07, 2021 15


DataStage Combines Partitioning and Pipelining
• Within DataStage, pipelining, partitioning, and repartitioning are automatic

• Job developer only identifies:


 Sequential vs. parallel operations (by stage)
 Method of data partitioning
 Configuration file (which identifies resources)
 Advanced stage options (buffer tuning, operator combining, etc.)

• By combining both pipelining and partitioning DataStage creates jobs with higher volume
throughput. The configuration file drives the parallelism by specifying the number of
partitions.

Saturday, August 07, 2021 16


Job Design v. Execution
• Much of the parallel processing paradigm is hidden from the programmer. The
programmer simply designates the process flow, as shown in the upper portion of this
diagram. DataStage, using the definitions in the configuration file, will actually execute
UNIX processes that are partitioned and parallelized, as illustrated in the bottom portion.

User assembles the flow using DataStage Designer

at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes)

No need to modify or recompile the job design!

Saturday, August 07, 2021 17


Execution, Production Environment
• Supports all hardware
configurations with a
single job design

• Scale up by adding
processors or nodes
with no application
change or recompile

• Configuration file
specifies hardware
configuration and
resources

Saturday, August 07, 2021 18


Execution, Production Environment
• DataStage isolates the specific details of the hardware environment from the job design.
This allows a properly designed job to run without modifications across SMP and
clustered/GRID/MPP environments.

• The job developer only needs to identify:


 -Parallel vs. Sequential operations (by stage)
 -Method of partitioning data to meet business requirements
 -Advanced per-stage options (for performance optimization)

• The parallel configuration file specifies hardware resources used to run a Particular Job.

Saturday, August 07, 2021 19


Three Types of Parallelism

data-partition

• Explicit parallelism
• Implicit pipeline "parallelism"
• Implicit data-partition parallelism

Saturday, August 07, 2021 20


Three Types of Parallelism

• There are really only two types of parallelism (pipeline and partition). Explicit represents
an implied way of doing things because you can restrict the degree of parallelism by
explicitly specifying certain nodes. This is done by explicitly specifying that we want to
“constrain” a stage to a node or node pool (i.e., named set of nodes).

• However, constraining pieces of a job to a specific pool of nodes reduces flexibility. Named
node pools give more flexibility than constraining to specific nodes, but any constraints
reduce flexibility.

Saturday, August 07, 2021 21


Defining Parallelism
• Execution mode (sequential / parallel) is controlled by stage definition and properties
 Default is parallel for most stages
 Can override default in most cases (Advanced Properties)

• Examples of stage parallelism:


 Sequential file reads (when using multiple readers or files)
 Sequential file targets (when using multiple files)
 Oracle sources (when partition table is set)

• Degree of parallelism is determined by configuration file


 Total number of logical nodes in nameless default pool
 Nodes listed in node map or in named node pool

• The configuration file is used to assign resources to operators.

Saturday, August 07, 2021 22


Configuration File
• Configuration file separates configuration (hardware / software) from job design
 Specified per job at runtime by $APT_CONFIG_FILE
 Change hardware and resources without changing job design

• Defines number of nodes (logical processing units) with their resources (need not match
physical CPUs)
 Dataset, Scratch, Buffer disk (file systems)
 Optional resources (Database, SAS, etc.)
 Advanced resource optimizations
• “Pools” (named subsets of nodes)

• Different configuration files can be used on different job runs


 Optimizes overall throughput and matches job characteristics to overall hardware
resources
 Allows runtime constraints on resource usage on a per job basis

Saturday, August 07, 2021 23


Configuration File

• DataStage jobs can point to different configuration files by using job parameters. Thus, a
job can utilize different hardware architectures without being recompiled.

• It can pay to have a 4-node configuration file running on a 2 processor box, for example, if
the job is “resource bound.” We can spread disk I/O among more controllers.

Saturday, August 07, 2021 24


Example Configuration File
• Key points: {
node "n1" {
 Number of nodes defined fastname "s1"
pool "" "n1" "s1" "app2" "sort"
 Resources assigned to resource disk "/orch/n1/d1" {}
each node. Their order is resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
significant. }
node "n2" {
 Advanced resource fastname "s2"
optimizations and pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
configuration (named pools, resource disk "/orch/n2/d2" {"bigdata"}
database, SAS) resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
}

Saturday, August 07, 2021 25


Example Configuration File
• This example shows a typical configuration file. Pools can be applied to nodes or other
resources. The curly braces following some disk resources specify the resource pools
associated with that resource. A node pool is simply a collection of nodes. The pool a
given node belongs to are listed after the key word ‘pool’ for the given node. A stage that is
constrained to use a particular named pool will run only on the nodes that are in that pool.
By default, all stages run on the nodes that are in the nameless pool (“”).

• Following the keyword “node” is the name of the node (logical processing unit).

• The order of resources is significant. The first disk is used before the second, and so on.

Saturday, August 07, 2021 26


Example Configuration File

• Keywords, such as “sort” and “bigdata”, when used, restrict the signified processes to the
use of the resources that are identified. For example, “sort” restricts sorting to node pools
and scratchdisk resources labeled “sort”.

• Database resources (not shown here) can also be created that restrict database access to
certain nodes.

• Question: Can objects be constrained to specific CPUs? No, a request is made to the
operating system and the operating system chooses the CPU.

Saturday, August 07, 2021 27


Configuration File Guidelines
• Minimize I/O overlap across nodes
 If multiple file systems are shared across nodes, alter order of file systems within
each node definition
 Pay particular attention with mapping of file systems to physical controllers / drives
within a RAID array or SAN
 Use local disks for scratch storage if possible

• Named Pools can be used to further separate I/O


 “buffer” – file systems are only used for buffer overflow
 “sort” – file systems are only used for sorting

• On clustered / GRID / MPP configurations, named pools can be used to further specify
resources across physical servers
 Through careful job design, can minimize data shipping
 Specifies server(s) with database connectivity

Saturday, August 07, 2021 28


Job Design Examples
Generating Mock Data
• Row Generator stage
 Define columns in which to generate the data
 In Extended Properties page, select algorithm for generating values
• Different types have different algorithms available

• Lookups can be used to generate large amounts of robust mock data


 Lookup tables map integers to values
 Column Generator columns generate integers to look up
 Cycling through integer sets can generate all possible combinations

Saturday, August 07, 2021 30


Job Design for Generating Mock Data
Lookup tables

Row Generator

Saturday, August 07, 2021 31


Specifying the Generating Algorithm
Cycle through

Columns generating
integers

Saturday, August 07, 2021 32


Specifying the Generating Algorithm
• The number of values to cycle through should be different for each set of integers, so that
all possible combinations will be generated, for example:
0 0 0
1 1 1
2 2 0
3 0 1
0 1 0
1 2 1
2 0 0
etc.

• Here the first column cycles through 0-3, the second 0-2, and the third 0-1.

Saturday, August 07, 2021 33


Inside the Lookup Stage

Metadata for data file

Return Mapped
values Saturday, August 07, 2021 34
Config File Displayed in Job Log

First Partition

Message displaying
config file
Second

Saturday, August 07, 2021 35


Thank You !

You might also like